Anda di halaman 1dari 24

Two-Phase Memetic Algorithm with Fuzzy

Restrictions for Solving the Task of Providing


Group Anonymity
Oleg Chertov and Dan Tavrov

Abstract It has become a common practice to provide public access to primary nonaggregated statistical data. Necessary precautions need to be taken to guarantee that
sensitive data features are masked, and data privacy cannot be violated.
In the case of providing group anonymity in a given dataset, i.e. protecting information about a group of people, it is important to protect intrinsic data features and
distributions. To solve this task, it is expedient to modify the dataset by introducing
a certain level of distortion, which can be done using different approaches. Once a
particular modification approach is chosen, the goal is to introduce minimal amount
of distortion under that approach.
In this work, we show that the task of providing group anonymity can be reduced to
the minimum cost network flow problem, where the network represents a model of
data modification, subject to fuzzy restrictions that encompass the chosen modification approach. The problem of determining such fuzzy restrictions is an ill-defined
one, and often can be solved only by expert evaluation.
The flow with a minimum cost can be found by well-known algorithms of polynomial complexity. However, the general task of choosing the modification approach
to provide group anonymity at the minimum cost possible is deemed to be of high
computational complexity. We propose to solve this task using appropriately tailored
memetic algorithm.

Oleg Chertov
National Technical University of Ukraine Kyiv Polytechnic Institute, 37 Peremohy Prospekt,
03056 Kyiv, Ukraine e-mail: chertov@i.ua
Dan Tavrov
National Technical University of Ukraine Kyiv Polytechnic Institute, 37 Peremohy Prospekt,
03056 Kyiv, Ukraine e-mail: dan.tavrov@i.ua

Oleg Chertov and Dan Tavrov

1 Introduction
A man is a social animal [1]. Most of our everyday actions depend on or are based
on how people treat them, especially the ones closest to us (or whose opinion is the
most relevant to us). Such people constitute what may be called a close circle.
At the same time, a person may not always be eager to disclose information about
members of her close circle. This reluctance may originate either from subjective
views, or from the nature of the circle (religious identity, professional community,
income level, LGBT community, radical group etc.).
What we face here is a problem of concealing membership of a given respondent in a certain group. This problem can be formulated as the task of masking
certain characteristics of a given respondent [2, 3, 4]. This task is usually called the
task of providing individual anonymity, where anonymity means [5] the property of
a subject to be unidentifiable within a set of other subjects. It is possible to set a
complementary task of providing group anonymity, where we need to conceal information not about a single respondent, but about a group of respondents (e.g., we
need to mask regional, age, or other kinds of distributions of a certain group).
In Sect. 2, we set the stage by discussing the recent advantages in the field of
providing group anonymity. We also discuss different ways to apply evolutionary
computing methods to solving the task of providing group anonymity with minimal
data distortion.
In Sect. 3, we provide the formal definition of the task of providing group
anonymity, introduce appropriate notation, and discuss how this task can be reduced
to the minimum cost flow network optimization problem with fuzzy restrictions on
the network architecture.
In Sect. 4, we describe the memetic algorithm for solving the task of providing
group anonymity. In Sect. 5, we introduce a novel idea of applying this memetic
algorithm in two phases.
In Sect. 6, we illustrate the application of the proposed algorithm to the real data
based task of protecting the regional distribution of military personnel. In Sect. 7,
we draw conclusions and outline the directions for the research to follow.

2 Related Work
The procedure for providing data group anonymity should meet the following conditions [6, p. 399]:
1. Disclosure risk is low or at least adequate to protected information importance.
2. Both original and protected data, when analyzed, yield close, or even equal results.
3. The cost of transforming the data is acceptable.

Two-Phase Memetic Algorithm for Providing Group Anonymity

Approaches for solving the task of providing group anonymity (TPGA) can be
divided into two groups [7], the ones that obtain the needed solution in two stages,
and the ones that achieve the same goal in a single stage.
All two-stage methods imply carrying out the following two steps:
1. Define the modified distribution that masks important aggregate characteristics
of a certain group of respondents.
2. Define the procedure for modifying the primary data so that they correspond to
the new distribution.
In many cases, the results of the first stage have to comply with the requirement
of preserving certain properties of the distribution being modified. To this end, we
need to apply specific methods. For instance:
if we need to preserve core statistical characteristics of the distribution, such as
its mean value and standard deviation, we can use the normalization method [8];
if we need to preserve high-frequency features of the distribution, we can use the
wavelet transforms method [9];
if we need to preserve the cyclic and periodic components of the distribution, we
can use the singular spectrum analysis method [10].
The assessment of the quality of the particular modified distribution from the
masking data features point of view typically can be done only subjectively by an
expert, and is usually imprecise.
After having obtained the modified distribution, we need to modify the data to
adapt them to this distribution. Typically, we want to introduce as little distortion
as possible. In this work, we show that modifying the data with minimal distortion
can be viewed as a well-known minimum cost flow problem in network flow optimization [11, p. 296]. Moreover, there exists a unique correspondence between the
modified distribution and the architecture of the network. In the last decades, many
algorithms of polynomial and pseudopolynomial complexity have been proposed
for solving this task.
In the absence of external restrictions, we can obtain modified distributions that
are equivalent from the masking data features point of view. Consider the case of
masking the regional distribution of military personnel. Extreme values in this distribution might correspond to the cites of classified military facilities. To provide
group anonymity in this case, we need to mask these extreme values, which can be
done using at least two completely different approaches [12, p. 20]:
the extremum transition approach tells us to modify the distribution in such way
that completely new extreme values emerge, which are different from the initial
ones;
the Ali Babas wife approach implies that we do not eliminate some or all of the
extreme values but rather add new alleged ones, so that it becomes impossible to
discriminate between the real and made-up extremums.
For instance, under Ali Babas wife approach, we can create new extreme values
for any region without any specific considerations, as long as the old ones become

Oleg Chertov and Dan Tavrov

indiscernible. Moreover, even if we decide to create alleged extremums in particular


regions, the imprecise nature of the quality of the solution enables us to define a lot
of similar distributions that differ in the absolute value of extremums created. However, as much as the modified distributions obtained by various modifications can
be equivalent at the first stage of the group anonymity problem, they correspond to
different network architectures, and thus inevitably lead to obtaining data distortions
of different magnitude at the second stage.
To come to grips with these challenges, we can use the single-stage approach to
solving the TPGA. It suggests that we try to obtain a modified distribution that, on
the one hand, suffices in masking sensitive data features, and on the other hand, leads
to introducing minimal data distortion. In this work, we show that the single-stage
approach TPGA can be formalized as the minimum cost flow problem with additional constraints imposed on the network architecture. These constraints account
for the quality of the modified distribution for masking the sensitive data features.
Considering the subjective and imprecise nature of such constraints, we propose
that they be stated in the form of fuzzy restrictions imposed on the distribution to be
obtained.
The problem stated this way is no longer computationally tractable. Due to the
nonlinear nature of fuzzy restrictions in most practical applications, the problem
is no longer a linear programming problem. Due to the nature of the data being
anonymized (which are often so called big data), the problem of finding the minimum cost flow becomes large-scale, and imposing additional nonlinear constraints
makes solving such a problem a tenuous task. As was mentioned in [12, p. 12], preserving data utility from the minimal data distortion point of view is an optimization
task at least of the same complexity as the well-known k-anonymization problem in
the field of statistical disclosure control, which is known to be NP-hard [13].
In this work, we propose to use memetic algorithms (MAs) [14] to solve the task
of providing group anonymity. Memetic algorithms are usually implemented as evolutionary algorithms with incorporated local search procedures [15, p. 173], sometimes called memes after the term introduced in [16, p. 192]. A review of memetic
algorithms for dealing with constrained optimization problems presented in [17]
shows the benefits of using MAs instead of conventional heuristics. Novel applications of memetic algorithms to solving optimization tasks have been proposed in
[18].
There can be distinguished four [15] commonly used ways to handle constraints
in any evolutionary algorithm:
1. Using penalty functions that reduce fitness of infeasible solutions [19].
2. Using repair functions that convert infeasible solutions into feasible ones [20].
3. Restricting search to a feasible subspace of the search space by using specific
alphabet for problem representation [15, pp. 215216].
4. Using decoder functions that map infeasible solutions into feasible ones, thus
transforming initial search space into another one [21].
In some cases, fuzzy restrictions can be deduced from additional considerations.
For instance, the solutions might be restricted to the ones preserving specific data

Two-Phase Memetic Algorithm for Providing Group Anonymity

features like high frequency [9] or periodic [10] components, as mentioned above.
It is therefore possible to limit search space by using a special form of individual
representation in the memetic algorithm [22]. In general, however, there are no specific requirements for the solution to meet, and using penalty functions seems to be
the most appropriate alternative.
In general, it is not clear what fuzzy restrictions should be included in the problem definition, as different restrictions guide evolution in different directions, not
necessarily leading to obtaining optimal solutions. If the restrictions are too severe
(we know the modified distribution to a high precision), the level of distortion introduced into the dataset might become too high. On the other hand, if the restrictions
are too mild, the modified distribution may not mask sensitive data features at all.
Since it is not always possible to define appropriate fuzzy restrictions at the outset
of solving the TPGA, we will discuss the two-phase memetic algorithm for solving
the TPGA, which possesses the following characteristics:
1. In the first phase, we define preliminary fuzzy restrictions, and obtain the approximate solutions.
2. In the second phase, we analyze the results obtained after the first phase, adapt
the fuzzy restrictions accordingly, and obtain the refined solutions.

3 Theoretic Background
3.1 The Generic Scheme of Providing Group Anonymity
Let the data for anonymizing be gathered in a depersonalized microfile M. Each
record ri , i = 1, , of this microfile contains values of attributes w j , j = 1, . The set
of all the values of w j is denoted by w j .
Let wv j , j = 1,t, denote vital microfile attributes. Then, a vital value combination
V can be defined as an element of the Cartesian product wv1 wv2 . . . wvl . We
will denote a set of vital value combinations by V = {V1 , . . . ,Vlv }. We call microfile
records whose attribute values belong to V vital records.
Let w p denote a parameter microfile attribute. Then, a parameter value P can be
defined as
 a value of this attribute, P w p . We will denote a set of parameter values
by P = P1 , . . . , Pl p . Parameter values can be used to divide the microfile M into
submicrofiles M1 , . . . , Ml p . Each submicrofile Mk has exactly k records, k = 1, l p ,
k k = .
Let us denote by G (V, P) a group of respondents whose distribution needs to
be protected when providing group anonymity. The group is thereby determined by
appropriately defined values of parameter and vital microfile attributes.
The task of providing group anonymity [7] lies in performing modification of
M in order to mask sensitive data features. The generic scheme of providing group
anonymity according to single-stage approach to solving the TPGA goes as follows:

Oleg Chertov and Dan Tavrov

1. Construct a (depersonalized) microfile M representing statistical data to be processed.


2. Define groups of respondents Gi (Vi , Pi ) to be protected, i = 1, k.
3. For each i from 1 to k:
a. Choose a dataset of arbitrary structure i (M, Gi ) called the goal representation that represents features of Gi in a way appropriate for their masking.
b. Define the algorithm A : i (M, Gi ) i (M , Gi ) called the modifying algorithm and obtain the modified goal representation i and the modified microfile M .
4. Prepare the modified microfile M for publishing.
In this work, we will illustrate the two-phase modifying algorithm based on
the most widely
 used goal representation, the quantity signal, in the form q =
q1 , q2 , . . . , ql p , where qi is a total number of respondents in a submicrofile Mi ,
i = 1, l p , whose vital attribute values belong to V.
Any modifying algorithm A has to provide two kinds of data modification:
the quantity signal q has to be altered in order to mask its sensitive features
according to fuzzy restrictions imposed on all or some of its values;
on the level of microfile modification, this can be achieved by swapping the
vital and non-vital records between different submicrofiles. Records should be
swapped in a pairwise fashion to preserve the number of records in each submicrofile.
Swapping two records between the submicrofiles obviously introduces certain
amount of distortion into the microfile, which can be measured by the influential
metric [12]

InfM (r, r ) =

nord

p=1

r (I p ) r (I p )
r (I p ) + r (I p )

2

ncat

+ k 2 (r (Jk ) , r (Jk )) ,

(1)

k=1

where I p (Jk ) stands for the pth ordinal (kth categorical) influential attribute (attribute whose distribution over parameter values is of interest for researchers), r ()
stands for the operator returning the attribute value of the record r, (v1 , v2 ) stands
for the operator, which is equal to a certain number 1 if its arguments belong to
the same category, and 2 otherwise, p and k are nonnegative weights (the more
important is the attribute, the greater is the weight).
A properly defined modifying algorithm has to modify the quantity signal to
mask sensitive data features, and at the same time minimize the distortion introduced
into the microfile in terms of (1).

Two-Phase Memetic Algorithm for Providing Group Anonymity

3.2 Fuzzy Restrictions Overview


A fuzzy restriction [23] is a fuzzy relation R (X) that can be treated as an elastic
constraint on the values that a variable X can assume. A fuzzy restriction can be
expressed in its canonical form [24]
CF (R (X)) : X isr R ,

(2)

where X is restricted variable, R is the restricting relation, r is an indexical variable


that defines the way, in which R restricts X. In the simplest case, the restriction can
be singular, i.e. R is a number.
One of the main types of fuzzy restrictions is the possibilistic restriction in the
form
R (X) : X is A ,
(3)
where A is a fuzzy subset [25] of space U, with the membership function A (u), u
U, and r is left blank intentionally. The fuzzy set A defines the possibility distribution
of X, i.e. the set of possible values of X with appropriate membership grades. The
membership function A (u) defines a compatibility degree of a particular value u of
X with the fuzzy restriction R (X).
When the variable X is n-ary, i.e. X = (X1 , . . . , Xn ), and variables X1 , . . . , Xn are
noninteractive, the relation R (X1 , . . . , Xn ) can be expressed [26, p. 310] as a Cartesian product
R (X1 , . . . , Xn ) = R (X1 ) . . . R (Xn ) ,
(4)
where X1 , . . . , Xn are noninteractive variables defined in the subspaces U1 , . . . , Un ,
respectively, and R (X1 ) , . . . , R (Xn ) are the fuzzy restrictions imposed on the variables X1 , . . . , Xn , respectively. In this case, we can rewrite (3) as
R (X1 ) : X1 is A1 ,
...
R (Xn ) : Xn is An ,

(5)

where Ai is a fuzzy subset of Ui with a membership function Ai (ui ), ui Ui , i =


1, n. According to the rule of maximal restriction [23, p. 9], the compatibility degree
of the value u = (u1 , . . . , un ) of the variable X with the fuzzy restriction R (X) equals
(u1 , . . . , un ) = A1 (u1 ) . . . An (un ) ,

(6)

where is the fuzzy intersection.


When the variables X1 , . . . , Xn are interactive, for instance, when the values of
these variables have to satisfy the condition
u1 + . . . + un = u0 ,
where u0 is some number, the relation R (X1 , . . . , Xn ) cannot be expressed in the form
(4), and (5) becomes [26, p. 311]

Oleg Chertov and Dan Tavrov

R (X1 ) : X1 is A1 ,
R (X2 |u1 ) : X2 is A2 ,
...
R (Xn |u1 , . . . , un1 ) : Xn is An ,

(7)

where R (Xi |u1 , . . . , ui1 ) is a fuzzy restriction conditioned on the values u1 , . . . , ui1 ,
and Ai is the fuzzy subset of Ui with the membership function Ai |u1 ,...,ui1 (ui ),
i = 1, n, which depends on the values u1 , . . . , ui1 . The compatibility degree of the
value u = (u1 , . . . , un ) of the variable X with the fuzzy restriction R (X) in this case
equals
(u1 , . . . , un ) = A1 (u1 ) . . . An |u1 ,...,un1 (un ) ,
(8)
where is the fuzzy intersection.

3.3 Fuzzy Restrictions Imposed on the Modified Goal Signal




In the context of solving the TPGA, the modified quantity signal q = q1 , . . . , ql p

can be treated as a value of an l p -ary variable X = X1 , . . . , Xl p . Variables X1 , . . . , Xl p
are interactive, since their values must satisfy the condition
q1 + . . . + ql p = Q0 ,

(9)

where Q0 is the total number of vital records in the microfile M.


In the general case, to mask sensitive features of the signal q, we need to impose
the following fuzzy restrictions on the values of X:
R (X1 ) : X1 is A1 ,
R (X2 |q1 ) : X2 is A2 ,
... 

(10)


R Xl p |q1 , . . . , ql p 1 : Xl p is Al p ,
which also guarantee that (9) is satisfied.
In practice, to mask sensitive features of the signal q, and also to minimally
constrain the search space for the algorithm A, it is sufficient to impose restrictions
only on a certain proper subset of atomic variables. In this work, we will additionally
assume without the loss of generality that the variables in this proper subset are
mutually noninteractive.
Let us denote by J = (i1 , . . . , ik ) the index sequence, which is a proper subsequence of I = (1, . . . , l p ). Let us denote by J the complement
of J to I. Also, let us

denote by XJ the k-ary variable XJ = Xi1 , . . . , Xik . We can then rewrite (10) as

Two-Phase Memetic Algorithm for Providing Group Anonymity

R (Xi1 ) : Xi1 is Ai1 ,


R (Xi2 ) : Xi2 is Ai2 ,
 ...
R Xik : Xik is Aik ,

(11)

R (XJ |qJ ) : qi = Q0 qj .
jJ

iJ

3.4 The Task of Providing Group Anonymity as the Generalized


Minimum Cost Flow Problem
Let us first discuss the simpler task of providing group anonymity using the twostage approach. As mentioned in Sect. 1, in this case we first obtain the modified
signal q that masks important characteristics of a group of respondents in some
sense. At the second stage, we swap the vital and non-vital microfile records between different submicrofiles in order to achieve the given modified quality signal
q with the minimal distortion introduced into the microfile in terms of (1).
Let us denote by i the valence of the submicrofile Mi , i = 1, l p . The valences
determine how many records need to be added to (if i > 0) or removed from (if
i < 0) the submicrofile Mi . Obviously, i = qi qi , i = 1, l p .
The task of minimizing the distortion introduced at the second stage can be
viewed as a task of finding the minimum cost flow in a network. When discussing
this task and its generalization, we will use the notation employed in [11].
Let G = (N, A) be a directed network defined by a set N of n nodes and a set A of
m directed arcs. Every arc (i, j) A has a cost ci j and a capacity ui j . Each node i N
is associated with a number b (i) that indicates its supply (if b (i) > 0) or demand (if
b (i) < 0). Then minimum cost flow problem can then be stated as follows:
min z (x) =

ci j xi j ,

(i, j)A

subject to:

xi j

j|(i, j)A

x ji = b (i)

i N ,

j|( j,i)A

0 xi j ui j

(12)

i (i, j) A ,

b (i) = 0 .

i=1

The network for the TPGA has the architecture with the following characteristics:
the set N consists of four disjoint subsets N1 , N2 , N3 , N4 ;
nodes N1k N1 and N4k N4 represent the submicrofiles Mk , k = 1, l p , i.e. |N1 | =
|N4 | = l p ;
(1)

(l )

(k,i)

the set N2 can be divided into l p disjoint subsets N2 , . . . , N2 p . Nodes N2


(k)
N2 represent vital records of the submicrofile Mk , k = 1, l p , i = 1, qk , i.e.,
(k)
N2 = qk , k = 1, l p ;

10

Oleg Chertov and Dan Tavrov


(1)

(l )

the set N3 can be analogously divided into l p disjoint subsets N3 , . . . , N3 p .


(k,i)

(k)

N3 represent
non-vital records of the submicrofile Mk , k = 1, l p ,

(k)
i = 1, k qk , i.e., N3 = k qk , k = 1, l p ;
the set A consists of the three disjoint subsets AN1 N2 , AN2 N3 , AN3 N4 ;
(k,i)
(k)
the arcs in AN1 N2 connect the nodes N1k N1 with the nodes N2 N2 , k = 1, l p ,
i = 1, qk . There are no other arcs in AN1 N2 ;
(k,i)
(k)
the arcs in AN3 N4 connect the nodes N3 N3 with the nodes N4k N1 , k = 1, l p ,
i = 1, k qk . There are no other arcs in AN3 N4 ;
(k,i)
(k)
the arcs in AN2 N3 connect each node in N2 N2 , k = 1, l p , i = 1, qk , with each
(l, j)
(l)
node N3 N3 , k = 1, l p , l = 1, l p , k 6= l, i = 1, qk , j = 1, k qk ;

k , i > 0
the supplies of the nodes N1k N1 , k = 1, l p , equal b N1k =
;
0
 , i 0

k , i < 0
the demands of the nodes N4k N4 , k = 1, l p , equal b N4k =
;
0 , i 0
the supplies/demands of the nodes in N2 and N3 equal 0;
the capacities of all the arcs in A equal 1;
the costs of the arcs in
 AN1 N2 and A
N3 N4 equal 0;
Nodes N3

(k,i)

the cost of each arc N2

(l, j)

, N3

AN2 N3 , k = 1, l p , l = 1, l p , k 6= l, i = 1, qk ,

j = 1, k qk , equals the value of (1) for appropriate microfile records.


The network described above in presented in Fig. 1.
With the single-stage approach to solving the TPGA, which is the main focus
of our discussion, we dont know the modified signal q in advance. We can only
impose certain fuzzy restrictions described in Sect. 3.2 on the values of q . In this
case, the problem of minimizing the distortion introduced into the microfile can be
formulated as follows:
min z (x) =

ci j xi j ,

(i, j)A

subject to:

xi j

j|(i, j)A

x ji = b (i)

i N ,

j|( j,i)A

0 xi j ui j

i (i, j) A ,

(13)

b (i) = 0 ,
 i=1

q1 , . . . , ql p ,


where q1 , . . . , ql p is the degree of compatibility of the modifiled quantity signal
q (or, equivalently, of the supplies/demands of the network) with the fuzzy restriction R (X) in the form (11), and is the user-defined constant, which expresses the
lowest acceptable degree of compatibility of the modified quantity signal with the
fuzzy restriction.
If there are additional constraints that need to be imposed on the modified quantity signal in order to preserve its intrinsic features (such as statistical characteris-

Two-Phase Memetic Algorithm for Providing Group Anonymity

11

Fig. 1 The architecture of a network for the task of providing group anonymity

tics, high-frequency features, or cyclic and periodic components), we can add them
to (13). In this work, we will assume that the problem to solve can be fully described
by (13).
To solve this complex problem, we propose to use the appropriately tailored
memetic modifying algorithm (MMA) introduced in [7], which is described in the
following sections.

4 Memetic Modifying Algorithm


4.1 Algorithm Outline
In this section, we will describe the memetic modifying algorithm for solving the
task (13). We will use the notation proposed in [7] that can conflict the notation from
the previous discussion. We believe it will be understandable from the context of the
discussion, what particular symbols mean.
The algorithm consists of the following steps:

12

Oleg Chertov and Dan Tavrov

1. Create initial population P = {Ui } of randomly generated individuals, i = 1, .


Apply local search operator S (Ui ) i = 1, .
2. Calculate fitness function f (Ui ) i = 1, .
3. If termination condition holds, stop. Continue otherwise.
4. Select pairs of parents. Put them into set P0 .
5. Apply recombination operator R (Ui1 ,Ui2 ) to each pair hUi1 ,Ui2 i from P0 , i1 =
1, , i2 = 1, , i1 6= i2 . Put the resulting offspring into population P00 .
6. Apply mutation operator M (U j ) U j P00 , j = 1, .
7. Apply local search operator S (U j ) j = 1, .
8. Calculate fitness function f (U j ) j = 1, .
9. Select among individuals from P P00 fittest ones. Put them into P in place of
the current ones.
10. Go to step 3.
Each individual is a matrix U with Q rows and four columns with the following
elements:
1. Element of the first column ui1 i = 1, Q is an index of a submicrofile to remove
vital records from. The set of such submicrofiles needs to be defined by the user.
2. Element of the third column ui3 i = 1, Q is an index of a submicrofile to add
vital records to. The set of such submicrofiles also needs to be defined by the
user.
3. Element of the second column ui2 i = 1, Q is an index of the record from Mui1
to be removed.
4. Element of the fourth column ui4 i = 1, Q is an index of the record from Mui3 to
be swapped with the one defined by ui2 .
Number of rows can vary from individual to individual.
Let us denote by |U|(i) the total count of occurrences of index i in the first and
the third column of U. Then, for any index i in the first column, |U|(i) qi .
Each particular pair hui1 , ui2 i (hui3 , ui4 i) i = 1, Q can occur in U only once.
Requirements mentioned above cannot be violated during the whole run of the
algorithm.
Each individual U uniquely defines both a modified quantity signal q and a
precise sequence of pairwise swaps to be performed in order to modify the microfile,
thereby defining a solution to the TPGA.

4.2 Fitness Function


In this work, we propose to use the fitness function as a sum of three independent
terms:
f (U) = (U) + (U) + (U) ,
(14)

Two-Phase Memetic Algorithm for Providing Group Anonymity

13

where (U) represents estimation of a TPGA solution quality from the minimizing
microfile distortion point of view, (U) is a penalty function representing estimation of TPGA solution quality from the compatibility with the imposed fuzzy
restrictions (11) point of view, and (U) is a penalty term against obtaining individuals with too many rows. As all three terms are of equal importance, their values
lie in the interval [0, 1].
We propose to use the following expression for the first term of (14):
Q

Cmax InfM Mui1 (ui2 ) , Mui3 (ui4 )

(U) =

i=1

Cmax

(15)

where Cmax is the greatest possible value of the cumulative influential metric (1),
Mi ( j) is the operator yielding the jth record of the submicrofile Mi , i = 1, l p .
We also propose to use the following expression for the penalty function:
ik

(U) =

A j



|U|( j) ,

(16)

j=i1

where indexes i1 , . . . , ik define the proper subset of the submicrofiles subject to the
fuzzy restrictions (11). It is worth noting that we dont need to explicitly incorporate
the last restriction from (11) into the penalty function, as by the nature of pairwise
swapping, the sum of all the elements of the modified quantity signal q will remain
equal to the sum of all the elements of the initial quantity signal q.

4.3 Other Algorithm Components


Operator R (Ui1 ,Ui2 ) should be defined as a proper recombination operator applied
to two parent individuals Ui1 and Ui2 that yields two offspring individuals U j1 and
U j2 . It should be applied with a high probability pc . In this work, we propose to use
recombination operator based on the cut and splice operator [27]. It randomly
generates two crossover points k1 [0, Qi1 ] and k2 [0, Qi2 ], then splits each parent at appropriate points, and exchanges the tails between them, thus creating the
offspring.
Operator M (U) should be defined as a proper mutation operator applied to the
individual U that yields the mutated one U 0 . In this work, we propose to use operator,
which is a superposition M = M4 M3 M2 M1 of the following operators:
1. Operator M1 is applied with small probability pm1 to the first column of U as to
the permutation. Each pair hui1 , ui2 i needs to be preserved i = 1, Q.
2. Operator M2 is applied with small probability pm2 to the third column of U as to
the permutation. Each pair hui3 , ui4 i needs to be preserved i = 1, Q.
3. Operator M3 is applied with small probability pm3 to the second column of U as
to the vector of categorical values.

14

Oleg Chertov and Dan Tavrov

4. Operator M4 is applied with small probability pm4 to the fourth column of U as


to the vector of categorical values.
Operator S (U) in this work is defined as an operator applied to the individual U
that yields the modified one U 0 according to the following procedure:
1. Carry out steps 24 i = 1, Q.
2. Generate a uniformly distributed random number r [0, 1].
3. If r pmem , assign to ui4 the index of a record from Mui3 closest to the record ui2
from Mui1 in terms of (1).
If r > pmem , assign to ui2 the index of a record from Mui1 closest to the record ui4
from Mui3 in terms of (1).
4. Go to step 2.
Other MMA components, such as selection, initialization, termination, population size etc. can be chosen individually for each TPGA at hand.

5 Two-Phase Memetic Algorithm


In this section, we will discuss the two-phase memetic algorithm for the TPGA,
in which sensitive data features to be masked are maximum values of the quantity
signal.
For a certain element of the quantity signal, there can be defined two types of
fuzzy restrictions:
1. Decreasing fuzzy restriction, the membership function for which is a monotonically non-increasing functions that tends to unity as the corresponding quantity
signal value decreases to a particular value.
2. Increasing fuzzy restriction, the membership function for which is a monotonically non-decreasing functions that tends to unity as the corresponding quantity
signal value increases to a particular value.
In most cases, at the outset of the TPGA, we can determine only decreasing fuzzy
restrictions for the submicrofiles to remove vital records from. There are many ways
to choose the submicrofiles to add records to, and to define appropriate increasing
fuzzy restrictions. Each particular choice leads to obtaining a completely different
TPGA in the form of (13), with different minimal distortions that can be introduced.
In some cases, with the help of information from external sources, it may be possible to choose submicrofiles to add vital records too, and define appropriate increasing fuzzy restrictions. However, in general, when there is no additional information
other than present in the data at hand, it is not clear what submicrofiles to choose
and what restrictions to impose, because different restrictions guide evolution in
different directions, not necessarily leading to obtaining good solutions.
In this work, we propose to leave the task of choosing the submicrofiles to add
records to to the evolutionary process itself according to the following procedure:

Two-Phase Memetic Algorithm for Providing Group Anonymity

15

1. Based on analyzing the quantity signal q representing microfile M, define suitable decreasing fuzzy restrictions for those signal elements that violate the requirement of masking maximum signal values.
2. Apply the memetic algorithm as described in Sect. 4.
3. Classify obtained individuals into feasible solutions, which are compatible with
decreasing restrictions to a high degree, and mask maximum signal values; subfeasible solutions, which are compatible with decreasing restrictions to a high
degree, but dont mask maximum signal values; and infeasible solutions, which
are not compatible with decreasing restrictions to a high degree.
4. Group all subfeasible solutions, for which it is possible to define the same increasing fuzzy restrictions, in clusters. One solution may belong to several clusters.
5. Choose the cluster with the smallest mean value of cumulative metric (1). If
the cluster contains less than solutions, where is the population size in the
algorithm, increase its size to by duplicating solutions at random. If the cluster
contains more than solutions, decrease its size to by removing solutions at
random.
6. Apply memetic algorithm of step 2 to the set of solutions obtained on step 5 as
to the initial population.
The first two steps of this procedure constitute the first phase of the MMA, the
other four ones constitute the second phase of the MMA.
It is worth noting that the quality of the solution heavily depends on the quality
of the decreasing restriction functions. If the restrictions are too severe (too many
swaps need to be performed), the cumulative metric (1) might become too great.
On the other hand, if they are too mild, the solution may not be feasible, since
maximums may remain greater than other signal values, even though their absolute
values have decreased.

6 Practical Results
6.1 General Description of the Task
To illustrate the application of the memetic modifying algorithm to the real data
based task of providing group anonymity, we decided to consider the problem of
masking the regional distribution of military active personnel in the state of Massachusetts (the U.S.) according to the 5-Percent Public Use Microdata Sample File
of the 2000 U.S. census [28]. The total of 141, 838 records was taken for analysis.
To define the group of military active personnel distributed by place of work, we
took Military Service as the vital attribute, its value Active Duty as the only
vital value, Place of Work PUMA, where PUMA stands for the Public Use
Microdata Area, as the parameter attribute, and its values 25010 25120 with the

16

Oleg Chertov and Dan Tavrov

step 10 as parameter values. The latter values stand for the codes of Massachusetts
statistical areas.
Quantity signal q corresponding to the group is presented in Fig. 8 (solid line).
Signal elements 1, 2, . . . , 12 correspond to statistical areas 25010, 25020, . . . , 20120,
respectively.

6.2 The First Phase of the Algorithm


As we can see from the graph of the quantity signal (Fig. 8, solid line), anonymity
can be provided by reducing the value of the second, the seventh, the ninth, and the
twelfth signal elements. In other words, we need to impose the following decreasing
fuzzy restriction in the form (11) on the modified quantity signal q as on the 12-ary
variable:
R (X2 ) : X2 is A2 ,
R (X7 ) : X7 is A7 ,
R (X9 ) : X9 is A9 ,
R (X12 ) : X12 is A12 ,
R (XJ |qJ ) : qi = 186 qj ,
iJ

jJ

where J = {2, 7, 9, 12}, J = {1, 3, 4, 5, 6, 8, 10, 11}.


For these practical examples, we decided to choose the following membership
functions for the fuzzy sets Ai , i J (Figs. 25):

1,




x 20 2

47
2

2 (x) =

67

2
,

47

0,

1,


2

1 2 x 25

5


7 (x) =

x 30 2

2
,

0,

x 20
, 20 x 43.5
43.5 x 67
x 67
x 25
, 25 x 27.5
27.5 x 30
x 30

Two-Phase Memetic Algorithm for Providing Group Anonymity

17

Fig. 2 Membership function associated with the decreasing fuzzy restriction for the second signal
element for the example

1,
x 25


2

25

, 25 x 26.5
12
3

2
9 (x) =

x 28

2
,
26.5 x 28

0,
x 28

1,
x 25


2

x 25

, 25 x 31.5
12
13

2
12 (x) =

x 38

2
,
31.5 x 38

13

0,
x 38
Indexes from J were chosen to appear in the first column of individuals in the
MMA population, indexes from J were chosen to appear in the third column.
To minimize the distortion introduced into the microfile, we took Sex, Age,
Hispanic or Latino Origin, Marital Status, Educational Attainment, Citizenship Status, and Persons Total Income in 1999 as the influential attributes. We
considered all these attributes to be categorical ones. To simplify the matter, we
chose the following parameters of (1): k = 1 k = 1, 7, 1 = 1, 2 = 0. In this case,
the metric (1) shows the number of attribute values to be altered during one swap of
the records between the submicrofiles.
To prevent individuals in the MMA from growing indefinitely, we used the following penalty term (Fig. 6):
1

ex (U) =
1+e

1
2 (QU 90)

18

Oleg Chertov and Dan Tavrov

The fitness function (14) for the first phase of the example is as follows:
Q
1099

f ex1 (U) =

sign Mui1 (ui2 ,Wk ) Mui3 (ui4 ,Wk )

i=1 k=1


 1099
+ jJ j |U|( j) +

1
1
1+e 2 (QU 90)

+
,

where sign () is a function yielding 1 if its argument is negative, 0 if it equals 0,


and 1 if it is positive; Wk , k = 1, 7, is the kth influential attribute, M j (i,Wk ) returns
the value of the attribute Wk of the ith record in submicrofile M j .
We chose the swap mutation [29] as mutation operators M1 and M2 , and the
random resetting mutation [15, p. 43] as mutation operators M3 and M4 . We decided
to apply tournament selection [30] as an efficient and easy-to-implement selection
operator, with the tournament size 5.
The population was initialized by randomly generating matrices with different
numbers of rows. Elements of the first column were generated with probabilities
proportional to the values of the corresponding elements of q. Elements of the
third column were generated with probabilities proportional to the total numbers
of records in corresponding submicrofiles.
During the MMA run, we applied linear fitness scaling [31, p. 79] to prevent premature convergence. We also multiplied the mutation probabilities by the factor of
10 whenever the standard deviation of the population fitness values dropped below
0.03.

Fig. 3 Membership function associated with the decreasing fuzzy restriction for the seventh signal
element for the example

Two-Phase Memetic Algorithm for Providing Group Anonymity

19

Fig. 4 Membership function associated with the decreasing fuzzy restriction for the ninth signal
element for the example

6.3 The Second Phase of the Algorithm


Among 3000 solutions obtained as the result of the first phase of applying MMA,
only 754 (25.133%) were feasible ones. Two solutions with the lowest cumulative
metrics (1) are given in Fig. 8(a) (dashed-dotted and dotted lines). The mean cumulative metric (1) was 57.901.
The majority of solutions were subfeasible ones (1837, or 61.233%). We divided
them into several clusters, the most prominent ones presented in Table 1.

Fig. 5 Membership function associated with the decreasing fuzzy restriction for the twelfth signal
element for the example

20

Oleg Chertov and Dan Tavrov

Fig. 6 Penalty term heavily discriminating from obtaining individuals with more than 100 rows
Table 1 Clusters obtained after the first phase of the memetic modifying algorithm
Quantity Signal Elements to Increase Cluster Size Mean Metric
1 and 6
78
45.436
3 and 6
84
46.048
26
46.269
3 and 10
4 and 6
43
48.488
6 and 8
183
46.519
8 and 10
101
44.238

As can be deduced from Table 1, it is reasonable to choose solutions, for which


values of elements 8 and 10 should be increased, for the second phase. This leads
us to the following increasing restriction functions (Fig. 7):

1,




x 15 2

,
2
12


8 (x) = 10 (x) =

x 27 2

12

0,

x 15
15 x 21
, 21 x 27
x 27

The fitness function (14) for the second phase of the example is as follows:
Q
1099

f ex2 (U) =

sign Mui1 (ui2 ,Wk ) Mui3 (ui4 ,Wk )

i=1 k=1

1099


+ jJ{10,12} j |U|( j) +

+
1
1
1+e 2 (QU 90)

Among 3000 solutions obtained as the result of the second phase of applying
MMA, 2693 ones (or 89.767%) were feasible ones. Two solutions with the lowest
cumulative metrics (1) are given in Fig. 8(b) (dashed and dotted lines). The mean

Two-Phase Memetic Algorithm for Providing Group Anonymity

21

Fig. 7 Membership functions associated with the increasing fuzzy restriction function for the
eighth and tenth signal elements from the example

cumulative metric (1) is 47.873. In other words, it is sufficient to alter as few as


0.005% of the microfile attribute values in order to provide group anonymity.
These results are better than the ones obtained in [7], even though restrictions imposed on the solution are stricter here. In our opinion, this demonstrates the practical
utility of the two-phase approach to solving the task of providing group anonymity.

7 Conclusion and Future Research


Combining local search techniques with evolutionary algorithms to increase efficiency of the latter ones has become a widely accepted practice. This idea can be
applied to solving real-life problems in quite diverse ways yielding results of varying quality.
In the work, we proposed the two-phase approach to applying memetic algorithms to solving the task of providing group anonymity that can provide superior
results. During the first phase, feasible solutions to the optimization task are obtained, and potential ways of their improvement are discovered. During the second
phase, we can formulate more concise constraints, which leads to obtaining solutions of a better quality.
Presented experimental results prove the two-phase memetic algorithm to be worthy of practical interest.
However, several issues need further investigation, for example, automatic clustering of the results obtained after the first phase, enhancing algorithm efficiency by
choosing appropriate components, analyzing algorithm efficiency as a function of
its parameters.

22

Oleg Chertov and Dan Tavrov

(a)

(b)
Fig. 8 Initial (solid line) and modified quantity signals: (a) feasible one with the metric 40 (dasheddotted line), feasible one with the metric 43 (dotted line), subfeasible one (dashed line) (b) the one
with the metric 37 (dashed-dotted line), the one with the metric 38 (dotted line)

References
1. Brooks, D.: The Social Animal: The Hidden Sources of Love, Character, and Achievement.
N.Y. Random House Trade Paperbacks (2011)
2. Aggarwal, C. C., Yu, P. S.: A general survey of privacy-preserving data mining: models and
algorithms. In: Aggarwal, C. C., Yu, P. S. (eds.) Privacy-Preserving Data Mining: Models and
Algorithms. Advances in Database Systems, vol. 34, pp. 1152. Springer Science+ Business
Media, LLC, New York (2008)
3. Fung, B., Wang, K., Chen, R., Yu, P.: Privacy-preserving data publishing: a survey of recent
developments. ACM Comp. Surv. 42(4), 153 (2010)

Two-Phase Memetic Algorithm for Providing Group Anonymity

23

4. Sowmyarani, C. N., Srinivasan, G. N.: Survey on recent developments in privacy preserving


models. Int. J. of Comput. App. 38(9), 1822 (2012)
5. Phitzmann, A., Hansen, M.: A terminology for talking about privacy by data minimization:
anonymity, unlinkability, undetectability, unobservability, pseudonymity, and identity management, Version v0.34, http://dud.inf.tu-dresden.de/Anon Terminology.shtml (2010)
6. Chertov, O., Pilipyuk, A.: Statistical disclosure control methods for microdata. In: 2009 International Symposium on Computing, Communication, and Control. Proc. of CSIT, vol. 1,
pp. 339343. IACSIT Press, Singapore (2011)
7. Chertov, O. Tavrov, D.: Memetic algorithm for solving the task of providing group anonymity.
In: Jamshidi, M., Kreinovich, V., Kacprzyk, J. (eds.) Advanced Trends in Soft Computing.
Studies in Fuzziness and Soft Computing, vol. 312, pp. 281292. Springer International Publishing Switzerland (2014)
8. Liu, L., Wang, J., Zhang, J.: Wavelet-based data perturbation for simultaneous privacypreserving and statistics-preserving. In: 2008 IEEE International Conference on Data Mining
Workshops, pp. 2735. IEEE Computer Society Press (2008)
9. Chertov, O., Tavrov, D.: Providing group anonymity using wavelet transform. In: L. M.
MacKinnon (ed.) Data Security and Security Data. LNCS, vol. 6121, pp. 2536. SpringerVerlag, Berlin, Heidelberg (2012)
10. Tavrov, D., Chertov, O.: SSA-caterpillar in group anonymity. Paper presented at the World
Conference on Soft Computing, San Francisco, CA, 2326 May 2011
11. Ahuja, R. K., Magnanti, T. L., Orlin, J. B.: Network Flows. Theory, Algorithms, and Applications. Prentice Hall, Upper Saddle River (1993)
12. Chertov, O. (ed.): Group Methods of Data Processing. Lulu.com, Raleigh (2010)
13. Meyerson, A., Williams, R.: General k-anonymization is hard. Carnegie Mellon School of
Computer Science, Tech. Rep. CMU-CS-03-113 (2003)
14. Moscato, P.: On evolution, search, optimization, genetic algorithms and martial arts: toward
memetic algorithms. Caltech Concurrent Computation Program, Caltech, CA, C3P Rep. 826
(1989)
15. Eiben, A. E., Smith, J. E.: Introduction to Evolutionary Computing. Springer-Verlag, Berlin,
Heidelberg (2007)
16. Dawkins, R.: The Selfish Gene: 30th Anniversary Edition. Oxford University Press, Oxford,
New York (2006)
17. Ray, T., Sarker, R.: Memetic algorithms in constrained optimization. In: Neri, F., Cotta, C.,
Moscato, P. (eds.) Handbook of Memetic Algorithms, pp. 135151. Springer-Verlag, Berlin,
Heidelberg (2012)
18. Kumar, S., Sharma, V. K., Kumari, R.: Improved onlooker bee phase in artificial bee colony
algorithm. Int. J. of Comput. Appl. 90(6), 3139 (2014)
19. Smith, A. E., Coit, D. W.: Penalty functions. In: Bck,T., Fogel, D. B., Michalewicz, Z. (eds.)
Evolutionary Computation 2. Advanced Algorithms and Operators, pp. 4148. Institute of
Physics Publishing, Bristol, Philadelphia (2000)
20. Michalewicz, Z.: Repair algorithms. In: Bck,T., Fogel, D. B., Michalewicz, Z. (eds.) Evolutionary Computation 2. Advanced Algorithms and Operators, pp. 5661. Institute of Physics
Publishing, Bristol, Philadelphia (2000)
21. Michalewicz, Z.: Decoders. In: Bck,T., Fogel, D. B., Michalewicz, Z. (eds.) Evolutionary
Computation 2. Advanced Algorithms and Operators, pp. 4955. Institute of Physics Publishing, Bristol, Philadelphia (2000)
22. Chertov, O. R., Tavrov, D. Y.: Memetychnyi alhorytm dlia modyfikatsii mikrofailu z minimizatsiieiu spotvoren u protsesi zabezpechennia hrupovoi anonimnosti (Memetic algorithm
for microfile modification with distortion minimization while providing group anonymity).
Shtuchnyi Intellekt 3(61), 399410 (2013)
23. Zadeh, L. A.: Calculus of fuzzy restrictions. In: Zadeh, L. A., Fu, K.-S., Tanaka, K., Shimura,
M. (eds.) Fuzzy Sets and Their Applications to Cognitive and Decision Processes, pp. 139.
Academic Press, New York (1975)
24. Zadeh, L. A.: Toward a restriction-centered theory of truth and meaning (RCT), Inf. Sci. 248,
114 (2013)

24

Oleg Chertov and Dan Tavrov

25. Zadeh, L. A.: Fuzzy sets. Inf. and Cont. 8, 338353 (1965)
26. Zadeh, L. A.: The concept of a linguistic variable and its application to approximate
reasoningII. Inf. Sci. 8, 301357 (1975)
27. Goldberg, D. E., Korb, B., Deb, K.: Messy genetic algorithms: motivation, analysis, and first
results. Comp. Sys. 3, 493530 (1989)
28. U. S. Census 2000. 5-Percent Public Use Microdata Sample Files,
http://www.census.gov/census2000/PUMS5.html
29. Syswerda, G.: Schedule optimization using genetic algorithms. In: Davis, L. (ed.) Handbook
of Genetic Algorithms, pp. 332349. Van Nostrand Reinhold, New York (1991)
30. Brindle, A.: Genetic algorithms for function optimization. Doctoral Dissertation, Department
of Computer Science, Tech. Rep. TR81-2, University of Alberta (1981)
31. Goldberg, D. E.: Genetic Algorithms in Search, Optimization, and Machine Learning.
Addison-Wesley (1989)

Anda mungkin juga menyukai