Anda di halaman 1dari 53

A Novel Anonymization Technique for Privacy Preserving Data Publishing

Introduction
Data publishing approach may lead to insufficient protection. So, providing privacy for the micro data publishing.

Privacy protection is an important issue in data processing.


Data must be anonymized to protect privacy.

Privacy preserving data publishing approach provides


methods for publishing the useful information.
2

Here micro data contains records each of which contains

information about an individual entity like person,


organization. The data mining community has focused on hiding sensitive rules generated from transactional databases. Before anonymizing the data, one can analyze the data

characteristics and use those characteristics in data


anonymization.
3

Literature Survey
In both Generalization and Bucketization approaches attributes are partitioned into three categories: 1) Some attributes are identifiers that can uniquely identify

an individual, such as Name or Social Security Number


2) Some attributes are Quasi Identifiers (QI), which the adversary may already know and can potentially identify an individual, e.g., Birthdate, Sex, and Zip code.
4

3) Some attributes are Sensitive Attributes (SAs), which are unknown to the adversary and are considered sensitive, such as Disease and Salary.
In generalization and bucketization, one first removes identifiers from the data and then partitions tuples into buckets. Generalization transforms the QI-values in each bucket. Bucketization, one separates the SAs from the QIs by randomly permuting the SA values in each bucket.

Privacy threats
When publishing microdata, there are three types of privacy disclosure threats. They are as follows 1) Membership disclosure 2) Identity disclosure 3) Attribute disclosure

Problem Specification
The anonymization techniques for privacy preserving microdata publishing are: 1.Generalization 2. Bucketization

Generalization losses considerable amount of information,


especially for high-dimensional data. This is due to curse of dimensionality. Bucketization has better data utility than generalization but have some drawbacks.
7

It does not prevent member ship disclosure. In many data sets, it is unclear which attributes are QIs and which are SAs. It requires a clear separation between QIs and SAs. By separating the QI attributes and SA here it breaks down the attribute correlation between attributes.

To overcome the drawbacks of above approaches a novel


technique called slicing will be implemented.
8

Slicing
Slicing partitions the dataset both vertically and horizontally. Grouping the attributes into columns, each column contains a subset of attributes, i.e., vertical partition Slicing also partition tuples into buckets. Each bucket contains a subset of tuples, i.e. horizontal partition.

Within each bucket, values in each column are randomly


permutated to break the linking between different columns.
9

Slicing Algorithms
Our algorithm consists of three phases they are as follows: Attribute partitioning Column generalization Tuple partitioning

10

Flow Chart for Project Design

11

Algorithm
The algorithm maintains two data structures: a queue of buckets Q and a set of sliced buckets SB. Initially, Q contains only one bucket which includes all tuples and SB is empty. In each iteration, the algorithm removes a bucket from Q and splits the bucket into two buckets.

If the sliced table after the split satisfies l-diversity, then the algorithm puts the two buckets at the end of the queue Q.
12

Cont..
Otherwise, we cannot split the bucket anymore and the algorithm puts the bucket into SB. When Q becomes empty, we have computed the sliced table. The set of sliced buckets is SB. The main part of the tuple-partition algorithm is to check whether a sliced table satisfies l-diversity.

13

Tuple Partitioning Algorithm


Algorithm tuple-partition(T, ) 1. Q = {T}; SB = . 2. while Q is not empty 3. remove the first bucket B from Q; Q = Q {B}. 4. split B into two buckets B1 and B2. 5. if diversity-check(T, Q {B1,B2} SB, ) 6. Q = Q {B1,B2}. 7. else SB = SB {B}. 8. return SB.
14

Diversity Check Algorithm


Algorithm diversity-check(T,T_, ) 1. for each tuple t T, L[t] = . 2. for each bucket B in T_ 3. record f(v) for each column value v in bucket B. 4. for each tuple t T 5. calculate p(t,B) and find D(t,B). 6. L[t] = L[t] {(p(t,B),D(t,B))}. 7. for each tuple t T 8. calculate p(t, s) for each s based on L[t]. 9. if p(t, s) 1/, return false. 10. return true.
15

Examples for Anonymization Techniques

This the original microdata table and its anonymized versions using anonymization techniques
16

In above figure it consists of QI and SA. Age, sex, zipcode is QI and disease is SA and generalized table that satisfies 4-anonymity
17

The above dataset shows the bucketized table that satisfies 2-diversity.

18

The above tables shows the Multiset-based generalization, one attribute per column slicing and the below table shows sliced table

19

Design and Implementation


Use case diagram
Data Slicing

Diverse Slicing

User

Correlation Measure

Attribute Clustering

Tuple Partitioning

It shows that the activities of the user. The user can provide the privacy to the microdata by generalizing the records, dividing into number of buckets and breaking the correlation between the attributes. The attributes of the table are sliced by performing random permutation and probability 20 function.

Class diagram

It shows that how probability functions are calculated and randomly permuted. It having five different classes with their attributes, methods how data is retrieved and methods are applied. 21

Modules
Data slicing Diverse slicing Correlation measure Attribute clustering Tuple partitioning

22

The attributes of tables are used for slicing and

bucketized.
Slicing first partition attributes into columns and then partition tuples into buckets. In diverse slicing has to extend the above analysis to the general case and introduce the notion of l-

diverse slicing and applying the probability


function to the attributes.
23

Two Correlation measures are used for measuring correlation between two continuous attributes and two categorical attributes.

After the correlations for each pair of attributes, we use clustering to partition attributes into columns. After that tuple partition will be done. Here tuples are partitioned into buckets.
24

H\w And S\w Specifications


Hardware: (1)2GB RAM (2) 320GB Hard disk (3) Intel processor(P3) Software: (1) Visual Studio (2) Windows xp/7
25

Results

Retrieving the dataset file

26

Showing the dataset files to show path of the file

27

Selecting The Dataset Test File

28

Displaying the path of the dataset file

29

Displaying the message in the textbox when user cancel the browsing 30

Displaying the file path of the dataset attribute

31

Displaying the attributes from the dataset

32

Selecting the attribute set from the dropdown list

33

Displaying the attribute domains of the attribute set

34

Displaying the tuples

35

Entering the number of buckets

36

37 Displaying the generalization values in both text and table format

Displaying the bucketization values in text format

38

Displaying the values which have a clear separation between Quasi39 Identifiers and Sensitive Attributes

Displaying the diversity mapping based on the salary values

40

Displaying sliced values in textbox

41

Displaying the sliced sets in the table

42

Showing the probability based on the countries and salary of all the 43 buckets

44 Providing security for sliced sets and storing the values in the database

45

Duplicating the attributes

46

Displaying the duplicate attributes in the table format

47

Comparison
400 350

300

250 Time (msec)

200

150

100

50

0
Anonymity Diversity Slicing

Graph showing time consumed for three techniques.


48

400

350

300

250

Time (msec)

200

150

100

50

0 Slicing Enhanced Slicing

Comparative graph showing time consumed in Enhanced Slicing.

49

Conclusion
Dataset is taken and performing anonymization techniques to protect privacy for micro data.

Attribute values are considered and applying the probability


functions. Generalization, bucketization and slicing techniques are

implemented.
By using DES it provides the security to the sliced set table. Overlapping slicing is done, which duplicates an attribute in more than one columns. By comparing slicing preserves better data utility than generalization and bucketization based on time consuming.
50

Future Work
Slicing gives better privacy than generalization and bucketization but still in future there is a scope to increase the privacy for microdata publishing, by using different anonymization techniques.

51

References
[1] Tiancheng Li, Ninghui Li, Jian Zhang, and Ian Molloy Slicing: A New Approach for Privacy Preserving Data Publishing ieee transactions on knowledge and data engineering, vol. 24, no. 3, march 2012. [2] A. Inan, M. Kantarcioglu, and E. Bertino, Using Anonymized Data for Classification, Proc. IEEE 25th Intl Conf. Data Eng. (ICDE), pp. 429-440,

2009.
[3] B.-C. Chen, K. LeFevre, and R. Ramakrishnan, Privacy Skyline: Privacy with Multidimensional Adversarial Knowledge, Proc. Intl Conf. Very Large Data

Bases (VLDB), pp. 770-781, 2007.


[4] C. Dwork, Differential Privacy: A Survey of Results, Proc. Fifth Intl Conf. Theory and Applications of Models of Computation (TAMC), pp. 1-19, 2008.
52

Thank you..

53

Anda mungkin juga menyukai