A Novel Anonymization Technique For Privacy Preserving Data Publishing

A Novel Anonymization Technique for Privacy Preserving Data Publishing
Introduction
Data publishing approach may lead to insufficient protection. So, providing privacy for the micro data publishing.
Privacy protection is an important issue in data processing.

Data must be anonymized to protect privacy.
Privacy preserving data publishing approach provides

methods for publishing the useful information.
2
Here micro data contains records each of which contains
information about an individual entity like person,

organization. The data mining community has focused on hiding sensitive rules generated from transactional databases. Before anonymizing the data, one can analyze the data
characteristics and use those characteristics in data

anonymization.
3
Literature Survey
In both Generalization and Bucketization approaches attributes are partitioned into three categories: 1) Some attributes are identifiers that can uniquely identify
an individual, such as Name or Social Security Number

2) Some attributes are Quasi Identifiers (QI), which the adversary may already know and can potentially identify an individual, e.g., Birthdate, Sex, and Zip code.
4
3) Some attributes are Sensitive Attributes (SAs), which are unknown to the adversary and are considered sensitive, such as Disease and Salary.
In generalization and bucketization, one first removes identifiers from the data and then partitions tuples into buckets. Generalization transforms the QI-values in each bucket. Bucketization, one separates the SAs from the QIs by randomly permuting the SA values in each bucket.
Privacy threats
When publishing microdata, there are three types of privacy disclosure threats. They are as follows 1) Membership disclosure 2) Identity disclosure 3) Attribute disclosure
Problem Specification
The anonymization techniques for privacy preserving microdata publishing are: 1.Generalization 2. Bucketization
Generalization losses considerable amount of information,

especially for high-dimensional data. This is due to curse of dimensionality. Bucketization has better data utility than generalization but have some drawbacks.
7
It does not prevent member ship disclosure. In many data sets, it is unclear which attributes are QIs and which are SAs. It requires a clear separation between QIs and SAs. By separating the QI attributes and SA here it breaks down the attribute correlation between attributes.
To overcome the drawbacks of above approaches a novel

technique called slicing will be implemented.
8
Slicing
Slicing partitions the dataset both vertically and horizontally. Grouping the attributes into columns, each column contains a subset of attributes, i.e., vertical partition Slicing also partition tuples into buckets. Each bucket contains a subset of tuples, i.e. horizontal partition.
Within each bucket, values in each column are randomly

permutated to break the linking between different columns.
9
Slicing Algorithms
Our algorithm consists of three phases they are as follows: Attribute partitioning Column generalization Tuple partitioning
10
Flow Chart for Project Design
11
Algorithm
The algorithm maintains two data structures: a queue of buckets Q and a set of sliced buckets SB. Initially, Q contains only one bucket which includes all tuples and SB is empty. In each iteration, the algorithm removes a bucket from Q and splits the bucket into two buckets.
If the sliced table after the split satisfies l-diversity, then the algorithm puts the two buckets at the end of the queue Q.
12
Cont..
Otherwise, we cannot split the bucket anymore and the algorithm puts the bucket into SB. When Q becomes empty, we have computed the sliced table. The set of sliced buckets is SB. The main part of the tuple-partition algorithm is to check whether a sliced table satisfies l-diversity.
13
Tuple Partitioning Algorithm

Algorithm tuple-partition(T, ) 1. Q = {T}; SB = . 2. while Q is not empty 3. remove the first bucket B from Q; Q = Q {B}. 4. split B into two buckets B1 and B2. 5. if diversity-check(T, Q {B1,B2} SB, ) 6. Q = Q {B1,B2}. 7. else SB = SB {B}. 8. return SB.
14
Diversity Check Algorithm

Algorithm diversity-check(T,T_, ) 1. for each tuple t T, L[t] = . 2. for each bucket B in T_ 3. record f(v) for each column value v in bucket B. 4. for each tuple t T 5. calculate p(t,B) and find D(t,B). 6. L[t] = L[t] {(p(t,B),D(t,B))}. 7. for each tuple t T 8. calculate p(t, s) for each s based on L[t]. 9. if p(t, s) 1/, return false. 10. return true.
15
Examples for Anonymization Techniques
This the original microdata table and its anonymized versions using anonymization techniques
16
In above figure it consists of QI and SA. Age, sex, zipcode is QI and disease is SA and generalized table that satisfies 4-anonymity
17
The above dataset shows the bucketized table that satisfies 2-diversity.
18
The above tables shows the Multiset-based generalization, one attribute per column slicing and the below table shows sliced table
19
Design and Implementation

Use case diagram
Data Slicing
Diverse Slicing
User
Correlation Measure
Attribute Clustering
Tuple Partitioning
It shows that the activities of the user. The user can provide the privacy to the microdata by generalizing the records, dividing into number of buckets and breaking the correlation between the attributes. The attributes of the table are sliced by performing random permutation and probability 20 function.
Class diagram
It shows that how probability functions are calculated and randomly permuted. It having five different classes with their attributes, methods how data is retrieved and methods are applied. 21
Modules
Data slicing Diverse slicing Correlation measure Attribute clustering Tuple partitioning
22
The attributes of tables are used for slicing and
bucketized.
Slicing first partition attributes into columns and then partition tuples into buckets. In diverse slicing has to extend the above analysis to the general case and introduce the notion of l-
diverse slicing and applying the probability

function to the attributes.
23
Two Correlation measures are used for measuring correlation between two continuous attributes and two categorical attributes.
After the correlations for each pair of attributes, we use clustering to partition attributes into columns. After that tuple partition will be done. Here tuples are partitioned into buckets.
24
H\w And S\w Specifications

Hardware: (1)2GB RAM (2) 320GB Hard disk (3) Intel processor(P3) Software: (1) Visual Studio (2) Windows xp/7
25
Results
Retrieving the dataset file
26
Showing the dataset files to show path of the file
27
Selecting The Dataset Test File
28
Displaying the path of the dataset file
29
Displaying the message in the textbox when user cancel the browsing 30
Displaying the file path of the dataset attribute
31
Displaying the attributes from the dataset
32
Selecting the attribute set from the dropdown list
33
Displaying the attribute domains of the attribute set
34
Displaying the tuples
35
Entering the number of buckets
36
37 Displaying the generalization values in both text and table format
Displaying the bucketization values in text format
38
Displaying the values which have a clear separation between Quasi39 Identifiers and Sensitive Attributes
Displaying the diversity mapping based on the salary values
40
Displaying sliced values in textbox
41
Displaying the sliced sets in the table
42
Showing the probability based on the countries and salary of all the 43 buckets
44 Providing security for sliced sets and storing the values in the database
45
Duplicating the attributes
46
Displaying the duplicate attributes in the table format
47
Comparison
400 350
300
250 Time (msec)
200
150
100
50
0
Anonymity Diversity Slicing
Graph showing time consumed for three techniques.

48
400
350
300
250
Time (msec)
200
150
100
50
0 Slicing Enhanced Slicing
Comparative graph showing time consumed in Enhanced Slicing.
49
Conclusion
Dataset is taken and performing anonymization techniques to protect privacy for micro data.
Attribute values are considered and applying the probability

functions. Generalization, bucketization and slicing techniques are
implemented.
By using DES it provides the security to the sliced set table. Overlapping slicing is done, which duplicates an attribute in more than one columns. By comparing slicing preserves better data utility than generalization and bucketization based on time consuming.
50
Future Work
Slicing gives better privacy than generalization and bucketization but still in future there is a scope to increase the privacy for microdata publishing, by using different anonymization techniques.
51
References
[1] Tiancheng Li, Ninghui Li, Jian Zhang, and Ian Molloy Slicing: A New Approach for Privacy Preserving Data Publishing ieee transactions on knowledge and data engineering, vol. 24, no. 3, march 2012. [2] A. Inan, M. Kantarcioglu, and E. Bertino, Using Anonymized Data for Classification, Proc. IEEE 25th Intl Conf. Data Eng. (ICDE), pp. 429-440,
2009.
[3] B.-C. Chen, K. LeFevre, and R. Ramakrishnan, Privacy Skyline: Privacy with Multidimensional Adversarial Knowledge, Proc. Intl Conf. Very Large Data
Bases (VLDB), pp. 770-781, 2007.

[4] C. Dwork, Differential Privacy: A Survey of Results, Proc. Fifth Intl Conf. Theory and Applications of Models of Computation (TAMC), pp. 1-19, 2008.
52
Thank you..
53

A Novel Anonymization Technique For Privacy Preserving Data Publishing

Diunggah oleh

Informasi Dokumen

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

A Novel Anonymization Technique For Privacy Preserving Data Publishing

Diunggah oleh

Hak Cipta:

Format Tersedia

A Novel Anonymization Technique for Privacy Preserving Data Publishing

Privacy protection is an important issue in data processing.

Privacy preserving data publishing approach provides

Here micro data contains records each of which contains

information about an individual entity like person,

characteristics and use those characteristics in data

an individual, such as Name or Social Security Number

Generalization losses considerable amount of information,

To overcome the drawbacks of above approaches a novel

Within each bucket, values in each column are randomly

Flow Chart for Project Design

Tuple Partitioning Algorithm

Diversity Check Algorithm

Examples for Anonymization Techniques

Design and Implementation

The attributes of tables are used for slicing and

diverse slicing and applying the probability

H\w And S\w Specifications

Retrieving the dataset file

Showing the dataset files to show path of the file

Selecting The Dataset Test File

Displaying the path of the dataset file

Displaying the file path of the dataset attribute

Displaying the attributes from the dataset

Selecting the attribute set from the dropdown list

Displaying the attribute domains of the attribute set

Displaying the tuples

Entering the number of buckets

37 Displaying the generalization values in both text and table format

Displaying the bucketization values in text format

Displaying the diversity mapping based on the salary values

Displaying sliced values in textbox

Displaying the sliced sets in the table

Duplicating the attributes

Displaying the duplicate attributes in the table format

250 Time (msec)

Graph showing time consumed for three techniques.

0 Slicing Enhanced Slicing

Comparative graph showing time consumed in Enhanced Slicing.

Attribute values are considered and applying the probability

Bases (VLDB), pp. 770-781, 2007.

Anda mungkin juga menyukai