Anda di halaman 1dari 83

ENHANCEMENT OF PRIVACY PRESERVATION METHOD WITH

CLUSTERING AND CRYPTOGRAPHIC TECHNIQUES IN DATA


MINING
A

Dissertation

Submitted

in partial fulfillment

for the award of Degree of

Master of Technology

In Department of Computer Science & Engineering

(With Specialization in Computer Science & Engineering)

Supervisor: Submitted By:

Dr. Akash Saxena Jeetendra Mittal

Associate Professor Enrollment No: 14E2CICSM40P603

CITM, Jaipur CITM, Jaipur

Department of Computer Science & Engineering

Compucom Institute of Technology and Management

Rajasthan Technical University

(MAY 2018)
CANDIDATE’S DECLARATION
I hereby declare that the work which is being presented in the Dissertation entitled
“Enhancement of Privacy Preservation Method With Clustering and Cryptographic
Techniques in Data Mining” in partial fulfillment for the award of Degree of “Master of
Technology” in Department of Computer Science & Engineering with Specialization in
Computer Science & Engineering and submitted to the Department of Computer Science &
Engineering. Compucom Institute of Technology and Management, Rajasthan
Technical University, Kota is a record of my own investigation carried under the Guidance
of Dr. Akash Saxena, Associate Professor, CITM, Jaipur.

I have not submitted the matter presented in this Dissertation anywhere for the award of any
other Degree.

(Jeetendra Mittal)

Computer Science & Engineering

Enrollment No: 14E2CICSM40P603

Compucom Institute of Technology and Management, Jaipur

Counter Signed by

(Dr. Akash Saxena)

Associate Professor

Computer Science & Engineering

Compucom Institute of Technology and Management, Jaipur


ACKNOWLEDGEMENT

It’s the foundation of the architecture that defines its ability to stand affirm. The foundation
of my research work is just not my sole attempt, but it took efforts and insights of many key
people.

I sincerely take this opportunity to acknowledge all those who directly or indirectly have been
a great support and inspiration throughout the research work.

First and foremost I would like to address one of my mentors: Dr. Akash Saxena, Associate
Professor, CITM. It has been an honor to be his research student. He consciously or
unconsciously leveraged his key and creative insights wherever applicable in the research
work. Amidst of his busy schedule, he makes himself available for any query related to work
almost every time. I sincerely appreciate his contribution directly or indirectly.

I would also like to acknowledge Registrar, Mr. Pawan Agarwal, CITM for his support
and significant contribution during each phase of dissertation directly or indirectly.

Lastly, I would like to thank my family and friends for their love, motivation and
encouragement in all my pursuits. Last but not the least my special thanks go to the
Principal, (Prof.) Dr. M. R. Farooqui , and my Institute, Compucom Institute of
Technology and Management, Jaipur, for giving me this opportunity to work in the great
environment.

Jeetendra Mittal
ABSTRACT
Data Mining is the way toward removing learning avoided huge volumes of raw data. The
knowledge must be new, not self-evident, and one must have the capacity to utilize it. The
original data is altered by the disinfection procedure to cover delicate learning before
discharge so the issue can be tended to. Data mining has been considerably contemplated and
helpful in various fields which incorporate the Internet of Things (IoT) and the business
development. In any case, data mining approaches likewise happen genuine difficulties
because of developed sensitive data divulgence and the violation of privacy. Privacy-
Preserving Data Mining additionally called (PPDM), as a fundamental branch of the data
mining and an energizing theme in privacy preservation, has increase specific consideration
in current years. Security conservation of sensitive learning is tended to by a few analysts as
association rules by smothering the frequent itemsets Clustering is the technique which
makes the cluster of useful objects which have resemble characteristics. Anonymization is to
protect the identity of the individual this encrypts identifiers like unique number and the
name whereas the data which is not encrypted provides less or no guarantee. This discussion
describes the privacy concern that occurs due to data mining, particularly for the national
security applications. We talk about privacy-preserving data mining by Anonymization
Method in which we utilize hierarchical clustering keeping in mind the end goal to partition
the given data and DES algorithm for encryption of data with a specific end goal to keep
sensitive data from an attacker. Advanced Encryption Standard (AES) is an algorithm to
provide the security to the data and it is very difficult to apply attacks. By the proposed work,
privacy preservation of the data increased and it can be shown with the help of the results.
AES provides the result in minimum time which show that propose produce result faster than
existing approaches.

i
CONTENTS
ABSTRACT i
CONTENTS ii
LIST OF FIGURE v
LIST OF TABLE vi
LIST OF ABBREVIATIONS vii
CHAPTER 1 1-33
INTRODUCTION 1
1.1 Data mining 1
1.2 Goals of Data Mining 2
1.2.1 Prediction 2
1.2.2 Identification 2
1.2.3 Classification 2
1.2.4 Optimization 2
1.3 Advantages of Data Mining 3
1.4 Privacy preserving data mining 3
1.5 Types of data mining system 4
1.6 Defining Privacy for Data Mining 5
1.6.1 Aims and non-aims of this section 6
1.6.2 Privacy and personal data 7
1.6.3 Privacy as hiding confidentiality 7
1.6.3.1 Privacy as hiding/confidentiality as the focus of PPDM 8
1.6.4 Privacy as control: informational self-determination 10
1.6.5 Privacy as practice: identity construction 12
1.7 Privacy preserving applications 14
1.7.1 Medical Database 14
1.7.2 Bioterrorism Application 14
1.8 Privacy threats 14
1.8.1 Identity Disclosure 14
1.8.2 Attribute Disclosure 14
1.8.3 Membership Disclosure 14
1.9 Evaluation criteria for privacy-preserving algorithm 15

ii
1.10 Background 16
1.10.1 Security Vs Privacy 17
1.10.2 Privacy Issues and Policies 17
1.11 Requirements of a PPDM algorithm 17
1.12 Need for privacy 18
1.13 Comparisons of different privacy preservation techniques 19
1.14 Clustering 22
1.15 PPDM Techniques 22
1.16 Privacy Preserving Techniques 24
1.16.1 Heuristic-based techniques 24
1.16.2 Cryptography-based strategies 24
1.16.3 Reconstruction-based techniques 24
1.16.4 Anonymization based PPDM 25
1.16.5 Perturbation Based PPDM 26
1.16.6 Randomized Response Based PPDM 27
1.16.7 Cryptography Based PPDM 28
1.17 Issues in designing a PPDM algorithm 29
1.17.1 Challenges of PPDM Algorithm Information Loss 29
1.17.2 Requirements of a PPDM algorithm 30
1.18 Data Encrypted Standard (DES) 31
1.19 Advanced Encryption Standard (AES) 32
1.19.1 Substitute Byte transformation 33
1.19.2 Shift Rows transformation 33
1.19.3 Mix columns transformation 33
1.19.4 Add round key transformation 33
CHAPTER 2 34-46
LITERATURE SURVEY 34
CHAPTER 3 47-54
SIMULATION TOOL 47
3.1 Simulation Environment 47
3.2 The MATLAB system comprises of five major sections 48
3.2.1. Development Environment 48

iii
3.2.2. The MATLAB Mathematical Function Library 48

3.2.3. The MATLAB Language 49

3.2.4. Handle Graphics 49


3.2.5. The MATLAB Application Program Interface (API) 49
3.3 MATLAB optimization toolbox 49
3.4 The MATLAB Language 50
3.4.1 MATLAB Variables and Operators 50
3.4.2 Control Flow Constructs 52
3.4.3 Array and Matrix Indexing 52
3.4.4 Libraries 53
3.5 Supported MATLAB Subset 53
CHAPTER 4 55-60
PROPOSED WORK 55
4.1 Proposed Work 55
CHAPTER 5 61
RESULT ANALYSIS 61-64
5.1 THE RESULT OF PROPOSED WORK-I 61
5.2 THE RESULT OF PROPOSED WORK-II 63
CHAPTER 6 65
CONCLUSION 65
REFERENCES 66-73

iv
LIST OF FIGURE
Figure No. Figure Capture Page No.
Figure 1.1 What revealing search data reveals 13
Figure 1.2 Linking Attack 25
Figure 1.3 Randomization Response Mode 27
Figure 4.1 Flowchart of Proposed Work-I 57
Figure 4.2 Flowchart of Proposed Work-II 60
Figure 5.1 Accuracy among the base and propose appraoch 61
Figure 5.2 Error rate among the base and propose appraoch 62
Figure 5.3 Elapsed Time among the base and propose appraoch 63
Figure 5.4 Accuracy among the base and propose appraoch 64
Figure 5.5 Error rate among the base and propose appraoch 64

v
LIST OF TABLE

Table No. Table Capture Page No.


Table 1.1 Advantages and Limitations of PPDM Techniques 15
Table 1.2 Comparisons on different privacy preservation techniques 19
Table 3.1 MATLAB Optimization functions 49
Table 5.1 Describes the accuracy between Base method results and Propose 61
method results
Table 5.2 Describes the error rate between Base method results and Propose 62
method results
Table 5.3 Comparison of Elapsed Time among Base and Propose Techniques 63

vi
LIST OF ABBREVIATIONS

PPDM Privacy Preserving Data Mining

FIP Fair Information Practices

SNS Social network sites

DPQR Data Perturbation and the Query Restriction

DDM Distributed Data Mining

SMC Secure Multiparty Computation

SDC Statistical Disclosure Control

DES Data encryption standard

IBM International Business Machines

NBS National Bureau of Standards

GA Genetic algorithms

MPC Multi-party computation

PHI Protected Health Information

API Application Program Interface

vii
Chapter- 1

INTRODUCTION

1.1 DATA MINING:


Data mining (DM) process are utilized massively in different of fields. At the season of outlining
Network IDS, it is important to identify and rectify those attacks in less time and raise the best
possible caution. To do this DM strategies are one of the interesting field and productive
strategies that can be utilized to plan the IDS. DM based interruption location strategies regularly
fall into any of the two classes; anomaly detection and misuse detection. Normally the process of
data mining refers to extracting process, the descriptive models from huge data storage.
Utilization of DM algorithms in IDS gives supreme execution and security. These frameworks fit
for identifying known and obscure assaults from the system. Various DM strategies like
summarization, clustering, and classification can be utilized for investigating and recognizing the
interruption [1]. Recently, data mining has been viewed as the hazard to protection no matter
how you look at it increase of the electronic data which kept up by associations. This has led to
increased concerns about the privacy of the underlying data. Over the most recent couple of
decades numerous methodologies and strategies, for example, arrangement, the association rule
mining has been proposed for the altering the data or transforming the data generally, data
mining has been viewed as hazard to security no matter how you look at it duplication of the
electronic information which kept up by associations. Over the most recent couple of decades
numerous methodologies and procedures, for example, arrangement, the affiliation govern
digging has been proposed for the changing of the information or changing the information… in
a way that to preserve the privacy. Preservations of individuals information is an essential for the
owners of data to ensure his privacy [1].

Data mining process enables an organization to utilize the vast measure of data to create
connections and connections among the data to enhance the business proficiency [1]. The Data
Mining innovation can build up these investigations all alone, utilizing commix of insights,
manmade brain power, machine learning algorithms, and data stores. With a specific end goal to

1
confront the testing risk, a few scientists have been proposed as a cure of this cumbersome
circumstance.

Balancing the privacy of the statistics as steady with the legitimate need of the consumer is a
major trouble. The unique data is modified by the sanitization method to cover sensitive know-
how earlier than launch so the issues can be addressed. Privacy preservation of sensitive
knowledge is addressed by way of numerous researchers in form of association policies via
suppressing the common object units. As the data mining offers the generation of association
rules, the alternate in support and confidence of the association rule for hiding sensitive
regulations is finished [2]. A new idea named now not altering the aid‟ is proposed to cover an
association rule. Confidentiality issues in the data mining. A key hassle that arises in any en
masse series of data is that of confidentiality. The need for privacy is occasionally because of
regulation ((e.g., for medical databases) or may be uplifted by way of business interests. The
result of humor is that data mining which does not often violate the privacy. The data mining
goal is to conclude throughout populations, instead of display statistics approximately
individuals.

1.2 GOALS OF DATA MINING:


As a rule, the goals of DM fall into different groups:

1.2.1 Prediction: Prediction decides the relationship between autonomous factors and
association amongst dependent and independent factors.

1.2.2 Identification: Data patterns are required to make out the existence of the item, an event or
some patterns those are of customer behavior. The known area is for authentication is a layout of
classification.

1.2.3 Classification: Data mining can help to divide the data so that various classes can be
recognized based on the parameters grouping to search a clever say that to show data

1.2.4 Optimization: DM can enhance the utilization of resources those are incomplete, for
example, time, space, cash or materials and to expand output those factors which are under a
predetermined arrangement of limitations [2].

2
1.3 ADVANTAGES OF DATA MINING:
Data mining applications are developing continuously in different industries that to provide
knowledge which is more hidden that allow to increase business efficiency and grow businesses.
DM approaches assume a basic part of a different domain. For the characterization of security
issues, a lot of information must be analyzed containing verifiable data. It is troublesome for
people to discover an example in such a huge quantity of data. DM, in any case, appears to be
appropriate to crush this difficulty and can be utilized to decide those models [2].

1.4 PRIVACY PRESERVING DATA MINING:


PPDM is looking into region required with the privacy driven from as I would see it identifiable
data when considered for data mining. Therefore, PPDM has to turn out to be an increasingly
more essential area of research. PPDM is a novel studies path in data mining. A num of
techniques and methodologies have been progressed for PPDM (Philip S. Yu et al. 2010). The
set of standards has been identified based totally on which a PPDM algorithm may be evaluated
(Charu C. Agarwal et al. 2008).

 Privacy level
 Hiding failure
 Data quality
 Complexity

The primary challenges of PPDM approach for association rule mining are excessively
expensive, hard to get better unique data after hiding and ought to be green sufficient for terribly
huge datasets (Wei Zhao et. Al 2007). PPDM is a research area involved with the privacy pushed
from in my opinion identifiable statistics whilst considered for data mining. The objective of this
paintings is to put into effect a distortion algorithm the use of the association rule hiding for
PPDM which could be efficient in supplying confidentiality and enhance the performance overall
(Charu C. Aggarwal et al. 2008). The debate on PPDM has acquired unique attention as data
mining has been broadly adopted by using public and the private corporations. We have
witnessed 3 main landmarks that represent the progress and achievement of this new research
location: the conceptive landmark, the deployment landmark, and the potential landmark. We
describe these landmarks as follows: The Conceptive landmark characterizes the length in which

3
valuable figures inside the community, along with O'Leary (1991, 1995), Piatetsky-Shapiro
(1995), and Klösgen, 1995; Clifton and Marks, 1996), examined the accomplishment of data
discovery and some of the fundamental territories where it could fighting with privacy stresses.
The key finding changed into that knowledge discovery can open new threats to informational
privacy and statistics safety if now not completed or used well.

The Deployment landmark is the current period wherein increasingly more PPDM techniques
were evolved and have been posted in refereed meetings. The facts available these days is unfold
over endless papers and conference proceedings. The consequences performed in the closing
years are promising and advise that PPDM will achieve the desires that have been set for it. At
this degree, there's no consent approximately what privacy renovation way in data mining. In
addition, there is no consensus on privacy concepts, guidelines, and necessities as a foundation
for the development and the deployment of recent PPDM strategies.

The extreme amount of procedures is prompting perplexity among designers, professionals, and
others keen on this technology. One of the most extreme essential requesting circumstances in
PPDM now could be to build up the groundwork for additional studies and development in this
area.

1.5 TYPES OF DATA MINING SYSTEM:

Data mining frameworks can be coordinated by different criteria the depiction is as per the going
with [2]:

a) Classification of the data mining frameworks as indicated by kind of the data source
mined:
In an association, a lot of data's are material where we group these data yet these are accessible.

b) Classification of the data mining systems according to the data model:


There is a tremendous measure of data mining models ( Relational data model, Object Model,
Object-Oriented Data Model, Hierarchical data Model or W data show) are achievable and every
single model we are utilizing the distinctive data. According to these data model, the data mining
framework characterize the data in the model.

4
c) Classification of statistics mining structures in keeping with mining strategies used:

This classification is consistent with the statistical analysis approach utilized together with
machine contemplating, neural systems, GA, actualities, perception, database situated or data
warehouse-arranged, and so forth. The class also can bear in mind the degree of personal
interaction involved in the records mining system including query-driven systems, interactive
exploratory systems, or self-sufficient systems. A complete system would provide an extensive
variety of data mining approach to fit extraordinary situations and options, and provide
exceptional ranges of user interaction.

1.6 DEFINING PRIVACY FOR DATA MINING:


In standard, privacy preservation happens in two most important dimensions: customers' non-
public information and facts regarding their collective activity. We seek advice from the former
as individual privacy protection and the latter as collective privacy preservation, which is
associated with corporate privacy in (Clifton et al., 2002).

i. User privacy preservation: The primary intention of the data security is the shelter of in
my view identifiable statistics. In standard, statistics are taken into consideration
individually identifiable if it could be linked, at once or circuitously, to a person
character. Thus, while non-public data are subjected to mining, the attribute values
related to individuals are non-public and must be covered from being uncovered. Miners
are capable of analyzing from international models instead of from components of a
particular user.
ii. Collective preservation of privacy: Securing private data won't be sufficient.
Sometimes, we may additionally need to secure against getting to know sensitive
understanding representing the activities of a collection. We consult with the safety of
sensitive knowledge as collective privacy maintenance. The goal here is pretty similar to
that one for statistical databases, wherein safety control mechanisms offer mixture data
approximately corporations (populace) and, at the same time, save you exposure of
personal records approximately people. However, in comparison to as is case for
statistical databases, any other objective of collective privacy upkeep is to defend
sensitive knowledge that can offer competitive gain within the commercial enterprise
world. In the case of collective privacy protection, agencies need to deal with a few

5
thrilling conflicts. For example, when non-open insights experiences investigation forms
that create new data around users' shopping examples, hobbies activities, or options, these
measurements might be utilized in recommender systems to predict or have an effect on
their future shopping patterns. In general, this state of event is useful to both customers
and corporations. However, whilst companies proportion statistics in a collaborative task,
the aim isn't most effective to shield individually identifiable data but additionally
sensitive data represented by way of some strategic patterns [3]. On this part, we can start
from the data which are more commonly considered as being implicated in privacy
debates, data protection or privacy breaches on the Internet. We then proceed to describe
three relevant groups of privacy approaches. The aim of this framework is to abstract
from and complement current privacy definitions, as explained. We illustrate the quite a
lot of faces of privacy making use of the illustration of the AOL logs, spotlight one of a
kind notions of identification, extend the dialogue by the way of investigating whose
privacy may be at stake, and characterize specifics of the Web relevant to our questions.
We summaries with the resulting aim of this article, to be then presented in the
subsequent.

1.6.1 Aims and non-aims of this section:


It there are a myriad of meanings of privacy, now not best in the computer science however
moreover in legal, social and different sciences. It’s not our purpose here to list them all or
investigate all of their details. What we aim to do alternatively is to reward a high-level
taxonomy of systems via asking what the global focus of definitions is. We can refer to choose
individual definitions to illustrate details and to select surveys to demonstrate degrees of
definitions.

The taxonomy distinguishes between three views of privacy: privacy as hiding, as manipulate,
and as follow. These are summarized. The definitions have been put into the connection of the
present article; see in particular we will show how these three views cut across individual
definitions ‘differences with regard to the subjects of privacy and type of data and knowledge.
While the objective of is in this manner to sum up past the large number of security definitions to
characterize a general system for the analysis, we do settle on one decision all through the rest of
the paper: We concentrate on person’s privacy. This is in line with the bulk of current scientific

6
and popular treatments, but the choice was also made to respect the topic complexity. We
therefore begin this section with a definition of “personal data” as the target of persons’ privacy.
We complement this general focus of the paper by a discussion of some issues of business and
state secrets. A thorough analysis of these questions would require the space of another article
[3].

1.6.2 Privacy and personal data:


Since computers are about data and data processing, any concept of privacy in computational
environments will concern data, in particular “personal data”. is "any data relating to a
recognized or identifiable normal individual [… ]; an identifiable individual is one who can be
distinguished, rapidly or not clearly, in remarkable by method for reference to a distinguishing
proof amount or to one or more thought processes assigned to his physical, physiological,
mental, economic, cultural or social identity. Observe the attention on identity, which is thought
to be targeted and identifiable for one traditional person; in line with this emphasis, US
terminology talks about personally identifiable information. The standard types of the data
which is personal are the profile data that describing the users , including the name, the
address, health status, etc. An important question for Web mining is whether IP addresses are
personal data. Privacy advocates have long argued. The European Union has as of late embraced
this position. If this becomes official policy, it is likely to have strong effects on how
organizations can collect and process Web usage data. As a consequence, it isn't the content
material of a section of data that defines it as protection-worthy, or the fact that the data were
produced by the means of an individual, pertain to an individual, or describe an individual; the
main pertinent question is whether the data can be linked to natural person. European law
permits the analysis of data if records are identified by a pseudonym. Under US law, a record
holder may assign a code to a de-identified record in order to permit the original record holder to
distinguish the record.

1.6.3 Privacy as hiding confidentiality:


In a classical article, privacy has been defined as “the right to be the left alone”. Although
initially planned as a correct that secures people against gossip and slander, this develop has
from that point forward procured a more extensive importance. To be specific, it alludes to an
individualistic liberal convention in which an inborn prior self is allowed a circle of autonomy

7
free from interruptions from both a. tyrannical state and the pressure of social standards. That
privacy encompasses this sense of a protected sphere is generally acknowledged in sociology,
and legal scholars, courts and regulators have recognized its data-dependency: The private circle
is something which is conceivably undermined by the exposure of (individual) data. This notion
is also popular in the computer science and has been explained as an autonomous (digital) sphere
in which the data about persons is protected, such that outside of this sphere the data remains
confidential. Data confidentiality the protection of data from unauthorized access is a strong and
useful translation of such privacy concerns into digital space. A key reason is that once data
about a person exists in a digital form, it is extremely troublesome to provide individuals with
any guarantees on the control of that data. Data collected is being using current technologies
represent activities of users in social life that for many are assumed to be private. To preserve
privacy is then to keep this data private, in other words confidential from a greater public. Not
exchanging the data would preserve privacy but is inconvenient and probably also not desirable.
Therefore, a great deal of privacy investigate in computer science is concerned with weaker
forms of data confidentiality such as anonymity.

Anonymity is finished with the aid of unlinking the person identity from the traces that her
movements depart in information systems. Anonymity continues the identification of the
individuals in competencies methods personal, however, it's not always concerned with how
public the traces hence come to be. This is also reflected in data protection legislation, which by
definition are not able to and does now not anonymous data [3].

Anonymity will also be established on distinctive items. In interchanges, anonymity is


accomplished while an individual is not identifiable inside a limited arrangement of users, called
the anonymity set. An individual completes an exchange anonymously if she cannot be
distinguished by an observer from others in that set. The observer (adversary) could receive
some extra information. This is typically knowledge about the likelihood of different individuals
having carried out a given transaction. The observer could also be the service supplier or some
other party with statement capabilities or with the potential to actively manipulate messages.
Relying on the observer’s capabilities, exclusive units may also be developed with varying
degrees of the anonymity for given set of anonymity. What level of the anonymity is adequate in
given setting which relies upon legitimate results and the social outcomes of an data breach and

8
remains an open inquiry. In databases and (PPDM), the conditions for setting up. Anonymity sets
and the targeted objectives are somewhat dissimilar than in communications. Anonymity is a
popular requirement when (Web or other) data are to be analyzed (e.g. data-mined), especially
when this is done by third parties. One difference to communications anonymity is that PPDM
methods aspire to secure the utility of the anonym zed data for analysts [3].

1.6.3.1 Privacy as hiding/confidentiality as the focus of PPDM:


The setup of PPDM as a class of mining methods clearly shows its focus on the privacy as
confidentiality. PPDM is stimulated through privacy as “the proper of an entity to be comfy from
unauthorized divulgence of sensible data that are contained in electronic archive or that can be
determined as aggregate and complex data from information put away in an electronic
repository”. A look at further definitions of privacy from the PPDM literature illustrates the
underlying idea that ‘privacy obtains when certain data are hidden ‘regardless of whether the
data are general or, for example, from social sites or query logs: For example, focus on the
“freedom from unauthorized intrusion “and demand “solutions that ensure data will not be
released”, and they observe that the “disclosure of knowledge about an entity (information about
an individual) is a potential individual privacy violation”, and that the analogous holds for the
disclosure of knowledge about sets of data of other entities like corporations. “[S]et[ting] up the
data to secure privacy of individual users even as keeping the global network properties[…] is
almost always completed by the means of anonymization, a simple procedure in which each
individual’s ‘name’ e.g., e-mail address, phone number, or actual name is replaced by a random
user ID” distinguish “general adversary who is trying to discover any useful information” from a
“particular competitor who tries to disclose information”, where the information is or includes
that about a specific website or business. PPDM ways handle the database inference
predicament: “The drawback that arises when personal knowledge can also be derived from
released data by using unauthorized users”, and the objective of PPDM is to “strengthen
algorithms for modifying the real data by some means, so that the private data and personal
competencies stay private even after mining process”. Here, “private data” are the given inputs to
the mining process that are supposed to remain confidential, and “privacy knowledge” is that
part of the knowledge inferred during mining that is supposed to remain confidential. Put more
simply, the objective is to learn what we are allowed to learn from data that we are not allowed
to see. To achieve this, PPDM methods must solve the (data mining/publishing) anonymization

9
problem: to “produce an anonymous that satisfies a given privacy requirement determined by the
chosen privacy model and to retain as much data utility as possible”. Key concepts of PPDM are
defined. Following security-research terminology, adversaries are also called “attackers” who
perform an attack: a "sequence of activities that result in the disclosure of confidential
information". This takes a wider look at a setting often found in PPDM: the publishing, by a data
publisher (e.g., a hospital), or at least in parts sensitive information on data subjects (e.g.,
patients), for an audience of data recipients. The latter is in general not known a priori, could
bill-intentioned, and may perform arbitrary data mining tasks. The aim of PPDP is then that
"access to published data will have to no longer allow the attacker to learn something additional
about any goal victim in comparison without an entry to the database, in spite of the presence of
any attacker’s history information received from other sources”. Due to the impossibility of this
in the face of arbitrary background knowledge, one usually assumes limited and specific
background knowledge of the attacker, or requires that probabilistically, the posterior beliefs
after looking at the published data are not much different from the prior beliefs. The same idea
lies behind differential privacy that “ensures that the removal or addition of a single database
item does not (substantially) affect the outcome of any analysis”. Informally, this could be
considered to not hide data, but to avoid that information can be gleaned from them. A wide
literature exists on PPDM and PPDP, which cannot be covered here. For details of specific
algorithms and method groups, see also (graphs/networks).Taking a closer look at these data-
centric definitions of privacy, one sees that alongside the focus on confidentiality (not seeing
data, not learning about an entity, protection from disclosure), there is also the recognition that
data need not be kept confidential in every case, but could be disclosed as long as someone
entitled to do so “decides” or “authorizes” disclosure/communication. This someone is often the
data subject, but may also be unspecified we will return to this question. This move away from
unconditional hiding, or “privacy-as-confidentiality”, leads to the notion of privacy as control, to
be discussed next.

1.6.4 Privacy as control: informational self-determination:


A much broader inspiration of privacy, showing in many authorized codifications, defines the
time period not best as a subject of concealment of private information, but also as the ability to
control what happens with it [3]. This notion does not call for strict data parsimony. One reason

10
is that the revelation of data is necessary and beneficial under any circumstances— and that
control may help to avert abuses of data collected in this way.

This idea is expressed in Westin’s definition of (data) privacy: “the right of the user to decide
what statistics about himself must be communicated to Others and under what circumstances”
and in the time period informational self-determination first utilized in a German constitutional
ruling when it comes to individual data gathered amid the 1983 registration, and highly
influential in Europe and beyond since then: “the protection of the user against unlimited
collection, storage, Use and revelation of his/her private information is enveloped by method for
the general identity privileges of the [German Constitution]. These general correct warrants in
this appreciate the potential of the man or woman to determine in principle the disclosure and the
use of his/her personal data. Obstacles to this instructive self-assurance are permitted handiest in
event of superseding public interest”. Data of self- determination is likewise communicated in
worldwide rules for data security such in light of the fact that the OECD's bearings on the
assurance of protection and Trans periphery Flows of Data which is Personal, the Fair
Information Practices (FIP) realize, alternative, access, and safety, or the concepts of the EU
Data Protection Directives. As an illustration, don't forget the ideas set up within the OECD
instructional materials: assortment problem, data quality, reason specification, use problem. In
sociological accounts, privacy as control is tied closely to the ability to separate identities, which
allows individuals to selectively employ revelation and concealment to facilitate their social
performances and relationships. Computer science has applied these ideas in systems for identity
management and access control [3].

Although informational self-determination principles are desirable, relying only on them when
building systems can be misleading. Collection limitation in one system does not protect against
the aggregation of those data in many systems. Openness may be overwhelming in current
ubiquitous-technology environments, where the numbers of data controllers increase
exponentially. A user may be overwhelmed by the difficulties of individual participation and
unable to judge the risk of revealing information or using automated agents for such decision-
making. Even if all these principles were implemented, it would be very tough to identify
violations. In the case of trusted parties, system security violations (i.e. hacked systems), design
failures (i.e. information leakages), or the linking of different sources of the safely released data

11
may cause unwanted release of information. It hence offers little security concerning the
collection of anonymized data, profiling based on correlations and patterns found in this
aggregated data, and the resultant alluring or undesirable separations. Finally, privacy as control
is an abstract concept that does not consider how people actually do and want to construct their
identities. This is the topic of privacy as practice, to which we turn next.

1.6.5 Privacy as practice: identity construction:


Despite interesting research results in the section of preserving the privacy methods and tools,
individuals are confronted every day with the collection of huge measures of data about them.
This has many reasons. Some services require identification (e.g. hospitals or employment
situations). Commercial interests in collecting information often extend beyond such contexts.
Popular and usable privacy enhancing technologies area are to non-existing. Surveillance
technologies collect information on a mass level without consent. Furthermore, people often
simply desire to reveal knowledgeable data about themselves with their names etc. By means of
privacy as apply, we check with the definition of the proper to privacy as the freedom from
unreasonable constraints on the construction of 1’s possess identity, which incorporates the
capacities to deliberately uncover or conceal data. This approach requires domain-specific and
sociological analysis of users’ and communities’ information revelation and concealment needs
as in the examples given. The diversity of user concerns that is stressed here is often not
emphasized in privacy-as-confidentiality and privacy-as-control approaches [3].

Privacy as takes after requests the likelihood to meditate inside the streams of existing data and
the re-negotiation of limits with perceiving to accumulated data.

These two exercises lay on, yet expand the possibility of security as enlightening self-assurance:
they request straightforwardness concerning accumulated arrangements of information and the
investigation methods and decisions applied to them. In this sense, these approaches define
privacy not only as a right but also as a public good. Sociologists have investigated the idea that
privacy is (social) practice from various viewpoints. Distinguishes two further types of privacy in
addition to the above-mentioned right to be let alone and the possibility of separating identities.
The third type is the construction of the public/private divide. This distinction concerns the social
negotiation of what remains private (i.e. silent and out of the public discourse) and what becomes
public. For instance, the decision by individuals to keep their voting choices private is generally

12
accepted today; while in the case of domestic violence, interest groups and individuals have
successfully lobbied over the past decades to redefine the “domestic” as a public issue

Figure 1.1 What revealing search data reveals

The fourth type in is the protection from surveillance. Here, surveillance refers to the creation
and managing of social knowledge about population groups. This kind of privacy can easily be
violated if individual observations are collated and used for statistical classification. At the point
when connected to people, such orders make articulations about their (non compliance with
standards, they're having a place with associations with given living arrangements and
valuations, and numerous others. Arguably, such processes may pose unreasonable constraints
on the construction of identities. Market Segmentation is a case of the arrangement of populace
gatherings. In computer science accounts of privacy in networks and in particular social network
sites (SNS, similar ideas have been expressed [3].

These definitions emphasize that confidentiality and individual control are part of privacy, but
not all. Privacy includes strategic concealment, but also the revelation of information in different
contexts, and these decisions are based on and part of the process of collective negotiation. Tools
should, therefore, support data concealment and revelation to help individuals practice privacy
individually and collectively.

13
1.7 PRIVACY PRESERVING APPLICATIONS:
1.7.1 Medical Database: Traditional method has been utilized just for the worldwide hunt and
supplants strategy keeping in mind the end goal to maintain privacy.

1.7.2 Bioterrorism Application: It is central to look at the medical data for security
conservation in the bioterrorism application. For instance, Biological operators are broadly found
in the regular habitat, for example, Bacillus anthrax. It is critical to discover the Bacillus anthrax
attack from the ordinary attack. It is important to track occurrences of the regular sicknesses. The
comparing information would be accounted for by the general public health agencies. The
respiratory illnesses were not reportable-diseases. This gives an answer for more identifiable data
as per public health law [4].

1.8 PRIVACY THREATS:


Releasing the aftereffect of data mining could bring about protection threats. Several privacy
disclosure dangers were conceivable in microdata distributing like personality revelation,
enrollment exposure, and a characteristic exposure. Privacy threats result about more exposure
hazard. Anonymizing the data and saving the data through different disclosure protections would
bring about better utility.

1.8.1 Identity Disclosure: Usually an individual was linked to a record in the published table. If
his identity was disclosed, then the corresponding sensitive value of an individual would be
revealed [4].

1.8.2 Attribute Disclosure: Attribute disclosure was possible when information about individual
record would be revealed. Before delivering the information, it is must to conclude attributes of
the user with the high confidence. According to the creators see [5], coordinating numerous
bucket was imperative to secure attribute disclosure.

1.8.3 Membership Disclosure: Membership information in the released table would infer an
identity of an individual through various attacks. In the event that the determination criteria were
not sensitive trait esteem, then it would prompt have a membership exposure [6].

14
1.9 EVALUATION CRITERIA FOR PRIVACY-PRESERVING ALGORITHM:
Privacy-preserving data mining a fundamental trademark in the progression and evaluation of
figuring is the unmistakable verification of suitable evaluation criteria and the development of
related principles. In some case, there is no privacy preserving algorithm exists that beats the
other entire algorithm on all possible measures.

It is vital to delivering users with a set of metrics which will allow them to select the best
suitable privacy preserving technique for the data; with respect to some specific parameters. An
introductory list of evaluation parameters to be used for evaluating the quality of privacy
preserving data mining algorithms is given below: [7]

(i) Performance: the performance of a mining algorithm is measured in terms of the time
required to achieve the privacy criteria.
(ii) Data Utility: Data utility is basically a measure of information loss or loss in the
functionality of data in providing the results, which could be generated in the
nonattendance of PPDM algorithms.

Table 1.1 Advantages and Limitations of PPDM Techniques [8]

Technique Advantages Limitations


Anonymization based PPDM Identity or sensitive data about Linking attack
record owners are to be hidden Heavy loss of information.
Perturbation based PPDM In this technique, different Original data values cannot be
attributes are preserved regenerated loss of
independently information
Randomized Response based It is generally straightforward Loss of individuals
PPDM [8] helpful for concealing data information.
about people. This method is not for
Better efficiency compare to multiple attribute databases.
cryptography based PPDM
techniques [8]
Condensation approach based Use pseudo data rather than The huge amount of

15
PPDM altered data. This method is information lost.
very real in case of stream It contains the same format as
data. the original data
Cryptography based PPDM Transformed data are exact This technique is especially
[8] and protected. Better privacy tough to scale the multiple
compare to randomized parties that are involved.
approach

(iii) Uncertainty level: It is a measure of vulnerability with which the sensitive data that has
been covered up can even now be predicted.
(iv) Resistance: Resistance is a measure of tolerance shown with the aid of PPDM algorithm
towards numerous data mining algorithms and fashions. As such, all the criteria that have
been discussed above need to be quantified for better evaluation of privacy-preserving
algorithms but, two very important criteria are a quantification of privacy loss and
information loss. Evaluation of privacy or privacy metric is a measure that demonstrates
how intently the first estimation of a trait can be assessed. If it can be estimated with
higher confidence, the privacy is low and vice versa. Lack of precision in estimating the
original dataset is known as information loss which can lead to the failure of the purpose
of data mining. So, a balance needs to be achieved between privacy and information loss.
Dakshi Agrawal and Charu Agrawal in [9] have discussed quantification of both privacy
and information loss in detail.

1.10 BACKGROUND:
Privacy is a vital concern while allowing access to different classes of the data set such as
business and medical dataset for mining. Privacy is so essential with respect to medical data
since it contains private information such as the type of disease associated with patient ID, name,
and address. In certain, while extracting the medical data the real data should be attainable for
making the precise predictions otherwise it will lead to results which are useless. Any kind of
release related to the person-specific information leads to several problems including moral
issues. In this manner, privacy can be characterized as avoiding undesirable uncover of data
while performing data mining on collective outcomes.

16
1.10.1 Security Vs Privacy:
Security is the capability to manage the access to information, protect from the unauthorized
disclosure, modification of data and the destruction of information [10]. A medical dataset
consists of all information related to the patient. Privacy is a more particular term which is
characterized by the privilege of a person to keep his personal data from being revealed.In the
medical type of datasets, a specific disease of a person must not be disclosed into the public
domain. Today some known methods of PPDM exist and are examined thoroughly.

1.10.2 Privacy Issues and Policies:


Security is the capacity of an individual or gathering to ensure data about them. The most notable
issues are to provide the confidentiality while preserving the data or information and
computational overhead. In the event that cryptographic procedures were utilized for privacy
preservation, it will include more computational multifaceted nature. In an appropriated
situation, when the quantity of partners ends up plainly bigger, the correspondence cost grows
exponentially [11].

A privacy policy is an arrangement of principles that unveil a portion of the ways a gathering
accumulates, oversees, uncovers and uses customer's data. In order to ensure privacy, the author
must address various privacy attacks which need a high degree of deliberation. In data mining,
Privacy attack occurs when one's precise privacy information is openly linked to him. Since it is
difficult to identify all types of attacks occasionally, the private providers can track certain kind
of policies delivered by different nations such as FHIPAA of US, Information Technology Act,
2000 of India and Data Protection Act of UK.

1.11 REQUIREMENTS OF A PPDM ALGORITHM:


I. Accuracy: The accuracy is closely related to the information loss resulting from the
hiding strategy: the less is the information data misfortune, the better is the data quality.
This measure to a great extent relies upon the particular class of PPDM algorithms
Always a PPDM algorithm has to maintain high accuracy to reduce information loss
(Aggarwal et al. 2008).
II. Completeness and Consistency: Completeness assesses the degree of missed data in the
sanitized database. Incomplete data has a significant impact on data mining results and
impairs the data mining algorithms from providing an accurate representation of the

17
underlying data. Consistency is related to the semantic constraints holding on the data
and it measures how many of these constraints are still satisfied after the sanitization
(Kantarcioglu et. al 2007).
III. Scalability: It is another imperative perspective to survey the execution of a PPDM
algorithm. In particular, scalability describes the efficiency trends when data sizes
increase. Such parameter concerns the expansion of both execution and capacity
prerequisites and the expenses of the interchanges required by a data mining system with
the increase of data size (Bertino et. al 2008)
IV. Data quality: It is an important aspect of PPDM. High-quality data that has been
prepared specifically for data mining tasks will result in useful data mining models and
output. Then again, low-quality data has a critical negative effect on the utility of data
mining comes about (Bettini et. al 2009).
V. Security: It is level of protection danger, damage, loss, and the crime. There are two
main approaches regarding how to deal with the problems of privacy that arise today. The
first is a legal and policy approach whereby organizations are limited in how they store
and use data based on privacy law and public policy. It ordinarily works by assessing
situations and choosing if the privacy rupture brought about by utilizing the data given is
supported or not. The second approach is technological and gives implemented privacy
ensures through cryptographic means. This approach has the ability to empower the data
to be utilized while preventing privacy breaches [12].

1.12 NEED FOR PRIVACY:


With the present-day world getting digitized, there is an expansion of electronic data. It is
fundamental to dissect the patterns of financial of the users of the society. Privacy concern is
critical when data revelation is considered. Say for an example, Medical data contains sensitive
data as it contains information about the patients’ diseases. It is important to privatize this data
before making it available for data mining. In medical situations, it is vital to save the mining
model with compelling security; else it will prompt mistaken expectations that are despicable.
Personal specific details must not be disclosed which may otherwise be considered unethical.
Privacy can be characterized as the counteractive action of undesirable exposure of data when
data mining is performed on total outcomes. Protection must be tended to at all the levels while
mining is done. Protection and security both are a hindrance for data mining errand. A clear

18
demarcation between security and privacy requirements of published data is essential. [12]
provides an address for identifying the importance of security and privacy in data mining [13]
[14]. In this paper, the authors first distinguish between privacy and security in the context of
Census data. The rest of the segment gives a prologue to protection arrangements and issues that
are taken care by different overseeing bodies inside India and different countries.

1.13 COMPARISONS OF DIFFERENT PRIVACY PRESERVATION TECHNIQUES:


[15]

Table 1.2: Comparisons of different privacy preservation techniques

Title Algorithm Parameter Conclusion


Anonymization of the Anonymization Clustering coefficient, The offered
Centralized and algorithm and the Diameter, the Normal sequential clustering
Distributed Social SaNGreeA separation, the Effective algorithm for
Networks is via algorithm are used distance across, the anonymizing social
Sequential Clustering for the sequential Epidemic limit. networks.
clustering
Data Mining for the Data Perturbation Multi-parameters The privacy-
Privacy Preserving and the Query perturbation preserving degree
Association Rules Restriction (DPQR) and efficiency of
Based on the time are completed.
Improved MASK The DPQR is
Algorithm [15]. suitable for Boolean
records.
K-Anonymity for K-Anonymity No. Of Tuples And Data, The Outperforms
Crowdsourcing algorithm spaces are used to standard K-
Database measure the overall Anonymity
performance of the approaches on
system. holding the
adequacy of
crowdsourcing.

19
Privacy Preserving Tree learning Temperature Humidity, The decision tree
Decision Tree Algorithm, decision the Wind Play. algorithm is good
Learning is Using tree generation are with other security
Unrealized Data Sets used. safeguarding
[15] methodologies, for
example,
cryptography, for
additional
protection.
Secure and Privacy- KeyGen(n) GSC which is (group A localization
Preserving algorithm signature center) algorithm, which is
Smartphone which is Accuracy, Simulation, suitable for the GPS
Based on Traffic timestamp location samples,
Information Systems and evaluated it
through the realistic
simulations.
On Design and Data mining Taken a toll parameter, PPSVC can
Analysis of Privacy- algorithm, Kernal parameter is accomplish
Preserving the SVM Classification utilized to the quantifying comparable
Classifier [15] algorithm, kernel the execution of the arrangement
adatron algorithm framework. exactness to the first
and the data fly SVM classifier. By
algorithm. securing the delicate
substance of support
vectors.
Privacy-Preserving GA Languages demonstrating The secure
Gradient-Descent smoothing parameters, constructing blocks
Methods weight parameters are are scalable and the
utilized to gauge the proposed protocols
execution of the allow us to decide a
framework higher comfy

20
protocol for the
packages for every
scenario.
A Data Mining 1.C5.0 data mining Area secured by roc, bend Overcomes the
Perspective in PPDM algorithm, dataset id, affectability, overheads arising
Systems [15] Commutative RSA specificity-1 because of key trade
cryptographic and key computation
algorithm with the aid of
adopting the
cryptographic
algorithm
Incentive Compatible Data analysis Deterministically non Claim 5.1, the length
Privacy-Preserving algorithms cooperatively computable of the last stride in a
Data Analysis (DNCC). PPDA assignment is
in DNCC, it is
constantly
conceivable to make
the whole PPDA
errand fulfilling the
DNCC demonstrate.
Privacy and Quality Outlier detection Detection rate, data A general process
Preserving anomaly detection range. Indices, anomaly for computing
Multimedia Data algorithm, secure score bounds on nonlinear
Aggregation for the hash algorithm. privacy preserving
Participatory Sensing data-mining (PPDM)
Systems Outlier approach with the
recognition applications to
irregularity the detection anomaly.
detection calculation,
the secure hash count
[15]

21
1.14 CLUSTERING:
Clustering [16] is a data mining strategy that has not taken its genuine part underway as of now
cited in spite of the fact that, the most vital calculation of this technique was very studied with
regards to privacy-preserving, which is k - means algorithm. Surveying privacy preserving k -
means clustering approaches apart from other privacy-preserving data mining ones is important
due to the use of this algorithm in imperative different regions, similar to image and signal
processing where the issue of security is unequivocally postured. The greater part of works in
protection safeguarding grouping are created on the k- means algorithm by applying the model of
secure multi-party calculation on various distributions (vertically, horizontally and arbitrary
partitioned data). Among the definitions of Partition clustering in view of the minimization of a
goal work, k- means algorithm is the most broadly utilized and contemplated. Given a dataset D
of n elements (objects, information focuses, items,… ) in genuine p- dimension space Rp and a
integer k. The k -means clustering algorithm partitions the dataset D of entities into k-disjoint
subsets, called clusters. Each cluster is represented by its center which is the central id of all
entities in that subset.

The need to preserve privacy in k-means algorithm happens when it is connected on distributed
data more than a few destinations, supposed " parties " and that it wishes to do clustering on the
association of their datasets. The point is to keep a gathering to see or derive the data of another
party amid the execution of the algorithm. This is accomplished by utilizing secure multi-party
calculation that gives a formal model to preserve the privacy of data [16].

1.15 PPDM TECHNIQUES:

In Recent years have seen broad research in the field of PPDM. As an examination course in data
mining and measurable databases, privacy-preserving data mining got generous consideration
and numerous analysts played out a decent number of concentrates in the region. Since its
initiation in 2000 with the spearheading work of Agrawal and Srikant [17] and Lindell and
Pinkas [18], security saving data mining has increased expanding fame in the data mining
research group. PPDM has turned into an essential issue in data mining research [19-20].

As a result, a radical new set of methodologies were introduced to permit mining of data, while
in the meantime forgetting the discharging of any hidden and sensitive information. Most of the

22
current methodologies can be grouped into two general classes [21]: (I) Methodologies that
secure the sensitive data itself in the mining procedure, and (ii) Methodologies that ensure the
sensitive data mining comes about (i.e. extricated information) that were created by the use of
the data mining. The main classification alludes to the methodologies that apply perturbation,
inspecting, speculation or suppression, transformation, and so on. techniques to the original
datasets in order to generate their sanitized counterparts that can be safely disclosed to
untrustworthy parties. The objective of this classification of methodologies is to empower the
data miner to get exact data mining comes about when it isn't furnished with the real data. Secure
Multiparty Computation systems that have been proposed to empower various information
holders to all things considered mine their data without revealing their datasets to each other. The
second class manages systems that restrict the exposure sensitive learning designs determined
through the utilization of data mining calculations and also methods for minimizing the viability
of classifiers in order undertakings, with the end goal that they don't uncover delicate data. In
distinction to the incorporated model, the Distributed Data Mining (DDM) show acknowledges
that the person's data is dispersed over various spots. Calculations are produced inside this
territory for the issue of productively getting the mining comes about because of the considerable
number of data through these circulated sources. A straightforward technique for data mining
over various sources that won't share data is to run existing data mining apparatuses at each place
autonomously and combine the outcomes [22]. However, this will often fail to give globally
valid output. Issues that cause a difference between local and global results include: (i) Values
for a single entity may be divided across sources. Data mining in singular locales will be not
able to distinguish cross-site connections. (ii) The same item might be copied at various
destinations and will be over-one-sided in the outcomes. (iii)At a solitary site, it is probably
going to be from a similar populace. PPDM has a tendency to change the first data with the goal
that the aftereffect of data mining assignment ought not to resist protection limitations.
Following is the rundown of five measurements based on which diverse PPDM Techniques can
be grouped [23]:

i. Data or rule hiding


ii. Data distribution
iii. Data change
iv. Data mining algorithms

23
v. Privacy preservation

Data or Rule Hiding: This estimation suggests whether rough data or accumulated data should
be concealed. The trouble for hiding aggregated data as guidelines are extremely troublesome,
and for this reason, regularly heuristics have been produced.

Data Distribution: This measurement alludes to the distribution of data. There is a portion of
the methodologies are created for centralized data, while others allude to a distributed data
scenario. Distributed data situations can be separated as even data segment and vertical data
partition.

Data Modification: Data modification is used with the aim of changing the unique values of a
database that wants to be allowed to the public and in this way to guarantee high privacy
protection. Methods of data modification include:

i. Perturbation: Which can supplanting quality motivation by new esteem ( changing a 1-


esteem to a 0-esteem, or adding noise).
ii. Blocking: which is the substitution of a present property estimation with a "?"
iii. Swapping: This alludes to exchanging estimations of the individual record.
iv. Sampling: This alludes to losing information for just example of a populace.
v. Encryption: numerous Cryptographic procedures are utilized for encryption.

1.16 PRIVACY PRESERVING TECHNIQUES:

1.16.1 Heuristic-based techniques: It is a versatile change that alters just chose esteems that
limit the viability of misfortune instead of every single accessible esteem.

1.16.2 Cryptography-based strategies: This system incorporates secure multiparty calculation


where a calculation is secure. Cryptography based algorithms are considered for protecting
privacy in a distributed situation by using encryption techniques.

1.16.3 Reconstruction-based techniques: Where the first dispersion of the data is reassembled
from the randomized data. In view of these measurements, diverse PPDM strategies might be
ordered into following five classes [24-25, 26].

• Anonymization based PPDM

24
• Perturbation based PPDM
• Randomized Response based PPDM
• Condensation approach based PPDM
• Cryptography based PPDM we discuss these in detail in the following
subsections.

1.16.4 Anonymization based PPDM:


The essential type of the data in a table comprises of following four kinds of traits:

1) Explicit Identifier is an arrangement of traits containing data that distinguishes a record


proprietor expressly, for example, name, SS number and so forth.
2) Quasi Identifier is an arrangement of traits that could conceivably recognize a record
proprietor when joined with freely available data
3) Sensitive Attribute is an arrangement of qualities that contains delicate individual
particular data, for example, disease, compensation and so on.
4) Anonymization refers to an approach where the identity or/and sensitive data about
record owners are to be hidden. It even assumes that sensitive data should be retained for
analysis. Clearly express identifiers ought to be evacuated yet at the same time there is a
risk of privacy interruption when semi identifiers are connected to freely available data.
Such attacks are called as linking attacks. For instance properties, for example, DOB,
Sex, Race, and Zip are accessible out in the open records, for example, voter list.

Figure 1.2 Linking Attack

Such records are available in medical records also when linked, can be used to infer the identity
of the corresponding individual with high probability as shown in figure 1.2. Sensitive data in
the medicinal record is illness or even prescription endorsed. The express identifiers like Name,
SS

25
number and so on have been expelled from the therapeutic records. All things considered, the
personality of an individual can be anticipated with a higher likelihood. Sweeney [27] proposed
k-namelessness display utilizing speculation and concealment to accomplish k-obscurity i.e. any
individual is discernable from in any event k-1 different ones as for semi identifier characteristic
in the anonymized dataset. As such, we can diagram a table as k-unknown if the Q1 estimations
of every crude are equal to those of in any event k-1 different columns. Supplanting an incentive
with less particular however semantically reliable esteem is called as speculation and
concealment include blocking the values. Releasing such data for mining reduces the risk of
identification when combined with publically available data. But, at the same time, the accuracy
of the applications on the transformed data is reduced. Various calculations have been proposed
to actualize k- anonymity utilizing speculation and concealment as of late. Despite the fact that
the anonymization technique guarantees that the changed information is valid yet endures
overwhelming data loss. In addition, it isn't safe to homogeneity attack and foundation
knowledge attack basically [28].

Confinements of the k-anonymity model stem from the two traditions.

First, it may be very tough for the owner of a database to decide which of the attributes are
available or which are not available in external tables. The second restriction is that the k-
anonymity model receives a specific strategy for attack, while in genuine circumstances; there is
no motivation behind why the attacker ought not to attempt different strategies. In any case, as an
exploration heading, k-anonymity in the mix with other privacy-preserving methods should be
examined for detecting and notwithstanding blocking k-anonymity violations [27].

1.16.5 Perturbation Based PPDM:


Perturbation being utilized as a part of factual disclosure control as it has an intrinsic property of
effortlessness, efficiency, and capacity to save measurable data. In perturbation, the first esteem
is changed with some synthetic data esteems so the factual data registered from the irritated
information does not vary from the measurable data figured from the first information to a bigger
degree. The annoyed original data don't consent to genuine record holders, so the attacker can't
play out the insightful linkages or recover sensitive knowledge from the available data. Another
branch of security safeguarding data mining that deals with the burdens of irritation approach is
cryptographic techniques [28].

26
1.16.6 Randomized Response Based PPDM:
Fundamentally, the randomized reaction is measurable procedure acquainted by Warner with
taking care of a review issue. In Randomized reaction, the data is curved such that the focal place
can't state with chances superior to a predefined edge, regardless of whether the data from a
client contains correct information or incorrect information. The information received by every
single user is twisted and if the number of users is large, the aggregate information of these users
can be estimated with the good quantity of accuracy. This is exceptionally significant for
decision-tree classification. It depends on consolidated estimations of a dataset, to some degree
individual data items. The data collection process in randomization strategy is done utilizing two
stages. Amid initial step, the data suppliers randomize their data and exchange the randomized
data to the data receiver. In the second step, the information beneficiary modifies the first
appropriation of the data by utilizing a circulation remaking calculation. The randomization
reaction demonstrate is appeared in figure 1.3.

Figure 1.3 Randomization Response Model

Randomization strategy is generally extremely straightforward and does not require learning of
the dissemination of different records in the data. Thus, the randomization strategy can be
executed at data collection time. It doesn't require a trusted server to contain the whole unique
records keeping in mind the end goal to play out the anonymization procedure. The weakness of
a randomization reaction based PPDM system is that it treats every one of the records measures
up to regardless of their local density.

27
These show an issue where the anomaly records turn out to be more subject to oppositional
attacks when contrasted with records in more compacted areas in the data. One key to this is to
be pointlessly adding commotion to every one of the records in the data. Be that as it may, it
lessens the utility of the data for mining purposes as the reproduced circulation may not yield
brings about the similarity of the motivation behind information mining. Buildup approach based
PPDM Condensation approach develops compelled clusters in the dataset and afterward
produces pseudo-data from the insights of these clusters. It is called as condensation due to its
approach of utilizing dense insights of the clusters to produce pseudo data. It makes sets of
disparate size from the data, with the end goal that it is certain that each record lies in a set
whose size is at any rate alike to its namelessness level. Propelled, pseudo data are produced
from each set in order to make a manufactured data index with an indistinguishable total
circulation from the unique data. This approach can be viably utilized for the arrangement issue.
The utilization of pseudo-information gives an extra layer of security, as it winds up hard to
perform antagonistic attacks on synthetic data. Moreover, the aggregate behavior of the data is
preserved, making it useful for a variety of data mining problems [28]. This method helps in
better privacy preservation as compared to other techniques as it uses pseudo data rather than
modified data.

In addition, it works even without redesigning data mining algorithms since the pseudo data has
an indistinguishable configuration from that of the original data. It is very effective in case of
data stream problems where the data is highly dynamic. At the same time, data mining results get
affected as huge amount of information is released because of the compression of a larger
number of records into a single statistical group entity.

1.16.7 Cryptography Based PPDM:


Consider a scenario where multiple medical institutions wish to conduct a joint research for some
mutual benefits without revealing unnecessary information. In this scenario, research regarding
symptoms, diagnosis and medication based on various parameters is to be conducted and at the
same time privacy of the individuals is to be protected. Such scenarios are referred to as
distributed computing scenarios. The parties involved in mining of such tasks can be mutual
untrusted parties, competitors; therefore protecting privacy becomes a major concern.
Cryptographic procedures locate its utility in such situations due to two reasons: First, it offers an

28
all-around characterized display for protection that incorporates techniques for demonstrating
and evaluating it. The data might be appropriated among various associates vertically or
horizontally. Every one of these techniques are relatively in view of an exceptional encryption
protocol known as Secure Multiparty Computation (SMC) technology. SMC utilized as a part of
circulated privacy preserving data mining comprises of an arrangement of secure subprotocols
that are utilized as a part of evenly and vertically partitioned data: secure sum, secure set union,
secure size of the intersection and scalar item. Albeit cryptographic strategies guarantee that the
changed information is correct and secure yet this approach neglects to convey when in excess of
a couple of gatherings are included. Additionally, the data mining results may rupture the
protection of individual records. There exist a good number of solutions in case of semi-honest
models but in case of malicious models very less studies have been made [28].

1.17 ISSUES IN DESIGNING A PPDM ALGORITHM:


The major challenges that a PPDM algorithm for association rule hiding are information loss,
expensive, recover original data after hiding and should be efficient enough for very large
dataset.

1.17.1 Challenges of PPDM Algorithm Information Loss:


The data loss is characterized as the proportion between the aggregate of the total errors made in
figuring the frequencies of the items from a sanitized database and the sum of all the frequencies
of items in the original database. Induction control in databases, otherwise called Statistical
Disclosure Control (SDC), is tied in with securing data so they can be distributed without
uncovering secret data that can be connected to particular people among those to which the data
correspond [29]. This is an essential application in a few territories, for example, official
insights, health measurements, e-commerce (sharing of customer data), and so forth. Since data
protection at last means information adjustment, the test for SDC is to accomplish security with
least loss of the precision looked for by database users.

i. Expensive: a considerable lot of the protocols in light of encryption utilize the thought
presented by Yao (2007). In Yao‟s protocol one of the gatherings register a mixed
adaptation of a Boolean circuit for assessing the coveted capacity. The scrambled circuit
comprises encryptions of all conceivable piece esteems on every single conceivable wire
in the circuit. The quantity of encryptions is roughly 4m, where m is the number of doors

29
in the circuit. The encryptions can be symmetric key encryption, which has a
commonplace cipher text-length of 64 bits. The mixer circuit is sent to the next gathering,
which would then be able to assess the circuit to get the last outcome. These
methodologies are when all is said in done, costly since they require entangled
encryptions for every individual bit [30].
ii. Recover original data after hiding: PPDM comprises the number of procedures to
recover the data from the extensive measure of the database which comprises of sensitive
data moreover. k-anonymity is a method to suppress or generalize the data so that the
data cannot be accessed by any unauthorized users.
iii. Support of large datasets: Due to the continuous advances in hardware technology,
large amounts of data can now be easily stored. Databases along with data warehouses
today store and manage amounts of data which are increasingly large. Thus, a PPDM
algorithm must be outlined and actualized with the capacity of taking care of colossal
datasets that may even now continue developing. The less quick is the reduction in the
productivity of a PPDM calculation for increasing data measurements, the better is its
adaptability. In this manner, the versatility measure is essential in deciding handy PPDM
strategies [31].

1.17.2 Requirements of a PPDM algorithm:

A. Accuracy:

The accuracy is closely related to the information loss resulting from the hiding strategy: the less
is the information loss, the better is the data quality. Always a PPDM algorithm has to maintain
high accuracy to reduce information loss.

B. Completeness and Consistency:

Completeness assesses the level of missed data in the sanitized database. Incomplete data has a
significant impact on data mining results and impairs the data mining algorithms from providing
an accurate representation of the underlying data.

30
C. Scalability:

In particular, scalability describes the efficiency trends when data sizes increase. Such parameter
concerns the expansion of both execution and capacity prerequisites and also the expenses of the
correspondences required by a data mining procedure with the expansion of data estimate [32].

D. Data quality:

It is an important aspect of PPDM. High-quality data that has been prepared specifically for data
mining tasks will result in useful data mining models and output. Alternatively, low-quality data
has a significant negative impact on the utility of data mining results [33].

E. Security:

It is the degree of protection against danger, damage, loss, and crime. There are two main
approaches regarding how to deal with the problems of privacy that arise today. The first is a
legal and policy approach whereby organizations are limited in how they store and use data
based on privacy law and public policy. It ordinarily works by assessing situations and choosing
if the security break caused by utilizing the information given is legitimized or not. The second
approach is mechanical, and gives authorized protection ensures through cryptographic means.
This approach has the capacity of empowering the information to be utilized while preventing
privacy breaches [34].

1.18 DATA ENCRYPTION STANDARD (DES):

DES changed into the last consequences of the test set up by utilizing International Business
Machines (IBM) Corporation in the past due 1960‟ s which outcome in a cipher referred to as
LUCIFER. The altered form of LUCIFER progress toward becoming advanced as an offer for
the novel national encryption in vogue asked for by means of the National Bureau of Standards
(NBS). It was completely followed in 1977 as the DES. DES relies on a cipher called the Feistel
block cipher. This turn into a piece figure progressed with the guide of the IBM cryptography
specialist Horst Feistel in the mid 70‟s. It includes a number of rounds in which each spherical
consists of bit-shuffling, nonlinear substitutions (S- bins) and exceptional OR operations. Once a
plain-text message is gotten to be encrypted, it's orchestrated into 64-bit blocks requirement for
input. On the off chance that the no. Of bits in the message isn't equally distinct by means of 64,

31
at that point the last piece can be padded [35] [36]. DES plays out a gazing change on the total
64-bit block of measurements. It's at that point split into 2, 32-bit sub-blocks, Li, and Ri which
might be then outperformed into 16 adjusts (the subscript I in Li and Ri recommends the present
round). Each of the rounds is equal and the outcome of growing their variety is twofold - the
algorithm, protection is expanded and its temporal efficiency decreased. Clearly, the ones are
two conflicting consequences and a compromise has to be made. For DES the amount selected
become 16, in all likelihood to guarantee the elimination of any correlation amid the ciphertext
and either the plaintext or key. At the top of the 16th round, the 32 bit Li and Ri yield divides are
swapped to make what's known as the pre-output. This [R16, L16] link is permuted utilizing a
component this is the best converse of the preliminary permutation. The output of this very last
permutation is the 64 bits cipher textual content [37] [38].

1.19 ADVANCED ENCRYPTION STANDARD (AES):


AES is the new encryption standard recommended by NIST to replace DES in 2001. AES
algorithm can bolster any blend of data (128 bits) and a key length of 128, 192, and 256 bits. The
calculation is alluded to as AES-128, AES-192, or AES-256, contingent upon the key length.
Amid encryption-decryption process, AES framework experiences 10 rounds for I28-bit keys, 12
rounds for I92-bit keys, and 14 rounds for 256- bit keys to convey last figure message or to
recover the original plain-text [39]. AES permits a 128-bit data length that can be isolated into
four essential operational squares.

These blocks are dealt with a cluster of bytes and sorted out as a matrix of the request of 4×4 that
is known as the state. For both encryption and decryption, the figure starts with an Add
RoundKey arrange. Be that as it may, before achieving the last round, this yield goes, however,
nine fundamental rounds, amid every one of those rounds four transformations are performed;

1) Sub-bytes,

2) Shiftrows,

3) Mix-columns,

4) Add round Key.

32
In the last (tenth) round, there is no Mix- column transformation. Figure 4 demonstrates the
general procedure. Decryption is the turn around procedure of encryption and utilizing converse
capacities: Inverse Substitute Bytes, Inverse Shift Rows and Inverse Mix Columns. Each round
of AES is administered by the accompanying transformations [40]:

1.19.1 Substitute Byte transformation:


AES contains 128-bit data block, which implies every one of the data blocks has 16 bytes.

1.19.2 Shift Rows transformation:

It is a basic byte transposition, the bytes in the last three columns of the state, contingent on the
line area, are consistently moved. For the second line, 1 byte round left move is performed. For
the third and fourth column, 2-byte and 3-byte left roundabout left moves are performed
individually.

1.19.3 Mix columns transformation:


This round is proportionate to a matrix duplication of every Column of the states. A fix matrix is
increased to every segment vector. In this activity, the bytes are taken as polynomials instead of
numbers.

1.19.4 Add round key transformation:

It is a bitwise XOR between the 128 bits of present state and 128 bits of the round key. This
change is its own converse [40].

33
Chapter- 2

LITERATURE SURVEY

Tamanna Kachwala et al.[41], there are vast future research directions for (PPDM) privacy
preserving data mining. First, present studies turn to use various terminologies to report similar
or related practice. For example, people used data modification, perturbation of data, sanitation
of data, hiding data, and pre-processing as the possible ways for preserving the privacy;
however, all are in fact related to use some of the approaches to modify actual data so that
private data and knowledge remain private even after process of mining. Lacking common type
of language for the discussions will cause misconception and slow down the research
breakthrough.

Along these lines, there is a need for the standardizing terminology and the act of PPDM.
Second, the most prior (PPDM) Algorithms the intractability for use with knowledgeable data
put away in a clustered database. In any case, in the present global digital environment, data is
regularly put away in different locales. With late methodologies in data and correspondence
advancements, the circulated PPDM philosophy may have a more extensive application,
particularly in medicinal, human services, saving money, military and supply the chain
situations. Third, methodologies of information stowing away have been the overwhelmed
techniques for securing the protection of user mining comes about, which may prompt sensitive
rules leakages. While some of the algorithms have precise for preserving rule like with before
data or information, it may reduce the accuracy of the other rules which are non-sensitive.

Kun Liu, et al.[42], Explores possibility using the multiplicative unplanned projection matrices
for preserving the privacy distributed data mining. This class problem is related directly to the
various other issues of data-mining like principal component analysis, clustering and the
classification. Paper makes primary contributions on two various grounds. First, it explores the
Independent element analysis as a possible tool for privacy breaching in the deterministic
multiplicative model. Then, it proposes an imprecise type of random technique of projection-

34
based that to reform the level of protection of privacy while still protect the ever occurred on the
attribute of the data.

Jaideep Vaidya[43], shows that general and the efficiently distributed knowledge of privacy-
preserving discovery is truly feasible. Paper considered the privacy and security angles when
related with distributed data that is apportioned either vertically or on a level plane over different
locales, and the incitement of playing out the provocation of data mining on such data. Since
RDTs can be needed to develop an equivalent, accurate and also sometimes better or good
models with the much lower cost, this paper proposed the distributed type of privacy-preserving
RDTs and its technique grasp the advantages that randomness in the structure can also give
strong privacy with the less computation. Results of paper are showing the privacy-preserving
version of RDT calculation scales directly with dataset measure and require essentially less time
than elective cryptographic methodologies.

Chun-Wei Lin et.al.[44], Talked about a greedy-based approach to deal with conceal sensitive
and legitimacy of the item sets by a transaction of insertion. The proposed approach first
calculates the maximal number of transactions to insert into the real database for hiding the
sensitive sets of item fully. The dummy data items of the transactions to be inserted are as
designed by the technique that is statistical, which can reduce greatly the side effects in the
PPDM. The sensitive data item sets are then folded by adding it to new transactions into actual
database, thus increasing the minimum count threshold to get the goal. It is, however, 3 factors
taken into the consideration. First, the transactions should be determined seriously by gaining the
less amount of the side effects to cover totally the itemsets that are sensitive. Here, sensitive sets
of item are evaluated respectively to search the maximal number of the transactions to insert. 2 nd,
the length of each newly transaction which is inserted then computed according to empirical
rules in standard normal distribution. Last, the already existing large data sets are then
alternatively added into newly inserted transactions according to the lengths of the transactions
which determined at the second procedure. This step to avert the missing failure of the large
itemsets for diminishing the side effects in the PPDM.

M.Mahendran et.al.[45], proposed an approach called heuristic approach which ensures the
privacy of output that avoids the extracted patterns (itemsets) from the malicious inference
issues. An algorithm which is efficient named as Algorithm which is Pattern-based Maxcover

35
algorithm is proposed. This algorithm decreases the discord between source dataset and the
released database; Moreover, the protected patterns which cannot be fetched from the released
database by an adversary or by counterpart even with low arbitrary threshold support. We need
to develop the mechanisms that can lead to new privacy control systems to convert a given
database in a new one in such a way to firm the general rules mined from the real database. The
agenda of transforming the database source into a new type of database that folds some
confidential patterns or the rules is said to be the sanitization process. To do, a minimum number
of transactions have been modified by canceling more than one or one item from them or even
by adding the noise to data by turning the items from 0 to 1 in some transactions. The released
database is known as a sanitized database. Here, the approach is to slightly alter some data, but
this is perfectly acceptable in some real applications.

Chun-Wei Lin et.al.[46], proposed 2 algorithms, one is a simple genetic algorithm(SGA) which
is to eradicate transaction (sGA2DT) and second is pre-large genetic algorithm which is to delete
the transaction (pGA2DT) based on the genetic algorithm (GA). Genetic algorithms (GAs) are
able to search the optimal results using the natural principles of evolution. A framework which is
a GA-based structure which is comprised of 2 calculations that are outlined and is proposed to
control the ideal particular issues of methodologies that are heuristic based. A flexible evaluation
function, containing 3 factors with their adjustable weightings, is designed to regulate whether
the certain transactions are chosen to be deleted for purpose of hiding sensitive sets of the item.
The proposed logos were calculated to delete a pre-defined transactions number for covering the
sensitive item sets. An algorithm of the simple genetic and concepts of pre-large are also
considered to diminished the time of execution for again scanning the real databases for
chromosome evaluation, and the number of populations in the proposed algorithms. A
straightforward approach which is (Greedy) is to be designed as benchmark that to evaluate the
achievement of two proposed algorithms as simple genetic algorithm (SGA) to delete
transactions (sGA2DT), and a pre-large genetic algorithm which to delete transactions
(pGA2DT) with regards to execution of time, the 3 side effects are (missing item sets, hiding
failures and artificial sets of item), and dissimilarity of the database in the experiments.

S. Lohiya [47], state that there is sensitive classification rule that is used for the hiding of
sensitive and private data from others users. In this technique, there are 2 steps which are used

36
for preserving privacy. First is to recognize the transactions of the sensitive rule and the second
is to substitute the values which are known to values which are unknown. In this approach, there
is the scanning of actual database and to identify transactions which are supporting the sensitive
rule. And then for every transaction, algorithms replaces the sensitive data with the values which
are unknown. This approach is applicable to applications where one can save values which are
unknown for some of the attributes.

Yu Zhu, Lei Liu [48], decide development of the plans of ideal randomization for the saving of
security of density estimation. The randomization impact on data mining is processed by
execution corruption and the common data loss, while interim based measurements are registered
by protection and privacy loss.

V. Ciriani, S. De Capitani di, Vimercati, S. Foresti, and P. Samarati [49], utilized k-anonymity to
reveal the personality of users in the gathering of the dataset which is indistinctly combined to in
any event k-1 respondents. It quantifies the measure of the obscurity which is held amid the
procedure of data mining. Technique for K-Anonymization decreasing the efficiency of data
mining algorithm on the anonymized data and the renders privacy preservation. While
discharging the genuine data, the genuine k-anonymity proposal and its authorization by means
of the speculation and the concealment to safe the respondents' characters were embellished and
talked about in various routes for applying the speculation and the suppression.

Yehuda Lindell, Benny Pinkasy[50], presents prologue to secure multiparty computation (SMC)
and its materialness to PPDM. The normal kind of blunders that set up in the writing when
PPDM is executed with SMC methods and the issues associated with effectiveness are talked
about and furthermore exhibits the issues in building the exceptionally proficient protocols.

Ueli Maurer[51], proposed a straightforward method to multi-party computation (MPC) with the
straight-forward security proofs. This work accomplishes the security just for latent enemy
setting, without the likelihood to improve it to active adversary settings.

Aris Gkoulalas-Divanis and the Grigorios Loukides[52], proposed about sequential pattern
hiding. Publishing sequence datasets offering opportunities which are remarkable for discovering
the interesting data patterns. This paper considers how to clean data to avert the revealing of the
patterns that are sensitive during the sequential pattern mining while providing that the

37
nonsensitive patterns can be discovered. The main algorithm utilized here endeavors to clean the
data with the negligible change, though the second spotlights on the limiting the opposite
symptoms.

Amruta Mhatre and Durga Toshniwal[53], displayed a novel approach that to conceal the
sensitive co-occurring of sequential patterns. This strategy chips away at the dynamic databases
notwithstanding, the large portion of standard methodologies for the privacy preservation work
just on a database which is static. Dynamic databases are a generalized model of the static
database, dynamic database, and incremental databases. The process is also extended to suit
these different types of the databases. The strategy introduced here maintains a strategic distance
from the event of the most sensitive sort of examples by every now and again smothering the
examples and keeping it from being frequent. It is further examined in order to develop methods
to opt the pattern to be blocked.

Shikha Sharma & Pooja Jain[54], work is based on the reduction of confidence and support of
the sensitive type of rules. Here algorithm is used in some modified form to cover the sensitive
association rule without any side effect. To hide the element which is sensitive, algorithm
repeatedly increases the counter of hiding the rule until faith goes to below minimum of the
threshold which is determined as opposed to checking all transactions again and over and
ordering them in expanding or decreasing request. On the off chance that the certainty goes
underneath least determined certainty limit, govern is shrouded i.e. it won't be found through any
data mining algorithm.

Shaofei Wu et al.[55], proposed a new algorithm to balance the privacy preserving and the
knowledge discovery in the association rule mining. The arrangement executes a filter after
mining stage to transient through or covers the confined found association rules. Before
executing the algorithms, the data structure of the database and the sensitive association rule
mining set is examined to fabricate the effective model.

Chirag N. Modi et al.[56], proposed an algorithm that gives protection or security against
including parties and different gatherings which can get data through the unsecured channel.

Stanley R. M. Oliveira[57], goal is at balancing the privacy and also disclosure of knowledgeable
data by trying to decrease the impact on the sanitized data transactions and also to reduce the

38
accidentally hidden and ghost rules. The utility here is measured as the number of rules which
are non-sensitive that were covered based on side-effects of a process called data modification.

Mohammad Reza Keyvanpour et. al.[58], dealing with a careful survey bringing to the limelight
scores of the works on different existing techniques of privacy-preserving approach, their uses
and deficiencies. The majority techniques for the privacy computation use of some form of the
transformation of data to perform the technique of privacy preservation. Characteristically, such
methodologies limit the granularity of the portrayal or farthest point the entrance for resources
keeping in mind the end goal to diminish the security. This diminished in granularity results in
few trouncing of the efficacy of data mining algorithms. This is the normal trade-off between
information loss and privacy. Researchers developed strategies to enable data mining way to deal
with being connected while privacy preserving of people. Despite the fact that few
methodologies have been proposed for (PPDM) which is privacy-preserving data mining,
mining, now we might want the peruser to peruse for a brisk outline gives a detailed survey on
some of the approaches used for PPDM. proposed a classification based on the 3 common
methods of (PPDM) Privacy-Preserving data mining, these are Data modification approach, Data
sanitization approach, and Secure Multiparty Computation approach.

Ali Inan et al.[59], focus on discovering object based divergence for privacy preservation.
Having thoroughly investigated the available diverse techniques for privacy preservation, we
find that the level of Privacy Preservation techniques is only of a single level. Indeed, even the
recently proposed Privacy Preservation procedures i.e., A near report on the diverse strategies for
cryptography for crafted by future the method of bothering was executed and furthermore
assessed individually theoretical structures to demonstrate their effectiveness. They are
additionally of single-level ones. The system used to contrast and with likewise differentiate each
approach all in all platform that will be the reason for finding out the appropriate method or
approach for a given sort of use of privacy-preserving shared filtering. Nonetheless, there are
many situations where sharing of information can lead to general gain as in the case of privacy-
preserving secure accord as mentioned.

E. Poovammal and M. Ponnavaikko et. Al.[60], proposed the technique to design the technique
for the microdata sanitization for securing the privacy from malicious types of attack as well as
to protect the data utility for the type of mining task. A graded grouping transformation and

39
mapping the transformation which is table based have been applied to an attribute which is
numerical sensitive and attribute which are categorical sensitive respectively, by the proposed
approach. They have been performing experiments on the adult dataset and compared results of
actual table and transformed table to prove that their proposed task independent technique has
the ability to secure the privacy, information, and utility. Generally, two approaches are called
statistics-based approach and the crypto-based approach are to deal with PPDM. One use of
technique named statistics-based technique is that it handles efficiently huge datasets.

Patrick Sharkey et al.[61], proposed an approach for statistics-based PPDM. Their approaches
were fully different from the techniques that are already existed because it allows the owners of
data to share with each other knowledge of models that are extricated from their own particular
private datasets, rather than enabling the data owners to distribute any of their own private
datasets (not even in any sanitized form). Here, the models of knowledge got from the individual
datasets have been used to deliver some pseudo-data and such data has been then utilized for
separating the prevalent "worldwide" knowledge models. There are a couple of specialized
delights while instrumental, so it should be deliberately tended to. Especially, they proposed
calculation for creating pseudo-information in view of the decision tree paths, a procedure for the
adjusting imperceptibility measures of datasets to assess the protection of choice trees, and a
calculation to decrease a choice tree with a specific end goal to guarantee a given privacy
requirement. Through the experimental study performed on different environments with several
types of datasets, predictive models, and utility measures have proved that the predictive models
mastered utilizing the proposed approach are significantly more exact than those picked up
utilizing the current l- diversity strategy. Since the organizations are gathering and sharing data
increasingly about their customers, the infringement of customer privacy is increasing very
rapidly. Although some sharing is for the use of general public like to identify the behavior of the
disease in medical research, individuals are worried about the intrusion of their privacy. To avoid
such violation, the sensitive attributes of data are mapped to another domain such that real values
are not disclosed and yet the original associations are preserved.

NafeesQamar et al[62], present that This paper addresses challenges of determining patterns
which are clinically-relevant here, we treat datasets which are medical type as a black box for
both the internal users and external users of the data enabling mechanism which is a remote

40
query to build database queries and execute database queries. The novelty of the method lies in
keeping off the complicated data deidentification system which is mostly used to retain patient
privacy. The carried out toolkit combines software engineering technologies practically identical
to Java EE and peaceful web offerings, to allow changing medical data in an unidentifiable XML
structure along with limiting users to the need-to-comprehend privacy principle. As a
consequence, the procedure inhibits contemplative processing of data, akin to attacks through an
adversary on medical dataset utilizing advanced computational approaches to uncover Protected
Health Information (PHI). The strategy is approved on an endoscopic announcing utility
established on open EHR and MST standards. The proposed process is essentially motivated
with the aid of the issues regarding querying datasets through medical researchers, governmental
or non-governmental organizations in monitoring health care services to improve the quality of
care.

Alexandre Evfimievski et al[63], presented that, PPDM rose in the reaction to 2 similarly
essential and fundamental (and a different) needs: data analysis to convey better services and
guaranteeing privileges of protection of data owners. Difficult as the task of addressing such
needs may seem, many tangible efforts have now been accomplished. Here, an overview of
popular techniques for doing the PPDM was presented, namely: suppression, cryptography,
randomization, and summarization. The privacy assures advantages and the disadvantages of
every approach were declared to provide a view which is balanced of state of the art. Finally, the
scenarios where PPDM may be used and some directions for the future work were outlined.

K.Sashirekha et al[64] presented that, Privacy Preserving and the Data Mining addresses the
issues of securing mobile individuals from the attackers. Privacy threat includes the process of
predicting the pattern movement based on statistical information collected. Intruder monitors the
models of activity to foresee amass development and attempt to get to the private data of portable
clients. Privacy can be defined by the methods for randomization and distributed privacy-
preserving data mining, k-anonymization. To give better privacy multi-level frameworks are
used. Here, an analysis was done on different methods of the privacy-preserving and policy of
multi-level trust, limitation while using large dimension data sets.

Stanley R. M. Oliveira et al[65], presented that, problems about PPDM have globally emerged.
The new proliferation in PPDM approach is obvious. Motivated by maximizing the number of

41
approaches that are successful, the current generation in PPDM moves toward the
standardization because it plays an essential role in the future of the PPDM. Here, we invest
what urgency to be done and take few steps toward the proposing like standardization: 1 st, we are
describing the issues we are facing in illustrating which data is private in data mining, and
discuss how the privacy can be offended in data mining. Then, we can say that privacy
preservation within data mining is based on users' personal data and data concerning with their
mutual activity.

Yehuda Lindell et al[66], address the issue of (PPDM) privacy preserving data mining. In
particular, we consider the situation where 2 parties owning the protected the databases that
desire to run a calculation of data mining on the association of databases, without uncovering out
any sort of data. Our work is propelled by the need to both secure advantaged data and its
utilization is empowering for examine or for other distinctive purposes. The above issue is a
particular sort of a case of the secure multi-party computation (SMC) and, can be understood by
utilizing the known generic protocols. Be that as it may, data mining algorithm are confused and,
besides, input typically is comprised of the massive sets of data. So the generic protocols in such
case are of no reasonable need and are consequently proficient protocols are in require. We are
focusing on the issue of the choice tree learning with a well-known calculation called ID3
algorithm. Our protocol is considerably more successful than nonspecific arrangements and
furthermore requests both not very many rounds of correspondence transmission capacity and
reasonable bandwidth.

Shahejad Khan et al[67], present that PPDM the approach creates privacy before data being
distributed by the approach of perturbation Existing approach for the issues is Privacy in user
security level on the database (DB's). So we will outfit straightforwardness to this assumption
and broaden the extent of perturbation-based PPDM to Multilevel believe (MLT-PPDM). In our
situation, the data miner must be more specific, so it does no more section more assigned
proliferation of the module. Under this situation, the fraud data miner by utilizing diverse aspects
can access to distributed copies of data and may aggregate different types of copies together to
have precise knowledge about data which data the owner would never want it to get be leaked.
We furnish proprietor of data to generate allotted copies of its data for multi-level trust (MLT-

42
PPDM) on demand. This feature is giving a maximum of flexibility and the intractability to data
owners.

Shweta Taneja et al[68], present that PPDM deals with the covering an individual's sensitive
identity without unfolding the usability of data. It has become a vital area of concern but still,
this branch of research is in its infancy.People today become very well aware of privacy
intrusions of sensitive data and are more uncertain to share their data. The major concern is that
data which is non-sensitive may even deliver information which is sensitive, including the
personal data, facts or patterns. Numerous methodologies of PPDM have been proposed in the
writing. Here, we have contemplated all condition of art approaches. A tabular comparison of
work done by various authors is presented. In our future work, we will work on a hybrid of these
techniques to preserve the privacy of sensitive data.

Chris Clifton et al[69], present that Privacy preserving mining of the distributed data has many
applications. Each and every application poses the various constraints which are: What is known
to be privacy, what are desired results, how is the data being distributed, what are constraints on
collaboration and cooperative computing, etc. We advise that the solution to this is a components
toolkit that can be joined for the specific type of privacy-preserving data mining applications.
Here this paper presents some type of components of such a toolkit and showing how they can be
used to solve many privacy-preserving data mining(PPDM) problems.

Yuna Oh et al[70], present that the increasing number of mobile device users suggests the
enlargement of personalized location-based services (LBS). Regardless of their proliferation, the
hazard of violating users' privacy with the aid of exposing person's region understanding stays. In
like manner, numerous surveys have looked into to turn away privacy violations LBS. However,
previous researchers most effective considering defending users' region understanding without
seeing that semantic area private violation through contextual information. In this paper, we give
an explanation for the method of inferring a user's behavior making use of semantic knowledge
which entails spatial and temporal information. We also recommend a privacy retaining manner
to hinder publicity of sensitive behavior in semantic LBS. We implement android use to validate
the proposed approach. In accordance with results of the experiment, the proposed approach of
b-diversity is validated to avert exposure of the behavior of sensitivity and also reducing data
utilization degradation.

43
Alexandre Evfimievski et al[71], present that, This technique is vulnerable potentially to the
privacy breaches: which is based on data distribution, one is able to learn with the high level of
confidence that some randomized records are to satisfy a specified type of property, even though
the security is maintained on average. Here, we exhibit a detailing which is new that of privacy
ruptures, together with the procedure, " amplification ", for constraining it. Like prior strategies,
enhancement makes it conceivable to limits the assurance of security level breaks with no
learning of dissemination of the actual data. We instinct this technique for mining problem of
association rules, and modify and adjust the algorithm from to limit privacy level breaches
without the data distribution knowledge. Constantly, we address the issues that the
randomization required to keep away from the ruptures on security (when mining the association
rules) brings about the long exchanges. By utilizing the pseudorandom generators and afterward
precisely picking the seeds with the end goal that coveted things from the unique exchange are
available in the randomized exchange, we can exchange only the seed rather than exchange,
bringing about the sensational level drop in correspondence cost and the capacity cost. At long
last, we are characterizing new sort of data measures that will take protection ruptures into the
record while evaluating the measure of security preserved by randomization.

Nissim Matatov et al.[72], have presented an approach for achieving k-anonymity by isolating
the original dataset into numerous projections so every last one of them takes after k-anonymity.
Besides, any endeavor to rejoin the projections brings about a table that still holds fast to k-
anonymity.. A classifier has been prepared on every one of the projection and afterward, an
unlabelled occurrence has been arranged by joining the orders of the considerable number of
classifiers. In view of k-namelessness requirements and the arrangement exactness, (GA)
hereditary calculation has been utilized by the proposed (DMPD) data mining privacy by
decomposition algorithm to seek the best feature set partitioning. Ten different datasets have
been used with DMPD to evaluate its classification presentation with other k-anonymity-based
methods. The results have shown that performance of DMPD was better when compared to other
existing algorithms which are k-anonymity-based algorithm and there is no need for using
domain-dependent knowledge. They have also evaluated the tradeoff b/w the 2 inconsistent
intention in PPDM: privacy and predictive performance, by using multi-objective optimization
techniques. Since the total number of traffic data in networks has been increasing at a shocking
rate, a substantial research of body has been made that tries to mine the traffic data in order to get

44
the valuable information. For instance, there are a few investigations in view of distinguishing
proof of Internet worms and trespasses by deciding the irregular traffic patterns. Be that as it
may, as the system traffic data have the data about the Internet usage patterns of users, network
users' protection might be debilitated amid the mining procedure.

Seung-Woo Kim et al.[73], proposed a robust technique, which preserves the privacy during the
mining of sequential pattern on the network traffic data. Their proposed strategy has utilized an
N- repository server display, which has to work as a single mining server and a maintenance
replacement technique, which changes over the response to query probabilistically to locate the
frequent sequential patterns without breaching privacy. Also, the technique has expedited the
overall mining process by maintaining the meta tables in each site in order to find out quickly
whether the candidate patterns have ever occurred on the site or not. The accuracy and
effectiveness of their proposed technique have shown by performing experiments using real-
world network traffic data. In recent days, different methods based on random perturbation of
data records have been introduced for protecting the privacy of the user in data mining process.

K. Srinivasa Rao and V. Chiranjeevi et al.[74], concentrated on an improved distortion process,


which attempts to improve the accuracy by selectively altering the list of items. In typical
distortion process, tuning the likelihood parameters for balancing the privacy and accuracy
parameters was exceptionally troublesome, and the presence or absence of every item was
modified with an equivalent likelihood. Be that as it may, in this enhanced twisting strategy, one
frequent item-sets and nonfrequent one item-sets have been adjusted with an assorted probability
controlled by two likelihood parameters, for example, FP, nfp individually. These two
probability parameters (fp and nfp) have been tuned flexibly by the proprietor of the data based
on his/her requirement for privacy and accuracy. The experiments performed on real-time
datasets have proved that there is a considerable increase in the accuracy at a very marginal cost
in privacy.

L. Sweeney et al.[75],contains official protection model entitled k-anonymity and a set of


associated plans for distribution. An issue provides k-anonymity safety if the information for
every individual enclosed in the release cannot be notable from at least k-1 individuals whose
information also looks in the publication. This paper likewise thinks about re-identification proof
events that can be gotten a handle on productions that obey k anonymity unless extra designs are

45
valued. The k-anonymity protection model is vital because it forms the origin on which the real-
world schemes are known as Data fly, m-Argus, and k-Similar deliver agreements on privacy
protection.

A. Machanavajjhala, Gehrke, Kifer & Venkitasubramaniam[76], shows that k anonymized


dataset cannot avoid major attacks because diversity is not present in the sensitive features. Here,
they announced a framework known as diversity, which gives robust privacy agreements. They
also verified that l-diversity and k-anonymity have adequate resemblance in their arrangement
that k-anonymity algorithms can be altered to work with l-diversity. Advancement in bar-code
knowledge has made it probable for retail groups to gather and store large quantities of sales
data, denoted as the basket data. A record in such data normally involves the date and the items
purchased in the transaction. Big groups sight such databases as central part of the selling
organization.

M. Elmisery et al.[77], present a novel clustering algorithm for vertically partitioned data; they
test the performance of that algorithm based on experiments and complexity analysis. Later they
presented a private version of this protocol using a protocol that is based on homo morphed
encryption. Our protocol is robust against colluding attack. In Privacy-protecting extent a set
association for cases that are uncommon in the healthcare data J.Y. Chun et al. propose a
privacy-preserving range set union' protocol that is utilized to discover uncommon cases in the
private medicinal datasets of people. They have suggested privacy-preserving extent set union
protocol PPRSU to discover uncommon cases while preserving privacy. The range set unionRt1,
t2 is a set of elements that at least t1 parties and at most 2 parties have in their private sets.
PPRSU can be used to make new set operations, as well as conventional set operations. PPRSU
does not disclose any other information, except the information that could be inferred from the
range set union and the size of each private set.

46
Chapter- 3

SIMULATION TOOL

MATLAB makes use of in a wide variety of functions, together with signal and image
processing, communications, control design, test and size, computational biology and parallel
computing and financial modeling and analysis.

MATLAB is an array language, in the beginning, trendy for rapid prototyping, but is now being
increasingly used to improve construction code for numerical and scientific applications.
Average MATLAB programs have plentiful data parallelism. These packages even have control
float dominated scalar areas that have and have an effect on the program's execution time.
Today's computer systems have huge computing vigor in the form of traditional CPU cores and
in addition throughput-oriented accelerators such as pix processing units. Accordingly, a strategy
that maps the manage glide dominated areas of a MATLAB program to the CPU and the data
parallel regions to the GPU can drastically beef up application performance [78].

MATLAB programs are declarative and naturally express data-level parallelism as the language
provides several high-level operators that work directly on arrays. Generally, MATLAB is
utilized as programming dialect to compose different kinds of simulations. It is used extensively
to simulate and design systems in areas like control engineering, image processing, and
communications. These programs are typically long-running and developers expend significant
effort in trying to shorten their running times. As of now, MATLAB programs are converted
into ordered dialects like C or FORTRAN to enhance execution. These translations are normally
done either by hand or by automated systems that compile MATLAB code to C or FORTRAN

3.1 SIMULATION ENVIRONMENT:


MATLAB It is an informal investigation and representation device which has been composed
with effective backing for networks and grid operations. And in addition to this, MATLAB has
amazing design abilities and its own particular capable programming dialect. The reason that
MATLAB success is the imperative instrument through which utilization of sets of MATLAB
projects intended to bolster a specific assignment. These arrangements of projects are called tool

47
stash, and the specific tool stash of enthusiasm to us is the picture preparing toolkit. Instead of
giving a portrayal of the majority of MATLAB'S capacities, we should limit ourselves to simply
those angles concerned with the treatment of pictures. We might present capacities, orders, and
methods as needed.

A MATLAB function is a magic word which acknowledges different parameters and creates a
yield: for instance a framework, a string, a chart or figure. Illustrations of such capacities are sin,
imread, imclose. There are small capacities in MATLAB as we ought to see, it is straightforward
(and in some cases needed) to compose our own. An order is a specific utilization of a capacity.
Cases of orders may be MATLAB is an unrivaled vernacular for particular preparing. It joins
calculation, discernment, and programming in an easy-to-use environment where issues and
game plans are imparted in the ordinary experimental documentation. General utilizes comprise
[78]:

i. Math and calculation


ii. Algorithm development
iii. Modeling, imitation, and prototyping
iv. Data study, investigation, and visualization
v. Engineering and scientific graphics
vi. Application progress, integrating GUI building.

3.2 THE MATLAB SYSTEM COMPRISES OF FIVE MAJOR SECTIONS:

3.2.1. Development Environment:


These tools give help and you apply MATLAB capacities and archives. An expansive piece of
these instruments is graphical customer interfaces.

3.2.2. The MATLAB Mathematical Function Library:


It is a large set of calculating algorithms sorting from basic functions as add, sine, cosine,
compound mathematics to additional complicated functions such as matrix converse, Bessel
functions, matrix eigenvalues and quick Fourier transforms [78].

48
3.2.3. The MATLAB Language:
These are high-level matrix/array languages with control flow statements, functions, data
structures, input/output, and object-oriented programming features. It permits "programming in
the small" to fast generate and unclean throw-away programs, and "programming in the large" to
generate total big and composite application programs.

3.2.4. Handle Graphics:


This is the MATLAB design structure. It joins strange state orders for two and three-dimensional
pictures get ready, action and reasonable presentation and data representation.

3.2.5. The MATLAB Application Program Interface (API):


This is a get-together that agrees makes FORTRAN and C programs that join with MATLAB. It
consolidates offices for calling MATLAB as a computational motor, calling schedules from
MATLAB (element connecting) and for perusing and composing MAT-records. MAT-files.

3.3 MATLAB OPTIMIZATION TOOLBOX:


There are different numbers of optimization toolbox in MATLAB they vary from linear to non-
linear, quadratic of a single function to multi-objective function. Table 41 indicates al the
optimization functions involves in MATLAB.

Table 3.1: MATLAB Optimization functions [79]

S.N. Name (MATLAB) Problem

1 Linprog Linear programming

2 Fsolve Nonlinear system of Eq. Solve

3 fzero Scalar nonlinear zero finding

4 Fminbnd Scalar bounded nonlinear function minimization

5 Lsqlin Linear least square with linear constraints

6 Lsqnonneg Linear least squares with non-negativity constraints

49
7 Quadprog Quadratic programming

8 Fminunc [79] Multidimensional unconstrained nonlinear minimization

9 Fmincon Multidimensional constrained nonlinear minimization

10 Fminsearch Multidimensional unconstrained nonlinear minimization

11 Fseminf Multidimensional constrained minimization

12 Lsqcurvefit Nonlinear curve-fitting via least squares

13 Lsqnonlin Nonlinear least squares with upper and lower bounds

14 Fgoalattain Multidimensional goal attainment optimization

15 Fminimax [79] Multidimensional mini-max optimization

3.4 THE MATLAB LANGUAGE:


MATLAB is a high-level language developed by Math Works. It’s a dynamically form array
dependent programming language which may be very widespread for developing numerical and
scientific usages. MATLAB has grown into a diverse and vast language over the years. This
section describes some of the important features of the MATLAB language [79].

3.4.1 MATLAB Variables and Operators:


MATLAB programs, at a basic level, are similar to programs written in a language like C++.
Each program has a set of variables and the program manipulates these variables through
operators and function calls. Values are assigned to variables through the assignment operator =.
For example,

a = 42;

Assigns the integer value 42 to the variable a. MATLAB aid variables of a no. of primitive form
like logical, int, real, complex and string. It is also possible to construct arrays with elements of
these primitive types. A programmer may construct a matrix of random real elements as follows

50
n = 100; 2 a = rand(n, n); In the above example, a is a 100×100 matrix of reals. Each element is
initialized with a random value. In MATLAB, all variables are matrices. Scalars are just single
element matrices. In MATLAB however, a variable does not need to be defined to be of a
particular type before the variable is used. MATLAB is a weekly dynamically typed language.
It's stated to be dynamically typed in view that the forms of variables are decided best at runtime.
It is weakly typed because the type of a variable can change through the course of a program. For
example, the following is a valid MATLAB program.

a = 42;
disp(a); //‘a’ is an integer
// More code...
a = "Hello World";
disp(a); //‘a’ is a string
// even more code...
Here the type of changes from integer to string when a string is assigned to it on line 4. This is
one of the features of MATLAB that makes it difficult to statically compile MATLAB code.
MATLAB provides a rich set of operators to operate on matrices. They are overloaded to
perform appropriate actions depending on the size and type of their input operands. Consider the
following code segment
x = 10;
y = 20;
a = rand (100, 100);
z = x + y;
b = a + a; 6 c = x + a
The “+” operators on lines 4, 5 and 6 all perform different operations at runtime. The plus on line
4 performs a scalar addition on the variables x and y. The + operator on line 5 however adds two
100×100 matrices. The + operator on line 6 adds the scalar x to each element of the matrix a. All
arithmetic operators in MATLAB are similarly overloaded. The * operator for example performs
the appropriate form of multiplication depending on the sizes of its arguments. For example, if
both arguments are matrices, it performs a matrix multiplication. However, if one is a matrix and
the other is a vector, it performs a matrix-vector multiplication. MATLAB also provides
operators to perform element-wise multiplications and division [79].

51
a = rand (100, 100);
b = rand (100, 100);
c = a * b;
d = a * b;
In the above code, each element in c is the product of the corresponding elements of a and b
whereas d is the matrix product of a and b.

3.4.2 Control Flow Constructs:


MATLAB provides most of the common control flow structures. It has support for if..else
statements and for, while and do..Until loops. It also has support for user-defined functions.

3.4.3 Array and Matrix Indexing:


The basic indexing mechanism is the same as in languages like C++ where an array variable is
indexed using an integer index. In the following code, line 2 assigns 42 to the fifth element of the
vector a.
a = ones (10, 1);
a (5) = 42;
The function ones returns an array of the requested size (a vector of length 10 in this case) with
each element initialized with the value 1. MATLAB also supports more sophisticated forms of
indexing than the primitive indexing described above. It is possible to index arrays with other
arrays. For instance, consider the subsequent code segment.
a = ones (10, 1);
i = 1:3;
a (i) = 42;
i is a vector containing the elements 1, 2 and 3 (the colon operator is described below). After line
3 is executed, the first three elements of a are assigned the value 42. It is also possible to index
arrays with arrays of dimensionality higher than one. The idiom in the above example is very
common in MATLAB because of which MATLAB provides the colon operator. The colon
operator can be used to construct arrays whose values are linearly changing according to a
predefined step size. The array a in the example below has all integers between 10 and 25 in
steps of 5, i.e., its elements are 10, 15, 20 and 25.
a = 10:5:25;

52
The colon also plays a special role in array indexing. It is used to specify all elements of an array
along a particular dimension.
a = ones (10, 10);
a(:, 1) = 42;
Line 2 in the above example assigns the value 42 to every element in the first column of the
10×10 matrix a. MATLAB also confer the keyword end. The end keyword when used within the
indexer for a particular array represents the index of the last element of that array in that
dimension.
a = ones (10, 10);
a(5:end, 1) = 42;
The last line in the above code segment assigns 42 to elements 5 to 10 of the first column of the
matrix obviously; such assortment mechanisms also are valid on the proper hand aspect of
assignment statements. MATLAB requires the programmer to ensure the compatibility of sub-
array dimensions when they are specified by the mechanisms described above. MATLAB also
does not require the indexer of an array to be smaller than the length of the array. When an array
is indexed past its end in any dimension, the array simply grows to accommodate the index.
Consider the following example.
a = ones (10, 1); //a is a vector of length 10
a(15) = 42; //a now has length 15
After line 2 in the above program is executed, the vector a has a length of 15. Elements created
when the array expands are assigned a value of zero. Thus, elements a(11) to a(14) get the value
0, while a(15) gets the value 42.

3.4.4 Libraries:
MATLAB has a wide variety of toolsets that provide users with domain-specific functionality.
For example, the communication toolbox provides functionality required for the design and
simulation of communication systems and the image processing toolbox provides APIs to several
frequently used image processing functions. However, most of these toolboxes are closed source.

3.5 SUPPORTED MATLAB SUBSET:


As MATLAB has grown into a various and huge language over time, the compiler described on
this thesis supports simplest a representative subset of MATLAB. A short description of the

53
subset of MATLAB supported and a collection of alternative assumptions made by our compiler
implementation are presented below [79].

1) MATLAB supports variables with primitive varieties logical, int, actual, complicated and
string. It’s also possible to build arrays of these kinds with any no. of dimensions. Now,
our compiler supports each primitive type except string and complex. Further, arrays are
restricted to a maximum of three dimensions.
2) MATLAB backings ordering with multi-dimensional arrays. In any case, our execution at
present just backings ordering with sole dimensional arrays.
3) In MATLAB, it is possible to vary the scale of arrays by assigning to factors past their
end. We currently don't support indexing prior the end of arrays. Further, in this thesis,
we refer to assignments to arrays through indexed expressions (For example, an (i) as
indexed assignments or partial assignments.
4) We accept that the MATLAB program to be assembled is a single script with no calls to
client characterized or toolset capacities. Support for client characterized capacities can
be included by broadening the frontend of the compiler. Also, anonymous functions and
function handles are not currently supported [79].
5) In broad, kind and shapes (array sizes) of MATLAB variables are not called until
runtime. Our compiler currently relies on a simple data flow analysis to extract sizes and
types of program variables. It also relies on programmer input when types cannot be
determined. We intend to extend our type system to support symbolic type inferencing in
future. Ultimately, we envision that the techniques described in this thesis will be used in
both compile time and run-time systems [79].

54
Chapter- 4

PROPOSED WORK

Data collected is being using current technologies represent activities of users in social life that
for many are assumed to be private. To preserve privacy is then to keep this data private, in other
words confidential from a greater public. Not exchanging the data would preserve privacy but is
inconvenient and probably also not desirable. Therefore, a great deal of privacy investigate in
computer science is concerned with weaker forms of data confidentiality such as anonymity.
Anonymity is finished with the aid of unlinking the person identity from the traces that her
movements depart in information systems. Anonymity continues the identification of the
individuals in competencies methods personal, however, it's not always concerned with how
public the traces hence come to be. This is also reflected in data protection legislation, which by
definition are not able to and does now not anonymous data. Clustering is a data mining strategy
that has not taken its genuine part underway as of now cited in spite of the fact that, the most
vital calculation of this technique was very studied with regards to privacy-preserving, which is k
- means algorithm.

4.1 PROPOSED WORK:

Cluster Analysis (data segmentation) has a variety of goals that relate to grouping or segmenting a
collection of objects (i.e., observations, individuals, cases, or data rows) into subsets or clusters, such that
those within each cluster are more closely related to one another than objects assigned to different
clusters. Central to all of the goals of cluster analysis is the notion of degree of similarity (or
dissimilarity) between the individual objects being clustered. There are two major methods of clustering:
hierarchical clustering and k-means clustering. For information on k-means clustering, refer to the k-
Means Clustering section.

In hierarchical clustering, the data is not partitioned into a particular cluster in a single step. Instead, a
series of partitions takes place, which may run from a single cluster containing all objects to n clusters
that each contain a single object. Hierarchical Clustering is subdivided into agglomerative methods,
which proceed by a series of fusions of the n objects into groups, and divisive methods, which
separate n objects successively into finer groupings. Agglomerative techniques are more

55
commonly used, and this is the method implemented in XLMiner. Hierarchical clustering may be
represented by a two-dimensional diagram known as a dendrogram, which illustrates the fusions or
divisions made at each successive stage of analysis. Following is an example of a dendrogram.

In this thesis, the proposed work used to increase the security of the data by performing
clustering and encrypting/decrypting of the data. Two proposed algorithms are performed to
show the efficiency of the proposed work.

A. Proposed Work-1
The proposed algorithm is an attempt to present a new approach for complex encrypting and
decrypting data based on parallel programming in such a way that the new method can make use
of multiple- core processor to acquire higher speed with the better degree of protection.

ALGORITHM-
1. Partition of given dataset by using hierarchical clustering.
Given
A set X of objects{X1….. , Xn}
A distance function dist(c1,c2)
For i =1 to n
Ci ={Xi}
end for
C ={ c1….,c2}
l = n+1
while C.size >1 do
-(Cmin1, Cmin2)= minimum dist(Ci,Cj) for all
Ci,Cj in C
- Remove Cmin1 and Cmin2 from C

- Add {Cmin1,Cmin2} to C

- l = l +1
end while

FLOW DIAGRAM-
56
Figure 4.1 Flowchart of Proposed Work-I

Procedure –
Step1 - Consider the dataset for input.
Step2 - Apply anonymization technique to that particular dataset.
Step3 - Hierarchical clustering technique is used to partition the data sets into clusters.
Step4 – DES encryption technique is used to suppress the data values.

57
Step5 – Final result obtained by the union of LHS and RHS values formed by anonymization
technique.

B. Proposed Work-II

Advanced Encryption Standard (AES) is an algorithm in which key block cipher uses a single
key to encrypt and decrypt the information for both the sender and receiver. In spite of the fact
that, the square length of Rijndael can be 128, 192, or 256 bits, the AES algorithm only received
the block length of 128 bits. At that point, the key length can be 128, 192, or 256 bits. The AES
calculation's inward activities are performed on a two-dimensional exhibit of bytes called State,
and each byte involves 8 bits. The State involves 4 columns of bytes and each line has Nb bytes.
Each byte is connoted by Si, j (0 ≤ I < 4, 0 ≤ j < Nb). Since the block length is 128 bits, each line
of the State contains Nb = 128/(4 x 8) = 4 bytes.

The four bytes in each segment of the State array form a 32-bit word, with the column number as
the list for the four bytes in each word. Toward the start of encryption or decryption, the array of
input bytes is mapped to the State array, accepting a 128- bit block can be communicated as 16
bytes: in0, in1, in2, … in15. The encryption/ decryption are performed on the State, toward the
finish of which the last esteem is mapped to the yield bytes exhibit out0, out1, out2, … out15.
The key of the AES algorithm can be mapped to 4 columns of bytes likewise, aside from the
number of bytes in each line signified by Nk can be 4, 6, or 8 when the length of the key, K, is
128, 192, or 256 bits, individually. The AES calculation is an iterative calculation. Every cycle
can be known as a round. The aggregate number of rounds, Nr, is 10 when Nk = 4, Nr = 12 when
Nk = 6, and Nr = 14 when Nk = 8.

In the initial step, we get the data from the database in which the operations can be performed.
The overall dataset has been fetched from the database and divides into the group which is
known as clusters. For the clustering process, hierarchical clustering has been used. The items
are grouped by calculating the distance between the a and if the data has minimum distance then
the clusters merge. The new and large cluster has been formed and updates the distance of the
items. Now the data has been encrypted using AES Algorithm where it contains various
operations for the key generation and then the optimal results obtained.

58
Proposed Algorithm:

Step:1 Start
Step:2 Input dataset from the database
Step:3 Apply Hierarchical clustering
a. Compute the distance between dataset
b. Put items into cluster
c. If Distance between two clusters is min
d. Then merge both clusters
e. Update distance
Step:4 Apply AES Algorithm over the output
a. Key selection
b. Generation of multiple key
c. Encryption
d. Decryption
Step:5 Get optimal result
Step:6 Exit

59
Fig.4.2
Fig.
ig.4.2
4.2Flowchart
Flo
Flowwchart
chart of of
Proposed
P Proroposed
posed Work
W Woork rk-II

60
Chapter- 5

RESULT ANALYSIS

The simulation of the proposed work has done with MATLAB 2013. There are two graphs
demonstrated below which show that the proposed technique has better accuracy and less error
rate. Time graph also demonstrated below which is lesser than the existing techniques.

5.1 THE RESULT OF PROPOSED WORK-I:

Table 5.1 Describes the accuracy of Base method results and Propose method results

Figure. 5.1: Accuracy of the base and propose an approach

Above graph represents the analysis of accuracy in the base and propose method results and
concluded that proposed method is best for preserving privacy.

61
Table 5.2 Describes the error rate between Base method results and Propose method results

Figure 5.2: Error rate among the base and propose an approach

Above graph represents the analysis of error rate in the base and propose method results and
concluded that proposed method is best for preserving privacy as the error rate in the proposed
method is less than the error rate in the base method.

62
5.2 THE RESULT OF PROPOSED WORK
WORK-II:

Table 5.3: Comparison of Elapsed Time among Base and Propose Techniques

Elapsed Time
200
180
160
Time in seconds

140
120
100
80 Base
60 Propose
40
20
0
100 200 300 400 500
No. of Records

. 5.3:Elapsed
Fig.
F 5.3:
Figure Ela edTime
Elapsed Timeamong
ongthe
among thebase
seand
base andpropose
pr oseappraoch
propose an aapproach
roach

63
Accuracy
100

95
Accuracy

90

85 Base
Proposed
80

75
100 200 300 400 500
No. of Records

Figure 5.4: Accuracy of the base and propose an approach

Error Rate
18
16
14
12
Error rate

10
8 Base
6
Proposed
4
2
0
100 200 300 400 500
No. of Rounds

Figure 5.5: Error rate among the base and propose an approach

64
Chapter- 6

CONCLUSION

Data Mining deals with the production of formerly unidentified patterns automatically from the
huge quantity of data sets. These datasets usually include sensitive individual information or
significant business information, which consequently get exposed to the other parties during
Data Mining activities. This creates an obstruction in Data Mining method. Solution to this
problem is provided by Privacy preserving in data mining (PPDM). The privacy renovation for
data analysis is a challenging studies difficulty because of increasingly larger volumes of data
sets, thereby requiring in-depth research. Each privacy preserving technique has its own
importance. PPDM is a dedicated set of Data Mining activities where techniques are developed
to protect the privacy of the data so that the knowledge detection process can be carried out
without a barrier. The principle of PPDM is to secure sensitive detail from leaking in the mining
process along with precise Data Mining results. Data encryption and anonymization are broadly
received approaches to battle privacy break. Nonetheless, encryption isn't appropriate for data
that are prepared and shared. Anonymizing huge data and dealing with anonymized data sets are
nonetheless challenges for classic anonymization processes. Privacy-preserving data mining
emerges to 2 critical desires: data analysis with a purpose to deliver better services and making
sure the privacy rights of the data owners. Substantial efforts have been accomplished to address
these needs. The results of our proposed work show that by doing hierarchical clustering and
encrypting the data using DES method we can achieve more preservation of privacy. The goal of
this paper is to discuss the clustering with the introduction of hierarchical clustering and AES
algorithm on privacy-preserving technique which are helpful in mining a large amount of data
with reasonable efficiency and security.

65
REFERENCES

[1]Vinutha H.P, Dr.Poornima B, “A Survey - Comparative Study on Intrusion Detection


System”, International Journal of Advanced Research in Computer and Communication
Engineering, Vol. 4, Issue 7, ISSN (Online) 2278-1021, July 2015.

[2]V. Jaiganesh, S. Mangayarkarasi, Dr. P. Sumathi, “Intrusion Detection Systems: A Survey


and Analysis of Classification Techniques”, International Journal of Advanced Research in
Computer and Communication Engineering, Vol. 2, Issue 4, ISSN (Print) : 2319-5940, April
2013.

[3] Dhivakar K, Mohana S, “A Survey on Privacy Preservation Recent Approaches and


Techniques”, International Journal of Innovative Research in Computer and Communication
Engineering, Vol. 2, Issue 11, November 2014.

[4] R.Natarajan , Dr.R.Sugumar, M.Mahendran, K.Anbazhagan, “A survey on Privacy


Preserving Data Mining”, International Journal of Advanced Research in Computer and
Communication Engineering, Vol. 1, Issue 1, MARCH 2012.

[5] Charu C. Aggarwal, “A General survey of privacy preserving Data Mining Models and
Algorithms”, IBM,T. J. Watson Research Centre.

[6] B.Vani, D.Jayanthi, “Efficient Approach for Privacy Preserving Micro data Publishing Using
Slicing”, IJRCTT, 2013.

[7] Tiancheng Li , Jian Zhang , Ian Molloy,“Slicing: A New Approach for Privacy Preserving
Data Publishing”, IEEE Transaction on KDD, 2012.

[8] S.V. Vassilios , B. Elisa, N.F. Igor, P.P. Loredana, S. Yucel and T. Yannis, “State of the Art
in Privacy Preserving Data Mining”, Published in SIGMOD Record, 33, pp: 50-57, 2004.

[9] Helger Lipmaa, “Cryptographic Techniques in Privacy Preserving Data Mining”, University
College London, Estonian Tutorial, 2007.

66
[10] D. Agrawal and C. Agarwal, “On the Design and Quantification of Privacy Preserving Data
Mining Algorithms”, PODS, pp: 247-255, 2001.

[11]. Majid BM, Asger GM, Rashid Ali, “Privacy Preserving Data Mining Techniques: Current
Scenario and Future Prospects”, Proceedings of 3rd ICCCT, India, 26-32, 2012.

[12]. Kamakshi P, Vinaya BA, “Preserving Privacy and Sharing the Data in Distributed
Environment using Cryptographic on Perturbed data” Journal of Computing, April; 2(4), 115-
119, 2010.

[13]. Benny P, “Cryptographic Techniques for Privacy-preserving data mining”, ACM SIGKDD
Explorations, December; 4(2), 12- 19, 2008.

[14] Alpa K. Shah, Ravi Gulati, “Contemporary Trends in Privacy Preserving Collaborative Data
Mining– A Survey”, Proceedings in IEEE International Conference on Electrical, Electronics,
Signals, Communication and Optimization (EESCO), 2015.

[15] Alpa K. Shah, Ravi Gulati, “Privacy, Collaboration and Security – Imperative Existence in
Data Mining” VNSGU Journal of Science and Technology Vol 4 ,No 1, Pg. 44-49, July 2015.

[16] Jisha Jose Panackal1 ,Dr Anitha S Pillai, “Privacy Preserving Data Mining: An Extensive
Survey”, in Proceedings of Proc. of Int. Conf. on Multimedia Processing, Communication and
Info. Tech., MPCIT, 2013.

[17] R. Agrawal and R. Srikant. “Privacy Preserving Data Mining”,ACM SIGMOD Conference
on Management of Data, pp: 439-450, 2000.

[18] Y. Lindell and B. Pinkas, “Privacy Preserving Data Mining”, Journal of Cryptology, 15(3),
pp.36-54, 2000.

[19] Aris Gkoulalas-Divanis and Vassilios S. Verikios, “An Overview of Privacy Preserving
Data Mining”, Published by The ACM Student Magazine, 2010.

[20] Stanley, R. M. O. and R. Z Osmar, “Towards Standardization in Privacy Preserving Data


Mining”, Published in Proceedings of 3rd Workshop on Data Mining Standards, WDMS, USA,
p.7-17, 2004.

67
[21] Elisa, B., N.F. Igor and P.P. Loredana. “A Framework for Evaluating Privacy Preserving
Data Mining Algorithms”, Published by Data Mining Knowledge Discovery, pp.121-154, 2005.

[22] Andreas Prodromidis, Philip Chan, and Salvatore Stolfo, : “Metalearning in distributed data
mining systems: Issues and approaches”. In “Advances in Distributed and Parallel Knowledge
Discovery”, AAAI/MIT Press, September 2000.

[23] S.V. Vassilios , B. Elisa, N.F. Igor, P.P. Loredana, S. Yucel and T. Yannis, 2004, “State of
the Art in Privacy Preserving Data Mining” Published in SIGMOD Record, 33, pp: 50-57, 2004.

[24] Wang P, "Survey on Privacy preserving data mining", International Journal of Digital
Content Technology and its Applications, Vol. 4, No. 9, 2011.

[25] Dharmendra Thakur and Prof. Hitesh Gupta,” An Exemplary Study of Privacy Preserving
Association Rule Mining Techniques”, P.C.S.T., BHOPAL C.S Dept, P.C.S.T., BHOPAL India,
International Journal of Advanced Research in Computer Science and Software Engineering
,vol.3 issue 11, 2013.

[26] C.V.Nithya and A.Jeyasree, ”Privacy Preserving Using Direct and Indirect Discrimination
Rule Method”, Vivekanandha College of Technology for Women Namakkal India, International
Journal of Advanced Research in Computer Science and Software Engineering ,vol.3 issue 12,
2013.

[27] Sweeney L, "Achieving k-Anonymity privacy protection uses generalization and


suppression" International journal of Uncertainty, Fuzziness and Knowledge based systems,
10(5), 571-588, 2002.

[28] Gayatri Nayak, Swagatika Devi, "A survey on Privacy Preserving Data Mining:
Approaches and Techniques", International Journal of Engineering Science and Technology,
Vol. 3 No. 3, 2127-2133, 2011.

[29] Agrawal D., and Aggarwal C.C, “On the Design and Quantification of Privacy Preserving
Data Mining Algorithms”, Proceedings of the 20th ACM Symposium on Principles of Database
Systems, pp. 247-255, 2007.

68
[30] Agrawal, R., and Srikant , “Privacy Preserving Data Mining”, Proceedings of the 19th ACM
International Conference on Knowledge Discovery and Data Mining, Canada, pp. 439-450,
2007.

[31] Benjamin C. Fung M. and Ke Wang, “Privacy-Preserving Data Publishing: A Survey of


Recent Developments”, ACM Computing Surveys, Vol. 42, No. 4, pp.322-435, 2010.

[32] Bertino E., Nai Fovino and Parasiliti Provenza, “A Framework for Evaluating Privacy
Preserving Data Mining Algorithm”, Journal of Data Mining and Knowledge Discovery, pp. 78-
87, 2005.

[33] Bikramjit Saikia and Debkumar Bhowmik , “Study of Association Rule Mining and
different hiding Techniques”, PhD thesis, Department of computer Science Engineering,
National Institute of Technology, pp.55-63, 2009.

[34] T.Nandhini, 2D. Vanathi, 3Dr.P.Sengottuvelan “A Review on Privacy Preservation in Data


Mining”, International Journal of UbiComp (IJU), Vol.7, No.3, July 2016.

[35]Gurjeevan Singh, Ashwani Kumar Singla,K.S. Sandha, ”Through Put Analysis Of Various
Encryption Algorithms”, IJCST Vol. 2, Issue 3, September 2011.
[36]Ramesh, A. et.al.,, “Performance analysis of encryption algorithms for Information
Security”, Circuits, Power and Computing Technologies (ICCPCT), pp. 840 – 844,March 2013.
[37]Shashi Mehrotra Seth, Rajan Mishra,” Comparative Analysis Of Encryption Algorithms For
Data Communication”, IJCST Vol. 2, Issue 2, pp.192- 192 , June 2011.
[38]Agarwal, R. , Dafouti, D., Tyagi, S. “Peformance analysis of data encryption algorithms “,
Electronics Computer Technology (ICECT), 2011 3rd International Conference , vol.5, pp. 399
– 403, April 2011, .
[39] Mr. Gurjeevan Singh, Mr. Ashwani Singla and Mr. K S Sandha, "Cryptography Algorithm
Comparison for Security Enhancement in Wireless Intrusion Detection System", International
Journal of Multidisciplinary Research, Vol.1 Issue 4, pp. 143-151, August 2011.
[40] Akash Kumar Mandal, Chandra Parakash and Mrs. Archana Tiwari, “Performance
Evaluation of Cryptographic Algorithms: DES and AES”, Conference on Electrical, Electronics
and Computer Science, pp. 1-5, 2012.

69
[41] Tamanna Kachwala, Sweta Parmar “An Approach for Preserving Privacy in Data Mining”
International Journal of Advanced Research in Computer Science and Software Engineering,
Volume 4, Issue 9ISSN: 2277 128X, September 2014.

[42] Kun Liu, Hillol Kargupta, and Jessica Ryan, Random Projection-Based Multiplicative Data
Perturbation for Privacy PreservingDistributed Data Mining, IEEE transactions on knowledge
and data engineering, vol. 18, no. 1, pp. 92-106, January 2006.

[43] Jaideep Vaidya, Basit Shafiq, Wei Fan, Danish Mehmood, and David Lorenzi, A Random
Decision Tree Framework forPrivacy-preserving Data Mining, Journal of latex class files, vol. 6,
no. 1, pp. 1- 14, , January 2007.

[44]. Chun-Wei Lin, Tzung-Pei Hong, Chia-Ching Chang, and Shyue-Liang Wang “A Greedy-
based Approach for Hiding Sensitive Itemsets by Transaction Insertion”, Journal of Information
Hiding and Multimedia Signal Processing, Volume 4, Number 4, October 2013.

[45]. M.Mahendran, Dr.R.Sugumar, K.Anbazhagan, R.Natarajan “An Efficient Algorithm for


Privacy Preserving Data Mining Using Heuristic Approach”, International Journal of Advanced
Research in Computer and Communication Engineering, Vol. 1, Issue 9, November 2012.

[46]. Chun-Wei Lin · Tzung-Pei Hong ·Kuo-Tung Yang ·Leon Shyue-LiangWang “The GA-
based algorithms for optimizing hiding sensitive itemsets through transaction deletion”, Springer
Science+Business Media New York, 2014.

[47]. S. Lohiya and L. Ragha, “Privacy Preserving in Data Mining Using Hybrid Approach”, in
proceedings of 2012 Fourth International Conference on Computational Intelligence and
Communication Networks, IEEE 2012.

[48] Yu Zhu& Lei Liu, “Optimal Randomization for Privacy Preserving Data Mining”, ACM,
August 2004.

[49] V. Ciriani, S. De Capitani di, Vimercati, S. Foresti, and P. Samarati, “k-Anonymity”


Springer US, Advances in Information Security, 2007.

[50] Yehuda Lindell, Benny Pinkas, “Secure Multiparty Computation for Privacy-Preserving
Data Mining”, IACR Cryptology ePrint Archive 2008: 197, 2008.

70
[51] U. Maurer, “Secure multi-party computation made simple,” in Proc. 3rd Int. Conf. Security
in Communication Networks (SCN’02), Berlin, Heidelberg, pp. 14–28, Springer-Verlag, , 2003.

[52] Aris Gkoulalas-Divanis, & Grigorios Loukides, “Revisiting Sequential Pattern Hiding to
Enhance Utility”, ACM, August 2011.

[53] Amruta Mhatre, Durga Toshniwal, “Hiding Co-occurring Sensitive Patterns in Progressive
Databases”, ACM, March 22, 2010.

[54] Shikha Sharma & Pooja Jain, “A Novel Data Mining Approach for Information Hiding”,
International Journal of Computers and Distributed Systems, Vol. No.1, Issue 3, October 2012.

[55] Shaofei Wu and Hui Wang ,"Research On The PrivacyPreserving Algorithm Of Association
Rule Mining InCentralized Database”, IEEE International Symposiums on Information
Processing, 2008.

[56] Chirag N. Modi, Udai Pratap Rao and Dhiren R. Patel, "An Efficient Approach for
Preventing Disclosure of Sensitive Association Rules in Databases", International Conference on
Advances in Communication, Network, and Computing,IEEE, 2010.

[57] S.R.M. Oliveira, O.R. Zaıane, Y. Saygin, “Secure association rule sharing, advances in
knowledge discovery and data mining”, in Proceedings of the 8th Pacific-Asia Conference
(PAKDD2004), Sydney, Australia, pp.74–85, 2004.

[58]MohammadReza Keyvanpour et al.,(2011), “Classification and Evaluation the Privacy


Preserving Data Mining Techniques by using a Data Modification–based Framework”,
International Journal on Computer Science and Engineering (IJCSE), Vol. 3, No. 2, pp. 862–870.

[59]Ali Inan, Yucel Saygin, Erkay Savas, Ayca Azgin Hintoglu and Albert Levi (2006), “Privacy
preserving clustering on horizontally partitioned data,”, 2013.

[60] E. Poovammal, M. Ponnavaikko, "Privacy and Utility Preserving Task Independent Data
Mining", International Journal of Computer Applications, Vol:1, No. 15, pp: 104-111,

[61] Patrick Sharkey, Hongwei Tian, Weining Zhang, and Shouhuai Xu, "Privacy-Preserving
Data Mining through Knowledge Model Sharing", 2012.

71
[62] NafeesQamar, Yilong Yang, AndrasNadas and Zhiming Liu,” Querying medical datasets
while preserving privacy”, The 6th International Conference on Current and Future Trends of
Information and Communication Technologies in Healthcare (ICTH 2016), Procedia Computer
Science 98, pp 324 – 331, 2016.

[63] Alexandre Evfimievski, Tyrone Grandison, “Privacy Preserving Data Mining” USA.

[64] K.Sashirekha, B.A.Sabarish, Arockia Selvaraj, “A Study on Privacy Preserving Data


Mining” ISSN(Online):2320-9801July 2014

[65] Stanley R. M. Oliveira and Osmar R. Zaïane, “Toward Standardization in Privacy-


Preserving Data Mining”, http://www.cs.ualberta.ca/~oliveira/psdm/psdm index.html.

[66] Yehuda Lindell, Benny Pinkas, “Privacy Preserving Data Mining 2009.

[67] Shahejad Khan, TejasGorhe, Ramesh Vig and Prof.BharatiA.Patil,” Enabling Multi-level
Trust in Privacy Preserving Data Mining”, 2015 International Conference on Green Computing
and Internet of Things (ICGCIoT) IEEE, pp 1369- 1372, 2015.

[68] Shweta Taneja, Shashank Khanna, Sugandha Tilwalia, Ankita, “A Review on Privacy
Preserving Data Mining: Techniques and Research Challenges”, (IJCSIT) International Journal
of Computer Science and Information Technologies, Vol. 5 (2), 2310-2315, 2014.

[69] Chris Clifton, Murat Kantarcioglu, Xiaodong Lin, Michael Y. Zhu, “Tools for Privacy
Preserving Distributed Data Mining”, Volume 4, Issue 2 - page 1

[70] Yuna Oh, Kangsoo Jung and Seog Park,” A privacy preserving technique to prevent
sensitive behavior exposure in semantic location-based service”, 18th International Conference
on Knowledge-Based and Intelligent Information & Engineering Systems - KES2014, Procedia
Computer Science 35, pp 318 – 327, 2014.

[71] Alexandre Evfimievski, Johannes Gehrke, Ramakrishnan Srikant, “Limiting Privacy


Breaches in Privacy Preserving Data Mining”, PODS 2003, June 912, 2003.

72
[72] Nissim Matatov, Lior Rokach, Oded Maimon, "Privacy-preserving data mining: A feature
set partitioning approach", Information Sciences, Volume 180, Issue 14, Pages 2696-2720, 15
July 2010.

[73] Seung-Woo Kim, Sanghyun Park, Jung-Im Won, Sang-Wook Kim, "Privacy preserving data
mining of sequential patterns for network traffic data", Information Sciences Volume 178, Issue
3, Pages 694-713, 1 February 2008.

[74] K.Srinivasa Rao, V.Chiranjeevi, "Distortion Based Algorithms For Privacy Preserving
Frequent Item Set Mining", International Journal of Data Mining & Knowledge Management
Process (IJDKP) Vol.1, No.4, July 2011.

[75] L. Sweeney, “k-anonymity: A model for protecting privacy,” Int. J.Uncertainity Fuzziness
Knows.-Base Syst., vol. 10, no. 5, pp. 557–570,2002.

[76] A. Machanavajjhala, J. Gehrke, D. Kifer, and M. Venkitasubramaniam, “l-diversity: Privacy


beyond kanonymity”, in Proc. 22nd Int. Conf. Data Eng., p. 24, 2006.

[77]Chun, Ji Young, Dong Hoon Lee, and Ik Rae Jeong. "Privacy-preserving range set union for
rare cases in healthcare data", IET Communications 6, no. 18 (2012): 3288-3293.

[78] Charu C. Aggarwal, Stephen C. Gates and Philip S. Yu, “On the merits of building
categorization systems by supervised clustering”, Proceedings of the fifth ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining, Pages 352 – 356, 1999.

[79] Sudipto Guha, Rajeev Rastogi, and Kyuseok Shim, (1998), “ROCK: A Robust Clustering
Algorithm forCategorical Attributes”, In Proceedings of the 15th International Conference on
Data Engineering, 1999.

73

Anda mungkin juga menyukai