Dissertation
Submitted
in partial fulfillment
Master of Technology
(MAY 2018)
CANDIDATE’S DECLARATION
I hereby declare that the work which is being presented in the Dissertation entitled
“Enhancement of Privacy Preservation Method With Clustering and Cryptographic
Techniques in Data Mining” in partial fulfillment for the award of Degree of “Master of
Technology” in Department of Computer Science & Engineering with Specialization in
Computer Science & Engineering and submitted to the Department of Computer Science &
Engineering. Compucom Institute of Technology and Management, Rajasthan
Technical University, Kota is a record of my own investigation carried under the Guidance
of Dr. Akash Saxena, Associate Professor, CITM, Jaipur.
I have not submitted the matter presented in this Dissertation anywhere for the award of any
other Degree.
(Jeetendra Mittal)
Counter Signed by
Associate Professor
It’s the foundation of the architecture that defines its ability to stand affirm. The foundation
of my research work is just not my sole attempt, but it took efforts and insights of many key
people.
I sincerely take this opportunity to acknowledge all those who directly or indirectly have been
a great support and inspiration throughout the research work.
First and foremost I would like to address one of my mentors: Dr. Akash Saxena, Associate
Professor, CITM. It has been an honor to be his research student. He consciously or
unconsciously leveraged his key and creative insights wherever applicable in the research
work. Amidst of his busy schedule, he makes himself available for any query related to work
almost every time. I sincerely appreciate his contribution directly or indirectly.
I would also like to acknowledge Registrar, Mr. Pawan Agarwal, CITM for his support
and significant contribution during each phase of dissertation directly or indirectly.
Lastly, I would like to thank my family and friends for their love, motivation and
encouragement in all my pursuits. Last but not the least my special thanks go to the
Principal, (Prof.) Dr. M. R. Farooqui , and my Institute, Compucom Institute of
Technology and Management, Jaipur, for giving me this opportunity to work in the great
environment.
Jeetendra Mittal
ABSTRACT
Data Mining is the way toward removing learning avoided huge volumes of raw data. The
knowledge must be new, not self-evident, and one must have the capacity to utilize it. The
original data is altered by the disinfection procedure to cover delicate learning before
discharge so the issue can be tended to. Data mining has been considerably contemplated and
helpful in various fields which incorporate the Internet of Things (IoT) and the business
development. In any case, data mining approaches likewise happen genuine difficulties
because of developed sensitive data divulgence and the violation of privacy. Privacy-
Preserving Data Mining additionally called (PPDM), as a fundamental branch of the data
mining and an energizing theme in privacy preservation, has increase specific consideration
in current years. Security conservation of sensitive learning is tended to by a few analysts as
association rules by smothering the frequent itemsets Clustering is the technique which
makes the cluster of useful objects which have resemble characteristics. Anonymization is to
protect the identity of the individual this encrypts identifiers like unique number and the
name whereas the data which is not encrypted provides less or no guarantee. This discussion
describes the privacy concern that occurs due to data mining, particularly for the national
security applications. We talk about privacy-preserving data mining by Anonymization
Method in which we utilize hierarchical clustering keeping in mind the end goal to partition
the given data and DES algorithm for encryption of data with a specific end goal to keep
sensitive data from an attacker. Advanced Encryption Standard (AES) is an algorithm to
provide the security to the data and it is very difficult to apply attacks. By the proposed work,
privacy preservation of the data increased and it can be shown with the help of the results.
AES provides the result in minimum time which show that propose produce result faster than
existing approaches.
i
CONTENTS
ABSTRACT i
CONTENTS ii
LIST OF FIGURE v
LIST OF TABLE vi
LIST OF ABBREVIATIONS vii
CHAPTER 1 1-33
INTRODUCTION 1
1.1 Data mining 1
1.2 Goals of Data Mining 2
1.2.1 Prediction 2
1.2.2 Identification 2
1.2.3 Classification 2
1.2.4 Optimization 2
1.3 Advantages of Data Mining 3
1.4 Privacy preserving data mining 3
1.5 Types of data mining system 4
1.6 Defining Privacy for Data Mining 5
1.6.1 Aims and non-aims of this section 6
1.6.2 Privacy and personal data 7
1.6.3 Privacy as hiding confidentiality 7
1.6.3.1 Privacy as hiding/confidentiality as the focus of PPDM 8
1.6.4 Privacy as control: informational self-determination 10
1.6.5 Privacy as practice: identity construction 12
1.7 Privacy preserving applications 14
1.7.1 Medical Database 14
1.7.2 Bioterrorism Application 14
1.8 Privacy threats 14
1.8.1 Identity Disclosure 14
1.8.2 Attribute Disclosure 14
1.8.3 Membership Disclosure 14
1.9 Evaluation criteria for privacy-preserving algorithm 15
ii
1.10 Background 16
1.10.1 Security Vs Privacy 17
1.10.2 Privacy Issues and Policies 17
1.11 Requirements of a PPDM algorithm 17
1.12 Need for privacy 18
1.13 Comparisons of different privacy preservation techniques 19
1.14 Clustering 22
1.15 PPDM Techniques 22
1.16 Privacy Preserving Techniques 24
1.16.1 Heuristic-based techniques 24
1.16.2 Cryptography-based strategies 24
1.16.3 Reconstruction-based techniques 24
1.16.4 Anonymization based PPDM 25
1.16.5 Perturbation Based PPDM 26
1.16.6 Randomized Response Based PPDM 27
1.16.7 Cryptography Based PPDM 28
1.17 Issues in designing a PPDM algorithm 29
1.17.1 Challenges of PPDM Algorithm Information Loss 29
1.17.2 Requirements of a PPDM algorithm 30
1.18 Data Encrypted Standard (DES) 31
1.19 Advanced Encryption Standard (AES) 32
1.19.1 Substitute Byte transformation 33
1.19.2 Shift Rows transformation 33
1.19.3 Mix columns transformation 33
1.19.4 Add round key transformation 33
CHAPTER 2 34-46
LITERATURE SURVEY 34
CHAPTER 3 47-54
SIMULATION TOOL 47
3.1 Simulation Environment 47
3.2 The MATLAB system comprises of five major sections 48
3.2.1. Development Environment 48
iii
3.2.2. The MATLAB Mathematical Function Library 48
iv
LIST OF FIGURE
Figure No. Figure Capture Page No.
Figure 1.1 What revealing search data reveals 13
Figure 1.2 Linking Attack 25
Figure 1.3 Randomization Response Mode 27
Figure 4.1 Flowchart of Proposed Work-I 57
Figure 4.2 Flowchart of Proposed Work-II 60
Figure 5.1 Accuracy among the base and propose appraoch 61
Figure 5.2 Error rate among the base and propose appraoch 62
Figure 5.3 Elapsed Time among the base and propose appraoch 63
Figure 5.4 Accuracy among the base and propose appraoch 64
Figure 5.5 Error rate among the base and propose appraoch 64
v
LIST OF TABLE
vi
LIST OF ABBREVIATIONS
GA Genetic algorithms
vii
Chapter- 1
INTRODUCTION
Data mining process enables an organization to utilize the vast measure of data to create
connections and connections among the data to enhance the business proficiency [1]. The Data
Mining innovation can build up these investigations all alone, utilizing commix of insights,
manmade brain power, machine learning algorithms, and data stores. With a specific end goal to
1
confront the testing risk, a few scientists have been proposed as a cure of this cumbersome
circumstance.
Balancing the privacy of the statistics as steady with the legitimate need of the consumer is a
major trouble. The unique data is modified by the sanitization method to cover sensitive know-
how earlier than launch so the issues can be addressed. Privacy preservation of sensitive
knowledge is addressed by way of numerous researchers in form of association policies via
suppressing the common object units. As the data mining offers the generation of association
rules, the alternate in support and confidence of the association rule for hiding sensitive
regulations is finished [2]. A new idea named now not altering the aid‟ is proposed to cover an
association rule. Confidentiality issues in the data mining. A key hassle that arises in any en
masse series of data is that of confidentiality. The need for privacy is occasionally because of
regulation ((e.g., for medical databases) or may be uplifted by way of business interests. The
result of humor is that data mining which does not often violate the privacy. The data mining
goal is to conclude throughout populations, instead of display statistics approximately
individuals.
1.2.1 Prediction: Prediction decides the relationship between autonomous factors and
association amongst dependent and independent factors.
1.2.2 Identification: Data patterns are required to make out the existence of the item, an event or
some patterns those are of customer behavior. The known area is for authentication is a layout of
classification.
1.2.3 Classification: Data mining can help to divide the data so that various classes can be
recognized based on the parameters grouping to search a clever say that to show data
1.2.4 Optimization: DM can enhance the utilization of resources those are incomplete, for
example, time, space, cash or materials and to expand output those factors which are under a
predetermined arrangement of limitations [2].
2
1.3 ADVANTAGES OF DATA MINING:
Data mining applications are developing continuously in different industries that to provide
knowledge which is more hidden that allow to increase business efficiency and grow businesses.
DM approaches assume a basic part of a different domain. For the characterization of security
issues, a lot of information must be analyzed containing verifiable data. It is troublesome for
people to discover an example in such a huge quantity of data. DM, in any case, appears to be
appropriate to crush this difficulty and can be utilized to decide those models [2].
Privacy level
Hiding failure
Data quality
Complexity
The primary challenges of PPDM approach for association rule mining are excessively
expensive, hard to get better unique data after hiding and ought to be green sufficient for terribly
huge datasets (Wei Zhao et. Al 2007). PPDM is a research area involved with the privacy pushed
from in my opinion identifiable statistics whilst considered for data mining. The objective of this
paintings is to put into effect a distortion algorithm the use of the association rule hiding for
PPDM which could be efficient in supplying confidentiality and enhance the performance overall
(Charu C. Aggarwal et al. 2008). The debate on PPDM has acquired unique attention as data
mining has been broadly adopted by using public and the private corporations. We have
witnessed 3 main landmarks that represent the progress and achievement of this new research
location: the conceptive landmark, the deployment landmark, and the potential landmark. We
describe these landmarks as follows: The Conceptive landmark characterizes the length in which
3
valuable figures inside the community, along with O'Leary (1991, 1995), Piatetsky-Shapiro
(1995), and Klösgen, 1995; Clifton and Marks, 1996), examined the accomplishment of data
discovery and some of the fundamental territories where it could fighting with privacy stresses.
The key finding changed into that knowledge discovery can open new threats to informational
privacy and statistics safety if now not completed or used well.
The Deployment landmark is the current period wherein increasingly more PPDM techniques
were evolved and have been posted in refereed meetings. The facts available these days is unfold
over endless papers and conference proceedings. The consequences performed in the closing
years are promising and advise that PPDM will achieve the desires that have been set for it. At
this degree, there's no consent approximately what privacy renovation way in data mining. In
addition, there is no consensus on privacy concepts, guidelines, and necessities as a foundation
for the development and the deployment of recent PPDM strategies.
The extreme amount of procedures is prompting perplexity among designers, professionals, and
others keen on this technology. One of the most extreme essential requesting circumstances in
PPDM now could be to build up the groundwork for additional studies and development in this
area.
Data mining frameworks can be coordinated by different criteria the depiction is as per the going
with [2]:
a) Classification of the data mining frameworks as indicated by kind of the data source
mined:
In an association, a lot of data's are material where we group these data yet these are accessible.
4
c) Classification of statistics mining structures in keeping with mining strategies used:
This classification is consistent with the statistical analysis approach utilized together with
machine contemplating, neural systems, GA, actualities, perception, database situated or data
warehouse-arranged, and so forth. The class also can bear in mind the degree of personal
interaction involved in the records mining system including query-driven systems, interactive
exploratory systems, or self-sufficient systems. A complete system would provide an extensive
variety of data mining approach to fit extraordinary situations and options, and provide
exceptional ranges of user interaction.
i. User privacy preservation: The primary intention of the data security is the shelter of in
my view identifiable statistics. In standard, statistics are taken into consideration
individually identifiable if it could be linked, at once or circuitously, to a person
character. Thus, while non-public data are subjected to mining, the attribute values
related to individuals are non-public and must be covered from being uncovered. Miners
are capable of analyzing from international models instead of from components of a
particular user.
ii. Collective preservation of privacy: Securing private data won't be sufficient.
Sometimes, we may additionally need to secure against getting to know sensitive
understanding representing the activities of a collection. We consult with the safety of
sensitive knowledge as collective privacy maintenance. The goal here is pretty similar to
that one for statistical databases, wherein safety control mechanisms offer mixture data
approximately corporations (populace) and, at the same time, save you exposure of
personal records approximately people. However, in comparison to as is case for
statistical databases, any other objective of collective privacy upkeep is to defend
sensitive knowledge that can offer competitive gain within the commercial enterprise
world. In the case of collective privacy protection, agencies need to deal with a few
5
thrilling conflicts. For example, when non-open insights experiences investigation forms
that create new data around users' shopping examples, hobbies activities, or options, these
measurements might be utilized in recommender systems to predict or have an effect on
their future shopping patterns. In general, this state of event is useful to both customers
and corporations. However, whilst companies proportion statistics in a collaborative task,
the aim isn't most effective to shield individually identifiable data but additionally
sensitive data represented by way of some strategic patterns [3]. On this part, we can start
from the data which are more commonly considered as being implicated in privacy
debates, data protection or privacy breaches on the Internet. We then proceed to describe
three relevant groups of privacy approaches. The aim of this framework is to abstract
from and complement current privacy definitions, as explained. We illustrate the quite a
lot of faces of privacy making use of the illustration of the AOL logs, spotlight one of a
kind notions of identification, extend the dialogue by the way of investigating whose
privacy may be at stake, and characterize specifics of the Web relevant to our questions.
We summaries with the resulting aim of this article, to be then presented in the
subsequent.
The taxonomy distinguishes between three views of privacy: privacy as hiding, as manipulate,
and as follow. These are summarized. The definitions have been put into the connection of the
present article; see in particular we will show how these three views cut across individual
definitions ‘differences with regard to the subjects of privacy and type of data and knowledge.
While the objective of is in this manner to sum up past the large number of security definitions to
characterize a general system for the analysis, we do settle on one decision all through the rest of
the paper: We concentrate on person’s privacy. This is in line with the bulk of current scientific
6
and popular treatments, but the choice was also made to respect the topic complexity. We
therefore begin this section with a definition of “personal data” as the target of persons’ privacy.
We complement this general focus of the paper by a discussion of some issues of business and
state secrets. A thorough analysis of these questions would require the space of another article
[3].
7
free from interruptions from both a. tyrannical state and the pressure of social standards. That
privacy encompasses this sense of a protected sphere is generally acknowledged in sociology,
and legal scholars, courts and regulators have recognized its data-dependency: The private circle
is something which is conceivably undermined by the exposure of (individual) data. This notion
is also popular in the computer science and has been explained as an autonomous (digital) sphere
in which the data about persons is protected, such that outside of this sphere the data remains
confidential. Data confidentiality the protection of data from unauthorized access is a strong and
useful translation of such privacy concerns into digital space. A key reason is that once data
about a person exists in a digital form, it is extremely troublesome to provide individuals with
any guarantees on the control of that data. Data collected is being using current technologies
represent activities of users in social life that for many are assumed to be private. To preserve
privacy is then to keep this data private, in other words confidential from a greater public. Not
exchanging the data would preserve privacy but is inconvenient and probably also not desirable.
Therefore, a great deal of privacy investigate in computer science is concerned with weaker
forms of data confidentiality such as anonymity.
Anonymity is finished with the aid of unlinking the person identity from the traces that her
movements depart in information systems. Anonymity continues the identification of the
individuals in competencies methods personal, however, it's not always concerned with how
public the traces hence come to be. This is also reflected in data protection legislation, which by
definition are not able to and does now not anonymous data [3].
8
remains an open inquiry. In databases and (PPDM), the conditions for setting up. Anonymity sets
and the targeted objectives are somewhat dissimilar than in communications. Anonymity is a
popular requirement when (Web or other) data are to be analyzed (e.g. data-mined), especially
when this is done by third parties. One difference to communications anonymity is that PPDM
methods aspire to secure the utility of the anonym zed data for analysts [3].
9
problem: to “produce an anonymous that satisfies a given privacy requirement determined by the
chosen privacy model and to retain as much data utility as possible”. Key concepts of PPDM are
defined. Following security-research terminology, adversaries are also called “attackers” who
perform an attack: a "sequence of activities that result in the disclosure of confidential
information". This takes a wider look at a setting often found in PPDM: the publishing, by a data
publisher (e.g., a hospital), or at least in parts sensitive information on data subjects (e.g.,
patients), for an audience of data recipients. The latter is in general not known a priori, could
bill-intentioned, and may perform arbitrary data mining tasks. The aim of PPDP is then that
"access to published data will have to no longer allow the attacker to learn something additional
about any goal victim in comparison without an entry to the database, in spite of the presence of
any attacker’s history information received from other sources”. Due to the impossibility of this
in the face of arbitrary background knowledge, one usually assumes limited and specific
background knowledge of the attacker, or requires that probabilistically, the posterior beliefs
after looking at the published data are not much different from the prior beliefs. The same idea
lies behind differential privacy that “ensures that the removal or addition of a single database
item does not (substantially) affect the outcome of any analysis”. Informally, this could be
considered to not hide data, but to avoid that information can be gleaned from them. A wide
literature exists on PPDM and PPDP, which cannot be covered here. For details of specific
algorithms and method groups, see also (graphs/networks).Taking a closer look at these data-
centric definitions of privacy, one sees that alongside the focus on confidentiality (not seeing
data, not learning about an entity, protection from disclosure), there is also the recognition that
data need not be kept confidential in every case, but could be disclosed as long as someone
entitled to do so “decides” or “authorizes” disclosure/communication. This someone is often the
data subject, but may also be unspecified we will return to this question. This move away from
unconditional hiding, or “privacy-as-confidentiality”, leads to the notion of privacy as control, to
be discussed next.
10
is that the revelation of data is necessary and beneficial under any circumstances— and that
control may help to avert abuses of data collected in this way.
This idea is expressed in Westin’s definition of (data) privacy: “the right of the user to decide
what statistics about himself must be communicated to Others and under what circumstances”
and in the time period informational self-determination first utilized in a German constitutional
ruling when it comes to individual data gathered amid the 1983 registration, and highly
influential in Europe and beyond since then: “the protection of the user against unlimited
collection, storage, Use and revelation of his/her private information is enveloped by method for
the general identity privileges of the [German Constitution]. These general correct warrants in
this appreciate the potential of the man or woman to determine in principle the disclosure and the
use of his/her personal data. Obstacles to this instructive self-assurance are permitted handiest in
event of superseding public interest”. Data of self- determination is likewise communicated in
worldwide rules for data security such in light of the fact that the OECD's bearings on the
assurance of protection and Trans periphery Flows of Data which is Personal, the Fair
Information Practices (FIP) realize, alternative, access, and safety, or the concepts of the EU
Data Protection Directives. As an illustration, don't forget the ideas set up within the OECD
instructional materials: assortment problem, data quality, reason specification, use problem. In
sociological accounts, privacy as control is tied closely to the ability to separate identities, which
allows individuals to selectively employ revelation and concealment to facilitate their social
performances and relationships. Computer science has applied these ideas in systems for identity
management and access control [3].
Although informational self-determination principles are desirable, relying only on them when
building systems can be misleading. Collection limitation in one system does not protect against
the aggregation of those data in many systems. Openness may be overwhelming in current
ubiquitous-technology environments, where the numbers of data controllers increase
exponentially. A user may be overwhelmed by the difficulties of individual participation and
unable to judge the risk of revealing information or using automated agents for such decision-
making. Even if all these principles were implemented, it would be very tough to identify
violations. In the case of trusted parties, system security violations (i.e. hacked systems), design
failures (i.e. information leakages), or the linking of different sources of the safely released data
11
may cause unwanted release of information. It hence offers little security concerning the
collection of anonymized data, profiling based on correlations and patterns found in this
aggregated data, and the resultant alluring or undesirable separations. Finally, privacy as control
is an abstract concept that does not consider how people actually do and want to construct their
identities. This is the topic of privacy as practice, to which we turn next.
Privacy as takes after requests the likelihood to meditate inside the streams of existing data and
the re-negotiation of limits with perceiving to accumulated data.
These two exercises lay on, yet expand the possibility of security as enlightening self-assurance:
they request straightforwardness concerning accumulated arrangements of information and the
investigation methods and decisions applied to them. In this sense, these approaches define
privacy not only as a right but also as a public good. Sociologists have investigated the idea that
privacy is (social) practice from various viewpoints. Distinguishes two further types of privacy in
addition to the above-mentioned right to be let alone and the possibility of separating identities.
The third type is the construction of the public/private divide. This distinction concerns the social
negotiation of what remains private (i.e. silent and out of the public discourse) and what becomes
public. For instance, the decision by individuals to keep their voting choices private is generally
12
accepted today; while in the case of domestic violence, interest groups and individuals have
successfully lobbied over the past decades to redefine the “domestic” as a public issue
The fourth type in is the protection from surveillance. Here, surveillance refers to the creation
and managing of social knowledge about population groups. This kind of privacy can easily be
violated if individual observations are collated and used for statistical classification. At the point
when connected to people, such orders make articulations about their (non compliance with
standards, they're having a place with associations with given living arrangements and
valuations, and numerous others. Arguably, such processes may pose unreasonable constraints
on the construction of identities. Market Segmentation is a case of the arrangement of populace
gatherings. In computer science accounts of privacy in networks and in particular social network
sites (SNS, similar ideas have been expressed [3].
These definitions emphasize that confidentiality and individual control are part of privacy, but
not all. Privacy includes strategic concealment, but also the revelation of information in different
contexts, and these decisions are based on and part of the process of collective negotiation. Tools
should, therefore, support data concealment and revelation to help individuals practice privacy
individually and collectively.
13
1.7 PRIVACY PRESERVING APPLICATIONS:
1.7.1 Medical Database: Traditional method has been utilized just for the worldwide hunt and
supplants strategy keeping in mind the end goal to maintain privacy.
1.7.2 Bioterrorism Application: It is central to look at the medical data for security
conservation in the bioterrorism application. For instance, Biological operators are broadly found
in the regular habitat, for example, Bacillus anthrax. It is critical to discover the Bacillus anthrax
attack from the ordinary attack. It is important to track occurrences of the regular sicknesses. The
comparing information would be accounted for by the general public health agencies. The
respiratory illnesses were not reportable-diseases. This gives an answer for more identifiable data
as per public health law [4].
1.8.1 Identity Disclosure: Usually an individual was linked to a record in the published table. If
his identity was disclosed, then the corresponding sensitive value of an individual would be
revealed [4].
1.8.2 Attribute Disclosure: Attribute disclosure was possible when information about individual
record would be revealed. Before delivering the information, it is must to conclude attributes of
the user with the high confidence. According to the creators see [5], coordinating numerous
bucket was imperative to secure attribute disclosure.
1.8.3 Membership Disclosure: Membership information in the released table would infer an
identity of an individual through various attacks. In the event that the determination criteria were
not sensitive trait esteem, then it would prompt have a membership exposure [6].
14
1.9 EVALUATION CRITERIA FOR PRIVACY-PRESERVING ALGORITHM:
Privacy-preserving data mining a fundamental trademark in the progression and evaluation of
figuring is the unmistakable verification of suitable evaluation criteria and the development of
related principles. In some case, there is no privacy preserving algorithm exists that beats the
other entire algorithm on all possible measures.
It is vital to delivering users with a set of metrics which will allow them to select the best
suitable privacy preserving technique for the data; with respect to some specific parameters. An
introductory list of evaluation parameters to be used for evaluating the quality of privacy
preserving data mining algorithms is given below: [7]
(i) Performance: the performance of a mining algorithm is measured in terms of the time
required to achieve the privacy criteria.
(ii) Data Utility: Data utility is basically a measure of information loss or loss in the
functionality of data in providing the results, which could be generated in the
nonattendance of PPDM algorithms.
15
PPDM altered data. This method is information lost.
very real in case of stream It contains the same format as
data. the original data
Cryptography based PPDM Transformed data are exact This technique is especially
[8] and protected. Better privacy tough to scale the multiple
compare to randomized parties that are involved.
approach
(iii) Uncertainty level: It is a measure of vulnerability with which the sensitive data that has
been covered up can even now be predicted.
(iv) Resistance: Resistance is a measure of tolerance shown with the aid of PPDM algorithm
towards numerous data mining algorithms and fashions. As such, all the criteria that have
been discussed above need to be quantified for better evaluation of privacy-preserving
algorithms but, two very important criteria are a quantification of privacy loss and
information loss. Evaluation of privacy or privacy metric is a measure that demonstrates
how intently the first estimation of a trait can be assessed. If it can be estimated with
higher confidence, the privacy is low and vice versa. Lack of precision in estimating the
original dataset is known as information loss which can lead to the failure of the purpose
of data mining. So, a balance needs to be achieved between privacy and information loss.
Dakshi Agrawal and Charu Agrawal in [9] have discussed quantification of both privacy
and information loss in detail.
1.10 BACKGROUND:
Privacy is a vital concern while allowing access to different classes of the data set such as
business and medical dataset for mining. Privacy is so essential with respect to medical data
since it contains private information such as the type of disease associated with patient ID, name,
and address. In certain, while extracting the medical data the real data should be attainable for
making the precise predictions otherwise it will lead to results which are useless. Any kind of
release related to the person-specific information leads to several problems including moral
issues. In this manner, privacy can be characterized as avoiding undesirable uncover of data
while performing data mining on collective outcomes.
16
1.10.1 Security Vs Privacy:
Security is the capability to manage the access to information, protect from the unauthorized
disclosure, modification of data and the destruction of information [10]. A medical dataset
consists of all information related to the patient. Privacy is a more particular term which is
characterized by the privilege of a person to keep his personal data from being revealed.In the
medical type of datasets, a specific disease of a person must not be disclosed into the public
domain. Today some known methods of PPDM exist and are examined thoroughly.
A privacy policy is an arrangement of principles that unveil a portion of the ways a gathering
accumulates, oversees, uncovers and uses customer's data. In order to ensure privacy, the author
must address various privacy attacks which need a high degree of deliberation. In data mining,
Privacy attack occurs when one's precise privacy information is openly linked to him. Since it is
difficult to identify all types of attacks occasionally, the private providers can track certain kind
of policies delivered by different nations such as FHIPAA of US, Information Technology Act,
2000 of India and Data Protection Act of UK.
17
underlying data. Consistency is related to the semantic constraints holding on the data
and it measures how many of these constraints are still satisfied after the sanitization
(Kantarcioglu et. al 2007).
III. Scalability: It is another imperative perspective to survey the execution of a PPDM
algorithm. In particular, scalability describes the efficiency trends when data sizes
increase. Such parameter concerns the expansion of both execution and capacity
prerequisites and the expenses of the interchanges required by a data mining system with
the increase of data size (Bertino et. al 2008)
IV. Data quality: It is an important aspect of PPDM. High-quality data that has been
prepared specifically for data mining tasks will result in useful data mining models and
output. Then again, low-quality data has a critical negative effect on the utility of data
mining comes about (Bettini et. al 2009).
V. Security: It is level of protection danger, damage, loss, and the crime. There are two
main approaches regarding how to deal with the problems of privacy that arise today. The
first is a legal and policy approach whereby organizations are limited in how they store
and use data based on privacy law and public policy. It ordinarily works by assessing
situations and choosing if the privacy rupture brought about by utilizing the data given is
supported or not. The second approach is technological and gives implemented privacy
ensures through cryptographic means. This approach has the ability to empower the data
to be utilized while preventing privacy breaches [12].
18
demarcation between security and privacy requirements of published data is essential. [12]
provides an address for identifying the importance of security and privacy in data mining [13]
[14]. In this paper, the authors first distinguish between privacy and security in the context of
Census data. The rest of the segment gives a prologue to protection arrangements and issues that
are taken care by different overseeing bodies inside India and different countries.
19
Privacy Preserving Tree learning Temperature Humidity, The decision tree
Decision Tree Algorithm, decision the Wind Play. algorithm is good
Learning is Using tree generation are with other security
Unrealized Data Sets used. safeguarding
[15] methodologies, for
example,
cryptography, for
additional
protection.
Secure and Privacy- KeyGen(n) GSC which is (group A localization
Preserving algorithm signature center) algorithm, which is
Smartphone which is Accuracy, Simulation, suitable for the GPS
Based on Traffic timestamp location samples,
Information Systems and evaluated it
through the realistic
simulations.
On Design and Data mining Taken a toll parameter, PPSVC can
Analysis of Privacy- algorithm, Kernal parameter is accomplish
Preserving the SVM Classification utilized to the quantifying comparable
Classifier [15] algorithm, kernel the execution of the arrangement
adatron algorithm framework. exactness to the first
and the data fly SVM classifier. By
algorithm. securing the delicate
substance of support
vectors.
Privacy-Preserving GA Languages demonstrating The secure
Gradient-Descent smoothing parameters, constructing blocks
Methods weight parameters are are scalable and the
utilized to gauge the proposed protocols
execution of the allow us to decide a
framework higher comfy
20
protocol for the
packages for every
scenario.
A Data Mining 1.C5.0 data mining Area secured by roc, bend Overcomes the
Perspective in PPDM algorithm, dataset id, affectability, overheads arising
Systems [15] Commutative RSA specificity-1 because of key trade
cryptographic and key computation
algorithm with the aid of
adopting the
cryptographic
algorithm
Incentive Compatible Data analysis Deterministically non Claim 5.1, the length
Privacy-Preserving algorithms cooperatively computable of the last stride in a
Data Analysis (DNCC). PPDA assignment is
in DNCC, it is
constantly
conceivable to make
the whole PPDA
errand fulfilling the
DNCC demonstrate.
Privacy and Quality Outlier detection Detection rate, data A general process
Preserving anomaly detection range. Indices, anomaly for computing
Multimedia Data algorithm, secure score bounds on nonlinear
Aggregation for the hash algorithm. privacy preserving
Participatory Sensing data-mining (PPDM)
Systems Outlier approach with the
recognition applications to
irregularity the detection anomaly.
detection calculation,
the secure hash count
[15]
21
1.14 CLUSTERING:
Clustering [16] is a data mining strategy that has not taken its genuine part underway as of now
cited in spite of the fact that, the most vital calculation of this technique was very studied with
regards to privacy-preserving, which is k - means algorithm. Surveying privacy preserving k -
means clustering approaches apart from other privacy-preserving data mining ones is important
due to the use of this algorithm in imperative different regions, similar to image and signal
processing where the issue of security is unequivocally postured. The greater part of works in
protection safeguarding grouping are created on the k- means algorithm by applying the model of
secure multi-party calculation on various distributions (vertically, horizontally and arbitrary
partitioned data). Among the definitions of Partition clustering in view of the minimization of a
goal work, k- means algorithm is the most broadly utilized and contemplated. Given a dataset D
of n elements (objects, information focuses, items,… ) in genuine p- dimension space Rp and a
integer k. The k -means clustering algorithm partitions the dataset D of entities into k-disjoint
subsets, called clusters. Each cluster is represented by its center which is the central id of all
entities in that subset.
The need to preserve privacy in k-means algorithm happens when it is connected on distributed
data more than a few destinations, supposed " parties " and that it wishes to do clustering on the
association of their datasets. The point is to keep a gathering to see or derive the data of another
party amid the execution of the algorithm. This is accomplished by utilizing secure multi-party
calculation that gives a formal model to preserve the privacy of data [16].
In Recent years have seen broad research in the field of PPDM. As an examination course in data
mining and measurable databases, privacy-preserving data mining got generous consideration
and numerous analysts played out a decent number of concentrates in the region. Since its
initiation in 2000 with the spearheading work of Agrawal and Srikant [17] and Lindell and
Pinkas [18], security saving data mining has increased expanding fame in the data mining
research group. PPDM has turned into an essential issue in data mining research [19-20].
As a result, a radical new set of methodologies were introduced to permit mining of data, while
in the meantime forgetting the discharging of any hidden and sensitive information. Most of the
22
current methodologies can be grouped into two general classes [21]: (I) Methodologies that
secure the sensitive data itself in the mining procedure, and (ii) Methodologies that ensure the
sensitive data mining comes about (i.e. extricated information) that were created by the use of
the data mining. The main classification alludes to the methodologies that apply perturbation,
inspecting, speculation or suppression, transformation, and so on. techniques to the original
datasets in order to generate their sanitized counterparts that can be safely disclosed to
untrustworthy parties. The objective of this classification of methodologies is to empower the
data miner to get exact data mining comes about when it isn't furnished with the real data. Secure
Multiparty Computation systems that have been proposed to empower various information
holders to all things considered mine their data without revealing their datasets to each other. The
second class manages systems that restrict the exposure sensitive learning designs determined
through the utilization of data mining calculations and also methods for minimizing the viability
of classifiers in order undertakings, with the end goal that they don't uncover delicate data. In
distinction to the incorporated model, the Distributed Data Mining (DDM) show acknowledges
that the person's data is dispersed over various spots. Calculations are produced inside this
territory for the issue of productively getting the mining comes about because of the considerable
number of data through these circulated sources. A straightforward technique for data mining
over various sources that won't share data is to run existing data mining apparatuses at each place
autonomously and combine the outcomes [22]. However, this will often fail to give globally
valid output. Issues that cause a difference between local and global results include: (i) Values
for a single entity may be divided across sources. Data mining in singular locales will be not
able to distinguish cross-site connections. (ii) The same item might be copied at various
destinations and will be over-one-sided in the outcomes. (iii)At a solitary site, it is probably
going to be from a similar populace. PPDM has a tendency to change the first data with the goal
that the aftereffect of data mining assignment ought not to resist protection limitations.
Following is the rundown of five measurements based on which diverse PPDM Techniques can
be grouped [23]:
23
v. Privacy preservation
Data or Rule Hiding: This estimation suggests whether rough data or accumulated data should
be concealed. The trouble for hiding aggregated data as guidelines are extremely troublesome,
and for this reason, regularly heuristics have been produced.
Data Distribution: This measurement alludes to the distribution of data. There is a portion of
the methodologies are created for centralized data, while others allude to a distributed data
scenario. Distributed data situations can be separated as even data segment and vertical data
partition.
Data Modification: Data modification is used with the aim of changing the unique values of a
database that wants to be allowed to the public and in this way to guarantee high privacy
protection. Methods of data modification include:
1.16.1 Heuristic-based techniques: It is a versatile change that alters just chose esteems that
limit the viability of misfortune instead of every single accessible esteem.
1.16.3 Reconstruction-based techniques: Where the first dispersion of the data is reassembled
from the randomized data. In view of these measurements, diverse PPDM strategies might be
ordered into following five classes [24-25, 26].
24
• Perturbation based PPDM
• Randomized Response based PPDM
• Condensation approach based PPDM
• Cryptography based PPDM we discuss these in detail in the following
subsections.
Such records are available in medical records also when linked, can be used to infer the identity
of the corresponding individual with high probability as shown in figure 1.2. Sensitive data in
the medicinal record is illness or even prescription endorsed. The express identifiers like Name,
SS
25
number and so on have been expelled from the therapeutic records. All things considered, the
personality of an individual can be anticipated with a higher likelihood. Sweeney [27] proposed
k-namelessness display utilizing speculation and concealment to accomplish k-obscurity i.e. any
individual is discernable from in any event k-1 different ones as for semi identifier characteristic
in the anonymized dataset. As such, we can diagram a table as k-unknown if the Q1 estimations
of every crude are equal to those of in any event k-1 different columns. Supplanting an incentive
with less particular however semantically reliable esteem is called as speculation and
concealment include blocking the values. Releasing such data for mining reduces the risk of
identification when combined with publically available data. But, at the same time, the accuracy
of the applications on the transformed data is reduced. Various calculations have been proposed
to actualize k- anonymity utilizing speculation and concealment as of late. Despite the fact that
the anonymization technique guarantees that the changed information is valid yet endures
overwhelming data loss. In addition, it isn't safe to homogeneity attack and foundation
knowledge attack basically [28].
First, it may be very tough for the owner of a database to decide which of the attributes are
available or which are not available in external tables. The second restriction is that the k-
anonymity model receives a specific strategy for attack, while in genuine circumstances; there is
no motivation behind why the attacker ought not to attempt different strategies. In any case, as an
exploration heading, k-anonymity in the mix with other privacy-preserving methods should be
examined for detecting and notwithstanding blocking k-anonymity violations [27].
26
1.16.6 Randomized Response Based PPDM:
Fundamentally, the randomized reaction is measurable procedure acquainted by Warner with
taking care of a review issue. In Randomized reaction, the data is curved such that the focal place
can't state with chances superior to a predefined edge, regardless of whether the data from a
client contains correct information or incorrect information. The information received by every
single user is twisted and if the number of users is large, the aggregate information of these users
can be estimated with the good quantity of accuracy. This is exceptionally significant for
decision-tree classification. It depends on consolidated estimations of a dataset, to some degree
individual data items. The data collection process in randomization strategy is done utilizing two
stages. Amid initial step, the data suppliers randomize their data and exchange the randomized
data to the data receiver. In the second step, the information beneficiary modifies the first
appropriation of the data by utilizing a circulation remaking calculation. The randomization
reaction demonstrate is appeared in figure 1.3.
Randomization strategy is generally extremely straightforward and does not require learning of
the dissemination of different records in the data. Thus, the randomization strategy can be
executed at data collection time. It doesn't require a trusted server to contain the whole unique
records keeping in mind the end goal to play out the anonymization procedure. The weakness of
a randomization reaction based PPDM system is that it treats every one of the records measures
up to regardless of their local density.
27
These show an issue where the anomaly records turn out to be more subject to oppositional
attacks when contrasted with records in more compacted areas in the data. One key to this is to
be pointlessly adding commotion to every one of the records in the data. Be that as it may, it
lessens the utility of the data for mining purposes as the reproduced circulation may not yield
brings about the similarity of the motivation behind information mining. Buildup approach based
PPDM Condensation approach develops compelled clusters in the dataset and afterward
produces pseudo-data from the insights of these clusters. It is called as condensation due to its
approach of utilizing dense insights of the clusters to produce pseudo data. It makes sets of
disparate size from the data, with the end goal that it is certain that each record lies in a set
whose size is at any rate alike to its namelessness level. Propelled, pseudo data are produced
from each set in order to make a manufactured data index with an indistinguishable total
circulation from the unique data. This approach can be viably utilized for the arrangement issue.
The utilization of pseudo-information gives an extra layer of security, as it winds up hard to
perform antagonistic attacks on synthetic data. Moreover, the aggregate behavior of the data is
preserved, making it useful for a variety of data mining problems [28]. This method helps in
better privacy preservation as compared to other techniques as it uses pseudo data rather than
modified data.
In addition, it works even without redesigning data mining algorithms since the pseudo data has
an indistinguishable configuration from that of the original data. It is very effective in case of
data stream problems where the data is highly dynamic. At the same time, data mining results get
affected as huge amount of information is released because of the compression of a larger
number of records into a single statistical group entity.
28
all-around characterized display for protection that incorporates techniques for demonstrating
and evaluating it. The data might be appropriated among various associates vertically or
horizontally. Every one of these techniques are relatively in view of an exceptional encryption
protocol known as Secure Multiparty Computation (SMC) technology. SMC utilized as a part of
circulated privacy preserving data mining comprises of an arrangement of secure subprotocols
that are utilized as a part of evenly and vertically partitioned data: secure sum, secure set union,
secure size of the intersection and scalar item. Albeit cryptographic strategies guarantee that the
changed information is correct and secure yet this approach neglects to convey when in excess of
a couple of gatherings are included. Additionally, the data mining results may rupture the
protection of individual records. There exist a good number of solutions in case of semi-honest
models but in case of malicious models very less studies have been made [28].
i. Expensive: a considerable lot of the protocols in light of encryption utilize the thought
presented by Yao (2007). In Yao‟s protocol one of the gatherings register a mixed
adaptation of a Boolean circuit for assessing the coveted capacity. The scrambled circuit
comprises encryptions of all conceivable piece esteems on every single conceivable wire
in the circuit. The quantity of encryptions is roughly 4m, where m is the number of doors
29
in the circuit. The encryptions can be symmetric key encryption, which has a
commonplace cipher text-length of 64 bits. The mixer circuit is sent to the next gathering,
which would then be able to assess the circuit to get the last outcome. These
methodologies are when all is said in done, costly since they require entangled
encryptions for every individual bit [30].
ii. Recover original data after hiding: PPDM comprises the number of procedures to
recover the data from the extensive measure of the database which comprises of sensitive
data moreover. k-anonymity is a method to suppress or generalize the data so that the
data cannot be accessed by any unauthorized users.
iii. Support of large datasets: Due to the continuous advances in hardware technology,
large amounts of data can now be easily stored. Databases along with data warehouses
today store and manage amounts of data which are increasingly large. Thus, a PPDM
algorithm must be outlined and actualized with the capacity of taking care of colossal
datasets that may even now continue developing. The less quick is the reduction in the
productivity of a PPDM calculation for increasing data measurements, the better is its
adaptability. In this manner, the versatility measure is essential in deciding handy PPDM
strategies [31].
A. Accuracy:
The accuracy is closely related to the information loss resulting from the hiding strategy: the less
is the information loss, the better is the data quality. Always a PPDM algorithm has to maintain
high accuracy to reduce information loss.
Completeness assesses the level of missed data in the sanitized database. Incomplete data has a
significant impact on data mining results and impairs the data mining algorithms from providing
an accurate representation of the underlying data.
30
C. Scalability:
In particular, scalability describes the efficiency trends when data sizes increase. Such parameter
concerns the expansion of both execution and capacity prerequisites and also the expenses of the
correspondences required by a data mining procedure with the expansion of data estimate [32].
D. Data quality:
It is an important aspect of PPDM. High-quality data that has been prepared specifically for data
mining tasks will result in useful data mining models and output. Alternatively, low-quality data
has a significant negative impact on the utility of data mining results [33].
E. Security:
It is the degree of protection against danger, damage, loss, and crime. There are two main
approaches regarding how to deal with the problems of privacy that arise today. The first is a
legal and policy approach whereby organizations are limited in how they store and use data
based on privacy law and public policy. It ordinarily works by assessing situations and choosing
if the security break caused by utilizing the information given is legitimized or not. The second
approach is mechanical, and gives authorized protection ensures through cryptographic means.
This approach has the capacity of empowering the information to be utilized while preventing
privacy breaches [34].
DES changed into the last consequences of the test set up by utilizing International Business
Machines (IBM) Corporation in the past due 1960‟ s which outcome in a cipher referred to as
LUCIFER. The altered form of LUCIFER progress toward becoming advanced as an offer for
the novel national encryption in vogue asked for by means of the National Bureau of Standards
(NBS). It was completely followed in 1977 as the DES. DES relies on a cipher called the Feistel
block cipher. This turn into a piece figure progressed with the guide of the IBM cryptography
specialist Horst Feistel in the mid 70‟s. It includes a number of rounds in which each spherical
consists of bit-shuffling, nonlinear substitutions (S- bins) and exceptional OR operations. Once a
plain-text message is gotten to be encrypted, it's orchestrated into 64-bit blocks requirement for
input. On the off chance that the no. Of bits in the message isn't equally distinct by means of 64,
31
at that point the last piece can be padded [35] [36]. DES plays out a gazing change on the total
64-bit block of measurements. It's at that point split into 2, 32-bit sub-blocks, Li, and Ri which
might be then outperformed into 16 adjusts (the subscript I in Li and Ri recommends the present
round). Each of the rounds is equal and the outcome of growing their variety is twofold - the
algorithm, protection is expanded and its temporal efficiency decreased. Clearly, the ones are
two conflicting consequences and a compromise has to be made. For DES the amount selected
become 16, in all likelihood to guarantee the elimination of any correlation amid the ciphertext
and either the plaintext or key. At the top of the 16th round, the 32 bit Li and Ri yield divides are
swapped to make what's known as the pre-output. This [R16, L16] link is permuted utilizing a
component this is the best converse of the preliminary permutation. The output of this very last
permutation is the 64 bits cipher textual content [37] [38].
These blocks are dealt with a cluster of bytes and sorted out as a matrix of the request of 4×4 that
is known as the state. For both encryption and decryption, the figure starts with an Add
RoundKey arrange. Be that as it may, before achieving the last round, this yield goes, however,
nine fundamental rounds, amid every one of those rounds four transformations are performed;
1) Sub-bytes,
2) Shiftrows,
3) Mix-columns,
32
In the last (tenth) round, there is no Mix- column transformation. Figure 4 demonstrates the
general procedure. Decryption is the turn around procedure of encryption and utilizing converse
capacities: Inverse Substitute Bytes, Inverse Shift Rows and Inverse Mix Columns. Each round
of AES is administered by the accompanying transformations [40]:
It is a basic byte transposition, the bytes in the last three columns of the state, contingent on the
line area, are consistently moved. For the second line, 1 byte round left move is performed. For
the third and fourth column, 2-byte and 3-byte left roundabout left moves are performed
individually.
It is a bitwise XOR between the 128 bits of present state and 128 bits of the round key. This
change is its own converse [40].
33
Chapter- 2
LITERATURE SURVEY
Tamanna Kachwala et al.[41], there are vast future research directions for (PPDM) privacy
preserving data mining. First, present studies turn to use various terminologies to report similar
or related practice. For example, people used data modification, perturbation of data, sanitation
of data, hiding data, and pre-processing as the possible ways for preserving the privacy;
however, all are in fact related to use some of the approaches to modify actual data so that
private data and knowledge remain private even after process of mining. Lacking common type
of language for the discussions will cause misconception and slow down the research
breakthrough.
Along these lines, there is a need for the standardizing terminology and the act of PPDM.
Second, the most prior (PPDM) Algorithms the intractability for use with knowledgeable data
put away in a clustered database. In any case, in the present global digital environment, data is
regularly put away in different locales. With late methodologies in data and correspondence
advancements, the circulated PPDM philosophy may have a more extensive application,
particularly in medicinal, human services, saving money, military and supply the chain
situations. Third, methodologies of information stowing away have been the overwhelmed
techniques for securing the protection of user mining comes about, which may prompt sensitive
rules leakages. While some of the algorithms have precise for preserving rule like with before
data or information, it may reduce the accuracy of the other rules which are non-sensitive.
Kun Liu, et al.[42], Explores possibility using the multiplicative unplanned projection matrices
for preserving the privacy distributed data mining. This class problem is related directly to the
various other issues of data-mining like principal component analysis, clustering and the
classification. Paper makes primary contributions on two various grounds. First, it explores the
Independent element analysis as a possible tool for privacy breaching in the deterministic
multiplicative model. Then, it proposes an imprecise type of random technique of projection-
34
based that to reform the level of protection of privacy while still protect the ever occurred on the
attribute of the data.
Jaideep Vaidya[43], shows that general and the efficiently distributed knowledge of privacy-
preserving discovery is truly feasible. Paper considered the privacy and security angles when
related with distributed data that is apportioned either vertically or on a level plane over different
locales, and the incitement of playing out the provocation of data mining on such data. Since
RDTs can be needed to develop an equivalent, accurate and also sometimes better or good
models with the much lower cost, this paper proposed the distributed type of privacy-preserving
RDTs and its technique grasp the advantages that randomness in the structure can also give
strong privacy with the less computation. Results of paper are showing the privacy-preserving
version of RDT calculation scales directly with dataset measure and require essentially less time
than elective cryptographic methodologies.
Chun-Wei Lin et.al.[44], Talked about a greedy-based approach to deal with conceal sensitive
and legitimacy of the item sets by a transaction of insertion. The proposed approach first
calculates the maximal number of transactions to insert into the real database for hiding the
sensitive sets of item fully. The dummy data items of the transactions to be inserted are as
designed by the technique that is statistical, which can reduce greatly the side effects in the
PPDM. The sensitive data item sets are then folded by adding it to new transactions into actual
database, thus increasing the minimum count threshold to get the goal. It is, however, 3 factors
taken into the consideration. First, the transactions should be determined seriously by gaining the
less amount of the side effects to cover totally the itemsets that are sensitive. Here, sensitive sets
of item are evaluated respectively to search the maximal number of the transactions to insert. 2 nd,
the length of each newly transaction which is inserted then computed according to empirical
rules in standard normal distribution. Last, the already existing large data sets are then
alternatively added into newly inserted transactions according to the lengths of the transactions
which determined at the second procedure. This step to avert the missing failure of the large
itemsets for diminishing the side effects in the PPDM.
M.Mahendran et.al.[45], proposed an approach called heuristic approach which ensures the
privacy of output that avoids the extracted patterns (itemsets) from the malicious inference
issues. An algorithm which is efficient named as Algorithm which is Pattern-based Maxcover
35
algorithm is proposed. This algorithm decreases the discord between source dataset and the
released database; Moreover, the protected patterns which cannot be fetched from the released
database by an adversary or by counterpart even with low arbitrary threshold support. We need
to develop the mechanisms that can lead to new privacy control systems to convert a given
database in a new one in such a way to firm the general rules mined from the real database. The
agenda of transforming the database source into a new type of database that folds some
confidential patterns or the rules is said to be the sanitization process. To do, a minimum number
of transactions have been modified by canceling more than one or one item from them or even
by adding the noise to data by turning the items from 0 to 1 in some transactions. The released
database is known as a sanitized database. Here, the approach is to slightly alter some data, but
this is perfectly acceptable in some real applications.
Chun-Wei Lin et.al.[46], proposed 2 algorithms, one is a simple genetic algorithm(SGA) which
is to eradicate transaction (sGA2DT) and second is pre-large genetic algorithm which is to delete
the transaction (pGA2DT) based on the genetic algorithm (GA). Genetic algorithms (GAs) are
able to search the optimal results using the natural principles of evolution. A framework which is
a GA-based structure which is comprised of 2 calculations that are outlined and is proposed to
control the ideal particular issues of methodologies that are heuristic based. A flexible evaluation
function, containing 3 factors with their adjustable weightings, is designed to regulate whether
the certain transactions are chosen to be deleted for purpose of hiding sensitive sets of the item.
The proposed logos were calculated to delete a pre-defined transactions number for covering the
sensitive item sets. An algorithm of the simple genetic and concepts of pre-large are also
considered to diminished the time of execution for again scanning the real databases for
chromosome evaluation, and the number of populations in the proposed algorithms. A
straightforward approach which is (Greedy) is to be designed as benchmark that to evaluate the
achievement of two proposed algorithms as simple genetic algorithm (SGA) to delete
transactions (sGA2DT), and a pre-large genetic algorithm which to delete transactions
(pGA2DT) with regards to execution of time, the 3 side effects are (missing item sets, hiding
failures and artificial sets of item), and dissimilarity of the database in the experiments.
S. Lohiya [47], state that there is sensitive classification rule that is used for the hiding of
sensitive and private data from others users. In this technique, there are 2 steps which are used
36
for preserving privacy. First is to recognize the transactions of the sensitive rule and the second
is to substitute the values which are known to values which are unknown. In this approach, there
is the scanning of actual database and to identify transactions which are supporting the sensitive
rule. And then for every transaction, algorithms replaces the sensitive data with the values which
are unknown. This approach is applicable to applications where one can save values which are
unknown for some of the attributes.
Yu Zhu, Lei Liu [48], decide development of the plans of ideal randomization for the saving of
security of density estimation. The randomization impact on data mining is processed by
execution corruption and the common data loss, while interim based measurements are registered
by protection and privacy loss.
V. Ciriani, S. De Capitani di, Vimercati, S. Foresti, and P. Samarati [49], utilized k-anonymity to
reveal the personality of users in the gathering of the dataset which is indistinctly combined to in
any event k-1 respondents. It quantifies the measure of the obscurity which is held amid the
procedure of data mining. Technique for K-Anonymization decreasing the efficiency of data
mining algorithm on the anonymized data and the renders privacy preservation. While
discharging the genuine data, the genuine k-anonymity proposal and its authorization by means
of the speculation and the concealment to safe the respondents' characters were embellished and
talked about in various routes for applying the speculation and the suppression.
Yehuda Lindell, Benny Pinkasy[50], presents prologue to secure multiparty computation (SMC)
and its materialness to PPDM. The normal kind of blunders that set up in the writing when
PPDM is executed with SMC methods and the issues associated with effectiveness are talked
about and furthermore exhibits the issues in building the exceptionally proficient protocols.
Ueli Maurer[51], proposed a straightforward method to multi-party computation (MPC) with the
straight-forward security proofs. This work accomplishes the security just for latent enemy
setting, without the likelihood to improve it to active adversary settings.
Aris Gkoulalas-Divanis and the Grigorios Loukides[52], proposed about sequential pattern
hiding. Publishing sequence datasets offering opportunities which are remarkable for discovering
the interesting data patterns. This paper considers how to clean data to avert the revealing of the
patterns that are sensitive during the sequential pattern mining while providing that the
37
nonsensitive patterns can be discovered. The main algorithm utilized here endeavors to clean the
data with the negligible change, though the second spotlights on the limiting the opposite
symptoms.
Amruta Mhatre and Durga Toshniwal[53], displayed a novel approach that to conceal the
sensitive co-occurring of sequential patterns. This strategy chips away at the dynamic databases
notwithstanding, the large portion of standard methodologies for the privacy preservation work
just on a database which is static. Dynamic databases are a generalized model of the static
database, dynamic database, and incremental databases. The process is also extended to suit
these different types of the databases. The strategy introduced here maintains a strategic distance
from the event of the most sensitive sort of examples by every now and again smothering the
examples and keeping it from being frequent. It is further examined in order to develop methods
to opt the pattern to be blocked.
Shikha Sharma & Pooja Jain[54], work is based on the reduction of confidence and support of
the sensitive type of rules. Here algorithm is used in some modified form to cover the sensitive
association rule without any side effect. To hide the element which is sensitive, algorithm
repeatedly increases the counter of hiding the rule until faith goes to below minimum of the
threshold which is determined as opposed to checking all transactions again and over and
ordering them in expanding or decreasing request. On the off chance that the certainty goes
underneath least determined certainty limit, govern is shrouded i.e. it won't be found through any
data mining algorithm.
Shaofei Wu et al.[55], proposed a new algorithm to balance the privacy preserving and the
knowledge discovery in the association rule mining. The arrangement executes a filter after
mining stage to transient through or covers the confined found association rules. Before
executing the algorithms, the data structure of the database and the sensitive association rule
mining set is examined to fabricate the effective model.
Chirag N. Modi et al.[56], proposed an algorithm that gives protection or security against
including parties and different gatherings which can get data through the unsecured channel.
Stanley R. M. Oliveira[57], goal is at balancing the privacy and also disclosure of knowledgeable
data by trying to decrease the impact on the sanitized data transactions and also to reduce the
38
accidentally hidden and ghost rules. The utility here is measured as the number of rules which
are non-sensitive that were covered based on side-effects of a process called data modification.
Mohammad Reza Keyvanpour et. al.[58], dealing with a careful survey bringing to the limelight
scores of the works on different existing techniques of privacy-preserving approach, their uses
and deficiencies. The majority techniques for the privacy computation use of some form of the
transformation of data to perform the technique of privacy preservation. Characteristically, such
methodologies limit the granularity of the portrayal or farthest point the entrance for resources
keeping in mind the end goal to diminish the security. This diminished in granularity results in
few trouncing of the efficacy of data mining algorithms. This is the normal trade-off between
information loss and privacy. Researchers developed strategies to enable data mining way to deal
with being connected while privacy preserving of people. Despite the fact that few
methodologies have been proposed for (PPDM) which is privacy-preserving data mining,
mining, now we might want the peruser to peruse for a brisk outline gives a detailed survey on
some of the approaches used for PPDM. proposed a classification based on the 3 common
methods of (PPDM) Privacy-Preserving data mining, these are Data modification approach, Data
sanitization approach, and Secure Multiparty Computation approach.
Ali Inan et al.[59], focus on discovering object based divergence for privacy preservation.
Having thoroughly investigated the available diverse techniques for privacy preservation, we
find that the level of Privacy Preservation techniques is only of a single level. Indeed, even the
recently proposed Privacy Preservation procedures i.e., A near report on the diverse strategies for
cryptography for crafted by future the method of bothering was executed and furthermore
assessed individually theoretical structures to demonstrate their effectiveness. They are
additionally of single-level ones. The system used to contrast and with likewise differentiate each
approach all in all platform that will be the reason for finding out the appropriate method or
approach for a given sort of use of privacy-preserving shared filtering. Nonetheless, there are
many situations where sharing of information can lead to general gain as in the case of privacy-
preserving secure accord as mentioned.
E. Poovammal and M. Ponnavaikko et. Al.[60], proposed the technique to design the technique
for the microdata sanitization for securing the privacy from malicious types of attack as well as
to protect the data utility for the type of mining task. A graded grouping transformation and
39
mapping the transformation which is table based have been applied to an attribute which is
numerical sensitive and attribute which are categorical sensitive respectively, by the proposed
approach. They have been performing experiments on the adult dataset and compared results of
actual table and transformed table to prove that their proposed task independent technique has
the ability to secure the privacy, information, and utility. Generally, two approaches are called
statistics-based approach and the crypto-based approach are to deal with PPDM. One use of
technique named statistics-based technique is that it handles efficiently huge datasets.
Patrick Sharkey et al.[61], proposed an approach for statistics-based PPDM. Their approaches
were fully different from the techniques that are already existed because it allows the owners of
data to share with each other knowledge of models that are extricated from their own particular
private datasets, rather than enabling the data owners to distribute any of their own private
datasets (not even in any sanitized form). Here, the models of knowledge got from the individual
datasets have been used to deliver some pseudo-data and such data has been then utilized for
separating the prevalent "worldwide" knowledge models. There are a couple of specialized
delights while instrumental, so it should be deliberately tended to. Especially, they proposed
calculation for creating pseudo-information in view of the decision tree paths, a procedure for the
adjusting imperceptibility measures of datasets to assess the protection of choice trees, and a
calculation to decrease a choice tree with a specific end goal to guarantee a given privacy
requirement. Through the experimental study performed on different environments with several
types of datasets, predictive models, and utility measures have proved that the predictive models
mastered utilizing the proposed approach are significantly more exact than those picked up
utilizing the current l- diversity strategy. Since the organizations are gathering and sharing data
increasingly about their customers, the infringement of customer privacy is increasing very
rapidly. Although some sharing is for the use of general public like to identify the behavior of the
disease in medical research, individuals are worried about the intrusion of their privacy. To avoid
such violation, the sensitive attributes of data are mapped to another domain such that real values
are not disclosed and yet the original associations are preserved.
NafeesQamar et al[62], present that This paper addresses challenges of determining patterns
which are clinically-relevant here, we treat datasets which are medical type as a black box for
both the internal users and external users of the data enabling mechanism which is a remote
40
query to build database queries and execute database queries. The novelty of the method lies in
keeping off the complicated data deidentification system which is mostly used to retain patient
privacy. The carried out toolkit combines software engineering technologies practically identical
to Java EE and peaceful web offerings, to allow changing medical data in an unidentifiable XML
structure along with limiting users to the need-to-comprehend privacy principle. As a
consequence, the procedure inhibits contemplative processing of data, akin to attacks through an
adversary on medical dataset utilizing advanced computational approaches to uncover Protected
Health Information (PHI). The strategy is approved on an endoscopic announcing utility
established on open EHR and MST standards. The proposed process is essentially motivated
with the aid of the issues regarding querying datasets through medical researchers, governmental
or non-governmental organizations in monitoring health care services to improve the quality of
care.
Alexandre Evfimievski et al[63], presented that, PPDM rose in the reaction to 2 similarly
essential and fundamental (and a different) needs: data analysis to convey better services and
guaranteeing privileges of protection of data owners. Difficult as the task of addressing such
needs may seem, many tangible efforts have now been accomplished. Here, an overview of
popular techniques for doing the PPDM was presented, namely: suppression, cryptography,
randomization, and summarization. The privacy assures advantages and the disadvantages of
every approach were declared to provide a view which is balanced of state of the art. Finally, the
scenarios where PPDM may be used and some directions for the future work were outlined.
K.Sashirekha et al[64] presented that, Privacy Preserving and the Data Mining addresses the
issues of securing mobile individuals from the attackers. Privacy threat includes the process of
predicting the pattern movement based on statistical information collected. Intruder monitors the
models of activity to foresee amass development and attempt to get to the private data of portable
clients. Privacy can be defined by the methods for randomization and distributed privacy-
preserving data mining, k-anonymization. To give better privacy multi-level frameworks are
used. Here, an analysis was done on different methods of the privacy-preserving and policy of
multi-level trust, limitation while using large dimension data sets.
Stanley R. M. Oliveira et al[65], presented that, problems about PPDM have globally emerged.
The new proliferation in PPDM approach is obvious. Motivated by maximizing the number of
41
approaches that are successful, the current generation in PPDM moves toward the
standardization because it plays an essential role in the future of the PPDM. Here, we invest
what urgency to be done and take few steps toward the proposing like standardization: 1 st, we are
describing the issues we are facing in illustrating which data is private in data mining, and
discuss how the privacy can be offended in data mining. Then, we can say that privacy
preservation within data mining is based on users' personal data and data concerning with their
mutual activity.
Yehuda Lindell et al[66], address the issue of (PPDM) privacy preserving data mining. In
particular, we consider the situation where 2 parties owning the protected the databases that
desire to run a calculation of data mining on the association of databases, without uncovering out
any sort of data. Our work is propelled by the need to both secure advantaged data and its
utilization is empowering for examine or for other distinctive purposes. The above issue is a
particular sort of a case of the secure multi-party computation (SMC) and, can be understood by
utilizing the known generic protocols. Be that as it may, data mining algorithm are confused and,
besides, input typically is comprised of the massive sets of data. So the generic protocols in such
case are of no reasonable need and are consequently proficient protocols are in require. We are
focusing on the issue of the choice tree learning with a well-known calculation called ID3
algorithm. Our protocol is considerably more successful than nonspecific arrangements and
furthermore requests both not very many rounds of correspondence transmission capacity and
reasonable bandwidth.
Shahejad Khan et al[67], present that PPDM the approach creates privacy before data being
distributed by the approach of perturbation Existing approach for the issues is Privacy in user
security level on the database (DB's). So we will outfit straightforwardness to this assumption
and broaden the extent of perturbation-based PPDM to Multilevel believe (MLT-PPDM). In our
situation, the data miner must be more specific, so it does no more section more assigned
proliferation of the module. Under this situation, the fraud data miner by utilizing diverse aspects
can access to distributed copies of data and may aggregate different types of copies together to
have precise knowledge about data which data the owner would never want it to get be leaked.
We furnish proprietor of data to generate allotted copies of its data for multi-level trust (MLT-
42
PPDM) on demand. This feature is giving a maximum of flexibility and the intractability to data
owners.
Shweta Taneja et al[68], present that PPDM deals with the covering an individual's sensitive
identity without unfolding the usability of data. It has become a vital area of concern but still,
this branch of research is in its infancy.People today become very well aware of privacy
intrusions of sensitive data and are more uncertain to share their data. The major concern is that
data which is non-sensitive may even deliver information which is sensitive, including the
personal data, facts or patterns. Numerous methodologies of PPDM have been proposed in the
writing. Here, we have contemplated all condition of art approaches. A tabular comparison of
work done by various authors is presented. In our future work, we will work on a hybrid of these
techniques to preserve the privacy of sensitive data.
Chris Clifton et al[69], present that Privacy preserving mining of the distributed data has many
applications. Each and every application poses the various constraints which are: What is known
to be privacy, what are desired results, how is the data being distributed, what are constraints on
collaboration and cooperative computing, etc. We advise that the solution to this is a components
toolkit that can be joined for the specific type of privacy-preserving data mining applications.
Here this paper presents some type of components of such a toolkit and showing how they can be
used to solve many privacy-preserving data mining(PPDM) problems.
Yuna Oh et al[70], present that the increasing number of mobile device users suggests the
enlargement of personalized location-based services (LBS). Regardless of their proliferation, the
hazard of violating users' privacy with the aid of exposing person's region understanding stays. In
like manner, numerous surveys have looked into to turn away privacy violations LBS. However,
previous researchers most effective considering defending users' region understanding without
seeing that semantic area private violation through contextual information. In this paper, we give
an explanation for the method of inferring a user's behavior making use of semantic knowledge
which entails spatial and temporal information. We also recommend a privacy retaining manner
to hinder publicity of sensitive behavior in semantic LBS. We implement android use to validate
the proposed approach. In accordance with results of the experiment, the proposed approach of
b-diversity is validated to avert exposure of the behavior of sensitivity and also reducing data
utilization degradation.
43
Alexandre Evfimievski et al[71], present that, This technique is vulnerable potentially to the
privacy breaches: which is based on data distribution, one is able to learn with the high level of
confidence that some randomized records are to satisfy a specified type of property, even though
the security is maintained on average. Here, we exhibit a detailing which is new that of privacy
ruptures, together with the procedure, " amplification ", for constraining it. Like prior strategies,
enhancement makes it conceivable to limits the assurance of security level breaks with no
learning of dissemination of the actual data. We instinct this technique for mining problem of
association rules, and modify and adjust the algorithm from to limit privacy level breaches
without the data distribution knowledge. Constantly, we address the issues that the
randomization required to keep away from the ruptures on security (when mining the association
rules) brings about the long exchanges. By utilizing the pseudorandom generators and afterward
precisely picking the seeds with the end goal that coveted things from the unique exchange are
available in the randomized exchange, we can exchange only the seed rather than exchange,
bringing about the sensational level drop in correspondence cost and the capacity cost. At long
last, we are characterizing new sort of data measures that will take protection ruptures into the
record while evaluating the measure of security preserved by randomization.
Nissim Matatov et al.[72], have presented an approach for achieving k-anonymity by isolating
the original dataset into numerous projections so every last one of them takes after k-anonymity.
Besides, any endeavor to rejoin the projections brings about a table that still holds fast to k-
anonymity.. A classifier has been prepared on every one of the projection and afterward, an
unlabelled occurrence has been arranged by joining the orders of the considerable number of
classifiers. In view of k-namelessness requirements and the arrangement exactness, (GA)
hereditary calculation has been utilized by the proposed (DMPD) data mining privacy by
decomposition algorithm to seek the best feature set partitioning. Ten different datasets have
been used with DMPD to evaluate its classification presentation with other k-anonymity-based
methods. The results have shown that performance of DMPD was better when compared to other
existing algorithms which are k-anonymity-based algorithm and there is no need for using
domain-dependent knowledge. They have also evaluated the tradeoff b/w the 2 inconsistent
intention in PPDM: privacy and predictive performance, by using multi-objective optimization
techniques. Since the total number of traffic data in networks has been increasing at a shocking
rate, a substantial research of body has been made that tries to mine the traffic data in order to get
44
the valuable information. For instance, there are a few investigations in view of distinguishing
proof of Internet worms and trespasses by deciding the irregular traffic patterns. Be that as it
may, as the system traffic data have the data about the Internet usage patterns of users, network
users' protection might be debilitated amid the mining procedure.
Seung-Woo Kim et al.[73], proposed a robust technique, which preserves the privacy during the
mining of sequential pattern on the network traffic data. Their proposed strategy has utilized an
N- repository server display, which has to work as a single mining server and a maintenance
replacement technique, which changes over the response to query probabilistically to locate the
frequent sequential patterns without breaching privacy. Also, the technique has expedited the
overall mining process by maintaining the meta tables in each site in order to find out quickly
whether the candidate patterns have ever occurred on the site or not. The accuracy and
effectiveness of their proposed technique have shown by performing experiments using real-
world network traffic data. In recent days, different methods based on random perturbation of
data records have been introduced for protecting the privacy of the user in data mining process.
45
valued. The k-anonymity protection model is vital because it forms the origin on which the real-
world schemes are known as Data fly, m-Argus, and k-Similar deliver agreements on privacy
protection.
M. Elmisery et al.[77], present a novel clustering algorithm for vertically partitioned data; they
test the performance of that algorithm based on experiments and complexity analysis. Later they
presented a private version of this protocol using a protocol that is based on homo morphed
encryption. Our protocol is robust against colluding attack. In Privacy-protecting extent a set
association for cases that are uncommon in the healthcare data J.Y. Chun et al. propose a
privacy-preserving range set union' protocol that is utilized to discover uncommon cases in the
private medicinal datasets of people. They have suggested privacy-preserving extent set union
protocol PPRSU to discover uncommon cases while preserving privacy. The range set unionRt1,
t2 is a set of elements that at least t1 parties and at most 2 parties have in their private sets.
PPRSU can be used to make new set operations, as well as conventional set operations. PPRSU
does not disclose any other information, except the information that could be inferred from the
range set union and the size of each private set.
46
Chapter- 3
SIMULATION TOOL
MATLAB makes use of in a wide variety of functions, together with signal and image
processing, communications, control design, test and size, computational biology and parallel
computing and financial modeling and analysis.
MATLAB is an array language, in the beginning, trendy for rapid prototyping, but is now being
increasingly used to improve construction code for numerical and scientific applications.
Average MATLAB programs have plentiful data parallelism. These packages even have control
float dominated scalar areas that have and have an effect on the program's execution time.
Today's computer systems have huge computing vigor in the form of traditional CPU cores and
in addition throughput-oriented accelerators such as pix processing units. Accordingly, a strategy
that maps the manage glide dominated areas of a MATLAB program to the CPU and the data
parallel regions to the GPU can drastically beef up application performance [78].
MATLAB programs are declarative and naturally express data-level parallelism as the language
provides several high-level operators that work directly on arrays. Generally, MATLAB is
utilized as programming dialect to compose different kinds of simulations. It is used extensively
to simulate and design systems in areas like control engineering, image processing, and
communications. These programs are typically long-running and developers expend significant
effort in trying to shorten their running times. As of now, MATLAB programs are converted
into ordered dialects like C or FORTRAN to enhance execution. These translations are normally
done either by hand or by automated systems that compile MATLAB code to C or FORTRAN
47
stash, and the specific tool stash of enthusiasm to us is the picture preparing toolkit. Instead of
giving a portrayal of the majority of MATLAB'S capacities, we should limit ourselves to simply
those angles concerned with the treatment of pictures. We might present capacities, orders, and
methods as needed.
A MATLAB function is a magic word which acknowledges different parameters and creates a
yield: for instance a framework, a string, a chart or figure. Illustrations of such capacities are sin,
imread, imclose. There are small capacities in MATLAB as we ought to see, it is straightforward
(and in some cases needed) to compose our own. An order is a specific utilization of a capacity.
Cases of orders may be MATLAB is an unrivaled vernacular for particular preparing. It joins
calculation, discernment, and programming in an easy-to-use environment where issues and
game plans are imparted in the ordinary experimental documentation. General utilizes comprise
[78]:
48
3.2.3. The MATLAB Language:
These are high-level matrix/array languages with control flow statements, functions, data
structures, input/output, and object-oriented programming features. It permits "programming in
the small" to fast generate and unclean throw-away programs, and "programming in the large" to
generate total big and composite application programs.
49
7 Quadprog Quadratic programming
a = 42;
Assigns the integer value 42 to the variable a. MATLAB aid variables of a no. of primitive form
like logical, int, real, complex and string. It is also possible to construct arrays with elements of
these primitive types. A programmer may construct a matrix of random real elements as follows
50
n = 100; 2 a = rand(n, n); In the above example, a is a 100×100 matrix of reals. Each element is
initialized with a random value. In MATLAB, all variables are matrices. Scalars are just single
element matrices. In MATLAB however, a variable does not need to be defined to be of a
particular type before the variable is used. MATLAB is a weekly dynamically typed language.
It's stated to be dynamically typed in view that the forms of variables are decided best at runtime.
It is weakly typed because the type of a variable can change through the course of a program. For
example, the following is a valid MATLAB program.
a = 42;
disp(a); //‘a’ is an integer
// More code...
a = "Hello World";
disp(a); //‘a’ is a string
// even more code...
Here the type of changes from integer to string when a string is assigned to it on line 4. This is
one of the features of MATLAB that makes it difficult to statically compile MATLAB code.
MATLAB provides a rich set of operators to operate on matrices. They are overloaded to
perform appropriate actions depending on the size and type of their input operands. Consider the
following code segment
x = 10;
y = 20;
a = rand (100, 100);
z = x + y;
b = a + a; 6 c = x + a
The “+” operators on lines 4, 5 and 6 all perform different operations at runtime. The plus on line
4 performs a scalar addition on the variables x and y. The + operator on line 5 however adds two
100×100 matrices. The + operator on line 6 adds the scalar x to each element of the matrix a. All
arithmetic operators in MATLAB are similarly overloaded. The * operator for example performs
the appropriate form of multiplication depending on the sizes of its arguments. For example, if
both arguments are matrices, it performs a matrix multiplication. However, if one is a matrix and
the other is a vector, it performs a matrix-vector multiplication. MATLAB also provides
operators to perform element-wise multiplications and division [79].
51
a = rand (100, 100);
b = rand (100, 100);
c = a * b;
d = a * b;
In the above code, each element in c is the product of the corresponding elements of a and b
whereas d is the matrix product of a and b.
52
The colon also plays a special role in array indexing. It is used to specify all elements of an array
along a particular dimension.
a = ones (10, 10);
a(:, 1) = 42;
Line 2 in the above example assigns the value 42 to every element in the first column of the
10×10 matrix a. MATLAB also confer the keyword end. The end keyword when used within the
indexer for a particular array represents the index of the last element of that array in that
dimension.
a = ones (10, 10);
a(5:end, 1) = 42;
The last line in the above code segment assigns 42 to elements 5 to 10 of the first column of the
matrix obviously; such assortment mechanisms also are valid on the proper hand aspect of
assignment statements. MATLAB requires the programmer to ensure the compatibility of sub-
array dimensions when they are specified by the mechanisms described above. MATLAB also
does not require the indexer of an array to be smaller than the length of the array. When an array
is indexed past its end in any dimension, the array simply grows to accommodate the index.
Consider the following example.
a = ones (10, 1); //a is a vector of length 10
a(15) = 42; //a now has length 15
After line 2 in the above program is executed, the vector a has a length of 15. Elements created
when the array expands are assigned a value of zero. Thus, elements a(11) to a(14) get the value
0, while a(15) gets the value 42.
3.4.4 Libraries:
MATLAB has a wide variety of toolsets that provide users with domain-specific functionality.
For example, the communication toolbox provides functionality required for the design and
simulation of communication systems and the image processing toolbox provides APIs to several
frequently used image processing functions. However, most of these toolboxes are closed source.
53
subset of MATLAB supported and a collection of alternative assumptions made by our compiler
implementation are presented below [79].
1) MATLAB supports variables with primitive varieties logical, int, actual, complicated and
string. It’s also possible to build arrays of these kinds with any no. of dimensions. Now,
our compiler supports each primitive type except string and complex. Further, arrays are
restricted to a maximum of three dimensions.
2) MATLAB backings ordering with multi-dimensional arrays. In any case, our execution at
present just backings ordering with sole dimensional arrays.
3) In MATLAB, it is possible to vary the scale of arrays by assigning to factors past their
end. We currently don't support indexing prior the end of arrays. Further, in this thesis,
we refer to assignments to arrays through indexed expressions (For example, an (i) as
indexed assignments or partial assignments.
4) We accept that the MATLAB program to be assembled is a single script with no calls to
client characterized or toolset capacities. Support for client characterized capacities can
be included by broadening the frontend of the compiler. Also, anonymous functions and
function handles are not currently supported [79].
5) In broad, kind and shapes (array sizes) of MATLAB variables are not called until
runtime. Our compiler currently relies on a simple data flow analysis to extract sizes and
types of program variables. It also relies on programmer input when types cannot be
determined. We intend to extend our type system to support symbolic type inferencing in
future. Ultimately, we envision that the techniques described in this thesis will be used in
both compile time and run-time systems [79].
54
Chapter- 4
PROPOSED WORK
Data collected is being using current technologies represent activities of users in social life that
for many are assumed to be private. To preserve privacy is then to keep this data private, in other
words confidential from a greater public. Not exchanging the data would preserve privacy but is
inconvenient and probably also not desirable. Therefore, a great deal of privacy investigate in
computer science is concerned with weaker forms of data confidentiality such as anonymity.
Anonymity is finished with the aid of unlinking the person identity from the traces that her
movements depart in information systems. Anonymity continues the identification of the
individuals in competencies methods personal, however, it's not always concerned with how
public the traces hence come to be. This is also reflected in data protection legislation, which by
definition are not able to and does now not anonymous data. Clustering is a data mining strategy
that has not taken its genuine part underway as of now cited in spite of the fact that, the most
vital calculation of this technique was very studied with regards to privacy-preserving, which is k
- means algorithm.
Cluster Analysis (data segmentation) has a variety of goals that relate to grouping or segmenting a
collection of objects (i.e., observations, individuals, cases, or data rows) into subsets or clusters, such that
those within each cluster are more closely related to one another than objects assigned to different
clusters. Central to all of the goals of cluster analysis is the notion of degree of similarity (or
dissimilarity) between the individual objects being clustered. There are two major methods of clustering:
hierarchical clustering and k-means clustering. For information on k-means clustering, refer to the k-
Means Clustering section.
In hierarchical clustering, the data is not partitioned into a particular cluster in a single step. Instead, a
series of partitions takes place, which may run from a single cluster containing all objects to n clusters
that each contain a single object. Hierarchical Clustering is subdivided into agglomerative methods,
which proceed by a series of fusions of the n objects into groups, and divisive methods, which
separate n objects successively into finer groupings. Agglomerative techniques are more
55
commonly used, and this is the method implemented in XLMiner. Hierarchical clustering may be
represented by a two-dimensional diagram known as a dendrogram, which illustrates the fusions or
divisions made at each successive stage of analysis. Following is an example of a dendrogram.
In this thesis, the proposed work used to increase the security of the data by performing
clustering and encrypting/decrypting of the data. Two proposed algorithms are performed to
show the efficiency of the proposed work.
A. Proposed Work-1
The proposed algorithm is an attempt to present a new approach for complex encrypting and
decrypting data based on parallel programming in such a way that the new method can make use
of multiple- core processor to acquire higher speed with the better degree of protection.
ALGORITHM-
1. Partition of given dataset by using hierarchical clustering.
Given
A set X of objects{X1….. , Xn}
A distance function dist(c1,c2)
For i =1 to n
Ci ={Xi}
end for
C ={ c1….,c2}
l = n+1
while C.size >1 do
-(Cmin1, Cmin2)= minimum dist(Ci,Cj) for all
Ci,Cj in C
- Remove Cmin1 and Cmin2 from C
- Add {Cmin1,Cmin2} to C
- l = l +1
end while
FLOW DIAGRAM-
56
Figure 4.1 Flowchart of Proposed Work-I
Procedure –
Step1 - Consider the dataset for input.
Step2 - Apply anonymization technique to that particular dataset.
Step3 - Hierarchical clustering technique is used to partition the data sets into clusters.
Step4 – DES encryption technique is used to suppress the data values.
57
Step5 – Final result obtained by the union of LHS and RHS values formed by anonymization
technique.
B. Proposed Work-II
Advanced Encryption Standard (AES) is an algorithm in which key block cipher uses a single
key to encrypt and decrypt the information for both the sender and receiver. In spite of the fact
that, the square length of Rijndael can be 128, 192, or 256 bits, the AES algorithm only received
the block length of 128 bits. At that point, the key length can be 128, 192, or 256 bits. The AES
calculation's inward activities are performed on a two-dimensional exhibit of bytes called State,
and each byte involves 8 bits. The State involves 4 columns of bytes and each line has Nb bytes.
Each byte is connoted by Si, j (0 ≤ I < 4, 0 ≤ j < Nb). Since the block length is 128 bits, each line
of the State contains Nb = 128/(4 x 8) = 4 bytes.
The four bytes in each segment of the State array form a 32-bit word, with the column number as
the list for the four bytes in each word. Toward the start of encryption or decryption, the array of
input bytes is mapped to the State array, accepting a 128- bit block can be communicated as 16
bytes: in0, in1, in2, … in15. The encryption/ decryption are performed on the State, toward the
finish of which the last esteem is mapped to the yield bytes exhibit out0, out1, out2, … out15.
The key of the AES algorithm can be mapped to 4 columns of bytes likewise, aside from the
number of bytes in each line signified by Nk can be 4, 6, or 8 when the length of the key, K, is
128, 192, or 256 bits, individually. The AES calculation is an iterative calculation. Every cycle
can be known as a round. The aggregate number of rounds, Nr, is 10 when Nk = 4, Nr = 12 when
Nk = 6, and Nr = 14 when Nk = 8.
In the initial step, we get the data from the database in which the operations can be performed.
The overall dataset has been fetched from the database and divides into the group which is
known as clusters. For the clustering process, hierarchical clustering has been used. The items
are grouped by calculating the distance between the a and if the data has minimum distance then
the clusters merge. The new and large cluster has been formed and updates the distance of the
items. Now the data has been encrypted using AES Algorithm where it contains various
operations for the key generation and then the optimal results obtained.
58
Proposed Algorithm:
Step:1 Start
Step:2 Input dataset from the database
Step:3 Apply Hierarchical clustering
a. Compute the distance between dataset
b. Put items into cluster
c. If Distance between two clusters is min
d. Then merge both clusters
e. Update distance
Step:4 Apply AES Algorithm over the output
a. Key selection
b. Generation of multiple key
c. Encryption
d. Decryption
Step:5 Get optimal result
Step:6 Exit
59
Fig.4.2
Fig.
ig.4.2
4.2Flowchart
Flo
Flowwchart
chart of of
Proposed
P Proroposed
posed Work
W Woork rk-II
60
Chapter- 5
RESULT ANALYSIS
The simulation of the proposed work has done with MATLAB 2013. There are two graphs
demonstrated below which show that the proposed technique has better accuracy and less error
rate. Time graph also demonstrated below which is lesser than the existing techniques.
Table 5.1 Describes the accuracy of Base method results and Propose method results
Above graph represents the analysis of accuracy in the base and propose method results and
concluded that proposed method is best for preserving privacy.
61
Table 5.2 Describes the error rate between Base method results and Propose method results
Figure 5.2: Error rate among the base and propose an approach
Above graph represents the analysis of error rate in the base and propose method results and
concluded that proposed method is best for preserving privacy as the error rate in the proposed
method is less than the error rate in the base method.
62
5.2 THE RESULT OF PROPOSED WORK
WORK-II:
Table 5.3: Comparison of Elapsed Time among Base and Propose Techniques
Elapsed Time
200
180
160
Time in seconds
140
120
100
80 Base
60 Propose
40
20
0
100 200 300 400 500
No. of Records
. 5.3:Elapsed
Fig.
F 5.3:
Figure Ela edTime
Elapsed Timeamong
ongthe
among thebase
seand
base andpropose
pr oseappraoch
propose an aapproach
roach
63
Accuracy
100
95
Accuracy
90
85 Base
Proposed
80
75
100 200 300 400 500
No. of Records
Error Rate
18
16
14
12
Error rate
10
8 Base
6
Proposed
4
2
0
100 200 300 400 500
No. of Rounds
Figure 5.5: Error rate among the base and propose an approach
64
Chapter- 6
CONCLUSION
Data Mining deals with the production of formerly unidentified patterns automatically from the
huge quantity of data sets. These datasets usually include sensitive individual information or
significant business information, which consequently get exposed to the other parties during
Data Mining activities. This creates an obstruction in Data Mining method. Solution to this
problem is provided by Privacy preserving in data mining (PPDM). The privacy renovation for
data analysis is a challenging studies difficulty because of increasingly larger volumes of data
sets, thereby requiring in-depth research. Each privacy preserving technique has its own
importance. PPDM is a dedicated set of Data Mining activities where techniques are developed
to protect the privacy of the data so that the knowledge detection process can be carried out
without a barrier. The principle of PPDM is to secure sensitive detail from leaking in the mining
process along with precise Data Mining results. Data encryption and anonymization are broadly
received approaches to battle privacy break. Nonetheless, encryption isn't appropriate for data
that are prepared and shared. Anonymizing huge data and dealing with anonymized data sets are
nonetheless challenges for classic anonymization processes. Privacy-preserving data mining
emerges to 2 critical desires: data analysis with a purpose to deliver better services and making
sure the privacy rights of the data owners. Substantial efforts have been accomplished to address
these needs. The results of our proposed work show that by doing hierarchical clustering and
encrypting the data using DES method we can achieve more preservation of privacy. The goal of
this paper is to discuss the clustering with the introduction of hierarchical clustering and AES
algorithm on privacy-preserving technique which are helpful in mining a large amount of data
with reasonable efficiency and security.
65
REFERENCES
[5] Charu C. Aggarwal, “A General survey of privacy preserving Data Mining Models and
Algorithms”, IBM,T. J. Watson Research Centre.
[6] B.Vani, D.Jayanthi, “Efficient Approach for Privacy Preserving Micro data Publishing Using
Slicing”, IJRCTT, 2013.
[7] Tiancheng Li , Jian Zhang , Ian Molloy,“Slicing: A New Approach for Privacy Preserving
Data Publishing”, IEEE Transaction on KDD, 2012.
[8] S.V. Vassilios , B. Elisa, N.F. Igor, P.P. Loredana, S. Yucel and T. Yannis, “State of the Art
in Privacy Preserving Data Mining”, Published in SIGMOD Record, 33, pp: 50-57, 2004.
[9] Helger Lipmaa, “Cryptographic Techniques in Privacy Preserving Data Mining”, University
College London, Estonian Tutorial, 2007.
66
[10] D. Agrawal and C. Agarwal, “On the Design and Quantification of Privacy Preserving Data
Mining Algorithms”, PODS, pp: 247-255, 2001.
[11]. Majid BM, Asger GM, Rashid Ali, “Privacy Preserving Data Mining Techniques: Current
Scenario and Future Prospects”, Proceedings of 3rd ICCCT, India, 26-32, 2012.
[12]. Kamakshi P, Vinaya BA, “Preserving Privacy and Sharing the Data in Distributed
Environment using Cryptographic on Perturbed data” Journal of Computing, April; 2(4), 115-
119, 2010.
[13]. Benny P, “Cryptographic Techniques for Privacy-preserving data mining”, ACM SIGKDD
Explorations, December; 4(2), 12- 19, 2008.
[14] Alpa K. Shah, Ravi Gulati, “Contemporary Trends in Privacy Preserving Collaborative Data
Mining– A Survey”, Proceedings in IEEE International Conference on Electrical, Electronics,
Signals, Communication and Optimization (EESCO), 2015.
[15] Alpa K. Shah, Ravi Gulati, “Privacy, Collaboration and Security – Imperative Existence in
Data Mining” VNSGU Journal of Science and Technology Vol 4 ,No 1, Pg. 44-49, July 2015.
[16] Jisha Jose Panackal1 ,Dr Anitha S Pillai, “Privacy Preserving Data Mining: An Extensive
Survey”, in Proceedings of Proc. of Int. Conf. on Multimedia Processing, Communication and
Info. Tech., MPCIT, 2013.
[17] R. Agrawal and R. Srikant. “Privacy Preserving Data Mining”,ACM SIGMOD Conference
on Management of Data, pp: 439-450, 2000.
[18] Y. Lindell and B. Pinkas, “Privacy Preserving Data Mining”, Journal of Cryptology, 15(3),
pp.36-54, 2000.
[19] Aris Gkoulalas-Divanis and Vassilios S. Verikios, “An Overview of Privacy Preserving
Data Mining”, Published by The ACM Student Magazine, 2010.
67
[21] Elisa, B., N.F. Igor and P.P. Loredana. “A Framework for Evaluating Privacy Preserving
Data Mining Algorithms”, Published by Data Mining Knowledge Discovery, pp.121-154, 2005.
[22] Andreas Prodromidis, Philip Chan, and Salvatore Stolfo, : “Metalearning in distributed data
mining systems: Issues and approaches”. In “Advances in Distributed and Parallel Knowledge
Discovery”, AAAI/MIT Press, September 2000.
[23] S.V. Vassilios , B. Elisa, N.F. Igor, P.P. Loredana, S. Yucel and T. Yannis, 2004, “State of
the Art in Privacy Preserving Data Mining” Published in SIGMOD Record, 33, pp: 50-57, 2004.
[24] Wang P, "Survey on Privacy preserving data mining", International Journal of Digital
Content Technology and its Applications, Vol. 4, No. 9, 2011.
[25] Dharmendra Thakur and Prof. Hitesh Gupta,” An Exemplary Study of Privacy Preserving
Association Rule Mining Techniques”, P.C.S.T., BHOPAL C.S Dept, P.C.S.T., BHOPAL India,
International Journal of Advanced Research in Computer Science and Software Engineering
,vol.3 issue 11, 2013.
[26] C.V.Nithya and A.Jeyasree, ”Privacy Preserving Using Direct and Indirect Discrimination
Rule Method”, Vivekanandha College of Technology for Women Namakkal India, International
Journal of Advanced Research in Computer Science and Software Engineering ,vol.3 issue 12,
2013.
[28] Gayatri Nayak, Swagatika Devi, "A survey on Privacy Preserving Data Mining:
Approaches and Techniques", International Journal of Engineering Science and Technology,
Vol. 3 No. 3, 2127-2133, 2011.
[29] Agrawal D., and Aggarwal C.C, “On the Design and Quantification of Privacy Preserving
Data Mining Algorithms”, Proceedings of the 20th ACM Symposium on Principles of Database
Systems, pp. 247-255, 2007.
68
[30] Agrawal, R., and Srikant , “Privacy Preserving Data Mining”, Proceedings of the 19th ACM
International Conference on Knowledge Discovery and Data Mining, Canada, pp. 439-450,
2007.
[32] Bertino E., Nai Fovino and Parasiliti Provenza, “A Framework for Evaluating Privacy
Preserving Data Mining Algorithm”, Journal of Data Mining and Knowledge Discovery, pp. 78-
87, 2005.
[33] Bikramjit Saikia and Debkumar Bhowmik , “Study of Association Rule Mining and
different hiding Techniques”, PhD thesis, Department of computer Science Engineering,
National Institute of Technology, pp.55-63, 2009.
[35]Gurjeevan Singh, Ashwani Kumar Singla,K.S. Sandha, ”Through Put Analysis Of Various
Encryption Algorithms”, IJCST Vol. 2, Issue 3, September 2011.
[36]Ramesh, A. et.al.,, “Performance analysis of encryption algorithms for Information
Security”, Circuits, Power and Computing Technologies (ICCPCT), pp. 840 – 844,March 2013.
[37]Shashi Mehrotra Seth, Rajan Mishra,” Comparative Analysis Of Encryption Algorithms For
Data Communication”, IJCST Vol. 2, Issue 2, pp.192- 192 , June 2011.
[38]Agarwal, R. , Dafouti, D., Tyagi, S. “Peformance analysis of data encryption algorithms “,
Electronics Computer Technology (ICECT), 2011 3rd International Conference , vol.5, pp. 399
– 403, April 2011, .
[39] Mr. Gurjeevan Singh, Mr. Ashwani Singla and Mr. K S Sandha, "Cryptography Algorithm
Comparison for Security Enhancement in Wireless Intrusion Detection System", International
Journal of Multidisciplinary Research, Vol.1 Issue 4, pp. 143-151, August 2011.
[40] Akash Kumar Mandal, Chandra Parakash and Mrs. Archana Tiwari, “Performance
Evaluation of Cryptographic Algorithms: DES and AES”, Conference on Electrical, Electronics
and Computer Science, pp. 1-5, 2012.
69
[41] Tamanna Kachwala, Sweta Parmar “An Approach for Preserving Privacy in Data Mining”
International Journal of Advanced Research in Computer Science and Software Engineering,
Volume 4, Issue 9ISSN: 2277 128X, September 2014.
[42] Kun Liu, Hillol Kargupta, and Jessica Ryan, Random Projection-Based Multiplicative Data
Perturbation for Privacy PreservingDistributed Data Mining, IEEE transactions on knowledge
and data engineering, vol. 18, no. 1, pp. 92-106, January 2006.
[43] Jaideep Vaidya, Basit Shafiq, Wei Fan, Danish Mehmood, and David Lorenzi, A Random
Decision Tree Framework forPrivacy-preserving Data Mining, Journal of latex class files, vol. 6,
no. 1, pp. 1- 14, , January 2007.
[44]. Chun-Wei Lin, Tzung-Pei Hong, Chia-Ching Chang, and Shyue-Liang Wang “A Greedy-
based Approach for Hiding Sensitive Itemsets by Transaction Insertion”, Journal of Information
Hiding and Multimedia Signal Processing, Volume 4, Number 4, October 2013.
[46]. Chun-Wei Lin · Tzung-Pei Hong ·Kuo-Tung Yang ·Leon Shyue-LiangWang “The GA-
based algorithms for optimizing hiding sensitive itemsets through transaction deletion”, Springer
Science+Business Media New York, 2014.
[47]. S. Lohiya and L. Ragha, “Privacy Preserving in Data Mining Using Hybrid Approach”, in
proceedings of 2012 Fourth International Conference on Computational Intelligence and
Communication Networks, IEEE 2012.
[48] Yu Zhu& Lei Liu, “Optimal Randomization for Privacy Preserving Data Mining”, ACM,
August 2004.
[50] Yehuda Lindell, Benny Pinkas, “Secure Multiparty Computation for Privacy-Preserving
Data Mining”, IACR Cryptology ePrint Archive 2008: 197, 2008.
70
[51] U. Maurer, “Secure multi-party computation made simple,” in Proc. 3rd Int. Conf. Security
in Communication Networks (SCN’02), Berlin, Heidelberg, pp. 14–28, Springer-Verlag, , 2003.
[52] Aris Gkoulalas-Divanis, & Grigorios Loukides, “Revisiting Sequential Pattern Hiding to
Enhance Utility”, ACM, August 2011.
[53] Amruta Mhatre, Durga Toshniwal, “Hiding Co-occurring Sensitive Patterns in Progressive
Databases”, ACM, March 22, 2010.
[54] Shikha Sharma & Pooja Jain, “A Novel Data Mining Approach for Information Hiding”,
International Journal of Computers and Distributed Systems, Vol. No.1, Issue 3, October 2012.
[55] Shaofei Wu and Hui Wang ,"Research On The PrivacyPreserving Algorithm Of Association
Rule Mining InCentralized Database”, IEEE International Symposiums on Information
Processing, 2008.
[56] Chirag N. Modi, Udai Pratap Rao and Dhiren R. Patel, "An Efficient Approach for
Preventing Disclosure of Sensitive Association Rules in Databases", International Conference on
Advances in Communication, Network, and Computing,IEEE, 2010.
[57] S.R.M. Oliveira, O.R. Zaıane, Y. Saygin, “Secure association rule sharing, advances in
knowledge discovery and data mining”, in Proceedings of the 8th Pacific-Asia Conference
(PAKDD2004), Sydney, Australia, pp.74–85, 2004.
[59]Ali Inan, Yucel Saygin, Erkay Savas, Ayca Azgin Hintoglu and Albert Levi (2006), “Privacy
preserving clustering on horizontally partitioned data,”, 2013.
[60] E. Poovammal, M. Ponnavaikko, "Privacy and Utility Preserving Task Independent Data
Mining", International Journal of Computer Applications, Vol:1, No. 15, pp: 104-111,
[61] Patrick Sharkey, Hongwei Tian, Weining Zhang, and Shouhuai Xu, "Privacy-Preserving
Data Mining through Knowledge Model Sharing", 2012.
71
[62] NafeesQamar, Yilong Yang, AndrasNadas and Zhiming Liu,” Querying medical datasets
while preserving privacy”, The 6th International Conference on Current and Future Trends of
Information and Communication Technologies in Healthcare (ICTH 2016), Procedia Computer
Science 98, pp 324 – 331, 2016.
[63] Alexandre Evfimievski, Tyrone Grandison, “Privacy Preserving Data Mining” USA.
[66] Yehuda Lindell, Benny Pinkas, “Privacy Preserving Data Mining 2009.
[67] Shahejad Khan, TejasGorhe, Ramesh Vig and Prof.BharatiA.Patil,” Enabling Multi-level
Trust in Privacy Preserving Data Mining”, 2015 International Conference on Green Computing
and Internet of Things (ICGCIoT) IEEE, pp 1369- 1372, 2015.
[68] Shweta Taneja, Shashank Khanna, Sugandha Tilwalia, Ankita, “A Review on Privacy
Preserving Data Mining: Techniques and Research Challenges”, (IJCSIT) International Journal
of Computer Science and Information Technologies, Vol. 5 (2), 2310-2315, 2014.
[69] Chris Clifton, Murat Kantarcioglu, Xiaodong Lin, Michael Y. Zhu, “Tools for Privacy
Preserving Distributed Data Mining”, Volume 4, Issue 2 - page 1
[70] Yuna Oh, Kangsoo Jung and Seog Park,” A privacy preserving technique to prevent
sensitive behavior exposure in semantic location-based service”, 18th International Conference
on Knowledge-Based and Intelligent Information & Engineering Systems - KES2014, Procedia
Computer Science 35, pp 318 – 327, 2014.
72
[72] Nissim Matatov, Lior Rokach, Oded Maimon, "Privacy-preserving data mining: A feature
set partitioning approach", Information Sciences, Volume 180, Issue 14, Pages 2696-2720, 15
July 2010.
[73] Seung-Woo Kim, Sanghyun Park, Jung-Im Won, Sang-Wook Kim, "Privacy preserving data
mining of sequential patterns for network traffic data", Information Sciences Volume 178, Issue
3, Pages 694-713, 1 February 2008.
[74] K.Srinivasa Rao, V.Chiranjeevi, "Distortion Based Algorithms For Privacy Preserving
Frequent Item Set Mining", International Journal of Data Mining & Knowledge Management
Process (IJDKP) Vol.1, No.4, July 2011.
[75] L. Sweeney, “k-anonymity: A model for protecting privacy,” Int. J.Uncertainity Fuzziness
Knows.-Base Syst., vol. 10, no. 5, pp. 557–570,2002.
[77]Chun, Ji Young, Dong Hoon Lee, and Ik Rae Jeong. "Privacy-preserving range set union for
rare cases in healthcare data", IET Communications 6, no. 18 (2012): 3288-3293.
[78] Charu C. Aggarwal, Stephen C. Gates and Philip S. Yu, “On the merits of building
categorization systems by supervised clustering”, Proceedings of the fifth ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining, Pages 352 – 356, 1999.
[79] Sudipto Guha, Rajeev Rastogi, and Kyuseok Shim, (1998), “ROCK: A Robust Clustering
Algorithm forCategorical Attributes”, In Proceedings of the 15th International Conference on
Data Engineering, 1999.
73