Documentation

Catching the Trend: A Framework for Clustering
Concept-Drifting Categorical Data
Abstract:
Data clustering is an important technique for exploratory data analysis
and has been the focus of substantial research in several domains for decades
among which Sampling has been recognized as an important technique to
improve the efficiency of clustering. However, with sampling applied, those
points that are not sampled will not have their labels after the normal
process. Although there is a straightforward approach in the numerical
domain, the problem of how to allocate those unlabeled data points into
proper clusters remains as a challenging issue in the categorical domain. In
this paper, a mechanism named Maximal Resemblance Data Labeling
(abbreviated as MARDL) is proposed to allocate each unlabeled data point
into the corresponding appropriate cluster based on the novel categorical
clustering representative, namely, N-Node set Importance
Representative(abbreviated as NNIR), which represents clusters by the
importance of the combinations of attribute values. MARDL has two
advantages:
1) MARDL exhibits high execution efficiency.
2) MARDL can achieve high intra cluster similarity and low inter cluster
similarity, which are regarded as the most important properties of clusters,
thus benefiting the analysis of cluster behaviors. MARDL is empirically
validated on real and synthetic data sets and is shown to be significantly
more efficient than prior works while attaining results of high quality.
Introduction
Data clustering is an important technique for exploratory data analysis
and has been the focus of substantial research in several domains for decades
. The problem of clustering is defined as follows: Given a set of data
objects, the problem of clustering is to partition data objects into groups in
such a way that objects in the same group are similar while objects in
different groups are dissimilar according to the predefined similarity
measurement. Therefore, clustering analysis can help us to gain insight into
the distribution of data . However, a difficult problem with learning in many
real world domains is that the concept of interest may depend on some
hidden context, not given explicitly in the form of predictive features. In
other words, the concepts that we try to learn from those data drift with
time .For example, the buying preferences of customers may change with
time, depending on the current day of the week, availability of alternatives,
discounting rate, etc. As the concepts behind the data evolve with time, the
underlying clusters may also change considerably with time .Performing
clustering on the entire time-evolving data not only decreases the quality of
clusters but also disregards the expectations of users, which usually require
recent clustering results.
The problem of clustering time-evolving data in the numerical domain

has been explored in the previous works .However, this problem has not
been widely discussed in the categorical domain with the exception of Web
log transactions. Actually, categorical attributes also prevalently exist in real
data with drifting concepts. For example, buying records of customers, Web
logs that record the browsing history of users, or Web documents often
evolve with time. Previous works on clustering categorical data focus on
doing clustering on the entire data set and do not take the drifting concepts
into consideration. Therefore, the problem of clustering time evolving data
in the categorical domain remains a challenging issue. As a result, a
framework for performing clustering on the categorical time-evolving data is
proposed in this paper. Instead of designing a specific clustering algorithm,
we propose a generalized clustering framework that utilizes existing
clustering algorithms and detects if there is a drifting concept or not in the
incoming data. Fig. 1 shows our entire framework of performing clustering
on the categorical time-evolving data. In order to detect the drifting
concepts, the sliding window technique is adopted. Sliding windows
conveniently eliminate the outdated records and the sliding windows
technique is utilized in several previous works on clustering time-evolving
data in the numerical domain.
. Therefore, based on the sliding window technique, we can test the
latest data points in the current window if the characteristics of clusters are
similar to the last clustering result or not. In, a similar strategy with our
framework that utilizes clustering results to analyze the drifting concepts is
proposed in the numerical domain. The online clustering algorithm is
performed on each time frame, and several numerical characteristics such as
the mean and standard deviation of cluster centers are used to represent
clustering results and detect the changes between time frames. After the
change is detected, an offline voting-based classification algorithm is
performed to associate each change with its correspondent event. However,
in the categorical domain, the above procedure is infeasible because the
numerical characteristics of clusters are difficult to define. Therefore, for
capturing the characteristics of clusters, an effective cluster representative
that summarizes the clustering results is required. In this paper, a practical
categorical clustering representative, named “Node Importance
Representative” (abbreviated as NIR) , is utilized. NIR represents clusters by
measuring the importance of each attribute value in the clusters. Based on
NIR, we propose the “Drifting Concept Detection” (abbreviated as DCD)
algorithm in this paper. In DCD, the incoming categorical data points at the
present sliding window are first allocated into the corresponding proper
cluster at the last clustering result, and the number of outliers that are not
able to be assigned into any cluster is counted. After that, the distribution of
clusters and outliers between the last clustering result and the current
temporal clustering result are compared with each other. If the distribution is
changed (exceeding some criteria), the concepts are said to drift. In the
concept-drifting window, the data points will do reclustering, and the last
clustering representative will be dumped out. On the contrary, if the concept
is steady, the clustering representative (NIR) will be updated. Moreover, the
framework presented in this paper not only detects the drifting concepts in
the categorical data but also explains the drifting concepts by analyzing the
relationship between clustering results at different times. The analyzing
algorithm is named “Cluster Relationship Analysis” (CRA). When the
drifting concept is detected by DCD, the last clustering representative is
dumped out. Therefore, each clustering representative that had been
recorded represents a successive constant clustering result, and different
dumped-out representatives describe different concepts in the data set. By
analyzing the relationship between clustering results, we may capture the
time evolving trend that explains why the clustering results have changes in
the data set.
Scope of the project:
Our contributions in this paper can be summarized as follows:
 A generalized framework of performing clustering on the

categorical time-evolving data is proposed in this paper. In
particular, this framework is independent of clustering
algorithms, and any categorical clustering algorithm can be
utilized in this framework.
 Based on the sliding window technique and the categorical
clustering representative NIR, the algorithm DCD is proposed
in this paper. In DCD, the distributions between clustering
results are tested to determine whether the concepts drift or not.
.
 The CRA algorithm, which analyzes the relationships between
clustering results generated by different concepts, is also
presented in this paper. We may capture the time-evolving
trend in the data set by analyzing the evolving clustering
results.
Module Description:
Collection of Data’s (Data Set):
It is a collection of data which is extracted

from the database that we are going to cluster. The data from the database
which is time evolving categorical data (which is not sequential basis
manner).
Initializing of Sliding Window:

Sliding Window is used to from subset of
data from dataset using specified size (i.e.) collection of data from the
database and transfer to the module.
Formation of Clustering:
Resulting of Sliding Window is Clustered data in
which contains group of data of subsets with common relationship.
Cluster Distributions Comparator:

Drifting Concept Detection (DCD)
algorithm is to detect the difference of cluster distribution between the
current data subset and the last clustering result.
Re- Clustering and Outlier:

If the difference of cluster distributions is
large enough, the sliding window will be considered as a concept – drifting
window, and will perform re-clustering. The data point that does not belong
to any proper clustering is called an outlier.
Literature Survey:
Data mining:
Data mining is the process of extracting patterns from data. As more
data are gathered, with the amount of data doubling every three years,[1] data
mining is becoming an increasingly important tool to transform these data
into information. It is commonly used in a wide range of profiling practices,
such as marketing, surveillance, fraud detection and scientific discovery.
While data mining can be used to uncover patterns in data samples, it is

important to be aware that the use of non-representative samples of data may
produce results that are not indicative of the domain. Similarly, data mining
will not find patterns that may be present in the domain, if those patterns are
not present in the sample being "mined". There is a tendency for
insufficiently knowledgeable "consumers" of the results to attribute "magical
abilities" to data mining, treating the technique as a sort of all-seeing crystal
ball. Like any other tool, it only functions in conjunction with the
appropriate raw material: in this case, indicative and representative data that
the user must first collect. Further, the discovery of a particular pattern in a
particular set of data does not necessarily mean that pattern is representative
of the whole population from which that data was drawn. Hence, an
important part of the process is the verification and validation of patterns on
other samples of data.
The term data mining has also been used in a related but negative
sense, to mean the deliberate searching for apparent but not necessarily
representative patterns in large numbers of data. To avoid confusion with the
other sense, the terms data dredging and data snooping are often used. Note,
however, that dredging and snooping can be (and sometimes are) used as
exploratory tools when developing and clarifying hypotheses.
Humans have been "manually" extracting patterns from data for

centuries, but the increasing volume of data in modern times has called for
more automated approaches. Early methods of identifying patterns in data
include Bayes' theorem (1700s) and Regression analysis (1800s). The
proliferation, ubiquity and increasing power of computer technology has
increased data collection and storage. As data sets have grown in size and
complexity, direct hands-on data analysis has increasingly been augmented
with indirect, automatic data processing. This has been aided by other
discoveries in computer science, such as Neural networks, Clustering,
Genetic algorithms (1950s), Decision trees(1960s) and Support vector
machines (1980s). Data mining is the process of applying these methods to
data with the intention of uncovering hidden patterns.[2] It has been used for
many years by businesses, scientists and governments to sift through
volumes of data such as airline passenger trip records, census data and
supermarket scanner data to produce market research reports. (Note,
however, that reporting is not always considered to be data mining).
A primary reason for using data mining is to assist in the analysis of

collections of observations of behaviour. Such data are vulnerable to
collinearity because of unknown interrelations. An unavoidable fact of data
mining is that the (sub-)set(s) of data being analysed may not be
representative of the whole domain, and therefore may not contain examples
of certain critical relationships and behaviours that exist across other parts of
the domain. To address this sort of issue, the analysis may be augmented
using experiment-based and other approaches, such as Choice Modelling for
human-generated data. In these situations, inherent correlations can be either
controlled for, or removed altogether, during the construction of the
experimental design.
There have been some efforts to define standards for data mining, for
example the 1999 European Cross Industry Standard Process for Data
Mining (CRISP-DM 1.0) and the 2004 Java Data Mining standard (JDM
1.0). These are evolving standards; later versions of these standards are
under development. Independent of these standardization efforts, freely
available open-source software systems like RapidMiner, Weka, KNIME,
and the R Project have become an informal standard for defining data-
mining processes. Most of these systems are able to import and export
models in PMML (Predictive Model Markup Language) which provides a
standard way to represent data mining models so that these can be shared
between different statistical applications. PMML is an XML-based language
developed by the Data Mining Group (DMG)[3], an independent group
composed of many data mining companies. PMML version 4.0 was released
in June 2009.[3][4][5]
Research and evolution
In addition to industry driven demand for standards and

interoperability, professional and academic activity have also made
considerable contributions to the evolution and rigour of the methods and
models; an article published in a 2008 issue of the International Journal of
Information Technology and Decision Making summarises the results of a
literature survey which traces and analyses this evolution.[6]
The premier professional body in the field is the Association for

Computing Machinery's Special Interest Group on Knowledge Discovery
and Data Mining (SIGKDD).[citation needed] Since 1989 they have hosted an
annual international conference and published its proceedings,[7] and since
1999 have published a biannual academic journal titled "SIGKDD
Explorations".[8] Other Computer Science conferences on data mining
include:
• DMIN - International Conference on Data Mining;[9]

• DMKD - Research Issues on Data Mining and Knowledge Discovery;
• ECML-PKDD - European Conference on Machine Learning and
Principles and Practice of Knowledge Discovery in Databases;
• ICDM - IEEE International Conference on Data Mining;[10]
• MLDM - Machine Learning and Data Mining in Pattern Recognition;
• SDM - SIAM International Conference on Data Mining
Knowledge Discovery in Databases (KDD) is the name coined by

Gregory Piatetsky-Shapiro in 1989 to describe the process of finding
interesting, interpreted, useful and novel data. There are many nuances to
this process, but roughly the steps are to preprocess raw data, mine the data,
and interpret the results.[11]
Pre-processing
Once the objective for the KDD process is known, a target data set
must be assembled. As data mining can only uncover patterns already
present in the data, the target dataset must be large enough to contain these
patterns while remaining concise enough to be mined in an acceptable
timeframe. A common source for data is a datamart or data warehouse.
The target set is then cleaned. Cleaning removes the observations with
noise and missing data.The clean data is reduced into feature vectors, one
vector per observation. A feature vector is a summarised version of the raw
data observation. For example, a black and white image of a face which is
100px by 100px would contain 10,000 bits of raw data. This might be turned
into a feature vector by locating the eyes and mouth in the image. Doing so
would reduce the data for each vector from 10,000 bits to three codes for the
locations, dramatically reducing the size of the dataset to be mined, and
hence reducing the processing effort. The feature(s) selected will depend on
what the objective(s) is/are; obviously, selecting the "right" feature(s) is
fundamental to successful data mining.
The feature vectors are divided into two sets, the "training set" and the
"test set". The training set is used to "train" the data mining algorithm(s),
while the test set is used to verify the accuracy of any patterns found.
Data mining
Data mining commonly involves four classes of task:[11]

• Classification - Arranges the data into predefined groups. For example
an email program might attempt to classify an email as legitimate or
spam. Common algorithms include Decision Tree Learning, Nearest
neighbor, naive Bayesian classification and Neural network.
• Clustering - Is like classification but the groups are not predefined, so
the algorithm will try to group similar items together.
• Regression - Attempts to find a function which models the data with
the least error.
• Association rule learning - Searches for relationships between
variables. For example a supermarket might gather data on customer
purchasing habits. Using association rule learning, the supermarket
can determine which products are frequently bought together and use
this information for marketing purposes. This is sometimes referred to
as "market basket analysis".
Results validation
The final step of knowledge discovery from data is to verify the

patterns produced by the data mining algorithms occur in the wider data set.
Not all patterns found by the data mining algorithms are necessarily valid. It
is common for the data mining algorithms to find patterns in the training set
which are not present in the general data set, this is called overfitting. To
overcome this, the evaluation uses a test set of data which the data mining
algorithm was not trained on. The learnt patterns are applied to this test set
and the resulting output is compared to the desired output. For example, a
data mining algorithm trying to distinguish spam from legitimate emails
would be trained on a training set of sample emails. Once trained, the learnt
patterns would be applied to the test set of emails which it had not been
trained on, the accuracy of these patterns can then be measured from how
many emails they correctly classify. A number of statistical methods may be
used to evaluate the algorithm such as ROC curves.
If the learnt patterns do not meet the desired standards, then it is necessary to
reevaluate and change the preprocessing and data mining. If the learnt
patterns do meet the desired standards then the final step is to interpret the
learnt patterns and turn them into knowledge.
Notable uses
Games
Since the early 1960s, with the availability of oracles for certain
combinatorial games, also called tablebases (e.g. for 3x3-chess) with any
beginning configuration, small-board dots-and-boxes, small-board-hex, and
certain endgames in chess, dots-and-boxes, and hex; a new area for data
mining has been opened up. This is the extraction of human-usable strategies
from these oracles. Current pattern recognition approaches do not seem to
fully have the required high level of abstraction in order to be applied
successfully. Instead, extensive experimentation with the tablebases,
combined with an intensive study of tablebase-answers to well designed
problems and with knowledge of prior art, i.e. pre-tablebase knowledge, is
used to yield insightful patterns. Berlekamp in dots-and-boxes etc. and John
Nunn in chess endgames are notable examples of researchers doing this
work, though they were not and are not involved in tablebase generation.
Business
Data mining in customer relationship management applications can

contribute significantly to the bottom line.[citation needed] Rather than randomly
contacting a prospect or customer through a call center or sending mail, a
company can concentrate its efforts on prospects that are predicted to have a
high likelihood of responding to an offer. More sophisticated methods may
be used to optimise resources across campaigns so that one may predict
which channel and which offer an individual is most likely to respond to —
across all potential offers. Additionally, sophisticated applications could be
used to automate the mailing. Once the results from data mining (potential
prospect/customer and channel/offer) are determined, this "sophisticated
application" can either automatically send an e-mail or regular mail. Finally,
in cases where many people will take an action without an offer, uplift
modeling can be used to determine which people will have the greatest
increase in responding if given an offer. Data clustering can also be used to
automatically discover the segments or groups within a customer data set.
Businesses employing data mining may see a return on investment, but also
they recognise that the number of predictive models can quickly become
very large. Rather than one model to predict which customers will churn, a
business could build a separate model for each region and customer type.
Then instead of sending an offer to all people that are likely to churn, it may
only want to send offers to customers that will likely take to offer. And
finally, it may also want to determine which customers are going to be
profitable over a window of time and only send the offers to those that are
likely to be profitable. In order to maintain this quantity of models, they
need to manage model versions and move to automated data mining.
Data mining can also be helpful to human-resources departments in

identifying the characteristics of their most successful employees.
Information obtained, such as universities attended by highly successful
employees, can help HR focus recruiting efforts accordingly. Additionally,
Strategic Enterprise Management applications help a company translate
corporate-level goals, such as profit and margin share targets, into
operational decisions, such as production plans and workforce levels.[12]
Another example of data mining, often called the market basket

analysis, relates to its use in retail sales. If a clothing store records the
purchases of customers, a data-mining system could identify those
customers who favour silk shirts over cotton ones. Although some
explanations of relationships may be difficult, taking advantage of it is
easier. The example deals with association rules within transaction-based
data. Not all data are transaction based and logical or inexact rules may also
be present within a database. In a manufacturing application, an inexact rule
may state that 73% of products which have a specific defect or problem will
develop a secondary problem within the next six months.
Market basket analysis has also been used to identify the purchase
patterns of the Alpha consumer. Alpha Consumers are people that play a key
roles in connecting with the concept behind a product, then adopting that
product, and finally validating it for the rest of society. Analyzing the data
collected on these type of users has allowed companies to predict future
buying trends and forecast supply demands.
Data Mining is a highly effective tool in the catalog marketing

industry. Catalogers have a rich history of customer transactions on millions
of customers dating back several years. Data mining tools can identify
patterns among customers and help identify the most likely customers to
respond to upcoming mailing campaigns.
Related to an integrated-circuit production line, an example of data

mining is described in the paper "Mining IC Test Data to Optimize VLSI
Testing."[13] In this paper the application of data mining and decision
analysis to the problem of die-level functional test is described. Experiments
mentioned in this paper demonstrate the ability of applying a system of
mining historical die-test data to create a probabilistic model of patterns of
die failure which are then utilised to decide in real time which die to test
next and when to stop testing. This system has been shown, based on
experiments with historical test data, to have the potential to improve profits
on mature IC products.
Science and engineering
In recent years, data mining has been widely used in area of science
and engineering, such as bioinformatics, genetics, medicine, education and
electrical power engineering.
the mapping relationship between the inter-individual variation in

human DNA sequences and variability in disease susceptibility. In lay terms,
it is to find out how the changes in an individual's DNA sequence affect the
risk of developing common diseases such as cancer. This is very important
to help improve the diagnosis, prevention and treatment of the diseases. The
data mining technique that is used to perform this task is known as
multifactor dimensionality reduction.[14]
In the area of electrical power engineering, data mining techniques

have been widely used for condition monitoring of high voltage electrical
equipment. The purpose of condition monitoring is to obtain valuable
information on the insulation's health status of the equipment. Data
clustering such as self-organizing map (SOM) has been applied on the
vibration monitoring and analysis of transformer on-load tap-
changers(OLTCS). Using vibration monitoring, it can be observed that each
tap change operation generates a signal that contains information about the
condition of the tap changer contacts and the drive mechanisms. Obviously,
different tap positions will generate different signals. However, there was
considerable variability amongst normal condition signals for the exact same
tap position. SOM has been applied to detect abnormal conditions and to
estimate the nature of the abnormalities.
Data mining techniques have also been applied for dissolved gas
analysis (DGA) on power transformers. DGA, as a diagnostics for power
transformer, has been available for many years. Data mining techniques such
as SOM has been applied to analyse data and to determine trends which are
not obvious to the standard DGA ratio techniques such as Duval Triangle.[15]
A educational research, where data mining has been used to study the
factors leading students to choose to engage in behaviors which reduce their
learning[16] and to understand the factors influencing university student
retention.[17]. A similar example of the social application of data mining its is
use in expertise finding systems, whereby descriptors of human expertise are
extracted, normalised and classified so as to facilitate the finding of experts,
particularly in scientific and technical fields. In this way, data mining can
facilitate Institutional memory.
Other examples of applying data mining technique applications are

biomedical data facilitated by domain ontologies,[18] mining clinical trial
data,[19] traffic analysis using SOM,[20] et cetera.
In adverse drug reaction surveillance, the Uppsala Monitoring Centre

has, since 1998, used data mining methods to routinely screen for reporting
patterns indicative of emerging drug safety issues in the WHO global
database of 4.6 million suspected adverse drug reaction incidents[21].
Recently, similar methodology has been developed to mine large collections
of electronic health records for temporal patterns associating drug
prescriptions to medical diagnoses[22].
Spatial Data mining
Spatial data mining is the application of data mining techniques to

spatial data. Spatial data mining follows along the same functions in data
mining, with the end objective to find patterns in geography. So far, data
mining and Geographic Information Systems (GIS) have existed as two
separate technologies, each with its own methods, traditions and approaches
to visualisation and data analysis. Particularly, most contemporary GIS have
only very basic spatial analysis functionality. The immense explosion in
geographically referenced data occasioned by developments in IT, digital
mapping, remote sensing, and the global diffusion of GIS emphasises the
importance of developing data driven inductive approaches to geographical
analysis and modeling.
Data mining, which is the partially automated search for hidden patterns
in large databases, offers great potential benefits for applied GIS-based
decision-making. Recently, the task of integrating these two technologies
has become critical, especially as various public and private sector
organisations possessing huge databases with thematic and geographically
referenced data begin to realise the huge potential of the information hidden
there. Among those organisations are:
• offices requiring analysis or dissemination of geo-referenced

statistical data
• public health services searching for explanations of disease clusters
• environmental agencies assessing the impact of changing land-use
patterns on climate change
• geo-marketing companies doing customer segmentation based on
spatial location.
Challenges
Geospatial data repositories tend to be very large. Moreover, existing

GIS datasets are often splintered into feature and attribute components, that
are conventionally archived in hybrid data management systems.
Algorithmic requirements differ substantially for relational (attribute) data
management and for topological (feature) data management [23]. Related to
this is the range and diversity of geographic data formats, that also presents
unique challenges. The digital geographic data revolution is creating new
types of data formats beyond the traditional "vector" and "raster" formats.
Geographic data repositories increasingly include ill-structured data such as
imagery and geo-referenced multi-media [24].
There are several critical research challenges in geographic knowledge

discovery and data mining. Miller and Han [25] offer the following list of
emerging research topics in the field:
• Developing and supporting geographic data warehouses - Spatial

properties are often reduced to simple aspatial attributes in
mainstream data warehouses. Creating an integrated GDW requires
solving issues in spatial and temporal data interoperability, including
differences in semantics, referencing systems, geometry, accuracy and
position.
• Better spatio-temporal representations in geographic knowledge
discovery - Current geographic knowledge discovery (GKD)
techniques generally use very simple representations of geographic
objects and spatial relationships. Geographic data mining techniques
should recognise more complex geographic objects (lines and
polygons) and relationships (non-Euclidean distances, direction,
connectivity and interaction through attributed geographic space such
as terrain). Time needs to be more fully integrated into these
geographic representations and relationships.
• Geographic knowledge discovery using diverse data types - GKD
techniques should be developed that can handle diverse data types
beyond the traditional raster and vector models, including imagery
and geo-referenced multimedia, as well as dynamic data types (video
streams, animation).
Surveillance
Previous data mining to stop terrorist programs under the U.S.

government include the Total Information Awareness (TIA) program,
Secure Flight (formerly known as Computer-Assisted Passenger
Prescreening System (CAPPS II)), Analysis, Dissemination, Visualization,
Insight, Semantic Enhancement (ADVISE[26]), and the Multistate Anti-
Terrorism Information Exchange (MATRIX).[27] These programs have been
discontinued due to controversy over whether they violate the US
Constitution's 4th amendment, although many programs that were formed
under them continue to be funded by different organisations, or under
different names.[28]
Two plausible data mining techniques in the context of combating

terrorism include "pattern mining" and "subject-based data mining".
Pattern mining
"Pattern mining" is a data mining technique that involves finding existing

patterns in data. In this context patterns often means association rules. The
original motivation for searching association rules came from the desire to
analyze supermarket transaction data, that is, to examine customer behaviour
in terms of the purchased products. For example, an association rule "beer
=> crisps (80%)" states that four out of five customers that bought beer also
bought crisps.
In the context of pattern mining as a tool to identify terrorist activity, the

National Research Council provides the following definition: "Pattern-based
data mining looks for patterns (including anomalous data patterns) that
might be associated with terrorist activity — these patterns might be
regarded as small signals in a large ocean of noise."[29][30][31] Pattern Mining
includes new areas such a Music Information Retrieval (MIR) where
patterns seen both in the temporal and non temporal domains are imported to
classical knowledge discovery search techniques.
Subject-based data mining
"Subject-based data mining" is a data mining technique involving the

search for associations between individuals in data. In the context of
combatting terrorism, the National Research Council provides the following
definition: "Subject-based data mining uses an initiating individual or other
datum that is considered, based on other information, to be of high interest,
and the goal is to determine what other persons or financial transactions or
movements, etc., are related to that initiating datum."[30]
Privacy concerns and ethics
Some people believe that data mining itself is ethically neutral.[32]

However, the ways in which data mining can be used can raise questions
regarding privacy, legality, and ethics.[33] In particular, data mining
government or commercial data sets for national security or law enforcement
purposes, such as in the Total Information Awareness Program or in
ADVISE, has raised privacy concerns.[34][35]
Data mining requires data preparation which can uncover information

or patterns which may compromise confidentiality and privacy obligations.
A common way for this to occur is through data aggregation. Data
aggregation is when the data are accrued, possibly from various sources, and
put together so that they can be analyzed.[36] This is not data mining per se,
but a result of the preparation of data before and for the purposes of the
analysis. The threat to an individual's privacy comes into play when the data,
once compiled, cause the data miner, or anyone who has access to the
newly-compiled data set, to be able to identify specific individuals,
especially when originally the data were anonymous.
It is recommended that an individual is made aware of the following before

data are collected:
• the purpose of the data collection and any data mining projects,
• how the data will be used,
• who will be able to mine the data and use them,
• the security surrounding access to the data, and in addition,
• how collected data can be updated.[36]
Privacy concerns have also been somewhat addressed by congress via the
passage of regulatory controls such as HIPAA. The Health Insurance
Portability and Accountability Act (HIPAA) requires individuals to be given
"informed consent" regarding any information that they provide and its
intended future uses by the facility receiving that information. According to
an article in Biotech Business Week, “In practice, HIPAA may not offer any
greater protection than the longstanding regulations in the research arena,
says the AAHC. More importantly, the rule's goal of protection through
informed consent is undermined by the complexity of consent forms that are
required of patients and participants, which approach a level of
incomprehensibility to average individuals.” (40) This underscores the
necessity for data anonymity in data aggregation practices.
One may additionally modify the data so that they are anonymous, so that
individuals may not be readily identified.[36] However, even de-identified
data sets can contain enough information to identify individuals, as occurred
when journalists were able to find several individuals based on a set of
search histories that were inadvertently released by AOL. [37]
Clustering
A computer cluster is a group of linked computers, working together

closely so that in many respects they form a single computer. The
components of a cluster are commonly, but not always, connected to each
other through fast local area networks. Clusters are usually deployed to
improve performance and/or availability over that of a single computer,
while typically being much more cost-effective than single computers of
comparable speed or availability.[1]
Cluster categorizations
High-availability (HA) clusters
High-availability clusters (also known as Failover Clusters) are

implemented primarily for the purpose of improving the availability of
services which the cluster provides. They operate by having redundant
nodes, which are then used to provide service when system components fail.
The most common size for an HA cluster is two nodes, which is the
minimum requirement to provide redundancy. HA cluster implementations
attempt to use redundancy of cluster components to eliminate single points
of failure.
There are many commercial implementations of High-Availability

clusters for many operating systems. The Linux-HA project is one
commonly used free software HA package for the Linux operating system.
Load-balancing clusters
Load-balancing when multiple computers are linked together to share

computational workload or function as a single virtual computer. Logically,
from the user side, they are multiple machines, but function as a single
virtual machine. Requests initiated from the user are managed by, and
distributed among, all the standalone computers to form a cluster. This
results in balanced computational work among different machines,
improving the performance of the cluster system.
Compute clusters
Often clusters are used primarily for computational purposes, rather

than handling IO-oriented operations such as web service or databases. For
instance, a cluster might support computational simulations of weather or
vehicle crashes. The primary distinction within compute clusters is how
tightly-coupled the individual nodes are. For instance, a single compute job
may require frequent communication among nodes - this implies that the
cluster shares a dedicated network, is densely located, and probably has
homogenous nodes. This cluster design is usually referred to as Beowulf
Cluster. The other extreme is where a compute job uses one or few nodes,
and needs little or no inter-node communication. This latter category is
sometimes called "Grid" computing. Tightly-coupled compute clusters are
designed for work that might traditionally have been called
"supercomputing". Middleware such as MPI (Message Passing Interface) or
PVM (Parallel Virtual Machine) permits compute clustering programs to be
portable to a wide variety of clusters.
Grid computing
Grids are usually computer clusters, but more focused on throughput

like a computing utility rather than running fewer, tightly-coupled jobs.
Often, grids will incorporate heterogeneous collections of computers,
possibly distributed geographically, sometimes administered by unrelated
organizations.
Grid computing is optimized for workloads which consist of many

independent jobs or packets of work, which do not have to share data
between the jobs during the computation process. Grids serve to manage the
allocation of jobs to computers which will perform the work independently
of the rest of the grid cluster. Resources such as storage may be shared by all
the nodes, but intermediate results of one job do not affect other jobs in
progress on other nodes of the grid.
An example of a very large grid is the Folding@home project. It is

analyzing data that is used by researchers to find cures for diseases such as
Alzheimer's and cancer. Another large project is the SETI@home project,
which may be the largest distributed grid in existence. It uses approximately
three million home computers all over the world to analyze data from the
Arecibo Observatory radiotelescope, searching for evidence of
extraterrestrial intelligence. In both of these cases, there is no inter-node
communication or shared storage. Individual nodes connect to a main,
central location to retrieve a small processing job. They then perform the
computation and return the result to the central server. In the case of the
@home projects, the software is generally run when the computer is
otherwise idle. U of C Berkley has developed an open source application
BOINC to allow individual users to contribute to the above and other
projects such as llc@home (Large Hadron Collider) from a single manager
which can then be set to allocate a percentage of idle time to each of the
projects a node is signed up for. The Software can be dowloaded and a
project list can be found here BOINC
The grid setup means that the nodes can take however many jobs they
are able to process in one session and then return the results and aquire a
new job from a central project server.
Implementations
The TOP500 organization's semiannual list of the 500 fastest

computers usually includes many clusters. TOP500 is a collaboration
between the University of Mannheim, the University of Tennessee, and the
National Energy Research Scientific Computing Center at Lawrence
Berkeley National Laboratory. As of June 18, 2008, the top supercomputer
is the Department of Energy's IBM Roadrunner system with performance of
1026 TFlops measured with High-Performance LINPACK benchmark.
Clustering can provide significant performance benefits versus price.

The System X supercomputer at Virginia Tech, the 28th most powerful
supercomputer on Earth as of June 2006[2], is a 12.25 TFlops computer
cluster of 1100 Apple XServe G5 2.3 GHz dual-processor machines (4 GB
RAM, 80 GB SATA HD) running Mac OS X and using InfiniBand
interconnect. The cluster initially consisted of Power Mac G5s; the rack-
mountable XServes are denser than desktop Macs, reducing the aggregate
size of the cluster. The total cost of the previous Power Mac system was
$5.2 million, a tenth of the cost of slower mainframe computer
supercomputers. (The Power Mac G5s were sold off.)
The central concept of a Beowulf cluster is the use of commercial off-

the-shelf (COTS) computers to produce a cost-effective alternative to a
traditional supercomputer. One project that took this to an extreme was the
Stone Souper computer.
However it is worth noting that FLOPs (floating point operations per

second), aren't always the best metric for supercomputer speed. Clusters can
have very high FLOPs, but they cannot access all data the cluster as a whole
has at once. Therefore clusters are excellent for parallel computation, but
much poorer than traditional supercomputers at non-parallel computation.
Java Spaces is a specification from Sun Microsystems that enables

clustering computers via a distributed shared memory.
Consumer game consoles
Due to the increasing computing power of each generation of game

consoles, a novel use has emerged where they are repurposed into HPC
clusters. Some examples of game console clusters are Sony PlayStation
clusters and Microsoft Xbox clusters. It has been suggested that countries
which are restricted from buying supercomputing technologies may be
obtaining game systems to build computer clusters for military use.
Existing system:
• We propose a generalized clustering framework that utilizes existing
clustering algorithms and detects if there is a drifting concept or not in
the incoming data.
Existing system advantage:

• We use NIR algorithm which has two advantage
1. The node is important in the cluster when the frequency of the
node is high in this cluster.
2. The node is important in the cluster if the node appears

prevalently in this cluster rather than in other clusters
Proposed system:
• We test the accuracy of DCD on both synthetic and real
data sets. First, we will test the accuracy of drifting concepts that are
detected by DCD. And then, in order to evaluate the results of
clustering algorithms, we adopt the following two widely used
methods.
• In order to detect the drifting concepts at different sliding windows,
we proposed the algorithm DCD to compare the cluster distributions
between the last clustering result and the temporal current clustering
result.
• In order to observe the relationship between different clustering
results, we proposed the algorithm CRA to analyze and show the
changes between different clustering results.
Proposed system advantage:
• We proposed a framework to perform clustering on categorical time-
evolving data. The framework detects the drifting concepts at different
sliding windows, generates the clustering results based on the current
concept, and also shows the relationship between clustering results by
visualization.
Project Description
The problem of clustering the categorical time-evolving data is
formulated as follows: Suppose that a series of categorical data points D
is given, where each data point is a vector of q attribute values, i.e., pj ¼
ðp1j ; p2j ; . . . ; pq jÞ. Let A ¼ fA1;A2; . . .;Aqg, where Aa is the ath
categorical attribute, 1 _ a _ q. In addition, suppose that the window size
N is also given. The data set D is separated into several continuous
subsets St, where the number of data points in each St is N. The
superscript number t is the identification number of the sliding window
and t is also called time stamp in this paper. For example, the first N data
points in D are located in the first subset S1. Based on the foregoing, the
objective of the framework is to perform clustering on the data set D and
consider the drifting concepts between St and Stþ1 and also analyze the
relationship between different clustering results. For ease of presentation,
several notations are defined as follows: In our framework, several
clustering results at different time stamps will be reported. Each
clustering result C½t1;t2_ is formed by one stable concept that persists
for a period of time, i.e., the sliding windows from t1 to t2. The
clustering results C½t1;t2_ contain k½t1;t2_ clusters, i.e., C½t1;t2_¼
fc½t1;t2_ 1 ; c½t1;t2_ 2 ; . . .; c½t1;t2_ k½t1;t2_g, where c½t1;t2_ i , 1
_ i _ k½t1;t2_, is the ith cluster in C½t1;t2_. If t1 ¼ t2 ¼ t, we simplify
the superscript by t. For example, the first clustering result that is
obtained from the initial clustering step is C1. Moreover, if we do not
point out a specific time stamp, the superscript will be omitted for ease of
presentation. In addition, when the DCD algorithm is performed, a
temporal clustering result, which is utilized to detect the drifting concept
at each sliding window, will be obtained. The notation C0t is used to
represent the temporal clustering result at time stamp t. Fig. 2 shows an
example of data set D with 15 data points, three attributes, and the sliding
window size N ¼ 5. The initial clustering is performed on the first sliding
window S1, and the clustering result C1, which contains two clusters,
c11 and c12 , is obtained. All of the symbols utilized in this paper are
summarized in Table .
Algorithm Used:
DRIFTING CONCEPT DETECTION

In this section, we introduce our DCD algorithm. The objective of the
DCD algorithm is to detect the difference of cluster distributions between
the current data subset St and the last clustering result C½te;t_1_ and to
decide whether the reclustering is required or not in St. Therefore, the
incoming categorical data points in St should be able to be allocated into
the corresponding proper cluster at the last clustering result efficiently.
We named the allocating process as “data labeling” [7]. In this paper, we
modify our previous work on the labeling process in order to detect
outliers in St. The data point that does not belong to any proper cluster is
called an outlier. After labeling, the last clustering result C½te;t_1_ and
the current temporal clustering result C0t obtained by data labeling are
compared with each other. If the difference of cluster distributions is
large enough, the sliding window t will be considered as a concept-
drifting window, and St will perform reclustering. The flowchart of the
DCD algorithm is shown in Figure.
 Data Labeling and Outlier Detection
The goal of data labeling is to decide the most appropriate cluster
label for each incoming data point. Specifically, suppose that a data point
pj is given. The similarity Sðci; puÞ between pj and cluster ci, 1 _ i _ k, is
measured, and the cluster that obtains maxðSðci; pjÞÞ is considered as
the most appropriate cluster. In this paper, the clusters are represented by
an effective clustering representative, named NIR.
 Cluster Distributions Comparison

In the Cluster Distribution Comparison step, the last clustering result and
the current temporal clustering result obtained by data labeling are
compared with each other to detect the drifting concept. The clustering
results are said to be different according to the following two criteria:
 1. The clustering results are different if quite a large number of
outliers are found by data labeling.
 2. The clustering results are different if quite a large number of
clusters are varied in the ratio of data points.

Since the idea of data labeling is to present the original clustering
characteristics to the incoming data points [7], the outliers that are not
able to allocate to any cluster may be generated based on different
concepts. Therefore, if too many outliers are detected by the data labeling
step, the drifting concept may happen in the current sliding window. As a
result, a threshold _ named outlier threshold is set in this step. If the ratio
of outliers in the current sliding window is larger than the outlier
threshold, the clustering results are said to be different, and the concept is
said to drift. Moreover, another type of drifting concept is also detected
in this step. The ratio of data points in a cluster may be changed
dramatically by a drifting concept, e.g., the cluster that contains half of
the data points in the last clustering result has suddenly disappeared in
the current clustering result. In order to detect the change, we adopt a
double-threshold method. One threshold _ named cluster variation
threshold is utilized to determine that the variation of the ratio of data
points in a cluster is big enough. The cluster that exceeds the cluster
variation threshold is seen as a different cluster. And then, the number of
different clusters is counted, and the ratio of different clusters is
compared with the other threshold _ named cluster difference threshold.
If the ratio of different clusters is larger than the cluster difference
threshold, the concept is said to drift in the current sliding window.
System Architecture:
System Design
Use case Diagram:
Sliding Window
Generating Cluster
Representatives
Drifting Datas
Data Source Cluster
Representative
Cluster Representatives Analyser

Class Diagram
Cluster Info
clusterName Cluster Analyser
clusterType clusterName
clusterSize clusterstate
getClusterInfo() analyseData()
getClusterData() getResult()
updateClusterdata() getUpdate()
insertDriftData()
Date Store Sliding windows

dataType windowSize
dataSize windowState
updateData() updateWindow()
insertData() getWindowState()
formCluster() getDriftingDatas()
insertDriftindatas()
Sequence diagram
Datapoints S liding W indow Clus ter Clus ter CLus ter

Repres entatives A naly s er
Get the data
S end the rec ent datas
A ppend with appropriate c lus ter
Form the new Clus ter
A naly s e the c lus ter

Collaboration Diagram
3: Append with appropriate cluster

Sliding
Window 2: Send the recent datas
Cluster
1: Get the data
Representatives
Datapoin
ts
4: Form the new Cluster

Cluster 5: Analyse the cluster
CLuster
Analyser
State Diagram:
Received
New datas
Updated the
Sliding Window
Updated the Cluster

Representatives
Cluster Analysed
with Updated Data
Activity Diagram:
Receiveing the
new datas
Updating the
Sliding Window
Updating the cluster

Representatives
Updating the
Clusters
Analysing the Cluster

with new datas
Dataflow Diagram:
Data Points
Sliding Window Generating Cluster

Representatives
Drifting Data
Updating Cluster
Representatives
Re-Clustering
Cluster Relationship Analysis

Login Page:
Admin Page:
Node Search page:
Forming Cluster according to node:
Link to get Sliding window:
Sliding Window Transacted output module:
Formation of Cluster:
Updating the data
Successful transacted page:
<%@ page language="java" contentType="text/html; charset=ISO-8859-1"
pageEncoding="ISO-8859-1"%>
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
<title>Banking login page</title>
</head>
<form name="loginform" method="get" action="loginvalidator">
<body bgcolor="silver">
<br><br><br><br><br><br><br><br><br><br><br><br>
<center>
<h1>Welcome to Clustering algorithm</h1>
</center>
<table align="center" border="2">
<tr><td>UserName</td><td><input type = "text" name="username" ></td></tr>
<tr><td>Password</td><td><input type = "password" name ="password"></td></tr>
</table>
<center><input type="submit" name="Submit" value="Login"></center>
</body>
</form>
</html>
..
package ser;
import java.io.IOException;
import java.io.PrintWriter;
import javax.servlet.RequestDispatcher;
import javax.servlet.ServletException;
import javax.servlet.http.HttpServlet;
import javax.servlet.http.HttpServletRequest;
import javax.servlet.http.HttpServletResponse;
import javax.servlet.http.HttpSession;
import Accessinformation.Accessinformation;
import Helperview.Helperview;
public class loginvalidator extends HttpServlet {

private static final long serialVersionUID = 1L;
public loginvalidator() {
super();
String username, password;
protected void doGet(HttpServletRequest request, HttpServletResponse response)
throws ServletException, IOException {
username=request.getParameter("username");
password=request.getParameter("password");
PrintWriter out = response.getWriter();
//out.println(username);
//out.println(password);
Helperview view = new Helperview();
view.setUsername(username);
view.setPassword(password);
HttpSession session = request.getSession();
Accessinformation dao = new Accessinformation();
try
view=dao.getTransactioninformation(username, password);
System.out.println("From Servlet "+view.getUsername());
System.out.println("From Servlet "+view.getPassword());
System.out.println("From Servlet "+view.getRole());
if(view==null)
out.println("please provide information");
else
if("admin".equalsIgnoreCase(view.getRole()))
System.out.println("Inside admin
"+view.getUsername());
session.setAttribute("admin", view);
//RequestDispatcher dis =
request.getRequestDispatcher("/admin.jsp");
response.sendRedirect("admin.jsp");
System.out.println("After dispatching the request");
else
System.out.println("From Servlet
"+view.getUsername());
session.setAttribute("accholder", view);
//RequestDispatcher dis =
request.getRequestDispatcher("/accountholder.jsp");
response.sendRedirect("accountholder.jsp");
catch (Exception e) {
e.printStackTrace();
System.out.println(e);
}
}
..
package Helperview;
public class Helperview {
String username, password, role;
String accountnumber, transaction, amount, branch;
// Table - 2
public String getAccountnumber() {
return accountnumber;
public void setAccountnumber(String accountnumber) {
this.accountnumber = accountnumber;
public String getTransaction() {
return transaction;
}
public void setTransaction(String transaction) {
this.transaction = transaction;
public String getAmount() {
return amount;
public void setAmount(String amount) {
this.amount = amount;
public String getBranch() {
return branch;
public void setBranch(String branch) {
this.branch = branch;
// Table - 1
public String getRole() {

return role;
public void setRole(String role) {
this.role = role;
public String getUsername() {
return username;
public void setUsername(String username) {
this.username = username;
public String getPassword() {
return password;
public void setPassword(String password) {
this.password = password;
}
..
package AdminDBConnector;
import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.PreparedStatement;
import java.sql.ResultSet;
import java.sql.SQLException;
public class AdminDBConnector {
public static Connection getConnection()
Connection connection = null;
try
Class.forName("sun.jdbc.odbc.JdbcOdbcDriver");
connection=DriverManager.getConnection("jdbc:odbc:ADMIN");
}
catch(SQLException e)
catch(ClassNotFoundException e)
return connection;
public static void close(Connection conn, PreparedStatement pst, ResultSet rs)
try
if(pst!=null)
{
pst.close();
try
if(rs!=null)
rs.close();
catch(Exception e)
}
try
if(conn!=null && !conn.isClosed())
conn.close();
..
package DataSource;
import java.sql.*;
import java.sql.Statement;
import java.sql.SQLException;
import java.util.ArrayList;
import java.util.Collection;
import AdminDBConnector.AdminDBConnector;
import Helperview.Helperview;
public class DataSource {
public static Collection<Helperview> getDataSource(String node)
Connection conn =null;
ResultSet rs = null;
PreparedStatement pst = null;
Helperview view =null;
Collection<Helperview> nodeColl = new ArrayList<Helperview>();
try
{
conn = AdminDBConnector.getConnection();
String query = "select * from transaction where Branch=?";
pst=conn.prepareStatement(query);
pst.setString(1, node);
rs=pst.executeQuery();
while(rs.next())
System.out.println("AAAAAAA");
view = new Helperview();
view.setAccountnumber(rs.getString(1));
view.setTransaction(rs.getString(2));
view.setAmount(rs.getString(3));
view.setBranch(rs.getString(4));
nodeColl.add(view);
// System.out.println("collection"+nodeColl);
System.out.println("From Data
Source:"+view.getAccountnumber());
Source:"+view.getTransaction());
Source:"+view.getAmount());
Source:"+view.getBranch());
finally
AdminDBConnector.close(conn, pst, rs);
return nodeColl;
}
..
<%@page import="Helperview.Helperview"%><html>
<head>
<title>Welcome Admin Page</title>
</head>
<body>
<br><br><br><br><br><br><br><br><br><br><br>
<%Helperview view = (Helperview)session.getAttribute("admin"); %>
<center>
<h1>Welcome <%out.println(view.getUsername()); %> !!!</h1>

<a href="clusterinformation.jsp">Click here to view cluster</a>
</center>
</body>
</html>
..
<head>
</head>
<body>
<center>

</center>
</body>
</html>
..
<head>
</head>
<body>
<center>
</center>
</body>
</html>
..
<%@ page language="java" import="java.util.*" %>
<html>
<head>
<title>Transacted information</title>
</head>
<table border="0" background="silver" align="center">
<tr><td><image src="qw.PNG"/></tr>
<tr><td><image src="er1.png"/></td></tr>
</table>
<h2>The recent transacted accounts are listed below:</h2>
<%
ArrayList set = (ArrayList)session.getAttribute("mapValue");
Iterator<String> iterator = set.iterator();
while(iterator.hasNext())
String element = iterator.next();
System.out.println(element);
%>
<br><br><br><br><tr align="center"><td align="center"><br><h4 align="center">
<%
out.println(element);
%>
</h4>
</td>
</tr>
</table>
</body>
</html>
package transactionservlet;
import java.sql.Statement;
import java.util.List;
public class Trigger_SlidingWindow
public ArrayList<String> getBranch() // Gets the branches names.
String url = "jdbc:odbc:ADMIN";
Connection con;
String query = "select * from triggertable";
Statement stmt;
ArrayList<String> columnSet = new ArrayList<String>();
// HashSet<Integer> columnamt = new HashSet<Integer>();
try
con = DriverManager.getConnection(url);
stmt = con.createStatement();
ResultSet rs = stmt.executeQuery(query);
while(rs.next())
columnSet.add(rs.getString(1));
// columnamt.add(rs.getInt(3));
catch(Exception e)
System.err.println("ClassNotFoundException: ");
System.err.println(e.getMessage() + e);
}
System.out.println("account: "+columnSet);
// System.out.println("amount remaining:"+columnamt);
return columnSet;
// return columnamt;
….
create Trigger transactiondbtrigger on transactiondb AFTER update as begin
declare @accountnumb numeric(10)
declare @amttransaction varchar(20)
declare @amt numeric(10)
declare @branch varchar(20)
declare @state varchar(20)
declare @accounttype varchar(20)
declare @phone varchar(20)
declare @fax varchar(20)
select @accountnumb = accountnumber from inserted
select @amttransaction = amounttransaction from inserted
select @amt = amount from inserted

select @branch = branch from inserted
select @state = state from inserted
select @accounttype = accounttype from inserted
select @phone = phonenumber from inserted
select @fax = faxnumber from inserted
insert into triggertable(account,transactiontype, amount, branch, state, accounttype,
phone, fax) values
(@accountnumb,@amttransaction,@amt,@branch,@state,@accounttype,@phone,@fax)
end
create table triggertable(account numeric(10),transactiontype varchar(20), amount
numeric(10), branch varchar(20), state varchar(20), accounttype varchar(20), phone
varchar(20), fax varchar(20));
select * from triggertable
update transactiondb set amounttransaction='deposit', amount=2000, branch='mumbai',
state='maharastra', accounttype='instant', phonenumber='066-223456', faxnumber='066-
569856' where accountnumber=1000
…
<html>
<head>
<title>Transaction</title>
</head>
<table border="0" background="" align="center">
<tr>
<td><image src="qw.PNG" /></td>
</tr>
<tr>
<td><image src="er1.png" /></td>
</tr>
</table>
<form method="get" action="updation"><br>
<br>
<br>
<br>
<br>
<br>
<br>
<tr>
<td>Account number:</td>
<td><input type="text" size="5" name="accountnumber"
align="center"></td>
</tr>
<tr>
<td>Transaction:</td>
<td><select name="transaction">
<option value="withdrawal">Withdrawal</option>
<option value="deposit">Deposit</option>
</select>
</tr>
<tr>
<td>Amount:</td>
<td><input type="text" name="amount"></td>
</tr>
<tr>
<td>City:</td>
<td><input type="text" name="city"></td>

</tr>
<tr>
<td>State:</td>
<td><input type="text" name="state"></td>
</tr>
<tr>
<td>Account type:</td>
<td><select name="acctype">
<option value="saving">Saving</option>
<option value="instant">Instant</option>
<option value="current">Current</option>
</select></td>
</tr>
<tr>
<td>Phone number:</td>
<td><input type="text" name="phone"></td>
</tr>
<tr>
<td>Fax number:</td>
<td><input type="text" name="fax"></td>
</tr>
</table>
<center><input type="submit" value="Submit" /></center>

</form>
</body>
</html>
….
public class AllTransactedNode {
public static ArrayList<String> getClusterinformationofall(String fromdate,String
todate, String branch, String transtype, String acctype) {
Connection conn = null;
ArrayList<String> arr = new ArrayList<String>();

try {
// conn = AdminDBConnector.getConnection();
conn = DriverManager.getConnection("jdbc:odbc:ADMIN");
// System.out.println("1");
String query = "select * from transactiondb where
dateoftransaction between ? and ? and branch = ? and amounttransaction=? and
accounttype=?";
System.out.println("2");
pst = conn.prepareStatement(query);
pst.setString(1, fromdate);
pst.setString(2, todate);
pst.setString(3, branch);
pst.setString(4, transtype);
pst.setString(5, acctype);
rs = pst.executeQuery();
while (rs.next()) {
arr.add(rs.getString(1));
arr.add("&");
finally {
}
System.out.println(arr);
return arr;
….
public class AllTransaction {
public static ArrayList<String> getClusterinformation(String fromdate,String
todate, String branch) {
Connection conn = null;
ArrayList<String> arr = new ArrayList<String>();

try {
// conn = AdminDBConnector.getConnection();
conn = DriverManager.getConnection("jdbc:odbc:ADMIN");
// System.out.println("1");
String query = "select * from transactiondb where
dateoftransaction between ? and ? and branch=?";
pst = conn.prepareStatement(query);
pst.setString(1, fromdate);
pst.setString(2, todate);
pst.setString(3, branch);
rs = pst.executeQuery();
while (rs.next()) {
arr.add("&");
finally {
System.out.println(arr);
return arr;
}
Software requirements:
Hardware requirements:
Processor : Pentium iv 2.6 GHz
Ram : 512 mb dd ram
Monitor : 15” color
Hard disk : 20 gb
Floppy drive : 1.44 mb
Cddrive : lg 52x
Keyboard : standard 102 keys
Mouse : 3 buttons
Software requirements:
Front End : Jsp,servlet
Back End : SQL SERVER 2000
Tools Used : Dreamweaver
Operating System : WindowsXP
Conclusion:
Here we proposed a framework to perform clustering on
categorical time-evolving data. The framework detects the
drifting concepts at different sliding windows, generates the
clustering results based on the current concept, and also shows
the relationship between clustering results by visualization. In
order to detect the drifting concepts at different sliding
windows, we proposed the algorithm DCD to compare the
cluster distributions between the last clustering result and the
temporal current clustering result. If the results are quite
different, the last clustering result will be dumped out, and the
current data in this sliding window will perform reclustering.
In addition, in order to observe the relationship between
different clustering results, we proposed the algorithm CRA to
analyze and show the changes between different clustering
results. The experimental evaluation shows that performing
DCD is faster than doing clustering once on the entire data set,
and DCD can provide high-quality clustering results with
correctly detected drifting concepts in both synthetic and real
data cases. Therefore, the result demonstrates that our
framework is practical for detecting drifting concepts in time-
evolving categorical data.
REFERENCES
[1] C. Aggarwal, J. Han, J. Wang, and P. Yu, “A Framework for Clustering

Evolving Data Streams,” Proc. 29th Int’l Conf. Very Large Data Bases
(VLDB), 2003.
[2] C.C. Aggarwal, J.L. Wolf, P.S. Yu, C. Procopiuc, and J.S. Park, “Fast
Algorithms for Projected Clustering,” Proc. ACM SIGMOD ’99, pp. 61-72,
1999.
[3] P. Andritsos, P. Tsaparas, R.J. Miller, and K.C. Sevcik, “Limbo:
Scalable Clustering of Categorical Data,” Proc. Ninth Int’l Conf. Extending
Database Technology (EDBT), 2004.
[4] D. Barbara´, Y. Li, and J. Couto, “Coolcat: An Entropy-Based Algorithm
for Categorical Clustering,” Proc. ACM Int’l Conf. Information and
Knowledge Management (CIKM), 2002.
[5] F. Cao, M. Ester, W. Qian, and A. Zhou, “Density-Based Clustering over
an Evolving Data Stream with Noise,” Proc. Sixth SIAM Int’l Conf. Data
Mining (SDM), 2006.
[6] D. Chakrabarti, R. Kumar, and A. Tomkins, “Evolutionary Clustering,”
Proc. ACM SIGKDD ’06, pp. 554-560, 2006.
[7] H.-L. Chen, K.-T. Chuang, and M.-S. Chen, “Labeling Unclustered
Categorical Data into Clusters Based on the Important Attribute Values,”
Proc. Fifth IEEE Int’l Conf. Data Mining (ICDM), 2005.
[8] Y. Chi, X.-D. Song, D.-Y. Zhou, K. Hino, and B.L. Tseng, “Evolutionary
Spectral Clustering by Incorporating Temporal Smoothness,” Proc. ACM
SIGKDD ’07, pp. 153-162, 2007.
[9] B.-R. Dai, J.-W. Huang, M.-Y. Yeh, and M.-S. Chen, “Adaptive
Clustering for Multiple Evolving Streams,” IEEE Trans. Knowledge and
Data Eng., vol. 18, no. 9, pp. 1166-1180, Sept. 2006.
[10] A.P. Dempster, N.M. Laird, and D.B. Rubin, “Maximum Likelihood
from Incomplete Data via the EM Algorithm,” J. Royal Statistical Soc.,
1977.
[11] D.H. Fisher, “Knowledge Acquisition via Incremental Conceptual
Clustering,” Machine Learning, 1987.
[12] M.M. Gaber and P.S. Yu, “Detection and Classification of Changes
in Evolving Data Streams,” Int’l J. Information Technology and Decision
Making, vol. 5, no. 4, pp. 659-670, 2006.
[13] V. Ganti, J. Gehrke, and R. Ramakrishnan, “CACTUS—Clustering
Categorical Data Using Summaries,” Proc. ACM SIGKDD, 1999.
[14] D. Gibson, J.M. Kleinberg, and P. Raghavan, “Clustering Categorical
Data: An Approach Based on Dynamical Systems,” VLDB J., vol. 8, nos. 3-
4, pp. 222-236, 2000.
[15] M.A. Gluck and J.E. Corter, “Information Uncertainty and the Utility of
Categories,” Proc. Seventh Ann. Conf. Cognitive Science Soc., pp. 283-287,
1985.
[16] S. Guha, R. Rastogi, and K. Shim, “ROCK: A Robust Clustering
Algorithm for Categorical Attributes,” Proc. 15th Int’l Conf. Data Eng.
(ICDE), 1999.
[17] E.-H. Han, G. Karypis, V. Kumar, and B. Mobasher, “Clustering Based
on Association Rule Hypergraphs,” Proc. ACM SIGMOD Workshop
Research Issues in Data Mining and Knowledge Discovery (DMKD), 1997.
[18] J. Han and M. Kamber, Data Mining: Concepts and Techniques.
Morgan Kaufmann, 2001.
[19] Z. Huang, “Extensions to the k-Means Algorithm for Clustering Large
Data Sets with Categorical Values,” Data Mining and Knowledge
Discovery, 1998.
[20] Z. Huang and M.K. Ng, “A Fuzzy k-Modes Algorithm for Clustering
Categorical Data,” IEEE Trans. Fuzzy Systems, 1999.
[21] G. Hulten, L. Spencer, and P. Domingos, “Mining Time-Changing Data
Streams,” Proc. ACM SIGKDD, 2001.
[22] A. Jain and R. Dubes, Algorithms for Clustering Data. Prentice Hall,
1988.
[23] A.K. Jain, M.N. Murty, and P.J. Flynn, “Data Clustering: A Review,”
ACM Computing Surveys, 1999.
[24] O. Nasraoui and C. Rojas, “Robust Clustering for Tracking Noisy
Evolving Data Streams,” Proc. Sixth SIAM Int’l Conf. Data Mining (SDM),
2006.
[25] O. Nasraoui, M. Soliman, E. Saka, A. Badia, and R. Germain, “A Web
Usage Mining Framework for Mining Evolving User Profiles in Dynamic
Web Sites,” IEEE Trans. Knowledge and Data Eng., vol. 20, no. 2, pp. 202-
215, Feb. 2008.
[26] G. Salton and M.J. McGill, Introduction to Modern Information
Retrieval. McGraw-Hill, 1986.
[27] G. Salton, A. Wong, and C.S. Yang, “A Vector Space Model for
Automatic Indexing,” Comm. ACM, vol. 18, no. 11, pp. 613-620, 1975.
[28] C.E. Shannon, “A Mathematical Theory of Communication,” Bell
System Technical J., 1948.
[29] Y. Sun, Q. Zhu, and Z. Chen, “An Iterative Initial-Points Refinement
algorithm for Categorical Data Clustering,” Pattern Recognition Letters, vol.
23, no. 7, 2002.
[30] H. Wang, W. Fan, P. Yun, and J. Han, “Mining Concept-Drifting Data
Streams Using Ensemble Classifiers,” Proc. ACM SIGKDD, 2003.
[31] G. Widmer and M. Kubat, “Learning in the Presence of Concept Drift
and Hidden Contexts,” Machine Learning, 1996.
[32] M.-Y. Yeh, B.-R. Dai, and M.-S. Chen, “Clustering over Multiple
Evolving Streams by Events and Correlations,” IEEE Trans. Knowledge and
Data Eng., vol. 19, no. 10, pp. 1349-1362, Oct. 2007.
[33] M.J. Zaki and M. Peters, “Clicks: Mining Subspace Clusters in
Categorical Data via k-Partite Maximal Cliques,” Proc. 21st Int’l
Conf. Data Eng., 2005.
[34] T. Zhang, R. Ramakrishnan, and M. Livny, “BIRCH: An Efficient Data
Clustering Method for Very Large Database,” Proc. ACM SIGMOD, 1996.
[35] A. Zhou, F. Cao, W. Qian, and C. Jin, “Tracking Clusters in Evolving
Data Streams over Sliding Windows,” Knowledge and Information Systems,
2007.

Documentation

Diunggah oleh

Informasi Dokumen

Deskripsi Asli:

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Documentation

Diunggah oleh

Hak Cipta:

Format Tersedia

Catching the Trend: A Framework for Clustering

Concept-Drifting Categorical Data

The problem of clustering time-evolving data in the numerical domain

 A generalized framework of performing clustering on the

Collection of Data’s (Data Set):

It is a collection of data which is extracted

Initializing of Sliding Window:

Cluster Distributions Comparator:

Re- Clustering and Outlier:

While data mining can be used to uncover patterns in data samples, it is

Humans have been "manually" extracting patterns from data for

A primary reason for using data mining is to assist in the analysis of

Research and evolution

In addition to industry driven demand for standards and

The premier professional body in the field is the Association for

• DMIN - International Conference on Data Mining;[9]

Knowledge Discovery in Databases (KDD) is the name coined by

Data mining commonly involves four classes of task:[11]

The final step of knowledge discovery from data is to verify the

Data mining in customer relationship management applications can

Data mining can also be helpful to human-resources departments in

Another example of data mining, often called the market basket

Data Mining is a highly effective tool in the catalog marketing

Related to an integrated-circuit production line, an example of data

Science and engineering

the mapping relationship between the inter-individual variation in

In the area of electrical power engineering, data mining techniques

Other examples of applying data mining technique applications are

In adverse drug reaction surveillance, the Uppsala Monitoring Centre

Spatial Data mining

Spatial data mining is the application of data mining techniques to

• offices requiring analysis or dissemination of geo-referenced

Geospatial data repositories tend to be very large. Moreover, existing

There are several critical research challenges in geographic knowledge

• Developing and supporting geographic data warehouses - Spatial

Previous data mining to stop terrorist programs under the U.S.

Two plausible data mining techniques in the context of combating

"Pattern mining" is a data mining technique that involves finding existing

In the context of pattern mining as a tool to identify terrorist activity, the

Subject-based data mining

"Subject-based data mining" is a data mining technique involving the

Privacy concerns and ethics

Some people believe that data mining itself is ethically neutral.[32]

Data mining requires data preparation which can uncover information

It is recommended that an individual is made aware of the following before

A computer cluster is a group of linked computers, working together

High-availability (HA) clusters

High-availability clusters (also known as Failover Clusters) are

There are many commercial implementations of High-Availability

Load-balancing when multiple computers are linked together to share

Often clusters are used primarily for computational purposes, rather

Grids are usually computer clusters, but more focused on throughput

Grid computing is optimized for workloads which consist of many

An example of a very large grid is the Folding@home project. It is

The TOP500 organization's semiannual list of the 500 fastest

Clustering can provide significant performance benefits versus price.

The central concept of a Beowulf cluster is the use of commercial off-

However it is worth noting that FLOPs (floating point operations per

Java Spaces is a specification from Sun Microsystems that enables

Consumer game consoles

Due to the increasing computing power of each generation of game

Existing system advantage:

2. The node is important in the cluster if the node appears