Abstract:
Data clustering is an important technique for exploratory data analysis
and has been the focus of substantial research in several domains for decades
among which Sampling has been recognized as an important technique to
improve the efficiency of clustering. However, with sampling applied, those
points that are not sampled will not have their labels after the normal
process. Although there is a straightforward approach in the numerical
domain, the problem of how to allocate those unlabeled data points into
proper clusters remains as a challenging issue in the categorical domain. In
this paper, a mechanism named Maximal Resemblance Data Labeling
(abbreviated as MARDL) is proposed to allocate each unlabeled data point
into the corresponding appropriate cluster based on the novel categorical
clustering representative, namely, N-Node set Importance
Representative(abbreviated as NNIR), which represents clusters by the
importance of the combinations of attribute values. MARDL has two
advantages:
1) MARDL exhibits high execution efficiency.
2) MARDL can achieve high intra cluster similarity and low inter cluster
similarity, which are regarded as the most important properties of clusters,
thus benefiting the analysis of cluster behaviors. MARDL is empirically
validated on real and synthetic data sets and is shown to be significantly
more efficient than prior works while attaining results of high quality.
Introduction
Data clustering is an important technique for exploratory data analysis
and has been the focus of substantial research in several domains for decades
. The problem of clustering is defined as follows: Given a set of data
objects, the problem of clustering is to partition data objects into groups in
such a way that objects in the same group are similar while objects in
different groups are dissimilar according to the predefined similarity
measurement. Therefore, clustering analysis can help us to gain insight into
the distribution of data . However, a difficult problem with learning in many
real world domains is that the concept of interest may depend on some
hidden context, not given explicitly in the form of predictive features. In
other words, the concepts that we try to learn from those data drift with
time .For example, the buying preferences of customers may change with
time, depending on the current day of the week, availability of alternatives,
discounting rate, etc. As the concepts behind the data evolve with time, the
underlying clusters may also change considerably with time .Performing
clustering on the entire time-evolving data not only decreases the quality of
clusters but also disregards the expectations of users, which usually require
recent clustering results.
Module Description:
Literature Survey:
Data mining:
Data mining is the process of extracting patterns from data. As more
data are gathered, with the amount of data doubling every three years,[1] data
mining is becoming an increasingly important tool to transform these data
into information. It is commonly used in a wide range of profiling practices,
such as marketing, surveillance, fraud detection and scientific discovery.
The term data mining has also been used in a related but negative
sense, to mean the deliberate searching for apparent but not necessarily
representative patterns in large numbers of data. To avoid confusion with the
other sense, the terms data dredging and data snooping are often used. Note,
however, that dredging and snooping can be (and sometimes are) used as
exploratory tools when developing and clarifying hypotheses.
There have been some efforts to define standards for data mining, for
example the 1999 European Cross Industry Standard Process for Data
Mining (CRISP-DM 1.0) and the 2004 Java Data Mining standard (JDM
1.0). These are evolving standards; later versions of these standards are
under development. Independent of these standardization efforts, freely
available open-source software systems like RapidMiner, Weka, KNIME,
and the R Project have become an informal standard for defining data-
mining processes. Most of these systems are able to import and export
models in PMML (Predictive Model Markup Language) which provides a
standard way to represent data mining models so that these can be shared
between different statistical applications. PMML is an XML-based language
developed by the Data Mining Group (DMG)[3], an independent group
composed of many data mining companies. PMML version 4.0 was released
in June 2009.[3][4][5]
Pre-processing
Once the objective for the KDD process is known, a target data set
must be assembled. As data mining can only uncover patterns already
present in the data, the target dataset must be large enough to contain these
patterns while remaining concise enough to be mined in an acceptable
timeframe. A common source for data is a datamart or data warehouse.
The target set is then cleaned. Cleaning removes the observations with
noise and missing data.The clean data is reduced into feature vectors, one
vector per observation. A feature vector is a summarised version of the raw
data observation. For example, a black and white image of a face which is
100px by 100px would contain 10,000 bits of raw data. This might be turned
into a feature vector by locating the eyes and mouth in the image. Doing so
would reduce the data for each vector from 10,000 bits to three codes for the
locations, dramatically reducing the size of the dataset to be mined, and
hence reducing the processing effort. The feature(s) selected will depend on
what the objective(s) is/are; obviously, selecting the "right" feature(s) is
fundamental to successful data mining.
The feature vectors are divided into two sets, the "training set" and the
"test set". The training set is used to "train" the data mining algorithm(s),
while the test set is used to verify the accuracy of any patterns found.
Data mining
Results validation
If the learnt patterns do not meet the desired standards, then it is necessary to
reevaluate and change the preprocessing and data mining. If the learnt
patterns do meet the desired standards then the final step is to interpret the
learnt patterns and turn them into knowledge.
Notable uses
Games
Since the early 1960s, with the availability of oracles for certain
combinatorial games, also called tablebases (e.g. for 3x3-chess) with any
beginning configuration, small-board dots-and-boxes, small-board-hex, and
certain endgames in chess, dots-and-boxes, and hex; a new area for data
mining has been opened up. This is the extraction of human-usable strategies
from these oracles. Current pattern recognition approaches do not seem to
fully have the required high level of abstraction in order to be applied
successfully. Instead, extensive experimentation with the tablebases,
combined with an intensive study of tablebase-answers to well designed
problems and with knowledge of prior art, i.e. pre-tablebase knowledge, is
used to yield insightful patterns. Berlekamp in dots-and-boxes etc. and John
Nunn in chess endgames are notable examples of researchers doing this
work, though they were not and are not involved in tablebase generation.
Business
Businesses employing data mining may see a return on investment, but also
they recognise that the number of predictive models can quickly become
very large. Rather than one model to predict which customers will churn, a
business could build a separate model for each region and customer type.
Then instead of sending an offer to all people that are likely to churn, it may
only want to send offers to customers that will likely take to offer. And
finally, it may also want to determine which customers are going to be
profitable over a window of time and only send the offers to those that are
likely to be profitable. In order to maintain this quantity of models, they
need to manage model versions and move to automated data mining.
Market basket analysis has also been used to identify the purchase
patterns of the Alpha consumer. Alpha Consumers are people that play a key
roles in connecting with the concept behind a product, then adopting that
product, and finally validating it for the rest of society. Analyzing the data
collected on these type of users has allowed companies to predict future
buying trends and forecast supply demands.
In recent years, data mining has been widely used in area of science
and engineering, such as bioinformatics, genetics, medicine, education and
electrical power engineering.
Data mining techniques have also been applied for dissolved gas
analysis (DGA) on power transformers. DGA, as a diagnostics for power
transformer, has been available for many years. Data mining techniques such
as SOM has been applied to analyse data and to determine trends which are
not obvious to the standard DGA ratio techniques such as Duval Triangle.[15]
A educational research, where data mining has been used to study the
factors leading students to choose to engage in behaviors which reduce their
learning[16] and to understand the factors influencing university student
retention.[17]. A similar example of the social application of data mining its is
use in expertise finding systems, whereby descriptors of human expertise are
extracted, normalised and classified so as to facilitate the finding of experts,
particularly in scientific and technical fields. In this way, data mining can
facilitate Institutional memory.
Data mining, which is the partially automated search for hidden patterns
in large databases, offers great potential benefits for applied GIS-based
decision-making. Recently, the task of integrating these two technologies
has become critical, especially as various public and private sector
organisations possessing huge databases with thematic and geographically
referenced data begin to realise the huge potential of the information hidden
there. Among those organisations are:
Challenges
Surveillance
Pattern mining
• the purpose of the data collection and any data mining projects,
• how the data will be used,
• who will be able to mine the data and use them,
• the security surrounding access to the data, and in addition,
• how collected data can be updated.[36]
Privacy concerns have also been somewhat addressed by congress via the
passage of regulatory controls such as HIPAA. The Health Insurance
Portability and Accountability Act (HIPAA) requires individuals to be given
"informed consent" regarding any information that they provide and its
intended future uses by the facility receiving that information. According to
an article in Biotech Business Week, “In practice, HIPAA may not offer any
greater protection than the longstanding regulations in the research arena,
says the AAHC. More importantly, the rule's goal of protection through
informed consent is undermined by the complexity of consent forms that are
required of patients and participants, which approach a level of
incomprehensibility to average individuals.” (40) This underscores the
necessity for data anonymity in data aggregation practices.
One may additionally modify the data so that they are anonymous, so that
individuals may not be readily identified.[36] However, even de-identified
data sets can contain enough information to identify individuals, as occurred
when journalists were able to find several individuals based on a set of
search histories that were inadvertently released by AOL. [37]
Clustering
Cluster categorizations
Load-balancing clusters
Compute clusters
Grid computing
The grid setup means that the nodes can take however many jobs they
are able to process in one session and then return the results and aquire a
new job from a central project server.
Implementations
Project Description
The problem of clustering the categorical time-evolving data is
formulated as follows: Suppose that a series of categorical data points D
is given, where each data point is a vector of q attribute values, i.e., pj ¼
ðp1j ; p2j ; . . . ; pq jÞ. Let A ¼ fA1;A2; . . .;Aqg, where Aa is the ath
categorical attribute, 1 _ a _ q. In addition, suppose that the window size
N is also given. The data set D is separated into several continuous
subsets St, where the number of data points in each St is N. The
superscript number t is the identification number of the sliding window
and t is also called time stamp in this paper. For example, the first N data
points in D are located in the first subset S1. Based on the foregoing, the
objective of the framework is to perform clustering on the data set D and
consider the drifting concepts between St and Stþ1 and also analyze the
relationship between different clustering results. For ease of presentation,
several notations are defined as follows: In our framework, several
clustering results at different time stamps will be reported. Each
clustering result C½t1;t2_ is formed by one stable concept that persists
for a period of time, i.e., the sliding windows from t1 to t2. The
clustering results C½t1;t2_ contain k½t1;t2_ clusters, i.e., C½t1;t2_¼
fc½t1;t2_ 1 ; c½t1;t2_ 2 ; . . .; c½t1;t2_ k½t1;t2_g, where c½t1;t2_ i , 1
_ i _ k½t1;t2_, is the ith cluster in C½t1;t2_. If t1 ¼ t2 ¼ t, we simplify
the superscript by t. For example, the first clustering result that is
obtained from the initial clustering step is C1. Moreover, if we do not
point out a specific time stamp, the superscript will be omitted for ease of
presentation. In addition, when the DCD algorithm is performed, a
temporal clustering result, which is utilized to detect the drifting concept
at each sliding window, will be obtained. The notation C0t is used to
represent the temporal clustering result at time stamp t. Fig. 2 shows an
example of data set D with 15 data points, three attributes, and the sliding
window size N ¼ 5. The initial clustering is performed on the first sliding
window S1, and the clustering result C1, which contains two clusters,
c11 and c12 , is obtained. All of the symbols utilized in this paper are
summarized in Table .
Algorithm Used:
Sliding Window
Generating Cluster
Representatives
Drifting Datas
Data Source Cluster
Representative
Cluster Info
clusterName Cluster Analyser
clusterType clusterName
clusterSize clusterstate
getClusterInfo() analyseData()
getClusterData() getResult()
updateClusterdata() getUpdate()
insertDriftData()
updateData() updateWindow()
insertData() getWindowState()
formCluster() getDriftingDatas()
insertDriftindatas()
Sequence diagram
Cluster
1: Get the data
Representatives
Datapoin
ts
CLuster
Analyser
State Diagram:
Received
New datas
Updated the
Sliding Window
Cluster Analysed
with Updated Data
Activity Diagram:
Receiveing the
new datas
Updating the
Sliding Window
Updating the
Clusters
Data Points
Drifting Data
Updating Cluster
Representatives
Re-Clustering
pageEncoding="ISO-8859-1"%>
"http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
</head>
<body bgcolor="silver">
<br><br><br><br><br><br><br><br><br><br><br><br>
<center>
</center>
</table>
</body>
</form>
</html>
..
package ser;
import java.io.IOException;
import java.io.PrintWriter;
import javax.servlet.RequestDispatcher;
import javax.servlet.ServletException;
import javax.servlet.http.HttpServlet;
import javax.servlet.http.HttpServletRequest;
import javax.servlet.http.HttpServletResponse;
import javax.servlet.http.HttpSession;
import Accessinformation.Accessinformation;
import Helperview.Helperview;
public loginvalidator() {
super();
username=request.getParameter("username");
password=request.getParameter("password");
//out.println(username);
//out.println(password);
view.setUsername(username);
view.setPassword(password);
try
view=dao.getTransactioninformation(username, password);
if(view==null)
else
if("admin".equalsIgnoreCase(view.getRole()))
System.out.println("Inside admin
"+view.getUsername());
session.setAttribute("admin", view);
//RequestDispatcher dis =
request.getRequestDispatcher("/admin.jsp");
response.sendRedirect("admin.jsp");
else
System.out.println("From Servlet
"+view.getUsername());
session.setAttribute("accholder", view);
//RequestDispatcher dis =
request.getRequestDispatcher("/accountholder.jsp");
response.sendRedirect("accountholder.jsp");
catch (Exception e) {
e.printStackTrace();
System.out.println(e);
}
}
..
package Helperview;
// Table - 2
return accountnumber;
this.accountnumber = accountnumber;
return transaction;
}
public void setTransaction(String transaction) {
this.transaction = transaction;
return amount;
this.amount = amount;
return branch;
this.branch = branch;
// Table - 1
this.role = role;
return username;
this.username = username;
return password;
this.password = password;
}
..
package AdminDBConnector;
import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.PreparedStatement;
import java.sql.ResultSet;
import java.sql.SQLException;
try
Class.forName("sun.jdbc.odbc.JdbcOdbcDriver");
connection=DriverManager.getConnection("jdbc:odbc:ADMIN");
}
catch(SQLException e)
e.printStackTrace();
System.out.println(e);
catch(ClassNotFoundException e)
e.printStackTrace();
System.out.println(e);
return connection;
try
if(pst!=null)
{
pst.close();
catch(SQLException e)
e.printStackTrace();
System.out.println(e);
try
if(rs!=null)
rs.close();
catch(Exception e)
e.printStackTrace();
}
try
conn.close();
catch(SQLException e)
e.printStackTrace();
System.out.println(e);
..
package DataSource;
import java.sql.*;
import java.sql.DriverManager;
import java.sql.Statement;
import java.sql.ResultSet;
import java.sql.SQLException;
import java.util.ArrayList;
import java.util.Collection;
import AdminDBConnector.AdminDBConnector;
import Helperview.Helperview;
ResultSet rs = null;
try
{
conn = AdminDBConnector.getConnection();
pst=conn.prepareStatement(query);
pst.setString(1, node);
rs=pst.executeQuery();
while(rs.next())
System.out.println("AAAAAAA");
view.setAccountnumber(rs.getString(1));
view.setTransaction(rs.getString(2));
view.setAmount(rs.getString(3));
view.setBranch(rs.getString(4));
nodeColl.add(view);
// System.out.println("collection"+nodeColl);
System.out.println("From Data
Source:"+view.getAccountnumber());
System.out.println("From Data
Source:"+view.getTransaction());
System.out.println("From Data
Source:"+view.getAmount());
System.out.println("From Data
Source:"+view.getBranch());
catch(SQLException e)
e.printStackTrace();
System.out.println(e);
finally
return nodeColl;
}
..
pageEncoding="ISO-8859-1"%>
"http://www.w3.org/TR/html4/loose.dtd">
<%@page import="Helperview.Helperview"%><html>
<head>
</head>
<body>
<body bgcolor="silver">
<br><br><br><br><br><br><br><br><br><br><br>
<center>
</center>
</body>
</html>
..
pageEncoding="ISO-8859-1"%>
"http://www.w3.org/TR/html4/loose.dtd">
<%@page import="Helperview.Helperview"%><html>
<head>
</head>
<body>
<body bgcolor="silver">
<br><br><br><br><br><br><br><br><br><br><br>
<center>
</center>
</body>
</html>
..
pageEncoding="ISO-8859-1"%>
"http://www.w3.org/TR/html4/loose.dtd">
<%@page import="Helperview.Helperview"%><html>
<head>
</head>
<body>
<body bgcolor="silver">
<br><br><br><br><br><br><br><br><br><br><br>
<center>
<h1>Welcome <%out.println(view.getUsername()); %> !!!</h1>
</center>
</body>
</html>
..
pageEncoding="ISO-8859-1"%>
"http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<title>Transacted information</title>
</head>
<body bgcolor="silver">
<tr><td><image src="qw.PNG"/></tr>
<tr><td><image src="er1.png"/></td></tr>
</table>
<h2>The recent transacted accounts are listed below:</h2>
<%
while(iterator.hasNext())
System.out.println(element);
%>
<%
out.println(element);
%>
</h4>
</td>
</tr>
</table>
</body>
</html>
package transactionservlet;
import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.ResultSet;
import java.sql.Statement;
import java.util.ArrayList;
import java.util.List;
Connection con;
String query = "select * from triggertable";
Statement stmt;
try
Class.forName("sun.jdbc.odbc.JdbcOdbcDriver");
con = DriverManager.getConnection(url);
stmt = con.createStatement();
ResultSet rs = stmt.executeQuery(query);
while(rs.next())
columnSet.add(rs.getString(1));
// columnamt.add(rs.getInt(3));
catch(Exception e)
System.err.println("ClassNotFoundException: ");
System.err.println(e.getMessage() + e);
}
System.out.println("account: "+columnSet);
// System.out.println("amount remaining:"+columnamt);
return columnSet;
// return columnamt;
….
(@accountnumb,@amttransaction,@amt,@branch,@state,@accounttype,@phone,@fax)
end
…
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<title>Transaction</title>
</head>
<tr>
</tr>
<tr>
</tr>
</table>
<body bgcolor="silver">
<br>
<br>
<br>
<br>
<br>
<br>
<tr>
<td>Account number:</td>
align="center"></td>
</tr>
<tr>
<td>Transaction:</td>
<td><select name="transaction">
<option value="withdrawal">Withdrawal</option>
<option value="deposit">Deposit</option>
</select>
</tr>
<tr>
<td>Amount:</td>
</tr>
<tr>
<td>City:</td>
<tr>
<td>State:</td>
</tr>
<tr>
<td>Account type:</td>
<td><select name="acctype">
<option value="saving">Saving</option>
<option value="instant">Instant</option>
<option value="current">Current</option>
</select></td>
</tr>
<tr>
<td>Phone number:</td>
</tr>
<tr>
<td>Fax number:</td>
</tr>
</table>
</body>
</html>
….
import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.PreparedStatement;
import java.sql.ResultSet;
import java.util.ArrayList;
import AdminDBConnector.AdminDBConnector;
ResultSet rs = null;
// conn = AdminDBConnector.getConnection();
Class.forName("sun.jdbc.odbc.JdbcOdbcDriver");
conn = DriverManager.getConnection("jdbc:odbc:ADMIN");
// System.out.println("1");
accounttype=?";
System.out.println("2");
pst = conn.prepareStatement(query);
System.out.println("3");
pst.setString(1, fromdate);
pst.setString(2, todate);
pst.setString(3, branch);
pst.setString(4, transtype);
pst.setString(5, acctype);
rs = pst.executeQuery();
System.out.println("4");
while (rs.next()) {
arr.add(rs.getString(1));
arr.add(rs.getString(2));
arr.add(rs.getString(3));
arr.add(rs.getString(4));
arr.add(rs.getString(5));
arr.add(rs.getString(6));
arr.add(rs.getString(7));
arr.add(rs.getString(8));
arr.add(rs.getString(9));
arr.add(rs.getString(10));
arr.add("&");
catch (Exception e) {
e.printStackTrace();
System.out.println(e);
finally {
}
System.out.println(arr);
return arr;
….
import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.PreparedStatement;
import java.sql.ResultSet;
import java.util.ArrayList;
import AdminDBConnector.AdminDBConnector;
ResultSet rs = null;
// conn = AdminDBConnector.getConnection();
Class.forName("sun.jdbc.odbc.JdbcOdbcDriver");
conn = DriverManager.getConnection("jdbc:odbc:ADMIN");
// System.out.println("1");
System.out.println("2");
pst = conn.prepareStatement(query);
System.out.println("3");
pst.setString(1, fromdate);
pst.setString(2, todate);
pst.setString(3, branch);
rs = pst.executeQuery();
System.out.println("4");
while (rs.next()) {
arr.add(rs.getString(1));
arr.add(rs.getString(2));
arr.add(rs.getString(3));
arr.add(rs.getString(4));
arr.add(rs.getString(5));
arr.add(rs.getString(6));
arr.add(rs.getString(7));
arr.add(rs.getString(8));
arr.add(rs.getString(9));
arr.add(rs.getString(10));
arr.add("&");
catch (Exception e) {
e.printStackTrace();
System.out.println(e);
finally {
System.out.println(arr);
return arr;
}
Software requirements:
Hardware requirements:
Processor : Pentium iv 2.6 GHz
Ram : 512 mb dd ram
Monitor : 15” color
Hard disk : 20 gb
Floppy drive : 1.44 mb
Cddrive : lg 52x
Keyboard : standard 102 keys
Mouse : 3 buttons
Software requirements:
Front End : Jsp,servlet
Back End : SQL SERVER 2000
Tools Used : Dreamweaver
Operating System : WindowsXP
Conclusion:
Here we proposed a framework to perform clustering on
categorical time-evolving data. The framework detects the
drifting concepts at different sliding windows, generates the
clustering results based on the current concept, and also shows
the relationship between clustering results by visualization. In
order to detect the drifting concepts at different sliding
windows, we proposed the algorithm DCD to compare the
cluster distributions between the last clustering result and the
temporal current clustering result. If the results are quite
different, the last clustering result will be dumped out, and the
current data in this sliding window will perform reclustering.
In addition, in order to observe the relationship between
different clustering results, we proposed the algorithm CRA to
analyze and show the changes between different clustering
results. The experimental evaluation shows that performing
DCD is faster than doing clustering once on the entire data set,
and DCD can provide high-quality clustering results with
correctly detected drifting concepts in both synthetic and real
data cases. Therefore, the result demonstrates that our
framework is practical for detecting drifting concepts in time-
evolving categorical data.
REFERENCES