Anda di halaman 1dari 5

A New Identification Method for Visualizing Trends

in Transactional Data
Dr.S.Subramanian Mr.B.Gopinathan R.Sajula Robin
Principal Research Scholar P.G Scholar
Sri Krishna College of Engg. & Tech. Adhiyamaan College of Engg Adhiyamaan College of Engg
Coimbatore – India Hosur – India Hosur – India
gopinathanme@gmail.com sajularobin@yahoo.co.in

knowledge management systems. Increasingly, knowledge


Abstract - Nowadays many Organizations are capturing more discovery in data (KDD) techniques are providing new
data about their customers, suppliers, competitors, and analytical structures that complement and sometimes
business environment. Most of this data is multi-attribute and replace existing human-expert-based techniques to provide
temporal in nature. Many Data mining and business improved support for decision making. Identifying and
intelligence techniques are used to discover patterns in such visualizing temporal relationships (e.g., trends) in data
data; however, Visualizing and analyzing this type of data can
constitutes an important problem that is relevant in many
be extremely difficult because it can have numerous attributes.
Hence a new technique is needed to mine the data according to business, scientific, and academic settings.
specific time periods and then compare the data mining results Additionally, it is often desired to aggregate over
across time periods to discover similarities. A new data the temporal dimension (e.g., by day, month, quarter, year,
analysis and visualization technique that presents complex etc.) to match corporate reporting standards. Hence a new
multiattribute temporal data in a cohesive graphical manner technique is needed to mine the data according to specific
by building on well-established data mining methods is time periods and then compare the data mining results
proposed. A Cluster-based Temporal Representation of EveNt across time periods to discover similarities. Mapping of the
Data (C-TREND) is introduced, a system that implements the multidimensional temporal data into an intuitive analytical
temporal cluster graph construct, which maps multi-attribute
construct is known as temporal cluster graph.
temporal data to a two-dimensional directed graph that
identifies trends in dominant data types over time. C-TREND In this paper the approach used for addressing
provides an end user with the ability to generate graphs from these types of issues is to mine the data according to
data and adjust graph parameters. specific time periods and then compare the data mining
results across time periods to discover similarities. The main
contribution of this paper is to develop a novel and useful
Keywords— Clustering, data and knowledge visualization, data approach for visualization and analysis of multiattribute
mining, interactive data exploration and discovery, temporal transactional data based on a new temporal cluster graph
data mining, trend analysis. construct, and to implement this approach as the Cluster-
based Temporal Representation of EveNt Data (C-TREND)
system.
I. INTRODUCTION

BUSINESS intelligence applications represent an


important opportunity for data mining techniques to help
firms gather and analyze information about their
performance, customers, competitors, and business
environment. Knowledge representation and data
visualization tools constitute one form of business
intelligence techniques that present information to users in a
manner that supports business decision-making processes.
Business intelligence tools gain their strength by
supporting decision-makers. The research field of data
mining has developed a number of methods for identifying Fig 1. Reducing multiattribute temporal complexity by
patterns in data to provide insights and decision support to partitioning data into time periods and producing a temporal
users. Data mining and business intelligence approaches are cluster graph.
often used for class identification and data visualization in

1
Consider the plot of a retailer’s customers by age of nominal symbols. A sequential pattern is a
and income over three months in Fig. 1. Xs represent subsequence that appears frequently in a sequence database.
customers in the first month, triangles represent customers Sequential pattern mining , which finds the set of frequent
in the second month, and circles represent customers in the subsequences in sequence databases, is an important data
third month. An analyst may be tasked with the job of mining task and has broad applications, such as business
discovering trends in customer type over these three analysis, web mining, security, and bio-sequences analysis.
months. [6],[9]
In fig. 1a, the data are collected together. However
identifying patterns in the data and relationship overtime are B. Data Visualization
difficult.
In fig. 1b, the data are partitioned by time leads to Visual data exploration [4] aims at integrating the
the identification of clusters within each period. human in the data exploration process, applying its
In fig 1c, the multidimensional temporal data are perceptual abilities to the large data sets available in today's
mapped into an intuitive analytical construct known as computer systems. The basic idea of visual data exploration
temporal cluster graph. is to present the data in some visual form, allowing the
human to get insight into the data, draw conclusions, and
II. RELATED WORK directly interact with the data. Visual data exploration is
Some techniques related to this paper are discussed especially useful when little is known about the data and the
below. exploration goals are vague. Visual Data Exploration
usually follows a three step process: Overview, Zoom and
A. Temporal Data Mining Filter, Details-on-Demand
Temporal data mining is concerned with data First, the user needs to get an overview of the data.
mining of large sequential data sets. By sequential data, In the overview, the user identifies interesting patterns and
mean data that is ordered with respect to some index. For focuses on one or more of them. For analyzing the patterns,
example, time series constitute a popular class of sequential the user needs to drill down and access details of the data.
data, where records are indexed by time. Visualization technology may be used for all three steps of
Temporal data mining is an important extension as the data exploration process.
it has the capability of mining activity rather than just states Visualization techniques are useful for showing an
and, thus, inferring relationships of contextual and temporal overview of the data, allowing the user to identify
proximity, some of which may also indicate a cause-effect interesting subsets. In this step, it is important to keep the
association. Moreover, temporal data mining has the ability overview visualization while focusing on the subset using
to mine the behavioral aspects of (communities of) objects another visualization technique. An alternative is to distort
as opposed to simply mining rules that describe their states the overview visualization in order to focus on the
at a point in time i.e., there is the promise of understanding interesting subsets. To further explore the interesting
why rather than merely what. [1] subsets, the user needs a drill-down capability in order to
Many studies in conventional data mining get the details about the data. Visualization technology not
distinguish two strategic goals for the discovery process: only provides the base visualization techniques for all three
1) the description of the characteristics of a population and steps, but also bridges the gaps between the steps.
2) the prediction of its evolution in the future. In the The techniques can be classified based on three
context of the discovery of similarities in temporal criteria: 1) The data to be visualized.2) The visualization
data, we categorize data mining research across technique.3) The interaction and distortion technique used.
three dimensions: data type, mining paradigm, Both scientific visualization and information
and temporal ordering. [7] visualization create graphical models and visual
Temporal data mining approaches depend on the representations from data that support direct user interaction
nature of the event sequence being studied. for exploring and acquiring insight into useful information
Probably the most common form of temporal data embedded in the underlying data. In scientific visualization,
mining— time series analysis is used to mine the graphical models are typically constructed from
sequence ofcontinuous real-valued elements and measured or simulated data representing objects or concepts
is often regression based, relying on the associated with phenomena from the physical world. As
prespecified definition of a model. Moreover, such, the data and, hence, its derived visual representations
standard time series analysis techniques typically represent objects that exist in a 1D (one-dimensional), 2D,
are examples of supervised learning; in other or 3D object space.[2]
words, they estimate the effects of a set of Eventually, data will also include a temporal
independent variables on a dependent variable.
dimension and the presence of spatial and temporal
[5] Another
dimensions is a determinant factor in deriving visual
common area of temporal data mining research is
representations from the data. In information visualization,
sequence analysis. Sequence analysis is often
used when the sequence is composed of a series the graphical models may represent abstract concepts and

2
relationships that do not necessarily have a counterpart in 3. The node Vi,j € {Vi} is the jth node in the ith
the physical world, e.g., information describing user partition. Nodes are labeled with the Size of the cluster
accesses to pages of an Internet portal or records describing they represent (i.e., the number of data point in that cluster).
selected properties of different car brands and models. 4. Edges connect nodes in adjacent partitions and
Typically, each data unity describes multiple related are labeled with a distance value between the two nodes,
attributes (usually more than four) that are not of a spatial or thus representing the similarity between the clusters
temporal nature. Interaction techniques provide user with connected by the edge.
the ability to dynamically change visual representation and
can empower the users perception of information. A B. Graph Parameters
comprehensive framework for user interface techniques Three graph parameters are proposed for
used in visualization system includes:1) Interactive Filtering displaying information at different levels of analysis:
2) Interactive Zooming 1. Partition Zoom
2. within-Period Trend Strength
Interactive Filtering: A number of interaction techniques 3. Cross-Period Trend Strength
have been developed to improve interactive filtering in data
exploration. An example of an interactive tool which can be C. Partition Zoom
used for interactive filtering is Magic Lenses. The basic idea
of Magic Lenses is to use a tool like a magnifying glass to Zoom feature has the ability to dynamically change
support filtering the data directly in the visualization. The the size of the clustering solution in a data partition. The
data under the magnifying glass is processed by the filter zoom feature allows the users to apply their domain
and the result is displayed differently than the remaining expertise by adjusting in real time the underlying clustering
data set. solution used to build a trend graph and interactively
evaluate multiple trend views.
Interactive Zooming: Zooming is a well-known technique Each data partition Di has a corresponding ki value,
which is widely used in a number of applications. In dealing where ki refers to the number of clusters estimated in the
with large amounts of data, it is important to present the clustering solution for that partition. For example, a value of
data in a highly compressed form to provide an overview of ki=5 corresponds to the clustering solution for the i th
the data, but, at the same time, allow a variable display of partition that contains exactly five clusters.
the data on different resolutions. Zooming not only means
to display the data objects larger, but also means that the D. Within-period Trend Strength
data representation automatically changes to present more
details on higher zoom levels. Within-period trend strength is a user specified
parameter that can be used to determine if nodes generated
III. TEMPORAL CLUSTER GRAPHS by the clustering solution are “strong” enough to be
included in the trend analysis.
Temporal cluster graph is a new data mining Within-period trend strength is denoted by
technique for identifying and visualizing trends in parameter α. Each data partition utilizes the same value of α.
multiattribute temporal data. Temporal cluster graph By clustering each data partition the nodes in the trend
provide the user to adjust and visualize the clustering graph are created. For every data partition Di, the clustering
solution for each partition. Hierarchal and graph-based solution contains ki clusters , and some of these clusters can
techniques are used by the temporal cluster graph to provide be filtered out based on the within-period trend strength
interactive filtering and zooming capabilities for parameter α.
visualization. The temporal cluster graph is a directed graph
that consists of a set of nodes V= {V1, V2… Vt}, where E. Cross-Period Trend Strength
each subset corresponds to a data partition and contains Ki Cross-period trend strength is a user specified
nodes. parameter that can be used to filter out spurious edges based
on their weight. Cross-period trend strength is denoted by
A. Temporal Cluster Graph Definition parameter β.
An edge is included in the output graph if it
To obtain the graph several steps are required: satisfies two criteria:
1.The edge is incident to two nodes that are both
1. Transactional data set D is partitioned based on included in the output graph (as determined by the
time periods into t data subsets D1; . . .;Dt (indexed clustering solution and within-period trend strength α).
chronologically), and each Di is a multiattribute data subset 2. The edge weight is less than or equal to a
containing records with m number of attributes. threshold η that depends on the cross-period trend strength
2. Data within each partition is then clustered using β.
apriori algorithm. The edge threshold η is calculated by taking the

3
average of the weights of all the possible edges among the B. Offline Preprocessing of the Data
nodes in two adjacent data partitions (say, partitions i and
i+1) and adjusting it by the user-specified β parameter. Only In the preprocessing phase, the data set is
edges with weights below average are included in the graph. partitioned based on time periods, and each partition is
clustered using one of many traditional clustering
V. C-TREND IMPLEMENTATION techniques such as a hierarchical approach. The results of
the clustering for each partition are used to generate two
A. C-TREND Overview data structures: the node list and the edge list.
Creating these lists in the preprocessing phase
C-TREND,Cluster-based Temporal Representation allows for more effective (real-time) visualization updates
of EveNt Data, a new method for discovering and of the C-TREND output graphs. Based on these data
visualizing trends and temporal patterns in transactional structures, graph entities (nodes and edges) are generated
attribute-value data that builds upon standard data mining and rendered as a temporal cluster graph in the system
clustering techniques. output window.
The C-TREND technique consists of two major
processes: B1. Data Clustering
1) Offline preprocessing of the data
2) Online interactive analysis and visualization of the trends C-TREND can be implemented with multiple
different standard clustering algorithms (e.g., agglomerative
or divisive hierarchical clustering or partition- based
clustering) and could be expanded to include new efficient
clustering techniques such as the clustering by messaging
between data points technique. Specifically apriori
algorithm is utilized and the clustering is performed
separately for each partition of data.

APRIORI ALGORITHM
Apriori steps are as follows:

1) Counts item occurrences to determine the


frequent item sets
2) Candidates are generated.
3) Count the support of item sets pruning process
ensures candidate sizes are already known to be frequent
item sets.
4) Use the frequent item sets to generate the
desired rules.

Algorithm 1. Apriori

Ck: Candidate itemset of size k


Lk: frequent itemset of size k
L1= {frequent items};
for(k= 1; Lk!=∅; k++) do begin
Ck+1= candidates generated from Lk;
for each transaction tin database do
Increment the count of all candidates in Ck+1that are
contained in t
Lk+1= candidates in Ck+1with min_support
end
return kLk;

C-TREND produces a dendrogram for each data partition


and utilizes a global input value N that represents the
Fig 2. The C-TREND Process maximum-sized cluster solution maintained for each data
partition. A useful solution will consist of a set of N <<n
clusters (n is number of data points in partition i) and,

4
therefore, C-TREND has to store only 2N-1 nodes per [2] M.C.F. de Oliveira and H. Levkowitz, “From Visual
partition. Data Exploration to Visual Data Mining: A Survey,” IEEE
VI INTERACTIVE DATA VISUALIZATION Trans. Visualization and Computer Graphics, vol. 9, no. 3,
pp. 378-394, July-Sept. 2003.
Interactive analysis includes the presentation of
output graphs in a graphical user interface(GUI) that allows [3] A. Jain, M. Murty, and P. Flynn, “Data Clustering: A
the user to adjust k for each partition and the α and β Review,” ACM Computing Surveys, vol. 31, no. 3, pp. 264-
parameters, Which prompts C-TREND to redraw the output 323, 1999.
graph based on these new values in real time.
C-TREND utilizes a series of validation flags to [4] D.A. Keim, “Information Visualization and Visual Data
maintain and update the displayed state of the output trend Mining,” IEEE Trans. Visualization and Computer
graph. Combinations of the validation flags are used to Graphics, vol., no. 1, pp. 1-8, 2002.
determine whether or not each possible edge and node
should be displayed in the graph, and as these flags change, [5] E. Keogh and S. Kasetty, “On the Need for Time Series
the displayed components of the graph also change. Data Mining Benchmarks: A Survey and Empirical
Each cluster in the node list (dendrogram data Demonstration,” Data Mining and Knowledge Discovery,
structures) possesses two flags: k-pass and α-pass. These vol. 7, no. 4, pp. 349-371, 2003.
flags are used to indicate whether the cluster should be
included in the output graph based on the ki value and the α [6] J. Pei, J. Han, B. Mortazavi-Asl, J. Wang, H. Pinto, Q.
value, respectively. Specifically, when ki is changed, the Chen, U. Dayal, and M.-C. Hsu, “Mining Sequential
dendrogram data structure is updated so that only the Patterns by Pattern- Growth: The Prefix Span Approach,”
clusters that should be extracted for the clustering solution IEEE Trans. Knowledge and Data Eng., vol. 16, no. 10, pp.
of size ki have a valid k-pass flag. Similarly, when α is 1-17, Oct. 2004.
changed, the dendrogram data structure is updated so that
only the clusters that are large enough to pass the node filter [7] J. Roddick and M. Spiliopoulou, “A Survey of Temporal
based on α are assigned a valid α -pass flag. Knowledge Discovery Paradigms and Methods,” IEEE
The nodes that have both valid k-pass and α -pass Trans. Knowledge and Data Eng., vol. 14, no. 4, pp. 750-
flags make up the set of nodes that are both large enough 767, July/Aug. 2002.
and in the desired clustering solution and therefore are
included in the output graph. [8] S.F. Roth and J. Mattis, “Data Characterization for
Intelligent Graphics Presentations,” Proc. Conf. Human
VI CONCLUSION Factors in Computing Systems (CHI ’90), pp. 193-200,
1990.
By harnessing computational techniques of data
mining, we have developed a new temporal clustering [9] M. Zaki, “SPADE: An Efficient Algorithm for Mining
technique for discovering, analyzing, and visualizing trends Frequent Sequences,” Machine Learning, vol. 42, no. 1-2,
in multiattribute temporal data. pp. 31-60, 2001.
The proposed technique is versatile, and the
implementation of the technique as the C-TREND system
gives significant data representation power to the user—
domain experts have the ability to adjust parameters and
clustering mechanisms to fine-tune trend graphs.
The C-TREND implementation is scalable: the
time required to adjust trend parameters is quite low even
for larger data sets, which provides for real-time
visualization capabilities.
Furthermore, the proposed temporal clustering
analysis technique is applicable in many different data
analysis contexts and can provide insights for analysts
performing historical analyses and generating forecasts.

REFRENCES

[1] C.M. Antunes and A.L. Oliveira, “Temporal Data


Mining: An Overview,” Proc. ACM SIGKDD Workshop
Data Mining, pp. 1-13, Aug. 2001.

Anda mungkin juga menyukai