Clustering is the most commonly used technique of
data mining under which patterns are discovered in the
underlying data. This paper presents that how clustering is carried out and the applications of clustering. It also provides us with a framework for the mixed attributes clustering problem and also shows us that how the customer data can be clustered identifying the high-profit, high-value and low-risk customers.

Attribution Non-Commercial (BY-NC)

35 tayangan

Clustering is the most commonly used technique of
data mining under which patterns are discovered in the
underlying data. This paper presents that how clustering is carried out and the applications of clustering. It also provides us with a framework for the mixed attributes clustering problem and also shows us that how the customer data can be clustered identifying the high-profit, high-value and low-risk customers.

Attribution Non-Commercial (BY-NC)

- Ng Jordan Weiss Nips01.Ps
- Course Plan -Data Mining
- Different Types of Data Mining Techniques Used in Agriculture - A Survey
- miningce
- dataminging syllabus
- k Means Clustering
- Clustering High Dimensional Data
- 使用GSM Cell-ID定位
- Dp 23695699
- a2hbc Tutorial
- 1 Song Etal Fuzzy Clustering Book Chapter
- u 25107111
- Data Warehousing Introduction
- Us 8433472
- Cluster Validity Indices for Gene Expression Data
- Mining E-Commerce Feedback Comments to Evaluate Multi Dimension Trust
- Clustering
- Wave Clus Intro
- Data Mining Concepts_binary
- 10.1109 TETC.2015.2418716 Wikipedia Based Semantic Similarity Measurements for Noisy Short Texts Using Extended Naive Bayes

Anda di halaman 1dari 5

Nimrat Kaur Sidhu*, Rajneet Kaur**

*

(Research Scholar, Department of Computer Science Engineering, SGGSWU, Fatehgarh Sahib, Punjab, India.)

**(Assistant Professor, Department of Computer Science Engineering, SGGSWU, Fatehgarh Sahib, Punjab, India.)

Abstract--Clustering is the most commonly used technique of data mining under which patterns are discovered in the underlying data. This paper presents that how clustering is carried out and the applications of clustering. It also provides us with a framework for the mixed attributes clustering problem and also shows us that how the customer data can be clustered identifying the high-profit, high-value and low-risk customers. Keywords-- Data Mining, Customer Clustering, Categorical Data

C. Similarity Measures A similarity measure SIMILAR ( Di, Dj ) can be used to represent the similarity between two documents i and j. Typical similarity generates values of 0 for documents exhibiting no agreement among the assigned indexed terms, and 1 when perfect agreement is detected. Intermediate values are obtained for cases of partial agreement. D. Threshold

I. INTRODUCTION Data mining systems can be classified according to the kinds of databases mined, the kinds of knowledge mined, the techniques used or the applications. Three important components of data mining systems are databases, data mining engine, and pattern evaluation modules. Next [1] are a few important definitions that are used in clustering technique of data mining. A. Cluster A cluster is an ordered list of objects, which have some common objects. The objects belong to an interval [ a,b]. B. Distance between Two Clusters The distance between two clusters involves some or all elements of the two clusters. The clustering method determines how the distance should be computed. The distance between two points is taken as a common metric to assess the similarity among the components of a population. The most commonly used distance measure is the Euclidean metric which defines the distance between two points p = (p1,p2,) and q = (q1,q2, .) as d = [ ( pi qi)2]1/2 The lowest possible input value of similarity required joining two objects in one cluster. A threshold T(J) is given for the Jth variable (1< J < N ). Cases are partitioned into clusters so that within each cluster the Jth variable has a range less than T(J). The thresholds should be chosen fairly large, especially if there are many variable. The procedure is equivalent to converting each variable to a category variable (using the thresholds to define the categories) and the clusters are then cells of the multidimensional contingency table between all variables. E. Similarity Matrix Similarity between objects calculated by the function SIMILAR (Di,,Dj), represented in the form of a matrix is called a similarity matrix. F. Cluster Seed First document or object of a cluster is defined as the initiator of that cluster i.e. every incoming objects similarity is compared with the initiator. The initiator is called the cluster seed.

ISSN: 2231-2803

http://www.ijcttjournal.org

Page 710

International Journal of Computer Trends and Technology (IJCTT) - volume4Issue4 April 2013

at all; note that any clustering algorithm will produce some clusters regardless of whether or not natural clusters exist. II. Basic Clustering Step A. Preprocessing and feature selection Most clustering models assume that n-dimensional feature vectors represent all data items. This step therefore involves choosing an appropriate feature, and doing appropriate preprocessing and feature extraction on data items to measure the values of the chosen feature set. It will often be desirable to choose a subset of all the features available, to reduce the dimensionality of the problem space. This step often requires a good deal of domain knowledge and data analysis. B. Similarity measure Similarity measure plays an important role in the process of clustering where a set of objects are grouped into several clusters, so that similar objects will be in the same cluster and dissimilar ones in different cluster. In clustering, its features represent an object and the similarity relationship between objects is measured by a similarity function. This is a function, which takes two sets of data items as input, and returns as output a similarity measure between them. C. Clustering algorithm Clustering algorithms are general schemes, which use particular similarity measures as subroutines. The particular choice of clustering algorithms depends on the desired properties of the final clustering, e.g. what are the relative importance of compactness, parsimony, and inclusiveness? Other considerations include the usual time and space complexity. A clustering algorithm attempts to find natural groups of components (or data) based on some similarity. The clustering algorithm also finds the centroid of a group of data sets. To determine cluster membership, most algorithms evaluate the distance between a point and the cluster centroids. The output from a clustering algorithm is basically a statistical description of the cluster centroids with the number of components in each cluster (2). D. Result validation IV. CUSTOMER DATA CLUSTERING Do the results make sense? If not, we may want to iterate back to some prior stage. It may also be useful to do a test of clustering tendency, to try to guess if clusters are present [3] Customer clustering is the most important data mining methodologies used in marketing and customer relationship management (CRM). Customer clustering would use E. Result interpretation and application Typical applications of clustering include data compression (via representing data samples by their cluster representative), hypothesis generation (looking for patterns in the clustering of data), hypothesis testing (e.g. verifying feature correlation or other data properties through a high degree of cluster formation), and prediction (once clusters have been formed from data and characterized, new data items can be classified by the characteristics of the cluster to which they would belong). III. Clustering Techniques Traditionally clustering techniques are broadly divided into hierarchical and partitioning. Hierarchical clustering is further subdivided into agglomerative and divisive. A. Agglomerative Start with the points as individual clusters and, at each step, merge the most similar or closest pair of clusters. This requires a definition of cluster similarity or distance. B. Divisive Start with one, all-inclusive cluster and, at each step, split a cluster until only singleton clusters of individual points remain. In this case, we need to decide, at each step, which cluster to split and how to perform the split. Hierarchical techniques produce a nested sequence of partitions, with a single, allinclusive cluster at the top and singleton clusters of individual points at the bottom. Each intermediate level can be viewed as combining two clusters from the next lower level (or splitting a cluster from the next higher level). The result of a hierarchical clustering algorithm can be graphically displayed as tree, called a dendogram. This tree graphically displays the merging process and the intermediate clusters. For document clustering, this dendogram provides a taxonomy, or hierarchical index.[2]

ISSN: 2231-2803

http://www.ijcttjournal.org

Page 711

International Journal of Computer Trends and Technology (IJCTT) - volume4Issue4 April 2013

customer-purchase transaction data to track buying behavior and create strategic business initiatives. Companies want to keep high-profit, high-value, and lowrisk customers. This cluster typically represents the 10 to 20 percent of customers who create 50 to 80 percent of a company's profits. A company would not want to lose these customers, and the strategic initiative for the segment is obviously retention. A low-profit, high-value, and low-risk customer segment is also an attractive one, and the obvious goal here would be to increase profitability for this segment. A. Architecture The approach is a two phased model. In first phase, collect the data from our organization retail smart store and then do the data cleansing. It involves removing the noise first, so the incomplete, missing and irrelevant data are removed and formatted according to the required format. In second phase, generate the clusters and profile the clusters to identify by best clusters. Fig.1 illustrates the whole process.[3] customers have been clustered using IBM Intelligent Miner tool. The first steps in the clustering process involve selecting the data set and the algorithm. There are two types of algorithms available in I-Miner process.[3] 1) Demographic clustering process 2) Neural clustering process In this exercise, the Demographic clustering process has been chosen, since it works best for the continuous data type. The data set has all the data types are continuous. The next step in the process is to choose the basic run parameters for the process. The basic parameters available for demographic clustering include are: 1) Maximum number of clusters 2) Maximum number of passes through the data 3) Accuracy 4) Similarity threshold The input parameters for the customers clustering are: 1) Recency 2) Total customer profit 3) Total customer revenue 4) Top revenue Department The data is first extracted from the oracle databases and flat files and converted into flat files. Subsequently, the IMiner process picks up the file and processed. The entire output data set would have customer information appended to the end of the each record. C. Cluster Profiling The next step in the clustering process is to profile the clusters by executing SQL queries. The purpose of profiling is to assess the potential business value of each cluster quantitatively by profiling the aggregate values of the shareholder value variables by cluster. V. CLUSTERING NUMERIC AND CATEGORICAL DATA Clustering typically groups data into sets in such a way that the intra-cluster similarity is maximized while the intercluster similarity is minimized.

Fig 1. Clustering Process

A. Cluster Ensembles Cluster ensembles is the method tocombine several runs of different clustering algorithms to get a common partition of

B. Experiments and Results For this study, the transaction of data of our organization retail smart store has been taken. Using these data,

ISSN: 2231-2803

http://www.ijcttjournal.org

Page 712

International Journal of Computer Trends and Technology (IJCTT) - volume4Issue4 April 2013

the original dataset, aiming for consolidation of the results from a portfolio of individual clustering results. In [4], the authors formally defined the cluster ensmble problem as an optimization problem and propose combiners for solving it based on a hyper-graph model. B. Cluster Ensemble: The Viewpoint of Categorical Data Clustering Clustering aims at discovering groups and patterns in data sets. In general, the output produced by a special clustering algorithm will be the assignment of data objects in dataset to different groups. In other words, it will be sufficient to identify data object with a unique cluster label. From the viewpoint of clustering, data objects with different cluster labels are considered to be in different clusters, if two objects are in the same cluster then they are considered to be fully similar, if not they are fully dissimilar. Thus, it is obvious that cluster labels are impossible to be given a natural ordering in a way similar to real numbers, that is to say, the output of clustering algorithm can be viewed as categorical. Since the output of individual clustering algorithm is categorical and so the cluster ensemble problem can be viewed as the categorical data clustering problem, in which runs of different clustering algorithm are combined into a new categorical dataset. Transforming the cluster ensemble problem into categorical data clustering problem has following advantages. First, some efficient algorithms for clustering categorical data have been proposed recently [1,8,14]. These algorithms can be fully exploited and also cluster ensemble problem could benefit from the advances in the research of categorical data clustering. Further, the problem of categorical data clustering is relatively simple and provides a unified framework for problem formalization. For clustering datasets with mixed types of attributes, we propose a novel divide and conquer technique. First, the original mixed dataset is divided into two sub-datasets: the pure categorical dataset and the pure numeric dataset. Next, existing well established clustering algorithms designed for different types of datasets are employed to produce corresponding clusters. Last, the clustering results on the categorical and numeric dataset are combined as a categorical dataset, on which the categorical data clustering algorithm is exploited to get the final clusters. C. Overview The steps involved in the cluster ensemble based algorithm framework are described in figure 1. First, the original mixed dataset is divided into two sub-datasets: the pure categorical dataset and the pure numeric dataset. Next, existing well established clustering algorithms designed for different types of datasets are employed to produce corresponding clusters. Finally, the clustering results on the categorical and numeric dataset are combined as a categorical dataset, on which the categorical data clustering algorithm is exploited to get the final clusters. For this algorithm, framework gets clustering output from both splitting categorical dataset and numeric dataset, therefore, it is named as CEBMDC (Cluster Ensemble Based Mixed Data Cluatering).

VI. CONCLUSION Clustering has been proved to be the most extensively used technique. In this paper, the various applications were discussed. For the customer clustering, demographic clustering technique was used. For the categorical data, the existing clustering algorithms were integrated and can be fully exploited. In futures, an alternative clustering algorithms can be integrated into the algorithm framework, to get a better insight and an advancement can be made in clustering technique. VII. REFERENCES

[1] I.K. Ravichandra Rao, Data Mining and Clustering Techniques, DRTC Workshop on Sementic Web, 8-10 december, 2003, DRTC, Banglore. Hartigan, John A, " Clustering Algorithms ".1975.John Wiley. New York. Rajagopal, Dr. Sankar, Customer Data Clustering Using Data Mining Technique, International Journal of Database Management Systems (IJDMS) Vol.3, No.4, November 2011 A. Strehl, J. Ghosh : " Cluster Ensembles- A Knowledge Reuse Framework for Combining Partitions. Proc. of the 8th National Conference on Artificial Intelligence and 4th Conference on

[2]

[3]

[4]

ISSN: 2231-2803

http://www.ijcttjournal.org

Page 713

International Journal of Computer Trends and Technology (IJCTT) - volume4Issue4 April 2013

Innovative Applications of Artificial Intelligence, pp. 93-99, 2002. [5] Z. He, X. Xu, S. Deng: " Squeezer: An Efficient Algorithm for Clustering Categorical Data. Journal of Computer Science and Technology, vol 17, no. 5, pp.611-625, 2002. Sudipto Guha , Rajeev Rastogi , Kyuseok Shim : ROCK : A Robust Clustering Algorithm for Categorical Attributes. In Proc. 1999 Int. Conf. Data Engineering , pp. 512-521, Sydney, Australia, Mar.1999. Ke Wang, Chu Xu, Bing Liu: " Clustering Transactions Using Large items. Proceedings of the 1999 ACM International Conference on Information and Knowledge Management, pp.483-490, 1999.

[6]

[7]

ISSN: 2231-2803

http://www.ijcttjournal.org

Page 714

- Ng Jordan Weiss Nips01.PsDiunggah olehpostscript
- Course Plan -Data MiningDiunggah olehTushar Langer
- Different Types of Data Mining Techniques Used in Agriculture - A SurveyDiunggah olehIJAERS JOURNAL
- miningceDiunggah olehapi-19969042
- k Means ClusteringDiunggah olehPedro Llanos
- dataminging syllabusDiunggah olehRenny Mayar
- Clustering High Dimensional DataDiunggah olehjasniz
- 使用GSM Cell-ID定位Diunggah olehChen Yonglin
- Dp 23695699Diunggah olehAnonymous 7VPPkWS8O
- a2hbc TutorialDiunggah olehllamedom
- 1 Song Etal Fuzzy Clustering Book ChapterDiunggah olehMag Nae
- u 25107111Diunggah olehAnonymous 7VPPkWS8O
- Data Warehousing IntroductionDiunggah olehsuperstar538
- Us 8433472Diunggah olehdavid19775891
- Cluster Validity Indices for Gene Expression DataDiunggah olehseventhsensegroup
- Mining E-Commerce Feedback Comments to Evaluate Multi Dimension TrustDiunggah olehIJMER
- ClusteringDiunggah olehThilak G
- Wave Clus IntroDiunggah olehsanjayr_nitt
- Data Mining Concepts_binaryDiunggah olehgee
- 10.1109 TETC.2015.2418716 Wikipedia Based Semantic Similarity Measurements for Noisy Short Texts Using Extended Naive BayesDiunggah olehsaman
- Main Syllabus Handout for Course DataMining_IT1503Diunggah olehMadhur Kedia
- New Distance Measure of Single-Valued Neutrosophic Sets and Its ApplicationDiunggah olehMia Amalia
- Operational Research-G3 r2Diunggah olehShilpa Sajan
- CNN FEATURES ARE ALSO GREAT AT UNSUPERVISED CLASSIFICATIONDiunggah olehCS & IT
- Business Intelligence and Data Mining [Maheshwari 2014-12-31].docDiunggah olehVianny Reyes
- Data Science Question BankDiunggah olehAchutha JC
- Presentation on Plant disease identification using matlabDiunggah olehSayantan Banerjee
- hjDiunggah olehtiara
- Clustering ReportDiunggah olehKumar Saurav
- Clustering_XDiunggah olehMudit Rander

- Extended Kalman Filter based State Estimation of Wind TurbineDiunggah olehseventhsensegroup
- Comparison Of The Effects Of Monochloramine And Glutaraldehyde (Biocides) Against Biofilm Microorganisms In Produced WaterDiunggah olehseventhsensegroup
- Experimental Investigation On Performance, Combustion Characteristics Of Diesel Engine By Using Cotton Seed OilDiunggah olehseventhsensegroup
- Optimal Search Results Over Cloud with a Novel Ranking ApproachDiunggah olehseventhsensegroup
- An Efficient Model Of Detection And Filtering Technique Over Malicious And Spam E-MailsDiunggah olehseventhsensegroup
- IJETT-V4I10P158Diunggah olehpradeepjoshi007
- Fabrication Of High Speed Indication And Automatic Pneumatic Braking SystemDiunggah olehseventhsensegroup
- Comparison of the Regression Equations in Different Places using Total StationDiunggah olehseventhsensegroup
- Separation Of , , & Activities In EEG To Measure The Depth Of Sleep And Mental StatusDiunggah olehseventhsensegroup
- Color Constancy for Light SourcesDiunggah olehseventhsensegroup
- Design, Development And Performance Evaluation Of Solar Dryer With Mirror Booster For Red Chilli (Capsicum Annum)Diunggah olehseventhsensegroup
- Application of Sparse Matrix Converter for Microturbine-Permanent Magnet Synchronous Generator output Voltage Quality EnhancementDiunggah olehseventhsensegroup
- A Multi-Level Storage Tank Gauging And Monitoring System Using A Nanosecond PulseDiunggah olehseventhsensegroup
- Implementation of Single Stage Three Level Power Factor Correction AC-DC Converter with Phase Shift ModulationDiunggah olehseventhsensegroup
- A Simple Method For Operating The Three-Phase Induction Motor On Single Phase Supply (For Wye Connection Standard)Diunggah olehseventhsensegroup
- The Utilization Of Underbalanced Drilling Technology May Minimize Tight Gas Reservoir Formation Damage: A Review StudyDiunggah olehseventhsensegroup
- FPGA Based Design and Implementation of Image Edge Detection Using Xilinx System GeneratorDiunggah olehseventhsensegroup
- High Speed Architecture Design Of Viterbi Decoder Using Verilog HDLDiunggah olehseventhsensegroup
- An Efficient And Empirical Model Of Distributed ClusteringDiunggah olehseventhsensegroup
- An Efficient Encrypted Data Searching Over Out Sourced DataDiunggah olehseventhsensegroup
- An Efficient Expert System For Diabetes By Naïve Bayesian ClassifierDiunggah olehseventhsensegroup
- Non-Linear Static Analysis of Multi-Storied BuildingDiunggah olehseventhsensegroup
- Design And Implementation Of Height Adjustable Sine (Has) Window-Based Fir Filter For Removing Powerline Noise In ECG SignalDiunggah olehseventhsensegroup
- Study On Fly Ash Based Geo-Polymer Concrete Using AdmixturesDiunggah olehseventhsensegroup
- Review On Different Types Of Router Architecture And Flow ControlDiunggah olehseventhsensegroup
- Key Drivers For Building Quality In Design PhaseDiunggah olehseventhsensegroup
- Free Vibration Characteristics of Edge Cracked Functionally Graded Beams by Using Finite Element MethodDiunggah olehseventhsensegroup
- A Comparative Study Of Impulse Noise Reduction In Digital Images For Classical And Fuzzy FiltersDiunggah olehseventhsensegroup
- A Review On Energy Efficient Secure Routing For Data Aggregation In Wireless Sensor NetworksDiunggah olehseventhsensegroup
- Performance And Emissions Characteristics Of Diesel Engine Fuelled With Rice Bran OilDiunggah olehseventhsensegroup

- LinuxOS 8th SemDiunggah olehTalha Habib
- BCAE-601A (3)Diunggah olehkaushik4end
- README.txtDiunggah olehSidharth Sharma
- b14233Diunggah olehRandy Dillon
- DB2 Application ProgrammingDiunggah olehAnvesh Poreddy
- ACC564 Week 8 Homework 1Diunggah olehfombyf
- Big Data - Issues for an International Political Sociology of the Datafication of WorldsDiunggah olehSerdar Vardar
- Dunham - Data Mining.pdfDiunggah olehBreno Santos
- diff. mysql & sqlDiunggah olehtaarif_san
- Students Database by Rajesh SolutionDiunggah olehmadhuri
- ch08Diunggah olehkapil_arpita
- BW Notice Dredzs@2014Diunggah olehEr Firos
- Visual Forensic Analysis and Reverse Engineering of Binary Data.pdfDiunggah olehMateus Freire
- 06FPBasicDiunggah olehbilo044
- MCQS computer networkDiunggah olehHafiz Mujadid Khalid
- What Are Joins in OracleDiunggah olehMuhammad Khalil
- Filetype PDF Pc3000 HddDiunggah olehCynthia
- The Oracle Databae ArchietectureDiunggah olehSultan Mirza
- NMM nsrsqlsv syntax for an SQL Server Cluster InstanceDiunggah olehjyxg
- BI Manager - Travelex - MumbaiDiunggah olehk_vk2000
- Unstacking DataDiunggah olehscjofyWFawlroa2r06YFVabfbaj
- Chapter 5Diunggah olehVanu Sha
- Data Mining For Fraud DetectionDiunggah olehIJMER
- Memory Organization and AddressingDiunggah olehArnab Bhattacharjee
- emcopyDiunggah olehbharath
- Selection Sort(1)oDiunggah olehdolliin
- Chapter 9 & 10 - Data Warehouse.pptxDiunggah olehYuikun
- Job Satisfaction Describes How Content an Individual is With His or Her JobDiunggah olehZaher Arakkal
- ABAP - Working With FilesDiunggah olehMilan Katrencik
- Mainframe UtilitiesDiunggah olehsxdasgu