Anda di halaman 1dari 10

JOURNAL OF COMPUTING, VOLUME 4, ISSUE 8, AUGUST 2012, ISSN (Online) 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.

ORG

29

The Formulation of a Data Mining Theory for the Knowledge Extraction by means of a Multiagent System
DOST MUHAMMAD KHAN1, NAWAZ MOHAMUDALLY2, D K R BABAJEE3
1

Assistant Professor, Department of Computer Science & IT, The Islamia University of Bahawalpur, PAKISTAN & PhD Student, School of Innovative Technologies & Engineering, University of Technology Mauritius (UTM), MAURITIUS
2

Associate Professor, & Consultancy & Technology Transfer Centre, Manager, University of Technology, Mauritius (UTM), MAURITIUS

Lecturer, Department of Applied Mathematical Sciences, SITE, University of Technology, Mauritius

Abstract: - Data mining is an extraction of useful information from datasets which provides the knowledge to ensure the effective decision making. The extraction of the knowledge from the dataset is not a single-step process rather then it is a multi-step process which involves the different data mining processes such as clustering, classification and visualization, where the output of one data mining process is the input of another process. The existing data mining algorithms and techniques only focus on one specific data mining task at a time, thus fail to address the fundamental need of supporting the extraction of knowledge as a multi-step process. Therefore, in this paper we propose a multi-step data mining theory (DMT) and a framework for the extraction of knowledge called knowledge extraction process (KEP). The clusters of the given dataset are generated in the first step; the second step consists of two data mining processes classification and visualization which takes the inputs of first step and generate the knowledge as final output. A MAS approach is used to validate the proposed DMT. Key-Words: - MAS, DMT, KDD, KEP it is a multi-step process, where the output of one step can be the input of another, involving different data mining processes such as data partitioning, aggregation and data transformation. The existing data mining algorithms and techniques only focus on one specific task at a time thus fail to address the fundamental need of supporting KDD as a multistep process [2]. According to Usama Fayyad et al. the data mining is a step in the KDD process that consists of applying data analysis and discovery algorithms that produce a particular enumeration of patterns over the data. The KDD process is interactive and iterative involving numerous steps with many decision made by the user [3][4][5][6]. Data mining is an iterative process and different processes are required to extract the knowledge from the dataset. According to J.D.Ullman the data mining process which is also known as a data mining life cycle, is a six steps process as illustrated in fig. 1.

1 Introduction
The goals of data mining are pattern and feature extraction, visualization of data and evaluation of results and thus to discover the knowledge from the datasets. These goals can be achieved by using different data mining algorithms. According to Y.Y.Yao a data mining may be viewed as an intermediate system between a database and an application, whose main task is to change the data into useable knowledge. The existence of knowledge in the data is unrelated to whether we have an algorithm to extract it. Many studies on data mining reply on the existence of an algorithm rather than the existence of knowledge. Even worse more often than not, one implicitly assumes that a particular type of knowledge is useful simply because we have found an algorithm. It is therefore, not surprising that we have many algorithms mining easily extractable knowledge, which may not necessarily be interesting and useful [1]. According to Theodore Johnson et al. the knowledge discovery from the database (KDD) is not a one-step process,

2012 Journal of Computing Press, NY, USA, ISSN 2151-9617

JOURNAL OF COMPUTING, VOLUME 4, ISSUE 8, AUGUST 2012, ISSN (Online) 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.ORG

30

Fig. 1 The Data Mining Processes The first process is data gathering and the most businesses already perform the data gathering in the form of data warehousing and web crawling. The second process is data cleansing or pre-processing in which the noise within the data and irrelevant data are removed. The bogus data and/or errors are also eliminated. The third process is dataset. In this process, the relevant data is selected, analyzed and transformed into the proper format for mining. One might decide to remove some of the data or add additional data. A well prepared dataset can significantly improve and ensure the extracted knowledge through data mining. The most important process in the cycle of data mining process is pattern extraction and discovery, which is often thought of as data mining and requires concentration and efforts. A feature is a combination of attributes in the data that is of special interest and captures important characteristics of the data. The other processes are visualization and evaluation of results. The visualization of each association helps to interpret the results and transform into the knowledge as final output, hence all these processes are required in the extraction of knowledge from the dataset [7][8][9][10][11]. A multiagent system (MAS) approach is useful in designing of a system where the domains require the MAS and speeds up the performance of the system. A MAS approach has been successfully implemented in the application of a Unified Medical Data Miner (UMDM), in the integration of K-means clustering and Decision Tree data mining algorithms and in the generation of initial centroids for Kmeans clustering data mining algorithm [12][17][18][19][20][21]. In this paper we formulate a data mining theory for the extraction of knowledge from the given dataset that either clustering or classification generates the knowledge or clustering and visualization produces the knowledge. The rest of the paper is organized as: Section 2 deals with the mathematical formulation of the proposed theory, section 3 discusses the methodology: a MAS approach, section 4 is about results and discussion and finally the conclusion is drawn in section 5.

the knowledge. This is shown in equations (1) and (2).


Clustering Classification Knowledge
Clustering Visualization Knowledge

(1) (2)

The proposed DMT is first formulated through the Boolean expression. Let us suppose that A, B, C inputs denote clustering, classification and visualization data mining processes respectively and K is the extracted knowledge, as output. Table 1 (truth table) further illustrates the DMT. Table 1 Truth Table
Sr.# 1 2 3 4 5 6 7 8 A 0 0 0 0 1 1 1 1 B 0 0 1 1 0 0 1 1 C 0 1 0 1 0 1 0 1 K 0 0 0 0 0 1 1 1

The table shows that the output K is 1 if all the three inputs are 1 and if the input A is 1 and either input B or C is 1. The Boolean expression derives from the truth table is given below:

Knowledge {clustering classifica }{clustering visulaizat } tion ion

(3)

The equation (3) proves the proposed theory that the knowledge is extracted from either combination of clustering and classification or from the combination of clustering and visualization. We also formulate our proposed theory through the mathematical functions. When thinking about the inputs and outputs, we can treat algorithms as functions; input a number into the algorithm, follow the prescribed steps, and get an output. A function is a relationship between an input variable and an output variable in which there is exactly one output for each input. An algorithm is to be a function, if the algorithm is consistent i.e. every time give the input, get the output and each input produces one possible output. It is not necessary that all functions have to work on numbers and not all functions need

2 Mathematical Formulation of the Proposed Theory


The proposed theory is that the unification of clustering and classification and the unification of clustering and visualization followed by the interpretation and evaluation of these result generate

2012 Journal of Computing Press, NY, USA, ISSN 2151-9617

JOURNAL OF COMPUTING, VOLUME 4, ISSUE 8, AUGUST 2012, ISSN (Online) 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.ORG

31

to follow a computational algorithm [14][15][16]. The selected data mining algorithms satisfy the conditions of a function; therefore these algorithms are applied as functions. Let S {Set of n attributes} A {Set of n clusters of n attributes} B {Set of n rules/classifiers of n clusters of n attributes} C {Set of n 2D graphs of n clusters of n attributes} We can describe this in a more specific way, which is: Let S s1 , s 2 ,..., s n } where s1 , s 2 ,..., s n are partitioned datasets containing at least any of the two attributes and only one class attribute of the original dataset.
A {C1 (c1 , c 2 ,..., c n ), C 2 (c1 , c 2 ,..., c n ),...., C n (c1 , c 2 ,..., c n )}

2. g : A B
g (C j ) Rk

, where j, k = 1, 2,, n.

3. h : A C
h(C j ) Vl

, where j, l = 1, 2,, n.

where C1 (c1 , c 2 ,..., c n ) is set of clusters of partitions s1 , C2 (c1, c2 ,..., cn ) is set of clusters of partitions s2 ,., and C n (c1 , c 2 ,..., c n ) is set of clusters of partition sn .
B {R1 (r1 , r2 ,..., rn ), R2 (r1 , r2 ,..., rn ),...., Rn (r1 , r2 ,..., rn )}

The function f is a mapping from the set of dataset to the set of clusters, the function g is a mapping from set of clusters to the set of rules/classifiers and the function h is a mapping from set of clusters to set of 2D graphs. The function f takes the set of data as input, applies K-means clustering algorithm and produces the clusters as output. The function g takes the clusters as input, applies the C4.5 (Decision Tree) algorithm and produces the rules/classifiers as output. The function h takes the clusters as input, applies data visualization algorithm and produces 2D graphs as output. The knowledge can be interpreted and evaluated either from the set of rules or from the 2D graphs, which is accepted or rejected by the user. Lemma: f, g & h are functions Proof: 1. f is a function Suppose that f is not a function then s1 can go to cluster 1 and cluster 2. This is impossible because clustering requires that s1 goes to set of cluster 1. This leads to a contradiction. Therefore, f is a function. 2. g is a function We define rule, depending on the clusters. The cluster 1 has rule1/classifier1. The cluster 2 has rule2/classifier2. We cannot define two rules/classifiers to a cluster otherwise we would not be able to classify the attributes. 3. h is a function The 2D graph will depend on the clusters. A cluster cannot produce two different 2D graphs. 1. f : S A
f (S i ) C j , g:A B g (C j ) Rk

where R1 (r1 , r2 ,..., rn ) is set of rule/classifier of cluster C1 , R 2 (r1 , r2 ,..., rn ) is the set of rule/classifier of cluster C 2 , , and R n (r1 , r2 ,..., rn ) is set of rule/classifier of cluster C n .
C {V1 (v1 , v 2 ,..., v n ),V2 (v1 , v 2 ,..., v n ),...., Vn (v1 , v 2 ,..., v n )}

where V1 (v1 , v 2 ,..., v n ) is set of 2D graphs of cluster C1 , V2 (v1 , v 2 ,..., v n ) is set of 2D graph of cluster C 2 , , and Vn (v1 , v 2 ,..., v n ) is set of 2D graph of cluster C n . Fig. 2 depicts these sets and their respective algorithms.

where i, j = 1, 2,, n.

, where j, k = 1, 2,, n.

For Rk B Fig. 2 The Composition of the Functions 1. f : S A


f (S i ) C j ,

Rk g (C j )
g ( f ( S i )) ( g f )( S i )

[Composition of the function] (4)

where i, j = 1, 2,, n.

(g f ) : S B

2012 Journal of Computing Press, NY, USA, ISSN 2151-9617

JOURNAL OF COMPUTING, VOLUME 4, ISSUE 8, AUGUST 2012, ISSN (Online) 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.ORG

32

dom( f ) S dom( g ) A dom( g f ) S

codom( f ) A codom( g ) B codom( g f ) B

The mathematical notation of these sets is given below.


S {s1 , s 2 } A {C1 (c1 , c2 ), C 2 (c1 , c 2 )} B {R1 (r1 , r2 ), R2 (r1 , r2 )} C {V1 (v1 , v2 ),V2 (v1 , v 2 )}

The function g f is a mapping from the set of data to the set of rules. The knowledge is derived from the interpretation and evaluation of the set of rules/classifiers which are obtained through the composition of the functions discussed above in equation (4). 2. f : S A
f (S i ) C j ,

Dataset S
CT 3 5 1 3 SECS 2 2 2 5 Mitoses 1 1 1 3
s1

where i, j = 1, 2,, n.

Class Benign Benign Malignant Malignant

h: AC h(C j ) Vl

The clusters s1 and s 2 are given below: , where j, l = 1, 2,, n.


CT 3 5 1 3 Mitoses 1 1 1 3
s2

For Vl C
Vl h(C j )

h( f ( S i )) (h f )( S i )

[Composition of the function] (5)

Class Benign Benign Malignant Malignant Class Benign Benign Malignant Malignant

(h f ) : S C dom( f ) S dom(h) A dom(h f ) S codom( f ) A codom(h) C codom(h f ) C

The function h f is a mapping from the set of data to the set of 2D graphs. The knowledge is derived from the interpretation and evaluation of the 2D graphs which are obtained through the composition of the functions discussed above in equation (5). We first pass the partitions of the dataset through the composition of clustering and classification, the results obtained are interpreted and evaluated to extract the knowledge. Secondly, the partitions of dataset through clustering and visualization, the results obtained are interpreted and evaluated to extract the knowledge. Therefore, we attempt to discover the knowledge in both ways. The order of the functions must be the same as given in equations 4 and 5. That is, first apply the clustering algorithm then classification algorithm and interpret the results to get knowledge and in the second case apply the clustering algorithm then visualization algorithm and interpret the results to obtain knowledge. This is the proof of the mathematical formulation of the proposed theory. Illustration: Let S be a dataset with two vertical partitions s1 and s 2 , the set A is a set of clusters of each partition, the set B is the set of rules of each clusters and the set C is the set of 2D graphs of each rules.

CT 3 5 1 3

SECS 2 2 2 5

1. f ( K means) : S A The K-means clustering algorithm takes three inputs, k number of clusters,, n, number of iteration and sp dataset and will produce k clusters of the given dataset. The function f is illustrated in equation (6).
f ( k , n, s p ) C k ( c k )

(6)

Where k and n are positive nonzero integers and s p is the vertical partition of the dataset S . Suppose we want to create two clusters of the datasets s1 and s 2 and number of iteration is set to 10 i.e. k = 2 and n = 10 then the function f ( 2,10, s1 ) C1 (c1 , c 2 ) is for the dataset s1 .

Fig. 3 The mapping of function f for s1

2012 Journal of Computing Press, NY, USA, ISSN 2151-9617

JOURNAL OF COMPUTING, VOLUME 4, ISSUE 8, AUGUST 2012, ISSN (Online) 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.ORG

33

Fig. 3 shows the mapping between set S and set A using K-means clustering algorithm as a function f. The values of other required parameters of this algorithm are number of clusters k = 2, number of iterations n = 10 and dataset s1 . This is many-to-one (three-to-one) mapping because all the members of set S are mapped with the single member of set A. Similarly, the function f (2,10, s 2 ) C 2 (c1 , c 2 ) is for the dataset s 2 . Fig. 5 The mapping of function g for s1 Fig. 5 shows the mapping between set A and set B using C4.5 (decision tree) algorithm as a function g. The algorithm takes the set of clusters C1 (c1 , c 2 ) of dataset s1 and produces the set of classifiers/rules R1 (r1 , r2 ) . This is one-to-one mapping because the members of set A are mapped with the corresponding members of set B. Fig. 4 The mapping of function f for s 2 Fig. 4 shows the mapping between set S and set A using K-means clustering algorithm as a function f. The values of other required parameters of this algorithm are number of clusters k = 2, number of iterations n = 10 and dataset s 2 . This is many-to-one (three-to-one) mapping because all the members of set S are mapped with the single member of set A. The figures 3 and 4 show that function f is many-toone function. Hence f is a function because it satisfies the conditions of the function; therefore the K-means clustering algorithm is a function. It is a supervised machine learning algorithm which requires the inputs with certain conditions and then produces the output. If we optimize the values of k and n then the K-means clustering algorithm will be one-to-one otherwise it is many-to-one. Remark: The process of creating clusters of the given dataset does not split the dataset into small datasets, the dataset remains the same and only datapoints are shifted within the dataset. For our own convenience we are illustrating different clusters of the dataset. This process does not create new datasets from the given dataset. 2. g (C 4.5) : A B The C4.5 algorithm takes the clusters of the dataset as input and produces the classifier/rule as output. The function g is illustrated in equation (7). g (C j ) Rk (7) Fig. 6 The mapping of function g for s 2 Fig. 6 shows the mapping between set A and set B using C4.5 (decision tree) algorithm as a function g. The algorithm takes the set of clusters C 2 (c1 , c2 ) of s 2 and produces the set of dataset classifiers/rules R2 (r1 , r2 ) . This is one-to-one mapping because the members of set A are mapped with the corresponding members of set B. The figures 5 and 6 show that function g is one-toone function. Hence g is a function because it satisfies the conditions of the function; therefore C4.5 algorithm is a function. It is an unsupervised machine learning algorithm which requires the input and then produces the output. Remark: For our own convenience we are putting if-then-else in the rules/classifiers created by the algorithm C4.5. The process of creating rules/classifiers does not place if-then-else in any of the rules/classifiers. The rules/classifiers can further be used in a simple query to validate the results. 3. h( DataVisualization) : A C

2012 Journal of Computing Press, NY, USA, ISSN 2151-9617

JOURNAL OF COMPUTING, VOLUME 4, ISSUE 8, AUGUST 2012, ISSN (Online) 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.ORG

34

The algorithm Data Visualization takes the clusters as input and produces 2D graphs as output. The function h is illustrated in equation (8). h(C j ) Vl (8)

The following case further demonstrates the proposed data mining theory: Case: BreastCancer, a medical dataset of four attributes is chosen as example to explain the theory discussed above. We create two vertical partitions, these partitions are the actual inputs of clustering algorithm and we want to create two clusters of each partition. Similarly, we will produce rules and 2D graphs of each partition and finally we will take one as output called knowledge after evaluating and interpreting all the obtained results. The whole process is discussed below:
S {s1 , s 2 } A {C1 (c1 , c 2 ), C 2 (c1 , c 2 )} B {R1 (r1 , r2 ), R2 (r1 , r2 )} C {V1 (v1 , v 2 ),V2 (v1 , v 2 )}

Fig. 7 The mapping of function h for s1 Fig. 7 shows the mapping between set A and set C using Data Visualization algorithm as a function h. The algorithm takes the set of cluster C1 (c1 , c 2 ) of dataset s1 and produces the set of 2D graphs V1 (v1 , v 2 ) . This is one-to-one mapping because the members of set A are mapped with the corresponding members of set C.

Dataset S
CT 3 5 1 3 SECS 2 2 2 5 Mitoses 1 1 1 3 Class Benign Benign Malignant Malignant

The set of data of dataset S is shown in the tables below:

s1
CT 3 5 1 3 Mitoses 1 1 1 3 Class Benign Benign Malignant Malignant

Fig. 8 The mapping of function h for s 2 Fig. 8 shows the mapping between set A and set C using Data Visualization algorithm as a function h. The algorithm takes the set of cluster C 2 (c1 , c2 ) of dataset s 2 and produces the set of 2D graphs V2 (v1 , v 2 ) . This is one-to-one mapping because the members of set A are mapped with the corresponding members of set C. The figures 7 and 8 show that function h is one-toone function. Hence h is a function because it satisfies the conditions of the function, therefore Data Visualization (2D graphs) is a function. Remark: The purpose of 2D graph is to identify the type of relationship if any between the attributes. The graph is used when a variable exists which is being tested and in this case the attribute or variable class is a test attribute.
CT 3 5 1 3

s2
SECS 2 2 2 5 Class Benign Benign Malignant Malignant

The set of clusters of set of data are shown in the following tables:
C1 (c1 )

CT 3 3

Mitoses 1 3
C1 (c 2 )

Class Benign Malignant

CT 5 1

Mitoses 1 1

Class Benign Malignant

2012 Journal of Computing Press, NY, USA, ISSN 2151-9617

JOURNAL OF COMPUTING, VOLUME 4, ISSUE 8, AUGUST 2012, ISSN (Online) 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.ORG

35

C 2 (c1 )

V2(c2)
6 5 4 CT 3 SECS 2 1 0 Benign Malignant

CT 3 1

SECS 2 2

Class Benign Malignant

C 2 (c 2 )

CT 5 3

SECS 2 5

Class Benign Malignant

V 2 (v 2 )

2D graph of C 2 (c 2 )

The set of rules/classifiers of set of clusters are described below:


R1 ( r1 ) : If CT = 3 & Mitoses = 3 then Class = Malignant else Class = Benign R1 ( r2 ) : If CT = 1 & Mitoses = 1 then Class = Malignant else Class = Benign R2 ( r1 ) : If CT = 3 & SECS = 2 then Class = Benign else Class = Malignant R2 ( r2 ) : If CT = 5 & SECS = 2 then Class = Benign else Class = Malignant

The following figures show the set of 2D graphs of set of clusters:


V1(c1) 3.5 3 2.5 2 1.5 1 0.5 0 Benign Malignant

We interpret and evaluate the results obtained from the set V1 (v1 , v 2 ) . The structure of 2D graphs of V1 (v1 ) and V1 (v 2 ) is identical. The result obtained from these graphs is that if attributes CT and Mitoses have the same values then the patient has Malignant class of BreastCancer otherwise Benign class of BreastCancer. The interpretation and evaluation of the set V 2 (v1 , v 2 ) show that the 2D graphs of V2 (v1 ) and V2 (v 2 ) are similar. The result achieved from these graphs is if attributes CT and SECS have variable values then the patient has Benign class of BreastCancer otherwise Malignant class of BreastCancer. Table 2 summarizes the steps involved in the proposed theory through the composition of the functions. Table 2 The Composition of the Functions
S
s1 s2

CT Mitoses

f
C1 (c1 , c 2 ) C 2 (c1 , c 2 )

g f
R1 ( r1 , r2 ) R2 ( r1 , r2 )

h f
V1 (v1 , v 2 ) V2 (v1 , v 2 )

V1 (v1 )

2D graph of C1 (c1 )
V1(c2)

6 5 4 3 2 1 0 Benign Malignant CT Mitoses

In this way the knowledge is extracted from the given dataset S through the composition of clustering and classification as g f and the composition of clustering and visualization as h f , followed by interpretation and evaluation of the results which depend on the users selection according to the business requirements.

V1 (v 2 )

2D graph of C1 (c2 )

3 Methodology: A MAS Approach


V2(c1)
3.5 3 2.5 2 1.5 1 0.5 0 Benign M alignant CT SECS

Three data mining algorithms; K-means for clustering, C4.5 for classification/rules and 2D graphs for data visualization are selected. On the basis of the proposed data mining theory (DMT), a MAS is developed and the architecture of the MAS is shown in fig. 9.

V2 (v1 )

2D graph of C 2 (c1 )

2012 Journal of Computing Press, NY, USA, ISSN 2151-9617

JOURNAL OF COMPUTING, VOLUME 4, ISSUE 8, AUGUST 2012, ISSN (Online) 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.ORG

36

Breastcancer, a medical dataset using the MAS are given below:


2D Graph
6 5 4 3 2 SECS BNu

Fig. 9 The Architecture of the MAS The fig. 9 has three main components; one component is dataset which is a basic input, the second is a MAS, a multi-step knowledge extraction process (KEP) based on the proposed DMT and finally the knowledge as an output. The dataset is the required and the basic input, an agent creates the appropriate the partitions of the dataset. Another agent takes the partitioned datasets as input, applies K-means clustering data mining algorithm for the process of clustering and produces the set of clusters as output. Similarly, an agent takes the set of clusters as input, applies the C4.5 (Decision Tree) data mining algorithm for the process of classification and produces the set of rules/classifiers as output and followed by the interpretation and evaluation of these rules is the knowledge as final output. Finally, another agent takes the set of clusters as input, applies data visualization data mining algorithm for the process of visualization and produces the set of 2D graphs as output and followed by the interpretation and evaluation of these 2D graphs is the knowledge as final output. In summary we can say that in the first-step the clusters of the given dataset are created and in the second-step either produces the rules or produces the 2D graphs of these clusters and in the third-step the rules and 2D graphs are interpreted and evaluated to produce the knowledge, hence the knowledge is extracted through the multi-step processes i.e. clustering, classification and visualization. Similarly, the other data mining algorithms for clustering, classification and visualization can be deployed to validate the proposed DMT.

1 0

The graph in fig. 10 can be divided into three sections, the value of attributes single epithelial size and bare nuclei is constant in the beginning, then the value of these attributes is variable, then in some region of the graph the value is equal and then again it is constant. The conclusion of this graph is that if the value of these attributes is either constant or equal then the patient has benign class of breast cancer and the variable values of these attributes give the malignant class of breast cancer, which shows the knowledge. Fig. 11 shows the rules of dataset Breast Cancer.
1. 2. 3. 4. 5. 6. 7. if Mitoses = 1 then if BCh = 7 then benign else if BCh = 1 then else if BCh = 2 then malignant else if BCh = 3 then else if BCh = 4 then benign else 7 if BCh = 8 then Class = malignant , Class = benign Class = benign , Class = benign , malignant Class = malignant , Class = malignant

Fig. 11 The Rules of dataset Breast Cancer The rules in the form of if then else are shown in fig. 11. It is an easy way to take the decision. The result is if the value of the attribute mitoses is 1 and the value of attribute bland chromatin is either 1, 2 or 3 then the patient has benign class of breast cancer otherwise the malignant class of breast cancer, which we can say is knowledge. These rules can be used in a simple query to further validate the results. The results and the extracted knowledge from Iris, a flower dataset using the MAS are given below:

4 Results and Discussion


The MAS based on the proposed DMT is tested on the variety of datasets such as Diabetes and Breast Cancer, two medical datasets, Iris, a flower dataset, Cars, a vehicle dataset and Sales, an account dataset [13]. We present some of the results of Breastcancer, Iris and Sales datasets. The results and the extracted knowledge from

2012 Journal of Computing Press, NY, USA, ISSN 2151-9617

be ni gn be ni gn be ni gn be ni gn m al ig na nt be ni gn be ni gn be ni gn be ni gn be ni gn m al ig na nt be ni gn be ni gn be ni gn

Fig. 10 2D graph between attributes single epithelial size & bare nuclei of dataset Breastcancer

JOURNAL OF COMPUTING, VOLUME 4, ISSUE 8, AUGUST 2012, ISSN (Online) 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.ORG

37

2D Graph
10 9 8 7 6 5 4 3 2 1 0 Irisversicolor Irisversicolor Irisversicolor Irisversicolor Irisversicolor Irisvirginica Irisvirginica Irisvirginica Irisvirginica Irisvirginica Irisvirginica Irisvirginica Irisvirginica Irisvirginica Irisvirginica Irisvirginica Irissetosa Irissetosa Irissetosa

petal_length petal_width

petal_length

Fig. 12 The Graph between petal_length and petal_width of Iris dataset The structure of the graph in fig. 12 is complex. At the beginning of the graph there is no relationship between the attributes petal_length and petal_width, then there exists a relationship between these attributes, again the attributes have distinct values and at the end there is relationship between the attributes. The outcome of this graph is that if there is no relationship between the attributes, the value of attribute class is Irisvirginicia and if there exists a relationship between the attributes then the value of attribute class is Irisversicolor, which shows the knowledge. Fig. 13 shows the rules of dataset Iris.
1. 2. 3. 4. if petal_width = 0 then Class = Irissetosa else if petal_width = 2 then Class = Irissetosa , Irisversicolor else if petal_length = 3 then Class = Irisversicolor else if petal_length=1 then Class = Irisvirginica, Iriscol else Class = Irisvirginica

The graph in fig. 14 shows a relationship between average sales and the forecast sales of the dataset Sales. The graph shows that in the beginning the average sales and the forecast sales are equal, in the middle of the graph the average sales is either higher or equal then the forecast sales values. The graph also shows that the gap between two values is quite significant during 20, 21 and 22 months i.e. the value of average sales is much higher then the forecast sales but at the end of the graph both values are again equal. The outcome of this graph is that during the months 20, 21, and 22 average sales is higher then the forecast sales and for the remaining months both values are equal which also represents the knowledge. Fig. 15 shows the rules of dataset Sales.
1. 2. 3. if AverageIndex = 1.13 then Class = NotOk else if Month = 3 then Class = NotOk else if Month = 16 then Class = NotOk else Class = Ok

petal_width

4.

Fig.15 The Rules of dataset Sales The result of the fig. 15 is that if the AverageIndex is 1.13 and the value of the Month is 3 and 16 then the required target of the sales is not achieved other wise the sales is within the target for the rest of the period. This is knowledge and these rules can be used in a simple query to further validate the results. Similarly, the rules and 2D graphs can also be discussed for the other clusters of these datasets. It is easy to extract the knowledge from the rules and 2D graphs of the given datasets.

5.

Fig.13 The Rules of dataset Iris The result for the dataset Iris is that if the value of attribute petal_width and petal_length is not equal then the class is either Irissetosa or Irisvirginica. This is knowledge and these rules can be used in a simple query to further validate the results. The results and the extracted knowledge from Sales, an account dataset using the MAS are given below:
2D Graph
4000000 3500000 3000000 Sales 2500000 2000000 1500000 1000000 500000 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 Month Average Sales Forecast Sales

5 Conclusion
In this paper we present a multi-step DMT for the discovery of knowledge from the dataset where the output of one data mining process is the input of other processes. The foundation of our proposed theory is that without clustering there is no classification and no visualization and hence there is no knowledge. Either the composition of clustering and classification or the composition of clustering and visualization through mathematical functions can produce the knowledge called KEP. In the first step the clusters of a dataset are created and in the second step the rules are created followed by the interpretation and evaluation to extract knowledge. Similarly, the 2D graphs of the clusters are generated in the second step followed by the interpretation and evaluation to extract the knowledge. The clustering and

Fig. 14 The Graph between average sales and forecast sales of Sales dataset

2012 Journal of Computing Press, NY, USA, ISSN 2151-9617

JOURNAL OF COMPUTING, VOLUME 4, ISSUE 8, AUGUST 2012, ISSN (Online) 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.ORG

38

classification processes of data mining produces the knowledge and the clustering and visualization processes of data mining processes also generates the knowledge. The proposed DMT is demonstrated and proved through the Boolean expression and the composition of mathematical functions. We illustrate the DMT with numerous examples and case studies. Furthermore, the proposed DMT is implemented in a MAS and is tested on the variety of datasets. At this stage we can say that the results are satisfactory and consistent. For future consideration, more tests can be conducted to validate the proposed DMT.

Acknowledgment
The authors are thankful to The Islamia University of Bahawalpur, Pakistan for providing financial assistance to carry out this research activity under HEC, Pakistan project 6467/F II. References: [1] Yao. Y.Y.: A step toward the foundations of data mining, Proc. SPIE 5098, 254 http://dx.doi.org/10.1117/12.509161, 2003. [2] Johnson. Theodore, Lakshmanan. Laks V. S., Ng. Raymond T.: The 3W Model and Algebra for Unified Data Mining, Proceedings of the 26th VLDB Conference, Cairo, Egypt, 2000. [3] Fayyad, U., Piatesky-Shapiro, G., Smyth, P., Uthurusamy, R. (Eds.): Advances in Knowledge Discovery and Data Mining, AAAI Press, Cambridge, 1996. [4] Fayyad, U., Piatetsky-Shapiro, G., and Smyth, P.: From data mining to knowledge discovery: an overview, AAAI Press, Cambridge, 1996. [5] Fayyad, U., Piatetsky-Shapiro, G., and Smyth, P.: The KDD process for extracting useful knowledge from volumes of data, Communications of the ACM, 39(11):2734, 1996. [6] Fayyad, U., Piatetsky-Shapiro, G., and Smyth, P.: Knowledge discovery and data mining: towards a unifying framework, Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, 8288, Portland, Oregon, 1996. [7] Cios, K.J., Pedrycz, W., Swiniarski, R.W., Kurgan, L.A.: Data Mining: A Knowledge Discovery Approach, chapter 2 pp 09-24, ISBN 978-0-387-33333-5 Springer, 2007. [8] Julio Ponce, Adem Karahoca.: Data Mining and Knowledge Discovery in Real Life Applications, chapter 1 pp 01-16, ISBN 978-3-902613-53-0, ITech, Vienna, Austria, 2009.

[9] Ullman, J. D.: Data Mining: A knowledge discovery in databases, URL:http://wwwdb.stanford.edu/~ullman/mining, visited 2009. [10] Threaling, Kurt.: An Introduction to Data Mining, http://www.threaling.com/dmintro/dmintro.htm. , 2003. [11] Kantardzic, Mehmed.: Data Mining: Concepts, Models, Methods, and Algorithms, ISBN: 471228524 John Wiley & Sons., 2003. [12] Maindonald, John.: Data Mining Methodological Weaknesses and Suggested Fixes, Proc. Fifth Australasian Data Mining Conference (AusDM2006), 2006. [13]US Census Bureau for datasets at URL: www.sgi.com/tech/mlc/db visited 2011. [14] Lovsz. L. , Pelikn. J., Vesztergombi. K.: Discrete Mathematics: Elementary and Beyond, pp. 38, Springer, ISBN: 0-387-95585-2, 2003 [15] Dries. Lou van den. Mathematical Logic, Math 570, Lecture Notes, Fall Semester 2007 [16] MathWorksheetsGo.: Evaluating Functions. at URL: www.MathWorksheetsGo.com, 2011 [17] Khan, Dost Muhammad. & Mohamudally, Nawaz.: A Multiagent System (MAS) for the Generation of Initial Centroids for k-means clustering Data Mining Algorithm based on Actual Sample Datapoints, JNIT, Vol. 1, Number 2, pp(8595), ISSN: 2092-8637, 2010. [18] Khan, Dost Muhammad. & Mohamudally, Nawaz.: An Agent Oriented Approach for Implementation of the Range Method of Initial Centroids in K-Means Clustering Data Mining Algorithm, IJIPM, Volume 1, Number 1, (pp 104113), ISSN: 2093-4009, 2010 [19] Mohamudally, Nawaz., Khan, Dost Muhammad.: Application of a Unified Medical Data Miner (UMDM) for Prediction, Classification, Interpretation and Visualization on Medical Datasets: The Diabetes Dataset Case, P. Perner (Ed.): ICDM 2011, LNAI 6870, Springer-Verlag Berlin Heidelberg pp. 7895, 2011 [20] Khan, Dost Muhammad. & Mohamudally, Nawaz.: An Integration of k-means clustering and Decision Tree (ID3) towards a more Efficient Data Mining Algorithm, Journal of Computing, ISSN: 2151-9617, Volume 3, Issue 12 (pp 76-82), 2011 [21] Peter. Stone, & Manuela Veloso.: Multiagent Systems: A Survey from a Machine Learning Perspective, URL: http://www.cs.cmu.edu/afs/cs/usr/pstone/public/pap ers/97MAS-survey/revised-survey.html, 1997.

2012 Journal of Computing Press, NY, USA, ISSN 2151-9617

Anda mungkin juga menyukai