Anda di halaman 1dari 17

1.

INTRODUCTION

For the lab project, we have used clustering analysis algorithms: k-means clustering
method and hierarchical method on our datasets to analyse the results obtained and
compare the results of these two methods.

i. The k-means Clustering Method:

Construct a partition of a database D of n objects into a set of k clusters.

1. Chose k number of clusters to be determined.


2. Chose k objects randomly as the initial cluster centers.
3. Repeat
a. Assign each object to their closest cluster center, using Euclidean distance.
b. Compute new cluster centers, calculate mean point.
4. Until
a. No change in cluster centers or
b. No object change its clusters.

ii. Hierarchical Method:

With hierarchal clustering, a nested set of cluster is created. Each level in the hierarchy
has the separate set of clusters. At the lowest level, each item is in its own unique
cluster. At the highest level, all items belong to the same cluster. With hierarchical
clustering, the desired number of clusters is not input.

The hierarchical clustering are of two types:

a. Agglomerative:

Agglomerative starts with as many clusters as there are records, with each cluster
having only one record. Then pairs of clusters are successively merged until the number
of clusters reduces to k. at each stage, the pair of clusters are merged which are nearest
to each other. If the merging is continued, it terminates in the hierarchy of clusters
which is built with just a single cluster containing all the records.

1
b. Decisive:

Divisive algorithm takes the opposite approach from the agglomerative techniques.
These starts with all the records in one cluster, and then try to split tat clusters into
smaller pieces.

2. DESCRIPTION OF ROUGH DATABASE

The database we have used consists of more than 1300 datasets with 12 attributes. Each
attribute further has its own values. There are some datasets with either null value or
value written as nothing. Also each of the values of attributes are represented by
numbers like 0 for no and 1 for yes. During ARFF file generation, we either use their
nominal values or real values.

Figure 2: Screenshot of Rough Database

3. KDD STEPS FOLLOWED IN DATABASE

The KDD steps we have used in the used database are as follows:

i. Data Selection:
 It is the first stage of KDD process in which we collect and select the
data set or database required to work with.

2
 From the sets of more than 1300 datasets with 12 attributes each of
which having their own value is used to select 100 datasets with 12
attributes and their respective values.
ii. Data Cleaning:
 This is the second stage of KDD where we try to eliminate all the defects
such as: human errors, not available when collected, not entered due to
misunderstanding by the stage of de-duplication, domain consistency,
and disambiguation.
 In this step, we have removed duplicate data and replaced null value
with some value.
iii. Coding:
 Coding is one of the most important stage where further cleaning and
transformation of data is done.
 Here, we have converted attributes’ yes/no value to 0/1 as in attributes
like health check, toilet, animal etc.
iv. Data Mining:
 It consists of different rules, techniques, and algorithms used for mining
purpose.
 To extract the useful information from the data, we have used
classification tasks which include
v. Reporting:
 This stage involves documenting the results obtained from learning
algorithms.
 For this, we have analyzed results obtained from mining.

4. ARFF FILE DESCRIPTION

This has two parts:

i. The header section defines the relation (data set) name, attribute name and
the type.
 @relation: This is the first line in any ARFF file, written in the header section,
followed by the relation/data set name as info_family.

3
 @attribute: The ARFF file consists of attributes cast and health check with
nominal values and other remaining attributes with real values.

ii. The data section lists the data instances.

ARFF File:

@relation info_family

@attribute cast {bahun,chhetri,newar,magar,rai,damai,bishworkarma,sarki}

@attribute type real

@attribute health_check {yes,no}

@attribute child_death real

@attribute smoke_male real

@attribute smoke_female real

@attribute special_chulo real

@attribute self_home real

@attribute toilet real

@attribute animal real

@attribute self_field numeric

@attribute drinking_water_filter real

@data

chhetri,3,no,0,1,2,0,1,1,1,15,1

chhetri,1,no,0,1,1,0,1,1,1,10,1

damai,1,no,0,2,2,0,1,1,0,16,0

4
chhetri,1,no,0,2,2,0,0,0,0,11,0

chhetri,1,no,0,2,2,1,1,0,1,3,0

chhetri,3,no,0,2,2,1,1,1,1,2,1

chhetri,3,no,0,1,2,1,1,1,1,6,0

chhetri,1,no,0,2,2,0,1,1,1,4,1

chhetri,3,no,0,2,2,0,1,1,1,8,0

chhetri,2,no,0,1,1,0,1,1,1,8,0

sarki,1,no,1,2,2,1,1,1,1,9,0

chhetri,1,no,0,1,2,0,0,0,1,7,0

chhetri,1,no,0,1,1,0,1,0,0,1,0

chhetri,1,no,0,1,1,0,1,1,1,3,0

chhetri,3,no,0,1,2,0,1,0,1,1,0

chhetri,1,no,0,1,1,1,1,1,1,7,0

chhetri,1,no,0,1,1,0,1,1,1,6,0

chhetri,4,no,0,1,2,0,1,0,1,3,0

chhetri,1,no,0,1,1,0,1,1,1,3.5,0

chhetri,3,no,0,1,1,0,1,1,1,2.5,0

chhetri,1,no,0,1,1,0,1,1,1,3,0

chhetri,1,no,0,1,1,0,1,1,1,6,0

chhetri,1,yes,0,1,1,0,1,0,0,1,1

chhetri,1,no,0,1,1,0,1,1,1,1,1

5
chhetri,3,no,0,1,1,0,1,1,1,3,0

chhetri,1,yes,0,1,1,1,1,1,1,3,0

chhetri,1,no,0,1,1,0,1,1,1,1,0

chhetri,1,no,0,1,2,1,1,1,1,0.5,1

chhetri,1,no,0,1,2,1,1,1,1,5,1

chhetri,3,no,0,1,2,1,1,1,1,2,0

sarki,1,no,0,1,2,0,1,1,1,0.5,0

newar,1,no,0,1,2,0,1,1,1,4,0

chhetri,1,no,0,1,2,0,1,1,1,2,1

chhetri,1,no,0,1,2,0,1,0,1,6,0

chhetri,3,no,0,1,2,0,0,1,1,1,0

chhetri,3,no,0,1,1,0,1,1,1,4,0

chhetri,3,no,0,1,1,1,1,1,1,1,1

chhetri,4,no,0,1,2,1,1,0,1,2,1

chhetri,4,no,0,1,1,0,1,0,1,2,0

chhetri,3,no,0,1,2,0,1,0,0,2,1

sarki,3,yes,0,1,1,1,1,1,1,2,1

bishworkarma,3,yes,0,1,1,1,1,1,1,3,1

bahun,3,no,0,1,1,1,1,1,1,1,1

chhetri,1,no,0,1,1,1,1,1,1,5,1

chhetri,3,no,0,1,1,0,1,1,1,2,0

6
bishworkarma,3,no,0,1,1,1,1,1,1,6,0

chhetri,1,no,0,1,1,1,1,1,1,8,1

chhetri,1,no,0,1,1,1,1,1,1,2,1

chhetri,3,yes,0,1,2,1,1,1,1,1,1

bahun,1,no,0,1,2,1,1,1,1,7,0

bahun,3,no,0,1,2,0,1,1,1,2,1

bahun,3,no,0,1,2,0,1,1,1,2,1

magar,1,no,0,1,2,1,1,1,1,3,0

chhetri,1,no,0,1,2,0,1,1,1,5,0

chhetri,3,no,0,1,2,1,1,1,1,4,0

chhetri,1,no,0,1,2,0,1,1,1,5,1

chhetri,4,no,0,1,2,0,1,1,1,3,0

chhetri,1,no,0,1,1,0,1,1,1,3,1

chhetri,3,no,0,1,1,0,1,1,1,2,0

chhetri,1,no,0,1,2,0,1,1,1,12,0

chhetri,1,no,0,1,1,0,1,1,1,7,0

chhetri,3,yes,0,1,1,0,1,1,1,4,0

chhetri,1,no,0,1,2,0,1,1,1,7,0

sarki,3,no,0,1,2,0,1,1,1,4,0

sarki,1,no,0,1,1,0,1,0,0,3,0

sarki,1,no,1,1,1,0,1,0,0,1,0

7
chhetri,3,no,0,1,1,0,1,0,1,10,0

chhetri,3,no,0,1,2,1,1,1,1,9,1

bishworkarma,2,no,0,1,2,0,1,1,1,16,0

chhetri,3,no,0,1,2,0,1,0,1,2,0

chhetri,1,yes,0,1,2,0,1,1,1,11,1

chhetri,3,no,0,1,1,0,0,0,1,2,1

sarki,1,no,0,1,1,0,1,1,1,6,0

sarki,1,no,0,1,1,0,1,0,1,8,0

sarki,1,no,0,1,2,0,1,0,0,5,0

sarki,1,no,0,1,2,0,1,0,0,4,0

chhetri,3,no,0,1,2,0,1,0,1,9,0

chhetri,1,no,0,1,1,0,1,1,1,8,0

chhetri,1,no,0,1,1,0,1,0,1,6,1

chhetri,4,no,0,1,1,0,1,0,1,1,0

chhetri,4,no,0,1,1,0,1,1,1,3,0

chhetri,3,no,0,1,1,1,1,1,1,6,1

chhetri,3,no,0,1,1,1,1,1,1,2,0

chhetri,1,no,0,1,1,0,1,1,1,7,1

chhetri,1,no,0,1,1,1,1,1,1,7,1

chhetri,3,no,0,1,1,0,1,1,1,9,0

chhetri,1,no,0,1,1,0,1,1,1,7,0

8
chhetri,1,no,0,1,2,0,1,1,1,4,1

chhetri,1,yes,0,1,2,0,1,0,0,5,0

chhetri,1,no,0,1,2,0,1,1,1,7,0

chhetri,1,no,0,1,2,0,1,1,1,5,1

chhetri,1,no,0,1,2,1,1,1,1,2,0

sarki,1,no,0,1,2,0,1,0,1,3,1

chhetri,1,no,0,1,2,0,1,0,1,7,1

chhetri,1,no,0,1,2,0,1,0,1,3,1

chhetri,3,no,0,1,2,0,1,0,1,3,0

sarki,1,yes,0,2,1,0,1,1,1,16,1

bishworkarma,5,no,0,2,1,0,0,0,0,4,0

chhetri,1,no,0,2,1,0,0,0,0,7,0

chhetri,4,no,0,2,1,1,1,1,0,2,1

chhetri,4,no,0,2,1,0,1,0,0,1,0

damai,1,no,0,2,1,0,1,0,1,2,1

magar,1,no,0,1,1,1,1,0,1,9,1

rai,1,no,0,1,1,0,1,1,1,3,0

9
Figure 4: Screenshot of ARFF File

5. DESCRIPTION OF ALGORITHM’S OUTPUT

In both k-means method and hierarchical method, we created five clusters to view the
output of our datasets.

From k-means algorithm, we got following output:

=== Run information ===

Scheme: weka.clusterers.SimpleKMeans -init 0 -max-candidates 100 -periodic-


pruning 10000 -min-density 2.0 -t1 -1.25 -t2 -1.0 -N 5 -A
"weka.core.EuclideanDistance -R first-last" -I 500 -num-slots 1 -S 10

Relation: info_family

Instances: 104

Attributes: 12

cast

type

child_death

10
smoke_male

smoke_female

special_chulo

self_home

toilet

animal

self_field

drinking_water_filter

Ignored:

health_check

Test mode: Classes to clusters evaluation on training data

=== Clustering model (full training set) ===

kMeans

======

Number of iterations: 5

Within cluster sum of squared errors: 109.96780303086852

Initial starting points (random):

Cluster 0: bishworkarma,3,0,1,1,1,1,1,1,3,1

Cluster 1: chhetri,1,0,1,1,0,1,1,1,6,0

Cluster 2: chhetri,1,0,1,1,0,1,1,1,3,1

Cluster 3: magar,1,0,1,2,1,1,1,1,3,0

11
Cluster 4: chhetri,1,0,1,1,1,1,1,1,8,1

Missing values globally replaced with mean/mode

Final cluster centroids:

Cluster#

Attribute Full Data 0 1 2 3 4

(104.0) (4.0) (41.0) (22.0) (20.0) (17.0)

=============================================================
========================================

cast chhetri bishworkarma chhetri chhetri chhetri chhetri

type 1.9231 3 2.0732 1.4545 1.8 2.0588

child_death 0.0192 0 0.0244 0 0.05 0

smoke_male 1.125 1 1.0976 1.1364 1.2 1.1176

smoke_female 1.4808 1 1.2683 1.5909 2 1.3529

special_chulo 0.2788 1 0 0 0.4 1

self_home 0.9423 1 0.9024 0.9545 0.95 1

toilet 0.6923 1 0.4878 0.6364 0.95 0.8824

animal 0.8654 1 0.7561 0.9091 0.95 0.9412

self_field 4.8173 3 .5122 5.1364 5.975 4.2059

drinking_water_filter 0.375 0.75 0 1 0 0.8235

12
Time taken to build model (full training data) : 0.03 seconds

=== Model and evaluation on training set ===

Clustered Instances

0 4 ( 4%)

1 41 ( 39%)

2 22 ( 21%)

3 20 ( 19%)

4 17 ( 16%)

Class attribute: health_check

Classes to Clusters:

0 1 2 3 4 <-- assigned to cluster

2 2 3 0 2 | yes

2 39 19 20 15 | no

Cluster 0 <-- No class

Cluster 1 <-- no

Cluster 2 <-- yes

Cluster 3 <-- No class

Cluster 4 <-- No class

Incorrectly clustered instances : 62.0 59.6154 %

13
From hierarchical algorithm, we got following output:

=== Run information ===

Scheme: weka.clusterers.HierarchicalClusterer -N 5 -L SINGLE -P -A


"weka.core.EuclideanDistance -R first-last"

Relation: info_family

Instances: 104

Attributes: 12

cast

type

child_death

smoke_male

smoke_female

special_chulo

self_home

toilet

animal

self_field

drinking_water_filter

Ignored:

health_check

Test mode: Classes to clusters evaluation on training data

14
=== Clustering model (full training set) ===

Time taken to build model (full training data) : 0.04 seconds

=== Model and evaluation on training set ===

Clustered Instances

0 100 ( 96%)

1 1 ( 1%)

2 1 ( 1%)

3 1 ( 1%)

4 1 ( 1%)

Class attribute: health_check

Classes to Clusters:

0 1 2 3 4 <-- assigned to cluster

8 0 0 1 0 | yes

92 1 1 0 1 | no

Cluster 0 <-- no

Cluster 1 <-- No class

Cluster 2 <-- No class

Cluster 3 <-- yes

Cluster 4 <-- No class

Incorrectly clustered instances : 11.0 10.5769 %

15
6. DECISION FROM THE OUTPUT
Here in this example Simple K means algorithm divided the given data into 5
clusters i.e. 4 instances in 1st cluster,41 in 2nd cluster,22 in 3rd cluster,20 in 4th
cluster and 17 in 5th cluster. Likewise hierarchical clustering divided the total data
into 5 clusters i.e. 96 instances in 1st cluster,1 in 2nd cluster,1 in 3rd cluster,1 in 4th
cluster and 1 in 5th cluster.

Using k- means clustering method, we got reliable result as it clustered data evenly
among five clusters and also the time to build the model was 0.03 sec. However,
using hierarchical clustering, we got unevenly distributed clusters and time taken to
build was also 0.05 sec. Hence, from the output, we can conclude that k-means
clustering method is better than hierarchical method for using clustering analysis.

Thus from the given data set the best output was obtained from k-means algorithm.

7. CONCLUSION
In this lab,we 1st created the data set using excel where first row indicates the
attribute and remaining row represent the actual data.The excel file was converted
to ARFF file through the help of WEKA tools.Here the data type of attributes is
reconized automatically according to the type of data enter in the data set.

Using Weka 3.8 software we performed various data mining techniques using
different algorithm and we also compared the algorithm with each other in terms of
result they published.In this way, we can use Weka and its tools like ARFF Viewer
to view ARFF files, Explorer to run ARFF file to perform various data mining
algorithms.

16
17

Anda mungkin juga menyukai