Anda di halaman 1dari 4

E-

eclipse@gmail.com

ABSTRACT: Cluster analysis plays a vital role in
various fields in order to group similar data from the
available database. There are various clustering
algorithm available in order to cluster the data but
the entire algorithm are not suitable for all process
.This paper mainly address with the comparative
performance analysis of partition based k-mean and
y-mean algorithm in Iris flower datasets. The
experimental results of iris data set show that the Y-
eans algorithm yields the best results in clustering
and time complexity compared with k-ean
algorithm in little iteration time.
!eywords - !-ean "lgorithm# Y-eans
"lgorithm# Cluster "nalysis.
1. INTRODUCTION
Cluster analysis groups the given data objects
based on only inforation found in the data and
describes the objects and their relationships. The
objective is that objects !ithin a group be siilar to
one another and different fro the objects in other
groups. The data objects !hich have the a"iu
siilarity !ithin a group and the greater the
difference bet!een the groups are# the better or
ore distinct the clustering. Clustering is an
effective techni$ue for e"ploratory data analysis#
and has found applications in a !ide variety of
areas.
In this paper# !e ainly revie! t!o
algoriths %&eans and y&ean algorith. 'ost
("isting ethods of clustering can be categori)ed
into three* partitioning# hierarchical# and grid&based
and odel&based ethods. The %&'eans and y&
eans are e"aples of partitional ethods. The y&
ean and %&ean are copared in the data sets of
iris flo!er to cluster the three species of iris flo!er
and the results are obtained in 'atlab.
+. '(T,ODO-O./
Clustering is one of the ost !idely perfored
analyses on gene e"pression data. (very clustering
algorith is based on the inde" of siilarity or
dissiilarity bet!een data points. (ach cluster is a
collection of data objects that are siilar to one
another are placed !ithin the sae cluster but are
dissiilar to objects in other clusters.The iris data
sets are ta%en fro three different species inorder
to classify each species !ith coon data sets.
The clustering process for each algorith differs
fro in order to classify the siilar groups.
.
2.1. THE K-MEANS ALGORITHM
0&'eans is one of the siplest unsupervised
learning algoriths used to partition the given data
objects in clustering. The procedure follo!s a
siple and easy !ay to classify a given data set
through a certain nuber of clusters 1assue %
clusters2. The ain procedure is to initiali)e %
centroids# one for each cluster groups. These
centroids have to be selected carefully since their
placeent !ill al!ays affect the end result. 3inally#
this algorith ais at inii)ing an objective
function# in this case a s$uared error function.
$ %%xa
&'(
-cb%%
)
4here 55"a1j2&yb55+ is a chosen distance easure
bet!een a data point "aj and cluster centre cb# is an
6 CO'76R6TI8( 6N6-/9I9 :(T4((N
0&'(6N 6ND /&'(6N9 6-.ORIT,'9 IN
3I9,(R;9 IRI9 D6T6 9(T9.
8.-eela
a
# 0.9a%thi priya
b
#R.'ani%andan
c

a
M.tech VLSI Design, Depat!ent "# $"!p%ting, SASTRA %ni&esit',Than(a&%-)1*+,1,In-ia.Email:leelaeclipse@gmail.com
.
M.tech VLSI Design, Depat!ent "# $"!p%ting, SASTRA /ni&esit',Than(a&%-)1*+,1,In-ia.Email:sakthieceb@gmail.com
c
Seni" Assistant 0"#ess", Depat!ent "# I$T,SASTRA /ni&esit',Than(a&%-)1*+,1,In-ia.Email:manikandan75@core.sastra.ed%
is an indicator of the distance of the n data points
fro their respective cluster centers.
9o# the better choice is to place the as far as
possible. The algorith is coposed of follo!ing
steps#
112 7lace % points into the space represented by the
objects that are being clustered. The %& points
represent initial group of centroids.
1+2 6ssign each object the group that has the closest
centroid.
1<2 6fter the assignent of centroids to each
objects# recalculate the positions of the % centroids.
1=2 Repeat steps + and < until the centroids no
longer ove. This produces a separation of the
objects into groups of the objects into groups fro
!hich the etric to be inii)ed can be
calculated.
The algorith is drastically sensitive to the initial
randoly selected cluster centers. The %&'eans
algorith can be run iteratively to inii)e this
effect.
0&eans include*
&siplicity and applicability for a !ide variety of
data types. It is also $uite efficient# even after
ultiple iterations are often perfored.
&It provide best result !hen the cluster is intensive
and the distinction bet!een clusters is obvious.
&It is efficient and scalable !hen the data set is
large.
4ea%nesses of 0&eans include*
&It depends on initial centroid and the final nuber
of clusters and undergoes degeneracy.
&The algorith is not apposite for nonconve" shape
clusters nor cluster si)es that are highly variable.
&6 sensitivity to noise points# arginal points and
isolated points.
2.2. THE 1-MEAN $L/STERING
/&eans is based on the 0&eans algorith.
The ain difference bet!een the t!o is /&eans;
ability to autonoously decide the nuber of
clusters based on the statistical nature of the data.
This a%es the nuber of final clusters that the
algorith produces a self&defined nuber rather
than a user&defined constant as in the case of 0&
eans. This overcoes one of the ain dra!bac%s
of 0&eans since a user&defined % cannot guarantee
a suitable partition of a dataset !ith an un%no!n
distribution# a rando value of initial % usually
results in poor clustering.
/&eans can find out an appropriate value of
final % 1centroids2# !hich is independent of the
initial % e"perients by using a se$uence of
splitting# deleting and erging the clusters# even
!ithout the %no!ledge of the distribution of data.
To eliinate the effect of doinating features due
to the feature&range differences# the dataset is first
norali)ed. Ne"t# the standard 0&eans algorith
is run over the training data. Due to the fact that the
final nuber of clusters is independent of initial %.
'oreover# the selection of the % initial centroids is
again independent of the final results. The standard
0&eans algorith uses (uclidian distance as a
distance function. The /&eans algorith uses the
follo!ing function to identify a single outlier per
iteration for each cluster. . -et Obc1:j# Cl2 be an
:oolean outlier detection function*
O
bc
1:
"
# C
y
2 >
l
"# y* j ?1# @# l ?1# n@
The /&eans algorith iteratively identifies
outliers and converts the to ne! centroids.
<. (A7(RI'(NT6- R(9U-T
("periental !or% is done through '6T-6:
prograing language. 6n iportant step in ost
clustering process is to select a distance easure#
!hich deterine the siilarity bet!een each data
objects fro calculation. This !ill anipulate the
shape of the clusters# as soe data objects !ill be
close to one another according to one distance and
farther a!ay according to another. They are
distinction !hether the clustering uses syetric
or asyetric distances. The syetric and +&
nor distance easure is used in this !or%. In the
(uclidean space *
n
# the distance bet!een t!o
points is usually given by the (uclidean distance
1+&nor distance2.
The iris flo!er have various species #in that three
species naely Iris setosa# Iris virginica and Iris
versicolor sre ta%en for clustering based on the
available data sets provided by 3isherBs Iris data set.
The fifty data sets classify the length and !idth of
sepals# petals of three species coonly. One of
the clusters contains Iris setosa and the other
cluster contains both Iris virginica and Iris
versicolor and is not separable !ithout the species
inforation.so this e"periental results proves the
efficiency by clustering all the three species !ith
tie cople"ity. The %&ean algorith classifies
the species !ith user defined iteration along !ith
dependency of initial centroids and final nuber of
clusters. .
The output obtained fro the clustering are sho!n
belo! by 0&ean clustering are sho!n in fig&1 and
fig&+ sho!s the iteration !hen n>=C.
0 5 10 15 20 25 30 35 40
10
20
30
40
50
60
70
80
90


10
20
30
40
50
60
data1
data2
data3
data4
data5
data6
data7
data8
data9
data10
data11
data12
data13
data14
data15
data16
data17
data18
data19
data20
data21
data22
data23
data24
data25
data26
data27
data28
data29
data30
data31
data32
data33
data34
data35
data36
data37
data38
data39
data40
data41
data42
data43
data44
data45
data46
data47
data48
data49
data50
3ig&1* 0&'(6N C-U9T(RIN.
4
5
6
7
8
2
3
4
5
1
2
3
4
5
6
7
3ig +* 0&'(6N C-U9T(RIN. 4,(N N>=C
0 5 10 15 20 25
35
40
45
50
55
60
65


10
20
30
40
50
60
data1
data2
data3
data4
data5
data6
data7
data8
data9
data10
data11
data12
data13
data14
data15
data16
data17
data18
data19
data20
data21
data22
data23
data24
data25
data26
data27
data28
data29
data30
data31
data32
data33
data34
data35
data36
data37
data38
data39
data40
data41
data42
data43
data44
data45
data46
data47
data48
data49
data50
3ig <* /&'(6N C-U9T(RIN.
4
5
6
7
8
2
3
4
5
1
2
3
4
5
6
7
3ig =* /&'(6N C-U9T(RIN. 4,(N N>+D
The /&ean clustering algorith overcoes the
dra!bac%s of %&ean clustering by !or%ing on
trained set of norali)ed input data. Then
undergoes the process of splitting# erging !ith the
deletion of epty clusters to avoid degeneracy. The
output obtained by /&ean cluster the data set in
fe! iteration N>+D !hen copared !ith %&ean is
sho!n in figure < and =.The red #blue and green
color indicates the three species of Iris flo!ers in it.
T6:-( 1 0&'(6N R(9U-T9
Cluster + ) , - . Total /istance
+ si0e .1 23 )4 ). )+ )+1 ,4
) si0e ,3 -+ ,. -4 ,1 )+3 1
, si0e -2 ,, -, -5 ,1 +32 3
- si0e ,1 -+ -- ,3 ,1 ))+ 5
. si0e .+ ,1 -, ,5 ,1 ))+ +5
T6:-( + /&'(6N R(9U-T9
3ig D* 68(R6.( RUN TI'(
The results of both the algoriths are analy)ed
based on the nuber of data points and the
coputational tie of each algorith. The
behavior of the algorith is analy)ed by
e"periental results. The nuber of data points is
clustered by the algorith as per the distribution of
arbitrary shapes of the data points. Tie
cople"ity analysis is a part of coputational
cople"ity theory that is used to describe an
algoriths use of coputational resourcesE in this
case# the best case and the !orst a"iu and
iniu tie ta%en by the /&'eans algorith is
1F+ and 1DG respectively. -i%e# fro table +# ++1
and 1HG are the a"iu and iniu tie ta%en
by the 0&ean algorith. The perforance of the
algoriths have been analy)ed for several
iterations by considering different data points 1for
!hich the results are not sho!n2 as input 1<CC data
points# =CC data points etc.2 and the nuber of
clusters are 1C and 1D 1for !hich also the results
are not sho!n2# the obtained results are found to be
highly ade$uate. 3igure D sho!s that the graph of
the average results of the distribution of data
points. The average e"ecution tie is ta%en fro
the tables 1 and +. It is easy to identify fro the
figure D that there is a difference bet!een the ties
of the algoriths. ,ere# it is found that the average
e"ecution tie of the /&'eans algorith is very
less by coparing the 0&'eans algorith.
=. CONC-U9ION9
This paper presents coparative analysis of a
t!o unsupervised clustering algorith naely y&
ean and %&ean for data classification.
("periental results sho! that /&eans has a very
good perforance copared 0&eans.
3urtherore# !e also analy)ed the overall
perforance of our algorith by using three
different species of iris datasets as initial input for
clustering. The outcoes of this e"perient
provide the best clustering of the data set for /&
eans than %&ean in fe! iteration steps and
iproved run tie. Our future !or% !ill ostly
concentrate on various !ays to iprove not only
the perforance of the algorith but also on the
accuracy and efficiency of the clustering.
*eferences
?1@ Chan# 7.0.# 'ahoney# '.8.# 6rshad# and '.,.2 Managing
c'.e theats2 Iss%es, app"aches, an- cha33enges. In2 Leaning
R%3es an- $3%stes #" An"!a3' Detecti"n in Net4"5 Ta##ic,
ch. <# pp. I1JHH. 9pringer# ,eidelberg 1+CCD2.
?+@ Cortes# C.# 8apni%# 8.* S%pp"t-&ect" net4"5s. Machine
Leaning 2,6*7, 28*92:8 61::;7 *. $"&e, T., Hat, 0.G.2
Neaest neigh." patten c3assi#icati"n. I((( Transactions on
Inforation Theory IT&1<112# +1J+F 11HGF2.
?<@ 'aria Caila N. :arioni# ,uberto -.Ra)ente# 6ga K. '.
Traina# Caetano Traina Kr# An e##icient app"ach t" sca3e %p 5-
!e-"i- .ase- a3g"ith!s in 3age -ata.ases, 2,,).
?=@ 'arta 8. 'odenesi# 'yrian C. 6. Costa# 6le"andre ..
(vsu%off## and Nelson 3.3.(bec%en, 0aa33e3 <%==' $-Means
$3%ste Ana3'sis, High 0e#"!ance $"!p%ting #"
$"!p%tati"na3 Science - VE$0AR 2,,).
?D@ 0aufan# -. and 7.K. Rousseeu!,><in-ing G"%ps in Data2
an Int"-%cti"n t" $3%ste Ana3'sis># Kohn 4iley and 9ons#
1HHC
?G@. 'ani%andan .R, I!p"&ing E##icienc' "# te?t%a3 static 4e.
c"ntent !ining %sing c3%steing techni@%es #Kournal of
Theoretical and 6pplied Inforation Technology
#8ol.<<#No.+#+C11
Cluster + ) , - . Total /istance
+ si0e ,3 -, ,3 -3 ,5 +4) 5
) si0e 2) )1 -2 )3 -2 +.2 1
, si0e ,4 -- ,+ ,2 -) +2. 3
- si0e .) -. )1 ,4 ,1 +2) 5
. si0e ,, ,, ,1 -) .) +2, +5

Anda mungkin juga menyukai