Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at .
http://www.jstor.org/page/info/about/policies/terms.jsp
JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of
content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms
of scholarship. For more information about JSTOR, please contact support@jstor.org.
Blackwell Publishing and Royal Statistical Society are collaborating with JSTOR to digitize, preserve and
extend access to Journal of the Royal Statistical Society. Series C (Applied Statistics).
http://www.jstor.org
100 APPLIED STATISTICS
C FIND WIMUM ENTRY
C
6o PIvcr = ,xC
KK = 0
DO 70 I = II, s
K = INDEX(I)
IF (ABS(LU(IC, IIl .LE. PIVOT) GOTO70
PIVOT = ABS(W(K, IIM
KK I
70 CONTIlUE
IF (IE *EQ. 0) GOTO 10
C
c SWITCHORDER
C
ISAVE = INDEX(ltK)
= INDEX('II
INDMEX(KiC)
INDEX(II) = ISAVE
C
C PUT IN COUJimSor LU ONE AT A TIME
C
IF (INTIA IIBASE(II) IR
IF (II *EQ. MNGOTO90
J = II + 1
ix) 80 I = it M
K = INDEXVI)
LU(E, II) = W(E, II / LU(ISAVE, II)
80 CONTINUE
90 CCNTINUE
EKE = IRCW
RETURN
END
AlgorithmAS 136
A K-MeansClustering
Algorithm
By J. A. HARTIGAN and M. A. WONG
New Haven,Connecticut,
Yale University, U.S.A.
Keywords: K-MEANS CLUSTERING ALGORITHM; TRANSFER ALGORITHM
LANGUAGE
ISO Fortran
DESCRIPTION AND PURPOSE
METHOD
The algorithmrequiresas inputa matrixof M pointsin N dimensionsand a matrixof
K initialclustercentresin N dimensions.The numberof pointsin clusterL is denotedby
NC(L). D(I, L) is theEuclideandistancebetweenpointI and clusterL. The generalprocedure
is to searchfora K-partitionwithlocallyoptimalwithin-cluster sum of squares by moving
pointsfromone clusterto another.
STATISTICAL ALGORITHMS 101
Step 1. For each pointI (I = 1,2, ..., M), finditsclosestand secondclosestclustercentres,
IC1(I) and IC2(I) respectively.AssignpointI to clusterICI(I).
Step 2. Update theclustercentresto be the averagesof pointscontainedwithinthem.
Step 3. Initially,all clustersbelongto thelive set.
Step 4. This is the optimal-transfer (OPTRA) stage:
Considereach pointI (I = 1,2, ..., M) in turn. If clusterL (L = 1,2, ..., K) is updatedin the
last quick-transfer (QTRAN) stage, then it belongs to the live set throughoutthis stage.
Otherwise, at each step,itis notin thelivesetifithas notbeen updatedin thelastM optimal-
transfersteps. Let point I be in clusterLI. If LI is in the live set, do Step 4a; otherwise,
do Step 4b.
Step4a. Computetheminimumofthequantity,R2 = [NC(L) * D(I, L)2]/[NC(L)+ 1],over
all clustersL (L LI, L= 1,2, ..., K). Let L2 be the clusterwiththe smallestR2. If this
value is greaterthanor equal to [NC(L1) * D(I, Ll)2]/[NC(Ll) - 1], no reallocationis necessary
andL2 is thenewIC2(I). (Note thatthevalue [NC(L1) * D(I,LI)2]/[NC(Ll) - 1] is remembered
and willremainthesameforpointI untilclusterLI is updated.) Otherwise, pointI is allocated
to clusterL2 and LI is thenew IC2(I). Clustercentresare updatedto be themeansof points
assignedto themif reallocationhas takenplace. The two clustersthat are involvedin the
transfer of pointI at thisparticularstepare now in theliveset.
Step 4b. This stepis the same as Step 4a, exceptthatthe minimumR2 is computedonly
over clustersin the live set.
Step 5. Stop if the live set is empty.Otherwise,go to Step 6 afterone pass throughthe
data set.
Step 6. This is thequick-transfer (QTRAN) stage:
Considereach pointI (1 = 1,2, ..., M) in turn. Let LI = IC1(I) and L2 = IC2(I). It is not
necessaryto checkthe pointI ifboth theclustersLI and L2 have not changedin the last M
steps. Computethevalues
RI = [NC(L1) * D(I,L1)2]/[NC(L1)- 1] and R2 = [NC(L2) * D(I,L2)2]/[NC(L2)+ 11.
(As noted earlier,RI is remembered and will remainthe same untilclusterLI is updated.)
If RI is less thanR2, pointI remainsin clusterL1. Otherwise,switchICl (I) and IC2(I) and
updatethecentresof clustersLI and L2. The twoclustersare also notedfortheirinvolvement
in a transfer at thisstep.
Step 7. If no transfer tookplace in thelast M steps,go to Step4. Otherwise,
go to Step 6.
STRUCTURE
SUBROUTINE KMNS (A, M, N, C, K, ICI, IC2, NC, AN1, AN2, NCP, D, ITRAN, LIVE,
ITER, WSS, IFAULT)
Formalparameters
A Real array(M, N) input: the data matrix
M Integer input: the numberof points
N Integer input: the numberof dimensions
C Real array(K, N) input: thematrixof initialclustercentres
output: the matrixof finalclustercentres
K Integer input: thenumberof clusters
IC1 Integerarray(M) output: the clustereach pointbelongsto
IC2 Integerarray(M) workspace: thisarrayis used to remembertheclusterwhich
each pointis mostlikelyto be transferred
to at
each step
NC Integerarray(K) output: the numberof pointsin each cluster
AN1 Real array(K) workspace:
AN2 Real array(K) workspace:
102 APPLIED STATISTICS
NCP Integerarray(K) workspace:
D Real array(M) workspace:
ITRAN Integerarray(K) workspace:
LIVE Integerarray(K) workspace:
ITE R Integer input: themaximumnumberof iterationsallowed
WSS Real array(K) sumofsquaresofeach cluster
output: thewithin-cluster
IEA ULT Integer output: see Fault Diagnosticsbelow
FAULT DIAGNOSTICS
IFA ULT -0 No fault
IFA ULT 1 At leastone clusteris emptyaftertheinitialassignment.(A bettersetofinitial
clustercentresis called for)
IFA ULT = 2 The allowedmaximumnumberof iterationsis exceeded
IFA ULT = 3 K is less thanor equal to 1 or greaterthanor equal to M
Auxiliaryalgorithms
The followingauxiliaryalgorithmsare called: SUBROUTINE OPTRA (A, M, N, C, K,
IC1, IC2, NC, AN1, AN2, NCP, D, ITRAN, LIVE, INDEX) and SUBROUTINE QTRAN
(A, M, N, C, K, IC1, IC2, NC, AN1, AN2, NCP, D, ITRAN, INDEX) whichare included.
RELATED ALGORITHMS
A relatedalgorithm is AS 113(A transferalgorithm fornon-hierarchial classification)
given
byBanfieldand Bassill(1977). Thisalgorithm uses swopsas wellas transfersto tryto overcome
theproblemof local optima;thatis, forall pairsof points,a testis made whetherexchanging
theclustersto whichthepointsbelongwillimprovethecriterion.It willbe substantially more
expensivethanthepresentalgorithmforlargeM.
The presentalgorithmis similarto Algorithm AS 58 (Euclideanclusteranalysis)givenby
Sparks(1973). Bothalgorithms aim at finding
a K-partitionof thesample,withwithin-cluster
sum of squares whichcannot be reducedby movingpointsfromone clusterto the other.
However,the implementation of AlgorithmAS 58 does not satisfythis condition. At the
stage whereeach point is examinedin turnto see if it should be reassignedto a different
cluster,onlythe closestcentreis used to checkforpossiblereallocationof the givenpoint;
a clustercentreother than the closest one may have the smallestvalue of the quantity
+ 1)} dI2,wheren,is thenumberof pointsin cluster/and di is thedistancefromclusterI
{nl/(n1
to the givenpoint. Hence, in general,AlgorithmAS 58 does not providea locally optimal
solution.
The two algorithms are testedon variousgenerateddata sets. The timeconsumedon the
IBM 370/158and thewithin-cluster sum of squares of the resultingK-partitions are givenin
Table 1. While comparingthe entriesof the table, note that AS 58 does not give locally
optimalsolutionsand so shouldbe expectedto take less time. The WSS are different forthe
two algorithmsbecause theyarriveat different partitionsof the sets of points. A savingof
about 50 per centin timeoccursin KMNS due to using"live" setsand due to usinga quick-
transferstagewhichreducesthenumberof optimaltransfer iterationsby a factorof 4. Thus,
KMNS comparedto AS 58 is locallyoptimaland takesless time,especiallywhenthenumber
of clustersis large.
TIME AND AccupAcY
The timeis approximately equal to CMNKI whereI is the numberof iterations.For an
IBM 370/158,C = 21 x 10-5sec. However,different data structures requirequite different
numbersof iterations;and a carefulselectionof initialclustercentreswill also lead to a
considerablesavingin time.
Storagerequirement: M(N+ 3) + K(N+ 7).
STATISTICAL ALGORITHMS 103
TABLE 1
Time(sec) WSS
ADDITIONALCOMMENTS
One way of obtainingthe initial clustercentresis suggestedhere. The points are
firstordered by their distancesto the overall mean of the sample. Then, for cluster
L (L = 1,2, ..., K), the {1 + (L -1) * [M/K]}thpointis chosen to be its initialclustercentre.
someK samplepointsarechosenas theinitialclustercentres.Usingthisinitialization
In effect,
process,it is guaranteedthat no clusterwill be emptyafterthe initialassignmentin the
subroutine.A quick initialization, whichis dependenton theinputorderof thepoints,takes
the firstK pointsas the initialcentres.
ACKNOWLEDGEMENTS
This researchis supportedby National ScienceFoundationGrantMCS75-08374.
REFERENCES
BANFIELD, C. F. and BASSILL,L. C. (1977). Algorithm AS113. A transfer algorithm fornon-hierarchical
Appl.Statist.,26, 206-210.
classification.
HARTIGAN,J.A. (1975). Clustering New York: Wiley.
Algorithms.
SPARKS,D. N. (1973). Algorithm AS 58. Euclideanclusteranalysis.Appl.Statist.,22, 126-130.
SUBROUTINEKIRTS(A, M, N, C, K, ICI, ICZ, NC, ANI, AN2, NCP,
* D, ITRAN, LIVE, ITER, WSS, IFAULT)
C
C ALGORITHMAS 136 APPL. STATIST. (1979) VOL.28, NO.1
c
C DIVIDE M POINTS IN N-DIMENSIONALSPACE INTO K CWSTERS
C SO THIATTHE WITHIN CUJSTERSUM OF SQUARESIS MINIMIZED.
C
DIMENSION A(M, N, ICI(Ml, IC2(M), D(Ml
DIMENSIONC(K, N), NC(K), AN1(K), AN2(K), NCP(K)
DIMENSION ITRAN(K), LIVE(K), WSStK), DT(2)
C
C DEFINE BIG TO BE A VERY LARGEPOSITIVE NUMBER
C
DATABIG /1,0E1O/
104 APPLIED STATISTICS
IFAULT = 3
IF (I LE. 1 OR, K GE, M) RETURN
C
C FOR EACH POITr I, FIND ITS TWO CLOSEST CENTRES,
C IC1(I) AND IC2(I). ASSIGN IT TO IC1(I).
C
DO 50 I = 1, M
ICl(I) = 1
IC2(I) = 2
DO 10 IL-- 1, 2
DT(IL) = 0.0
DO 10 J = 1, N;
DA = A(I, J) - C(IL, J)
DT(IL) = DT(IL) + DA * DA
10 COtNTINUE
IF (I)T(1) . . T(2!)) GOTO 2o
ICI(I) = 2
IC2(I) = 1
TEMP = DT(1
DT(1) DT(2)
DT() = TEMP
20 DO 50 L = 3, K
DB = 0.0
DO 30 J = 1, NJ
DC = A(I, J) - C(L, J)
DB = DB + DC * DC
IF (DO .GE. DT(2)) GOTO 50
30 CONTINUE
IF (DB *LT. DT(1)* GCTO 40
DT(2) = D
IC2(I = L
GOTO 50
40 DT(2) = DT(1l
IC2(I) = ICI(I)
DT(l) = DO
IC1(I) = L
50 C0NTIN4UE
C
C UPDATE CUISTER CENTRES TO BE TIIE AVERAGE
C (OF POINTS CONTAINED WITHIN THEM
C
DO 70 L 1, K
NC(L) = 0
DO 6o = 1, tN
6o C(L, J) = 0.0
70 CONTINUE
DO 90 I = 1, M
L = IC(I)
TIC(L) = NC(L) + i
D)O So 3 = 1, 1N
80 C(L, J) = C(L, J) + A(I, 3)
90 CONTINTUE
50 CONTINUE3
R2 = DC * AN2(L)
L2 = L
6o CoNTINUE3
IF (R2 .LT. DIV) G(FO 70
C
C IF NO TRAiNSFER IS NECESSARY, 12 IS THE NE?t IC2(I)
C
IC2(I) = J2
GTrO go
C
C UPDATE CLJSTER CENTRES, LIVE, NCP, AN1 AND AN2
C FOR CLUSTERS L AND 12, AND) UPDAT13 IC1(I) AND IC2(I)
C
70 INDIX = 0
LIVE(L1) = M+ I
LIVEL'T) = M + I
NCP(L1) = I
NCP(LE) = I
ALl = NC(1)
ALY = ALA - 1. 0
ALE2 = C S(2)
ALT = ALTE?+ 1,0
DO 8o0J 1, N
C(L1, J) = (C(L,r J) * ALI - A(I, J)) ALW
C(L2, J) - (C(2, J) * AL9E + A(I, J)) / ALT
80 CONTINUE)
NC(L1) = NCC(L1I 1
NC(L2) = NC(1 2) + 1
AN2 (LI) = ALW / ALL
AN1(LIL = BIG
IF (ALW GT. 1.0) AN(L1 = A1T / (ALW - 1.0)
AN1(L2) = ALT / AU
AN2(L2) = ALT / (ALT + 1,0)
IC1(I) = 2
IC2(I) = LI
90 COnUE=
IF (INDEX .EQ, MN RETURN
100 CONITINUE
DOI 110 L 1, 1
C
C ITRAN(Ll IS SET TO ZERO BEFORE ENTERING, QTRAXL
C ALSO, LIVE(L) liAS TO BE DECREASED BY M BEFORE
C RE-ENTERING OPTRAt
C
ITRAN (L) 0
LIVE(L! - LIVE(L) - M
110 C(ONTINUE
RETURIN
END
C
SUBROlTINE QTRAN(A, M, N, C, K, IC1, IC2, NC, AN1,
* ANE, NCP, D, ITRAN, INDEX)
C
C ALORITIJULAS 136.?. APPL. STATIST. (17q) VOL.28, NO.1
C
C TIIIS IS TIIE QUICK TRANiSFER STAGE.
C IC1(IN IS TIHE CLJSTER WHIICII POINT I BELONGS TO.
C IC2(I) IS TE CLUJSTER WHIICHIP0INT I IS MOST
C LIKEL.Y TO BE TR.ANSFERRED TO.
C FOR EACH POINT I, IC1(I) AND IC2(I* ARE SWITCHED, IFs
C NECESSARY, TO REDUCE WITHIN CLUSTER SUM OF SQUARES.
C THE CLJUSTER CENTRES ARE UPDATED AFTE1R EACH STEP
C
DIMENSION A(V, N), IC1(M), IC2(M), D(M)
DIMENSIOi C(KC, N1), NC(KC), AN1(1), AN2(K), NCP(K), ITRAN(K)
C
C DEFINE BIG TO BE A VERY LRGlE POSITIVE NUMIIER
C
DATA BIG /1,OE1O/
108 APPLIED STATISTICS
C IN THE )OPTIMAL-TRANSFERSTAGE, NCP(L) INDICATES THE
C STEP AT WHICHICUJSTER L IS LAST UPDATED
C IN TIHE QUICIC-TRANSFER STAGE, NCP(LN IS EQUAL TO THE
C STEP AT WHICII CLUSTER L IS LAST UPDATED PLUS M
C
ICOUN = 0
ISTEP = 0
10 DO 70 I = 1, M
ICWUN ICOlUN+ 1
ISTEP ISTEP + 1
LI = IC1(I)
L2 = ICZ(I)
C
C IF POINT I IS THE ONLY MEMBEROF CUJSTER Li, NO TRANSFER
C
IF (NC(.L1) *EQ. 1) GOTO 6
C
C IF IST1EP IS GREATER THIANNCP(LA), NO NEED TO RECOMPUTE
C DISTANCE FROM4POINT I TO CLUSTER LI
C NOTE THAT IF CLUSTER LI IS LAST UPDATED EXACTLY MASTEPS
C AGO WE STILL NEED TO COMPUTE THE, DISTANCE FROM POINT I
C TO CLUSTER LI
C
IF (ISTEP .GT. NCP(L1)) GOTO 30
DA =0.0
DO 20 J = 1, NJ
DB = A(I, J) - C(LX, J)
DA = DA + DO * DO
20 CONTIlUE
D(I) = DA * ANl(Li)
C
C IF ISTEP IS GREATER THAN OR EQUAL TO BOTl NCP(Li) AND
C NCP(12) THERE WILL BE NO TRANSFER OF POINT I AT THIS STEP
C
30 IF (ISTEP .GE. NCP(i) .AND. ISTEP GE. NCP(L2)) GOTo 6o
R2 = D(I) / AN2(L2)
DD = 0.0
DO 40 J = 1, N
DE = A(I, J) - C(L2, J)
DD = DO + DE * DE
IF (DD GE. R12) GOTr Go
40 COtNTINUE
C
C UPD}ATECLJUSTERCENTRES, NCP, NC, ITRAN, ANI AND AN2
C FOR CLlJSTERS LA AND 12. ALSO, UPDATE IC1(I) AND IC2(I).
C NOTE THIAT IF Al4Y UPDATING OCCURS IN THIS STAGE,
C INDEX IS SET BACK TO 0
C
ICOUN 0
INDEX 0
ITRAN(Li) = 1
ITRAWT(2) = 1
NCP(L1) = ISTEP + M
NCP(L2) = ISTEP + M
ALi = N4C
(LI)
ALW = ALL - 1.0
AI2 = NC(U.)
ALT = AL2 + 1. 0
DO 507 = 1, N
C(LI, J) = (C(Li, J) * ALI - A'I, J.) / ALU
C(L2, J) (C(12, J) * ALS + A(I, J)) / ALT
50 CONTINIUE
NC(LI) =NC(Li - 1
NC(L2) = NC(12) + 1
AN2(L1) = ALU( / ALi
AN1(LU) = BIG
IF (ATJ .GT. 1.0) AN1() = ALW / (ALW - 1.0)
AN1(L-2) = ALT / ALI
AN2(12) = ALT / t(ALT + 1.0)
ICiCI) -= 2
IC2(I) = I,
C
C IF NO REALUICATION TOOK PLACE IN TliE LAST M STEPS, RERN
C
6o IF (ICOUN .EQ. M) RETURN
70 CONTINUE
GOYTOl10
IND