Solutions To Tutorial 4 Cluster Analysis

RESEARCH AND SURVEY STATISTICS – STA3022F
SOLUTION TO TUTORIAL #4
Week 5 2007
CLUSTER ANALYSIS
QUESTION 1: HOTEL PROFILE ANALYSIS study
ITERATION 1: Merge C and D at a distance of 1.21

Revised distance matrix:
A B CD E F G H I J K
A 0.00 3.97 3.82 6.83 4.07 3.52 4.40 2.99 4.30 4.75
B 0.00 1.39 5.33 3.49 2.07 2.45 2.59 2.32 2.83
CD 0.00 5.08 3.11 1.73 3.01 1.85 2.78 3.54
E 0.00 4.44 6.48 7.10 5.82 6.09 5.33
F 0.00 3.36 4.65 2.77 3.51 3.59
G 0.00 2.37 1.50 2.50 3.82
H 0.00 2.91 1.62 2.90
I 0.00 2.68 3.62
J 0.00 1.88
K 0.00
ITERATION 2: Merge CD and B at a distance of 1.39

A CDB E F G H I J K
A 0 3.82 6.83 4.07 3.52 4.4 2.99 4.3 4.75
CDB 0 5.08 3.11 1.73 2.45 1.85 2.32 2.83
E 0 4.44 6.48 7.1 5.82 6.09 5.33
F 0 3.36 4.65 2.77 3.51 3.59
G 0 2.37 1.5 2.5 3.82
H 0 2.91 1.62 2.9
I 0 2.68 3.62
J 0 1.88
K 0
ITERATION 3: Merge G and I at a distance of 1.50

A CDB E F GI H J K
A 0 3.82 6.83 4.07 2.99 4.40 4.30 4.75
CDB 0 5.08 3.11 1.73 2.45 2.32 2.83
E 0 4.44 5.82 7.10 6.09 5.33
F 0 2.77 4.65 3.51 3.59
GI 0 2.37 2.50 3.62
H 0 1.62 2.90
J 0 1.88
K 0
ITERATION 4: Merge H and J at a distance of 1.62
A CDB E F GI HJ K
A 0 3.82 6.83 4.07 2.99 4.30 4.75
1
CDB 0 5.08 3.11 1.73 2.32 2.83
E 0 4.44 5.82 6.09 5.33
F 0 2.77 3.51 3.59
GI 0 2.37 3.62
HJ 0 2.90
K 0
ITERATION 5: Merge CBD and GI at a distance of 1.73

A CBDGI E F HJ K
A 0 2.99 6.83 4.07 4.30 4.75
CBDGI 0 5.08 2.77 2.32 2.83
E 0 4.44 6.09 5.33
F 0 3.51 3.59
HJ 0 2.90
K 0
ITERATION 6: Merge HJ and K at a distance of 1.88

A CBDGI E F HJK
A 0 2.99 6.83 4.07 4.30
CBDGI 0 5.08 2.77 2.32
E 0 4.44 5.33
F 0 3.51
HKJ 0
ITERATION 7: Merge CBDGI and HJK at a distance of 2.32

A CBDGIJK E F
A 0 2.99 6.83 4.07
CBDGIHJK 0 5.08 2.77
E 0 4.44
F 0
ITERATION 8: Merge CBDGIHJK and F at a distance of 2.77

A CBDGIJKF E
A 0 2.99 6.83
CBDGIHJKF 0 4.44
E 0
2
ITERATION 9: Merge CBDGIHJKF and A at a distance of 2.99
CBDGIJKFA E
CBDGIHJKFA 0 4.07
E 0
ITERATION 10: Merge CBDGIHJKFE and E at a distance of 4.07
Amalgamation schedule:
Iteration Merged objects Distance
1 C,D 1.21
2 CD,B 1.39
3 G,I 1.5
4 H,J 1.62
5 CDB,GI 1.73
6 HJ,K 1.88
7 CDBGI,HJK 2.32
8 CDBGIHJK,F 2.77
9 CDBGIHJKF,A 2.99
10 CDBGIHJKFA,E 4.07
Dendrogram:
Brief comment:
There appear to be 5 clusters of hotels that emerge (using a cut-off of around 2.00). Hotels B, C, D,
G, and I form cluster 1, and hotels H, J, and K form cluster 2. Of the remaining hotels, hotels F and
A are somewhat dissimilar to the rest, and form their own clusters. Hotel E is completely dissimilar
to the rest, and also forms its own cluster. You need to examine attribute evaluations (cluster
profiles) to establish reasons for similarities and differences.
3
Question 2: NEWSPAPERS SEGMENTATION ANALYSIS study
Key Point: In this example is that you MUST standardise the data before doing the cluster
analysis, otherwise the LEADERSHIP variable (which is of a much higher order of magnitude)
will dominate the analysis unfairly.
Unstandardised Data:
READER ARTICLE ADVERT COMMUN

A 3000 2 10 6
B 2000 7 2 4
C 6000 7 9 2
D 2000 4 4 4
E 5000 7 7 3
Mean 3600 5.4 6.4 3.8
Std. Dev. 1816.59 2.30 3.36 1.48
Standardised Data:

A -0.3303 -1.4769 1.0709 1.4832
B -0.8808 0.6950 -1.3089 0.1348
C 1.3212 0.6950 0.7735 -1.2136
D -0.8808 -0.6081 -0.7140 0.1348
E 0.7707 0.6950 0.1785 -0.5394
ITERATION #0: Initial Distance Matrix
A B C D E
A 0.0000 3.5358 3.8478 2.4621 3.2888
B 3.5358 0.0000 3.3171 1.4325 2.3225
C 3.8478 3.3171 0.0000 3.2523 1.0543
D 2.4621 1.4325 3.2523 0.0000 2.3825
E 3.2888 2.3225 1.0543 2.3825 0.0000
e.g.) Distance between A and B:
√
(-0.3303+0.8808)2 + (-1.4769-0.6950)2 +
D = (1.0709+1.3089)2 + (1.4832-0.1348)2
= 3.5358
ITERATION #1
STEP 1: Merge C and E at distance 1.05.
STEP 2: Revise distance matrix.
2.1) Computer average attribute profile for two merged objects.
4
C 1.3212 0.6950 0.7735 -1.2136
E 0.7707 0.6950 0.1785 -0.5394
Average 1.0459 0.6950 0.4760 -0.8765
2.2) Revise the data matrix.

A -0.3303 -1.4769 1.0709 1.4832
B -0.8808 0.6950 -1.3089 0.1348
CE 1.0459 0.6950 0.4760 -0.8765
D -0.8808 -0.6081 -0.7140 0.1348
2.3) Recompute the distance matrix.
A B CE D
A 0.0000 3.5358 3.5402 2.4621
B 3.5358 0.0000 2.8144 1.4325
CE 3.5402 2.8144 0.0000 2.8016
D 2.4621 1.4325 2.8016 0.0000
ITERATION #2
STEP 1: Merge B and D at distance 1.4325.

B -0.8808 0.6950 -1.3089 0.1348
D -0.8808 -0.6081 -0.7140 0.1348
Average -0.8808 0.0434 -1.0114 0.1348

A -0.3303 -1.4769 1.0709 1.4832
BD -0.8808 0.0434 -1.0114 0.1348
CE 1.0459 0.6950 0.4760 -0.8765
A BD CE
A 0.0000 2.9612 3.5402
BD 2.9612 0.0000 2.7151
CE 3.5402 2.7151 0.0000
5
ITERATION #3
STEP 1: Merge BD and CE at distance 2.7151.

BD -0.8808 0.0434 -1.0114 0.1348
CE 1.0459 0.6950 0.4760 -0.8765
Average 0.0826 0.3692 -0.2677 -0.3708

A -0.3303 -1.4769 1.0709 1.4832
BDCE 0.0826 0.3692 -0.2677 -0.3708
A BDCE
A 0.0000 2.9678
BDCE 2.9678 0.0000
ITERATION #4
At the final step, merge ! and BDCE at a distance of 2.9678.
AMALGAMATION SCHEDULE:
ITERATION DISTANCE MERGED OBJECTS

1 1.0543 C, E
2 1.4325 B, D
3 2.7151 BD, CE
4 2.9678 A, BDCE
6
DENDOGRAM:
There appear to be 3 distinct clusters (A, BD and CE), suggesting the cut-off point should be
around 2.
To profile clusters, average the attribute values within each cluster. In this case it is possible to use
either the RAW or STANDARDISED data. Using the standardised data is suggested, as you’ve
already worked out the relevant numbers.
Profiles:
CLUSTER READER ARTICLE ADVERT COMMUN

A -0.3303 -1.4769 1.0709 1.4832
BD -0.8808 0.0434 -1.0114 0.1348
CE 1.0459 0.6950 0.4760 -0.8765
Brief suggested Interpretation:
Cluster BD is characterised by a small readership and very few adverts. It offers a moderate degree
of quality in its articles and community news.
Cluster CE is characterised by a large readership, high standard of articles, but poor community
news. It has an average amount of advertising.
Cluster A has moderate-to-low readership, many adverts, poor articles, but excellent community
news.
7
QUESTION 3: AGRICULTURAL CONTROL BOARDS study
1. Completed Euclidean Distance Matrix
Euclidean distances (Cluster Analysis Q4)

Case No. Meat Maize Dec Citrus Sugar Wheat
Meat 0.00 2.13 1.53 2.27 0.65 1.99
Maize 2.13 0.00 1.74 2.46 1.62 0.38
Dec 1.53 1.74 0.00 0.88 1.09 1.62
Citrus 2.27 2.46 0.88 0.00 1.94 2.32
Sugar 0.65 1.62 1.09 1.94 0.00 1.54
Wheat 1.99 0.38 1.62 2.32 1.54 0.00
2. Dendrogram
Amalgamation Schedule (Cluster Analysis Q4)

Single Linkage
Euclidean distances
linkage Obj. No. Obj. No. Obj. No. Obj. No. Obj. No. Obj. No.
distance 1 2 3 4 5 6
.3820994 Maize Wheat
.6478426 Meat Sugar
.8800568 Dec Citrus
1.094212 Meat Sugar Dec Citrus
1.542757 Meat Sugar Dec Citrus Maize Wheat
T r e e D ia g r a m f o r 6 C a s e s
S in g le L in k a g e
E u c lid e a n d is t a n c e s
M eat
Sugar
Dec
C it r u s
M a iz e
W heat
0 .2 0 .4 0 .6 0 .8 1 .0 1 .2 1 .4 1 .6
L in k a g e D is t a n c e
3. Discussion of findings. If cut – off set at 1.0, then three clusters emerge
Cluster 1
Control Board Budget Net export Membership
Meat 1.52 1.38 0.68
Sugar 1.45 1.9 1.06
Cluster mean 1.485 1.64 0.87
Description: Control boards have a low budget, modest exports and very few members
8
Cluster 2
Deciduous fruit 1.79 1.89 2.1
Citrus fruit 1.95 1.56 2.9
Cluster mean 1.87 1.725 2.5
Description: Control boards have a very limited budget, modest exports, but a large
membership base
Cluster 3
Maize 2.22 3.32 1.2
Wheat 2.48 3.04 1.2
Cluster mean 2.35 3.18 1.2
Description: Control boards have large budgets, undertake large scale exports, but with few
members.
If the cut – off was set at 1.2, then clusters 1 and 2 will merge. Profile wise, they are both “small”
in terms of budget size and level of exports, but differ on membership size – the
“Diciduous/Citrus” boards have larger number of members than the “Meat/Sugar” Boards
4. Standardisation issues
Standardisation removes the influence of scale on the calculation of the distance matrix. This is
important because a failure to do so can result in variables that are measured with a small natural
unit (e.g. grams) dominating the distance calculations at the expense of those variables measured
with a large natural unit (e.g. tons). In this case, all variables have already been informally
standardised so that the number in the table are of similar magnitudes (around 1). No statistical
standardisation (subtract mean, divide by standard deviation) is required.
If we want to find out what the standardised profiles would be for each cluster, we need to know
the mean and standard deviation of each variable. These are (you need to know how to calculate
these – see Stats 1!)

Mean 1.90 2.18 1.52
Std dev 0.40 0.80 0.82
Standardised scores can then be calculated in the usual way:

Std score = (Unstd score – Mean)/StdDev
CLUSTER 1 CLUSTER 2 CLUSTER 3

Unstdised Stdised Unstdised Stdised Unstdised
Cluster Cluster Cluster Cluster Cluster Stdised
Variable mean mean mean mean mean Cluster mean
Budget 1.49 -1.04 1.87 -0.08 2.35 1.12
Net export 1.64 -0.67 1.73 -0.57 3.18 1.24
Membership 0.87 -0.80 2.50 1.19 1.20 -0.39
9
QUESTION 4: MAGAZINE CONTENT AND PREFERENCE study
(1) H0: µCluster1 (Décor) = µCluster2 (Décor)

H1: µCluster1 (Décor) ≠ µCluster2 (Décor)
MSTreatment = Between SS / df = 8.06 / 1 = 8.06

MSError = Within SS / df = 50.93 / 58 = 0.87
F = MST / MSE = 8.06 / 0.87 = 9.26
Fcrit = F(1,58) (0.05) ≈ 4.00
Fstat > Fcrit, so reject H0 and conclude that the DECOR attribute average value is not the
same between clusters. DECOR is therefore able to significantly discriminate between the
two clusters.
(2) H0: µCluster1 (Garden) = µCluster2 (Garden) = µCluster3 (Garden)

H1: At least one mean differs from the rest
MST = 12.54 / 2 = 6.27

MSE = 46.45 / 57 = 0.81
F = 6.27 / 0.81 = 7.74
Fcrit = F(2,57) (0.05) ≈ 3.15
Fstat > Fcrit, so reject H0 and conclude that the GARDEN attribute average value is not the
same across the 3 clusters. GARDEN is therefore able to significantly discriminate
between the three clusters.
(3) For Buy attribute,

MST = 33.16 / 2 = 16.58
MSE = 25.83 / 57 = 0.45
F = 16.58 / 0.45 = 36.85
Fcrit = F(2,57) (0.05) ≈ 3.15
Fstat> Fcrit, so BUY is clearly significant.
Significant predictor variables are (BUY, DÉCOR, GARDEN and PAY)
(iv) Cluster profiles for Cluster 1 on significant variables only:
From TABLE 10:

CLUSTER 1
BUY 1.07
DECOR 0.23
GARDEN -0.37
PAY 0.78
Respondents in cluster 1 buy considerably more lifestyle magazines than those in other
clusters, and are also willing to pay more for a magazine than those in other clusters. They
have a slightly above average interest in décor and a slightly below average interest in
gardening features, but these are secondary effect. Essentially, this is a cluster which
contains strong fans of lifestyle magazines.
10
(v) Noting from Figure 4.5 and 4.6 that each cluster has 30 observations in it, one can get
completed table
NEED = Yes NEED = No Row Totals

Cluster 1 19 11 30
Cluster 2 16 14 30
Totals 35 25 60
CLUSTER 1: 19 “YES”es out of 30 = 63%

CLUSTER 2: 16 “YES”es out of 30 = 53%
H0: There is no significant association between clusters and the perceived need for a
new lifestyle magazine.
H1: There is a significant association between clusters and the perceived need for a new
lifestyle magazine.
Use the Pearson Chi-squared statistic given in Figure 4.9 of 0.617, and compare this to a
critical Chi-squared statistic (at the 5% level, with 1 degree of freedom) of 3.84. Since the
test statistic is less than the critical value, one cannot reject the null hypothesis of no
association at the 5% level. Cluster membership appears to have no significant association
with perceived need for a new magazine.
(vi) Up to you to provide your own interpretation. Both 2-cluster and 3-cluster models are able
to pick up the clusters of disinterested readers (Cluster 2 in both solutions). The 3-cluster
model has the seeming advantage of distinguishing between those readers who want more
décor and more gardening articles (cluster 3), and those who are more or less happy with
the current format (cluster 1). Fortunately for the magazine publishers, those that are more
or less happy with the current format are those that tend to buy more magazines and pay a
premium. Those that desire a different format (one with more décor and gardening features)
tend not to buy a large number of magazines, and those that they do buy, they buy at an
average price.
On the basis of the fact that a 3-cluster solution picks up these important groupings, it
would seem to be preferable to a 2-cluster solution.
(vii) Question doesn’t specify whether you should use the 2-cluster or 3-cluster solution, so
we’ll look at both. For the 2-cluster solution
Euclidean distance of respondent 10 to Cluster 1 = 1.55

So, respondent 10 is closest to cluster 1 and should be included in that group

So, respondent 10 is closest to cluster 1 and should be included in that group
(viii) Again, the question hasn’t specified whether you should use the 2-cluster or 3-cluster
solution. This solution is just for the 2-cluster solution (the 3-cluster solution is left to you)
11
So, the new respondent is closer to cluster 1 than cluster 2 and should be included in cluster
1. To compute how this changes the centroid, just note that
• The previous cluster 1 centroid (without the new respondent) was given by (0.55,
0.36, 0.20, 0.81) and had 30 members.
• An average is calculated as sum of all observations divided by number of
observations.
• The old sum for each attribute was therefore 30*(0.55, 0.36, 0.20, 0.81) = (16.58,
11.00, 6.12, 24.31)
• The new sum is (16.58+0.2, 11.0-0.4, 6.12+0.3, 24.31+0.1) = (16.77, 10.60, 6.43,
24.41)
• Finally the new average/centroid is (1/31)* (16.77, 10.60, 6.43, 24.41) = (0.54,
0.31, 0.21, 0.78)
• Note how the new cluster 1 centroid is very close to the old one. This is expected…
one new observation shouldn’t change a centroid very much.
12

Solutions To Tutorial 4 Cluster Analysis

Diunggah oleh

Informasi Dokumen

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Solutions To Tutorial 4 Cluster Analysis

Diunggah oleh

Hak Cipta:

Format Tersedia

RESEARCH AND SURVEY STATISTICS – STA3022F

QUESTION 1: HOTEL PROFILE ANALYSIS study

ITERATION 1: Merge C and D at a distance of 1.21

ITERATION 2: Merge CD and B at a distance of 1.39

ITERATION 3: Merge G and I at a distance of 1.50

ITERATION 5: Merge CBD and GI at a distance of 1.73

ITERATION 6: Merge HJ and K at a distance of 1.88

ITERATION 7: Merge CBDGI and HJK at a distance of 2.32

ITERATION 8: Merge CBDGIHJK and F at a distance of 2.77

ITERATION 10: Merge CBDGIHJKFE and E at a distance of 4.07

READER ARTICLE ADVERT COMMUN

READER ARTICLE ADVERT COMMUN

ITERATION #0: Initial Distance Matrix

e.g.) Distance between A and B:

STEP 1: Merge C and E at distance 1.05.

STEP 2: Revise distance matrix.

2.1) Computer average attribute profile for two merged objects.

2.2) Revise the data matrix.

READER ARTICLE ADVERT COMMUN

2.3) Recompute the distance matrix.

STEP 1: Merge B and D at distance 1.4325.

STEP 2: Revise distance matrix.

2.1) Computer average attribute profile for two merged objects.

READER ARTICLE ADVERT COMMUN

2.2) Revise the data matrix.

READER ARTICLE ADVERT COMMUN

2.3) Recompute the distance matrix.

STEP 1: Merge BD and CE at distance 2.7151.

STEP 2: Revise distance matrix.

2.1) Computer average attribute profile for two merged objects.

READER ARTICLE ADVERT COMMUN

2.2) Revise the data matrix.

READER ARTICLE ADVERT COMMUN

2.3) Recompute the distance matrix.

At the final step, merge ! and BDCE at a distance of 2.9678.

ITERATION DISTANCE MERGED OBJECTS

CLUSTER READER ARTICLE ADVERT COMMUN

Brief suggested Interpretation:

1. Completed Euclidean Distance Matrix

Euclidean distances (Cluster Analysis Q4)

Amalgamation Schedule (Cluster Analysis Q4)

Control Board Budget Net export Membership

Standardised scores can then be calculated in the usual way:

CLUSTER 1 CLUSTER 2 CLUSTER 3

(1) H0: µCluster1 (Décor) = µCluster2 (Décor)

MSTreatment = Between SS / df = 8.06 / 1 = 8.06

(2) H0: µCluster1 (Garden) = µCluster2 (Garden) = µCluster3 (Garden)

MST = 12.54 / 2 = 6.27

(3) For Buy attribute,

Fstat> Fcrit, so BUY is clearly significant.

Significant predictor variables are (BUY, DÉCOR, GARDEN and PAY)

(iv) Cluster profiles for Cluster 1 on significant variables only:

From TABLE 10:

NEED = Yes NEED = No Row Totals

CLUSTER 1: 19 “YES”es out of 30 = 63%

Euclidean distance of respondent 10 to Cluster 1 = 1.55

So, respondent 10 is closest to cluster 1 and should be included in that group

Euclidean distance of respondent 10 to Cluster 1 = 1.73

So, respondent 10 is closest to cluster 1 and should be included in that group

Anda mungkin juga menyukai