SOLUTION TO TUTORIAL #4
Week 5 2007
CLUSTER ANALYSIS
A B CD E F G H I J K
A 0.00 3.97 3.82 6.83 4.07 3.52 4.40 2.99 4.30 4.75
B 0.00 1.39 5.33 3.49 2.07 2.45 2.59 2.32 2.83
CD 0.00 5.08 3.11 1.73 3.01 1.85 2.78 3.54
E 0.00 4.44 6.48 7.10 5.82 6.09 5.33
F 0.00 3.36 4.65 2.77 3.51 3.59
G 0.00 2.37 1.50 2.50 3.82
H 0.00 2.91 1.62 2.90
I 0.00 2.68 3.62
J 0.00 1.88
K 0.00
A CDB E F G H I J K
A 0 3.82 6.83 4.07 3.52 4.4 2.99 4.3 4.75
CDB 0 5.08 3.11 1.73 2.45 1.85 2.32 2.83
E 0 4.44 6.48 7.1 5.82 6.09 5.33
F 0 3.36 4.65 2.77 3.51 3.59
G 0 2.37 1.5 2.5 3.82
H 0 2.91 1.62 2.9
I 0 2.68 3.62
J 0 1.88
K 0
A CDB E F GI H J K
A 0 3.82 6.83 4.07 2.99 4.40 4.30 4.75
CDB 0 5.08 3.11 1.73 2.45 2.32 2.83
E 0 4.44 5.82 7.10 6.09 5.33
F 0 2.77 4.65 3.51 3.59
GI 0 2.37 2.50 3.62
H 0 1.62 2.90
J 0 1.88
K 0
ITERATION 4: Merge H and J at a distance of 1.62
Revised distance matrix:
A CDB E F GI HJ K
A 0 3.82 6.83 4.07 2.99 4.30 4.75
1
CDB 0 5.08 3.11 1.73 2.32 2.83
E 0 4.44 5.82 6.09 5.33
F 0 2.77 3.51 3.59
GI 0 2.37 3.62
HJ 0 2.90
K 0
A CBDGI E F HJ K
A 0 2.99 6.83 4.07 4.30 4.75
CBDGI 0 5.08 2.77 2.32 2.83
E 0 4.44 6.09 5.33
F 0 3.51 3.59
HJ 0 2.90
K 0
A CBDGI E F HJK
A 0 2.99 6.83 4.07 4.30
CBDGI 0 5.08 2.77 2.32
E 0 4.44 5.33
F 0 3.51
HKJ 0
A CBDGIJK E F
A 0 2.99 6.83 4.07
CBDGIHJK 0 5.08 2.77
E 0 4.44
F 0
A CBDGIJKF E
A 0 2.99 6.83
CBDGIHJKF 0 4.44
E 0
2
ITERATION 9: Merge CBDGIHJKF and A at a distance of 2.99
Revised distance matrix:
CBDGIJKFA E
CBDGIHJKFA 0 4.07
E 0
Amalgamation schedule:
Iteration Merged objects Distance
1 C,D 1.21
2 CD,B 1.39
3 G,I 1.5
4 H,J 1.62
5 CDB,GI 1.73
6 HJ,K 1.88
7 CDBGI,HJK 2.32
8 CDBGIHJK,F 2.77
9 CDBGIHJKF,A 2.99
10 CDBGIHJKFA,E 4.07
Dendrogram:
Brief comment:
There appear to be 5 clusters of hotels that emerge (using a cut-off of around 2.00). Hotels B, C, D,
G, and I form cluster 1, and hotels H, J, and K form cluster 2. Of the remaining hotels, hotels F and
A are somewhat dissimilar to the rest, and form their own clusters. Hotel E is completely dissimilar
to the rest, and also forms its own cluster. You need to examine attribute evaluations (cluster
profiles) to establish reasons for similarities and differences.
3
Question 2: NEWSPAPERS SEGMENTATION ANALYSIS study
Key Point: In this example is that you MUST standardise the data before doing the cluster
analysis, otherwise the LEADERSHIP variable (which is of a much higher order of magnitude)
will dominate the analysis unfairly.
Unstandardised Data:
Standardised Data:
A B C D E
A 0.0000 3.5358 3.8478 2.4621 3.2888
B 3.5358 0.0000 3.3171 1.4325 2.3225
C 3.8478 3.3171 0.0000 3.2523 1.0543
D 2.4621 1.4325 3.2523 0.0000 2.3825
E 3.2888 2.3225 1.0543 2.3825 0.0000
√
(-0.3303+0.8808)2 + (-1.4769-0.6950)2 +
D = (1.0709+1.3089)2 + (1.4832-0.1348)2
= 3.5358
ITERATION #1
4
READER ARTICLE ADVERT COMMUN
C 1.3212 0.6950 0.7735 -1.2136
E 0.7707 0.6950 0.1785 -0.5394
Average 1.0459 0.6950 0.4760 -0.8765
A B CE D
A 0.0000 3.5358 3.5402 2.4621
B 3.5358 0.0000 2.8144 1.4325
CE 3.5402 2.8144 0.0000 2.8016
D 2.4621 1.4325 2.8016 0.0000
ITERATION #2
A BD CE
A 0.0000 2.9612 3.5402
BD 2.9612 0.0000 2.7151
CE 3.5402 2.7151 0.0000
5
ITERATION #3
A BDCE
A 0.0000 2.9678
BDCE 2.9678 0.0000
ITERATION #4
AMALGAMATION SCHEDULE:
6
DENDOGRAM:
There appear to be 3 distinct clusters (A, BD and CE), suggesting the cut-off point should be
around 2.
To profile clusters, average the attribute values within each cluster. In this case it is possible to use
either the RAW or STANDARDISED data. Using the standardised data is suggested, as you’ve
already worked out the relevant numbers.
Profiles:
Cluster BD is characterised by a small readership and very few adverts. It offers a moderate degree
of quality in its articles and community news.
Cluster CE is characterised by a large readership, high standard of articles, but poor community
news. It has an average amount of advertising.
Cluster A has moderate-to-low readership, many adverts, poor articles, but excellent community
news.
7
QUESTION 3: AGRICULTURAL CONTROL BOARDS study
2. Dendrogram
T r e e D ia g r a m f o r 6 C a s e s
S in g le L in k a g e
E u c lid e a n d is t a n c e s
M eat
Sugar
Dec
C it r u s
M a iz e
W heat
0 .2 0 .4 0 .6 0 .8 1 .0 1 .2 1 .4 1 .6
L in k a g e D is t a n c e
3. Discussion of findings. If cut – off set at 1.0, then three clusters emerge
Cluster 1
Control Board Budget Net export Membership
Meat 1.52 1.38 0.68
Sugar 1.45 1.9 1.06
Cluster mean 1.485 1.64 0.87
Description: Control boards have a low budget, modest exports and very few members
8
Cluster 2
Control Board Budget Net export Membership
Deciduous fruit 1.79 1.89 2.1
Citrus fruit 1.95 1.56 2.9
Cluster mean 1.87 1.725 2.5
Description: Control boards have a very limited budget, modest exports, but a large
membership base
Cluster 3
Control Board Budget Net export Membership
Maize 2.22 3.32 1.2
Wheat 2.48 3.04 1.2
Cluster mean 2.35 3.18 1.2
Description: Control boards have large budgets, undertake large scale exports, but with few
members.
If the cut – off was set at 1.2, then clusters 1 and 2 will merge. Profile wise, they are both “small”
in terms of budget size and level of exports, but differ on membership size – the
“Diciduous/Citrus” boards have larger number of members than the “Meat/Sugar” Boards
4. Standardisation issues
Standardisation removes the influence of scale on the calculation of the distance matrix. This is
important because a failure to do so can result in variables that are measured with a small natural
unit (e.g. grams) dominating the distance calculations at the expense of those variables measured
with a large natural unit (e.g. tons). In this case, all variables have already been informally
standardised so that the number in the table are of similar magnitudes (around 1). No statistical
standardisation (subtract mean, divide by standard deviation) is required.
If we want to find out what the standardised profiles would be for each cluster, we need to know
the mean and standard deviation of each variable. These are (you need to know how to calculate
these – see Stats 1!)
9
QUESTION 4: MAGAZINE CONTENT AND PREFERENCE study
Fstat > Fcrit, so reject H0 and conclude that the DECOR attribute average value is not the
same between clusters. DECOR is therefore able to significantly discriminate between the
two clusters.
Fstat > Fcrit, so reject H0 and conclude that the GARDEN attribute average value is not the
same across the 3 clusters. GARDEN is therefore able to significantly discriminate
between the three clusters.
Respondents in cluster 1 buy considerably more lifestyle magazines than those in other
clusters, and are also willing to pay more for a magazine than those in other clusters. They
have a slightly above average interest in décor and a slightly below average interest in
gardening features, but these are secondary effect. Essentially, this is a cluster which
contains strong fans of lifestyle magazines.
10
(v) Noting from Figure 4.5 and 4.6 that each cluster has 30 observations in it, one can get
completed table
H0: There is no significant association between clusters and the perceived need for a
new lifestyle magazine.
H1: There is a significant association between clusters and the perceived need for a new
lifestyle magazine.
Use the Pearson Chi-squared statistic given in Figure 4.9 of 0.617, and compare this to a
critical Chi-squared statistic (at the 5% level, with 1 degree of freedom) of 3.84. Since the
test statistic is less than the critical value, one cannot reject the null hypothesis of no
association at the 5% level. Cluster membership appears to have no significant association
with perceived need for a new magazine.
(vi) Up to you to provide your own interpretation. Both 2-cluster and 3-cluster models are able
to pick up the clusters of disinterested readers (Cluster 2 in both solutions). The 3-cluster
model has the seeming advantage of distinguishing between those readers who want more
décor and more gardening articles (cluster 3), and those who are more or less happy with
the current format (cluster 1). Fortunately for the magazine publishers, those that are more
or less happy with the current format are those that tend to buy more magazines and pay a
premium. Those that desire a different format (one with more décor and gardening features)
tend not to buy a large number of magazines, and those that they do buy, they buy at an
average price.
On the basis of the fact that a 3-cluster solution picks up these important groupings, it
would seem to be preferable to a 2-cluster solution.
(vii) Question doesn’t specify whether you should use the 2-cluster or 3-cluster solution, so
we’ll look at both. For the 2-cluster solution
(viii) Again, the question hasn’t specified whether you should use the 2-cluster or 3-cluster
solution. This solution is just for the 2-cluster solution (the 3-cluster solution is left to you)
11
Euclidean distance of respondent 10 to Cluster 1 = 1.10
Euclidean distance of respondent 10 to Cluster 2 = 1.28
So, the new respondent is closer to cluster 1 than cluster 2 and should be included in cluster
1. To compute how this changes the centroid, just note that
• The previous cluster 1 centroid (without the new respondent) was given by (0.55,
0.36, 0.20, 0.81) and had 30 members.
• An average is calculated as sum of all observations divided by number of
observations.
• The old sum for each attribute was therefore 30*(0.55, 0.36, 0.20, 0.81) = (16.58,
11.00, 6.12, 24.31)
• The new sum is (16.58+0.2, 11.0-0.4, 6.12+0.3, 24.31+0.1) = (16.77, 10.60, 6.43,
24.41)
• Finally the new average/centroid is (1/31)* (16.77, 10.60, 6.43, 24.41) = (0.54,
0.31, 0.21, 0.78)
• Note how the new cluster 1 centroid is very close to the old one. This is expected…
one new observation shouldn’t change a centroid very much.
12