Anda di halaman 1dari 12

RESEARCH AND SURVEY STATISTICS – STA3022F

SOLUTION TO TUTORIAL #4
Week 5 2007
CLUSTER ANALYSIS

QUESTION 1: HOTEL PROFILE ANALYSIS study

ITERATION 1: Merge C and D at a distance of 1.21


Revised distance matrix:

A B CD E F G H I J K
A 0.00 3.97 3.82 6.83 4.07 3.52 4.40 2.99 4.30 4.75
B 0.00 1.39 5.33 3.49 2.07 2.45 2.59 2.32 2.83
CD 0.00 5.08 3.11 1.73 3.01 1.85 2.78 3.54
E 0.00 4.44 6.48 7.10 5.82 6.09 5.33
F 0.00 3.36 4.65 2.77 3.51 3.59
G 0.00 2.37 1.50 2.50 3.82
H 0.00 2.91 1.62 2.90
I 0.00 2.68 3.62
J 0.00 1.88
K 0.00

ITERATION 2: Merge CD and B at a distance of 1.39


Revised distance matrix:

A CDB E F G H I J K
A 0 3.82 6.83 4.07 3.52 4.4 2.99 4.3 4.75
CDB 0 5.08 3.11 1.73 2.45 1.85 2.32 2.83
E 0 4.44 6.48 7.1 5.82 6.09 5.33
F 0 3.36 4.65 2.77 3.51 3.59
G 0 2.37 1.5 2.5 3.82
H 0 2.91 1.62 2.9
I 0 2.68 3.62
J 0 1.88
K 0

ITERATION 3: Merge G and I at a distance of 1.50


Revised distance matrix:

A CDB E F GI H J K
A 0 3.82 6.83 4.07 2.99 4.40 4.30 4.75
CDB 0 5.08 3.11 1.73 2.45 2.32 2.83
E 0 4.44 5.82 7.10 6.09 5.33
F 0 2.77 4.65 3.51 3.59
GI 0 2.37 2.50 3.62
H 0 1.62 2.90
J 0 1.88
K 0
ITERATION 4: Merge H and J at a distance of 1.62
Revised distance matrix:

A CDB E F GI HJ K
A 0 3.82 6.83 4.07 2.99 4.30 4.75

1
CDB 0 5.08 3.11 1.73 2.32 2.83
E 0 4.44 5.82 6.09 5.33
F 0 2.77 3.51 3.59
GI 0 2.37 3.62
HJ 0 2.90
K 0

ITERATION 5: Merge CBD and GI at a distance of 1.73


Revised distance matrix:

A CBDGI E F HJ K
A 0 2.99 6.83 4.07 4.30 4.75
CBDGI 0 5.08 2.77 2.32 2.83
E 0 4.44 6.09 5.33
F 0 3.51 3.59
HJ 0 2.90
K 0

ITERATION 6: Merge HJ and K at a distance of 1.88


Revised distance matrix:

A CBDGI E F HJK
A 0 2.99 6.83 4.07 4.30
CBDGI 0 5.08 2.77 2.32
E 0 4.44 5.33
F 0 3.51
HKJ 0

ITERATION 7: Merge CBDGI and HJK at a distance of 2.32


Revised distance matrix:

A CBDGIJK E F
A 0 2.99 6.83 4.07
CBDGIHJK 0 5.08 2.77
E 0 4.44
F 0

ITERATION 8: Merge CBDGIHJK and F at a distance of 2.77


Revised distance matrix:

A CBDGIJKF E
A 0 2.99 6.83
CBDGIHJKF 0 4.44
E 0

2
ITERATION 9: Merge CBDGIHJKF and A at a distance of 2.99
Revised distance matrix:

CBDGIJKFA E
CBDGIHJKFA 0 4.07
E 0

ITERATION 10: Merge CBDGIHJKFE and E at a distance of 4.07

Amalgamation schedule:
Iteration Merged objects Distance
1 C,D 1.21
2 CD,B 1.39
3 G,I 1.5
4 H,J 1.62
5 CDB,GI 1.73
6 HJ,K 1.88
7 CDBGI,HJK 2.32
8 CDBGIHJK,F 2.77
9 CDBGIHJKF,A 2.99
10 CDBGIHJKFA,E 4.07

Dendrogram:

Brief comment:
There appear to be 5 clusters of hotels that emerge (using a cut-off of around 2.00). Hotels B, C, D,
G, and I form cluster 1, and hotels H, J, and K form cluster 2. Of the remaining hotels, hotels F and
A are somewhat dissimilar to the rest, and form their own clusters. Hotel E is completely dissimilar
to the rest, and also forms its own cluster. You need to examine attribute evaluations (cluster
profiles) to establish reasons for similarities and differences.

3
Question 2: NEWSPAPERS SEGMENTATION ANALYSIS study

Key Point: In this example is that you MUST standardise the data before doing the cluster
analysis, otherwise the LEADERSHIP variable (which is of a much higher order of magnitude)
will dominate the analysis unfairly.

Unstandardised Data:

READER ARTICLE ADVERT COMMUN


A 3000 2 10 6
B 2000 7 2 4
C 6000 7 9 2
D 2000 4 4 4
E 5000 7 7 3
Mean 3600 5.4 6.4 3.8
Std. Dev. 1816.59 2.30 3.36 1.48

Standardised Data:

READER ARTICLE ADVERT COMMUN


A -0.3303 -1.4769 1.0709 1.4832
B -0.8808 0.6950 -1.3089 0.1348
C 1.3212 0.6950 0.7735 -1.2136
D -0.8808 -0.6081 -0.7140 0.1348
E 0.7707 0.6950 0.1785 -0.5394

ITERATION #0: Initial Distance Matrix

A B C D E
A 0.0000 3.5358 3.8478 2.4621 3.2888
B 3.5358 0.0000 3.3171 1.4325 2.3225
C 3.8478 3.3171 0.0000 3.2523 1.0543
D 2.4621 1.4325 3.2523 0.0000 2.3825
E 3.2888 2.3225 1.0543 2.3825 0.0000

e.g.) Distance between A and B:


(-0.3303+0.8808)2 + (-1.4769-0.6950)2 +
D = (1.0709+1.3089)2 + (1.4832-0.1348)2

= 3.5358

ITERATION #1

STEP 1: Merge C and E at distance 1.05.

STEP 2: Revise distance matrix.

2.1) Computer average attribute profile for two merged objects.

4
READER ARTICLE ADVERT COMMUN
C 1.3212 0.6950 0.7735 -1.2136
E 0.7707 0.6950 0.1785 -0.5394
Average 1.0459 0.6950 0.4760 -0.8765

2.2) Revise the data matrix.

READER ARTICLE ADVERT COMMUN


A -0.3303 -1.4769 1.0709 1.4832
B -0.8808 0.6950 -1.3089 0.1348
CE 1.0459 0.6950 0.4760 -0.8765
D -0.8808 -0.6081 -0.7140 0.1348

2.3) Recompute the distance matrix.

A B CE D
A 0.0000 3.5358 3.5402 2.4621
B 3.5358 0.0000 2.8144 1.4325
CE 3.5402 2.8144 0.0000 2.8016
D 2.4621 1.4325 2.8016 0.0000

ITERATION #2

STEP 1: Merge B and D at distance 1.4325.

STEP 2: Revise distance matrix.

2.1) Computer average attribute profile for two merged objects.

READER ARTICLE ADVERT COMMUN


B -0.8808 0.6950 -1.3089 0.1348
D -0.8808 -0.6081 -0.7140 0.1348
Average -0.8808 0.0434 -1.0114 0.1348

2.2) Revise the data matrix.

READER ARTICLE ADVERT COMMUN


A -0.3303 -1.4769 1.0709 1.4832
BD -0.8808 0.0434 -1.0114 0.1348
CE 1.0459 0.6950 0.4760 -0.8765

2.3) Recompute the distance matrix.

A BD CE
A 0.0000 2.9612 3.5402
BD 2.9612 0.0000 2.7151
CE 3.5402 2.7151 0.0000

5
ITERATION #3

STEP 1: Merge BD and CE at distance 2.7151.

STEP 2: Revise distance matrix.

2.1) Computer average attribute profile for two merged objects.

READER ARTICLE ADVERT COMMUN


BD -0.8808 0.0434 -1.0114 0.1348
CE 1.0459 0.6950 0.4760 -0.8765
Average 0.0826 0.3692 -0.2677 -0.3708

2.2) Revise the data matrix.

READER ARTICLE ADVERT COMMUN


A -0.3303 -1.4769 1.0709 1.4832
BDCE 0.0826 0.3692 -0.2677 -0.3708

2.3) Recompute the distance matrix.

A BDCE
A 0.0000 2.9678
BDCE 2.9678 0.0000

ITERATION #4

At the final step, merge ! and BDCE at a distance of 2.9678.

AMALGAMATION SCHEDULE:

ITERATION DISTANCE MERGED OBJECTS


1 1.0543 C, E
2 1.4325 B, D
3 2.7151 BD, CE
4 2.9678 A, BDCE

6
DENDOGRAM:

There appear to be 3 distinct clusters (A, BD and CE), suggesting the cut-off point should be
around 2.

To profile clusters, average the attribute values within each cluster. In this case it is possible to use
either the RAW or STANDARDISED data. Using the standardised data is suggested, as you’ve
already worked out the relevant numbers.

Profiles:

CLUSTER READER ARTICLE ADVERT COMMUN


A -0.3303 -1.4769 1.0709 1.4832
BD -0.8808 0.0434 -1.0114 0.1348
CE 1.0459 0.6950 0.4760 -0.8765

Brief suggested Interpretation:

Cluster BD is characterised by a small readership and very few adverts. It offers a moderate degree
of quality in its articles and community news.

Cluster CE is characterised by a large readership, high standard of articles, but poor community
news. It has an average amount of advertising.

Cluster A has moderate-to-low readership, many adverts, poor articles, but excellent community
news.

7
QUESTION 3: AGRICULTURAL CONTROL BOARDS study

1. Completed Euclidean Distance Matrix

Euclidean distances (Cluster Analysis Q4)


Case No. Meat Maize Dec Citrus Sugar Wheat
Meat 0.00 2.13 1.53 2.27 0.65 1.99
Maize 2.13 0.00 1.74 2.46 1.62 0.38
Dec 1.53 1.74 0.00 0.88 1.09 1.62
Citrus 2.27 2.46 0.88 0.00 1.94 2.32
Sugar 0.65 1.62 1.09 1.94 0.00 1.54
Wheat 1.99 0.38 1.62 2.32 1.54 0.00

2. Dendrogram

Amalgamation Schedule (Cluster Analysis Q4)


Single Linkage
Euclidean distances
linkage Obj. No. Obj. No. Obj. No. Obj. No. Obj. No. Obj. No.
distance 1 2 3 4 5 6
.3820994 Maize Wheat
.6478426 Meat Sugar
.8800568 Dec Citrus
1.094212 Meat Sugar Dec Citrus
1.542757 Meat Sugar Dec Citrus Maize Wheat

T r e e D ia g r a m f o r 6 C a s e s
S in g le L in k a g e
E u c lid e a n d is t a n c e s

M eat

Sugar

Dec

C it r u s

M a iz e

W heat

0 .2 0 .4 0 .6 0 .8 1 .0 1 .2 1 .4 1 .6
L in k a g e D is t a n c e

3. Discussion of findings. If cut – off set at 1.0, then three clusters emerge

Cluster 1
Control Board Budget Net export Membership
Meat 1.52 1.38 0.68
Sugar 1.45 1.9 1.06
Cluster mean 1.485 1.64 0.87
Description: Control boards have a low budget, modest exports and very few members

8
Cluster 2
Control Board Budget Net export Membership
Deciduous fruit 1.79 1.89 2.1
Citrus fruit 1.95 1.56 2.9
Cluster mean 1.87 1.725 2.5
Description: Control boards have a very limited budget, modest exports, but a large
membership base

Cluster 3
Control Board Budget Net export Membership
Maize 2.22 3.32 1.2
Wheat 2.48 3.04 1.2
Cluster mean 2.35 3.18 1.2
Description: Control boards have large budgets, undertake large scale exports, but with few
members.

If the cut – off was set at 1.2, then clusters 1 and 2 will merge. Profile wise, they are both “small”
in terms of budget size and level of exports, but differ on membership size – the
“Diciduous/Citrus” boards have larger number of members than the “Meat/Sugar” Boards

4. Standardisation issues

Standardisation removes the influence of scale on the calculation of the distance matrix. This is
important because a failure to do so can result in variables that are measured with a small natural
unit (e.g. grams) dominating the distance calculations at the expense of those variables measured
with a large natural unit (e.g. tons). In this case, all variables have already been informally
standardised so that the number in the table are of similar magnitudes (around 1). No statistical
standardisation (subtract mean, divide by standard deviation) is required.

If we want to find out what the standardised profiles would be for each cluster, we need to know
the mean and standard deviation of each variable. These are (you need to know how to calculate
these – see Stats 1!)

Control Board Budget Net export Membership


Mean 1.90 2.18 1.52
Std dev 0.40 0.80 0.82

Standardised scores can then be calculated in the usual way:


Std score = (Unstd score – Mean)/StdDev

CLUSTER 1 CLUSTER 2 CLUSTER 3


Unstdised Stdised Unstdised Stdised Unstdised
Cluster Cluster Cluster Cluster Cluster Stdised
Variable mean mean mean mean mean Cluster mean
Budget 1.49 -1.04 1.87 -0.08 2.35 1.12
Net export 1.64 -0.67 1.73 -0.57 3.18 1.24
Membership 0.87 -0.80 2.50 1.19 1.20 -0.39

9
QUESTION 4: MAGAZINE CONTENT AND PREFERENCE study

(1) H0: µCluster1 (Décor) = µCluster2 (Décor)


H1: µCluster1 (Décor) ≠ µCluster2 (Décor)

MSTreatment = Between SS / df = 8.06 / 1 = 8.06


MSError = Within SS / df = 50.93 / 58 = 0.87
F = MST / MSE = 8.06 / 0.87 = 9.26
Fcrit = F(1,58) (0.05) ≈ 4.00

Fstat > Fcrit, so reject H0 and conclude that the DECOR attribute average value is not the
same between clusters. DECOR is therefore able to significantly discriminate between the
two clusters.

(2) H0: µCluster1 (Garden) = µCluster2 (Garden) = µCluster3 (Garden)


H1: At least one mean differs from the rest

MST = 12.54 / 2 = 6.27


MSE = 46.45 / 57 = 0.81
F = 6.27 / 0.81 = 7.74
Fcrit = F(2,57) (0.05) ≈ 3.15

Fstat > Fcrit, so reject H0 and conclude that the GARDEN attribute average value is not the
same across the 3 clusters. GARDEN is therefore able to significantly discriminate
between the three clusters.

(3) For Buy attribute,


MST = 33.16 / 2 = 16.58
MSE = 25.83 / 57 = 0.45
F = 16.58 / 0.45 = 36.85
Fcrit = F(2,57) (0.05) ≈ 3.15

Fstat> Fcrit, so BUY is clearly significant.

Significant predictor variables are (BUY, DÉCOR, GARDEN and PAY)

(iv) Cluster profiles for Cluster 1 on significant variables only:

From TABLE 10:


CLUSTER 1
BUY 1.07
DECOR 0.23
GARDEN -0.37
PAY 0.78

Respondents in cluster 1 buy considerably more lifestyle magazines than those in other
clusters, and are also willing to pay more for a magazine than those in other clusters. They
have a slightly above average interest in décor and a slightly below average interest in
gardening features, but these are secondary effect. Essentially, this is a cluster which
contains strong fans of lifestyle magazines.

10
(v) Noting from Figure 4.5 and 4.6 that each cluster has 30 observations in it, one can get
completed table

NEED = Yes NEED = No Row Totals


Cluster 1 19 11 30
Cluster 2 16 14 30
Totals 35 25 60

CLUSTER 1: 19 “YES”es out of 30 = 63%


CLUSTER 2: 16 “YES”es out of 30 = 53%

H0: There is no significant association between clusters and the perceived need for a
new lifestyle magazine.
H1: There is a significant association between clusters and the perceived need for a new
lifestyle magazine.

Use the Pearson Chi-squared statistic given in Figure 4.9 of 0.617, and compare this to a
critical Chi-squared statistic (at the 5% level, with 1 degree of freedom) of 3.84. Since the
test statistic is less than the critical value, one cannot reject the null hypothesis of no
association at the 5% level. Cluster membership appears to have no significant association
with perceived need for a new magazine.

(vi) Up to you to provide your own interpretation. Both 2-cluster and 3-cluster models are able
to pick up the clusters of disinterested readers (Cluster 2 in both solutions). The 3-cluster
model has the seeming advantage of distinguishing between those readers who want more
décor and more gardening articles (cluster 3), and those who are more or less happy with
the current format (cluster 1). Fortunately for the magazine publishers, those that are more
or less happy with the current format are those that tend to buy more magazines and pay a
premium. Those that desire a different format (one with more décor and gardening features)
tend not to buy a large number of magazines, and those that they do buy, they buy at an
average price.

On the basis of the fact that a 3-cluster solution picks up these important groupings, it
would seem to be preferable to a 2-cluster solution.

(vii) Question doesn’t specify whether you should use the 2-cluster or 3-cluster solution, so
we’ll look at both. For the 2-cluster solution

Euclidean distance of respondent 10 to Cluster 1 = 1.55


Euclidean distance of respondent 10 to Cluster 2 = 3.51

So, respondent 10 is closest to cluster 1 and should be included in that group

Euclidean distance of respondent 10 to Cluster 1 = 1.73


Euclidean distance of respondent 10 to Cluster 2 = 3.80
Euclidean distance of respondent 10 to Cluster 3 = 2.26

So, respondent 10 is closest to cluster 1 and should be included in that group

(viii) Again, the question hasn’t specified whether you should use the 2-cluster or 3-cluster
solution. This solution is just for the 2-cluster solution (the 3-cluster solution is left to you)

11
Euclidean distance of respondent 10 to Cluster 1 = 1.10
Euclidean distance of respondent 10 to Cluster 2 = 1.28

So, the new respondent is closer to cluster 1 than cluster 2 and should be included in cluster
1. To compute how this changes the centroid, just note that
• The previous cluster 1 centroid (without the new respondent) was given by (0.55,
0.36, 0.20, 0.81) and had 30 members.
• An average is calculated as sum of all observations divided by number of
observations.
• The old sum for each attribute was therefore 30*(0.55, 0.36, 0.20, 0.81) = (16.58,
11.00, 6.12, 24.31)
• The new sum is (16.58+0.2, 11.0-0.4, 6.12+0.3, 24.31+0.1) = (16.77, 10.60, 6.43,
24.41)
• Finally the new average/centroid is (1/31)* (16.77, 10.60, 6.43, 24.41) = (0.54,
0.31, 0.21, 0.78)
• Note how the new cluster 1 centroid is very close to the old one. This is expected…
one new observation shouldn’t change a centroid very much.

12

Anda mungkin juga menyukai