Anda di halaman 1dari 7

Market Research and Analysis

A Note on Cluster Analysis

By Professor V. Sriram

Cluster Analysis is a technique that enables us to classify individuals or objects into a priori unknown
groups. This distinguishes it from discriminant analysis and logistic regression where the groups are
known a priori and we use the known groups to understand what characteristics define the groups. In
contrast, in cluster analysis we look at the patterns of similarities and dissimilarities between the values
of variables associated with the individuals/objects and then group the most similar objects into the
same clusters.
Cluster analysis is used in a variety of marketing applications. Examples include:
(i) Identifying the competitive set of products: to group products which are similar based ontheir
attributes so that we can figure out what products are in the same competitive set;
(ii) Identifying consumer segments: We could use it to group customers based on their demographics,
preference, attitudes, interests, opinions and the benefits they seek etc. into segments.
(iii) Identify Test Market Cities: We could identify cities, which are similar based on their demographic
or ethnic composition to pick out representative cities belonging to different clusters for doing a test
(iv) International Segmentation: We could cluster countries based on how products have diffused
through them in the past, in order to figure out how to develop differentiated global marketing
(v) Geodemographic Segmentation: One of the most widely known applications of cluster analysis has
been in Geodemographic Segmentation, where zip codes around the country have been classified into
64 PRIZM clusters based on their demographic and psyschographic characteristics. Visit the Clartias
Website ( for more details. Names of these clusters include such colorful names as
the Bohemian mix (covering Greenwich village in New York, Lincoln Park in Chicago etc), Blue Blooded
Estates, Winners Circle etc.
The Basic Concept of Cluster Analysis
Consider a set of observations, which take different values for the variables for X1 and X2. From the
scatter plot on Figure 1, it is obvious that this set of observations is clustered into four distinct segments.
Cluster Analysis is a general technique, which can look at the values taken by the observations and then
classify such observations (which could be product consumer, city or country characteristics) into
distinct segments. In practice, however the boundaries of segments however are not as clean and

distinct as in figure 1, so there could be some differences in how consumers are classified depending on
the criteria that will be used for clustering. We will discuss the set of issues that will need to be
addressed in performing cluster analysis in this note.

Steps in Cluster Analysis

In performing cluster analysis, we need to go through the following steps in which we make the
following decisions:
1. What variables will we use for clustering the observations?
2. How will define similarity or dissimilarity between observations in order to perform the
3. What will be the decision rule (clustering procedure) that we use to decide which observations
to cluster?
4. How many clusters to have?
5. What do the clusters mean (profiling the cluster) and how do we interpret them?
6. Assess the validity of the clustering
Illustrating Cluster Analysis: Clustering Cities
In this note, we will illustrate cluster analysis using a somewhat obvious clustering problem of clustering
cities based on their geographic positions, where the results will be easy to visualize and therefore the

impact of our decisions made during the various steps become somewhat transparent. We will cluster
cities based on their distances from each other into groups.
Step 1: What variables will we use for clustering?
Since we have decided to cluster the cities based on their geographic positions, we can use their
latitudes and longitudes to measure their position. The data are given below.

Here, a positive latitude means the city is north of the equator, a negative latitude means the city is
south of the equator. A positive longitude means the city is east of Greenwich near London (through
which the zeroth longitude passes) and a negative longitude means the city is west of Greenwich. Note
London is approximately at zero longitude. Most of the cities in this data are in the northern hemisphere
(north of equator, positive latitude), except the South American cities of Rio de Janeiro and Buenos Aires
which are in the southern hemisphere.
If suppose we wanted to cluster cities around the world for coming up with an international market
segmentation strategy, the relevant variables might have been economic and demographic
characteristics of these cities, measures of cultural attitudes etc. The decision about which variables to
use for clustering is a critically important decision that will have a big impact on the clustering solution.
So we need to think carefully about the variables we will choose for clustering. Good exploratory
research that gives us a good sense of what variables may distinguish people or products or cities or
countries for the marketing decision at hand is critical.
Step 2: How will we define the distance or similarity between observations?

For the above problem, the Euclidean distance (which is the straight line distance) between two cities is
an appropriate measure. Let xik be the value of the ith observation (city) on the kth variable (latitude or
longitude). The Euclidean distance between two cities i and j

A common measure in marketing research is the squared Euclidean distance of

In the above measures, observations that differ in value more will appear more dissimilar while
observations that differ in values less will appear more similar due to the use of the squaring of
differences. This is usually a good measure in that it will help us discriminate between observations.
When this is not appropriate, we may use the city-block measure or Manhattan measure of distance.

Another important decision that needs to be made at this stage is whether to standardize the variables
or not. If variables are at very different scales, then clearly standardization should be done. For example,
if we attempt a classification of business schools, based on recruiter ratings (say 1-100), student ratings
(1-100) and starting salaries (range 50000-150,000), then clearly if we do not standardize the variables,
the effect of starting salaries will swamp out the other two effects. Therefore we should standardize the
variables in this case. However in some situations, where the scales are comparable, standardization
may be a bad idea. For example, when students rate their schools on various factors on a 1-7 scale.
Suppose the variability is minimal on a certain variable (like say IT infrastructure), but very high on
satisfaction with job placement, then standardization will reduce the real big differences in placement
satisfaction and magnify the small differences in IT infrastructure satisfaction. Hence standardization is
not always a good idea and you should use it judiciously.

Based on Euclidean distances, we show below the inter-city distances. We treat 1 latitude and 1 longitude as a unit
of distance in this computation. While this is strictly not correct, since the distances between latitudes and
longitudes are close near the poles than near the equator, we will use this as an approximation for our example.

In figure 2a, we plot the positions of the cities based on their longitudes and latitudes and in figure 2b, we plot the
positions based on their standardized latitudes and longitudes. You can see the effect of standardization; the
north-distances are expanded while the east-west distances are compressed after standardization. This is clearly
not appropriate in the case of distances, where the actual values really reflect the true distances between the

Step3: Selecting the Clustering Procedure

Broadly speaking, there are two types of clustering procedures: Hierarchical and non-hierarchical clustering
procedures. At this stage I will discuss only the hierarchical clustering procedure. We will discuss non- hierarchical
clustering procedures later in the context of checking for robustness of the results obtained from hierarchical
Hierarchical Clustering Procedures:
Let us consider one example of Hierarchical clustering based on the single linkage method. The single linkage
method is based on the nearest neighbor rule. The first two objects clustered are those that have the shortest
distance between them (in the city-clustering example, it is Frankfurt and London (Observations 7 and 8). The next
shortest distance is identified and either the third object is clustered with the first two, or a new two-object cluster
is formed. At every stage the distance between the two clusters is the distance between their two closest points.
Two clusters are merged at any stage by the single shortest distance between their two closest points. This process
is continued until all objects are in one cluster.

The complete linkage method is similar to the single linkage method, except that it is based on the
maximum distance between the two farthest points in the cluster. The average linkage method uses
the average distance between all pairs of points in the cluster in determining the closest cluster.
Of the three approaches, average linkage leads to the most robust and desirable clustering.
Another commonly used procedure for hierarchical clustering is the Ward's procedure. Here for each
cluster the means for all the variables are computed. Then for each object the squared Euclidean
distance with respect to the cluster means are computed and these squared Euclidean distances are
summed. When two clusters are merged, the sum of the squared Euclidean distances with respect to
the new mean will increase. Ward's procedure will choose to merge those clusters which will lead to the
least increase in the sum of squared Euclidean distances.