Multiple Correspondence
Analysis
CHAPTER JANUARY 2014
DOWNLOADS
VIEWS
28
17
2 AUTHORS, INCLUDING:
Francois Husson
Agrocampus Ouest
63 PUBLICATIONS 739 CITATIONS
SEE PROFILE
11
Multiple Correspondence Analysis
Franois Husson and Julie Josse
Contents
11.1 Studying Individuals.................................................................................. 164
11.1.1 Cloud of Individuals and Distance between Individuals......... 164
11.1.2 Fitting the Cloud of Individuals................................................... 166
11.1.3 Interpreting the Cloud of Individuals Using the Variables...... 167
11.1.4 Interpreting the Cloud of Individuals from the Categories..... 168
11.2 Studying Categories................................................................................... 171
11.2.1 The Cloud of Categories and Distance between Categories.... 171
11.2.2 Fitting the Cloud............................................................................. 172
11.3 Link between the Study of the Two CloudsDuality.......................... 174
11.4 CA on the Burt Table.................................................................................. 175
11.5 Example........................................................................................................ 176
11.6 Discussion.................................................................................................... 181
163
164
1
2
3
4
5
6
7
8
xa
a
a
A
a
A
a
A
a
xb
b
b
b
B
B
B
B
B
xc
c
c
c
c
c
C
C
xd
d
d
d
D
D
Label
abcd1
abcd2
Abcd
aBcD
ABcD
aBC
ABC
aB
a A b
1 0 1
1 0 1
0 1 1
1 0 0
0 1 0
1 0 0
0 1 0
1 0 0
N1 N2 N3
= = =
5 3 3
B
0
0
0
1
1
1
1
1
N4
=
5
c
1
1
1
1
1
0
0
0
N5
=
5
C
0
0
0
0
0
1
1
0
N6
=
2
0
0
0
0
0
0
0
1
N7
=
1
d
1
1
1
0
0
0
0
0
N8
=
3
D
0
0
0
1
1
0
0
0
N9
=
2
0
0
0
0
0
1
1
0
N10
=
2
0
0
0
0
0
0
0
1
N11
=
1
Q=4
Q=4
Q=4
Q=4
Q=4
Q=4
Q=4
Q=4
Figure 11.1
Example of converting raw data into indicator matrix. The response patterns are used as
respondent labels of the indicator matrix.
To illustrate the method let us consider a toy data set containing N=8 individuals described by Q=4 categorical variables (see left table in Figure11.1)
with J1=2, J2=2, J3=3, and J4=4 categories, respectively. For example, the
first two individuals give responses a for the first variable, b for the second, c
for the third, and d for the fourth.
The first step in MCA is to recode the data. Usually, data are coded using
the indicator matrix of dummy variables, denoted Z of size NJ, with
J=
Q
q =1
Jq ,
11.1Studying Individuals
11.1.1 Cloud of Individuals and Distance between Individuals
Studying the individuals involves looking at the data table row by row. Each
row of the indicator matrix has J coordinates and defines a vector in J . It may
thus be represented as a point in J , and the set of rows is the cloud of individuals. From a geometric point of view, studying the similarities between
165
individuals can be seen as studying the shape of this cloud of points. Are the
individuals divided into groups? Is there a main direction in the cloud that
opposes some specific individuals with others? The distance between points
representing the individuals is thus crucial. The chi-square distance is used
since MCA is presented as a CA of the indicator matrix. The square of the
chi-square distance (see also Chapter9 by Cazes) between two rows is
J
j =1
1 pij pij
c j ri
ri
In much the same way as the MCAs contingency table is the indicator matrix
Z, the correspondence matrix P is equal to Z/(NQ), Dr is the diagonal matrix
of row masses
ri =
1
Q
= , i = 1,.., N ,
NQ N
Nj
, j = 1,.., J .
NQ
d22 ( i , i) =
j =1
N
=
Q
1/ N
N j /( NQ) 1/ N
N1 (z
j =1
ij
zij ) 2
Two individuals are close to each other if they answer a relatively large number of questions in the same way, and are farther apart if they have very
different response profiles. The categories with low frequencies make higher
contributions to the distances between those (few) individuals that give that
response and the large majority that do not.
The cloud of points defined by these distances has a centre of gravity GN
(with coordinates cj, j=1,...,J) and a total inertia that is equal to
1
Total inertia =
N
d (i, G
2
i =1
)=
J Q J
= 1
Q
Q
166
1 T
Z ZD1v k = k v k
Q
where D is the diagonal matrix of the column sums of Z, Nj, j=1,...,J. This
is the eigenequation of a nonsymmetric matrix, and the normalization of the
eigenvectors vk, gathered as columns of the matrix V, is VTDV=I. The eigenvectors in V define the principal axes, and the coordinates of the individuals
are then the projections of the row profiles (1/Q)Z onto these principal axes.
Figure11.2 provides the best representation of the cloud of individuals in
two dimensions. We can see that points abcd1 and abcd2 are superimposed
(as they have the same response categories) and are close to point Abcd (as
they have three out of four categories in common). On the other hand, point
aB (which has categories with low frequencies in the sample) is relatively
far from both the centre of gravity and the other individuals.
The quality of the clouds fit on the two-dimensional space is measured
by the percentage of inertia, i.e.,the ratio between the variability explained
by the map and the total variability. The variability explained by an axis is
equal to the variance of the coordinates of the individuals on this axis, i.e.,to
the eigenvalue associated with that axis. The percentage of inertia associated
with the two-dimensional map is therefore
( 1 + 2 )/
J Q
k =1
k = 68.0%.
In this case, the percentage is high, as it is a small data set with strong links
between variables. Often in MCA, however, the percentages of inertia associAU: Please
ated with each axis are weak (Benzcri, 1977).
indicate 1977a,
1977b, or 1977c
for Benzcri.
167
1.5
aB
Dim 2 (30.8%)
1.0
0.5
0.0
abcd2
abcd1
aBcD
Abcd
ABcD
0.5
aBC
ABC
1.0
1.0
0.5
0.0
0.5
Dim 1 (37.1%)
0.1
1.5
Figure 11.2
Representation of the individuals on the first two axes.
168
Variables Representation
1.0
Xc
Xd
Dim 2 (30.8%)
0.8
0.6
0.4
Xa
0.2
Xb
0.0
0.0
0.2
0.4
0.6
Dim 1 (37.1%)
0.8
1.0
Figure 11.3
Square correlation ratios of the variables on the two dimensions.
carrying different categories for variable xb). Variables xc and x d, on the other
hand, can be used to explain oppositions between individuals on the first
and second axes.
Note: The vector fk of the coordinates of the N individuals on an axis k is
a vector of N , which can be considered a quantitative variable. The set of
these new variables, known as principal components, verifies the following
property:
f k = arg max
fk
(f , x )
2
q =1
under the constraint that fk is orthogonal to fk for all k<k. Thus, as with the
other methods of analyse des donnes, the first principal components of the
MCA are orthogonal variables that are the most closely linked to the set of
variables, with the relationship measured by the squared correlation ratio.
11.1.4 Interpreting the Cloud of Individuals from the Categories
The information provided by the variables is general and thus insufficient
for interpreting the similarities and differences between individuals. It is
therefore natural to make a more detailed analysis using the categories of
these variables. One instinctive way of representing a category on the graph
of individuals is to position the category at the barycentre (i.e.,average) of
the individuals who have it in their response (left graph, Figure11.4 for the
1.0
0.5
0.0
0.5
1.0
1.0
Abcd
abcd2
abcd1
0.5
0.0
0.5
Dim 1 (37.1%)
ABcD
aBcD
Abc
abc
1.0
1.0
0.5
0.0
0.5
1.0
1.5
1.0
Abcd
abcd2
abcd1
0.5
0.0
0.5
Dim 1 (37.1%)
ABcD
a
aBcD
ABC
aBC
1.0
aB
Dim 2 (33.0%)
Dim 2 (30.8%)
1.0
0.5
0.0
0.5
1.0
1.5
1.0
0.5
0.0
0.5
Dim 1 (47.8%)
a
B
1.0
Figure 11.4
Representation of categories in Figure 11.2. Left: Representation of the categories of variable x d; individuals are coloured (using grey levels) according
to their category. Middle: Superimposed representation of the individuals and categories of all the variables. Right: Representation of the categories
without the individuals. The different percentages of inertia will be explained Section 11.4.
Dim 2 (30.8%)
1.5
aB
170
categories of variable x d). Generally the categories of all the variables are represented on a single graph (middle graph, Figure11.4).
Two categories are close if, overall, the individuals carrying them are close,
which means that their responses are similar over all the variables. Proximity
between two categories is interpreted as proximity between two groups of
individuals. Here, for example, the categories C and are superimposed as
they are chosen by the same individuals.
Another way of constructing this barycentric representation is, for each
category, to add up the rows of the indicator matrix relating to the individuals carrying that category and then to project these new vectors of row sums
as supplementary elements on the graph of individuals (using the projection of supplementary rows from CA; see Chapter9 by Cazes). The table of
these row sums, represented in the lower part of Table11.1, is the Burt table.
Presenting categories as new rows in this way justifies superimposing representations on the same graph of individuals and categories, which thus lie
in the same space.
Often in MCA, we examine the representation of the cloud of individuals
to have an idea of how the individuals are distributed, and we then focus
on interpreting the representation of the cloud of barycentres alone (right
graph, Figure11.4). Indeed, we are often not interested in the details of the
Table11.1
Burt Table as Supplementary Rows of the Indicator Matrix
abcd1
abcd2
Abcd
aBcD
ABcD
aBC
ABC
aB
a
A
b
B
c
C
d
D
1
1
0
1
0
1
0
1
5
0
2
3
3
1
1
2
1
1
1
0
0
1
0
1
0
1
0
0
3
1
2
2
1
0
1
1
1
0
1
1
1
0
0
0
0
0
2
1
3
0
3
0
0
3
0
0
0
0
0
0
1
1
1
1
1
3
2
0
5
2
2
1
0
2
2
1
1
1
1
1
1
0
0
0
3
2
3
2
5
0
0
3
2
0
0
0
0
0
0
0
1
1
0
1
1
0
2
0
2
0
0
0
2
0
0
0
0
0
0
0
0
1
1
0
0
1
0
0
1
0
0
0
1
1
1
1
0
0
0
0
0
2
1
3
0
3
0
0
3
0
0
0
0
0
0
1
1
0
0
0
1
1
0
2
2
0
0
0
2
0
0
0
0
0
0
0
1
1
0
1
1
0
2
0
2
0
0
0
2
0
0
0
0
0
0
0
0
1
1
0
0
1
0
0
1
0
0
0
1
171
11.2Studying Categories
11.2.1 The Cloud of Categories and Distance between Categories
The representation of the categories obtained previously was constructed
from the optimal representation of individuals. It is also possible to construct
a representation of the categories by analysing the cloud of the columns from
the indicator matrix. From a geometric point of view, the category study
can be seen as studying the shape of a cloud of points in N . The distance
between two columns is the chi-square distance, as defined above:
N
i =1
1 pij pij
ri c j c j
d ( j , j ) =
2
2
i =1
1/ N N j /( NQ) N j /( NQ)
=N
i =1
zij
zij
N N
j
j
The distance between two categories decreases when most of the individuals
carrying one category also carry the other. The distance between two categories depends on their frequencies. It must be noted that the distance between
two categories does not depend on the other variables, unlike the distance
between two categories when they are considered as groups of individuals, as
described above. The chi-square distance between two categories in the indicator matrix is thus more difficult to justify (Greenacre and Blasius, 2006, p.60).
The squared distance of a category j from the centre of gravity GK of the
cloud of categories (of coordinates ri=1/N for all the i) is
172
d ( j , GK ) = N
2
i =1
zij
1
N
N N = N 1
j
The smaller the frequency of a given category, the farther it is from the centre
of gravity. The squared distance weighted by Nj/NQ (the weight of category
j) is the inertia of category j:
Inertia( j) =
Nj
Nj
1
d 2 ( j , GK ) = 1
NQ
Q
N
The less frequently a category occurs, the higher its inertia. It is often recommended to avoid categories with low frequencies that might too heavily
influence the overall analysis of the results. However, it must be noted that
the inertias between a low-frequency category, for example, carried by 1%
of individuals, and one carried by 10% are not that different (0.99 compared
with 0.9 divided by the number of variables). Generally speaking, in MCA,
the impact of categories with low frequencies is often overestimated. This
tends to stem from their great distance from the centre of gravity (and thus
their extreme position on graphs). However, in calculating a categorys inertia, its weight counterbalances the squared distance.
A variables inertia is defined by the sum of inertias of its categories:
Jq
Inertia( q) =
Nj
Q1 1 N =
j =1
Jq 1
Q
(11.1)
The higher the number of categories a variable has, the greater its inertia. By
adding together the inertias of all the variables, we obtain the total inertia of
the cloud of categories:
Q
Inertia =
q =1
Jq 1 J
= 1
Q
Q
As expected, the total inertia of the cloud of categories is the same as the one
of the cloud of individuals. This is explained by the duality between the two
analyses of rows and columns. Duality is an intrinsic property of methods
of analyse des donnes.
11.2.2 Fitting the Cloud
As for the individuals, the cloud of categories lies into a high-dimensional
space, and we want to find a subspace to represent the cloud of points as best
A
o
as possible. Within this space, the eigenvectors uk, associated with the eigenvalues k, are obtained from the eigenvalue decomposition of PDc1P TDr1 , thus
1
ZD1ZTu k = k u k
Q
Dim 2 (30.8%)
173
1
a
0
b d
c
D
A
1
C
1
0
Dim 1 (37.1%)
Figure 11.5
Representation of the categories produced by the CA of the indicator matrix.
AU: Please
provide specific
chapter number.
174
g k ( j) =
1
k
1
k
Jq
q =1 j =1
N
i =1
zij
g k ( j)
Q
zij
fk ( i)
Nj
with fk(i) and g k(j) the coordinates of the individual i and category j on dimension k. An individual is, up to the multiplicative factor 1/ k , at the barycentre of the categories that it carries. And a category is, up to the multiplicative
factor 1/ k , at the barycentre of the individuals that carry it (Figure11.5).
These relationships are often used to construct a pseudobarycentric graphical representation, also called the symmetric map.The 1/ k coefficient
ensures that both barycentric properties are represented simultaneously on
the same graph (Figure11.6).
However, this pseudobarycentric representation is difficult to read, as
the distances between an individual and a category cannot be interpreted.
Indeed, the individuals and the categories do not lie in the same space ( J
for the individuals and N for the categories). Note, for example, that the
individual aB, which is the unique individual who has chosen categories
and , is not superimposed with these two categories. We therefore recommend not to construct this representation, but rather prefer a barycentric representation, such as Figure11.4, where a category is at the exact barycentre of
the individuals who have chosen it.
175
Dim 2 (30.8%)
aB
abcd2
abcd1
c
b d
Abcd
a
aBcD
D
ABcD
A
1
1
0
Dim 1 (37.1%)
B
aBC
ABC
C
Figure 11.6
Pseudobarycentric representation.
176
matrix
Burt
= Indicator
k
k
11.5Example
The example studied here is inspired by the work of the sociologist Bourdieu,
who initially made MCA popular in the humanities. Here we examine the
results of a survey on the construction of identity called Histoire de vie
(Life Story). This survey, in which 8,403 people aged 15 or over were questioned, was conducted in 2003 by the French National Institute of Statistics
(Insee: http://www.insee.fr/). Identity describes the way in which individuals construct a place for themselves within a society that will both enable
integration and influence the affirmation of ones own identity. The aim
of the survey is to describe the different types of social relationships that
enable individuals to fit in to French society at the start of the 21st century.
Here we will examine an extract of this database relating to the hobbies of
the French, available in the R package FactoMineR under the name hobbies. The data set contains 8,403 individuals and 24 variables divided into
three groups. The first 18 variables relate to different hobbies. The question
format was as follows: Have you done or been involved in the following
hobby in the past 12 months, without ever having been obliged to do it?
with possible answers yes and no. The questions were related to going to
the cinema; reading; listening to music; going to a show (theatre, concert,
dance, circus, etc.); visiting an exhibition, museum, or historical monument;
using a computer or game console; doing sport or physical activity; walking
or hiking; travelling or visiting; playing music, painting, or other artistic
177
activity (such as dance, drama, writing, photography); collecting; volunteering; doing manual work (mechanics, DIY, or decorating); gardening; knitting, embroidery, or sewing; cooking (as a hobby); fishing or hunting; and
the number of hours spent watching television on average every day (zero,
01 h, 12 h, 23 h, 4h, or more).
The following four variables were used to label the individuals: sex (male,
female), age (1525, 2635, 3645, 4655, 5665, 6675, 7685, 86 or more),
marital status (single, married, widowed, divorced, remarried), and profession (manual labourer, unskilled worker, technician, foreman, senior management, employee, other). And finally, a quantitative variable indicates the
number of hobbies practiced out of the 18 possible choices.
The 18 categorical variables corresponding to the hobbies are considered
active variables (the distances between individuals are calculated from these
variables), and the sociodemographic variables are considered supplementary (or illustrative). These will be used to assist in interpreting the dimensions of the MCA. In this presentation, we will limit ourselves to interpreting
the first two dimensions.
The representation of individuals (not presented here) is homogeneous,
without any distinct groups of individuals. The square correlation ratios of
variables on the two dimensions (Figure11.7) indicate that the variables that
contribute the most to the construction of the first dimension are exhibition,
show, and cinema, and that which contributes the most to the construction of
the second dimension is gardening. The supplementary variables, in italics on
0.5
Gardening
Dim 2 (7.28%)
0.4
0.3
0.2
0.1
0.0
Knitting
Fishing
Collecting
Sex
0.0
Cooking Mechanic
Age
Walking
Cinema
Marital status
TV
Profession
Computer Sport
Listening music
Show
Playing music
Travelling Exhibition
Volunteering
0.1
Reading
0.2
0.3
Dim 1 (43.74%)
Figure 11.7
Map of categorical variables. Illustrative variables in italics.
0.4
0.5
178
the graph, are represented in the same way as the active variables. They are
all somewhat linked to the first two dimensions (except the variable sex).
The representation of active categories (Figure 11.8) obtained by the CA
of the Burt table shows an opposition on the first dimension between all
the yes categories and the no categories. This dimension thus opposes those
individuals who tend to have more activities with those who have none, or
only very few. The second dimension separates different activities, with
those hobbies that could be described as young (cinema, computing, sport,
shows, playing music) at the bottom, and the calm activities (gardening,
knitting, fishing) at the top of the graph. The proximity between the categories fishing, gardening, and knitting means that individuals taking
these categories have the same profile of responses. More precisely, it covers
two situations: cases where the individuals are the same and cases where the
individuals are different but have the same pattern of responses for the other
categories. The former situation may correspond to the proximity between
the barycentres of the individuals who fish and those who do gardening,
indicating that most of those that do fishing also do gardening. The latter
case may correspond to the proximity between knitting and the two other
categories. Indeed, we can suppose that it is not necessarily the same individuals who do both knitting and fishing. However, these individuals may
have the same profile: they are not involved in any young activities. Using
the graphs, it is possible to appreciate the main tendencies of the analysis.
The interpretation can be refined by using different indicators (contribution,
quality of representation) or even by returning to the raw data (by conducting cross-tabulations, for example). Here we can check that 55.9% of people
0.3
Knitting
Fishing
Gardening
Dim 2 (7.3%)
0.2
0.1
Listening music_no
Cinema_no
Mechanic
Cooking
Walking
Collecting
TV_3
Volunteering
Sport_no
TV_2
Show_no
Playing music_no
Reading
Exhibition
Travelling
Travelling_no
Volunteering_no
Collecting_no
TV_1 Listening music
Exhibition_no
Playing
music
TV_4
Fishing_no
Computer
Mechanic_no
Knitting_no
Walking_no
Show
Sport
TV_no
Cooking_no
Cinema
Gardening_no
Computer_no
0.0
Reading_no
0.1
0.2
0.4
0.2
0.0
Dim 1 (43.7%)
0.2
0.4
Figure 11.8
Map of active categories obtained from the Burt table. The graph of individuals is represented
in the top left-hand corner.
179
who do fishing also do gardening. We should also remark that the contribution of the category gardening to the construction of the second dimension
is greater than the contribution of fishing (18.7 versus 9.5). This is also suggested by the graph of variables rather than the coordinate of the categories
on the second dimension. This is due to the fact that the category fishing
is less common than the category gardening (only 945 respondents compared with 3,356). In addition, we can check that the fishermen are not necessarily practicing knitting too!
To make the results easier to interpret, we can use the supplementary
elements. The graph of supplementary categories (Figure 11.9) shows a
Guttman effect (or arch effect) for the categories of the variable age, which
are perfectly sequenced. Young people, who typically have a lot of activities, are opposed on the first dimension with older people with fewer or no
activities. The second dimension opposes the classes of extreme age with
the classes of middle age. More specifically, the age categories of between 45
and 75 years tend to do more gardening, knitting, and fishing than the other
age categories. In the same way, young people aged 1535 tend to prefer
activities relating to sport, cinema, computing, and shows, when compared
with other age categories. These results can be substantiated by constructing tables with row or column percentages, for example, confronting the
variables computer and age or gardening and age (Table11.2).
On the graph, the different socioprofessional categories are confronted
on the first dimension with, from left to right, manual labourers, unskilled
workers, employees, technicians, foremen, and senior management. Overall,
managers have more hobbies (by returning to the raw data, we can see that
Dim 2 (7.3%)
5665
6675
Manual labourer
Widower
Married
4655
Remarried
FOther
Foreman
3645
M
Technician
Management
7685
Unskilled worker
Employee
Divorcee
Profession.NA
0.0
86+
26
35
Single
0.2
1625
0.4
0.2
0.0
Dim 1 (43.7%)
Figure 11.9
Map of supplementary categories.
0.2
180
Table11.2
Column Percentages for the Cross-Tabulations Computer by Age and Gardening
by Age
Computer no
Computer yes
Gardening no
Gardening yes
1525
2635
3645
4655
5665
6675
7685
86+
0.25
0.75
0.86
0.14
0.43
0.57
0.69
0.31
0.50
0.50
0.57
0.43
0.66
0.34
0.52
0.48
0.82
0.18
0.51
0.49
0.92
0.08
0.54
0.46
0.96
0.04
0.65
0.35
1.00
0.00
0.78
0.22
0.5
Dim 2
Nb activities
0.0
0.5
1.0
1.0
0.5
0.0
Dim 1
0.5
Figure 11.10
Map of the quantitative supplementary variable number of activities.
1.0
A
i
o
G
P
AU: Please
indicate 2006a
or 2006b for
Greenacre and
Pardo.
181
11.6 Discussion
Let us end this introductory chapter to MCA with a number of important
additional aspects: