Data Mining Notes

AIMS Review Course
Statistical Data Mining

Wiesner Vos & Ludger Evers
January 29, 2006

C OMMONS D EED
You are free:

• to copy, distribute, display, and perform the work
• to make derivative works
Under the following conditions:

Attribution.
You must give the original author credit.
Noncommercial.
You may not use this work for commercial purposes.
Share Alike.
If you alter, transform, or build upon this work, you may distribute the resulting work only under a
license identical to this one.
For any reuse or distribution, you must make clear to others the license terms of this work.
Any of these conditions can be waived if you get permission from the copyright holder.
Your fair use and other rights are in no way affected by the above.
This is a human-readable summary of the Legal Code (the full license). It can be read online at the following address:
http://creativecommons.org/licenses/by-nc-sa/2.0/uk/legalcode
Note: The Commons Deed is not a license. It is simply a handy reference for understanding the Legal Code (the full license) —
it is a human-readable expression of some of its key terms. Think of it as the user-friendly interface to the Legal Code beneath.
This Deed itself has no legal value, and its contents do not appear in the actual license.
Contents
1 Introduction 7
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2 Data mining as a step in the KDD process . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3 Tasks in statistical data mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.4 Useful properties of data mining algorithms . . . . . . . . . . . . . . . . . . . . . . . . 10
1.4.1 Robustness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.4.2 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.4.3 Resilience to the curse of dimensionality . . . . . . . . . . . . . . . . . . . . . . 12
1.4.4 Automation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.4.5 Simplicity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.5 Missing values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.6 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.7 Datasets used in the lecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.7.1 The iris dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.7.2 The European economies dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.7.3 The quarter-circle dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.7.4 The breast tumour dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2 Exploratory data analysis 17

2.1 Visualising data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.1.1 Visualising continuous data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.1.2 Visualising nominal data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2 Descriptive statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.3 Graphical Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.3.1 Principal Component Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.3.2 Biplots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.3.3 Independent Component Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.3.4 Exploratory Projection Pursuit . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3 Clustering 37
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.2 Dissimilarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.3 Clustering Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.3.1 Hierarchical Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.3.2 Gaussian mixture models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3
4 CONTENTS
3.3.3 k-Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.3.4 k-Medoids (Partitioning around medoids) . . . . . . . . . . . . . . . . . . . . . 55
3.3.5 Vector quantisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.3.6 Self-organising maps (SOM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.4 Comments on clustering in a microarray context . . . . . . . . . . . . . . . . . . . . . . 63
3.4.1 Two-way clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.4.2 Discussion: Clustering gene expression data . . . . . . . . . . . . . . . . . . . . 63
3.4.3 Alternatives to clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.5 Multidimensional scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.5.1 Example 1 : MDS and prostate cancer gene expression data . . . . . . . . . . . 67
3.5.2 Example 2 : MDS and virus data . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4 Classification: Theoretical backgrounds and discriminant analysis 71

4.1 Classification as supervised learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.2 Some decision theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.3 Linear discriminant analysis (LDA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.4 Quadratic discriminant analysis (QDA) . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.5 Fisher’s approach to LDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.6 Non-parametric estimation of the classwise densities . . . . . . . . . . . . . . . . . . . 79
5 Trees and rules 81

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.2 Growing a tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.3 Pruning a tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.4 From trees to rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.5 Frequent item sets and association rules . . . . . . . . . . . . . . . . . . . . . . . . . . 92
6 Generalised Linear Models 99

6.1 The Linear Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
6.1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
6.1.2 Linear Model Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
6.2 Generalised Linear Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
6.2.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
6.2.2 Basic GLM setup and assumptions . . . . . . . . . . . . . . . . . . . . . . . . . 102
6.2.3 Goodness of fit and comparing models . . . . . . . . . . . . . . . . . . . . . . . 103
6.2.4 Applications of Generalised Linear Models . . . . . . . . . . . . . . . . . . . . . 106
6.2.5 Binary Data and Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . 106
6.2.6 Count Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
6.2.7 Survival Analysis and Repeated Measures . . . . . . . . . . . . . . . . . . . . . 119
7 Non-parametric classification tools 121

7.1 k-nearest neighbours . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
7.2 Learning vector quantisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
7.3 Neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
7.4 Support vector machines (SVM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
CONTENTS 5
8 Classifier assessment, bagging, and boosting 137

8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
8.2 Bias, Variance and Error Rate Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 137
8.2.1 Cross-validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
8.2.2 The Bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
8.2.3 Log-probability scoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
8.2.4 ROC curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
8.2.5 Example: Toxicogenomic Gene Expression Data . . . . . . . . . . . . . . . . . . 141
8.3 Bagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
8.4 Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
9 Expert Systems 147

9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
9.2 Certainty factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
9.3 Dempster-Shafer theory of evidence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
9.4 Fuzzy logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
9.5 Probabilistic reasoning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
10 Graphical models 159

10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
10.2 Directed graphical models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
10.3 Undirected graphical models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
10.4 From directed to undirected graphical models . . . . . . . . . . . . . . . . . . . . . . . 165
10.5 Computing probabilities on graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
10.5.1 The message passing algorithm for exact inference . . . . . . . . . . . . . . . . 166
10.5.2 Approximate inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
6 CONTENTS
Chapter 1
Introduction
1.1 Background
We are drowning in information and starving for knowledge.
– Rutherford D. Roger
In the last two decades the amount of data available for (statistical) analysis has experienced an
unpreceded growth and gave birth to the new field of data mining. Whether it was in the context of
biology and medicine (notably genomics) or in the context of marketing, much attention focused on
data mining.
Data mining is a broad and interdisciplinary subject, it involves aspects of statistics, engineering
and computer science. According to a marketing brochure of the SAS Insitute — one of the major
manufacturers of data analysis software: “Data mining is the process of data selection, exploration
and building models using vast data stores to uncover previously unknown patterns.” This could —
maybe save the “vast data stores” — serve equally well as an explanation of what statistics is about.
Inferring information from data is not something new; for decades it has been and it still is the core
domain of statistics. Much of the attention data mining received was due to the information technology
hype of the last couple of years. Or in more cynic words (Witten and Frank, 1999):
What’s the difference between machine learning and statistics? Cynics, looking wryly at the
explosion of commercial interest (and hype) in this area equate data mining to statistics plus
marketing.
But something has changed over the years. In the first half of the 20th century statisticians mostly
analysed systematically planned experiments to reply to a thoroughly formulated scientific question.
These experiments lead to a small amount of high quality data. Under these controlled conditions one
could often even derive an optimal way of collecting and analysing the data and (mathematically)
prove this property.
At the beginning of the 21st century the situation is somewhat more chaotic. The scale of the
datasets has changed. Due to the advance of information technology (computer based acquisition of
data, cheap mass data storage solutions, etc.) the datasets have been growing in both dimensions:
they do not only consist of more and more observations, they also contain more and more variables.
Often these data are not directly sampled, but are merely a byproduct of other activities (credit card
transactions, etc.). As such, they do not necessarily stem from a good experimental design and some
7
8 CHAPTER 1. INTRODUCTION
variables might contain no interesting information at all. The data thus contains more and more
“noise”. And the questions to which statistical analysis has to respond got somehow more diffuse
(“find something interesting”). The data analysed in genetics (especially microarray data) shares
many of the above difficulties, although the data does result from systematic experiments. But a huge
amount of genes is scanned in the hope of finding a “good” one.
This development — and the question how to react to it — was (over-)emphasised in Leo Breiman’s
article “Statistical modeling – the two cultures” and provoked some discussion amongst statisticians
(Breiman, 2001).
Historically the term data mining however had a different meaning; it was used amongst statis-
ticians to describe the improper analysis of data that leads to merely coincidental findings. Already
in 1940 Fred Schwed wrote in his hilarious and cynic portrait of everyday Wall Street: “There have
always been a considerable number of pathetic people who busy themselves examining the last thou-
sand numbers which have appeared on a roulette wheel, in search of some repeating pattern. Sadly
enough, they have usually found it.” (Schwed, 1940) This misuse of statistics is today referred to as
data dredging.
When literally mining (for minerals) most of the material extracted from the pit is worthless. A
good mining technology is characterised by efficiently extracting the small proportion of good material.
A similar notion applies to data mining. We do not want to find patterns that are only characteristic
to the dataset currently examined. We rather want to identify the structure behind the data. We want
generalisation, i.e. we should be able to find the structure inferred from the dataset in future examples
as well.
1.2 Data mining as a step in the KDD process

Data mining is very closely related to the Knowledge discovery in databases (KDD). Frawley et al.
(1992) defined KDD as the “the nontrivial extraction of implicit, previously unknown, and potentially
useful information from data. . . . Data mining is a step in the KDD process that consists of applying
data analysis and discovery algorithms that, under acceptable computational efficiency limitations,
produce a particular enumeration of patterns (over models) over the data.”
The KDD process broadly consists of the following steps:
Understanding the problem: Developing an understanding of the problem and the application do-
main is essential in order to be able to carry out the subsequent steps. This involves enquiring
into the nature of the problem at hand as well as into which data is available, how the data was
collected, how the different variables were measured and defined, etc.
Creating a target data set: In most commercial applications of data mining the data has to be gath-
ered from different sources, mostly transactional data bases. Transactional data bases usually
store their data in third normal form (3NF): This type of data storage tries to minimise re-
dundancy (and thus storage space) and prevent insert/delete/update anomalies (see e.g. Date,
1981). However this causes relevant information to be spread over many different tables, so
gathering the necessary information can be quite cumbersome. Once the information is ex-
tracted it is often stored in a data warehouse or several smaller data marts. In a data warehouse
the data is not stored in third normal form but in a denormalised, subject-specific form (e.g. us-
ing the star schema ). This leads to redundant storage of information, but allows quick reponse
1.3. TASKS IN STATISTICAL DATA MINING 9
times to queries by data analysts. The redundancy is however not a serious problem as a data
warehouse is only used for data analysis and not for current, “live” data.
Data preprocessing and data exploration: The data extracted needs to be preprocessed and ex-
plored before it can be used for data mining. Especially if the data stems from different sources
one has to make sure that the data is coherent (Are informations coded the same way?). Erro-
neous data has to be disposed of, however it first has to be discovered.
It is not less important to get a first idea of what the data is like; this knowledge is needed when
choosing appropriate data mining algorithms. Depending on the data mining method used, it
might be necessary to transform variables with skewed distributions. If the data set contains too
many variables it might be necessary to summarise the data.
This phase of the KDD process consists mainly of analysing summary statistics and visualising
the data. Chapter 2 of the lecture focuses on exploratory data mining techniques.
Depending on the quality and the nature of the data, this can be one of the most time consuming
steps of the KDD process. It is essential to carry out this step properly; even the best data mining
algorithm can’t compensate for poor data.
Data mining: After having extracted and prepared the data we can now apply different data mining
algorithms such as logistic regression or support vector machines. This is the focus of the lecture,
which tries to give an overview over (statistical) data mining techniques.
Interpretation: Even though it sounds trivial the correct interpretation of the results obtained from
the data mining algorithms is an essential step in the KDD process. The KDD process would be
pointless without this final step. This implies that the data mining algorithms have to provide
information which is easy to interpret. Furthermore discussing the results with “experts” in the
subject matter might reveal aspects that have not been considered when a model for the data
analysis was chosen.
It is important to see data mining as an interactive and especially iterative process. Sometimes it is
even necessary to go back one or two steps. If the data mining algorithm fails or produces unexpected
results it might be necessary to explore the data again and maybe apply further transformations.
1.3 Tasks in statistical data mining

The tasks in statistical data mining can be roughly divided into two groups:
Supervised learning in which the examples are known to be grouped in advance and in which the
objective is to infer how to classify future observations.1
Examples: Predicting from the genes whether a certain therapy will be effective or not. Classifi-
cation of credit card transactions as fraudulent and non-fraudulent.
Unsupervised learning involves detecting previously unknown groups of “similar” cases in the data
or finding “patterns” in the data.
Example: Grouping predators with respect to their prey. Identifying genes with a similar biolog-
ical function.
1
Classification predicts a nominal variable whereas regression generally predicts a continuous variable, although there
are regression techniques for nominal variables as well (see the chapter on logistic regression).
Supervised learning problems are generally easier to solve than unsupervised learning problems
— at least they are based on a better defined problem. In supervised learning we want to learn the
link between an outcome variable Y and a set of input variables X1 , . . . , Xp . If we look at the joint
distribution p(y, x1 , . . . , xp ) of the random variables and factor it into
p(y, x1 , . . . , xp ) = p(y|x1 , . . . , xp ) · p(x1 , . . . , xp )
we can see that we “just” have to get a good estimate of p(y|x1 , . . . , xp ), the other factor is largely
irrelevant as only p̂(y|x1 , . . . , xp ) is important for the prediction. In unsupervised learning we are
in a more “symmetrical” situation; we possess data X1 , . . . , Xp — there is no outcome variable we
focus on. Usually we want to identify interesting patterns, which usually translates to regions of high
probability (clusters etc.). For this we have to infer reliable information on the whole joint distribution
p(x1 , . . . , xp ); this is a lot more difficult.
Furthermore there it is straightforward to define goodness criteria for supervised learning methods.
An algorithm which delivers predictions closer to the observed response is clearly to be preferred. For
unsupervised learning problems it is a lot more difficult to come up with sensible goodness criteria.
Most of the algorithms in unsupervised learning are thus based on heuristics and their results thus
should be interpreted with a certain amount of scepticism.
1.4 Useful properties of data mining algorithms

Robustness and scalability are two properties that data mining algorithms should possess.
1.4.1 Robustness
The data used for data mining does generally not stem from systematic and controlled experiments
and contains thus very often a high level of noise and especially outliers, “unusual” data values. As
most of the data sets in data mining are rather big it can be very hard to identify outliers using
data visualisation techniques. In practise it will thus be hardly ever possible to entirely remove this
additional noise in the preprocessing step.
It is therefore advantageous if the data mining algorithm does not break down under these condi-
tions, thus if it is “robust”. Robustness is an important and desirable property of data analysis methods.
The term robustness however has a very vague definition. The Encyclopedia of Statistical Sciences de-
fines it as performing “well not only under ideal conditions, the model assumptions that have been
postulated, but also under departures from this ideal” (Kotz et al., 1988, p. 157).
Example for robustness: Estimating the location of a distribution
If we want to estimate the location of a symmetric distribution, we can use the following two statis-
tics:
• The arithmetic mean (“average”):

x1 + x2 + . . . + xn
x=
n
1.4. USEFUL PROPERTIES OF DATA MINING ALGORITHMS 11
Regression line without outlier Regression line with outlier
4
3
3
2
2
y
y
1
1
0
0
−1
−1
1 2 3 4 5 6 1 2 3 4 5 6
x x
Figure 1.1: Least-squares regression line before and after adding an outlier (red dot)
• The median is the 50%-percentile of the data, i.e. if we order the data x1 ≤ x2 . . . ≤ xn , then
the n+1
2 -th value is the median.
n+1
(If 2 is not an integer, any value between xn/2 and xn/2+1 can be used as the median.
xn/2 +xn/2+1
However often 2 is used as the median.)
Consider we look at a (fictitious) data set consisting of the values 3, 4, 5, 6 and 7. Its mean is 5, so
is its median. If we changed 7 to 70, the mean would however be 17.6, thus completely misleading.
The median however would still be 5. So the median is — contrary to the mean — a robust estimator
of the location of a distribution.
Not only the mean is not robust. Normal least-square regression is not robust either, as figure 1.1
shows. The left hand side shows a couple of data points with the corresponding regression line. Adding
one outlier completely changes the regression line. Both the mean and least-squares regression are —
as the name suggests — based on a quadratic loss function (“L2 loss”). Observations which depart a lot
from the model are penalised super-proportionally, i..e get a big weight and thus can do a lot of harm.
Robust methods are generally based on linear loss functions (“L1 loss”), which put smaller weights
on these outlying observations and are thus less effected by them. For a more detailed description of
robust statistics see e.g. Huber (1981).
Note that in practise it is very difficult to distinguish between an outlier and an observation that
corresponds to a pattern in the data. Even if the observation is classified as an outlier by a data analysis
technique like linear regression, it does not automatically entail that it is an outlier that can simply
be disposed of. This outlier might contain important information about the structure behind the data.
So blindly removing outliers is definitely the wrong approach — unless you know they are due to
measurement errors or typing error during the data entry.
1.4.2 Scalability
One of the key differences between “classical” statistics and data mining is the scale of the problems.
In order to be able to handle huge data sets we need algorithms which do not scale to badly. The
complexity of an algorithm is usually given in Landau notation. An algorithm is said to be of the order
n (or n2 ) if the number of iterations necessary to analyse a data set of n observations is, for big n,
at worst proportional to n (or n2 ). This is then denoted as O(n) (or (O(n2 )).2 Finding the minimum
of a variable is of order O(n) as we have to go through the data only once. Ordering a data set is
O(n log(n)) at best, primitive sorting algorithms need O(n2 )) iterations.
The complexity of an algorithm is largely irrelevant when analysing a small-scale problem. The
complexity however does matter for large-scale problems. A complexity of O(n) or O(n · log(n)) is
usually considered as very good to good. O(n2 ) is acceptable, every thing beyond it is problematic.
Assume you have two algorithms. One is of order O(n), and the other one is of order O(n3 ). Both take
about one second analysing a data set of 1, 000 observations. How long would they take analysing a
data set of 1, 000, 000 observations (1, 000 times bigger)? Roughly speaking the first would take about
1, 000 seconds whereas the second one however would take 1, 0003 seconds, i.e. more than 30 years!
Most of the data sets in data mining not only consist of a huge number of observations, the number
of variables is often equally large. So increasing the number of variables should not affect the algo-
rithm too badly, i.e. its complexity as a function of the number of variables p should not be too high
either. In some cases, the number of variables grows as fast as the number of observations, or even
faster. (see section 1.4.3 as well)
However scalability is not only about time (or memory). It is about reliability as well. Computers
use floating point arithmetics. Despite its benefits it is still of finite precision, so rounding has to
take place. For a small data set the rounding error is usually irrelevant, not so for huge data sets.
Large-scale algorithms have to be implemented in a way to avoid these numerical problems.
1.4.3 Resilience to the curse of dimensionality

The colourful term “curse of dimensionality” was initially coined by Richard Bellmann in the 1960s
and referred to the problems when evaluating a high-dimensional function on a discrete grid of say
1
spacing 10 on the unit hypercube. If the hypercube is 10-dimensional we have to evaluate the function
10 times. If it is 20-dimensional we even need 1020 evaluations. (Bellman, 1961)
10
In statistics the term “curse of dimensionality” is used for the problems that occur when the number
of covariates keeps growing. If the number of variables grows as fast as the number of observations,
most statistical techniques will not be consistent (i.e. not converge to the true values) and thus
not deliver meaningful results. A very sketchy explanation of this phenomenon is as follows: If the
number of variables keeps growing then the estimation algorithm cannot focus on the relevant part of
the covariate space and thus cannot achieve higher accuracy as it “wastes” almost all its resources to
represent irrelevant portions of the space. It is thus essential to keep the dimensionality of the problem
low (e.g. by summarising the data beforehand) or by using regularisation techniques (“penalisation”).
One type of methods that suffers a lot from the curse of dimensionality are nearest neighbour
methods. Nearest neighbour methods effectively average the value of the response variable of its
nearest neighbours. Assume that the neighbourhood should consist of 5% of the data and that the
2
The exact definition is as follows: A function f (x), is said to be O(g(x)) if there is a finite x0 and a finite M , such that
f (x) ≤ M · g(x) for all x > x0 . Now let i(n) be the number of iterations the algorithm needs to analyse a dataset of n
observations. Assume i(n) = n(n−1) 2
. Then i(n) = n(n−1)
2
≤ n2 for all n > 0, so i(n) ∈ O(n2 ).
1.5. MISSING VALUES 13
covariates are uniformly distributed on the 10-dimensional unit hypercube. Then its neighbours would
be spread over 78% of the range of each single covariate. For 100 covariate this proportion amounts to
97%! The method would not at all be local any more!
However, there seem be to “blessings of dimensionality” as well. One of them is known to math-
ematicians as the “concentration of measure”. In its basic and highly simplified form it says that a
suitably smooth function will be constant on most of the space, so only a very small part of the space
does actually matter. (see e.g. Donoho, 2000)
1.4.4 Automation
Data mining involves the repeated analysis of huge data sets. It is essential that the computensively
intensive (and this time consuming) part of the analysis can be carried without intervention by the
user. This does not conflict with the interactive nature of KDD. It just says that the data mining step
itself (e.g. fitting the classifier) should be highly automated. The presentation of the results and the
choice of the methods and models used should be highly interactive.
1.4.5 Simplicity
A final and not unimportant criterion is simplicity. The data mining algorithm should be easy to
understand and it should be no more difficult to interpret its results. It will happen that even the best
data mining algorithms sometimes delivers absolutely useless results. If the data mining algorithm is
a “black box” we can’t do much more than blindly trusting it, or — if the results are too obscure —
distrusting it. However if we understand how the algorithm works and which assumptions it is based
on we are able to find out whether we should trust the results, and if not where things went wrong.
This might allow us to fix the problem and obtain meaningful results.
1.5 Missing values

Sometimes some measurements for one observation or more are missing. How can we deal with this
problem? At first we have to try to find out why they are missing: are they just missing at random or
are they missing for a special reason? In the latter case the fact that we “observed” a missing value
does contain relevant information, so we can’t deal with it as if it were missing at random. Consider
the case of a clinical study. The patients for whom the proposed treatment is not effective might decide
to drop out of the study. Assuming that they were “normal” patients and imputing the average values
would clearly bias the results of the study.
If the missing values occur at random and independently of the other variables studied we can use
one of the various imputations techniques. Most of these methods use the information of all or some
of the other observations to find an appropriate value to fill in. The easiest, but clearly not the best
approach would be just to use the mean of the other observations. A more sophisticated technique and
a better choice for gene expression data is k-nearest neighbour impute (Hastie et al., 1999), which
averages over the neighbouring data points3 .
3
The neighbouring data points are determined using the not-missing variables.
1.6 Notation
In data mining we usually analyse datasets with more than one variable (multivariate data). We gener-
ally use the letter “x” to designate variables. We denote then with xij the j-th measurement (variable)
on the i-th individual (observation). We could imagine the data to be stored in a spreadsheet, where
the columns correspond to the variables and the rows to the observations.
Variable 1 Variable 2 ... Variable j ... Variable p

Observation 1 x11 x12 ... x1j ... x1p
Observation 2 x21 x22 ... x2j ... x2p
... ... ... ... ... ... ...
Observation i xi1 xi2 ... xij ... xip
... ... ... ... ... ... ...
Observation n xn1 xn2 ... xnj ... xnp
More mathematically speaking we can see our data as a matrix X with n rows and p columns, i.e.
 
x11 x12 . . . x1j . . . x1p

 x21 x22 . . . x2j . . . x2p 

 .. .. .. .. .. .. 
 . . . . . . 
X=
 xi1 xi2

 . . . xij . . . xip 

 .. .. .. .. .. .. 
 . . . . . . 
xn1 xn2 . . . xnj . . . xnp
Note that some statistical methods (especially clustering) can be carried out in the observation
space as well as in the variable space. “In the variable space” means that the transposed data matrix is
used, i.e. the data matrix is reflected: the observations are seen as variables and the variables are seen
as observations. Microarray data is an example for it. We might want to group the samples (i.e. find
clusters in the observation space), but we might as well want to group the genes (i.e. find clusters in
the variable space). To that extent, the decision what is chosen to be “observation” and what is chosen
to be “variable” can sometimes be arbitrary.
For microarray data often the transposed data matrix is used, i.e. the rows correspond to the genes
(the variables) and the columns correspond to the samples (the observations). The reason for this is
mainly that microarray data usually contains far more genes than samples and it is easier to visualise
a table with more rows than columns.
Some statistical methods have been initially designed for continuous variables. Sometimes these
methods can handle nominal variables if they are coded approriately. One possibility is dummy coding.
It is nothing else than the bitwise representation of the discrete variable. A discrete variable with three
levels (say “A”, “B” and “C”) for example is replaced by three “dummies”. The first one is 1 if the discrete
variable is “A”, otherwise it is 0; the second one is 1 iff the discrete variable is “B”; and the third one is
1 iff the discrete variable is “C”. The following table gives an example:
1.7. DATASETS USED IN THE LECTURE 15
Discrete variable Dummy A Dummy B Dummy C

A 1 0 0
B 0 1 0
C 0 0 1
B 0 1 0
C 0 0 1
... ... ... ...
Note that Dummy C can be computed from the first two by 1 − Dummy A − Dummy B, the data matrix
(including the intercept term) is thus not of full rank. This poses a problem for certain statistical
methods. That’s why the last column in the dummy coding is often omitted, the last level is then the
reference class.
1.7 Datasets used in the lecture

This section presents datasets that are used more than one time. Datasets that are used only once are
described when they are analysed.
1.7.1 The iris dataset
The famous iris dataset was collected by E. Anderson (Anderson, 1935) and analysed by R. A. Fisher
(Fisher, 1936). It gives the measurements in centimetres of the variables sepal length and width and
petal length and width, respectively, for 50 flowers from each of three species of iris. The species
studied are iris setosa, iris versicolor, and iris virginica. In the chapters on classification we will
however only use the measurements on the petals.
1.7.2 The European economies dataset
The European economies dataset contains information on the composition of the gross value added
for 25 European countries (all EU countries apart from Ireland and Malta as well as Romania and
Bulgaria). It gives for 14 industries the percentage of the gross value added by that industry. This
dataset can be obtained online from Eurostat.
1.7.3 The quarter-circle dataset
This is a simulated dataset that contains a noisy quarter circle. The two covariates x1 and x2 were
independently sampled from a uniform distribution on [0, 1]. The observations were classified into the
first class if
q r
2
x2i1 + x2i2 + εi < ,
π
where εi is a Gaussian error term with a standard deviation of 0.15.
1.7.4 The breast tumour dataset

This dataset (van ’t Veer et al., 2002) is from a study on primary breast tumours using DNA microarray
data. The subjects were 78 sporadic lymph-node-negative female patients. The aim of the study was
to search for a prognostic signature in their gene expression profiles. 44 patients remained free of
disease after their initial diagnosis for an interval of at least 5 years (good prognosis group) and 34
patients developed distant metastases within 5 years (poor prognosis group). The aim is to discrimi-
nate between good and poor prognosis groups using the gene expression profiles of the patients. The
original data includes expression profiles on 24, 189 genes. We will use two versions of this dataset.
One will only contain the 30 “best” genes, the other one will contain the 4, 000 “best” genes. Both
datasets are divided into a training set (40 patients) and a test set (36 patients).4 5
4
The “best” genes were obtained by ordering the genes by the absolute value of the t-statistic for the comparison between
the two groups.
5
The original dataset contained some missing values that have been filled in using KNNimpute, i.e. the missing value was
replaced by the mean of the nearest (complete) neighbours. Two observations have been deleted due to too many missing
values.
Chapter 2
Exploratory data analysis
2.1 Visualising data
2.1.1 Visualising continuous data
Visualising one-dimensional data
Exploring the data visually is usually the first step of analysing data. It allows getting an idea of
what the dataset is like and allows spotting outliers and other anomalies. To get a first idea of a
one-dimensional data set, one can look at a boxplot (see figure 2.1) or a kernel-based estimation of its
density (see figure 2.2). When looking at data describing different groups like the iris data, groupwise
boxplots (or if enough data is available, groupwise density estimation) can help unveiling differences
between the groups (see figure 2.3).
Visualising two-dimensional data
For analysing two-dimensional data a scatter plot is probably the best choice. A scatter plot is nothing
else than a plot of one variable against the other. A scatter plot matrix (“pairs plot”), in which every
variable is plotted against every other variable, is usually a good way to visualise a dataset with a few
continuous variables. Colours and the plotting symbol can be used to represent categorical variables.
Figure 2.4 shows a pairs plot of the iris dataset.
Sometimes it is useful not to plot the whole dataset but only a certain subset. One might for
example want to look at one plot per level of a nominal variable. Instead of a nominal variable, one
can create a separate plot of the data for different ranges of a continuous covariate. Figure 2.5 gives
an example.
Conditional pairs plots can be done as well when conditioning on a continuous variable. In this
case we simply condition on the conditioning variable being within a certain range of values. Figure
2.6 gives an example. The boxes at the top indicate which values of petal.width are considered in
the different plots. For example, the bottom left hand plot corresponds to flowers with a petal.width
between 0 and 0.45.
17
18 CHAPTER 2. EXPLORATORY DATA ANALYSIS
Outside value
Outside value
Outside value
4.0
Lower adjacent value

3.5
Lower quartile
3.0
Median
Upper quartile
2.5
Upper adjacent value

2.0
Outside value
Figure 2.1: Boxplot of the sepal length (iris data)
Sepal length
1.0
0.8
0.6
Density
0.4
0.2
0.0
2.0 2.5 3.0 3.5 4.0 4.5
N = 150 Bandwidth = 0.1233
Figure 2.2: Kernel density estimation of the sepal length (iris data)
2.1. VISUALISING DATA 19
4.0
3.5
3.0
2.5
2.0
setosa versicolor virginica
Figure 2.3: Boxplots of the sepal length for the different species(iris data)
Iris setosa Iris versicolor Iris virginica

2.5
1.5 2.0 2.5
2.0
1.5
Petal.Width
1.0
0.5
0.0 0.5 1.0
0.0
7
4 5 6 7
6
5
4 Petal.Length 4
3
2
1 2 3 4
1
4.5
3.5 4.0 4.5
4.0
3.5
Sepal.Width
3.0
2.5
2.0 2.5 3.0
2.0
8
7 8
7
Sepal.Length 6
5
5 6
Scatter Plot Matrix
Figure 2.4: Pairs plot of the iris dataset

setosa versicolor virginica

2.5 2.5 2.5
2.0 1.52.02.5 2.0 1.52.02.5 2.0 1.52.02.5
1.5 1.5 1.5
Petal.Width Petal.Width Petal.Width
1.0 1.0 1.0
0.00.51.0 0.5 0.00.51.0 0.5 0.00.51.0 0.5
7 0.0 7 0.0 7 0.0
6 4 5 6 7 6 4 5 6 7 6 4 5 6 7
5 5 5
Petal.Length
4 4 Petal.Length
4 4 Petal.Length
4 4
3 3 3
1 2 3 4 2 1 2 3 4 2 1 2 3 4 2
1 1 1
4.5 4.5 4.5
4.0 3.54.04.5 4.0 3.54.04.5 4.0 3.54.04.5
3.5 3.5 3.5
Sepal.Width
3.0 Sepal.Width
3.0 Sepal.Width
3.0
2.02.53.0 2.5 2.02.53.0 2.5 2.02.53.0 2.5
2.0 2.0 2.0
8 8 8
7 7 8 7 7 8 7 7 8
Sepal.Length
6 Sepal.Length
6 Sepal.Length
6
5 6 5 5 6 5 5 6 5
Scatter Plot Matrix
Figure 2.5: Conditional pairs plot of the iris dataset (by species)
Given : Petal.Width
0.5 1.0 1.5 2.0 2.5
4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0
4.0
3.5
3.0
2.5
Sepal.Width
2.0
4.0
3.5
3.0
2.5
2.0
4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0
Sepal.Length
Figure 2.6: Conditional pairs plot of the sepal measurements if the ziris dataset (by petal.width)
Petal.Width
Sepal.Length
Sepal.Width
Figure 2.7: Three-dimensional scatter plot of the iris dataset
Higher-dimensional visualisation
Sometimes it is not enough just to visualise higher-dimensional data using scatter plots only. Three-
dimensional scatter plots (as in figure 2.7) seem to be a solution. However do note that these plots
are in fact two-dimensional as well — they fit on a (flat) sheet of paper. Like scatter plots they
are a projection of the data; the projection is just not parallel to the axes any more. But different
observations can still be projected to the same point, so the points are not uniquely represented.
How can we overcome this problem? There are several approaches to it:
• The easiest is looking not only at one but at a range of projections. This is in fact the idea behind
scatter plots.
Note however that it can be difficult to find out which projections are interesting and one might
have to look at a high number of projections.
• Interactive and animated plots provide a way of getting around the problem. When the data
points are spun we are actually able to capture their three-dimensional structure. There are
several interactive tools to visualise high dimensional data like S-Plus’s “brush and spin” and the
open source tools XGobi and GGobi. Figure 2.8 shows a screenshot of XGobi.
• If the data can be projected to fit on a (flat) sheet of paper, why not doing it in a meaningful
way, that similar data points end up close to each other. This is the idea of biplots and MDS (see
sections 2.3.2 and 3.5).
However it might be that it is most meaningful to do something else than projecting the data.
High-dimensional data can be displayed in a variety of other ways, too.
Figure 2.8: Screenshot of XGobi
• “Heatmaps” are another way of displaying multivariate datasets, especially microarray data.
However they can be rather misleading, as the impression they generate depends highly on the
ordering of the variables and observations. Unordered heatmaps (figure 2.9(a)) look rather
chaotic and are thus difficult to interpret. Therefore one generally applies a clustering algo-
rithm to the rows and columns, and one displays the variables and observations according to
the dendrogram. This approach however yields the problem that the heatmap “inherits” the
instability of clustering. Figure 2.9(b) and 2.9(c) show heatmaps for the same dataset however
using two different clustering algorithms. Figure 2.9(b) suggests at first sight that there were
two functional groups of genes, where as figure 2.9(c) suggests there were three.
• Parallel coordinate plots are an alternative to heatmaps which do not suffer from this problem,
however they might look less “fancy”. In parallel plots each vertical line corresponds to a vari-
able whereas the connecting lines correspond to the observations. Figure 2.10 shows a parallel
coordinate plot of the European economies dataset.1
• Star plots (see figure 2.11) are another possibility to visualise such data.
2.1.2 Visualising nominal data

Nominal data is usually displayed using bar plots displaying the frequency of each class. The left hand
side of figure 2.12 shows the frequency of death penalty verdicts for homicide in Florida in 1976 and
1
Sometimes, the variables are rescaled in order to use the full range of the plotting region — in our case however this
does not seem very useful.
2.1. VISUALISING DATA
gene19592
gene3921
gene1701
gene296
gene20713
gene23436
gene19642
gene15118 gene23599 gene15118
gene13743 gene13743
gene296
gene10755
gene20713
gene23726 gene19608
gene16015
gene14973
gene13998
gene19231 gene8629
gene9128
gene16078
gene19608
gene16024 gene12106
gene23726
gene20533
gene8629
gene10755 gene19401
gene5448
gene8875
gene16323
gene15118 gene7374
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
gene21675 gene9616
55
66
8
20
48
69
65
72
50
63
23
42
18
30
14
31
5
22
28
7
19
17
21
71
4
25
3
9
29
10
36
46
41
6
1
27
26
39
32
38
74
11
43
59
52
57
75
61
49
54
68
47
51
73
35
45
53
2
58
56
70
24
44
64
67
76
62
60
16
40
12
34
13
15
33
37
75
64
56
70
24
44
59
4
32
74
11
38
13
35
45
67
76
61
49
54
68
8
62
43
60
16
40
52
57
47
51
73
53
2
58
12
15
34
55
66
48
69
20
65
72
50
63
37
33
71
23
42
39
25
6
41
3
9
5
22
29
10
36
46
26
14
31
28
7
18
30
1
27
19
17
21
(a) Unordered heatmap (b) Heatmap (using complete linkage clustering) (c) Heatmap (using average linkage clustering)
Figure 2.9: Heatmaps for the same observations and genes for the “best” 100 genes of the breast tumour data
23
Belgium Hungary
25
Czech Republic Netherlands

Denmark Austria
Germany Poland
Estonia Portugal
Greece Slovenia
Spain Slovakia
20
France Finland
Italy Sweden
Cyprus United Kingdom
Latvia Bulgaria
Lithuania Romania
15
Luxembourg
10
5
0
Agriculture
Chemical
Construction
Education
Electricity
Electronics
Financial
Hotel.Rest
Machinery
Mining.Oil
Public
RealEst.etc
Transport
Wholesale
Figure 2.10: Parallel coordinate plot of the European economies dataset
Czech Republic Germany Greece

Belgium Denmark Estonia Spain
France Cyprus Lithuania Hungary

Italy Latvia Luxembourg
Austria Portugal Slovakia

Netherlands Poland Slovenia Finland
Electricity Education
Electronics Construction
Financial Chemical
Hotel.Rest Agriculture
Sweden Bulgaria
United Kingdom Romania Machinery Wholesale
Mining.Oil Transport
Public RealEst.etc
Figure 2.11: Star plot of the European economies dataset

Death penalty: Influence of the defendant’s

Death penalty: Influence of the defendant’s race
and the victim’s race
white black
yes no yes no
140
yes
no
120
100
white
80
victim
60
40
20
black
0
white black defendant
Figure 2.12: Bar plot and mosaic plot of the proportion of death penalty verdicts
1977 — separately for white and black defendants (from Agresti, 1990). The plot suggests that there
is no evidence for racism.
For displaying the joint distribution of several nominal variables mosaic plots are a good choice.
In mosaic plots the surface of the rectangles is proportional to the probabilities. The right hand side
of figure 2.12 gives an example. It shows the proportion of death penalty verdicts by the race of the
defendant and by the race of the victim. This plot however leads to the opposite conclusion: If the
victim is black the defendant is less likely to obtain the death penalty; black defendants generally seem
to have a higher risk of being sentenced to death. So there is some evidence for racism amongst the
jury members.
We couldn’t see this in the bar plot, as we did not use the (important) information on the vic-
tim’s race. The fact that whites are less likely to be sentenced to death was masked by fact that the
defendant’s and the victim’s race are not independent.
2.2 Descriptive statistics

Much of the information contained in the data can be assessed by calculating certain summary num-
bers, known as descriptive statistics. For example, the sample mean, is a descriptive statistic that
provides a measure of location; that is, a “central value” for a set of numbers. Another example is the
variance that provides a measure of dispersion. In this course we encounter some descriptive statis-
tics that measure location, variation and linear association. The formal definitions of these quantities
follow.
Let x11 , x21 , .., xN 1 be the N measurements on the first variable. The sample mean of these mea-
surements, denoted x̄1 , is given by
N
1 X
x̄1 = xi1
N
i=1
A measure of spread is provided by the sample variance, defined for N measurements on the first
variable as
N
1 X
s21 = (xi1 − x̄1 )2
N
i=1
In general, for p variables, we have
N
1 X
s2j = (xij − x̄j )2 j = 1, 2, . . . , p
N
i=1
The sample covariance
N
1 X
s2jk = (xij − x̄j )(xik − x̄k ) j, k = 1, 2, . . . , p
N
i=1
measures the association between the jth and kth variables. Note that the covariance reduces to the
sample variance when j = k. Moreover, sjk = skj for all j and k. The final descriptive statistic
considered here is the sample correlation coefficient (sometimes referred to as Pearson’s correlation
coefficient). This measure of the linear association between two variables does not depend on the
units of measurement. The sample correlation coefficient, for the ith and kth variables, is defined as
P
sik j=1 N (xji− x̄i )(xjk − x̄k )
rik = p √ = qP qP
si sk 2 2
j=1 N (xji − x̄i ) j=1 N (xjk − x̄k )
for i = 1, 2, . . . , p; k = 1, 2, . . . , p. Covariance and correlation provide measures of linear association.

Their values are less informative for other kinds of association. These quantities can also be very
sensitive to outliers. Suspect observations must be accounted for by correcting obvious recording
mistakes. An alternative is to make use of robust alternatives for the above descriptive statistics that
are not affected by spurious observations. The descriptive statistics computed from n measurements
2.3. GRAPHICAL METHODS 27
on p variables can also be organised into arrays.

 
x̄1
x̄2 
 
Sample mean vector x̄ =  . 
 .. 
x̄p
 
s11 s12 ··· s1p
s21 s22 ··· s2p 
 
Sample covariance matrix S =  . .. .. 
 .. . . 
sp1 sp2 ··· spp
 
1 r12 ··· r1p
r21 1 ··· r2p 
 
Sample correlation matrix R =  . .. .. 
 .. . . 
rp1 rp2 · · · 1
The above sample mean vector and sample covariance matrix can also expressed in matrix notation as
follows:
1
x̄ = X′ 1
N

1 1 1 ′
S= X X − X′ 11′ X
′
= X X − x̄x̄′
N N N
where 1 is a column vector of N ones.
2.3 Graphical Methods

2.3.1 Principal Component Analysis
Principal component analysis (PCA) is concerned with explaining the variance-covariance structure in
data through a few linear combinations of the original variables. PCA is from a class of methods called
projection methods. These methods choose one or more linear combinations of the original variables
to maximise some measure of “interestingness”. In the case of PCA this measure of “interestingness” is
the variance. Its general objectives are (1) data reduction and (2) interpretation. PCA is particularly
useful when one has a large, correlated set of variables and would like to reduce them to a manageable
few. For example, in gene expression arrays we have many correlated genes being co-expressed (under
or over), often in response to the same biological phenomenon. PCA can be used to identify small
subsets of genes that behave very similarly, and account for large variances across samples. In the
lectures we will consider an example of such an application.
In a full PCA, the set of response variables is re-expressed in terms of a set of an equal number
of principal component variables. Whereas the response variables are intercorrelated, the principal
component variables are not. Therefore, the variance-covariance structure of the principal components
is described fully by their variances. The total variance – that is, the sum of their variances – is the same
as the total variance of the original variances. For this reason, the variances of principal components
are usually expressed as percentages of the total variation.
Deriving principal components
We describe principal components first for a set of real valued random variables X1 , . . . , Xp . To obtain
the first principal component, we seek a derived variable Z1 = a11 X1 + a21 X2 + · · · + ap1 Xp such
that var(Z1 ) is maximised. Here the ai1 are real-valued coefficients. If left uncontrolled we can
make the variance of Z1 arbitrarily large. So we constrain a1 = (ai1 ) to have unit Euclidean norm,
a′1 a1 = ka1 k2 = 1. This largest principal component attempts to capture the common variation in a set
of variables via a single derived variable. The second principal component is then a derived variable,
Z2 , uncorrelated with the first, with the largest variance, and so on.
Suppose that S is the p × p covariance matrix of the data. The eigendecomposition of S is given as
S = AΛA′
where Λ is a diagonal matrix containing the eigenvalues λ1 ≥ λ2 ≥ · · · ≥ λk ≥ 0 and A is a p × p

orthogonal matrix whose columns consists of the p eigenvectors of S. The eigenvalues are real and
non-negative since the covariance matrix S is symmetric non-negative definite. It is easy to show
that the loading-vector for the first principal component a1 is the eigenvector of S corresponding to
the largest eigenvalue, and so on. So the ith prinicpal component is then the ith linear combination
picked by this procedure. It is customary to advocate using the eigendecomposition of the correlation
matrix rather than the covariance matrix, which is the same procedure as rescaling the variables to
unit variance before calculating the covariance matrix. Below we give a summary of the above results
concerning principal components. We also show how we can establish which proportion of the total
variance can be explained by the kth principal component. If most (say 80 to 90%) of the total
population variance, for large p, can be explained by the first few components, then these components
can “replace” the original p variables without much loss of information. For this reason, the variances
of principal components are usually expressed as percentages of the total variation.
Note that another interpretation of principal components is that we seek new variables Zi which
are rotaions of the old variables to explain best the variation in the dataset. Suppose we consider
the first k principal components. The subspace spanned by these new variables represent the “best”
k-dimensional view of the data. It best approximates the original points in the sense of minimising the
sum of the squared distance from the points to their projections.
If S = (sik ) is the p × p sample covariance matrix with eigenvalue-eigenvector pairs

(λ1 , a1 ), (λ2 , a2 ), . . . , (λp , ap ), the ith sample principal component is given by
zi = a′i x = a1i x1 + a2i x2 + · · · + api xp , i = 1, 2, . . . , p
where λ1 ≥ λ2 ≥ · · · ≥ λk ≥ 0 and x is any observation on the variables X1 , X2 , . . . , Xp . Also
Sample covariance(zi , zk ) = 0, i 6= k
Sample variance(zk ) = λk , k = 1, 2, . . . , p
In addition,
p
X
Total sample variance = sii = λ1 + . . . + λp ,
i=1
The proportion of the total variance explained by the kth prinicipal component is
λk
k = 1, 2, . . . , p
λ1 + λ2 + . . . + λp
PCA usage
Note that the emphasis on variance with PCA creates some problems. The principal components
depend on the units in which the variables are measured. PCA works best in situations where all
variables are measurements of the same quantity, but at different times or locations. Unless we have
good a priori reasons to regard the variances of the variables to be comparable, we would normally
make them equal by rescaling all the variables to unit variance before calculating the covariance
matrix. The fact that the extraction of principal components is based on variances, also means that
the procedure is sensitive to the presence of outliers. This is particularly important for large noisy
datasets such as gene expression data. It is not straight-forward to identify outliers in high-dimensional
spaces, so it is desirable to use a method of extracting principal components that is less sensitive to
outliers. This can be achieved by taking an eigendecomposition of a robustly estimated covariance or
correlation matrix (Devlin et al., 1981). An alternative approach is to to find a projection maximising
a robust measure of variance in q dimensions as mentioned by Ripley (1996) on page 293. Although
principal components are uncorrelated, their scatterplots sometimes reveal important structures in
the data other than linear correlation. We will consider examples of such plots in the lectures and
practicals. The first few principal components are often useful to reveal structure in the data. However,
principal components corresponding to smaller eigenvalues can also be of interest. In the practical we
will also consider “Gene Shaving” as an adaption of PCA that looks for small subsets of genes that
behave similarly, and account for large variances across samples.
Example 1: Yeast Cell Cycle Data
Here we present examples of PCA applied to 3 real-life datasets. First we consider gene expression
data on the cell cycle of yeast (Cho et al., 1998). They identify a subset of genes that can be categorised
into five different phases of the cell-cycle. Yeung and Ruzzo (2001) conduct a PCA on 384 of these
genes. Changes in expression for the genes are measured over two cell cycles (17 time points). The
data were normalised so that the expression values for each gene has mean zero and variance across
the cell cycles. Figure 2.13 is a visualisation of the 384 genes used by Yeung and Ruzzo (2001) in the
space of the first two prinicipal components. These two components account for 60% of the variation
in the data (extracted from the correlation matrix). Genes corresponding to each of the five different
phases of the cell cycle are plotted in different symbols. We see that despite some overlap the five
classes are reasonably well separated.
Example 2: Morphometrics of Ordovician trilobites
Although distinctive and often common in Ordovician faunas, illaenid trilobites have relatively fea-
tureless, smooth exoskeletons; taxonomic discrimination within the group is therefore difficult. While
a number of authors have attempted a qualitative approach to illaenid systematics. Bruton and Owen
4
2
2nd principal component
0
−2
−4
−4 −2 0 2 4
1st principal component
Figure 2.13: The yeast cell cycle data plotted in the subspace of the first two PCs
S. glaber S. linnarssoni(norway) S. linnarssoni(sweden)

0.4 0.0 0.2 0.4
0.2
0.0 0.0
Comp.2
−0.2
−0.4
−0.4 −0.2 0.0
0 2 4
Comp.1
0 0
−2
−4 −2 0 −4
Scatter Plot Matrix
Figure 2.14: First two principal components for the paleontological data
(1988) described the Scandinavian Upper Ordovician members of the family in terms of measurements
and statistics. This study seeks to discriminate between at least two different species of the illaenid
trilobite Stenopareia. Four measurements were made on the cranidia of the Norwegian illaenid trilo-
bites. The measurements were measured on Stenopareia glaber from the Ashgill of Norway together
with S. linnarssoni from both Norway and Sweden,respectively. In figure 2.14 we plot the above
data in the space of the first two principal components. We distinguish between the three groups of
measurements in the plot. The two subspecies Stenopareia glaber and Stenopareia linnarssoni appear
well-separated, while there also appear to be a reasonably clear separation between the Swedish and
Norwegian observations for Stenopareia.
Example 3: Crabs Data
Blue male Blue female Orange male Orange female

0.10 0.00 0.05 0.10
0.05
0.00 Comp.3 0.00
−0.05
−0.10
−0.10 −0.05 0.00
0.15 0.05 0.10 0.15
0.10
0.05
Comp.2 0.00
−0.05
−0.10
−0.10 −0.05 0.00
1.5 0.5 1.0 1.5
1.0
0.5
Comp.1
0.0
−0.5
−1.0 −0.5 0.0

−1.0
Scatter Plot Matrix
Figure 2.15: First three principal components for the crabs data
Campbell and Mahon (1974) studied rock crabs of the genus leptograpsus. One species, L. var-
iegatus, had been split into two new species, previously grouped by colour form, orange and blue.
Preserved specimens loose their colour, so it was hoped that morphological differences would enable
museum material to be classified.
Data are available on 50 specimens of each sex of each species, collected on sight at Fremantle,
Western Australia. Each specimen has measurements on the width of the frontal lip FL, the rear width
RW, and length along the midline CL and the maximum width CW of the carapace, and the body depth
BD in mm.
We plot the first three principal components in figure 2.15. What conclusions do you draw with
regards to the structure revealed by each of the principal components? Do all the components in the
plot reveal interesting structure? We discuss this in more detail in the lecture.
2.3.2 Biplots
Singular value decomposition of a real matrix: X = U ΛV ′ , where Λ is a diagonal matrix with

decreasing non-negative entries, U is a n × p matrix with orthonormal columns, and V is a p × p
orthogonal matrix. The principal components are given by the columns of XV .
Let X̃ = U Λq V ′ be the matrix formed by setting all but the first q (q ≤ p)singular values (diagonal
elements of Λ) to zero. Then X̃ is the best possible rank-q approximation to X in a least squares sense
(i.e. the q-dimensional subspace has the smallest sum of squared distances between the original
points and their projections.)
Consider n arbitrary cases (or items), each consiting of measurements on p variables summarised
in a centered data matrix X : n × p (the data matrix can be centered by subtracting the column
means from each column). These cases have a geometrical representation in a p-dimensional space
ℜp . The biplot (Gabriel, 1971) is a method for representing both the cases and variables. The biplot
represents X by two sets of vectors of dimensions n and p producing a rank-2 approximation to X.
The best rank-q (q < p) approximation in a least squares sense can be obtained from the singular value
decomposition (SVD) of X (see above). The best fitting q-dimensional subspace (in a least squares
sense) is spanned by the columns of the matrix consiting of the first q columns of the matrix V in the
SVD, say Vq . They define a natural set of orthogonal coordinate axes for the q-dimensional subspace.
Therefore, XVq are the linear projection of X into the “best”q-dimensional subspace.
So we can obtain the best 2-dimensional view of the data by setting λ3 , . . . , λp to zero in the SVD
of X, i.e.

λ1 0 v1′
X ≈ X̃ = u1 u2 = GH ′
0 λ2 v2′
where the diagonal scaling factors can be absorbed into G or H in various ways. The biplot then
consists of plotting the n + p 2-dimensional vectors which form the rows of G and H. The rows of G
specify the coordinates of the observations (items) in the 2-dimensional subspace, while the rows of H
relate to the variables and are typically represented as arrows. The inner products between the rows
of G represent the covariances between the variables. Therfore, the lengths of the arrows represent
the standard deviations of the variables. In figure 2.16 we present an example for the trilobite data
from example 2.
Gower and Hand (1996) devote a whole book to biplots and argue that these axes should just
be considered as scaffolding, which are removed for display purposes. They recommend that rep-
resentations of the original axes, representing the variables, in the q-dimensional subspace. These
representations are referred to as biplot axes. These axes can be provided with suitable scales.
2.3.3 Independent Component Analysis

Independent component analysis (ICA) is becoming a very popular visualisation tool for uncovering
structure in multivariate datasets. It is a method of decomposing a multi-dimensional dataset into
a set of statistically independent non-gaussian variables, and is strongly connected to the technique
of projection pursuit in the statistical literature (Friedman and Tukey, 1974). ICA assumes that the
−1.5 −1.0 −0.5 0.0 0.5 1.0
19
2
1.0
1
8
36
15
29
20 4326
25
0.5
1
23
6 30
38 34
5 27 44
35 55
33 14
40
13
L2
39 31 16 53
21
Comp.2
0.0
24 W3
0
4 2 28 49
59 W1
7 9
3 10 W2
52 11
42 22
17 58
57
−0.5
37 56
46
−1
18 60
41
12 51
48
32 54
−1.0
50 47
−2
45
−1.5
−2 −1 0 1 2
Comp.1
Figure 2.16: Principal component biplot of the trilobite data. Observations 1−43 represent Stenopareia
glaber, while observations 44 − 53 and 54 − 60 represent the two different groups of S. linnarssoni
observations from Norway and Sweden, respectively
dataset is made up by mixing an unknown set of independent components or variables, and attempts
to recover the hidden sources and unknown mixing process. Another way to consider this is that
ICA looks for rotations of sphered data that have approximately independent coordinates. Suppose
there are k ≤ p independent sources in a data matrix S, and we observe the p linear combinations
X = SA with mixing matrix S. The “unmixing” problem is to recover S. ICA attempts to unmix
the data by estimating an unmixing matrix W , where XW = S. The ICA model will introduce a
fundamental ambiguity in the scale and order of the recovered sources. In other words, the sources
can be recovered up to an arbitrary permutation and scaling. Therfore, we may as well assume that
the signals have unit variances. The above model assumes that all randomness come from the signals
and is known as the “noise-free” ICA model. Below we identify both the “noise free” and “noisy” ICA
models, but we will only consider the former. ICA is often very useful in uncovering structure hidden
within high-dimensional datasets and is routinely used as an exploratory tool in areas such as medical
imaging (Beckmann and Smith, 2002) and gene expression experiments (Liebermeister, 2002; Hori
et al., 2001; Lee and Batzoglou, 2003).
Definition 1 (Noise-free ICA model)

ICA of a n × p data matrix X consists of estimating the following generative model for the data:
X = SA
where the latent variables (components) are the columns of the matrix S and are assumed to be
independent. The matrix A is the mixing matrix.
Definition 2 (Noisy ICA model)
ICA of a n × p data matrix X consists of estimating the following generative model for the data:
X = SA + N
where S and A are the same as above and N is a noise matrix.
The starting point for ICA is the simple assumption that the components are statistically indepen-
dent. Under the above generative model, the measured signals in X will tend to be more Gaussian than
the source components (in S) due to the Central Limit Theorem. Thus, in order to extract the indepen-
dent components/sources we search for an un-mixing matrix W that maximises the non-gaussianity
(independence) of the sources. The FastICA algorithm of Hyvärinen (1999) provides a very efficient
method of maximisation suited for this task. It measures non-gaussianity using approximations to
neg-entropy (J) (for speed) which are more robust than kurtosis based measures and fast to compute.
This algorithm is implemented in the R package fastICA (http://www.r-project.org).
Example 1: Crabs Data
We illustrate ICA on the crabs data that we have considered before. In figure 2.17 we present a scat-
terplot matrix of the four independent components. It seems as though the 2nd and 3rd components
picked out the sexes and colour forms, respectively. However, note that Venables and Ripley (2002)
remark on the arbitrariness of ICA in terms of the number of components we select. In the crabs
data we would perhaps anticipate two components, however, a two component model yield much less
impressive results for the crabs data.
Blue male Blue female Orange male Orange female

0 1 2
2
0 V4 0
−1
−2 −1 0 −2
2 0 1 2
0 V3 0
−1
−2 −1 0
−2
2 0 1 2
0 V2 0
−1
−2 −1 0
−2
4
2 4
V1
0
−2
−2 0
Scatter Plot Matrix
Figure 2.17: Scatterplot matrix of the four independent components for the crabs data
Example 2: Ordovician trilobite data
We also apply ICA to the trilobite data introduced earlier. In figure 2.18 we present boxplots of the
recovered components for this data. It is clear that the first signal is reasonably successful in picking
out the the three different subgroups of trilobites.
2.3.4 Exploratory Projection Pursuit

Friedman and Tukey (1974) proposed exploratory projection pursuit as a graphical technique for
visualising high-dimensional data. Projection pursuit methods seek a q-dimensional projection of the
data that maximises some measure of “interestingness” (unlike with PCA this measure would not be
the variance, and it would be scale-free). The value of q is usually chosen to be 1 or 2 in order to
facilitate visualisation. Diaconis and Freedman (1984) show that a randomly selected projection of
a high-dimensional dataset will appear similar to a sample from a multivariate normal distribution.
Therefore, most one or two dimensional projections of high-dimensional data will look Gaussian.
Interesting structure, such as clusters or long tails, would be revealed by non-Gaussian projections. For
this reason the above authors stress that “interestingness” has to mean departures from multivariate
normality.
All so-called elliptically symmetric distributions (like the normal distribution) should be uninter-
esting. So projection pursuit is based on computing projection pursuit indices that measure the non-
normality of a distribution by focussing on different types of departures from Gaussianity. A wide
variety of indices have been proposed and there is disagreement over the merits of different indices.
Two important considerations should be considered when deciding apon a specific index: (1) the
computational time that is required to compute it (or some approximation of it); (2) its sensitivity to
2
2
3
1
2
0
0
0
1
−1
−1
0
−2
−1
−3
−1
−2
1 2 3 1 2 3 1 2 3 1 2 3
Figure 2.18: Boxplots of the “signals” recovered for the trilobite data split by the three subgroups
outliers. Once an index is chosen, a projection is chosen by numerically maximising the index over
the choice of projection. Most measures do not depend on correlations in the data (i.e. they are affine
invariant). Affine invariance can be achieved by “sphering” the data before computing the projection
index. A spherically symmetric point distribution has a covariance matrix proportional to the identity
matrix. So data can be transformed to have an identity covariance matrix. A simple way to achieve
this is to transform the data to the principal components and then rescaling each component to unit
variance.
Ripley (1996) comments that once an interesting projection is found, it is important to remove
the structure it reveals to allow other interesting views. Interested readers are referred to Ripley
(1996) for a more detailed discussion on exploratory projection pursuit and projection indices. XGobi
and gGobi are freely available visualisation tools implementing projection pursuit methods. We will
consider some applications in the lectures. Despite different origins, ICA and exploratory projection
pursuit are quite similar.
Chapter 3
Clustering
3.1 Introduction
Clustering methods divide a collection of observations into subsets or “clusters” so that those within
each cluster are more closely related to one another than objects assigned to different clusters. Group-
ing is done on the basis of similarities or distances (dissimilarities). The inputs required are sim-
ilarity measures or data from which similarities can be computed. The number of groups may be
pre-assigned, or it may be decided by the algorithm. Clustering can be used to discover interesting
groupings in the data, or to verify suitability of predefined classes. There are various examples of both
these applications in the microarray literature (Alon et al., 1999; Eisen et al., 1998; Alizadeh et al.,
2000; Yeung et al., 2001a; Hamadeh et al., 2002). Clustering is an unsupervised method, however,
sometimes clustering is even used to “classify” (Sørlie et al., 2001). Ripley (1996) warns of the dan-
gers of using clustering to classify observations as the unsupervised classes may reflect a completely
different classification of the data.
Even without the precise notion of a natural grouping, we are often able to cluster objects in
two- or three-dimensional scatterplots by eye. To take advantage of the mind’s ability to to group
similar objects and to select outlying observations, we can make use of several visualisation methods
such as multi-dimensional scaling and the projection pursuit methods. In fact, Venables and Ripley
(2002) observe that we should not assume that ‘clustering’ methods are the best way to discovering
interesting groupings in the data. They state that in their experience the visualisation methods are far
more effective. Kaufman and Rousseeuw (1990) provide a good introduction to cluster analysis and
their methods are available in the cluster package in R.
3.2 Dissimilarity
Clustering algorithms typically operate on either of two structures. The first of these structures is
simply the n × p data matrix:  
x11 · · · x1p
 .. .. 
 . . 
xn1 · · · xnp
Many clustering methods are based on a measure of similarity (or dissimilarity). A second structure
which clustering algorithms may accept is a table consisting of pairwise proximities for the objects
37
38 CHAPTER 3. CLUSTERING
 
d(1, 1) d(1, 2) d(1, 3) · · · d(1, n)
 d(2, 1) d(2, 2) d(2, 3) · · · d(2, n) 
 
 .. .. 
 . . 
d(n, 1) d(n, 2) d(n, 3) · · · d(n, n)
where d(i, j) measures the dissimilarity between items i and j. The within-cluster dissimilarity can
form the basis for a loss function that the clustering algorithm seeks to minimise. However, all of
the clustering methods are just algorithms and are not guaranteed to find the global optimum. Dis-
similarities are computed in several ways that are dependent on the type of the original variables.
Dissimilarities often only satisfy the first three of the following axioms of a metric:
Dissimilarity metric axioms
1. d(i, i) = 0
2. d(i, j) ≥ 0
3. d(i, j) = d(j, i)
4. d(i, j) ≤ d(i, h) + d(h, j)
There can be further distinguished between metric and ultrametric dissimilarities. The former sat-
isfies the fourth axiom above, while ultrametric dissimilarities satifies d(i, j) ≤ max{d(i, k), d(j, k)}.
Ultrametric dissimilarities have the appealing property that they can be represented by a dendrogram.
Dendrograms will be discussed when we look at hierarchical clustering algorithms.
Common measures for continuous data include Euclidean distance and the Manhattan (L1 ) dis-
tance. Consider two observations x and y involving measurements on p variables. The above distance
measures between these p-dimensional observations are defined as follows:
Euclidean distance
v
u p
uX
d(i, j) = t (xik − yjk )2
k=1
Manhattan distance
p
X
d(i, j) = |xik − yjk |
k=1
For continuous variables the issue of standardisation arises. The choice of measurement units
has a strong effect on the clustering structure. According to Kaufman and Rousseeuw a change in
3.2. DISSIMILARITY 39
the measurement units may even lead to a very different clustering structure. The variable with the
largest dispersion will have the largest effect on the clustering. Standardising the variables makes
the clustering less dependent on the measurement units of the variables. Standardisation may not be
beneficial in all clustering applications. The variables may for example sometimes have an absolute
meaning that will be distorted by standardisation. Standardisation may also lead to an unwanted
dampening of the clustering structure.
Standarised measures
xik − mk
zik =
sk
In the above standardisation mk and sk represent measures of location (e.g. the sample mean)
and dispersion (e.g. the sample standard deviation) for the kth variable. The disadvantage of using
the standard deviation is its sensitivity to outliers. More robust measures such as the mean absolute
deviation (MAD) sk = mediankxik − mk k may be more suitable for noisy datasets. The choice of
dissimilarity is best motivated by the subject matter considerations. Chipman et al. (2003) propose the
“1-correlation” distance as a popular choice for microarray data. This dissimilarity can be related to the
more familiar Euclidean distance. Changes in the average measurement level or range of measurement
from one sample to the next are effectively removed by this dissimilarity.
“1-correlation” distance
d(i, j) = 1 − rxy
where rxy is the sample correlation between the observations x and y. If
x − x̄ y − ȳ
x̃ = P 2
and ỹ = P 2
i (xi − x̄) /p i (yi − ȳ) /p
then
kx̃ − ỹk = 2p(1 − rxy )
That is, squared Euclidean distance for the above standardised objects is proportional to the correla-
tion of the original objects.
In this course we only consider continuous variables. Dissimilarities for other types of variables
are discussed in Kaufman and Rousseeuw. In the following section we will discuss a few well-known
clustering methods.
3.3 Clustering Methods

3.3.1 Hierarchical Methods
Hierarchical clustering techniques proceed by either a series of successive mergers or a series of suc-
cessive divisions. Hierachical methods result in a nested sequence of clusters which can be graphically
represented with a tree, called a dendrogram. We can distinguish between two main types of hierar-
chical clustering algorithms: agglomerative algorithms and divisive algorithms.
Agglomerative algorithms (Bottom-up methods)
Agglomerative methods start with the individual objects. This means that they initially consider each
object as belonging to a different cluster. The most similar objects are first grouped, and these initial
groups are then merged according to their similarities. Eventually, all subgroupes are fused into a
single cluster. In order to merge clusters we need to define the similarity between our merged cluster
and all other clusters. There are several methods to measure the similarity between clusters. These
methods are sometimes referred to as linkage methods. There does not seem to be consensus as to
which is the best method. Below we define four widely used rules for joining clusters. In the lecture
we will explain each method at the hand of a simple example.
The function hclust in R implements the above approaches through its method argument. The
function agnes in the R package cluster implements the agglomerative hierarchical algorithm of
Kaufman and Rousseeuw with the above methods. Eisen et al. (1998) is a well-known application of
mean linakge agglomerative hierarchical clustering applied to gene expression data. Their Cluster
software implements their algorithm.
Divisive algorithms (Top-down methods)
Divisive hierarchical methods work in the opposite direction. They are not as well known and not used
as much as agglomerative methods. An advantage of these methods over the previous ones is that
they can give better reflections of the structure near the top of the tree. Agglomerative methods can
perform poorly in reflecting this part of the tree because many joins have been made at this stage. Each
join depends on all previous joins, so some questionable early joins cannot be later undone. When
division into a few large clusters is of interest the divisive methods can be a more attractive option,
however, these methods are less likely to give good groupings after many splits have been made.
There are several variations on divisive hierarchical clustering, each offering a different approach
to the combinatorial problem of subdividing a group of objects into two subgroups. Unlike the ag-
glomerative methods, where the best join can be identified at each step, the best partition cannot
usually be found and methods attempt to find a local optimum. One such method is the algorithm
developed by Macnaughton-Smith et al. (1964). In this algorithm we select a single example whose
average dissimilarity to the remaining ones is greatest, and transfer it to a second cluster (say B). For
all the remaining examples of cluster A we compare the average dissimilarity to B with that to the
remainder of A. If any examples in A are on average nearer to B, we transfer to B that for which the
difference in average similarity is greatest. If there are no such examples we stop. This process splits
a single cluster. We can subsequently split each cluster that is created, and repeat the process as far as
required. The R function diana from the cluster package perform divisive clustering.
3.3. CLUSTERING METHODS 41
Linkage methods
a) Single linkage. Clusters are joined based on the minimum distance between items in different
clusters. This method will tend to produce long and loosely connected clusters.
b) Complete linkage. Clusters are joined based on the maximum distance between items in differ-
ent clusters. This means that two clusters will be joined only if all members in one cluster are
close to the other cluster. This method will tend to produce “compact” clusters so that relatively
similar objects can remain separated up to quite high levels in the dendrogram.
c) Average linkage. Clusters are joined based on the average of all distances between the points in
the two clusters.
Example 1: Agglomerative hierarchical clustering of prostate cancer expression data
Complete link
15
10
5
normal
tumour
tumour
tumour
tumour
tumour
normal
normal
tumour
normal
tumour
tumour
normal
tumour
tumour
tumour
tumour
tumour
normal
normal
tumour
tumour
tumour
normal
tumour
tumour
tumour
normal
normal
tumour
tumour
tumour
normal
normal
normal
normal
tumour
tumour
normal
tumour
tumour
tumour
tumour
normal
normal
tumour
tumour
tumour
normal
tumour
normal
normal
normal
normal
normal
0
tumour
tumour
tumour
tumour
tumour
tumour
normal
normal
tumour
tumour
normal
normal
normal
normal
normal
tumour
normal
normal
normal
tumour
tumour
tumour
tumour
tumour
normal
tumour
tumour
normal
normal
normal
normal
normal
normal
normal
normal
normal
tumour
tumour
normal
normal
tumour
tumour
normal
normal
normal
normal
normal
Average link
10
8
6
tumour
tumour
normal
tumour
tumour
4
normal
normal
normal
tumour
tumour
tumour
tumour
tumour
tumour
normal
tumour
tumour
tumour
normal
normal
tumour
tumour
tumour
tumour
tumour
tumour
normal
normal
tumour
tumour
tumour
normal
tumour
tumour
normal
normal
tumour
tumour
normal
tumour
normal
normal
normal
tumour
tumour
2
normal
tumour
tumour
tumour
normal
normal
normal
normal
normal
normal
normal
tumour
tumour
tumour
tumour
normal
tumour
tumour
tumour
tumour
normal
normal
normal
normal
normal
normal
normal
tumour
tumour
tumour
tumour
normal
tumour
tumour
tumour
tumour
normal
normal
normal
normal
normal
normal
normal
normal
normal
normal
tumour
tumour
normal
tumour
tumour
normal
normal
normal
normal
0
normal
normal
Figure 3.1: Agglomerative hierarchical clustering of the prostate cancer gene expression data of Singh
et al. (2002) using different links
Singh et al. (2002) make use of the Affymetrix U95Av2 chip set and derive expression profiles on
52 prostate tumours and 50 expression profiles on nontumour prostate samples (normal). Each array
contains approximately 12 600 genes and ESTs. The aim is to predict the clinical behaviour of prostate
cancer based upon gene expression analysis of primary tumours. Their aim is to discriminate between
the normal and tumour samples based on the expression profiles of the various samples. In figure 3.1
we produce the dendrogram of an agglomerative hierarchical clustering that we perform using the
AGNES algorithm proposed by Kaufman and Rousseeuw (1990). This algorithm is implemented in
the function agnes in the R package cluster. We perform the clustering using both complete and
average link functions. Typically, clustering is performed after a filtering step that aims to identify the
most discriminating genes between the different classes. In the above application we use the the 100
genes with the largest absolute t-values between the two classes. However, it must be noted that the
filtering method and the number of genes included in the clustering will all impact on the performance
of the clustering algorithm and potentially bias the final result. In figure 3.2 we repeat the clustering
based on the top 100 genes and all genes, respectively. We see that there is a reasonable difference
in the separation achieved by these two sets of genes. Note that we can also perform a agglomerative
hierachical clustering in R using the function hclust.
Complete link (100 genes)
0 2 4 6 8
tumour
tumour
normal
tumour
tumour
normal
normal
normal
tumour
tumour
tumour
tumour
tumour
tumour
normal
tumour
tumour
tumour
normal
normal
tumour
tumour
tumour
tumour
tumour
tumour
normal
normal
tumour
tumour
tumour
normal
tumour
tumour
normal
normal
tumour
tumour
normal
tumour
normal
normal
normal
tumour
tumour
normal
tumour
tumour
tumour
normal
normal
normal
normal
normal
normal
normal
tumour
tumour
tumour
tumour
normal
tumour
tumour
tumour
tumour
normal
normal
normal
normal
normal
normal
normal
tumour
tumour
tumour
tumour
normal
tumour
tumour
tumour
tumour
normal
normal
normal
normal
normal
normal
normal
normal
normal
normal
tumour
tumour
normal
tumour
tumour
normal
normal
normal
normal
normal
normal
Complete link (All genes)

20 40 60 80
normal
tumour
normal
tumour
tumour
tumour
tumour
normal
tumour
tumour
tumour
normal
normal
tumour
normal
normal
normal
tumour
tumour
tumour
tumour
tumour
normal
tumour
tumour
tumour
tumour
tumour
tumour
tumour
normal
normal
tumour
tumour
tumour
normal
normal
normal
normal
normal
tumour
normal
normal
normal
tumour
normal
normal
normal
normal
tumour
tumour
normal
tumour
tumour
normal
normal
tumour
normal
normal
tumour
tumour
tumour
tumour
normal
normal
normal
tumour
tumour
normal
normal
tumour
normal
normal
tumour
tumour
normal
normal
normal
normal
tumour
normal
tumour
tumour
tumour
tumour
normal
normal
normal
tumour
normal
tumour
normal
normal
normal
normal
tumour
tumour
normal
normal
tumour
tumour
tumour
Figure 3.2: Agglomerative hierarchical clustering of the prostate cancer gene expression data of Singh
et al. (2002) using the same link (average) and different numbers of genes
Example 2: Heat maps of two gene expression datasets
Heat maps are graphical tools that have become popular in the gene expression literature. Similarities
between the expression levels of genes and samples are of interest. These representations are most
effective if the rows and columns are ordered. Hierarchical clustering is typically used to order the
genes and samples according to similarities in their expression patterns. Figure 3.3 is such a represen-
tation for the prostate cancer data of Singh et al. (2002) that we clustered in the previous example.
This heatmap was constructed in R using the function heatmap. There is a clear separation between
the normal and tumour samples, and there appear to be a small number of groups of co-regulated
genes that appear to be responsible for the good discrimination between the groups. For illustartion
we only select a subset of the 50 most descriminating genes in terms of the absolute values of their
t-statistics here.
44
Figure 3.3: Heat map of prostate cancer gene expression data (Singh et al., 2002)
gene8911
gene12139
gene8370
gene928
gene10033
gene8105
gene659
gene10119
gene4864
gene8488
gene10869
gene12504
gene12476
gene5327
gene12129
gene9746
gene7667
gene7712
gene12120
gene8164
gene10379
gene6557
gene6233
Genes
gene7815
gene5831
gene9035
gene10864
gene8134
gene2808
gene9112
gene4775
gene314
gene12062
gene9037
gene12399
gene10289
gene8716
gene11118
gene2621
gene9404
gene11586
gene5757
gene9641
gene11203
gene11576
CHAPTER 3. CLUSTERING
gene7436
gene829
gene11817
gene6553
gene563
t73
t62
t75
t70
t60
t78
t64
n35
t90
t53
t65
t69
t56
t71
t68
t79
t54
t52
t66
t82
t55
t72
t99
t93
t96
t57
t74
t101
t51
t102
t81
t85
t97
t76
t88
t91
t94
t98
t77
t83
n32
t87
t86
t100
n34
n1
n49
n12
n36
n41
n27
n30
t58
t59
n16
n6
n3
n15
n8
n13
n21
n2
n9
n18
n29
n42
t80
t67
t92
n22
t61
t63
n31
n40
n24
n37
n39
t84
n43
n44
n11
n14
n17
n4
n5
n7
n23
n28
n20
n48
n19
n10
n25
n46
n33
n47
t89
t95
n50
n45
n26
n38
Samples
Example 3: Divisive hierarchical clustering of trilobite data
10
8
6
4
2
3
0
2
2
2
1
2
2
2
1
3
2
3
1
1
1
1
1
1
1
1
2
1
1
1
1
1
1
1
3
1
1
1
3
1
1
1
3
3
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
Figure 3.4: Divisive hierarchical clustering of the trilobite data
In figure 3.4 we present the dendrogram of of a divisive hierarchical clustering performed using
the algorithm of Kaufman and Rousseeuw (1990) implemented in the function diana included in the
cluster package. We see that the clustering show a reasonable separation between groups 1 and 3,
but there appear to be some overlap between groups 1 and 2.
Estimated density (classes confounded)
0.0 0.1 0.2 0.3 0.4 0.5

Density
0 1 2 3
petal width
Estimated classwise densities

2.0
setosa
versicolor
1.5
virginica
Density
1.0
0.5
0.0
0.0 0.5 1.0 1.5 2.0 2.5
Petal width
Figure 3.5: Estimated density of the petal width for all species confounded (above) and by species
(below)
3.3.2 Gaussian mixture models
The clustering algorithms presented so far were all based on heuristic concepts of proximity. In this
chapter we will examine Gaussian mixture models, a model-based approach to clustering. The idea
of model-based clustering is that the data is an independent sample from different group populations,
but the group labels have been “lost”.
An introductory example
Consider the measurements of the petal width in the iris dataset. Figure 3.5 shows a plot of the
(estimated) densities of the petal width for the three different species and a plot of the density for
the whole dataset. The plots suggest that it is reasonable to assume that the petal width in each
group is from a Gaussian distribution. Denote with Xi the petal width and with Si the species name
of the i-th flower. Then Xi is (assumed to be) Gaussian in each group and the mean and the variance
can be estimated by the empirical mean and variance in the different populations:
2 )
(Xi |Si = se) ∼ N (µse , σse for iris setosa 2 = 0.01133)
(estimates: µ̂se = 0.246, σ̂se
2 2 = 0.03990)
(Xi |Si = ve) ∼ N (µve , σve ) for iris versicolor (estimates: µ̂ve = 1.326, σ̂ve
2 )
(Xi |Si = vi) ∼ N (µvi , σvi 2 = 0.07697)
for iris virginica (estimates: µ̂vi = 2.026, σ̂vi
The density of X (species confounded) is then
f (x) = πse fµse ,σse

2 (x) + πve fµ ,σ 2 (x) + πvi fµ ,σ 2 (x)
ve ve vi vi
The density of X is thus a mixture of three normal distributions. The πse , πve and πvi denote
the probability that a randomly picked iris is of the corresponding species. These probabilities are
sometimes referred to as mixing weights.

Our objective is however clustering, not classification. Thus we assume the species labels were lost
and we only know the petal widths. Nonetheless, we can stick to our idea that the density of X is a
mixture of Gaussians. Thus we assume
f (x) = π1 fµ1 ,σ12 (x) + π2 fµ2 ,σ22 (x) + π3 fµ3 ,σ32 (x).
Note that we had to replace the species names with indexes, because we don’t possess the species
information any more. This makes the estimation of the means µ1 , µ2 , µ3 and variances σ12 , σ22 , σ32 far
more difficult. The species name Si is now a latent variable, as we are unable to observe it.
The idea behind Gaussian mixture models is like the situation in the example. We assume that
the data comes from K clusters.1 The probability that an observation is from cluster k is πk . We
assume that the data from cluster k is from a (multivariate) Gaussian distribution with mean µk and
covariance matrix Σk . Thus the density of x in cluster k is
1 1 ′ −1
fµk ,Σk (x) = p e− 2 (x−µk ) Σk (x−µk )
p
(2π) · |Σk |
Often one assumes that all clusters have the same orientation and spread, i.e. Σk = Σ or that all
2
 i.e. Σ = σ I. Another
clusters are spherical,  popular assumption is that the covariance matrices are
(k) 2
σ ···0
 1. .. .. 
diagonal, i.e. Σk =  .. . . In this case the clusters are ellipses with axes parallel to
. 
(k) 2
0 · · · σp
to coordinate axes. Figure 3.6 visualises these different assumptions.
The density of x (all clusters confounded) is the mixture of the distributions in the clusters:
K
X
f (x) = πk fµk ,Σk (x).
k=1
This yields the likelihood

n K
!
Y X
ℓ(π1 , . . . , πK , µ1 , . . . , µK , Σ1 , . . . , ΣK ) = πk fµk ,Σk (xi ) . (3.1)
i=1 k=1
Gaussian mixture models are an elegant model for clustering, however they suffer from one draw-
back. Fitting Gaussian mixture models is rather difficult. The density of the mixture distribution
involves a sum, and the derivative of the logarithm of the likelihood (3.1) does not lead to the usual
simplifications. The most popular method for fitting Gaussian mixture models is the expectation-
maximisation (EM) algorithm of Dempster et al. (1977). The basic idea of the EM algorithm is to
introduce a latent (hidden) variable Si , such that its knowledge would simplify the maximisation. In
our case Si indicates — like in the example — to which cluster the i-th observation belongs.
The EM algorithm now consists of the following iterative process. For h = 1, 2, . . . iterate the E-Step
and the M-Step:
1
This parameter needs to be fixed in advance.
Different covariances Different, but diagonal covariances
Identical covariances Identical and spherical covariances
Figure 3.6: Examples for the different assumptions on the covariance structure
Recall: Multivariate Gaussian distribution
A p-dimensional random variable

x = (x1 , . . . , xp )
is from a multivariate Gaussian distribution with mean vector
µ = (µ1 , . . . , µp )
and covariance matrix  

σ12 σ12 · · · σ1p

 σ12 σ22 · · · σ2p 

Σ= .. .. .. .. 
 . . . . 
σ1p σ2p · · · σp2
iff it has the density
1 1 ′ −1
fµ,Σ (x) = p e− 2 (x−µ) Σ (x−µ)
.
(2π)p · |Σ|
Note that the marginal distribution of the xj is Gaussian with mean µj and variance σj2 .
σ
The correlation between Xj and Xk is ρjk = q jk .
σj2 σk2
 2 
σ 0 ··· 0
 0 σ2 · · · 0 
If µ = (µ, . . . , µ) and Σ = σ 2 · I =  .
 
.. . . .. , then x consists of p independent
 .. . . . 
0 0 ··· σ2
realisations of a N (µ, σ 2 ) distribution.
Density of a Gaussian with

mean µ = (0, 0) Density of a Gaussian with
mean µ = (0, 0)
1 0 2 1
and covariance matrix : and covariance matrix :
0 1 1 1
1.0
1.0
0.5
0.5
0.0
0.0
−0.5
−0.5
−1.0
−1.0
(h−1)
E-Step Estimate the distribution of Si given the data and the previous values of the parameters π̂k ,
(h−1) (h−1) (h)
µ̂k and Σ̂k (k = 1, . . . , K). This yields the “responsibilities” wik :
(h−1)
πk f
(h−1) (h−1) (xi )
(h) µ̂k ,Σ̂k
wik := PK (h−1)
ζ=1 πζ f (h−1) (h−1) (xi )
µ̂ ,Σ̂
ζ ζ
(h) (h) (h)

M-Step Compute the updated π̂k , µ̂k , and Σ̂k (k = 1, . . . , K) by estimating them from the data
(h) (h) (h)
using the vector wk = (w1k , . . . , wnk ) of responsibilities as weights: This leads to the updated
estimators
n
(h) 1 X (h)
π̂k = wik ,
n
i=1
n
(h) 1 X (h)
µ̂k := P (h)
wik xi ,
n
i=1 wik i=1
and
n
(h) 1 X (h) (h) (h)
Σ̂k := P (h)
wik (xi − µ̂k )(xi − µ̂k )′
n
i=1 wik i=1
for the covariances Σk (assuming different covariances in the classes).
Some mathematical details
The EM algorithm focuses on the likelihood that is obtained when assuming the class labels Si are
known. In this case the likelihood is
n
Y
πSi fµS ,ΣSi (x)
i
i=1
Note that the sum has now disappeared. This is due to the fact that we know the class labels now, so
we just multiply the densities of the distribution in the corresponding clusters instead of multiplying
the unpleasant mixture density. For simplification of the notation denote with θ the whole parameter
vector, i.e.
θ := (π1 , . . . , πK , µ1 , . . . , µK , Σ1 , . . . , ΣK )
The log-likelihood is thus
n
X
L(θ) = (log πSi + log fµS ,ΣSi (x)) (3.2)
i
i=1
In reality however, the Si are unknown. So the EM-algorithm uses the trick of taking the (conditional)
expectation of the likelihood.
In the E-step we have to compute the conditional expectation of the log-likelihood (3.2) given the
(h−1)
data using the previous values of the parameters θ̂ . Thus we have to compute the conditional
distribution of Si (using the Bayes formula) yielding the responsibilities

(h−1)
π̂k f(h−1) (h−1) (xi )
(h) µ̂k ,Σ̂k
wik := P (h−1) (Si = k) = P (h−1)
θ̂ K
ζ=1 π̂ζ f (h−1) (h−1) (xi )
µ̂ ,Σ̂
ζ ζ
The conditional expectation is thus

n X
X K
(h) (h)
E (h−1) (L(θ)|x1 , . . . , xn ) = (wik log πk + wik log fµk ,Σk (x)). (3.3)
θ̂
i=1 k=1
(h)
In the M-step we have to find θ̂ by maximising the conditional expectation (3.3) ober θ. One
(h) (h) (h)
can easily see that the updated parameter estimates π̂k , µ̂k , and Σk maximise (3.3).
Figure 3.7 visualises the flow of the algorithm for an easy toy example. Despite being theoretically
elegant, the EM algorithm has major drawbacks: It is very slow; the convergence is only linear. Fur-
thermore it does only converge to a stationary point of the log-likelihood (practically speaking a local
(0) (0) (0)
maximum). The EM algorithm is very sensitive to the inital values π̂k , µ̂k and Σ̂k (k = 1, . . . , K);
it is thus essential to choose appropriate initial values. In pratise, it is often more robust and faster
to maximise the likelihood (3.1) directly using a an appropriate optimisation technique like a (trust
region) Newton method or a (line serach) quasi-Newton approach.
There is another class of models which is often confused with mixture models: Likelihood-based
partitioning methods assume as well that the data comes from different clusters, in which the data are
Gaussian with mean µk and covariance matrix Σk .2 The partitioning Ω = Ω1 ∪ . . . ∪ ΩK is however
not modelled using a latent variable, but being seen a model parameter. If one assumes that all
clusters have identical and spherical covariance matrices this approach leads to the k-means algorithm
presented in the next section.
3.3.3 k-Means
In Gaussian mixture models the means µ̂k were updated by weighted averages of the xi . The weights
wik were the conditional probabilities that the i-th observation was from cluster k (“responsibilities”).
This probability is typically high for observations that are close to the cluster centre µ̂k . This idea can
be used to construct a simple, yet powerful clustering algorithm. The k-means algorithm (McQueen,
1967) can be seen as a simplification of the EM-algorithm for fitting Gaussian mixture models. Instead
of calculating the responsibilities, it simply assigns all observations to the closest cluster, i.e. the cluster
with the closest centre µ̂k , which is then updated by the taking the arithmetic mean of all observations
in the cluster.
(0) (0)
Using suitably chosen initial values µ̂1 , . . . , µ̂K the k-means-algorithm thus iterates the following
two steps until convergence (h = 1, 2, . . .):
(h−1)
i. Determine for all observations xi the cluster centre µ̂k that is closest and allocate the i-th
observation to this cluster.
2
Like for mixture models it is possible to use other probability distribution as well.
Iteration 1 Iteration 2
1.5
1.5
1.0
1.0
0.5
0.5
x2
x2
0.0
0.0
−0.5
−0.5
−1.0
−1.0
−1.5
−1.5
−2 −1 0 1 −2 −1 0 1
x1 x1
1.5
1.5
1.0
1.0
0.5
0.5
x2
x2
0.0
0.0
−0.5
−0.5
−1.0
−1.0
−1.5
−1.5
−2 −1 0 1 −2 −1 0 1
x1 x1
1.5
1.5
1.0
1.0
0.5
0.5
x2
x2
0.0
0.0
−0.5
−0.5
−1.0
−1.0
−1.5
−1.5
−2 −1 0 1 −2 −1 0 1
x1 x1
Figure 3.7: First 6 iterations of the EM algorithm for fitting a Gaussian mixture model to a toy ex-
ample. The colouring of the data points corresponds to the responsibilities, the dotted lines show the
isodensites of the Gaussians, and the arrow indicates where the means of the clusters are moved.
ii. Update the class centres by computing the mean in each cluster:
(h) 1 X
µ̂k := xi
size of cluster k
observations xi in cluster k
Figure 3.8 visualises the flow of the k-means algorithm for an easy example.
The k-means algorithm determines the clusters and updates the means in a “hard” way. The EM
algorithm computed the (conditional) probabilities that one observation is from one class (“soft parti-
tioning”);3 the k-means algorithm simply assigns one class to the closest cluster (“hard partitioning”).
The partitions created by the k-means algorithm are called Voronoi regions.
The k-means algorithm minimises the (trace of the) within-class variance. Thus the sum of squares,
which is n times the trace of the within-class variance,
K
X X
kxi − µ̂k k2 (3.4)
k=1 observations xi in cluster k
can be used as a criterion to determine the goodness of a partitioning.

Bottou and Bengio (1995) show that the k-means algorithm is a gradient descent algorithm and
that the update of the cluster centres (step ii.) is equivalent to a Newton step. So one can expect the
k-means algorithm to converge a lot faster than the EM-algorithm for mixture models. Thus it is often
used to find the starting values for the latter.
As the k-means algorithm is based on Euclidean distance, it is essential that the data is appropri-
ately scaled before the algorithm is used.
The k-means algorithm depends highly on the initial values chosen. Thus it is essential to run
the algorithm using different starting values and to compare the results. Many implementations use
randomly picked observations are starting values, so all you have to do is running the k-means clus-
tering several times (without resetting the random number generator). If the algorithm always leads
to (clearly) different partitionings, then you have to assume the the dataset is not clustered at all!
This is even more the case if the different partitionings always have approximately the same goodness
criterion (3.4).4 The dependence of the k-means algorithm on the initial values makes it very easy to
manipulate its results. When being started often enough with different starting values that are close to
the desired solution, the algorithm will eventually converge to the desired solution and thus “prove”
the desired hypothesis.
One way of analysing the solution of the k-means algorithm is looking at a couple of plots. One pos-
sibility is plotting the cluster centres in the plain of the first two principal components and colouring
the observations according the their clusters. A parallel coordinate plot comparing the cluster centres
and clusterwise parallel coordinate plots can be helpful as well. Figure 3.9 to 3.11 show the corre-
sponding plots for the result obtained by applying the k-means algorithm to the European economies
dataset (K = 5 clusters). The clustering corresponds to what one might have expected: Luxembourg’s
banking sector makes it stand out, and Romania and Bulgaria, which are not yet part of the EU, have
an economy still highly dependant on agriculture. The composition of the other three clusters seems
3
Thus the EM algorithm for fitting Gaussian mixture models is sometimes referred to as “soft k-means”.
4
It is possible that the algorithm sometimes gets “stuck” in a suboptimal solution, so even if there is a clear clustering it
might happen that the k-means algorithm does not find it. In this case the sum of squares of the “bad” solution should be
clearly less than the sum of squares of the “good” solution.
1.5
1.5
1.0
1.0
0.5
0.5
x2
x2
0.0
0.0
−0.5
−0.5
−1.0
−1.0
−1.5
−1.5
−2 −1 0 1 −2 −1 0 1
x1 x1
1.5
1.5
1.0
1.0
0.5
0.5
x2
x2
0.0
0.0
−0.5
−0.5
−1.0
−1.0
−1.5
−1.5
−2 −1 0 1 −2 −1 0 1
x1 x1
Figure 3.8: First 4 iterations of the k-means algorithm applied to a toy example. The colouring of the
data points corresponds to the cluster allocation, the “x” designates the cluster centres and the arrow
indicates where they are moved. The grey lines are the borders of the corresponding Voronoi regions
at first sight reasonable as well: Western European industrial countries, Eastern Eurpean Countries
and the remaining countries. Even though the clusters have a nice interpretation, there is not enough
evidence for them in the data — the three latter clusters are not different enough that they could be
clearly distinguished. When rerunning the k-means algorithm with another initial configuration we
would see that the composition of these three clusters changes all the time.
A “silhouette” plot of the distances between the observations and the corresponding cluster centres
can give some evidence on the goodness of the partitioning. Figure 3.12 shows a silhouette plot for
the European economies dataset.
3.3.4 k-Medoids (Partitioning around medoids)

As stated in the preceding chapter the k-means algorithm is based on the squared Euclidean dissim-
ilarity. This makes the k-means algorithm rather prone to outliers. The k-medoid algorithm (used
with appropriate dissimilarity measures) does not have this flaw. Furthermore, like the hierarchical
clustering algorithms, it can be carried out using any dissimilarity measure dij , where dij is the dis-
similarity between the i-th and the j-th observation. The k-medoids algorithm can thus be seen as
a generalisation of the k-means algorithm. However one often chooses dij = |xi − xj |. One usually
restricts the cluster centres to be one of the observations in each cluster, thus µ̂k = xck for some ck .
The k-medoids algorithm then consists of the iteration of the following two steps until the cluster
assignments do not change any more.
i. For all observations determine the cluster centre that is closest, i.e. assign xi to the class k whose
centre µ̂k = xck is closest, i.e. has the lowest dissimilarity dick .
ii. For the k-th cluster (k = 1, . . . , K) determine the observation (from cluster k) that minimises the
sum of dissimilaries to the other observations, i.e. choose the index ck such that
X
dick
observations xi in cluster k
is minimal.
The k-medoids algorithm is usually initialised at random, i.e. the indices of the cluster centres
i1 , . . . , iK are chosen at random from {1, . . . , n} (without replacement).
In analogy to k-means algorithm the above k-medoids algorithms seeks to minimise
K
X X
dick (3.5)
k=1 observations xi in cluster k
However, the above ad-hoc algorithm gives no guarantee of converging to this minimum (nor does the
k-means algorithm). Kaufman and Rousseeuw (1990) and Massart et al. (1983) propose alternative
and less heuristic algorithms to minimise (3.5). The k-means algorithm is a more efficient implemen-
tation of this algorithm using the dissimilarity function dij = |xi − xj |2 and without the restriction on
the cluster centres.
The solutions of the k-medoid algorithm should be handled with the same care (and sceptic) as the
solutions of the k-means algorithm. Thus it is advisable to run the algorithm using different starting
configurations and to compare the results obtained.
Germany
France
5
United Kingdom
Belgium Sweden
Second principle component (18% of the variance)
Hungary
Denmark
Italy
Netherlands Finland
Portugal
Bulgaria
Czech Republic
0
Poland
Slovenia
Estonia
Austria Slovakia
Spain
Greece
Latvia Romania
Cyprus
Lithuania
−5
−10
Luxembourg
−15
−15 −10 −5 0 5 10
First principle component (23% of the variance)
Figure 3.9: Plot of the cluster structure in the first two principle components of the European economies
dataset. Note that the first two principal components only account for 41% of the variance.
25
20
15
10
5
0
Agriculture
Chemical
Construction
Education
Electricity
Electronics
Financial
Hotel.Rest
Machinery
Mining.Oil
Public
RealEst.etc
Transport
Wholesale
Figure 3.10: Parallel coordinate plot of the clusters found in the European economies dataset. The
soluid lines sorrespond to the cluster centres, whereas the dotted lines correspond to the mini-
mal/maximal values is the clusters.
3.3. CLUSTERING METHODS
0 5 10 15 20 25 0 5 10 15 20 25 0 5 10 15 20 25
Figure 3.11: Clusterwise parallel coordinate plot for the European economies dataset.
Agriculture Agriculture Agriculture

Chemical Chemical Chemical
Construction Construction Construction
Education Education Education
Electricity Electricity Electricity
Electronics Electronics Electronics
Cluster 5
Cluster 3
Cluster 1
Financial Financial Financial
Hotel.Rest Hotel.Rest Hotel.Rest
Machinery Machinery Machinery
Mining.Oil Mining.Oil Mining.Oil
Public Public Public
Luxembourg
Romania
Bulgaria
RealEst.etc RealEst.etc RealEst.etc
Finland
Slovenia
Austria
Hungary
Cyprus
Spain
Greece
Transport Transport Transport

Wholesale Wholesale Wholesale
0 5 10 15 20 25 0 5 10 15 20 25
Agriculture Agriculture
Chemical Chemical
Construction Construction
Education Education
Electricity Electricity
Electronics Electronics
Cluster 4
Cluster 2
Financial Financial
Hotel.Rest Hotel.Rest
Machinery Machinery
Mining.Oil Mining.Oil
United Kingdom
Sweden
Netherlands
Italy
France
Germany
Denmark
Belgium
Slovakia
Portugal
Poland
Lithuania
Latvia
Estonia
Public Czech Republic Public
RealEst.etc RealEst.etc
Transport Transport
Wholesale Wholesale
57
Bulgaria
Romania
Germany
France
Denmark
Netherlands
United Kingdom
Italy
Belgium
Sweden
Luxembourg
Lithuania
Poland
Estonia
Portugal
Latvia
Czech Republic
Slovakia
Finland
Cyprus
Hungary
Greece
Spain
Austria
Slovenia
0 10 20 30 40
Figure 3.12: Silhouette plot of the distances to the cluster centres for the European economies dataset
Figure 3.13: Illustration of block vector quantisation: Each block of 3 × 3 pixels is regarded as one
observation with 27 variables (3 rows × 3 columns × 3 colour channels (red, green, blue))
3.3.5 Vector quantisation

Vector quantisation is a tool in (lossy) signal compression. A sophisticated version of vector quantisa-
tion is e.g. part of the MPEG-4 audio and video compression standard. The idea of vector quantisation
is to replace all signals by a number of “representative signals”, the prototypes (codewords ). So in-
stead of saving the multidimensional signal (e.g. a block of pixels in an image) itself we just have to
store the index of the corresponding prototypes in the codebook. This requires far less space. Note that
this is a lossy compression technique, as we are not able to reconstruct the original signal. However
we hope that the prototypes are representative and thus the quality does not degrade too much.
The k-means algorithm5 can now be used to find the codebook. One simply uses the cluster centres
µ̂k as the prototypes. The idea is that the cluster centres are good representatives of the clusters. So
all signals (i.e. data points) in the Voronoi region corresponding to one cluster are represented by the
same prototype. The number of clusters controls how much the signals are compressed as we need
log2 (k) bits to save the index of the prototype.
We will illustrate vector quantisation with an example image of a sheep. We will apply a 3×3 block
vector quantisation, i.e. we see each block of 3 × 3 pixels as one observation, leading to 27 variables;
figure 3.13 illustrates this. Figure 3.14(a) shows the original images; figures 3.14(b) to 3.14(d) show
the results for different codebook lengths. Already the first compressed image leads to a file size of 88
kB (including the codebook), thus only 6.3% of the size of the original uncompressed image.
3.3.6 Self-organising maps (SOM)

The clustering techniques studied so far lead to clusters that are topologically unrelated; clusters that
are for example close to each other in the dedrogram, are not necessarily more similar than other
clusters. Self-organising maps (Kohonen, 1995) is a clustering algorithm that leads to cluster that
5
The k-means algorithm is sometimes called Lloyd’s algorithm is this context.
lengths
Figure 3.14: Original image and results of 3 × 3 block vector quantisation for different codebook
60
(a) Original image (24 bits/pixel, uncompressed size 1, 402 kB) (b) Codebook length 1024 (1.11 bits/pixel, total size 88 kB)
CHAPTER 3. CLUSTERING
(c) Codebook length 128 (0.78 bits/pixel, total size 50 kB) (d) Codebook length 16 (0.44 bits/pixel, total size 27 kB)
SOM of European countries: Map
Greece
Portugal
Spain
Poland
Cyprus
Slovakia Latvia
Luxembourg Lithuania
Estonia
Slovenia
Netherlands Czech Republic Romania

Austria
Denmark
Germany
Hungary
BelgiumSweden Italy Bulgaria
France
Finland
United Kingdom
Figure 3.15: SOM map of the European economies dataset
have a topological similarity structure. It assumes that the clusters lie on a rectangular (or hexagonal)
grid; on this grid, neighbouring clusters are more similar than distant ones.
Like in k-means clustering an observation is assigned to the cluster whose “centre” is closest to
it. In k-means clustering an observation only influences the closest cluster centre. In self-organising
maps an observation influences neighbouring clusters as well, however the influence on more distant
clusters is less. Kohonen proposed an online algorithm, in which the observations xi are presented
repeatedly and one after another. For each observation xi presented the algorithm updates all cluster
centres µ̂k (k = 1, . . . K), which are called representatives in this context, by setting
µ̂k ← µ̂k + αt · h(dik )(xi − µ̂k ).
dik is hereby the distance (on the grid, not in the observation space) between the cluster containing
the i-th observation and the k-th cluster. h : R+ + −1/2d2 (“Gaussian
0 → R0 is a decreasing function like e
neighbourhood”). If one sets h(d) := 1{0} (d), then only the cluster containing the i-th observation
(and no other cluster) gets updated; in this case the SOM algorithm is very inefficient variant of the
k-means algorithm.
The SOM algorithm is highly dependent on the initial configuration — to an even greater extent
than the k-means algorithm. When running the SOM algorithm with randomly picked inital cluster
centres, one gereally never gets the same result. Therefore, the SOM algorithm is usually initliased by
putting the cluster centres on a grid in the plain of the first two principal components. So the SOM
solution is highly dependent on the MDS solution (see section 3.5)
Note that we have to choose the number of clusters and the topology beforehand. One usually
chooses a higher number of clusters than one would choose for the k-means algorithm. The reason for
this is the topological constraint on the clusters, that makes them less flexible. It is even common that
the SOM algorithm leads to one or more empty clusters.
Figure 3.15 and 3.16 show the results obtained when applying the SOM algorithm to the European
economies dataset. Figure 3.16 shows the grid of clusters in the principal component plain after the
initialisation, and after 1, 10 and 100 iterations. Figure 3.15 gives the resulting self-organising map.
Initialisation 1 iteration
10
10
Germany Germany
France France
5
5
United Kingdom United Kingdom

Sweden
Belgium Sweden
Belgium
Italy Hungary
Denmark
Netherlands Italy Hungary
Denmark
Netherlands
FinlandPortugal FinlandPortugal
Czech Bulgaria
Republic
Poland Czech Bulgaria
Republic
Poland
0
0
Slovenia
Austria Estonia
Slovakia Slovenia
Austria Estonia
Slovakia
Spain
Greece
Latvia Romania Spain
Greece
Latvia Romania
Cyprus Cyprus
−5
Lithuania −5 Lithuania
−10
−10
−15
−15
Luxembourg Luxembourg
−20 −15 −10 −5 0 5 10 −20 −15 −10 −5 0 5 10
1st principal component 1st principal component
10 iterations 100 iterations

10
10
Germany Germany
France France
5
United Kingdom United Kingdom

Sweden
Belgium Sweden
Belgium
Italy Hungary
Denmark
Netherlands Italy Hungary
Denmark
Netherlands
FinlandPortugal FinlandPortugal
Czech Bulgaria
Republic
Poland Czech Bulgaria
Republic
Poland
0
Slovenia
Austria Estonia
Slovakia Slovenia
Austria Estonia
Slovakia
Spain
Greece
Latvia Romania Spain
Greece
Latvia Romania
Cyprus Cyprus
−5
−5
Lithuania Lithuania
−10
−10
−15
−15
Luxembourg Luxembourg
−20 −15 −10 −5 0 5 10 −20 −15 −10 −5 0 5 10
1st principal component 1st principal component
Figure 3.16: Plot of the grid of the SOM for the European economies dataset after initialisation and
after 1, 10, and 100 iterations.
3.4. COMMENTS ON CLUSTERING IN A MICROARRAY CONTEXT 63
3.4 Comments on clustering in a microarray context

3.4.1 Two-way clustering
In microarray studies samples can be clustered independently of genes, and vice versa. Some cluster-
ing applications use both samples and genes simultaneously to extract joint information about both
of them. This apporach has been motivated by the fact that the overall clustering of samples may be
dominated by a set of genes involved in a particular cellular process, which may mask another inter-
esting group supported by another group of genes. For this reason, two-way clustering aims to extract
joint information on the samples and genes.
One approach to the above clustering problem is to systematically uncover multiple sample clus-
terings based on different sets of genes. Getz et al. (2000) propose such an approach. Their method
aims to find subsets of genes that result in stable clusterings of a subset of samples. They find pairs
(Ok , Fk ), k = 1, 2, . . . , K where Ok is a subset of genes and Fk is a subset of the samples, so that when
genes Ok are used to cluster samples Fk , the clustering yields stable and significant partitions. This
will be useful when the overall clustering of samples based on all genes is dominated by some subset
of genes, for example genes related to metabolism. The above authors do appear to be reasonably suc-
cessful in their stated goal – to find stable clustering pairs (Ok , Fk ) – however, it is difficult to assess
the biological relevance of their results and to compare them to more standard hierarchical clustering
approaches.
Block clustering, due to Hartigan (1972) and named by Duffy and Quiroz (1991), is another ap-
proach for obtaining a divisive hierachical row-and-column clustering of a data matrix. This approach
reorders the rows (genes) and columns (samples) to produce homogeneous blocks of gene expression.
Block clustering only provides one reorganisation of the samples and genes, and cannot discover mul-
tiple groupings. In some sense, the plaid model (Lazzeroni and Owen, 1992) is a further extension of
the ideas behind block clustering and gene shaving.
3.4.2 Discussion: Clustering gene expression data

Clustering as an exploratory tool
Cluster analysis has caught on as technique for revealing common patterns of gene expression across
different samples in the microarray literature. The idea is that genes with similar profiles of activity
will have related functions. However, clustering is often misused and the results over-interpreted.
Some statisticians argue that clustering is best suited to determining relationships between a small
number of variables, rather than deriving patterns of involving thousands of genes across a very large
dataset. As we have mentioned before, certain hierarchical methods can be unreliable for obtaining
good high-level clusters for example. Some people do not even regard clustering as statistics, but
rather regard it as an arbitrary manupilation of data that is not principled.
Biologists can be sent down blind alleys if they see clusters that reinforce their own assumptions
about the relationships between individual genes. F. H. C Marriott made the following comment in
the context of cluster analysis on p.89 of Marriott (1974):
“If the results disagree with informed opinion, do not admit a simple logical interpretation,
and do not show up clearly in a graphical presentation, they are probably wrong. There
is no magic about numerical methods, and many ways in which they can break down.
They are a valuable aid to the interpretation of data, not sausage machines automatically
transforming bodies of numbers into packages of scientific fact.”
Caution should be exercised when intrepreting and assessing clustering results. As an example
we consider a study by Alizadeh et al. (2000). They used hierarchical clustering to group B-cell
lymphomas into two distinct classes. These correlated with inferences in patient survival, and seemed
from their gene expression profiles to be related to the stage of development of the cell from which the
cancer originated. However, last year another group, using a different clustering method, failed to find
this association (Shipp et al., 2002). Choosing a clustering method is still very much dependent on
the data, the goals, and the investigator’s personal preferences. This often leads to a lot of subjective
judgement when applying clustering in a microarray context. In many cases where you have a good
idea what you are looking for you will probably find it!
The primary goals of clustering should be data exploration, pattern-discovery and dimension re-
duction. Clustering should not be used to classify. If the intent is to distinguish between currently
defined sample classes (e.g. tumour and normal tissues), the supervised methods that we will discuss
in later lectures should be used. Unsupervised groupings may reflect a completely different classifica-
tion of the data as mentioned on page 288 of Ripley (1996).
Evaluating clustering performance

A major problem encountered when using cluster analysis is the difficulty in interpreting and evaluat-
ing the reliability of the clustering results and various clustering algorithms. The boundries between
clusters are often very noisy and many clusters can arise by chance. Most clustering algorithms do not
formally account for sources of error and variation. This means that these clustering methods will be
sensitive to outliers. Therefore, assessing the clustering results and interpreting the clusters found are
as important as generating the clusters. In the absence of a well-defined statistical model, it is often
difficult to define what a good clustering algorithm or the correct number of clusters is.
In hierarchical clustering there is no provision for a reallocation of objects that may have been
“incorrectly” grouped at an early stage. If clustering is chosen as the method of choice it is advisable
to examine the final configuration of clusters to see if it is sensible. For a particular application, it is a
good idea to try several clustering methods, and within a given method, a couple of different ways of
assigning similarities. If the outcomes from several methods are roughly consistent with one another
it perhaps strenghens the case for a “natural” grouping.
The stability of a hierarchical solution can sometimes be checked by applying the clustering al-
gorithm before and after small errors (pertubations) have been added to the data. If the groups are
fairly well distinguished, the clusterings before and after pertubation should agree. Kerr and Churchill
(2001) for example, propose the use of bootstrapping to assess the stability of results from cluster
results in a microarray context. This involve sampling the noise distribution (with replacement). They
emphasise that irrespective of the clustering algorithm that is chosen, it is imperative to assess whether
the results are statistically reliable relative to the level of noise in the data. Yeung et al. (2001b) also
propose an approach for validating clustering results from microarray experiments. Their approach
involves leaving out one experiment at at a time and then comparing various clustering algorithms us-
ing the left-out experiment. McShane et al. (2002) propose a noise model to assess the reproducibility
of clustering results in the analysis of microarray data.
In most gene expression studies there will typically only be a reasonably small percentage of the
total number of genes (typically thousands) that will show any significant change across treatment
3.4. COMMENTS ON CLUSTERING IN A MICROARRAY CONTEXT 65
levels or experimental conditions. Many genes will not exhibit interesting changes in expression and
will typically only add noise to the data. A gene-filtering step to extract the “interesting genes” is
often advisable before applying a clustering algorithm, as they may lead to a reduction in noise and
a subsequent improvement in the performance of the clustering algorithm. However, extreme caution
must be exercised, as the results can depend strongly on the set of genes used. For example, use of
all available genes or genes that were chosen by some filtering method can produce very different
hierarchical clusterings of the samples for example.
Choosing the number of clusters

Choosing the number of clusters is crucial for many hierarchical and partioning methods, especially
in gene expression studies. Some approaches to address this problem have been suggested. Ben-Hur
et al. (2002) use a sampling approach to determine the number of clusters that provides the most
stable solution. Another approach introduces the so-called Gap-statistic (Tibshirani et al., 2001). This
approach compares a measure of within-cluster distances from the cluster centroids to its expecta-
tion under an appropriate reference distribution, and chooses the number of clusters maximising the
difference.
Some studies propose a mixture model approach (i.e. model-based clustering) for the analysis of
gene expression data (Ghosh and Chinnaiyan, 2002; Yeung et al., 2001a; Barash and Friedman, 2001;
McLachlan et al., 2002). An attractive feature of the mixture-model approach is that it is based on
a statistical model. This provides one with a probabilistic framework for selecting the best clustering
structure and the correct number of clusters, by reducing it to a model selection problem. The above
studies focus on Gaussian mixtures. Yeung et al. (2001a) propose various data transformations in order
to transform the data to be nearly Gaussian. However, it is notoriously difficult to test for multivariate
Gaussian distributions in higher dimensions. Furthermore, the problem of providing protection against
outliers in multivariate data is difficult: it increases in difficulty as the dimension of the data increases.
The fitting of multivariate t distributions, as proposed in McLachlan and Peel (1998) and McLach-
lan and Peel (2000) provides a longer-tailed alternative to the normal distribution and hence added
robustness. For many applied problems (including gene expression studies), the tails of the normal
distribution are often shorter than required. The strongest criticism against the use of model-based
clustering methods for gene expression studies is the strong parametric assumptions these methods
make about the distribution of gene expression data, which may not always be realistic.
3.4.3 Alternatives to clustering

PCA and ICA are often used as alternatives to clustering when the aim is to identify small subsets of
“interesting” genes. As we have discussed in earlier lectures, PCA and ICA differ in the measure of
“interestingness” that they use. Gene shaving is an adaptation of PCA that looks for small subsets of
genes that behave very similarly, and account for large variance across samples. Yeung and Ruzzo
(2001) also propose a PCA-based approach to clustering.
ICA is emerging as another strong candidate to visualise and and find structure in expression data.
Liebermeister (2002), Lee and Batzoglou (2003) and Hori et al. (2001) apply ICA to gene expression
data and find that ICA is successful in revealing biologically relevant groups in the data. Liebermeister
motivates the application of ICA to gene expression data, as ICA relies on the idea of combinatorial
control, by describing the expression levels of genes as linear functions of common hidden variables.
The author refers to these hidden variables as “expression modes”. Expression modes may be related to
distinct biological causes of variation, such as responses to experimental treatments. ICA differs from
clustering in that clustering identifies groups of genes responding to different variables, but it does not
represent the combinatorial structure itself. A small number of “effective” regulators, each acting on
a large set of genes and varying between distinct biological situations may describe the co-regulation
of genes. Liebermeister proposes a method for ordering independent components according to their
ability to discriminate between different groups of genes. However, the above study does not address
the important issue of choosing the number of components. Similar to the problem of choosing the
number of clusters in cluster analysis, ICA also suffers from determining the number of independent
components that are “hidden” within the data. Venables and Ripley (2002) make the following remark
on page 314:
“There is a lot of arbitrariness in the use of ICA, in particular in choosing the number of
signals.”
Lee (2003) proposes the SONIC statistic for choosing the number of independent components. SONIC
is based on the idea behind the Gap statistic for estimating the number of clusters in cluster anal-
ysis. SONIC compares the ordered negentropy values of estimated independent components to its
expectation under a null reference distribution of the data.
3.5 Multidimensional scaling

Multidimensional scaling (MDS) is part of a wider class of methods sometimes referred to as distance
methods. These methods attempt to represent a high-dimensional cloud of points in a lower dimen-
sional space (typically 2 or 3 dimensional) by preserving the inter-point distances as well as possible.
Suppose that x1 , . . . , xn are observations in ℜp . The first step is to define a measure of dissimilarity
dij between observations i and j, for example the Euclidean distance kxi − xj k. The function daisy
in the R package cluster provides a general way to compute dissimilarities that extends to variables
that are not measured on interval scale (e.g. ordinal and binary variables). Multidimensional scaling
seeks a representation z1 , . . . , zn ∈ ℜk so that the distances between the points best match those in
the dissimilarity matrix. Here the best representation is that one which minimises the stress function.
There are different forms of MDS that use different stress functions. Two popular choices of the stress
function are given below.
Stress functions
1. Classical MDS: Xh i X
˜ =
Sclassical (d, d) d2ij − d˜2ij / d2ij
i6=j i6=j
2. Sammon’s nonlinear mapping:

X X 2
˜ = 1/
SSammon (d, d) dij dij − d˜ij /dij
i6=j i6=j
3.5. MULTIDIMENSIONAL SCALING 67
3. Kruskal-Shepard nonmetric scaling:

X 2 X
STRESS2 = θ(dij ) − d˜ij / d˜ij
i6=j i6=j
where d represents dissimilarity in the original p-dimensional space and d˜ represents disimilarity in
the reduced k-dimensional space.
Note that using classical MDS with Euclidean distance is equivalent to plotting the first k prinicipal
components (without rescaling to correlations). Sammon mapping places more emphasis representing
smaller pairwise distances accurately. In R classical MDS, Sammon’s mapping, and Kruskal-Shepard
nonmetric MDS can be performed using the functions cmdscale, sammon and isoMDS. The last two
functions are available in the MASS package.
3.5.1 Example 1 : MDS and prostate cancer gene expression data

In figure 3.17 we present three distance-based representations of the prostate cancer data that we
have considered before. The tumour (t) amd normal (n) samples seem to be well separated in all
three plots. There appears to be one tumour sample that is more similar to the normal samples and
one normal sample that is more similar to the tumour samples (based on the 200 genes we used).
In later lectures we shall return to this data to see how well some classifiers fare in distinguishing
between these groups. The MDS plots suggest that there is a clear separation between the two classes
and we expect the classifiers to be reasonably successful in distinguishing between the groups.
3.5.2 Example 2 : MDS and virus data

In this example we consider a dataset on 61 viruses with rod-shaped particles affecting various crops
(tobacco, tomato, cucumber and others) described by Fauquet et al. (1988) and analysed by Eslava-
Gómez (1989). There are 18 measurements on each virus, the number of amino acid residues per
molecule of coat protein; the data come from a total of 26 sources. There is an existing classi-
fication by the number of RNA molecules and mode of transmission, into 39 Tobamoviruses with
monopartite genomes spread by contact, 6 Tobraviruses with bipartite genomes spread by nematodes,
3 Hordeiviruses with tripartite genomes, transmission mode unknown and 13 Furoviruses , 12 of which
are known to be spread fungally. The question of interest to Fauquet et al. was whether the furoviruses
form a distinct group, and they performed various multivariate analyses. Here we only consider the
Tobamoviruses to investigate if there are further subgroups within this group of viruses. In figure 3.18
we use the same distance-based representations as for the previous example to visualise the data.
There are two viruses with identical scores, of which only one is included in the analyses. (No analysis
of these data could differentiate between the two).
Figure 3.18 reveals some clear subgroups within the Tobamoviruses, which are perhaps most clearly
visible in the Kruskal MDS plot. Viruses 1 (frangipani mosaic virus), 7 (cucumber green mottle mosaic
virus) and 21 (pepper mild mottle virus) have been clearly separated from the other viruses in the
Kruskal MDS plot, which is not the case in the other two plots. Ripley (1996) states that the non-
metric MDS plot shows interpretable groupings. The upper right is the cucumber green mottle virus,
Metric Scaling Kruskal’s MDS
n n
n
8
n n
n n
6
n n
n n n nn n
5
n n n n n
n n n n n
4
n n n n n
n n n n n
n n t t t
n n n
n
2
t n t n n t
n n t t t t t t
n n n tt t n n n t t t t
t n n
n n t t nn n n n tt t t
0
nn ttt t t t
0
n t n t tt t
n n
t tt tt n nn t t t t
t t
n nn tt t t
nt t
t t n
n t t t t t
t t t ttt t nt n t
−2
n t tt t t t nn t t t tt t n
n t t t t t t nn t t
t nn t t
n tn
n t
t
n t t
nnn t
−4
n
−5
t
nn
n t
−10 −5 0 5 10 −10 −5 0 5 10 15
Sammon Mapping
n
10
n
n n
n
n nn n
n n
n n n t
5
n n n
n n t t t
n n
t n n t t
t t
n n n t t t
n n n t
n n t t t
0
n n n t t t t
n nn t t t t t
n t t t
n t t t
nn n t t tt t
n n tn t t t t
n
−5
n t t t
n tt t
n t t
n t
t
t
−10 −5 0 5 10 15
Figure 3.17: Distance based representations of of the prostate cancer data (200 genes). The top left
plot is by classic multidimensional scaling, the top right by Kruskal’s isotonic multidimensional scaling,
and the bottom left by Sammon’s nonlinear mapping
3.5. MULTIDIMENSIONAL SCALING 69
Metric Scaling Kruskal’s MDS
38 7
36 7
3
12
37
4
33 10
32 11
27 25 36 2
2
35 26
6 34 38
3733
2
5
2 32 21 27 10
11
1
634 25
35 5 26
0
0
12 223113
30
17
4 29
18 28
316
15
14
−2
21
−1
22 31 1920
13
19 30
17
4 1 28 23 824
9
20 18 316
29
15
−2
14
−4
238
9
24 1
−4 −2 0 2 4 −5 0 5
Sammon Mapping
7
6
12
36
2
4
38
33 27 10
37 32 21
2
11
6 5 25
35 34 26
0
31 22
17
430
18 13
29 28
3
−2
16
1920 1514
−4
23 8 24
9
1
−10 −5 0 5
Figure 3.18: Distance-based representations of the Tobamovirus group of viruses. The variables were
scaled before Euclidean distance was used. The point are labelled by the index number of the virus
the upper left is the ribgrass mosaic virus and the two others. The one group of viruses at the bottom,
namely 8,9,19,20,23,24, are the tobacco mild green mosaic and odontoglossum ringspot viruses.
Chapter 4
Classification: Theoretical backgrounds

and discriminant analysis
4.1 Classification as supervised learning

The objective of the clustering methods presented so far was to infer how the data might be partitioned
into different classes (unsupervised learning ). The task in supervised learning is different. We know
to which classes the observations xi belong. The objective is to “learn” from these examples so that we
are able to predict the class label for future observations. Examples for tasks in supervised learning
are diagnosing diseases, identifying “good” customers, reading handwritten digits etc.
4.2 Some decision theory

This chapter gives a very brief overview over the statistical decision theory for classification.
Assume the observations come from L different classes and let the random variable C be the class
label. For each object we possess measurements xi , the feature vector. We assume that the feature
vectors xi are from a feature space X, typically a subset of Rp .
Our task is to find a classification rule
ĉ : X → {1, . . . , L}, x 7→ ĉ(x).
The interpretation of the map is as follows: If we observe the feature vector x, then we classify the
observation into the class ĉ(x).1
For now, we make the unrealistic assumption that the classwise densities f (x|l) and the prior
probabilities πl are known. For an observation from class l, the probability of misclassification of the
decision rule ĉ(·) is then
pmc(l) = P(ĉ(x) 6= l|C = l)
For assessing the overall performance of a classifier, we need to specify how “bad” wrong decisions
are. We usually assume that all wrong decisions are equally bad, but this needs not to be the case
1
Sometimes additional decisions like “outlier” or “doubt” are considered. See e.g. Ripley (1996).
71
72 CHAPTER 4. CLASSIFICATION: THEORETICAL BACKGROUNDS AND DISCRIMINANT ANALYSIS
under all circumstances: In medicine the cost of failing to detect a disease is much higher than the
cost of a false positive result. We use a loss function to model these costs:
L(l, m) = Loss incurred when classifying an observation from class l into class m
If we make the above assumption that all wrong decisions are equally bad, we obtain

0 if l = m
L(l, m) = . (4.1)
1 if l 6= m
Using the loss function we now can compute the expected loss when using the classification rule
ĉ(·). This expected loss is called total risk.
R(ĉ) = E (L(C, ĉ(X)))
When using the usual loss function (4.1) one obtains

L
X
R(ĉ) = πl · pmc(x). (4.2)
l=1
Assuming we know the prior probabilities πl and the classwise densities f (x|l), we can give an best
decision rule: On can show that the classification rule
“Classify x into the class with highest posterior probability p(l|x)”
minimises the total risk, when using the loss function (4.1). We can compute the posterior probability
using the Bayes formula
πl f (x|l)
p(l|x) = PL . (4.3)
λ=1 πλ f (x|λ)
Therefore this best decision rule is often called Bayes rule and its total risk is called Bayes risk. As the
Bayes rule is the best decision rule, the Bayes risk is a lower bound for the total risk for all possible
classification rules. Thus it is often used as a benchmark.
Figure 4.1 illustrates the Bayes rule for an easy two-class example, where the prior probabilities
are π1 = 0.4 and π2 = 0.6 and the classwise distributions are Gaussians with means µ1 = −1 and
µ2 = 1 and standard deviation σ = 1. For all x on the left hand side π1 f (x|1) is greater than π2 f (x|2),
so for these x we have ĉ(x) = 1.
In reality however, the classwise densities f (x|l), the classwise probabilities πl , and thus the pos-
terior probabilities p(l|x) are unknown; we have to estimate them from the data.
Logistic regression (see chapter 6) estimates the posterior probability p(l|x) — which is nothing
else than the conditional distribution of the class labels given the feature vector x — directly from the
data.
Alternatively, we can estimate the classwise densities f (x|l) (and the prior probabilities πl ) from
the data and plug them into (4.3) yielding
π̂l fˆ(x|l)
p̂(l|x) = PL .
λ=1 π̂λ fˆ(x|λ)
The resulting classifier called the plug-in classifier. For estimating f (x|l) we have two possibilities.
We can make a parametric assumption about the classwise distribution of x and only estimate the
4.3. LINEAR DISCRIMINANT ANALYSIS (LDA) 73
c^(x) = 1 c^(x) = 2
π1f(x|1) π2f(x|2)
Figure 4.1: Example for the decision boundary generated by the Bayes rule
parameters. We could for example assume that the classwise distributions are all Gaussians. This leads
to linear and quadratic discriminant analysis (see sections 4.3 and 4.4). We could as well estimate the
densities f (x|l) non-parametrically (see section 4.6).
One problem of the plug-in approach is that it does not account for the uncertainty in fˆ(x|l).
The more Bayesian approach of predictive classification takes this into account. However it needs an
assumption on the prior distributions of the model parameters of f (x|l). (see e.g. Ripley, 1996, ch.
2.4)
4.3 Linear discriminant analysis (LDA)

Linear discriminant analysis assumes that the classwise densities are multivariate Gaussian distribu-
tions with means µ1 , . . . , µL and common covariance matrix Σ, i.e.
1 1 ′ −1
f (x|l) = fµl ,Σ (x) = p e− 2 (x−µl ) Σ (x−µl )
.
(2π)p · |Σ|
The (log) posterior probability then becomes
1
log p(l|x) = − (x − µl )′ Σ−1 (x − µl ) + log πl + const (4.4)
2
1 ′ −1 1 1
= − x Σ x + x′ Σ−1 µl − µ′l Σ−1 µl + log πl + const = − x′ Σ−1 x +x′ al + bl ,
2 2 | 2 {z }
independent of l
with appropriately defined al and bl . As the part depending on l is a linear in x, the decision boundaries
will be (linear) hyperplanes.2
2
The decision boundary between the l-th and the m-th class is the set of points for which p(l|x) = p(m|x). We can take
the logarithm on both sides yielding log p(l|x) = log p(m|x), using (4.4) we obtain x′ al + bl = x′ am + bm , an equation
linear in x.
In practice the means µ1 , . . . , µL and covariance Σ are unknown. But using the plug-in approach
all we have to do is replacing them by their estimator. The maximum likelihood estimators for the
parameters are
1 X
µ̂l := x̄l = xi
nl
xi in class l
for the mean and for the covariance

L
1X X
Σ̂ := (xi − µ̂l )(xi − µ̂l )′ .
n
l=1 xi in class l
However, Σ is often estimated by the within-class covariance matrix
L
1 X X
W := (xi − µ̂l )(xi − µ̂l )′ , (4.5)
n−L
l=1 xi in class l
which has a different denominator.

In order to be able to compute the value of the Gaussian density we need to calculate the inverse
of the covariance matrix. This is only possible if it is of full rank. This can be only the case if the
dataset has at least as many rows as columns3 — especially in genetics this might not be the case.
Figure 4.2 shows the decision boundary obtained by applying linear discriminant analysis to the iris
dataset (only using the covariates petal width and petal length). Note that all estimated classwise
densities have the same orientation.
4.4 Quadratic discriminant analysis (QDA)

Unlike linear discriminant analysis, quadratic discriminant analysis assumes that the Gaussians in each
class have different covariance matrices Σl . In this case the (log) posteriori becomes
1 1
log p(l|x) = − log |Σl | − (x − µl )′ Σ−1
l (x − µl ) + log πl + const
2 2
1 1 1
= − x′ Σl−1 x + x′ Σµl − µ′l Σ−1
l µl − 2 log |Σl | + log πl + const (4.6)
2 2
= x′ Al x + x′ bl + cl
with appropriately defined Al , bl , and cl . As (4.6) is quadratic in x the decision boundaries are (parts
of) ellipses.
Using the plug-in approach we estimate µl by the mean x̄l in class l and Σl by the covariance
in class l. Note that each class must have more observations than covariates, otherwise one of the
classwise covariance matrices is not of full rank.
Figure 4.3 shows the decision boundaries obtained by quadratic discriminant analysis for the iris
data. Note the the classwise Gaussian distributions now have different orientations and spread.
3
Note that this is just a necessary condition, not a sufficient one.
4.4. QUADRATIC DISCRIMINANT ANALYSIS (QDA) 75
2.5
2.0
1.5
Petal.Width
1.0
0.5
1 2 3 4 5 6 7
Petal.Length
Figure 4.2: Decision boundaries and classwise Gaussian distributions for an LDA fit to the iris dataset
2.5
2.0
1.5
Petal.Width
1.0
0.5
1 2 3 4 5 6 7
Petal.Length
Figure 4.3: Decision boundaries and classwise Gaussian distributions for a QDA fit to the iris dataset
Sometimes however, quadratic discriminant analysis is too flexible and yields an overfit to the data.
In this case one can use regularised discriminant analysis (Friedman, 1989), a “compromise” between
LDA and QDA. Regularised discriminant analysis replaces the estimator for Σl by
RDA
Σ̂l (α) = αΣ̂l + (1 − α)Σ̂;
α is hereby from [0, 1]. For α = 0 we obtain LDA, for α = 1 we obtain QDA.
Couldn’t we perform regression as well?
The log posteriori (4.4) in LDA is a linear function in x. So one might ask why it would not be easier
to estimate the parameters al and bl directly using some sort of regression model. This would mean
that we have to estimate less parameters.
p(1|x)
Let’s consider only the two-class case. It is enough to know the ratio , as we classify an
p(2|x)
observation to class 1, if the ratio is greater than 1. Now using (4.4) we obtain
p(1|x) p(1|x)
log = log =
1 − p(1|x) p(2|x)
1 1
= log p(1|x) − log p(2|x) = − x′ Σ−1 x + x′ a1 + b1 + x′ Σ−1 x − x′ a2 − b2 =
2 2
= x′ (a1 − a2 ) + (b1 − b2 ) = x′ β + α
| {z } | {z }
=:β =:α
This is the assumption of logistic regression (see chapter 6). We can even see quadratic discrimi-
nant analysis as a logistic model, however we must include as well all interactions and all squared
covariates.
The difference between the two approaches is that logistic regression makes no assumption
about the classwise distribution of xi as it directly models the conditional likelihood given the xi .
4.5 Fisher’s approach to LDA

Fisher’s approach to LDA is not at all based on the decision theory from section 4.2; however he obtains
the same results as in section 4.3 by looking for projections that separate the data best.
Consider first the case of two classes. We want to project the data onto a line, such that the classes
are as separated as possible. A projection onto a line is mathematically defined over a vector a and
the projected values are proportional to the linear combination x′ a.
But how to characterise a good projection? Fisher proposed to use the projection that maximises
the (squared) difference between the projected class means, however relative to the variances in the
projected classes. Figure 4.4 visualises this idea.
We thus want to maximise
Distance between the projected class means a′ Ba
= ′ , (4.7)
Variance of the projected data a Wa
4.5. FISHER’S APPROACH TO LDA 77
Maximal distance between projected means Best separation
Figure 4.4: Maximising the difference between the projected means is not always the best way to
seperate the data, the covariance structure has to be considered as well.
where
L
1 X
B := (x̄l − x̄)(x̄l − x̄)′
L−1
l=1
is the between-class variance matrix, which measures the dispersion of the class means. When the
data is spherical (i.e. the covariance matrix is proportional to the identity matrix), then we can omit
the denominator. This leads to an equivalent approach to LDA: We first rescale the data so that the
within-class covariance matrix is spherical, i.e. the data of each class forms a ball. Then we project
the data such that the projected class means have as much spread as possible.4 Once we found the
projection direction a, we can give the LDA classification rule. Classify a new observation xnew into
the class whose mean is closest to xnew in the projected space. We thus calculate the distance between
x′new a and the x̄′l a.
For problems with more than two classes, only one projection might not be enough to separate
the data. We can define additional projection directions a2 , . . . , aL−1 by maximising (4.7) over all a
that are orthogonal in the metric implied by W to the previous projection directions. One can show
that the approach of section 4.3 based on the normal distribution (and equal priors) is equivalent
Fisher’s LDA if all L − 1 projection directions are used.5 However, it can be useful in practice not to
use all projection directions, as the projection directions of higher order can contain more noise than
useful information about the classes. This type of LDA is sometimes referred to as reduced rank LDA.
Figure 4.5 shows the decision boundaries obtained by fiiting a reduced rank LDA with rank 1 to the
iris dataset.
One can show that Fisher’s approach is equivalent to canonical correlation analysis and optimal
scoring (see e.g. Mardia et al., 1979). In optimal scoring one performs a regression from X to
an artificial variable y, where yi is set to a value θl determined by the class l the i-th observations
belongs to. So not only the regression parameter β, but also the class parameters θ1 , . . . θL have to be
estimated. This view allows two generalisations of linear discriminant analysis:
4
Note that LDA requires the within-class covariance matrix W to be of full rank.
5
If the number of covariates is less than L − 1, then it is of course enough to consider as many projections as covariates.
4
3
2
Petal.Width
1
0
−1
1 2 3 4 5 6 7
Petal.Length
Figure 4.5: Reduced rank LDA (rank 1) fitted to the iris dataset. The dashed line is the projection line.
The coloured solid lines are the projections of the class mean, and the solid black lines are the decision
boundaries.
• When using penalised (ridge) regression6 instead of linear regression one obtains penalised
discriminant analysis (PDA) (Hastie et al., 1995). PDA corresponds to using a regularised within-
class covariance matrix
WP DA = W + λΩ,
where Ω is a suitable penalisation matrix. The within-class covariance matrix thus does not need
to be of full rank any more. This version of discriminant analysis is very useful when dealing
with a huge amount of (highly correlated) covariates. PDA has been successfully applied to digit
recognition (the initial problem of Hastie et al.), medicial imaging (see e.g. Kustra and Strother,
2001) and microarray analysis (see e.g. Kari et al., 2003).
Note that “plain” LDA is rather unsuited for microarray analysis.
• When using a non-parametric regression technique7 one obtains flexible discriminant analysis
(FDA) (Hastie et al., 1994). FDA can be used when linear decision boundaries are too rigid and
a more flexible discrimination rule is needed.
Despite being a quite simple and maybe old-fashioned technique, linear discriminant analysis is yet
very powerful. In the STATLOG project (Michie et al., 1994), which compared different classification
techniques, LDA was amongst the top three classifiers for 7 of the 22 studied datasets.
6
Ridge regression adds a penalisation term to the least-squares criterion and obtains thus a shrunken solution. It is mainly
used when the predictors are (nearly) collinear.
7
Nonparametric regression fits a smooth curve through the data, whereas linear regression fits a straight line.
4.6. NON-PARAMETRIC ESTIMATION OF THE CLASSWISE DENSITIES 79
4.6 Non-parametric estimation of the classwise densities

We have seen in section 4.2 that the (Bayes) optimal decision rule classifies xi into the class l that
maximises the posterior probability
πl f (xi |l)
p(l|xi ) = PL . (4.8)
λ=1 πλ f (xi |λ)
For linear and quadratic discriminant analysis we assumed that these class-wise densities f (xi |l) were
Gaussians, once with common covariance matrix, once with different covariance matrices, and we
estimated their means and covariances. However we can also estimate the densities f (xi |l) non-
parametrically, i.e. without assuming a certain type of distribution. (see e.g. Hand, 1981)
Non-parametric density estimation can be carried out in a variety of ways. Here, we will only
consider kernel density estimation. Kernel density estimation estimates the density at x ∈ Rp by

1 X x − xi
fˆ(x|l) = K
nl hp h
xi in class l
. K is hereby any bounded function with integral 1 and a peak at 0. A sensible choice for K is the
density of the p-variate standard normal distribution. The bandwidth h controls how smooth the
estimated density is: the bigger the chosen h is, the wigglier the estimated density gets.
Now we can estimate p(l|xi ) by plugging the estimates fˆ(x|λ) into (4.8):
πl P

πl fˆ(xi |l) nl xi in class l
K x−x
h
i
p̂(l|x) = PL = PL π P . (4.9)
λ=1 πλ fˆ(xi |λ) λ
λ=1 nλ K x−xi
xi in class λ h
When estimating the prior probabilities by π̂l = nl /n expression (4.9) simplifies to

P
xi in class l
K x−x
h
i
p̂(l|x) = Pn x−xi
.
i=1 K h
We classify a new observation xnew into the class l with highest posterior probability p̂(l|xnew ).
Figure 4.6 shows the decision boundaries when estimating the classwise densities in the iris dataset
non-parametrically.
Note that this technique is only useful for datasets of low dimension, as it is highly susceptible to
the “curse of dimensionality”.
3.0
2.5
Petal.Length (standardised)
2.0
1.5
1.0
0.5
0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0
Petal.Width (standardised)
Figure 4.6: Classwise densities and decision boundaries obtained when extimating the classwise den-
sities nonparametrically
Chapter 5
Trees and rules
5.1 Introduction
Decision trees are hierarchical classifiers based on “If . . . then . . . else . . . ” rules. This makes them
extremely easy to interpret and made decision trees very popular tools in medical diagnosis and other
fields. Decision trees have been commercialised successfully under the acronyms CART (Breiman et al.,
1984) and C4.5 (Quinlan, 1993). Figure 5.1 gives an example of a decision tree; it is taken from the
NHS Direct self-help guide. The left hand side shows the original page from the NHS guide and the
right hand side shows the decision tree in a more usual way. At every node there is one question that
has to be answered with either “yes” or “no”; thus the tree is binary. The top node is called the root,
the terminal nodes are called leaves.1
Mathematically speaking decision trees do nothing else than cutting the feature space into smaller
and smaller hypercubes. Each hypercube (a rectangle in the case of R2 ) corresponds to a terminal
node of the tree and vice versa. Figure 5.2 visualises this for a decision tree for the iris dataset. The
splits are hereby carried out in a way that the partitions get as pure as possible, ideally there should
be only observations of one class in a partition (terminal node).
5.2 Growing a tree

The trees are grown from the root note using the “divide and conquer” device: First the root node is
split, then the two just created nodes are split etc.2 . When carrying out a split, we have to make two
decisions. What variable to use? and: Which values to assign to the left branch and which to assign
to the right one?
As mentioned above we want to reduce the impurity of the partitions, which is assessed by a
measure of impurity. Common measures of impurity are the Gini coefficient
X X
i(p) = pl pm = 1 − p2l
l6=m l
and the entropy X

i(p) = − pl log pl ,
l
1
Note that decision trees are grown “upside-down”.
2
We will only consider binary trees in this chapter.
81
82 CHAPTER 5. TREES AND RULES
Figure 5.1: Page taken from the NHS Direct self-help guide (left) and corresponding decision tree
(right)
where p = (p1 , . . . , pL ) denotes the empirical distribution of the class labels in the partition.3 Figure
5.3 displays the Gini coefficient and the entropy for the two-class case. If the partition consists of only
one class (frequency p1 either 0 or 1), the impurity is 0. Are both classes equally present (frequency
p1 = 0.5), then both impurity measures are maximal.
When splitting a node with empirical distribution p into two nodes with empirical distributions pl
(left node) and pr (right node) the decrease in impurity is
i(p) − (πl i(pl ) + πr i(pr )) ,
where πl is the proportion of observations that is allocated to the left node and πr = 1 − πl is the
proportion of observations allocated to the right node.
We now can use the decrease in impurity to “grow” the tree. Starting with one partition (i.e. the
root node), we repeatedly split all terminal nodes such that each time the decrease in impurity is
maximal. We can repeat this until no more decrease is possible. Figure 5.4 shows the decision tree
for the Pima Indians dataset. The Pima Indians dataset was collected by the US National Institute of
Diabetes and Digestive and Kidney Diseases. The subjects were women who were at least 21 years old,
of Pima Indian heritage and living near Phoenix, Arizona. They were tested for diabetes according to
World Health Organisation criteria. The variables measured were the number of pregnancies (npreg),
the plasma glucose concentration in an oral glucose tolerance test (glu), the diastolic blood pressure
3
It might occur that pl = 0, in this case we define 0 log 0 := 0.
5.2. GROWING A TREE 83
Decision tree Induced partitioning
2.5
Petal.Length< 2.45
|
2.0
1.5
Petal.Width
Petal.Width< 1.75
setosa
1.0
0.5
setosa
versicolor
versicolor virginica virginica
1 2 3 4 5 6 7
Petal.Length
Figure 5.2: Decision tree for the iris dataset and corresponding partitioning of the feature space
0.25
0.7
0.6
0.20
0.5
0.15
Gini coefficient
0.4
Entropy
0.3
0.10
0.2
0.05
0.1
Gini coefficient
Entropy
0.00
0.0 0.2 0.4 0.6 0.8 1.0
p1
Figure 5.3: Gini coefficient and entropy for a two-class problem

Toy Example: Constructing a decision tree
Consider the following toy example. The aim is to predict using information on the sex, the blood
pressure and the smoking status whether treatment A or treatment B is to be preferred.
# sex blood pressure smoker best treatment

1 male low yes A
2 male low no A
3 male low yes A
4 male low no A
5 male high yes B
6 male high no B
7 female low yes B
8 female low no B
9 female low yes B
10 female low yes A
First we have to split the root node:

5 5
The impurity of the root node (using the Gini coefficient) is i , = 1 − (0.52 + 0.52 ) = 0.5.
10 10
Examine variable sex: Examine blood pressure: Examine variable smoker:

Impurity of the root node: Impurity of the root node: Impurity of the root node:
5 5 5 5 5 5
i 10 , 10 = 0.5 i 10 , 10 = 0.5 i 10 , 10 = 0.5
4 2
1 3
5 3
0 2
2 3
3 2

i 6, 6 = 0.444 i 4, 4 = 0.375 i 8, 8 = 0.469 i 2, 2 =0 i 5, 5 = 0.48 i 5, 5 = 0.48
Impurity after splitting: Impurity after splitting: Impurity after splitting:

0.6 · 0.444 + 0.4 · 0.375 = 0.417 0.8 · 0.469 + 0.2 · 0 = 0.375 0.5 · 0.48 + 0.5 · 0.48 = 0.48
Reduction: Reduction: Reduction:
0.5 − 0.417 = 0.083 0.5 − 0.375 = 0.125 0.5 − 0.48 = 0.02
The variable blood pressure yields the biggest reduction in impurity so it is used. In a next step the
5.2. GROWING A TREE 85
left hand node has to be split again, the right hand node is already pure.
Examine variable sex: Examine variable smoker:

Impurity
before splitting: Impurity
before splitting:
5 3 5 3
i 8 , 8 = 0.469 i 8 , 8 = 0.469
4 0
1 3
2 2
3 1

i 4, 4 =0 i 4, 4 = 0.375 i 4, 4 = 0.5 i 4, 4 = 0.375
Impurity after splitting: Impurity after splitting:

0.5 · 0 + 0.5 · 0.375 = 0.188 0.5 · 0.5 + 0.5 · 0.375 = 0.438
Reduction: Reduction:
0.417 − 0.188 = 0.229 0.469 − 0.438 = 0.031
Thus the variable sex is used for splitting. The obtained tree is then:
in mm Hg (bp), the triceps skin fold thickness in mm (skin), the body mass index (bbi), the diabetes
pedigree function (ped), and the age (age).
Until now we have tacitly assumed that all covariates are binary variables. How can other co-
variates be included into the tree? Simply by transforming them into a binary variables. A nominal
variable which has k different levels is transformed into 2k−1 − 1 binary variables. One binary variable
is created for each different combination of levels. For example, a nominal variable taking the values
“red”, “green” and “blue” would be transformed into three binary variables: The first taking the values
“red” and “green or blue”, the second taking the values “red or green” and “blue”, and the third taking
the values “red or blue” and “green”.4 It is a little less tedious to include ordinal or metrical covariates;
only k −1 different splits have to be considered as we just have to find an appropriate threshold.5 Once
the values are ordered, one only has to consider one threshold for pair of neighbouring values, usually
x(j) +x(j+1)
2 .
Decision trees are classifiers that are easy to interpret. Decision trees are however often outper-
formed by other methods when it comes to the the accuracy of the prediction; they have a rather high
variance and are thus unstable classifiers. For this reason one should not over-interpret which splits
come first or later. A slight change in the dataset sometimes causes the whole tree to “topple over.”
5.3 Pruning a tree

Growing the tree until no more decrease in impurity is possible often leads to an overfit to the training
data. At their early stages, some decision tree algorithms used “early stopping” rules. Quinlan (1986)
proposed to halt the splitting process as soon as the proposed split does not yield a satisfactory parti-
tioning any more.6 These stopping criteria however did not always perform well in practise. It turned
out to be better to grow the tree and then prune it.
The most popular pruning approach is the one proposed by Breiman et al. (1984). The idea behind
this approach is that too big trees yield an overfit. Thus too big trees should be penalised. Denote
with R(T ) a measure of fit for the tree; this can be the misclassification rate on the training set or the
entropy of the partitioning. Instead of minimising the fit criterion R(T ) itself, we now minimise the
penalised fitting criterion
R(T ) + α · size(T ),
where size(T ) is the number of leafs and α controls the amount of penalisation. If we choose α = 0,
there will be no pruning; if we choose α = +∞ all nodes but the root node are removed. Breiman
et al. (1984) showed that there is a nested sequence of subtrees of the fitted tree such that each is
optimal for a range of α. So all we have to do is to pick one of the trees of this sequence. Figure
5.5 shows this for the Pima Indians data.
If we have a validation set at hand, we can pick the subtree yielding the lowest error rate in the
validation set. Otherwise one generally uses cross-validation to pick the optimal subtree. Figure 5.6
shows the error (relative to a tree with the root node only) for the different subtrees for the Pima
Indians data. The error rate was estimated using cross validation and the vertical lines indicate the
4
When response has only two classes one can order the attributes by the proportion of one of the classes, and then one
just has to find a cut-off point.
5
Note however that for most continuous covariates, all values are different (i.e. k = n, so k can be quite large.
6
Quinlan used a χ2 -test of independence between the variable used to split and the response variable using all observa-
tions in the current node. If the test rejected the assumption of independence, the split was carried out.
5.3. PRUNING A TREE 87
glu< 123.5
|
age< 28.5 ped< 0.3095

glu< 90
No bmi< 27.05
No npreg< 6.5
No
bmi>=35.85
Yes
bmi< 32.85 glu< 166 bmi< 28.65
No
No Yes bp< 89.5
Yes age< 32 ped< 0.628
No Yes bp>=71
glu< 138 Yes
No Yes
Yes
glu>=152.5
No npreg< 6.5
Yes
No Yes
Figure 5.4: Unpruned decision tree for the Pima Indians dataset
standard error of the estimates. One can now pick the subtree leading to the lowest error in cross-
validation. Breiman et al. (1984) however propose to choose the smallest tree that is not more than
one standard error worse that the best tree. (“One SE rule”). The idea behind this is that smaller trees
are generally preferable and the tree picked that way is only “insignificantly” worse than be best one.
In our example the “one SE rule” would pick a tree of size 5. Figure 5.7 shows the decision tree after
pruning (i.e. the optimal subtree) for the Pima Indians dataset.
Quinlan’s approach to pruning (Quinlan, 1988) is more simplistic, but theoretically less elegant
than Breiman’s. The goal of the pruning step is to find the subtree with the lowest error rate. If
the true error rates were known, the following approach would suggest itself: If the replacement of
a subtree by the most abundant leaf or branch yields a reduction in the error rate, prune the tree
accordingly. As explained above, the training error rate is a very bad estimate of the true error rate
especially at the bottom of the tree, where the training error rate often overestimates the true error
rate. Instead of comparing the training error rates directly one can compare a pessimistic estimate
of the true error rate. Quinlan proposed to use the upper confidence limit for the error rate using
a binomial model for the number of errors which he called the pessimistic error rate. The upper
confidence limit can be obtained directly from the training error rate and the number of observations
in the different nodes.7 Note however that strictly speaking the binomial model is not applicable here;
the binomial model would only hold if the error rate were estimated using unseen examples and not
using observations from the training data. Heuristically, the upper confidence limit used by Quinlan
can be seen as some sort of penalised estimator of the error rate where too small nodes and thus too
fine-grained trees get penalised.
7
Let N be the number of observations in the current node or subtree and E be the number of errors. Then E can be seen
as approximately having the binomial distribution B(N, πerr ) where πerr is the probability of an error.
r
E E · (N − E)
+ zγ
N N3
is a γ-upper confidence limit for the probability of an error πerr .
CHAPTER 5. TREES AND RULES
α
Figure 5.5: Optimal decision trees for the Pima Indians data for different ranges of the complexity parameter cp = R(Troot )
88
5.3. PRUNING A TREE 89
size of tree
1 2 3 4 5 6 18
1.2
1.1
1.0
X−val Relative Error
0.9
0.8
0.7
0.6
0.5
Inf 0.19 0.11 0.066 0.042 0.021 0.012
cp
Figure 5.6: Error rate estimated by cross-validation for the different subtrees. The vertical lines show
the standard error, and the horizontal dashed line in one standard error worse than the best subtree.
Using the “one SE rule” we would pick the first value (from the left) of the tree size for which the
curve is below the dashed line.
glu< 123.5
|
ped< 0.3095
No
bmi< 28.65
No
No Yes
Figure 5.7: Pruned decision tree for the Pima Indians dataset.
The tree corresponds to the rules

X1=FALSE (R1) If X1 and X2 then the outcome is “yes”.
|
(R2) If X1 , not X2 , X3 and X4 , then the outcome is “yes”.
X3=FALSE X2=FALSE (R3) If X1 , not X2 , X3 and not X4 , then the outcome is “no”.
(R4) If X1 , not X2 , and not X3 , then the outcome is “no”.
X4=FALSE (R5) If not X1 , X3 and X4 , then the outcome is “yes”.
no (R6) If not X1 , X3 and not X4 , then the outcome is “no”.
(R7) If not X1 and not X3 , then the outcome is “no”.
X3=FALSE
This can be simplified to:
X4=FALSE yes
no (R1) If X1 and X2 then the outcome is “yes”.
no yes
(R2’) If X3 and X4 then the outcome is “yes”.
no yes (R3’) In all other cases (i.e. if either not X1 or not X2 and
either not X3 or not X4 ) the outcome is “no”.)
Figure 5.8: Example tree for X1 = X2 = TRUE or X3 = X4 = TRUE and the corresponding rules
5.4 From trees to rules

Trees are not the only way to represent a set of rules. Note that only certain sets of rules can be
displayed by trees; the first condition used at the root node — or its negation — for example has to
be part of every rule. It thus might make sense to represent the rule set directly as “If . . . then . . . ”
rules. The conversion is straightforward as every terminal node of a decision tree corresponds to one
rule.
In order to maintain the interpretability of the rules one has to make sure that there are only simple
rules and that number of rules is quite small. The rule set thus has to be simplified (“generalised”).
As the rules do not have to comply with the tree structure anymore, it is often possible to simplify the
set of rules without affecting their accuracy by removing superfluous conditions. Figure 5.8 shows a
decision tree, the corresponding rule set and how it can be simplified.
However even after eliminating superfluous rules the rule set might still be too complex. In this
case the rule set has to be pruned like a decision tree. Quinlan (1993) proposes an approach similar
to his pruning algorithm. For every rule one computes the pessimistic error rate once for the rule with
all conditions once with one or more of the conditions being omitted. If the pessimistic error rate can
be reduced the corresponding parts of the antecedent are dropped.8 Rules for which the pessimistic
error rate exceeds a certain threshold (e.g. 0.5) are dropped completely. Note that there is a difference
between pruning trees and pruning rules. When pruning rules the conditions of the different rules can
be pruned independently of each other. When pruning a tree, the tree structure has to be maintained:
8
For complex rules this can be quite tedious. Quinlan proposed to use a “greedy search” strategy instead of trying all
combinations. The condition that gives the highest decrease in the pessimistic error rate is removed first, then the next one
is picked from the remaining conditions, etc. until no more decrease in the pessimistic error rate is possible.
5.4. FROM TREES TO RULES 91
when pruning at a certain node, all the descending nodes are affected.
Toy example: Pruning a rule set
Consider the following data set which indicates depending on the weather whether a person plays
tennis or not. (data taken from Quinlan, 1993)
# outlook temperature humidity wind play.tennis
1 sunny hot high weak no
2 sunny hot high strong no
3 overcast hot high weak yes
4 rain mild high weak yes
5 rain cool normal weak yes
6 rain cool normal strong no
7 overcast cool normal strong yes
8 sunny mild high weak no
9 sunny cool normal weak yes
10 rain mild normal weak yes
11 sunny mild normal strong yes
12 overcast mild high strong yes
13 overcast hot normal weak yes
14 rain mild high strong no
The corresponding pruned decision tree is:
outlook=rain/sunny
|
humidity=high
Yes
No Yes
The tree corresponds to the rules:
(R1) If the outlook is rain or sunny and the humidity is high, then he/she will not play.
(R2) If the outlook is rain or sunny and the humidity is normal, then he/she will play.
(R3) If the outlook is overcast, then he/she will play.
We now have to compute the pessimistic error rate for all rules once with all conditions and
once with different conditions omitted. (We’ll use a confidence level of 90% and drop rules with a
pessimistic error rate of more than 50%.)
rule # errors # observations pessimistic error rate

(R1) 1 5 0.4293
(R1) without “humidity=high” 5 10 0.7026
(R1) without “outlook=rain/sunny” 3 8 0.5944
(R2) 1 5 0.4293
(R2 )without “humidity=normal” 5 10 0.7026
(R2) without “outlook=rain/sunny” 1 7 0.3124
If we remove the condition on the humidity from rule (R2) we can decrease the pessimistic error rate
from 0.4293 to 0.3124, so we should simplify rule (R2) accordingly. The pessimistic error rate for rule
(R1) exceeds 0.5, so it has to be dropped.
Thus we obtain the following rule set:
(R2’) If the humidity is normal, then he will play.
(R3) If the outlook is overcast, then he will play.
Note that generalising the rules in the above way leads to two problems:
• As we might have removed certain parts of the conditions of some of the rules, the rules do not
need to be exclusive any more. We thus need a mechanism to resolve conflicts. Quinlan (1993)
proposes to rank the rules according to their “quality”9 and then use only the prediction of the
“best” rule which is applicable.
• If we remove certain rules because their pessimistic error rate it too high, it can happen that
there is no rule which covers certain cases. In this case one has to define a default prediction
to which one can fall back. Quinlan (1993) proposes to simply use the most abundant class
amongst all items to which no rule applies. In the above example the default class would be “He
won’t play tennis.”
5.5 Frequent item sets and association rules

Obtaining rules from a decision tree only provides us with rules that predict one variable — the one
used for training the decision tree. It is thus a supervised learning problem. Sometimes however we
are interested in rules concerning all the variables. A retailer might be interested in which items are
generally purchased at the same time, so that these products can be marketed together. One (in-
)famous example is the correlation between the sales of beer and nappies which says that customers
who buy nappies will buy beer as well.10 This type of analysis is called market basket analysis and is
one of the most popular techniques in commercial data mining projects. Note that extracting general
9
Quinlan proposes to use the minimum description length (MDL) to assess the “quality” of a rule.
10
Although you can find this example in practically every lecture or textbook on data mining or database marketing, it is
more of an urban legend than a fact. The legend says that a very high correlation between the purchases of beer and the
purchases of nappies was found and that the retailer therefore placed some beer cans next to the nappies — resulting in
increased sales of both. And there even had been an explanation for this. It was theorised that fathers were stopping off at
Wal-Mart to buy nappies for their babies, and since they could no longer go down to the pub as often, they would buy beer
as well.
Although this story has been attributed to Wal-Mart or other US retailers, it belongs to the realm of fiction. Forbes Magazine
5.5. FREQUENT ITEM SETS AND ASSOCIATION RULES 93
rules from data is an unsupervised learning problem and thus similar to cluster analysis. Its results thus
have to be handled with the same care and scepticism as results from clustering algorithms.
In their basic versions the algorithms to identify association rules work with binary (“0/1”) vari-
ables only.11 In the marketing example, an observation (i.e. a row) corresponds to a transaction while
a variable (i.e. a column) indicated whether a certain product has been purchased or not.
The first step in finding association rules is finding frequent item sets. In our marketing example,
frequent item sets correspond to products which are usually purchased together. An item set consisting
of the variables {v1 , . . . , vs } is considered frequent if the probability that all of its values are one,
P ({v1 = 1} ∩ . . . ∩ {vs = 1})
is high. We can estimate this probability by the corresponding count in the data set, i.e.
# observations with v1 = 1, . . . , and vs = 1
P̂ ({v1 = 1} ∩ . . . ∩ {vs = 1}) = .
# observations
This is often called the support.
In order to obtain the numerator we have to count for every possible combination of variables
(products) how often all of these variables are 1 (i.e. how often these products were purchased
together). However if there are p variables in total, there are 2p − 1 item sets, which all have to be
counted. In other words, the number of item sets grows exponentially. It is thus virtually impossible
to determine the support for every possible item set. Note that for a data set with 100 variables we
already would have to analyse 2100 − 1 ≈ 1.27 × 1030 item sets. We thus have to come up with an
intelligent way of searching frequent item sets.
The search process can be visualised using a prefix tree / subset lattice. Figure 5.9 shows such a
tree for an easy example: every node corresponds to one item set. Lines (dotted or solid) indicate that
the item set is a subset of the item set below. Recall that the objective is to search the tree for frequent
item sets.
Luckily the search can be simplified quite a lot:
• When we search through the lattice, it is sufficient to descend the solid lines. This prevents us
from visiting the same node twice. A line is solid if the index of the added variable is greater
than the ones of the variables already in the set.12
• If we observe that the support for a node is too small, we do not have to consider any of its
descendants (using dotted or solid lines). The support for the descendants cannot be greater
than the support of the node itself. In other words: Every subset of a frequent item set, is a
frequent item set as well. This is often called the apriori property.
• Usually one is interested in identifying simple association rules and frequent item sets. So one
does not need to consider frequent item sets with too many elements. So we only have to search
the lattice up to a certain depth.
of June 4th 1998 noted: “It appears to have come from one Thomas Blischok, now CEO of Decisioneering Group in Scottsdale
(AZ). As vice president of industry consulting for NCR, he was doing a study for American Stores’ Osco Drugs in 1992 when
he discovered dozens of correlations, including one connecting beer and diapers in transactions between 5 p.m. and 7 p.m.”
The rest is a legend; Osco never rearranged its beer or nappy shelves.
11
Note however that when using dummies other variables (with finite support) can be converted to binary ones.
12
This requires the variables (not the observations) to be ordered. This is generally no problem, as the variables are in a
table, thus ordered from left to right. If one wants to generalise the algorithm, finding a suitable order is far from being as
trivial.
{v1 } {v2 } {v3 } {v4 }

# 6 / S : 0.6 # 8 / S : 0.8 # 7 / S : 0.7 # 1 / S : 0.1
PP h
PP hhhh hh
PP PP hhh hhhhhh
PP PP hhhh h
PP PP hhh hhhhhh
{v1 , v2 } {v1 , v3 } {v1 , v4 } {v2 , v3 } {v2 , v4 } {v3 , v4 }
# 4 / S : 0.4 # 5 / S : 0.5 # 1 / S : 0.1 # 6 / S : 0.6 #0/S :0 #0/S :0
hP
Phhhh hh hh
PP hhh hhhhhh hhhhhh
PP hhhh h h
{v1 , v2 , v3 } {v1 , v2 , v4 } {v1 , v3 , c4 } {v2 , v3 , v4 }
# 3 / S : 0.3 #0/S :0 #0/S :0 #0/S :0
XXX
XXX
XXX
XX
{v1 , v2 , v3 , v4 }
#0/S :0
v1 v2 v3 v4
cont. from prev. column
1 1 0 0
1 0 0 1
0 1 1 0
1 1 1 0
0 1 1 0
1 1 1 0
1 1 1 0
0 1 1 0
1 0 1 0
0 1 0 0
cont. next column
Figure 5.9: Prefix tree / subset lattice for an easy toy example. (# denotes the count, S the support)
These ideas are the basic ingredient to apriori algorithm of Agrawal et al. (1995) to find frequent
item sets. The apriori algorithm searches the prefix tree line-wise, i.e. all item sets of size h are
examined in the h-th step. You could equally search the prefix tree “column-wise”, i.e. examine in
the h-step all item sets containing vh ; the latter approach is used by the eclat algorithm by Zaki et al.
(1997).
Toy example: Apriori algorithm
We will carry out the apriori algorithm for the toy data set from figure 5.9. The minimum support
should be 0.3.
We start with the first row of the lattice and see that we can drop {v4 }, as its support is too small.
{v1 }
# 6 / S : 0.6
{v2 }
# 8 / S : 0.8
{v3 }
# 7 / S : 0.7 ××××
{v } 4
# 1 / S : 0.1
As we have dropped {v4 }, we don’t even have to look at {v1 , v4 }, {v2 , v4 }, and {v3 , v4 }. We only
have to compute the support for {v1 , v2 }, {v1 , v3 }, and {v2 , v3 } — which turns out to be sufficiently
high.
{v1 }
# 6 / S : 0.6
PPP
{v2 }
# 8 / S : 0.8
h
PP hhhh
{v3 }
××
# 7 / S : 0.7
hhhh
××
{v } 4
# 1 / S : 0.1
PP hhh h
PP
P PP hhhh hhhhhhh
PP PP hhh hhh
{v1 , v2 }
# 4 / S : 0.4
{v1 , v3 }
# 5 / S : 0.5 ××××
{v1 , v4 }
#−/S :−
{v2 , v3 }
# 6 / S : 0.6 ×××× ××××
{v2 , v4 } {v3 , v4 }
#−/S :− #−/S :−
Once again we don’t even have to look at {v1 , v2 , v4 }, {v1 , v3 , v4 } and {v2 , v3 , v4 }.
{v1 }
# 6 / S : 0.6
PPP
{v2 }
# 8 / S : 0.8
h
PP hhhh
{v3 }
××
# 7 / S : 0.7
hhhh
h
××
{v } 4
# 1 / S : 0.1
PP PP hhh
PP hhhh hhhhhhh
PP hhh hhh
P PP
{v1 , v2 }
# 4 / S : 0.4
hhhhh
PP
{v1 , v3 }
# 5 / S : 0.5
hh
××××
{v1 , v4 }
#−/S :−
hh
{v2 , v3 }
# 6 / S : 0.6 ×××× ××××
{v2 , v4 } {v3 , v4 }
#−/S :− #−/S :−
PP hhhh h h
{v1 , v2 , v3 }
# 3 / S : 0.3 ×××× ×××× ××××
{v1 , v2 , v4 }
#−/S :−
{v1 , v3 , c4 }
#−/S :−
{v2 , v3 , v4 }
#−/S :−
We do not have to consider {v1 , v2 , v3 , v4 } either.

The extracted frequent item sets are thus {v1 }, {v2 }, {v3 }, {v1 , v2 }, {v1 , v3 }, {v2 , v3 }, and
{v1, v2 , v3 }.
Once we have extracted the frequent item sets, we have to convert them to association rules.
We have to split the frequent item set into a part used in the antecedent and a part used in the
consequent. Denote with A the set of variables used in the antecedent and with C the ones used in
the consequent. It seems — at first sight at least — useful to find a rule such that the conditional
P(A∩C)
probability P(C|A) = P(A) is high. This conditional probability is estimated by
P̂(A ∩ C) # observations with A and C

P̂(C|A) = = ,
P̂(A) # observations with A
usually called the confidence.

There has been some criticism to support and confidence. Aggarwal and Yu (1998) gave the
following example. Among 5, 000 students, 3, 000 play basketball, 3, 750 eat cereals, and 2000 both
play basketball and eat cereals:
P
basketball no basketball
cereals 2000 1750 3750
no cereals 1000 250 1250
P
3000 2000 5000
Most of the students eat cereals, but the proportion of students eating cereals is lower amongst stu-
dents playing basketball (66, 7%) than amongst students not playing basketball. Thus playing basket-
ball makes students less likely to eat cereals. However the rule “play basketball → eat cereals” has a
2000
support of 5000 = 40% and a confidence of 2000
3000 = 66.7%. The rule “play basketball → not eat cereals”
has a support of 1000
5000 = 20% and a confidence 1000
of 3000 = 33.3%. Thus we would conclude “play basket-
ball → eat cereals”. The reason for this counterintuitive result is that most of the students eat cereals,
and although students playing basketball are less likely to eat cereal, the majority of them still does
so.
In order to obtain the correct conclusion we should not only focus on the confidence, but use
another measure which allows us detect whether the consequent is more or less likely relative to the
whole population. In other words, it is not important whether P(C|A) is big, but whether P(C|A) is
greater than P(C). This is where the so-called lift comes in. It is defined as
P̂(C|A) P̂(A ∩ C) (# observations with A and C) · (# observations)

= = .
P̂(C) P̂(A) · P̂(C) (# observations with A) · (# observations with C)
If A and C are independent, then P(A ∩ C) = P(A) · P(C), so P̂(A ∩ C) ≈ P̂(A) · P̂(C). If the lift takes
a value close to 1 this indicates that the antecedent and the consequent are independent. If it is less
than 1, then the consequent given the antecedent (and relative to the population) is less likely. If it
is greater than 1, then the consequent is given the antecedent (and relative to the population) more
likely. Thus in order to be valid, a rule should have a lift greater than 1.
In the above example, the lift for “play basketball → eat cereals” is 0.889 and the one for “play
basketball → not eat cereals” is 1.333 — so the lift favours the correct conclusion. Note that it is (like
in the example) quite frequent that the confidence and the lift favour opposing rules!
The apriori algorithm cannot only be used to mine marketing databases. Creighton and Hanash
(2003) use association rules to analyse gene expression data. They convert the continuous gene
expression data into two types of dummies: “gene highly expressed” and “gene highly repressed”. Icev
et al. (2003) propose an algorithm (Distance-based association rule mining, DARM) that takes the
distance between the different genes in the item set into account as well. Whilst this seems rather
useful from a biological point of view, it makes searching for frequent item sets more difficult as it
breaks the apriori property. Furthermore a slightly modified first step of the eclat algorithm can be
used to find frequent fragments in complex molecules (see e.g. Borgelt and Berthold, 2002).
When using association rule mining techniques one should always bear in mind that they are
nothing else than (sometimes useful) heuristics. In doubt it might be more fruitful (though probably
more tedious as well) to fit a graphical model to the data. This allows similar conclusions whilst being
based on sound statistical methods and not being plagued with the above-mentioned problems.
Chapter 6
Generalised Linear Models
6.1 The Linear Model

6.1.1 Overview
We consider the situation where there is one response variable, and it is continuous. Further, we
will assume that the response variable, or some function of it, is linearly related (we will discuss
later the definition of linearly related) to the explanatory variable(s). With this setup, it is common
to use a Linear Model or Linear Regression to model the relationship between the response and
explanatory variables. If there is only one explanatory variable, then the modelling process is referred
to as simple linear regression; when there are two or more explanatory variables, then the modelling
process is referred to as multiple linear regression. Generally, there is no restriction placed on the
explanatory variables: they can be continuous, ordinal, or nominal. However, in this introduction, we
will temporarily assume that they are continuous for the purpose of illustration.
6.1.2 Linear Model Setup

First, we simply write down what it means for a set of explanatory variables to be linearly related to a
response variable:
Y = β0 + β1 X1 + β2 X2 + . . . + βp Xp + ε
This means that a linear combination of the X’s can be used as a predictor of the response variable,
Y . The term β0 in this linear combination is often referred to as the intercept, whereas β1 , . . . , βp
are often referred to as the slopes; together, the βs are referred to as the regression coefficients or
regression parameters. As we will see below, the regression parameters allow us to make quantitative
statements about the effect that the Xs have on Y , particularly about how a change in each X variable
will effect Y . The term ε is often referred to as the error term and relates to the error which was
discussed above; if ε were 0, then the straight line would be uniquely and exactly determined, but as
indicated, this is unlikely to be the case with actual data. Here, we will assume that the only error in
the model has to do with the response variable: we will assume that the X variables are measured
without error (Assumption 1).
Note that, in the above equation, Y may actually be a (known) function (e.g., log) of the outcome
variable in its original form; similarly, any of the explanatory variables may be a (known) function
of one or more of the design variables or covariates. However, even if Y is actually the log of the
99
100 CHAPTER 6. GENERALISED LINEAR MODELS
original outcome variable or if X1 is actually the square of one of the original covariates, we still refer
to this equation as linear. This is because linear refers to the coefficients rather than to the variables
in the above equation: for an equation to be linear in the coefficients, each coefficient must be in its
original form (not squared, cubed, etc...) and no additive term can involve more than one coefficient.
Further, even if the response and/or explanatory variables are defined as functions of the data set’s
original variables, Equation (1) is linear in terms of these newly defined variables (although it may
not be linear in terms of the original response variable and original covariates and design variables).
The relationship between these newly defined variables might be summarised by the relationships
illustrated in the figure below
Figure 6.1: Two linear relationships between a response variable and an explanatory variable
but not by either of the following relationships
Figure 6.2: Two non-linear relationships between a response variable and an explanatory variable
If we were to make a scatterplot of the newly defined response variable vs. the newly defined
explanatory variables, then the scatter of points should appear linear. For instance, suppose that Y in
the above is actually the log of the original outcome variable (called Y ′ ) and that there is only one
explanatory variable, X1 . Well, a plot of Y = log(Y ′ ) versus X1 should appear linear, although a plot
of the outcome variable on its original scale, Y ′ , versus X1 would not.
We will make the assumption that the regression equation correctly specifies the relationship be-
tween Y and X (i.e., that the relationship between Y and the explanatory variables is truly linear and
6.2. GENERALISED LINEAR MODEL 101
that all explanatory variables that have an effect on X have been included in the equation). Thus, we
assume that ε has mean zero (Assumption 2); further, we assume that ε is normally distributed (As-
sumption 3) with a constant (but unknown) variance, σ 2 (Assumption 4). Because of our assumption
that ε has a zero mean, the conditional mean of Y (i.e., conditional on X) can be written as a linear
combination of the X variables:
E[Y |X] = β0 + β1 X1 + β2 X2 + . . . + βp Xp
Further, as a result of our constant variance and normality assumptions, we can say that, condi-
tional on X, Y is normally distributed with a variance of σ 2 (for this reason, we will often refer to
σ 2 as the conditional variance). In other words, these three assumptions about ε are equivalent to
saying that there is a normally distributed subpopulation of responses for each combination of the
explanatory variables, that the means of the subpopulation fall on a straight-line (or a straight-plane)
function of the explanatory variables, and that the subpopulation variances are all equal to σ 2 .
We will state a fifth assumption, which is that the units are independent of each other: more
specifically, we assume that the error terms for the units are independent of each other (in a statistical
sense)(Assumption 5).
6.2 Generalised Linear Model

6.2.1 Motivation
In this lecture we extend the ideas of linear regression to the more general idea of a generalised linear
model (GLM). The essence of linear models is that the response variable is continuous and normally
distributed: here we relax these assumptions and consider cases where the response variable is non-
normal and in particular has a discrete distribution. Although these models are more general than
linear models, nearly all of the techniques for testing hypotheses regarding the regression coefficients,
and checking the assumptions of the model apply directly to a GLM. In addition, a linear model is
simply a special type of a generalised linear model and thus all of the discussion below applies equally
to linear models.
A typical statistical model can be expressed as an equation that equates the mean(s) of the response
variable(s) to some function of a linear combination of the explanatory variables:
E[Y |X = x] = η(β0 + β1 X1 + β2 X2 + . . . + βp Xp ) = η[LC(X; β)],

In the above equation, the form of the function η() is known, as are Y and X (the latter for a
particular choice of explanatory variables). However, the parameters of the model, β0 , β1 , . . . , βp , are
not known and must be estimated. The simple linear model is a special case of the above in which the
function η() is the identity function.
Generalised linear models extend the ideas underlying the simple linear model
E[Y |X] = β0 + β1 X1 + β2 X2 + . . . + βp Xp , N (µ, σ 2 )

where the Yi are independent, to the following more general situations:
1. Response variables can follow distributions other than the Normal distribution. They can be
discrete, or logical (one of two categories).
2. The relationship between the response and the explanatory variables does not have to take on
the simple linear form above.
GLMs are based on the exponential family of distributions that shares many of the desirable statis-
tical properties of the Normal distribution. Consider a single random variable Y whose probability
distribution depends only on a single parameter θ. The distribution of Y belongs to the exponential
family if it can be written in the form
f (y, θ) = exp[{yθ − b(θ)}/a(φ) + c(y, φ)]

where a, b and c are known functions. Furthermore, φ is a scale parameter that is treated as a
nuisance parameter if it is unknown. Some well-known distributions that belong to the exponential
family of distributions include the Normal, exponential, Poisson and binomial distributions. For ex-
ample, consider the discrete random variable Y that follows a Poisson distribution with parameter λ.
The probability function for Y is
λy e−λ
f (y, λ) =
y!
where y takes the values 1, 2, . . .. This can be rewritten as
f (y, λ) = exp(y log λ − λ − log y!) = exp(yθ − exp(θ) − log y!) = f (y, θ).
This is exactly the form of the general exponential family with θ = log(λ),φ = 1, a(φ) = 1,b(θ) =
exp θ and c(y, θ) = − log y!. Similarly, all other distributions in the exponential family can be written
in this form.
In the case of GLMs we require an extension of numerical estimation methods to estimate the
parameters β in the linear model, to a more general situation where there is some non-linear function,
g, relating E(Yi ) = µi to the linear component xi ′β , that is
g(µi ) = xi ′β
where g is called the link function. The estimation of the parameters is typically based on maxi-
mum likelihood. Although explicit mathematical expressions can be found for the estimators in some
special cases, numerical optimisation methods are usually needed. These methods are included in
most modern statistical software packages. It is not the aim of this course to go into any detail on
estimation, since we will focus on the application of these models rather than their estimation.
GLMs are extensively used in the analysis of binary data (e.g. logistic regression) and count data
(e.g. Poisson and log-linear models). We will consider some of these applications. First we introduce
the basic GLM setup and assumptions.
6.2.2 Basic GLM setup and assumptions

We will define the generalised linear model in terms of a set of independent random variables Y1 , Y2 , . . . , YN
each with a distribution from the exponential family of distributions. The generalised linear model
has three components:
1. The random component: the response variables, Y1 , Y2 , . . . , YN , are assumed to share the same
distribution from the exponential family of distributions introduced in the previous section, with
E(Yi ) = µi and constant variance σ 2 . This part describes how the response variable is dis-
tributed.
2. The systematic component: covariates (explanatory variables) x1 , x2 , . . . , xp produce a linear pre-

dictor η given by
Xp
η= xi βi
i=1
3. The link function, g, between the random and systematic components. It describes how the
covariates are related to the random component, i.e. ηi = g(µi ), where µi = E(Yi ) and can be
any monotone differentiable function.
An important aspect of generalised linear models that we need to keep in mind is that they assume
independent (or at least uncorrelated) observations. A second important assumption is that there is
a single error term in the model. This corresponds to Assumption 1 for the linear model, namely, that
the only error in the model has to do with the response variable: we will assume that the X variables
are measured without error. In the case of generalised linear models we no longer assume constant
variance of the residuals, although we still have to know how the variance depends on the mean.
The variance function V (µi ) relates the variance of Yi to its mean µi . The form of this function is
determined by the distribution that is assumed.
6.2.3 Goodness of fit and comparing models

Overview
A very important aspect of generalised models, and indeed all statistical models, is to evaluate the
relevance of our model for our data and how well it fits the data. In statistical terms this is referred to
as goodness of fit. We are also interested in comparing different models and selecting a model that is
reasonably simple, but that provides a good fit. This involves finding a balance between the model fit
and complexity: we want to improve the model fit without increasing the complexity excessively. In a
statistical modelling framework we perform hypothesis tests to compare how well two related models
fit the data. In the generalised linear model framework, the two models we compare should have
the same probability distribution and link function, but they can differ with regards to the number
of parameters in the systematic component of the model. The simpler model relating to the null
hypothesis H0 is therefore a special case of the other more general model. If the simpler model fits the
data as well as the more general model it is retained on the grounds of parsimony and H0 cannot be
rejected. If the more general model provides a significantly better fit than the simple model, then H0
is rejected in favour of the alternative hypothesis H1 , which corresponds to the more general model.
In order to make these comparisons we need goodness of fit statistics to describe how well the models
fit. These statistics can be based on any of a number of criteria such as the maximum value of the log-
likelihood function, the minimum value of the sum of squares criterion or a composite statistic based
on the residuals. If fY (y, θ) is the density function for a random variable Y given the parameter θ,
then the log-likelihood based on a set of independent observations of Y , y1 , y2 , . . . , yn , is then defined
as
X
ℓ(θ, y) = log fi (yi , θi )
i
where θ = (θ1 , θ2 , . . . , θn ). It is important to note the subtle shift in emphasis from the density
function. In the density function f (y, θ) is considered as a function in y for fixed θ, whereas the
log-likelihood is primarily considered as a function of θ for the particular data observed y1 , y2 , . . . , yn .

In order to test the hypotheses above, sampling distributions of the goodness of fit statistics are
required. In the following subsection we consider one such goodness of fit criterion, the deviance, in
a bit more detail.
Finally, we will say something about the choice of scale for the analysis. It is an important aspect
of the model selection process, although scaling problems are considerably reduced in the generalised
linear model setup. The normality and constant variance assumption of the linear regression model is
for instance no longer a requirement. The choice of scale is largely dependent on the purpose for which
the scale will be used. It is also important to keep in mind that no single scale will simultaneously
produce all the desired properties.
The Deviance
One way of assessing the adequacy of a model is to compare it with a more general model with the
maximum number of parameters that can be estimated. It is referred to as the saturated model. In the
saturated model there is basically one parameter per observation. The deviance assesses the goodness
of fit for the model by looking at the difference between the log-likelihood functions of the saturated
model and the model under investigation, i.e. ℓ(bsat , y) − ℓ(b, y). Here bsat denotes the maximum
likelihood estimator of the parameter vector of the saturated model, β sat , and b is the maximum
likelihood estimator of the parameters of the model under investigation, β. The maximum likelihood
estimator is the estimator that maximises the likelihood function.
The deviance is defined as
D = 2{ℓ(bsat , y) − ℓ(b, y)}.
A small deviance implies a good fit. The sampling distribution of the deviance is approximately
χ2 (m − p, ν), where ν is the non-centrality parameter. The deviance has an exact χ2 distribution if
the response variables Yi are normally distributed. In this case, however, D depends on var(Yi ) = σ 2
which, in practice, is usually unknown. This prevents the direct use of the deviance as a goodness
of fit statistic in this case. For other distributions of the Yi , the deviance may only be approximately
chi-square. It must be noted that this approximation can be very poor for limited amounts of data. In
the case of the binomial and Poisson distributions, for example, D can be calculated and used directly
as a goodness of fit statistic. If the scale parameter φ is unknown or known to have a value other than
one, we us a scaled version of the deviance
D
φ
and we call it the scaled deviance.

The deviance forms the basis for most hypothesis testing for generalised linear models. Suppose
we are interested in comparing the fit of two models. These models need to have the same probability
distribution and the same link function. The models also need to be hierarchical, which means that
the systematic component of the simpler model M0 is a special case of the linear component of the
more general model M1 .
Consider the null hypothesis
 
β1
β2 
 
H0 : β = β 0 =  . 
 .. 
βq
that corresponds to model M0 and a more general hypothesis
 
β1
β2 
 
H1 : β = β 1 =  . 
 .. 
βp
that corresponds to model M1 , with q < p < N .
We can test H0 against H1 using the difference of the deviance statistics
∆D = D0 − D1 = 2{ℓ(bsat , y) − ℓ(b0 , y)} − 2{ℓ(bsat , y) − ℓ(b1 , y)} = 2{ℓ(b1 , y) − ℓ(b0 , y)}
If both models describe the data well, then D0 follows a χ2 (N − q) distribution and D1 follows a
χ2 (N −p) distribution. It then follows that ∆D has a χ2 (p−q) distribution under certain independence
assumptions. If the value of ∆D is consistent with the χ2 (p−q) distribution we would generally choose
the model M0 corresponding to H0 because it is simpler. If the value of ∆D is in the critical region of
the χ2 (p − q) distribution we would reject H0 in favour of H1 on the grounds that model M1 provides
a significantly better description of the data. It must be noted that this model too may not fit the data
particularly well.
If the deviance can be calculated from the data, then ∆D provides a good method for hypothesis
testing. The sampling distribution of ∆D is usually better approximated by the chi-squared distribu-
tion than is the sampling distribution of a single deviance.
Model Checking
As is the case for the simple linear model, we should perform model checking after we have fitted
a particular generalised linear model. As before we should look at whether the model is reasonable
and we should investigate whether the various assumptions we make when we fit and draw inference
using the model are satisfied. If the checks and investigations do reveal that there is a problem, then
there are a number of different solutions available to us.
There various graphical techniques available to us for trying to detect systematic departures from
our generalised linear model that may for example be the result of an incorrectly specified link func-
tion, variance function or a misspecification of the explanatory variables in the systematic component
in the model. As was the case for the linear model, many of these graphical techniques entail using the
residual values from our fitted model. For generalised linear models we require an extended definition
of residuals, applicable to distributions other than the Normal distribution. Many of these these resid-
uals are discussed in detail in McCullagh and Nelder (1989). These residuals can be used to assess
various aspects of the adequacy of a fitted generalised linear model that we have mentioned above.
The residuals should be unrelated to the explanatory variables and they can also be use to identify any
unusual values that may require some further investigation. Various plots of the residuals can be used
to assess these properties. Systematic pattern in the residual plots can for example be indicative of an
unsuitable link function, wrong scale of one or more of the predictors, or omission of a quadratic term
in a predictor. Examples of extended definitions of residuals that are widely used in model checking
for GLMs include the Pearson and deviance residuals.
6.2.4 Applications of Generalised Linear Models

Overview
We will focus on some common applications of GLMs for different forms and scales of the response
and explanatory variables. We start by considering methods for analysing binary response data. We
will focus on the most common method to analyse binary response data, namely logistic regression. Lo-
gistic regression is used to model relationships between the response variable and several explanatory
variables, which may be discrete or continuous. We also discuss generalisations of logistic regression
for dichotomous responses to model nominal or ordinal responses with more than two categories.
The nominal logistic regression model will be introduced for situations where there is no natural order
among the response categories and the ordinal logistic regression model will be introduced for situ-
ations where the response categories are ordinal. We also look at the analysis of count data. Here
we are interested in the number of times an event occurs. We will distinguish between two different
situations. In the first situation, the events relate to varying amounts of ‘exposure’, which need to be
taken into account when modelling the rate of events. Poisson regression is used in this case. The other
explanatory variables (in addition to ‘exposure’) may be continuous or categorical. The counts may for
example be the number of traffic accidents, which need to be analysed in relation to some exposure
variable such as the number of registered motor vehicles. In the second situation, ‘exposure’ is con-
stant (and therefore not relevant to the model) and the explanatory variables are usually categorical.
If there are only a few explanatory variables the data are summarised in a cross-classified table. The
response is the frequency or count in each cell of the table. The variables used to define the table are
all treated as explanatory variables. The study design may mean that there are certain constraints on
the cell frequencies (for example, the totals of the frequencies in each row of the table may be equal)
and this need to be taken into account in the modelling. Log-linear models are appropriate for this
situation. We will consider an example relating to each of the above situations. Finally, we will give a
short overview of further applications of generalised linear models.
6.2.5 Binary Data and Logistic Regression

Introduction
Statistical regression models for binary response variables are of great importance in many fields of
application. The categories for binary response variables are typically coded zero or one. For example,
let Y be a response that indicates if subjects developed lungcancer or not. Subjects who developed
lung cancer will be coded with a one and the remainder with a zero. The mean of Y is a probability
in this case that measures the probability of developing lung cancer. Regression models for binary
responses are used to describe probabilities as functions of explanatory variables.
Linear regression models are not suited to binary responses as they allow responses outside the
inteval [0, 1], which is not permitted for probabilities. Logistic regression provides a satisfactory alter-
native for such problems. It describes a function of mean (which is a probability here) as a function
of the explanatory variables. The fuction of mean it uses is the logit function or the logarithm of the
odds.
The estimated coefficients from a logistic regression fit are interpreted in terms of odds and odds
ratios, where these coefficients are estimated using maximum likelihood methods.
A logistic regression model can be extended to problems where we wish to model the posterior
probabilities of K classes via linear functions in the explanatory variables. We will define the model
for both the binary and K class setup. However, we will mainly focus on binary responses here and
only consider the K class setup as a direct extention.
The logistic regression model
The natural link for a binary response variable is the logit (or log-odds) function. In this case it
is conventional to use π to signify the mean mean response, so that it is emphasised that it is a
probability. The logit link is
π
g(π) = logit(π) = log
1−π
so that the logistic regression model is
logit(π) = β0 + β1 X1 + · · · + βp Xp
The inverse of this function is called the logistic function. So if logit(π) = η, then
π = exp(η)/[1 + exp(η)]
Furthermore, note that in ordinary regression the variance, Var{Y |X1 , . . . , Xp } = σ 2 , is constant
and does not depend on the values of the explanatory variables. However, the variance of a population
of binary response variables with mean π is π(1 − π).
Logistic Regression Model
µ = µ{Y |X1 , . . . , Xp } = π; Var{Y |X1 , . . . , Xp } = π(1 − π)

logit(π) = β0 + β1 X1 + · · · + βp Xp
The logistic regression model for K classes
When we have K classes we can specify the logistic regression model in terms of K − 1 logit trans-
formations. Although the model uses the last class as the denominator or baseline in the definition
below, the choice of denominator is arbitrary in that the estimates are equivariant under this choice.
Logistic Regression Model for K classes

G = 1|X1 , . . . , Xp
log = β10 + β11 X1 + · · · + β1p Xp
G = K|X1 , . . . , Xp
G = 2|X1 , . . . , Xp
log = β20 + β21 X1 + · · · + β2p Xp
G = K|X1 , . . . , Xp
..
.
G = K − 1|X1 , . . . , Xp
log = βK−1,0 + βK−1,1 X1 + · · · + βK−1,p Xp
G = K|X1 , . . . , Xp
When K = 2 we have the simple binary logistic regresion model defined before.
Interpretation and estimation of coefficients
In the logistic regression model the odds for a positive response (i.e. Y = 1) given X1 , . . . , Xp are
Interpreting the coefficients
Odds that Y = 1 : ω = exp(β0 + β1 X1 + · · · + βp Xp )
Odds ratio
The ratio of the odds at X1 = A relative to the odds at X1 = B, for fixed values of the other X’s, is
ωA /ωB = exp[β1 (A − B)]
Thus, if X1 increases by 1 unit, the odds that Y = 1 will change by a multiplicative factor of exp(β1 ),
other variables being the same.
In generalised linear models the method of maximum likelihood is used to estimate the parameters.
When the values of the parameters are given, the logistic regression model specifies how to calculate
the probability that a specific outcome will occur (e.g. y1 = 1, y2 = 0, y3 = 1 etc.). Suppose that there
are n independent responses, with πi the mean response for observation i then
n
Y
P (Y1 = y1 , . . . , Yn = yn ) = πiyi (1 − π)1−yi
i=1
for the outcomes y1 , . . . , yn . This depends on the parameters through π, so we can compute the
probability of the observed outcome under various possible choices for the β’s. This is known as the
likelihood function. Below we give a simple description of maximum likelihood estimation without
going into technical details.
Likelihood
The likelihood of any specified values for the parameters is the probability of the actual outcome,
calculated with those parameter values.
Maximum Likelihood Estimates
The estimates of the parameters that assign the highest probability to the observed outcome (which
maximise the likelihood) are the maximum likelihood estimates.
Properties of Maximum Likelihood Estimates
If the model is correct and the sample size is large enough, then:
1. The maximum likelihood estimators are essentially unbiased
2. Standard deviations of the sampling distributions of the estimators can easily be obtained
3. The estimators are about as precise as any nearly unbiased estimators that can be obtained
4. The sampling distributions are nearly normal.
The properties of the estimates imply that each estimated coefficient βj in the logistic regression
has an approximately normal sampling distribution.
Z-ratio = (β̂j − βj )/SE(β̂j ) ∼ Standard Normal (approx)
This can be used to construct confidence intervals or to obtain p-values for tests about the individ-
ual coefficients. A test based on the above approximate normality is referred to as Wald’s test.
The Drop-in-Deviance Test

It is often of interest to judge the adequacy of a reduced model to the full model. The reduced model
is obtained from the full model by excluding some of the explanatory variables from the full set of
explanatory variables. This corresponds to the hypothesis that several of the coefficients in the full
model equal zero. We now need a measure for comparing such a reduced model to the full model
inculding all the explanatory variables. We can use a likelihood ratio test sometimes known as the
drop-in-deviance test for this purpose.
Above we mentioned that the estimated maximum likelihood parameter estimates for a logistic
regression model maximse the corresponding likelihood for the model. If we now represent the max-
imum of the likelihood for the full model by LM AXfull and the maximum of the likelihood of the
reduced model by LM AXreduced, then the general form of the likelihood ratio test statistic is
LRT = 2 log(LM AXfull ) − 2 log(LM AXreduced )
Under the null hypothesis (i.e. if the reduced model is correct), the LRT has approximately a
chi-square distribution with degrees of freedom equal to the difference between the numbers of pa-
rameters in the full and reduced models. Inadequacies of the reduced model tend to inflate the LRT
statistic, so that we will reject the reduced model. Note that the LRT depends on the mathematical
form of the likelihood function, which we will not present here. A quantity called the deviance is often
presented in software packages, rather than the maximum likelihood, where
deviance = constant − 2 log(LM AX)
and the constant is the same for the full and reduced models. It follows that
LRT = deviancereduced − deviancefull
This represent the drop in the deviance due to the additional parameters included in the full model.
For this reason the likelihood ratio test is often referred to as the drop-in-deviance test. This test can
for example be used to test whether all coefficients apart from the constant term are zero taking the
reduced model as the one with no explanatory variables. Similarly, we can test for the significance of
individual coefficients by taking as the reduced model the full model minus the single term that we
are testing for (note that this is not the same as Wald’s test).
Some General Comments
Graphical methods
One can construct preliminary logistic regression models that include polynomial terms and/or in-
teraction terms to judge whether the desired model needs expansion to compensate for very specific
problems.
Model selection
The range of possible models can often be very large. An Information Criterion (AIC) is a model selec-
tion device that emphasises parsimony by penalising models for having large numbers of parameters.
This can be used as a guide for choosing an appropriate model.
Discriminant Analysis and Logistic Regression
Logistic regression might be used to predict future binary responses, by fitting a logistic rgression
model to a training set. This model can then be applied to a set of explanatory varibles for which the
binary response is unknown. The estimated probability π obtained from the logistic regression model
might be used as an estimate of the probability for a “positive” response.
Logistic regression prediction problems are similar in structure to problems in discriminant analy-
sis. In fact, it can be shown that they can be written in exactly the same form (see Hastie et al. (2001),
p.103−104). The difference between the approaches lies in the way the coefficients are estimated. The
logistic regression model is more general in that it makes less asssumptions. In discriminant analysis a
parametric form is assumed for the marginal density of X. Observations in each group are assumed to
follow a gaussian distribution. In the case of logistic regression we can think of this marginal density
as being estimated in a fully nonparametrical and unrestricted fashion, using the empirical distribution
function which places a mass of 1/N at each observation.
What difference does the additional assumptions make between the approaches? The additional
model assumptions for discriminant analysis has the advantage that it gives us more information about
the parameters, which means that we estimate them more efficiently (lower variance). However, dis-
criminant analysis can also be badly affected by outliers and the distributional assumptions may not
hold. Logistic regression prediction problems are perhaps preferable to discriminant analysis when
the explanatory variables have nonnormal distributions (e.g. when they are categorical). There is
no general answer to which method is to be preferred. Logistic regression requires less parameters
(is thus less likely to be unstable) and the coefficients α and β are nicely interpretable (see chapter
6). However, if the assumption that the classwise densities are Gaussians is true, then discriminant
analysis is more efficient, i.e. less observations are necessary to get the same precision as compared
to logistic regression (Efron, 1975). Furthermore discriminant analysis can handle completely sepa-
rated datasets. Logistic regression is nearly as efficient discriminant analysis when the explanatory
variables do follow normal distributions. Therefore, when discrimination between two groups is the
goal, logistic regression should always be considered as an appropriate tool.
Hastie et al. (2001) report that it is generally felt that logistic regression is a safer, more robust
approach than discriminant analysis. In their experience the two models give very similar results
though, even when discriminant analysis is used inappropriately (e.g. with categorical explanatory
variables). For a more detailed discussion on logistic discriminantion refer to Ripley (1996) and
Hastie et al. (2001).
Example: Birdkeeping and lung cancer
Background
Ramsey and Schafer (2002) introduce data based on that of Holst et al. (1988). This data are from a
study that investigate birdkeeping as a risk factor for developing lung cancer. The study is based on a
case-control study of patients in 1985 at four hospitals in The Hague, Netherlands. They identified 49
cases of lung cancer among patients who were registered with a general practice, who were age 65 or
younger, and who had resided in the city since 1965. They also selected 98 controls from a population
of residents having the same general age structure. Data were gathered on the following variables for
each subject:
FM = Sex (1=F,0=M)
AG = Age, in years
SS = Sosioeconomic status (1=High,0=Low), based on wages
of the principal wage earner in the household
YR = Years of smoking prior to diagnosis or examination
CD = Average rate of smoking, in cigarettes per day
BK = Indicator of birdkeeping (caged birds in the home for more than
6 consecutive months from 5 to 14 years before diagnosis (cases) or
examination (controls)
It is known that smoking and age are both associated with lung cancer incidence. The research
question of interest is the following: after age, socioeconomic status and smoking have been controlled
for, is an additional risk associated with birdkeeping?
Data Exploration
In figure 6.3 we present a coded scatterplot of of the number of years that a subject smoked, versus
the individual’s age. From the plot it appears as though for individuals of similar ages, the persons
with lung cancer are more likely to have been relatively long-time smokers. To investigate the effect
of birdkeeping we can compare subjects with similar smoking history and ask whether more of the
birdkeepers amongst them tend to have cancer. Do you observe any trends here?
We construct a further exploratory plot by plotting the sample logits, log[π̂/(1 − π̂))], against the
number of years of smoking. In this plot π̂ is the proportion of lung cancer patients among subjects
grouped into similar years of smoking (0, 1 − 20, 21 − 30, 31 − 40, 41 − 50). The midpoint of the interval
was used to represent the years of smoking for the members of the group. The resulting plot is given
in figure 6.4. Does the logit increase linearly with years of smoking? How can we test this more
formally?
50
birdkeeper,cancer
birdkeeper,no cancer
non−birdkeeper,cancer
non−birdkeeper, no cancer
40
Years of Smoking
30
20
10
0
40 45 50 55 60 65
Age(Years)
Figure 6.3: Coded scatterplot of years of smoking versus age of subject
Finding a suitable model

We approach the problem by first fitting a fairly rich model to the data. We include years of smoking,
birdkeeping, cigarettes per day, and squares and interaction of these in the model. One approach to
obtain the simplest possible model that fits the data well is to perform a backward elimination of terms
based on the AIC. Below we present the output from the full model.
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.2776698 2.3858069 0.116 0.907348
−0.5
Years Smoked (midpoint of interval)
−1.0
−1.5
−2.0
−2.5
0 10 20 30 40
Sample Logit
Figure 6.4: Plot of sample logits versus years smoking for data grouped by intervals (see text) of years
of smoking
SEXMALE 0.6155883 0.5450581 1.129 0.258729

SSLOW 0.1865263 0.4746561 0.393 0.694341
AG 0.0381615 0.0398243 0.958 0.337938
YR -0.0283392 0.0961924 -0.295 0.768292
CD -0.1500281 0.1058075 -1.418 0.156210
BKNOBIRD 1.3795187 0.4123760 3.345 0.000822
I(YR^2) -0.0009586 0.0019860 -0.483 0.629310
I(CD^2) 0.0020067 0.0021790 0.921 0.357094
YR:CD 0.0011864 0.0024376 0.487 0.626455
After a stepwise model selection step based on AIC we obtain a much simplified model. The R
output for this model selection procedure are given below. We see that all terms cans be dropped
except for the YR and BK terms.
Stepwise Model Path

Analysis of Deviance Table
Initial Model:
LC ~ SEX + SS + AG + YR + CD + BK + I(YR^2) + I(CD^2) + YR *
CD
Final Model:
LC ~ YR + BK
Step Df Deviance Resid. Df Resid. Dev AIC

1 137 152.5932 172.5932
2 - SS 1 0.1540821 138 152.7473 170.7473
3 - YR:CD 1 0.2366595 139 152.9839 168.9839
4 - I(YR^2) 1 0.0724717 140 153.0564 167.0564
5 - AG 1 0.7160968 141 153.7725 165.7725
6 - SEX 1 1.0987324 142 154.8712 164.8712
7 - I(CD^2) 1 1.8761359 143 156.7474 164.7474
8 - CD 1 1.3670333 144 158.1144 164.1144
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.70460392 0.56266766 3.029504 0.0024495567
YR -0.05824972 0.01684527 -3.457927 0.0005443496
BKNOBIRD 1.47555151 0.39587626 3.727305 0.0001935383
Interpretation
The estimate for the coefficient of birdkeeping is 1.4756, so the odds of lung cancer for birdkeepers
are estimated to be exp(1.4756) = 4.37 times as large as the corresponding odds for non-birdkeepers.
This is after the other explanatory variables are accounted for. A 95% confidence interval for the odds
of lung cancer for birdkeepers relative to the odds of lung cancer for non-birdkeepers is [2.01, 9.50].
6.2.6 Count Data
Introduction
This section focuses on the analysis of count data. Here we are interested in the number of times
an event occurs. We will distinguish between two different situations, as we have mentioned earlier.
In the first situation, the events relate to varying amounts of ‘exposure’, which need to be taken
into account when modelling the rate of events. Poisson regression is used in this case. The other
explanatory variables (in addition to ‘exposure’) may be continuous or categorical. The counts may
for example be the number of traffic accidents, which need to be analysed in relation to some exposure
variable such as the number of registered motor vehicles. In the second situation, ‘exposure’ is constant
(and therefore not relevant to the model) and the explanatory variables are usually categorical. If there
are only a few explanatory variables the data are summarised in a cross-classified table. The response
is the frequency or count in each cell of the table. The variables used to define the table are all treated
as explanatory variables. The study design may mean that there are certain constraints on the cell
frequencies (for example, the totals of the frequencies in each row of the table may be equal) and this
need to be taken into account in the modelling. Log-linear models are appropriate for this situation.
We will consider an example relating to each of the above situations.
Poisson Regression
Let Y1 , . . . , YN be independent random variables with Yi denoting the number of events observed from
exposure ni for the ith covariate class. The expected value of Yi can be written as
E(Yi ) = µi = ni θi .
For example, suppose Yi is the number of insurance claims for a particular make and model of
car. This will depend on the number of cars of this type that are insured, ni , and other variables that
affect θi , such as the age of the cars and the location where they are used. The subscript i is used to
denote the different combinations of make, model, age, location and so on. The dependence of θi on
the explanatory variables is usually modelled by
θi = eβ0 +β1 x1 +···+βp xp .
This can be modelled within the generalised linear model set-up as
E(Yi ) = µi = ηi eβ0 +β1 x1 +···+βp xp ; Yi ∼ Poisson(µi ).
The natural link function is the logarithmic function
log µi = log ni + β0 + β1 x1 + · · · + βp xp .
The inclusion of the term log ni differs from the usual specification of the linear component. This
term is called the offset and is a known constant that is incorporated into the estimation of the param-
eters β0 , . . . , βp .
The parameter estimates are often interpreted on the exponential scale eβ in terms of ratios of
rates. For a binary explanatory variable denoted by an indicator variable, xj = 0 if the factor is absent
or xj = 1 if it is present, the rate ratio, RR, for presence versus absence is defined as
Yi |present
RR = = eβj .
Yi |absent
Similarly, for a continuous explanatory variable xk , a one-unit increase will result in a multiplica-
tive effect of eβk on the rate µ. Hypotheses about the parameters can be tested using the Wald score
or likelihood ratio statistic. Confidence intervals can also be constructed similarly. Alternatively, hy-
pothesis testing can be performed by comparing the goodness of fit of appropriately defined nested
models.
Goodness of fit
Pearson and deviance residuals can be calculated for Poisson regression models to evaluate the fit.
The deviance and X 2 statistics can be used directly as measures of goodness of fit, as they can be
calculated from the data and the fitted model (there are no nuisance parameters such as σ 2 for the
Normal distribution). Two other summary statistics that are often provided in software packages
for Poisson regression models are the likelihood ratio chi-square statistic and the pseudo-R2 . These
statistics are based on comparisons between the maximum value of the log-likelihood function for a
minimal model with no covariates and the maximum value of the log-likelihood function for a model
with p parameters.
Table 6.1: Smoke Data

Age Group Smokers Non-Smokers
Deaths Person-years Deaths Person-years
25–44 32 52407 2 18790
45–54 104 43248 12 10673
55–64 206 28612 28 5710
65–74 186 12663 28 2585
75–84 102 5317 31 1462
Example
The data in Table 6.1 are from a study conducted in 1951, in which all British doctors were sent a brief
questionnaire about their smoking status . Since then information of their deaths has been collected.
Table 6.1 shows the numbers of deaths from coronary heart disease among male doctors 10 years after
the survey. It also shows the total number of person-years of observation at the time of the analysis.
This is the ‘exposure’ that was mentioned above.
The questions of interest were:
1. Is the death rate higher for smokers than for non-smokers?
2. If so, by how much?
3. Is the differential effect related to age?
As illustration we will consider one Poisson regression model that describes the data well (there
are more than one). The model we will consider is
log(deathsi ) = log(populationi ) + β1 + β2 smokei + β3 agecati + β4 agesqi + β5 smkagei .
The subscript i denotes the ith subgroup defined by age group and smoking status (i = 1, . . . , 5
for ages 25 − 44, . . . , 75 − 84 for smokers and i = 6, . . . , 10 for the corresponding age groups for
non-smokers). The term deathsi denotes the expected number of deaths and populationi denotes
the number of doctors at risk in group i. Smokei is an indicator variable for smoking status and
that takes a value of one for smokers and zero for non-smokers; agecati takes values 1, . . . , 5 for age
groups 35 − 44, . . . , 75 − 84; agesqi is the square root of agecati which is included to take account of
a non-linear death rate of increase with age; smkagei is an extra age term that is equal to agecati for
smokers and zero for non-smokers, thus describing different rates of increase in the number of deaths
for smokers and non-smokers. The model output is given below.
The Wald statistics to test βj = 0 all have very small p-values and the confidence intervals for eβj
do not contain unity showing that all the terms are needed in the model. The estimates show that
the risk of coronary deaths was, on average, about four times higher for smokers than non-smokers
(based on the rate ratio for smoke), after the effect of age is taken into account. The goodness of fit
measures X 2 = 1.550 and D = 1.635 show that the fit for the model is good, when compared to the
appropriate χ2 distribution.
Log-linear Models
In the case of log-linear models there is no single variable that clearly stands out as a response, thus
all the variables are treated alike. Typical examples involve counts in a Poisson or Poisson-like process.
These counts are treated as independent Poisson variables. If there are only a few variables the data
are summarised in a cross-classified table, called a contingency table. We define a contingency table as
follows: divide each variable into categories (if necessary) and count the number of cases falling into
each of the possible combinations of categories. It is these counts that we wish to model. So if Yi is the
random variable representing the count in the ith cell of the contingency table, then Yi ∼ Poisson(µi ),
where E(Yi ) = µi and var(Yi ) = µi .
It follows that the log-link is the natural choice for the link function in a GLM set-up. This follows
because log of the mean of each cell is equal to the linear predictor, i.e.
log(µi ) = ηi = β0 + β1 x1 + . . . + βp xp .
The term log-linear model is used to describe this relationship between the means of the cell counts
and the linear predictor through the log-link. For contingency tables the main questions almost always
relate to associations between variables. Therefore, in log-linear models, the terms of primary interest
are the interactions involving two or more variables. We see that there is no offset term as was the
case for Poisson regression. In this case ‘exposure’ is constant and therefore not relevant to the model.
This follows from certain constraints on the cell frequencies (for example, the totals of the frequencies
in each row of the table may be equal).
Goodness of fit
The adequacy of fit of a log-linear model can be assessed through the D and X 2 goodness of fit
statistics as before. More insight into model adequacy can often be obtained by examining the Pearson
or deviance residuals. Hypothesis tests can be conducted by comparing the difference in goodness of
fit statistics between a general model corresponding to an alternative hypothesis and a nested, simpler
model corresponding to a null hypothesis. We will illustrate some of these methods in the example
below.
Example
We consider an example involving a 2 × 2 × 2 contingency table that contains data from a study of the
effects of racial characteristics on whether individuals convicted of homicide receive the death penalty
(Agresti, 1990).
If we fit all models from the simplest model, containing just the main effects, to the most complex
model containing all interactions between the three variables, it follows from goodness of fit test (in
this case the likelihood ratio and Pearson chi-square test) that only two models fit adequately at a
0.05 level of significance. The model containing interaction terms between the victim’s race and the
penalty and between the victim’s race and the defendant’s race is the simplest model that provides a
good fit. According to this model, the death penalty verdict is independent of the defendant’s race, if
we are given the victim’s race. This model can be specified as
log(µijk ) = α + λD V P DV VP
i + λj + λk + λij + λjk , i, j, k = 1, 2;
The results for the SPSS fit of the above model are given below.
Goodness-of-fit Statistics
Chi-Square DF Sig.
Likelihood Ratio 1.8819 2 .3903

Pearson 1.4313 2 .4889
_
- - -- - - - - - - - - - - - - - - - - - - - - - - - - - - - -
GENERAL LOGLINEAR ANALYSIS
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Parameter Estimates
Asymptotic 95% CI
Parameter Estimate SE Z-value Lower Upper
1 4.5797 .1011 45.31 4.38 4.78

2 -2.8717 .4196 -6.84 -3.69 -2.05
3 .0000 . . . .
4 -2.4375 .3476 -7.01 -3.12 -1.76
5 .0000 . . . .
6 -.5876 .1639 -3.59 -.91 -.27
7 .0000 . . . .
8 1.0579 .4635 2.28 .15 1.97
9 .0000 . . . .
10 .0000 . . . .
11 .0000 . . . .
12 3.3116 .3786 8.75 2.57 4.05
13 .0000 . . . .
14 .0000 . . . .
15 .0000 . . . .
The Pearson chi-square and Likelihood ratio goodness of fit statistics both show that the model
provides a good fit to the data. The parameter estimates for our model follow from the above as:
α = 4.5797, λD V P VP
1 = −2.8717, λ1 = −2.4375, λ1 = −0.5876, λ11 = 3.3116.
The parameters are computed under the constraints that λD V DV

2 = 0, λ2 = 0 and λ12 = λ21 =
DV
λV12P = λV21P = 0. Now the model can be interpreted by computing odds ratios. We will not go into this
analysis, but interested readers are referred to Agresti (1990).
6.2.7 Survival Analysis and Repeated Measures

There are two important types of data that we have not discussed, which can also be modelled with
GLMs (and other methods). We will not discuss these types of data in detail here. We will merely
mention these as types of data that require their own approaches to modelling. The first of these is
survival data. This data is the time from some well-defined starting point until some event, called
‘failure’ occurs. These times are non-negative and typically have skewed distributions with long tails.
The analysis of survival data is the topic of numerous books and most statistical software packages have
procedures to implement appropriate statistical methods for this data. The exponential distribution
plays a central role in the analysis of survival data.
In all the models that we have considered so far we have assumed that the outcomes Yi , i = 1, . . . , n
are independent. There are two common situations where this assumption is implausible. In one
situation the outcomes are repeated measures over time of the same subjects. This is an example of
longitudinal data. Longitudinal data for a group of subjects are likely to exhibit correlation between
successive measurements. The other situation in which data are likely to be correlated is where
they are measurements on related subjects; for example, the weights of samples of women aged 40
years selected from specific locations in different countries. This kind of data is sometimes referred
to as clustered data. The term repeated measures is used to describe both longitudinal and clustered
data. Methods such as repeated measures ANOVA and generalised estimating equations are examples of
methods to deal with these kinds of data.
Chapter 7
Non-parametric classification tools
7.1 k-nearest neighbours

A very simple, but rather powerful approach to classification is the nearest-neighbour rule. It classifies
a new observation xnew into the class of its nearest neighbour. This rule is however very sensitive
to outliers. However it can be easily generalised to the k-nearest neighbour (k-nn) rule. The k-nn
rule determines the k nearest neighbours1 2 and performs a majority vote amongst the k nearest
neighbours. Sometimes it is useful to include a “doubt” option, if the proportion of the winning class
is less then a specified threshold; in this case no prediction is possible at that location.
There is another way of seeing the k-nn algorithm. For every point we estimate the posterior
probability p(l|x) that xi is in class l by the proportion of the observations of class l amongst the
neighbours of xi . We then classify xi into the class with the highest posterior probability.
The idea behind this is that the neighbours are close to xi and thus have similar posterior probabil-
ities. Increasing the value of k thus reduces the variance of the classifier, as we use more observations
to calculate the posterior probability (law of large numbers!). Increasing the value of k however in-
creases the bias as we use examples that are more and more distant from xi and thus might have
different posterior probabilities. We have to make a bias-variance trade-off. Figure 7.1 shows the deci-
sion boundaries obtained by applying the k-nn algorithm with different values of k to the quarter-circle
dataset.
Normally, the Euclidean distance is used to determine the nearest neighbours. Note that this means
that the variables have to be rescaled appropriately before the k-nn algorithm is applied. However,
k-nn works well with other metrics as well. For digit recognition invariant metrics are used. Rota-
tion invariant metrics remove the effect of rotation (handwriting can be more or less slanted) before
distances are calculated. (see e.g. Simard et al. 1993).
Note that the k-nn rule is susceptible to the curse of dimensionality : for high-dimensional data, the
“nearest” neighbours can be rather far away. This causes the bias to increase, as we average over wide
areas. One way to deal with this problem is to use adaptive metrics like the discriminant adaptive
nearest-neighbour (DANN) (Hastie & Tibshirani, 1996) rule. The basic idea behind this approach is
to stretch out the neighbourhood into directions for which the class probabilities don’t change much.
Note that k-nn requires the whole training dataset to be stored in order to be able to classify new
1
The k nearest neighbours are the k training points xij with the lowest distance kxij − xnew k
2
In case of a tie in the distance, one can either pick k points at random or use all points not further away than the k-th
nearest neighbour.
121
122 CHAPTER 7. NON-PARAMETRIC CLASSIFICATION TOOLS
k=1 neighbour k=5 neighbours

1.0
1.0
0.8
0.8
0.6
0.6
x2
x2
0.4
0.4
0.2
0.2
0.0
0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
x1 x1
Figure 7.1: Decision boundaries obtained using the k-nn rule once with k = 1 neighbour and once
with k = 5 neighbours
observations. This can pose quite a serious problem, as training datasets can exceed the size of several
giga byte. Various data editing techniques have been proposed to deal with this matter. Their aim is
to find a subset of training examples that suffices for prediction and to discard the remaining part of
the training set. (see e.g. Devijver and Kittler, 1982)
Table 7.1 gives the error rates of the k-nn algorihtm applied to the breast tumour dataset, once
using just the “best” 30 genes, once using the “best” 4, 000 genes.
7.2 Learning vector quantisation

For the k-nn classifier we need to store the whole training dataset in order to be able to predict (unless
we use data editing techniques). The idea of learning vector quantisation (LVQ) (Kohonen, 1988)
(a) 30 genes (b) 4, 000 genes

Test set error rate Test set error rate
k = 1 neighbour 0.33 k = 1 neighbour 0.33
k = 10 neighbours 0.22 k = 20 neighbours 0.17
k = 20 neighbours 0.25 k = 30 neighbours 0.44
Table 7.1: Test error rate for the k-nn algorithm with different numbers of neighbours lengths applied
to the breast tumour dataset
7.2. LEARNING VECTOR QUANTISATION 123
(a) 30 genes (b) 4, 000 genes

Error rate Error rate
Training set Test set Training set Test set
5 prototypes per class 0.075 0.25 5 prototypes per class 0.35 0.42
Table 7.2: Training and test error rate for the LVQ algorithm with different codebook lengths applied
to the breast tumour dataset
is to store only “typical examples” of each class. Like in vector quantisation we look for an optimal
codebook, but now we look for optimal prototypes for each class.
Consider once again a problem with L classes. We want to find R prototypes per class. Let ξ lr
denote the r-th prototype of the l-th class.
The LVQ algorithm is an online algorithm, i.e. the training data is presented repeatedly on a one
by one basis.3 Every time a training point xi is presented, one determines the closest prototype ξ lr . If
this prototype is of the same class as xi then it is moved closer to xi by updating
ξ lr → ξ lr − α · (ξ lr − xi ).
If the prototype ξ lr is not of the same class as xi , then it is moved away from xi by updating
ξ lr → ξ lr + α · (ξ lr − xi ).
The idea of the LVQ algorithm is that prototypes of the correct class are “attracted”, whereas the
wrong prototypes are “repelled”. α is hereby the learning rate and governs how much the prototypes
are moved. During the iteration process the learning rate α is decreased to 0. Figure 7.2 illustrates the
run of the algorithm.
In practice it is useful to use different and adaptive learning rates for each prototype and compute
(h)
the current learning rate αlr for the r-th prototype of the l-th class from its last value αlr (h − 1) by
setting4
 (h−1)
αlr

 (h−1) if the current training point is correctly classified
(h) 1+αlr
αlr = (h−1) (7.1)
 αlr(h−1) if the current training point is incorrectly classified.

1−αlr
The idea behind this approach is that the learning rate should be increased if there are still data points
of the wrong class close to the prototype. This causes the algorithm to converge faster away from
these “wrong” points. On the other hand, if the example is correctly classified one can assume that the
prototype is in a “good” neighbourhood, so the learning rate can be decreased.
Table 7.2 gives the results of applying the learning vector quantisation algorithm to the breast
tumour dataset. Adaptive learnings rates as in (7.1) were used.
3
The LVQ algorithm described here is the basic LVQ1 algorithm from Kohonen. For more sophisticated versions like
LVQ2.1 and LVQ3 see Ripley (1996).
4
This variant of the LVQ1 algorithm is often called OLVQ1.
Iteration 1 Iteration 2 Iteration 5

1.0
1.0
1.0
0.8
0.8
0.8
0.6
0.6
0.6
x2
x2
x2
0.4
0.4
0.4
0.2
0.2
0.2
0.0
0.0
0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
x1 x1 x1
Iteration 10 Iteration 100 Iteration 500

1.0
1.0
1.0
0.8
0.8
0.8
0.6
0.6
0.6
x2
x2
x2
0.4
0.4
0.4
0.2
0.2
0.2
0.0
0.0
0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
x1 x1 x1
Figure 7.2: Iterations of the LVQ algorithm for the quarter-circle dataset. The empty points represent
the data, the big empty point the data point currently presented. The filled dots correspond to the
codebook. The closest code vector is linked to the data point presented by a grey line and an arrow
shows how this prototype is changed. The dotted line is the decision boundary
7.3. NEURAL NETWORKS 125
Figure 7.3: Network diagram of a neural network (right) and corresponding diagram for logistic
regression (left)
7.3 Neural networks

Neural networks were initially proposed in the 1960s (Widrow and Hoff, 1960; Rosenblatt, 1962) and
received a lot of attention in the mid-1980s (see e.g. Rumelhart and McClelland, 1986). Statistically
speaking, neural networks are two-level nonlinear models. In this section we will only cover feed-
forward neural networks with one hidden layer, the most popular “vanilla” neural network.
Recall the logistic regression model from chapter 6: a new observation was classified to one group
if
1
φ(β0 + β1 xi1 + . . . + βp xip ) > ,
2
where φ was the logistic function. The left hand side of figure 7.3 gives a visualisation of this model.
Neural networks introduce an additional layer (see the right hand side of that figure) that allows to
model more complex decision functions. Logistic regression leads to a decision boundary that is a
hyperplane: neural networks can lead to more sophisticated decision boundaries (see figure 7.4).
Neural networks are typically represented by a network diagram like in in figure 7.3. In feed-
forward neural networks “signals” are only passed towards the output layer and not backwards. The
input layer corresponds to the covariates in the regression context. These input signals are then
“transmitted” to the units in the hidden layer. At each of these units all incoming signals are summed
up using weights wjk and after adding a constant αk they are transformed by a function φk (“activation
function”). The output of the hidden layer units is then used as input for the output layer, leading to
the transformation   
X X
ŷil = φl αl + wkl φk αk + wjk xij  (7.2)
k→l j→k
et
The most common activation function for the hidden layer is the logistic function . For the out-
1 + et
put units mostly either no transformation (“linear unit”), the logistic function, the threshold function5
or the “softmax” function (see (7.5) below) is used.
Note that neural networks can have more than one output node. For two-class classification prob-
lems however one output node is sufficient. In all other cases an output unit per class is used. A formal
definition of a neural network can be found in the glossary of Ripley (1996).
5
The threshold function is 1 if the argument is positive, otherwise it is 0.
(a) Fitted values and decision boundary for

a logistic regression fit to the quarter-circle
data.
(b) Output of the hidden layer, (transformed) output of the output layer(fitted values)
and decision boundary for a neural network fit to the quarter-circle data
Figure 7.4: Logistic regression (above) and neural network (below) compared
7.3. NEURAL NETWORKS 127
The weights wjk (and wkl )6 are chosen such that the output ŷil from (7.2) is “closest” to the
observed value yil . For a two-class classification problem yi = 1 if the i-th observation is in one of the
classes, and yi = 0 otherwise.7 If there are more than two classes one has to use one output unit per
class, i.e. yi = (yi1 , . . . , yiL ), where yil = 1 if the i-th observation is in class l and 0 otherwise.
The weights wjk are thus chosen to minimise a fitting criterion R like
n X
X L
R(w) = (yil − ŷil )2 , (7.3)
i=1 l=1
which corresponds to fitting the weights by least squares, a fitting technique mainly used for regression
settings.8
For a two-class problem one can use the (conditional) log-likelihood
n
X yi 1 − yi
R(w) = yi log + (1 − yi ) log ; (7.4)
ŷi 1 − ŷi
i=1
in this case the model is a logistic regression model in the hidden units and the weights are estimated
by maximum-likelihood. For more than two classes one mostly uses the “softmax” activation function,
which is given by
etl
PL , (7.5)
tλ
λ=1 e
where tl is the input to the current output unit, and tλ (λ = 1, . . . , L) is the input to the other output
units. As a fitting criterion one then uses the cross-entropy
n X
X L
R(w) = −yil log(ŷil ). (7.6)
i=1 l=1
Once again this leads to a model which is a multiple logistic regression model in the hidden units.9
When using the fitting criteria (7.4) or (7.6) one can interpret the output ŷil as the predicted proba-
bility that the i-th observation is from class l.
The fitting criterion is usually minimised by gradient descent, i.e.
∂R
wjk → wjk − η (7.7)
∂wjk
These derivatives are calculated from the output nodes to the input nodes across the network (chain
rule!). This gave the algorithm the name of back-propagation (details can be found in Ripley, 1996).
Contrary to logistic regression, we are not guaranteed that the algorithm converges to a global mini-
mum. Depending on the initial choice of the weights the algorithm can get “stuck” in a local minimum.
Figure 7.5 shows the results of fitting a neural network to the artificial quarter-circle dataset with dif-
ferent starting values. One can see that neural networks are very flexible tools10 , in fact they are too
flexible in many cases. The “optimal” solution in our example corresponds to an overfit to the data;
compared to the other solutions it leads to a higher error rate in the test set (well despite leading to
the lowest error rate in the training set).
6
In the following “wjk ” refers to all weights.
7
The second index l of yil can be dropped as there is only one output unit (always l = 1).
8
Note that this is a sum both over observations and over output unit.
9
For a two-class problem this approach (using two output nodes) is equivalent to the criterion (7.4).
10
In fact, any continuous decision boundary can be modelled by a neural network, if enough hidden units are used.
Global solution and local minima

1.0
0.8
0.6
x2
0.4
0.2
Solution (global minimum)

Local minimum 1
Local minimum 2
Local minimum 3
0.0
0.0 0.2 0.4 0.6 0.8 1.0
x1
Training data Test data

Fitting criterion Error rate Error rate
Global solution 16.3520 0.05 0.14
Local minimum 1 18.2518 0.08 0.13
Local minimum 2 23.1583 0.10 0.10
Local minimum 3 27.2862 0.12 0.12
Figure 7.5: Optimal solution and local minima (decision boundaries) for the quarter-circle data (3
hidden units, no weight decay)
7.4. SUPPORT VECTOR MACHINES (SVM) 129
Neural network fit with a weight decay of 0.01
1.0
0.8
0.6
x2
0.4
0.2
0.0
0.0 0.2 0.4 0.6 0.8 1.0
x1
Figure 7.6: Solution (decision boundary) for the quarter-circle data using a weight decay of 0.01
The solution which is optimal with respect to the fitting criteria (7.3), (7.4) or (7.6) is often not
what we are looking for, as it can lead to a tremendous overfit to the data. This can be solved by
minimising a regularised version of the fitting criteria by adding a penalisation term like in ridge
regression: X
Rreg (w) = R(w) + λ 2
wjk
j,k
This changes the update rule (7.7) to
∂R
wjk → wjk − η − 2ηλwjk
∂wjk
Figure 7.6 shows the result when using a weight decay of 0.01 for the quarter-circle data. Apart from
avoiding an overfit, using the weight decay leads to several other benefits: The regularised fitting
criteria has generally less local minima, making it easier to find the global solution. Furthermore it
generally speeds up the convergence of the fitting algorithm (Ripley, 1996).
Table 7.3 gives the results of fitting different neural networks to the breast tumour data.
7.4 Support vector machines (SVM)

The history of support vector machines reaches back to the 1960s. The “Generalised Portrait” algo-
rithm, which constructs a separating hyperplane with maximal margin, was originally proposed by
(a) 30 genes
Error rate
Training set Test set
1 hidden unit, no weight decay 0.025 0.22
2 hidden units, no weight decay 0.025 0.33
2 hidden units, weight decay λ = 0.001 0.025 0.17
(b) 4, 000 genes

Error rate
1 hidden unit, no weight decay 0.025 0.28
1 hidden unit, weight decay λ = 0.006 0.025 0.14
2 hidden units, weight decay λ = 0.006 0.025 0.25
Table 7.3: Training and test error rate for neural networks fits to the breast tumour data
Vapnik (Vapnik and Lerner, 1963; Vapnik and Chervonenkis, 1964). In the last few years, support
vector machines have become an increasingly popular learning algorithm.
There are support vector machines for classification and for regression (Vapnik, 1995). We will
only cover support vector classification machines. The vanilla type of support vector machine can only
deal with two-class classification problems.11 .
The basic idea of support vector classification is to maximise the margin between the two classes.
Figure 7.7 compares this to the idea of linear discriminant analysis (LDA). As one can see from the
figure, the solution depends only on the observations that lie on the margin. This makes the solution
vector of the support vector machine extremely sparse.
Thus we look for separating hyperplane {x : hx, wi = −b} maximising the margin.12 . For the
sake of uniqueness we have to restrict the norm of w to 1. Mathematically it is however easier to fix
the margin to 1 and look for the hyperplane with minimal kwk2 /2. Graphically this corresponds to
zooming the data by a factor of 1/ρ. We thus want to minimise
1
kwk2
2
Obviously maximising the margin in the above sense is only possible if the dataset is separable. In
practice however, most datasets are not separable.13 In this case, slack variables ξi must be introduced
so that the constraints can be met. The value of these slack variables indicates how far the data point
lies outside the margin (“error”). Figure 7.8 visualises this idea. These “errors” are considered with a
cost factor C > 0 in the objective function, which then becomes
n
1 X
kwk2 + C ξi .
2
i=1
11
Most implementations deal with L > 2 classes by fitting L support vector machines: The l-th machine is trained on the
dataset “class l vs. the rest”.
12
w is its normal vector and governs its orientation, whereas b controls the offset of the hyperplane
13
Another way of dealing with non-separable data is mapping it into a high-dimensional feature space (see below).
LDA solution SVC solution
Figure 7.7: Decision boundary for linear discriminant analysis (left) and support vector classification
(right) for a two-dimensional toy example: LDA maximises (after rescaling the data) the distance
between the projected class means (filled symbols), whereas SVC maximises the margin between the
two classes.
The solution is thus a trade off between maximising the margin and not having too many points
dropping outside the “margin”. This optimisation problem can be solved using the standard tools of
semi-definite quadratic programming (e.g. interior point algorithms).
Mathematical background of support vector machines
We seek the hyperplane {x : hx, w◦ i = −b◦ } (kw◦ k = 1) which maximises the margin ρ. The margin
is the smallest number satisfying
hxi , w◦ i + b◦ ≥ ρ ror yi = 1, hxi , w◦ i + b◦ ≤ −ρ for yi = −1
Equivalently (divide w◦ and b◦ by ρ), we can fix the margin to 1 and minimise the norm of w with
respect to
hxi , wi + b ≥ 1 for yi = 1, hxi , wi + b ≤ −1 for yi = −1 (7.8)
The corresponding Lagrangian is
n
1 X
L(w, b) = kwk2 − αi yi (hxi , wi + b − 1)
2
i=1
x4
ξ5
x5
ξ3
x2
x8
x1 ξ8 x3 x9
1 x6
1
x7
Figure 7.8: Slack variables ξi used in support vector classification for non-separable data
with Lagrangian multipliers αi ≥ 0. The dual problem is to maximise

n
X 1X
D(φ) = αi − αi αj yi yj hxi , xj i (7.9)
2
i=1 i,j
P P
over all αi ≥ 0 with ni=1 αi yi = 0. w can be obtained from the αi via w = ni=1 αi yi xi . b can be
computed using any observation xisv with αisv > 0 via yisv (hxisv , wi + b) = 1.
For “soft margin” support vector machines (not separable datasets) we must change (7.8) to
hxi , wi + b ≥ 1 − ξi for yi = 1, hxi , wi + b ≤ 1 + ξi for yi = −1 (7.10)
with ξi ≥ 0. The objective function is now

n
1 X
kwk2 + C ξi .
2
i=1
Pn corresponding dual problem is once again (7.9), now to be maximised over all 0 ≤ αi ≤ C with
The
i=1 αi yi = 0. P
We can compute w and b from φ: w = ni=1 αi yi xi and b can be computed using any observation
xisv with 0 < αisv < C using equations (7.10) with a “=” sign instead of the ≤ and ≥. Note that
ξi > 0, iff αi = C.
Source data xi Mapped data (xi,x2i )
8
6
x2
4
2
0
−2 −1 0 1 2 3 −2 −1 0 1 2 3
x x
Figure 7.9: One-dimensional data xi and mapped data (xi , x2i ). Contrary to the (unmapped) source
data, the mapped data is separable by a hyperplane.
The prediction formula for a new observation with covariate xnew is

r
!
X
ŷnew = sign (hxnew , wi + b) = sign αi yi hxi , xnew i + b
i=1
So far, we can only generate separating hyperplanes. Many problems in real life are however not
linearly separable. Consider the situation at the left hand side of figure 7.9: Observations from one
class (green circles) are on both sides whereas observations from the other class (red boxes) are in
between. We cannot separate the data with one hyperplane. But instead of looking only at the xi , we
can consider the pair (xi , x2i ). In the (x, x2 )-space, we are now able to separate the two classes. Not
only in this toy example it is often necessary to map the data into a higher-dimensional feature space
before trying to separate the data. Thus we transform the data using a feature map
φ : Rp 7→ H, x 7→ φ(x)
that transforms our data into a high-dimensional space H, that can even be of infinite dimension.
Support vector machines have a very important mathematical property: They depend on the data
only through dot products hxi , xj i or in the case of a feature map hφ(xi ), φ(xj )i. This allows the
so called “kernel trick”: Instead of specifying the feature map, we just specify the kernel k(xi , xj ) =
hφ(xi ), φ(xj )i (i.e. the dot product in the induced feature space). Examples of kernel functions are
the usual dot product in the Rp
k : (x, x′ ) 7→ k1 (x, x′ ) := xT x′ ,
the homogeneous polynomial kernel (q ∈ N)
k: (x, x′ ) 7→ k1 (x, x′ ) := (xT x′ )q

as well as the Gaussian kernel (γ > 0)

k : (x, x′ ) 7→ k2 (x, x′ ) := exp −γkx − x′ k2 .
Note that the polynomial kernel and the Gaussian kernel lead to non-linear decision boundaries in the
(unmapped) data space. The decision boundary in the feature space is still linear. The feature space
however is a black box14 which we can airily ignore.
The performance of a support vector machines depends crucially on the hyperparameters chosen:
Big values of C generally yield an overfit to the data. The bigger the degree q of the polynomial or
the smaller the “width” γ of the Gaussian kernel is selected, the more wigglier the decision boundary
will be and the more likely the fit will result in an overfit. Figure 7.10 visualises the effect of the hy-
perparameters. Historically the Vapnik-Chervonenkis-(VC)-bound has been used to determine optimal
values for the hyperparameters (Vapnik and Chervonenkis, 1968, 1971). The VC bound is a (loose)
upper bound for the actual risk. One just picks the value of the hyperparameter that leads to the lowest
VC bound. The idea of minimising the VC bound is the central idea of the structural risk minimisation
(SRM) principle. However, there seems to be some empirical evidence that the VC bound is not suit-
able for choosing optimal hyperparameters (see e.g. Duan et al., 2003). Simulation studies suggest
that it is best to determine the hyperparameters using either a validation set or cross-validation (see
section 8.2.1 below).
Support vector machines are not scale-invariant, so it is necessary to scale the data beforehand.
However most implementations of SVM (like the one used in R) perform the standardisation automat-
ically.
Support vector machines have been successfully trained classifiers with huge amounts of data (like
speech or image (digit) recognition). This applies to the number of observations as well as to the
number of covariates.15 . Support vector machines are one of the most competitive methods is the area
of two-class classification (see e.g. Schölkopf and Smola, 2002). The result obtained from fitting a
support vector machine to the breast tumour data is given in table 7.4.
14
For the Gaussian kernel it is even completely unknown how the feature space looks like.
15
One can even fit support vector machines to datasets with more covariates than observations — conditions under which
other algorithms (e.g. logistic regression) fail.
Cost C=0.1 / Kernel parameter γ=0.01 Cost C=0.1 / Kernel parameter γ=0.1
1.0
1.0
0.8
0.8
0.6
0.6
x2
x2
0.4
0.4
0.2
0.2
0.0
0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
x1 x1
Cost C=10 / Kernel parameter γ=0.1 Cost C=10 / Kernel parameter γ=25
1.0
1.0
0.8
0.8
0.6
0.6
x2
x2
0.4
0.4
0.2
0.2
0.0
0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
x1 x1
Error rate
Cost C = 0.1, kernel parameter γ = 0.01 0.15 0.15
Cost C = 0.1, kernel parameter γ = 0.1 0.09 0.12
Cost C = 10, kernel parameter γ = 0.1 0.11 0.19
Cost C = 10, kernel parameter γ = 25 0.01 0.25
Figure 7.10: SVM fits (decision boundary) and error rates for different values of the cost C and the
parameter γ of the Gaussian kernel
(a) 30 genes
Error rate
Linear kernel Cost C = 0.001 0.40 0.54
Cost C = 0.01 0 0.17
Cost C = 0.1 0 0.17
Gaussian kernel Cost C = 0.35, kernel parameter γ = 0.04 0.05 0.17
Cost C = 1, kernel parameter γ = 0.01 0.025 0.27
(b) 4, 000 genes

Error rate
Linear kernel Cost C = 0.0001 0 0.20
Cost C = 0.0009 0 0.17
Cost C = 0.005 0 0.20
Table 7.4: Training and test error rate for SVM fits to the breast tumour data
Chapter 8
Classifier assessment, bagging, and

boosting
8.1 Introduction
Different classifiers can have very different accuracies (i.e. different misclassification rates). Therefore,
it is important to assess the performance of different learning methods. The perfomance assessment
can be split into determining if we have a good estimate of the posterior probabilities of class mem-
bership (p(c|x)), and direct assessment of performance.
Reliable estimates of the classification error or other measures of performance are important for
(i) training the classifier (e.g. selecting the appropriate predictor variables and parameters), and
(ii) estimating the generalisation error of the classifier. The generalisation performance of a learning
method relates to its prediction capability on independent test data. Hastie et al. (2001) refer to the
above two goals as model selection and model assessment.
8.2 Bias, Variance and Error Rate Estimation

In Figure 8.1 we illustrate the ability of a learning method to generalise graphically. We can explain
before-mentioned figure by distinguishing between test error or generalisation error and training error
as in Hastie et al. (2001). Test error is the expected prediction error over an independent test sample,
while training error is the average loss over the training sample.
First we consider the test error for our estimated model. As the model becomes more and more
complex, it is able to adapt to more complicated structures in the data (decrease in bias), but the
estimation error increases as we have to estimate more parameters (increase in variance). There is
typically a trade-off between bias and variance that gives minimum test error.
We see that the training error shows very different behaviour to the test error, and is therefore
not a good estimate of the test error. The training error consistently decreases with model complexity.
We can make the training error arbitrarily small by just increasing the model complexity enough.
However, such a model will overfit the training data and will generalise poorly.
We will consider a few methods for estimating the test error for a specific model. If large amounts
of data are available a popular approach to model selection and model assessment is to divide the
dataset into three parts: a training set, a validation set and a test set. The training set is used to fit the
137
138 CHAPTER 8. CLASSIFIER ASSESSMENT, BAGGING, AND BOOSTING
Figure 8.1: Changes in test sample and training sample error as the model comlexity is varied
8.2. BIAS, VARIANCE AND ERROR RATE ESTIMATION 139
models; the validation set is used to estimate prediction error for model selection; the test set is used
for assessment of the generalisation error of the final chosen model. Note that the test set should not
be available until we assess the final model. If we use the test set repeatedly to find the model with
the lowest test set error, we risk underestimating the true test error.
There is no general rule on how to choose the number of observations in each of the three parts.
This will typically depend on the signal-to-noise ratio and the size of the training sample. Hastie et al.
(2001) suggest that a typical split might be 50% for training, and 25% each for validation and testing.
A limitation of this approach is that it reduces the effective size of the training set. This is a problem
for microarray datasets that have a very limited number of observations. Below we discuss methods
that are designed for situations where there is insufficient data to split it into three parts. The reader
is referred to Section 2.7 in Ripley (1996) and Chapter 7 in Hastie et al. (2001) for more detailed
discussions.
Ripley (1996) makes the following remarks on page 76 in the context of studies for comparing error
rates:
• It is an experiment and should be designed and analysed as such.
• Comparisons between classifiers will normally be more precise than assessment of the true risk.
• We often need a very large test set to measure subtle differences in performance, as standard
errors of estimates decrease slowly with an increase in sample size
8.2.1 Cross-validation
Cross-validation is one approach to get around the wasteful approach based on separate training,
validation and test sets. In V -fold cross-validation (CV) (Breiman et al., 1984), the data set L is
randomly divided into V sets Lν , ν = 1, . . . , V , of as nearly equal size as possible. Classifiers are built
on the training sets L − Lν , error rates are computed from the validation sets Lν , and averaged over
ν.
A question that arises is how large V should be. A common form of CV is leave-one-out cross-
validation (LOOCV)(Stone, 1974), where V = n. LOOCV can result in smaller bias but high variance
estimators of classification. Therefore, it is clear that there is a bias-variance trade-off in the selection
of V . LOOCV can also be expensive in terms of computational time. Breiman (1996b) shows that
LOOCV provides good estimators of the generalisation error for stable (low variance) classifiers such
as k-NN. Choices of V = 5 or V = 10 are often reasonable.
Cross-validation is commonly used to choose a suitable model by estimating a measure of per-
formance (e.g. the error rate) or a measure of model fit (e.g. deviance). However, we can also
use cross-validation for parameter estimation, by choosing the parameter value which minimises the
cross-validated measure of performance. If we then want to assess the performance of the resulting
classifier, we have to do so by a double layer of cross-validation.
8.2.2 The Bootstrap
Cross-validation is used to estimate error rates on a test set by constructing a pseudo test set from the
training set. A different approach is to accept that the training set error is biased, and then to try and
estimate that bias, and correct for it by using this estimate.
Suppose we take a sample of size n by sampling with replacement from our original sample (known
as a bootstrap sample), and then calculate an estimate θ ∗ from this sample. Suppose that θ̂ is the
estimate computed from the original data and θ is the true value of the quantity that we want to
estimate. Then the idea is that the variability of θ ∗ − θ̂ should mimic that of θ̂ − θ. In particular, the
mean of of θ ∗ − θ̂ should estimate the bias of θ̂. This can be estimated by resampling B times and
averaging.
We can use the above ideas of the bootstrap to estimate the difference (bias) between a classifier’s
apparent error and true error. We can obtain a bootstrap estimate of the bias by computing the average
(over B bootstrap samples) of the error rate on the bootstrap training set L∗ and subtracting the
estimate computed on the larger set L. (L∗ contains some of the members of L more than once, and
usually some not at all, so as a set L∗ is smaller). A problem is that the example being predicted may
be in the bootstrap training set. In this case we may sometimes predict it too well, leading to biased
estimates. Efron (1983) propose the “.632” bootsrap estimate, which considers only the predictions of
those members of L not in L∗ . By using a similar approach as above, we can also use the boostrap to
directly estimate the precision of the apparent error rate.
The jackknife (Quenouille, 1949) is another approach to estimate the bias and is loosely related
to the bootstrap. These methods are sometimes considered to be improvements on cross-validation,
however, Ripley (1996) warns on page 75 that this is perhaps optimistic, since the full power of the
boostrap seems not to have been fully tested.
8.2.3 Log-probability scoring
The counting of errors on a test set is one approach to assess the performance of a classifier. However,
this approach often has high variance. Sometime other approaches may perform better. One good
approach is log-probability scoring:
X
− log p̂(ci |xi ).
test set
This has lower variance than error counting. Note that the above measure is reliant on the fact that
p̂(c|x) is a good estimate of p(c|x). We can use calibration plots to evaluate this. If the predicted prob-
ability p, using our model for p(c|x), is close to the corresponding fractions, the predicted probabilities
are well-calibrated. A calibration plot plots the predicted probabilities against the corresponding frac-
tions. A relatively straight line indicates that the predicted probabilities are well-calibrated. However,
often extreme probabilities are estimated to be too extreme. This is the result of the fact that we
are not using p(c|x) but p(c|x, θ̂), a plug-in estimate in which we assume that the fitted parameters θ̂
represent the true parameters. So the uncertainty in θ̂ is ignored.
Calibration plots can be used to re-calibrate our probabilities, but most often deviations from a
straight line in the plot are indicative of more fundamental problems with the model.
8.3. BAGGING 141
8.2.4 ROC curves

Receiver operator characteristic (ROC) curves is another summary of classifier performance (see Hand
(1997), Chapter7). Consider a two class medical problem where the outcomes are “normal” or “dis-
eased”. The specificity is defined as
specificity = p(predict normal|is normal)

and sensitifity is defined as
sensitivity = p(predict diseased|is diseased).

If our classifier provides us with estimates of p(diseased|x). We can now declare a subject as
diseased if p(diseased|x) > α. By varying α ∈ [0, 1] we get a family of classifiers. An ROC curve
plots sensitivity against 1-specificity. For equal losses we want to minimise the sum of sensitivity and
1-specificity, or sensitivity-specificity. The best choice of α corresponds to the point where the ROC
curve has slope 1.
8.2.5 Example: Toxicogenomic Gene Expression Data

In the lecture we will consider an example involving toxicogenomic gene expression data.
8.3 Bagging
Many classifiers like decision trees or neural networks with many hidden units are are unstable pre-
dictors with high variance. Bagging is a technique to reduce the variance of an unstable predictor.
Although bagging is usually presented in the context of decision trees, it can be applied to other
predictors (neural nets etc.) as well.
Imagine we had B training datasets each of size n: L1 , . . . , LB . Using this data we could obtain B
(b) (b) (b) (b)
classifiers p̂ (x) = p̂1 (x), . . . , p̂L (x) . p̂l (x) denotes hereby the probability predicted by the b-th
classifier that a new observation with covariates x belongs to the l-th class. To form a final classifier,
we could aggregate the classifiers by taking the arithmetic mean of the predicted probabilities
B
1 X (b)
p̂average (x) = p̂ (x) (8.1)
B
b=1
We then classify x into the class l with the highest predicted probability p̂average
l .1
Taking the arithmetic mean over several training datasets should reduce the variance significantly.2
In practice however we can seldom afford the luxury of obtaining more than one learning dataset of
size n. We now use the technique of bootstrapping ; we draw B bootstrap samples of size n from
our training set (with replacement). This technique is called bagging, the acronym for bootstrap
aggregating (Breiman, 1996a). The classifier defined in (8.1) is called bagging classifier.3
1
Instead of using the predicted probabilities, we could as well use the predictions. We would then classify a new obser-
vation into the class that gets the most “votes”.
2 2
Recall that the variance of the arithmetic mean of a sample Z1 , . . . , Zn is σn , if the Zi are uncorrelated random variables
2
with variance σ .
3
Strictly speaking, (8.1) is not the bagging classifier, but an Monte-Carlo approximation to it.
(a) 30 genes
Error rate
Training set Test set (out of bag estimate)
Decision tree 0.025 0.36
Bagged decision tree 0 0.19 (0.20)
Random forest 0 0.22 (0.25)
(b) 2, 000 genes
Error rate
Training set Test set (out of bag estimate)
Decision tree 0.025 0.36
Bagged decision trees 0 0.28 (0.275)
Random forest 0 0.25 (0.23)
Table 8.1: Training and test error rate obtained when fitting a decision tree, bagged decision trees and
a random forest to the breast tumour dataset
Note that bagging only works with unstable predictors (like decision trees or regressions after
variable selections). When using stable predictors, bagging generally deteriorates the variance. Or in
Breiman’s words: “Bagging goes a ways toward making a silk purse out of a sow’s ear, especially if the
ear is twitchy.” (Breiman, 1996a).
As a byproduct bagging furthermore leads to an unbiased estimate of the error rate. As we use a
randomly drawn sample for constructing the trees, there are (nearly) always a few observation that
were not used; we can use them to get an unbiased estimate of the error rate (“out of bag” estimation).
The idea of bagging can be generalised. Random forests (Breiman, 1999) do not only use a random
subset of the observations, they also just use only a small random subset of the variables in each itera-
tion. This does not only offer theoretical advantages, it makes fitting the decision tree computationally
a lot easier, as we only have to consider a possible small subset of covariates.
Table 8.1 gives the training set and test set error rate for a plain decision tree, bagged decision trees
and random forests for the breast tumour data set. Note that decision trees are generally unsuited for
microarray data, as they concentrate on a very small number of genes. The out-of-bag estimates are
rather close to the actual test set performance.
8.4 Boosting
Bagging’s cousin, boosting has been developed to deal with the converse situation; it aggregates stable
classifiers with a poor performance (high bias), so called weak learners. An example for this type of
classifier are stumps. Stumps are decision trees with only two terminal nodes. The idea behind
boosting is to re-weight the observations in every iteration, such that the weak learner is forced to
concentrate on the “difficult” observations. The weak learners can then be seen as an “committee”,
which “votes” by majority to get the final prediction. Statistically speaking, boosting is a greedy
gradient-descent algorithm for fitting an additive model, although “committee” and “majority vote”
sound more glamorous.
8.4. BOOSTING 143
The Adaboost algorithm of Freund and Schapire (1995)
We will only consider the two-class case. The two classes will be coded with ±1, i.e. either yi = 1 or
yi = −1. Denote with fˆ(h) the predictions of the h-th weak learner. We start with a weighting vector
of w(1) = (1/n, . . . , 1/n).4 Iterate for h = 1, 2, . . . , H:
i. Train the weak learner fˆ(h) (predicting ±1) using the weights w(h) .
!
(h)
X
(h) 1 1 − ǫ(h)
ii. Compute the weighted training error ǫ = wi and set α = log .
2 ǫ(h)
i: fˆ(h) (xi )6=yi
iii. Update the weights (i = 1, . . . , n)

( (h)
(h+1) (h) e−α if fˆ(h) (xi ) = yi
wi = wi · (h)
eα if fˆ(h) (xi ) 6= yi
and normalise them, so that they sum to 1.

H
X
The final predictor is then F̂ (H)
(x) = α(h) fˆ(h) (x). We predict the first class for a new observation
h=1
xnew if F̂ (H) (xnew ) > 0, otherwise we predict the second.
Step iii. of the Adaboost algorithm is responsible for “focusing” on the difficult examples: if an
observation is misclassified by the current weak learner fˆ(h) then the weight of this observation is
(h+1)
increased, otherwise it is decreased. We can rewrite wi as
α(η) fˆ(η) (xi )
Ph (h) (x
(h+1)
wi = e−yi η=1 = e−yi F̂ i)
.
The weights are thus a function of yi F̂ (h) (xi ), which is nothing else than the “margin” of the final
predictor up to the h-th step. The margin yi F̂ (h) (xi ) indicates how far the predictions are away from
the decision boundary, i.e. how “confident” the final predictor is. Boosting can thus be seen as a
technique that maximises the margin.
Friedman et al. (1998) showed that boosting is a greedy gradient descent algorithm for fitting an
additive logistic model. However contrary to logistic regression (cf. section 6) Adaboost does not
maximise the likelihood, but minimises
|ỹ − p(x)|
Ee−yf (x) = p ,
p(x)(1 − p(x))

1 if y = 1
where ỹ := (thus just y in 0/1 coding instead of −1/ + 1 coding) and p(x) := P(y =
0 if y = −1
1|x).
Freund and Schapire (1995) motivated this expression as a differentiable upper bound to the
misclassification error. However one can see it as a χ2 -type approximation to the log-likelihood.
However one can use other loss functions as well. Friedman et al. (1998) propose using the log-
likelihood as a criterion (“Logit-Boost”); Friedman (1999) proposed to use the squared loss function
(“L2 -Boost”).
A main theoretical property of boosting is its ability to reduce the training error, it uses the weak
learners to form a strong learner. Already if the weak learner performs slightly better than random
guessing (i.e. has a training error rate of slightly less than 1/2), the training error of the Adaboost
algorithm drops exponentially fast. This is however not the most important practical property of
boosting.
In practise, Boosting seems to be rather resistant to overfitting. This is the main reason for the
popularity of boosting. This is mainly an empirical finding and the theory behind it is not completely
understood. Freund and Schapire (1995) argue that boosting leads to larger margins and thus reduces
the generalisation error in the test set. They thus connect boosting to support vector machines, which
have a similar property. Bühlmann and Yu (2003) argue that boosting leads to a more favourable
bias-variance trade-off: “When the iteration b increases by 1, one more term is added in the fitted
procedure, but due to the dependence of this new term on the previous terms, the ‘complexity’ of
the fitted procedure is not increased by a constant amount as we got used to in linear regression, but
an exponentially diminishing amount as b gets large.” Breiman (1999) however argues that the only
reason why boosting performs that well is that it emulates the behaviour of random forests.
Boosting reduces the bias of a classifier, whereas bagging reduces its variance. Thus they are based
on two extremely different types of base classifiers. Unstable predictors (like e.g. unpruned decision
trees) are unsuitable for boosting. Sometimes even stumps possess a too high variance, in this case
shrunken stumps (trees) are used. Shrunken stumps use the prediction of the tree, but multiply it by
a shrinkage factor ν ∈ (0; 1).
When fitting boosted stumps (with a shrinkage factor of ν = 0.5) to the breast tumour data (30
genes), one obtains a training error of 0 and a test set error of 0.22. Figure 8.2 shows a plot of the
error rates during the fitting process. It shows two important characteristics of boosting. There is
virtually no overfit even when using all 500 weak learners. Although the training error reaches 0
quite early, the test set error is still reduced further. The reason for this is that boosting does not only
minimise the training error, but aims to maximise the margin (see figure 8.3).
8.4. BOOSTING 145
0.4
Training error
Test error
0.3
Error
0.2
0.1
0.0
0 100 200 300 400 500
Number of boosting iterations
Figure 8.2: Error rates for the Adaboost fit to the breast tumour dataset)
30
25
20
15
10
5
0
1 iterations 5 iterations 10 iterations 50 iterations 100 iterations 250 iterations 500 iterations
Figure 8.3: Distribution of the margin (Adaboost fit to the breast tumour dataset)
Chapter 9
Expert Systems
9.1 Introduction
In data mining information is induced from data. In some situations however, one does not possess
enough quantitative information for data mining techniques to produce meaningful results. In these
cases one can try to infer information from experts. This is the domain of expert systems.
An expert systems consists of a knowledge base elicited from experts and an inference engine.
There are many (potential) domains of application for expert systems. Generally these are situations
where collecting enough data is impossible or not feasible for economical or ethical reasons. One
example for such a setting is the assessment of the safety of a nuclear power plant.
There have been many attempts to define the term “expert system” but most of the definitions are
quite vague. Goodall (1985) defines an expert system as “a computer system that operates by applying
an inference mechanism to a body of specialist expertise represented in the form of ‘knowledge’ ”. The
boundary between expert systems and data mining is rather blurred. The CHILD network (Spiegel-
halter et al., 1993) — a diagnosis system for telephone referrals of newborn babies with (possible)
congenital heart disease — for example was trained both by experts and by data.
Rule-based systems have played and still play an important role in the domain of expert systems.
Rule-based systems represent knowledge as a set of “If . . . then . . . ” rules, which are particularly easy
to elicit and to understand. However deterministic rule-based systems are quite problematic. As Pearl
(1988) remarks, “reasoning about any realistic domain always requires that some simplifications be
made. The very fact of preparing knowledge to support reasoning requires that we leave many facts
unknown, unsaid, or crudely summarized.”. In other words, rules are quite imprecise and as in the
saying there is no rule without exceptions. Classic monotone logic however does not not allow for ex-
ceptions, and conclusions cannot be retracted. Any inconsistency in the rule set — and simplifications
often entail some inconsistencies — can cause the system to collapse, so they have to be avoided by
all means. When DEC’s XCON system, which was used for automatically checking orders and product
configurations, reached 10,000 rules, it was a full-time job for 150 people to maintain it. Nonetheless
XCON was said to have saved Digital around forty million dollar a year.
In his 1982 presidential address to the AAAI, Marvin Minsky gave a famous example for the prob-
lems encountered when using monotone logic in expert systems:
H UMAN : “All ducks can fly. Charlie is a duck.”

E XPERT S YSTEM : “Then Charlie can fly.”
147
148 CHAPTER 9. EXPERT SYSTEMS
H UMAN : “But Charlie is dead.”

E XPERT S YSTEM : “Oh!”
Taking uncertainty into account allows to solve such conflicts. However, adding uncertainty to rule-
based systems is quite tricky. In monotone logical systems, we can use a conclusion regardless of how
we obtained it: as we have been able to prove it, it must be true. Under uncertainty we have to keep
track of how we obtained the evidence in order to avoid counting the same piece of evidence several
times. For example, being told the same rumour twice is only greater evidence, if the rumours stem
from independent sources. Furthermore, the truth of a complex logical proposition can be derived
from the truth of its parts. For example, if a and b are true, then we know that a ∧ b is true as
well.1 However we cannot establish the uncertainty of complex propositions from the uncertainty of
its parts,unless we make stringent independence assumptions. This is what many indeterministic rule-
based systems do, but this can cause problems if these independence assumptions are not justified.
When using rule-based systems with stringent independence assumptions one has to decide whether
the system should reason diagnostically (symptoms → causes) or causally (causes → symptoms). Mix-
ing the two ways causes the system to break down. Consider the much used sprinkler example. If the
grass is wet it is either wet because it rained or it is wet because the sprinkler was turned on. So we
need the rules “grass wet → rain” and “grass wet → sprinkler” for diagnosis and “rain → grass wet”
and “sprinkler → grass wet” for causal reasoning. If we use all four rules, we cannot help obtaining
“sprinkler → rain” and “rain → sprinkler” — which is ridiculous.
9.2 Certainty factors

It seems natural to statisticians to use probabilities to account for uncertainty. Historically however,
many expert systems were not based on probabilistic reasoning.
One of the first expert systems was the MYCIN system developed at Stanford by Edward Short-
liffe and Bruce Buchanan (Shortliffe, 1976). The system was designed to diagnose and recommend
treatment for certain blood infections. Although 35% of its conclusions were unacceptable, it still
outperformed its human counterparts from Stanford medical school. Nontheless it was never used in
practise due to ethical and legal issues.
The MYCIN system used certainty factors (CF) instead of probabilities. The certainty factor is a
number in the closed interval [−1, 1]. A value of 1 indicates logical implication, whereas a value of −1
indicates logical exclusion. A value of 0 indicates that there is no conclusive evidence at all.
Although certainty factors were not defined or elicited that way, Shortliffe and Buchanan gave them
a “probabilistic interpretation”. Using the probability P (hypothesis) and the conditional probability
P (hypothesis|evidence) the interpretation of the certainty factor is:
P(Hypothesis|evidence)−P(Hypothesis)
CF = 1−P(Hypothesis) if P (Hypothesis|evidence) ≥ P (Hypothesis)
P(Hypothesis)−P(Hypothesis|evidence)
CF = − P(Hypothesis) if P (Hypothesis|evidence) ≤ P (Hypothesis)
Given the (deterministic) evidence the MYCIN system first picked the relevant rules (i.e. the rules
that contain the evidence and the subsequent ones) and computed the belief in the hypothesis using
1
The binary logical operator “∧” denotes “and”, the binary logical operator “∨” denotes “or” and the unary logical operator
“¬” denotes negation.
9.2. CERTAINTY FACTORS 149
the following combination rules:2

Sequential combination:
The rules A → B (with certainty factor CF (A, B)) and B → C
(with certainty factor CF (B, C)) can be combined to: A → C
CF (A, B) with certainty factor
A - B CF (B, C)
- C

CF (A, B) · CF (B, C) if CF (A, B) ≥ 0
CF (A, C) =
0 if CF (A, B) ≤ 0
Parallel combination:
If two rules share the same conclusion, i.e. rule A → C with
certainty factor CF (A, C) and rule B → C with certainty factor
CF (B, C), then MYCIN uses, if A and B are part of the evidence,
the combined belief CF (A ∩ B, C) =

A PP CF (A, C)  CF (A, C) + CF (B, C) − CF (A, C) · CF (B, C)


PP 
 if CF (A, C) ≥ 0 and CF (B, C) ≥ 0
q C
P 
1

 CF (A, C) + CF (B, C)





CF (B, C) 
B 1 − min{|CF (A, C)|, |CF (B, C)|}
=

 if CF (A, C) ≥ 0 and CF (B, C) < 0



 or if CF (A, C) < 0 or CF (B, C) ≥ 0



 CF (A, C) + CF (B, C) − CF (A, C) · CF (B, C)

if CF (A, C) and CF (B, C) < 0
(The parallel combination rule is associative and commutative, i.e.

if we have more than two rules leading to the same conclusion, we
can combine the rules in any order.)
Toy example: Certainty factors
A patient has been diagnosed with a certain type of rash which might suggest that he suffers
from Lyme disease. A blood test for antibodies gave a positive result. Using the rules:
• A rash of this type is Erythema migrans. (“R → E”, CF (R, E) = 0.4)
• If Erythema migrans is present, the patient suffers from Lyme disease.

(“E → L, CF (E, L) = 0.9)
• If the blood test is positive, then the patient suffers from Lyme disease.
(“B → L, CF (B, L) = 0.7)
Graphically we can represent these rules as follows:

R - E P
P 0.4 PP0.9
q
P

1 L

B 0.7
2
This is not an exhaustive enumeration of MYCIN’s rules, but merely a summary of the most important ones.
Using the rule for sequential combination, we obtain the certainty factor:
CF (R, L) = CF (R, E) · CF (E, L) = 0.4 · 0.9 = 0.36
The simplified graph is:

R PP 0.36
PP
q
P
1 L

B 0.7
Using the rule for sequential combination, we obtain the certainty factor:
CF (R ∩ B, L) = CF (R, L) + CF (B, L) − CF (R, L) · CF (B, L) = 0.36 + 0.7 − 0.36 · 0.7 = 0.808.
So given that we observe a rash and the blood test is positive, the belief that this was caused by a
Lyme disease infection is 0.808.
0.808 - L
R∩B
Certainty factors and especially their probabilistic interpretation came quite soon under sharp criti-
cism. The combination rules mentioned above are not compatible with the probabilistic interpretation
of certainty factors, unless one makes strict independence assumptions. Adams (1976) showed that
using both the probabilistic interpretation and the rule for parallel combination of evidence corre-
sponds to assuming that the two pieces of evidence pointing at the same conclusion are statistically
independent, both marginally and conditionally given the hypothesis and given its negation. These
assumptions are far stricter than the independence assumptions made by any other method! Further-
more assuming the probabilistic interpretation and using the rule for parallel combination can limit
the range of possible values for the certainty factors to a small sub-interval — otherwise some of the
probabilities in the probabilistic interpretation would be outside the interval [0, 1].
Shortliffe and Buchanan (1984) finally abandoned the probabilistic interpretation. Today one of
the two authors even no longer recommends this method (according to Russell and Norvig, 2003).
9.3 Dempster-Shafer theory of evidence

If we wanted to use a probabilistic model for reasoning, we would have to fully specify a complete
probabilistic model. Often we do not have enough knowledge to do so without making approxima-
tions. The Dempster-Shafer theory (see e.g. Shafer, 1976) allows to sidestep this problem by com-
puting probabilities of provability instead of probabilities of truth. It provides two probabilities: the
belief, which is the probability that we can prove the hypothesis, and the plausibility, which is the
probability that we cannot prove that the hypothesis is wrong.
We will illustrate this type of reasoning using an example: a suspect is charged with a crime.
Imagine for now that there is no evidence at all. The belief that the suspect is guilty (i.e. committed
the crime) is 0, as we can prove under no circumstances that he committed the crime. The plausibility
that the suspect is guilty is 1; we cannot show that it is impossible that he committed the crime, so it
is plausible that he is guilty.
Now assume a witness says he has seen the suspect running away from the scene of the crime. The
probability that the witness told the truth is say 40%. Let’s first assume that the witness did not tell
9.3. DEMPSTER-SHAFER THEORY OF EVIDENCE 151
TRUE • • guilty not guilty TRUE • • guilty not guilty
“guilty” cannot be shown “guilty” can be shown

“not guilty” cannot be shown “not guilty” cannot be shown
(probability 0.6) (probability 0.4)
Figure 9.1: States of the switch for the model with one piece of evidence
Scenario #1: Scenario #2:
TRUE • • guilty not guilty • • TRUE TRUE • • guilty not guilty • • TRUE
“guilty” cannot be shown “guilty” cannot be shown

“not guilty” cannot be shown “not guilty” can be shown
Scenario #3: Scenario #4:
TRUE • • guilty not guilty • • TRUE TRUE • • guilty not guilty • • TRUE
“guilty” can be shown conflict

“not guilty” cannot be shown
Figure 9.2: States of the switch for the model with two pieces of evidence
the truth. In this case we can neither prove that the suspect committed the crime, nor can we prove
that he is innocent. On the other hand, if we assume the witness told the truth we can prove that the
suspect committed the crime. As the probability that the witness told the truth is 40%, the probability
that we can prove that the suspect is guilty is 40% as well.
Pearl (1988) proposed to illustrate this using a “random-switch”-model as illustrated in figure 9.1.
The probability that the switch is in the closed position (i.e. that the witness told the truth is 40%).
Thus we can prove with a probability of 40% that the suspect is guilty.
How could we incorporate another piece of information? This can be handled using Dempster’s
rule of combination.3 Let’s go back to our example and assume that there is another witness who has
seen the suspect in another area of the town at the time of the crime — thus providing the suspect
with a perfect alibi. Let’s assume that this second witness told the truth with a probability of 45%.
Dempster’s rule of combination tells us to assume that the random switches are independent. Now
we can construct the following table (see figure 9.2 for the positions of the corresponding switches):
scenario probability witness report #1 witness report #2 guilty not guilty
#1 0.6 · 0.55 = 0.33 false false not provable not provable
#2 0.6 · 0.45 = 0.27 false true not provable provable
#3 0.4 · 0.55 = 0.22 true false provable not provable
#4 0.4 · 0.45 = 0.18 true true conflict
Note that in scenario #4 we obtain a conflict, we can prove both that the suspect committed the
crime and that he is innocent. Dempster’s rule of combination now tells us to discard such scenarios.
In order to obtain the belief that the suspect committed the crime, we have to add up the probabilities
of the scenarios under which we could prove that he committed the crime. However we discarded
3
Note that Dempster’s rule of combination is an assumption, and is not a rule that we have to (or are able to) prove.
scenario #4, so we have to renormalise by dividing by the sum of the probabilities of the scenarios
without conflicts. Thus the belief that he committed the crime is
0.22
Bel({“guilty”}) = = 0.268
0.33 + 0.27 + 0.22
The plausibility of the suspect having committed the crime is the sum of the probabilities of all scenar-
ios under which we cannot prove that he did not commit the crime. Once again we have to renormalise
this probability:
0.33 + 0.22
P l({“guilty”}) = = 0.671
0.33 + 0.27 + 0.22
From a more mathematical point of view, the Dempster-Shafer theory can be seen as follows. Let Ω
be the set of mutually exclusive states, exactly one of which is true. We then define a basic probability
assignment function
m : Ω ⊃ A 7→ m(A) ∈ [0, 1],
P
such that A⊂Ω m(A) = 1.
In our example (using both pieces of evidence) we would have
m({“guilty”}) = 0.268
m({“not guilty”}) = 0.329
m({“guilty”,“not guilty”}) = 0.403
The belief Bel in a hypothesis H is then defined as

X
Bel(H) = m(A),
A⊂H
its plausibility P l defined as

X
P l(H) = m(A) = 1 − Bel(¬H).
A∩H6=∅
Using the above example

X
Bel({“guilty”}) = m(A) = m({“guilty”}) = 0.268
A⊂{“guilty”}
and
X
P l({“guilty”}) = m(A) = m({“guilty”})+m({“guilty”,“not guilty”}) = 0.268+0.403 = 0.671,
A∩{“guilty”}6=∅
as visualised in figure 9.3.

The Dempster-Shafer approach was very popular in the late 1980s and early 1990s, and there was
a lot of active research in this area. Its proponents argued that unlike probabilities belief functions
are able to distinguish between uncertainty and ignorance. Furthermore they are — like probability
theory — able to handle conflicting rules, however it is very disputed whether they do so in a sensible
manner (see box below).
9.3. DEMPSTER-SHAFER THEORY OF EVIDENCE 153
P l({“guilty”}) = 0.671

{“guilty”, “not guilty”}
Probability 0.403

Bel({“guilty”}) = 0.268

{“guilty”} {“not guilty”}

Probability 0.268
Probability 0.329

Figure 9.3: World Ω, belief Bel and plausibility P l
It is important to note that Dempster’s rule of combination implies that the propositions used as
evidence are independent. This has to be ensured when constructing the expert system and can limit
their scope quite a lot.
Belief and plausibility are often misinterpreted as a confidence interval for the true probability
and their difference misinterpreted as a degree of certainty. This is not the case — mainly due to
Dempster’s rule of combination (see e.g. Pearl, 1988, chapter 9.1.4).
The penguin triangle
Consider the following propositions (from Pearl, 1988):
p: “Tweety is a penguin.”
f : “Tweety can fly.”
Assume we have the rule:
r1 : p → ¬f (with belief 1 − ε1 ) — “Penguins normally don’t fly.”
Now consider that we want to assess whether Tweety can fly. If we know that Tweety is a penguin
(p), then our belief that she can fly is 0, because we can prove under no circumstances that Tweety
can fly (f ). The plausibility of Tweety being able to fly is ε1 ; if rule r1 is not active, we cannot prove
the converse.
Now we know as well that penguins are all birds, and that birds normally fly:
r2 : p → b (with belief 1) — “Penguins are birds.”

r3 : b → f (with belief 1 − ε2 ) — “Birds normally fly.”
Does this change our belief that Tweety can fly? Once again we have to consider all possible
positions of the two random switches:
Probability r1 r3 f ¬f
ε1 ε2 inactive inactive not provable not provable
(1 − ε1 )ε2 active inactive not provable provable
ε1 (1 − ε2 ) inactive active provable not provable
(1 − ε1 )(1 − ε2 ) active active conflicting
Using Dempster’s rule of combination we obtain (for small ε1 , ε2 ):
ε1 (1 − ε2 ) ε1 − ε1 ε2
Bel(f ) = = >0
1 − (1 − ε1 )(1 − ε2 ) ε1 + ε2 − ε1 ε2
Tweety’s birdnees seems to endow her with extra flying power. This result is highly counterin-
tuitive. A penguin is a special kind of bird and thus we would expect that the properties Tweety
gets from being a penguins override the ones she gets from being a bird. The probabilistic approach
which will be presented in section 9.5 does not suffer from this problem.
9.4 Fuzzy logic

Fuzzy logic is based on the theory of fuzzy sets (Zadeh, 1965). In classical set theory an element ω of
the base set Ω is either a member of a subset S ⊂ Ω (ω ∈ S) or it is not (ω 6∈ S). One can think of this
as assigning 0 or 1 to every ω ∈ Ω; 0 if ω 6∈ S and 1 if ω ∈ S. A fuzzy set F is defined by assigning to
each x a degree of membership. The degree of membership is a real number between 0 and 1.
Now consider Ω to be the set of all propositions and T the set of true propositions. Using classical
set theory one has to decide for all propositions ω ∈ Ω whether they are true or not. This however
might be difficult for imprecise propositions like “person X is cute” or “the temperature is high”. Using
fuzzy sets one can assign to each proposition a degree of membership to the set of true propositions,
thus a degree of truth. A degree of truth of 0 corresponds to absolute falsity and one of 1 corresponds
to absolute truth.
The key advantage of fuzzy logic is that it allows us to keep the set of propositions considered
relatively simple and manageable without too much unrealistic abstraction. In the early 1990s fuzzy
logic got very popular in the area of control theory and subsequently it has been used successfully in
simple control systems such as automatic gear boxes or auto-focus cameras.
In boolean logic it is very easy to determine whether composite propositions are true. “ω1 and
ω2 ” is true if and only if ω1 is true and ω2 is true. In fuzzy logic we have to give rules how to obtain
the degree of truth of composite propositions. There are many different flavours of fuzzy logic, but
the most common system of axioms uses the following rules: If t(ω) denotes the degree of truth of a
proposition ω, then
t(ω1 ∧ ω2 ) = min {t(ω1 ), t(ω2 )} (9.1)

t(ω1 ∨ ω2 ) = max {t(ω1 ), t(ω2 )} (9.2)
t(¬ω) = 1 − t(ω) (9.3)
In practise the input to a fuzzy logic system is often crisp, i.e. not fuzzy. We have to carry out
9.4. FUZZY LOGIC 155
1.0
cold
medium
hot
0.8
Degree of membership (truth)
0.6
0.4
0.2
0.0
−10 0 10 20 30 40 50
Temperature [°C]
Figure 9.4: Example for a defuzzification
a fuzzification step. Usually a continuous measurement has to be transformed into one or more
fuzzy states. Usually triangular or trapezoid membership functions are used for this purpose. Figure
9.4 gives an example for this; a measured temperature is transformed into three fuzzy states “cold”,
“medium” and “hot”. The output from an expert system is often required to be crisp and not fuzzy, i.e.
at the end we have to carry out a defuzzification step. The easiest is to pick the hypothesis which has
the highest degree of truth (“Maximum defuzzification technique”).
Note that fuzzy logic has been and still is a highly contested subject. In his controversial paper
Elkan (1994) proved under the additional axiom
t(ω1 ) = t(ω2 ) if ω1 and ω2 are logically equivalent (in Boolean logic) (9.4)
that fuzzy logic collapses to boolean logic.4 In their responses the proponents of fuzzy logic argued
that Elkan’s assumption (9.4) is not needed in fuzzy logic. Especially, the rule of the excluded middle
that ω ∨ ¬ω is a tautology is not valid in fuzzy logic. Of course, Elkan’s argument misses the point of
fuzzy logic, as fuzzy logic is a replacement for boolean logic. It is not that astonishing that forcing
boolean logic into fuzzy logic leads to boolean logic again. Nonetheless this shows us how difficult
it is to handle fuzzy logic correctly. Generally people reason in boolean logic and we expect that
propositions that are logically equivalent should be equally true. The degree of truth of a proposition
can depend on how we formulate it; two versions that are equivalent in (boolean) logic can have
different degrees of truth.
It is no less controversial whether fuzzy logic is a suitable tool for dealing with uncertainty. The
usual perception is that fuzzy logic is supposed to deal with imprecision, not with uncertainty. Figure
4
A simplified version of his proof is as follows: “ω ∨ ¬ω is logically equivalent to a tautology, thus t(ω ∨ ¬ω) = 1 (using
(9.4)). Using (9.2) and (9.3) t(ω ∨ ¬ω) = max {t(ω), 1 − t(ω)}, thus max {t(ω), 1 − t(ω)} = 1 and t(ω) ∈ {0, 1}.
P(“Apple is red”) = 0.5 t(“Apple is red”) = 0.5

P(“Apple is green”) = 0.5 t(“Apple is green”) = 0.5
Uncertainty Imprecision
Figure 9.5: Uncertainty and imprecision
9.5 illustrates this idea. Russell and Norvig (2003, p. 463) for example state “most authors say
that fuzzy set theory is not a method for uncertainty reasoning at all.”. The FAQ of the newsgroup
comp.ai.fuzzy however states that “fuzzy sets and logic must be viewed as a formal mathematical
theory for the representation of uncertainty”. This is to be doubted. Assume you want to meet up
with a friend and both of you tend to be late. Using fuzzy logic to model the uncertainty we could set
t(“I will be on time”) = t(“My friend will be on time”) = 0.5. Now
t(“Both of us will be on time”) = t (“I will be on time” ∧ “My friend will be on time”)
= min {t(“I will be on time”), t(“My friend will be on time”)}
(9.1)
= min{0.5, 0.5} = 0.5.
This result is highly counterintuitive, as the uncertainty of you being too late is as big as the uncertainty
of both you and your friend being late. In probability theory this would only happen if either both of
you are late or both of you are on time, which is far from being a realistic model!
A more detailed critique of fuzzy logic can be found in Laviolette et al. (1995), who essentially
show that nearly everything fuzzy logic has been used for can be done at least equally well using
probability theory (using continuous random variables).5
9.5 Probabilistic reasoning

On the surface it is far from obvious that beliefs should behave like frequencies and thus it is not
obvious that probability theory is the best tool for describing beliefs. For a long time the usage of
probability theory to model beliefs was refuted in the AI community. McCarthy and Hayes (1969)
5
In the apple example you can introduce three continuous random variables red, green and blue with support [0, 1].This
would allow us to model the colour exactly.
9.5. PROBABILISTIC REASONING 157

E1 E2 Evidence

I
@
@
@
H Hypothesis

@

@R
@
E4 E3

Figure 9.6: Graphical representation the Naive Bayes model
found probability theory to be “epistemologically inadequate” for these purposes. Indeed, at first sight
there seem to be some reasons for not choosing probabilities: If we for example know “If a, then b”,
then “If not b, then not a” holds as well, as both are logically equivalent. However if P(A|B) = p, then
we do not have that P(¬B|¬A) = p.
However according to Pearl (1988, p.15f) there seems to be “a fortunate match between human
intuition and the laws of probability. . . . It came about because beliefs are formed not in a vacuum
but rather as a distillation of sensory experiences. For reasons of storage economy and generality
we forget the actual experiences and retain their mental impressions in form of averages, weights, or
(more vividly) abstract qualitative relationships that help us determine future actions. . . . With these
considerations in mind it is hard to envision how a calculus of beliefs can evolve that is substantially
different from the calculus of proportions and frequencies, namely probability.” Furthermore one
has to note that other ways of modelling uncertainty have quite a number of shortcomings and that
probability theory comes with a huge toolkit of concepts, methods and theorems that might come in
handy when using probabilities to model beliefs.
The simplest probabilistic model is the naive Bayes model which assumes that the pieces of evi-
dence Ej are conditionally independent given the (true) hypothesis:
k
Y
P(E1 , . . . , Ek |H) = P(Ej |H)
j=1
Thus we obtain for the conditional probability of H given the evidence:

k
Y
P(H|E1 , . . . , Ek ) ∝ P(H) · P(Ej |H)
j=1
This conditional probability measures the strength of the evidence in favour of the hypothesis H.
Figure 9.6 gives a graphical representation of the independence assumptions made. Probabilistic
reasoning can be used in far more complex situations as well. Then the dependency structure is usually
modeled using a Bayesian Network (see chapter 10).
Chapter 10
Graphical models
10.1 Introduction
In the editorial preface to his book Jordan (1998) gave a good description of what graphical models
are about:
“Graphical models are a marriage between probability theory and graph theory. They
provide a natural tool for dealing with two problems that occur throughout applied math-
ematics and engineering – uncertainty and complexity – and in particular they are playing
an increasingly important role in the design and analysis of machine learning algorithms.
Fundamental to the idea of a graphical model is the notion of modularity – a complex
system is built by combining simpler parts. Probability theory provides the glue whereby
the parts are combined, ensuring that the system as a whole is consistent, and providing
ways to interface models to data. The graph theoretic side of graphical models provides
both an intuitively appealing interface by which humans can model highly-interacting sets
of variables as well as a data structure that lends itself naturally to the design of efficient
general-purpose algorithms.”
Essentially graphical models provide an easy way of modeling and visualising the structure of
complex data. They do so by modelling the marginal and conditional independence structure. Making
assumptions on (conditional) independences can simplify inference a lot. For example parameterising
a five-dimensional distribution of 0/1 indicators requires 25 − 1 = 31 parameters. When assuming a
graphical model as in figure 10.1 we can reduce the number of parameters to 10.
Marginal and conditional independence
Two events A and B are called (marginally) independent if
P(A ∩ B) = P(A) · P(B)
.
Two events A and B are called conditionally independent given an event C if
P(A ∩ B|C) = P(A|C) · P(B|C).
159
160 CHAPTER 10. GRAPHICAL MODELS
P(A ∩ C)
P(A|C) denotes hereby the conditional probability for A given C: P(A|C) =
P(C)
Example: The height and the IQ of a child are clearly correlated. But the height and the IQ are
conditionally independent given the age.
The (conditional) independence of random variables is defined in a similar way: Let X, Y and
Z be three random variables and f be their (probability) density function. Then X and Y are called
marginally independent if
f(X,Y ) (x, y) = fX (x)fy (y) for all possible values x and y.
X and Y are called conditionally independent given Z if
f(X,Y |Z)(x, y|z) = f(X|Z) (x|z)f(Y |Z)(y|z) for all possible values x, y, and z.
In the following we will use a more sloppy notation that is used in most of the literature on
graphical models. The letter p can either denote a probability measure or a (probability) density
function. If A, B, and C are events then for example p(A, B|C) = P(A ∩ B|C). If X, Y , and Z are
random variables, then p(X, Y |Z) = f(X,Y |Z)(x, y|z).
In graphical models nodes correspond to variables and edges correspond to “link” between the vari-
ables.1 Not all of the nodes have to correspond to observable variables; some nodes might correspond
to hidden (i.e. unobservable) variables. There are two main types of graphical models:
• In directed graphical models (“Bayesian networks”) the edges are “arrows” and — very infor-
mally speaking — correspond to causation. Bayesian networks are an extremely useful tool for
specifying the structure of a probabilistic expert system.
• In undirected graphical models (“Markov graph”, “Markov random field”) the edges are — as
the name suggests — undirected. Missing edges correspond to conditional independences. Undi-
rected graphical models are generally more for suitable for data mining.
However directed and undirected models can be combined resulting in so-called “chain graphs”; this
is however beyond the scope of this lecture.
10.2 Directed graphical models

A directed graph describes the factorisation properties of the underlying probability distribution. We
will restrict our attention to directed acyclic graphs (DAGs), which are graphs without directed circles,
i.e. when following the arrows one can never go through a single node twice. As the edges have
directions we can define the notion of parents. The parents of a node N are all the nodes which have
edges (“arrows”) directly pointing at N . Recall figure 9.6: it is in fact a directed graphical model
describing the naive Bayes rule. The parent of any piece of evidence Ej is the hypothesis H. Recall
further that the naive Bayes rule corresponds to
k
Y
p(E1 , . . . , Ek , H) = p(H) · p(Ej |H)
j=1
1
Note that in factor graphs nodes can represent both variables and facorisation properties.
10.2. DIRECTED GRAPHICAL MODELS 161
We can rewrite this as:
k
Y
p(E1 , . . . , Ek , H) = P (H) · p(Ej |parents of Ej )
i=1
In the general case the link between the directed graph and the factorisation of the probability
distribution is similar. A DAG with nodes X1 , . . . , Xk corresponds to the following factorisation:
k
Y
p(x1 , . . . , xk ) = p(xj |parents of Xj )
j=1
This factorisation can be interpreted as: “different values of Xj are caused by its parents.” It is a
little more difficult to determine the conditional independeces implied by the DAGs.
Conditional independence by examples (in directed graphs)
Example 1: Consider the following model for the link between the age A, the height H and the IQ I
of a child.
This model corresponds to the factorisation: p(A, H, I) = p(A)·p(I|A)·p(H|A)
An In this model we have p(H, I) 6= p(H) · p(I), however
@
R
@
Hn In
p(A, H, I)
p(H, I|A) = = p(I|A) · p(H|A)
p(A)
So height and intelligence are not marginally independent, but they are conditionally independent
given the age. The age is thus the “only link” between height and intelligence.
Note that the arrows at A are tail-to-tail.
Example 2: Consider the following model describing the smoke alarm in your house. S indicates
whether you smoke in your room, F whether the fire alarm goes off and R whether people run in
panic out of the building.
The model corresponds to the factorisation: p(F, R, S) = p(S)·p(F |S)·p(R|F )
Once again p(S, R) 6= p(S) · p(R), but
Sn- Fn- Rn p(F, R, S) p(F |S) · p(S)

p(R, S|F ) = = p(R|F ) · = p(R|F ) · p(S|F ).
p(F ) p(F )
| {z }
= p(F,S)
p(F )
=p(S|F )
So you smoking in your room is not independent of people running out of the building. They are
however conditionally independent given the fire alarm. Interpreatation: The reason why people
run in panic out of the building when you smoke is that you set off the smoke alarm!
Note that the arrows at F are tail-to head.
Example 3: Consider the following model for a burglar alarm (A) that is either set off by a burglar
in your house (B) or by an earthquake (E).
The model corresponds to the factorisation:

Bn En p(A, B, E) = p(B) · p(E) · p(A|B, E).
@ Now we have that p(B, E) = p(B) · p(E), but
R
@
An p(A, B, E) p(A|B, E)
p(B, E|A) = = p(B) · p(E) · 6= p(B|A) · p(E|A)
p(A) p(A)
So getting burgled (B) and an earthquake (E) are marginally independent. However they are not
conditionally independent given that the alarm went off (A). At first sight this seems a little bit
puzzling, but there is an easy explanation to it: If the alarm went off, we know that it was caused
either by an earthquake or by a buglar. If there was no earthquake, then we know that there is
a burglar in the house. So once the alarm goes off, B and E get (conditionally) dependent. This
phenomenon is often referred to as “explaining away”.
Note that the arrows at A are head-to-head.
In order to generalise the above findings on conditional independence we need to introduce the
concept of d-separation. Consider three sets of nodes (variables) A, B and C. A is then called d-
separated from B by C when every trail between A and B is blocked. A trail is a path from any
member of A to any member of B following links in any direction (i.e. even in the opposite direction).
Whether a trail is blocked at an intermediate node N depends on whether the arrows of the path at
N are head-to-head or not.
(i) If the two arrows at N are not head-to-head: Then the trail is blocked at N if N ∈ C.
(This corresponds to the examples 1 and 2.)
(ii) If the two arrows at n are head-to-head: Then the trail is blocked at N if neither N nor any of its
descendants are in C.
(This rule deals with the “explaining away” phenomenon and corresponds to example 3.)
The set vof variables A is conditionally independent of B given C if A is d-separated from B by C.

Figure 10.1 shows an example of a more complex directed graph. In this graph A is conditionally
independent of D given B as we condition upon B and the arrows at B are tail-to-tail, so the only
path from A to D is blocked (using rule (i)). However A is not conditionally independent of B given
E: the arrows at C are head-to-head but E is a descendant of C and we have conditioned on E, so
using rule (ii) the path is not blocked.
Directed graphical models can be used in a huge variety of contexts:
They are a very good tool for specifying the structure of a probabilistic expert system. The CHILD
network (Spiegelhalter et al., 1993) is such an example; its DAG is displayed in figure 10.2.
Many statistical models are in fact directed graphical models. For example, the mixture of Gaussians
discussed in chapter 3.3.2 can be seen as a directed graphical model. In the one-dimensional case,
we observe only one variable, namely X. Like in the EM algorithm we can assume that there is an
unobserved variable S that indicates which cluster we are in. Obviously different values of S cause X
to have different distributions, so it is a DAG with a directed edge from S to X (see figure 10.3). Many
other classical statistical models are directed graphical models as well: factor analysis, probabilistic
principal component analysis, independent component analysis, hierarchical mixtures of experts, etc.
Although undirected graphs are more popular in data mining, directed graphical models can be used
there as well. Heckerman (1996) gives an overview.
10.2. DIRECTED GRAPHICAL MODELS 163
An Bn
@ @
R
@ R
@
En Dn
?
Cn
Figure 10.1: Example for a directed acyclic graph (DAG)
Figure 10.2: DAG used in the CHILD network
Cluster indicator Sn - XnObserved variable
Figure 10.3: Mixture of Gaussians as a DAG

An En
@@
@
@
Bn Cn Dn
Figure 10.4: Example of an undirected graph with the corresponding cliques (dotted boxes)
i. ii. iii.
Tn Tn Tn
Sn Mn Sn Mn Sn Mn
Figure 10.5: Different undirected graphs for the links between treatment T , messenger substance M ,
and the health state S of a patient
10.3 Undirected graphical models

Determining conditional independences in DAGs can be quite cumbersome. In undirected graphical
models this is a lot easier: missing edges correspond to conditional independences. More precisely:
a set of nodes A is conditionally independent of B given C if all paths from any member of A to any
member of B go through a member of C (“C blocks all paths from A to B”). If there is no path at all
between A and B then A and B are (marginally) independent.
Figure 10.4 shows an easy example of an undirected graph. A is conditionally independent of
E given C, however the conditional independence does not hold if we condition on B, as the path
A − C − D − E is not blocked.
Figure 10.5 shows three different undirected graphical models that describe the effect of a treat-
ment T on the health state of a patient S. This effect might be via a messenger substance M .
Panel i. The graphical model on the left-hand side corresponds to a (placebo) treatment that does
not affect the messenger substance or the health state, as T is independent of M and S.
M and S are dependent, so the messenger substance does have an influence on the health
state.
Panel ii. The graphical model in the middle corresponds to a treatment which influences the health
state only by means of the messenger substance: T is conditionally independent of S given
M.
Panel iii. The graphical model on the right hand side corresponds to a treatment which influences the
health state both directly and via the messenger substance.
In the general case the Markov property induced by the graph is that
p(xi |all other variables) = p(xi |all neighbouring nodes of Xi ).

10.4. FROM DIRECTED TO UNDIRECTED GRAPHICAL MODELS 165
From this property we can derive the following rule: In p(A|B) we can omit a node B from the
condition B if all the paths from B to members of A are blocked by another node in B.
For the distribution corresponding to graph figure 10.4 we obtain:
p(A, B, C, D, E) = p(A) · p(B|A) · p(C|A, B) · p(D|A, B, C) · p(E|A, B, C, D)

= p(A) · p(B|A) · p(C|A, B) · p(D|C) · p(E|D)
| {z } | {z } | {z }
=:ψ1 (A,B,C) =:ψ2 (C,D) =:ψ3 (D,E)
= ψ1 (A, B, C) · ψ2 (C, D) · ψ3 (D, E)
Thus we can factor the distribution into so-called potential functions. The first one depends only on
A, B, and C, the second one depends only on D and E, and the third one depends only on D and E.
What do these sets {A, B, C}, {C, D}, and {D, E} correspond to? They correspond to the cliques of
the graph. A clique is a maximal subgraph in which all nodes are connected.2 This can be generalised:
The joint distribution on an undirected graphical model can be written as
Y
p(x1 , . . . , xk ) ∝ ψC (xC ) (10.1)
C∈C
where ψC is a non-negative function (that is not everywhere 0), called the potential function. C is the
set of all cliques and xC denotes all variables in the clique C. The converse is true as well: any set
of potential functions with the above properties defines a probability distribution on the undirected
graphical model with the corresponding cliques. Note the “∝” in the potential representation. If we
actually want to compute (numerical) probabilities we have to compute the value of the normalisation
constant, which is X Y
ψC (xC ).
all possible values C∈C
Note that this can be quite cumbersome to compute. For a graph with 30 binary variables this amounts
to a summing up 230 = 1, 073, 741, 824 numbers. If we want to use graphical models in practise we
have to find an alternative to the above formula.
10.4 From directed to undirected graphical models

Note that not every set of conditional independence assumptions corresponds to a graphical model.3
Furthermore certain independence assumptions only correspond to a directed graph whereas others
only correspond to an undirected graph. Figure 10.6 shows a directed graphical model that does not
correspond to an undirected model and an undirected model that does not correspond to a directed
one.4
2
In this context “maximal” means that we cannot find another node which we could add to the clique without destroying
its clique property.
3
By saying “a set of conditional independence assumptions corresponds to a graph” we mean that the set of conditional
independence assumptions and the set of conditional independences implied by the graph are the same. However on one
can always find a graphical model that represents a (maximal) subset of these conditional independence assumptions. The
rest of the conditional independences can be modeled by choosing a corresponding distribution (parametrisation).
4
Once can show that every DAG can be represented as an undirected graph if and only if all parents are married. An
undirected graph can be represented as a DAG if and only if it is chordal (For a definition of chordal see section 10.5.1).
An Bn
Bn En
@
R
@
An Dn Cn
Figure 10.6: Directed graph with no equivalent directed graph (left hand side) and undirected graph
with no equivalent directed graph
An Bn An Bn
@ @ @ @
R
@ R
@ @ @
En Dn En Dn
?
Cn Cn
Figure 10.7: Directed graph (left hand side) and corresponding undirected graph obtained by morali-
sation (right hand side)
For inference purposes we often have to convert a directed graph to an undirected one. As we
have just seen there might not be an equivalent undirected model. How could we proceed? Can we
simply convert the directed edges to undirected ones, i.e. remove the arrowheads? The answer is
unfortunately no. The reason to this is the “explaining away” phenomenon. In the graphical model
on the left hand side B and E are not conditionally independent given A. However if we dropped
the arrows, B and E would be conditionally independent given A. We would thus introduce a new
conditional independence, i.e. make our model more restrictive and possibly unrealistic. The only
way around this problem is to add an edge between B and E.
This step is called moralisation. In the general case a directed graph is moralised by replacing all
directed edges by undirected ones and by adding edges joining the parents of every node. The price
we have to pay for the moralisation is that we loose some conditional independences, i.e. our model
gets broader.
10.5 Computing probabilities on graphs

10.5.1 The message passing algorithm for exact inference
If we are interested in the marginal distribution of a single variable Xj we have to sum over all possible
combinations of all other variables, as5
X X Y
p(xj ) = p(x1 , . . . , xk ) ∝ ψC (xC ).
x1 ,...,xj−1 ,xj+1 ,...,xk x1 ,...,xj−1 ,xj+1 ,...,xk C∈C
5
If the random variables are conitnuous we have to replace the sum by an integral.
10.5. COMPUTING PROBABILITIES ON GRAPHS 167
Cn
@
@ n
An Bn E
@
@
Dn
Figure 10.8: Example for a chordal graph
The number of summands is thus exponential in the number of variables, making this summation
NP-hard.
Now consider the graph in figure 10.8. In order to compute p(xA ) we need so sum over xB , xC ,
xD and xE :
1 X
p(xA ) = φAB (xA , xB ) · φBCD (xB , xC , xD ) · φCDE (xC , xD , xE ) =
C x ,x ,x ,x
B C D E
1 X X
= φAB (xA , xB ) · φBCD (xB , xC , xD ) · φCDE (xC , xD , xE ) = (10.2)
C x ,x ,x xE
B C D
| {z }
∗ (x ,x )
=:ψCD C D
| {z }
=:φ∗BCD (xB ,xC ,xD )
1 X X
= φAB (xA , xB ) · φ∗BCD (xB , xC , xD ) =
C x xC ,xD
B
| {z }
∗ (x )
=:ψB B
| {z }
=:φ∗AB (xA ,xB )
1 X ∗
= φAB (xA , xB )
C x
B
If we group the potentials as above one can simplify the summation a lot. This simplification corre-
sponds to finding a so-called elimination order. Even though finding an elimination order is rather
trivial in our example it is NP-hard in general. Furthermore one has to find an elimination order per
marginal.
Is there an easier way of obtaining the marginal probabilities? The general answer to this problem
is unfortunately no, but in some cases the calculations can be simplified a lot.
Let’s look again at the graph in figure 10.8. The corresponding distribution can be written as
p(A, B, C, D, E) = p(A) · p(B|A) · p(C|A, B) · p(D|A, B, C) · p(E|A, B, C, D)

= p(A) · p(B|A) ·p(C|B) · p(D|B, C) · p(E|C, D) =
| {z } | {z } | {z } | {z }
= p(A,B)
p(B)
= p(B,C)
p(B)
= p(B,C,D)
p(B,C)
= p(C,D,E)
p(C,D)
p(A, B) · p(B, C, D) · p(C, D, E)

= =
p(B) · p(C, D)
Once again the factorisation of the numerator corresponds to the cliques. What does the factorisation
of the denominator correspond to? {B} and {C, D} are the separators. A separator is the set of nodes
which two neighbouring cliques have in common.
This can be generalised, however not to all graphs: if a graph is chordal then its distribution can
be factorised as follows:
Q
ψC (xC )
p(x1 , . . . , xk ) = C∈C
Q , (10.3)
S∈S φS (xS )
where S is the set of all separators.

A graph is called chordal (or triangulated) if it has no cycles of four or more nodes without a
chord. A chord is an edge joining two non-consecutive nodes. The right-hand graph in figure 10.6 is
an example for a graph that is not chordal. The cycle A − B − C − D − A consists of four nodes, but
has no chord. If we added an edge between A and C (or between B and D) it would be chordal.
Even though not every graph is chordal there exists a technique to convert any graph into a chordal
graph; once again this is done by adding additional edges to it. One important byproduct of the so-
called triangulation is the join tree. The join tree is a tree of all cliques having the running intersection
property : If a node appears in any two cliques it is as well in all cliques that lie on the join tree between
the two cliques.
Triangulating a graph
The following algorithm triangulates the graph (steps 1 & 2) and constructs the join tree (steps 3 &
4). Consider as an example the graph below; clearly it is not chordal.
Cn
@
@ n
An Bn E
@
@
Dn
1. Order the vertexes by a maximum cardinality search : choose any vertex and number it “1”.
Then number the other vertexes as follows: choose one of the vertexes with the largest number
of already numbered neighbours and number it next. Repeat this process until all vertexes are
numbered.
Cn3
4 @
@ n
5 An Bn E 1
@
@
Dn2
2. Using the reverse ordering obtained in step 1 (i.e.starting with the highest numbered vertex)
carry out the following step for each vertex: For a given vertex ensure by adding missing edges
that all its lower-numbered neighbours are themselves neighbours; if two lower-numbered
neighbours are not already joined by an edge, add the missing edge.
Cn3
4 @
@ n
5 An Bn E 1
@
@
Dn2
3. Identify all cliques and order them by the highest-numbered vertex in the clique.
1. CDE (3) / 2. BCD (4) / 3. AB (5)
4. Starting with the first clique (in the ordering obtained in step 3.), add the cliques one by one
and connect each clique to a predecessor it shares the most vertexes.
CDE BCD AB
Note that there are often many different ways of triangulating a graph.
Why is (10.3) so important and what is its advantage over (10.1)? Using (10.3) there is a an easy
way to modify (10.1), such that the clique potentials correspond to the joint distribution, i.e. we can
obtain that φ∗∗
C (xC ) = p(xC ).
In order to convert clique potentials to joint distributions we need to achieve consistency. Consider
the graph of the above example. Obviously
X X
pBCD (xB , xC , xD ) = pCD (xC , xD ) = pCDE (xC , xD , xE ).
xB xE
In other words: We can obtain the marginal on the separator {C, D} from both the joint distribution
of the clique {B, C, D} and the joint distribution of the clique {C, D, E}. The joint distributions on the
cliques are locally (and globally ) consistent. However not all potential functions imply local or global
consistency; we could have
X X
φBCD (xB , xC , xD ) 6= φCDE (xC , xD , xE ).
xB xE
In analogy to the joint distributions two potentials φC1 and φC2 with separator S = C1 ∩ C2 are called
locally consistent if
X X
φC1 (xC1 ) = φC2 (xC2 ).
all possible values of C1 \S all possible values of C2 \S
If the graph is triangulated (i.e. if the cliques can be arranged on a join tree) then there is an easy way
to obtain local and global consistency: message passing This algorithm was independently discovered
by Pearl (1982) and Lauritzen and Spiegelhalter (1988).6
6
Pearl’s original algorithm was designed for polytrees only. Lauritzen and Spiegelhalter’s algorithm works with any
chordal graph, using the trick that the cliques can be arranged on a tree.
Recall equation (10.2). It can be seen as passing a message (ψCD∗ ) from the clique {C, D, E} to
the clique {B, C, D}. Then the potential for the clique {B, C, D} is updated. Next a message (ψB ∗)
is passed from the clique {B, C, D} to the clique {A, B} and its clique potential is updated. On the
corresponding join tree this corresponds to passing a message from the left to the right, which allowed
us to compute the marginal of the right-hand clique {A, B}.
Obtaining local consistency can be achieved by message passing: One node is chosen as root node.
First messages are passed in towards the root node (starting with the cliques furthest away) and then
messages are passed out from the root node. For convenience we initialise ψS (·) = 1 for all separators
S and incorporate the given probabilities into suitable clique potentials φC . Consider two cliques C1
and C2 with separator S, where C2 is closer to the root node. So we first have to pass a message from
C1 to C2 . The message is7
X
ψS∗ (xS ) = φ∗C1 (xC1 )
C1 \S
and thus we update

ψS∗ (xS )
φ∗C2 (xC2 ) = φC2 (xC2 )
ψS (xS )
When passing messages out from the root node, we pass a message
X
ψS∗∗ (xS ) = φ∗∗
C2 (xC2 )
C2 \S
from C2 to C1 yielding the update
ψS∗∗ (xS )
φ∗∗ ∗
C1 (xC1 ) = φC1 (xC1 )
ψS∗ (xS )
Note that now

 
X X ψ ∗∗ (xS ) X ψ ∗∗ (xS ) X
φ∗∗
C1 (xC1 ) = φ∗C1 (xC1 ) S∗ = φ∗C1 (xC1 ) S∗ = ψS∗∗ (xS ) = φ∗∗
C2 (xC2 )
ψS (xS ) ψS (xS )
C1 \S C1 \S C1 \S C2 \S
| {z }
∗ (x )
=ψS S
Thus we have obtained local consistency. But more than that, one can even show that if the graph
is chordal message passing implies global consistency, especially φ∗∗
C = pC , i.e. the clique potentials
correspond to the marginals on the cliques. One can show this by using (10.3) and by stepwise
reducing the join tree to one clique. In our example (10.8) this corresponds to (using the clique
7
In the inward pass we do not need to update the cliques from which we start passing a message, i.e. for such a clique E
we have φ∗E = φE . In the outward pass we do not have update the root clique R, i.e. φ∗∗ ∗
R = φR .
{A, B} as an example):
X
pABCD (xA , xB , xC , xD ) = pABCDE (xA , xB , xC , xD , xE ) =
xE
X φ∗∗ (xA , xB ) · φ∗∗ (xB , xC , xD ) · φ∗∗ (xC , xD , xE D)
AB BCD CDE
= ∗∗ (x ) · ψ ∗∗ (x , x ) =
x
ψ B B BC B C
E
φ∗∗ ∗∗
AB (xA , xB ) · φBCD (xB , xC , xD )
X
= ∗∗ (x ) · ψ ∗∗ (x , x ) · φ∗∗
CDE (xC , xD , xE D) =
ψB B BC B C xE
| {z }
∗∗ (x ,x )
=ψBC B C
φ∗∗ ∗∗
AB (xA , xB ) · φBCD (xB , xC , xD )
= ∗∗ (x )
ψB B
X X φ∗∗ (xA , xB ) · φ∗∗ (xB , xC , xD )
AB BCD
pAB (xA , xB ) = pABCD (xA , xB , xC , xD ) = ∗∗ (x ) =
x ,x x ,x
ψ B B
C D C D
φ∗∗
AB (xA , xB )
X
= ∗∗ (x ) · φ∗∗ ∗∗
BCD (xB , xC , xD ) = φAB (xA , xB )
ψB B xC ,xD
| {z }
∗∗ (x )
=ψB B
What have we gained by all this? Initially we had to work out the marginals from the joint distri-
bution of all variables (there might be several hundred or more), now we can work them out from the
joint distribution of the cliques (which hopefully consist of less variables). Initially the problem was
NP-hard in the number of variables, now it is “only” NP-hard in the maximal number of variables in
any single clique.
Note that one can always obtain local consistency between two cliques by carrying out the message
passing — even if the cliques cannot be arranged on a join tree. However there is no guarantee that we
can obtain the local consistency between two cliques without breaking another one, so convergence is
not guaranteed. This is often called loopy belief propagation. (see e.g. Turbo decoding)8 .
The above algorithm can be easily modified to obtain conditional probabilities p(x|B) by setting
the probabilities for all events to 0 which are incompatible with B. The probabilities then do not sum
to 1 anymore, but the message passing algorithm can then be carried out as normal, just the final
probabilities need to be renormalised (see example below).
Toy example: Message passing
Consider the following over-simplified Bayesian expert system for the diagnosis of lung cancer
and tuberculosis. The variables are smoking status S, whether the patient has lung cancer L, whether
the patient has been to a Tb region R, whether the patient is infected tuberculosis T , and whether
the patient suffers from breathing problems.
8
Initially Pearl proposed the message passing algorithm only for polytrees, i.e. graphs without cycles (“loops”)
Sn- Ln
Hj Bn
H
*

Rn- Tn
The corresponding moralised undirected graph is already triangulated
Sn Ln
HH Bn

Rn Tn
and the corresponding join tree is:

SL LBT TR
Consider the following conditional probabilities: p(T |R) = 0.1, p(T |R̄) = 0.01, p(L|S) = 0.3,
p(LS̄) = 0.05, p(B|L, T ) = 0.9, p(B|L, T̄ ) = 0.5, p(B|L̄, T ) = 0.8, and p(B|L̄, T̄ ) = 0.3.
Using this information we now can compute the probability that a smoker who suffers from
breathing problems and has not been to a Tb region is infected with Tb. As we are after a conditional
probability we set the potential for all events which are not compatible with the condition (smoker,
no visit to a Tb region, breathing problems) to 0.
We will incorporate the probability p(T |R̄) = 0.01 into the clique potential φ{R,T } of the clique
{R, T }. As R is not compatible with the evidence we are conditioning on we set φ{R,T } (R, ·) = 0.
φ{R,T } T T̄
R 0 0
R̄ 0.01 0.99
The probability p(L|S) = 0.3 will be incorporated into the clique potential φ{S,L} of the clique
{S, L}.
φ{S,L} L L̄
S 0.3 0.7
S̄ 0 0
The remaining probabilities will be incorporated into the potential φ{L,B,T } of the clique
{L, B, T }
φ{L,B,T } T T̄
L L̄ L L̄
B 0.9 0.8 0.5 0.3
B̄ 0 0 0 0
For the message passing we select the clique {L, B, T } as the root, so the first have to pass a
message from {R, T } to the clique {L, B, T }. In order to do that we compute the marginal of φ{R,T }
w.r.t. T , i.e.
∗
ψ{T T T̄
}
0.01 0.99
The second message of the inward pass is from {R, T } to {L, B, T }; we now have to compute
the marginal of φ{S,L} w.r.t. L, i.e.
∗
ψ{L} L L̄
0.3 0.7
We now can update the clique potential of the clique {L, B, T }:

φ∗{L,B,T } T T̄
L L̄ L L̄
B 0.9 · 0.01 · 0.3 = 0.0027 0.8 · 0.01 · 0.7 = 0.0056 0.5 · 0.99 · 0.3 = 0.1485 0.3 · 0.99 · 0.7 = 0.2097
B̄ 0 · 0.01 · 0.3 = 0 0 · 0.01 · 0.7 = 0 0 · 0.99 · 0.3 = 0 0 · 0.99 · 0.7 = 0
Next we would have to pass messages from {L, B, T } both to {R, T } and to {S, L}. However as
our answer can be found using φ{L,B,T } , we do not have to do this here. The probability that the
patient has lung cancer can now be read from the above table:
0.0027 + 0.0056
p(T |B, S, T̄ ) = = 0.02265
0.0027 + 0.0056 + 0.1485 + 0.2097
10.5.2 Approximate inference

Monte Carlo methods
Whilst computing a marginal (or conditional) probability p(xj ) as described in the previous section can
be difficult or even not feasible in practise, it might be well be possible to draw a random sample from
# samples with Xj = xj
p(xj ). Once we have drawn random samples, we can estimate p(xi ) by p̂(xi ) = # samples .
More general, we can use this device to estimate any expected value Ef (x).
The easiest way of sampling from a DAG is forward sampling. Recall that for a DAG
k
Y
p(x1 , . . . , xk ) = p(xj |parents of Xj )
j=1
Due to its Markov properties and the fact that the graph is acyclic, we can group all variables
which correspond to nodes without parents in a set V0 . We can group all the other nodes into sets
V1 , . . . , Vr such that all variables in Vk are conditionally independent given the variables in V0 ∪ . . . ∪
Vk−1 . For the graph in figure 10.1, we could set V0 = {A, B}, V1 = {D, E}, and V2 = {C}. Now
we can first sample from the variables in V0 , then from the variables in V1 given the values of the
variables in V0 etc. In the example we would first draw a random number prom p(xA ) and p(xB ), then
from p(xE |xA ) and p(xD |xB ), and finally from p(xC |xE ). We can even use this method to compute
conditional probabilities p(xj |B) by discarding samples that are not compatible with B. This is however
a very inefficient way of sampling from p(xj |B), especially if p(B) is small.
Gibbs sampling is a more general method of obtaining random samples from an undirected or
directed graph. For this we however need the concept of the Markov blanket. For a given vari-
able (node) xj the Markov blanket MXj is the minimal set of variables, such that p(xj |MXj ) =
p(xj |all other variables), i.e. given the variables in the Markov blanket xj is conditionally independent
of all the other variables. For undirected graphs the Markov blanket of a variable consists of all neigh-
bouring nodes. For directed graphs it consists of all parents, children and the coparents9 of the node.
In figure 10.1 the Markov blanket of B is the set of nodes {A, E, D}.
The Gibbs sampler proceeds as follows. In each step a variable Xj is chosen and a new value of
Xj is chosen at random from
Q
C∈MX φC (xC )
p(xj |all other variables) = p(xj |MXj ) = P Q j , (10.4)
xj C∈MX φC (xC ) j
where MXj is the set of cliques to which the node Xj belongs.10

(h) (h)
More mathematically, the Gibbs sampler yields a Markov chain, as the new realisation (X1 , . . . , Xk )
(h−1) (h−1)
only depends on the past through the last realisation (X1 , . . . , Xk ) (by construction). If
(h−1) (h−1) (h) (h)
(X1 , . . . , Xk ) has the desired distribution p(·), then (X1 , . . . , Xr ) has the desired distri-
bution p(·|B). p(·) is thus the invariant distribution of the Markov chain formed by the Gibbs sampler.
(1) (1)
In other words, if we can manage that our first random realisation X1 , . . . , Xk is from p(·|B), then
(h) (h)
all subsequent realisations (X1 , . . . , Xk ) will be as well. So far we have not gained much.
What makes Gibbs sampling work is that under some regularity conditions11 a Markov chain con-
verges to its invariant distribution. In other words, regardless of how we choose initial values we use,
(h) (h)
the distribution of X1 , . . . , Xk will tend under these regularity conditions T for h → +∞ to the de-
sired distribution p(·). If we want to compute a conditional probability p (·| l {Xjl = xjl }), we simple
(h)
do not choose Xjl at random, but set Xjl ≡ xjl .
In practise, the Gibbs sampler is run with essentially arbitrary starting values chosen at random. At
the beginning, i.e. during the so-called burn-in period, the generated random samples are discarded
and only the “late” one are used. The idea behind this approach is that one waits for the Markov chain
to get close to the invariant distribution.
The Metropolis-Hastings (MH) algorithm is another popular Markov chain Monte Carlo (MCMC)
algorithm. Its advantage over the Gibbs sampler is twofold. We do not have to be able to draw
at random from p(xj |MXj ). Furthermore we only need to be able to compute the unnormalised
densities, i.e. we do not have to compute the denominator in (10.4). The MH algorithm is based on
(h−1) (h)
a proposal distribution qj (y|xj ) from which we have to draw a random proposal Yj . We now
(h)
“accept” yj with the probability
 
(h) (h−1) (h) (h−1)
(h) (h−1)
 p(yj |MXj ) qj (yj |xj
) 
Aj (yj |xj ) = min 1, · .
 p(x(h−1) |M(h−1) ) qj (x(h−1) |y (h) ) 
j Xj j j
9
The coparents of a node are the other parents of its children.
10
The last equality sign is only meaningful for undirected graphs.
11
These are for example fulfilled if one can get from any state to any other state with a positive probability (not necessarily
in one step) and if there are no cycles.
(h) (h) (h−1)

If we accept yj we set xj (h) = yj , otherwise we set xj (h) = xj . Note that we only need to
know the unnormalised densities, as we only need its ratio. Once again one can show that the desired
distribution p(·) is the invariant distribution of the resulting Markov chain of random realisations.
(h) (h−1)
qj (yj |xj )
When the proposal density is symmetric (which is the case for Gaussian proposals) then (h−1) (h) =
qj (xj |yj )
(h)
p(yj |MXj )(h−1)
1 and we only need to use the ratio of the densities (h−1) (h−1) . This is the original version of the
p(xj |MX )
j
algorithm (Metropolis et al., 1953). The Gibbs sampler is a special case of the MH algorithm using the
conditional distribution of the variable given the others as proposal.
MCMC methods have become in the last 10 years a very popular tools and have unfortunately often
been used without the necessary care. When proposing an MCMC for a model one has to make sure
that the relevant regularity conditions are met — which can be quite cumbersome. Otherwise one risks
that the Markov chain does not converge to its invariant distribution. Furthermore it is very difficult
to tell when the Markov chain has actually reached the invariant distribution, i.e. from what time
one can consider the generated random values to stem from the desired distribution. There are no
generally valid rules of thumb of how long to choose the burn-in period. This is especially dangerous
if the underlying model contains many strong links, i.e. the variables and thus the subsequent random
realisations are highly dependent. Robert and Casella (2004) give a more detailed overview of MCMC
methods.
Bibliography
Adams, J. B. (1976) A probability model of medical reasoning and the mycin model. Mathematical
Biosciences, 32, 177–186.
Aggarwal, C. C. and Yu, P. S. (1998) A new framework for itemset generation. In Proceedings of the
Seventeenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, June 1-3,
1998, Seattle, Washington, 18–24. ACM Press.
Agrawal, R., Mannila, H., Srikant, R., Toivonen, H. and Verkamo, A. I. (1995) Fast discovery of associ-
ation rules. In Advances in Knowledge Discovery and Data Mining. Cambridge (MA): MIT Press.
Agresti, A. (1990) Categorical Data Analyis. New York: Wiley.
Alizadeh, A., Eisen, M., Davis, R., Ma, C., Lossos, I., Rosenwald, A., Boldrick, J., Sabet, H., Tran, T., Yu,
X., Powell, J., Yang, L., Marti, G., Moore, T., Hudson, J., Lu, L., Lewis, D., Tibshirani, R., Sherlock,
G., Chan, W., Greiner, T., Weisenburger, D., Armitage, J., Warnke, R., Levy, R., Wilson, W., Grever,
M., Byrd, J., Botstein, D., Brown, P. and Staudt, L. (2000) Different types of diffuse large B-cell
lymphoma identified by gene expression profiling. Nature, 403, 503–511.
Alon, U., Barkai, N., Notterman, D. A., Gish, K., Ybarra, S., Mack, D. and Levine, A. J. (1999) Broad
patterns of gene expression revealed by cluster analysis of tumor and normal colon tissues probed
by oligonucleotide arrays. Proc. Natl. Acad. Sci. USA, 96, 6745–6750.
Anderson, E. (1935) The irises of the Gaspé peninsula. Bulletin of the American Iris Society, 59, 2–5.
Barash, Y. and Friedman, N. (2001) Context specific bayesian clustering for gene expression data.
In Fifth Annual International Conference on Computational Biology (ed. T. Lengauer). Montreal,
Canada: RECOMB 2001.
Beckmann, C. and Smith, S. (2002) Probabilistic independent component analysis for FMRI. Tech.
rep., FMRIB, University of Oxford, URL http://www.fmri.ox.ac.uk/analysis/techrep/.
Bellman, R. (1961) Adaptive Control Processes: A Guided Tour. Princeton University Press.
Ben-Hur, A., Elisseeff, A. and Guyon, I. (2002) A stability based method for discovering structure in
clustered data. Pac Symp Biocomput. 2002, 6–7.
Borgelt, C. and Berthold, M. R. (2002) Mining molecular fragments: Finding relevant substructures
of molecules. In ICDM ’02: Proceedings of the 2002 IEEE International Conference on Data Mining
(ICDM’02), 51. Washington (DC): IEEE Computer Society.
177
178 BIBLIOGRAPHY
Bottou, L. and Bengio, Y. (1995) Convergence properties of the K-means algorithms. In Advances in
Neural Information Processing Systems (eds. G. Tesauro, D. Touretzky and T. Leen), vol. 7, 585–592.
The MIT Press.
Breiman, L. (1996a) Bagging predictors. Machine Learning, 24, 123–140.
— (1996b) Heuristics of instability and stabilization in model selection. Annals of Statistics, 24(6),
2350–2383.
— (1999) Random forests. Tech. rep., University of California, Berkeley. URL http://www.stat.
berkeley.edu/tech-reports/567.ps.Z.
— (2001) Statistical modelling: Two two cultures. Statistical Science, 16.
Breiman, L., Friedman, J., Olshen, R. and Stone, C. (1984) Classification and Regression Trees. Mon-
terey, CA: Wadsworth and Brooks/Cole.
Bruton, D. and Owen, A. (1988) The Norwegian Upper Ordovician illaenid trilobites. Norsk Geologisk
Tidsskrift, 68, 241–258.
Bühlmann, P. and Yu, B. (2003) Boosting with the l2 -loss: Regression and classification. Journal of the
American Statistical Association, 98, 324–339.
Campbell, N. and Mahon, R. (1974) A multivariate study of variation in two species of rock crab of
genus leptograptus. Australian Journal of Zoology, 22, 417–425.
Chipman, H., Hastie, T. and Tibshirani, R. (2003) Clustering microarray data. In Statistical analysis of
gene expression microarray data (ed. T. Speed), chap. 4. New York: Chapman and Hall.
Cho, R., Campbell, M., Winzeler, E., Steinmetz, L., Conway, A., Wodicka, L., Wolfsberg, T., Gabrielian,
A., Landsman, D., Lockhart, D. and Davis, R. (1998) A genome-wide transcriptional analysis of the
mitotic cell cycle. Molecular Cell, 2(1), 65–73.
Creighton, C. and Hanash, S. (2003) Mining gene expression databases for association rules. Bioinfor-
matics, 19, 79–86.
Date, C. J. (1981) An Introduction to Database Systems. Addison-Wesley.
Dempster, A., N.M.Laird and D.B.Rubin (1977) Maximum likelihood from incomplete data via the EM
algorithm. Journal Royal Stat. Soc., Series B, 39, 1–38.
Devijver, P. and Kittler, J. (1982) Pattern Recognition: A Statistical Approach. Prentice Hall.
Devlin, S., Gnanadesikan, R. and Kettenring, J. (1981) Robust estimation of dispersion matrices and
principal components. Journal of the American Statistical Association, 76, 354–362.
Diaconis, P. and Freedman, D. (1984) Asymptotics of graphical projection pursuit. Annals of Statistics,
12, 793–815.
Donoho, D. (2000) High-dimensional data analysis: The curses and blessings of dimensionality. Con-
ference “Mathematical Challenges of the 21st Century”: AMS.
BIBLIOGRAPHY 179
Duan, K., Keerthi, S. and Poo, A. N. (2003) Evaluation of simple performance measures for tuning
SVM hyperparameters. Neurocomputing, 51, 41–59.
Duffy, D. and Quiroz, A. (1991) A permutation-based algorithm for block clustering. Journal of Classi-
fication, 8, 65–91.
Efron, B. (1975) The efficiency of logistic regression compared to normal discriminant analysis. Jour-
nal of the American Statistical Association, 70, 892–898.
— (1983) Estimating the error rate of a prediction rule. Improvements on cross-validation. JASA, 78,
316–331.
Eisen, M., Spellman, P., Brown, P. and Botstein, D. (1998) Cluster analysis and display of genome-wide
expression patterns. Proc. Natl. Acad. Sci. USA, 95, 14863–14868.
Elkan, C. (1994) The paradoxical success of fuzzy logic. IEEE Expert, 9, 3–8.
Eslava-Gómez, G. (1989) Projection Pursuit and Other Graphical Methods for Multivariate Data. D. Phil.
thesis, University of Oxford.
Fauquet, C., Desbois, D., Fargette, D. and Vidal, G. (1988) Classification of furoviruses based on the
amino acid composition of their coat proteins. In Viruses with Fungal Vectors (eds. J. I. Cooper and
M. J. C. Asher), 19–38. Edinburgh: Association of Applied Biologists.
Fisher, R. A. (1936) The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7,
179–188.
Frawley, W. J., Piatetsky-Shapiro, G. and Matheus, C. J. (1992) Knowledge discovery in databases - an

overview. Ai Magazine, 13, 57–70.
Freund, Y. and Schapire, R. E. (1995) A decision-theoretic generalization of on-line learning and an

application to boosting. In European Conference on Computational Learning Theory, 23–37.
Friedman, J. (1999) Greedy function approximation: a gradient boosting machine. Tech. rep., Stanford
University. URL stat.stanford.edu/∼jhf/ftp/trebst.ps.
Friedman, J., Hastie, T. and Tibshirani, R. (1998) Additive logistic regression: a statistical view of
boosting. Tech. rep., Stanford University. URL http://www-stat.stanford.edu/∼trevor/Papers/
boost.ps.
Friedman, J. and Tukey, J. (1974) A projection pursuit algorithms for exploratory data analysis. IEEE
Transactions on Computers, 23, 881–890.
Friedman, J. H. (1989) Regularized discriminant analysis. Journal of the American Statistical Associa-
tion, 84.
Getz, G., Levine, E. and Domany, E. (2000) Coupled two-way clustering analysis of gene microarray
data. Proc. Natl. Acad. Sci., 97(22), 12079–12084.
Ghosh, D. and Chinnaiyan, A. (2002) Mixture modelling of gene expression data from microarray
experiments. Bioinformatics, 18, 275–286.
180 BIBLIOGRAPHY
Goodall, A. (1985) The guide to expert systems. Oxford: Learned Information.
Gower, J. and Hand, D. (1996) Biplots. London: Chapman & Hall.
Hamadeh, H., Bushel, P., Jayadev, S., DiSorbo, O., Sieber, S., Bennett, L., Li, L., Tennant, R., Stoll,
R., Barrett, J., Blanchard, K., Paules, R. and Afshari, C. (2002) Gene Expression Analysis Reveals
Chemical-Specific Profiles. Toxicological Sciences, 67, 219–231.
Hand, D. J. (1981) Discrimination and Classification. New York: John Wiley & Sons.
— (1997) Construction and assessment of classification rules. Wiley: Chichester.
Hartigan, J. (1972) Direct clustering of a data matrix. Journal of the American Statistical Association,
6, 123–129.
Hastie, T., Buja, A. and Tibshirani, R. (1995) Penalized discriminant analysis. Annals of Statistics, 23,
73–102.
Hastie, T., Tibshirani, R. and Buja, A. (1994) Flexible discriminant analysis by optimal scoring. Journal
of the American Statistical Association, 89, 1255–1270.
Hastie, T., Tibshirani, R. and Friedman, J. (2001) The elements of statistical learning: data mining,
inference and prediction. New York: Springer.
Hastie, T., Tibshirani, R., Sherlock, G., Eisen, M., Brown, P. and Botstein, D. (1999) Imputing missing
data for gene expression arrays. Tech. rep., Stanford University Statistics Department.
Heckerman, D. (1996) A tutorial on learning with bayesian networks. Tech. Rep. MST-TR-95-06, Mi-
crosoft Research.
Holst, P. A., Kromhout, D. and Brand, R. (1988) For Debate: Pet birds as an Independent Risk Factor
of Lung Cancer. British Medical Journal, 297, 13–21.
Hori, G., Inoue, M., Nishimura, S. and Nakahara, H. (2001) Blind gene classiffication based on ICA of
microarray data. In Proc. 3rd Int, Workshop on Independent Component Analysis and Blind Separation,
332–336. ICA2001.
Huber, P. J. (1981) Robust statistics. New York: John Wiley & Sons.
Hyvärinen, A. (1999) Survey on independent component analysis. Neural computing surveys, 2, 94–
128.
Icev, A., Ruiz, C. and Ryder, E. F. (2003) Distance-enhanced association rules for gene expression. In
BIOKDD, 34–40.
Jordan, M. (ed.) (1998) Learning in graphical models. Kluwer.
Kari, L., Loboda, A., Nebozhyn, M., Rook, A. H., Vonderheid, E. C., Nichols, C., Virok, D., Chang,
C., Horng, W., Johnston, J., Wysocka, M., Showe, M. K. and C., S. L. (2003) Classification and
prediction of survival in patients with the leukemic phase of cutaneous t cell lymphoma. Journal of
Experimental Medicine, 197, 1477–88.
BIBLIOGRAPHY 181
Kaufman, L. and Rousseeuw, P. (1990) Finding groups in data: An Intorduction to cluster analysis. New
York: Wiley.
Kerr, M. and Churchill, G. (2001) Bootstrapping cluster analysis: assessing the reliability of conclusions
from microarray experiments. Proc. Natl. Acad. Sci., 98(16), 8961–8965.
Kohonen, T. (1988) Learning vector quantization. Neural Networks, 1, 303.
— (1995) Self-Organizing Maps. New York: Springer.
Kotz, S., Johnson, N. L. and Read, C. B. (eds.) (1988) Encyclopedia of Statistical Science, vol. 6. New
York: Wiley Interscience.
Kustra, R. and Strother, S. (2001) Penalized discriminant analysis of [5 O]-water PET brain images with
prediction error selection of smoothness and regularization hyperparameters. IEEE Transactions on
Medical Imaging, 20, 376–387.
Lauritzen, S. L. and Spiegelhalter, D. (1988) Local computations with probabilities on graphical struc-
tures and their application to expert systems (with discussion). Journal of the Royal Statistical
Society, Series B, 50, 157–224.
Laviolette, M., J. W. Seaman, J., Barrett, J. D. and Woodall, W. H. (1995) A probabilistic and statistical
view of fuzzy methods. Technometrics, 37, 249–261.
Lazzeroni, L. and Owen, A. (1992) Plaid models for gene expression data. Statistica Sinica, 12, 61–86.
Lee, S. and Batzoglou, S. (2003) Application of independent component analysis to microarrays.

Genome Biology, 4:R76, 658—666.
Lee, S.-M. (2003) Estimating the number of independent components via the SONIC statistic. Master’s
thesis, University of Oxford, United Kingdom.
Liebermeister, W. (2002) Linear modes of gene expression determined by independent component

analysis. Bioinformatics, 18, 51–60.
Macnaughton-Smith, P., Williams, W., Dale, M. and Mockett, L. (1964) Dissimilarity analysis: a new
technique of hierarchical sub-division. Nature, 202, 1034–1035.
Mardia, K. V., Kent, J. T. and Bibby, J. M. (1979) Multivariate analysis. London: Academic Press.
Marriott, F. (1974) The interpretation of multiple observations. London: Academic Press.
Massart, D. L., Kaufman, L. and Plastria, F. (1983) Non-hierarchical clustering with masloc. Pattern
Recognition, 16, 507–516.
McCarthy, J. and Hayes, P. (1969) Some philosophical problems from the standpoint of artificial in-
telligence. In Machine Intelligence (eds. B. Meltzer and D. Mitchie), vol. 4, 463–502. Edinbourgh
University Press.
McCullagh, P. and Nelder, J. (1989) Generalized Linear Models. London: Chapman and Hall, 2nd edn.
182 BIBLIOGRAPHY
McLachlan, G., Bean, R. and Peel, D. (2002) A mixture model-based approach to the clustering of
microarray expression data. Bioinformatics, 18, 413–422.
McLachlan, G. and Peel, D. (1998) Robust cluster analysis via mixtures of multivariate t distributions,
658–666. Berlin: Springer-Verlag.
— (2000) Finite Mixture Models. New York: Wiley.
McQueen, J. (1967) Some methods for classification and analysis of multivariate observations. In
Proceedings of 5th Berkeley Symposium on Mathematics, Statistics and Probability, vol. 1.
McShane, L., Radmacher, M., Freidlin, B., Yu, R., Li, M. and Simon, R. (2002) Methods for assessing
reproducibility of clustering patterns observed in analyses of microarray data. Bioinformatics, 18,
1462–1469.
Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H. and Teller, E. (1953) Equations of
state calculations by fast computing machine. J. Chem. Phys., 21, 1087–1091.
Michie, D., Spiegelhalter, D. J. and Taylor, C. C. (1994) Machine Learning, Neural and Statistical Clas-
sification. Chichester: Ellis Horwood.
Pearl, J. (1982) Reverend bayes on inference engines: A distributed hierarchical approach. In AAAI,
133–136.
— (1988) Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. San Francisco,
(CA): Morgan Kaufmann.
Quenouille, M. (1949) Approximate tests of correlation in time series. Journal of the Royal Statistical
Society series B, 11, 68–84.
Quinlan, J. R. (1986) The effect of noise on concept learning. In Machine Learning: An Artifical
Intelligence Approach (eds. R. S. Michalski, J. G. Carbonnell and T. M. Mitchell), vol. 2. San Mateo
(CA): Morgan Kauffman.
— (1988) Simplifying decision trees. In Knowledge Acquisition for Knowledge-Based Systems (eds.
B. Gaines and J. Boose), 239–252. London: Academic Press.
— (1993) C4.5: Programs for machine learning. San Mateo (CA): Morgan Kauffman.
Ramsey, F. L. and Schafer, D. W. (2002) The statistical sleuth: A course in methods of data analysis.
Pacific Grove, CA: Duxbury, 2 edn.
Ripley, B. D. (1996) Pattern recognition and neural networks. Cambridge: Cambridge University Press.
Robert, C. and Casella, G. (2004) Monte Carlo Statistical Methods. New York: Springer, 2nd edn.
Rosenblatt, F. (1962) Principles of Neurodynamics. Washington D.C.: Spartan Books.
Rumelhart, D. E. and McClelland, J. L. (1986) Parallel Distributed Processing: Explorations in the Mi-
crostructure of Cognition, vol. 2. Cambridge: MIT Press.
BIBLIOGRAPHY 183
Russell, S. and Norvig, P. (2003) Artificial Intelligence: A Modern Approach. Englewood Cliffs (NJ):
Prentice-Hall.
Schölkopf, B. and Smola, A. (2002) Learning with kernels: Support vector machines, regularization,
optimization and beyond. The MIT Press, Cambridge (MA).
Schwed, F. (1940) Where are the Customer’s Yachts? A Good Hard Look at Wall Street. New York: Simon
and Schuster.
Shafer, G. (1976) A Mathematical Theory of Evidence. Princeton (NJ): Princeton University Press.
Shipp, M., Ross, K., Tamayo, P., Weng, A., Kutok, J., Aguiar, R., Gaasenbeek, M., Angelo, M., Reich,
M., Pinkus, G., Ray, T., Koval, M., Last, K., Norton, A., Lister, T., Mesirov, J., Neuberg, D., Lander, E.,
Aster, J. and Golub, T. (2002) Diffuse large B-cell lymphoma outcome prediction by gene-expression
profiling and supervised machine learning. Nature Medicine, 8, 68–74.
Shortliffe, E. (1976) Computer-based Medical Consultations: MYCIN. Amsterdam: Elesevier.
Shortliffe, E. and Buchanan, B. (1984) Rule based expert systems. Reading (MA).
Singh, D., Febbo, P., Ross, K., Jackson, D., Manola, J., Ladd, C., Tamayo, P., Renshaw, A., D’Amico,
A., Richie, J., Lander, E., Loda, M., Kantoff, P., Golub, T. and Sellers, R. (2002) Gene expression
correlates of clinical prostate cancer behaviour. Cancer Cell, 1(2), 203–209.
Sørlie, T., Perou, C., Tibshirani, R., Aas, T., Geisler, S., Johnsen, H., Hastie, T., Eisen, M., van de Rijn,
M., Jeffrey, S., Thorsen, T., Quist, H., Matese, J., Brown, P., Botstein, D., Eystein, L. and Borresen-
Dale, A. (2001) Gene expression patterns of breast carcinomas distinguish tumor subclasses with
clinical implications. Proc. Natl. Acad. Sci., 98(19), 10869–10874.
Spiegelhalter, D., Dawid, A., Lauritzen, S. and Cowell, R. (1993) Bayesian analysis in expert systems.
Statistical Science, 8, 219–247.
Stone, M. (1974) Cross-validatory choice and assessment of statistical predictions (with discussion).
Journal of the Royal Statistical Society, Series B, 36, 111–147.
Tibshirani, R., Walther, G. and Hastie, T. (2001) Estimating the number of clusters in a data set via
the gap statistic. Journal of the Royal Statistical Society B, 63(2), 411–423.
Vapnik, V. (1995) The Nature of Statistical Learning Theory. New York: Springer.
Vapnik, V. and Chervonenkis, A. J. (1964) On the one class of the algorithms of pattern recognition.
Automation and Remote Control, 25.
— (1968) On the uniform convergence of relative frequencies of events to their probabilities. Sov.
Math. Dokl., 781–787.
— (1971) On the uniform convergence of relative frequencies of events to their probabilities. Theory
Probab. Apl., 16, 264–280.
Vapnik, V. and Lerner, A. J. (1963) Generalized portrait method for pattern recognition. Automation
and Remote Control, 24.
184 BIBLIOGRAPHY
van ’t Veer, L., Dai, H., van de Vijver, M., He, Y., Hart, A., Mao, M., Peterse, H., van der Kooy, K.,
Marton, M., Witteveen, A., Schreiber, G., Kerkhoven, R., Roberts, C., Linsley, P., Bernards, R. and
Friend, S. (2002) Gene expression profiling predicts clinical outcome of breast cancer. Nature, 415,
530–536.
Venables, W. and Ripley, B. (2002) Modern Applied Statistics with S. New York: Springer-Verlag, 4th
edn.
Widrow, B. and Hoff, M. E. (1960) Adaptive switching circuits. In 1960 IRE WESCON Convention
Record, vol. 4, 96–104. New York: IRE.
Witten, I. H. and Frank, E. (1999) Data Mining: Practical Machine Learning Tools and Techniques with
JAVA Implementations. San Francisco: Morgan Kaufmann.
Yeung, K., Fraley, C., Murua, A., Raftery, A. and Ruzzo, W. (2001a) Model-based clustering and data
transformations for gene expression data. Bioinformatics, 17, 977–987.
Yeung, K., Haynor, D. and Ruzzo, W. (2001b) Validating clustering for gene expression data. Bioinfor-
matics, 17, 309–318.
Yeung, K. and Ruzzo, W. (2001) Prinicipal component analysis for clustering gene expression data.
Bioinformatics, 17, 763–774.
Zadeh, L. (1965) Fuzzy Sets. Information and Control, 3, 338–353.
Zaki, M. J., Parthasarathy, S., Ogihara, M. and Li, W. (1997) New algorithms for fast discovery of
association rules. In In 3rd Intl. Conf. on Knowledge Discovery and Data Mining (eds. D. Heckerman,
H. Mannila, D. Pregibon, R. Uthurusamy and M. Park), 283–296. AAAI Press.

Data Mining Notes

Diunggah oleh

Informasi Dokumen

Deskripsi Asli:

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Data Mining Notes

Diunggah oleh

Hak Cipta:

Format Tersedia

AIMS Review Course

Statistical Data Mining

January 29, 2006

You are free:

Under the following conditions:

2 Exploratory data analysis 17

4 Classification: Theoretical backgrounds and discriminant analysis 71

5 Trees and rules 81

6 Generalised Linear Models 99

7 Non-parametric classification tools 121

8 Classifier assessment, bagging, and boosting 137

9 Expert Systems 147

10 Graphical models 159

1.2 Data mining as a step in the KDD process

1.3 Tasks in statistical data mining

p(y, x1 , . . . , xp ) = p(y|x1 , . . . , xp ) · p(x1 , . . . , xp )

1.4 Useful properties of data mining algorithms

Example for robustness: Estimating the location of a distribution

• The arithmetic mean (“average”):

Regression line without outlier Regression line with outlier

1.4.3 Resilience to the curse of dimensionality

1.5 Missing values

Variable 1 Variable 2 ... Variable j ... Variable p

Discrete variable Dummy A Dummy B Dummy C

1.7 Datasets used in the lecture

1.7.1 The iris dataset

1.7.2 The European economies dataset

1.7.3 The quarter-circle dataset

1.7.4 The breast tumour dataset

Exploratory data analysis

2.1 Visualising data

2.1.1 Visualising continuous data

Visualising one-dimensional data

Visualising two-dimensional data

Lower adjacent value

Upper adjacent value

Figure 2.1: Boxplot of the sepal length (iris data)

2.0 2.5 3.0 3.5 4.0 4.5

N = 150 Bandwidth = 0.1233

setosa versicolor virginica

Iris setosa Iris versicolor Iris virginica

Scatter Plot Matrix

Figure 2.4: Pairs plot of the iris dataset

setosa versicolor virginica

Scatter Plot Matrix

4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0

Figure 2.7: Three-dimensional scatter plot of the iris dataset

Figure 2.8: Screenshot of XGobi

2.1.2 Visualising nominal data

Czech Republic Netherlands

Czech Republic Germany Greece

France Cyprus Lithuania Hungary

Austria Portugal Slovakia

Figure 2.11: Star plot of the European economies dataset

Death penalty: Influence of the defendant’s

white black defendant

2.2 Descriptive statistics

In general, for p variables, we have

The sample covariance

for i = 1, 2, . . . , p; k = 1, 2, . . . , p. Covariance and correlation provide measures of linear association.

on p variables can also be organised into arrays.

2.3 Graphical Methods

Deriving principal components