Anda di halaman 1dari 344

Proceedings of the Third European Workshop on

pgm06 Probabilistic Graphical Models

Prague, Czech Republic


September 1215, 2006

Edited by Milan Studen and Ji Vomlel

Action M Agency
Prague, 2006
Typeset in LATEX.
The cover page and PGM logo: Jir Vomlel
Graphic art: Karel Solark

ISBN 80-86742-14-8
Preface

These are the proceedings of the third European workshop on Probabilistic Graphical Models
(PGM06) to be held in Prague, Czech Republic, September 12-15, 2006. The aim of this series
of workshops on probabilistic graphical models is to provide a discussion forum for researchers
interested in this topic. The first European PGM workshop (PGM02) was held in Cuenca, Spain
in November 2002. It was a successful workshop and several of its participants expressed interest in
having a biennial European workshop devoted particularly to probabilistic graphical models. The
second PGM workshop (PGM04) was held in Leiden, the Netherlands in October 2004. It was
also a success; more emphasis was put on collaborative work, and the participants also discussed
how to foster cooperation between European research groups.
There are two trends which can be observed in connection with PGM workshops. First, each
workshop is held during an earlier month than the preceding one. Indeed, PGM02 was held in
November, PGM04 in October and PGM06 will be held in September. Nevertheless, I think this
is a coincidence. The second trend is the increasing number of contributions (to be) presented. I
would like to believe that this is not a coincidence, but an indication of increasing research interest
in probabilistic graphical models.
A total of 60 papers were submitted to PGM06 and, after the reviewing and post-reviewing
phases, 41 of them were accepted for presentation at the workshop (21 talks, 20 posters) and
appear in these proceedings. The authors of these papers come from 17 different countries, mainly
European ones.
To handle the reviewing process (from May 15, 2006 to June 28, 2006) the PGM Program
Committee was considerably extended. This made it possible that every submitted paper was
reviewed by 3 independent reviewers. The reviews were handled electronically using the START
conference management system.
Most of the PC members reviewed around 7 papers. Some of them also helped in the post-
reviewing phase, if revisions to some of the accepted papers were desired. Therefore, I would like
to express my sincere thanks here to all of the PC members for all of the work they have done
towards the success of PGM06. I think PC members helped very much to improve the quality of
the contributions. Of course, my thanks are also addressed to the authors.
Further thanks are devoted to the sponsor, the DAR UTIA research centre, who covered the
expenses related to the START system. I am also indebted to the members of the organizing
committee for their help, in particular, to my co-editor, Jirka Vomlel.
My wish is for the participants in the PGM06 workshop to appreciate both the beauty of
Prague and the scientific program of the workshop. I believe there will be a fruitful discussion
during the workshop and I hope that the tradition of PGM workshops will continue.

Prague, June 26, 2006


Milan Studeny
PGM06 PC Chair
Workshop Organization
Program Committee
Chair:
Milan Studeny Czech Republic
Committee Members:
Concha Bielza Technical University of Madrid, Spain
Luis M. de Campos University of Granada, Spain
Francisco J. Dez UNED Madrid, Spain
Marek J. Druzdzel University of Pittsburgh, USA
Ad Feelders Utrecht University, Netherlands
Linda van der Gaag Utrecht University, Netherlands
Jose A. Gamez University of Castilla La Mancha, Spain
Juan F. Huete University of Granada, Spain
Manfred Jaeger Aalborg University, Denmark
Finn V. Jensen Aalborg University, Denmark
Radim Jirousek Academy of Sciences, Czech Republic
Pedro Larranaga University of the Basque Country, Spain
Peter Lucas Radboud University Nijmegen, Netherlands
Fero Matus Academy of Sciences, Czech Republic
Serafn Moral University of Granada, Spain
Thomas D. Nielsen Aalborg University, Denmark
Kristian G. Olesen Aalborg University, Denmark
Jose M. Pena Linkoping University, Sweden
Jose M. Puerta University of Castilla La Mancha, Spain
Silja Renooij Utrecht University, Netherlands
David RosInsua Rey Juan Carlos University, Spain
Antonio Salmeron University of Almeria, Spain
James Q. Smith University of Warwick, UK
Marco Valtorta University of South Carolina, USA
Jir Vomlel Academy of Sciences, Czech Republic
Marco Zaffalon IDSIA Lugano, Switzerland

Additional Reviewers
Alessandro Antonucci IDSIA Lugano, Switzerland
Janneke Bolt Utrecht University, Netherlands
Julia Flores University of Castilla La Mancha, Spain
Marcel van Gerven Radboud University Nijmegen, Netherlands
Yimin Huang University of South Carolina, USA
Sren H. Nielsen Aalborg University, Denmark
Jana Novovicova Academy of Sciences, Czech Republic
Valerie Sessions University of South Carolina, USA
Petr Simecek Academy of Sciences, Czech Republic
Marta Vomlelova Charles University, Czech Republic
Peter de Waal Utrecht University, Netherlands
Changhe Yuan University of Pittsburgh, USA
Organizing Committee
Chair:
Radim Jirousek Academy of Sciences, Czech Republic
Committee members:
Milena Zeithamlova Action M Agency, Czech Republic
Marta Vomlelova Charles University, Czech Republic
Radim Lnenicka Academy of Sciences, Czech Republic
Petr Simecek Academy of Sciences, Czech Republic
Jir Vomlel Academy of Sciences, Czech Republic
Contents
Some Variations on the PC Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Joaqun Abellan, Manuel Gomez-Olmedo, and Serafn Moral

Learning Complex Bayesian Network Features for Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9


Peter Antal, Andras Gezsi, Gabor Hullam, and Andras Millinghoffer

Literature Mining using Bayesian Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17


Peter Antal and Andras Millinghoffer

Locally specified credal networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25


Alessandro Antonucci and Marco Zaffalon

A Bayesian Network Framework for the Construction of Virtual Agents


with Human-like Behaviour . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
Olav Bangs, Nicolaj Sndberg-Madsen, and Finn V. Jensen

Loopy Propagation: the Convergence Error in Markov Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43


Janneke H. Bolt

Preprocessing the MAP Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51


Janneke H. Bolt and Linda C. van der Gaag

Quartet-Based Learning of Shallow Latent Variables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .59


Tao Chen and Nevin L. Zhang

Continuous Decision MTE Influence Diagrams. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .67


Barry R. Cobb

Discriminative Scoring of Bayesian Network Classifiers: a Comparative Study . . . . . . . . . . . . . . . . . 75


Ad Feelders and Jevgenijs Ivanovs

The Independency tree model: a new approach for clustering and factorisation . . . . . . . . . . . . . . . . . 83
M. Julia Flores, Jose A. Gamez, and Serafn Moral

Learning the Tree Augmented Naive Bayes Classifier from incomplete datasets . . . . . . . . . . . . . . . . 91
Olivier C.H. Francois and Philippe Leray

Lattices for Studying Monotonicity of Bayesian Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99


Linda C. van der Gaag, Silja Renooij, and Petra L. Geenen

Multi-dimensional Bayesian Network Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107


Linda C. van der Gaag and Peter R. de Waal

Dependency networks based classifiers: learning models by using independence tests . . . . . . . . . . 115
Jose A. Gamez, Juan L. Mateo, and Jose M. Puerta

Unsupervised naive Bayes for data clustering with mixtures of truncated exponentials . . . . . . . . 123
Jose A. Gamez, Rafael Rum, and Antonio Salmeron

Selecting Strategies for Infinite-Horizon Dynamic LIMIDS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131


Marcel A. J. van Gerven and Francisco J. Dez
Sensitivity analysis of extreme inaccuracies in Gaussian Bayesian Networks . . . . . . . . . . . . . . . . . . . 139
Miguel A. Gomez-Villegas, Paloma Man, and Rosario Susi

Learning Bayesian Networks Structure using Markov Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147


Christophe Gonzales and Nicolas Jouve

Estimation of linear, non-gaussian causal models in the presence of confounding


latent variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
Patrik O. Hoyer, Shohei Shimizu, and Antti J. Kerminen

Symmetric Causal Independence Models for Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163


Rasa Jurgelenaite and Tom Heskes

Complexity Results for Enhanced Qualitative Probabilistic Networks . . . . . . . . . . . . . . . . . . . . . . . . . 171


Johan Kwisthout and Gerard Tel

Decision analysis with influence diagrams using Elviras explanation facilities . . . . . . . . . . . . . . . . . 179
Manuel Luque and Francisco J. Dez

Dynamic importance sampling in Bayesian networks using factorisation of probability trees . . . 187
Irene Martnez, Carmelo Rodrguez, and Antonio Salmeron

Learning Semi-Markovian Causal Models using Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195


Stijn Meganck, Sam Maes, Philippe Leray, and Bernard Manderick

Geometry of rank tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207


Jason Morton, Lior Pachter, Anne Shiu, Bernd Sturmfels, and Oliver Wienand

An Empirical Study of Efficiency and Accuracy of Probabilistic Graphical Models . . . . . . . . . . . . 215


Jens D. Nielsen and Manfred Jaeger

Adapting Bayes Network Structures to Non-stationary Domains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223


Sren H. Nielsen and Thomas D. Nielsen

Diagnosing Lyme disease - Tailoring patient specific Bayesian networks


for temporal reasoning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
Kristian G. Olesen, Ole K. Hejlesen, Ram Dessau, Ivan Beltoft, and Michael Trangeled

Predictive Maintenance using Dynamic Probabilistic Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239


Demet Ozgur-Unluakin and Taner Bilgic

Reading Dependencies from the Minimal Undirected Independence Map of a Graphoid


that Satisfies Weak Transitivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
Jose M. Pena, Roland Nilsson, Johan Bjorkegren, and Jesper Tegner

Evidence and Scenario Sensitivities in Naive Bayesian Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255


Silja Renooij and Linda van der Gaag

Abstraction and Refinement for Solving Continuous Markov Decision Processes. . . . . . . . . . . . . . .263
Alberto Reyes, Pablo Ibarguengoytia, L. Enrique Sucar, and Eduardo Morales

Bayesian Model Averaging of TAN Models for Clustering. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .271


Guzman Santafe, Jose A. Lozano, and Pedro Larranaga
Dynamic Weighting A* Search-based MAP Algorithm for Bayesian Networks . . . . . . . . . . . . . . . . . 279
Xiaoxun Sun, Marek J. Druzdzel, and Changhe Yuan

A Short Note on Discrete Representability of Independence Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 287


Petr Simecek

Evaluating Causal effects using Chain Event Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293


Peter Thwaites and Jim Smith

Severity of Local Maxima for the EM Algorithm: Experiences with Hierarchical


Latent Class Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301
Yi Wang and Nevin L. Zhang

Optimal Design with Design Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309


Yang Xiang

Hybrid Loopy Belief Propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317


Changhe Yuan and Marek J. Druzdzel

Probabilistic Independence of Causal Influences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325


Adam Zagorecki and Marek J. Druzdzel
Some Variations on the PC Algorithm
J. Abellan, M. Gomez-Olmedo, and S. Moral
Department of Computer Science and Artificial Intelligence
University of Granada
18071 - Granada, Spain

Abstract
This paper proposes some possible modifications on the PC basic learning algorithm and
makes some experiments to study their behaviour. The variations are: to determine
minimum size cut sets between two nodes to study the deletion of a link, to make statistical
decisions taking into account a Bayesian score instead of a classical Chi-square test, to
study the refinement of the learned network by a greedy optimization of a Bayesian score,
and to solve link ambiguities taking into account a measure of their strength. It will be
shown that some of these modifications can improve PC performance, depending of the
objective of the learning task: discovering the causal structure or approximating the joint
probability distribution for the problem variables.

1 Introduction one is that scoring and search procedures allow


to compare very different models by a score that
There are two main approaches to learning can be be interpreted as the probability of be-
Bayesian networks from data. One is based on ing the true model. As a consequence, we can
scoring and searching (Cooper and Herskovits, also follow a Bayesian approach considering sev-
1992; Heckerman, 1995; Buntine, 1991). Its eral alternative models, each one of them with
main idea is to define a global measure (score) its corresponding probability, and using them
which evaluates a given Bayesian network model to determine posterior decisions (model averag-
as a function of the data. The problem is solved ing). Finally, in score and searching approaches
by searching in the space of possible Bayesian different combinatorial optimization techniques
network models trying to find the network with (de Campos et al., 2002; Blanco et al., 2003)
optimal score. The other approach (constraint can be applied to maximize the evaluation of
learning) is based on carrying out several inde- the learned network. On the other hand, the PC
pendence tests on the database and building a algorithm has some advantages. One of them is
Bayesian network in agreement with tests re- that it has an intuitive basis and under some
sults. The main example of this approach is PC ideal conditions it has guarantee of recovering a
algorithm (Spirtes et al., 1993). It can be ap- graph equivalent to the one being a true model
plied to any source providing information about for the data. It can be considered as an smart
whether a given conditional independence rela- selection and ordering of the questions that have
tion is verified. to be done in order to recover a causal structure.
In the past years, searching and scoring pro- The basic point of this paper is that PC al-
cedures have received more attention, due to gorithm provides a set of strategies that can
some clear advantages (Heckerman et al., 1999). be combined with other ideas to produce good
One is that constraint based learning makes learning algorithms which can be adapted to dif-
categorical decisions from the very beginning. ferent situations. An example of this is when
These decisions are based on statistical tests van Dijk et al. (2003) propose a combination of
that may be erroneous and these errors will af- order 0 and 1 tests of PC algorithm with an scor-
fect all the future algorithm bahaviour. Another ing and searching procedure. Here, we propose
several variations about the original PC algo- tail:
rithm. The first one will be a generalization of
the necessary path condition (Steck and Tresp, 1. Start with a complete undirected graph G 0
1999); the second will be to change the statisti- 2. i = 0
cal tests for independence by considering deci- 3. Repeat
sions based on a Bayesian score; the third will be 4. For each X X
to allow the possibility of refining the network 5. For each Y ADJX
learned with PC by applying a greedy optimiza- 6.Test whether S ADJX {Y }
tion of a Bayesian score; and finally the last pro- with |S| = i and I(X, Y |S)
posal will be to delete edges from triangles in the 7. If this set exists
graph following an order given by a Bayesian 8. Make SXY = S
score (removing weaker edges first). We will 9. Remove X Y link from G0
show the intuitive basis for all of them and we 10. i = i + 1
will make some experiments showing their per- 11. Until |ADJX | i, X
formance when learning Alarm network (Bein-
lich et al., 1989). The quality of the learned In this algorithm, ADJX is the set of nodes
networks will be measured by the number or adjacent to X in graph G0 . The basis is that if
missing-added links and the Kullback-Leibler the set of independencies is faithful to a graph,
distance of the probability distribution associ- then there is not a link between X and Y , if
ated to the learned network to the original one. and only if there is a subset S of the adjacent
The paper is organized as follows: Section 2 nodes of X such that I(X, Y |S). For each pair
is devoted to describe the fundamentals of PC of variables, SXY will contain such a set, if it
algorithm; Section 3 introduces the four varia- is found. This set will be used in the posterior
tions of PC algorithm; in Section 4 the results orientation stage.
of the experiments are reported and discussed; The orientation step will proceed by looking
Section 5 is devoted to the conclusions. for sets of three variables {X, Y, Z} such that
edges X Z, Y Z are in the graph by not the
2 The PC Algorithm edge X Y . Then, if Z 6 SXY , it orients the
edges from X to Z and from Y to Z creating a
Assume that we have a set of variables X = v-structure: X Z Y . Once, these orienta-
(X1 , . . . , Xn ) with a global probability distribu- tions are done, then it tries to orient the rest of
tion about them P . By an uppercase bold letter the edges following two basic principles: not to
A we will represent a subset of variables of X. create cycles and not to create new v-structures.
By I(A, B|C) we will denote that sets A and B It is possible that the orientation of some of the
are conditionally independent given C. edges has to be arbitrarily selected.
PC algorithm assumes faithfulness. This If the set of independencies is faithful to a
means that there is a directed acyclic graph, G, graph and we have a perfect way of determin-
such that the independence relationships among ing whether I(X, Y |S), then the algorithm has
the variables in X are exactly those represented guarantee of producing a graph equivalent (rep-
by G by means of the d-separation criterion resents the same set of independencies) to the
(Pearl, 1988). PC algorithm is based on the original one.
existence of a procedure which is able of say- However, in practice none of these conditions
ing when I(A, B|C) is verified in graph G. It is verified. Independencies are decided at the
first tries to find the skeleton (underlying undi- light of independence statistical tests based on
rected graph) and on a posterior step makes the a set of data D. The usual way of doing these
orientation of the edges. Our variations will be tests is by means of a chi-square test based on
mainly applied to the first part (determining the the cross entropy statistic measured in the sam-
skeleton). So we shall describe it with some de- ple (Spirtes et al., 1993). Statistical tests have

2 J. Abelln, M. Gmez-Olmedo, and S. Moral


errors and then, even if faithfulness hypothesis
is verified, it is possible that we do not recover
the original graph. The number of errors of sta-
tistical tests increases when the sample is small X Z Y
or the cardinality of the conditioning set S is
large (Spirtes et al., 1993, p. 116). In both
cases, due to the nature of frequentist statisti-
cal tests, there is a tendency to always decide
independence (Cohen, 1988). This is one rea-
son of doing statistical tests in increasing order Figure 1: An small cut set
of the cardinality of the sets to which we are
conditioning.
Apart from no recovering the original graph, Acid and de Campos (2001) did in a different
we can have another effects, as the possibility context. The computation of this set will need
of finding cycles when orienting v-structures. In some extra time, but it can be done in polyno-
our implementation, we have always avoided cy- mial time with a modification of Ford-Fulkerson
cles by reversing the arrows if necessary. algorithm (Acid and de Campos, 1996).

3 The Variations 3.2 Bayesian Statistical Tests


PC algorithm performs a chi-square statistical
3.1 Necessary Path Condition
test to decide about independence. However,
In PC algorithm it is possible that we delete as shown by Moral (2004), sometimes statis-
the link between X and Y by testing the inde- tical tests make too many errors. They try
pendence I(X, Y |S), when S is a set containing to keep the Type I error (deciding dependence
nodes that do not appear in a path (without when there is independence) constant to the
cycles) from X to Y . The inclusion of these significance level. However, if the sample is
nodes is not theoretically wrong, but statisti- large enough this error can be much lower by
cal tests make more errors when the size of using a different decision procedure, without
the conditioning set increases, then it can be an important increase in Type II error (decid-
a source of problems in practice. For this rea- ing independence when there is dependence).
son, Steck and Tresp (1999) proposed to reduce Margaritis (2003) has proposed to make statis-
ADJX {Y } in Step 6, by removing all the tical tests of independence for continuous vari-
nodes that are not in a path from X to Y . In ables by using a Bayesian score after discretizing
this paper, we will go an step further by consid- them. Previously, Cooper (1997) proposed a
ering any subset CU TX,Y disconnecting X and different independence test based on a Bayesian
Y in the graph in which the link X Y has score, but only when conditioning to 0 or 1 vari-
been deleted, playing the role of ADJ X {Y }. able. Here we propose to do all the statistical
Consider that in the skeleton, we want to see tests by using a Bayesian Dirichlet score 1 (Heck-
whether link X Y can be deleted, then we erman, 1995) with a global sample size s equal
first remove it, and if the situation is the one to 1.0. The test I(X, Y |S) is carried out by com-
in Figure 1, we could consider CU TX,Y = {Z}. paring the scores of X with S as parents and of
However, in the actual algorithm (even with the X with S {Y } as parents. If the former is
necessary path condition) we consider the set larger than the later, the variables are consid-
ADJX {Y }, which is larger, and therefore with ered independent, and in the other case, they
an increased possibility of error.
1
Our proposal is to apply PC algorithm, but We have chosen this score instead of the original
K2 score (Cooper and Herskovits, 1992) because this is
by considering in step 6 a cut set of mini- considered more correct from a theoretical point of view
mum size in the graph without X Y link, as (Heckerman et al., 1995).

Some Variations on the PC Algorithm 3


are considered dependent. The score of X with
a set of parents P a(X) = Z is the logarithm of:
!
Y (s0 ) Y (Nz,x + s00 )

z (Nz + s0 ) x (s00 )

where Nz is the number of occurrences of


[Z = z] in the sample, Nz,x is the number of
Figure 2: A too complex network.
occurrences of [Z = z, X = x] in the sample, s 0
is s divided by the number of possible values of
Z, and s00 is equal to s0 divided by the number X Y
of values of X.

3.3 Refinement Z
If the statistical tests do not make errors and
the faithfulness hypothesis is verified, then PC
algorithm will recover a graph equivalent to the
Figure 3: A simple network
original one, but this can never be assured with
finite samples. Also, even if we recover the origi-
nal graph, when our objective is to approximate make the movement with highest score dif-
the joint distribution for all the variables, then ference while this is positive.
depending of the sample size, it can be more
Refinement can also solve some of the prob-
convenient to use a simpler graph than the true
lems associated with the non verification of
one. Imagine that the variables follow the graph
the faithfulness hypothesis. Assume for exam-
of Figure 2. This graph can be recovered by
ple, that we have a problem with 3 variables,
PC algorithm by doing only statistical indepen-
X, Y, Z, and that the set of independencies is
dence tests of order 0 and 1 (conditioning to
given by the independencies of the graph in Fig-
none or 1 variable). However, when we are go-
ure 3 plus the independence I(Y, Z|). PC al-
ing to estimate the parameters of the network
gorithm will estimate a network, where the link
we have to estimate a high number of probabil-
between Y and Z is lost. Even if the sample
ity values. This can be a too complex model
is large we will estimate a too simple network
(too many parameters) if the database is not
which is not an I-map (Pearl, 1988). If we ori-
large enough. In this situation, it can be rea-
ent the link X Z in PC algorithm, refinement
sonable to try to refine this network, taking into
can produce the network in Figure 3, by check-
account the actual orientation and the size of
ing that the Bayesian score is increased (as it
the model. In this sense, the result of PC al-
should be the case if I(Z, Y |X) is not verified).
gorithm can be used as an starting point for a
The idea of refining a learned Bayesian net-
greedy search algorithm to optimize a concrete
work by means of a greedy optimization of a
metric.
Bayesian score has been used in a different con-
In particular, our proposal is based on the
text by Dash and Druzdzel (1999).
following steps:
3.4 Triangles Resolution
1. Obtain an order compatible with the graph
learned by PC algorithm. Imagine that we have 3 variables, X, Y, Z,
and that no independence relationship involv-
2. For each node, try to delete each one of ing them is verified: each pair of variables is de-
its parents or to add some of the non par- pendent and conditionally dependent giving the
ents preceding nodes as parent, measuring third one. As there is a tendency to decide for
the resulting Bayesian Dirichlet score. We independence when the size of the conditioning

4 J. Abelln, M. Gmez-Olmedo, and S. Moral


set is larger, then it is possible that all order 0 PC variations. In all of them, we have started
tests produce dependence, but when we test the with the original network and we have gener-
independence of two variables with respect to a ated samples of different sizes by logic sampling.
third one, we obtain independence. In this sit- Then, we have tried to recover the original net-
uation, the result of PC algorithm, will depend work from the samples by using the different
of the order in which tests are carried out. For variations of the PC algorithm including the ori-
example, if we ask first for the independence, entation step. We have considered the follow-
I(X, Y |Z), then the link X Y is deleted, but ing measures of error in this process: number of
not the other two links, which will be oriented in missing links, number of added links, and the
a posterior step without creating a v-structure. Kullback-Leibler distance (Kullback, 1968) of
If we test first I(X, Z|Y ), then the deleted link the learned probability distribution to the orig-
will be X Z, but not the other two. inal one2 . Kullback-Leibler distance is a more
It seems reasonable that if one of the links appropriate measure of error when the objective
is going to be removed, we should choose the is to approximate the joint probability distribu-
weakest one. In this paper, for each 3 vari- tion for all the variables and the measures of
ables that are a triangle (the graph contains number of differences in links is more appropri-
the 3 links) after order 0 tests, we measure the ate when our objective is to recover the causal
strength of link X Y as the Bayesian score structure of the problem. We do not consider
of X with Y, Z as parent, minus the Bayesian the number of wrong orientations as our vari-
score of X with Z as parents. For each trian- ations are mainly focused in the skeleton dis-
gle we delete the link with lowest strength (if covery phase of PC algorithm. The number of
this value is lower than 0). This is done as an added or deleted links only depend of the first
intermediate step, between order 0 and order 1 part of the learning algorithm (selection of the
conditional independence tests. skeleton).
In this paper, it has been implemented only in Experiments have been carried out in Elvira
the case in which independence tests are based environment (Consortium, 2002), where a local
on a Bayesian score, but it could be also consid- computation of Kullback-Leibler distance is im-
ered in the case of Chi-square tests by consid- plemented. The different sample sizes we have
ering the strength of a link equal to the p-value used are: 100, 500, 1000, 5000, 10000, and for
of the statistical independence test. each sample size, we have repeated the exper-
A deeper study of this type of interdepen- iment 100 times. The combinations of algo-
dencies between the deletion of links (the pres- rithms we have tested are the following:
ence of a link depends of the absence of other
one, and vice versa) has been carried out by Alg1 This is the algorithm with minimal separat-
Steck and Tresp (1999), but the resolution of ing sets, score based tests, no refinement,
these ambiguities is not done. Hugin system and triangle resolution.
(Madsen et al., 2003) allows to decide between Alg2 Algorithm with minimal separating sets,
the different possibilities by asking to the user. score based tests, refinement, and triangle
Our procedure could be extended to this more resolution.
general setting, but at this stage the implemen- Alg3 Algorithm with adjacent nodes as separat-
tation has been limited to triangles, as it is, at ing sets, score based tests, no refinement,
the sample time, the most usual and simplest and triangle resolution.
situation.
Alg4 Algorithm with minimal separating sets,
Chi-square tests, no refinement and no res-
4 Experiments olution of triangles.
We have done some experiments with the Alarm 2
The parameters are estimated with a Bayesian
network (Beinlich et al., 1989) for testing the Dirichlet approach with a global sample size of 2.

Some Variations on the PC Algorithm 5


100 500 1000 5000 10000 100 500 1000 5000 10000
Alg1 17.94 8.76 6.02 3.25 2.56 Alg1 2.16 4.1 5.88 21.98 42
Alg2 16.29 8.16 5.59 3.7 3.28 Alg2 2.25 4.12 5.89 21.98 42.1
Alg3 26.54 18.47 14.87 8.18 7.08 Alg3 0.33 1.56 3.34 20.73 44.14
Alg4 29.07 11.49 8.42 3.53 2.14 Alg4 2.45 8.9 13.89 39.42 68.85
Alg5 17.87 8.83 6.08 3.03 2.57 Alg5 2.21 4.15 6.02 22.95 44.4

Table 1: Average number of missing links Table 4: Average time

100 500 1000 5000 10000


Alg1 10.42 4.96 2.87 2.2 1.98 and it does not add a significant amount of extra
Alg2 26.13 17.52 16.32 16.01 15.38 time.
Alg3 3.06 0.63 0.14 0.03 0.01
When comparing Alg1 with Alg3 (minimum
Alg4 14.73 5.96 5.68 4.78 4.79
size cut set vs set of adjacent nodes) we ob-
Alg5 10.46 4.99 3.21 1.92 1.9
serve that with adjacent nodes fewer links are
Table 2: Average number of added links added and more ones are missing. The total
amount of errors is in favour of Alg1 (minimum
size cut set). This is due to the fact that Alg3
Alg5 This is the algorithm with minimal separat- makes more conditional independence tests and
ing sets, score based tests, no refinement, with larger conditioning sets of variables, which
and no resolution of triangles. makes more possible to delete links. Kullback-
These combinations are designed in this way, Leibler distance is better for Alg1 except for the
as we consider Alg1 our basic algorithm to re- largest sample size. A possible explanation, is
cover the graph structure, and then we want to that with this large sample size, we really do
study the effect of the application of each one not miss any important link of the network, and
of the variations to it. added links can be more dangerous than deleted
Table 1 contains the average number of miss- ones (when we delete links we are averaging dis-
ing links, Table 2 the average number of added tributions). With smaller sample sizes, Alg3
links, Table 3 the average Kulback-Leibler dis- had a worse Kullback-Leibler as it can be miss-
tance, and finally Table 4 contains the average ing some important links. We do not have any
running times of the different algorithms. In possible explanation to the fact that Alg1 does
these results we highlight the following facts: not improve Kullback-Leibler distance when in-
Refinement (Alg2) increases the number of er- creasing the sample size from 5000 to 10000.
rors in the recovering of the causal structure When comparing the time of both algorithms,
(mainly more added links), but decreases the we see that Alg1 needs more time (to compute
Kullback-Leibler distance to the original distri- minimum size cut sets) however, when the sam-
bution. So, its application will depend of our ple size is large this extra time is compensated
objective: approximate the joint distribution or by the lower number of statistical tests, being
recover the causal structure. Refinement is fast the total time for size 10000 lower in the case of
Alg1 (with minimum size cut sets).
100 500 1000 5000 10000 When comparing Alg1 and Alg4 (Score test
Alg1 4.15 2.46 1.81 0.98 0.99 and triangle resolution vs Chi-square tests and
Alg2 2.91 0.96 0.56 0.19 0.11 no triangle resolution) we observe than Alg4 al-
Alg3 4.98 3.27 2.58 1.11 0.77 ways add more links and miss more links (except
Alg4 5.96 2.19 1.49 1.05 0.91 for the largest sample size). The total number
Alg5 4.15 2.36 1.86 1.11 0.98 of errors is lower for Alg1. It is meaningful the
fact that the number of added links do not de-
Table 3: Average Kullback-Leibler distance crease when going from a sample of 5000 to a

6 J. Abelln, M. Gmez-Olmedo, and S. Moral


sample of 10000. This is due to the fact that refinement step would depend of the final aim:
the probability of considering dependence when if we want to learn the causal structure, then re-
there is independence is fixed (the significance finement should not be applied, but if we want
level) for large the sample sizes. So all the ex- to approximate the joint probability distribu-
tra information of the larger sample is devoted tion, then refinement should be applied. We
to decrease the number of missing links (2.14 in recognize that more extensive experiments are
Alg4 against 2.56 in Alg1), but the difference in necessary to evaluate the application of these
added links is 4.79 in Alg4 against 1.98 in Alg1. modifications, specially the triangle resolution.
So the small decreasing in missing links is at the But we feel that this modification is intuitively
cost of a more important error in the number of supported, and that it could have a more im-
added links. Bayesian scores tests are more bal- portant role in other situations, specially if the
anced in the two types of errors. When consid- faithfulness hypothesis is not verified.
ering the Kullback-Leibler distance, we observe Other combinations could be appropriated if
again the same situation than when comparing the objective is different, for example if we want
Alg1 and Alg2: a greater number of errors in to minimize the number of added links, then
the structure does not always imply a greater Alg3 (with adjacent nodes as cut set) could be
Kullback-Leibler distance. The time is always considered.
greater for Alg4. In the future we plan to make more extensive
The differences between Alg1 and Alg4 are experiments testing different networks and dif-
not due to the triangles resolution in Alg1. As ferent combinations of these modifications. At
we will see now, triangles resolution do not re- the same time, we will consider another possible
ally implies important changes in Alg1 perfor- variations, as for example an algorithm mixing
mance. In fact, the effect of Chi-square tests the skeleton and orientation steps. It is possible
against Bayesian tests without any other addi- that some of the independencies are tested con-
tional factor, can be seen by comparing Alg5 ditional to some sets, that after the orientation
and Alg4. In this case, we can observe the same do not separate the two links. We also plan to
differences as when comparing Alg1 and Alg5. study alternative scores and to study the efect
When comparing Alg1 and Alg5 (no resolu- of using different sample sizes. Also partial ori-
tion of triangles) we see that there is not im- entations can help to make the separating sets
portant differences in performance (errors and even smaller as there can be some paths which
time) when resolution of triangles is applied. are not active without observations. This can
It seems that the total number of errors is de- make algorithms faster and more accurate. Fi-
creased for intermediate sample sizes (500-1000) nally, we think that the use of PC and its vari-
and there are not important differences for the ations as starting points for greedy searching
other sample sizes, but more experiments are algorithms needs further research effort.
necessary. Triangles resolution do not really
add a meaningful extra time. Applying this step Acknowledgments
needs some time, but the graph is simplified and This work has been supported by the Spanish
posterior steps can be faster. Ministry of Science and Technology under the
Algra project (TIN2004-06204-C03-02).
5 Conclusions
In this paper we have proposed four variations References
of the PC algorithm and we have tested them
when learning the Alarm network. Our final S. Acid and L.M. de Campos. 1996. Finding min-
imum d-separating sets in belief networks. In
recommendation would be to use the PC algo- Proceedings of the Twelfth Annual Conference on
rithm with score based tests, minimum size cut Uncertainty in Artificial Intelligence (UAI96),
sets, and triangle resolution. The application of pages 310, Portland, Oregon.

Some Variations on the PC Algorithm 7


S. Acid and L.M. de Campos. 2001. A hy- D. Heckerman, C. Meek, and G. Cooper. 1999. A
brid methodology for learning belief networks: bayesian approach to causal discovery. In C. Gly-
Benedit. International Journal of Approximate mour and G.F. Cooper, editors, Computation,
Reasoning, 27:235262. Causation, and Discovery, pages 141165. AAAI
Press.
I.A. Beinlich, H.J. Suermondt, R.M. Chavez, and
G.F. Cooper. 1989. The alarm monitoring sys- D. Heckerman. 1995. A tutorial on learning with
tem: A case study with two probabilistic inference Bayesian networks. Technical Report MSR-TR-
techniques for belief networks. In Proceedings of 95-06, Microsoft Research.
the Second European Conference on Artificial In-
telligence in Medicine, pages 247256. Springer- S. Kullback. 1968. Information Theory and Statis-
Verlag. tics. Dover, New York.
A.L. Madsen, M. Land, U.F. Kjrulff, and
R. Blanco, I. Inza, and P. Larraaga. 2003. Learning F. Jensen. 2003. The hugin tool for learning net-
bayesian networks in the space of structures by es- works. In T.D. Nielsen and N.L. Zhang, editors,
timation of distribution algorithms. International Proceedings of ECSQARU 2003, pages 594605.
Journal of Intelligent Systems, 18:205220. Springer-Verlag.
W. Buntine. 1991. Theory refinement in bayesian D. Margaritis. 2003. Learning Bayesian Model
networks. In Proceedings of the Seventh Con- Structure from Data. Ph.D. thesis, School of
ference on Uncertainty in Artificial Intelligence, Computer Science, Carnegie Mellon University.
pages 5260. Morgan Kaufmann, San Francisco,
CA. S. Moral. 2004. An empirical comparison of score
measures for independence. In Proceedings of the
J. Cohen. 1988. Statistical power analysis for the Tenth International Conference IPMU 2004, Vol.
behavioral sciences (2nd edition). Erlbaum, Hills- 2, pages 13071314.
dale, NJ.
J. Pearl. 1988. Probabilistic Reasoning with Intelli-
Elvira Consortium. 2002. Elvira: An environ- gent Systems. Morgan & Kaufman, San Mateo.
ment for probabilistic graphical models. In J.A.
Gmez and A. Salmern, editors, Proceedings of the P. Spirtes, C. Glymour, and R. Scheines. 1993. Cau-
1st European Workshop on Probabilistic Graphi- sation, Prediction and Search. Springer Verlag,
cal Models, pages 222230. Berlin.
H. Steck and V. Tresp. 1999. Bayesian belief net-
G.F. Cooper and E.A. Herskovits. 1992. A bayesian
works for data mining. In Proceedings of the 2nd
method for the induction of probabilistic networks
Workshop on Data Mining und Data Warehous-
from data. Machine Learning, 9:309347.
ing als Grundlage moderner entscheidungsunter-
stuetzender Systeme, pages 145154.
G.F. Cooper. 1997. A simple constraint-based
algorithm for efficiently mining observational S. van Dijk, L.C. can der Gaag, and D. Thierens.
databases for causal relationships. Data Mining 2003. A skeleton-based approach to learning
and Knowledge Discovery, 1:203224. bayesian networks from data. In Nada Lavrac,
Dragan Gamberger, Ljupco Todorovski, and Hen-
D. Dash and M.J. Druzdzel. 1999. A hybrid any- drik Blockeel, editors, Proceedings of the Sev-
time algorithm for the construction of causal mod- enth Conference on Principles and Practice of
els from sparse data. In Proceedings of the Fif- Knowledge Discovery in Databases, pages 132
teenth Annual Conference on Uncertainty in Arti- 143. Springer Verlag.
ficial Intelligence (UAI-99), pages 142149. Mor-
gan Kaufmann.

L.M. de Campos, J.M. Fernndez-Luna, J.A. Gmez,


and J. M. Puerta. 2002. Ant colony optimiza-
tion for learning bayesian networks. International
Journal of Approximate Reasoning, 31:511549.

D. Heckerman, D. Geiger, and D.M. Chickering.


1995. Learning bayesian networks: The combina-
tion of knowledge and statistical data. Machine
Learning, 20:197243.

8 J. Abelln, M. Gmez-Olmedo, and S. Moral


Learning Complex Bayesian Network Features for Classification
Peter Antal, Andras Gezsi, Gabor Hullam and Andras Millinghoffer
Department of Measurement and Information Systems
Budapest University of Technology and Economics

Abstract
The increasing complexity of the models, the abundant electronic literature and the
relative scarcity of the data make it necessary to use the Bayesian approach to complex
queries based on prior knowledge and structural models. In the paper we discuss the
probabilistic semantics of such statements, the computational challenges and possible
solutions of Bayesian inference over complex Bayesian network features, particularly over
features relevant in the conditional analysis. We introduce a special feature called Markov
Blanket Graph. Next we present an application of the ordering-based Monte Carlo method
over Markov Blanket Graphs and Markov Blanket sets.

In the Bayesian approach to a structural fea- for elementary features (Madigan et al., 1996;
ture F with values F (G) {fi }R i=1 we are Friedman and Koller, 2000). This paper ex-
interested in the feature posterior induced by tends these results by investigating Bayesian in-
the model posterior given the observations DN , ference about BN features with high-cardinality,
where G denotes the structure of the Bayesian relevant in classification. In Section 1 we
network (BN) present a unified view of BN features enriched
with free-text annotations as a probabilistic
X knowledge base (pKB) and discuss the corre-
p(fi |DN ) = 1(F (G) = fi )p(G|DN ) (1) sponding probabilistic semantics. In Section 2
G
we overview the earlier approaches to feature
The importance of such inference results from learning. In Section 3 we discuss structural BN
(1) the frequently impractically high sample features and introduce a special feature called
and computational complexity of the complete Markov Blanket Graph or Mechanism Bound-
model, (2) a subsequent Bayesian decision- ary Graph. Section 4 discusses its relevance in
theoretic phase, (3) the availability of stochastic conditional modeling. In Section 5 we report an
methods for estimating such posteriors, and (4) algorithm using ordering-based MCMC meth-
the focusedness of the data and the prior on ods to perform inference over Markov Blanket
certain aspects (e.g. by pattern of missing val- Graphs and Markov Blanket sets. Section 6
ues or better understood parts of the model). presents results for the ovarian tumor domain.
Correspondingly, there is a general expectation
1 BN features in pKBs
that for small amount of data some properties
of complex models can be inferred with high Probabilistic and causal interpretations of BN
confidence and relatively low computation cost ensure that structural features can express a
preserving a model-based foundation. wide range of relevant concepts based on condi-
The irregularity of the posterior over the tional independence statements and causal as-
discrete model space of the Directed Acyclic sertions (Pearl, 1988; Pearl, 2000; Spirtes et
Graphs (DAGs) poses serious challenges when al., 2001). To enrich this approach with sub-
such feature posteriors are to be estimated. jective domain knowledge via free-text annota-
This induced the research on the application of tions, we introduce the concept of Probabilistic
Markov Chain Monte Carlo (MCMC) methods Annotated Bayesian Network knowledge base.
Definition 1. A Probabilistic Annotated membership and pairwise precedence was inves-
Bayesian Network knowledge base K for a fixed tigated in (Friedman et al., 1999).
set V of discrete random variables is a first- On the contrary, the Bayesian framework of-
order logical knowledge base including standard fers many advantages such as the normative,
graph, string and BN related predicates, rela- model-based combination of prior and data al-
tions and functions. Let Gw represent a target lowing unconstrained application in the small
DAG structure including all the target random sample region. Furthermore the feature poste-
variables. It includes free-text descriptions for riors can be embedded in a probabilistic knowl-
the subgraphs and for their subsets. We assume edge base and they can be used to induce pri-
that the models M of the knowledge base vary ors for other model spaces and for a subse-
only in Gw (i.e. there is a mapping G M ) quent learning. In (Buntine, 1991) proposed
and a distribution p(Gw |) is available. the concept of a posterior knowledge base con-
A sentence is any well-formed first-order ditioned on a given ordering for the analysis
formula in K, the probability of which is defined of BN models. In (Cooper and Herskovits,
as the expectation of its truth 1992) discussed the general use of the poste-
rior for BN structures to compute the poste-
X rior of arbitrary features. In (Madigan et al.,
Ep(M|K)[M ] = M(G) p(G|K).
1996) proposed an MCMC scheme over the
G
space of DAGs and orderings of the variables
where M(G) denotes its truth-value in the to approximate Bayesian inference. In (Heck-
model M(G). This hybrid approach defines a ermann et al., 1997) considered the applica-
distribution over models by combining a logi- tion of the full Bayesian approach to causal
cal knowledge base with a probabilistic model. BNs. Another MCMC scheme, the ordering-
The logical knowledge base describes the cer- based MCMC method utilizing the ordering of
tain knowledge in the domain defining a set the variables were reported in (Friedman and
of models (legal worlds) and the probabilistic Koller, 2000). They developed and used a
part (p(Gw |)) expresses the uncertain knowl- closed form for the order conditional posteri-
edge over these worlds. ors of Markov blanket membership, beside the
earlier closed form of the parental sets.
2 Earlier works
To avoid the statistical and computational bur- 3 BN features
den of identifying complete models, related lo-
cal algorithms for identifying causal relations The prevailing interpretation of BN feature
were reported in (Silverstein et al., 2000) and in learning assumes that the feature set is sig-
(Glymour and Cooper, 1999; Mani and Cooper, nificantly simpler than the complete domain
2001). The majority of feature learning algo- model providing an overall characterization as
rithms targets the learning of relevant variables marginals and that the number of features and
for the conditional modeling of a central vari- their values is tractable (e.g linear or quadratic
able, i.e. they target the so-called feature sub- in the number of variables). Another interpre-
set selection (FSS) problem (Kohavi and John, tation is to identify high-scoring arbitrary sub-
1997). Such examples are the Markov Blanket graphs or parental sets, Markov blanket sub-
Approximating Algorithm (Koller and Sahami, sets and estimate their posteriors. A set of sim-
1996) and the Incremential Association Markov ple features means a fragmentary representation
Blanket algorithm (Tsamardinos and Aliferis, for the distribution over the complete domain
2003). The subgraphs of a BN as features were model from multiple, though simplified aspects,
targeted in (Peer et al., 2001). The bootstrap whereas using a given complex feature means a
approach inducing confidence measures for fea- focused representation from a single, but com-
tures such as compelled edges, Markov blanket plex point of view. A feature F is complex if

10 P. Antal, A. Gzsi, G. Hullm, and A. Millinghoffer


the number of its values is exponential in the feature is not a unique representative of a condi-
number of domain variables. First we cite a tionally equivalent class of BNs. From a causal
central concept and a theorem about relevance point of view, this feature uniquely represents
for variables V = {X1 , . . . , Xn } (Pearl, 1988). the minimal set of mechanisms including Y . In
Definition 2. A set of variables M B(Xi ) short, under the conditions mentioned above,
is called the Markov Blanket of Xi w.r.t. this structural feature of the causal BN domain
the distribution P (V ), if (Xi V \ model is necessary and sufficient to support the
M B(Xi )|M B(Xi )). A minimal Markov blanket manual exploration and automated construc-
is called a Markov boundary. tion of a conditional dependency model.
There is no closed formula for the posterior
Theorem 1. If a distribution P (V ) factorizes p(M BG(Y, G)), which excludes the direct use
w.r.t DAG G, then of the MBG space in optimization or in Monte
Carlo methods. However, there exist a formula
i = 1, . . . , n : (Xi
V \ bd(Xi )|bd(Xi , G))P ,
for the order conditional posterior with polyno-
where bd(Xi , G) denotes the set of parents, chil- mial time complexity if the size of the parental
dren and the childrens other parents for Xi . sets is bounded by k

So the set bd(Xi , G) is a Markov blanket of p(M BG(Y, G) = mbg|DN ) =


Xi . So we will also refer to bd(Xi , G) as a
p(pa(Y, mbg)|DN )
Markov blanket of Xi in G using the notation Y
M B(Xi , G) implicitly assuming that P factor- p(pa(Xi , mbg)|DN )
izes w.r.t. G. Y Xi (3)
Y pa(Xi ,mbg)
The induced, symmetric pairwise rela- Y X
tion is the Markov Blanket Membership p(pa(Xi )|DN ).
M BM (Xi , Xj , G) w.r.t. G between Xi and Xj Y Xi Y pa(X
/ i)
Y pa(X
/ i ,mbg)
(Friedman et al., 1999)
The cardinality of the M BG(Y ) space is still
M BM (Xi , Xj , G) Xj bd(Xi , G) (2)
super-exponential (even if the number of par-
Finally, we define the Markov Blanket Graph. ents is bounded by k). Consider an order-
ing of the variables such that Y is the first
Definition 3. A subgraph of G is called the and all the other variables are children of it,
Markov Blanket Graph or Mechanism Bound- then the parental sets can be selected indepen-
ary Graph M BG(Xi , G) of a variable Xi if it dently, so the number of alternatives is in the
includes the nodes from M B(Xi , G) and the in- 2
order of (n 1)n (or (n 1)(k1)(n1) ). How-
coming edges into Xi and into its children. ever, if Y is the last in the ordering, then the
It is easy to show, that the characteristic number of alternatives is of the order 2n1 or
property of the MBG feature is that it com- (n 1)(k) ). In case of M BG(Y, G), variable
pletely defines the distribution P (Y |V \ Y ) by Xi can be (1) non-occurring in the MBG, (2)
the local dependency models of Y and its chil- a parent of Y (Xi pa(Y, G)), (3) a child of
dren in a BN model G, in case of point pa- Y (Xi ch(Y, G)) and (4) (pure) other par-
rameters (G, ) and of parameter priors satisfy- ent ((Xi / pa(Y, G) (Xi pa(ch(Y )j )))).
ing global parameter independence (Spiegelhal- These types correspond to the irrelevant (1) and
ter and Lauritzen, 1990) and parameter modu- strongly relevant (2,3,4) categories (see, Def. 4).
larity (Heckerman et al., 1995). This property The number of DAG models G(n) compatible
offers two interpretations for the MBG feature. with a given MBG and ordering can be com-
From a probabilistic point of view the M BG(G) puted as follows: the contribution of the vari-
feature defines an equivalence relation over the ables Xi Y without any constraint and the
DAGs w.r.t. P (Y |V \ Y ), but clearly the MBG contribution of the variables Y Xi that are

Learning Complex Bayesian Network Features for Classification 11


not children of Y, which is still 2O((k)(n)log(n)) 5 Estimating complex features
(note certain sparse graphs are compatible with
many orderings). The basic task is the estimation of the expecta-
tion of a given random variable over the space
4 Features in conditional modeling of DAGs with a specified confidence level in
Eq. 1. We assume complete data, discrete do-
In the conditional Bayesian approach the rel- main variables, multinomial local conditional
evance of predictor variables (features in this distributions and Dirichlet parameter priors. It
context) can be defined in an asymptotic, ensures efficiently computable closed formulas
algorithm-, model- and loss-free way as follows for the (unnormalized) posteriors of DAGs. As
Definition 4. A feature Xi is strongly rel- this posterior cannot be sampled directly and
evant iff there exists some xi , y and si = the construction of an approximating distribu-
x1 , . . . , xi1 , xi+1 , . . . , xn for which p(xi , si ) > 0 tion is frequently not feasible, the standard ap-
such that p(y|xi , si ) 6= p(y|si ). A feature Xi is proach is to use MCMC methods such as the
weakly relevant iff it is not strongly relevant, Metropolis-Hastings over the DAG space (see
and there exists a subset of features Si of Si for e.g. (Gamerman, 1997; Gelman et al., 1995)).
which there exists some xi , y and si for which The DAG-based MCMC method for estimat-
p(xi , si ) > 0 such that p(y|xi , si ) 6= p(y|si ). ing a given expectation is generally applicable,
A feature is relevant if it is either weakly or but for certain types of features such as Markov
strongly relevant; otherwise it is irrelevant (Ko- blanket membership an improved method, the
havi and John, 1997). so-called ordering-based MCMC method can be
applied, which utilizes closed-forms of the or-
In the so-called filter approach to feature se-
der conditional feature posteriors computable
lection we have to select a minimal subset X
in O(nk+1 ) time, where k denotes the maximum
which fully determines the conditional distri-
number of parents (Friedman and Koller, 2000).
bution of the target (p(Y |X) = p(Y |X )). If
the conditional modeling is not applicable and In these approaches the problem is simplified
a domain model-based approach is necessary to the estimation of separate posteriors. How-
then the Markov boundary property (feature) ever, the number of target features can be as
seems to be an ideal candidate for identifying high as 104 106 even for a given type of pair-
relevance. The following theorem gives a suf- wise features and moderate domain complex-
ficient condition for uniqueness and minimality ity. This calls for a decision-theoretic report
(Tsamardinos and Aliferis, 2003). of selection and estimation of the features, but
here we use a simplified approach targeting the
Theorem 2. If the distribution P is stable selection-estimation of the K most probable fea-
w.r.t. the DAG G, then the variables bd(Y, G) ture values. Because of the exponential num-
form a unique and minimal Markov blanket of ber of feature values a search method has to
Y , M B(Y ). Furthermore, Xi M B(Y ) iff Xi be applied either iteratively or in an integrated
is strongly relevant. fashion. The first approach requires the offline
However, the MBG feature provides a more storage of orderings and corresponding common
detailed description about relevancies. As an factors, so we investigated this latter option.
example, consider that a logistic regression (LR) The integrated feature selection-estimation is
model without interaction terms and a Naive particularly relevant for the ordering-based MC
BN model can be made conditionally equiva- methods, because it does not generate implic-
lent using a local and transparent parameter itly high-scoring features and features that
transformation. If the distribution contains ad- are not part of the solution cause extra com-
ditional dependencies, then the induced condi- putational costs in estimation.
tional distribution has to be represented by a The goal of search within the MC cycle at
LR model with interaction terms. step l is the generation of MBGs with high

12 P. Antal, A. Gzsi, G. Hullm, and A. Millinghoffer


order conditional posterior, potentially using estimation of simple classification features such
the already generated MBGs and the posteri- as edge and MBM relations, and the estimation
ors p(M BG| l , DN ). To facilitate the search of the MB features of a given variable using the
we define an order conditional MBG state space estimated MAP MBG collection is not shown).
based on the observation that the order condi-
tionally MAP MBG can be found in O(nk+1 ) Algorithm 1 Search and estimation of classifi-
time with a negligible constant increase only. cation features using the MBG-ordering spaces
An MBG state is represented by an n dimen- Require: p(),p(pa(Xi )| ),k,R,,LS ,S ,LT ,M;
sional vector s, where n is the number of vari- Ensure: K MAP feature value with estimates
ables not preceding the target variable Y in the Cache order-free parental posteriors =
ordering : { i, |pa(Xi )| k : p(pa(Xi )|DN )}
Initialize MCMC, the MBG-tree T , MBM
X
n
and edge posterior matrices R, E;
n = 1(Y  Xi ) (4)
Insert a priori specified MBGs in T ;
i=1
for l = 0 to M do {the sampling cycle}
The range of the values are integers si = Draw next ordering;
0, . . . , ri representing either separate parental Cache order specific common factors for
sets or (in case of Xi where Y l Xi ) a special |pa(Xi )| k:
set of parental sets not including the target vari- p(pa(Xi )| l ) for all Xi
able. The product of the order conditional pos- p(Y / pa(Xi )| l ) for Y l Xi ;
teriors of the represented sets of parental sets Compute p(l |DN );
gives the order conditional posterior of the rep- Construct order conditional MBG-
resented MBG state in Eq. 3. We ensure that Subspace(, , R, )=
the conditional posteriors of the represented sets S S =Search(,LS , S );
of parental sets are monotone decreasing w.r.t. for all mbg S S do
their indices: if mbg / T then
Insert(T , mbg)
si < si : p(si |DN , ) p(si |DN , ) (5) if LT < |T | then
T =PruneToHPD(T ,LT );
which can be constructed in for all mbg T do
O(nk+1 log(maxi ri )) time, where k is the p(mbg|DN )+ = p(mbg| l , DN );
maximum parental set size. Report K MAP MBGs from T ;
This MBG space allows the application of ef- Report K MAP MBs using the MBGs in T ;
ficient search methods. We experimented with
a direct sampling, top-sampling and a deter- Parameters R, allow the restriction of the
ministic search. The direct sampling was used MBG subspace separately for each dimension to
as a baseline, because it does need the MBG a less than R values by requiring that the corre-
space. The top-sampling method is biased to- sponding posteriors are above the exp() ratio
wards sampling MBGs with high order condi- of the respective MAP value. The uniform-cost
tional posterior, by sampling only from the L search starts from the order conditional MAP
most probable sets of parental sets for each MBG, and stops after the expansion of LS num-
Y - Xi . The uniform-cost search constructs ber of states or if the most probable MBG in its
the MBG space, then performs a uniform-cost search list drops below exp(S ) ratio of the or-
search to a maximum number of MBGs or to der conditional posterior of the starting MBG.
threshold p(M BGM AP, | , DN )/S . Generally, the expansion phase has high com-
The pseudo code of searching and estimat- putational costs, but for large LT the update of
ing MAP values for the MBG feature of a given the MBGs in T is high as well. In order to
variable is shown in Alg. 1 (for simplicity the maintain tractability the usage of more refined

Learning Complex Bayesian Network Features for Classification 13


methods such as partial updating are required. method for a single ordering, because a total
Within the explored OC domain however the ordering of the variables was available from an
full, exact update has acceptable costs if the expert. Fig. 1 reports the estimated posteriors
size of the estimated MBG set is LT [105 , 106 ]. of the MAP MB sets for P athology with their
This LT ensures that the newly inserted MBGs MBM-based approximated values assuming
are not pruned before their estimates can re- the independence of the MBM values and
liable indicate their high-scoring potential and Table 1 shows the members of the MB sets.
still allows an exact update. In larger domains Note that the two monotone decreasing curves
this balance can be different and the analysis of correspond to independent rankings, one for
this question in general is for future research. the experts total ordering and one for the
The analysis of the MBG space showed that unconstrained case. It also reports the MB
the conditional posteriors of the ranked parental set spanned by a prior BN specified by the
sets after rank 10 are negligible, so subsequently expert (E), the MB set spanned by the MAP
we will report results using values R = 20, = 4 BN (BN ) and the set spanned by the MAP
and LS = 104 , S = 106 . Note that the ex- MBG (M BG) (see Eq. 6). Furthermore we
pansion with the LS conditionally most proba- generated another reference set (LR) from a
ble MBGs in each step does not guarantee that conditional standpoint using the logistic regres-
the LS most probable MBGs are estimated, not sion model class and the SPSS 14.0 software
even the MAP MBG. with the default setting for forward model
construction (Hosmer and Lemeshow, 2000).
6 Results M Bp reports the result of the deterministic
select-estimate method using the total ordering
We used a data set consisting of 35 discrete vari- of the expert and the M B1 , M B2 , M B3 report
ables and 782 complete cases related to the pre- the result of the unconstrained ordering-based
operative diagnosis of ovarian cancer (see (Antal MCMC with deterministic select-estimate
et al., 2004)). method. Variables F amHist,CycleDay,
First we report the estimation-selection of HormT herapy, Hysterectomy, P arity
MB features for the central variable P athology. PMenoAge are never selected and the variables
We applied the heuristic deterministic search- V olume, Ascites, P apillation, P apF low,
estimation method in the inner cycle of the CA125, W allRegularity are always selected,
MCMC method. The length of the burn-in so they are not reported.
and MCMC simulation was 10000, the prob-
ability of the pairwise replace operator was
The MBM-based approximation performs rel-
0.8, the parameter prior was the BDeu and
atively well, particularly w.r.t. ranking in the
the structure prior was uniform prior for the
case of the experts ordering 0 , but it performs
parental set sizes (Friedman and Koller, 2000).
poorly in the unconstrained case both w.r.t. es-
The maximum number of parents was 4 (the
timations and ranks (see the difference of Mp set
posteriors of larger sets are insignificant).
to M1 w.r.t. variables Age, M eno, P I, T AM X,
For preselected high-scoring MB values after
Solid).
10000 burn-in the single-chain convergence
test from Geweke comparing averages has
z-score approximately 0.5, the R value of We compared the M BG(Y, GM AP ) and
the multiple-chain method of Gelman-Rubin M B(Y, GM AP ) feature values defined by the
with 5 chains drops below 1.05 (Gamerman, MAP BN structure GM AP against the MAP
1997; Gelman et al., 1995). The variances MBG feature value M BG(Y )M AP and the
of the MCMC estimates of these preselected MAP MB feature value M B(Y )M AP including
test feature values drop below 102 . We also the MB feature value defined by the MAP MBG
applied the deterministic search-estimation feature value M B(Y, M BG(Y )M AP )

14 P. Antal, A. Gzsi, G. Hullm, and A. Millinghoffer


Table 1: Markov blanket sets of Pathology 0.1
0.09
among the thirty-five variables.
E LR BN MG M Bp M B1 M B2 M B3 0.08

Age 1 0 1 0 1 0 0 0 0.07
P(MB|<,D)
Meno 1 0 1 1 0 1 1 1 0.06
P(MB|<,D,MBM)
PMenoY 0 1 0 0 0 0 0 0 0.05
P(MB|D)
PillUse 1 0 0 0 0 0 0 0 0.04
P(MB|D,MBM)
Bilateral 1 0 1 1 1 1 1 1
0.03
Pain 0 1 0 0 0 0 0 0
0.02
Fluid 1 0 1 0 0 0 0 0
Septum 1 0 1 1 1 1 1 1 0.01

ISeptum 0 0 1 1 1 1 1 1 0
PSmooth 1 0 0 0 0 0 0 0 1 3 5 7 9 11 13 15 17 19
Loc. 1 1 0 0 0 1 0 1
Shadows 1 0 1 1 1 1 1 1
Echog. 1 0 1 1 1 1 1 1 Figure 1: The ranked posteriors and their
ColScore 1 0 1 1 1 1 1 1
PI 1 1 0 0 1 0 0 0 MBM-based approximations of the 20 most
RI 1 0 1 1 1 1 1 1 probable M B(P athology) sets for the sin-
PSV 1 0 1 1 1 1 1 1 gle/unconstrained orderings.
TAMX 1 1 0 1 1 0 0 0
Solid 1 1 1 1 1 0 1 0
FHBrCa 0 0 0 0 0 0 0 1
FHOvCa 0 0 0 1 0 0 0 0 is connected with the annotated BN knowledge
base defined in Def. 1, which allows an offline
exploration of the domain from the point of view
of conditional modeling. The histogram of the
number of parameters and inputs for the MBGs
GM AP = arg max p(G|DN ) (6)
G using only the fourteen most relevant variables
M BG(Y )M AP = arg max p(mbg(Y )|DN ) are reported in Fig. 2.
mbg(Y )
M AP
M B(Y ) = arg max p(mb(Y )|DN )
mb(Y )
600

We performed the comparison using the best


800 500
BN structure found in the MCMC simulation.
The MAP MBG feature value M BG(Y )M AP
600
400
400
differed significantly from the MAP domain
Frequency

200
model, because of an additional Age and F luid 300
0
variables in the domain model. The MAP 200
200
MB feature value M B(Y )M AP similarly differs 400
14

from the MB sets defined by the MAP domain 600


12 100

models for example w.r.t. the vascularization 100 200 300 400
10
500 600
variables such as P I. Interestingly, the MAP 700 800 8 |Inputs|
|Params|
MB feature value also differs from the MB fea-
ture value defined by the MAP MBG feature
Figure 2: The histogram of the number of pa-
value M B(Y, M BG(Y )M AP ), for example w.r.t.
rameters and inputs of the MAP MBGs.
T AM X, Solid variables. In conclusion these
results together with the comparison against
the simple feature-based analysis such as MBM- 7 Conclusion
based analysis reported in Fig. 1, show the rel-
evance of the complex feature-based analysis. In the paper we presented a Bayesian approach
We also constructed an offline probabilistic for complex BN features, for the so-called
knowledge base containing 104 MAP MBGs. It Markov Blanket set and the Markov Blanket

Learning Complex Bayesian Network Features for Classification 15


Graph features. We developed and applied a C. Glymour and G. F. Cooper. 1999. Computation,
select-estimate algorithm using ordering-based Causation, and Discovery. AAAI Press.
MCMC, which uses the efficiently computable D. Heckerman, D. Geiger, and D. Chickering. 1995.
order conditional posterior of the MBG feature Learning Bayesian networks: The combination of
and the proposed MBG space. The compari- knowledge and statistical data. Machine Learn-
ing, 20:197243.
son of the most probable MB and MBG fea-
ture values with simple feature based approx- D. Heckermann, C. Meek, and G. Cooper. 1997. A
imations and with complete domain modeling bayesian aproach to causal discovery. Technical
Report, MSR-TR-97-05.
showed the separate significance of the analy-
sis based on these complex BN features in the D. W. Hosmer and S. Lemeshow. 2000. Applied
investigated medical domain. The proposed al- Logistic Regression. Wiley & Sons, Chichester.
gorithm and the offline knowledge base in the R. Kohavi and G. H. John. 1997. Wrappers for
introduced probabilistic annotated BN knowl- feature subset selection. Artificial Intelligence,
edge base context allows new types of analysis 97:273324.
and fusion of expertise, data and literature. D. Koller and M. Sahami. 1996. Toward optimal
feature selection. In International Conference on
Acknowledgements Machine Learning, pages 284292.
We thank T. Dobrowiecki for his helpful com- D. Madigan, S. A. Andersson, M. Perlman, and
ments. Our research was supported by grants C. T. Volinsky. 1996. Bayesian model averag-
from Research Council KUL: IDO (IOTA On- ing and model selection for markov equivalence
cology, Genetic networks). classes of acyclic digraphs. Comm.Statist. Theory
Methods, 25:24932520.
S. Mani and G. F. Cooper. 2001. A simulation study
References of three related causal data mining algorithms. In
P. Antal, G. Fannes, Y. Moreau, D. Timmerman, International Workshop on Artificial Intelligence
and B. De Moor. 2004. Using literature and data and Statistics, pages 7380. Morgan Kaufmann,
to learn Bayesian networks as clinical models of San Francisco, CA.
ovarian tumors. AI in Med., 30:257281. Special J. Pearl. 1988. Probabilistic Reasoning in Intelligent
issue on Bayesian Models in Med. Systems. Morgan Kaufmann, San Francisco, CA.
W. L. Buntine. 1991. Theory refinement of Bayesian J. Pearl. 2000. Causality: Models, Reasoning, and
networks. In Proc. of the 7th Conf. on Un- Inference. Cambridge University Press.
certainty in Artificial Intelligence , pages 5260.
Morgan Kaufmann. D. Peer, A. Regev, G. Elidan, and N. Friedman.
2001. Inferring subnetworks from perturbed ex-
G. F. Cooper and E. Herskovits. 1992. A Bayesian pression profiles. Bioinformatics, Proceedings of
method for the induction of probabilistic networks ISMB 2001, 17(Suppl. 1):215224.
from data. Machine Learning, 9:309347.
C. Silverstein, S. Brin, R. Motwani, and J. D. Ull-
N. Friedman and D. Koller. 2000. Being Bayesian man. 2000. Scalable techniques for mining causal
about network structure. In Proc. of the 16th structures. Data Mining and Knowledge Discov-
Conf. on Uncertainty in Artificial Intelligence, ery, 4(2/3):163192.
pages 201211. Morgan Kaufmann.
D. J. Spiegelhalter and S. L. Lauritzen. 1990. Se-
N. Friedman, M. Goldszmidt, and A. Wyner. 1999. quential updating of conditional probabilities on
Data analysis with bayesian networks: A Boot- directed acyclic graphical structures. Networks,
strap approach. In Proc. of the 15th Conf. on 20(.):579605.
Uncertainty in Artificial Intelligence, pages 196
205. Morgan Kaufmann. P. Spirtes, C. Glymour, and R. Scheines. 2001. Cau-
sation, Prediction, and Search. MIT Press.
D. Gamerman. 1997. Markov Chain Monte Carlo.
Chapman & Hall, London. I. Tsamardinos and C. Aliferis. 2003. Towards
principled feature selection: Relevancy,filters, and
A. Gelman, J. B. Carlin, H. S. Stern, and D. B. wrappers. In Proc. of the Artificial Intelligence
Rubin. 1995. Bayesian Data Analysis. Chapman and Statistics, pages 334342.
& Hall, London.

16 P. Antal, A. Gzsi, G. Hullm, and A. Millinghoffer


Literature Mining using Bayesian Networks
Peter Antal and Andras Millinghoffer
Department of Measurement and Information Systems
Budapest University of Technology and Economics

Abstract
In biomedical domains, free text electronic literature is an important resource for knowl-
edge discovery and acquisition, particularly to provide a priori components for evaluating
or learning domain models. Aiming at the automated extraction of this prior knowledge
we discuss the types of uncertainties in a domain with respect to causal mechanisms,
formulate assumptions about their report in scientific papers and derive generative prob-
abilistic models for the occurrences of biomedical concepts in papers. These results allow
the discovery and extraction of latent causal dependency relations from the domain lit-
erature using minimal linguistic support. Contrary to the currently prevailing methods,
which assume that relations are sufficiently formulated for linguistic methods, our ap-
proach assumes only the report of causally associated entities without their tentative
status or relations, and can discover new relations and prune redundancies by providing
a domain-wide model. Therefore the proposed Bayesian network based text mining is an
important complement to the linguistic approaches.

1 Introduction In a wider sense our work provides support


to statistical inference about the structure of
Rapid accumulation of biological data and the the domain model. This is a two-step process,
corresponding knowledge posed a new challenge which consists of the reconstruction of the be-
of making this voluminous, uncertain and fre- liefs in mechanisms from the literature by model
quently inconsistent knowledge accessible. De- learning and their usage in a subsequent learn-
spite recent trends to broaden the scope of for- ing phase. Here, the Bayesian framework is an
mal knowledge bases in biomedical domains, obvious choice. Earlier applications of text min-
free text electronic literature is still the central ing provided results for the domain experts or
repository of the domain knowledge. This cen- data analysts, whereas our aim is to go one step
tral role will probably be retained in the near further and use the results directly in the sta-
future, because of the rapidly expanding fron- tistical learning of the domain models.
tiers. The extraction of explicitly stated or the The paper is organized as follows. Section 2
discovery of implicitly present latent knowledge presents a unified view of the literature, the
requires various techniques ranging from purely data and their models. In Section 3 we re-
linguistic approaches to machine learning meth- view the types of uncertainties in biomedical do-
ods. In the paper we investigate a domain- mains from a causal, mechanism-oriented point
model based approach to statistical inference of view. In Section 4 we summarize recent ap-
about dependence and causal relations given proaches to information extraction and liter-
the literature using minimal linguistic prepro- ature mining based on natural language pro-
cessing. We use Bayesian Networks (BNs) as cessing (NLP) and local analysis of occur-
causal domain models to introduce generative rence patterns. In Section 5 we propose gen-
models of publication, i.e. we examine the re- erative probabilistic models for the occurrences
lation of domain models and generative models of biomedical concepts in scientific papers. Sec-
of the corresponding literature. tion 6 presents textual aspects of the application
domain, the diagnosis of ovarian cancer. Sec- erature(!) data as:
tion 7 reports results on learning BNs given the
L P (G) X L L L
literature. P (G|DN 0) =
L )
P (DN 0 |G )P (G |G).
P (DN 0 L G

2 Fusion of literature and data The formalization (DN G GL DN L )


0

also allows the computation of the posterior


over the domain models given both clinical and
The relation of experimental data DN , proba- the literature(!) data as:
bilistic causal domain models formalized as BNs
L and models of P (DNL |G)
(G, ), domain literature DN 0 L 0 P (DN |G)
L L P (G|DN , DN 0 ) = P (G)
publication (G , ) can be approached at dif- L
P (DN 0 ) P (DN |DNL )
0
ferent levels. For the moment, let us assume X
L L L
P (G)P (DN |G) P (DN 0 |G )P (G |G),
that probabilistic models are available describ-
GL
ing the generation of observations P (DN |(G, ))
and literature P (DN L |(GL , L )). This latter The order of the factors shows that the prior
0

may include stochastic grammars for modeling is first updated by the literature data, then by
the linguistic aspects of the publication, how- the clinical data. A considerable advantage of
ever, we will assume that the literature has a this approach is the integration of literature and
simplified agrammatical representation and the clinical data at the lowest level and not through
corresponding generative model can be formal- feature posteriors, i.e. by using literature pos-
ized as a BN (GL , L ) as well. teriors in feature-based priors for the (clinical)
The main question is the relation of P (G, ) data analysis (Antal et al., 2004).
and P (GL , L ). In the most general approach We will assume that a bijective relation exists
the hypothetical posteriors P (G, |DN , i ) ex- between the domain model structures G and the
pressing personal beliefs over the domain mod- publication model structures GL (T (G) = GL ),
els conditional on the experiments and the per- whereas the parameters L may encode addi-
sonal background knowledge i determine or tional aspects of publication policies and expla-
at least influence the parameters of the model nation. We will focus on the logical link be-
(GL , L ) in P (DN
L |(GL , L ), ). tween the structures, where the posterior given
0 i
the literature and possibly the clinical data is:
The construction or the learning of a full-
L
fledged decision theoretic model of publication P (G|DN , DN 0 , ) (1)
is currently not feasible regarding the state of L
P (G|)P (DN |G)P (DN 0 |T (G)).
quantitative modeling of scientific research and
publication policies, not to mention the cogni- This shows the equal status of the literature
tive and even stylistic aspects of explanation, and the clinical data. In integrated learning
understanding and learning (Rosenberg, 2000). from heterogeneous sources however, the scal-
In a severely restricted approach we will fo- ing of the sources is advisable to express our
cus only on the effect of the belief in domain confidence in them.
models P (G, ) on that in publication models
3 Concepts, associations, causation
P (GL , L ). We will assume that this transfor-
mation is local, i.e. there is a simple proba- Frequently a biomedical domain can be charac-
bilistic link between the model spaces, specifi- terized by a dominant type of uncertainty w.r.t
cally between the structure of the domain model the causal mechanisms. Such types of uncer-
and the structure and parameters of the pub- tainty show certain sequentiality described be-
lication model p(GL , L |G). Probabilistically low, related to the development of biomedical
linked model spaces allow the computation of knowledge, though a strictly sequential view is
the posterior over domain models given the lit- clearly an oversimplification.

18 P. Antalband A. Millinghoffer
(1) Conceptual phase: Uncertainty over the to simple grammatical characterization. Conse-
domain ontology, i.e. the relevant entities. quently top-down methods typically use agram-
(2) Associative phase: Uncertainty over the matical text representations and minimal lin-
association of entities, reported in the literature guistic support. They autonomously prune the
as indirect, associative hypotheses, frequently redundant, inconsistent, indirect relations by
as clusters of entities. Though we accept the evaluating consistent domain models and can
general view of causal relations behind associa- deliver results in domains already in the Asso-
tions, we assume that the exact causal functions ciative phase.
and direct relations are unknown.
(3) Causal relevance phase: (Existential) un- Until recently mainly bottom-up methods
certainty over causal relations (i.e. over mech- have been analyzed in the literature: linguistic
anisms). Typically, direct causal relations are approaches extract explicitly stated relations,
theoretized as processes and mechanisms. possibly with qualitative ratings (Proux et al.,
(4) Causal effect phase: Uncertainty over the 2000; Hirschman et al., 2002); co-occurrence
strength of the autonomous mechanisms em- analysis quantifies the pairwise relations of vari-
bodying the causal relations. ables by their relative frequency (Stapley and
In this paper we assume that the target do- Benoit, 2000; Jenssen et al., 2001); kernel sim-
main is already in the Associative or Causal ilarity analysis uses the textual descriptions or
phase, i.e. that the entities are more or less the occurrence patterns of variables in publi-
agreed, but their causal relations are mostly cations to quantify their relation (Shatkay et
in the discovery phase. This holds in many al., 2002); Swanson and Smalheiser (1997) dis-
biomedical domains, particularly in those link- cover relationships through the heuristic pat-
ing biological and clinical levels. There the As- tern analysis of citations and co-occurrences; in
sociative phase is a crucial but lengthy knowl- (Cooper, 1997) and (Mani and Cooper, 2000)
edge accumulation process, where wide range of local constraints were applied to cope with pos-
research methods is used to report associated sible hidden confounders, to support the discov-
pairs or clusters of the domain entities. These ery of causal relations; joint statistical analysis
methods admittedly produce causally oriented in (Krauthammer et al., 2002) fits a generative
associative relations which are partial, biased model to the temporal pattern of corrobora-
and noisy. tions, refutations and citations of individual re-
lations to identify true statements. The top-
4 Literature mining down method of the joint statistical analysis of
de Campos (1998) learns a restricted BN the-
Literature mining methods can be classified into saurus from the occurrence patterns of words in
bottom-up (pairwise) and top-down (domain the literature. Our approach is closest to this
model based) methods. Bottom-up methods at- and those of Krauthammer et al. and Mani.
tempt to identify individual relationships and
the integration is left to the domain expert. Lin- The reconstruction of informative and faith-
guistic approaches assume that the individual ful priors over domain mechanisms or models
relations are sufficiently known, formulated and from research papers is further complicated by
reported for automated detection methods. On the multiple aspects of uncertainty about the ex-
the contrary, top-down methods concentrate on istence, scope (conditions of validity), strength,
identifying consistent domain models by analyz- causality (direction), robustness for perturba-
ing jointly the domain literature. They assume tion and relevance of mechanism and the in-
that mainly causally associated entities are re- completeness of reported relations, because they
ported with or without tentative relations and are assumed to be well-known parts of common
direct structural knowledge. Their linguistic sense knowledge or of the paradigmatic already
formulation is highly variable, not conforming reported knowledge of the community.

Literature Mining using Bayesian Networks 19


5 BN models of publications 5.1 The intransitive publication model
The first generative model is a two-layer BN.
Considering (biomedical) abstracts, we adopt The upper-layer variables represent the prag-
the central role of causal understanding and ex- matic functions (described or explanandum) of
planation in scientific research and publication the corresponding concepts, while lower-layer
(Thagard, 1998). Furthermore, we assume that variables represent their observable occurrences
the contemporary (collective) uncertainty over (described, explanatory or explained). Upper-
mechanisms is an important factor influenc- layer variables can be interpreted as the in-
ing the publications. According to this causal tentions of the authors or as the property of
stance, we accept the causal relevance inter- the given experimental technique. We assume
pretation, more specifically the explained (ex- that lower-layer variables are influenced only by
planandum) and explanatory (explanans), in the upper-layer ones denoting the correspond-
addition, we allow the described status. This is ing mechanisms, and not by any other external
appealing, because in the assumed causal publi- quantities, e.g. by the number of the reported
cations both the name occurrence and the pre- entities in the paper. A further assumption is
processing kernel similarity method (see Sec- that the belief in a compound mechanism is the
tion 6) express the presence or relevance of the product of the beliefs in the pairwise dependen-
concept corresponding to the respective vari- cies. Consequently we use noisy-OR canonic
able. This implicitly means that we assume distributions for the children in the lower layer.
that publications contain either descriptions of In a noisy-OR local dependency (Pearl, 1988),
the domain concepts without considering their the edges can be labeled with a parameter, in-
relations or the occurrences of entities partici- hibiting the OR function, which can be inter-
pating in known or latent causal relations. We preted also structurally as the probability of an
assume that there is only one causal mechanism implicative edge.
for each parental set, so we will equate a given This model extends the atomistic, individual-
parental set and the mechanism based on it. mechanism oriented information extraction
methods by supporting the joint learning of all
Furthermore, we assume that mainly positive
the mechanisms, i.e. by the search for a domain-
statements are reported and we treat negation
wide coherent model. However it still cannot
and refutation as noise, and that exclusive hy-
model the dependencies between the reported
potheses are reported, i.e. we treat alternatives
associations, and the presence of hidden vari-
as one aggregated hypothesis. Additionally, we
ables considerably increase the computational
presume that the dominant type of publications
complexity of parameter and structure learning.
are causally (forward) oriented. We attempt
to model the transitive nature of causal expla-
5.2 The transitive publication model
nation over mechanisms, e.g. that causal mech-
anisms with a common cause or with a common To devise a more advanced model, we relax the
effect are surveyed in an article, or that subse- assumption of the independence between the
quent causal mechanisms are tracked to demon- variables in the upper layer representing the
strate a causal chain. On the other hand, we pragmatic functions, and we adapt the mod-
also have to model the lack of transitivity, i.e. els to the bag-of-word representation of publi-
the incompleteness of causal explanations, e.g. cations (see Section 6). Consequently we an-
that certain variables are assumed as explana- alyze the possible pragmatic functions corre-
tory, others as potentially explained, except for sponding to the domain variables, which could
survey articles that describe an overall domain be represented by hidden variables. We assume
model. Finally, we assume that the reports of here that the explanatory roles of a variable are
the causal mechanisms and the univariate de- not differentiated, and that if a variable is ex-
scriptions are independent of each other. plained, then it can be explanatory for any other

20 P. Antalband A. Millinghoffer
variable. We assume also full observability of to-cause or diagnostic interpretation and expla-
causal relevance, i.e. that the lack of occur- nation method has a different structure with op-
rence of an entity in a paper means causal ir- posite edge directions.
relevance w.r.t. the mechanisms and variables In the Bayesian framework, there is a struc-
in the paper and not a neutral omission. These tural uncertainty also, i.e. uncertainty over the
assumptions allow the merging of the explana- structure of the generative models (literature
tory, explained and described status with the BNs) themselves. So to compute the probabil-
observable reported status, i.e. we can repre- ity of a parental set P aXi = paXi given a liter-
sent them jointly with a single binary variable. ature data set DNL , we have to average over the
0

Note that these assumptions remain tenable in structures using the posterior given the litera-
case of report of experiments, where the pattern ture data:
of relevancies has a transitive-causal bias. L
P (P aXi = paXi |DN 0) (3)
These would imply that we can model only X
L
full survey papers, but the general, uncon- = P (Xi = 1|paXi , G)P (G|DN 0 )
strained multinomial dependency model used in (paXi Xi )G
X
the transitive BNs provides enough freedom to L
1((paXi Xi ) G)P (G|DN 0) (4)
avoid this. A possible semantics of the parame- G
ters of a binary, transitive literature BN can be Consequently, the result of learning BNs from
derived from a causal stance that the presence the literature can be multiple, e.g. using a max-
of an entity Xi is influenced only by the pres- imum a posteriori (MAP) structure and the cor-
ence of its potential explanatory entities, i.e. its responding parameters, or the posterior over the
parents. Consequently, P (Xi = 1|P aXi = paxi ) structures (Eq. 3). In the first case, the param-
can be interpreted as the belief that the present eters can be interpreted structurally and con-
parental variables can explain the entities Xi verted into a prior for a subsequent learning. In
(P aXi denotes the parents of Xi and P aXi the latter case, we neglect the parametric infor-
Xi denotes the parental substructure). In that mation focusing on the structural constraints,
way the parameters of a complete network can and transform the posterior over the literature
represent the priors for parental sets compatible network structures into a prior over the struc-
with the implied ordering: tures of the real-world BNs (see Eq. 1).

P (Xi = 1|P aXi = paXi ) = P (P aXi = paXi ) 6 The literature data sets
(2)
For our research we used the same collection
where for notational simplicity pa(Xi ) denotes
of abstracts as that described in (Antal et al.,
both the parental set and a corresponding bi-
2004), which was a preliminary work using pair-
nary representation.
wise methods. The collection contains 2256 ab-
The multinomial model allows entity specific
stracts about ovarian cancer, mostly between
modifications at each node, combined into the
1980 and 2002. Also a name, a list of synonyms
parameters of the conditional probability model
and a text kernel is available for each domain
that are independent of other variables (i.e. un-
variable. The presence of the name (and syn-
structured noise). This permits the modeling
onyms) of a variable in documents is denoted
of the description of the entities (P (XiD )), the
with a binary value. Another binary represen-
beginning of the transitive scheme of causal
tation of the publications is based on the kernel
explanation (P (XiB )) and the reverse effect
documents:
of interrupting the transitive scheme (P (XiI )).
These auxiliary variables model simplistic in- K 1 if 0.1 < sim(kj , di )
Rij = , (5)
terventions, i.e. authors intentions about pub- 0 else
lishing an observational model. Note that a which expresses the relevance of kernel doc-
backward model corresponding to an effect- ument kj to document di using the term

Literature Mining using Bayesian Networks 21


frequency-inverse document frequency (TF- linking nodes. (Pure) Confounded (C) The two
IDF) vector representation and the cosine sim- nodes have a common ancestor. The relation
ilarity metric (Baeza-Yates and Ribeiro-Neto, is pure, if there is no edge or path between the
1999). We use the term literature data to denote nodes. Independent (I) None of the previous
both binary representations of the relevance of (i.e. there is no causal connection).
concepts in publications, usually denoted with The difference between two model structures
DN L (containing N 0 publications).
0 can be represented in a matrix containing the
number of relations of a given type in the expert
7 Results model and in the trained model (the type of the
relation in the expert model is the row index
The structure learning of the transitive model is and the type in the trained model is the col-
achieved by an exhaustive evaluation of parental umn index). These matrices (i.e. the compari-
sets up to 4 variables followed by the K2 greedy son of the transitive and the intransitive mod-
heuristics using the BDeu score (Heckerman et els to the experts) are shown in Table 1. Scalar
al., 1995) and an ordering of the variables from
an expert, in order to be compatible with the
learning of the intransitive model. The struc- Table 1: Causal comparison of the intransitive
ture learning of the two-layer model has a higher and the transitive domain models (columns with
computational cost, because the evaluation of a i and t in the subscript, respectively) to the
structure requires the optimization of parame- expert model (rows).
ters, which can be performed e.g. by gradient- Ii Ci Pi Ei It Ct Pt Et
descent algorithms. Because of the use of the I 12 0 0 0 0 4 2 6
forward explanation scheme, only those vari- C 106 20 2 4 4 90 26 12
ables in the upper layer can be the parents P 756 72 80 18 188 460 216 62
of an external variable that succeed it in the E 70 6 8 36 6 38 24 52
causal order. Note that beside the optional
parental edges for the external variables, we al- scores can be derived from this matrix, to evalu-
ways force a deterministic edge from the cor- ate the goodness of the trained model, the stan-
responding non-external variable. During the dard choice is to sum the elements with differ-
parameter learning of a fixed network structure ent weights (Cooper and Yoo, 1999; Wu et al.,
the non-zero inhibitory parameters of the lower 2001). One possibility e.g. if we take the sum of
layer variables are adjusted according to a gradi- the diagonal elements as a measure of similar-
ent descent method to maximize the likelihood ity. By this comparison, the intransitive model
of the data (see (Russell et al., 1995)). After achieves 148 points, while the transitive 358, so
having found the best structure, according to its the transitive reconstructs more faithfully the
semantics, it is converted into a flat, real-world underlying structure. Particularly important is
structure without hidden variables. This con- the (E, E) element according to which 52 of the
version involves the merging of the correspond- 120 edges of the expert model remains in the
ing pairs of nodes of the two layers, and then transitive model, on the contrary the intransi-
reverting the edges (since in the explanatory in- tive model preserves only 36 edges. Similarly
terpretation effects precede causes). the independent relations of the expert model
We compared the trained models to the ex- are well respected by both models.
pert model using a quantitative score based on Another score, which penalizes only the incor-
the comparison of the pairwise relations in the rect identification of independence (i.e. those
model, which are defined w.r.t. the causal inter- and only those weights have a value of 1 which
pretation as follows (Cooper and Yoo, 1999; Wu belong to the elements (I, .) or (., I), the others
et al., 2001): Causal edge (E) An edge between are 0), gives a score 210 and 932 for the tran-
the nodes. Causal path (P) A directed path sitive model and the intransitive respectively.

22 P. Antalband A. Millinghoffer
PMenoAge FamHistBr FamHistOv
ReprYears

FamHist

Meno Age
Age PostMenoY

Hysterect

CycleDay
PillUse Parity

HormThera
Pathology

TAMX PI
PI
PapFlow

PSV RI
PapSmooth
Papillati
ColScore
Solid
Volume
Ascites
WallRegul Septum
Fluid
IncomplSe
Locularit Echogenic Bilateral
Shadows
Pain
CA125

Figure 1: The expert-provided (dotted), the MAP transitive (dashed) and the intransitive (solid)
BNs compatible with the experts total ordering of the thirty-five variables using the literature data
set (P MRCOREL
3
), the K2 noninformative parameter priors, and noninformative structure priors.

This demonstrates that the intransitive model 1


is extremely conservative in comparison with 0.9
Septum
PI
both the other learning method and with the 0.8
ColScore
0.7
knowledge of the expert, it is only capable of 0.6
TAMX
PapSmooth
detecting the most important edges; note that 0.5 RI

the proportion of its false positive predictions 0.4 FamHistBrCa


Fluid
0.3
regarding the edges is only 38% while in the 0.2
Volume

transitive model it is 61%. 0.1


WallRegularity
Bilateral
0 PSV
1980 1985 1990 1995 2000 2005
Furthermore, we investigated the Bayesian
learning of BN features, particularly using the
temporal sequence of the literature data sets. Figure 2: The probability of the relation
An important feature indicating relevance be- Markov Blanket Membership between Pathol-
tween two variables is the so-called Markov ogy and the variables with a slow rise.
Blanket Membership (Friedman and Koller,
2000). We have examined the temporal charac-
ure 2 shows examples for variables with a slow
teristics of the posterior of this relation between
rising in time.
a target variable Pathology and the other ones
using the approximation in Eq. 4. This feature 8 Conclusion
is a good representative for the diagnostic im-
portance of variables according to the commu- In the paper we proposed generative BN mod-
nity. We have found four types of variables: the els of scientific publication to support the con-
posterior of the relevance increasing in time fast struction of real-world models from free-text lit-
or slowly, decreasing slowly or fluctuating. Fig- erature. The advantage of this approach is its

Literature Mining using Bayesian Networks 23


domain model based foundation, hence it is ca- L. Hirschman, J. C. Park, J. Tsujii, L. Wong, and
pable of constructing coherent models by au- C. H. Wu. 2002. Accomplishments and challenges
in literature data mining for biology. Bioinfor-
tonomously pruning redundant or inconsistent
matics, 18:15531561.
relations. The preliminary results support this
expectation. In the future we plan to use the T. K. Jenssen, A. Laegreid, J. Komorowski, and
E. Hovig. 2001. A literature network of human
evaluation methodology applied there including
genes for high-throughput analysis of gene expres-
rank based performance metrics and to investi- sion. Nature Genetics, 28:2128.
gate the issue of negation and refutation partic-
M. Krauthammer, P. Kra, I. Iossifov, S. M. Gomez,
ularly through time.
G. Hripcsak, V. Hatzivassiloglou, C. Friedman,
and A. Rzhetstky. 2002. Of truth and pathways:
chasing bits of information through myriads of ar-
References ticles. Bioinformatics, 18:249257.
P. Antal, G. Fannes, Y. Moreau, D. Timmerman, S. Mani and G. F. Cooper. 2000. Causal discov-
and B. De Moor. 2004. Using literature and ery from medical textual data. In AMIA Annual
data to learn Bayesian networks as clinical mod- Symposium.
els of ovarian tumors. Artificial Intelligence in
Medicine, 30:257281. Special issue on Bayesian J. Pearl. 1988. Probabilistic Reasoning in Intelligent
Models in Medicine. Systems. Morgan Kaufmann, San Francisco, CA.

R. Baeza-Yates and B. Ribeiro-Neto. 1999. Modern D. Proux, F. Rechenmann, and L. Julliard. 2000.
Information Retrieval. ACM Press, New York. A pragmatic information extraction strategy for
gathering data on genetic interactions. In Proc.
G. F. Cooper and C. Yoo. 1999. Causal discovery of the 8th International Conference on Intelligent
from a mixture of experimental and observational Systems for Molecular Biology (ISMB2000), La-
data. In Proc. of the 15th Conf. on Uncertainty in Jolla, California, pages 279285.
Artificial Intelligence (UAI-1999), pages 116125.
A. Rosenberg. 2000. Philosophy of Science: A con-
Morgan Kaufmann.
temporary introduction. Routledge.
G. Cooper. 1997. A simple constraint-based S. J. Russell, J. Binder, D. Koller, and K. Kanazawa.
algorithm for efficiently mining observational 1995. Local learning in probabilistic networks
databases for causal relationships. Data Mining with hidden variables. In IJCAI, pages 1146
and Knowledge Discovery, 2:203224. 1152.
L. M. de Campos, J. M. Fernandez, and J. F. H. Shatkay, S. Edwards, and M. Boguski. 2002. In-
Huete. 1998. Query expansion in information re- formation retrieval meets gene analysis. IEEE In-
trieval systems using a Bayesian network-based telligent Systems, 17(2):4553.
thesaurus. In Gregory Cooper and Serafin Moral,
editors, Proc. of the 14th Conf. on Uncertainty B. Stapley and G. Benoit. 2000. Biobibliometrics:
in Artificial Intelligence (UAI-1998), pages 53 Information retrieval and visualization from co-
60. Morgan Kaufmann. occurrences of gene names in medline asbtracts.
In Proc. of Pacific Symposium on Biocomputing
N. Friedman and D. Koller. 2000. Being (PSB00), volume 5, pages 529540.
Bayesian about network structure. In Craig
Boutilier and Moises Goldszmidt, editors, Proc. D. R. Swanson and N. R. Smalheiser. 1997. An in-
of the 16th Conf. on Uncertainty in Artificial teractive system for finding complementary litera-
Intelligence(UAI-2000), pages 201211. Morgan tures: a stimulus to scientific discovery. Artificial
Kaufmann. Intelligence, 91:183203.
P. Thagard. 1998. Explaining disease: Correlations,
D. Geiger and D. Heckerman. 1996. Knowledge rep- causes, and mechanisms. Minds and Machines,
resentation and inference in similarity networks 8:6178.
and Bayesian multinets. Artificial Intelligence,
82:4574. X. Wu, P. Lucas, S. Kerr, and R. Dijkhuizen. 2001.
Learning bayesian-network topologies in realistic
D. Heckerman, D. Geiger, and D. Chickering. 1995. medical domains. In In Medical Data Analysis:
Learning Bayesian networks: The combination of Second International Symposium, ISMDA, pages
knowledge and statistical data. Machine Learn- 302308. Springer-Verlag, Berlin.
ing, 20:197243.

24 P. Antalband A. Millinghoffer
 
!#"$%&
'%(*),+ + -/.1012 34'5.7637.981:;:;<=-/.10?>@-A2 :B3DC-FE-G(*37.
H +6 <*6 8163JIK5-G(L(*)>3/(L(*),M4K5<ONP6 8Q0Q<RN181(L(TS H ./6),(L(L<*UG)V.1W,-4'%2 6 <*XY:;<L-G(*)
Z\[^]`_GaGbGc >-/.Q.13dTeO8#U/-/.#3#f gNPh^<*6W;);2 (L-/.Q0
iFj1k/lQmPmGjGn#oPp1q!rs#j7tPt#j1k/qGnvuxwQyAoYmPy/jz|{V}
~R,xVYG
! /A,G V -A2 )53P0Q),(L+6 1-A6)BP6)V.Q0\-,G),+ <L-/..#);6 +630Q),-G(Yh^<*6 <Q2 ),:;<L+ <*37.<.12 3/Y- ]
Y<L(L<*6Gg9-/.Q0:;-/.-G:B6 8Q-G(L(*v)2 );U/-A2 01),0-G++ );6 +!3G\-,G),+ <L-/.D.#);6 +;!#<L01)V.1:B)+ 81UGUG),+6 +6 1-A6:B2 ),0Y-G(
.1);6 +=-A2 )-v3Vh);2 T8Q(),-/.Q+632 ),12 ),+ )V./6-/.100Q),-G(h^<*6 4-/./<v3G2 6 -/.76=-/.Q0:1-G(L(*)V.#U/<.#U
Q2 3/Q(*)V+<.8Q.1:B);2 6 -G<.2 ),-G+37.1<.#U1)U/<*G))B#-/4Q(*),+634+ 13Vh?6 Q-A6^+37)3G!6 #),+ )12 3/Q(*)V4+
:;-/.37.Q(*v)3P01),(L(*),09:B2 ),0Q-G(.#);6 +:;-G(L(*),0 GQ;`1G/B*1 V =%#),+)Gg#3Vh);G);2,gR-A2 )
+ 6 <L(L(<L+ + <.1U-DUG2 -G1<L:;-G(2 ),12 ),+)V./6 -A6 <*37.(L-/.#U781-AUG)4-/.10+ 3/(8#6 <*37.-G(*UG3G2 <*6 Q4+;^#)+ <*6 81-A6 <*37.
<L+=P8Q<*6)46 #)3/Y3/+ <*6)4h^<*6 J+),Q-A2 -A6),(*+ v),:;<*XQ),0:B2 ),0Q-G(.#);6 +;gh%1<L:|1-,G)Dv);)V.6 #)4+|81#),:B6
3G\81:|+6 810Q-/.10-G(*UG3G2 <*6 Q4<L:01);G),(*3/Y)V./6;^1<L+5Q-Gv);25U/<*G),+6h3-F3G2:B37./62|<LY8#6 <*37.1+;
<*2 + 6;g<*6^01),(L<*G);2|+%-4.1);h?UG2 -GYQ<L:;-G(R(L-/.#U781-AUG)=6343G2|81(L-A6)-/.76#)=3G:B2 ),0Q-G(O.1);6h3G2 gvv3G6
+ ),Q-A2 -A6),(*-/.10.#37. ] +),Y-A2 -A6),(*+ v),:;<*XQ),0%N#),:B37.10g<*6+ #3xh^+^6 Q-A6 GQ .137. ] +),Q-A2 -A6),(*@+ v),:;< ]
XY),0.#);62 ),12 ),+)V./6),0h^<*6 @6 #).#);h(L-/.#U781-AUG):;-/.v)=),-G+ <L(*62 -/.1+ 3G2|),0<.763-/.),#81<*A-G(*)V./6
+ ),Q-A2 -A6),(*+ v),:;<*XQ),0.#);6;g01);X.#),03VG);2-4(L-A2 UG);250Q37-G<.^1<L+%2 ),+ 8Q(*6^3/v)V.1+%8Q-.P8Qv);23G
.1);hv);2 + v),:B6 <*G),+5-/.10:B37.1:B2 );6)378#6 :B37),+;XQ2|+6%3G-G(L(Tg<*6<),0Q<L-A6),(*)V.Q-GQ(*),+6 #)=)B1<L+6 <.#U
-G(*UG3G2|<*6 Q+=3G2+),Q-A2|-A6),(*+ v),:;<*XQ),0:B2 ),0Q-G(\.#);6 +=63v)-GYQ(L<*),0J63.#37. ] + ),Q-A2 -A6),(*+ ),:;<*XY),0
37.1),+;
xOG/P/R 0Q<*6 <*37.1-G(4-G+ +58Q.1:B6 <*37.<L+-G(L(*3Vh),063A-A2 <.<*6 +
:B2 ),0Q-G(1+ );6<.101),v)V.101)V.76 (*3Gv6 #)3G6 #);2 +,!%#)\2 ), ]
)3P:,8Q+%37.:B2 ),0Q-G(.#);6h3G2 #+4dNP),:B6 <*37.9f=d Z 3GW ] 2 ),+)V.76 -A6 <*37.<L+.1-A6 812 -G(L(*(*3#:;-G(),:;-/8Q+)6 #);2 )-A2 )
-/.g bAG/ f gh%1<L:-A2 )-UG)V.#);2 -G(L<*W,-A6 <*37.3G .#32 ),(L-A6 <*37.1+|1<LQ+v);6h);)V.0Q<E);2 )V./6:B2 ),0Y-G(+ );6 +;
\-,G),+ <L-/..#);6 +,^#)UG)V.#);2|-G(L<*W,-A6 <*37.<L+-G:|1<*);G),0 ^#)P8#),+ 6 <*37.?<L+3G2 ):B37Q(L<L:;-A6),0h^<*6 3G2 )
92 ),(L-F#<.1U^6 #)2 ),P8Q<*2 )V)V.766 1-A6O6 #):B37.10Y<*6 <*37.1-G( UG)V.#);2 -G(R+ v),:;<*XY:;-A6 <*37.1+3G!:B2 ),0Q-G(O.#);6h3G2 #+;gQh%1<L:
-G+ +8Q.1:B6 <*37.1+3G6 #)3#01),(v)12 ),:;<L+ )Gh^<*6 h):;-G(L( YGY,`#GG;$1 V ^#)<L01),-3G
:B2 ),0Q-G(.1);6 +),-G:|3G6 #)V<L+37.Q(*42 ),#81<*2 ),063v) ] .#37. ] + ),Q-A2 -A6),(*+ ),:;<*XY),0:B2 ),0Y-G(v.#);6 +\<L+<.-G:B663
(*37.#U63-:;(*3/+),0:B37.7G)B+);6; Z (*3/+),0:B37./G)B+);6 + -G(L(*3Vh3G2R2 ),(L-A6 <*37.Q+ 1<LQ+v);6h);)V.:B37.10Y<*6 <*37.1-G(P4-G+ +
3G-G+ +^8Q.1:B6 <*37.1+%-A2 )-G(L+31.#3xh%.-G+ V /A;, 8Q.1:B6 <*37.1+<.40Q<E);2 )V./6:B2 ),0Q-G(Q+);6 +,g7h%1<L::;-/.4-G(L+3
-A6);2\eR);P<!dTe);#<Tg aGcA f + <.#U:B2 ),0Q-G(v+);6 +\<.6 #) v)-A2%-,h\-,<.6 #).#);6;
Q(L-G:B)3G\4-G+ +8Q.1:B6 <*37.1+4-AG),+:B2 ),0Q-G(.#);6h3G2 #+ '(*6 #378#U746 #)5<L01),-3GR.137. ] +),Q-A2 -A6),(*D+ v),:;<*XQ),0
-/. ^v V,=v7 7; 3P0Q),(%d-G(L(*);Gg5 aGa Ff :B2 ),0Q-G(.1);6 +<L+2 ),(L-A6 <*G),(*<.76 81<*6 <*G)Gg<*6+ #3781(L0v)
H 6\:;-/.)5+ #3xh%.D6 1-A6-:B2 ),0Q-G(.1);6h3G2 D<L+),P81<* ] +62 ),+ +),0J6 Q-A66 Q<L+#<.103G%.#);6 +1-G+);)V.<.7G),+ ]
-G(*)V.7663- ;,54^G/B/; h^<*6 6 #)+ -/) 6 <*U/-A6),0$G);2 (L<*66 (*)G<.T-G:B6;g6 #);2 )@1-G+v);)V..#3
UG2 -GY -A66)VQ6+3-A2D6301);G),(*3/-JUG)V.#);2|-G(^UG2 -G1<L:;-G(
'5.<v3G2 6 -/.76P81),+6 <*37.<L+h#);6 #);23G2=.13G6-G(L( (L-/.#U781-AUG)63@01),+ :B2 <Lv)6 #)V!-/.106 1);2 )4<L+=.13-G( ]
:B2 ),0Q-G(.#);6h3G2 #+:;-/.?v)2 ),Q2 ),+)V.76),0<.-h\-, UG3G2 <*6 Q63:B374Y8#6)h%<*6 6 1)V*^1<L+5-GQv),-A2 +
6 1-A6)V4Y1-G+ <*W;),+(*3P:;-G(L<*6`G^#)-/.1+ h);2<L+v3/+ <*6 <*G)
<*h)2 ),+62 <L:B66 #)=-A66)V./6 <*37.636 #)=3/+6v3/Y81(L-A2 G  R* xF V
F   ! A"#$  %'|&7@) (7 ||+ *-, ..0/)V1 *  B243V 
6#v)3G:B2 ),0Y-G(.#);6`h3G2 P+;gv6 #3/+):;-G(L(*),0 ;`1G/B* \x, |5B6GG76VGOF98|!:<;>=|
Q , dNP),:B6 <*37.#f H .$6 1<L+:;-G+)Gg),-G:|?:B37. ] F  64G :Vx *F ? 3Fx@ : 6A  % x
63v)-/.8Q.#3G2 6 8Q.Q-A6)U/-G<. 6 #)(L<*6);2|-A6 8#2 )-G+  
F` v/O 
 A  1
6 #).#37. ] + ),Q-A2 -A6)+ v),:;<*XY:;-A6 <*37.+);)V+=63)6 #) Re );681+XY2 +601);X.1)+37)%.13G6 -A6 <*37.4-/.106 #)^8Q.10Q- ]
G);563%3P01),(/4-/./<v3G2 6 -/.7612 3/Q(*)V4+;gV-G+O<L(L(81+ ] )V./6 -G(.#3G6 <*37.3G5\-,G),+ <L-/.$.#);6`h3G2 ve);6
62 -A6),0<.NP),:B6 <*37. _ NP),Q-A2|-A6),(*+ v),:;<*XQ),0:B2 ),0Q-G( d !!! #"fv)-:B3/(L(*),:B6 <*37. 3G2 -/.10137 A-A2 < ]
.#);6 +,gV37.6 1)3G6 #);21-/.Q0gx1-VG)v);)V.56 #)+ 8Q#),:B63G -GQ(*),+,g%h%Q<L:|$6 -AG)A-G(8#),+<. X.1<*6)+);6 +,g-/.10%$
81:-G(*UG3G2 <*6 Q4<L:01);G),(*3/)V.76\d Z 3GWV4-/.g bAG/ f -0Q<*2 ),:B6),0-G:B#:;(L<L:@UG2|-GYdTK^''&=f gh%#3/+ ).#3#01),+
H .6 1<L+OQ-Gv);2h)U/<*G)6h34-F3G2R:B37.762 <LY816 <*37.1+; -A2 )-G+ +3#:;<L-A6),063@6 #)A-A2 <L-GQ(*),+<.( 3G2=),-G:
<*2 +6;gYh)01);X.#)-48Y.1<*XQ),0UG2 -G1<L:;-G((L-/.#U781-AUG)63 #)*+g-,/.10<L+6 #)v3/+ + <LQ<L(L<*6+ Q-G:B)3G2#)g43)!-
(*3#:;-G(L(*+ v),:;<*:B2 ),0Q-G(=.#);6h3G2 #+<.6 #)JUG)V.#);2 -G( UG)V.#);2|<L:),(*)V)V./653G5,/.10g764d#)f\--G+ +%8Q.1:B6 <*37.
:;-G+)dNP),:B6 <*37. f %%#).#);h2 ),12 ),+)V./6 -A6 <*37.<L+5<. ] 3G28#)-/.1096d:3)f6 1)12 3/Q-GQ<L(L<*66 1-A6;#)3)
+ Y<*2 ),0g#<L-6 #) Z^Z > 62|-/.1+3G2|-A6 <*37.d Z -/.#3);6 ^1)Q-A2 )V.76 +3G2 ) g-G:;:B3G2 0Q<.#U63$gY-A2 )0Q)V.#3G6),0
-G(TLg aGa #f g9$6 #)3G2|-G(L<L+ 3G<. 81)V.1:B)0Q<L- ] 96 #)3/<./6F-A2|<L-GQ(*)=<=)`gPh%#3/+ )^v3/+ + <LQ<L(L<*6`+ Q-G:B)
UG2 -/4+;g-/.103G2 )UG)V.#);2 -G(L(* 3G=01),:;<L+ <*37.UG2|-GY1+ <L+1,/>-0 3G2),-G:|@?4)*A,/>B0gC6d#)ED ?)f<L+6 #):B37.10Q< ]
dCQ-/.#U);6-G(TLg aGa 9f H .6 1<L+(L-/.1U781-AUG)^6 #)^UG2 -GY 6 <*37.1-G(-G+ +8Q.1:B6 <*37.3G2#)U/<*G)V.6 #)!3/<./6A-G(8#)
3GR-:B2 ),0Q-G(.#);6\<L+\-/8#U7)V.76),0Dh^<*6 4:B37.762 3/(.#3#01),+ ?)O3G6 #)Q-A2 )V.76 +3G#)`^1<L+\3G2|-G(L<L+ <L+^+ 8GF ]
6 1-A6\)B112 ),+ +\6 1)2 ),(L-A6 <*37.1+ 1<LQ+v);6`h);)V.0Y<Ev);2 )V.76 :;<*)V.7663412 3/);2|(*D<.762 3#0Y81:B)6 #)3/(L(*3Vh^<.1U1
:B2 ),0Y-G(+);6 +;)U/<*G))B1-/Q(*),+\634+|#3Vh6 1-A66 #) HJILKMN N:OBMQP4R '\-,G),+ <L-/..1);6h3G2 dTTSf3xG);21
.#);h(L-/.#U78Q-AUG)412 3xP<L01),+37.#)h^<*6 -.1-A6 8#2 -G(!h\-, <L+!-5Q-G<*2VUW$ YX1Z + 81:6 Q-A6 X <L+-5+);6!3GY:B37.10Y<*6 <*37.1-G(
6301);Xv.#).#37. ] +),Q-A2|-A6),(*+ v),:;<*XQ),0.1);6 +;O-/.10h) 4-G+ +T8Y.1:B6 <*37.1+64d#)[D ?)f g37.#)3G2),-G:|#)*9
U/<*G)-12 3#:B),0Y8#2 )632 );3G2|81(L-A6)-/./+),Q-A2 -A6),(* -/.10J?)C*=,/>-0
+ v),:;<*XQ),0.1);6^<.6 #).#);h(L-/.#U781-AUG)G )-G+ +|8Q)6 1)]\ G A_^D GY/TG 634-AG)#$
N#),:B37.10gh)-AG)-JG);2 + <Y(*)3/Q+);2 A-A6 <*37. 2 ),Q2 ),+)V.7612 3/Q-GQ<L(L<L+ 6 <L:<.101),v)V.101)V.Q:B)2 ),(L-A6 <*37.Q+
dNP),:B6 <*37.7f g7h%Q<L:|1-G++|8#2 12 <L+ <.#U/(*=v3Vh);2 81(#< ] v);6h);)V. 6 #)F-A2 <L-GY(*),+D<.`);G);2 A-A2 <L-GQ(*)<L+
Q(L<L:;-A6 <*37.Q+;h)5+ #3xh 6 Q-A63G2-/.74:B2 ),0Q-G(.#);6`h3G2 <.10Q),)V.Q01)V./63G\<*6 +.#37. ] 01),+ :B)V.10Q-/.76.#37. ] Q-A2 )V.76 +
+ v),:;<*XQ),0@h^<*6 @6 #).#);h(L-/.#U781-AUG)6 #);2 )<L+5-+), ] :B37.10Y<*6 <*37.1-G(37.J<*6 +Q-A2 )V./6 +;^P81+;g-TS 01);6);2 ]
-A2 -A6),(*@+ v),:;<*XQ),0:B2 ),0Y-G(.#);6`h3G2 g01);X.#),03VG);2- 4<.#),+-3/<.76-G+ +8Q.1:B6 <*37.+6df\-G:;:B3G2 0Q<.#UD63
(L-A2 UG);201374-G<.gh%1<L:<L+D),#81<*A-G(*)V./6;^#)12 3 ] 6 #)3/(L(*3Vh%<.#U4T-G:B63G2 <*W,-A6 <*37.
:B),0Y812 )46362 -/.1+ 3G2| 6 #)D3G2);2=<.7636 #)(L-A66);2 "
.#);6`h3G2 <L+G);2 $+ <Q(*)Gg-/.10$6 -AG),+37.1(*$(L<.#),-A2 b
6d:a!f 6d:3 ) D ? ) f  dFf
6 <)G^^#)=G);v3/<./6<L+%6 Q-A6%6 1<L+12 3#:B),0Y8#2 ):;-/. )dc
v)81+),0-G+-63P3/(63I+),Y-A2 -A6),M6 #):B2 ),0Y-G(+);6 +
3G.137. ] +),Q-A2 -A6),(*+ v),:;<*XQ),0.#);6 +;=^1<L+-AG),+5<*6 3G2),-G:|a(*A,/e5g/h%1);2 )\3G2),-G:gf1 !!!hEi 6 #)
v3/+ + <LY(*)633#01),(Tg9+),Q-A2|-A6),(*+ v),:;<*XQ),0.#);6 +;g A-G(8#),+d:3-)  ?)f-A2 ):B37.1+ <L+ 6)V./6h%<*6
a
12 3/Q(*)V+3G2|);2 (*3#01),(L(*),07.137. ] +),Q-A2 -A6),(* j kY l8 C1 J  #Y 7l51mn2
+ v),:;<*XQ),0=37.#),+;9-/.101)V.1:B)6381+ ) GY dT3G6 =)B1-G:B6
-/.10-GQQ2 3,1<-A6)Af)B1<L+6 <.#UJ-G(*UG3G2|<*6 Q 3G2+),Q- ] Z 2 ),0Q-G(.#);6h3G2 #+)B#6)V.10\-,G),+ <L-/. .#);6 +63J01),-G(
2 -A6),(*+ v),:;<*XQ),0@.#);6 +634+3/(*G)+ 81:12 3/Q(*)V4+; h^<*6 <12 ),:;<L+ <*37.<. 12 3/Q-GQ<L(L<*6G ^1<L+J<L+3/ ]
N#37):B374)V.76 +=37.6 1<L+2 ),+ 81(*6-/.10v);2 + ),: ] 6 -G<.#),09),-/.1+3G:;(*3/+),0J:B37./G)B+);6 +53G12 3/ ]
6 <*G),+3G2T816 8#2 )01);G),(*3/)V.76 +-A2 )@0Q<L+ :,81+ + ),0<. -GQ<L(L<*6`4-G+ +T8Y.1:B6 <*37.1+;gh1<L:|-A2 ):;-G(L(*),0 V /A
NP),:B6 <*37. c ^#)=3G2 )6),:|Y.1<L:;-G(!Q-A2 6 +3G!6 1<L+%Q- ] ,; dTe);#<Tg aGcA f )3/(L(*3Vh d Z 3GWV4-/.g bAGG f<.
v);2-A2 ):B3/(L(*),:B6),0<.'Qv)V.10Q<D' ] :B37.1+ <L01);2 <.#U37.1(*X.Q<*6),(*JUG)V.#);2 -A6),0:B2 ),0Q-G(\+);6 +;g
<T)GLgP3/16 -G<.#),0-G+6 1)%:B37.7G)BP81(L(13G-Xv.1<*6).P8Q ]
v);23G-G+ +T8Y.1:B6 <*37.1+;J&);37);62 <L:;-G(L(*Gg-:B2 ),0Q-G(
+);6<L+- #A*A|1 ' :B2 ),0Q-G(!+);6:B37.76 -G<.1+-/.<.#X ]
VTx|||4 * A 6GA x| 2G .1<*6)5.98Y=v);23GR4-G+ +8Q.1:B6 <*37.Q+;gP8#637.1(*-=X.1<*6)
, V G`  Ox0A GG * .P8Qv);23G Eo7,5F Rqp#Y,TG# !6 #3/+)%:B3G2 2 ) ]
243V 8R F  8 7 6432FF  8|!: x x 8
F 3F ; + v37.10Y<.#U=636 1) ^F;T| 3G6 #)v3/(*963/v)Gg1h%Q<L:|

26 A. Antonucci and M. Zaffalon


A- 2 )GgY<.UG)V.#);2 -G(Tg-4+ 81Q+);63G!6 #)UG)V.1);2 -A6 <.#UD4-G+ + < .1<*6 <*37. b gh1<L:|2 ),Q2 ),+)V.76 +- Z S -G+-/.)B1Q(L<L:;<*6
8Q.1:B6 <*37.1+;' :B2 ),0Q-G(%+ );643VG);2J <L+D01)V.#3G6),0$-G+ )V.P8Q);2 -A6 <*37.3G! S+;
df )+ <<L(L-A2|(*0Q)V.#3G6)-G+ d D Yf5- GQ S%);G);2 6 #),(*),+ +,g6 #);2 )-A2 )+ v),:;<*XY:+|81:;(L-G+ +),+3G
G/YA, 7A;, 3xG);29 U/<*G)V.-A-G(8# ) 3G Z S+6 1-A601);X.#)-+);6^3G S+%-G+<.K);X.1<*6 <*37. b
-/.#3G6 #);22 -/.10137A-A2 <L-GQ(* ) 4g<T)GLg-:B2 ),0Y-G(+);6 6 #2 3781U7=(*3#:;-G(1+ v),:;<*XY:;-A6 <*37.1+,!%1<L+O<L+!3G2!)B#-/4Q(*)
3G=:B37.Q0Q<*6 <*37.1-G(-G+ +8Q.1:B6 <*37.1+
64d D Qf &<*G)V. 6 #)5:;-G+)3G Z S%+h^<*6 +),Q-A2 -A6),(*D+ v),:;<*XQ),04:B2 ),0Q-G(
-  GQ, 7AR,; d  4f g6 #) G GYAV /A +);6 +;!g @h%1<L:-A2 )+ <Q(*:;-G(L(*),0 ,`#GG;Q ,
;, 3G2 <L+6 1):B2 ),0Q-G(\+);6 df3/16 -G<.1),07  " <.6 #)3/(L(*3Vh%<.#U1^1<L+4+ ),:;<*X:;-A6 <*37.
6 #)v3/<.76 ] h^<L+)%4-A2 U/<.1-G(L<*W,-A6 <*37.3G 2 37 -G(L(16 #) 2 ),#81<*2 ),+=),-G:|:B37.10Y<*6 <*37.1-G(-G+ +T8Y.1:B6 <*37.63v) ]
3/<./6-G+ +8Q.1:B6 <*37.`64d  f9* d  f < ] (*37.#U63-dT:B37.10Q<*6 <*37.1-G(f:B2 ),0Q-G(+);6;g-G:;:B3G2 0Y<.#U63
.1-G(L(*Gg!U/<*G)V.-+ 81Y+);6 ,
. ,/.g-Y-A2 6 <L:,81(L-A2 (* 6 #)3/(L(*3Vh%<.#U1
<v3G2 6 -/.76!:B2 ),0Q-G(#+);6!3G237812!8#2 v3/+),+<L+!6 #) ^F/, HJILKMN N:O-$ M # R ' + ),Q-A2 -A6),(*+ ),:;<*XY),0 Z S3xG);2
p1_p9V /A!,; 3G2 ,
. g<T)GLg6 1)+);653G-G(L(!4-G+ + <L+R-^Q-G<*2TUW$ & %8Z g#h%#);2 ) % <L+O-+);63GQ:B37.10Y<*6 <*37.1-G(
8Q.1:B6 <*37.1+3VG);2+ -G+ + <*U7.1<.#U12 3/Q-GQ<L(L<*6`37.#)63 :B2 ),0Q-G(!+);6 + d#)[D ?4)`f g37.#)3G25),-G:|Q#)/*  -/.10
,
. H .6 #)^3/(L(*3xh^<.#Uh)h^<L(L(81+)6 #)^h),(L(11.#3xh%. ?4)*A,/>B0
T-G:B66 1-A66 #)G);2 6 <L:B),+3Gv+ 8Q:|-:B2 ),0Q-G(1+ );6-A2 )6 #)
D ,
. D70Q);UG)V.#);2 -A6)-G+ +8Q.1:B6 <*37.1+\-G+ + <*U7.Q<.#U12 3/ ] ^1) G  Eo7;#BG d@f3G-+),Q-A2 -A6),(*
-GQ<L(L<*6`37.#)6346 #)+ <.#U/(*)),(*)V)V.76 +3G2,
. + v),:;<*XQ),0 Z S <L+@01);X.#),0-G+6 #):B37./G)BP81(L(3G
HJILKMN N:O- M  R ' :B2 ),0Q-G(.#);6`h3G2 d Z Sf%3VG);2g 6 #)3/<./644-G+ +8Q.1:B6 <*37.1+g64d@f gh^<*6 g3G2),-G:
<L+\-=Y-G<*2 UW$  i X !!!hYX  u Z + 81:6 1-A6 UW$ YX  Z <L+\- a(*A,/e5
\-,G),+ <L-/..#);6h3G2 3xG);2' 3G2),-G:   !!!C "
^1)^ S%+ i UW$ u  c -A2 )^+ -G<L0463v)6 1) G 64d:a!f 64d:3)[D ?)f  643G2^d#),-G)[D :|?49)`f5#* )*Ad #)E?D ?4)1)*Af  <')
b
#GB S+3GYX 6 #_)Z  Z S :B37.1+ <L01);2 ),0 <. K);X.1< ] )dc 
6 <*37. b d9f
^1) Z S UW$  i X !!!hYX  u Z :;-/.v)58Q+),06301) ] [ );2 ) d ) D ? ) f4:;-/.v)2 ),Q(L-G:B),0?9$6 #)J+);63G
6);2|4<.#)6 #)3/(L(*3Vh^<.#UD:B2 ),0Q-G(+ );6; <*6 +\G);2 6 <L:B),+dT+);) '2 3/v3/+ <*6 <*37.%<.6 #)-GQv)V.10Y<f
NP),Q-A2|-A6),(*+ v),:;<*XQ),0 Z S%+-A2 )6 1)3/+6v3/Y81(L-A2
d@f  Z\[ i 6 d@f !!!C 6  d@f u  d b f 6#v)3G Z S
'+-3G2 )@UG)V.1);2 -G(5:;-G+)Gg%+37)-/816 #3G2 +D:B37. ]
h%#);2 ) Z[ 01)V.13G6),+6 #):B37.7G)B P81(L(D3G-+);6 + <L01);2 ),0+3 ] :;-G(L(*),0 Eo7;#B ^A + ),:;<*X:;-A6 <*37.1+3G Z S+
3GT8Y.1:B6 <*37.1+;g-/.10 6 #)3/<./6-G+ + T8Q.Q:B6 <*37.1+ d );2 2 ),<*2 -0Q)- (^3P:1--/.10 Z 3GWV-/.g bAG/b f gh#);2 )
i 6  d@f u  c -A2 )D6 #3/+)01);6);2|4<.#),0J96 #):B37 ] <.1+6),-G03G-+),Q-A2|-A6)+ v),:;<*XY:;-A6 <*37.3G2),-G:|:B37. ]
Q-A6 <LQ(*) S%+3G6 1) Z S=O<*6 -/.J-G81+)D3G6);2 ] 0Q<*6 <*37.1-G(4-G+ +DT8Q.Q:B6 <*37. -G+<.$K);X.1<*6 <*37.Pg6 #)
<.13/(*3GUGGgPh):;-G(L(6 #)5:B2 ),0Q-G(+);6<.D#81-A6 <*37.d b f UG)V.#);2 <L:=12 3/Q-GY<L(L<*66 -GQ(*)@64d#)[D <')`f g<T)GLg-8Q.1: ]
6 #) G [o/,#TG 3GY6 #) Z S=gG7-/.1-G(*3GUGh^<*6 6 <*37.@3Gv3G6 A ) -/.Q09< ) g<L+501);X.#),063Dv),(*37.#U63
6 #).#3G6 <*37.JQ2 3V#<L01),0J<.6 #)+ ),:;<L-G(:;-G+)43G ,`v -X.1<*6)+);63G6 -GQ(*),+,^#)+62 37.1U)B#6)V.1+ <*37.3G
GGB*DQ , Z S+dT+);)NP),:B6 <*37.@#f H .#);2 )V.1:B) -/.4)BP6)V.1+ <*G) Z S<L+3/Q6 -G<.#),0-G+\<.#81-A6 <*37.d9f g
3VG);2- Z S<L+<.76)V.101),0-G+6 1)\:B374Y8#6 -A6 <*37.3G81 ] 9+ <4Q(*2 ),Q(L-G:;<.#U6 #)+),Y-A2 -A6)2 ),#81<*2 )V)V.76 +
v);2-/.10(*3Vh);2)B#v),:B6 -A6 <*37.1+^3G2^-U/<*G)V.T8Y.1:B6 <*37. 3G2),-G:|+ <.#U/(*)4:B37.10Q<*6 <*37.Q-G(-G+ +8Q.1:B6 <*37.gRh^<*6
3G5 3VG);26 1):B2 ),0Q-G(!+);6 d@f g3G25),#81<*F-G(*)V.76 (* )B#6)V.1+ <*G)D2 ),P8Q<*2 )V)V.76 +-Gv378#6=6 #)46 -GQ(*),+=h%1<L:
3VG);25<*6 +G);2 6 <L:B),+d-G(L(*);Gg! aGa Ff 6 -AG)A-G(8#),+%<.6 #):B3G2 2 ),+ 37.Q0Q<.#UX.1<*6)+ );6 +;
H .6 #).#)BP6+),:B6 <*37.gPh)12 3xP<L01)%-/.4-G(*6);2|.Q-A6 <*G)
 1 5 v _l   /Y1 5Y# l5 1 01);X.Q<*6 <*37.3G Z S=gh%<*6 $6 #)+ -/)UG)V.#);2 -G(L<*6`$3G
^#) -G<. ),-A6 8#2 )3G Q2 3/Q-GQ<L(L<L+6 <L: UG2 -G1<L:;-G( K);X.1<*6 <*37. b gY8#653/16 -G<.#),06 #2 378#U7(*3#:;-G(+ v),:;< ]
3#01),(L+;g%h%1<L:<L+6 #)+ v),:;<*XY:;-A6 <*37.3G-U/(*3/Q-G( XY:;-A6 <*37.-G+^<.K);X.1<*6 <*37.@P
3#01),(O6 #2 378#U7-:B3/(L(*),:B6 <*37.3G+ 81 ] 3P0Q),(L+%(*3#:;-G( *,+,  3xx| 3F \x!Vx| !- !  .  "
-
636 1).#3#01),+3G\6 #)UG2 -GgO:B37.762 -G+6 +=h^<*6 K); ] /10 % / 3 2|+ *<,...1 ;

Locally specified credal networks 27


G l /1G v/O#Y 7l51 4 <.#),09-G(L(6 #)v3/+ + <LQ(*)+ 62 -A6);U/<*),+;8#66 #)
H . 6 1<L++),:B6 <*37.h)12 3xP<L01)-/.-G(*6);2|.Q-A6 <*G)J-/.10 v3/<.76^<L+h%#);6 1);2^3G2.13G6^-G(L(6 #),+).#);6h3G2 #+%1-VG)
G);6!),#81<*F-G(*)V.7601);X.1<*6 <*37.=3G2 Z S%+h^<*6 =2 ),+ ),:B663 6 #)+ -/)K''&gv-G+2 ),#81<*2 ),09K%);Xv.1<*6 <*37. b O3
K);X.1<*6 <*37. b gPh%1<L:<L+<.1+ Y<*2 ),0496 #)%3G2|4-G(L<L+ + 13Vh6 1<L+h)=.1);),06 #)3/(L(*3Vh^<.1U1
3G 9 , TGD;/ x dC1-/.#U);6!-G(TLgQ aGa 9fOP<L-%6 #) 2 x M 4 3 O 6 5Q N:O-M P8 R 7 ^A; *,|A Q ,
ZZ >62 -/.Q+3G2-A6 <*37.d Z -/.#34);6^-G(TLgO aGa #f  " UW$  d9   
f  d X  f ;Z : 7;/= <?> 7 $
^F
HJILKMN N:OB M R ' (*3#:;-G(L(*+ v),:;<*XQ),0:B2 ),0Y-G(.#);6 ] ;,,TG JA @#+^xGT7B 
;G  : ;G /B @
h3G2 3xG);2+
<L+D-62 <LQ(*);6(UW$  d   
f  d  YX f Z  C]* 
D   A @#^BGLG |1;GTG#D^F; $ EGF I H
+ 8Q:|6 1-A6;dT<f=$$<L+-K''&3xG);2 0    
A
|GQ ,5: A @//GA @##G;Q D C A @
dT<L< f  <L+^-4:B3/(L(*),:B6 <*37.3G!+ );6 +8 , .10 ,/.10gQ37.#)3G2 AA @1J @P LG;@ D CLKMF N H,_^FA @#YV7|G|
),-G:`#)J*  -/.Q0 +?4)J* <=)dT<L<L<f X <L+-+);6 #GYG  D CPO
3G:B37.10Q<*6 <*37.Q-G(O-G+ +8Q.1:B6 <*37.1+V6d ) D ? ) f g37.#)3G2
),-G:+#)2*+
-/.Q0+?4)*A,/> 
)<.76)V.1063+|#3Vh6 1-A6K);X.Q<*6 <*37.+ v),:;<*XQ),+
- Z S 3xG);26 #)A-A2 <L-GQ(*),+<. 
6 #).#3#01),+:B3G2 ]
2 ),+ 37.Q0Q<.#U63]
-A2 )6 1);2 );3G2 ):;-G(L(*),0 pP|;G
-/.10Dh^<L(L(v)501),Q<L:B6),094:;<*2 :;(*),+;gQh1<L(*)56 #3/+):B3G2 ] <*U78#2 )A!^#)K''& $
2 );6 812|.#),07O2 -/.1+3G2|- ]
2 ),+ 37.Q0Q<.#U63A  -A2 )4+ -G<L0 7 VB/Y,7B -/.10 6 <*37.U/<*G)V.-4(*3#:;-G(L(*+ v),:;<*XQ),0 Z Sh#3/+)=K^''&
h^<L(L(%v)01),Q<L:B6),0$7+ #81-A2 ),+,eR);6D81+-G+ +3#:;<L-A6) <L+6 1-A6<. <*U78#2 ) b d3G2-G(L+ 3 <*U78#2 )%3G2 <*U78#2 )\#f
),-G:01),:;<L+ <*37..#3#01)+#) *   h^<*6 -+ 3 ] :;-G(L(*),0
7 , TGqp#Y,0TG  . 0! ,/>-0  ,/. 0R2 );6 8#2|.Q<.#U-/. <*U78#2 )Ag2 ),v3G2 6 +-/.)B1-/Q(*)=3GO2 -/.1+3G2|- ]
),(*)V)V./63G , .10 3G2^),-G:A?)2*9,/>B0 Z -G(L( G G 6 <*37. A@^1)K''& $
2 );6 812|.#),0J9O2 -/.1+3G2|- ]
-/.-A2 2|-,3G01),:;<L+ <*37.8Q.1:B6 <*37.1+,g37.#)3G24),-G: 6 <*37.<L+^:B37.1+ <L0Q);2 ),096 #)3/(L(*3xh^<.#U1
#);*  %)D0Q)V.#3G6)-G+#, 6 #)D+);63G-G(L(6 #) 2RQ I O I 5 PT R S @#G GYAOB/ 
;*/ ^A
v3/+ + <LY(*)+62 -A6);U/<*),+; UW$ YX  ;Z : O O A @#A ! p#YVG 6 Fd
f qp1J @DA @#G
-G:|+ 62 -A6);UG * , 01);6);2|<.1),+- S 3xG);2
 #<L-K%);Xv.1<*6 <*37.1'$:B37.10Q<*6 <*37.1-G(Y4-G+ +!8Q.1:B6 <*37. 6 
d:a
f1 U 6 Vd:a   a
f  d f
64d#)ED ?4)f3G2),-G:|8Q.Q:B);2 6 -G<..#3P0Q)##)/* 
-/.10 V  6W*X 
? ) *A, >-0 <L+-G(*2 ),-G01+ v),:;<*XQ),07 X !O301);6);2<.#)
- Sh)Q-,G)6 #)V. 63+ <Y(*2 ),12 ),+)V.7601),:;< ] ;G /B@ a
* , e " : ;/,/|ZYVBAA@1  GQF
+ <*37.Q+^8Q.1:B6 <*37.1+9@-G+ +^T8Y.1:B6 <*37.1+;3G25),-G:|01) ] qp#Y,TG " UW$
YX
 Z _^F, 
: @#; $

:;<L+ <*37..#3#01)8#)*J -/.10]?)1*9,/>-0g#h):B37.Q+ <L01);2 A @#? <[> 7 7,G G $ ; S /#;G/G]\ O
6 #):B37.10Q<*6 <*37.1-G(14-G+ +8Q.1:B6 <*37.1+16 xd#)[D ?4)`f-G+ + <*U7. ] 2 37 h%Q<L:|gY:B37.Q+ <L01);2 <.#U46 #)= S+ UW$

 3G2
<.#U$-G(L(6 #)4-G+ +@63 6 #)A-G(8#)  .10 d?4)`f gh%#);2 ) ),-G:+62 -A6);UG *A[, Yg1<*6<L+v3/+ + <LY(*)63:B37Y.QX :;(Z 8101)G
 .10<L+@6 #)01),:;<L+ <*37.?8Q.1:B6 <*37.?:B3G2 2 ),+ v37.10Y<.#U63
^1)TS 3/16 -G<.#),0?<.6 1<L+h\-,h^<L(L(v)01) ] ^ O 0O _`_ Y4 a PR > L,|A*41 ,  " F <TB
.#3G6),0-G+ UW$ YX  Z g=h%1<L(*)J3G26 #):B3G2 2 ),+ v37.10Q<.#U bv|1;|*9TB " ^F; 
F;
QGT
3/<./6-G+ +4T8Y.1:B6 <*37.g\h):;(*),-A2 (* Q-,G)Gg3G24),-G: GA @#c <?> 7 $
;Wp# , S G1;G|GT: Gd \O
a9d:a   a
f5*A,Ve5 H 6<L+h3G2 6 763.#3G6)6 Q-A6-/./ Z S01);Xv.#),0-G+
<.K);X.Q<*6 <*37. b :;-/.v)2 );3G2|81(L-A6),0-G+<.K);X.1< ]
6 Fd:a   a
f b 6 xd:3  D ? T!f b 64d:3-) D ?)f
 6 <*37.D1g194+ <Y(*-G0Q0Q<.#U-+ <.#U/(*)01),:;<L+ <*37..#3#01)Gg
.Ce .10 Ce#" h%Q<L:|<L+5Q-A2 )V./6%3G-G(L(R6 #)3G6 #);2.#3#01),+=dT+);) <*U ]
d#f 8#2 ) b f
%#)@.#)B#64+6),$<L+46 1)V.3/9P<*3781+,@h)h\-/./6D63 %#)@:B37.10Y<*6 <*37.1-G(-G+ +4T8Q.Q:B6 <*37.1+:B3G2 2 ),+ v37.10 ]
01);Xv.#)=- Z S 9),-/.1+%3G6 #)+);653G S+01);6);2 ] <.#U63@0Q<Ev);2 )V./6A-G(8#),+3G6 #)01),:;<L+ <*37.J.#3#01)-A2 )
$&%'(' 0 `|G| xF ;x x *)#*xx -G+ +|8Q),0=63)6 #3/+)+ ),:;<*XY),096 #):B374Q-A6 <LQ(*)
R, + 3A /F 8x. -0/1Ox@ x -; S+;%1<L+),-/.1+6 Q-A6;g/<* e01)V.#3G6),+O6 #)0Q),:;<L+ <*37.

28 A. Antonucci and M. Zaffalon


> 3G2 )UG)V.#);2|-G(L(*Gg:B37.1+ 62 -G<./6 +3G26 1)+ v),:;<*XY:;- ]

6 <*37.1+43G:B37.10Y<*6 <*37.1-G(%4-G+ +T8Y.1:B6 <*37.1+2 ),(L-A6 <*G)@63
0Q<E);2 )V.76O.#3#01),+-A2 )+ <4<L(L-A2 (*2 ),12 ),+)V.76),09501),:;< ]
+ <*37..13P01),+!h1<L:|-A2 )6 1)Q-A2 )V.76 +3G6 1),+)^.#3#01),+;
<.1-G(L(*Gg#63(*3P:;-G(L(*4+ ),:;<*Gg#-G+2 ),#81<*2 ),07K); ]
<.1<*6 <*37.$1g5-+),Q-A2 -A6),(*$+ v),:;<*XQ),0 Z Sg<*6h3781(L0
*< U78#2 ) b OeR3P:;-G(9+ v),:;<*XY:;-A6 <*37.=3GY-%.#37. ] +),Q-A2 -A6),(* + 8GF:B)632 );3G2|81(L-A6)6 #)J+),Q-A2|-A6),(*?+ v),:;<*XQ),0
+ v),:;<*XQ),0 Z S 3VG);26 #)K''& <. <*U7812 )A (^) ] Z S -G+-/.)B#6)V.1+ <*G) Z S h%#3/+ )6 -GY(*),+-A2 )3/ ]
)Vv);26 1-A6:;<*2 :;(*),+01)V.#3G6)8Q.1:B);2 6 -G<. .#3#01),+;g 6 -G<.#),0:B37.Q+ <L01);2 <.#U-G(L(\6 1):B37=Y<.1-A6 <*37.1+3G6 #)
h%1<L(*)6 1)+ #81-A2 )<L+^81+ ),03G2^6 #)01),:;<L+ <*37.@.#3#01)G G);2 6 <L:B),+43G%6 #)+ ),Q-A2 -A6),(*+ v),:;<*XQ),0:B37.10Y<*6 <*37.1-G(
:B2 ),0Q-G(+);6 +=3G6 #)D+ -/)A-A2 <L-GQ(*)G);6;g6 1<L+-G ]
12 3/-G:D+ 8PE);2 +\-/.43/9#<*3781+)B#v37.#)V.76 <L-G(v)B#Q(*3/+ <*37.
.#3#01)GgA6 #)\+6 -A6),+3G e <.Q01)B6 #)\:B37Q-A6 <LY(*) S+;g 3G!6 #).P8Q);2\3G!6 -GQ(*),+%<.6 #)<.1Y816+ <*W;)G
/- .10 64d#) D ?)  f  6C7d#)ED ?4)f gh%#);2 ) 6C7d#)ED ?4)f '3G2 ))BEv),:B6 <*G)12 3P:B),0Y812 ):B37.1+ <L+6 +<.-G0Y0Q<.#U
-A2 )6 #):B37.10Q<*6 <*37.Q-G(-G+ +%T8Y.1:B6 <*37.1+5+ v),:;<*XQ),07 -01),:;<L+ <*37..#3#01)<.v);6h);)V.),-G: .13P01)-/.Q0<*6 +
6 #) ] 6 J:B374Q-A6 <LQ(*)4 S3G2),-G: #);* 
-/.10 Q-A2 )V.76 +;g2 );U/-A2 0Q<.#U6 #).#3P0Q),+3G@6 #)3G2|<*U/<.1-G(
?4)5* ,/>B0-/.Q0 * , %^1<L+^3G2|81(L-A6 <*37.gYh%1<L: 3#01),(!-G+8Q.1:B);2 6 -G<..#3#01),+d <*U78#2 )#f O3:B37 ]
<L+R-/.)B#-/4Q(*)3G16 #) Z^Z > 62|-/.1+3G2|-A6 <*37.d Z -/.#3 Q(*);6)6 #)(*3#:;-G(+ v),:;<*XY:;-A6 <*37.12 3P:B);),0-G+3/(L(*3Vh%+;
);6-G(TLg aGa #f g<L+37.1(*D+);)V<.#U/(*D(*3P:;-G(TgQv),:;-/81+ )3G 3G2),-G:|D8Q.1:B);2 6 -G<..#3P0Q)V#)g76 #)^+6 -A6),+ )2*A,  0
6 #)-A2 :;+:B37.Q.#),:B6 <.#U6 #)01),:;<L+ <*37..13P01)h^<*6 -G(L( 3G56 #):B3G2 2 ),+ 37.Q0Q<.#U01),:;<L+ <*37. .#3P0Q) e#)5-A2 )-G+ ]
6 #)8Q.1:B);2 6 -G<..#3#01),+; H .6 #)42 )V-G<.1<.#U@Q-A2 63G + 8Q),063<.Q01)B 6 #)G);2 6 <L:B),+3G A 6 1):B37.10Q< ]
6 1<L+\+),:B6 <*37.g1h)+ #3xh13Vh0Q<E);2 )V.76 Z S+\+ v),:;<*X ] 6 <*37.1-G(O:B2 ),0Q-G(+);6 + d#)ED ?)fT* % gRh^<*6 +?4)2*A,/> 
:;-A6 <*37.1+\:;-/.Dv)2 );3G2|81(L-A6),0-G+2 ),#81<*2 ),07K); ] H .6 1<L+4h-VGg3G2),-G:|8Q.1:B);2 6 -G<. .#3#01)+ ) g\<*6D<L+
<.1<*6 <*37.1 v3/+ + <LQ(*)63+);6@6 1):B37.10Y<*6 <*37.1-G(-G+ +T8Y.1:B6 <*37.
3G2\)B1-/Q(*)Gg#h):;-/.2 ),Q2 ),+)V.76)B#6)V.1+ <*G) Z S+ 64d#)[D )f63v)6 #)G);2 6)B3G46 #):B37.10Y<*6 <*37.1-G(
9@<.762 3P081:;<.#U-01),:;<L+ <*37.Q-A2 )V.763G2),-G:J.#3#01) :B2 ),0Q-G(%+);6 d#)ED ?4)f-G+ +3#:;<L-A6),0 63 )g3G2),-G:
3G!6 #)3G2 <*U/<.1-G( Z Sd <*U7812 )=9f )g* ,  0 ();U/-A2 0Q<.1U01),:;<L+ <*37..#3#01),+;g3G2),-G:
01),:;<L+ <*37.D.#3#01[) e ) -/.1046 #)^2 ),(L-A6),04A-G(8#)=0 ? ) 3G6 #)
Q-A2 )V.76 +;gh)@+ <Q(*+);66 #)@+ 81Q+);6 , 0 ,  0
63Jv)+ 81:6 1-A6 i 64d#) D )f u C 0 6W 0 0 -A2 )6 1)G);2 ]
6 <L:B),+3G d ) D ? ) f ^1<L+-GQ12 3/-G:|gh%Q<L:| <L+
:;(*),-A2 (*(L<.#),-A2\<.6 1)^<.1Y8#6+ <*W;)Gg96 -AG),+\<.1+ Y<*2 -A6 <*37.
2 37 v7 7; 2 ),12 ),+ )V./6 -A6 <*37.1+\d Z -/.#3-/.10
<*U78#2 )P!e3#:;-G(1+ v),:;<*XY:;-A6 <*37.3G-/.)B#6)V.1+ <*G) Z S >3G2 -G(Tg bAG/b f
3VG);26 #)K''& <. <*U78#2 )A
^1):B37.10Q<*6 <*37.1-G(-G+ +T8Q.Q:B6 <*37.1+3G!6 #)=8Y.1:B);2 ]
6 -G<..#3P0Q),+:B3G2 2 ),+ v37.10Y<.#U63@0Q<E);2 )V.76A-G(8#),+3G
6 #)2 ),(L-A6),00Q),:;<L+ <*37..#3#01),+-A2 )-G+ + 8Q),063v)
6 #3/+)+ v),:;<*XQ),096 #)0Q<E);2 )V.766 -GQ(*),+^<.46 1)%)B ]
6)V.1+ <*G)\+ v),:;<*XY:;-A6 <*37.=3GQ6 #)\8Q.1:B);2 6 -G<..#3#01)G!^1<L+
),-/.Q+@6 Q-A6;g<*]#)<. -/.8Y.1:B);2 6 -G<..#3#01)-/.10
e ) 6 #):B3G2 2 ),+ v37.10Y<.#UD0Q),:;<L+ <*37.J.#3#01)Gg6 #)+ 6 -A6),+
) *Q,  0<.101)B6 #)6 -GQ(*),+=6 C 0 d#)[D <=)f3G6 #))B ]
6)V.1+ <*G)+ ),:;<*X:;-A6 <*37.3G2 #)g-/.10Q64d#)[D )  ?4)`f<L+
6 #)3/<.76\4-G+ +8Q.1:B6 <*37.]6 C 0 d#)[D ?4)`f-G+ +3#:;<L-A6),0D63 *< U78#2 )1 e 3#:;-G(+ v),:;<*XY:;-A6 <*37. G3 D-+),Q-A2 -A6),(*
6 #) ) ] 6 6 -GQ(*)3G6 #))B#6)V.1+ <*G)+ v),:;<*XY:;-A6 <*37. + v),:;<*XQ),0 Z S?x3 G);2%6 #)K^''& <. *< U78#2 )4A

Locally specified credal networks 29


QN 8Q44-A2 <*W,<.#U1g#h):;-/.3/16 -G<.@<.)!F:;<*)V./66 <) 6 #)<. 8#)V.1:B),+!3G26 1)+#.1);2 U/<*),+v);6h);)V.6 #)A-A2 < ]
-(*3#:;-G(P+ v),:;<*XY:;-A6 <*37.3G1- Z SgF7+ <4Q(*%:B37.1+ <L01);2 ] -GQ(*),+, H h)2 );U/-A2 0$#81-G(L<*6 -A6 <*G)J.#);6 +-G+:B2 ),0Q-G(
<.#U37.#)53G6 #)562 -/.Q+3G2-A6 <*37.1+01),+ :B2 <Lv),0<.6 1<L+ .#);6 +,gh)+ );)6 1-A6.#3G6-G(L(6`Pv),+3G2 ),(L-A6 <*37.1+=:;-/.
+),:B6 <*37. H 6<L+\h3G2 6 .#3G6 <.#U6 1-A6(*3#:;-G(+ v),:;<*XY:;- ] v)52 ),12 ),+ )V./6),09+),Q-A2 -A6)+ ),:;<*X:;-A6 <*37.1+3G6 #)
6 <*37.1+3G Z S+Q-G+ ),0437.6 #)%K''&+\2 ),+ v),:B6 <*G),(*4<. :B37.10Y<*6 <*37.1-G(#:B2 ),0Q-G(P+);6 +;^1<L+<L+;gA3G2<.1+ 6 -/.1:B)GgG6 #)
<*U78#2 )1g <*U78#2 )-/.Q0 <*U78#2 ) b :B3G2 2 ),+ v37.1063 :;-G+)\3GOdTv3/+ <*6 <*G)Af  p1AG:^F^ TpQ;| gGh1<L:|2 ) ]
Z S+RY-G+),037.6 #)\K''&$<. <*U7812 )%h%1<L:-A2 )2 ) ] #81<*2 ),+;gP3G26h33P3/(*),-/.F-A2|<L-GQ(*),+ -/.10 Dg#6 1-A6
+ v),:B6 <*G),(*+),Y-A2 -A6),(*+ v),:;<*XQ),0gF)B#6)V.1+ <*G),(*+ v),: ]
<*XQ),0-/.10.#37. ] +),Q-A2 -A6),(*dT-/.10.#37. ] )B#6)V.1+ <*G),(*f 64"d !7D #Vf $64"d !7&D %'#Vf d_ f
+ v),:;<*XQ),0 <L:B)@G);2 + -#g<*64<L+),-G+ 63:|#),: 6 1-A6
O2 -/.1+3G2|4-A6 <*37.4\:;-/.v)2 );U/-A2 01),0-G+!6 #)<.7G);2 +) ^1)#81-G(L<*6 -A6 <*G)=<. 8#)V.1:B)5v);6h);)V(. ?-/.1(0  :;-/.
3G6 #),+)6 12 );)62 -/.1+ 3G2|4-A6 <*37.1+; 6 #);2 );3G2 )v)D3P01),(*),07J2 ),P8Q<*2 <.#U96)d #D #Vf-/.10
64)d g&D %'#Vf634v),(*37.#U63:B2 ),0Q-G(+);6 +;gYh%1<L:|:;-/.Q.#3G6
 Q4 xO O q C 5  v -lW v)+ ),Q-A2 -A6),(*+ v),:;<*XQ),0),:;-/8Q+)3G6 #)\:B37.1+62 -G<.76
 /Y1 TY #Y l5 Q_ <.#81-A6 <*37.d _ f '.@)B#6)V.1+ <*G)+ v),:;<*XY:;-A6 <*37.3G2
eR);6\81+<L(L(81+62|-A6)594-);h )B1-/Q(*),+6 1-A66 #).#) ]  + #3781(L06 #);2 );3G2 )v):B37.1+ <L0Q);2 ),0633#01),(O6 #)
:B),+ + <*63G.#37. ] +),Y-A2 -A6),(*+ v),:;<*XQ),0:B2 ),0Q-G(.#);6 + v3/+ <*6 <*G)<. 8#)V.1:B)3G  d Z 3GWV-/.);6%-G(TLg bAG #f
-A2 <L+ ),+^.1-A6 8#2 -G(L(*<.-F-A2 <*);6`3GQ2 3/Q(*)V+, *,+ N # _:I M . -vV0 / Q 3 O ^21@ ()V)V);2
^ OBM I  11 N 
I M 3 I IGM  I _:I ^#) |G#;,| 6 1-A6K''&5+2 ),12 ),+)V.76<.101),v)V.101)V.1:;<*),+v);6h);)V.
^FG ^A;,;;| pP Z H (f<L+@- .#);h 2|81(*)3G2
d A-A2 <L-GQ(*),+-G:;:B3G2|0Q<.#U636 #)>@-A2 G3V :B37.Q0Q<*6 <*37.
810Q-A6 <.1Uv),(L<*);T+h^<*6 <.1:B37Y(*);6)3/Y+);2 A-A6 <*37.1+ K5<E);2 )V./6@K''&5+0Q),+ :B2 <LQ<.#U6 1)+ -/)<.101),v)V. ]
dC-FE-G(*37.g bAG/ f Z H (3P01),(L+6 #)5:;-G+)53G4<P),0 01)V.Q:;<*),+D-A2 )+ -G<L0$63v)  p# ^FA;Q 3d );2|--/.10
1.#3Vh%(*),01UG) -Gv378#66 #)<.1:B374Q(*);6)V.#),+ +12 3#:B),+ +;g '),-A2 (Tg aGaA f ^98Q+;g- S :;-/.v)2 );3G2|81(L-A6),0
h%Q<L:|<L+O-G+ + 8Q),0563^v).#),-A2 (*8Q.#1.#3Vh.53G2O+37) 81+ <.#U-/.D),#81<*A-G(*)V./6K^''&^1)+ -/)513/(L0Q+\h^<*6
A-A2 <L-GQ(*),+-/.108Q.1+),(*),:B6 <*G)@3G246 #)3G6 #);2 +,^1<L+ Z S+;g%h%#)V. dT-G+<Q(L<L:;<*6 (*$0137.#)<. 6 1<L+Q-Gv);2;f
(*),-G0Q+63-/.<12 ),:;<L+)Q2 3/Q-GQ<L(L<*6`@3#01),(Tgvh%#);2 ) BG $Y9`1,Y7;| 2 ),Q(L-G:B),+@+ 6 -/.10Q-A2 012 3/Q- ]
-G(L(6 1)v3/+ + <LQ(*):B374Q(*);6 <*37.1+3G56 #)<.1:B374Q(*);6) Q<L(L<L+ 6 <L:<.101),v)V.10Q)V.1:B)<.6 1)>@-A2 G3V:B37.Q0Q<*6 <*37.
3/Q+ );2 F-A6 <*37.Q+3G=6 1)XQ2 +66#v)3GF-A2 <L-GY(*),+-A2 ) d>3G2 -G(-/.10 Z -/.#31g bAG/b f
:B37.1+ <L01);2 ),0 H .-2 ),:B)V./6h3G2 g810Q-A6 <.#U@),(L<*);T+ Z 37.1+ <L01);2,gR3G2)B1-/Y(*)G4g  5 -/.16 0 879Dg
h^<*6 Z H (1-G+v);)V.:B37.1+ <L01);2 ),0 3G2 S%+dT'5. ] h%Q<L:|-A2 ):;(*),-A2 (*),#81<*F-G(*)V.76K''&5+,; :.#)12 3/ ]
637.P81:;:;<-/.10C-FE-G(*37.g bAG/_ f ^#)12 3/Q(*)V <L+ (*)V h%<*6 +),Q-A2 -A6),(*+ ),:;<*XY),0 Z S%+<L+6 1-A66 #);
12 3VG),063Jv)),P8Q<*F-G(*)V.7663-62 -G0Q<*6 <*37.Q-G(v),(L<*); -A2 ).#3G64:;(*3/+),08Q.Q01);26 1<L+4P<.103Gd),P81<*A-G(*)V.76Bf
810Q-A6 <.1U12 3/Q(*)V<.- Z S=6 #)),#81<*F-G(*)V.Q:B)@<L+ +6281:B6 8#2 )$:|1-/.#UG),+, <*h)01);Xv.#)-+),Q-A2 -A6),(*
4-G01)3/+ + <LQ(*)9.137. ] +),Q-A2 -A6),(*+ v),:;<*XQ),0:B2 ),0Q-G( + v),:;<*XQ),0 Z S 3G<2  5DgO-/.Q06 #)V.2 );G);2 + )6 #)
.#);6 +, -A2 :Ag6 #)2 ),+|81(*6 <.#U.#);6h^<L(L(4.#3G6v)+),Q-A2 -A6),(*
+ v),:;<*XQ),04<.UG)V.#);2 -G(T Z 37.1+ <L01);26 1)3/(L(*3xh^<.#U+), ]
 _WN V1 N  I M I   O    8Q-G(L<*6 -A6 <*G) 12 3/Q- ] -A2 -A6)+ v),:;<*XY:;-A6 <*37.1+R3GP6 1):B37.10Q<*6 <*37.1-G(P:B2 ),0Q-G(7+);6 +
Q<L(L<L+ 6 <L:.#);6`h3G2 P+d),(L(4-/.gD aGaA f-A2 )-/.-G ] 3G2^- Z S3VG);=2  >D
+62|-G:B6 <*37.3GTS+;g/h#);2 )6 #)12 3/Q-GQ<L(L<L+6 <L:-G+ + ),+ + ]
)V./6 +D-A2 )2 ),Q(L-G:B),09#81-G(L<*6 -A6 <*G)2 ),(L-A6 <*37.1+3G )d +D !1f  ? a ! A @ )d +&D %'!Qf B  ? c ! b @
 R@ x=F  *5 G||;5 2G 8 A )d f  Z\[ i ?  ! C @  ? ! @ u 
= F R`|A B% B` 6 3 2G  B *
|4 DF|T 2G 3V  3x |A|Ox 2 h%1);2 ) 6h3 ] 0Q<)V.1+ <*37.1-G(J#3G2 <*W;37.76 -G(-A2 2|-,#+-A2 )
 | ||4 %L 3A F| 2F| x / / = 81+ ),0630Q)V.#3G6)-G+ +8Q.1:B6 <*37.1+\3G2v393/(*),-/.DA-A2 < ]
|x *|!  - 1PA v`xx ` 6 34 vF B 
%x  F  ;x 8v ':,F 8 3F * 8F -GQ(*),+,N181:-+ ),Q-A2 -A6),(*+ v),:;<*XQ),0 Z SQ-G+6h3
Fx;v! 3xY \FG|@ 2x 6 O0 3 2G :B374Q-A6 <LQ(*) S+;g37.#)3G2),-G:|G);2 6)BJ3G )d f
 RT * F x|F xF,| 6GA 
x; 5 Q\` F|9F ; 2 37 6 1)3/<.764-G+ +T8Q.Q:B6 <*37.1+\:B3G2 2 ),+ v37.10Q<.1U63

30 A. Antonucci and M. Zaffalon


6 #),+)TS+;g+ -, 6 d)  f%-/.10 6 d)  4f gh)3/ ] 8Q.1:B6 <*37.3/16 -G<.#),0<.6 1<L+%h-VGgh%1<L:|@01);X.1)6 #)
6 -G<.6 1):B37.10Q<*6 <*37.1-G(-G+ +\8Q.1:B6 <*37.1+3G26 1):B3G2 ] + -/):B37.10Q<*6 <*37.1-G(R4-G+ +8Q.1:B6 <*37.1+3G2,
2 ),+ v37.10Y<.#UTS+3xG);2 >= 6 )d f 6 )d f2B  ?  bG @
6 )d fB  ? c  ! ,C @ 6 )d fB  ? cG ! @ 6 d)+D&%'!Qf1 6 d)JD&% !1f1!B?  A@
6 )d #D #Vf B  ?  b ! bGc @ 6 )d #D #Vf B ? Gb ! L @ 6
d ]&D %'#Vf1 6
d ]&D %'#Vf1B  ?*  @
6 )d #&D % #,f2B ? ! _ @ 6 )d #&D % #,f2B  ? G ! _  @
-/.100Y<Ev);2 )V.76^:B37.10Y<*6 <*37.1-G(R-G+ +T8Y.1:B6 <*37.1+\3G2,
':;:B3G2 0Q<.#U$63$K);X.Q<*6 <*37. b g=6 #),+)6h30Y<L+6 <.1:B6 6 )d +D !1f ?*  @ 6 )d +D !QfB
 ? _  ! GC @
+ v),:;<*XY:;-A6 <*37.1+01);X.1)- Z S 3VG);2   gh%1<L: 6 d
 D #VfB? _  ! GC@ 6 d
]D #VfB? ! @
:;-/.Q.#3G6!v)+),Q-A2|-A6),(*=+ v),:;<*XQ),0=-G+<.=K);X.1<*6 <*37.P
O3$+);)6 1<L+;g.#3G6)3G2)B#-/4Q(*)J6 Q-A6@6 #)+ v),: ] )Q-,G)6 1);2 );3G2 )3/16 -G<.#),06h3D S+3VG);,2  
<*XY:;-A6 <*37. 64"d !7D #,f  Gb gQ6"d !B&D %'#Vf  G -/.10   gh%1<L::;-/.@v)2 );U/-A2 0Q),0-G+6 #)=:B374Q-A6 ]
64"d #Vf
 cG g%h%1<L:<L+:;(*),-A2 (*$-v3/+ + <LY(*)@+ v),:;< ] <LQ(*)TS+D3G- Z SN181:|- Z S <L+:;(*),-A2 (*$.#37. ]
XY:;-A6 <*37.<*6 1):B37.10Y<*6 <*37.1-G(:B2 ),0Q-G(v+);6 +\h);2 )+ ),Q- ] +),Q-A2|-A6),(*+ v),:;<*XQ),0g1v),:;-/81+)6 #)6h3TS++ v),: ]
2 -A6),(*4+ ),:;<*XY),0g9h3781(L0D(*),-G0D63=6 #)8Q.1-G:;:B),Q6 -GQ(*) <*=0Q<E);2 )V.76:B37.10Q<*6 <*37.1-G(Y4-G+ +T8Y.1:B6 <*37.1+!3G23G2 )
-G+ +\T8Q.Q:B6 <*37.J64)d fB  ? a ! A @ * )d f 6 1-/.-F-A2 <L-GY(*)G
H 6<L+ 8Q+);T8Q(63 3/Q+);2 G)?6 1-A6 UG)V.#);2 -G(Tg.#37. ]
+),Q-A2|-A6),(*4+ v),:;<*XQ),0g Z S%+013.13G6+ 8PE);23G26 1),+) /  lG 7l lWG C 5  v -lW
12 3/Y(*)V+8Q+6v),:;-/81+)6 #); -A2 ):;(*3/+),0 8Y.101);2 /1 5#Y l5 Q_
),#81<*F-G(*)V.76:1-/.#UG),+<.6 #)^+6281:B6 8#2 )3Gv6 #)^K''&
TI Y MNM - 3 O 5 NM O 5 / _:I I  1V &<*G)V. H .6 1<L+5+),:B6 <*37.gvh)12 3xG)=6 1-A6-/.7(*3#:;-G(L(*@+ v),: ]
6 #2 );)v393/(*),-/. 2 -/.Q0137 F-A2 <L-GY(*),(+ g  -/.1 0 g <*XQ),0 Z S3VG);2g
:;-/.),P8Q<*F-G(*)V.76 (*v)2 );U/-A2|01),0
(*);656 #)K''&     )B#12 ),+ +5<.101),v)V.101)V. ] -G+-+),Q-A2 -A6),(*+ v),:;<*XQ),0 Z S 3xG);2T^#)62|-/.1+ ]
:;<*),+);6`h);)V.46 #)VO)h-/.7663(*),-A2|.46 #)%3#01),( 3G2|-A6 <*37.<L+6),:Q.1<L:;-G(L(*+62 -G<*U7763G2 h-A2|0<*6<L+
12 3/Y-GQ<L(L<*6 <*),+3G2+|81:|-K''& 2 37 6 1)<.Q:B37 ] Q-G+),037.2 ),12 ),+)V.76 <.#U0Q),:;<L+ <*37..#3P0Q),+^98Y.1:B);2 ]
Q(*);6)0Q-A6 -+ );6\<.-GQ(*)AgQ-G+ + 8Y<.#U.#3<.#3G2|- ] 6 -G<.4.13P01),+h^<*6 A-G:,8#3781+:B37.10Q<*6 <*37.1-G(v:B2 ),0Q-G(Q+ );6 +;g
6 <*37.-Gv378#656 #)12 3P:B),+ +4-AP<.1UD6 1)3/Y+);2 A-A6 <*37. -G+3G2|4-G(L<*W;),0v),(*3Vh
3G <L+ + <.1U4<.@6 #)(L-G+ 62 ),:B3G2 03G6 #)0Q-A6 -+);6; 2 V M 3 O 6 51 N:O-M  8 R 7 ^A; *,|A1 ,
^#)3/+6^:B37.Q+);2 A-A6 <*G)-GQQ2 3/-G:|@<L+6 #);2 );3G2 )=63  " UW$  d9   
f  d  YX f Z _^A; 
: 7;G,`v
(*),-A2|.6h30Q<L+6 <.1:B6 S%+2 37 6 #)6`h3:B374Q(*);6) GGB*J1 ,  " UW$ &%8Z _^F,  :  @#,A @1
0Q-A6 -+);6 +4:B3G2 2 ),+ v37.10Q<.1U63J6 #)3/+ + <LQ(*)A-G(8#),+ /YGGAvV /AQ,; % GOBG /B @ #)2*+
3G6 1)4<L+ + <.#UD3/Q+);2 A-A6 <*37.-/.10:B37.1+ <L0Q);25<.Q01);),0 GY ?)1*9,/>-0 E
6 #) Z S4-G01)53G6 1),+):B37Q-A6 <LY(*) S%+;  64d#)ED ?4)f <*4#)2*J

' Z d#) D ?)f    d 7f


! #  W/4 0 0 d#)`f<*4#)2*J   
%'! % # 
! # %  @#, 6d#)ED ?)f ^A@#5F|R p#YV/41 V \ 0
!   X GY W/ 0 0 d#)f A@# A-G:,8#3781+ , 7AQ,;9BG , . 0 O
-GQ(*)@A'0Q-A6 -+);6=-Gv378#66 12 );)v3P3/(*),-/.A-A2 < ] <*U78#2 ) 2 ),v3G2 6 +=-/.)B1-/Y(*)3GR2|-/.1+3G2|- ]
-GQ(*),+; g =01)V.#3G6),+%-<L+ + <.1U3/Q+);2 A-A6 <*37. 6 <*37. b gh%1<L:|@<L+5:;(*),-A2 (*(L<.#),-A2<.@6 #)<.1Y8#6+ <*W;)G
O3-AG)6 1<.#U/+\+ <Y(*)^h)5:B374Y8#6)%6 #)12 3/ ] ^1)4dT+62 37.#U#f2 ),(L-A6 <*37.v);6h);)V.-(*3#:;-G(L(*+ v),: ]
-GQ<L(L<*6 <*),+\3G26 1)3/<./6\+6 -A6),+94),-/.1+3G6 #)%2 ),( ] <*XQ),0 Z S UW$  d   
f  d   X f Z 3VG);2]
-/.106 #)
-A6 <*G)2 ),P81)V.1:;<*),+<.6 #):B37Q(*);6)0Q-A6 -+);6 +;eR);6 +),Q-A2|-A6),(*+ v),:;<*XQ),0 Z S UW$ &% Z 2 );6 8#2.#),07
6 )d     f^-/.Q0(6 )d     f%v)6 #)53/<./6=4-G+ + O2 -/.1+3G2|-A6 <*37. b <L+378#6 (L<.#),07D6 #)3/(L(*3Vh^<.1U1

Locally specified credal networks 31


Z H (810Q-A6 <.#U@Q2 3/Q(*)V0Q),+ :B2 <Lv),0<.NP),:B6 <*37. _ g
h%Q<L:|<L+-.#3G6 -GQ(*)-/.102 ),-G01 ] 63 ] 81+)2 ),+|81(*6;
 kRl AR
 Rl! n2
)1-,G)01);X.#),0 -.#);hUG2 -GY1<L:;-G((L-/.#U781-AUG)63
3G2|81(L-A6)^-/.76#v)^3G:B2 ),0Q-G(v.#);6h3G2 vg13G6 4+), ]
-A2 -A6),(*-/.10.#37. ] +),Q-A2 -A6),(*J+ ),:;<*XY),0)1-VG)
-G(L+3@+ #3Vh),06 Q-A6-/./.#);62 ),12 ),+)V.76),0h%<*6 6 #)
.#);h(L-/.#U781-AUG)5:;-/.Dv)^),-G+ <L(*62 -/.1+3G2|),04<./63-/.
),#81<*A-G(*)V./6+),Q-A2 -A6),(*+ ),:;<*XY),0:B2 ),0Q-G(.#);6;\^1<L+
<*U78#2 ) ^1)K^''& -G+ +3#:;<L-A6),0 636 1)+),Q- ] <4Q(L<*),+;gx<.Q-A2 6 <L:,81(L-A2VgF6 Q-A6.#37. ] + ),Q-A2 -A6),(*+ v),:;< ]
2 -A6),(*+ v),:;<*XQ),0 Z S 2 );6 8#2|.#),0 7O2 -/.1+3G2|- ] XQ),0.#);6 +1-VG)%-/.),P81<*A-G(*)V.76+),Y-A2 -A6),(*+ v),:;<*XQ),0
6 <*37. b g2 37 6 1)(*3#:;-G(L(*+ v),:;<*XQ),0 Z SQ-G+ ),037. 2 ),Q2 ),+)V.76 -A6 <*37.g%3G2h%1<L:+3/(8#6 <*37.1+-G(*UG3G2 <*6 Y+
6 #) K''& <. <*U11 ^#):B37.Q0Q<*6 <*37.1-G(D:B2 ),0Q-G( -A2 )-VF-G<L(L-GY(*)<.6 1)(L<*6);2 -A6 8#2 )G
+);6 +43G6 #)h1<*6)@.#3#01),+@dT:B3G2 2 ),+ v37.10Q<.#U636 #) %#)62 -/.1+3G2|4-A6 <*37.12 3/v3/+),04-G(L+3+|#3Vh^+6 1-A6
3G2 <*U/<.Q-G(#8Y.1:B);2 6 -G<..#3#01),+Bf-A2 )12 ),:;<L+),(*+ v),:;<*XQ),0g -+ 8Qv:;(L-G+ +3GD+ ),Q-A2 -A6),(* + v),:;<*XQ),0:B2 ),0Y-G(4.#);6 ]
h%Q<L(*)6 #)UG2 );.#3#01),+dT<T)GLg.#);h8Q.Q:B);2 6 -G<..#3#01),+ h3G2 #+:;-/.)D81+),063+3/(*G)<.#);2 )V.1:B)12 3/Q(*)V+
:B3G2 2 ),+ v37.10Q<.#U\636 #)3G2|);201),:;<L+ <*37.=.#3P0Q),+BfQ2 ), ] 3G2-A2 Y<*62 -A2 + v),:;<*XQ),0 :B2 ),0Q-G(%.1);6 +;6 1<L+<L+46 #)
2 ),+ )V./6A-A2 <L-GQ(*),+@h%#3/+ )J:B37.10Q<*6 <*37.1-G(=:B2 ),0Q-G(=+);6 + :;(L-G+ +3G.#);6 +5<.h%1<L:6 #):B2 ),0Q-G(+);6 +-A2 )),<*6 #);2
-A2 )A-G:,8#3781+, A-G:,8#3781+3G212 ),:;<L+)G H 6<L+h3G2 6 .13G6 <.#U=6 Q-A6-=2 ) ]
:B)V.7601);G),(*3/)V.763G6 #)^-GQQ2 3,1<-A6)^e b -G(*UG3 ]
2RQ I O I 5 7R ;  d
f |A@# GGYA 2 <*6 Y dT'.7637.98Q:;:;<);6^-G(TLg bAG/_ f+ );)V+\634v)Q-A2 ]
;G 
A@# B/ [o/;1TG  df 6 <L:,81(L-A2|(*+ 81<*6),081+ 63G2\+ 81:4-=:;(L-G+ +;g#-/.10D+ #3781(L0
UW$ &%8Z GY d
f A@#G Eo7;#B/ 6 #);2 );3G2 )v):B37.1+ <L01);2 ),0<.8#6 8#2 )h3G2 v
UW$  d  
f  d  YX f Z O S @#, E <.1-G(L(*Gg6 #)+62 37.#U$:B37.Q.#),:B6 <*37. v);6h);)V.6 #)
(L-/.#U78Q-AUG)3G2%:B2 ),0Q-G(R.#);6`h3G2 P+%<./62 3#0Y81:B),0<.6 1<L+
d
f1
  d
f dc f Q-Gv);24-/.Q06 #)3G2|4-G(L<L+ 3G01),:;<L+ <*37.$.#);6h3G2 #+
dT<.1:;(8Q0Q<.#U5<. 8#)V.1:B)\0Q<L-AUG2 -/4+Bf g/+);)V4+O63v)\Q-A2 ]
2 37^#);3G2 )V b g7<*6<L++62 -G<*U7763G2 h-A2|063:B37. ] 6 <L:,81(L-A2|(*h3G2 6 )B1Q(*3G2 <.#U3G2:B2 3/+ + ] );2 6 <L(L<*W,-A6 <*37.
:;(810Q)6 #)3/(L(*3Vh^<.#U1 v);6h);)V.6 1)6h34XQ),(L0Q+,
^ O 0O _`_ Y4 a  R >5Q;,;;Y=v7B; G$*G  n!m l Y    YO
|A*1 V \  " G|D  p#:^xA;QT*;A ^F
A @#;#GGB*1 V \ $ " ;Wp# ; S G# ^Q<L+2 ),+),-A2 :h\-G+Y-A2 6 <L-G(L(*+ 81Qv3G2 6),096 #)
;G/G O NPh%<L+ +/S5N UG2|-/./6 bAGG/bAF] /aGbGaG
A
eR);6D81++62 ),+ +46 1-A6R2 -/.Q+3G2-A6 <*37. b <L+4G);2
+ <4Q(*)Gg-/.10<*6<L++ 8#2 Q2 <L+ <.#U6 1-A6<*6<L+12 ),+)V.76),0
 /!! 
#);2 )3G2D6 1)@XQ2|+646 <)Gg^-G+<*6D<L+D2 ),-G(L(* 6 #)@G); ^1)3/7#<*3781+D12 3P3G+43G Z 3G2 3/(L(L-A2 -/.10 Z 3G2 3/( ]
63I+ ),Q-A2 -A6),M6 1):B2 ),0Q-G(+);6 +3G.137. ] +),Q-A2 -A6),(* (L-A2 b -A2 )37<*66),0
+ v),:;<*XQ),0 .#);6 +;@<.-G:B6;gU/<*G)V. -.137. ] +),Q-A2 -A6),(* , S @1 G; \ O\e);6^8Q++6 -A2 66 #)4-A2 U/<.1-G( ]
+ v),:;<*XQ),0 Z S=gV37.#)\:;-/.(*3#:;-G(L(*=+ ),:;<*56 #) Z SgA81+ ] <*W,-A6 <*37.<.#81-A6 <*37. d f42 37 -0Q),:;<L+ <*37.?.#3#01)
<.#U6 #)=Q2 ),+ :B2 <L16 <*37.Q+^3G6 #)+),:B37.10Q-A2 6%3GNP),: ]   *Q  %':;:B3G2 0Q<.#UD63DP8Q-A6 <*37.d#f g3G2%),-G:
6 <*37. g-/.10-GQQ(*O2 -/.1+3G2|4-A6 <*37. b 633/16 -G<. a(*A,Ve5
-+),Q-A2|-A6),(*+ v),:;<*XQ),0 Z S='%:;:B3G2 0Y<.#U63 Z 3G2 3/( ]
(L-A2 b g16 #)V.g#-/./D<.#);2 )V.1:B)12 3/Q(*)V 37.6 #)53G2 <*U ] 
<.1-G( Z S:;-/.D),#81<*A-G(*)V./6 (*D)%2 ),12 ),+ )V./6),0D37.6 1<L+ U 6Vd:a!f U b 6Vd:3YD ? T f! b 6d:3)ED ?4)f
.#);h+ ),Q-A2 -A6),(*+ v),:;<*XQ),0 Z S=GR3-AG)-/.)B1-/ ]  
V LW*X  
V 6W*X .   e  .10` e "
Q(*)Gg\6 1<L+D12 3#:B),0Y8#2 ):;-/.<),0Q<L-A6),(*+3/(*G)6 #) da f

32 A. Antonucci and M. Zaffalon


^ P81+;g^3V#<.#U378163G6 1)+ 8Y6 #):B37.10Y<*6 <*37.1-G( , |1F/ \ O)Q2 3VG)6 1)J12 3/v3/+ < ]
12 3/Y-GQ<L(L<*6 <*),+h%1<L:013.#3G62 ););2636 #)+ 6 -A6),+ 6 <*37.9- _p1,T/7qp#_p# g-G+ + 8Y<.#U6 1-A6
3G5  dh%1<L:|-A2 )Q2 <*) 01)V.13G6),07Df g#81- ] -A6(*),-G+6-G);2 6)B 64 d@f3G  df<L+.13G63/Q6 -G<.#),0
6 <*37.Jd a f),:B37),+; 9 -(*3#:;-G(^:B37Q<.Q-A6 <*37.3G5G);2 6 <L:B),+3G6 1):B37. ]
0Q<*6 <*37.1-G(4:B2 ),0Q-G(+);6 +<. ^1<L+),-/.Q+6 1-A6;g

6xd:3  D ?  f b 64d:3_D 3   ?  Ff  
3G2),-G:| a*,/e5g 64 d:a!f% T-G:B63G2 <*W;),+-G+<.#81- ]
U
.   /  6 <*37.Jd,9f gvY8#6-A6^(*),-G+ 6%-:B37.10Y<*6 <*37.1-G(12 3/Y-GQ<L(L<*6
  LW /  <.6 1<L+Q2 3P081:B6:B37),+52 37 -:B37.10Q<*6 <*37.1-G(4-G+ +
d f 8Q.1:B6 <*37. h%1<L:<L+.#3G6-G);2 6)B 3G=6 #)@2 ),(L-A6 <*G)
h%#);2 )
.  01)V.#3G6),+6 #)4:1<L(L012 )V.3G/  -/.10g3G2 :B37.10Q<*6 <*37.Q-G(:B2 ),0Q-G(+);6; ^1<L+:B37.10Q<*6 <*37.Q-G(54-G+ +
),-G:|  *
.  g <  D-A2 )6 #)Q-A2 )V.76 +3G=  01) ] 8Q.1:B6 <*37.g+ -, 64d  D ?f g:;-/.v))B112 ),+ + ),0-G+4-
12 <*G),03G8  %#);2 );3G2 )Gg:B37.Q+ <L01);2 <.#U6 Q-A66 #) :B37.7G)B:B37Q<.1-A6 <*37.3GG);2 6 <L:B),+3G d  D ?  f g7<T)GLg
-G+ +8Q.1:B6 <*37.6  d D ? f-G+ + <*U7.1+\-G(L(v6 #)-G+ +63 64d  D ?f#   6  d  D ? f gh%<*6    
6 #)A-G(8#)  .  d?  f *  ,/ .  gRh#);2 )  .  <L+6 1)401) ] -/.10gY3G2%),-G:|g   $ -/.10+6  d  D ?`f\<L+^-4G);2 ]
:;<L+ <*37.T8Y.1:B6 <*37.-G+ +3P:;<L-A6),0$63 g#81-A6 <*37.d f 6)B3G d  D ?f ^P81+;g13G2),-G:|Aa(*A,/e5g
2 );h2|<*6),+-G+
b 64d:3 _D  .  d?  f  ?  Ff dGFf 6 d:a!f2 U    6  d:3 D ? f b 46 d:3 D ? f  dB#f
.  /  ) c  
H 6<L+6 #);2 );3G2 )+ 8GF4:;<*)V.7663+ );6@<
  < 
<   g h%1<L::;-/.$v)),-G+ <L(*2 );3G28Q(L-A6),0 -G+-J:B37.7G)B
-/.10 :B37Q<.1-A6 <*37.%981+,g 64 d@f<L+\-:B37./G)B:B37Q<.1- ]
6 <*37.@3G),(*)V)V.76 +3G6 #)+ 62 37.#U4)B#6)V.1+ <*37.  d@f
6 
d  D ? 
f  64d D  .  d?  f  ?  xf  d b f ^1<L+RP<*3/(L-A6),+!6 1)-G+ + 8Y16 <*37.56 Q-A6 64 d@f<L+O-^G);2 ]
6)B3G  d@f
632 );U/-A2 0#81-A6 <*37.dGFf-G+\6 #)\3/<.76^-G+ +\8Q.1: ]
6 <*37.3G-4 S 3VG);2; i   u Q-G+ ),037.6 #)=K''& , S @# G;  O\'%:;:B3G2|0Q<.#U 63 ^#) ]
2 );6 8#2.#),09R2|-/.1+3G2|-A6 <*37.4:B37.1+ <L01);2 ),0=3G26 #) 3G2 )V Ag 6 #) + 62 37.#U )B#6)V.1+ <*37. d
f 3G
+ <.#U/(*)01),:;<L+ <*37..#3#01)  *+ ^^#)6 #),+ <L+6 1);2 ) ] UW$ d   
f d  f :;-/.v)2 );U/-A2 01),0 -G+$6 #)
3G2 )3/(L(*3Vh^+2 37 -4+ <Q(*)<*6);2 -A6 <*37.@3xG);25-G(L(R6 #) -A 2 U/<.1-G (3G2= 
E3GX Z
  *A9  %
d@f1 Z\[ i 6 Fdf u  6W!  d f
^1)%3/(L(*3Vh^<.1Uh),(L( ] 1.#3Vh.-/.10d2 ),(L-A6 <*G),(*Yf\<. ]
6 81<*6 <*G)12 3/v3/+ <*6 <*37.J<L+2 ),#81<*2 ),063@3/16 -G<.^#);3 ] h%#);2 )3G2),-G:| * , g-6 xd@f<L+%6 #)^3/<.7654-G+ +
2 )V b g-/.10h^<L(L(v)12 3xG),01);2 )v),:;-/81+)3G6 #) 8Q.1:B6 <*37.-G+ +3#:;<L-A6),063UW$ YX  Z 3G2),-G: D CQ*
+);)V4<.#U/(*(L-G: @3G<*6 +3G2|4-G(R12 393G<.@6 #)=(L<*6);2|- ]  %g6 #):B37.10Q<*6 <*37.Q-G(!4-G+ +8Q.1:B6 <*37.Q+=6 xdD CLD ? CFf g

6 8#2 )G + v),:;<*XQ),03G2),-G:| ? C%* ,/> "<. X ,g-G+ + <*U7.-G" (L(
 O / O N N:OBM PcR S @#V^F,|T|B i 6   d@f u  c \A @1 6 #)4-G+ +" 63 6 #)J+ <.#U/(*)+6 -A6)  . " d? CFf * , .#" g
G [o/;1TG  d@f ;#GGB*1 V h%#);2 )  #. " -A2 )$6 #)$01),:;<L+ <*37. 8Q.1:B6 <*37.1+-G+ +3#:;< ]
4 " UW$ & % Z G  GQF qpP,TG#\9;G -A6),0 63?6 #)+62 -A6);UG * ,  g-/.10 2 ),12 ),+)V.76
;DA@#4 G;YGTG@^A;=A@#;`1G/B* 6 #);2 );3G2 )-G);" 2 6)B3G6 #)JA-G:,8#3781+:B37.10Y<*6 <*37.1-G(
Q , |GYG/YA=V /A5;, : O O : ;G /B @ :B2 ),0Q-G(=+);6 W /$" dD CAf * % + v),:;<*XQ),0?<.?#81- ]
a(*A,/e E " 6 <*37.d 7f ^P81+;g 6  d@f<L+-G);2 6)B3G  d@f^v) ]
6   d:a!f 6   d:3-) D ?)`f 
b d,9f :;-/81+)3G '2 3/v3/+ <*6 <*37.AJ'+ A-A2 <*),+<. , g-G(L(
)dc 6 #)G);2 6 <L:B),+3G  d@f-A2 )3/Q6 -G<.#),04-/.106 #);2 );3G2 )
;G /J @  !!!h :  @#; : ;G /B @ fA  d@f# d@f g2 37 h%1<L:6 #)6 #),+ <L+3/(L(*3Vh%+
!!!hEi GY ?4)/* ,/>B0 : 6   d#)[D ?4)`f 9^F,|[o@ -A2 U/<.1-G(L<*W,<.#U1
d#)[D ?4)`f* % O

Locally specified credal networks 33


QYQG bb)+*Dm

"w $$
D)+U 
 
, )+?$ 6h~
 
   

!"
"#$!%& '(*)+, ) &U 
 <*'
)h?$;  -  - *96/ * $)/.0
; v6+2KY`Q M+Q[^(aR_
- )/.0)+)21354
)7689:
 +;<)7$3$)+ 6=
>2 $?@(8A Y\N+aSajQTN+R@MON+p$
@[ Oy
7 "
$
*$BC?$;  - *)+DEGFHJI0K<L7MON NOPQSRUTVWL8X>Y[ZQSK PCQSR,Y\NK]R^(_ (s*

"
"$u  ( *68<9 +
*v~k)+; )+ +)~o<}$),*'
)
Y`Q[L(R^(abMOL(RXN+K<N+R@MONcL(R2dL8X]YfegN+Y[Z L7P5VhQSR:I3K<Li ^iQSajQkY`l +;<)7$ +*6<6<)+;7FHgI0K<L7MON NOPQSRUTV=L/XY[Z NdsN MOLR@PR_
^R@Pmd@Y\^Y`Q9V]Y`Q[MVOnom?$?@)
; Y\N+K]R@^Y`Q[L(R^(a,d,l(vL(V]QSvJL(R|],K<NOM+Q9V+NI0K<LUi<^Ui]QSaQkY[Q[NV
^R@PsZvNQSK0
@ajQ[MO^Y`Q[L(RVOp,?B)6b(v
$Uuv})+;7
U<
   qpr(s*
pst uF/v)pgwyx=oz
{+A
Dm
h
"
"
#|10
;<*{<*
*B
;<<}$Dm6f~k;?$?;<5 A ,s(s*
""vz
,68);<'(( '); $*)6~k
;3?$; )v9 ] ')
*Dm( ) $?$<*$BE ; )$
$)O6+FHI0K<L7MONON PQSRUTVL/X *v~k);<) ).b }* D=?)+<)$(O$FHEI0K LMONONOP(QSRUTVL/X
Y[Z NhY[ZQSK<PE0vK<L]vNO^RdYq^KY[QSR Th=N]V+NO^(K<MZ N+Kmd@l(_ Y[Z N0rLvKYkZ]R@YqN+K]R@^
Y[Q[LR@^(a d@l($L5V]QSvL(R],K<NOM+Q9VN
$L5V]QSv=nrm
?$?,)7;7 I3K<Li ^iQSajQkY`Q[N]Vc^(RP>Z N+QSK3
@ajQ[MO^Y`Q[L(R$V]p?,B
)76"#5
,7$vuvF/n
z0
$=uE
;Oqo""$r6<B?$;  -  - */4f<; )+)76
  D?$ v<)Dm; B
*96m.b<}*D?$; ) 96<)g?$;  -  - *A }$B,psr`ps,g )7

 D=? v A
 )76+R,Y\v3
@K<L+ NO^(VLRQSR Tp,
85O*] #  
<}$)
; 4~v)7 968*
)/.0
; 6r]R@Yq$U3
@K<L+
NO^5V+L(R,QSRUT
ps
q] 5@5
z0
$pft$z0$,p,uE; 
`

z
 '
)+
6<)O6
~?$;< -  - * )76?$; 
?,B<*
 - 46<*D $9( )m$A
)**$B<; )+)~ +9&U $)6FHI3K<LMONONOP(QSR T(VL/X
YkZ]R@YqN+K]R@^
Y[Q[LR@^(aL(RXN+K<N+R@MONL(RWI0K<L7M NVOV]QSRUT^(RP
eg^R@^+TUN+N+R,YL8X!R@MONKY\^(QSR@Y[lQSRgRL(3aNOP+TUN+_`^(VNOP
d@l5VYqN+V]pv?
B
)76 
w0x=vz
{+Dm
pUzv)z0D?@6pUtuF/v)p
|t$ z
w)+; ;<);O=$|bv O}0"",; 
?@6< s
E;<)*A
 1054)6<*
)/.0
; 6
6 68v 9( ).b<}*D=?;<)+A
+*6<)
!&U * ( ')?$;  -  - *96/ * 
6 68)76<6<D)+U 6FH
I3K<LMONONOP(QSR T(V0L8XY[Z N0 Y[ZfRR^a$L(R7XNK<N+R@MONLRoR_
MON+KYq^QSR,Y`lQSR|KY[Q 0MQ[^aU]R@YqN+aSajQTUNRMONp ?
B
)637"@


bF3; )6 6+
w0x=z
{D
:""
"$z; )$
)/.0
; 6KY[Q 0MQ[^a
R,Y\NaSaQTN+R@MON+p7
"$*
(v

$
w0yx=z
{+Dm

"
"$oxf;O?$}$9 +
 D=vv)*6~k;D?$; ) +*6<)
?;< -  - ** )76+]R,Y\$0,K<LUN ^(V+L(RQSR Tp`(A
O*# @7,
tzrw)+; ;<);OE$cbv O}c,w0x=z
{D
h
"
"Uv
FH$~k)+; )+ +).b<}6<)+?,;O(<)4E6<?,)7 )!6<)O6f~?$;  - A
- * )76 +;<)7$$)+/.;<v6FH:I0K<L7M NONOP(QSR T(VL8X=Y[Z N

Y[Z>LRXN+K<NRMONhLRR@MON+KYq^QSR,Y`lcQSRKY`Q M+Q[^(aR_
Y\N+aSajQTN+R@MON+p$?
B
)63U
"(U  E
; B
 $~kD
$
F]r)+' `c7
"$:sZvNm0R,Y\NK\@K]Q9V+NL8XRL(3aNOP+TUN+hEF/n
;<)76<6pvv
r
u$E; 
s,z0""vuU<; 
B= +
v<*
,$A
$)+?@)+v) )~k; ; )$
68)+ 60RR^(ayVL/Xe!^
YkZvNm^Y`_
Q[MV^(RPKY`Q M+Q[^(a,R,Y\NaSajQTUN+R@MONp@/A\ ]
5 $

n$);<Dm,Et$)
;<q3
"$%& '(*)+, )h6<4U$A
 }$)6<96c~| +
6 Dvv)+96FHI0K LMONONOP(QSRUTV2L/XY[Z N
d@QUYkZR,R^(afL(R7XNK<N+R@MONL(RR@MON+KYq^QSR,Y`lQSR2K]_
Y`Q M+Q[^(a@]R,Y\N+aSajQTN+R@MON+p?
B
)763
5v"$ %968)' );
s*)4
E$dYq^
Y[Q9VY`Q[M ^aoNO^5V+LRQSRUTgQkY[ZE],K N_
M+Q9V+NI0K Li ^iQSajQkY`Q[N]V]3z},?$Dmh
qp$).W
; 

34 A. Antonucci and M. Zaffalon



54
 6& 7!'"8,9!$#%;:<&3 2'(=*)
'((>; + ,(,?-.  #0/1 2 ,(3
@ ACB%DFEBHG+IKJ L+M+N OQP%h RSZ ACBHBUG+TWIKVJ2LSiKGYM+X[G+ZOQP%\^RS]2ACISBH_aOj`bM+klBSDUX+TnmSJ \^oWG5P%JqBHp BrGYBHX6s.pcX[Odt G+Gfe\^]2GY\^] gK\^GYJ \^G
u \^ v BHBHAd]2Zw2RSxy]2\^IG< wzG+RHOdk|DK{}\^] RSJ2x~OdwaiKv+M s+u w \^\^]zG+VxP^BHOQ\^]2GYt P%\


%
q
H
+
r

xywRJ X+\%\^\^xyAQJ"JkRSw ]
\^x~X+\%v+P^w2OQOdJ2G+OQRSI(Gyw Rx\%BH<ts+OdG+OdvyIYp"DOd]2w2 s BHBHwAOQBHJqIKMSO\^k.G<BWw JqDMKOdk]2RSw2]|s BH\nYA BHBHx~IK\^v+G<AQw\vOdGRrJ B;J \%P%J RSJ x~\%J
v+BHs+G~w \^OdG[]|YI<sYBH\^xyGY\rP%M<\ X[OdOCw2BHIr]IrBH]xBHv+RH+kOQPqOdw BHJ A
] \^RqAQ\^}Dr\^BHDKG<\^wz]qM}w2RS+]2OQAQJ(XMBHw2v+Yv+\^] G0RKBSOdP wWx~OdA OQJ ZJ \%\(J;BHw2Z+YAQ\9\?w vRRSOdwG<BHw?tK\?w2 ]BHBHw(w2OQBHRSIKG \^BHG<A
w X+J(\%BHP^] OQJ2\yOQRSJ2GYs+J;v+vBHRrGYJ Xb\%XbBHv+w vRf\qBSBHP^]WwDK<\^]2s+iFxOdBHG<G[w _j\^AdAdO tKOdIK\\^G<] \nwq_ p
I<RHkBH] BSX[P^AQw2\%OQJ RSJ|G.YM
RqBHGYXBHAdOQ]\^BHGw2Yw2\^Y]?\^i9w2 xBHGB%iw2]2BHiv+OdvG+\qIfBHw ]qRfpOdGY P%RSBH]2w|vOQRSJ|]x~BHw OQ\J J2\^OdG+xyIRSOQw2J|OQRSw2GY J?BHw}OdG\^xyw2YRS\~w2OQRSs+GYw2OdJ}A OdOdw2G[OQ\%YJ?sY\^OdwGYP%OQJ\ w2v+Y] \\nkP \^Y]BHRSZ+OQP%AQ\\
w .RFxOdA OdBHIrtK<\yw v+\n[] RHv+T2Ad\%OQP^P^Odww?vw2Y\^]7\ykRSX[]2ixyG \%BHXfx~BHOQP%w J? RHBHkAdZ\^xyRS]2RSI~w2OQRSG+GYOdJ(DK\^BS] JJ2Od}wai\^AdOQJBSZ J;BSw2J Y\%XF\^Od]?RSGFOdG[w2YYsY\;\^GYJ R9P%\PqBHRSAdGQ\%Xfw2Y\y`P YVRSM[OQP%\r+pOQP FYOQ\J
B~YJ2i[\^GJ2w \^\^x=DK\^G<kRSw ]WJ~J2OdOdG[x?Ys+sYAC\^BHGYw2P%OdG+\FIIK<RKs+BHAQxJ~BHRSGb]~\^] xy\%J RSRSw2s+OQ] RSP%GY\%JqJqp p P%YP%RS\] X[{}OdG+RSx~Iv w RyBH]w2BH+w OQRSJ]w2YBHG \%RSBH]2Adi[iKJ M\%\^J9xyYRSRqw2*OQRSGYBHJzGBH\^] \(DK\^BUG<wy\%P^xw \%B%Xi
OdDKG[\%YP^sYw RS\^]GYP%P%RS\G<IKwRKBHBHOdG+AQJ}OdG+BHI(GYX~X+\^] Ir\%] J \%RS\s+] RHP%k\%BHJqG+p"IK\^]qYMr\k\q
BHDr]qBHMnAdT2s RqBHiKw M<RSBH]}GYX+X~\^w J \^RS]2]2x~] RqOdWGYn\%MJ|BHw2GYYX9\w2P Y \BHG+ IKP^\ w2OQRHRSkGy\^xy
] RSRSw2vOQRSRrG J BH\^]
A v+v+] OQP RHt[YJ
AQ\(w2YB\
PqBSBHP^w2Gf\OQRSkZRG\zP^sYxys+Jw2\%OdRSX[A GOdOCqBH\%w2w2XYOdG+\WkIRS{}]vRSw2Rrx~YJ \v J2OdBHZ+
]OdDrBHA OdBHw w2RSAdOQs \%]JBHBHw BHGYRSGYX]WXBHRSGYs+\^xyXFw2AdORSw2GYYw2\OQ\ RSYG  RqBHP^A|w2OQv+w2RSY] GRH\Y
X+AQ\r] \^RSp9DKv\^AQRrRSJ +v\^OQ]q\%J;Xp v EBHvB%\^iK];\%J2OQOCJzBHGOdGYGYJ2v+\^waOd] }\%RSX]2tyZ<i0xyw2RYX+\~\^AQJ}`BHAQJ VR p
HrK
||Kj 
Z rr\^r KB%nDp OQRS s+G ]u \nOC+BHP%Z+\^AQv+R w( kl] \qRSBSxP P%BH\^]2IKw\^BHG<OdwG\nJ [P^+]2OdOdZ+v+Odw w\%Xw2Y\^\;DKJ\^BHG<xyw Jq\ p
eW}RSOd]2s+w2AQs XBHA"ZBH\FIK\^RHG<kzw OdJ;G<w BS\^P^] w2\%OdG+J2wIOdP^G] \%P%X[RSOdx~Z+Adv+i0s+BSw J\^]<s+I<xBHxyBHG0\%JqZp\^OdG+YIK\J Od GYGX+NW\^v\^\^DKGY\^X+]2\^G<OdG<wyw Z\^]\^ N B%DOdIrOQ<RSs+w J~]q}w2Yw2Y\\^iBHIK\^G<BHw OdJwyX+RSG+RbAdiGYRSkwyRS]( B%w2YDK\\
DP^ ACOddBS]2Sw2J Ss J|BHRHA+kfBHI<IKBH\^xySG<w \%JJ|BS;w2P^ lw2BHOdG+w|}Iz}zP^RSJn] s+n\%AQp"X[XOd Z+Z}Ad\^i9GYzBS\nJJYsYw|<J2s+w2s YxBH\BHAd xyGYiJ"RrIrOQJ2OdJ
wDK\Wkl] Sw2RSQYx n\ v+w2BHYAdACB%\^iKiKOdMU]y\^]]BH\nw2[w OQvRRS\%G P%P^BHRSw A<\%xyDXOd\0]2s+w2BHs w2GYOdBHA XOdAwaBHiKOdIKpG<\^w G<\^w ]+JBSOQP^BHJyww2w J \^RSx~AdOds+w2v+w2wOQRSw2w G.YRM|\^xxfYBURqp[}Od\^Wx~DKJ2Od\^qs[]q\ _ M
v+ACB%OdiKw29\^]w2xy+OQRSJ"] kl\"] \%kl\%] \%X+\%RSX+x5RSxw2Yw2\ v+BHACG?B%iKRS\^w2]|Y\^Pq]BHG9P^ACBSZJ \J \%\nJ.[vRHk[\%P^I<w BH\%xyX9\%w JqR p ] OQJ\%J2s+s+G<Adw JBHG<OdGw \%v+X] Od\%GX[OQP%P^RSwx~BHZ+v+AQ\0s+w Z\^]\^ I<B%BHDxyOQRS\%s+Jq]p +OQP sYJ2s BHAd i
OdBHG<G<w i\^]DBSOdP^]2wWw2s BHOdA"w2bBHIKBH\^G<G<iFwqpRS;Z[T2BH\%xyP^w \%JWOdGfBH] w2\Y\IK\^I<w2BHw2xyOdG+\rIMYxyOdGYRSP^] AdsY\?X[BHOdGYG+XI @ GY\RHkyw2Y\BHOdxyJbRHkyw2Y\.OdA OdIr<wv+] RHT2\%P^w0OQJ
xyPqBSRSJ ] \%\bJzAd] O \qw2w2BHAQAd\9OQJ2w2\nOQPbRS]2OdGw;OQBHJWAdv+s+BSw;J2vOdG<\%P^w w RJw2YYRq\(}D\^OdDK]2w2\^s ]BHOdA|GBHIKx\^G<BHw G<Jqi p w w2OQR~RSGYX+Jq\^MUDKw2\^YAQRS\|v0J2i[B(J2w J2\^i[x6J2w \^OQJxPqBHkRSAdQ]\%X(J2OdBHx?G9s+}ACBHyw2OdG+rlIyS<s+SxSBHG \^bxyRHS_
BSJ+<OQJ|s+OdxGBHJ2GYv+J|Odw OQ\WJ|RHBHkxyP%RSRSG+x~I?v+w2s+Yw \\^]RSAQBHX+IK\%\^J2G<ww OdJG<DKBS\%P^J2w2w2OdG+OdI<I9BHw P^\%] X\%X[w OdRSZ+v[Adi _ J2li[HJ2rw l\^xSkSRS]( X[Hi^aG nBH!x~OQlPy`G BHV+]2n]p.BHw2@ OQRSw2GY\^BH]GYBHXOdxyxyJWROdX+GY\^P^AdAdO sYG+X+I0\yRHB k
OQw P%RJWv+OdG ] R X[sY P%l\.s+BH]2IKOdG+\^G<IYw MJ0qrrOdKw2npz<s+x\?BHv+G[] \%_jAdJ O \^tKG<\wZBy\^xy B%\^Dw2OQYRSRs+X ] w2w2YOQRS\zG.v+pACB%NiK\^OdtK] JRSACv+BU] T \nk \^] i\^AQGYX[P%OdI\%JklZ ] RSBSxJ \%XFw2YRS\GF(w2Yj\;+Sv+^ACB%liK\^n] J BSPn7_
+VOQP RSxyOQ\;Jv}RS\^v+Ads+ACJ2BHs+]Odw I<\%BHXxykRS\%Jq]M[DOd]2+w2OQs P BHfA BHB%IKDK\^\;G<w DKJ\^Od]2GfiJ2Od}x~zv+AQJq\ p S
nS9nl9l~ YBSJ} rv+l] SRS5vRrSJ \%X w R?nZ KBSnJ [\WSdw2 YS\WHJ2i[J2w S\^Qx2SRSGw2YY\n_
DxyOd]2\^w2G<s wqBHM
ArBHrIKr\^KG<w JqBHMSGYBHX] \ NWu \^OCDKBH\^Z+]2AQR Od G< w \^lE}]9AdNO %OdUIrBH<] w XJ
lG<E}w OQ\^Rq]2wBHBHOd] G[\r_ M RSJ2i[]2OQJ2\%w J\^xRSGFw2OjYp \r\zp&<s+vYxJ2i[BHP GYRS\^AQxyRSIrRSiYw2yOQRSG BH+AOQP BHGYXIrs+xyOQX+RS\%w2JOdDrBH<w2s+OQxRSG BHBHG A
SB P^w2OQRSYGY\*J? `iAQV$X[OdIYOQJM rJ2rOdrx~KOdnACpBH]w Rw2Y\ @;{{ xyRX+\^A v\%P^w Y\%\?X W`R Vp OQJ X+\%P%RSx~vRrJ \%XOdG<w R~YDK\?J2w \^vYJqpY\


GYj@ \%P ]2tw RSM G<riKrMWr{|KyAQRSJ ] RS\rxyM BH\GYv+X6]BS{}P^RSw2OQAdPqOdBHGYAzJqMv+] qRS Z+AQ\^nxyp J G5Odw2lEBHOdx9]2w7__  BH
IKS\^ G<w+SPq2BHrGjSF] \^P%wRSBHx~OdGv BHRS] ]\%J(xYBHRqOdG<wBHAdOOdtKG\^AdOdiw J OdwWOQJR *w2 BHZw?\nkw2RSY] \\
v+J2OQAQRS\^GYxyJ~\^G<BH] w2\OdG+IvRSw2OdG<Yw \\%X@;RS{s+{!wqM|xyBHAQRRSX+G+\^IbA?kRSOdw2]FkjBSJ P^RSOCAdBHs+A(w2OQ\nRS[GYv+J(] k\%RSJ7]_ BHrGYjSX bBUX+kl\^w w \^\^]]2x~BHG OdGY\%\^JDK\^w2G<Ywb\f \^BSxyJ0RSRw2OQP%RSP^G s+BH] A \%X] \qp BSP^w2OQYRS\G}w RHSw2 Y\  

w2J \^YRSDKxy\}\^G<G<\(w s+JqRHx?pk|"Zw2Y\^[]|v+\^xf] RH\%kpzJ \^J2xy`0OdG+RSRrIw2J2OQJ wWRS\^GYGYDKJ|RS\^]wBHBHBHGYAZ+X(AQ\^\~xytK\%BHRS\^] w2v+\~OQOdRSG+ByGYI?Jz] \%BWBHX[w+sYOQRSP^J2GYw2w OQRSP%RS]2\Gi(] RHRH\n_kk GYRHk\^w2Y\J2Od\^w2s xyBHRSw2w2OQRSOQRSG.G p BHA w(] \%BHJ2AQvJ RSRfGYX+J \r\^w p"\^]2x~Y\WOdGY\%J?%lw2YS\~OdG<2w \^+GYHJ2qOdwani
v+X+] \^RS]?vw RrRJ \%X+J\qBHw2A}Y\0OdBSw2P^w2OQw2RSYG\w2\^YDK\0\^G<BHwqIKpf\^G<wY\OdBSA WP^w2wOQBHRStKG\P OdYGRrJ RS\^]7G _
xv\^BH] OdJ GYRSJG BHBAdO v+wai] RSwaZ+iAQv\^x \%Jqp BHAQRSG+IFOdw2xyRX+\^AdO G+IfX[O\^] \^G<w Jw22YY\RSs+BHAQIKXb\^G<Zw \9J OdG[YsY\^p"GYP%Y\%X\ Z<ib<Hw2nYS\~d \nS[v S\%P^w \%[X SW<R  BHnGYYX
] \%SJ2 v<GRSnGYw2+J SOQ\% J?Jqp.?}RS^Y]2|t\|SX[} ]2q\OdDkOdlG+REP^INWsYkJ(JnRS.] RSP%w G\
RzOdJ2BSGOdP^x?w2w2YOQs+RS\ACGYBHJ~w `\BH\^GYVWxyXOQJRSs+w2OQw2RSOd+A G OdSqBH \ A
" 

    BHnGY2rP%j\0SOdG w2BHYGY\0X[AQI<\%JBHxyP \0BHw G+RIK\%\nJ[+OdGOdZ+Odw2wY\bw2YBH\bIK\^\^G<xyw RSJfw2OQBHRSv+GYvJ\qOjBHp ]7\r_ p




EP


ZP llYRSRSw2OQ07P%\\^xyRH k|qRSBSw2OQP^RS w2WG OQRSBHR GfA"J2OQwJBHBHOdw G[GY\(YX&BHsYGY\^OdwGYXP%OQP \%JYXfRSsYZ<OQJ P%i\%\9X&w2RHYk|w \zRBSDrP^X+BHw2OQAd\^RSsYw G.\\^]2pRHx~kOdBHGYYAd\\ w w2Y\n[\;w2P s+Y] Rr\%JqJ \^MYGbBHG+BSOdP^xw2OQBHRSw2G.OQRSM+GYOkJBH\^G<w PriKpp Y\?%jS9PqBH]2]2OQ\%JRSs+w



v BH]G~BHw w2RS+]OQJ"BHv GYBHXv\^X+]


\%J }P^\]2OdZ\FOdA YYP%RqRS*GYP%w2\^YG<\Fw2]BHv+w ] \RSZ+RSAQGy\^xw2Y\ PqBH{}GRSx9Z\_
v+] \^\%xy] J \%RSRSJ s+w2J OQ\%] RSX0P%GY\%Jqw2J+MzB%] BHRSDrAds+BHQRqOdIrACBHOdZ+G+w2AQIY\\yw J R\^BHDKIKw2\^\^Y]G<BH\w A;J BH\^IKBSxy\^P^G<RSw2OQwfw2RSOQRSGYw GYRJqJp9xyWw RRJ2X+OdG+Z\^I\bA;w2w2\nYY\\ _

w2xyJ OQRSRSRAdGDKX+\%\^XbAQJOdsYA w2zJ2+OdX+] G+RS\%Ifs+J P^IrE]2OdB%ZiKBH\\%G~J2w2OC\nBHYYG\BHx~GYIK\^\^v+waGYAQ}\r\^pRS]BH]2Vt[Az\%JqP^J2pyw2w2OQ]2RSsYG P^Yw2\9s+] GY\b\nOdA[+RHwkJ2YJ w2\%RqYPn \_
k}rRSrRS]r]2Kw2AQXYnp
\zJ2E}w+BHOQiyw J2\w BHRSAdk]2QRSRqi]}klw2Ods+YG+GY\zI(P^BHw2v+IKOQ] RS\^\%GfG<X+wq\nv+M[Y] }GYRSv\W\%XRr\^J Adv+\%O x~] X\nOdkZ<G \^iBH] \^w lGY\zEP%w2BH\%YJ|]2\ w2w GYGYR\%\%P Od\%G[tX _ M YRqY*\zACw2BSYJ2\fwwaxy}RRyX+J \^\%AQP^Jw2OQPqRSBHGYGJJ2RSs+s+v+w2AdvO GYRS]2\%wJklw2s+Yw2\bs+] {}\zRS}x~RSv ]2BHt]MBHVw \%RSPn]q_ p $#

YRHksYBS\^GYP^w2P%OQ\yRSG~ZRSOdw
w2OQJ"w2vYRr\yJ J2P Od Z+BHAQ\G+w IKRW\yxyOdGRX+\^xy\^AYRSX[w2OOQRS\^G] \^BHG<GYwXvP \^Y] J RSRSOQG[P%\ _ w2GYOQ\^RSwaG }RS]2tOdxyA RRSX+s+\^w2AQAdJyO GYPq\~BHYGRqZ\w2s+Yw2\~OdA X+Odq\^\%DKX\^AQRSkRSv]9\%Xw2YE\fB%
iK\%DrJ2BHOCAdBHs[G _
&%

BHAdO |waiBSP waivBH\%IKJ \^G<wWOdA IK\^w;ByP%\^]2wBHOdGBHxyRSs+G<w;RHk WR X}+BHw \%RSRSJ ]2]~P^t]2BHOdw ZGYR9\%XJZw2\zY s+\ BHw2Odw A Odx?P^qw2\%sYOQXRSJ2pGwZ
\;] RSJ2vvRr\%J P^\^O ]qM
\%Xk+RSOd]AQ\fRSs+V]\%P^klw2]OQBHRSxyG\n _
Z<B;iG<s+vx?\^]7ZkRS\^]2]x~RHOdk.G+] I\%J BSRSP^s+w2] OdDP%\%Odw2J}OQ\%OdJqGp}RS|] X+BS\^P ]}w BSR?P^w2ZOd\DOdPqwaiBH]2] ]2\%OQ\%<Xs+OdRS] s+\%Jw  

w2BHBzYGYP%\X(\^]2BHOwkIKBH\^B OdG<G] w}\%BSPqJ P^RSBHw2s+GyOdD] PqP%OdwaBH\i~]2OQ]2JOQi~JAQRrRSX+J2s+\^w"w wq\^Odp"w|]2x~BUYOdGY\\%\%P^WXyw J"R GYfRS+w}] OQ\%RSP P%G+\^AdOdBSi(DKP^\%Z<w2X9Odi9DklOdY] w2RSRqOQ\%x J ' ( jrK
*),+.-/10 32

v\^GY\^YX9]7\9kRS}IK]2\RKx~BHsYOdA|G+J \I9RHkWP^w2AQR Rr+0J OQJ\ BSw }J


R(RSw2YY]2tb\RqZ OQJ;BS<J2w s+OQRPx P%BHRSB%GYGYDKJ\yP%\^}Dv+RSOdw|]2s+w2X[AQs X]2BHOdpDA}OdBHG+R?IKI\^w2w2G<+Yw OQ\JJ
 

Dkr\nRSBH]w \%\%P^w2Xw2 OdBHDKw2wzY\F\;BSw2P^YBHw2IK\fOdD\^G<BSOdwaP^wiKw2pOQOdJDOdBHwaGYi\;XsYOQJ~Z<J \ iZ+Ws++wOQR J}BHIKAQJ \^w RbRGY\^X+Z<]\^BHiw A\^Yv+]2Rq] x~\nkOdGY\^xy] \?\^RSGYw2w2YP%O\\_ BHIKB%\^i[G<Jw JqBHp IK\^ G<w P^Jw2OdDPqOdBHw2GOQ\%J] \%BHP%GY\^OdXDK\ RqWGYR \^] J2+BHGYOdvXBHw2] Y\F\ w2WY\FR RSG+BHAdGi
. 

}yr\flS}RSS]2 t 2 Odw2QkRSs+]}tOdRHGYk"X+w2JY\;RHk?DOd\^]2xyw2s RSBHw2AOQRSBHGYIKJ \^G<w JqSp rM


 

\w2^BHYGYIK\b\^P%G<\%BSXwyP^] w2Z<Od\%ibDP%Od\^waYOdiKDKRqp1\% J~xykl] YRSRS\0w2xOdDrxyBHBbRSw w2\%J2OdXvDrBH\%w2w2P^YOQORS\ GPBHIKkBSRS\^P^G<]w2wOdDw2YOQOdJzwa\0ikRSX[OQ]?J~OX+Od\^G[RS] Y\^OdG+G<s[I w_


   4 

] w2Y\%SJ2\nv2AQRS\^GYDKJ \^M"\(A RHw RFk<\nBHn[Gfv\%\^BHP^DKGYw \^X \%G<X w?WX+2S\^R ^vp\^GYOdGX+YJ;w2\yYRS\?\nGbYJ2BSw2OdYw2P^s w?\~BH\^P w2xy OQRSBHRSGbG+w2IKOQBURS\9klG w Od\^BHG ] A
EP

 

w2xYBH\Fw \%\^JyDK\^w2G< wqBHpwOdwyOQ<JnX[O RP%P^P^s+s+Ad] wJ~w RY] \^\^GwBHw2OdYG\w2YBHIK\f\^AQG<\^wDK\^\%AWJ2w2RHO_k


  BPqSBHP^w2AdQOd\%DXfOdw2OQ] \%\%JJ RSxs+B%] i9P%\%ZJqM \BHOdG[GYYXfsYw2\^YGY\;P%\%xyXyRSZ<w2Odi~DrBHX[w2OOQRS\^G] \^kG<RSw] kj\qBSBSP^P w 0RS] RHJqk M
w2\^Y] \^\GYBSP%\P^w2kOdRSD]}Odw2w2OQ\%YJ(\WOQJ2JvOd\%G[P^YO sYPz\^GYBSP%P^\%w2OdXDOdZ<waiKip w2 Y\]BHBHw2IKOQRS\^G<G wBHA.J(BHv+IK] \^\nG<kw_
P^P^Ws+s+R ] ] 6JJzOdw9YY\^ \^GGbBSXOdw2wYZ \~\nBSkBHJRSIK] Z\^\\%G<P%ww2RSYxy\%\J2\w2\^OdxDKs+\^G BHG<BSw wyP \%+J?OQ\^B+DrOdJ2BHAQ\0OdZ+IrAQG+\rOpS nPq2BH2G<SwX+RR\nPnPn___
 



}OdGRSs+w2+AQXOQJw2}]2iRS]2w tRw2xY\ BU[Odx~OdqRH\kw2w2YY\\\nBH[IKv\^G<\%P^w~w \%BHX AQJ RFWOdR x~v MBSZ+P^s+w Jw 5 

w2P^Y] Y\q\\^BSG\^J xy\zw2OdRSYGw2\~OQklRSs+\nG[w2s+vOQJ] \%\;P^X+w \n\^\%[w X \^v]2\%Wx~P^Rw Od\%GYX \%X]2WOQJ Z<\%R iJqp9w2Y\?+YOdP \9AQ\ OdBHG<G+w SIK\^GY\J2ROdOdP%G0waP^is+\nRH] J_k
 ! w2Y\;YP Y\?RSxyOQP%\?\^w2RHYkRX+BSP^RSw2AQOdRSDIrOdiFwaiKkp RS]WX+\%P^OQX[OdG+I+OQP BSP^w2OdDOdwai

EP
" 
w Rv\^]7kRS]2x PqBHG6Z\bX+\%J P^]2OdZ\%X6Z<iw2Y\0kRSAdQRqOdG+I
36 O. Bangs, N. Sndberg-Madsen, and F. V. Jensen
J2w \^vYJ
 I<BHxyJ\%Jqw2M+w2OQYJ\y}xRS]2BHtOdGOQJkROdP^GsYJw2YOQJ?\bRSP%GRSG< w B%\nD[OdwFG+I0RHk(w2YP%RS\x~BHv+IKs+\^G<w \^w ]J
rp  G\^DK\^G<wP BHG+IK\%Jw2Y\
] \%J RSs+] P%\%JRHk[w2Y\|BHIK\^G<wqp w2w2\nY+[] \W+\%Odv+\Z+ACOdB%w2waiKOdiG+\^vI]q\%pJf<s+RHYxk\;BHBSG[BHP^IK_jw2Ad\^OdO DG<tKOdw \(w2JOQZ\%OdJfG\^ w2w2B% YDBH\zOQwRS\ns+Yi]qBHOQM\^x~RSAQX ZYv+J AQ\WW\^]2RDrBHOdA Z+MAQ \?OdG B%Z<DKiB\
[p|RHk"Yw2\zY\n\;[P v \%BHP^G+w IK\%X\OdWGfR ] \%J xRSs+B%] iP%\%P Jq p BHG+IK\?BSJBHGf\n\%P^w
P%RSx~xy\^] P^OCBHAI<BHxy\;w2Y\^] \;OdA .GY\%\%XFw R9Z\zxyRS] \
 

[p w2 OQRSP G BHBHA.G+] IK\%\ J2vOdGRSGY\nJ [\rvp \%P^w \%X WR w2]2OdIrIK\^] JBHG\^xyRH_ w2c YBH\z]2x~ACBHOdGYG+XIp BSP^w2OdDOdw2OQ\%J( BSJ(w R0X+ROdw2kjBH]2x~OdG+I

# 

% Yp|J RSYs+\b] P%\n\%[JzvBH\%GYP^Xfw \%w2X Y\WGYR \^1 IrOdDK\^G6X[w2OQP^Yw\BHw P \? w2BHYG+\(IK\%BSX6P^w2] OdD<\n__


 w2YRS\WxyBH\IK\^BSG<P^w w2JOdDw OdRw2OQZ\%J9\WBHBH] Z+\yAQ\ w2w YR?\vBS\^P^]7w2kOdRSD]2Odxw2OQ\%J?Od}w2+\yOdGBHw2YG<\w
Odwaiw R~v\^]7kRS]2xfp P%RSG[YGY\%J RHk"w2Y\^Od]YRSxy\rp


EP

}V<v RSOQJ2BH]2t] \nOdw2G+_jYw2IOd\xyRSBH]W\fIKBH\^BSwG<P^w2w9YOdDRSw xyOdRfw2OQ\rZ\%pJ\BHX+] RS\FOdG+BHI0G<iYBS\^P^Gw2OdDOdOdw?w2OQOQ\%JJ9GY}RS\w


\}nYRSBH]2x~t[\(J~v+AQl\rEOdpA BH"G+OdG<IK\
w2J ] sYLRJ BHX[\}GYsY@ XP%Z[\9T2w2\%YP^s+w"\?OdA @Qkl\^]]2x~BHOQxy\^OdG<G.\^w M.}\%rX(RSr]2Ert0KB%?w2iK+\%j] J2@;RSOCBHs+@WG(IrENWNW\^BHJnw7G _

Od\^GDrBHw2G<Yw;\
\nX+Y\^BHwBHx~OdAQv+J;AQ\"BH] kRS\?].+\qOQBSX+J X+\\^RHGbk+Od\nG0[vw2YRrJ2\(Odw2RSOQRSZ[G.T2\%MHP^BSw JJqM w2Y}\
\~Od]2BH] AQ\^J AR _
OdGYG[RSYw|sYYIK\^\^\ GYwP%Wx?\%XyR sYZ<P i~] \%Ww2P%Y\^R \ OdfDKBH\%klIKX] \^RSG<klx] w RSJ|X+x waRSiOd\qG+vBSI\rP M<kjBH\r]2pBSIYx~P^p<w2OdB;OdG+DI9J OdwaRSiBSAQX[P^w2waOQ\^iOdD]vOdw2\~OQOd\%OQA JJ
ZI<BH\^OdAdG6OQ\^DKOd\G6w2 J2vBH\%w|P^J ORS xyPqBH\w2OQP^RSACG6BSJ J w2\%OdxyJPq\BHGyJ2YZRS\s+AQ] X\^sYZJ \%\XBSM<P J +ROQ\^J DKRS\%xyX\ p



B?YkjBH\^]2] xy\qBS\^J;]qpOdw;OdA Z\(w2Y\9xy\qBHG+OdG+IRHk\n[OQJ2w \^GYP%\(kRS]


v+AdRqOQPq}BH\^Z+DKAQ\^\]Ww Rw2YE\;NWOQX+Jqp"\qBSVJRSxyOdG\w2RH+k[OQJw2Yv \
BHGYvRS\^w]WBHBHw2] OQRS\G(\%RH<k s @;BHAd@WifEBHNWv[J_


BHBH] ] \\OdOdAG+dsYv+J2s+w2w]BHGYw R\%XX+\%OdJG] c\^v+OdIr] \%s+J ] \^\G<w2rOdpG+IJ YRS\xyXY\BSJ2GYYR\%X+X\0GYJ2RvX+\%\%PnJ_ +


9
r
C <
YsYY\^\zGYxyP%\%RSXyw2OdZ<Dri~BHw2w2OQYRS\GFJ k\qRSBS] J RSBHAdGkjBHBHGY]2X~x~w2OdYG+\Iw2BSOdP^xyw2Od\ DRHOdw2kOQ\%XYJ B%iKBHp
] \;OdYG[\ _
  
    

OOd G[Y\%X6sY\^RSGYs+P%w \(J2OQRSX+G\0GYw2RYX+\\%JJ P%OdRSGYvJ2OQ\bX+\RHk?OdGYw2J2Yw\BHGYP^P%AC\%BSJWJ JRHk


w2B%YD\?OdG+P^IACBSJ BHJqG p
kjBH]2x~OdG+IBSP^w2OdDOdw2OQ\%JWBH] \
w2GYYRY\X+\?\%GYZJRR%w X++R;\%\%J|w2JWYRS] \G~\^Odv+w2GYY] J2\%\wJ BHs+\^GYG<v+P%wzv\r\^OdMGY]BHJ2GY wXyBHBHAGYk.w2P%YRH\%\kJzw2GYRHYRk}\X+P^Z\%ACJ}BSR%J BHJ w\%BHJqw2] M Y\\OdG+YZv+\^RSs+] w7\w_
xy
AQRrRSJ2s+w}Ir\+OdG+P^IOQ\^G<wOdw2bB%B~iv+RHAQk.RSv+s+AQIrRSbs+IrBH+GYOdX0G+IYB9pY RS\%] <J \rs+pOd] \%JYB\


w w2 RSxBHwBHPq] BH\ GRSZs+\w2v+v s+BHw|] \^GYG<Rw X+J\%RHJqk<GYGYRRX+X+\%\%J|JOdOdGGYJ2w2OQYX+\\Ww2J2Ys+\]2] OdRSGYs+J2GYwBHX[GYOdG+P%I\ \^AQXM B(v+AQRSs+Ir.MBHGYXfB9YRS] J \rp



P^ACBSJ Jqp
B%DrAQRSBHOds+ACIrBHZ++AQOdG+\rMWIv+AQRSOdw2s+5Ir+BOdG+Iv+AQRSs+OdIrw2&.p B v+k9AQRSGYs+RIr6YRSOQ] JJ \w2YOQ\J
xyv+AQRrRSs+J2w&Ir.\ p P^OQ\^G<wqp \%<s+Od] \%JB \^AQX BHGYX B




J2YAQB%RSRqiDKs+\^IrRHAj+k;p OdG+v+I(AQRSs+IrOdw2+OdG+BIYpJ2YRq DK\%\^<Ajp
s+Od] \%YJ\BAQ\qBS J2\^w|AQX\ BHP^GYOQX\^G<Bw



OdBHG[IKYY\^\ sYG<\^wWGY R P%BS\%JX] B?Z<\%P%i9AQRq\^Odw2&DKY\%\W\^XGYBH\^klIK] ]2\^RSIrG<x*iw AQJ}\q\^BSDK\^P GY\^FAj\^M[]2kjOdIrBHwi~]2OQx~JAQ\^OdGYDKG+RS\^IAjwMBS\n\rP^[p w2IYvOdpKD\%OOdX[kwaOQiw2\^YG<OQ\wJ





w R~v+AQRSs+IrfOdw20B9J2YRqDK\^Ajp
cOdOdIrw2Fs+] wa\&}Ry OdG+v+ s+GwGY@;R@WX+\%EJ NBHGYP%XfRSG<BHwGfBHOdRSG+s+OdG+w2v+I1s+wBHG GYROdX+GY\rJ2p
wBHGYYP%\\ Y\;YRSxy\?BSP^w29OdDrOdw2 OQ\%JWCBH] <\
       

@RS;s+@Ww2v+Es+NwGYP^RACBSX+J \rJ;p Odw J \^Ak BSJ?RSGY\~OdG+v+s+wGYRX+\BHGYXRSGY\ {|AQ\qBHG+OdG+IYp Y\0xyRSw2OdDrBHw2OQRSGkRS]FP^AQ\qBHG+OdG+IOQJ 

BU\%P^w \%XZ<iw2Y\zw2Odxy\zRHk"XYB%iFBHGYXw2Y\;J \qBSJ RSG.p


A Bayesian Network Framework for the Construction of Virtual Agents


with Human-like Behaviour 37

RSGYACB%\9iP OdG++IOdAQXpzOdw25Y\(P +xyOdAQRSX[w2] Od\^DrG.BHp$w2OQRSGb \%OQJW<s+OdG[Od] Y\%sYJb\^GYBHwP%\%AQX0\qBSZ<J2i w RHJ RSks+w2] YP%\;\%Jqs+Mw2w2OdAYOdwa\(iX+GY\%RP^X+OQJ2\rOQRSpGbGYYR\zX+OdG+\?v+OQJ;s+wBGYX[Rs+X+x~\%J x?BHi] \;DrBHw2Y]2OC\zBHZ+] \nAQ\_
w2OQJFY\?OdG[BHYIKsY\^\^G<GYw P%J \%X6\^GYZ<\^]2iIriw2YAQ\b\^DKw2\^OdAjxyp\YRH\ k9WXYB%R i6] BH\%GYP%X\^OdDKw2\%YX \ BHBSGYP^w2X?OdDOQOdJwaiKsYpJ \%XkRS]] \qBSX[OdG+Izw2Y\}\n[v\%P^w \%X WR FRHk \qBSP

  ! 

BHIK\^G<w JWP BH]2OQJ2xB[p U


b

ACYB%\iOdxyG+IRSw2OdDrBHOdw2w2OQRSGJ2vOQRSJsYOdG[J \rYp=sY\^GYP%\%\%<Xs+OdZ<] i9\%Jw2YB\ J2J v\qBSRSJ sYRSJ G.\rMp


OdGY\w2+X+] \^\%DK\f\^AQRSB%vi[\%JqXFW@;@WY\bEN@;@WxyERX+NW\^JAQJ}BHJ2] s+\fv+xyvRSR]2X+w|\^w2AQYJ\ RHk`\nV _

       

w2P Y \BH]2OQBHJ2IKx\^G<B[wp J\^YGY\ \^]2IrWi R AQ\^DKOQJF\^AyOdG[BHYGYsYX5\^GYw2P%Y\%\X&J2Z<viRSsYw2J Y\%\J v\^AQ\%JP^w w R9\%X IK\^WwR w2Y|\z}] \%\J2s+PqAdBHwGBHG<w\^G<P w \^BH]G+\^IKDK\?\^G<RHk"w J\n[w vR\%w2P^Yw \%\FX xyWRR X_

 

w2Odxy\RHk
XYB%ifBHGYXFw2Y\?BHIK\^G<w J P BH]2OQJ2xB[p
 

Odw;v YBHGY\]2\%w\%RHX+`k[Jw2w YV&Rf\


J2wX+YBS\^RSJ2w s+tz\^AQ]2RHXx~k[w2OdRSGYYs+\\}w2v+{}w2Ys+RS\ywx~J2BHv w2GBH] \^]G+BHs+w IrvRSw2XY]"BHRHw RHk[\%kw2X Yw2Y\\y`\^M?xyV+J RHnR _ p
4 

+
0 9
r
C <


Y\;J2v BH] \n_jw2Odxy\(BSP^w2OdDOdw2OQ\%JWBH] \


       

w2w2OQYRS\G
BHADr] BH\%Ads J2vBHRSw RSGY]nJ n\Ww v R9BH]2w2w|Y\;RHk\^w2DK+\^OQG<Jw(wBSv J2t9BH]2wOdRHA YkOdw2GYYP^\zAdsYwX+BS\J2tw2YRH\ k
 EP

v+
s+s+Z9Z0OQDJOQOdJ2G[OdwqYpsY\^GYYP%\\%X(xyZ<RSi?w2Odw2DrYBH\w2OQBHRSIKG\^G<kRSw ] J"IKxRSOdBHG+]2IOdwBHw RyA[J2w2wYBU\_



w2Z<sYibJ|w2BHYGY\9X9w2\^OdxyGY\^\y]2IrRHi(kAQXY\^B%DKiK\^MAjp"YY\^\w2YW\^]R Fw2YOQJ"\yOdBHG[IKY\^sYG<\^w?GYP% \%BSX J

.  xyJ2BSOdP^qRw2\ OdX+DRH\^OdkAQw2JOQw2\%YPqJ\WBHGP + v+OQBHP ] G+FRqIKDOQ\WJOQX+OdsY\bGJ \%\nBbX[v]OdBHGF\%G+P^tKP%w \%RS\%XGSXT7Ws+AdOQGYRJ2P^ww2pOQRHRSkGFYPq\;BHGYOd@;w2X[ @WOQXYEBHw N \


w w2YR\ P YRP^Rrw2OQJ RS\FG0w2


Y\] RSBSvP^Rrw2J Od\^D]nOdnwaip w Rv\^]7kRS]2x w2Y\FwBSJ2tRHk
P%RSBSx~Jqp v BHG<iKM
BHGYXYRq x?sYP xyRSGY\^i0w2Y\BHIK\^G<w

EP

kOdu wRSBHBH]A;w2XYOdG+J2BHwIYw2BHOdpbw2G+sYI~ JOQ\%J}BH<GYOds+G[XOdY] \%sY\^J9\^GYGY\^BP%]2Ir\%XYiXBHw Z<AQ\r\^ip0DKw2\^YAjp \;Y\BHIKxyY\^\G<RSww2WOdJDrRBHx5w2OQBHRS]7OQG J_ X+\^w GBHOdA.w2YY\;Rq&kRSAdw2QYRq\;OdxyG+IFRX+J \^\%AQP^Jw2OQJ2RSs+GYv+J v}RS]2\(wX+w2Y\%J \;P^]2OdZ`\VOdpG0xyRS] \
Sa  r.=
7.



OdJ2G[wBHYw2sYsY\^JqGYMSP%w2\%YX\Z<BHiIK\^w2G<Yw \yJw2P Od xyBH\]2OQJ2RHxk B[XYM<B%BHiKGYMX(w2YY\yRqxx?BH]2sYOdwP BH A &    2 */ *0 *0 + 2  *0

v w "\^BHBH]] ]2\%w(BHJ?GRHw2k\^YDKw2\\^YG<\yBHw}IK \^G<`BSwJVbJ~RP%OQ\nJP^[s+vw2] Y\%\%\P^Xw p {}\%X RSGx~Ww2v +R BHOQJ}]BHJ Zw \%\nRSP^k](w2RSOQ] w2RS\ GBHBH}w(GY\ P%XRSx9BUOdkA __
  ) 3)

xyRSGY\^iw2Y\?BHIK\^G<w BSJqp 


ACYB%\ iOdxyG+IRSw2OdP DrYBH\%w2J OQJqRSp*GOQJOd\%G[<Ys+sYOd] \^\%GYJ0P%\%BXP Z<Yi~\%J w2JfY\v w2BHOd]2xyw2GY\W\^RH]qkp
  

J2\^YGYRqXp YRqw2Y\ @;@WENxyRX+\^AQJPqBHG~Z\}sYJ \%X9w RWw2+OQJ


OdXYG<B%w iK\^p(Ad OdIK\^YGY\ P%\rWp R OQJzOdG[YsY\^GYP%\%XbZ<i0w2Y\~BHIK\^G<w J

\] ^\%GYJ2P%s+\%G<AdJw iJ9w2Y\^OdGDK\\^w2G<BHYIKwF\F\^PqG<J2BHw w GRSJFAQ\^Z] G\%\fJ ] RSxy\%s+J R] RSP%X+s+\%\^] JqAQP%\%M \%XJ~\rp Z<IYGYipRSw9BYZRqw2\^YOd\nG+klOdIwFwB%\^OdG[DrDKYBH\^G<Ods[Aw__
$  

wk2RSOdD] OdGwaBSibP^cw2waOdOdiDIrvOds+wa\] i\yPqwaiBHGvBH\?GZ@;\y@;@WJ @W\%E\^ENWG.N1pJPqkRSBHY]zGF\yw2ZYIK\;\~\^GY\nJ2[\^v ]w2BHBH]] BSA\nP^_jJ2w w2w2\%Od]2xyXsYP^\yklw2] s+RSBSx] Pn\_




BPqHBHZ+GAQ\zZkRS\]Odw2GYYJ \?\^]2BHw \%IKX\^G<OdwqG<pw R@ GYw2YP%\(\fBHBSGP^w2\^OdDDKOd\^waG<iw waOQiJvxy\R@;X+\^@WAQ\%EXFNWOdwJ


w2\^+AdOQOQP^J.Odw \n\%YXBHx~kl] RSv+xAQ\rpw2Y\ Y\
I<BHv xyBH]\WBHxyX+\%\^J2w Od\^Ir] GYJ.\^k]qRSM+]BSw2J}Yw2\
Yxy\^iyRX+w \^\^GYAQJXBHw ] R\ BHGYXPqw2BHYG\ ZW\ R P%RSx~iOQv \^AQBHX+] \%\%XXZ<w iR?w2w2YY\ \?WZ\%R J2bw;iBSOQP^\^w2AQOdX+D\%OdwaXiZIr\nOdkDKRS\^] G \
 

OdGB%DKw2Y\\^BHOdG(]I<s+BH]2xyIK\\%w JR Z \^ B%DKB%\}DK\rP%pRSG<c+w2s+] ]2RSw2A+YRq\^DK]2xy\^]"RSY] Rq\}w2YP \^ ] BH\]BSP^OdAw <\^Od] G J


EP
w2OdGY\\n[\^DKv\^\%G<P^w|w \%X BSJ
WOdR G<w2] kRRSX[]WsYw2P%Y\%\~Xp\^DK\^+G<OQwqJ
p?IrNWOdDKRS\%w J
\(w2Yw2 \BHP w; BH\qG+BSP IK \


IKI<\^BHGYxy\^\;]BHP A+ ZBH\}]BSGYP^Rzw \^XY] BHJqwMYB;w2YB%\DrBHX+OdAC\%BHJ2Z+Od] AQ\%\XkRSZ]"\^w2 YB%\DOQZRS\^s+ ] B%RHDk"OQRSw2s+Y]\^xRHk ] Ojp\%\rJ p+RSw2s+Y] \(P%\WBHIKBHAQ\^J G<RwW IKBS\^Jw J B WWR R bkBHRSw2]wBS P B%YD\%OdXG+IkRSw2]}YRq\] GY\%J \^RS] s+J2+] P%Odv.\rMM



GYOdGf\%\%\nX[OQGYJ2w RS\^wGYZP%\r\Wp w2Y\;JBHxy\BSJw2 BHwRHk


BHG<ivRSv+s+ACBHw2OQRSG Bw2HYGY\XP O kBHw2G+YIK\f\] OdG\%J RS\n[s+v] P%\%\fP^w OQ\%JX AQWRrJ2R ww2p +OQJY\Od\^A xyzRSBHw2AQJ OQRSRG BUBHA\%] P^\nw_


v+|s+BSw2P w2YOdG+\^BSIb] \P^w2BFOdOQDJbxyOdwaBH\qiG5BSJ2BHs+OdAQGYJ ] R\J2wRHBH kGYBSP%Jfw2\YBH\ kGRSW]bOdRGY\qJ2BSwP BHRH GYkP%w2BS\ P^BHw2RSw9Ods+Dw2OdBSv+waP^i s+w2Odw2DRSw2OdOds+waG+iKw7I _ p OdJ2G0vRS VGbGY\%J P^c\Ww2OdOQw2IrRSYs+G\^] Gf\rp X+\^BHvG\^GY@;X+@WJERSGfN1w2+kRSOQJ]WP w2 YBH\9G+\^IKDK\(\^G<BSw;Jw2RS s+BHw2wzAdO GYw2Y\%X \
4 



w2YY\\BHv+IK] \^\nG<kw\^] J(\^GYxyP%\RSw2kOdRSDr]BHw2w2OQYRS\9GBSkRSP^w2]OdDw2OdYwa\iFJ2waviv\%P^\O OQJzPB~BSP^v w2BHOdD] Od\^waG<iKw p


 Bv+HIKs+\^wFG<wGYRJ?X+xy\RSbGY\^SinOQJ?J2J2w YRSRSAQ\^s+GAQX&PqBHP%GRSG<ZwBH\OdG&J \%\^w2G.Y\p\n[Yv\\%P^RSw s+\%w7X _

#

38 O. Bangs, N. Sndberg-Madsen, and F. V. Jensen


cOdIrs+] \?  GfRSZ[T2\%P^w RS]2OQ\^G<w \%XEN&kRS]w2Y\J2v BH] \n_jw2Odxy\?BSP^w2OdDOdw2OQ\%Jqp


wxyBHROdG0X+\^w2AY\?OQJ~GYOd\^GY J \^]2Drw BH\%AdXsY\OdGkRS]Ww2Yw2\Y\9BSP^BHw2IKOdD\^G<Odwaw iJWwaxyivRS\0GY\^@;iKpz@WE+NWOQJJ




ZOdw2\^OQ\%wa}JF\%\^GY\^w2] Y\\`0`0RSRSGYGY\^i\^iOQJOdG+v+BHG6s+w~OdG+GYv+Rs+X+wf\FGYBHRGYX+X\rw2p=Y\FBS+P^OQJFw2OdD<OQJ_
cBHIKOdIr\^s+G<w] \ J.xyRS GY\^xyiYnRpX+\^A+Yk\RS]}.B;+w2^YS\nGYklw|RX+\^\"DK\^xyG<wzRX+w2\^YAQJ.\nklw}YRH\^kw2w2YY\^\] J2 YRqGf
OdG aca OdIrs+] \ Y p  a r. |5
# 
%

w2Y\\^DK\^G<w BSJRP%P^s+] \%XfRS]Ok"OdwOQJ B9w2+] \qBHwqp



Kj
K  Y
3+ *0
  *0 ),+ *) )  *0


GDrBHw2Ad+s OQBHJ(w J RS\%]fP^w2BHOQGYRSGX}w2Y\\ RS s+w2P^Adw2O OQGYRS\G6OQX+
\q] BSRSJ(vRrP%J RS\^GY]qP%M\^]2w2G+YOd\%G+J I0\bw2BHY] \\
-
 2 0

DrP^s+BHAd]2sY] \^\?G<kwyRS]WDrBHw2AdYsY\~\BHBHIKGY\^XG<w w2J;Y\] \%v+J RS] s+RSZ ] P%BH\(Z+OdxyA OdwaRSiGY\^RHiKkWMw2IrYOdDK\f\^Gb\^DKw2\^YG<\w kJ2RSw2Od]A kls+v+w2] s+\^] AdO\;x~] Od\%G J BH\q]2BHiK] MP .BHp GYXJ2YRSs+AQXZ\FJ \%\^GBSJ9w RSv+OQP%J


RBSJ P%J P^Rs+P^]2OC]2BHOdG+w \%IYXp*Odw2Y\w2OdYG+\v+s+BHwIK\^GYG<Rw JX+}\bbSSnnGYRJ2YX+RS\s+AQCX6kl] RSZx \ kRSs+];Y\^\xy


RSDrw2BHOQRSAds GYBHJ?w RSBH]
] \(X+\^OdG[w \^Y]2sYx~\^GYOdGYP%\%\%J"Xw RZ<iw2+YOQP \9~\^\nDK[\^w G<\^wqG<p wu w2Y\n\ _
w2w2YY\\v+BS] P^RSw2Z OdDBHOdZ+waibOdA Odxywai~Rw2X+ \^BHAQJnw
nw2pY\\^YDK\~\^G<GYwRX+\ OdA Y.R+P%P^^s+]qP%MrRSOjG<p \rwpSBHw2OdYGY\J w P%\^RS]2] x~X[OdOdG+G+IOdG+w IR0w2J YRS\xy\^\xyJ2RSOdx~w2OQv+RSG AQ\BHA?]2s+OdG[AQ\%YJ9sYsY\^GYJ2OdP%G+\I0OQw2JfY\X+RSP GY \BHG+BSIKPn\ _


v+ k] RSw2+Z OQBHJ Z+Od.A Od+wai^w2 GYBHRwX+\?w2YOQ\~JWw2OdGb+] w2\qYBH\(w?J2wBHOdA w |\?Zkj\~BHAQPqJ BH\r]2M]2w2OQY\%X\(RSRSs+s+w7wq_ p OdGf\n[v\%P^w \%X WR BSJOdG+v+s+w4  

v+BSJ;s+w|w2YGY\9RX+OdG+\;v+bs+wSGYnRX+\OdbA SB%DKnr\ Mw2OjYp \r\ p9JBHGYxyR\w2YX[\nOQJ2klw;w2]2 OdZ+BSs+Jw2OQRRSPnG _ rp|T2RqRriKJ2p Odw2OdDK\(P BHG+IK\?OdG0\n[v\%P^w \%X WR ] \%J2s+Adw J OdG  

Pk^\%s+P^]2w ] \%\%XXp M|J kR0Odw2wYOQ\FJ?BHOdGIK\^w2G<Yw \J~J2xywBHRSw GY\y\^w2i]2sY \rBSM"J9w2YGY\RSw9\^DKZ\^\%G<\^w(G BUBSkJ_ [p|NW\^I<BHw2OdDK\ P BHG+IK\ OdG \n[v\%P^w \%X WR Mw2Y\


Rw2YP%\;P^s+\^] DK\%\^XyG<wqJ RWM BHw2GYYX\ BHw2YIK\^\;G<RSw s+Jw2v+xys+RSwzGYb\^i9SnOdAz+Z\OdA BUw2Y\%P^\^w Gf\%X~P%RSZ<G[i _ x?P s+BH] G+X+IK\^\]WPqRHBHkG+B(GYRSAQRqw(DK\%\qXBSJ2OdRSA GYi0\UZnM+\~] \%] \%J2s+P^w2Adw O J\%OdXGfJ l\rRSp]2IY] pyRqw2Yp \



A Bayesian Network Framework for the Construction of Virtual Agents


with Human-like Behaviour 39
yxcOdRSIrGYs+\^] iY\ }
BSJYZ\\%V<\^GFv BHOd] GY\nJ _j\^w2]2Odw xy\%X\;p BSP^w2OdDOdwaiwaiv\@;@WEN6Y\^] \;BxyRX+\^AkRS]Bw2Y\nklw\^DK\^G<w?w2Y\nklwRHkw2Y\zBHIK\^G<w J
.%  

# [pNWP \^BHI<G+BHIKw2\9OdDKPq\!BHGP ZBH\;G+] IK\%\ P^w2OOd G \%X\n[vOdw2\%YP^w RS\%s+X w;B~WAQRRSwWMRHk|w2\nYk\ _ RHRHkkw2BFYP^\?ACBSv+J ACJBHG.RSMs+sYw2v+J \%s+Xw2w2w OdG+RIfX+w2\^Yw \^\~]2x~v+] OdRSGYZ \(BHw2Z+YOdA\(Odwa\ni[vRHk\%P^J2w sY\%P%XbP%\%\nJ kJ_


kRS]2wqM+] \%J2s+Adw JOdG0BHG+IK\^]qp kk\%RSP^] wBSRSJ GyJ \%w2J YJ2Od\G+] I9\%J w2YRSs+\] X[P%\%O J|P^RHs+k.AdBHwaw2iKw p \^x~v+w2OdG+I?w2Y\v+ACBHGBHGYX
YpNWJ2s+\^AdI<wBHRHw2kOdDKB\w2P + ] BH\qG+BHIKwq\M.w2OdYG\y\nBH[G<vw\%BHP^IKw RS\%G+X OQJ2Ww(R Pq6BHGBSJ(Z\~BFX+] \n\n__ +OQJ|BHv+v+] R%[OdxBHw2OQRSGyxy\qBHGYJ
w2 BHw
}\]2OQJ2t?OdIrGYRS]7_ 

OdRqG+IG+Odw2G+YI\bRSx?GYs+\F] X+OQJy\^]J \^RHw~k~+BOdIrOk\^\rGYM RSs+s+IrG+.AQ\%MJ JFBSJ~w2Yw2Y\ \FW+R Od1IrY\%kRSJ2w]


%  

k\qBHw \%XM+] \%J2s+Adw JOdGBHG+IK\^]qp 

[pNWJ2s+\^AdI<wyBHRHw2kzOdDKB0\P w2+ ] BH\qG+BHIKwq\MOdw2GY\F\n[BHvG<\%wP^BHw IK\%RSX G+OQWJ2wyR 6OQJ9BSGYJ(\n[BFw9] w \nR _ Ir\n[s+vOQJ2\%+P^Odw G+\%IX ZW\^R wa}\%x\^G B%iYp=GYRSw?BHGYOdX&GYP^Ad[sYpX+\PqBHBFG&ZOk\b\rp X+RSu GYOQ\bJ2w2OdZ<G[i _
 

P%\%RSJ x~RSs+v ] BHP%]2\%OdG+JIRHk"w2Yw2Y\\?] BH\%G<J RSwBHs+IK] P%RS\%G+JOQJ2wqRHpkw2Y\bBHIK\^G<wfBHGYXw2Y\


 

Odx~vRrJ J2OdZ+AQ\;w R~X+\nk\qBHwqM+] \%J2s+Adw JOdGFk\qBH]qp


%


]
#[xp9BHu w2OQJzOQOQRSJ2w2YGOd] G+\%PqIrXBHs+GOQOQJ2J;Z+GYOd\yG+RSI;BSw(P Z+B\^OQ\^waw2}DK]2Od\%\%DX\^OCBHGZ<A
J2ibxOdw2xs BHw2BHBHw w2t\^OQRS]qOdG+pGYIJ
 EGYN \^BH] v+xy\v+R[] X+R%p[\^RSAQO]J_ w2RqYDK\\^u ] \^P%\^w RSxy\^x~]2RSx~Odw2G+OQOdRSIG+GYOdw2G+JfYIF\OQJ0w2v+Y] B\9RSZ+x~J2AQw2\^O] x[\^w2G+s+BHIr] GYw2\X9RHYRHk9k|Rqw2w2YYJ \\(\^]2OdX[G[OQRSO YsYsYJ"P^\^s+w2GYYAdP%wa\\~i&\nRSRHG _k


RHX+kW\^w w2\^Y]2\x~OdX[G+O OdG+I0P^s+YAdwaRqiRHX[kWO IK\^P^w2s+w2OdAdG+w(IOdw?] \%OQJJ RSw s+Rf] P%Z+\%]2JqOdM}G+IBHGYw2XY\w2Y\n\^G _ vOdG \%P^\nw [\%Xv\%WP^w R \%FX OQJWBUR \%P^OQw Jq\%p XMrcOjp \rOdGYpSX[YOdRqG+IACw2BHY]2IK\\J2w2w2Y] \\^G+P Ir w2BH5G+IKOQ\J
 $ 

vw2Y\%\P^w \^\%DKX \^G<WwqRpYZ \%BSJ P \tFxyw RRX+BH\^w AQJ}AQ\qBSJ2OdwA ] \% J BH\^wx?OdZ+w AQ\Ww2BSYJ \Zxy\nkRRSX] \_ X+BURSGY\%P^\Fw \%X[Xs+M?]2Odw2G+YI\w2J Y\^\F]2OQX+RSsY\^w J2\^GY]2\%x~J J0OdG OQBHJw2OQw2RSYG\RHRSkzs+w2w2Yv+\Fs+w\^xyRHkRSw2w2OQRSYG \
4 
 

\w2^ AQJBHwBHAdBS] P^\qw2BSOdX[DiOdw2OQ\%X+J \%J BHP^] ]2\WOdZv+\%ACXBHGYp J}kRSY]\bBSP%X[<Os+\^Od]2] Od\^G+GYI~P%\%] J\%J RSs+OdA ] zP%\%ZJq\ M {}RSx~Yv \ BH]BHP^w w2RSOQ]qRSpG
] RSvRrJ \^]zJ2YRSs+AQXbRSs+w2v+s+wWw2Y\?GY\n[w
w2OdG+YI9\BBSP^] w2\%OdJ DRSOdwas+i] P%wa\riM+vBH\%GYJXyBH] OdGY\J2X[w \qOBSX\^] \^RHG<kw\n[v+vACBH\%GYP^w J\%kX RS] WBSR P%0<s+w2YOd]7\_ w2+OdG+I0RHkkRSw2Y](\9w2YBH\FIK\^BHG<IKw;\^G<BHGYw~Xbw RbB~X+AdOQR+J2wWp RH k}w(BSIKP^\^w2w OdJyDOdBSw2OQJ(\%JqOdG+M v+]BHs+G+wytKBH\%XG
s+DK\^w2OdG+A OdOQw2\^OQGY\%P%J"\rpOdA +YZ\|\v+B ACBHxyGYJ.\qBSJ2Ods+A +] \BHAQRHJ kRw2OdYGY\P^Ad\nsY[X+v\\%BHP^G?w \%OdX9GYJ2OdGYwBHP%GYRSG[P%\_ Z<BSP^iwyw2YX[O\^Od] \^] \n\^[G<vw2\%AdiP^w \%w RX WBHGR \^pxyu RSOw2OQRS\^G ] \^BHG<AWwzP BHIKBH\^G+G<IKw \rJM\rpOdIYA p|] BS\nJ_

EP


40 O. Bangs, N. Sndberg-Madsen, and F. V. Jensen


w B RbZ+P s+YAdRiRrJ wa\FivBH\FIrIrBH] IK\%\^J J2G<OdwyDK\FOQJyBSBHP^w2G+OdIKD\^Odw2] OQ\%\%XJqM
Odw~+OQJ9OdAQ\fxyRSBf] \v BSAdP^O tKO \^AdJ2i w SRJ \%G+<Adi?sY\^RSGYGYP%\}\ P RHYk.RSv+OQP%AC\}BHGYRHJqk M] J \%R?J RSRSs+GY] P%P%\ \w2BSYP%\<s+OQJ2Odw2OQOQRSJ}GPqUBHSAQw2P^Ys+\|ACBHZw \%\%XJ2w M
AdBHBHO IKAQtKRS\^\^GYG<Adi\rwpyw ROdA \yP ] Y\^\qRG<BSRrDP^J OQw;\J2OQw RSRBSGP^+w2BOdOdDIrJ2OdYw2i[\^OQJ2\%](w J\^BHxG+IKY\^\^]Y] \Z<\^] i0w2\YZ\q\\^BSOdP G+BHIKIF\^BHxyG<IKw\^RSG<] OQ\wJ }J \%\z<sYPqBH\^GYG+P%GY\rRSp wZ\WJ2s+] \Ww2 BHww2+OQJ}BSJOdGkjBSP^w}w2Y\WZ\%J2w EV

\^G<BSw2J]BHBHGYG P%\?}kyRS]zrl\qBSSP bS \^xy.KRSw2%OQjRSSGbBSSJ 2J %RjSP^%OCBH w \%XOdw2bOdw2 \qBSBHP G J2wBHG<R&w2AdiB%DKIrRS] OQRqX5Odw2G+YIY\M|Dr}BH\AdsY\OdA RHkx\^BHxyOdG<RSww2BHOQRSOdGGYJbBZAd\^OQJ2Odw~G+I6RHk P%w2RSYG[\_
.

BSBSP^P^w2w2OdOdDDOdOdwawaiKiKp
MWx?Ys+\ Adw2Odv+P^Adw2iOQRSw2GY
\] X+RSRSvw7Rr_jJ v+\^] ]|RX[sYOdA P^Yww2YRH\^k GykRS]|\qBHBSGYP X

EF
P%P%RSRSG<G<w2w2]2]2OdOdZ+Z+s+s+w2w2OQOQRSRSGYGJWkjBSw RX+\~RqDK\^M];BHGYw2OdXFxyAQ\r\^pyw Vw2YRS\xyDr\~BH\^AdsYDK\%\^JG<w RHJkx\qBSB%P iK M

YP%RSRqx~}\^v+DKAQ\^\^w ]q\^MyAdiZ\] \^J xyRRqDKv\%RqX}\^kl]7] klRSs+x Ayw2 BHww2w2Y]\^BHi1s+xBHBS] Jn\nM GY\^YDKRq\^ ]


EP

lSSOdw2+S Yw2Y%\ \n[v \%P^w \%X WR iOQ\^AQX[OdG+I0w2Y\f}yS


w2RS+IrOQiKJM9RBHP%GYP^X s+] JOdwbOdGOQJ0<Zs+\^xOdG+BHIGYJOdG<OQDKJ\%J2Bw2OdxI<BHBHw w2\%w X5\^]OdG5RHk;w2vYYJ2\i[P .YRSOA__


EP
5 
EF EP


Adw2O YIr\<w~] \%v+P%] RSRHAdQT2\%\%P^P^w2wqOQpRSG{|RHs+k]2] v+\^] G<\^wDOQ\^RSDKsY\^JWG<\^w DKJ~\^xG<w B%JqiMIrBHOdAQDJ OdR0G+Iw2]2w2OdYIrIK\^x\^]


EV

] w2+\^GYOdG+\^I}\%w2X BHJ2wzw2] }\^G+\9IrYw2RSvOd\?G w RAQRMRStFw2+klOQs+J?]2OQw2JqYM\^RS]zGYOdP%G<\w RBHI<YBHRqOd1G.M


w RB

BHwW]BHRWJ2GYOdB%G+X+DKIRSRSx OQRSX?G+sYv+AdiJ2] Od\%G+w2X[+I~OQP^OQJw2wYBHZ+\%J OdOdA\;A OdwaDri] BH\%}AdJ2sYs+\\%AdJwOdBSOdA GJ<v+}w2OQY\^P t?Od\bIrBH<BHG(w IKJq\^BSp G<P^w2w OdJFDOdBHwaAi _
EV = QoL EP EF

OdGYP^AdsYX+\rp
EP

OdG+B%Ii[w JR;vZ\^\^]7w2kw RS\^]2]x~w2OdYG+\^IOd]}J RSJ2Odxyw2s \BHw2BSOQP^RSw2GyOdDZ<Odwai9iKMw2]2w2i<OdsYG+J?IGYw R?\^DKBS\^P%]<s+w2]2Odi<] \ _   +  Kj 

xyBSJ RSJ \%] \~J J] w2\%YJ \|RSs+X[] O P%\%P^Jqs+p Adwa iJzRH}k\] \UBH7AdBS] P%\q<BSs+X[ibOd]2OdGYG+\%IW\%] X\%J xyRSs+R] X+P%\^\%AQJ.Jkw RSR ]
0 ) ),2 2

w2v+ G ] RKBHw2wBS+P OQGYJ\%J w \%\%RXP^w2Zw OQRS\?RGfOdZx~}\0v+\zAQJ2\^vxy\%OdAP^\^OG< IKw \%R9\%XXw2+kP%RS] RSRS]x~s+w2Irv+Y0AQ\0\^w BHRS\^AdAds+iKw2w2p?YAdO \zGY\%w2\~X6+OdBHG+BHOdIKv[x J_


w2\^YAQJ~\k
RS]DrBHX+Ads \^w BH\^w ]2RSx~]qMYOdG+}Od\;G+Ix~OdIrY<\^wzw2BSYJ\^]}w2\^YAd\sYBHJ \zIK\^w2G<Yw \%JJ \;J2YxyRSRs+XAQX _


w2G+]2OQi\^GYw P%R~\(BSRHP%k|<\qs+BSOd] P \zb] v+\%ACJ BHRSGbs+] PqP%BH\%GJqp


Z\?YJ2\;s+Z+\n[w2]vBS\%P^P^w w \%\%XXklOd] GYRSP%x*RSG<w2DKY\n\_ BHBHww2OdG+X+I\^DK@;\^AQ@WRSv+EOdG+N IxyB(RJ2X+i[\^J2AQw J\^xkl] RSkx=RS]w2BHYs+\w RSJ2xv\%BHP^w2OOQ PqPqBHBHAdw2iOQRSIKGY\^JqGYp \^]7_
\ n[BSvJ|\%ZP^\%w \^\%GX BHWw2w R \^x~v+RHkzw \%w2XYp\fJ2+Odw2OQs J}BHDrw2BHOQRSAdsYG\OQJ|Yw2\^Y] \^\FGw2sYYJ \F\%X~v+ACkBHRSG ]

< H <
wk2RSY] \BSPqP%BH<AQs+P^s+Od]2ACOdBHG+w2IOQRS] G6\%J RHRSk;s+] w2P%Y\%\ J PqBHGfw2RHYk;\^Gfw2YZ\\;v+sYACBHJ \%G.XFp6Z<
iACBHw2YGY\J    
BHIKY\^\G<w ] J\%J BHRSGYs+X?] P%w2Y\%J\^Od]"w2 vBHRrwJ J2x~OdZ+OdAQIr\|<DrwFBHAdZsY\0\%JqB%p
DrVBHRSOdACxyBHZ+\|AQ\0] \%J kRSRS]s+] w2P%Y\%\J

NWP^RSw2w OQ\WRSGw2
BHw] RSw2v+RrOQJJ \^BH]?v+v+OdG] RKw2BSYP \FJOQBHJxyx?\~iKRSv+B%OQiPrM+BSZ+J(s+wBSP^Pqw2BHOdDGFOdw2\qOQBS\%J7Jq_ p Bw2H]2] Od\Z+s+IrAQw RS\%Z JWBHRHAk
Drw2BHYAdsY\(\%BHJqIKM\^\rG<p IYw Jqp9M w2\rYp IY\ p J P \q BSBHJ RS]2OQG.J2xMRSB[w2M Y\^] +JOdAQBH\?] \zRSw2BH[w7__
EV

Od<A sYi?\^ZG<\w
\nv+[ACBHv GYBHJqGYp"X+\%X~+OQw JqRWMSYOdGYRqP^}AdsY\^X+DK\^\ ]qBHMG<P%i(RSxyG<s+\%Jx?BHZw
\^w2]|YRH\kP%J2Rrs+J2ZYw
J RH\n_k
Od\^G] +JOdw2BHAQY\] \0\xyRSI<RSZ[BHGYT2xy\%\^iP^\nw _j}J"x~RSRSOd]Ir]2AQ<BHXwWZYMzJ2Z\rw2\]p IYBSBHpP^Gbw2BOQRSBHGYv+ZYJ
AQJ2RSw2w2s+] BSIrBHP^w
w2OQPqRSOQJFBHG.G9p}BHZGNW\|R~RSkZ[RSxs+T2\%BHGYP^w7X w_
P%BHB RSGyIrx?] \n\%Z+[\%OdvX[G RSi9BHGYw2BH\^OQRSv+G<Gbw2v+OC] BHRHRKA+k
BSIrv+P ] ~ACRqBHxGYw2JB%yi?w RyOdG~Z\^\w2DrYBHBH\}w2Adw s G<\^BHs+x~w x?\rv+p Zw \% \^X]|AdMKw RH\^k]2YG v\^BHRr] w2J \}OdJ2DKOdw2Z+\^YAdAQi \\ w w2Y\^]\bDr BHAdBHsYw\%w2JY\^kl] ifRSxBH] \rw2M+Y}\b\;I<GYBH\%xy\%X\rp=w R~tYGY\ Rq&W}R \^] \;RHk~w R9\qBSIKP \^ w
Y+] OdIrJ2Ywz\%J2J2w w\^v0\n[OQvJW\%w P^Ryw \%YX GYWX0R w2Y\(BUv+klACw BH\^G.]MJ2RSs+]zZ+w2v+]ACBSBHP^GYw2JqOdG+MI9w2Odw2Y0\Ww2\nY\ _ ] J2\%vJ \%RSP^s+O ] P%\%\XMZBH\^AQOdRSG+G+IbIbOdGOdB0w2J2vB\%P^J2Ow BHPGYXYJ2BHwBH] Xw \v+x?] \nsYkJ2\^wy] \^BHGYAQP%J \RkZRS\]
 

vv\%\%P^P^w w \%\%XXOdJ2GYOdw2P%s RSBHG<w2DKOQ\^RSG+GOQ\^BUGYklP%w \~\^]yRHk|PqBHw2]2Y]2\?iOdv+G+ACIBHGRSns+Mw~w2Yw2Y\^G0\w2v+YACBH\(G\nOQJ_



w2P^YAdsY\fX+\9] \%w2J YRSs+\9] J2P%w\rBHp GYXY BHwy] xXbB%]iBHG+BHIKAQ\yJ RBHGZ\fB%DK\n\^[]vBHIK\%X[\OQ\^BHG<IKw\^G<w w;ROdG[OdA _
sYJ RbJ \%RSXfG.w M
Rys+YG<GYw2OdXAGYw2YR0\I<ZBH\%OdGJ2w Odv+GACBH\nG0[vw R\%P^\nw +\%X \%P^s+Ww R \?6GY\nOQJ([wqvMRrBHJ GYJ2OX _ Odw
B%X+DKR\ \%RHJ"k.GY\qRSBSw"P x] BH\%tKJ \RSs+J \^] GYP%\rJ \|MKkRSY]
\^BW] \] \%OdwJ RSxs+BH] P%tK\}\%J}AdO J tK\^\GYJ J \q\rBSMKJ \rRSp G.IYpM
Z+wBHAQ\rIKp\RHkY\9IrOdIrD] OdG+\%\%IX[ibBBHZv+\^v+w2w ] \^RK] BSP WR 6OdA\%"J2w2 OdB%xDKBH\9w w2\rYM
\~Z+s+BSX[w9DrBHG[OdA _ 
Z+s+wOdwX+R\%JkRS]OdG<w \^Ad OdIK\^GYP%\rp
w<BHsYtK\^\zGYP%AQRS\rG+MKIK\q\^BS]P ~w R~vRrPqJ BHJ2AQOdP^Z+s+AQAC\|BHw GY\r\nM[[kw
RSJ2]w \q\^BSv9P GY\%J2\%w X+\^J"vFw RWOdGFZw2\Y\^\;DrJ BH\nA__
$ 
      ! "$#%&' ( )+*," ! !!-($&.($&'
s BHw \%Xpc+s+]2w2Y\^]2xyRS] \w2+OQJ9BHv+v+] RKBSP OdA }OdGYP^AdsYX+\ !!/!0&' "$&.+#213( 5476)&!89:&<;=5>?';@(    
+*," ! !!A&+0&'  &'  B

A Bayesian Network Framework for the Construction of Virtual Agents


with Human-like Behaviour 41
    9 r C <
9r C <          ] wW\%J OQRSJWs+sY] P%J2s \%J?BHAdIri0OdDKBy\^GIKRw R0RXBHGOQX+\qBHBIK\^w G<Rw?xkjBHBHAdtK\9J2Odw2s++] Od\?Gw2 BH wzBHw?w2YOQ\J
vvRrRrJ J J2J2OdOdZ+Z+AQAQ\(\~kw RSRf]Rqw2DK \^BH]2w(]2BHOdIKw \^\ G<w?waiJv\~OdGw RFw2Y \B%BHDKIK\r\^p G<w(w;J2OQvJ?\%BHP^AQO+J R _


OdJ GbRSYs+\q\] BSP%BSP \%P^Jw2w2OdBSD P^OdBHw2waOdwibDOdOdG[wawaiiYvsYwa\%\^iJ(GYvP%BH\r\ AQpWRSw2G+cYYIfRS\]zxy\qOdBSRSw2P w2OdDrBHBHBSAdw2P^OQw2RSw2OdYGD\OdkwaRSiKBS]MP^w2w2w2OdYYD\\?Odw2BS] OQ\%\nPnJ__



PqBHw2OQRSG.p
EF

w2OdG[OdDYOdsYwaiK\^M.GYP%\f+OQBHP AQRSG+X[IbOd] \%P^Odw2w2OQRSG.w2M"Y\BHGYJXBHxyw2Y\\~kRSJ2w2](] \^OdG[G+YIrsYw2\^GYRHP%k\Fw2RSYG \ 2


N YOdtK\RSACBHBUs+T w2Y RSi] AQJyX[Od}IRSks+RSAQ]XDrOdBHGAds v BHBHZ+]2AQw2\OQP^s+X[ACOQBHJ P^]sYAdJ O J2tKOQ\FRSGYw JRRSw2G BH+G+OQt J
- 4+ 0 */  0 32

w2w YR\ OdGYWP^RAdsYX+\RHkBw2YJ2\wBHBSGYP^XYw2BHOdD] OdXwaiKv+p ] \n wk\^OQ] J9\^GYBHP%AQ\0J RkRSBF]IKZRRSRw2XOQX+w2Y\qB\




BJ2SYP^w2RSOds+DAQOdXwai0BHwaAQJ iRv\%ZJ\(BHIrGYOdX0DK\^w2GY\yByBSkjBSP^P^w2w OdDRSOd]zw2OQk\%RSJq];p(\q|BSBSP P \^xyBSP^RSw2w2OdDOQRSOdwaG.i M OQxX+\qBHBSw2OdJ9DK\~RSGBHG+w2IrYAQ\\rp;`YV\~BHs++w2OQYP RS] JW}\}RS s+B%AQDKXb\AdIrO tKOdDK\(\^w GRBfw2 GYBHRSG+]7t _
vsYBHIKJ Rr\%\^J X9G<\^]qwkp RSJ2s+]}wx1PqxBHAQw P^B%Rzs+iw2ACYBHZw2\\zOdG+JIKBHI9RxyRw2YX\}\ DrOQX+BHAd\qsYB9\rM<w MKR9Z<J i~RzAQ\^w2w2w YBHBH\ Adw" YP^GYw2\nOQRS_jw2Gks+RSG+
] Od] G+BHRHGI _ @;s+@WIrEOdGNWJqkRSp.]v+\|] Rq}DRSOQs+X[AQOdXyG+IBHAQBJ Rw AdRO tKRS\|Aw kR RS]w2 J2BHvG+\%tP^Ow2 YPq\BHw2BHOQGYRSRSGG<i<RH_k
w2v+Y] \\nkZ\^] \^\^ GYB%P%D\%OQJqRSp ]}RHkBHIK\^G<w J}PqBHGZ\X+RSGY\ J RSAQ\^Adi~RSGyw2Y\
EV

EF xyRSsYJ] \^DOQ\^}\^] JkRS]DrBHAds BHZ+AQ\P%RSx~xy\^G<w Jqp
w2Y\
ACJBHBHGYxyJ~\kRSBS]yJfBSBSP%<P^w2s+OdOdD]2OdOdw2G+OQI\%JqMW] \%J RSOdw2s+&] P%\%OdGYJP%J2RSYG<RSDKs+\^G+AQXOQ\^GYOdGYP%P^\AdsY] X+\n\ _  0  0 30 Y | 0 2

v+J2YACBSRSP^s+OdAQG+XI OdGYP^WAdsYR X+\|p] \%c+J s+RS]2s+w2] YP%\^\%]2Jxyw2 RS] BH\wOdG[w2YYsY\$\^GYJ2P%v\|\%w2P^YO \|Pqv+BH] w2RSOQRSZ[G _
 !#"%$'&)(+*-,/.1024365879,;:=<?>83651@BADCE:-51FG3652H@BI6I 36J
7%@B51<?@BKL ML$'"N O%P=P-QRO%S=&


OkdBHG[\%Z+P^YOdw AsY\%Odwa\^XiGYP%Z<RH\yik?w2w2YBHY\~w2\w v+\^v+x~] ACRSBHv+Z Gw2BHOdZ+G+ZOdI1A\^OdOdwaG+w2iKYIM\YJ2RqsYv+P%ACP%BH\%] G.J \%MfJ7J klRSs+BHs+GYAj] MXP%Y\%J(RqY!RqBH ] \w2Y] BU\^\nki __
T)U  VV#W'?MYX+%Z[\Z?W' ]^[%Z#`_'&=&=&a3b:=cBI *ed\dD

T  f-ghW'[=_'&=&%_i@Bj-@BADkl3652H@BAmin3 7o>82bpD

J RSRSs+s+w] J2P%sY\%JP%P%BH\%J ] J7\0kls+BUAdiK\%p P^w \%XZ<iw2Y\fv+ACBHGZ\^OdG+IPqBH]2]2OQ\%X q r/s U M _'&=&%$tu9v`JHw8x=x=y=z-w=w%{'JHw=1g| U  =}%Z)~+f'[#DZ


g)f=YMfB]^[%Z#

 <y q
`m\Zf=8s= U f=[=W'MG1f U U  #"==+1>@Y(+*7'J
5362436j-@v`24AD0<B240A@*t),;*-243b*-5pD)W']M=[/ =[\Q

cYRHkRS]fw2Y\q\9BSP BH&ZRqwaDKi\rvM\YRHk\^] \?IKw2\^YG<\9wfBHw2IKY\^\G<wzv+wa] i\nvk\^\(] \^X[GYOP%\%\^Jf] JRHklky] RSBHx Ad
       \ ZsY~+[#

m T W'\Z[#?1/_'&=&%_mdD52H@R7'A:-24365872>@(h(uE*#F=@BI*

v+BHG&] Od\nA .kB%\^ZDK] \z\^\^]GYsYBHP%J IK\|\%\XOQJ.pBHs+IKGY\^YJ2G<v\(wq\%MWBHP^GYIKO \%\^\%G<\%X?X+wJw2waYiw v\|R\J2wZPqBH\bBHGYG0XYJ2vBHBH] \%AQXJ P^R~Ov+ Od] \%GY\nXP^kp\^AdsY] \^X+ GYk9\?P%BB\
t),;*-243b*-5p365t),;c?*#F-3b@?F(>:-A:=<B2H@BA?pDH~+fB[[#M8Q
 %f'|Z}[G f=\}f=f=m \ZW U f=8=[?W-Z f=W U
}W'?W=DZ[?N/ U W-Z f='KL[BZ}fM'W'M|[#\[#W'??}
}W U U [=[#KL[ U `f=[=

J2RHwk}BHw2GY XYBHBHw] Xywai]vBH\9G+IK\OdAkRS]X[Ow2Y\^\]z] kl\%] J RSRSxs+] P%w2Y\%J\B%YDK\^\^]] BH\WIKBH\GBHBHIKIK\^\^G<G<wqw p  T W'%+~! r+  U U []^ |_'&=&=&^*.1J\F=*-kl5(+*-5J

w2 wYIK\|J2\^w2B%G<OdDKAw\^wax]iBHvBHIKtK\%\J
\%BHJxIKJ \^B%\^G<iGYwqJ MrBH\z\rAQJ pw IYRzR9pq w2AQ\^YB%wDK\
\ Z+s+AdJi;J
J2was+w2i x=vBH\
w|w xyR(X[O\^w2G<Y\^w2\;]
OQRSJklGYBH] RS\%xyXx \ p
pD24AD0<B243b*-5:-51Fm@R.@B24362436j-@v`24AD0<B240A@Dpm@R.1A@Dp@B52H:-J
243b*-5365 m:-C=@DpD3b:-5 i@B24k*-Aop/H~+fB[[#M %mf'lZ}[
h} \Z[[%Z}YH%Z[W-Z f=W U`U f=MW/\Z `BW U H%Z[ U U  Q
EF =[B[8fB [BZs9f=[[B[=W'=[#|_'%_oQ_'=S

DrZBH\qAdBHsYw2\rOdG+M.If\rp IYvp\%RSw2v+YAQ\9\~Z+s+s+vAd i0OdwaA i|vX[\~] RSBHvIK\^xyG<w RS] J\~OdGYw2 P^AdBHO GG BHw2w2YOQRS\GB%w D<R _

EF


\^]BH|IKBS\;P v\^BH] IKJ \^RSG<G w


JqGYM[\%\%YX+\^J|G~B T2waRqiiv\]2OQBHJ \%GYJqXyp BzJ2v\%P^O PqBHw2OQRSGyRHk
1

w2BHYIK\z\^G<v+w] \nRHkk\^w2] \^GYBHwP%\%waJiv\rYM\^GY] \;RSw Od\Www2X[ OBHw\^] v+J] kl\n] kRS\^x*] \^GYB9P%\%J2J}wBHkGYRSXY]BHBS] PnX _
w2 k}OdDJ OdRSwai xyU\9v+AC] BH\%GJ RSwas+i] vP%\%\9JGYBH\%] \%\ X++J;+w \%RXZOdGY\(J2OQ+X++\z\%XBHGFBHBHw?IKB\^G<J2wvwa\%iP^vO \rP p


Dr\^]BHBHAdsYIK\r\FMBHRSIK]|\^OdG<GYwyJ2OQRHX+kW\zw2B; ]BHBHw9G+waIKi\ vw2\r MBHOdww9X[x?OsY\^J2] w9JZkl] \RSxJ2vw2\%YP^O\z B%\%D<X_ p


42 O. Bangs, N. Sndberg-Madsen, and F. V. Jensen
Loopy Propagation: the Convergence Error in Markov Networks
Janneke H. Bolt
Institute of Information and Computing Sciences, Utrecht University
P.O. Box 80.089, 3508 TB Utrecht, The Netherlands
janneke@cs.uu.nl

Abstract
Loopy propagation provides for approximate reasoning with Bayesian networks. In pre-
vious research, we distinguished between two different types of error in the probabilities
yielded by the algorithm; the cycling error and the convergence error. Other researchers
analysed an equivalent algorithm for pairwise Markov networks. For such networks with
just a simple loop, a relationship between the exact and the approximate probabilities was
established. In their research, there appeared to be no equivalent for the convergence er-
ror, however. In this paper, we indicate that the convergence error in a Bayesian network
is converted to a cycling error in the equivalent Markov network. Furthermore, we show
that the prior convergence error in Markov networks is characterised by the fact that the
previously mentioned relationship between the exact and the approximate probabilities
cannot be derived for the loop node in which this error occurs.

1 Introduction ity distributions for a singly connected network,


it yields approximate probabilities for the vari-
A Bayesian network uniquely defines a joint ables of a multiply connected network. Good
probability distribution and as such provides approximation performance has been reported
for computing any probability of interest over for this algorithm (Murphy et al 1999).
its variables. Reasoning with a Bayesian net- In (Bolt and van der Gaag 2004), we stud-
work, more specifically, amounts to comput- ied the performance of the loopy-propagation
ing (posterior) probability distributions for the from a theoretical point of view and argued that
variables involved. For networks without any two types of error may arise in the approximate
topological restrictions, this reasoning task is probabilities yielded by the algorithm: the cy-
known to be NP-hard (Cooper 1990). For net- cling error and the convergence error. A cycling
works with specific restricted topologies, how- error arises when messages are being passed
ever, efficient algorithms are available, such as on within a loop repetitively and old informa-
Pearls propagation algorithm for singly con- tion is mistaken for new by the variables in-
nected networks. Also the task of computing volved. A convergence error arises when mes-
approximate probabilities with guaranteed error sages that originate from dependent variables
bounds is NP-hard in general (Dagum and Luby are combined as if they were independent.
1993). Although their results are not guaran- Many other researchers have addressed the
teed to lie within given error bounds, various ap- performance of the loopy-propagation algo-
proximation algorithms are available that yield rithm. Weiss and his co-workers, more specif-
good results on many real-life networks. One of ically, investigated its performance by study-
these algorithms is the loopy-propagation algo- ing the application of an equivalent algorithm
rithm. The basic idea of this algorithm is to ap- on pairwise Markov networks (Weiss 2000, and
ply Pearls propagation algorithm to a Bayesian Weiss and Freeman 2001). Their use of Markov
network regardless of its topological structure. networks for this purpose was motivated by the
While the algorithm results in exact probabil- relatively easier analysis of these networks and
justified by the observation that any Bayesian is presented by
network can be converted into an equivalent Y
pairwise Markov network. Weiss (2000) derived Pr(V) = Pr(A | p(A))
an analytical relationship between the exact and AV
the computed probabilities for the loop nodes For the scope of this paper we assume all vari-
in a network including a single loop. In the ables of a Bayesian network to be binary. We
analysis of loopy propagation in Markov net- will often write a for A = a1 and a for A = a2 .
works, however, no distinction between differ- Fig. 1 depicts a small binary Bayesian network.
ent error types was made, and on first sight
there is no equivalent for the convergence er- A Pr(a) = x
ror. In this paper we investigate this difference
in results; we do so by constructing the simplest Pr(c | ab) = r B Pr(b | a) = p
situation in which a convergence error may oc- Pr(c | ab) = s Pr(b | a) = q
Pr(c | ab) = t
cur, and analysing the equivalent Markov net- Pr(c | ab) = u C
work. We find that the convergence error in
the Bayesian Markov network is converted to a Figure 1: An example Bayesian network.
cycling error in the equivalent Markov network.
Furthermore, we find that the prior convergence A multiply connected network includes one or
error in Markov networks is characterised by the more loops. We say that a loop is simple if none
fact that the relationship between the exact and of its nodes are shared by another loop. A node
the approximate probabilities, as established by that has two or more incoming arcs on a loop
Weiss, cannot be derived for the loop node in will be called a convergence node of this loop.
which this error occurs. Node C is the only convergence node in the net-
work from Fig. 1. Pearls propagation algorithm
(Pearl 1988) was designed for exact inference
2 Bayesian Networks with singly connected Bayesian networks. The
term loopy propagation used throughout the lit-
A Bayesian network is a model of a joint prob- erature, refers to the application of this algo-
ability distribution Pr over a set of stochas- rithm to networks with loops.
tic variables V, consisting of a directed acyclic
graph and a set of conditional probability dis- 3 The Convergence Error in
tributions1 . Each variable A is represented by Bayesian Networks
a node A in the networks digraph2 . (Condi-
When applied to a singly connected Bayesian
tional) independency between the variables is
network, Pearls propagation algorithm results
captured by the digraphs set of arcs accord-
in exact probabilities. When applied to a multi-
ing to the d-separation criterion (Pearl 1988).
ply connected network, however, the computed
The strength of the probabilistic relationships
probabilities may include errors. In previous
between the variables is captured by the con-
work we distinguished between two different
ditional probability distributions Pr(A | p(A)),
types of error (Bolt and van der Gaag 2004).
where p(A) denotes the instantiations of the
The first type of error arises when messages
parents of A. The joint probability distribution
are being passed on in a loop repetitively and
old information is mistaken for new by the vari-
1
Variables are denoted by upper-case letters (A), and ables involved. The error that thus arises will
their values by indexed lower-case letters (ai ); sets of
variables by bold-face upper-case letters (A) and their be termed a cycling error. A cycling error can
instantiations by bold-face lower-case letters (a). The only occur if for each convergence node of a loop
upper-case letter is also used to indicate the whole range either the node itself or one of its descendents
of values of a variable or a set of variables.
2
The terms node and variable will be used inter- is observed. The second type of error originates
changeably. from the combination of causal messages by the

44 J. H. Bolt
convergence node of a loop. A convergence node
1
combines the messages from its parents as if the
parents are independent. They may be depen- Pr(c)

dent, however, and by assuming independence, 0.65

a convergence error may be introduced. A con- 0.5

vergence error may already emerge in a network


in its prior state. In the sequel we will denote
the probabilities that result upon loopy prop- 0
1
Pr(b)
agation with Prf to distinguish them from the Pr(a) 1 0

exact probabilities which are denoted by Pr. Figure 2: The probability of c as a function of
Moreover, we studied the prior convergence Pr(a) and Pr(b), assuming independence of the
error. Below, we apply our analysis to the ex- parents A and B of C (surface), and as a func-
ample network from Figure 1. For the network tion of Pr(a) (line segment).
in its prior state, the loopy-propagation algo-
rithm establishes
X
line segment that matches the probability Pr(a)
f
Pr(c) = Pr(c | AB) Pr(A) Pr(B) from the network and its orthogonal projection
A,B
on the surface. For Pr(a) = 0.5, more specif-
f
ically, the difference between Pr(c) and Pr(c)
as probability for node C. Nodes A and B, how-
ever, may be dependent and the exact probabil- is indicated by the vertical dotted line segment
ity Pr(c) equals and equals 0.65 0.5 = 0.15. Informally speak-
ing:
X
Pr(c) = Pr(c | AB) Pr(B | A) Pr(A)
A,B the more curved the surface is, the larger
the distance between a point on the line
The difference between the exact and approxi- segment and its projection on the surface
mate probabilities is can be; the curvature of the surface is re-
flected by the factor x;
f
Pr(c) Pr(c) =xyz
where the distance between a point on the seg-
ment and its projection on the surface de-
pends on the orientation of the line seg-
x = Pr(c | ab) Pr(c | ab) Pr(c | ab) + Pr(c | ab)
ment; the orientation of the line segment is
y = Pr(b | a) Pr(b | a)
reflected by the factor y;
z = Pr(a) Pr(a)2

the distance between a point on the line


The factors that govern the size of the prior segment and its projection on the surface
convergence error in the network from Figure 1, depends its position on the line segment;
are illustrated in Figure 2; for the construction this position is reflected by the factor z.
of this figure we used the following probabilities:
r = 1, s = 0, t = 0, u = 1, p = 0.4, q = 0.1 and We recall that the convergence error originates
x = 0.5. The line segment captures the exact from combining messages from dependent nodes
probability Pr(c) as a function of Pr(a); note as if they were independent. The factors y and
that each specific Pr(a) corresponds with a spe- z now in essence capture the degree of depen-
f as a func-
cific Pr(b). The surface captures Pr(c) dence between the nodes A and B; the factor x
tion of Pr(a) and Pr(b). The convergence error indicates to which extent this dependence can
equals the distance between the point on the affect the computed probabilities.

Loopy Propagation: the Convergence Error in Markov Networks 45


4 Markov Networks (ab) = p   A  
(ab) = q p q p r
Like a Bayesian network, a Markov network r s q s
(ab) = r
uniquely defines a joint probability distribu- B
tion over a set of statistical variables V. The (ab) = s
variables are represented by the nodes of an Figure 3: An example pairwise Markov network
undirected graph and (conditional) indepen- and its transition matrices.
dency between the variables is captured by the
graphs set of edges; a variable is (conditionally)
independent of every other variable given its The propagation algorithm now is defined
Markov blanket. The strength of the probabilis- as follows. The message from A to B equals
tic relationships is captured by clique potentials. M AB v after normalisation, where v is the vec-
Cliques C are subsets of nodes that are com- tor that results from the component wise mul-
pletely connected; C = V. For each clique, a tiplication of all message vectors sent to A ex-
potential function C (AC ) is given that assigns cept for the message vector sent by B. The
a non-negative real number to each configura- procedure is initialised with all message vectors
tion of the nodes A of C. The joint probability set to (1,1,...,1). Observed nodes do not re-
is presented by: ceive messages and they always transmit a vec-
Y tor with 1 for the observed value and zero for
Pr(V) = 1/Z C (AC ) all other values. The probability distribution for
C a node, is obtained by combining all incoming
P Q
where Z = C C (AC ) is a normalising messages, again by component wise multiplica-
V P
factor, ensuring that V Pr(V) = 1. A pair-
tion and normalisation.
wise Markov network is a Markov network with
cliques of maximal two nodes.
5 Converting a Bayesian Network
For pairwise Markov networks, an algorithm
into a Pairwise Markov Network
can be specified that is functionally equivalent
to Pearls propagation algorithm (Weiss 2000).
In this section, the conversion of a Bayesian net-
In this algorithm, in each time step, every node
work into an equivalent pairwise Markov net-
sends a probability vector to each of its neigh-
work is described (Weiss 2000). In the con-
bours. The probability distribution of a node is
version of the Bayesian network into a Markov
obtained by combining the steady state values
network, for any node with multiple parents,
of the messages from its neighbours.
an auxiliary node is constructed into which the
In a pairwise Markov network, the transition
common parents are clustered. This auxiliary
matrices M AB and M BA can be associated with
node is connected to the child and its parents
any edge between nodes A and B.
and the original arcs between child and parents
AB are removed. Furthermore, all arc directions in
Mji = (A = ai , B = bj )
T
the network are dropped. The clusters are all
Note that matrix M BA equals M AB . pairs of connected nodes. For a cluster with an
Example 1 Suppose we have a Markov net- auxiliary node and a former parent node, the
work with two binary nodes A and B, and potential is set to 1 if the nodes have a simi-
suppose that for this network the potentials lar value for the former parent node and to 0
(ab) = p, (ab) = q, (ab) = r and (a b) = s otherwise. For the other clusters, the potentials
are specified, as in Fig. 3 . We then as- are equal to the conditional probabilities of the
sociate the transition matrix M AB = q s p r former child given the former parent. Further-
more, the prior probability of a former root node
with the
link from A to B and its transpose
p q
is incorporated by multiplication into one of the
M BA
=
r s
with the link from B to A. 2 potentials of the clusters in which it takes part.

46 J. H. Bolt
(ab) = px
(ab) = (1 p)x
matrices plus the other incoming vectors. Sub-
(ab) = q(1 x) sequently, he showed that the reflexive matrices
(a, a0 b0 ) = 1 (b, a0 b0 ) = 1
0 0
(ab) = (1 q)(1 x) also include the exact probability distribution
(a, a b ) = 1 (b, a0 b0 ) = 0
(a, a0 b0 ) = 0 (b, a0 b0 ) = 1 and used those two observations to derive an an-
(a, a0 b0 ) = 0
A (b, a0 b0 ) = 0 alytical relationship between the approximated
(a, a0 b0 ) = 0 (b, a0 b0 ) = 0 and the exact probabilities.
0 0
(a, a b ) = 0 B (b, a0 b0 ) = 1
(a, a0 b0 ) = 1 (b, a0 b0 ) = 0 M 1T
(a, a0 b0 ) = 1 (b, a0 b0 ) = 1 O1 O2
X M1
D1 D2
L1
0 0 0 0
(a b , c) = r (a b , c) = (1 r) L2
(a0 b0 , c) = s (a0 b0 , c) = (1 s)
(a0 b0 , c) = t (a0 b0 , c) = (1 t)
C Mn
(a0 b0 , c) = u (a0 b0 , c) = (1 u) Ln
nT
M
Figure 4: A pairwise Markov network that rep- Dn
resents the same joint probability distribution O n

as the Bayesian network from Figure 1.


Figure 5: An example Markov network with just
one loop.
Example 2 The Bayesian network from Figure
1 can be converted into the pairwise Markov More in detail, Weiss considered a Markov
network from Figure 4 with clusters AB, AX, network with a single loop with n nodes L 1 ...Ln
BX and XC. Node X is composed of A0 and and with connected to each node in the loop,
B 0 and has the values a0 b0 , a0 b0 , a0 b0 and a0 b0 . an observed node O 1 ...O n as shown in Figure 5.
Given that the prior probability of root node During propagation, a node O i will constantly
A is incorporated in the potential of cluster send the same message into the loop. This vec-
AB, the network has the following potentials: tor is one of the columns of the transition ma-
(AB) = Pr(B | A) Pr(A); (XC) = Pr(C | i i
trix M O L In order to enable the incorporation
AB): (AX) = 1 if A0 = A and 0 otherwise of this message into the reflexive matrices, this
and; (BX) = 1 if B 0 = B and 0 otherwise. vector is transformed into a diagonal matrix D i ,
with the vector elements on the diagonal.
For
6 The Analysis of Loopy O i Li p r
Propagation in Markov Networks example, suppose that M =
q s
and
suppose that
the observation
O i = oi1 is made,
Weiss (2000) analysed the performance of the
loopy-propagation algorithm for Markov net- then Di = p0 0q . Furthermore, M 1 is the
works with a single loop and related the approx- transition matrix for the message from L 1 to
T
imate probabilities found for the nodes in the L2 and M 1 the transition matrix for the mes-
loop to their exact probabilities. He noted that sage from L2 to L1 etc. The reflexive matrix
in the application of the algorithm messages C for the transition of a counterclockwise mes-
will cycle in the loop and errors will emerge sage from node L1 back to itself is defined as
T T T
as a result of the double counting of informa- M 1 D 2 ...M n1 D n M n D 1 . The message that
tion. The main idea of his analysis is that for a L2 sends to L1 in the steady state now is in
node in the loop, two reflexive matrices can be the direction of the principal eigenvector of C.
derived; one for the messages cycling clockwise The reflexive matrix C 2 for the transition of a
and one for the messages cycling counterclock- clockwise message from node L1 back to itself
wise. The probability distribution computed by is defined as M n D n M n1 D n1 ...M 1 D 1 . The
the loopy-propagation algorithm for the loop message that node Ln sends to L1 in the steady
node in the steady state, now can be inferred state is in the direction of the principal eigen-
from the principal eigenvectors of the reflexive vector of C 2 . Component wise multiplication of

Loopy Propagation: the Convergence Error in Markov Networks 47


the two principal eigenvectors and the message We do so by constructing the simplest situation
from O 1 to L1 , and normalisation of the result- in which a convergence error may occur, that is,
ing vector, yields a vector of which the com- the Bayesian network from Figure 1 in its prior
ponents equal the approximated values for L 1 state, and analysing this situation in the equiv-
in the steady state. Furthermore, Weiss proved alent Markov network. The focus thereby is on
that the elements on the diagonals of the re- the node that replaces the convergence node in
flexive matrices equal the correct probabilities the loop. We then argue that the results have a
of the relevant value of L1 and the evidence, for more general validity.
example, C1,1 equals Pr(l11 , o). Subsequently, he Consider the Bayesian network from Fig-
related the exact probabilities for a node A in ure 1. In its prior state, there is no cy-
the loop to its approximate probabilities by cling of information, and exact probabilities
will be found for nodes A and B. In node
f i) + P
1 Pr(a Pij j Pji1 C, however, a convergence error may emerge.
Pr(ai ) = P j=2 (1)
j j
The network can be converted into the pair-
wise Markov network from Figure 4. For
in which P is a matrix that is composed of the this network we find the transition matrices
eigenvectors of C, with the principal eigenvector
1 1 0 0
in the first column, and 1 ...j are the eigenval- M XA =
0 0 1 1
, M XB = 10 01 10 01 ,

ues of the reflexive matrices, with 1 the max- M XC =
r s t u
, M AB =
imum eigenvalue. We note that from this for- 1r 1s 1t 1u

px q(1 x)
mula it follows that correct probabilities will be (1 p)x (1 p)(1 x)
and their transposes.
found if 1 equals 1 and all other eigenvalues
In the prior state of the network, C will send the
equal 0.
message M CX (1, 1) = (1, 1, 1, 1) to X. In or-
In the above analysis, all nodes O i are con-
der to enable the incorporation of this message
sidered to be observed. Note that given unob-
into the reflexive loop matrices it is transformed
served nodes outside the loop, the analysis is
into D CX which, in this case, is the 4x4-identity
essentially the same.
In
that case a transition
p r
matrix.
O i Li
matrix M =
q s
will result in the diag- We first evaluate the performance of the loopy

onal matrix D i = p+r 0
. propagation algorithm for the regular loop node
0 q+s A. This node has the following reflexive matri-
7 The Convergence Error in Markov ces for its clockwise and counterclockwise mes-
Networks sages respectively:

As discussed in Section 3 in Bayesian networks,


x 1x
a distinction could be made between the cycling M A = M XA DCX M BX M AB =
x 1x
error and the convergence error. In the previ-
ous section it appeared that for Markov network
such a distinction does not exist. All errors re- with eigenvalues 1 and 0 and principal eigenvec-
sult from the cycling of information and, on first tor (1,1) and
sight, there is no equivalent for the convergence
error. However, any Bayesian network can be x x
M A = M BA M XB DCX M AX =
1x 1x
converted into an equivalent pairwise Markov
network on which an algorithm equivalent to
the loopy-propagation algorithm can be used. with eigenvalues 1 and 0 and principal eigenvec-
In this section, we investigate this apparent in- tor (x, 1x). Note that the correct probabilities
compatibility of results and indicate how the for node A indeed are found on the diagonal of
convergence error yet is embedded in the anal- the reflexive matrices. Furthermore, 1 = 1 and
ysis of loopy propagation in Markov networks. 2 = 0 and therefore correct approximations are

48 J. H. Bolt
expected. We indeed find that the approxima- A first observation is that 1 equals 1 and the
tions (1 x, 1 (1 x)) equal the exact probabil- other eigenvalues equal 0, but the exact and the
ities. Note also that, as expected, the messages approximate probabilities of node X may differ.
from node A back to itself do not change any This is not consistent with Equation 1. The ex-
more after the first cycle. As in the Bayesian planation is that for node X, the matrix P , is
network, for node A no cycling of information singular and therefore, the matrix P 1 , which
occurs in the Markov network. For node B a is needed in the derivation of the relationship
similar evaluation can be made. between the exact and approximate probabili-
We now turn to the convergence node C. In ties, does not exist. Equation 1, thus isnt valid
the Bayesian network in its prior state a conver- for the auxiliary node X. We note furthermore
gence error may emerge in this node. In the con- that the messages from node X back to itself
version of the Bayesian network into the Markov may still change after the first cycle. We there-
network, the convergence node C is placed out- fore find that, although in the Bayesian network
side the loop and the auxiliary node X is added. there is no cycling of information, in the Markov
For X, the following reflexive matrices are com- network, for node X information may cycle, re-
puted for the clockwise and counterclockwise sulting in errors computed for its probabilities.
messages from node X back to itself respec- The probabilities computed by the loopy-
tively: propagation algorithm for node C equal the nor-
malised product M XC v, where v is the vec-
M X = M BX M AB M XA D CX =
" # tor with the approximate probabilities found at
px px q(1 x) q(1 x)
(1 p)x (1 p)x (1 q)(1 x) (1 q)(1 x) node X. It can easily be seen that these ap-
px px q(1 x) q(1 x)
(1 p)x (1 p)x (1 q)(1 x) (1 q)(1 x) proximate probabilities equal the approximate
probabilities found in the equivalent Bayesian
with eigenvalues 1, 0, 0, 0; principal eigenvector network. Furthermore we observe that if node
((px+q(1x))/((1p)x+(1q)(1x)), 1, (px+ X would send its exact probabilities, that is,
q(1 x))/((1 p)x + (1 q)(1 x)), 1) and Pr(AB), exact probabilities for node C would
other eigenvectors (0, 0, 1, 1), (1, 1, 0, 0) and be computed. In the Markov network we
(0, 0, 0, 0). thus may consider the convergence error to be
founded in the cycling of information for the
M X = M AX M BA M XB M CX = auxiliary node X.
" #
px (1 p)x px (1 p)x
px (1 p)x px (1 p)x In Section 3, a formula for the size of the prior
q(1 x) (1 q)(1 x) q(1 x) (1 q)(1 x) convergence error in the network from figure 1
q(1 x) (1 q)(1 x) q(1 x) (1 q)(1 x)
is given. We there argued that this size is de-
with eigenvalues 1, 0, 0, 0; principal eigenvector termined by the factors y and z that capture
(x/(1x), x/(1x), 1, 1) and other eigenvectors the degree of dependency between the parents
(0, 1, 0, 1), (1, 0, 1, 0) and (0, 0, 0, 0). of the convergence node and the factor x, that
On the diagonal of the reflexive matrices of X indicates to which extent the dependence be-
we find the probabilities Pr(AB). As the correct tween nodes A and B can affect the computed
probabilities for a loop node are found on the probabilities. In this formula, x is composed
diagonal of its reflexive matrices, these probabil- of the conditional probabilities of node C. In
ities can be considered to be the exact probabil- the analysis in the Markov network we have a
ities for node X. The normalised vector of the similar finding. The effect of the degree of de-
component wise multiplication of the principal pendence between A and B is reflected in the
eigenvectors of the two reflexive matrices of X difference between the exact and the approxi-
equals the vector with the normalised probabil- mate probabilities found for node X. The ef-
ities Pr(A)Pr(B). Likewise, these probabilities fect of the conditional probabilities at node C
can be considered to be the approximate prob- emerges in the transition of the message vector
abilities for node X. from X to C.

Loopy Propagation: the Convergence Error in Markov Networks 49


We just considered the small example net- is converted to a cycling error in the equivalent
work from Figure 1. Note, however, that for Markov network. Furthermore, we found that
any prior binary Bayesian networks with just the prior convergence error is characterised by
simple loops, the situation for any loop can be the fact that the relationship between the exact
summarised to the situation in Figure 1 by probabilities and the approximate probabilities
marginalisation over the relevant variables. The yielded by loopy propagation, as established by
results with respect to the manifestation of the Weiss, can not be derived for the loop node in
convergence error by the cycling of information which this error occurs. We then argued that
and the invalidity of Equation 1 for the auxil- these results are valid for binary Bayesian net-
iary node, found for the network from Figure work with just simple loops in general.
1, therefore, apply to any prior binary Bayesian
networks with just simple loops.34 Acknowledgements
This research was partly supported by the
8 Discussion Netherlands Organisation for Scientific Re-
Loopy propagation refers to the application of search (NWO).
Pearls propagation algorithm for exact reason-
ing with singly connected Bayesian networks to References
networks with loops. In previous research we
identified two different types of error that may J.H. Bolt, L.C. van der Gaag. 2004. The convergence
error in loopy propagation. Paper presented at the
arise in the probabilities computed by the algo- International Conference on Advances in Intelli-
rithm. Cycling errors result from the cycling of gent Systems: Theory and Applications.
information and arise in loop nodes as soon as
G.F. Cooper. 1990. The computational complexity
for each convergence node of the loop, either the of probabilistic inference using Bayesian belief
node itself, or one of its descendents is observed. networks. Artificial Intelligence, 42:393405.
Convergence errors result from combining infor-
P. Dagum, M. Luby. 1993. Approximate inference in
mation from dependent nodes as if they were in-
Bayesian networks is NP hard. Artificial Intelli-
dependent and may arise at convergence nodes. gence, 60:141153.
This second error type is found both in a net-
works prior and posterior state. Loopy prop- K. Murphy, Y. Weiss, M. Jordan. 1999. Loopy belief
propagation for approximate inference: an empir-
agation has also been studied by the analysis ical study. In 15th Conference on Uncertainty in
of the performance of an equivalent algorithm Artificial Intelligence, pages 467475.
in pairwise Markov networks with just a simple
J. Pearl. 1988. Probabilistic Reasoning in Intelligent
loop. According to this analysis all errors result
Systems: Networks of Plausible Inference. Mor-
from the cyling of information and on first sight gan Kaufmann Publishers, Palo Alto.
there is no equivalent for the convergence error.
We investigated how the convergence error yet is Y. Weiss. 2000. Correctness of local probability
propagation in graphical models with loops. Neu-
embedded in the analysis of loopy propagation ral Computation, 12:141.
in Markov networks. We did so by constructing
the simplest situation in which a convergence Y. Weiss, W.T. Freeman. 2001. Correctness of be-
lief propagation in Gaussian graphical models of
error may occur, and analysing this situation in
arbitrary topology. Neural Computation, 13:2173
the equivalent Markov network. We found that 2200.
the convergence error in the Bayesian network
3
Given a loop with multiple convergence nodes, in the
prior state of the network, the parents of the convergence
nodes are independent and effectively no loop is present.
4
Two loops in sequence may result in incorrect proba-
bilities entering the second loop. The reflexive matrices,
however, will have a similar structure as the reflexive
matrices derived in this section.

50 J. H. Bolt
Preprocessing the MAP problem
Janneke H. Bolt and Linda C. van der Gaag
Department of Information and Computing Sciences, Utrecht University
P.O. Box 80.089, 3508 TB Utrecht, The Netherlands
{janneke,linda}@cs.uu.nl

Abstract
The MAP problem for Bayesian networks is the problem of finding for a set of variables an
instantiation of highest posterior probability given the available evidence. The problem is
known to be computationally infeasible in general. In this paper, we present a method for
preprocessing the MAP problem with the aim of reducing the runtime requirements for
its solution. Our method exploits the concepts of Markov and MAP blanket for deriving
partial information about a solution to the problem. We investigate the practicability of
our preprocessing method in combination with an exact algorithm for solving the MAP
problem for some real Bayesian networks.

1 Introduction problem may be obtained. There is no guar-


antee in general, however, that the values in
Upon reasoning with a Bayesian network, of- the resulting joint instantiation indeed corre-
ten a best explanation is sought for a given spond to the values of the variables in a solu-
set of observations. Given the available evi- tion to the MAP problem. In this paper, we
dence, such an explanation is an instantiation of now show that, for some of the variables of in-
highest probability for some subset of the net- terest, the computation of marginal posterior
works variables. The problem of finding such probabilities may in fact provide exact informa-
an instantiation has an unfavourable computa- tion about their value in a solution to the MAP
tional complexity. If the subset of variables for problem. We show more specifically that, by
which a most likely instantiation is to be found building upon the concept of Markov blanket,
includes just a single variable, then the prob- some of the variables may be fixed to a par-
lem, which is known as the Pr problem, is NP- ticular value; for some of the other variables of
hard in general. A similar observation holds for interest, moreover, values may be excluded from
the MPE problem in which an instantiation of further consideration. We further introduce the
highest probability is sought for all unobserved concept of MAP blanket that serves to provide
variables. These two problems can be solved similar information.
in polynomial time, however, for networks of Deriving partial information about a solution
bounded treewidth. If the subset of interest is a to the MAP problem by building upon the con-
non-singleton proper subset of the set of unob- cepts of Markov and MAP blanket, can be ex-
served variables, on the other hand, the prob- ploited as a preprocessing step before the prob-
lem, which is then known as the MAP problem, lem is actually solved with any available al-
remains NP-hard even for networks for which gorithm. The derived information in essence
the other two problems can be feasibly solved serves to reduce the search space for the prob-
(Park and Darwiche, 2002). lem and thereby reduces the algorithms run-
By performing inference in a Bayesian net- time requirements. We performed an initial
work under study and establishing the most study of the practicability of our preprocess-
likely value for each variable of interest sepa- ing method by solving MAP problems for real
rately, an estimate for a solution to the MAP networks using an exact branch-and-bound al-
gorithm, and found that preprocessing can be MAP variable in a solution m is called its MAP
profitable. value. Dependent upon the size of the MAP set,
The paper is organised as follows. In Sec- we distinguish between three different types of
tion 2, we provide some preliminaries on the problem. If the MAP set includes just a single
MAP problem. In Section 3, we present two variable, the problem of finding the best expla-
propositions that constitute the basis of our pre- nation for a set of observations reduces to es-
processing method. In Section 4, we provide tablishing the most likely value for this variable
some preliminary results about the practicabil- from its marginal posterior probability distribu-
ity of our preprocessing method. The paper is tion. This problem is called the Pr problem as it
ended in Section 5 with our concluding obser- essentially amounts to performing standard in-
vations. ference (Park and Darwiche, 2001). In the sec-
ond type of problem, the MAP set includes all
2 The MAP problem non-observed variables. This problem is known
as the most probable explanation or MPE prob-
Before reviewing the MAP problem, we intro- lem. In this paper, we are interested in the third
duce our notational conventions. A Bayesian type of problem, called the MAP problem, in
network is a model of a joint probability dis- which the MAP set is a non-singleton proper
tribution Pr over a set of stochastic variables, subset of the set of non-observed variables of the
consisting of a directed acyclic graph and a set network under study. This problem amounts to
of conditional probability distributions. We de- finding an instantiation of highest probability
note variables by upper-case letters (A) and for a designated set of variables of interest.
their values by (indexed) lower-case letters (a i ); We would like to note that the MAP problem
sets of variables are indicated by bold-face is more complex in essence than the other two
upper-case letters (A) and their instantiations problems. The Pr problem and the MPE prob-
by bold-face lower-case letters (a). Each vari- lem both are NP-hard in general and are solv-
able is represented by a node in the digraph; able in polynomial time for Bayesian networks
(conditional) independence between the vari- of bounded treewidth. The MAP problem is
ables is encoded by the digraphs set of arcs NPPP -hard in general and remains NP-hard for
according to the d-separation criterion (Pearl, these restricted networks (Park, 2002).
1988). The Markov blanket B of a variable A
consists of its neighbours in the digraph plus the 3 Fixing MAP values
parents of its children. Given its Markov blan-
ket, the variable is independent of all other vari- By performing inference in a Bayesian network
ables in the network. The strengths of the prob- and solving the Pr problem for each MAP vari-
abilistic relationships between the variables are able separately, an estimate for a MAP solu-
captured by conditional probability tables that tion may be obtained. There is no guarantee
encode for each variable A the conditional dis- in general, however, that the value with highest
tributions Pr(A | p(A)) given its parents p(A). marginal probability for a variable corresponds
Upon reasoning with a Bayesian network, of- with its value in a MAP solution. We now show
ten a best explanation is sought for a given set that, for some variables, the computation of
of observations. Given evidence o for a subset marginal probabilities may in fact provide ex-
of variables O, such an explanation is an instan- act information about their MAP values.
tiation of highest probability for some subset M The first property that we will exploit in the
of the networks variables. The set M is called sequel, builds upon the concept of Markov blan-
the MAP set for the problem; its elements are ket. We consider a MAP variable H and its
called the MAP variables. An instantiation m associated Markov blanket. If a specific value
of highest probability to the set M is termed a hi of H has highest probability in the marginal
MAP solution; the value that is assigned to a distribution over H for all possible instantia-

52 J. H. Bolt and L. C. van der Gaag


tions of the blanket, then hi will be the value of I A B
H in a MAP solution for any MAP problem in-
cluding H. Alternatively, if some value h j never H C
has highest marginal probability, then this value
cannot be included in any solution. D E

Proposition 1. Let H be a MAP variable in a F G


Bayesian network and let B be its Markov blan- Figure 1: An example directed acyclic graph.
ket. Let hi be a specific value of H.
1. If Pr(hi | b) Pr(hk | b) for all values The above proposition provides for preprocess-
hk of H and all instantiations b of B, then ing a MAP problem. Prior to actually solving
hi is the value of H in a MAP solution for the problem, some of the MAP variables may be
any MAP problem that includes H. fixed to a particular value using the first prop-
erty. With the second property, moreover, var-
2. If there exist values hk of H with Pr(hi |
ious values of the MAP variables may be ex-
b) < Pr(hk | b) for all instantiations b
cluded from further consideration. By build-
of B, then hi is not the MAP value of H
ing upon the proposition, therefore, the search
in any solution to a MAP problem that in-
space for the MAP problem is effectively re-
cludes H.
duced. We illustrate this with an example.
Proof. We prove the first property stated in the
proposition; the proof of the second property Example 1. We consider a Bayesian network
builds upon similar arguments. with the graphical structure from Figure 1.
We consider an arbitrary MAP problem with Suppose that H is a ternary MAP variable with
the MAP set {H} M and the evidence o for the values h1 , h2 and h3 . The Markov blanket
the observed variables O. Finding a solution to B of H consists of the three variables A, C and
the problem amounts to finding an instantiation D. Now, if for any instantiation b of these vari-
to the MAP set that maximises the posterior ables we have that Pr(h1 | b) Pr(h2 | b) and
probability Pr(H, M | o). We have that Pr(h1 | b) Pr(h3 | b), then h1 occurs in a
MAP solution for any MAP problem that in-
Pr(hi , M | o) = cludes H. By fixing the variable H to the value
P h1 , the search space of any such problem is re-
= b Pr(hi | b, o) Pr(M | b, o) Pr(b | o)
duced by a factor 3. We consider, as an exam-
For the posterior probability Pr(hk , M | o) an ple, the MAP set {H, E, G, I} of ternary vari-
analogous expression is found. Now suppose ables. Without preprocessing, the search space
that for the value hi of H we have that Pr(hi | includes 34 = 81 possible instantiations. By fix-
b) Pr(hk | b) for all values hk of H and all in- ing the variable H to h1 , the search space of the
stantiations b of B. Since B is the Markov blan- problem reduces to 33 = 27 instantiations. 
ket of H, we have that Pr(hi | b) Pr(hk | b)
implies Pr(hi | b, o) Pr(hk | b, o) for all val- Establishing whether or not the properties from
ues hk of H and all instantiations b of B. We Proposition 1 can be used for a specific MAP
conclude that Pr(hi , M | o) Pr(hk , M | o) variable, requires a number of computations
for all hk . The value hi of H thus is included that is exponential in the size of the variables
in a solution to the MAP problem under study. Markov blanket. The computations required,
Since the above considerations are algebraically however, are highly local. A single restricted in-
independent of the MAP variables M and of the ward propagation for each instantiation of the
evidence o, this property holds for any MAP Markov blanket to the variable of interest suf-
problem that includes the variable H.  fices. Since the proposition moreover holds for

Preprocessing the MAP Problem 53


any MAP problem that includes the variable, however, Proposition 2 provides information for
the computational burden involved is amortised any problem in which H has a subset of K for its
over all future MAP computations. MAP blanket and with matching evidence for
The second proposition that we will exploit the observed variables that are not d-separated
for preprocessing a MAP problem, builds upon from H by K. The information derived from
the new concept of MAP blanket. We consider Proposition 2, therefore, is more restricted in
a problem with the MAP variables {H} M. scope than that from Proposition 1.
A MAP blanket K of H now is a minimal sub- We illustrate the application of Proposition 2
set K M such that K d-separates H from for our example network.
M \ K given the available evidence. Now, if a
specific value hi of H has highest probability in
Example 2. We consider again the MAP set
the marginal distribution over H given the ev-
{H, E, G, I} for the Bayesian network from Ex-
idence for all possible instantiations of K, then
ample 1. In the absence of any evidence, the
hi will be the MAP value of H in a solution to
MAP blanket of the variable H includes just
the MAP problem under study. Alternatively, if
the variable I. Now, if for each value i of
some value hj never has highest marginal prob-
I we have that Pr(h1 | i) Pr(h2 | i) and
ability, then this value cannot be a MAP value
Pr(h1 | i) Pr(h3 | i), then the value h1 occurs
in any of the problems solutions.
in a solution to the given MAP problem. The
search space for actually solving the problem
Proposition 2. Let {H} M be the MAP set thus again is reduced from 81 to 27. 
of a given MAP problem for a Bayesian network
and let o be the evidence that is available for
the observed variables O. Let K be the MAP Establishing whether or not the properties from
blanket for the variable H given o, and let h i be Proposition 2 can be used for a specific MAP
a specific value of H. variable, requires a number of computations
that is exponential in the size of the vari-
1. If Pr(hi | k, o) Pr(hk | k, o) for all values ables MAP blanket. The size of this blanket
hk of H and all instantiations k to K, then is strongly dependent of the networks connec-
hi is the MAP value of H in a solution to tivity and of the location of the various MAP
the given MAP problem. variables and observed variables in the network.
The MAP blanket can in fact be larger in size
2. If there exist values hk of H with Pr(hi | than the Markov blanket of the variable. The
k, o) < Pr(hk | k, o) for all instantiations computations required, moreover, are less local
k to K, then hi is not the MAP value of H than those required for Proposition 1 and can
in any solution to the given MAP problem. involve full inward propagations to the variable
of interest. Since the proposition in addition
The proof of the proposition is relatively applies to just a restricted class of MAP prob-
straightforward, building upon similar argu- lems, the computational burden involved in its
ments as the proof of Proposition 1. verification can be amortised over other MAP
The above proposition again provides for pre- computations to a lesser extent than that in-
processing a MAP problem. Prior to actually volved in the verification of Proposition 1.
solving the problem, the values of some vari- The general idea underlying the two propo-
ables may be fixed and other values may be ex- sitions stated above is the same. The idea is
cluded for further consideration. The proposi- to verify whether or not a particular value of
tion therefore again serves to effectively reduce H can be fixed or excluded as a MAP value by
the search space for the problem under study. investigating Hs marginal probability distribu-
While the information derived from Proposi- tions given all possible instantiations of a collec-
tion 1 holds for any MAP problem including H, tion of variables surrounding H. Proposition 1

54 J. H. Bolt and L. C. van der Gaag


uses for this purpose the Markov blanket of H. Example 3. We consider again the MAP set
By building upon the Markov blanket, which in {H, E, G, I} for the Bayesian network from Fig-
essence is independent of the MAP set and of ure 1. Now suppose that the variable H can be
the entered evidence, generally applicable state- fixed to a particular value. Then, by performing
ments about the values of H are found. A dis- evidence absorption of this value, the graphical
advantage of building upon the Markov blanket, structure of the network falls apart into the two
however, is that maximally different distribu- components {A, I, H} and {B, C, D, E, F, G},
tions for H are examined, which decreases the respectively. The MAP problem then decom-
chances of fixing or excluding values. poses into the problem with the MAP set {I}
By taking a blanket-like collection of variables for the first component and the problem with
at a larger distance from the MAP variable H, the MAP set {E, G} for the second component;
the marginal distributions examined for H are both these problems now include the value of
likely to be less divergent, which serves to in- H as further evidence. The search space thus is
crease the chances of fixing or excluding val- further reduced from 27 to 3 + 9 = 12 instanti-
ues of H as MAP values. A major disadvan- ations to be studied. 
tage of such a blanket, however, is that its size
tends to grow with the distance from H, which 4 Experiments
will result in an infeasibly large number of in-
stantiations to be studied. For any blanket-like In the previous section, we have introduced a
collection, we observe that the MAP variables method for preprocessing the MAP problem for
of the problem have to either be in the blan- Bayesian networks. In this section, we perform
ket or be d-separated from H by the blanket. a preliminary study of the practicability of our
Proposition 2 builds upon this observation ex- method by solving MAP problems for real net-
plicitly and considers only the instantiations of works using an exact algorithm. In Section 4.1
the MAP blanket of the variable. The proposi- we describe the set-up of the experiments; we
tion thereby reduces the computations involved review the results in Section 4.2.
in its application yet retains and even further
4.1 The Experimental Set-up
exploits the advantage of examining less diver-
gent marginal distributions over H. Note that In our experiments, we study the effects of our
the values that can be fixed or excluded based preprocessing method on three real Bayesian
on the first proposition, will also be fixed or ex- networks. We first report the percentages of
cluded based on the second proposition. It may values that are fixed or excluded by exploit-
nevertheless still be worthwhile to exploit the ing Proposition 1. We then compare the num-
first proposition because, as stated before, with bers of fixed variables as well as the numbers
this proposition values can be fixed or excluded of network propagations with and without pre-
in general and more restricted and possibly less processing, upon solving various MAP problems
computations are required. with a state-of-the-art exact algorithm.
So far we have argued that application of In our experiments, we use three real
the two propositions serves to reduce the search Bayesian networks with a relatively high con-
space for a MAP problem by fixing variables to nectivity; Table 1 reports the numbers of vari-
particular values and by excluding other values ables and values for these networks. The Wil-
from further consideration. We would like to sons disease network (WD) is a small net-
mention that by fixing variables the graphical work in medicine, developed for the diagnosis
structure of the Bayesian network under study of Wilsons liver disease (Korver and Lucas,
may fall apart into unconnected components, 1993). The classical swine fever network (CSF)
for which the MAP problem can be solved sep- is a network in veterinary science, currently
arately. We illustrate the basic idea with our under development, for the early detection of
running example. outbreaks of classical swine fever in pig herds

Preprocessing the MAP Problem 55


(Geenen and Van der Gaag, 2005). The ex-
Table 1: The numbers of fixed and excluded
tended oesophageal cancer network (OESO+) is
values.
a moderately-sized network in medicine, which
has been developed for the prediction of re- network #vars. #vals. #vars.f. #vals.f.+e.
WD 21 56 4(19.0%) 13(23.2%)
sponse to treatment of oesophageal cancer (Ale- CSF 41 98 7(17.1%) 15(15.3%)
man et al, 2000). For each network, we compute OESO+ 67 175 12(17.9%) 27(15.4%)
MAP solutions for randomly generated MAP
sets with 25% and 50% of the networks vari-
ables, respectively; for each size, five sets are or excluded ranges between 15.3% and 23.2%.
generated. We did not set any evidence. We would like to stress that whenever a variable
For solving the various MAP problems in our can be fixed to a particular value, this result is
experiments, we use a basic implementation of valid for any MAP problem that includes this
the exact branch-and-bound algorithm available variable. The computations involved, therefore,
from Park and Darwiche (2003). This algorithm have to be performed only once.
solves the MAP problem exactly for most net- In the second experiment, we compared for
works for which the Pr and MPE problems are each network the numbers of variables that
feasible. The algorithm constructs a depth-first could be fixed by the two different preprocess-
search tree by choosing values for subsequent ing steps; for Proposition 2, we restricted the
MAP variables, cutting off branches using an number of network propagations per MAP vari-
upper bound. Since our preprocessing method able to four because of the limited applicabil-
reduces the search space by fixing variables and ity of the resulting information. We further es-
excluding values, it essentially serves to decrease tablished the numbers of network propagations
the depth of the tree and to diminish its branch- of the exact branch-and-bound algorithm that
ing factor. were forestalled by the preprocessing.
The results of the second experiment are pre-
4.2 Experimental results sented in Table 2. The table reports in the two
In the first experiment, we established for each leftmost columns, for each network, the sizes
network the number of values that could be of the MAP sets used and the average num-
fixed or excluded by applying Proposition 1, ber of network propagations performed with-
that is, by studying the marginal distributions out any preprocessing. In the subsequent two
per variable given its Markov blanket. For com- columns, it reports the average number of vari-
putational reasons, we decided not to investi- ables that could be fixed to a particular value
gate variables for which the associated blanket by using Proposition 1 and the average num-
had more than 45 000 different instantiations; ber of network propagations performed by the
this number is arbitrarily chosen. In the WD, branch-and-bound algorithm after this prepro-
CSF and OESO+ networks, there were 0, 8 and cessing step. In the fifth and sixth columns, the
15 of such variables respectively. table reports the numbers obtained with Propo-
The results of the first experiment are pre- sition 2; the sixth column in addition shows, be-
sented in Table 1. The table reports, for each tween parenthesis, the average number of prop-
network, the total number of variables, the to- agations that are required for the application
tal number of values, the number of variables of Proposition 2. In the final two columns of
for which a value could be fixed, and the num- the table, results for the two preprocessing steps
ber of values that could be fixed or excluded; combined are reported: the final but one column
note that if, for example, for a ternary variable again mentions the number of variables that
a value can be fixed, then also two values can could be fixed to a particular value by Proposi-
be excluded. We observe that 17.1% to 19.0% tion 1; it moreover mentions the average num-
of the variables could be fixed to a particular ber of variables that could be fixed to a partic-
value. The number of values that could be fixed ular value by Proposition 2 after Proposition 1

56 J. H. Bolt and L. C. van der Gaag


had been used. The rightmost column reports fewer variables are fixed in the step based on
the average number of network propagations re- Proposition 2 than in the step based on Propo-
quired by the branch-and-bound algorithm. We sition 1. This can be attributed to the limited
would like to note that the additional computa- number of propagations used in the step based
tions required for Proposition 2 are restricted on Proposition 2. We conclude that, for the net-
network propagations; although the worst-case works under study, it has been quite worthwhile
complexity of these propagations is the same to use Proposition 1 as a preprocessing step be-
as that of the propagations performed by the fore actually solving the various MAP problems.
branch-and-bound algorithm, their runtime re- Because of the higher computational burden in-
quirements may be considerably less. volved and its relative lack of additional value,
From Table 2, we observe that the number of the use of Proposition 2 has not been worthwhile
network propagations performed by the branch- for our networks and associated problems.
and-bound algorithm grows with the number To conclude, in Section 3 we observed that
of MAP variables, as expected. For the two fixing variables to their MAP value could serve
smaller networks, we observe in fact that the to partition a MAP problem into smaller prob-
number of propagations without preprocessing lems. Such a partition did not occur in our
equals the number of MAP variables plus one. experiments. We would like to note, however,
For the larger OESO+ network, the number of that we studied MAP problems without evi-
network propagations performed by the algo- dence only. We expect that in the presence of
rithm is much larger than the number of MAP evidence MAP problems will more readily be
variables. This finding is not unexpected since partitioned into smaller problems.
the MAP problem has a high computational
complexity. For the smaller networks, we fur- 5 Conclusions and discussion
ther observe that each variable that is fixed
by one of the preprocessing steps translates di- The MAP problem for Bayesian networks is
rectly into a reduction of the number of prop- the problem of finding for a set of variables
agations by one. For the OESO+ network, we an instantiation of highest posterior probabil-
find a larger reduction in the number of net- ity given the available evidence. The problem
work propagations per fixed variable. Our ex- has an unfavourable computational complexity,
perimental results thus indicate that the num- being NPPP -hard in general. In this paper, we
ber of propagations required by the branch-and- showed that computation of the marginal pos-
bound algorithm indeed is decreased by fixing terior probabilities of a variable H given its
variables to their MAP value. Markov blanket may provide exact information
With respect to using Proposition 1, we ob- about its value in a MAP solution. This infor-
serve that in all networks under study a rea- mation is valid for any MAP problem that in-
sonably number of variables could be fixed to cludes the variable H. We further showed that
a MAP value. We did not take the number of computation of the marginal probabilities of H
local propagations required for this proposition given its MAP blanket may also provide exact
into consideration because the computational information about its value. This information
burden involved is amortised over future MAP is valid, however, for a more restricted class of
computations. With respect to using Proposi- MAP problems. We argued that these results
tion 2, we observe that for all networks and all can be exploited for preprocessing MAP prob-
MAP sets the additional computations involved lems before they are actually solved using any
outweigh the number of network computations state-of-the-art algorithm for this purpose.
that are forestalled for the algorithm. This ob- We performed a preliminary experimental
servation applies to using just Proposition 2 as study of the practicability of the preprocessing
well as to using the proposition after Proposi- steps by solving MAP problems for three differ-
tion 1 has been applied. We also observe that ent Bayesian networks using an exact branch-

Preprocessing the MAP Problem 57


Table 2: The number of network propagations without and with preprocessing using Propositions
1 and 2.
Prop. 1 Prop. 2 Prop. 1+2
network #MAP #props. #vars. f. #props. #vars. f. #props. (add.) #vars. f. #props. (add.)
WD 5 6.0 1.2 4.8 0.0 6.0 (0.6) 1.2 + 0.0 4.8 (0.6)
10 11.0 2.4 8.6 1.0 10.0 (6.4) 2.4 + 0.2 8.4 (4.0)
CSF 10 11.0 2.0 9.0 1.0 10.0 (1.6) 2.0 + 0.4 8.6 (0.4)
20 21.0 3.4 17.6 1.6 19.4 (8.2) 3.4 + 0.6 17.0 (5.0)
OESO+ 17 24.8 3.4 18.4 0.8 22.0 (4.2) 3.4 + 0.0 18.4 (2.2)
34 49.8 5.8 40.8 1.8 47.4 (7.8) 5.8 + 0.4 40.0 (4.6)

and-bound algorithm. As expected, the num- Acknowledgements


ber of network propagations required by the al- This research was partly supported by the
gorithm is effectively decreased by fixing MAP Netherlands Organisation for Scientific Re-
variables to their appropriate values. We found search (NWO).
that by building upon the concept of Markov
blanket for 17.1% to 19.0% of the variables a
MAP value could be fixed. Since the results of References
this preprocessing step are applicable to all fu- B.M.P. Aleman, L.C. van der Gaag and B.G. Taal.
ture MAP computations and the computational 2000. A Decision Model for Oesophageal Cancer
burden involved thus is amortised, we consid- Definition Document (internal publication in
ered it worthwhile to perform this step for the Dutch).
investigated networks. We would like to add P.L. Geenen and L.C. van der Gaag. 2005. Devel-
that, because the computations involved are oping a Bayesian network for clinical diagnosis
highly local, the step may also be feasibly ap- in veterinary medicine: from the individual to
the herd. In 3rd Bayesian Modelling Applications
plied to networks that are too large for the exact Workshop, held in conjunction with the 21st Con-
MAP algorithm used in the experiments. With ference on Uncertainty in Artificial Intelligence.
respect to building upon the concept of MAP
blanket, we found that for the investigated net- M. Korver and P.J.F. Lucas. 1993. Converting a rule-
based expert system into a belief network. Medical
works and associated MAP problems, the com- Informatics, 18:219241.
putations involved outweighed the reduction in
network propagations that was achieved. Since J. Park and A. Darwiche. 2001. Approximating
MAP using local search. In 17th Conference on
the networks in our study were comparable with Uncertainty in Artificial Intelligence, pages 403
respect to size and connectivity, further exper- 410.
iments are necessary before any definitive con-
J. Park. 2002. MAP complexity results and approx-
clusion can be drawn with respect to this pre-
imation methods. In 18th Conference on Uncer-
processing step. tainty in Artificial Intelligence, pages 388396.

In our further research, we will expand the J. Park and A. Darwiche. 2003. Solving MAP exactly
experiments to networks with different numbers using systematic search. In 19th Conference on
Uncertainty in Artificial Intelligence, pages 459
of variables, different cardinality and diverging 468.
connectivity. More specifically, we will investi-
gate for which types of network preprocessing is J. Pearl. 1988. Probabilistic Reasoning in Intelligent
Systems: Networks of Plausible Inference. Mor-
most profitable. In our future experiments we gan Kaufmann Publishers, Palo Alto.
will also take the effect of evidence into account.
We will further investigate if the class of MAP
problems that can be feasibly solved can be ex-
tended by our preprocessing method. We hope
to report further results in the near future.

58 J. H. Bolt and L. C. van der Gaag


Quartet-Based Learning of Shallow Latent Variables
Tao Chen and Nevin L. Zhang
Department of Computer Science and Engineering
The Hong Kong University of Science and Technology, Hong Kong, China
{csct,lzhang}@cse.ust.hk

Abstract
Hierarchical latent class(HLC) models are tree-structured Bayesian networks where leaf
nodes are observed while internal nodes are hidden. We explore the following two-stage
approach for learning HLC models: One first identifies the shallow latent variables
latent variables adjacent to observed variables and then determines the structure among
the shallow and possibly some other deep latent variables. This paper is concerned
with the first stage. In earlier work, we have shown how shallow latent variables can be
correctly identified from quartet submodels if one could learn them without errors. In
reality, one does make errors when learning quartet submodels. In this paper, we study
the probability of such errors and propose a method that can reliably identify shallow
latent variables despite of the errors.

1 Introduction Challenge 2000 data set (van der Putten and


van Someren, 2004), which consists of 42 man-
Hierarchical latent class (HLC) models (Zhang, ifest variables and 5,822 records, and a data
2004) are tree-structured Bayesian networks set about traditional Chinese medicine (TCM),
where variables at leaf nodes are observed and which consists of 35 manifest variables and
hence are called manifest variables (nodes), 2,600 records.
while variables at internal nodes are hidden and In terms of running time, HSHC took 98
hence are called latent variables (nodes). HLC hours to analyze the aforementioned TCM data
models were first identified by Pearl (1988) as set on a top-end PC, and 121 hours to analyze
a potentially useful class of models and were the CoIL Challenge 2000 data set. It is clear
first systematically studied by Zhang (2004) as that HSHC will not be able to analyze data sets
a framework to alleviate the disadvantages of with hundreds of manifest variables.
LC models for clustering. As a tool for cluster Aimed at developing algorithms more efficient
analysis, HLC Models produce more meaningful than currently available, we explore a two-stage
clusters than latent class models and they allow approach where one (1) identifies the shallow
multi-way clustering at the same time. As a tool latent variables, i.e. latent variables adjacent
for probabilistic modeling, they can model high- to observed variables, and (2) determines the
order interactions among observed variables and structure among those shallow, and possibly
help one to reveal interesting latent structures some other deep, latent variables. This pa-
behind data. They also facilitate unsupervised per is concerned with the first stage.
profiling. In earlier work (Chen and Zhang, 2005), we
Several algorithms for learning HLC models have shown how shallow latent variables can be
have been proposed. Among them, the heuris- correctly identified from quartet submodels if
tic single hill-climbing (HSHC) algorithm devel- one could learn them without errors. In real-
oped by Zhang and Kocka (2004) is currently ity, one does make errors when learning quartet-
the most efficient. HSHC has been used to submodels. In this paper, we study the proba-
successfully analyze, among others, the CoIL bility of such errors and propose a method that
X1 X1

X2 Y1 X3 X2 Y1 X3

U V T W U V T W
Y2 Y3 Y4 Y5 Y6 Y7 Y2 Y3 Y4 Y5 Y6 Y7

(a) (b) (a) (b)


Figure 1: An example HLC model and the cor-
responding unrooted HLC model. The Xi s are
latent nodes and the Yj s are manifest nodes.
U T V W U W V T

can reliably identify shallow latent variables de-


(c) (d)
spite of the errors.
Figure 2: Four possible quartet substructures
2 HLC Models and Shallow Latent for a quartet Q = {U, V, T, W }. The fork in (a)
Variables is denoted by [U V T W ], the dogbones in (b), (c),
and (d) respectively by [U V |T W ], [U T |V W ],
Figure 1 (a) shows an example HLC model. and [U W |V T ].
Zhang (2004) has proved that it is impossible to
determine, from data, the orientation of edges tent variables in the model respectively.
in an HLC model. One can learn only unrooted
3 Quartet-Based SLV Discovery:
HLC models, i.e. HLC models with all direc-
The Principle
tions on the edges dropped. Figure 1 (b) shows
an example unrooted HLC model. An unrooted A shallow latent node is defined by its relation-
HLC model represents a class of HLC models. ship with its manifest neighbors. Hence to iden-
Members of the class are obtained by rooting tify the shallow latent nodes means to identify
the model at various nodes. Semantically it is sibling clusters. To identify sibling clusters, we
a Markov random field on an undirected tree. need to determine, for each pair of manifest vari-
In the rest of this paper, we are concerned only ables (U, V ), whether U and V are siblings. In
with unrooted HLC models. this section, we explain how to answer this ques-
In this paper, we will use the term HLC struc- tion by inspecting quartet submodels.
ture to refer to the set of nodes in an HLC A quartet is a set of four manifest variables,
model and the connections among them. HLC e.g., Q = {U, V, T, W }. The restriction of an
structure is regular if it does not contain latent HLC model structure S onto Q is obtained from
nodes of degree 2. Starting from an irregular S by deleting all the nodes and edges not in
HLC structure, we can obtain a regular struc- the paths between any pair of variables in Q.
ture by connecting the two neighbors of each Applying regularization to the resulting HLC
latent node of degree 2 and then remove that structure, we obtain the quartet substructure for
node. This process is known as regularization. Q, which we denote by S|Q . As shown in Figure
In this paper, we only consider regular HLC 2, S|Q is either the fork [U V T W ], or one of the
structures. dogbones [U V |T W ], [U T |V W ], and [U W |V T ].
In an HLC model, a shallow latent variable Consider the HLC structure in Figure 1 (b).
(SLV) is one that is adjacent to at least one The quartet substrcuture for {Y1 , Y2 , Y3 , Y4 }
manifest variable. Two manifest variables are is the fork [Y1 Y2 Y3 Y4 ], while that for
siblings if they are adjacent to the same (shal- {Y1 , Y2 , Y4 , Y5 } is the dogbone [Y1 Y5 |Y2 Y4 ],
low) latent variable. For a given shallow la- and that for {Y1 , Y2 , Y5 , Y6 } is the dogbone
tent variable X, all manifest variables adjacent [Y1 Y2 |Y5 Y6 ].
to X constitute a sibling cluster. In the HLC It is obvious that if two manifest variables U
structure shown in Figure 1 (b), there are 3 and V are siblings in the structure S, then they
sibling clusters, namely {Y1 }, {Y2 , Y3 , Y4 }, and must be siblings in any quartet substructure
{Y5 , Y6 , Y7 }. They correspond to the three la- that involves both of them. Chen and Zhang

60 T. Chen and N. L. Zhang


(2005) has proved the converse. So, we have can determine whether two manifest variables
Theorem 1 Suppose S is a regular HLC struc- U and V are siblings (in the generative model)
ture. Let (U, V ) be a pair of manifest variables. as follow:
Then U and V are not siblings in S iff there ex- Pick a third manifest variable T .
ist two other manifest variables T and W such For each Q QU V |T , call QSL(D, Q).
that S|{U,V,T,W } is a dogbone where U and V are If U and V are not siblings in one of the
not siblings. resulting substructures, then conclude that
Theorem 1 indicates that we can determine they are not siblings (in the generative
whether U and V are siblings by examining model).
If U and V are siblings in all the resulting
all possible quartets involving both U and V .
substructures, then conclude that they are
There are (n2)(n3)
2 such quartets, where n
siblings (in the generative model).
is the number of manifest variables. We next
present a result that allows us to do the same We can run the above procedure on each pair
by examining only n 3 quartets. of manifest variables to determine whether they
Let (U, V ) be a pair of manifest variables and are siblings. Afterwards, we can summarize all
T be a third one. We use QU V |T to denote the the results using a sibling graph. The sibling
following collection of quartets: graph is an undirected graph over the manifest
variables where two variables U and V are ad-
QU V |T := {{U, V, T, W }|W Y\{U, V, T }},
jacent iff they are determined as siblings.
If QSL is error-free, then each connected com-
where Y is the set of all manifest variables.
ponent of the sibling graph should be com-
T appears in every quartet in QU V |T and thus
pletely connected and correspond to one latent
called a standing member of QU V |T . It is clear
variable. For example, if the structure of the
that QU V |T consists of n3 quartets. Chen and
generative model is as Figure 3 (a), then the
Zhang (2005) has also proved the following:
sibling graph that we obtain will be as Fig-
Theorem 2 Suppose S is a regular HLC struc- ure 3 (b). There are four completely connected
ture. Let (U, V ) be a pair of manifest variables components, namely {Y1 , Y2 , Y3 }, {Y4 , Y5 , Y6 },
and T be a third one. U and V are not siblings {Y7 , Y8 , Y9 }, {Y10 , Y11 , Y12 }, which respectively
in S iff there exists a quartet Q QU V |T such correspond to the four latent variables in the
that S|Q is a dogbone where U and V are not generative structure.
siblings.
4 Probability of Learning Quartet
In learning tasks, we do not know the struc-
Submodels Correctly
ture of the generative model structure S. Where
do we obtain the quartet substructures? The In the previous section, we assumed that QSL is
answer is to learn them from data. Let M be error-free. In reality, one does make mistakes
an HLC model with a regular structure S. Sup- when learning quartet submodels. We have em-
pose that D is a collection of i.i.d samples drawn pirically studied the probability of such errors.
from M. Each record in D contains values for For our experiments, QSL was implemented
the manifest variables, but not for the latent using the HSHC algorithm. For model selection,
variables. Let QSL(D, Q) be a routine that takes we tried each of the scoring functions, namely
data D and a quartet Q as inputs, and returns BIC (Schwarz, 1978), BICe (Kocka and Zhang,
an HLC structure on the quartet Q. One can 2002), AIC (Akaike, 1974), and the Cheeseman-
use the HSHC algorithm to implement QSL, and Stutz(CS) score (Cheeseman and Stutz, 1995).
one can first project the data D onto the quartet We randomly generated around 20,000 quar-
Q when learning the substructure for Q. tet models. About half of them are
Suppose QSL is error-free, i.e. QSL(D, Q) = fork-structured, while the rest are dogbone-
S|Q for any quartet Q. By Theorem 2, we structured. The cardinalities of the variables

Quartet-Based Learning of Shallow Latent Variables 61


range from 2 to 5. From each of the models,
Table 1: Percentage of times that QSL produced
we sampled data sets of size 500, 1,000, 2,500,
the wrong quartet structure. The table on the
5,000. The QSL was then used to analyze the
top is for the fork-structured generative models,
data sets. In all the experiments, QSL produced
while the table at the bottom is for the dogbone-
either forks or dogbones. Consequently, there
structured generative models.
are only three classes of errors:
F2D: The generative model was a fork, and QSL 500(F2D) 1000(F2D) 2500(F2D) 5000(F2D)
BIC 0.83% 1.87% 3.70% 3.93%
produced a dogbone. BICe 1.42% 3.16% 6.58% 7.86%
AIC 12.8% 10.2% 6.81% 4.85%
D2F: The generative model was a dogbone, and CS 6.20% 5.05% 6.19% 6.41%
QSL produced a fork. Total=10011

D2D: The generative model was a dogbone, and 500 1000 2500 5000
D2F D2D D2F D2D D2F D2D D2F D2D
QSL produced a different dogbone. BIC 95.3%0.00% 87.0%0.00% 68.2%0.00% 51.5%0.00%
BICe 95.0%0.05% 86.8%0.00% 67.8%0.04% 50.6%0.04%
The statistics are shown in Table 5. To un- AIC 71.2%2.71% 60.5%1.58% 46.7%0.54% 39.0%0.38%
CS 88.5%1.93% 83.4%0.74% 67.0%0.30% 51.5%0.13%
derstand the meaning of the numbers, consider Total=10023
the number 0.83% at the upper-left corner of
the top table. It means that, when the sample i) Which quartets should we use in Step 1?
size was 500, QSL returned dogbones in 0.83% ii) How do we determine sibling relationships
percent, or 83, of the 10,011 cases where the in Step 2 based on results from Step 1?
generative models were forks. In all the other iii) How do we introduce SLVs in Step 3 based
cases, QSL returned the correct fork structure. on the sibling graph constructed in Step 2?
It is clear from the tables that: The probabil-
ity of D2F errors is large; the probability of F2D Our answer to the second question is sim-
errors is small; and the probability of D2D er- ple: two manifest variables are regarded as non-
rors is very small, especially when BIC or BICe siblings if they are not siblings in one of the
are used for model selection. Also note that quartet submodels. In the next two subsections,
the probability of D2F decreases with sample we discuss the other two questions.
size, but that of F2D errors do not. In the next 5.1 SLV Introduction
section, we will use those observations when de-
signing an algorithm for identifying shallow la- As seen in Section 3, when QSL is error-free, the
tent variables. sibling graph one obtains has a nice property:
every connected component is a fully connected
In terms of comparison among the scoring
subgraph. In this case, the rule for SLV intro-
functions, BIC and BICe are clearly preferred
duction is obvious:
over the other two as far as F2D and D2D er-
rors are concerned. It is interesting to observe Introduce one latent variable for each
that BICe, although proposed as an improve- connected component.
ment to BIC, is not as good as BIC when it
comes to learning quartet models. For the rest When QSL is error-prone, the sibling graph one
of this paper, we use BIC. obtains no longer has the aforementioned prop-
erty. Suppose data are sampled from a model
5 Quartet-Based SLV Discovery: An with the structure shown in Figure 3 (a). Then
Algorithm what one obtains might be the graphs (c), (d),
or (e) instead of (b).
The quartet-based approach to SLV discov- Nonetheless, we still use the above SLV in-
ery consists of three steps: (1) learn quartet troduction rule for the general case. We choose
submodels, (2) determine sibling relationships it for its simplicity, and for the lack of better
among manifest variables and hence obtain a alternatives. This choice also endows the SLV
sibling graph, and (3) introduce SLVs based on discovery algorithm being developed with error
the sibling graph. Three questions ensue: tolerance abilities.

62 T. Chen and N. L. Zhang


Y1 This is a case of misclustering.
Y2
X1 We next turn to Question 1. There, the most
Y3
important concern is how to minimize errors.
X2 X3 X4
5.2 Quartet Selection
Y4 Y5 Y6 Y7 Y8 Y9 Y10 Y11 Y12 To determine whether two manifest variables U
and V are siblings, we can consider all quartets
(a) (b)
in QU V |T , i.e. all the quartets with a third vari-
able T as a standing member. This selection of
quartets will be referred to as the parsimonious
selection. There are only n 3 quartets in the
selection.
When QSL were error-free, one can use QU V |T
to correctly determine whether U and V are sib-
(c) (d) lings. As an example, suppose data are sampled
from a model with the structure shown in Fig-
X1 X2
ure 3 (a), and we want to determine whether
Y9 and Y11 are siblings based on data. Further
Y1 Y2 Y3 Y4 Y5 Y6
suppose Y4 is picked as the standing member.
X3 X4
If QSL is error-free, we get 4 dogbones where
Y9 and Y11 are not sibling, namely [Y9 Y7 |Y11 Y4 ],
Y7 Y8 Y9 Y10 Y11 Y12 [Y9 Y8 |Y11 Y4 ], [Y9 Y4 |Y11 Y10 ], [Y9 Y4 |Y11 Y12 ], and
(e) (f) hence conclude that Y9 and Y11 are not siblings.
In reality, QSL does make mistakes. Accord-
Figure 3: Generative model (a), sibling graphs
ing to Section 4, the probability of D2F errors
(b, c, d, e), and shallow latent variables (f).
is quite high. There is therefore good chance
that, instead of the aforementioned 4 dogbones,
There are three types of mistakes that one we get 4 forks. In that case, Y9 and Y11 will be
can make when introducing SLVs, namely latent regarded as siblings, resulting in a fake edge in
omission, latent commission, and misclustering. the sibling graph.
In the example shown in Figure 3, if we intro- QU V |T represents one extreme when it comes
duce SLVs based on the sibling graph (c), then to quartet selection. The other extreme is to use
we will introduce three latent variables. Two of all the quartets that involve both U and V . This
them correspond respectively to the latent vari- selection of quartets will be referred to as the
ables X1 and X2 in the generative model, while generous selection. Suppose U and V are non-
the third corresponds to a merge of X3 and siblings in the generative model. This gener-
X4 . So, one latent variable is omitted. ous choice will reduce the effects of D2F errors.
If we introduce SLVs based on the sibling As a matter of fact, there are now many more
graph (d), then we will introduce five latent quartets, when compared with the case of par-
variables. Three of them correspond respec- simonious quartet selection, for which the true
tively to X1 , X3 , and X4 , while the other two structures are dogbones with U and V being
are both related to X2 . So, one latent variable non-siblings. If we learn one of those structures
is commissioned. correctly, we will be able to correctly identify U
If we introduce SLVs based on the sibling and V as non-siblings.
graph (e), then we will introduce four latent The generous selection also comes with a
variables. Two of them correspond respectively drawback. In our running example, consider
to X1 and X4 . The other two are related to X2 the task of determining whether Y1 and Y2 are
and X3 , but there is not clear correspondence. siblings based on data. There are 36 quartets

Quartet-Based Learning of Shallow Latent Variables 63


that involve Y1 and Y2 and for which the true Algorithm DiscoverSLVs(D, m):
structures are forks. Those include [Y1 Y2 Y4 Y7 ], 1. G ConstructSGraph(D, m).
[Y1 Y2 Y4 Y10 ], and so on. According to the Sec- 2. return
tion 4, the probability for QSL to make an the list of the connected components of G.
F2D mistake on any one of those quartets is
small. But there are 36 of them. There is Algorithm ConstructSGraph(D, m):
good chance for QSL to make an F2D mistake
on one of them. If the structure QSL learned 1. G complete graph over manifest nodes Y.
for [Y1 Y2 Y4 Y7 ], for instance, turns out to be the 2. for each edge (U, V ) of G,
dogbone [Y1 Y4 |Y2 Y7 ], then Y1 and Y2 will be re- 3. pick {T1 , , Tm } Y\{U, V }
garded as non-siblings, and hence the edge be- 4. for each Q m i=1 QU V |Ti ,
tween Y1 and Y2 will be missing from the sibling 5. if QSL(D, Q)=[U |V ]
graph. 6. delete edge (U, V ) from G, break.
Those discussions point to the middle ground 7. endFor.
between parsimonious and generous quartet se- 8. endFor.
lection. One natural way to explore this middle 9. return G.
ground is to use several standing members in- Figure 4: An algorithm for learning SLVs.
stead of one.
lationships are tolerated. When the set C is not
5.3 The Algorithm very small, it takes a few or more mistakes in
Figure 4 shows an algorithm for discov- the right combination to break up C.
ering shallow latent variables, namely
6 Empirical Evaluation
DiscoverSLVs. The algorithm first calls a
subroutine ConstructSGraph to construct a We have carried out simulation experiments to
sibling graph, and then finds all the connected evaluate the ability of DiscoverSLVs in discov-
components of the graph. It is understood that ering latent variables. This section describes the
one latent variable is introduced for each of the setup of the experiments and reports our find-
connected components. ings.
The subroutine ConstructSGraph starts from The generative models in the experiments
the complete graph. For each pair of manifest share the same structure. The structure con-
variables U and V , it considers all the quartets sists of 39 manifest variables and 13 latent vari-
that involve U , V , and one of the m standing ables. The latent variables form a complete 3-
members Ti (i = 1, 2, . . . , m). QSL is called to ary tree of height two. Each latent variable in
learn a submodel structure for each of the quar- the structure is connected to 3 manifest vari-
tets. If U and V are not siblings in one of the ables. Hence all latent variables are shallow.
quartet substructures, the edge between U and The cardinalities of all variables were set at 3.
V is deleted. We created 10 generative models from the
DiscoverSLVs has error tolerance mechanism structure by randomly assigning parameter val-
naturally built in. This is mainly because it re- ues. From each of the 10 generative models, we
gards connected components in sibling graph as sampled 5 data sets of 500, 1,000, 2,500, 5,000
sibling clusters. Let C be a sibling cluster in and 10,000 records. DiscoverSLVs was run on
the generative model. The vertices in C will be each of the data sets three times, with the num-
placed in one cluster by DiscoverSLVs if they ber of standing members m set at 1, 3 and 5
are in the same connected component in the sib- respectively. The algorithms were implemented
ling graph G produced by ConstructSGraph. It in Java and all experiments were run on a Pen-
is not required for variables in C to be pairwise tium 4 PC with a clock rate of 3.2 GHz.
connected in G. Therefore, a few mistakes by The performance statistics are summarized in
ConstructSGraph when determining sibling re- Table 2. They consist of errors at three different

64 T. Chen and N. L. Zhang


Table 2: Performance statistics of our SLV discovery algorithm.
QSL-level Edge-level SLV-level
m D2F D2D F2D ME FE LO LCO MC
Sample size = 500
1 77.0%(4.5%) 0.06%(0.05%) 0.40%(0.13%) 1.0(1.1) 88.2(9.2) 11(3.0) 0(0) 0(0)
3 71.8%(4.0%) 0.05%(0.03%) 0.36%(0.14%) 2.9(1.3) 28.0(8.3) 9.3(3.1) 0(0) 0.1(0.3)
5 68.8%(3.0%) 0.05%(0.02%) 0.30%(0.14%) 3.3(1.6) 14.7(6.0) 5.1(1.9) 0.2(0.4) 0(0)
Sample size = 1000
1 65.3%(6.2%) 0.01%(0.03%) 0.17%(0.14%) 0.3(0.6) 55.3(15.7) 11.2(0.7) 0(0) 0(0)
3 58.3%(2.7%) 0.02%(0.02%) 0.10%(0.07%) 0.7(0.9) 10.1(6.5) 4.8(2.9) 0(0) 0(0)
5 56.6%(3.2%) 0.01%(0.01%) 0.10%(0.08%) 1.1(0.7) 6.2(3.8) 2.3(1.3) 0.1(0.3) 0(0)
Sample size = 2500
1 50.1%(2.3%) 0.00%(0.00%) 0.04%(0.07%) 0.2(0.4) 20.4(8.9) 8.6(2.3) 0(0) 0(0)
3 38.0%(4.6%) 0.00%(0.00%) 0.02%(0.04)% 0.2(0.4) 3.1(3.7) 1.4(1.4) 0(0) 0(0)
5 37.3%(3.8%) 0.00%(0.01%) 0.07%(0.08)% 1.1(1.4) 1.3(2.5) 1.5(0.9) 0.1(0.3) 0(0)
Sample size = 5000
1 32.6%(5.5%) 0.00%(0.00%) 0.03%(0.09%) 0(0) 7.7(6.0) 3.9(2.4) 0(0) 0(0)
3 25.2%(4.4%) 0.01%(0.01%) 0.04%(0.06%) 0.3(0.5) 1.5(2.7) 0.5(0.7) 0(0) 0(0)
5 25.7%(3.9%) 0.00%(0.01%) 0.10%(0.11%) 0.9(0.9) 0.9(2.7) 0.1(0.3) 0(0) 0(0)
Sample size = 10000
1 21.8%(5.9%) 0.00%(0.00%) 0.02%(0.07%) 0(0) 2.0(3.3) 1.2(1.8) 0(0) 0(0)
3 17.5%(3.9%) 0.00%(0.00%) 0.05%(0.06%) 0.2(0.4) 0.4(1.2) 0.1(0.3) 0(0) 0(0)
5 17.4%(4.1%) 0.00%(0.01%) 0.03%(0.04%) 0.6(0.8) 0.1(0.3) 0.1(0.3) 0(0) 0(0)

levels of the algorithm: the errors made when 2,500 and m = 5. This is also expected. As m
learning quartet substructures (QSL-level), the increases, the number of quartets examined also
errors made when determining sibling relation- increases. For two manifest variables U and V
ships between manifest variables (edge-level), that are not siblings in the generative model,
and the errors made when introducing SLVs the probability of obtaining a dogbone (despite
(SLV-level). Each number in the table is an D2F errors) where U and V are not siblings also
average over the 10 generative models. The increases. The number of fake edges decreases
corresponding standard deviations are given in with m because as m increases, the probability
parentheses. of D2F errors decreases.
QSL-level errors: We see that the proba- SLV-level errors: We now turn to SLV-
bilities of the QSL-level errors are significantly level errors. Because there were not many miss-
smaller than those reported in Section 4. This is ing edges, true sibling clusters of the generative
because we deal with strong dependency models models were almost never broken up. There are
here, while the numbers in Section 4 are about only five exceptions. The first exception hap-
general models. This indicates that strong de- pened for one generative model in the case of
pendency assumption does make learning easier. sample size 500 and m = 3. In that case, a man-
On the other hand, the trends remain the same: ifest variable from one true cluster was placed
the probability of D2F errors is large, that of into another, resulting in one misclustering er-
F2D errors is small, and that of D2D errors are ror (MC).
very small. Moreover, the probability of D2F The other four exceptions happened for the
errors decreases with sample size. following combinations of sample size and m:
Edge-level errors: For the edge-level, the (500, 5), (1000, 5), (2500, 5). In those cases,
numbers of missing edges (ME) and the number one true sibling cluster was broken into two clus-
of fake edges (FE) are reported. We see that the ters, resulting in four latent commission errors
number of missing edges is always small, and in (LCO).
general it increases with the number of standing Fake edges cause clusters to merge and hence
members m. This is expected since the larger m lead to latent omission errors. In our experi-
is, the more quartets one examines, and hence ments, the true clusters were almost never bro-
the more likely one makes F2D errors. ken up. Hence a good way to measure latent
The number of fake edges is large when the omission errors is to use the total number of
sample size is small and m is small. In gen- shallow latent variables, i.e. 13, minus the num-
eral, it decreases with sample size and m. It ber of clusters returned by DiscoverSLVs. We
dropped to 1.3 when for the case of sample size call this the number of LO errors. In Table 2,we

Quartet-Based Learning of Shallow Latent Variables 65


see that the number of LO errors decreases with models. The quartet-based method for PT re-
sample size and the number of standing mem- construction first learn submodels for all Cn4
bers m. It dropped to 1.4 when for the case of possible quartets and then use them to build
sample size 2,500 and m = 3. When the sample the overall tree ( St. John et al., 2003).
size was increased to 10,000 and m set to 3 or
5, LO errors occurred only once for one of the 8 Conclusions
10 generative models. We have empirically studied the probability of
Running time: The running times of making errors when learning quartet HLC mod-
DiscoverSLVs are summarized in the following els and have observed some interesting regular-
table (in hours). For the sake of comparison, we ities. In particular, we observed the probability
also include the times HSHC took attempting to of D2F errors is high and decreases with sample
reconstruct the generative models based on the size, while the probability of F2D errors is low.
same data as used by DiscoverSLVs. We see Based on those observations, we have developed
that DiscoverSLVs took only a small fraction an algorithm for discovering shallow latent vari-
of the time that HSHC took, especially in the ables in HLC models reliably and efficiently.
cases with large samples. This indicates that
the two-stage approach that we are exploring Acknowledgements
can result algorithms significantly more efficient Research on this work was supported by Hong
than HSHC. Kong Grants Council Grant #622105.
RunningTime(hrs) 500 1000 2500 5000 10000
m=1 1.05 1.06 0.92 0.89 0.84
m=3 1.57 1.52 1.49 1.79 1.91
m=5 1.85 2.07 2.17 2.72 3.02 References
HSHC 4.65 8.69 23.7 43.0 118.6
Akaike, H. (1974). A new look at the statistical
The numbers of quartets examined by model identification. IEEE Transactions on Au-
DiscoverSLVs are summarized in the follow- tomotic Control,.
ing table. We see that DiscoverSLVs exam- Cheeseman, P. and Stutz, J. (1995). Bayesian classi-
4 =82251
ined only a small fraction of all the C39 fication (AutoClass): Theory and results. In Ad-
vances in knowledge discovery and data mining.
possible quartets. We also see that doubling m Chen, T. and Zhang, N.L. (2006). Quartet-Based
does not imply doubling the number of quar- Learning of HLC Models: Discovery of Shallow
tets examined, nor doubling the running time. Latent Variables. In AIMath-06.
Moreover, the number of quartets examined de- Kocka, T. and Zhang, N.L. (2002). Dimension cor-
creases with sample size. This is because the rection for hierarchical latent class models. In
UAI-02.
probability of D2F errors decrease with sample Pearl, J. (1988). Probabilistic Reasoning in Intelli-
size. gent Systems: Networks of Plausible Inference.
500 1000 2500 5000 10000 Schwarz, G. (1978). Estimating the dimension of a
m=1 5552 4191 2861 2131 1816
m=3 8326 5939 4659 4292 4112
model. Annuals of Statistics.
m=5 9801 8178 6776 6536 6496 Silva, R., Scheines, R., Glymour, C. and Spirtes, P.
Total: 82251
(2003). Learning measurement models for unob-
served variables. In UAI-03.
7 Related Work Spirtes, P., Glymour, C. and Scheines, R. (2000).
Causation, Prediction and Search.
Linear latent variable graphs (LLVGs) are a spe- St. John, K., Warnow, T., Moret, B.M.E. and
cial class of structural equation models. Vari- Vawter, L. (2003) Performance study of phylo-
ables in such models are continuous. Some are genetic methods: (unweighted) quartet methods
observed, while others are latent. Sliva et al. and neighbor-joining. Journal of Algorithms.
van der Putten, P. and van Someren, M. (2004).
(2003) has studied the problem of identifying A Bias-Variance Analysis of a Real World Learn-
SLVs in LLVGs. Their approach is based on the ing Problem: The CoIL Challenge 2000. Machine
Tetrad constraints (Spirtes et al., 2000), and its Learning.
complexity is O(n6 ). Zhang, N.L. (2004). Hierarchical latent class models
for cluster analysis. JMLR.
Phylogenetic trees (PT) (St. John et al., Zhang, N.L. and Kocka, T. (2004). Efficient learning
2003) can be viewed as a special class of HLC of hierarchical latent class models. In ICTAI-04.

66 T. Chen and N. L. Zhang


 

   
  !"#$&%'$)(*
+-,/. .0214365-7988
NOQP,/. R STO?U>R&:<7W;>VY=?=@X[=BZ\A/7>CEU]D?7>F S_GI^`HKZBJabL9M ,9Udce+&fda^gU]OQa a
hi^j.k9^gUd^`,mln^`o`^jR,/.0qprUdarR^jR fdRrO
stO\u]^gUdkWRr7>UvwhYx4vwy{zWz>|/}2~363x43
tQ@?wW
l^u]R f].OQa7WVdRr.<fUdZB,/RrOQcO\u]P67>U]O?U>R^`,Wo`a&lnbX&6^gUdf]O?UdZ\O cw^`,/kW. ,9S_aZB,9U .OQP.OQarO?U>RcdOQZB^`a^j7>UPd. 798]
ojO?Sab^jR c^`aZ\.OBRrO)cOQZB^`a^j7>U{,/.<^`,W8ojOQa!b^jR ]7>fdRYo`^gS^jR,/R^j7>Uda!7>UR ]O&c^`aRr. ^`8wf]R^j7>Uda7WV6Z\7>U9R^gUf]7>fda
Zd,9UdZ\O)/,/. ^`,W8ojOQa 7W.R ]OU,/R f].O&7WV6R ]Obf]R^`o`^jR0TVfUdZ\R^j7>UaB3d^`aP,WP6OB.Pd.OQaO?U9Ra,9UTO\uRrO?Uda ^j7>UT7WV
l)X^gU]f]O?UdZ\O&c^`,/kW.<,9S_aYR d,/R ,Wo`oj7@ba 867WR Tcw^`aZ\.OBRrOb,9UdcTZ\7>U>R^gUf]7>fda cOQZB^`a^j7>U /,/. ^`,W8ojOQaB3b^`a
UdOBST7]cdOQoR ]OZ\7>U>R^gUf]7>fa)cdOQZB^`a^j7>Uel)X^gU]f]O?UdZ\Oc^`,/kW.<,9ScdOBWOQoj79Pwa)cdOQZB^`a^j7>U.fdojOQaEV7W.
Z\7>U>R^gUf]7>fda_cdOQZB^`a ^j7>U{,/.<^`,W8ojOQa_,Wa_,VfUdZ\R^j7>U7WVR ]OQ^j._Z\7>U>R^gUf]7>faP,/.O?U>RaB3pUR ^`a_ST7]cdOQoKv
,Z\7>U>R^gUf]7>fdacOQZB^`a^j7>UU]7]cdOq^`aS_,/. k9^gUd,Wo`^jBOQc80.OQPwo`,WZB^gU]k^jRTb^jR ,cOBRrOB.<S_^gUd^`aR^`ZqZ<d,9UdZ\O
Ud7cdO ,9Udc,WPPwoj0^gU]kq,9U79P6OB. ,/R^j7>UcOB. ^jWOQcVI.7>SR ]O_STOBR ]7]c7WV[Z\7>U>W79ogf]R^j7>Uda^gUPd.798,W8w^`o`^jR0
R dOB7W.0W3 prUO\udPOB.<^gSTO?U>RaBvZ\7>U9R^gUf]7>fdacdOQZB^`a^j7>UlnbX^gU]fdO?UdZ\O c^`,/kW. ,9Sa)Pd.7@^`cdO,9Uq^gUdZ\.OQ,WarO
^gUnS_,{ud^gSfwSO\udP6OQZ\RrOQcf]R^`o`^jR0q]O?UZ\7>SP,/.OQcRr7O\ud^`arR^gU]k2STOBR ]7]caB3
@W9! 9t 7WV)Rr.<fwUdZB,/RrOQcO\u]P67>U]O?U>R^`,Wo`aqln)X) ^gU]f]O?UdZ\O2c^
,/kW. ,9SaBvd^`Z<#,/. O^gUdf]O?UdZ\Oc^`,/kW. ,9Sa]OB.O
xiU GQ-MJB:<JLWGK?WAWF ^`an,Z\7>SP,WZ\RekW. ,WPd^`ZB,Wo Pd.798w,W8^`o`^jR0cw^`arRr. ^`8wf]R^j7>Uan,9Udcf]R^`o`^jR0VKfUZ\R^j7>Uda
.OQPd. OQarO?U9R,/R^j7>UmVI7W.-,cdOQZB^`a^j7>UmPd.798ojO?SfUdcdOB.EfU ,/.O.OQPd.OQaO?U9RrOQc8>0ln)X#P67WRrO?U>R^`,Wo`aB3NOQZB^`a^j7>U
Z\OB.R,W^gU>R0W3 pU^jR^`,Wo`oj0Wv^gUdf]O?UdZ\Oc^`,/kW.<,9S_a-OB.O /,/. ^`,W8ojOQa-^gUml)X^gU]f]O?UdZ\Oc^`,/kW. ,9S_a-S fdarR[d,?WO
Pd.79P679arOQc8>07?E,/. c,9Udcl,/R ]OQar7>UQW{z]_,Wa c^`aZ\. OBRrOarR,/RrOaP,WZ\OQaB ]7?-OBWOB.Qv S_,9U>0cdOQZB^`a ^j7>U
,V. 7>U9RO?UdcVI7W.cdOQZB^`a^j7>URr.OBOQaQ3dfd8aOQf]O?U>Roj0Wv Pd.798wojO?S_a_^gUPd. ,WZ\R^`Z\Od,QWOZ\7>U>R^gUf]7>famcdOQZB^`a ^j7>U
ogSarRrOQcQWW&,9Udcd,WZ<>RrOB.QWW)cdOBWOQoj79P6OQc /,/. ^`,W8ojOQaB3
STOBR d7caVI7W.OB/,Wogfd,/R^gU]k4,9UT^gUdf]O?UdZ\O)c^`,/kW.<,9Sc^ 5-7>Ua^`cdOB. R dObVI79o`oj7?b^gU]k cdOQZB^`a^j7>UmPd.798wojO?SVI.7>S
.OQZ\Roj0b^jR d7>f]R4Z\7>U>WOB.R^gU]k^jR4Rr7,ncdOQZB^`a^j7>URr.OBOW3 OQZ\7>U]7>S^`ZBaKarOBO4Y^jk>f]. O{
b]OQaOTS_OBR ]7]cai,Waa fS_OR d,/R ,Wo`ofwUdZ\OB.R,W^gU/,/. ^
,W8ojOQa ^gUR ]OS_7cOQo,/.O2.OQPd.OQarO?U>RrOQc8>0c^`a Z\.OBRrO xS_7>U]79P679o`^`arRbVK,WZ\OQafUdZ\OB.R,W^gUcdO?S,9Udc
Pd.798w,W8^`o`^jR0S,WaaeVfUdZ\R^j7>UaK-ln-a\e,9UdcR d,/R cOQPO?UcdO?U9R7>UVK,QW7W.<,W8ojO {7W.
cdOQZB^`a^j7>Ue{,/.<^`,W8ojOQad,?WOc^`a Z\.OBRrO4arR,/RrO4a P,WZ\OQaB3 fwU]VI,?W7W. ,W8ojO }qS,/.WOBRZ\7>Udcw^
prU]f]O?UZ\Oc^`,/kW. ,9SS_7cOQo`aqR d,/RP6OB.<S^jRqZ\7>U R^j7>UaB3NO?S_,9Udc_^`a[.OBOQZ\RrOQc^gU_R ]OPd.<^`Z\O
R^gUf]7>fdanZd,9UdZ\O,9UdccdOQZB^`a^j7>U/,/. ^`,W8ojOQan^gUZBogfdcdO KT!^jR-.OQZ\OQ^jWOQa&VI7W.&^jRa-7>f]RPwf]R3 ]O
,9fda a^`,9U^gU]6f]O?UdZ\Oc^`,/kW. ,9Sadd,WZ9RrOB.,9Udc S_7>U]79P679o`^`arRZB,9UwU]7WRcw^j.OQZ\Roj04798aOB.WO-cO\
O?UdojOB0Wv-QWWvYS^uR fd.OQa7WV ,9faa^`,9Uda ^gUdf]O?UdZ\O S,9Udcv8wf]R. ,/R ]OB..OQo`^jOQa7>UR ]O. O\
c^`,/kW. ,9SaK79o`,9Udc,9Udcdd,WZ9RrOB.QvQWWv,9Udc a<fdojRaKE7WV ,S_,/.WOBRa f].WOB0nRr7mk9,9f]kWO
o`^gU]OQ,/.r]fd,Wcd.<,/R^`ZZ\7>Uc^jR^j7>Ud,Wo ,9fdaa ^`,9U^gUdf]O?UdZ\O cO?S_,9Udcv),9UcR fdaPd.OQc^`Z\RqPd.<^`Z\OK 3
c^`,/kW. ,9Sa_ln,WcwarO?U,9Udc9O?UdarO?Uvty/}W}9|3X[,WZ<7WV +E,WarOQc7>UR ]Oea<f].WOB0.OQa fdojRaKv-R dO
R ]OQarObS_7cOQo`atO\udPoj79^jRa S fdojR^j/,/. ^`,/RrOU]7W.S_,WoPd.798] S_7>U]79P679o`^`arRS fdarRcOBRrOB.<S_^gU]O^jRa79PdR^gS,Wo
,W8^`o`^jR0R ]OB7W. 0Rr7Pd.7@^`cOmR ]OqcdOQZB^`a ^j7>UarRr. ,/RrOBkW0 7>fdRPwf]R43
,9UdcO\udP6OQZ\RrOQcf]R^`o`^jR0R d,/Rar79ojWOQaeR ]O^gUdf]O?UdZ\O
c^`,/kW. ,9Se3 bdOST7>Ud79P79o`^`aRZB,9UPd.7]cwfdZ\OV. 7>S}Rr7{}
5-798w8,9Uc]O?U]7?0y/}W}/z]^gU>Rr.7cfdZ\O_S_^u]R f].OQa fUd^jRa,9Uc"d,WaR ]OV79o`oj7@b^gU]kfdR^`o`^jR0#VKfwUdZ\R^j7>U
Y ^jk>f].Om/ xUe^gU]f]O?UdZ\O cw^`,/kW. ,9S VI7W.bR ]OST7>U]79P] Y^jk>f].Oy pU]6f]O?UdZ\O c^`,/kW. ,9S"V7W.X!ud,9S_PojOT/3
79o`^`arRBacdOQZB^`a^j7>UnPd.798ojO?Se3
4 5/ 66976 98;:=<-[9t?>
@BADC EGF ?d H FJI
  94 
|  y/}W}W}W}W}]3b]OmP.798,W8^`o h,/. ^`,W8wojOQaYb^`o`o]8O-cdO?U]7WRrOQcT804ZB,WP^jR,WodojOBRrRrOB. aQv9OW3 kd3`v
^jR07WV),9UfUdVI,?W7W. ,W8ojOqS_,/.WOBRm }i^`a}]3 z9}]3 K L&MON 3OBRam7WV /,/. ^`,W8ojOQa2b^`o`o86OncdO?U]7WRrOQc80
bdOSTOQ,9U7WVR dOZ\7>Udc^jR^j7>U,Wo ,9faa^`,9UPd.798,{ 8679o`cdVK,WZ\OZB,WP^jR,Wo_ojOBRrRrOB. aQ)v P ^jV,Wo`o_,/.Ocw^`aZ\.OBRrO
8^`o`^jR0ecdO?Uda^jR0VKfUZ\R^j7>UK Ni[[VI7W.P. ^`Z\OmK E^`ab, Zd,9UdZ\O /,/. ^`,W8ojOQaBRv Q^jVY,Wo`ot,/.O4Z\7>U>R^gUf]7>fabZ<,9UdZ\O
o`^gU]OQ,/.[VKfwUdZ\R^j7>UT7WV]fd,9U9R^jR0nP.7cfdZ\OQcv9^`Z< /,/. ^`,W8ojOQaBTv S^jVb,Wo`oE,/.OcdOQZB^`a^j7>U/,/. ^`,W8ojOQaBv 7W#. U
^`aq8,WaOQc7>UR ]OS_,/. WOBRZ\7>Uc^jR^j7>Udae,WaqVI79o`oj7?aB ^jVeR dOZ\7>SP7>UdO?U9Ra,/.O,S^uR fd.O7WVncw^`aZ\.OBRrO
{  } |/}W}W} {} QB}W} W_,9Udc Zd,9UdZ\OWvZ\7>U9R^gUf]7>fdanZ<d,9UdZ\OWv ,9UccdOQZB^`a^j7>U/,/. ^
{  !"B}W}W}W#} |/} QB}W}W}  3&OQarR ,W8ojOQaQ3p TV U^`a,arOBR7WV {,/. ^`,W8wojOQaBWv V^`a,mZ\7>=U Xk>f
.OQa<fdojRaK,/.Ok9^jWO?U,Wae,P6OB. Z\O?U>R,/kWO7WV Z\7>U . ,/R^j7>U7WVaP6OQZB2^ XwZarR,/RrOQai7WVR ]79aO/,/. ^`,W8ojOQaB3b]O
a fwSTOB. aYa f]. WOB0WOQc_]7 ^gU>RrO?UdcTRr78wf]0R ]ObPd.7]cwfdZ\R c^`a Z\.OBRrOWvZ\7>U>R^gUfd7>fdaBvw7W.iS_^u]OQcearR,/RrO aP,WZ\O 7WTV U
,9Udc,/.OmST7]cdOQojOQc80R ]O_VI79o`oj7?b^gU]k86OBR,e Ni-aB ^`a)cO?U]7WRrOQce8%0 Y[Zi3
w %} '&)(+*-,6/ .   y .>i,9Udc0
$    l)XP.798,W8^`o`^jR0P67WRrO?U>R^`,Wo`aBv&c^`aZ\.OBRrOPd.798,{
&)(+*-,6y . Q/ . 3 8^`o`^jR0P67WRrO?U>R^`,Wo`aBv_,9UdccdOBRrOB.S_^gUd^`arR^`ZP67WRrO?U>R^`,Wo`a
prU]fdO?UdZ\Oc^`,/kW. ,9S_a#^jR Z\7>U9R^gUf]7>fdacdOQZB^ ,/.OcdO?U]7WRrOQcn80qoj7?-OB.rZB,WarOkW.OBOBeojOBRrRrOB. aBvOW3 kd3`vW\ v
] _v ^3 ln)XfdR^`o`^jR0mP67WRrO?U>R^`,Wo`a),/.OcdO?U]7WRrOQc80 a` 3
a^j7>U {,/. ^`,W8wojOQa,9Udc U]7>U ,9fda a^`,9U Z\7>U>R^gUfd7>fda prUkW. ,WPwd^`ZB,Wo2.OQPd. OQarO?U9R,/R^j7>UaBvecdOQZB^`a^j7>U/,/. ^
Zd,9UdZ\O/,/. ^`,W8ojOQa,/. Oc2^ 1ZQfdojRRr7"ar79ojWOfda^gU]k ,W8ojOQa,/.O . OQPd.OQarO?U>RrOQc 80 .OQZ\R,9U]k>fdo`,/. U]7]cdOQa
ZQf]. .O?U9RS_OBR ]7c79oj7WkW0W3bd^`aP,WP6OB.^gU>Rr.7]cwfdZ\OQa Ib^jR ,a^gUdk9ojO87W.<cdOB.V7W.ec^`a Z\.OBRrO,9Udc,cd7>f
R ]O)Z\7>U>R^gUf]7>fda-cdOQZB^`a^j7>Ul)X^gU]f]O?UdZ\O)cw^`,/kW. ,9S 8ojOi867W. cdOB.-V7W.&Z\7>U>R^gUfd7>fda\vwZd,9UdZ\Oi{,/.<^`,W8ojOQa&,/.O
5ENl)XpNvd, S_7cOQod^`Z,Wo`oj7@ba),9U>0Z\7>S8^ .OQP.OQarO?U>RrOQc8>07@{,Wo`aIb^jR ,a ^gU]k9ojO867W. cdOB.TVI7W.
Ud,/R^j7>U7WVc^`aZ\.OBRrOq,9UdcZ\7>U>R^gUf]7>faTZ<d,9UdZ\O/,/. ^ c^`a Z\.OBRrOWv6,2cd7>fd8wojO867W. cdOB.V7W.iZ\7>U9R^gUf]7>fdaQv6,9Udc,
,W8ojOQa b^jR U]7e. OQarRr. ^`Z\R^j7>Uda47>UR ]OTR0]P6O_7WVEPd.798] Rr. ^`PwojOi867W. cOB.-^jVR ]OZ<d,9UdZ\O/,/. ^`,W8ojO ^`a&cOBRrOB.<S_^gU
,W8^`o`^jR0c^`arRr. ^`8wfdR^j7>Uv ,Wa -OQo`o),Wa_c^`aZ\. OBRrOe,9Udc 3@7W. ^`arR^`Z{v-,9Udcf]R^`o`^jR0VKfwUdZ\R^j7>Uda ,/.O2.OQPd.OQarO?U>RrOQc80
Z\7>U>R^gUfd7>fda[cOQZB^`a^j7>U_/,/. ^`,W8ojOQaQ3[5ENln)X p Na-ZB,9U c^`,9S_7>UdcaB3
,Wo`ar7,WZBZ\7>SST7]c,/RrOZ\7>Uc^jR^j7>Ud,Wo`oj0cdOBRrOB.<S^gUd^`arR^`Z b V d cfehgji CaA 5E7>Uda^`cdOB.[R ]Ob^gU]fdO?UdZ\Obc^`,/kW. ,9S^gU
Zd,9UdZ\O/,/. ^`,W8ojOQaQ3xiU79P6OB. ,/R^j7>UcOB. ^jWOQcVI.7>S Y^jk>f].O)y3YprUR d^`aST7]cdOQoKv>^`a!,cw^`aZ\.OBRrO&/,/. ^`,W8ojO
R ]O2STOBR ]7]c7WV&Z\7>U9W79ogf]R^j7>Ua^gUPd.798,W8w^`o`^jR0R ]O\ ^`Z< ZB,9UTR,/WOb7>U {,WogfdOQa[}i7W.[ /3!b]O
7W.0^`afdaOQcRr7OQo`^gS_^gUd,/RrOn,Z\7>U>R^gUf]7>fdacOQZB^`a^j7>U /,/. ^`,W8ojOQa ,9Udc ,/.OZ\7>U>R^gUf]7>fa4Z<,9UdZ\O/,/. ^
U]7]cdOEcwf]. ^gUdkR dOEa79ogf]R^j7>UTPwd,WaOE^gU ,45&Nil)X p N3 ,W8ojOQaQvb^jR ,cdOBRrOB.S_^gUd^`arR^`ZZd,9UdZ\O/,/. ^`,W8ojO
]O.O?S,W^gUdcdOB.7WVR ^`aP,WP6OB.^`a7W. k9,9Ud^jBOQc ^`aZ\7>Udc^jR^j7>Ud,Wo`oj0cOBRrOB.<S_^gUd^`aR^`Z_k9^jWO?U,e/,Wogf]O
,WaVI79o`oj7?baQ3 OQZ\R^j7>Uyk9^jWOQaU]7WR,/R^j7>U,9UdccOBV 7WVY3 b]O4fdR^`o`^jR0qVfUdZ\R^j7>U k [^`a),_Z\7>U>R^gUfd7>fda
^gUd^jR^j7>UaB3 OQZ\R^j7>UcdOQaZ\.<^`8OQan79P6OB. ,/R^j7>UdafdarOQc VfUdZ\R^j7>U7WV 3
Rr7ar79ojWO5ENln)X p NaB3OQZ\R^j7>UzPd.OQaO?U9RamR ]O @BAl@ m H V o no i Fdprq n I 9do is
5ENl)XpN ar79ogf]R^j7>URr7R dOO\ud,9S_PojOP.798ojO?Sq3 b V e FI i I / HI gI
OQZ\R^j7>U|Pd. 7?]^`cdOQaZ\7>UdZBogfa^j7>UdaB3bd^`aP,WP6OB.m^`a
O\u]Rr. ,WZ\RrOQcV. 7>S*,noj7>UdkWOB.-7W.]^gU]kP,WP6OB.5-7988v xS^uR f]. O 7WVRr.<fwUdZB,/RrOQc4O\udP67>U]O?U>R^`,Wo`a)ln)X)P67/
y/}W}93 RrO?U>R^`,Wod^gU,9U ^gU]f]O?UdZ\OEc^`,/kW. ,9S,WaR dO-VI79o`oj7?^gU]k

68 B. R. Cobb
dc O XU^jR^j7>Uv]d^`Z<2^`a&,TST7]c^2XwZB,/R^j7>Um7WVR ]OcdO XUd^ 7WVT^jRaeZ\7>U>R^gUf]7>fanP,/.O?U>RaB_]7@[OBWOB.Qv4R ]OlnbX
R^j7>UeP.79P679arOQcq8>01fwS  ,9Udc],WogS_OB. 7>Uy/}W}9|3 .OQPd. OQarO?U9R,/R^j7>U7WVR d^`amS_OBR ]7c X.<arRfdarOQa_,c^`a
 ;
JB GI 3 stOB' R U 8O",*S^uOQc Z\.OBRrO ,WPPd.7?u]^gS,/R^j7>URr7mR ]OZ\7>U9R^gUf]7>fdaicdOQZB^`a ^j7>U
c^gS_O?Uda^j7>Ud,Wo4{,/. ^`,W8wojOW3 sOBR P   . . . {v /,/. ^`,W8ojOW3
Q   . . . v,9Ud" c S   . . . 286O
R ]Oc^`aZ\. OBRrOiZd,9UdZ\OWvZ\7>U>R^gUfd7>fda)Z<,9UdZ\OWv,9UdccdO\ @WA;o m q bqp F wh H g H f^ S i I o H Hji
ZB^`a^j7>U/,/. ^`,W8ojOP,/. Ra7WMV Uv. OQaP6OQZ\R^jWOQoj0WvTb^jR dfdPwP79aOr)!s^`a),9U^gUdPwfdR&l)XP7WRrO?U>R^`,WoVI7W. U"
,9!Un#"l$)&%XP67W(RrO?'YU9R3^`,Wx VKfUZ\R^j7>U*) Y Z,+- ./ ^`a
o^jVY7>U]O47WVYR ]O U]O\u]RbR ].OBOZ\7>U
Pj 0 Qt0 S.OQPd.OQaO?U9R^gU]k_,T[NiVI7W.u#H Qk9^jWO?U
^jRabPw,/.O?U9Ra Uwv x 93 pV[OZB,9UqWOB. ^jVI0mR ,/R
c^jR^j7>Uda]79o`caB y z { ) s j V  D " D 
/ 3 P1M0 S32T,9Uc4)ZB,9Ue86Oi). ^jRrRrO?Ue,Wa
VI7W.4,Wo`To V[H Y Z6|~} vt-O_arR,/RrOR d,/R)Rs!^`a,9UlnbX
 =CB cdO?Uda ^jR0V7W.k43Om,Waa fS_OR d,/R ,Wo`o ^gUdPwf]RlnbX
)j VY&)5]Y , 687  , ` O\udP< ?>A`E@ D =GF 9 9 Pd.798w,W8^`o`^jR0P7WRrO?U>R^`,Wo`a^gU,5&Nil)X p N,/.OTU]7W.r
`;: = : { S_,Wo`^jBOQcnPd. ^j7W.)Rr7_R dO4ar79ogf]R^j7>UnPwd,WarOW3
@WA S i> i] cfH I HIB HI p F i I / HI g
VI7W=K.B ,Wo`o V*H Y Ziv]OB.O , ` I-_}  . . . J ,9Udc x L>J JBAFGIGiK GK:  ;
JQn GK cdOQaZ\.<^`8OQanR ]Oo`^gU]OQ,/.
>86`@ OB. aQv 3 I/  . . . JvML/  . . .o  ,/.O[.OQ,Wo>UfS cdOBRrOB.<S^gUd^`arR^`Zb.OQo`,/R^j7>Uda d^`P86OBR-OBO?U2,4arOBR-7WV/,/. ^
,W8ojOQ!a Q x  . . .o 93ExcdOBRrOB.<S_^gU^`arR^`Z P67WRrO?U
y 3 P 9 0 S 2",9UcR ]OB. O^`a, P,/. R^jR^j7>U R^`,Wo5E7988,9Udc2d]O?U]7@0Wvy/}W}9|YV7W. Q^`a cdO XUdOQc_,Wa
YN+  . . .oYPO&7WB V YRQ_^gU>Rr74>0P6OB. ZQfd86OQaYa fdZR d,/R ,9UOQ]fd,/R^j7>U
)^`acdO XU]OQcq,Wa
)j VY!&)TSj VY ^jV VUGH YRS  y 5] }x  w5]!}x  :   Iz]
]OB.O_OQ,WZV)T_S XW/  . . . XYZB,9U86O_b. ^jRrRrO?U ]OB. O]vr / . . . ,/. OZ\7>UarR,9U9RaQ3 b]O
^gUeR ]O VI7W.<S#7WV!OQ]fd,/R^j7>U{iK^K3 OW3EOQ,WZ<Z)TS_^`a OQ]fd,/R^j7>U 5]2 }cdO XUdOQa_,o`^gUdOQ,/.2cdOBRrOB.S_^gU
,9Unln)XP67WRrO?U9R^`,Wot7>U YPS3 ^`arR^`Z2.OQo`,/R^j7>Uda<d^`Pv!]OB. O 5] ,  D 
,   v,9UdcdOB.O ,
  . . .o, A ,9Udc ,/.O
3 P[0 S]3\ 2i,9Udc VI7W.OQ,WZ<) Xdu]OQcT{,WogfdO4_T^  s H .OQ,WDo UfS> 86OB. aB3 b]OTZ\7O 1ZB^jO?U9#R ,A ` 7>U ` H> Q^`a
YP`acbv) dfe g5d[ZB,9Ue86O cdO XUdOQcq,Wab^gUy3 OQ]fd,WoERr7]O?UR ]OecdOBRrOB.S_^gUd^`arR^`ZP67WRrO?U>R^`,Wo)^`a
aP6OQZB2^ XOQce,Wab,_Z\7>Udc^jR^j7>Ud,WoP67WRrO?U9R^`,WotV7W.h ` k9^jWO?U
prUR ]OcdO XUd^jR^j7>U,W867?WOWvhY^`aR ]OUfwS86OB._7WV Qvr ` 3]Ol 9: ;WAXi  5] }x-,/.OS_,W^gU
 GJ:JKi ve,9UcjJ ^`aR ]OUfS86OB.7WVnO\u]P67>U]O?U>R^`,Wo R,W^gU]OQc,WaT,cOQZ\7>S_P679arOQcarOBRT7WV[OQ^jk>>RrOQcOQ]fd,{
R dJBA^j. Fkci ZB^g,WUiarOQOW,Wv!ZR ]PO2^jOQP6Z\7WO[Rr7WO?U9VRR ^`],WO-omlnl )XP67WRrO?U9) R^`df,We oKg3Ypr5dUiR VI]7WO. R^j7>UdaBv b^jR .OQPd. OQarO?U9R^gUdkR dO JQGj ? _i VI7W.T,Wo`o
/  . . .<T32b]OT-OQ^jk>9RabR0P^`ZB,Wo`oj0. OQa fdojR
,Wo`o!_T^  s 6GH Y ` e b Z\7>UdarR^jR fdRrA O?9R ]F_OiJBln _i)XP67WRrO?U>R^`,Wo VI.7>S R ]O S_,/.k9^gU,Wo`^jQ,/R^j7>Un7WV,mc^`aZ\.OBRrO{,/.<^`,W8ojOWv
VI7W.  Pf+Q LS 93prUR ^`aPw,WPOB.?v,Wo`o)ln)XP.798,{ ,Waba<]7?U^gUeX!ud,9S_PojO4T^gUOQZ\R^j7>Un3 y3g/3
8^`o`^jR0,9Ucnf]R^`o`^jR0qP67WRrO?U>R^`,Wo`ab,/.O OQ]fd,WoRr7TBOB. 7^gU O RrOB.<S 5] }x , P67WRrO?U>R^`,Wo
fUda POQZB2^ XwOQc.OBk9^j7>UdaB3prUn5ENl)XpNaBvw,Wo`oP.798,{ ^gU R ]O arO?UarO R d,/R ,9U90 /,/. ^`,W8ojO  `
8^`o`^jR0mc^`arRr.<^`8wf]R^j7>Uda-,9Udcqf]R^`o`^jR02VfUdZ\R^j7>Uda-,/.O,WP] R,/WOQa7>U R dO /,Wogf]O D ` r,A  D 
Pd.7?u]^gS,/RrOQce80l)XP67WRrO?U>R^`,Wo`aB3 , e   , e    , A  G/ ,
bdOcO XUd^jR^j7>UPd. OQarO?U9RrOQcdOB.O,Waa fS_OQa2R d,/R b^jR `_ Pd. D 79`_8 ,W8^`o`^jR0` / D ,9` Ud/ c,9U907WR ]OB.D /,Wogf]> Ob^jR `
cdOQZB^`a^j7>U{,/. ^`,W8wojOQa,/.Oc^`aZ\.OBRrOW3 bd^`aP,WP6OB. Pd.798w,W8^`o`^jR0}]3xa_b^jR lnbXP67WRrO?U>R^`,Wo`aBvS fdo
Pd.OQaO?U9Ra!,STOBR d7cVI7W.cdOBWOQoj79P^gU]k ,cdOQZB^`a ^j7>U.<fdojO R^`PojO_VI. ,/k>S_O?U9Ra7WVER ]OTVI7W.<S^gUIz]ZB,9UcO XU]O_,
VI7W.,Z\7>U>R^gUf]7>fdacdOQZB^`a^j7>Un{,/. ^`,W8wojO,Wa,TVKfwUdZ\R^j7>U cdOBRrOB.<S^gUd^`arR^`Z P67WRrO?U>R^`,WoVI7W.,_aOBR)7WVY{,/. ^`,W8wojOQra U3

Continuous Decision MTE Influence Diagrams 69


b dOTV. ,/k>S_O?U9Ra4ZB,9U86OP,/. ,9S_OBRrOB. ^jBOQc80OQ^jR ]OB.
arOBRam7WV4c^`a Z\.OBRrOZd,9UdZ\O,9UdccdOQZB^`a^j7>U{,/.<^`,W8ojOQaBv
7W.80Pw,/.R^jR^j7>Uda7WV90]P6OB. ZQfd86OQa7WV2Z\7>U>R^gUfd7>fda
Zd,9UdZ\O /,/. ^`,W8ojOQaBv,Waba ]7@Uq86OQoj7?43
b V d cfehgji @BA pUR ]Oq^gU]fdO?UdZ\O2cw^`,/kW. ,9S 7WVbY^jk/
f]. Oyv2^`aZ\7>Udc^jR^j7>Ud,Wo`oj0cOBRrOB.<S_^gUd^`aR^`Zk9^jWO?U
3qbd^`a4.OQo`,/R^j7>Uda ^`P^`a.OQPd.OQaO?U9RrOQc8>0R ]O2cdO\ Y^jk>f].O pU]6f]O?UdZ\O c^`,/kW. ,9S"V7W.X!ud,9S_PojO 3
RrOB.<S^gUd^`arR^`Z_P67WRrO?U>R^`,WoVI. ,/k>STO?U>R    }x[7W.
 z> . W9 /_} .gWQy}x)^jV}   _} . W|{zv
,9Udc8>0#R ]OcdOBRrOB.<S^gUd^`arR^`ZP67WRrO?U>R^`,WonVI. ,/k>S_O?U9R Kdfe g5]Y)ndfe g dfe g d?e g 5d
@  }x7W.  m} 2{}4}x^jV6_} . W|{z  
/3 x)ndfe g 5GB L dfe g5  }x W  |
  
  M8 o 8 > VI7W.,Wo`o 5H YRQ 3qxZBZ\7W. c^gU]kRr7nR ]O2stx!"Pd.79P]
/, k9,/R^j7>UaZ<dO?STO_ln,WcwarO?Ue,9Udce9O?UdarO?UvQWW-R ]O
5ENl)XpNa,/.OTar79ojWOQc80fda^gUdkmR ]O Vfda^j7>U,Wo P67WRrO?U>R^`,Wo`a,/. OU]7WRZ\7>S8^gU]OQcv!8wf]R4. ,/R ]OB.TS_,W^gU
kW7W. ^jR wS cdOBWOQoj79P6OQc8>0d]O?U]7@0QWW,9UdcR ]O R,W^gU]OQc2,WaE,cOQZ\7>S_P679arOQcaOBR[7WVP67WRrO?U>R^`,Wo`a&cwf]. ^gU]k
79P6OB. ,/R^j7>Uda-Pd.OQaO?U9RrOQcm^gU_R d^`a[arOQZ\R^j7>U3 ]O)Vfda^j7>U Z\7>S8^gUd,/R^j7>U3
,WojkW7W. ^jR wS ^gU>W79ojWOQacdOQojOBR^gU]k{,/.<^`,W8ojOQa2V.7>S R ]O o Al@ m w# "WH I J g H 5 d H FJI
U]OBR[7W. 2^gU2,TarOQ]f]O?UdZ\O^`Z<m.OQaP6OQZ\RaER ]O ^gU]VI7W.r o Al@BA C S HIQ>o i i %$ I  i'& w HIh gji p F c
S,/R^j7>UZ\7>UdarRr.<,W^gU9RaeI.OQP.OQarO?U>RrOQc80,/. ZBaP79^gU>R m q b I s S i i]o c H I HIQ HI
^gU]kmRr72cOQZB^`a^j7>U{,/. ^`,W8wojOQa^gU^gU]fdO?UdZ\Oc^`,/kW.<,9S_a\ p F i I HI gI
^gU R dO)Pd.798ojO?Se3b^`aZ\7>Udc^jR^j7>U_O?Uda f].OQaYR ,/R[fU d^`a)79P6OB. ,/R^j7>U^`a^`o`ogfarRr. ,/RrOQc8>0qO\ud,9S_PojOKarOBO
798aOB.WOQc_Z<,9UdZ\O)/,/. ^`,W8ojOQa ,/. ObcdOQojOBRrOQc86OBVI7W.O&cdO\ 5-798w8vwy/}W}9[VI7W.b,TVI7W.<S_,WocO XUd^jR^j7>U63
ZB^`a^j7>U{,/.<^`,W8ojOQaB3m]OTVKfa^j7>USTOBR d7c,WPPwo`^jOQa Rr7
Pd. 798ojO?S_a]OB.O2R ]OB.Oq^`a 7>Udoj07>U]Oef]R^`o`^jR0VfUdZ b V d cfehgji o A 5-7>Uda ^`cdOB.R dO^gU]6f]O?UdZ\Ocw^`,/kW. ,9S
R^j7>UI7W.e, 79^gU>Rf]R^`o`^jR0VfUdZ\R^j7>U^`Z<VK,WZ\Rr7W. a ^gUY^jk>f].Ov]OB.O } _} . |,9Udc
S fdojR^`Po`^`ZB,/R^jWOQoj0e^gU>Rr7arOBWOB.<,Wof]R^`o`^jR0eP7WRrO?U>R^`,Wo`a\3  {b_} . |3_]O {,/. ^`,W8wojO ( ^`a Z\7>Uc^jR^j7>U
5E7>S8w^gUd,/R^j7>UT7WVR-7Tln)XP67WRrO?U9R^`,Wo`aE^`a[P79^gU>R ,Wo`oj0cOBRrOB.<S_^gUd^`aR^`Zek9^jWO?7 U A  ! 9v,.OQo`,/R^j7>Uda<d^`P
b^`aOmS fdojR^`Po`^`ZB,/R^j7>U3nl,/.k9^gUd,Wo`^jQ,/R^j7>U7WV&Z<,9UdZ\O .OQP.OQarO?U>RrOQc 80 R ]OcdOBRrOB.<S^gUd^`arR^`ZP67WRrO?U>R^`,Wo
/,/. ^`,W8ojOQaVI.7>S l)X P7WRrO?U>R^`,Wo`a^`a^gU9RrOBkW.<,/R^j7>U VI. ,/k>STO?U>Ra  * )R D  *}T ) y D }x
7@WOB.-Z\7>U>R^gUfd7>fda Zd,9UdZ\Ob/,/. ^`,W8ojOQa[86OQ^gU]k .O?S_7?WOQc ,9Udc Q* ) D  { ) _} . yW| D }x3
,9Udca fSmS_,/R^j7>U7?WOB.ecw^`aZ\.OBRrOZd,9UdZ\O{,/.<^`,W8ojOQa bdO .O?S_7?/,Wo7WV  VI.7>S R ]O Z\7>S8w^gUd,/R^j7>U
86OQ^gU]k. O?ST7?WOQc3 NOBR,W^`o`a7WVqR ]OQarOR-779P6OB. ,{ 7WV V7W7 . #(?  ,9Udc"R dO[l VI7W.  .O\
R^j7>Uda-ZB,9Um8ObV7>fUc^gU5-7988,9Uc2d]O?Ud7?0Wvdy/}W}/z]3 a fojRa^gUR ]OVI79o`oj7?b^gUdkcdOBRrOB.<S_^gU^`arR^`ZP67WRrO?U>R^`,WoK
R dOB.Z\7>S8^gUd,/R^j7>U,9UdcS_,/.k9^gUd,Wo`^jQ,/R^j7>U79P6OB.r ?_} . | ) y D }x _} . !| ) _} . yW| D }x >3
,/R^j7>Uda_.OQf^j.OQcV7W.5ENln)X p Na_,/. OncdOQaZ\. ^`86OQc ]O)/,Wogf]OQa VI7W.-R ]OcdOBRrOB.S_^gUd^`arR^`ZP67WRrO?U9R^`,Wo6,/.O
86OQoj7?43 -OQ^jk>9RrOQcb^jR R dOePd.798,W8^`o`^jR0{,WogfdOQafdP67>U.O\
S_7?/,Wo7WVR dO c^`aZ\.OBRrO /,/. ^`,W8ojOW3
o ADC  F c H I  H FI Fprm q b I s o Al@BA @  FJI H I n F nt +$ I  i'& w/ HIwh gji
S i>o i cfH I HIB HI p F i I HI
gI p F c m q b I s S i i]o c H I HIQ HI
stOBR) dfe g5  86O2,9Ul)XPd.798,W8w^`o`^jR0P67WRrO?U>R^`,Wo p F i I HI gI
7>U U P(0 Q0 S,9UcojOBR dfe g5  }xb86O l,/.k9^gUd,Wo`^jQ,/R^j7>U7WV4,Z\7>U9R^gUf]7>fda2/,/. ^`,W8ojO `
,cdOBRrOB.<S^gUd^`arR^`ZqP67WRrO?U>R^`,WoE7>U U P 0 Q0 Sv VI.7>SR ]OZ\7>S8w^gUd,/R^j7>Um7WV,9UlnbXP7WRrO?U>R^`,Wo,9Udc
dOB.%O Q$ Q 0 Q  3nb]O2Z\7>S8^gUd,/R^j7>U7WVu)ndfe g ,cdOBRrOB.<S_^gU^`arR^`ZmP67WRrO?U>R^`,Wo[^`a a f8arR^jR f]R^j7>U7WV&R ]O
,9Udc dfe gq^`a),_P67WRrO?U>R^`,Wo TVI7W. UcdO XUdOQcq,Wa
^gU>WOB. arOq7WViR ]OqOQ]fd,/R^j7>UKa\^gUR ]OecdOBRrOB.<S^gUd^`arR^`Z

70 B. R. Cobb
6P 7WRrO?U9R^`,Wo ^gU9Rr7qR ]Ol)XP67WRrO?U>R^`,WoK3Tbd^`ai79P6OB. ,{ ]RrOQPe/!5E.OQ,/RrO, cw^`aZ\.OBRrOb,WPP.7Qud^gS_,/R^j7>U Rr7R ]O
R^j7>U2^`a[cOB. ^jWOQc2V.7>SR ]OSTOBR ]7]cT7WVtZ\7>U>W79ogf]R^j7>Uda Z\7>U>R^gUf]7>fdabcOQZB^`a^j7>Uq/,/. ^`,W8ojO4,9Uce,WPPoj02R ]O
^gUPd.798,W8w^`o`^jR04R dOB7W.0Wv>,WaO\u]Pwo`,W^gU]OQcT^gU5-7988 ,9Udc P.7Z\OQcf].O^gUnOQZ\R^j7>U3 y3 3
d]O?Ud7?0Wvy/}W}9|3 ]RrOQPy~a^gU]kojOQ,WarRa fd,/. OQa.OBkW.OQaa ^j7>UvZ\. OQ,/RrO
stOBR) dfe g5  86O_,9Ul)XP67WRrO?U9R^`,Wo 7>U U  ,qcOQZB^`a^j7>U.<fojOVI7W. R ]O_Z\7>U>R^gUf]7>fda cdOQZB^`a ^j7>U
P,0 Q 0 S ,9UdcojOBR dfe g5  i"}x-86O,ncOBRrOB.r
S_^gU^`arR^`ZP7WRrO?U>R^`,Wo)7>U U  0 P 0 Q 0 S3xa /,/. ^`,W8ojO),Wa[,iVKfwUdZ\R^j7>U7WV6^jRa[Z\7>U9R^gUf]7>fda P,/.r
a fS_O Q" QR0 Q  v U U !;0 U  v&,9UdcR d,/R O?U>R?Ka\3
OQ,WZ< dfe gfe 5  [}xv /  . . . Tv^`ai^gU9WOB. R^`8ojO ]RrOQP5E7>UdarRr.<fdZ\R,cdOBRrOB.S_^gUd^`arR^`ZP67WRrO?U>R^`,Wo
^gU4 H Q  Q 3bdO4S_,/.k9^gUd,Wo7W?V x) dfe Wg  dfe Jg  VI.7>S R dOcdOQZB^`a^j7>U.<fdojOcdOBWOQoj79P6OQc^gURrOQP
VI7W.-,` arOBR 7WV/,/. ^`,W 8ojOQa Usd P#0m Qv  ` c0 S U yvZ\7>U>WOB.RR dOZ\7>U>R^gUf]7>facdOQZB^`a ^j7>U /,/. ^
^`a)Z\7>SPwf]RrOQce,Wa ,W8wojO-Rr7,icdOBRrOB.<S^gUd^`arR^`Z-Z<d,9UZ\O&{,/. ^`,W8wojOWv>,9Udc
S,/.k9^gUd,Wo`^jBOR ]O/,/. ^`,W8ojOfa^gU]kR ]OPd.7]Z\O\
)ndfe g dfe g6d?Ze g  5 s cf].O^gUnOQZ\R^j7>U3 y3 y3
b^`ab79P6OB. ,/R^j7>Unb^`o`o8OcdOQaZ\. ^`86OQcnV7W.R ]O ZB,WarO
9     )ndfe g
W dfe gfe 5 s 5 s  ]OB. O,cdOQZB^`a ^j7>U/,/. ^`,W8ojOd,Wae7>UdOZ\7>U>R^gUf]7>fa
: P,/.O?U>RBvw8wfdR&S fdojR^j/,/. ^`,/RrO . OBkW.OQaa^j7>UnZB,9Uq86O4fdarOQc
VI7W.b,Wo`Bo V s GH Y Z  ]OB.O55 s  D ` v 5G-5 s  D ` v Rr7O\u]RrO?UdcR ]O79P6OB. ,/R^j7>UV7W.qR dOZB,WaO]OB.O,
5  5 s  D ` vw,9Udc cdOQZB^`a^j7>U/,/. ^`,W8ojO,WamS fdojR^`PojOnZ\7>U>R^gUf]7>fda2P,/.r
O?U>RaB3fdPP679arO2[OqE,9U9RTRr7OQo`^gS^gUd,/RrOe,Z\7>U>R^gU
Wndfe gfe 5 s Y ,  D  f]7>fdacdOQZB^`a^j7>U/,/. ^`,W8ojO  b^jR Z\7>U>R^gUfd7>fda P,/.r
, e `_  D `_  ;, e ` /  D ` /  , D  > f/ , ` O?U>RrVI.7>S"R ]O^gUdf]O?UdZ\Oc^`,/kW. ,9Sfda^gU]kmR ]O4Vf
. a^j7>U,WojkW7W.<^jR Sq3stOBR 86Oe,9Ul)Xf]R^`o`^jR0P67/
b]OZ\7>UdarR,9U>R  ^`a , `  ^jVk) dfe g^`a,9UlnbX RrO?U>R^`,WoVI7W. x G 93bY^j. arRBv, 9P679^gU9Rc^`aZ\.OBRrO ,WP]
Pd.7?u]^gS,/R^j7>U^`aZ\.OQ,/RrOQcV7W.3X ,WZfd,Wo`^jR,/R^jWO
Pd.798w,W8^`o`^jR0P7WRrO?U>R^`,Wo&,9Udc2^jVN) dfe g^`aT,9UlnbX arR,/RrO7WVR ]Oic^`aZ\.OBRrO,WPPd.7?u]^gS,/R^j7>UZ\7W.. OQaP67>Udca
f]R^`o`^jR0P67WRrO?U9R^`,WoKvt]OB.#O ,A ` .OQPd.OQarO?U>RaR ]O Z\7OBV b^jR ,.OQ,WobUfS8OB. OQc/,Wogf]On^gUR ]OeZ\7>U>R^gUf]7>fa
XwZB^jO?U>R)7>U/,/. ^`,W8ojO$ ` ^gU dfe gfe 5  3 arR,/RrOaP,WZ\Om7WV&R ]O/,/. ^`,W8ojOW v Y  $  " " ` 
 93b]O2.OQ,Wo&UfwS86OB.OQc/,Wogf]OQaT,Waar7]ZB^
oBA @BAo S HIQ>o i>o i S i HIo H FJI & w/ HIw gji ,/" RrOQce"7 ^jR q R ]O4]fd,Wo`^jR,/R^jWO arR,/RrOQa,/.O4cdOBRrOB.<S7 ^gU]OQc
l,/.k9^gUd,Wo`^jQ,/R^j7>Umb^jR . OQaP6OQZ\R!Rr74, c^`aZ\. OBRrObcdO\ ,Wa "  " ` j * _} . |? " " `  G! VI7W.
ZB^`a^j7>U/,/. ^`,W8ojOq^`a7>Udoj0cdO X6U]OQcVI7W._lnbXf]R^`o`^jR0 * / . . .7 w3d7W.a^gSPo`^`ZB^jR
arR,/RrOQab^`o`o,Wo`ar728Oi.OBVIOB..OQceRr77 ,Wa  " +  . 7 . .o " " 3
P67WRrO?U9R^`,Wo`aQ3 stOBR 86O,9Ul)X*f]R^`o`^jR0P67WRrO?U 0WvTR dOfd,Wo`^jR,/R^jWO
R^`,WoiVI7W. U" P1G0 QM0 Sv)dOB.O H S3b]O l,/.k9^gUd,Wo`^jQ,/R^j7>U7WV VI.7>SR ]O_f]R^`o`^jR0P67WRrO?U
S_,/. k9^gUd,Wo7WV VI7W.,marOBR7WV!/,/. ^`,W8ojOQa U A T^`a R^`,Wo ^gU>W79ojWOQa XUdc^gU]kR dO){,WogfdOb7WV "  R ,/R-S_,{ud^
,9Uel)XfdR^`o`^jR0eP67WRrO?U9R^`,WoZ\7>S_PwfdRrOQcq,Wa S_^jBOQa V7W.!OQ,WZ_P679^gU>RY^gUR dO)Z\7>U9R^gUf]7>fdac7>S_,W^gU
7WV43b]OeVI79o`oj7?^gU]kYP^jOQZ\OZ\7>S_P67>U]O?U>RlnbX
 @ Z   B _^ 5= s s Y S_ z,{u _^ 5= s > P67WRrO?U9R^`,Wo-VI7W. #^`a cdOB. ^jWOQcV. 7>S*R ]Oc^`aZ\. OBR^jBOQc
Z\7>U>R^gUf]7>fa&cdOQZB^`a^j7>U2/,/. ^`,W8ojOr fda^gUdk4R ]OPd.7]Z\O\
VI7W.,Wo`o_T^ a5  s s tH Y Z   ]OB.O s s s  " 3 cwf]. Oi^gUOQZ\R^j7>U3 y3
l,/.k9^gUd,Wo`^jQ,/R^j7>U7WV&cdOQZB^`a^j7>U/,/. ^`,W8ojOQa.OQa fdojRa^gU $%% ? D ^jV (   D ( 
,9UlnbXP67WRrO?U9R^`,WoK3d7W._cdOBR,W^`o`aBv-arOBOn5-7988,9Udc
d]O?Ud7?0y/}W}/z]3 D ! & %% 33  D ^jV (   D ( 
7# ' O] D ^jV (O   D  (MO
oBA @BA  FI / H I n F nt S i HIo H FJI & w/ HIw gji
X[o`^gS_^gUd,/R^gU]k,Z\7>U9R^gUf]7>fdacdOQZB^`a^j7>U"{,/.<^`,W8ojO ]OB. O ( = ,9Udc ( = ,/.OR ]Ooj7@[OB.,9Udc#fdPP6OB.
VI.7>S ,m5ENl)XpN^`ab, R ]. OBO\arRrOQPnPd. 7Z\OQaaQ 867>fUdca 7WViR ]Ocd7>S_,W^gU7WVR dOe{,/.<^`,W8ojOZ^gULW

Continuous Decision MTE Influence Diagrams 71


R Pw^jOQZ\O_7WVER ]OlnbXf]R^`o`^jR0P7WRrO?U>R^`,Wo vt,Wa
cdOBRrOB.S_^gU]OQc80R ]OPd.7]Z\OQcwf].Oq^gUOQZ\R^j7>U7  3 y3 3 1.75

bdOQarO4867>fUdca&,/. O4cdO XU]OQcna fdZ<eR d,/R ( `_  O( ` 1.5

( O=  O( = i12eV7W.,Wo`o -I  L/  . . . XY6vmI w\ L,9Udc


1.25

aPw= ,W: Z\Om7W= Vr  ^`=ak9^jWO?UD 7 ,W` a) YRD }#7     D O D `  D 


 (  O(  vYdOB.O_R ]O2arR,/RrO
1

0.75

DRr7W.m7W9V3 . OQ,WstoOBRUfTS 86D OB.OQ c/ ,W" ogf]@ OQB a  . . .  " 7W@ ViB R ]7 On86OcOQ,ZB^`Wa^jOQ7>ZU
0.5

0.25

/,/. ^` ,W8ojO R d,/R2Z\7W..OQaP67>Udcb"  ^jR +  . . .o O^gU


0
0 0.2 0.4 0.6 0.8 1

Y^jk>f].Ozd)b]O lnbXP7WRrO?U>R^`,WoV.<,/k>STO?U>Rab^`Z<
v^K3 OW3 ^`abR dOcdOQZB^`a ^j7>Un.fdojO4798dR,W^gUdOQcfdP67>U Z\7>UdaR^jR f]RrO R ]Ol)XP7WRrO?U>R^`,Wo ] V7W. ?#  9
93
R ]7 OT . O?ST7?/,Wo!7WVn3_]O_cdOQZB^`a^j7>U.fdojO_^`a4,VfUdZ
R^j7>UdOB.OT D b " =KB ^jV ( =   D 9( = VI7W. ,Wo`o % '& )6 (+*  :
L /  . . .oXY63 @
]OmcdOQZB^`a ^j7>U.<fdojO "^`a Rr. ,9UarV7W.STOQcRr7,cdO\ b^`amarOQZ\R^j7>UP.OQarO?U>RamR dO5ENl)XpNa79ogf]R^j7>U
ZB^`a^j7>U.fdoj O  R d,/R^`aarR,/RrOQc,Wa,VKfUZ\R^j7>U7WV Rr7R ]OST7>U]79P679o`^`arRBacdOQZB^`a^j7>UPd.798wojO?Sq3xcc^
43#~a^gUdkojOQ,WarRna]fd,/.OQa.OBkW. OQaa^j7>Uvi-O798dR,W^gU R^j7>Ud,WoYUfSTOB. ^`ZB,WocdOBR,W^`o`a7WV[l)XP7WRrO?U>R^`,Wo`aiZB,9U
,o`^gU]OQ,/.OQ]fd,/R^j7>U7WVER ]OVI7W.<S % D i > P >  D 3 86OiV7>fwUdcq^gU5-7988vy/}W}93
bdOq{,Wogf] OQ a >   > ~,/.OZB,Wo`ZQfdo`,/RrOQc,Wa

  ]OB.O  "  B  "  B  . . .  " O B ADC , i_e i i I ?d H FJI
R
,9Udc @ @ @ bdOlnbXP67WRrO?U>R^`,WoV7W.ik9^jWO?U @7 < T^`a,y@
  /   ! P^jOQZ\O2ln)X,WPPd. 7Qud^gS_,/R^j7>URr7R ]OmU]7W.S_,Wo! Ni
  /  5-798w8e,9UdcddO?U]7?0qy/}W}9/,&b^jR . -|/}W}W} {} 
 ,9Ud0c /W4#B}W} k9^jWO?U }]v,9Ud1c -B}W}W}W}
33  33 . |/} ,9Ud+c /  B}W}W}  k9^jWO?U /3b]OP67/
   /  " RrO?U>R^`,WRo ^VI7W!. ?  7 < Td,WabP67WRrO?U>R^`,WotVI. ,/k>STO?U>Ra
 ^Y R }[,9Ud% c ^Y R {3
bdOicOQZB^`a^j7>Uq.<fojO4^`a&R ]O?UqaR,/RrOQcn,Wa ]Ol)X,WPPd.7?u]^gS,/R^j7>UeRr72R ]OfdR^`o`^jR0eVfUdZ
R^j7>U  V7W[. ? < ^`a[cdO?U]7WRrOQc_80  ,9Udc_^`a cdOBRrOB.r
$%% " ` ^jV > 6 >  D  " ` S^gU]OQcfda^gU]kqR ]OmPd.7]Z\OQcwf].OTcdOQaZ\.<^`8OQc^gU5-7988
,9Udcd]O?U]7@0Wvy/}W}/z]3 ln)X,WPP.7Qud^gS_,/R^j7>UdaRr7
T D Y & %%' > 67 >  ^jD V " `  6  D  7 " R ]O86OBR, Ni-aV7W.ik9^jWO?U ,/. OZ\7>UdarRr.fdZ\RrOQc
7 ^jV >  > >  D > # " 7   fda ^gU]kR ]OPd.7]Z\OQcwf].O^gU*5-798w8 vy/}W}93
" bdOQarOl)XP67WRrO?U>R^`,WowVI. ,/k>STO?U>Ra JM d ^`Z< `H TZ\7>UarR^
dOB.OR dO X. #7arRq ,9UdcR d^j. c. OBk9^j7>Udaq7W$V 7 _   D ,/.O R f]RrOiR ]O4l)XP67WRrO?U>R^`,Wo ] VI7W. ?#  9 \,/. O4c^`a
Po`,?0WOQcekW. ,WPd^`ZB,Wo`oj0e^gUqY^jk>f].O z7?WOB. o`,?0WOQc7>UR ]O
,WccOQcRr7O?Ua f].O  R,/WOQa7>U,n{,WogfdOb^jR d^gU^jRa Z\7W.. OQaP67>Udc^gU]k86OBR,2 Ni-aB3]O-lnV7W. ^`a
arR,/RrOTaPw,WZ\OW3bdOcdOQZB^`a^j7>U.fdojO,W867?WOT^`afdarOQcRr7 .OQP.OQarO?U>RrOQc80R dOP67WRrO?U>R^`,W!o \dOB.;O \[K}
Z\7>UdaRr.<fdZ\R!,cdOBRrOB.<S^gUd^`arR^`ZEP67WRrO?U9R^`,Wo  }x3
h,/. ^`,W8wojO d,Wa86OBO?UZ\7>U>WOB.RrOQcRr7,DcOB" RrOB.<S_^gU }Y_} . zm,9UdGc \-{Ym{Y_} . 3
xojR ]7>f]k>2^`a[,Z\7>U9R^gUf]7>fda-cdOQZB^`a^j7>Um/,/. ^`,W8ojOWv
^`arR^`ZZd,9UdZ\O/,/. ^`,W8ojO ,9Udcq^`aEOQo`^gS^gUd,/RrOQcqVI.7>SR ]O -Om^`o`o ^gUd^jR^`,Wo`oj0fdarO,nR ]. OBO\P79^gU>Rq " c^`a
S_7cdOQo80efda^gU]kR ]O479P6OB. ,/R^j7>UcdOQaZ\. ^`86OQce^gUOQZ Z\.OBRrO,WPPd.7?u]^gS,/R^j7>Ue^jR G xEW/ . W9v   W|v
R^j7>Ue3 y3 y3 ,9Udc  k |W . WWRr7,WPPoj0R dOar79ogf]R^j7>UPd.7]Z\O\
]OTZBo`,Waa 7WV&lnbXP67WRrO?U9R^`,Wo`a^`a ZBoj79arOQcfUdcdOB. cwfd.OW3
OQ,WZ 7WVdR ]O[79P6OB. ,/R^j7>Uda.OQf^j.OQc Rr7ar79ojWOEl)X^gU
fdO?UdZ\O c^`,/kW. ,9Sa45-7988,9Udcd]O?Ud7?0Wvy/}W}/z]&,9Udc Al@ 2WF gDn/ H FI
R ]O[Z\7>U9W79ogf]R^j7>U79P6OB. ,/R^j7>UfdarOQcRr7OQo`^gS_^gUd,/RrO-Z\7>U R
R^gUf]7>fdaecOQZB^`a^j7>U/,/. ^`,W8ojOQa5-7988,9Udcd]O?Ud7?0Wv bdO 5ENl)XpNa79ogf]R^j7>UPd.7]Z\OBOQcai8>0eOQo`^gS^gUd,/R
y/}W}9|3 ^gU]kqR dO_{,/.<^`,W8ojOQa ^gUR ]OmarOQfdO?UdZ\OTvvYTv 3

72 B. R. Cobb
,W8ojO/NOQZB^`a ^j7>UarRr. ,/RrOBkW0VI7W.,9U5&Nil)X p N
200000

150000
ar79ogf]R^j7>Ueb^jR OQ^jk>>RarR,/RrOQabV7W.i 3
OQarRb1)OQa<fdojRa4K fd,9U>R^jR0e .7]cwfdZ\OQc
100000

}  _} .gQW|W| 4W3 9/|


50000

_} .gQW|W |  _} . yWWW| 4z>3gQyW|


0.2 0.4 0.6 0.8 1
-50000

_} . yWWW |  _} . W9/| 4|W3 9/|


-100000

Y^jk>f].O|b]Oif]R^`o`^jR0P67WRrO?U>R^`,Wo6V. ,/k>S_O?U9Ra[d^`Z _} . W9/


|   4W|3 WyW|
Z\7>UdarR^jR fdRrO4R ]O4f]R^`o`^jR0eP67WRrO?U9R^`,Wo dk 3
x 5ENl)XpN ar79ogf]R^j7>U"b^jR OQ^jk>>Rc^`a Z\.OBRrO
7.O?ST7@WO)Tv9-O)ZB,Wo`ZQfdo`,/RrO   ^t  e   3 arR,/RrOQaV7W.bR ]O cdOQZB^`a^j7>Ue{,/. ^`,W8wojO0]^jOQo`ca&R dO4cdO\
b]O/,Wogf]OQabV7W.  ,/.O ZB,Wo`ZQfdo`,/RrOQc,Wa ZB^`a^j7>U.<fojO^gU,W8wojOKar7>S_OarR,/RrOQa7WVq,/.O
y z @ R>R ^ R } " 
U]7WR79PdR^gS_,Wo V7W. ,9U90/,Wogf]OQa7WV) 3eNO XU].O 
   }!  W . 9/| rz> .gQyW|  |9 . >/=|  W_| . 9yW|x ,9Udc
   / 
    !  _} .}9Wy9/| !
   {! y z @ R>R ^ R { "  .  
   / 
  }_. yWy9/|
   
7 . O?ST7?WO me   v [O ZB,Wo`ZQfdo`,/RrO k 

 /  
k  _} . {z>W| .
 \  ]   3 b]OfdR^`o`^jR0 P67WRrO?U>R^`,Wo  
k  /  "  _} . WW9/| "
VI. ,/ k>STO?U>Ra k < x\v dk <   v,9Udc 
b]O .OQa<fdojR^gU]kqcdOQZB^`a^j7>U.<fojO ^`a8w,WarOQc7>UR ]O_o`^gU
dk <  k ,/. Oa ]7@UkW. ,WPd^`ZB,Wo`oj0^gUY^jk/ OQ,/.TOQ]fd,/R^j7>U % z/ . z9}9 W . |/}de,9UdcR ]O
f].O |3 1bO?ST7@^gU]k ^gU9W79ojWOQa X. aR X6Udc^gU]k cdOBRrOB.<S^gUd^`arR^`Z_P67WRrO?U>R^`,Wo >4 }x[VI7W). {  
l,{Ju  k <   L  k <   L  k < d,WaTVI. ,/k>S_O?U9Ra  9_*}x)7W.  z/ . z9}9
 - ,/ROQ,WZ<P79^gU>R^gUR ]O_cd7>S,W^gUn7WV- 34prUR d^`a
ZB,Wk arOWv-O XUdc dk <     k <  k W . |/}d}x!^jV}  _} .//}9,9Uc @ >-}x
,/R _} . y/}9W|3 O\u]RBvqR ]OcOQZB^`a^j7>U.<fdojO^`a 7W.  } {}}x^jVY_} .//}9  /3
cdOBWOQoj79P6OQcn80qfda^gU]kTR d^`a){,Wogf]OW3 A @BAl@  FI i? H I " $ i S i HIo H FJI & w/ HIw gji
A @BADC  o id/ H I " S i HIo H FJI , nhgji p F  F S i> i] cfH I HIB HI %$ I  i
 FI / H I n F nt S i HIo H FJI & w/ HIw gji & w/ HIwh gji
~a^gU]kR ]O_Pd. 7Z\OQcwfd.O^gU]OQZ\R^j7>U3 y3 vt-O_cdO\ 7 OQo`^gS_^gUd,/RrOiR ]OicdOQZB^`a^j7>U2/,/. ^`,W8ojOTv-OZ\7>U
RrOB.<S^gU]OR ]O 79PdR^gS,WoarRr. ,/RrOBkW0q7WV!Pd.7]cwfdZB^gUdk_ a^`cdOB.iR ]O .OB]^`arOQc^gU]6f]O?UdZ\O c^`,/kW. ,9S^gUY^jk>f].Oy
 W|^jV }  _} . y/}9W|v,9Udc  |W . WW R d,/R.OQPo`,WZ\OQaYR ]OEcdOQZB^`a^j7>U U]7]cdO[b^jR ,cdOBRrOB.S_^gU
^jV) _} . y/}9W|  "/3n~a^gU]kojOQ,WarR af,/k.OQa .OBkW. OQa ^`arR^`Z-Pd.798,W8w^`o`^jR04U]7]cdOW3]O P67WRrO?U9R^`,Wo`a.O?S,W^gUd^gU]k
a^j7>UKarOBO]OQZ\R^j7>U3 y3 z]vER ]OqV79o`oj7@b^gU]kcdOQZB^`a ^j7>U ,/VIRrOB.m^`aTZ\7>U9WOB.RrOQcRr7,cdOBRrOB.S_^gUd^`arR^`Z2Z<,9UdZ\O
.<fdojO ^`a)798dR,W^gU]OQc /,/. ^`,W8ojO,/.O dk V7W. {  ,9Uc VI7W. {#  937
S_,/. k9^gUd,Wo`^jBOq,Z\7>U>R^gUfd7>fdaTZ<d,9UZ\O2/,/. ^`,W8ojO2VI.7>S
 Y R ]OZ\7>S8^gUd,/R^j7>U7WV,_f]R^`o`^jR0qP67WRrO?U>R^`,Wo,9Udcq,TcdO\
 z> . W9! P /_} .gWQy ^jV}  _} . W|{z RrOB.<S^gUd^`arR^`ZP67WRrO?U9R^`,WoKv[[Oea f8arR^jR f]RrO,9UO\u]P.OQa
{} ^jV_} . W|{z   . a^j7>U798dR,W^gUdOQcV.7>S R dOcOBRrOB.<S_^gUd^`aR^`ZmP67WRrO?U>R^`,Wo
VI7W.R ]O2/,/. ^`,W8ojORr786Om.O?ST7@WOQc^gU>Rr7R ]Oqf]R^`o`^jR0
7iZ\7>U>WOB.R[Rr7,cdOBRrOB.S_^gUd^`arR^`Z-Z<,9UdZ\O&{,/.<^`,W8ojOWv P67WRrO?U9R^`,WoKv-,WaTcdOBR,W^`ojOQc^gUOQZ\R^j7>U3 y3 y3prUR d^`a
-OmfdarO  bRr7cdO XU]OTR dO_cdOBRrOB.<S_^gU^`arR^`Z_P67WRrO?U ZB,WarOWv6,_U]OBl)Xf]R^`o`^jR0eP67WRrO?U9R^`,Wo^`aB
R^`,Wo VI7W. {#  4,Wa&^gUX!ud,9S_PwojOiy7WVY]OQZ\R^j7>Uey3 z & $% dk Iz> . W9! R /_} .gWQy 
IR ]Ocd7>S,W^gU7WV P.OQZBogfdcdOQaR ]OP679aa^`8w^`o`^jR07WV  %' ^jVY}  _} . W|{z .
Pd.7]cwfdZB^gUdkBOB. 7mfUd^jRa\3 dk {_}  ^jVY_} . W|{z  

Continuous Decision MTE Influence Diagrams 73


prU9RrOBkW.<,/R^gU]k  7@WOB.TR ]O2c7>S_,W^gU7WV"k9^jWOQaR ]O 
} }<%&*Eg Eg}~ !$%& XO=  <j PQ%,Y[!$36!$%&-$!'(%
XU,Wo]O\u]P6OQZ\RrOQcf]R^`o`^jR0T7WVtW9/yW3-5E.OQ,/R^gU]k4R dObcdO\ O 36'(*<xO !80 '(<%%&!/MS]y 36C05&02'(% )>?'DM 5 36!80
=Y}M63 5 %
ZB^`a^j7>Uq.<fdojOaP6OQZB^2XOQcq80R ]OVfUdZ\R^j7>U T !0]^jOQo`ca -8<XM6!8*!/,HR % !$%jM6'(<7(08 @ /Rc/m=XmX& =,XSyR
,9U^gUdZ\.OQ,WarO2^gUO\udPOQZ\RrOQc/,Wogf]O7WV)Wy]B}7?WOB.R ]O  $ =cKx$XRj  &jj&j8,&
cdOQZB^`a ^j7>U.fdojOZ\. OQ,/RrOQcfda^gU]k,9Uln)X ^gUdf]O?UdZ\O 
  (,~, !$% XO  Eg Egj<%&*"5 > f TWH
c^`,/kW.<,9S*b^jR ,nc^`a Z\.OBRrOmcdOQZB^`a ^j7>U/,/. ^`,W8ojOW3e~a H&3 x,'(>?<=M '(% )H 36  &< '(7D'DMSOB* !$%&0 'NMSOBY[5 %R-M '( %R0'(%Oj
^gU]kiR ]O&a,9S_O-P.7Z\OQcf].OWv/R ]O)S_,{ud^gS fSO\udP6OQZ\RrOQc &3 '1*9<xO!80 '1<=%% !/MS]y 36C,0]"'NM6Z>?'NM 5&3 !.0U =Y?M63 5 %
f]R^`o`^jR0VI7W.R ]OST7>U]79P679o`^`arRfda^gU]k,5&Nil)X p N -8<XM6!8*;!/,HR % !$%jM6'(<7(08 @F {1/XG=K  

b^jR qOQ^jk>9Ric^`aZ\.OBRrOarR,/RrOQaV7W.^`aWWWWv^`Z<
.&j x, 

^`ai/}9|Wm^jk>]OB.R ,9Un,9UlnbX^gUdf]O?UdZ\O4cw^`,/kW. ,9S K X]y<36*}  Tj<%&*& K L <=M  !.02 %S.f2PQ%,:R5 !$%&-$!
b^jR TOQ^jk>9REarR,/RrOQa[V7W.-T3 5-7988ny/}W}9!k9^jWOQa-, cdO\ * '(<)3<=>0$ @'(% TW X]<=3*<=%&*& W L <XM &!80 %
R,W^`ojOQcZ\7>S_Pw,/. ^`ar7>UT7WVR dOi5ENl)XpNar79ogf]R^j7>UmRr7 !8* 08 K6 XX=ym/ DJXmBRN6=
X&2$1XmXx1  .x&X  ~jM 3<XM6!$)'1-
a^gS^`o`,/.blnbX,9Udcqcw^`aZ\.OBRrO4^gU]6f]O?UdZ\O c^`,/kW. ,9SaB3 !8-$'(0 '( %&0W36 5 H L !$% 7( EG<=36Cm yT

mt[  >/t > L < * 0 !$%T  <%&*yj!$%&0 !$%U{=j 6~ 7DV'(% )7('D% !.<=3
5&< *,3<XM '1--/ %&*,'DM '( %&<7<=5&0602'1<=%'(%,:&5 !8%&-/!* '(<=
5-7>U>R^gUf]7>fdacdOQZB^`a ^j7>Ul)X^gU]6f]O?UdZ\O)c^`,/kW. ,9Sa ) 36<>0$ @ {$==,=,=SR6$ ?{
^`Z<cdOBRrOB.<S_^gUdO-,cdOQZB^`a^j7>U.<fojO[VI7W.,iZ\7>U>R^gUfd7>fda Kx$XRj  &j&=X, ,
cdOQZB^`a ^j7>U2{,/. ^`,W8wojO ,Wa),VKfUZ\R^j7>U27WV^jRa)Z\7>U>R^gUfd7>fda L < * 0 !$%,T R<=%&*y IR=!8%&0 !$%28 jg <$O!8Vx<7D5
P,/. O?U9Ra a ]7?P67WRrO?U9R^`,WoRr7^gS_P.7?WOcOQZB^`a^j7>U <=M '( %U Yy0 O>>?!$M 36'(-<xO!80 '1<=%U*,!.-/'102'( %H 3 7(!$>0$ @
S,/^gU]k ,/R&, oj7?-OB.bZ\7>S_PwfdR,/R^j7>Ud,WoZ\79arRER d,9Uc^`a '(%? f<0 C!$O<=%&*}E3<*,![!8*&0$ fm//cX
Z\.OBRrOq^gU]6f]O?UdZ\Omcw^`,/kW. ,9S_a7W._lnbX^gU]f]O?UdZ\Oc^ / y/={$Nj/m x, .,  
,/kW. ,9SaB3". OB^j7>fdaSTOBR ]7]caVI7W.d,9Udcwo`^gU]kZ\7>U 7(>0SM6!8*}~m L RS. jf%?36!$H 36!80 !$%jM '(% )<%&*02 7DV'(% )
R^gUf]7>fdaEcdOQZB^`a^j7>Um/,/. ^`,W8ojOQa&^gU2^gU]fdO?UdZ\Oc^`,/kW.<,9S_a * !8-/'10 'D %H 36  7(!$>08 @E & !.02'10$ !$HR<=3 M >?!$%jM" =Y
.OQ]fd^j.OnOQ^jR ]OB.Ud7W.<S_,Wo)cw^`arRr. ^`8wf]R^j7>Ua_VI7W.mZ\7>U>R^gU % )'(% !8!$36'D% )=
-/ %& >?'(-g~O,02M !$>08~jM6<%,Y[ 3*K% 'DV !$3
f]7>faZ<d,9UdZ\O{,/. ^`,W8wojOQa4,9Udc 3@7W.c^`aZ\.OBRrO,WPwPd.7Qud^ 0 'DMSO&~jM<=%,Y[ 36*}RyT

S,/R^j7>Uda4Rr7Z\7>U>R^gUfd7>fda cdOQZB^`a^j7>U{,/.<^`,W8ojOQaB3b]O Eg 7(<%&*},A9 <=%&* ~&< -jM !$3S8  L 'NM65 3 !.0
Z\7>U>R^gUfd7>fdabcdOQZB^`a ^j7>Unl)X^gUdf]O?UdZ\O c^`,/kW.<,9S",Wo Y<=5R0 0 '(<%&0y<=%&*>'(% '(>5 >3 !87(<=M '(V!K!8%jM 36 HOM !8-
oj7@ba ,4S_7W.O&Pd.OQZB^`arOcdOQZB^`a^j7>UT.fdojO),9UdcTR ]O,W8^`o`^jR0 %&' 5 !.0yY[ 3W>? ,*,!$7('(% )-/ % M6'D%5 5&05 %&-$!$3 M6<'D%jM '(!808 @'(%
Rr7_O\udPoj79^jR)R dO Z\7>U9R^gUf]7>fdaUd,/R f].O7WVRrOQarRb. OQa fdojRaB3 mW!8-C !$36><=%U<=%R*UW  L <=>* <% '[!8*&0$ fm/
cX?/ y/=R{$Nj/m6 R&}8 x.&
2  IRF gjis"Jc;i I ? "5&> ,&<%&*TR~,<7D>?!83R %j g6Eg!$% % '(7(!8060
H&3 H,
bdOn,9f]R ]7W.^`akW.<,/RrOBVKfdobV7W.2R ]O^gUda^jk>>RrVKfdoZ\7>S <) <=M '( %]"'DM >'DM 5 36!80 Y M 365 %&-8<XM !.*!$H %&!$%jM '1<=710$ @
S_O?U9Ra&7WV!R d.OBO4,9U]7>U>0]S_7>fda). OB^jOB-OB. ad^`Z<na^jk/ '(%m ,*, B[!8*g X ==B&X[c=xR
 .//cWWx$=&j,m$$/{=R x!8-/M 5 36!
Ud2^ XZB,9U9Roj0q^gS_P.7?WOQcR ]OP,WP6OB.Q3 =M6!80'(%TK32M6'N#-/'1<=7}PQ%jM !87D7('D) !$%&-$! j  ,x =&
~R<-jM !838K 28 j VX<7D5&<=M '(% )'(%,:&5 !8%&-/!* '(<=
 _:d>
 : : : ) 36<>0$ @U / = W/$X / &Rj /j,


  "!$#&% '(% )+*,!.-/'102'( %4365 7(!809'(% ~R<-jM !838m <%&*B !8% 7D!8OS.6<5&0602'1<=%
'(%,:&5&!$%&-$!;*,'1<=) 36<>?08 @+AB 36C'D% )FEG<=H!$3.JIK'(3 ) 'D% '1< '(%,:&5&!$%&-$!*,'1<=) 36<>0$ @? Xm$j// /$ , { 
L 'D7('NM<=36OFPQ%R0SM6'NM65,M ! UTWVX<='(7(< 7(!ZY[ 3\*, X]"% 7( < *^<XM._  ., 
```"acbXdfegaShjikRl.dmhji&e=n,lh&o.pk l oqXpp&l r,s=tuRvxwya{z,ij|  ~&!$% XO mEg EgGS. T;% !8]>?!/M6 ,*UY[ 336!$H&3 !.02!8% M

} }<%&*UEg Egm~&!$% XOJy KO 3 '1*U'(%,:&5 '(% )B<=%&*02 7DV'(% )B<xO !80 '(<%*,!8-$'(0 'D %H 36  7(!$>08 @B'(%
!8%&-/!*,'(<)3<=>05&0 'D%&)>?'NM65 3 !.0W YM63 5&%&-$<=M !8*!$,HR   gK<%&*[!.*} /
$XG{$Nj/m6} =R/
%&!$%jM '1<=710$ @'(% L 
&'(-C !$36'D% )<%&* jK<7DH!$36%U[!.* 0$ c=1/ _ WX c=1//" 2 
&<=H&>?<%<%&*
$/{=R/ y/=R{$Nj/m R  j.  <=7(7 %&* %} 8xm8 


 <%&*Eg Eg~ !8% XO=  j^ KO 3 '1*
<xO !80 '(<%% !$MS]
3 C,0]"'DM 7('D% !.<=3K* !/M !83 >?'(% '10SM6'(-VX<3 'D
< 7(!808 @'(%yy< -$-5&0<%&*,< <=CC 7(<?!8* 08 f&
$/{=R?/
$X/Rc/NN $ R, }8X8&

74 B. R. Cobb
Discriminative Scoring of Bayesian Network Classifiers:
a Comparative Study
Ad Feelders and Jevgenijs Ivanovs
Department of Information and Computing Science
Universiteit Utrecht, The Netherlands

Abstract
We consider the problem of scoring Bayesian Network Classifiers (BNCs) on the basis of the
conditional loglikelihood (CLL). Currently, optimization is usually performed in BN parameter
space, but for perfect graphs (such as Naive Bayes, TANs and FANs) a mapping to an equivalent
Logistic Regression (LR) model is possible, and optimization can be performed in LR parameter
space. We perform an empirical comparison of the efficiency of scoring in BN parameter space,
and in LR parameter space using two different mappings. For each parameterization, we study two
popular optimization methods: conjugate gradient, and BFGS. Efficiency of scoring is compared
on simulated data and data sets from the UCI Machine Learning repository.

1 Introduction proposed mapping is more efficient than scoring in


BN space, because the logistic regression model has
Discriminative learning of Bayesian Network Clas- fewer parameters than its BNC counterpart and be-
sifiers (BNCs) has received considerable attention cause the LR model is known to have a strictly con-
recently (Greiner et al., 2005; Pernkopf and Bilmes, cave loglikelihood function. To test this hypothesis
2005; Roos et al., 2005; Santafe et al., 2005). In dis- we perform experiments to compare the efficiency
criminative learning, one chooses the parameter val- of model fitting with both LR parameterizations,
ues that maximize the conditional likelihood of the and the more commonly used BN parameterization.
class label given the attributes, rather then the joint This paper is structured as follows. In section 2
likelihood of the class label and the attributes. It we introduce the required notation and basic con-
is well known that conditional loglikelihood (CLL) cepts. Next, in section 3 we describe two map-
optimization, although arguably more appropriate pings from BNCs with perfect graphs to equivalent
in a classification setting, is computationally more LR models. In section 4 we give a short descrip-
expensive because there is no closed-form solution tion of the optimization methods used in the experi-
for the ML estimates and therefore numerical op- ments, and motivate their choice. Subsequently, we
timization techniques have to be applied. Since in compare the efficiency of discriminative learning in
structure learning of BNCs many models have to be LR parameter space and BN parameter space, us-
scored, the efficiency of scoring a single model is of ing the optimization methods discussed. Finally, we
considerable interest. present the conclusions in section 7.
For BNCs with perfect independence graphs
(such as Naive Bayes, TANs, FANs) a mapping 2 Preliminaries
to an equivalent Logistic Regression (LR) model
is possible, and optimization can be performed in 2.1 Bayesian Networks
LR parameter space. We consider two such map- We use uppercase letters for random variables and
pings: one proposed in (Roos et al., 2005), and a lowercase for their values. Vectors are written in
different mapping, that, although relatively straight- boldface. A Bayesian network (BN) (X, G =
forward, has to our knowledge not been proposed (V, E), ) consists of a discrete random vector X =
before in discriminative learning of BNCs. We con- (X0 , . . . , Xn ), a directed acyclic graph (DAG) G
jecture that scoring models in LR space using our representing the directed independence graph of X,
and a set of conditional probabilities (parameters) . It is well known that the loglikelihood function
V = {0, 1, . . . , n} is the set of nodes of G, and E of the logistic regression model is concave and has
the set of directed edges. Node i in G corresponds unique maximum (provided the data matrix Z is
to random variable Xi . With pa(i) (ch(i)) we de- of full column rank) attained for finite w except in
note the set of parents (children) of node i in G. We two special circumstances described in (Anderson,
write XS , S {0, . . . , n} to denote the projection 1982).
of random vector X on components with index in
S. The parameter set consists of the conditional 2.3 Log-linear models
probabilities Let G = (V, E) be the (undirected) independence
graph of random vector X, that is E is the set of
xi |xpa(i) = P (Xi = xi |Xpa(i) = xpa(i) ), 0 i n
edges (i, j) such that whenever (i, j) is not in E, the
We use Xi = {0, . . . , di 1} to denote the set variables Xi and Xj are independent conditionally
of possible values of Xi , 0 i n. The set on the rest. The log-linear expansion of a graphical
of possible values of random vector XS is denoted log-linear model is
XS = iS Xi . We also use Xi = Xi \ {0}, and X
likewise XS = iS Xi ln P (x) = uC (xC )
In a BN classifier there is one distinguished vari- CV
able called the class variable; the remaining vari-
ables are called attributes. We use X0 to denote where the sum is taken over all complete subgraphs
the class variable; X1 , . . . , Xn are the attributes. C of G, and all xC XC , that is, uC (xC ) = 0
To denote the attributes, we also write XA , where for i C and xi = 0 (to avoid overparameteri-
A = {1, . . . , n}. We define (i) = pa(i)\{0}, the zation). The u-term u (x) is just a constant. It is
non-class parents of node i, and (i) = {i} (i). well known that for BNs with perfect directed inde-
Finally, we recall the definition of a perfect graph: pendence graph, an equivalent graphical log-linear
a directed graph in which all nodes that have a com- model is obtained by simply dropping the direction
mon child are connected is called perfect. of the edges.

2.2 Logistic Regression 3 Mapping to Logistic Regression


The basic assumption of logistic regression (Ander-
In this section we discuss two different mappings
son, 1982) for binary class variable X0 {0, 1} is
from BNCs with a perfect independence graph to
k equivalent logistic regression models. Equivalent
P (X0 = 1|Z) X
ln = w0 + wi Zi , (1) here means that, assuming P (X) > 0, the BNC and
P (X0 = 0|Z)
i=1 corresponding LR model, represent the same set of
where the predictors Zi (i = 1, . . . , k) can be sin- conditional distributions of the class variable.
gle attributes from XA , but also functions of one or
3.1 Mapping of Roos et al.
more attributes from XA . In words: the log poste-
rior odds are linear in the parameters, not necessar- Roos et al. (Roos et al., 2005) define a map-
ily in the basic attributes. ping from BNCs whose canonical form is a per-
Generalization to a non-binary class variable fect graph, to equivalent LR models. The canonical
X0 X0 gives form is obtained by (1) taking the Markov blanket
of X0 (2) marrying any unmarried parents of X0 .
k
P (X0 = x0 |Z) (x )
X (x ) This operation does clearly not change the condi-
ln = w0 0 + wi 0 Zi , (2)
P (X0 = 0|Z) tional distribution of X0 . They show that if this
i=1
canonical form is a perfect graph, then the BNC
for all x0 X0 . This model is often referred to can be mapped to an equivalent LR model. Their
as the multinomial logit model or polychotomous mapping creates an LR model with predictors (and
logistic regression model. corresponding parameters) as follows

76 A. Feelders and J. Ivanovs


1. Zxpa(0) = I(Xpa(0) = xpa(0) ) with parameter 1 2 1 2
(x0 )
wxpa(0) , for xpa(0) Xpa(0) .
0 0
2. Zx(i) = I(X(i) = x(i) ) with parameter
(x )
0
wx(i) , for i ch(0) and x(i) X(i) .
3 4 5 3 4 5
For a given BNC with parameter value an equiva-
lent LR model is obtained by putting
Figure 1: Example BNC (left); undirected graph
(x0 ) (x0 )
wxpa(0) = ln x0 |xpa(0) , wx(i) = ln xi |xpa(i) with same conditional distribution of class (right).
3.2 Proposed mapping
Like in the previous section, we start from the 3.3 Example
canonical graph which is assumed to be perfect. Consider the BNC depicted in figure 1. Assuming
Hence, we obtain an equivalent graphical log-linear all variables are binary, and using this fact to sim-
model by simply dropping the direction of the plify notation, this maps to the equivalent LR model
edges. We then have (with 9 parameters)
P (X0 = x0 |XA ) P (X0 = 1|Z)
ln ln = w + w{1} X1 + w{2} X2 +
P (X0 = 0|XA ) P (X0 = 0|Z)
P (X0 = x0 , XA )/P (XA )
= ln w{3} X3 + w{4} X4 + w{5} X5 +
P (X0 = 0, XA )/P (XA )
w{1,2} X1 X2 + w{1,3} X1 X3 + w{3,4} X3 X4
= ln P (X0 = x0 , XA ) ln P (X0 = 0, XA ),
for x0 X0 . Filling in the log-linear expansion In the parameterization of (Roos et al., 2005), we
for ln P (X0 = x0 , XA ) and ln P (X0 = 0, XA ), we map to the equivalent LR Roos model (with 14 pa-
see immediately that u-terms that do not contain X0 rameters)
cancel, and furthermore that u-terms with X0 = 0 P (X0 = 1|Z)
are constrained to be zero by our identification re- ln = w{1,2}=(0,0) Z{1,2}=(0,0) +
P (X0 = 0|Z)
strictions. Hence we get
w{1,2}=(0,1) Z{1,2}=(0,1) + w{1,2}=(1,0) Z{1,2}=(1,0) +
ln P (X0 = x0 , xA ) ln P (X0 = 0, xA ) = w{1,2}=(1,1) Z{1,2}=(1,1) + w{1,3}=(0,0) Z{1,3}=(0,0) +
X
u{0} (X0 = x0 ) + uC (X0 = x0 , xC ) = . . . + w{3,4}=(1,1) Z{3,4}=(1,1) +
C
w{5}=(0) Z{5}=(0) + w{5}=(1) Z{5}=(1)
(x ) (x )
X
w 0 + wxC0
C In both cases we set w(0) = 0.
where C is any complete subgraph of G not con-
4 Optimization methods
taining X0 . Hence, to map to a LR model, we create
variables In the experiments, we use two optimization meth-
ods: conjugate gradient(CG) and variable metric
I(XC = xC ), xC XC
(BFGS) algorithms (Nash, 1990). Conjugate Gra-
to obtain the LR specification dient is commonly used for discriminative learning
P (X0 = x0 |Z) of BN parameters, see for example (Greiner et al.,
(x )
X (x )
ln = w 0 + wxC0 I(XC = xC ) 2005; Pernkopf and Bilmes, 2005). In a study of
P (X0 = 0|Z)
C Minka (Minka, 2001), it was shown that CG and
This LR specification models the same set of condi- BFGS are efficient optimization methods for the lo-
tional distributions of X0 as the corresponding undi- gistic regression task.
rected graphical model, see for example (Sutton and BFGS is a Hessian-based algorithm, which up-
McCallum, 2006). dates an approximate inverse Hessian matrix of size

Discriminative Scoring of Bayesian Network Classifiers: a Comparative Study 77


r2 at each step, where r is the number of param- In our study we compare the rates of convergence
eters. CG on the other hand works with a vector for 9 optimization techniques: 3 optimization al-
of size r. This obviously makes each iteration less gorithms (CG, CGB, BFGS) used in 3 parameter
costly, however, in general CG exhibits slower con- spaces (LR, LR Roos, BN). We compare conver-
vergence in terms of iterations. gence in terms of (1) iterations of the update di-
Both algorithms at step k compute an update di- rection calculation, and (2) floating point operations
rection u(k) , followed by a line search. This is (flops). The number of flops provides a fair compar-
a one-dimensional search, which looks for a step ison of the performance of the different methods,
size maximizing f () = CLL(w(k) + u(k) ), but the number of iterations gives us additional in-
where w(k) is a vector of parameter values at step sight into the behavior of the methods.
k. It is argued in (Nash, 1990) that neither algo- Let costCLL and costgrad denote the costs in
rithm benefits from too large a step being taken. flops of CLL and gradient evaluations respectively,
Even more, for BFGS it is not desirable that the in- and let countCLL be the number of times CLL
crease in the function value is different in magni- is evaluated in a particular iteration. We estimate
tude from the one determined by the gradient value, the cost of one particular iteration of BFGS as
(w(k+1) w(k) )T g(k) . Therefore, simple accept- 12r2 + 8r + (4r + costCLL )countCLL + costgrad ;
able point search is suggested for both methods. the cost of one CG and CGB iteration is 10r +(4r +
In case of CG an additional step is made. Once costCLL )countCLL +costgrad . These estimates are
an acceptable point has been found, we have suf- obtained by inspecting the source code in (Nash,
ficient information to fit a parabola to the projec- 1990).
tion of the function on the search direction. The
parabola requires three pieces of information: the 5 Parameter learning
function value at the end of the last iteration (or 5.1 Logistic regression
the initial point), the projection of the gradient at
this point onto the search direction, and the new Equation 2 can be rewritten as follows
function value at the acceptable point. If the CLL (x0 ) T Z
value at the maximum of the parabola is larger than ew
P (X0 = x0 |Z) = P ,
d0 1 (x00 ) T
the CLL value at the acceptable point, then the for- ew Z
x00 =0
mer becomes the starting point for the next iteration.
Another approach to CG line search is Brents line where we put Z0 = 1, which corresponds to the in-
search method (Press et al., 1992). It iteratively fits tercept, and fix w(0) = 0. wT denotes the transpose
a parabola to 3 points. The main difference with the of w. The conditional loglikelihood of the parame-
previous approach is that we do find an optimum in ters given data is
the given update direction. This, however, requires QN (i) (i)
more function evaluations at each iteration. CLL(w) = log i=1 P (x0 |z )
In our experiments we use the implementation of (i) T 0 T
w(x0 ) z(i) )),
z(i) log( dx00 1
PN (x0 )
P
the CG and BFGS methods of the optim function = i=1 (w =0 e
0

of R (Venables and Ripley, 2002) which is based on


where N is the number of observations in the data
the source code from (Nash, 1990). In addition we
set. The gradient of the CLL is given by
implemented CG with Brents line search (with rela-
tive precision for Brents method set to 0.0002). We CLL(w)
(x0 ) =
refer to this algorithm as CGB. The conjugate gra- wk

dient method may use different heuristic formulas (x ) T (i)
(i)
PN (i) ew 0 z zk
for computing the update direction. Our preliminary i=1
1 (i) z
{x0 =x0 } k Pd0 1 w(x00 ) T z(i)
study showed that the difference in performance be- x00 =0
e
tween these heuristics is small. We use the Polak-
Ribiere formula, suggested in (Greiner et al., 2005) where x0 X0 . We note here that the data ma-
for optimization in the BN parameter space. trix Z is a sparse matrix of indicators with 1s in

78 A. Feelders and J. Ivanovs


non-zero positions, thus the CLL and gradient func- Simple calculations result in
tions can be implemented very efficiently. We esti-
CLL()
mate costs (in flops) according to above formulas: j =
i|k
costCLL = |Z|(d0 1) + N (d0 + 2), costgrad = 
|Z|d0 + 2N d0 , where |Z| denotes number of 1s PN
1{x(t) =i,x(t) 1{x(t) j
i|k
t=1 =k} =k}
in the data matrix. Here we used the fact that the j pa(j) pa(j)
T Pd0 1
multiplication of two vectors w(x0 ) z(i) requires

(t)
[P (x(t) )(1 (t) (t) 1 (t) j )]
x0 =0 {x =i,x =k} {x =k} i|k
j pa(j) pa(j)
|z(i) |1 flops. In case of LR Roos matrix Z always Pd0 1
(t)
P (x(t) )
contains exactly 1 + n |pa(0)| 1s irrespective of x0 =0

graph complexity and dimension of attributes. For


our mapping |Z| depends on the sample. In the Note that for each t and j > 1 only d0 dj gradient
experiments we used the following heuristic to re- values are to be considered. In order to obtain
duce |Z|: for each attribute Xi we code the most from we need 3|| flops, where || denotes the
frequent value as 0. We note that |Z| is smaller in number of components of . From the formulas
case of our mapping comparing to LR Roos map- above we estimate costCLL = 3|| + N (nd0 +
ping, when the structure is not very complex (with 2d0 + 2) and costgrad = 3|| + N (2nd0 + n +
regard to the number of parents and the domain size (2d0 + 1)(3 + d1 + . . . + dn )).
of the attributes) and becomes bigger for more com- Finally, we point out that the cost of the gradi-
plex structures. ent is very close to the cost of the CLL in case of
LR and is by a factor 2 + 2d larger in case of BN.
5.2 Bayesian Network Classifiers This suggests that CGB might be better than CG for
Here we follow the approach taken in (Greiner et BN, but it is very unlikely that it will be better also
al., 2005; Pernkopf and Bilmes, 2005). We write for LR and LR Roos parameter spaces. Note that
j
costCLL (BN) is very close to costCLL (LR Roos),
i|k = P (xj = i|xpa(j) = k). which strongly supports the fairness of our cost es-
j Pdj 1 j timates.
We have constraints i|k 0 and i=0 i|k = 1.
We reparameterize to incorporate the constraints on 5.3 Convergence issues
j j
i|k and use different parameters i|k as follows
It is well known that LR has a concave loglikeli-
j hood function. Since BNCs whose canonical form
j
exp i|k
i|k = Pd 1 is a perfect graph are equivalent to the LR models
j j
l=0 exp l|k obtained by either mapping, it follows from the con-
tinuity of the mapping from to w, that the CLL
The CLL is given by function in the standard BN parameterization also
N
X 0 1
dX has only global optima (Roos et al., 2005). This im-
(t)
CLL() = (log P (x ) log P (x(t) )) plies that all our algorithms should converge to the
t=1 (t)
x0 =0 maximal CLL value.
It is important to pick good initial values for the
Further expansion may be obtained using the factor- parameters. The common approach to initialize BN
j
ization of P (x) and plugging in expressions for i|k . parameters is to use the (generative) ML estimates.
It is easy to see that It is crucial for all three parameter spaces to avoid
ij0 |k zero probabilities, therefore we use Laplaces rule,
j
= ij0 |k (1{i=i0 } i|k
j
), and add one to each frequency. After the initial val-
i|k ues of are obtained, we can derive starting val-
ues for w as well. We simply apply the mappings
thus
to obtain the initial values for the LR Roos and LR
P (x) j parameters. This procedure guarantees that all algo-
j
= 1{xpa(j) =k} P (x)(1{xj =i} i|k )
i|k rithms have the same initial CLL value.

Discriminative Scoring of Bayesian Network Classifiers: a Comparative Study 79


6 Experiments 0
1 8
For the experiments we used artificially generated
data sets and data sets from the UCI Machine Learn- 7
2
ing Repository (Blake and Merz, 1998). Artifi- 6
3
cial data were generated from Bayesian Networks, 4
where the parameter values were obtained by sam- 5
pling each parameter from a Dirichlet distribution
with = 1. We generated data from both simple
Figure 2: Structure fitted to Pima Indians data. The
and complex structures in order to evaluate perfor-
structure was constructed using a hill climber on the
mance under different scenarios. We generated per-
(penalized) generative likelihood.
fect graphs in the following way: for each attribute
we select i parents: the class node and i 1 previ-
ous nodes(whenever possible) according to index-
ing, where i {1, 2, 3}.

336.2
Table 1 lists the UCI data sets used in the ex-
periments, and their properties. To discretize nu-

336.4
meric variables, we used the algorithm described in
(Fayyad and Irani, 1993). The ? values in the votes 336.6

data set were treated as a separate value, not as a


CLL

missing value. LR + CG
LR + CG_full.search
336.8

LR + BFGS
LR_Roos + CG
LR_Roos + CG_full.search
Table 1: Data sets and their properties LR_Roos + BFGS
BN + CG
337.0

#Samples #Attr. #Class Max attr. dim. BN + CG_full.search


BN + BFGS
Glass 214 9 6 4
Pima 768 8 2 4 0 e+00 2 e+06 4 e+06 6 e+06 8 e+06 1 e+07

Satimage 4435 36 7 2 flops

Tic-tac-toe 958 9 2 3
Voting 435 16 2 3 Figure 3: Optimization curves on the Pima Indians
data.
The fitted BNC structures for the UCI data were
obtained by (1) connecting the class to all attributes,
(2) using a greedy generative structure learning al- (flops) needed to reach the threshold. For each data
gorithm to add successive arcs. Step (2) was not set these numbers are scaled dividing by the small-
applied for the Satimage data set for which we fit- est value. The bound of 5% was selected heuris-
ted the Naive Bayes model. All structures happened tically. This bound is big enough, so all algorithms
to be perfect graphs, thus there was no need for ad- were able to reach it before 200 iterations and before
justment. In figure 2 we show the structure fitted to satisfying the stopping criterion. The bound is small
the Pima Indians data set. The other structures were enough, so algorithms make in general quite a few
of similar complexity. iterations before achieving it. There are some data
Figure 3 depicts convergence curves for 9 opti- sets, however, on which all algorithms converge fast
mization algorithms on the Pima Indians data. to a very high CLL value, and a clear distinction
Table 2 depicts statistics for different UCI and ar- in performance is visible using a very small bound.
tificial data sets. For each data set i we have start- Pima Indians is an example (5% threshold is set to
ing CLL value CLLiM L and maximal (over all algo- -337.037). We still see that table 2 contains a quite
rithms) CLL value CLLimax . We define a threshold fair comparison, except that BFGS methods are un-
ti = CLLimax 5%(CLLimax CLLiM L ). For ev- derestimated for smaller bounds.
ery algorithm we computed the number of iterations It is very interesting to notice that convergence

80 A. Feelders and J. Ivanovs


Data Set LR LR Roos BN
CG CGB BFGS CG CGB BFGS CG CGB BFGS
Glass 2.1 1.9 1.0 3.0 2.0 1.1 2.4 1.9 1.0
Pima 1.3 1.5 1.8 1.5 1.2 1.8 1.0 1.0 1.8
Satimage 8.2 2.3 1.0 3.3 2.0 1.1 3.0 1.0 1.3
Tic-tac-toe 3.9 2.2 1.5 2.1 1.1 1.2 2.2 1.0 1.3
Voting 3.2 1.8 1.2 1.9 1.3 1.0 1.5 1.0 1.1
1.5.2-3.100 2.8 1.4 1.2 2.3 1.3 1.1 2.0 1.0 1.2
2.5.2-3.100 2.6 2.1 1.4 1.8 1.6 1.3 1.4 1.0 1.0
3.5.2-3.100 2.7 1.8 1.3 2.2 1.0 1.2 1.3 1.2 1.2
1.5.2-3.1000 2.2 2.2 2.2 1.5 1.5 2.5 1.0 1.0 1.5
2.5.2-3.1000 2.2 2.0 1.8 1.2 1.2 1.8 1.0 1.0 2.5
3.5.2-3.1000 3.6 2.2 1.0 1.8 1.6 1.0 1.3 1.3 1.0
1.30.2-3.1000 2.0 1.2 1.0 1.5 1.2 1.5 1.2 1.0 1.2
2.30.2-3.1000 2.0 1.8 2.2 1.6 1.2 1.4 1.4 1.0 1.2
3.30.2-3.1000 2.2 1.2 2.0 1.6 1.0 1.6 1.4 1.0 1.2
1.5.5.1000 3.1 3.2 1.8 2.3 2.3 1.5 1.1 1.1 1.0
2.5.5.1000 6.7 6.9 2.3 3.1 3.0 1.4 1.5 1.5 1.0
3.5.5.1000 4.8 4.2 1.8 2.3 2.2 1.1 1.7 1.7 1.0
mean 3.3 2.4 1.6 2.1 1.6 1.4 1.6 1.2 1.3
Data Set LR LR Roos BN
CG CGB BFGS CG CGB BFGS CG CGB BFGS
Glass 1.0 2.4 3.3 2.6 4.1 8.3 5.9 8.1 12.6
Pima 1.0 2.4 1.6 1.8 2.5 3.1 4.0 6.1 11.0
Satimage 4.6 3.0 1.0 4.6 7.1 3.3 11.5 6.1 7.6
Tic-tac-toe 2.0 3.2 1.0 1.3 1.7 1.3 6.1 4.0 6.0
Voting 1.2 1.9 1.9 1.0 1.8 3.5 3.6 4.2 15.8
1.5.2-3.100 1.0 1.7 1.0 1.2 2.1 1.8 2.9 2.9 5.1
2.5.2-3.100 1.2 2.8 3.5 1.0 2.6 7.4 2.1 3.2 13.2
3.5.2-3.100 1.0 2.4 2.3 1.1 1.4 5.3 2.1 4.4 21.2
1.5.2-3.1000 1.0 2.0 1.7 1.0 1.9 3.9 1.5 2.4 5.5
2.5.2-3.1000 1.5 3.2 2.2 1.0 2.0 4.6 2.3 3.6 25.0
3.5.2-3.1000 1.7 2.6 1.4 1.0 2.2 3.5 2.2 4.3 13.5
1.30.2-3.1000 1.9 3.7 1.0 3.4 8.4 4.3 12.1 17.4 16.6
2.30.2-3.1000 1.0 3.1 2.2 1.2 3.2 3.3 4.9 7.4 11.8
3.30.2-3.1000 1.2 2.3 5.8 1.0 2.4 12.5 4.1 7.6 36.6
1.5.5.1000 1.1 2.8 1.5 1.0 2.4 1.9 1.6 2.4 2.8
2.5.5.1000 2.3 5.6 11.1 1.0 2.2 10.1 1.7 2.4 11.6
3.5.5.1000 2.6 5.4 101.5 1.0 2.3 98.1 2.4 3.8 137.5
mean 1.6 3.0 8.5 1.5 3.0 10.4 4.2 5.3 20.8

Table 2: Convergence speed in terms of iterations (top) and flops (bottom). Artificial data sets are named
in the following way: [number of parents of each attribute].[number of attributes].[domain size of each
attribute].[sample size]. Domain size denoted as 2-3 is randomly chosen between 2 and 3 with equal proba-
bilities.

Discriminative Scoring of Bayesian Network Classifiers: a Comparative Study 81


with regard to iterations is faster in BN and References
LR Roos parameter spaces. It seems that overpa- J. A. Anderson. 1982. Logistic discrimination. In P. R.
rameterization is advantageous in this respect. This Krishnaiah and L. N. Kanal, editors, Classification,
might be explained by the fact that there is only a Pattern Recognition and Reduction of Dimensionality,
single global optimum in the LR case, whereas there volume 2 of Handbook of Statistics, pages 169191.
North-Holland.
are many global optima in case of BN and LR Roos.
Thus, in the latter case one can quickly converge to C.L. Blake and C.J. Merz. 1998. UCI
the closest optimum. Overparameterization, how- repository of machine learning databases
[http://www.ics.uci.edu/mlearn/mlrepository.html].
ever, is also an additional burden, as witnessed by
the flops count; this is especially true for the BFGS U. Fayyad and K. Irani. 1993. Multi-interval discretiza-
method. tion of continuous valued attributes for classification
We observe that BFGS is generally winning, with learning. In Proceedings of IJCAI-93 (volume 2),
pages 10221027. Morgan Kaufmann.
CGB being close, in terms of iterations. Consid-
ering flops, CGB is definitely losing to CG in all R. Greiner, S. Xiaoyuan, B. Shen, and W. Zhou. 2005.
parameterizations. CG seems to be the best tech- Structural extension to logistic regression: Discrimi-
native parameter learning of belief net classifiers. Ma-
nique, though it might be very advantageous to use chine Learning, 59:297322.
BFGS for simple (with respect to the number of par-
ents and the domain size of attributes) structures, in T. Minka. 2001. Algorithms for maximum-likelihood
combination with our LR mapping. logistic regression. Technical Report Statistics 758,
Carnegie Mellon University.
We have both theoretical and practical evidence
that our LR parameterization is the best for rela- J. C. Nash. 1990. Compact Numerical Methods for
tively simple structures. So the general conclusion Computers: Linear Algebra and Function Minimisa-
tion (2nd ed.). Hilger.
is to use LR + BFGS, LR + CG and LR Roos + CG
in order of growing structure complexity. The size F. Pernkopf and J. Bilmes. 2005. Discriminative ver-
of the domain and the number of parents mainly in- sus generative parameter and structure learning of
Bayesian network classifiers. In ICML 05: Proceed-
fluence our choice. On the basis of flop counts, the ings of the 22nd international conference on Machine
BN parameterization is never preferred in our ex- learning, pages 657664, New York. ACM Press.
periments. Finally, we note that our LR parameteri-
zation was the best on 4 out of 5 UCI data sets. W.H. Press, B.P. Flannery, S.A. Teukolsky, and W.T. Vet-
terling. 1992. Numerical Recipes in C: the art of sci-
entific computing. Cambridge.
7 Conclusion
T. Roos, H. Wettig, P. Grunwald, P. Myllymaki, and
We have studied the efficiency of discriminative H. Tirri. 2005. On discriminative Bayesian network
scoring of BNCs using alternative parameteriza- classifiers and logistic regression. Machine Learning,
59:267296.
tions of the conditional distribution of the class vari-
able. In case the canonical form of the BNC is a per- G. Santafe, J. A. Lozano, and P. Larranaga. 2005. Dis-
fect graph, there is a choice between at least three criminative learning of Bayesian network classifiers
via the TM algorithm. In L. Godo, editor, ECSQARU
parameterizations. We found out that it is wise to 2005, volume 3571 of Lecture Notes in Computer Sci-
exploit perfectness by optimizing in LR or LR Roos ence, pages 148160. Springer.
spaces. Based on the experiments we have per-
C. Sutton and A. McCallum. 2006. An introduction to
formed, we would suggest to use LR + BFGS, LR +
conditional random fields for relational learning. In
CG and LR Roos + CG in order of growing struc- L. Getoor and B. Taskar, editors, Introduction to Sta-
ture complexity. If only one method is to be se- tistical Relational Learning. MIT Press. To appear.
lected, we suggest LR Roos + CG. It works pretty
W.N. Venables and B.D. Ripley. 2002. Modern Applied
well on any type of data set, plus the mapping is Statistics with S (fourth edition). Springer, New York.
very straightforward, which makes the initialization
step easy to implement.

82 A. Feelders and J. Ivanovs


The Independency tree model:
a new approach for clustering and factorisation
M. Julia Flores and Jose A. Gamez
Computing Systems Department / SIMD (i3 A)
University of Castilla-La Mancha
Albacete, 02071, Spain

Serafn Moral
Departamento de Ciencias de la Computacion e I. A.
Universidad de Granada
Granada, 18071, Spain

Abstract
Taking as an inspiration the so-called Explanation Tree for abductive inference in Bayesian
networks, we have developed a new clustering approach. It is based on exploiting the
variable independencies with the aim of building a tree structure such that in each leaf all
the variables are independent. In this work we produce a structure called Independency
tree. This structure can be seen as an extended probability tree, introducing a new
and very important element: a list of probabilistic single potentials associated to every
node. In the paper we will show that the model can be used to approximate a joint
probability distribution and, at the same time, as a hierarchical clustering procedure.
The Independency tree can be learned from data and it allows a fast computation of
conditional probabilities.

1 Introduction 2001, pg. 293), we could distinguish two differ-


ent objectives in this descriptive task: (a) seg-
In the last years the relevance of unsupervised mentation, in which the aim is simply to par-
classification within data mining processing has tition the data in a convenient way, probably
been remarkable. When dealing with large using only a small number of the available vari-
number of cases in real applications, the iden- ables; and (b) decomposition, in which the aim
tification of common features that allows the is to see whether the data is composed (or not)
formation of groups/clusters of cases seems to of natural subclasses, i.e., to discover whether
be a powerful capability that both simplifies the the overall population is heterogeneous. Strictly
data processing and also allows the user to un- speaking cluster analysis is devoted to the sec-
derstand better the trend(s) followed in the reg- ond goal although, in general, the term is widely
istered cases. used to describe both segmentation and cluster
In the data mining community this descrip- analysis problems.
tive task is known as cluster analysis (Ander- In this work we only deal with categorical or
berg, 1973; Duda et al., 2001; Kaufman and discrete variables and we are closer to segmen-
Rousseeuw, 1990; Jain et al., 1999), that is, tation than (strictly speaking) to cluster anal-
getting a decomposition or partition of a data ysis. Our proposal is a method that in our
set into groups in such a way that the objects opinion has several good properties: (1) it pro-
in one group are similar to each other but as duces a tree-like graphical structure that allows
different as possible from the objects in other us to visually describe each cluster by means
groups. In fact, as pointed out in (Hand et al., of a configuration of the relevant variables for
that segment of the population; (2) it takes ad- chical divisive methods than to any other clus-
vantage of the identification of contextual inde- tering approach (known by us). In concrete,
pendencies in order to build a decomposition of it works as monothetic divisive clustering algo-
the joint probability distribution; (3) it stores a rithms (Kaufman and Rousseeuw, 1990), that
joint probability distribution about all the vari- split clusters using one variable at a time, but
ables, in a simple way, allowing an efficient com- differs from classical divisive approaches in the
putation of conditional probabilities. An exam- use of a Bayesian score to decide which variable
ple of an independence tree is given in figure 1, is chosen at each step, and because branches are
which will be thoroughly described in the fol- not completely developed.
lowing sections. With respect to probabilistic clustering, al-
The paper is structured as follows: first, in though our method produces a hard clustering,
Section 2 we give some preliminaries in rela- it is somehow related to it because probability
tion with our proposed method. Section 3 de- distributions are used to complete the informa-
scribes the kind of model we aim to look for, tion about the discovered model. Probabilistic
while in Section 4 we propose the algorithm we model-based clustering is usually modelled as a
have designed to discover it from data. Section mixture of models (see e.g. (Duda et al., 2001)).
5 describes the experiments carried out. Finally, Thus, a hidden random variable is added to the
Section 6 is devoted to the conclusions and to original observed variables and its states corre-
describe some possible lines in order to continue spond with the components of the mixture (the
our research. number of clusters). In this way we move to a
problem of learning from unlabelled data and
2 Preliminaries usually EM algorithm (Dempster et al., 1977)
is used to carry out the learning task when the
Different types of clustering algorithms can be graphical structure is fixed and structural EM
found in the literature differing in the type of (Friedman, 1998) when the graphical structure
approach they follow. Probably the three main has also to be discovered (Pena et al., 2000).
approaches are: partition-based clustering, hi- Iterative approaches have been described in the
erarchical clustering, and probabilistic model- literature (Cheeseman and Stutz, 1996) in order
based clustering. From them, the first two ap- to discover also the number of clusters (compo-
proaches yield a hard clustering in the sense that nents of the mixture). Notice that although this
clusters are exclusive, while the third one yields is not the goal of the learning task, the struc-
a soft clustering, that is, an object can belong tures discovered can be used for approximate
to more than one cluster following a probability inference (Lowd and Domingos, 2005), having
distribution. Because our approach is somewhat the advantage over general Bayesian networks
related to hierarchical and probabilistic model- (Jensen, 2001) that the learned graphical struc-
based clustering we briefly comment on these ture is, in general, simple (i.e. naive Bayes) and
types of clustering. so inference is extremely fast.
Hierarchical clustering (Sneath and Sokal,
1973) returns a tree-like structure called den- 3 Independency tree model
drogram. This structure reflects the way in If we have the set of variables X =
which the objects in the data set have been {X1 , . . . , Xm } regarding a certain domain, an
merged from single points to the whole set independent probability tree for this set of vari-
or split from the whole set to single points. ables is a tree such that (e.g. figure 1):
Thus, there are two distinct types of hierar-
chical methods: agglomerative (by merging two Each inner node N is labelled by a vari-
clusters) and divisive (by splitting a cluster). able Var(N ), and this node has a child for
Although our method does not exactly belong each one of the possible values (states) of
to hierarchical clustering, it is closer to hierar- Var(N ).

84 M. J. Flores, J. A. Gmez, and S. Moral


When a variable is represented by a node potentials in List(N ), and that DVL(N )1 is the
in the tree it implies that this variable par- union of sets VL(N ), where N is a node in
titions the space from now on (from this a path from N to a leaf, then the independen-
point to the farther branches in the tree) cies are generated from the following statement:
depending on its (nominal) value. So, a Each variable X in VL(N ) - Var(N ) is indepen-
variable cannot appear again (deeper) in dent of the variables (VL(N ) {X})DVL(N )
the tree. given the configuration Conf(N ); i.e. each vari-
able in the list of a node is independent of the
Each node N will have an associated list
other variables in that list and of the variables
of potentials, List(N ), with each of the po-
in its descendants, given the configuration asso-
tentials in the list storing a marginal prob-
ciated to the node.
ability distribution for one of the variables
An independent probability tree also defines
in X, including always a potential for the
a partition of the set of possible values of vari-
variable Var(N ).
ables in X. The number of clusters is the num-
This list appears in Figure 1 framed into a ber of leaves. If N is a leaf with associated
dashed box. configuration Conf(N ), then this group is given
For any path from the root to a leaf, each by all set of values x that are compatible with
one of the variables in X appears uniquely Conf(N ) (they have the same value for all the
and exactly once in the the lists associated variables in Conf(N )). For example, in figure 1
to the nodes along the path. This poten- the configuration X = {0, 1, 0, 1, 0} would fall
tial will determine the conditional proba- on the second leaf since X1 =0 and X2 =1. It
bility of the variables given the values of is assumed that the probability distribution of
the variables on the path from the root to the other variables in the cluster are given by
the list containing the potential. the potentials in the path defined by the config-
uration, for example, P(X3 =0)=1.
A configuration is a subset of variables Y In this way, with an independence tree we are
{X1 , . . . , Xm } together with a concrete value able of accomplishing two goals in one: (1)The
Yj = yj for each one of the variables Yj Y. variables are partitioned in a hierarchical way
Each node N has an associated configuration that gives us at the same time a clustering re-
determined for the variables in the path from sult; (2)The probability value of every configu-
the root to node N (excluding Var(N )) with the ration of the variables.
values corresponding to the children we have to
follow to reach N . This configuration will be
denoted as Conf(N ). X1: "0" [0.67] "1" [0.33]
An independent probability tree represents a X4: "0" [0.5] "1" [0.5]
X5: "0" [0.5] "1" [0.5]
joint probability distribution, p, about the vari- X1
"0" "1"
ables in X. If x = (x1 , . . . , xm ), then X2: "0" [0.5] "1" [0.5]
Y
m
X2 X2: "0" [1] "1" [0]
p(x) = Px(i) (xi ) "1"
X3: "0" [1] "1" [0]
"0"
i=1

where Px(i) is the potential for variable Xi which X3: "0" [0] "1" [1] X3: "0" [1] "1" [0]
is in the path from the root to a leaf determined
by configuration x (following in each inner node Figure 1: Illustrative independence tree struc-
N with variable Var(N ) = Xj , the child corre- ture learned from the exclusive dataset.
sponding to the value of Xj x).
This decomposition is based on a set of inde-
pendencies among variables {X1 , . . . , Xm }. As-
sume that VL(N ) is the set of variables of the 1
D stands for Descendants.

The Independency tree model: a new approach for clustering and factorisation 85
Our structure seeks to keep in leaves those (independent variables). This is exploited by
variables that remain independent. At the same another clustering algorithms as EM-based Au-
time, if the distribution of one variable is shared toClass clustering algorithm (Cheeseman and
by several leaves, we try to store it in their com- Stutz, 1996).
mon ascendant, to avoid repeating it in all the An important fact of this model is that it is a
leaves. For that reason, when one variable ap- generalisation of some usual models in classifi-
pears in a list for a node Nj it means that this cation, the Naive Bayes model (a tree with only
distribution is common for all levels from here to one inner node associated to the class and in
a leaf. For example, in figure 1, the binary vari- its children all the rest of variables are indepen-
able X4 has a uniform (1/2,1/2) distribution for dent), and the classification tree (a tree in which
all leaf cases (also for intermediate ones), since the list of each inner node only contains one po-
it is associated to the root node. On the other tential and the potential associated to the class
hand, we can see how X2 distribution varies de- variable always appears in the leaves). This gen-
pending on the branch (left or right) we take eralization means that our model is able of rep-
from the root, that is, when X1 = 0 the be- resenting the same conditional probability dis-
haviour of variable X2 is uniform whereas being tribution of the class variable with respect to the
1 the value of X1 , X2 is determined to be 0. rest of the variables, taking as basis the same set
The intuition underlying this model is based of associated independencies.
on the idea that inside each cluster the variables From this model one may want to apply two
are independent. When we have a set of data, kinds of operations:
groups are defined by common values in certain The production of clusters and their charac-
variables, having the other variables random terisation. A simple in-depth from root until ev-
variations. Imagine that we have a database ery leaf node access of the tree will give us the
with characteristics of different animals includ- corresponding clusters, and also the potential
ing mammals and birds. The presence of these lists associated to the path nodes will indicate
two groups is based on the existence of depen- the behaviour of the other variables not in the
dencies between the variables (two legs is re- path.
lated with having wings and feathers). Once The computation of the probability for a cer-
these variables are fixed, there can be another tain complete configuration with a value for
variables (size, colour) that can have random each variable: x = {x1 , x2 , . . . , xm }. This al-
variations inside each group, but that they do gorithm is recursive and works as follows:
not define new subcategories. getProb(Configuration {x1 , x2 , . . . , xm },Node N )
Of course, there are some other possible al- P rob 1
1 For all Potential Pj List(N)
ternatives to define clustering. This is based 1.1. Xj Var(Pj )
on the idea that when all the variables are inde- 1.2. P rob P rob Pj (Xj = xj )
pendent, then to subdivide the population is not 2 If N is a leaf then return P rob
3 Else
useful, as we have a simple method to describe 3.1. XN Var(N)
the joint behaviour of the variables. However, 3.2. Next Node N Branch child(XN = xN )
if some of the variables are dependent, then the 3.3. return P robgetProb({x1 , x2 , . . . , xm },N )

values of some basic variables could help to de- If we have a certain configuration we should
termine the values of the other variables, and go through the tree from root until leaves tak-
then it can be useful to divide the population ing the corresponding branches. Every time we
in groups according to the values of these basic reach a node, we have to use the single poten-
variables. tials in the associated list to multiply the value
Another way of seeing it is that having inde- of probability by the values of the potentials
pendent variables is the simplest model we can corresponding to this configuration.
have. So, we determine a set of categories such It is also possible to compute the probabil-
that each one is described in a very simple way ity for an interest variable Z conditioned to

86 M. J. Flores, J. A. Gmez, and S. Moral


a generic configuration (a set of observations): values of the leaves that are compatible with
Y = y. This can be done in two steps: 1) this value (compatibility means that to follow
Transform the independent probability tree into this path we do not have to assume Z = z with
a probability tree (Salmeron et al., 2000); 2) z 6= z ), which is linear too.
Make a marginalisation of the probability tree
by adding in all the variables except in Z as 4 Our clustering algorithm
describe in (Salmeron et al., 2000).
In the following we describe the first step. In this section we are going to describe how
It is a recursive procedure that visits all nodes an independent probability tree can be learned
from root to leaves. Each node can pass to its from a database D, with values for all the vari-
children a float, P rob (1 in the root) and a po- ables in X = {X1 , . . . , Xm }.
tential P which depends of variable Z (empty The basics of the algorithm are simple. It
in the root). Each time a node is visited, the tries to determine for each node, the variable
following operations are carried out: with a strongest degree of dependence with the
All the potentials in List(N ) are examined and rest of remaining variables. This variable will
removed. For any potential, depending on its be assigned to this node, repeating the process
variable Xj we proceed as follows: with its children until all the variables are inde-
- If Xj 6= Z, then if Xj = Yk is in the obser- pendent.
vations configuration, P rob is multiplied by the For this, we need a measure of the degree
value of this potential for Yk = yk . of dependence of two variables Xi and Xj in
- If Xj 6= Z and Xj does not appear in the database D. The measure should be centered
observations configuration, then the potential is around 0, in such a way that the variables are
ignored. considered dependent if and only if the measure
- If Xj = Z and Xj is the one associated is greater than 0. In this paper, we consider the
to node N , then we transform each one of the K2 score (Cooper and Herskovits, 1992), mea-
children of Z, but multiplying P rob by the value suring the degree of dependence as the differ-
of potential in Z = z, before transforming the ence in the logarithm of the K2 score of Xj
child corresponding to this value. conditioned to Xi minus the logarithm of K2
- If Xj = Z and Xj is not the one associated score of marginal Xj , i.e. the difference be-
to node N , then the potential of Xj is stored in tween the logarithms of the K2 scores of two
P. networks with two variables: one in which Xi
After examining the list of potentials, we pro- is a parent of Xj and other in which the two
ceed: variables are not connected. Let us call this
- If N is not a leaf node, then we transform degree of dependence Dep(Xi , Xj |D). This is
its children. a non-symmetrical measure and should be read
- If N is a leaf node and P = , we assign the as the influence of Xi on Xj , however in prac-
value P rob to this node. tice the differences between Dep(Xi , Xj |D) and
- If N is a leaf node and P = 6 , we make N Dep(Xj , Xi |D) are not important.
an inner node with variable Z: For each value In any moment, given a variable Xi and a
Z = z, build a node Nz which is a leaf node database D, we can estimate a potential Pi (D)
with a value equal to P rob P(z). Make Nz a for this variable in the database. This potential
child of N . is the estimation of the marginal probability of
This procedure is fast: it is linear in the the Xi . Here we assume that this is done by count-
size of the independent probability tree (consid- ing the absolute frequencies of each one of the
ering the number of values of Z constant). The values in the database.
marginalisation in the second step has the same The algorithm starts with a list of variables
time complexity. In fact, this marginalization L which is initially equal to {X1 , . . . , Xm } and
can be done by adding for each value Z = z the a database equal to the original one D, then it

The Independency tree model: a new approach for clustering and factorisation 87
determines the root node N and its children in et al., 1993), where a link between two nodes is
a recursive way. For that, for any variable Xi deleted if these nodes are marginally indepen-
in the list, it computes dent.
X Another approximation is that it is assumed
Dep(Xi |D) = Dep(Xi , Xj |D) that all the variables are independent in a leaf
Xj L
when Dep(Xk |D) 0, even if some of the terms
Then, the variable Xk with maximum value we are adding are positive. We have found that
of Dep(Xi |D) is considered. this is a good compromise criterion to limit the
If Dep(Xk |D) > 0, then we assign variable complexity of learned models.
Xk to node N and we add potential Pk (D) to
the list List(N ), removing Xk for L. For all 5 Experiments
the remaining variables Xi in L, we compute
Dep(Xi , Xj |D) for Xj L(j 6= i), and Xj = To make an initial evaluation of the Indepen-
Xk . If all these values are less or equal than dence Tree (IndepT) model, we decided to com-
0, then we add potential Pi (D) to List(N ) and pare it with other well known unsupervised
remove Xi from L; i.e. we keep in this node classification techniques that are also based on
the variables which are independent of the rest Probabilistic Graphical Models: learning of a
of variables, including the variable in the node. Bayesian network (by a standard algorithm as
Finally, we build a child of N for each one of PC) and also Expectation-Maximisation with
the values Xk = xk . This is done by calling a Naive Bayes structure, which uses cross-
recursively to the same procedure, but with the validation to decide the number of clusters. Be-
new list of nodes, and changing database D to cause we produce a probabilistic description of
D[Xk = xk ], where D[Xk = xk ] is the subset the dataset, we use the log-likelihood (logL) of
of D given by those cases in which variable Xk the data given the model to score a given clus-
takes the value xk . tering. By using the logL as goodness measure
If Dep(Xk |D) 0, then the process is we can compare our approach with other algo-
stopped and this is a leaf node. We build rithms for probabilistic model-based unsuper-
List(N ), by adding potential Pi (D) for any vari- vised learning: probabilistic model-based clus-
able Xi in L. tering and Bayesian networks. This is a direct
In this algorithm, the complexity of each node evaluation of the procedures as methods to en-
computation is limited by O(m2 .n) where m is code a complex joint probability distribution.
the number of variables and n is the database At the same time, it is also an evaluation of our
size. The number of leaves is limited by the method from the clustering point of view, show-
database size n multiplied by the maximum ing whether the proposed segmentation is useful
number of cases of a variable, as a leaf with only for a simple description of the population.
one case of the database is never branched. But Then, the basic steps we have followed
usually the number of nodes is much lower. for the three procedures [IndepT,PC-Learn,EM-
In the algorithm, we make some approxima- Naive] are:
tions with respect to the independencies repre- 1. Divide the data cases into a training or data
sented by the model. First, we only look at one- set (SD ) and a test set (ST ), using (2/3,1/3).
to-one dependencies, and not to joint dependen- 2. Build the corresponding model for the cases
cies. It can be the case that Xi is indepen- in SD .
dent of Xj and Xk and it is not independent of 3. Compute the log-likelyhood of the obtained
(Xj , Xk ). However, testing these independents model over the data in ST .
is more costly and we do not have the possibility Among the tested cases, apart from easy syn-
of a direct representation of this in the model. thetical data bases created to check the ex-
This assumption is made by other Bayesian net- pected and right clusters, we have looked for
works learning algorithms such as PC (Spirtes real sets of cases. Some of them have been taken

88 M. J. Flores, J. A. Gmez, and S. Moral


from the UCI datasets repository (Newman et case IndepT PC-Learn EM-Naive
al., 1998) and others from real applications re- 1 -61.33 -62.88 -119.67
lated to our current research environment. 2 -1073.94 -2947.85 -3535.11
We will indicate the main remarks about all 3A -3016.92 -2644.38 -4297.21
the evaluated cases: 3B -2100.40 -3584.06 -6266.71
Case 1: exclusive: This is a very simple 3C -4956.04 -7630.98 -15145.15
dataset with five binary variables, from X1 to 3D -1340.04 -2770.15 -4223.45
X5 . The three first variables have an exclusive 4A -6853.12 -18074.69 -32213.29
behaviour, that is, if one of them is 1 the other 4B -6802.94 -17357.80 -31244.28
two will be 0. On the other hand, X4 and X5 5: -465790.73 -349631.35 -794415.07
are independent with the three first ones and
Table 1: Comparison in terms of the log-
also with each other. This example is interest-
likelyhood value for all cases (datasets).
ing not especially for the LogL comparison, but
mainly to see the interpretation of the tree given
in figure 1. the tree learned in the exclusive case. As it
Case 2: tic-tac-toe: Taken from UCI reposi- can be observed, the tree really captures what
tory and it encodes the complete set of possible is happening in the problem: X4 and X5 are
board configurations at the end of tic-tac-toe independent and placed in the list of root node
games. It presents 958 cases and 9 attributes and then one of the other variables is looked up,
(one per each game square). if its value is 1, then the values of the other two
Cases 3: greenhouses: Data cases taken variables is completely determined and we stop.
from real greenhouses located in Almera If its value is 0, then another variable has to be
(Spain). There are four distinct sets of examined.
cases, here we indicate them with denotation With respect to the capacity of approximat-
(case)={num cases, num attributes}: (3A)= ing the joint probability distribution, we can
{1240,8}, (3B)= {1465,17}, (3C)= {1318,33}, see in table 1 that our method always provides
(3D)= {1465,6}. greater values of logL than EM-Naive in all the
Cases 4: sheep: Datasets taken from the datasets. With respect to PC algorithm, the
work developed in (Flores and Gamez, 2005) independent tree wins in all the situations ex-
where the historical data of the different ani- cept in two of them (cases 3A and 5). This is
mals were registered and used to analyse their a remarkable result as our model has some lim-
genetic merit for milk production directed to itations in representing sets of independencies
Manchego cheese. There are two datasets with that can be easily represented by a Bayesian
3087 cases, 4A with 24 variables/attributes and network (for example a Markov chain), however
4B, where the attribute breeding value (that its behaviour is usually better. We think that
could be interpreted as class attribute) has been this is due to two main aspects: it can represent
removed. asymmetrical independencies and it is a simple
Case 5: connect4: Also downloaded from the model (which is always a virtue).
UCI repository and it contains all legal 8-play
positions in the game of connect-4 in which 6 Concluding remarks and future
neither player has won yet, and in which the work
next move is not forced. There are 67557 cases
and 42 attributes, each corresponding to one In this paper we have proposed a new model for
connect-4 square2 . representing a joint probability distribution in a
The results of the experiments can be found compact way, which can be used for fast compu-
in figure 1 and in table 1. Figure 1 presents tation of conditional probability distributions.
This model is based on a partition of the state
2
Actually, only the half of the cases have been used. space, in such a way that in each group all the

The Independency tree model: a new approach for clustering and factorisation 89
variables are independent. In this sense, it is at A. P. Dempster, N. M. Laird, and D.B. Rubin. 1977.
the same time a clustering algorithm. Maximum likelihood from incomplete data via the
EM algorithm. Journal of the Royal Statistical
In the experiments with real and syntheti-
Society, 1:138.
cal data we have shown its good behaviour as
a method of approximating a joint probability, R.O. Duda, P.E. Hart, and D.J. Stork. 2001. Pat-
providing results that are better (except for two tern Recognition. Wiley.
cases for the PC algorithm) to the factorisa- M.J. Flores and J. A. Gamez. 2005. Breeding value
tion provided by an standard Bayesian networks classification in manchego sheep: A study of at-
learning algorithm (PC) and by EM clustering tribute selection and construction. In Kowledge-
Based Intelligent Information and Enginnering
algorithm. In any case, more extensive experi-
Systems. LNAI, volume 3682, pages 13381346.
ments are necessary in order to compare it with Springer Verlag.
another clustering and factorisation procedures.
We think that this model can be improved N. Friedman. 1998. The Bayesian structural EM al-
gorithm. In Proceedings of the 14th Conference on
and exploited in several ways: (1) Some of the Uncertainty in Artificial Intelligence (UAI-98),
clusters can have very few cases or can be very pages 129138. Morgan Kaufmann.
similar in the distributions of the variables. We
D. Hand, H. Mannila, and P. Smyth. 2001. Princi-
could devise a procedure to joint these clusters; ples of Data Mining. MIT Press.
(2) we feel that the different number of values of
the variables can have some undesirable effects. A.K. Jain, M.M. Murty, and P.J. Flynn. 1999. Data
This problem could be solved by considering bi- clustering: a review. ACM Computing Surveys,
31:264323.
nary trees in which the branching is determined
by a partition of the set of possible values of a F. V. Jensen. 2001. Bayesian Networks and Deci-
variable in two parts. This could allow to extend sion Graphs. Springer Verlag.
the model to continuous variables; (3) we could L. Kaufman and P. Rousseeuw. 1990. Finding
determine different stepping branching rules de- Groups in Data. John Wiley and Sons.
pending of the final objective: to approximate a
D. Lowd and P. Domingos. 2005. Naive Bayes mod-
joint probability or to provide a simple (though els for probability estimation. In Proceedings of
not necessarily exhaustive) partition of the state the 22nd ICML conference, pages 529536. ACM
space that could help us to have an idea of how Press.
the variables take their values; (4) to use this
D.J. Newman, S. Hettich, C.L. Blake, and C.J.
model in classification problems. Merz. 1998. UCI repository of machine learning
databases.
Acknowledgments
This work has been supported by Spanish MEC J.M. Pena, J.A. Lozano, and P. Larranaga. 2000.
An improved Bayesian structural EM algorithm
under project TIN2004-06204-C03-{02,03}. for learning Bayesian networks for clustering.
Pattern Recognition Letters, 21:779786.

References A. Salmeron, A. Cano, and S. Moral. 2000. Impor-


tance sampling in Bayesian networks using prob-
M.R. Anderberg. 1973. Cluster Analysis for Appli- ability trees. Computational Statistics and Data
cations. Academic Press. Analysis, 34:387413.
P. Cheeseman and J. Stutz. 1996. Bayesian classi- P.H. Sneath and R.R. Sokal. 1973. Numerical Tax-
fication (AUTOCLASS): Theory and results. In onomy. Freeman.
U. M. Fayyad, G. Piatetsky-Shapiro, P Smyth,
and R. Uthurusamy, editors, Advances in Knowl- P. Spirtes, C. Glymour, and R. Scheines. 1993. Cau-
edge Discovery and Data Mining, pages 153180. sation, Prediction and Search. Springer Verlag.
AAAI Press/MIT Press.
G.F. Cooper and E.A. Herskovits. 1992. A Bayesian
method for the induction of probabilistic networks
from data. Machine Learning, 9:309347.

90 M. J. Flores, J. A. Gmez, and S. Moral


Learning the Tree Augmented Naive Bayes Classifier from
incomplete datasets
Olivier C.H. Francois and Philippe Leray
LITIS Lab., INSA de Rouen, BP 08, av. de lUniversite
76801 Saint-Etienne-Du-Rouvray, France.

Abstract
The Bayesian network formalism is becoming increasingly popular in many areas such
as decision aid or diagnosis, in particular thanks to its inference capabilities, even when
data are incomplete. For classification tasks, Naive Bayes and Augmented Naive Bayes
classifiers have shown excellent performances. Learning a Naive Bayes classifier from
incomplete datasets is not difficult as only parameter learning has to be performed. But
there are not many methods to efficiently learn Tree Augmented Naive Bayes classifiers
from incomplete datasets. In this paper, we take up the structural em algorithm principle
introduced by (Friedman, 1997) to propose an algorithm to answer this question.

1 Introduction which often give better performances than the


Naive Bayes classifier, require structure learn-
Bayesian networks are a formalism for proba- ing. Only a few methods of structural learning
bilistic reasoning increasingly used in decision deal with incomplete data.
aid, diagnosis and complex systems control. Let
X = {X1 , . . . , Xn } be a set of discrete random vari- We introduce in this paper a method to learn
ables. A Bayesian network B =< G, > is de- Tree Augmented Naive Bayes (tan) classifiers
fined by a directed acyclic graph G =< N, U > based on the expectation-maximization (em)
where N represents the set of nodes (one node principle. Some previous work by (Cohen et
for each variable) and U the set of edges, and al., 2004) also deals with tan classifiers and em
parameters = {ijk }16i6n,16j6qi ,16k6rithe set principle for partially unlabeled data. In there
of conditional probability tables of each node work, only the variable corresponding to the
Xi knowing its parents state P i (with ri and class can be partially missing whereas any vari-
qi as respective cardinalities of Xi and Pi ). able can be partially missing in the approach we
If G and are known, many inference algo- propose here.
rithms can be used to compute the probability We will therefore first recall the issues relat-
of any variable that has not been measured con- ing to structural learning, and review the vari-
ditionally to the values of measured variables. ous ways of dealing with incomplete data, pri-
Bayesian networks are therefore a tool of choice marily for parameter estimation, and also for
for reasoning in uncertainty, based on incom- structure determination. We will then exam-
plete data, which is often the case in real appli- ine the structural em algorithm principle, before
cations. proposing and testing a few ideas for improve-
It is possible to use this formalism for clas- ment based on the extension of the Maximum
sification tasks. For instance, the Naive Bayes Weight Spanning Tree algorithm to deal with
classifier has shown excellent performance. This incomplete data. Then, we will show how to
model is simple and only need parameter learn- use the introduced method to learn the well-
ing that can be performed with incomplete known Tree Augmented Naive Bayes classifier
datasets. Augmented Naive Bayes classifiers from incomplete datasets and we will give some
(with trees, forests or Bayesian networks), experiments on real data.
2 Preliminary remarks 2.2 Bayesian classifiers
2.1 Structural learning Bayesian classifiers as Naive Bayes have shown
Because of the super-exponential size of the excellent performances on many datasets. Even
search space, exhaustive search for the best if the Naive Bayes classifier has underlying
structure is impossible. Many heuristic meth- heavy independence assumptions, (Domingos
ods have been proposed to determine the struc- and Pazzani, 1997) have shown that it is op-
ture of a Bayesian network. Some of them rely timal for conjunctive and disjunctive concepts.
on human expert knowledge, others use real They have also shown that the Naive Bayes clas-
data which need to be, most of the time, com- sifier does not require attribute independence to
pletely observed. be optimal under Zero-One loss.
Here, we are more specifically interested in Augmented Naive Bayes classifier appear as
score-based methods. Primarily, greedy search a natural extension to the Naive Bayes classi-
algorithm adapted by (Chickering et al., 1995) fier. It allows to relax the assumption of inde-
and maximum weight spanning tree (mwst) pendence of attributes given the class variable.
proposed by (Chow and Liu, 1968) and ap- Many ways to find the best tree to augment
plied to Bayesian networks in (Heckerman et al., the Naive Bayes classifier have been studied.
1995). The greedy search carried out in directed These Tree Augmented Naive Bayes classifiers
acyclic graph (dag) space where the interest of (Geiger, 1992; Friedman et al., 1997) are a re-
each structure located near the current struc- stricted family of Bayesian Networks in which
ture is assessed by means of a bic/mdl type the class variable has no parent and each other
measurement (Eqn.1)1 or a Bayesian score like attribute has as parents the class variable and
bde (Heckerman et al., 1995). at most one other attribute. The bic score of
log N such a Bayesian network is given by Eqn.3.
BIC(G, ) = log P (D|G, ) Dim(G) (1)
2
BIC(TAN , ) = bic(C, , C| ) (3)
where Dim(G) is the number of parameters used X
for the Bayesian network representation and + bic(Xi , {C, Pi }, Xi |{C,Pi } )
i
N is the size of the dataset D.
The bic score is decomposable. It can be where C stands for the class node and Pi could
written as the sum of the local score computed only be the emptyset or a singleton {Xj }, Xj 6
for each node as BIC(G, ) = i bic(Xi , Pi , Xi|Pi ) {C, Xi }.
P

where bic(Xi , Pi , Xi|Pi ) = Forest Augmented Naive Bayes classifier


X X log N
Nijk log ijk Dim(Xi|Pi ) (2) (fan) is very close to the tan one. In this
2
Xi =xk Pi =paj model, the augmented structure is not a tree,
with Nijk the occurrence number of {Xi = but a set of disconnected trees in the attribute
xk and Pi = paj } in D. space (Sacha, 1999).
The principle of the mwst algorithm is rather
different. This algorithm determines the best 2.3 Dealing with incomplete data
tree that links all the variables, using a mutual
information measurement like in (Chow and 2.3.1 Practical issue
Liu, 1968) or the bic score variation when two Nowadays, more and more datasets are avail-
variables become linked as proposed by (Hecker- able, and most of them are incomplete. When
man et al., 1995). The aim is to find an optimal we want to build a model from an incomplete
solution, but in a space limited to trees. dataset, it is often possible to consider only the
1
As (Friedman, 1997), we consider that the bic/mdl complete samples in the dataset. But, in this
score is a function of the graph G and the parameters , case, we do not have a lot of data to learn the
generalizing the classical definition of the bic score which model. For instance, if we have a dataset with
is defined with our notation by BIC(G, ) where
is obtained by maximizing the likelihood or BIC(G, ) 2000 samples on 20 attributes with a probabil-
score for a given G. ity of 20% that a data is missing, then, only

92 O. C.H. Franois and P. Leray


23 samples (in average) are complete. General- Do where Xi and P a(Xi ) are measured, not only
izing from the example, we see that we cannot in Dco (where all Xi s are measured) as in the
ignore the problem of incomplete datasets. previous approach.
Many methods try to rely more on all the ob-
2.3.2 Nature of missing data
served data. Among them are sequential updat-
Let D = {Xil }16i6n,16l6N our dataset, with ing (Spiegelhalter and Lauritzen, 1990), Gibbs
Do the observed part of D, Dm the missing part
sampling (Geman and Geman, 1984), and ex-
and Dco the set of completely observed cases in pectation maximisation (EM) in (Dempster et
Do . Let also M = {Mil } with Mil = 1 if Xil is
al., 1977). Those algorithms use the missing
missing, 0 if not. We then have the following data mar properties. More recently, bound
relations: and collapse algorithm (Ramoni and Sebastiani,
Dm = {Xil / Mil = 1}16i6n,16l6N
1998) and robust Bayesian estimator (Ramoni
Do = {Xil / Mil = 0}16i6n,16l6N
and Sebastiani, 2000) try to resolve this task
Dco = {[X1l . . . Xnl ] / [M1l . . . Mnl ] = [0 . . . 0]}16l6N
whatever the nature of missing data.
Dealing with missing data depends on their EM has been adapted by (Lauritzen, 1995)
nature. (Rubin, 1976) identified several types to Bayesian network parameter learning when
of missing data: the structure is known. Let log P (D|) =
mcar (Missing Completly At Random):
log P (Do , Dm |) be the data log-likelihood.
P (M|D) = P (M), the probability for data
Dm being an unmeasured random variable, this
to be missing does not depend on D,
log-likelihood is also a random variable function
mar (Missing At Random): P (M|D) =
of Dm . By establishing a reference model , it
P (M|Do ), the probability for data to be
is possible to estimate the probability density of
missing depends on observed data,
nmar (Not Missing At Random): the prob- the missing data P (Dm | ) and therefore to cal-
ability for data to be missing depends on culate Q( : ), the expectation of the previous
both observed and missing data. log-likelihood:
mcar and mar situations are the easiest to Q( : ) = E [log P (Do , Dm |)] (4)
solve as observed data include all necessary in- So Q( : ) is the expectation of the likelihood
formation to estimate missing data distribution. of any set of parameters calculated using a
The case of nmar is trickier as outside informa- distribution of the missing data P (Dm | ). This
tion has to be used to model the missing data equation can be re-written as follows.
distribution. n
X X X
Q( : ) =
Nijk log ijk (5)
2.3.3 Learning with incomplete data i=1 Xi =xk Pi =paj

With mcar data, the first and simplest possi- where Nijk
= E [Nijk ] = N P (Xi = xk , Pi =
ble approach is the complete case analysis. This
paj | )is obtained by inference in the network
is a parameter estimation based on Dco , the set < G, > if the {Xi ,Pi } are not completely mea-
of completely observed cases in Do . When D is sured, or else only by mere counting.
mcar, the estimator based on Dco is unbiased. (Dempster et al., 1977) proved convergence
However, with a high number of variables the of the em algorithm, as the fact that it was not
probability for a case [X1l . . . Xnl ] to be completely necessary to find the global optimum i+1 of
measured is low and Dco may be empty. function Q( : i ) but simply a value which
One advantage of Bayesian networks is that, would increase function Q (Generalized em).
if only Xi and Pi = P a(Xi ) are measured, then
the corresponding conditional probability table 2.3.4 Learning G with incomplete
can be estimated. Another possible method dataset
with mcar cases is the available case analy- The main methods for structural learning
sis, i.e. using for the estimation of each con- with incomplete data use the em principle: Al-
ditional probability P (Xi |P a(Xi )) the cases in ternative Model Selection em (ams-em) pro-

Learning the Tree Augmented Naive Bayes Classifier from incomplete datasets 93
posed by (Friedman, 1997) or Bayesian Struc- Algorithm 2 : Detailed em for structural learning
tural em (bs-em) (Friedman, 1998). We can 1: Init: f inished = f alse, i = 0
also cite the Hybrid Independence Test pro- Random or heuristic choice of the initial
Bayesian network (G 0 , 0,0 )
posed in (Dash and Druzdzel, 2003) that can 2: repeat
use em to estimate the essential sufficient statis- 3: j=0
4: repeat
tics that are then used for an independence 5: i,j+1 = argmax Q(G i , : G i , i,j )
test in a constraint-based method. (Myers et 6: j =j+1

al., 1999) also proposes a structural learning 7: until convergence (i,j i,j )
o

o o
method based on genetic algorithm and mcmc. 8: if i = 0 or |Q(G i , i,j : G i1 , i1,j )
o o
We will now explain the structural em algorithm Q(G i1 , i1,j : G i1 , i1,j )| >  then
o
principle in details and see how we could adapt 9: G i+1 = arg max Q(G, : G i , i,j )
GVG i
it to learn a tan model. 10: i+1,0 = argmax Q(G i+1 , : G i , i,j )
o


11: i=i+1
3 Structural em algorithm 12: else
13: f inished = true
3.1 General principle 14: end if
The EM principle, which we have described 15: until f inished
above for parameter learning, applies more
generally to structural learning (Algorithm 1
a super-exponential space. However, with Gen-
as proposed by (Friedman, 1997; Friedman,
eralised em it is sufficient to look for a better
1998)).
solution rather than the best possible one, with-
Algorithm 1 : Generic em for structural learning out affecting the algorithm convergence proper-
1: Init: i = 0 ties. This search for a better solution can then
Random or heuristic choice of the initial be done in a limited space, like for example VG ,
Bayesian network (G 0 , 0 ) the set of the neigbours of graph G that have
2: repeat
3: i=i+1 been generated by removal, addition or inver-
4: (G i , i ) = argmax Q(G, : G i1 , i1 ) sion of an arc.
G,
5: until |Q(G i , i : G i1 , i1 ) Concerning the search in the space of the pa-
Q(G i1 , i1 : G i1 , i1 )| 6  rameters (Eqn.7), (Friedman, 1997) proposes
repeating the operation several times, using a
The maximization step in this algorithm (step clever initialisation. This step then amounts to
4) has to be performed in the joint space running the parametric em algorithm for each
{G, } which amounts to searching the best structure G i , starting with structure G 0 (steps 4
structure and the best parameters correspond- to 7 of Algorithm 2). The two structural em
ing to this structure. In practice, these two algorithms proposed by Friedman can therefore
steps are clearly distinct2 : be considered as greedy search algorithms, with
G i = argmax Q(G, : G i1 , i1 ) (6) EM parameter learning at each iteration.
G
i = argmax Q(G i , : G i1 , i1 ) (7)

3.2 Choice of function Q
where Q(G, : G , ) is the expectation of
the likelihood of any Bayesian network < We now have to choose the function Q that will
G, > computed using a distribution of the be used for structural learning. The likelihood
missing data P (Dm |G , ). used for parameter learning is not a good indi-
Note that the first search (Eqn.6) in the space cator to determine the best graph since it gives
of possible graphs takes us back to the initial more importance to strongly connected struc-
problem, i.e. the search for the best structure in tures. Moreover, it is impossible to compute
2
marginal likelihood when data are incomplete,
The notation Q(G, : . . . ) used in Eqn.6 stands for
E [Q(G, : . . . )] for Bayesian scores or Q(G, o : . . . ) so that it is necessary to rely on an efficient
where o is obtained by likelihood maximisation. approximation like those reviewed by (Chicker-

94 O. C.H. Franois and P. Leray


ing and Heckerman, 1996). In complete data There is a change from the regular structural
cases, the most frequently used measurements em algorithm in step 9, i.e. the search for a
are the bic/mdl score and the Bayesian bde better structure for the next iteration. With
score (see paragraph 2.1). When proposing the the previous structural em algorithms, we were
ms-em and mwst-em algorithms, (Friedman, looking for the best dag among the neighbours
1997) shows how to use the bic/mdl score with of the current graph. With mwst-em, we can
incomplete data, by applying the principle of directly get the best tree that maximises func-
Eqn.4 to the bic score (Eqn.1) instead of likeli- tion Q.
hood. Function QBIC is defined as the bic score In paragraph 2.1, we briefly recalled that the
expectation by using a certain probability den- mwst algorithm used a similarity function be-
sity on the missing data P (Dm |G , ) : tween two nodes which was based on the bic
QBIC (G, : G , ) = (8) score variation whether Xj is linked to Xi or
1 not. This function can be summed up in the
EG , [log P (Do , Dm |G, )] Dim(G) log N
2 following (symmetrical) matrix:
h i h i
As the bic score is decomposable, so is QBIC . Mij = bic(Xi , Xj , Xi|Xj ) bic(Xi , , Xi )
16i,j6n
BIC
X bic (11)
Q (G, : G , ) = Q (Xi , Pi , Xi|Pi : G , )
i where the local bic score is defined in Eqn.2.
where Qbic (Xi , Pi , Xi|Pi : G , ) = (9) Running maximum (weight) spanning algo-
X X log N rithms like Kruskals on matrix M enables us to
Nijk log ijk Dim(Xi|Pi ) (10)
X =x P =pa
2 obtain the best tree T that maximises the sum
i k i j

with Nijk
= EG , [Nijk ] = N P (Xi = xk , Pi =
of the local scores on all the nodes, i.e. function
BIC of Eqn.2.
paj |G , ) obtained by inference in the network
{G , } if {Xi ,Pi } are not completely measured,
By applying the principle we described in sec-
or else only by mere counting. With the same tion 3.2, we can then adapt mwst to incom-
reasoning, (Friedman, 1998) proposes the adap- plete data by replacing the local bic score of
tation of the bde score to incomplete data. Equn.11 with its expectation; to do so, we use a
certain probability density of the missing data
4 TAN-EM, a structural EM for P (Dm |T , ) :
h i h
classification Q
Mij = Qbic (Xi , Pi = {Xj }, Xi|Xj : T , )
i,j
i
(Leray and Francois, 2005) have introduced Qbic (Xi , Pi = , Xi : T , ) (12)
mwst-em an adaptation of mwst dealing with With the same reasoning, running a maximum
incomplete datasets. The approach we propose (weight) spanning tree algorithm on matrix
here is using the same principles in order to ef- M Q enables us to get the best tree T that max-
ficiently learn tan classifiers from incomplete imises the sum of the local scores on all the
datasets. nodes, i.e. function QBIC of Eqn.9.
4.1 MWST-EM, a structural EM in 4.2 TAN-EM, a structural EM for
the space of trees classification
Step 1 of Algorithm 2, like all the previous al- The score used to find the best tan structure
gorithms, deals with the choice of the initial is very similar to the one used in mwst, so we
structure. The choice of an oriented chain graph can adapt it to incomplete datasets by defining
linking all the variables proposed by (Friedman, the following score matrix:
1997) seems even more judicious here, since this h
Q
i h
Mij = Qbic (Xi , Pi = {C, Xj }, Xi|Xj C : T , )
chain graph also belongs to the tree space. Steps i,j
i
4 to 7 do not change. They deal with the run- Qbic (Xi , Pi = {C}, Xi|C : T , ) (13)
ning of the parametric em algorithm for each Using this new score matrix, we can use the
structure Bi , starting with structure B0 . approach previously proposed for mwst-em to

Learning the Tree Augmented Naive Bayes Classifier from incomplete datasets 95
get the best augmented tree, and connect the dedicated to classification tasks while the oth-
class node to all the other nodes to obtain the ers do not consider the class node as a specific
tan structure. We are currently using the same variable.
reasoning to find the best forest extension. We also give an confidence interval for each
classification rate, based on Eqn.14 proposed by
4.3 Related works (Bennani and Bossaert, 1996):
2
q
Z T (1T ) 2
Z
(Meila-Predoviciu, 1999) applies mwst algo- T+ 2N
Z N
+ 4N 2
I(, N ) = 2 (14)
rithm and em principle, but in another frame- 1+
Z
N
work, learning mixtures of trees. In this work, where N is the number of samples in the dataset,
the data is complete, but a new variable is intro- T is the classification rate and Z = 1, 96 for =
duced in order to take into account the weight 95%.
of each tree in the mixture. This variable isnt
5.2 Results
measured so em is used to determine the corre-
sponding parameters. The results are summed up in Table 1. First, we
(Pena et al., 2002) propose a change inside could see that even if the Naive Bayes classifier
the framework of the sem algorithm resulting often gives good results, the other tested meth-
in an alternative approach for learning Bayes ods allow to obtain better classification rates.
Nets for clustering more efficiently. But, where all runnings of nb-em give the same
(Greiner and Zhou, 2002) propose maximiz- results, as em parameter learning only needs
ing conditional likelihood for BN parameter an initialisiation, the other methods do not al-
learning. They apply their method to mcar in- ways give the same results, and then, the same
complete data by using available case analysis classification rates. We have also noticed (not
in order to find the best tan classifier. reported here) that, excepting nb-em, tan-
em seems the most stable method concerning
(Cohen et al., 2004) deal with tan classifiers
the evaluated classification rate while mwst-em
and em principle for partially unlabeled data.
seems to be the less stable.
In there work, only the variable corresponding
The method mwst-em can obtain very good
to the class can be partially missing whereas any
structures with a good initialisation. Then, ini-
variable can be partially missing in our tan-em
tialising it with the results of mwst-em gives us
extension.
stabler results (see (Leray and Francois, 2005)
5 Experiments for a more specific study of this point).
In our tests, except for this house dataset,
5.1 Protocol tan-em always obtains a structure that lead
to better classification rates in comparison with
The experiment stage aims at evaluating the
the other structure learning methods.
Tree Augmented Naive Bayes classifier on
Surprisingly, we also remark that mwst-em
incomplete datasets from UCI repository3 :
can give good classification rates even if the
Hepatitis, Horse, House, Mushrooms and
class node is connected to a maximum of two
Thyroid.
other attributes.
The tan-em method we proposed here is
Regarding the log-likelihood reported in Ta-
compared to the Naive Bayes classifier with
ble 1, we see that the tan-em algorithm finds
em parameters learning. We also indicate the
structures that can also lead to a good approx-
classification rate obtained by three methods:
imation of the underlying probability distribu-
mwst-em, sem initialised with a random chain
tion of the data, even with a strong constraint
and sem initialised with the tree given by
on the graph structure.
mwst-em (sem+t). The first two methods are
Finally, the Table 1 illustrates that tan-
3
http://www.ics.uci.edu/mlearn/MLRepository. em and mwst-em have about the same com-
html plexity (regarding the computational time) and

96 O. C.H. Franois and P. Leray


Datasets N learn test #C %I NB-EM MWST-EM TAN-EM SEM SEM+T
Hepatitis 20 90 65 2 8.4 70.8 [58.8;80.5] 73.8 [62.0;83.0] 75.4 [63.6;84.2] 66.1 [54.0;76.5] 66.1 [54.0;76.5]
-1224.2 ; 29.5 -1147.6 ; 90.4 -1148.7 ; 88.5 -1211.5 ; 1213.1 -1207.9 ; 1478.5
Horse 28 300 300 2 88.0 75 [63.5;83.8] 77.9 [66.7;86.2] 80.9 [69.9;88.5] 66.2 [54.3;76.3] 66.2 [54.3;76.3]
-5589.1 ; 227.7 -5199.6 ; 656.1 -5354.4 ; 582.2 -5348.3 ; 31807 -5318.2 ; 10054
House 17 290 145 2 46.7 89.7 [83.6;93.7] 93.8 [88.6;96.7] 92.4 [86.9;95.8] 92.4 [86.9;95.8] 93.8 [88.6;96.7]
-2203.4 ; 110.3 -2518.0 ; 157.0 -2022.2 ; 180.7 -2524.4 ; 1732.4 -2195.8 ; 3327.2
Mushrooms 23 5416 2708 2 30.5 92.8 [91.7;93.8] 74.7 [73.0;73.4] 91.3 [90.2;92.4] 74.9 [73.2;76.5] 74.9 [73.2;76.5]
-97854 ; 2028.9 -108011 ; 6228.2 -87556 ; 5987.4 -111484 ; 70494 -110828 ; 59795
Thyroid 22 2800 972 2 29.9 95.3 [93.7;96.5] 93.8 [92.1;95.2] 96.2 [94.7;97.3] 93.8 [92.1;95.2] 93.8 [92.1;95.2]
-39348 ; 1305.6 -38881 ; 3173.0 -38350 ; 3471.4 -38303 ; 17197 -39749 ; 14482

Table 1: First line: best classification rate (on 10 runs, except Mushrooms on 5, in %) on test
dataset and its confidence interval, for the following learning algorithms: nb-em, mwst-em, sem,
tan-em and sem+t. Second line: log-likelihood estimated with test data and calculation time
(sec) for the network with the best classification rate. The first six columns give us the name of the
dataset and some of its properties : number of attributes, learning sample size, test sample size,
number of classes and percentage of incomplete samples.

are a good compromise between nb-em (clas- applying in (subspace of) dag space. (Chicker-
sical Naive Bayes with em parameter learning) ing and Meek, 2002) proposed an optimal search
and mwst-em (greedy search with incomplete algorithm (ges) which deals with Markov equiv-
data). alent space. Logically enough, the next step
in our research is to adapt ges to incomplete
6 Conclusions and prospects datasets. Then we could test results of this
Bayesian networks are a tool of choice for rea- method on classification tasks.
soning in uncertainty, with incomplete data.
However, most of the time, Bayesian network 7 Acknowledgement
structural learning only deal with complete This work was supported in part by the ist
data. We have proposed here an adaptation of Programme of the European Community, under
the learning process of Tree Augmented Naive the pascal Network of Excellence, ist-2002-
Bayes classifier from incomplete datasets (and 506778. This publication only reflects the au-
not only partially labelled data). This method thors views.
has been successfuly tested on some datasets.
We have seen that tan-em was a good clas-
sification tool compared to other Bayesian net- References
works we could obtained with structural em like
learning methods. Y. Bennani and F. Bossaert. 1996. Predictive neu-
ral networks for traffic disturbance detection in
Our method can easily be extended to un- the telephone network. In Proceedings of IMACS-
supervised classification tasks by adding a new CESA96, page xx, Lille, France.
step in order to determine the best cardinality
for the class variable. D. Chickering and D. Heckerman. 1996. Efficient
Approximation for the Marginal Likelihood of In-
Related future works are the adaptation of
complete Data given a Bayesian Network. In
some other Augmented Naive Bayes classifiers UAI96, pages 158168. Morgan Kaufmann.
for incomplete datasets (fan for instance), but
also the study of these methods with mar D. Chickering and C. Meek. 2002. Finding optimal
datasets. bayesian networks. In Adnan Darwiche and Nir
Friedman, editors, Proceedings of the 18th Con-
mwst-em, tan-em and sem methods are re- ference on Uncertainty in Artificial Intelligence
spective adaptations of mwst, tan and greedy (UAI-02), pages 94102, S.F., Cal. Morgan Kauf-
search to incomplete data. These algorithms are mann Publishers.

Learning the Tree Augmented Naive Bayes Classifier from incomplete datasets 97
D. Chickering, D. Geiger, and D. Heckerman. 1995. R. Greiner and W. Zhou. 2002. Structural extension
Learning bayesian networks: Search methods and to logistic regression. In Proceedings of the Eigh-
experimental results. In Proceedings of Fifth Con- teenth Annual National Conference on Artificial
ference on Artificial Intelligence and Statistics, Intelligence (AAI02), pages 167173, Edmonton,
pages 112128. Canada.
C.K. Chow and C.N. Liu. 1968. Approximating dis- D. Heckerman, D. Geiger, and M. Chickering. 1995.
crete probability distributions with dependence Learning Bayesian networks: The combination of
trees. IEEE Transactions on Information The- knowledge and statistical data. Machine Learn-
ory, 14(3):462467. ing, 20:197243.
I. Cohen, F. G. Cozman, N. Sebe, M. C. Cirelo, S. Lauritzen. 1995. The EM algorithm for graphical
and T. S. Huang. 2004. Semisupervised learning association models with missing data. Computa-
of classifiers: Theory, algorithms, and their ap- tional Statistics and Data Analysis, 19:191201.
plication to human-computer interaction. IEEE
Transactions on Pattern Analysis and Machine P. Leray and O. Francois. 2005. bayesian network
Intelligence, 26(12):15531568. structural learning and incomplete data. In Pro-
ceedings of the International and Interdisciplinary
D. Dash and M.J. Druzdzel. 2003. Robust inde- Conference on Adaptive Knowledge Representa-
pendence testing for constraint-based learning of tion and Reasoning (AKRR 2005), Espoo, Fin-
causal structure. In Proceedings of The Nine- land, pages 3340.
teenth Conference on Uncertainty in Artificial In-
telligence (UAI03), pages 167174. M. Meila-Predoviciu. 1999. Learning with Mixtures
of Trees. Ph.D. thesis, MIT.
A. Dempster, N. Laird, and D. Rubin. 1977. Maxi-
mum likelihood from incomplete data via the EM J.W. Myers, K.B. Laskey, and T.S. Lewitt. 1999.
algorithm. Journal of the Royal Statistical Soci- Learning bayesian network from incomplete data
ety, B 39:138. with stochatic search algorithms. In Proceedings
of the Fifteenth Conference on Uncertainty in Ar-
P. Domingos and M. Pazzani. 1997. On the optimal- tificial Intelligence (UAI99).
ity of the simple bayesian classifier under zero-one
loss. Machine Learning, 29:103130. J.M. Pena, J. Lozano, and P. Larranaga. 2002.
Learning recursive bayesian multinets for data
N. Friedman, D. Geiger, and M. Goldszmidt. 1997. clustering by means of constructive induction.
bayesian network classifiers. Machine Learning, Machine Learning, 47:1:6390.
29(2-3):131163.
M. Ramoni and P. Sebastiani. 1998. Parameter es-
N. Friedman. 1997. Learning belief networks in the timation in Bayesian networks from incomplete
presence of missing values and hidden variables. databases. Intelligent Data Analysis, 2:139160.
In Proceedings of the 14th International Confer-
ence on Machine Learning, pages 125133. Mor- M. Ramoni and P. Sebastiani. 2000. Robust learn-
gan Kaufmann. ing with missing data. Machine Learning, 45:147
170.
N. Friedman. 1998. The bayesian structural EM
algorithm. In Gregory F. Cooper and Serafn D.B. Rubin. 1976. Inference and missing data.
Moral, editors, Proceedings of the 14th Conference Biometrika, 63:581592.
on Uncertainty in Artificial Intelligence (UAI-
98), pages 129138, San Francisco, July. Morgan J.P. Sacha. 1999. New Synthesis of bayesian Net-
Kaufmann. work Classifiers and Cardiac SPECT Image In-
terpretation. Ph.D. thesis, The University of
D. Geiger. 1992. An entropy-based learning algo- Toledo.
rithm of bayesian conditional trees. In Uncer-
tainty in Artificial Intelligence: Proceedings of the D. J. Spiegelhalter and S. L. Lauritzen. 1990. Se-
Eighth Conference (UAI-1992), pages 9297, San quential updating of conditional probabilities on
Mateo, CA. Morgan Kaufmann Publishers. directed graphical structures. Networks, 20:579
605.
S. Geman and D. Geman. 1984. Stochastic re-
laxation, Gibbs distributions, and the Bayesian
restoration of images. IEEE Transactions on Pat-
tern Analysis and Machine Intelligence, 6(6):721
741, November.

98 O. C.H. Franois and P. Leray


 
   !"#$% &'(*)+ #
%,.-/
0'1234
57698:<;=?><@A;B8:DCFE!G?;B;IHDJKL69M N%;OC8DP'PQ6 NRJ<;B8:TSUCFV&E ;W5X>YG!CFC8DC8
Z!CF[;IE%V%\]C8^V#PB_a`b8D_cPBE \];IV%6dPQ8T;B8:e=PQ\$[f<V%698DH]KLgF6dC8<gRChFJYiV&E%Cgkj^V!i#86d@BCFE h%6dVmlBJ
SU>onp>qPsrTtIuL>vuQtBw'JxByIuQt{z,q2iV&E%Cgkj^VFJYV%jDC4|#CFV%jDCFE M9;B8:<hF>
CR}~\];B6MAL<L ^^L0mQ'D^DLU'"
asB
`8*\W;B8^lE%C;BM [E%PBMdC\:DPQ\W;B698<hFJV%j<CW\];B698@I;IE 69;IYMdCPB_68^V&CFE%Ch%VCj<;@BChp\$PQ8DPBV&PQ8<6gF;BM9Mdle68
V&CFEk\]hUPB_V%jDCPBh&CFE @A;IM9C@I;IE 69;IM9ChFJB6984V%j<Ch%C8<h&CV%j<;IV j<6dHQjDCFE0@I;BM9fDChU_PBE0V%jDC@I;IE 69;IM9CXPB_Y68^V&CFE%Ch%V
CgRPQ\$C"\$PBE CM6dBCMdl69V%jj<69HQjDCFE&}PBE :DCFE C:PBYh&CFE%@I;IV%6dPQ8<hF>0i#8<_PBE%V%f8<;IV&CMdlBJBCh&V%;IM69h%j<698<HjDCFV%j<CFE
PBE8DPBVW;*q;slBCh%69;B88DCFV~PBE%CRrLj6d6dV%hV%jDCh%Ce\$PQ8DPBV&PQ869gF6dV~l[<E%PB[CFE%V%69Ch]69h$j<6dHQjMdl698V&E ;BgRV%;IM9C
68HBC8<CFE ;BM>`8V%j<69h[Y;I[CFEsJ C[E%Ch&C8V;\]CFV%jDP':V%j;IVFJX'lf69M9:<698<HTfD[PQ8V%j<CTgRPQ8<gRCF[<V$PB_
;Bh h%6dHQ8<\$C8V$M9;IV&V%69gRCBJ[<E%P@'69:<Ch_PBE]69:<C8^V%6d_clL698DH;B8^l@'69PQM9;IV%6dPQ8<hPB_!V%jDC[<E%PB[CFE%V%6dChPB_$c[;IE%V%6;BM
\]PQ8DPBV&PQ8<69gF6dVmlPB_YV%jDC"PQfDV&[f<V0;B8:_PBEXgRPQ8h&V&E f<gRV%698<H?\W698<69\];BMPIC8<:<698DHgRPQ8V&CRr'V%h> C69M9M9fh&V&E ;IV&C
V%j<C;I[<[M969gF;IV%69PQ8PB_UPQf<E,\$CFV%jDPL:6dV%j;$E C;BMq;lBCh%69;B88DCFVmPBE%698@BCFV&CFEk698<;IE%lh%gF6dC8<gRCB>
aBQ XX'Q~7 fDV%69PQ8HQ6d@BC8Tj<6dHQj<CFE&}PBE :DCFE%C:PBYh&CFE%@I;IV%6dPQ8<hF>
i!8D_cPBE%V%f<8<;IV&CMdlBJV%j<C[<E%PBYMdC\ PB_@BCFE 69_lL698DH
`b8\];B8l[<E%PBMdC\:DPQ\];B698<hJ<V%jDC@I;IE 69;IM9Ch"PB_ 69\} jDCFV%j<CFEPBE"8DPBV;q;slBCh%69;B88DCFVmPBE WCRrDj<6d6dV%hXV%jDC
[PBE%V%;B8<gRC$j<;s@BC:6CFE C8^V!E%PQMdChF>4n!_cV&C8J;8f\4CFE [<E%PB[CFE%V%69ChPB_?\$PQ8DPBV&PQ869gF6dV~l_E PQ\6dV%h$:DPQ\];B698PB_
PB_"PBh&CFE @A;IM9C]698D[fDV@I;IE 69;IYMdCh;B8<:;Th%698DHQM9C$PQfDVb} ;I[<[M69gF;IV%6dPQ86h#j<69HQj<Mdl698V&E ;BgRV%;IMdC698eHBC8DCFE ;BM0;B8<:
[fDV#@A;IEk69;IMdC4;IE C:<69h%V%698DHQf<69h jDC:>`8e;W69PQ\$C:<69gF;BM E%C\];B68<h?V&PC]h&PCF@BC8_PBE4[PQM9lV&E%CFCh>W#M9V%jDPQfDHQj
:<69;IHQ8<PQh&V%69gT;I[<[YM969gF;IV%6dPQ8J_cPBEWCRrD;B\$[MdCBJV%jDC698D[fDV ;B8*;I[<[E%PrD69\];IV&CW;B8l'V%69\$CW;BMdHBPBEk6dV%j<\69h;s@A;B6M9;IMdC
@I;IE 69;IMdChgF;I[<V%f<E%CV%jDC?8<:<68DHQhX_cE%PQ\/:<6dCFE%C8V":<6} _cPBE*:DCgF69:<698<H6d_];2HQ6d@BC88DCFVmPBE 69h\$PQ8<PBV&PQ8DCBJ
;IHQ8DPQh&V%6gV&Ch&V%hW;B8:V%jDCPQfDV&[f<V$@A;IE 6;IMdC\$PL:DCM9h C_PQf<8:{V%j;IV69V%hE f<8V%69\$C!E%Cf6dE%C\$C8V%hV&C8<:V&P
V%jDC{[PQh h%6dMdC]:69h&C;Bh&ChF>ef<MdV%69[MdC$698D[f<V@A;IE 6;IMdCh _cPBE%Ch&V%;BM9Mfh&C698;$[<Ek;BgRV%69gF;BMh&CFV&V%698<HD>
;B8<:$;h%68DHQMdCPQfDV&[fDV0@I;IE 69;IMdC698p_;BgRV ;IE%C"V~l'[69gF;BM9M9l `b8V%j<69h?[Y;I[CFEsJC{[<E%Ch&C8V;T8DCF4Ja\$PBE%CW[<E ;Bg}
_cPQf<8<:698;B8lV~l'[CPB_ :<6;IHQ8DPQh&V%69g?[<E PBMdC\> V%69gF;IM9C\$CFV%j<P':_cPBEh&V%f<:DlL698DH\$PQ8DPBV&PQ8<69gF69V~lPB_
DPBE#\];B8^lW[<E%PBMdC\WhFJV%j<C!E CM9;IV%6dPQ8CFV~CFC8V%jDC q;lBCh%6;B88DCFV~PBE%LhF>z,jDC\$CFV%jDPL:f<69M:<hWfD[PQ8
PQfDV&[f<V@A;IEk69;IMdCW;B8:*V%jDC{PBh&CFE%@I;IMdCW68D[fDV?@I;IE 6} ;M9;IV&V%69gRCPB_;BM9M N&PQ698^V@I;BM9fDC;Bh%h%6dHQ8<\]C8^V%hV&PV%jDC
;IMdCh69h\]PQ8DPBV&PQ8DC698V%jDCh&C8<h&C2V%j<;IVj6dHQjDCFE&} PBh&CFE @A;IM9C@I;IE 69;IMdChf<8<:<CFEeh&V%f<:<lY{V%j<69hM9;IV&V%69gRC
PBE :DCFE C:@A;BMfDCh_cPBEV%jDCW698<[fDV?@A;IEk69;IMdCh?HQ6d@BCWE 69h&C \$PBE%CFP@BCFEJX69hC8<j<;B8gRC:69V%j698D_cPBE \];IV%69PQ8;IPQfDV
V&P;j<6dHQj<CFE&}PBE :DCFE%C:PQfDV&[fDV{_PBEV%jDCe\W;B698@I;IE 6} V%jDCCRCgRV%hPB_$V%jDC@A;IE 69PQf<h;Bh%h 6dHQ8<\$C8V%hTPQ8V%jDC
;IMdCPB_ 698V&CFE%Ch&VF>`b8;]6dPQ\]C:<69gF;BM7:69;IHQ8DPQh&V%69g;I[D} [<E%PBY;I69M96dVml:<69h%V&E 6dfDV%69PQ8P@BCFE4V%jDCW\W;B698@A;IEk69;IMdC
[M96gF;IV%6dPQ8JI_PBE0CRrD;B\$[MdCBJBPBYh&CFE%@L698DH#h%l'\$[V&PQ\]h ;B8<: PB_698V&CFE%Ch&VF>z,j<C;Bh%h 6dHQ8<\$C8VM9;IV&V%6gRC6hf<h%C:_cPBE
h%6dHQ8h"V%j<;IV;IE%C\$PBE%C4h&CF@BCFE C469MMYE Ch%f<MdV"698T;W\$PBE%C 69:DC8V%6d_cl'68DH;B8l@L6dPQM9;IV%6dPQ8<h{PB_V%jDC[<E PB[CFE V%6dChWPB_
h&CF@BCFE%C?:<69h%C;Bh&CC698DH4V%jDC#\$PQh%VM96dBCM9lp@I;BM9fDCPB_V%jDC \$PQ8DPBV&PQ869gF6dV~l698V%jDC8DCFV~PBE%> KLfDh%CfDC8V%MdlBJ
:<69;IHQ8<PQh&V%69g?@I;IE 69;IMdCB> zjDCgRPQ8<gRCF[<V,PB_ \$PQ8<PBV&PQ8<69g} \]69869\];BMBPIC8<:<698<H#gRPQ8V&CRr'V%h ;IE%CgRPQ8<h&V&Ekf<gRV&C:pV%j<;IV
6dVmle698:<69h%V&E 6dfDV%69PQ8j<;Bh?CFC8698V&E%PL:<f<gRC:eV&PTgF;I[D} g j;IE ;BgRV&CFE 69h&CV%jDC*69:DC8V%6d<C:@L6dPQM9;IV%6dPQ8<h;B8:2[<E%PI}
V%fDE%CeV%j<69hWV~l'[CePB_'8<PsMdC:<HBC_cPBEq;lBCh 69;B828DCFVb} @L69:DC?_PBE#_cf<E%V%jDCFE,698@BCh&V%6dHQ;IV%6dPQ8>
PBE%'h?c ;B8:DCFEG?;B;IHTF Bo>dJDIuBuIk>PBE%C#h%[CgF69_} z,j<C#E f8^V%69\]C#gRPQ\$[M9CRrL6dVmlWPB_aPQfDE"\$CFV%jDPL:6hXCRr'}
69gF;BM9M9lBJ;48<CFV~PBE%W69hh%;B69:]V&PpC69h&PBV&PQ8<C#6d_V%jDC!gRPQ8L} [PQ8DC8V%69;BMY68{V%j<C!8'f<\CFEPB_PBYh&CFE%@I;IMdC!@A;IE 6;IMdCh
:<6dV%69PQ8<;BM[E%PB;I69M6dV~l:<69h%V&E 6dfDV%69PQ8gRPQ\$[fDV&C:_cPBE f<8<:<CFE{h&V%f<:<lB>+DPBEM9;IE%HBCFE8DCFV~PBE%Lh698<gFM9f:<698DH;
V%jDCPQfDV&[fDV@A;IEk69;IMdC{HQ6d@BC8h&[CgF6dg$PBh%CFE%@A;IV%69PQ8<h M9;IE%HBC]8'f<\CFE!PB_PBh&CFE @A;IM9Cp@I;IE 69;IYMdCh!V%jDCFE%CF_cPBE%CBJ
69hh&V&PLg j;Bh&V%69gF;BM9Mdl:DPQ\]68<;IV&C:l;B8lh f<g j*:69h&V&E 6} PQfDEX\]CFV%jDP':[E%Ps@L69:DCh _cPBE @BCFE 69_lL698DH![;IE%V%69;BM\$PQ8DPI}
V&PQ8<6gF6dV~l_PBE*M96\]6dV&C:+h fDh&CFV%hPB_{@A;IE 6;IMdChPQ8<MdlB> V&CFE \Wh PB_h&V&PLg j<;Bh&V%6g#:DPQ\W698<;B8<gRC,;B\$PQ8<H4V%jDCh&C!:<69hb}
z,j<Cf<8D_;@BPQf<E ;IMdCgRPQ\$[fDV%;IV%69PQ8<;BMagRPQ\][MdCRrD6dV~lTPB_ V&E 69fDV%6dPQ8<h>7DPBE;[<E%PB;IY69M96dVml:<69h%V&E 6dfDV%69PQ8SXEsI 4
V%jDC{[<E%PBM9C\JajDPCF@BCFEsJX69h4M6dBCMdlV&Pe_cPBE%Ch&V%;BM9MXV%jDC P@BCFEeV%jDCPQfDV&[fDVT@I;IE 69;IYMdCBJ?V%jDCgFf<\pf<M9;IV%69@BC :<69hb}
:DCh 6dHQ8PB_aCh h&C8^V%6;BM9Mdl\$PBE%CC WgF6dC8V\$CFV%j<P':<h> V&E 69fDV%6dPQ8_f< 8<gRV%6dPQZ 8 Y\[^]69h$:DCFY 8DC:;B/h Y\[^]RN _ ` 
C";I[<[YM96dC:PQf<EU@BCFEk6dgF;IV%6dPQ8p\$CFV%jDPL:p_PBE h&V%f<:<l^} SXEsI  8a_ _cPBE;BM9M4@I;BM9fDC9 h _ PBb_ > DPBEV~P
698<H\]PQ8DPBV&PQ8<69gF6dVmlV&P{;]q;lBCh 69;B88DCFVmPBE T698@BCFVb} :<6h&V&E 6df<V%6dPQ8<h"S EAI 4#;B8<:S %E cI 4,Ps@BCFdE J;Bh%h&PLgF6}
CFE 68<;IE%l$h%gF6dC8<gRCB> `b8]E CgRC8^VlBC;IE hFJLC#:DCF@BCM9PB[C:; ;IV&C:#6dV%e j Y2[<]I !;B8<e : Y [^].f I #E%Ch&[CgRV%6d@BCMdlBJC
8DCFVmPBE _PBE0V%jDC:DCFV&CgRV%6dPQ8PB_YgFM9;Bh%h%69gF;BM'h&68DCX_cCF@BCFE h%;slV%j<;IV#SXgE cI 4"6:h RI F3KXh'$ RRN JNKkIci MAPjF"EdJSGYH G, P@BCFE
698[Y6dHQhF>qPBV%jV%j<C8DCFV~PBE% vh{h%V&E f<gRV%fDE%CT;B8:6dV%h SXEsI 4k JX:DC8<PBV&C:S Esk I 4` 8/SX%E cI kJX6d@_ Y [<]Sf N _ ` 8
;Bh%h%P'gF69;IV&C:[E%PB;I69M6dV%6dChCFE%CCM969gF6dV&C:_cE%PQ\ V~P Y [^] N _ _PBE;BM9 M _ - I 4TqCFE HBCFElJ wBtIu^k>C
CRrL[CFE%V%hF>Z?fDE 68DHTV%jDCCM969gF6dV%;IV%69PQ8698V&CFE%@L6dCFhFJUV%jDC 8DPh ;l]V%j<;IVV%jDC?8DCFV~PBE%W6@h JSRXFBI FHG698{6dV%hh&CFVPB_
CRrL[CFE%V%hWj<;B:[<E%PL:<f<gRC:h&CF@BCFE ;BM!h&V%;IV&C\$C8V%hV%j<;IV PBh%CFE%@A;IYMdC?@A;IE 6;IMdCmh 6d_
h%f<HBHBCh&V&C:[<E%PB[CFE%V%69ChPB_U\$PQ8DPBV&PQ8<6gF6dV~lB> Ch&V%f<:L} S EsI qp n r 8S EsI sp n c
6dC:V%jDCh&C[<E PB[CFE V%6dChW698V%jDC8DCFV~PBE%f<h%698DHPQfDE n D n co
@BCFE 69gF;IV%6dPQ82\$CFV%jDPL:>C_cPQf<8<:2;h%\];BMM#8'f<\} _cPBE#;BM9MLNbPQ698V,@I;BM9fDC;Bh%h 6dHQ8<\$C8V%h n ! n c V&`P ]>`8D_cPBE&}
CFE?PB_@L6dPQM9;IV%69PQ8<h!PB_V%jDC$h%f<HBHBCh&V&C:*[<E PB[CFE V%6dChPB_ \];BMMdlh&[C;IL698DHDJC]j<;@BC]V%j;IV;8DCFV~PBE%69h69h&PI}
\$PQ8<PBV&PQ8<69gF6dVmlBJ7j<6g j[<E%P@BC:*V&PC]698<:<6gF;IV%6d@BCPB_ V&PQ8DC?698{:<69h&V&E 69fDV%6dPQ8]69_C8V&CFE 698<Hp;pj<6dHQjDCFE%}PBE :DCFE%C:
\$PL:DCM9M698DH68<;B:DC'f<;BgF6dChF> @I;BM9fDC;Bh%h%6dHQ8<\]C8^VV&PV%jDCPBh&CFE%@I;IMdC@A;IEk69;IMdCh
zjDC?[;I[CFE,69h"PBE%HQ;B8<6h&C:;Bh,_cPQM9MdPhF>0`8K'CgRV%6dPQ8 gF;B8<8<PBV\];IBCj<6dHQjDCFE%}PBE :DCFE%C:@A;BM9f<ChPB_,V%jDC{PQfDVb}
'JCE%CF@L6dCF V%jDCegRPQ8<gRCF[<VWPB_\]PQ8DPBV&PQ8<69gF6dVmlB>`b8 [f<V{@I;IE 69;IM9CMdCh hM6dBCMdlB>z,j<CgRPQ8gRCF[<VPB_;B8V%6}
K'CgRV%69PQ8*x'J7C$[<E%Ch&C8VPQfDE\$CFV%jDPL:_cPBEh&V%f<:DlL698DH V&PQ8<6gF6dV~lWj;BhV%jDC?E%CF@BCFE h&C698V&CFE%[<E CFV%;IV%6dPQ80V%jDC8DCFVb}
\$PQ8<PBV&PQ8<69gF6dVmlB>$CWE%CF[PBE%V?PQ8V%jDC{;I[<[M969gF;IV%69PQ8ePB_ PBE%69h,h%;B69:V&PWCWH GN JcI FHGp69b8 6d_
PQfDE?\$CFV%jDPL:68K'CgRV%69PQ8D>!z,jDC4[Y;I[CFE#C8<:<h#6dV%j
PQfDE,gRPQ8gFM9f<:<698<HpPBYh&CFE%@I;IV%6dPQ8<h,698KLCgRV%6dPQ8y'> n D n c o S EsI qp n r 'S EsI sp n c
  
7X  Xe /7XB7XmD  _cPBE;BMM<@I;BM9fDC;Bh%h 6dHQ8<\$C8V%h n ! n c V&tP ]> |#PBV&C?V%j<;IV6d_
;8DCFVmPBE 69h6h&PBV&PQ8DC]6986dV%h?PBYh&CFE%@I;IMdC@A;IEk69;IMdCh
i#[PQ8E%CF@L6dCF698<HV%jDCgRPQ8<gRCF[<V0PB_\$PQ8DPBV&PQ869gF6dV~lBJBC HQ6d@BC8V%jDCePBE :DCFE 698<HQu h 8 PQ8V%jDC69E{h&CFV%hPB_@I;BM9fDChFJ
;Bh%h f<\$CV%j<;IVT;q;lBCh%69;B88DCFV~PBE%2f<8<:DCFEh&V%f<:<l V%jDC8TV%jDC8DCFV~PBE%69h,;B8V%6dV&PQ8DC4HQ6d@BC8V%jDCE%CF@BCFEkh&C:
698gFM9f<:DCh;h 698DHQMdCPQfDV&[fDV@I;IE 69;IM9C  ;B8< : PBD} PBE :<CFE 698DHQhF> #M9V%jDPQfDHQj;B8^V%6dV&PQ869gF6dV~lV%j'f<h69hcE%CR}
h&CFE @A;IM9C@A;IEk69;IMdC h J  "!###$!% &J (' I @BCFE h%CMdlDC'f<6d@I;BMdC8^VV&P69h&PBV&PQ8<69gF69V~lBJ CCRrL[M96gF6dV%Mdl
698;B:<:<69V%6dPQ8J0V%jDC8DCFV~PBE%\];l698<gFM9f<:<CW;B8;IE%6} :<6h&V%698DHQf<6h%jCFVmCFC8V%jDCXVmP!V~l'[Ch7PB_D\$PQ8<PBV&PQ8<69g}
V&E ;IE l8'f<\CFEPB_$698V&CFE \$C:69;IV&C@A;IEk69;IMdChj69g j 6dVmlh 698<gRCT;*:DPQ\W;B698PB_?;I[<[M969gF;IV%69PQ8\];lCRrDj<6d6dV
;IE%C8DPBV*PBh&CFE @BC:698 [E ;BgRV%69gRCB* > )X;Bg j@I;IE 69;IMdC ;B8698V&E 69gF;IV&C$gRPQ\698<;IV%69PQ8PB_69h&PBV&PQ869gF6dV~l;B8<:;B8L}
+,]698+V%jDC8<CFV~PBE%;B:DPB[<V%hePQ8DCPB_];8<69V&Ch&CFV V%6dV&PQ869gF6dV~l_cPBE698V&CFE%E%CM9;IV&C:PBYh&CFE%@I;IMdC?@A;IEk69;IMdChF>
- . +,/   021 !###3!%054 Q J 6 '7IJPB_#@I;BM9fDCh>*C qf<6M9:<698DHfD[PQ8{V%j<C?;IP@BC4:DCF8<69V%6dPQ8<hFJCj<;s@BC
;Bh%h f<\$CV%j<;IVV%jDCFE%CCRrD69h&V%h;V&PBV%;BM PBEk:DCFE 698D9 H 8PQ8 V%j<;IV:DCgF69:698DHjDCFV%j<CFE{PBE8DPBV;q;lBCh 69;B88DCFVb}
V%j<6h"h&CFV,PB_0@A;BMfDChFD6dV%j<PQfDV"MdPQh%h"PB_0HBC8DCFE ;BM6dV~lBJ<C PBE%$69hX6h&PBV&PQ8DC#;B\]PQf<8^V%hV&P4@BCFEk6d_lL698DHV%j;IVC8V&CFE&}
;Bh%h f<\$CpV%j<;I:V 0<; 80>= jDC8DCF@BCF@E ?A8CB>pz,jDCpPBE&} 698<HT;B8^lj<6dHQj<CFE&}PBE :DCFE%C:*@A;BM9f<C];Bh%h%6dHQ8<\]C8^VV&P6dV%h
:DCFEk698DHQh[CFE@I;IE 69;IMdC ;IE%CV%;IBC84V&P#698<:<f<gRC ;,[;IE%V%69;BM PBh%CFE%@A;IYMdC@A;IE 6;IMdCh$E%Ch f<MdV%h$698;*h%V&P'gkj<;Bh&V%69gF;BM9M9l
PBE :<CFE 698D,H DPQ8WV%jDC#h%CFV PB_NbPQ698V @I;BM9fDC;Bh h%6dHQ8<\$C8V%h :DPQ\W698<;B8V[<E PB;I69M969V~l]:<69h%V&E 6dfDV%69PQ8]Ps@BCFEV%jDCPQfDVb}
V&PW;B8lh%fDh&CFV"PB_0V%jDC8<CFV~PBE%> vh#@I;IE 69;IMdCh> [f<V @A;IEk69;IMdCBv> )Xh%V%;IM969h%j698DH;B8V%6dV&PQ8<69gF6dVml];B\$PQf<8V%h
zjDCgRPQ8<gRCF[<V$PB_ E,FHG>FBI F"GJLKJN MOJ.GQP"JSRN TUJIVXWDN JNFHG V&P@BCFE 69_lL698DHV%j<;IVC8V&CFE 698DHT;B8lh%f<g j;Bh%h%6dHQ8<\]C8^V
8DPf69M9:<hfD[PQ8V%jDCX[PQh%V&CFE 6dPBE7[E%PB;I69M6dV~l:69h&V&E 6} E%Ch f<MdV%h"698;]:<PQ\]698<;IV&C::<6h&V&E 6df<V%6dPQ8>
f<V%6dPQ8<hPs@BCFE"V%jDC?PQfDV&[f<V@I;IE 69;IM9C#HQ6d@BC8V%jDC?@I;IE 6} zjDC2[<E%PBMdC\ PB_:DCgF69:<68DH\]PQ8DPBV&PQ8<69gF6dVml[<_c[PBE
PQf<hYN&PQ698V @A;BMfDC";Bh%h%6dHQ8<\]C8^V%h0V&PV%j<C,8DCFV~PBE% vh PBD} ;q;slBCh%69;B88DCFV~PBE%69h'8<Ps8V&PC*gRPQ|#S }
h&CFE @A;IM9Cp@I;IE 69;IYMdChWc ;B8:<CFEpG?;B;IH#I >dJ IuBuIk> gRPQ\$[YMdCFV&C698HBC8DCFE ;BMJ;B8<:698_;BgRV,E%C\];B68<h"gRPQ|#S }
z,j<CgRPQ8<gRCF[<VFJ\$PBE Ch&[CgF6dYgF;BM9MdlBJW69h:DCF8<C:698 gRPQ\$[YMdCFV&C_cPBE[PQM9lV&E%CFChc ;B8:DCFE*G?;B;IHFI >dJ

100 L. C. van der Gaag, S. Renooij, and P. L. Geenen


I uBuIk>`8@L6dCF PB_!V%jDCh&CgRPQ\$[MdCRrD6dVmlgRPQ8<h%6:DCFE ;A} 0(&#'(
V%6dPQ8<hJY ;B8:DCFEG?;B;IHF,I >:<Ch%6dHQ8DC:e;B8;I[[<E%Pr'}
69\];IV&C{;B8^l'V%69\$CW;BM9HBPBE 6dV%j<\_cPBE4@BCFEk6d_lL698DH#jDCFV%jDCFE 0(&#'% 0%&#'(
PBE]8DPBV];HQ6d@BC88DCFVmPBE 69h$\]PQ8DPBV&PQ8DCB>z,j<69hp;BM}
HBPBE 6dV%j\ h&V%f<:6dCh$V%jDCeE%CM9;IV%6dPQ8CFVmCFC8V%jDCPQfDVb} 0(&#>1 0%&#'% 01$#'(
[fDV@A;IEk69;IMdC;B8<:C;BgkjPBh&CFE%@I;IMdC{@A;IEk69;IMdCh&CF[D}
;IE ;IV&CMdlBJ698V&CFEk\]hpPB_,V%jDCh%6dHQ8PB_,V%jDCf<;BM6dV%;IV%6d@BC
698 YfDC8<gRC?CFV~CFC8TV%jDC\ cCM9M9\];B8J wBwIu^kfD[PQ8 0%&#>1 01$#'%

698<gRPQ8gFM9f<h%6d@BCE Ch%f<MdV%hFJ6dV&CFE ;IV%6d@BCM9l V%6dHQj^V&C8<C:8'fL}


\$CFE 6gF;BM0PQf8<:<h?;IE%C$Ch&V%;IM69h%jDC:PQ8V%jDC$E CMdCF@A;B8V 01$#>1
[<E%PBY;I69M96dV%69Chf<h%698<H!;B8;B8lV%69\]C\]CFV%jDP':$;s@A;B6M9;IMdC 0 6dHQfDE%C Ia!84CRrD;B\$[MdC;Bh%h%6dHQ8\$C8^VaM9;IV&V%69gRC_PBEUV~P
_cE%PQ\ 5a69f*;B8<:CM9M9\W;B8 wBwBtQk>i#8<_PBE%V%f8<;IV&CMdlBJ V&CFE 8<;IE l{PBh%CFE%@A;IYMdC@A;IEk69;IMdChF>
V%jDC*P@BCFE ;BM9M4E f<8V%69\$CE%C'f<6dE%C\]C8^V%hPB_V%jDC;BMdHBPI}
E 6dV%j\V&C8:V&P_PBE%Ch%V%;BM9MLf<h&C"698;?[<E ;BgRV%69gF;BMDh%CFV&V%698DHD>
aQX  m  /7XB7XmD  _cPBEV%jDC#V~PV&CFE 8<;IE%l$@I;IE 69;IM9Ch + ;B8<:*)W> DPBE_fDE&}
 V%jDCFE698D_cPBE \];IV%6dPQ8;IPQfDV{M9;IV&V%69gRChW68HBC8DCFE ;BMJC
n?fDE\$CFV%j<P': _cPBE h&V%f<:DlL698DH \$PQ8DPBV&PQ869gF6dV~l 698 E%CF_cCFE,V&PeG!,E ;I+ .V -FCFE\J w0/<sk>
q;lBCh%6;B8 8DCFV~PBE%Lh8<Ps f<69M9:hfD[PQ8V%jDCgRPQ8L} zaP:DCh%gRE 6dCTV%j<CCRCgRV%hPB_4V%j<Ce@I;IE 6dPQf<h{@I;BM9fDC
h&V&E fgRV!PB_;B8;Bh%h%6dHQ8<\]C8^VM9;IV&V%69gRCB>z,jDCM9;IV&V%69gRC$698L} ;Bh%h%69HQ8<\$C8V%h n PQ8V%jDC[<E%PB;I6M96dV~l2:<69h%V&E 6dfDV%69PQ8
gFM9f<:<Ch;BM9MN&PQ698^V@I;BM9fDC];Bh h%6dHQ8<\$C8V%hV&PV%jDCWh&CFVPB_ Ps@BCFEV%jDCPQfDV&[fDV@I;IE 69;IM9CBJ!V%j<C;Bh%h%69HQ8<\$C8VM9;IVb}
PBh&CFE @A;IM9C@I;IE 69;IYMdCh ;B8<:]69h0C8j<;B8<gRC:$69V%j[<E%PBD} V%69gRC69hC8j<;B8<gRC:6dV%j[E%PB;I69M69h&V%69g,68D_PBEk\];IV%6dPQ8>
;I69M69h&V%69g698D_cPBE \];IV%6dPQ8gRPQ\$[f<V&C:_E%PQ\ V%jDC8DCFVb} 26dV%jTC;BgkjCMdC\]C8^1V # n "PB_ V%jDCM;IV&V%69gRCBJV%jDCgRPQ8L}
PBE%f<8<:<CFEph&V%f:DlB><E%PQ\ V%jDCM9;IV&V%69gRCBJ ;BM9M@L6dPQM9;A} :<6dV%69PQ8<;BM7[E%PB;I69M6dV~lT:<6h&V&E 6df<V%6dPQ8SXEsI  p n #P@BCFE
V%6dPQ8<h0PB_V%jDC"[<E PB[CFE V%6dCh0PB_\]PQ8DPBV&PQ8<69gF6dVmlp;IE C,69:DC8L} V%jDCTPQf<V&[fDV@A;IEk69;IMdA C  6h];Bh%h&PLgF69;IV&C:>Ce8DPBV&C
V%6d<C:J,_E PQ\ #j<69g j\]69869\];BM,PIC8<:<68DHgRPQ8V&CRr'V%h V%j<;IVV%jDCh%C:<69h&V&Ek6dfDV%6dPQ8hW;IE%CE%C;B:69MdlgRPQ\$[YfDV&C:
;IE%C4gRPQ8h&V&E f<gRV&C:_cPBE_fDE%V%j<CFE,698^@BCh%V%6dHQ;IV%6dPQ8> _cE%PQ\ V%jDC*q;lBCh%6;B8+8DCFVmPBE%2f<8<:<CFETh&V%f<:DlB>C
     cD c  69MM E%CFV%fDEk8V&PV%j<CTgRPQ\$[MdCRrD6dV~lPB_V%jDCgRPQ\$[YfDV%;A}

V%6dPQ8<h698^@BPQM9@BC:T698K'CgRV%6dPQ8ex'> D>
z,jDC$ R RUJ GEWX G"dBN JLKkp_cPBE!V%jDCh&CFV PB_ PBh&CFE%@}
;IMdC@A;IE 6;IMdCh$PB_;q;slBCh%69;B828DCFV~PBE%gF;I[<V%f<E%Ch 32 46587 9 :*      cD c 
;BM9MQNbPQ68^V@A;BMfDC!;Bh h%6dHQ8<\$C8V%hV&P WJ<;BMdPQ8DHp6dV%j{V%jDC ql*f<6M9:<698DHfD[PQ8*V%j<C;Bh%h 6dHQ8<\$C8VM9;IV&V%6gRCBJ @BCFE 6}
[;IE%V%6;BMaPBEk:DCFE 698DHCFV~CFC8V%jDC\>DPBEC;Bgkj@I;BM9fDC _cl'698<H\]PQ8DPBV&PQ8<69gF6dVml68:<69h&V&E 69fDV%6dPQ8;B\$PQf<8V%h$V&P
;Bh%h%69HQ8<\$C8V n V&P ]J;B8CMdC\$C8V # n "69h,68<gFM9f<:DC: g j<Cg%L698DH$jDCFV%j<CFE[Y;IE%V%69gFf<M9;IE":DPQ\W698<;B8<gRC![<E%PB[CFE&}
698V%jDCM9;IV&V%69gRCB>zjDCePBV&V&PQ\ PB_V%jDCM9;IV&V%69gRCC8L} V%6dChUjDPQM9:;B\]PQ8DHV%jDC[<E%PB;IY69M96dVml#:<69h%V&E 6dfDV%69PQ8<h;Bhb}
gRPL:DCh!V%jDC$;Bh%h%6dHQ8<\]k C8^V n _PBE?j69g jeC$j<;s@BCV%j<;IV h&PLgF69;IV&C:T#6dV%jV%jDCM9;IV&V%69gRHC vhCMdC\$C8V%hF>
n D n c _PBE$;BM9M n c - N pkV%jDC{PBV&V&PQ\ V%j'f<hpC8L} CE%CgF;BM9MaV%j<;IV?;{q;slBCh%69;B88DCFVmPBE%e69h#6h&PBV&PQ8DC
gRPL:DCh0V%jDCMdPsCh&Vb}PBE :DCFE C:$@A;BMfDC;Bh h%6dHQ8<\$C8VUV&P W> 6986dV%hh&CFV$PB_#PBh&CFE%@I;IMdC@I;IE 69;IMdCdh  69_,C8^V&CFEk698DH
z,jDCV&PB[PB_V%jDCM9;IV&V%69gRCC8<gRPL:DChUV%jDC;Bh%h%6dHQ8\$C8^V n c c ;j<6dHQjDCFE%}PBE :DCFE%C:@A;BM9f<Cp;Bh%h 6dHQ8<\$C8V#V&bP  E%Ch f<MdV%h
_cPBE j<69gkjC,j;@BC,V%j<;IV n c D n c c _cPBE ;BM9M n c k - N k 698;h&V&PLg j<;Bh%V%69gF;BM9Mdl:DPQ\]698<;B8V![<E%PB;I6M96dV~lT:69h&V&E 6}
V%jDCpV&PB[V%jfh#C8gRP':DCh!V%jDCj<6dHQj<Ch&Vb}PBE :DCFE%C:e@I;BM9fDC fDV%69PQ8P@BCFEV%jDCpPQfDV&[fDV!@A;IEk69;IMd,C >C_cf<E%V%jDCFE
;Bh%h%69HQ8<\$C8V V&tP ]>0`b8$V%jDC,M9;IV&V%6gRCBJC,_fDE%V%jDCFE j<;s@BC E%CgF;BM9MV%j<;IVV%jDC[;IE%V%69;BMPBE :DCFEk698D H DPQ8V%jDC4NbPQ698V
V%j<;IVX;B8WCMdC\$C8V # n U[<E%CgRC:<ChX;B8]CM9C\$C8^V # n c c @I;BM9fDCT;Bh%h%69HQ8<\$C8V%h]V&O P  69h$C8<gRPL:DC::<6dE%CgRV%Mdl698
6d_ n D n c c C\$PBE%CFP@BCFE]h%;lV%j<;I!V # n [<E CgRC:DCh V%jDC;Bh%h%6dHQ8\$C8^V?M9;IV&V%69gRC_cPB:E ]>?C]8DPs gRPQ8<h 69:DCFE
# n c c :<6dE CgRV%MdlT6d_ V%jDCFE%C$69h!8DP;Bh%h%6dHQ8\$C8^V n c 6dV%j ;BM9MY[;B6dE hXPB_7gRPQ8<:<69V%6dPQ8<;BM<[E%PB;I69M6dV~l:<69h&V&Ek6dfDV%6dPQ8h
n " n c " n c c > z,jDC[;IE%V%6;BMPBEk:DCFE 698DH:DCFY8DC:plpV%jDC S EsI ap n ];B8<:SXEsI ap n c $;Bh%h&PLgF69;IV&C:69V%jCMdCR}
M9;IV&V%69gRC$V%j'f<hgRPQ68<gF69:DCh?69V%jeV%j<C$[;IE%V%69;BM0PBE :<CFE 698DH \$C8V%*h ! n ];B8<; : # n c $698V%jDCM9;IV&V%69gRCh f<g jV%j<;IV
DPQ8V%jDC#NbPQ68^V?@I;BM9fDC;Bh%h%6dHQ8<\]C8^V%h?V&b P ]>U69HQfDE%C # n [<E CgRC:DC* h ! n c k>2`m_?_cPBEWC;Bgkjh f<g j[;B6dE]C
:<CF[69gRV%hFJ';BhX;B8{CRrL;B\$[YMdCBJ^V%jDC#;Bh%h%6dHQ8<\]C8^VM9;IV&V%69gRC j<;s@BCpV%j;IV!SXEsI  p n 82S EI  p n c kJV%jDC8eV%jDCp8DCFVb}

Lattices for Studying Monotonicity of Bayesian Networks 101


PBE%6h 69h&PBV&PQ8DC,698dW>0zjDC"8DCFV~PBE%$69h ;B8V%6dV&PQ8DC,698 D8 PBV#h%jDPs698V%jDC4q;slBCh%69;B88DCFV~PBE%f<8<:DCFE#h&V%f<:DlB>
6d_U_PBE,C;Bgkjeh f<g j[;B6dE,PB_ :<69h%V&E 6dfDV%69PQ8<hXC4j<;s@BC CV%jDC8Th%;slV%j<;IVV%jDCh%C[<E%PB[CFE%V%6dCh";IE C4@L6dPQM9;IV&C:>
V%j<;IVS EsI sp n c r 8S EsI qp n k> C#8DPBV&C,V%j<;IVFJ'h%698<gRC PBE%C_cPBE \];BM9MdlBJLCh%;lV%j<;IV,;[;B69EPB_N&PQ698^V@I;BM9fDC
V%jDCp[<E%PB[CFE%VmlTPB_h%V&P'gkj<;Bh&V%69g:DPQ\]68<;B8<gRC69hV&Ek;B8<h%6} ;Bh%h 6dHQ8<\$C8V%h n ! n c 6dV%j n D n c JLFIdB~U RV%jDCp[<E%PB[D}
V%6d@BCBJCTj;@BCTV&P*h&V%f:Dl*V%jDCT:DPQ\W698<;B8<gRC[<E%PB[CFE&} CFE%Vml]PB_769h%PBV&PQ8<69gF6dVml]6d_C?j<;s@BC?SXEsI sp n  SXEsI sp
V%6dCh!PB_ V%j<C:69h&V&E 6dYfDV%6dPQ8<h"PB_:<6dE%CgRV%MdlTM698DBC:e[;B6dE h n c _PBE0V%jDC69EU;Bh%h%P'gF69;IV&C:[E%PB;I69M6dV~l?:<69h&V&Ek6dfDV%6dPQ8hF
PB_#CMdC\$C8V%h]698V%jDCM9;IV&V%69gRCPQ8<M9l*V&P*:DCgF69:DCfD[PQ8 @L6dPQM9;IV%6dPQ8PB_!V%jDC[<E%PB[CFE%VmlPB_?;B8^V%6dV&PQ869gF6dV~l69h:DCR}
69h%PBV&PQ8<69gF6dVml{PBE;B8V%6dV&PQ8<6gF6dV~lB> 8<C:Th%69\]6M9;IE MdlB> CPBh&CFE%@BCpV%j<;IV#_PBE!;B8^l@L6dPQM9;IVb}
!h4\]C8^V%6dPQ8<C:698K'CgRV%6dPQ8'J0;:DPQ\];B68PB_,;I[D} 698<Hp[Y;B6dE n ! n c #jDPQh&CCMdC\$C8V%h,;IE%C:<6dE%CgRV%Mdl{M9698DBC:
[M69gF;IV%6dPQ8\];lCRrDj<6d69V!;B8*68^V&E 6gF;IV&C]gRPQ\468<;IV%6dPQ8 698V%j<C;Bh%h%6dHQ8\$C8^VM9;IV&V%69gRCBJ^V%j<CFE%C69h /; W<G5JW<@I;IE 6}
PB_[<E%PB[CFE%V%6dCh!PB_69h&PBV&PQ8<6gF6dV~l;B8<:;B8V%6dV&PQ8<69gF6dVmle_cPBE ;IM9/C 698V%jDC]h&CFV4PB_PBYh&CFE%@I;IMdC$@I;IE 69;IMdC:h  V&P
698V&CFE%E%CM;IV&C:PBh&CFE%@I;IMdC@I;IE 69;IMdCh>p!h;B8CRrD;B\} j69g j n ;B8<: n c ;Bh%h 6dHQ8;T:<6CFE%C8V@A;BMfDCB>]5CFV n ;
[M9CBJAC"gRPQ8<h%69:<CFEUV%j<CVmP@I;IE 69;IYMdCh +;B8<: )+h%fg j CpV%jDC@A;BM9f<CpPB_V%j<69h!@I;IE 69;IMdCp6dV%j698V%j<C;Bh h%6dHQ8L}
V%j<;IVV%jDCPQf<V&[fDVX@I;IE 69;IYMdC,PB_698V&CFE%Ch&VCj;@BCh69h&PI} \$C8V n ;B8<:{M9CFV n = J BZ ?<J'C!6dV%h@A;BM9f<C#698 n c > z,jDC
V&PQ8<6gF;BM9Mdl$69/8 + ;B8:{;B8V%6dV&PQ8<69gF;BMMdl$69*8 ){>0C?_PLgFf<h @L6dPQM9;IV%6dPQ88DP69h?:DC8DPBV&C:'l n ! n c p ; = k>C
PQ8V%j<C@A;BM9f<C;Bh%h%6dHQ8<\]C8^V%h h%;slWV%j<;IVV%jDC?@L6dPQM9;IV%6dPQ8j<;Bh;g j;B8DHBCPB_7V%jDC?@I;BM9fDC

# ; 1 PB2_ @_cE%PQ\ n ; V&P n = _cPBE"6dV%:h F"TUJ JSG<JL,Ek6dV&V&C8  ; = >
n ;  0

0  1.# ;
z,j<Ce@I;BM9fDC;Bh%h%6dHQ8\$C8^V n V&Z P W3 mV%j<;IVT69h
n c
;


0  1.# ; 1
h%j;IE%C:l n ;B8<: n c J69h?gF;BM9MdC:V%jDC K F"G~ ^PB_V%jDC
n c c
;

@L6dPQM9;IV%6dPQ8>
V&PeV%jDCh&CW@I;IE 69;IYMdChF>|PBV&CV%j<;IV4_cPBE4V%j<CWV%jDE%CFC;Bhb} `b8HBC8DCFEk;BMJYf<h 698DH$V%jDC4;Bh h%6dHQ8<\$C8VM9;IV&V%6gRC4_cPBE#;
h%69HQ8<\$C8V%h,Cpj<;s@BCV%j<;IV n  D n c c ;B8<: n c D n c c q;lBCh 69;B88DCFV~PBE%\];leE%Ch%f<M9V#698;h&CFV +PB_X@L6dPI}
V%jDC;Bh%h%6dHQ8\$C8^V%h n  ; ;B8<: n ; c ; j;@B; C8DPPBEk; :DCFE 698DHD; > M9;IV%69PQ8<hPB_0V%jDC[<E%PB[CFE%V%6dChPB_0\$PQ8DPBV&PQ869gF6dV~lB>"K'PQ\$C
DPBEh&V%f<:DlL698DHV%jDC\]PQ8DPBV&PQ8<69gF6dVml[E%PB[CFE%V%6dCh698L} PB_V%jDCW69:<C8^V%6dC:e@L6dPQM9;IV%69PQ8<h\];lh%j<PsgRPQ8<h 69:DCFE&}
@BPQMd@BC:J^C8DPsj<;@BC#V&P4gRPQ\$[;IE C,V%jDC,[<E PB;I69M969V~l ;IM9C]E%CFHQf<M9;IEk6dV~lBJ0698*V%j<C{h&C8<h%C]V%j<;IV6d_";T@L6dPQM9;IV%6dPQ8
:<6h&V&E 6df<V%6dPQ8<h$P@BCFEV%jDCPQfDV&[YfDV@A;IE 6;IMdC  V%j<;IV PBE 69HQ698<;IV&Ch0_cE%PQ\;gkj<;B8DHBC,PB_@I;BM9fDC,698$;?[;IE V%69gFf<M9;IE
;IE%C!gRPQ\$[fDV&C:W_E PQ\V%jDC!q;lBCh%69;B88DCFVmPBE ]f<8<:DCFE gRPQ8V&CRr'VFJ0V%jDC8*V%j<69hh%;B\]CWg j<;B8<HBC{gF;Bf<h&Chp;@L6dPQM9;A}
h&V%f:DlBJHQ6d@BC8V%jDC@I;IE 6dPQf<h?;Bh%h%69HQ8<\$C8V%h?V&bP + ;B8<: V%6dPQ8W698$;BM9MDj<6dHQj<CFE&}PBE :DCFE%C:$gRPQ8V&CRrLV%h;BhXCM9M> KLfg j
){>`__cPBE$V%jDC[;B6dEPB_#:69h&V&E 6dYfDV%6dPQ8<hS EsI  p n 
;B8<:2SXEsI  p n c c; kJ#Cj<;@BCV%j;IVSXEI  p n c c; ; 69h @L6dPQM9;IV%6dPQ8h69M9MCV&CFE \$C: RN TUW KsN W TbI >,n!V%jDCFE@L6dPI}
h&V&PLg j;Bh&V%69gF;BM9Mdl:DPQ\]698<;B8V4P@BCFE]S EI  p n  ; kJ V%jDC8 M9;IV%69PQ8<h_cE%PQ\V%jDCh&CFV *:DP!8DPBVUj;@BCh%f<g jp;,E CFHQf<M9;IE
V%jDC8<CFV~PBE%{CRrDj<6dY6dV%hV%jDC;Bh%h&PLgF69;IV&C:[<E PB[CFE V~lWPB_ h&V&Ekf<gRV%fDE%C;B8<:\];lCgRPQ8<h%6:DCFE%C: JSG>KJNPX GI >0C
69h%PBV&PQ8<69gF6dVml6e 8 +>`__PBEV%jDC{:<6h&V&E 6df<V%6dPQ8<h!HQ6d@BC8 :<6h&V%698DHQf<6h%jCFV~CFC8V%j<C{VmPVml[ChpPB_,@L6dPQM9;IV%6dPQ8
B
; 
8 : 
C$j<;@BCV%j<;IVSXEI  p n c ; ,6h#h&V&PLg j;Bhb} V&P;BM9MdPs;{gRPQ\][;BgRV#E CF[<E%Ch&C8V%;IV%6dPQ8ePB_ V%jDC4C8V%6dE%C
n  c n  c c
V%69gF;BMMdl{:DPQ\W698<;B8V"Ps@BCFES EsI  p n c c; kJ<V%jDC8TV%j<C48DCFVb}
; ; h&CFVPB_U69:<C8^V%6dC:@'6dPQM;IV%6dPQ8<hF>
PBE%h%jDPhV%jDC;Bh%h&PLgF69;IV&C:[E%PB[CFE%V~l?PB_D;B8^V%69V&PQ8<69g} CgRPQ8<h%69:<CFE7V%jDCXh fDh&CFV N  ; = !PB_<;BM9MB@L6}
6dVmle698 ){>4z,jDC$8DCFVmPBE%eV%jfh?E%CF@BC;BM9h?PBV%j[<E%PB[D} PQM9;IV%69PQ8<h?PB_V%j<C$[<E%PB[CFE%V%6dCh?PB_\$PQ8DPBV&PQ8<6gF6dV~lV%j<;IV
CFE%V%69Ch,6d_ PBE 69HQ698<;IV&C,_cE%PQ\;4g j;B8DHBC?PB_V%jDC@I;BM9fDC#PB_V%jDC@I;IE 6}
S EI qp n  ; m 8S EsI  p n c c; m 8S EI qp n c ; ;IM9dC @_cE%PQ\ n ; V&P n = >z,jDC]gRPQ8V&CRr'V n  PB_h%fg j
;@'69PQM9;IV%6dPQ8+8<Ps 69hegF;BMMdC:; RN TUWKN W T&I K F"G~ ^
CFE 69_lL698DHpjDCFV%jDCFEPBE,8DPBV"V%j<C?8<CFV~PBE%6h6h&PBV&PQ8DC F#"uF%$!X G>Ke6d_CTj<;@BCeV%j<;IV]V%jDCFE CT69h$;@L6dPQM9;IV%6dPQ8
69`8 + ;B8<:;B8V%6dV&PQ8DC4698 )8DPs;B\$PQf<8V%h"V&PWh&V% f<:<l^} n ! n c p  ; = k N  ; = _PBE];BMM@A;BM9f<C;Bh h%6dHQ8L}
698<HV%j<C];IP@BC{698<Cf<;BM6dV%6dCh#_PBE4;BMM0@A;BM9f<C:h 0 ;B8<: \$C8V%h n V&`P  V%j;IV?68<gFM9f<:DCV%j69h#gRPQ8V&CRrLV n  PBE;
# ; PB_0V%jDCV~PW@I;IE 69;IYMdChF> j<69HQjDCFE&}PBE :DCFE C:]PQ8DCB> `_V%jDCFE C!69h8DP$MdPCFE&}PBEk:DCFE%C:
h&V&Ekf<gRV%fDE ;BMBgRPQ8V&CRr'VUPB_LPIC8gRCX_cPBE7V%jDCXh ;B\$CXg j;B8DHBC
8     :  9  
: 0 9   5 PB_L@I;BM9fD C  ; = JICh%;lV%j<;IVaV%jDCXgRPQ8V&CRr'V0PB_LPIC8<gRC
i#[PQ8Tf<h%698<HpV%j<C4;Bh%h%6dHQ8\$C8^V#M9;IV&V%69gRC4;Bh!:DCh%gRE 69C: 69h RRN T WKN W T&Ici MEJ.GJ.E]Io>+h&V&E fgRV%fDE ;BM9Mdl]\W698<69\];BM
;IPs@BCBJh&PQ\]C[<E PB[CFE V%6dCh*PB_T\$PQ8<PBV&PQ8<69gF6dVml\];l gRPQ8V&CRr'V"PB_aPIC8gRCV%jf<hgkj<;IE ;BgRV&CFE 69h%Ch";ph%CFV"PB_E%CR}

102 L. C. van der Gaag, S. Renooij, and P. L. Geenen


9M ;IV&C:@L6dPQM9;IV%6dPQ8<h> h&V&Ekf<gRV%fDE ;BM9M9l\W698<69\];BMgRPQ8L} 96 8{;@L6dPQM9;IV%6dPQ8 n ! n c p$ ; = k !> z,jDC[<E%PLgRC:<fDE%C
V&CRrLV n  _PBE0V%jDC@L6dPQM9;IV%6dPQ8<hUPBE 69HQ698<;IV%698DH#_E PQ\  ; = h%fDYh&Cf<C8^V%Mdl*6h&PQM9;IV&Chp_E PQ\ V%jDC;Bh h%6dHQ8<\$C8V$M9;IVb}
\$PBE%C$h&[CgF6dYgF;BM9MdlT:DCF8<Ch;h fDM9;IV&V%69gRCPB_V%jDC$;Bhb} V%69gRCV%jDC$h%fDYM9;IV&V%69gRCp6dV%jV%jDCCMdC\$C8V # n kJ76dV%j
h%6dHQ8\$C8^VM9;IV&V%69gRC698Wj<69gkjWC;BgkjCMdC\$C8V698gFM9f<:DCh n  n ;  n  JY_cPBE?6dV%h,PBV&V&PQ\V%jDCph%fDM9;IV&V%6gRC_cf<E%V%jDCFE
n ;  ;B8<:$698pj<69gkj_PBE C;Bgkj]CMdC\]C8^V n V%jDCFE%CCRrL6h&V%h ; 698<gFMf<:DChU;BM9MLCMdC\$C8V%h # n ; n  a#6dV%j n ; n  D n ; n  >
@L6dPQM9;IV%6dPQ8 n ! n c pj ; = 698TV%j<Ch&CFV ?N  ; = kV%jDC |#PsJ<6d_7C;Bg jCMdC\]C8^VPB_7V%j<C?h fDM9;IV&V%69gRC!PLgFgFfDE h698
gRPQ8V&CRr'V n 69hV%jDC4PBV&V&PQ\PB_aV%j69h"h%fDM;IV&V%69gRCB> ;@L6dPQM9;IV%6dPQ86dV%Z j  ; = _PBE{6dV%h$PBE 6dHQ68JXV%jDC8V%jDC
`b8;B:<:<6dV%6dPQ8{V&PV%jDC?@'69PQM9;IV%6dPQ8<hV%j<;IV";IE%CgRPs@BCFE%C: gRPQ8V&CRr'V n 69h,;h&V&E f<gRV%fDEk;BM9Mdl{\W698<69\];BMgRPQ8V&CRrLV!PB_
'lWh&V&E f<gRV%f<E ;BM9Mdl$\]69869\];BMgRPQ8V&CRrLV%hPB_7PIC8<gRCBJDV%jDC PIC8<gRCB,PBV%jDCFE%#69h&C6dV$69h698<gF69:<C8^V%;BM>z,jDC[<E%PLgRCR}
h&CFVpPB_,;BM9MX6:DC8^V%69<C:*@L6dPQM9;IV%6dPQ8h4\];sl698<gFM9f:DC$CMdCR} :<fDE C69h 69V&CFE ;IV%6d@BCMdlE%CF[C;IV&C:$_PBE ;BMMLlBCFV f<8<gRPs@BCFE C:
\$C8V%h,V%j<;IV#;IE%C48DPBVh%V&E f<gRV%fDE ;BMMdl]E%CM9;IV&C:>gRPQ8L} @L6dPQM9;IV%6dPQ8<h>
V&CRrLV n  69hh%;B69:V&PC;B8 J.G>KJLP^ GI: K F"G~ Q/ F#"  9 7   5  0 9 7   AD 9 7
F#$! G>Kk*6d_6dV{6h{8DPBV;h&V&E f<gRV%fDEk;BMgRPQ8V&CRr'VPB_PB_}
  

_cC8<gRCB>C$8DPBV&CV%j<;IV;B8^lh%CFV!PB_X@'6dPQM;IV%6dPQ8<h!69:DC8L} z,jDC Ekf<8^V%6\$C gRPQ\$[MdCRrD6dV~l PB_PQfDE\$CFV%jDPL:_cPBE


V%6d<C:_cE%PQ\ ;q;slBCh%69;B88<CFV~PBE%6hp8<Ps/g j<;IEk;Bg} h&V%f<:<l'698<H4\$PQ8DPBV&PQ8<6gF6dV~l]PB_U;q;slBCh%69;B88DCFVmPBE%69h
V&CFE 69h%C:lp;h&CFV PB_Yh&V&Ekf<gRV%fDE ;BM9M9l4\]698<6\];BM^gRPQ8V&CRr'V%h :DCFV&CFE \W698DC:$'lV%j<C#h%6 -FC"PB_V%j<C;Bh%h%6dHQ8\$C8^VM9;IV&V%69gRC
PB_0PIC8gRC4;B8<:T;]h%CFV"PB_068<gF69:DC8V%;BMgRPQ8^V&CRrLV%hF> f<h&C:>CPBh&CFE%@BCV%j<;IVpV%j<69hpM9;IV&V%69gRCC8<gRPL:DCh;B8
CRrL[PQ8<C8^V%69;BMY8f\4CFE PB_@I;BM9fDC;Bh%h 6dHQ8<\$C8V%hXV&PV%jDC
!h;B8CRrD;B\$[MdCBJaC]gRPQ8h%69:DCFEV%jDC{M9;IV&V%69gRC$_cE%PQ\ h&CFVPB_PBh&CFE @A;IM9C@I;IE 69;IYMdChF> =PQ8<h&V&E fgRV%698DHV%jDC
06dHQfDE%Z C _cPBEV%jDCVmPV&CFE 8<;IE%lPBh&CFE%@I;IMdC@I;IE 6} M9;IV&V%69gRC;B8<:gRPQ\$[fDV%68DH]V%jDCp[<E%PB;IY69M96dVml:<6h&V&E 6dfD}
;IMdCh + ;B8<: )W> KLfD[<[PQh&CV%j<;IVf<[PQ8@BCFEk6d_l} V%6dPQ8<h!V&PC;Bh%h%P'gF69;IV&C:#6dV%je69V%h#CM9C\$C8^V%hJV%j<CFE%CR}
698DH69h&PBV&PQ8<6gF6dV~l PB_V%jDCq;slBCh%69;B88DCFVmPBE% f<8L} _cPBE%CBJWV%;IBChCRrL[PQ8<C8^V%69;BM]V%69\$CB> TPBE%CFP@BCFEJV%jDC
:DCFEh&V%f:DlBJV%jDC2V%jDE%CFC@L6dPQM9;IV%6dPQ8h2.021.# %"!%0 % #'% p :DPQ\]68<;B8<gRC[<E%PB[CFE%V%6dCh0PB_;B8$CRrL[PQ8DC8V%69;BMD8f\4CFE
+ 1  % !s. 0 1 # ( !%0 % # ( p + 1  % ;B8<:. 0 % # % !%0 ( # % p PB_[;B6dE h?PB_[<E%PBY;I69M96dVml:<69h%V&E 6dfDV%69PQ8<hj<;s@BCWV&PTC
+ %  (s;IE%C69:DC8V%6d<C:> z,jDC*<Ekh&VTV~PPB_pV%j<Ch&C gRPQ\$[;IE C:>DPBE +698<;IE%lPBh&CFE%@I;IMdC@A;IE 6;IMdChFJ
@L6dPQM9;IV%6dPQ8<h{j<;@BCV%jDCh%;B\$CPBE 6dHQ68V%jDCeVmP@L6dPI} _cPBE,CRrL;B\][MdCBJ<;BMdE%C;B:<l
M9;IV%6dPQ8h;IEk69h&C6d_V%jDC@A;BMfDCePB_V%jDC@A;IEk69;IMdA C + 69h
g j;B8DHBC:_cE%PQ\ 01V&Z P 0%I>z,j<Ch&C@'6dPQM;IV%6dPQ8<h;IE%C 
 
.  eb
g j;IE ;BgRV&CFE 69h&C:'lV%j<C{h&V&E fgRV%fDE ;BM9Mdl\]698<6\];BM0gRPQ8L}
 

 
 
V&CRrLV7PB_'PIC8<gRC # %I68PBV%jV%jDC gRPQ8V&CRr'V # % ;B8:V%jDC gRPQ\$[;IEk69h&PQ8<h$;IE%CE Cf<69E%C:>DE PQ\ V%jDCh&CTgRPQ8h%69:L}
j<6dHQj<CFE&}PBE :DCFE%C:gRPQ8^V&CRrLV # ( JQgkj<;B8DHQ698<HV%jDC@A;BMfDCPB_ CFE ;IV%6dPQ8h]Cj<;s@BCeV%j;IVWPQfDE\$CFV%jDPL:j<;Bh;*@BCFE%l
+ _cE%PQ\ 021!V&b P 0 %HQ69@BChE 69h&CpV&PT;@L6dPQM9;IV%6dPQ8>z,jDC j<6dHQjTE f<8V%69\$CgRPQ\$[M9CRrL6dVmlB>"#MdV%j<PQfDHQj6dV%hE%C'f<6dE%CR}
V%j<6dEk:@'6dPQM;IV%6dPQ8T\$C8V%6dPQ8DC:e;IP@BC69h#8DPBV,lBCFV!gRPs@} \$C8V%h$gF;B8C{E C:<f<gRC:V&P*;IV]MdC;Bh&V$h&PQ\]CCRr'V&C8V
CFE%C:lV%jDC{6:DC8^V%69<C:\]698<69\W;BMUgRPQ8V&CRrLVph%68<gRC]6dV 'lCRrL[MdPQ69V%698DHV%jDCW698:DCF[C8<:DC8<gRCh\$PL:DCM9M9C:698*;
j<;Bh!;{:6CFE C8^V#PBE 6dHQ698>,zj<69h,@L6dPQM9;IV%69PQ8TPBE 6dHQ68<;IV&Ch q;lBCh%6;B8{8DCFVmPBE YJL_cPBEM9;IE%HBCFE8DCFV~PBE%Lh698gFM9f<:<698<H
_cE%PQ\ ;g j<;B8<HBCePB_V%jDC@I;BM9fDCTPB_V%jDC@I;IE 69;IMdu C + ;M9;IE%HBCe8'f<\CFEWPB_PBh%CFE%@A;IYMdCT@I;IE 69;IMdCh{h&V%f<:Dl}
_cE%PQ\ 0%#V&P 0 (;B8<:;IHQ;B698j<;Bh #'%,_cPBE"6dV%hPIC8:<698DH 698DH\$PQ8DPBV&PQ869gF6dV~l698DCF@L6dV%;IM9l$CgRPQ\]Ch"698D_cC;Bh%6dMdCB>
gRPQ8V&CRr'VF>z,jDCgRPQ8V&CRr'6V # % 8DP+69h"\$CFE CMdl68<gF69:DC8L} z,jDCf<8D_;@BPQf<E ;IMdCpgRPQ\$[fDV%;IV%6dPQ8;BMUgRPQ\][MdCRrD6dV~lTPB_
V%;BMh%698<gRC!8DP4@L6dPQM9;IV%6dPQ8{j<;BhCFC8{69:DC8V%6d<C:W_PBEV%jDC V%jDC{[<E%PBMdC\TJUjDPCF@BCFEsJ 69h4M969BCMdlV&P_cPBE%Ch&V%;BM9MV%jDC
[;B6dE"PB_U;Bh%h%6dHQ8\$C8^V%h 0% # (?;B8<: 0 (&# (!69V%j68<gFM9f<:DC :DCh%69HQ8PB_UCh h&C8^V%6;BM9Mdl{\]PBE%CC WgF6dC8V\$CFV%j<P':<h>
V%jDC4j6dHQjDCFE&}PBE :<CFE%C:gRPQ8^V&CRrL6V # ( > `b8@L6dCF PB_V%jDCj<6dHQjE f<8V%69\$CgRPQ\$[MdCRrD6dVml698L}
zaP]gRPQ8<gFM9f<:<CBJ'CPQf<M9:M96dBC!V&PW8DPBV&CV%j<;IV_cPBE; @BPQMd@BC:J^C[<E%PB[PQh&CV&Pf<h&CPQfDE \$CFV%jDPL:_cPBE h&V%f<:Dl}
HQ6d@BC8h&CFVPB_@'6dPQM;IV%6dPQ8<hFJV%jDC!h%V&E f<gRV%fDE ;BMMdlp\W698<69\];BM 698DH![<E%PB[CFE%V%69Ch7PB_ L" TRN JI> E/F"G>FBI F"G5JNKJcN MPQ8<MdlB>0z,jDC


gRPQ8V&CRr'V%hPB_$PIC8<gRC;IE CE%C;B:<6MdlCh%V%;IM969h%j<C:l gRPQ8<gRCF[<VaPB_D[;IE%V%69;BM^\$PQ8DPBV&PQ8<69gF69V~l;I[<[M969ChV&P#;#h%fDD}


V&E ;s@BCFE h%698DHV%jDC;Bh%h 6dHQ8<\$C8VWM9;IV&V%69gRCef<8<:DCFEWh%V%f<:DlB> h&CFV + PB_ V%jDCpPBh&CFE @A;IM9C4@I;IE 69;IMdCh#PB_X;8DCFV~PBE%>
K'V%;IE%V%68DH;IV4V%j<CWPBV&V&PQ\ PB_"V%j<C{M9;IV&V%69gRCBJUV%jDC{[<E%PI} !8;Bh%h%6dHQ8<\]C8^V]M9;IV&V%6gRCT69h$gRPQ8<h&V&E fgRV&C:_PBE$V%j<Ch&C
gRC:<fDE C8<:<h;B8TCMdC\$C86V ! n ,h%f<g jeV%j<;IV n 69hV%jDC @I;IE 69;IMdCh;Bh:DCh%gREk6dC: ;IP@BCB> z,jDC[<E PB;I69M}
MdPCh&Vb}PBEk:DCFE%C:4NbPQ68^V @I;BM9fDC!;Bh%h%6dHQ8\$C8^VXP'gFgFfDE E 698DH 6dVml:<69h%V&E 6dfDV%69PQ8<hW;Bh%h%P'gF69;IV&C:2#6dV%jV%jDCCMdC\$C8V%h

Lattices for Studying Monotonicity of Bayesian Networks 103


PB_V%jDCM9;IV&V%69gRC;IE%CgRPQ8<:<6dV%6dPQ8<C:PQ8; 'gP*NbPQ68^V ^@A;IEk69;IMdCha;IE%CPBYh&CFE%@I;IMdCB>a06dHQf<E%C#:DCF[69gRV%h7V%jDC
@I;BM9fDC!;Bh%h%6dHQ8<\]C8^V n  V&PpV%jDC!PBh&CFE%@I;IMdC#@A;IEk69;IMdCh HBE ;I[Yj<69gF;BMh&V&E fgRV%fDE%CPB_UV%jDC4gFfDE%E C8^V,8DCFVmPBE Y>
   ! + V%j<;IV";IE%C?8DPBV"698<gFM9f<:<C:]698{V%jDCh&V%f<:DlB>
26dV%jC;BgkjCM9C\$C8^*V #. 0$PB_?V%jDCM9;IV&V%69gRCeV%jfhW69h  2 9aFsaB:   cD, 
 

;Bh%h%P'gF69;IV&C:V%jDC$gRPQ8<:6dV%6dPQ8<;BMa[<E%PB;IY69M96dVml:<6h&V&E 6dfD} Z?fDE 68DHV%jDCCM96gF6dV%;IV%6dPQ8p698^V&CFE @'6dCF#hFJIPQfDE @BCFV&CFE 698<;IE l


V%6dPQ8S EAI  p<0 n  k>?zjDCpM;IV&V%69gRCp8<Ps [<E%P@'6:DCh_cPBE CRrL[CFE%V%h[<E P':<fgRC:@A;IEk6dPQf<hh&V%;IV&C\]C8^V%hpV%j<;IVh%fDHI}
h&V%f:Dl'68DH"V%jDC\$PQ8DPBV&PQ8<69gF69V~l[<E%PB[CFE%V%69ChPB_<V%jDC8DCFVb} HBCh&V&C:\$PQ8DPBV&PQ8<6gF6dV~lB> z,j<CFl698:<69gF;IV&C:J#_PBETCRr'}
PBE%_cPBEpV%j<Ch&CFV + HQ6d@BC8 n >z,jDC;Bh%h%6dHQ8<\]C8^V ;B\$[YMdCBJYV%j;IV#V%jDCpPQfDV&[f<V#@I;IE 69;IM9C "!2JST&^XEdJ
n  V%jDC869h{V&CFE \$C:V%jDZ C V K HT%F"W G>P" R R J HG5E] G h%j<PQf<M9:Cj<;s@BC69h%PBV&PQ8<69gF;BM9Mdl698V&CFE \]hpPB_,V%jDC{<@BC
_cPBEh&V%f<:<l'698<HV%jDC4[;IE%V%69;BM\$PQ8<PBV&PQ8<69gF6dVmlB> @I;IE 69;IM9Ch$#lJ"TUT h Fs%IJ&% HJIJ'7 XTJ)(Id"J RRJI;B8<:
CPQfM9:M96dBCV&P8DPBV&CV%j;IV*[;IE%V%69;BMW\$PQ8DPI} *$JSG h'^XE/F"TUT h' XRk>pC_PLgFf<h!PQ8V%jDCh&CPBh&CFE @^}
V&PQ8<6gF6dV~lPB_;*8DCFV~PBE%HQ6d@BC8;[;IE%V%69gFfM9;IE$;Bg%} ;IM9CX@I;IE 69;IMdChaV&P69M9M9fh&V&E ;IV&C V%jDC;I[<[M69gF;IV%6dPQ8PB_<PQfDE
HBE%PQf8<:;Bh%h%6dHQ8<\]C8^V n :DP'Ch8DPBV?HQf<;IE ;B8V&CFC][;IE&} \$CFV%j<P':_cPBE#h&V%f:Dl'68DH[Y;IE%V%69;BM\$PQ8DPBV&PQ869gF6dV~lB>
V%69;BM#\$PQ8DPBV&PQ8<6gF6dV~lHQ6d@BC8;B8^lPBV%j<CFE{;Bg HBE PQf<8<: <E%PQ\ V%jDC @BC PBh&CFE%@I;IMdC @I;IE 69;IM9Ch f<8<:DCFE
;Bh%h 6dHQ8<\$C8VF>U!M9h&PDJB@L6dPQM9;IV%69PQ8<hUPB_YV%jDC[<E PB[CFE V%6dCh7PB_ h&V%f:DlBJC]gRPQ8<h&V&E f<gRV&C:;B8*;Bh%h%6dHQ8\$C8^VM;IV&V%69gRC];Bh
\$PQ8<PBV&PQ8<69gF6dVmlV%j<;IVU;IE C69:<C8^V%6dC:HQ6d@BC8p;[Y;IE%V%69gFfL} :DCh gRE 6dC:698V%jDC[E%CF@'69PQf<h*h&CgRV%6dPQ8> KL698gRC;BM9M
M9;IE Y;Bg%'HBE%PQf<8<:];Bh h%6dHQ8<\$C8V \];sl8<PBV P'gFgFfDEHQ6d@BC8 @I;IE 69;IM9Ch,698@BPQMd@BC:CFE%C698<;IE%lBJ;B:DPB[<V%698<H]PQ8DC4PB_
;B8DPBV%j<CFEph%fg j;Bh h%6dHQ8<\$C8VF>z,jDC{;Bg%'HBE%PQf<8<:;Bhb} V%jDC@A;BMfDChNTUW;B8<: "FIiRFFJ4Cf<h%C:+;h M96dHQjV%Mdl
h%69HQ8<\$C8V;IHQ;B698<h&V!j<6g jT[Y;IE%V%69;BM7\]PQ8DPBV&PQ8<69gF6dVml69h \$PBE CgRPQ8<gF69h&CeC8<gRPL:<698DHPB_V%jDC69EN&PQ698^VW@A;BM9f<C;Bhb}
V&PCh&V%f<:<6dC:h%jDPQfM9:V%j<CFE%CF_PBE CCg jDPQh%C86dV%j h%69HQ8<\$C8V%hF>#DPBEC;Bgkj;Bh%h%6dHQ8\$C8^V n V&PV%jDCh&CFV#PB_
gF;IE%C#;B8<:$C";Bh&C:]f<[PQ8$gRPQ8h%69:DCFE ;IV%69PQ8<h0_E PQ\ V%jDC PBh%CFE%@A;IYMdC @A;IEk69;IMdCk hWJCM9CFV # n C V%jDCXh fDh&CFV
:DPQ\W;B698PB_U;I[[M969gF;IV%6dPQ8> PB_5+h%f<g jpV%j<;IVv@ ! n a6d_;B8:4PQ8<M9l46d_2NTUW
  
 9  ~m
B~7 PLgFgFfDE h68 n >Wz,j<C$CMdC\$C8V%hPB_V%jDC]E%Ch%fMdV%698DHM9;IVb}
V%69gRCV%jf<hW;IE%Ch%fDYh&CFV%h$PB_:],V%j<CTPBV&V&PQ\ PB_?V%jDC
C;I[<[M969C:PQfDE0\$CFV%jDPL:4_cPBEUh%V%f<:DlL698DH\$PQ8<PBV&PQ8<69g} M9;IV&V%6gRC6h#V%j<CpC\][<V~lh&CFV;B8<:V%jDCV&PB[C'f<;BM9h!V%jDC
6dVml$V&P$;4E%C;BMq;lBCh%6;B88DCFV~PBE%W698W@BCFV&CFE 698<;IE lWh%gF6} C8V%6dE%Ch&CFVW>z,jDC{E%Ch%f<M9V%698DHTM9;IV&V%69gRC_cPBEV%jDC{<@BC
C8<gRCB>XC<E 6dC l]698^V&E P':<fgRC!V%j<C8DCFV~PBE%;B8<::DCR} @I;IE 69;IM9Chf<8<:DCFEh&V%f:Dle69hh jDPs869806dHQfDE%CWx'>]`V
h%gREk6dC#V%jDC4@'6dPQM;IV%6dPQ8<hV%j<;IV,C469:<C8^V%6dC:> 698gFM9f<:DCh+@xBCMdC\$C8V%haV&P!gF;I[<V%f<E%C;BM9M^[PQh%h%6dM9C
&N PQ698VU@I;BM9fDC";Bh%h 6dHQ8<\$C8V%h0V&PV%jDC"@I;IE 69;IMdCh0;B8<:p_fDE&}
 :    9 9Y{ c<s cQ * :      %V jDCFE698<gFM9f:DCh tIu4:<69E%CgRVXh&CFVb}~698gFM9f<h%6dPQ8Wh&V%;IV&C\$C8V%hF>
`b8gFMdPQh&CgRPQM9M9;IPBE ;IV%69PQ8#6dV%jV~PCRr'[CFE%V%h_cE%PQ\ q CF_cPBE%CV%jDCM9;IV&V%69gRCgRPQfM9: CC8<j<;B8<gRC:6dV%j
V%jDC=C8V&E ;BM!`b8<h&V%6dV%fDV&CPB_p#8<6\];BM#Z?69h%C;Bh&C=PQ8L} gRPQ8<: 6dV%6dPQ8<;BMD[E%PB;I69M6dV%6dCh0PB_V%jDC#[<E%Ch&C8<gRC#PB_7;@L6}
V&E%PQMp698+V%jDC|#CFV%jDCFE M;B8<:<hFJC;IE%C:DCF@BCMdPB[698<H; E ;IC\W96 ;?PB_gFM;Bh%h%69gF;BM<h%698DC_cCF@BCFEJ^Cj<;B:$V&P:DCgF69:DC
q;lBCh 69;B88<CFV~PBE%_PBEV%jDC{:<CFV&CgRV%6dPQ8PB_"gFM9;Bh h%69gF;BM fD[PQ84; Y;Bg%'HBE%PQf<8<:;Bh%h 6dHQ8<\$C8V_cPBEV%jDC?PBV%jDCFElw
h&#698DC_CF@BCFEs>="M9;Bh%h%69gF;BM"h%698DC_cCF@BCFE6h];B8698<_Cg} PBh%CFE%@AI; YMdC]@I;IE 69;IMdCh;IHQ;B698<h&Vj<69gkj*V%jDC[<E%PB[CFE&}
V%6dPQfhW:<69h%C;Bh&CPB_[69HQhFJ"j<69gkjj<;Bhh&CFE 6dPQfhWh&PLgF6dPI} V%6dCh]PB_#[ ;IE V%69;BM"\$PQ8DPBV&PQ8<6gF6dV~lPQfM9:C@BCFEk6d<C:>
CgRPQ8DPQ\W69gF;BMYgRPQ8h&Cf<C8<gRChf<[PQ8W;B8WPQfDV&E%C;IY> !h C:DCgF96 :<C:V&PV%;IBC_cPBEpV%j69h4[YfDE%[PQh&CWV%j<C{@I;BM9fDC
V%jDC,:69h&C;Bh&C"j<;Bh ;?[PBV&C8V%69;BMD_cPBE E ;I[69:h&[<E C;B:JQ6dV 69h ;Bh%h 6dHQ8<\$C 8V$698j<69gkj;BMMPBV%jDCFE$PBh&CFE%@I;IMdC@I;IE 6}
69\][CFEk;IV%6d@BCV%j;IV 6dV%h0P'gFgFf<E%E%C8<gRC,69h0:DCFV&CgRV&C:{698pV%jDC ;IM9ChPB_LV%Dj C 8DCFV~PBE%j<;B:4;B:DPB[<V&C:V%j<C @A;BMfDC "RB RR>
C;IE M9lph&V%;IHBCh> z,jDC,8DCFVmPBE%$f<8:DCFE gRPQ8<h&V&E fgRV%6dPQ8]69h Cg jDPQh&Ce%V j<69h$[Y;IE%V%69gFf<M9;IE];Bh h%6dHQ8<\$C8VWh%68<gRCV%jDC
;B69\]C:;IVh%fD[<[PBE%V%68DH@BCFV&CFEk698<;IE%l[<E ;BgRV%6dV%69PQ8DCFE h"698 @I;IE 6dPQf<h!gFM969896 gF;BMah 6dHQ8<h!j<;@BC];E ;IV%jDCFEh \];BM9Ma[<E%PBD}
V%jDC:<69;IHQ8DPQh 69h]PB_V%j<C:<6h&C;Bh&C#jDC8@L69h%6dV%68DH[6dH ;I6M96dV~lPB_P'Fg gFfDE E%C8<gRC;B8<:6dV6hj6dHQj<Mdlf<8<M6dBCMdl
_;IE \]h6dV%j:<6h&C;Bh&C?[<E%PBM9C\]hPB_af<8DL8DP8gF;Bf<h&CB> V&P8<:*;M9;IE%BH C8'f<\4CFEPB_"V%jDCh&Ch%6dHQ8<h698;h%698L}
n?f<E8<CFV~PBE%gFfDE%E%C8V%MdlW698gFM9f<:DCh^p@I;IE 69;IMdCh_cPBE HQMdC!M96d@BC![6dHD>XG?d6 @BC8{V%j<69h;Bg HBE%PQf8<:{;Bh h%6dHQ8<\$C8VFJ
j69g jPs@BCFEAQuBu[;IEk;B\$CFV&CFE[<E PB;I69M969V%6dChWj<;s@BC CWgRPQ\$[f<V&C:%V jDC]@I;IE 6dPQf<hgRPQ8:<6dV%6dPQ8<;BM0[E%PB;I69Md}
CFC8;Bh%h&Ch%h%C:>z,jDCp@A;IEk69;IMdCh?\$PL:DCMaV%jDC[;IV%jDPI} 6dV%69ChV&PeCW;Bh h&P'Fg 6;IV&C:69V%jV%jDCWCM9C\$C8^V%hPB_V%jDC
HBC8DCh 69hUPB_V%jDC":<69h%C;Bh&C";Bh0CM9ML;Bh0V%jDC"gFM968<69gF;BM'h%6dHQ8<h ;Bh%h 6dHQ8<\$C8V,M9;IV&V%69Rg CB>
PBh%CFE%@BC:68698:<6d@L69:<f<;BMU[6dHQh>APB_,V%jDCV&PBV%;BMPB_ <PBEXC;Bg j[;B69E PB_ :<6dE%CgRV%MdlM9698<BC:CMdC\$C8V%hX_cE%PQ\

104 L. C. van der Gaag, S. Renooij, and P. L. Geenen


06dHQf<E%C4'Xz,jDCHBEk;I[j<69gF;BMh&V&Ekf<gRV%fDE%CPB_UPQfDEq;lBCh 69;B8e8<CFV~PBE%_PBEgFM;Bh%h%69gF;BMh%698DC?_cCF@BCFE#698[69HQhF>
V%jDC;Bh%h%6dHQ8\$C8^VM9;IV&V%69gRCBJ,CgRPQ\$[;IE%C:V%jDCgRPQ8L} V&PHQ6d@BCE 69h%CV&PV%j<6h!gRPQ\698;IV%6dPQ8PB_h%69HQ8<hF>z,jDCFl
:<6dV%69PQ8<;BM [<E PB;I69M969V%6dChPB_;@L6dE ;IC\]69;PB_#gFM9;Bh h%69gF;BM V%j'f<h4698:<69gF;IV&C:V%j<;IVV%jDCW8DCFVmPBE *h%jDPQf<M9:68<:DCFC:
h&68DCe_cCF@BCFE> C*_cPQf<8<:2_cPQfDE@L6dPQM9;IV%6dPQ8<hPB_pV%jDC j<;s@BCCFC869h&PBV&PQ8<C698V%jDC<@BC*@I;IE 69;IMdChf8<:DCFE
[<E%PB[CFE%V%69ChPB_{[;IE%V%69;BM69h&PBV&PQ869gF6dV~l+HQ6d@BC8V%jDCh&CR} h&V%f<:<lHQ6d@BC8WV%jDC#;IYh&C8<gRC,PB_;B8l$PBV%jDCFEh%69HQ8<hF>0z,jDC
MdCgRV&C:;Bg HBE PQf<8<:;Bh%h%6dHQ8\$C8^VFV%j<Ch&C@'6dPQM;IV%6dPQ8<h _cPQfDE69:<C8^V%6dC:$@'69PQM9;IV%6dPQ8<hXV%jf<hCFE%C#698:<69gF;IV%6d@BC,PB_
;IE%Ch jDPs8]l:<;Bh%jDC:WM9698DCh 698$06dHQf<E%Cx'> z,jDC"_cPQfDE \$PL:DCM9M968DHp68<;B:DC'f<;BgF6dCh"698PQfDE#gFfDE%E%C8V,8DCFV~PBE%>
@L6dPQM9;IV%6dPQ8<hU;BM9M'PBE 6dHQ68<;IV&C:p_E%PQ\;B:<:698DHV%j<CgFM698<69gF;BM
h%6dHQ8WPB_7:<69;IE E jDP'C;V&PV%j<C!gRPQ\698<;IV%69PQ8WPB_8:<698DHQh
{7X*m Ia 
PB_ ;IV%;ArL6;];B8<:T\];BM;B69h&CB>Xz,j<C_PQfDE@'6dPQM;IV%6dPQ8<hV%j'f<h `b8V%j<69h?[;I[CFEJCWj<;s@BC][<E%Ch&C8V&C:*;\$CFV%j<P':_cPBE
CFE%C;BM9MgRPs@BCFE%C:'l;h 698DHQMdCXh%V&E f<gRV%fDE ;BMMdl!\W698<69\];BM h&V%f<:<l'698<H\$PQ8DPBV&PQ8<6gF6dV~lPB_q;lBCh%69;B8$8<CFV~PBE%'h>U`b8
gRPQ8V&CRr'V#PB_UPIC8<gRCB> @L6dCF*PB_'V%jDCXf<8D_;@BPQf<E ;IMdC gRPQ\$[M9CRrL6dVml?PB_V%jDCX[<E%PBD}
C[<E%Ch&C8V&C:V%jDC[Y;B6dE hPB_@L6dPQM9;IV%698DH;Bh%h 6dHQ8L} MdC\ 698HBC8DCFE ;BMJV%jDC\$CFV%jDPL:_PLgFf<h&ChPQ8N%f<h&V;
\$C8V%h]V&PV~P@BCFV&CFE 698;IE 69;B8<hF>qC698DHgRPQ8D_cE%PQ8V&C: h%fDYh&CFV#PB_XV%jDCPBh&CFE%@I;IMdCp@A;IEk69;IMdCh#PB_X;8DCFV~PBE%
6dV%jV%jDC _cPQfDE@L6dPQM9;IV%6dPQ8<hJsV%jDCFl698<:<CF[C8:DC8^V%M9l;B8<: ;B8<:f<69M9:hfD[PQ8*;M9;IV&V%6gRC{PB_,;BM9MNbPQ698V4@I;BM9fDC{;Bhb}
6dV%jgRPQ8@L69gRV%6dPQ868<:<69gF;IV&C:V%j<;IV{V%jDC[<E%PBY;I69M96dVml h%6dHQ8\$C8^V%h#V&PV%jDCh&Cp@A;IE 6;IMdChF>zjDC4M9;IV&V%69gRC69h#C8L}
PB_ ;]@L6dE ;IC\W69;PB_ gFM9;Bh%h%6gF;BMh&698<C_CF@BCFE?h%jDPQf<M:T698L} j<;B8<gRC:6dV%j698D_cPBE \];IV%6dPQ8;IPQfDVV%jDCCRCgRV%hPB_
gRE%C;Bh&CfD[PQ88:<698DHV%jDCp;B:<:<6dV%6dPQ8;BMh%6dHQ8PB_ :<69;IE&} V%jDCh&Ce;Bh%h%6dHQ8<\]C8^V%h$PQ8V%jDCT[<E PB;I69M969V~l:<6h&V&E 6dfD}
E jDP'C;L>*qPBV%j@BCFV&CFE 698<;IEk69;B8<h\$C8V%6dPQ8DC:V%j<;IVV%jDC V%6dPQ8Ps@BCFEV%jDCe8<CFV~PBE%> vh\];B68PQfDV&[f<V]@A;IEk69;IMdCB>
gRPQ\698<;IV%6dPQ8PB_X;IV%;ArD69;;B8<::<6;IE%E jDP'C;]Ch&[CgF69;BMMdl z,jDCC8<j<;B8<gRC:pM9;IV&V%69gRCV%jDC8p69h7fh&C:4_cPBE069:DC8V%6d_lL698DH
[PQ698V&C:V&P#gFM9;Bh h%69gF;BMQh&#698DC _cCF@BCFEB6dV%j<68?V%jDCXh%gRPB[C ;BM9MU@L6dPQM9;IV%6dPQ8<h?PB_V%jDC$[<E%PB[CFE%V%69Ch#PB_\$PQ8DPBV&PQ869gF6dV~l
PB_"V%jDCWq;slBCh%69;B88DCFV~PBE%J0V%jDCFl*gRPQf<M9:8DPBVV%j<698< ;B8<:p_cPBE0gRPQ8h&V&E f<gRV%698<H!\]68<69\];BMBPIC8<:<68DH#gRPQ8V&CRr'V%h
PB_$;B8DPBV%jDCFE:<69h&C;Bh%C*V%j<;IVTPQf<M9:C\]PBE%CM96dBCM9l _cPBE,_cf<E%V%jDCFE,gRPQ8<h%6:DCFE ;IV%6dPQ8>

Lattices for Studying Monotonicity of Bayesian Networks 105


"!" #$
%&(
 '
(
 "(
 **(
   

 **(  "(  ( -  (  **(
 "( "!" #
%*&,(  '
(    ( "!" #$
%&(
)  
"(  '
"(    +( "! #$
%&( ,  '+(
,     ** "! #$
%&$   
  

 '
"(  *(  "(   "(  "(  "!" #
%*&,(     
"(  "( 
 "!" #
%&( 
 !" #$
%&(
 "( ,   +( )  
"(  **( "!" #
%&(     
"( "!" #
%*&,(  *(  "(     +(
,
 **  '
 ,      '  '
  **
    
 "!" #
%&$    '
  +*

 '
"( )  
"(    (  ( "!" #$
%&(  +*( ,  '+(  "( "!" #
&(    (
   ** )  
    '
 ,  '   "!" #
%&$  ** "!" #
%*&

  
         "! #$
%&

06dHQfDE Cx'!8;Bh%h%6dHQ8<\]C8^VM9;IV&V%69gRC_cPBEPQfDEq;lBCh%6;B8*8DCFV~PBE%_PBEgFM;Bh%h%69gF;BM h&#698DCp_CF@BCFE7V%jDC$@'69PQM9;IV%6dPQ8<h


PB_UV%jDC[<E PB[CFE V%6dChPB_a[Y;IE%V%69;BM\$PQ8DPBV&PQ869gF6dV~l;IE%C68<:<69gF;IV&C:'l:<;Bh%jDC:M968DChF>
CPQf<M9:eM96dBC4V&P8DPBV&CpV%j<;IVFJ;Bh?;h%fD[<[M9C\$C8^V
/   < YX  
V&PPQfDET\]CFV%jDP':2_cPBETh&V%f<:<l'698<H\$PQ8<PBV&PQ8<69gF6dVmlBJ!C
:DCh 6dHQ8DC:W;ph&[CgF69;BM}[YfDE%[PQh&C"CM969gF69V%;IV%6dPQ8WV&Cgkj<8<69'fDC 0214351768 9:;8%9%1<%=?>;@1BA2CDEC*FHG"CF IDEJLKNMIFHG"F OEPRQTS-MOEU"VWDEPYX
Z[DEV;MG"F DEP]\[PTD^J_V`G"FHGaTbEc-de8%d)1fa2gih9+jkc:?8 9,lm8 9nko;:-1
V%j<;IV;BMMdPsh!_cPBE:<69h%gFfh%h%698DH{6dV%j:DPQ\];B68eCRrL[CFE%V%h
j<CFV%jDCFEPBE8DPBV4V%jDC]69:DC8V%6d<C:@L6dPQM9;IV%6dPQ8h698<:DCFC: p 1 p 9`oE
q r+s%8 9t1u<%=wvi<;1yxzD^CC*F IM{QYSMOEU"V;|~}BF'UG"C]LOEPYIMYC G

gF;B8C?gRPQ8<h&V&E fDC:;Bh@L6dPQM9;IV%6dPQ8hPB_UgRPQ\]\$PQ8<M9lW;Bg} D^PTX5KFHG"C*U"F iC*F'`MuxDEC*CF IMGai1 1T98 8%o;c)1

L8DPs#MdC:DHBC:[;IV&V&CFE 8hPB_\]PQ8DPBV&PQ8<69gF6dVmlc ;B8:DCFE 1 B1Nzjfo^c-d1 71N8 nknfo^cz1e<t=;=?>1c- 9+8%8 cwro;n

G?;B;IHF{Bo>dJ4IuBu Qk> z,j<69hV&Cg j8<69'fDC*j<;BhCFC8


.
r9o?di8 ;9+8t?nfr+jk;cjfcw-o;c?rjroErjf?8h9+?-o^-jfnkjkr+jH

:DCh 6dHQ8DC:Wh%[CgF69gF;BM9Mdl$h&P;BhV&Pp;Bh%{M96dV&V%M9C,V%69\$C?;B8<: c-8r;9%1)cU+OtIMMXEF'PEGOC'S-M-C'SLOEPtMUMPYIM

M969V&V%MdC gRPBHQ8<6dV%6d@BCCRPBE V7_cE%PQ\V%jDCCRr'[CFE%V%ha698V%jDC@BCFE&} O^PPTIM U"CD^F'P-C*VF'P\[U"C*F $I F DEJ)"P2CM J'J_FwMPYIMa)?9+:wo^c


o;i o^cczaih-o^:?8%,;?>;wi1
6dYgF;IV%6dPQ8PB_0V%jDC69:DC8V%6d<C:@L6dPQM9;IV%6dPQ8hF> B1 1%`o;cdi8 9 p o;o;:-a%1 B1t6idinHo^8%c-di8 9o^c-d1%8%8 nHdi8 9%1
zjDCWE%Ch f<MdV%hV%j<;IVCPB<V%;B698<C:_cE%PQ\ ;I[<[M9l'698<H b;@;@;-1^?c^r;cjHjfrjkc6,o`;8%+jHo^cNc8r;9i 1EcUOE

PQfDE"\$CFV%j<P':_cPBEh%V%f<:DlL698DH\]PQ8DPBV&PQ8<69gF6dVmlWV&P$;pE%C;BM IMMXEF'PEGOC'S-MyEC SmO^P%M U+M PTIMuOEPPTIM U"CD^F'P-C*VuF'P

8DCFVmPBE 68@BCFV&CFEk698<;IE%lh%gF6dC8<gRCBJ698<:69gF;IV&CV%j<;IVW6dV \[U"C*F $I F DEJT"P-CM J'J_F?M PTIM a-"a-h-o;:;8t,;;=?vE-1

[<E Ch&C8^V%hX;f<h&CF_f<ML\$CFV%jDPL:$_cPBEXh&V%f<:DlL698DH?E%C;Bh%PQ8<698DH B1 1`o;cdi8%9 p o;o;:-a71 L1 p 8 8%c8 co;c-d1 021 1


[;IV&V&CFEk8<h68q;lBCh 69;B88DCFV~PBE%LhF>z,jDC{8<CRr'Vh&V&CF[ o^2o;c8twgi-j 1Bb;@;@;-1mlm8 9j jkc:W;c-^r+?cjHjfr

8DP69h0V&PCRrLV&C8<:]PQf<E \$CFV%jDPL:lpV&Cg j<869f<Ch V%j<;IV ;e6,o`;8tjHo^cc8r;9ijfr+di?o^jkc8ihT8 9+r%1c

CRrL[MdPQ6dV#V%jDC$gRPQ8<h&V&E fgRV&C:\]698<69\W;BMPIC8:<698DHgRPQ8L} ,U+O%IMMXEF'PEGOC'S-M[-C'SZD^V;M"G"F D^POtX;M J'J_F'Pw\$;YJ_F'

V&CRrLV%h_PBE"69:DC8V%6d_cl'68DHV%jDC\$PL:DCM9M9698<H698<;B:DC'f<;BgF6dCh
ID^CF O^PGO^U+`G+S-OTa2h-o^:?8%=]<t1

698;8DCFV~PBE%V%j<;IV$gF;Bf<h&CV%jDC@A;IEk6dPQf<hp@'69PQM9;IV%6dPQ8<h 1 71^8 nknfo;c)1-<t=;=;@-1Ec-d-o^8 cwro;n-?c-8%hirL^?2o^nf

PB_0\$PQ8DPBV&PQ8<6gF6dV~lB> jfro^r+jk;8h9;-o;jknfjHrjkc-8r;9%1\[U"CF IF D^J"P2CM J'J_F'


wM PTIMai?-Lb;wv?@;-1

106 L. C. van der Gaag, S. Renooij, and P. L. Geenen



   "!#$&%')(+*",- . / (+
021436567 89;:<7=3>5;?A@BC7=7<DE7=365>FG?AHI?A@JK9L5;?CMN7=7=O
P?AQL7<@ H R?S3
H)T=UGV3;UWT=@XR 7<H 1YTZ3>7=365[8\TZRQL];H 1436D ^;_A1Y?S36_`?SaAb6cHI@ ?S_ d
Hc361Y:=?A@Xa 1YHfe
FG9hgi96jkTSlnm<o&9poZm=qrb+s=t<oZm<ujcHI@ ?S_ d
HAbu)d;?Cv)?AH d;?A@XO47=3656a
wyx;z|{6};~$6~r~rx&6r$
{x
2S|L=

M?E143HI@ Tr56]L_`?H d;?U7=R14OYe[T=U\Ri]6OYH 1f561R?S36a 1YTZ3L7=O2j\7Se=?Sa 147=33;?AHT=@ _AO7=a a 1Y6?A@XaA9ud;?SaI? _AO47=aI
a 1Y6?A@ a1436_AO4]656?CTZ3;?T=@CRT=@ ?_AO47=a a:<7<@ 147<LO4?Sa)7=365R]LOYH 1YQLOY?CU?S7<H ];@ ?i:<7<@ 147<LOY?SaSbL)d61_ d3;?A?S5n36T=H
?CRT&5;?SO4OY?S57=a.?S1436D56?AQ?S3L5;?S3H\TZ3?A:=?A@ e_AO7=a a\:y7<@X147<LOY?=9g]6@.U7=R 1OYe T=U$R]6O4H 1f5614R?S3La 1YTZ367=O
_AO7=a a 1Y6?A@Xa 1436_AO4]L5;?Sa7=aEa Q?S_A17=O_A7=aI?SaH d;?[k?SO4O&3;T)3367=14:=?[j\7Se=?Sa 147=3'7=365HI@ ?A?`f7=];DZR?S3
HI?S5
_AO7=a a 1Y6?A@XaAb$e=?AHiT<?A@ a?AHIHI?A@RT&5;?SO4O41436D[_A7<QL7<L14O1YH 1Y?SaKH dL7=3U7=R 14O1Y?SaKT=URTr56?SO4a1YH d7a 1436DZOY?
_AO7=a a:<7<@ 147<LOY?=9M?n5;?Sa _`@X1Y?H d;?[OY?S7<@ 3L143;DQ6@ T=LO4?SRUWT=@7Na ];6U7=R 14O4eNT=UCR]6O4H 1f5614R?S3La 1YTZ367=O
_AO7=a a 1Y6?A@Xa7=365a d6T|H d67<HH d;?_`TZRQLOY?`l;1YHeNT=U)H d;?aITZO];H 1YTZ37=OYD=T=@ 1YH d6R14aiQTZOYe&3;TZR 147=O143H d6?
3r]6R?A@T=U:y7<@ 17<LOY?Sa143:=TZO4:=?S5+9M?U];@ H d;?A@QL@ ?SaI?S3
HaITZR ?iQL@ ?SO414R 1367<@ e?`lrQ?A@ 14R ?S3H 7=O2@ ?Sa ]6O4H a
HITE14O4O4]6a HI@ 7<HI?H d;?C?S3;?A6H a\T=U$H d;?Ri]6OYH 1f5L14R?S36a 14TZ367=O41YHeiT=U$TZ];@_AO7=a a 1Y6?A@XaA9
2=ZrZf :<7<@ 1YTZ]6a_AO47=a a 1Y6?A@ aG14365L14_A7<HI?.R]6O4H 1YQLOY?k_AO47=a aI?SaAbH d;?S3
H d;?14RQLO414?S5_`TZRL1367<H 1YTZ3R7Se3;T=H? H d;?RTZaIH
\j 7Se=?Sa 17=33;?AHT=@ _AO47=a a 1Y6?A@ ad67|:=?.DZ7=143;?S5 _`TZ3La 145& O41Y=?SO4e?`l&QLO47=367<H 1YTZ3T=U$H d;?CT=aI?A@ :=?S5>U?S7<H ];@ ?SaS9
?A@ 7<LO4?.QT=Q]6O47<@ 1YHeUT=@aITZOY:&143;DC_AO47=a a 1YL_A7<H 1YTZ3 Q6@ T=; V3H d614aQL7<Q?A@k?143
HI@ Tr5L]6_`?H d;?_`TZ36_`?AQ6HNT=U
OY?SR a)d;?A@ ?i7=31436a H 7=36_`?i56?Sa _`@ 1Y?S5>re[73
]LR?A@ Ri]6OYH 1f561R?S36a 1YTZ3L7=O41YHfe143j\7|e=?Sa 147=336?AHfkT=@ _AO47=a
T=UU?S7<H ];@ ?SadL7=aHIT?_AO47=a a 1Y6?S5143TZ3;?T=UaI?A:
a 1YL?A@ aHITiQ6@ T:r1456?UWT=@\7=_A_A];@ 7<HI?SO4eRT&5;?SO4O41436DCQ6@ T=;
?A@ 7=O\5614a H 1436_`H_AO7=a aI?SaA9Nud6?a ]L_A_`?Sa aT=U)?SaIQ?S_A147=OOYe OY?SR a)d;?A@ ?n1436aIH 7=36_`?SaE7<@ ?[7=a a 1YDZ3;?S5HITR]LOYH 1YQLOY?
367=1Y:=?j\7Se=?Sa 17=3_AO47=a a 1YL?A@ a)7=365nH d;?RT=@ ?i?`lrQL@ ?Sa _AO47=a a ?SaA9R]6O4H 1f5614R?S3La 1YTZ367=O+j\7Se=?Sa 147=33;?AHfkT=@
a 1Y:=?iHI@ ?A?`f7=];DZR ?S3HI?S5N36?AHfkT=@ [_AO47=a a 1Y6?A@ a14a@ ?S7=5& _AO47=a a 1Y6?A@[143L_AO4]65;?SaTZ3;?T=@RT=@ ?_AO47=a a[:y7<@ 17<LOY?Sa
14OYe?`l&QLO47=1436?S5CUW@ TZRH d;?S14@2?S7=aI?kT=UL_`TZ36a HI@ ]6_`H 1YTZ3i7=365 7=365TZ3;?T=@NR T=@ ?UW?S7<H ]6@ ?:y7<@ 17<LOY?SaA9VHRT&5&
H d;?S1Y@.D=?S36?A@ 7=O4OYeD=T
T&5>_AO47=a a 1YL_A7<H 1YTZ3>Q?A@ UWT=@ R7=36_`?=9 ?SO4aEH d;?@ ?SO47<H 1YTZ36a d61YQLa?AH?A?S3'H d;?:<7<@ 147<LOY?Sa
e
n7=3e7<Q6QLO14_A7<H 1YTZ35;TZR 7=143LaAb+d;T|k?A:=?A@Sb2136_AO4]65;? 7=_`e&_AO414_561Y@ ?S_`HI?S5D=@ 7<QLd6aT:=?A@H d;? _AO47=a a:y7<@ 17<LOY?Sa
_AO47=a a 1YL_A7<H 1YTZ3-QL@ T=LOY?SR a>)d;?A@ ?N7=31436aIH 7=36_`?d67=a 7=365T|:=?A@H d;?>U?S7<H ];@ ?[:<7<@ 147<OY?Sa aI?AQL7<@X7<HI?SOYe=b7=365
HIT>? 7=a a 14DZ3;?S5HIT[7RTZa HO41Y=?SOYe_`TZR14367<H 1YTZ3T=U U];@ H d;?A@_`TZ363;?S_`H aH d;?HT'aI?AH aT=U:<7<@ 147<LOY?Sa[
e
_AO47=a a ?SaA9^&1436_`?)H d;?3r]6R?A@T=U_AO47=a a:<7<@ 147<OY?Sa1437 7>1QL7<@ H 1YHI?E561Y@ ?S_`HI?S5D=@ 7<QLd+7=3?`l&7=RQOY?Ri]6OYH 1
j\7Se=?Sa 17=33;?AHfkT=@ [_AO47=a a 1Y6?A@14a@ ?SaIHI@ 14_`HI?S5nHITTZ3;?=b 5614R ?S36a 1YTZ367=O_AO47=a a 1Y6?A@>14a56?AQL14_`HI?S5143-$1YDZ]6@ ?<9
a ]6_XdQ6@ T=OY?SR aC_A7=363;T=H? RTr56?SO4OY?S5aIHI@ 7=14DZdHIUT=@I aKUT=@TZ3;?`f5L14R?S36a 14TZ367=Oj\7Se=?Sa 147=33;?AHT=@ _AO47=a
\7<@ 56OYe=9-g3;?>7<Q6Q6@ TZ7=_ d14aHIT_`TZ36a HI@ ]6_`HE7_`TZR a 1YL?A@ aAbZk?561aIH 143;DZ]61a di?AH?A?S3561?A@ ?S3
HHferQ?SaT=U
QTZ]6365_AO7=a a[:<7<@ 147<OY?H d67<HRTr56?SO4a[7=O4OKQTZa a 1YLOY? Ri]6OYH 1f561R?S36a 1YTZ3L7=O_AO47=a a 1Y6?A@$
e14RQTZa 13;D)@ ?Sa HI@ 14_
_`TZRL14367<H 1YTZ3LaT=U6_AO47=a a ?SaA9Gu)d614a_AO47=a a:<7<@ 147<LO4?H d;?S3 H 1YTZ36aTZ3H d;?S1Y@D=@ 7<QLdL14_A7=O.aIHI@ ]6_`H ]6@ ?=9L]6O4OYeHI@ ?A?`
?S7=a 14O4e?S3L56a]6Q)1YH d7=3136d61YL14H 1Y:=?SOYe[O47<@ D=?E3r]6R 7=];DZR?S3
HI?S5R]LOYH 1f5614R ?S36a 1YTZ367=O;_AO7=a a 1Y6?A@XaAbZUT=@?`lr
?A@KT=U\:y7=O];?SaA9EO4aIT;b2H d;?aIHI@X]6_`H ];@ ?T=U.H d;?Q6@ T=; 7=RQLO4?=b6d67S:=?i561Y@ ?S_`HI?S5HI@ ?A?SaT|:=?A@H d;?S1Y@_AO7=a a.:<7<@ 1
OY?SR14a[3;T=HnQ6@ T=Q?A@ OYe@ ?A6?S_`HI?S5143H d;?@ ?Sa ]6OYH 143;D 7<LOY?Sa)7=a.?SOO7=a.T|:=?A@H d;?S1Y@.U?S7<H ];@ ?:y7<@ 17<LOY?SaA9
RT&5;?SO9G3;T=H d6?A@7<Q6Q6@ TZ7=_Xd14aGHITK5;?A:=?SOYT=QR]LOYH 1YQLOY? ;T=@)H d;?U7=R 1OYeT=U2U]6O4O4eiHI@ ?A?`f7=];DZR?S3
HI?S5[Ri]6OYH 1
_AO47=a a 1Y6?A@ aAbTZ3;?UT=@?S7=_XdT=@ 1YDZ1367=O_AO47=a aA9[]LOYH 1YQLOY? 5614R ?S36a 1YTZ367=OC_AO7=a a 1Y6?A@XaAbKk?aIH ]L5;e'H d;?O4?S7<@ 36143;D
_AO47=a a 1Y6?A@ aAb
d;T|k?A:=?A@Sb;_A7=363;T=Hk_A7<Q6H ];@ ?)13HI?A@ 7=_`H 14TZ36a Q6@ T=OY?SRb6H d67<H14aAbH d;?Q6@ T=LOY?SR#T=UL3L56143;D 7E_AO47=a
7=RTZ3;DH d;?k:y7<@X1YTZ]6a_AO7=a aI?Sa$7=365iR 7SeKH dr]6aG7=O4aIT3;T=H a 1YL?A@H dL7<HK?SaIH6H aK7aI?AHKT=U\7S:<7=14O47<OY?567<H 7&9M?
Q6@ T=Q?A@ O4en@ ?AL?S_`HH d;?Q6@ T=OY?SR9[[T=@ ?AT|:=?A@Sb1YU.H d;? a d;TH dL7<HAb=DZ1Y:=?S37;l&?S5aI?SO4?S_`H 1YTZ3T=ULUW?S7<H ];@ ?:<7<@ 1
C2
    f '
C1 jk?AUT=@ ?@ ?A:&1Y?A)13;D 367=1Y:=? jk7|e=?Sa 147=3 7=365HI@ ?A?`
C3 7=];DZR ?S3HI?S5 36?AHfkT=@ i_AO7=a a 1Y6?A@XaAb<?143
HI@ T&56]6_`?kTZ];@
3;T=H 7<H 14TZ367=O._`TZ3:=?S3
H 1YTZ36aS9M?>_`TZ3La 145;?A@j\7Se=?Sa 17=3
3;?AHT=@ raT|:=?A@>7L3614HI?naI?AH "! $w #&%$')()()(*'+#-, b
.&/ <b;T=UG5614a _`@ ?AHI?@X7=365;TZR :y7<@X147<LOY?SaAbr)d;?A@ ??S7=_Xd
F2 F3 # CH 7<=?Sa7:y7=O4]6?n143 7NL3614HI?[aI?A1 H 0324+5 # 6X9;T=@
7>a ];LaI?AHKT=U.:<7<@ 147<OY?S8a 7:9 ?]6aI; ? 032<4=5>7?68!
@ACB>DE 0324=5 # 6CHIT56?S3;T=HI?H d;?aI?AHiT=G U FTZ143
HK:<7=O4];?
F1 F4

$1YDZ];@ ?$3?`l;7=RQLO4?\Ri]6OYH 1Yf5614R?S36a 1YTZ367=OZj\7Se=?SaI 7=a a 1YDZ36R?S3


H aHIHT 7 9'j\7Se=?Sa 147=33;?AHT=@ 3;T|14a7
147=33;?AHfkT=@ i_AO47=a a 1YL?A@)1YH d_AO47=a a2:y7<@X147<LOY?Sa 67=365 QL7=14J@ IK!MLN 'PORQ b;)d;?A@ 8? N14a\7=37=_`er_AO41_5L1Y@ ?S_`HI?S5
U?S7<H ];@ ?C:<7<@ 147<LO4?Sa  Z9 D=@ 7<Qd)d6TZaI?:=?A@ H 14_`?Sa._`T=@ @ ?SaIQTZ365HITH d;?@X7=365;TZR
:<7<@ 147<LO4?SSa  7=365 O 14a)7EaI?AH)T=UQ7<@ 7=R?AHI?A@Q6@ T=L7y
L1O41YH 1Y?SaAH d;?EaI?AH O 143L_AO4]65;?SaC7[QL7<@ 7=R?AHI?AU@ T*V BW X V B
7<LO4?SaQ?A@_AO47=a a:y7<@X147<LOY?=b.H d;?OY?S7<@X36143;DNQL@ T=LOY?SR UT=@?S7=_ d:<7=O4];H? YZ[\032<4]5 # 67=365?S7=_ d:<7=O4];?7=a
_A7=3n?5;?S_`TZRQTZaI?S5n143HITET=Q6H 14R 1a 7<H 1YTZ3Q6@ T=LOY?SR a a 14DZ36R?S3
& H ^SYZ_[`0324+5>^ # 6HITH d;?aI?AH ^ # T=U
UT=@KH d;?aI?AHKT=U._AO47=a a:y7<@ 17<LOY?SaK7=365NUWT=@H d;?aI?AHKT=U QL7<@ ?S3H akT=U #  1a 3 Ni9ud6?3;?AHT=@ b I3;T56?AL3;?Sa
U?S7<H ];@ ?:y7<@ 17<LOY?SaaI?AQ7<@ 7<HI?SOYe=bLd614_ dn_A7=3>T=H d>? 7 FTZ143
H)QL@ T=L7<L14O1YHfe5614aIHI@X1YL];H 1YTZ&
c 3 dCeT:=?Af@  H d67<H
aITZO4:=?S513QTZOYe&3;TZR 147=OH 14R?=9 M?U]6@ H d;?A@7<@ DZ];? 14a\U7=_`HIT=@ 14a ?S5>7=_A_`T=@ 561436D HIT
H d67<HAb$7=OYH d;TZ]6DZdTZ];@O4?S7<@ 36143;D7=OYD=T=@X1YH d6R 7=a a ]6R?Sa ,
7n;l&?S5NL1Q7<@ H 1YHI?D=@X7<QLd?AH?A?S3H d;?_AO47=a ai7=365 dCeg5 # % ')()()($'+#-, 6! h T
ACBW XAGB (
U?S7<H ];@ ?K:<7<@ 147<OY?SaAb;1YH14a.?S7=a 14OYe_`TZRL13;?S5)1YH d?`lr j i %
14a H 143;D7<Q6Q6@ TZ7=_Xd;?SaHITUW?S7<H ];@ ?Ka ];LaI?AHa ?SOY?S_`H 1YTZ3+9 \j 7|e=?Sa 147=3 3;?AHT=@ _AO47=a a 1Y6?A@ a$7<@ ?.j\7Se=?Sa 17=3 3;?AH
u)d;?K3
]LR?A@ 14_A7=O@ ?Sa ]6OYH a\H d67<H?KT=LH 7=143;?S5U@ TZR kT=@ &aT=U@ ?SaIHI@ 1_`HI?S5 HIT=QTZOYT=D=eH d67<Hk7<@ ?H 7=14OYT=@ ?S5 HIT
Q6@ ?SO414R 143L7<@ e?`l&Q?A@ 14R?S3
H a )14H dTZ];@OY?S7<@ 3613;D7=O aITZO4:r1436D>_AO47=a a 1Y_A7<H 1YTZ3NQ6@ T=LOY?SR aCd;?A@ ?143LaIH 7=36_`?Sa
D=T=@ 14H d6R _AOY?S7<@ OYe[1O4O4]6aIHI@X7<HI?KH d;??S3;?A6H aT=URi]6OYH 1 5;?Sa _`@ 1Y?S5
e7N3r]6R?A@T=UU?S7<H ];@ ?Sa d67|:=?[HITN?
561R?S36a 1YTZ3L7=O41YHfeT=Uj\7|e=?Sa 147=33;?AHfkT=@ _AO47=a a 1Y6?A@ aA9 _AO47=a a 1Y6?S5143TZ3;?T=UaI?A:=?A@X7=OK5614aIH 136_`HQ6@ ?S5;?AL3;?S5

aIQ?S_A147=O4OYeTZ3a R 7=OOY?A@567<H 7aI?AH aAbrH d;?)_`TZ36aIHI@X]6_`HI?S5 _AO47=a aI?S8a 5;@ 1Y?S5LR 7=l 3 k)mn2<4pobSq=q <6X9u)d;?CaI?AH.T=U:<7<@ 1
Ri]6OYH 1f5L14R?S36a 14TZ367=O_AO7=a a 1Y6?A@XaQ6@ T:r15;?S5Kd61YDZd6?A@7=_ 7<LO4?Sna T=U7Kj\7Se=?Sa 147=33;?AHT=@ _AO7=a a 1Y6?A@1a$QL7<@ H 1
_A];@X7=_`eH dL7=3KH d;?S1Y@2TZ3;?`f5614R ?S36a 1YTZ367=O=_`TZ]L3HI?A@ Q7<@ H aA9 H 1YTZ36?S5[143
HIT7a ?AHqZrs! w  % ')()()(*' ut b v / <bLT=U
V3 _`TZRL14367<H 1YTZ3 )1YH d U?S7<H ];@ ?aI?SOY?S_`H 1YTZ3+bERT=@ ?` w kx2<m>y{zxkU|2zP}~2)4jk7=3657a 143;DZOY?AHITZ3naI?A8 H Z! w 
T:=?A@Sb<TZ];@27=OYD=T=@X1YH d6R@ ?Sa ]6OYHI?S5K13KaIQL7<@ a ?A@2_AO47=a a 1Y6?A@ a )14H dH d;S? )4j2P|2<zP}~24k`9$ 2}|k2kP}~2b)4j2}
)14H d_`TZ36a 15;?A@ 7<LOYeU?A?A@kQL7<@ 7=R ?AHI?A@ a\7=365+b&d;?S36_`?=b k)zdL7=a7C561Y@ ?S_`HI?S5 HI@ ?A?.UWT=@1YH aD=@ 7<QLb d Nb
143)dL14_ d
)14H da R 7=O4OY?A@.:<7<@ 147=3L_`?=9 H d;?_AO47=a a:y7<@X147<LOYc? 14aH d;?]6361 r];?@ TrT=H\7=365 ?S7=_Xd
u)d;?nQ7<Q?A@14aT=@ DZ7=3614aI?S5'7=aUTZO4OYT)aA9 V3^r?S_ U?S7<H ];@ ?:y7<@X147<LOY_ ?   dL7= a  UT=@1YH aETZ36OYeQL7<@ ?S3
HA9
H 1YTZ 3 rb.?n@ ?A:&1Y?A j\7Se=?Sa 147=3'3;?AHfkT=@ _AO47=a a 1Y6?A@ a m>z+kPk+2y<bkmkxH k$m>nz 5$S_6)4j2} kzCd67=a

143D=?S3;?A@X7=O9#V3^&?S_`H 1YTZ3srb?5;?AL3;?TZ];@U7=R UT=@1YH aD=@X7<QL- d N75614@ ?S_`HI?S57=_`er_AO41_D=@ 7<Qdi13)dL14_ d
14O4e T=UNRi]6OYH 1Yf5614R?S36a 1YTZ367=O_AO47=a a 1Y6?A@ aA9 V3 ^r?S_ H d;?_AO47=a a:y7<@X147<LOYc? 14aH d;?]6361 r];?@ TrT=H\7=365 ?S7=_Xd
H 1YTZ 3 ;b?'7=565;@ ?Sa aH d;?'OY?S7<@ 3L143;DQ6@ T=LOY?SR UT=@ U?S7<H ];@ ?:<7<@ 147<OYg? dL7=Ja 7=365 7<HR TZaIHTZ3;?)T=H d;?A@
U]6O4OYeHI@ ?A?`f7=];DZR?S3
HI?S5Ri]6OYH 1Yf5614R?S36a 1YTZ367=O)_AO7=a a 1 U?S7<H ];@ ?:<7<@ 147<LO4?UT=@>1YH aQ7<@ ?S3H aSH d;?Na ]66D=@ 7<QLd
6?A@Xa 7=365Q6@ ?SaI?S3HE7QTZOYer36TZR 147=OH 14R ?7=O4D=T=@ 1YH d6R 143L56]6_`?S5
eH d;?a ?AfH 3rkb2R T=@ ?AT|:=?A@Sb$14aK75L1Y@ ?S_`HI?S5
UT=@aITZO4:r1436D>1YHA9EV3^&?S_`H 1YTZ3trb$k?6@ 14?A6en7=5656@ ?Sa a HI@ ?A?=bLHI?A@ R?S5H d;? w k2<m>y{z+kHm>zxkkT=U$H d;?K_AO47=a a 1Y6?A@S9
U?S7<H ];@ ?aI?SOY?S_`H 1YTZ3UWT=@kTZ];@Ri]6OYH 1Yf5614R?S36a 1YTZ367=Or_AO47=a u)d;?ED=?S36?A@ 7=O\Q6@ T=LO4?SRT=UOY?S7<@ 3613;Dn7j\7Se=?Sa 17=3
a 146?A@ aA92M?k@ ?AQT=@ HGaITZR ?kQL@ ?SO414R 1367<@ e@ ?Sa ]6O4H aU@ TZR 3;?AHT=@ _AO47=a a 1Y6?A@.U@ TZR#7DZ1Y:=?S3>a ?AHT=UG5L7<H 7a 7=R
7=3E7<Q6QLO414_A7<H 14TZ3143H d;?L1YTZR?S561_A7=O;5;TZR 7=143 13E^r?S_ QLO4?Sac! $w % ')()()($'+ K bu / <bG14aHIT>L365U@ TZR
H 1YTZ 3 r9ud;?Q7<Q?A@14a)@ TZ]6365;?S5[T<)1YH dnTZ];@_`TZ3& 7=RTZ36DCH d;?U7=R 14OYeKT=U+3;?AHT=@ _AO7=a a 1Y6?A@Xa2TZ36?H d67<H
_AO4]L56143;DiT=LaI?A@ :<7<H 1YTZ36a)143[^r?S_`H 1YTZ
3 
9 ?SaIHR 7<H _ d;?SaH d6?i7|:y7=14O7<LOY?5L7<H 7&9a7R ?S7=a ];@ ?

108 L. C. van der Gaag and P. R. de Waal


T=U;d;T|k?SO4OZ7RT&5;?SOZ56?Sa _`@ 1Y?Sa+H d;?567<H 7&b<T=UWHI?S3H d;?  0324+53r6 0324]5f6 E)d614_XdHe
Q14_A7=O4OYe]614O456a
RT&5;?SO paGOYT=D<fO41Y=?SO14d;TrTr5CDZ14:=?S3H d;?\567<H 714a2]6aI?S5+=UT=@
 ];QTZ3H d;?
?}3 kz=m2<kP+244zPy{4k `9ca 143;DH d61a\@ ]6OY?=b
7j\7Se=?Sa 147=336?AHfkT=@ f I7=3657567<H 7aI?AH nb=H d;?.OYT=D< 7_AO7=a a 1Y6?A@ TZ];HIQL];H aUWT=@?S7=_ d :<7=O4];?7=a a 1YDZ3&
sI
O41Y=?SO14d;TrTr5ET=U I DZ14:=?S3 1a.5;?AL3;?S5>7=a  R?S3
H
[ 032<4=5Zr6 Xb7_AO47=a a:<7=O4];? [a ]6_XdH d67<H

d e 5 6 / d e 5  ! 6 EUT=@>7=O4O &[ 0324+5R6Xb
" 

S5>I&6n!  OYT=D)dCeg5 6


( 6@ ?S7<&143;DH 1Y?Sa7<H@ 7=3656TZR9
i %
; T=@.H d6?U7=R 14OYeT=Uj\7|e=?Sa 147=3>3;?AHT=@ E_AO47=a a 1Y6?A@ a # $ % n= `  Ln <f   & y 6 
& ('

143D=?S3;?A@ 7=ObH d;? OY?S7<@X36143;DQ6@ T=LO4?SR 1aC143HI@X7=_`H 7<LOY?=9 j\7Se=?Sa 17=336?AHfkT=@ n_AO47=a a 1Y6?A@ a)7=a@ ?A:r1Y?Ak?S5[7<T:=?=b
;T=@:y7<@X1YTZ]6aa ];6U7=R 14O41Y?SaiT=UK_AO47=a a 1Y6?A@ aSbkd;T?A:=?A@|b 1436_AO]65;?7a 1436DZOY?E_AO47=a a:<7<@ 147<OY?E7=3657=aa ]L_ d7<@ ?
H d;?QL@ T=LOY?SR 1aaITZOY:<7<LOY?143 QTZO4er3;TZR147=O&H 14R?=9
lr TZ3;?`f561R?S36a 1YTZ3L7=O9 M?3;T 143
HI@ Tr5L]6_`?H d;?_`TZ3&
7=RQLO4?Sa136_AO4]65;?H d;?367=1Y:=?j\7Se=?Sa 147=3 7=365 uGv _`?AQ6HT=U R]LOYH 1f5614R ?S36a 1YTZ367=O1YHfe143j\7Se=?Sa 147=33;?AH
_AO47=a a 1Y6?A@ a@ ?A:&1Y?A?S57<T:=? 7=365H d;? a ];6U7=R 14O4eT=U kT=@ _AO47=a a 1Y6?A@ aCre5;?A36143;D>7nU7=R14OYeT=URT&5;?SO4a
j\7Se=?Sa 17=3 3;?AHT=@ _AO7=a a 1Y6?A@XaT=U)dL14_ dH d;?na ];; H d67<H)R7Se1436_AO4]L5;?Ri]6OYH 14QLOY?_AO47=a a.:<7<@ 147<LO4?SaA9
D=@ 7<QLd143L56]6_`?S5
eNH d6?U?S7<H ];@ ?:<7<@ 147<OY?Sa_`TZ3LaIH 1 Uy{4pm>}+}-k) P}~2<42k}>2< k$m>nz )4j2P
H ];HI?Sa7UWT=@ ?SaIfH 50]L_A7=aAb <o=o 6X9\;T=@)T=H d;?A@a ];6U7=R P} k)zi14a.7j\7|e=?Sa 147=33;?AHfkT=@ T=UGd614_ dH d;?CD=@X7<QLd
14O414?Sa\T=U$j\7|e=?Sa 147=33;?AHfkT=@ >_AO47=a a 1Y6?A@ aAb&@ ?Sa ?S7<@ _ d;?A@Xa N !KL ' cQ d67=a7K@ ?SaIHI@ 14_`HI?S5 HIT=QTZO4T=D=e=9ud;?aI?AJ H 
d67|:=?Q6@ ?SaI?S3
HI?S5d;?S];@ 14a H 14_7=OYD=T=@X1YH d6R a)d61_ dnQ6@ T<
*)

:&145;?D=T
T&5+bH d6TZ];DZd3;TZ3&T=Q6H 1R 7=ObaITZO4]6H 1YTZ36& a 5^&7y T=U@ 7=3656TZR :<7<@ 147<LOY?Sa14aQ7<@ H 1YH 1YTZ3;?S5[143HITEH d;?aI?AH a
 ! w  % ')()()(*'  b / <bT=U_AO47=a aE:y7<@ 17<LOY?Sa
d67=R 1bSq=<q r K?AT=DZd7=3L5F$7 S7=361bSq=q=q 6X9M?
 
+ -,
7=365H d;?aI?AfH Zr ! w  % ')()()($' ut b v / <b+T=UU?S7y
kTZ]6O45O41Y=?HIT3;T=HI?H d67<HH d;?Sa ?@ ?Sa ]6OYH a7=O4Ok@ ?SO47<HI? H ];@ ?:y7<@X147<LOY?SaA9ud6?aI?AHkT=U7<@ _Aa T=U+H d;?D=@ 7<QLd14a
HITOY?S7<@ 3613;D7i_AO47=a a 146?A@UT=@.7i;lr?S5aI?AH\T=U@ ?SOY?A:y7=3
H
)

U?S7<H ];@ ?:<7<@ 147<LO4?SaA9Gu2TH d6??SaIHT=UTZ];@&3;T|OY?S5;D=?=b QL7<@ H 14H 1YTZ3;?S5143


HITCH d;?.H d;@ ?A?a ?AH a b r7=365 ) ) ).0/

H d;?aI?SOY?S_`H 1YTZ3T=U7=3T=Q6H 14R 7=OaI?AHT=U)UW?S7<H ];@ ?:<7<@ 1 d67|:r1436DH d;?CUWTZOOYT|)13;DQ6@ T=Q?A@ H 1Y?Sa)
7<LOY?Sa)d67=a3;T=H?A?S3aITZO4:=?S5>7=ae=?AHA9 UT=@?S7=_Xs d uq[;3rH d;?A@ ?E14aK_ 7 C [ )1YH d
^&136_`?1YH aD=@X7<QLd d67=a 7 ;l&?S5#HIT=QTZOYT=D=e=bH d;?
1
)2.0/ 5C ' 6[ 7=365UT=@?S7=_ d [_H d;?A@ ?
Q6@ T=OY?SR T=UnOY?S7<@ 361436D7367=1Y:=?'jk7|e=?Sa 147=3 _AO7=a a 1 1a.7=3  f[Zr)1YH d 5 '  6[ ).0/
6?A@-7=RTZ]63
H a'HIT FI]LaIH ?Sa H 7<LO414a dL143;DR 7yl;14Ri]6R H d6?a ];6D=@ 7<QLdT=cU N H d67<H 14a14365L]6_`?S5
e Z
O41Y=?SO14d;TrTr5E?SaIH 14R 7<HI?Sa)UW@ TZR#H d;?K7S:<7=14O47<OY?K567<H 7UT=@ 1
$? r]67=O4 a Nc !MLZ ' Q
1YH aQL7<@ 7=R ?AHI?A@ aA9;T=@7;l&?S5nD=@X7<QLd614_A7=O$aIHI@ ]L_`H ];@ ? *)

143D=?S36?A@ 7=ObGH d;?R 7yl;14Ri]6RfO41Y=?SO14d;TrTr5?SaIH 14R7<HIT=@ a H d6?a ];6D=@ 7<QLdT=8U N H d67<H 14a143L56]6_`?S5
e 3r
T=U$H d;?CQL7<@ 7=R ?AHI?A@)Q6@ T=7<L14O41YH 14?Sa7<@ ?DZ1Y:=?S3re
1
$? r]67=O4
*) a Nqr !ML3r ' r Q 9
d q5Y^SYZ6 '
T  V BW X V B !
ud;?ia ];6D=@X7<QL_ d Nc 14a7D=@X7<QLd614_A7=O2aIHI@ ]6_`H ];@ ?T:=?A@
)d;?A@ ? d 5;?S3;T=HI?SaH d;?\?SRQ1Y@ 14_A7=Or5614aIHI@X1YL];H 1YTZ35;?`
 
 H d;?_AO47=a a:y7<@X147<LOY?Sa7=3651a_A7=O4O4?S5H d;?_AO47=a a 146?A@ pa 3
L3;?S5
e-H d6?U@ ?$
]6?S36_A1Y?Sa>T=UT&_A_A];@ @ ?S36_`?143H d;? )4j2PPyZz=2 H d;?a ];LD=@ 7<QLR
465 d Nqr14a+_A7=O4O4?S5K1YH a w kx2
567<H 7&9\ud;?CQ6@ T=LOY?SR#T=UOY?S7<@ 3L143;D7 uGv_AO47=a a 1Y6?A@ 465 m>y{z+kPyZz=2 9ud;?a ];6D=@ 7<QLR d N !ML ' Q 14a
.0/ *) .0/
7=RTZ]63
H aHIT6@ a H)5;?AHI?A@ R 136143;D7 D=@X7<QLd614_A7=Oa HI@ ]6_ 7L1YQL7<@ H 1YHI?D=@ 7<QdNH d67<H@ ?SO47<HI?SaKH d6? :y7<@X1YTZ]6aCU?S7y
H ];@ ?T=U[R 7yl;14Ri]6R O1Y=?SO414d;TrT&5 DZ1Y:=?S3 H d;?'7S:<7=14O H ];@ ?[:y7<@ 17<LOY?SaHITH d6?[_AO47=a a:<7<@ 147<OY?SaA.H d614aa ];;
7<LOY?567<H 77=365'H d;?S3-?SaIH 7<O414a d613;D?Sa H 14R 7<HI?SaUT=@ D=@ 7<QLd1a._A7=O4OY?S5H d;? w kx2<m>y{zxk8)k4kx*m>}><yZ<z+2 >T=U 7465
1YH aiQL7<@ 7=R ?AHI?A@ aA9;T=@ _`TZ3LaIHI@ ]6_`H 1436D7R 7yl;14Ri]6R H d;?[_AO7=a a 1Y6?A@ 7=3L51YH aaI?AH T=UC7<@ _AaE14aHI?A@ R?S5H d;?
O41Y=?SO14d;TrTr5U?S7<H ];@ ?KHI@ ?A?=b7 QTZOYe&3;TZR 147=OYH 14R?7=OYD=T< 3 _AO47=a a 1Y6?A@ pa w kx2<m>y{zxk$k4kx$m>}~ 2z= )k$m9;T=@7=3e
@ 1YH dLR#14a.7S:<7=14O47<LO4?CUW@ TZR ;@ 1Y?S56R7=l 3 k)m24poq5Sq=q <6X9 :<7<@ 147<LOY? # 1437ER]LOYH 1f5614R ?S36a 1YTZ367=O_AO47=a a 1Y6?A@Sb;k?
u2T _`TZ36_AO]65;?=b?TZ]LO45O41Y=?HIT3;T=HI?H dL7<Hn7 3;T]6aIf? ^ # HITE5;?S36T=HI?H d;?_AO7=a aQL7<@ ?S3H a)T=U #
j\7Se=?Sa 17=33;?AHfkT=@ _AO47=a a 1YL?A@_`TZRQ];HI?SaUWT=@?S7=_Xd 14 3 Nib&H d67<H.14aA{b ^ # ! ^ # Z9M?U]6@ H d;?A@\]6aI?
98
1436a H 7=36_`?[7_`TZ3L561YH 1YTZ367=OQ6@ T=L7<L1O41YHfe5614a HI@ 1YL];H 14TZ3 ^r # HIT56?S3;T=HI?H d;?UW?S7<H ];@ ?iQ7<@ ?S3H aT=U # 143 Ni9
T|:=?A@EH d;?_AO47=a a:<7<@ 147<OY?=9;T=@_AO47=a a 1Y_A7<H 1YTZ3QL];@I vT=HI?)H d67<HUT=@7=3
e_AO47=a a:<7<@ 147<OY? k?H dr]6ad67|:=?
QTZaI?SaAb.1YH U];@ H d6?A@14a 7=a a Tr_A147<HI?S5 )1YH d7U]L36_`H 1YTZ3 H d67<H ^r ! 7=365 ^q !1^q9
;:

Multi-dimensional Bayesian Network Classifiers 109


M 1YH d613 H d;? U7=R 14OYe T=U Ri]6OYH 1f5L14R?S36a 14TZ367=O
' OY?SR UT=@nj\7|e=?Sa 147=336?AHfkT=@ -_AO47=a a 1YL?A@ ad67=a?A?S3
j\7Se=?Sa 147=3 3;?AHfkT=@ _AO47=a a 1Y6?A@ a:y7<@X1YTZ]6a5L1?A@ ?S3H aIH ]L561Y?S5UT=@7>6lr?S5aI?AHKT=U.@ ?SOY?A:<7=3
HKUW?S7<H ]6@ ?:<7<@ 1
He
Q?SaT=U_AO47=a a 1YL?A@7<@ ?i5614aIH 1436DZ]614a d;?S5L7=a ?S5n];QTZ3 7<LO4?SaA9 6TZO4OYT|143;D7a 14R14O47<@7<Q6Q6@ TZ7=_Xd+b?36T|
H d;?S14@D=@ 7<QLd614_A7=O.a HI@ ]6_`H ];@ ?SaS93?`l;7=RQLO4?>14aH d;? UT=@ Ri]6O47<HI?TZ]6@O4?S7<@ 36143;DQ6@ T=LOY?SR HITQ?A@ H 7=143[HIT7
w y{44 q2}|kgHy{4pm>}+}bk }> 24)4j2PP} kz)13)dL14_ d a ]66U7=R14OYeT=U_AO47=a a 1Y6?A@ aUWT=@.d614_ dEH d;?U?S7<H ];@ ?CaI?`
T=H d H d;?_AO7=a aa ];6D=@ 7<Qd 7=365 H d;?U?S7<H ];@ ?a ];; OY?S_`H 14TZ3a ];6D=@ 7<Qd1a\;lr?S59M?K)14O4O@ ?AH ]6@ 3HITH d;?
D=@ 7<Qd>d67S:=??SRQ6He7<@X_KaI?AH aA9kv)T=HI?H d67<H)H d61a.a ];; 14a a ];?T=U2U?S7<H ];@ ?a ];LaI?AHa ?SOY?S_`H 1YTZ3>143[^&?S_`H 1YTZ3[tr9
U7=R 14OYeT=U1QL7<@ H 1YHI?i_AO47=a a 146?A@ a1436_AO4]656?SaH d;?iTZ3;?` M??ADZ143re5;?A36143;DnH d;?a ];6U7=R 14OYeT=UU]LO4OYe
561R?S36a 1YTZ3L7=O367=1Y:=?jk7|e=?Sa 147=3_AO47=a a 1Y6?A@7=a7naIQ?` HI@ ?A?`f7=]6DZR?S3HI?S5 _AO47=a a 1YL?A@ a1YH d 7;l&?S5aI?SOY?S_
_A147=O_A7=aI?=9J)?A:=?A@ aI?SOYe=b7=3
ea ]L_ d1QL7<@ H 1YHI?n_AO47=a H 1YTZ3 T=UU?S7<H ];@ ? :y7<@ 17<LOY?SaQ?A@_AO47=a a:<7<@ 147<LOY?=9
a 146?A@dL7=a7=3 $? r]61Y:<7=OY?S3
H3L7=1Y:=?jk7|e=?Sa 147=3_AO7=a a 1 ud6?SaI?_AO7=a a 1Y6?A@Xa7<@ ?_`TZ36a 15;?A@ ?S5 2U} PP}4kUT=@
6?A@))1YH d7 a 13;DZOY?C_`TZRQTZ]6365_AO47=a a:<7<@ 147<LOY?=93& H d;?OY?S7<@ 3L143;DQL@ T=LOY?SR9$M?KOY?AHkH d;?CaI?AH.T=U@X7=365;TZR
T=H d;?A@HferQ?T=UR]LOYH 1f5614R ?S36a 1YTZ367=O_AO47=a a 1Y6?A@)1a.H d;? :<7<@ 147<LO4?Sqa  ?iQL7<@ H 1YH 14TZ3;?S5n13HIT 7=365 Zr+k?
a ]66U7=R14OYeT=U._AO47=a a 146?A@ aC143)dL14_ dT=H dH d;?E_AO47=a a U];@ H d;?A@+H 7<=?\7aI?AH2T=U;7<@ _Aa HIT)?QL7<@ H 1YH 1YTZ3;?S5K13HIT
)
a ]66D=@ 7<QLd7=365H d;?)U?S7<H ];@ ?a ];6D=@X7<QLd7<@ ?5L1Y@ ?S_`HI?S5 ) ) b r-7=365).0/ 7=a?AUT=@ ?=9M?E3;T _`TZ3La 145;?A@
HI@ ?A?SaS9V3H d;?>@ ?SR7=14365;?A@T=UH d;?>Q7<Q?A@|bk?>)14O4O 7a ];LaI?AH
) .0/ T=8U  @  r a ]6_XdH d67< H L ' Q *) .0/
UTr_A]LaTZ3H dL14aa ];6U7=R 14OYeT=U w y{44 m>zxkk)=2<y<bkZmk 14a7UW?S7<H ];@ ?a ?SOY?S_`H 1YTZ3'a ];6D=@X7<QLd+9U]LO4OYeHI@ ?A?`
Uy{4 m>}+}bk) P}~24C)4j2} kzPX9 7=];DZR ?S3HI?S5Ri]6OYH 1Yf5614R?S36a 1YTZ367=O)_AO7=a a 1Y6?A@3;T14a
Ri]6OYH 1f561R?S36a 1YTZ3L7=O_AO47=a a 1Y6?A@N143?Sa a ?S36_`?14a 7=56R14a a 1YOY?UT=@ ) ./ 1YU2?KdL7S:=?CUT=@)1YH a.aI?AH.T=UG7<@ _Aa
]6a ?S5HITNL365a 7 FITZ143
H:y7=O4]6?7=a a 1YDZ36R?S3
HT=UCd61YDZd& ).0/ )2.0/ H d67<H ! ) 9$ud;?\aI?AH$T=U7=O4O&7=56R14a a 1YOY?
.0/
?SaIHQTZaIHI?A@ 1YT=@QL@ T=L7<L14O1YHfeEHIT1YH aaI?AH)T=U_AO7=a a:<7<@ 1 _AO47=a a 1Y6?A@ akUT=@
) .0/ 14a5;?S36T=HI?S5>7="a !$# %'&9
7<LO4?SaA9$1436561436DNa ]6_ d7=37=a a 1YDZ36R ?S3HEDZ1Y:=?S3:<7=O u)d;?OY?S7<@ 361436D Q6@ T=LOY?SR 3;T|14aHITL365U@ TZR
];?SaUT=@7=OOU?S7<H ];@ ?:y7<@X147<LOY?Sai143
:=TZOY:=?S5+b14a$?
]614: 7=RTZ36D H d;?a ?AHT=U7=56R 14a a 1YLOY?_AO47=a a 1Y6?A@ a\TZ3;?H d67<H
7=OY?S3
HHITaITZOY:&143;DH d;? [F Q6@ T=LOY?SR>9u)d614aQ6@ T=; ?SaIH LH aH d;?7S:<7=14O47<LO4?n567<H 7&9a7R?S7=a ];@ ?[T=U
OY?SR14a&3;T|)3HIT?vFfdL7<@ 5143D=?S36?A@ 7=Ob;e=?AH._A7=3 d;T-?SOO7R Tr5;?SO5;?Sa _`@ 1Y?SakH d;?K567<H 7&b;k?K]6aI?C1YH a
?aITZOY:=?S5-143'QTZOYe&3;TZR 17=O)H 14R ?UT=@[3;?AHT=@ raT=U OYT=D<fO1Y=?SO414d;TrT&5DZ1Y:=?S3'H d;?N5L7<H 7&9 [T=@ ?UWT=@XR 7=O4OYe=b
TZ]63656?S5 HI@ ?A?A)145;H K d 5jkT&56O47<?S365;?A@ k)mb24joRb <o=o 6X9 H d;?nOY?S7<@ 361436DQ6@ T=LOY?SR UWT=@U]LO4OYeHI@ ?A?`f7=];DZR?S3
HI?S5
V3H d6?)Q6@ ?Sa ?S36_`?T=U2]63;T=LaI?A@ :=?S5U?S7<H ];@ ?:y7<@X147<LOY?SaAb Ri]6OYH 1f5L14R?S36a 14TZ367=O_AO7=a a 1Y6?A@Xa+)1YH dK7);l&?S5KU?S7<H ];@ ?
H d;?[Q6@ T=LOY?SR T=UCL36561436D7=a a 1YDZ36R?S3
H aET=UdL1YDZd;?SaIH aI?SO4?S_`H 1YTZ3>7<@ _Ka ?AH ) H d;?S3>14a.HITL3657 _AO47=a a 1Y6?A@
QTZaIHI?A@ 14T=@QL@ T=L7<L14O1YHfe@ ?SR 7=136a143HI@X7=_`H 7<LOY??A:=?S3
./
I 14( 3 ! # %'& H d67<H)R 7yl;14R 14a ?Sa S5>I 6X9  
UT=@H d;?SaI?@ ?SaIHI@X14_`HI?S53;?AHfkT=@ &a 5F$7<@ b <o=o 6X9V3
:&1Y?A T=UH d6?N]63;U7S:=TZ];@ 7<OY?_`TZRQL];H 7<H 1YTZ3L7=OK_`TZR
*) +  -,   . 6/G0L - 
QLO4?`l&1YHe143
:=TZOY:=?S5+bk?3;T=HI?H d67<HH d;?Q6@X7=_`H 14_A7<L14O
1YHeT=URi]6OYH 1f561R?S36a 1YTZ3L7=O+_AO47=a a 1YL?A@ a)14aO414R 14HI?S5>HIT V3H d614a)aI?S_`H 1YTZ3n?a d;T|H d67<H)H d;?iOY?S7<@ 3613;DQ6@ T=;
RT&5;?SO4a\1YH d@ ?SaIHI@ 1_`HI?S5>_AO47=a aa ];6D=@ 7<QLdLaA9 OY?SR UWT=@U]6OOYeHI@ ?A?`f7=]6DZR?S3HI?S5Ri]6OYH 1f5L14R?S36a 14TZ367=O
_AO47=a a 1Y6?A@ a\_A7=3[?Ka TZOY:=?S5>143QTZOYe&3;TZR 147=OH 14R ?=9
g3f n   3   L2< L
% M?_`TZ36a 15;?A@7EU]6O4OYeHI@ ?A?`f7=];DZR ?S3HI?S5_AO47=a a 1Y6?A@
$ % Z `  L yf   & y 6 
n & (' I)1YH dH d;?_AO47=a ai:y7<@X147<LOY?S- a 7=365H d;?U?S7<H ];@ ?
V3iH d614a$aI?S_`H 1YTZ3k?5;?AL36?H d;?.QL@ T=LOY?SRT=UO4?S7<@ 36143;D :<7<@ 147<LO4?SUa ZrH d67<H14ai7=56R 1a a 1YLO4?UWT=@H d;?U?S7<H ];@ ?
7nU]LO4OYeHI@ ?A?`f7=];DZR?S3
HI?S5Ri]6OYH 1f561R?S36a 1YTZ3L7=O_AO47=a aI?SO4?S_`H 1YTZ37<@ _aI?AH 9Nj\]614O56143;D>];QTZ37@ ?Sa ]6OYH
) .0/

a 146?A@7=365a d;TH d67<HH d614a$Q6@ T=LOY?SR_A7=3?.5;?S_`TZR U@ TZR ;@ 1Y?S5LR 7=3 k)m 24jo5Sq=q <6XbG?d67|:=? H d67<HH d;?
QTZaI?S5143
HITHfkT[aI?AQL7<@ 7<HI?T=QLH 14R 14a 7<H 14TZ3[Q6@ T=LOY?SR a OYT=D<fO1Y=?SO414d;TrT&5NT=SU IDZ1Y:=?S37567<H 7aI?A?H _A7=3?
)dL14_ d_A7=3T=H d?KaITZO4:=?S5>143QTZOYe&3;TZR 147=OH 14R?=9 @X1YHIHI?S37=a

  6 G    S5>I6!
jk?AUT=@ ?5;?AL3L143;DH d;? Q6@ T=OY?SR T=UO4?S7<@ 36143;D7 + t
U]6O4OYeHI@ ?A?`f7=];DZR?S3
HI?S5Ri]6OYH 1Yf5614R?S36a 1YTZ367=O)_AO7=a a 1 !21 43  578:6 9 ^ 6<;43
5  c  578:6 9 5>   S
^  6
6?A@U@ TZR 567<H 7&bk?i@ ?S_A7=O4OH dL7<HH d6?i@ ?SO47<HI?S5Q6@ T=; i % i %

110 L. C. van der Gaag and P. R. de Waal


3
+

$8:6 9 5x^c~6 1l43


+

5 8:6 9 2u Ta ]6R R 7<@ 1aI?>H d;?7<T|:=?_`TZ36a 1456?A@ 7<H 1YTZ36aAb\k?


!" 56
d67|:=?H d67<H'7 _AO47=a a 1YL?A@H d67<H'aITZOY:=?Sa H d;?O4?S7<@ 3&
 
ji % i %
t t 143;D Q6@ T=LO4?SR UT=@'U]6OOYeHI@ ?A?`f7=];DZR?S3
HI?S5 Ri]6OYH 1
; 3   8:6 9 5> Zx^S 6/1 43  578:6 9 5> 6 '
5614R ?S36a 1YTZ367=O;_AO7=a a 1Y6?A@Xa)1YH dH d6?);l&?S5 U?S7<H ];@ ?aI?`
i % i % OY?S_`H 1YTZ37<@ _aI?AH ) b1aE7_AO47=a a 146?A@U@ TZR !$# % &
.0/
H d67<H)R7yl&14R14aI?Sa
)d;?A@ ? d -1a.H d;?K?SRQ1Y@ 14_A7=O5614a HI@ 1YL];H 14TZ3EU@ TZR nb
 
 + t
5 8 5 # 6?! 1 V db5Y 6 3OYT= D db5Y 614aiH d;??S3HI@ T=Q
e   8:6 9 5  x ^c  6<; $8:6 9 5>  x ^ r   ^   6 (
 
T=U7@ 7=3656TZR :<7<@ 147<LOY? # )1YH d-5L14aIHI@ 1Y];H 1YTZ3 db i % i %
5 8 5 # 7?6c! 1  V  db5Y ' 6 3ZOYT=n
 D db5Y  65;?`  M?3;TQ6@ T&_`?A?S5rea d;T)143;DH d67<HH d;?O4?S7<@ 36143;D
3;T=HI?SaH d6?\_`TZ3L561YH 1YTZ367=O
?S3
HI@ T=Q
eiT=U # DZ1Y:=?S?3 7Eb7=365 Q6@ T=OY?SR _A7=3?E5;?S_`TZRQTZaI?S513HITnHfkTaI?AQ7<@ 7<HI?
 8 5 # x 7-6! db5Y ' 6/3SOYT=D db5d Y 65Y 3$'db 6 5  6

T=Q6H 14R14a 7<H 1YTZ3Q6@ T=LOY?SRa+)d614_Xd_A7=3iT=H dK?a TZOY:=?S5
143QTZOYe&3;TZR 17=O$H 14R?=9uTnH d614aC?S365bk?6@ a HK_`TZ3&
 
V
a 1456?A@H d;?\R]6H ]67=Of143;UT=@ R 7<H 14TZ3KHI?A@ RQ?A@ H 7=143613;D.HIT
5;?S3;T=HI?Sa)H d;?KRi];H ]67=O1436UWT=@ R7<H 1YTZ3T=U # 7=365 7E9 H d;?_AO47=a ak:<7<@ 147<LOY?SaS92M?T=LaI?A@ :=?H dL7<HAb&a 1436_`?_AO7=a a
ud6?HfkT)?S3
HI@ T=QreCHI?A@ R a 5 6 56+7=3L5 576 5> 6 :<7<@ 147<LOY?Sa)TZ36OYed67|:=?_AO7=a aQL7<@ ?S3H aAbH d614aHI?A@ R 5;?`
143H d;?7<T:=??`l&Q6@ ?Sa a 1YTZ3UT=8:@ 9 S5>I f&8:6>9 _`TZ3&    Q?S3656aTZ3H d;?aI?AHT=U;7<@ _Aa T=U&H d;?_AO7=a aa ]66D=@ 7<QLd
)
_`?A@ 3iR 7<@ DZ143L7=O5614a HI@ 1YL];H 14TZ36a?SaIH 7<LO41a d;?S5KU@ TZRH d;? TZ36OYe=9M'1YH d@ ?SaIQ?S_`HHITH d;?_`TZ36561YH 1YTZ3L7=OR]6H ]67=O
7S:<7=14O47<OY?>567<H 7&9-ud;?SaI?[HI?A@ R a H d6?A@ ?AUWT=@ ?n5;?AQ?S365 143;UT=@ R 7<H 14TZ3HI?A@ R 7<T|:=?=bk?nT=La ?A@ :=?H d67<HH d614a
TZ36OYeTZ3H d;?>?SR QL1Y@ 14_A7=O\561aIHI@ 1YL]6H 1YTZ37=3653;T=HTZ3 HI?A@ R 5;?AQ?S3656a TZ3 H d6?[U?S7<H ];@ ?aI?SOY?S_`H 1YTZ37<@ _aI?AH
H d;?D=@ 7<QLd61_A7=O)aIHI@ ]L_`H ];@ ?[T=UH d;?n_AO47=a a 1Y6?A@S9ud614a ) b)dL14_ d'14a;l&?S5+b7=365TZ3 rkbL];H3;T=HTZ3
)
T=LaI?A@ :y7<H 1YTZ314RQO41Y?SaH d67<H\7=37=56R 14a a 1YLOY?._AO47=a a 1Y6?A@
.0/
H d;?aI?AH 9 ud;?SaI?_`TZ36a 1456?A@ 7<H 1YTZ36a14RQOYeH d67<H
)
H d67<H$R 7yl;14R 14aI?SaH d;?\OYT=D<fO41Y=?SO41d;T
T&5DZ1Y:=?S3iH d;?k567<H 7 H d;?HTHI?A@ R a7<@ ?14365;?AQ?S365;?S3
H$7=365 _A7=3 ?R 7ylr
14a\7_AO47=a a 146?A@\H d67<H.R 7yl&1R 14aI?SakH d;?Ca ]6R T=UG1YH a\HfkT 14R 1aI?S5aI?AQL7<@ 7<HI?SOYe=9
Ri];H ]67=Of1436UWT=@ R7<H 1YTZ3HI?A@ R aA9 ud6?Ri];H ]67=OYf143;UT=@ R 7<H 1YTZ3HI?A@ R"Q?A@ H 7=136143;DHIT
M?_`TZ36a 1456?A@H d;? R];H ]L7=Of143;UT=@ R 7<H 1YTZ3 HI?A@ R H d;?_AO47=a a:y7<@X147<LOY?Sai_A7=3?R 7yl;14R 14aI?S5
eN]La 143;D
$8:6 9 5> x ^ 613aITZR ?\R T=@ ?.5;?AH 7=14O92M?3;T=HI?\H d67<H H d;?Q6@ T&_`?S56];@ ?nU@ TZR 8\d;T 7=365'014M ] 5S<q =m 6UT=@
H d;?.aI?AH$T=ULQ7<@ ?S3H a ^ \T=U7=3
eU?S7<H ];@ ?\:<7<@ 147<LOY?   _`TZ36aIHI@X]6_`H 143;D R 7yl;14Ri]6RfO414=?SO414d;TrTr5HI@ ?A?Sa)
14aQL7<@ H 1YH 1YTZ36?S5 143
HITH d;?)a ?AJH ^qG T=U+_AO47=a aQL7<@ ?S3
H a
7=365H d6?KaI?AH ^rG CT=U$U?S7<H ];@ ?CQL7<@ ?S3H aA9ca 13;DH d;? <9\8\TZ36aIHI@ ]6_`H7 _`TZRQLOY?AHI?#]6365L1Y@ ?S_`HI?S5 D=@X7<QLd
_ dL7=143@ ]LOY?>UT=@ERi];H ]67=O)13;UWT=@XR 7<H 1YTZ3 58kT:=?A@>7=365 T:=?A@H d6?KaI?AHT=U_AO47=a a.:<7<@ 147<OY?Sa Z9
ud;TZR7=aAbSq=q&* 6Xb;H d6?HI?A@ R $8:6 9 5> =x ^ 6_A7=33;T r9a a 14DZ37Nk?S1YDZd
 H $8 6 9 5 ' C6HIT?S7=_ d ?S5;D=?
?Ct @ 1YHIHI?S37=a ?AH?A?S 5 C=b !
3 G7=36_  ;9
 
$8:6 9 5> x ^qu  6 ; $8:6 9 5> Zx ^Sr  ^qC 6 ' 
sr9j\]61O45>7R7yl&14Ri]6Rk?S1YDZd
HI?S5na QL7=36361436DHI@ ?A?=b
Pi %
UT=@ ?`l;7=RQLOY? ]6a 143;D C@ ]6a y7=O pa 7=OYD=T=@X1YH d6R

)d;?A@ ?  8 5 # x7  q614aH d;?_`TZ365L1YH 1YTZ367=O\Ri];H ]67=O
 5Sq=<
t 6X9
143;UT=@ R 7<H 14TZ3T=U # =7 365 7 DZ1Y:=?S3 )1YH d ;9u2@ 7=36a UWT=@ R H d;?]L36561Y@ ?S_`HI?S5HI@ ?A?143
HIT 7 561
 8 5 # x 7  q6!  @ ?S_`HI?S5TZ3;?=b$reN_ d;TrTZa 1436D[7=37<@ L1YHI@ 7<@ e:<7<@ 1
db5Y '  6 7<OY?EUT=@1YH ai@ TrT=H7=3L5a ?AHIH 143;D7=O4O\7<@ _561Y@ ?S_
db5Y ' ' 6$3SOYT=D

!
V

   db5Y  6/3$d 5   6
(
  H 14TZ36a.U@ TZRH d;?K@ TrT=HTZ];Hk7<@X5+9
^&143L_`?H d6?CUW?S7<H ];@ ?aI?SOY?S_`H 1YTZ3[7<@ _CaI?AH 14a\;l&?S5+b ) .0/ ;T=@H d;?i_`TZ36561YH 1YTZ3L7=OR]6H ]67=Of143;UT=@ R 7<H 14TZ3HI?A@ R
H d;?.aI?AH^   T=U_AO47=a a$QL7<@ ?S3
H aGT=ULH d;?\U?S7<H ];@ ?\:<7<@ 1 Q?A@ H 7=143613;DHIT>H d;? U?S7<H ];@ ?E:y7<@ 17<LOY?SaAbk? T=La ?A@ :=?
7<LOYU?  14aH d;?ia 7=R?UWT=@?A:=?A@ e7=56R 14a a 1YLOY?_AO7=a a 1 H d67<H1YH[14anR 7yl&1R 14aI?S5-re'L365613;D7R 7yl;14Ri]6R
6?A@S9M?_`TZ36_AO4]656?H d67<HCH d6?Ri];H ]67=OYf143;UT=@ R 7<H 1YTZ3 O41Y=?SO14d;TrTr5561Y@ ?S_`HI?S5aIQL7=3L36143;D>HI@ ?A?T|:=?A@ H d;?U?S7y
HI?A@  R $8:6 9 5> =x ^qC  614aKH d;?a 7=R?UWT=@i7=O4ORT&5;?SO4a H ];@ ?:<7<@ 147<LOY?SaS9 ^&]L_ d7HI@ ?A?14a_`TZ36a HI@ ]6_`HI?S5
e
143H d;?aI?AHT=U$7=56R 14a a 1YLOY?_AO47=a a 1YL?A@ a ! # % & 9 H d;?KUTZO4OYT)143;DiQ6@ T&_`?S56];@ <?

Multi-dimensional Bayesian Network Classifiers 111


<9\8\TZ36aIHI@ ]6_`H7_`TZRQLOY?AHI?561Y@ ?S_`HI?S5D=@ 7<QdT:=?A@ Ri];H ]67=Of13;UWT=@XR 7<H 1YTZ3HI?A@ R#UT=@H d;?K_AO47=a a.:y7<@X147<LOY?Sa
H d6?aI?AHT=UG:y7<@X147<LOY?Sa Zrk9 d67=aHIT ?KR 7yl&1R 14aI?S5]6a 13;DH d;?K6@ a HkQL@ Tr_`?S5L];@ ?=9
r9a a 14DZ3E7ik?S1YDZdH $6 5>x   ^qC  6HIT?S7=_Xd  3Z  
; GZLrZ
T   8:b 9  !  ;9

7<@X_CUW@ TZR   HIb V3 H d;? Q6@ ?A:&1YTZ]6a-a ?S_`H 1YTZ3+b? d67|:=? 7=5656@ ?Sa aI?S5
sr9j\]61O45>7R 7yl&1R]6R?S1YDZd
HI?S5561Y@ ?S_`HI?S5naIQL7=3& H d;?nOY?S7<@ 361436DQ6@ T=LOY?SR UWT=@U]LO4OYeHI@ ?A?`f7=];DZR?S3
HI?S5
3L143;D HI@ ?A?=bUT=@?`l;7=RQLOY?i]6a 1436DH d;?i7=O4D=T=@ 1YH d6R Ri]6OYH 1f5L14R?S36a 14TZ367=O_AO7=a a 1Y6?A@Xa+)1YH dK7);l&?S5KU?S7<H ];@ ?
T=U8.d
]7=3650214] 5S<q =m 6T=f@
56RTZ3656a 7=OYD=T<
3 aI?SO4?S_`H 1YTZ3>a ];LD=@ 7<QLd+9$M?36T|6@X1Y?A6eE5L14a _A]6a a\U?S7y
@X1YH d6R 5S<q <6X9 H ];@ ?Ka ];LaI?AHa ?SOY?S_`H 1YTZ3UT=@TZ];@)_AO7=a a 1Y6?A@XaA9
VH14aK?SO4Or3;T)3H d67<HAb$1YU\RT=@ ?ET=@OY?Sa a@ ?S56]63&
M?kTZ]6O45O41Y=?HIT3;T=HI?H d67<HUT=@R 7yl&1R 14a 1436D 567=3
HKU?S7<H ];@ ?Sai7<@ ?E136_AO4]65;?S51437>567<H 7naI?AHAb$H d;?SaI?
H d;?\_`TZ365614H 1YTZ367=ORi];H ]67=OYf143;UT=@ R 7<H 1YTZ3HI?A@ RUWT=@GH d;? U?S7<H ];@ ?SaR 7|eNL147=aH d;?_AO7=a a 1Y6?A@iH d67<H14aO4?S7<@ 3;?S5
U?S7<H ];@ ?:y7<@X147<LOY?SaAb?d67|:=?HIT-_`TZ36aIHI@ ]6_`H7'561 U@ TZR H d6? 567<H 7&b)d614_Xd143 H ];@ 3R7Se @ ?Sa ]LOYH143
@ ?S_`HI?S5aIQL7=363613;DHI@ ?A?=b)d614OY?UWT=@R 7yl&1R 14a 1436D H d;? 7-@ ?SO47<H 1Y:=?SOYeQT
T=@N_AO47=a a 1YL_A7<H 1YTZ37=_A_A];@ 7=_`e=9 je
Ri];H ]67=Of13;UWT=@XR 7<H 1YTZ3HI?A@ R#UT=@H d;?K_AO47=a a.:y7<@X147<LOY?Sa _`TZ36a HI@ ]6_`H 143;DH d;? _AO7=a a 1Y6?A@T:=?A@ F ]6aIHC7a ];La ?AHT=U
k?>_`TZRQL];HI?>7=3]636561Y@ ?S_`HI?S5TZ3;?=9ud;?3;?A?S5UT=@ H d;?nUW?S7<H ]6@ ?n:<7<@ 147<OY?SaAb7OY?Sa a_`TZRQLOY?`l_AO47=a a 1Y6?A@
H d61a561Y?A@ ?S3L_`?7<@X14aI?SaUW@ TZR H d6?iT=aI?A@ :<7<H 1YTZ3H d67<H 14aer14?SO45;?S5H dL7<HEHI?S365LaEHITd67|:=?7?AHIHI?A@EQ?A@ UT=@I
$6 5P C 6! $6 5CZP 6+UWT=@GH d;?\_AO47=a a2:y7<@X147<LOY?Sa R 7=3L_` ? 507=3;DZO4?A;e k)mq24pob\Sq=<q 6X9[$1436561436D7nR 14361Y
)8:dL9 14OY? $8:6 9 5>ux   8:9 Z^qG 6!  8:6 9 5> =x u 3^qC6
  Ri]6R a ];aI?AHCT=UU?S7<H ];@ ?Saa ]6_XdH d67<HCH d;?a ?SOY?S_`H 1Y:=?
UT=@H d;?CU?S7<H ];@ ?K:<7<@ 147<OY?SaA9 _AO47=a a 1Y6?A@_`TZ36a HI@ ]6_`HI?S5nT|:=?A@H d614aa ]6LaI?AH)d67=ad61YDZd&
gC];@7=O4D=T=@ 1YH d6R UT=@a TZOY:r13;DH d;?OY?S7<@X36143;DEQ6@ T=; ?SaIH7=_A_A];@ 7=_`e14a&3;T)3n7=aH d;?iU?S7<H ];@ ?a ];LaI?AHaI?`
OY?SR UWT=@U]6OOYeHI@ ?A?`f7=]6DZR?S3HI?S5Ri]6OYH 1f5L14R?S36a 14TZ367=O OY?S_`H 14TZ3>Q6@ T=OY?SR9u)d;?KU?S7<H ];@ ?a ?SOY?S_`H 1YTZ3>QL@ T=LOY?SR
_AO47=a a 1Y6?A@ aN)1YH d7;l&?S5 U?S7<H ];@ ?'aI?SOY?S_`H 1YTZ3 a ];; ]636UWT=@ H ]L367<HI?SOYe14aK&3;T|3NHIT[?vFfdL7<@ 513D=?S3&
D=@ 7<Qd 14a _`TZRQTZaI?S5 T=U H d6?HfkT Q6@ T&_`?S56];@ ?Sa ?A@ 7=O 5ua 7=R 7<@ 5L143;TZa.7=365O41YU?A@ 14aA3b <o=oZs 6X9
5;?Sa _`@ 1Y?S5 7<T|:=?=9 ^rTZOY:&143;D H d6? QL@ T=LOY?SR 6T=@ jk7|e=?Sa 147=3 3;?AHT=@ _AO47=a a 1Y6?A@ aAb5L1?A@ ?S3H
H dr]6a 7=RTZ]L3H a HIT_`TZRQL]6H 143;D 7=3 ]63L561Y@ ?S_`HI?S5 d;?S]6@ 14aIH 14_7<Q6QL@ TZ7=_ d;?Sa)HITU?S7<H ];@ ?a ];LaI?AH)a ?SOY?S_`H 1YTZ3
R 7yl;14Ri]6Rk?S1YDZd
HI?S5aIQL7=363613;DHI@ ?A?.T|:=?A@H d6?_AO47=a a d67|:=??A?S3Q6@ T=QTZa ?S5+9g3;?iT=UH d;?Sa ?1a)H d;?&z+2  74
:<7<@ 147<LO4?Sa7=365 7 561Y@ ?S_`HI?S5 R 7yl&1R]6R?S1YDZd
HI?S5 4 74 4 5k)zg2 z+$2 5 CTZd67S:&1&7=36
 5 =TZd63+b;Sq=q <6Xb13)dL14_ d
aIQ7=3636143;DHI@ ?A?nT|:=?A@H d;?>U?S7<H ];@ ?>:<7<@ 147<LO4?SaA9ud;? H d;?aI?SOY?S_`H 1YTZ3ET=U+U?S7<H ];@ ?):<7<@ 147<OY?Sa14aR?A@ D=?S5E)1YH d
_`TZRQ];H 7<H 1YTZ3T=UH d;??S14DZdH aiUT=@iH d6?]63L561Y@ ?S_`HI?S5 H d;?O4?S7<@ 36143;D-7=OYD=T=@X1YH d6R9 M? 36T| 7<@ DZ]6?H d67<H
HI@ ?A?d67=a7_`TZR QL];H 7<H 1YTZ367=O
_`TZRQLOY?`l;1YHeT=U  5  6Xb , H d;?a 7=R?7<Q6Q6@ TZ7=_ d_A7=3?]6aI?S5UT=@TZ];@Ri]6OYH 1
)dL14OY?H d;?C_`TZ36a HI@ ]6_`H 1YTZ3T=UH d;?HI@ ?A?C1YH a ?SOYU+@ $? r]61Y@ ?Sa 561R?S36a 1YTZ3L7=Oj\7Se=?Sa 17=33;?AHT=@ _AO47=a a 1Y6?A@ aA9ud;?
, ,  5 O4T=D u6H 14R_
 ? 5 C@ ]6aI<7=ObSq=<t 6X9ud;?E_`TZRQL]& @ ?Sa ]6OYH 143;DiQ6@ T&_`?S56];@ ?1a7=a.UTZO4OYT|)a
H 7<H 1YTZ3T=U2H d;?Ck?S1YDZdH a\UT=@.H d;?K5614@ ?S_`HI?S5HI@ ?A?Kd67=a.7 <9\8.d;T
TZa ?H d;??SRQ6HeUW?S7<H ];@ ?aI?SOY?S_`H 1YTZ3a ];;
_`TZRQOY?`l&14Hfe[T=U  5v  6Xbd614OY?iH d;? _`TZ36aIHI@X]6_`H 1YTZ3 D=@X7<QLdUT=@H d;?143L1YH 147=O_A];@ @ ?S3
Ha ];LD=@ 7<QLd+9
T=U+H d;?HI@ ?A?14H aI?SOYUH 7<=?Sa  5v*6$H 1R?=9^&1436_`?)7HferQ;
14_A7=O.5L7<H 7aI?AH a 7<H 14aIL?S?a  OYT=D -7=365  vb
, r96@ TZR H d6?_A];@ @ ?S3
Ha ];LD=@ 7<QLdD=?S3;?A@ 7<HI?7=O4O
k?)d67|:=?H d67<HH d;?)T:=?A@ 7=O4O_`TZRQLOY?`l;1YHfeT=U+TZ];@k7=OYD=T< QTZa a 1YLOY?UW?S7<H ]6@ ?aI?SOY?S_`H 14TZ3 a ];6D=@X7<QLd6aH d67<H
@ 14H d6R 14aQTZOYe&3;TZR 147=O13>H d;?3
]LR?A@T=U:y7<@X147<LOY?Sa 7<@ ?CT=6H 7=143;?S5reE7=565L143;D7=37<@ _UW@ TZR#7_AO47=a a
143
:=TZOY:=?S5+9 :<7<@ 147<OY?CHIT7U?S7<H ];@ ?K:<7<@ 147<LO4?=9
M?_`TZ36_AO4]L5;?[H d61aaI?S_`H 1YTZ3reT=LaI?A@ :r1436DNH d67<H sr96T=@>?S7=_ dD=?S3;?A@ 7<HI?S5-U?S7<H ];@ ?aI?SOY?S_`H 14TZ3-a ];;
H d;?OY?S7<@ 3613;DCQ6@ T=LO4?SR _A7=3E7=OaITK?)UT=@ Ri]6O47<HI?S5 UT=@ D=@X7<QLd+b_`TZR QL];HI?H d;?7=_A_A];@ 7=_`eT=UH d;??SaIH
_AO47=a a 1Y6?A@ aE143 )d614_XdH d;?aI?AH T=@H d;?aI?AH r
) ) _AO7=a a 1Y6?A@DZ1Y:=?S3KH d61a+a ];6D=@ 7<QdH d67<H214a+O4?S7<@ 3;?S5
14aK?SRQ6Hfe=9ngC];@7=OYD=T=@ 14H d6R14aK@ ?S7=5614OYe7=567<Q6HI?S5HIT ]La 143;DH d;?7=OYD=T=@ 1YH d6R UW@ TZR H d;?KQ6@ ?A:r1YTZ]LaaI?S_
H d;?Sa ?KQ6@ T=LO4?SR aA9M'1YH d l! ibLTZ36O4eEH d;?_`TZ36561
) : H 14TZ3+9
H 1YTZ3L7=OR];H ]L7=Of143;UT=@ R 7<H 1YTZ3HI?A@XR UWT=@H d;? U?S7<H ];@ ? ;9\^&?SOY?S_`H$H d;?\?Sa HGD=?S36?A@ 7<HI?S5a ];6D=@X7<QLd+b<H d67<H14aAb
:<7<@ 147<LO4?Sad67=aHIT?NR 7yl;14R 1aI?S5-]6a 1436DH d6?aI?S_ H d6? UW?S7<H ]6@ ?aI?SOY?S_`H 14TZ3Na ];6D=@X7<QLdT=UkH d6? _AO47=a
TZ365QL@ Tr_`?S5L];@ ?7<T:=?=9M1YH d r;! ibTZ36O4e>H d;?
) : a 1Y6?A@.T=Ud61YDZd;?Sa H7=_A_A];@ 7=_`e=9

112 L. C. van der Gaag and P. R. de Waal


tr9VUH d;?7=_A_A];@ 7=_`e T=U+H d;?_AO47=a a 1Y6?A@)14H d H d;?aI?` a 1 A?CT=U567<H 7 a ?AH))Ao=o

O4?S_`HI?S5a ];6D=@ 7<Qd1adL1YDZd;?A@H d67=3H d67<HCT=UkH d;? $4p2PP} kzfm> 3


4 k 2P*o
4{2z)o
_AO7=a a 1Y6?A@1YH dH d;?>_A];@ @ ?S3H a ]66D=@ 7<QLd+bH d;?S3 8kTZR QTZ]L365367=1Y:=? o&9  =q=t
56?S3;T=HI?H d6?\a ?SOY?S_`HI?S5a ];6D=@ 7<Qd7=aGH d;?\_A];@ @ ?S3
H n]6OYH 1f5L14R367=1Y:=? o&9 t
Ss<
a ];6D=@ 7<QLd7=365D=THIT^rHI?A1 Q r9-VU36T=HAb.H d;?S3 8kTZR QTZ]L365u2v o&9 s=t S m<=q
a HIT=Q[7=365Q6@ T=QTZa ?CH d;?K?SaIH_AO47=a a 1Y6?A@.UT=@H d;? n]6OYH 1f5L14RGuGv o&9 6  Zo
_A]6@ @ ?S3
Ha ];6D=@ 7<Qd7=a.H d;?KT:=?A@ 7=O4O?SaIHA9 a 1 A?CT=U567<H 7 a ?AH) <o=o
^rH 7<@ H 13;D )1YH d7=3?SRQLHfe D=@ 7<QLdL14_A7=ONaIHI@ ]L_`H ];@ ? $4p2PP} kzfm> 3k
4 2P*o
{2z)o
4

)1YH d6TZ];H)7=3
e>7<@ _AaAb7=a143[H d;?7<T|:=?iQ6@ T&_`?S56];@ ?=bL14a 8kTZR QTZ]L365367=1Y:=? o&9 <o <&
&3;T|)37=a w zn2z=f$k4kx*m>}>L9$OYHI?A@ 3L7<H 1Y:=?SOYe=b x2 n]6OYH 1f5L14R367=1Y:=? o&9 t=t=t * <q
n2z+-k4 }U}2<m>}~_A7=3?\]6aI?S5byd614_ daIH 7<@ H a$)1YH d 8kTZR QTZ]L365u2v o&9 s<oZt s<o <o
7KU]6O4OrD=@ 7<Qd614_A7=O6aIHI@X]6_`H ];@ ?U@ TZR)d614_Xd a 143;DZO4?7<@ _Aa n]6OYH 1f5L14RGuGv o&9 <t AoZ<q
7<@ ?K@ ?SRT|:=?S5n1437=3>1YHI?A@ 7<H 14:=?KU7=a d61YTZ3+9 a 1 A?CT=U567<H 7 a ?A)H  Zo=o
 6
  L2<    Z<n <  $4p2PP} kzfm> 3k
4 2P*o
{2z)o
4

8kTZR QTZ]L365367=1Y:=? o&9 t=t<o << s


V3H d614aaI?S_`H 1YTZ3#k?-Q6@ ?Sa ?S3HaITZR?-QL@ ?SO414R 1367<@ e n]6OYH 1f5L14R367=1Y:=? o&9 <oZt 
3r]6R?A@ 14_A7=O@ ?Sa ]6OYH aU@ TZR"TZ]6@>?`lrQ?A@ 1R?S3H a>HIT14O 8kTZR QTZ]L365u2v o&9 t<oZt <o
O4]6a HI@ 7<HI?H d;? ?S3;?A6H aT=U\Ri]6OYH 1f561R?S36a 1YTZ3L7=O41YHfeT=U n]6OYH 1f5L14RGuGv o&9 t=m=t s=<m 
j\7Se=?Sa 17=3n3;?AHT=@ _AO7=a a 1Y6?A@XaA9
^&136_`?H d;?cC8kV.@ ?AQTZa 1YHIT=@ eT=U?S3L_ d6R 7<@ >567<H 7 uG7<LOY?)G
l&Q?A@X14R?S3
H 7=O=@ ?Sa ]6O4H a+UWT=@$561?A@ ?S3
HHferQ?Sa
5;Tr?Sa>3;T=H[136_AO4]65;?7=3
e'5L7<H 7aI?AH a>1YH d-R]LOYH 1YQLOY? T=U_AO47=a a 1Y6?A@TZ35L7<H 7EaI?AH aT=U5L1?A@ ?S3Ha 1 A?D=?S36?A@I
_AO47=a ak:<7<@ 147<LO4?SaAbk?56?S_A145;?S5EHITiD=?S36?A@ 7<HI?CaITZR?C7<@I 7<HI?S5>U@ TZR#H d6?CT
?SaIT=Qd67<D=?S7=O2_A7=3L_`?A@3;?AHT=@ 9
H 1YL_A17=O+567<H 7aI?AH a)HITHI?SaIHTZ];@)OY?S7<@X36143;D 7=OYD=T=@X1YH d6R9
ud;?Sa ?567<H 7a ?AH a[k?A@ ?D=?S36?A@ 7<HI?S5UW@ TZR H d;?NTr?`
aIT=QLdL7<D=?S7=O_A7=3L_`?A@3;?AHfkT=@ _ 5 7=356?A@BC7=7< D k)m2<4pob UTZO45 _`@ TZa aI:y7=O41567<H 1YTZ3+9$;T=@kH d;?)Ri]6OYH 1f561R?S36a 1YTZ3L7=O
<o= o 6X9udL14a3;?AHT=@ UT=@H d6?)aIH 7<DZ143;DT=U+_A7=36_`?A@T=U _AO47=a a 1Y6?A@ aAb?5;?AL3;?S5H d;?S1Y@k7=_A_A];@ 7=_`e7=akH d;?Q6@ T<
H d;?T
?Sa T=QLd67<DZ]6a\1436_AO]65;?Sa @ 7=3L5;TZR:<7<@ 147<OY?SaT=U QT=@ H 1YTZ3T=Ua 7=RQLOY?SaH d67<Hk?A@ ?_AO7=a a 1Y6?S5E_`T=@ @ ?S_`H OYe
)d61_ 8d =t7<@ ?T=LaI?A@ :<7<LOY?U?S7<H ];@ ?:y7<@ 17<LOY?SaA9u)d;@ ?A? UT=@)7=O4O+_AO47=a a):y7<@X147<LOY?Sa.143
:=TZOY:=?S5+9
T=UH d;?n36?AHfkT=@ paE:<7<@ 147<LOY?Sa143?Sa aI?S3L_`?n7<@ ?[_AO7=a a ud6?n@ ?Sa ]6OYH aEUW@ TZR TZ];@?`l&Q?A@ 14R?S3
H a7<@ ?a ]6R
:<7<@ 147<LOY?SaSb)dL14_ dK13KH d;?k_A];@ @ ?S3
H23;?AHT=@ 7<@ ?\a ]6R R 7<@ 1aI?S513uG7<LOY?<9ud;?7=_A_A];@ 7=_`eT=U)H d;??SaIH
R 7<@ 1aI?S5n13n7a 1436DZOY?TZ];HIQ];H:<7<@ 147<OY?=9M?iD=?S36?A@I OY?S7<@ 36?S5_AO47=a a 1YL?A@\1aDZ14:=?S3>143H d;?CaI?S_`TZ365>_`TZO]6R 3+
7<HI?S5H d;@ ?A?567<H 7na ?AH aT=UKAo=o&nb <o=o7=36l 5 Zo=oa 7=R H d;?>H dL1Y@ 5_`TZO4]LR 3DZ1Y:=?SaH d;?n3r]6R?A@T=UQ7<@ 7=R
QLOY?SaSb@ ?SaIQ?S_`H 1Y:=?SO4e=bk]6a 143;DOYT=DZ14_a 7=R QLO4143;D;9;@ TZR ?AHI?A@Q6@ T=L7<L1O41YH 1Y?SaH d67<Hk?A@ ?n?SaIH 14R 7<HI?S5UWT=@EH d614a
H d;?D=?S36?A@ 7<HI?S5Ea 7=RQLOY?Sak?@ ?SR T|:=?S5EH d;?):<7=O4];?SaT=U _AO47=a a 1Y6?A@S9 6@ TZR H d;? H 7<OY?k?R7Se _`TZ3L_AO4]65;?
7=O4O3;TZ3&T=aI?A@ :<7<LOY?:y7<@ 17<LOY?SaAbr?`l&_`?AQLH.UWT=@\H d;TZaI?CT=U H d67<HEH d;?nRi]6OYH 1f5L14R?S36a 14TZ367=O_AO7=a a 1Y6?A@XaAb1YH d;TZ];H
H d;?KH d6@ ?A?K_AO47=a a.:<7<@ 147<OY?SaA9 ?`l;_`?AQ6H 1YTZ3+b
TZ];HIQ?A@ UWT=@XR H d;?S1Y@_`TZR QTZ]L365_`TZ]63
HI?A@I
;@ TZR H d;? H d;@ ?A? 5L7<H 7 aI?AH a-k?_`TZ36aIHI@X]6_`HI?S5 QL7<@ H a143 HI?A@XR aT=U7=_A_A];@X7=_`e=9OaITH d;?3
]LR?A@ aT=U
U]6O4OYe 367=1Y:=? 7=365U]LO4OYe HI@ ?A?`f7=];DZR?S3
HI?S5 Ri]6OYH 1 ?SaIH 14R7<HI?S5Q7<@ 7=R?AHI?A@ a7<@ ?_`TZ36a 15;?A@ 7<LOYea R7=O4OY?A@
5614R ?S36a 1YTZ367=O_AO47=a a 1YL?A@ aA9 6T=@H d614anQL];@ QTZaI?=bCk? UT=@iH d6?Ri]6OYH 1Yf5614R?S36a 1YTZ367=O_AO47=a a 1Y6?A@ aA9g37S:=?A@
]6aI?S5H d;?O4?S7<@ 36143;D7=OYD=T=@ 1YH dLR 5;?Sa _`@ 1Y?S5143H d;? 7<D=?=bZH d;?.OY?S7<@X3;?S5Ri]6OYH 1Yf5614R?S36a 1YTZ367=O=_AO47=a a 1Y6?A@ a2@ ?`
Q6@ ?A:&1YTZ]6aa ?S_`H 1YTZ36aA9 ;T=@_`TZR QL7<@ 14aITZ3 QL];@ QTZaI?SaAb r]61Y@ ?KTZ36?`H d61Y@ 5[T=UH d;?3r]6R?A@T=UQL7<@ 7=R?AHI?A@XaT=U
k?U];@ H d6?A@nOY?S7<@ 36?S53L7=1Y:=?7=3L5HI@ ?A?`f7=]6DZR?S3HI?S5 H d;?S1Y@>_`TZR QTZ]L365-_`TZ]63
HI?A@ QL7<@ H aA9 u)d;?561?A@ ?S36_`?
j\7Se=?Sa 17=3 3;?AHT=@ _AO47=a a 1YL?A@ a)1YH d7_`TZRQTZ]6365 14aQL7<@ H 14_A]6O47<@ O4eNaIHI@ 1Y&143;DUWT=@H d6?>367=1Y:=?>_AO47=a a 1Y6?A@ a
_AO47=a a:<7<@ 147<LO4?KUW@ TZR H d;?567<H 7&9C;T=@K7=O4O2_AO47=a a 146?A@ aAb OY?S7<@ 36?S5U@ TZRH d;?Ao=oyfa 7=RQLOY?567<H 7'aI?AHAbd;?A@ ?
k?]6a ?S57nUWT=@ k7<@ 5;faI?SOY?S_`H 1YTZ3@ 7<QLQ?A@i7<Q6Q6@ TZ7=_Xd H d;?KRi]6OYH 1f5L14R?S36a 14TZ367=O_AO7=a a 1Y6?A@.3;?A?S5La.TZ36OYeETZ3;?`
)1YH dnH d;?OY?S7<@ 3613;DE7=OYD=T=@X1YH d6R9ud;?7=_A_A];@ 7=_A1Y?SaT=U 6UH dT=UH d;?3
]6R?A@T=UQL7<@ 7=R?AHI?A@XaT=UH d;?_`TZR
H d;?:<7<@ 1YTZ]La_AO47=a a 146?A@ a$k?A@ ?)_A7=O4_A]6O7<HI?S5]6a 13;DHI?S3& QTZ]6365TZ36?=9ud;?)a R7=O4OY?A@3r]6R?A@XaT=U+QL7<@ 7=R ?AHI?A@ a

Multi-dimensional Bayesian Network Classifiers 113


@ ?$r]61Y@ ?S5_`TZ3LaIH 1YH ];HI?7_`TZ3La 145;?A@ 7<OY?C7=5;:y7=3
H 7<D=?T=U v .)N"3 m%5T::;F=0>@)ST) :T>JST>"@
H d;?Ri]6OYH 1Yf5614R?S36a 1YTZ367=O_AO7=a a 1Y6?A@Xa2T:=?A@H d6?S1Y@_`TZR Q JSJAFAF *L#%GJAF>@4:JC)+P_Fcz`dixczs#cedcz_as!
QTZ]6365_`TZ]L3HI?A@ Q7<@ H aa 136_`?.H d;?SaI?QL7<@ 7=R?AHI?A@ aHferQ; Z+:L:9:9
14_A7=OOYeK3;?A?S5HIT??SaIH 1R 7<HI?S5UW@ TZR@ ?SO47<H 1Y:=?SOYea R 7=O4O 3 K#. !:L", O+) BI:S 5T:u;F o {G`D`di8g^(h
567<H 7aI?AH aA9 |FdhF^"]<Ds"i8cz^"d6k`a^!]<!%GH
OYH d;TZ]6DZd TZ];@QL@ ?SO414R 1367<@ e-?`l&Q?A@X14R?S3
H 7=O@ ?` P$ BD S\5(u";<=0C>J%GBBQa"Aa)%G4S\^!q]<d6s!{
a ]LOYH anO4T
T=Q6@ TZR 14a 143;D;bKk?7<@ ?7S\7<@ ?H d67<HU];@I ^Th`Fg`as!]@_akN^Thmik`js!ixcz^!d6s!{q]J`Js"q^(h3is"db:s!]@b!ga
H d;?A@?`l&Q?A@X14R?S3
H 7<H 1YTZ3n1a)3;?S_`?Sa a 7<@ e>HITa ];LaIH 7=3
H 1 
jZ7":Lp7!9
7<HI?C7=3
eE_AO47=14Rak7<TZ];Hk?AHIHI?A@\Q?A@ UWT=@ R7=36_`?T=U2TZ];@ @%BI1j%4"?K#1j USTBD%>u5T:";F
Ri]6OYH 1f5L14R?S36a 14TZ367=O<jk7|e=?Sa 147=33;?AHT=@ K_AO7=a a 1Y6?A@XaA9
+H:S@%F>(W J'AFUSJST%GaSs_akced6``as!]<dcedf"
7Z+2Nu
& E fn <fn &+ Z    Z3 
$ 3 :4)"K pMV%xN5(:;< "J%4
V3-H d614a>QL7<Q?A@>?143
HI@ Tr5L]6_`?S573;?A U7=R14OYeT=U 4BD:>J
+HuST%U"AFUSJS@%~JSZAF :BvC"J%S@ 
j\7Se=?Sa 147=33;?AHfkT=@ _AO47=a a 146?A@ aH d67<H.1436_AO4]656?)TZ3;?T=@ * %US(>J@%Q>J%G :xQS@NA:S@S@%~A!>J%G :QS@"C
RT=@ ?E_AO47=a a:y7<@X147<LOY?Sa7=365Ri]6OYH 1YQOY?UW?S7<H ]6@ ?:<7<@ 1 C@ Aa)uS"XY Z-[+]@^_a`a`ab!cedf!gm^(hjik`DiekD|FdiY`F]<d6s!ixcz^!ds"{
7<LO4?SaH d67<H3;?A?S53;T=H?nRTr56?SO4OY?S57=a?S13;DN5;?` ^!]@!g@k^ar^"dw2|s!db?iYs!ixcUg<i8cz_Fg<C"4:S27:7L79
Q?S365;?S3
HK];QTZ3?A:=?A@ eN_AO7=a aK:<7<@ 147<LO4?=9M?UWT=@ Ri]& L3 )%?1v :: ) 5(u";<a"CCaS\* :V*"
O47<HI?S5H d;?OY?S7<@ 361436D>Q6@ T=LO4?SR UWT=@iH d614aU7=R 14OYe7=365 >JJS@QS@F>NS@A<>J%G :0w0]<ixc y_czs!{2|<di`{e{/c~f`Fd6_J`mZ
Q6@ ?SaI?S3HI?S57aITZO4];H 14TZ37=OYD=T=@X1YH d6RH d67<H14aiQTZO4er3;T< 7!m:7!
R 17=O143H d;?3r]6R?A@KT=U\:y7<@ 17<LOY?SaK143
:=TZOY:=?S5+9g];@ 
j3@ST'!"x65(";<=0I>@)LS@) @>@uS(>+STC"%4vSTQ
Q6@ ?SO414R 143L7<@ e?`lrQ?A@ 14R ?S3H 7=O&@ ?Sa ]6OYH a$aI?A@ :=?S5HITK14O4O4]6aI >J@3 "*V4:JC)"p>@)3>@a%G4DSJ"S@BDpC@ :Q
HI@ 7<HI?H d;??S36?A6H aT=U+H d;?R]LOYH 1f5614R ?S36a 1YTZ367=O1YHfeKT=U B2[ ]@^u_a`J`ab"cedf"g,^(hiek`Nw0D`]<cz_as!ds!ik`Ds"i8cz_as"{
TZ];@j\7|e=?Sa 147=33;?AHfkT=@ _AO47=a a 146?A@ aA9 P^_cz`i8!Z\:p"9
V3-H d;?3;?S7<@[U];H ];@ ??13HI?S365HITQ?A@ UT=@ R 7 M\\ 4H:\XYQ"D+) :BDCST :+5(::7:;<O
RT=@ ??`l&HI?S36a 14:=??`lrQ?A@ 14R ?S3H 7<H 1YTZ3 a H ]65;eT=UTZ];@ "HST%US "*I
+HS@%U"A:S@S@%~aS+XYZ[+]@^_a`a`ab!cedf!g
OY?S7<@X36143;D7=O4D=T=@ 1YH d6Rb]6a 143;DT=H d;?A@>567<H 7aI?AH a7=365 ^ThDiek`pl:!ikt-^!duhF`F]J`Fd6_J`^"d,w0]<ixc y_czs!{|FdiY`F{e{/c~f`d_a`F
C"4uS77"mp77
T=H d;?A@7<Q6Q6@ TZ7=_ d;?SaHITUW?S7<H ];@ ?a ]6LaI?AHaI?SO4?S_`H 1YTZ3+9
M?U];@ H d;?A@)14a dHIT143:=?Sa H 1YDZ7<HI?T=H d;?A@HferQ?SaET=U M\    AS5x7"99;F ST>@J%UA<>@u
H:S@%F>(W J'
RT&5;?SO[UW@ TZR TZ];@'U7=R 14OYeT=URi]6OYH 1f5L14R?S36a 14TZ367=O ST>@JAF>@J@%G4XYZ\w0b!!s!d6_J`FgcedD0s!`Fg<czs!dI`i8}
_AO47=a a 1Y6?A@ aS9FGTZa a 14LOY?E7=O4HI?A@ 367<H 1Y:=?Sa143L_AO4]65;?_AO7=a a 1 ^"]@g-ixqb!cz`FgcedPq:aced`Fgagvs!d6bp6^ThaiLtV^"2r6qi8cedf"
: x6:CJ%G4@VJ4C"4:S7!7:7
6?A@Xa21YH d . f5;?AQ?S365;?S36_`?kQTZOYe
HI@ ?A?Sa2T:=?A@H d;?S14@GU?S7y
H ];@ ?:y7<@ 17<LOY?SaA9^;1436_`?H d;?3r]6R?A@CT=U\_AO7=a aK:<7<@ 1  MV@'6j587"9:9:7:;<+KNOMAF BDCFE%G>(H@uST~>aSp"C
7<LO4?Sak]La ]67=O4OYe 1a@X7<H d;?A@a R 7=O4Obrk?C7=O4aITkTZ]6O45O41Y=? C@ E%BD">@% IBDF>J) SXY Z-[+]@^_a`a`ab!cedf!gm^(hjik`ml"ik
t-^"dhF`]@`Fd6_a`^"dd_a`]<is"cedixcedw0]<ixc y_czs!{3|<di`{e{/ce}
HIT13:=?SaIH 14DZ7<HI?[H d;?>U?S7=a 1Y14O41YHeNT=UC_AO47=a a 1YL?A@ a)1YH d f`d_a`FKN J4:"L"*BIC4uS :mp:
a O1YDZdH O4eRT=@ ?_`TZR QLOY?`l_AO47=a aa ]66D=@ 7<QLd6aS9
K#v")BD%85T::;Fmu"J%G4R%GBD%G>@uCA

+H:S@%mAFUSJST%GaSXYZV[ ]@^u_a`J`ab"cedf"g2^Th+iek`2:db2|Fd}
iY`]<ds!ixcz^!d6s!{tV^"dhF`]@`d_a`#^!dd6^ {G`abf`cUgF_a^"`]<
=6{Z s"dbs"isIcedcedfOO0O2XM-JSJSC"4uS+:m:
  
!"##$ % &(') *+,-/.0!"
21346587"9:9:7:;<=0?>@)3A BDCFE%G>(HI "*>J)LKNMPO X<-SJ"BI"a% :Sv".0  -O2%~*@%US25x7"9:9:;FV !W+"aS
C@ :QGBR%C@ :Q"Q%%ST>@%UAV>(W :@'SFXYZ\[+]@^_a`a`ab!cedf!g C@%AF%C*">@JSTGuA<>@%  Z J!"A!~>JaS
^Thjiek`mln"iekpo q]@^Jr`as!d#t-^!duhF`F]J`Fd6_J`v^!dpw0]<ixc y_czs!{|Fd} Wa"CCJSXY ZV[ ]J^_a`a`ab!cedf"g^(h+ik`0"ikm|FdiY`F]<d6s!ixcz^!ds"{
iY`{e{/c~f:`d_a`X@=3M-JSJSC4uS "m" ^!]@!g@k^ar^!dDw0]<i8c y _Fczs"{|<diY`{e{/c~f:`d_a`js!d6bjiYs!ixcUg<i8cz_Fg<

 DN.) !W".0 %#5(:;<pOCCJ E%GBI!>@  .0D134V62  % &F .0   K%G>T>JBI"


%4I%US@A@>@3CJ QQ%G%G>(H?%US(>J@%Q>J%G :SW%G>@)C
j K# M\O2BI"N
j 1v:"x 587997;<M-J QQ%GG
A>@JuS|Jo-o-o]@s"dgs_ixcz^!dg^"d|FdhF^"]<Ds"i8cz^"d %G>@%S* ,C@ :Q"Q%%ST>@%UA#>(W :@'6ZAS@ST>@H
6k`J^"]<! Z\:73?: % S@ C)"4u"mA"Au3w0]<ixc y_czs!{j|FdiY`F{e{~c~f:`d_a`ced
`ab!cz_ced`7:Z+u7LN:

114 L. C. van der Gaag and P. R. de Waal


Dependency networks based classifiers: learning models by using
independence tests.
Jose A. Gamez and Juan L. Mateo and Jose M. Puerta
Computing Systems Department
Intelligent Systems and Data Mining Group i3 A
University of Castilla-La Mancha
Albacete, 02071, Spain

Abstract
In this paper we propose using dependency networks (Heckerman et al., 2000), that is a
probabilistic graphical model similar to Bayesian networks, to model classifiers. The main
difference between these two models is that in dependency networks cycles are allowed, and
this fact has the consequence that the automatic learning process is much easier and can
be parallelized. These properties make dependency networks a valuable model especially
when it is needed to deal with large databases. Because of these promising characteristics
we analyse the usefulness of dependency networks-based Bayesian classifiers. We present
an approach that uses independence tests based on chi square distribution, in order to
find relationships between predictive variables. We show that this algorithm is as good
as some state-of the-art Bayesian classifiers, like TAN and an implementation of the BAN
model, and has, in addition, other interesting proprierties like scalability or good quality
for visualizing relationships.

1 Introduction itative knowledge. The classification task can


be expressed as determining the probability for
Classification is basically the task of assigning the class variable knowing the value for all other
a label to an instance formed by a set of at- variables:
tributes. The automatic approach for this task
is one of the main parts in machine learning. P (C|X1 = x1 , X2 = x2 , , Xn = xn ).
Thus, the problem consists on learning classi-
fiers from datasets in which each instance has After this computation, the class variables
been previously labelled. Many methods have value with the highest probability is returned
been used for solving this problem based on var- as result. Obviously computing this probability
ious representations such as decision trees, neu- distribution for the class variable can be very
ral networks, rule based systems, support vec- difficult and inefficient if it is carried out di-
tor machines among others. However a prob- rectly, but based on the independencies stated
abilistic approach can be used for this pur- by the selected model this problem can be sim-
pose, Bayesian classifiers (Duda and Hart, 1973; plified enormously.
Friedman et al., 1997). Bayesian classifiers In this paper we present a new algorithm
make use of a probabilistic graphical model such for the automatic learning of Bayesian classi-
as Bayesian networks, that have been a very fiers from data, but using dependency network
popular way of representing probabilistic rela- classifiers instead of Bayesian networks. De-
tionships between variables in a domain. Thus a pendency networks (DNs) were proposed by
Bayesian classifier takes advantage of probabil- Heckerman et al. (2000) as a new probabilistic
ity theory, especially the Bayes rule, in conjunc- graphical model similar to BNs, but with a key
tion with a graphical representation for qual- difference: the graph that encodes the model
structure does not have to be acyclic. As in BNs encoded in the model, each variable is indepen-
each node in a DN has a conditional probability dent of its non-descendants given its parents.
distribution given its parents in the graph. Al- Thus, P can be factorized as follows:
though the presence of cycles can be observed Yn
as the capability to represent richer models, the P (X) = P (Xi |P ai ) (1)
price we have to pay is that usual BNs infer- i=1
ence algorithms cannot be applied and Gibbs A dependency network is similar to a BN,
sampling has to be used in order to recover the but the former can have directed cycles in its
joint probability distribution (see (Heckerman associated graph. In this model, based on its
et al., 2000)). An initial approach for introduc- conditional independence constraints, each vari-
ing dependency networks in automatic classifi- able is independent of all other variables given
cation can be found in (Mateo, 2006). its parents. A DN D can be represented by
Learning a DN from data is easier than learn- (G,P) where G is a directed graph not necessar-
ing a BN, especially because restrictions about ily acyclic, and as in a BN P = {i, P (Xi |Pai )}
introducing cycles are not taken into account. is the set of local probability distributions, one
In (Heckerman et al., 2000) was presented a for each variable. In (Heckerman et al., 2000)
method for learning DNs based on learning it is shown that inference over DNs, recovering
the conditional probability distribution for each joint probability distribution of X from the lo-
node independently, by using any classification cal probability distributions, is done by means
or regression algorithm to discover the parents of Gibbs sampling instead of the traditional in-
of each node. Also a feature selection scheme ference algorithms used for BNs due to the po-
can be use in conjunction. It is clear that this tential existence of cycles.
learning approach produces scalable algorithms, A dependency network is said consistent if
because any learning algorithm with this ap- P (X) can be obtained from P via its factoriza-
proach can be easily parallelized just distribut- tion from the graph (equation 1). By this defi-
ing groups of variables between all nodes avail- nition, can be shown that exists an equivalence
able in a multi-computer system. between a consistent DN and a Markov network.
This paper is organized as follows. In section This fact suggests an approach for learning DNs
2 dependency networks are briefly described. In from Markov network learned from data. The
section 3 we review some notes about how classi- problem with this approach is that can be com-
fication task is done with probabilistic graphical putationally inefficient in many cases.
models. In section 4 we describe the algorithm Due to this inconvenient, consistent DNs be-
to learn dependency networks classifiers from came very restrictive, so in (Heckerman et al.,
data by using independence tests. In section 5 2000) the authors defined general DNs in which
we show the experimental results for this algo- consistency about joint probability distribution
rithm and compare them with state-of-the-art is not required. General DNs are interesting for
BN classifiers, like TAN algorithm and an im- automatic learning due to the fact that each lo-
plementation of the BAN model. In section 6 we cal probability distribution can be learned inde-
present our conclusions and some open research pendently, although this local way of processing
lines for the future. can lead to inconsistencies in the joint probabil-
ity distribution. Furthermore, structural incon-
2 Preliminaries sistencies (Xi pa(Xj ) but not Xj pa(Xi ))
can be found in general DNs. Nonetheless, in
A Bayesian network (Jensen, 2001) B over the (Heckerman et al., 2000) the authors argued
domain X = {X1 , X2 , , Xn }, can be defined that these inconsistencies can disappear, or at
as a tuple (G,P) where G is a directed acyclic least be reduced to the minimum when a rea-
graph, and P is a joint probability distribution. sonably large dataset is used to learn from. In
Due to the conditional independence constraints this work we only consider general DNs.

116 J. A. Gmez, J. L. Mateo, and J. M. Puerta


An important concept in probabilistic graph- Markov blanket. Nonetheless, it has been shown
ical models is Markov blanket. This concept is that methods based on extending Naive Bayes
defined for any variable as the set of variables classifier, in general, outperform the ones based
which made it independent from the rest of vari- on the Markov blanket.
ables in the graph. In a BN the Markov blanket When the classifier model has been learned
is formed by the set of parents, children and par- is time to use inference in order to find what
ent of the children for every variable. However, class value has to be assigned for the values
in a DN, the Markov blanket for a variable is ex- of a given instance. To accomplish this task
actly the parents set. That is the main reason the MAP (Maximum at posterior ) hypothesis
why DN are useful for visualizing relationships is used, which returns as result the value for
among variables, because these relationships are the class variable that maximize the probabil-
explicitly shown. ity given the values of predictive variables. This
is represented by the following expression:
3 Probabilistic graphical models in
classification cM AP = arg max p(c|x1 , . . . , xn )
cC
As it has been said before, we can use Bayesian p(x1 , . . . , xn |c) p(c)
= arg max
models in order to perform a classification task. cC p(x1 , . . . , xn )
Within this area, the more classical Bayesian = arg max p(c, x1 , . . . , xn )
classifier is the one called naive (Duda and Hart, cC
1973; Langley et al., 1992). This model consid-
ers all predictive attributes independent given Thus, the classification problem is reduced to
the class variable. It has been shown that this compute joint probability for every value of the
classifier is one of the most effective and com- class variable with the values for the predictive
petitive in spite of its very restrictive model variables given by the instance to by classified.
since it does not allow dependencies between When we deal with a classifier based on a
predictive variables. However this model is used Bayesian network we can use joint probability
in some applications and is employed as initial factorization shown in equation 1 in order to
point for other Bayesian classifiers. In the aug- recover these probabilities. Initially inference
mented naive bayes family models, the learn- with dependency networks must be done by
ing algorithm begins with the structure pro- means of Gibbs sampling, but due to the fact
posed by naive Bayesian classifier and then try that we focus on classification, and under the
to find dependencies among variables using a assumption of complete data (i.e. no missing
variety of strategies. One of this strategies can data neither in the training set nor in the test
be looking for relationships between predictive set), we can avoid Gibbs sampling. This result
variables represented by a simple structure like comes from the fact that we can use modified
a tree as is done in TAN (Tree Augmented Naive ordered Gibbs sampling as designed in (Hecker-
Bayes) algorithm (Friedman et al., 1997). An- man et al., 2000). This way of doing inference
other is to adapt a learning algorithm employed over dependency networks is based on decom-
in machine learning with general Bayesian net- posing the inference task into a set of inference
works. In this case the learning method is mod- tasks on single variables. For example, if we
ified in order to take into account the relation- consider a simple classifier based on dependency
ship between every variable and the class. network with two predictive variables, which
Apart from this approach to learn Bayesian show dependency between them, so with the fol-
classifiers, it must be mentioned another one lowing graphical structure: X1 X2 ; C
based on the Markov blanket concept. These X1 ; C X2 ; its probability model will be:
methods focus their efforts in searching the set
of variables that will be the class variables P (C, X1 , X2 ) = P (C)P (X1 |C, X2 )P (X2 |C, X1 ).

Dependency networks based classifiers: learning models by using independence tests 117
With this method joint probability can be ob- where Pai is the current set of Xi s parents mi-
tained by computing each term separately. The nus C. We assess the goodness of every candi-
determination of the first term does not requires date parent by the difference between the per-
Gibbs sampling. The second and third terms centile of the 2 distribution (with = 0.025)
should be determined by using Gibbs sampling and the statistic value. Here the degrees of free-
but in this case we know all values for their con- dom must be
ditioning variables, so they can be determined Y
directly too by looking at their conditional prob- df = (|Xi | 1) (|Xj | 1) (|P ai |).
ability tables. If we assume that the class vari- P ai Pai

able will be determined before other variables Obviously this process finishes when all can-
(as it is the case), by this procedure we can com- didate parents not selected yet shows indepen-
pute any joint probability avoiding the Gibbs dence with Xi according to this test, and in or-
sampling process. der to make search more efficient, every candi-
4 Finding dependencies date parent that appear independent in any step
in the algorithm, is rejected and is not taken
In this section we present our algorithm to build into account in next iterations. The reason of
a classifier model from data based on depen- rejecting these variables is just for efficiency, al-
dency networks. Our idea is to use indepen- though we know that any of these rejected vari-
dence tests in order to find dependencies be- ables can become dependent in posterior inter-
tween predictive variables Xi and Xj , but tak- ations.
ing into account the class variable, and in this Figure 1 shows the pseudo-code of this algo-
way extend the Naive Bayes structure. It is rithm called ChiSqDN.
known that statistic
I n i t i a l i z e s t r u c t u r e t o Naive Bayes
2 N c I(Xi ; Xj |C = c)
For each v a r i a b l e Xi
Pai =
follows a2 distribution with (|Xi |1)(|Xj |1) Cand = X \ {Xi } // Candidate p a r e n t s
degrees of freedom under the null hypothesis of While Cand 6=
For each v a r i a b l e Xj Cand
independence, where N c is the number of input
Let val be a s s e s s m e n t f o r Xj by 2
instances in which C = c and I(Xi ; Xj |C = c) test
is the mutual information between variables Xi I f val 0 Then
Remove Xj from Cand
and Xj when the class variable C is equal to
c. Using this expression, for any variable can Let Xmax be t h e b e s t v a r i a b l e found
be found all other variables (in)dependent given Pai = Pai Xmax
Remove Xmax from Cand
the class.
But it is not this set what we are looking For each v a r i a b l e P ai Pai
for, we need a set of variables which made that Make new l i n k P ai Xi
variable independent from the other, that is, its
Markov blanket. So, we try to discover this set Figure 1: Pseudo-code for ChiSqDN algorithm.
by carrying out an iterative process, for every
predictive variable independently, in which in It is clear that determining degrees of free-
each step is selected as a new parent the vari- dom is critical in order to perform properly this
able not still chosen which shows more depen- test. The scheme outlined before is the general
dency with the variable under study (Xi ). This way for assessing this parameter, nonetheless,
selection is done considering all the parents pre- in some cases, especially for deterministic or al-
viously chosen. Thus the statistic used actually most deterministic relationships, the tests car-
is ried out using degrees of freedom in this way
2 N c I(Xi ; Xj |C = c, Pai ), can fail. Deterministic relationships are pretty

118 J. A. Gmez, J. L. Mateo, and J. M. Puerta


common in automatic learning when input cases 5 Experimental results
are relatively few. Examining the datasets used
In order to evaluate our proposed algorithm and
in the experiments (table 1) can be seen that
to measure their quality, we have selected a set
there are some of them with a low number of
of datasets from the UCI repository (Newman
instances. So, to make the tests in our algo-
et al., 1998). These datasets are described in
rithm properly, we use a method for computing
table 1. We have preprocessed these datasets
degrees of freedom and handling this kind of re-
in order to remove missing values and to dis-
lationships proposed in (Yilmaz et al., 2002).
cretize continuous variables. Missing values for
Basically and considering discrete variables and
each variable have been replaced by the mean or
its conditional probability tables (CPT), this
mode depending on whether it is a continuous o
proposal reduces degrees of freedom based on
discrete variable respectively. For discretization
the level of determinism and this level is deter-
we have used the algorithm proposed in (Fayyad
mined by the number of columns or rows which
and Irani, 1993), which is especially suggested
are full of zeros.
for classification.
For example, if we wanted to compute degrees
of freedom for variables Xi and Xj whose proba- Datasets inst. attrib. |C| cont.? missing?
bility distribution is shown in figure 2, we should australian 690 15 2 yes no
heart 270 14 2 yes no
use (3 1) (3 1) = 4 as value for this param- hepatitis 155 20 2 yes yes
eter, as both variables have 3 states. Nonethe- iris 150 5 3 yes no
less, is clear that the relation between these two lung 32 57 3 no yes
pima 768 9 2 yes no
variables is not probabilistic but deterministic, post-op 90 9 3 yes yes
because is similar to consider that variable Xi segment 2310 20 7 yes no
has only one state. Thus, if we use this new soybean 683 36 19 no yes
vehicle 846 19 4 yes no
way for computing degrees of freedom we have vote 435 17 2 no yes
to take into account that this table has two rows
and a column full of zeros. Then the degrees of Table 1: Description of the datasets used in the
freedom will be (3 2 1) (3 1 1) = 0. experiments.

Xj MT We have chosen a pair of algorithms from the


13 16 0 29 state-of the-art in Bayesian classification which
Xi 0 0 0 0 allow relationships between predictive variables.
0 0 0 0 One of this algorithms is the one known as TAN
MT 13 16 0 (Tree Augmented Naive Bayes). The other em-
ploys a general Bayesian network in order to
Figure 2: Probability table for variables Xi and represent relationships between variables and is
Xj . based on BAN model (Bayesian network Aug-
mented Naive Bayes (Friedman et al., 1997)). In
In these cases, i. e, when the degrees of free- our experiments we use B algorithm (Buntine,
dom are zero, it does not make sense perform 1994) as learning algorithm with BIC (Schwarz,
the test because in this case it always holds. 1978) metric to guide the search. Both algo-
In (Yilmaz et al., 2002), the authors propose rithms use mutual information as base statistic
skipping the test and mark the relation in or- in order to determine relationships, directly in
der to be handled by human experts. In our TAN algorithm and a penalized version in BIC
case, when this algorithm yields degrees of free- metric.
dom equal to zero we prefer an automatic ap- The experimentation process consists on run-
proach that is to accept dependence between ning each algorithm for each database in a 5x2
tested variables if statistic value is greater than cross validation. This kind of validation is an
zero. extension of the well known k-fold cross vali-

Dependency networks based classifiers: learning models by using independence tests 119
dation in which a 2-fold cv is performed five the ease of learning and possibility of doing this
times and in any repetition a different partition process in parallel without introduce an extra
is used. With this process we have the valida- workload.
tion power from the 10-fold cv but with less cor-
Apart from these characteristics that can be
relation between each result (Dietterich, 1998).
exploited from a dependency network, we can
Once we have performed our experiments in
focus on its descriptive capability, i.e. how
that way, we have the results for each classifier
good is the graph for each model to show re-
in table 2.
lations over the domain. Taking into account
TAN BAN+BIC ChiSqDN pima dataset, in figures 3 and 4 are shown the
australian 0.838841 0.866957 0.843478 graphs yielded by the algorithms BAN+BIC
heart 0.819259 0.826667 0.823704
and ChiSqDN respectively.
hepatitis 0.852930 0.872311 0.865801
iris 0.948000 0.962667 0.962667 Obviously, for every variable in these graphs,
lung 0.525000 0.525000 0.525000
pima 0.772135 0.773177 0.744531 those that form its Markov blanket are the more
post op 0.653333 0.673333 0.673333 informative or more related. The difference is
segment 0.936970 0.909437 0.931169 that for a Bayesian network graph the Markov
soybean 0.898661 0.906893 0.903950
vehicle 0.719385 0.697163 0.698345 blanket set is formed by parents, children and
vote 0.945299 0.944840 0.931045 parent of children for every variable, but for a
dependency network graph the Markov blanket
Table 2: Classification accuracy estimated for set are all variables directly connected. Thus in
algorithms tested. tables 3 and 4 we show the Markov blanket ex-
tracted from the corresponding graph for every
variable. Evidently the sets do not match due
5.1 Analysis of the results to the automatic learning process but are quite
The analysis done for the previous results con- similar. Apart from that, it is clear that discov-
sists on carring out a statistical test. The test ering these relationships from the dependency
selected is the paired Wilcoxon signed ranks, network graph is easier, especially for people not
that is a non parametric test which examines used to deal with Bayesian networks, than doing
two samples in order to decide whether the dif- it from a Bayesian network. We think that this
ferences shown by them can be assumed as not point is an important feature of dependency net-
significant. In our case, if this test holds, then works because can lead us to a tuning process
we can conclude that the classifiers which yield helped by human experts after the automatic
the results analysed have similar classification learning.
power.
Thus, this test is done with a level of sig-
nificance of 0.05 ( = 0.05) and it shows that Variable Markov blanket
both reference classifiers do not differ signifi- mass skin, class
cantly from our classifier learned with ChiSqDN skin age, mass, insu, class
algorithm. When we compare our algorithm insu skin, plas, class
with TAN we obtain a p-value of 0.9188, and plas insu, class
for the comparison with BAN model the p-value age pres, preg, skin, class
is 0.1415. Therefore ChiSqDN algorithm is as pres age, class
good as the Bayesian classifier considered, in preg age, class
spite of the approximation introduced in the fac- pedi class
torization of the joint probability distribution
Table 3: Markov blanket for each predictive
due to the use of general dependency networks.
variable from the BAN+BIC classifier.
Nonetheless, also as consequence of the use of
dependency networks we can take advantage of

120 J. A. Gmez, J. L. Mateo, and J. M. Puerta


class

plas pedi

mass pres
preg
insu age
skin
Figure 3: Network yielded by BAN algorithm with BIC metric for the pima dataset.

class

pedi

insu

preg
plas skin

age
mass pres

Figure 4: Network yielded by ChiSqDN algorithm for the pima dataset.

Variable Markov blanket for a classifier based on dependency networks.


mass skin, pres, class By means of the analysis developed in this pa-
skin insu, mass, age, class per, we have shown that the proposed classifier
insu plas, skin, class model based on dependency networks can be as
plas insu, class good as some classifiers based on Bayesian net-
age pres, preg, skin, class works. Besides this important feature, classi-
pres age, mass, preg, class fiers obtained by our algorithm are visually eas-
preg age, pres, class ier to understand. We mean that relationships
pedi class automatically learned by the algorithm are eas-
ily understood if the classifier is modelled with
Table 4: Markov blanket for each predictive a dependency network. This characteristic is es-
variable from the ChiSqDN classifier. pecially valuable because for people is often dif-
ficult discover all relationships inside a Bayesian
6 Conclusions and Future Work networks if they do not know their theory, how-
ever these relationships are clearly shown in a
We have shown how dependency networks can dependency network. Besides, we can not forget
be used as a model for Bayesian classifiers be- the main contributions of dependency networks:
yond their initial purpose. We have presented ease of learning and possibility to perform this
a new learning algorithm which use indepen- learning in parallel.
dence tests in order to find the structural model

Dependency networks based classifiers: learning models by using independence tests 121
The ChiSqDN algorithm shown here learn Juan L. Mateo. 2006. Dependency networks: Use
structural information for every variable inde- in automatic classification and Estimation of Dis-
tribution Algorithms (in spanish). University of
pendently, so we plan to study a parallelized
Castilla-La Mancha.
version in the future. Also we will try to im-
prove this algorithm, in terms of complexity, D.J. Newman, S. Hettich, C.L. Blake,
by using simpler statistics. Our idea is to use and C.J. Merz. 1998. UCI Repos-
itory of machine learning databases.
an approximation for the mutual information http://www.ics.uci.edu/mlearn/MLRepository.html.
over multiples variables using a decomposition
in simpler terms (Roure Alcobe, 2004). Josep Roure Alcobe. 2004. An approximated MDL
score for learning Bayesian networks. In Proceed-
Acknowledgments ings of the Second European Workshop on Proba-
bilistic Graphical Models 2004 (PGM04).
This work has been partially supported by
Gideon Schwarz. 1978. Estimating the dimension of
Spanish Ministerio de Ciencia y Tecnologa, a model. Annals of Statistics, 6(2):461464.
Junta de Comunidades de Castilla-La Man-
cha and European Social Fund under projects Yusuf Kenan Yilmaz, Ethem Alpaydin, Levent Akin,
TIN2004-06204-C03-03 and PBI-05-022. and Taner Bilgi. 2002. Handling of Deterministic
Relationships in Constraint-based Causal Discov-
ery. In First European Workshop on Probabilistic
Graphical Models (PGPM02), pages 204211.
References
Wray L. Buntine. 1994. Operations for Learning
with Graphical Models. Journal of Artificial In-
telligence Research, 2:159225.

Thomas G. Dietterich. 1998. Approximate sta-


tistical test for comparing supervised classifica-
tion learning algorithms. Neural Computation,
10(7):18951923.

R. O. Duda and P. E. Hart. 1973. Pattern classifi-


cation and scene analisys. John Wiley and Sons.

U. Fayyad and K. Irani. 1993. Multi-interval


discretization of continuous-valued attributes for
classification learning. In Morgan and Kaufmann,
editors, International Joint conference on Artifi-
cial Intellegence (IJCAI), pages 10221029.

Nir Friedman, Dan Geiger, and Moises Goldszmidt.


1997. Bayesian Network Classifiers. Machine
Learning, 29(2-3):131163.

David Heckerman, David Maxwell Chickering,


Christopher Meek, Robert Rounthwaite, and Carl
Kadie. 2000. Dependency Networks for Inference,
Collaborative Filtering, and Data Visualization.
Journal of Machine Learning Research, 1:4975.

Finn V. Jensen. 2001. Bayesian networks and deci-


sion graphs. Springer.

P. Langley, W. Iba, and K. Thompson. 1992. An


Analisys of bayesian Classifiers. In Proceedings of
the 10th National Conference on Artificial Intel-
ligence, pages 223228.

122 J. A. Gmez, J. L. Mateo, and J. M. Puerta


Unsupervised naive Bayes for data clustering with mixtures of
truncated exponentials
Jose A. Gamez
Computing Systems Department
Intelligent Systems and Data Mining Group i3 A
University of Castilla-La Mancha
Albacete, 02071, Spain

Rafael Rum and Antonio Salmeron


Department of Statistics and Applied Mathematics
Data Analysis Group
University of Almera
Almera, 04120, Spain

Abstract
In this paper we propose a naive Bayes model for unsupervised data clustering, where
the class variable is hidden. The feature variables can be discrete or continuous, as the
conditional distributions are represented as mixtures of truncated exponentials (MTEs).
The number of classes is determined using the data augmentation algorithm. The proposed
model is compared with the conditional Gaussian model for some real world and synthetic
databases.

1 Introduction ing, hierarchical clustering, and probabilistic


model-based clustering. From them, the first
Unsupervised classification, frequently known two approaches yield a hard clustering in the
as clustering, is a descriptive task with many sense that clusters are exclusive, while the third
applications (pattern recognition, . . . ) and that one yield a soft clustering, that is, an object
can be also used as a preprocessing task in the can belong to more than one cluster following a
so-called knowledge discovery from data bases probability distribution.
(KDD) process (population segmentation and In this paper we focus on probabilistic cluster-
outlier identification). ing, and from this point of view the data cluster-
Cluster analysis or clustering (Anderberg, ing problem can be defined as the inference of a
1973; Jain et al., 1999) is understood as a probability distribution for the given database.
decomposition or partition of a data set into In this work we allow nominal and numeric vari-
groups in such a way that the objects in one ables in the data set, and the novelty of the ap-
group are similar to each other but as different proach lies in the use of MTE (Mixture of Trun-
as possible from the objects in other groups. cated Exponential) (Moral et al., 2001) densities
Thus, the main goal of cluster analysis is to to model the numeric variables. This model has
detect whether or not the general population been shown as a clear alternative to the use of
is heterogeneous, that is, whether the data fall Gaussian models in different tasks (inference,
into distinct groups. classification and learning) but to our knowl-
Different types of clustering algorithms can edge its applicability to clustering remains to
be found in the literature depending on the type be studied. This is precisely the goal of this pa-
of approach they follow. Probably the three per, to do an initial study of applying the MTE
main approaches are: partition-based cluster- model to the data clustering problem.
The paper is structured as follows: first, Sec- Class
tions 2 and 3 give, respectively, some prelim- (hidden)
inaries about probabilistic clustering and mix-
tures of truncated exponentials. Section 4 de-
scribes the method we have developed in or- Y1 Yn Z1 Zm
der to obtain a clustering from data by using
mixtures of truncated exponentials to model Figure 1: Graphical structure of the model.
numerical variables. Experiments over several Y1 , . . . , Yn represent discrete/nominal variables
datasets are described in Section 5. Finally, and while Z1 , . . . , Zm represent numeric variables.
having in mind that this is a first approach, in
Section 6 we discuss about some improvements
and applications to our method, and also our clusters distribution (multinomial or gaussian)
conclusions are presented. and the population distribution between the
clusters. To obtain these parameters the EM
2 Probabilistic model-based (expectation-maximization) algorithm is used.
clustering The EM algorithm works iteratively by carrying
out the following two-steps at each iteration:
The usual way of modeling data clustering in a
probabilistic approach is to add a hidden ran- Expectation.- Estimate the distribution of
dom variable to the data set, i.e., a variable the hidden variable C, conditioned on the
whose value has been missed in all the records. current setting of the parameter vector k
This hidden variable, normally referred to as (i.e. the parameters needed to specify the
the class variable, will reflect the cluster mem- non-hidden variables distributions).
bership for every case in the data set.
Maximization.- Taking into account the new
From the previous setting, probabilistic distribution of C, use maximum-likelihood
model-based clustering is modeled as a mixture to obtain a new set of parameters k+1 from
of models (see e.g. (Duda et al., 2001)), where the observed data.
the states of the hidden class variable corre-
spond to the components of the mixture (the The algorithm starts from a given initial start-
number of clusters), and the multinomial distri- ing point (e.g. random parameters for C or )
bution is used to model discrete variables while and under fairly general conditions it is guaran-
the Gaussian distribution is used to model nu- teed to converge to (at least) a local maximum
meric variables. In this way we move to a prob- of the log-likelihood function.
lem of learning from unlabeled data and usu- Iterative approaches have been described in
ally the EM algorithm (Dempster et al., 1977) the literature (Cheeseman and Stutz, 1996) in
is used to carry out the learning task when the order to discover also the number of clusters
graphical structure is fixed and structural EM (components of the mixture). The idea is to
(Friedman, 1998) when the graphical structure start with k = 2 and use EM to fit a model, then
also has to be discovered (Pena et al., 2000). the algorithm tries with k = 3, 4, ... while the
In this paper we focus on the simplest model log-likelihood improves by adding a new com-
with fixed structure, the so-called Naive Bayes ponent (cluster) to the mixture. Of course some
structure (fig. 1) where the class is the only root criteria has to be used in order to prevent over-
variable and all the attributes are conditionally fitting.
independent given the class. To finish with this section we discuss about
Once we decide that our graphical model is how to evaluate the obtained clustering?. Obvi-
fixed, the clustering problem reduces to take ously, a clustering is useful if it produces some
a dataset of instances and a previously speci- interesting insight in the problem we are analyz-
fied number of clusters (k), and work out each ing. However, in practice and specially in order

124 J. A. Gmez, R. Rum, and A. Salmern


to test new algorithms we have to use a differ- Definition 2. (MTE density) An MTE poten-
ent way of measuring the quality of a clustering. tial f is an MTE density if
In this paper and because we produce a prob- X Z
abilistic description of the dataset, we use the f (y, z)dz = 1 .
log-likelihood (logL) of the dataset (D) given yY Z

the model to score a given clustering:


A univariate MTE density f (x), for a contin-
X uous variable X, can be learnt from a sample as
logL = log p(x|{, C}).
follows (Romero et al., 2006):
xD

1. Select the maximum number, s, of intervals


3 Mixtures of Truncated to split the domain.
Exponentials
2. Select the maximum number, t, of expo-
The MTE model is an alternative to Gaussian nential terms on each interval.
models, and can be seen as a generalization
to discretization models. It is defined by its 3. A Gaussian kernel density is fitted to the
corresponding potential and density as follows data.
(Moral et al., 2001):
4. The domain of X is split into k s
Definition 1. (MTE potential) Let X be a pieces, according to changes in concav-
mixed n-dimensional random vector. Let Y = ity/convexity or increase/decrease in the
(Y1 , . . . , Yd ) and Z = (Z1 , . . . , Zc ) be the dis- kernel density.
crete and continuous parts of X, respectively,
with c + d = n. A function f : X 7 R+ 5. In each piece, an MTE potential with m
0 is
a Mixture of Truncated Exponentials potential t exponential terms is fitted to the kernel
(MTE potential) if one of the next conditions density by iterative least squares estima-
holds: tion.

6. Update the MTE coefficients to integrate


i. Y = and f can be written as up to one.

m Xc When learning an MTE density f (y) from a
(j)
X
f (x) = f (z) = a0 + ai exp bi zj sample, if Y is discrete, the procedure is:

i=1 j=1
(1) 1. For each state yi of Y , compute the Max-
for all z Z , where ai , i = 0, . . . , m and imium Likelihood Estimator of p(yi ), which
(j) #yi
bi , i = 1, . . . , m, j = 1, . . . , c are real num- is .
bers. sample size
A conditional MTE density can be specified
ii. Y = and there is a partition D1 , . . . , Dk by dividing the domain of the conditioning vari-
of Z into hypercubes such that f is de- ables and specifying an MTE density for the
fined as conditioned variable for each configuration of
splits of the conditioning variables (Romero et
f (x) = f (z) = fi (z) if z Di , al., 2006).
In this paper, the conditional MTE densities
where each fi , i = 1, . . . , k can be written needed are f (x|y), where Y is the class vari-
in the form of equation (1). able, which is discrete, and X is either a dis-
crete or continuous variable, so the potential is
iii. Y 6= and for each fixed value y Y , composed by an MTE potential defined for X,
fy (z) = f (y, z) can be defined as in ii. for each state of the conditioning variable Y .

Unsupervised naive Bayes for data clustering with mixtures of truncated exponentials 125
Example 1. The function f defined as been set to four, again according to the results
in the cited reference.
5 + e2x + ex ,


y = 0, We start the algorithm with two clusters with
the same probability, i.e., the hidden class vari-

0 < x < 2,



able has two states and the marginal probability


1 + ex y = 0,


of each state is 1/2. The conditional distribu-




2 x < 3,
f (x|y) = 1 tion for each feature variable given each of the
+ e2x y = 1, two possible values of the class is initially the
5


same, and is computed as the marginal MTE

0 < x < 2,



density estimated from the train database ac-


4e3x y = 1,


cording to the estimation algorithm described



2 x < 3.

in section 3.
The initial model is refined using the data
is an MTE potential for a conditional x|y re- augmentation method (Tanner and Wong,
lation, with X continuous, x (0, 3]. Observe 1987):
that this potential is not actually a conditional
density. It must be normalised to integrate up 1. First of all, we divide the train database
to one. into two parts, one of them properly for
In general, an MTE conditional density training and the other one for testing the
intermediate models.
f (x|y) with one conditioning discrete variable,
with states y1 , . . . , ys , is learnt from a database 2. For each record in the test and train
as follows: databases, the distribution of the class vari-
able is computed.
1. Divide the database in s sets, according to
variable Y states. 3. According to the obtained distribution, a
value for the class variable is simulated and
2. For each set, learn a marginal density for inserted in the corresponding cell of the
X, as shown above, with k and m constant. database.

4 Probabilistic clustering by using 4. When a value for the class variable for all
MTEs the records has been simulated, the con-
ditional distributions of the feature vari-
In the clustering algorithm we propose here, the ables are re-estimated using the method de-
conditional distribution for each variable given scribed in section 3, using as sample the
the class is modeled by an MTE density. In the train database.
MTE framework, the domain of the variables is
split into pieces and in each resulting interval 5. The log-likelihood of the new model is com-
an MTE potential is fitted to the data. In this puted from the test database. If it is
work we will use the so-called f ive-parameter higher than the log-likelihood of the initial
MTE, which means that in each split there are model, the process is repeated. Otherwise,
at most five parameters to be estimated from the process is stopped, returning the best
data: model found for two clusters.

f (x) = a0 + a1 ea2 x + a3 ea4 x . (2) In order to avoid long cycles, we have lim-
ited the number of iterations in the procedure
The choice of the f ive-parameter MTE is mo- above to 100. After obtaining the best model for
tivated by its low complexity and high fitting two clusters, the algorithm continues increasing
power (Cobb et al., 2006). The maximum num- the number of clusters and repeating the proce-
ber of splits of the domain of each variable has dure above for three clusters, and so on. The

126 J. A. Gmez, R. Rum, and A. Salmern


algorithm continues to increase the number of (a) bestN et := net.
clusters while the log-likelihood of the model, (b) L0 := L1 .
computed from the test database, is increased. (c) Add a cluster to net.
A model for n clusters in converted into a
(d) Refine net by data augmentation.
model for n + 1 clusters as follows:
(e) L1 :=log-likelihood(net,Dt ).
1. Let c1 , . . . , cn be the states of the class vari-
able. 8. RETURN(bestN et)

2. Add a new state, cn+1 to the class variable. 5 Experimental evaluation


3. Update the probability distribution of the In order to analyse the performance of the pro-
class variable by re-computing the proba- posed algorithm in the data clustering task, we
bility of cn and cn+1 (the probability of the have selected a set of databases1 (see Table 1)
other values remains unchanged): and the following experimental procedure has
p(cn ) := p(cn )/2. been carried out:
p(cn+1 ) := p(cn )/2. For comparison we have used the EM al-
4. For each feature variable X, gorithm implemented in the WEKA data
mining suite (Witten and Frank, 2005).
f (x|cn+1 ) := f (x|cn ). This is a classical implementation of the
EM algorithm in which discrete and nu-
At this point, we can formulate the clustering
merical variables are allowed as input at-
algorithm as follows. The algorithm receives as
tributes. Numerical attributes are modeled
argument a database with variables X1 , . . . , Xn
by means of Gaussian distributions. In this
and returns a naive Bayes network with vari-
implementation the user can specify, be-
ables X1 , . . . , Xn , C, which can be used to pre-
forehand, the number of clusters or the EM
dict the class of any object characterised by fea-
can decide how many clusters to create by
tures X1 , . . . , Xn .
cross validation. In the last case, the num-
ber of clusters is initially set to 1, the EM
Algorithm CLUSTER(D)
algorithm is performed by using a 10 folds
1. Divide the train database D into two parts: cross validation and the log-likelihood (av-
one with the 80% of the records in D se- eraged the 10 folds) is computed. Then, the
lected at random, for estimating the pa- number of clusters in increased by one and
rameters (Dp ) and another one with the the same process is carried out. If the log-
remaining 20% (Dt ) for testing the current likelihood with the new number of clusters
model. has increased with respect to the previous
one, the execution continues, otherwise the
2. Let net be the initial model obtained from clustering obtained in the previous itera-
Dp as described above. tion is returned.
3. bestN et := net. We have divided all the data sets into train
4. L0 :=log-likelihood(net,Dt ). (80% of the instances) and test (20% of the
instances). Then the training set has been
5. Refine net by data augmentation from used to learn the model but the test set is
database Dp . used to assess the quality of the final model,
i.e., to compute the log-likelihood.
6. L1 :=log-likelihood(net,Dt ).
1
In those datasets usually used for classification the
7. WHILE (L1 > L0 ) class variable has been removed.

Unsupervised naive Bayes for data clustering with mixtures of truncated exponentials 127
DataSet #instances #attributes discrete? classes source
returns 113 5 no (Shenoy and Shenoy, 1999)
pima 768 8 no 2 UCI(Newman et al., 1998)
bupa 345 6 no 2 UCI(Newman et al., 1998)
abalone 4177 9 2 UCI(Newman et al., 1998)
waveform 5000 21 no 3 UCI(Newman et al., 1998)
waveform-n 5000 40 no 3 UCI(Newman et al., 1998)
net15 10000 15 2 (Romero et al., 2006)
sheeps 3087 23 5 real farming data (Gamez, 2005)

Table 1: Brief description of the data sets used in the experiments. The column discrete means
if the dataset contains discrete/nominal variables (if so the number of discrete variables is given).
The column classes shows the number of class-labels when the dataset comes from classification
task.

Two independent runs have been carried significantly lower than the number of clusters
out for each algorithm and data set: produced by the EM. In some cases, a model
with many clusters, all of them with low prob-
The algorithm proposed in this pa-
ability, is not as useful as a model with fewer
per is executed over the given training
clusters but with higher probability, even if the
set. Let #mte clusters be the number of
last one has lower likelihood.
clusters it has decided. Then, the EM
Another fact that must be pointed out is that,
algorithm is executed receiving as in-
in the four problems where the exact number
puts the same training set and number
of classes is known (pima, bupa, waveform and
of clusters equal to #mte clusters.
waveform-n) the number of clusters returned by
The WEKA EM algorithm is exe-
the MTE-based algorithms is more accurate.
cuted over the given training set. Let
Besides, forcing the number of clusters in the
#em clusters be the number of clusters
EM algorithm to be equal to the number of clus-
it has decided. Then, our algorithm is
ters generated by the MTE method, the differ-
executed receiving as inputs the same
ences turn in favor of the MTE-based algorithm.
training set and number of clusters
After this analysis, we consider the MTE-
equal to #em clusters.
based clustering algorithm a promising method.
In this way we try to compare the perfor- It must be taken into account that the EM
mance of both algorithms over the same algorithm implemented in Weka is very opti-
number of clusters. mised and sophisticated, while the MTE-based
method presented here is a first version in which
Table 2 shows the results obtained. Out of many aspects are still to be improved. Even in
the 8 databases used in the experiments, 7 re- these conditions, our method is competitive in
fer to real world problems, while the other one some problems. Models which are away from
(Net15) is a randomly generated network with normality (as Net15) are more appropriate for
joint MTE distribution (Romero et al., 2006). applying the MTE-based algorithm rather than
A rough analysis of the results suggest a better EM.
performance of the EM clustering algorithm: in
5 out of the 8 problems, it obtains a model with 6 Discussion and concluding remarks
higher log-likelihood than the MTE-based algo-
rithm. However, it must be pointed out that the In this paper we have introduced a novel unsu-
number of clusters contained in the best mod- pervised clustering algorithm which is based on
els generated by the MTE-based algorithm is the MTE model. The model has been tested

128 J. A. Gmez, R. Rum, and A. Salmern


DataSet #mte clusters MTE EM #em clusters MTE EM
returns 2 212 129
Pima 3 -4298 -4431 7 -4299 -3628
bupa 5 -1488 -1501 3 -1491 -1489
abalone 11 6662 7832 15 6817 8090
waveform 3 -38118 -33084 11 -38153 -31892
Waveform-n 3 -65200 -59990 11 -65227 -58829
Net15 4 -3418 -6014 23 -3723 -5087
Sheeps 3 -34214 -35102 7 -34215 -34145

Table 2: Results obtained. Columns 2-4 represents the number of clusters discovered by the
proposed MTE-based method, its result and the result obtained by EM when using that number
of clusters. Columns 5-7 represents the number of clusters discovered by EM algorithm, its result
and the result obtained by the MTE-based algorithm when using that number of clusters.

over 8 databases, 7 of them corresponding to plicity of probability propagation in naive Bayes


real-world problems and some of them com- structures. A naive Bayes model with discrete
monly used for benchmarks. We consider that class variable, actually represents the condi-
the results obtained, in compares with the EM tional distribution of each variable to the rest of
algorithm, are at least promising. The algo- variables as a mixture of marginal distributions,
rithm we have used in this paper can still be being the number of components in the mix-
improved. For instance, during the model con- ture equal to the number of states of the class
struction we have used a train-test approach for variable. So, instead of constructing a general
estimating the parameters and validating the in- Bayesian network from a database, if the goal
termediate models obtained. However, the EM is probability propagation, this approach can be
algorithm uses cross validation. We plan to in- competitive. The idea is specially interesting for
clude 10-fold cross validation in a forthcoming the MTE model, since probability propagation
version of the algorithm. has a high complexity for this model.
Another aspect that can be improved is the
Acknowledgments
way in which the model is updated when the
number of clusters is increased. We have fol- Acknowledgements This work has been
lowed a very simplistic approach: for the fea- supported by Spanish MEC under project
ture variables, we duplicate the same density TIN2004-06204-C03-{01,03}.
function conditional on the previous last clus-
ter, while for the class variable, we divide the
References
probability of the former last cluster with the
new one. We think that a more sophisticated M.R. Anderberg. 1973. Cluster Analysis for Appli-
approach, based on component splitting and pa- cations. Academic Press.
rameter reusing similar to the techniques pro- P. Cheeseman and J. Stutz. 1996. Bayesian classi-
posed in (Karciauskas, 2005) could improve the fication (AUTOCLASS): Theory and results. In
performance of our algorithm. U. M. Fayyad, G. Piatetsky-Shapiro, P Smyth,
and R. Uthurusamy, editors, Advances in Knowl-
The algorithm proposed in this paper can be edge Discovery and Data Mining, pages 153180.
applied not only to clustering problems, but AAAI Press/MIT Press.
also to construct an algorithm for approximate
B.R. Cobb, P.P. Shenoy, and R. Rum. 2006. Ap-
probability propagation in hybrid Bayesian net- proximating probability density functions with
works. This idea was firstly proposed in (Lowd mixtures of truncated exponentials. Statistics and
and Domingos, 2005), and is based on the sim- Computing, In press.

Unsupervised naive Bayes for data clustering with mixtures of truncated exponentials 129
A. P. Dempster, N. M. Laird, and D.B. Rubin. 1977. M.A. Tanner and W.H Wong. 1987. The calcula-
Maximum likelihood from incomplete data via the tion of posterior distributions by data augmenta-
em algorithm. Journal of the Royal Statistical So- tion (with discussion). Journal of the American
ciety, 1(39):138. Statistical Association, 82:528550.

R.O. Duda, P.E. Hart, and D.J. Stork. 2001. Pat- I.H. Witten and E. Frank. 2005. Data Min-
tern Recognition. Wiley. ing: Practical Machine Learning Tools and Tech-
niques. Morgan Kaufmann.
N. Friedman. 1998. The Bayesian structural EM
algorithm. In Gregory F. Cooper and Serafn
Moral, editors, Proceedings of the 14th Conference
on Uncertainty in Artificial Intelligence (UAI-
98), pages 129138. Morgan Kaufmann.

J.A. Gamez. 2005. Mining the ESROM: A study


of breeding value prediction in manchego sheep
by means of classification techniques plus at-
tribute selection and construction. Technical Re-
port DIAB-05-01-3, Computer Systems Depart-
ment. University of Castilla-La Mancha.

A.K. Jain, M.M. Murty, and P.J. Flynn. 1999. Data


clustering: a review. ACM COmputing Surveys,
31(3):264323.

G. Karciauskas. 2005. Learning with hidden vari-


ables: A parameter reusing approach for tree-
structured Bayesian networks. Ph.D. thesis, De-
partment of Computer Science. Aalborg Univer-
sity.

D. Lowd and P. Domingos. 2005. Naive bayes mod-


els for probability estimation. In ICML 05: Pro-
ceedings of the 22nd international conference on
Machine learning, pages 529536. ACM Press.

S. Moral, R. Rum, and A. Salmeron. 2001. Mix-


tures of truncated exponentials in hybrid Bayesian
networks. In Lecture Notes in Artificial Intelli-
gence, volume 2143, pages 135143.

D.J. Newman, S. Hettich, C.L. Blake,


and C.J. Merz. 1998. UCI Repos-
itory of machine learning databases.
www.ics.uci.edu/mlearn/MLRepository.html.

J.M. Pena, J.A. Lozano, and P. Larranaga. 2000.


An improved bayesian structural EM algorithm
for learning bayesian networks for clustering. Pat-
tern Recognition Letters, 21:779786.

V. Romero, R. Rum, and A. Salmeron. 2006.


Learning hybrid Bayesian networks using mix-
tures of truncated exponentials. Inernational
Journal of Approximate Reasoning, 42:5468.

C. Shenoy and P.P. Shenoy. 1999. Bayesian network


models of portfolio risk and return. In W. Lo Y.S.
Abu-Mostafa, B. LeBaron and A.S. Weigend, edi-
tors, Computational Finance, pages 85104. MIT
Press, Cambridge, MA.

130 J. A. Gmez, R. Rum, and A. Salmern


Selecting Strategies for Infinite-Horizon Dynamic LIMIDs
Marcel A.J. van Gerven
Institute for Computing and Information Sciences
Radboud University Nijmegen
Toernooiveld 1
6525 ED Nijmegen, The Netherlands

Francisco J. Dez
Department of Artificial Intelligence, UNED
Juan del Rosal 16
28040 Madrid, Spain

Abstract
In previous work we have introduced dynamic limited-memory influence diagrams (DLIM-
IDs) as an extension of LIMIDs aimed at representing infinite-horizon decision processes.
If a DLIMID respects the first-order Markov assumption then it can be represented by
2TLIMIDS. Given that the treatment selection algorithm for LIMIDs, called single policy
updating (SPU), can be infeasible even for small finite-horizon models, we propose two
alternative algorithms for treatment selection with 2TLIMIDS. First, single rule updating
(SRU) is a hill-climbing method inspired upon SPU which needs not iterate exhaustively
over all possible policies at each decision node. Second, a simulated annealing algorithm
can be used to avoid the local-maximum policies found by SPU and SRU.

1 Introduction strategy. Often however, there is no predefined


time at which the process stops (i.e. we have
An important goal in artificial intelligence is to an infinite-horizon decision process) and in that
create systems that make optimal decisions in case the LIMID would also become infinite in
situations characterized by uncertainty. One size. In previous work, we have introduced dy-
can think for instance of a robot that navigates namic LIMIDs (DLIMIDs) and two-stage tem-
based on its sensor readings in order to achieve poral LIMIDs (2TLIMIDs) as an extension of
goal states, or of a medical decision-support sys- standard LIMIDs that allow for the representa-
tem that chooses treatments based on patient tion of infinite-horizon decision processes (van
status in order to maximize life-expectancy. Gerven et al., 2006). However, the problem of
Limited-memory influence diagrams (LIM- finding acceptable strategies for DLIMIDs has
IDs) are a formalism for decision-making un- not yet been addressed. In this paper we discuss
der uncertainty (Lauritzen and Nilsson, 2001). a number of techniques to approximate the opti-
They generalize standard influence diagrams mal strategy for infinite-horizon dynamic LIM-
(Howard and Matheson, 1984) by relaxing the IDs. We demonstrate the performance of these
assumption that the whole observed history is algorithms on a non-trivial decision problem.
taken into account when making a decision, and
by dropping the requirement that a complete or- 2 Preliminaries
der is defined over decisions. This increases the
2.1 Bayesian Networks
size and variety of decision problems that can
be handled, although possibly at the expense Bayesian networks (Pearl, 1988) provide for a
of finding only approximations to the optimal compact factorization of a joint probability dis-
tribution over a set of random variables by ex- of being in a certain state as defined by config-
ploiting the notion of conditional independence. urations of chance and decision variables. G is
One way to represent conditional independence an acyclic directed graph (ADG) with vertices
is by means of an acyclic directed graph (ADG) V (G) corresponding to CDU, where we use
G where vertices V (G) correspond to random V to denote C D. Again, due to this corre-
variables X and the absence of arcs from the spondence, we will use nodes in V (G) and cor-
set of arcs A(G) represents conditional indepen- responding elements in C D U interchange-
dence. Due to this one-to-one correspondence ably. The meaning of an arc (X, Y ) A(G) is
we will use vertices v V (G) and random vari- determined by the type of Y . If Y C then
ables X X interchangeably. A Bayesian net- the conditional probability distribution associ-
work (BN) is defined as B = (X, G, P ), such ated with Y is conditioned by X. If Y D then
that the joint probability distribution P over X X represents information that is present to the
factorizes according to G: decision maker prior deciding upon Y . We call
Y (Y ) the informational predecessors of Y . The
P (X) = P (X | G (X)) order in which decisions are made in a LIMID
XX
should be compatible with the partial order in-
where G (X) = {X | (X , X) A(G)} denotes duced by the ADG and are based only on the
the parents of X. We drop the subscript G when parents (D) of a decision D. If Y U then
clear from context. We assume that (random) X takes part in the specification of the utility
variables X can take values x from a set X and function Y such that Y : (Y ) R. In this pa-
use x to denote an element in X = XX X per, it is assumed that utility nodes cannot have
for a set X of (random) variables. children and the joint utility functionPU is ad-
ditively decomposable such that U = U U U .
2.2 LIMIDs P specifies for each d D a distribution
Although Bayesian networks are a natural Y
P (C : d) = P (C | (C))
framework for probabilistic knowledge represen-
CC
tation and reasoning under uncertainty, they
that represents the distribution over C when
are not suited for decision-making under un-
we externally set D = d (Cowell et al., 1999).
certainty. Influence-diagrams (Howard and
Hence, C is not conditioned on D, but rather
Matheson, 1984) are graphical models that ex-
parameterized by D, and if D is unbound then
tend Bayesian networks to incorporate decision-
we write P (C : D).
making but are restricted to the representation
A stochastic policy for decisions D D is de-
of small decision problems. Limited-memory
fined as a distribution PD (D | (D)) that maps
influence diagrams (LIMIDs) (Lauritzen and
configurations of (D) to a distribution over al-
Nilsson, 2001) generalize standard influence-
ternatives for D. If PD is degenerate then we
diagrams by relaxing the no-forgetting assump-
say that the policy is deterministic. A strategy
tion, which states that, given a total ordering of
is a set of policies = {PD : D D} which
decisions, information present when making de-
induces the following joint distribution over the
cision D is also taken into account when making
variables in V:
decision D , if D precedes D in the ordering. A Y
LIMID is defined as a tuple L = (C, D, U, G, P ) P (V) = P (C : D) PD (D | (D)).
consisting of the following components. C rep- DD
resents a set of random variables, called chance Using this distribution we can compute the ex-
variables, D represents a set of decisions avail- pected
P utility of a strategy as E (U ) =
able to the decision maker, where a decision P
v (v)U (v). The aim of any rational deci-
D D is defined as a variable that can take on sion maker is then to maximize the expected
a value from a set of choices D , and U is a set utility by finding the optimal strategy
of utility functions, which represent the utility arg max E (U ).

132 M. A. J. van Gerven and F. J. Dez


C1 C1 C1 C1 C1 C1 C1 C1

D1 D1 D1 D1 D1 D1 D1 D1

C2 C2 C2 C2 C2 C2 C2 C2

U1 U1 U1 U1 U1 U1 U1 U1

L0 Lt DLIMID
Figure 1: Chance nodes are shown as circles, decision nodes as squares and utility nodes as dia-
monds. The 2TLIMID (left) can be unrolled into a DLIMID (right), where large brackets denote
the boundary between subsequent times. Note that due to the definition of a 2TLIMID, the infor-
mational predecessors of a decision can only occur in the same, or the preceding time-slice.

2.3 Dynamic LIMIDs and 2TLIMIDs for all V t1 Vt1:t in the transition model
it holds that Gt (V t1 ) = . The transition
Although LIMIDs can often represent finite- model is not yet bound to any specific t, but
horizon decision processes more compactly than if bound to some t 1 : N , then it is used
standard influence diagrams, they cannot rep- to represent P (Ct : Dt1:t ) and utility func-
resent infinite-horizon decision processes. Re- tions U Ut , where both G and P do not
cently, we introduced dynamic LIMIDs (DLIM- depend on t. The prior model is used to rep-
IDs) as an explicit representation of decision resent the initial distribution P 0 (C0 : D0 ) and
processes (van Gerven et al., 2006). To repre- utility functions U U0 . The interface of the
sent time, we use T N to denote a set of time transition model is the set It Vt1 such that
points, which we normally assume to be an in- (Vit1 , Vjt ) A(G) Vit1 It . Given a hori-
terval {u | t u t , {t, u, t } N}, also writ- zon N , we may unroll a 2TLIMID for n time-
ten as t : t . We assume that chance variables, slices in order to obtain a DLIMID with the
decision variables and utility functions are in- following joint distribution:
dexed by a superscript t T, and use CT , DT
Y
N
and UT to denote all chance variables, decision P (C, D) = P 0 (C0 : D0 ) P (Ct : Dt1:t ).
variables and utility functions at times t T, t=1
where we abbreviate CT DT with VT .
Let t = {PDt (D | G (D)) | D Dt } denote
A DLIMID is simply defined as a LIMID the strategy for a time-slice t and let the whole
(CT , DT , UT , G, P ) such that for all pairs of strategy be = 0 N . Given 0 , L0
variables X t , Y u VT UT it holds that if defines a distribution over the variables in V0 :
t < u then Y u cannot precede X t in the or- Y
P0 (V0 ) = P 0 (C0 : D0 ) PD (D | G0 (D))
dering induced by G. If T = 0 : N , where
DD0
N N is the (possibly infinite) horizon, then
we suppress T altogether, and we suppress in- and given a strategy t with t > 0, Lt defines
dices for individual chance variables, decision the following distribution over variables in Vt :
Y
variables and utility functions when clear from Pt (Vt | It ) = P (Ct : Du ) PD (D | G (D))
context. If a DLIMID respects the Markov DDt
assumption that the future is independent of
with u = t 1 : t. Combining both equations,
the past, given the present, then it can be
given a horizon N and strategy , a 2TLIMID
compactly represented by a 2TLIMID (see Fig.
induces a distribution over variables in V:
1), which is a pair T = (L0 , Lt ) with prior
model L0 = (C0 , D0 , U0 , G0 , P 0 ) and transition YN
P (V) = P0 (V0 ) Pt (Vt | It ). (1)
model Lt = (Ct1:t , Dt1:t , Ut , G, P ) such that t=1

Selecting Strategies for Infinite-Horizon Dynamic LIMIDs 133


P
Let U 0 (V0 ) = U U0 U (G0 (U )) denote the D0 D1
joint t t-1:t
P utility for time-slice 0 and let U (V ) = F0 F1
U Ut U (G (U )) denote the joint utility for
time-slice t > 0. We redefine the joint utility
L0 L1
function for a dynamic LIMID as
M0 M1
X
N
U (V) = U 0 (V0 ) + t U t (Vt-1:t ) S0 S1
t=1
T0 T1
where with 0 < 1 is a discount factor,
representing the notion that early rewards are
U11 U20 U30 U11 U21 U31
worth more than rewards earned at a later time.
In this way, we can use DLIMIDs to represent
infinite-horizon Markov decision processes. Figure 2: A DLIMID for patient treatment as
specified by a 2TLIMID.
2.4 Memory variables
Figure 1 makes clear that the informational pre-
survival S. Patient survival has an associated
decessors of a decision variable D t can only oc-
utility U1 . An initial strategy, as for instance
cur in time-slices t or t1 (viz. Eq. 1). Observa-
suggested by a physician, might be to always
tions made earlier in time will not be taken into
treat and never test.
account and as a result, states that are qualita-
There are various ways to define M and
tively different can appear the same to the deci-
the distributions P (M 0 | V 0 ) and P (M t |
sion maker, which leads to suboptimal policies.
V t , M t1 ) for a memory variable M . The opti-
This phenomenon is known as perceptual alias-
mal way to define our memory variables is prob-
ing (Whitehead and Ballard, 1991). We try to
lem dependent, and we assume that this defini-
avoid this problem by introducing memory vari-
tion is based on the available domain knowl-
ables that represent a summary of the observed
edge. For our running example, we choose
history. With each observable variable V V,
M = {a, n, p} {a, n, p} {a, n, p}, where
we associate a memory variable M C, such
a stands for the absence of a finding, n for a
that the parents of a memory variable are given
negative finding, and p for a positive finding,
by (M 0 ) = {V 0 } and (M t ) = {V t , M t1 }
which is evidence for the presence of the dis-
for t {1, . . . , N } and all children of M t , with
ease. M t = (z, y, x) then denotes the current
t {0, . . . , N }, are decision variables D Dt .
finding x, the finding in the previous time-slice
Figure 2 visualizes the concept of a memory
y, and the finding two time-slices ago z. If the
variable, and is used as the running example
new finding is F t+1 = f then M t+1 = (y, x, f ),
for this paper. It depicts a 2TLIMID for the
and since we have not yet observed any findings
treatment of patients that may or may not have
at t = 0, the initial memory is (a, a, a) if we did
a disease D. The disease can be identified by a
not test and (a, a, n) or (a, a, p) if the test was
finding F , which is the result of a laboratory test
performed.
L, having an associated cost that is captured by
the utility function U2 . The memory concern- 3 Improving Strategies in
ing findings is represented by the memory vari- Infinite-Horizon DLIMIDs
able M , and based on this memory, we decide
whether or not to perform treatment T , which Although DLIMIDs constructed from 2TLIM-
has an associated cost, captured by the utility IDs can represent infinite-horizon Markov deci-
function U3 . Memory concerning past findings sion processes, to date, the problem of strategy
will be used to decide whether or not to perform improvement using 2TLIMIDs has not been ad-
the laboratory test. If the patient has the dis- dressed. In this section, we explore techniques
ease then this decreases the chances of patient for finding strategies with high expected utility.

134 M. A. J. van Gerven and F. J. Dez


3.1 Single Policy Updating SPU(T ,0 ,):
= 0 , euMax = E0 (U ).
One way to improve strategies in standard LIM- repeat
IDs is to use an iterative procedure called single euMaxOld = euMax
policy updating or SPU for short (Lauritzen and for j = 1 to n do
Nilsson, 2001). Let 0 = {p1 , . . . , pn } be an or- for all policies pj for at Dj do
dered set representing the initial strategy where = pj
pj , 1 j n stands for a (randomly initialized) if E (U ) > euMax + then
policy PDj for a decision Dj . We say pj is the lo- = and euMax = E (U )
cal maximum policy for a strategy at decision end if
Dj if E (U ) cannot be improved by changing end for
pj . Single policy updating proceeds by iterat- end for
ing over all decision variables (called a cycle) until euMax = euMaxOld
to find local maximum policies, and to reiterate return
until no further improvement in expected util-
ity can be achieved. SPU converges in a finite Figure 3: Single policy updating for 2TLIMIDs.
number of cycles to a local maximum strategy
where each pj is a local maximum policy.
Note that this local maximum strategy is not optimal strategy is deterministic and stationary
necessarily the global maximum strategy . for infinite-horizon and fully observable Markov
To find local maximum policies in standard decision processes (Ross, 1983). In the partially
LIMIDs, Lauritzen and Nilsson (2001) use a observable case, we can only expect to find ap-
message passing algorithm, optimized for sin- proximations to the optimal strategy by using
gle policy updating. In this paper, we re- memory variables that represent part of the ob-
sort to standard inference algorithms for finding servational history (Meuleau et al., 1999).
strategies for (infinite-horizon) DLIMIDs. We We proceed by converting (L0 , Lt ) into
make use of the fact that given , a LIMID (B0 , Bt ) with B0 = B(L0 , 0 ) and Bt =
L = (C, D, U, G, P ) may be converted into a B(Lt , t ), where (B0 , Bt ) is known as a two-
Bayesian network B = (X, G , P ). Since in- stage temporal Bayes net (Dean and Kanazawa,
duces a distribution over variables in V (viz. 1989). We use inference algorithms that oper-
Eq. 1), we can use to convert decision vari- ate on (B0 , Bt ) in order to compute an approx-
ables D D to random variables X X with imation of the expected utility. In our work,
parents G (D) such that P (X | G (X)) = we have used the interface algorithm (Murphy,
PD (D | G (D)). Additionally, utility functions 2002), for which it holds that the space and time
U U may be converted into random vari- taken to compute each P (Xt | Xt1 ) does not
ables X by means of Coopers transformation depend on the number of time-slices. The ap-
(Cooper, 1988), which allows us to compute proximation E (U ) is made using a finite num-
E (U ). We use B(L, ) to denote this conver- ber of time-slices k, where k is such that k <
sion of a LIMID into a Bayesian network. with > 0. The discount factor ensures that
Single policy updating cannot be applied di- limt E (U ) = 0. Let 0 = 0 t be
rectly to an infinite-horizon DLIMID since com- the initial strategy with 0 = {p1 , . . . , pm } and
puting E (U ) would need an infinite number of t = {pm+1 , . . . , pn }, where m is the number of
steps. In order to approximate the expected decision variables in L0 and nm is the number
utility given , we assume that the DLIMID of decision variables in Lt . Following (Lauritzen
can be represented as a 2TLIMID T = (L0 , Lt ) and Nilsson, 2001), we define pj as the strat-
and can be expressed as a pair (0 , t ), such egy obtained by replacing pj with pj in . SPU
that 0 is the strategy at t = 0 and t is a based on a 2TLIMID T with initial strategy 0
stationary strategy at t 1 : . Note that the is then defined by the algorithm in Fig. 3.

Selecting Strategies for Infinite-Horizon Dynamic LIMIDs 135


3.2 Single Rule Updating SRU(T ,0 ,):
= 0 , euMax = E0 (U )
An obstacle for the use of SPU for strategy
repeat
improvement is the fact that if the state-space
euMaxOld = euMax
(D) for informational predecessors (D) of a
for j = 1 to n do
decision variable D becomes large, then it be-
repeat
comes impossible in practice to iterate over all
euMaxLocal = euMax
possible policies for D. The number of policies
for all configurations x (Dj ) do
that needs to be evaluated at each decision vari-
r for all actions a Dj do
able D grows as km , with k = |D |, assuming
pj = (x, a ) pj
that |Vj | = m for all Vj (D), and r is the
= pj
number of informational predecessors of D. For
if E (U ) > euMax + then
example, looking back for two time-slices within
= and euMax = E (U )
our model leads to 2 policies for L0 and 227 poli-
end if
cies for T 0 , L1 and T 1 , which is computationally
end for
infeasible, even for this small example.
end for
For this reason, we introduce a hill-climbing
until euMax = euMaxLocal
search called single rule updating (SRU) that
end for
is inspired upon single policy updating. A de-
until euMax = euMaxOld
terministic policy can be viewed as a mapping
return
pj : (Djt ) Djt , describing for each config-
uration x (Djt ) an action a Djt . We Figure 4: Single rule updating for 2TLIMIDs.
call (x, a) pj a decision rule. Instead of ex-
haustively searching over all possible policies for
each decision variable, we try to increase the our running example (Fig. 2) to never test and
expected utility by local changes to the deci- always treat. Suppose this is our initial strategy
sion rules within the policy. I.e., at each step 0 for either the SPU or SRU algorithm. Try-
we change one decision-rule within the policy, ing to improve the policy for the laboratory test
accepting the change when the expected utility L we find that performing the test will only de-
increases. We use (x, a ) pj to denote the re- crease the expected utility since the test has no
placement of (x, a) by (x, a ) in pj . Similar to informational value (we always treat) but does
SPU, we keep iterating until there is no further have an associated cost. Conversely, trying to
increase in the expected utility (Fig. 4). improve the policy for treatment we find that
SRU decreases the number of policies that the test is never performed and therefore it is
need to be evaluated in each local cycle for a more safe to always treat. Hence, SPU and SRU
decision variable to kmr , where notation is as will stop after one cycle, returning the proposed
before. For our example, we only need to eval- strategy as the local optimal strategy.
uate 2 policies for L0 and 54 policies for T 0 , In order to improve upon the strategies found
L1 and T 1 in each local cycle, albeit at the ex- by SRU, we resort to simulated annealing (SA),
pense of replacing the exhaustive search by a which is a heuristic search method that tries
hill-climbing strategy, increasing the risk of end- to avoid getting trapped into local maximum
ing up in local maxima, and having to run local solutions that are found by hill-climbing tech-
cycles until convergence. niques such as SRU (Kirkpatrick et al., 1983).
SA chooses candidate solutions by looking at
3.3 Simulated Annealing neighbors of the current solution as defined by
SPU and SRU both find local maximum strate- a neighborhood function. Local maxima are
gies, which may not be the optimal strategy . avoided by sometimes accepting worse solutions
To see this, consider the proposed strategy for according to an acceptance function.

136 M. A. J. van Gerven and F. J. Dez


SA(T ,0 ,,0 , T ): 12
= 0 , t = 0, eu = E (U )
repeat 11
select a random decision variable Dj
E (U ) 10
select a random decision rule (x, a) pj
select a random action a Dj , a 6= a 9
pj = (x, a ) pj
= pj 8
eu = E (U ) 0 200 400 600 800 1000
if P (a( ) = yes | eu + , eu , t) then cycle
= and eu = eu
Figure 6: Change in E (U ) for one run of SA.
end if
The sudden increase at the end of the run is
t =t+1
caused by the subsequent application of SRU.
until T (t) < 0
return SRU(T , , )
P (Lt | M t1 ) and P (T 0 | M 0 ) = P (T t | M t ).
Figure 5: Simulated annealing for 2TLIMIDs. For the twenty experiments, SRU found
strategies with an average expected utility of
In this paper, we have chosen the acceptance E (U ) = 11.23 with = 0.43. This required
function P (a( ) = yes | eu, eu , t) equal to 1 the evaluation of 720 different strategies on av-
if eu > eu and equal to exp( eueu erage. SA found strategies with an average ex-
T (t) ) otherwise,
pected utility of E (U ) = 11.51 with = 0.14.
where a( ) stands for the acceptance of the
This required the evaluation of 1546 different
proposed strategy , eu = E (U ), eu =
strategies on average. In 16 out of 20 exper-
E (U ) for the current strategy , and T is an
iments, SA found strategies with higher ex-
annealing schedule that is defined as T (t + 1) =
pected utility than SRU. Furthermore, due to
T (t) where T (0) = with < 1 and > 0.
the random behavior of the algorithm it avoids
With respect to strategy finding in DLIMIDs,
the local maximum policies found by SRU. For
we propose the simulated annealing scheme as
instance, SRU finds a strategy with expected
shown in Fig. 5, where is repeatedly chosen
utility 10.29 three out of twenty times, which
uniformly at random between 0 and 1. Note
is equal to the expected utility of the proposed
that after the annealing phase we apply SRU in
strategy to always treat and never test. The
order to greedily find a local maximum solution.
best strategy was found by simulated anneal-
ing and has an expected utility of 11.67. The
4 Experimental Results
subsequent values of E (U ), found during that
We have compared the solutions found by SRU experiment, are shown in Fig. 6.
and SA to our running example based on twenty The found strategy can be represented by a
randomly chosen initial strategies. Note that policy graph (Meuleau et al., 1999); a finite state
SPU was not feasible due to the large number controller that depicts state transitions, where
of policies per decision variable. We have cho- states represent actions and transitions are in-
sen a discounting factor = 0.95 and a stop- duced by observations. Figure 7 depicts the pol-
ping criterion = 0.01. After some initial ex- icy graph for the best found strategy. Starting
periments we have chosen = 0.995, = 0.5 at the left arrow, each time slice constitutes a
and tMin = 3.33 103 for the SA parame- test decision (circle) and a treatment decision
ters. In order to reduce computational load, (double circle). Shaded circles stand for posi-
we assume that the parameters for P (T 0 | M 0 ) tive decisions (i.e., Lt = yes and T t = yes) and
and P (T t | M t ) are tied such that we need clear circles stand for negative decisions (i.e.,
only estimate three different policies for P (L0 ), Lt = no and T t = no). If a test has two outgo-

Selecting Strategies for Infinite-Horizon Dynamic LIMIDs 137


influence diagrams cannot manage in principle.
Hence, 2TLIMIDs are particularly useful in the
case of problems that cannot be properly ap-
proximated by a short number of time slices.

References
G.F. Cooper. 1988. A method for using belief net-
works as influence diagrams. In Proceedings of the
4th Workshop on Uncertainty in AI, pages 5563,
University of Minnesota, Minneapolis.
R. Cowell, A. P. Dawid, S. L. Lauritzen, and
D. Spiegelhalter. 1999. Probabilistic Networks
and Expert Systems. Springer.
T. Dean and K. Kanazawa. 1989. A model for rea-
soning about persistence and causation. Compu-
tational Intelligence, 5(3):142150.
Figure 7: Policy graph for the best strategy
found by simulated annealing. R.A. Howard and J.E. Matheson. 1984. Influence di-
agrams. In R.A. Howard and J.E. Matheson, edi-
tors, Readings in the Principles and Applications
ing arcs, then these stand for a negative finding of Decision Analysis. Strategic Decisions Group,
Menlo Park, CA.
(dashed arc) and positive finding (solid arc) re-
spectively. Most of the time, the policy graph S. Kirkpatrick, C.D. Gelatt Jr., and M.P. Vecchi.
associates a negative test result with no treat- 1983. Optimization by simulated annealing. Sci-
ence, 220:671680.
ment and a positive test result with treatment.
S.L. Lauritzen and D. Nilsson. 2001. Representing
5 Conclusion and solving decision problems with limited infor-
mation. Management Science, 47(9):12351251.
In this paper we have demonstrated that reason- N. Meuleau, K.-E. Kim, L.P. Kaelbling, and A.R.
able strategies can be found for infinite-horizon Cassandra. 1999. Solving POMDPs by search-
DLIMIDs by means of SRU. Although compu- ing the space of finite policies. In Proceedings of
tationally more expensive, SA considerably im- the 15th Conference on Uncertainty in Artificial
Intelligence, pages 417426, Stockholm, Sweden.
proves the found strategies by avoiding local
maxima. Both SRU and SA do not suffer from K.P. Murphy. 2002. Dynamic Bayesian Networks.
the intractability of SPU when the number of Ph.D. thesis, UC Berkely.
informational predecessors increases. The ap- J. Pearl. 1988. Probabilistic Reasoning in Intelligent
proach does require that good strategies can be Systems: Networks of Plausible Inference. Mor-
found using a limited amount of memory, since gan Kaufmann Publishers, 2 edition.
otherwise, found strategies will fail to approx- S. Ross. 1983. Introduction to Stochastic Dynamic
imate the optimal strategy. This requirement Programming. Academic Press, New York.
should hold especially between time-slices, since M.A.J. van Gerven, F.J. Dez, B.G. Taal, and P.J.F.
the state-space of memory variables can become Lucas. 2006. Prognosis of high-grade carcinoid
prohibitively large when a large part of the ob- tumor patients using dynamic limited-memory in-
served history is required for optimal decision- fluence diagrams. In Intelligent Data Analysis in
bioMedicine and Pharmacology (IDAMAP) 2006.
making. Although this restricts the types of Accepted for publication.
decision problems that can be managed, DLIM-
IDs, constructed from a 2TLIMID, allow the S.D. Whitehead and D.H. Ballard. 1991. Learning
to perceive and act by trial and error. Machine
representation of large, or even infinite-horizon Learning, 7:4583.
decision problems, something which standard

138 M. A. J. van Gerven and F. J. Dez


Sensitivity analysis of extreme inaccuracies in Gaussian Bayesian
Networks
Miguel A. Gomez-Villegas and Paloma Man
ma gv@mat.ucm.es, pmain@mat.ucm.es
Departamento de Estadstica e Investigacin Operativa
Universidad Complutense de Madrid
Plaza de Ciencias 3,
28040 Madrid, Spain

Rosario Susi
rsusi@estad.ucm.es
Departamento de Estadstica e Investigacin Operativa III
Universidad Complutense de Madrid
Avda. Puerta de Hierro s/n,
28040 Madrid, Spain

Abstract
We present the behavior of a sensitivity measure defined to evaluate the impact of
model inaccuracies over the posterior marginal density of the variable of interest, after the
evidence propagation is executed, for extreme perturbations of parameters in Gaussian
Bayesian networks. This sensitivity measure is based on the Kullback-Leibler divergence
and yields different expressions depending on the type of parameter (mean, variance or
covariance) to be perturbed. This analysis is useful to know the extreme effect of uncer-
tainty about some of the initial parameters of the model in a Gaussian Bayesian network.
These concepts and methods are illustrated with some examples.
Keywords: Gaussian Bayesian Network, Sensitivity analysis, Kullback-Leibler diver-
gence.

1 Introduction computing the partial derivative of a posterior


marginal probability with respect to a given pa-
Bayesian network is a graphical probabilistic rameter, Coupe, et al. (2002) develop an effi-
model that provides a graphical framework for cient sensitivity analysis based on inference al-
complex domains with lots of inter-related vari- gorithms and Chan, et al. (2005) introduce a
ables. Among other authors, Bayesian networks sensitivity analysis based on a distance measure.
have been studied, by Pearl (1988), Lauritzen In Gaussian Bayesian networks Castillo, et al.
(1996), Heckerman (1998) and Jensen (2001). (2003) present a sensitivity analysis based on
A sensitivity analysis in a Bayesian network is symbolic propagation and Gomez-Villegas, et
necessary to study how sensitive is the networks al. (2006) develop a sensitivity measure, based
output to inaccuracies or imprecisions in the pa- on the Kullback-Leibler divergence, to perform
rameters of the initial network, and therefore to the sensitivity analysis.
evaluate the network robustness. In this paper, we study the behavior of the sensi-
In recent years, some sensitivity analysis tech- tivity measure presented by Gomez-Villegas, et
niques for Bayesian networks have been devel- al. (2006) for extreme inaccuracies (perturba-
oped. In Discrete Bayesian networks Laskey tions) of parameters that describe the Gaussian
(1995) presents a sensitivity analysis based on Bayesian network. To prove that this is a well-

defined measure we are interested in studying 1
= (2)n/2 ||1/2 exp (x )0 1 (x )
the sensitivity measure when one of the param- 2
where is the n-dimensional mean vector,
eters is different from the original value. More-
nn the covariance matrix and || the
over, we want to proof that if the value of one
determinant of .
parameter is similar to the real value then the
sensitivity measure is close to zero. The conditional density associated with
The paper is organized as follows. In Section 2 Xi for i = 1, ..., n in equation (1), is the
we briefly introduce definitions of Bayesian net- univariate normal distribution, with density

works and Gaussian Bayesian networks, review i1
X
how propagation in Gaussian Bayesian networks f (xi |pa(xi )) N i + ij (xj j ), i
j=1
can be performed, and present our working ex- where ij is the regression coefficient of Xj
ample. In Section 3, we present the sensitiv- in the regression of Xi on the parents of Xi ,
ity measure and develop the sensitivity analysis and i = ii iP a(xi ) 1
0
P a(xi) iP a(xi ) is the
proposed. In Section 4, we obtain the limits conditional variance of Xi given its parents.
of the sensitivity measure for extreme pertur-
bations of the parameters giving the behavior
Different algorithms have been proposed to
of the measure in the limit so as their interpre-
propagate the evidence of some nodes in Gaus-
tation. In Section 5, we perform the sensitiv-
sian Bayesian networks. We present an incre-
ity analysis with the working example for some
mental method, updating one evidential vari-
extreme imprecisions. Finally, the paper ends
able at a time (see Castillo, et al. 1997) based
with some conclusions.
on computing the conditional probability den-
sity of a multivariate normal distribution given
2 Gaussian Bayesian Networks and
the evidential variable Xe .
Evidence propagation
For the partition X = (Y, E), with Y the set of
Definition 1 (Bayesian network). A Bayesian non-evidential variables, where Xi Y is the
network is a pair (G,P ) where G is a directed variable of interest, and E is the evidence vari-
acyclic graph (DAG), which nodes correspond- able, then, the conditional probability distribu-
ing to random variables X={X1 , ..., Xn } and tion of Y, given the evidence E = {Xe = e}, is
edges representing probabilistic dependencies, multivariate normal with parameters
and P ={p(x1 |pa(x1 )), ..., p(xn |pa(xn ))} is a set
of conditional probability densities (one for each Y|E=e = Y + YE 1
EE (e E )
variable) where pa(xi ) is the set of parents of
Y|E=e = YY YE 1
EE EY
node Xi in G. The set P defines the associated
joint probability density as Therefore, the variable of interest Xi Y after
the evidence propagation is
n
Y
p(x) = p(xi |pa(xi )). (1) Y|E=e Y|E=e
i=1
Xi |E = e N (i , ii )
!
As usual we work with a variable of inter- ie 2
N i + (e e ), ii ie (2)
est, so the networks output is the information ee ee
about this variable of interest after the evidence where i and e are the means of Xi and Xe
propagation. respectively before the propagation, ii and
Definition 2 (Gaussian Bayesian net- ee the variances of Xi and Xe respectively
work). A Gaussian Bayesian network is a before propagating the evidence, and ie the
Bayesian network over X={X1 , ..., Xn } with covariance between Xi and Xe before the
a multivariate normal distribution N (, ), evidence propagation.
then the joint density is given by f (x) =

140 M. A. Gmez-Villegas, P. Man, and R. Susi


3 Sensitivity Analysis and
non-influential parameters
The Sensitivity Analysis proposed is as follows:
Let us suppose the model before propagating
the evidence is N (, ) with one evidential vari-
able Xe , whose value is known. After propagat-
ing this evidence we obtain the marginal density
of interest f (xi |e). Next, we add a perturbation
to one of the parameters in the model before
propagating the evidence (this parameter is sup-
Figure 1: DAG of the Gaussian Bayesian net- posed to be inaccurate thus reflects this inac-
work curacy) and perform the evidence propagation,
to get f (xi |e, ). In some cases, the perturba-
We illustrate the concept of a Gaussian tion has some restrictions to get admissible
Bayesian network and the evidence propagation parameters.
method with next example. The effect of adding the perturbation is com-
Example 1. Assume that we are interested in puted by comparing those density functions by
how a machine works. This machine is made means of the sensitivity measure. That measure
up of 5 elements, the variables of the problem, is based on the Kullback-Leibler divergence be-
connected as the network in Figure 1. Let us tween the target marginal density obtained af-
consider the time each element is working has ter evidence propagation, considering the model
a normal distribution and we are interested in with and without the perturbation
the last element of the machine (X5 ). Definition 3 (Sensitivity measure). Let (G,P )
be a Gaussian Bayesian network N (, ). Let
f (xi |e) be the marginal density of interest af-
Being X = {X1 , X2 , X3 , X4 , X5 } normally dis- ter evidence propagation and f (xi |e, ) the same
tributed, X N (, ), with parameters density when the perturbation is added to one
parameter of the initial model. Then, the sen-
2 3 0 6 0 6 sitivity measure is defined by
3 0 2 2 0 2

S pj (f (xi |e), f (xi |e, )) =
= 3 ; = 6 2 15 0 15

4 0 0 0 2 4 Z
f (xi |e)
5 6 2 15 4 26 = f (xi |e) ln dxi (3)
f (xi |e, )
Considering the evidence E = {X2 = 4}, where the subscript pj is the inaccurate param-
after evidence propagation we obtain that eter and the proposed perturbation, being the
X5 |X2 = 4 N (6, 24) and the joint distribution new value of the parameter pj = pj + .
is normal with parameters
For small values of the sensitivity measure
2 we can conclude our Bayesian network is robust
4

Y|X2 =4 =
for that perturbation.
;
4
6
3.1 Mean vector inaccuracy
3 6 0 6 Three different situations are possible depend-
6 13 0 13
Y|X2 =4 ing on the element of to be perturbed, i.e.,
=
0 0 2 4 the perturbation can affect the mean of the
6 13 4 24 variable of interest Xi Y, the mean of the

Sensitivity analysis of extreme inaccuracies in Gaussian Bayesian Networks 141


evidence variable Xe E, or the mean of any Proposition 2. Adding the perturbation <
other variable Xj Y with j 6= i. Developing to the covariance matrix , the sensitivity
the sensitivity measure (3), we obtain next measure obtained is
propositions, If the perturbation is added to the vari-
ance of the variable of interest, being
Proposition 1. For the perturbation < in 2
ii = ii + with > ii + ie , then
the mean vector , the sensitivity measure is as ee
Y|E=e Y|E=e,
follows Xi |E = e, N (i , ii ) where
2
When the perturbation is added to the mean Y|E=e,
ii = ii + ie and
of the variable of interest, i = i + , and ee
Y|E=e Y|E=e S ii (f (xi |e), f (xi |e, )) =
Xi |E = e, N (i + , ii ),
" ! #
2 1
S i (f (xi |e), f (xi |e, )) = . = ln 1 + Y|E=e Y|E=e,
.
Y|E=e
2ii 2 ii ii

If the perturbation is added to the mean When the perturbation is added to the
of the evidential variable, e = e + , variance of the evidential variable, being

the posterior density of the variable ee = ee + and > ee (1 max 2je )
Xj Y
of interest, with the perturbation, is
Y|E=e ie Y|E=e with je the corresponding correlation co-
Xi |E = e, N (i , ii ), efficient, the posterior density of inter-
ee
Y|E=e, Y|E=e,
then est is Xi |E = e, N (i , ii )
2 Y|E=e, ie
e 2 ie with i = i + (e e ) and
S (f (xi |e), f (xi |e, )) = . ee +
Y|E=e ee 2
2ii Y|E=e, ie
ii = ii therefore
ee + " Y|E=e, !
The perturbation added to the mean of any 1 ii
ee
other non-evidential variable, different from the S (f (xi |e), f (xi |e, )) = ln Y|E=e
+
2
variable of interest, has no influence over Xi ,
ii

2
ie
then f (xi |e, ) = f (xi |e), and the sensitivity ee ee + 1 + (e e )2 (ee +)ee
measure is zero. + Y|E=e, .
ii

3.2 Covariance matrix inaccuracy The perturbation added to the variance of


There are six possible different situations, any other non-evidential variable Xj Y with
depending on the parameter of the covariance
j 6= i, jj = jj + , has no influence over Xi ,
matrix to be changed; three if the pertur- therefore f (xi |e, ) = f (xi |e) and the sensitivity
bation is added to the variances (elements measure is zero.
in the diagonal of ) and other three if the
perturbation is added to the covariances of . When the covariance between the vari-
When the covariance matrix is perturbed, the able of interest Xi and the evidential vari-
structure of the network can change. Those
able Xe is perturbed, ie = ie + = ei

changes are presented in the precision matrix and ie ii ee < < ie + ii ee ,
of the perturbed network, that is, the inverse then Xi |E = e, N (i
Y|E=e, Y|E=e,
, ii )
of the covariance matrix with perturbation . Y|E=e, (ie + )
To guarantee the normality of the network it with i = i + (e e ) and
ee
is necessary Y|E=e, to be a positive definite Y|E=e, (ie + )2
matrix in all cases presented in next proposition ii = ii the sensitivity
ee
measure obtained is

142 M. A. Gmez-Villegas, P. Man, and R. Susi


S ie (f (xi |e), f (xi |e, )) = Proposition 4. When the extreme perturbation
" ! is added to the elements of the covariance ma-
1 2 + 2ie trix and the correlation coefficient of Xi and Xe
= ln 1 +
2 Y|E=e
ee ii is 0 < 2ie < 1, the results are,

2 1. lim S ii (f (xi |e), f (xi |e, )) = but


Y|E=e
ii + ee (e e )
+ 1 .
S ii (f (xi |e), f (xi |e, )) = o()
Y|E=e,
ii
lim S ii (f (xi |e), f (xi |e, )) = with
Mii
If we add the perturbation to any other 2
ie
Mii = ii + = ii (1 2ie )
covariance, i.e., between the variable of interest ee
Xi and any other non-evidential variable Xj or
between the evidence variable Xe and Xj Y lim S ii (f (xi |e), f (xi |e, )) = 0
0
with j 6= i, the posterior probability density of
the variable of interest Xi is the same as with-
2. lim S ee (f (xi |e), f (xi |e, )) =
out perturbation and therefore the sensitivity
" !#
measure is zero. 1 (e e )2
= ln(1 2ie ) 2ie 1
2 ee

4 Extreme behavior of the lim S ee (f (xi |e), f (xi |e, )) =



" M
ee !
Sensitivity Measure 1 Mee 2 2ie (1 Mee
)
ie
= ln (1 2 )
+
When using the sensitivity measure, that 2 Mee M 2ie

ie
!# ee
describes how sensitive is the variable of (e e )2 1 Mee
1+
interest when a perturbation is added to a ee Mee
inaccurate parameter, it would be interesting where Mee = ee (1 Mee ) being
to know how is the sensitivity measure when
Mee = maxXj je 2
the perturbation < is extreme. Then, we
study the behavior of that measure for extreme
lim S ee (f (xi |e), f (xi |e, )) = 0
perturbations so as the limit of the sensitivity 0
measure.
Next propositions present the results about the 3. lim S ie (f (xi |e), f (xi |e, )) =
1
Mie
limits of the sensitivity measures in all cases
given in Propositions 1 and 2, lim S ie (f (xi |e), f (xi |e, )) =
2
Mie
1
Proposition 3. When the perturbation added with Mie = ie ii ee and
2 = +
to the mean vector is extreme, the sensitivity Mie ie ii ee

measure is as follows,
lim S ie (f (xi |e), f (xi |e, )) = 0.
i 0
1. lim S (f (xi |e), f (xi |e, )) =

Proof. 1. It follows directly.
lim S i (f (xi |e), f (xi |e, )) = 0
0

When ii = ii + the new variance of


e Y|E=e, Y|E=e
2. lim S (f (xi |e), f (xi |e, )) = Xi is ii = ii + .

Y|E=e, Y|E=e
lim S e (f (xi |e), f (xi |e, )) = 0. Being ii > 0 then > ii .
0 Y|E=e
Naming Mii = ii and considering
Proof. It follows directly. Y|E=e
x = ii + we have

Sensitivity analysis of extreme inaccuracies in Gaussian Bayesian Networks 143


lim S ii (f (xi |e), f (xi |e, )) =
Mii
1 h Y|E=e It follows directly.
= lim x ln(x) x ln(ii ) x+
x0 2xi
Y|E=e
+ii = .
3. If we make ie = + , the new
ie
It follows directly. conditional variance is
Y|E=e, Y|E=e 2 + 2ie
ii = ii .
ee
2. lim S ee (f (xi |e), f (xi |e, )) = Then, if we impose ii
Y|E=e,
> 0, must


! 2
ie (ee )2 satisfy the next condition
1 ii ee 1 ee
= ln + ie ii ee < < ie + ii ee .
2 Y|E=e ii
ii First, naming
2
Y|E=e Mie = ie + ii ee , we calculate
with ii = ii (1 2ie ) and
2 lim S ie (f (xi | e) , f (xi | e, )).
ie 2
Mie
2ie = the limit is
" ii ee But 2
Mie is equivalent to
!#
1 (e e )2 2 Y|E=e
= ln(1 2ie ) 2ie 1 . + 2ie ee ii and given
2 ee that
= + , the new conditional S ie (f (xi | e) , f (xi | e, )) =
When ee ee " Y|E=e !
variance for all non evidential variables is 1 ee ii 2 + 2ie
2
= ln Y|E=e
+
Y|E=e, je 2 ee
jj = jj for all Xj Y. ii 2
ee + Y|E=e
ee ii
+ ee (e e )
Y|E=e,
If we impose jj > 0 for all Xj Y + Y|E=e
1
then must satisfy next condition ee ii ( 2 + 2ie )

> ee (1 max 2je ). k
Xj Y and as lim ln x + =
x0 x
Naming Mee = maxXj 2je and for every k, then we get
) then we have
Mee = ee (1 Mee ie
ee
lim S (f (xi | e) , f (xi | e, )) = .
lim S (f (xi |e), f (xi |e, )) = 2
Mie
Mee

2
1 ii eeie+ Analogously, the other limit is also
= lim ln +
Mee 2 ie
2 lim S ie (f (xi | e) , f (xi | e, )) = .
ii ee 1
Mie
2
ie
ee ee + 1 + (e e )2 (ee +)ee
2 It follows directly.
ie
ii ee +

2
with 2ie = ii ie
ee
" ! The behavior of the sensitivity measure is the
1 ii ee Mee 2
= ln ie
+ expected one, except when the extreme pertur-
2 ( 2 )
Mee ii ee ie bation is added to the evidential variance, be-
2
ie
1Mee (ee )2
1+

1Mee cause with known evidence, the posterior den-
ee
Mee ee
Mee sity of interest with the perturbation in the
+ 2 =
Mee ie model f (xi |e, ) is not so different of the pos-
"
2
! terior density of interest without the perturba-
1 Mee ie 2ie (1 Mee
)
tion f (xi |e), therefore although an extreme per-
= ln (1 2 )
+
2 Mee M 2ie turbation added to the evidential variance can

ie
!# ee
(e e )2 1 Mee
exist, the sensitivity measure tends to a finite
1+ .
ee Mee value.

144 M. A. Gmez-Villegas, P. Man, and R. Susi


5 Experimental results
Example 2. Consider the Gaussian Bayesian
network given in Example 1. Experts disagree
with definition of the variable of interest X5 ,
then the mean could be 51 = 2 = 5 + 1 (with
2
1 = 3), the variance could be 55 = 24 with
2 = 2 and the covariances between X5 and
3
evidential variable X2 could be 52 = 3 with
3 = 1 (the same to 25 ); with the variance
4
of the evidential variable being 22 = 4 with
4 = 2 and between X5 and other non-evidential
5
variable X3 that could be 53 = 13 with 5 = 2
(the same to 35 ); moreover, there are different
opinions about X3 , because they suppose that
3 could be 36 = 7 with 6 = 4, that 33 could Figure 2: Sensitivity measures obtained in the
7 example for any perturbation value
be 33 = 17 with 7 = 2, and that 32 could be
8
32 = 1 with 8 = 1 (the same to 23 ).
The sensitivity measure for those inaccuracy
parameters yields The extreme behavior of the sensitivity measure
S 5 (f (x5 |X2 = 4), f (x5 |X2 = 4, 1 )) = 0.1875 for some particular cases, is given as follows
S 55 (f (x5 |X2 = 4), f (x5 |X2 = 4, 2 )) = 0.00195 When 1 = 20, the sensitivity measure is
S 52 (f (x5 |X2 = 4), f (x5 |X2 = 4, 3 )) = 0.00895 S 5 (f (x5 |X2 = 4), f (x5 |X2 = 4, 1 )) = 8.33
S 22 (f (x5 |X2 = 4), f (x5 |X2 = 4, 4 )) = 0.00541 and with the perturbation 1 = 25,
S 53 (f (x5 |X2 = 4), f (x5 |X2 = 4, 5 )) = 0 S 5 (f (x5 |X2 = 4), f (x5 |X2 = 4, 1 )) = 13.02.
S 3 (f (x5 |X2 = 4), f (x5 |X2 = 4, 6 )) = 0 If the perturbation 2 = 1000,
S 33 (f (x5 |X2 = 4), f (x5 |X2 = 4, 7 )) = 0 S 55 (f (x5 |X2 = 4), f (x5 |X2 = 4, 2 )) = 1.39
S 32 (f (x5 |X2 = 4), f (x5 |X2 = 4, 8 )) = 0 therefore as the perturbation in-
As these values of the sensitivity measures are creases to infinity the sensitivity
small we can conclude that the perturbations measure grows very slowly, in fact
presented do not affect the variable of interest S 55 (f (x5 |X2 = 4), f (x5 |X2 = 4, 2 )) = o()
and therefore the network can be considered as stated before. However if 2 =
robust. Also, the inaccuracies about the 23, the sensitivity measure is
non-evidential variable X3 do not disturb the S 55 (f (x5 |X2 = 4), f (x5 |X2 = 4, 2 )) = 9.91.
posterior marginal density of interest, being We do not present the sensitivity measure when

the sensitivity measure zero in all cases. If 3 is extreme because 3 must be in ( 2, 2)
experts think that the sensitivity measure to keep the covariance matrix of the network
obtained for the mean of the variable of interest with the perturbation 3 positive definite.
is large enough then they should review the Finally, when 4 = 100, the sensitivity measure
information about this variable. is S 22 (f (x5 |X2 = 4), f (x5 |X2 = 4, 4 )) = 0.02
Moreover, we have implemented an algorithm (where the limit of S 22 when tends to
(see Appendix 1) to compute all the sensitivity infinity is 0.0208) and with 4 = 1.73,
measures that can influence over the variable S 22 (f (x5 |X2 = 4), f (x5 |X2 = 4, 4 )) = 2.026
of interest Xi ; this algorithm compare those (being the limit when tends to Mee 2.1212).
sensitivity measures computed with a threshold In Figure 2, we observed the sensitivity mea-
s fixed by experts. Then, if the sensitivity sures, considered as a function of ; the graph
measure is larger than the threshold, the shows the behavior of the measure when <.
parameter should be reviewed.

Sensitivity analysis of extreme inaccuracies in Gaussian Bayesian Networks 145


6 Conclusions using a divergence measure. Communications in
StatisticsTheory and Methods, accepted for pub-
In this paper we study the behavior of the sensi- lication.
tivity measure, that compares the marginal den-
Heckerman, D. 1998. A tutorial on learning with
sity of interest when the model of the Gaussian Bayesian networks. In: Jordan, M. I., eds,
Bayesian network is described with and without Learning in Graphical Models. Cambridge, Mas-
a perturbation <, when the perturbation is sachusetts: MIT Press.
extreme. Considering a large perturbation the
Jensen, F.V. 2001. Bayesian Networks and Decision
sensitivity measure is large too except when the Graphs. New York: Springer.
extreme perturbation is added to the evidence
variance. Therefore, although the evidence vari- Kullback, S., Leibler, R.A. 1951. On Information
and Sufficiency. Annals of Mathematical Statis-
ance were large and different from the variance tics, 22:7986.
in the original model, the sensitivity measure
would be limited by a finite value, that is be- Laskey, K.B. 1995. Sensitivity Analysis for Proba-
cause the evidence about this variable explains bility Assessments in Bayesian Networks. IEEE
Transactions on Systems, Man, and Cybernetics,
the behavior of the variable of interest regard- 25(6):901909.
less its inaccurate variance.
Moreover, in all possible cases of the sensitivity Lauritzen, S.L. 1996. Graphical Models. Oxford:
Clarendon Press.
measure, if the perturbation added to a parame-
ter tends to zero, the sensitivity measure is zero Pearl, J. 1988. Probabilistic Reasoning in Intelligent
too. Systems. Networks of Plausible Inference, Morgan
Kaufmann: Palo Alto.
The study of the behavior of the sensitivity
measure is useful to prove that this is a well- Appendix 1
defined measure to develop a sensitivity analy-
sis in Gaussian Bayesian networks even if the The algorithm that computes the sensitivity
proposed perturbation is extreme. measures and determines the parameters in the
The posterior research is focused on perturbing network is available at the URL:
more than one parameter simultaneously so as http://www.ucm.es/info/eue/pagina/
with more than one variable of interest. /APOYO/RosarioSusiGarcia/S algorithm.pdf

Acknowledgments
References This research was supported by the MEC from
Castillo, E., Gutirrez, J.M., Hadi, A.S. 1997. Expert
Spain Grant MTM2005-05462 and Universidad
Systems and Probabilistic Network Models. New Complutense de Madrid-Comunidad de Madrid
York: SpringerVerlag. Grant UCM2005-910395.
Castillo, E., Kjrulff, U. 2003. Sensitivity analysis
in Gaussian Bayesian networks using a symbolic-
numerical technique. Reliability Engineering and
System Safety, 79:139148.
Chan, H., Darwiche, A. 2005. A distance Mea-
sure for Bounding Probabilistic Belief Change.
International Journal of Approximate Reasoning,
38(2):149174.
Coupe, V.M.H., van der Gaag, L.C. 2002. Propier-
ties of sensitivity analysis of Bayesian belief net-
works. Annals of Mathematics and Artificial In-
telligence, 36:323356.
Gomez-Villegas, M.A., Man, P., Susi, R. 2006. Sen-
sitivity analysis in Gaussian Bayesian networks

146 M. A. Gmez-Villegas, P. Man, and R. Susi


Learning Bayesian Networks Structure using Markov Networks
Christophe Gonzales and Nicolas Jouve
Laboratoire dInformatique de Paris VI
8 rue du capitaine Scott 75015 Paris, France

Abstract
This paper addresses the problem of learning a Bayes net (BN) structure from a database. We
advocate first searching the Markov networks (MNs) space to obtain an initial RB and, then, re-
fining it into an optimal RB. More precisely, it can be shown that under classical assumptions our
algorithm obtains the optimal RB moral graph in polynomial time. This MN is thus optimal w.r.t.
inference. The process is attractive in that, in addition to providing optimality guarrantees, the
MN space is substantially smaller than the traditional search spaces (those of BNs and equivalent
classes (ECs)). In practice, we face the classical shortcoming of constraint-based methods, namely
the unreliability of high-order conditional independence tests, and handle it using efficient con-
ditioning set computations based on graph triangulations. Our preliminary experimentations are
promising, both in terms of the quality of the produced solutions and in terms of time responses.

1 Introduction plex BNs. A more popular approach consists in


searching the BN structures space, optimizing some
Effective modeling of uncertainty is essential to score (BDeu, MDL, etc.) measuring how well the
most AI applications. For fifteen years probabilis- structure fits data (Heckerman et al., 1995; Lam and
tic graphical models like Bayesian nets (BN) and Bacchus, 1993). Unfortunately, the number of BNs
Markov nets (MN), a.k.a. Markov random fields, is super-exponential in the number of random vari-
(Cowell et al., 1999; Jensen, 1996; Pearl, 1988) ables (Robinson, 1973) and an exhaustive compar-
have proved to be well suited and computationally ison of all the structures is impossible. One way
efficient to deal with uncertainties in many practical out is to use local search algorithms parsing effi-
applications. Their key feature consists in exploit- ciently the BN structure space, moving from one
ing probabilistic conditional independences (CIs) to BN to the next one by performing simple graphical
decompose a joint probability distribution over a set modifications (Heckerman et al., 1995). However,
of random variables as a product of functions of these suffer from the existence of multiple BNs rep-
smaller sets, thus allowing a compact representation resenting the same set of CIs. This not only length-
of the joint probability as well as efficient inference ens the search but may also trap it into local op-
algorithms (Allen and Darwiche, 2003; Madsen and tima. To avoid these problems, the third class of
Jensen, 1999). These independences are encoded in approaches (Munteanu and Bendou, 2001; Chick-
the graphical structure of the model, which is di- ering, 2002) searches the space of BN equivalence
rected for BNs and undirected for MNs. In this pa- classes (EC). In this space, equivalent BNs are rep-
per, we propose an algorithm to learn a BN structure resented by a unique partially directed graph. In ad-
from a data sample of the distribution of interest. dition to speeding-up the search by avoiding use-
There exist three main classes of algorithms for less moves from one BN to an equivalent one, this
learning BN from data. The first one tries to de- class of algorithms possesses nice theoretical prop-
termine the set of all probabilistic CIs using sta- erties. For instance, under DAG-isormorphism, a
tistical independence tests (Verma and Pearl, 1990; classical hypothesis on the data generative distribu-
Spirtes et al., 2000). Such tests have been criti- tion, and in the limit of a large database, the GES
cized in the literature as they can only be applied algorithm is theoretically able to recover the opti-
with small conditioning sets, thus ruling out com- mal BN structure (Chickering, 2002). This property
is remarkable as very few search algorithms are able Finally Section 5 mentions some related work and
to guarantee the quality of the returned structure. presents some experimental results.
Although, in theory, the EC space seems more at-
tractive than the BN space, it suffers in practice from 2 Background
two problems: i) its neighborhood is exponential in
the number of nodes (Chickering, 2002) and ii) the Let V be a set of random variables with joint prob-
EC space size is roughly equal to that of the BN ability distribution P (V). Variables are denoted by
space (Gillispie and Perlman, 2001). It would thus capitalized letters (other than P ) and sets of vari-
be interesting to find a space preserving the optimal- ables (except V) by bold letters.
ity feature of the EC space exploited by GES while Markov and Bayes nets are both composed of:
avoiding the above problems. For this purpose, the i) a graphical structure whose nodes are the vari-
MN space seems a good candidate as, like the EC ables in V (hereafter, we indifferently refer to nodes
space, its equivalence classes are singletons. More- and their corresponding variables) and ii) a set of
over, it is exponentially smaller than the BN space numerical parameters. The structure encodes proba-
and, by its undirected nature, its neighborhood is bilistic CIs among variables and then defines a fam-
also exponentially smaller than that of EC. This sug- ily of probability distributions. The set of parame-
gests that searching the MN space instead of the EC ters, whose form depends on the structure, defines a
space can lead to significant improvements. unique distribution among this family and assesses
To support this claim, we propose a learning al- it numerically. Once a structure is learned from
gorithm that mainly performs its search in the MN data, its parameters are assessed, generally by max-
space. More precisely, it is divided into three dis- imizing data likelihood given the model.
tinct phases. In the first one, we use a local search A MN structure is an undirected graph while a
algorithm that finds an optimal MN searching the BN one is a directed acyclic graph (DAG). MNs
MN space. It is well-known that only chordal MNs graphically encode CIs by separation. In an undi-
are precisely representable by BNs (Pearl, 1988). rected graph G, two nodes X and Y are said to be
As it is unlikely that the MN we find at the end of separated by a disjoint set Z, denoted by X G Y |
the first phase is actually chordal, its transformation Z, if each chain between X and Y has a node in Z.
into a BN must come along with the loss of some BNs criterion, called d-separation, is defined simi-
CIs. Finding a set of CIs that can be dispensed with larly except that it distinguishes colliders on a chain,
to map the MN into a BN while fitting data as best i.e., nodes with their two adjacent arcs on the chain
as possible is not a trivial task. Phase 2 exploits the directed toward them. In a DAG, X and Y are said
relationship between BNs and their moral graphs to to be d-separated by Z, denoted by X B Y | Z, if
transform the MN into a BN whose moral graph is for each chain between X and Y there exists a node
not far from the MN obtained at the end of phase 1. S s.t. if S is a collider on the chain, then neither S
Then, in a third phase, this BN is refined into one nor any of its descendant is in Z, else S is in Z. Ex-
that better fits data. This last phase uses GES sec- tending Pearls terminology (Pearl, 1988), we will
ond step. The whole learning algorithm preserves say that a graph G, directed or not, is an I-map (I
GES theoretical optimality guarantee. Furthermore, standing for independency), implicitly relative to P ,
if X G Y | Z = X P Y | Z. We will also


the BN at the end of phase 2, which possesses inter-


esting theoretical properties, is obtained in polyno- say that a graph G is an I-map of another graph G 0
mial time. Phase 3 is exponential but has an anytime when X G Y | Z = X G 0 Y | Z. When two
property. Preliminary experimental results suggest structures are I-maps of one another, they represent
that our learning algorithm is faster than GES and the same family and are thus said to be equivalent.
produces better quality solutions. The complete graph, which exhibits no separation
The paper is organized as follows. Section 2 pro- assertion, is a structure containing all the possible
vides some background on MNs and BNs. Then, distributions and is, as such, always an I-map. Of
Section 3 describes how we obtain the optimal MN. course, in a learning prospect, our aim is to recover
Section 4 shows how it can be converted into a BN. from data an I-map as sparse as possible. More pre-

148 C. Gonzales and N. Jouve


cisely, in both MN and BN frameworks, an I-map G Like most works aiming to recover an optimal
is said to be optimal (w.r.t. inclusion) relatively to P structure from data, we will assume that the under-
if there exists no other I-map G 0 such that i) G is an lying distribution P is DAG-isomorph, i.e. that there
I-map of G 0 and ii) G is not equivalent to G 0 . exists a DAG B encoding exactly the indepen-
dences of P (s.t. X B Y | Z X P Y | Z).


Though they are strongly related, BNs and MNs
are not able to represent precisely the same in- Under this assumption, B and equivalent BNs are
dependence shapes. MNs are mostly used in the obviously optimal. Using the axiomatic in (Pearl,
image processing and computer vision community 1988), it can be shown that, in the MN frame-
(Prez, 1998), while BNs are particularly used in work, the DAG-isomorphism hypothesis entails the
the AI community working on modeling and predic- so-called intersection property, leading to the exis-
tion tools like expert systems (Cowell et al., 1999). tence of a unique optimal MN, say G (Pearl, 1988).
In the latter prospect, BNs are mostly preferred for Moreover, G is the moral graph of B .
two reasons. First, the kind of independences they
can express using arcs orientations are considered 3 Markov network search
more interesting than the independence shapes only Searching the BN or the EC space is often per-
representable by MNs (Pearl, 1988). Secondly, as formed as an optimization process, that is, the algo-
BNs parameters are probabilities (while those of rithm looks for a structure optimizing some good-
MNs are non-negative potential functions without ness of fit measure, the latter being a decomposable
any real actual meaning), they are considered easier scoring function that involves maximum likelihood
to assess, to manipulate and to interpret. In BNs, estimates. For MNs, the computation of these esti-
the direction is exploited only through V-structures, mates is hard and requires time-expensive methods
that is, three-node chains whose central node is a (Murray and Ghahramani, 2004) (unless the MN is
collider the neighbors of which are not connected by triangulated, which is not frequent). Hence score-
an arc. This idea is expressed in a theorem (Pearl, based exploration strategies seem inappropriate for
1988) stating that two BNs are equivalent if and MN searches. Using statistical CI tests and com-
only if they have the same V-structures and the same bining them to reveal the MNs edges seems a more
skeleton, where the skeleton of a DAG is the undi- promising approach. This is the one we follow here.
rected graph resulting from the removal of its arcs
direction. Algorithm L EARN MN
Input : database
MNs are unable to encode V-structure informa- Output : moral graph G = (V, E2 )
tion since it is a directed notion. As a DAG with- 1. E1
out V-structure is chordal (i.e. triangulated), it is not 2. foreach edge (X, Y ) 6 E1 do
surprising that a result (Pearl et al., 1989) states that search, if necessary, a new SXY
s.t. X (V,E1 ) Y | SXY
independences of a family of distributions can be


3. if (X, Y ) 6 E1 s.t. X 6 Y |SXY then


represented by both MNs and BNs frameworks if 4. add edge (X, Y ) to E1 and go to line 2
and only if the associated structure is chordal. In 5. E2 E1
6. foreach edge (X, Y ) E2 do
the general case where the graph is not chordal, it is search, if necessary, a new SXY
possible to get an optimal MN from a BN by mor- s.t. X (V,E2 \{(X,Y )}) Y | SXY


alization. The moral graph of a DAG is the undi- 7. if (X, Y ) E2 s.t. X Y |SXY then
rected graph obtained by first adding edges between 8. remove edge (X, Y ) from E2 and go to line 6
9. return G = (V, E2 )
non adjacent nodes with a common child (i.e. the
extremities of V-structures) and then removing the End of algorithm
arcs orientations. It is easily seen that the moral Unlike most works where learning a MN amounts
graph of a BN is an optimal undirected I-map of this to independently learn the Markov blanket of all the
graph, even if some independences of the BN have variables, we construct a MN in a local search man-
been necessarily loosed in the transformation. The ner. Algorithm L EARN MN consists in two con-
converse, i.e. getting a directed I-map from a MN, secutive phases: the first one adds edges to the
is less easy and will be addressed in Section 4. empty graph until it converges toward a graph G 1 =

Learning Bayesian Networks Structure using Markov Networks 149


(V, E1 ). Then it removes from G1 as many edges as cuts a junction tree into distinct connected compo-
possible, hence resulting in a graph G 2 = (V, E2 ). nents. Hence, we propose a method, based on join
At each step of phase 1, we compute the depen- trees (Jensen, 1996), which computes the separators
dence of each pair of non-adjacent nodes (X, Y ) for all pairs at the same time in O(|V| 4 ), as com-
conditionally to any set SXY separating them in pared to O(|V|5 ) in (Tian et al., 1998). Although the
the current graph. We determine the pair with the separators we compute are not guaranteed to be op-
strongest dependence (using a normalized differ- timal, in practice they are most often small. Here is
ence to the 2 critical value) and we add the cor- the key idea: assume G0 is triangulated and a corre-
responding edge to the graph and update separators. sponding join tree J0 is computed. Then two cases
Lemma 1. Assuming DAG-isomorphism and sound can obtain: i) X and Y belong to the same clique, or
statistical tests, graph G1 resulting for the first ii) they belong to different cliques of J 0 . In the sec-
phase of L EARN MN is an I-map. ond case, let C0 , . . . , Ck be the smallest connected
set of cliques such that X C0 and Y Ck , i.e.,
Sketch of proof. At the end of phase 1, (X, Y ) 6 C0 and Ck are the nearest cliques containing X and
E1 , there exists SXY s.t. X G1 Y | SXY and Y . Then any separator on the path C0 , . . . , Ck cuts
Y | SXY . By DAG-isomorphism, it can be


X all the paths between X and Y in G0 and, thus, can


shown that G1 contains the skeleton of B , and, then, act as an admissible set SXY . Assuming the trian-
that it also contains its moralization edges.  gulation algorithm produces separators as small as
In the second phase, which takes an I-map G 1 as possible, selecting the smallest one should keep sets
input, L EARN MN detects the superfluous edges in SXY sufficiently small to allow independence tests.
G1 and iteratively removes them. As for the first case, there is no separator between X
Proposition 1. Assuming DAG-isomorphism and and Y , so the junction tree is helpless. However, we
sound statistical tests, the graph G 2 returned by do not need assigning to SXY all the neighbors of X
L EARN MN is the optimal MN G . or Y , but only those that belong to the same bicon-
nected component as X and Y (as only those can be
Sketch of proof. At the end of phase 2, (X, Y ) on the chains between X and Y ). Such components
G2 , there exists SXY s.t. X (V,E2 \{(X,Y )}) Y | can be computed quickly by depth first search algo-
SXY and X 6 Y | SXY . We show using DAG-


rithms. However, as we shall see, the triangulation


isomorphism that this phase cannot delete any edge algorithm we used determines them implicitly.
in G and, then, that it discards all other edges.  In order to produce triangulations in polynomial
It is well known that the main shortcoming of in- time (determining an optimal one is NP-hard), we
dependence test-based methods is the unreliability used the simplicial and almost-simplicial rules ad-
of high-order tests, so we must keep sets S XY as vocated by (Eijkhof et al., 2002) as well as some
small as possible. Finding small SXY requires a heuristics. These rules give high-quality triangula-
more sophisticated approach than simply using the tions but are time-consuming. So, to speed-up the
set of neighbors of one of X or Y . Actually, the best process, we used incremental triangulations as sug-
set is the smallest one that cuts all the paths between gested by (Flores et al., 2003). The idea is to update
X and Y . This can be computed using a min-cut in the triangulation only in the maximal prime sub-
a max-flow problem (Tian et al., 1998). Although graphs of G0 involved in the modifications resulting
the set computed is then optimal w.r.t. the quality from L EARN MNs lines 4 and 8. In addition to the
of the independence test, it has a major drawback: join tree, a join tree of max prime subgraphs is thus
computing a min-cut for all possible pairs of nodes maintained by the algorithm. It turns out that merg-
(X, Y ) is too prohibitive when the graph involves ing in this tree the adjacent cliques that are linked
numerous variables. Moreover, pairs of close nodes by a separator containing more than one node of V
require redundant computations. An attractive alter- precisely produces G0 s biconnected components.
native results from the similarities between min-cuts Once the join tree constructed, extracting for each
and junction trees: a min-cut separates the MN into pair of nodes (X, Y ) not belonging to the same
several connected components, just as a separator clique its minimal separator is achieved by the col-

150 C. Gonzales and N. Jouve


lect/distribute algorithms below. In the latter, mes- Because we use a polynomial algorithm to tri-
sages Mij transmitted from a clique Cj to Ci are angulate the maximal prime subgraphs, the com-
vectors of pairs (X, S) such that the best condi- plexity of the whole process recovering the opti-
tioning sets between nodes Y in Ci \(Ci Cj ) mal MN is also polynomial. Hence, under DAG-
and X is S. An illustrative example is given on isomorphism, we obtain the moral graph of the op-
Figure 1: messages near solid arrows result from timal BN in polynomial time. As the first step of
C OLLECT({BCD},{BCD}) and those in dashed inference algorithms consists in moralizing the BN,
arrows from D ISTRIBUTE({BCD},{BCD},). and triangulating this moral graph to produce a sec-
ondary structure well-suited for efficient computa-
Algorithm C OLLECT tions, the MN we obtain is optimal w.r.t. inference.
Input : pair (Ci , Cj )
Output : message Mij
1. Mij an empty vector message 4 From the Markov Net to the Bayes Net
2. foreach neighbor Ck of Ci except Cj do
3. Mik Collect(Ck , Ci ) Once the MN is obtained, we transform it into a BN
4. foreach pair (X, S) in Mik do B0 using algorithm MN2BN.
5. foreach node Y in Ci \(Ci Ck ) do
6. if SXY 6= S then Algorithm MN2BN
7. SXY S Input : UG G = (V, E)
8. done Output : DAG B = (V 0 , A)
9. add (X, min(Ci Ck , S)) to Mij 1. A ; V 0 V
10. done 2. While |V| > 1 Do
11. done 3. Choose X V in a single clique of G
12. foreach node X in Ci \(Ci Cj ) do 4. E {Y V : (Y, X) E}
13. add (X, Ci Cj ) to Mij 5. For all Y E Do
14. return Mij 6. E E\{(Y, X)} and A A {(Y X)}
End of algorithm 7. V V\{X}
8. For all (Y, Z) E E Do
9. G D EMORALIZE(G, (Y, Z))
Algorithm D ISTRIBUTE 10. Done
Input : pair (Ci , Cj ), message Nij 11. Return B = (V 0 , A)
Output :
1. foreach pair (X, S) in Nij do End of algorithm
2. foreach node Y in Ci \(Ci Cj ) do
3. if SXY 6= S then Algorithm D EMORALIZE
4. SXY S Input : UG G = (V, E) ; Edge (X, Y ) E
5. done Output : UG G 0 = (V, E 0 )
6. done 1. E 0 E\{(X, Y )}
7. foreach neighbor Ck of Ci except Cj do 2. Search Z V s.t. X (V,E 0 ) Y | Z
8. N empty message vector
Y |Z Then Return G 0 = (V, E 0 )


3. If X
9. foreach pair (X, S) in Nij do
4. Else Return G 0 = (V, E)
10. add (X, min(Ci Ck , S)) to N
11. call Distribute (Ck , Ci , N) End of algorithm
12. N0 message Mik sent during collect
13. foreach pair (X, S) in N0 do Before stating the properties of MN2BN, let us
14. add (X, min(Ci Ck , S)) to Nij illustrate it with an example. Consider the graphs in
15. done Figure 2. Assuming a DAG-isomorph distribution,
End of algorithm B represents the optimal BN. Suppose we obtained
the optimal MN G : we now apply MN2BN to it. In
(E, {C})
G = G , F and R belong to a single clique. Assume
C CE
(A, {B}) MN2BN chooses F . After saving its neighbors set
(A, {B}) in G (l.4), F is removed from G (l.7) as well as its
AB B BCD
(A, {B}), (E, {C}) (A, {B}), (E, {C}) adjacent edges (l.6), which results in a new current

BD BDF B BG graph G = G1 . The empty DAG B is altered by
(G, {B}), (F, {B, D}) (G, {B}) adding arcs from these neighbors toward F , hence
resulting in the next current DAG B = B 1 . By di-
Figure 1: Messages sent during Collect/diffusion. recting F s adjacent arcs toward F , we may create

Learning Bayesian Networks Structure using Markov Networks 151


C F C C C C

H J H J H J H J H H

K L K L K L K K K K

R R

G G1 G2 G3 G4 G5 G6
C F C F C F C F C F C F C F

H J H J H J H J H J H J H J

K L K L K L K L K L K L K L

R R R R R R R

B B1 B2 B3 B4 B5 B6 = B 0

Figure 2: A MN2BN run example

V-structures, hence some of the edges remaining in DAG B, its moral graph G and a node X belonging
G between F s neighbors may correspond to mor- to a single clique of G, there exists a DAG B 0 s.t.: i)
alization edges and may thus be safely discarded. B 0 is an I-map of B, ii) G is the moral graph of B 0
To this end, for each candidate edge, D EMORALIZE and iii) X has no child in B 0 . Finally, we prove the
tests whether its deletion would introduce spurious proposition by induction on V. 
independence in the DAG by searching a separator
B0 encodes at least as many independences as G
in the current remaining undirected graph G. If not,
and possibly more if some moralization edges have
we discard it from G. In our example, there is only
been discarded. In the case where MN2BN selects
one candidate edge, (C, J). We therefore compute
nodes in the inverse topological order of B , B0 is
a set separating C from J in G1 without (C, J), say
actually B . However, this should not happen very
J | K. As this independence


K, and test if C
often and recovering B requires in general to refine
does not hold in a distribution exactly encoded by
B0 by identifying all remaining hidden V-structures.
B , the test fails and the edge is kept. For the second
Under DAG-isomorph distribution and sound statis-
iteration, only node R belongs to a single clique.
tical tests, the second phase of GES (Chickering,
The process is similar but, this time, moralization
2002) is known to transform an I-map of the genera-
edge (K, L) can be discarded. The algorithm then
tive distribution into B . Applied to B0 , it can there-
goes on similarly and finally returns B 6 = B0 . It
fore be used as the refinement phase of our algo-
is easily seen that B0 is a DAG which is an I-map
rithm. This one has an exponential complexity but
of B and that its moral graph is G . Note that by
benefits from an anytime property, in the sense that
mapping G into B6 , not all moralization edges have
it proceeds by constructing a sequence of BNs shar-
been identified. For instance, (C, F ) was not.
ing the same moral graph and the quality of which
Now, let us explain why MN2BN produces
converges monotonically from B0 to B . This last
DAGs closely related to the optimal BN we look for.
phase preserves the GES score-based optimality.
Proposition 2. Assuming a DAG-isomorph distri- With real-world databases, L EARN MN can fail
bution and sound statistical tests, if MN2BN is ap- to recover precisely G . In this case, departing from
plied to G , the returned graph B0 is a DAG that is our theoretical framework, we lose our guarantee
an I-map of B , and its moral graph is G . More- of accuracy. We also lose the guarantee to always
over, the algorithm runs in polynomial time. find a node belonging to a single clique. However,
Sketch of proof. By induction on V, we prove that, MN2BN can easily be modified to handle this situ-
at each iteration, G is the moral graph of a subgraph ation: if no node is found on line 3, just add edges to
of B . This entails the existence of a node in a single G so that a given node forms a clique with its neigh-
clique at each iteration. Then we prove that given a bors. Whatever the node chosen, B0 is guaranteed

152 C. Gonzales and N. Jouve


to be an I-map of the distribution represented by the for GES phase 1 but, as the toolkit does not provide
input MN. However, the node should be carefully the network found at the end of phase 1, we used
chosen to avoid deviating too much from the MN the better one resulting from phase 2. For both GES
passed in argument to MN2BN, as each edge addi- and K2, we did not consider directly the output BN
tion hides conditional independences. but its moral graph, to compare it with the MN out-
put by LMN. Note that the GES toolkit was unable
5 Related works and experiments to handle some of the graphs. Both time and quality
results reported are means and standard deviations
In this paper, we have presented an algorithm that
for the 10 databases. The first table shows running
exploits attractive features of the MN space to
times in seconds for LMN and GES (on a 3.06GHz
quickly compute an undirected I-map close to the
PC). The second one shows for each algorithm the
optimal one. This graph is then directed and fed to
number of edges spuriously added (+) and missed (-
GES second phase to obtain an optimal BN. The key
) w.r.t. the original BN moral graph (whose number
idea is to learn as much as possible the polynomi-
of edges appears in the first row).
ally accessible information of the generative distri-
According to these preliminary tests, we consider
bution, namely its undirected part, and postpone to
our approach as promising. W.r.t GES, we find
the end the task concentrating the learning computa-
that LMN is not time-consuming, as GES running
tional complexity, namely the V-structure recovery.
time grows much faster than that of LMN when
Most algorithms of the constraint-based approach,
database size increases. As for networks quality, it
like IC (Verma and Pearl, 1990) or PC (Spirtes et al.,
is noticeable that, for all algorithms, the smaller the
2000), deal at some stage with undirected learning
database, the weaker the dependences represented
but they do not explicitly separate this stage from
in the database and hence the fewer the edges recov-
the directed one. That is, they generally search a
ered. LMN finds clearly more edges than GES and
power set to extract V-structures information before
K2, despite the latters ordering knowledge which
achieving a whole undirected search. In (Dijk et al.,
gives it a strong advantage. LMN adds fewer spu-
2003), the two phases are disjoint but the authors are
rious edges than GES but more than K2 (which ex-
more concerned with finding the skeleton (which is
ploits its ordering knowledge). We think that LMN
not an I-map) rather than the moral graph. As they
did not detect some of the unnecessary edges be-
only allow CI tests conditioned by a set of size no
cause nodes involved had so many neighbors that
greater than 1, the search is mostly delegated to a
2 tests were meaningless, despite our efforts. This
directed score-based search. (Cheng et al., 2002)
suggests alternating multiple phases of edge addi-
makes much stronger distribution assumptions.
tions and deletions, so as to avoid situations dete-
Like GES, under DAG-isomorphism and sound
riorating due to big clusters of neighbors. We also
CI test hypotheses, our method is able to find the
should try to tune the 2 confidence probability.
optimal BN it looks for. However, these hypothe-
ses probably do not hold in practice. To assess the
discrepancy between theory and real world, we per- References
formed a series of experiments on classical bench-
D. Allen and A. Darwiche. 2003. Optimal time-space
marks of the Bayes Net Repository1 . For each BN, tradeoff in probabilistic inference. In proceedings of
we generated by logic sampling 10 databases of size IJCAI, pages 969975.
500, 2000 and 20000. For each database, we ran
J. Cheng, R. Greiner, J. Kelly, D.A. Bell, and W. Liu.
L EARN MN, GES (the WinMine toolkit software) 2002. Learning Bayesian networks from data: An
and a K2 implemented with a BIC score and fed information-theory based approach. Artificial Intel-
with the original BN topological ordering. K2 is ligence, 137(12):4390.
thus already aware of some crucial learning infor-
D.M. Chickering. 2002. Optimal structure identification
mation that LMN and GES must discover. For GES, with greedy search. Journal of Machine Learning Re-
we recorded as time response only the time required search, 3:507554.
1
http://compbio.cs.huji.ac.il/Repository R.G. Cowell, A.P. Dawid, S.L. Lauritzen, and D.J.

Learning Bayesian Networks Structure using Markov Networks 153


alarm barley carpo hailfinder insurance mildew water
size Avg. Stdev Avg. Stdev Avg. Stdev Avg. Stdev Avg. Stdev Avg. Stdev Avg. Stdev
20000 LMN 3.65 0.82 9.38 0.73 11.96 2.67 15.56 1.31 3.59 0.1 3.29 0.37 7.48 0.78
GES 22.36 1.72 75.91 4.19 16.74 4.22 18.36 3.91
2000 LMN 0.98 0.28 1.46 0.46 1.79 0.68 2.36 0.65 1.05 0.12 0.56 0.08 1.37 0.21
GES 1.36 0 7.27 0.52 1.5 0 1.55 0
500 LMN 0.47 0.15 0.31 0.01 0.71 0.27 1.24 0.39 0.63 0.2 0.28 0.02 0.44 0.12
GES 0.64 0 0.64 0 0.5 0 0.45 0

alarm (65) barley (126) carpo (102) hailfinder (99) insurance (70) mildew (80) water (123)
size Avg. Stdev Avg. Stdev Avg. Stdev Avg. Stdev Avg. Stdev Avg. Stdev Avg. Stdev
20000 GES + 13.6 2.69 26.6 3.85 17.6 5.14 17.9 2.34
GES - 39.7 1.1 38.7 4.22 44.4 3.17 94.6 2.15
LMN + 2.1 1.3 15.7 1.1 9.4 1.28 9.7 3 0.4 0.49 3.2 0.75 0 0
LMN - 9.6 1.11 60.1 2.02 23.2 1.89 10.1 0.94 15.5 1.63 22.3 1.55 52.3 1.85
K2 + 3.2 0.75 3 0 3.1 1.58 0 0 0 0 1 0 0 0
K2 - 19.7 1.79 87.8 0.6 31.1 2.43 26 0 19.3 1.1 59 0 71.6 2.58
2000 GES + 14 0.63 11.8 3.6 11.8 2.32 3.9 1.14
GES - 62 0.77 66.4 2.2 59.2 1.47 110.3 1.73
LMN + 3.4 0.8 13.3 2.19 7.5 3.35 2.2 1.17 0.3 0.46 4.5 1.2 0.2 0.4
LMN - 21.3 2.33 79 1.73 33.3 2.61 15.4 1.28 24.9 2.12 40.4 1.2 75 3.87
K2 + 3.2 0.6 3.1 0.3 2.3 0.78 0 0 0 0 0 0 0 0
K2 - 34.9 1.37 98.3 0.78 45.7 3.72 50.2 1.83 38.2 1.47 64.4 1.28 98.2 2.48
500 GES + 3.9 1.64 10.3 2.28 5.3 1 0.5 0.5
GES - 65 0 89.1 2.07 65.1 1.14 118.2 1.25
LMN + 7.1 2.26 8.2 1.25 5.9 2.12 2 1.26 1.1 0.94 4.4 1.5 0.4 0.92
LMN - 31.5 2.01 90.2 1.08 54.2 2.52 33.6 3.07 34.7 1.68 56.8 0.98 88.3 2.24
K2 + 3.7 0.78 3.4 0.49 4.5 2.29 0.1 0.3 0.3 0.46 0 0 0 0
K2 - 42.7 1 105.2 0.98 60.8 2.44 65.5 1.2 47.2 0.98 74.8 0.75 101.6 1.2

Spiegelhalter. 1999. Probabilistic Networks and Ex- P. Munteanu and M. Bendou. 2001. The EQ frame-
pert Systems. Springer-Verlag. work for learning equivalence classes of Bayesian net-
works. In Proceedings of the IEEE International Con-
S. van Dijk, L. van der Gaag, and D. Thierens. 2003. ference on Data Mining, pages 417424.
A skeleton-based approach to learning Bayesian net-
works from data. In Proceedings of PKDD. I. Murray and Z. Ghahramani. 2004. Bayesian learning
in undirected graphical models: Approximate MCMC
F. van den Eijkhof, H.L. Bodlaender, and A.M.C.A. algorithms. In Proceedings of UAI.
Koster. 2002. Safe reduction rules for weighted
treewidth. Technical Report UU-CS-2002-051, J. Pearl, D. Geiger, and T.S. Verma. 1989. The logic of
Utrecht University. influence diagrams. In R.M. Oliver and J.Q. Smith,
editors, Influence Diagrams, Belief Networks and De-
J. Flores, J. Gamez, and K. Olesen. 2003. Incremental cision Analysis. John Wiley and Sons.
compilation of bayesian networks. In Proceedings of
UAI, pages 233240. J. Pearl. 1988. Probabilistic Reasoning in Intelligent
Systems: Networks of Plausible Inference. Morgan
S.B. Gillispie and M.D. Perlman. 2001. Enumerating Kaufman.
Markov equivalence classes of acyclic digraphs mod-
els. In Proceedings of UAI, pages 171177. P. Prez. 1998. Markov random fields and images. CWI
Quarterly, 11(4):413437.
D. Heckerman, D. Geiger, and D.M. Chickering. 1995.
Learning Bayesian networks: The combination of R.W. Robinson. 1973. Counting labeled acyclic di-
knowledge and statistical data. Machine Learning, graphs. In F. Harary, editor, New directions in Graph
20(3):197243. Theory, pages 239273. Academic Press.
P. Spirtes, C. Glymour, and R. Scheines. 2000. Causa-
F.V. Jensen. 1996. An Introduction to Bayesian Net-
tion, Prediction, and Search. Springer Verlag.
works. Springer-Verlag, North America.
J. Tian, A. Paz, and J. Pearl. 1998. Finding minimal d-
W. Lam and F. Bacchus. 1993. Using causal information separators. Technical Report TR-254, Cognitive Sys-
and local measures to learn Bayesian networks. In tems Laboratory.
Proceedings of UAI, pages 243250.
T.S. Verma and J. Pearl. 1990. Causal networks : Se-
A.L. Madsen and F.V. Jensen. 1999. Lazy propagation: mantics and expressiveness. In Proceedings of UAI,
A junction tree inference algorithm based on lazy in- pages 6976.
ference. Artificial Intelligence, 113:203245.

154 C. Gonzales and N. Jouve


 
   


       
   

 
 
        
       & '  !   (



  
! "
      
! $ ) 
# 
! $  
%$

 

! $  $ 
 $ *$

+  $ ,
 
 $- !

   + $$.
+ 
/$  +    #    
    )
  #

+
  
+ #    0 $
  $  

! ' $' 
 1/$
!  
 $ 2 #   $ 
+ *  $ 34456
  $
3447-  !
 
.' /
  !$$ $ 
 $  /     

  #/$      


/
 +   

!  
 $ + 

!
' $  #/$      $
'    ,    
$
' 
'      
  1  
! 
 $ +  1   2  # $

 $'
 !
 '      /  $ $
 + 
1
 
 $ '    
  $ #/$
!   
 %$$
)$/ 
  
#  !
 $$ $

 
 $ ,
 
 $  / 
'
!  
 $ 
/$ $ 
 $ !
 

.
 $ '
$
! 
!   $ . #$     # /   ' 
!
    
# 
!  
 
.    
#        '

  
8 
 $
 / +  #/$  *
$$  :;<;6 $#  $ 3447-   ! .
 
  

$      .       /$
!     
'  
 / 

/ #
6
 $

+ #     $  

! '.
+ 
  + +$$   !
  .   *  
$$ /  - $.
#    '  
 '      
'   
  # /   # $

    $ 9   /   +
 !
  /
  # /  $.
' 

$$  
  0   .      
 /$  / + 
!
 $ 
+ #    1 $
!   / 
 $  !
 '  %
.
       
 $+ 
/$ 
 $    $ +
$  #/$  
!
 

$$  0   !    /   $ 
.'     
#  
$  $
 #   $$ .
 
/  
'   $  !


/$ 
 $ 

$  #/$  +
  '    $  +
 *
$ 9  + + 
$       $ 34456
  $ 3447- + 
+ 

   $

/ #
$   
/   !  #/$  #
$#   
.
$$
+ 
 !

   
/$  . /   $ $  $  

 

 
$  $  $
     
 $  $ '    1/$ !

/  $$    
 
/$  * $ 
. 0  $  ! 

!
'  
34446     $ 3444- #/$             +
&  $ 
 $ $

+  . 
+ +  / 
 !  $  

 #
$  $
'    ,    %' : 
+  0 $
!    

$
'  '      
  1     

 
! 
 $ +  1    
$  
 
    

   
'   !
$$
+ %  $  +
 *  $ 34456
  $
 
 3  = +  1  $  $ 3447- /  !$ !
  
 
!    

 $    !

!     .   ,     '  '  +
 

 >  /  
+ 
  $  * $ 3444-      
0

 $ !
   
 5 
#  
 .  $'
! 9    !       
$
 
1'  $ #/$
!   !
 
    $ 9 

  

 
 %$$  
 7  ?  !.    

  $
/      
 +
   
 
$
   *    $ 3444-  
 
 0
 $   +  $ $ $  0

     !

 #/$ 

    
/  !.
  +
/ #  '    / !  
 
 ,$ 0$  

 $  
.'  $ 
 $ *   %$$ +   +  /$ 

/ #
&@(). 
  *  $ 34456
  $' / 
!  # 
  *+ 

 $ 3447-- /  +
$
/ #  /  
/ #  #/$    -   


!  #
$#  #/$     
   '    
' 
 /
#  / 
  !
$$
+' 
 A 
  +    
!
/ #  #/$ 
:  !$$  
! #/$  *$' 
/.  $
  *-  
  E 

 #  #/$ -   B :      /  


    /   .
 


'    
    
$    $ !
   /

$  #/$    $  #/$    9   

  #
 +

2  
   $
  / *- *  $ 34456
  $ 3447-  
   '  ' 
      +   $$
+ 
 

    A
*
$$  :;<;-  '   /   .   #/$  + 9  $ $
/ # 
   /        *(- #/$     
 $$ 
/$ 
* $ 34446     $ 3444- !
  $ $    9   


  0'    #
$#' $  
.
3  #$ '  
 #/$   $ + $  #/$    *
$$  :;<;6
    

!  #$  $  . $#  $ 3447- 
  

!

'  
 $  #/$  $  C.   #/$  $$
+'  
$  .
/ D * 
-     $ 
.

 / 
 ' .
  

$ 
      

E F  F    
  
B
   *:-  

 
     "
   0 $ 
 $ 
+  %'.
=  /    $$  
.   :  
$ /   $ $  




 
 #/$  +
   
/$ 
  !$$ '  .
  /

! 
. 
#.
' 
 $ !
   $
!  # 
  B
 *         -    /   



        


! 



   
*       - B
* - $ $ /  !    '  '
   

 $ 
    #/$ *  - + 

> 
      / 
!   
/ #    6   '    
!
2  
    
'   
!   #/$ !

  
/#
$ 


/ #  #/$  /  :      ! /$    /
$ $ 
9 
 

  +

    

$
/ #   %
 $  
/$
!
   
   '  
! '   '
!

156 P. O. Hoyer, S. Shimizu, and A. J. Kerminen


a b

e2 x2 e4 x4 e5 x5 e2 x2 e4 x4 e5 x5
2 4 2 3 2 3

e6 x6 -8 e1 x1 6 12 -8 e1 x1

3 -5 -7 -5
e8 x8 e3 x3 e7 x7 e'8 x8 e3 x3

%' :A 
 0 $
!  $  #/$ &@() 
 $  ' 
 
 

 !
$$
+'  '  
 
 A  AB  AB   AB 
AB 3  F > F 

AB 3 F =  F   AB =
F   AB <  5 F    AB ? F   *  $$ 
   
/   
    '  $-  C/ D #/$    + $$
 

  $ !

  /

* -   #/$   
+  6 

/ #   # 
  

!
$  #$ 
!       
 
$
 


 $ 
 
' 
  +
  *-   
/ #
$$  $$ ,#$ 
 +
 +  $$ $$  $ # #/$  # /   $1  +     0 !
  $

 
 
 
  
  !
 
  2 !
$   
! $$  $ #
  
 !
     #/$  
 #   $ #   #/$  '  !
$.
  +  $ # +    

 '
$A $
+' 
 A
1'  $  $
  / +  
/. +
$  #/$ &@() 
 $ 

 #  #/$   
    ! 
$ !  .

  $$  $ #   #/$  /

*-
! 
/ #   # 
   .

+  $     G 
!  $ !
  +

 $   $   


  2 # 
 $ 
 / '  / 

/ .
#
$  $
 

AB 3  F > F 
+
$  #/$ &@() 
 $ 

 AB =
F   
      ! 
$
$ ' 
!  
/ #
$$ ,#$   $$
$ 9 
!
/ #  #/$ 


 
 AB =*3  F > F 
- F 
/ #  #/$    $ !
  +

.
B 7  F :3 F =
F  $   


! $ *3444-   $
B 7  F :3 F  9   '#  /
*  
* --    
 

+   B  2      $ !
 $$
+  + #  $1   B =
F   

 
!    +

 $  , 


+  +   
#    #/$ 
 / '  / 
 
/ #.

!
  
 $ 

/   $  
 $ 
 
  

$$  0  
 + 
/ #     $ 
 %$$ +  1  
 
 
/ 

'$
   
  $  $. $  #/$ &@() 
 $ +   $.

       $' 
 $    #/$   

 
 *   
 -

+  %' :/ @
    #/$    $  +
$  *    .

 /  $1  + +
 '' - % 
 $
' 9   $ 

/ #   #/$   #    
! $ 

Estimation of linear, non-gaussian causal models in the presence of confounding


latent variables 157

+
$  #/$  0/ 0$ 

. 
 $ 
   
! 
$ 
 $

$  
! 
 
  ' 

/. + 
/ #
$$ ,#$  
 ' .
 #  #/$  %$$  $  #/$  ' 
 $
   
#  
    # 

 #  
$ 
 $ + 
/ .       
#
$$  $$ ,#$  
 '#   $  +
 *  $ 34456

$  #/$ &@() 
 $ +     $ 3447- + 
+    '   
!
$$
+' $'
A !
  &@() 
  !
$$
+    


  $  *"- /
 *"


      
   :;;>6 #H   $ 344:-  + / G
    
     # +  
   1$$ !
'
 
   9 
! $  #/$ 
       
  
             

2 / ' / 
 '  !$$  # .

  B       + $   $ 


#/$  ! +   
 ' /

 
    
!  #/$     !$$ 
          1   B  F  +  / 
!  (
       
! "#    
    0  
$ / .
             
 $
+  '$ !
  +
    $   

$  %     $  
    
 $
 ' *-
!  #/$  
$#'
!
 

/  B  +  B *  -

          


&   %'     $

  G 
!  / #.
/$ 


/ #  #/$  ' 
  $ "
# (   
$ /   
$
+  '$ *$.
           

' 
   $
+  '$ - + 
  %%
  
 
 
 *-   
'  
 $   $
 / +      
    
.'
!  
.
) (    
    $
  % $    
* 
 
!   1   $   

  
       

     
 $
             
!   8  ' + +
 
   

 
  #
 +
 @
+ 
   !.
+   %    
    B  +  
 8  
+
! 
! 
! ' 

!  #/$    $



     $  
 
' 

/ #  #/$  2 
 
   
 
     / 
!
/ #  #/$   $  
$   
        / 
! / #/$   
.
      %     
 , + 
 
$  
+  

+  

 /   " $.
,   -    
   
 -      
 . &     $
  $

    .
$    $          C0' 0D  % 

        !
 #  $  #/$ &@()

 $    
/ #
$$  $$
   

 ,#$  
$ 
 $ +  +
 $


! '  $  
 $  

$
   
 + 
+ 
+ !
  ' . 
 $ @ 0 '  !$$  
! #/$ 
  /  /$ $  #/$ &@()   $$ $  #/$  
 1 * 

158 P. O. Hoyer, S. Shimizu, and A. J. Kerminen


a b c d
e k(i)
1 0
1 0 e k(i) e randperm(i) e randperm(i)
0 1

xk(i) 0
xk(i) xk(i) xi
i J i J i J


%' 3A 
 
!  !$$ " 0 !
  
$ 
 $ *  0 $
   =    7
/ #  #/$ -  

+ 
 
 
   #/$ 
@
    

!  
$ 
 $  $    
!  0 

! 
  /0   
  /

 
+ 
 
 

/ #  #/$   
        #$  +  '
   +
   



 
+         +   
/  (. 
 
 /  1
 +



/ #  $  #/$   !

  + #  $  
 /

 
!
 0 
+  *- 
 
 
     " 
/ #  /
0   
$  

  
 + 

 * !
- 
+  $
  !


/ #  #/$  
/ #  0' 0   
+  
 
!    

  $
 

 $
 - !
$$
+  / $$
/ #  #/$  

  $'
!  
$
!

*  $
 -  $

   
!
 !$$ 0 
+  %' 3 $
'
  $'     
  
6
  $ 
 
 
/$ 
! 
 / '
+
$
/ # 
!  !$$ 0 + /$ 
/  '
!  G 

$   
$  + /.
!  / #/$  
 # 


$   
+   %' 3  $  '
!  
 
 
 
 
'

   
/ #  " / 0 #/$    
!   #/$ 




 $$
!  !

 
     # '  '
!  
 
 !

!$$ 0      $$  ! .    #/$ 

/ #  #/$ 
     $

      
! 

 ,   +   +
   
  $ 
 $ 
$ /  .  #  /$ 
 $ 

    #.

  !
 
/ #  / 0  /$  
  / #/$  ' 
 
 + 
  +
! 
 
$'  $   
!  1


  $
/ # /  !   !
   
  
   
 $ # 
 I
  
#  *344>-    
 
/$    '  $ $  

#   $ 
+  
# 
 $  " 
.,    ! '  
 $A 2 .

 $   1/$ '#  
'   
 
+ + 
$
! 
 
 

 # $ $'
  #$/$ !
     #/$  @
 
+ #     
 ' *)
$   $ :;;?6  :;;;6
$ !

 '  $$   %'.
#H   $ 344:-       3     /
 !
  / 

!  / 


 
   !  
! 
!
/ #
$$ ,#$  
$ 
 $
$ 
 $ +   0 " 0' .   / 
! $1
 

/ #  #
0 
+   0 / 
  $ +      $ * F  -J* J J-
 $  
!  $' +  #/$

+      
  / 
!
/.

 


 #     #/$  %


  $1.
    , "   . 
 +   
  +    $  

1
  
# 
 $   
$  #$ $ .#/$ &@() 
 $A

Estimation of linear, non-gaussian causal models in the presence of confounding


latent variables 159
 / :;;=-
/'  
! .
     %
      '
    '.


"$ & - #     '  $  


!  0' 0 
 
   
   
       /    !
$$
+A %
 
     
   !  + $  
!  0$  
/
 %
  
  $  ' '   $$

    
   '#  $  
       
  

  
  8 '  +    K#
   
   
!
  $    $' 
' *  /.

$ #$   -  
  +$$ '# 
   %
 '    
  +  + 
//$  
' :
 
$$  
  
    

!  '
+

     
! #'  1    

!  / .
"# (        0 +  $ $'
  
  

  /  
*  '  
 $   
!   !
  '  '
"
# $ 

 $  
 $ 
 $     ' / $ 
 $   

  $$  
  # 
$     
  % %     & "  $
9  +  $ !
 
 

 !$$   

!  /
#   

!

"# 0    %
  % 
"$ %     

!  
   @ #  $  
# /   !$$  $    
 .
% #    $  
$
 *  
 5-       # 

  $   & '
"# 1   
  $ 
  
 ' !
 $$   $


  $  B   
     

 


"# 23     3   2 # !
  0 # $
 
.
%
      %    
*- # !  $'
  /  *-
   #/$
!    $'  
 !


 $   & '
" #     3       $' + 
 
  *- 
#
  $  


! $ '  $$ .

    
  $      #/$ &@() 
 $ !
  %$$
+ $$.
   )$/ 
  !
 $$
!  
0    #$/$ 
$$
+    
     
9
$ $  $  # !
  $
 
  + #     0 + !
 $ 
 /  /   # 
" / 0  
+    $  $ . '   $    
$
!   
 

! 
  
$ /   % +     /   / '  '
!
    #/$ $ ' 
*$$- . 
 $  #/$ &@() 
 $ 

 
 ' $$ $  
! 
/ !
  
 $ $$'  
 
'

. 
 ' $'
  
  $ " / 0  !
  ! '   
 $/$  % 
   
!
!
/ #
$$ ,#$  
$ 
 $
  +
   $  
 
! $$ ' $'
   
   $$ 

 
 +$$ '#  
$$   !.
    

      
 

!  +  
 0   '  ' 
 $  
    
    
  

$#'   
/$   ,   
#   
       
  
 

 
!  
!
 
!  
  
    

+ #
 '   $'  
 *I!
 
  

160 P. O. Hoyer, S. Shimizu, and A. J. Kerminen


a b c
0.2653 0.3086 0.5253

x2 -0.5683 x2 -0.6611 x2 1.125

0.411 0.4792 0.6416 -0.8535 0.4792 0.6416 1.453 0.4792 -1.5

1.62 x4 x3 x4 -2.892 x3 x4 -5.657 x3

0.4717 0.4449 -0.03126 0.4449 -0.03126

1.287 -0.03126 x1 x1

-1.785
connections between observed variables
x1 connections to/from hidden variables

%' =A 
 
$ '    $  #/$ &@() 
 $ 
 
$ 
 $
+  $$ 
/ #
$$ ,#$  
  +
  *- 

/ #
$$
,#$  */ $$
 ,#$ - 
 $ %
  " / + 
 $ *- '    
 
/$ 
' / +  
 $ */-  *-    0 !
  $

 %' =  #   0$


 
 $    #/$ -  + 
$$   
 !     + $$ ,#$  
 +  +  /

! $$  /

'$ 
 $ #/$    $
! :444  # 
 +
 
 +     $ #/$
!   %' > 
+   $  $  $'
.
  
 / '  +
$ #    1  
  
!   .
  
!  0  
!  / 0A +
 *     +
  , $  .
2 '    
 &@() 
 $  !
 1/$ -  +
$ 
 /
/$ + 
.
 
 $ $$   
 
' " # 
$  
     


$
/ 0   
 
 $ 34 9       
' 
/ #  #/$ 
C  D
!       

!  .

 >3 
!    
!
/$ '  '   

 $   

!  !   
.
$ + $
 *  $   - 

'$ ) +
   / !
  ! +
  .

 $ 2   
 / 
  $' 
' 
/     /  $  
$ 
 $1

!  
   
  $  
/$   
 !
/$ 
/.
!$    $'  +
  
   . $   
!  $/$ ' 
# 
.
$ $
 
 '  '
    $
. $  " / 0 +   / 
! .
'
 
  #
! 
./  $   #/$    /

!  .

#  $'
 * $ 34446     $ / #/$   
+ "  $.
3444- 
$1

! 

$  . '
 *)
$   $ :;;?6  :;;;-
    
/ #    +
 !
 #  $$  
 / 


%$$ + !
   $$.$  
. 
+$ ' 
 
   #$/$ !
 .


! $ '  
 $ !
 $  '
# 
 $  " /   $/$ 
2    0 .
!.' ! +
  $  '  

*)
$   $ :;;?6  :;;;- 
. %
 $ +  / /$ 

$#  $.
  " /  
  
/$   # $ $' 
/$  #  +    .
 $ 
/$  $
 + 


       #/$ 
$ 
#  $$  +
 *=
/ #  
 . + ' / $$  


  

Estimation of linear, non-gaussian causal models in the presence of confounding


latent variables 161
a b      + 
  / 
 
! %$ 
8  L34><37
0.1861 0.6672 -0.1588 -0.6894
    
   
     
x1 x2 -0.43 x1 x2 0.4431

-0.349 -0.3624  


   
              
      ! " 
x3 x3

%' >A 
 
$ '    $  #. # $%   &
  %   
/$ &@() 
 $   
'    
  '   (      )*+&
 
 $   !
   @
   ,  - ./       
+ 
     '
!  G       $%  

!    #/$   /


 +  ,0  1 23  *&
4/
   $$
      
 $  /  3 53    
$ % 
 1  $'  / 
! 
 
         6+7))&
/ +         # 8   " "%93  2:  ;  
    %  *) < '  %3 =
 $' &@() 
 $ +   $  #. >%  3 2  3 >
$
 
        
 !"
/$  $ G  
$  $$ / 
!           #!$%%&' > 

/ #  #/$    !
$$
+   . **
  /  *- 
 $ * #H    2:  # 8   * ,% > 

 3444-    
 $   .      2   3  / %=
/
  /        
/

!   3 %>        =
$
+. 
$    /   2  3 3/   
   *6+7+
  !
$$
+   
 $  #$ +*
/   ' $$ / !
 '    2:   3   , 8? * 
" $'
 * #H   $ 344:-  $.  
 !"   
   
'
 !

# 
 $  " +
$
$   , @3  =A $   , B  +

/        @C%3% 0   /    =
 23   >  3 > %C3 % 
  

      (  !  )   
     #!*+,' >  )+)*
2 #   $ 
+ 
+ 
 $  @3  B % 
$ 
 $ +   /
 #
$# 
 
.'  

!
'    #  *  "- . ) /  ) 
( $%/ > D 2  # 
#/$  0 *  $ 34456
 
$ 3447-   
/
 + #  . " "%93  2:  ;    # 8  
/   9 
! 
!
' $  #. * E2    =>3   3
%  3 >
$
  $0  ( 
/$   
+ 
+ 
   
! 1   "  ! 2     #1!$%%3'

 $ 
  + 
/ #   . >  *) ,  /3> " 
$
 !
 + !$$ )$/ 
  #$/$ - "2 - "   $ B%3  # " 

1  #/$
!   
 %  *) F   >  33      2=
+
     
 # $
 
!+ 
 / %  4  ( .    /
$  .
. .$  

$  + *&)


   # "  $ B%3  - "   * 
 )   )   @
. # 
2  
#H  !
 
 


162 P. O. Hoyer, S. Shimizu, and A. J. Kerminen


Symmetric Causal Independence Models
for Classification
Rasa Jurgelenaite and Tom Heskes
Institute for Computing and Information Sciences
Radboud University Nijmegen
Toernooiveld 1, 6525 ED Nijmegen, The Netherlands

Abstract
Causal independence modelling is a well-known method both for reducing the size of prob-
ability tables and for explaining the underlying mechanisms in Bayesian networks. In this
paper, we propose an application of an extended class of causal independence models,
causal independence models based on the symmetric Boolean function, for classification.
We present an EM algorithm to learn the parameters of these models, and study con-
vergence of the algorithm. Experimental results on the Reuters data collection show the
competitive classification performance of causal independence models based on the sym-
metric Boolean function in comparison to noisy OR model and, consequently, with other
state-of-the-art classifiers.

1 Introduction through intermediate variables and interaction


function.
Bayesian networks (Pearl, 1988) are well- Causal independence assumptions are of-
established as a sound formalism for represent- ten used in practical Bayesian network mod-
ing and reasoning with probabilistic knowledge. els (Kappen and Neijt, 2002; Shwe et al.,
However, because the number of conditional 1991). However, most researchers restrict them-
probabilities for the node grows exponentially selves to using only the logical OR and log-
with the number of its parents, it is usually ical AND operators to define the interaction
unreliable if not infeasible to specify the con- among causes. The resulting probabilistic sub-
ditional probabilities for the node that has a models are called noisy OR and noisy AND;
large number number of parents. The task of as- their underlying assumption is that the pres-
sessing conditional probability distributions be- ence of either at least one cause or all causes
comes even more complex if the model has to in- at the same time give rise to the effect. Sev-
tegrate expert knowledge. While learning algo- eral authors proposed to expand the space of in-
rithms can be forced to take into account an ex- teraction functions by other symmetric Boolean
perts view, for the best possible results the ex- functions: the idea was already mentioned but
perts must be willing to reconsider their ideas in not developed further in (Meek and Heckerman,
light of the models discovered structure. This 1997), analysis of the qualitative patterns was
requires a clear understanding of the model by presented in (Lucas, 2005), and assessment of
the domain expert. Causal independence mod- conditional probabilities was studied in (Jurge-
els (Dez, 1993; Heckerman and Breese, 1994; lenaite et al., 2006).
Srinivas, 1993; Zhang and Poole, 1996) can both Even though for some real-world problems
limit the number of conditional probabilities to the intermediate variables are observable (see
be assessed and provide the ability for models to Visscher et al. (2005)), in many problems these
be understood by domain experts in the field. variables are latent. Therefore, conditional
The main idea of causal independence models probability distributions depend on unknown
is that causes influence a given common effect parameters which must be estimated from data,
using maximum likelihood (ML) or maximum a C1 C2 ... Cn
posteriori (MAP). One of the most widespread
techniques for finding ML or MAP estimates is
H1 H2 ... Hn
the expectation-maximization (EM) algorithm.
Meek and Heckerman (1997) provided a gen-
eral scheme how to use the EM algorithm to E f
compute the maximum likelihood estimates of
the parameters in causal independence mod- Figure 1: Causal independence model
els assumed that each local distribution func-
tion is collection of multinomial distributions. where there is a 11 correspondence between
Vomlel (2006) described the application of the the vertices V(G) and the random variables
EM algorithm to learn the parameters in the in V, and arcs A(G) represent the conditional
noisy OR model for classification. (in)dependencies between the variables; (2) a
The application of an extended class of causal quantitative part Pr consisting of local proba-
independence models, namely causal indepen- bility distributions Pr(V | (V )), for each vari-
dence models with a symmetric Boolean func- able V V given the parents (V ) of the corre-
tion as an interaction function, to classification sponding vertex (interpreted as variables). The
is the main topic of this paper. These mod- joint probability distribution Pr is factorised ac-
els will further be referred to as the symmetric cording to the structure of the graph, as follows:
causal independence models. We present an EM Y
algorithm to learn the parameters in symmet- Pr(V) = Pr(V | (V )).
ric causal independence models, and study con- V V
vergence of the algorithm. Experimental results
Each variable V V has a finite set of mutually
show the competitive classification performance
exclusive states. In this paper, we assume all
of the symmetric causal independence models
variables to be binary; as an abbreviation, we
in comparison with the noisy OR classifier and,
will often use v + to denote V = > (true) and
consequently, with other widely-used classifiers.
v to denote V = (false). We interpret >
The remainder of this paper is organised as
as 1 and as 0 in an arithmetic context. An
follows. In the following section, we review
expression such as
Bayesian networks and discuss the semantics of
symmetric causal independence models. In Sec-
X
g(H1 , . . . , Hn )
tion 3, we state the EM algorithm for finding (H1 ,...,Hn )=>
the parameters in symmetric causal indepen-
dence models. The maxima of the log-likelihood stands for summing g(H1 , . . . , Hn ) over all pos-
function for the symmetric causal independence sible values of the variables Hk for which the
models are examined in Section 4. Finally, Sec- constraint (H1 , . . . , Hn ) = > holds.
tion 5 presents the experimental results, and
conclusions are drawn in Section 6. 2.2 Semantics of Symmetric Causal
Independence Models
2 Symmetric Boolean Functions for Causal independence is a popular way to spec-
Modelling Causal Independence ify interactions among cause variables. The
global structure of a causal independence model
2.1 Bayesian Networks is shown in Figure 1; it expresses the idea that
A Bayesian network B = (G, Pr) represents a causes C1 , . . . , Cn influence a given common ef-
factorised joint probability distribution on a set fect E through hidden variables H1 , . . . , Hn and
of random variables V. It consists of two parts: a deterministic function f , called the interac-
(1) a qualitative part, represented as an acyclic tion function. The impact of each cause Ci
directed graph (ADG) G = (V(G), A(G)), on the common effect E is independent of each

164 R. Jurgelenaite and T. Heskes


other cause Cj , j 6= i. The hidden variable Hi the cause variables does not matter, the Boolean
is considered to be a contribution of the cause functions become symmetric (Wegener, 1987)
variable Ci to the common effect E. The func- and the number reduces to 2n+1 .
tion f represents in which way the hidden ef- An important symmetric Boolean function is
fects Hi , and indirectly also the causes Ci , in- the exact Boolean function l , which has func-
teract to yield the final effect E. Hence, the tion value true, i.e. l (H1 , . . . , Hn ) = >, if
Pn
function f is defined in such a way that when i=1 (Hi ) = l with (Hi ) equal to 1, if Hi
a relationship, as modelled by the function f , is equal to true and 0 otherwise. A symmetric
between Hi , i = 1, . . . , n, and E = > is sat- Boolean function can be decomposed in terms
isfied, then it holds that f (H1 , . . . , Hn ) = >. of the exact functions l as (Wegener, 1987):
It is assumed that Pr(e+ | H1 , . . . , Hn ) = 1 if
f (H1 , . . . , Hn ) = >, and Pr(e+ | H1 , . . . , Hn ) = n
_
0 if f (H1 , . . . , Hn ) = . f (H1 , . . . , Hn ) = i (H1 , . . . , Hn ) i (2)
i=0
A causal independence model is defined in
terms of the causal parameters Pr(Hi | Ci ), for where i are Boolean constants depending only
i = 1, . . . , n and the function f (H1 , . . . , Hn ). on the function f . For example, for the Boolean
Most papers on causal independence models as- function defined in terms of the OR operator we
sume that absent causes do not contribute to have 0 = and 1 = . . . = n = >.
the effect (Heckerman and Breese, 1994; Pearl, Another useful symmetric Boolean function is
1988). In terms of probability theory this im- the threshold function k , which simply checks

plies that it holds that Pr(h+ i | ci ) = 0; as a whether there are at least k trues among the ar-
consequence, it holds that Pr(hi | c
i ) = 1. In guments, i.e. k (I1 , . . . , In ) = >, if nj=1 (Ij )
P
this paper we make the same assumption. k with (Ij ) equal to 1, if Ij is equal to true
In situations in which the model does not cap- and 0 otherwise. To express it in the Boolean
ture all possible causes, it is useful to introduce constants we have: 0 = = k1 = and
a leaky cause which summarizes the unidentified k = = n = >. Causal independence model
causes contributing to the effect and is assumed based on the Boolean threshold function further
to be always present (Henrion, 1989). We model will be referred to as the noisy threshold models.
this leak term by adding an additional input
Cn+1 = 1 to the data; in an arithmetic context 2.3 The Poisson Binomial Distribution
the leaky cause is treated in the same way as
identified causes. Using the property of Equation (2) of the sym-
The conditional probability of the occurrence metric Boolean functions, the conditional prob-
of the effect E given the causes C1 , . . . , Cn , i.e., ability of the occurrence of the effect E given the
Pr(e+ | C1 , . . . , Cn ), can be obtained from the causes C1 , . . . , Cn can be decomposed in terms
causal parameters Pr(Hl | Cl ) as follows (Zhang of probabilities that exactly l hidden variables
and Poole, 1996): H1 , . . . , Hn are true as follows:

Pr(e+ | C1 , . . . , Cn ) Pr(e+ | C1 , . . . , Cn )
X n
Y X X n
Y
= Pr(Hi | Ci ). (1) = Pr(Hi | Ci ).
f (H1 ,...,Hn )=> i=1 0 l n l (H1 ,...,Hn ) i=1
l
In this paper, we assume that the function f in
Equation (1) is a Boolean function. However, Let l denote the number of successes in n
n
there are 22 different n-ary Boolean functions independent trials, where pi is a probability
(Enderton, 1972; Wegener, 1987); thus, the po- of success in the ith trial, i = 1, . . . , n; let
tential number of causal interaction models is p = (p1 , . . . , pn ), then B(l; p) denotes the Pois-
huge. However, if we assume that the order of son binomial distribution (Le Cam, 1960; Dar-

Symmetric Causal Independence Models for Classification 165


roch, 1964): Let us define
n l (z,j) (z,j) (z,j) (z,j)
Y X Y p jz p\k = (p1 , . . . , pk1 , pk+1 , . . . , pn(z,j) ).
B(l; p) = (1 pi ) .
i=1 1j1 <...<j n z=1
1 p jz
l
Subsequently, for all hidden variables Hk with
Let us define a vector of probabilistic param- k = 1, . . . , n we compute the probability
eters p(C1 , . . . , Cn ) = (p1 , . . . , pn ) with pi = Pr(h+ j j (z) ) where
k | e ,c ,
Pr(h+i | Ci ). Then the connection between the
Poisson binomial distribution and the class of Pr(h+ j j (z)
k | e ,c , )
symmetric causal independence models is as fol- (z,j) Pn1

(z,j)

lows. pk i=0 B i; p\k i+1
= Pn
(z,j)
if ej = 1,
Proposition 1. It holds that: i=0 B i; p i
n
Pr(e+ | C1 , . . . , Cn ) =
X
B(i; p(C1 , . . . , Cn ))i . and
i=0
j j (z)
Pr(h+
k | e ,c , )

(z,j) Pn1 (z,j)
3 EM Algorithm pk 1 i=0 B i; p\k i+1
= Pn
(z,j)
if ej = 0.
1 i=0 B i; p i
Let D = {x1 , . . . , xN } be a data set of indepen-
dent and identically distributed settings of the
observed variables in a symmetric causal inde-
Maximization step: Update the parameter
pendence model, where
estimates for all k = 1, . . . , n:
xj = (cj , ej ) = (cj1 , . . . , cjn , ej ).
j +
1jN ck Pr(hk | ej , cj , (z) )
P
We assume that no additional information k = P j .
about the model is available. Therefore, to learn 1jN ck

the parameters of the model we maximize the


4 Analysis of the Maxima of the
conditional log-likelihood
Log-likelihood Function
N
X
CLL() = ln Pr(ej | cj , ). Generally, there is no guarantee that the EM
j=1 algorithm will converge to a global maximum of
The expectation-maximization (EM) algo- log-likelihood. In this section, we investigate
rithm (Dempster et al., 1977) is a general the maxima of the conditional log-likelihood
method to find the maximum likelihood esti- function for symmetric causal independence
mate of the parameters in probabilistic models, models.
where the data is incomplete or the model has 4.1 Noisy OR and Noisy AND Models
hidden variables.
Let = (1 , . . . , n ) be the parameters of the In this section we will show that the conditional
symmetric causal independence model where log-likelihood for the noisy OR and the noisy
i = Pr(h+ + AND models has only one maximum. Since
i | ci ). Then, after some calcula-
tions, the (z + 1)-th iteration of the EM algo- the conditional log-likelihood for these models
rithm for symmetric causal independence mod- is not necessarily concave we will use a mono-
els is given by: tonic transformation to prove the absence of the
stationary points other than global maxima.
Expectation step: For every data sample First, we establish a connection between the
xj = (cj , ej ) with j = 1, . . . , N , we form maxima of the log-likelihood function and the
maxima of the corresponding composite func-
(z,j) (z,j) (z)
p(z,j) = (p1 , . . . , p(z,j)
n ) where pi = i cji . tion.

166 R. Jurgelenaite and T. Heskes


Proposition 2. (Global optimality condi- it is a known result that the Hessian matrix of
tion for concave functions (Boyd and the log-likelihood function for logistic regression
Vandenberghe, 2004)) is negative-semidefinite, and hence the problem
Suppose h(q) : Q < is concave and differen- has no local optima, we will use transformations
tiable on Q. Then q Q is a global maximum that allow us to write the log-likelihood for the
if and only if noisy OR and noisy AND models in a similar
T form as that of the logistic regression model.
h(q ) h(q )

h(q ) = ,..., = 0. The conditional probability of the effect in a
q1 qn noisy OR model can be written:
Further we consider the function n
Pr(h
Y
Pr(e+ | c, ) = 1 i | ci )
CLL() = h(q()). i=1
n n
!
Y X
Let CLL() and h(q()) be twice differentiable =1 (1 i )ci = 1 exp ln(1 i )ci .
i=1 i=1
functions, and let q() be a differentiable, in-
jective function where (q) is its inverse. Then Let us choose a monotonic transformation qi =
the following relationship between the station- ln(1 i ), i = 1, . . . , n. Then the conditional
ary points of the functions CLL and h holds. probability of the effect in a noisy OR model
Lemma 1. Suppose, is a stationary point of equals
CLL(). Then there is a corresponding point T
Pr(e+ | c, q) = 1 eq c .
q( ), which is a stationary point of h(q()).
Let us define z j = qT cj and f (z j ) = Pr(e+ |
Proof. Since the function q() is differentiable cj , q), then the function h reads
(q1 ,...,qn )
and injective, its Jacobian matrix ( 1 ,...,n )
is N
X
positive definite. Therefore, from the chain h(q) = ej ln f (z j ) + (1 ej ) ln(1 f (z j )). (3)
rule it follows that if CLL( ) = 0, then j=1
h(q( )) = 0.
Since f 0 (z j ) = 1 f (z j ), the first derivative of
Proposition 3. If h(q()) is concave and h is
is a stationary point of CLL(), then is a N
h(q) X f 0 (z j )(ej f (z j ))
global maximum. = cj
q j=1
f (z j )(1 f (z j ))
N
Proof. If is a stationary point, then from
X ej f (z j )
= cj .
Lemma 1 it follows that q( ) is also station- j=1
f (z j )
ary. From the global optimality condition for
concave functions the stationary point q( ) is To prove that the function h is concave we need
a maximum of h(q()), thus from the definition to prove that its Hessian matrix is negative
of global maximum we get that for all semidefinite. The Hessian matrix of h reads
N
CLL() = h(q()) h(q( )) = CLL( ). 2 h(q) X 1 f (z j ) j j j T
= e c c .
qqT j=1
f (z j )2

As the Hessian matrix of h is negative semidefi-


Given Proposition 3 the absence of the lo- nite, the function h is concave. Therefore, from
cal optima can be proven by introducing such Proposition 3 it follows that every stationary
a monotonic transformation q() that the com- point of the log-likelihood function for the noisy
posite function h(q()) would be concave. As OR model is a global maximum.

Symmetric Causal Independence Models for Classification 167


The conditional probability of the effect in a learn the hidden parameters of the model de-
noisy AND model can be written: scribing this interaction we have to maximize
n
the conditional log-likelihood function
Y
Pr(ej+ | c, ) = Pr(h+
i | ci )
i=1 CLL() = ln[1 (1 2 )(1 3 )
n n +(1 1 )2 (1 3 ) + (1 1 )(1 2 )3 ]
!
ici = exp
Y X
= ln i ci .
+ ln[1 1 (1 3 ) (1 1 )3 ].
i=1 i=1

Let us choose a monotonic transformation qi = Depending on a choice for initial parameter


ln i , i = 1, . . . , n. Then the conditional proba- settings (0) , the EM algorithm for symmetric
bility of the effect in a noisy AND model equals causal independence models converges to one of
T the maxima:
Pr(ej+ | c, q) = eq c .
CLL()max
Let us define z j = qT cj and f (z j ) = Pr(e+ | (
0 at = (0,
1, 0),
cj , q). The function h is the same as for the =
noisy OR model in Equation (3). Combined 1.386 at { 1 , 0, 12 , 12 , 0, 3 }.
with f 0 (z j ) = f (z j ), it yields the first derivative
of h Obviously, only the point = (0, 1, 0) is a global
maximum of the log-likelihood function while
N
h(q) X f 0 (z j )(ej f (z j )) the other obtained points are local maxima.
= cj
q j=1
f (z j )(1 f (z j )) The discussed counterexample proves that in
N general case the EM algorithm for symmetric
X ej f (z j )
= cj causal independence models does not necessar-
1 f (z j )
j=1 ily converge to the global maximum.
and Hessian matrix 5 Experimental Results
N
2 h(q) X f (z j ) For our experiments we use Reuters data collec-
= (1 ej )cj cj T .
qqT j=1
(1 f (z j ))2 tion, which allows us to evaluate the classifica-
tion performance of large symmetric causal in-
Hence, the function h is concave, and the log- dependence models where the number of cause
likelihood for the noisy AND model has no other variables for some document classes is in the
stationary points than the global maxima. hundreds.
4.2 General Case 5.1 Evaluation Scheme
The EM algorithm is guaranteed to converge to Since we do not have an efficient algorithm to
the local maxima or saddle points. Thus, we perform a search in the space of symmetric
can only be sure that the global maximum, i.e. Boolean functions, we chose to model the in-
a point such that CLL( ) CLL() for all teraction among cause and effect variables by
6= , will be found if the log-likelihood has means of Boolean threshold functions, which
neither saddle points nor local maxima. How- seem to be the most probable interaction func-
ever, the log-likelihood function for a causal in- tions for the given domains.
dependence model with any symmetric Boolean Given the model parameters , the testing
function does not always fulfill this requirement data Dtest and the classification threshold 21 , the
as it is shown in the following counterexample. classifications and misclassifications for both
Example 1. Let us assume a data set D = classes are computed. Let tp (true positives)
{(1, 1, 1, 1), (1, 0, 1, 0)} and an interaction func- stand for the number of data samples (cj , ej+ )
tion 1 , i.e. 1 = 1 and 0 = 2 = 3 = 0. To Dtest for which Pr(e+ | cj , ) 12 and fp (false

168 R. Jurgelenaite and T. Heskes


positives) stand for the number of data samples
Table 1: Classification accuracy for symmetric
(cj , ej+ ) Dtest for which Pr(e+ | cj , ) < 21 .
causal independence models with the interac-
Likewise, tn (true negatives) is the number of
tion function k , k = 1, . . . , 4 for Reuters data
data samples (cj , ej ) Dtest for which Pr(e+ |
set; NClass is number of documents in the cor-
cj , ) < 21 and fp (false positives) is the num-
responding class.
ber of data samples (cj , ej ) Dtest for which
Pr(e+ | cj , ) 21 . To evaluate the classifica- Class NClass 1 2 3 4
tion performance we use accuracy, which is a earn 1087 96.3 97.2 97.2 96.8
measure of correctly classified cases, acq 719 93.1 93.2 93.2 93.0
tp + tn crude 189 98.1 98.1 97.6 97.7
= , money-fx 179 95.8 95.8 95.9 96.0
tp + tn + f n + f p
grain 149 99.2 99.0 98.2 97.9
and F-measure, which combines precision = interest 131 96.5 96.8 96.7 96.7
tp tp
tp+f p and recall = tp+f n , trade 117 96.6 97.0 97.3 97.3
2 ship 89 98.9 98.8 98.7 98.6
F = . wheat 71 99.5 99.2 98.8 98.5
+
corn 56 99.7 99.4 99.1 98.8
5.2 Reuters Data Set
We used the Reuters-21578 text categorization
collection containing the Reuters new stories 6 Discussion
preprocessed by Karciauskas (2002). The train-
ing set contained 7769 documents and the test- In this paper, we discussed the application of
ing set contained 3018 documents. For every symmetric causal independence models for clas-
of the ten document classes the most informa- sification. We developed the EM algorithm to
tive features were selected using the expected learn the parameters in symmetric causal inde-
information gain as a feature selection criteria, pendence models and studied its convergence.
and each document class was classified sepa- The reported experimental results indicate that
rately against all other classes. We chose to it is unnecessary to restrict causal independence
use the same threshold for the expected infor- models to only two interaction functions, logical
mation gain as in (Vomlel, 2006), the number OR and logical AND. Competitive classification
of selected features varied from 23 for the corn performance of symmetric causal independence
document class to 307 for the earn document models present them as a potentially useful ad-
class. While learning the values of the hidden ditional tool to the set of classifiers.
parameters the EM algorithm was stopped af- The current study has only examined the
ter 50 iterations. The accuracy and F-measure problem of learning conditional probabilities of
for causal independence models with the thresh- hidden variables. The problem of learning an
old interaction function k = 1, . . . , 4 are given optimal interaction function has not been ad-
in tables 1 and 2. Even though the threshold dressed. Efficient search in symmetric Boolean
to select the relevant features was tuned for the function space is a possible direction for future
noisy OR classifier, for 5 document classes the research.
causal independence models with other interac-
tion function than logical OR provided better Acknowledgments
results. This research, carried out in the TimeBayes
The accuracy and F-measure of the noisy OR project, was supported by the Netherlands Or-
model and a few other classifiers on the Reuters ganization for Scientific Research (NWO) under
data collection reported in (Vomlel, 2006) show project number FN4556. The authors are grate-
the competitive performance of the noisy OR ful to Gytis Karciauskas for the preprocessed
model. Reuters data. We would also like to thank Jir

Symmetric Causal Independence Models for Classification 169


report ICISR06014, Radboud University Ni-
Table 2: Classification F-measure for symmet- jmegen.
ric causal independence models with the inter-
action function k , k = 1, . . . , 4 for Reuters data H.J. Kappen and J.P. Neijt. 2002. Promedas, a prob-
abilistic decision support system for medical diag-
set; NClass is number of documents in the cor-
nosis. Technical report, SNN - UMCU.
responding class.
Class NClass 1 2 3 4 G. Karciauskas. 2002. Text Categorization using Hi-
erarchical Bayesian Network Classifiers. Master
earn 1087 95.0 96.1 96.1 95.6 thesis, Aalborg University.
acq 719 85.3 84.3 84.5 83.8 L. Le Cam. 1960. An approximation theorem for the
crude 189 84.5 85.7 80.7 81.0 Poisson binomial distribution. Pacific Journal of
money-fx 179 60.9 62.1 62.6 62.7 Mathematics, 10:11811197.
grain 149 92.7 89.9 80.7 77.2
P.J.F. Lucas. 2005. Bayesian network modelling
interest 131 40.2 55.0 53.3 54.0 through qualitative patterns. Artificial Intelli-
trade 117 51.0 61.2 63.7 63.7 gence, 163:233263.
ship 89 79.5 77.7 74.5 71.5
C. Meek and D. Heckerman. 1997. Structure and
wheat 71 90.3 81.8 71.4 66.2 parameter learning for causal independence and
corn 56 91.8 83.6 72.5 61.5 causal interaction models. In Proceedings of the
Thirteenth Conference on Uncertainty in Artifi-
cial Intelligence, pages 366375.
Vomlel for sharing his code and insights. J. Pearl. 1988. Probabilistic Reasoning in Intelligent
Systems: Networks of Plausible Inference. Mor-
gan Kauffman Publishers.
References
M.A. Shwe, B. Middleton, D.E. Heckerman, M.
S. Boyd and L. Vandenberghe. 2004. Convex Opti- Henrion, E.J. Horvitz, H.P. Lehmann and G.F.
mization. Cambridge University Press. Cooper. 1991. Probabilistic diagnosis using a re-
formulation of the INTERNIST-1/QMR knowl-
J. Darroch. 1964. On the distribution of the number edge base, I the probabilistic model and in-
of successes in independent trials. The Annals of ference algorithms. Methods of Information in
Mathematical Statistics, 35:13171321. Medicine, 30:241255.
A.P. Dempster, N.M. Laird and D.B. Rubin. 1977. S. Srinivas. 1993. A generalization of the noisy-or
Maximum likelihood from incomplete data via the model. In Proceedings of the Ninth Conference on
EM algorithm. Journal of the Royal Statistical So- Uncertainty in Artificial Intelligence, pages 208
ciety, 39:138. 215.
F.J. Dez. 1993. Parameter adjustment in Bayes net-
S. Visscher, P.J.F. Lucas, M. Bonten and K.
works. The generalized noisy OR-gate. In Proceed-
Schurink. 2005. Improving the therapeutic perfor-
ings of the Ninth Conference on Uncertainty in
mance of a medical Bayesian network using noisy
Artificial Intelligence, pages 99105.
threshold models. In Proceedings of the Sixth In-
H.B. Enderton. 1972. A Mathematical Introduction ternational Symposium on Biological and Medical
to Logic. Academic Press, San Diego. Data Analysis, pages 161172.

D. Heckerman and J.S. Breese. 1994. A new look at J. Vomlel. 2006. Noisy-or classifier. International
causal independence. In Proceedings of the Tenth Journal of Intelligent Systems, 21:381398.
Conference on Uncertainty in Artificial Intelli-
I. Wegener. 1987. The Complexity of Boolean Func-
gence, pages 286292.
tions. John Wiley & Sons, New York.
M. Henrion. 1989. Some practical issues in con-
N.L. Zhang and D. Poole. 1996. Exploiting causal in-
structing belief networks. Uncertainty in Artificial
dependence in Bayesian networks inference. Jour-
Intelligence, 3:161173.
nal of Artificial Intelligence Research, 5:301328.
R. Jurgelenaite, T. Heskes and P.J.F. Lucas. 2006.
Noisy threshold functions for modelling causal
independence in Bayesian networks. Technical

170 R. Jurgelenaite and T. Heskes


Complexity Results for Enhanced Qualitative Probabilistic
Networks
Johan Kwisthout and Gerard Tel
Department of Information and Computer Sciences
University of Utrecht
Utrecht, The Netherlands

Abstract
While quantitative probabilistic networks (QPNs) allow the expert to state influences
between nodes in the network as influence signs, rather than conditional probabilities,
inference in these networks often leads to ambiguous results due to unresolved trade-offs
in the network. Various enhancements have been proposed that incorporate a notion
of strength of the influence, such as enhanced and rich enhanced operators. Although
inference in standard (i.e., not enhanced) QPNs can be done in time polynomial to the
length of the input, the computational complexity of inference in such enhanced networks
has not been determined yet. In this paper, we introduce relaxation schemes to relate
these enhancements to the more general case where continuous influence intervals are
used. We show that inference in networks with continuous influence intervals is NP -hard,
and remains NP -hard when the intervals are discretised and the interval [1, 1] is divided
into blocks with length of 14 . We discuss membership of NP, and show how these general
complexity results may be used to determine the complexity of specific enhancements to
QPNs. Furthermore, this might give more insight in the particular properties of feasible
and infeasible approaches to enhance QPNs.

1 Introduction (van der Gaag et al., 2006), or in applications


where the exact probability distribution is un-
While probabilistic networks (Pearl, 1988) are known or irrelevant (Wellman, 1990).
based on a intuitive notion of causality and
Nevertheless, this qualitative abstraction
uncertainty of knowledge, elicitating the re-
leads to ambiguity when influences with con-
quired probabilistic information from the ex-
trasting signs are combined. Enhanced QPNs
perts can be a difficult task. Qualitative prob-
have been proposed (Renooij and van der Gaag,
abilistic networks (Wellman, 1990), or QPNs,
1999) in order to allow for more flexibility in de-
have been proposed as a qualitative abstrac-
termining the influences (e.g., weakly or strongly
tion of probabilistic networks to overcome this
positive) and partially resolve conflicts when
problem. These QPNs summarise the condi-
combining influences. Also, mixed networks
tional probabilities between the variables in the
(Renooij and van der Gaag, 2002) have been
network by a sign, which denotes the direction
proposed, to facilitate stepwise quantification
of the effect. In contrast to quantitative net-
and allowing both qualitative and quantitative
works, where inference has been shown to be
influences to be modelled in the network.
NP -hard (Cooper, 1990), these networks have
efficient (i.e., polynomial-time) inference algo- Although inference in quantitative networks
rithms. QPNs are often used as an intermedi- is NP -hard, and polynomial-time algorithms are
ate step in the construction of a probabilistic known for inference in standard qualitative net-
network (Renooij and van der Gaag, 2002), as works, the computational complexity of infer-
a tool for verifying properties of such networks ence in enhanced networks has not been deter-
mined yet. In this paper we recall the defini- we define S (A, B, ti ) as the influence S , with
tion of QPNs in section 2, and we introduce a {+, , 0, ?}, from a node A on a node B
framework to relate various enhancements, such along trail ti , we can formalise these properties
as enhanced, rich enhanced, and interval-based as shown in table 1. The - and -operators
operators in section 3. In section 4 we show that that follow from the transitivity and composi-
inference in the general, interval-based case is tion properties are defined in table 2.
NP -hard. In section 5 we show that it remains
NP -hard if we use discrete - rather than contin-
uous - intervals. Furthermore, we argue that,
symmetry S (A, B, ti )
although hardness proofs might be nontrivial
to obtain, it is unlikely that there exist poly- S (B, A, t1
i )
0
nomial algorithms for less general variants of transitivity S (A, B, ti ) S (B, C, tj )
enhanced networks, such as the enhanced and 0
S (A, C, ti tj )
rich enhanced operators suggested by Renooij 0
composition S (A, B, ti ) S (A, B, tj )
and Van der Gaag (1999). Finally, we conclude 0

our paper in section 6. S (A, B, ti tj )


0
) 00 0
00 )
associativity S ( = S (
2 Qualitative Probabilistic Networks distribution S (
0
) 00
= S (
00
)( 0 00 )

In qualitative probabilistic networks, a directed


acyclic graph G = (V, A) is associated with a set Table 1: Properties of qualitative influences
of qualitative influences and synergies (Well-
man, 1990), where the influence of one node to
another is summarised by a sign1 . For example,
a positive influence of a node A on its succes- + 0 ? + 0 ?
sor B, denoted with S + (A, B), expresses that + + 0 ? + + ? + ?
higher values for A make higher values for B + 0 ? ? ?
more likely than lower values, regardless of in- 0 0 0 0 0 0 + 0 ?
fluences of other nodes on B. In binary cases, ? ? ? 0 ? ? ? ? ? ?
with a > a and b > b, this can be summarised
as Pr(b | ax) Pr(b | ax) 0 for any value of x Table 2: The - and -operator for combining signs
of other predecessors of B. Negative influences,
denoted by S , and zero influences, denoted by
S 0 , are defined analogously. If an influence is Using these properties, an efficient (poly-
not positive, negative, or zero, it is ambiguous, nomial time) inference algorithm can be con-
denoted by S ? . Influences can be direct (causal structed (Druzdzel and Henrion, 1993a) that
influence) or induced (inter-causal influence or propagates observed node values to other neigh-
product synergy). In the latter case, the value bouring nodes. The basic idea of the algorithm,
of one node influences the probabilities of values given in pseudo-code in figure 1, is as follows.
of another node, given a third node (Druzdzel When entering the procedure, a node I is in-
and Henrion, 1993b). stantiated with a + or a (i.e., trail = ,
Various properties hold for these qualitative from = to = I and msign = + or ). Then,
influences, namely symmetry, transitivity, com- this node sign is propagated through the net-
position, associativity and distribution (Well- work, following active trails and updating nodes
man, 1990; Renooij and van der Gaag, 1999). If when needed. Observe from table 2 that a node
can change at most two times: from 0 to +,
1
Note that the network Q = (G, ) represents in- , or ?, and then only to ?. This algorithm
finitely many quantitative probabilistic networks that re-
spect the restrictions on the conditional probabilities de- visits each node at most two times, and there-
noted by the signs of all arcs. fore halts after a polynomial amount of time.

172 J. Kwisthout and G. Tel


procedure PropagateSign(trail, from, to, msign): than one, which is impossible. Since the indi-
sign[to] sign[to] msign;
trail trail { to }; vidual intervals might be estimated by experts,
for each active neighbour Vi of to this situation is not unthinkable, especially in
do lsign sign of influence between to and Vi ;
msign sign[to] lsign;
large networks. This property can be used to
if Vi 6 trail and sign[Vi ] 6= sign[Vi ] msign detect design errors in the network.
then PropagateSign(trail, to, Vi , msign). Note, that the symmetry, associativity, and
distribution property of qualitative networks do
Figure 1: The sign-propagation algorithm no longer apply in these enhancements. For ex-
ample, although a positive influence from a node
A to B along the direction of the arc also has a
3 Enhanced QPNs
positive influence in the opposite direction, the
These qualitative influences and synergies strength of this influence is unknown. Also, the
can of course be extended to preserve a outcome of the combination of a strongly pos-
larger amount of information in the abstrac- itive, weakly positive and weakly negative sign
tion (Renooij and van der Gaag, 1999). depends on the evaluation order.
For example, given a certain cut-off value
i [r, s]
, an influence can be strongly positive
(Pr(b | ax) Pr(b | ax) ) or weakly negative [p, q] [min X, max X],

( Pr(b | ax) Pr(b | ax) 0). The basic where X = {p r, p s, q r, q s}

+ and signs are enhanced with signs for


strong influences (++ and ) and aug- i [r, s]
mented with multiplication indices to handle [p, q] [p + r, q + s] [1, 1]
complex dependencies on as a result of tran-
sitive and compositional combinations. In addi- Table 3: The i - and i -operators for interval multipli-
cation and addition
tion, signs such as +? and ? are used to de-
note positive or negative influences of unknown
strength. Using this notion of strength, trade- 3.1 Relaxation schemes
offs in the network can be modelled by compo- If we take a closer look at the e , r , and
sitions of weak and strong opposite signs. e operators defined in (Renooij and van der
Furthermore, an interval network can be con- Gaag, 1999) and compare them with the
structed (Renooij and van der Gaag, 2002), interval operators i and i , we can see that
where each arc has an associated influence in- the interval results are sometimes somehow
terval rather than a sign. Such an influ- relaxed. We see that symbols representing
ence is denoted as F [p,q] (A, B), meaning that influences correspond to intervals, but after the
Pr(b | ax) Pr(b | ax) [p, q]. Note that, given application of any operation on these intervals,
this definition, S + (A, B) F [0,1] (A, B), the result is extended to an interval that can
and similar observations hold for S , S 0 and be represented by one of the available symbols.
S ? . We will denote the intervals [1, 0], [0, 1], For example, in the interval model we have
[0, 0] and [1, 1] as unit intervals, being special [, 1] i [1, 1] = [ 1, 1], but, while [, 1]
cases that correspond to the traditional quali- corresponds to ++ in the enhanced model and
tative networks. The - and -operator, de- [1, 1] corresponds to ?, + + e ? =? [1, 1].
noting transitivity and composition in interval The lower limit 1 is relaxed to 1, because
networks are defined in table 3. Note that it is the actually resulting interval [ 1, 1] does
possible that a result of a combination of two not correspond to any symbol. Therefore, to
trails leads to an empty set, for example when connect the (enhanced) qualitative and interval
combining [ 21 , 1] with [ 43 , 1], which would denote models, we will introduce relaxation schemes
that the total influence of a node on another that map the result of each operation to the
node, along multiple trails, would be greater minimal interval that can be represented by

Complexity Results for Enhanced Qualitative Probabilistic Networks 173


one of the available symbols: is that the -operator is no longer associative.
The result of inference now depends on the or-
der in which various influences are propagated
Definition 1. (Relaxation scheme) through the network.
Rx will be defined as a relaxation scheme,
denoted as Rx ([a, b]) = [c, d], if Rx maps the 3.2 Problem definition
outcome [a, b] of an or operation to an To decide on the complexity of inference of this
interval [c, d], where [a, b] [c, d]. general, interval-based enhancements of QPNs,
a decision problem needs to be determined. We
In standard QPNs, the relaxation scheme state this problem, denoted as iPieqnetd2 , as
(which I will denote RI or the unit scheme) is follows.
defined as in figure 2.
iPieqnetd

[0, 1] if a 0 b > 0 Instance: Qualitative Probabilistic Net-

[1, 0] if a < 0 b 0 work Q = (G, ) with an instantiation

RI (a, b) =



[0, 0] if a = b = 0 for A V (G) and a node B V \ {A}.
[1, 1] otherwise Question: Is there an ordering on the

combination of influences such that the
Figure 2: Relaxation scheme RI (a, b)
influence of A on B [1, 1]?
Similarly, the e , r , and e operators can
be denoted with the relaxation schemes in fig- 3.3 Probability representation
ure 3, in which m equals min(i, j) and is an
In this paper, we assume that the probabili-
arbitrary cut-off value. To improve readabil-
ties in the network are represented by fractions,
ity, in the remainder of this paper the - and
denoted by integer pairs, rather than by reals.
-operators, when used without index, denote
This has the advantage, that the length of the
operators on intervals as defined in table 3.
result of addition and multiplication of fractions
( is polynomial in the length of the original num-
[1, 1] if a < 0 b > 0
Re (a, b) = bers. We can efficiently code the fractions in
[a, b] otherwise
the network by rewriting them, using their least
[m , 1] if a = i + j b common denominator. Adding or multiplying






[1, m ] if b = (i + j ) a these fractions will not affect their denomina-
tors, whose length will not change during the

[0, 1] if a b = i + j







[1, 0] if a = (i + j ) b inference process.

[0, 1] if a = (i j )

Re (a, b) = 4 Complexity of the problems

and b 0 and i < j

[1, 0] if a = (i j ) We will prove the hardness of the inference







and b 0 and i < j problem iPieqnetd by a transformation from

[1, 1] if a 0 and b 0 3sat. We construct a network Q using clauses





[a, b] otherwise C and boolean variables U , and prove that,
( upon instantiation of a node I to [1, 1], there
[1, 1] if a < 0 b > 0
Rr (a, b) = is an ordering on the combination of influences
[a, b] otherwise
such that the influence of I on a given node
Figure 3: Relaxation schemes Re , Re , and Rr Y Q \ {I} is a true subset of [1, 1], if and
only if the corresponding 3sat instance is sat-
This notion of a relaxation scheme allows us isfiable. In the network, the influence of a node
to relate various operators in a uniform way. 2
An acronym for Interval-based Probabilistic Infer-
A common property of most of these schemes ence in Enhanced Qualitative Networks

174 J. Kwisthout and G. Tel


A on a node B along the arc (A, B) is given as I
an interval; when the interval equals [1, 1] (i.e.,
Pr(b | a) = 1 and Pr(b | a) = 0) then the in- Vg
terval is omitted for readability. Note, that the A
influence of B on A, against the direction of the [ 12 , 1] [1, 12 ]
arc (A, B), equals the unit interval of the influ-
ence associated with (A, B). B [ 12 , 12 ] C
As a running example, we will construct
a network for the following 3sat instance,
introduced in (Cooper, 1990): D

Example 1. (3satex )
U = {u1 , u2 , u3 , u4 }, and C = {(u1 u2 u3 ),
(u1 u2 u3 ), (u2 u3 u4 )} Figure 4: Variable gadget Vg

This instance is satisfiable, for example with that an -operation with [1, 0] will transform
the truth assignment u1 = T , u2 = F , u3 = F , a value of [1, 12 ] in [ 12 , 1] and vice versa, and
and u4 = T . [0, 1] will not change them. We can therefore
regard an influence F [1,0] as a negation of the
4.1 Construct for our proofs truth assignment for that influence. Note, that
For each variable in the 3sat instance, our net- [1, 1] will stay the same in both cases.
work contains a variable gadget as shown If we zoom in on the clause-network in figure
in figure 4. After the instantiation of node 6, we will see that the three incoming vari-
I with [1, 1], the influence at node D equals ables in a clause, that have a value of either
[ 12 , 1][ 12 , 12 ][1, 21 ], which is either [1, 12 ], [1, 12 ], [ 12 , 1], or [1, 1], are multiplied with
[ 12 , 1] or [1, 1], depending on the order of the arc influence Fi,j to form variables, and then
evaluation. We will use the non-associativity combined with the instantiation node (with a
of the -operator in this network as a non- value of [1, 1]), forming nodes wi . Note that
deterministic choice of assignment of truth val- [1, 12 ] [1, 1] = [1, 1] [1, 1] = [0, 1] and that
ues to variables. As we will see later, an eval- [ 12 , 1] [1, 1] = [ 12 , 1]. Since a [1, 1] result in
uation order that leads to [1, 1] can be safely the variable gadget does not change by multi-
dismissed (it will act as a falsum in the clauses, plying with the Fi,j influence, and leads to the
making both x and x false), so we will concen- same value as an F variable, such an assignment
trate on [ 21 , 1] (which will be our T assign- will never satisfy the 3sat instance.
ment) and [1, 12 ] (F assignment) as the two The influences associated with these nodes wi
possible choices. are multiplied by [ 12 , 1] and added together in
We construct literals ui from our 3sat in- the clause result Oj . At this point, Oj has a
stance, each with a variable gadget Vg as value of [ k4 , 1], where k equals the number of
input. Therefore, each variable can have a literals which are true in this clause. The con-
value of [1, 21 ] or [ 12 , 1] as influence, non- secutive adding to [ 41 , 1], multiplication with
deterministicly. Furthermore, we add clause- [0, 1] and adding to [1, 1] has the function of a
networks Clj for each clause in the instance logical or -operator, giving a value for Cj of [ 43 , 1]
and connect each variable ui to a clause-network if no literal in the clause was true, and [1, 1] if
Clj if the variable occurs in Cj . The influence one or more were true.
associated with this arc (ui , Clj ) is defined as We then combine the separate clauses Cj into
F [p,q] (ui , Clj ), where [p, q] equals [1, 0] if ui a variable Y , by adding edges from each clause
is in Cj , and [0, 1] if ui is in Cj (figure 5). Note to Y using intermediate variables D1 to Dn1 .

Complexity Results for Enhanced Qualitative Probabilistic Networks 175


I The influence interval in Y has a value between
k+1
[ 34 , 1] and [ 2 2k+11 , 1], where k is the number of
clauses, if one or more clauses had a value of
[0, 1]. Note that we can easily transform the
interval in Y to a subset of [1, 1] in Y 0 by con-
1
secutively adding it to [1, 1] and [1+ 2k+1 , 1].
Vg1 Vg2 Vg3 Vg4 This would result in a true subset of [1, 1] if
and only Y was equal to [1, 1].

C1 C2 Ci Cn
u1 u2 u3 u4
F1,2 F2,1 F2,3 F3,1
F3,3
F1,1 F2,2 F3,2 F4,3 [ 12 , 1] [ 12 , 1] [ 12 , 1] [ 12 , 1]

D1
Cl3
[ 12 , 1]
Cl1 Cl2

Figure 5: The literal-clause construction Di [ 12 , 1]

u1 u2 u3 I
F2,j F3,j Y
F1,j
w3
Figure 7: Connecting the clauses
w2
w1
4.2 NP-hardness proof
[ 12 , 1] [ 12 , 1] [ 12 , 1] Using the construct presented in the previous
section, the computational complexity of the
iPieqnetd can be established as follows.
Oj
Theorem 1. The iPieqnetd problem is NP-
hard.
Proof. To prove NP -hardness, we construct a
[ 14 , 1] transformation from the 3sat problem. Let
Oj0
(U, C) be an instance of this problem, and let
[0, 1] Q(U,C) be the interval-based qualitative proba-
bilistic network constructed from this instance,
Cj
as described in the previous section. When the
Clj node I Q is instantiated with [1, 1], then I
has an influence of [1, 1] on Y (and therefore an
Figure 6: The clause construction influence on Y 0 which is a true subset of [1, 1])
if and only if all nodes Cj have a value of [1, 1],
i.e. there exists an ordering of the operators
The use of these intermediate variables allows in the variable-gadget such that at least one
us to generalise these results to more restricted literal in C is true. We conclude that (U, C)
cases. The interval of these edges is [ 12 , 1], lead- has a solution with at least one true literal in
ing to a value of [1, 1] in Y if and only if all each clause, if and only if the iPieqnetd prob-
clauses Cj have a value of [1, 1] (see figure 7). lem has a solution for network Q(U,C) , instanti-

176 J. Kwisthout and G. Tel


ation I = [1, 1] and node Y 0 . Since Q(U,C) can of the arcs until all incoming trails of Y have
be computed from (U, C) in polynomial time, been calculated always yields better results than
we have a polynomial-time transformation from a propagation sequence that doesnt have this
3sat to iPieqnetd, which proves NP -hardness property. This makes it plausible that there ex-
of iPieqnetd. ist networks where an optimal solution (i.e., a
propagation sequence that leads to the small-
4.3 On the possible membership of NP est possible subset of [1, 1] at the node we are
Although iPieqnetd has been shown to be NP - interested in) cannot be described using such a
hard, membership of the NP -class (and, as a polynomial certificate.
consequence, NP -completeness) is not trivial to
prove. To prove membership of NP, one has to 5 Operator variants
prove that if the instance is solvable, then there
exists a certificate, polynomial in length with In order to be able to represent every possi-
respect to the input of the problem, that can ble 3sat instance, a relaxation scheme must be
be used to verify this claim. A trivial certifi- able to generate a variable gadget, and retain
cate could be a formula, using the - and - enough information to discriminate between the
operators, influences, and parentheses, describ- cases where zero, one, two or three literals in
ing how the influence of the a certain node can each clause are true. Furthermore, the relax-
be calculated from the instantiated node and ation scheme must be able to represent the in-
the characteristics of the network. Unfortu- stantiations [1, 1] (or >) and [1, 1] (or ),
nately, such a certificate can grow exponentially and the uninstantiated case [0, 0]. With a re-
large. laxation scheme that effectively divides the in-
In special cases, this certificate might also be terval [1, 1] in discrete blocks with size of a
described by an ordering of the nodes in the net- multitude of 14 , (such as R 14 (a, b) = [ b4ac d4be
4 , 4 ])
work and for each node an ordering of the in- the proof construction is essentially the same
puts, which would be a polynomial certificate. as in the general case discussed in section 3.
For example, the variable gadget from figure 4 This relaxation scheme does not have any effect
can be described with the ordering I, A, B, C, D on the intervals we used in the variable gadget
plus an ordering on the incoming trails in D. and the clause construction of Q(U,C) , the net-
All possible outcomes of the propagation algo- work constructed in the NP -hardness proof of
rithm can be calculated using this description. the general case used only intervals (a, b) for
Note, however, that there are other propagation which R 1 (a, b) = (a, b). Furthermore, when
4
sequences possible that cannot be described in connecting the clauses, the possible influences
this way. For example, the algorithm might first in Y are relaxed to [0, 1], [ 41 , 1], [ 12 , 1], [ 34 , 1], and
explore the trail A B D C, and after [1, 1], so we can onstruct Y 0 by consecutively
that the trail A C D B. Then, the adding the interval in Y to [1, 1] and [ 43 , 1].
influence in D is dependent on the information Thus, the problem - which we will denote as
in C, but C is visited by D following the first relaxed-Pieqnetd - remains NP -hard for re-
trail, and D is visited by C if the second trail laxation scheme R 1 .
4
is explored. Nevertheless, this cannot lead to The non-associativity of the e - and r -
other outcomes than [1, 21 ], [ 12 , 1], or [1, 1]. operators defined in (Renooij and van der Gaag,
However, it is doubtful that such a descrip- 1999) suggest hardness of the inference problem
tion exists for all possible networks. For exam- as well. Although e and r are not associa-
ple, even if we can make a topological sort of tive, they cannot produce results that can be
a certain network, starting with a certain in- regarded as opposites. For example, the expres-
stantiation node X and ending with another sion (+ + e + e ) can lead to a positive influ-
node Y , it is still not the case that a propa- ence of unknown strength (+? ) when evaluated
gation sequence that follows only the direction as ((+ + e +) e ) or an unknown influence

Complexity Results for Enhanced Qualitative Probabilistic Networks 177


(?) when evaluated as (+ + e (+ e )), but M.J. Druzdzel and M. Henrion. 1993a. Belief Prop-
never to a negative influence. A transformation agation in Qualitative Probabilistic Networks.
In N. Piera Carrete and M.G. Singh, editors,
from a 3sat variant might not succeed because
Qualitative Reasoning and Decision Technologies,
of this reason. However, it might be possible CIMNE: Barcelona.
to construct a transformation from relaxed-
Pieqnetd, which is subject of ongoing research. M.J. Druzdzel and M. Henrion. 1993b. Intercausal
reasoning with uninstantiated ancestor nodes. In
Proceedings of the Ninth Conference on Uncer-
6 Conclusion tainty in Artificial Intelligence, pages 317325.
In this paper, we addressed the computational J. Pearl. 1988. Probabilistic Reasoning in Intelligent
complexity of inference in enhanced Qualita- Systems: Networks of Plausible Inference. Mor-
tive Probabilistic Networks. As a first step, gan Kaufmann, Palo Alto.
we have embedded both standard and en- S. Renooij and L.C. van der Gaag. 1999. Enhancing
hanced QPNs in the interval-model using relax- QPNs for trade-off resolution. In Proceedings of
ation schemes, and we showed that inference in the Fifteenth Conference on Uncertainy in Artifi-
this general interval-model is NP -hard and re- cial Intelligence, pages 559566.
mains NP -hard for relaxation scheme R 1 (a, b). S. Renooij and L.C. van der Gaag. 2002. From
4
We believe, that the hardness of inference is due Qualitative to Quantitative Probabilistic Net-
to the fact that reasoning in QPNs is under- works. In Proceedings of the Eighteenth Con-
ference in Uncertainty in Artificial Intelligence,
defined: The outcome of the inference process pages 422429.
depends on choices during evaluation. Never-
theless, further research needs to be conducted L.C. van der Gaag, S. Renooij, and P.L. Gee-
in order to determine where exactly the NP/P nen. 2006. Lattices for verifying monotonicity of
Bayesian networks. In Proceedings of the Third
border lies, in other words: which enhance- European Workshop on Probabilistic Graphical
ments to the standard qualitative model allow Models (to appear).
for polynomial-time inference, and which en-
M.P. Wellman. 1990. Fundamental Concepts of
hancements lead to intractable results. Fur- Qualitative Probabilistic Networks. Artificial In-
thermore, a definition of transitive and com- telligence, 44(3):257303.
positional combinations of qualitative influences
in which the outcome is independent of the or-
der of the influence propagation might reduce
the computational complexity of inference and
facilitate the use of qualitative models to de-
sign, validate, analyse, and simulate probabilis-
tic networks.

Acknowledgements
This research has been (partly) supported by
the Netherlands Organisation for Scientific Re-
search (NWO).
The authors wish to thank Hans Bodlaender
and Silja Renooij for their insightful comments
on this subject.

References
G.F. Cooper. 1990. The Computational Complex-
ity of Probabilistic Inference using Bayesian Belief
Networks. Artificial Intelligence, 42(2):393405.

178 J. Kwisthout and G. Tel


Decision analysis with influence diagrams
using Elviras explanation facilities
Manuel Luque and Francisco J. Dez
Dept. Inteligencia Artificial, UNED
Juan del Rosal, 16, 28040 Madrid, Spain

Abstract
Explanation of reasoning in expert systems is necessary for debugging the knowledge base,
for facilitating their acceptance by human users, and for using them as tutoring systems.
Influence diagrams have proved to be effective tools for building decision-support systems,
but explanation of their reasoning is difficult, because inference in probabilistic graphical
models seems to have little relation with human thinking. The current paper describes
some explanation capabilities for influence diagrams and how they have been implemented
in Elvira, a public software tool.

1 Introduction way in which the knowledge has been applied for


arriving at a conclusion, and what would have
Influence diagrams (IDs) are a probabilistic happened if the user had introduced different
graphical model for modelling decision prob- pieces of evidence (what-if reasoning).
lems. They constitute a simple graphical for- These reasons are especially relevant in the
malism that makes it possible to represent the case of probabilistic expert systems, because the
three main components of decision problems: elicitation of probabilities is a difficult task that
the uncertainty, due to the existence of variables usually requires debugging and refinement, and
not controlled by the decision maker, the deci- because the algorithms for the computation of
sions or actions under their direct control, and probabilities and utilities are, at least appar-
their preferences about the possible outcomes. ently, very different from human reasoning.
In the context of expert systems, either prob- Unfortunately, most expert systems and com-
abilistic or heuristic, the development of ex- mercial tools available today, either heuristic
planation facilities is important for three main or probabilistic, have no explanation capabili-
reasons (Lacave and Dez, 2002). First, be- ties. In this paper we describe some explanation
cause the construction of those systems with the methods developed as a response to the needs
help of human experts is a difficult and time- that we have detected when building and debug-
consuming task, prone to errors and omissions. ging medical expert systems (Dez et al., 1997;
An explanation tool can help the experts and Luque et al., 2005) and when teaching proba-
the knowledge engineers to debug the system bilistic graphical models to pre- and postgradu-
when it does not yield the expected results and ate students of computer science and medicine
even before a malfunction occurs. Second, be- (Dez, 2004). These new methods have been
cause human beings are reluctant to accept the implemented in Elvira, a public software tool.
advice offered by a machine if they are not able
to understand how the system arrived at those 1.1 Elvira
recommendations; this reluctancy is especially Elvira1 is a tool for building and evaluating
clear in medicine (Wallis and Shortliffe, 1984). graphical probabilistic models (Elvira Consor-
And third, because an expert system used as an 1
At http://www.ia.uned.es/elvira it is possible to
intelligent tutor must be able to communicate obtain the source code and several technical documents
to the apprentice the knowledge it contains, the about Elvira.
tium, 2002) developed as a join project of sev- sion D. Arcs into utility nodes represent func-
eral Spanish universities. It contains a graphi- tional dependence: for ordinary utility nodes,
cal interface for editing networks, with specific they represent the domain of the associated util-
options for canonical models, exact and approx- ity function; for a super-value node they in-
imate algorithms for both discrete and contin- dicate that the associated utility is a function
uous variables, explanation facilities, learning (sum or product) of the utility functions of its
methods for building networks from databases, parents.
etc. Although some of the algorithms can work
with both discrete and continuous variables,
most of the explanation capabilities assume that
all the variables are discrete.

2 Influence diagrams
2.1 Definition of an ID
An influence diagram (ID) consists of a directed
acyclic graph that contains three kinds of nodes:
chance nodes VC , decision nodes VD and util-
ity nodes VU see Figure 1. Chance nodes rep-
resent random variables not controlled by the
decision maker. Decision nodes correspond to
actions under the direct control of the deci-
sion maker. Utility nodes represent the deci-
sion makers preferences. Utility nodes can not
be parents of chance or decision nodes. Given
that each node represents a variable, we will use
the terms variable and node interchangeably. Figure 1: ID with two decisions (ovals), two
In the extended framework proposed by chance nodes (squares) and three utility nodes
Tatman and Shachter (1990) there are two (diamonds). There is a directed path including
kinds of utility nodes: ordinary utility nodes, all the decisions and node U0 .
whose parents are decision and/or chance nodes,
and super-value nodes, whose parents are util- Standard IDs require that there is a directed
ity nodes, and can be in turn of two types, sum path that includes all the decision nodes and
and product. We assume that there is a utility indicates the order in which the decisions are
node U0 , which is either the only utility node or made. This in turn induces a partition of
a descendant of all the other utility nodes, and VC such that for an ID having n decisions
therefore has no children.2 {D0 , . . . , Dn1 }, the partition contains n + 1
There are three kinds of arcs in an ID, de- subsets {C0 , C1 , ..., Cn }, where Ci is the set
pending on the type of node they go into. Arcs of chance variables C such that there is a link
into chance nodes represent probabilistic depen- C Di and no link C Dj with j < i; i.e.,
dency. Arcs into decision nodes represent avail- Ci represents the set of random variables known
ability of information, i.e., an arc X D means for Di and unknown for previous decisions. Cn
that the state of X is known when making deci- is the set of variables having no link to any deci-
2 sion, i.e., the variables whose true value is never
An ID that does not fulfill this condition can be
transformed by adding a super-value node U0 of type known directly.
sum whose parents are the utility nodes that did not Given a chance or decision variable V , two
have descendants. The expected utility and the optimal
strategy (both defined below) of the transformed dia- decisions Di and Dj such that i < j, and two
gram are the same as those of the original one. links V Di and V Dj , the former link

180 M.l Luque and F. J. Dez


is said to be a no-forgetting link. In the above induces a joint distribution over VC VD
example, T D would be a non-forgetting link. defined by
The variables known to the decision maker
when deciding on Di are called informational P (vC , vD )
Y
predecessors of Di and denoted by IPred (Di ). = P (vC : vD ) PD (d|IP red(D))
Assuming the no-forgetting hypothesis, we have DVD
that IPred (Di ) = IPred(Di1 ) {Di1 } Ci = Y Y
= P (c|pa(C)) PD (d|pa(D)) (2)
C0 {D0 } C1 . . . {Di1 } Ci .
CVC DVD
The quantitative information that defines an
ID is given by assigning to each random node Let I be an ID, a strategy for I and r
C a probability distribution P (c|pa(C)) for a configuration defined over a set of variables
each configuration of its parents, pa(C), as- R VC VD such that P (r) 6= 0. The
signing to each ordinary utility node U a func- probability distribution induced by strategy
tion U (pa(U )) that maps each configuration given the configuration r, defined over R0 =
of its parents onto a real number, and assign- (VC VD ) \ R, is given by:
ing a utility-combination function to each super-
value node. The domain of each function U is P (r, r0 )
P (r0 |r) = . (3)
given by its functional predecessors, FPred (U ). P (r)
For an ordinary utility node, FPred (U ) =
P Using this distribution we can compute the ex-
Sa(U ), and for a super-value
0 ).
node FPred (U ) =
pected utility of U under strategy given the
0
U P a(U ) FPred(U
For instance, in the ID of Figure 1, we have configuration r as:
that FPred (U1 ) = {X, D}, FPred (U2 ) = {T } X
and FPred (U0 ) = {X, D, T }. EUU (, r) = P (r0 |r)U (r, r0 ) . (4)
r0
In order to simplify our notation, we will
sometimes assume without loss of generality For the terminal utility node U0 , EUU0 (, r) is
that for any utility node U we have that said to be the expected utility of strategy given
FPred (U ) = VC VD . the configuration r, and denoted by EU (, r).
2.2 Policies and expected utilities We define the expected utility of U under
strategy as EUU () = EUU (, ), where
For each configuration vD of the decision vari-
is the empty configuration. We also define
ables VD we have a joint distribution over the
the expected utility of strategy as EU () =
set of random variables VC :
EUU0 (). We have that
Y
P (vC : vD ) = P (c|pa(C)) (1) X
EUU () = P (r)EUU (, r) . (5)
CVC
r
which represents the probability of configura-
tion vC when the decision variables are exter- An optimal strategy is a strategy opt that
nally set to the values given by vD (Cowell et maximizes the expected utility:
al., 1999).
A stochastic policy for a decision D is a prob- opt = arg maxEU () , (6)

ability distribution defined over D and condi-
tioned on the set of its informational predeces- where is the set of all strategies for I. Each
sors, PD (d|IPred (D)). If PD is degenerate (con- policy in an optimal strategy is said to be an
sisting of ones and zeros only) then we say the optimal policy. The maximum expected utility
policy is deterministic. (MEU) is
A strategy for an ID is a set of policies, one
for each decision, {PD |D VD }. A strategy MEU = EU (opt ) = max EU () . (7)

Decision analysis with influence diagrams using Elvira's explanation facilities 181
The evaluation of an ID consists in finding X T
the MEU and an optimal strategy. It can be
proved (Cowell et al., 1999; Jensen, 2001) that Y

X X X
MEU = max . . . max D
d0 dn1
c0 cn1 cn
U1 U2
P (vC : vD )(vC , vD ) . (8)
U0
For instance, the MEU for the ID in Figure 1,
assuming that U0 is of type sum, is
Figure 2: Cooper policy network (PN) for the
X X ID in Figure 1.
MEU = max max P (x) P (y|t, x)
t d
y x

(U1 (x, d) + U2 (t)) . (9) PCP N (+u|r) = norm U (EUU (, r)) . (12)
| {z }
U0 (x,d,t) This equation allows us to compute EUU (, r)
as norm 1U (PCP N (+u|r)); i.e., the expected util-
2.3 Cooper policy networks
ity for a node U in the ID can be computed from
When a strategy = {PD |D VD } is de- the marginal probability of the corresponding
fined for an ID, we can convert this into a node in the CPN.
Bayesian network, called Cooper policy net-
work (CPN), as follows: each decision D 3 Explanation of influence diagrams
is replaced by a chance node with probabil- in Elvira
ity potential PD and parents IPred (D), and
3.1 Explanation of the model
each utility node U is converted into a chance
node whose parents are its functional predeces- The explanation of IDs in Elvira is based, to a
sors (cf. Sec. 2.1); the values of new chance great extent, on the methods developed for ex-
variables are {+u, u} and its probability is planation of Bayesian networks (Lacave et al.,
PCP N (+u|FPred (U )) =norm U (U (FPred(U ))), 2000; Lacave et al., 2006b). One of the meth-
where norm U is a bijective linear transforma- ods that have proven to be more useful is the
tion that maps the utilities onto the interval automatic colorings of links. The definitions in
[0, 1] (Cooper, 1988). (Lacave et al., 2006b) for the sign of influence
For instance, the CPN for the ID in Figure 1 is and magnitude of influence, inspired on (Well-
displayed in Figure 2. Please note the addition man, 1990), have been adapted to incoming arcs
of the non-forgetting link T D and that the to ordinary utility nodes.
parents of node U0 are no longer U1 and U2 but Specifically, the magnitude of the influence
T , X, and D, which were chance or decision gives a relative measure of how a variable is in-
nodes in the ID. fluencing an ordinary utility node (see (Lacave
The joint distribution of the CPN is: et al., 2006a) for further details). Then, the in-
fluence of a link pointing to a utility node is
PCP N (vC , vD , vU ) positive when higher values of A lead to higher
Y utilities. Coopers transformation norm U guar-
= P (vC , vD ) PU (u|pa(U )) . (10)
antees that the magnitude of the influence is
U VU
normalized.
For a configuration r defined over a set of vari- For instance, in Figure 1 the link X Y is
ables R VC VD and U a utility node, it is colored in red because it represents a positive
possible to prove that influence: the presence of the disease increases
the probability of a positive result of the test.
PCP N (r) =P (r) (11) The link X U1 is colored in blue because it

182 M.l Luque and F. J. Dez


represents a negative influence: the disease de- on the kind of node. Chance and decision
creases the expected quality of life. The link nodes present bars and numbers correspond-
D U1 is colored in purple because the influ- ing to posterior probabilities of their states,
ence that it represents is undefined: the treat- P (v|e), as given by Equation 3. This is the
ment is beneficial for patients suffering from X probability that a chance variable takes a cer-
but detrimental for healthy patients. tain value or the probability that the decision
Additionally, when a decision node has been maker chooses a certain option (Nilsson and
assigned a policy, either by optimization or im- Jensen, 1998)please note that in Equation 3
posed by the user (see Sec. 3.3), the correspond- there is no distinction between chance and de-
ing probability distribution PD can be used to cision nodes. Utility nodes show the expected
color the links pointing to that node, as shown utilities, EUU (, e), given by Equation 4. The
in Figure 1. guide bar indicates the range of the utilities.
The coloring of links has been very useful in
Elvira for debugging Bayesian networks (Lacave
et al., 2006b) and IDs, in particular for detect-
ing some cases in which the numerical probabil-
ities did not correspond to the qualitative rela-
tions determined by human experts, and also for
checking the effect that the informational pre-
decessors of a decision node have on its policy.

3.2 Explanation of reasoning:


cases of evidence
In Section 2.3 we have seen that, given a strat-
egy, an ID can be converted into a CPN, which
is a true Bayesian network. Consequently, all
the explanation capabilities for BNs, such as
Elviras ability to manage evidence cases are Figure 3: ID in Elvira with two evidence cases:
also available for IDs. (a) the prior case (no evidence); (b) a case with
A finding states with certainty the value the evidence e = {+y}.
taken on a chance or decision variable. A set of
findings is called evidence and corresponds to a Figure 3 shows the result of evaluating the
certain configuration e of a set of observed vari- ID in Figure 1. The link T D is drawn as
ables E. An evidence case is determined by an a discontinuous arrow to indicate that it has
evidence e and the posterior probabilities and been added by the evaluation of the diagram.
the expected utilities induced by it. In this example, as there was no policy imposed
A distinguishable feature of Elvira is its abil- by the user (see below), Elvira computed the
ity to manage several evidence cases simultane- optimal strategy. In this figure two evidence
ously. A special evidence case is the prior case, cases are displayed. The first one is the prior
which is the first case created and corresponds case, i.e., the case in which there is no evidence.
to the absence of evidence. The second evidence case is given by e = {+y};
One of the evidence cases is marked as the i.e., it displays the probabilities and utilities
current case. Its probabilities and utilities are of the subpopulation of patients in which the
displayed in the nodes. For the rest of the cases, test gives a positive result. Node Y is colored
the probabilities and utilities are displayed only in gray to highlight the fact that there is ev-
by bars, in order to visually compare how they idence about it. The probability of +x, rep-
vary when new evidence is introduced. resented by a red bar, is 0.70; the green bar
The information displayed for nodes depends close to it represents the probability of +x for

Decision analysis with influence diagrams using Elvira's explanation facilities 183
the prior case, i.e., the prevalence of the disease; of the test is irrelevant. The M EU for this de-
the red bar is longer than the green one because cision problem is U1 (+x, +d).
P (+x| + y) > P (+x). The global utility for the In contrast, the introduction of evidence in
second case is 81.05, which is smaller than the Elvira (which may include findings for deci-
green bar close to it (the expected utility for the sion variables as well) does not lead to a new de-
general population) because the presence of the cision scenario nor to a different strategy, since
symptom worsens the prognosis. The red bar the strategy is determined before introducing
for Treatment=yes is the probability that a pa- the evidence. Put another way, in Elvira we
tient having a positive result in the test receives adopt the point view of an external observer of a
the treatment; this probability is 100% because system that includes the decision maker as one
the optimal strategy determines that all symp- of its components. The probabilities and ex-
tomatic patients must be treated. pected utilities given by Equations 2 and 4 are
The possibility of introducing evidence in those corresponding to the subpopulation indi-
Elvira has been useful for building IDs in medi- cated by e when treated with strategy . For
cine: when we were interested in computing the instance, given the evidence {+x}, the probabil-
posterior probability of diagnoses given several ity P (+t| + x) shown by Elvira is the probabil-
sets of findings, we need to manually convert the ity that a patient suffering from X receives the
ID into a Bayesian network by removing deci- test, which is 100% (it was 0% in Ezawas sce-
sion and utility nodes, and each time the ID was nario), and P (+d| + x) is the probability that
modified we have to convert it into a Bayesian he receives the treatment; contrary to Ezawas
network to compute the probabilities. Now the scenario, this probability may differ from 100%
probabilities can be computed directly on the because of false negatives. The expected utility
ID. for a patient suffering from X is

3.2.1 Clarifying the concept of EU (, {+x}) =


evidence in influence diagrams X
= P (t, y, d| + x) (U1 (+x, d) + U2 (t)) .
In order to avoid confusions, we must men- | {z }
t,y,d
U0 (+x,d,t)
tion that Ezawas method (1998) for introduc-
ing evidence in IDs is very different from the where P (t, y, d| + x) = P (t) P (y|t, +x)
way that is introduced in Elvira. For Ezawa, P (d|t, y).
the introduction of evidence e leads to a differ- Finally, we must underlie that both ap-
ent decision problem in which the values of the proaches are not rivals. They correspond to
variables in E would be known with certainty different points of view when considering evi-
before making any decision. For instance, in- dence in IDs and can complement each other in
troducing evidence {+x} in the ID in Figure 1 order to perform a better decision analysis and
would imply that X would be known when mak- to explain the reasoning.3
ing decisions T and D. Therefore, the expected
utility of the new decision problem would be 3.3 What-if reasoning: analysis of
X non-optimal strategies
max max P (y|+x : t, d)(U1 (+x, d) + U2 (t)) . In Elvira it is possible to have a strategy in
t d | {z }
y
U0 (+x,d,t) which some of the policies are imposed by the
user and the others are computed by maximiza-
where P (y|+x : t, d) = P (+x, y : t, d)/P (+x) = tion. The way of imposing a policy consists
P (y| + x : t). In spite of the apparent similarity in setting a probability distribution PD for the
of this expression with Equation 9, the optimal corresponding decision D by means of Elviras
strategy changes significantly to always treat, 3
In the future, we will implement in Elvira Ezawas
without testing, because if we know with cer- method and the possibility of computing the expected
tainty that the disease X is present the result value of perfect information (EVPI).

184 M.l Luque and F. J. Dez


GUI; the process is identical to editing the con- 3.5 Sensitivity analysis
ditional probability table of a chance node. In Recently Elvira has been endowed with some
fact, such a decision will be treated by Elvira as well-known sensitivity analysis tools, such as
it were a chance node, and the maximization is one-way sensitivity analysis, tornado diagrams,
performed only on the rest of the decision nodes. and spider diagrams, which can be combined
This way, in addition to computing the op- with the above-mentioned methods for the ex-
timal strategy (when the user has imposed no planation of reasoning. For instance, given the
policy), as any other software tool for influence ID in Figure 1, one-way sensitivity analysis on
diagrams, Elvira also permits to analyze how the prevalence of the disease X can be used to
the expected utilities and the rest of the policies determine the threshold treatment, and we can
would vary if the decision maker chose a non- later see whether the result of test Y makes the
optimal policy for some of the decisions (what-if probability of X cross that threshold. In the
reasoning). construction of more complex IDs this has been
useful for understanding why some tests are nec-
The reason for implementing this explanation
essary or not, and why sometimes the result of
facility is that when we were building a cer-
a test is irrelevant.
tain medical influence diagram (Luque et al.,
2005) our expert wondered why the model rec-
4 Conclusions
ommended not to perform a certain test. We
wished to compute the a posteriori probability The explanation of reasoning in expert systems
of the disease given a positive result in the test, is necessary for debugging the knowledge base,
but we could not introduce this evidence, be- for facilitating their acceptance by human users,
cause it was incompatible with the optimal pol- and for using them as tutoring systems. This is
icy (not to test). After we implemented the pos- especially important in the case of influence dia-
sibility of imposing non-optimal policies (in this grams, because inference in probabilistic graph-
case, performing the test) we could see that the ical models seems to have little relation with
posterior probability of the disease remained be- human thinking. Nevertheless, in general cur-
low the treatment threshold even after a posi- rent software tools for building and evaluating
tive result in the test, and given that the result decision trees and IDs offer very little support
of the test would be irrelevant, it is not worthy for analyzing the reasoning: usually they only
to do it. permit to compute the value of information, to
perform some kind of sensitivity analysis, or to
expand a decision tree, which can be hardly con-
3.4 Decision trees sidered as explanation capabilities.
In this paper we have described some facili-
Elvira can expand a decision tree corresponding ties implemented in Elvira that have been useful
to an ID, with the possibility of expanding and for understanding the knowledge contained in
contracting its branches to the desired level of a certain ID, and why its evaluation has led to
detail. This is especially useful when building certain results, i.e., the optimal strategy and the
IDs in medicine, because physicians are used to expected utility. They can analyze how the pos-
thinking in terms of scenarios and most of them terior probabilities, policies, and expected util-
are familiar with decision trees, while very few ities would vary if the decision maker applied a
have heard about influence diagrams. One dif- different (non-optimal) strategy. Most of these
ference of Elvira with respect to other software explanation facilities are based on the construc-
tools is the possibility of expanding the nodes in tion of a so-called Cooper policy network, which
the decision tree to explore the decomposition of is a true Bayesian network, and consequently all
its utility given by the structure of super-value the explanation options that were implemented
nodes of the ID. for Bayesian networks in Elvira are also avail-

Decision analysis with influence diagrams using Elvira's explanation facilities 185
able for IDs, such as the possibility of handling F. V. Jensen. 2001. Bayesian Networks and Deci-
several evidence cases simultaneously. Another sion Graphs. Springer-Verlag, New York.
novelty of Elvira with respect to most software C. Lacave and F. J. Dez. 2002. A review of explana-
tools for IDs is the ability to include super-value tion methods for Bayesian networks. Knowledge
nodes in the IDs, even when expanding the de- Engineering Review, 17:107127.
cision tree equivalent to an ID. C. Lacave, R. Atienza, and F. J. Dez. 2000. Graph-
We have also mentioned in this paper how ical explanation in Bayesian networks. In Pro-
these explanation facilities, which we have de- ceedings of the International Symposium on Med-
veloped as a response to the needs that we ical Data Analysis (ISMDA-2000), pages 122129,
Frankfurt, Germany. Springer-Verlag, Heidelberg.
have encountered when building medical mod-
els, have helped us to explain the IDs to the C. Lacave, M. Luque, and F. J. Dez. 2006a. Ex-
physicians collaborating with us and to debug planation of Bayesian networks and influence di-
those models (see (Lacave et al., 2006a) for fur- agrams in Elvira. Technical Report, CISIAD,
UNED, Madrid.
ther details).
C. Lacave, A. Onisko, and F. J. Dez. 2006b. Use of
Acknowledgments Elviras explanation facility for debugging prob-
Part of this work has been financed by the abilistic expert systems. Knowledge-Based Sys-
tems. In press.
Spanish CICYT under project TIC2001-2973-
C05. Manuel Luque has also been partially sup- M. Luque, F. J. Dez, and C. Disdier. 2005. In-
ported by Department of Education of the Co- fluence diagrams for medical decision problems:
Some limitations and proposed solutions. In J. H.
munidad de Madrid and the European Social Holmes and N. Peek, editors, Proceedings of the
Fund (ESF). Intelligent Data Analysis in Medicine and Phar-
macology, pages 8586.

References D. Nilsson and F. V. Jensen. 1998. Probabilities of


future decisions. In Proceedings from the Inter-
G. F. Cooper. 1988. A method for using belief net- national Conference on Informational Processing
works as influence diagrams. In Proceedings of the and Management of Uncertainty in knowledge-
4th Workshop on Uncertainty in AI, pages 5563, based Systems, pages 14541461.
University of Minnesota, Minneapolis.
J. A. Tatman and R. D. Shachter. 1990. Dynamic
R. G. Cowell, A. P. Dawid, S. L. Lauritzen, and programming and influence diagrams. IEEE
D. J. Spiegelhalter. 1999. Probabilistic Networks Transactions on Systems, Man, and Cybernetics,
and Expert Systems. Springer-Verlag, New York. 20:365379.

F. J. Dez, J. Mira, E. Iturralde, and S. Zubillaga. J. W. Wallis and E. H. Shortliffe. 1984. Cus-
1997. DIAVAL, a Bayesian expert system for tomized explanations using causal knowledge. In
echocardiography. Artificial Intelligence in Medi- B. G. Buchanan and E. H. Shortliffe, editors,
cine, 10:5973. Rule-Based Expert Systems: The MYCIN Ex-
periments of the Stanford Heuristic Programming
F.J. Dez. 2004. Teaching probabilistic medical Project, chapter 20, pages 371388. Addison-
reasoning with the Elvira software. In R. Haux Wesley, Reading, MA.
and C. Kulikowski, editors, IMIA Yearbook of
Medical Informatics, pages 175180. Schattauer, M. P. Wellman. 1990. Fundamental concepts of
Sttutgart. qualitative probabilistic networks. Artificial In-
telligence, 44:257303.
The Elvira Consortium. 2002. Elvira: An environ-
ment for creating and using probabilistic graph-
ical models. In Proceedings of the First Euro-
pean Workshop on Probabilistic Graphical Models
(PGM02), pages 111, Cuenca, Spain.

K. Ezawa. 1998. Evidence propagation and value of


evidence on influence diagrams. Operations Re-
search, 46:7383.

186 M.l Luque and F. J. Dez


Dynamic importance sampling in Bayesian networks using
factorisation of probability trees
Irene Martnez
Department of Languages and Computation
University of Almera
Almera, 04120, Spain

Carmelo Rodrguez and Antonio Salmeron


Department of Statistics and Applied Mathematics
University of Almera
Almera, 04120, Spain

Abstract
Factorisation of probability trees is a useful tool for inference in Bayesian networks. Prob-
abilistic potentials some of whose parts are proportional can be decomposed as a product
of smaller trees. Some algorithms, like lazy propagation, can take advantage of this fact.
Also, the factorisation can be used as a tool for approximating inference, if the decompo-
sition is carried out even if the proportionality is not completely reached. In this paper we
propose the use of approximate factorisation as a means of controlling the approximation
level in a dynamic importance sampling algorithm.

1 Introduction posteriori distribution is obtained. In the sec-


ond stage a sample is drawn using the approx-
In this paper we propose an algorithm for up- imate distribution and the probabilities are es-
dating probabilities in Bayesian networks. This timated according to the importance sampling
problem is known to be NP-hard even in the ap- methodology (Rubinstein, 1981). In the most
proximate case (Dagum and Luby, 1993). There recent algorithm (Moral and Salmeron, 2005) a
exist several deterministic approximate algo- dynamic approach is adopted, in which the sam-
rithms (Cano et al., 2000; Cano et al., 2002; pling distributions are updated according to the
Cano et al., 2003; Kjrulff, 1994) as well as information collected during the simulation.
algorithms based on Monte Carlo simulation. Recently, a new approach to construct ap-
The two main approaches are: Gibbs sam- proximate deterministic algorithms was pro-
pling (Jensen et al., 1995; Pearl, 1987) and im- posed in (Martnez et al., 2005), based on
portance sampling (Cheng and Druzdzel, 2000; the concept of approximate factorisation of the
Hernandez et al., 1998; Moral and Salmeron, probability trees used to represent the probabil-
2005; Salmeron et al., 2000; Shachter and Peot, ity functions.
1990). The goal of this paper is to incorporate the
A class of these simulation procedures is ideas contained in (Martnez et al., 2005) to the
composed by the importance sampling algo- dynamic algorithm introduced in (Moral and
rithms based on approximate pre-computation Salmeron, 2005), with the aim of showing that
(Hernandez et al., 1998; Moral and Salmeron, the approximation level can be controlled by the
2005; Salmeron et al., 2000). These methods factorisation alternatively to pruning the trees.
perform first a fast but non exact propagation, The rest of the paper is organised as fol-
consisting of a node removal process (Zhang and lows: in section 2 we analyse the main features
Poole, 1996). In this way, an approximate a of the dynamic importance sampling algorithm
in (Moral and Salmeron, 2005). Section 3 ex- ple from p(x) and then estimate p(xk , e) from
plain the concept of approximate factorisation it. However, p(x) is often unmanageable, due
of probability trees. Then, in section 4 we ex- to the large size of the state space of X, X .
plain how both ideas can be combined in a new Importance sampling tries to overcome this
algorithm. The performance of the new algo- problem by using a simplified probability func-
rithm is tested using three real-world Bayesian tion p (x) to obtain a sample of X . The es-
networks in section 5 and the paper ends with timation is carried out according to the next
the conclusions in section 6. procedure.

2 Dynamic importance sampling in Importance Sampling


Bayesian networks
1. FOR j := 1 to m (sample size)
Along this paper we will consider a Bayesian
network with variables X = {X1 , . . . , Xn } (a) Generate a configuration x(j) X
where each Xi is a discrete variable taking val- using p .
ues on a finite set Xi . By X we denote the (b) Compute a weight for the generated
state space of the n-dimensional random vari- configuration as:
able X.
p(x(j) )
Probabilistic reasoning in Bayesian networks wj := . (1)
requires the computation of the posterior prob- p (x(j) )
abilities p(xk |e), xk Xk for each variable of
2. For each xk Xk , estimate p(xk , e) as
interest Xk , given that some other variables E
p(xk , e) obtained as the sum of the weights
have been observed to take value e.
in formula (1) corresponding to configura-
The posterior probability mentioned above
tions containing xk divided by m.
can be expressed as
3. Normalise the values p(xk , e) in order
p(xk , e) to obtain p(xk |e), i.e., an estimation of
p(xk |e) = xk Xk ,
p(e) p(xk |e).
and, since p(e) is a constant value, the prob-
In (Salmeron et al., 2000; Moral and
lem of calculating the posterior probability of
Salmeron, 2005), a sampling distribution is
interest is equivalent to obtaining p(xk , e) and
computed for each variable, so that p is equal
normalising afterwards. If we denote by p(x)
to the product of the sampling distributions
the joint distribution for variables X, then it
for all the variables. The sampling distribu-
holds that
tion for each variable can be obtained through
X a process of variable elimination. Assume that
p(xk , e) = p(x) ,
the variables are eliminated following the or-
X\({Xk }E)
der X1 , . . . , Xn , and that, before eliminating the
where we assume that the k-th coordinate of first variable, H is the set of conditional dis-
x is equal to xk and the coordinates in x cor- tributions in the network. Then, the next al-
responding to observed variables are equal to gorithm obtains a sampling distribution for he
e. Therefore, the problem of probabilistic rea- i-th variable.
soning can be reduced to computing a sum, and
here is where the importance sampling technique Get Sampling Distribution(Xi ,H)
takes part.
Importance sampling is a well known method 1. Hi := {f H|f is defined for Xi }.
for computing integrals (Rubinstein, 1981) or Q
2. fi := f Hi f .
sums over multivariate functions. A straight-
3. fi := xX fi .
P
forward way to do that could be to draw a sam-
i

188 I. Martnez, C. Rodrguez, and A. Salmern


4. H := H \ Hi {fi }. Otherwise, potential fi is re-computed in order
to correct the detected discrepancy.
5. RETURN (fi ).
3 Factorisation of probability trees
Simulation is carried out in an order contrary As mentioned in section 2, the approximation of
to the one in which variables are deleted. Each the probability trees used to represent the prob-
variable Xi is simulated from its sampling distri- abilistic potentials consists of pruning some of
bution fi . This function is defined for variable their branches, namely those that lead to simi-
Xi and other variables already sampled. The lar leaves (similar probability values), that can
potential fi is restricted to the already obtained be substituted by the average, so that the error
values of the variables for which it is defined, of the approximation depends on the differences
except Xi , giving rise to a function which de- between the values of the leaves corresponding
pends only on Xi . Finally, a value for this vari- to the pruned branches. Another way of tak-
able is obtained with probability proportional ing advantage of the use of probability trees is
to the values of this potential. If all the compu- given by the possible presence of proportional-
tations are exact, it was proved in (Hernandez ity between different subtrees (Martnez et al.,
et al., 1998) that, following this procedure, we 2002). This fact is illustrated in figures 1 and
are really sampling with the optimal probability 2. In this case, the tree in figure 1 can be repre-
p (x) = p(x|e). However, the result of the com- sented, without loss of precision, by the product
binations in the process of obtaining the sam- of two smaller trees, shown in figure 2.
pling distributions may require a large amount Note that the second tree in figure 2 does
of space to be stored, and therefore approxima- not contain variable X. It means that, when
tions are usually employed, either using proba- computing the sampling distribution for vari-
bility tables (Hernandez et al., 1998) or proba- able X, the products necessary to obtain func-
bility trees (Salmeron et al., 2000) to represent tion fi would be done over smaller trees.
the distributions.
In (Moral and Salmeron, 2005) an alternative 3.1 Approximate Factorisation of
procedure to simulate each variable was used. Probability Trees
Instead of restricting fi to the values of the Factorisation of probability trees can be utilised
variables already sampled, all the functions in as a tool for developing approximate inference
Hi are restricted, resulting in a set of functions algorithms, when the proportionality condition
depending only on Xi . The sampling distribu- is relaxed. In this case, probability trees can be
tion is then computed by multiplying all these decomposed even if we find only almost propor-
functions. If the computations are exact, then tional, rather than proportional, subtrees. Ap-
both distributions are the same, as restriction proximate factorisation and its application to
and combination commute. This is the basis Lazy propagation was proposed in (Martnez et
for the dynamic updating procedure proposed al., 2005), and is stated as follows. Let T1 and
in (Moral and Salmeron, 2005). Probabilistic T2 be two sub-trees which are siblings for a given
potentials are represented by probability trees, context (i.e. both sub-trees are children of the
which are approximated by pruning some of its same node), such that both have the same size
branches when computing fi during the calcula- and their leaves contain only strictly positive
tion of the sampling distributions. This approx- numbers. The goal of the approximate factori-
imation is somehow corrected according to the sation is to find a tree T2 with the same struc-
information collected during the simulation: the ture than T2 , such that T2 and T1 become pro-
probability value of the configuration already portional, under the restriction that the poten-
simulated provided by the product of the poten- tial represented by T2 must be as close as possi-
tials in Hi must be the same as the probability ble to the one represented by T2 . Then, T2 can
value for the same configuration provided by fi . be replaced by T2 and the resulting tree that

Dynamic importance sampling in Bayesian networks using factorisation of probability trees 189
W
0 1

X X
0 2 0 2
1 1

Y Y Y 0.4 0.1 0.5


0 3 0 3 0 3
1 2 1 2 1 2

0.1 0.2 0.2 0.5 0.2 0.4 0.4 1.0 0.4 0.8 0.8 2.0

Figure 1: A probability tree proportional below X for context (W = 0).

W W
0 1 0 1

X X Y 1
0 2 0 2 0 3
1 1 1 2

1 2 4 0.4 0.1 0.5 0.1 0.2 0.2 0.5

Figure 2: Decomposition of the tree in figure 1 with respect to variable X.

contain T1 and T2 can be decomposed by factori- In (Martnez et al., 2005) several divergence
sation. Formally, approximate factorisation is measures are considered, giving the optimal
defined through the concept of -factorisability. in each case. In this work we will consider
Definition 1. A probability tree T is - only the measure that showed the best perfor-
factorisable within context (XC = xC ), with mance in (Martnez et al., 2005) that we will
proportionality factors with respect to a di- describe now. Consider a probability tree T .
vergence measure D if there is an xi X and Let T1 and T2 be sub-trees of T below a vari-
a set L X \ {xi } such that for every xj L, able X, for a given context (XC = xc ) with
j > 0 such that leaves P = {pi : i = 1, . . . , n ; pi 6= 0} and Q =
{qi : i = 1, . . . , n} respectively. As we described
D(T R(XC =xC ,X=xj ) , j T R(XC =xC ,X=xi ) ) , before, approximate factorisation is achieved by
replacing T2 by another tree T2 such that T2 is
where T R(XC =xC ,X=xi ) denotes the subtree of proportional to T1 . It means that the leaves of
T which is reached by the branch where vari- T2 will be Q = {pi : i = 1, . . . , n}, where
ables XC take values xC and Xi takes value xi . is the proportionality factor between T1 and T2 .
Parameter > 0 is called the tolerance of the Let us denote by {i = qi /pi , i = 1, . . . , n} the
approximation. ratios between the leaves of T2 and T1 .
Observe that if = 0, we have exact factori- The 2 divergence, defined as
sation. In the definition above, D and =
(1 , . . . , k ) are related in such a way that n
can be computed in order to minimise the value
X (qi pi )2
D (T2 , T2 ) = ,
of the divergence measure D. qi
i=1

190 I. Martnez, C. Rodrguez, and A. Salmern


is minimised for equal to Get Sampling Distribution(Xi ,H,D,)
Pn
pi 1. Hi := {f H|f is defined for Xi }.
= Pn i=1 .
i=1 i /i
p
2. FOR each f Hi ,
In this work we will use its normalised version
s (a) IF there is a context XC for which
D the tree corresponding to f is -
D (T2 , T2 ) = , (2) factorisable with respect to D,
D + n
Decompose f as f1 f2 , where f1
which takes values between 0 and 1 and is min- is defined for Xi and f2 is not.
imised for the same . Hi := (Hi \ {f }) {f1 }.
H := (H \ {f }) {f2 }.
4 Dynamic importance sampling
combined with factorisation (b) ELSE
H := H \ {f }.
As we mentioned in section 2, the complexity Q
of the dynamic importance sampling algorithm 3. fi := f Hi f .
relies on the computation of the sampling dis-
4. fi := xX fi .
P
tribution. In the algorithm proposed in (Moral i

and Salmeron, 2005), this complexity is con-


5. H := H {fi }.
trolled by pruning the trees resulting from mul-
tiplications of other trees. 6. RETURN (fi ).
Here we propose to use approximate factori-
sation instead of tree pruning to control the According to the algorithm above, the sam-
complexity of the sampling distributions com- pling distribution for a variable Xi is computed
putation. Note that these two alternatives (ap- by taking all the functions which are defined
proximate factorisation and tree pruning) are for it, but unlike the method in (Moral and
not exclusive. They can be used in a combined Salmeron, 2005), the product of all those func-
form. However, in order to have a clear idea of tions is postponed until the possible factorisa-
how approximate factorisation affects the accu- tions of all the intervening functions are tested.
racy of the results, we will not mix it with tree Therefore, the product in step 3 is carried out
pruning in the algorithm proposed here. over functions with smaller domains than the
The difference between the dynamic algo- product in step 2 of the algorithm described in
rithm described in (Moral and Salmeron, 2005) section 2.
and the method we propose in this paper is The accuracy of the sampling distribution is
in the computation of the sampling distribu- controlled by parameter , so that higher values
tions. The simulation phase and the update of of it result in worse approximations.
the sampling distribution is carried out exactly
5 Experimental evaluation
in the same way, except that we have estab-
lished a limit for the number of updates dur- In this section we describe a set of experiments
ing the simulations equal to 1000 iterations. It carried out to show how approximate factori-
means that, after iteration #1000 in the simu- sation can be used to control the level of ap-
lation, the sampling distributions are no longer proximation in dynamic importance sampling.
updated. We have decided this with the aim of We have implemented the algorithm in java,
avoiding a high increase in space requirements as a part of the Elvira system (Elvira Con-
during the simulation. Therefore, we only de- sortium, 2002). We have selected three real-
scribe the part of the algorithm devoted to ob- world Bayesian networks borrowed from the
tain the sampling distribution for a given vari- Machine Intelligence group at Aalborg Uni-
able Xi . versity (www.cs.aau.dk/research/MI/). The

Dynamic importance sampling in Bayesian networks using factorisation of probability trees 191
three networks are called Link (Jensen et al.,
1995), Munin1 and Munin2 (Andreassen et al.,
1989). Table 1 displays, for each network, the
number of variables, number of observed vari-
ables (selected at random) and its size. By the
size we mean the sum of the clique sizes that re-
sulted when using HUGIN (Jensen et al., 1990)
for computing the exact posterior distributions
given the observed variables.

Table 1: Statistics about the networks used in


the experiments. Figure 3: Results for Link with =
Vars. Obs. vars. Size 0.1, 0.05, 0.01, 0.005 and 0.001.
Link 441 166 23,983,962
Munin1 189 8 83,735,758
Munin2 1003 15 2,049,942

We have run the importance sam-


pling algorithm with tolerance values
= 0.1, 0.05, 0.01, 0.005 and 0.001 and
with different number of simulation iterations:
2000, 3000, 4000 and 5000. Given the random
nature of this algorithm, we have run each
experiment 20 times and the errors have been
averaged. For one variable Xl , the error in
Figure 4: Results for Munin1 with =
the estimation in its posterior distribution is
0.1, 0.05, 0.01, 0.005 and 0.001.
measured as (Fertig and Mann, 1980):

5.1 Results discussion


v
u 1
u X (p (al |e) p(al |e))2 The results summarised in figures 3, 4 and 5, in-
G(Xl ) = t
|Xl | p(al |e)(1 p(al |e)) dicate that the error can be somehow controlled
al Xl
by means of the parameter. There are, how-
(3) ever, irregularities in the graphs, probably due
where p(al |e) is the true a posteriori probability, to the variance associated with the estimations,
p (al |e) is the estimated value and |Xl | is the due to the stochastic nature of the importance
number of cases of variable Xl . For a set of sampling method.
variables {X1 , . . . , Xn }, the error is: The best estimations are achieved for the
Munin1 network. This is due to the fact
that few factorisations are actually carried out,
v
u n
uX
G({X1 , . . . , Xn }) = t G(Xi )2 (4) and therefore the number of approximations is
i=1 lower, what also decreases the variance of the
estimations.
This error measure emphasises the differences Factorisation or even approximate factorisa-
in small probability values, which means that is tion are difficult to carry out over very extreme
more discriminant than other measures like the distributions. In the approximate case, a high
mean squared error. value for would be necessary. This difficulty

192 I. Martnez, C. Rodrguez, and A. Salmern


proximate factorisation and tree pruning should
be studied. We have carried out some experi-
ment in this direction and the first results are
very promising: the variance in the estimations
and the computing time seems to be signifi-
cantly reduced. The main problem of combin-
ing both methods is that the difficulty of con-
trolling the final error of the estimation through
the parameters of the algorithm increases. We
leave for an extended version of this paper a de-
tailed insight on this issue. The experiments we
are currently carrying out suggest that the most
Figure 5: Results for Munin2 with = sensible way of combining both approximation
0.1, 0.05, 0.01, 0.005 and 0.001. procedures is to use slight tree pruning and also
low values for the tolerance of the factorisation.
also arises when using tree pruning as approx- Otherwise, the error introduced by one of the
imation method, since this former method is approximation method can affect the other.
good to capture uniform regions. It seems difficult to establish a general rule
Regarding the high variance associated with for determining which approximation method
the estimations suggested by the graphs in fig- is preferable. Depending on the network and
ures 3 and 5, it can be caused by multiple fac- the propagation algorithm, the performance of
torisations of the same tree: A tree can be fac- both techniques may vary. For instance, ap-
torised for some variable and afterwards, one proximate factorisation is appropriate for al-
of the resulting factors, can be factorised again gorithms that handle factorised potentials, as
when deleting another variable. In this case, Lazy propagation or the dynamic importance
even if the local error is controlled by , there is sampling method used in this paper.
a global error due to the concatenations of fac- Finally, an efficiency comparison between
torisations that is difficult to handle. Perhaps the resulting algorithm and the dynamic im-
the solution is to forbid that trees that result portance sampling algorithm in (Moral and
from a factorisations be factorised again. Salmeron, 2005) will be one of the issues that
we plan to aim at.
6 Conclusions
Acknowledgments
In this paper we have introduced a new version This work has been supported by the Span-
of the dynamic importance sampling algorithm ish Ministry of Education and Science through
proposed in (Moral and Salmeron, 2005). The project TIN2004-06204-C03-01.
novelty consists of using the approximate fac-
torisation of probability trees to control the ap-
proximation of the sampling distributions and, References
in consequence, the final approximations. S. Andreassen, F.V. Jensen, S.K. Andersen,
The experimental results suggest that the B. Falck, U. Kjrulff, M. Woldbye, A.R. Sorensen,
method is valid but perhaps more difficult to A. Rosenfalck, and F. Jensen, 1989. Computer-
Aided Electromyography and Expert Systems,
control than tree pruning. chapter MUNIN - an expert EMG assistant, pages
Even though the goal of this paper was to 255277. Elsevier Science Publishers B.V., Ams-
analyse the performance of tree factorisation as terdam.
a stand-alone approximation method, one im- A. Cano, S. Moral, and A. Salmeron. 2000. Penni-
mediate conclusion that can be drawn from the less propagation in join trees. International Jour-
work presented here is that the joint use of ap- nal of Intelligent Systems, 15:10271059.

Dynamic importance sampling in Bayesian networks using factorisation of probability trees 193
A. Cano, S. Moral, and A. Salmeron. 2002. Lazy I. Martnez, S. Moral, C. Rodrguez, and
evaluation in Penniless propagation over join A. Salmeron. 2005. Approximate factorisation of
trees. Networks, 39:175185. probability trees. ECSQARU05. Lecture Notes
in Artificial Intelligence, 3571:5162.
A. Cano, S. Moral, and A. Salmeron. 2003. Novel
strategies to approximate probability trees in Pen- S. Moral and A. Salmeron. 2005. Dynamic im-
niless propagation. International Journal of Intel- portance sampling in Bayesian networks based on
ligent Systems, 18:193203. probability trees. International Journal of Ap-
proximate Reasoning, 38(3):245261.
J. Cheng and M.J. Druzdzel. 2000. AIS-
BN: An adaptive importance sampling algorithm J. Pearl. 1987. Evidential reasoning using stochas-
for evidential reasoning in large Bayesian net- tic simulation of causal models. Artificial Intelli-
works. Journal of Artificial Intelligence Research, gence, 32:247257.
13:155188.
R.Y. Rubinstein. 1981. Simulation and the Monte
P. Dagum and M. Luby. 1993. Approximating prob- Carlo Method. Wiley (New York).
abilistic inference in Bayesian belief networks is
A. Salmeron, A. Cano, and S. Moral. 2000. Impor-
NP-hard. Artificial Intelligence, 60:141153.
tance sampling in Bayesian networks using prob-
Elvira Consortium. 2002. Elvira: An environ- ability trees. Computational Statistics and Data
ment for creating and using probabilistic graph- Analysis, 34:387413.
ical models. In J.A. Gmez and A. Salmern, edi- R.D. Shachter and M.A. Peot. 1990. Simulation ap-
tors, Proceedings of the First European Workshop proaches to general probabilistic inference on be-
on Probabilistic Graphical Models, pages 222230. lief networks. In M. Henrion, R.D. Shachter, L.N.
Kanal, and J.F. Lemmer, editors, Uncertainty in
K.W. Fertig and N.R. Mann. 1980. An accurate
Artificial Intelligence, volume 5, pages 221231.
approximation to the sampling distribution of the
North Holland (Amsterdam).
studentized extreme-valued statistic. Technomet-
rics, 22:8390. N.L. Zhang and D. Poole. 1996. Exploiting
causal independence in Bayesian network infer-
L.D. Hernandez, S. Moral, and A. Salmeron. 1998. ence. Journal of Artificial Intelligence Research,
A Monte Carlo algorithm for probabilistic prop- 5:301328.
agation in belief networks based on importance
sampling and stratified simulation techniques. In-
ternational Journal of Approximate Reasoning,
18:5391.

F.V. Jensen, S.L. Lauritzen, and K.G. Olesen.


1990. Bayesian updating in causal probabilistic
networks by local computation. Computational
Statistics Quarterly, 4:269282.

C.S. Jensen, A. Kong, and U. Kjrulff. 1995. Block-


ing Gibbs sampling in very large probabilistic ex-
pert systems. International Journal of Human-
Computer Studies, 42:647666.

U. Kjrulff. 1994. Reduction of computational com-


plexity in Bayesian networks through removal of
weak dependencies. In Proceedings of the 10th
Conference on Uncertainty in Artificial Intelli-
gence, pages 374382. Morgan Kaufmann, San
Francisco.

I. Martnez, S. Moral, C. Rodrguez, and


A. Salmeron. 2002. Factorisation of probability
trees and its application to inference in Bayesian
networks. In J.A. Gamez and A. Salmeron, edi-
tors, Proceedings of the First European Workshop
on Probabilistic Graphical Models, pages 127134.

194 I. Martnez, C. Rodrguez, and A. Salmern


Learning Semi-Markovian Causal Models using Experiments
Stijn Meganck1 , Sam Maes2 , Philippe Leray2 and Bernard Manderick1
1 CoMo

Vrije Universiteit Brussel


Brussels, Belgium
2 LITIS

INSA Rouen
St. Etienne du Rouvray, France

Abstract
Semi-Markovian causal models (SMCMs) are an extension of causal Bayesian networks for
modeling problems with latent variables. However, there is a big gap between the SMCMs
used in theoretical studies and the models that can be learned from observational data
alone. The result of standard algorithms for learning from observations, is a complete
partially ancestral graph (CPAG), representing the Markov equivalence class of maximal
ancestral graphs (MAGs). In MAGs not all edges can be interpreted as immediate causal
relationships. In order to apply state-of-the-art causal inference techniques we need to
completely orient the learned CPAG and to transform the result into a SMCM by removing
non-causal edges. In this paper we combine recent work on MAG structure learning from
observational data with causal learning from experiments in order to achieve that goal.
More specifically, we provide a set of rules that indicate which experiments are needed
in order to transform a CPAG to a completely oriented SMCM and how the results of
these experiments have to be processed. We will propose an alternative representation
for SMCMs that can easily be parametrised and where the parameters can be learned
with classical methods. Finally, we show how this parametrisation can be used to develop
methods to efficiently perform both probabilistic and causal inference.

1 Introduction from observational and experimental data unto


different types of efficient inference.
This paper discusses graphical models that can Semi-Markovian causal models (SMCMs)
handle latent variables without explicitly mod- (Pearl, 2000; Tian and Pearl, 2002) are specif-
eling them quantitatively. For such problem do- ically suited for performing quantitative causal
mains, several paradigms exist, such as semi- inference in the presence of latent variables.
Markovian causal models or maximal ances- However, at this time no efficient parametrisa-
tral graphs. Applying these techniques to a tion of such models is provided and there are no
problem domain consists of several steps, typi- techniques for performing efficient probabilistic
cally: structure learning from observational and inference. Furthermore there are no techniques
experimental data, parameter learning, proba- for learning these models from data issued from
bilistic inference, and, quantitative causal infer- observations, experiments or both.
ence. Maximal ancestral graphs (MAGs), devel-
A problem is that each existing approach only oped in (Richardson and Spirtes, 2002) are
focuses on one or a few of all the steps involved specifically suited for structure learning from
in the process of modeling a problem includ- observational data. In MAGs every edge de-
ing latent variables. The goal of this paper is picts an ancestral relationship. However, the
to investigate the integral process from learning techniques only learn up to Markov equivalence
and provide no clues on which additional ex- {V1 , . . . , Vn }, while corresponding lowercase let-
periments to perform in order to obtain the ters are used to represent their instantiations,
fully oriented causal graph. See (Eberhardt et i.e., v1 , v2 and v is an instantiation of all vi .
al., 2005; Meganck et al., 2006) for that type P (Vi ) is used to denote the probability distribu-
of results on Bayesian networks without latent tion over all possible values of variable V i , while
variables. Furthermore, no parametrisation for P (Vi = vi ) is used to denote the probability dis-
discrete variables is provided for MAGs (only tribution over the instantiation of variable V i to
one in terms of Gaussian distributions) and no value vi . Usually, P (vi ) is used as an abbrevia-
techniques for probabilistic and causal inference tion of P (Vi = vi ).
have been developed. The operators P a(Vi ), Anc(Vi ), N e(Vi ) de-
We have chosen to use SMCMs in this pa- note the observable parents, ancestors and
per, because they are the only formalism that neighbors respectively of variable V i in a graph
allows to perform causal inference while taking and P a(vi ) represents the values of the parents
into account the influence of latent variables. of Vi . Likewise, the operator LP a(Vi ) represents
However, we will combine existing techniques the latent parents of variable Vi . If Vi Vj
to learn MAGs with newly developed methods appears in a graph then we say that they are
to provide an integral approach that uses both spouses, i.e., Vi Sp(Vj ) and vice versa.
observational data and experiments in order to When two variables Vi , Vj are independent we
learn fully oriented semi-Markovian causal mod- denote it by (Vi Vj ), when they are dependent
els. by (Vi Vj ).
2

In this paper we also introduce an alterna-


tive representation for SMCMs together with a 2.2 Semi-Markovian Causal Models
parametrisation for this representation, where Consider the model in Figure 1(a), it is a prob-
the parameters can be learned from data with lem with observable variables V1 , . . . , V6 and la-
classical techniques. Finally, we discuss how tent variables L1 , L2 and it is represented by a
probabilistic and quantitative causal inference directed acyclic graph (DAG). As this DAG rep-
can be performed in these models. resents the actual problem, henceforth we will
The next section introduces the necessary no- refer to it as the underlying DAG.
tations and definitions. It also discusses the se- The central graphical modeling representa-
mantical and other differences between SMCMs tion that we use are the semi-Markovian causal
and MAGs. In section 3, we discuss structure models (Tian and Pearl, 2002).
learning for SMCMs. Then we introduce a new
Definition 1. A semi-Markovian causal
representation for SMCMs that can easily be
model (SMCM) is a representation of a causal
parametrised. We also show how both proba-
Bayesian network (CBN) with observable vari-
bilistic and causal inference can be performed
ables V = {V1 , . . . , Vn } and latent variables
with the help of this new representation.
L = {L1 , . . . , Ln0 }. Furthermore, every latent
variable has no parents (i.e., is a root node) and
2 Notation and Definitions
has exactly two children that are both observed.
We start this section by introducing notations See Figure 1(b) for an example SMCM repre-
and defining concepts necessary in the rest of senting the underlying DAG in (a).
this paper. We will also clarify the differences In a SMCM each directed edge represents an
and similarities between the semantics of SM- immediate autonomous causal relation between
CMs and MAGs. the corresponding variables. Our operational
definition of causality is as follows: a relation
2.1 Notation from variable C to variable E is causal in a cer-
In this work uppercase letters are used to rep- tain context, when a manipulation in the form
resent variables or sets of variables, i.e., V = of a randomised controlled experiment on vari-

196 S. Meganck, S. Maes, P. Leray, and B. Manderick


V1 V1 V1
Definition 3. Causal inference is the process
V2 L1 V2 V2
of calculating the effect of manipulating some
variables X on the probability distribution of
V3 V4 V3 V4 V3 V4 some other variables Y ; this is denoted as
P (Y = y|do(X = x)).
V5 V6 V5 V6 V5 V6

L2
(b) (c)
An example causal inference query in the
(a)
SMCM of Figure 1(b) is P (V6 = v6 |do(V2 =
v2 )).
Figure 1: (a) A problem domain represented by Tian and Pearl have introduced theoretical
a causal DAG model with observable and latent causal inference algorithms to perform causal
variables. (b) A semi-Markovian causal model inference in SMCMs (Pearl, 2000; Tian and
representation of (a). (c) A maximal ancestral Pearl, 2002). However these algorithms assume
graph representation of (a). that any distribution that can be obtained from
the JPD over the observable variables is avail-
able C, induces a change in the probability dis- able. We will show that their algorithm per-
tribution of variable E, in that specific context forms efficiently on our parametrisation.
(Neapolitan, 2003).
In a SMCM a bi-directed edge between two 2.3 Maximal Ancestral Graphs
variables represents a latent variable that is a Maximal ancestral graphs are another approach
common cause of these two variables. Although to modeling with latent variables (Richardson
this seems very restrictive, it has been shown and Spirtes, 2002). The main research focus in
that models with arbitrary latent variables can that area lies on learning the structure of these
be converted into SMCMs while preserving the models.
same independence relations between the ob- Ancestral graphs (AGs) are complete under
servable variables (Tian and Pearl, 2002). marginalisation and conditioning. We will only
The semantics of both directed and bi- discuss AGs without conditioning as is com-
directed edges imply that SMCMs are not max- monly done in recent work.
imal. This means that they do not represent all
Definition 4. An ancestral graph without
dependencies between variables with an edge.
conditioning is a graph containing directed
This is because in a SMCM an edge either rep-
and bi-directed edges, such that there is no
resents an immediate causal relation or a latent
bi-directed edge between two variables that are
common cause, and therefore dependencies due
connected by a directed path.
to a so called inducing path, will not be repre-
sented by an edge. Pearls d-separation criterion can be extended
A node is a collider on a path if both its im- to be applied to ancestral graphs, and is called
mediate neighbors on the path point into it. m-separation in this setting. M -separation
characterizes the independence relations repre-
Definition 2. An inducing path is a path in
sented by an ancestral graph.
a graph such that each observable non-endpoint
node of the path is a collider, and an ancestor Definition 5. An ancestral graph is said to
of at least one of the endpoints. be maximal if, for every pair of non-adjacent
Inducing paths have the property that their nodes (Vi , Vj ) there exists a set Z such that Vi
endpoints can not be separated by conditioning and Vj are independent conditional on Z.
on any subset of the observable variables. For A non-maximal AG can be transformed into
instance, in Figure 1(a), the path V1 V2 a MAG by adding some bi-directed edges (in-
L1 V6 is inducing. dicating confounding) to the model. See Figure
SMCMs are specifically suited for another 1(c) for an example MAG representing the same
type of inference, i.e., causal inference. model as the underlying DAG in (a).

Learning Semi-Markovian Causal Models using Experiments 197


In this setting a directed edge represents an edges , , oo, o, such that
ancestral relation in the underlying DAG with
latent variables. I.e., an edge from variable A 1. PG has the same adjacencies as G (and
to B represents that in the underlying causal hence any member of [G]) does;
DAG with latent variables, there is a directed
path between A and B. 2. A mark of arrowhead > is in PG if and only
Bi-directed edges represent a latent common if it is invariant in [G]; and
cause between the variables. However, if there
is a latent common cause between two variables 3. A mark of tail is in PG if and only if it
A and B, and there is also a directed path be- is invariant in [G].
tween A and B in the underlying DAG, then
in the MAG the ancestral relation takes prece- 4. A mark of o is in PG if not all members in
dence and a directed edge will be found between [G] have the same mark.
the variables. V2 V6 in Figure 1(c) is an ex-
ample of such an edge. 2.4 Assumptions
Furthermore, as MAGs are maximal, there As is customary in the graphical modeling re-
will also be edges between variables that have no search area, the SMCMs we take into account
immediate connection in the underlying DAG, in this paper are subject to some simplifying
but that are connected via an inducing path. assumptions:
The edge V1 V6 in Figure 1(c) is an example
of such an edge. 1. Stability, i.e., the independencies in the
These semantics of edges make causal infer- CBN with observed and latent variables
ence in MAGs virtually impossible. As stated that generates the data are structural and
by the Manipulation Theorem (Spirtes et al., not due to several influences exactly can-
2000), in order to calculate the causal effect of celing each other out (Pearl, 2000).
a variable A on another variable B, the immedi-
ate parents (i.e., the pre-intervention causes) of 2. Only a single immediate connection per
A have to be removed from the model. However, two variables in the underlying DAG. I.e.,
as opposed to SMCMs, in MAGs an edge does we do not take into account problems where
not necessarily represent an immediate causal two variables that are connected by an im-
relationship, but rather an ancestral relation- mediate causal edge are also confounded
ship and hence in general the modeler does not by a latent variable causing both variables.
know which are the real immediate causes of a Constraint based learning techniques such
manipulated variable. as IC* (Pearl, 2000) and FCI (Spirtes et al.,
An additional problem for finding the pre- 2000) also do not explicitly recognise mul-
intervention causes of a variable in MAGs is that tiple edges between variables. However, in
when there is an ancestral relation and a latent (Tian and Pearl, 2002) Tian presents an
common cause between variables, then the an- algorithm for performing causal inference
cestral relation takes precedence and the con- where such relations between variables are
founding is absorbed in the ancestral relation. taken into account.
Complete partial ancestral graphs (CPAGs)
3. No selection bias. Mimicking recent work,
are defined in (Zhang and Spirtes, 2005) in the
we do not take into account latent variables
following way.
that are conditioned upon, as can be the
Definition 6. Let [G] be the Markov equiva- consequence of selection effects.
lence class for an arbitrary MAG G. The com-
plete partial ancestral graph (CPAG) for 4. Discrete variables. All the variables in our
[G], PG , is a graph with possibly the following models are discrete.

198 S. Meganck, S. Maes, P. Leray, and B. Manderick


3 Structure learning V1 V2 V1 V2

Just as learning a graphical model in general,


learning a SMCM consists of two parts: struc-
ture learning and parameter learning. Both can V3 V4 V3 V4
be done using data, expert knowledge and/or (a) (b)
experiments. In this section we discuss only
structure learning. Figure 2: (a) A SMCM. (b) Result of FCI, with
an i-false edge V3 ooV4 .
3.1 Without latent variables
Learning the structure of Bayesian networks
without latent variables has been studied by a and there is no subset of observable variables D
number of researchers (Pearl, 2000; Spirtes et such that (Vi
Vj |D). This does not necessarily
al., 2000). The results of applying one of those mean that Vi has an immediate causal influence
algorithms is a representative of the Markov on Vj : it may also be a result of an inducing
equivalence class. path between Vi and Vj . For instance in Fig-
In order to perform probabilistic or causal in- ure 1(c), the link between V1 and V6 is present
ference, we need a fully oriented structure. For due to the inducing path V1 , V2 , L1 , V6 shown in
probabilistic inference this can be any repre- Figure 1(a).
sentative of the Markov equivalence class, but Inducing paths may also introduce an edge of
for causal inference we need the correct causal type o or oo between two variables, indicating
graph that models the underlying system. In or- either a directed or bi-directed edge, although
der to achieve this, additional experiments have there is no immediate influence in the form of
to be performed. an immediate causal influence or latent common
In previous work (Meganck et al., 2006), we cause between the two variables. An example of
studied learning the completely oriented struc- such a link is V3 ooV4 in Figure 2.
ture for causal Bayesian networks without la- A consequence of these properties of MAGs
tent variables. We proposed a solution to min- and CPAGs is that they are not suited for causal
imise the total cost of the experiments needed inference, since the immediate causal parents of
by using elements from decision theory. The each observable variable are not available, as is
techniques used there could be extended to the necessary according to the Manipulation The-
results of this paper. orem. As we want to learn models that can
perform causal inference, we will discuss how to
3.2 With latent variables transform a CPAG into a SMCM in the next
In order to learn graphical models with la- sections. Before we start, we have to men-
tent variables from observational data, the Fast tion that we assume that the CPAG is correctly
Causal Inference (FCI) algorithm (Spirtes et learned from data with the FCI algorithm and
al., 2000) has been proposed. Recently this re- the extended tail augmentation rules, i.e., each
sult has been extended with the complete tail result that is found is not due to a sampling
augmentation rules introduced in (Zhang and error.
Spirtes, 2005). The results of this algorithm is
3.3 Transforming the CPAG
a CPAG, representing the Markov equivalence
class of MAGs consistent with the data. Our goal is to transform a given CPAG in or-
As mentioned above for MAGs, in a CPAG der to obtain a SMCM that corresponds to the
the directed edges have to be interpreted as correspondig DAG. Remember that in general
being ancestral instead of causal. This means there are three types of edges in a CPAG: ,
that there is a directed edge from Vi to Vj if o, oo, in which o means either a tail mark
Vi is an ancestor of Vj in the underlying DAG or a directed mark >. So one of the tasks to

Learning Semi-Markovian Causal Models using Experiments 199


obtain a valid SMCM is to disambiguate those In general if a variable Vi is experimented on
edges with at least one o as an endpoint. A and another variable Vj is affected by this ex-
second task will be to identify and remove the periment, we say that Vj varies with exp(Vi ),
edges that are created due to an inducing path. denoted by exp(Vi ) Vj . If there is no varia-
In the next section we will first discuss exactly tion in Vj we note exp(Vi ) 6 Vj .
which information we obtain from performing Although conditioning on a set of variables
an experiment. Then, we will discuss the two D might cause some variables to become proba-
possibilities o and oo. Finally, we will dis- bilistically dependent, conditioning will not in-
cuss how we can find edges that are created due fluence whether two variables vary with each
to inducing paths and how to remove these to other when performing an experiment. I.e., sup-
obtain the correct SMCM. pose the following structure is given V i D
Vj , then conditioning on D will make Vi depen-
3.3.1 Performing experiments
dent on Vj , but when we perform an experiment
The experiments discussed here play the role on Vi and check whether Vj varies with Vi then
of the manipulations that define a causal re- conditioning on D will make no difference.
lation (cf. Section 2.2). An experiment on a First we have to introduce the following defi-
variable Vi , i.e., a randomised controlled exper- nition:
iment, removes the influence of other variables
in the system on Vi . The experiment forces Definition 7. A potentially directed path
a distribution on Vi , and thereby changes the (p.d. path) in a CPAG is a path made only of
joint distribution of all variables in the system edges of types o and , with all arrowheads
that depend directly or indirectly on V i but in the same direction. A p.d. path from V i to
does not change the conditional distribution of Vj is denoted as Vi 99K Vj .
other variables given values of Vi . After the
3.3.2 Solving o
randomisation, the associations of the remain-
ing variables with Vi provide information about An overview of the different rules for solving
which variables are influenced by Vi (Neapoli- o is given in Table 1
tan, 2003). When we perform the experiment
we cut all influence of other variables on V i . Type 1(a)
Graphically this corresponds to removing all in- Given Ao B
coming edges into Vi from the underlying DAG. exp(A) 6 B
All parameters besides those for the variable Action AB
experimented on, (i.e., P (Vi |P a(Vi ))), remain Type 1(b)
the same. We then measure the influence of the Given Ao B
manipulation on variables of interest to get the exp(A) B
post-interventional distribution on these vari- 6 p.d. path (length 2)A 99K B
ables. Action AB
To analyse the results of the experiment we Type 1(c)
compare for each variable of interest V j the orig- Given Ao B
inal distribution P and the post-interventional exp(A) B
distribution PE , thus comparing P (Vj ) and p.d. path (length 2)A 99K B
PE (Vj ) = P (Vj |do(Vi = vi )). Action Block all p.d. paths by condi-
We denote performing an experiment at vari- tioning on blocking set D
able Vi or a set of variables W by exp(Vi ) or (a) exp(A)|D B: A B
exp(W ) respectively, and if we have to condition (b) exp(A)|D 6 B: A B
on some other set of variables D while perform-
ing the experiment, we denote it as exp(V i )|D Table 1: An overview of the different actions
and exp(W )|D. needed to complete edges of type o.

200 S. Meganck, S. Maes, P. Leray, and B. Manderick


For any edge Vi o Vj , there is no need to Type 2(a)
perform an experiment on Vj because we know Given AooB
that there can be no immediate influence of V j exp(A) 6 B
on Vi , so we will only perform an experiment on Action A oB(Type 1)
Vi . Type 2(b)
If exp(Vi ) 6 Vj , then there is no influence of Given AooB
Vi on Vj , so we know that there can be no di- exp(A) B
rected edge between Vi and Vj and thus the only 6 p.d. path (length 2)A 99K B
remaining possibility is Vi Vj (Type 1(a)). Action AB
If exp(Vi ) Vj , then we know for sure that Type 2(c)
there is an influence of Vi on Vj , we now need to Given AooB
discover whether this influence is immediate or exp(A) B
via some intermediate variables. Therefore we p.d. path (length 2)A 99K B
make a difference whether there is a potentially Action Block all p.d. paths by condi-
directed (p.d.) path between Vi and Vj of length tioning on blocking set D
2, or not. If no such path exists, then the (a) exp(A)|D B: A B
influence has to be immediate and the edge is (b) exp(A)|D 6 B: A oB
found Vi Vj (Type 1(b)). (Type 1)
If at least one p.d. path Vi 99K Vj exists,
we need to block the influence of those paths Table 2: An overview of the different actions
on Vj while performing the experiment, so we needed to complete edges of type oo.
try to find a blocking set D for all these paths.
If exp(Vi )|D Vj , then the influence has to
be immediate, because all paths of length 2 on Vj while performing the experiment, so we
are blocked, so Vi Vj . On the other hand if find a blocking set D like with Type 1(c). If
exp(Vi )|D 6 Vj , there is no immediate influ- exp(Vi )|D Vj , then the influence has to be
ence and the edge is Vi Vj (Type 1(c)). immediate, because all paths of length 2 are
blocked, so Vi Vj . On the other hand if
3.3.3 Solving oo exp(Vi )|D 6 Vj , there is no immediate influ-
An overview of the different rules for solving ence and the edge is of type: Vi oVj , which
oo is given in Table 2. again becomes a problem of Type 1.
For any edge Vi ooVj , we have no information
at all, so we might need to perform experiments 3.3.4 Removing inducing path edges
on both variables. An inducing path between two variables V i
If exp(Vi ) 6 Vj , then there is no influence and Vj might create an edge between these two
of Vi on Vj so we know that there can be no variables during learning because the two are
directed edge between Vi and Vj and thus the dependent conditional on any subset of observ-
edge is of the following form: Vi oVj , which able variables. As mentioned before, this type
then becomes a problem of Type 1. of edges is not present in SMCMs as it does not
If exp(Vi ) Vj , then we know for sure that represent an immediate causal influence or a la-
there is an influence of Vi on Vj , and as in the tent variable in the underlying DAG. We will
case of Type 1(b), we make a difference whether call such an edge an i-false edge.
there is a potentially directed path between V i For instance, in Figure 1(a) the path
and Vj of length 2, or not. If no such path V1 , V2 , L1 , V6 is an inducing path, which causes
exists, then the influence has to be immediate the FCI algorithm to find an i-false edge be-
and the edge must be turned into Vi Vj . tween V1 and V6 , see Figure 1(c). Another ex-
If at least one p.d. path Vi 99K Vj exists, ample is given in Figure 2 where the SMCM is
we need to block the influence of those paths given in (a) and the result of FCI in (b). The

Learning Semi-Markovian Causal Models using Experiments 201


V1 V1 V1
edge between V3 and V4 in (b) is a consequence
of the inducing path via the observable variables V2 V2 V2
V3 , V 1 , V 2 , V 4 .
In order to be able to apply a causal inference V3 V4 V3 V4 V3 V4

algorithm we need to remove all i-false edges


V5 V6 V5 V6 V5 V6
from the learned structure. We need to identify
(a) (b) (c)
the substructures that can indicate this type of
V1 V1 V1
edges. This is easily done by looking at any
two variables that are connected by an immedi- V2 V2 V2

ate connection, and when this edge is removed,


V3 V4 V3 V4 V3 V4
they have at least one inducing path between
them. To check whether the immediate con- V5 V6 V5 V6 V5 V6
nection needs to be present we have to block (d) (e) (f)

all inducing paths by performing one or more


experiments on an inducing path blocking set Figure 3: (a) The result of FCI on data of the
(i-blocking set) D ip and block all other paths underlying DAG of Figure 1(a). (b) Result of
by conditioning on a blocking set D. If V i and an experiment on V5 . (c) After experiment on
Vj are dependent, i.e., Vi Vj under these cir- V4 . (d) After experiment on V3 . (e) After ex-
2

cumstances the edge is correct and otherwise it periment on V2 while conditioning on V3 . (f)
can be removed. After resolving all problems of Type 1 and 2.
In the example of Figure 1(c), we can block
the inducing path by performing an experiment of the FCI algorithm can be seen in Figure 3(a).
on V2 , and hence can check that V1 and V6 We will first resolve problems of Type 1 and 2,
do not covary with each other in these circum- and then remove i-false edges. The result of
stances, so the edge can be removed. each step is indicated in Figure 3.
In Table 3 an overview of the actions to re-
solve i-false edges is given. exp(V5 )

Given A pair of connected variables V i , Vj V5 ooV4 :


A set of inducing paths Vi , . . . , Vj exp(V5 ) 6 V4 V4 o V5 (Type 2(a))
Action Block all inducing paths Vi , . . . , Vj V5 o V6 :
by conditioning on i-blocking set exp(V5 ) 6 V6 V5 V6 (Type 1(a))
D ip . exp(V4 )
Block all other paths between Vi
and Vj by conditioning on blocking V4 ooV2 :
set D. exp(V4 ) 6 V2 V2 o V4 (Type 2(a))
When performing all exp(D ip )|D: V4 ooV3 :
if Vi Vj : confounding is real exp(V4 ) 6 V3 V3 o V4 (Type 2(a))
2

else remove edge between Vi , Vj V4 o V5 :


exp(V4 ) V5 V4 V5 (Type 1(b))
Table 3: Removing inducing path edges.
V4 o V6 :
exp(V4 ) V6 V4 V6 (Type 1(b))
3.4 Example
exp(V3 )
We will demonstrate a number of steps to dis-
cover the completely oriented SMCM (Figure V3 ooV2 :
1(b)) based on the result of the FCI algorithm exp(V3 ) 6 V2 V2 o V3 (Type 2(a))
applied on observational data generated from V3 o V4 :
the underlying DAG in Figure 1(a). The result exp(V3 ) V4 V3 V4 (Type 1(b))

202 S. Meganck, S. Maes, P. Leray, and B. Manderick


exp(V2 ) and exp(V2 )|V3 , because two p.d. L = {L1 , . . . , Ln0 }. Then the joint probabil-
paths between V2 and V4 ity distribution can be written as the following
mixture of products:
V2 ooV1 :
exp(V2 ) 6 V1 V1 o V2 (Type 2(a)) X Y
V2 o V3 : P (v) = P (vi |P a(vi ), LP a(vi ))
exp(V2 ) V3 V2 V3 (Type 1(b)) {lk |Lk L}Vi V
V2 o V4 : (1)
Y
exp(V2 )|V3 V4 V2 V4 (Type P (lj ).
1(c)) Lj L

After resolving all problems of Type 1 and Taking into account that in a SMCM the la-
2 we end up with the SMCM structure shown tent variables are implicitly represented by bi-
in Figure 3(f). This representation is no longer directed edges, we introduce the following defi-
consistent with the MAG representation since nition.
there are bi-directed edges between two vari- Definition 8. In a SMCM, the set of observ-
ables on a directed path, i.e., V2 , V6 . There is able variables can be partitioned by assigning
a potentially i-false edge V1 V6 in the struc- two variables to the same group iff they are con-
ture with inducing path V1 , V2 , V6 , so we need nected by a bi-directed path. We call such a
to perform an experiment on V2 , blocking all group a c-component (from confounded com-
other paths between V1 and V6 (this is also done ponent) (Tian and Pearl, 2002).
by exp(V2 ) in this case). Given that the orig-
For instance, in Figure 1(b) variables
inal structure is as in Figure 1(a), performing
V2 , V5 , V6 belong to the same c-component.
exp(V2 ) shows that V1 and V6 are independent,
Then it can be readily seen that c-components
i.e., exp(V2 ) : (V1
V6 ). Thus the bi-directed
and their associated latent variables form re-
edge between V1 and V6 is removed, giving us
spective partitions of the observable and la-
the SMCM of Figure 1(b).
tent variables. Let Q[Si ] denote the contribu-
4 Parametrisation of SMCMs tion of a c-component with observable variables
Si V to the mixture of products in Equa-
As mentioned before, in their work on causal in- tion 1. Then Q we can rewrite the JPD as follows:
ference, Tian and Pearl provide an algorithm for P (v) = Q[Si ].
performing causal inference given knowledge of i{1,...,k}
the structure of an SMCM and the joint prob- Finally, (Tian and Pearl, 2002) proved that
ability distribution (JPD) over the observable each Q[S] could be calculated as follows. Let
variables. However, they do not provide a para- Vo1 < . . . < Von be a topological order over V ,
metrisation to efficiently store the JPD over the and let V (i) = Vo1 < . . . < Voi , i = 1, . . . , n, and
observables. V (0) = .
We start this section by discussing the fac- Y
Q[S] = P (vi |(Ti P a(Ti ))\{Vi }) (2)
torisation for SMCMs introduced in (Tian and
Vi S
Pearl, 2002). From that result we derive an ad-
ditional representation for SMCMs and a para- where Ti is the c-component of the SMCM G
metrisation that facilitates probabilistic and reduced to variables V (i) , that contains Vi . The
causal inference. We will also discuss how these SMCM G reduced to a set of variables V 0
parameters can be learned from data. V is the graph obtained by removing from the
graph all variables V \V 0 and the edges that are
4.1 Factorising with Latent Variables connected to them.
Consider an underlying DAG with observable In the rest of this section we will develop
variables V = {V1 , . . . , Vn } and latent variables a method for deriving a DAG from a SMCM.

Learning Semi-Markovian Causal Models using Experiments 203


We will show that the classical factorisation
Q
V1

P (vi |P a(vi )) associated with this DAG, is the


V2
same as the one associated with the SMCM, as
above. V1 ,V 2 ,V4 ,V 5,V 6
V 2,V 4
V 2,V 3 ,V4
V3 V4

4.2 Parametrised representation


V5 V6
Here we first introduce an additional represen- (a) (b)
tation for SMCMs, then we show how it can be
parametrised and, finally we discuss how this Figure 4: (a) The PR-representation applied to
new representation could be optimised. the SMCM of Figure 1(b). (b) Junction tree
representation of the DAG in (a).
4.2.1 PR-representation
Consider Vo1 < . . . < Von to be a topological
order O over the observable variables V , only The resulting DAG is an I -map (Pearl, 1988),
considering the directed edges of the SMCM, over the observable variables of the indepen-
and let V (i) = Vo1 < . . . < Voi , i = 1, . . . , n, and dence model represented by the SMCM. This
V (0) = . Then Table 4 shows how the para- means that all the independencies that can be
metrised (PR-) representation can be obtained derived from the new graph must also be present
from the original SMCM structure. in the JPD over the observable variables. This
property can be more formally stated as the fol-
Given a SMCM G and a topological ordering
lowing theorem.
O, the PR-representation has these properties:
1. The nodes are V , i.e., the observable var-
Theorem 9. The PR-representation P R de-
iables of the SMCM.
rived from a SMCM S is an I-map of that
2. The directed edges are the same as in the
SMCM.
SMCM.
3. The confounded edges are replaced by a
Proof. Consider two variables A and B in P R
number of directed edges as follows:
that are not connected by an edge. This means
Add an edge from node Vi to node Vj iff:
that they are not independent, since a necessary
condition for two variables to be conditionally
a) Vi (Tj P a(Tj )), where Tj is the
independent in a stable DAG is that they are
c-component of G reduced to variables
not connected by an edge. P R is stable as we
V (j) that contains Vj ,
consider only stable problems (Assumption 1 in
b) and there was not already an edge
Section 2.4).
between nodes Vi and Vj .
Then, from the method for constructing P R
Table 4: Obtaining the PR-representation. we can conclude that S (i) contains no directed
edge between A and B, (ii) contains no bi-
This way each variable becomes a child of the directed edge between A and B, (iii) contains
variables it would condition on in the calcula- no inducing path between A and B. Property
tion of the contribution of its c-component as in (iii) holds because the only inducing paths that
Equation (2). are possible in a SMCM are those between a
Figure 4(a) shows the PR-representation of member of a c-component and the immediate
the SMCM in Figure 1(a). The topological or- parent of another variable of the c-component,
der that was used here is V1 < V2 < V3 < V4 < and in these cases the method for constructing
V5 < V6 and the directed edges that have been P R adds an edge between those variables. Be-
added are V1 V5 , V2 V5 , V1 V6 , V2 V6 , cause of (i),(ii), and (iii) we can conclude that
and, V5 V6 . A and B are independent in S.

204 S. Meganck, S. Maes, P. Leray, and B. Manderick


4.2.2 Parametrisation We have solved this problem by first trans-
For this DAG we can use the same para- forming the SMCM into its PR-representation,
metrisation as for classical BNs, i.e., learning which allows us to apply the junction tree in-
P (vi |P a(vi )) for each variable, where P a(vi ) ference algorithm. This is a consequence of
denotes the parents in the new DAG. In this the fact that, as previously mentioned, the PR-
way the JPD over the observable variables representation is an I -map over the observable
factorises as in a classical BN, i.e., P (v) = variables. And as the JT algorithm is based
Q only on independencies in the DAG, applying it
P (vi |P a(vi )). This follows immediately from
the definition of a c-component and from Equa- to an I -map of the problem gives correct results.
tion (2). See Figure 4(b) for the junction tree obtained
from the parametrised representation in Figure
4.2.3 Optimising the Parametrisation 4(a).
We have mentioned that the number of Note that any other classical probabilistic in-
edges added during the creation of the PR- ference technique that only uses conditional in-
representation depends on the topological order dependencies between variables could also be
of the SMCM. applied to the PR-representation.
As this order is not unique, choosing an or-
4.4 Causal inference
der where variables with a lesser amount of par-
ents have precedence, will cause less edges to Tian and Pearl (2002) developed an algorithm
be added to the DAG. This is because most of for performing causal inference, however as
the added edges go from parents of c-component mentioned before they have not provided an ef-
members to c-component members that are ficient parametrisation.
topological descendants. Richardson and Spirtes (2003) show causal
By choosing an optimal topological order, we inference in AGs on an example, but a detailed
can conserve more conditional independence re- approach is not provided and the problem of
lations of the SMCM and thus make the graph what to do when some of the parents of a vari-
more sparse, thus leading to a more efficient able are latent is not solved.
parametrisation. By definition in the PR-representation, the
parents of each variable are exactly those vari-
4.2.4 Learning parameters ables that have to be conditioned on in order to
As the PR-representation of SMCMs is a obtain the factor of that variable in the calcula-
DAG as in the classical Bayesian network for- tion of the c-component, see Table 4 and (Tian
malism, the parameters that have to be learned and Pearl, 2002). Thus, the PR-representation
are P (vi |P a(vi )). Therefore, techniques such provides all the necessary quantitative informa-
as ML and MAP estimation (Heckerman, 1995) tion, while the original structure of the SMCM
can be applied to perform this task. provides the necessary structural information,
for the algorithm by Tian and Pearl to be ap-
4.3 Probabilistic inference plied.
One of the most famous existing probabilistic
5 Conclusions and Perspectives
inference algorithm for models without latent
variables is the junction tree algorithm (JT) In this paper we have proposed a number of so-
(Lauritzen and Spiegelhalter, 1988). lutions to problems that arise when using SM-
These techniques cannot immediately be ap- CMs in practice.
plied to SMCMs for two reasons. First of all More precisely, we showed that there is a big
until now no efficient parametrisation for this gap between the models that can be learned
type of models was available, and secondly, it from data alone and the models that are used
is not clear how to handle the bi-directed edges in theory. We showed that it is important to re-
that are present in SMCMs. trieve the fully oriented structure of a SMCM,

Learning Semi-Markovian Causal Models using Experiments 205


and discussed how to obtain this from a given work of Excellence, IST-2002-506778.
CPAG by performing experiments.
For future work we would like to relax the
assumptions made in this paper. First of all References
we want to study the implications of allowing F. Eberhardt, C. Glymour, and R. Scheines. 2005.
two types of edges between two variables, i.e., On the number of experiments sufficient and in
the worst case necessary to identify all causal rela-
confounding as well as a immediate causal rela- tions among n variables. In Proc. of the 21st Con-
tionship. Another direction for possible future ference on Uncertainty in Artificial Intelligence
work would be to study the effect of allowing (UAI), pages 178183. Edinburgh.
multiple joint experiments in other cases than
D. Heckerman. 1995. A tutorial on learning with
when removing inducing path edges. Bayesian networks. Technical report, Microsoft
Furthermore, we believe that applying the Research.
orientation and tail augmentation rules of
S. L. Lauritzen and D. J. Spiegelhalter. 1988. Lo-
(Zhang and Spirtes, 2005) after each experi- cal computations with probabilities on graphical
ment, might help to reduce the number of exper- structures and their application to expert sys-
iments needed to fully orient the structure. In tems. Journal of the Royal Statistical Society, se-
this way we could extend to SMCMs our previ- ries B, 50:157244.
ous results (Meganck et al., 2006) on minimising S. Meganck, P. Leray, and B. Manderick. 2006.
the total number of experiments in causal mod- Learning causal Bayesian networks from observa-
els without latent variables. This would allow to tions and experiments: A decision theoretic ap-
proach. In Modeling Decisions in Artificial Intel-
compare empirical results with the theoretical
ligence, LNAI 3885, pages 58 69.
bounds developed in (Eberhardt et al., 2005).
Up until now SMCMs have not been para- R.E. Neapolitan. 2003. Learning Bayesian Net-
metrised in another way than by the entire joint works. Prentice Hall.
probability distribution. Another contribution J. Pearl. 1988. Probabilistic Reasoning in Intelligent
of our paper is that we showed that using an Systems. Morgan Kaufmann.
alternative representation, we can parametrise J. Pearl. 2000. Causality: Models, Reasoning and
SMCMs in order to perform probabilistic as well Inference. MIT Press.
as causal inference. Furthermore this new rep-
resentation allows to learn the parameters using T. Richardson and P. Spirtes. 2002. Ancestral graph
Markov models. Technical Report 375, Dept. of
classical methods. Statistics, University of Washington.
We have informally pointed out that the
choice of a topological order when creating the T. Richardson and P. Spirtes, 2003. Causal infer-
ence via ancestral graph models, chapter 3. Ox-
PR-representation, influences its size and thus ford Statistical Science Series: Highly Structured
its efficiency. We would like to investigate this Stochastic Systems. Oxford University Press.
property in a more formal manner. Finally, we
P. Spirtes, C. Glymour, and R. Scheines. 2000. Cau-
have started implementing the techniques intro-
sation, Prediction and Search. MIT Press.
duced in this paper into the structure learning
package (SLP)1 of the Bayesian networks tool- J. Tian and J. Pearl. 2002. On the identification
box (BNT)2 for MATLAB. of causal effects. Technical Report (R-290-L),
UCLA C.S. Lab.
Acknowledgments J. Zhang and P. Spirtes. 2005. A characterization of
Markov equivalence classes for ancestral graphical
This work was partially funded by an IWT- models. Technical Report 168, Dept. of Philoso-
scholarship and by the IST Programme of the phy, Carnegie-Mellon University.
European Community, under the PASCAL net-
1
http://banquiseasi.insa-rouen.fr/projects/bnt-slp/
2
http://bnt.sourceforge.net/

206 S. Meganck, S. Maes, P. Leray, and B. Manderick


Geometry of Rank Tests
Jason Morton, Lior Pachter, Anne Shiu, Bernd Sturmfels, and Oliver Wienand
Department of Mathematics, UC Berkeley

Abstract
We study partitions of the symmetric group which have desirable geometric properties.
The statistical tests defined by such partitions involve counting all permutations in the
equivalence classes. These permutations are the linear extensions of partially ordered sets
specified by the data. Our methods refine rank tests of non-parametric statistics, such
as the sign test and the runs test, and are useful for the exploratory analysis of ordinal
data. Convex rank tests correspond to probabilistic conditional independence structures
known as semi-graphoids. Submodular rank tests are classified by the faces of the cone of
submodular functions, or by Minkowski summands of the permutohedron. We enumerate
all small instances of such rank tests. Graphical tests correspond to both graphical models
and to graph associahedra, and they have excellent statistical and algorithmic properties.

1 Introduction In Section 3 we describe the class of con-


vex rank tests which captures properties of tests
The non-parametric approach to statistics was used in practice. We work in the language of
introduced by (Pitman, 1937). The emergence algebraic combinatorics (Stanley, 1997). Con-
of microarray data in molecular biology has led vex rank tests are in bijection with polyhedral
to a number of new tests for identifying signifi- fans that coarsen the hyperplane arrangement
cant patterns in gene expression time series; see of Sn , and with conditional independence struc-
e.g. (Willbrand, 2005). This application moti- tures known as semi-graphoids (Studeny, 2005).
vated us to develop a mathematical theory of
rank tests. We propose that a rank test is a Section 4 is devoted to convex rank tests that
partition of Sn induced by a map : Sn T are induced by submodular functions. These
from the symmetric group of all permutations of submodular rank tests are in bijection with
[n] = {1, . . . , n} onto a set T of statistics. The Minkowski summands of the (n1)-dimensional
statistic () is the signature of the permuta- permutohedron and with structural imset mod-
tion Sn . Each rank test defines a partition els. Furthermore, these tests are at a suitable
of Sn into classes, where and 0 are in the level of generality for the biological applications
same class if and only if () = ( 0 ). We iden- that motivated us. We make the connections
tify T = image( ) with the set of all classes in to polytopes and independence models concrete
this partition of Sn . Assuming the uniform dis- by classifying all convex rank tests for n 5.
tribution on Sn , the probability of seeing a par-
In Section 5 we discuss the class of graphi-
ticular signature t T is 1/n! times | 1 (t)|.
cal tests. In mathematics, these correspond to
The computation of a p-value for a given per-
graph associahedra, and in statistics to graphi-
mutation Sn typically amounts to summing
cal models. The equivalence of these two struc-
1 tures is shown in Theorem 18. The implemen-
Pr( 0 ) | 1 ( 0 ) |

= (1) tation of convex rank tests requires the efficient
n!
enumeration of linear extensions of partially or-
over all permutations 0 with Pr( 0 ) < Pr(). dered sets (posets). A key ingredient is a highly
In Section 2 we explain how existing rank tests optimized method for computing distributive
can be understood from our point of view. lattices. Our software is discussed in Section 6.
2 Rank tests and posets there is some 2 C such that (a, b) 2 ; thus,
0 L(1 2 ) by (PC). This shows 0 C.
A permutation in Sn is a total order on [n] = A pre-convex rank test is thus an unordered
{1, . . . , n}. This means that is a set of n2 collection of posets P1 , , . . . , Pk on [n] that sat-
ordered pairs of elements in [n]. If and 0 are isfies the property that Sn is the disjoint union
permutations then 0 is a partial order. of the subsets L(P1 ), . . . , L(Pk ). The posets Pi
In the applications we have in mind, the data that represent the classes in a pre-convex rank
are vectors u Rn with distinct coordinates. test capture the shapes of data vectors.
The permutation associated with u is the to-
Example 1 (The sign test for paired data).
tal order = { (i, j) [n] [n] : ui < uj }.
The sign test is performed on data that are
We shall employ two other ways of writing
paired as two vectors u = (u1 , u2 , . . . , um ) and
this permutation. The first is the rank vector
v = (v1 , v2 , . . . , vm ). The null hypothesis is that
= (1 , . . . , n ), whose defining properties are
the median of the differences ui vi is 0. The
{1 , . . . , n } = [n] and i < j if and only if
test statistic is the number of differences that
ui < uj . That is, the coordinate of the rank
are positive. This test is a rank test, because u
vector with value i is at the same position as the
and v can be transformed into the overall ranks
ith smallest coordinate of u. The second is the
of the n = 2m values, and the rank vector en-
descent vector = (1 , . . . , n ), defined by ui >
tries can then be compared. This test coarsens
ui+1 . The ith coordinate of the descent vector
the convex rank test which is the MSS of Section
is the position of the ith largest value of u. For
4 with K = {{1, m + 1}, {2, m + 2}, . . . }.
example, if u = (11, 7, 13) then its permutation
is represented by = {(2, 1), (1, 3), (2, 3)}, by Example 2 (Runs tests). A runs test can be
= (2, 1, 3), or by = (3, 1, 2). used when there is a natural ordering on the
A permutation is a linear extension of a data points, such as in a time series. The data
partial order P on [n] if P . We write are transformed into a sequence of pluses and
L(P ) Sn for the set of linear extensions of P . minuses, and the null hypothesis is that the
A partition of the symmetric group Sn is a pre- number of observed runs is no more than that
convex rank test if the following axiom holds: expected by chance. A runs test is a coarsening
of the convex rank test described in (Will-
If () = ( 0 ) and 00 L( 0 ) brand, 2005, Section 6.1.1) and in Example 4.
(P C)
then () = ( 0 ) = ( 00 ). These two examples suggest that many tests
from classical statistics have a natural refine-
Note that 00 L( 0 ) means 0 00 . ment by a pre-convex rank test. The term
For n = 3 the number of all rank tests is the Bell pre-convex refers to the following interpreta-
number B6 = 203. Of these 203 set partitions tion of the axiom (PC). Consider any two vec-
of S3 , only 40 satisfy the axiom (PC). tors u and u0 in Rn , and a convex combination
Each class C of a pre-convex rank test cor- u00 = u + (1 )u0 , with 0 < < 1. If
responds to a poset P on [n]; namely, P is , 0 , 00 are the permutations of u, u0 , u00 then
the intersection of all total orders in that class: 00 L( 0 ). Thus the regions in Rn speci-
T
P = C . The axiom (PC) ensures that C fied by a pre-convex rank test are convex cones.
coincides with the set L(P ) of all linear exten-
3 Convex rank tests
sions of P . The inclusion C L(P ) is clear. For
the reverse inclusion, note that from any permu- A fan in Rn is a finite collection F of polyhe-
tation in L(P ), we can obtain any other 0 in dral cones which satisfies the following proper-
L(P ) by a sequence of reversals (a, b) 7 (b, a), ties: (i) if C F and C 0 is a face of C, then
where each intermediate is also in L(P ). As- C 0 F, (ii) If C, C 0 F, then C C 0 is a face
sume 0 L(P ) and 1 C differ by one rever- of C. Two vectors u and v in Rn are permu-
sal (a, b) 0 , (b, a) 1 . Then (b, a) / P , so tation equivalent when ui < uj if and only if

208 J. Morton, L. Pachter, A. Shiu, B. Sturmfels, and O. Wienand


vi < vj , and ui = uj if and only if vi = vj for all 123 11132
i, j [n]. The permutation equivalence classes 11
11 3|
1
(of which there are 13 for n = 3) induce a fan 1 11
which we call the Sn -fan. The maximal cones 213111 312
in the Sn -fan, which are the closures of the per- 11
11
1 3|{2}
mutation equivalence classes corresponding to 11
total orders, are indexed by permutations in
231 321
Sn . A coarsening of the Sn -fan is a fan F such
that every permutation equivalence class of Rn 1 3|
MMM
is fully contained in a cone C of F; F defines MMM qq
a partition of Sn because each maximal cone of MMM q q
qM
the Sn -fan is contained in some cone C F. q q q MMMMM
MMM
We define a convex rank test to be a partition qq
of Sn defined by a coarsening of the Sn -fan. We 1 3|{2}
identify the fan with that test.
Two maximal cones of the Sn -fan share a wall Figure 1: The permutohedron P3 and the S3 -
if there exists an index k such that k = k+1 0 ,
0 0
fan projected to the plane. Each permutation
k+1 = k and i = i for i 6= k, k + 1. That is, is represented by its descent vector = 1 2 3 .
the corresponding permutations and 0 differ
by an adjacent transposition. To such an un-
ordered pair {, 0 }, we associate the following and is visualized in Figure 1. Permutations are
conditional independence (CI) statement: in the same class if they are connected by a
solid edge; there are four classes. In the S3 -fan,
k k+1 | {1 , . . . , k1 }. (2) the two missing walls are labeled by conditional
independence statements as defined in (2).
This formula defines a map from the set of walls
of the Sn -fan onto the set of all CI statements Example 5 (Up-down analysis for n = 4). The
 test F in (Willbrand, 2005) is shown in Figure 2.
Tn = i j | K : K [n]\{i, j} . The double edges correspond to the 12 CI state-
The map from walls to CI statements is not in- ments in MF . There are 8 classes; e.g., the class
jective; there are (n k 1)!(k 1)! walls which {3412, 3142, 1342, 1324, 3124} consists of the 5
are labelled by the statement (2). permutations with up-down pattern (, +, ).
Any convex rank test F is characterized by Our proof of Theorem 3 rests on translating
the collection of walls {, 0 } that are removed the semi-graphoid axiom for a set of CI state-
when passing from the Sn -fan to F. So, from ments into geometric statements about the cor-
(2), any convex rank test F maps to a set MF responding set of edges of the permutohedron.
of CI statements corresponding to missing walls. The Sn -fan is the normal fan (Ziegler, 1995)
Recall from (Matus, 2004) and (Studeny, 2005) of the permutohedron Pn , which is the convex
that a subset M of Tn is a semi-graphoid if the hull of the vectors (1 , . . . , n ) Rn , where
following axiom holds: runs over all rank vectors of permutations in
Sn . The edges of Pn correspond to walls and
i j | K ` M and i ` | K M
are thus labeled with CI statements. A collec-
implies i j | K M and i ` | K j M. tion of parallel edges of Pn perpendicular to a
hyperplane xi = xj corresponds to the set of
Theorem 3. The map F 7 MF is a bijection CI statements i j|K, where K ranges over
between convex rank tests and semi-graphoids. all subsets of [n]\{i, j}. The two-dimensional
Example 4 (Up-down analysis for n = 3). The faces of Pn are squares and regular hexagons,
test in (Willbrand, 2005) is a convex rank test and two edges of Pn have the same label in Tn

Geometry of Rank Tests 209


fine a presentation of the group algebra of Sn :
4132xPP O4123
OOO
xxxx PPPP O771423
xxxxxxx  777 (BS) i,i+1 i+k+1,i+k+2 i+k+1,i+k+2 i,i+1 ,
xxxx 1432 777
xx x
x  7777 (BH) i,i+1 i+1,i+2 i,i+1 i+1,i+2 i,i+1 i+1,i+2 ,
4312xxxx  4213 7777
    7777 2
(BN ) i,i+1 1,
 1342  7777
 vv::::
) 1243
 4321 vvvvvvv :::::: 4231
v
 )) where suitable i and k vary over [n]. The first
3412+P vvvv
v :::: 2413 )))
++PPPP vvvvvvv ::::  ) two are the braid relations, and the last repre-
++ P9vv99314299 ::::  
9 :: 1234 )  sents the idempotency of each transposition.
++ 9999 x ) 2143

+9 9999
1324
xxxxxxx ))  Now, we regard these relations as proper-
34219999 9 9999 xxxx ))  
9999 9999 xxxxxxx 2431 )  ties of a set of edges of Pn , by identifying a
9999 9
9x(xxx x 2134x)
9
999
3124 (( xxxx
word and a permutation with the set of edges
9999 ( xxxxxx that comprise the corresponding path in Pn .
99LLL (( x
3241 LLLLL (( 2341 xxxxxxxx For example, a set satisfying (BS) is one such
LL x
xxx that, starting from any , the edges of the path
3214 2314
i,i+1 i+k+1,i+k+2 are in the set if and only if the
edges of the path i+k+1,i+k+2 i,i+1 are in the
Figure 2: The permutohedron P4 with vertices set. Note then, that (BS) is the square axiom,
marked by descent vectors . The test up- and (BH) is a weakening of the hexagon ax-
down analysis is indicated by the double edges. iom of semi-graphoids. That is, implications in
either direction hold in a semi-graphoid. How-
ever, (BN) holds only directionally in a semi-
if, but not only if, they are opposite edges of a graphoid: if an edge lies in the semi-graphoid,
square. A semi-graphoid M can be identified then its two vertices are in the same class; but
with the set M of edges with labels from M. the empty path at some vertex certainly does
The semi-graphoid axiom translates into a geo- not imply the presence of all incident edges in
metric condition on the hexagonal faces of Pn . the semi-graphoid. Thus, for a semi-graphoid,
we have (BS) and (BH), but must replace (BN)
with the directional version
Observation 6. A set M of edges of the per-
mutohedron Pn is a semi-graphoid if and only (BN 0 ) 2
i,i+1 1.
if M satisfies the following two axioms:
Square axiom: Whenever an edge of a square Consider a path p from to 0 in a semi-
is in M, then the opposite edge is also in M. graphoid. A result of (Tits, 1968) gives the fol-
Hexagon axiom: Whenever two adjacent edg- lowing lemma; see also (Brown, 1989, p. 49-51).
es of a hexagon are in M, then the two opposite Lemma 7. If M is a semi-graphoid, then if
edges of that hexagon are also in M. and 0 lie in the same class of M, then so do
all shortest paths on Pn between them.
Let M be the subgraph of the edge graph of We are now equipped to prove Theorem 3.
Pn defined by the statements in M. Then the Note that we have demonstrated that semi-
classes of the rank test defined by M are given graphoids and convex rank tests can be re-
by the permutations in the path-connected com- garded as sets of edges of Pn , so we will show
ponents of M. We regard a path from to that their axiom systems are equivalent. We
0 on Pn as a word (1) (l) in the free as- first show that a semi-graphoid satisfies (PC).
sociative algebra A generated by the adjacent Consider , 0 in the same class C of a semi-
transpositions of [n]. For example, the word graphoid, and let 00 L(, 0 ). Further, let p be
23 := (23) gives the path from to 0 = 23 = a shortest path from to 00 (so, p = 00 ), and
1 3 2 4 . . . n . The following relations in A de- let q be a shortest path from 00 to 0 . We claim

210 J. Morton, L. Pachter, A. Shiu, B. Sturmfels, and O. Wienand


that qp is a shortest path from to 0 , and thus show that M satisfies the square and hexagon
00 C by Lemma 7. Suppose qp is not a short- axioms. This completes the proof of Theorem 3.
est path. Then, we can obtain a shorter path in Remark 8. For n = 3 there are 40 pre-convex
the semi-graphoid by some sequence of substitu- rank tests, but only 22 of them are convex rank
tions according to (BS), (BH), and (BN). Only tests. The corresponding CI models are shown
(BN) decreases the length of a path, so the se- in Figure 5.6 on page 108 in (Studeny, 2005).
quence must involve (BN). Therefore, there is
some i, j in [n], such that their positions rela- 4 The submodular cone
tive to each other are reversed twice in qp. But
In this section we examine a subclass of the con-
p and q are shortest paths, hence one reversal
vex rank tests. Let 2[n] denote the collection
occurs in each p and q. Then and 0 agree on
of all subsets of [n] = {1, 2, . . . , n}. Any real-
whether i > j or j > i, but the reverse holds
valued function w : 2[n] R defines a convex
in 00 , contradicting 00 L(, 0 ). Thus every
polytope Qw of dimension n 1 as follows:
semi-graphoid is a pre-convex rank test.
Now, we show that a semi-graphoid corre- Qw := x Rn : x1 + x2 + + xn = w([n])

sponds to a fan. Consider the cone corre- P
and iI xi w(I) for all = 6 I [n] .
sponding to a class C. We need only show
that it meets any other cone in a shared face. A function w : 2[n] R is called submodular if
Since C is a cone of a coarsening of the Sn -fan, w(I)+w(J) w(I J)+w(I J) for I, J [n].
each nonmaximal face of C lies in a hyperplane
Proposition 9. A function w : 2[n] R is
H = {xi = xj }. Suppose a face of C coin-
submodular if and only if the normal fan of the
cides with the hyperplane H and that i > j in
polyhedron Qw is a coarsening of the Sn -fan.
C. A vertex borders H if i and j are adja-
cent in . We will show that if , 0 C bor- This follows from greedy maximization as in
der H, then their reflections = 1 . . . ji . . . n (Lovasz, 1983). Note that the function w is sub-
and 0 = 10 . . . ji . . . n0 both lie in some class modular if and only if the optimal solution of
C 0 . Consider a great circle path between
and 0 which stays closest to H: all vertices in maximize u x subject to x Qw
the path have i and j separated by at most one depends only on the permutation equivalence
position, and no two consecutive vertices have i class of u. Thus, solving this linear program-
and j nonadjacent. This is a shortest path, so ming problem constitutes a convex rank test.
it lies in C, by Lemma 7. Using the square and Any such test is called a submodular rank test.
hexagon axioms (Observation 6), we see that A convex polytope is a (Minkowski) summand
the reflection of the path across H is a path in of another polytope if the normal fan of the lat-
the semi-graphoid that connects to 0 (Figure ter refines the normal fan of the former. The
3). Thus a semigraphoid is a convex rank test. polytope Qw that represents a submodular rank
test is a summand of the permutohedron Pn .
qMMMM qMMMM 0
qqqqq qqq
M M
Theorem 10. The following combinatorial
q objects are equivalent for any positive integer n:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 1. submodular rank tests,
xi = xj
MMMM qq MMM qqq 2. summands of the permutohedron Pn ,
MM qqq MM qqq 0
3. structural conditional independence models,
n
4. faces of the submodular cone Cn in R2 .
Figure 3: Reflecting a path across a hyperplane.
We have 1 2 from Proposition 9, and
Finally, if M is a set of edges of Pn , repre- 1 3 follows from (Studeny, 2005). Further
senting a convex rank test, then it is easy to 3 4 holds by definition.

Geometry of Rank Tests 211


The submodular cone is the cone Cn of all The associated MSS test K is defined as follows.
submodular functions w : 2[n] R. Work- Given Sn , we compute the number of indices
ing modulo its lineality space Cn (Cn ), j [r] such that max{k : k Kj } = i , for
we regard Cn as a pointed cone of dimension each i [n]. The signature K () is the vector
2n n 1. in Nn whose ith coordinate is that number. Few
Remark 11. All 22 convex rank tests for n = 3 submodular rank tests are MSS tests:
are submodular. The submodular cone C3 is a Remark 15. For n = 3, among the 22 sub-
4-dimensional cone whose base is a bipyramid. modular rank tests, only 15 are MSS tests. For
The polytopes Qw , as w ranges over the faces n = 4, among the 22108, only 1218 are MSS.
of C3 , are all the Minkowski summands of P3 .
5 Graphical tests
Proposition 12. For n 4, there exist convex
rank tests that are not submodular rank tests. Graphical models are fundamental in statistics,
Equivalently, there are fans that coarsen the Sn - and they also lead to a useful class of rank tests.
fan but are not the normal fan of any polytope. First we show how to associate a semi-graphoid
to a family K. Let FwK be the normal fan of
This result is stated in Section 2.2.4 of (Stu- QwK . We write MK for the CI model derived
deny, 2005) in the following form: There exist from FwK using the bijection in Theorem 3.
semi-graphoids that are not structural.
We answered Question 4.5 posed in (Post- Proposition 16. The semi-graphoid MK is the
nikov, 2006) by finding a non-submodular con- set of CI statements i j | K which satisfy the
vex rank test in which all the posets Pi are trees: following property: all sets containing {i, j} and
contained in {i, j} [n]\K are not in K.

M = 2 3|{1, 4}, 1 4|{2, 3}, Let G be a graph with vertex set [n]. We
define K(G) to be the collection of all subsets K

1 2|, 3 4| .
of [n] such that the induced subgraph of G|K is
Remark 13. For n = 4 there are 22108 sub- connected. Recall that the undirected graphical
modular rank tests, one for each face of the 11- model (or Markov random field) derived from
dimensional cone C4 . The base of this submod- the graph G is the set MG of CI statements:
ular cone is a polytope with f -vector (1, 37, 356,
MG = i j | C : the restriction of G to

1596, 3985, 5980, 5560, 3212, 1128, 228, 24, 1).
[n]\C contains no path from i to j .
Remark 14. For n = 5 there are 117978 coars-
est submodular rank tests, in 1319 symmetry The polytope G = K(G) is the graph asso-
classes. We confirmed this result of (Studeny, ciahedron, which is a well-studied object in com-
2000) with POLYMAKE (Gawrilow, 2000). binatorics (Carr, 2004; Postnikov, 2005). The
We now define a class of submodular rank next theorem is derived from Proposition 16.
tests, which we call Minkowski sum of simplices Theorem 17. The CI model induced by the
(MSS) tests. Note that each subset K of [n] graph associahedron coincides with the graphi-
defines a submodular function wK by setting cal model MG , i.e., MK(G) = MG .
wK (I) = 1 if K I is non-empty and wK (I) = 0
There is a natural involution on the set of
if K I is empty. The corresponding polytope
all CI statements which is defined as follows:
QwK is the simplex K = conv{ek : k K}.
Now consider an arbitrary subset K = (i j | C) := i j | [n]\(C {i, j}).
{K1 , K2 , . . . , Kr } of 2[n] . It defines the submod-
ular function wK = wK1 +wK2 + +wKr . The If M is any CI model, then the CI model M is
corresponding polytope is the Minkowski sum obtained by applying the involution to all the
CI statements in the model M. Note that this
K = K1 + K2 + + Kr . involution was called duality in (Matus, 1992).

212 J. Morton, L. Pachter, A. Shiu, B. Sturmfels, and O. Wienand


dron, and when it is a cycle, G is the cyclohe-
4132 4123 dron. The number of classes in the tubing test


 1423 K(G) is the G-Catalan number of (Postnikov,
1432


1 2n
4213

4312 2005). This number is n+1 for the associa-
 n
  2n2


 
hedron test and n1 for the cyclohedron test.
1342 ::



)
 :::
 )1243
:
 ))
4321
  :::: 4231

:
:::

 6 Enumerating linear extensions
3412+P 2413 ))
++PPPP ::::  )
++ 93142
P 9999 ::
:

:::   In this paper we introduced a hierarchy of
++ 999 1234)

2143
9999 1324 ))  rank tests, ranging from pre-convex to graph-
+9 9999 ))  
342199999 9999
))  ical. Rank tests are applied to data vectors
9999 9 2431
9 2134x)

9999 9
9( u Rn , or permutations Sn , and locate
993124
9
999
(
( xxxx
9999 (( x
xxxxxx their cones. In order to determine the signifi-
3241LLLLLLLL((( 2341 xxxxxxxx cance of a data vector, one needs to compute the
LL x
xxx quantity | 1 () |, and possibly the proba-
3214 2314 bilities of other maximal cones. These cones are
indexed by posets P1 , P2 , . . . , Pk on [n], and the
Figure 4: The permutohedron P4 . Double edges probability computations are equivalent to find-
indicate the test K(G) when G is the path. ing the cardinality of some of the sets L(Pi ).

Edges with large dots indicate the test K(G) . We now present our methods for computing
linear extensions. If the rank test is a tubing

test then this computation is done as follows.
The graphical tubing rank test K(G) is the test From the given permutation, we identify its sig-

associated with MK(G) . It can be obtained by nature (image under ), which we may assume
a construction similar to the MSS test K , with is its G-tree T (Postnikov, 2005). Suppose the
the function wK defined differently and super- root of the tree T has k children, each of which
modular. The graphical model rank test K(G) is is a root of a subtree Ti for i = 1, . . . , k. Writing
the MSS test of the set family K(G). |Ti | for the number of nodes in Ti , we have
We next relate K(G) and K(G) to a known
combinatorial characterization of the graph as-  Pk i| k
 Y !
|T
sociahedron G . Two subsets A, B [n] are | 1 (T) | = i=1
| 1 (Ti )| .
|T1 |, . . . , |Tk |
compatible for the graph G if one of the fol- i=1
lowing conditions holds: A B, B A, or
This recursive formula can be translated into
A B = , and there is no edge between any
an efficient iterative algorithm. In (Willbrand,
node in A and B. A tubing of the graph G is a
2005) the analogous problem is raised for the
subset T of 2[n] such that any two elements of
test in Example 3. A determinantal formula for
T are compatible. Carr and Devadoss (2005)
(1) appears in (Stanley, 1997, page 69).
showed that G is a simple polytope whose
For an arbitrary convex rank test we proceed
faces are in bijection with the tubings.
as follows. The test is specified (implicitly or
Theorem 18. The following four combinatorial explicitly) by a collection of posets P1 , . . . , Pk
objects are isomorphic for any graph G on [n]: on [n]. From the given permutation, we first
the graphical model rank test K(G) , identify the unique poset Pi of which that per-

the graphical tubing rank test K(G) , mutation is a linear extension. We next con-
the fan of the graph associahedron G , struct the distributive lattice L(Pi ) of all order
the simplicial complex of all tubings on G. ideals of Pi . Recall that an order ideal is a sub-
The maximal tubings of G correspond to ver- set O of [n] such that if l O and (k, l) Pi
tices of the graph associahedron G . When G is then k O. The set of all order ideals is a lat-
the path of length n, then G is the associahe- tice with meet and join operations given by set

Geometry of Rank Tests 213


intersection OO0 and set union OO0 . Knowl- M Carr and S Devadoss. Coxeter complexes
edge of this distributive lattice L(Pi ) solves our and graph associahedra. 2004. Available from
http://arxiv.org/abs/math.QA/0407229.
problem because the linear extensions of Pi are
precisely the maximal chains of L(Pi ). Com- E Gawrilow and M Joswig. Polymake: a frame-
puting the number of linear extensions is #P- work for analyzing convex polytopes, in Polytopes
complete (Brightwell, 1991). Therefore we de- Combinatorics and Computation, eds. G Kalai
and G M Ziegler, Birkhauser, 2000, 43-74.
veloped efficient heuristics to build L(Pi ).
The key algorithmic task is the following: L Lovasz. Submodular functions and convexity, in
given a poset Pi on [n], compute an efficient Math Programming: The State of the Art, eds.
A Bachem, M Groetschel, and B Korte, Springer,
representation of the distributive lattice L(Pi ). 1983, 235-257.
Our program for performing rank tests works
as follows. The input is a permutation and a F Matus. Ascending and descending conditional
independence relations, in Proceedings of the
rank test . The test can be specified either Eleventh Prague Conference on Inform. The-
ory, Stat. Dec. Functions and Random Proc.,
by a list of posets P1 , . . . , Pk (pre-convex), Academia, B, 1992, 189-200.
or by a semigraphoid M (convex rank test), F Matus. Towards classification of semigraphoids.
Discrete Mathematics, 277, 115-145, 2004.
or by a submodular function w : 2[n] R,
EJG Pitman. Significance tests which may be ap-
or by a collection K of subsets of [n] (MSS), plied to samples from any populations. Supple-
ment to the Journal of the Royal Statistical Soci-
ety, 4(1):119130, 1937.
or by a graph G on [n] (graphical test).
A Postnikov. Permutohedra, associahe-
The output of our program has two parts. First, dra, and beyond. 2005. Available from
it gives the number |L(Pi )| of linear extensions, http://arxiv.org/abs/math/0507163.
where the poset Pi represents the equivalence A Postnikov, V Reiner, L Williams. Faces of Simple
class of Sn specified by the data . It also Generalized Permutohedra. Preprint, 2006.
gives a representation of the distributive lat-
RP Stanley. Enumerative Combinatorics Volume I,
tice L(Pi ), in a format that can be read by Cambridge University Press, Cambridge, 1997.
the maple package posets (Stembridge, 2004).
Our software for the above rank tests is available J Stembridge. Maple packages for symmet-
ric functions, posets, root systems, and
at www.bio.math.berkeley.edu/ranktests/. finite Coxeter groups. Available from
www.math.lsa.umich.edu/jrs/maple.html.
Acknowledgments M Studeny. Probablistic conditional independence
This paper originated in discussions with structures. Springer Series in Information Science
and Statistics, Springer-Verlag, London, 2005.
Olivier Pourquie and Mary-Lee Dequeant in
the DARPA Fundamental Laws of Biology Pro- M Studeny, RR Bouckaert, and T Kocka. Extreme
gram, which supported Jason Morton, Lior supermodular set functions over five variables. In-
stitute of Information Theory and Automation,
Pachter, and Bernd Sturmfels. Anne Shiu was Research report n. 1977, Prague, 2000.
supported by a Lucent Technologies Bell Labs
Graduate Research Fellowship. Oliver Wienand J Tits. Le probleme des mots dans les groupes de
Coxeter. Symposia Math., 1:175-185, 1968.
was supported by the Wipprecht foundation.
K Willbrand, F Radvanyi, JP Nadal, JP Thiery, and
T Fink. Identifying genes from up-down proper-
References ties of microarray expression series. Bioinformat-
ics, 21(20):38593864, 2005.
G Brightwell and P Winkler. Counting linear exten-
sions. Order, 8(3):225-242, 1991. G Ziegler. Lectures on polytopes. Vol. 152 of Gradu-
ate Texts in Mathematics. Springer-Verlag, 1995.
K Brown. Buildings. Springer, New York, 1989.

214 J. Morton, L. Pachter, A. Shiu, B. Sturmfels, and O. Wienand


An Empirical Study of Efficiency and Accuracy of Probabilistic Graphical
Models
Jens Dalgaard Nielsen and Manfred Jaeger
Institut for Datalogi, Aalborg Universitet
Frederik Bajers Vej 7, 9220 Aalborg , Denmark
{dalgaard,jaeger}@cs.aau.dk

Abstract
In this paper we compare Nave Bayes (NB) models, general Bayes Net (BN) models and Proba-
bilistic Decision Graph (PDG) models w.r.t. accuracy and efficiency. As the basis for our analysis
we use graphs of size vs. likelihood that show the theoretical capabilities of the models. We also
measure accuracy and efficiency empirically by running exact inference algorithms on randomly
generated queries. Our analysis supports previous results by showing good accuracy for NB mod-
els compared to both BN and PDG models. However, our results also show that the advantage of
the low complexity inference provided by NB models is not as significant as assessed in a previous
study.

1 Introduction nalised likelihood metrics like BIC, AIC and MDL


are weighted sums of model accuracy and size.
Probabilistic graphical models (PGMs) have been When the learned model is to be used for general in-
applied extensively in machine learning and data ference, including a measure for inference complex-
mining research, and many studies have been dedi- ity into the metric is relevant. Neither BIC, AIC nor
cated to the development of algorithms for learning MDL explicitly takes inference complexity into ac-
PGMs from data. Automatically learned PGMs are count when assessing a given model. Recently, sev-
typically used for inference, and therefore efficiency eral authors have independently emphasised the im-
and accuracy of the PGM w.r.t. inference are of in- portance of considering inference complexity when
terest when evaluating a learned model. applying learning in a real domain.
Among some of the most commonly used PGMs
are the general Bayesian Network model (BN) and Beygelzimer and Rish (2003) investigate the
the Nave Bayes model (NB). The BN model ef- tradeoff between model accuracy and efficiency.
ficiently represents a joint probability distribution They only consider BN models for a given target
over a domain of discrete random variables by a fac- distribution (in a learning setting, the target distri-
torization into independent local distributions. The bution is the empirical distribution defined by the
NB model contains a number of components defined data; more generally, the target distribution could
by an unobserved latent variable and models each be any distribution one wants to represent). For BNs
discrete random variable as independent of all other treewidth is an adequate efficiency measure (defined
variables within each component. Exact inference as k 1, where k is the size of the largest clique in
has linear time complexity in the size of the model an optimal junction tree). Tradeoff curves that plot
when using NB models. treewidth against the best possible accuracy achiev-
For the general BN model both exact and approx- able with a given treewidth are introduced. These
imate inference are NP-hard (Cooper, 1987; Dagum tradeoff curves can be used to investigate the ap-
and Luby, 1993). proximability of a target distribution.
Model-selection algorithms for learning PGMs As an example for a distribution with poor ap-
typically use some conventional score-metric, proximability in this sense, Beygelzimer and Rish
searching for a model that optimises the metric. Pe- (2003) mention the parity distribution, which rep-
resents the parity function on n binary inputs. An Our present paper extend these previous works in
accurate representation of this distribution requires two ways. First, we conduct a comparative analysis
a BN of treewidth n 1, and any BN with a smaller of accuracy vs. efficiency tradeoffs for three type
treewidth can approximate the parity distribution of PGMs: BN, NB and PDG models. Our results
only as well as the empty network. show that in spite of theoretical differences BN,
The non-approximability of the parity distribu- NB and PDG models perform surprisingly similar
tion (and hence the impossibility of accurate models when learned from real data, and no single model
supporting efficient inference) only holds under the is consistently superior. Second, we investigate the
restriction to BN models with nodes corresponding theoretical and empirical efficiency of exact infer-
exactly to the n input bits. The use of other PGMs, ence for all models. This analysis somewhat differs
or the use of latent variables in a BN representation, from the analysis in (Lowd and Domingos, 2005),
can still lead to accurate and computationally effi- where only approximate inference was considered
cient representations of the parity distribution. for BNs. The latter approach can lead to somewhat
Motivated by some distributions refusal to be ef- unfavorable results for BNs, because approximate
ficiently approximated by BN models, the PGM lan- inference can be much slower than exact inference
guage of probabilistic decision graph (PDG) mod- for models still amenable to exact inference. Our re-
els was developed (Jaeger, 2004). In particular, the sults show that while NB models are still very com-
parity distribution is representable by a PDG that petitive w.r.t. accuracy, exact inference in BN mod-
has inference complexity linear in n. In a recent els is often tractable and differences in empirically
study an empirical analysis of the approximations measured run-times are typically not significant.
offered by BN and PDG models learned from real-
world data was conducted (Jaeger et al., 2006). Sim- 2 Probabilistic Graphical Models
ilar to the tradeoff curves of (Beygelzimer and Rish,
2003), Jaeger et al. (2006) used graphs showing In this section we introduce the three types of mod-
likelihood of data vs. size of the model for the anal- els that we will use in our experiments; the gen-
ysis of accuracy vs. complexity. The comparison of eral Bayesian Network (BN), the Nave Bayesian
PDGs vs. BNs did not produce a clear winner, and Network (NB) and the Probabilistic Decision Graph
the main lesson was that the models offer surpris- (PDG).
ingly similar tradeoffs when learned from real data.
2.1 Bayesian Network Models
Also motivated by considerations of model ac-
curacy and efficiency, Lowd and Domingos (2005) BNs (Jensen, 2001; Pearl, 1988) are a class of
in a recent study compared NB and BN models. probabilistic graphical models that represent a joint
NB models can potentially offer accuracy-efficiency probability distribution over a domain X of discrete
tradeoff behaviors that for some distributions differ random variables through a factorization of inde-
from those provided by standard BN representation pendent local distributions or factors. The structure
(although NBs do not include the latent class mod- of a BN is a directed acyclic graph (DAG) G =
els that allow an efficient representation of the par- (V, E) of nodes V and directed edges E. Each ran-
ity distribution). Lowd and Domingos (2005) deter- dom variable Xi X is represented
Q by a node Vi
mine inference complexity empirically by measur- V, and the factorization Xi X P (Xi |paG (Xi ))
ing inference times on randomly generated queries. (where paG (Xi ) is the set of random variables rep-
The inferences are computed exactly for NB mod- resented by parents of node Vi in DAG G) defines
els, but for BN models approximate methods were the full joint probability distribution P (X) repre-
used (Gibbs sampling and loopy belief propaga- sented by the BN model. By size of a BN we under-
tion). Lowd and Domingos (2005) conclude that stand the size of the representation, i.e. the number
NB models offer approximations that are as accu- of independent parameters. Exact inference is usu-
rate as those offered by BN models, but in terms of ally performed by first constructing a junction tree
inference complexity the NB models are reported to from the BN. Inference is then solvable in time lin-
be orders of magnitude faster than BN models. ear in the size of the junction tree (Lauritzen and

216 J. D. Nielsen and M. Jaeger


Spiegelhalter, 1988), which may be exponential in 3 Elements of the Analysis
the size of the BN from which it was constructed.
The goal of our analysis is to investigate the qual-
ity of PGMs learned from real data w.r.t. accu-
2.2 Nave Bayes Models
racy and inference efficiency. The appropriate no-
The NB model represents a joint probability distri- tion of accuracy depends on the intended tasks for
bution over a domain X of discrete random vari- the model. Following (Jaeger et al., 2006; Lowd
ables by introducing an unobserved, latent variable and Domingos, 2005; Beygelzimer and Rish, 2003)
C. Each state of C is referred to as a component, we use log-likelihood of the model given the data
and conditioned on C, each variable X i X is as- (L(M, D)) as a global measure of accuracy. Log-
sumed to be independent of all other variables in X. likelihood score is essentially equivalent to cross-
This yields the simple factorization: P (X, C) = entropy (CE) between the empirical distribution
Q
P (C) Xi X P (Xi |C). Exact inference is com- P D and the distribution P M represented in model
putable in time linear in the representation size of M:
the NB model. 1
CE(P D , P M ) = H(P D ) L(M, D), (1)
|D|
2.3 Probabilistic Decision Graph Models
where H() is the entropy function. Observe that
PDGs are a fairly new language for probabilis-
when CE(P D , P M ) = 0 (when P D and P M are
tic graphical modeling (Jaeger, 2004; Bozga and
equal), then L(M, D) = |D| H(P D ). Thus, data
Maler, 1999). As BNs and NBs, PDGs represent
entropy is an upper bound on the log-likelihood.
a joint probability distribution over a domain of dis-
crete random variables X through a factorization of 3.1 Theoretical Complexity vs. Accuracy
local distributions. However, the structure of the
We use SL-curves (Jaeger et al., 2006) in our anal-
factorization defined by a PDG is not based on a
ysis of the theoretical performance of each PGM
variable level independence model but on a certain
language. SL-curves are plots of size vs. likeli-
kind of context specific independencies among the
hood. The size of model here is the effective size, i.e.
variables. A PDG can be seen as a two-layer struc-
a model complexity parameter, such that inference
ture, 1) a forest of tree-structures over all mem-
has linear time complexity in this parameter. For
bers of X, and 2) a set of rooted DAG structures
NB and PDG models this is the size of the model
over parameter nodes, each holding a local distri-
itself. For BN models it is the size of the junction
bution over one random variable. Figure 1(a) shows
tree constructed for inference.
a forest F of tree-structures over binary variables
X = {X0 , X1 . . . , X5 }, and figure 1(b) shows an 3.2 Empirical Complexity and Accuracy
example of a PDG structure based on F . For a com-
The size measure used in the SL-curves described
plete semantics of the PDG model and algorithms
in section 3.1 measures inference complexity only
for exact inference with linear complexity in the size
up to a linear factor. Following Lowd and Domin-
of the model, the reader is referred to (Jaeger, 2004).
gos (2005), we estimate the complexity of exact
X0 X4 X0 X4
inference also empirically by measuring execution
0 1 0 1 0 1 times for random queries. A query for model M
X1 X2 X5 X1 X2 X5 is solved by computing a conditional probability
0 1 1 0
P M (Q = q|E = e), where Q, E are disjoint sub-
X3 X3
sets of X, and q, e are instantiations of Q, respec-
(a) (b)
tively E. Queries are randomly generated as fol-
Figure 1: Example PDG. Subfigure (a) shows the a lows: first a random pair hQi , Ei i of subsets of
forest-structure F over binary 5 variables, and (b) variables is drawn from X. Then, an instance d i
shows a full PDG structure based on F . is randomly drawn from the test data. The random
query then is P M (Q = di [Qi ]|E = di [Ei ]), where

An Empirical Study of Efficiency and Accuracy of Probabilistic Graphical Models 217


di [Qi ], di [Ei ] are the instantiations of Q, respec- a junction tree we were able to find using existing
tively E in di . The empirical complexity is sim- state-of-the-art learning and triangulation methods.
ply the average execution time for random queries.
The empirical accuracy is measured by averaging 4.2 Learning Nave Bayes Net Models
log(P M (Q = di [Qi ]|E = di [Ei ])) over the ran- For learning NB models we have implemented a
dom queries. Compared to the global accuracy mea- version of the NBE algorithm (Lowd and Domin-
sure L(M, D), this can be understood as a measure gos, 2005) for learning NB models. As the structure
for local accuracy, i.e. restricted to specific con- of NB models is fixed, the task reduces to learning
ditional and marginal distributions of P M . the number of states in the latent variable C, and
the parameters of the model. Learning in the pres-
4 Learning
ence of the latent components is done by standard
In this section we briefly describe the algorithms Expectation Maximization (EM) approach, follow-
we use for learning each type of PGM from data. ing (Lowd and Domingos, 2005; Karciauskas et al.,
For our analysis we need to learn a range of models 2004). Learning a range of models is done by incre-
with different efficiency vs. accuracy tradeoffs. For mentally increasing the number of states of C, and
score based learning a general -score will be used outputting the model learned for each cardinality. In
(Jaeger et al., 2006): this way we obtain a range of models that offer dif-
ferent complexity vs. accuracy tradeoffs. Note that
S (M, D) = L(M, D) (1 )|M |, (2) no structure-score like (2) is required as the struc-
ture is fixed.
where 0 < < 1, and |M | is the size of model
M . Equation (2) is a general score metric, as it be- 4.3 Learning Probabilistic Decision Graphs
comes equivalent to common metrics as BIC, AIC Learning of PDGs is done using the model-selection
and MDL for specific settings of 1 . By optimizing algorithm presented in (Jaeger et al., 2006). Using
scores with different settings of we get a range of local transformations as search operators, the algo-
models offering different tradeoffs between size and rithm performs a search for a structure that opti-
accuracy. Suitable ranges of -values were deter- mises -score.
mined experimentally for each type of model using
score score-based learning (BNs and PDGs). 5 Experiments
4.1 Learning Bayesian Networks We have produced SL-curves and empirically mea-
We use the KES algorithm for learning BN mod- sured inference times and accuracy, on 5 differ-
els (Nielsen et al., 2003). KES performs model- ent datasets (see table 1) from the UCI repository 2 .
selection in the space of equivalence classes of BN These five datasets are a representative sample of
structures using a semi-greedy heuristic. A param- the 50 datasets used in the extensive study by Lowd
eter k [0 . . . 1] controls the level of greediness, and Domingos (2005).
where a setting of 0 is maximally stochastic and 1 is We used the same versions of the datasets as used
maximally greedy. by Lowd and Domingos (2005). Specifically, the
The -score used in the KES algorithm uses the partitioning into training (90%) and test (10%) sets
size of the BN as the size parameter |M |, not the was the same, continuous variables were discretized
size of its junction tree. Clearly, it would be desir- into five equal frequency bins, and missing values
able to score BNs directly by the size of their junc- were interpreted as special states of the variables.
tion trees, but this appears computationally infeasi- For measuring the empirical efficiency and accu-
ble. Thus, our SL-curves for BNs do not show for a racy, we generated random queries as described in
given accuracy level the smallest possible size of a 3.2 consisting of 1 to 5 query variables Q and 0 to
junction tree achieving that accuracy, but the size of 5 evidence variables E.
1 log|D| 2
E.g. (2) with = 2+log |D|
corresponds to BIC http://www.ics.uci.edu/mlearn

218 J. D. Nielsen and M. Jaeger


comparing exact inference in NB to approximate
Table 1: Datasets used for experiments.
methods in BNs. We do not observe this tendency
Dataset #Vars training test
when considering exact inference for both BNs and
Poisonous Mushroom 23 7337 787
NBs. Our results show that we can learn BNs that
King, Rook vs. King 7 25188 2868
are within reach of exact inference methods, and
Pageblocks 11 4482 574
that the theoretical inference complexity as mea-
Abalone 9 3758 419
sured by effective model size mostly is similar for
Image Segmentation 17 2047 263
a given accuracy level for all three PGM languages.
Effective model size measures actual inference
For inference in BN and NB models we used time only up to a linear factor. In order to de-
the junction-tree algorithm implemented in the in- termine whether there possibly are huge (orders of
ference engine in Hugin3 through the Hugin Java magnitude) differences in these linear factors, we
API. For inference in PDGs, the method described measure the actual inference time on our random
in (Jaeger, 2004) was implemented in Java. BN and queries. The left column of table 3 shows the av-
NB experiments were performed on a standard lap- erage inference time for 1000 random queries with
top, 1.6GHz Pentium CPU with 512Mb RAM run- 4 query and 3 evidence variables (results for other
ning Linux. PDG experiments were performed on numbers of query and evidence variables were very
a Sun Fire280R, 900Mhz SPARC CPU with 4Gb similar). We observe that this empirical complex-
RAM running Solaris 9. ity behaves almost indistinguishably for BN and NB
models. This is not surprising, since both models
6 Results use the Hugin inference engine 4 . The results do
Table 2 shows SL-curves for each dataset and PGM, show, however, that the different structures of the
both for training (left column) and test sets (right junction trees for BN and NB models do not have
column). a significant impact on runtime. The linear factor
for PDG inference in these experiments is about 4
Lowd and Domingos (2005) based their compar-
times larger than that for BN/NB inference 5 . Seeing
ison on single BN and NB models. The NB models
that we use a proof-of-concept prototype Java im-
were selected by maximizing likelihood on a hold-
plementation for PDGs, and the commercial Hugin
out set. Crosses in the right column of table 2 in-
inference engine for BNs and NBs, this indicates
dicate the size-likelihood values obtained by the NB
that PDGs are competitive in practice, not only ac-
models reported in (Lowd and Domingos, 2005).
cording to theoretical complexity analyses.
The first observation we make from table 2 is
that no single model language consistently domi- The right column in table 3 shows the empirical
nates the others. The plots in the left column shows (local) accuracy obtained for 1000 random queries
that BN models have the lowest log-likelihood mea- with 4 query and 3 evidence variables. Overall, the
sured on training data consistently for models larger results are consistent with the global accuracy on
than some small threshold. For Abalone and Image the test data (table 2, right column). The differences
Segmentation, this characteristic is mitigated in the observed for the different PGMs in table 2 can also
plots for the test-data, where especially PDGs seem be seen in table 3, though the discrepancies tend
to overfit the training-data and accordingly receives 4
Zhang (1998) shows that variable elimination can be more
low log-likelihood score on the test-data. efficient than junction tree-based inference. However, his re-
The overall picture in table 2 is that in terms of ac- sults do not indicate that we would obtain substantially differ-
ent results if we used variable elimination in our experiments.
curacy, BNs and NBs are often quite similar (King, 5
This factor has to be viewed with caution, since PDG in-
Rook vs. King is the only exception). This is con- ference was run on a machine with a slower CPU but more
sistent with what Lowd and Domingos (2005) have main memory. When running PDG inference on the same ma-
chine as NB/BN inference, we observed overall a similar per-
found. However, Lowd and Domingos (2005) re- formance, but more deviations from a strictly linear behavior
ported big differences in inference complexity when (in table 3 still visible to some degree for the Mushroom and
Abalone data). These deviations seem mostly attributable to
3
http://www.hugin.com the memory management in the Java runtime environment.

An Empirical Study of Efficiency and Accuracy of Probabilistic Graphical Models 219


Table 2: SL-curves for train-sets (left column) and test-sets (right column). The crosses marks the NB
models reported by Lowd and Domingos (2005). H(D) (minus data-entropy) is plotted as a horizontal
line.
Poisonous Mushroom - Training Poisonous Mushroom - Test
-12 -12 -12 -12
-14 -14 -14 -14
-16 -16 -16 -16
-18 -18 -18 -18
Log-Likelihood

Log-Likelihood
-20 -20 -20 -20
-22 -22 -22 -22
-24 -24 -24 -24
-26 -26 -26 -26
-28 -28 -28 -28
-30 -30 -30 -30
0 5000 10000 15000 20000 25000 30000 0 5000 10000 15000 20000 25000 30000
Effective model size Effective model size

King, Rook vs. King - Training King, Rook vs. King - Test
-14.5 -14.5 -16 -16
-15 -15
-16.5 -16.5
-15.5 -15.5
-17 -17
-16 -16
Log-Likelihood

-16.5 -16.5 Log-Likelihood -17.5 -17.5

-17 -17 -18 -18


-17.5 -17.5
-18.5 -18.5
-18 -18
-19 -19
-18.5 -18.5
-19 -19 -19.5 -19.5
0 5000 10000 15000 20000 25000 30000 0 5000 10000 15000 20000 25000 30000
Effective model size Effective model size

Pageblocks - Training Pageblocks - Test


-10 -10 -11 -11
-11 -11 -12 -12
-12 -12 -13 -13
-13 -13
Log-Likelihood

Log-Likelihood

-14 -14
-14 -14
-15 -15
-15 -15
-16 -16
-16 -16
-17 -17 -17 -17

-18 -18 -18 -18

-19 -19 -19 -19


0 5000 10000 15000 20000 0 5000 10000 15000 20000
Effective model size Effective model size

Abalone - Training Abalone - Test


-9 -9 -10.4 -10.4

-9.5 -9.5 -10.6 -10.6

-10 -10 -10.8 -10.8


Log-Likelihood

Log-Likelihood

-10.5 -10.5 -11 -11

-11 -11 -11.2 -11.2

-11.5 -11.5 -11.4 -11.4

-12 -12 -11.6 -11.6


0 2000 4000 6000 8000 10000 0 2000 4000 6000 8000 10000
Effective model size Effective model size

Image Segmentation - Training Image Segmentation - Test


-10 -10 -17 -17
-12 -12 -18 -18

-14 -14 -19 -19


Log-Likelihood

Log-Likelihood

-20 -20
-16 -16
-21 -21
-18 -18
-22 -22
-20 -20
-23 -23
-22 -22 -24 -24
-24 -24 -25 -25
-26 -26 -26 -26
0 5000 10000 15000 20000 25000 30000 0 5000 10000 15000 20000 25000 30000
Effective model size Effective model size
BN NB PDG -H(D) BN NB PDG

220 J. D. Nielsen and M. Jaeger


Table 3: Empirical efficiency (left column) and accuracy (right column) for 4 query and 3 evidence variables.
Poisonous Mushroom Poisonous Mushroom
9 9 -2.5 -2.5

Empirical Accuracy (log-likelihood)


8 8
-3 -3

Empirical complexity (ms)


7 7
6 6 -3.5 -3.5
5 5
-4 -4
4 4
3 3 -4.5 -4.5
2 2
-5 -5
1 1
0 0 -5.5 -5.5
0 5000 10000 15000 20000 25000 30000 0 5000 10000 15000 20000 25000 30000
Effective model size Effective model size

King, Rook vs. King King, Rook vs. King


3 3 -8 -8

Empirical Accuracy (log-likelihood)


2.5 2.5 -8.5 -8.5
Empirical complexity (ms)

2 2 -9 -9

1.5 1.5 -9.5 -9.5

1 1 -10 -10

0.5 0.5 -10.5 -10.5

0 0 -11 -11
0 5000 10000 15000 20000 25000 30000 0 5000 10000 15000 20000 25000 30000
Effective model size Effective model size

Pageblocks Pageblocks
6 6 -4 -4
Empirical Accuracy (log-likelihood)

-4.2 -4.2
5 5
Empirical complexity (ms)

-4.4 -4.4
4 4
-4.6 -4.6

3 3 -4.8 -4.8

-5 -5
2 2
-5.2 -5.2
1 1
-5.4 -5.4

0 0 -5.6 -5.6
0 5000 10000 15000 20000 25000 0 5000 10000 15000 20000 25000
Effective model size Effective model size

Abalone Abalone
3 3 -3.8 -3.8
Empirical Accuracy (log-likelihood)

-3.9 -3.9
2.5 2.5
Empirical complexity (ms)

-4 -4
2 2
-4.1 -4.1

1.5 1.5 -4.2 -4.2

-4.3 -4.3
1 1
-4.4 -4.4
0.5 0.5
-4.5 -4.5

0 0 -4.6 -4.6
0 5000 10000 0 5000 10000
Effective model size Effective model size

Image Segmentation Image Segmentation


20 20 -4 -4
Empirical Accuracy (log-likelihood)

18 18
Empirical complexity (ms)

16 16 -4.5 -4.5
14 14
12 12 -5 -5
10 10
8 8 -5.5 -5.5
6 6
4 4 -6 -6
2 2
0 0 -6.5 -6.5
0 25000 50000 75000 0 25000 50000 75000
Effective model size Effective model size
BN NB PDG BN NB PDG

An Empirical Study of Efficiency and Accuracy of Probabilistic Graphical Models 221


to become less pronounced on the random queries Cooper, G. F.: 1987, Probabilistic inference us-
as on global likelihood (particularly for PDGs in ing belief networks is NP-hard, Technical report,
the image segmentation data). One possible expla- Knowledge Systems Laboratory, Stanford Uni-
nation for this is that low global likelihood scores versity.
are mostly due to a few test cases whose joint in- Dagum, P. and Luby, M.: 1993, Approximating
stantiation of the variables are given low probability probabilistic inference in Bayesian belief net-
by a model, and that these isolated low-probability works is NP-hard, Artificial Intelligence 60, 141
configurations are seldom met with in the random 153.
queries.
Jaeger, M.: 2004, Probabilistic decision graphs
7 Conclusion - combining verification and ai techniques for
probabilistic inference, International Journal of
Motivated by several previous, independent studies Uncertainty, Fuzziness and Knowledge-Based
on the tradeoff between model accuracy and effi- Systems 12, 1942.
ciency in different PGM languages, we have inves-
tigated the performance of BN, NB, and PDG mod- Jaeger, M., Nielsen, J. D. and Silander, T.: 2006,
els. Our main findings are: 1) In contrast to po- Learning probabilistic decision graphs, Interna-
tentially widely different performance on artificial tional Journal of Approximate Reasoning 42(1-
examples (e.g. the parity distribution), we observe a 2), 84100.
relatively uniform behavior of all three languages on Jensen, F. V.: 2001, Bayesian Networks and Deci-
real-life data. 2) Our results confirm the conclusions sion Graphs, Springer.
of Lowd and Domingos (2005) that the NB model Karciauskas, G., Kocka, T., Jensen, F. V.,
is a viable alternative to the BN model for gen- Larranaga, P. and Lozano, J. A.: 2004, Learning
eral purpose probabilistic modeling and inference. of latent class models by splitting and merging
However, the order-of-magnitude advantages in in- components, Proceedings of the Second Euro-
ference complexity could not be confirmed when pean Workshop on Probabilistic Graphical Mod-
comparing exact inference methods for both types els, pp. 137144.
of models. 3) Previous theoretical complexity anal-
Lauritzen, S. L. and Spiegelhalter, D. J.: 1988, Lo-
yses for inference in PDG models now have been
cal computations with probabilities on graphi-
complemented with empirical results showing also
cal structures and their application to expert sys-
the practical competitiveness of PDGs.
tems, Journal of the Royal Statistical Society
Acknowledgment 50(2), 157224.
Lowd, D. and Domingos, P.: 2005, Naive Bayes
We thank Daniel Lowd for providing the pre-
models for probability estimation, Proceedings
processed datasets we used in our experiments. We
of the Twentysecond International Conference on
thank the AutonLab at CMU for making their effi-
Machine Learning, pp. 529536.
cient software libraries available to us.
Nielsen, J. D., Kocka, T. and Pena, J. M.: 2003,
References On local optima in learning Bayesian networks,
Proceedings of the Nineteenth Conference on Un-
Beygelzimer, A. and Rish, I.: 2003, Approximabil-
certainty in Artificial Intelligence, Morgan Kauf-
ity of probability distributions, Advances in Neu-
mann Publishers, pp. 435442.
ral Information Processing Systems 16, The MIT
Press. Pearl, J.: 1988, Probabilistic reasoning in intelli-
gent systems: networks of plausible inference,
Bozga, M. and Maler, O.: 1999, On the represen-
Morgan Kaufmann Publishers.
tation of probabilities over structured domains,
Proceedings of the 11th International Confer- Zhang, N. L.: 1998, Computational properties of
ence on Computer Aided Verification, Springer, two exact algorithms for Bayesian networks, Ap-
pp. 261273. plied Intelligence 9, 173183.

222 J. D. Nielsen and M. Jaeger


Adapting Bayes Network Structures to Non-stationary Domains
Sren Holbech Nielsen and Thomas D. Nielsen
Department of Computer Science
Aalborg University
Fredrik Bajers Vej 7E
9220 Aalborg , Denmark

Abstract
When an incremental structural learning method gradually modifies a Bayesian network (BN)
structure to fit observations, as they are read from a database, we call the process structural adap-
tation. Structural adaptation is useful when the learner is set to work in an unknown environment,
where a BN is to be gradually constructed as observations of the environment are made. Existing
algorithms for incremental learning assume that the samples in the database have been drawn from
a single underlying distribution. In this paper we relax this assumption, so that the underlying dis-
tribution can change during the sampling of the database. The method that we present can thus be
used in unknown environments, where it is not even known whether the dynamics of the environ-
ment are stable. We briefly state formal correctness results for our method, and demonstrate its
feasibility experimentally.

1 Introduction These papers all assume that the database of


observations has been produced by a stationary
Ever since Pearl (1988) published his seminal book stochastic process. That is, the ordering of the ob-
on Bayesian networks (BNs), the formalism has be- servations in the database is inconsequential. How-
come a widespread tool for representing, eliciting, ever, many real life observable processes cannot re-
and discovering probabilistic relationships for prob- ally be said to be invariant with respect to time:
lem domains defined over discrete variables. One Mechanical mechanisms may suddenly fail, for
area of research that has seen much activity is the instance, and non-observable effects may change
area of learning the structure of BNs, where prob- abruptly. When human decision makers are some-
abilistic relationships for variables are discovered, how involved in the data generating process, these
or inferred, from a database of observations of these are almost surely not fully describable by the ob-
variables (main papers in learning include (Heck- servables and may change their behaviour instanta-
erman, 1998)). One part of this research area fo- neously. A simple example of a situation, where it is
cuses on incremental structural learning, where ob- unrealistic to expect a stationary generating process,
servations are received sequentially, and a BN struc- is an industrial system, where some component is
ture is gradually constructed along the way with- exchanged for one of another make. Similarly, if
out keeping all observations in memory. A special the coach of a soccer team changes the strategy of
case of incremental structural learning is structural the team during a match, data on the play from af-
adaptation, where the incremental algorithm main- ter the chance would be distributed differently from
tains one or more candidate structures and applies that representing the time before.
changes to these structures as observations are re- In this work we relax the assumption on sta-
ceived. This particular area of research has received tionary data, opting instead for learning from data
very little attention, with the only results that we are which is only approximately stationary. More
aware of being (Buntine, 1991; Lam and Bacchus, concretely, we assume that the data generating pro-
1994; Lam, 1998; Friedman and Goldszmidt, 1997; cess is piecewise stationary, as in the examples
Roure, 2004). given above, and thus do not try to deal with data
where the data generating process changes gradu- in the distribution P . The d-separation criterion is
ally, as can happen when machinery is slowly being thus
worn down.1 Furthermore, we focus on domains X G Y | Z X PB Y | Z,
where the shifts in distribution from one stationary
for any BN B (G, ). The set of all conditional
period to the next is of a local nature (i.e. only a
independence statements that may be read from a
subset of the probabilistic relationships among vari-
graph in this manner, is referred to as that graphs
ables change as the shifts take place).
d-separation properties.
2 Preliminaries We refer to any two graphs over the same vari-
ables as being equivalent if they have the same
As a general notational rule we use bold font to de- d-separation properties. Equivalence is obviously
note sets and vectors (V , c, etc.) and calligraphic an equivalence relation. Verma and Pearl (1990)
font to denote mathematical structures and compo- proved that equivalent graphs necessarily have the
sitions (B, G, etc.). Moreover, we shall use upper same skeleton and the same v-structures.2 The
case letters to denote random variables or sets of equivalence class of graphs containing a specific
random variables (X, Y , V , etc.), and lower case graph G can then be uniquely represented by the
letters to denote specific states of these variables partially directed graph G obtained from the skele-
(x4 , y , c, etc.). is used to denote defined as. ton of G by directing links that participate in a v-
A BN B (G, ) over a set of discrete vari- structure in G in the direction dictated by G. G
ables V consists of an acyclic directed graph (tradi- is called the pattern of G. Any graph G , which is
tionally abbreviated DAG) G, whose nodes are the obtained from G by directing the remaining undi-
variables in V , and a set of conditional probabil- rected links, without creating a directed cycle or a
ity distributions (which we abbreviate CPTs for new v-structure, is then equivalent to G. We say
conditional probability table). A correspondence that G is a consistent extension of G . The partially
between G and is enforced by requiring that directed graph G obtained from G , by directing
consists of one CPT P (X|PAG (X)) for each vari- undirected links as they appear in G, whenever all
able X, specifying a conditional probability distri- consistent extensions of G agree on this direction,
bution for X given each possible instantiation of the is called the completed pattern of G. G is ob-
parents PAG (X) of X in G. A unique joint distri- viously a unique representation of Gs equivalence
bution PB over V is obtained by taking the product class as well.
of all the CPTs in . When it introduces no am- Given any joint distribution P over V it is possi-
biguity, we shall sometimes treat B as synonymous ble to construct a BN B such that P = PB (Pearl,
with its graph G. 1988). A distribution P for which there is a BN
Due to the construction of PB we are guaranteed BP (GP , P ) such that PBP = P and also
that all dependencies inherent in PB can be read di-
rectly from G by use of the d-separation criterion X P Y | Z X GP Y | Z
(Pearl, 1988). The d-separation criterion states that,
holds, is called DAG faithful, and BP (and some-
if X and Y are d-separated by Z, then it holds that
times just GP ) is called a perfect map. DAG faithful
X is conditionally independent of Y given Z in PB ,
distributions are important since, if a data generat-
or equivalently, if X is conditionally dependent of
ing process is known to be DAG faithful, then a per-
Y given Z in PB , then X and Y are not d-separated
fect map can, in principle, be inferred from the data
by Z in G. In the remainder of the text, we use
under the assumption that the data is representative
X G Y | Z to denote that X is d-separated from
of the distribution.
Y by Z in the DAG G, and X P Y | Z to denote
For any probability distribution P over variables
that X is conditionally independent of Y given Z
V and variable X V , we define a Markov bound-
1
The changes in distribution of such data is of a continous ary (Pearl, 1988) of X to be a set S V \ {X}
nature, and adaptation of networks would probably be better
2
accomplished by adjusting parameters in the net, rather than A triple of nodes (X, Y, Z) constitutes a v-structure iff X
the structure itself. and Z are non-adjacent and both are parents of Y .

224 S. H. Nielsen and T. D. Nielsen


such that X P V \ (S {X}) | S and this holds Pt be the distribution generating sample point t.
for no proper subset of S. It is easy to see that if Furthermore, let B1 , . . . , Bl be the BNs found by a
P is DAG faithful, the Markov boundary of X is structural adaptation method M , when receiving D.
uniquely defined, and consists of Xs parents, chil- Given a distance measure dist on BNs, we define the
dren, and childrens parents in a perfect map of P . deviance of M on D wrt. dist as
In the case of P being DAG faithful, we denote the
Markov boundary by MBP (X). 1X
l
dev(M, D) dist(BPi , Bi ).
l
3 The Adaptation Problem i=1

Before presenting our method for structural adapta- For a method M to adapt to a DAG faithful sample
tion, we describe the problem more precisely: sequence D wrt. dist then means that M seeks to
We say that a sequence is samples from a piece- minimize its deviance on D wrt. dist as possible.
wise DAG faithful distribution, if the sequence can
be partitioned into sets such that each set is a 4 A Structural Adaptation Method
database sampled from a single DAG faithful dis-
tribution, and the rank of the sequence is the size The method proposed here continuously monitors
of the smallest such partition. Formally, let D = the data stream D and evaluates whether the last,
(d1 , . . . , dl ) be a sequence of observations over say k, observations fit the current model. When this
variables V . We say that D is sampled from a piece- turns out not to be the case, we conclude that a shift
wise DAG faithful distribution (or simply that it is in D took place k observations ago. To adapt to the
a piecewise DAG faithful sequence), if there are in- change, an immediate approach could be to learn a
dices 1 = i1 < < im = l + 1, such that each new network from the last k cases. By following
of Dj = (dij , . . . , dij+1 1 ), for 1 j m 1, this approach, however, we will unfortunately loose
is a sequence of samples from a DAG faithful dis- all the knowledge gained from cases before the last
tribution. The rank of the sequence is defined as k observations. This is a problem if some parts of
minj ij+1 ij , and we say that m 1 is its size and the perfect maps, of the two distributions on each
l its length. A pair of consecutive samples, di and side of the shift, are the same, since in such situ-
di+1 , constitute a shift in D, if there is j such that ations we re-learn those parts from the new data,
di is in Dj and di+1 is in Dj+1 . Obviously, we can even though they have not changed. Not only is this
have any sequence of observations being indistin- a waste of computational effort, but it can also be the
guishable from a piecewise DAG faithful sequence, case that the last k observations, while not directly
by selecting the partitions small enough, so we re- contradicting these parts, do not enforce them ei-
strict our attention to sequences that are piecewise ther, and consequently they are altered erroneously.
DAG faithful of at least rank r. However, we do not Instead, we try to detect where in the perfect maps
assume that neither the actual rank nor size of the of the two distributions changes have taken place,
sequences are known, and specifically we do not as- and only learn these parts of the new perfect map.
sume that the indices i1 , . . . , im are known. This presents challenges, not only in detection, but
The learning task that we address in this paper also in learning the changed parts and having them
consists of incrementally learning a BN, while re- fit the non-changed parts seamlessly. Hence, the
ceiving a piecewise DAG faithful sequence of sam- method consists of two main mechanisms: One,
ples, and making sure that after each sample point monitoring the current BN while receiving obser-
the BN structure is as close as possible to the dis- vations and detecting when and where the model
tribution that generated this point. Throughout the should be changed, and two, re-learning the parts of
paper we assume that each sample is complete, so the model that conflicts with the observations, and
that no observations in the sequence have missing integrating the re-learned parts with the remaining
values. Formally, let D be a complete piecewise parts of the model. These two mechanisms are de-
DAG faithful sample sequence of length l, and let scribed below in Sections 4.1 and 4.2, respectively.

Adapting Bayes Network Structures to Non-stationary Domains 225


4.1 Detecting Changes bets are off):
The detection part of our method, shown in Al- PE (Xi = di )
gorithm 1, continuously processes the cases it re- log >0. (1)
PB (Xi = di |Xj = dj (j =
6 i))
ceives. For each observation d and node X, the
method measures (using C ONFLICT M EASURE(B, Therefore, we let C ONFLICT M EASURE(B, Xi , d)
X, d)) how well d fits with the local structure of return the value given on the left-hand side of (1).
B around X. Based on the history of measurements We note that this is where the assumption of com-
for node X, cX , the method tests (using S HIFT I N - plete data comes into play: If d is not completely
S TREAM(cX , k)) whether a shift occurred k obser- observed, then (1) cannot be evaluated for all nodes
vations ago. k thus acts as the number of observa- Xi .
tions that are allowed to pass before the method Since a high value returned by C ONFLICT M EA -
should realize that a shift has taken place. We there- SURE() for a node X could be caused by a rare

fore call the parameter k the allowed delay of the case, we cannot use that value directly for deter-
method. When the actual detection has taken place, mining whether a shift has occurred. Rather, we
as a last step, the detection algorithm invokes the look at the block of values for the last k cases,
updating algorithm (U PDATE N ET()) with the set and compare these with those from before that. If
of nodes, for which S HIFTI N S TREAM() detected a there is a tendency towards higher values in the for-
change, together with the last k observations. mer, then we conclude that this cannot be caused
only by rare cases, and that a shift must have oc-
Algorithm 1 Algorithm for BN adaption. Takes as curred. Specifically, for each variable X, S HIFT I N -
input an initial network B, defined over variables S TREAM(cX , k) checks whether there is a signif-
V , a series of cases D, and an allowed delay k for icant increase in the values of the last k entries in
detecting shifts in D. cX relative to those before that. In our implementa-
1: procedure A DAPT(B, V , D, k) tion S HIFTI N S TREAM(cX , k) calculates the nega-
2: D [] tive of the second discrete cosine transform compo-
3: cX [] (X V ) nent (see e.g. (Press et al., 2002)) of the last 2k mea-
4: loop
5: d N EXT C ASE (D) sures in cX , and returns true if this statistic exceeds
6: A PPEND(D , (d)) a pre-specified threshold value. We are unaware of
7: S
8: for X V do
any previous work using this technique for change
9: c C ONFLICT M EASURE(B, X, d) point detection, but we chose to use this as it out-
10: A PPEND(cX , c) performed the more traditional methods of log-odds
11: if S HIFT I N S TREAM(cX, k) then
12: S S {X} ratios and t-tests in our setting.
13: D L AST KE NTRIES(D, k)
14: if S 6= then 4.2 Learning and Incorporating Changes
15: U PDATE N ET (B, S, D ) When a shift involving nodes S has been detected,
U PDATE N ET(B, S, D ) in Algorithm 2 adapts the
To monitor how well each observation d BN B around the nodes in S to fit the empirical dis-
(d1 , . . . , dm ) fit the current model B, and espe- tribution defined by the last k cases D read from
cially the connections between a node Xi and the re- D. Throughout the text, both the cases and the em-
maining nodes in B, we have followed the approach pirical distribution will be denoted D . Since we
of Jensen et al. (1991): If the current model is cor- want to reuse the knowledge encoded in B that has
rect, then we would in general expect that the prob- not been deemed outdated by the detection part of
ability for observing d dictated by B is higher than the method, we will update B to fit D based on the
or equal to that yielded by most other models. This assumption that only nodes in S need updating of
should especially be the case for the empty model E, their probabilistic bindings (i.e. the structure asso-
where all nodes are unconnected. That is, in B, we ciated with their Markov boundaries in BD ). Ig-
expect the individual attributes of d to be positively noring most details for the moment, the updating
correlated (unless d is a rare case, in which case all method in Algorithm 2 first runs through the nodes

226 S. H. Nielsen and T. D. Nielsen


Algorithm 2 Update Algorithm for BN. Takes as d D (X)) to find variables adjacent to X.
ables in MB
input the network to be updated B, a set of vari- The latter method uses a greedy search for itera-
ables whose structural bindings may be wrong S, tively growing and shrinking the estimated set of ad-
and data to learn from D . jacent variables until no further change takes place.
1: procedure U PDATE N ET (B, S, D ) Both of these methods need an independence ora-
2: for X S do cle ID . For this we have used a 2 test on D .
3: d D (X) M ARKOVB OUNDARY(X, D )
MB
4: GX A DJACENCIES(X, MB d D (X), D )
5: d D (X), D )
PARTIALLY D IRECT(X, GX , MB Algorithm 3 Uncovers the direction of some arcs
.
adjacent to a variable X as they would be in BD
6: G M ERGE F RAGMENTS({GX}XS )
7: G M ERGE C ONNECTIONS(B, G , S) NEGX (X) consists of the nodes connected to X by
8: (G , C) D IRECT(G , B, S)
9:
a link in GX .
10: for X V do 1: procedure PARTIALLY D IRECT(X, GX , MB d D (X), D )
11: if X C then 2: D IRECTA S I N OTHER F RAGMENTS(GX)
12: {PD (X|PAG (X))} 3: for Y MB d D (X) \ (NEG (X) PAG (X)) do
X X
13: else 4: for T ( MB d D (X) \ {X, Y } do
14: {PB (X|PAG (X))} 5: if ID (X, Y | T ) then
15: B (G , ) 6: for Z NEGX (X) \ T do
7: if ID (X, Y | T {Z}) then
8: L INK T OA RC(GX , X, Z)
in S and learns a partially directed graph fragment
GX for each node X (GX can roughly be thought of The method PARTIALLY D IRECT(X, GX ,
as a local completed pattern for X). When net- d D (X), D ) directs a number of links in the
MB
work fragments have been constructed for all nodes graph fragment GX in accordance with the direction
in S, these fragments are merged into a single graph of these in (the unknown) BD . With D IREC -

G , which is again merged with fragments from the TA S I N OTHER F RAGMENTS(GX ), the procedure
original graph of B. The merged graph is then di- first exchanges links for arcs, when previously
rected using four direction rules, which try to pre- directed graph fragments unanimously dictate this.
serve as much of Bs structure as possible, with- The procedure then runs through each variable Y in
out violating the newly uncovered knowledge rep- d D (X) not adjacent to X, finds a set of nodes
MB
resented by the learned graph fragments. Finally, in MBd D (X) that separate X from Y , and then
new CPTs are constructed for those nodes C that tries to re-establish connection to Y by repeatedly
have a new parent set in BD (nodes which, ideally, expanding the set of separating nodes by a single
should be a subset of S). node adjacent to X. If such a node can be found
The actual construction of GX is divided into it has to be a child of X in the completed pattern
three steps: First, an estimate MB d D (X) of BD , and no arc in the pattern B originating from
D
MBD (X) is computed, using M ARKOVB OUND - X is left as a link by the procedure (see (Nielsen
ARY(X, D ); second, nodes in MB d D (X) that are and Nielsen, 2006) for proofs). As before the
adjacent to X in BD are uncovered, using A DJA - independence oracle ID was implemented as a 2
CENCIES(X, MB d D (X), D ), and GX is initialized test in our experiments.
as a graph over X and these nodes, where X is con- In most constraint based learning methods, only
nected with links to these adjacent nodes; and third, the direction of arcs participating in v-structures
some links in GX that are arcs in BD are directed as
are uncovered using independence tests, and struc-
they would be in BD using PARTIALLY D IRECT(X,
tural rules are relied on for directing the remaining
GX , MBd D (X), D ). See (Nielsen and Nielsen, arcs afterwards. For the proposed method, how-
2006) for more elaboration on this. ever, more arcs from the completed pattern, than
In our experimental implementation, we used the just those of v-structures, are directed through in-
decision tree learning method of Frey et al. (2003) dependence tests. The reason is that traditional un-
to find MBd D (X), and the A LGORITHM PCD() covering of the direction of arcs in a v-structure
method in (Pena et al., 2005) (restricted to the vari- X Y Z relies not only on knowledge that X

Adapting Bayes Network Structures to Non-stationary Domains 227


and Y are adjacent, and that X and Z are not, but 4. If Rules 1 to 3 cannot be applied, chose a link
also on the knowledge that Y and Z are adjacent. At X Y at random, such that X, Y V \ S,
the point, where GX is learned, however, knowledge and direct it as in B.
of the connections among nodes adjacent to X is not
known (and may be dictated by D or may be dic- 5. If Rules 1 to 4 cannot be applied, chose a link
tated by B), so this traditional approach is not pos- at random, and direct it randomly.
sible. Of course these unknown connections could
Due to potentially flawed statistical tests, the resul-
be uncovered from D using a constraint based algo-
tant graph may contain cycles each involving at least
rithm, but the entire point of the method is to avoid
one node in S. These are eliminated by reversing
learning of the complete new network.
only arcs connecting to at least one node in S. The
When all graph fragments for nodes in S have
reversal process resembles the one used in (Margari-
been constructed, they are merged through a simple
tis and Thrun, 2000): We remove all arcs connecting
graph union in M ERGE F RAGMENTS(); no conflicts
to nodes in S that appears in at least one cycle. We
among orientations can happen due to the construc-
order the removed arcs according to how many cy-
tion of PARTIALLY D IRECT(). In M ERGE C ON -
cles they appear in, and then insert them back in the
NECTIONS(B, G , S) connections among nodes in
graph, starting with the arcs that appear in the least
V \ S are added according to the following rule: If
number of cycles, breaking ties arbitrarily. When
X, Y V \ S are adjacent in B, then add the link
at some point the insertion of an arc gives rise to a
X Y to G . The reason for this rule is that insepa-
cycle, we insert the arc as its reverse.
rable nodes, for which no change has been detected,
We have obtained a proof of the correctness
are assumed to be inseparable still. However, the di-
of the proposed method, but space restrictions pre-
rection of some of the arcs may have changed in G ,
vents us from bringing it here. Basically, we have
wherefore we cannot directly transfer the directions
shown that, given the set-up from Section 3, if the
in B to G .
method is started with a network equivalent to BP1 ,
Following the merge, D IRECT(G , B, S) directs then Bi will be equivalent to BPik for all i > k.
the remaining links in G according to the following This is what we refer to as correct behaviour, and
five rules: it means that once on the right track, the method
will continue to adapt to the underlying distribu-
1. If X V \ S, X Y is a link, X Y is an tion, with the delay k. The assumptions behind the
arc in B, and Y is a descendant of some node result, besides that each Pi is DAG faithful, are i)
Z in MBB (X) \ ADB (X), where ADB (X) the samples in D are representative of the distribu-
are nodes adjacent to X in B, through a path tions they are drawn from, ii) the rank of D is bigger
involving only children of X, then direct the than 2k, and iii) S HIFT I N S TREAM() returns true
link X Y as X Y .3 for variable X and sample j iff Pjk is not simi-
lar to Pjk1 around X (see (Nielsen and Nielsen,
2. If Rule 1 cannot be applied, and if X Y is a 2006) for formal definitions and detailed proofs).
link, Z X is an arc, and Z and Y are non-
adjacent, then direct the link XY as X Y . 5 Experiments and Results

3. If Rule 1 cannot be applied, and if X Y is a To investigate how our method behaves in practice,
link and there is a directed path from X to Y , we ran a series of experiments. We constructed
then direct the link X Y as X Y . 100 experiments, where each consisted of five ran-
domly generated BNs B1 , . . . , B5 over ten variables,
3
That Rule 1 is sensible is proved in (Nielsen and Nielsen, each having between two and five states. We made
2006). Intuitively, we try to identify a graph fragment for X in sure that Bi was structurally identical to Bi1 ex-
B, that can be merged with the graph fragments learned from cept for the connection between two randomly cho-
D . It turns out that the arcs directed by Rule 1 are exactly
those that would have been learned by PARTIALLY D IRECT(X, sen nodes. All CPTs in Bi were kept the same as
GX , MBB (X), PB ). in Bi1 , except for the nodes with a new parent

228 S. H. Nielsen and T. D. Nielsen


set. For these we employed four different meth- 600
A
ods for generating new distributions: A estimated B
C
500
the probabilities from the previous network with D

some added noise to ensure that no two distribu- 400


tions were the same. B, C, and D generated entirely
new CPTs, with B drawing distributions from a uni- 300
form distribution over distributions. C drew dis-
200
tributions from the same distribution, but rejected
those CPTs where there were no two parent con- 100
figurations, for which the listed distributions had a
KL-distance of more than 1. D was identical to C, 0
0 100 200 300 400 500 600
except for having a threshold of 5. The purpose of
the latter two methods is to ensure strong probabilis-
tic dependencies for at least one parent configura- Figure 1: Deviance measures FG-SA (X-axis) vs.
tion. For generation of the initial BNs we used the NN (Y-axis).
method of Ide et al. (). For each series of five BNs,
500
we sampled r cases from each network and concate- A
450 B
nated them into a piecewise DAG faithful sample C
400 D
sequence of rank r and length 5r, for r being 500,
350
1000, 5000, and 10000.
300
We fed our method (NN) with the generated
250
sequences, using different delays k (100, 500,
200
and 1000), and measured the deciance wrt. the
150
KL-distance on each. As mentioned we are un-
100
aware of other work geared towards non-stationary
50
distributions, but for base-line comparison pur-
0
poses, we implemented the structural adaptation 0 50 100 150 200 250 300 350 400 450 500
methods of Friedman and Goldszmidt (1997) (FG)
and Lam and Bacchus (1994) (LB). For the method Figure 2: Deviance measures FG-HC (X-axis) vs.
of Friedman and Goldszmidt (1997) we tried both NN (Y-axis).
simulated annealing (FG-SA) and a more time con-
suming hill-climbing (FG-HC) for the unspecified
search step of the algorithm. As these methods have FG-HC are presented in Figures 1 and 2. NN out-
not been developed to deal with non-stationary dis- performed FG-SA in 81 of the experiments, and
tributions, they have to be told the delay between FG-HC in 65 of the experiments. The deviance of
learning. For this we used the same value k, that the LB method was much worse than for either of
we use as delay for our own method, as this ensure these three. The one experiment, where both the
that all methods store only a maximum of k full FG methods outperformed the NN method substan-
cases at any one time. The chosen k values, also tially, had r equal to 10000 and k equal to 1000,
correspond to those found for the experimental re- and was thus the experiment closest to the assump-
sults reported in Friedman and Goldszmidt (1997) tion on stationary distributions of the FG and LB
and Lam (1998). The only other pre-specified pa- learners.
rameter required by our method, viz. a threshold for Studying the individual experiments more
the 2 -tests we set at a conventional 0.05. Each closely, it became apparent that NN is more sta-
method was given the correct initial network B1 to ble than FG: It does not alter the network as often
start its exploration. as FG, and when doing so, NN does not alter it as
Space does not permit us to present the results much as FG. This is positive, as besides preventing
in full, but the deviance of both NN, FG-SA, and unnecessary computations, it frees the user of

Adapting Bayes Network Structures to Non-stationary Domains 229


the learned nets from a range of false positives. References
Furthermore, we observed that the distance between W. Buntine. 1991. Theory refinement on Bayesian net-
the BN maintained by NN and the generating one works. In UAI 91, pages 5260. Morgan Kaufmann.
seems to stabilize after some time. This was not L. Frey, D. Fisher, I. Tsamardinos, C. F. Aliferis, and
always the case for FG. A. Statnikov. 2003. Identifying Markov blankets with
decision tree induction. In ICDM 03, pages 5966.
6 Discussion IEEE Computer Society Press.
N. Friedman and M. Goldszmidt. 1997. Sequential up-
The plotted experiments seem to indicate that our date of Bayesian network structure. In UAI 97, pages
method is superior to existing techniques for do- 165174. Morgan Kaufmann.
mains, where the underlying distribution is not sta-
D. Heckerman. 1998. A tutorial on learning with
tionary. This is especially underscored by our ex- Bayesian networks. In M. Jordan, editor, Learning
periments actually favouring the existing methods in Graphical Models, pages 301354. Kluwer.
through using only sequences whose rank is a multi- J. S. Ide, F. G. Cozman, and F. T. Ramos. Generat-
plum of the learners k value, which means that both ing random Bayesian networks with constraints on in-
FG and LB always learn from data from only one duced width. In ECAI 04, pages 323327. IOS Press.
partition of the sequence, unlike NN, which rarely F. V. Jensen, B. Chamberlain, T. Nordahl, and F. Jensen.
identifies the point of change completely accurately. 1991. Analysis in HUGIN of data conflict. In UAI 91.
Moreover, score based approaches are geared to- Elsevier.
wards getting small KL-scores, and thus the metric W. Lam and F. Bacchus. 1994. Using new data to re-
we have reported should favour FG and LB too. fine a Bayesian network. In UAI 94, pages 383390.
Of immediate interest to us, is investigation of Morgan Kaufmann.
how our method fares when the BN given to it at W. Lam. 1998. Bayesian network refinement via ma-
the beginning is not representative of the distribu- chine learning approach. IEEE Trans. Pattern Anal.
tion generating the first partition of the sample se- Mach. Intell., 20(3):240251.
quence. Also, it would be interesting to investigate D. Margaritis and S. Thrun. 2000. Bayesian network in-
the extreme cases of sequences of size 1 and those duction via local neighborhoods. In Advances in Neu-
with very low ranks (r 500). Obviously, the task ral Information Processing Systems 12, pages 505
511. MIT Press.
of testing other parameter choices and other imple-
mentation options for the helper functions need to S. H. Nielsen and T. D. Nielsen. 2006. Adapt-
be carried out too. ing bayes nets to non-stationary probability dis-
tributions. Technical report, Aalborg University.
Currently, we have some ideas for optimizing www.cs.aau.dk/holbech/nielsennielsen note.ps.
the precision of our method, including performing
J. Pearl. 1988. Probabilistic Reasoning in Intelligent
parameter adaptation of the CPTs associated with Systems. Morgan Kaufmann.
the maintained structure, and letting changes cas-
cade, by marking nodes adjacent to changed nodes J. M. Pena, J. Bjorkegren, and J. Tegner. 2005. Scalable,
efficient and correct learning of Markov boundaries
as changed themselves. under the faithfulness assumption. In ECSQARU 05,
In the future it would be interesting to see how volume 3571 of LNCS, pages 136147. Springer.
a score based approach to the local learning part of W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P.
our method would perform. The problem with tak- Flannery, editors. 2002. Numerical Recipes in C++:
ing this road is that it does not seem to have any The Art of Scientific Computing. Cambridge Univer-
formal underpinnings, as the measures score based sity Press, 2nd edition.
approaches optimize are all defined in terms of a J. Roure. 2004. Incremental hill-climbing search applied
single underlying distribution. A difficulty which to Bayesian network structure learning. In Proceed-
Friedman and Goldszmidt (1997) also allude to in ings of the 15th European Conference on Machine
Learning. Springer.
their efforts to justify learning from data collections
of varying size for local parts of the network. T. Verma and J. Pearl. 1990. Equivalence and synthesis
of causal models. In UAI 91, pages 220227. Elsevier.

230 S. H. Nielsen and T. D. Nielsen


Diagnosing Lyme disease - Tailoring patient specific Bayesian
networks for temporal reasoning
Kristian G. Olesen (kgo@cs.aau.dk)
Department of Computer Science, Aalborg University
Fredrik Bajersvej 7 E, DK-9220 Aalborg East, Denmark

Ole K. Hejlesen (okh@mi.aau.dk)


Department of Medical Informatics, Aalborg University
Fredrik Bajersvej 7 D, DK-9220 Aalborg East, Denmark

Ram Dessau (rde@cn.stam.dk)


Clinical Microbiology, Nstved Hospital
Denmark

Ivan Beltoft (ibe@netcompany.com) and Michael Trangeled (michael@trangeled.com)


Department of Medical Informatics, Aalborg University

Abstract
Lyme disease is an infection evolving in three stages. Lyme disease is characterised by a
number of symptoms whose manifestations evolve over time. In order to correctly classify
the disease it is important to include the clinical history of the patient. Consultations
are typically scattered at non-equidistant points in time and the probability of observing
symptoms depend on the time since the disease was inflicted on the patient.
A simple model of the evolution of symptoms over time forms the basis of a dynamically
tailored model that describes a specific patient. The time of infliction of the disease is
estimated by a model search that identifies the most probable model for the patient given
the pattern of symptom manifestations over time.

1 Introduction tibodies (immunglobulin M and G) in the blood.


Methods for direct detection of the bacteria in
Lyme disease, or Lyme Borreliosis, is the most the tissue or the blood are not available for rou-
commonly reported tick-borne infection in Eu- tine diagnosis. Thus it may be difficult to diag-
rope and North America. Lyme disease was nose Lyme disease as the symptoms may not be
named in 1977 when a cluster of children in and exclusive for the disease and testing for antibod-
around Lyme, Connecticut, displayed a similar ies may yield false positive or negative results.
disease pattern. Subsequent studies revealed In the next section we give an overview of
that the children were infected by a bacteria, Lyme disease and in section 3 a brief summary
which was named Borrelia burgdorferi. These of two earlier systems are given. In section 4 we
bacteria are transmitted to humans by the bite suggest an approach that overcome the difficul-
of infected ticks (Ixodus species). The clinical ties by modelling multiple consultations spread
manifestations of Lyme borreliosis are described out in non-equidistant points in time. This ap-
in three stages and may affect different organ proach assumes that the time of infection is
systems, most frequently the skin and nervous known, but this is rarely the case. We aim for
system. The diagnosis is based on the clinical an estimation of this fact; a number of hypoth-
findings and results of laboratory testing for an- esised models are generated and the best one is
Organ system Stage 1: Stage 2: Stage 3:
Incubation period Incubation period Incubation period
one week (few days one week to a few months to years
to one month) months
Skin Erythema migrans Borrelial lympocytoma Acrodermatitis chronica
athrophicans
Central nervous Early neuroborreliosis Chronic neuroborreliosis
system
Joints Lyme arthritis
Heart Lyme carditis
Average sentitivity of
antibody detection 50% 80% 100%
(IgG or IgM)

Table 1: Clinical manifestations of Lyme disease.

identified based on the available evidence. In other manifestations of Lyme disease is more
section 5 a preliminary evaluation of the ap- rare. The diagnosis of lyme arthritis is espe-
proach is described and finally we conclude in cially difficult as the symptoms are similar to
section 6. arthritis due to other causes and because the
disease is rare.
2 Lyme Disease Level

According to European case defintions IgM IgG

(http://www.oeghmp.at/eucalb/diagnosis case-
definition-outline.html) Lyme disease and its
manifestations are described in three stages as
3 weeks 4-6 weeks 4-6 months 2 years Time
Bite
shown in Table 1.
Erythema migrans is the most frequent man- Figure 1: IgM and IgG development after infec-
ifestation. The rash which migrates from the tion of Lyme disease.
site of the tickbite continuously grows in size.
Erythema migrans is fairly characteristic and
the diagnosis is clinical. Laboratory testing is The epidemiology is complex. For exam-
useless as only half of the patients are positive, ple different age groups are at different risks,
when the disease is localized to the skin only. there is a large seasonsal variation due to vari-
Neuroborreliosis is less frequent, in Denmark ations in tick activity (Figure 2), the incuba-
the incidence is around 3/100.000 per year. The tion period may vary from a few days to sev-
IgM or IgG may be positive in 80% of patients eral years and some clinical manifestations are
with neuroborreliosis, but if the duration of the very rare. There are large variations in in-
the clinical disease is more than two months, cubation time and clinical progression. Some
then 100% of patients are positive. Figure 1 patients may have the disease starting with a
shows that not only the probability of being rash (erythema migrans) and then progressing
positive, but also the level of antibodies vary as to ntaeuroborreliosis, but most patients with
a function of time. Thus the laboratory mea- neuroborreliosis are not aware of a preceeding
surement of IgM and IgG antibodies depends rash. The description of the disease and its pro-
both on the duration of the clinical disease and gression primarily based on (Smith et al., 2003)
the dissemination of the infection to other or- and (Gray et al., 2002).
gan systems than the skin. The incidence of the There are also large individual variations in

232 K. G. Olesen, O. K. Hejlesen, R. Dessau, I. Beltoft, and M. Trangeled


nosis is supported by incorporating clinical and
laboratory data into the model.
To capture the complex patterns of the dis-
ease, a model must incorporate the relevant
clinical evidence including estimation of the
temporal aspects of time since the tickbite, the
duration of clinical disease and the development
of antibody response.

3 Existing Models for Diagnosis of


Lyme Disease
We have knowledge of two models that
Figure 2: Comparison of seasonal occurrence of have been developed to assist the med-
erythema migrans (EM, n = 566) and neurobor- ical practitioner in the diagnosis of
reliosis (NB, n = 148) in northeastern Austria, Lyme disease (Dessau and Andersen, 2001;
1994-1995.(Gray et al., 2002). Fisker et al., 2002). Both models are based on
Bayesian networks (Jensen, 2001) and as the
latter is a further development of the former,
the behaviour of the patient and the physician, they share most of the variables.
about when to visit the doctor, when to take The models include a group of variables de-
the laboratory test ect. Possible repeated visits scribing general information and knowledge in-
to the doctor and repeated laboratory testing cluding age and gender of the patient, the pa-
are performed at variable intervals. Joint pain tients exposure to ticks (is the patient a forrest
and arthritis is a very common problem in the worker or orienteer), the month of the year and
general population, but is only very rarely due whether the patient recalls a tick bite. The in-
to Lyme borreliosis. However, many patients formation is used to establish whether or not
with arthritis are tested for Lyme borreliosis the conditions for a Lyme disease infection has
( 30% of all patients samples) and as the prior been present. This part of the model influence
probability of Lyme disease is low, the posterior the hypothesis variable, Borrelia.
probality of Lyme arthritis is also low in spite Another section of the models describe clin-
of a positive test result. Laboratories in some ical manifestations. The findings are influ-
countries attempt to improve the diagnosis of enced by the hypothesis variable, but may be
Lyme disease by using a two-step strategy in caused by other reasons. Similarly, a section
antibody detection, by setting a low cut-off on of variables describing laboratory findings may
the screening test and then performing a more be caused by either Lyme disease or by other
expensive and complex test to confirm or dis- disorders.
card the result of the primary test. As the sec- The structure of the model by Dessau and
ond test is using the same principle of indirect Andersen (2001) is shown in Figure 3. In this
testing for antibody reactions to different Borre- model the three stages of Lyme disease is explic-
lial antigens the two tests are highly correlated, itly represented as three copies of the hypothesis
and the information gain is limited. Thus, the and the findings sections. The conditional prob-
basic problem of false positive or false negative abilities reflect the temporal evolution of symp-
results is not solved. It was therefore found im- toms, such that e.g. neuroborreliosis is more
portant to develop a decision support system to probable in stage two than in stages one and
assist the clinician by calculating the posterior three. A problem with this approach is that the
probability of Lyme disease to guide the choice progression of the disease is uncertain, and con-
of treatment. An evidence based clinical diag- sequently it is unclear which copy of e.g. finding

Diagnosing Lyme disease - Tailoring patient specific Bayesian networks for temporal reasoning 233
Laboratory
findings

Immunology1 Immunology2 Immunology3

Background Borrelia1 Borrelia2 Borrelia3

Borrelia

Pathology1 Pathology2 Pathology3

Clinical
findings

Figure 3: Structure of the Bayesian network by Dessau and Andersen.

Old laboratory Laboratory


variables to use. This was resolved by combin- findings findings
ing variables from each stage in single variables
that describe the observable clinical manifesta-
tions and laboratory findings. Background Borrelia
The progress of Lyme disease was modelled
by a variable duration (not shown in the fig- Clinical
ure), indicating the time a patient has expe- findings

rienced the symptoms. It does not distinguish


Figure 4: Structure of the Bayesian network by
between the duration of the different symptoms,
Fisker et al.
but models the total duration of illness. If the
duration is short it indicates a stage one Lyme
disease and so forth. duced Temporal Nodes Bayesian Networks as
Dessau and Andersens model is a snapshot an alternative to dynamic Bayesian networks.
of a patient at a specific point in time and it The time intervals represent the time since
does not involve temporal aspects such as means the symptom was first observed, and both the
for entering of multiple evidence on the same length and the number of time intervals was
variables as a result of repeated consultations. tailored individually for each node. With this
The model by Fisker et al. (2002) aims to approach it became possible to enter old evi-
include these issues. The idea is to reflect the dence into the model. For example, if erythema
clinical practice to examine a patient more than migrans was observed on, or recalled by, the
once if the diagnosis is uncertain. This gives the patient ten weeks ago and lymphocytoma is a
medical practitioner the opportunity to observe current observation, the EM variable is instan-
if the development of the disease corresponds tiated to the state 8 weeks - 4 months and the
to the typical pattern. The model mainly in- lymphocy is instantiated to < 6 weeks.
cludes the same variables, but the structure of The model also incorporated the ability to
the network is different (see Figure 4). enter duplicate results of laboratory tests. This
Instead of triplicating the pathology for the option was included by repeating the laboratory
different stages of the disease the temporal as- findings and including a variable specifying the
pect was incorporated in the model by defin- time between the result of the old test and the
ing the states of manifestation variables as current test (not shown in the figure).
time intervals. This approach was inspired by The two models basically use the same fea-
(Arroyo-Figueroa and Sucar, 1999) that intro- tures but differ in the modelling of the progres-

234 K. G. Olesen, O. K. Hejlesen, R. Dessau, I. Beltoft, and M. Trangeled


FP_IgM FP_IgG
FP_IgM FP_IgG FP_IgM FP_IgG

IgMTest IgGTest
IgMTest IgGTest IgMTest IgGTest

...
Month of Bite Recall Bite Gender IgM IgG
IgM IgG
IgM IgG

Lyme disease
Lyme disease Lyme disease

Lyme
Tick Activity Tick Bite Tick Bite 2
Disease

EM Arthrit ENB Cardit Lymcyt CNB ACA


EM Arthrit ENB Cardit Lymcyt CNB ACA EM Arthrit ENB Cardit Lymcyt CNB ACA

Exposure Age Rash Arthritis Early Neuro Carditis Lymphocytom


Chronic
ACA Chronic
Chronic Neuro Rash Arthritis Early Neuro Carditis Lymphocytom ACA
Rash Arthritis Early Neuro Carditis Lymphocytom ACA Neuro
Neuro

OtherEM OtherMS OtherNeuro OtherCardit OtherLym OtherCNB OtherACA


OtherEM OtherMS OtherNeuro OtherCardit OtherLym OtherCNB OtherACA OtherEM OtherMS OtherNeuro OtherCardit OtherLym OtherCNB OtherACA

Consultation 1 Consultation 2 Consultation N

Figure 5: The structure of the tailored patient specific model is composed of background knowledge
and consultation modules.

sion of the disease. Further, the model by Fisker ules, where each module describes a consul-
et al. takes the use of data from previous con- tation. The problem is that consultations
sultations into consideration. may appear at random points in time and
Both models implement the possibility of that the conditional probabilities for the symp-
reading the probability of having Lyme disease toms at each consultation vary, depending
in one of the three stages, but with two dif- on the time that has passed since the tick
ferent approaches. The model by Fisker et al. bite. Therefore frameworks such as dynamic
models the stages as states in a single variable, Bayesian networks and Markov chains do not
whereas the model by Dessau and Andersen apply. We introduce continuous time inspired
models the stages explicitly as causally depen- by (Nodelman et al., 2002). The proposed ap-
dent variables. proach involve a model for the conditional prob-
The main problem in both models is the tem- abilities, but before reaching that point we de-
poral progression of Lyme disease. Dessau and termine the structure of the model.
Andersens model gives a static picture of the
current situation, that only takes the total du- 4.1 Structure
ration of the period of illness into consideration. The structure of the patient specific model is
Fisker et al. is more explicit in the modelling illustrated in Figure 5. At the first consulta-
of the progress of the disease by letting states tion the background knowledge module is linked
denote time intervals since a symptom was first to a generic consultation module consisting of
observed and by duplicating variables for labo- the disease node, clinical findings and labora-
ratory findings. This enables inclusion of find- tory findings. In order to keep focus on the
ings from the past, but is still limited to single overall modelling technique we deliberately kept
observations for clinical evidence and two en- the model simple. The consultation module is
tries for laboratory tests. a naive Bayes model, and subsequent consulta-
Besides the difficulties in determining the tions are included in the the model by extended-
structure of the models, a considerable effort ing it with a new consultation module. This
was required to specify the conditional proba- process can be repeated for an arbitrary num-
bilities. ber of consultations. The consultation modules
In the following section we propose a further are connected only through the disease nodes.
elaboration of the modelling by introduction of This is a quite strong assumption that may be
continuous time and an arbitrary number of debated, but it simplifies things and keep the
consultations. complexity of the resulting model linear in the
number of consultations. Less critical is that
4 Tailoring patient specific models
the disease nodes do not take the stage of the
We aim for a Bayesian network that incor- disease into account; we consider that the clas-
porates the full clinical history of individual sification into stages are mostly for descriptive
patients. The model is composed of mod- purposes, but it could be modeled as stages in

Diagnosing Lyme disease - Tailoring patient specific Bayesian networks for temporal reasoning 235
1
the disease node, although this would compli- Age: 020
Age: 2170

cate the quantitative specification slightly. 0.9 Age: >70

0.8

4.2 Conditional probability tables 0.7

As the symptoms of Lyme disease vary over

P( EM | Lyme disease)
0.6

time we assume a simple model for the condi- 0.5

tional probabilities in the model. We approxi- 0.4

mate the conditional probabilities for the symp-


0.3
toms by a continuous function, composed as a
0.2
mixture of two sigmoid functions. This is a
somewhat arbitrary choise; other models may 0.1

be investigated, but for the present purpose this 0


0 10 20 30 40 50 60 70 80 90 100
Days since infectio
n
simple model suffice. The functions describing
the conditional probabilities are based on em- Figure 6: The probability of experiencing ery-
pirical knowledge. From existing databases we thema migrans (EM) over time modelled as con-
extract the probabilities for the various symp- tinuous functions. The development of EM is
toms at different times since the infection was dependent on the age of the patient.
inflicted on the patient. The time of infliction is
usually not known, but around one third of the
1'st consultation 2'nd consultation nth consultation
patients recall a tick bit. In other cases there is t1
t2
indirect evidence for the time of the bite, such . t.n.

as limited periods of exposure. time


Time of
An example is shown in Figure 6, where the infection

resulting functions for the probability of show-


ing the symptom EM up to 100 days after the Figure 7: Time line that represents the consul-
infection are drawn. As can be seen, the age of tations that forms the clinical history.
the patient has been incorporporated as a pa-
rameter to the function. Thus, we can compute
the conditional probabilities for a consultation tient consults the medical practitioner the time
module directly, provided that the time since of the consultation is registered. Therefore, the
the infection is known. This is rarely the case. intervals between the consultations are known,
whereas the interval, t1 , from the infection to
4.3 Determining the time of infection the first consultation is unknown and must be
The purpose of the Bayesian model is to cal- estimated. It is often difficult to determine t1 ,
culate the probability of Lyme disease based on because a patient may not recollect a tick bite,
the clinical findings and possible laboratory evi- and because symptoms may not appear until
dence. As described in the previous section, the late in the progress of the disease.
conditional probabilities of the symptoms and As the conditional probabilities for the con-
the serology are determined by the time since sultation modules vary depending on t1 , differ-
the tick bite. Thus, in order to construct the ent values of t1 will result in Bayesian networks
complete model it is necessary to know the time with different quantitative specification. Hence,
of infliction. estimation of t1 is a crucial part of the construc-
The clinical history for a given patient is il- tion of the patient specific model. We tackle
lustrated in Figure 7. Different time intervals this problem by hypothesising a number of dif-
are shown in the figure. The interval t1 is the ferent models, and we identify those that best
time from the infection to the first consulta- matches the evidence. Each model includes all
tion, t2 is the interval between the first and the consultations for the patient. Thus, we have a
second consultation and so on. When the pa- fixed structure with different instantiations of

236 K. G. Olesen, O. K. Hejlesen, R. Dessau, I. Beltoft, and M. Trangeled


Most probable model
the conditional probabilities for different values P(M|e)
of t1 .
The probability of a model given the evidence
can be calculated by using Bayes rule:

P (e | M ) P (M ) Tstart t1 Tend Time since


P (M | e) = (1) infection
P (e)
P (e | M ) can be extracted from the hy- Figure 8: The figure illustrates the probabil-
pothesised Bayesian network as the normaliza- ity of the different models as a function of days
tion constant. The probability of the evidence, since infection.
P (e), is not known, but is a model independent
constant. The prior probability of the model,
P (M ), is assumed to be equal for all models, The estimation of the most probable model
and is therefore inversely proportional to the has been implemented by evaluating all models
number of models. Thus, P (M | e) can be ex- for t1 in the interval 0-800 days. The upper limit
pressed by of 800 days is based on the assumption that the
general development pattern of Lyme disease at
P (M | e) P (e | M ) (2) this time is stabilized and the symptoms are
chronical. The size of the interval Tstart - Tend
The probability of Lyme disease, P (Ld) can has, somewhat arbritrarily, been set to 25 days.
be read from each model and will, of course,
vary depending on the model M. By using 5 Preliminary evaluation
the fundamental rule, the joint probability of
In a preliminary test of the proposed approach
P (Ld, M | e) can be obtained as
the results were compared to the outcome of
the model by Dessau and Andersen. The eval-
P (Ld, M | e) = P (Ld | M, e) P (M | e) (3) uation was focused on the structural problems
in the models. The cases that were used for
where P (M | e) can be substituted by using the evaluation are based on a survey performed
equation 2: by Ram Dessau where 2414 questionnaires have
been collected from the Danish hospitals in Aal-
borg and Herlev. The overall picture of the com-
P (Ld, M | e) P (Ld | M, e) P (e | M ) (4)
parison is that the proposed model in general
The probability of Lyme disease given the ev- estimates a higher probability of Lyme disease
idence, can now be found by marginalizing over than the model by Dessau and Andersen.
M: In order to evaluate the effect of the incorpo-
ration of the clinical history in the time sliced
TX
model, a number of typical courses of Lyme dis-
end
P (Ld | e) = P (Ld | MTn , e)P (e | MTn ) ease have been evaluated as single consultations
Tn =Tstart without utilizing the clinical history and as con-
(5) sultations utilizing the previous history. At the
Tstart and Tend represent an interval that sur- same time, each of the consultations have been
rounds the most probable value of the time since evaluated in the model by Dessau and Ander-
the bite, t1 , as illustrated in Figure 8. sen in order to evaluate how it handles the later
The interval is chosen due to computational stage symptoms, when the clinical history is not
considerations, and because the assumption incorporated.
that all models are equally probable is obviously This informal study indicates that the time
not valid. Alternatively, we could simply choose sliced model behaves as intended when the clini-
the model with highest probability. cal history is incorporated. The estimated prob-

Diagnosing Lyme disease - Tailoring patient specific Bayesian networks for temporal reasoning 237
ability of Lyme disease is gradually increased gradually increased from consultation to consul-
for each added consultation, as the general de- tation, whereas it was reduced when the history
velopment pattern of the disease is confirmed in did not confirm the pattern.
typical courses. We conclude that the proposed method seems
In the estimates from the model by Dessau viable for temporal domains, where changing
and Andersen the probability of Lyme disease conditional probabilities can be modeled by con-
decreases as the later stage symptoms are ob- tinuous functions.
served. This reasoning does not seem appropri-
ate when it is known from earlier consultations
that the clinical history points toward a borre-
References
lial infection. The model by Dessau and Ander- [Arroyo-Figueroa and Sucar1999] Gustava Arroyo-
sen is not designed to incorporate the clinical Figueroa and Luis Enrique Sucar. 1999. A
Temporal Bayesian Network for Diagnosis and
history, but from the results it can be seen that Prediction. In Proceedings of the 15th Conference
this parameter is important in order to provide on Uncertainty in Artificial Intelligence, pages
reasonable estimates of the probability of Lyme 1320. ISBN: 1-55860-614-9.
disease. [Dessau and Andersen2001] Ram Dessau and Maj-
Britt Nordahl Andersen. 2001. Beslutningssttte
6 Conclusion med Bayesiansk netvrk - En model til tolkning
af laboratoriesvar. Technical report, Aalborg Uni-
Temporal reasoning is an integral part of medi- versitet. In danish.
cal expertise. Many decisions are based on prog- [Fisker et al.2002] Brian Fisker, Kristian Kirk,
noses or diagnoses where symptoms evolve over Jimmy Klitgaard, and Lars Reng. 2002. Decision
time and the effects of treatments typically be- Support System for Lyme Disease. Technical
come apparent only with some delay. A com- report, Aalborg Universitet.
mon scenario in the general practice is that a [Gray et al.2002] J.S. Gray, O. Kahl, R.S. Lane, and
patient attends the clinic at different times on G. Stanek. 2002. Lyme Borreliosis: Biology, Epi-
a non-regular basis. demiology and Control. Cabi Publishing. ISBN:
0-85199-632-9.
Lyme disease is an infection characterised by
a number of symptoms whose manifestations [Jensen2001] Finn V. Jensen. 2001. Bayesian Net-
evolve over time. In order to correctly classify works and Decision Graphs. Springer Verlag.
the disease it is important to include the clini- ISBN: 0-387-95259-4.
cal history of the patient. We have proposed a [Nodelman et al.2002] Uri Nodelman, Christian R.
method to include consultations scattered over Shelton, and Daphne Koller. 2002. Continuous
time at non-equidistant points. A description Time Bayesian Networks. In Proceedings of the
18th Conference on Uncertainty in Artificial In-
of the evolution of symptoms over time forms telligence, pages 378387. ISBN: 1-55860-897-4.
the basis of a dynamically tailored model that
describes a specific patient. The time of in- [Smith et al.2003] Martin Smith, Jeremy
fliction of the disease is estimated by a model Gray, Marta Granstrom, Crawford Re-
vie, and George Gettinby. 2003. Euro-
search that identifies the most probable model pean Concerted Action on Lyme Borreliosis.
for the patient given the pattern of symptoms http://vie.dis.strath.ac.uk/vie/LymeEU/.
over time. Read 04-12-2003.
Based on a preliminary evaluation it can be
concluded that the method proposed for han-
dling nonequivalent time intervals in order to in-
corporate the clinical history works satisfactory.
In cases where the clinical history confirmed
the general development pattern of Lyme dis-
ease the estimated probability the disease was

238 K. G. Olesen, O. K. Hejlesen, R. Dessau, I. Beltoft, and M. Trangeled


Predictive Maintenance using Dynamic Probabilistic Networks
Demet Ozgur-Unluakn
Taner Bilgic
Department of Industrial Engineering
Bogazici University
Istanbul, 34342, Turkey

Abstract
We study dynamic reliability of systems where system components age with a constant
failure rate and there is a budget constraint. We develop a methodology to eectively
prepare a predictive maintenance plan of such a system using dynamic Bayesian networks
(DBNs). DBN representation allows monitoring the system reliability in a given planning
horizon and predicting the system state under dierent replacement plans. When the
system reliability falls below a predetermined threshold value, component replacements
are planned such that maintenance budget is not exceeded. The decision of which com-
ponent(s) to replace is an important issue since it aects future system reliability and
consequently the next time to do replacement. Component marginal probabilities given
the system state are used to determine which component(s) to replace. Two approaches
are proposed in calculating the marginal probabilities of components. The rst is a myopic
approach where the only evidence belongs to the current planning time. The second is a
look-ahead approach where all the subsequent time intervals are included as evidence.

1 Introduction the context of Bayesian networks (Pearl, 1988).


A similar troubleshooting problem, where mul-
Maintenance can be performed either after the tiple but independent faults are allowed, is ad-
breakdown takes place or before the problem dressed in (Srinivas, 1995). More recent studies
arises. The former is reactive whereas the latter are mostly due to researchers from the SACSO
is proactive. Reactive maintenance is appropri- (Systems for Automated Customer Support Op-
ate for systems where the failure does not re- erations) project. By assuming a single fault,
sult in serious consequences. Decision-theoretic independent actions and constant costs, and
troubleshooting belongs to this category. Proac- making use of a simple model representation
tive or planned maintenance can be further clas- technique (Skaanning et al., 2000), they show
sied as preventive and predictive (Kothamasu that the simple eciency ordering yields an op-
et al., 2006). These dier in the scheduling timal sequence of actions (Jensen et al., 2001).
behaviour. Preventive maintenance performs Langseth and Jensen (2001) present two heuris-
maintenance on a xed schedule whereas in pre- tics for handling dependent actions and condi-
dictive maintenance, the schedule is adaptively tional costs. Langseth and Jensen (2003) pro-
determined. Reliability-centered maintenance vide a formalism that combines the methodolo-
is predictive maintenance where reliability esti- gies used in reliability analysis and decision-
mates of the system are used to develop a cost- theoretic troubleshooting. Koca and Bilgic
eective schedule. (2004) present a generic decision-theoretic trou-
Decision-theoretic troubleshooting, which bleshooter to handle troubleshooting tasks in-
balances costs and likelihoods for the best ac- corporating questions, dependent actions, con-
tion, is rst studied by (Kalagnanam and Hen- ditional costs, and any combinations of these.
rion, 1988). Heckerman et al.(1995) extend it to Decision-theoretic troubleshooting has always
been studied as a static problem and with an a methodology to eectively prepare a predic-
objective to reach a minimum-cost action plan. tive maintenance plan using DBNs. One can
On the reliability analysis side, fault diagnosis argue that the failure of the system and its as-
nds its roots in (Vesely, 1970). Kothamasu et sociated costs can be modelled using inuence
al.(2006) review the philosophies and techniques diagrams or limited memory inuence diagrams
that focus on improving reliability and reducing (LIMIDs). But we propose a way of represent-
unscheduled downtime by monitoring and pre- ing the problem as an optimization problem rst
dicting machine health. Torres-Toledano and and then use only DBNs for fast inference under
Sucar (1998) develop a general methodology for some simplifying assumptions.
reliability modelling of complex systems based The rest of the paper is organized as follows:
on Bayesian networks. Welch and Thelen (2000) In Section 2, problem is dened and in Sec-
apply DBNs to an example from the dynamic tion 3, dynamic Bayesian network based models
reliability community. Weber and Joue (2003, are proposed as a solution to the problem de-
2006) present a methodology that helps de- ned. Two approaches are presented in schedul-
veloping DBNs and Dynamic Object Oriented ing maintenance plans. Numerical results are
Bayesian Networks (DOOBNs) from available given in Section 4. Finally Section 5 gives con-
data to formalize reliability of complex dynamic clusions and points future work.
models. Muller et al. (2004) propose a method-
ology to design a prognosis process taking into 2 Problem Definition
account the behavior of environmental events.
They develop a probabilistic model based on the The problem we take up can be described as fol-
translation of a preliminary process model into lows: There is a system which consists of sev-
DBN. Bouillaut et al. (2004) use causal prob- eral components. We observe the system in dis-
abilistic networks for the improvement of the crete epochs and assume that system reliability
maintenance of railway infrastructure. Weber is observable. System reliability is a function
et al. (2004) use DBN to model dependabil- of the interactions of the system components
ity of systems taking into account degradations which are not directly observable. We presume
and failure modes governed by exogenous con- that the reliability of the system should be kept
straints. over a predetermined threshold value in all pe-
All of the studies related to reliability analysis riods. This is reasonable in mission critical sys-
using DBNs are descriptive. Dynamic problem tems where the failure of the system is a very
is represented with DBNs and the outcome of low probability event due to built in redundancy
the analysis is how system reliability behaves and other structural properties. Therefore, we
in time. The impact of maintenance of an el- do not explicitly model the case where the sys-
ement at a specic time on this behaviour is tem actually fails. Components age with a con-
also reported in some of them. However opti- stant rate and it is possible to replace compo-
mization of maintenance activities (i.e., nding nents in any period. Once replaced, the com-
a minimum cost plan) is not considered which ponents will work at their full capacity. There
is the main motivation of our paper. Main- is a given maintenance budget for each period,
tenance is expensive and critical in most sys- which the total replacement cost in that period
tems. Unexpected breakdowns are not tolera- cannot exceed. Our aim is to minimize total
ble. That is why planning maintenance activ- maintenance cost in a planning horizon such
ities intelligently is an important issue since it that reliability of the system never falls below
saves money, service time and also lost produc- the threshold and maintenance budget is not
tion time. In this study, we are trying to opti- exceeded. Furthermore we make the following
mize maintenance activities of a system where assumptions:
components age with a constant failure rate (i) Lifetime of any component in the system
and there is a budget constraint. We develop is exponentially distributed. That is failure rate

240 D. zgr-nlakin and T. Bilgi


of any component is constant for all periods. The objective function (1) aims to minimize
(ii) All other conditional probability distrib- the total component replacement cost. Con-
utions used in the representation are discrete. straint set (2) represents the budget constraints.
(iii) All components and the system have two Constraint set (3) guarantees that system relia-
states (1 is the working state, 0 is the fail- bility in each period should be greater than the
ure state). given threshold value. Constraints in (4) ensure
(iv) Components can be replaced at the be- that if components are replaced their reliability
ginning of each period. Once they are replaced, becomes 1, otherwise it will decrease with cor-
their working state probability (i.e., their relia- responding failure rates. System reliability in
bility) become 1 in that period. each period is calculated by constraint (5). Fi-
The problem can be expressed as a mathe- nally constraints (6) and (7) dene the bounds
matical optimization problem. on decision variables. In general, solving this
The model parameters are: problem may be quite dicult. The diculty
lies in constraint sets (4) and (5). Constraint
i : index of components
set (4) denes a non-linear relation of the deci-
t : index of time periods
sion variables whereas constraint set (5) is much
i : failure rate of component i
more generic. In fact, system reliability at time
Ri1 : initial reliability of component i
t can be a function of whole history of the sys-
cit : cost of replacing component i in period t
tem. It is this set of constraints and the implied
Bt : available maintenance budget in period t
relationships of constraint set (4) that we rep-
L : threshold value of system reliability
resent using a dynamic Bayesian network.
f () : function mapping component reliabilities
Further assumptions are imposed in order to
Rit to system reliability R0t
simplify the above problem:
The decision variables are: (v) Replacement costs of all components in
 any period are the same and they are all nor-
1 if component i is replaced in period t
Xit = malized at one.
0 otherwise
(vi) Available budget in any period is normal-
Rit : reliability of component i in period t ized at one.
R0t : reliability of system in period t These assumptions indicate that in any pe-
riod, only one replacement can be planned and
The Predictive Maintenance (PM) model can the objective function we are trying to minimize
be formulated mathematically as follows: becomes the total number of replacements in a
T n
planning horizon.

Z(P M ) = min cit Xit (1)
t=1 i=1 3 Proposed Solution
subject to The mathematical problem may be solved ana-
n lytically or numerically once the constraint set

cit Xit Bt , t (2) (5) is made explicit. However, as the causal rela-
i=1 tions (represented with constraint set (5) in the
problem formulation) between the components
R0t L, t (3)
and the system becomes more complex, it gets
Rit = (1 Xit )ei Ri,t1 + Xit , i, t (4) dicult to represent and solve it mathemati-
cally. We represent the constraint set (5) using
R0t = f (R1t , R2t , ...Rnt ), t (5) dynamic Bayesian Networks (DBNs) (Murphy,
Xit {0, 1}, i, t (6) 2002). A DBN is an extended Bayesian net-
work (BN) which includes a temporal dimen-
0 Rit 1, i, t (7) sion. BNs are a widely used formalism for repre-

Predictive Maintenance using Dynamic Probabilistic Networks 241


senting uncertain knowledge. The main features DBN representation allows monitoring the
of the formalism are a graphical encoding of a system reliability in a given planning horizon
set of conditional independence relations and a and predicting the system state under dier-
compact way of representing a joint probability ent replacement schedules. When the system
distribution between random variables (Pearl, reliability falls below a predetermined thresh-
1988). BNs have the power to represent causal old value, a component replacement is planned.
relations between components and the system The decision of which component to replace is
using conditional probability distributions. It an important issue since it aects future system
is possible to analyse the process over a large reliability and consequently the next time to do
planning horizon with DBNs. a replacement. Like in decision-theoretic trou-
bleshooting (Heckerman et al., 1995), marginal
probabilities of components given the system
B1 B2 B3
state are used as eciency measures of com-
A1 A2 A3 ponents in each period when a replacement is
planned. Let Sk the denote system state in pe-
riod k and Cik denote the state of component
S1 S2 S3 i in period k. The following algorithm summa-
rizes our DBN approach:

Figure 1: DBN representation of a system with


2 components (i) Initialize t = 1
(ii) Infer system reliability P (Sk = 1) t k T
(iii) Check if P (Sk = 1) L
Figure 1 illustrates a DBN representation of a (iv) If P (Sk = 1) L k, then stop.
system with 2 components, A and B. Solid arcs (v) Else prepare a replacement plan for period k
represent the causal relations between the com- (a) Calculate Pik = P (Cik = 0|Sk = 0) i
ponents and the system node whereas dashed (b) i = arg max{Pik }
arcs represent temporal relation of the com- (c) Update reliability of i in k to 1.
ponents between two consecutive time peri- P (Cik = 1) 1
ods. Note that this representation is Markov- (d) Update t = k + 1
ian whereas DBNs can represent more general (vi) If t > T , then stop.
transition structures. Temporal relations are (vii) Else continue with step (ii)
the transition probabilities of components due
to aging. Since the lifetime of any compo- Note that P (St = 1) = R0t and P (Cit = 1) =
nent in the system is exponentially distributed, Rit . This is a myopic approach, since the only
the transition probabilities are constant because evidence in calculating marginal probabilities in
of the memoryless property of the exponential (v.a) belongs to the system state at the current
distribution given the time intervals are equal. planning time. An alternative approach is to
Transition probability table for a component take into account future information which can
with two states (1 is the working state, 0 is be transmitted by the transition probabilities
the failure state) is given in Table 1. of components. This is done by entering evi-
dence to the system node from k + 1 to T as
Sk+1:T = 0. We call this approach the look-
Table 1: Transition probability for component i ahead approach. The algorithm is the same as
Comp(t)
above except for step (v.a) which is replaced as
Comp(t + 1) 1 0
t
follows in look-ahead approach:
1 e i 0
0 1-ei t 1 (v.a) Calculate Pik = P (Cik = 0|Sk+1:T = 0)
i where Sk+1:T denotes Sk+1 , ..., ST .

242 D. zgr-nlakin and T. Bilgi


4 Numerical Results plans in Figure 3 and Figure 4 are generated by
the myopic and look-ahead approaches, respec-
The DBN algorithm is coded in Matlab and uses
tively. Replacement plans of the two approaches
the Bayes Net Toolbox (BNT) (Murphy, 2001)
dier since components have dierent transition
to represent the causal and temporal relations,
probabilities. Both approaches begin their re-
and to infer the reliability of the system. Two
placement plan in period 12, by selecting the
approaches, myopic and look-ahead, are com-
same component to replace. In the next replace-
pared on a small example with two components
ment period (k = 21), dierent components are
given in Figure 1. The planning horizon is taken
selected. Myopic approach selects component
as 100 periods and the threshold value is given
B, because this replacement will make the sys-
as 0.50.
tem reliability higher in the short-term. Look-
First scenario is created by taking mean time ahead approach selects component A, because
to failure (MTTF) (1/i ) of each component this replacement will make the system reliabil-
i equal which is set at 40 periods. The same ity higher in the long-run. This is further illus-
replacement plan, given in Figure 2, is gener- trated in Table 2. Although system reliability
ated by both approaches. This is because com- in the myopic approach is higher at t = 21, it
ponents have equal MTTFs and hence equal decreases faster than the system reliability un-
transition probabilities. System reliability of der the look-ahead approach. Hence, the look-
scenario 1 is illustrated in Figure 2 where the ahead approach plans its next replacement at
peak points are the periods where a component t = 29 while the myopic approach plans its next
is replaced. On each peak point, the compo- replacement at t = 28. By selecting A instead
nent which is planned for replacement is indi- of B, the look-ahead approach defers its next
cated in the gure. 4 replacements are planned replacement time. So as a total, in scenario 2,
in both approaches. When a replacement oc- the look-ahead approach generates 10 replace-
curs, system reliability jumps to a higher reli- ments in 100 periods while the myopic approach
ability value, and then gradually decreases as generates 11.
time evolves until the next replacement.

System Reliability Threshold value=0.50 Table 2: Scenario 2- system reliability where


0.9
21 t 28
0.85 B Period 21 22 23 24
A
B Myopic .7712 .7169 .6673 .6220
0.8
A
Look-ahead .7036 .6718 .6421 .6144
0.75
Period 25 26 27 28
Probability

0.7 Myopic .5805 .5425 .5077 .4758


0.65 Look-ahead .5884 .5642 .5414 .5200
0.6
System reliability of scenario 2 is illustrated
0.55
in Figures 3 and 4 for myopic and look-ahead
0.5 approaches, respectively. In Figure 3, there are
0 20 40 60 80 100
Time
11 peak points which means 11 replacements are
planned. In Figure 4, there are 10 peak points
Figure 2: Scenario 1- system reliability and re- which means 10 replacements are planned.
placement plan A third scenario is also carried out by further
decreasing MTTF of component B to 5 peri-
Second scenario is created by dierentiating ods. The discrepancy between plans generated
MTTFs of components. We decrease MTTF by the two approaches becomes more apparent.
of component B to 10 periods. Replacement Myopic approach plans a total of 19 replace-

Predictive Maintenance using Dynamic Probabilistic Networks 243


System Reliability Threshold value=0.50 System Reliability Threshold value=0.50
0.95
B B B B
B B
0.9 0.9

0.85 B 0.85 B
B B B
0.8 A 0.8
B A
B

Probability
A B
0.75 0.75
Probability

A
0.7 0.7
A A A
0.65 0.65

0.6 0.6

0.55 0.55

0.5
10 20 30 40 50 60 70 80 90 100 0 20 40 60 80 100
Time Time

Figure 3: Scenario 2- system reliability and re- Figure 4: Scenario 2- system reliability and re-
placement plan with the myopic approach placement plan with the look-ahead approach

ments while look-ahead approach plans a total k replacements in a reasonable time horizon.
of 16 replacements. Here, k refers to the upper bound of minimum
When the threshold value increases, an inter- replacements given by our DBN algorithm. The
esting question arises: Does it still worth to ac- total number of solutions in a horizon of T pe-
count for future information in choosing which riods with k replacements is given as:
 
component to replace? Table 3 shows the num-
T
ber of replacements found by the myopic and 2k (8)
k
the look-ahead approaches at various threshold
(L) values for scenario 2. When L = 0.80, my- The rst term is the total number of all possible
opic approach nds fewer replacements than the size-k subsets of a size-T set. This corresponds
look-ahead approach. This is because as thresh- to the total number of all possible time alterna-
old increases, more frequent replacements will tives of k replacements in horizon T . The sec-
be planned, hence focusing on short-term relia- ond term is the total number of all possible re-
bility instead of the future reliability may result placements for two components. Since this solu-
in less number of replacements. tion space becomes intractable with increasing
T and k, a smaller part of the planning horizon
Table 3: Number of replacements for the myopic is taken for enumeration of both scenarios. We
and look-ahead approaches at various threshold started working with T = 50 periods where 2
values for T = 100 repairs are proposed and T = 30 periods where
Threshold Myopic Look-ahead 3 repairs are proposed by our algorithm for sce-
0.50 11 10 nario 1 and scenario 2, respectively. The next
0.60 15 15 replacements correspond to t = 62 (Figure 2)
and t = 40 (Figure 4). Hence, we increased
0.70 23 22
T = 61 and T = 39, the periods just before
0.80 33 35
the 3rd and 4th replacements given by our al-
0.90 67 67
gorithm for scenarios 1 and 2. The number of
0.95 100 100
feasible solutions found by enumeration are re-
ported in Table 4.
In order to understand how good our method- When we decrease k, number of replacements,
ology is, we enumerate all possible solutions of by 1 (k = 1 and k = 2 for scenarios 1 and 2 re-

244 D. zgr-nlakin and T. Bilgi


two approaches may end up with dierent num-
Table 4: Enumeration results
Scen- Horizon Number of Feasible ber of replacements. This is because the look-
ario Replacements Solutions ahead approach takes future system reliability
into consideration while the myopic approach
1 50 2 324
focuses on the current planning time.
1 50 1 0
In this kind of predictive maintenance prob-
1 61 2 2
lem, there are two important decisions: One is
1 61 1 0
the time of replacement. Replacement should
2 30 3 1543
be done such that system reliability is always
2 30 2 0 guaranteed to be over a threshold. The other
2 39 3 1 decision is which component(s) to replace in
2 39 2 0 that period such that budget is not exceeded
and total replacement cost is minimized. Our
methodology is based on separating these deci-
spectively); no feasible solution is found. So, it sions under assumptions (v) and (vi). We give
is not possible to nd a plan with less replace- the former decision by monitoring the rst pe-
ments given by our algorithm for these cases. riod when system reliability just falls below a
Note also that, in scenario 1 when T = 61 and given threshold. In other words, we defer a re-
k = 2, enumeration nds two feasible solutions placement decision as far as the threshold per-
which are in fact symmetric solutions found also mits. As for the latter decision, we propose two
by our algorithm. Similarly in scenario 2 when approaches, myopic and look-ahead, to choose
T = 39 and k = 3, enumeration nds one fea- the component to replace. By enumerating fea-
sible solution which is the one found by our al- sible solutions in a reasonable horizon, we show
gorithm with the look-ahead approach. There that our method is eective for our simplied
are 1543 feasible solutions for scenario 2 with problem where the objective has become min-
T = 30 and k = 3. Our lookahead approach imizing total replacements in a given planning
nds one of these solutions and the solution horizon.
it nds has the maximum system reliability at In this paper, we outline a method that can
T = 30 among all solutions. The same observa- be used for nding a minimum-cost predictive
tion is also valid for scenario 1 with T = 50 and maintenance schedule such that the system reli-
k = 2. ability is always above a certain threshold. The
approach is normative in nature as opposed to
5 Conclusion
descriptive which is the case in most of the lit-
We study dynamic reliability of a system where erature that uses DBNs in reliability analysis.
components age with a constant failure rate and The problem becomes more complex by
there is a budget constraint. We develop a (i) dierentiating component costs (in time),
methodology to eectively prepare a good pre- (ii) dierentiating available budget in time,
dictive maintenance plan using DBNs to rep- (iii) dening a maintenance xed cost for each
resent the causal and temporal relations, and period which may or may not dier in time. In
to infer reliability values. We try to minimize these cases, separating the two decisions may
number of component replacements in a plan- not give a good solution. Studying such cases is
ning horizon by rst deciding the time and then left for future work.
the component to replace. Two approaches are
presented to choose the component and they Acknowledgments
are compared on three scenarios and various This work is supported by Bogazici University
threshold values. When failure rates of com- Research Fund under grant BAP06A301D.
ponents are equal, they nd the same replace- Demet Ozgur-Unluakn is supported by
ment plan. However, as failure rates dier, the Bogazici University Foundation to attend the

Predictive Maintenance using Dynamic Probabilistic Networks 245


PGM 06 workshop. Kevin Patrick Murphy. 2001. The Bayes Net Tool-
box for Matlab, Computing Science and Statistics:
Proceedings of the Interface.
References Jude Pearl. 1988. Probabilistic reasoning in intel-
Laurent Bouillaut, Philippe Weber, A. Ben Salem ligent systems: Networks of plausible inference,
and Patrice Aknin. 2004. Use of causal probabilis- Morgan Kaufmann Publishers.
tic networks for the improvement of the mainte-
nance of railway infrastructure. In IEEE Interna- Sampath Srinivas. 1995. A polynomial algorithm for
tional Conference on Systems, Man and Cyber- computing the optimal repair strategy in a system
natics, pages 6243-6249. with independent component failures. In 11th An-
nual Conference on Uncertainty in Artificial In-
David Heckerman, John S. Breese and Koos Rom- telligence, pages 515-522.
melse. 1995. Decision-theoretic troubleshooting,
Claus Skaanning, Finn V. Jensen and Ue Kjrul.
Communications of the ACM, 38(3): 49-57.
2000. Printer troubleshooting using Bayesian net-
Finn V. Jensen, Ue. Kjrul, Brian. Kristiansen, works. In 13th International Conference on In-
Helge Langseth, Claus Skaanning, Jiri Vomlel and dustrial and Engineering Applications of Artificial
Marta Vomlelovaa. 2001. The SACSO methodol- Intelligence and Expert Systems, pages 367-379.
ogy for troubleshooting complex systems. Artifi-
Jose Gerardo Torres-Toledano and Luis Enrique Su-
cial Intelligence for Engineering, Design, Analysis
car. 1998. Bayesian networks for reliability analy-
and Manufacturing,15(4):321-333.
sis of complex systems. In 6th Ibero-American
Jayant Kalagnanam and Max Henrion. 1988. A com- Conference on AI: Progress in Artificial Intelli-
parison of decision analysis and expert rules for gence, pages 195-206.
sequential analysis. In 4th Conference on Uncer- W. E. Vesely. 1970. A time-dependent methodology
tainty in Artificial Intelligence, pages 271-281. for fault tree evaluation. Nuclear Engineering and
Design, 13:337-360.
Eylem Koca and Taner Bilgic. 2004. Troubleshoot-
ing with dependent actions, conditional costs, and Philippe Weber and Lionel Joue. 2006. Complex
questions, Technical Report, FBE-IE-13/2004-18, system reliability modeling with Dynamic Object
Bogazici University. Oriented Bayesian Networks (DOOBN). Reliabil-
ity Engineering and System Safety, 91: 149-162.
Ranganath Kothamasu, Samuel H. Huang and
William H. VerDuin. 2006. System health mon- Philippe Weber, Paul Munteanu and Lionel Joue.
itoring and prognostics - a review of current par- 2004. Dynamic Bayesian networks modeling the
adigms and practices. Int J Adv Manuf Technol, dependability of systems with degradations and
28:1012-1024. exogenous constraints. In 11th IFAC Symposium
on Informational Control Problems in Manufac-
Helge Langseth and Finn V. Jensen. 2003. Decision turing (INCOM04).
theoretic troubleshooting of coherent systems. Re-
liability Engineering and System Safety, 80(1):49- Philippe Weber and Lionel Joue. 2003. Reliability
62. modeling with dynamic Bayesian networks. In 5th
IFAC Symposium SAFEPROCESS03, pages 57-
Helge Langseth and Finn V. Jensen. 2001. Heuris- 62.
tics for two extensions of basic troubleshooting.
In 7th Scandinavian Conference on Artificial In- Robert L. Welch and Travis V. Thelen. 2000. Dy-
telligence, Frontiers in Artificial Intelligence and namic reliability analysis in an operational con-
Applications, pages 80-89. text: the Bayesian network perspective. In Dy-
namic Reliability: Future Directions, pages 277-
Alexandre Muller, Philippe Weber and A. Ben 307.
Salem. 2004. Process model-based dynamic
Bayesian networks for prognostic. In IEEE 4th
International Conference on Intelligent Systems
Design and Applications.

Kevin Patrick Murphy. 2002. Dynamic Bayesian


networks: representation, inference and learn-
ing, Ph.D. Dissertation, University of California,
Berkeley.

246 D. zgr-nlakin and T. Bilgi


Reading Dependencies from the Minimal Undirected
Independence Map of a Graphoid that Satisfies Weak Transitivity
Jose M. Pena Roland Nilsson Johan Bjorkegren Jesper Tegner
Linkoping University Linkoping University Karolinska Institutet Linkoping University
Linkoping, Sweden Linkoping, Sweden Stockholm, Sweden Linkoping, Sweden

Abstract
We present a sound and complete graphical criterion for reading dependencies from the
minimal undirected independence map of a graphoid that satisfies weak transitivity. We
argue that assuming weak transitivity is not too restrictive.

1 Introduction close with some discussion in Section 6.

A minimal undirected independence map G of 2 Preliminaries


an independence model p is used to read inde-
pendencies that hold in p. Sometimes, however, The following definitions and results can be
G can also be used to read dependencies hol- found in most books on probabilistic graphi-
ding in p. For instance, if p is a graphoid that is cal models, e.g. (Pearl, 1988; Studeny, 2005).
faithful to G then, by definition, vertex separa- Let U denote a set of random variables. Un-
tion is a sound and complete graphical criterion less otherwise stated, all the independence mo-
for reading dependencies from G. If p is simply dels and graphs in this paper are defined over
a graphoid, then there also exists a sound and U. Let X, Y, Z and W denote four mutually
complete graphical criterion for reading depen- disjoint subsets of U. An independence mo-
dencies from G (Bouckaert, 1995). del p is a set of independencies of the form
In this paper, we introduce a sound and com- X is independent of Y given Z. We represent
plete graphical criterion for reading dependen- that an independency is in p by X Y|Z and
cies from G under the assumption that p is a that an independency is not in p by X 6 Y|Z.
graphoid that satisfies weak transitivity. Our An independence model is a graphoid when
criterion allows reading more dependencies than it satisfies the following five properties: Sym-
the criterion in (Bouckaert, 1995) at the cost of metry X Y|Z Y X|Z, decomposi-
assuming weak transitivity. We argue that this tion X YW|Z X Y|Z, weak union
assumption is not too restrictive. Specifically, X YW|Z X Y|ZW, contraction X
we show that there exist important families of Y|ZWX W|Z X YW|Z, and intersec-
probability distributions that are graphoids and tion X Y|ZW X W|ZY X YW|Z.
satisfy weak transitivity. Any strictly positive probability distribution is
The rest of the paper is organized as follows. a graphoid.
In Section 5, we present our criterion for reading Let sep(X, Y|Z) denote that X is separated
dependencies from G. As will become clear la- from Y given Z in a graph G. Specifically,
ter, it is important to first prove that vertex sep(X, Y|Z) holds when every path in G bet-
separation is sound and complete for reading ween X and Y is blocked by Z. If G is an undi-
independencies from G. We do so in Section rected graph (UG), then a path in G between X
4. Equally important is to show that assuming and Y is blocked by Z when there exists some
that p satisfies weak transitivity is not too res- Z Z in the path. If G is a directed and acyclic
trictive. We do so in Section 3. We start by graph (DAG), then a path in G between X and
reviewing some key concepts in Section 2 and Y is blocked by Z when there exists a node Z
in the path such that either (i) Z does not have out and instantiating some others in a probabi-
two parents in the path and Z Z, or (ii) Z has lity distribution that is faithful to some DAG
two parents in the path and neither Z nor any of is a WT graphoid, although it may be neither
its descendants in G is in Z. An independence Gaussian nor faithful to any UG or DAG.
model p is faithful to an UG or DAG G when Theorem 1. Let p be a probability distribution
X Y|Z iff sep(X, Y|Z). Any independence that is a WT graphoid and let W U. Then,
model that is faithful to some UG or DAG is a p(U \ W) is a WT graphoid. If p(U \ W|W =
graphoid. An UG G is an undirected indepen- w) has the same independencies for all w, then
dence map of an independence model p when p(U \ W|W = w) for any w is a WT graphoid.
X Y|Z if sep(X, Y|Z). Moreover, G is a mi-
nimal undirected independence (MUI) map of p Proof. Let X, Y and Z denote three mutually
when removing any edge from G makes it cease disjoint subsets of U \ W. Then, X Y|Z in
to be an independence map of p. A Markov p(U \ W) iff X Y|Z in p and, thus, p(U \ W)
boundary of X U in an independence model satisfies the WT graphoid properties because p
p is any subset M B(X) of U \ X such that (i) satisfies them. If p(U \ W|W = w) has the
X U\X \M B(X)|M B(X), and (ii) no proper same independencies for all w then, for any w,
subset of M B(X) satisfies (i). If p is a graphoid, X Y|Z in p(U \ W|W = w) iff X Y|ZW in
then (i) M B(X) is unique for all X, (ii) the p. Then, p(U \ W|W = w) for any w satisfies
MUI map G of p is unique, and (iii) two nodes the WT graphoid properties because p satisfies
X and Y are adjacent in G iff X M B(Y ) iff them.
Y M B(X) iff X 6 Y |U \ (XY ). We now show that it is not too restrictive to
A Bayesian network (BN) is a pair (G, ) assume in the theorem above that p(U\W|W =
where G is a DAG and are parameters spe- w) has the same independencies for all w, be-
cifying a probability distribution for each X cause there exist important families of proba-
U given its parents in G, p(X|P a(X)). The bility distributions whose all or almost all the
BN
Q represents the probability distribution p = members satisfy such an assumption. For ins-
XU p(X|P a(X)). Then, G is an indepen- tance, if p is a Gaussian probability distribution,
dence map of a probability distribution p iff p then p(U \ W|W = w) has the same indepen-
can be represented by a BN with DAG G. dencies for all w, because the independencies in
p(U \ W|W = w) only depend on the variance-
3 WT Graphoids covariance matrix of p (Anderson, 1984). Let
us now consider all the multinomial probability
Let X, Y and Z denote three mutually disjoint distributions for which a DAG G is an indepen-
subsets of U. We call WT graphoid to any dence map and denote them by M (G). The
graphoid that satisfies weak transitivity X following theorem, which is inspired by (Meek,
Y|Z X Y|ZV X V |Z V
Y|Z with 1995), proves that the probability of randomly
V U \ (XYZ). We now argue that there exist drawing from M (G) a probability distribution
important families of probability distributions p such that p(U\ W|W = w) does not have the
that are WT graphoids and, thus, that WT gra- same independencies for all w is zero.
phoids are worth studying. For instance, any
Theorem 2. The probability distributions p in
probability distribution that is Gaussian or fai-
M (G) for which there exists some W U such
thful to some UG or DAG is a WT graphoid
that p(U \ W|W = w) does not have the same
(Pearl, 1988; Studeny, 2005). There also exist
independencies for all w have Lebesgue measure
probability distributions that are WT graphoids
zero wrt M (G).
although they are neither Gaussian nor faithful
to any UG or DAG. For instance, it follows from Proof. The proof basically proceeds in the same
the theorem below that the probability distribu- way as that of Theorem 7 in (Meek, 1995), so we
tion that results from marginalizing some nodes refer the reader to that paper for more details.

248 J. M. Pea, R. Nilsson, J. Bjrkegren, and J. Tegnr


Let W U and let X, Y and Z denote three unrealistic to assume that the probability dis-
disjoint subsets of U \ W. For a constraint such tribution underlying the learning data in most
as X Y|Z to be true in p(U \ W|W = w) projects on gene expression data analysis, one
but false in p(U \ W|W = w0 ), two condi- of the hottest areas of research nowadays, is a
tions must be met. First, sep(X, Y|ZW) must WT graphoid.
not hold in G and, second, the following equa-
tions must be satisfied: p(X = x, Y = y, Z = 4 Reading Independencies
z, W = w)p(Z = z, W = w) p(X = x, Z = By definition, sep is sound for reading indepen-
z, W = w)p(Y = y, Z = z, W = w) = 0 dencies from the MUI map G of a WT graphoid
for all x, y and z. Each equation is a poly- p, i.e. it only identifies independencies in p.
nomial in the BN parameters corresponding to Now, we prove that sep in G is also complete
G, because each term p(V = v) in the equa- in the sense that it identifies all the indepen-
tions is the summation of products of BN para- dencies in p that can be identified by studying
meters (Meek, 1995). Furthermore, each poly- G alone. Specifically, we prove that there exist
nomial is non-trivial, i.e. not all the values of multinomial and Gaussian probability distribu-
the BN parameters corresponding to G are so- tions that are faithful to G. Such probability
lutions to the polynomial. To see it, note that distributions have all and only the independen-
there exists a probability distribution q in M (G) cies that sep identifies from G. Moreover, such
that is faithful to G (Meek, 1995) and, thus, probability distributions must be WT graphoids
that X 6 Y|ZW in q because sep(X, Y|ZW) because sep satisfies the WT graphoid proper-
does not hold in G. Then, by permuting the ties (Pearl, 1988). The fact that sep in G is
states of the random variables, we can trans- complete, in addition to being an important re-
form the BN parameter values corresponding to sult in itself, is important for reading as many
q into BN parameter values for p so that the po- dependencies as possible from G (see Section 5).
lynomial does not hold. Let sol(x, y, z, w) de-
Theorem 3. Let G be an UG. There exist mul-
note the set of solutions to the polynomial for
tinomial and Gaussian probability distributions
x, y and z. Then, sol(x, y, z, w) has Lebesgue
that are faithful to G.
measure zero wrt Rn , where n is the number of
linearly independent BN parameters correspon- Proof. We first prove the theorem for multino-
ding to G, because it consists of the solutions mial probability distributions. Create a copy H
to a non-trivial
S polynomial
S T (Okamoto, 1973). of G and, then, replace every edge X Y in
Let sol = X,Y,Z,W w x,y,z sol(x, y, z, w) H by X WXY Y where WXY / U is an
and recall from above that the outer-most auxiliary node. Let W denote all the auxiliary
union only involves those cases for which nodes created. Then, H is a DAG over UW.
sep(X, Y|ZW) does not hold in G. Then, sol Moreover, for any three mutually disjoint sub-
has Lebesgue measure zero wrt Rn , because the sets X, Y and Z of U, sep(X, Y|ZW) in H iff
finite union and intersection of sets of Lebesgue sep(X, Y|Z) in G.
measure zero has Lebesgue measure zero too. The probability distributions p(U, W) in
Consequently, the probability distributions p in M (H) that are faithful to H and satisfy that
M (G) such that p(U \ W|W = w) does not p(U|W = w) has the same independencies for
have the same independencies for all w have all w have positive Lebesgue measure wrt M (H)
Lebesgue measure zero wrt Rn because they because (i) M (H) has positive Lebesgue mea-
are contained in sol. These probability distri- sure wrt Rn (Meek, 1995), (ii) the probabi-
butions also have Lebesgue measure zero wrt lity distributions in M (H) that are not faithful
M (G), because M (G) has positive Lebesgue to H have Lebesgue measure zero wrt M (H)
measure wrt Rn (Meek, 1995). (Meek, 1995), (iii) the probability distributions
p(U, W) in M (H) such that p(U|W = w) does
Finally, we argue in Section 6 that it is not not have the same independencies for all w have

Reading Dependencies from the Minimal Undirected Independence Map of a Graphoid that Satisfies
Weak Transitivity 249
Lebesgue measure zero wrt M (H) by Theorem G alone. However, sep in G is not complete if
2, and (iv) the union of the probability distri- being complete is understood as being able to
butions in (ii) and (iii) has Lebesgue measure identify all the independencies in p. Actually,
zero wrt M (H) because the finite union of sets no sound criterion for reading independencies
of Lebesgue measure zero has Lebesgue measure from G alone is complete in the latter sense.
zero. An example follows.
Let p(U, W) denote any probability distribu- Example 1. Let p be a multinomial (Gaussian)
tion in M (H) that is faithful to H and satis- probability distribution that is faithful to the
fies that p(U|W = w) has the same indepen- DAG X Z Y . Such a probability dis-
dencies for all w. As proven in the paragraph tribution exists (Meek, 1995). Let G denote
above, such a probability distribution exists. the MUI map of p, namely the complete UG.
Fix any w and let X, Y and Z denote three Note that p is not faithful to G. However, by
mutually disjoint subsets of U. Then, X Y|Z Theorem 3, there exists a multinomial (Gaus-
in p(U|W = w) iff X Y|ZW in p(U, W) sian) probability distribution q that is faithful
iff sep(X, Y|ZW) in H iff sep(X, Y|Z) in G. to G. As discussed in Section 3, p and q are WT
Then, p(U|W = w) is faithful to G. graphoids. Let us assume that we are dealing
The proof for Gaussian probability distribu- with p. Then, no sound criterion can conclude
tions is analogous. In this case, p(U, W) is a X Y | by just studying G because this inde-
Gaussian probability distribution and thus, for pendency does not hold in q, and it is impossible
any w, p(U|W = w) is Gaussian too (Ander- to know whether we are dealing with p or q on
son, 1984). Theorem 2 is not needed in the proof the sole basis of G.
because, as discussed in Section 3, any Gaussian
probability distribution p(U, W) satisfies that 5 Reading Dependencies
p(U|W = w) has the same independencies for
all w. In this section, we propose a sound and com-
plete criterion for reading dependencies from
The theorem above has previously been pro- the MUI map of a WT graphoid. We define the
ven for multinomial probability distributions dependence base of an independence model p as
in (Geiger and Pearl, 1993), but the proof the set of all the dependencies X 6 Y |U \ (XY )
constrains the cardinality of U. Our proof does with X, Y U. If p is a WT graphoid, then ad-
not constraint the cardinality of U and applies ditional dependencies in p can be derived from
not only to multinomial but also to Gaussian its dependence base via the WT graphoid pro-
probability distributions. It has been proven perties. For this purpose, we rephrase the WT
in (Frydenberg, 1990) that sep in an UG G is graphoid properties as follows. Let X, Y, Z
complete in the sense that it identifies all the and W denote four mutually disjoint subsets
independencies holding in every Gaussian pro- of U. Symmetry Y 6 X|Z X 6 Y|Z. De-
bability distribution for which G is an indepen- composition X 6 Y|Z X 6 YW|Z. Weak
dence map. Our result is stronger because it union X 6 Y|ZW X 6 YW|Z. Contraction
proves the existence of a Gaussian probability X 6
YW|Z X 6 Y|ZW X 6 W|Z is pro-
distribution with exactly these independencies. blematic for deriving new dependencies because
We learned from one of the reviewers, whom we it contains a disjunction in the right-hand side
thank for it, that a rather different proof of the and, thus, it should be split into two properties:
theorem above for Gaussian probability distri- Contraction1 X 6 YW|Z X Y|ZW X
butions is reported in (Lnenicka, 2005). 6
W|Z, and contraction2 X 6 YW|Z X
The theorem above proves that sep in the
W|Z X 6 Y|ZW. Likewise, intersec-
MUI map G of a WT graphoid p is complete tion X 6 YW|Z X 6 Y|ZW X 6 W|ZY
in the sense that it identifies all the indepen- gives rise to intersection1 X 6 YW|Z X
dencies in p that can be identified by studying Y|ZW X 6 W|ZY, and intersection2

250 J. M. Pea, R. Nilsson, J. Bjrkegren, and J. Tegnr


X 6 YW|Z X W|ZY X 6 Y|ZW. prove it by induction over n. We first prove
Note that intersection1 and intersection2 are it for n = 2. Let W denote all the nodes
equivalent and, thus, we refer to them simply in U \ X1:2 \ Y that are not separated from
as intersection. Finally, weak transitivity X 6 X1 given X2 Y in G. Since X1 and X2 are
V |Z V 6
Y|Z X 6 Y|Z X 6 Y|ZV with adjacent in G, X1 6 X2 |U \ X1:2 and, thus,
V U \ (XYZ) gives rise to weak transitivity1 X1 W 6 X2 (U \ X1:2 \ Y \ W)|Y due to weak
X 6
V |Z V 6
Y|Z X Y|Z X 6 Y|ZV , union. This together with sep(X1 W, U \ X1:2 \
and weak transitivity2 X 6 V |Z V 6 Y|Z X Y \ W|X2 Y), which follows from the definition

Y|ZV X 6 Y|Z. The independency in of W, implies X1 W 6 X2 |Y due to contrac-


the left-hand side of any of the properties above tion1. Note that if U \ X1:2 \ Y \ W = ,
holds if the corresponding sep statement holds then X1 W 6 X2 (U \ X1:2 \ Y \ W)|Y directly
in the MUI map G of p. This is the best solution implies X1 W 6 X2 |Y. In any case, this inde-
we can hope for because, as discussed in Section pendency together with sep(W, X2 |X1 Y), be-
4, sep in G is sound and complete. Moreover, cause otherwise there exist several unblocked
this solution does not require more information paths in G between X1 and X2 which contra-
about p than what it is available, because G can dicts the definition of Y, implies X1 6 X2 |Y
be constructed from the dependence base of p. due to contraction1. Note that if W = , then
We call the WT graphoid closure of the depen- X1 W 6 X2 |Y directly implies X1 6
X2 |Y. Let
dence base of p to the set of all the dependencies us assume as induction hypothesis that the sta-
that are in the dependence base of p plus those tement that we are proving holds for all n < m.
that can be derived from it by applying the WT We now prove it for n = m. Since the paths
graphoid properties. X1:2 and X2:m contain less than m nodes and
We now introduce our criterion for reading Y blocks all the other paths in G between X1
dependencies from the MUI map of a WT gra- and X2 and between X2 and Xm , because other-
phoid. Let X, Y and Z denote three mutually wise there exist several unblocked paths in G
disjoint subsets of U. Then, con(X, Y|Z) de- between X1 and Xm which contradicts the de-
notes that X is connected to Y given Z in an finition of Y, then X1 6
X2 |Y and X2 6
Xm |Y
UG G. Specifically, con(X, Y|Z) holds when due to the induction hypothesis. This together
there exist some X1 X and Xn Y such that with sep(X1 , Xm |YX2 ), which follows from the
there exists exactly one path in G between X1 definition of X1:m and Y, implies X1 6 Xm |Y
and Xn that is not blocked by (X \ X1 )(Y \ due to weak transitivity2.
Xn )Z. Note that there may exist several unblo- Let X, Y and Z denote three mutually dis-
cked paths in G between X and Y but only one joint subsets of U. If con(X, Y|Z) holds in G,
between X1 and Xn . We now prove that con is then there exist some X1 X and Xn Y
sound for reading dependencies from the MUI such that X1 6 Xn |(X \ X1 )(Y \ Xn )Z due to
map of a WT graphoid, i.e. it only identifies the paragraph above and, thus, X 6 Y|Z due
dependencies in the WT graphoid. Actually, it to weak union. Then, every con statement in
only identifies dependencies in the WT graphoid G corresponds to a dependency in p. Moreover,
closure of the dependence base of p. Hereinaf- this dependency must be in the WT graphoid
ter, X1:n denotes a path X1 , . . . , Xn in an UG. closure of the dependence base of p, because we
Theorem 4. Let p be a WT graphoid and G have only used in the proof the dependence base
its MUI map. Then, con in G only identifies of p and the WT graphoid properties.
dependencies in the WT graphoid closure of the
dependence base of p. We now prove that con is complete for rea-
ding dependencies from the MUI map of a WT
Proof. We first prove that if X1:n is the only graphoid p, in the sense that it identifies all the
path in G between X1 and Xn that is not blo- dependencies in p that follow from the informa-
cked by Y U \ X1:n , then X1 6 Xn |Y. We tion about p that is available, namely the de-

Reading Dependencies from the Minimal Undirected Independence Map of a Graphoid that Satisfies
Weak Transitivity 251
pendence base of p and the fact that p is a WT Contraction2 con(X, YW|Z)
graphoid. sep(X, W|Z) con(X, Y|ZW). Since Z
Theorem 5. Let p be a WT graphoid and G blocks all the paths in G between X and
its MUI map. Then, con in G identifies all the W, the path X1:n in the left-hand side
dependencies in the WT graphoid closure of the must be between X and Y and, thus, it
dependence base of p. satisfies the right-hand side.

Proof. It suffices to prove (i) that all the depen- Intersection con(X, YW|Z)
dencies in the dependence base of p are identi- sep(X, Y|ZW) con(X, W|ZY). Since
fied by con in G, and (ii) that con satisfies the ZW blocks all the paths in G between X
WT graphoid properties. Since the first point and Y, the path X1:n in the left-hand side
is trivial, we only prove the second point. Let must be between X and W and, thus, it
X, Y, Z and W denote four mutually disjoint satisfies the right-hand side.
subsets of U. Weak transitivity2 con(X, Xm |Z)
con(Xm , Y|Z) sep(X, Y|ZXm )
Symmetry con(Y, X|Z) con(X, Y|Z).
con(X, Y|Z) with Xm U \ (XYZ). Let
Trivial.
X1:m and Xm:n denote the paths in the first
Decomposition con(X, Y|Z) and second, respectively, con statements in
con(X, YW|Z). Trivial if W contains no the left-hand side. Let X1:m:n denote the
node in the path X1:n in the left-hand path X1 , . . . , Xm , . . . , Xn . Then, X1:m:n
side. If W contains some node in X1:n , satisfies the right-hand side because (i) Z
then let Xm denote the closest node to does not block X1:m:n , and (ii) Z blocks
X1 such that Xm X1:n W. Then, the all the other paths in G between X1 and
path X1:m satisfies the right-hand side Xn , because otherwise there exist several
because (X \ X1 )(YW \ Xm )Z blocks all unblocked paths in G between X1 and Xm
the other paths in G between X1 and Xm , or between Xm and Xn , which contradicts
since (X \ X1 )(Y \ Xn )Z blocks all the the left-hand side.
paths in G between X1 and Xm except Weak transitivity1 con(X, Xm |Z)
X1:m , because otherwise there exist several con(Xm , Y|Z) sep(X, Y|Z)
unblocked paths in G between X1 and Xn , con(X, Y|ZXm ) with Xm U \ (XYZ).
which contradicts the left-hand side. This property never applies because, as
seen in weak transitivity2, sep(X, Y|Z)
Weak union con(X, Y|ZW)
never holds since Z does not block X1:m:n .
con(X, YW|Z). Trivial because W
contains no node in the path X1:n in the
left-hand side.
Note that the meaning of completeness in the
Contraction1 con(X, YW|Z) theorem above differs from that in Theorem 3.
sep(X, Y|ZW) con(X, W|Z). Since It remains an open question whether con in G
ZW blocks all the paths in G between identifies all the dependencies in p that can be
X and Y, then (i) the path X1:n in the identified by studying G alone. Note also that
left-hand side must be between X and W, con in G is not complete if being complete is
and (ii) all the paths in G between X1 and understood as being able to identify all the de-
Xn that are blocked by Y are also blocked pendencies in p. Actually, no sound criterion for
by (W \ Xn )Z and, thus, Y is not needed reading dependencies from G alone is complete
to block all the paths in G between X1 in this sense. Example 1 illustrates this point.
and Xn except X1:n . Then, X1:n satisfies Let us now assume that we are dealing with q
the right-hand side. instead of with p. Then, no sound criterion can

252 J. M. Pea, R. Nilsson, J. Bjrkegren, and J. Tegnr


conclude X 6 Y | by just studying G because MUI map. If G is a forest, then p is faithful to
this dependency does not hold in p, and it is im- it.
possible to know whether we are dealing with p
or q on the sole basis of G. Proof. Any independency in p for which the cor-
responding separation statement does not hold
We have defined the dependence base of a
in G contradicts Theorem 4.
WT graphoid p as the set of all the dependen-
cies X 6
Y |U \ (XY ) with X, Y U. Howe- 6 Discussion
ver, Theorems 4 and 5 remain valid if we re-
define the dependence base of p as the set of In this paper, we have introduced a sound
all the dependencies X 6 Y |M B(X) \ Y with and complete criterion for reading dependencies
X, Y U. It suffices to prove that the WT from the MUI map of a WT graphoid. In (Pena
graphoid closure is the same for both depen- et al., 2006), we show how this helps to iden-
dence bases of p. Specifically, we prove that tify all the nodes that are relevant to compute
the first dependence base is in the WT gra- all the conditional probability distributions for
phoid closure of the second dependence base a given set of nodes without having to learn a
and vice versa. If X 6 Y |U \ (XY ), then BN first. We are currently working on a sound
X 6
Y (U\(XY )\(M B(X)\Y ))|M B(X)\Y due and complete criterion for reading dependencies
to weak union. This together with sep(X, U \ from a minimal directed independence map of a
(XY ) \ (M B(X) \ Y )|Y (M B(X) \ Y )) implies WT graphoid.
X 6 Y |M B(X) \ Y due to contraction1. On Due to lack of time, we have not been able to
the other hand, if X 6 Y |M B(X) \ Y , then address some of the questions posed by the re-
X 6 Y (U \ (XY ) \ (M B(X) \ Y ))|M B(X) \ viewers. We plan to do it in an extended version
Y due to decomposition. This together with of this paper. These questions were studying
sep(X, U\(XY )\(M B(X)\Y )|Y (M B(X)\Y )) the relation between the new criterion and lack
implies X 6 Y |U \ (XY ) due to intersection. of vertex separation, studying the complexity of
In (Bouckaert, 1995), the following sound and the new criterion, and studying the uniqueness
complete (in the same sense as con) criterion and consistency of the WT graphoid closure.
for reading dependencies from the MUI map Our end-goal is to apply the results in this
of a graphoid is introduced: Let X, Y and paper to our project on atherosclerosis gene ex-
Z denote three mutually disjoint subsets of U, pression data analysis in order to learn depen-
then X 6 Y|Z when there exist some X1 X dencies between genes. We believe that it is not
and X2 Y such that X1 M B(X2 ) and ei- unrealistic to assume that the probability distri-
ther M B(X1 ) \ X2 (X \ X1 )(Y \ X2 )Z or bution underlying our data satisfies strict positi-
M B(X2 ) \ X1 (X \ X1 )(Y \ X2 )Z. Note that vity and weak transitivity and, thus, it is a WT
con(X, Y|Z) coincides with this criterion when graphoid. The cell is the functional unit of all
n = 2 and either M B(X1 ) \ X2 (X \ X1 )(Y \ the organisms and includes all the information
X2 )Z or M B(X2 ) \ X1 (X \ X1 )(Y \ X2 )Z. necessary to regulate its function. This informa-
Therefore, con allows reading more dependen- tion is encoded in the DNA of the cell, which is
cies than the criterion in (Bouckaert, 1995) at divided into a set of genes, each coding for one
the cost of assuming weak transitivity which, as or more proteins. Proteins are required for prac-
discussed in Section 3, is not a too restrictive tically all the functions in the cell. The amount
assumption. of protein produced depends on the expression
level of the coding gene which, in turn, depends
Finally, the soundness of con allows us to give
on the amount of proteins produced by other
an alternative proof to the following theorem,
genes. Therefore, a dynamic Bayesian network
which was originally proven in (Becker et al.,
is a rather accurate model of the cell (Murphy
2000).
and Mian, 1999): The nodes represent the genes
Theorem 6. Let p be a WT graphoid and G its and proteins, and the edges and parameters re-

Reading Dependencies from the Minimal Undirected Independence Map of a Graphoid that Satisfies
Weak Transitivity 253
present the causal relations between the gene References
expression levels and the protein amounts. It Theodore W. Anderson. 1984. An Introduction to
is important that the Bayesian network is dy- Multivariate Statistical Analysis. John Wiley &
namic because a gene can regulate some of its Sons.
regulators and even itself with some time delay. Ann Becker, Dan Geiger and Christopher Meek.
Since the technology for measuring the state of 2000. Perfect Tree-Like Markovian Distributions.
the protein nodes is not widely available yet, In 16th Conference on UAI, pages 1923.
the data in most projects on gene expression
Remco R. Bouckaert. 1995. Bayesian Belief Net-
data analysis are a sample of the probability dis- works: From Construction to Inference. PhD The-
tribution represented by the dynamic Bayesian sis, University of Utrecht.
network after marginalizing the protein nodes
Morten Frydenberg. 1990. Marginalization and Col-
out. The probability distribution with no node lapsability in Graphical Interaction Models. An-
marginalized out is almost surely faithful to the nals of Statistics, 18:790805.
dynamic Bayesian network (Meek, 1995) and,
Dan Geiger and Judea Pearl. 1993. Logical and Al-
thus, it satisfies weak transitivity (see Section 3)
gorithmic Properties of Conditional Independence
and, thus, so does the probability distribution and Graphical Models. The Annals of Statistics,
after marginalizing the protein nodes out (see 21:20012021.
Theorem 1). The assumption that the proba-
Radim Lnenicka. 2005. On Gaussian Conditional In-
bility distribution sampled is strictly positive is dependence Structures. Technical Report 2005/
justified because measuring the state of the gene 14, Academy of Sciences of the Czech Republic.
nodes involves a series of complex wet-lab and
Christopher Meek. 1995. Strong Completeness and
computer-assisted steps that introduces noise in Faithfulness in Bayesian Networks. In 11th Confe-
the measurements (Sebastiani et al., 2003). rence on UAI, pages 411418.
Additional evidence supporting the claim Kevin Murphy and Saira Mian. 1999. Modelling
that the results in this paper can be help- Gene Expression Data Using Dynamic Bayesian
ful for learning gene dependencies comes from Networks. Technical Report, University of Cali-
fornia.
the increasing attention that graphical Gaussian
models of gene networks have been receiving Masashi Okamoto. 1973. Distinctness of the Eigen-
from the bioinformatics community (Schafer values of a Quadratic Form in a Multivariate
and Strimmer, 2005). A graphical Gaussian mo- Sample. Annals of Statistics, 1:763765.
del of a gene network is not more than the MUI Judea Pearl. 1988. Probabilistic Reasoning in Intel-
map of the probability distribution underlying ligent Systems: Networks of Plausible Inference.
the gene network, which is assumed to be Gaus- Morgan Kaufmann.
sian, hence the name of the model. Then, this Jose M. Pena, Roland Nilsson, Johan Bjorkegren
underlying probability distribution is a WT gra- and Jesper Tegner. 2006. Identifying the Rele-
phoid and, thus, the results in this paper apply. vant Nodes Without Learning the Model. In 22nd
Conference on UAI, pages 367374.
Juliane Schafer and Korbinian Strimmer. 2005.
Acknowledgments Learning Large-Scale Graphical Gaussian Models
from Genomic Data. In Science of Complex Net-
works: From Biology to the Internet and WWW.
We thank the anonymous referees for their com-
Paola Sebastiani, Emanuela Gussoni, Isaac S. Ko-
ments. This work is funded by the Swedish Re-
hane and Marco F. Ramoni. 2003. Statistical
search Council (VR-621-2005-4202), the Ph.D. Challenges in Functional Genomics (with Discus-
Programme in Medical Bioinformatics, the Swe- sion). Statistical Science, 18:3360.
dish Foundation for Strategic Research, Clinical
Milan Studeny. 2005. Probabilistic Conditional In-
Gene Networks AB, and Linkoping University. dependence Structures. Springer.

254 J. M. Pea, R. Nilsson, J. Bjrkegren, and J. Tegnr



     ! #"$%&(')*+ ,.-//01
24365 7%8 9:<;>=?=@3 7A8B;>C)DE36;>C>81F&GIHJ8B;KCI:ML&NO8B8JP
Q :MRS8JLTU:<;?TV=BWYX;IWZ=BL!U18JT3[=@;K8B;>C\F]=@URS^IT36;>P12I_M3[:<;>_`:<aMb>c+T%L:<_ed?T+c&;>3[HB:ML!a3[Tgf
h Gji GSkl=<monJp4Gqp@nBrsbtBuJp@n1vkwc+T%L:<_ed?TMb/vVdI:yxV:MTd>:MLe568B;>C>a
z@{?|~}~4e}I|>4I>s{s}
E<SB
URS3[L!36_M8B5S:MH436CI:<;S_`:aedI=VaTd>8JTV;>8B3[HB:Ok]8<fB:<ae368B;o_M58Baa3[>:ML!a
R:MLW=BLeUs^>3[T%:Ol:<565_`=@URS8JL:<CAT%=
U1=BL:a%=BRSd>3a%T36_M8JT%:<C1;I:MT
=BLe _M58Baa3[>:ML!aMbB:MHB:<;36;1Hs3[:M=BW36;S8B_M_M^ILe8B_M3[:<a3; TdI:<3[L RS8JLe8BU1:MT%:MLeaMG X;
TdS36aR8JR/:MLbsl:Oa%T^>CIfTdI:+:`:<_`Ta=BWYa^>_edR8JLe8BU:MT%:ML36;>8B_M_M^>Le8B_M3[:<al?f36;~HB:<aT3[P@8JT36;IP TdI:Oa%:<;>ae3
T36Hs3[TfAW^S;>_`T3[=@;>a=BW
8;>8B3[HB: k]8<fB:<a38B;_M58Baa3[>:MLG:CI:<U1=@;>a%T%Le8JT%:Td>8JTMb8Ba+8_`=@;>a%:<s^I:<;>_`:=BW
Td>:+_M568Baae3[>:ML<qa
36;>CI:MR:<;>CI:<;S_`:R>L=BR:MLT3[:<aMb?TdI:<a%:&a%:<;>a3[T36Hs3[Tf W^>;S_`T3[=@;>a8JLe:&d>36P@d>5[f _`=@;>a%T%Le8B36;>:<CG
:&36;?HB:<a%T3[P@8JT%:V+dI:MTdI:MLTdI:VH8JLe36=@^>a RS8JT%T%:MLe;Sa=BWa%:<;>a36T3[Hs36TgfTdS8JTW=@565[=WZL=@U TdI:<a%:VW^>;>_`T36=@;>a
ae^IR>R=BLTTdI:=BSa:MLHB:<CL=BS^>a%T;>:<aa=BWY;>8B3[HB:k]8<fB:<a368B;A_M58Baa3[>:ML!aMGX;8BC>C>3[T36=@;T%= TdI:a%T8B;>C>8JL!C
a:<;>a3[T3[H43[TfoP@3[HB:<;TdI:8<HJ8B36568JS5[::MHs36C>:<;>_`:Bb
:8B56a%=oaT^>CIfTdI::`:<_`T=BWRS8JLe8BU:MT%:ML 36;>8B_M_M^IL!8
_M36:<a36;H43[:M=BWVae_`:<;>8JLe3[=@a=BWV8BC>C>36T3[=@;>8B5:MH436CI:<;S_`:BG\:AadI=Td>8JTTdI:a%T8B;>C>8JL!Ca%:<;Sa3[T3[H43[Tgf
W^>;>_`T36=@;>aa^I_`:OT%=CI:<a_`Le3[:Oa^>_!dKa_`:<;>8JL!3[=1a%:<;>a36T3[Hs36T3[:<aMG
EB@ s@g a3[S:MLeaU18f$/:8JT%T%Le3[S^IT%:<CT%=8_`=@US3;>8JT3[=@;
=BW!R>L=BR:MLT3[:<a=BWVTdI:C>8JT88B;>C=BWTdI:_M568Baae3[>:ML
k]8<fB:<a38B;;I:MTgl=BL4a&8JLe:=BWT%:<;:<URS5[=fB:<C\W=BLO_M568Ba W^>;>_`T3[=@;GvdI:_`=@U1U=@;S5[fo^>a:<CV36;>;I:MLT8JB:<ag8B565
a3[_M8JT3[=@;T8BasalVdI:MLe:+8B;36;IRS^IT
36;>a%T8B;>_`:&36a
T%=: Le^>56:Bbs+d>36_ed8Baae3[P@;>a]8B;K36;>aT8B;>_`:T%=18_M568Baa]Vd>3_ed
8Baa36P@;I:<CT%=K=@;I:=BW]8\;?^>U:MLO=BW]=@^IT%RS^ITy_M568Baea%:<aMG 36aAU=@aTAR>L=BS8JS56:8B_M_`=BL!C>36;IPT%=Td>:^>;>CI:MLe56fs36;>P
vdI:8B_`T^>8B5_M568Baae3[>:MLTdI:<;w36a8W^>;>_`T3[=@;Vd>3_ed k]8<fB:<a38B;;I:MT
=BLeb]W=BL1:`m48BU1RS5[:Bb]a%:M:<U1a1T%=_`=@;4
8Baa36P@;>a8a36;IP@56:K_M568BaaT%=:<8B_!d3;IRS^ITMblS8Ba%:<C=@; T%Le3[^IT%:T%=1TdI:y;>8B3[HB:_M568Baa3[S:ML<qaa^>_M_`:<aa Q =@U36;4
TdI:+R/=@aT%:MLe3[=BLRSL=BS8JS3653[Tgf C>36a%T%L!3[S^IT3[=@; _`=@UR^IT%:<C PB=@a h 8JM<8B;>3by<rBr@B!G.vdI:A=BSa:MLHB:<Ca%T8J36563[Tf
WL=@U$TdI:k]8fB:<a368B;;I:MTgl=BLW=BLVTdI:y=@^>T%RS^ITHJ8JLe3 U18f8B56a%=:8JT%T%L!3[S^IT%:<CT%=TdI:;>8B3[HB:36;>CI:MR:<;4
8JS5[:BG24^>_ed_M568Baae3[>:MLeao8JL:=BWT%:<;S^>3656TK^IR=@;8 CI:<;>_`:ARSL=BR:MLT3[:<a =BWTdI:K_M568Baae3[>:ML<GXTd>8Ba:M:<;
;>8B3[HB:okl8fB:<a368B;;I:MTgl=BL/b]_`=@;Sa36a%T36;>P=BW&8ae36;IP@5[: adI=V;bIW=BL:`mI8BURS5[:Bb>TdS8JTV;>8B3[HB:yk]8<fB:<ae368B;o_M58Baa3
_M568BaaHJ8JLe368JS5[:8B;>C8;?^SU:MLy=BW=Ba%:MLHJ8JS5[:1W:<8 >:MLeaR/:MLeWZ=BLeUl:<565W=BLV=BTdK_`=@U1RS5[:MT%:<5[f36;>CI:MR:<;4
T^IL:OHJ8JLe368J5[:<aMb4:<8B_edo=BWE+d>36_edA3aU=sC>:<565[:<CA8Ba:` CI:<;?TA8B;>CwW^S;>_`T3[=@;>8B5656fCI:MR:<;>CI:<;?TW:<8JT^IL:HJ8JLe3
36;IP36;>CI:MR:<;>CI:<;?T=BW :MHB:MLfK=BTd>:ML&W:<8JT^IL:H8JL!368JS5[: 8JS5[:<a Q =@U36;IPB=@a h 8JM<8B;S3b<rBr@?9+36adb>JpBpI!G
P@3[HB:<;\TdI:_M568BaaVH8JL!368JS5[:BGvVdI:y;?^SU:MLe36_M8B5R8JLe8BU
:MT%:MLeaW=BLa^>_!d8&;S8B3[HB:;I:MTgl=BL8JLe:PB:<;I:MLe8B565[f:<a%T3 a3[TX36Hs;3[TTfdS368Ba;SR8B8J5[R/fs:MaeL36ablT%=:]a:<TU^>RSCIf5[=fOTdIT%:y:<_e:`d>;>:<3_`?T^Ia+:<=BaYW WLR=@8JULe8Ba%U:<;4:`
U18JT%:<CWL=@UC>8JT8 8B;>C36;I:MH43[T8JS5[f18JL:36;>8B_M_M^>Le8JT%:BG T%:ML3;>8B_M_M^ILe8B_M3[:<ay=@;8K;>8B3[HB:1k]8fB:<a368B;;I:MTgl=BL/qa
msR:MLe36U1:<;~TadS8<HB:+aedI=V;T36U:V8B;SC8JP@8B36;Td>8JT R=@a%T%:MLe3[=BLR>L=BS8JS3563[TgfoC>36a%T%Le36S^IT3[=@;>a&8B;>CTd>:ML:M?f
_M568Baae3[>:MLea^>365[T=@;;>8B3[HB:kl8fB:<a368B;;I:MT
=BL4a8JL: _`=@;?T%Le3[S^IT%:ofB:MTA8B;I=BTdI:ML:`msR568B;>8JT3[=@;=BWyTdI:=BI
s^>3[T%:oaT8JS5[:BTdI:Mf8JL:_`=@UR:MT3[T3[HB:V3[Td=BTdI:ML a%:MLHB:<Coa%T8JS3653[Tgf=BW;S8B3[HB:yk]8<fB:<a368B;_M568Baae3[>:MLeaMG X;
5[:<8JLe;>:MLeaMbJL:MP@8JLeC>56:<aaY=BW>TdI:a3[M:]8B;>C ?^S8B563[TgfO=BWTdI: PB:<;I:MLe8B5bTdI:1a%:<;>ae3[T3[H43[TgfK=BW]8R=@a%T%:MLe3[=BLR>Le=BS8JS365
C>8JT8a%:MTYWZLe=@UVdS36_edTdI:Mf8JL:5[:<8JL!;I:<C Q =@U136;>PB=@a 3[Tf=BWV36;?T%:ML:<a%TT%=oRS8JL!8BU:MT%:ML H8JLe38JT3[=@;_`=@UR563[:<a
h 8JM<8B;>3bl<rBr@?Y9+36adbJpBpIJ D36^:MTy8B5G[bYJpBp@u@!G V3[TdO8Va36URS56:U8JTdI:<U18JT36_M8B5BW^>;>_`T36=@;F]8Ba%T3655[=M
R>RS8JL:<;?T5[fBbV36;S8B_M_M^ILe8B_M3[:<a36;TdI:RS8JLe8BU:MT%:MLea=BW J[ b<rBr@?Fl=@^IR]: 8B;KC>:ML+N8B8JPIb/JpBp@@!b/_M8B565[:<C
TdI:^>;>C>:MLe5[f436;IP;>:MTgl=BLC>=;>=BTae3[P@;>3[S_M8B;?T5[fK8JWZ 8 a%:<;>a3[T36Hs3[TfW^>;S_`T3[=@;G :wV36551CI:<U=@;>aT%Le8JT%:
W:<_`TTdI:yR:MLW=BLeU18B;>_`:O=BWa^>_!dK81;>8B36HB:y_M568Baa3[S:ML<G Td>8JTyTd>:136;>CI:MR:<;>CI:<;S_`:8Baae^>UR>T3[=@;SaO^>;>CI:MLe56fs36;>P
vd>:=BSa%:MLHB:<C1aT8JS36563[Tf=BW;>8B3[HB:k]8fB:<a368B;_M568Ba 8 ;>8B3[HB:Ok]8<fB:<a38B;A;I:MTgl=BL_`=@;>a%T%Le8B36;TdI:<a:a:<;>a3
T3[H43[TfW^>;>_`T36=@;>aT%= a^>_!d8568JLPB:+:`msT%:<;?T]Td>8JTTdI:Mf
_M8B;36;W8B_`T :
:<a%T8J5636adI:<C:`mI8B_`T5[fWL=@UHB:MLf5636U r<0 r>0

3[T%:<C36;IW=BLeU18JT3[=@;GEX;8BC>C>3[T36=@;bB
:+a%T^>CIfTdI:HJ8JL% s>1 s<0

3[=@^Sa&a:<;>a3[T3[H43[Tf\R>L=BR:MLT36:<a+TdS8JTOWZ=@5656= WL=@U TdI:


t1 t1

_`=@;>aT%Le8B36;I:<CW^>;S_`T3[=@;>aMG:ol=@^>56C563[B:KT%=;I=BT%: II I

Td>8JTKa:<;>a3[T3[H43[Tf8B;>8B5[f4a36ad>8Ba:M:<;.8JR>RS53[:<C:` (s,t)

W=BL:O36;TdI:y_`=@;?T%:`msT=BWY;>8B3[HB:Ok]8<fB:<a38B;\_M568Baae3[>:MLeaMb
center

|2r|
8Ba8U:<8B;>a=BWR>Le=H436C>36;IP=@^>;>C>a]=@;KTdI:8BU=@^>;?T vertex

=BW]RS8JLe8BU1:MT%:MLHJ8JLe368JT3[=@;Td>8JT36a8B565[=l:<CV3[TdI=@^IT
III IV

_!d>8B;IP@36;IPIbSWZ=BL   =BWTdI:R=@aa36S5[:36;>a%T8B;>_`:<a<b>TdI:


r>0 r<0
s>1 s<0

_M568Baea8B;36;>aT8B;>_`:3a8Baa36P@;I:<CT%=F]dS8B; Q 8JL% t0 t0
3[P@^ILe:JvVdI:R=@aa36S5[:d?f?R:ML=@568Bao8B;>CTdI:<3[L
V3_edI:BbOJpBp@t@!G$::`m4T%:<;>Cw=@;.Td>36aL:<ae^>5[TV3[Td 
_`=@;>aT8B;~TaMb +dI:ML:1TdI:_`=@;>a%T%Le8B3;~Ta=@;%$4b&8B;>C('
W^ILTdI:ML3;>a3[P@d?TaMG 8JL:yaR/:<_M36S_WZ=BL+a%:<;>a3[T36Hs3[TfW^>;>_`T3[=@;SaMG
 =BL_M568Baa36S_M8JT3[=@;R>L=BS56:<U1aMbA3[T36a=BWT%:<;8Ba
a^SU:<CTd>8JT:MHs36C>:<;>_`:36a8<HJ8B36568JS5[:W=BL:MHB:MLfa36;4 T%:MLeaYR:MLT8B36;S36;IPT%=&TdI:a8BU:]_`=@;>C>3[T36=@;>8B5~C>3a%T%Le3[S^I
P@5[:oWZ:<8JT^>L:oHJ8JLe368J5[:BGX;;s^>U:MLe=@^>a8JR>R5636_M8JT3[=@; T3[=@;8JL: _`=JH8JL!3[:<CR>L=BR=BLT3[=@;>8B55[fBb>TdI:<;TdI: a%:<;4
CI=@U8B36;>aMb>dI=
:MHB:MLbTd>3aV8Baa^SUR>T3[=@;KU8<fK;I=BT: a36T3[Hs36TgfW^>;>_`T36=@;36a 8ys^I=BT3[:<;?T =BW/T
=5636;I:<8JL W^>;>_
L:<8B536a%T36_JG)vd>:\?^>:<a%T3[=@;TdI:<;8JL!36a%:<ad>= U^S_ed T3[=@;Sa36;TdI:R8JLe8BU:MT%:M*L ) ^>;>C>:MLa%T^>CIfF]8Ba%T3655[=M
36U1RS8B_`T8BC>CS3[T3[=@;>8B5:MH436CI:<;S_`:o_`=@^>56Cd>8HB:o=@;TdI: J[ b<rBr@?Fl=@^IR]: 8B;CI:MLN8B8JPIbJpBp@@!EU=BL:
=@^IT%R^IT =BW/3;~T%:ML:<aT8B;>C1dI=a%:<;Sa3[T3[HB:]Td>36a36URS8B_`T W=BLeU18B5656fBbsTd>:yW^S;>_`T3[=@;T8JB:<a+TdI:OWZ=BL!U
36aT%=36;>8B_M_M^IL!8B_M3[:<aO36;TdI:1;>:MTgl=BLqaRS8JLe8BU1:MT%:MLeaMG
:136;?T%L=4C>^>_`: TdI:1;I=HB:<5;I=BT3[=@;=BW]a_`:<;>8JL!3[=Ka%:<;4 +  )- ,/.16 0 )7)329258 4
a36T3[Hs36TgfT%=_M8JRST^IL:TdI:568JT%T%:MLTf?R:=BWa%:<;>a3[T36Hs3[Tf
8B;>C ad>=Td>8JTTdI:]:`:<_`Ta =BWSRS8JL!8BU:MT%:ML H8JL!368JT3[=@; Vd>:ML:TdI: _`0=@;>a%T8B;?Ta . : 4 : 6 : 8o8JL:^>365[TWL=@U(TdI:
36;H43[:M =BWa_`:<;>8JL!3[=@aO=BW8BC>C>3[T36=@;>8B5Y:MHs36C>:<;>_`:1_M8B; 8Baa:<aaU:<;?Ta W=BL TdI:;I=@;4HJ8JLe3[:<CRS8JLe8BU1:MT%:MLeaMG vdI:
::<a%T8JS563adI:<C:M_M3[:<;?T5[fBG _`=@;>aT8B;~Ta=BWTdI:W^>;S_`T3[=@;_M8B;/:WZ:<8Ba36S5[f_`=@U
vVdI:oR8JR/:MLA36a=BLeP@8B;>36a%:<Cw8BaAW=@565[=VaMG X;2s:<_ RS^>T%:<CWL=@U TdI:.;I:MT
=BLe Fl=@^>Rl:  8B;CI:ML
T3[=@;sbl: R>L:<a%:<;?T&a=@U: R>L:<5636U36;>8JLe3[:<a=@;a:<;>a3 N8B8JPIbJpBp@s ;l=7 <]Le^S5 8B;oCI:ML&N8B8JPIbJpBpBp~!G
T3[H43[TfW^>;>_`T36=@;>aV8B;>C\TdI:<3[L8Baa%=4_M368JT%:<Ca%:<;>a3[T36Hs3[Tf a%:<;Sa3[T3[H43[TgfW^>;>_`T3[=@;K3a:<3[TdI:ML&8    > W^>;>_
R>Le=BR/:MLeT3[:<aMGlX;2s:<_`T3[=@;tsbl::<aT8JS5636adKTdI:W^>;>_ T3[=@;1=BL
8OWLe8JP@U:<;?T=BW8 > ? A@?4[ >CBA=D >FE " [ G
T3[=@;S8B5BWZ=BL!U1a=BW4TdI:a:<;>a3[T3[H43[Tf+W^>;>_`T3[=@;SaWZ=BLY;>8B3[HB: L:<_`T8B;>P@^>568JLVd?f?R:ML=@568T8JB:<a+TdI:yPB:<;I:MLe8B5WZ=BLeU
k]8<fB:<ae368B;;I:MT
=BLesa
8B;>C8BC>CIL:<aea TdI::<;>a^S36;IPOa%:<;4 +  )- , )HG($ & 2('
a36T3[Hs36TgfR>L=BR:MLT3[:<aMGX;2s:<_`T3[=@ ; l:3;~T%L=4C>^>_`:
TdI:;I=BT3[=@;=BWa_`:<;S8JLe3[=a:<;>a3[T3[H43[Tf 8B;SCad>=wTd>8JT Vd>:ML:Bb~W=BL
8W^>;S_`T3[=@;V3[Td .I: 4 : 6 : 88Ba:MW=BL:Bb~l:
3[Ty_M8B;::<aT8JS5636ad>:<CWZL=@U a%T8B;>C>8JLeCa%:<;>a3[T36Hs3[Tf d>8HB:OTd>8JCT &J,KGMLN~Ob ',/PN4b48B;SQC $R,TSN-2U& 0 ' GvdI:
W^>;>_`T3[=@;SaMGYvVdI:VRS8JR:ML
:<;>C>a+3[Td1=@^IL]_`=@;>_M56^Sa3[=@;>a d?f?R:ML=@568Vd>8BaYTgl=&SLe8B;>_ed>:<aY8B;SCTdI:lTgl=y8Ba%f4URI
8B;>CACS3[L:<_`T3[=@;>a]W=BLW^IT^IL:OLe:<a%:<8JLe_edo36;\2s:<_`T3[=@;ousG T%=BT%:<Va ),W&8B;>5 C +  )X ,T'  3[P@^IL:36565^>a%T%Le8JT%:<a
TdI:o5[=s_M8JT36=@;>a=BWOTdI:KR=@aae3[S5[:>Le8B;S_edI:<aLe:<568JT3[HB:

g w ~
 T%=KTd>:8Ba%f4UR>T%=BT%:<aMGK2436;>_`:=BTYd )8B;>(C +  )&8JL:
R>Le=BS8JS365636T3[:<aMb<TdI:]T
=JgC>3U:<;>a3[=@;S8B5?a%R8B_`:l=BWTdI:<3[L
vE=o36;?HB:<a%T3[P@8JT%:TdI::`/:<_`Ta=BW36;>8B_M_M^>Le8B_M3[:<a3;3[Ta W:<8Ba3[S56: H8B56^>:<a36aCI:M;I:<COsf3p Z[) : +  )\ Z)JBTd>36a
RS8JL!8BU:MT%:MLeaMb8.k]8<fB:<a38B;;I:MT
=BL_M8B; :a^II a%R8B_`:36aT%:MLeU:<CTd>:  ^ ]- !_ "] G
2436;S_`:]8B;?fa%:<;4
7%:<_`T%:<CT%=8a%:<;Sa3[T3[H43[Tgf 8B;>8B5[f4a36a<GEX;1a^>_!d8B;8B;>8B5 a36T3[Hs36Tgf\W^>;>_`T36=@;adI=@^>56C:l:<565:<d>8<HB:<C36;TdI:
f4a36aMbBTd>:l:`:<_`Ta
=BWSHJ8JLfs3;IPO8Oa36;IP@5[:]RS8JL!8BU:MT%:ML=@; ^>;S3[T+36;>CI=yb8d~fsR:ML=@5636_a%:<;>a3[T36Hs3[TfW^>;>_`T3[=@;
8B;.=@^>T%RS^ITAR>L=B8JS36563[Tf=BW36;?T%:ML:<a%TK8JL:aT^>C>3[:<CG 36a8WZLe8JP@U1:<;~T=BW]7^>a%T=@;I:o=BWTdI:oW=@^ILR=@aa3[S56:
vd>:<a%::`:<_`Ta18JL:KC>:<a_`Le3[:<Csf8a3URS5[:U18JTd4 >L!8B;>_edI:<aV36;  3[P@^IL:1JG
:<U18JT3_M8B5EW^>;>_`T36=@;b_M8B565[:<C8 <    !  #"  G  L=@U 8a%:<;Sa3[T3[H43[TgfW^S;>_`T3[=@;b
H8JLe36=@^>aR>L=BR:ML%
XWb^IR=@;H8JLefs36;>P8a36;IP@56: RS8JLe8BU:MT%:ML<bTdI:R8JLe8BU:` T3[:<a_M8B;:)_`=@URS^IT%:<CG `V:MLe:)
:>Le3[b: aSf L:`

256 S. Renooij and L. van der Gaag


H43[:MTd>:+R>Le=BR/:MLeT3[:<a=BW <     J  D8BaB:MfBb   > 7 ?>  E`  G:;>=a%T^>CIfTdI:a:<;>a3[T3[H43[Tf
<rBrBu@!Qb J > g  D > "  V  8B;SC _ 7 = EM _    =BWlTdI:1R=@a%T%:MLe3[=BLORSL=BS8JS3653[Tgf h L 6 JN=BW]8A_M58Baa
 #"  8B; CI:MLN8B8JP 9:<;>=?=@3 7MbJpBpI!G vdI: HJ8B56^I: 6 P@3[HB:<;8B;36;IR^IT136;>a%T8B;S_`:KN36;T%:ML!U1a=BW
M       J  W=BL\8RS8JL!8BU:MT%:ML\36aTdI:8Ja%=J 8RS8JLe8BU1:MT%:MML ) ,PO N RS Q J 6 Q!b+dI:MLT : N R\Q C>:<;I=BT%:<a
56^IT%:H8B56^>: 
 )  =BW TdI:SLea%TVCI:ML!3[H8JT36HB:y=BW TdI: a%=@U:HJ8B56^I:=BWTdI:W:<8JT^IL:HJ8JLe368JS5[: = R 8B;>C 6 Q36a
a%:<;>ae3[T3[H43[TgfW ^S;>_`T3[=@;1 8JT
TdI:+8Baa:<aaU:<;?T )WZ=BL
TdI: 8+H8B56^>:=BWITdI:]_M568BaaEH8JLe38JS5[:BGvd>:=BLe3[P@3;>8B5@8Baa%:<aea
RS8JLe8BU1:MT%:ML<>3[T]_M8JR>T^IL:<a]TdI::`:<_`T=BWE8B;A36;>S;>3[T%:<a U:<;?Ta%R:<_M3[>:<C36; TdI:+;>:MTgl=BL WZ=BL
TdI:RS8JLe8BU1:MT%:ML
36U18B55[faeU18B565aed>3[WTV36;KTd>:RS8JLe8BU1:MT%:ML&=@;oTdI:=@^IT )36aCI:<;I=BT%:<CK?Q f )JGlvdI:y;I=BT8JT36=@>; O N 36aV^>a%:<CAW=BL
RS^IT+R>L=BS8J36563[TfBGvdI:3URS8B_`T&=BW8568JLePB:MLaed>3[WTV36a TdI:&R=@a%T%:MLe3[=BL
R>L=BS8JS3563[Tgf =BW3;~T%:ML:<aTR>Le36=BLT%= RS8
a%T%L=@;>P@5[fOCI:MR:<;>CI:<;?T^IR=@;OTdI:5[=4_M8JT3[=@;=BWITdIC: B > Le8BU:MT%:ML&HJ8JLe368JT3[=@;G:^>aX: +'UWV9X NZY []\  )]T%=C>:<;I=BT%:
g  =BWTdI:+a%:<;>a3[T36Hs3[TfW^>;>_`T36=@;b@Td>8JT3aMb~TdI:VR/=@3;~T TdI:Va:<;>a3[T3[H43[TfW^>;>_`T3[=@; Td>8JT:`msR>Le:<aa%:<a TdI:VR>L=BI
VdI:MLe:   
 )  ,JG HB:MLeT%:`mTd>8JT53[:<a+3[Td>36;TdI: 8JS3653[Tgf=BWY36;?T%:ML:<a%T+36;T%:MLeU1a=BWTdI:OR8JLe8BU:MT%:MCL )EG
^>;>36TV36 ;>C>= 8Ba36_M8B565[fOU18JLesaETdI:lT%Le8B;>a3[T36=@;yWL=@U vd>:KWZ=@55[=V3;IPR>L=BR=@a3[T36=@;;I=_M8JR>T^IL:<aTdI:
RS8JLe8BU1:MT%:MLEHJ8B56^I:<a+3[Tdy8VdS3[P@dya%:<;>ae3[T3[H43[Tgf+H8B56^>:T%= a%:<;>ae3[T3[H43[TgfW^>;S_`T3[=@;.Td>8JT\C>:<a_`Le3[:<aTdI:=@^>T%RS^IT
RS8JLe8BU1:MT%:MLHJ8B56^I:<aV36Td85[= a:<;>a3[T3[H43[TfHJ8B56^I:Bb R>L=B8JS36563[TfA=BW
36;?T%:ML:<a%T=BWl8A;>8B3[HB:k]8<fB:<a368B;;I:MT
=BLHs36_`:HB:MLea84GAXW]8B;=@^IT%RS^>TyR>L=BS8J36563[Tfo36a^>a%:<C l=BLo36;oT%:MLeU1a=BWl8a36;IP@5[:RS8JLe8BU:MT%:MLy8Baea%=s_M38JT%:<C
W=BL:<a%T8JS536ad>36;>P+TdI:+U=@a%T5636B:<5[fHJ8B56^I:=BWTdI:V=@^IT V3[Td=@;I:A=BW+TdI:WZ:<8JT^ILe:HJ8JLe368J5[:<aMGvdI:R>L=BR=J
RS^IT HJ8JLe368JS56:Bb TdI:A:`/:<_`T=BW&RS8JLe8BU1:MT%:ML1H8JL!368JT3[=@; a3[T36=@;.adI=+aTd>8JTKTd>36aW^>;>_`T36=@;3aAd>3[P@d>5[f_`=@;4
=@;TdI:oU=@a%T5636B:<5[f=@^IT%R^ITH8B5^I:K36a =BWO36;?T%:ML:<a%TMG a%T%Le8B3;I:<C8Ba8 L:<a^>56Tl=BWETdI:y36;>C>:MR/:<;SCI:<;>_`:<al36;TdI:
vdI: _ 7 =  E` _    #"  WZ=BL8oRS8JLe8BU:MT%:ML1;I= ;I:MT
=BL8B;SCTdI:oRS8JLe8BU:MT%:ML:<36;IP_`=@;>C>36T3[=@;I:<C
36a&8R8B3[L   : !bVd>:ML: 36aVTd>:8BU=@^S;~T&=BWHJ8JLe3 =@;A8 HJ8B56^I:&=BWETdI:O_M568Baea]H8JLe38JS5[:BGX;W8B_`TMb>W=BL8B;~f
8JT3[=@;8B5656=l:<CT%=HJ8B56^I:<aOaeU18B565[:ML&Td>8B;3[Ta8Baa%:<aea _M568Baa]HJ8B56^I:8B;>CA8B;?fRS8JL!8BU:MT%:ML<bI=@;>5[f=@;I:=BWEW=@^IL
U:<;?TV36TdI=@^IT_!d>8B;IP@36;IPoTdI:U1=@a%T5636B:<5[fo=@^>T%RS^IT C>3:ML:<;?TW^S;>_`T3[=@;>a_M8B;ALe:<a^>5[TMG :yl=@^>56C\563[B:T%=
HJ8B56^I:Bb>8B;>C  3a
Td>:y8BU=@^>;?T=BWEHJ8JLe368JT3[=@;A8B55[=l:<C ;I=BT%:&Td>8JT]TdI:&R>L=BR=@a3[T3[=@;dI=@56C>a=@;>5[fW=BLlR8JLe8BU
T%=568JLePB:MLH8B5^I:<aMITdI:ya%f4U/=@5a 8B;>C 8JL:^>a%:<C :MT%:MLea8B;>C=@^IT%RS^ITyRSL=BS8JS3653[T3[:<a&Td>8JT 8B565[=WZ=BL 8
T%=36;>C>36_M8JT%:Td>8JTH8JLe38JT3[=@;3a8B5656=l:<C^>RT%=TdI: d?f?R:ML=@5636_&a%:<;>ae3[T3[H43[TgfW^>;S_`T3[=@;G
=@^>;>C>8JL!3[:<al=BWETdI:^>;>3[T]V36;SCI=yG ^  /`_M/ .Z ./ ,a(`&cb M = R E   >   >  EM
]- B B U  J  d N R   _. <1 ) ,eOY N Rf Q J 6 Q< E

S @ @
! ( *
#"  J $
> 

D4? > > _ NkY [l\ B g*B ih B

 <
g
  "  g
-
]
 = R
c+R/=@;:<36;IPa^I47%:<_`T%:<CT%=)8wa%:<;>a36T3[Hs36Tgf)8B;>8B5[f? M     Y    "  + UjV4X  ) Bs "   " B
a36a<b&Td>:36;>C>:MR/:<;SCI:<;>_`:8Baa^>U1R>T3[=@;>aA=BW8;>8B3[HB: " Z "]- A@J " > 1 m
k]8<fB:<a38B;;I:MT
=BLe)a%T%L=@;IP@5[f._`=@;>aT%Le8B36;.TdI:PB:<;4 nZoprqts]u vxw4y{z}| ~k~ ~k7T
~
:MLe8B5 W=BLeU =BWlTdI:L:<ae^>5[T36;IPAa%:<;Sa3[T3[H43[Tgf\W^>;S_`T3[=@;>aMG
X;W8B_`TMbYP@3[HB:<;7^>a%Ty536U13[T%:<C36;IW=BLeU18JT36=@;bTdI::`ms
z z3c
8B_`T W^>;>_`T3[=@;>aY_M8B; :L:<8BCS365[fyCI:MT%:MLeU36;I:<CWZ=BL :<8B_!d z3 z3k

_M568BaaVH8B5^I:8B;>C\:<8B_edoRS8JLe8BU:MT%:MLGX;\Td>36aVa:<_`T3[=@; s
z
z'
s
l:1CI:MLe3[HB: TdI:<a%:W^>;>_`T36=@;>a8B;>CC>36a_M^SaaVTd>:<3[LO:<;4 z3
z>
a^>3;IPa%:<;>a3[T36Hs3[TfR>Le=BR/:MLeT3[:<aMG
] B > & ? , )^G  _ & , ) 2
X ? \
> By
J eB& @g
 > 

   D "  "  B

" >>  D "  _  A@    
%'&( )+*-, B .0/ ,  1324/S 5  J J h L
#"    _ OQ-  B 7" >  @     1"  6 QJN
kl:MWZ=BLe::<a%T8JS536ad>36;>PTd>:a%:<;>ae3[T3[H43[TgfW^>;S_`T3[=@;>aW=BL > "F" M :R>L=HB:TdI:R>L=BR:MLTgfaT8JT%:<C(36; TdI:
;>8B3[HB:k]8<fB:<a368B;;I:MT
=BLesaMbl:36;?T%L=4C>^>_`:a%=@U: R>L=BR=@a36T3[=@;WZ=BDL N R ,N QR~ITdI:R>L=s=BWEW=BDL N RH, N QR36a
;I=BT8JT3[=@;S8B5_`=@;?HB:<;?T3[=@;>aMG :_`=@;>a36CI:ML8;>8B3[HB: 8B;>8B5[=BPB=@^SaMG :O:MP@36;1sf1Le36T36;IPyTdI:&R>L=B8JS36563[Tf
k]8<fB:<a38B;K;I:MTgl=BLV3[Td;I=4CI:<7a 69 8 , z :< ;>= b h L 6 : N36;T%:ML!U1a=BWYTdI:;I:MT
=BL/qaR8JLe8BU:MT%:MLeaM
= , z =@? =7BS D b CFE sb+8B;SC8JL!_MHa G 9 8y ,
z : : =I K JM: LHA A , A: : A A A: C : 36a_M8B565[:<CTd>: M[ = h L 6 : N , OY N I J 6 0 O 6 0 OY N R J 6 :
 ? >  E` K=BWVTdI:;I:MTgl=BL8B;>C =7I 8JLe:T%:ML!U:<C3[Ta `xk] 

Evidence and Scenario Sensitivities in Naive Bayesian Classifiers 257


Vd>:ML: N I 36aTd>:HJ8B56^I:=BWTd>:HJ8JLe368JS56: =7I 36;TdI:
36;>RS^IT 36;>a%T8B;S_`:NBGvd>3aR>L=BS8JS3563[TgfL:<568JT%:<a
T%=yTdI: 1 1

RS8JL!8BU:MT%:ML ) , OYN QRJ 6 Q l8Ba

fc (x )

fc (x )
0.8 fc (x ) 0.8

h L 6 Q : N` ) , OY N I J 6 Q 0 OY 6 Q 0 )
0.6 0.6
fc (x )
0.4 0.4
`xk] 
W=BL 8B;>CK8Ba
0.2 0.2

6 , 6 Q 0 0

h L 6 : N` ) , YO N I J 6 0 YO 6
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1

 3[P@^ILe:s mI8BURS5[:ya%:<;Sa3[T3[H43[TgfW^>;>_`T3[=@;SaWZ=BL+TdI:
x0 x x0 x

x RS8JL!8BU:MT%:ML^) , OYNQ R J 6 Q+3[TdR)  ,wp A sbO`Q,wp A Ib


W=BL 6 , 6 Q Gl2436U13568JLe5[fBb h LN_M8B;A:OLe3[T%T%:<;\8Ba Vd>:ML: N R , N QR  l=BL N Rc, N QR >  @B %!G
h LN, h L 6 Q : N 2 N  N h L 6 : N :  Le=@UTd>:&HJ8B56^I:<a
=BWTd>:&8Ba%f4UR>T%=BT%:<a]W=@^>;>C36;TdI:

Vd>:ML: h L 6 Q : N8JP@8B36;L :<568JT%:<a T%=TdI:RS8JL!8BU:MT%:ML R>Le=?=BW8J=HB:Bbl:\=BSa:MLHB:\Td>8JTWZ=BL N R , N QRoTdI:
)K8Ba36;SC>36_M8JT%:<C8J/=HB:y8B;>CTdI:=BTdI:MLT%:MLeU36a]_`=@;4
a%:<;Sa3[T3[H43[Tgf\W^>;S_`T3[=@Y; + UjV4X N Y []\  )&36a8WZL!8JP@U:<;~TO=BW
a%T8B;?TV36TdL:<a%R:<_`TT%=1TdI:RS8JLe8BU:MT%:ML<G  =BL 6 , 6 Q 8wX TdIg?^>8BC>Le8B;~Td?fsR/:MLe/=@58SLe8B;>_ed8B55 =BTdI:ML
l:y;I=S;>CW=BLTdI:ya%:<;Sa3[T3[H43[TgfW^S;>_`T3[=@;Td>8JT W^>;>_`T3[=@;S1a + UWV9X NZY []\  )V3[Td 6 , 6 Q
8JL:WLe8JP@U:<;?Ta
=BW]Xa%Tg?^S8BCILe8B;?Ty>Le8B;>_!dI:<aMG  =BL N R , N Q R l:1S;>C
XXeXLeCI8B;>CXX;>C4gs^>8BCILe8B;?T>L!8B;>_edI:<a<bBL:<a%R:<_`T3[HB:<5[fBG
+UWV9X N Y [l\  )- , h hL L6 Q N: `N`) ) , .X)70 25) 8 , )HG() & ? vd>:<a%:+RSL=BR:MLT3[:<al8JL:O365656^>aT%Le8JT%:<C3;  3[P@^ILe:OsGvE=
.10 36;?T^>3[T36HB:<5[f:`msRS58B36;KVd?fKTdI:W^>;>_`T3[=@; + UjV4X N Y []\  )
8B;>CW=BL 6 , 6 QTd>8JT d>8BaO8ACS3/:MLe:<;~TOad>8JR: Td>8B;TdI:=BTd>:MLyW^S;>_`T3[=@;>aMb
l:=BSa%:MLHB:Td>8JTyHJ8JLfs3;IPA8ARS8JLe8BU:MT%:ML OY N I J 6 Q
+ UWV9X NkY [l\  ) , h h LL 6 N: N``) ) , )34 N 258 , )HG%$ N & ? d>8BaV81C>3[L:<_`T:`:<_`TV=@;\TdI:OR>L=B8JS36563[Tf h L 6 Q J}N!b
.V0 Vd>:ML:<8BaTdI:1R>Le=BS8JS365636T3[:<a h L 6 JN!b 6 , 6 QJb8JL:
V36TdTdI:y_`=@;>a%T8B;?Ta =@;>56f36;SC>3[L:<_`T5[f8/:<_`T%:<CT%=o:<;>a^IL:Td>8JTTdI:C>36a
T%Le36S^IT3[=@;a^>U1a&T%=A=@;I:BG Q ^I:T%=ATdI:_`=@;>a%T%L!8B36;I:<C
, OY N I J 6 Q OY 6 Q W=BLeU)=BWITd>:<a%:
W^>;>_`T36=@;>aMbU1=BL:M=HB:ML<b?8B56A5 + UWV9X NkY []\  )!b
6 , 6 QbydS8<HB:TdI:a8BU:ad>8JR:BG(vVdI:W^>;>_`T3[=@;
. 0
`xk] 
4 N , Y Y
O N I J 6 O 6 +UWV9X N Y [l\  )]TdI:ML:MW=BL:    l:yCI:MHs38B;~TMG
0 :)35656^>a%T%L!8JT%:TdI:.W^>;>_`T36=@;>8B5WZ=BLeUaC>:MLe3[HB:<C
x
h L 8J=HB:V3[Td.8B;.:`mI8BURS5[:BGvVdI::`m48BU1RS5[:adI=Va
8 , 
N  N
6 : N U=BLe:a%R:<_M3[_M8B565[fTd>8JT8Ba 8\L:<ae^>5[T=BW+TdI:<3[L _`=@;4
a%T%L!8B36;I:<CW=BLeUAb]8B;~fa%:<;>ae3[T3[H43[TgfW^S;>_`T3[=@;_M8B;:


v d>:K_`=@;>a%T8B;?T Q& ? , GHL 36aTdI:KHB:MLT36_M8B5&8Ba%f4URI :<a%T8J5636adI:<CWL=@U$HB:MLfA563U13[T%:<C36;IW=BLeU18JT3[=@;G
T%=BT%:Td>8JT36a+aed>8JL:<CKsf\8B565W^S;>_`T3[=@;>aM/3[Ta+H8B56^>:36a
P
5 _-1 (`& :_`=@;>a36C>:ML8O;>8B3[HB:k]8<fB:<a368B;;I:MT
CI:MT%:ML!U136;I:<CWL=@U `
l=BLV3[TdA8_M568BaaHJ8JLe368JS56:yU=sC>:<565636;IPTdI:aT8JPB:<a
h L N G h L 6 Q : N X bXX bXXk&bXXX bX Y 8B;SCX k =BWV_M8B;>_`:ML =BWTdI:
& ? , G
h L 6 Q : N ZOY N R J 6 Q =s:<a%=BRSd>8JP@^SaMBTdI:lWZ:<8JT^ILe:
HJ8JLe368JS56:<a=BW>TdI:];I:MT
=BL
, )\G
) _M8JR>T^>L:OTdI:OL:<a^S5[Ta]WZLe=@UC>368JP@;I=@aT36_T%:<a%TaMG  =BL+8
O Q P@3[HB:<;wR8JT3[:<;~TMbTdI:8H8B3658JS5[:S;>C>3;IP@a8JLe:a^>U
X;\8BC>CS3[T3[=@;bTdI:W^>;>_`T36=@; + UWV9X N Y [l\ )d>8Ba8dI=BL% U18JL!36a%:<C.36;.Td>:36;IRS^>T36;Sa%T8B;>_`: N@bOP@3[HB:<;)VdS36_ed
3[M=@;?T8B58Ba%f4UR>T%=BT%: 8JT '\, PP ,JGOvdI :W^S;>_`T3[=@;>a TdI:W=@565[=V36;IPOR=@a%T%:ML!3[=BLC>36a%T%L!3[S^IT3[=@;36a_`=@U1RS^IT%:<C
+ UWV9X NkY []\  )]+3[Td 6 , 6 Q b8B565dS8<HB:V' , ,p4GvdI: =HB:MLVTdI:_M58BaaHJ8JLe368JS5[: l
_`=@;>aT8B;~T\a $ N ,/S P =BWTd>:s 58JT%T%:MLW^>;S_`T3[=@;>a]8JL:OCI:`
s P
T%:MLeU36;I:<CWL=@U O N , 
G
      
 y   ~Z| !#" !$!%" '&(!%" !)!%" !+*-,." /10 !%"

258 S. Renooij and L. van der Gaag


:W^ILTd>:MLK_`=@;>a36CI:ML\TdI:W:<8JT^IL:HJ8JLe368JS56: g+
" "JbBU=4CI:<56536;IPlTdI:=BSa%:MLeHB:<CyR>L:<a:<;>_`:=BL8JSa:<;>_`:
=BW5[=s_`=JLe:MP@3[=@;>8B5>5[f4URSdS8JT36_
U1:MT8Ba%T8Ba%:<aMGvVdI:;I:MT |df/dx (x0)|

l=BLA36;>_M56^>C>:<a
Td>:yWZ=@55[=V3;IP18Baa%:<aaeU:<;~TaVWZ=BL+TdI: 3

RS8JLe8BU1:MT%:MLeaW=BLTd>36a]HJ8JLe368J5[:B 2.5
2
1.5
      

         "!
1


!%" ! !%" ! !%" !#" ! "
% !%" * 0.5
0.8
1
!%" & !%" & ! "
% ! "
# ,." !%" * 00 0.6

+x =a^IRSR/=@a:Td>8JT l:8JLe:3;~T%:ML:<aT%:<C36;Td>::MWZ
0.2 0.4 P
0.4 0.2

3[P@^IL:ts vdI:a%:<;Sa3[T3[H43[TgfH8B5^I:]WZ=BL + UWV9X N Y [l\ )8Ba


x0 0.6

W:<_`Taw=BW36;>8B_M_M^ILe8B_M36:<a3;TdI:RS8JLe8BU:MT%:ML ) ,
0.8
1 0

OY g+ " "(, ;>= J T, X Y =@;TdI:oR/=@aT%:MLe3[=BL 8W^>;>_`T3[=@;=BW )O8B;SCO`QbIWZ=BL N R ,NQR@G
R>L=B8JS36563[T36:<a h L  JN+W=BLy=@^ILOR8JT3[:<;~TO+dI=Ad>8Ba
;I=:MHs3CI:<;>_`: =BW5[=4_`=JL:MP@3[=@;>8B5U:MT8BaT8Ba%:<aMGvVdI:<a%: k]8Ba%:<C^IR=@;KTd>:PB:<;I:ML!8B5EW=BLeU(=BW TdI: a%:<;>a3[T36H~
:`:<_`Ta8JLe:o_M8JR>T^ILe:<Csfae3mW^>;S_`T3[=@# $;> aV36TdTdI: 3[TfHJ8B56^I:CI:MLe36HB:<C.8J/=HB:Bb  3[P@^IL:tCI:MR36_`TaTdI:
HB:MLT36_M8B5+8Ba%f4UR>T%=BT%Y: & ? , p A uB G # % ? , GVp A tBtsG a%:<;>ae3[T3[H43[Tgf1HJ8B56^I:=B-W +UjV4X N Y []\  )lP@3[HB:<T; N R ,NQ R W=BL
w3[TdI=@^ITd>8Hs36;>PT%= R:MLW=BLeU8B;~fW^ILeTdI:MLl_`=@U1RS^4 C>3:ML:<;?T_`=@US36;S8JT3[=@;>a=BWHJ8B56^I:<aWZ=B7L )8B;>C O`QG
T8JT3[=@;>a<b>
:;I=);>CTd>8JT :=BSa%:MLeHB:Td>8JTd>3[P@da%:<;>a36T3[Hs36TgfyHJ8B56^I:<a_M8B;=@;>56f
+&('*) ) , )72) p tBt :WZ=@^>;SC3[WTdI:8Baa%:<aaeU:<;~TAW=BLATdI:RS8JLe8BU1:MT%:ML
^>;>C>:ML&a%T^SCIf\36aaU18B55E8B;>CTdI: =BLe36P@36;>8B5ER/=@aT%:MLe3[=BL
8B;>C A R>L=B8JS36563[TTf O Q 3a;I=@;4:`m4T%L:<U:B9 G 8\=BL:WZ=BL!U18B565[fBb
+,+ ) , h L  JN 0 )72p A nBp u tBt l:Kd>8<HB:ATd>8JT    )<  : A3[W+8B;>C=@;>5[f36W



o r
p t
q s u 
v w

W=BLy8B;~f  , X G:1W^ILTdIA :ML&S;>CTd>8JTOW=BLOTdI: )<;dO Q GHO Q Z p BusG  =BD L N RH , N QR@b>dS3[P@da%:<;>a3[T36H~


_`=@URS56:<U:<;~T=BWTdI:RS8JLe8BU:MT%:ML ) 8B565W^>;S_`T3[=@;>a 3[TfH8B56^>:<a_M8B;=@;>5[f:W=@^>;>C3[W]TdI:8Baa:<aaU:<;?T


A

d>8HB:TdI:<3[LHB:MLeT36_M8B58Ba%f4UR>T%=BT%:8JT & , A tBt A.- W=BLTdI:oRS8JLe8BU:MT%:ML36a1568JLePB:MLTd>8B;p A Ju8B;>CTdI:


=BLe3[P@3;>8B5SR=@a%T%:ML!3[=BL
RSL=BS8JS3653[Tgf O Q 36a
;>=@;4:`msT%Le:<U:BG
%'&0/ 1  ,  . .02.Z 3 _E /`_ ? .? x+=BT%:TdS8JTdS3[P@d\a%:<;Sa3[T3[H43[TgfHJ8B56^I:<a&_M8B;Td?^>a&=@;>56f
 =BL8;>8B36HB:wk]8fB:<a368B; ;>:MTgl=BLbo8B;~fa%:<;>a3[T36H~ :W=@^>;>CWZ=BL TdI:RS8JL!8BU:MT%:MLeaTd>8JTCI:<a_`Le36/:TdI:
3[TfW^>;>_`T36=@;.36aA:`mI8B_`T5[f)CI:MT%:MLeU13;I:<C.sfTdI:8Ba ` =B H8B56^>:A=BW8WZ:<8JT^ILe:\HJ8JLe368JS5[:AP@36HB:<;8
a%:<aaeU:<;~TW=BLTd>:RS8JL!8BU:MT%:ML/:<3;IPHJ8JLe3[:<C8B;>C RS 8J LT3_M ^>568JL  _M568BaeaHJ8B56^I:BG.vd>3aR>L=BR:MLTfl8Ba;I=J
TdI:=BLe3[P@36;>8B5HJ8B56^I:=BWTd>:R>L=B8JS36563[Tf=BW36;?T%:ML% T36_`:<C/:MW=BL:W=BLPB:<;I:ML!8B5Sk]8<fB:<a38B;;I:MT
=BLesa& 8B;
:<a%TMG  L=@UTdI:W^>;>_`T3[=@;O;>=8B;?fOa%:<;>a36T3[Hs36Tgf&R>L=BRI CI:ML&NO8B8JP 9V:<;I=?=@3 7`bSJpB?p >@!G
:MLTf1=BW36;?T%:ML:<a%T]_M8B;/:&_`=@URS^IT%:<CGY:Oa%T^SCIfTdI: X;H43[:M=BWV;>8B3[HB:k]8<fB:<a368B;_M58Baa3[>:ML!aMb
:=BI
R>L=BR:MLT36:<a=BWYa%:<;>a3[T36Hs3[TfH8B5^I:Bb?HB:MLeT%:`mR>Le=<mI36U13[Tf a%:MLHB:TdS8JT36;IRS^IT36;>aT8B;>_`:<aVd>36_!d=s_M_M^IL Td>:U=BL:
8B;>CK8BCSU136aa36S5[:+C>:MHs368JT36=@;G WL:<?^>:<;~T5[fP@3[HB:<;8RS8JLT36_M^S568JLY_M58Baa 8JL:8B56a%=O563[B:<56f
1  , .Z .2.Z 342I` 1 *  ,65 2
s  _Y / .x5 . 3 T%=1:OTdI:U1=BL:yR>L=B8JS5[:P@3[HB:<;KTdS8JTV_M568Baa3;ATdI:
 =BL
TdI:H8JL!3[=@^>aR=@aa3[5[:la:<;>a3[T3[H43[TfW^>;>_`T3[=@;>aYW=BL
;I:MT
=BL)^S;>CI:MLe5[f436;IPTdI:_M58Baa3[>:MLG 2436;S_`:d>3[P@d
8;S8B3[HB:+k]8<fB:<ae368B;;I:MTgl=BL/bsTd>:+a%:<;>ae3[T3[H43[Tgf HJ8B56^I:<a a%:<;>ae3[T3[H43[TgfH8B56^>:<aV_M8B;K=@;S5[f:OW=@^>;>CAW=BLVTd>:yRS8
8JL:&L:<8BC>3656f:<aT8JS5636ad>:<CEW=BL N R ,N QR
:&S;>CTd>8JT Le8BU:MT%:ML!aTd>8JTCI:<ae_`Le3[:]TdI:5[:<8Ba%T53[B:<5[fHJ8B56^I:]=BW8
W:<8JT^IL:HJ8JLe368JS5[:Bb~TdI:R=@a%T%:MLe3[=BL R>Le=BS8JS365636Tgfy=BW/TdI:
 7 +  )  ,  G O Q O Q EdO N O Q ,  7 +  )<  _`=BLL:<aR/=@;SC>36;IP_M568Baa&H8B5^I: WZ=BLOa^>_!d8B;36;IRS^IT&36;4
N N
 7 )
 7 ) a%T8B;>_`:O+36565S;>=BTl:&HB:MLfa:<;>a3[T3[HB:&T%=36;>8B_M_M^IL!8B_M3[:<a
 0 ) 0 )   A
  
 =B@ L N R , NQ R b/l:S;SC\8a36U13658JLL:<a^>5[T+?fALe:MRS568B_  36;oTdI:H8JLe36=@^>a&RS8JLe8BU:MT%:MLea<GOX;oW8B_`TMbTd>:HB:MLeT36_M8B5
  
36;IP )?fJ GU)Gx+=BT%:Td>8JTyW=BL8A36;>8JLf\_M58Baa 8Ba%f4UR>T%=BT%:=BW8B;8Baa%=4_M368JT%:<Ca%:<;>a36T3[Hs36Tgf W^S;>_`T3[=@;b
HJ8JLe368JS5[:BbTdI:1a:<;>a3[T3[H43[TfAH8B56^>:<aWZ=BLy3[TaT
=K_M58Baa 8B;>CdI:<;>_`: TdI7: )SH8B56^>:=BW
TdI: W^S;>_`T3[=@;qa&HB:MLT%:`mb
HJ8B56^I:<a8JLe:TdI:ae8BU:BG vd>36aR>Le=BR/:MLeTgfOdI=@56C>aWZ=BL8B;~f V3655:A?^S3[T%:AC>36a%T8B;?T WZL=@UTd>:ARS8JLe8BU:MT%:ML<qa8Ba
S36;S8JLf+=@^>T%RS^ITHJ8JLe368J5[:36;8B;~fyk]8<fB:<ae368B;;I:MTgl=BL/G a%:<aaeU:<;~TMbL:<ae^>5[T36;IPo36;8\Le8JTdI:MVL aS8JTW^>;>_`T3[=@;36;

Evidence and Scenario Sensitivities in Naive Bayesian Classifiers 259


Q " > N Rc, N QR B  B AE " _ 7    E _ b  B `
1 1

"   B " _ ]- B bB " =F>>  M"   )  ? _
[ J @
fc (x )

fc (x )
0.8 0.8 ) > D _1bE  C G ) !_ G ) h >  D  

> "F" ` vVdI:SLea%TR>L=BR:MLTgfWZ=@55[=VaWL=@U TdI:=BI


0.6 fc (x ) 0.6

0.4 0.4
a%:MLeH8JT3[=@;TdS8JT8B55EW^>;>_`T36=@;>a +UjV4X NkY [l\ V3[Td 6 , 6 Q
0.2 0.2
d>8HB:oTd>:\a8BU:dI=BLe3[M=@;?T8B5V8B;SCHB:MLT36_M8B5&8Ba%f4URI
T%=BT%:<aO8B;>CoTdI:ML:MW=BL:CI=;I=BT36;~T%:ML!a%:<_`TMG a8_`=@;4
fc (x )
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
a%:<s^I:<;>_`:Bbs:<3[TdI:ML 6 Q =BL 6  36aTd>:+U=@a%T]563[B:<56fHJ8B56^I:
 3[P@^ILe1 : I mI8BURS56:<a+=BWla%:<;>a36T3[Hs36TgfKW^>;>_`T3[=@;Sa+W=BL =BW : b
L:MP@8JLeC>5[:<aea =BW&Td>:AH8B56^>:=BJW )EGXW&TdI:ATgl=
x0 x x0 x

TdI:R8JLe8BU:MT%:ML ) , OYNQ R J 6 Q!b~+dI:ML:7N R ,N Q R 8B;>C W^>;>_`T3[=@;SJa + UWV9X N Y []\  )8B;>C +   )36;?T%:MLea%:<_`T8JRT ) b


6 Q 36a  "BlTdI:OU=@a%T5636B:<5[f1HJ8B56^I:BG 8JLe368JT36=@;:<3[TdI:ML TdI:<;l:1d>8HB:1Td>8JT )  p :
W=BL N R , N QR18B;>C
_M8B;U18JB: 6 QTdI: U=@a%T563[B:<56fHJ8B56^I:A  %=BLV36565 )  G  W=B3
L N Rc, NQ R Gvd>:Va%:<_`=@;>C8B;>CTd>3[LeC
;I=BT_!d>8B;IPB:TdI:yU=@a%TV53[B:<5[fHJ8B56^I: >  @B !G R>Le=BR/:MLeT3[ :<a : a%T8JT%:<C8J=HB:+;I=W=@565[=36U1U:<C>38JT%:<5[f
WL=@U$TdI:OPB:<;>:MLe8B5W=BLeU1a=BWTdI:OW^>;>_`T3[=@;SaMG
TdI:V>L=@8BCI:MLH436_M36;S3[Tgf=BW!)  GvVdI:U=@a%Tl563[B:<5[f_M568Baa :=BSa:MLHB:]WZL=@UTd>:]8J=HB:R>L=BR=@a3[T3[=@;Td>8JTMb@W=BL
HJ8B56^I:1+36565TdI:<;d>8JLeC>56fo:MHB:ML_edS8B;IPB:^>R/=@;aU18B565 8B;?fKRS8JLe8BU1:MT%:ML<b8B;8BC>U136aae3[S5[:y8BU=@^S;~T&=BWHJ8JLe368
adS3[WZTaY36; 8RS8JLe8BU1:MT%:ML<G vdI:8J=HB:8JLP@^>U1:<;~T Tds^>a T3[=@; C>3[/:ML:<;?TWL=@U =B<L 36aYR/=@aea3[S5[:=@;>5[fOT%=TdI:
_`=BLLe=B/=BL!8JT%:<aoTd>::<UR3[Le36_M8B5656f=BSa%:MLeHB:<CLe=BS^>a%T 5[:MWT=BLYT%=+Td>:Le3[P@d?T=BWITdI:
RS8JLe8BU:MT%:MLqaY8Baea%:<aaU:<;?T
;I:<aea=BWY_M58Baa3[>:ML!a]T%=1RS8JLe8BU:MT%:ML&36;>8B_M_M^IL!8B_M3[:<aMG S^>TE;I:MHB:ML 36;/=BTdC>3[Le:<_`T3[=@;>aMG  L=@UTd>36aE=BSa%:MLHJ8
5 5 .< . 1 5  2.I ./ , X;H43[:M=BW_M568Baea3[S_M8 T3[=@;\
:d>8<HB:Td>8JT^IR=@;H8JLf436;IPTdI:RS8JL!8BU:MT%:ML<b
T3[=@;bsTd>:OR>L=BR:MLTgf=BWY8BCSU136aa36S5[:+C>:MHs368JT36=@;A36a]:<a TdI:]U=@a%T563[B:<56f_M58BaaYH8B56^>:
_M8B; _ed>8B;>PB:l=@;S5[fO=@;>_`:BG
 3[P@^ILe:<a8B;>C ae^IR>R=BLTlTd>36a=BSa:MLH8JT36=@;G
R:<_M368B5656f=BWy36;?T%:ML:<a%TMGvdI:\8BCSU136aa36S5[:ACI:MH4368JT3[=@; 5 _-1 /& :_`=@;>a36CI:ML8JP@8B36;TdI:;>8B3[HB:
W=BL8RS8JLe8BU:MT%:MLP@3[HB:<a1TdI:A8BU=@^>;?T=BW+H8JL!368JT3[=@; `
k]8<fB:<ae368B;_M568Baea3[>:MLO8B;>CTd>:RS8JT3[:<;?Ty36;IW=BLeU18JT3[=@;
Td>8JT13a18B565[=
:<C:MW=BL:K8B;36;Sa%T8B;>_`:K36a_M568Baea3[>:<C WL=@U m48BUR5[:JG 24^>R>R=@a%:Td>8JTl:8JL:=@;>_`:
U=BL:
8Ba:<5[=@;IP@36;IPT%=8C>3:ML:<;?T_M568Baa<GvdI:oWZ=@5656= 36;?T%:ML:<a%T%:<C36;TdI::`:<_`Ta=BWI36;S8B_M_M^ILe8B_M3[:<a36;yTdI: RS8
36;>PR>Le=BR/=@ae3[T3[=@;;I=.P@36HB:<a8 PB:<;I:MLe8B5W=BLeUW=BLV8B565 Le8BU1:MT%:ML ), O g+ " "R,;I= J1U,.X Y ]=@;\TdI:
R=@aa36S5[:CI:MH4368JT3[=@;SaE36; 8;>8B3[HB:k]8<fB:<ae368B;;I:MTgl=BL/G R=@a%T%:MLe36=BLlRSL=BS8JS3653[T3[:<a h L  JNW=BLTdI:RS8JT3[:<;?TMG
vd>:+R>Le=BR/=@ae3[T3[=@;U1=BL:&aR/:<_M36S_M8B565[f1ad>=ValTd>8JT36; :Le:<_M8B565ITd>8JTTd>:H8B5^I:X _`=BLL:<a%R=@;>CSaT%=yTdI:
a^S_ed8;I:MT
=BLebTdI: U=@a%T563[B:<5[fK_M58BaaVHJ8B56^I:_M8B; U=@aT]53[B:<5[f1a%T8JPB:OW=BL]Td>36alRS8JT3[:<;?TMb4V3[Td8R>L=BS8
_!d>8B;IPB: + M" ` "  ^IR=@;HJ8JLfs3;IPy8yRS8JL!8BU:MT%:ML<G S3563[Tgf=BWp A >4JGOvdI:a%:<_`=@;>CU=@a%T563[B:<56faT8JPB:W=BL
^  /`_</S .Z ./ , /'& b < O  ,)U18m z O N J 6 , 6 Q h
M TdI:R8JT3[:<;~Ty3aOa%T8JPB:XX b/+3[TdR>L=BS8J36563[Tf\p A <rsG
6  E  J   "  :  " > ] B  B h L 6  JNM ,PO  h :y;I=)S;>CTd>8JTTdI:yU1=@a%T563[B:<5[f# $ a%T8JPB:_edS8B;IPB:<a
!_
 M
+   
) - , + UjV4X N  Y [l\  ) g*B  h WL=@UX Y T%=XX 8JT ) ,wp A <r 0 # % ? ,wp A >sGlvdI:
8BC>U36aa3[5[:1CI:MH4368JT3[=@;WZ=BL TdI:RS8JLe8BU:MT%:MLTds^>a 36a
 ]
B4 JJ + UWV9X NkY []\  ) Z+   )  " >BZI 6 , 6 Q p ,t > y!G x+=BT%:Td>8JTTdI:U=@a%TO563[B:<5[foa%T8JPB:1_M8B;4
;I=BA T_!: d>8B;IPB:K36;?T%=8B;~f=BTdI:MLHJ8B56^I:A^IR=@;HJ8JLf436;IP
Q " > N R ,N QR B _ V = EM _ b   " X " > B TdI:OR8JLe8BU:MT%:ML+^S;>CI:MLa%T^>C>fBG -
DI > 1<g > )5  m
X;Hs3[:M=BW;S8B3[HB:kl8fB:<a368B;1;>:MTgl=BL_M568Baea3[>:MLea<bl:
"  _  #"  _ 7    E` _ b   " 
d>8HB:]WZL=@UTdI:]8J=HB:]R>L=BR=@a36T3[=@;yTd>8JTl:_M8B;:`ms
O  ;dO Q  )\G ) y R:<_`TT%=S;>C568JLPB:+8BCSU136aa36S5[:CI:MH4368JT3[=@;>aWZ=BLlU136a%T
O  , O`Q p : : p~ 36;Sa%T8B;>_`:<aMG vE=a^>R>R=BLTTd>36a
:`m4R/:<_`T8JT36=@;b4
:_`=@;4
O  : O Q 9  ) G )<   ) ;
a3CI:ML81RS8JL!8BU:MT%:MCL )V3[Td N R , N QR@/a36U13568JL8JLP@^4
:
9  y "B B > ]- M
 U:<;?Ta dI=@56CW=B<L N RH, N Q R GXW O  ; O`Qb@l:]dS8<HB:WL=@U
: TdI:OW^>;>_`T36=@;>8B5/WZ=BLeUaTd>8JTTdI:8Baea%:<aaU:<;?\T )W=BL
] B > ) , O   TdI:lRS8JLe8BU1:MT%:ML3aE^>;S563[B:<5[f&T%=+:lHB:MLfaU8B565bJ8BaEW=BL
0

260 S. Renooij and L. van der Gaag


W^>;>_`T3[=@;>ayV3[Td8B;8Ba%f4UR>T%=BT%:T%=oTdI:5[:MWZT=BWTdI: >  D J @cg*B  hI " > B 6h ]
Bs J Bs
^>;>36TyV36;>C>=TdI:3;~T%:MLea:<_`T3[=@;=BWVTdI:a:<;>a3[T3[H43[Tf h L 6 JN , I ? h LN I J 6

W^>;>_`T3[=@;>a563[:<aT%=TdI:A5[:MWT=BW )GXMW O  : O Q b=@; h L 6 JN   N  I  ? h LN I J 6 0 h L 6 JN 
TdI:y=BTd>:MLVd>8B;>CbIl:y:`m4R:<_`T8B;\8BC>U13aa3[S56:+CI:MH4368
T3[=@;V3[Td>3;TdI:+^S;>3[TV3;>CI=K " >  WZ=BL]HB:MLf1aeU18B565 ] B > N I  O B 1  J  7"  =7I   = 
HJ8B56^I:<a=BW )GXWYTdI:36;>aT8B;>_`:<aTd>8JTV8JLe:U=4CI:<565[:<C > "F" M vd>:
R>Le=BR/:MLeTgfyW=@565[=Va36U1U:<C>38JT%:<5[fO?f8JRI
8BaTdI:U=BL:R>L=BS8J5[:\P@3[HB:<;w8RS8JLT3_M^>568JL_M58Baa RS5[f436;IPAk]8<fB:<a<YL!^>5[:18B;>C:`msRS56=@3[T36;IPATdI:36;>CI:MR:<;4
HJ8B56^I:Bb@=4_M_M^ILTd>:VU=BL:=BWT%:<;3; R>Le8B_`T36_`:BbsTdI:<;W=BL CI:<;>_`:<aTd>8JTVdI=@56CA3;TdI:;>8B3[HB:y;>:MTgl=BLG
TdI:<a%:36;>aT8B;>_`:<al:&:`msR:<_`T]Td>8JT )V3a;I=BT]8aeU18B565
HJ8B56^I:8B;>CTd>8JT O`Q : O  GX;RSLe8B_`T36_`:
:l=@^>56C vdI:K8J=HB:\R>L=BR=@a36T3[=@;;I=(8B565[=+a^>aT%=_`=@U
TdI:ML:MW=BL:O:`m4R:<_`TVU=@a%T36;Sa%T8B;>_`:<a]T%=1L:<ae^>5[T]36;A8BC4 RS^IT%:TdI: ;I:MR/=@aT%:MLe3[=BLVRSL=BS8JS3653[TgfAC>36aT%Le3[S^IT36=@;
U136aea3[S5[:&CI:MH4368JT3[=@;>a=BW9  : y!G =HB:ML&TdI:y_M568BaeaH8JL!368JS5[:OWL=@UTdI:yRSL:MHs36=@^>a]=@;I:BG

S /? 
J@   ` 5 _-1 %'& :OL:<_`=@;Sa36CI:MLlTdI:O;>8B3[HB:Ok]8<fB:<a38B;
;I:MT
=BL8B;>CTdI:RS8JT3[:<;?T36;IW=BLeU18JT3[=@;WL=@UTdI:
 =BL\_M58Baa3[S_M8JT36=@;RSL=BS5[:<U1a<bV3[T36aPB:<;I:MLe8B565[f8Ba R>L:MH43[=@^>ay:`m48BUR5[:<aMGX;8BCSC>3[T3[=@;T%=oTdI:1T%:<a%TaT%=
a^>U1:<C Td>8JT :MHs3CI:<;>_`:36a 8H8B3658JS5[:]WZ=BL:MHB:MLfae36;IP@5[: Vd>3_edTdI:RS8JT3[:<;?Td>8Ba 8B5[Le:<8BCIf:M:<;a^I47%:<_`T%:<Cb
W:<8JT^IL:VHJ8JLe368J5[:BGEX;1RSLe8B_`T36_M8B5/8JR>RS5636_M8JT36=@;>aMb~dI= 8Ka_M8B;=BW]TdI:563[HB:MLy_M8B;:R/:MLeWZ=BLeU1:<CG:;I=
:MHB:ML<bTd>36a8Baa^>U1R>T3[=@;U18f;I=BT/:L:<8B5636a%T3_JGX; 8JL:3;~T%:ML:<aT%:<CK36;ATdI:OR=@a%T%:ML!3[=BLVC>36aT%Le3[S^IT36=@;=HB:ML
TdI:KU:<CS36_M8B5]CI=@U18B36;bWZ=BL:`mI8BURS5[:Bbl8RS8JT3[:<;?T136a TdI:yHJ8JLe3[=@^SaVa%T8JPB:<a&3[WYTdI:L:<ae^>5[T=BWTd>36aT%:<aTVl:ML:
T%=:&_M58Baa3[>:<C36;?T%= =@;I:=BWE8 ;?^SU:MLl=BWEC>36a:<8Ba%:<a R=@a3[T3[HB:BG  =BLTdI:WZ:<8JT^ILe:HJ8JLe368JS56: g+ J > b/TdI:
V3[Td>=@^IT
:<36;>Pa^>47:<_`T%:<CKT%=:MHB:MLfR=@aae3[S5[:&C>368JPJ W=@565[=+36;IP RS8JLe8BU:MT%:ML&8Baa%:<aaeU:<;~Ta+8JL:ya%R:<_M3[>:<C
;I=@a%T3_+T%:<aTMGvdI:Os^I:<a%T3[=@;TdI:<;A8JL!36a%:<adI=)U^>_!d
36UR8B_`T8BC>C>36T3[=@;>8B5:MH436CI:<;>_`:_`=@^>56Cd>8<HB:=@;TdI:   
0         ! 
  $

R>L=B8JS36563[TfC>36a%T%Le36S^IT3[=@;=HB:ML1TdI:_M568BaaH8JL!368JS5[: *
!#" !
!#" &
!%" !
!%" &
!%" !
!%" &
!%" !
!%" &
!#" !
!#" &
,." /
!%"
8B;>CdI=wa:<;>a3[T3[HB:&Td>36al36UR8B_`Tl36a
T%=36;>8B_M_M^IL!8B_M3[:<a  L=@U TdI:T8JS5[:8B;>CTdI:RSL=BS8JS3653[Tgf)C>3a%T%Le3[S^I
36;TdI:1;I:MT
=BLeqaOR8JLe8BU:MT%:MLeaMGvd>:W=BLeU:MLy3aa^I: T3[=@; h L  J7NWL=@U mI8BURS56:JbVl:S;>CTd>8JT
36a1_M56=@a%:<5[fL:<568JT%:<CT%=Td>:\;I=BT3[=@;=BW&HJ8B56^I:A=BWO36;4  + h L g+ J > ,.fB:<a@J1 h L  J N ,)p <JpIG
W=BLeU18JT3[=@;GvVdI:O568JT%T%:ML+36aea^I:36;~HB=@56HB:<a8;I=BT3[=@;A=BW vdI:;I:MR=@a%T%:MLe36=BLR>L=BS8J36560 3[Tf=BW (, X k)A ;I=
a%:<;>ae3[T3[H43[TgfTd>8JTC>3:MLeaWL=@U TdI:+a%T8B;>C>8JLeC;>=BT3[=@; W=@565[=+aC>3[L:<_`T5[fWL=@U
^>a%:<C36;TdI:1R>Le:MHs3[=@^SaOa%:<_`T3[=@;>ay3;TdI:a%:<;>a:Td>8JT h LX k JN , p A >Br ,wp >Bt
3[T]R:MLT8B36;>a];I=BTT%=8<HJ8B36568JS5[::MH436CI:<;>_`:O^IT]T%=1a_`:` p B p <Jp  A
;>8JLe36=@a=BWR=@aae3[S5[f 8BC>C>3[T36=@;>8B5>:MH436CI:<;>_`:BG:L:MW:ML V3[Td>=@^IT+A Le:<?^>36Le36;IP8B;?A fo8BCSC>3[T3[=@;>8B5E_`=@URS^IT8JT36=@;>a
T%=Td>3a;I=BT3[=@;=BWOa%:<;>ae3[T3[H43[Tgf8BQa   > #" <    WL=@U TdI:;I:MT
=BLeGx+=BT%:Td>8JTTd>:;I:MT%:<a%TLe:<a^>5[T
   8B;SC^>a:TdI:T%:MLeU  _  H <     T%=
L:MW:MLVT%=TdI:U=BLe:a%T8B;>CS8JLeCK;I=BT3[=@;G 5[TdI=@^>P@dK3[T l=@^>56CA_edS8B;IPB:TdI:yU1=@a%TV563[B:<56fa%T8JPB:BG -
36a18JRSRS5636_M8JS56:AT%=k]8<fB:<a38B;;I:MT
=BLesa36;PB:<;>:MLe8B5b vdI:8J=HB:R>L=BR=@a36T3[=@;.8B565[=VaAW=BLK:<a%T8JS563ad>36;IP
36;Td>36a]a%:<_`T36=@;
:yC>3a_M^>aala_`:<;>8JLe36=1a%:<;>a3[T36Hs3[Tf136; TdI:y36U1RS8B_`T]=BWE8BC>C>36T3[=@;>8B5/:MHs3CI:<;>_`:=@;TdI:OR=@a%T%:`
TdI:_`=@;?T%:`m4TV=BW;>8B3[HB:k]8<fB:<ae368B;\_M568Baea3[>:MLea<G Le3[=BLRSL=BS8JS3653[TgfCS36a%T%Le3[^IT3[=@;1=HB:MLVTdI:_M568Baa]HJ8JLe3
vE=a%T^SCIfTdI:y:`:<_`T+=BW8BC>CS3[T3[=@;>8B5:MH436CI:<;S_`:y=@; 8JS5[:BGvE=\_M8JR>T^ILe:TdI:1a%:<;>a36T3[Hs36Tgf\=BWlTd>36aO3URS8B_`T
8B;K=@^>T%RS^ITR>Le=BS8JS365636TgfW=BLVTd>:_M568BaaVH8JL!368JS5[:BbIl: T%=&RS8JLe8BU:MT%:MLYH8JL!368JT3[=@;b
:]_`=@;>a3CI:MLTdI:
W^S;>_`T3[=@;
_`=@;>a3CI:MLTdI:&Le8JT3[= h L 6 JNY T%= h L 6 JN+dI:ML:  )VTd>8JTyCI:<a_`Le36/:<aVTdI:8J=HB:R>L=B8JS36563[TfLe8JT3[=
N  8B;S C N  CI:<;I=BT%:]TdI:l:MHs3CI:<;>_`:RSLe3[=BLYT%=&8B;>C8JWT%:ML 8BaV8 W^>;>_`T36=@;=BW81RS8JLe8BU:MT%:ML )Q, OY NQ R J 6 Q6!
L:<_`:<3[H436;IPTd>:y;I:M):MH436CI:<;>_`:OL:<aR/:<_`T36HB:<5[fBG ,"! hh LL 66 JJNN Y $ #  ) , hh LL 66 JJNN Y `` ))
 )
^  /`_M/ .Z ./ , %'&Db M =    _ =  E  <M X" -  
 > Y   >  E`   ]- B
=  =  =  !_ XWETdI:OR8JLe8BU:MT%:MCL )oR:MLT8B36;>alT%=8HJ8JLe368J5[:O36;TdI:
= UG =  , z = ? = h Z7ZfC b < N a%:MT = UG = blTdI:<;TdI:KCI:<;>=@U136;>8JT%=BL3a8_`=@;4
  _ N  E "     g        H"     _ =  h a%T8B;?T
+3[TdL:<a%R:<_`TT%V= )G vd>:W^S;>_`T3[=@;  ) TdI:<;
: A A A:
=

Evidence and Scenario Sensitivities in Naive Bayesian Classifiers 261


7^>a%T36a8% !J _ HB:MLea36=@;=BWTd>:a%:<;Sa3[T3[H43[TgfW^>;>_ Le8BU1:MT%:MLV36;>8B_M_M^IL!8B_M3[:<a36;H43[:M.=BWYae_`:<;>8JLe3[=@a]=BW8BC4
T3[=@R; + UWV9X NkY [  \ )!G NO3[HB:<;OTdI:RSL=BS8JS3653[TgfC>3a%T%Le3[S^I C>36T3[=@;>8B5:MH436CI:<;>_`:BG:ad>=l:<CTd>8JToWZ=BL;>8B3[HB:
T3[=@; h L : J`N+=HB:MLTdI:1_M58BaaH8JL!368JS5[:Bbl:1_M8B; k]8<fB:<ae368B;);I:MT
=BL4aKa_`:<;S8JLe3[=a%:<;>ae3[T3[H43[Tgf_M8B;.:
TdI:MLe:MWZ=BL:36U1U1:<C>368JT%:<5[f1CI:MT%:ML!U136;I:&TdI:a%:<;>a3[T36Hs3[Tf L:<8BCS365[fw:`msRSL:<aa%:<C36;T%:MLeU1ao=BW1TdI:U=BL:a%T8B;4
=BW&TdI:\36U1RS8B_`T=BWTdI:\8BC>C>36T3[=@;>8B5]:MH436CI:<;>_`:AWL=@U C>8JL!Ca%:<;Sa3[T3[H43[TgfW^>;>_`T36=@;>aMGX;TdI:A;I:<8JL W^IT^IL:Bb
TdI:1a:<;>a3[T3[H43[Tf\W^>;>_`T3[=@; + UWV9X NkY [  \  )!GxV=BT%:Td>8JT l:ol=@^>56C563[B:T%=a%T^>CIfae_`:<;>8JLe3[=a%:<;>a36T3[Hs36Tgf36;
36;A8;S8B3[HB:yk]8<fB:<a368B;;I:MT
=BLTdI:y58JT%T%:MLVW^>;>_`T3[=@; k]8<fB:<ae368B;\;I:MT
=BLesa+36;PB:<;I:MLe8B5G
36a4;I=V;OW=BL:<8B_edRS8JLe8BU:MT%:M^L )=@;>_`:Td>:R=@a%T%:MLe3[=BL
R>Le=BS8JS365636TgfC>3a%T%Le3[S^>T3[=@; h L : JN  ]3a8<HJ8B36568JS5[:BG 

>

I
` 5 _-1 & :+L:<_`=@;>a3CI:MLTdI:VR>L:MH43[=@^>a :`mI8BU


!#"$&')% (*(*')+ -,.0/1
23
#45.$6
87:99<;<

2 ' ,=*><@?-,?$A,B.$C*D (*'  'FE G? ' C-,!, ' @HI (J K

RS56:BG+:;I=8JLe: 3;~T%:ML:<aT%:<C3;\TdI::`:<_`Ta=BW
36;4 L*MNMNMPO$Q*RTSVU)RXW&Y6Z\[TSU#[TS^]_GU`Yba&c8UedgfhRTSdiR-Skj^lm_n*a)Q`o
8B_M_M^IL!8B_M3[:<a 36;TdI:RS8JLe8BU:MT%:M7L )[, OY g+  B > , Spa&Y6Z\W&Ueqr;<smt7Gquvt<q-w

fB:<a a JX kV=@;TdI:Le8JT3[= h LX k J NYT%= 41


i Ix,y-,.z/
m{5 ( H|CDex '
mq}}w
g~ ' ,V,VhX"$
Lh X k J<N  !Gw:L:<_M8B565TdS8JTTdI:\;>:M R=@a%T%:` E  GY? ' lN[-,1SK&, a)' Q@a&HISp We(aJ [-D&SC*Sk We')a)( Q`KYR-
TZbS,^Y6_QZ[:SW*ae5aejTQ`ZY6S<Z ITW)U Z\RT[@L&YVSa o
Le36=BLVR>L=B8JS36563[Tf=BW 5, X k)l=@^>56C\:p A >BtsTdI:
R>Le=BS8JS365636Tgf&P@3[HB:<;7^>a%TETdI:
8H8B3568JS5[:
:MHs36C>:<;>_`:]8Ba Yba)Z=Xa)SkWea) ( r-,^!"$v-,V,gV- ' #7K}<;#u7X7:$

p A BJG: ;I=:<a%T8J5636ad\TdI:a%:<;Sa3[T3[H43[TgfW^>;>_`T3[=@; 8

4
 I"' % -,.N
5
>T-,. 'K( X-
kq-}}rq$
V ( X$
+ UWV9X  Y [  \  )
,  #  ? 8B;>CS;>CTdS8JT 'KH(  * ' g)
A5 S' ,SkR-=*U><[@v@?fh-,R-Y?$a&cvgRTY6-Z\ W`UE GRT? Sk' j5,!Q`Y6 Z I' W) Z\' RTk, L&' So
Yba)(*Z=J Xa)SkWea)wsiwXq-w1uwXV

 ), p B 0 )72p >


)

 Le=@U m48BU1RS5[:&tl:&dS8BCTd>8JT]TdI:&R>L=B8JS36563[Tf=BW
A A m
{5,Vri-,.
ri +K+ -,
$7:99r;$
5,8*x ' X$v-
TdI:y_M58BaaHJ8B56^I:OX k.36;S_`L:<8Ba%:<CAWL=@Up A BT%=p A >Bt @?#-$x ' V 'IE G? ' , D)X ')( "V,. 'K(A+K')( -, '
^IR=@;8+R=@a36T3[HB:
5636HB:MLYa_M8B;b@TdI:ML:Msfy:<_`=@U136;IPu A X*K
fhRXW*$ZSkaAa*R-Q`SZS<-q9VsF7K}Xw!u7:w}V

T36U1:<a8BaE53[B:<5[fBG:]_M8B;;I=36;8BC>C>3[T3[=@;O_`=@;S_M56^>CI: 1
 ( "V,.
5
>T-,. ')( 
#q}}X}V
5h J
Td>8JT
3[WTd>:RS8JLe8BU:MT%:ML )3a H8JL!3[:<Cb@TdI:+_M568Baa HJ8B56^I: ,V ' ,=*><@?h-,-?<C!D&V"V*T*X,-? '& D& ' ,r:

X k_M8B;:<_`=@U:8JTyU=@a%T > A AT36U:<a8BaO5636B:<5[fK8Ba b,Q[:Weaea*j-ZSr-U[Ya X YlN[-SK&a)Qa&SpWea[TSASo


Wea)Q`YR-ZSY6_ZSh5Q`YZ W&Z\R-pL&SYba&=Z=Xa)SkWea)p ( X-,!-"V
V36TdI=@^IT]TdI:8BC>CS3[T3[=@;>8B5/:MH436CI:<;>_`:BG - v,V,$ ' |w7G; uwXqV

 "
O g Jg 
E
FA J' ?
#7K99r$
|2 ' ,><=@?,-?$C^ ( ( -
=@?* '  ' ,r*, E G? ' C-,, ' @H (*J )
LMMNM
X;Td>3a1RS8JR:ML<b]l:o^>a%:<CT%:<_ed>;S36?^>:<aWL=@U a:<;>a3 OVQR-SVU)RW)YZ\[-SVU8[-S]k_GU`Ya)c1UedNfhR-SVdRTSkjlm_nea&Q`Spa&Y6Z\W&Ue
T3[H43[Tf8B;>8B5[f4a36a T%=aT^>CIfTd>:K:`/:<_`Ta=BW+R8JLe8BU:` qX$si9}7u9X}9V

T%:ML3;>8B_M_M^ILe8B_M3[:<ay=@;8A;>8B3[HB:kl8fB:<a368B;;I:MTgl=BL/qa m
T"-
G ' V,.11
:"
q}}XV
:/r"-,r*=eT> ' ".$?
R=@a%T%:MLe36=BL+R>Le=BS8JS365636TgfKC>36aT%Le3[S^IT36=@;>aMGV:1aedI=l:<C *x '' ' D&N-p*,V!.V-*#,vD)X= ')( )
Tb,LMMNM
Td>8JTATd>:36;>CI:MR:<;>CI:<;S_`:8Baae^>UR>T3[=@;Sa=BWa^S_ed.8 FQ[KWeaeaejTZS<TU[YVa-YL`SYa)Q`SkR-YZ\[-SkR-8lN[TS:&a&Q*a&SpW*a
[-SlN[-cp$Ya)QBR-SkjL`S:&[TQ`cvRTY6Z\[TSOaeWe<Sp[T[*-_T3- ' 
;I:MT
=BLe _`=@;Sa%T%Le8B36; TdI:VW^>;S_`T3[=@;>8B5sWZ=BL!U =BWTdI:8Ba q!uwXwV

a%=4_M368JT%:<C.a%:<;Sa3[T3[H43[TgfW^>;S_`T3[=@;>aMTd>:<a%:W^S;>_`T3[=@;>a `
<~|Cx
$q-}X}V7X
/, ' V ( CD)-k@*".$?-3*x ' ,-> ' E G? ' 
8JL:OCI:MT%:ML!U136;I:<Ca%=@5[:<56f1?fTdI:O8Baa%:<aaeU:<;~TW=BL]TdI: D)X ')(
Kb,L&klgL[TQTUV[e[-SBMc kZQ`Z\WeRTfa)Y<o
RS8JL!8BU:MT%:ML^>;>C>:MLya%T^>CIf8B;>CTdI:1=BL!3[P@36;>8B5YR=@a%T%:` [:jGU!ZS5Q`YZ W&Z\R-L`SYa)Z=ra&SpW*a)V ' Ft75utXV

Le36=BLR>L=BS8JS3563[Tgf1CS36a%T%Le3[^IT3[=@;=HB:MLVTd>:y_M568BaaHJ8JLe3
5
>G,. ')( -,.2p
~ ' ,V< &
kq}}7
V/,-?<,V
8JS56:BG vVdI:a%:<;>a3[T36Hs3[TfKR>L=BR:MLT36:<aWZ=@5656=V36;>PWL=@U  ' ,><=@?.VTe ( X ( VC@*D, ' @H (*J K
|b,
TdI:W^>;>_`T3[=@;SaR>Le=H436CI:<Ca%=@U:OW^>;>C>8BU:<;?T8B5/8JLP@^4 FQ[KWeaeaejTZS<TU[YVa $ YlN[TS:&a&Qa)SkWea[TSASpWea&Q`o
YbRTZSY_ZS5Q`Y6Z IW)Z\RTvL`SYa)Z=ra&SpW*a) ( r-,!-"V
U:<;?TaVd>36_!d_`=BLLe=B/=BL!8JT%:<CTdI::<URS36Le36_M8B565[f=BI v,V,$ ' w}u^-wr;$

a%:MLeHB:<CL=BS^>a%T;>:<aal=BWY_M568Baea3[>:MLealT%=RS8JL!8BU:MT%:MLV36;4
8B_M_M^IL!8B_M3[:<aMG 8K=BL:L:<a%:<8JLe_!do3aL:<s^>3[L:<Cb>d>=l:MHB:ML<b
5
>T-,. ')( --,.2p
|~ ' ,V<X &
|q-}X}V
|5,x '
 ' ,><=@?I ( VC@*D8, ' @HI (J 5 (' C-=@?
T%=W^ILTd>:MLKa^ISa%T8B;?T368JT%:Td>:<a%:8JLP@^>U1:<;~TaMG(: Dex ( D` ')( CD)K
fh[KjXa&Q`S0L`S:&[TQ`cvRTY6Z\[TSQ[:Wea`UeU`ZSrr
W^ILTdI:ML36;?T%L=4C>^>_`:<CKTd>: ;>=HB:<5 ;I=BT3[=@;o=BW
ae_`:<;>8JLe3[= Q[-cOpVae[TQ`_Y[BIpZ\WeRTY6Z\[TSUek NC ' >< 'K( < ' FwXX
a%:<;Sa3[T3[H43[TgfBbVd>36_!dCI:<a_`Le3[:<a\TdI::`/:<_`Ta=BWRS8 utr}X$

262 S. Renooij and L. van der Gaag


Abstraction and Refinement for Solving Continuous Markov
Decision Processes
Alberto Reyes1 and Pablo Ibarguengoytia
Inst. de Inv. Electricas
Av. Reforma 113, Palmira, Cuernavaca, Mor., Mexico
{areyes,pibar}@iie.org,mx

L. Enrique Sucar and Eduardo Morales


INAOE
Luis Enrique Erro 1, Sta. Ma. Tonantzintla, Pue., Mexico
{esucar,emorales}@inaoep.mx

Abstract
We propose a novel approach for solving continuous and hybrid Markov Decision Processes
(MDPs) based on two phases. In the first phase, an initial approximate solution is obtained
by partitioning the state space based on the reward function, and solving the resulting
discrete MDP. In the second phase, the initial abstraction is refined and improved. States
with high variance in their value with respect to neighboring states are partitioned, and
the MDP is solved locally to improve the policy. In our approach, the reward function and
transition model are learned from a random exploration of the environment, and can work
with both, pure continuous spaces; or hybrid, with continuous and discrete variables. We
demonstrate empirically the method in several simulated robot navigation problems, with
different sizes and complexities. Our results show an approximate optimal solution with
an important reduction in state size and solution time compared to a fine discretization
of the space.

1 Introduction number of state-variables; and do not apply to


continuous domains.
Markov Decision Processes (MDPs) have de- Many stochastic processes are more naturally
veloped as a standard method for decision- defined using continuous state variables which
theoretic planning. Traditional MDP solution are solved as Continuous Markov Decision Pro-
techniques have the drawback that they re- cesses (CMDPs) or general state-space MDPs
quire an explicit state representation, limiting by analogy with general state-space Markov
their applicability to real-world problems. Fac- chains. In this approach the optimal value func-
tored representations (Boutilier et al., 1999) ad- tion satisfies the Bellman fixed point equation:
dress this drawback via compactly specifying Z
the state-space in factored form by using dy- V (s) = maxa [R(s, a) + p(s0 |s, a)V (s0 )ds0 ].
namic Bayesian networks or decision diagrams. s0
Such Factored MDPs can be used to repre- Where s represents the state, a a finite set of ac-
sent in a more compact way exponentially large tions, R the immediate reward, V the expected
state spaces. The algorithms for planning using value, P the transition function, and a dis-
MDPs, however, still run in time polynomial in count factor.
the size of the state space or exponential in the The problem with CMDPs is that if the con-
1
Ph.D. Student at ITESM Campus Cuernavaca, Av. tinuous space is discretized to find a solution,
Reforma 182-A, Lomas, Cuernavaca, Mor., Mexico the discretization causes yet another level of ex-
ponential blow up. This curse of dimensional- factored MDPs although he extends the tech-
ity has limited the use of the MDP framework, nique to be applied to both discrete and contin-
and overcoming it has become a relevant topic of uous problems in a collaborative setting.
research. Existing solutions attempt to replace
the value function or the optimal policy with a Our approach is related to these works, how-
finite approximation. Two recent methods to ever it differs on several aspects. First, it is
solve a CMDPs are known as grid-based MDP based on qualitative models (Kuipers, 1986),
discretizations and parametric approximations. which are particularly useful for domains with
The idea behind the grid-based MDPs is to dis- continuous state variables. It also differs in the
cretize the state-space in a set of grid points and way the abstraction is built. We use training
approximate value functions over such points. data to learn a decision tree for the reward
Unfortunately, classic grid algorithms scale up function, from which we deduce an abstraction
exponentially with the number of state variables called qualitative states. This initial abstrac-
(Bonet and Pearl, 2002). An alternative way is tion is refined and improved via a local itera-
to approximate the optimal value function V (x) tive process. States with high variance in their
with an appropriate parametric function model value with respect to neighboring states are par-
(Bertsekas and Tsitsiklis, 1996). The parame- titioned, and the MDP is solved locally to im-
ters of the model are fitted iteratively by ap- prove the policy. At each stage in the refine-
plying one step Bellman backups to a finite set ment process, only one state is partitioned, and
of state points arranged on a fixed grid or ob- the process finishes when any potential partition
tained through Monte Carlo sampling. Least does not change the policy. In our approach,
squares criterion is used to fit the parameters of the reward function and transition model are
the model. In addition to parallel updates and learned from a random exploration of the en-
optimizations, on-line update schemes based on vironment, and can work with both, pure con-
gradient descent, e.g., (Bertsekas and Tsitsiklis, tinuous spaces; or hybrid, with continuous and
1996) can be used to optimize the parameters. discrete variables.
The disadvantages of these methods are their
instability and possible divergence (Bertsekas, We have tested our method in simulated
1995). robot navigation problems, in which the state
Several authors, e.g., (Dean and Givan, space is continuous, with several scenarios with
1997), use the notions of abstraction and ag- different sizes and complexities. We compare
gregation to group states that are similar with the solution of the models obtained with the
respect to certain problem characteristics to fur- initial abstraction and the final refinement, with
ther reduce the complexity of the representation the solution of a discrete MDP obtained with a
or the solution. (Feng et al., 2004) proposes a fine discretization of the state space, in terms of
state aggregation approach for exploiting struc- differences in policy, value and complexity. Our
ture in MDPs with continuous variables where results show an approximate optimal solution
the state space is dynamically partitioned into with an important reduction in state size and
regions where the value function is the same solution time compared to a fine discretization
throughout each region. The technique comes of the space.
from POMDPs to represent and reason about
linear surfaces effectively. (Hauskrecht and The rest of the paper is organized as follows.
Kveton, 2003) show that approximate linear The next section gives a brief introduction to
programming is able to solve not only MDPs MDPs. Section 3 develops the abstraction pro-
with finite states, but also be successfully ap- cess and Section 4 the refinement stage. In Sec-
plied to factored continuous MDPs. Similarly, tion 5 the empirical evaluation is described. We
(Guestrin et al., 2004) presents a framework conclude with a summary and directions for fu-
that also exploits structure to model and solve ture work.

264 A. Reyes, P. Ibargengoytia, L. E. Sucar, and E. Morales


2 Markov Decision Processes tractable.
In a factored MDP, the set of states is de-
A Markov Decision Process (MDP) (Puterman,
scribed via a set of random variables X =
1994) models a sequential decision problem, in
{X1 , . . . , Xn }, where each Xi takes on values
which a system evolves in time and is controlled
in some finite domain Dom(Xi ). A state x de-
by an agent. The system dynamics is governed
fines a value xi Dom(Xi ) for each variable
by a probabilistic transition function that maps
Xi . Thus, the set of states S = Dom(Xi )
states and actions to states. At each time, an
is exponentially large, making it impractical
agent receives a reward that depends on the cur-
to represent the transition model explicitly as
rent state and the applied action. Thus, the
matrices. Fortunately, the framework of dy-
main problem is to find a control strategy or
namic Bayesian networks (DBN) (Dean and
policy that maximizes the expected reward over
Kanazawa, 1989) gives us the tools to describe
time.
the transition model function concisely. In these
Formally, an MDP is a tuple M =<
representations, the post-action nodes (at the
S, A, , R >, where S is a finite set of states
time t + 1) contain matrices with the probabil-
{s1 , . . . , sn }. A is a finite set of actions for all
ities of their values given their parents values
states. : A S S is the state transition
under the effects of an action.
function specified as a probability distribution.
The probability of reaching state s0 by perform-
3 Qualitative MDPs
ing action a in state s is written as (a, s, s 0 ).
R : S A < is the reward function. R(s, a) 3.1 Qualitative states
is the reward that the agent receives if it takes
We define a qualitative state space Q as a set of
action a in state s. A policy for an MDP is
states q1 , q2 , ..qn that have different utility prop-
a mapping : S A that selects an action
erties. These properties map the state space
for each state. A solution to an MDP is a pol-
into a set of partitions, such that each partition
icy that maximizes its expected value. For the
corresponds to a group of continuous states with
discounted infinitehorizon case with any given
a similar reward value. In a qualitative MDP, a
discount factor [0, 1), there is a policy V
state partition qi is a region bounded by a set
that is optimal regardless of the starting state
of constraints over the continuous dimensions in
that satisfies the Bellman equation (Bellman,
the state space. The relational operators used
1957):
in this approach are < and . For example,
V (s) = maxa {R(s, a)+s0 S (a, s, s0 )V (s0 )} assuming that the immediate reward is a func-
tion of the linear position in a robot navigation
Two popular methods for solving this equa- domain, a qualitative state could be a region
tion and finding an optimal policy for an MDP in an x0 x1 coordinates system bounded by
are: (a) value iteration and (b) policy iteration the constraints: x0 val(x0 ) and x1 val(x1 ),
(Puterman, 1994). expressing that the current x0 coordinate is lim-
ited by the interval [val(x0 ), ], and the x1 co-
2.1 Factored MDPs ordinate by the interval [val(x1 ), ]. It is evi-
A common problem with the MDP formalism dent that a qualitative state can cover a large
is that the state space grows exponentially with number of states (if we consider a fine discretiza-
the number of domain variables, and its infer- tion) with similar properties.
ence methods grow in the number of actions. Similarly to the reward function in a factored
Thus, in large problems, MDPs becomes im- MDP, the state space Q is represented by a deci-
practical and inefficient. Factored represen- sion tree (Qtree). In our approach, the decision
tations avoid enumerating the problem state tree is automatically induced from data. Each
space, producing a more concise representa- leaf in the induced decision tree is labeled with
tion that makes solving more complex problems a new qualitative state. Even for leaves with the

Abstraction and Refinement for Solving Continuous Markov Decision Processes 265
same reward value, we assign a different qualita- factors. The qualitative state space Q, is an
tive state value. This generates more states but additional factor that concentrates all the con-
at the same time creates more guidance that tinuous variables. The idea is to substitute
helps produce more adequate policies. States all these variables by this abstraction to re-
with similar reward are partitioned so each q duce the dimensionality of the state space. Ini-
state is a continuous region. Figure 1 shows tially, only the continuous variables involved in
this tree transformation in a two dimensional the reward function are considered, but, as de-
domain. scribed in Section 4, other continuous variables
can be incorporated in the refinement stage.
Thus, a Qualitative MDP state is described in
a factored form as X = {X1 , . . . , Xn , Q}, where
X1 , . . . , Xn are the discrete factors, and Q is a
factor that represents the relevant continuous
dimensions.
3.3 Learning Qualitative MDPs
The Qualitative MDP model is learned from
Figure 1: Transformation of the reward deci-
data based on a random exploration of the en-
sion tree into a Q-tree. Nodes in the tree rep-
vironment with a 10% Gaussian noise intro-
resent continuous variables and edges evaluate
duced on the actions outcomes. We assume that
whether this variable is less or greater than a
the agent can explore the state space, and for
particular bound.
each stateaction can receive some immediate
Each branch in the Qtree denotes a set of reward. Based on this random exploration, an
constraints for each partition qi . Figure 2 illus- initial partition, Q0 , of the continuous dimen-
trates the constraints associated to the example sions is obtained, and the reward function and
presented above, and its representation in a 2- transition functions are induced.
dimensional space. Given a set of state transition represented as
a set of random variables, O j = {Xt , A, Xt+1 },
for j = 1, 2, ..., M , for each state and action A
executed by an agent, and a reward (or cost) R j
associated to each transition, we learn a quali-
tative factored MDP model:
1. From a set of examples {O, R} obtain a
decision tree, RDT , that predicts the re-
ward function R in terms of continuous
and discrete state variables, X1 , . . . , Xk , Q.
Figure 2: In a Q-tree, branches are constraints We used J48, a Java re- implementation of
and leaves are qualitative states. A graphical C4.5 (Quinlan, 1993) in Weka, to induce a
representation of the tree is also shown. Note pruned decision tree.
that when an upper or lower variable bound is
2. Obtain from the decision tree, RDT , the
infinite, it must be understood as the upper or
set of constraints for the continuous vari-
lower variable bound in the domain.
ables relevant to determine the qualitative
states (qstates) in the form of a Q-tree. In
3.2 Qualitative MDP Model terms of the domain variables, we obtain
Specification a new variable Q representing the reward-
We can define a qualitative MDP as a factored based qualitative state space whose values
MDP with a set of hybrid qualitativediscrete are the qstates.

266 A. Reyes, P. Ibargengoytia, L. E. Sucar, and E. Morales


3. Qualify data from the original sample in 1. Select an unmarked partition (state) with
such a way that the new set of attributes the highest variance in its utility value with
are the Q variables, the remaining discrete respect to its neighbours
state variables not included in the decision
tree, and the action A. This transformed 2. Select the dimension (state variable) with
data set can be called the qualified data set. the highest difference in utility value with
its contiguous states
4. Format the qualified data set in such a way
3. Bisect the dimension (divide the state in
that the attributes follow a temporal causal
two equal-size parts)
ordering. For example variable Qt must be
set before Qt+1 , X1t before X1t+1 , and so 4. Solve the new MDP
on. The whole set of attributes should be
the variable Q in time t, the remaining sys- 5. If the new MDP has the same policy as
tem variables in time t, the variable Q in before, mark the original state before the
time t+1, the remaining system variables partition and return to the previous MDP,
in time t+1, and the action A. otherwise, accept the refinement and con-
tinue.
5. Prepare data for the induction of a 2-stage
The minimum size of a state is defined by the
dynamic Bayesian net. According to the
user and is domain dependent. For example, in
action space dimension, split the qualified
the robot navigation domain, the minimum size
data set into |A| sets of samples for each
depends on the smallest goal (area with certain
action.
reward) or on the robot step size. A graphical
6. Induce the transition model for each action, representation of this process is shown in fig-
Aj , using the K2 algorithm (Cooper and ure 3. Figure 3 (a) shows an initial partition
Herskovits, 1992). for two qualitative variables where each quali-
tative state is a set of ground states with simi-
This initial model represents a high-level ab- lar reward. Figure 3 (b) shows the refined two-
straction of the continuous state space and can dimension state space after applying the split-
be solved using a standard solution technique, ting state algorithm.
such as value iteration, to obtain the optimal
policy. This approach has been successfully ap-
plied in some domains. However, in some cases,
our abstraction can miss some relevant details
of the domain and consequently produce sub-
optimal policies. We improve this initial parti-
tion through a refinement stage described in the
next section. Figure 3: Qualitative refinement for a two-
dimension state space. a) initial partition by
4 Qualitative State Refinement
reward. b) refined state space
We have designed a value-based algorithm that
recursively selects and partitions abtract states
5 Experimental Results
with high utility variance. Given an initial
partition and a solution for the qualitative We tested our approach in a robot navigation
MDP, the algorithm proceeds as follows: domain using a simulated environment. In this
setting goals are represented as light-colored
While there is an unmarked partition greater square regions with positive immediate reward,
than the minimum size: while non-desirable regions are represented by

Abstraction and Refinement for Solving Continuous Markov Decision Processes 267
Figure 4: Abstraction and refinement process. Upper left: reward regions. Upper right: explo-
ration process. Lower left: initial qualitative states and their corresponding policies, where u=up,
d=down, r=right, and l=left. Lower right: refined partition.

dark-colored squares with negative reward. The The qualitative representation and refine-
remaining regions in the navigation area receive ment were tested with several problems of dif-
0 reward (white). Experimentally, we express ferent sizes and complexities, and compared to a
the size of a rewarded (non zero reward) as a fine discretization of the environment in terms
function of the navigation area. Rewarded re- of precision and complexity. The precision is
gions are multivalued and can be distributed evaluated by comparing the policies and values
randomly over the navigation area. The num- per state. The policy precision is obtained by
ber of rewarded squares is also variable. Since comparing the policies generated with respect
obstacles are not considered robot states, they to the policy obtained from a fine discretiza-
are not included. tion. That is, we count the number of fine cells
in which the policies are the same:
The robot sensor system included the x-y
position, angular orientation, and navigation P P = (N EC/N T C) 100,
bounds detection. In a set of experiments the
possible actions are discrete orthogonal move- where P P is the policy precision in percentage,
ments to the right, left, up, and down. Fig- N EC is the number of fine cells with the same
ure 4 upper left shows an example of a navi- policy, and N T C is the total number of fine
gation problem with 26 rewarded regions. The cells. This measure is pessimistic because in
reward function can have six possible values. In some states it is possible for more than one ac-
this example, goals are represented as different- tion to have the same or similar value, and in
scale light colors. Similarly, negative rewards this measure only one is considered correct.
are represented with different-scale dark colors. The utility error is calculated as follows. The
The planning problem is to automatically ob- utility values of all the states in each represen-
tain an optimal policy for the robot to achieve tation is first normalized. The sum of the abso-
its goals avoiding negative rewarded regions. lute differences of the utility values of the corre-

268 A. Reyes, P. Ibargengoytia, L. E. Sucar, and E. Morales


Table 1: Description of problems and comparison between a normal discretization and our qual-
itative discretization.
Problem Discrete Qualitative
Learning Inference Learning Inference
no. reward no. no. no. time no. time no. time no. time
id reward size reward samples states (ms) itera- (ms) states (ms) itera- (ms)
cells (% dim) values tions tions
1 2 20 3 40,000 25 7,671 120 20 8 2,634 120 20
2 4 20 5 40,000 25 1,763 123 20 13 2,423 122 20
3 10 10 3 40,000 100 4,026 120 80 26 2,503 120 20
4 6 5 3 40,000 400 5,418 120 1,602 24 4,527 120 40
5 10 5 5 28,868 400 3,595 128 2,774 29 2,203 127 60
6 12 5 4 29,250 400 7,351 124 7,921 46 2,163 124 30
7 14 3.3 9 50,000 900 9,223 117 16,784 60 4,296 117 241

sponding states is evaluated and averaged over and processing time required to solve the prob-
all the differences. lems. This is important since in complex do-
Figure 4 shows the abstraction and refine- mains where it can be difficult to define an ade-
ment process for the motion planning problem quate abstraction or solve the resulting MDP
presented above. A color inversion and gray problem, one option is to create abstractions
scale format is used for clarification. The upper and hope for suboptimal policies. To evaluate
left figure shows the rewarded regions. The up- the quality of the results Table 2 shows that the
per right figure illustrates the exploration pro- proposed abstraction produces on average only
cess. The lower left figure shows the initial qual- 9.17% error in the utility value when compared
itative states and their corresponding policies. against the values obtained from the dicretized
The lower right figure shows the refined parti- problem as can be seen in Table 2. Finally, since
tion. an initial refinement can miss some relevant as-
Table 1 presents a comparison between the pects of the domain, a refinement process may
behavior of seven problems solved with a sim- be necessary to obtain more adequate policies.
ple discretization approach and our qualitative Our refinement process is able to maintain or
approach. Problems are identified with a num- improve the utility values in all cases. The per-
ber as shown in the first column. The first centage of policy precision with respect to the
five columns describe the characteristics of each initial qualitative model can sometimes decrease
problem. For example, problem 1 (first row) has with the refinement process. This is due to our
2 reward cells with values different from zero pessimistic measure.
that occupy 20% of the number of cells, the dif-
ferent number of reward values is 3 (e.g., -10,
0 and 10) and we generated 40,000 samples to
build the MDP model.
Table 2: Comparative results between the ini-
Table 2 presents a comparison between the
tial abstraction and the proposed refinement
qualitative and the refined representation. The
process.
first three columns describe the characteristics
Qualitative Refinement
of the qualitative model and the following de- util. policy util. policy refin.
id error precis. error precis. time
scribe the characteristics with our refinement (%) (%) (%) (%) (min)
process. They are compared in terms of utility 1 7.38 80 7.38 80 0.33
2 9.03 64 9.03 64 0.247
error in %, the policy precision also in %, and 3 10.68 64 9.45 74 6.3
4 12.65 52 8.82 54.5 6.13
the time spent in the refinement in minutes. 5 7.13 35 5.79 36 1.23
As can be seen from Table 1, there is a signif- 6
7
11.56
5.78
47.2
44.78
11.32
5.45
46.72
43.89
10.31
23.34
icant reduction in the complexity of the prob-
lems using our abstraction approach. This can
be clearly appreciated from the number of states

Abstraction and Refinement for Solving Continuous Markov Decision Processes 269
6 Conclusions and Future Work C. Boutilier, T. Dean, and S. Hanks. 1999. Decision-
theoretic planning: structural assumptions and
In this paper, a new approach for solving con- computational leverage. Journal of AI Research,
tinuous and hybrid infinite-horizon MDPs is 11:194.
described. In the first phase we use an ex- G. F. Cooper and E. Herskovits. 1992. A Bayesian
ploration strategy of the environment and a method for the induction of probabilistic networks
machine learning approach to induce an ini- from data. Machine Learning.
tial state abstraction. We then follow a refine- T. Dean and R. Givan. 1997. Model minimization in
ment process to improve on the utility value. Markov Decision Processes. In Proc. of the 14th
Our approach creates significant reductions in National Conf. on AI, pages 106111. AAAI.
space and time allowing to solve quickly rela-
T. Dean and K. Kanazawa. 1989. A model for rea-
tively large problems. The utility values on our soning about persistence and causation. Compu-
abstracted representation are reasonably close tational Intelligence, 5:142150.
(less than 10%) to those obtained using a fine
Z. Feng, R. Dearden, N. Meuleau, and R. Washing-
discretization of the domain. A new refinement ton. 2004. Dynamic programming for structured
process to improve the results of the initial pro- continuous Markov decision problems. In Proc. of
posed abstraction is also presented. It always the 20th Conf. on Uncertainty in AI (UAI-2004).
improves or at least maintains the utility val- Banff, Canada.
ues obtained from the qualitative abstraction. C. Guestrin, M. Hauskrecht, and B. Kveton. 2004.
Although tested on small solvable problems for Solving factored mdps with continuous and dis-
comparison purposes, the approach can be ap- crete variables. In Twentieth Conference on Un-
certainty in Artificial Intelligence (UAI 2004),
plied to more complex domains where a simple
Banff, Canada.
dicretization approach is not feasible.
As future research work we will like to im- M. Hauskrecht and B. Kveton. 2003. Linear pro-
prove our refinement strategy to select a better gram approximation for factored continuous-state
Markov decision processes. In Advances in Neural
segmentation of the abstract states and use an Information Processing Systems NIPS(03), pages
alternative search strategy. We also plan to test 895 902.
our approach in other domains.
B. Kuipers. 1986. Qualitative simulation. AI,
29:289338.
Acknowledgments
M.L. Puterman. 1994. Markov Decision Processes.
This work was supported in part by the Insti-
Wiley, New York.
tuto de Investigaciones Electricas, Mexico and
CONACYT Project No. 47968. J.R. Quinlan. 1993. C4.5: Programs for ma-
chine learning. Morgan Kaufmann, San Fran-
cisco, Calif., USA.
References
R.E. Bellman. 1957. Dynamic Programming.
Princeton U. Press, Princeton, N.J.

D. P. Bertsekas and J.N. Tsitsiklis. 1996. Neuro-


dynamic programming. Athena Sciences.

D. P. Bertsekas. 1995. A counter-example to tem-


poral difference learning. Neural Computation,
7:270279.

B. Bonet and J. Pearl. 2002. Qualitative mdps


and pomdps: An order-of-magnitude approach.
In Proceedings of the 18th Conf. on Uncertainty
in AI, UAI-02, pages 6168, Edmonton, Canada.

270 A. Reyes, P. Ibargengoytia, L. E. Sucar, and E. Morales


Bayesian Model Averaging of TAN Models for Clustering
Guzman Santafe, Jose A. Lozano and Pedro Larranaga
Intelligent Systems Group, Computer Science and A. I. Dept.
University of the Basque Country, Spain

Abstract
Selecting a single model for clustering ignores the uncertainty left by nite data as to
which is the correct model to describe the dataset. In fact, the fewer samples the dataset
has, the higher the uncertainty is in model selection. In these cases, a Bayesian approach
may be benecial, but unfortunately this approach is usually computationally intractable
and only approximations are feasible. For supervised classication problems, it has been
demonstrated that model averaging calculations, under some restrictions, are feasible and
ecient. In this paper, we extend the expectation model averaging (EMA) algorithm
originally proposed in Santafe et al. (2006) to deal with model averaging of naive Bayes
models for clustering. Thus, the extended algorithm, EMA-TAN, allows to perform an
ecient approximation for a model averaging over the class of tree augmented naive Bayes
(TAN) models for clustering. We also present some empirical results that show how the
EMA algorithm based on TAN outperforms other clustering methods.

1 Introduction Friedman, 2002) or recursive Bayesian multinets


(Pena et al., 2002) which are trees that incor-
Unsupervised classication or clustering is the porate Bayesian multinets in the leaves.
process of grouping similar objects or data sam- The process of selecting a single model for
ples together into natural groups called clus- clustering ignores model uncertainty and it can
ters. This process generates a partition of lead to the selection of a model that does not
the objects to be classied. Bayesian networks properly describe the data. In fact, the fewer
(Jensen, 2001) are powerful probabilistic graph- samples the dataset has, the higher the uncer-
ical models that can be used for clustering tainty is in model selection. In these cases,
purposes. The naive Bayes is the most sim- a Bayesian approach may be benecial. The
ple Bayesian network model (Duda and Hart, Bayesian approach proposes an averaging over
1973). It assumes that the predictive vari- all models weighted by their posterior probabil-
ables in the model are independent given the ity given the data (Madigan and Raftery, 1994;
value of the cluster variable. Despite this be- Hoeting et al., 1999). The Bayesian model av-
ing a very simple model, it has been success- eraging has been seen by some authors as a
fully used in clustering problems (Cheeseman model combination technique (Domingos, 2000;
and Stutz, 1996). Other models have been pro- Clarke, 2003) and it does not always perform
posed in the literature in order to relax the as successfully as expected. However, other au-
heavy independence assumptions that the naive thors states that Bayesian model averaging is
Bayes model makes. For example, tree aug- not exactly a model ensemble technique but a
mented naive Bayes (TAN) models (Friedman method for soft model selection (Minka, 2002).
et al., 1997) allow the predictive variables to Although model averaging approach is nor-
form a tree and the class variable remains as mally preferred when there are a few data sam-
a parent of each predictive variable. On the ples because it deals with uncertainty in model
other hand, more complicated methods have selection, this approach is usually intractable
been proposed to allow the encoding of context- and only approximations are feasible. Typi-
specic (in)dependencies in clustering problems cally, the Bayesian model averaging for cluster-
by using, for instance, naive Bayes (Barash and ing is approximated by averaging over some of
(l) (l)
the models with the highest posterior probabili- {x1 , . . . , xn }, with l = 1, . . . , N .
ties (Friedman, 1998). However, ecient calcu- We dene a Bayesian network model as B =
lation of model averaging for supervised classi- S, , where S describes the structure of the
cation models under some constraints has been model and its parameter set. Using the clas-
proposed in the literature (Dash and Cooper, sical notation in Bayesian networks, ijk (with
2004; Cerquides and Lopez de Mantaras, 2005). k = 1, . . . , ri and ri being the number of states
Some of these proposals have also been extended for variable Xi ) represents the conditional prob-
to clustering problems. For example, Santafe ability of variable Xi taking its k-th value given
et al. (2006) extend the calculations of naive that its parents, P ai , takes its j-th congura-
Bayes models to approximate a Bayesian model tion. The conditional probability mass function
averaging for clustering. In that paper, the for Xi given the j-th conguration of its par-
authors introduce the expectation model aver- ents is designated as ij , with j = 1, . . . , qi ,
aging (EMA) algorithm, which is a variant of where qi is the number of dierent states of P ai .
the well-known EM algorithm (Dempster et al., Finally, i = ( i1 , . . . , iqi ) denotes the set of
1977) and allows to deal with the latent clus- parameters for variable Xi , and therefore, =
ter variable and then approximate a Bayesian ( C , 1 , . . . , n ), where C = (C1 , . . . , CrC )
model averaging of naive Bayes models for clus- is the set of parameters for the cluster variable,
tering. with rC the number of clusters xed in advance.
In this paper we use the structural features In order to use the decomposition proposed
estimation proposed by Friedman and Koller in Friedman and Koller (2003) for an ecient
(2003) and the model averaging calculations model averaging calculation, we need to con-
for supervised classiers presented in Dash and sider an ancestral order over the predictive
Cooper (2004) in order to extend the EMA algo- variables.
rithm to TAN models (EMA-TAN algorithm).
Definition 1. Class of TAN models (L ): T AN
In other words, we propose a method to obtain
given an ancestral order , a model B belongs
a single Bayesian network model for clustering
to LT AN if each predictive variable has up to
which approximates a Bayesian model averaging
two parents (another predictive variable and the
over all possible TAN models. This is possible
cluster variable) and the arcs between variables
by setting an ancestral order among the predic-
that are dened in the structure S are directed
tive variables. The result of the Bayesian model
down levels: Xj Xi S level (Xi ) <
averaging over TAN models is a single Bayesian
level (Xj ).
network model for clustering (Dash and Cooper,
2004). Note that, the classical conception of TAN
The rest of the paper is organized as follows. model allows the predictive variables to form a
Section 2 introduces the notation that is used tree and then, the cluster variable is set as a
throughout the paper as well as the assumptions parent of each predictive variable. However, we
that we make. Section 3 describes the EMA do not restrict LTAN to only these tree mod-
algorithm to approximate model averaging over els, but we also allow the predictive variables to
the class of TAN models. Section 4 shows some form a forest and also the class variable may or
experimental results with synthetic data that may not be set as a parent of each predictive
illustrate the behavior of the EMA algorithm. variable. Therefore L T AN also includes, among
Finally, section 5 presents the conclusions of the others, naive Bayes and selective naive Bayes
paper and future work. models.
For a given ordering and a particular vari-
2 Notation and Assumptions able Xi , we can enumerate all the possible
parent sets for Xi in the class of TAN mod-
In an unsupervised learning problem there is a els LT AN . In order to clarify calculations, we
set of predictive variables, X1 , . . . , Xn , and the superscript with v any quantity related to a
latent cluster variable, C. The dataset D = predictive variable Xi and thus, we are able
{x(1) , . . . , x(N ) } contains data samples x(l) = to identify the parent set of variable Xi that

272 G. Santaf, J. A. Lozano, and P. Larraaga


we are taking into consideration. For exam- Parameter independence assumes that the
ple, for a given order = X1 , X2 , X3  the prior on parameters ijk for a variable Xi de-
possible sets of parents for X3 in LTAN are: pends only on local structures. This is known as
P a13 = {}, P a23 = {X1 }, P a33 = {X2 }, P a43 = parameter modularity (Heckerman et al., 1995).
{C, X1 }, P a53 = {C, X2 }, P a63 = {C} with, in Therefore, we can state that for any two net-
this case, v = 1, . . . , 6. In general, we consider, work structures S1 and S2 , if Xi has the same
without loss of generality, = X1 , . . . , Xn  parent set in both structures then p(ijk |S1 ) =
and therefore, for a variable Xi , v = 1, . . . , 2i. p(ijk |S2 ). As a consequence, parameter cal-
Moreover, we use i to index any quantity related culations for a variable Xi will be the same in
to the i-th predictive variable, with i = 1, . . . , n. every model whose structure denes that the
Additionally, the following ve assumptions variable Xi has the same parent set.
are needed to perform an ecient approxima-
tion of model averaging over L T AN :
3 The EMA-TAN Algorithm
Multinomial variables: Each variable Xi is The EMA algorithm was originally introduced
discrete and can take ri states. The cluster vari- in Santafe et al. (2006) for dealing with model
able is also discrete and can take rC possible averaging calculations of naive Bayes models.
states, rC being the number of clusters xed in In this section we present the extension of this
advance. algorithm (EMA-TAN) in order to average over
Complete dataset: We assume that there are TAN models.
no missing values for the predictive variables in Theorem 1 (Dash and Cooper, 2004). There
the dataset. However, the cluster variable is exist, for supervised classification problems, a
latent; therefore, its values are always missing. single model B = S,  which defines a joint
Dirichlet priors: The parameters of every probability distribution p(c, x|B) equivalent to
model are assumed to follow a Dirichlet distri- the joint probability distribution produced by
bution. Thus, ijk is the Dirichlet hyperparam- model averaging over all TAN models. This
eter for parameter ijk from the network, and model B is a complete Bayesian network where
Cj is the hyperparameter for Cj . In fact, as the structure S defines the relationship between
we have to take into consideration each possible variables in such a way that, for a variable Xi ,
model in L T AN , the parameters of the models the parent set of Xi in the model B is Pai =
can be denoted as ijk v . Hence, we assume the v
v=1 P ai .
2i
existence of hyperparameters vijk .
This result can be extended to clustering
Parameter independence: The probability problems by means of the EMA-TAN algorithm.
of having the set of parameters for a given However, the latent cluster variable prevents the
structure S can be factorized as follows: exact calculation of the model averaging and
qi
n  therefore the model B obtained by the EMA-

p(|S) = p( C ) p( ij |S) (1) TAN algorithm is not an exact model averag-
i=1 j=1 ing over LTAN but an approximation. There-
fore, the EMA-TAN algorithm provides a pow-
Structure modularity: The prior probability erful tool that allows to learn a single unsuper-
p(S) can be decomposed in terms of each vised Bayesian network model which approxi-
variable and its parents: mates Bayesian model averaging over LTAN .
The unsupervised classier is obtained by
n
 learning the predictive probability, p(c, x|D),
p(S) pS (C) pS (Xi , P ai ) (2)
i=1
averaged over the maximum a posteriori (MAP)
parameter congurations for all the models in
where pS (Xi , P ai ) is the information con- LT AN . Then, we can obtain the unsupervised
tributed by variable Xi to p(S), and pS (C) is classier by using the conditional probability of
the information contributed by the cluster vari- the cluster variable given by Bayes rule.
able. The EMA-TAN algorithm is an adaption of

Bayesian Model Averaging of TAN Models for Clustering 273


the well-known EM algorithm. It uses the E sucient statistics for all the models in LTAN
step of the EM algorithm to deal with the miss- because some of these models share the same
ing values for the cluster variable. Then, it per- value for the expected sucient statistics.
forms a model averaging step (MA) to obtain Hence, it is only necessary to calculate the
p(c, x|D) and thus the unsupervised Bayesian expected sucient statistics with dierent
network model for clustering. parent sets. They can be obtained as follows:
The EMA-TAN, as well as the EM algorithm, N

is an iterative process where the two steps of v
E(Nijk |B (t) ) = p(xki , P avi = j|x(l) , B (t) ) (3)
the algorithm are repeated successively until a l=1
stopping criterion is met. At the t-th iteration
(t) where xki represents the k-th value of the i-
of the algorithm, a set of parameters, , for
th variable. The expected sucient statistic
the Bayesian network model B (t) is calculated. v |B (t) ) denotes, at iteration t, the ex-
E(Nijk
Note that, although we dierentiate between
pected number of cases in the dataset D where
the Bayesian network models among the itera-
variable Xi takes its k-th value, and the v-th
tions of the EMA-TAN algorithm, the structure
parent set of Xi takes its j-th conguration.
of the model, S, is constant and only the es-
Similarly, we can obtain the expected suf-
timated parameter set changes. The algorithm
cient statistics for the cluster variable. This
stops when the dierence between the sets of pa-
is a special case since for any model in LTAN
rameters learned in two consecutive iterations,
(t) (t+1) the parent set for C is the same (the cluster
and , is less than threshold , which variable does not have any parent). Therefore,
is xed in advance. we refuse the use of superindex v in those
In order to use the EMA-TAN algorithm, quantities related only to C.
we need to set an initial parameter congu-
(0) N

ration, , and the value of . The values E(NCj |B (t) ) = p(C = j|x(l) , B (t) ) (4)
(0)
for are usually taken at random and  is l=1
set at a small value. Note that, even though
Note that, some of the expected sucient
the obtained model is a single unsupervised v |B (t) ) do not depend on the
statistics E(Nijk
Bayesian network, its parameters are learned
value of C. Therefore, these values are constant
taking into account the MAP parameter con-
throughout the iterations of the algorithm and
guration for every model in LTAN . Thus,
it is necessary to calculate them only once.
the resulting unsupervised Bayesian network
will incorporate into its parameters information 3.2 MA Step (Model Averaging)
about the (in)dependencies between variables In this second step, the EMA algorithm per-
described by the dierent TAN models. forms the model averaging calculations which
3.1 E Step (Expectation) obtain a single Bayesian network model with
(t+1)
parameters . These parameters are ob-
Intuitively, we can see this step as a comple-
tained by calculating p(c, x|D(t) ) as an average
tion of the values for the cluster variable, which
over the MAP congurations for the models in
L
are missing. Actually, this step computes the
expected sucient statistics in the dataset for T AN .

variable Xi in every model in LTAN given the


In order to make the calculations clearer, we
rst show how we can obtain p(c, x|S, D (t) ) for
current model, B (t) . These expected sucient a xed structure S:
statistics are used in the next step of the al-
gorithm, MA, as if they were actual sucient 
(t)
statistics from a complete dataset. From now p(c, x|S, D ) = p(c, x|S, )p(|S, D (t) )d (5)
on, D (t) denotes the dataset after the E step at
the t-th iteration of the algorithm. The exact computation of the integral in
Note that, due to parameter modularity, we Equation 5 is intractable for clustering prob-
do not actually need to calculate the expected lems, therefore, an approximation is needed

274 G. Santaf, J. A. Lozano, and P. Larraaga


(Heckerman et al., 1995). However, assuming p(D(t) |S) eciently. In order to do so, we
parameter independence and Dirichlet priors, adapt the formula to calculate the marginal
and given that the expected sucient statis- likelihood with complete data (Cooper and
tics calculated in the previous E step can Herskovits, 1992) to our problem with missing
be used as an approximation to the actual values. Thus, we have an approximation to
sucient statistics in the complete dataset, we p(D (t) |S):
can approximate p(c, x|S, D (t) ) by the MAP 
rC
(C ) (Cj + E(NCj |B(t) ))
parameter conguration. This is the parameter p(D (t) |S)
(C + E(NC |B(t) )) (Cj )
conguration that maximizes p(|S, D (t) ) and j=1

can be described in terms of the expected 


n

qi 
ri i i
(iji ) (ijk + E(Nijk |B(t) ))
sucient statistics and the Dirichlet hyper- i
(iji + E(Niji |B(t) )) (ijk )
parameters (Heckerman et al., 1995; Cooper i=1 j=1 k=1

and Herskovits, 1992). Therefore, Equation 5


results: At this point, given structure modular-
ity assumption, we are able to approximate
Cj + E(NCj |B (t) ) p(c, x|D(t) ) with the following expression:
p(c, x|S, D (t) ) (6)
C + E(NC |B (t) ) n
  i
(t)
n i + E(N i |B (t) )
 n
 p(c, x|D ) Cj ijk (9)
ijk ijk i
= Cj ijk S i=1
i=1 iji + E(Niji |B (t) ) i=1
where is a constant and Cj and ijk
i
are
i dened in Equation 10.
where ijk is the MAP parameter congu-
i
ration for ijk (i denotes the parent index (C )
that corresponds to the parent set for Xi de- Cj = Cj pS (C)
i (C + E(NC |B(t) ))
scribed in S), iji = rk=1 ijk
i
, E(Nij |B (t) ) =
 ri 
rC
(Cj + E(NCj |B(t) ))
k=1 E(Nijk |B ) and similarly for the values
(t) (10)
(Cj )
related to C. j=1

Considering that the structure is not xed



qi
( i
ij )
a priori, we should average over all selective i
= i
ijk pS (Xi , P ai
i )
model structures in LTAN in the following way: (
ijk i
ij
i
+ E(Nij |B(t) ))
j=1


ri
( i
ijk + E(Nijk |B
i (t)
))
(t)
p(c, x|D ) = (7) ( i
ijk )
  k=1

p(c, x|S, )p(|S, D (t) )d p(S|D(t) )


S Since we are assuming parameter indepen-
dence, structure modularity and parameter
Therefore, the model averaging calculations
modularity, we can apply the dynamic pro-
require a summation over 2n n! terms, which are
gramming solution described in Friedman and
the models in LT AN . Koller (2003), and Dash and Cooper (2004).
Using the previous calculations for a xed
Thus Equation 9 can be written as follows:
structure, Equation 8 can be written as:
n 
 2i
 n
 i p(c, x|D(t) ) Cj vijk (11)
p(c, x|D (t) ) Cj ijk p(S|D(t) )
i=1 v=1
S i=1
 n
 i with being a constant.
Cj ijk p(D(t) |S) p(S) (8)
Note that the time complexity needed to cal-
culate the averaging over L
S i=1
T AN is the same as
Given the assumption of Dirichlet priors and that which is needed to learn the MAP param-
parameter independence, we can approximate eter conguration for B.

Bayesian Model Averaging of TAN Models for Clustering 275


We can see the similarity of the above- obtain dierent datasets of sizes 40, 80, 160, 320
described Equation 11 with the factorization and 640. In the experiments, we compare the
of a Bayesian network model. Indeed the joint multi-start EMA-TAN (called also EMA-TAN
probability distribution of the approximated for convenience) with three dierent algorithms:
Bayesian model averaging over LTAN for EM-TAN: a multi-start EM that learns a TAN
clustering is equivalent to a single Bayesian model by using, at each step of the EM algo-
network model. Therefore, the parameters rithm, the classical method proposed by Fried-
of the model for the next iteration of the man et al. (1997) adapted to be used within the
algorithm can be calculated as follows: EM algorithm.

2i
EM-BNET: a multi-start EM algorithm used
vijk
(t+1) (t+1)
Cj Cj , ijk (12)
to learn the MAP parameters of a complete
v=1
Bayesian network for a given order .
3.3 Multi-start EMA-TAN EMA: the multi-start model averaging of naive
The EMA-TAN is a greedy algorithm that is Bayes for clustering.
susceptible to be trapped in a local optima. The Note that, for a given order , both EMA-
results obtained by the algorithm depend on the TAN and EM-BNET models share the same
random initialization of the parameters. There- network structure but their parameter sets are
fore, we propose the use of a multi-start scheme calculated in a dierent way. The number of
where m dierent runs of the algorithm with dif- multi-start iterations for both multi-start EM
ferent random initializations are performed. In and multi-start EMA is m = 100.
Santaf et al. (2006) dierent criteria to obtain Since the datasets are synthetically gener-
the nal model from the multi-start process are ated, we are able to know the real ordering
proposed. In our case, we use the same criteria among variables. Nevertheless, we prefer to use
that the multi-start EM uses: the best model in a more general approach and assume that this
terms of likelihood among all the m calculated order is unknown. Therefore, we use a random
models is selected as the nal model. This is ordering among predictive variables for each
not a pure Bayesian approach to the model av- EMA-TAN model that we learn. As the EM-
eraging process but, in practice, it works as well BNET algorithm also needs an ancestral order
as other more complicated techniques. among predictive variables, in the experiments,
we use the same random ordering for any pair
4 Evaluation in Clustering Problems of models (EMA-TAN vs. EM-BNET) that we
It is not easy to validate clustering algorithms compare.
since clustering problems do not normally Every model is used to cluster the dataset
provide information about the true grouping of from which it has been learned. In the exper-
data samples. In general, it is quite common iments, we compare EMA-TAN vs. EM-TAN,
to use synthetic data because the true model EMA-TAN vs. EM-BNET and EMA-TAN vs.
that generated the dataset as well as the EMA. For each test, the winner model is ob-
underlying clustering structure of the data tained by comparing the data partitions ob-
are known. In order to illustrate the behavior tained by both models with the true partition of
of the EMA-TAN algorithm, we compare it the dataset. Thus, the model with the best es-
with the classical EM algorithm and with the timated accuracy (percentage of correctly clas-
EMA algorithm (Santafe et al., 2006). For sied instances) is the winner model. In Table
EMA-TAN evaluation, we obtain random TAN 1, the results from the experiments with ran-
models where the number of predictive vari- dom TAN models are shown. For each model
ables vary in {2, 4, 8, 10, 12, 14}, each predictive conguration, the table describes the number
variable can take up to three states and the of wins/draws/losses of the EMA-TAN models
number of clusters is set to two. For each TAN with respect to EM-TAN, EM-BNET or EMA
conguration we generate 100 random models models on basis of the estimated accuracy of
and each one of these models is sampled to each model. We also provide information about

276 G. Santaf, J. A. Lozano, and P. Larraaga


a Wilcoxon signed-rank test1 used to evaluate imates model averaging over the class of TAN
whether the accuracy estimated by two dier- models. Furthermore, this approximation can
ent models is dierent at the 1% and 10% levels. be performed eciently. This is possible by us-
We write the results shown in Table 1 in bold if ing the EMA-TAN algorithm. The EMA is an
the test is surpassed at the 10% level and inside algorithm originally proposed in Santafe et al.
a gray color box if the test is surpassed at the 1% (2006) for averaging over naive Bayes models in
level. It can be seen that, in general, the EMA- clustering problems. In this paper we extend
TAN models behave better than the others and the algorithm to deal with more complicated
the dierences between estimated accuracy are, models such as TAN (EMA-TAN algorithm).
in most of the cases, statistically signicant. We also present an empirical evaluation by com-
We can see that the compared models obtain paring the model averaging over TAN models
very similar results when they have a few predic- with the model averaging over naive Bayes mod-
tive variables. This is because the set of models els and with the EM algorithm to learn a sin-
that we are averaging over to obtain the EMA- gle TAN model and a Bayesian network model.
TAN model is very small. Therefore, it is quite These experiments conclude that, at least, when
possible that other algorithms such as EM-TAN the model that generated the dataset is included
select the correct model. Hence, in some exper- in the set of models that we average over, the av-
iments with the simplest models (models with 2 eraging over TAN models outperforms the other
predictive variables), EMA-TAN algorithm sig- methods
nicantly lost with the other algorithms. How- Probably, one of the limitations of the pro-
ever, when the number of predictive variables posed algorithm is that, because the learned
in the model increases and the dataset size is model which approximates model averaging is a
relatively big (the smallest datasets are not big complete Bayesian network, it is computation-
enough for a reliable estimation of the param- ally hard to learn the model for problems with
eters) the EMA-TAN considerably outperforms many predictive variables. In order to overcome
any other model in the test. this situation, we can restrict the nal model
The experimental results from this section re- to a maximum of k possible parents for each
inforce the idea that the results of the Bayesian predictive variable. Future work might include
model averaging outperform other methods a more exhaustive empirical evaluation of the
when the model that generated the data is in- EMA-TAN algorithm and the application of the
cluded in the set of models that we are aver- algorithm to real problems. Moreover, the algo-
aging over. Since we are averaging over a re- rithm can be extended to other Bayesian clas-
stricted class of models, this situation may not siers such as k-dependent naive Bayes (kDB),
be fullled when applying the EMA-TAN to real selective naive Bayes, etc. Another interesting
problems. Due to the lack of space, we do not future work may be the relaxation of complete
include in the paper more experimentation in dataset assumption.
progress, but we are aware that it would be
very interesting to check the performance of the Acknowledgments
EMA algorithm with other synthetic datasets This work was supported by the SAIOTEK-
generated by sampling naive Bayes and general Autoinmune (II) 2006 and Etortek research
Bayesian network models and also with dataset projects from the Basque Government, by the
from real problems. Spanish Ministerio de Educacion y Ciencia un-
5 Conclusions der grant TIN 2005-03824, and by the Govern-
ment of Navarre under a PhD grant.
We have shown that it is possible to obtain a sin-
gle unsupervised Bayesian network that approx-
1
References
Since we have checked by means of a Kolmogorov-
Smirnov test that not all the outcomes from the experi- Y. Barash and N. Friedman. 2002. Context-
ment can be considered to follow a normal distribution, specic Bayesian clustering for gene expression
we decided to use a Wilcoxon sign-rank test to compare data. Journal of Computational Biology, 9:169
the results 191.

Bayesian Model Averaging of TAN Models for Clustering 277


EMA-TAN vs. EM-TAN EMA-TAN vs. EM-BNET
#Var 40 80 160 320 640 #Var 40 80 160 320 640
2 11/71/18 9/71/20 15/68/17 18/70/12 15/63/22 2 24/51/25 29/35/36 33/34/33 28/35/37 41/29/30
4 37/24/39 33/17/50 50/8/42 48/5/47 46/3/51 4 54/11/35 51/8/41 59/8/33 49/6/45 57/1/42
8 51/6/43 57/4/39 65/6/29 74/4/22 71/1/28 8 59/8/33 64/5/31 81/1/18 83/0/17 91/0/9
10 48/5/47 59/4/37 55/10/35 64/6/30 67/4/29 10 67/7/26 76/0/24 82/2/16 84/0/16 94/0/6
12 54/4/42 65/3/32 65/4/31 69/4/27 83/2/15 12 66/5/29 86/1/13 80/0/20 91/0/9 89/0/11
14 59/8/33 68/6/26 75/2/23 76/4/20 76/6/18 14 74/2/24 83/1/16 92/1/7 92/0/8 96/0/4

EMA-TAN vs. EMA


#Var 40 80 160 320 640
2 21/53/26 24/41/35 24/48/28 23/45/32 25/45/30
4 47/14/39 49/6/45 49/10/41 50/5/45 56/3/41
8 52/10/38 58/4/38 80/4/16 90/0/10 89/0/11
10 48/13/39 55/8/37 69/6/25 81/1/18 88/1/11
12 55/8/37 78/6/16 79/1/20 88/2/10 88/0/12
14 55/11/34 68/7/25 81/3/16 84/3/13 92/0/8

Table 1: Comparison between EMA-TAN models and EM-TAN, EM-BNET and EMA learned from datasets
sampled from random TAN models

J. Cerquides and R. Lopez de Mantaras. 2005. TAN N. Friedman, D. Geiger and M. Goldszmidt 1997.
classiers based on decomposable distributions. Bayesian networks classiers. Machine Learning,
Machine Learning, 59(3):323354. 29:131163.
P. Cheeseman and J. Stutz. 1996. Bayesian classi- N. Friedman. 1998. The Bayesian structural EM
cation (Autoclass): Theory and results. In Ad- algorithm. In Proceedings of the Fourteenth Con-
vances in Knowledge Discovery and Data Mining, ference on Uncertainty in Artificial Intelligence,
pages 153180. AAAI Press. pages 129138.
B. Clarke. 2003. Comparing Bayes model averag- D. Heckerman, D. Geiger, and D. M. Chickering.
ing and stacking when model approximation error 1995. Learning Bayesian networks: The combina-
cannot be ignored. Journal of Machine Learning tion of knowledge and statistical data. Machine
Research, 4:683712. Learning, 20:197243.

G. F. Cooper and E. Herskovits. 1992. A Bayesian J. Hoeting, D. Madigan, A. E. Raftery, and C. Volin-
method for the induction of probabilistic networks sky. 1999. Bayesian model averaging. Statistical
from data. Machine Learning, 9:309347. Science, 14:382401.

D. Dash and G. F. Cooper. 2004. Model averaging F.V. Jensen. 2001. Bayesian Networks and Decision
for prediction with discrete Bayesian networks. Graphs. Springer Verlag.
Journal of Machine Learning Research, 5:1177 D. Madigan and A. E. Raftery. 1994. Model se-
1203. lection and accounting for model uncertainty in
A. P. Dempster, N. M. Laird, and D. B. Rubin. 1977. graphical models using Occams window. Journal
Maximum likelihood from incomplete data via the of the American Statistical Association, 89:1535
EM algorithm (with discussion). Journal of the 1546.
Royal Statistical Society B, 39:138. T. Minka. 2002. Bayesian model averaging is not
model combination. MIT Media Lab note.
P. Domingos. 2000. Bayesian averaging of classiers
and the overtting problem. In Proceedings of the J. M. Pena, J. A. Lozano, and P. Larranaga. 2002.
Seventeenth International Conference on Machine Learning recursive Bayesian multinets for cluster-
Learning, pages 223230. ing by means of constructive induction. Machine
Learning, 47(1):63-90.
R. Duda and P. Hart. 1973. Pattern Classification
and Scene Analysis. John Wiley and Sons. G. Santafe, J. A. Lozano, and P. Larranaga. 2006.
Bayesian model averaging of naive Bayes for clus-
N. Friedman and D. Koller. 2003. Being Bayesian tering. IEEE Transactions on Systems, Man, and
about network structure: A Bayesian approach CyberneticsPart B. Accepted for publication.
to structure discovery in Bayesian networks. Ma-
chine Learning, 50:95126.

278 G. Santaf, J. A. Lozano, and P. Larraaga


Dynamic Weighting A* Search-based MAP Algorithm for
Bayesian Networks
Xiaoxun Sun Marek J. Druzdzel Changhe Yuan
Computer Science Department Decision Systems Laboratory Department of Computer Science
University of Southern California School of Information Sciences and Engineering
Los Angeles, CA 90089 & Intelligent Systems Program Mississippi State University
xiaoxuns@usc.edu University of Pittsburgh Mississippi State, MS 39762
Pittsburgh, PA 15260 cyuan@cse.msstate.edu
marek@sis.pitt.edu

Abstract
In this paper we introduce the Dynamic Weighting A* (DWA*) search algorithm for solv-
ing MAP. By exploiting asymmetries in the distribution of MAP variables, the algorithm
can greatly reduce the search space and yield MAP solutions with high quality.

1 Introduction ity space with the remaining assignments hav-


ing practically negligible probability (Druzdzel,
The Maximum A-Posteriori assignment (MAP) 1994). Also, the DWA* uses dynamic weight-
is the problem of finding the most probable in- ing based on greedy guess (Park and Darwiche,
stantiation of a set of variables given partial ev- 2001; Yuan et al., 2004) as the heuristic func-
idence on the remaining variables in a Bayesian tion. While it is theoretically not admissi-
network. A special case of MAP that has been ble (admissible heuristic should offer an upper
paid much attention is the Most Probable Ex- bound on the MAP), it offers -admissibility and
planation (MPE) problem. MPE is the problem excellent performance in practice (Pearl, 1988).
of finding the most probable state of the model
given that all its evidences have been observed. 2 MAP
MAP turns out to be a much harder problem
compared to MPE. Particularly, MPE is NP- The MAP problem is defined as follows: Let M
complete while the corresponding MAP prob- be the set of MAP variables (we are interested
lem is N P P P -complete (Park, 2002). MAP is in the most probable configuration of these vari-
more useful than MPE for providing explana- ables); E is the set of evidence, namely the
tions. For instance, in diagnosis, generally we variables whose states we have observed; The
are only interested in the configuration of fault remainder of the variables, denoted by S, are
variables given some observations. There may variables that are neither observed nor of inter-
be many other variables that have not been ob- est to us. Given an assignment e of variables
served and are outside the scope of our interest. E, the MAP problem is that of finding the as-
signment m of variables M which maximizes
In this paper, we introduce the Dynamic
the probability of P (m | e), while MPE is the
Weighting A* (DWA*) search algorithm for
special case of MAP with S being empty.
solving MAP that is generally more efficient
than existing algorithms. The algorithm ex- X
map = max p(M, S | E) . (1)
plores the asymmetries among all possible as- M
S
signments in the joint probability distribution
of the MAP variables. Typically, a small frac- In Bayesian networks, we use the Conditional
tion of all possible assignments can be expected Probability Table (CPT) as the potential over
to cover a large portion of the total probabil- a variable and its parents. A potential over all
the states of one variable after updating beliefs 3.1 A* search
is called marginal. The notation e stands for MAP can be solved by A* search in the proba-
the potential in which we have fixed the value bility tree that is composed of all the variables
of e E. Then the probability of MAP with in the MAP set. The nodes in the search tree
as its CPTs turns out to be a real number: represent partial assignments of the MAP vari-
XY ables M. The root node represents an empty
map = max e . (2) assignment. Each MAP variable will be instan-
M
S tiated in a certain order. If a variable x in the
set of MAP variables M is instantiated at the
In Equation 2, summation does not commute ith place using its jth state, it will be denoted
with maximization. Therefore, it is necessary as Mij . Leaves of the search tree correspond to
to do summation before the maximization. The the last MAP variable that has been instanti-
order is called the elimination order. The size ated. The vector of instantiated states of each
of the largest clique minus 1 in a jointree con- MAP variable is called an assignment or a sce-
structed based on an elimination order is called nario. We compute the probability of assign-
the induced width. The induced width of the ments while searching the whole probability tree
best elimination order is called the treewidth. using chain rule. For each inner node, the newly
However, for the MAP problems in which nei- instantiated node will be added to the evidence
ther the set S nor the set M are empty, the order set, i.e., the evidence set will be extended to
is constrained. Then the constrained elimina- Mij E. Then the probability of an assignment
tion order is known as the constrained treewidth. of n MAP variables can be computed as follows:
Generally, the constrained treewidth is much
P (M | E) = P (Mni | M1j , M2k , . . . M(n1)t , E)
larger than treewidth, leading the problem be-
yond the limits of feasibility. . . . P (M2k | M1j , E)P (M1j | E) .
Several researchers have proposed algorithms Suppose we are in the xth layer of the search tree
for solving MAP. A very efficient approximate and preparing for instantiating the xth MAP
search-based algorithm based on local search, variables. Then the function above can be
proposed by Park (2002), is capable of solving rewritten as follows:
MAP efficiently. An exact method, based on
branch-and-bound depth-first search, proposed
P (M | E) =
by Park and Darwiche (2003), performs quite
b
well when the search space is not too large. An- z }| {
P (Mni | M1j . . . M(n1)t , E) . . . P (M(x+1)z | Mxy . . . E)
other approximate scheme, proposed by Yuan et
P (Mxy | M1j , M2k . . . M(x1)q , E) . . . P (M1j | E) (3)
al. (2004), is a Reheated Annealing MAP algo- | {z }
rithm. It is somewhat slower on simple networks a

but it is capable of handling difficult cases that The general idea of DWA* is that in each in-
exact methods cannot tackle. ner node of the probability tree, we can compute
the value of item (a) in the function above ex-
3 Solving MAP using Dynamic actly. We can estimate the heuristic value of the
Weighting A* Search item (b) for the MAP variables that have not
been instantiated given the initial evidence set
We propose in this section an algorithm for and the MAP variables that have been instanti-
solving MAP using Dynamic Weighting A* ated as the new evidence. In order to fit the typ-
search, which incorporates the dynamic weight- ical format of the cost function of A* search, we
ing (Pearl, 1988) in the heuristic function, can take the logarithm of the equation above,
relevance reasoning (Druzdzel and Suermondt, which will not change its monotonicity. Then
1994), and dynamic ordering in the search tree. we get f (n) = g(n) + h(n), where g(n) and h(n)

280 X. Sun, M. J. Druzdzel, and C. Yuan


are obtained from the logarithmic transforma- The requirement of exhaustive independence
tion of items (a) and (b) respectively. g(n) gives is too strict for most MAP problems. However,
the exact cost from the start node to node in the simulation results show that in practice, even
nth layer of the search tree, and h(n) is the es- when this requirement is violated, the product
timated cost of the best search path from the is still extremely close to the MAP probability
nth layer to the leaf nodes of the search tree. (Yuan et al., 2004). This suggests using it as an
In order to guarantee the optimality of the so- -admissible heuristic function (Pearl, 1988).
lution, h(n) should be admissible, which in this The curve Greedy Guess Estimate in Figure 1
case means that it should be an upper-bound on shows that with the increase of the number of
the value of any assignment with the currently MAP variables, the ratio between the greedy
instantiated MAP variables as its elements. guess and the accurate estimate of the optimal
probability diverges from the ideal ratio one al-
3.2 Heuristic Function with Dynamic
though not always monotonically.
Weighting
Definition 1. A heuristic function h2 is said to 3.2.2 Dynamic Weighting
be more informed than h1 if both are admissible Since greedy guess is a tight lower bound on
and h2 is closer to the optimal cost. the optimal probability of MAP, it is possible
For the MAP problem, the probability of the to compensate for the error between the greedy
optimal assignment Popt < h2 < h1 . guess and the optimal probability. We can do
this by adding a weight to the greedy guess such
Theorem 1. If h2 is more informed than h1
that the product of them is equal to or larger
then A2 dominates A1 . (Pearl, 1988)
than the optimal probability for each inner node
The power of the heuristic function is mea- in the search tree. This, it turns out, yields an
sured by the amount of pruning induced by h(n) excellent -admissible heuristic function. This
and depends on the accuracy of this estimate. assumption can be represented as follows:
If h(n) estimates the completion cost precisely
(h(n) = Popt ), then A* will only expand nodes {PGreedyGuess (1 + ) Popt
on the optimal path. On the other hand, if no 0 0 0
 (PGreedyGuess (1 +  ) Popt )  <  } ,
heuristic at all is used, (for the MAP problem
this amounts to h(n) = 1), then a uniform-cost where  is the minimum weight that can guar-
search ensues, which is far less efficient. So it antee the heuristic function to be admissible.
is critical for us to find an admissible and tight Figure 1 shows that if we just keep  constant,
h(n) to get both accurate and efficient solutions. neglecting the changes of the estimate accuracy
3.2.1 Greedy Guess with the increase of the MAP variables, the es-
If each variable in the MAP set M is condi- timate function and the optimal probability can
tionally independent of all remaining MAP vari- be represented by the curve Constant Weight-
ables (this is called exhaustive independence), ing Heuristic. Obviously, the problem with this
then the MAP problem amounts to a simple idea is that it is less informed when the search
computation based on the greedy chain rule. progresses, as there are fewer MAP variables to
We instantiate the MAP variable in the current estimate.
search layer to the state with the largest proba- Dynamic Weighting (Pohl, 1973) is an effi-
bility and repeat this for each of the remaining cient tool for improving the efficiency of A*
MAP variables one by one. The probability of search. If applied properly, it will keep the
MAP is then heuristic function admissible while remaining
tight on the optimal probability. For MAP, in
n
P (M |E) =
Y
max P (Mij |M(i1)k . . . M1m , E) . the shallow layer of the search tree, we get more
i=1
j MAP variables than the deeper layer for esti-
(4) mate. Hence, the greedy estimate will be more

Dynamic Weighting A* Search-based MAP Algorithm for Bayesian Networks 281


likely to diverge from the optimal probability. will lead the weighted heuristic to be inadmis-
We propose the following Dynamic Weighting sible. Let us give this a closer look and analyze
Heuristic Function for the xth layer of the search the conditions under which the algorithm fails
tree of n MAP variables: to achieve optimality. Suppose there are two
n (x + 1) candidate assignments: s1 and s2 with proba-
h(x) = GreedyGuess (1 + ) bilities p1 and p2 respectively, among which s2
n
( ) . is the optimal assignment that the algorithm
fails to find. And s1 is now in the last step of
Rather than keeping the weight constant search which will lead to a suboptimal solution.
throughout the search, we dynamically change We skip the logarithm in the function for the
it so as to make it less heavy as the search goes sake of clarity here (then the cost function f is a
deeper. In the last step of the search (x = n1), product of transformed g and h instead of their
the weight will be zero, since the greedy guess sum).
for only one MAP variable is exact and then the
f1 = g1 h1 and f2 = g2 h2
cost function f(n-1) is equal to the probability
of the assignment. Figure 1 shows an empiri- The error introduced by a inadmissible h2 is
cal comparison of greedy guess, constant, and f1 > f2 . The algorithm will then find s1 in-
dynamic weighting heuristics against accurate stead of s2 , i.e.,
estimate of the probability. We see that the dy-
f1 > f 2 g 1 h 1 > g 2 h 2 .
namic weighting heuristic is more informed than
constant weighting. Since s1 is now in the last step of search, f1 = p1
(Section 3.2.2). Now, suppose that we have
0
1.6 an ideal heuristic function h2 , which leads to
0
1.4 p2 = g2 h2 . Then we have:
Q u a lity o f H (X ) (h (x )/o p tim a l)

1.2
g1 h1 g2 h2 p1 g2 h2 p1 h2
> 0 > 0 > 0 .
1
p2 g2 h2 p2 g2 h2 p2 h2
0.8

0.6
It is clear that only when the ratio between
0.4 JUHHG\JXHVV the probability of suboptimal assignment and
FRQVWDQWZHLJKWLQJ
0.2 G\QDPLFZHLJKWLQJ the optimal one is larger than the ratio be-
0
RSWLPDO tween the inadmissible heuristic function and
0 5 10 15 20 25 30 35 40 the ideal one, will the algorithm find a subop-
T h e n u m b e r o f M A P v a ria b le s
timal solution. Because of large asymmetries
Figure 1: An empirical comparison of the con- among probabilities that are further amplified
stant weighting and the dynamic weighting by their multiplicative combination (Druzdzel,
heuristics based on greedy guess. 1994), we can expect that for most cases, the
ratios between p1 and p2 are far less than 1.
Even though the heuristic function will some-
3.3 Searching with Inadmissible times break the rule of admissibility, if only the
Heuristics for MAP Problem greedy guess is not too divergent from the ideal
Since the minimum weight  that can guarantee estimate, the algorithm will still achieve opti-
the heuristic function to be admissible is un- mality. Our simulation results also confirm the
known before the MAP problem is solved, and robustness of the algorithm.
it may vary between cases, we normally set to
3.4 Improvements to the Algorithm
be a safe parameter that is supposed to be larger
than  (In our experiments, we set to be 1.0). There are two main techniques that we used to
However, if is accidentally smaller than , it improve the efficiency of the basic A* algorithm.

282 X. Sun, M. J. Druzdzel, and C. Yuan


3.4.1 Relevance Reasoning settings for all the three algorithms above dur-
The main problem faced by the decision- ing comparison, unless otherwise stated.
theoretic approach is the complexity of prob-
4.1 Experimental Design
abilistic reasoning. The critical factor in ex-
act inference schemes for Bayesian networks The Bayesian networks that we used in our
is the topology of the underlying graph and, experiments included Alarm (Beinlich et al.,
more specifically, its connectivity. The frame- 1989), Barley (Kristensen and Rasmussen,
work of relevance reasoning (Druzdzel and 2002), CPCS179 and CPCS360 (Pradhan et al.,
Suermondt (1994) provides an accessible sum- 1994), Diabetes (Andreassen et al., 1991), Hail-
mary of the relevant techniques) is based on finder (Abramson et al., 1996), Munin (An-
d-separation and other simple and computa- dreassen et al., 1989), Pathfinder (Heckerman,
tional efficient techniques for pruning irrelevant 1990), Andes, and Win95pts (Heckerman et al.,
parts of a Bayesian network and can yield sub- 1995). We also tested the algorithms on two
networks that are smaller and less densely con- large proprietary diagnostic networks built at
nected than the original network. Relevance the HRL Laboratories (HRL1 and HRL2). We
reasoning is an integral part of the SMILE li- divided the networks into three groups: (1)
brary (Druzdzel, 2005) on which the implemen- small and middle-sized, (2) large but tractable,
tation of our algorithm is based. and (3) hard networks.
For MAP, our focus is the set of variables For each network, we randomly generated 20
M and the evidence set E. Parts of the model cases. For each case, we randomly chose 20
that are probabilistically independent from the MAP variables from among the root nodes. We
nodes in M given the observed evidence E are chose the same number of evidence nodes from
computationally irrelevant to reasoning about among the leaf nodes. Following tests of MAP
the MAP problem. Removing them leads to algorithms published earlier in the literature, we
substantial savings in computation. set the search time limit to be 3, 000 seconds.

3.4.2 Dynamic Ordering 4.2 Results of First & Second Group


As the search tree is constructed dynamically, We firstly ran the P-Loc, P-Sys, An-
we have the freedom to order the variables in a nealedMAP and DWA* on all networks in the
way that will improve the efficiency of DWA*. first and second group. The P-Sys is an exact
Expanding nodes with the largest asymmetries algorithm. So Table 2 only reports the number
in marginal probability distribution leads to of MAP problems that were solved optimally by
early cut-off of less promising branches of the the rest three algorithms. DWA* found all opti-
search tree. We use the entropy of the marginal mal solutions. The P-Loc missed only one case
probability distributions as a measure of asym- on Andes and the AnnealedMAP missed one
metry. on Hailfinder and two cases on Andes.
Since both AnnealedMAP and P-Loc
4 Experimental Results
failed to find all optimal solutions in Andes, we
To test DWA*, we compared its performance studied the performance of the four algorithms
in real Bayesian networks with those of current as a function of the number of MAP variables
state of the art MAP algorithms: the P-Loc (we randomly generated 20 cases for each num-
and P-Sys algorithms (Park and Darwiche, ber of MAP variables) on it.
2001; Park and Darwiche, 2003) implemented Because P-Sys failed to generate any result
in SamIam, and AnnealedMAP (Yuan et al., when the number of MAP variables reached 40,
2004) in SMILE respectively. We implemented while DWA* found all largest probabilities, we
DWA* in C++ and performed our tests on a 3.0 subsequently compared all the other three al-
GHz Pentium D Windows XP computer with gorithms with DWA*. With the increase of
2GB RAM. We used the default parameters and the number of MAP variables, both P-Loc and

Dynamic Weighting A* Search-based MAP Algorithm for Bayesian Networks 283


Group Networks P-Loc A-MAP A* #MAP P-Loc<DWA* P-Loc>DWA*
1 Alarm 20 20 20 10 0 0
CPCS179 20 20 20 20 0 0
CPCS360 20 20 20 30 0 0
Hailfinder 20 19 20 40 1 0
Pathfinder 20 20 20 50 2 0
Andes 19 18 20 60 2 1
Win95pts 20 20 20 70 3 2
2 Munin 20 20 20 80 5 0
HRL1 20 20 20
HRL2 20 20 20 Table 3: The number of cases in which the P-
Loc algorithm found larger/smaller probabili-
Table 1: Number of cases solved optimally out ties than DWA* in network Andes when spend-
of 20 cases for the first and second group. ing a little bit more time than DWA*.

#MAP P-Sys P-Loc A-MAP


10 0 0 0 also compared the efficiency of the algorithms.
20 0 1 2 Table 4 reports the average running time of
30 0 1 0 the four algorithms on the first and the sec-
40 TimeOut 4 4 ond groups of networks. For the first group,
50 TimeOut 6 2
60 TimeOut 5 2 P-Sys P-Loc A-MAP A*
70 TimeOut 6 5 Alarm 0.017 0.020 0.042 0.005
80 TimeOut 6 1 CPCS179 0.031 0.117 0.257 0.024
CPCS360 0.045 75.20 0.427 0.072
Table 2: Number of cases in which DWA* Hailfinder 2.281 0.109 0.219 0.266
found more probable instantiation than the Pathfinder 0.052 0.056 0.098 0.005
other three algorithms (network Andes). Andes 14.49 1.250 4.283 2.406
Win95pts 0.035 0.041 0.328 0.032
AnnealedMAP turned out to be less accu- Munin 3.064 4.101 19.24 1.763
rate than DWA* on Andes. When the num- HRL1 0.493 51.18 2.831 0.193
ber of MAP variables was above 40, there were HRL2 0.092 3.011 2.041 0.169
about 25% cases of P-Loc and 15% cases in
Table 4: Running time (in seconds) of the four
which AnnealedMAP found smaller probabil-
algorithms on the first and second group.
ities than DWA*. We notice from Table 2 that
P-Loc spent less time than DWA* when using
its default settings for Andes, so we increased the AnnealedMAP, P-Loc and P-Sys algo-
the search steps of P-Loc such that it spent rithms showed similar efficiency on all except
the same amount of time as DWA* in order to the CPCS360 and Andes networks. DWA* gen-
make a fair comparison. However, in practice erated solutions with the shortest time on av-
the search time is not continuous in the number erage. Its smaller variance of the search time
of search steps, so we just chose parameters for indicates that DWA* is more stable across dif-
P-Loc such that it spent only a little bit more ferent networks.
time than DWA*. Table 3 shows the compar- For the second group, which consists of large
ison results. We can see that after increasing Bayesian networks, P-Sys, AnnealedMAP
the search steps of P-Loc, DWA* still main- and DWA* are all efficient. DWA* still spends
tains better accuracy. shortest time on average, while the P-Loc is
In addition to the precision of the results, we much slower on the HRL1 network.

284 X. Sun, M. J. Druzdzel, and C. Yuan


4.3 Results of Third Group leaf nodes. The running times are shown in Fig-
The third group consists of two complex ure 2. Typically, P-Sys and P-Loc need more
Bayesian networks: Barley and Diabetes, many running time in face of more complex problems,
nodes of which have more than 10 different while AnnealedMAP and DWA* seem more
states. Because the P-Sys algorithm did not robust in comparison.
produce results within the time limit, the only

available measure of accuracy was a relative one:
 P-Sys
which of the algorithms found an assignment 
P-L o c
A n n e a lin g M A P

R unning T ime (m)


with higher probability. Table 5 lists the num- 
D W A*

ber of cases that were solved differently between 



the P-Loc, AnnealedMAP, and DWA* algo-

rithms. PL , PA and P stand for the proba- 
bility of MAP solutions found by P-Loc, An- 
nealedMAP and DWA* respectively. 


Group3 P >PL /P <PL P >PA /P <PA Number of MAP Variables(100 Evidences)
Barley 3/2 5/3
Diabetes 5/0 4/0 Figure 2: Running time of the four algorithms
when increasing the number of MAP nodes on
Table 5: The numberof cases that are solved the Munin network.
differently from P-Loc, AnnealedMAP and
DWA*.

For Barley, the accuracy of the three algo- 5 Discussion


rithms was quite similar. However, for Diabetes
Solving MAP is hard. By exploiting asymme-
DWA* was more accurate: it found solutions
tries among the probabilities of possible ele-
with largest probabilities for all 20 cases, while
ments of the joint probability distributions of
P-Loc failed to find 5 and AnnealedMAP
MAP variables, DWA* is able to greatly reduce
failed to find 4 of them.
the search space and lead to efficient and accu-
Group3 P-Sys P-Loc A-MAP A* rate solutions of MAP problems. Our experi-
Barley TimeOut 68.63 31.95 122.1 mental results also show that generally, DWA*
Diabetes TimeOut 338.4 163.4 81.8
is more efficient than the existent algorithms.
Table 6: Running time (in seconds) of the four Especially for large and complex Bayesian net-
algorithms on the third group. works, when the exact algorithm fails to gen-
erate any result within a reasonable time, the
DWA* turns out to be slower than P-Loc DWA* can still provide accurate solutions effi-
and AnnealedMAP on Barley but more effi- ciently. Further extension of this research is to
cient on Diabetes (see Table 6). apply DWA* to the K-MAP problem, which is
to find k most probable assignments for MAP
4.4 Results of Incremental MAP Test variables. It is very convenient for DWA* to
Out last experiment focused on the robustness achieve that, since in the process of finding the
of the four algorithms to the number of MAP most probable assignment the algorithm keeps
variables. In this experiment, we set the num- all the candidate assignments in the search fron-
ber of evidence nodes to 100, generated MAP tier. We can expect that the additional search
problems with an increasing number of MAP time will be linear in k.
nodes. We chose the Munin network, because In sum, DWA* enriches the approaches for
it seems the hardest network among the group solving MAP problem and extends the scope of
1 & 2 and has sufficiently large sets of root and MAP problems that can be solved.

Dynamic Weighting A* Search-based MAP Algorithm for Bayesian Networks 285


Acknowledgements D. Heckerman, J. Breese, and K. Rommelse. 1995.
This research was supported by the Air Force Decision-theoretic troubleshooting. Communica-
tions of the ACM, 38:4957.
Office of Scientific Research grants F4962003
10187 and FA95500610243 and by Intel Re- D. Heckerman. 1990. Probabilistic similarity net-
search. We thank anonymous reviewers for sev- works. Networks, 20(5):607636, August.
eral insightful comments that led to improve- K. Kristensen and I.A. Rasmussen. 2002. The use
ments in the presentation of the paper. All of a Bayesian network in the design of a decision
experimental data have been obtained using support system for growing malting barley with-
out use of pesticides. Computers and Electronics
SMILE, a Bayesian inference engine developed
in Agriculture, 33:197217.
at the Decision Systems Laboratory and avail-
able at http://genie.sis.pitt.edu/. J. D. Park and A. Darwiche. 2001. Approximat-
ing MAP using local search. In Proceedings of
the 17th Conference on Uncertainty in Artificial
References Intelligence (UAI01), pages 403410, Morgan
Kaufmann Publishers San Francisco, California.
B. Abramson, J. Brown, W. Edwards, A. Murphy,
and R. Winkler. 1996. Hailfinder: A Bayesian J. D. Park and A. Darwiche. 2003. Solving MAP
system for forecasting severe weather. Interna- exactly using systematic search. In Proceedings
tional Journal of Forecasting, 12(1):5772. of the 19th Conference on Uncertainty in Artifi-
cial Intelligence (UAI03), pages 459468, Mor-
S. Andreassen, F. V. Jensen, S. K. Andersen, gan Kaufmann Publishers San Francisco, Califor-
B. Falck, U. Kjrulff, M. Woldbye, A. R. nia.
Srensen, A. Rosenfalck, and F. Jensen. 1989.
MUNIN an expert EMG assistant. In John E. J. D. Park. 2002. MAP complexity results and ap-
Desmedt, editor, Computer-Aided Electromyogra- proximation methods. In Proceedings of the 18th
phy and Expert Systems, chapter 21. Elsevier Sci- Conference on Uncertainty in Artificial Intelli-
ence Publishers, Amsterdam. gence (UAI02), pages 388396, Morgan Kauf-
mann Publishers San Francisco, California.
S. Andreassen, R. Hovorka, J. Benn, K. G. Olesen,
and E. R. Carson. 1991. A model-based ap- J. Pearl. 1988. Heuristics : intelligent search
proach to insulin adjustment. In M. Stefanelli, strategies for computer problem solving. Addison-
A. Hasman, M. Fieschi, and J. Talmon, editors, Wesley Publishing Company, Inc.
Proceedings of the Third Conference on Artificial
Intelligence in Medicine, pages 239248. Springer- I. Pohl. 1973. The avoidance of (relative) catas-
Verlag. trophe, heuristic competence, genuine dynamic
weighting and computational issues in heuristic
I. Beinlich, G. Suermondt, R. Chavez, and problem solving. In Proc. of the 3rd IJCAI, pages
G. Cooper. 1989. The ALARM monitoring sys- 1217, Stanford, MA.
tem: A case study with two probabilistic in-
ference techniques for belief networks. In In M. Pradhan, G. Provan, B. Middleton, and M. Hen-
Proc. 2nd European Conf. on AI and Medicine, rion. 1994. Knowledge engineering for large be-
pages 38:247256, Springer-Verlag, Berlin. lief networks. In Proceedings of the Tenth An-
nual Conference on Uncertainty in Artificial In-
Marek J. Druzdzel and Henri J. Suermondt. 1994. telligence (UAI94), pages 484490, San Mateo,
Relevance in probabilistic models: Backyards in CA. Morgan Kaufmann Publishers, Inc.
a small world. In Working notes of the AAAI
1994 Fall Symposium Series: Relevance, pages S. E. Shimony. 1994. Finding MAPs for belief net-
6063, New Orleans, LA (An extended version of works is NP-hard. Artificial Intelligence, 68:399
this paper is in preparation.), 46 November. 410.

M. J. Druzdzel. 1994. Some properties of joint prob- C. Yuan, T. Lu, and M. J. Druzdzel. 2004. Annealed
ability distributions. In Proceedings of the 10th MAP. In Proceedings of the 20th Annual Con-
Conference on Uncertainty in Artificial Intelli- ference on Uncertainty in Artificial Intelligence
gence (UAI94), pages 187194, Morgan Kauf- (UAI04), pages 628635, AUAI Press, Arling-
mann Publishers San Francisco, California. ton, Virginia.

M. J. Druzdzel. 2005. Intelligent decision support


systems based on smile. Software 2.0, 2:1233.

286 X. Sun, M. J. Druzdzel, and C. Yuan


A Short Note on Discrete Representability of Independence
Models
Petr Simecek
Institute of Information Theory and Automation
Academy of Sciences of the Czech Republic
Pod Vodarenskou vez 4,
182 08 Prague, Czech Republic

Abstract
The paper discusses the problem to characterize collections of conditional independence
triples (independence model) that are representable by a discrete distribution. The known
results are summarized and the number of representable models over four elements set,
often mistakenly claimed to be 18300, is corrected. In the second part, the bounds for the
number of positively representable models over four elements set are derived.

1 Introduction 2 Independence models


Conditional independence relationships occur For the readers convenience, auxiliary results
naturally among components of highly struc- related to independence models are recalled in
tured stochastic systems. In a field of graphical this section.
Markov models, a graph (nodes connected by Throughout the paper, the singleton {a} will
edges) is used to represent the CI structure of a be shorten by a and the union of sets A B will
set of probability distributions. be written simply as juxtaposition AB. A ran-
Given the joint probability distribution of dom vector = (a )aN is a collection of ran-
a collection of random variables it is easy to con- dom variables indexed by a finite set N . For
struct the list of all CIs among them. On the A N , a subvector (a )aA is denoted by A ;
other hand, given the list (here called an inde- is presumed to be a constant. Analogously,
pendence model) of CIs an interesting question if x = (xa )aN is a constant vector then xA is
arises whether there exists a collection of dis- an appropriate coordinate projection.
crete random variables meeting these and only Provided A, B, C are pairwise disjoint subsets
these CIs, i.e. representing that model. of N , A B | C stands for a statement A
The problem of probabilistic representability and B are conditionally independent given C .
comes originally from J. Pearl, cf. (Pearl, 1998). In particular, unconditional independence (C =
It was proved by M. Studeny in (Studeny, 1992) ) is abbreviated as A B .
that there is no finite characterization (finite A random vector = (a )aN is called dis-
set of inference rules) of the set of representable crete if each a takes values in a state space
independence models. Xa such that 1 < |Xa | < . A discrete random
The interesting point is that for the fixed vector is called positive if for any appropriate
number of variables (or vertices) the number constant vector x
of representable independence models is much
higher than the number of graphs. Therefore, 0 < P ( = x) < 1.
even partial characterization of representable
In the case of discretely distributed random vec-
independence models may help to improve and
tor, variables a and b are independent1 given
understand the limits of learning of Bayesian
networks and Markov models. 1
The independence relation between random vectors
C iff for any appropriate constant vector xabC on N such that
P ( abC = xabC )P ( C = xC ) = hab|Ci I h(a)(b)|(C)i I ,
P ( aC = xaC )P ( bC = xbC ). where (C) stands for {(c); c C}. See Fig-
Let N be a finite set and TN denotes the set ure 2 for an example of three isomorphic models.
of all pairs hab|Ci such that ab is an (unordered) An equivalence class of independence models
couple of distinct elements of N and C N \ab. with respect to the isomorphic relation will be
Subsets of TN will be referred as formal in- referred as type.
dependence models over N . Independence
models and TN are called trivial.
The independence model I() induced by a
random vector indexed by N is the indepen-
dence model over N defined as follows
I() = {hab|Ci; a b |C } .
Figure 2: Example of three isomorphic models.
Let us emphasize that an independence model
I() uniquely determines also all other condi- An independence model I is said to be repre-
tional independences among subvectors of , cf. sentable2 if there exists a discretely distributed
(Matus, 1992). random vector such that I = I(). In addi-
Diagrams proposed by R. Lnenicka will be tion, a special attention will be devoted to pos-
used for a visualisation of independence model itive representations, i.e. representations by
I over N such that |N | 4. Each element of a positive discrete distribution.
N is plotted as a dot. If hab|i I then dots Let us note that isomorphic models are either
corresponding to a and b are joined by a line. If all representable or non-representable. Conse-
hab|ci I then we put a line between dots cor- quently, we can classify types as representable
responding to a and b and add small line in the and non-representable.
middle pointing in cdirection. If both hab|ci Lemma 1. If I = I() and I = I( ) are
and hab|di are elements of I, then only one line representable independence models then the in-
with two small lines in the middle is plotted. dependence model I I is also representable.
Finally, if hab|cdi I is visualised by a brace In particular, if they have positive representa-
between a and b. See example in Figure 1. tions then there exists a positive representation
1 2 of I I , too.
Q Q
Proof. Let X = Xa and X 0 = Xa0 be state
spaces of and 0 , respectively. The required
representation takes place in
Y
b= Xa Xa0
{

X
4 3 aN

Figure1: Diagram of the independence model and it is distributed as follows


I = h12|i, h23|1i, h23|4i, h34|12i, h14|i, P ( = (xa , x0a )aN ) =
h14|2i . P ( = (xa )aN ) P ( 0 = (x0a )aN ).

Independence models I and I over N will be See (Studeny and Vejnarova, 1998), pp. 5, for
called isomorphic if there exists a permutation more details.
2
A , B given C might be defined analogously. However, Of course, it is also possible to consider repre-
we will see that such relationships are uniquely deter- sentability in other distribution frameworks that the dis-
mined by the elementary ones (|A| = |B| = 1). crete distributions, cf. (Lnenicka, 2005).

288 P. imeek
Figure 3: Irreducible models over N = {1, 2, 3, 4}.

A Short Note on Discrete Representability of Independence Models 289


Lemma 2. Let a, b, c be distinct elements of N if there exists A C such that
and D N \ abc. If an independence models I \
over N is representable, then I= C.
CA

{hab|cDi, hac|Di} I  There are only only 13 types of irreducible
{hac|bDi, hab|Di} I . models, see Figure 3 or (Studeny and Bocek,
1994), pp. 277278. The problematic point is
Moreover, if I is positively representable, then
the total number of representable independence
 models over N . It has been believed that this
{hab|cDi, hac|bDi} I =  number is 18300, cf. (Studeny, 2002), (Lau-
{hab|Di, hac|Di} I .
ritzen and Richardson, 2002), (Robins et al.,
Proof. These are so called semigraphoid and 2003), (Simecek, 2006). . . However, working on
graphoid properties, cf. (Lauritzen, 1996) for this paper I have discovered that there actu-
the proof. ally exist 18478 different representable indepen-
dence models over N of 1098 types. The list of
3 Representability of Independence models has been checked by several programs
Models over N = {1, 2, 3, 4} including SG POKUS written by M. Studeny
and P. Bocek. The list can be downloaded from
For N consisting of three or less elements, all the web page
independence models not contradicting proper-
http://5r.matfyz.cz/skola/models
ties from Lemma 2 are representable, resp. pos-
itively representable, cf. (Studeny, 2005). That 3.2 Positive Representability
is why we focus on N = {1, 2, 3, 4} from now to Only a little is known about positive rep-
the end of the paper. resentability. This paper would like to be
The first subsection summarizes known re- the first step to the complete characterisation
sults related to (general) representability of in- of positively representable models over N =
dependence models. The second subsection is {1, 2, 3, 4}.
devoted to positive representability. Obviously, a set of positively representable
models is a subset of the set of (generally) repre-
3.1 General Representability
sentable models. In addition, positively repre-
The problem was solved in a brilliant series sentable model must fulfill properties following
of papers (Matus and Studeny, 1995), (Matus, the second part of Lemma 2 and Lemma 3 be-
1995) and (Matus, 1999) by F. Matus and low.
M. Studeny. The final conclusions are clearly
Lemma 3. Let a, b, c, d be distinct elements of
and comprehensibly presented in (Studeny and
N . If I is a positively representable indepen-
Bocek, 1994) and (Matus, 1997)3 .
dence model oven N such that
In brief, due to Lemma 1 an intersection of
two representable models is also representable. {hab|cdi, hcd|abi, hcd|ai} I,
Therefore, the class of all representable models
over N can be described by a set of so called then
irreducible models C, i.e. nontrivial repre- hcd|bi I hcd|i I.
sentable models that cannot be written as an Proof. See (Spohn, 1994), pp. 15.
intersection of two other representable models.
It is not difficult to evidence that a nontrivial in- There are 5547 (generally) representable
dependence model I is representable if and only models (356 types) meeting requirements on for
positively representable models from Lemma 2
3
To avoid confusion, note that (Matus, 1997) contains and Lemma 3. This is the upper bound to the
a minor typo in Figure 14 on pp. 21. Over the upper
line in the first two diagrams should be instead of . set of all positively representable models.

290 P. imeek
Figure 4: Undecided types.

A Short Note on Discrete Representability of Independence Models 291


Again, this class of models A is generated by R. Lnenicka. 2005. On gaussian conditional inde-
its subset C by the operation of intersection. pendence structures. Technical report, UTIA AV
The elements of C can be found by starting with CR.
an empty set C and in each step adding the el- F. Matus and M. Studeny. 1995. Conditional inde-
ement of A not yet generated by C with the pendences among four random variables I. Com-
greatest size. Actually, C contains (nontrivial) binatorics, Probability & Computing, 4:269278.
models of 23 types (may be downloaded from F. Matus. 1992. On equivalence of markov prop-
the mentioned web page). erties over undirected graphs. Journal of Applied
Probability, 29:745749.
To find some lower bound to the set of pos-
F. Matus. 1995. Conditional independences among
itively representable models, large amount of four random variables II. Combinatorics, Proba-
positive binary (sample space {0, 1}N ) ran- bility & Computing, 4:407417.
dom distributions have been randomly gener-
F. Matus. 1997. Conditional independence struc-
ated. The description of the generating pro- tures examined via minors. The Annals of Math-
cess will be omitted here4 , see the above men- ematics and Artificial Intelligence, 21:99128.
tioned web page for the list of corresponding F. Matus. 1999. Conditional independences among
independence models and their binary represen- four random variables III. Combinatorics, Proba-
tations. Using Lemma 1 we obtained 4555 (299 bility & Computing, 8:269276.
types) different positive representations of inde- J. Pearl. 1998. Probabilistic Reasoning in Intelligent
pendence models. The remaining problematic Systems. Morgan Kaufmann.
57 types are plotted in Figure 4.
J. M. Robins, R. Scheines, P. Spirtes, and L. Wasser-
man. 2003. Uniform consistency in causal infer-
ence. Biometrika, 90:491515.
3.3 Conclusion
P. Simecek. 2006. Gaussian representation of in-
Let us summarize the results into the conclud- dependence models over four random variables.
ing theorem. Accepted to COMPSTAT conference.
Theorem 1. There are 18478 different (gener- W. Spohn. 1994. On the properties of conditional
ally) representable independence models (1098 independence. In P. Humphreys, editor, Patrick
Suppes: Scientific Philosopher, volume 1, pages
types) over the set N = {1, 2, 3, 4}. There are
173194. Kluwer.
between 4555 and 5547 different positively rep-
resentable independence models (299356 types) M. Studeny and P. Bocek. 1994. CI-models arising
among 4 random variables. In Proceedings of the
over the set N = {1, 2, 3, 4}. 3rd workshop WUPES, pages 268282.
Acknowledgments M. Studeny and J. Vejnarova. 1998. The multiinfor-
mation function as a tool for measuring stochastic
This work was supported by the grants GA CR dependence. In M. I. Jordan, editor, Learning in
n. 201/04/0393 and GA CR n. 201/05/H007. Graphical Models, pages 261298. Kluwer.
M. Studeny. 1992. Conditional independence re-
lations have no finite complete characterization.
References In S. Kubk and J.A. Vsek, editors, Information
S. L. Lauritzen and T. S. Richardson. 2002. Chain Theory, Statistical Decision Functions and Ran-
graph models and their causal interpretations. dom Processes. Transactions of the 11th Prague
Journal of the Royal Statistical Society: Series B Conference, volume B, pages 377396. Kluwer.
(Statistical Methodology), 64:321361. M. Studeny. 2002. On stochastic conditional inde-
pendence: the problems of characterization and
S. L. Lauritzen. 1996. Graphical Models. Oxford description. Annals of Mathematics and Artifi-
University Press. cial Intelligence, 35:241323.
4
Briefly, the parametrization of joint distribution by M. Studeny. 2005. Probabilistic Conditional Inde-
moments has been used. Further, both systematic and pendence Structures. Springer.
random search through parameter space was performed.

292 P. imeek
Evaluating Causal e e ts using Chain Event Graphs
Peter Thwaites and Jim Smith
University of Warwi k Statisti s Department
Coventry, United Kingdom
Abstra t
The Chain Event Graph (CEG) is a oloured mixed graph used for the representation
of nite dis rete distributions. It an be derived from an Event Tree (ET) together
with a set of equivalen e statements relating to the probabilisti stru ture of the ET.
CEGs are espe ially useful for representing and analysing asymmetri pro esses, and
olle tions of implied onditional independen e statements over a variety of fun tions
an be read from their topology. The CEG is also a valuable framework for expressing
ausal hypotheses, and manipulated-probability expressions analogous to that given
by Pearl in his Ba k Door Theorem an be derived. The expression we derive here
is valid for a far larger set of interventions than an be analysed using Bayesian Net-
works (BNs), and also for models whi h have insu ient symmetry to be des ribed
adequately by a Bayesian Network.
1 Introdu tion this in ludes all the properties that an be
read from the equivalent BN if the CEG rep-
Bayesian Networks are good graphi al repre- resents a symmetri model and far more if the
sentations for many dis rete joint probabil- model is asymmetri .
ity distributions. However, many asymmet- In the next se tion we show how CEGs
ri models (by whi h we mean models with an be onstru ted. We then des ribe how we
non-symmetri sample spa e stru tures) an- an use CEGs to analyse the e e ts of Causal
not be fully des ribed by a BN. Su h pro- manipulation.
esses arise frequently in, for example, bio- Bayesian Networks are often extended to
logi al regulation, risk analysis and Bayesian apply also to a ontrol spa e. When it is valid
poli y analysis. to make this extension the BN is alled ausal.
In eli iting these models it is usually sen- Although there is debate (Pearl, 2000; Lau-
sible to start with an Event Tree (Shafer, ritzen, 2001; Dawid, 2002) about terminol-
1996), whi h is essentially a des ription of ogy, it is ertainly the ase that BNs are use-
how the pro ess unfolds rather than how the ful for analysing (in Pearl's notation) the ef-
system might appear to an observer. Work- fe ts of manipulations of the form Do X = x0
ing with an ET an be quite umbersome, but in symmetri models, where X is a variable
they do re e t any model asymmetry, both to be manipulated and x0 the setting that
in model development and in model sample this variable is to be manipulated to. This
spa e stru ture. type of intervention, whi h might be termed
The Chain Event Graph (Ri omagno & atomi , is a tually a rather oarse manipula-
Smith, 2005; Smith & Anderson, 2006; tion sin e we would need to extend the spa e
Thwaites & Smith, 2006a) is a graphi al to make predi tions of e e ts when X is ma-
stru ture designed for analysis of asymmetri nipulated to any value. Although there is
systems. It retains the advantages of the ET, a ase for only onsidering su h manipula-
whilst typi ally having far fewer edges and tions when a model is very symmetri , it is
verti es. Moreover, the CEG an be read for too oarse to apture many of the manipu-
a ri h olle tion of onditional independen e lations we might want to onsider in asym-
properties of the model. Unlike Jaeger's very metri environments. We use this paper to
useful Probabilisti De ision Graph (2002) show how CEGs an be used to analyse a far
more re ned singular manipulation in models (1 on the ET in Figure 1); A he ked, fault
whi h may have insu ient symmetry to be found, ma hine swit hed o (2 ); A he ked,
des ribed adequately by a Bayesian Network. no fault found, B he ked, fault found, ma-
hine swit hed o (3 ).
2 CEG onstru tion If A is found faulty it is repla ed and the
We an produ e a CEG from an Event Tree ma hine swit hed ba k on (vertex v1 ), and
whi h we believe represents the model (see for B is then he ked. B is then either found
example Figure 1). This ET is just a graph- not faulty (4 ), or faulty and the ma hine
i al des ription of how the pro ess unfolds, swit hed o (5 ).
and the set of atoms of the Event Spa e (or If B is found faulty by the monitoring sys-
path sigma algebra) of the tree is simply the tem, then it is removed and he ked (verti es
set of root to leaf paths within the tree. Any v2 and v3 ). There are then three possibili-
random variables de ned on the tree are mea- ties, whose probabilities are independent of
sureable with respe t to this path sigma alge- whether or not omponent A has been re-
bra. pla ed: B is not in fa t faulty, the ma hine
is reset and restarted (6 ); B is faulty, is su -
9 essfully repla ed and the ma hine restarted
v4 (7 ); B is faulty, is repla ed unsu essfully
1
1
0 and the ma hine is left o until the engineer
9 an see it (8 ).
4 v5 1 At the time of any monitoring y le, the
9
0
quality of the produ t produ ed (10 ) is un-
v6 a e ted by the repla ement of A unless B is
6 10 also repla ed. It is however dependent on the
v7 11 e e tiveness of B whi h depends on its age,
v0
2 5 v2 7
but also, if it is a new omponent, on the age
v1 8
12 of A; so:
v18 (good produ t j A and B repla ed) = 12

> (good produ t j only B repla ed) = 14
> (good produ t j B not repla ed) = 10
9
v8
v3 6 10

3
7 v9 13 An ET for this set-up is given in Figure 1 and
a derived CEG in Figure 2. Note that:
8
14
 The subtrees rooted in the verti es v4 ; v5 ;
v23 v6 and v8 of the ET are identi al (both
in physi al stru ture and in probability
distribution), so these verti es have been
Figure 1. ET for ma hine example. onjoined into the vertex (or position) w4
in the CEG.
Example 1.  The subtrees rooted in v2 and v3 are not
A ma hine in a produ tion line utilises two identi al (as 11 6= 13 ; 12 6= 14 ), but
repla eable omponents A and B. Faults in the edges leaving v2 and v3 arry identi al
these omponents do not automati ally ause probabilities. The equivalent positions in
the ma hine to fail, but do a e t the qual- the CEG w2 and w3 have been joined by
ity of the produ t, so the ma hine in orpo- an undire ted edge.
rates an automated monitoring system, whi h  All leaf-verti es of the ET have been on-
is ompletely reliable for nding faults in A, joined into one sink-vertex in the CEG,
but whi h an dete t a fault in B when it is labelled w1 .
fun tioning orre tly.
In any monitoring y le, omponent A is To omplete the transformation, note that
he ked rst, and there are three initial pos- vi ! wi for 0  i  3, v7 ! w5 and v9 ! w6 .
sibilities: A, B he ked and no faults found
294 P. Thwaites and J. Smith
w4

9

1
6

0

6

w5

4
1

2 5 w2
w0 1 winf
2
w1 8
13

w6 14
3

7
8

w3

Figure 2. CEG for ma hine example.

A formal des ription of the pro ess is as fol- (a) (e(v1 i ; v1 i+1 )) = e( (v1 i ); (v1 i+1 ))
lows: Consider the ET T = (V (T ); E (T )) = e(v2 i ; v2 i+1 ) for 0  i  m()
where ea h element of E (T ) has an asso i- (b) (v1 i+1 j v1 i ) = (v2 i+1 j v2 i )
ated edge probability. Let S (T )  V (T ) be where v1 i+1 and v2 i+1 label the same value
the set of non-leaf verti es of the ET. on the sample spa es X(vi 1 ) and X(vi 2 ) for
Let vi  vj indi ate that there is a path join- i  0.
ing verti es vi and vj in the ET, and that vi The set of equivalen e lasses indu ed by
pre edes vj on this path. the bije tion is denoted K (T ), and the ele-
Let X(v) be the sample spa e of X (v), the ments of K (T ) are alled positions.
random variable asso iated with the vertex v
(X(v) an be thought of as the set of edges De nition 2. For any v1 ; v2 2 S (T ), v1 and
leaving v, so in our example, X(v1 ) = v2 are termed stage-equivalent, i there is a
fB found not faulty; B found faultyg). bije tion  whi h maps the set of edges
E1 = fe1(v1 ; v1 0 ) j v1 0 2 X(v1 )g onto
For any v 2 S (T ), vl 2 V (T )nS (T ) su h that E2 = fe2 (v2 ; v2 0 ) j v2 0 2 X(v2 )g in su h
v  vl : a way that:
 Label v = v 0 (v1 0 j v1 ) = ((v1 0 ) j (v1 )) = (v2 0 j v2 )
 Let vi+1 be the vertex su h that where v1 0 and v2 0 label the same value on the
v i  v i+1  vl for whi h there is no sample spa es X(v1 ) and X(v2 ).
vertex v0 su h that v i  v0  v i+1 for The set of equivalen e lasses indu ed by
i0 the bije tion  is denoted L(T ), and the ele-
 Label vl = v m , where the path  onsists ments of L(T ) are alled stages.
of m edges of the form e(vi ; v i+1 )
A CEG C (T ) of our model is onstru ted as
De nition 1. For any v1 ; v2 2 S (T ), v1 and follows:
v2 are termed equivalent, i there is a bije - (1) V (C (T )) = K (T ) [ fw1g
tion whi h maps the set of paths (and om- (2) Ea h w; w0 2 K (T ) will orrespond to a
ponent edges) set of v; v0 2 S (T ). If, for su h v; v0 , 9
1 = f1 (v1 ; vl1 ) j vl1 2 V (T )nS (T )g onto a dire ted edge e(v; v0 ) 2 E (T ), then 9 a
2 = f2 (v2 ; vl2 ) j vl2 2 V (T )nS (T )g in su h dire ted edge e(w; w0 ) 2 E (C (T ))
a way that: (3) If 9 an edge e(v; vl ) 2 E (T ) st v 2 S (T )

Evaluating Causal effects using Chain Event Graphs 295


and vl 2 V (T )nS (T ), then 9 a dire ted Note that the ombination rules for path
edge e(w; w1 ) 2 E (C (T )) probabilities on CEGs (dire tly analogous to
(4) If two verti es v1 ; v2 2 S (T ) are stage- those for trees) give us that for any 3 posi-
equivalent, then 9 an undire ted edge tions w1 ; w2 ; w3 , with w1  w2  w3 , we have
e(w1 ; w2 ) 2 E (C (T )) that (w3 j w1 ; w2 ) = (w3 j w2 ); that is the
(5) If w1 and w2 are in the same stage (ie: if probability that we pass through position w3
v1 ; v2 are stage-equivalent in E (T )), and given that we have passed through positions
if (v1 0 j v1 ) = (v2 0 j v2 ) then the edges w1 and w2 is simply the probability that we
e(w1 ; w1 0 ) and e(w2; w2 0 ) have the same pass through position w3 given that we have
label or olour in E (C (T )). passed through position w2 .
Note that in our example, w2 and w3 are in
the same stage and that (v6 jv2 ) = (v8 jv3 ), 3 Manipulations of CEGs
(v7 jv2 ) = (v9 jv3 ), (v18 jv2 ) = (v23 jv3 ), so
the edges e(w2 ; w4 ) and e(w3 ; w4 ) are The simplest types of intervention are of the
oloured the same, as are the edges e(w2 ; w5 ) form Do X = x0 for some variable X and
and e(w3 ; w6 ) and as are e(w2 ; w1 ) and setting x0 , and these are really the only in-
e(w3 ; w1 ). terventions that an be satisfa torily analysed
More detail on CEG onstru tion an be using BNs. In this paper we onsider a mu h
found in Smith & Anderson (2006), as an more general intervention where not only the
a detailed des ription of how CEGs are read. setting of the manipulated variable, but the
We on lude se tion 2 of this paper by looking variable itself may be di erent depending on
at two ideas that will be used extensively in the settings of other variables within the sys-
the next se tion. tem.
Firstly, when we say that a position w We an model su h interventions by the
in our CEG or the set of edges leaving w produ tion of a manipulated CEG C^ in par-
have an asso iated variable, we are not refer- allel with our idle CEG C . In the interven-
ing to the measurement-variables of a BN- tion onsidered here every path in our CEG is
representation of the problem, ea h of whi h manipulated by having one omponent edge
must take a value for any atomi event, but given a probability of 1 or 0. All edges with
to a more exible onstru t de ned through zero probabilities and bran hes stemming
stage-equivalen e in the underlying tree. The from su h edges are removed (or pruned) from
exit-edges of a position are simply the olle - C^ (note that in this paper all edges on any
tion of possible immediate out omes in the CEG will have non-zero probabilities). We
next step of the pro ess given the history up will all su h an intervention a singular ma-
to that position. A setting (or value or level) nipulation, and denote it Do Int.
is then simply a possible realisation of a vari-
able in this olle tion. De nition 4. A subset WX of positions of
Se ondly, we use these edge probabilities C quali es as a singular manipulation set if:
to de ne the probabilities of omposite events (1) all root-to-sink paths in C pass through
in the path sigma eld of our CEG: exa tly one position in pa(WX ), where
w 2 pa(WX ) if w  w0 for some w0 2 WX
De nition 3. For two positions w; w0 with and there exists an edge e(w; w0 )
w  w0 , let  (w0 j w) be the probability as- (2) ea h position in pa(WX ) has exa tly one
so iated with the path (w; w0 ). Note that hild in WX , by whi h we mean that for
this will be a produ t of edge probabilities. w 2 pa(WX ), there exists exa tly one
w0 2 WX su h that there exists an edge
X e(w; w0 )
De ne (w0 j w) ,  (w0 j w) A singular manipulation is then an interven-
2 tion su h that:
(a) for ea h w 2 pa(WX ) and orresponding
where  is the set of all paths from w to w0 . w0 2 WX , ^ (w0 j w) = 1

296 P. Thwaites and J. Smith


(b) for any w 2 pa(WX ) and w0 2= WX su h possibly allowing us to sidestep identi ability
that w  w0 and there exists an edge problems asso iated with it.
e(w; w0 ), then ^ (w0 j w) = 0, and this Following Pearl, we have two onditions,
edge is removed (or pruned) in C ^ whi h if satis ed, are su ient for WZ to be
0
( ) for any w 2= pa(WX ) and w su h that onsidered a Ba k Door blo king set. We give
w  w0 and there exists an edge e(w; w0 ), the rst here, and the se ond following a few
then ^ (w0 j w) = (w0 j w) further de nitions.
where ^ is a probability in our manipulated
CEG C^ . (A) For all wj 2 WX , every wj w1 path in
C must pass through exa tly one position
Let WX = fwj g, pa(WX ) = fwj g. Ea h posi- k 2 WZ
i w
tion in pa(WX ) has exa tly one hild in WX so
elements of pa(WX ) an be intially identi ed The obvious notation for use with CEGs is a
by their hild in WX (ie by the path-based one. However most pra titioners
index j ). But a position in WX ould have will be more familiar with expressions su h
more than one parent in pa(WX ), so we dis- as (3.2), so we here develop a few ideas to
tinguish these parents by a se ond index i. allow us to express our ausal expression in a
For ea h pair (wji ; wj ) let Xji be the variable similar fashion. The rst step in this pro ess
asso iated with the edge e(wji ; wj ) and xij be is to note that any position w in a CEG has
the setting of this variable on this edge. a unique set q(w) asso iated with it, where:
If we also onsider a response variable Y  Q(w) is the minimum set of variables,
downstream from the set of positions WX , by spe ifying the settings (or values or
then we an show (using for example Pearl's levels) of whi h, we an des ribe the union
de nition of Do) that: of all w0 w paths
(y j Do Int)
 q(w) are the settings of Q(w) whi h fully
X h i des ribe the union of all w0 w paths
= (wji j w0 ) (y j wj ) (3:1) Formally this means that:
i;j
[
Pearl's own Ba k Door expression (below) is q(w) = q()
a simpli ation of the general manipulated- 2w
probability expression used with BNs. where q() are the settings on the
w0 w path , and w is the set of all w0 w
(y j Do x0 )
paths.
X
= (y j z; x0 ) (z ) (3:2) Letting Z (w) be the set of variables en oun-
z tered on edges upstream of w, X (w) be the set
Z here is a subset of the measurement- of variables en ountered on edges downstream
variables of the BN whi h obey ertain on- of w, and R(w) = Z (w)nQ(w), we note that
ditions. If Z is hosen arefully then we an the onditional independen e statement en-
al ulate (y j Do x0 ) without onditioning oded by the position w is of the form:
on the full set of measurement-variables. X (w) q R(w) j q(w)
In this paper we use the topology of the
CEG to produ e an analogous expression to In the CEG in Figure 2 for example,
(3.2) for our more general singular manipu- X (w4 ) q R(w4 ) j q(w4 ) tells us that produ t
lation, by using a set of positions WZ down- quality is independent of the monitoring sys-
stream from the intervention whi h an stand- tem responses, given that B is not repla ed.
in for the set of positions WX used in expres- Note that the de nition of q(w) means
sion (3.1). As with Pearl's expression, the that the variable-settings within it might not
use of su h a set WZ will redu e the omplex- always orrespond to simple values of the vari-
ity of the general expression (3.1) as well as ables within Q(w). None-the-less, we have

Evaluating Causal effects using Chain Event Graphs 297


found that q(w) is typi ally simpler than ea h  zji (wk ) are the settings of Zji (wk ) whi h
individual q(). (with the addition of Xji = xij ) fully de-
We now use the ideas outlined above to s ribe the union of all w0 wji wj wk
expand the expression (3.1). As the position paths
wk is uniquely de ned by q(wk ), we an write,
without ambiguity (y j wk ) = (y j q(wk )). De nition 5.SWe express this formally as:
Using ondition (A) we get: Let qji (wk ) = 2 q(), where q() are the
settings on the w0 wji wj wk path , and
(y j Do Int) (3:3)
Xh i X i  is the set of all w0 wji wj wk paths
= (wj j w0 ) (wk ; y j wj ) in C. Let Qij (wk ) be the set of variables
X
i;j k present in qji (wk ).
= (wji j w0 ) (y j wj ; wk ) (wk j wj ) De ne Zji (wk ) as Qij (wk )nXji . Let zji (wk ) be
i;j;k
X the settings of Zji (wk ) ompatible with qji (wk ).
= (wji j w0 ) (y j wk ) (wk j wj ) Our Xji ; xij ; Zji (wk ); zji (wk ) are dire tly
i;j;k
XhX i analogous to Pearl's X; x; Z and z in expres-
= (wji jw0 )(wk jwj ) (yjq(wk )) sion (3.2), and ful l similar roles in our nal
k i;j ausal expression.
The following rather te hni al de nitions
The equivalen e of (y j wj ; wk ) and (y j wk ) are only required for an understanding of the
is a onsequen e of the equivalen e of proof. De nition 7 deals with the idea of a
(w3 j w1 ; w2 ) and (w3 j w2 ) noted at the des endant whi h is very similar to the analo-
end of se tion 2, and proved in Thwaites & gous idea in BNs, and is needed for
Smith (2006b). ondition (B).
We now need a number of te hni al def-
initions before we an introdu e our se ond De nition 6.
ondition and pro eed to our ausal expres- De ne q(wji ) analogously with the de nition
sion. Examples illustrating these de nitions of q(w). Let Q(wji ) be the set of variables
an be found in Thwaites & Smith (2006b). present in q(wji ).
Now, the position wk an also be fully de- De ne Qj (wk ) as Zji (wk )nQ(wji ). Note that
s ribed by the union of disjoint events, ea h of Q(wji )  Zji (wk ). Let qj (wk ) be the settings
whi h is (by onditions (1) and (A)) a of Qj (wk ) ompatible with zji (wk ) (or qji (wk )).
w0 wji wk path for some wji 2 pa(WX ). We an therefore write:
These events divide into 2 distin t sets:
(1) w0 wji wj wk paths zji (wk ) = (q(wji ); qj (wk ))
(2) paths that do not utilise the xij edge when qji (wk ) = (xij ; zji (wk )) = (q(wji ); xij ; qj (wk ))
leaving wji (formally w0 wji w0 wk We an also ombine the events in set (2) into
paths where there exists an edge e(wji ; w0 ), C-paths, ea h of whi hS an be uniquely har-
but w0 2= WX ). a terised by rji (wk ) = 2M q(), where q()
We an ombine the events in set (1) into are the settings on the w0 wji w0 wk
omposite or C-paths so that ea h C-path path , and M is the set of all w0 wji w0 wk
passes through exa tly one wji and an be paths in C. We an therefore write:
uniquely hara terised by a pair
qji (wk ) = (xij ; zji (wk )), where zji is de ned as h[ i[h[ i
follows: q(wk ) = qji (wk ) rji (wk )
 Zji (wk ) is the minimum set of variables, i;j i;j
by spe ifying the settings of whi h, we
an des ribe the union of all w0 wji De nition 7. Consider variables A; D; fBm g
wj wk paths (with Xji ex luded from de ned on our CEG C . Then D is a de-
this set) s endant of A in C if there exists a sequen e

298 P. Thwaites and J. Smith


of (not ne essarily adja ent) edges e1 ; : : : en But this is simply the probability that
forming part of a w0 w1 path in C where Qj (wk ) = qj (wk ) given that Xji = xij in the
the edges e1 ; : : : en are labelled respe tively sub-CEG Cji .
b1 j (a; : : :); b2 j (b1 ; : : :); : : : bn 1 j (bn 2 ; : : :); Condition (B) implies that Xji q Qj (wk ) in Cji
d j (bn 1 ; : : :), or if there exists an edge form- sin e Xji has no parents in this CEG. So we
ing part of a w0 w1 path in C labelled get:
d j (a; : : :); where a; b1 ; b2 ; : : : bn 1 ; d are set-
tings of A; B1 ; B2 ; : : : Bn 1 ; D. (wk j wj ) = (qj (wk ) j q(wji ))
We are now in a position to state our 2nd Substituting this into expression (3.3), we get:
ondition. (y j Do Int)
XhX i
(B) In the sub-CEG Cji with wji as root-node, = (wji j w0 ) (qj (wk ) j q(wji ))
Qj (wk ) must ontain no des endants of k i;j
Xji for all i; j; for ea h position wk (y j q(iwk ))
XhX
Che king that ondition (B) is ful lled is a - = (q(wji )) (qj (wk ) j q(wji ))
tually straightforward on a CEG, espe ially k i;j
sin e we will know whi h manipulations we (y j q(wk ))
intend to investigate, and an usually on- XhX i
stru t our CEG so as to make Qj (wk ) as small = (zji (wk )) (y j q(wk )) 
as possible for all values of j; k. k i;j
It is possible to show that the expression
We an now repla e expression (3.3) by a Ba k (y j q(wk )) an be repla ed by a probability
Door expression for singular manipulations: onditioned on a single w0 wk path, and
Proposition 1.
moreover that even on that path Y may well
be independent of some of the variables en-
ountered given the path-settings of the oth-
(y j Do Int) (3:4) ers | for details see Thwaites & Smith
XhX i i (2006b). S
= (zj (wk )) (y j q(wk )) We also noted earlier that q(w) = q()
k i;j is typi ally simpler thanP ea h individual
q(). In most instan es i;j (zji (wk )) will
Proof. be the probability of a union of disjoint events
Consider whi h will also typi ally be simpler than an
(wk j wj ) = (wk j wji ; wj ) individual
P zj (wik ). We an dedu e that al u-
i
lating i;j (zj (wk )) is unlikely to be a om-
= (q(wk ) j q(wji ); xij ) plex task.
h [ i h[ i 
= qn (wk ) [
m rn (wk ) jq(wj ); xj
m i i
We on lude this se tion by demonstrating
m;n m;n
X m how expression (3.4) is related to Pearl's Ba k
i j
= (qn (wk ) q(wj ); xij ) Door expression (3.2):
m;n
X Consider the intervention Do X = x0 , and
j
+ (rnm (wk ) q(wji ); xij ) let Xji = X and xij = x0 for all i; j . Combine
m;n all
S ouri w0 wj wj wk C-paths, and write
i

sin e disjoint. i;j qj (wk ) = (x0 ; z (wk )). Rephrase ondi-


= (qji (wk )jq(wji ); xij ) + (rji (wk )jq(wji ); xij ) tions (2) and (B) as:
= (qji (wk ) j q(wji ); xij ) (2) ea h position in pa(WX ) has exa tly one
of its outward edges in C labelled x0 , and
= (q(wji ); xij ; qj (wk ) j q(wji ); xij ) this edge joins the position in pa(WX ) to
= (qj (wk ) j q(wji ); xij ) a position in WX

Evaluating Causal effects using Chain Event Graphs 299


(B) Z (wk ) must ontain no des endants of X vertex, instead provides a probability distri-
(where Z (Wk ) is de ned from z (wk ) in bution for the outgoing edges of that vertex.
the obvious manner) So not only are CEGs ideal representa-
Then with a little work, we an repla e ex- tions of asymmetri dis rete models, retaining
pression (3.4) by: the onvenien e of a tree-form for des ribing
how a pro ess unfolds but also expressing a
ri h olle tion of the model's onditional in-
(y j Do Int) dependen e properties, but their event-based
X ausal analysis has distin t advantages over
= (z (wk )) (y j x0 ; z (wk )) the variable-based analysis one performs on
k Bayesian Networks.
If Z (wk ) ontains the same variables for all k, Referen es
and z (wk ) runs through all settings of Z (wk )
as we run through all wk , then this expression A.P. Dawid. 2002. In uen e Diagrams for Causal
redu es to Pearl's expression (3.2). Modelling and Inferen e. International Statisti-
al Review 70.
4 Causal Analysis on CEGs and BNs
M. Jaeger. 2002. Probabilisti De ision Graphs
The prin ipal advantage that CEGs have over | Combining Veri ation and AI te hniques for
BNs when it omes to Causal analysis is their Probabilisti Inferen e. In PGM '02 Pro eedings
exibility. In a BN the kind of manipulations of the 1st European Workshop on Probabilisti
we an onsider are severely restri ted (see for Graphi al Models, pages 81-88.
example se tion 2.6 of Lauritzen (2001) where
he omments on Shafer (1996)), whereas us- S.L. Lauritzen. 2001. Causal Inferen e from
ing a CEG we an ta kle not only the on- Graphi al Models. In O.E. Barndor -Nielsen et
ventional Do X = x0 manipulations of sym- al (Eds) Complex Sto hasti Systems. Chapman
metri models, but also the analysis of inter- and Hall / CRC.
ventions on asymmetri models and manipu- J. Pearl. 2000. Causality: models, reasoning and
lations where both the manipulated-variable inferen e. Cambridge University Press.
and the manipulated-variable value an di er
for di erent settings of other variables. It is E. Ri omagno and J.Q. Smith. 2005. The Causal
also the ase that our blo king sets are sets Manipulation and Bayesian Estimation of Chain
of values or settings, and do not need to or- Event Graphs. CRiSM Resear h Report no. 05-
respond to any xed subset of the original 16, University of Warwi k.
problem random variables.
For simpli ity of exposition in this pa- G. Shafer. 1996. The Art of Causal Conje ture.
per we have not fully exploited the potential MIT Press.
exibility of the CEG, onsidering only for- J.Q. Smith and P.E. Anderson. 2006. Conditional
mulae asso iated with a blo king-set WZ of independen e and Chain Event Graphs. A -
positions downstream of WX . We an also epted, subje t to revision, by Journal of Ar-
onsider sets of stages upstream of WX , and ti ial Intelligen e.
ombinations of the two. Also, we have only
dis ussed one parti ular fairly oarse exam- P.A. Thwaites and J.Q. Smith. 2006a. Non-
ple of an intervention. There are often ir- symmetri models, Chain Event Graphs and
umstan es where some paths in our CEG propagation. In IPMU '06 Pro eedings of the
are not manipulated at all, for example in 11th International Conferen e on Information
a treatment regime where only patients with Pro essing and Management of Un ertainty in
ertain ombinations of symptoms (ie at er- Knowledge-Based Systems.
tain positions or stages) are treated. There
are also non-singular interventions where (for P.A. Thwaites and J.Q. Smith. 2006b. Singular
instan e) a manipulation, rather than for - manipulations on Chain Event Graphs. War-
ing a path to follow one spe i edge at some wi k Resear h Report, University of Warwi k.

300 P. Thwaites and J. Smith


Severity of Local Maxima for the EM Algorithm:
Experiences with Hierarchical Latent Class Models
Yi Wang and Nevin L. Zhang
Department of Computer Science and Engineering
Hong Kong University of Science and Technology
Hong Kong, China
{wangyi,lzhang}@cse.ust.hk

Abstract
It is common knowledge that the EM algorithm can be trapped at local maxima and
consequently fails to reach global maxima. We empirically investigate the severity of this
problem in the context of hierarchical latent class (HLC) models. Our experiments were
run on HLC models where dependency between neighboring variables is strong. (The
reason for focusing on this class of models will be made clear in the main text.) We
first ran EM from randomly generated single starting points, and observed that (1) the
probability of hitting global maxima is generally high, (2) it increases with the strength of
dependency and sample sizes, and (3) it decreases with the amount of extreme probability
values. We also observed that, at high dependence strength levels, local maxima are far
apart from global ones in terms of likelihoods. Those imply that local maxima can be
reliably avoided by running EM from a few starting points and hence are not a serious
issue. This is confirmed by our second set of experiments.

1 Introduction EM from multiple random starting points for


a number of steps, then pick the one with the
The EM algorithm (Dempster et al., 1977) is highest likelihood, and continue EM from the
a popular method for approximating maximum picked point until convergence. In addition to
likelihood estimate in the case of incomplete multiple restart EM, several more sophisticated
data. It is widely used for parameter learning strategies for avoiding local maxima have also
in models, such as mixture models (Everitt and been proposed (Fayyad et al., 1998; Ueda and
Hand, 1981) and latent class models (Lazarsfeld Nakano, 1998; Ueda et al., 2000; Elidan et al.,
and Henry, 1968), that contain latent variables. 2002; Karciauskas et al., 2004).
A well known problem associated with EM While there is abundant work on avoiding lo-
is that it can be trapped at local maxima and cal maxima, we are aware of few work on the
consequently fails to reach global maxima (Wu, severity of the local maxima issue. In this pa-
1983). One simple way to alleviate the problem per, we empirically investigate the severity of
is to run EM many times from randomly gener- local maxima for hierarchical latent class (HLC)
ated starting points, and take the highest like- models. Our experiments were run on HLC
lihood obtained as the global maximum. One models where dependency between neighboring
problem with this method is that it is compu- variables is strong. This class of models was
tationally expensive when the number of start- chosen because we use HLC models to discover
ing points is large because EM converges slowly. latent structures. It is our philosophical view
For this reason, researchers usually adopt the that one cannot expect to discover latent struc-
multiple restart strategy (Chickering and Heck- tures reliably unless observed variables strongly
erman, 1997; van de Pol et al., 1998; Uebersax, depend on latent variables (Zhang, 2004; Zhang
2000; Vermunt and Magidson, 2000): First run and Kocka, 2004).
In the first set of experiments, we ran EM X1

from randomly generated single starting points,


and observed that (1) the probability of hitting X2 Y1 X3
global maxima is generally high, (2) it increases
with the strength of dependency and sample Y2 Y3 Y4 Y5 Y6 Y7
sizes, and (3) it decreases with the amount of
extreme probability values. We also observed Figure 1: An example HLC model. X1 , X2 ,
that, at high dependence strength levels, local X3 are latent variables, and Y1 , Y2 , , Y7 are
maxima are far apart from global ones in terms manifest variables.
of likelihoods.
Those observations have immediate practi- We usually write an HLC model as a pair
cal implications. Earlier in this section, we M =(m, ). The first component m consists of
mentioned a simple local-maxima avoidance the model structure and cardinalities of the vari-
method. We pointed out one of its problems, ables. The second component is the collection
i.e. its high computational complexity, and said of parameters. Two HLC models M =(m, )
that multiple restart can alleviate this prob- and M =(m , ) are marginally equivalent if
lem. There is another problem with the method: they share the same set of manifest variables
there is no guidance on how many starting Y and P (Y|m, )=P (Y|m , ).
points to use in order to avoid local maxima We denote the cardinality of a variable X by
reliably. Multiple restart provides no solution |X|. For a latent variable X in an HLC model,
for this problem. denote the set of its neighbors by nb(X). An
Observations from our experiments suggest a HLC model is Qregular if for any latent vari-
|Z|
guideline for strong dependence HLC models. able X, |X| maxZnb(X) |Z| , and the inequality
Znb(X)
As a matter of fact, they imply that local max- strictly holds when X has only two neighbors,
ima can be reliably avoided by using multiple one of which being a latent variable. Zhang
restart and a few starting points. This is con- (2004) has shown that an irregular HLC model
firmed by our second set of experiments. can always be reduced to a marginally equiv-
The remainder of this paper is organized as alent HLC model that is regular and contains
follows. In Section 2, we review some basic con- fewer independent parameters. Henceforth, our
cepts about HLC models and the EM algorithm. discussions are restricted to regular HLC mod-
In Sections 3 and 4, we report our first and sec- els.
ond sets of experiments respectively. We con-
clude this paper and point out some potential 2.2 Strong Dependence HLC Models
future work in Section 5. The strength of dependency between two vari-
ables is usually measured by mutual informa-
2 Background tion or correlation (Cover and Thomas, 1991).
However, there is no general definition of strong
2.1 Hierarchical Latent Class Models
dependency for HLC models yet. In this study,
Hierarchical latent class (HLC) models (Zhang, we use the operational definition described in
2004) are tree-structured Bayesian networks the following paragraph.
where the leaf nodes are observed while the Consider a probability distribution. We call
internal nodes are hidden. An example HLC the component with the largest probability
model is shown in Figure 1. Following the con- mass the major component. We say that an
ventions in latent variable model literatures, we HLC model is a strong dependence model if
call the leaf nodes manifest variables and the The cardinality of each node is no smaller
internal nodes latent variables 1 . than that of its parent.
1
In this paper, we do not distinguish between nodes The major components in all conditional
and variables.
distributions are larger than 0.5, and

302 Y. Wang and N. L. Zhang


In each conditional probability table, the avoid local maxima is to run EM many times
major components of different rows are lo- with randomly generated starting points, and
cated in different columns. pick the instance with the highest likelihood as
In general, the larger the major components, the the final result. The more starting points it
higher the dependence strength (DS) level. In uses, the higher the chance it can reach the
the extreme case, when all major components global maximum. However, due to the slow
are equal to 1, the HLC model becomes deter- convergence of EM, this method is computa-
ministic. Strong dependence HLC models de- tionally expensive. A more feasible method,
fined in this way have been examined in our called multiple restart EM, is to run EM with
previous work on discovering latent structures multiple random starting points, and retain the
(Zhang, 2004; Zhang and Kocka, 2004). The one with the highest likelihood after a specified
results show that such models can be reliably number of initial steps. This method and its
recovered from data. variant are commonly used to learn latent vari-
able models in practice (Chickering and Heck-
2.3 EM Algorithm erman, 1997; van de Pol et al., 1998; Uebersax,
Latent variables can never be observed and 2000; Vermunt and Magidson, 2000). Other
their values are always missing in data. This work on escaping from poor local maxima in-
fact complicates the maximum likelihood esti- cludes (Fayyad et al., 1998; Ueda and Nakano,
mation problem, since we cannot compute suf- 1998; Ueda et al., 2000; Elidan et al., 2002; Kar-
ficient statistics from incomplete data records. ciauskas et al., 2004).
A common method to deal with such situations
is to use the expectation-maximization (EM) al- 3 Severity of Local Maxima
gorithm (Dempster et al., 1977). The EM al- Here is the strategy that we adopt to empiri-
gorithm starts with a randomly chosen estima- cally investigate the severity of local maxima in
tion to parameters, and iteratively improves this strong dependence HLC models: (1) create a set
estimation by increasing its loglikelihood. In of strong dependence models, (2) sample some
each EM step, the task of increasing loglikeli- data from each of the models, (3) learn model
hood is delegated to the maximization of the parameters from the data by running EM to
expected loglikelihood function. The latter is a convergence from a number of starting points,
lower bound of the loglikelihood function. It is (4) graph the final loglikelihoods obtained.
defined as The final loglikelihoods for different starting
points could be different due to local maxima.
Q(|D, t )
m X Hence, an inspection of their distribution would
give us a good idea about the severity of local
X
= P (Xl |Dl , t ) log P (Xl , Dl |),
l=1 Xl maxima.

where D={D1 , D2 , , Dm } denotes the col- 3.1 Experiment Setup


letion of data, Xl denotes the set of variables The structure of all models used in our exper-
whose values are missing in Dl , and t denotes iments was the ternary tree with height equals
the estimation at the t-th step. The EM algo- 3, as shown in Figure 2. The cardinalities of
rithm terminates when the increase in loglike- all variables were set at 3. Parameters were
lihood between two successive steps is smaller randomly generated subject to the strong de-
than a predefined stopping threshold. pendency condition. We examined 5 DS levels,
It is common knowledge that the EM algo- labeled from 1 to 5. They correspond to restrict-
rithm can be trapped at local maxima and con- ing the major components within the following 5
sequently fails to reach global maxima (Wu, intervals: [0.5, 0.6), [0.6, 0.7),[0.7, 0.8),[0.8, 0.9),
1983). The specific result depends on the choice and [0.9, 1.0]. For each DS level, 5 different pa-
of the starting point. A common method to rameterizations were generated. Consequently,

Severity of Local Maxima for the EM Algorithm: Experiences with Hierarchical Latent
Class Models 303
X1 The smaller it is, the higher the quality of the
parameters produced by the EM run. In partic-
X2 X3 X4 ular, the relative likelihood shortfall of the run
that produced the global maximum l is 0.
Y1 Y2 Y3 Y4 Y5 Y6 Y7 Y8 Y9 For a given DS level, there are 5 models and
hence 500 EM runs. We put the relative like-
Figure 2: The structure of generative models. lihood shortfalls of all those EM runs into one
set and let, for any nonnegative real number x,
we examined 25 models in total. F (x) be the percent of the elements in the set
From each of the 25 models, four training sets that is no larger than x. We call F (x) the dis-
of 100, 500, 1000, and 5000 records were sam- tribution function of (aggregated) relative likeli-
pled. On each training set, we ran EM indepen- hood shortfalls of EM runs, or simply distribu-
dently for 100 times, each time starting from a tion of EM relative likelihood shortfall, for the
randomly selected initial point. The stopping DS level.
threshold of EM was set at 0.0001. The high- The first 5 plots in Figure 4 depict the dis-
est loglikelihood obtained in the 100 runs is re- tributions of EM relative likelihood shortfall for
garded as the global maximum. the 5 DS levels. There are four curves in each
plot, each corresponding to a sample size3 .
3.2 Results
The results are summarized in Figures 3 and 3.2.1 Probability of Hitting Global
4. In Figure 3, there is a plot for each combi- Maxima
nation of DS level (row) and sample size (col- The most interesting question is how often
umn). As mentioned earlier, 5 models were cre- EM hits global maxima. To answer this ques-
ated for each DS level. The plot is for one of tion, we first look at the solid curves in Fig-
those 5 models. In the plot, there are three ure 3. Most of them are stair-shaped. In each
curves: solid, dashed, and dotted2 . The dashed curve, the x-position of the right most stair is
and dotted curves are for Section 4. The solid the global maximum, and the height of that
curve is for this section. The curve depicts a stair is the frequency of hitting the global max-
distribution function F (x), where x is loglikeli- imum.
hood and F (x) is the percent of EM runs that We see that the frequency of hitting global
achieved a loglikelihood no larger than x. maxima was generally high for high DS lev-
While a plot in Figure 3 is about one model els. In particular, for DS level 3 or above, EM
at a DS level, a plot in Figure 4 represents an reached global maxima more than half of the
aggregation of results about all 5 models at a DS time. For sample size 500 or above, the fre-
level. Loglikelihoods for different models can be quency was even greater than 0.7. On the other
in very different ranges. To aggregate them in hand, the frequency was low for DS level 1, es-
a meaningful way, we introduce the concept of pecially when the sample size was small.
relative likelihood shortfall of EM run. The first 5 plots in Figure 4 tell the same
Consider a particular model. We have run story. In those plots, the global maxima are
EM 100 times and hence obtained 100 loglike- represented by x=0. The heights of the curves
lihoods. The maximum is regarded as the op- at x=0 are the frequencies of hitting global max-
timum and is denoted by l . Suppose a partic- ima. We see that for DS level 3 or above, the fre-
ular EM run resulted in loglikelihood l. Then quency of hitting global maxima is larger than
the relative likelihood shortfall of that EM run is 0.5, except that for DS level 3 and sample size
defined as (ll )/l . This value is nonnegative. 100. And once again, the frequency was low for
2
DS level 1.
Note that in some plot (e.g., that for DS level 4 and
3
sample size 5000) the curves overlap and are indistin- Note that in Figure 4 (b) and (c) some curves are
guishable. close to the y-axes and are hardly distinguishable.

304 Y. Wang and N. L. Zhang


100 records 500 records 1000 records 5000 records
1 1 1 1

0.8 0.8 0.8 0.8

Distribution function
Distribution function

Distribution function

Distribution function
DS level 1

0.6 0.6 0.6 0.6

0.4 0.4 0.4 0.4

0.2 0.2 0.2 0.2

0 0 0 0
925 920 915 910 4715 4710 4705 4700 4695 9504 9503 9502 9501 9500 9499 4.755 4.7545 4.754 4.7535 4.753
Loglikelihood Loglikelihood Loglikelihood Loglikelihood 4
x 10

1 1 1 1

0.8 0.8 0.8 0.8

Distribution function
Distribution function

Distribution function

Distribution function
DS level 2

0.6 0.6 0.6 0.6

0.4 0.4 0.4 0.4

0.2 0.2 0.2 0.2

0 0 0 0
890 885 880 875 870 4495 4490 4485 4480 4475 4470 8986.5 8986 8985.5 8985 4.55044.55044.55044.55044.55034.5503
Loglikelihood Loglikelihood Loglikelihood Loglikelihood 4
x 10

1 1 1 1

0.8 0.8 0.8 0.8

Distribution function
Distribution function

Distribution function

Distribution function
DS level 3

0.6 0.6 0.6 0.6

0.4 0.4 0.4 0.4

0.2 0.2 0.2 0.2

0 0 0 0
800 795 790 785 780 775 4100 4090 4080 4070 4060 4050 7967 7966 7965 7964 7963 7962 4.05 4.049 4.048 4.047 4.046 4.045
Loglikelihood Loglikelihood Loglikelihood Loglikelihood 4
x 10

1 1 1 1

0.8 0.8 0.8 0.8


Distribution function
Distribution function

Distribution function

Distribution function
DS level 4

0.6 0.6 0.6 0.6

0.4 0.4 0.4 0.4

0.2 0.2 0.2 0.2

0 0 0 0
680 660 640 620 600 3500 3400 3300 3200 3100 7200 7000 6800 6600 6400 3.241 3.2405 3.24 3.2395 3.239
Loglikelihood Loglikelihood Loglikelihood Loglikelihood 4
x 10

1 1 1 1

0.8 0.8 0.8 0.8


Distribution function
Distribution function

Distribution function

Distribution function
DS level 5

0.6 0.6 0.6 0.6

0.4 0.4 0.4 0.4

0.2 0.2 0.2 0.2

0 0 0 0
450 400 350 300 2400 2200 2000 1800 1600 4500 4000 3500 3000 2.2 2 1.8 1.6
Loglikelihood Loglikelihood Loglikelihood Loglikelihood 4
x 10

Figure 3: Distributions of loglikelihoods obtained by different EM runs. Solid curves are for EM
runs with single starting points. Dashed and dotted curves are for runs of multiple restart EM with
setting 410 and 1650 respectively (see Section 4). Each curve depicts a distribution function
F (x), where x is loglikelihood and F (x) is the percent of runs that achieved a loglikelihood no
larger than x.

Severity of Local Maxima for the EM Algorithm: Experiences with Hierarchical Latent
Class Models 305
1 1 1

0.8 0.8 0.8


Distribution function

Distribution function

Distribution function
0.6 0.6 0.6

0.4 0.4 0.4

100 records 100 records 100 records


0.2 500 records 0.2 500 records 0.2 500 records
1000 records 1000 records 1000 records
5000 records 5000 records 5000 records
0 0 0
0 0.002 0.004 0.006 0.008 0.01 0 0.002 0.004 0.006 0.008 0.01 0 0.002 0.004 0.006 0.008 0.01
Relative likelihood shortfall Relative likelihood shortfall Relative likelihood shortfall

(a) DS level 1 (b) DS level 2 (c) DS level 3

1 1 1

0.8 0.8 0.8


Distribution function

Distrubition function

Distribution function
0.6 0.6 0.6

0.4 0.4 0.4

100 records 100 records 100 records


0.2 500 records 0.2 500 records 0.2 500 records
1000 records 1000 records 1000 records
5000 records 5000 records 5000 records
0 0 0
0 0.002 0.004 0.006 0.008 0.01 0 0.002 0.004 0.006 0.008 0.01 0 0.002 0.004 0.006 0.008 0.01
Relative likelihood shortfall Relative likelihood shortfall Relative likelihood shortfall

(d) DS level 4 (e) DS level 5 (f) DS level 3 with 15% zero components

Figure 4: Distributions of EM relative likelihood shortfall for different DS levels.

3.2.2 DS Level and Local Maxima sample size is large.


To see how DS level influences the severity We conjecture that this is because that, in
of local maxima, read each column of Figure 3 generative models for DS levels 4 and 5, there
(for a given sample size) from top to bottom. are many parameter values close to 0. It has
Notice that, in general, the height of the right been reported in latent variable model litera-
most stair increases with the DS level. It means tures that extreme parameter values cause local
that the frequency of hitting global maxima in- maxima (Uebersax, 2000; McCutcheon, 2002).
creases with DS level. The same conclusion can To confirm our conjecture, we created another
be read off from Figure 4. Compare the curves set of five models for DS level 3. The parame-
for a given sample size from the first 5 plots. In ters were generated in the same way as before,
general, their values at x=0 rise up as we move except that we randomly set 15% of the non-
from DS level 1 to DS level 5. major components in the probability distribu-
tions to 0. We then repeated the experiments
3.2.3 Extreme Parameter Values and for this set of models. The EM relative likeli-
Local Maxima hood shortfall distributions are given in Figure
There are some exceptions to the regularities 4 (f). We see that, as expected, the frequency
mentioned in the previous paragraph. We see in of hitting global maxima did decrease consider-
both figures that the frequency of hitting global ably compared with Figure 4 (c).
maxima decreases as the DS level changes from
4 to 5. By comparing Figure 4 (c) and (d), we 3.2.4 Sample Size and Local Maxima
also find that the frequency slightly drops when We also noticed the power of the sample size.
the DS level changes from 3 to 4, given that the By going through each row of Figure 3, we found

306 Y. Wang and N. L. Zhang


that the frequency of hitting global maxima in- We say that a starting point is optimal if it
creases with the sample size. The phenomenon converges to the global maximum. The first ob-
is more apparent if we look at Figure 4, where servation can be restated as follows: the proba-
curves for different sample sizes are plotted in bility for a randomly generated starting point to
the same picture. As the sample size goes up, be optimal is generally high. Consequently, it
the curve becomes steeper and its value at x=0 is almost sure that there is an optimal starting
increases significantly. point within a few randomly generated ones.
As it is well known, the EM algorithm in-
3.2.5 Quality of Local Maxima
creases the likelihood quickly in its early stage
In addition to the frequency of hitting global and slows down when it is converging. In other
maxima, another interesting question is how words, the likelihood should become relatively
bad local maxima could be. Let us first examine stable after a few steps. Therefore, the second
the local maxima in Figure 3. They are marked observation implies that we can easily separate
by the x-positions of the inflection points on the optimal starting point from the others after
curves. For DS level 1, the local maxima are not running a few EM steps on them.
far from the global ones. The discrepancies are A straightforward consequence of the above
less than 15. Moreover, there are a lot inflection inference is that multiple restart EM with a few
points on the curves, namely, distinct local max- starting points and initial steps should reliably
imal solutions. They distributed evenly within avoid local maxima for strong dependence HLC
the interval between the worst local maxima and models. To confirm this conjecture, we ran mul-
the global ones. tiple restart EM independently for 100 times on
For the other DS levels, things are different. each training set that was presented in Figure
We first observed that the quality of local max- 3. We tested two settings for multiple restart
ima can be extremely poor in some situation. EM: (1) 4 random starting points with 10 ini-
The worst case is that of DS level 5 with sample tial steps (in short, 410), and (2) 16 random
size 5000. The loglikelihood of the local maxi- starting points with 50 initial steps (in short,
mum can be lower than that of the global one 1650). As before, we plotted the distributions
by more than 4000. The second thing we ob- of loglikelihoods in Figure 3. The dashed and
served is that the curves contain much fewer the dotted curves denote the results for settings
inflection points. In other words, there are only of 410 and 1650, respectively.
a few distinct local maximal solutions in such From Figure 3, we see that multiple restart
cases. Moreover, those local maxima stay far EM with setting 1650 can reliably avoid lo-
apart from global ones since the steps of the cal maxima for DS level 2 or above. Actually,
staircases are fairly large. Those observations the dotted curves are parallel to the y-axes ex-
can be confirmed by studying the first 5 plots cept that for DS level 2 and sample size 100.
in Figure 4, where the curves and the inflection It means that global maxima can always be
points can be interpreted similarly. reached in such cases. For setting 410, sim-
ilar behaviors are observed for DS level 4 or 5
4 Performance of Multiple Restart
and sample size 500 or above. Note that dashed
EM
and dotted curves overlap in those plots.
We emphasize two observations mentioned in Nonetheless, we also notice that, for DS level
the previous section: (1) the probability of hit- 1, multiple restart EM with both settings still
ting global maxima is generally high, and (2) can not find global maxima reliably. This is
likelihoods of local maxima are far apart from consistent with our reasoning. As we have men-
those of global maxima at high DS levels. We tioned in Section 3.2.1, when we ran EM with
will see that these observations have immedi- randomly generated single starting points, the
ate implications on the performance of multiple frequency of hitting global maxima is low for DS
restart EM. level 1. In other words, it is hard to hit an op-

Severity of Local Maxima for the EM Algorithm: Experiences with Hierarchical Latent
Class Models 307
timal starting point by chance. Moreover, due of Bayesian networks with hidden variables. Ma-
to the small discrepancy among local maxima chine Learning, 29:181212.
(see Section 3.2.5), it demands a large number T. M. Cover and J. A. Thomas. 1991. Elements of
of initial steps to distinguish an optimal starting Information Theory. John Wiley & Sons, Inc.
point from the others. Therefore, the effective- A. P. Dempster, N. M. Laird, and D. R. Rubin.
ness of multiple restart EM degenerates. In such 1977. Maximum likelihood from incomplete data
situations, one can either increase the number via the EM algorithm. Journal of the Royal Sta-
tistical Society, Series B, 39(1):138.
of starting points and initial steps, or appeal to
more sophisticated methods for avoiding local G. Elidan, M. Ninio, N. Friedman, and D. Schuur-
mans. 2002. Data perturbation for escaping local
maxima. maxima in learning. In AAAI-02, pages 132139.
5 Conclusions B. S. Everitt and D. J. Hand. 1981. Finite Mixture
Distributions. Chapman and Hall.
We have empirically investigated the severity of
local maxima for EM in the context of strong de- U. M. Fayyad, C. A. Reina, and P. S. Bradley. 1998.
Initialization of iterative refinement clustering al-
pendence HLC models. We have observed that gorithms. In KDD-98, pages 194198.
(1) the probability of hitting global maxima is
G. Karciauskas, T. Kocka, F. V. Jensen, P. Lar-
generally high, (2) it increases with the strength ranaga, and J. A. Lozano. 2004. Learning of
of dependency and sample sizes, (3) it decreases latent class models by splitting and merging com-
with the amount of extreme probability values, ponents. In PGM-04.
and (4) likelihoods of local maxima are far apart P. F. Lazarsfeld and N. W. Henry. 1968. Latent
from those of global maxima at high dependence Structure Analysis. Houghton Mifflin.
strength levels. We have also empirically shown A. L. McCutcheon. 2002. Basic concepts and pro-
that the local maxima can be reliably avoided cedures in single- and multiple-group latent class
by using multiple restart EM with a few starting analysis. In Applied Latent Class Analysis, chap-
points and hence are not a serious issue. ter 2, pages 5685. Cambridge University Press.
Our discussion has been restricted to a spe- J. S. Uebersax. 2000. A brief study of local
cific class of HLC models. In particular, we maximum solutions in latent class analysis.
http://ourworld.compuserve.com/homepages/
have defined the strong dependency in an op- jsuebersax/local.htm.
erational way. One can devise more formal def-
N. Ueda and R. Nakano. 1998. Determinis-
initions and carry on similar studies for gener- tic annealing EM algorithm. Neural Networks,
alized strong dependence models. Another fu- 11(2):271282.
ture work would be the theoretical exploration
N. Ueda, R. Nakano, Z. Ghahramani, and G. E. Hin-
to support our experiences. ton. 2000. SMEM algorithm for mixture models.
Based on our observations, we have ana- Neural Computation, 12(9):21092128.
lyzed the performance of multiple restart EM. F. van de Pol, R. Langeheine, and W. de Jong, 1998.
One can exploit those observations to analyze PANMARK user manual, version 3. Netherlands
more sophisticated strategies for avoiding local Central Bureau of Statistics.
maxima. We believe that those observations J. K. Vermunt and J. Magidson, 2000. Latent GOLD
can also give some inspirations to develop new Users Guide. Statistical Innovations Inc.
methods on this direction. C. F. J. Wu. 1983. On the convergence properties
of the EM algorithm. The Annals of Statistics,
Acknowledgments 11(1):95103.
Research on this work was supported by Hong N. L. Zhang and T. Kocka. 2004. Efficient learning
Kong Grants Council Grant #622105. of hierarchical latent class models. In ICTAI-04,
pages 585593.
References N. L. Zhang. 2004. Hierarchical latent class models
for cluster analysis. Journal of Machine Learning
D. M. Chickering and D. Heckerman. 1997. Effi- Research, 5(6):697723.
cient approximations for the marginal likelihood

308 Y. Wang and N. L. Zhang



  !"#$&%'()
*,+.-/1032.4
562./87:9<;=/8>@?BA:CED)F$9HG1IKJLNMO032P0:QP0
RTSVUHWYX[ZK\:W
] =BIP0^;>TA:C,Q.9H_</1=/8A`2bac>J$9<A:;9<>/1_d_eA3G1G10:fA:;0^>/87:9gQ.9H=/84`2L>JP/1=ThiA:;j&0:QPQ.;9H==k9H=T;9HI.;9H=9[23>0^>/8A`2
032PQl_eA`mnINF$>0^>/8A`2d/1==oF$9H=)CpA:;0=k9<>aqfP0:=k9HQrQ.9H=/84`2l0^4:9[23><+tsu9v;9<wN2$9x>J$9vQ.9<w2./8>/8A`2uA:CyQ.9H=/84`2
2.9<>qhzA:;jb=i>kAf9mnA:;9 9e{.I.;9H==/87:9|CpA:;iQ.9H=/84`2xj.2$A[h|G89HQP4:9032.Q}m~A:;9 9e9H_e>/87:96CA:;i4`F./1QP/2$4m~A$Q.9HG
_eA`2P=k>k;oF._e>/8A`2g032.Qd7:9<;o/8wK_<0^>/8A`2+tsu9}I.;A3IA3=k9x0_eA`mnIP/1G89HQg=k>k;oF._e>F$;9nA:C Q.9H=/84`2g2$9<>qhzA:;j$=032PQ
0g=F./8>k9A:C)0:G84:A:;/8>JPmn=x>J.0^>T_eA`mnIKF$>k9H=nA3I.>/mn0:GQ.9H=/84`2.=v9<x_</89[2`>G8?:+|J$9t;9H=F.G8>vIP;A[7$/1Q.9H=}0
mn9H_oJ.032P/1=m>JP0^>#_eA`mfP/2.9H=IP;A3fP0:fP/1G1/1=k>/1_^L_eA`2.=k>k;0:/2`>aqfP0:=k9HQL032.Q,Q.9H_</1=/8A`2bac>J$9<A:;9<>/1_y;9H0:=kA`2./2.4
CpA:;|=/2.43G89eaq0^4:9[23>A3I.>/mn0:GQP9H=/84`2LK032.QtwKG1G1= /2B0~430:I/2TA3I.>/mn0:GV_eA3G1G10:fNA:;o0^>/87:9Q.9H=/84`2+
Y:3Eb3qV 4:9[2`>_eA3G1G10:fA:;0^>/8A`2/1=BQP9<7:9HG8A3IN9HQ -/1032$49<>0:G+1L
^:3nhJ./1_J;9HQKF._e9H=T_eA`mnIKG89e{$/8>@?9e{.INA`2.9[23>/10:G1G8?
|J./1=lhiA:;j_eA`2._e9<;o2.=gA3I.>/mx0:GxQ.9H_</1=/8A`2/2_e9[2ba Cp;A`m">J.0^>|A:C#0v_e9[23>k;o0:G1/8<9HQQP9H=/84`2tf?T9e{.J.03FP=k>/87:9
>k;0:G1/8<9HQ032PQr_eA3G1G10:fA:;0^>/87:9Q.9H=/84`2+ ] >~>J$9B_e9[2ba 9<7^0:GF.0^>/8A`2+|JP/1=v_eA`2`>k;/1fKF$>/8A`2&4:Ab9H=TCF$;>J$9<;}>kA
>k;0:G1/8<9HQC;A`23><L}/8>gQ.9H0:G1=lh|/8>J=k9<>aqfK0:=k9HQQP9H=/84`2 9[2.0:fPG899<x_</89[2`>G8Ab_<0:GQ.9H=/84`20^>9H0:_oJ0^4:9[2`><+ ] Q$a
sd0^;QLTH::bT$A3fN9<j9<>0:G+1LTH::ThJ./1_oJL,FP2ba QP/8>/8A`2.0:G#_eA`2.=>k;0:/23>=6CpA:;)MO60^;9,0:G1=kABI.;A3IA3=k9HQ
G1/8j:9IA3/2`>aqfP0:=k9HQQ.9H=/84`2rA:C>k9[2_eA`2.wN2$9HQ>kAlG8A$_<0:G >kA|C0:_</1G1/8>0^>k9 m~A$Q.9HG:_eA`2.=k>k;F._e>/8A`2032PQ)7:9<;/8wK_<0^>/8A`2+
A3I.>/mx0:G1=<Lx9<7^0:GF.0^>k9H=0:G1Gx0:G8>k9<;2.0^>/87:9Q.9H=/84`2.=l>kA uK^cVg.HT3E
=k9<9<j}>J$9643G8A3fP0:GA3I.>/mx0:G+;/mx0^;?n_oJP0:G1G89[2$4:9H= CpA:;
I.;A$QKF._e>G1/8Cp9H_e?$_<G89Omx032.0^4:9[m~9[2`>V/2._<GF.QP9iJ$A[hg>kA9<Ca
wK_</89[2`>G8?lI.;A3IK0^430^>k9}_eA`2.=>k;0:/23>=,032.Qg=k9H0^;_Jg/2g0 ]IP0^=;>/2`A:>kCE;A$>QKJ$F.9_e;9H9HQr=k9H0:0^f;_oA[J7:9:>kLAYhz>J.0^/1;=QKhz=A:m,;jlF.G8>/1=,/10^4:039[2l2`9H>=_e=kA39[2`G1G1>0:/1f.0:a G
Q.9H=/84`2l=IP0:_e9 0^;9HQP/1=9<>0:G+1L^:3+|J./1=hiA:;j A:;0^>/87:9Q.9H=/84`2+z./2._e9>J$9)CpAb_HFP= J$9<;9)/1= >J.9)Q.9H_</a
_eA`2`>k;/1fKF$>k9H=z>kAnm~9<9<>/2$4>J$9H=k96_oJP0:G1G89[2$4:9H=yh|/8>Jv032 =/8A`2}I.;A$_e9H==z0^>y0=/2.43G890^4:9[2`><LKm,F.G8>/10^4:9[23>y/1==oF$9H=
0:G84:A:;/8>JPm=FP/8>k9|>J.0^>zQ.9H=/84`2.=A3I.>/mn0:G1G8?nQ.9H_</1=/8A`2ba _<032f9=k9<>0:=/1Q.9h|/8>J$A`F.>09H_e>/2$4B>J$9,/23>k9<4:;o/8>q?
>J$9<A:;9<>/1_<0:G1G8?032.QB9<x_</89[2`>G8?:+ A:Cz>J$9~;9H=F.G8>I.;9H=k9[23>k9HQ+k2>J./1==k9H_e>/8A`2Lhz9nQ.9ea
] >>J$9g_eA3G1G10:fA:;0^>/87:9Cp;A`2`><L/8>=kA3G87:9H=0=F.f$a wN2$90}`e1:BNH: 6)E0:=>J.9|j:9<?n4:;0:INJ./1_<0:G
I.;A3fKG89[m/2gQ.9H_</1=/8A`2bac>J.9<A:;9<>/1_BQ.9H=/84`2nA3=k>,;9ea m~A$Q.9HG0:==kA$_</10^>k9HQth|/8>Jt=F._oJT032T0^4:9[2`><+iJ$9Q.9<w.a
=k9H0^;_JA`2_eA3G1G10:fA:;0^>/87:9gQ.9H=/84`2L 9:+4.+ A`2PQKF$;/ 2./8>/8A`2J$9<;9)9e{$>k9[2.QP=|>J.0^>/2 -/1032$4v9<>|0:G+1L^:^$
032.QMyJ.032PQ.;0^j^0:=032L H::LOCA$_HF.=k9H=A`2Q.9H=/84`2$9<; =kAg>J.0^>}/8>}/1=}2.A:>vA`2.G8?mnA:;9t9e{.I.;9H==/87:9/2;9HI$a
/2$CpA:;omn0^>/8A`2l=J.0^;/2.4fKF$>2$A:>A`2rmn0^j$/2$4QP9H=/84`2 ;9H=k9[2`>/2$4lQP9H=/84`2j.2$A[h|G89HQP4:9:LOfNF$>n0:G1=kAgmnA:;9B;9ea
_oJ.A3/1_e9H=<+lE{._e9HI.>/8A`2.=HL=F._oJr0:=uE^p8::po =k>k;/1_e>/87:9|>kA)fN9<>k>k9<;4`F./1Q.9 m~A$Q.9HG$_eA`2.=k>k;oFP_e>/8A`2~032.Q
p1H3p: ;03FP29<>v0:G+1LH::~0^;99H==k9[2`>/10:G1G8? 7:9<;/8wK_<0^>/8A`2+ ] =I9H_e>= /2 -6/1032$4n9<> 0:G+1L^:^$O0^;9
IA3/23>aqfK0:=k9HQ032.QA`2.G8?&I.;A$QKF._e9uG8A$_<0:G1G8?&A3IP>/mn0:G A`F$>G1/2$9HQJ$9<;9f.;/89<P?TCpA:;_eA`mnIKG89<>k9[2$9H==<+y 9H0:QP9<;=
Q.9H=/84`2P=<+ ] Q.9H_</1=/8A`2bac>J$9<A:;9<>/1_T4:;0:IKJ./1_<0:Gym~A$Q.9HGL 0^;9);9<C9<;;9HQT>kAn;9<Cp9<;9[2._e9)CpA:;m~A:;9)Q.9<>0:/1G1=<+
>k9<;omn9HQ&o^1`:3^1:N<: MyLE/1= ] QP9H=/84`22$9<>qhzA:;j/1=T0g>k;/1IPG89d cn
I.;A3IA3=k9HQl/2 -6/1032$49<>0:G+1L^:^$+MzA`mxINA`2.9[23>a hJ$A3=9=>k;oF._e>F$;9/1=0"_eA`2K2$9H_e>k9HQ ] D
_e9[2`>k9<;9HQgQP9H=/84`2d/1=_eA`2.=/1Q.9<;9HQg/2u>J$9v_eA`23>k9e{$>A:C c +d|J$9}=k9<>,A:C2$A$Q.9H=<L#9H0:_oJ_eA:;;9H=IA`2.QP=>kA
I.;A$QKF._e>6G1/8Cp9H_e?b_<G89Tmn032.0^4:9[mn9[23> u h/8>J>J$9 0d7^0^;/10:fPG89:L /1= ug +/1=v032
A3f$9H_e>/87:9A:CVAY7:9<;0:G1GA3I.>/mx0:GNI9<;CpA:;omn032P_e9_eA`FP2`>a 2$A`2bac9[mxI.>q?u=k9<>A:C)`e83$::n<q<o+t/1=0=k9<>
/2$4dQP/87:9<;o=k9FP2._e9<;>0:/23>@?C;A`mmn0^>k9<;/10:G1=<Lymn032bFba A:CHP[p3Pn<Pc^<<3[c:;9HI.;9H=k9[2`>/2$4>J$9zFK2._e9<;ka
C0:_e>F$;/2$4r032.QA3I9<;0^>/2$4.L 0:=vhz9HG1G|0:=}I.;9<Cp9<;9[2._e9 >0:/2mn032bF$Cp0:_e>F.;/2$4x032.QA3I9<;0^>/2$4x9[2`7$/8;A`2Pm~9[2`>
A:Cx7:9[2PQ.A:;=032.QF.=k9<;=H+ ] =_oJ$9[mn9gCpA:;lm,F.G8>/10a CpA:;t>J$9uI.;A$QKF._e>tFP2.QP9<;TQP9H=/84`2+ /1=0322$A`2ba
9[mxI.>q?n=k9<>OA:CA3f.9H_e>/87:96P<e:~:K~ kA:C ] G1GyF$>/1G1/8>q?g70^;o/10:fPG89H=~0^;9BfP/2.0^;?lh|/8>JrQ.A`mn0:/2
>J$9~I.;AbQKFP_e><+n/1=032u2.A`2bac9[mnI.>@?=k9<>A:CO=F.f.9H_a N +|J$9INA:>k9[2`>/10:G x B k/1=i0:==/84`2$9HQ
>/87:9 p KH:$OA:C#>J$9)Q.9H=/84`2.9<;H+ 0F$>/1G1/8>@?&CFP2._e>/8A`2 kxhJ$A3=k970:GF.9H=T;032$4:9
)+* -,/. * 

/1=|032t2.A`2bac9[mnI.>@?}=9<> A:CO1@3^.0^;_<=<+isu9;9<Cp9<; /2 032.Q/2._<GFPQ.9T>J$9T>qhzAdfNA`FK2.QP=<+ ] _<_eA:;Q$a


 0 

>kATQP/1=k>/2._e>9[2.QPIA3/2`>=|Cp;A`m 0:= T032.Q LNCp;A`m /2$4>kAG89<430:GO0^;_<=HL _eA`2.=/1=k>=A:C I9<;CpA:;omn032._e9


01 32

0:= 032PQ L`Cp;A`m 0:= "032PQ L3C;A`m0:= mn9H0:=F$;9H=|A`2.G8?:+J$9IA:>k9[2`>/10:G v  k /1=


  4

032.Q L3;9H=I9H_e>/87:9HG8?:+|J.9<;9 0^;9|=/{,>q?$I9H=EA:CG89<430:G 0:==/84`2$9HQ x k+i0:_oJ2$A$Q.9 /1=



    , 

0^;_<=H 0:G1=kA0:==/84`2.9HQg0hi9H/84`J`> y=F._Jd>J.0^>>J$9



 65 *  7

hz9H/84`J3>=|A:CE0:G1GVF$>/1G1/8>q?2.AbQ.9H=y=oFPm>kAnA`2$9:+
68%9:1 32

^+ ] ;_ =/84`2P/8wP9H=>J.0^>>J.9i>@hiAIP0^;03mn9<>k9<;= s/8>J >JF.=rQ.9<wN2.9HQLB /1= ^Pc3[po^p 0


0^;9/237:A3G87:9HQ/2B0nQP9H=/84`2B_eA`2.=k>k;o0:/23><+
 

0H?:9H=/10322.9<>qhzA:;jN+ ] ==FKmn/2$4~0:QPQP/8>/87:9,/2.Q.9HI9[2ba
3 ;

b+ ] ;_ |;9HI.;9H=k9[23>=)>J.0^>I9<;CpA:;omn032._e9 Q.9[2P_e9 9<9[2.9<?032.Q|0:/0$LH ^|03m~A`2$4TF$>/1G1/8>q?


QP9HIN9[2PQP=OA`2BQ.9H=/84`2IP0^;03mn9<>k9<; .+ 7^0^;/10:fPG89H=<LK>J$99e{$I9H_e>k9HQF$>/1G1/8>q?BA:CE0~Q.9H=/84`2

<

 7=

+ ] ;_ p;9HI.;9H=9[23>=}Q.9HI9[2.Q.9[2._e?f9<>qhz9<9[2 E
= ?>A@B8 x k
@
>DCE
@
F F =

9[2`7$/8;A`2Pm~9[2`>0:G6C0:_e>kA:;=<+J./1=}>q?$IN9A:C0^;_<=
  

/1=0:QPQ.9HQ>kAG89<430:GB0^;_<=/2 -6/1032$49<>0:G+1L _<032lfN9n_eA`mxIKF$>k9HQdf?=k>032.QP0^;QlI.;A3fP0:fP/1G1/1=>/1_x/2ba


^:^$>kAv9[2._eA$Q.9~_eA`mnIKG89e{Q.9HI9[2.Q.9[2._e9;9HG10a Cp9<;9[2._e9y/2n -6/1032$49<>E0:G+1Lb^:^$L`hJ$9<;9 fA3G1Q
/1=)0B_eA`2.wP4`F$;0^>/8A`2dA:CzL |/2PQ.9e{b9H=F$>/1G1/8>q?d2$A$Q.9H=
=

>/8A`2P=|03m~A`2.4~9[237$/8;A`2Pmn9[23>0:GC0:_e>kA:;=<+ /2tL ! fA3G1QE/1=z0,_eA`2$wK4`F$;0^>/8A`2vA:C>J$96IP0^;9[2`>=


HG

.+ ] ;_ #=/84`2./8wP9H=i>J.0^>iI9<;CpA:;omx032._e9 Q.9ea A:C L032PQ /1=>J$9}hi9H/84`J`>n0:==kA$_</10^>k9HQh|/8>J +


F
@ @ @

I9[2.QK=OA`2T9[237$/8;A`2Pmn9[23>yC0:_e>kA:; o+
 

#/84`F$;9n=J$A[h= 0~>k;/87$/10:G9e{.03mnIPG89)6+
 B8 I
 J

b+ ] ;_ pQ.9<wN2$9H= N0:=y0,_eA`mxINA3=/8>k9I9<;ka


CpA:;omx032._e9m~9H0:=F.;9:+
  
m_io_perf u_io_perf m_cost

b+ ] ;_ =/84`2P/8wP9H=)>J.0^>F$>/1G1/8>q? gQ.9HI9[2.QP=


A`2I9<;CA:;mn032._e9 +
 d_io_controler
d_b_chipset u_cost
 d_c_chipset

/1=x0u=k9<>~A:CIA:>k9[2`>/10:G1=nA`2$9t0:==kAb_</10^>k9HQh|/8>J d_memory

9H0:_JB2$AbQP9 t/2v>J$9O<:o A:CV0,_eA`2.QP/8>/8A`2P0:GI.;A3fP0a m_perf

fP/1G1/8>@?BQP/1=k>k;/1fKF.>/8A`2 x VkLKhJ.9<;9 z/1=y>J$9


 m_stability u_perf
d_c_voltage u_stability

=k9<>BA:C,IP0^;9[2`>t2$A$Q.9H=TA:C + ] _<_eA:;QP/2.4>kAG89<430:G
    

0^;_<=HLb0Q.9H=/84`2nIK0^;03m~9<>k9<; _<032vJ.0[7:9A`2PG8?nQ.9H=/84`2
 t_temperature
d_12v t_humidity

IP0^;o03m~9<>k9<;=,0:=nIP0^;9[23>=<+g@C u/1=n0;AbA:><L x ./1=




0u_eA`2.=>0323>xQK/1=k>k;/1fKF$>/8A`2+ >J.9<;h|/1=k9:L x .k
  J /84`F.;9x^ ] >k;/87$/10:GV6CA:;|OMm~A:>J.9<;fA30^;Q+
m,F.=k>O_eA`2`>0:/2B>J$967^0:GF$9) /+9:+1LN2$A:>y=k>k;/1_e>G8?BIA3=/a |9H=k>k;/1_e>/8A`2A:CG89<430:G)0^;_<=tQPA9H=t2$A:>B_eA`2.=>k;0:/2
! "  

>/87:9^+|J.9z2$A`2baq=>k;/1_e>G8?`aqIA3=/8>/87:9O;9 $F./8;9[mn9[23>_<032 =F$v_</89[23>G8?&=kA>J.0^>T9<7:9<;?&_eA`mxIA`2$9[23>}/1=T;9HGa


f9FP2PQ.9<;=k>kAbAbQ0:=vCpA3G1G8A[h=<usJ.9[2 .,/1=T2$A`2ba 9<7^0323>~>kA>J$9BQ.9H=/84`2r>0:=kjNLz0:=,h|/1G1GOf9H_eA`m~9B_<G89H0^;
$#

9[mxI.>q?:L x $ki;9HI.;9H=k9[2`>= Q.9H=/84`2T_eA`2P=k>k;0:/2`>=<+ f9HG8A[h+su9/mxIA3=k96>J$9)CpA3G1G8A[h|/2$4do<HPp:p 6;9ea


% 

>~=I9H_</8wP9H=~FP2.Q.9<;hJ.0^>~_eA`2$wP4`F.;0^>/8A`2.=,A:C .L $F./8;9[mn9[23>E>kAI.;A[7$/1Q.9 CF$;>J.9<;4`F./1QP032._e9>kAmnAbQ.9HG


&   

_e9<;>0:/2g7^0:GF$9H=A:C 0^;9T/1G1G89<430:GL032.Qg>J$9}70:GF$9T _eA`2.=>k;oF._e>/8A`2032.Q>kA|C0:_</1G1/8>0^>k9|m~A$Q.9HG^7:9<;/8wN_<0^>/8A`2+


  #

=/84`2P/8wP9H=y7$/8A3G10^>/8A`2tA:CEQ.9H=/84`2t_eA`2.=k>k;0:/2`>=<+ PA:;t0323?&Q.9H=/84`2IP0^;03mn9<>k9<; .L/8C>J$9<;99e{./1=k>=




A:>k9[2`>/10:G x oku/1=l0>@?bIK/1_<0:GvI.;A3fP0:fP/1G1/8>q? 0uQP/8;9H_e>k9HQIP0^>J/2C;A`m d>kAg0lF$>/1G1/8>q?2$A$Q.9


J K

QP/1=>k;/1fKF$>/8A`2+ ] _<_eA:;oQP/2$4T>kA}G89<430:GE0^;o_<=<L y_eA`2ba VL>J$9[2 /1=H<P^+ >J$9<;h|/1=9:L /1=rK:P


'   L

=/1=>= A:C#9[237$/8;A`2Pmn9[23>0:GCp0:_e>kA:;=A`2.G8?:+ x k e<HP^ + C /1=d2$A`2$ac9H==k9[2`>/10:GL>J.9[2>J.9r9e{ba


(  M N O

/1=0:G1=kA}0x>q?$IP/1_<0:G#I.;A3fP0:fK/1G1/8>q?tQP/1=k>k;o/1fKF$>/8A`2LN;9HI.;9ea I9H_e>k9HQ,F$>/1G1/8>q?A:CK0323?43/87:9[2,QP9H=/84`2/1=#/2.QP9HIN9[2PQ.9[23>
  P

=k9[2`>/2$4,>J$9FP2._e9<;>0:/2}Q.9HI9[2.QP9[2._e9A:CV>J$9I9<;CpA:;ka A:Cy>J$9v70:GF.9x>J.0^> >0^j:9H=<+k2uA:>J$9<;hzA:;QP=<L>J$9


mx032._e96A`2BQ.9H=/84`2LK9[237$/8;A`2Pmn9[23><LP0:=|hi9HG1G0:= A:>J$9<; A3I.>/mx0:G6Q.9H=/84`2&A[7:9<;>J$9u;9[mn0:/2P/2$4rQP9H=/84`2IP0a



I9<;CpA:;omn032P_e96mn9H0:=F$;9H=<+ ;03mn9<>k9<;=/1=,/2.QP9HIN9[2PQ.9[23>A:C .LE032.Qg>J$9}A3I.>/mn0:G 

310 Y. Xiang
7^0:GF$9tCpA:; l/1=vFP2.Q.9<wN2.9HQ+|9[2._e9:L l_<032f9B;9ea ! 1 x0^;9>J.9nIP0^;9[2`>=A:C /2>J.9nL=F._J
m~AY7:9HQnCp;A`muh|/8>J$A`F$>09H_e>/2$4>J$9 A3I.>/mx0:GPQ.9ea >J.0^>  1  _<032K2$A:>u>0^j:97^0:GF$90:G1G~0^>u>J$9
P    

=/84`2A[7:9<;;9[mn0:/2./2.4vQ.9H=/84`2IP0^;o03m~9<>k9<;=032.Q>J$9 =03mn9>/m~9:+ |J./1=g/1=g=IN9H_</8wK9HQ/2>J.96!f`?


 

mn0{./m,FPm9e{$I9H_e>k9HQF$>/1G1/8>q?:+ x  "g 1  b $+ ] 23?Q.9ea


.A:;6032`?TI9<;CpA:;omx032._e9)m~9H0:=F$;9 LK/8C#>J$9<;9)9e{ba =/84`2d>J.0^>0^;9x_eA`2P=/1=k>k9[2`>h|/8>Jd>J$9x_eA`2$wP4`F.;0^>/8A`2
   

/1=k>=y0,QP/8;9H_e>k9HQBIK0^>JT/2tdCp;A`m0,Q.9H=/84`2TIK0^;03m~9ea # 1 d


b/1=~032/1G1G89<430:GQ.9H=/84`2+A:>k9
J

>k9<; >kA L$0:=hz9HG1GN0:=z0QK/8;9H_e>k9HQvIK0^>JnCp;A`m ">kA >J.0^>n>J.92FPmf9<;,A:C=oF._oJ/1G1G89<430:GQ.9H=/84`2.=x0^;9t/2


 

0F$>/1G1/8>@?l2$A$Q.9 LV>J.9[2 /1=ve<HP^ +vs&/8>J.A`F$> >J$9)A:;oQ.9<;yA:C$  &%(' +


 6

>J$9 IP0^>J~C;A`m )>kA L h|/1G1GK2.A:>Q.9HI9[2.QA`2~0323? ] 6!/1=g=?$2`>0:_e>/1_<0:G1G8?0 0[?:9H=/10322$9<>qhzA:;j



8

Q.9H=/84`2+s/8>J$A`F$>6>J$9,IP0^>JCp;A`m >kA L h|/1G1G 032.QJ.9[2._e9_<032tf9)_eA`mxIP/1G89HQ/2`>kAv06FP2._e>/8A`2B>k;9<9


 

2$A:>/2.NF$9[2._e9 >J.9 A3I.>/mn0:GKQ.9H=/84`2+#k2,9H/8>J$9<;O_<0:=k9:L ) +vF$9x>kA>J$9vmnA:;0:G1/8H0^>/8A`2g=k>k9HIl/2l_eA`m,a


E (

/1= Q.9<9[mn9HQuK:Pko<HPp:$032PQt_<032Bf9;9[m~AY7:9HQ IP/1G10^>/8A`2L>J$9<;9u9e{./1=k>=0_<GF.=k>k9<+; * ,- /2&+


h|/8>J$A`F.> 09H_e>/2$4}>J$9A3I.>/mx0:GQ.9H=/84`2+ ] C>k9<;t_eA`mxIP/1G10^>/8A`2L9H0:_oJ_<GF.=>k9<+; ./2
/1=B0:=a

|J.9 _<0:=k9CA:;&9[2`7b/8;A`2Km~9[2`>0:GC0:_e>kA:;=QP/9<;= =/84`2$9HQ}0IA:>k9[2`>/10: G / .,AY7:9<;z>J.9=k9<0> .A:CVm~9[m,a


Cp;A`m >J$9x0:fA[7:9:+vMOA`2.=/1Q.9<;)0BIP0^>J   pL f9<;7^0^;/10:fPG89H=<+
J$9l0:==/84`2Km~9[2`>>k1A *=0^>/1=kwK9H=
hJ$9<;9 /1=9H==k9[2`>/10:Gy032.Q /1=~0G89H0^C+r.F.IPIA3=k9 / #|  1 b#$+
  

#/8;=k><L~hz90:==FKm~9r>J.0^2> * 3 d+ MzA`2P=/1Q.9<;


 

>J.0^>>J$97^0:GF$9A:C /1=j.2$A[h20^>>J.9>/m~9A:CQ.9ea
I 

=/84`2 032T9e{b>k9[2P=/8A`2B>kAv>J$9)_eA`23>k9e{$>CA:;| b2+ k+ 0dQP/9<;9[2`>x_eA`2$wP4`F.;0^>/8A`2A:4C * hJ$A3=k9tIA:>k9[2`>/10:G


  J

9HI9[2.QP/2$4)A`2n>J$9|j.2$AYh2~7^0:GF$9|A:C Lb>J$9|/mxIP0:_e>
#
7^0:GF$9u/1=t2$A`2bac<9<;A.LCA:;B/2.=k>032P_e9:L>J$9u_eA`2$wK4`F$;0a
A:C )A`2 mn0[?lQP/9<;H+C|0QP/8;9H_e>k9HQIP0^>JlCp;A`m
 
>/8A`2  #5  1  b
9H0:_oJ9[2`7b/8;A`2Km~9[2`>0:G C0:_e>kA:;v>kAr0gF$>/1G1/8>@?2$AbQP9/1= =F._J&>J.0^6> / # "d 5  1 &
 I    

b0 7$+Av9<7^0:GF.0^>k9~032`?tQ.9H=/84`2_eA`2.=/1=k>k9[2`>h|/8>J
   

;9 $F./8;9HQL>J$9T2$AbQP9 0:fAY7:9}h/1G1Gf9xQP/1=0:G1G8AYhi9HQ+ #T  1 tbLE7^0:GF$9H= #T"$Ly+1+1+1L T


|J$9<;9<CA:;9:LN032B9[2`7b/8;A`2Km~9[2`>0:GC0:_e>kA:; y/1=|QP9<9[m~9HQ
$#  

0^;99[23>k9<;9HQ&/2`>k8A *,+sJ$9[2 ! g/1=v9[23>k9<;9HQ


 
 

o<HPp:/8CV>J.9<;969e{$/1=k>=y032TFP2PQP/8;9H_e>k9HQTIP0^>JT/2B


/2`>k9 A *,L~>J$9IA:>k9[2`>/10:Gn7^0:GF$89 / # !




Cp;A`m >kA)0IN9<;CA:;omx032._e9ym~9H0:=F$;9 + >J$9<;h|/1=k9:L 5  1 ~b /1=6m,F.G8>/1IKG1/89HQf?Tf9H_<03F.=k9


 

O/1=K:Ke<HP^ +


>J$9_eA`2$wP4`F$;0^>/8A`2/1=v/2P_eA`2.=/1=k>k9[2`>xh|/8>J $+
 

su9;9 bFP/8;9B>J.0^>v0:G1G|Q.9H=/84`2IK0^;03m~9<>k9<;=HLzI9<;ka


] =t>J$9d;9H=F.G8><L>J$9gFPIQK0^>k9HQIA:>k9[23>/10:G70:GF$9r/1=


CpA:;omn032._e9Tm~9H0:=F.;9H=<LE032.Qg9[237$/8;A`2Pmn9[23>0:GiCp0:_e>kA:;o=
$#

/2B0nQP9H=/84`2t2$9<>@hiA:;jT>kAxf99H==9[23>/10:G+ / #T "n 5  1 b$+A


=FPmvmn0^;/8<9:L#f9<CA:;9x9[2`>k9<;/2$470:GF$9H= "n$LE>J$9
   

IA:>k9[23>/10:Gz7^0:GF$:9 / v"  1 b6/1=$L032.Q





 u.:Kb3q  cP
uK^c
0^Cp>k9<;}9[2`>k9<;/2$4 ! $L >J$9INA:>k9[2`>/10:G70:GF.9CpA:;
 

] 2v0^4:9[2`> 9 $F./1IPI9HQvh|/8>JT0, _<032T_eA`mnIKF.>k99e{ba 9<7:9<;?BA:>J$9<;6_eA`2$wP4`F.;0^>/8A`2/1=|$0+ 9[2._e9:L0^C>k9<;9[2ba


!

I9H_e>k9HQF$>/1G1/8>@?  $2+ kzCpA:;|9H0:_J0:G8>k9<;ka >k9<;/2$4  $L+1+1+1L  $L9<7:9<;?IA:>k9[23>/10:G7^0:GF$9


$#

2.0^>/87:9TQ.9H=/84`2F.=/2.4=k>032.QP0^;oQgI.;A3fP0:fP/1G1/1=k>/1_x;9H0a /62 / *f9H_eA`mn9H= $+


= # K H

=kA`2./2.4. + A[hz9<7:9<;HL,wN2.QP/2$4>J$9gA3IP>/mn0:GQP9H=/84`2 9e{b><L0:==FPmn9>J.0^; > *  ; <vLrhJ$9<;9


f?9e{PJ.03F.=k>/87:9HG8?9<70:GFP0^>/2$4B9H0:_oJuQP9H=/84`2J.0:=6>J$9 < = >032.? Q < @A >b+ $/2._e90323?IA^a
_eA`mnIKG89e{$/8>@?      L hJ$9<;9  /1=v>J$9mx0{$/m,FPm >k9[2`>/10:BG / < #  1 3r_<032f9Cp0:_e>kA:;/8<9HQ /2`>kA
2bFPmfN9<;A:Cy70:GF$9H=>J.0^>0QP9H=/84`2lIK0^;03m~9<>k9<;,_<032 / <   1  C /   1  Lhz9 _eA`2._<GF.QP9Cp;A`m
 

>0^j:9:+su9I.;9H=k9[23>y0~m,F._oJ}m~A:;99<x_</89[2`> m~9<>J$A$Q >J$960:fA[7:96>J.0^>O0^C>k9<;O9[2`>k9<;/2$4 #|$L+1+1+1L $L


   

f9HG8A[h)+ 9<7:9<;?BIA:>k9[23>/10:GV70:GF.9/62 / *f9H_eA`m~9H=|$+



 

] 2/1G1G89<430:GQ.9H=/84`27$/8A3G10^>k9H=0^>)G89H0:=k>A`2.9,QP9H=/84`2 |JP/1=V032.0:G8?$=/1=#=F$4:4:9H=k>=#0m~9<>J$A$Q)>kA6Q.9<>k9H_e>E/1G1G89ea
_eA`2.=k>k;o0:/23><+xJ$9~=kAA`2.9<;/8>)_<032df9~Q.9<>k9H_e>k9HQL>J$9 430:GQ.9H=/84`2.=H+sJ$9[2}0IP0^;>/1_HF.G10^;yQ.9H=/84`2 IA3==/1fPG8?
=kAbA`2$9<;}>J$99<7^0:GF.0^>/8A`2&_eA`mnIKF.>0^>/8A`2_<032f9QP/a /1G1G89<430:Gpi/1=O9<7^0:GF.0^>k9HQL.CpA:;O9H0:_oJB_<GF.=>k9<D; * >J.0^> J.0:=
;9H_e>k9HQ>kAd0:G8>k9<;2.0^>/87:9Q.9H=/84`2.=~032.Q>J$9tm~A:;9}9<Ca f9<9[2~0:==/84`2.9HQ x $k#;9HG10^>/87:9>kA0)Q.9H=/84`2xIP0a
wK_</89[2`>u>J$9A[7:9<;0:G1GvQ.9H=/84`2_eA`mxIKF$>0^>/8A`2+!.F.I$a ;03mn9<>k9<; .Lz9[23>k9<;v>J$9t7^0:GF$9tA:C l032PQ>J$9B7^0:GF$9
&  

IA3=k96>J.0^> 0,_eA`2.=k>k;0:/2`>|/2`7:A3G87:9H=|0~=FPfP=k9<>  A:CE9H0:_oJQ.9H=/84`2IP0^;o03m~9<>k9<;/2 $y/2`>k6A *~+ ] Cp>k9<;


P 

A:CfK/2.0^;?Q.9H=/84`2IK0^;03m~9<>k9<;=   1 `L}hJ$9<;9 9[2`>k9<;/2$4.LP_oJ.9H_jB/8CV>J$9=FKmA:CIA:>k9[23>/10:G7^0:GF$9H=yA:C


 
! 

Optimal Design with Design Networks 311


/ *K/1=V$+ ] ==kAbA`20:=V0yIA3=/8>/87:9>k9H=k>VA$_<_HF$;=<L_eA`2ba
_<GF.QP9x>J.0^>~0:G1GOQ.9H=/84`2.=>J.0^>~0^;9T_eA`2.=/1=k>k9[2`>,h|/8>J
>J./1=|_eA`2$wP4`F$;0^>/8A`2BA:C $kO0^;9/1G1G89<430:G+O|J$9<;9
d_a, t_p, m_g

h|/1G1Gf92$Ax2.9<9HQB>kAx9<7^0:GF.0^>k9032`?}A`2$9)A:C#>J$9[mB+
  
u_w, m_h d_a, m_g

u 
qd @qV m_h d_a, d_d, m_g

k2T0:QPQP/8>/8A`2T>kAx9H0^;G8?BQ.9<>k9H_e>/8A`2BA:C#/1G1G89<430:GQP9H=/84`2.=<L
d_d, m_g

hz9z=9<9<j)>kA69<70:GF.0^>k9yG89<430:G$Q.9H=/84`2P=#mnA:;99<v_</89[23>G8? d_d, m_g, m_h d_a, d_d

0:=CpA3G1G8A[h|=<+6su9~0:==FKm~9>JP0^>=kA`mn9=9HIP0^;0^>kA:;=6/2 t_q, m_i, d_b d_a, d_c, d_d


_eA`2.=/1=k>OA:CVQP9H=/84`2}IP0^;03m~9<>k9<;o=iA`2PG8?:+|J$96=k>k;oF._a
>F$;9tA:C0329e{.03mnIPG896 /1=v=J$A[h2/2 #/84`F$;9b+ m_i, d_b d_a, d_c

>/1=_eA`237:9<;>k9HQg/23>kAB>J.9 ) /2 #/84`F$;9 +k2>J$9


J

) L>J$;9<9=k9HIP0^;0^>kA:;=   L   032.Q
J  m_i, m_j, d_b d_a, d_b, d_c

 _eA`2.=/1=k> A:C#Q.9H=/84`2IP0^;03mn9<>k9<;=OA`2.G8?:+
)   . )   .
)    . m_i, m_j m_j, d_b d_b, d_c

m_i, m_j, u_v m_j, d_b, d_c

#/84`F$;9 ] kFP2P_e>/8A`2t>k;9<9,;9HIP;9H=k9[2`>0^>/8A`2A:CQ.9ea
t_q
d_b

=/84`22$9<>qhzA:;jB/2 #/84`F$;9b+
m_i J 
d_a t_p
J
u_v
m_j d_c m_g
] =}0329e{.03mnIPG89:Ly_eA`2.=/1Q.9<;}0rQP/87$/1=/8A`2>k;9<9CpA:;
d_d m_h
>J$9v/2 #/84`F$;9}b+B|J$9x=9<> A:Cy70^;o/10:fPG89H=
/2u/1=6>J$94:9[2$9<;0^>/2$4t=k9<><+)J$9=9<>A:CzQP/87$/1=/8A`2.=
J
u_w

J #/84`F$;9b|J$9)=k>k;F._e>F$;9A:CE0~Q.9H=/84`22$9<>qhzA:;j+ /1|= J.! 9x=k9< > # A:oC QP/8o7$ /1=5 /8oA` 2l< =9H0:IP=B0^;=o0^J$>kA[A:h;=2&/1=/2 0:G1#=A/84`=F$J$;AY9h.2 +
) . J

5=/2$4">J.9H=k9=k9HIP0^;o0^>kA:;=0:=fA`FP2PQP0^;/89H=<Ldhz9 /2>J$9,wP4`F.;9:+>_<032uf99H0:=/1G8?7:9<;/8wK9HQ>J.0^>>J$9
_eA`mxIP/1G89~>J$9 ) /2`>kA0B;9HI.;9H=k9[2`>0^>/8A`2L#_<0:G1G89HQ 4:;0:INJQ.9HIP/1_e>k9HQ/1=0 ) +[J$9=k9<>A:CPQP/87b/1=/8A`2 ) |=V/1=
:p[ :o>J.0^>/1=m~A:;9u9e9H_e>/87:9CA:;Q.9H=/84`2 %  5 < 0:==J.A[h2~/2 /84`F.;9bL
_eA`mxIKF$>0^>/8A`2+ hJ.9<;9n>J$9v4:9[2$9<;0^>/2$4u=k9<>A:C QP/87b/1=/8A`2 )  /1=
) . J @

,
iP:d #V9<>zf9 0)Q.9H=/84`2n2$9<>qhzA:;j,A[7:9<;O032 QP/87$/1=/8A`2 + @

2$A`2.9[mnI.>@?=k9<>A:C9H==9[23>/10:GK70^;/10:fKG89H= T
+ ]  CpA:;/1=0~>F.IPG89 u
c"!T$#$%  /1= >J$9 & ' ( ) &   A:*C + !
/1=y0n=oF.fP=k9<>OA:C>J$9)IA[hz9<;=k9<> ,+ c A:C =FP_oJ
V1 d_a, d_b, d_c d_a, d_d, m_g, m_h, t_p, u_w

>J.0^>). -*/'0 +0:_oJ9HG89[mn9[23>A:C ! /1=6_<0:G1G89HQ0


V3
d_b, d_c d_a, d_c d_a, d_d

)1 ' + # /1=|_eA`mxINA3=9HQTA:C>J$9CA3G1G8AYh|/2$4. d_b, d_c, m_i, m_j, t_q, u_v d_a, d_c, d_d

# 32 . . 7 . . !T . = .
V2
V0
)  :  9 

. @ . = > .@ .  # #/84`F$;9~.|J$9nQP/87$/1=/8A`2>k;9<9xCA:;QP9H=/84`2d2$9<>@hiA:;j


/2 #/84`F$;9b+0:_oJ}Q.A`FPfPG89eacA[7^0:G;9HI.;9H=k9[2`>=z0QP/87$/a
J
  .

i0:_oJFP2$A:;QP9<;9HQIP0:/8; 2 . . 7/1=r_<0:G1G89HQ"0 =/8A`2+0:_oJfAH{B;9HI.;9H=9[23>= 0x=k9HIP0^;0^>kA:;H+


PJ

 '4(35( 3 f9<>qhz9<9[2T>J$9>@hiAxQK/87b/1=/8A`2P=y032.QB/1= G10a




f9HG89HQ,f`?>J$9z/2`>k9<;=k9H_e>/8A`2 . @D. p+ ! 032.Q # 0^;9O=kA )2._e9u "J.0:=Bf9<9[2_eA`mxIP/1G89HQ/23>kA ) L


_eA`mxINA3=9HQt>J.0^> c" !T$ # yCA:;mn=|0FP2._e>/8A`2>k;9<9:+ _eA`2.=>k;oF._e>/8A`2A:CO0tQP/87$/1=/8A`2>k;9<=9 /1=)=k>k;0:/84`J`>kCpA:;ka


% /1=}0g=k9<>}A:C 768191:5; +>= hO0^;Q|J$9y=k9<>A:CQP/87b/1=/8A`2 ) =E0^;9yA3f.>0:/2$9HQ~Cp;A`m


9HG89[mn9[23>= mx0:IxA`2$9eac>kA^acA`2.9)>kA~9HG89[m~9[2`>=OA:C ! =FP_oJ f?;9[m~A[7$/2$4/8>==k9HIK0^;0^>kA:;=E>J.0^>i0^;9 _eA`mnIA3=k9HQ
>J.0^>9H0:_oJQK/87b/1=/8A`2FP2._e>/8A`2g>k;9<9BJ.0:=>J$9T_eA:;;9ea A:C|QP9H=/84`2IP0^;03mn9<>k9<;=A`2.G8?:+ .A:;n/2.=>032._e9:Li0^Cp>k9<;
=IA`2.QK/2$4QK/87b/1=/8A`20:= /8>= 4:9[2.9<;0^>/2$4v=9<><+ ;9[mnA[7$/2$4=k9HIP0^;o0^>kA:;=   L   032.Q
J
)   . )   .

312 Y. Xiang
>/8A`2}_<032n>J.9[2xf9 ;9<>k;o/89<7:9HQvCp;A`m>J$9_eA:;;9H=IA`2.Q$a
ST2
/2$4dF$>/1G1/8>@?r7^0^;/10:fPG89:+r|J$9T9e{.IN9H_e>k9HQF.>/1G1/8>q?_eA`2ba
>k;/1fKF.>/8A`2A:C>J$9~IP0^;>/10:GEQP9H=/84`2_<032f9A3fP>0:/2$9HQ
t_q, m_i, d_b
d_a, d_c, d_d

m_i, d_b
f?_eA`mfP/2./2$4}>J$9,;9H=F.G8>Cp;A`m /2.QP/87$/1QKF.0:G#F$>/1G1/8>q?
7^0^;/10:fPG89H=<+
ST1

A:;96I.;9H_</1=k9HG8?:L./2vQP/87b/1=/8A`2 Lb>J$9CA3G1G8AYh|/2$4~/1=
m_i, m_j, d_b d_a, d_b, d_c
@

m_i, m_j m_j, d_b


A3f.>0:/2$9HQCA:;|9H0:_oJt_eA`2.wP4`F$;0^>/8A`2 A:C# L 7=
@

#    x ck
m_i, m_j, u_v ST0 m_j, d_b, d_c
= > 8 > C F F =
d_a, t_p, m_g

hJ$9<;9 v/2.Q.9e{$9H=|F$>/1G1/8>@?t2$AbQP9H=y/2 L /1=|0~_eA`2ba @

wP4`F$;o0^>/8A`2&A:C)>J$9uIP0^;9[2`>=TA:C  L032PQ  /1=T>J$9


d_a, m_g ST3 F

hz9H/84`J3>0:==kA$_</10^>k9HQh|/8>J ^+
 08
d_a, d_d, m_g

su9}9e{b>k9[2PQl>J$9TQ.9<wN2P/8>/8A`2dA:C 6>kAQP/87$/a
u_w, m_h K

=/8A`2.=h|/8>J.A`F$>F$>/1G1/8>q?70^;/10:fKG89H=032.Q>kA6/1G1G89<430:GKIP0^;ka
d_d, m_g =
m_h

d_d, m_g, m_h


>/10:G#Q.9H=/84`2 0:=CpA3G1G8A[h|=<|su9,=9<> z v/8C
QP/87$/1=/8A`2 _eA`2`>0:/2.=|2$AnF$>/1G1/8>@?v7^0^;/10:fPG89H=<+su9)=k9<>
@ = =

#/84`F$;9~b 6/87b/1=/8A`2 ) = CpA:;QP/87$/1=/8A`2t>k;9<9/2 #/84^a c



#/8C V/1=|/1G1G89<430:G+
F$;9.+O _eA:;;9H=IA`2.QP=O>kAxQP/87$/1=/8A`2 +
J @ @ J

] C>k9<;~>J.9T0:fAY7:9B9<70:GF.0^>/8A`2A`2r9H0:_oJ_eA`2.wP4`Fba
= , =

;0^>/8A`2A:C Li>J$9BQ.9H=/84`2r9<7^0:GF.0^>/8A`2h/8>J./2r>J$9
@

 Li/1=n=IPG1/8>n/2`>kAlQP/87$/1=/8A`2 ) =~/2 #/84^a QP/87$/1=/8A`2B/1=_eA`mnIPG89<>k9:+


F$;96b+|J$9 4:9[2.9<;0^>/2$4~=k9<>=zA:C>J$9H=k96QP/87$/1=/8A`2 ) =
)    . !J

f9H_eA`m~9 QP/87$/1=/8A`2.=i/2 ! +E|J$9|;9[m~AY7:9HQn=k9HIP0^;0^>kA:;o= "K^


V
^^q&@l q^c3P
f9H_eA`m~9QK/87b/1=/8A`2=k9HIP0^;o0^>kA:;=6/2 # + >/1=0x=/mnIPG89 A,_eA`mfP/2.96QP9H=/84`2v9<70:GF.0^>/8A`2P= 0^> /2.QP/87$/1QKF.0:GQP/a
mn0^>k>k9<;|>kAn=J$A[h>J.0^> cE" !B$ # >JbF.= _eA`2.=k>k;F._e>k9HQ 7$/1=/8A`2.=|032PQtQ.9<>k9<;omx/2$96>J$9)A3I.>/mx0:GQ.9H=/84`2Lm~9H=a
CpA:;omn=y06FP2._e>/8A`2B>k;9<9:+ =0^4:9H=t0^;9IK0:==k9HQf9<>qhz9<9[2QP/87$/1=/8A`2.=<L0:G8A`2$4>J$9
uK^cV 
@
3cVciqd @qV =k9HIP0^;o0^>kA:;=)A:Cy>J$9vQP/87b/1=/8A`2u>k;9<9 9:+4.+1L #/84`F$;9n$+
|J$9m~9H==0^4:9IP0:==/2$4/1=gA:;43032./8<9HQ/2`>kA>J$;9<9
4J

0:_JQP/87$/1=/8A`2 _eA`2`>0:/2.=~0322$A`2$9[mxI.>q?d=oF.fP=k9<>
@
;A`FP2PQP=y0:G8A`2$4n>J$9)QK/87b/1=/8A`2t>k;9<9:+|J.9)=k?$2`>0{t032.Q
A:CQ.9H=/84`2,IP0^;03m~9<>k9<;o=<L3hJ$9<;9 V/2.Q.9e{$9H=#>J$9 QP/a
@
=k9[mx0323>/1_<=|A:C#mn9H==0^4:9H=|/2B9H0:_oJB;A`FP2.QBQP/9<;H+
7$/1=/8A`2+#|J./1=V/1=V>k;oF$9z=/2._e9>J$9zQP/87$/1=/8A`2_eA`2`>0:/2.=0^>
G
|J.90:G84:A:;/8>JPmn=d>J.0^>dhz9I.;9H=k9[2`>lf9HG8A[h0^;9
G89H0:=k>y>J$96Q.9H=/84`2}IP0^;o03m~9<>k9<;=z>J.0^>OCpA:;om>J$9=9HIP0a m~A3=>G8?x9e{$9H_HF$>k9HQf?}/2PQP/87b/1QNF.0:GQP/87b/1=/8A`2.=<L.9e{._e9HI.>
;0^>kA:;fN9<>@hi9<9[2/8>=k9HG8CE032.Qt032$A:>J.9<;|QP/87$/1=/8A`2+ A`2$9f`?B>J$9,0^4:9[2`>+ s/8>J$A`F$>6G8A3=/2$4}4:9[2$9<;0:G1/8>q?:L
#/8;=k><LCA:;i9H0:_oJv_eA`2$wP4`F$;o0^>/8A`2nA:C L`hJ$9<>J.9<;/8>
@
hz9Q.9[2.A:>k9>J$9,QP/87$/1=/8A`2A3f$9H_e>9e{$9H_HF$>/2$4T>J$9,0:Ga
/1=z0G89<430:GIP0^;>/10:GNQ.9H=/84`2}2$9<9HQP=>kA,f9|Q.9<>k9<;omx/2$9HQL
J
4:A:;/8>JPmf`? Y+i|J$969e{b9H_HF.>/8A`2/1= 0:_e>/87^0^>k9HQf`?B0
>kA>J$9v9e{$>k9[2`>,0:G1G8AYhi9HQf`?l/2.CA:;omx0^>/8A`2l0[70:/1G10:fKG89 _<0:G1G89<;HLQP9[2$A:>k9HQgf?  LhJ./1_oJg/1=9H/8>J$9<;032r0:Q^k0a
/2>J$9QP/87b/1=/8A`2+|JP/1=n_<032f9t0:_oJP/89<7:9HQ0:=xA`F$>a _e9[2`>zQP/87$/1=/8A`2nA:C  /2 A:;O0^4:9[2`>+E|J$9|=k9HIK0^;0a
G1/2$9HQu/2l$9H_e>/8A`2 + ] QP/87$/1=/8A`2dmx0H?A:;mn0H?2$A:> >kA:;yf9<>qhz9<9[2v_<0:G1G89<;yQP/87$/1=/8A`2  032.Q |/1=zQP9[2$A:>k9HQ
_eA`2`>0:/2dF$>/1G1/8>q?7^0^;/10:fPG89H=<+@C/8>Q.Ab9H=)2$A:>_eA`2`>0:/2

0:=  +BC  J.0:=0:QKQP/8>/8A`2.0:Gz0:Q^0:_e9[2`>,QP/87b/1=/8A`2.=<L
F$>/1G1/8>@?v7^0^;/10:fPG89H=<L.>J.9[2x>J.99<7^0:GF.0^>/8A`2th|/8>J./2v>J$9 >J$9<?t0^;9Q.9[2$A:>k9HQ0:= o 5  1o 032PQB>J$9H/8;6=k9HI$a
QP/87$/1=/8A`2&/1=T_eA`mnIPG89<>k9:+ .A:;B>J$9QP/87$/1=/8A`2>k;9<9u/2 0^;0^>kA:;=th|/8>J 0^;9uQ.9[2$A:>k9HQ0:=  5  1   L
#/84`F$;9).L |032.Q 5 0^;9=oF._oJBQK/87b/1=/8A`2P=<+
J
;9H=I9H_e>/87:9HG8?:+
C>J$9)QP/87$/1=/8A`2T_eA`2`>0:/2.=|F$>/1G1/8>@?v7^0^;/10:fPG89H=HL.>J$9[2
J

CpA:;9H0:_JG89<430:G_eA`2.wP4`F$;0^>/8A`2A:C LNCF$;>J.9<;9<7^0:Ga
@   CqKb3q3C @!  V:`#"zE3c
F.0^>/8A`22.9<9HQP=>kAf9I9<;CpA:;omn9HQ+ |J./1=u_<032f9 k26>J$9wK;=k>;A`FP2PQA:CPm~9H==0^4:9iIP0:==/2$4.L^QP/87$/1=/8A`2 
Q.A`2$9}f`?uf9HG1/89<C I.;A3IK0^430^>/8A`2lh/8>J./2u>J$9}QP/87$/1=/8A`2 ;9H_e9H/87:9H=067:9H_e>kA:;zm~9H==0^4:9OCp;A`m 9H0:_oJ~0:Q:0:_e9[2`>QP/a
) )+|J$9}9e{.IN9H_e>k9HQF$>/1G1/8>@?lCpA:;,9H0:_oJF$>/1G1/8>@?gCFP2._a 7$/1=/8A`2 +EG89[mn9[23>=#A:C>J$9y7:9H_e>kA:;z0^;9 /2.Q.9e{$9HQ~f`?
@

Optimal Design with Design Networks 313


PI 0^;>/10:GQ.9H=/84`2.=OAY7:9<;  +J$9 b>J}9HG89[mn9[23>yA:C>J$9
@
 d qY:` "iE3@ 3 
uK^cV
7:9H_e>kA:;HLP/2PQ.9e{b9HQTf?vIP0^;>/10:GQP9H=/84`2  L./1=zQP9[2$A:>k9HQ
8

0:=  +,J$9~_eA:;;9H=IA`2.QP/2.4T_eA`mxIA`2$9[23>=6CpA:;
@
k2>J$9O=k9H_eA`2.Q;A`FP2PQ)A:CKm~9H==0^4:9zIK0:==/2$4.L:QP/87$/1=/8A`2
>J$97:9H_e>kA:;m~9H==0^4:9>J.0^> =k9[2.QP=O>kAxQK/87b/1=/8A`2  6;9H_e9H/87:9H=C;A`mQP/87$/1=/8A`2  0vIP0^;>/10:GVQP9H=/84`2 
0^;9  A[7:9<;   i032PQ ,   LP;9H=IN9H_e>/87:9HG8?:+ >J.0^>/1=_eA`2.=/1=k>k9[2`>h|/8>J>J$9A3I.>/mx0:G:Q.9H=/84`2+EMzA`m,a
@
C /1=z0G89H0^CA`2x>J$96QP/87b/1=/8A`2x>k;9<9 hJ.A3=k9|A`2.G8? fP/2P/2$4}/8>h/8>J>J$9~Q.9H=/84`29<7^0:GF.0^>/8A`2dI9<;CpA:;om~9HQ
0:Q^k0:_e9[2`>QP/87$/1=/8A`2t/1= YL ,  _eA:;;9H=INA`2PQP=y>kA @
9H0^;G1/89<;h|/8>J./2t>J$9QK/87b/1=/8A`2L )/1Q.9[2`>/8wP9H=|>J.9)A3I$a
>J$9mx0{$/m,FPm9e{.I9H_e>k9HQF$>/1G1/8>@? >/mx0:GKIP0^;>/10:GQ.9H=/84`2xh|/8>J./2x>J$9QP/87b/1=/8A`2+E@>i>J$9[2
=k9[2PQP=y>J$9A3I.>/mn0:GIP0^;>/10:GVQP9H=/84`2t;9HG10^>/87:9>kAv>J$9
  
c
@
=k9HIK0^;0^>kA:;A:C|9H0:_oJr0:Q^k0:_e9[23>,QP/87$/1=/8A`2g>kA>J$9}_eA:;ka
;9H=INA`2PQP/2$4 + ] =m~9H==0^4:9xIP0:==/2$4tI.;A:4:;9H==k9H=<L
0  =
@

AY7:9<;0:G1G3IP0^;>/10:G3Q.9H=/84`2 3>J.0^>V0^;9_eA`2.=/1=k>k9[23>h|/8>J 9H0:_JQP/87b/1=/8A`2/1Q.9[23>/8wK9H=>J.9}A3IP>/mn0:GzIP0^;>/10:GOQ.9ea


 + >J$9<;h|/1=9:LK/8>=/2`>k9<;I.;9<>0^>/8A`2f9H_eA`m~9H=_<G89H0^;
=

f9HG8A[h+ =/84`2h|/8>JP/2>J$9uQP/87$/1=/8A`2+|JP/1=v/1=B0:_J./89<7:9HQf?
J$9CpA3G1G8A[h|/2.40:G84:A:;o/8>JPmBL hJ$9[29e{$9H_HF$>k9HQf? >J$9CpA3G1G8A[h/2$4v0:G84:A:;/8>JKmB+
9H0:_JQK/87b/1=/8A`2L|I.;A3IP0^430^>k9H=tF.>/1G1/8>q?_eA`2`>k;/1fKF.>/8A`2.= 33op 
 q !p~^ p[ : ,H
A:CPIP0^;>/10:GQ.9H=/84`2.=0^>/2.QP/87$/1QKF.0:GQP/87b/1=/8A`2.=/2`hO0^;QP= e83 ' }sJ$9[2 QP/87$/1=/8A`2 /1=_<0:G1G89HQ f`?  >kA
0:G8A`2$4QK/87b/1=/8A`2&>k;9<9:+|J$97:9H_e>kA:;m~9H==0^4:9u=k9[2`> 6/1=k>k;/1fNF$>k9 6I.>/mx0:G1/87$/1=/8A`2.9H=/84`2Lt/8>QPA9H=g>J$9
Cp;A`m >kA y/1=iQ.9[2$A:>k9HQv0:= , 032PQ~>J.0^>=k9[2`>
@ @
CpA3G1G8A[h|/2.4.
Cp;A`m >kA  /1=Q.9[2$A:>k9HQt0:= ,  + ^+C  /1=)032d0:Q^0:_e9[2`>QP/87$/1=/8A`2L ;9H_e9H/87:9H=0
33op  oE:[ p[ : #p ' zsJ$9[2BQP/a
7$/1=/8A`2  /1= _<0:G1G89HQf`?  >kABMzA3G1G89H_e>6/87b/1=/8A`2.5|>/1Ga

IK0^;>/10:GOQ.9H=/84`2  AY7:9<;   Cp;A`m  +|J$9[2L
/8>@?:LN/8>|Q.Ab9H=y>J$9CpA3G1G8A[h/2$4. 03mnA`2$4|IP0^;>/10:GQ.9H=/84`2.=AY7:9<;# >J.0^>0^;9z_eA`2ba
=/1=k>k9[23>vh|/8>J Ly/8>x/1QP9[23>/8wP9H=v>J$9tA`2.9th|/8>J
^+ PA:;|9H0:_oJ0:Q^k0:_e9[23>|QK/87b/1=/8A`2 L _<0:G1G1=6MzA3Ga @
>J.9J./84`J$9H=k>  70:GF$9 fP;9H0^jb/2.4u>/89H=n0^;fP/a
G89H_e>6/87$/1=/8A`2.5>/1G1/8>q?n/2 032.Q;9H_e9H/87:9H=E , >k;o0^;/1G8?KO032.QBG10:f9HG/8>|0:=  +
J @ @

Cp;A`m +
@ =

b+ PA:;9H0:_JtIP0^;>/10:GVQ.9H=/84`2 bLNF.IQP0^>k9 b+ 6>J$9<;h|/1=k9:Lz03m~A`2$4u0:G1GyIP0^;>/10:GyQ.9H=/84`2.=,AY7:9<;


^Ly/8>}/1Q.9[23>/8wK9H=x>J$9A`2$9h/8>J>J$9JP/84`J$9H=k>

J K=

b# b  r  @
7^0:GF$9 f.;9H0^j$/2$46>/89H=0^;fP/8>k;0^;o/1G8?K032.Q,G10a
= = > @
= f9HG/8>|0:=  +
hJ$9<;9~ ,  /1=/2.Q.9e{$9HQf?IP0^;>/10:GEQ.9H=/84`2
@
+ PA:;9H0:_oJx0:Q^0:_e9[2`>QP/87$/1=/8A`2 L`_<0:G1GK/1=>k;/1fKFba
@

 032.Q  /1= _eA`2.=/1=k>k9[23>h|/8>J $+ >k9 I.>/mn0:G16/87$/1=/8A`2.9H=/84`2 /2 032.Q=k9[2.Q


 J @
=

+C  /1=0320:Q^k0:_e9[2`>QP/87$/1=/8A`2L^CpA:;9H0:_JIP0^;>/10:G >J.9OIK0^;>/10:G$Q.9H=/84`2  >J.0^>E/1=E_eA`2.=/1=k>k9[2`>Eh|/8>J


QP9H=/84`2  L _eA`mnIKF.>k9H=  >kA +
 @
=

    b
0  = "   cKb3q  3 
dK^cV
AY7:9<;0:G1GvIK0^;>/10:GvQP9H=/84`2 >JP0^>g0^;9_eA`2ba k2g>J$9tG10:=k>~;A`FP2PQgA:Cm~9H==0^4:9tIP0:==/2.4.LiQP/87$/1=/8A`2
=/1=k>k9[23>h|/8>J  LG10:f9HG1=A`2$9=oF._oJIP0^;>/10:G
=

QP9H=/84`2>JP0^>;9H0:_J$9H=>J$9l7^0:GF$9r   f? x;9H_e9H/87:9H=~>J$9TA3I.>/mn0:GOIP0^;>/10:GzQP9H=/84`2rA[7:9<;x @

 f.;9H0^j$/2$4g>/89H=}0^;ofP/8>k;0^;/1G8?NL 032.Q=9[2.QP=


Cp;A`m 9H0:_oJ0:Q:0:_e9[2`>dQP/87$/1=/8A`2 + >u_eA`mfP/2$9H= @

=
 >kA  + >J$9[mh|/8>Jg/8>=A[h2gA3I.>/mx0:GIP0^;>/10:GOQ.9H=/84`2dAY7:9<;
032.Qu=k9[2.QK=>J.9;9H=F.G8>>kABQP/87$/1=/8A`2  + ] >>J$9
A:>k9>J.0^>  dmn0[?f99 $F.0:Gx>kA

+ 9[2.QtA:C>J./1=|;A`FP2.QLb>J.9)A3I.>/mn0:GVQ.9H=/84`2TAY7:9<; /1=
sJ$9[2 

IP0^;>/1_</1IP0^>k9H=n/20320:QKQP/8>/8A`2Lhz9T;9ea
= $# ,
A3f.>0:/2.9HQ+
$F./8;9~>J.0^>>J$9x=FKm /1= 

+xsJ$9[2 

OIP0^;>/1_a 33op E^p1H# !p~^ ,e1:  sJ$9[2}QP/a
,

/1IP0^>k9H=/20  uA3IN9<;o0^>/8A`2Lhi9;9 $F./8;9>J.0^>6>J$9 7$/1=/8A`2 n/1=_<0:G1G89HQgf?  >kAuMzA3G1G89H_e> 6I.>/mx0:G19ea


# , , 

;9H=oF.G8>/1=

/8C032.QnA`2.G8?~/8C0:G1GKA3I9<;0323>=i0^;9

+ =/84`2LP/8>|Q.Ab9H=y>J$9)CpA3G1G8A[h|/2.4.
  $#
, ,

314 Y. Xiang
^+ PA:; 9H0:_oJ0:Q^k0:_e9[23>|QK/87b/1=/8A`2 L 6_<0:G1G1=6MzA3Ga @
>J$9|QP9H=/84`2,;9<>F$;2$9HQ~>J$;A`F.4`JxMzA3G1G89H_e> 6I.>/mx0:G19ea
G89H_e> I.>/mn0:G19H=/84`2/2 032PQ;9H_e9H/87:9H=  =/84`2+ 9[2.QBA:C#=kj:9<>_oJ
J @

Cp;A`m + |J.9<A:;9[m =J$AYh|=#>J.0^> /1=032A3I.>/mn0:G.QP9H=/84`2+


@ = 1 2

 =

b+OMOA`mfK/2$9  h|/8>J0:G1G  ;9H_e9H/87:9HQu032.Q=k9[2.Q  $:H v|J$9Q.9H=/84`2


I.;A$QKF._e9HQf`? 6I$a
>/mn0:G19H=/84`2 ?$6/87b/1=/8A`2.V;9<9/1=|A3I.>/mn0:G+
 =

;9H=F.G8>y>kA  +
= /=

|J.9CpA3G1G8A[h|/2$40:G84:A:;o/8>JPm0:_e>/870^>k9H=t>J$9>J$;9<9 ;AA:C=kj:9<>_J#@>=F.x_e9H=E>kA=J.A[h>J.0^> 
A3f.>0:/2$9HQf?>J$9;AbA:>TQP/87b/1=/8A`2 /2=k>k9HIlA:C
=

;A`FP2PQP=A:C6m~9H==0^4:9TIK0:==/2$4u/2r>F$;o2r032.Q/1=,9e{$9ea 6/1=k>k;/1fKF.>k9 6I.>/mn0:G16/87$/1=/8A`2.9H=/84`2/1=>J$9gmn0{./a


_HF$>k9HQtf?T>J$9)0^4:9[2`> + m,FPm 9e{.IN9H_e>k9HQ F$>/1G1/8>q?"AY7:9<;&0:G1GG89<430:GQ.9H=/84`2.=<+
3:p !~^ ,e83
p[ : Ko   2._e9>J./1=T/1=x9H=>0:fPG1/1=J$9HQL/8>vCpA3G1G8A[h|=T>JP0^>}0rQ.9ea


 
   "!# $%" &' =/84`2L>J.0^>60^>k>0:/2.=6>J./1=mx0{$/m,FPm9e{.I9H_e>k9HQuF$>/1Ga
/8>q?032.Q/1=;9H=k>k;o/1_e>k9HQ">kA9H0:_oJ"QP/87$/1=/8A`2Lt/1= 
G10:f9HG89HQf?>J$9_eA:;;9H=INA`2PQP/2$4QP/87$/1=/8A`2QNF$;/2$4
!=
()*+, -*., / $"01 
  #2" 34&5 6  
7
*+, 801 9" !:
" %;=<#" >? /0=/)  #01  @5 A# 6/1=k>k;/1fKF.>k9 6I.>/mn0:G16/87$/1=/8A`2.9H=/84`2+ MzA3G1G89H_e> I$a
>/mn0:G19H=/84`2=/mnIKG8? 0:==k9[mfPG89H=l>J$9[m >kA:4:9<>J$9<;[+
B *+, -*., / $;=<#" >?, 01  @,A/  
|J$9ufA$Q.?A:C>J.9IP;AA:C,F.=k9H=B/2.QNF._e>/8A`2&h|/8>J0
C)DFE A HG # E  I)" J J/ 1KL">M3# J/@,ANO%P JQR =k>k;oFP_e>F$;9=/mn/1G10^;T>kA>J$9uI.;AbA:C)0:fA[7:9:+ 9[2PQA:C
b A`FK2.QK2$9H==tA:C 6I.>/mn0:G19H=/84`2 ?$6/87b/1=/8A`2.V;9<9 =kj:9<>_J
E1

/1=y9H=k>0:fKG1/1=J$9HQf9HG8AYh)+;A3IA3=/8>/8A`2~0:==k9<;>= >J.0^>
P 2

I.;AbQKFP_e9HQl/2g>J$9T0:G84:A:;/8>JPm /1=,0G89<430:GyQP9H=/84`2+ \    q- ^!


]
2.G8?}0~I.;AbA:C=kj:9<>_Jt/1=y43/87:9[2QKF$96>kAx=IK0:_e9)G1/mn/8><+
=

Syo$^pp: |J$9vQ.9H=/84`2 I.;A$QKF._e9HQdf? 6I$a 9[2$A:>k9}>J$9n>kA:>0:Gy2FKmf9<;A:CyQ.9H=/84`2gIP0^;03mn9<>k9<;=


>/mn0:G19H=/84`2 ?$6/87b/1=/8A`2.V;9<9/1=G89<430:G+
= 
f? 032.Q>J$9umn0{./mFKm 2FKmf9<;vA:C)INA3==/1fPG89
7^0:GF$9H=A:C 0Q.9H=/84`2lIP0^;o03m~9<>k9<;f?  + ] _e9[23>k;o0:Ga
 

;AA:CN=kj:9<>_oJ= T6/89<hr>J.9OQK/87b/1=/8A`2>k;9<9O0:=#;AbA:>k9HQ /8<9HQA3I.>/mx0:GQ.9H=/84`2>JP0^>}9<7^0:GF.0^>k9H=t0:G1G6Q.9H=/84`2.=


0^> - U 032.QF.=k9,/2.QKF._e>/8A`2A`2/8>=Q.9HI.>J WV XV+|J$9 9e{PJ.03F.=k>/87:9HG8?J.0:= >J$9)_eA`mnIPG89e{./8>qB?      +
fP0:=k9,_<0:=k9~/1= WV X$+|J.9<;9/1=0T=/2$43G89,QP/87$/1=/8A`2


032.Qr>J$9tI.;A3INA3=/8>/8A`2g_<032f9T9H0:=/1G8?=J$AYh2+ ] =a

.A:; 6I.>/mx0:G19H=/84`2 ?b6/87$/1=/8A`2.;9<9:LG89<>>J$9
2bFPmfN9<;A:CQP/87b/1=/8A`2.=if9 ! L>J$9mn0{./m,FPm2bFPm,a
J

=FPmn96>J$9I.;A3IA3=/8>/8A`2BCA:; 8VJXZY 032.Qt_eA`2.=/1Q.9<; f9<;yA:CQP9H=/84`2BIP0^;03mn9<>k9<;= I9<; QP/87$/1=/8A`2Bf9 _3L032.Q


O 

WV XB u^+#9<>>J$9|;AA:>iQP/87b/1=/8A`2nf9  032PQn/8>=


0:Q^k0:_e9[23>6QP/87$/1=/8A`2.=fN9 E  1 [ +|J$9=F.f$a
0
>J$9)mn0{./m,FPm_<0^;QP/2.0:G1/8>@?}A:CQP/87$/1=/8A`2T=k9HIP0^;0^>kA:;o=
fa9 `+vF$;/2$4MzA3G1G89H_e>6/87b/1=/8A`2P5|>/1G1/8>@?:Li9H0:_JlQP/87$/a
@

>k;9<9;AbA:>k9HQB0^>y9H0:_oJ J.0:=y0,Q.9HI.>ZJ Y + ?}0:=a


G

=/8A`2,9<7^0:GF.0^>k9H=   ]b VIP0^;>/10:GKQ.9H=/84`2.=E032.Q~=9[2.QP=E0


@

=FPmxI.>/8A`2LV/8C MzA3G1G89H_e>6/87$/1=/8A`2.5>/1G1/8>q?/1=_<0:G1G89HQgA`2

9H0:_oJ f?t>J$90^4:9[2`><LCpA3G1G8A[hz9HQuf`?0}_<0:G1GA:C6/1=a m~9H==0^4:9A:C=/8<9   ]c z>kA}>J$9_<0:G1G89<;[+ 9[2._e9:L>J$9


_eA`mnIKG89e{$/8>@?A:C 6I.>/mn0:G19H=/84`2 ?$6/87b/1=/8A`2.V;9<9x/1=
@

>k;/1fKF.>k9 6I.>/mn0:G16/87$/1=/8A`2.9H=/84`2A`2 LnCA3G1G8AYhi9HQ @


 ! ]b  ! ]c  A:;omx0:G1G8?:1L `T/1=mFP_oJ


f?g0_<0:G1GOA:C6MzA3G1G89H_e> 6IP>/mn0:G19H=/84`2A`2 L>J$9[2 @  


=mx0:G1G89<;>J.03d
  5
2 _032.Q">J$9_eA`mxIPG89e{./8>q?"f9H_eA`m~9H=
>J$9 IP0^;>/10:GpzQ.9H=/84`2};9<>F$;2$9HQvf? /1=yG89<430:G;9HG10a @
 ! eb + sJ$9[f 2 _/1=F.IPI9<;kaqfA`FP2.QP9HQL 6I$a
>/87:9>kAx>J$9)=F.f.>k;9<9:+ >/mn0:G19H=/84`2 ?$6/87b/1=/8A`2.V;9<9/1=|9<x_</89[23><+
 

|J.90:_e>F.0:G39e{$9H_HF$>/8A`2QP/9<;=Cp;A`m>J$90:fA[7:9O0:=
CpA3G1G8A[h|=<yMzA3G1G89H_e>6/87b/1=/8A`2.5|>/1G1/8>@?/1=|_<0:G1G89HQtA`2T9H0:_J BPh g 
:Kj i 3
@
f`? ihJ./1_J);9H_e9H/87:9H=i , C;A`m + .A:;E9H0:_J
@ @

9HG89[m~9[2`>  /2 , LK/8> /1=



/8C>J$9<;99e{ba k20:QPQP/8>/8A`2>kAdI.;9<7b/8A`F.=nhiA:;jr;9<7$/89<hi9HQ/2b9H_a
@ J
@

/1=k>=62$AxG89<430:G IP0^;>/10:GpyQ.9H=/84`2_eA`2.=/1=k>k9[2`>h/8>J  >/8A`2g^LNhz9QK/1=_HF.==|;9HG10^>/8A`2.=A:CE>J./1=|_eA`2`>k;/1fKF.>/8A`2


,

/2l>J$9}=oF.f.>k;9<9};AbA:>k9HQr0^> + ?g2bF.G1G0:QPQK/8>/8A`2L
@
>kAnA:>J$9<;;9HG10^>k9HQthzA:;jTf9HG8A[h)
032`?IK0^;>/10:GQ.9H=/84`2 t/2 ~_eA`2.=/1=k>k9[2`>h/8>J  k2>J$9G1/8>k9<;0^>F$;9A`24:;0:INJ./1_<0:GmnAbQP9HG1=<Lz0l;9ea
h|/1G1Gf99<70:GF.0^>k9HQ>kA b# 

+ |9[2._e9:L  G10^>k9HQ hiA:;j/1==k>k;A`2$4dkFP2P_e>/8A`2>k;9<9  ) 9[2.=k9[29<>
=

_<032P2$A:>Of9|=k9HG89H_e>k9HQB0:=  f`? |QNF$;/2$46/1=k>k;o/1fKFba 0:G+1LyH:$hJ./1_oJu0:QKQ.;9H==k9H=>J$9x/1==oF$9nA:Cy=k9 $F$9[2ba


= , =

>k9 6I.>/mx0:G16/87b/1=/8A`2P9H=/84`2v032PQx_<032P2$A:>zf9|IP0^;>A:C >/10:GyQ.9H_</1=/8A`2mx0^jb/2.4.+ ] =~/2.QK/1_<0^>k9HQf`?r0^;9HQP/1=


= $#

Optimal Design with Design Networks 315


9<>t0:G+ #0^;9HQP/1=B9<>t0:G+1L)^:3L)IA3/23>aqfK0:=k9HQQ.9ea /1G8?:LKfNF$> _<032Bf99e{b>k9[2PQ.9HQt>kAn;9<>F$;o2T0:G1G+
=/84`2T_eA:;;9H=IA`2.QP=i>kAn=k9 $F$9[23>/10:GQP9H_</1=/8A`2tmn0^j$/2$4.+ `#K+ cK &K:
k2.=k>k9H0:QL>J$9dA3IP>/mn0:GQ.9H=/84`20:QPQ.;9H==9HQ/2>J./1=
$#

_eA`2`>k;/1fKF$>/8A`2>0^j:9H=E>J.9z0:IPIP;A30:_oJA:CN=9<>aqfP0:=k9HQ,Q.9ea #/2.032._</10:G=F.IPIA:;>~>kAg>J./1=n;9H=k9H0^;_oJ/1=xIP;A[7$/1Q.9HQ
f?}$iyMA:CEMO032P0:QP0$+
J

=/84`2LKhJ./1_Jt/2`7:A3G87:9H==/m,F.G8>032$9<A`F.=|9<70:GFP0^>/8A`2A:C
0m,F._J~G10^;4:962FPmf9<;EA:CQ.9H=/84`2xIP0^;03m~9<>k9<;o=E>J.032
hJP0^>v/1=T_eA`2.=/1Q.9<;9HQ/20l>@?bIK/1_<0:G=9 bF$9[2`>/10:G6Q.9ea g P3N$K
_</1=/8A`2gI.;A3fKG89[mB+vJ$9x/1==F.9nA:C Q.9H=/84`2l_eA`2.=k>k;0:/2`>
$#

7$/8A3G10^>/8A`2t/1= 0:G1=AnQ.9H0:G8> h|/8>JT/2}>J./1=|_eA`23>k;/1fNF$>/8A`2+   



  "!# !#$%
&('*)),+
- ./&013254&7689:9: ;,
25<:=&>?25<:@A<:BC 25<:D 
E6E4<GF
] 2$A:>J$9<;z;9HG10^>k9HQThzA:;jn/1=y2$9H=k>k9HQFP2._e>/8A`2x>k;9<9H= 2 &C6H25
&I1J,
K9L  6E47=&84 <:689G&M&C.5<GN, OQPSR5TCU*V
k0^9<;oF.G#LPH: `L`hJ$9<;9O0 ) /1=E2$9H=k>k9HQ~/206_<GF.=a WXJY[Z\5ZZS]_^a`bZc]dZe`Z ]b\E` `fhgi`kjmln Tpoqsr l Tt
>k9<;A:CK06J./84`J$9<;#G89<7:9HG ) d>kA;9HQKF._e9O=IP0:_e9O_eA`mnIKG89e{a
<
f rbu X qwvmqLo8U8q n uxqstzymR j{Z tkyu j oqLo|ymtzv g}nzX q l qG~*y X qwTmt 
/8>@?r032.Qr>kAd0:G1G8AYh m~A:;9v9<x_</89[2`>xf9HG1/89<CIP;A3IP0^430a ? N&*.S+m% 'C
>/8A`2+J$9QP/87b/1=/8A`2}>k;9<9:LPI.;A3IA3=k9HQT/2v>J./1=y_eA`2`>k;/a  &C6E425&C
[ &C 
59Q'C),)0a
5&C&6H9: ./25&8
<:N
fKF.>/8A`2Lb/1=O0:G1=kA~0,2$9H=k>k9HQ ) &=k>k;oFP_e>F$;9:+E|J$96_eA`m,a 1J,
6H, ./25
E <:,2[&H2 
b.8 Z R X q UHqwyu \ t X usuxq tkU 
IKF.>0^>/8A`2_eA`2.QNF._e>k9HQh/8>J./2t0vQP/87b/1=/8A`2LJ.A[hz9<7:9<;HL , E ,pb++ 
4:Ab9H=f9<?:A`2.QI.;A3fP0:fK/1G1/1=k>/1_;9H0:=kA`2./2$4.+y>|;9H0:=A`2.= &C .5&8  &8 .5&8e  z _<G2/2 @A&C
C'*)) 
0:fA`F$>#Q.9H=/84`2,Q.9H_</1=/8A`2.=<L3_eA`2.=k>k;o0:/23>=032PQ,F.>/1G1/8>/89H=<L
5,@<:b &C 6H&b<: N
E @.25 62 <G,25
&8&*.8O
032.QB/1=Q.9H_</1=/8A`2bac>J.9<A:;9<>/1_/22.0^>F$;9:+ PSR5TCU*V_, XwY TmtCHVSTt ^ tkU R X yqst Xj qst Z R X q UHqwyu \ t
X usuxq tkU ? N&C.,+*%
k2>J$9&G1/8>k9<;o0^>F$;9A`2_eA`2.=k>k;0:/2`>=0^>/1=kC0:_e>/8A`2
I.;A3fPG89[m M|$yL^_eA`2.=k>k;0:/2`>#4:;0:IKJ.=0^;9O_eA`237:9<;>k9HQ  &C&8&C  # < '*)+ U8qLoqwTmto$q XwY
f rbu X q n u gJ8 U X qs o @;
5<LbN,&
>kA ) =>kA=kA3G87:9M $i= 9H_oJ`>k9<;i032.Q~9H0^;GLH::+
|J.9_HF.;;9[2`>_eA`2`>k;/1fKF.>/8A`2 9H==k9[2`>/10:G1G8? 9e{$>k9[2.QP= -k / &8
9'*))%%&C./25&C 6H25<:2
5&C&C.COPSR5TCU*V
>J.0^>t0:IKI.;A30:_oJ>kA_eA`2.=k>k;0:/2`>tA3I.>/mx/8H0^>/8A`2032.Q  XwY TtCHVTmt ^ tkU R X yqst Xj qst Z R X q U8qwymu \ t X usuxqs
tkU b? N&*.)m%,',,
m=%<:&8 68&b4bb&.59:  
h|/8>Jt0~Q.9H_</1=/8A`2bac>J$9<A:;9<>/1_A3f$9H_e>/87:9CFP2P_e>/8A`2+ b
<a !#4  b
E m . 'C),))  !1J
E @&8F

1J,
6H9:9L ;k
E m25<:=&  $b<:./25
<:;b25&*$&C;bF; .5&C
K  Vi" qiqV &C.5<GN,aOPSR5TCU*V WmXwY oqt Z r X T l y X qwTmt TmtC R
tkU b? N,&C.,)pb),
su9y;9<w2$9HQ>J.9OQP9<wN2./8>/8A`2CpA:;#QP9H=/84`2~2$9<>qhzA:;j$=#;9ea
430^;QK/2$4G89<430:G.0^;_<=#032.Q9H==k9[2`>/10:G1/8>q?:+i|J$9z2.9<hQP9<C a  
&Cb<L.8!N,4&8%; N4 9L z  
 &C% 6K+ %&H25F; ,./&*b&*./<:N &C6H<L.5<G,bF
/2./8>/8A`2x/mnI.;A[7:9H=9e{.I.;9H==/87:9[2$9H==i032.Q~4`F./1QP032P_e9|>kA 2 4&8,
5&825<L6?k&8
E./?k&C6H25<:=& OQP R5T*UCV c R5Tmt X q REo3qst
mnAbQ.9HG_eA`2.=k>k;oF._e>/8A`2032PQl7:9<;/8wN_<0^>/8A`2+dsu9B_eA`m,a oEqti ` q l rbu:y X qwTmt o ymR5U Y , W TRECo Y T n 
IP/1G89HQ0TQ.9H=/84`2u2.9<>qhzA:;j/2`>kAt0TQP/87$/1=/8A`2>k;9<9~032.Q ? N&*.e'b,b
I.;9H=k9[23>k9HQ0:G84:A:;/8>JPmx=n>J.0^>T_eA`mfP/2$9I.;A3fP0:fP/1G1/1=a  b%,;&Ck%!#x  
%  ,<G,&8
*'*)),)aamF
>/1_^L$_eA`2.=k>k;0:/2`>aqfP0:=k9HQT032.QvQ.9H_</1=/8A`2bac>J.9<A:;9<>/1_6;9H0a 2 .?
5<: 68<G?9:&C.e1_./&82/F; .5&C68 68

5&C,2 &8N,<G&C&8
5F
=kA`2P/2$4CA:;iA3I.>/mn0:G.QP9H=/84`2n/2,>J.9|_eA`mnIP/1G89HQx=k>k;oF._a <:N ` uGT*ymt f ytky8 8l t X Cq   ,E +*% 
>F$;9:+&|JP/1=~;9H=F.G8>}2$A:>xA`2.G8?I.;A[7$/1Q.9H=x0g_eA`mnIKFba ! z 
E'C),) Z0Y% TmR j TSr ymt X q X y X qs #\ t* R
>0^>/8A`2.0:Gm~9H_J.032./1=mCA:;=/2$43G89eaq0^4:9[23>OA3I.>/mx0:G.Q.9ea tkU ZSnn uq v X T Zf U Y yt qwUEymuS oqt T l n qsu R
=/84`2LVfKF$>0:G1=kAtwNG1G1=)/2u0T430:Id/2A3IP>/mn0:GE_eA3G1G10:fA^a _4 8254&*./<L.8$p &8? 
/2 @A&C21 $&*6E4 <L68 9F
;0^>/87:9Q.9H=/84`2+ N,<G &8&8
<:N 

J$9"I.;9H=k9[2`>k9HQ 0:G84:A:;/8>JPm =F./8>k9=kA3G87:9H=>J$9 A <: N 4&8  !#S&*./4 @%4, i!


I.;A3fPG89[m"A:CQ.9H_</1=/8A`2bac>J$9<A:;9<>/1_A3I.>/mx0:GVQ.9H=/84`2/2 &C6H<L.5<G,bF2 4&8,
5&825<L6N
E ? 4<:6C 9@Abb&C9 1J
6H9:9L ;k
E mF
2 <G=,&b&C.5<GN,,.5?? 9G6E4 <: .8O$!#  } pS  
>J$96_eA`2`>k9e{b>OA:CV_<0^>0:G8A:4xQP9H=/84`2vhJ$9<;9QP/1=_e;9<>k96Q.9ea   %bb<G&Cb<G25,
.C Z vmmymtkU o#qst Z R X q U8qwymu \ t
=/84`2)A3I.>/8A`2.=0^;9i_eA`2.wP4`F$;9HQ+|J$9z0:G84:A:;/8>JPm =F./8>k9 X usuxq tkU Hd Z\  W m ? N&*.,,*%,+) %?
<:N&C
C
Q.9<;o/87:9H=z/8>=z9<x_</89[2P_e?vf?nQ.9H_eA`mnIA3=/2$4>J$96Q.9H=/84`2 AC <: N  m4&C  p p=&C .8}ba ?b2 <G@ 9
Q.A`mx0:/20:_<_eA:;QP/2$4>kAQ.9H=/84`2l=k9HIP0^;0^>kA:;o=/2uQP/87$/a &C.5<GN,[<:6H,9G9L ;k
E m2 <G=,&Ab&*./<:N&82,
5zOPSR5TCU*V
=/8A`2>k;9<9H=<+sJ$9[2m,F.G8>/1IKG89A3I.>/mx0:G`Q.9H=/84`2P=V9e{./1=k><L XJY\ t X R8V Tqst X Tmt*HVTmt Z r X TmtzT l Tmr%o Z t X oymtzv
>J$90:G84:A:;o/8>JPm=oF./8>k9;9<>F$;2.=A`2$9EA:C$>J$9[m0^;fK/8>k;0^;ka f rbu X qwy8 t Xz`kj o X8l oe Z ZefZe`S H*? N,&C.a 'Hbm

316 Y. Xiang
Hybrid Loopy Belief Propagation
Changhe Yuan Marek J. Druzdzel
Department of Computer Science and Engineering Decision Systems Laboratory
Mississippi State University School of Information Sciences
Mississippi State, MS 39762 University of Pittsburgh
cyuan@cse.msstate.edu Pittsburgh, PA 15260
marek@sis.pitt.edu

Abstract
We propose an algorithm called Hybrid Loopy Belief Propagation (HLBP), which extends
the Loopy Belief Propagation (LBP) (Murphy et al., 1999) and Nonparametric Belief
Propagation (NBP) (Sudderth et al., 2003) algorithms to deal with general hybrid Bayesian
networks. The main idea is to represent the LBP messages with mixture of Gaussians and
formulate their calculation as Monte Carlo integration problems. The new algorithm is
general enough to deal with hybrid models that may represent linear or nonlinear equations
and arbitrary probability distributions.

1 Introduction Spiegelhalter, 1988). As Lerner et al. (Lerner et


al., 2001) pointed out, it is important to have
Some real problems are more naturally mod- alternative solutions in case that junction tree
elled by hybrid Bayesian networks that con- algorithm-based methods are not feasible.
tain mixtures of discrete and continuous vari- In this paper, we propose the Hybrid Loopy
ables. However, several factors make inference Belief Propagation algorithm, which extends the
in hybrid models extremely hard. First, linear Loopy Belief Propagation (LBP) (Murphy et
or nonlinear deterministic relations may exist al., 1999) and Nonparametric Belief Propaga-
in the models. Second, the models may con- tion (NBP) (Sudderth et al., 2003) algorithms
tain arbitrary probability distributions. Third, to deal with general hybrid Bayesian networks.
the orderings among the discrete and continu- The main idea is to represent LBP messages
ous variables may be arbitrary. Since the gen- as Mixtures of Gaussians (MG) and formulate
eral case is difficult, existing research often fo- their calculation as Monte Carlo integration
cuses on special instances of hybrid models, such problems. The extension is far from trivial due
as Conditional Linear Gaussians (CLG) (Lau- to the enormous complexity brought by deter-
ritzen, 1992). However, one major assumption ministic equations and mixtures of discrete and
behind CLG is that discrete variables cannot continuous variables. Another advantage of the
have continuous parents. This limitation was algorithm is that it approximates the true pos-
later addressed by extending CLG with logis- terior probability distributions, unlike most ex-
tic and softmax functions (Lerner et al., 2001; isting approaches which only produce their first
Murphy, 1999). More recent research begin to two moments for CLG models.
develop methodologies for more general non-
2 Hybrid Loopy Belief Propagation
Gaussian models, such as Mixture of Truncated
Exponentials (MTE) (Moral et al., 2001; Cobb To extend LBP to hybrid Bayesian networks,
and Shenoy, 2005), and junction tree algorithm we need to know how to calculate the LBP mes-
with approximate clique potentials (Koller et al., sages in hybrid models. There are two kinds of
1999). However, most of these approaches rely messages defined in LBP. Let X be a node in
on the junction tree algorithm (Lauritzen and a hybrid model. Let Yj be Xs children and Ui
(t+1)
be Xs parents. The X (ui ) message that X modifications and get
sends to its parent Ui is given by: (t+1)
Y j (x) =
(t+1) Q (t) Q (t)
X (ui ) =
R X
(x) Y (x)P (x|u) X (uk )
R R
k
Q (t) Q (t) k k
.
X (x) Yj (x) P (x|u) X (uk )(1). u
(t)
Y (x)
x j
j uk :k6=i k6=i

(t+1)
Now, for the messages sent from X to
and the message Yj (x) that X sends to its its different children, we can share most of
child Yj is defined as: the calculation. We first get samples for
ui s, and then sample from the product of
(t+1)
Y j (x) = Q (t)
X (x) Yk (x)P (x|u). For each different mes-
Q (t) R Q (t) k
X (x) Yk (x) P (x|u) X (uk ) .(2) (t+1)
k6=j u k sage Yj (x), we use the same sample x but
(t)
assign it different weights 1/Yi (x).
where X (x) is a message that X sends to itself Let us now consider how to calculate the
representing evidence. For simplicity, I use only (t+1)
X (ui ) message defined in Equation 1. First,
integrations in the message definitions. It is ev- we rearrange the equation analogously:
ident that no closed-form solutions exist for the
(t+1)
messages in general hybrid models. However, X (ui ) =
we observe that their calculations can be for- P (x,uk :k6=i|ui )

{
mulated as Monte Carlo integration problems. R z
Y
}|
Y

(t) (t)
(t+1) X (x) Yj (x)P (x|u) X (uk ) (3).
x,uk :k6=i
First, let us look at the Yj (x) message de-
j k6=i
fined in Equation 2. We can rearrange the equa-
tion in the following way:
It turns out that here we are facing a quite dif-
(t+1)
(t+1) ferent problem from the calculation of Yj (x).
Yj (x) =
Note that now we have P (x, uk : k 6= i|ui ), a
P (x,u)

z }| {
joint distribution over x and uk (k 6= i) condi-
R Y (t) Y (t)
X (x) Yk (x)P (x|u) X (uk ) . tional on ui , so the whole expression is only a
u

k6=j k


likelihood function of ui , which is not guaran-
teed to be integrable. As in (Sudderth et al.,
Essentially, we put all the integrations out- 2003; Koller et al., 1999), we choose to restrict
side of the other operations. Given the new for- our attention to densities and assume that the
mula, we realize that we have a joint probability ranges of all continuous variables are bounded
distribution over x and ui s, and the task is to (maybe large). The assumption only solves part
integrate all the ui s out and get the marginal of the problem. Another difficulty is that com-
probability distribution over x. Since P (x, u) position method is no longer applicable here,
can be naturally decomposed into P (x|u)P (u), because we have to draw samples for all par-
the calculation of the message can be solved us- ents ui before we can decide P (x|u). We note
ing a Monte Carlo sampling technique called that for any fixed ui , Equation 3 is an integra-
Composition method (Tanner, 1993). The idea tion over x and uk , k 6= i and can be estimated
is to first draw samples for each ui s from using Monte Carlo methods. That means that
(t+1)
(t)
X (ui ), and then sample from the product of we can estimate X (ui ) up to a constant for
Q (t) any value ui , although we do not know how to
X (x) Yk (x)P (x|u). We will discuss how
k6=j sample from it. This is exactly the time when
to take the product in the next subsection. For importance sampling becomes handy. We can
now, let us assume that the computation is pos- estimate the message by drawing a set of im-
sible. To make life even easier, we make further portance samples as follows: sample ui from a

318 C. Yuan and M. J. Druzdzel


(t+1) Algorithm: HLBP
chosen importance function, estimate X (ui )
using Monte Carlo integration, and assign the 1. Initialize the messages that evidence nodes send to
ratio between the estimated value and I(ui ) as themselves and their children as indicating messages
the weight for the sample. A simple choice for with fixed values, and initialize all other messages to
be uniform.
the importance function is the uniform distribu-
tion over the range of ui , but we can improve the 2. while (stopping criterion not satisfied)
accuracy of Monte Carlo integration by choos- Recompute all the messages using Monte Carlo
integration methods.
ing a more informed importance function, the
(t) Normalize discrete messages.
corresponding message X (ui ) from the last it-
eration. Because of the iterative nature of LBP, Approximate all continuous messages using MGs.
messages usually keep improving over each it- end while
eration, so they are clearly better importance 3. Calculate (x) and (x) messages for each variable.
functions.
4. Calculate the posterior probability distributions for
Note that the preceding discussion is quite all the variables by sampling from the product of
general. The variables under consideration can (x) and (x) messages.
be either discrete or continuous. We only need
to know how to sample from or evaluate the Figure 1: The Hybrid Loopy Belief Propagation
conditional relations. Therefore, we can use any algorithm.
representation to model the case of discrete vari-
ables with continuous parents. For instance, we samples using a regularized version of expecta-
can use general softmax functions (Lerner et al., tion maximization (EM) (Dempster et al., 1977;
2001) for the representation. Koller et al., 1999) algorithm.
We now know how to use Monte Carlo inte-
Given the above discussion, we outline the
gration methods to estimate the LBP messages,
final HLBP algorithm in Figure 1. We first ini-
represented as sets of weighted samples. To
tialize all the messages. Then, we iteratively
complete the algorithm, we need to figure out
recompute all the messages using the methods
how to propagate these messages. Belief prop-
described above. In the end, we calculate the
agation involves operations such as sampling,
messages (x) and (x) for each node X and
multiplication, and marginalization. Sampling
use them to estimate the posterior probability
from a message represented as a set of weighted
distribution over X.
samples may be easy to do; we can use resam-
pling technique. However, multiplying two such
3 Product of Mixtures of Gaussians
messages is not straightforward. Therefore, we
choose to use density estimation techniques to One question that remains to be answered in
approximate each continuous message using a the last section is how to sample from the prod-
mixture of Gaussians (MG) (Sudderth et al., Q (t)
uct of X (x) Yk (x)P (x|u). We address the
2003). A K component mixture of Gaussian k6=j
has the following form problem in this section. If P (x|u) is a contin-
uous probability distribution, we can approx-
X
K
imate P (x|u) with a MG. Then, the problem
M (x) = wi N (x; i , i ) , (4)
becomes how to compute the product of several
i=1
MGs. The approximation is often a reasonable
P
K thing to do because we can approximate any
where wi = 1. MG has several nice prop- continuous probability distribution reasonably
i=1
erties. First, we can approximate any contin- well using MG. Even when such approxima-
uous distribution reasonably well using a MG. tion is poor, we can approximate the product of
Second, it is closed under multiplication. Last, P (x|u) with an MG using another MG. One ex-
we can estimate a MG from a set of weighted ample is that the product of Gaussian distribu-

Hybrid Loopy Belief Propagation 319


tion and logistic function can be approximated Algorithm: Sample from a product of MGs M1 M2
... MK .
well with another Gaussian distribution (Mur-
phy, 1999). 1. Randomly pick a component, say j1 , form the first
Suppose we have messages M1 , M2 , ..., MK , message M1 according to its weights w11 , ..., w1J1 .
each represented as an MG. Sudderth et al. use 2. Initialize cumulative parameters as follows
Gibbs sampling algorithm to address the prob- 1 1j1 ; 1 1j1 .
lem (Sudderth et al., 2003). The shortcoming 3. i 2; wImportance w1j1 .
of the Gibbs sampler is its efficiency: We usu-
ally have to carry out several iterations of Gibbs 4. while (i K)
sampling in order to get one sample. Compute new parameters for each component of
ith message as follows
Here we propose a more efficient method
ik
((i1 )1 + (ik )1 )1 ,
based on the chain rule. If we treat the se- ik ( ik
ik
+

i1


)ik .
i1
lection of one component from each MG mes-
Compute new weights for ith message with any x
sage as a random variable, which we call a N (x;
i1 ,i1 )N (x;ik ,ik )
wik = wik wi1j .
label, our goal is to draw a sample from the i1 N (x; , )
ik ik

joint probability distribution of all the labels Calculate the normalized weights
wik .
P (L1 , L2 , ..., LK ). We note that the joint prob- Randomly pick a component, say ji , from the ith
ability distribution of the labels of all the mes- message using the normalized weights.
sages P (L1 , L2 , ..., LK ) can be factorized using i iji ; i ij

i
.
the chain rule as follows: i i + 1;

wImportance = wImportance wij .
Y
K i

P (L1 , L2 , ..., LK ) = P (L1 ) P (Li |L1 , ..., Li1 ) . end while


i=2
5. Sample from the Gaussian with mean K and vari-
(5)
ance K .
Therefore, the idea is to sample from the

labels sequentially based on the prior or con- 6. Assign the sample weight wKj K
/wImportance.
ditional probability distributions. Let wij be
the weight for the jth component of ith mes- Figure 2: Sample from a product of MGs.
sage and ij and ij be the components pa-
rameters. We can sample from the product of
In our preceding discussion, we approximate
messages M1 , ..., MK using the algorithm pre-
P (x|u) using MG if P (x|u) is a continuous prob-
sented in Figure 2. The main idea is to cal-
ability distribution. It is not the case if P (x|u)
culate the conditional probability distributions
is deterministic or if X is observed. We discuss
cumulatively. Due to the Gaussian densities,
the following several scenarios separately.
the method has to correct the bias introduced
during the sampling by assigning the samples Deterministic relation without evidence: We
weights. The method only needs to go over the simply evaluate P (x|u) to get the value x as a
messages once to obtain one importance sam- sample. Because we did not take into account
ple and is more efficient than the Gibbs sam- the messages, we need to correct our bias us-
pler in (Sudderth et al., 2003). Empirical re- ing weights. For the message Yi (x) sent from
sults show that the precision obtained by the X to its child Yi , we take x as the sample and
Q (t)
importance sampler is comparable to the Gibbs assign it weight X (x) Yk (x). For the mes-
k6=i
sampler given a reasonable number of samples.
sage X (ui ) sent from X to its parent Ui , we
take value ui as a sample for Ui and assign it
Q (t) (t)
4 Belief Propagation with Evidence weight X (x) Yk (x)/X (ui ).
k
Special care is needed for belief propaga- Stochastic relation with evidence: The mes-
tion with evidence and deterministic relations. sages Yj (x) sent from evidence node X to its

320 C. Yuan and M. J. Druzdzel


children are always indicating messages with proportional {N (2; 1, 1), N (1; 1, 1)}.
fixed values. The messages Yj (x) sent from For the message C (b) message sent from C
the children to X have no influence on X, so to B, since we know that B can only take two
we need not calculate them. We only need to values, we choose a distribution that puts equal
update the messages X (ui ), for which we take probabilities on these two values as the impor-
the value ui as the sample and assign it weight tance function, from which we sample for B.
Q (t) (t)
X (e) Yk (e)/X (ui ), where e is the observed The state of B will also determine A as follows:
k
value of X. when B = 2, we have A = a; when B = 1, we
Deterministic relation with evidence: This have A = a. We then assign weight 0.7 to the
case is the most difficult. To illustrate this case sample if B = 2 and 0.3 if B = 1. However, the
more clearly, we use a simple hybrid Bayesian magic of knowing feasible values for B is impos-
network with one discrete node A and two con- sible in practice. Instead, we first sample for A
tinuous nodes B and C (see Figure 3). from C (A), in this case, {0.7, 0.3}, given which
we can solve P (C|A, B) for B and assign each
sample weight 1.0. So C (b) have probabilities
a 0.7 B
proportional to {0.7, 0.3} on two values 2.0 and
a 0.3 N (B; 1, 1) 1.0.
In general, in order to calculate messages
P (C|A, B) a a sent out from a deterministic node with evi-
C C=B C =2B dence, we need to sample from all parents ex-
cept one, and then solve P (x|u) for that par-
ent. There are several issues here. First, since
 
A B
we want to use the values of other parents to
  solve for the chosen parent, we need an equa-
@
R
@ tion solver. We used an implementation of the
C
 Newtons method for solving nonlinear set of
equations (Kelley, 2003). However, not all equa-
Figure 3: A simple hybrid Bayesian network. tions are solvable by this equation solver or any
equation solver for that matter. We may want
Example 1. Let C be observed at state 2.0. to choose the parent that is easiest to solve.
Given the evidence, there are only two possible This can be tested by means of a preprocess-
values for B: 2.0 when A = a and 1.0 when A = ing step. In more difficult cases, we have to re-
a. We need to calculate messages C (a) and sort to users help and ask at the model build-
C (b). If we follow the routine and sample from ing stage for specifying which parent to solve
A and B first, it is extremely unlikely for us or even manually specify the inverse functions.
to hit a feasible sample; Almost all the samples When there are multiple choices, one heuristic
that we get would have weight 0.0. Clearly we that we find helpful is to choose the continuous
need a better way to do that. parent with the largest variance.
First, let us consider how to calculate the
message C (a) sent from C to A. Suppose we 5 Lazy LBP
choose uniform distribution as the importance
function, we first randomly pick a state for A. We can see that HLBP involves repeated den-
After the state of A is fixed, we note that we can sity estimation and Monte Carlo integration,
solve P (C|A, B) to get the state for B. There- which are both computationally intense. Effi-
fore, we need not sample for B from C (b), in ciency naturally becomes a concern for the al-
this case N (B; 1, 1). We then assign N (B; 1, 1) gorithm. To improve its efficiency, we propose
as weight for the sample. Since As two states a technique called Lazy LBP, which is also ap-
are equally likely, the message C (A) would be plicable to other extensions of LBP. The tech-

Hybrid Loopy Belief Propagation 321


nique serves as a summarization that contains 6 Experimental Results
both my original findings and some commonly
used optimization methods. We tested the HLBP algorithm on two bench-
After evidence is introduced in the network, mark hybrid Bayesian networks: emission net-
we can pre-propagate the evidence to reduce work (Lauritzen, 1992) and its extension aug-
computation in HLBP. First, we can plug in any mented emission network (Lerner et al., 2001)
evidence to the conditional relations of its chil- shown in Figure 4(a). Note that HLBP is ap-
dren, so we need not calculate the messages plicable to more general hybrid models; We
from the evidence to its children. We need not choose the networks only for comparison pur-
calculate the messages from its children to the pose. To evaluate how well HLBP performs, we
evidence either. Secondly, evidence may deter- discretized the ranges of continuous variables to
mine the value of its neighbors because of deter- 50 intervals and then calculated the Hellingers
ministic relations, in which case we can evaluate distance (Kokolakis and Nanopoulos, 2001) be-
the deterministic relations in advance so that we tween the results of HLBP and the exact solu-
need not calculate messages between them. tions obtained by a massive amount of computa-
Furthermore, from the definitions of the LBP tion (likelihood weighting with 100M samples)
messages, we can easily see that we do not have as the error for HLBP. All results reported are
to recompute the messages all the time. For ex- the average of 50 runs of the experiments.
ample, the (x) messages from the children of a
6.1 Parameter Selection
node with no evidence as descendant are always
uniform. Also, a message needs not be updated HLBP has several tunable parameters. We
if the sender of the message has received no new have number of samples for estimating messages
messages from neighbors other than the recipi- (number of message samples), number of sam-
ent. ples for the integration (number of integration
Based on Equation 1, messages should be samples) in Equation 3. The most dramatic
updated if incoming messages change. How- influence on precision comes from the number
ever, we have the following result. of message samples, shown as in Figure 4(b).
Theorem 1. The messages sent from a non- Counter intuitively, the number of integration
evidence node to its parents remain uniform be- samples does not have as big impact as we might
fore it receives any non-uniform messages from think (see Figure 4(c) with 1K message sam-
its children, even though there are new mes- ples). The reason we believe is that when we
sages coming from the parents. draw a lot of samples for messages, the preci-
sion of each sample becomes less critical. In
Proof. Since there are no non-uniform mes- our experiments, we set the number of message
sages coming in, Equation 1 simplifies to samples to 1, 000 and the number of integra-
Z Y tion samples to 12. For the EM algorithm for
(t+1) (t)
X (ui ) = P (x|u) X (uk ) estimating MGs, we typically set the regulariza-
k6=i
x,uk :k6=i
Z
tion constant for preventing over fitting to 0.8,
= P (x, uk : k 6= i|ui ) stopping likelihood threshold to 0.001, and the
number of components in mixtures of Gaussian
x,uk :k6=i
= . to 2.
We also compared the performance of two
Finally for HLBP, we may be able to calculate samplers for product of MGs: Gibbs sam-
some messages exactly. For example, suppose pler (Sudderth et al., 2003) and the importance
a discrete node has only discrete parents. We sampler in Section 3. As we can see from Fig-
can always calculate the messages sent from this ure 4(b,c), when the number of message sam-
node to its neighbors exactly. In this case, we ples is small, Gibbs sampler has slight advan-
should avoid using Monte Carlo sampling. tage over the importance sampler. As the num-

322 C. Yuan and M. J. Druzdzel


W as te F ilter B u rn in g
T ype S tate R eg im e

0.7 0.1 2 1.2


Im p o rta n c e S a m p le r
E s tim a te d P o s te rio r
0.6 G ib b s S a m p le r 0.1 1
Metal E x a c t P o s te rio r
E ffic ien c y C O 2

H e llin g e r's D is ta n c e
W as te N o rm a l A p p ro x im a tio n
0.5

H e llin g e r's D is ta n c e
E m is s io n 0.08 0.8

0.4

P ro b a b ility
0.06 0.6
D u st 0.3
Metal C O 2
E m is s io n 0.04 0.4
E m is s io n S en s o r 0.2

0.1 0.02 Im p o rta n c e S a m p le r 0.2


G ib b s S a m p le r
0
0 0
Metal D u st L ig h t 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 0.7 6 1.7 2 2 .6 8 3 .6 4 4 .6 5 .5 6 6 .5 2 7 .4 8 8 .4 4
S en s o r S en s o r P en etrab ility N u m b e r o f M e s s a g e S a m p le s (1 0* 2 ^ (i-1 )) N u m b e r o f M e s s a g e In te g ra tio n S a m p le s (2 ^ i) P o s te rio r D is trib u tio n o f D u s tE m is s io n

(a) (b) (c) (d)


0.3 5 4 0.3 5 3.5
H LB P H LB P H LB P H LB P
0.3 3.5 0.3 3
Lazy H LB P Lazy H LB P Lazy H LB P Lazy H LB P

3
H e llin g e r's D is ta n c e

H e llin g e r's D is ta n c e
0.2 5 0.2 5 2.5

R u n n in g T im e (s )

R u n n in g T im e (s )
2.5
0.2 0.2 2
2
0.1 5 0.1 5 1.5
1.5
0.1 0.1 1
1

0.05 0.05 0.5


0.5

0 0 0 0
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
P ro p a g a tio n L e n g th P ro p a g a tio n L e n g th P ro p a g a tio n L e n g th P ro p a g a tio n L e n g th

(e) (f ) (g) (h)

Figure 4: (a) Emission network (without dashed nodes) and augmented emission network (with
dashed nodes). (b,c) The influence of number of message samples and number of message inte-
gration samples on the precision of HLBP on augmented Emission network CO2Sensor and Dust-
Sensor both observed to be true and Penetrability to be 0.5. (d) Posterior probability distribution
of DustEmission when observing CO2Emission to be 1.3, Penetrability to be 0.5 and WasteType
to be 1 in Emission network. (e,f,g,h) Results of HLBP and Lazy HLBP: (e) error on emission,
(f) running time on emission, (g) error on augmented emission, (h) running time on augmented
emission.

ber of message samples increases, the difference by Lazy LBP) as a function of the propaga-
becomes negligible. Since the importance sam- tion length. We can see that HLBP needs only
pler is much more efficient, we use it in all our several steps to converge. Furthermore, HLBP
other experiments. achieves better precision than its lazy version,
but Lazy HLBP is much more efficient than
6.2 Results on Emission Networks
HLBP. Theoretically, Lazy LBP should not af-
We note that mean and variance alone provide fect the results of HLBP but only improve its
only limited information about the actual poste- efficiency if the messages are calcualted exactly.
rior probability distributions. Figure 4(d) shows However, we use importance sampling to esti-
the posterior probability distribution of node mate the messages. Since we use the messages
DustEmission when observing CO2Emission at from the last iteration as the importance func-
1.3, Penetrability at 0.5, and WasteType at tions, iterations will help improving the func-
1. We also plot in the same figure the corre- tions.
sponding normal approximation with mean 3.77
and variance 1.74. We see that the normal ap-
proximation does not reflect the true posterior.
While the actual posterior distribution has a We also tested the HLBP algorithm on the
multimodal shape, the normal approximation augmented Emission network (Lerner et al.,
does not tell us where the mass really is. We 2001) with CO2Sensor and DustSensor both ob-
also report the estimated posterior probability served to be true and Penetrability to be 0.5
distribution of DustEmission by HLBP. HLBP and report the results in Figures 4(g,h). We
seemed able to estimate the shape of the actual again observed that not many iterations are
distribution very accurately. needed for HLBP to converge. In this case,
In Figures 4(e,f), we plot the error curves Lazy HLBP provides comparable results while
of HLBP and Lazy HLBP (HLBP enhanced improving the efficiency of HLBP.

Hybrid Loopy Belief Propagation 323


7 Conclusion G. Kokolakis and P.H. Nanopoulos. 2001.
Bayesian multivariate micro-aggregation under
The contribution of this paper is two-fold. First, the Hellingers distance criterion. Research in of-
we propose the Hybrid Loopy Belief Propagation ficial statistics, 4(1):117126.
algorithm (HLBP). The algorithm is general D. Koller, U. Lerner, and D. Angelov. 1999. A gen-
enough to deal with general hybrid Bayesian eral algorithm for approximate inference and its
networks that contain mixtures of discrete and application to hybrid Bayes nets. In Proceedings
of the Fifteenth Annual Conference on Uncer-
continuous variables and may represent linear
tainty in Artificial Intelligence (UAI-99), pages
or nonlinear equations and arbitrary probability 324333. Morgan Kaufmann Publishers Inc.
distributions and naturally accommodate the
scenario where discrete variables have contin- S. L. Lauritzen and D. J. Spiegelhalter. 1988. Lo-
cal computations with probabilities on graphical
uous parents. Its another advantage is that it structures and their application to expert sys-
approximates the true posterior distributions. tems. Journal of the Royal Statistical Society, Se-
Second, we propose an importance sampler to ries B (Methodological), 50(2):157224.
sample from the product of MGs. Its accuracy is S.L. Lauritzen. 1992. Propagation of probabilities,
comparable to the Gibbs sampler in (Sudderth means and variances in mixed graphical associa-
et al., 2003) but much more efficient given the tion models. Journal of the American Statistical
same number of samples. We anticipate that, Association, 87:10981108.
just as LBP, HLBP will work well for many U. Lerner, E. Segal, and D. Koller. 2001. Ex-
practical models and can serve as a promising act inference in networks with discrete children of
approximate method for hybrid Bayesian net- continuous parents. In Proceedings of the Seven-
teenth Annual Conference on Uncertainty in Arti-
works. ficial Intelligence (UAI-01), pages 319328. Mor-
gan Kaufmann Publishers Inc.
Acknowledgements
S. Moral, R. Rumi, and A. Salmeron. 2001. Mix-
This research was supported by the Air Force tures of truncated exponentials in hybrid Bayesian
Office of Scientific Research grants F4962003 networks. In Proceedings of Fifth European Con-
ference on Symbolic and Quantitative Approaches
10187 and FA95500610243 and by Intel Re- to Reasoning with Uncertainty (ECSQARU-01),
search. We thank anonymous reviewers for sev- pages 156167.
eral insightful comments that led to improve-
K. Murphy, Y. Weiss, and M. Jordan. 1999. Loopy
ments in the presentation of the paper. All belief propagation for approximate inference: An
experimental data have been obtained using empirical study. In Proceedings of the Fifteenth
SMILE, a Bayesian inference engine developed Annual Conference on Uncertainty in Artificial
at the Decision Systems Laboratory and avail- Intelligence (UAI99), pages 467475, San Fran-
cisco, CA. Morgan Kaufmann Publishers.
able at http://genie.sis.pitt.edu/.
K. P. Murphy. 1999. A variational approximation
for Bayesian networks with discrete and contin-
References uous latent variables. In Proceedings of the Fif-
teenth Annual Conference on Uncertainty in Arti-
B. R. Cobb and P. P. Shenoy. 2005. Hybrid Bayesian ficial Intelligence (UAI-99), pages 457466. Mor-
networks with linear deterministic variables. In gan Kaufmann Publishers Inc.
Proceedings of the 21th Annual Conference on
Uncertainty in Artificial Intelligence (UAI05), E. Sudderth, A. Ihler, W. Freeman, and A. Will-
pages 136144, AUAI Press Corvallis, Oregon. sky. 2003. Nonparametric belief propagation. In
Proceedings of 2003 IEEE Computer Society Con-
A.P. Dempster, N.M. Laird, and D.B. Rubin. 1977. ference on Computer Vision and Pattern Recog-
Maximum likelihood from incomplete data via nition (CVPR-03).
the EM algorithm. J. Royal Statistical Society, M. Tanner. 1993. Tools for Statistical Inference:
B39:138. Methods for Exploration of Posterior Distribu-
tions and Likelihood Functions. Springer, New
C. T. Kelley. 2003. Solving Nonlinear Equations York.
with Newtons Method.

324 C. Yuan and M. J. Druzdzel


Probabilistic Independence of Causal Influences
Adam Zagorecki Marek Druzdzel
Engineering Systems Department Decision Systems Laboratory
Defence Academy of the UK School of Information Sciences
Cranfield University University of Pittsburgh
Shrivenham, UK Pittsburgh, PA 15260, USA

Abstract
One practical problem with building large scale Bayesian network models is an expo-
nential growth of the number of numerical parameters in conditional probability tables.
Obtaining large number of probabilities from domain experts is too expensive and too time
demanding in practice. A widely accepted solution to this problem is the assumption of
independence of causal influences (ICI) which allows for parametric models that define
conditional probability distributions using only a number of parameters that is linear in
the number of causes. ICI models, such as the noisy-OR and the noisy-AND gates, have
been widely used by practitioners. In this paper we propose PICI, probabilistic ICI, an
extension of the ICI assumption that leads to more expressive parametric models. We
provide examples of three PICI models and demonstrate how they can cope with a com-
bination of positive and negative influences, something that is hard for noisy-OR and
noisy-AND gates.

1 INTRODUCTION manner. There are two qualitatively different


approaches to avoiding the problem of specify-
Bayesian networks (Pearl, 1988) have proved ing large CPTs. The first is to exploit inter-
their value as a modeling tool in many distinct nal structure within a CPT which basically
domains. One of the most serious bottlenecks in means to encode efficiently symmetries in CPTs
applying this modeling tool is the costly process (Boutilier et al., 1996; Boutilier et al., 1995).
of creating the model. Although practically ap- The other is to assume some model of interac-
plicable algorithms for learning BN from data tion among causes (parent influences) that de-
are available (Heckerman, 1999), still there are fines the effects (child nodes) CPT. The most
many applications that lack quality data and re- popular class of model in this category is based
quire using an experts knowledge to build mod- on the concept known as causal independence or
els. independence of causal influences (ICI) (Heck-
One of the most serious problems related to erman and Breese, 1994) which we describe in
building practical models is the large number Section 2. These two approaches should be
of conditional probability distributions needed viewed as complementary. It is because they
to be specified when a node in the graph has are able to capture two distinct types of inter-
large number of parent nodes. In the case of actions between causes.
discrete variables, which we assume in this pa- In practical applications, the noisy-OR
per, conditional probabilities are encoded in the (Good, 1961; Pearl, 1988) model together with
form of conditional probability tables (CPTs) its extension capable of handling multi-valued
that are indexed by all possible combinations variables, the noisy-MAX (Henrion, 1989),
of parent states. For example, 15 binary par- and the complementary models the noisy-
ent variables result in over 32,000 parameters, a AND/MIN (Dez and Druzdzel, 2002) are the
number that is impossible to specify in a direct most often applied ICI models. One of the ob-
vious limitations of these models is that they
capture only a small set of patterns of interac-
tions among causes, in particular they do not
allow for combining both positive and negative
influences. In this paper we introduce an ex-
tension to the independence of causal influences
which allows us to define models that capture
both positive and negative influences. We be-
lieve that the new models are of practical impor-
tance as practitioners with whom we have had
contact often express a need for conditional dis-
Figure 1: General form of independence of
tribution models that allow for a combination
causal interactions
of promoting and inhibiting causes.
The problem of insufficient expressive power
of the ICI models has been recognized by prac- however we introduce notation used throughout
titioners and we are aware of at least two at- the paper. We denote an effect (child) vari-
tempts to propose models that are based on the able as Y and its n parent variables as X =
ICI idea, but are not strictly ICI models. The {X1 , . . . , Xn }. We denote the states of a vari-
first of them is the recursive noisy-OR (Lemmer able with lower case, e.g. X = x. When it is
and Gossink, 2004) which allows the specifica- unambiguous we use x instead of X = x.
tion of interactions among parent causes, but In the ICI model, the interaction between
still is limited to only positive influences. The variables Xi and Y is defined by means of
other interesting proposal is the CAST logic (1) the mechanism variables Mi , introduced to
(Chang et al., 1994; Rosen and Smith, 1996) quantify the influence of each cause on the effect
which allows for combining both positive and separately, and (2) the deterministic function f
negative influences in a single model, however that maps the outputs of Mi into Y . Formally,
detaches from a probabilistic interpretation of the causal independence model is a model for
the parameters, and consequently leads to diffi- which two independence assertions hold: (1) for
culties with their interpretation and it can not any two mechanism variables Mi and Mj (i 6= j)
be exploited to speed-up inference. Mi is independent of Mj given X1 , . . . , Xn , and
The remainder of this paper is organized as (2) Mi and any other variable in the network
follows. In Section 2, we briefly describe in- that does not belong to the causal mechanism
dependence of causal influences. In Section 3 are independent given X1 , . . . , Xn and Y . An
we discuss the amechanistic property of the ICI ICI model is shown in Figure 1.
models, while in Section 4 we introduce an ex- The most popular example of an ICI model
tension of ICI a new family of models proba- is the noisy-OR model. The noisy-OR model
bilistic independence of causal influences. In the assumes that all variables involved in the inter-
following two Sections 5 and 6, we introduce two action are binary. The mechanism variables in
examples of models for local probability distri- the context of the noisy-OR are often referred
butions, that belong to the new family. Finally, to as inhibitors. The inhibitors have the same
we conclude our paper with discussion of our range as Y and their CPTs are defined as fol-
proposal in Section 7. lows:

2 INDEPENDENCE OF CAUSAL P (Mi = y|Xi = xi ) = pi


INFLUENCES (ICI) P (Mi = y|Xi = xi ) = 0 . (1)

In this section we briefly introduce the concept Function f that combines the individual influ-
of independence of causal influences (ICI). First ences is the deterministic OR. It is important

326 A. Zagorecki and M. J. Druzdzel


to note that the domain of the function defin- is to allow for easy elicitation of parameters of
ing the individual influences are the outcomes the intermediate nodes Mi , even though these
(states) of Y (each mechanism variable maps can not be observed directly. This is achieved
Range(Xi ) to Range(Y )). This means that f through a particular way of setting (control-
is of the form Y = f (M1 , . . . , Mn ), where typ- ling) the causes Xi . Assuming that all causes
ically all variables Mi and Y take values from except for a single cause Xi are in their dis-
the same set. In the case of the noisy-OR model tinguished states and Xi is in some state (not
it is {y, y}. The noisy-MAX model is an ex- distinguished), it is easy to determine the prob-
tension of the noisy-OR model to multi-valued ability distribution for the hidden variable Mi .
variables where the combination function is the Not surprisingly, the noisy-OR model is an
deterministic MAX defined over Y s outcomes. example of an amechanistic model. In this case,
the distinguished states are usually false or ab-
3 AMECHANISTIC PROPERTY sent states, because the effect variable is guar-
anteed to be in the distinguished state given
The amechanistic property of the causal interac- that all the causes are in their distinguished
tion models was explicated by Breese and Heck- states Xi = xi . Equation 1 reflects the amecha-
erman, originally under the name of atemporal nistic assumption and results in the fact that the
(Heckerman, 1993), although later the authors parameters of the mechanisms (inhibitors) can
changed the name to amechanistic (Heckerman be obtained directly as the conditional proba-
and Breese, 1996). The amechanistic property bilities: P (Y = y|x1 , . . . , xi1 , xi , xi+1 , . . . , xn ).
of ICI models relates to a major problem of Similarly, the noisy-MAX is an amechanistic
this proposal namely the problem of deter- model it assumes that each parent variable has
mining semantic meaning of causal mechanisms. a distinguished state (arbitrarily selected) and
In practice, it is often impossible to say any- the effect variable has a distinguished state. In
thing about the nature of causal mechanisms the case of the effect variable, the distinguished
(as they can often be simply artificial constructs state is assumed to be the lowest state (with re-
for the sake of modeling) and therefore they, or spect to the ordering relation imposed on the ef-
their parameters, can not be specified explicitly. fect variables states). We strongly believe that
Even though Heckerman and Breese proposed a the amechanistic property is highly significant
strict definition of the amechanistic property, from the point of view of knowledge acquisi-
in this paper we will broaden this definition tion. Even though we introduce a parametric
and assume that an ICI model is an amecha- model instead of a traditional CPT, the amech-
nistic model, when its parameterization can be anistic property causes the parametric model to
defined exclusively by means of (a subset) of be defined in terms of a conditional probability
conditional probabilities P (Y |X) without men- distribution P (Y |X) and, therefore, is concep-
tioning Mi s explicitly. This removes the burden tually consistent with the BN framework. We
of defining mechanisms directly. believe that the specification of a parametric
To achieve this goal it is assumed that one model in terms of probabilities has contributed
of the states of each cause Xi is a special state to the great popularity of the noisy-OR model.
(also referred to as as the distinguished state).
Usually such a state is the normal state, like 4 PROBABILISTIC ICI
ok in hardware diagnostic systems or absent for
a disease in a medical system, but such asso- The combination function in the ICI models is
ciation depends on the modeled domain. We defined as a mapping of mechanisms states into
will use * symbol to denote the distinguished the states of the effect variable Y . Therefore,
state. Given that all causes Xi are in their dis- it can be written as Y = f (M), where M is
tinguished states, the effect variable Y is guar- a vector of mechanism variables. Let Qi be
anteed to be in its distinguished state. The idea a set of parameters of CPT of node Mi , and

Probabilistic Independence of Causal Influences 327


Q = {Q1 , . . . , Qn } be a set of all parameters
of all mechanism variables. Now we define the
new family probabilistic independence of causal
interactions (PICI) for local probability distrib-
utions. A PICI model for the variable Y consists
of (1) a set of n mechanism variables Mi , where
each variable Mi corresponds to exactly one par-
ent Xi and has the same range as Y , and (2) a
combination function f that transforms a set of
probability distributions Qi into a single prob-
ability distribution over Y . The mechanisms
Figure 2: BN model for probabilistic indepen-
Mi in the PICI obey the same independence as-
dence of causal interactions, where P (Y |M) =
sumptions as in the ICI. The PICI family is de-
f (Q, M).
fined in a way similar to the ICI family, with
the exception of the combination function, that
is defined in the form P (Y ) = f (Q, M). The ative influences on the effect variable that has a
PICI family includes both ICI models, which distinguished state in the middle of the scale.
can be easily seen from its definition, as f (M) We assume that the parent nodes Xi are dis-
is a subset of f (Q, M), assuming Q is the empty crete (not necessarily binary, nor an ordering
set. relation between states is required), and each of
In other words, in the case of ICI for a given them has one distinguished state, that we de-
instantiation of the states of the mechanism note as xi . The distinguished state is not a
variables, the state of Y is a function of the property of a parent variable, but rather a part
states of the mechanism variables, while for the of a definition of a causal interaction model a
PICI the distribution over Y is a function of variable that is a parent in two causal indepen-
the states of the mechanism variables and some dence models may have different distinguished
parameters Q. states in each of these models. The effect vari-
Heckerman and Breese (Heckerman, 1993) able Y also has its distinguished state, and by
identified other forms (or rather properties) of analogy we will denote it by y . The range of
the ICI models that are interesting from the the mechanism variables Mi is the same as the
practical point of view. We would like to note range of Y . Unlike the noisy-MAX model, the
that those forms (decomposable, multiple de- distinguished state may be in the middle of the
composable, and temporal ICI) are related to scale.
properties of the function f , and can be applied In terms of parameterization of the mecha-
to the PICI models in the same way as they are nisms, the only constraint on the distribution
applied to the ICI models. of Mi conditional on Xi = xi is:

5 NOISY-AVERAGE MODEL P (Mi = mi |Xi = xi ) = 1


P (Mi 6= mi |Xi = xi ) = 0 , (2)
In this section, we propose a new local distrib-
ution model that is a PICI model. Our goal while the other parameters in the CPT of Mi
is to propose a model that (1) is convenient can take arbitrary values.
for knowledge elicitation from human experts The definition of the CPT for Y is a key el-
by providing a clear parameterization, and (2) ement of our proposal. In the ICI models, the
is able to express interactions that are impos- CPT for Y was by definition constrained to be
sible to capture by other widely used models a deterministic function, mapping states of Mi s
(like the noisy-MAX model). With this model to the states of Y . In our proposal, we define
we are interested in modeling positive and neg- the CPT of Y to be a function of probabilities

328 A. Zagorecki and M. J. Druzdzel


of the Mi s: (often referred to as leak ), that stands for all
( Q
n
unmodeled causes and is assumed to be always
for y = y
P (y|x) = i=1 P (Mi = y |xi ) in some state x0 . The leak probabilities are ob-
Pn
n i=1 P (Mi = y|xi ) 6 y
for y = tained using:
where is a normalizing constant discussed P (Y = y|x1 , . . . , xn ) = P (M0 = y) .
later. Let qij = qi . For simplicity of notation
However, this slightly complicates the schema
assume that qij = P (Mi = y j |xi ), qi = P (Mi =
for obtaining parameters P (Mi = y|xi ). In
y |xi ), and D = ni=1 P (Mi = y |xi ). Then we
Q
the case of the leaky model, the equality in
can write:
Equation 3 does not hold, since X0 acts as
my my n
X a regular parent variable that is in a non-
qij =
X X
P (yj |x) = D + distinguished state. Therefore, the parameters
j=1 j=1,j6=j
n i=1
for other mechanism variables should be ob-
X X
my n
X n tained using conditional probabilities P (Y =
=D+ qij = D + (1 qi ), y|x1 , . . . , xi , . . . , xn ), P (M0 = y|x0 ) and the
n j=1,j6=j i=1 n i=1
combination function. This implies that the
where my is the number of states of Y . Since acquired probabilities should fulfil some non-
the sum on the left must equal 1, as it defines trivial constraints. Because of space limitation,
the probability distribution P (Y |x), we can cal- we decided to skip the discussion of these con-
culate as: straints. In a nutshell, these constraints are sim-
ilar in nature to constraints for the leaky noisy-
n(1 D)
= Pn . MAX model. These constraints should not be a
i=1 (1 qi ) problem in practice, when P (M0 = y ) is large
The definition above does not define P (Y |M) (which implies that the leak cause has marginal
but rather P (Y |X). It is possible to calculate influence on non-distinguished states).
P (Y |M) from P (Y |X), though it is not needed Now we introduce an example of the applica-
to use the model. Now we discuss how to ob- tion of the new model. Imagine a simple di-
tain the probabilities P (Mi |Xi ). Using defini- agnostic model for an engine cooling system.
tion of P (y|x) and the amechanistic property, The pressure sensor reading (S) can be in three
this task amounts to obtaining the probabilities states high, normal, or low {hi, nr, lo}, that
of Y given that Xi is in its non-distinguished correspond to pressure in a hose. Two possible
state and all other causes are in their distin- faults included in our model are: pump failure
guished states (in a very similar way to how the (P) and crack (C). The pump can malfunction
noisy-OR parameters are obtained). P (y|x) in in two distinct ways: work non-stop instead of
this case takes the form: adjusting its speed, or simply fail and not work
at all. The states for pump failure are: {ns, fa,
P (Y = y|x1 , . . . , xi1 , xi , . . . , xi+1 , xn ) = ok }. For simplicity we assume that the crack on
the hose can be present or absent {pr,ab}. The
= P (Mi = y|xi ) , (3)
BN for this problem is presented in Figure 3.
and, therefore, defines an easy and intuitive way The noisy-MAX model is not appropriate here,
for parameterizing the model by just asking for because the distinguished state of the effect
conditional probabilities, in a very similar way variable (S) does not correspond to the lowest
to the noisy-OR model. It is easy to notice that value in the ordering relation. In other words,
P (Y = y |x1 , . . . , xi , . . . , xn ) = 1, which poses the neutral value is not one of the extremes,
a constraint that may be unacceptable from a but lies in the middle, which makes use of the
modeling point of view. We can address this MAX function over the states inappropriate. To
limitation in a very similar way to the noisy- apply the noisy-average model, first we should
OR model, by assuming a dummy variable X0 identify the distinguished states of the variables.

Probabilistic Independence of Causal Influences 329


model the CPT of Y can be decomposed in
pairwise relations (Figure 4) and such a decom-
position can be exploited for sake of improved
inference speed in the same way as for decom-
posable ICI models.

Figure 3: BN model for the pump example.

In our example, they will be: normal for sen-


sor reading, ok for pump failure and absent for
crack. The next step is to decide whether we Figure 4: Decomposition of a combination func-
should add an influence of non-modeled causes tion.
on the sensor (a leak probability). If such an
influence is not included, this would imply that It is important to note that the noisy-average
P (S = nr |P = ok, C = ab) = 1, otherwise model does not take into account the ordering
this probability distribution can take arbitrary of states (only the distinguished state is treated
values from the range (0, 1], but in practice it in a special manner). If a two causes have high
should always be close to 1. probability of high and low pressure, one should
not expect that combined effect will have prob-
Assuming that the influence of non-modeled
ability of normal state (the value in-between).
causes is not included, the acquisition of the
mechanism parameters is performed directly
6 AVERAGE MODEL
by asking for conditional probabilities of form
P (Y |x1 , . . . , xi , . . . , xn ). In that case, a typical Another example of a PICI model we want to
question asked of an expert would be: What present is the model that averages influences
is the probability of the sensor being in the low of mechanisms. This model highlights another
state, given that a crack was observed but the property of the PICI models that is important
pump is in state ok? However, if the unmod- in practice. If we look at the representation of
eled influences were significant, an adjustment a PICI model, we will see that the size of the
for the leak probability is needed. Having ob- CPT of node Y is exponential in the number of
tained all the mechanism parameters, the noisy- mechanisms (or causes). Hence, in general case
average model specifies a conditional probabil- it does not guarantee a low number of distrib-
ity in a CPT by means of the combination func- utions. One solution is to define a combination
tion. function that can be expressed explicitly in the
Intuitively, the noisy-average combines the form of a BN but in such a way that it has sig-
various influences by averaging probabilities. In nificantly fewer parameters. In the case of ICI
case where all active influences (the parents in models, the decomposability property (Hecker-
non-distinguished states) imply high probabil- man and Breese, 1996) served this purpose, and
ity of one value, this value will have a high can do too for in PICI models. This property
posterior probability, and the synergetic effect allows for significant speed-ups in inference.
will take place similarly to the noisy-OR/MAX In the average model, the probability distrib-
models. If the active parents will vote for dif- ution over Y given the mechanisms is basically
ferent effects states, the combined effect will be a ratio of the number of mechanisms that are
an average of the individual influences. More- in given state divided by the total number of
over, the noisy-average model is a decomposable mechanisms (by definition Y and M have the

330 A. Zagorecki and M. J. Druzdzel


same range): person in the vehicle, try to carry the attack
n
at rush hours or time when the security is less
1X strict, etc. Each of these behaviors is not nec-
P (Y = y|M1 , . . . , Mn ) = I(Mi = y) .
n i=1 essarily a strong indicator of terrorist activity,
(4) but several of them occurring at the same time
where I is the identity function. Basically, this may indicate possible threat.
combination function says that the probability The average model can be used to model this
of the effect being in state y is the ratio of mech- situation as follows: separately for each of sus-
anisms that result in state y to all mechanisms. picious activities (causes) a probability distribu-
Please note that the definition of how a cause Xi tion of terrorist presence given this activity can
results in the effect is defined in the probability be obtained which basically means specification
distribution P (Mi |Xi ). The pairwise decompo- of probability distribution of mechanisms. Then
sition can be done as follows: combination function for the average model acts
as popular voting to determine P (Y |X).
P (Yi = y|Yi1 = a, Mn = b)
The average model draws ideas from the lin-
i 1 ear models, but unlike the linear models, the lin-
= I(y = a) + I(y = b) , ear sum is done over probabilities (as it is PICI),
i+1 i+1
for Y2 , . . . , Yn . Y1 is defined as: and it has explicit hidden mechanism variables
that define influence of single cause on the ef-
P (Y1 = y|M1 = a, M2 = b) = fect.

1 1 7 CONCLUSIONS
= I(y = a) + I(y = b) .
2 2
The fact that the combination function is de- In this paper, we formally introduced a new
composable may be easily exploited by inference class of models for local probability distribu-
algorithms. Additionally, we showed that this tions that is called PICI. The new class is an
model presents benefits for learning from small extension of the widely accepted concept of in-
data sets (Zagorecki et al., 2006). dependence of causal influences. The basic idea
Theoretically, this model is amechanistic, be- is to relax the assumption that the combination
cause it is possible to obtain parameters of function should be deterministic. We claim that
this model (probability distributions over mech- such an assumption is not necessary either for
anism variables) by asking an expert only for clarity of the models and their parameters, nor
probabilities in the form of P (Y |X). For ex- for other aspects such as convenient decompo-
ample, assuming variables in the model are bi- sitions of the combination function that can be
nary, we have 2n parameters in the model. It exploited by inference algorithms.
would be enough to select 2n arbitrary proba- To support our claim, we presented two con-
bilities P (Y |X) out of 2n and create a set of 2n ceptually distinct models for local probability
linear equations applying Equation 4. Though distributions that address different limitations
in practice, one needs to ensure that the set of of existing models based on the ICI. These mod-
equations has exactly one solution what in gen- els have clear parameterizations that facilitate
eral case is not guaranteed. their use by human experts. The proposed mod-
As an example, let us assume that we want els can be directly exploited by inference al-
to model classification of a threat at a mili- gorithms due to fact that they can be explic-
tary checkpoint. There is an expected terrorist itly represented by means of a BN, and their
threat at that location and there are particular combination function can be decomposed into a
elements of behavior that can help spot a terror- chain of binary relationships. This property has
ist. We can expect that a terrorist can approach been recognized to provide significant inference
the checkpoint in a large vehicle, being the only speed-ups for the ICI models. Finally, because

Probabilistic Independence of Causal Influences 331


they can be represented in form of hidden vari- D. Heckerman and J. Breese. 1994. A new look at
ables, their parameters can be learned using the causal independence. In Proceedings of the Tenth
Annual Conference on Uncertainty in Artificial
EM algorithm.
Intelligence (UAI94), pages 286292, San Fran-
We believe that the concept of PICI may lead cisco, CA. Morgan Kaufmann Publishers.
to new models not described here. One remark
we shall make here: it is important that new D. Heckerman and J. Breese. 1996. Causal indepen-
dence for probability assessment and inference us-
models should be explicitly expressible in terms ing Bayesian networks. In IEEE, Systems, Man,
of a BN. If a model does not allow for com- and Cybernetics, pages 26:826831.
pact representation and needs to be specified as
D. Heckerman. 1993. Causal independence for
a CPT for inference purposes, it undermines a
knowledge acquisition and inference. In Proceed-
significant benefit of models for local probabil- ings of the Ninth Conference on Uncertainty in
ity distributions a way to avoid using large Artificial Intelligence (UAI93), pages 122127,
conditional probability tables. Washington, DC. Morgan Kaufmann Publishers.

Acknowledgements D. Heckerman. 1999. A tutorial on learning with


Bayesian networks. In Learning in Graphical
This research was supported by the Air Force Models, pages 301354, Cambridge, MA, USA.
Office of Scientific Research grants F4962003 MIT Press.
10187 and FA95500610243 and by Intel Re-
M. Henrion. 1989. Some practical issues in con-
search. While we take full responsibility for structing belief networks. In L.N. Kanal, T.S.
any errors in the paper, we would like to thank Levitt, and J.F. Lemmer, editors, Uncertainty in
anonymous reviewers for several insightful com- Artificial Intelligence 3, pages 161173. Elsevier
ments that led to improvements in the presen- Science Publishing Company, Inc., New York, N.
Y.
tation of the paper, Tsai-Ching Lu for inspi-
ration, invaluable discussions and critical com- J. F. Lemmer and D. E. Gossink. 2004. Recursive
ments, and Ken McNaught for useful comments noisy-OR: A rule for estimating complex prob-
on the draft. abilistic causal interactions. IEEE Transactions
on Systems, Man and Cybernetics, (34(6)):2252
2261.
References J. Pearl. 1988. Probabilistic Reasoning in Intelligent
C. Boutilier, R. Dearden, and M. Goldszmidt. 1995. Systems: Networks of Plausible Inference. Mor-
Exploiting structure in policy construction. In gan Kaufmann Publishers, Inc., San Mateo, CA.
Chris Mellish, editor, Proceedings of the Four-
J. A. Rosen and W. L. Smith. 1996. Influence
teenth International Joint Conference on Artifi-
net modeling and causal strengths: An evolution-
cial Intelligence, pages 11041111, San Francisco. ary approach. In Command and Control Research
Morgan Kaufmann.
and Technology Symposium.
C. Boutilier, N. Friedman, M. Goldszmidt, and
D. Koller. 1996. Context-specific independence A. Zagorecki, M. Voortman, and M. Druzdzel. 2006.
in Bayesian networks. In Proceedings of the 12th Decomposing local probability distributions in
Annual Conference on Uncertainty in Artificial Bayesian networks for improved inference and pa-
Intelligence (UAI-96), pages 115123, San Fran- rameter learning. In FLAIRS Conference.
cisco, CA. Morgan Kaufmann Publishers.
K.C. Chang, P.E. Lehner, A.H. Levis, Abbas K.
Zaidi, and X. Zhao. 1994. On causal influence
logic. Technical Report for Subcontract no. 26-
940079-80.
F. J. Dez and Marek J. Druzdzel. 2002. Canon-
ical probabilistic models for knowledge engineer-
ing. Forthcoming.
I. Good. 1961. A causal calculus (I). British Journal
of Philosophy of Science, 11:305318.

332 A. Zagorecki and M. J. Druzdzel


Author Index
Abellan, Joaqun, 1 Martnez, Irene, 187
Antal, Peter, 9, 17 Mateo, Juan L., 115
Antonucci, Alessandro, 25 Meganck, Stijn, 195
Bangs, Olav, 35 Millinghoffer, Andras, 9, 17
Beltoft, Ivan, 231 Moral, Serafn, 1, 83
Bilgic, Taner, 239 Morales, Eduardo, 263
Bjorkegren, Johan, 247 Morton, Jason, 207
Bolt, Janneke H., 43, 51 Nielsen, Jens D., 215
Chen, Tao, 59 Nielsen, Sren H., 223
Cobb, Barry R., 67 Nielsen, Thomas D., 223
Dessau, Ram, 231 Nilsson, Roland, 247
Dez, Francisco J., 131, 179 Olesen, Kristian G., 231
Druzdzel, Marek J., 279, 317, 325 Ozgur-Unluakin, Demet, 239
Feelders, Ad, 75 Pachter, Lior, 207
Flores, M. Julia, 83 Pena, Jose M., 247
Francois, Olivier C.H., 91 Puerta, Jose M., 115
van der Gaag, Linda C., 51, 99, 107, 255 Renooij, Silja, 99, 255
Gamez, Jose A., 83, 115, 123 Reyes, Alberto, 263
Geenen, Petra L., 99 Rodrguez, Carmelo, 187
van Gerven, Marcel A. J., 131 Rum, Rafael, 123
Gezsi, Andras, 9 Salmeron, Antonio, 123, 187
Gomez-Olmedo, Manuel, 1 Santafe, Guzman, 271
Gomez-Villegas, Miguel A., 139 Shimizu, Shohei, 155
Gonzales, Christophe, 147 Shiu, Anne, 207
Hejlesen, Ole K., 231 Smith, Jim, 293
Heskes, Tom, 163 Sndberg-Madsen, Nicolaj, 35
Hoyer, Patrik O., 155 Sturmfels, Bernd, 207
Hullam, Gabor, 9 Sucar, L. Enrique, 263
Ibarguengoytia, Pablo, 263 Sun, Xiaoxun, 279
Ivanovs, Jevgenijs, 75 Susi, Rosario, 139
Jaeger, Manfred, 215 Simecek, Petr, 287
Jensen, Finn V., 35 Tegner, Jesper, 247
Jouve, Nicolas, 147 Tel, Gerard, 171
Jurgelenaite, Rasa, 163 Thwaites, Peter, 293
Kerminen, Antti J., 155 Trangeled, Michael, 231
Kwisthout, Johan, 171 de Waal, Peter R., 107
Larranaga, Pedro, 271 Wang, Yi, 301
Leray, Philippe, 91, 195 Wienand, Oliver, 207
Lozano, Jose A., 271 Xiang, Yang, 309
Luque, Manuel, 179 Yuan, Changhe, 279, 317
Maes, Sam, 195 Zaffalon, Marco, 25
Man, Paloma, 139 Zagorecki, Adam, 325
Manderick, Bernard, 195 Zhang, Nevin L., 59, 301
Title: Proceedings of the Third European Workshop
on Probabilitic Graphical Models
Editors: Milan Studeny and Jir Vomlel
Published by: Zeithamlova Milena, Ing. - Agentura Action M
Vrsovicka 68, 101 00 Praha 10, Czech Republic
http://www.action-m.com
Number of Pages: 344
Number of Copies: 80
Year of Publication: 2006
Place of Publication: Prague
First Edition
Printed by: Reprostredisko UK MFF
Charles University in Prague - Faculty of Mathematics and Physics
Sokolovska 83, 186 75 Praha 8 - Karln, Czech Republic

ISBN 80-86742-14-8

Anda mungkin juga menyukai