Anda di halaman 1dari 198

Cellular Automata, Dynamical Systems and Neural Networks

Mathematics and Its Applications

Managing Editor:

M. HAZEWINKEL
Centre for Mathematics and Computer Science, Amsterdam, The Netherlands

Volume 282
Cellular Automata,
Dynamical Systems
and Neural Networks

edited by

Eric Goles
and
Servet Martinez
Departamemo de /ngenieria Matenuitica,
F.C.F.M.,
Universidad de Chile,
Samiago. Chile

SPRINGER-SCIENCE+BUSINESS MEDIA, B.V.


A C.I.P. Catalogue record for this book is available from the Library of Congress.

ISBN 978-90-481-4382-5 ISBN 978-94-017-1005-3 (eBook)


DOI 10.1007/978-94-017-1005-3

Printed on acid-free paper

All Rights Reserved


© 1994 Springer Science+Business Media Dordrecht
Originally published by Kluwer Academic Publishers in 1994
No part of the material protected by this copyright notice may be reproduced or
utilized in any form or by any means, electronic or mechanical,
including photocopying, recording or by any information storage and
retrieval system, without written permission from the copyright owner.
CONTENTS

FOREWORD vii

CELLULAR AUTOMATA AND TRANSDUCERS. A TOPOLOGICAL VIEW


Blanchard
Fran~ois 1

AUTOMATA NETWORK MODELS OF INTERACTING POPULATIONS


Nino Boccara 23

ENTROPY, PRESSURE AND LARGE DEVIATION


Artur Lopes 79

FORMAL NEURAL NETWORKS: FROM SUPERVISED TO


UNSUPERVISED LEARNING
Jean-Pierre Nadal 147

STORAGE OF CORRELATED PATTERNS IN NEURAL NETWORKS


Patricio Perez 167
FOREWORD

This book contains the courses given at the Third School on Statistical Physics
and Cooperative Systems held at Santiago, Chile, from 14th to 18th December
1992. The main idea of this periodic school was to bring together scientists work-
ing on subjects related with recent trends in Statistical Physics. More precisely
related with non linear phenomena, dynamical systems, ergodic theory, cellular au-
tomata, symbolic dynamics, large deviation theory and neural networks. Scientists
working in these subjects come from several areas: mathematics, biology, physics,
computer science, electrical engineering and artificial intelligence. Recently, a very
important cross-fertilization has taken place with regard to the aforesaid scientific
and technological disciplines, so as to give a new approach to the research whose
common core remains in statistical physics.

Each contribution is devoted to one or more of the previous subjects. In most


cases they are structured as surveys, presenting at the same time an original point
of view about the topic and showing mostly new results.

The expository text of Fran<;ois Blanchard concerns the study of normal numbers
and its preservation under some symbolic transformation. This work furnishes
the main concepts used in symbolic dynamics and automata theory. Some open
problems dealing with cellular automata are presented.

The paper of Nino Boccara attached the investigation of models of interacting


populations, for instance in epidemiology and ecology. He presents a discrete-
space approach in the context of cellular automata as well as the more classical
continuous models.

The survey paper of Artur Lopes is devoted the study of the relations between
ergodic theory of dynamical systems with large deviation theory. In this paper
the main concepts and basic results of ergodic theory are introduced: Birkhoff's
theorem, entropy pressure and the Ruelle-Perron-Frobenius operator; as well as

vii
viii

the formalism of large deviation theory. The main results connecting pressure and
free energy are established.

The work of Jean-Pierre Nadal presents different supervised and unsupervised


learning strategies for artificial neural networks, also a duality notion between
neural network architectures is developed to provide a tool for comparing. different
learning strategies.

The exposition of Patricio Perez deals with the storage capacities of artificial
neural networks. He presents in the framework of statistical mechanics the stor-
age of unbiased, biased and correlated patterns as well as some numerical results
concerning the storage capacity of two coupled Hopfield networks.

The editors are grateful to the participants of the School, as well as to the
authors of the individual chapters. They are also indebted to the sponsors and sup-
porters whose interest and help was essential for the success of the meeting: Fonde-
cyt, Conicyt, French Cooperation, Departamento de Relaciones Internacionales
and DTI of the Universidad de Chile and Departamento de lngenieria Matematica
and CENET of the Facultad de Ciencias Fisicas y Matematicas.

Mrs. Gladys Cavallone deserves a special mention for her very fine and hard
work typing the book.

The Editors
CELLULAR AUTOMATA AND TRANSDUCERS.
A TOPOLOGICAL VIEW

FRANCOIS BLANCHARD
C.N.R.S.
Laboratoire de M athematiques Discretes
C.-tse 930 - 163 avenue de Luminy
13288 Marseille Cedex 9
France

ABSTRACT. In this article we deal with two of the numerous instances in which automata
play a part in Topological or Measurable Dynamics. The first is preservation of normality by
transducers; here we give a detailed account of the main proof in [2], that of normality preservation
under multiplication by rationals. The second instance is an introduction to the dynamical
properties of onto cellular automata - since there are but a few known results about them, we
mainly give definitions, point out some elementary properties and ask questions. These two
applications of automata in the field of Dynamics, though very different in spirit, are strongly
linked, because cellular automata are a particular class of transducers, because entropy and other,
mainly topological, notions from Dynamical Systems play an important part in both, and finally
because one of the underlying aims is to study the transformations of the interval associated to
some of these automata.

1. Introduction

There are not so many references to transducers in mathematical literature; this


paper partly arises from the belief that their importance will increase in the future.
Its purpose is not to introduce new results, but rather to illustrate two ideas: first,
that transducers are suitable tools for some questions of normality; and second,
that the dynamics of onto cellular automata (and also transducers, but this still
more difficult matter is not addressed here) is a huge open field with few known
results, having connections with number theory and formal languages, as well as
statistical physics.
Its practical aim is to emphasise two connected, though rather different aspects
of what can be done in Symbolic Dynamics and related fields by using this kind

E. Goles and S. Martinez (eds.), Cellular Automata, Dynamical Systems and Neural Networks, 1-22.
© 1994 Kluwer Academic Publishers.
2

of tools. In the first instance transducers are used as tools; it is an ergodic proof
of the fact that normality is preserved under multiplication by rationals. Only the
proof is original in the article where it comes from [2] - the result was first obtained
during the fifties by Harmonic Analysis - but there are some interesting side effects:
the methods permit to investigate what happens to real numbers that are generic
for some non Lebesgue measures, under multiplication by rationals; it also allows
to prove that multiplication by rationals also preserves "near" normality.
The second topic is the dynamics of onto cellular automata - which means
in this case their properties are described for their own sake. A possible physical
interpretation is the behaviour of some interacting or cooperative systems at equi-
librium. This field is wide and largely unexplored, and for this reason promising for
beginners. After recalling the definitions of topological conjugacy and entropy, we
introduce two notions for one-sided cellular automata, expansiveness and a "stan-
dard symbolic factor", which we feel sure will play an important part in future
developments, and then prove that the symbolic factor has the entropy of the cel-
lualr automaton, and that expansiveness is equivalent to the fact that the cellular
automaton and its symbolic factor are conjugate. Finally a few open questions are
posed.
We made references to some of the papers we are aware of, which does not
mean authors of other papers in the field should feel undeservedly neglected; this
is decidedly not a survey, and all the less so in the domain of cellular automata,
where a huge literature already exists. As should be expected, there is infinitely
more in the ergodic theory of compact spaces than is to be found in this paper; only
a few useful ergodic definitions and results are recalled, mainly at the beginning of
the sections in which they are used. Motivated readers can consult [7] or [21] for a
deeper insight. Let us point out that in this article the point of view is rather that of
topological dynamics - considering primarily a compact metric space, such as [0, 1)
or the shift space, endowed with a continuous transformation, and then introducing
one or several invariant measures - than that of metric dynamics - considering a
probability space, and then a measurable measure-preserving transformation.
After defining transducers and introducing our two topics in Section 2, Section
3 is devoted to multiplication transducers and normality preservation, and Section
3

4 to the dynamics of onto cellular automata.


I want to thank Alejandro Maass, to whom I am greatly indebted for many
informations, discussions and remarks.

2. Transducers

2.1. SOME PRELIMINARY DEFINITIONS

Let A be a finite set of symbols, or alphabet. A* denotes the set of all finite
sequences on A, and A+ the set of all nonempty sequences. The sets of infinite
sequences on A, A IN and A E will be constantly used in this article: their elements
are denoted by x = (x;, i E IN or m); we write x(i,j) = x;x;+ 1 ••• Xj. They are
compact for the usual topology, and usually endowed with the shift transformation
a:

which is a continuous map. On AE it also is one-to-one. A subshift X is a closed


a -invariant subset of AE; it is completely described by the set L(X) of all words
that are allowed in the coordinates of its elements. In AJN the situation is the
same: a closed shift-invariant subset of AJN is the image under projection on the
positive coordinates of a closed shift-invariant subset of A E.
Many families of subshifts have been studied for different purposes; the only
one worth mentioning here is that of sofic systems: these are the subshifts whose
associated language L(X) can be recognized by a finite automaton (see below).
Soficity is preserved under factor maps. The full shift AE itself is sofic; less trivial
examples are the subshifts defined (for input and output) by the transducer of
Example 1 below.
A factor map is homomorphism of subshifts, that is to say a continuous, onto,
shift-commuting map from the subshift Y to the subshift X; in this situation X is
said to be a factor of Y.
4

2.2. WHAT IS A TRANSDUCER?

A transducer may be described as a translating machine: when fed a sequence


of symbols, it changes it into another. This is done sequentially, i.e. letter by letter
- hence the shift-commuting property, which may also be viewed as spatial self-
consistency; the output is not unique, it varies according to the state the machine
is in at the start.
Here is a definitiGn. When describing a transducer Tone requires three finite
alphabets A, A' and C, where A and A' are sets of symbols (they are identical in our
examples) and C is the set of states of the machine. A transducer T is an _oriented
graph on C, in which each arc has two labels, one (the input label) in A, one (the
output label) in A'. Suppose you want to transduce the word u = a 1a 2 ••• an: first
choose an initial state c1 in C, then look for an arc starting from c1 with input
label a 1 • If such an arc does not exist one cannot transduce u; if there is one, then
write down the output letter b1 corresponding to the same arc, replace c 1 by the
state c2 in which the arc ends. Then do the same again, first with a 2 and the new
initial state c2 , until u has been entirely spelled. One thus obtains a transduced
word u' = a 1a2 ... an. Remark that there may be as many words transduced from
u as there are states in C.
Transducers thus described act (i.e. read and write words) from left to right;
but they can also be made to do this in the opposite direction, by just reverting
the reading order. We shall in fact have to do so when we want to represent
transductionnally the multiplication by an integer.

Example 1. For A= A'= {0, 1}, C ={a, b}, consider the following graph as that
of a left-to-right transducer:

0/0

0
8
1/1
8 0/1
01/0
5

With initial state a, the transduced image of u = 001101 is 001011. Remark


that to each digit the transducer is adding the one to the left (the state a means
that we add 0, the state b that we add 1 (mod. 2) to the next input).

Let us next introduce some relevant vocabulary. When one consider only one
of the labels (the input or the output), the transducer T is reduced to what is
usually called an automaton. Because of this, all usual terms in Automata Theory
are applied to transducers, with the added qualification "input" or "output" when
necessary for the siJ,ke of definiteness.
A transducer is said to be irreducible when its graph is strongly connected.
A word u E A*(A'*) is accepted for input (output) by T if there is a path with
input (O!Jtput) label u in the graph. The set of all input (output) words accepted
by Tis denoted by L 0 (T) (L 1 (T)). These two languages define two subshifts.
But the only case we are concerned with is when L 0 (T) = L 1 (T) = A*.
Suppose a transduc~r is assigned the (too hard) task of translating English into
Spanish. One would expect it to accept more or less correct English words, phrases
or sentences: there is no point in translating rubbish. Then the sets of "words" (in
the sense of Language Theory) accepted by this transducer for input and output
would be proper subshifts. On the other hand, a transducer performing multipli-
cation by 3 on the expansions of all real numbers (or even integers) to base 2 must
accept all words on {0, 1}, since all of them occur in some expansion. The cellular
automata we shall be dealing with possess the same property.
A transducer is said to be input-deterministic if given c E C, a E A, there exists
at most one arc from c with input a (same definition for output). For instance
the transducer of Example 1 is input- and output-deterministic. This property is
very useful but sometimes too strong. A convenient one is the following: a non
ambiguous transducer is one such that for any u E A+, c, c' E C, there is at most
one path in the graph from c to c' with label u. A deterministic transducer is
obviously non ambiguous. This implies for any u E A+ there are at most #(C)
paths from c with label u.
Two particular classes are especially considered in the sequel: cellular au-
tomata, which are a very simple kind of transducers, much easier to handle than
the general type since they are merely maps, and multiplication transducers.
6

2.3. CELLULAR AUTOMATA

Cellular automata are widely used in statistical physics in order to model the
microscopic equilibrium or evolution of fluids or spin glasses. A cellular automaton
is a map F from the configuration set A E or A IN to itself, defined in the first case by

(F(x))i = f(x(i- n, i + n)
for some given map f: A2 n+l--+ A, and in the second by

(F(x)); = f(x(i,·i + n))


with f : An+l --+ A; n is called the radius of the CA, and f is its rule. In both
cases it is a continuous, shift-commuting but generally not onto map from the
configuration set to itself. In other terms, a cellular automaton is just a factor map
from A E or A IN to itself ( and onto some of its subshifts ); a reason for introducing
a special name for these objects is that in the theory of cellular automata the
emphasis is set on the dynamics of the transformation F.
How does one represent a cellular automaton by a transducer ? This is fairly
easy. In AE, the two symbol sets are two different copies of the set A; C = A 2 n:
a state is just a memory containing 2n coordinates; whatever b E A, from c =
au, a E A, there starts an arc to the vertex ub with input b and output f( aub).
This transducer acts from left to right, but it is just as easy to describe a right-to-
left one.
The transducer of Example 1 represents a cellular automaton (computing its
rule is an exercise left to the reader).

2.4. MULTIPLICATION TRANSDUCERS

Not all transducers can be reduced to cellular automata. The ones performing
multiplication by k in base p generally cannot [3]. Multiplication transducers and
their properties are well-known to language theorists, but we do not know of any
book in which they are described; we are therefore referring the reader to [2, section
4.A] for proofs.
7

The transducer 7i.:,p multiplying by kin base pis just a representation of the
usual algorithm. Each of the sets A and A' is equal to {0, 1, ... ,p- 1}. The state
set Cis the set of carries {0, 1, ... , k- 1}. Like the algorithm, the transducer acts
from right to left. Denote by [r] the-integer part of the real number r. Suppose
the initial carry is c, and one has to multiply the input a: then the output b and
the new carry c' are given by the formulas

b = ka + c(mod.p) (1)

and
c' = f(ka +c), (2)

where f(n) = [n/p]. From these formulas one easily deduces that the set of carries
may be restricted to {0, 1, ... , k- 1}. The graph of Tk,p, together with the input
labels, can be deduced from Equation (2); Equation (1) gives the output labels.
Equation (2) also testifies that no other carry need be added to the set C, and
that Tk,p is always input-deterministic (this simply means that given the input
and carry at some time, one can deduce from them the carry at next time). It
is slightly more difficult to check that no carry in C can be done without. The
results we need are summed up in the following classical statement.

Proposition 1. Tk,p is always input-deterministic. When k and pare coprime,


it is also output-deterministic. When k divides p, Tk,p is non ambiguous for output.

Exercise. Construct the graphs of the transducers 73,2 and 'T.t,2·

3. Normality Preservation

As was announced in the introduction, we are going to show that multipli-


cation by a rational (mod. 1) preserves normality. Part of the motivation comes
from the wide use of random sequences. Those that are actually used for applica-
tions are theoretically far from random, since they are periodic (of course with a
huge period), and there is no theorem proving they are doing their job properly;
in some instances they proved really unfit. Those which are known to be normal,
8

I.e. random from a purely statistical point of view, do not behave satisfactorily
at the "beginning" (say, for the 10 10 first digits !), at least for some sophisticated
simulations. This is a strong motivation for developping theoretical researches on
normality.
Another one is more specific. Implicit in [9] is the question whether any non
atomic measure on the 1-torus, invariant under multiplication by 2 and multipli-
cation by 3, is necessarily Lebesgue. -5ome progress was made towards a positive
answer (see [13] for more on this subject) but the general result is still unknown.
Transducers yield a new formulation of the problem - which does not mean they
are pointing out to a solution.
In [1], [5], [14], [20] the reader can find other aspects of what can be done
about normality with more general kinds of transducers.

3.1. INVARIANT MEASURES AND ENTROPY

Some further notions of ergodic theory, especially on compact spaces, are


worth introducing at this point.
Consider a compact metric set X, endowed with an onto endomorphism T. It
is well known that the set M(X) of probability measures on X is also a compact
metric set for the topology of weak convergence of measures; so is the set I( X) of
invariant" probability measures on X. Assuming </> : Y --+ X to be continuous and
onto, define the measure cl>v to be the image measure of 11 under ¢. The map 4> is
continuous, onto and shift-commuting from M(Y) to M(X).
To x E X associate the family of probability measures ( Sn( x ), n E IN) defined
by
n-1

Sn(x) = 1/n L Dr;x,


i=O

where Dx is the Dirac measure on point x.

Definition. A measure f.1- is said to be associated to x E X if it is the weak limit


of some subsequence of (Sn(x)); xis said to be generic for f.1- if Sn(x) --+ f.1- when
n--+ oo.
9

Both definitions imply fl. is T-invariant; when fl. is also ergodic, Birkhoff's
theorem states that {I.-almost all points are {I.-generic. The most interesting cases
for this study are when X = [0, 1) and Tis multiplication by the integer k {mod. 1),
or X= AJN and Tis the shift; a point x E [0, 1), generic for the Lebesgue measure,
is called normal, as well as its expansion, which is generic for the uniform measure.
We only give the definition of the entropy of a shift-invariant measure on
X = ABV {or some subshift). For u E An, denote by [u] the cylinder set {x E
Xjx(O, ... ,n -1) = u}.

Definition. The entropy of a shift-invariant measure fl. on X is the real number

h,. =- }i_.~ 1/n L fl.([u]).logfl.([u]).


uEAn

The entropy is preserved under bounded-to-one factor maps; this is a conse-


quence of a classical result of Ergodic Theory. A well-known theorem of Symbolic
Dynamics establishes that for a transitive sofic subshift there is a unique measure
with maximal entropy- in the case of Azz, it is the uniform measure>..

Proposition 2. Suppose 4> is a factor map from the subshift Y to the subshift X,
such that card {4> - I ( x)} is bounded on X; let v be an invariant measure on Y and
fl. = ci>v. Then h,. = hv.
3.2 MORE ABOUT MULTIPLICATION TRANSDUCERS

All schoolchildren know (or rather should know) how to use multiplication
algorithms. So why bother to represent them as transducers? The answer is this
presentation emphasises elementary properties of the associated graphs, and these
turn out to be decisive for the prpof that normality is preserved under multiplica-
tion by rationals, as well as for further results.
It is obvious how to use Tk,p to multiply an integer by k in base p, or to do
the same mod. 1 to a dyadic number. But is the second case really as obvious as it
seems ? It is if we only think of the canonical expansion of a dyadic number; but
if, as is the case, we want the set of expansions to be closed, a dyadic number has
got two expansions ! In this paper we consider that a real number r E [0, 1) has
IO

a set of expansions E(r) C AJN with one or two elements; conversely, a sequence
x E AJN has one valuation:

L x;.p-i(mod. 1).
00

V(x) =
i=l

Consider all possible infinite paths on the graph of Tk,p with input label in
E(r), then the set F(r) of their output labels. I claim F(r) = E(k.r (mod. p)),
which implies it contains at most two elements. Indeed, since Tk,p is deterministic,
finite paths in the graph having the input label s( 0, n) for some s E E( r) are
entirely determined by the carry chosen for the "initial" time n. But changing the
carry at time n corresponds to a difference of valuation with modulus less than
kn-I. Letting n tend to infinity, this means the valuations of distinct elements of
F( r) are identical.
Formally, what we have been doing is this: to a real number in [0, 1) we sub-
stituted the set of its infinite expansions; then we considered all infinite sequences
of cJN X AJN' representing a path in the graph of Tk,p together with its input
label in E(r), and output label in F(r); and finally, from these triple sequences we
selected the sequences of output labels, all of them representing the same number
k.r(mod.l). This situation is represented in the following commutative diagram,
y

X
;/~ X

vj
0 jv
xk(mod. I)
[0, I) [0, I)

where y is the closed subset of cJN X AJN X AlN, for which the sequence on c
represents a path in the graph and the sequences on A are its input and output
labels; ¢> is the projection of Y onto X, corresponding to input labels, and 'lj; is the
one corresponding to output labels.
11

3.3 THE RESULT

Proposition 3. Multiplication by a rational (mod. 1) preserves normality.

Proof. Let r E [0, 1), q Ea}, x be the expansion of r to base p. The statement is
equivalent to the following claim: for any integer k, the expansion x' of k.r (mod. 1)
is normal iff x also is. Indeed, assuming q = k / k', the "if" part of the former claim
establishes normality of k.r (mod. 1) and the "only if" part, the normality of
k.rfk' = q.r (mod. 1); putting the two results together achieves the proof. Now
it is sufficient to prove this for two simple cases: when k and p are coprime, and
when k divides p. This is done by using the elementary properties of multiplication
transducers (Proposition 1), together with classical results of Symbolic Dynamics.

Proposition 4. Suppose T is a transducer with input and output alphabet A,


recognizing the language A* and non ambiguous, both for input and output. Then
the two following properties are equivalent:
(1) x E Aw is normal;
(2) any element in T(x) is normal.

Proof. The probability measure J.L 1 on X is said to be a transduced image of J.L if


there is von Y such that cl>(v) = J.L and w(v) = J.L 1 •
We claim that if x' is a transduced image of x for T (their common preimage
in Y being y) and x is generic for J.L, then any measure 1-L' associated to x' is a
transduced image of J.L: assume 1-L' to be associated to x'; by compactness of M(Y)
there is some measure v associated to y such that w(v) = J.L 1 ; but since <f>(y) = x
one must have cl>(v) = J.L, whence the result.
Now suppose 1-L = >.; non ambiguousness ofT for input and output is equiva-
lent to the fact that </> and 'ljJ preserve entropy, so for any common preimage v one
must have

since >. is the unique measure on X with maximal entropy, so is J.L 1 , which implies
f.L' = >. • •
12

Two other results in [2), generalising the latter, may be quoted here. The first
states that given a rational q, the closer to normality r is, the closer is q.r. This
may also be obtained by Harmonic Analysis, but of course the topological methods
used in [2) are perfectly natural.
The second is the following: suppose the invariant measure p on X is such
that <I> - 1 (p) is a singleton, and x E X is generic for p. Then any transduced
image X1 is w(<I>- 1 (p))-generic. Non trivial examples of this situation exist.
of X

One important fact about this result is the difference with normality preservation:
one does not assume w(<I>- 1 (1'1)) = !'1; the transduced image of a generic point is
still generic, but generally for another invariant measure. In what cases are the two
measures identical ? This question brings us back to the one asked by Furstenberg.

4. On the Dynamics of onto Cellular Automata

After the seminal paper of Hedlund [10), the dynamics of endomorphisms of


the shift has drawn the attention of many mathematical physicists (who renamed
them "cellular automata") and a few mathematicians. Motivated readers will find
different aspects of this topic in [12) and [22).
One of the many interesting dynamical features of cellular automata is their
limit set: given a cellular automaton F : X --+ X, its limit set is the subshift
U~ 1 Fn(X); one may ask what is the limit set of a given CA, whether this limit
set is in fact obtained as the image of finitely many iterations of the map, what
subshifts may be obtained as limit sets of cellular automata. These questions are
quite likely undecidable in general; some of them have nevertheless been solved in
particular cases [18). But they are out of the scope of this article: we are only
considering onto cellular automata, which means the limit set is A.DV or A~ itself!
Here we adopt the classical point of view of topological dynamics: X = A~
(or A.DV) is a compact metric space, F is an onto endomorphism, and we would
like to link some of the relevant properties of this dynamical system to properties
of the rule defining F. There are still fewer known results along this line: for some
of them see [6), [16), [17).
Apart from its purely dynamical interest, and its connections with statistical
physics, this set of questions also has some arithmetical significance. Onto cellular
13

automata acting on one-sided sequences of digits define onto transformations of


the 1-torus - hence the emphasis on this particular class of automata in the last
subsections of this text. The definition is ambiguous on p-adic numbers, except
when the cellular automaton is just a power of the shift or with some similar
restrictive hypotheses [3]; but the transformation is well defined and continuous at
all other points , and it is of course measurable. The question of what continuous
maps can be represented by a transducer or cellular automaton is addressed in
[3], and some connected questions are treated in [4]. It is easily proved that
any such map, continuous or not, preserves the Lebesgue measure (Proposition
8) and deduced from Proposition 4 that it preserves normality. Apart from these
remarks there is not much known about such transformations, except there is a
good algorithm to compute their Value at any given (non p-adic) point... and
symbolic methods are likely to tell much about their dynamics. Advances will be
welcome.
In this section we introduce some notions and remarks which we think are
basic for future investigations, and state some open questions.

4.1. TOPOLOGICAL CONJUGACY AND ENTROPY

Conjugacy is the natural equivalence between topological dynamical systems.


A topological dynamical system (TDS for short) is a compact metric space endowed
with a continuous, onto map. Two TDS, (X, T) and (Y, S), are said to be conjugate
when there is a one-to-one factor map from the one to the other. There are various
families of TDS which are stable under conjugacy. In the sections above we defined
the class of subshifts, i.e. closed invariant subsets of the shift space; the set [0, 1),
endowed with multiplication by p (mod. 1) can never, for topological reasons, be
conjugate to a subshift, however close to the full shift on p letters it is.
What about the configuration space X, endowed, not as usual with the shift,
but with the cellular automaton map F? F commutes with the shift without being
a shift; indeed, there are instances in which (X, F) is conjugate to a subshift, and
some in which it cannot be.
Some topological invariants introduce other distinctions; topological entropy
is one. Before we give its formal definition, some notions must be introduced.
14

Assume 'R is an open cover of the compact metric set X: we denote by H('R)
the nonnegative real number inf{logcard('R')}, where the infimum is taken over
all finite subcovers 'R' of 'R. The cover 'R is said to be finer than S if for any
U E 'R there is V E S with U C V; this property is denoted by S ~ 'R. It implies
H(S) ~ H('R).
Denote by 'R V S the cover made up of all intersections R n S, R E 'R, S E S.
For n E IN write
n-1
'R(n) = V r-in.
i=O

The entropy of the cover 'R is the (well-defined) nonnegative number

h('R, T) = lim 1/n H('R(n)).


n-oo

Whenever 'R ~ S one has h('R,T) ~ h(S,T).

Definition. The topological entropy of (X, T) is the nonnegative real number

h(X, T) = sup h('R, T),

where the sup is taken over all finite open covers of X.


The following property will be used in the sequel. It is proved in [7].

Lemma 5. Suppose 'Rn, n E IN is an increasing family of covers of X such that


any finite open cover of X is coarser than some 'Rn. Then

lim h('Rn, T) = h(X, T).


n-oo

One proves easily that if (X, T) is a factor of (Y, S) then

h(X, T) ~ h(Y, S).

Whenever X is a subshift, one has

h(X, T) = n-oo
lim 1/n log#(L(X) nAn). (3)

A trivial example is the full shift on p letters, whose entropy is logp.


15

We now state the Variational Principle. This fundamental theorem, obtained


during the seventies by putting together results of several researchers, links the
topological and measure-theoretic definitions of entropy. Its rather intricate proof
can be found in [7] and [21].

Proposition 6. For any compact space X endowed with the continuous onto map
T, one has
h(X, T) = sup (hi').
I'EI(X)

Here are two classical properties concerning the topological entropy of sym-
bolic sytems we are going to use:
- factor maps between sofic systems preserve topological entropy if and only if
they are bounded-to-1, i.e. the preimages of points under the map have bounded
cardinality.
- any proper subshift of A;,: has entropy strictly less than log #(A).

4.2. ONTO CELLULAR AUTOMATA

In this subsection proofs are given for cellular automata acting on A;,:, but
they are strictly identical for those acting on A IN.
Kari [15] proved that it is undecidable whether a 2-dimensional cellular au-
tomaton is onto. In the one-dimensional case, the same question (which is a key
issue in this context) is fortunately decidable. Since this fact seems not to be
widely known, we shall give a sketch of the proof.

Proposition 7. It is decidable whether a cellular automaton F is onto.

Proof. Let F be a cellular automaton on A;,: with rule exists an automaton A


"simulating" F, which is the same as the transducer described there except we
discard the input: its set of states is A 2 n; the arcs are defined in the following way:
suppose u and v belong to A 2 n, then there is an arc from u to v iff u = aw, v = wb
with a, b E A- v may follow u if it is "shifted from" u, i.e. iff there is x E A;,: such
that x(1,2n) = u, is labelled f(awb). This finite automaton obviously recognises
the language L(F(A.z)).
16

Now, using classical symbolic or language-theoretic techniques, it is easy to


tell whether F(AE) = AE, or equivalently L(F(AE)) =A*. One way is to check
whether the automaton is non ambiguous, or equivalently whether there are no
"diamonds" (two distinct paths with same initial and final vertices and same label).
Another one is first to deduce from A some deterministic automaton recognising
L(F(AE)), and then to compute the maximum eigenvalue p ofthe incidence matrix
of this automaton. Since by [8] h(F(AE)) = logp, and the entropy of any proper
subshift of AE is strictly less than log#(A) (see [7]), a necessary and sufficient
condition for F(AE) = AE is p =#(A). •

Onto cellular automata have an important common property, which is one of


the motivations for studying them.

Proposition 8. Let F be onto on AE. Then the uniform measure A is invariant


under F.

Proof.
Denote by FA the image of A under the map F. By our assumption F(AE) =
AE, therefore h(F(AE),a) = h(AE,a); since topological entropy is preserved by
bounded-to-1 factor maps only, this means F is bounded-to-1. Now a bounded-to-
1 map preserves measurable entropy, and h( FA, a) = h( A, a). But A is the unique
measure on X having this entropy, hence FA = A. •

Remark. Equivalents of Propositions 7 and 8 exist and are as easily proved


for transducers of a more general type. In fact an implicit proof of Proposition 8,
valid for a large class of transducers, is hidden in that of Proposition 4.
In the way of examples, we describe two simple, algebraically defined sets
of rules generating onto cellular automata; they will be examined from different
points of view in the following subsections. The proof that they are onto is left
to the reader; it is especially simple for f( x 0 xt) = x 0 + x 1 , which belongs to both
classes.

Example 2. Let p > 1 be an integer and A = {0, ... ,p- 1} be endowed with
addition mod. p. Define F by f(xo, ... , Xr) = g(xo ... Xr-d + Xr, where g is some
map from An to A.
17

Example 3. [6] Let F be the cellular automaton defined on A.z by


r
f(xo ... Xr) = Xo + IJ(x; + b;),
i=l

where A is endowed with addition and multiplication mod. p, and b = b1 ..• br E


A*.
Remark that these two sets of rules define cellular automata acting on AJN
as well as A .z. Another, much more complicated set of rules, defining 1-to-1 onto
cellular automata acting on A.z only, has been introduced in [17].

4.3. A NATURAL SYMBOLIC FACTOR OF (X, T)

From now on, denote by X the set AJN of simply infinite sequences ~n A.
Every onto cellular automaton (X, F) has a symbolic factor (Z, a) which plays a
primary role in its dynan1ics (it can also be defined for two-sided cellular automata,
though we do not do it here). It is sometimes conjugate to (X, F) (Exan1ple 2 above
and below) and sometimes not (F = Id, Example 3); it is not canonical, except of
course in the first case. It plays an outstanding part in the entropy calculations of
[6] and [17], though this is not pointed out in these articles.
We first introduce this factor map. Let 7l' : X --t (Ar)JN be defined by 1l'X =
(F;(x)(O, r - 1)), i E IN. One may see 1l'X as a set of r different infinite sequences
on which the shift acts simultaneously. The reader can check that 7l' is continuous
and F o 7l' = 7l' o a: putting Z = 7r(X), 7l' is a factor map from (X, F) to the
subshift ( Z, a). To put things heuristically, 7l' shrinks x E X to the sequence of
elements of the open cover n = {[u] / u EAr} (in fact a clopen partition) to which
x, Fx, ... , Fnx, ... belong in their turn. Or else a acts on Z exactly the way F
acts on nand the sequence of its preimages.
In general, the topological entropy of a cellular automaton is undecidable [11].
It can nevertheless be computed for some restricted classes of automata, as shown
by Coven [6] and Lind [17]. In both papers the topological entropy is in fact
computed on the factor ( Z, a) or its equivalent for two-sided cellular automata,
which prompted A. Maass and the author to make the following observations.

Proposition 9. h(Z,a) = h(X,F).


18

Proof. Since 1r is a factor map one has h( Z, q) :$ h( X, F).


In order to prove the converse inequality, for given n > 0 consider two other
families of factor maps, 'll'n and 1r~ with range X, images Zn and Z~, defined by
'll'n(x) = (F;(x)(n, n + r- 1)), i E IN and 1r~(x) = (F;(x)(O, n + r - 1)), i E IN.
Remark first that Zn is conjugate to Z (they are in fact copies of each other), so
h(Z,u) = h(Zn,u). Now, given z E Zn there correspond to it at most (#(A))n
points in Z~: once the coordinates x(O, n-1) have been chosen, given z all missing
coordinates are determined by the map F. Therefore by Formula (3), h(Z~,u) =
h(Zn,u). As n-+ oo, h(Z~,u)-+ h(X,F): this completes the proof. •

Remark. The family of subshifts Zn are all conjugate to Z = Zo, because they
are the same set of points endowed with the same transformation. But 1rn is never
a conjugacy map for n -::j:. 0, since no information at all about x(O, n - 1) can be
recovered from its image.
Another property of onto cellular automata which we think to be promising
is the following.

Definition. An onto cellular automaton F is said to be expansive if there is


f.> 0 such that whenever x -::j:. y both belong to X, there is some n such that
d(Fn(x), Fn(y)) > €.
This is a classical definition in Topological Dynamics, where expansive systems
were introduced as a generalisation of symbolic systems: for x -::j:. y in a symbolic
set, there must be some coordinate n at which they differ, hence d(Tnx, Tny) = 1
for some suitable distanced on X. In the field of cellular automata expansiveness
takes on a particular significance because of the following observation:

Proposition 10. (Z,u) is topologically conjugate to (X, F) iff F is expansive.

Proof. Suppose (Z, u) is conjugate to (X, F). Then (X, F) is symbolic, therefore
expansive.
Conversely, assume F has radius rand is expansive. For x -::j:. y, d(Fnx, Fny) >
f. for some n: whatever the chosen distance d this means there exists q E IN+ such
that Fnx(O,q -1) -::j:. Fny(O,r -1), and since f. is universal q does not depend on
19

x and y. For any p E IN+ consider the factor map '1/Jp : X -+ (AP)lN defined by
'1/Jp(x) = (Fi(x)(O,p- 1)), i E IN, and its image Yp: the previous remark means
that psiq is 1-to-1, therefore (Yq, a) and (X, F) are conjugate and (Y, a) is symbolic.
There remains to replace Yq by Z = Yr. If q :::; r this is very easy: (X, F) is
conjugate to (Yq,a), which is a factor of (Yr = Z,a), which is in its turn a factor
of (X, F), so that (Z,a) and (X, F) are conjugate.
Now call q the smallest possible value such that (Yq, a) and (X, F) are con-
jugate, and assume q is greater than r. As q is the smallest possible value for
conjugacy one can find x ::f. yin X with 1l"q- 1 (x) = 7rq_ 1 (y), the last equality im-
plying that x( 0, q- 2) = y(O, q- 2). Consider the two points x' ::f. y' of X defined
by ax' = x, ay' = y and x(O) = y(O) =a E A. One has

1l"q-I(ax') = 1rx = 1ry = 1l"q-l(ay'),

which identifies all but the first infinite sequences of symbols constituting 1r qX 1 and
1r q y'; also, since the first q letters of x' and y' are the same this is also true by

induction for Fnx' and Fny'. So 1l"qX 1 = 1l"qY1 whereas x' =/= y', which contradicts
the minimality assumption on q. Hence the result. •

In fact we can perfectly do without the definition of expansiveness, but it


is interesting to link the property for a cellular automaton of having a symbolic
action with a classical notion of Topological Dynamics.
Let us now go back first to Example 2. Define F on {0, 1}JN by f(x 0 xi) =
xo +x 1 .In this case one easily checks xo and F( x )a entirely determine x 1 ; using this
remark inductively shows 1r to be invertible: (X, F) is conjugate by 1r to (Z, a)=
{0, 1}JN and has entropy log2. For Example 2 in general, i.e. J(xo ... Xr) =
g(xo ... Xr-l) + Xr, it is hardly more difficult to check 1r is invertible and (X, F) is
conjugate to the full r-shift and has entropy log r.
For Example 3 the situation is more complicated. For all rules in the class,
i.e. whenever r
f(xo ... Xr) = xo + IT(x; + b;),
i=l

F is onto [10]. But to compute the entropy, up to now it has been necessary to
make an extra assumption: b = b1 ... br must be an aperiodic word, meaning there
20

is no p, 1 ~ p ~ r - 1, such that bi = bi+p for 1 ~ i ~ r - p. In this case the


topological entropy of (X, F) is proved by Coven in [6) to be log 2.
Lind's examples [17) form a class of cellular automata which are easily proved
never to be expansive, in a sense closely connected to the one introduced above.
Their entropies form a dense set of numbers in JR+: this is the main motivation
for the article, since the only cellular automata for which the entropy had been
computed previously had an entropy equal to the log of an integer.

4.4. SOME POSSIBLE MATTERS FOR INVESTIGATION

1! It would be interesting to describe families of onto cellular automata,


different from the one in Example 2 but maybe containing it, for which 1r is a
conjugacy map (perhaps first on A-IN, because it looks easier); and then try to
learn more about their dynamics (which might be sometimes more complex than
that of Example 2).
2! Can one compute the entropy of the cellular automaton defined (for
A = { 0, 1} for instance) by rule f( xo x1 x2) = Xo + x1 x2 ? This childish rule does
not belong to the family studied in [6), because the word b is not aperiodic, and
has given headaches to several people !
3! Here is another question, maybe more ambitious than the first: can
one, for some families of rules defining onto cellular automata, find the ergodic
properties of the measure- theoretic dynamical system (X, F, .\),which on account
of Proposition 8 is always well defined ?
4! Some readers may now be asking themselves why we did not bother to ask
questions similar to 1 and 3 for the action of transducers. The most obvious reason
is that transducers do not define actual maps on A-IN, which makes it certainly
more difficult to formulate such problems. But there is certainly something to
investigate in this direction, and some motivation for doing so in the fact that the
situation is the same for associated maps on the torus. So, here is a completely
open field for research ...
21

References

[I] Blanchard, F., Non Literal Transducers and Some Problems of Normality, J.
Th. Nombre8 Bordeaux, to appear.
[2] Blanchard, F., Dumont, J.-M., Thomas, A., Generic Sequences, Transducers
and Multiplication of Normal Numbers, l8rael J. Math., 80,257-287 (1992).
[3] Blanchard, F., Host, B., Maass, A., Representation par Automates de Fonc-
tions Continues du Tore, preprint (1993).
[4] Botelho, F., Garzon, M., On Dynamical Properties of Neural Networks, Com-
plex Sy8tem8 5, 401-413 (1991).
[5] Broglio, A., Liardet, P., Prediction with Automata. Symbolic Dynamics and
its Applications, P. Walters (ed.), Contemporary Math. 135, AMS, Provi-
dence (1985).
[6] Coven, E.M., Topological Entropy of Block Maps, Proc. Amer. Math. Soc.
78, 590-594 (1980).
[7] Denker, M., Grillenberger, C., Sigmund, K., Ergodic Them::y on Compact
Spaces, Lecture Note8 in Math. 527, Springer, Berlin (1976).
[8] Fischer, R., Sofie Systems and Graphs, Monat8. Math. 80, 179-186 (1975).
[9] Furstenberg, H., Disjointness in Ergodic Theory, Minimal Sets, and a Problem
of Diophantine Approximation, Math. Sy8tem8 Theory 1, 1-49 (1967).
[10] Hedlund, G.A., Endomorphisms and Automorphisms of the Shift Dynamical
System, Math. Sy8tem8 Theory 3, 320-375 (1969).
[11] Hurd, L. P., Kari, J., Culik, K., The Topological Entropy of Cellular Automata
is Undecidable, Ergodic Th. Dynam. Sy8 12, 255-265 (1992).
[12] Gales, E., Martlnez, S., Automata Networks, Dynamical Systems and Sta-
tistical Physics, Kluwer Acadamic Pub., Mathematics and its Applications,
(1992).
[13] Johnson, A.S., Measures on the Circle Invariant under Multiplication by a
Nonlacunary Subsemigroup of the Integers, preprint (1991).
[14] Kamae, T., Weiss, B., Normal Numbers and Selection Rules, l8rael J. Math.
21, 101-110 (1975).
[15] Kari, J., Decision Problems Concerning Cellular Automata, Thesis, University
of Turku, Finland (1990).
22

[16] Lind, D.A., Application of Ergodic Theory and Sofie Systems to Cellular
Automata, Physica 10 D, 36~44 {1984).
[17] Lind, D.A., Entropies of Automorphisms of a Topological Markov Shift, Proc.
Amer. Math. Soc. 99, 589-595 (1987).
[18] Maass, A., On Sofie Limit Sets of Cellular Automata, preprint (1992).
[19] Parry, W., Topics in Ergodic Theory, Cambridge University Press, London
{1981) .
[20] Thomas, A., Suites Normales et Transducteurs, preprint {1992).
[21] Walters, P., An Introduction to Ergodic Theory, Graduate Texts in Math. 79,
Springer, Berlin {1982).
[22] Wolfram, S., Theory and Applications of Cellular Automata, World Scientific,
Singapore (1986).
AUTOMATA NETWORK MODELS OFINTERACTING
POPULATIONS

NINO BOCCARA
DRECAM-SPEC
CE-Saclay, France
Department of Physics, University of Illinois
Chicago, USA

1. Introduction

The first task that faces the theoretician who wants to interpret the time evolution
of a complex system is the construction of a model. In the actual system many
features are likely to be important. Not all of them, however, should be included in
the model. Only the few relevant features which are thought to play an essential
role in the interpretation of the observed phenomena should be retained. Such
simplified descriptions should not be criticized on the basis of their omissions and
oversimplifications. The investigation of a simple model is often very helpful in
developing the intuition necessary for the understanding of the behavior of complex
real systems. In many-body physics, for instance, models such as the van der
Waals model of a fluid, the Heisenberberg model of ferromagnetism, the mass and
spring model of lattice vibrations, the Landau model of phase transitions, the Ising
model of cooperative phenomena, to mention just a few, have played a major role.
A simple model, if it captures the key elements of a complex system, may elicit
highly relevant questions.
This series of lectures is devoted to the investigation of models of interact-
ing populations such as susceptibles and infectives in epidemiology or competing
species in ecology.
Most models in population dynamics are formulated in terms of differential
equations, 1 the classical example being the predator-prey model proposed in the

For a rich and fascinating variety of models refer to Murray (1989).

23
E. Gales and S. Martinez (eds.), Cellular Automata, Dynamical Systems and Neural Networks, 23-77.
© 1994 Kluwer Academic Publishers.
24

1920's, independently, by Lotka (1925) and Volterra (1926). 2

The different models to be discussed here are extensions of the so-called "gen-
eral epidemic model" (see Bailey 1975). In this model, infection spreads by con-
tact from infectives to susceptibles, and infectives are removed from circulation by
death or isolation. A simple model of this type was proposed by Kermack and
McKendrick (1927). A nice and simple discussion of their model can be found in
Waltman (1974). These authors assumed that infection and removal were governed
by the following rules:
(i) The rate of change in the susceptible population is proportional to the number
of contacts between susceptibles and infectives, where the number of contacts
is taken to be proportional to the product of the number of susceptibles S by
the number of infectives I.
(ii) lnfectives are removed at a rate proportional to their number I.
(iii) The total number of individuals S +I+ R, where R is the number of removed
infectives, is constant, that is, the model ignores births, deaths by other causes,
immigration, emigration, etc.
If S, I and Rare supposed to be real positive functions of timet, (i), (ii) and
(iii) yield
dS =-iS!
dt
dl =iSI-ri
dt
dR
-=rl,
dt
where i and r are positive constants representing, respectively, the infection rate
and the removal rate. From the first equation, it is clear that S is a nonincreas-

2 Vito Volterra (1860-1940) was stimulated to study this problem by his future son-in-law,
Umberto D'Ancona, who, analyzing market statistics of the Adriatic fisheries, found that, during
the First World War, certain predaceous species increased when fishing was severely limited. A
year before, in .1925, Alfred James Lotka (1880-1949) had come up with an almost identical
solution to the predator-prey problem. His method was very general, but, probably because of
that, his book-reprinted as Elements of Mathematical Biology (New York: Dover, 1956)-
did not receive the attention it deserved.
25

ing function, whereas the second equation implies that I(t) increases with t if
S( t) < r I i and decreases otherwise. Therefore, if, at t = 0, the initial number of
susceptibles S(O) is less than ilr, since S(t) ~ S(O), the infection dies out, that is,
no epidemic occurs. If, on the contrary, S(O) is greater than the critical value ilr,
the epidemic occurs, that is, the number of infectives first increases and then de-
creases when S(t) becomes less than ilr. 3 This "threshold phenomenon" shows
that an epidemic can occur if, and only if, the initial number of susceptibles is
greater than a threshold value.
The Kermack-McKendrick model assumes a homogeneous mixing of the pop-
ulation, that is, it neglects the local character of the infection process. This as-
sumption which is, in general, questionable may, however, be valid in some limit
cases to be discussed later. The model also neglects the motion of the individuals,
which is a factor that clearly affects the spread of the disease.
To take into account the motion of the individuals, it is usually assumed
that they disperse randomly. This hypothesis amounts to incorporating diffusion
terms in the equations. Models of this type help understanding the spatial spread
of epidemics. Consider, for instance, rabies epidemic among foxes. Rabies is a
viral infection of the nervous central system. It is transmitted by contact and
is invariably fatal. If the virus enters the limbic system, that is, the part of the
brain thought to control behavior, the fox loses its sense of territory and wanders
in a more or less random way. To discuss the spatial spread of rabies among
foxes in Europe, Kallen et al (1985) added a diffusion term in rate equation of
the infectives in the Kermack-McKendrick model in order to take into account the
random dispersion of the rabid foxes. 4 We have then

dS = -iSI
dt
di . fj2 I
- = zSI -ri + D -2
dt ox
dR
dt = rl,

where D is the diffusion coefficient of the infected foxes. This system of equations

3 That is, by definition, an epidemic occurs if dl I dt is positive at t = 0.


4 See also Murray (1989) p. 659.
26

admit travelling wavefront solutions of the form S(x- ct), I(x- ct) and R(x-
ct), where c is the speed propagation of the epidemic wave. For the epidemic to
occur, the average initial susceptible population density, i.e., ahead of the epidemic
wave, has to be greater than the threshold value r /i; and, in this case, it is found
that c behaves as D 112 • To explain the observed fluctuations in the susceptible
fox population density after the passage of the wavefront, Murray et al (1986)
have considered a less simple model taking into account fox reproduction and the
existence of a rather long incubation period (12 to 150 days).
Although the models presented sofar have unquestionably contributed to our
understanding of the spread of an infectious disease (e.g., Murray's model allows
for quantitative comparison with known data), the short-range character of the in-
fection process is not correctly taken into account. This will be manifest when we
discuss systems that exhibit bifurcations. In phase transition theory, for instance,
it is well-known that in the vicinity of a bifurcation point-i.e., a second-order tran-
sition point-certain physical quantities have a singular behavior. 5 It is only above
a certain spatial dimensionality-known as the upper critical dimensionality-that
the behavior of the system is correctly described by a partial differential equation.
For instance, the spatial fluctuations of the order parameter close to a second-order
transition point are correctly described by the time-independent Landau-Ginzburg
equation above 4 dimensions. 6
One way to take correctly into account the short-range character of the infec-
tion process is to discretize space, and to represent the spread of an epidemic as
the growth of a random cluster on a lattice. A kinetic model of cluster growth may
be defined as follows (Grassberger, Cardy and Grassberger). Denote, as usual, by
Z 2 the two-dimensional square lattice. At a time t a site of Z 2 is either vacant
(healthy), occupied (infected) or immune. An immune site is one which has been
occupied in the past. At timet+ 1 a vacant site becomes occupied with a proba-
bility p if, at least, one of its neighbor is occupied at time t. An occupied site at
time t becomes immune at time t + 1. However, immunisation is not perfect and an
immune site may become reoccupied with probability p-q if, once again, one of its

5 See, e.g., Boccara (1976), pp 155-189.


6 Ibid. pp 227-274.
27

neighbors is occupied. More generally, one might assume that the probability that
a site becomes occupied depends on the number of occupied neighbors. If p = q,
any bond can be tried only once, since at a second try one of the neighboring
sites is completely immune and no infection can pass. A similar model has been
studied by McKay and Jan (1984) to discuss forest fires. Vacant, occupied and
immune sites correspond, respectively, to sites occupied by unburnt, burning and
burnt trees. It is found that there is a critical probability Pc below which only a
finite number of sites are immune. In the vicinity of Pc the system exhibits a sec-
ond order phase transition characterized by a set of critical exponents. The upper
critical dimensionality is equal to 6, and Cardy (1983) has calculated the critical
exponents to first order in E = 6- d. Cardy and Grassberger (1985) have shown
that these models are in the same universality class-i.e., have the same critical
exponents-as percolation cluster growth models. The relationship of the general
epidemic model to the percolation process has been first noticed by Mollison (1977).
The general epidemic model on a lattice may be viewed as a discrete dynamical
system, in space and time. More precisely, it may be defined as a probabilistic
automata network. In simple words, an automata network (Goles and Martinez
1991) consists of a graph where each site takes states in a finite set. The state of a
site changes in time according to a rule which takes into account only the states of
the neighboring sites in the graph. This is the point of view which will be adopted
in these lectures.
To conclude this already rather long introduction, it is probably worthwhile
to give a slightly more general definition of the spatial general epidemic model
since, after the review paper of Mollison (1977) and the introduction of random
graphs-which are graphs with randomly colored edges-by Gertsbakh (1977),
several papers have appeared in the mathematical literature on this topic. 7 Let V
be a set of sites (usually V = zd). At any timet ~ 0 each site is either empty or has
a healthy or an infected individual. The number of sites with infected individuals
is initially finite. An infected individual emits germs in a Poisson process until
he is removed after a random lifetime. Each germs goes independently to another
site chosen according to a probability distribution attached to the parent site. If a

7 See, e.g., Kuulasmaa (1982), Kuulasmaa and Zachary (1984), and Cox and Durrett (1988).
28

germ meets an infected individual or goes to an empty site, nothing happens. After
an individual has been removed his site remains empty for ever. The infectives
have all the same emission rate and identical lifetime distribution.
All these different versions of the spatial general epidemic model still neglect
the motion of the individuals. The influence of this factor on the spread of the
epidemic is one of the main concerns of this series of lectures. Various models
will be discussed. All of them are site-exchange cellular automata, that is, au-
tomata networks whose local rule consists of two subrules. The first one, applied
synchronously, models the interaction process between the individuals. It is a prob-
abilistic cellular-automaton rule. The second subrule, applied sequentially, models
the motion of the individuals. It is a site-exchange rule. Such models may also
be viewed as interacting particle systems. The interested mathematically-oriented
reader should refer to Liggett (1985).

2. Site-Exchange Cellular Automata

2.1. EVOLUTION OF CELLULAR AUTOMATA

Cellular automata provide simple models for a variety of complex systems con-
taining a large number of identical elements with local interactions (Farmer et al,
Wolfram, Manneville et al, Gutowitz, Boccara et al). A cellular automaton ( CA)
consists of a lattice with a discrete variable at each site. The state of the CA is
specified by the values of the variables at each site. A CA evolves in discrete time
steps. At a given time, the value of the variable at one site is determined by the
values of the variables at the neighboring sites-and the neighborhood of a site
might include the site itself-at the previous time step. The evolution rule is syn-
chronous, that is, all sites are updated simultaneously. CAs are, therefore, discrete
(in space and time) dynamical systems. They may be more precisely defined as
follows. Let s: Z x N t--? {0, 1} be a function that satisfies the equation

(Vi E Z) (Vt E N) s(i,t + 1) = f(s(i- r,t),s(i- r + 1,t), ... ,s(i + r,t))


and such that
(Vi E Z) s( i, 0) = so( i),
29

where N is the set of nonnegative integers, Z the set of all integers, and
so:Z --t {0, 1} a given function that specifies the initial condition. Such a sys-
tem is a one-dimensional cA . d-dimensional CAs may be defined in a similar way.
The mapping f: {0, 1)2r+l --t {0, 1} determines the dynamics. It is referred to as
the local rule of theCA. The positive integer r is the range--or the radius--of the
rule. The function St: i--t s(i, t) is the state of theCA at timet. S = {0, 1}z is the
state space. An element of the state space is called a configuration. Since the state
at time t + 1 is entirely determined by the state at time t and the rule f, f induces
a mapping f: S --t S, called the global rule-or the evolution operator-such that

Given a rule J, its limit set AJ is defined by


AJ = t-+oo
lim ft(S)

= nrtcs),
t~O

where, for any t E N, ft+ 1 = f oft with f 1 = f. AJ is clearly invariant, that is,
f(AJ) = AJ· Since any £-invariant subset belongs to AJ, the limit setoff is the
maximal £-invariant subset of S.
Based on investigations of a large sample of cAs, Wolfram (1984) has shown
that, according to their asymptotic behavior, cAs rules appear to fall into four
qualitative classes. Class-1 CAS evolve, from almost all initial states, to a unique
homogeneous state in which all sites have the same value. Class-2 CAS yield sep-
arated simple stable or periodic structures. Class-3 CAs exhibit chaotic patterns.
The statistical properties of these patterns are typically the same for almost all
initial states. In particular, the density of nonzero site variables tends to a fixed
value as time t tends to oo. The evolution of class-4 cAs leads to complex localized
or propagating structures.
The evolution of a class-1 or -2 CA is rather simple. On the opposite, the
evolution of a class-4 CA seems very complex. Gallas and Herrmann (1990) have,
however, argued that class-4 CAs are, actually, either class-1 or -2. The only
difference is that they reach their steady state, either homogeneous or periodic in
space, after a long transient.
30

As far as their statistical properties are concerned, class-3 CAS are somewhat
similar to systems studied in equilibrium statistical physics. The limit set of class-3
CA contains a strange at tractor. That is, after sufficiently many time steps, starting
from almost any initial configuration, the state of a class-3 CA evolves chaotically
on a Cantor-like subset of S. The asymptotic behavior as the timet tends to oo
of, say the density of nonzero site-values, is either of the form exp( -at) or r-r,
where a and 1 are constants. As the range of the rule increases, the exponential
behavior is more and more frequent. Most range-1 class-3 have, on the contrary,
a power-law behavior., To illustrate these different asymptotic behaviors, we shall
briefly describe the evolution of toward their attractor of range-1 Rules 18, 54 and
22. 8

2.1.1. R·ule 18. It is defined by the following map

if (x 1,x2x 3 ) = (0,0, 1) or (1,0,0),


otherwise.

For this rule, configurations belonging to the at tractor consist of sequences of zeros
of odd lengths separated by isolated ones. The average number of sequences of
zeros of lengths 2n + 1 per site is equal to 1/2n+ 3 (Boccara et al 1990). 9 With
respect to this background a sequence of two ones or a sequence of zeros of even
length it; a "defect" or a "kink" (Grass berger 1983). Since two sequences of zeros of
odd lengths separated by two neighboring ones generate, at the next time step, a

8 Any rule may be specified by its rule number, which, following Wolfram (1983), is defined by

where

2r
e(xi, X2,. ·., X2r+l) = 2:::: f(xl, x2, ... , X2r+I) 2:::: Xj+12 2 r-j.
Xt,X2 1 ••• ,X2r+l j=O

9 Different rules may have the same attractor (Fig. 2.1b). Due to different time correlations,
their spatiotemporal patterns are, however, different.
31

sequence of zeros of even length, configurations generated by Rule 18 may contain


defects of one type only. During the evolution these defects move, and when
they meet they annihilate pairwise (Fig.2.1a). This process has been studied by
Grassberger (1983) who found that, starting from a random initial configuration,
the density of defects decreases as c 1 12 . 10 These defects may be viewed as
particles. For Rule 18 all particles are of the same type, and a particle is its own
antiparticle. This particularly simple picture is not general as the discussion of
Rule 54 will show.

Figure 2.1. Spatiotemporal pattern generated by the evolution according to (a)


range-1 Rule 18, (b) range-2 Rule 2216773650.

10 See Bramson and Lebowitz (1991) for a rigorous study of asymptotic behaviors of densities
of pairwise annihilating particles executing random walks on a d-dimensional cubic lattice, for
any integral value of d.
32

2.1.2. Rule 54. Rule 54 is defined by

if (x1,x 2 x3) = (0,0, 1), (0, 1,0), (1,0,0),


or(1,0,1).
otherwise.

Figure 2.2. Spatiotemporal patterns generated by the evolution according to


range-1 Rule 54. (a) Evolution from a randomly generated initial configuration.
(b) Particles ge and g0 • (c) Interaction between two particles w with opposite
velocities and interaction between a w and a g 0 • (d) Interaction between a w and
age.
33

Here again the evolution toward the attractor may be viewed as annihilation pro-
cesses of interacting particlelike structures {Boccara et al1991 ). The background is
periodic in space and time, both periods being equal to 4. Three types of particles
may be distinguished. Two of them are non propagating and periodic in time.

Figure 2.3. Rule 54. Pairwise annihilation processes of even gutters.

Their periods are equal to 4. They may be generated by sequences of zeros whose
lengths are greater that 3. We shall denote them by 9e and Yo (g for gutter)
according to whether they consist sequences of zeros of even or odd length. There
34

exists also a propagating particle w (for wall), which may be generated by three
zeros following three ones or the converse. This particle may propagate to the right
or to the left. Its velocity is equal to 1. These particles have a rather rich variety
of interactions (Fig. 2.2). As represented in Figure 2.2c, we have the reactions

W + tJ ---+Yo, W +Yo---+ tJ, tJ +Yo---+ W,


where the arrows over the symbol w indicates the direction of propagation of the
corresponding particle. Many other similar reactions can be written down. Figure
2.2d illustrates the interaction of the particles w and Ye· A pair of even gutters Ye
may also be annihilated. There are many multiparticle reactions in which a pair
of Ye disappear. A few of them are illustrated in Figure 2.3. Such reactions are
relatively rare because they involve at least four particles (two Ye close together
and two or more w ). More complex annihilation processes of a pair of Ye involving
more particles, in particular the presence of one odd gutter y 0 , also occur.
Simulations done on a 104 -site lattice for a number of time steps larger than
108 show that the number of ye-particles tends to zero, approximately, as r 0 · 15
over two decades from 3 x 10 6 to 3 x 108 time steps. To show that these results
do not depend upon the lattice size, simulations have also been done on a 105 -site
lattice for a number of time steps of the order of 107 . Since the time evolution
toward the attractor is essentially governed by the pairwise annihilation processes
of even gutters (Fig. 2.4), the value of the exponent 'Y may be qualitatively un-
derstood as follows. Space translations of even gutters are due to the impacts of
walls (Fig. 2.2d). Since the number of walls is proportional to the number of
even gutters n 9 • (Fig. 2.4), the mean velocity for the displacement of even gut-
ters is proportional to n 9 •• Moreover, when two even gutters come close together,
their pairwise annihilation requires the occurence of, at least, two simultaneous
events, i.e., the impact of at least two walls in very precise positions (Fig. 2.3)
(through the impact of a singular wall they would move away). If we assume that
the probability of this particular event is proportional to n;., it follows that the
probabilty for two neighboring even gutters to annihilate is. roughly proportional
to n;. n n;..
9• Hence, the mean time required for two neighboring even gutters
to annihilate is roughly proportional to n;.5 , or, equivalently n 9 • behaves as r-y,
35

where 1 is not larger than 0.2, since less probable events also contribute to the
pairwise annihilation of even gutters.

Figure 2.4. Rule 54. Remaining particles after 3 x 10 8 time steps. Note that,
in the asymptotic regime, the number of walls is proportional to the number of
even gutters. The figure shows the evolution of 512 lattice sites from a 104 -site
lattice during 512 time steps. The periodic background is eliminated through the
mapping 1J( i, t) = 2.::~:~ s( i + k, t) mod 2.

2.1.3. Rule 22. It is defined by the following map

if (x 1, xzx3) = (0, 0, 1), (0, 1, 0) or (1, 0, 0)


otherwise.

For this rule, the evolution toward the attractor cannot be viewed as annihilation
processes of interacting particlelike structures, and for large t the density of nonzero
sites tends to its stationary value exponentially. This exponential behavior has
been studied in detail by Zabolitzky (1988) who found that the constant a in the
argument of the exponential depends actually on the initial density of nonzero sites
Czz(O), and goes to zero as 0.44 cz 2 (0).
Since, in the infinite-time limit, the density of nonzero sites approaches its
stationary value as the decreasing number of particles, we may put forward the
36

Figure 2.5. Rule 22. Spatia-temporal pattern. Random initial condition.

following conjecture: For a class-9 CA, the density of nonzero sites tends to its
stationary value either as C"'~ or e-at, where 'Y and a are positive. In the case of
a one-dimensional CA, the power-law behavior is observed if, and only if, after a
short transient, the spatiotemporal pattern generated by the evolution of the cA may
be viewed as interacting particlelike structures evolving in a regular background.

2.2. DETERMINISTIC SITE-EXCHANGE CELLULAR AUTOMATA

Site-exchange CAs are automata networks whose rule consists of two subrules. The
first one is a standard synchronous CA rule, whereas the second is a sequential site-
exchange rule. This last rule, characterized by a parameter m, is defined as follows.
A site, whose value is one, is selected at random and swapped with another site
value (either zero or one) also selected at random. The second site is either a
neighbor of the first one (local site-exchange) or any site of the lattice (nonlocal
site-exchange). This operation is repeated mct(m, t)N times, where N is the total
number of sites, CJ(m, t) the density of nonzero sites at timet, and f is theCA
37

rule. The parameter m is called the degree of mixing. It is important to note that
this mixing process, which will be used to model either short- or long- range moves
of interacting individuals, does not change the value of the density CJ(m, t).
If m = oo, the correlations created by the application of the cA rule f are
completely destroyed, and the value of the stationary density of nonzero sites
c J( oo, oo) is then correctly predicted by a mean-field-type approximation in which
it is assumed that the probability, at time t, for a site value to be equal to one is
CJ(m, t). This approximation is incorrect when m is not sufficiently large.
In this section we will describe some results, obtained recently by Boccara
and Roger (1993), concerning the behavior of the stationary densities of nonzero
sites of the three one-dimensional range-one CAs whose evolution has been studied
in the preceding section.

2.2.1. Rule 18. The evolution of the density of nonzero sites according to the
mean-field approximation is determined by

c1s( oo, t + 1) = 2cis( oo, t)(1 - c 1s( oo, t)) 2 •


Therefore c 1s(oo,oo) = 1- 1/.../2 = 0.29289, whereas the stationary density of
nonzero sites in the absence of mixing c1s(O, oo ), which is exactly known, is equal
to i· Figure 2.6a represents on a logarithmic scale the variation of the stationary
density c 1s( m, oo) as a function of m when the mixing process results from a local
site-exchange. Note that as m increases, c1s( m, oo) does not tend to c1s( oo, oo)
monotonically. c1s(m, oo) reaches a maximun, greater than the mean-field value,
for m = mo of the order of unity. Below but not to close to the maximum, the
spatia-temporal patterns are similar to those generated by the cellular automaton
evolving according to Rule 18 (for m = 0) and exhibit the same defects, whereas
above the maximum no particlelike structures may be identified and the patterns
are completely different (see next section Fig. 2.10).
In order to characterize the asymptotic behavior of c1s( m, oo) for very small
and very large m, define the following exponents:
.
11m log(cts(m,oo)-cts(O,oo))
ao =
m-+O+ logm
Goo= lim log(cis(m,oo)- Cts(oo,oo)).
m-+O+ logm
38

0.1 lll.illll I liiillil I ill~ i\,iltlj I lllliii[ I iilliiil ilT

0.05 (a)
.,. 0 0 0 0

---
,...,
c ""
;::;- 0.02

/"""
0 <l

8 0 02 8
£ 0 01
i
c"' 0 005 ~ 0.01
0.01 0 002

0 001
I
0. 0 0 5 L.L.I-'.illl!l_LL.I.W!II.-1-.U.UlllL-LllilillL-LLU.<illL._L.LW~--'..llllill
10-4 10-3 10- 2 10- 1 10° 10 1 10 2 10 3
m m

Figure 2.6. Rule 18. Log-log plot of the stationary density of nonzero sites
as a function of the degree of mixing m. (a) local site-exchange. (b) nonlocal
site-exchange.

From the log-log plots we have obtained ao = 0.415 ± 0.005 (Fig. 2.5a) and
a 00 = 0.44 ± 0.01 (Fig. 2.5a inset). These results may be understood as follows.
If, starting from the attractor of the cellular automaton evolving according
to Rule 18, we exchange a very small number of sites between two successive
applications of Rule 18, we create defects at a rate which can be assumed to
be proportional to m. Therefore, for small m, the number ~v+(m, t) of defects
created during a short time interval ~t is given by

On the other hand, for large t, due to the annihilation process described in the
preceding section, the number of defects v( m, t) decreases as C 7 with "( = t for
Rule 18. Hence, during ~t, the number of defects decreases by
39

The equilibrium is reached when D.v+ = D.v-. Therefore,

r-y-l = O(m),

thus, for large t and small m, assuming that Dac1s(m, t) = c1s(m, t)- c1s(O, t) is
proportional to the number of defects, we have

It should be stressed that this simple argument can, at best, give an order of
magnitude. The assumption that defects are created at a rate proportional to m
is the simplest but may be not exact. Moreover, it is assumed that the defects
are created at random in the lattice. However, in a local site-exchange process,
when we move a nonzero site to a neighboring zero site, we create two neighboring
defects which have, therefore, a higher probability to annihilate than more distant
defects. Hence, the exponent a 0 should be slightly greater than the value -y/(/+1).
This is indeed the case.
When m is large, a given site has moved, on the average to a distance vm,
thus, if the correlations created by the application of Rule 18 have a range that is
small compared to this distance, the exponent a 00 should be close to ~'since the
local site-process is a random walk in a random environment (DeMasi et al1989).
If the mixing process is nonlocal, the argument explaining the behavior of
c 18 (m, oo) for small m remains valid, but the motion of nonzero sites being non-
diffusive, the behavior for large m is very different. Figure 2.6b shows that
no= 0.425 ± 0.005, and a 00 = 3.7 ± 0.1 (inset).
2.2.2. Rule 54. For small and large m, the arguments given above are still valid.
However, for this particular rule, the number of defects decreases so slowly (cf.
preceding section) that we should consider values of m of the order of 10-7 leading
to prohibitive computation times. In the simulations of Boccara and Roger (1993),
the smallest m is of the order of 10- 3 . Around this value it is found that no =
0.11 ± 0.01 (Fig. 2.7a), which is not so far from m ---+ 0-value -yj(! + 1) = 0.13,
and a 00 = 0.53 ± 0.01 (Fig. 2. 7a inset). For nonlocal site-exchange it is found that
a0 = 0.19 ± 0.01, notably higher than -yj(! + 1), and 0 00 = 5.3 ± 0.1 (Fig. 2.7b).
40

(a) (b)
0.1 10-1

0 0 0 0
0.05

-...
"'.... / "'
-...
....
10-2

+ 0.02 +
8 810-3
i 0.01
--;;: _§_
{}
I
8I
0.005 10-4

0. 002 L-I..JLWIIIJL.LJ..I.illiii......L.li.lliiii-..J...U.l.IIIII_LillliiiLJ_-'-llliiii.._ULUlllll 1o-5~~~~~~m-~~~-L~llW


10-4 10-3 10-2 10- 1 10° 10 1 10 2 10 3 10-3
m

Figure 2.7. Rule 54. Log-log plot of the stationary density of nonzero sites
as a function of the degree of mixing m. (a) local site-exchange. (b) nonlocal
site-exchange.

2.2.3. Rule 22. For this rule, we have seen that the evolution toward the attractor
cannot be viewed as annihilation processes of interacting particlelike structures,
and for large t the density of nonzero sites tends to its stationary value expo-
nentially, that is, 'Y = oo. For small m, the argument given for Rule 18 is still
valid if, for large t, the number of defects v(m, t) is replaced by the difference
.6.c22 (m, t) = c 22 (m, t) - c 22 (0, t). c22(m, oo) should, therefore, behave linearly
for small m. This is indeed the case (Fig. 2.7a). For large m, we have found
a 00 = 0.55 ± 0.01 (Fig. 2.7a inset) for local site-exchange, in agreement with the
argument given for Rule 18, whereas a 00 = 4.4 ± 0.1 (Fig. 2.7b) for nonlocal site-
exchange. Note that, in this case, for small m, .6.c22( m, oo) is not exactly linear.
We have ao = 0.86 ± 0.01 (Fig. 2. 7b ).
41

2.3. PROBABILISTIC SITE-EXCHANGE CELLULAR AUTOMATA

Probabilistic CAs with an absorbing state exhibit phase transitions. Directed per-
colation is a typical example. 11 In this section we study the infl.ence of the mixing
process defined in the preceding section on the critical properties of the following
cellular-automaton rule

if s(t,i -1) = O,s(t,i) = O,s(t,i + 1) = 1,


s(t+ !,i) ~{~ if s(t,i -1) = 1,s(t,i) = O,s(t,i
otherwise,
+ 1) = 0,

where X is a discrete random variable defined on {0, 1} such that

Pr( X = 0) = 1 - p and Pr( X = 1) = p.

When p = 1, it coincides with Rule 18.


If the degree of mixing m tends to infinity, the value of the stationary density
of nonzero sites, for a fixed value of the probability p, c( oo, p, oo) is correctly
predicted by a mean-field-type approximation. The equation

c(oo,p, t + 1) = 2pc(oo,p, t)(1- c(oo,p, t)) 2

determines, therefore, the evolution of the density of nonzero sites within the mean-
field approximation. This mapping has two fixed points: 0 and and 1 - 1/ y2p.
Since the density of nonzero sites is a nonnegative quantity not greater than 1, the
second fixed point exists if, and only if, 2p 2: 1. The stability of these two fixed
points is easy to determine. We have:
(i) if p < 1/2, 0 is stable,
(ii) if p > 1/2, 0 is unstable and 1- ljy2p is stable.
At p = Pc( oo) = ~, the system exhibits a transcritical bifurcation similar to a
second-order phase transition characterized by a nonnegative order parameter. In
the neighborhood of the bifurcation point, the stationary density of nonzero sites
c( oo, p, oo) behaves as p - Pc( oo ).

11 See, e.g., Kinzel (1983).


42

For a fixed finite value of the degree of mixing m, the system exhibits, at
p = Pc(m), a transcritical bifurcation. The behavior of the stationary density
of nonzero sites c( m, p, oo) in the neighborhood of the bifurcation point may be
characterized by a critical exponent f3(m) defined by

{3( ) _ 1. log c( m, p, oo)


m - p-p.(m)+O
1m log ( p- Pc ( m ))'

Within the mean-field approximation {3( oo) = 1.


0.5
'I

II

........
8
J
~

i 00

G 0.1

0. 05 L_.J_-'--'-,W_--'---'--'---'--'--'-'---'--'----'------'------'---'
0.0005 0.001 0.002 0.005 0.01 0.02
(p- Pc(m))

Figure 2.8. Typical log-log plot of c(m,p, oo) as a function of p- Pc(m), for
m = 0.41. A least-square fit gives Pc( m) = 0. 781 and {3( m) = 0.275 (local site-
exchange).

Figure 2.8 shows a typical variation of the stationary density c( m, p, oo) as a


function of p for a fixed value of m in the case of short-range moves (local site-
exchange). Pc(m) is not equal tot and the exponent f3(m) is clearly different from
1.
43

To determine the values of Pc(m) and /3(m), we measure, for a fixed degree
of mixing m, the stationary density of nonzero sites c( m, p, oo) for different prob-
abilities p. If pis close to Pc(m), the stationary density is of the form

c(m,p, oo) = A(m)(p- Pc(m))P(m)

and, using a least-square fit, it is possible to determine simultaneously A(m),


Pc( m) and /3( m ). Since the state corresponding to all site-values equal to zero is
absorbing, it is difficult to measure small density values and, therefore, approach
closely enough Pc(m). The value of the exponent f3(m) being very sensitive to the
value of Pc( m ), it would be better to first determine Pc( m) with the highest possible
precision, and then determine f3(m). Boccara et a1 (1993) used the following
"extinction method" to obtain more precise results.
For p < Pc( m ), the density of nonzero sites c( m, p, t) goes to zero as t increases.
As a function of p, the number of time steps necessary to reach the absorbing
state-the "extinction time"-behaves as (Pc(m)- p)-", when pis close to Pc(m).
Using a 104 -lattice size and measuring, for each value of p, the average extinction
time over 100 experiments, it is found Pc(O) = 0.8086 ± 0.0002. From this rather
precise value, it follows that A(O) = 0.44 and /3(0) = 0.286 ± 0.005. This value
for /3(0) is the same as for directed percolation (Bease 1977). This result is in
favor of a universal critical behavior for one-dimensional cellular automata with
one absorbing state (Kinzell983), the universality class being characterized by the
space dimensionality and the dependence of the mean-field map for small densities
(Bidaux et all989).
The extinction method being rather time consuming, it has been used only
for m = 0 to check the precision of the other method. The orders of magnitude of
the error bars on Pc(m) and (3 are, respectively, 0.005 and 0.02.
The variations of Pc( m) and (3 as fonctions of m are represented in Figure 2.9.
Two regimes may be distinguished. In the small m regime-for m ~ lO-pe( m)
and particularly /3( m) are close to their m = 0 values, whereas, in the large m
regime-form~ 10-they are close to their mean-field values.
The existence of the two regimes may be understood as follows. We have
seen that configurations generated by the evolution of some one-dimensional cel-
lular automata may be viewed, after a short transient, as particlelike structures
44

(a) (b)

.........
• 0.5

e<5:: I
• I I

IIIIIII IIIII

0. 5 0.2 L......J....J..J.J..LWL-J....U.Wlii.......I....I..I.J..UJJ'----'..J..J..J..LIJ.<L_J....U.Wlii.....LLC--
10-3 10- 2 10- 1 10° 10 1 10 2 10 3 10- 3 10-2 10- 1 10° 10 1
m m

Figure 2.9. Pc (a) and fJ (b) as functions of m. Local site-exchange.

evolving in a regular background (Section 1). If we start from a configuration in


the attractor of deterministic Rule 18, moving a one to a neighboring lattice site
creates a pair of neighboring defects. For small values of m we, therefore, create
a small number of defects, and, close to Pc(m), the spatia-temporal pattern, as
shown in Figure 2.10a, is only slightly perturbed compared to the spatia-temporal
pattern form = 0 (Figure 2.10b). That is, for small value of m, we may expect
the behavior of the system not to change much.
In the case of long-range moves (nonlocal site-exchange), the variations of
Pc(m) and fJ as fonctions of m, represented in Figure 2.11, are strikingly different.
While local and nonlocal site-exchange mixing processes lead to similar asymptotic
behaviors for the stationary density c( m, p, oo) for small values of m when pis close
to one, this is no more the case when pis close to Pc· When the stationary density is
very small, a one moved into a "sea" of zeros is the seed of a Sierpinski-like triangle
characteristic of the spatia-temporal pattern of Rule 18. Close to Pc(m), this
structure is short-lived, and the resulting spatia-temporal pattern differs greatly,
45

Figure 2.10. Spatia-temporal pattern for p close to Pc( m) for short-range moves.
(a) m = 0.1, p = 0.8, (b) m = 0, p = 0.81.

as shown in Figure 2.12, from that obtained for short-range moves.


Form= 0 the derivatives 8pc/8m and 8(3/8m are infinite. The behaviors of
Pc( m) and (3( m) for small m may, therefore, be characterized by the exponents ap
and a~ defined by
. log (Pc(m)- Pc(O))
ap = hm
m-o+ logm
. log (f3(m)- f3(0))
a~= hm
m-o+ logm
Simulations show that

ap = 0.48 ± 0.01 and a~ = 0.33 ± 0.05.

To determine the exponent a~, it is very important to determine (J(m) with


a sufficient precision, that is, the percolation threshold Pc( m) has to be measured
with great precision.
46

(a) (b)

'I 'I 'I 'I

f I! f 1ft!f
ft
••
I., ?
f{
u
CQ

.•• f

0.5 .I ,I 0.5 ,I .I
0.001 0.01 m 0.1 0.001 0.01 m 0.1

Figure 2.11. Pc (a) and (3 (b) as functions of m. Nonlocal site-exchange.

Figure 2.12. Spatio-temporal pattern for p close to to Pc(m) for long-range moves.
Herem= 0.1, p = 068.
47

3. Interacting Populations

3.1. VARIOUS MODELS

We will essentially consider epidemic models and occasionally predator-prey mod-


els. In the simplest epidemic models, based on disease status, the individuals are
divided into three disjoint groups:
(S) The susceptible group, i.e., those individuals who are not infected but who are
capable of contracting the disease and become infective.
(I) The infective group, i.e., those individuals who are capable of transmitting
the disease to susceptibles.
(R) The removed group, i.e., those individuals who have had the disease and are
dead. If we are more optimistic and do not consider a fatal outcome, this
group could consist of individuals that either have been isolated-e.g., in a
hospital-or have recovered and are permanently immune; this group could
also consists of infectives who recovered and are only partially immune.
In more realistic models many other groups should be considered. We could,
for instance, define new groups to take into account the existence of latent and
incubation periods. 12
The models discussed in this chapter are formulated in terms of automata
networks (Gales and Martinez 1991 ). Automata networks consist of a graph with a
discrete variable at each vertex. Each vertex variable evolves in discrete time steps
according to a definite rule involving the values of neighboring vertex variables.
The vertex variables may be updated sequentially or synchronously.
Automata networks are discrete dynamical systems, which may be defined
more formally as follows.
Let G = (V, E) be a graph, where V is a set of vertices and E a set of edges.
Each edge joins two vertices not necessarily distinct. An automata network, defined
on V, is a triple (G,Q,{J;Ii E V}), where G is a graph on V, Q a finite set of

12 During the latent period the individual who has been exposed to the disease is not yet

infectious, whereas during the incubation period, the individual does not present symptoms but
is infectious.
48

states and fi: QIUd Q a mapping, called the local transition rule associated to
-+

vertex i. Ui = {j E Vl{j, i} E E} is the neighborhood of i, i.e., the set of vertices


connected to i' and Iui I denotes the number of vertices belonging to Ui. The graph
G is assumed to be locally finite, i.e., for all i E V, IUil < oo.
In our models the set V is the two-dimensional torus Zi, where ZL is the
set of integers modulo L. A vertex is either empty or occupied by an individual
belonging to one of the three groups. The spread of the disease is governed by the
following rules:
1. Susceptibles become infective by contact, i.e., a susceptible may become in-
fective with a probability Pi if, and only if, it is in the neighborhood of an
infective. More precisely, during one time step, the probability that a sus-
ceptible having z infected neighbors becomes infected is {1 - {1 - PiY). This
hypothesis neglects latent periods, i.e., an infected susceptible becomes im-
mediately infective.
2. lnfectives are removed with a probability d;. That is, at each time step,
an infected individual is either removed or not. The number of time steps
T during which he remains infected is a random variable with a geometric
distribution, i.e., the probability P(T = k) that T is equal to the positive
integer k is equal to d;(1 - d;)k-l, and we have

E(T) = ;i Var(T) = 1 ~ di'



where, as usual, E(T) and Var(T) denote, respectively, the mean and the
variance ofT. This assumption states that removal is equally likely among
infectives. In particular, it does not take into account the length of time the
individual has been infective.
3. The time unit is the time step. During one time step, first the two preceding
rules-the infection and removal rules-are applied and then, the individuals
move on the lattice according to a site-exchange rule.
4. As described in the preceding chapter, site-exchange rules consider here are
only of two extreme types. An individual selected at random may move to a
vertex also chosen at random. If the chosen vertex is empty the individual will
move, otherwise he will not move. The set in which the vertex is randomly
49

chosen depends on the range of the move. In a short-rang move the chosen
vertex is any one of the four nearest neighbors, whereas in a long-range move
the chosen vertex is any vertex of the graph. Since individuals may only
move to empty sites, the average number of times an individual is selected
to perform a move during one time step is an average number of tentative
moves. during a unit of time. This parameter is denoted by m. Even when
m > 1, some individuals do not move. For a given m and in the limit of an
infinite number of individuals, the probability that s given individuals do not
move-i.e., have not been selected to move-is e-sm.
This model is rather crude, it assumes that the system is closed, births, deaths
by other causes, immigrations, or emigrations are ignored. We shall also study a
sligthly more general model in which susceptibles and infectives may both give
birth to susceptibles at neighboring empty sites with respective probabilities b8
and b;. Moreover, we may also allow susceptibles to be removed with a probability
ds.
These epidemic models could also be presented as predator-prey models. ln-
fectives may in fact be viewed as predators preying on susceptibles. However, small
changes should be made. When a prey is eaten by a predator, it is, of course, not
immediately changed into a predator. One has to take into account the efficiency
with which extra food is turned into extra predators, and predators give birth to
predators and not to prey.
These models are automata networks with mixed transition rules. That is,
at each time step, the evolution results from the application of two subrules. The
first subrule is a probabilistic translation invariant synchronous CA rule, and the
second one is a sequential site-exchange rule.
Conventionally, we may represent a model by a graph in which the vertex i
correspond to the group G; and the directed arc (i,j) is labeled by the probability
Pij to transform an individual belonging to G; into an individual belonging to Gj.
Such a graph is called a transfer diagram. For instance, the simple SIR model
described above corresponds to the transfer diagram

p; d;
S---+ I---+ R,
50

3.2. MEAN-FIELD APPROXIMATION

The mean-field approximation ignores space dependence and neglects correla-


tions. The state of the system at time t is, therefore, characterized by the space-
independent densities of the different groups. This approximation is equivalent to
the assumption of homogeneous mixing. In the case of a physical system exhibiting
a phase transition, the quantitative predictions of a mean-field approximation are
not very good, however, for the models described in the preceding section, since
the second subrule represents a process that destroys the correlations created by
the first subrule, if m tends to oo, the mean-field approximation becomes exact.

3.2.1. SIR Model. The evolution equations read (Boccara and Cheong 1992)

SMFA(t + 1) = c- h1FA(t + 1)- RMFA(t + 1) (1)


RMFA(t + 1) = RMFA(t) + d;IMFA(t) (2)
IMFA(t + 1) = IMFA(t) + SMFA(t)(1- (1- p;IMFA(t)Y)- d;IMFA(t), (3)

where z is the number of neighboring vertices of a given vertex. For the two-
dimensional square lattice considered in our simulations, z = 4. Note that, within
the framework of this approximation, the "incidence rate", represented by the term
SMFA(t)(1- (1- p;IMFA(t))z), is not-in this model as in most models (Bailey
1975, Waltman 1974, Anderson and May 1991)-bilinear. 13
From Equations (1-3), it follows that SMFA(t) is positive nonincreasing
whereas RMFA(t) is positive nondecreasing. Therefore, the infinite-time limits

13 In statistical physics, the incidence rate of epidemiologists, is viewed as a two-body inter-

action. The influence of incidence rates of the form Sa Jb on the dynamics of different epidemic
models, has been studied by various authors (see, for example, Hethcote and van den Driessche
(1991) who also considered incidence rate of the form Sg(J), where g is not linear). When
a ":11, the qualitative dynamical behavior of the model is not altered, but for b > 1, multiple
equilibria and limit cycles have been found (see subsection 3). In statistical physics, the existence
of new phases resulting from nonbilinear two-body interactions has been known for a long time
(see, for example, Boccara (1976)).
51

SMFA(oo) and RMFA(oo) exist. Since IMFA(t) = C- SMFA(t)- RMFA(t), it


follows also that I MFA (oo) exists and satisfies the relation

which shows that IMFA(oo) = 0.


If the initial conditions are

RMFA(O) = 0 and

IMFA(1) is small, and we have

IMFA(1)- IMFA(O) = (zp;SMFA(O)- d;)IMFA(O) + O(I~FA(O)). (4)

Hence, according to the initial value of the density of susceptibles, we may distigu-
ish two cases:
1. If SMFA(O) < d;jzp; then IMFA(1) < IMFA(O). Since SMFA(t) is a
nonincreasing function of time, IMFA(t) goes monotonically to zero as t tends to
oo. That is, no epidemic occurs.
2. If SMFA(O) > d;jzp; then IMFA(1) > IMFA(O). The density IMFA(t) of
infectives increases as long as the density of susceptibles SMFA(t) is greater than
the threshold d;j zp; and then tends monotonically to zero.
This shows that the spread of the disease occurs only if the initial density
of susceptibles is greater than a threshold value. This is exactly the threshold
theorem of Kermack and McKendrick. Since I MFA ( t) is, in general, very small,
Equation (3) is well approximated by

IMFA(t + 1) = IMFA(t) + zp;SMFA(t)I(t)- dJMFA(t), (3')

which shows that the mean-field approximation is equivalent to a time-discrete


formulation of the Kermack-McKendrick model. Figure 3.1 shows two typical
time evolutions of the density of infectives.
52

0.04

0.03

-
-
~ 0.02
8
......

0.01

0.00
0 10 20 30 40
t
Figure 3.1. Time evolution of the density of infectives for the SIR model within
the mean-field approximation. C = 0.6, IMFA(O) = 0.01, z = 4, p; = 0.3. (a)
d; = 0.5 (SMFA(O) > d;jzp;). (b) d; = 0.75 (SMFA(O) < Prfzp;).

3.2.2. Two-Population Model. It is straightforward to extend the SIR model


SIR

to more than one population. For instance, the heterosexual spread of a venereal
disease involves the obligatory switching of infection back and forth between two
distinct populations. In this case, the probability for a susceptible of Population 1
(resp. 2) to become infective by contact with an infective of Population 2 (resp. 1)
is denoted by p 1 ,; (resp. P2,i), and the probability for an infective of Population 1
(resp. 2) to be removed is denoted by d 1 ,; (resp. d2 ,;). 14 We have then (Boccara

14 A less crude model for the heterosexual spread of a venereal disease may be obtained, for

example, by separating males and females into different age groups and assuming that a male
(resp. female) susceptible belonging to a given age group can catch the disease from a female
(resp. male) infective if, and only if, the infective belongs to neighboring age groups.
53

and Cheong 1992)

(5)
RMFA(t + 1)=RMFA(t) + da,iiMFA(t) (6)
I'M FA(t + 1)=I'M FA(t)+SMFA(t)(1-(1-pa,ii~FA(t)Y)-da,dMFA(t), (7)

where a and f3 are equal to 1 or 2 with a =/= /3.


As for the one-population model, it follows that, for a = 1,2, SMFA(t)
is positive nonincreasing whereas RMFA(t) is positive nondecreasing. There-
fore, the infinite-time limits SMFA(oo) and RMFA(oo) exist. Since IMFA(t) =
ca- SMFA(t)- RMFA(t), it follows also that IMFA(oo) exists and satisfies the
relation
RMFA(oo) = RMFA(oo) + da,iiMFA(oo),
which shows that I'M FA (oo) = 0.
Due to the coupling between the two populations a wider variety of situations
may occur. For instance, if the initial conditions are

and

I'M FA ( 1) is small, and we have

IMFA(1)- IMFA(O) = ZPa,iSMFA(O)I~FA(O)-


da,iiMFA(O) + O((I~FA)\0)), (8)

where a and f3 are equal to 1 or 2 with a =/= /3. Hence, according to the initial
values Si.tFA(O), SlrFA(O), Ii.tFA(O) and IlrFA(O), we may observe the following
behaviors:
l.If

and

then
54

Since 5 At FA and 5~ FA are nonincreasing functions of time, I At FA (t) and I~ FA (t)


go monotonically to zero as t tends to oo. No epidemic occurs.
2. If

and

then

The densities of infectives in both populations increase as long as the relations

and

are satisfied. Since the densities of susceptibles decrease with time, the densities
of infectives, after having reached a maximum, tend monotonically to zero.
3. If

and

then

but, since IAtFA(t+ 1) depends on I~FA(t), the density ofinfectives in Population


1 does not necessarily goes monotonically to zero. After having decreased for few
time steps, due to the increase of the density of infectives in Population 2, it may
increase if
55

0.012 0.10

0.010
0.08

0.006
0.06
......
...... 0.008
......
.....
~.;
~
i~
0.04
0.004
«! «!
......
.... ......
..... 0.02

i 0.002
i.....
~
~
I
0.000 0.00
0 6 10 15 20 0 20
t

0.020

0.010

,..... ,.....
..,
..... 0.016

i
~
i~
0.005
«! «!
,..... 0.010 ,.....
..... .....
i i.....
~
0.000
6 10 15 20 0 6 !0 15 20
t

Figure 3.2. Time evolution of the densities of infectives for the two-population
model using the mean-field approximation.
QI = zpi,iS1tFA(O)I1rFA(O)- di,JitFA(O),
Q2 = ZP2,iS1rFA(O)I1tFA(O)- d2,ii1rFA(O). z = 4,
S1tFA(O) = S1rFA(O) = 0.29, IitFA(O) = I1rFA(O) = 0.01.
(a) Ql < 0 and Q2 < 0, PI,i = 0.37, P2,i = 0.23, dl,i = 0.6, d2,i = 0.3.
(b) Ql > 0 and Q2 > 0, PI,i = 0.5, P2,i = 0.8, d1,i = 0.35, d2,i = 0.25.
(c) QI < 0 and Q2 > 0, PI,i = 0.13, P2,i = 0.8, d1,i = 0.27, d2,i = 0.35.
(d) QI < 0 and Q2 > 0, PI,i = 0.15, P2,i = 0.6, d1,i = 0.5, d2,i = 0.3.
56

becomes positive. The spread of the disease in Population 2 may trigger the
epidemic in Population 1. If, however, the increase of the density of infectives in
Population 2 is not high enough, then the density of infectives in Population 1
will decrease monotonically whereas the density of infectives in Population 2 will
increase as long as

and then tends monotonically to zero. The disease spreads only in Population 2
whereas no epidemic occurs in Population 1.
Figures 3.2a-2d show some typical time evolutions of the density of infectives
in both populations.

3.2.3. Model. In this model, after recovery, infected individuals become sus-
SIS

ceptibleto catch the disease again (as, e.g., with common cold). This model is
interesting because it exhibits a transcritical bifurcation between an endemic state
and a disease-free state.
Here, the state of the system at time t is characterized by the densities
S MFA ( t) and I MFA ( t) of susceptibles and infectives, and the evolution equation
of the density of infectives is (Boccara and Cheong 1993)

where Pr denotes the probability per unit time for an infective to recover.
Since the population is closed, the total density

(10)

is time-independent. Eliminating SMFA(t) between (9) and (10) yields

In the infinite-time limit, the stationary density of infectives IM FA( oo) is such that
57

lM FA( oo) = 0 is always a solution of Equation (12). This value characterizes the
disease-free state. It is a stable stationary state if, and only if, zCp; - Pr ~ 0. If
zCp; - Pr > 0, the stable stationary state is given by the unique positive solution
of Equation (12). In this case, a nonzero fraction of the population is infected. The
system is in the endemic state. For zCp;-pr = 0 the system, within the framework
of the mean-field approximation, undergoes a transcritical bifurcation similar to
a second order phase transition characterized by a nonnegative order parameter,
whose role is played, in this model, by the stationary density of infected individuals
IM FA( oo ). This threshold theorem is a well-known result for differential equation
s1s models (Hethcote 1976).
It is easy to verify that, in the endemic state, when zCp; - Pr tends to zero
from above, lMFA(oo) goes continuously to zero as zCp;- Pr· In the (p;,pr)
parameter plane,
zCp;- Pr = 0 (13)

is the equation of the second order phase transition line.

3.2.4. Generalized SIR Model. Let us assume that, more generally, susceptibles and
infectives may give birth to susceptibles at neighboring empty sites with respective
probabilities b8 and b;, and that susceptibles may be removed with a probability
d 8 • The evolution equation read

SMFA(t + 1)=(1- ds)SMFA(1- SMFA(t)- htFA(t))x


f( bsSM FA(t)+b;JM FA(t))-(1-ds )SM FA(t)f(p;lM FA(t)), (14)
fMFA(t + 1)=(1- d;)iJ\tFA(t) + (1- ds)SMFA(t)j(p;JMFA(t)) (15)
RMFA(t + 1)=RMFA(t) + dsSMFA(t) + dJMFA(t), (16)

where f(x) = 1- (1- x)•.


Note that, in this model, the population is not closed. Due to the birth
processes, we obviously have

SMFA(t + 1) + IMFA(t + 1) + RMFA(t + 1)- SMFA(t)- lMFA(t)-


RMFA(t) = (1- SMFA.(i)- fMFA(t))J(bsSMFA(t) + bJMFA.(i)).
58

In the (S, I) plane, the discrete dynamical system represented by 14 and 15


has three fixed points: (0,0), (S0 ,0) and (S*,I*), where So is the solution of

-dsS + (1- S)J(bsS) = 0, (17)

and S* and I* satisfy

-d 8 S + (1- S- I)J(bsS + b;I)- (1- d 8 )SJ(p;I) = 0, (18)


-d;J + (1- ds)SJ(p;I) = 0. (19)

1. (0, 0) is stable-not a very interesting situation corresponding to there-


moval of all the individuals-if the eigenvalues of the 2 x 2 Jacobian matrix J(O, 0)
have a norm-i.e., an absolute value-less than 1. Since

J11 (0, 0) = 1 - ds + bsJ' (0)


Jiz(O, 0) = b;J' (0)
lzi(O,O)=O
lzz(O, 0) = 1 - d;

(0, 0) is stable if(!' (0) = z)

(20a)
d; > 0. (20b)

2. (So, 0) exists if zb 8 > d 8 , that is if (0, 0) is unstable. Since

J11(So, 0) = 1- ds- f(bsSo) + (1- so)bsf'(bsSo)


hz(So, 0) =- f(bsSo) + (1- So)b;J'(bsSo)- (1- ds)Sop;f'(O)
lzi(So,O) = 0
Jzz(So, 0) + 1- d; + (1- ds)Sop;f'(O)
(So, 0) is stable-which is the case for the simple SIR model discussed above-when

ds + f(bsSo)- (1- So)bsJ'(bsSo), > 0 (21a)


d;- (1- ds)Sop;J'(O) > 0. (21b)
59

Since, when it exists, So satisfies 17, we may eliminate ds in 21a ( ds = f(bsSo)(1-


S0)/S0), and write this stability condition as

(22)

For 0 S So S 1, this condition is always satisfied. Therefore, (So, 0) is stable if,


and only if, 21b is satisfied.
When 21b is satisfied, the dynamical behavior of this generalized model is
similar to the the behavior of the simple SIR model discussed above. In particular,
if, for all t, I(t) « 1, we have

As for the simple SIR model, there exists a threshold value for the density of
susceptibles above which the number of infectives increases.
3. The expression of the Jacobian matrix J ( S* ,I*) is rather complicated.
We shall discuss the stability of (S*, I*) and the bifurcation corresponding to the
coalescence of (S* ,I*) and (So, 0) as follows. Let

F(S,I) = -dsS + (1- S- I)J(bsS + b;I)- (1- ds)Sf(p;I), (23a)


G(S, I)= -d;l + (1- ds)Sf(p;I). (23b)

Then, S* and I* are such that F( S* ,I*) = 0 and G( S*, I*) = 0. When, at the
bifurcation point, (S*,I*) and (S0 ,0) coalesce-the bifurcation is transcritical-
' S* tends to So and I* to zero. This bifurcation point is the analogue of a
second-order transition point from the endemic to the disease-free state (see above
subsection). Hence, in the vicinity of this point we have

F(S0 , 0) + (S*- S 0 ) ~~(So, 0) +I*~~ (So, 0) + · · · = 0, (24a)

G(So, 0) + (S*- So)~~ (So, 0) +I*~~ (So, 0)+


(S*- So)I* :;~/So, O)t(I*) 2 f{f(So, 0) + · · · = 0, (24b)
60

where

F(So, 0) = -dsSo + (1- So)f(bsSo), (25a)


G(So,O) = 0, (25b)

~~(So,O) = -ds- f(bsSo) + (1- So)bsf'(bsSo), (25c)

~~(So, 0) =- f(bsSo) + (1- So)b;J'(bsSo)- (1- ds)Sop;J'(O), (25d)


oG (25e)
oS(So,O) = 0,

~~(So, 0) = -d; + (1- ds)Sop;f'(O), (25!)


a2G
oSoi(So, 0) = (1- ds)p;f'(O), (25g)

0 2 G ( So, 0) = (1 - ds ) Sop;!
2 II ( 0 ) . (25h)
012

Therefore, in the endemic state, when S* - So and J* are small, we have

(S*- So)~~(So,O) +I*~~ (So,O) = 0, (26a)

(S*- So) :;~1 (So, O)ti* ~:~(So, 0) =-~~(So, 0). (26b)

I*, which plays the role of the order parameter, goes to zero as

~~ (So,O) = -d; + (1- ds)Sop;f'(O),

since there exist no values for the parameters b8 , and ds such that

F(So,O) = 0,
oF
aS (So, 0) = 0,

that is,

-dsSo + (1- So)f(bsSo) = 0, (27a)


-ds- f(bsSo) + (1- So)bsJ'(bsSo) = 0. (27b)
61

In fact, combining (27a) and (27b) yields

which, for bs E [0, 1] and So E [0, 1], is satisfied only if b8 So = 0.


Hence, if -d; + (1- ds)Sop;f'(O) > 0, 15 (S*,I*) exists and (S0 ,0) is no more
stable.
4. This is not the end of the story. It may be verified that the eigenvalues
of the Jacobian matrix J(S*, I*) are complex. When it is stable, the fixed point
(S*,I*) which characterizes the endemic state is a spiral node. If trJ(S*,I*) and
detJ(S*, I*) denote, respectively, the trace and the determinant of the Jacobian
matrix, the condition detJ(S*,I*) = 1 may be satisfied. The system will, in this
case, exhi hit a Hopf bifurcation and the densities S MFA and I MFA will be periodic
functions of time in a domain of the parameter space defined by

(trJ(S*,I*)) 2 - 4detJ(S*,I*) < 0, (28a)


detJ(S*,I*) > 1. (28b)

3.3. CONSTANT-INTERACTION MODELS

In an constant-interaction model, the neighborhood U; of a given vertex i consists


of all the other vertices. That is, U; = V- {i}. Hence, if JVI denotes the total
number of vertices, lUi I = lVI- 1. Since the number of neighbors is very large, the
probability to transform an individual of one group into an individual of another
group has to be very small. More precisely, when JVI tends to infinity, this probabil-
ity should behave as 1/JVI. Therefore, the incidence rate, which, in the mean-field
approximation is S(1- (1- p;l) 4 ), becomes S(1- (1- p;l /JVI)IVI- 1 ). In the limit
of an infinite number of vertices, this incidence rate tends to S(1 - exp( -p;I)).
That is, for all the models we have considered, the evolution equations are identical
to the mean-field equations if we replace the function f by x r-t 1- exp( -x ). Note
that, in this case, the parameters which, as b8 , b;, and p;, characterizes two-body

15 i.e., when Condition 21b is violated.


62

interaction, are no more probabilities. That is, they may take any nonnegative
value.
In phase transition theory, it is well-known that constant-interaction models
exhibit a mean-field behavior (Boccara 1976). In epidemilogy, these models are,
however, much less artificial than in phase transition theory. For instance, in a
group of individuals in a vacation resort, the infection process is well approxi-
mated by a constant interaction model, whereas in many-body physics, constant-
interaction models are not realistic.
We will not analyze constant-interaction models since their qualitative dynam-
ical behavior is identical to the behavior found using the mean-field approximation
(Boccara and Cheong 1992 and 1993).

3.4. SIMULATIONS

In all our simulations, the total density of individuals is above the site percolation
threshold for the square lattice, which is equal to 0.593 (Stauffer 1979), in order
to be able to observe cooperative effects when m = 0.
3.4.1. SIR model. Figure 3.3a shows that the influence of the parameter m on
the time evolution of an epidemic with permanent removal for short-range moves.
As m increases the density of infectives as a function of time tends to the mean-
field result. Figure 3.3b shows that the convergence to the mean-field result is
much faster for long-range moves. Mixing is more effective with long-range moves.
If, instead of permanent removal, infectives recover with the probability Pr and
become permanently immune the convergence to the mean-field result is slower
(Fig. 3.3c) since the presence of the inert immune population on the lattice lowers
the effective mixing.
Note that, since the initial configuration is random, for any type of move and
any value of m, the value of density of infectives after the first time step is correctly
predicted by the mean-field approximation.
As shown by Kermack and McKendrick (1927) the spread of the disease does
not stop for lack of a susceptible population. As the time t tends to infinity, the
stationary density of susceptibles S( m, oo) for a given value of m is positive. The
63
0.26

0.20

0.16
....
,-..

~ 0.10

0.06 0.26

0.20
6 15 20

0.16
....
,-..

~ 0.10

0.26
0.05

0.20

6 10 15 20

0.16 t
....
,-..

~ 0.10

0.06

Figure 3.3. Time evolution of an epidemic for the siR model for different values
of m. The dashed line corresponds to the mean-field approximation . C = 0.6,
I(O) = 0.01, Pi= 0.5, di = 0.3, 100 X 100 lattice. Each point represents the average
of 10 experiments. (a) Short-range moves and permanent removal. + : m = 0,
x :m = 5, o : m = 250. (b) Long-range moves and permanent removal. + : m = 0,
x : m = 0.2, o: m = 2. (c) Short-range moves and permanent recovery. C = 0.6,
I(O) = 0.01, Pi= 0.5, Pr = 0.3, 100 X 100 lattice. Each point represents the average
of 10 experiments. + : m = 0, x : m = 5, o : m = 250.
64

variation of S( m, oo) as a function of m is represented in Figure 3.4 in the case


of permanent removal and short-range moves. As expected S( m, oo) tends to the
mean-field value as m tends to oo. More precisely, Boccara and Cheong {1992)
have shown that S(m,oo) tends to S(oo,oo) as m-<>oo, where a 00 = 1.14±0.11.

0.3

-a
8

-
0.2

f ll

0.1
~
~

\
0.0
0 10 20 30 40 50
m
Figure 3.4. Stationary density of susceptibles for the SIR model as a function of
m in the case of permanent removal and short-range moves.

In the case of permanent recovery and short-range moves or permanent re-


moval and long-range moves, simulations show (Boccara and Cheong 1992) that
the exponent a 00 is equal to 1.02 ± 0.11 in the first case whereas it is equal to
2.06 ± 0.13 in the second one. The value of the exponent 0' 00 characterizes the
approach of the stationary density of susceptibles S( m, oo) to its mean-field value.
a 00 seems to depend upon the range of the move but not upon the fact that we
have permanent recovery or permanent removal. For short-range moves which cor-
respond to diffusive motion, a 00 is close to 1. This result has been explained in
Section 2.2. The approach of the stationary density of susceptibles to its mean-
field value is faster for long-range moves. This is reasonable since mixing is more
65

effective in this case. 16

3.4.2. sis Model. Figure 3.5 represents the (p;, Pr) phase diagram for different
values of min the case of short-range moves. Figure 3.6 shows a typical variation
of the stationary density of infectives I( m, oo) as a function of p; for given values of
Pr and m. The slope at the critical point (i.e., the transcritical bifurcation point)
seems to be infinite. If this is indeed the case, the critical exponent f3 defined by
f3 = lim log I(m, oo), (10)
p;-pf->O+ log(p; -pi)

P1
Figure 3.5. (p;, Pr) phase diagram in the case of short-range moves form = 0 ( o),
m = 2 ( x ), m = 8 (o ). The dashed line represents the mean-field approximation.
Total density: C = 0.6. Lattice size: 100 x 100.

which is equal to 1 within the mean-field approximation, is less than 1. Figure


3.7 shows a log-log plot of J(m,oo) as a function of p;- p'j, where p'f is the
critical value of p;, for Pr = 0.5 and m = 0.3. It is found that p'j = 0.302
and f3 = 0.6. It has been clearly established that the mean-field approximation,
16 As for deterministic cellular automata (Section 2.2), we have not found any simple argument
to explain the value of the exponent.
66

because it neglects correlations which play an essential role in the neighborhood of


a second-order phase transition, cannot predict correctly the critical behavior of
short range interaction systems (Boccara 1976). For standard probabilistic cellular
automata, this is also the case (Bidaux et aJ1989, Martins et al1991 ).

0.5

0.4

,........_ 0.3
8
(")
0
....__.
,_. 0.2

0.1

0.0
I
0.2 0.3 0.4 0.5 0.6 0.7 0.8

Pi
Figure 3.6. Typical variation of I( m, oo) as a function of Pi for given values of
Pr and m in the case of short range moves. Here Pr = 0.5, m = 0.3. The critical
value of Pi is 0.3018. Total density: C = 0.6. Lattice size: 100 X 100.

For a given value of Pr, the variations of (3 and Pi as functions of m (Fig. 3.8)
exhibit two regimes. In the small m regime, i.e., for m :S 10, Pi and particularly
(3 have their m = 0 values. In the large m regime, i.e., for m ~ 300, Pi and (3
have their mean-field values.
Concerning the value of the critical exponent (3 the following two points should
be stressed.
1. The fact that, for small values of m, the exponent (3, approximately equal to
0.6, is much less than its mean-field value illustrates how wrong is the assumption
of homogeous mixing. This model, which takes into account the fluctuations in
67

the number of contacts in space and time, neglects, however, all other causes of
heterogeneity.

-1

0
0

-2 0
0

l'
.;t'
0
-3 0
0

I
-4~~~~~~~~~~~~~~~~~

-6 -5 -4 -3 -2 -1 0
ln (pi-Pic)
Figure 3.7. Typical log-log plot of I(m,oo) as a function of Pi- p'f for given
values of Pr and min the case of short-range moves. Here Pr = 0.5, m = 0.3. The
critical value of p; is 0.3018. Total density: C = 0.6. Lattice size: 100 x 100.

2. When m = 0, the value of fJ for this model is equal to the value of fJ for
two-dimensional directed percolation (Bease 1977). This result strongly suggests
that the critical properties of our model are universal, i.e., model-independent (see
Section 2.3).

For given values of p; and Pr, the asymptotic behavior of the stationary density
of infectives, for both small and large values of m, has also been studied (Boccara
and Cheong 1993). When m tends to zero, I( m, oo)- I(O, oo) tends to zero as mao

with a 0 = 0.177 ± 0.15. When m tends to oo, I( m, oo) tends to I( oo, oo) as m -aoo,
and, as for the siR model, it is found that a 00 is close to 1 (exactly 0.945 ± 0.065).
68

to- 1
m

Figure 3.8. Variations of f3 and p'f as functions of m for short-range moves.


Pr = 0.5, C = 0.6, lattice size: 100 x 100. For /3, typical error bars have been
represented.

The fact that a 0 is rather small shows the importance of motion in the spread
of a disease. The stationary number of infectives increases dramatically when
the individuals start to move. In other words, we may say that the response
8!( m, 00 )/am of the stationary density of infectives to motion of the individuals
tends to oo when m tends to 0. The asymptotic behavior of I(m, oo) for small m
is related to the asymptotic behavior of I(O, t) for large t (Section 2.3).
For long-range moves, the variations of f3 and p'f as functions of m, for a fixed
value of Pn are very different than those for short-range moves. Figure 3.9 shows
that f3 and Pi go very fast to their mean-field values. Whereas for short-range
moves, f3 and p'f do not vary in the small m regime, here, on the contrary, the
derivatives of f3 and Pi with respect to m tend to oo as m tends to 0. For small
m, the asymptotic behaviors of f3 and p'f may, therefore, be characterized by an
exponent. These exponents are not easy to measure. It is found (Boccara and
Cheong 1993) that f3(m)- /3(0) and pi(O)- pi(m) both behave approximately as
mi/2.
69

2 2.5
m

Figure 3.9. Variations of {3 and p'f as functions of m for long range moves.
Pr = 0.5, C = 0.6, lattice size: 100 x 100. For {3, typical error bars have been
represented.

3.4.3. Generalized SIR Model. Since the transition from the endemic state to
the disease-free state corresponding to the coalescence of (S* ,I*) and (So, 0) is
the analogue of the transcritical bifurcation studied in the case of the simple SIS

model, we have determined, form= 0, the value of the exponent {3(0) defined by

{3(0) = lim log I(O, oo) ,


d; -df --.o+ log( di - di)

for fixed values of all the other parameters. 17 The log-log plot represented in
Figure 3.10 shows that {3(0) = 0.568 ± 0.050. This value is equal to the value of
17 For the simple SIS model, the exponent {3 was defined (Equation 10) for a fixed value of
Pr (Pi variable). A similar definition for a fixed value of Pi (pr variable) could have been given.
These two definitions lead to identical values for {3 since the value of the exponent does not
depend upon the direction along which the transition line is approached. Here, d; has been
chosen as our variable parameter. Any other choice is equivalent.
70

t'(O) obtained for the simple sis model, strongly suggesting again that all these
models belong to the same universality class.

-2.0

-2.5

-3.0

-3.5

-10 -9 -8 -7
ln (d 1c-d 1)

Figure 3.10. Log-log plot of I(O, oo) as a function of d; - d~ for given values of
all other parameters. Here b8 = 0.4, b; = 0.1, d 8 = 0.3, and Pi = 0.01. The critical
value of d; is 0.01149. Lattice size: 200 x 200.

The most interesting feature of this model is the existence of the Hopf bifur-
cation. Since the emphasis of these lectures is on the importance of motion, we
have studied the influence of m on the stability of (S*, I*).
Our simulations indicate that motion favors cyclic behavior. This is, shown
in Figures 3.11a-f. In the case of short-range moves, form = 300 (Fig. 3.11b),
we observe a cyclic behavior; the cycle, in the (S, I) plane being almost identical
to the cycle predicted by the mean-field approximation (Fig 3.lla). When m is
decreased, the size of the cycle starts first to increase (Fig 3.11c), then decreases
(Fig. 3.11d-e) to finally disappear completely (Fig 3.11f).
71

.....................-1......-r-r-r-1,............,...,....1,............,........,1,............,....,..,

0
- ..., r- -

0.11 - - .... - -
G.IO .... - -

I I I
L I 1 I
OM o...._......_.OM......._......._O.._.Mw....o._._._o.oe......._._._.o._.oe~~o.l ....u.............,_..~:_I ...........,O.IM~.........~ !:'-'""""":..-!::.-:-'-'-'::"'o.··
....

0.10- - 0.101- -

Q.ll 1- - .... - 0 -
Q.IO r - .... r- -

uo 1- - 0.10 - -

G.ll 1- 0 - 0.111- -

0.10 - 0.10 -
~-) -

O.OI O~......_.O•..l.:OII,.._......_,O._,.~......_._._._O.~,W....~O...L.oeI~.......JO.I 1........_~1-.......~,w--1~-=--


G.lll ,_._,.................. 1....._':".
0 O.M 0.04 0.. 0.01 O.t
s s
Figure 3.11. Hopf bifurcation and motion. b8 = 0.143, b; = 0.0001, d 8 = 0.001,
d; = 0.15 and p; = 0.9. (a) Mean-field cycle. (b) m = 300, lattice size: 100 x 100.
(c) m =50, lattice size: 200 x 200. (d) m = 20, lattice size: 500 x 500. (e) m = 10,
lattice size: 1000 x 1000. (f) m = 1, lattice size: 1000 x 1000.
72

I I I I

c
0..10 - •~·,.
-

?'~I
0.10 1- ...:$
.;;
-
•·
•'~"

0.10 1- -
I I I I 0.1
0.011
0 0.01 0.04 0.011 0.011 0.1
s

OA

0.6

0.4

0.:1

0.1

0.1

0.0 0.8
s

Figure 3.12. Hopf bifurcation and motion. bs = 0.143, b; = 0.0001, ds = 0.001,


d; = 0.15 and p; = 0.9. Lattice size: 1000 x 1000. (a) m = 3, (b) m = 0.1, (c)
m = 0.001, (d) m = 0.0001.
Similar results are observed for long-range moves. In this case, as usual, the
mean-field behavior is observed for a much smaller value of m (Fig. 3.12a-d).
Here also, we observe that starting , for m = 3, from a cycle identical to the cycle
predicted by the mean-field approximation (Fig. 3.12a), decreasing m the size of
the cycle starts first to increase (Fig 3.12b ), then decreases (Fig. 3.12c) to finally
disappear completely (Fig 3.12d). For long-range moves, m has to be very small
73

to observe a fixed-point behavior, in agreement with the general result, illustrated


in Section 2.3, that for short-range moves we have essentially two regimes, m = 0
and mean-field, whereas for long-range moves we have, except for very small values
of m essentially one regime, namely, mean-field.

4. Conclusion

We have discussed various automata network models for the spread of infectious
diseases in populations of moving individuals. The local rules of the automaton
consist of two subrules. The first, which is synchronous, is a probalistic cellular au-
tomaton rule. It models birth, death, infection and recovery. The second, applied
sequentially, is a site-exchange rule. It describes the different types of moves the
individuals may perform. The emphasis has been on the influence of motion, that
is, the degree of mixing which follows from the application of the second subrule.
The degree of mixing is measured by a parameter m representing the average num-
ber of tentative moves per individual. If m goes to oo then the time evolution of
the difi"erent models is exactly described by a mean-field-type approximation. The
stationary densities of the different populations approach their mean-field values
is characterized by an exponent a 00 close to 1 if the motion of the individuals is
diffusive, that is, for short-range moves. For long-range moves the approach to
the mean-field value is faster. The asymptotic behavior of these densities for small
values of m has also been studied. For the SIRmodel, the derivative with respect
to m of S( m, oo) is negative and very large showing that as soon as the individuals
start to move, the spread of the disease increases dramatically. This effect is even
more stricking in the SIS model, the derivative of the density of infectives I( m, oo)
being, in this case, infinite.
The SIS model exhibits a transcritical bifurcation similar to a second-order
phase transition. In the neighborhood of the phase transition the system exhibits
a critical behavior due, for any finite value of m, to the local character of the
first subrule modeling infection and recovery. For m = 0, the critical exponent
j3 has the value found for two-dimensional directed percolation, suggesting that
the critical behavior of the sis model is universal. j3 depends, however, on the
74

degree of mixing m. This dependence is very different according to whether the


individuals perform short- or long-range moves. This difference appears to be very
general and independent of the particular system under consideration.
More general SIR models may be defined. If, for instance, we assume that
both susceptibles and infectives may give births to susceptibles with respective
probabilities b8 and b; and susceptibles may die with a probability d 8 , we find that
this system exhibits in the phase space (S, I) three fixed points: (0, 0), (So, 0) and
(S*,I*), where S0 , S* and I* are nonzero values depending upon the parameters
of the system. This model generalizes the preceding ones. When (S0 ,0) is stable,
( S*, I*) does not exist and the evolution of the system is similar to the evolution of
the simple SIR model. The bifurcation from (So,O) to (S*,I*) is transcritical and
similar to the one exhibited by the simple s1s model. In particular the exponent
f3 has the same value for m = 0. More interestingly, the system exhibits a Hopf
bifurcation when (S*, I*) becomes unstable. Stable periodic fluctuations of the
densities of the susceptibles and infectives may account for recurrent epidemics.

References

[I] Bailey, N.T.J., The Mathematical Theory of Infectious Diseases and its Ap-
plications, London, Charles Griffin (1975).
[2] Bease, J., Series Expansions for the Directed-Bond Percolation Problem, J.
Phys. C: Solid State Phys. 10, 917-924 (1977).
[3] Bidaux, R., N. Boccara, H. Chate, Order of the Transition Versus Space
Dimensionality in a Family of Cellular Automata, Phys. Rev. A 39, 3094-
3105 (1989).
[4] Boccara, N., Symetries Brisees, Paris, Hermann (1976).
[5] Boccara, N., K. Cheong, Automata Network SIR Models for the Spread of
Infectious Diseases in Populations of Moving Individuals, J. Phys. A: Math.
Gen. 25, 2447-2461 (1992).
[6] Boccara, N., K. Cheong, Critical Behaviour of a Probabilistic Automata Net-
work s1s Model for the Spread of an Infectious Disease in a Population of
Moving Individuals, J. Phys. A: Math. Gen. 26, 3707-3717 (1993).
75

[7] Boccara, N., E. Goles, S. Martinez, P. Picco (eds.), Cellular Automata and
Cooperative Phenomena. Proc. of a Workshop, Les Houches, Dordrecht,
Kluwer {1993).
[8] Boccara, N., J. Nasser, M. Roger, Annihilation of Defects During the Evo-
lution of Some One-Dimensional Class-3 Deterministic Cellular Automata,
Europhys. Lett. 13, 489-494 (1990).
[9] Boccara, N., J. Nasser, M. Roger, Particlelike Structures and their Interactions
in Spatia-Temporal Patterns Generated by One-Dimensional Deterministic
Cellular-Automaton Rules, Phys. Rev. A 44, 866-875 (1991).
[10] Boccara, N., J. Nasser, M. Roger, Critical Behavior of a Probabilistic Local
and Nonlocal Site-Exchange Cellular Automaton (1993), to appear.
[11] Boccara, N., M. Roger, Some Properties of Local and Nonlocal Site-Exchange
Deterministic Cellular Automata, (1993), to appear.
[12] Bramson, M., J.L. Lebowitz, Asymptotic Behavior of Densities for Two-
Particle Annihilating Random Walks, J. Stat. Phys. 62, 297-372 (1991).
[13] Cardy, J.L., Field-Theoretic Formulation of an Epidemic Process with Immu-
nisation, J. Phys. A: Math. Gen. 16, L709-L712 (1983).
[14] Cardy, J.L., P. Grassberger, Epidemic Models and Percolation, J. Phys. A:
Math. Gen. 18, L267-L271 (1985).
[15] Cox, J.T., R. Durrett, Limit Theorems for the Spread of Epidemics and Forest
Fires, Stoch. Proc. Appl. 30, 171-191 (1988).
[16] DeMasi, A., P.A. Ferrari, S. Golstein, W.D. Wick, An lnvariance Principle for
Reversible Markov Processes. Applications to Random Motions in Random
Environments, J. Stat. Phys. 55, 787-855 (1989).
[17] Farmer, D., T. Toffoli, S. Wolfram (eds.), Cellular Automata: Proc. of an
Interdisciplinary Workshop, Los Alamos, Amsterdam, North-Holland (1984).
[18] Gallas, J.A.C., H. Herrmann, Investigating an Automaton of Class-4, Int. J.
Mod. Phys. C 1, 181-191 (1990).
[19] Gertsbakh, LB., Epidemic Process on a Random Graph: Some Preliminary
Results. J. Appl. Prob. 14, 427-438 (1977).
[20] Goles, E., S. Martinez, Neural and Automata Networks, Dordrecht, Kluwer
(1991).
76

[21] Grassberger, P., New Mechanism for Deterministic Diffusion, Phys. Rev. 28
A, 3666-3667 (1983).
[22] Grassberger, P., On the Critical Behavior of the General Epidemic Process
and Dynamical Percolation, Math. Biosci. 63, 157-172 (1983).
[23] Gutowitz, H. (ed.), Cellular Automata: Theory and Experiment, Proc. Work-
shop, Los Alamos, Amsterdam, North-Holland (1990).
[24] Hethcote, H.W., Qualitative Analyses of Communicable Disease Models,
Math. Biosci. 28, 335-356 (1976).
[25] Hethcote, H.W., P. van den Driessche, Some Epidemiological Models with
Nonlinear Incidence, J. Math. Biol. 29, 271-287 (1991).
[26] KaJlen, A., P. Arcuri, J.D. Murray, A Simple Model for the Spatial Spread
and Control of Rabies, J. theor. Biol. 116, 377-393 (1985).
[27] Kermack, W.O., A.G. McKendrick, A Contribution to the Mathematical The-
ory of Epidemics, Proc. Roy. Soc. A 115, 700-721 (1927).
[28] Kinzel, W., Directed Percolation, Ann. Israel Phys. Soc. 5, 425-445 (1983).
[29] Kinzel, W., Phase Transitions in Cellular Automata, Z. Phys. B 58, 229-244
(1985).
[30] Kuulasmaa, K., The Spatial General Epidemic and Locally Dependent Ran-
dom Graphs, J. Appl. Prob. 19, 745-758 (1982).
[31] Kuulasmaa, K., S. Zachary, On Spatial General Epidemic and Bond Percola-
tion Processes, J. Appl. Prob. 21, 911-914 (1984).
[32] Liggett, T.M., Interacting Particles Systems, Heidelberg, Springer-Verlag
(1985 ).
[33] Lotka, A.J., Elements of Physical Biology, Baltimore, Williams and Wilkins
(1925).
[34] McKay, G., N. Jan, Forest Fires as Critical Phenomena, J. Phys. A: Math.
Gen. 17, L757-L760 (1984).
[35] Manneville, P., N. Boccara, G. Vichniac, R. Bidaux (eds.), Cellular Automata
and Modeling of Complex Physical Systems. Proc. of a Workshop, Les
Houches, Heidelberg, Springer-Verlag (1989).
[36] Mollison, D., Spatial Contact Models for Ecological and Epidemic Spread,
J.R. Statist. Soc. B 39 283-326 (1977).
77

[37] Murray, J.D., Mathematical Biology, Heidelberg, Springer-Verlag (1989).


[38] Murray, J.D., E.A. Stanley, D.L. Brown, On the Spatial Spread of Rabies
among Foxes, Proc. Roy. Soc. (London) B 229, 111-150 (1986).
[39] Stauffer, D., Scaling Theory of Percolation Clusters, Physics Reports 54, 1-74
(1979).
[40] Volterra, V., Variazioni e Fluttuazioni del Numero d'Individui in Specie Ani-
mali Conviventi, R. Ace. dei Lincei 6(2), 31-113 (1926).
[41] Waltman P., Deterministic Threshold Models in the Theory of Epidemics,
Heidelberg, Springer-Verlag (1974).
[42] Wolfram, S., Theory and Applications of Cellular Automata, Singapore,
World Scientific (1986).
[43] Wolfram, S., Statistical Mechanics of Cellular Automata, Rev. Mod. Phys.
55, 601-644 (1983).
[44] Zabolitzky, J. Critical Properties of Rule 22 Elementary Cellular Automata,
J. Stat. Phys. 50 1255-1262 (1988).
ENTROPY, PRESSURE AND LARGE DEVIATION

ARTUR LOPES
Instituto de M atemcf.tica
Universidade Federal do Rio Grande do Sul
91500 Porto Alegre RS
Brasil

ABSTRACT. We present a brief introduction to Ergodic Theory and equilibrium states of Ther-
modynamic Formalism. We also analyze Large deviation properties of the equilibrium states
defined in Thermodynamic Formalism. Several problems related to Statistical Mechanics are
consider.

1. Introduction

Our purpose in the first paragraphs of this text is to present the basic concepts
of Ergodic Theory in the most simple way. We introduce the Ergodic Theorem of
Birkhoff and the concept of entropy and pressure. Our final goal is to analyze sev-
eral important problems related to Statistical Mechanics in the setting of Ergodic
Theory.
We hope to present some of the main ideas of Ergodic Theory without too
many technicalities. The relation between the concepts of pressure and entropy
with the free-energy of Large Deviation Theory will be explored in the last para-
graphs.
Given a space X, a probability P on X is a law that associates to each subset
B of X a real value P(B). The value P(X) is assumed to be one. We also assume in
the definition of probability that for any sequence Bn, n EN of disjoint subsets of
X (that is, Bn n Bm = 0 form different from n), the union of such sets, UneNBn,
satisfies P(UneN Bn) = L:::'=o P(Bn)· Finally we require that P(A-B) = P(A)-
P(B) for any subsets A and B of X, such that B is contained in A.
Unfortunately, in most cases one can not have all the above properties defined
for all subsets of X. Therefore we define the probability P on a smaller family of
79
E. Gales and S. Martfnez (eds. ), Cellular Automata, Dynamical Systems and Neural Networks, 79-146.
© 1994 Kluwer Academic Publishers.
80

subsets of X. In the present text this family of subsets is a a-algebra A. We refer


the reader to any book on Real Analysis [16] for the precise definition of a-algebra.
In all the situations we will face in this text, the subsets B of X for which we want
to assign a value P(B), will be elements of the family A. Therefore we will not
have problems with sets B whose probability P(B) is not well defined.
Ya. Sinai define Ergodic Theory in the following way "The basic problems in
Ergodic Theory consist of the study of the statistical properties of the groups
of motions of non-random objects". The group of motions we are interested
in this text is the set of iterates of a map T from a metrical space X into it-
self, that is T, T 2 , T 3 , ... , Tn, .... What properties one can expect for the iter-
ates of a general point x, in other words, what results can be stated for the set
{x,T(x),T 2 (x),T 3 (x), ... ,Tn(x), ... }? We will suppose there exist a certain prob-
ability P involved in the problem and we will be interested in properties that are
true for every x in X outside a negligible set A of probability zero (that is, P(A)
= 0).

2. Birkhoff's Ergodic Theorem

Let n = {0, 1} N be the set of sequences of O's and 1 's, that is, z E n if z =
(zo,ZI,Z2, ... ,Zn,···) where z; E {0,1} for all i EN.
We call this set the Bernoulli space. We can think of this set as the set of
events of tossing a coin infinitely many times, in which we associate head with
0 and tail with 1. For example, (0,1,0,1,0,1, ... ) is the event in which we have
alternatedly head and tail, beginning with a tail at time 0, that is, zo = 0.
A cylinder (or a parallepiped) A is a subset of n defined by a finite specification
of elements; the set A= {(0,1,1,0,1,z5,z6,···,zn,···) I z; E {0,1}, i ~ 5}, for
example, is a cylinder, which we denote by (0, 1, 1,0, 1). In general a cylinder is
given by

(ao,al, ... ,am) = {(ao,ai, ... ,am,Zm+I,Zm+2,···,Zm+n 1 · · · ) I z; E {0,1}, i ~ m+1},

where n is fixed and a 0 , a 1 , ... ,am belonging to {0, 1} are also fixed. We should
think of (0, 1, 1,0, 1) as the event of tossing a coin and have successively head, tail,
tail, head and tail and no specification about the rest of the other tossings.
81

We would like to define a probability on the set X = n. First we will assign


values P(A) for the elementary subsets: the cylinders A. After that we will extend
this probability to more complicated subsets B, as countable unions and intersec-
tion of cylinders A, and then to more general and elaborated specifications. The
family of subsets B for which we will be able to assign the value P(B) will be called
later the a- algebra A.
The probability of having in order head, tail, tail, head and tail when we toss
the coin depends on the probability of having head or tail at each time.
Suppose that po,Pl are two numbers such that po,Pr ;::: 0 and Po+ Pl = 1.
Suppose that each time we toss the coin we have probability p0 of having head (or 0)
and probability p 1 of having tail (or 1 ). If we suppose the tossings are independent,
the probability of having head, tail, tail, head and tail is P5 p~. Therefore it is
natural to give probability p~ p~ to the set A= (0, 1, 1,0, 1).
In the same way we can define P( a 0 , a 1 , ••• , an) = pg Pi" where q is the number
of O's in the sequence {ao, a 1 , •.. , an} and m is the number of 1 's in the sequence
{ao, a 1 , ••• , an}. In this way we obtain a well defined measure on any cylinder. We
define cylinders more generally by a finite number of specifications but perhaps not
in sequence, for instance {(0, z 1 , 1, Z3, 0, z 5 , Z6, z1, ... )I ZJ E {0, 1}, z3 E {0, 1}, z; E
{0, 1} , i ;::: 5} is a cylinder. We will present the precise definition of the general
cylinder later. Using well known ideas of measure theory one can extend this
probability P to the a-algebra generated by all cylinders (see(13]).
In this way if we denote this a-algebra by A and the probability by P we
have ( n, A ,P) as a well defined measure space. Note that P(f!) = 1, because
1 =Po+ P1 = P(O) + P(l) = P(f!). Remember that (0) = {(0, zr, Zz, Z3, ... )I z; E
{0, 1} fori ;::: 1} and (I)= {(1, z 1 , zz, z 3 , ... ) I z; E {0, 1} fori;::: 1}. We say that
the coin is fair if p 0 = 0.5 and p 1 = 0.5. It is a well known observable fact that if
we toss the coin a very large number of times, like 200 times, we will obtain more
or less half times head and half times tail. It is also reasonable to suppose that if
the coin has probability p0 to obtain head (or 0) and Pl of having a tail (or 1) then
if we toss the coin 200 times, we will obtain more or less 200 p0 times head and 200
p 1 times tail. In Probability Theory this is known as the Law of Large Numbers (1].
The Ergodic Theorem of Birkhoff is a quite general theorem that will assure
82

that the above result is true. We explain now more carefully the meaning of the
Ergodic Theorem.
Note first that P depends on Po and PI· The Birkhoff Ergodic Theorem (it
will be formally stated later) claims that there exists a set A such that P( A) = 1,
and such that for all z E A, where z = (zo,zt,z2,····zn, ... ), we have that

Po = lim
n-<X>
~n (cardinal of heads among zo, ZI, ... , Zn-d

and
PI = lim
n~=
~n (cardinal of tails among zo, ZI, ... , Zn-d·

The above result claims that the mean value of heads that appears in tossing
the coin n times converges to p 0 . Before we state the Birkhoff Ergodic Theorem in
precise mathematical terms we need to introduce the concepts of shift and invariant
measure.
The shift map (} from n to n is the map such that for

we have

Therefore we can express the number of tails we have tossing the coin n times
(as expressed by z E n) by
n-I
L IA((Jl(z)),
j=O

where I A is the indicator of A = (I), that is, IA ( z) = 1 for z E (I) and IA ( z) = 0


for z ¢(I); in other terms, IA(z) = 1 if z0 = 1 and IA(z) = 0 if z0 = 0. In the
same way,
n-I

L IB((Ji(z))
j=O

is the number of heads we have for the event z of tossing the coin n times; here
IB(z) is the indicator of the set B = (0), that is, IB( z) = 0 if z ¢ (0) and IB(z) = 1
if z E (0).
83

In this way we can see that the shift helps us to formulate the number of
heads and tails in a simple expression.

Definition 2.1. The set { z, a( z ), a 2 ( z ), ... , an( z ), ... } is called the orbit of z under
the shift map a. The element an(z) is called the nil!. iterate of z.

We will call the Borel a-Algebra of n the a-Algebra generated by the cylinders.
The Borel a-Algebra of R is the a-Algebra generated by the finite intervals (see
[16]). We say that f from X toR is measurable if for each set A in the a-Algebra
of Borel of R, f- 1 (A) is in the a-Algebra of Borel of n.
Given a certain measurable map ¢> : n ~ R, the mean value of ¢> on z up to
the n1l! iterate is
n-1
-1 "'"'
~ ¢>(a1.(z)).
n .
J=O

In this way, for ¢> = I(o)' the mean value of I(o) on z, up to the n.!lLiterate is the
mean value of times we obtain a head, tossing the coin n times. In the case of the
fair coin, that is, Po = 0.5 = p 1 , and ¢> = J(ii), one should expect that the mean
number of heads should converge to 0.5 when n goes to infinity.
We will be interested in obtaining the limit of these mean values as n goes to
infinity, that is,
1 n-1 .
lim - "'"'
n--+oo n ~
¢>( a 1 ( z))
j=O

for P-almost all points z.


First we need to introduce the concept of invariant measure.

Definition 2.2. Given (X, A, J.L), where X is a set, A is a a-algebra on X and


J.L is a measure on this a-algebra, we consider T a measurable map from X to X
(that is T- 1 (A) E A for all A E A), and say that J.L is invariant for T if for all
measurable sets A E A, J.L(T- 1 (A)) = J.L(A).

Invariant measures appear very naturally in several areas of Mathematics as


for instance, in Hamiltonian Mechanics, Geometry and Number Theory.
84

We now show that the probability P {depending on Po and p 1 ) introduced


before is invariant for the shift.

Proposition 2.1. The probability P 1s always invariant for the shift map
a: n-+ n.
Proof. It is enough to show that P(T- 1 (A)) = P(A) for the sets A that are
generators (the cylinders) of the a-algebra.
Consider A= (a 0 , a 1 , ••• ,an) a cylinder, then

Notation. We introduce the following notation: M(T) is the set of all invariant
probabilities p for the measurable map T : X -+ X.
Therefore M(a) denotes the set of all invariant probabilities for a. For each
Po,pl, such that Po+ PI = 1, 0, we have that the corresponding P
po,PI ~
belongs to M(a) as was shown in the proposition above. There exist of course
other probabilities p E M (a) that are not of the form P.
The set of probabilities M(T) is a convex simplex in the set of all measures
on the a-algebra A of the set X. It is well known in Convex Analysis that the
points in the corners of the convex play a very important role.

Definition 2.3. A point x in a convex set C is called extremal if x cannot be


expressed as x = >.y + (1 - >.)z, where y and z are inC, x different from y and z
and 0 < >. < 1.

It is possible to show that the probability measures that are extremals for the
set of invariant probabilities C = M(T) are the ergodic probabilities.
85

We define ergodic measures however by a different property.

Definition 2.4. We say that J.L E M(T) is ergodic if for all A E A such that
T- 1 (A) =A either J.L(A) = 0 or J.L(A) = 1.

The above definition means that for an ergodic measure the action of the
measurable map T on any non trivial set A E A (a trivial set being equal to 0
or X up to a set of J.L-measure zero) is so random that it can not leave the set A
invariant; in other words the set A has to spread around the set X under iteration
ofT.
Note that the empty set 0 and the total set 11 are always invariant, but they
have respectively measure 0 and 1.

Remark. It can be shown that the shift with the invariant probability P defined
above is ergodic [18].

In Ergodic Theory, most of the proofs of general results follow the recipe:
first prove the result for ergodic measures and then use the ergodic decomposition
theorem [13] to extend the result for other kind of measures.

Notation. Given a probability J.L on the set X, we will say that a property happen
J.L-almost everywhere, if there exist a subset A contained in X, such that J.L(A) = 1
and the property is true, for all z in the set A.

Notation. We will denote by £ 1 (J.L) the set of measurable functions f from X to


R such that J f( z )dJ.L( z) exist and is finite.

Now we can state Birkhoff's Ergodic Theorem.

Theorem 2.1. (Birkhoff)- Let (X, A, J.L) be a probability space and T: X-+ X a
measurable transformation that preserves J.L, that is, J.L E M(T) and suppose that
J.L is ergodic. Then for any f E £ 1 (J.L ),

n-1

lim .!_ """'J(Tl(z)) = jJ(x)dJ.L(x) (1)


n-oon L.....t
j=O
86

for z EX, J.t-almost everywhere.

The above result essentially claims that for ergodic measures, spatial mean
(the right hand side of {1)) is equal to temporal mean {the left hand side of {1)) for
almost every point z. Therefore, in this case, in order to compute an integral, one
has to estimate the value of a series. In several practical situations this property
brings a simplification to the problem of estimating an integral.
When we consider T = u, P = J.t and X = n in the Bernoulli shift example
we mentioned before, then considering f(x) = I(o)(x), we get

=j
n-1

}!._.~ ~ L I(o)(ui(z)) I(o)(x)dP(x) = J.t(O) = po,


j=O

(for P-almost every z ), which we mentioned before in our reasoning. This theorem
therefore is a very general result that can, as a particular case, assure the validity
of the Strong Law of Large Numbers.
In the case p 0 = 0.5 = p 1 , the fair coin, the event of obtaining head every
time from 0 to infinity {that is, {1,1,1,1,1, ... )) is rare (hasP-measure zero). For a
set A of measure one the events (z0 , ZI, ••. , Zn, •• ) E A are such that head and tail
appear with the same frequency.
The questions that people in Probability and Ergodic Theory are concerned
with are not of deterministic nature. The statements that are relevant and perti-
nent, are the ones about events that happen with probability one. In other words,
the statements a bouts sets A such that J.t( A) = 1. Sets of measure zero are consider
negligible.
The Birkhoff Ergodic Theorem is one of the most celebrated theorems of
Mathematics and was inspired by Statistical Mechanics, more specifically by the
billiard ball model, which is a model for a particle reflecting on the walls of a closed
compartment [13].
We now state a more general version of Birkhoff's Ergodic Theorem, without
the assumption that the measure is ergodic.

Theorem 2.2. (Birkhoff)- Let (X,A,J.t) be a probability space and J.t E M(T),
87

where Tis measurable, T: X-+ X. Then for any f E .C 1 (f.1.) the limit
1 n-1 .
lim - ""'
n-+oo n L....J
f(T 1 (z))

j=O

exists for z E X, tJ.-almost everywhere. If the limit is denoted by


n-1

lim
n-+oo
~n ""'f(Ti(x)) = j(x),
L....J
j=O

then it is also true that

I ](x)dtJ.(x) =I f(x)dtJ.(x).

Note that the difference of the last result to the previous one is that in the
case the measure is ergodic J is constant f.1. - almost everywhere.
The Bernoulli space n can be equipped with a distance do : nXn -+ R
in the following way: for a fixed value () with 0 < () < 1, we define the metric
do(x, y) = ()N, (where N is the largest natural number such that x; = y;, Iii< N)
if x is different from y. When x is equal to y then we define the distance to be
zero. If we define open sets n in the usual way (product topology) we have that the
a-algebra generated by the cylinders is the a-algebra of Borel, since the cylinders
form a basis for the topology of n.
As an example consider()= 0.3, z=(1,1,0,1,0,0,1, ... ) and e = 0.0081 = 0.34,
then is easy to see that B( z, e) (the open ball of center z and radius e) is equal to
the cylinder (1, 1,0, 1).
Note that the indicator function fA is continuous if A is a cylinder.
In the rest of this text we will consider a certain fixed value () and denote by
d the metric associated with it.

Definition 2.5. A map T from a metric space (X, d) into itself is expanding if
there exist A > 1 such that for any x, there exist e > 0 such that \:ly E B(x, e),
d(T(x), T(y)) ~ Ad(x, y).

Note that if do(x,y) =a, x,y En, then do(a(x),a(y)) = o:B- 1 = B- 1 do(x,y).
Therefore the Bernoulli shift a is expanding with the value A = B- 1 in the notation
of above definition.
88

It is also necessary to introduce the two-sided Bernoulli shift as the set


n= {0, l}z of elements of the form

The shift a : n - t n is defined in the same way,

when z = (z;). For example for z = (z;) where z; = 1 fori even and z; = 0 for i
odd, a(z) = (zi+ 1 ) = (y;) where y; = 1 fori odd andy;= 0 fori even. Note that
a 2 (z) = z in this case.

Definition 2.6. For a general map T : X -t X, the orbit of x is the set


{x,T(x),T 2 (x), ... ,Tn(x), ... }. We say xis periodic of period n if n 2:: 1 is the
smallest possible natural number such that Tn(x) = x.

Therefore in the example given above z is a periodic point of period 2. The


orbit of z in this case is {z, T(z)}. Note that the shift in the one-sided Bernoulli
space is not one-to-one, but the shift in the two sided Bernoulli space is.
Consider a finite set (an alphabet) of k symbols {0, 1, ... , k- 1} and a proba-
bility J.Lo on this finite set, that is,

J.Lo(i) = p;

and
k-1
LPi = 1.
i=O

Consider also the set of sequences of these symbols, that is, the set of sequences
z = (zo, Zl' Z2, ... , Zn, ... ) where z; E {0, 1, ... , k- 1}. We will again denote by n the
set of all these sequences. Sometimes we denote by z : N - t {0, 1, ... , k- 1}
an element of n and z(n) by Zn· The shift on n is defined in the same way
as before, a: n - t n is such that for z = (zo,ZI,Z2, ... ,Zn+l, ... ) En, a(z) =
(z1,z2, ... ,zn, ... ) En.
89

Definition 2.7. Given finite subsets Ao,AJ, ... ,Am of {0,1, ... ,k -1} and j EN,
we define the cylinder C(j, Ao, ... , Am) by

C(j,Ao, ... ,Am) = {x En I x(j +i) E A;, 0 :s i :s m}.


Disjoint unions of cylinders form an algebra that generates a a-algebra A on
n. Moreover, given the probability flo on {0, 1, ... , k- 1}, there exists a unique
probability P on the a-algebra A (the product measure associated to flo) such
that for every cylinder:

IT flo(A;).
m

P( C(j, Ao, ... , Am)) =


i=O

The above definition is the precise definition of a general cylinder we promised


before.
We define inn a metric in the same way as always: for a fixed B, 0 < B < 1, we
define do( x, y) = where N is the largest N such that x; = y; for all 0 :S i :S N
()N
for x different from y and zero otherwise. It is easy to see that do has all the
properties of a metric.
These definitions, of course, extend the previous ones defined for the shift in
two symbols. The system defined above is also called the one-sided Bernoulli shift
on

with probability P(p 0 ,p 1 , ... ,pk-d on n.


The two-sided shift is the set of all functions z : Z -+ {0, 1, ... , k- 1} and
in the same way as before a( x )( i) = x( i + 1) is by definition the shift map on
this space. The cylinders are defined in a similar way: given subsets Ao, ... , An of
{0, 1, ... , k - 1} and j E Z (remember that j E N in the one-sided shift case)

C(j, Ao, ... , Am) = { z En I z(j + i) E A;, 0 :s i :s m}.


In the same way as before we consider the a-algebra generated by the cylinders.
Moreover, given a probability flo on {0, 1, ... , k- 1} such that flo(i) = p;, i E
{O, ... ,k -1}, I:~,:~ p; =,1, then we define P(C(j,Ao, ... ,Am)) = II~oflo(A;). For
90

0 < () < 1 fixed, the metric we will consider on n is ds(x, y) = ()N where N is the
largest N such that x; = y; for all i such Iii ~ N if x is different from y and zero
otherwise.
We will call such system the two-sided Bernoulli shift on

with probability P(po,p1, ... ,pk-d on 11.


The main difference between the one-sided shift and the two-sided shift is that
the latter is one-to-one. With the one-sided shift, any z E 11 = B(po,p1, ... ,pk-1)
has k preimages, that is, if z = (zo,z1, ... ,zn,···), then

and
Xk-1 = (k -1,zo,z1,···•Zn,···)

are such that a(x;) = z, i E {0, ... , k- 1 }, that is, a- 1 (z) = {x 0 , ••• , Xk-d.
More generally for z = (z 0 ,z 1 , ••. ), the set of solutions x of an(x) = z is the
set of points x of the form

where x 0 ,x 1 , ... ,Xn- 1 E {1,2, ... k} are arbitrary. Therefore the cardinality of the
set of such solutions x is kn.

Notation. We call the set of such points, the pre-images of z by a.

Periodic orbits for a are also easy to find. The set of all periodic orbits of
period n is obtained in the following way: take z0 , z1 , .. , Zn- 1 in all possible ways
such that z; E {0, 1}, i E {1, 2, ... , n- 1}. For each one of these zo, z1, ... Zn-1
91

repeat the block infinitely many times, in order to obtain the set of all x such that
an(x) = x, where

X= (zo, ZJ, ... , Zn-1, Zo, ZJ, ... , Zn-1, Zo, ZJ, ... , Zn-1, ... ).

Remark. Note that the cardinality of the set of solutions z of an(z) = z and the
cardinality of the set of solutions x of an(x) = z is the same and equal to kn. In
fact, the procedure of finding the set of solutions is quite similar in both cases.

Proposition 2.2. The set of all periodic points for the shift is dense in n with
the de metric.

Proof. Given z = (z;)iEN, z; E {0, ... , k- 1}, and~> 0, takeN such that ()N < ~­
Now define x as the successive repetition of the string (z 0 , z1 , ... , ZN ), that is,

X= (zo, z1 , z2 , ... , ZN, Zo, z1 , z2 , ... , ZN, Zo, ZJ, z2 , ... , ZN, ... ).

Then de( z, x) < ()N < ~ and TN+ 1 ( x) = x, that is, x is a periodic point of period
at most (N + 1) and~ close to z. This proves the proposition. •

Remark. A similar result for the preimages of a certain point z can be obtained
(the proof is basically the same), that is: any y E X can be aproximated by
preimages of z.

Note that the temporal mean }(z) off (in Birkhoff's Theorem) at a point z
belonging to a periodic orbit, is the mean value off in the orbit of z. Therefore,
in most cases (but not all cases, as we can see below), the periodic orbits have to
be excluded from the set A of measure one mentioned in Birkhoff's Theorem.
In an extensive number of cases in Dynamical Systems the periodic orbits are
dense in the region where the dynamics is concentrated [6]. Periodic orbits are
extremely important for understanding the dynamics and the ergodic properties
of a measure 11 even if they can have 11 - measure zero.
There exist invariant probabilities that are finite sums of Dirac measures in
M(T), but they have to be concentrated on periodic orbits because of the invari-
ance.
92

For example, the measure J1. such that:

J1.{(001001001001...)) = 1/3

JI.((010010010010 ... )) = 1/3

Jl.((100100100100 ... )) = 1/3

is invariant and has support on a periodic orbit of period 3.


The space X we consider in this text will always be a compact metric space
with metric d. We also denote by C(X) the set of continuous functions on X
taking values in R. We will consider in C(X) the supremum norm, that is,
11!11 =sup {I f(x) II x EX}.
Notation. We will denote by M(X) the set of all probabilities on the Borel
a-algebra of X.

Notation. A law "1 such that for each set A in the a-Algebra of Borel of X, ry(A)
is a real number (not necessarily positive) or is equal to oo, and such that:

L ry(A;)
00

a) ry(U~ 1 A;) =
i=I

when the A; are disjoint (that is A; n Aj = 0 fori::/= j ),

b) ry(0) =0
c) ry(A- B)= ry(A)- ry(B)

when B C A, is called a signed measure. We denote by S(X) the set of all signed
measures on the Borel a-algebra of X.

Example. For the set X= R, given a continuous function ci>(x) (not necessarilly
positive and not necessarily integrable), the law ry(A) = JA ci>(x)dx is a signed-
measure on X.
93

There exist signed-measures on X= R that are not of the above form.

Given a certain normed space V, the dual of V, denoted by V*, is the set of
all continuous linear functionals on V, that is, the set of all functionals C : V -+ R
that are linear and continuous. The following theorem claims that the dual of the
set C(X) is the space S(X) [16].

Theorem 2.3. (Riesz) - Let C : C(X) -+ R be a continuous linear functional.


Then there exists a unique v E S(X) such that£(!)= J fdv(u) for any f E C(X).

Corollary 2.1. If C is positive (that is, for any f E C(X), C(f) 2:: 0 iff 2:: 0)
and if £(1) = 1, then there exists a unique probability p. E M(X) such that
C(f) = J fdp. for any f E C(X).

Definition 2.8. Given T: X-+ Y measurable and v E M(X), we define T*(v) =


w as the unique measure w E M(Y) such that J(! o T)(x)dv(x) = J f(x)dw(x)
for any f E C(X).

The measure w always exists and is well defined by Riesz's Theorem applied
to C(f) = J(! o S)( x )dv( x). The measure w is usually called the pull back of the
measure v by the map S.
It is easily obtained from well known properties about approximation of con-
tinuons functions by step functions (finite sums of indicators with different weights)
and vice-versa [16] that 1) and 2) below are equivalent:
1) for any Borel set A,

v(S- 1 (A)) = J(IA o S)(x)dv(x) = J lA(x)dw(x) = w(A).

2) for any f E C(X),

ju o S)(x)dv(x) = j f(x)dw(x).

A particular important case is when X =Y and T : X -+ X. In this case


w = T*( v) is also a measure on X.
94

From the above considerations we can state:

Proposition 2.3. - p E M(T) if and only if T*(p) = p.

One would like to say that a sequence of measures f-tn converges to f-t if and
only if, for any Borel set A, the sequence f-tn(A) converges to J-t(A). This is almost
true. One has to suppose that the boundary of the set A has f-t - measure zero and
then the claim is true [16). The more usefull definition of convergence is in terms
of the action of the measures on the continuous functions:

Definition 2.9. We say that a sequence f-tn E M(X) converges weakly to a


probability f-t if for any continuous function f: X- R we have that

J~] f(x)dpn(x) = J f(x)dJ-t(x).

If X is a compact metric space, the space M(X) of all probabilities is weakly


sequentially-compact, that is, any sequence f-tn E M(X) has a convergent subse-
quence to an element f-t E M(X) [16). The set M(T) is also weakly sequentially-
compact.

Definition 2.10. The Dirac Delta measure at the point z is by definition the
probability measure that associates measure one to each Borel set that contains z
and has measure zero otherwise. We will denote such a probability by Oz.

It is well known that for a continuous function f and z E X, the value


J f(x)d6z(x) = f(z).
Given the above definitions, the Ergodic Theorem of Birkhoff can be stated
in the following way:

Theorem 2.4. Let (X, A, p) be a probability space, T : X - X a. measurable


transformation that preserves p and suppose p is ergodic. Then
n-1

f-t = n--tooo
lim .,!_ '"' OTi(z) (2)
n L....,;
j=O
95

for I" almost every z.

Definition 2.11. The right hand side of the above equality is called the empirical
measure . [7]

Definition 2.12. The support of a measure I" defined on X is the set of points
x EX such that for any e greater than zero the measure fL(B(x, e)) of the ball of
center x and radius e is strictly positive.

Given a measure I" on X, in terms of Birkhoff's Theorem, there is no important


information outside the support of the measure.
The above result shows that the support of two different ergodic measures
have to be disjoint.

3. Entropy

Let X be a compact metric space with a metric d: X x X---+ Rand T: X---+ X a


transformation preserving the measure I" E M(T) defined on the Borel a-algebra
of X.
The dynamic ball B(z,n,O, for z EX, n EN and~> 0 is by definition the
set B(z,n,O = {y EX I d(Ti(z),Ti(y)) <~for all 0 ~ j ~ n -1}. One could
think that one has a microscope that is able to detect that two points x, y E X
are distinct if they are~ appart, that is, d(x, y) > ~- Therefore B(z, n, ~)is the set
of points we are not able to distinguish from z performing n iterations. The value
I"(B(z, n, 0) gives the amount of indeterminacy after we perform n- 1 iterations
of the map Ton the point z.
For z and ~ fixed and increasing n, the sets B(z, n, 0 decrease, that is, for
m ;::- n, B(z, n, 0 :) B(z, m, 0. When n goes to infinity, B(z, n, 0 converges to the
set {z} in the nice cases. In this case, if also l"({z}) = 0, then fL(B(z,n,O) will
converge to zero, when n goes to infinity. One would like in this case to express
the exponential velocity of decreasing in the form I"( B( z, n, ~)) ~ An for a certain
value A with 0 < A < 1, when ~ is very small. Writing A as e-h(p), h(l") will be
what we call later the entropy of I"· The entropy of a measure will determine
96

therefore the exponential velocity of decreasing of the indeterminacy of the system


after iterations of the map T.

Theorem 3.1. (Brin-Katok) [4] - Suppose J.L is ergodic for the transformation T
on (X, A ,J.L) and consider d a metric on the compact set X. Then the two limits

lim (-lim sup


e--o \ n-+oo n
~
log J.L(B(z, n,
.
e)~J =lim (-liminf ~ log J.L(B(z, n,e)~
e--o\ n-+oo n J {3)

exist and do not depend on z for J.L- almost every point z in X.

Remark. By definition, given a sequence (an), n E N of real numbers we call


limsupn-+ooan, the supremum of the set of limits of convergent subsequences of
the sequence (an). The definition of liminf is analogous. The reason for introducing
this definition is that not all sequences (an) converge (therefore limn-+oo an has no
meaning), but for bounded sequences, the limsup and liminf will always exist. A
sequence (an) converge, if and only if, the limsup and the liminf are equal {and off
course, they are equal to the limit). In the above theorem, not always the sequence
an = ~ log J.L( B( z, n, f)) will converge. The limsup and liminf will exist in any case.

Definition 3.1. For an invariant ergodic measure J.L E M(T) we define the entropy
of J.L as the value

h(J.L) =-lim
e--o
(limsup~n logJ.L(B(z,n,e))),
n-+oo

where z was chosen in a set of measure one satisfying the above Theorem.

Note that we could define alternatively the entropy by the lim inf (see Theorem
3.1).
Later on we will define the entropy of a measure J.L E M(T) when J.L is not
ergodic.
Note that the larger the entropy of the measure, the faster will be the decreas-
ing of indeterminacy of the system. Therefore larger the entropy, more chaotic the
system is.
97

Example. A trivial example where we can compute the entropy is the following:
consider a periodic point x of period n, and the probability 1-' = Ej,:-g ~8Ti(x)· It
is easy to see that this measure 1-' is ergodic and that the entropy h(l-') = 0.
The above example is in fact not exactly random or chaotic, but, m some
sense, totally deterministic.

Proposition 3.1. The entropy of the probability 1-' = P(p 0 ,p!), with po,PI > 0
and Po+ PI = 1, invariant for the shift on n= B(po,p!), is

-po logpo -PI log PI· (4)

Proof. As we mention before, it can be shown that the probability P(po,PI) under
the action of the shift is ergodic (see Remark after Definition 2.4).
Consider z E D in a set A of P-measure one satisfying the Birkhoff Ergodic
Theorem. That is, for any f E C(X),

lim ~"'
n~<X> n ~
n-I

j=O
f(al(z)) = J f(x)dP(x).

The intersection of A with the set of full measure of the Definition 3.1 will
also have measure one. Without loss of generality we can suppose that z is in this
intersection.
Fix ~ > 0. Remember that we consider on D the metric
de(x, y) = ()N

where N is the largest integer such that x; = y; for any 0:::; i :::; N. Let no be such
that ()no < ~ and assume no is the smallest possible such. Then, for n > no we have
B(z,n,O ={yEn I de(al(z),ai(y)) < ~.o =::;j:::; n-1} = (zo,ZI,Z2,····Zn+no-I)
and therefore

lim sup~ log P(B(z, n, 0)


n-oo n
1 '\"~+no-1 I- (uJ(z)) '\"~+no-1 I- (ui(z))
.
= 1tmsup- 1 UJ=O (0) UJ=O (I)
og Po PI
n-CXJ n

l n+no-1 . 1 n+no-1
L
.
= lim - " ' I(o)(al(z))log Po+ lim - I(o)(al(z))log PI·
n-oo n ~ n--+OC> n
j=O j=O
98

The limits in the last expression exist see because z was chosen satisfying
Birkhoff's Ergodic Theorem, and therefore

and

lim 1
n+no-1
~ I(I)(ui(z)) =
Jl<I>(x)dP(x) =Pt·
n->oo n + no - 1 L.....J
i=O
Therefore

lim sup ~log P(B(z, n, e))


n-oo n
n + no - 1 11m. n + no - 1
= Po log Po lim
n--+00 n
+ Pt log Pt n-oo n
=Po log Po+ Ptlog Pt·

Finally,

- lim (lim sup..!_ log P( B( z, n,


e-o n-+oo n
e))) = -po log Po - Pt log Pt

and therefore
h(P) = -po log Po - Ptlogpl. •

The next result can be obtained using a similar argument to the one used in
the proof of last Theorem:

Theorem 3.2. For the probability P(p 1 ,p2, ... ,pn), invariant for the shift u inn
symbols, the entropy is:
n
h(P) =- LPi logp;.
i=l

Note that from the definition of entropy in principle the value h(J.L) could
depend on the metric d we are using. In fact the entropy h(J.L) do not depend of
the metric but we will not prove this result in the text (see [13]).
99

Theorem 3.3. (Brin-Katok) [4] - Suppose f..l E M(T) is an invariant measure,


not necessarily ergodic on X and consider d a metric on the compact set X. Then
there exist for f..l- a.e. point z E X the two limits:

h(z) = lim{-limsup
e-..o \ n--+00
~logf..l(B(z,n,e)~
n
1im{-liminf ~logf..l(B(z,n,e)~.
1 = e-..o \ n
n--+oo 1
The function h( z) is integrable.

The difference between this result and the previous one for ergodic measures
is the function h(z). When the measure is not ergodic the "limit sup" can change
from point to point even in a set of full measure. When f..l is ergodic h( z) is constant
for all z f..l-almost everywhere.

Definition 3.2. The entropy of f..l E M(T) is the integral J h(z)df..l(Z) where h(z)
is defined in the above theorem.

This definition generalizes the previous one for ergodic measures.


Note that the concept of entropy was defined only for invariant probabilities on
M(T) and not for the general probability on M(X). The entropy of an invariant
measure is always a non-negative number.

4. Topological Pressure

The entropy of a system ( T, X, f..l) measures the randomness of the system. The
larger the entropy, the more chaotic the system is.
The concept of entropy appears in Physics and is associated with the principle
that Nature tends to maximize entropy. That is, if one considers particles of a gas
concentrated at a corner of a closed box, at an initial time t 0 , then after some time
the particles will tend to an equilibrium where the particles are spread in a totally
random way. This means that after some time the gas will have a uniform density
in the box. As the velocity of the particles is very large, in fact, this is the state
that will be observed. Therefore the state that will ocurr in Nature will be the one
that is most randomic among all possible states.
100

A system of particles is much more random (has more entropy) if it is uni-


formily spread in the box than if it is concentrated at a corner of the box. Therefore
equilibrium is attained in maximum entropy arrangements.
The definition of entropy by Shannon was introduced with relation to In-
formation Theory. If one wants to transmit a message through a channel us-
ing an alphabet with n symbols {1,2, ... ,n}, each one with a certain probability
PI, ... ,pn, L:7=l p; = 1 of being used, then the entropy of this system is the entropy
of the Bernoulli shift B(p 1 , P2, ... ,pn)· The entropy in this case is a very important
information of practical use (see (2]).
Historically the concept of entropy in Physics was defined in a different way
than the one introduced much later in 1948 by Shannon.
Our motivation here is associated with a more recent approach of Bowen-
Ruelle-Sinai, who, around 1960, proposed to use Shannon's entropy as a mathe-
matical tool for understanding Statistical Mechanics in one-dimensional lattices.
Soon we will show that this program includes the study of the Topological Pressure
(see definition below) for the shift. In fact these mathematicians proposed to study
a more general problem that includes not only the shift but also a larger class of
maps. This theory is known nowadays as the Thermodynamic Formalism (17]. The
Ruelle-Perron-Frobenius Operator (see next chapter) was introduced for a certain
class of maps (the expanding maps in the case one consider one-dimensional dy-
namics) in order to handle the problem of finding the measure of maximal pressure
(see (17]). Several important results in different areas of Mathematics as Geom-
etry, Number Theory, Dimension of Fractals, etc ... , were obtained using results
related to the above mentioned operator (17], which is a natural generalization (to
the space of continuous functions) of a Perron-Frobenius matrix acting on Rn (see
(17] or example after Theorem 7.5). In the context of Physics the Ruelle-Perron-
Frobenius Operator corresponds to the Transfer Operator of Statistical Mechanics
[17].
Now we will follow the beautiful and simple motivation of the subject pre-
sented in Bowen. [3]
Consider a physical system with possible states 1, 2, ... , m and let the energies
of these states be E 1 , E 2 , ..• , Em, respectively. Suppose that our system is put in
101

contact with a much larger "heat source", which is at temperature T. Energy is


thereby allowed to pass between the original system and the heat source, and the
temperature T of the source remains constant, as it is so much larger than our
system. As the energy of our system is not considered fixed, any array of the states
can occur. The physical problem we are considering is not deterministic, and we
can only speack of the probability that a certain state, let's say j, ocurr. That is,
if one performs a sequence of observations, let's say 1000, it will be observed that
for a certain proportion of these observation the state j will ocurr. The relevant
question is to know for each j, the value of this proportion (probability) when
the number of observations goes to infinity. It has been known from Statistical
Mechanics for a long time that the probability Pj that the state j occurs is given
by the Gibbs distribution:
e-BE;
Pj=Lm -BE-'jE{1,2, ... ,m},
i=I e ,

where B = k~ and k is a physical constant.


A mathematical formulation of the above consideration in a variational way
can be obtained as follows: consider
m m

i=l i=l

defined over the simplex in R m given by

{(pi,p2, ... ,pm)\p;2:0, iE{1,2, ... ,m} and ~p;=l}.


Using Lagrange multipliers, it is easy to show that the maximum ofF in the
simplex is obtained at
e-BE;
Pj = Lm -BE·, j E {1,2, ... ,m},
i=I e 1

in accordance with the Pj above.


The quantity
m
H(pi,P2, ... ,pm) = L -p;logp;
i=l
102

is called the entropy of the distribution (PI, pz, ... , Pm ). Let - I:;: 1 PIE; denote
the average energy E(p1,p2, ... ,pm)·
Then we can say that the Gibbs distribution maximizes

The expression BE - H is called, in this context, free energy (in fact, there
exist several different concepts in Mathematics and Physics also called free energy).
Therefore we can say that Nature minimizes free energy. When the temper-
ature T = oo, that is, E = 0, nature maximizes entropy. In this case the Gibbs
state is the most random probability, namely, Pi = 1/m, j E {1, 2, ... , m }. Again,
using analogy with Classical Mechanics, E plays the role of potential energy and
H plays the role of kinetic energy.
Now, let us return to Gibbs measures. Generalizing the above considera-
tions, Ruelle proposed the following model: consider the one-dimensional lattice
Z. Here one has for each integer a physical system with possible states 1, 2, ... , m. A
configuration of the system consists of assigning an x; E {1, 2, ... , m} for each i E Z.
Thus a configuration is a point X= {x;}iEZ E {1,2, ... ,m}z = n
Considering now on the space n the shift map

and M (a) the space of probabilities v such that for any Borel set A

v(A) = v(a- 1 (A)),

one obtains the well-known Bernoulli shift model.


A continuous function </> : n --+ R, in this setting, contains the information of
energy and temperature.
The problem here is to find a way to obtain the Gibbs distribution in the
infinite one-dimensional lattice in a similar way as it was obtained before for the
finite case.
For instance, for Spin-lattices, one can consider a positive spin + and a neg-
ative spin - in each site of the one-dimensional lattice Z and consider a certain
103

probability p of arrangements. In this case we have to consider the Bernoulli space


in two symbols n= { +,- }z and probabilities p on n.
Note that it is natural to consider just probabilities p E M(a), because there
is no natural reason to consider a certain distinguished point of the lattice as the
origin 0 in Z.
Given a certain continuous function¢> : n---+ R (¢>will contain the information
of temperature, energy, magnetic-field, etc ... ), consider the following variational
problem:

Definition 4.1. For a continuous function ¢> consider

(p Es:(a)) { h(p) + J<f>(z)dp(z)},

where h(p) is the entropy of the probability p. We call such a supremum the
Topological Pressure (a better name would be Free Energy, but we follow here the
terminology of Ruelle) associated with¢> and denote it by P(¢>).

Remark. There exist an analogous definition of Pressure for invariant measures


for T instead of a.

Example. A good example to have in mind is the following: consider


n = {+,- }N ( + is positive spin and - negative spin) and ¢> is constant in each
one of the four cylinders ( +, +), (+,-), (-,-) and ( -, +) . Consider q0 , q1 >
0, qo + q1 = 1 and define ¢> in the following way:
a) </>(z) = q0 ,\:lz E (+,+)
b) </>( z) = q1 , \:1 z E ( +, -)
c) ¢>( z) = 1, \:1 z E (-, +) and
d) </>(z) =0,\:/z E (-,-).
In this case, we assume that in the lattice Z there exist a probabilility q0 of
having a + at the right of a + and a probability q1 of having a - at the right of a
+. We also assume by c) and d) that at the right of a- there exist always a spin
+. One would like to find a probability f.i-, defined in the all space n such that
the above mentioned property happen. This probability f.1- will be called later the
104

equilibrium state associated to the potential ¢>. The equilibrium state f..l will be
defined by means of a variational formula (see Definition 4.2). In the case of the
present example, the solution can be obtained by means of the theory of Markov
Chains and Perron-Frobenius operator (note that we introduce a stochastic matrix)
and this will be explained in section 7 (see example after Theorem 7.5).
The solution for the case of a general r/> (not constant in cylinders) will require
a more sofisticated version of the Perron-Frobenius theorem that will be presented
on section 7.

Most of the time we will use the word pressure instead of topological pres-
sure. It is natural to ask which properties does a probability p which attain such
supremum have.

Definition 4.2. We will call the probability f..l that attains the above supremum
(in the case there exists one such f..l) the Gibbs state (or equilibrium probability for
r/>) for the one-dimensional lattice with potential function ¢>. In other words:

h(f..l) + J
r/>(z)df..l(z) = P(r/>)

J J
or
h(f..l) + rf>(z)df..l(z) 2: h(v) + rf>(z)dv(z)

for any v E M(T).

Notation. Sometimes we will denote this probability f..l by f..lt/> in order to express
the dependence of f..l on ¢>.

For expanding systems the probability that attains the above supremum is
unique, and therefore equilibrium states do exist (see paragraph 7). Non-unique-
ness of the probability that attains the supremum is related with Phase Transition
of spin-lattices [9],[10],[12]. D.Ruelle [17] was able to obtain a certain function r/>
that represents interactions of a certain special kind and such that the probability
that attains the above supremum P( r/>) is exactly the "Gibbs state" in the lattice
Z that, with other procedures, people in Physics already knew a long time ago.
Therefore the terminology of Gibbs state that we introduced above is quite proper.
105

The analogy of the above setting in the lattice Z with the finite case we
mention before is transparent.
If we assume a wall effect, then we have to consider the lattice N, that is the
one-sided shift.
The setting we presented above is suitable for analyzing problems in Statistical
Mechanics of the one-dimensional lattice Z. For the two-dimensional case Z 2 (or
for the three-dimensional case Z3 ), one should consider actions of Z2 (or Z3 ) and
the situation is much more complicated (see [17] for references).
Entropy is defined for measures and Pressure for continuous functions. The set
of measures and the set of continuous functions are dual one of the other. In fact
these two concepts are related one to the other by means of a Legendre Transform
[8]. Some of these properties will be consider in the last part (see section 7) of
these notes.
We refer the reader to [7] [8] [5] [11] for a complete description of the above
results.
When two different ¢ and 1/J determine the same equilibrium state Jl ? That
is, when Jl<t> = Jl.P ? This is an important question that will be analized more
carefully later. The following proposition is an easy consequence of the properties
of the probabilities v E M(T).

Proposition 4.1. Criterium of Homology- Suppose¢ and 1/J are two continuous
functions such that there exist a continuous function g and a constant k satisfying
¢ -1/J =goT- g + k, then Jl<t> = Jl.P·

Proof. For any v E M(T), j(g o T(z)- g(z))dv(z) = 0 by definition, therefore


h(v) + J ¢( z )dv( z) = h(v) + J 1/J(z )dv(z) + k for any v E M(T). Therefore P( ¢) =
P(ljJ) + k and Jl<P = Jl.P·
Note that if k=O, then P( ¢) = P( 1/J ). •

Definition 4.3. In the case ¢ = 0, we have

P(¢) = sup h(p),


pEM(T)
106

and this value P(O) is called the topological entropy ofT . We will denote such
value by h(T).

We refer the reader to [3] [15] [17] [18] for results about Pressure and Ther-
modynamic Formalism.
In the case T = a it can be shown that h( a) = log d (see Definition 4.3) if
(a, n) is the shift in d symbols.
More generally, if an expanding map T has the property that for any
a EX, #{T- 1 (a)} = d, then h(T) =log d.
From Theorem 3.2 the entropy of the shift a of d symbols, under the probabil-
ity P( 1I d, 1I d, ... , 1I d) is equal to log d. Therefore, in this case we can identify very
easily the equilibrium state for¢>= 0, it is the probability p 0 = P(11 d, 11 d, ... , 11 d).
This measure will be called later the maximal entropy measure .
In paragraph 7 we will consider very precise results on the existence of equi-
librium states for expanding maps.

5. Large Deviation

In this paragraph and in the next one, we will consider T a continuous map from
a compact metric space (X,d) into itself, p an ergodic invariant measure on (X, A)
and f a continuous function from X to R m. Some of the proofs will be done for
m = 1 in order to simplify the notation.
The Ergodic Theorem of Birkhoff claims that for an ergodic measure p E
M ( T) and a continuous function f from X to R m, for p-almost every point z E X,

n-1

J~moo ~ L f(Ti(z))
j=O
= J f(x)dp(x).

The typical example of application of the Ergodic Theorem, as we said before,


is the situation where we toss a fair coin 1000 times. One can observe that among
these 1000 tossings, more or less 500 times appears a head and the same happens
for tails. The event of obtaining head all the 1000 times is possible, and has P-
probability (0.5) 1000 . This number is very small but is not zero. This event is
107

a deviation of the general behaviour of the typical trajectory. It is very relevant


in several problems in Probability, in Mathematics and in Physics to understand
what happens with the trajectories that deviate of the mean. We will show later
mathematical examples of such phenomena.
For each time n the data~ Ej,:-g Io(ai(z)) are spread around the mean value
1/2, but when n goes to infinity, the data are more and more concentrated (in terms
of probability) around the mean value. The main question is: how to estimate
deviating behaviour? For the fair coin, the typical trajectory will produce, in the
limit as n goes to infinity, the temporal mean 1/2. Suppose we estipulate that a
mistake of E = 0.01 is tolerable for the distance of the finite temporal mean to the
spatial mean

j l0(x)dP(x) I,
n-1

I~?: fo(ai(z))-
]=0

but not more than that.


For n=1000, there exists a set Bn(E) with small P=P(1/2,1/2) probability
such that the temporal mean of orbits has a temporal mean outside the tolerance
level. For example the cylinder with the first 1000 elements equal to 0 is contained
in Bn( E), because

1 999 . 1 1 1
- ""'I-(a 1 (z))-- = 1-- =- > 0.01.
n~ 0 2 2 2-
i=O

for z in the above mentioned cylinder.


We will be concerned here with the problem of estimating the velocity with
which JJ(Bn(E)) goes to zero when n goes to infinity.
From a practical point of view, the Ergodic Theorem would not be very useful,
if JJ(Bn(E)) goes to zero too slowly. For a given E of tolerance and a fixed n (any
practical experiment is finite), we choose at random a point z in X, according to
P(1/2, 1/2). If the velocity of convergence to zero of the sequence JJ(Bn( E)) is very
slow, then there is a very large probability of choosing the point z in the bad set
Bn(E).
The area of Mathematics where such kind of problems are tackled is known
as Large Deviation Theory (see [7] for a very nice and general reference).
108

Let's return now to the general case of a measurable map T from X to X,


leaving invariant a measure p. We will be more precise about what we want to
measure.

Definition 5.1. Given E greater than zero and n EN, then by definition Qn(E)
is equal to:

Qn(E) = p{z II~ L


n-1

J=O
f(Ti(z))- I f(x)dp(x) I~ t:}.

Proposition 5.1. Given E,

lim Qn(E) = 0.
n-oo

Proof. For a given value E denote

An= {z II~ L
n-1

J=O
f(Ti(z))- I f(x)dp(x) I~ t:}.

We will show that limn-oo p(An) = 0.


Consider the set Y =n;:"= 1 Uf>n A;. For each z E Y, the sequence an =
~ 2:;,:~ f(Ti(z)) has a subsequence-with distance more thanE from J f(x)dp(x).
Therefore, for any z E Y the above defined sequence an does not converge to
J f( x )dp( x ), and hence Y has measure zero by the Ergodic Theorem of Birkhoff.
As the sequence Dn = U~nA; is decreasing and p(Y) = 0, then

lim p(An) ~ lim p(Dn) = p(Y) = 0


n-oo n-oo

Therefore the proposition is proved. •

Corollary 5.1. Given E >0

}~m00 p{z II~ L


n-1

j=O
f(Ti(z))- I f(x)dp(x) I~ €} = 1
109

One would like to be sure that the convergence to zero we consider above in
Proposition 5.1 is at least exponential, that is: for any E, there exists a positive M
such that for every n

j f(x)dJL(x) 12: ~:} ~


n-1

JL{z II ~ L f(T 1(z))- e-Mn


]=0

Under suitable assumptions we will show that this property will be true (see
Prop. 6.8).
It is quite surprising that in the case Jl is an equilibrium state (see Def. 4.2)
this result can be obtained using properties related to the Pressure (see paragraph
7 and 8). We will return to this fact later, but first we need to explain some of the
basic properties of Large Deviation Theory.
The relevant question here is how fast, in logarithmic scale the value Qn( ~:)
goes to zero, that is, how to find the value

lim
n-+cx:>
~log
n
Qn( E).

The above value is an important information about the asymptotic value of


the Jl -measure of the set of trajectories that deviate up to E of the behaviour of
the typical trajectory given by the Theorem of Birkhoff.
More generaly speaking, for a certain subset A of Rm one would like to know,
for a certain fixed value of n, when the values z are such that:
n-1

~ f(T 1.(z)) EA.


-1 """
n j=O

In the situation we analyze before (corollary 5.1)

A= {y E Rm II y- j f(x)dJL(x) 12: ~:}


Definition 5.2. Given a subset A of Rm and n EN we denote

L f(T (z)) E A}.


1 n-1 .

Qn(A) = fl{Z I- 1
n i=O
110

In the same way as before one would like to know the value

lim _!_log Qn(A).


n
n--+oo

Remark. If the set A is an open interval that contains the mean value J f( x )dp.( x ),
then the above limit is zero because limn--+oo Qn(A) = 1 (see corollary 5.1).

First, we will try to give a general idea of how the solution of this problem is
obtained, and then later we will show the proofs of the results we will state now.
There exists a magic function I( v) defined for v E R m (the set where the
function f takes its values) such the the above limit is determined by:

lim ..!_logQn(A) =- inf {I(v)},


n
n--+oo vEA

when A is an interval.
The function I it will be called the deviation function. The shape of I is
basically the shape of I v - J f( x )dp.( x) 12 , v E R m, that is, I( v) is a non-negative
continuous function that attains a minimum equal to zero at the value J f(x)dp.(x).
The properties we mentioned before are not always true for the general T, p.
and f, but under reasonable assumptions the above mentioned properties will be
true. This will be explained very soon.
The natural question is: how can one obtains such a function I? The function
I( v ), v E R m is obtained as the Legendre Transform (we will present the general
definition later) of the free energy c(t), tERm to be defined below.

Definition 5.3. Given n E N and t E R m we denote

Cn(t) =~log J e<t,(f(x)+f(T(x))+f(T2(x)+ ... +J(Tn-l(x))>dp.(x).

Definition 5.4. Suppose that for each t E R 0 and n EN, the value cn(t) is finite,
then we define c(t), the free energy, as the limit:

c(t) = lim Cn(t),


n--+oo
Ill

in the case the above limit exists.

Remark. Note that c(O) = 0.

Remark. The function c( t) is also known in Probability as the moment generating


function. For people familiar with Probability Theory and Stochastic Processes,
we would like to point out that the random variables f(Tn(z)),n E N are not
independent in general.

Definition 5.5. A function g( t) is convex iffor any s, t E R m and 0 < .X < 1,

g(.Xs + (1- .X)t)::; .Xg(s) + (1- .X)g(t)

We say g is strictly convex , if for any 0 < .X < 1 the above expression is true with
< instead of ::; .

It is easy to see that a differentiable function g( t) such that its second deriva-
tive satisfies g" ( t) 2: 0 for all t E R is convex.

Proposition 5.2. The function c(t) is convex intERn.

Proof. The Holder inequality [16] claims that

where h and k are respectively on Cp(P,) and Cq(P,) and p and q are such that
1/p + 1/q =1.
Consider s,t ERn, h(x) = e<.Xs,J(x)+ ... +f(Tn-l(x))>,

k(x) = e<(I-.\)s,J(x)+ ... +J(rn-'(x))> , .X E (0, 1), and then define p=1/.X and
q=1/(1- .X). Now, using the Holder inequality:

J e <As+(l-.\)t,J( x)+ f(T(x))+ ... +f(T(n-l)(x)> dp,( X) ::;

(J e <s,f(x)+ ... + f( yln-1 l(x))> dp,( X)).\(! e <t,J(x)+ ... + f(T(n-l)(x))> dp,( X) )1-.\.
112

Therefore, taking ~ log in each side of the above inequality, one obtains that:

Cn(,\s + (1- ,\)t) ~ Acn(s) + (1- -\)cn(t),


and hence c( t) is convex, because it is the limit of convex functions. •

Definition 5.6. The deviation function I(v), v E Rm, is by definition the Legen-
dre transform of the function c(t), tERm, that is

I(v)= sup{<t,v>-c(t)}.
tER"'

The deviation function I is well defined in the case c( t) is strictly convex.


In order to simplify the argument, let's consider the one dimensional case
m=l. When c is differentiable, then it is easy to see that

I(v) =sup{ tv- c(t)} = tov- c(t 0 ),


tER

where t 0 is such that c' (to)= v (see proposition 6.1). Such a t 0 is well defined if c
is strictly convex and differentiable. In this case the deviation function I( v) is also
differentiable in v, as it is easy to see. If c(t) is piecewise differentiable (with left
and right derivatives), then I(z) has also this property.
In more precise mathematical terms one should say that the deviation function
I(v) of c(t), t E Rm, takes values v in the dual of Rm. The dual of Rm is Rm
itself, and therefore, in the finite dimensional case (m finite) there is no problem to
define the Legendre transform in the way we did above. We will need to consider
Legendre transforms in infinite dimensional vector spaces soon. This will require
some small changes in the definition of Legendre Transform. Before that, we will
consider the main properties that are true in the finite dimensional case. The
key property is the differentiability of the free energy c(t). Assuming piecewise
differentiability (with the existence of right and left derivatives for c(t), t E R),
most results we will state below will be true (Theorem 6.2 and Proposition 6.8
require that the free energy be differentiable).
113

The main result we want to prove in the next paragraph is:

Theorem 5.1. Assume the free energy c(t), tERm is well defined and also that
cis differentiable, then for an open paralepiped A contained in Rm

lim
n--+oo
~logQn(A)
n
=- inf{J(v)}.
zEA

The above result is true for much more general sets A contained in Rm, but
we will state and prove the general result later.
The main results for the finite dimensional case will be proved for n=l. The
general case is not very much different from the case n=l. The infinite dimensional
case is however much more difficult than the finite dimensional case [7].

6. Free Energy and the Deviation Function

We will need to develop some elementary properties of Legendre Transforms in


order to prove the Theorem we stated above.

Definition 6.1. Given a convex piecewise differentiable map g(y), y E Rm, the
Legendre transform of g, denoted by g*(p),p E Rm, is by definition

g*(p) = sup {< p, y > - g(y)}.


yERm

Proposition 6.1. Suppose g(y) is defined for all y E R and that the second
derivative is continuous. If there exists a> 0 such that, g"(y) >a> 0, y E R,
then g*(p) = PYo- g(yo) where g' (yo)= p.

Proof. In the case there exists a value y0 such that g' (yo) = p, then clearly
g*(p) = YoP - g(y0 ). Therefore, all we have to show is that g' (y) is a global
diffeomorphism from R to R.
Note that for a positive h, g' (x +h)- g' (x) = J:+h g"(y)dy > ah. Therefore
the map g' is injective. The map g' is open (that is, the image g'(A) of each open
set A is open) because g'(x +h)- g'(x) > ah. The map g' is closed (that is, the
image g' (K) of each closed set K is closed), because it is continuous. We claim
114

that g' is sobrejective. This is easy too see: the image by g' of the open and closed
set R, is an open and closed interval and therefore equal to R. The conclusion is
that g' is bijective from R to itself. •

Proposition 6.2. Suppose g(y) defined on y E R satisfies g" (y) > 0 for all y E R,
then g* satifies g*" (p) > 0 for all p E R.

Proof. We will use the following notation: for each value p denote y(p) the
only value y such that ~(y(p)) = p. As we saw in the last proposition g*(p) =
y(p)p- g(y(p)). Taking derivatives with respect top,

dg* dy dg dy dy dy
dP(p) = dp (p)p + y(p)- dy (y(p)) dp (p) = dp (p)p + y(p)- pdp (p) = y(p).

Hence g*" (p) = y' (p)


Now, as for any p, p = ~(y(p)), taking derivatives in both sides with respect
top, 1 = g" (y(p))y' (p) = g" (y(p))g*" (p). Thus g*" is positive, if g" is positive. •

Remark. We will assume that all maps g to which we apply the Legendre trans-
form satisfy the condition g" (y) > a, y E R for a certain fixed· positive value a.
When we consider piecewise differentiable maps (with left and right derivatives),
then we will also suppose that the left and right derivatives satisfy the same con-
dition in a.

The geometric interpretation of the Legendre transform of g in terms of the


graphic of g is shown in fig 1.
Now we will prove a key result in the Theory of Legendre Transforms:

Proposition 6.3. Suppose f(x) and f*(x) are stricly convex and differentiable
for every x; then the Legendre Transform is an involution, that is, f** =f.
115

y(p)

Figure 1.

Proof. We will show that if g denotes f*, then g* =f.


For a given p E R denote by x(p) the value x such that supxER {px - f( x)}
attains the supremum. Since f* = g, then ~(x(p)) = p and g(p) = px(p)- f(x(p)).
For a certain fixed value x 0 and for each x E R define ~( x) as the value ~
obtained by the intersection of the line (y, z(y)) = (y, f( x) + / (x )y) with the line
x = x 0 (see fig 2). It is easy to see that f~x~~oA = / (x), and therefore

~(x) = f(x)- x/ (x) + / (x)xo.

Given p, g(p) = px(p) - f(x(p)) where x(p) is such that ~ (x(p)) = p.


Therefore, if we write ~in terms of p, then

~(p) = ~(x(p)) = ~(x) = f(x(p))- x(p)p + pxo = -g(p) + pxo.


Note that
sup ~(p) = sup{pxo- g(p)} = g*(xo).
pER pER
From fig 2 one can easily see that sup~(p) is attained when p =/
(xo) and
the supremum value of~ is f(x 0 ). Therefore we conclude that g*(xo) = f(xo) .


116

Xo X

Figure 2.

Definition 6.2. We say that f is conjugated tog iff* =g.

The last result claims that iff is conjugated to g, then g is also conjugated to f.

Definition 6.3. Suppose g is a convex function on Rm. We say that y E Rm is a


subdifferential of g in the value x, if g( z) 2:: g( x )+ < y, z - x > for any z E R m.
We denote the set of all subdifferentials of g in the value x by bg( x ).

This definition allows one to deal with the case c(t), t E R, piecewise dif-
ferentiable (it is differentiable up to a finite set of points t;, i E {1, 2, ... , n} ). In
the values t where cis differentiable there is a unique subdifferential c' (t) = bc(t),
but in the values t; where c( t) has left and right derivatives (we assume in the
definition that this property is true) respectively equal to u; and v;, then bc(t;) is
the interval [ u;, v; ].
The next result shows a duality between the subdiferentials of conjugated
functions.

Proposition 6.4. y E bg(x) if and only if x E bg*(y).

Proof. By definition yE bg( x) is equivalent to

g(z) 2:: g(x)+ < y, z- x >


117

for all z E R.
The last expression is equivalent to

< y, z > -g(z) :S< y, x > -g(x)

for all z E R.
Therefore y E hg(x) is equivalent to say that x realizes the supremum of
< y,z > -g(z).
We also obtain from the above reasoning that y E hg(x) is equivalent to
g*(y) =< y,x >- g(x), and thus equivalent to< x,y >= g*(y) + g(x).
Applying the same result for g = g*, and interchanging the role of x and
y, that is, x=y and y=x, we conclude that x E 8g*(y) is equivalent to < y, x >=
g**(x)+g*(y). The last expression is equivalent to< y, x >= g(x)+g*(y), because
from the last proposition g** = g.
Hence y E hg(x) is equivalent to x E 8g*(y) •

Using this proposition one can show the following result:

Proposition 6.5. I( v) = 0, if and only if, v E 8c(O). The function I is non-


negative and has minimum equal zero in the set 8c(O).

Proof. First note that as I= c*, then from the last proposition v E &(0), if and
only if, 0 E 8!( v ). In this case,

I( z) ;::: I( v )+ < 0, z- v >= I( v) = 0

for any z E R. Therefore, I(z) has infimum in the set 8c(O).


Proposition 6.4 claims that < t,v >= c(t) + c*(v) = c(t) + I(v), if and
only if, v E 8c(t). Now, using this proposition for the case t = 0, one obtain
I(v) = -c(O) = 0. The final conclusion is that I(z);::: I(v) = 0 for v E 8c(O) and
z E R. •

The proof of the main Theorem 5.1 is done in two separated parts: the upper
large deviation inequality and the lower large deviation inequality. First we will
118

show the upper large deviation inequality. This inequality is true in a quite general
context, even without the hypothesis of full differentiability of c(t) [7]. In the
second inequality we will use differentiability of the free energy.

Proposition 6.6. (Upper large deviation inequality) Suppose c(t), t E R is a well


defined convex function, then

IL
n-1

lim sup~ logjt{x f(Ti(x)) E K} =lim sup~ logQn(K) ~- inf I(z) (5)
n-oo n . n-+oo n zEK
]=0

where K is a closed set in R.

Proof. Let's first recall Tchebishev's inequality: let g be a measurable function


from X in Rand h from R toR a non-negative, nondecreasing function such that
J h(g(x))dJl(x) is finite. In this case, for any valued such that h(d) is positive

{X I g(X ) 2: d} J h(g(x))dJl(x)
Jl ~ h( d) .

We refer the reader to [7] for the proof of Tchebishev's inequality.


Denote 8c(O) = [u 0 , v0 ] (it is very easy to see that Sc(O) is an interval).
We will show first the claim of the theorem for semi intervals [a, oo,) where
a is larger than the right derivative v 0 of c at t=O. For such a and any t > 0,
Tchebishev's inequality for

h(y) = enty, 1" .n-1

g(x) = - ~J(T1 (x)), d= a,


n j=O

(Remark- we require t > 0 in order h(y) being non-decreasing) implies that

Therefore taking limits when n goes to infinity, one concludes that

. 1
hm sup -log Qn([a, oo )) ~ -sup{ ta- c(t)}. (6)
n-oo n t~O
119

Now we need the following claim:

Claim. supt~o{at- c(t)} =I( a)= suptER {at- c(t)}.

Proof of the Claim. c( t) is convex, hence u 0 , the left derivative of cat 0, satisfies
Uo ::; c~t) 't < 0. Therefore,

c(t)
ta- c(t) = t(a- -t-)::; t(a- uo).

The last term is negative because a 2 v0 2 u0 •


The conclusion, is that I( a)= suptER {ta- c(t)} = supt>o{ta- c(t)}.
Hence the claim is proved.

Before we return to the proof of Theorem, we will need first to prove another
claim.

Claim. I( a)= inf.~a I(z).

Proof of the Claim. From Proposition 6.5, I(z) is equal to 0 on [u 0 , v0 ] = b'c(O).


We claim that for z > vo the function I is monotone nondecreasing. This is so
because, if there exist two values z1 and z2 larger than v0 , such that I(zi) = I(z 2),
then there exists z E [z1, z2] with 0 E H(z) (this follows at once from the convexity
and the definition 6.3 but do not require differentiability).
This means, by proposition(6.5), that z E bc(O), but this is false because z is
not in [uo, vo]. Therefore I( a)= infz>a I(z), and the second claim is also proved.

Now, from equation (6) and using the two claims stated above, we obtain the
desired conclusion
1
lim sup- log Qn {I<} ::; - inf.J( z) (7)
n-oo n zEh

when I< = [a, oo) and a larger than v 0 , the rigth derivative of c at 0.
The proof for intervals K of the form ( -oo, a], a < u 0 is similar.
Now we will prove the claim of the theorem for a general closed set K.
120

First note that if K intersects the set &(0) = [uo, vo], then the claim is trivial
because infzeK I(z) = 0 (remember that v E c5c(O), if and only if, I(v) = 0, by
proposition 6.5).
Hence, we will suppose that K does not intersect the set [u 0 , v0 ].
Consider a, b two real values such that ( -oo, a] U [b, oo) is the smallest possible
set such that I< C ( -oo, a]U[b, oo). As the set K is closed, then (a= -oo or a E I<)
and (b = oo or b E I<). Suppose for simplification of the notation that a, b E I<
(the other case can be easily handled by the reader). From the first part we know
that infze(-oo,a] I(z) =I( a) and infze[b,oo) I(z) = I(b). Therefore infzeK I(z) =
min {I( a), I( b)}, because a, bE I<.
Finally from the first part(7):

lim sup.!_ log Qn(I<) :5 lim sup.!_ log(Qn( -oo, a] + Qn[b, oo)) :5
n~oo n n~oo n
lim sup .!_(log Qn( -oo, a] +log Qn[b, oo)) :5 -I( a) - I(b) :5
n-+oo n
- inf{I(a),I(b)} = inf I(z).
zEK

Therefore the Proposition is proved. •

Proposition 6. 7. If c( t) is differentiable at t=O, then c' (0) = J f( x )dp.( x ).


Proof. We know from the last proposition that I(z) ~ I(v) = 0 for z E Rand
v E c5c(O) = {c' (0)}.
Note that if c is differentiable at 0, we have uniqueness of the z such that
I(z) = 0, this value being equal to v = c'(O).
The proof will be done by contradiction. Suppose c' (0) is different from
J J2
f(x)d ,...ll(x). a·1ven e = lc' (0)- f(x)dj<(x)l > 0, cons1.der

I<= ( -oo, c' (0)- e] U [c' (0) + e, oo)


and M = infzeK I(z) > 0. Proposition 6.6 assures that for suficiently large n EN:

p.( {z 1"'
n-1
II - f(T .(z)) E K} = p.( {z I -1"' f(T .(z))- c (0)
~ 1
n-1
~ 1 I
I~ e} :5 e-n M . (8)
n i=O n i=O
121

From the last inequality, p;-almost every point z has the property that its
temporal mean converges to c' (0), and from the Theorem of Birkhoff, this value
c' (0) has to be the spatial mean J f( x )dp;( x ). Hence we obtain a contradiction
and the proposition is proved. •

Definition 6.4. We say that the p;-integrable function f from X to R has the
exponential convergence property, if for any e > 0, there exist M > 0 such that:

p;{y II
n-1

L
j=O
f(Ti(y))- J f(x)dp;(x) I~ e}:::; e-nM

for n large enough.

Proposition 6.8. Suppose cis differentiable at t = 0, then f has the exponential


convergence property.

Proof. As we have just shown that c'(O) = J f(x)dp;(x) and v = c'(o) is the only
value that I( v) = 0, then given e, there exists

M = inf I(z),
zE[j f(x)dp(x)-•.J f(x)dp(x)+•]

such that

L f(Ti(y))- j f(x)dp;(x) I~ e}::::;


n-1

p;{y II e-nM. •
j=O

We will need the very well known definition of distribuition in order to simplify
the notation in the proof of the next theorem:

Definition 6.5. Given a p;-integrable function f :X ---+ R, (a random variable)


then the measure p;f defined on the real line R, such that for any continuous

J J
function g : R ---+ R
go fdp;(z) = g(x)dp;f(x)

is called the distribuition function of the p;-integrable function f.


122

Such a measure ,_,I always exists (using the notation of the first chapter f :
X ---t Y (or f: X ---t R), then f..tJ it is the pull-back of the measure f..t by the map
f as introduced in Definition 2.8).

Remark. Note that for any interval (a, b) contained in R,

f..tf((a,b)) = f..t{Y I f(y) E (a, b)}.

As a practical rule, remember that each time one wants to integrate


I g(x)df..tf(x), one substitutes the variable x by f(z) and integrates with respect
to f..t, that is: I g(f(z))df..t(z).
The proofs of all results we obtained before are quite general and can be easily
extented (the proofs being absolutely the same) to the following case:

Theorem 6.1. For each value n E N, let X n be a ,_,-integrable function on X such


that Xn,;z) E R, z E X has vn(x), x E R as distribuition function, that is, using
X
the notation that we introduced above for distribuition function, Vn = ,..,~. Define

the free energy of the sequence ~.


Suppose c( s) is differentiable at s = 0, then there exists a positive M such that

f..t( {z II Xn
-(z)- c (0)
1
I~ f
}
:::; e-n M (9)
n
for n large enough.
The value M is obtained in the following way:

M = inf I(l),
IE( -oo,c' (0)-<)U( c' (O)+<,oo)

where for each value l, I (I) = sup 1ER { sl- c( s)}, is the Legendre transform of c( s ).

Remark. Note that it follows from the above theorem that

lim vn(( -oo, c' (0) - €] U [c' (0)


n-+oo
+ f, oo)) = 0
123

and therefore that


lim Vn(B(c' (0)), e))
n-+oo
=1 (10)

(see last remark and the definition of distribuition function).

The last theorem can be seen as a generalization of the results we obtained


before by making the measurable function X n ( z) defined above play the role of
the function Z:j,:-~ f(Ti ( z)) that we previously considered.
Now we will use this last result to prove the lower large deviation inequality:

Theorem 6.2. (Lower large deviation inequality) Suppose that the free energy
c( t) is differentiable for every t E R, then for any open set A:

lim inf _!:_log Qn(A)


n-+oo n
~ - inf I(z ).
zEA

Proof. We will assume that for any real value z E R there exists a value t such
that c' (t) = z. If we suppose that c" ( t) > o: > 0, then this assumption is satisfied
as we saw in Proposition 6.1.
The above hypothesis is not necessary for the proof of the theorem, but in or-
der to avoid too many technicalities, we will prove the result under this assumption.
Consider z in the open set A and r such that B(z, r) = (z-r, z+r) is contained
in A. Denote by t a value such that c' (t) = z (there exists such at by hypothesis).
Now we will need to use the concept of distribuition of a p-measurable function
that we introduce before.
nr · · on R sue h th a t JL n = JL .!.n z=~-l
vve WI"11 d enote b y JL n t h e d"1stn"b mtwn 1= 0
f(Tj (z))

(see the notation introduced after definition 6.5).


Therefore, given a set (a, b) C R,

J(a,b)
dpn(x) = JLn((a, b))= p{z
n
n-1
I ..!:_ L
j=O
f(Ti(z)) E (a, b)} = Qn((a, b)).

Denote Zn(t) = J etnxdpn(x) = encn(t) (see definition 5.3 and remember the
practical rule mentioned in the remark after the definition 6.5 of distribuition). The
reader familiar with Statistical Mechanics will recognize the Partition function in
the definition we introduced.
124

For each value t E R and n E N, we will now denote by J-L~ the probability
on R given by
(11)

Note that for a fixed t and n,

and therefore the term Zn(t) = ecn(t) appears only as a normalization term in the
definition of the probability 1-"~ (it does not depend on x ).
This one-parameter family of probabilities J-L~, t E R, will play a very impor-
tant role in the proof of the theorem.
One should think of the measure J-L~ in the following way: for t=O the measure
J-L n = J-Lf. From the Theorem of Birkhoff, the measure J-L n = J-Lo focalizes on (or
has mean value) v = J f(x)dJ-L(x) = c'(O), that is,

For the given value z E A, we choose t such that c' (t) = z, and then the
measure J-L~, will focalize on (or has mean value) z = c' ( t) as will be shown:

Claim. Suppose c' (t) = z, then for any r:

lim J-Lf((z-r,z+r))=l (12)


n->oo

Proof of the Claim. For the value t and n EN, let Xn be a measurable functions
such that ~ has distribuition function J-Lf (such measurable functions always exist
by trivial arguments). Now we will use the last theorem and the fact that z = c' (0).
Define the new free energy

as was done in the last theorem.


125

One can obtain c1(s) from c(s) in the following way:

R~OO
1
c1(s) = lim -log
n
J esnxdp~(x) lim -log
n--+00
1
n
J enx(s+t)
eRCn
(t) dpn(x)

=lim .!:.logfenx(s+t)dpn(x)- lim _!:_logecn(t)n


R--+00 n R--+00 n

= c(t + s)- c(t).


Hence, if c is differentiable on t, then c1(s) is differentiable at s = 0 and
~~(t) = !!,{:(0). Now, as the hyphotesis of differentiability of the last theorem is
satisfied, the conclusion follows (see remark after theorem 6.1):

lim p~(B(c~(O), r)) = 1


n-co

Using the fact that we choose tin such manner that c~(O) = c' (t) = z, we conclude
that:
lim p~((B(z,r)) = 1
n-co

and the claim is proved.

Note that introducing the parameter t in our problem (defining the one-
parameter family of measures pf, n E N), has the effect of translating by t the
free energy c( s) (on the parameter s ), that is,

c1(s) = c(t + s)- c(t).


In other words we adapt the measure pf in such way that this new measure
has mean value z.
Now we will return to the proof of the theorem.
For any point x E B(z,r),-tz- It I r ~ -tx. Therefore:

Qn(A);::: Qn(B(z,r)) ={ dpn(x)


JB(z,r)

= Zn(t) { e-ntx pf(x);::: en(cn(t)-tz)-rnitlpf(B(z, r)).


JB(z,r)

Hence

liminf .!:.logQn(A);::: c(t)- tz- r It I +liminf .!:.logpf(B(z,r))


n--+oo n n--+oo n
126

From the claim we know that the last term in the right hand side of the above
expression is zero. Hence, as c(t)- tz = -I(z), because c' (t) = z, then

liminf ~log Qn(A) 2:: -I(z)- r It I.


n--+cx:> n

As r was arbitrary and positive, we conclude finally that

liminf ~ logQn(A) 2:: -I(z).


n
n--+(X)

Now as z was arbitrary in the open set A, we obtain that

1
liminf -logQn(A) 2::- inf I(z),
n~oo n zEA

and this is the end of the proof of the theorem. •

As I( z) is assumed to be continuous (because c( t) is assumed to be differen-


tiable), the final conclusion is:

Theorem 6.3. Suppose c(t) is differentiable in t, then for a given interval C (open
or closed)

lim
n~oo
~n logQn(C) = - zEC
inf I(z).

Now we will want to relate the results we obtained above with the Pressure
of Thermodynam ic Formalism.
127

7. The Ruelle Operator

In this chapter we will present several results related to the pressure of expanding
maps. For such a class of maps the Ruelle Operator will produce a complete solu-
tion for the problem of existence and uniqueness of equilibrium states. Theorem
7.2 will explain how to obtain in a constructive way the equilibrium states. We
point out that the Bernoulli shift is a very important case where the results we
will present can be applied. In this section we will consider only maps that have
the property that for each point z EX, {T- 1 (z)} is equal to a fixed valued> 1,
independent of z. Therefore the results will aply directly to the one-sided shift
but not for the two-sided shift (see section 2 for definitions). The results presented
here can be extented to the two-sided shift, but this require a certain proposition
that we will not present here (see [15]).
Recall the definition:

Definition 7.1. A map T from a compact metric space (X, d) to itself is expanding
if there exist A > 1 such that, for any x E X there exist e > 0 such that Vy E
B(x, e), d(T(x), T(y)) > Ad(x, y).

Example. Consider a0 =0 < a1 < a2 < a3 < ... < an-I < an =1a sequence
of distict numbers on the interval [0,1]. Suppose T is a differentiable (C 00 ) by
part map from [0,1] to itself such that IT'(x)l > A > 1, for all z different from
a 0 , a 1 , •.• an. Suppose also that for each i E {0, 1, 2, .. n- 1} , T([a;, a;+l]) = [0, 1].
We will also suppose that T has a C 00 extension to the values a;, i E {0, 1, 2, ... , n}
with the same properties. This map is expanding and is one of the possible kinds
of maps where the results we will present in this section can apply. In fig 3 we
show the graph of a map T where all the above properties happen.

Notation. We will use the following notation: for </> E C(X) and v E M(X) or
(S(X)) we denote the value J </>(x)dv(x) by< </>,v >.
Definition 7.2. For a given operator C from C(X) to itself, the dual of C is
the operator£* defined from the dual space C(X)* = S(X) (the space of signed
128

measures) to itself defined in the following way: £*is the only operator from S(X)
to itself such that for any¢ E C(X) and v E S(X)

< C(¢),v >=< ¢,C*(v) >.

Figure 3.

Remark. Such an operator C* is well defined by the lliesz Theorem. This is


so because for a given fixed v E S(X) the operator 1-l from C(X) to R given by
J
1i( ¢) =< C( ¢ ), v >= .C¢( x )dv( x) satisfies the hypothesis of the lliesz Theorem.
J
Therefore, there exists a signed-measure Jl such that .C¢( x )dv( x) = 1i( ¢) =
J ¢(x)dJ.l(x) =< ¢, Jl >. Hence, by definition, C*(v) = Jl·
We will assume in the next theorem that the map T has a fixed degree d, that
is, that for any a E X, #{T- 1 (a)} =d. For such a map kind h(T) = logd (see
definition 4.3).

Definition 7.3. Define Jln(x) E M(X) by


1
Jln(x) = dn L hy,
Tn(y)=x

where d = #T- 1 (a) independs on a EX.

Theorem 7.1. Let T : X +--' be an expanding map of degree d. There exists


Jl E M(T) such that Jl = limn-.ooJ.ln(x) for any x EX. Moreover Jl satisfies:
129

(1) J-L is ergodic and positive on open sets;


(2) h(J-L) = logd;
(3) h(TJ) < logd for any 7J E M(T), 7J-=/: v.

Remark. Remember that P(O) = log d = h(T) and therefore J-L is the equilibrium
state for 1/J = 0 (see definition 4.3). The maximal measure for the one-sided
shift in d symbols can be obtained also as the Probability P(1/d, 1/d, ... , 1/d) (see
definition 2. 7 and remark in the end of section 4 ).

Definition 7 .4. The above defined measure J-L is called the maximal measure.

Definition 7 .5. Suppose that T : X +-' is a continuous map and 1/J : X -+ R is a


continuous function. Remember that we denote by C(X) the space of continuous
functions on X. Define C.p : C(X) +-'by

C.p¢>(x) = L ei/J(y)¢>(y)
yET- 1 x

for any ¢> E C(M) and x E M. We call this operator the Ruelle-Perron-Frobenius
Operator (Ruelle Operator for short).

It is quite easy to see that:

£~¢>(X) = L ei/J(y)+I/J(T(y))+I/J(T2(y))+ ... +I/J(Tn-l(y)) ¢>(y ). (13)


yETn(x)

A function 1/J is called Holder-continuous is there exist "Y > 0 such that Vx, y E
X, d(T( x ), T(y)) < d( x, y p. We will require in the next theorem that the function
1/J be Holder and without this hypothesis about 1/J the results stated in the theorem
will not be necessarily true (see [10] for a counter-example).
Now we will state a fundamental theorem in Thermodynamic Formalism.

Theorem 7.2. (see [3] for a proof) - Let T : X +-' be an expanding map and
1/J : X -+ R be Holder-continuous. Then there exist h : X -+ R Holder-continuous
and strictly positive, v E M(X) and .A> 0 such that:
(1) J hdv = 1 ;
130

= >.h ;
(2) £1/Jh
(3) C~v = >.v ;
( 4) II >. -n £~¢ - h J ¢dv llc(x)---+ 0 for any ¢ E C(X). ;
(5) h is the unique positive eigenfunction of £1/J, except for multiplication by
scalars;
(6) The probability J-Ll/1 = hv is T-invariant (that is, J-Ll/1 E M(T)), ergodic,
has positive entropy, is positive on open sets and satisfies

log>. = h(J-Ll/1) + J
.,Pdp.;

(7) For any 7J E M(T), 7J =f. J-Ll/1;

log>. > h( 7J) + J


.,Pdry;

In order to explain how one can obtain the equilibrium states J-ll/1 associated
to .,P in a more appropriate way, we will need to consider a series of remarks.

Remark. It follows from (6) and (7) of Theorem 7.2 that P( .,P) = log>. and that
J-ll/1 is the unique equilibrium state for ¢. Therefore the pressure is equal to log>.,
where>. is an eigenvalue of the Ruelle Operator. In fact, it can be shown that >.is
the largest eigenvalue of the operator £1/1 [3] [15]. The remaider of the spectrum of
£1/1 is contained in a disc (on C) of radius strictly smaller than >.. The multiplicity
of the eigenvalue .A is one.
Note that J-Ll/1 E M(T), but v is not necessarily in this set.

Remark. The value P( .,P) can be computed in the following way: fix a certain
point x 0 EX and consider¢ constant and equal to 1 in (4) of Theorem 7.2. Ash
is bounded (being continuous on a compact space) then from (4) Theorem 7.2

1 cn1(x )
lim - log 1/1
n-+oon >.n
°=0
that is,
lim
n_..oo
_!n log£~1(x 0 ) =log>.= P(.,P) (14)
131

or
lim ~log ""' et/J(y)+t/J(T(y))+ ... +t/J(Tn-t)(y) = P(,P). (15)
n-+oon L..J

Remark. The eigenfunction h can be obtained with the following procedure:


consider ¢ constant equal 1 in ( 4 ), then
_en 1(x)
h(x) = n-+oo
lim -"'~-
).n
{16)
Remark. In order to obtain p,, we just need to obtain v. The probability v can
be obtained from Theorem 7.2 (4): consider a certain value x 0 and bx 0 , then from
(4)
_cn*(D ) et/J(x)+t/J(T(x))+ ... +,P(rn- 1 (x))
h(xo)v = lim t/J xo = lim ""' ).n bx (17)
n-+oo _An n-+oo L..J
rn(x)=xo

Therefore v can be obtained in the above mentioned way.

In this way we can obtain v by means of the limit of a sequence of finite sum
of Dirac measures on the preimages of the point x. In the case of the maximal
measure (,P = O,P(O) = logd,>. = d,h = 1,v = p, = p,.p), the weights in the points
x such that Tn(x) = x 0 are evenly distribuited and equal to d-n. For the general
Holder continuous map '1/J, it is necessary to distribute the weights in a different
form as above. There is a more apropriate way to obtain directly the equilibrium
measure p,.p, that will be presented later.

Remark. If one is interested in finding an invariant measure p, for the map


T, given in the example after Definition 7.1, and that has also a density p with
respect to dx, that is dp,( x) = p( x )dx, then one should consider the potential
,P(x) = -log JT'(x)J. In this case, it is not difficult to check that Theorem 7.2
gives>.= 1 and h(x) = p(x) (see [13)). The equilibrium probability dp, (satisfying
(6) Theorem 7.2) will be in this case the measure p(x)dx.

Let us see now how Theorem 7.1 follows from Theorem 7.2. Take '1/J =0 and
let .>., h and v be given by Theorem 7.2. Then

.C.p1(x) = L 1(y) = dl.


yET- 1 x
132

Because of part (5) of Theorem 1.2, d = A and h _ 1. Also, part (4) of


theorem 7.2 shows that

for any 'P E C(X). This proves Theorem 7.1.

Definition 7.6. A continuous function J : X --+ R is the Jacobian ofT with

i
respect to J.l E M(X) if
J.L(T(A)) = JdJ.L

for any Borel set A such that T lA is injective.


It is easy to prove that such a J exists (use the Radon-Nykodin Theorem) and
it is unique (in the apropriate sense). The Jacobian is the local rate of variation of
the measure J.l by means of forward iteration of the map. Some ergodic properties
of J.l can be analysed through J.

Theorem 7 .3. Suppose that J (the Jacobian of an invariant measure J.l) is Holder-
continuous and strictly positive. Then
(a) h(J.L) = JlogJdJ.L;
(b) J.l is ergodic.

Consider now the question of finding aT-invariant probability with a given


Jacobian J > 1 . It is easy to prove that every function J > 1 that is the Jacobian
ofT with respect to some T-invariant probability must satisfy

I:
T(x)=y
1
I J(x) I
=1 (18)

for any y EX. This condition is also sufficient.

Theorem 7 .4. Let T : X 4---' be an expanding map and J : X --+ R strictly


positive and Holder-continuous, the Jacobian of 7J E M(T). Consider t/J =-log J,
then the equilibrium state J.l.p = 7], h is constant equal 1 and P( -log J) = 0.
133

Proof. From (18) and condition (2) of Theorem 7.2, h =1 and A = 1 in the last
theorem. Hence P( -log J) = 0. •

Theorem 7 .5. Suppose .,P is Holder continuous, 1-L.P is the equilibrium state asso-
ciated with .,P and h is the eigenfunction associated with A in Theorem 7.2, then
the Jacobian J.p of the probability 1-L.P is given by:

J ( ) =, -.P(x)hoT(x) (19)
.p x Ae h(x)

Remark. It follows from the last expression that

.,P(x)- (-log J.p(x) = log(h o T(x)) -log h(x) +A (20)

That is .,P and -logJ.p satisfies the homology criterium (Proposition(4.1)) and
therefore they determine the same equilibrium state, that is 1-L.P = /-L-log J.p. Re-
member that P(- log J .p) = log A = log 1 = 0.
It follows from the last claims and from

(see ( 4) in Theorem 7.2) that the equilibrium state 1-L.P can be obtained in the
following way:

i-L'I/1= lim
n-oo
L e -log J.p(y)-log J.p(T(y))- ... -log J.p(Tn- 1 (y)) b
y (21)
Tn(y)=x

= nl.!_.~ L (J'I/J(y)J'I/l(T(y)) ... J'I/J(Tn-I(y)))-Iby (22)


rn(y)=x

Hence from A and h one can obtain 1-L.P as the limit of a sum of weights placed
in the preimages of a point x EX (J.p is given by (19)).

Example. We will consider now the example mentioned in section 4, just after
Definition 4.1. In fact we can analyze a more general example where we will be
able to show explicitely the equilibrium probability. Consider p( +,+ ), p( +,-),
p( -,+) and p( -,-) non-negative numbers such that p( +,+) + p( +,-) = 1 and p( -,+)
134

+ p( -,-) = 1. These numbers p(i,j), i,j E { +,-} express the probability of having
spin j at the right of spin i in the lattice Z.
Consider the matrix

A- (p(+,+) p(-,+))
- p(+,-) p(-,-)

It can be shown [15] that this matrix A has the value 1 as the larger eigenvalue (this
result is known in the usual textbooks on Matrix Theory as the Perron-Frobenius
Theorem) and we will denote by (p( + ),p(-)) the normalized eigenvalue associated
to the eigenvalue 1, that is:

A(p( + ), p(-)) = (p( + ), p(-)) ' p( +) + p(-) = 1.

Now we can define a measure f.l on cylinders (and then extend to the more
general class of Borel sets) by:

n E N , io, i 1 , i 2 , ... ,in E { +,- }. It quite easy to see that considering in Theorem
7.2 the potential 1/J constant in each one of the four cylinders given by:
a) ,P(z) = logp(+,+) Vz E (+,+),
b) ,P(z) = logp(+, -) Vz E (+, -),
c) ,P(z) = logp( -, +) Vz E (-,+)and
d) ,P(z) = logp( -,-) Vz E ( - , - ),
then the eigenfunction h is constant equal 1 and >. equal 1. It is not difficult to
see that the measure f.l given above satisfies the equation (3) in Theorem 7.2 (see
also Definition 7.2), that is L~f.l = f.l (first show that L~f.l(B) = f.l(B), for the
cylinders B depending on the two first coordinates, and then depending on three
coordinates, and so on ... ). Therefore f.l is the equilibrium state for the 1/J given
above.
This example shows that the Ruelle Operator is in fact an extension of the
Perron-Frobenius Operator of Matrix Theory (finite dimension) to the infinite
dimensional space of functions.
135

The Jacobian of the measure J.l is constant by parts and is constant in each
cilinder (see Theorem 7.5)

J(z) = e-.P(z) =p(i,j)- 1 , Vz E (i,j),i,j E {+,-}.

The above described example includes the one we mention before in section
4.

Theorem 7 .6. Suppose T is a continuous map from X to X, X is a compact


metric space and h(T) is finite. Consider v a finite signed measure on the Borel
a-algebra of X. Then the following properties are equivalent:

(a) v E M(T)

and
(b) V¢ E C(X), < ¢, v >"5: P(¢). (23)

Proof. (a) -+ (b)


By definition of Pressure, < ¢, v >"5: P( ¢ ), because v E M(T) and h(v) 2: 0.

(b)-+ (a)
Suppose v satisfies (b), then we will show first that v is a measure, that is,
for any non-negative continuous function ¢, < ¢, v >2: 0.
Consider¢ E C(X) such that ¢(x) 2: 0, Vx EX, then given n EN and b > 0

j(<P(x) + b)ndv(x) 2: -P( -(¢ + b)n)

by assumption (b). By definition of pressure and from the fact that ¢is nonegative

-P( -(¢ + b)n) =- sup {h(p)- j(<P(x) + b)ndp(x)} 2:


pEM(T)

-(h(T)- inf {(¢(x)


xEX
+ b)n}) 2: -h(T) + nb
For large n the last expression is positive. As b was arbitrary, it follows that
J ¢(x)dv(x) =< ¢,v >2: 0. Hence, vis a measure. Now we will show that vis a
probability, that is that, v(X) = 1.
136

J
For n E Z n dv(x) = Jndv(x) :::; P(n) = h(T) + n, therefore v(X) :::;
h~T) + 1, if n > 0, and v(X) 2:: h<;') + 1, if n < 0.
Now letting n go to oo in the first expression and n to -oo in the second we
conclude that v(X) = 1.
This means that vis a probability. Finally we will show that v E M(T), that
J J
is, we will show that for any</> E C(X), </>(x)dv(x) = </>(T(x))dv(x). In other
words we have to show that < </> o T - ¢>, v >= 0.
For a given n E Z, n < </> o -</>, v >:S P(¢> o T- </>)n by assumption (b). Now
using the criteria of the homology we have that the last term is P(O) = h(T). Hence
< </> o T- </>, v >:S h<;'), if n > 0, and < </> o T- ¢>, v >2:: h<;'), if n < 0.
Now letting n go to oo in the first expression and n to -oo in the last expression
we conclude that < </> o T - </>, v >= 0. Thus the Theorem is proved. •

The Pressure P( 1/1) is a continuous function of 1/J ( see[18]); one could ask if
the entropy h(v) is continuous in v E M(T), that is, wheter

Wn E M(T)

converging weakly to v (see definition 2.9) implies limn-....= h(wn) = h(v).


An equilibrium state PI/• can be obtained as a limit of finite sums of Dirac
measures on periodic orbits of arbitrarily large period. We did not prove this
fact, but from the expression (17) in this paragraph (in fact expression (17) is for
preimages and not for periodic orbits) it is quite reasonable to believe that the
above claim is true (see remark before Proposition 2.2).
Another reason supporting the above claim is the fact that for an expanding
map the periodic orbits are dense in the support of any invariant measure( see [13])
(see proposition 2.2 for a proof in the case Tis the shift).
The entropy of an invariant measure with support on a periodic orbit is zero
(see example after Theorem 3.1 ), therefore as the entropy of an equilibrium state
is positive (theorem 7.2 (6)), one concludes that the entropy is not continuous.
The entropy can jump up in the limit.
The entropy however can not jump down in the limit as it is stated in Theorem
7.8. We need first to state more precisely what we mean by that.
137

Definition 7. 7. A function F on a space M is upper-semicontinuous at v if for


any convergent sequence Wn E M, n E N such that limn--co Wn = v E M, then

lim F(wn) 2:: F(v).


n--co

Theorem 7. 7. (see [18] for a proof) Suppose T is a continuous map from X to X,


where X is a compact metric space, and that h(T) = supvEM(T){h(v)} is finite.
For a given probability v E M(T) the following statements are equivalent:
(a) h(v) = infc/>EC(XJ{P(</>)- < </>, v >} ;
(b) the entropy is upper-semicontinuous at v.

Theorem 7.8. (see [18] for a proof) For expanding systems the entropy is upper-
semicontinuous at any probability v E M(T).

Remark. From the two results presented above one can conclude that a measure
v is invariant for an expanding map T, if and only if

h(v)= inf {P(</>)-<</>,v>}=- sup {<</>,v>-P(</>)}. (24)


c/>EC(X) ci>EC(X)

Therefore the entropy is minus the Legendre Transform of the Pressure.


Remember that the dual of C(X) is S(X) and that Pressure is defined for
continuous functions and entropy for elements of M(T) C S(X).
Proposition 6.3 claims that in the finite dimensional case the Legendre trans-
form is an involution, that is, f** = g. Therefore, one could also expect that
the Legendre transform of minus the entropy should be the P ressure. This is so
because, by definition,

P(lj;) = sup {< 1/J,v > -(-h(v))}.


vEM(T)

The disturbing point in the above expression is that we are taking supremum
in a smaller set M(T) and not in the dual of C(X), that is in the set S(X). If we
define the entropy of a signed measure rt by

h(17) = inf {P(lj;)- < 1/','rf >}


.PEG( X)
138

as in (24), then h(TJ) < 0 for 7J E S(X)- M(T) (see theorem 7.6).
Hence we finally can state that:

P('¢) = sup {< '1/J,v > -(-h(v))}, (25)


liES( X)

because the entropy of non-invariant measures will not interfere in the supremum
and the analogy with the finite dimensional case is complete.

For results about Large Deviation properties in this setting (level-2 large de-
viation) we refer the reader to [8]. In the next paragraph we will consider large
deviation properties, but in another setting (the Ievel-l large deviation). The ter-
minology of Ievel-l and level-2 is explained in more detail in [7]. The reference
[7] is an excelent source of results for large-deviation, but does not consider the
entropy (Kolmogorov-Shanon entropy) and presure as we are doing here.
We will repeat definition 6.3 but now for the infinite dimensional case.

Definition 7.8. For a given convex function K from C(X) toR, we call a signed
measure J-l E S(X) (the dual of C(X)) a subdifferential of K at the value 7J and
write J-l = bK(TJ), if the following is true: for any'¢ E C(X),

K('¢) ~ K(TJ)+ < '¢ -TJ,fl >.

Notation. As the pressure P( '¢) is convex in '¢ we can consider the above defini-
tion for the pressure and we will denote the subset of signed-measures fl that are
subdifferential of P at the value 7J by t( 7J ). In other words,

t(77) = bP(TJ) = {J-L E S(x) IP('¢) ~ P(71) + j(t/J(x) -TJ(x))dJ-L(x), V'¢ E C(X)}.
(26)
Remember that for a continuous function '¢, the set of probabilities fl such
that P( '¢) = h(J-L) + J'¢( x )dJ-L( x) is called the set of equilibrium measures. The
main Theorem stated in the beginning of this section is that for an expanding map
T and a Holder continuous function'¢, equilibrium states exist and are unique.

Theorem 7.9. (see [18]) Suppose Tis an expanding map such that h(T) is finite.
If'¢ is a continuous function on X, then t( '¢) is the set of equilibrium states for '¢.
The set t( '¢) is not empty.
139

The next result improves the claim that for expanding systems the subdiffer-
ential of the pressure P at .,P is J.L.p (that is, 6P( .,P) = J.L.p ).

Theorem 7.10. Suppose that T is an expanding map. Given f and g Holder


continuous functions, the function

p(t) = P(f + tg)


is convex and real analytic in t. The value p'(t) is equal to I g(x)dJ.LJ+tg(x).
Proof. We refer the reader to [15] [17] for the proof of the differentiability of p( t ).
We will assume that p is differentiable and we will show that p' (t) = I g( x )dp f+tg.
We will reduce the question to its simplest form in order to simplify the
argument.
First note that it is enough to show that -!iP(f + tg)lt=O =I
gdJ.LJ· For the
general case consider P( (! + tg) + sg) and take derivative at s = 0.
Another simplification is that we can substitute f by -log J where J is the
Jacobian of J.lJ· In fact (see the Remark after theorem 7.5)

(! + tg) - (-log J + tg) = P(f) + log( h o T) - log h,

and therefore f + tg and log J + tg are homologous. Hence J.L f+tg = J.L-Iog J+tg and
furthermore P(f + tg) = P( -log J + tg) + P(f). Taking derivative with repect to
t in both sides of the last expression:

Note that from (22), for any ¢> the integral

J cf>dJ.L-!og J = lim
n-tooo
"
L....J
Tn(y)=xo
</>(y)e- I:;:: log J(Ti(y)) (27)

= lim .C~logJ¢(xo) (28)


n-+oo

where x 0 is a certain point in X.


We will use the above property very soon.
140

One of the Remarks after Theorem 7.2 states that (see (15))
1 "" L:~- 1 (-log J+tg)(Tj(y))
P(-logJ+tg)= lim -log L., e 1=0 ,
n--+oo n

hence, derivativing term by term (the fact that this is possible is a crucial step
that will not be proved here [15](17]) one obtains:
d "' n "~-1 (Ti( ))eL:;,::< -log J+tg)(Tj (y))
P( 1 J ) 1. 1 LJT (y)=xo LJJ=O 9 Y
dt - og + tg = 1m -
n--+oo n
"' "'" 1
L..Jj,:o (-log J+tg)(Tj(y))
LJT"(y)=xo e
Now in the last expression considering t = 0 we obtain
d 1 "' n "~-1 (Ti( ))e- L:7,:o1log J(Tj (y))
P( 1 J )i 1' LJT (y)=xo LJJ=O g y
dt - og + tg t=O = n.:.~ ~ "' - L:;,:: log J(Tj(y))
LJT"(y)=xo e
(29)
• "' - L:~~: log J(Tj (y)) _
C 1aim. LJT"(y)=xo e ,_ - 1, Vn EN, Vxo EX

Proof of the Claim. The proof is by induction. The claim is true for n = 1 by
(18). Suppose the claim is true for n, then we will prove that the claim is true for
n+l.
In fact
L e- L:7=olog J(Tj(y)) =

L e-logJ(z) L e-L:;.=-:logJ(Tj(y))= L e-logJ(z) 1 =1.

T(z)=xo T"(y)=z T(z)=xo

In the last two equalities we used the fact that the claim is true for n and 1.
This is the end of the proof of the claim.

Now, we return to the proof of the Theorem. It follows from the claim and
(29)(27)(28) (taking ,P =go Ti) that:
n-1
d 1 · ""- 1 -1 J(Tj( 11 ))
-P( -log J
dt
+ tg)lt=O = n--+oo
lim -
n
"" ""g(T1(y))eL..Jj=o og
L., L.,
T"(y)=x j=O
n-1
(30)
=lim
n --+oo
~"".C~logJ(g(Ti))(xo)
n L.,
j=O
141

As the convergence in Theorem 7.2 ( 4) is uniform (and the eigenfunction h of


theorem 7.2 is constant equal 1 for '1/J = - log J by Theorem 7.4 ), then for an y e,
there exist N > 0 such that for any n EN, n >Nand z EX,

IC-logJg(z)- I g(x)dJL-logJ(x)l $f.

Therefore, from (30), considering z varying under the form Ti(x 0 )

!P(-logJ +tg)lt=O =I g(x)dJL-logJ(x).


Finally, we conclude that:

!P(f +tg)lt=O = :tP(-logJ +tg)lt=O =I g(x)dJL-logJ(x) =I g(x)dJLJ(x)


(31)
and this is the end of the proof of the Theorem. •

Theorem 7.11. (see [18]) Suppose T is an expanding map on X and h(T) is


finite, then there exists a dense subset B of C(X), such that for '1/J in B, there
exists just one equilibrium state for '1/J, that is, the cardinal oft( '1/J) is 1.

8. Pressure and Large Deviation

In this paragraph we will show a result relating large deviation with pressure. It
is possible to obtain very precise results about the deviation function for Holder
functions and the maximal measure of an expanding map.

Notation. Let zo be a point of X, and for each n EN, denote by z(n,i,zo),


i E {1, 2, 3, ... , dn} the dn solutions of the equation

We know that the maximal entropy measure (see theorem 7.1) JL can be ob-
tained as
d"
JL = n--+oo
lim d-n "bz(n
LJ ''i zo)·
i=l
142

Notation. In this section we will will denote by f-L the maximal entropy measure
(see theorem 7.1).

Given 0 < 1 < 1, denote by C(!) the space of Holder-continuous real-valued


functions in X endowed with the metric

II g 11=11 g llo +sup I g(x)- g(y) I


#y I X - y I')'
where II g llo is the usual supremum norm.

Theorem 8.1. LetT be an expanding map, and g E C(!), then

where f-L is the maximal measure.

Proof. Let g be a Holder-continuous function on the compact set X.


Let us consider a fixed z0 E X and denote by z( n, i) the z( n, i, zo ), n E N and
i E {1,2,3, ... ,dn}.
For a given n EN,

~ rnli~= d-rn ,~. ~ exp (~ g(T'(z(n, i, (z(m - n, k)))))).

From [3] (in this moment the hypothesis about expansivity and Holder-conti-
nuous are essential), there exist constants C 1 , c1 such that for n large enough

c, t, exp (~ g(Ti(z(n, i, z)))) ,; t, exp (~ g(T1(z(n, i))))

,; C, ~ exp (~ g(Ti(z(n, i, z)))) (I)


143

for any z EX.


Therefore,

From this, it follows that

Now from the expression of the pressure that appears as a Remark after theorem
7.2 (see expression 7.15) the claim of the theorem is proved. •

Remark. Consider the free-energy c( t) of a continuous function g and the maximal


measure p. Suppose g is Holder-continuous, then from the definition 5.3 of the free-
energy c( t ), t E R one concludes from the last theorem that P( tg) = c( t) + log d.
Remember that the free-energy depends on the function and on the measure we
are considering.

Theorem 8.2. The free-energy c(t) for a Holder-continuous function g and the
maximal measure p satisfies

c(t) = P(tg) -log d. (32)

Therefore c( t) is differentiable and g has the exponential convergence property.


144

Proof. If c( t) is differentiable, then g has the exponential convergence property


for Jl (see proposition 6.8). Since c(t) = P(tg) + logd (from last theorem) and
P(tg) is differentiable {theorem 7.10), the results follows. •

It is quite natural to ask if one can obtain the deviation function

I(v) = sup{tv- c(t)}


tER

from results of Thermodynamic Formalism. The next theorem answers this ques-
tion.

Theorem 8.3. Suppose g is Holder-continuous, Jl is the maximal measure and


p( t) = P( tg ), t E R. Then the deviation function is

I( v) = log d- h(Jlt 0 g ), {33)

where Jlt 0 g = Jlt/> is the equilibrium state for t/J =tog and to satisfies p' (to) = v.

Proof. By definition

I(v) =sup{ tv- c(t)} =sup{ tv- (P(tg) -log d)}= sup{ tv- p(t)} +log d.
tER tER tER

It is easy to see that p(t) is convex and from theorem 7.10 p(t) is also dif-
ferentiable. Suppose t 0 is the unique value such that p' (to) = v, then from last
theorem and the definition of pressure

I(v) =sup{ tv- p(t)} + logd = t 0 v- p(to) + logd


tER

= tov - h(Jlt 0 g) - j tog( x )dJlt 09 ( x) +log d.

Now from Theorem 7.10 v = p' (to)= J g(x)dJJ 109 (x), and the claim of the Theorem
follows. •

In conclusion, for g E C(7) and the maximal measure Jl one can obtain the
value of I( v), v E R by I( v) =log d- h(Jlt 09 ) where to satisfies p' (to) = v.
145

Remark. More general results about large deviations and free-energy of Holder
functions g and equilibrium states /lg can be obtained, but we will not consider
such questions here. We refer the reader to [5],[8],[9] for interesting results in this
subject. Theorem 3 in [8] is not correctly stated, but is not necessary for the proof
of Theorem 7, the main result of [8].

References

[1] Billingsley, P., Probability and Measure, John Wiley (1979).


[2] Billingsley, P., Ergodic Theory and Information, John Wiley (1965).
[3] Bowen, R., Equilibrium States and Ergodic Theory of Anosov Diffeomor-
phisms, Lecture Notes in Math., Springer Verlag 470 (1975).
[4] Brin, M., A. Katok, On Local Entropy, Geometric Dynamics - Lecture Notes
in Math., Springer Verlag 1007, 30-38 (1983).
[5] Collet, P., J. Lebowitz, A. Porzio, The Dimension Spectrum of Some Dynam-
ical Systems, Journal of Statistical Physics 47, 609-644 (1987).
[6] Devaney, R., An Introduction to Chaotic Dynamical Systems, Benjamin
(1986).
[7] Ellis, K., Entropy, Large Deviation and Statistical Mechanics, Springer-Verlag
(1985).
[8] Lopes, A., Entropy and Large Deviation, NonLinearity 3(2), 527-546 (1990).
[9] Lopes, A., Dimension Spectra and a Mathematical Model for Phase Transi-
tions, Advances in Applied Mathematics 11(4), 475-502 (1990).
[10] Lopes, A., A First Order Level-Two Phase Transition, Journal of Statistical
Physics 60(3/4), 395-411 (1990).
[11] Lopes, A., The Dimension Spectrum of the Maximal Measure, SIAM Journal
of Matematical Analysis 20(5), 1243-1254 (1989).
[12] Lopes, A., The Zeta Function, Non-Differentiability of Pressure and the Crit-
ical Exponent of Transition, Advances in Mathematics, to appear, preprint
1990.
[13] Mane, R., Ergodic Theory and Differentiable Dynamics, Springer-Verlag
(1987).
146

[14] Mane, R., On the Hausdorff Dimension of the Invariant Probabilities of Ra-
tional Maps, Lecture Notes in Math. 1331,86-116, Springer-Verlag(1990).
[15] Parry, W., M. Pollicott, Zeta Functions and the Periodic Orbit Structure of
Hyperbolic Dynamics, Asterisque 187-188 (1990).
[16] Rudin, W., Real and Complex Analysis, McGraw-Hill (1974).
[17] Ruelle, D., Thermodynamic Formalism, Addison-Wesley (1978).
[18] Walters, P., An Introduction to Ergodic Theory, Springer-Verlag (1981).
FORMAL NEURAL NETWORKS: FROM SUPERVISED TO
UNSUPERVISED LEARNING

JEAN-PIERRE NADAL
Laboratoire de Physique Statistique*
Ecole Normale Superieure,
24, rue Lhomond, F-75231 Paris Cedex 05
France

ABSTRACT. This lecture is on the study of formal neural networks. The emphasis will be put on
the bridges that exists between the analysis of the main tasks and architectures that are usually
considered: auto-associative learning by an attractor neural network, hetero-associative learning
by a feedforward net, learning a rule by example and unsupervised learning. In particular a
duality between two architectures will be shown to provide a tool for comparing supervised and
unsupervised learning.

1. Introduction

In the study of formal neural networks (for a general review see [21], [31]), one
usually distinguishes two main types of learning paradigmes, and two main types
of architectures. For the learning tasks:
• Supervised learning (the desired output is given for a set of patterns). There
are two sub-families:
- learning by heart (that is realizing an associative memory)
- learning a rule by example: the set of input-ouput pairs to be learned
are a set of examples illustrating a rule. One expects the network to
generalize, that is to give a correct output when a new (unlearned) pattern
is presented.

* Laboratoire associe au C.N .R.S. (U .R.A. 1306) et aux universites Paris VI et Paris VII.
147
E. Coles and S. Martinez (eds.), Cellular Automata, Dynamical Systems and Neural Networks, 147-166.
© 1994 Kluwer Academic Publishers.
148

• Unsupervised learning (no desired output is specified). The network "self-


organizes" as input patterns are sequentially presented.
and for the architectures:
• attractor neural networks (ANN), that is networks with a large amount of
feedback connections - possibly with every neuron receiving inputs from every
other neuron
• feedforward networks made of layers, each layer receiving inputs from and
only from the preceding layer. The simplest feedforward net is the perceptron
(one input layer, one output layer, no "hidden" layer).
There are intermediate architectures and intermediate learning schemes, but
it is convenient and useful to consider the above extreme cases. The aim of this
lecture is to point out the bridges that exists between the analysis of the different
tasks: in many cases the study of one given learning task on a given architecture is
related to the study of another task on the same or another (but precisely related)
architecture. In this game the use of information theoretic concepts will be shown
to be most useful.
The first 3 sections are devoted to supervised learning. In section 2 I will re-
view the main results on the performance of a perceptron. I will also show why this
very simple architecture tells us something on the behavior of more complicated
nets, such as multilayer networks. Then in section 3 I recall how one can relate
the study of an associative memory realized by an ANN to an hetero-associative
task by a simple perceptron. Then in section 4 I review the perceptron type al-
gorithms that can be used either for an associative memory or for learning a rule
by example, and I indicate how more complicated architectures can be generated
with the use of such algorithms.
In section 5 I will exhibit a duality between two perceptrons, allowing to relate
unsupervised and supervised learning tasks. This last part is based on a recent
work done in collaboration with N. Parga [29j, [30J.
149

2. Supervised Learning: Feedforward Neural Networks

2.1 THE PERCEPTRON

Fifty years ago McCulloch and Pitts have defined the formal neuron as a binary
element [26]. What they have shown is that, with such a caricature of the biological
neuron, one can build a universal Turing machine. However this positive result
says nothing on how to use (formal) neurons in order to learn a task. But the basic
ideas on how learning could take place were proposed at about the same time: in
1949 the neuropsychologist D. 0. Hebb publishes a book [20] where he formulates
hypotheses explaining how associative learning might occur in the brain. In fact
almost every neural network modeling has its roots in this pioneering work of Hebb.
In all these models one basic postulate is that the properties of the synapses
might be modified during learning. This was exploited during the 60's in the study
of the simplest possible neural networks, in particular the perceptron [28]. This
network has an input layer directly connected to an output layer. The couplings
(synaptic efficacies) between the two layers are adaptable elements (in the original
design of the perceptron there is a preprocessing layer, but of fixed architecture and
couplings: one can thus ignore it for all what follows). The simplest perceptron
has only one output unit, as on figure 1.

input
(

j=N

Figure 1. The simplest perceptron: a single formal neuron.

Let me precise my notations for the case of the perceptron with one binary
150

output. Its state a takes, say, the values 0 or 1. There are N inputs units,
N couplings J = { J 1 , ... , J N} and a threshold B. Inputs may be continuous or
discrete. In a supervised learning task, one is given a set X of p input patterns,

X= {(i',J.l = 1, ... ,p} (1)

and the set of the desired outputs,

T =(Til= 0, 1, J.l = 1, ... ,p)

which have to be learned by the perceptron. For a given choice of the couplings,
the output a~' when the J.tth pattern is presented is given by:

N
a~' = e(~= Ji~J - B) (2)
j=l

where 0( h) is 1 for h > 0 and 0 otherwise. Learning means thus choosing


(computing) the couplings and the threshold in such a way that the desired
output configuration i is equal - or as close as possible - to the actual output
ii =(a~' = 0, 1, J.l = 1, ... ,p). In the next section I will consider the ability of the
perceptron to learn.

2.2. THE GEOMETRICAL APPROACH

During the 60's the storage capacity of the percetron has been obtained from
geometrical arguments [13]. One considers the space of couplings (f = {Jj,j =
1, ... , N} being considered as a point in an N dimensional space). Then each
pattern J.l defines an hyperplane, and the output a~' is 1 or 0 depending on which
side of the hyperplane lies the point J Hence the p hyperplanes devide the space
of couplings in domains (figure [2]), each domain being associated to one specific
set ii = {a 1 , ... , aP} of outputs. Let us call ~(X) the number of domains. Since
each all is either Qor 1, there is at most 2P different output configurations iJ, that is

(3)
151

If the patterns are "in general position" (that is every subset of at most N pat-
terns are linearly independant), then a remarkable result is that ~(X) is in fact
independant of X and a function only of p and N [13]:

(1,0,1)

(1,0,0) (1, 1, 1)

1
(0,0,0) (0, 1, 1)
(0, 1,0)
3
2

Figure 2. Partition of J space in domains. Here p = 3 patterns in N = 2


dimensions define 7 domains. For each pattern the arrow points towards the half
space of the J's producing an output 1 for this pattern. The resulting code, that
is the output configuration ii = (a~', J1 = 1, 2, 3), is given inside each domain. The
output configuration ii = (0, 0, 1) is not realized.

L c;
minp,N
~= (4)
k=O

where c; = k! <:~k)!. In particular

2P if p:::; N
~ = { < 2P ifp> N
(5)

This means that N is the "Vapnik-Chervonenkis dimension" [39] of the perceptron


(N + 1 is the first value of p for which~ is smaller than 2P):

dvc = N (6)

If the task is to learn a rule by example, the VC dimension plays a crucial role:
generalization will occur if the number of examples p is large compared to dvc
152

[39). Another important parameter is the asymptotic capacity. In the large N


limit, for a fixed ratio
(7)

the fraction of output configurations wich are not realized remains vanishingly
small for a greater than 1, up to the "critical storage capacity" ([13], [18]) ac,

ac = 2. (8)

This is obtained by considering the asymptotic behavior of C,

C =ln.6. (9)

J~oo CjN = c(a) = { ~S(1/a) if a~ 2


if a> 2
=ac (10)

Here (and in the following) logarithms are expressed in base 2, and S(x) is the
entropy function (measured in bits):

S(x) = -[xlnx + (1- x)ln(1- x)]. (11)

For large a, the above formula gives the asymptotic behavior

c(a) "'a-oo Ina (12)

In fact c has an important meaning : c( a) it is the information capacity of the


percept ron. Indeed, C = ln .6. is the number of bits needed for specifying one
domain out of .6., hence is the amount of information stored in the couplings
when learning an association (X, 7), whenever this particular configuration T' does
corresponds to an existing domain. This gives the obvious result that below ac the
amount of information stored (in bits per synapse) is equal to a. But for a > ac
with probability one (in the large N limit) no domain exists for a configuration
T' chosen at random, and errors will results. However, it has been shown by G.
Toulouse [12] that even above ac, c(a) remains the maximal amount of information
that can be stored in the synapses.
One can understand this statement by considering the error rate. Below ac it
is possible to learn without any error. Above ac errors will occur, and the minimal
153

fraction of errors, f, that can be achieved can be computed by writing that the
capacity per synapse c( a) is equal to the amount of information stored per synapse
(when there is p€ errors), that is to a(1- S(€)):

aS(€) =a- c(a) (13)

The above formula (13) can be seen as an application of Fano's inequality [10]
giving the smallest possible error rate that can be achieved by a communication
channel of (Shannon) capacity C: the r.h.s. of (13) is the number of bits (per
synapse) that cannot be correctly processed, and the l.h.s is the amount of infor-
mation needed to specify where the errors are.

2.3 MULTILAYER PERCEPTRONS: AN OPTIMAL UPPER BOUND

The preceding results for the perceptron appear to be also useful when considering
more complex architectures- in fact any learning machine with a binary output.
For a general learning machine, the VC dimension and the number of domains are
defined as above : D.( X) is the number of different possible output configurations
iJ. In general it will depend on the choice of X (and not only on p and N as for the
perceptron). However one can consider its maximal value over all possible choices
of X:
D.m =maxD.(X)
X
(14)

This maximal value D.m is equal to 2P for p up to some number called the VC
(Vapnik-Chervonenkis) dimension, dvc (possibly infinite), and is strictly smaller
above. As mentioned above for the perceptron, generalization is guaranteed for
p much larger than the VC dimension. Vapnik [39) has shown the remarkable
result that D.m is bounded above by l:~~~p,dvc c;. That is, there is an upper
bound which is precisely the number of domains of a perceptron having the same
VC dimension (i.e. with a number of inputs equal to that value of dvc, see {4)).
Hence this upper bound is optimal (all learning machines with a given value of the
VC dimension satisfy the bound, and equality is realized for at least one of these
machines, the perceptron).
154

To conclude this short section, one sees that the results for the simple percep-
tron gives us some insight on any learning machine if one replace N, the number
of couplings, by the VC dimension.
I now turn to ANN, relating its study to the one of the perceptron.

3. Supervised Learning: Recurrent Networks

3.1. REMAINDER ON ATTRACTOR NEURAL NETWORKS

Hebb suggested also that the associative behavior of the human memory might be
the result of a collective behavior. The Attractor Neural Networks as introduced
by J. J. Hopfield in 1982 [22] can be seen as direct formalization of Hebb's ideas.
In this model, every neuron is connected to every other neuron, as on figure 3.
Each neuron is a linear threshold unit as above. With an asynchronous dynamics,
the state of neuron i is computed at time t + 8t according to

Figure 3. An Attractor Neural Network. In bold: neuron 1 and the couplings


controlling its activity.

N
CJ;(t + 8f) = 0(L J;jCTj(t) - 8;) (15)
j=l

In the above dynamics, synaptic noise can be incorporated by replacing the deter-
minist updating rule by a stochastic one, but I will restrict here to the noiseless
155

case. When the synaptic efficacies are symmetric, that is

(16)

then one can associate an "energy" to the dynamics (15) and show that from any
initial configuration the network will evolve towards a (possibly local) minimum
of the energy. This means that the network behaves like an associative mem-
ory: starting from some initial configuration- coding for a stimulus-, the network
evolves until it settles down to a fix point: the stable configuration that is reached
is the respons of the network to the stimulus; the presented pattern (initial con-
figuration) has been recognized as being the fixed point pattern. In this context
learning is equivallent to impose as fixed points a given set of patterns. In the
Hopfield model, an empirical (Hebbian) rule fixing the couplings as function of the
patterns is chosen. This particular learning scheme leads to symmetric couplings.
Using statistical mechanics tools (in particular thanks to the analogy with a spin-
glass model), the Hopfield model has been studied (4], as well as many variant of
it. Very soon it was recognized that the symmetry condition is not necessary, and
that other attractors than fixed points can be considered (2]. One of the most well
known result is the storage capacity of the Hopfield model: in the large N limit,
the maximal number of patterns that can be stored is Pmax = a eN, with

ac "' 0.14. (17)

This means that for a = -/!;r smaller than ac the system does behaves as an as-
sociative memory, with for each stored pattern the existence of a fix point which
is very close to (although not identical to) that pattern. Since 1982 many studies
have been devoted to the Hopfield model and its variants (2] (31] (37], with as main
result that they do provide associative memory devices, with a storage capacity
proportionnal to the connectivity of the network (that is to the typical number to
which each neuron is connected; the connectivity is N in the standard Hopfield
model). Moreover it has been possible to modify the original model in order to
take into account biological constraints, and to consider ANN with more realistic
neurons and architectures (3] (31] in such a way that comparison with experiments
is becoming possible.
156

However these studies do not tell us how good (or bad) are the performances
of such models: are there better ways of computing the synaptic efficacies; under
which conditons is it possible to learn a given set of patterns?

3.2. FROM AUTO-ASSOCIATIVE TO HETERO-ASSOCIATIVE LEARNING

A first answer to the preceding questions has been given by E. Gardner in 1987 [17]
[18] in a way I explain now. Instead of choosing a particular rule for computing
the couplings, one may ask first wether there exists at least one set of couplings
which stabilizes the patterns. A simple remark allows to get the answer. Looking
for a network that effectively memorizes a set of p patterns, {(" = ( ~'j, j =
1, ... ,N), f.1- = 1, ... ,p} (where each ~j is either 0 or 1), means looking for a set of
couplings and thresholds that satisfy the N p inequalities

1 N
for each i and each f.1- : (~f- 2)(L J;j~'l - o;) > o (18)
j=l

where usually the self-coupling terms J;; are set to 0 (one wants to avoid the
trivial solution J;; > 0, J;j = 0 for i =/; j, which do not give any associative
property). However, ifwe do not impose any particular symmetry condition, so
that the couplings J;j and Jji are independant parameters, one sees that the above
inequalities decouple in N independant sets of p inequalities: for each neuron i,
one has to solve the problem P; consisting of p inequalities for which the unknown
are the couplings {J;j, j = 1, ... , N, j =/; i} and the threshold 0;:

1 N
P;: for each f.l-, c~r- 2)(L J;j~r - 0;) > o (19)
j=l

The N problems {P;, i = 1, ... , N} can be solved in parallel. Furthermore, one


sees that each problem P; is equivallent to solving an hetero-associative task for
a simple perceptron (as on figure 1) having N - 1 inputs and a single output, the
input-output pairs to be learned being {(~'j ,j = 1, ... , N, j # i), ~f}, f.1- = 1, ... ,p.
Hence in order to study the ability of an ANN to learn, it is sufficient to study
the case of the simplest perceptron. In particular, we have already from section
157

2 that as many as 2N patterns can be learned exactly (which is much more than
the 0.14N patterns imperfectly learned in the Hopfield model). Moreover, the
perceptron algorithm ([28], see next section), applied to each neuron i (that is to
each problem P;), allows to effectively compute a set of couplings.
But Elizabeth Gardner went much further by introducing a statistical physics
approach to this theoretical study of learning [18]. She introduced a measure in
the space of couplings, so that it is possible to ask for the number (or the fraction)
of couplings that effectively learn a set of pattern. From that approach, using
the techniques developed for the study of spin-glass models, one gets the storage
capacity of the perceptron under various conditions (unbiased or biased patterns,
continuous or discrete couplings, ... ; the critical capacity O'c = 2 corresponding to
the particular case of continuous couplings and unbiased patterns). One gets also
the typical behavior of a network taken at random among all the networks which
have learned the same set of patterns. Moreover this approach has been adapted
to the study of generalization, that is to the learning of a rule by example [38]. I
will not give more details here on these aspects, and I consider now the algorithmic
problem.

4. Algorithms: the Perceptron and Beyond

4.1. LEARNING ALGORITHMS FOR THE PERCEPTRON

We know that a perceptron can learn at most 2N associations, but is it possible to


find one set of couplings that realize this learning? The perceptron algorithm pro-
posed by Rosenblatt [28] [33] allows precisely to find a solution. A remarkable fact
is that it is possible to proove that the algorithm will converge in a fin'ite amount
of time (whenever a solution does exists) [28]. This algorithm is very simple: it
consists in taking a pattern at random, checking wether the current couplings give
a correct output; if not, one performs a learning step, with a Hebbian rule (if
pattern is being tested, each coupling Jj is increased if the input
fl, e;and the
desired output r~-' have the same sign, and decreased otherwise). This procedure
is repeated until convergence. In practice one has to let run the algorithm a given,
158

arbitrarily chosen, amount of time, since one does not know in advance wether at
least one solution exists.
Since it has been realized that learning algorithms for the perceptron can be
used in the context of ANN, as explained above, many variant of the basic algo-
rithm have been proposed in order to find couplings having some specific properties
[1). In particular several algorithms (the "minover" [23), the "adatron" [5) and the
"optimal margin classifier" [11)) allow to find the synaptic efficacies which will
maximize the size of the basins of attraction.
But what if the desired associations are not learnable? There are various
algorithms which tend to find couplings such that the number of errors will be
as small as possible [15) [32) [40). In particular, the "pocket" algorithm [15) is
a variant of the perceptron algorithm which guarantees to find a solution with
the smallest possible number of errors - provided one lets run the algorithm long
enough ...

4.2. CONSTRUCTIVE ALGORITHMS BASED ON PERCEPTRON LEARNING

In most practical applications, where one wants to find a rule hidden behind a set
of examples, an architecture more complicated thant the one of the perceptron is
required. The most standard approach is to choose a layered network with an a
priori chosen number of hidden layers, and to let run the backpropagation algorithm
[24) [35). There exists however alternatives to this method: one can also "learn"
the architecture. Since 1986 there exists a family of constructive algorithms, which
adds units until the desired result is obtained [9) [14) [16) [19) [27) [34) [36). Most
of these algorithms are based on perceptron learning. I give here one example,
the "Neural Tree" algorithm [19) [36) (also called the "upstart" algorithm in the
slightly different version of M. Frean [14)).
Given the "training set", a set of p input patterns with their class 0 or 1 (their
desired output T ), one starts by trying a perceptron algorithm in order to learn
the p associations (pattern-class): in case theses associations were learnable by a
perceptron, the algorithm will give one solution, and the problem is solved. If not
(in practice if no solution has been found after some given amount of time), then
one keeps the couplings given by the algorithm (or the pocket [15) solution, that
159

is the set of couplings with the least number of errors). These couplings define
our first neuron. They define an hyperplan which cuts the input space into two
domains (figure 4), and input patterns on one side have a a 1 = 1 output, patterns
on the other side have a a 1 = 0 output. At least one of these domains contains a
mixture of patterns of the two classes. We will say that such a domain is unpure,
a pure domain being one which contains patterns of a same class. The goal of the
algorithm is to end up with a partition of the input space in pure domains. One
considers each unpure domain separately. For a given ( unpure) domain, one lets
run a perceptron algorithm trying to separate the patterns according to the class
they belong to. This leads to a new unit, defining an hyperplane which cuts the
domain into two new domains. This procedure is repeated until every domain is
pure. On figure 4 five domains have been generated.

0 1

Figure 4. A Neural Tree. Above: partition of the input space by a Neural Tree.
Below: The functional tree architecture.
160

One should note that every neuron that has been built is receiving connections
from (and only from) the input units. The tree is functional: consider for example
the neural tree of figure 4; to read the class of a new pattern, one looks at the
output of the first neuron. Depending on its value, 1 or 0, one reads the output
of neuron 2 or 3. In the first case, the output of neuron 2 gives the class. In the
second case, if the output of neuron 3 is 1, then one reads the output of neuron 4
which gives the class.
One should note also that the perceptron algorithm can be replaced by any
learning algorithm (for the perceptron architecture) that one finds convenient.
Most importantly, this algorithm can be easily adapted to multiclass problems
[36], that is when the desired output can take more than two values: in the final
Neural Tree, each domain will contain patterns of a same class.
In many applications one has noisy data, so that the best performances on
generalization may not be obtained when every example of the training set is
correctly learned. But with a Neural Tree (as with most constructive algorithm)
one can always add units until every output is equal to the desired output. Hence
it is likely that the net will in fact "learn by heart" all the examples and will not
generalize. Indeed, one has to stop the growth of the tree when generalization,
as measured by the number of correct answers on a test set, starts to decrease.
Such strategy can be applied locally, that is at each leave of the current tree. This
is an advantage of this algorithm: the input space is partitioned in a way that
reflects the local density of data, so that one has a good control on the quality
of generalization (one acquires more knowledge on the rule where there are more
examples).

5. From Supervised to Unsupervised Learning

5.1. THE DUAL PERCEPTRONS

Let us now come back to the perceptron, and reconsider formula (2) giving the
output of the perceptron with a single output unit. One can say, as above, that
there are p input-ouput pairs realized by a perceptron with a single output unit,
161

whose couplings are the J's. But one can as well say that one has a perceptron
with p output units, where J is now an input pattern, and the {11, fL = 1, ... , p are
the p coupling vectors (figure [5]). I will call A the initial perceptron with a unique
output, and A* the dual perceptron with p output units as just explained.
To avoid confusions when considering one of the dual perceptrons, whenever
condisering A* I will append a "*" to each ambiguous word: in particular I will
write "pattern*" and "couplings*", the * being a remainder that for A* these
denominations refer to J and to the {11, respectively.

input
f

j=N

Figure 5. The dual perceptron A*.

5.2. THE NUMBER OF DOMAINS: THE DUAL POINT OF VIEW

Now let us reconsider the geometrical argument from the point of view of the
dual perceptron A*. What we have seen in section 2.2, is that for a given choice
of the couplings*, X, one explores all the possible different output states a that
can be obtained when the input pattern* J varies. If J represents, say, the light
intensities on a retina, a is the first neural representation of a visual scene in the
visual pathway. Since all visual scenes falling into a same domain are encoded with
the same neural representation, ~(X) is the maximal number of visual scenes that
can be distinguished. This can be said in term of transmission of information: to
specify one domain out of ~(X) represents ln~(X) bits of information. Hence
162

the maximal amount of information, C, that 8 can convey on the inputs* is

C= ln~(X). (20)

In the language of information theory, C is the channel capacity of the perceptron


A* if used as a memoryless channel in a communication system [10]. Hence one can
use the term of information capacity with its dual meaning, of information storage
capacity for the perceptron and of Shannon capacity for the perceptron*. From
( 4) one sees that up to p = N each output neuron gives one bit of information
(C = p), and for p >None gains less and less information by adding new units*.
We are thus lead to consider the dual perceptron as a device which associate
a neural representation (or codeword) to each input* signal, for which the per-
formance are evaluated with tools coming from information theory. This point of
view corresponds to an approach developped recently in particular for modeling
the sensory pathways in the brain ([6] [25]). In that context one wants the sys-
tem to perform an efficient coding, according to some cost function derived from
information theoretic concepts and general considerations on what type of coding
might be useful for the brain [7] [8]. The algorithmic counterpart, that is the
modification of the couplings* in order to minimize such cost function, results in
unsupervised learning schemes: the cost function specify an average quality of the
code, but not a desired output for a given input*.
The duality between the two perceptrons is thus a bridge between the study
of supervised and unsupervised learning tasks. What I have shown here is the
identity between the information capacities. In fact every relevant quantity for the
perceptron is related (but not necessarily identical) to a quantity relevant for the
dual perceptron, and furthermore the statistical physics approach as introduced by
E. Gardner for the perceptron can be used for the study of the typical properties
of the dual perceptron [30].

6. Conclusion

In this lecture I have given a quick overview of the bridges that exists between
the study of supervised as well as unsupervised learning tasks. I have shown the
163

remarkable fact that the study of the simplest architecture, the perceptron, can
be useful for understanding more complex architectures, such as fully connected
networks and multilayer networks. Moreover complex architectures can be build
by using perceptron algorithms.
The duality between supervised and unsupervised learning needs to be fur-
ther exploited. One puzzling aspect is the discrepancy between the standard view
points that come from the study of the two paradigms : in supervised learning
one insists on having distributed representations (the patterns should be made of
features distributed as randomly as possible), this in order to ensure good associa-
tive properties. In unsupervised learning one finds that efficient encoding produces
"grand-mother" type cells, each neuron learning to respond to a particular (set of)
feature(s). The duality presented above should help in analysing this problem.

Acknowledgements

I thank the organizers of FIESTA92 for inviting me. I thank Nestor Parga for a
fruitful ongoing collaboration on the study of unsupervised learning, on which part
of this talk is based.

References

[1] Abbott, L.F., Learning in Neural Network Memories, Network 1, 05-22 (1990).
[2] Amit, D.J ., Modeling Brain Function, Cambridge University Press (1989).
[3] Amit, D.J ., M.R. Evans, M. Abeles, At tractor Neural Networks with Biolog-
ical Probe Neurons, Network 2 (1991).
[4] Amit, D.J., H. Gutfreund, H. Sompolimsky, Storing an Infinite Number of
Patterns in a Spin-Glass Model of Neural Networks, Phys. Rev. Lett. 55,
1530-1533 (1985).
[5] Anlauf, J.K., M. Biehl, The Adatron: an Adaptative Perceptron Algorithm,
Europhys. Lett. 10, 687 (1989).
[6] Atick, J.J., Could Information Theory Provide an Ecological Theory of Sen-
sory Processing, Network 3, 213-251 (1992).
164

[7] Barlow, H.B., Possible Principles Underlying the Transformation of Sensory


Messages, Rosenblith W. (ed.), Sensory Communication, 217, M.I.T. Press,
Cambridge MA (1961).
[8] Barlow, H.B., Unsupervised Learning, Neural Comp. 1, 295-311 (1989).
[9] Bichsel, M., P. Seitz, Minimum Class Entropy: a Maximum Information Ap-
proach to Layered Networks, Neural Network 2, 133-41 (1989).
[10] Blahut, R.E., Principles and Practice of Information Theory, Addison-Wesley,
Cambridge MA (1988).
[11] Boser, B., I. Guyon, V. Vapnik, An Algorithm for Optimal Margin Classi-
fiers, Proceedings of the A CM workshop on Computational Learning Theory,
Pittsburgh, July 1992 (1992).
[12] Brunel, N., J.-P. Nadal, G. Toulouse, Information Capacity of a Perceptron,
J. Phys. A: Math. and Gen. 25, 5017-5037 (1992).
[13] Cover, T.M., Geometrical and Statistical Properties of Systems of Linear In-
equalities with Applications in Pattern Recognition, IEEE Trans. Electron.
Comput. 14, 326 (1965).
[14] Frean, M., The Upstart Algorithm: a Method for Constructing and Training
Feedforward Neural Networks, Neural Comp. 2, 198-209 (1990).
[15] Gallant, S.l., Optimal Linear Discriminants, Proceedings of the 8 th. Int.
Conf. on Pattern Recognition, 849-52, Paris 27-31 October 1986 (1987). IEEE
Computer Soc. Press, Washington D.C.
[16] Gallant, S.I., Three Constructive Algorithms for Network Learning, Proc. 8th
Ann Conf of Cognitive Science Soc, 652-60, Amherst, MA 15-17 August 1986
(1986).
[17] Gardner, E., Maximum Storage Capacity in Neural Networks, J. Physique
(France) 48, 741-755 (1987).
[18] Gardner, E., The Space of Interactions in Neural Networks Models, J. Phys.
A: Math. and Gen. 21, 257 (1988).
[19] Golea, M., M. Marchand, A Growth Algorithm for Neural Network Decision
Trees, Europhys. Lett. 12, 105-10 (1990).
[20] Hebb, D.O., The Organization of Behavior: A Neurophysiological Study, J.
Wiley, New-York (1949).
165

[21] Hertz, J., A. Krogh, R.G. Palmer, Introduction to the Theory of Neural Com-
putation, Addison-Wesley, Cambridge MA (1990).
[22] Hopfield, J.J., Neural Networks as Physical Systems with Emergent Compu-
tational Abilities, Proc. Natl. Acad. Sci. USA 79, 2554-58 (1982).
[23] Krauth, W., M. Mezard, Learning Algorithms with Optimal Stability in Neu-
ral Networks, J. Phys. A: Math. and Gen. 20, L745 (1987).
[24] Le Cun, Y.,1 A Learning Scheme for Asymmetric Threshold Networks, Pro-
ceedings of Cognitiva 85, 599-604, Paris, France (1985). CESTA-AFCET.
[25] Linsker, R., Self-Organization in a Perceptual Network, Computer 21, 105-17
(1988).
[26] McCulloch, W.S., W.A. Pitts, A Logical Calculus of the Ideas Immanent in
Nervous Activity, Bull. of Math. Biophys. 5, 115-133 (1943).
[27] Mezard, M., J.-P. Nadal, Learning in Feedforward Layered Networks: the
Tiling Algorithm, J. Phys. A: Math. and Gen. 22, 2191-203 (1989).
[28] Minsky, M.L., S.A. Papert, Perceptrons, M.I.T. Press, Cambridge MA (1988).
[29] Nadal, J.-P., N. Parga, Duality between Learning Machines: a Bridge between
Supervised and Unsupervised Learning, LPSENS preprint, to appear in Neural
Computation (1993).
[30] Nadal, J.-P., N. Parga, Information Processing by a Perceptron in an Unsu-
pervised Learning Task, Network 4, 295-312 (1993).
[31] Peretto, P., An Introduction to the Modeling of Neural Networks, Cambridge
University Press (1992).
[32] Personnaz, L., I. Guyon, G. Dreyfus, Collective Computationnal Properties
of Neural Networks: New Learning Mechanisms, Phys. Rev. A34, 4217-28
(1986).
[33] Rosenblatt, F., Principles of Neurodynamics, Spartan Books, New York
(1962).
[34] Rujan, P., M. Marchand, Learning by Minimizing Resources in Neural Net-
works, Complex Systems 3, 229-42 (1989).
[35] Rumelhart, D.E., G.E. Hinton, and R.J. Williams, Learning Internal Repre-
sentations by Error Propagation, In McClelland J.L. Rumelhart D.E. and the
PDP research group (eds. ), Parallel Distributed Processing: Explorations in
166

the Micro8tructure of Cognition Vol. I, 318-362, Bradford Books, Cambridge


MA (1986).
[36] Sirat, J.A., J.-P. Nadal, Neural Trees: a New Tool for Classification, Network
1, 423-38 (1990).
[37] Sompolinsky, H., Statistical Mechanics of Neural Networks, Phy8ic8 Today
(December 1988).
[38] Tishby, N., E. Levin, S. Solla, Consistent Inference of Probabilities in Layered
Networks: Predictions and Generalization, Proceeding.!! of the International
Joint Conference on Neural Network.!!, Washington D.C. (1989).
[39] Vapnik, V., Estimation of Dependences Based on Empirical Data, Springer
Series in Statistics, Springer, New-York (1982).
[40] Widrow, B., M.E. Hoff, Adaptive Switching Circuits, IRE WESCON Conv.
Record, Part 4, 96-104 (1960).
STORAGE OF CORRELATED PATTERNS IN NEURAL
NETWORKS

PATRICIO PEREZ
Departamento de Fisica
Universidad de Santiago de Chile
Casilla 307, Correo 2
Santiago
Chile

ABSTRACT. We describe here some ways of storing correlated patterns in neural networks of
two state neurons. We begin with a calculation of the bounds for storage capacity in the case of
uncorrelated, unbiased patterns. We extend these results to the case of biased patterns, which is
a form of correlation. We present then some especific models that allow the storage of patterns
with different kinds of correlation. A model based in the segmentation in sub-nets is described
in more detail. We can store patterns in the sub-nets, by varying the interaction between these,
we obtain an efficient way to store correlated patterns that can be related to the human ability
to memorize and retrieve words.

1. Introduction

An important class of models of neural networks is formed by the Ising spin type
of fully connected neurons. In these models, state S; = +1 corresponds to a firing
neuron and S; = -1 to a quiescent neuron. The potential at the membrane of
neuron i, at each instant of time, IS assumed to correspond to the local field h;,
which is given by:

(1)

where J; 1 characterizes the synaptic efficacy for action potentials traveling from
neuron j to neuron i. The precise values of these matrix elements or weights
are determined by "learning" a set of patterns which represent the information
167
E. Gales and S. Martinez (eds.), Cellular Automata, Dynamical Systems and Neural Networks, 167-189.
© 1994 Kluwer Academic Publishers.
168

to be stored. Usually, the dynamics is defined through an asynchronous spin.-fiip


algorithm in which the updating obeys to

S;{t + 1) = sgn(h;(t)). {2)


Starting from an arbitrary configuration, repeated application of {2) will lead to a
stationary state S* that satisfies:
N
Sih; = 2: J;is:s; >"' {3)
j=l,#i

where K 2::: 0 is a measure of the basin of attraction or region around S* from which
it is reached.
The network will have associative memory if these stationary states, or an
important fraction of them, correspond to the patterns used to build the J[is
and the basins of attraction are not vanishingly small. The storage capacity of
the network (a c) is usually defined as the ratio between the maximum amount of
stationary states that is possible to program in advance (p) and the total number
of neurons (N).
Several different models have been proposed for the construction of the synap-
tic coefficients. In some of them each pattern is memorized in a single learning
event [1,3,8,14], and in others each pattern is learned by repeated presentation of
it to the network in a sequence of learning steps [2,4,5,7,9]. Most of the learning
rules cited are local [1,2,3,4,5,7,8,9] but sometimes allowing nonlocality leads to
interesting properties [14]. Here locality means that the synapses between two
neurons depend only on the activity of them when the patterns to store are taken
into account. General results for the storage capacity and stability of the stored
patterns are due to E. Gardner [7], a calculation that we summarize in section
2. Storage depends strongly on whether the patterns are correlated. Two stored
patterns ~; and ~r are uncorrelated if they satisfy

<~r~r>=o {4)

where the brackets mean average over the statistical distribution of stored patterns.
The Hopfield model is an example of a learning rule that allows the storage of only
uncorrelated patterns. In this case, the synaptic coefficients are given by:
169

J;; =0. (5)

The condition of stability for pattern ~i and for N > > 1 is obtained from equation
(3) with li = 0:

1+ ~ 2::: ~;~r L:~i~i > 0. (6)


w#v j

This condition may be written as 1 + R > 0. If the stored patterns are uncorrelated
R will average to zero but it can have deviations of the order of V'PJN.
In this
manner we can understand why with the Hopfield model we can store of the order
of N uncorrelated patterns. If correlation is present, in general R will not average
to zero and the storage capacity is drastically reduced.
Usually, the components of a prescribed pattern (a pattern we want to store)
are chosen at random with probability:

1 1
P(~f) = 2(1 + m)b(~f- 1) + 2(1- m)b(~f + 1). (7)

If m = 0, we say that the patterns are unbiased, every neuron has the same
probability of being active and quiescent. That is the case for the Hopfield model.
If m =/:- 0 the prescribed patterns are biased. This will produce that the patterns
have a mean correlation < ~f ~i >= m 2 . Then, bias implies correlation, although
this is not the only type of correlation assumed in neural network models.
The best known models to store correlated patterns are based on a non-local
synaptic matrix or an iterative learning algorithm. The former are problematic
because involve the inversion of a very large matrix and are biologically unrealistic.
The later may have convergence problems. In section 3 we discuss ways to store
correlated patterns in especific neural network models, including a novel approach
using a local and one-presentation learning rule. In section 4, some results of recent
numerical calculations using this last type of model are presented.
170

2. Bounds for Storage Capacity and Stability in Neural Networks

2.1. UNBIASED PATTERNS

We consider a multiconnected neural network as that defined in the previous sec-


tion with an unespecified synaptic matrix Jij and a set of p random patterns {iJ
that we want to store in it:

~f = ±1 p. = 1, ... , p; i = 1, ... , N. (8)

We will try to answer the following question: what is the maximum amount of
patterns with a given stability that we can store in a network with an optimum
synaptic matrix?
Since multiplying Jii by any set of constants has no effect on the dynamics
expressed by equation (2), it is convenient to assume a normalization condition

LJfi =N. (9)


j¢i

The idea is to calculate the fractional volume of the space of solutions for the
synaptic coefficients. For a given stability "'• storage of p patterns will be possible
as long as this volume does not vanish. The maximum storage capacity is obtained
when upon increasing pfN the fractional volume goes to zero.
The fraction of phase space Vr that satisfies conditions (3) and (5) for the
embedded patterns equation (8) can be written as

(10)

where 0( x) is the step function. If Vi is the fractional volume for fixed i we can
assume that

N
Vr = fiV;. (11)
i=l

Since we are interested in the case of N large, we study the thermodynamic limit
171

1 ln Vr = lim N
lim N 1 '""' ln v; (12)
N -oo N -oo L....,
i

When we take averages over an ensemble of random patterns ~r the fractional


volume will be the same for all sites, so it is necessary only to calculate < ln v; >.
We use the replica trick 10

. < vn > -1
< ln V >= hm - - - - - (13)
n-o n

where it is assumed the validity of the analytic continuation of n from positive


integers to zero. From (10) we see that

(14)

where J;j is the realization of the J;i for the replica a.

It is convenient to introduce an integral representation for the step function


appearing in (14):

(15)

In calculating the average over random unbiased patterns ~j in (14), the


relevant term is:
172

R =< IT IT exp( -ix~ L Jij~r~: /VN) >


a=l p j#-i

j#i p a

(16)
= exp [~ L ln cos(L x~ Jij / VN)l
J¢• 1-' a

~ exp [-~ L L:x~x~(L J;jJfjfN)l


p a,b j#i

where in the last step, only the lowest order term in 1/N in the Taylor expansion
of ln cos x is kept. We introduce now the parameters

a<b (17)

which correspond to the mutual overlaps between the couplings in the different
replica copies. We will make also use of auxiliary variables pab and Ea which
satisfy the following relations:

Then equation (14) takes the form


173

xexp [N(o:Gl(qab)+Gz (Fab,Ea)- LFabqab+1/2L Ea)] (20)

Jtl
a<b a

x ( dE" exp [N( G,(O, E") + !/2E" )] ) _,

with

(i
(21)
X exp L xa Aa -1/2 L(xa) 2 - L qabxaxb)
a a a<b
and

Gz(Fab, Ea) = ln fr J
a=!
dr exp( -1/2 L Ea(r) 2
a
+L
a<b
Fab r Jb) (22)

The integrals in equation (20), in the limit N ---+ oo, can be solved with the help
of the saddle point method and the replica-symmetric ansatz

(23)

We need to maximize the function

1 n
G(q, F, E)= o:G1(q) + Gz(F, E)- 2n(n -1)qF + 2E (24)

where in the limit n ---+ 0 and using the saddle point conditions we can eliminate
F and E. Finally we are left with

< yn >= exp [NnmaxG(q) + 0(1/N)]. (25)


174

In the thermodynamic limit we have

G(q) =a J 1
DtlnH((Jqt + K)/(1- q) 1 12 )) + 2 ln(1- q) + 2 q/(1- q)
1
(26)

where

(27)

H(x) = 1 00
Dz (28)

and a= pjN as usual. The condition for maximizing G(q) leads to

q =(1- q)2: J Dt [H((Jqt + K)/(1- q) 112 )] - 2


(29)
x exp( -( Jqt + K) 2/(1- q))
From the definition of q (eq. ( 17)) we see that q = 1 means that there is a unique
solution for the synaptic couplings. Then an upper limit for the storage capacity
(ac) is obtained from (29) in the limit q - t 1. In this case the integral is dominated
by values oft > -K and then

(30)

Is easy to see that in the limit K -t 0, ac = 2, which is in agreement with previous


calculations [16]. For increasing K, ac decreases smoothly.
It is worth to remark that the results obtained here apply to unbiased random
patterns, with or without other type of correlation.

2.2. BIASED PATTERNS

The analysis of the previous section assumed that the stored patterns were random,
with the ~f having equal probability of being +1 or -1. If the patterns are drawn
from a biased random distribution
1 1
P(~f) = 2(1 + m)b(~f -1) + 2 (1- m)b(~f + 1) (31)
175

the calculation of the expectation value < vn > that appears in eq (14) leads to
a relevant term of the form

-(~; 1n[~(1+m)exp(-i~ lx~er)+~(t-m)exp~~ lx;er)l)


(32)
If we expand the logarithm up to second order in Ea Jij/..fN we get
exp [-i L,.,a
mMax:er - ~(1 - m 2 )(L(x:? + 2 L
,.,a a<b
qa 6 x:x~)l (33)

where qab is the same as before (equation (17)) and we introduced

Ma = _1_
,fN '..../..
L ]'!-.
IJ
(34)
Jrl

Besides the auxiliary variables pab and Ea defined through equations (18) and
(19) we define Ka from the identity

1 1
-oo
00
dMa
00

-oo
(Nj21r)dKa exp (iNKa [Ma- ~L
vN Jrl
.......
Jijl) = 1. (35)

In the large N limit we will have then:

lim lim ~ ln ([/lin dMadEa II dqabdpab


~ ln < vn >= N--+oo
N--+oo N N
a=l a<b
X exp (N [o:G~(qab, Ma) + G2(Fab, Ea)- L qab pab +~LEa])] (36)
a<b a

G 2 is given by equation (22) and G~ by:

G~ = ln ( fi 1oo-oo
dxa {oo d>. a
},. 27r
a=l

(37)
a

-~(1-m2) L(xa)2- (1-m2) L qabxaxb))


a a<b
176

where the average is taken over the random variable~ with distribution (31 ). With
the replica symmetric ansatz and in the saddle point approximation we get

< vn >= exp [Nn (extM,qG(q, M) + 0(1/N))] (38)

where extM,q means maximum with respect toM and minimum with respect to q
and G(q, M) comes from the function

1 n(n- 1) n
G(q,M,F,E)=o.G 1 (q,M)+G 2 (F,E)- 2 qF+2,E (39)

after eliminating F and E. The conditions for q and M, in the limit q--+ 1 produce

1
1 = ac(m, K) [ -(1 + m) 100
Mm
Dt ( ( K _- 2 ) 1 / 2 + t) 2
1
2 11 1 m
(40)
K+Mm
+ t )2]
00
1
+ 2(1- m) 72
(
Dt (1- m2)1/2

and

where /I = (Mm- K)/(1- m 2) 1 12 and 1 2 = ( -Mm- K)/(1- m 2) 1 12. ForK= 0


and for small values of m we get

(42)

which is in agreement with the result for uncorrelated patterns. Form approaching
one, O.c goes to infinity as:
1
O.c =- . (43)
(1- m)ln(1- m)

For intermediate values of m we can solve equations (40) and ( 41) numerically.
It is interesting to notice that while the storage capacity increases with bias,
the information content does not. According to Shannon's formula, the information
is given by
(44)
177

If we use the probability distribution (31 ), and sum over all sites and patterns we
obtain
I =-N 2 _1_[1-m 1 1-m l+m 1 l+m] (45)
acln2 2 n 2 + 2 n 2

which is a decreasing function of m. As m --+ 1 we have I --+ N 2 / ln 2 which is less


than the value I = 2N 2 corresponding to m = 0.

3. Ways of Storing Correlated Patterns

The bounds for storage capacity calculated above are based on the assumption
of a second order synaptic matrix J;i. We may think on the possibility of using
a higher order matrix J;jk ... and in this case the expression for the local fields
(Eq. (1)) should be accordingly modified. It has been shown that a higher order
generalization of the Hebb rule used in the Hopfield model [8] indeed increases
the maximum amount of stored patterns [6,12]. Even more general, G.A.Kohring
[11] demonstrated that for an optimal synaptic matrix of order n, the number
of uncorrelated patterns that can be stored is of the order of Nn-l. However
if we define information density as the ratio between number of stored bits per
synapse, it is found that it cannot exceed the value 2 of the second order case. For
biased patterns, again the storage capacity goes to infinity as the parameter m
approaches one, but if we look at the information density, this quantity decreases
as bias increases.
Returning to second order synaptic matrices, it is interesting to mention that
with the Hopfield model, we can store only uncorrelated patterns [8,1]. For this
model we have:
1
:L (f e;
p
J;j = r.:r (46)
V .JV JJ=l

and its storage capacity for uncorrelated patterns is ac = 0.144. When we in-
troduce correlation in the way described by Eq.(31 ), the amount of patterns that
we can store is drastically reduced, due to the noise produced by the embedded
patterns over the pattern that we intend to retrieve [1].
To improve the mentioned weakness of the Hopfield model, when correlation
is introduced imposing that the embedded patterns have a mean activity m, Amit
178

et al. [1] proposed a modification to rule (46) in which the synaptic efficacies are
given by:
1 p
Jij = 'N L:<~r- m)(~j- m) (47)
V .J.V 1£=}

In this case a signal to noise analysis leads them to conclude that the storage
capacity decreases with m as:

ac(O) = 0.144. (48)

ijesides, spurious states are observed to dominate the dynamics. However, a slight
variation to this learning rule allows nearly optimal storage capacity in the limit
of very low activities [3].
The pseudo inverse solution [14] has the following synaptic matrix:

Jij = ~ L~f(Q-l)l'v~J (49)


"·"
where
Q,., = ~ L:~r~r. (50)
i

In this model the synaptic matrix permits the storage of a maximum of N patterns,
which can be correlated but must be linearly independent. A problem here is the
nonlocality of the learning rule, which may be biologically unrealistic.
In the last time several models have been proposed in which the synaptic
matrix is built by a repeated Hebbian learning [2,4,5, 7,9], where terms of the form

1
aJij = N~r~: (51)

are iteratively added until every pattern (" has a desired stability. In this way
we can saturate the Gardner limit and store a large amount of uncorrelated or
correlated patterns.
For the pseudoinverse solution and the iterative models, correlation is not nec-
essarily introduced by assuming a given mean activity for the prescribed patterns,
but when correlation is due to any different mechanism the bounds for information
content do not change.
179

In the following lines we describe a type of network in which correlation is


introduced by allowing nonvanishing mutual overlaps between groups of prescribed
patterns.
Let us assume that we have a network with N Ising spins or two state neurons
in which we decompose the states S as a sum of k terms or segments:
§ = §(1> + §(2) + ... + §<k) (52)

where each §<i) has Ni consecutive components different from zero starting at
position N1 + N2 + ... + Ni-l + 1. We can understand this as a partition of the
network in k sub-nets, each of them with Ni neurons. The local field acting on
neuron i which belongs to sub-net i 1 can be written as:

(53)

where J~ 1 i 2 ) = Jii if neuron j is in sub-net i2 and neuron i is in subnet i1 and


zero otherwise, for a synaptic matrix Jii originally defined for the whole network.
We modify now these local fields by defining coupling parameters Ai 1 i 2 which will
allow us to vary the degree of interaction between sub-nets.
We are interested in the properties of a network with a synaptic matrix Tij

given by
k k
T.,·J· = '""' '""'..\· . J~~ti2)
~ ~ ltl2 1)
(54)
it=l i2=1

where the ]~ 1 i 2 ) are defined as before and the Ai 1 i 2 are real numbers. In the very
special case when Ai 1 i 2 = bi 1 i 2 and with a local synaptic matrix Jij, it is easy to see
that we can store and retrieve patterns in each subnet independently. This is so
because the subnets are uncoupled. The storage capacity of the whole network will
be given by all possible combinations of stored segments in the sub-nets when each
of them is at its limiting capacity. Obviously, these patterns are correlated due to
non negligible mutual overlaps between the N component vectors. However, the
case of no coupling between the subnets is not very interesting since the extension
from what is known for the subnets is trivial and recognition of a pattern that is
180

noisy in only a few of the segments cannot be helped by the matching of the rest.
We may think on the possibility of building a synaptic matrix of the form given by
equation (54) which allows interactions between the different subnets. A simple
case 1s:
(55)

with f varying between zero and one. As a further simplification we will assume
that all the sub-nets are of the same size, with Nk neurons, so kNk = N.
In this case the local fields (equation ( 49)) are of the form:

N k N

L:
j=l,j#i
L: L: (56)

The condition of stability of a prescribed pattern { = fl) + f 2 ) + ... + fk) is


that for all neurons in all sub-nets h~it) dit) > 0. In order that a pattern stored
in a given sub-net be not distabilized due to the coupling with other subnets, we
should store patterns having finite basins of attraction. This means

L
N
J;jtid~)iddit) > c(p) > 0 (57)
j=I,j#i

where, given that all subnets are of the same size, a given number of stored patterns
p per net is assumed to give the same bounds for stability for all of them. Then,
by combining equations (57) and (56), we see that the following condition ensures
stability for the pk combinations of segmented patterns:

k
E < c(p)(l 2: (58)

If the original J;j's are of the order of unity, we can in addition, consistently with
that property, require that in all subnets

L
Nk
(J;jtid)2 = Nk. (59)
j=l,j#i
181

We can as a first approximation, assume that J;~'i 2 )~; 2 in (58) takes the values +1
and -1 with equal probability. Using the central limit theorem, we observe that
the upper bound for E in order to store the pk patterns will be approximately

E"' c(p)jy'(k -1)Nk. (60)

If the patterns stored in each subnet are unbiased, we can simply relate c(p) to the
amount of stored states in them by using equation (30) of the previous section:

(61)

with solution

(62)

From this equation we can solve numerically for c for any p and replace it in (60).
So far we have shown that if the basins of attraction for the patterns stored
within the subnets are of a given size, a certain degree of interaction between the
subnets does not distabilize these patterns. Besides this, it would be desirable,
that due to the interaction between nets, not only the stored segments are not
weakened, but some selected combinations are preferentially recognized. This is
a property of a model introduced by U. Krey and G. Poppel (10], in which they
use the classical one presentation, Hebbian learning of the Hopfield model for
the interaction within the subnets. For the internet matrix elements they define
a copuling parameter Ep;, ,p; 2 , which is different from zero only for some of the
combinations and then we have:

(63)

where i falls within subnet i 1 and j within subnet i2.


The patterns stored in each subnet may be called letters and the preferred
combinations may be understood as preferred words. By using the replica method
they are able to derive some analytical properties of the model. For example in
182

the case of two subnets (two-letter words) and T = 0 they find a phase diagram
(storage capacity a= pfN as a function of the magnitude of~:) which shows regions
where only the preferred words are retrieved and other where non-preferred words
are also retrieved. Assuming that each letter in subnet 1 forms a unique word with
a letter of subnet 2, if we are in a region where only preferred words are retrieved,
presentation of a pattern where only one of the letters is distinguished should lead
to the retrieval of the complete word.
More interesting is the case of three-letter words. Here in order to keep the
basic idea of sub-nets, to store and retrieve preferred words it is necessary to
introduce three neuron or spin interactions. We divide a network with N neurons
in three subnets of the same size and in each of them we store a few patterns
(letters).
If we start from a fully connected network with three spin interactions with
the intention to find an expression for the local fields similar to (53) and then
extrapolate to something like (57), we expect to derive a complicated mixture of
terms and indices. Instead, we think that we can keep the basic ingredients of the
approach if simpler local fields are assumed. For example, for neurons in subnet 1:

h(l) =~ i~l) s~l) + ~ J~~k23) s(2) sk(3) (64)


I L...J IJ J L...J IJ J
j j,k

where Jgl) is the usual Hopfield matrix (Eq. (46)) for patterns within subnet 1 and
(65)

with i within subnet 1, j within subnet 2 and k within subnet 3. Similar expressions
apply for subnets 2 and 3, after a cyclic rearrangement of indices. The first term in
Eq. (64) will stabilize the single letters in the subnet and the second will take care
of the collective aspects of the words. In the next section, numerical calculations
on the storage and retrieval of non preferred combinations in two coupled Hopfield
nets and preferred three letter words with a synaptic matrix including three neuron
interaction will be presented.
183

4. Numerical Calculations

4.1. TWO COUPLED HOPFIELD NETS

In collaboration with G. Salini I have studied the storage capacity of two coupled
Hopfield nets, one with N 1 neurons and the other with N 2 = N- N 1 neurons.
The stored patterns are of the form:

(66)

In this case the local fields will be:

(67)

and

(68)

where we have allowed for asymmetric coupling between the sub-nets and the
synaptic matrices are of the form:

i,j = l, ... ,N1 (69)

PI P2
i~2)
I)
= "~~'>a,J.I'>J,V'
"~~1) ~~2) i = l, ... ,Nt,j = N1 + l, ... ,N (70)
J.l=lv=l

with obvious extensions for J;~ 2 ) and J;~l).


184

.
"-
"'
"' ...&!.
~

.Q. ~

"
~ . ~

. &l

Fi gu re 1.

.
t

,.. .

..
ii

-.
:: ti;:

~ ~
~
(/) <!:

~ .., _,,q«:~\"
~

'·-
Fi gu re 2.
185

Figure 1 shows results of simulations done on a net with N = 80 neurons


divided in sub-nets with N 1 = N2 = 40. We chose 5 uncorrelated patterns for
each sub-net and then looked for the possibility of storage of all the combinations
of segments in the complete net. Each point in the graph represents an average
over all the patterns and for several trials. Stability is estimated from the overlap
between final and initial state when we start from one of the embedded patterns
and after a stationary state is reached using asynchronous updating.
We observe that there is an important range of couplings in which the network
is able to store a number of correlated patterns that is greater than the maximum
amount of uncorrelated patterns that a classical Hopfield net of the same size can
handle.
In Figure 2 we show the results of a calculation similar to that of Figure 1,
except for the fact that in this case we have the sub-nets with a different amount
of neurons. Here N 1 = 10, N2 = 70 and the 9 combinations from 3 uncorrelated
patterns in sub-net 1 and 3 uncorrelated patterns in sub-net 2 are tested for storage.
What is interesting here is that storage is possible only when the effect of the large
sub-net over the small one is weakened significantly.
If in these nets we use one of the p 1 stored sub-patterns of sub-net 1 as a
cue, and if there is retrieval of a complete, two-component pattern, it can be any
of the p 2 combinations with sub-patterns of sub-net 2. This may be inconvenient
for the ability of the net to retrieve patterns from partial cues. To improve the
performance we could use the coupling parameter that appears in equation (63),
but in that case we can store only an amount min(pl,P2) of patterns.

4.2. THREE LETTER WORDS (13]

Within the model described by equations (64) and (65) we divided a net with
N = 60 neurons in three subnets of the same size. In each subnet we stored 3
patterns that represented letters of the alphabet. + 1 corresponds to an 'x' and -1
to a blank in a two dimensional array. They were chosen to be different enough in
order to have a low correlation.
In the first subnet the letters were 'U', 'C' and 'J', in the second, '0', 'K' and
'I' and in the third subnet, 'A', 'X' and 'S'. From the 27 possible combinations
186

we selected 7: 'UOA', 'UKX', 'UIS', 'COX', 'CKA', 'JOS' and 'JIA', which have
the property that two of them differ at least in two letters. In order to test the
ability of the net to retrieve these words we did the following calculation: with the
initial state corresponding to a pattern in which two of the letters of an embedded
patterns were present and the third segment totally random, the fraction of times
that the complete word was retrieved was plotted against the magnitude of the
coupling parameter e.

FIGURE J

0.8

~
0.6

~
..."'0
.;
0.4
...
0
0:

0.2

0.40 0.60 0.80 1.00


COUPUNG STRENGTH

Figure 3.

The results displayed in Figure 3 show that when only the two neuron interac-
tion is present ( e = 0), there is no retrieval. For small values of e, the three neuron
term has a positive effect, allowing a very good recognition. However, when e in-
creases beyond 0.15, retrieval ability is lowered remaining stable in a value around
0.6.
Figure 4 shows what happens when, in the absence of noise, the net initialized
in a combination of stored letters that does not correspond to a stored word is
allowed to evolve, as a function of coupling. We observe that coupling distabilizes
very rapidly these patterns.
187

0.8

0.6

0.4

0.2

0.0 + - - - . , - - - - - - - . - - - - , - - - - - , - - - - I E - - - 6 - @ - - - - - - j
0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07
COUPUNG STRENGTH

Figure 4.

The results of both calculations suggest that in Equation (64), the two terms
on the right side are important and have a cooperative behaviour. The two neuron
term stabilizes the single letters and the three neuron term is encharged of giving
meaning to the words, but when it becomes too important in comparison with the
first, the correlations present are an obstacle for the retrieval as in the case of the
Hopfield model and its higher order extensions.

Acknowledgements

This work has been supported by Fondo Nacional de Desarrollo Cientifico y Tec-
nol6gico de Chile (FONDECYT) through projects 91-0443 and 1930106 and De-
partamento de Investigaciones Cientlficas y Tecnol6gicas, Universidad de Santiago
de Chile (DICYT).
188

References

[1] Amit, D.J., Gutfreund, H., Sompolinsky, H., Information Storage in Neural
Networks with Low Levels of Activity, Phys. Rev. A 35, 2293-2303 (1987).
[2] Blatt, M.G., Vergini, E.G., Neural Networks: a Local Learning Prescription
for Arbitrary Correlated Patterns, Phys. Rev. Lett. 66, 1793-1796 (1991 ).
(3] Buhmann, J., Divko, R., Schulten, K., Associative Memory with High Infor-
mation Content, Phys. Rev. A 39, 2689-2692 (1989).
[4] Diederich, S., Opper, M., Learning of Correlated Patterns in Spin Glass Net-
works by Local Learning Rules, Phys. Rev. Lett. 58, 949-952 (1987).
[5] Forrest, B.M., Content-addressability and Learning in Neural Networks, J.
Phys. A 21, 245-255 (1988).
[6] Gardner, E., Multiconnected Neural Network Models, J. Phys. A: Math Gen.
20, 3453-3464 (1987).
(7] Gardner, E., The Phase Space of Interactions in Neural Network Models, J.
Phys. A: Math Gen. 21, 257-270 (1988).
[8] Hopfield, J.J., Neural Networks and Physical Systems with Emergent Com-
putational Abilities, Proc. Natl. Acad. Sci., U.S.A. 79, 2554-2558 (1982).
[9] Krauth, W., Mezard, M., Learning Algorithms with Optimal Stability in Neu-
ral Networks, J. Phys. A 20, L745-L751 (1987).
[10] Krey, U., Poppel, G., On the Thermodynamics of Associative recall of Struc-
tured Patterns within a given Context, Z. Phys. B 76, 513-520 (1989).
[11] Kohring, G.A., Neural Networks with Many-Neuron Interactions, J. Physique
51, 145-155 (1990).
[12] Matus, I.J., Perez, P., Generalized Learning Rule for High Order Neural Net-
works, Phys. Rev. A 43, 5683-5688 (1991 ).
[13] Perez, P., Salini, G., Storage of Words in a Neural Network, Phys. Lett. A.
181, 61-66 (1993).
[14] Personnaz, L., Guyon, 1., Dreyfus, G., Information Storage and Retrieval in
Spin-Like Neural Networks, J. Physique Lett. 46, L-359-L365 (1985).
[15] Sherrington, D., Kirkpatrick, S., Solvable Model of a Spin-Glass, Phys. Rev.
Lett. 35, 1792-1796 (1975).
189

[16) Venkatesh, S., Epsilon Capacity of Neural Networks, Proc. Conf. on Neural
Networks for Computing, Snow Bird, UT, J. Denker (ed.), AlP, New York,
440-464 (1986).
Other Mathematics and Its Applications titles of interest:

A.Ya. Helemskii: The Homology of Banach and Topological Algebras. 1989,


356 pp. ISBN 0-7923-0217-6
J. Martinez (ed.): Ordered Algebraic Structures. 1989, 304 pp.
ISBN 0-7923-0489-6
V.I. Varshavsky: Self-Timed Control of Concurrent Processes. The Design of
Aperiodic Logical Circuits in Computers and Discrete Systems. 1989,428 pp.
ISBN 0-7923-0525-6
E. Goles and S. Martinez: Neural and Automata Networks. Dynamical Behavior
and Applications. 1990, 264 pp. ISBN 0-7923-0632-5
A. Crumeyrolle: Orthogonal and Symplectic Clifford Algebras. Spinor Structures.
1990,364 pp. ISBN 0-7923-0541-8
S. Albeverio, Ph. Blanchard and D. Testard (eds.): Stochastics, Algebra and
Analysis in Classical and Quantum Dynamics. 1990,264 pp. ISBN 0-7923-0637-6
G. Karpilovsky: Symmetric and G-Algebras. With Applications to Group Represen-
tations. 1990, 384 pp. ISBN 0-7923-0761-5
J. Bosak: Decomposition of Graphs. 1990,268 pp. ISBN 0-7923-0747-X
J. Adamek and V. Trnkova: Automata and Algebras in Categories. 1990, 488 pp.
ISBN 0-7923-0010-6
A.B. Venkov: Spectral Theory of Automorphic Functions and Its Applications.
1991,280 pp. ISBN 0-7923-0487-X
M.A. Tsfasman and S.G. Vladuts: Algebraic Geometric Codes. 1991,668 pp.
ISBN 0-7923-0727-5
H.J. Voss: Cycles and Bridges in Graphs. 1991,288 pp. ISBN 0-7923-0899-9
V.K. Kharchenko: Automorphisms and Derivations of Associative Rings. 1991,
386 pp. ISBN 0-7923-1382-8
A.Yu. Olshanskii: Geometry of Defining Relations in Groups. 1991, 513 pp.
ISBN 0-7923-1394-1
F. Brackx and D. Constales: Computer Algebra with USP and REDUCE. An
Introduction to Computer-Aided Pure Mathematics. 1992,,286 pp.
ISBN 0-7923-1441-7

N.M. Korobov: Exponential Sums and their Applications. 1992,210 pp.


ISBN 0-7923-1647-9
D.G. Skordev: Computability in Combinatory Spaces. An Algebraic Generalization
of Abstract First Order Computability. 1992,320 pp. ISBN 0-7923-1576-6
E. Goles and S. Martinez: Statistical Physics, Automata Networks and Dynamical
Systems. 1992, 208 pp. ISBN 0-7923-1595-2
Other Mathematics and Its Applications titles of interest:

M.A. Frumkin: Systolic Computations. 1992, 320 pp. ISBN 0-7923-1708-4


J. Alajbegovic and J. Mockor: Approximation Theorems in Commutative Algebra.
1992,330 pp. ISBN 0-7923-1948-6
I.A. Faradzev, A.A. Ivanov, M.M. Klin and A.J. Woldar: Investigations in Al-
gebraic Theory of Combinatorial Objects. 1993,516 pp. ISBN 0-7923-1927-3
I.E. Shparlinski: Computational and Algorithmic Problems in Finite Fields. 1992,
266 pp. ISBN 0-7923-2057-3
P. Feinsilver and R. Schott: Algebraic Structures and Operator Calculus. Vol. 1.
Representations and Probability Theory. 1993,224 pp. ISBN 0-7923-2116-2
A.G. Pinus: Boolean Constructions in Universal Algebras. 1993, 350 pp.
ISBN 0-7923-2117-0
V.V. Alexandrov and N.D. Gorsky: Image Representation and Processing. A
Recursive Approach. 1993, 200 pp. ISBN 0-7923-2136-7
L.A. Bokut' and G.P. Kukin: Algorithmic and Combinatorial Algebra. 1993,
469 pp. ISBN 0-7923-2313-0
Y. Bahturin: Basic Structures of Modern Algebra. 1993, 419 pp.
ISBN 0-7923-2459-5
R. Krichevsky: Universal Compression and Retrieval. 1994,219 pp.
ISBN 0-7923-2672-5
A. Elduque and H.C. Myung: Mutations of Alternative Algebras. 1994, 226 pp.
ISBN 0-7923-2735-7
E. Goles and S. Martinez (eds.): Cellular Automata, Dynamical Systems and
Neural Networks. 1994, 189 pp. ISBN 0-7923-2772-1

Anda mungkin juga menyukai