Anda di halaman 1dari 235

TIME SERIES

A.W. van der Vaart


Vrije Universiteit Amsterdam
PREFACE
These are lecture notes for the courses Tijdreeksen, Time Series and Financial
TimeSeries. The material is more than can be treated in a one-semester course. See
next section for the exam requirements.
Parts marked by an asterisk * do not belong to the exam requirements.
Exercises marked by a single asterisk * are either hard or to be considered of
secondary importance. Exercises marked by a double asterisk ** are questions to which
I do not know the solution.
Amsterdam, 19952010 (revisions, extensions),
A.W. van der Vaart
OLD EXAM
The take-home exam exists of handing in solutions to the following problems listed in
the text. (Note: the numbering of the problems may change over time. The numbers
mentioned on this page refer to the numbering in this particular version. Dont use them
with earlier or later versions of the lecture notes.)
1.11, 1.14, 1.29, 4.10, 5.2, 5.14, 6.8, 6.14, 6.17, 6.18, 8.12, 8.19, 8.26, 8.34,9.1, 9.8,
9.19, 10.16, 11.21, 13.5, 13.6.
If you nd this too much work, remember that you can also do everything at one
go at the end of the course.
LITERATURE
The following list is a small selection of books on time series analysis. Azencott/Dacunha-
Castelle and Brockwell/Davis are close to the core material treated in these notes. The
rst book by Brockwell/Davis is a standard book for graduate courses for statisticians.
Their second book is prettier, because it lacks the overload of formulas and computations
of the rst, but is of a lower level.
Chateld is less mathematical, but perhaps of interest from a data-analysis point
of view. Hannan and Deistler is tough reading, and on systems, which overlaps with
time series analysis, but is not focused on statistics. Hamilton is a standard work used
by econometricians; be aware, it has the existence results for ARMA processes wrong.
Brillingers book is old, but contains some material that is not covered in the later works.
Rosenblatts book is new, and also original in its choice of subjects. Harvey is a proponent
of using system theory and the Kalman lter for a statistical time series analysis. His
book is not very mathematical, and a good background to state space modelling.
Most books lack a treatment of developments of the last 1015 years, such as
GARCH models, stochastic volatility models, or cointegration. Mills and Gourieroux
ll this gap to some extent. The rst contains a lot of material, including examples t-
ting models to economic time series, but little mathematics. The second appears to be
written for a more mathematical audience, but is not completely satisfying. For instance,
its discussion of existence and stationarity of GARCH processes is incomplete, and the
presentation is mathematically imprecise at many places. An alternative to these books
are several review papers on volatility models, such as Bollerslev et al., Ghysels et al.,
and Shepard. Besides introductory discussion, also inclusing empirical evidence, these
have extensive lists of references for further reading.
The book by Taniguchi and Kakizawa is unique in its emphasis on asymptotic theory,
including some results on local asymptotic normality. It is valuable as a resource.
[1] Azencott, R. and Dacunha-Castelle, D., (1984). Series dObservations Irreguli`eres.
Masson, Paris.
[2] Brillinger, D.R., (1981). Time Series Analysis: Data Analysis and Theory. Holt,
Rinehart & Winston.
[3] Bollerslev, T., Chou, Y.C. and Kroner, K., (1992). ARCH modelling in nance:
a selective review of the theory and empirical evidence. J. Econometrics 52,
201224.
[4] Bollerslev, R., Engle, R. and Nelson, D., (ARCH models). Handbook of Econo-
metrics IV (eds: RF Engle and D McFadden). North Holland, Amsterdam.
[5] Brockwell, P.J. and Davis, R.A., (1991). Time Series: Theory and Methods.
Springer.
[6] Chateld, C., (1984). The Analysis of Time Series: An Introduction. Chapman and
Hall.
[7] Gourieroux, C., (1997). ARCH Models and Financial Applications. Springer.
[8] Dedecker, J., Doukhan, P., Lang, G., Leon R., J.R., Louhichi, S. and Prieur, C.,
(2007). Weak Dependence. Springer.
[9] Gnedenko, B.V., (1962). Theory of Probability. Chelsea Publishing Company.
[10] Hall, P. and Heyde, C.C., (1980). Martingale Limit Theory and Its Applications.
Academic Press, New York.
[11] Hamilton, J.D., (1994). Time Series Analysis. Princeton.
[12] Hannan, E.J. and Deistler, M., (1988). The Statistical Theory of Linear Systems.
John Wiley, New York.
[13] Harvey, A.C., (1989). Forecasting, Structural Time Series Models and the Kalman
Filter. Cambridge University Press.
[14] Mills, T.C., (1999). The Econometric Modelling of Financial Times Series. Cam-
bridge University Press.
[15] Rio, A., (2000). Theorie Asymptotique des Processus Aleatoires Faiblement
Dependants. Springer Verlag, New York.
[16] Rosenblatt, M., (2000). Gaussian and Non-Gaussian Linear Time Series and Ran-
dom Fields. Springer, New York.
[17] Rudin, W., (1974). Real and Complex Analysis. McGraw-Hill.
[18] Taniguchi, M. and Kakizawa, Y., (2000). Asymptotic Theory of Statistical Infer-
ence for Time Series. Springer.
[19] van der Vaart, A.W., (1998). Asymptotic Statistics. Cambrdige University Press.
CONTENTS
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1. Stationarity . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2. Transformations and Filters . . . . . . . . . . . . . . . . . . . . 12
1.3. Complex Random Variables . . . . . . . . . . . . . . . . . . . . 18
1.4. Multivariate Time Series . . . . . . . . . . . . . . . . . . . . . . 21
2. Hilbert Spaces and Prediction . . . . . . . . . . . . . . . . . . . . . 22
2.1. Hilbert Spaces and Projections . . . . . . . . . . . . . . . . . . . 22
2.2. Square-integrable Random Variables . . . . . . . . . . . . . . . . . 26
2.3. Linear Prediction . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.4. Innovation Algorithm . . . . . . . . . . . . . . . . . . . . . . . 32
2.5. Nonlinear Prediction . . . . . . . . . . . . . . . . . . . . . . . 33
2.6. Partial Auto-Correlation . . . . . . . . . . . . . . . . . . . . . . 34
3. Stochastic Convergence . . . . . . . . . . . . . . . . . . . . . . . . 36
3.1. Basic theory . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.2. Convergence of Moments . . . . . . . . . . . . . . . . . . . . . . 40
3.3. Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.4. Stochastic o and O symbols . . . . . . . . . . . . . . . . . . . . 42
3.5. Transforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.6. Cramer-Wold Device . . . . . . . . . . . . . . . . . . . . . . . 44
3.7. Delta-method . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.8. Lindeberg Central Limit Theorem . . . . . . . . . . . . . . . . . . 46
3.9. Minimum Contrast Estimators . . . . . . . . . . . . . . . . . . . 46
4. Central Limit Theorem . . . . . . . . . . . . . . . . . . . . . . . . 49
4.1. Finite Dependence . . . . . . . . . . . . . . . . . . . . . . . . 50
4.2. Linear Processes . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.3. Strong Mixing . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.4. Uniform Mixing . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.5. Martingale Dierences . . . . . . . . . . . . . . . . . . . . . . . 58
4.6. Projections . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.7. Proof of Theorem 4.7 . . . . . . . . . . . . . . . . . . . . . . . 63
5. Nonparametric Estimation of Mean and Covariance . . . . . . . . . . . . 67
5.1. Mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.2. Auto Covariances . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.3. Auto Correlations . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.4. Partial Auto Correlations . . . . . . . . . . . . . . . . . . . . . 79
6. Spectral Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
6.1. Spectral Measures . . . . . . . . . . . . . . . . . . . . . . . . . 83
6.2. Aliasing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
6.3. Nonsummable lters . . . . . . . . . . . . . . . . . . . . . . . . 92
6.4. Spectral Decomposition . . . . . . . . . . . . . . . . . . . . . . 93
6.5. Multivariate Spectra . . . . . . . . . . . . . . . . . . . . . . 100
6.6. Prediction in the Frequency Domain . . . . . . . . . . . . . . . . 103
7. Law of Large Numbers . . . . . . . . . . . . . . . . . . . . . . . 104
7.1. Stationary Sequences . . . . . . . . . . . . . . . . . . . . . . 104
7.2. Ergodic Theorem . . . . . . . . . . . . . . . . . . . . . . . . 105
7.3. Mixing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
7.4. Subadditive Ergodic Theorem . . . . . . . . . . . . . . . . . . 112
8. ARIMA Processes . . . . . . . . . . . . . . . . . . . . . . . . . . 114
8.1. Backshift Calculus . . . . . . . . . . . . . . . . . . . . . . . 114
8.2. ARMA Processes . . . . . . . . . . . . . . . . . . . . . . . . 116
8.3. Invertibility . . . . . . . . . . . . . . . . . . . . . . . . . . 120
8.4. Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
8.5. Auto Correlation and Spectrum . . . . . . . . . . . . . . . . . . 124
8.6. Existence of Causal and Invertible Solutions . . . . . . . . . . . . 127
8.7. Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
8.8. ARIMA Processes . . . . . . . . . . . . . . . . . . . . . . . . 130
8.9. VARMA Processes . . . . . . . . . . . . . . . . . . . . . . . 131
8.10. ARMA-X Processes . . . . . . . . . . . . . . . . . . . . . . . 133
9. GARCH Processes . . . . . . . . . . . . . . . . . . . . . . . . . 134
9.1. Linear GARCH . . . . . . . . . . . . . . . . . . . . . . . . . 136
9.2. Linear GARCH with Leverage and Power GARCH . . . . . . . . . 149
9.3. Exponential GARCH . . . . . . . . . . . . . . . . . . . . . . 149
9.4. GARCH in Mean . . . . . . . . . . . . . . . . . . . . . . . . 150
10. State Space Models . . . . . . . . . . . . . . . . . . . . . . . . . 152
10.1. Hidden Markov Models and State Spaces . . . . . . . . . . . . . 153
10.2. Kalman Filtering . . . . . . . . . . . . . . . . . . . . . . . . 158
10.3. Nonlinear Filtering . . . . . . . . . . . . . . . . . . . . . . . 162
10.4. Stochastic Volatility Models . . . . . . . . . . . . . . . . . . . 165
11. Moment and Least Squares Estimators . . . . . . . . . . . . . . . . 170
11.1. Yule-Walker Estimators . . . . . . . . . . . . . . . . . . . . . 170
11.2. Moment Estimators . . . . . . . . . . . . . . . . . . . . . . . 177
11.3. Least Squares Estimators . . . . . . . . . . . . . . . . . . . . 188
12. Spectral Estimation . . . . . . . . . . . . . . . . . . . . . . . . . 193
12.1. Finite Fourier Transform . . . . . . . . . . . . . . . . . . . . . 194
12.2. Periodogram . . . . . . . . . . . . . . . . . . . . . . . . . . 195
12.3. Estimating a Spectral Density . . . . . . . . . . . . . . . . . . 200
12.4. Estimating a Spectral Distribution . . . . . . . . . . . . . . . . 205
13. Maximum Likelihood . . . . . . . . . . . . . . . . . . . . . . . . 210
13.1. General Likelihood . . . . . . . . . . . . . . . . . . . . . . . 211
13.2. Asymptotics under a Correct Model . . . . . . . . . . . . . . . . 213
13.3. Asymptotics under Misspecication . . . . . . . . . . . . . . . . 217
13.4. Gaussian Likelihood . . . . . . . . . . . . . . . . . . . . . . . 219
13.5. Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . 227
1
Introduction
Oddly enough, a statistical time series is a mathematical sequence, not a series. In this
book we understand a time series to be a doubly innite sequence
. . . , X
2
, X
1
, X
0
, X
1
, X
2
, . . .
of random variables or random vectors. We refer to the index t of X
t
as time and think
of X
t
as the state or output of a stochastic system at time t. The interpretation of the
index as time is unimportant for the mathematical theory, which is concerned with
the joint distribution of the variables only, but the implied ordering of the variables is
usually essential. Unless stated otherwise, the variable X
t
is assumed to be real valued,
but we shall also consider sequences of random vectors, which sometimes have values in
the complex numbers. We often write the time series X
t
rather than use the correct
(X
t
: t Z), and instead of time series we also speak of process, stochastic process,
or signal.
We shall be interested in the joint distribution of the variables X
t
. The easiest way to
ensure that these variables, and other variables that we may introduce, possess joint laws,
is to assume that they are dened as measurable maps on a single underlying probability
space. This is what we meant, but did not say in the preceding denition. Also in general
we make the underlying probability space formal only if otherwise confusion might arise,
and then denote it by (, |, P), with denoting a typical element of .
Time series theory is a mixture of probabilistic and statistical concepts. The proba-
bilistic part is to study and characterize probability distributions of sets of variables X
t
that will typically be dependent. The statistical problem is to determine the probability
distribution of the time series given observations X
1
, . . . , X
n
at times 1, 2, . . . , n. The
resulting stochastic model can be used in two ways:
- understanding the stochastic system;
- predicting the future, i.e. X
n+1
, X
n+2
, . . . ,.
1.1: Stationarity 3
1.1 Stationarity
In order to have any chance of success at these tasks it is necessary to assume some a-
priori structure of the time series. Indeed, if the X
t
could be completely arbitrary random
variables, then (X
1
, . . . , X
n
) would constitute a single observation from a completely
unknown distribution on R
n
. Conclusions about this distribution would be impossible,
let alone about the distribution of the future values X
n+1
, X
n+2
, . . ..
A basic type of structure is stationarity. This comes in two forms.
1.1 Denition. The time series X
t
is strictly stationary if the distribution (on R
h+1
)
of the vector (X
t
, X
t+1
, . . . , X
t+h
) is independent of t, for every h N.
1.2 Denition. The time series X
t
is stationary (or more precisely second order sta-
tionary) if EX
t
and EX
t+h
X
t
exist and are nite and do not depend on t, for every
h N.
It is clear that a strictly stationary time series with nite second moments is also
stationary. For a stationary time series the auto-covariance and auto-correlation at lag
h Z are dened by

X
(h) = cov(X
t+h
, X
t
),

X
(h) = (X
t+h
, X
t
) =

X
(h)

X
(0)
.
The auto-covariance and auto-correlation are functions
X
: Z R and
X
: Z [1, 1].
Both functions are symmetric about 0. Together with the mean = EX
t
, they determine
the rst and second moments of the stationary time series. Much of time series (too
much?) concerns this second order structure.
The autocovariance and autocorrelations functions are measures of (linear) de-
pendence between the variables at dierent time instants, except at lag 0, where

X
(0) = var X
t
gives the variance of (any) X
t
, and
X
(0) = 1.
1.3 Example (White noise). A doubly innite sequence of independent, identically
distributed random variables X
t
is a strictly stationary time series. Its auto-covariance
function is, with
2
= var X
t
,

X
(h) =
_

2
, if h = 0,
0, if h ,= 0.
Any stationary time series X
t
with mean zero and covariance function of this type is
called a white noise series. Thus any mean-zero i.i.d. sequence with nite variances is a
white noise series. The converse is not true: there exist white noise series that are not
strictly stationary.
The name noise should be intuitively clear. We shall see why it is called white
when discussing spectral theory of time series in Chapter 6.
White noise series are important building blocks to construct other series, but from
the point of view of time series analysis they are not so interesting. More interesting are
series of dependent random variables, so that, to a certain extent, the future can be
predicted from the past.
4 1: Introduction
0 50 100 150 200 250
-
2
-
1
0
1
2
Figure 1.1. Realization of a Gaussian white noise series of length 250.
1.4 EXERCISE. Construct a white noise sequence that is not strictly stationary.
1.5 Example (Deterministic trigonometric series). Let A and B be given, uncorre-
lated random variables with mean zero and variance
2
, and let be a given number.
Then
X
t
= Acos(t) +Bsin(t)
denes a stationary time series. Indeed, EX
t
= 0 and

X
(h) = cov(X
t+h
, X
t
)
= cos
_
(t +h)
_
cos(t) var A + sin
_
(t +h)
_
sin(t) var B
=
2
cos(h).
Even though A and B are random variables, this type of time series is called deterministic
in time series theory. Once A and B have been determined (at time say), the process
behaves as a deterministic trigonometric function. This type of time series is an important
building block to model cyclic events in a system, but it is not the typical example of a
statistical time series that we study in this course. Predicting the future is too easy in
this case.
1.1: Stationarity 5
1.6 Example (Moving average). Given a white noise series Z
t
with variance
2
and
a number set
X
t
= Z
t
+Z
t1
.
This is called a moving average of order 1. The series is stationary with EX
t
= 0 and

X
(h) = cov(Z
t+h
+Z
t+h1
, Z
t
+Z
t1
) =
_
_
_
(1 +
2
)
2
, if h = 0,

2
, if h = 1,
0, otherwise.
Thus X
s
and X
t
are uncorrelated whenever s and t are two or more time instants apart.
We speak of short range dependence and say that the time series has short memory.
Figure 1.2 shows the realization of a moving average series.
If the Z
t
are an i.i.d. sequence, then the moving average is strictly stationary.
A natural generalization are higher order moving averages of the form X
t
= Z
t
+

1
Z
t1
+ +
q
Z
tq
.
0 50 100 150 200 250
-
3
-
2
-
1
0
1
2
3
Figure 1.2. Realization of length 250 of the moving average series Xt = Zt 0.5Zt1 for Gaussian white
noise Zt.
6 1: Introduction
1.7 EXERCISE. Prove that the series X
t
in Example 1.6 is strictly stationary if Z
t
is a
strictly stationary sequence.
1.8 Example (Autoregression). Given a white noise series Z
t
with variance
2
and a
number consider the equations
X
t
= X
t1
+Z
t
, t Z.
To give this equation a clear meaning, we take the white noise series Z
t
as being dened
on some probability space (, |, P), and we look for a solution X
t
to the equation: a
time series dened on the same probability space that solves the equation pointwise in
(at least for almost every ).
This equation does not dene X
t
, but in general has many solutions. Indeed, we
can dene the sequence Z
t
and the variable X
0
in some arbitrary way on the given
probability space and next dene the remaining variables X
t
for t Z 0 by the
equation (if ,= 0). However, suppose that we are only interested in stationary solutions.
Then there is either no solution or a unique solution, depending on the value of , as we
shall now prove.
Suppose rst that [[ < 1. By iteration we nd that
X
t
= (X
t2
+Z
t1
) +Z
t
=
=
k
X
tk
+
k1
Z
tk+1
+ +Z
t1
+Z
t
.
For a stationary sequence X
t
we have that E(
k
X
tk
)
2
=
2k
EX
2
0
0 as k . This
suggests that a solution of the equation is given by the innite series
X
t
= Z
t
+Z
t1
+
2
Z
t2
+ =

j=0

j
Z
tj
.
We show below in Lemma 1.28 that the series on the right side converges almost surely,
so that the preceding display indeed denes some random variable X
t
. This is a moving
average of innite order. We can check directly, by substitution in the equation, that X
t
satises the auto-regressive relation. (For every for which the series converges; hence
only almost surely. We consider this to be good enough.)
If we are allowed to change expectations and innite sums, then we see that
EX
t
=

j=0

j
EZ
tj
= 0,

X
(h) =

i=0

j=0

j
EZ
t+hi
Z
tj
=

j=0

h+j

2
1
h+j0
=

]h]
1
2

2
.
We prove the validity of these formulas in Lemma 1.28. It follows that X
t
is indeed
a stationary time series. In this example
X
(h) ,= 0 for every h, so that every pair of
variables X
s
and X
t
are dependent. However, because
X
(h) 0 at exponential speed
1.1: Stationarity 7
as h , this series is still considered to be short-range dependent. Note that
X
(h)
oscillates if < 0 and decreases monotonely if > 0.
For = 1 the situation is very dierent: no stationary solution exists. To see this
note that the equation obtained before by iteration now takes the form, for k = t,
X
t
= X
0
+Z
1
+ +Z
t
.
This implies that var(X
t
X
0
) = t
2
as t . However, by the triangle inequality
we have that
sd(X
t
X
0
) sd X
t
+ sd X
0
= 2 sdX
0
,
for a stationary sequence X
t
. Hence no stationary solution exists. The situation for = 1
is explosive: the randomness increases signicantly as t due to the introduction of
a new Z
t
for every t. Figure 1.3 illustrates the nonstationarity of a random walk.
The cases = 1 and [[ > 1 are left as exercises.
The auto-regressive time series of order one generalizes naturally to auto-regressive
series of the form X
t
=
1
X
t1
+
p
X
tp
+Z
t
. The existence of stationary solutions
X
t
to this equation depends on the locations of the roots of the polynomial z 1

1
z
2
z
2

p
z
p
, as is discussed in Chapter 8.
1.9 EXERCISE. Consider the cases = 1 and [[ > 1. Show that in the rst case there
is no stationary solution and in the second case there is a unique stationary solution.
[For [[ > 1 mimic the argument for [[ < 1, but with time reversed: iterate X
t1
=
(1/)X
t
Z
t
/.]
1.10 Example (GARCH). A time series X
t
is called a GARCH(1, 1) process if, for
given nonnegative constants , and , and a given i.i.d. sequence Z
t
with mean zero
and unit variance, it satises the system of equations
X
t
=
t
Z
t
,

2
t
= +
2
t1
+X
2
t1
.
The variable
t
is interpreted as the scale or volatility of the time series X
t
at time t. By
the second equation this is modelled as dependent on the past. To make this structure
explicit it is often included (or thought to be automatic) in the denition of a GARCH
process that, for every t,
(i)
t
is a measurable function of X
t1
, X
t2
, . . .;
(ii) Z
t
is independent of X
t1
, X
t2
, . . ..
In view of the rst GARCH equation, properties (i)-(ii) imply that
EX
t
= E
t
EZ
t
= 0,
EX
s
X
t
= E(X
s

t
)EZ
t
= 0, (s < t).
Therefore, a GARCH process with the extra properties (i)-(ii) is a white noise process.
Furthermore,
E(X
t
[ X
t1
, X
t2
, . . .) =
t
EZ
t
= 0,
E(X
2
t
[ X
t1
, X
t2
, . . .) =
2
t
EZ
2
t
=
2
t
.
8 1: Introduction
0 50 100 150 200 250
-
1
0
-
5
0
5
Figure 1.3. Realization of a random walk Xt = Zt + + Z0 of length 250 for Zt Gaussian white noise.
The rst equation shows that X
t
is a martingale dierence series: the past does not
help to predict future values of the time series. The second equation exhibits
2
t
as the
conditional variance of X
t
given the past. Because
2
t
is a nontrivial time series by the
second GARCH equation (if + = 0), the time series X
t
is not an i.i.d. sequence.
Because the conditional mean of X
t
given the past is zero, a GARCH process will
uctuate around the value 0. A large deviation [X
t1
[ from 0 at time t1 will cause a large
conditional variance
2
t
= +X
2
t1
+
2
t1
at time t, and then the deviation of X
t
=

t
Z
t
from 0 will tend to be large as well. Similarly, small deviations from 0 will tend to be
followed by other small deviations. Thus a GARCH process will alternate between periods
of big uctuations and periods of small uctuations. This is also expressed by saying
that a GARCH process exhibits volatility clustering. Volatility clustering is commonly
observed in time series of stock returns. The GARCH(1, 1) process has become a popular
model for such time series. Figure 1.5 shows a realization. The signs of the X
t
are equal
to the signs of the Z
t
and hence will be independent over time.
The abbreviation GARCH is for generalized auto-regressive conditional het-
eroscedasticity: the conditional variances are not i.i.d., and depend on the past through
an auto-regressive scheme. Typically, the conditional variances
2
t
are not directly ob-
served, but must be inferred from the observed sequence X
t
.
Being a white noise process, a GARCH process can itself be used as input in an-
1.1: Stationarity 9
0 50 100 150 200 250
-
4
-
2
0
2
4
Figure 1.4. Realization of length 250 of the stationary solution to the equation Xt = 0.5Xt1+0.2Xt2 +Zt
for Zt Gaussian white noise.
other scheme, such as an auto-regressive or a moving average series, leading to e.g. AR-
GARCH processes. There are also many generalizations of the GARCH structure. In
a GARCH(p, q) process
2
t
is allowed to depend on
2
t1
, . . . ,
2
tp
and X
2
t1
, . . . , X
2
tq
.
A GARCH (0, q) process is also called an ARCH process. The rationale of using the
squares X
2
t
appears to be mostly that these are nonnegative and simple; there are many
variations using other functions.
As the auto-regressive relationship, the dening GARCH equations are recursive,
and it is not immediately clear that there exist solutions to the pair of dening equations,
and, if so, whether the solution is unique and satises (i)-(ii). If we are only interested
in the process for t 0, then we might complement the equations with an initial value
for
2
0
, and solve the equations by iteration, taking a suitable sequence Z
t
as given.
Alternatively, we may study the existence of stationary solutions X
t
to the equations.
We now show that there exists a stationary solution X
t
to the GARCH equations
if + < 1, which then is unique and automatically possesses the properties (i)-(ii). By
10 1: Introduction
iterating the GARCH relation we nd that, for every n 0,

2
t
= + ( +Z
2
t1
)
2
t1
= +
n

j=1
( +Z
2
t1
) ( +Z
2
tj
)
+ ( +Z
2
t1
) ( +Z
2
tn1
)
2
tn1
.
The sequence
_
(+Z
2
t1
) (+Z
2
tn1
)
_

n=1
, which consists of nonnegative variables
with means ( + )
n+1
, converges in probability to zero if + < 1. If the time series

2
t
is bounded in probability as t , then the term on the far right converges to zero
in probability as n . Thus for a stationary solution (X
t
,
t
) we must have
(1.1)
2
t
= +

j=1
( +Z
2
t1
) ( +Z
2
tj
).
This series has positive terms, and hence is always well dened. It is easily checked that
the series converges in mean if and only if + < 1 (cf. Lemma 1.26). Given an i.i.d.
series Z
t
, we can then dene a process X
t
by rst dening the conditional variances
2
t
by
(1.1), and next setting X
t
=
t
Z
t
. It can be veried by substitution that this process X
t
solves the GARCH relationship. Hence a stationary solution to the GARCH equations
exists if + < 1.
Because Z
t
is independent of Z
t1
, Z
t2
, . . ., by (1.1) it is also independent of

2
t1
,
2
t2
, . . ., and hence also of X
t1
=
t1
Z
t1
, X
t2
=
t2
Z
t2
, . . .. In addition
it follows that the time series X
t
=
t
Z
t
is strictly stationary, being a xed measurable
transformation of (Z
t
, Z
t1
, . . .) for every t.
By iterating the auto-regressive relation
2
t
=
2
t1
+ W
t
, with W
t
= + X
2
t1
,
in the same way as in Example 1.8, we also nd that for the stationary solution
2
t
=

j=0

j
W
tj
. Hence
t
is (X
t1
, X
t2
, . . .)-measurable.
An inspection of the preceding argument shows that a strictly stationary solution
exists under a weaker condition than + < 1. This is because the innite series (1.1)
may converge almost surely, without converging in mean (see Exercise 1.14). We shall
study this further in Chapter 9.
1.11 EXERCISE. Let + [0, 1) and 1
2

2
2 > 0, where = EZ
4
t
. Show
that the second and fourth (marginal) moments of a stationary GARCH process are
given by /(1 ) and
2
(1 + + )/(1
2

2
2)(1 ). From this
compute the kurtosis of the GARCH process with standard normal Z
t
. [You can use
(1.1), but it is easier to use the GARCH relations.]
1.12 EXERCISE. Show that EX
4
t
= if 1
2

2
2 = 0.
1.13 EXERCISE. Suppose that the process X
t
is square-integrable and satises the
GARCH relation for an i.i.d. sequence Z
t
such that Z
t
is independent of X
t1
, X
t2
, . . .
and such that
2
t
= E(X
2
t
[ X
t1
, X
t2
, . . .), for every t, and some , , > 0. Show that
+ < 1. [Derive that EX
2
t
= +

n
j=1
( +)
j
+ ( +)
n+1
EX
2
tn1
.]
1.1: Stationarity 11
0 100 200 300 400 500
-
5
0
5
Figure 1.5. Realization of the GARCH(1, 1) process with = 0.1, = 0 and = 0.8 of length 500 for Zt
Gaussian white noise.
1.14 EXERCISE. Let Z
t
be an i.i.d. sequence with Elog(Z
2
t
) < 0. Show that

j=0
Z
2
t
Z
2
t1
Z
2
tj
< almost surely. [Use the Law of Large Numbers.]
1.15 Example (Stochastic volatility). A general approach to obtain a time series
with volatility clustering is to dene X
t
=
t
Z
t
for an i.i.d. sequence Z
t
and a process
t
that depends positively on its past. A GARCH model ts this scheme, but a simpler
construction is to let
t
depend only on its own past and independent noise. Because
t
is to have an interpretation as a scale parameter, we restrain it to be positive. One way
to combine these requirements is to set
h
t
= h
t1
+W
t
,

2
t
= e
ht
,
X
t
=
t
Z
t
.
Here W
t
is a white noise sequence, h
t
is a (stationary) solution to the auto-regressive
equation, and the process Z
t
is i.i.d. and independent of the process W
t
. If > 0 and

t1
= e
ht1/2
is large, then
t
= e
ht/2
will tend to be large as well. Hence the process
X
t
will exhibit volatility clustering.
12 1: Introduction
The process h
t
will typically not be observed and for that reason is sometimes called
latent. A stochastic volatility process of this type is an example of a (nonlinear) state
space model, discussed in Chapter 10. Rather than dening
t
by an auto-regression in
the exponent, we may choose a dierent scheme. For instance, an EGARCH(p, 0) model
postulates the relationship
log
t
= +
p

j=1

j
log
tj
.
This is not a stochastic volatility model, because it does not include a random distur-
bance. The symmetric EGARCH (p, q) model repairs this by adding terms depending on
the past of the observed series X
t
=
t
Z
t
, giving
log
t
= +
q

j=1

j
[Z
tj
[ +
p

j=1

j
log
tj
.
In this sense GARCH processes and their variants are much related to stochastic volatility
models. In view of the recursive nature of the denitions of
t
and X
t
, they are perhaps
more complicated.
1.2 Transformations and Filters
Many time series in real life are not stationary. Rather than modelling a nonstationary
sequence, such a sequence is often transformed in a time series that is (assumed to be)
stationary. The statistical analysis next focuses on the transformed series.
Two important deviations from stationarity are trend and seasonality. A trend is a
long-term, steady increase or decrease in the general level of the time series. A seasonal
component is a cyclic change in the level, the cycle length being for instance a year or a
week. Even though Example 1.5 shows that a perfectly cyclic series can be modelled as
a stationary series, it is often considered wise to remove such perfect cycles from a given
series before applying statistical techniques.
There are many ways in which a given time series can be transformed into a series
that is easier to analyse. Transforming individual variables X
t
into variables f(X
t
) by a
xed function f (such as the logarithm) is a common technique to render the variables
more stable. It does not transform a nonstationary series into a strictly stationary one,
but it does change the autocovariance function, which may be a more useful summary for
the transformed series. Another simple technique is detrending by substracting a best
tting polynomial in t of some xed degree. This is commonly found by the method of
least squares: given a nonstationary time series X
t
we determine constants

0
, . . . ,

p
by
minimizing
(
0
, . . . ,
p
)
n

t=1
_
X
t

0

1
t
p
t
p
_
2
.
1.2: Transformations and Filters 13
0 50 100 150 200 250
-
1
0
-
5
0
5
Figure 1.6. Realization of length 250 of the stochastic volatility model Xt = e
h
t
/2
Zt for a standard
Gaussian i.i.d. process Zt and a stationary auto-regressive process ht = 0.8ht1 +Wt for a standard Gaussian
i.i.d. process Wt.
Next the time series X
t

1
t

p
t
p
is assumed to be stationary.
A standard transformation for nancial time series is to (log) returns, given by
log
X
t
X
t1
, or
X
t
X
t1
1.
If X
t
/X
t1
is close to unity for all t, then these transformations are similar, as log x
x 1 for x 1. Because log(e
ct
/e
c(t1)
) = c, a log return can be intuitively interpreted
as the exponent of exponential growth. Many nancial time series exhibit an exponential
trend, not always in the right direction for the owners of the corresponding assets.
A general method to transform a nonstationary sequence in a stationary one, advo-
cated with much success in a famous book by Box and Jenkins, is ltering.
1.16 Denition. The (linear) lter with lter coecients (
j
: j Z) is the operation
that transforms a given time series X
t
into the time series Y
t
=

jZ

j
X
tj
.
A linear lter is a moving average of innite order. In Lemma 1.28 we give conditions
for the innite series to be well dened. All lters used in practice are nite lters: only
14 1: Introduction
1984 1986 1988 1990 1992
1
.
0
1
.
5
2
.
0
2
.
5
1984 1986 1988 1990 1992
-
0
.
2
-
0
.
1
0
.
0
0
.
1
Figure 1.7. Prices of Hewlett Packard on New York Stock Exchange and corresponding log returns.
nitely many coecients
j
are nonzero. Important examples are the dierence lter
X
t
= X
t
X
t1
,
and its repetitions
k
X
t
=
k1
X
t
dened recursely for k = 2, 3, . . .. Dierencing
with a bigger lag gives the the seasonal dierence lter
k
X
t
= X
t
X
tk
, for k the
length of the season or period. It is intuitively clear that a time series may have
stationary (seasonal) dierences (or increments), but may itself be nonstationary.
1.17 Example (Polynomial trend). A linear trend model could take the form X
t
=
at + Z
t
for a strictly stationary time series Z
t
. If a ,= 0, then the time series X
t
is not
stationary in the mean. However, the dierenced series X
t
= a+Z
t
Z
t1
is stationary.
Thus dierencing can be used to remove a linear trend. Similarly, a polynomial trend
can be removed by repeated dierencing: a polynomial trend of degree k is removed by
applying
k
.
1.18 EXERCISE. Check this for a series of the form X
t
= at +bt
2
+Z
t
.
1.19 EXERCISE. Does a (repeated) seasonal lter also remove polynomial trend?
1.2: Transformations and Filters 15
0 20 40 60 80 100
0
2
0
0
4
0
0
6
0
0
0 20 40 60 80 100
0
2
4
6
8
1
0
0 20 40 60 80 100
-
4
-
2
0
2
4
Figure 1.8. Realization of the time series t + 0.05t
2
+ Xt for the stationary auto-regressive process Xt
satisfying Xt 0.8Xt1 = Zt for Gaussian white noise Zt, and the same series after once and twice dierencing.
1.20 Example (Random walk). A random walk is dened as the sequence of partial
sums X
t
= Z
1
+Z
2
+ +Z
t
of an i.i.d. sequence Z
t
(with X
0
= 0). A random walk is
not stationary, but the dierenced series X
t
= Z
t
certainly is.
1.21 Example (Monthly cycle). If X
t
is the value of a system in month t, then
12
X
t
is the change in the system during the past year. For seasonable variables without trend
this series might be modelled as stationary. For series that contain both yearly seasonality
and polynomial trend, the series
k

12
X
t
might be stationary.
1.22 Example (Weekly averages). If X
t
is the value of a system at day t, then Y
t
=
(1/7)

6
j=0
X
tj
is the average value over the last week. This series might show trend,
but should not show seasonality due to day-of-the-week. We could study seasonality
by considering the time series X
t
Y
t
, which results from ltering the series X
t
with
coecients (
0
, . . . ,
6
) = (6/7, 1/7, . . . , 1/7).
1.23 Example (Exponential smoothing). An ad-hoc method for predicting the future
is to equate the future to the present or, more generally, to the average of the last
k observed values of a time series. When averaging past values it is natural to give
16 1: Introduction
more weight to the most recent values. Exponentially decreasing weights appear to have
some popularity. This corresponds to predicting a future value of a time series X
t
by
the weighted average

j=0

j
(1 )
1
X
tj
for some (0, 1). The coecients
j
=

j
/(1 ) of this lter decrease exponentially and satisfy

j
= 1.
1.24 EXERCISE (Convolution). Show that the result of two lters with coecients

j
and
j
applied in turn (if well dened) is the lter with coecients
j
given by

k
=

j

j

kj
. This is called the convolution of the two lters. Infer that ltering is
commutative.
1.25 Denition. A lter with coecients
j
is causal if
j
= 0 for every j < 0.
For a causal lter the variable Y
t
=

j

j
X
tj
depends only on the values
X
t
, X
t1
, . . . of the original time series in the present and past, not the future. This is
important for prediction. Given X
t
up to some time t, we can calculate Y
t
up to time t.
If Y
t
is stationary, we can use results for stationary time series to predict the future value
Y
t+1
. Next we predict the future value X
t+1
by X
t+1
=
1
0
(Y
t+1

j>0

j
X
t+1j
).
In order to derive conditions that guarantee that an innite lter is well dened, we
start with a lemma concerning series of random variables. Recall that a series

t
x
t
of
nonnegative numbers is always well dened (although possibly ), where the order of
summation is irrelevant. Furthermore, for general numbers x
t
the absolute convergence

t
[x
t
[ < implies that

t
x
t
exists as a nite number, where the order of summation
is again irrelevant. We shall be concerned with series indexed by t belonging to some
countable set T, such as N, Z, or Z
2
. It follows from the preceding that

tT
x
t
is well
dened as a limit as n of partial sums

tTn
x
t
, for any increasing sequence of
nite subsets T
n
T with union T, if either every x
t
is nonnegative or

t
[x
t
[ < .
For instance, in the case that the index set T is equal to Z, we can choose the sets
T
n
= t Z: [t[ n.
1.26 Lemma. Let (X
t
: t T) be an arbitrary countable set of random variables.
(i) If X
t
0 for every t, then E

t
X
t
=

t
EX
t
(possibly +);
(ii) If

t
E[X
t
[ < , then the series

t
X
t
converges absolutely almost surely and
E

t
X
t
=

t
EX
t
.
Proof. Suppose T =
j
T
j
for an increasing sequence T
1
T
2
of nite subsets of
T. Assertion (i) follows from the monotone convergence theorem applied to the variables
Y
j
=

tTj
X
t
. For the proof of (ii) we rst note that Z =

t
[X
t
[ is well dened, with
nite mean EZ =

t
E[X
t
[, by (i). Thus

t
[X
t
[ converges almost surely to a nite limit,
and hence Y : =

t
X
t
converges almost surely as well. The variables Y
j
=

tTj
X
t
are
dominated by Z and converge to Y as j . Hence EY
j
EY by the dominated
convergence theorem.
The dominated convergence theorem in the proof of (ii) actually gives a better result,
namely: if

t
E[X
t
[ < , then
E

tT
X
t

tTj
X
t

0, if T
1
T
2
T.
1.2: Transformations and Filters 17
This is called the convergence in mean of the series

t
X
t
. The analogous convergence
of the second moment is called the convergence in second mean. Alternatively, we speak
of convergence in quadratic mean or convergence in L
1
or L
2
.
1.27 EXERCISE. Suppose that E[X
n
X[
p
0 and E[X[
p
< for some p 1. Show
that EX
k
n
EX
k
for every 0 < k p.
1.28 Lemma. Let (Z
t
: t Z) be an arbitrary time series and let

j
[
j
[ < .
(i) If sup
t
E[Z
t
[ < , then

j

j
Z
tj
converges absolutely, almost surely and in mean.
(ii) If sup
t
E[Z
t
[
2
< , then

j

j
Z
tj
converges in second mean as well.
(iii) If the series Z
t
is stationary, then so is the series X
t
=

j

j
Z
tj
and
X
(h) =

j

j

j+lh

Z
(l).
Proof. (i). Because

t
E[
j
Z
tj
[ sup
t
E[Z
t
[

j
[
j
[ < , it follows by (ii) of the
preceding lemma that the series

j

j
Z
t
is absolutely convergent, almost surely. The
convergence in mean follows as in the remark following the lemma.
(ii). By (i) the series converges almost surely, and

j

j
Z
tj

]j]k

j
Z
tj
=

]j]>k

j
Z
tj
. By the triangle inequality we have

]j]>k

j
Z
tj

]j]>k
[
j
Z
tj
[
_
2
=

]j]>k

]i]>k
[
j
[[
i
[[Z
tj
[[Z
ti
[.
By the Cauchy-Schwarz inequality E[Z
tj
[[Z
ti
[
_
E[Z
tj
[
2
[EZ
ti
[
2
_
1/2
, which is
bounded by sup
t
E[Z
t
[
2
. Therefore, in view of (i) of the preceding lemma the expec-
tation of the right side (and hence the left side) of the preceding display is bounded
above by

]j]>k

]i]>k
[
j
[[
i
[ sup
t
E[Z
t
[
2
=
_

]j]>k
[
j
[
_
2
sup
t
E[Z
t
[
2
.
This converges to zero as k .
(iii). By (i) the series

j

j
Z
tj
converges in mean. Therefore, E

j

j
Z
tj
=

j

j
EZ
t
, which is independent of t. Using arguments as before, we see that we can
also justify the interchange of the order of expectations (hidden in the covariance) and
double sums in

X
(h) = cov
_

j
Z
t+hj
,

i
Z
ti
_
=

i
cov(Z
t+hj
, Z
ti
) =

Z
(h j +i).
This can be written in the form given by the lemma by the change of variables (j, i)
(j, j +l h).
18 1: Introduction
1.29 EXERCISE. Suppose that the series Z
t
in (iii) is strictly stationary. Show that the
series X
t
is strictly stationary whenever it is well dened. [You need measure theory to
make the argument mathematically rigorous(?).]
* 1.30 EXERCISE. For a white noise series Z
t
, part (ii) of the preceding lemma can be im-
proved: Suppose that Z
t
is a white noise sequence and

j

2
j
< . Show that

j

j
Z
tj
converges in second mean. [For this exercise you may want to use material from Chap-
ter 2.]
1.3 Complex Random Variables
Even though no real-life time series is complex valued, the use of complex numbers is
notationally convenient to develop the mathematical theory. In this section we discuss
complex-valued random variables.
A complex random variable Z is a map from some probability space into the eld
of complex numbers whose real and imaginary parts are random variables. For complex
random variables Z = X +iY , Z
1
and Z
2
, we dene
EZ = EX +iEY,
var Z = E[Z EZ[
2
,
cov(Z
1
, Z
2
) = E(Z
1
EZ
1
)(Z
2
EZ
2
).
Here the overbar means complex conjugation. Some simple properties are, for , C,
EZ = EZ, EZ = EZ,
var Z = E[Z[
2
[EZ[
2
= var X + var Y = cov(Z, Z),
var(Z) = [[
2
var Z,
cov(Z
1
, Z
2
) = cov(Z
1
, Z
2
),
cov(Z
1
, Z
2
) = cov(Z
2
, Z
1
) = EZ
1
Z
2
EZ
1
EZ
2
.
A complex random vector is of course a vector Z = (Z
1
, . . . , Z
n
)
T
of complex random
variables Z
i
dened on the same probability space. Its covariance matrix Cov(Z) is the
(n n) matrix of covariances cov(Z
i
, Z
j
).
1.31 EXERCISE. Prove the preceding identities.
1.32 EXERCISE. Prove that a covariance matrix = Cov(Z) of a complex random
vector Z is Hermitian (i.e. =
T
) and nonnegative-denite (i.e.
T
0 for every
C
n
). Here
T
is the reection of a matrix, dened to have (i, j)-element
j,i
.
A complex time series is a (doubly innite) sequence of complex-valued random
variables X
t
. The denitions of mean, covariance function, and (strict) stationarity given
1.3: Complex Random Variables 19
for real time series apply equally well to complex time series. In particular, the auto-
covariance function of a stationary complex time series is the function
X
: Z C dened
by
X
(h) = cov(X
t+h
, X
t
). Lemma 1.28 also extends to complex time series Z
t
, where
in (iii) we must read
X
(h) =

j

j

j+lh

Z
(l).
1.33 EXERCISE. Show that the auto-covariance function of a complex stationary time
series X
t
is conjugate symmetric:
X
(h) =
X
(h), for every h Z. (The positions of
X
t+h
and X
t
in the denition
X
(h) = cov(X
t+h
, X
t
) matter!)
Useful as these denitions are, it should be noted that the second-order structure of a
complex-valued time series is only partly described by its covariance function. A complex
random variable is geometrically equivalent to the two-dimensional real vector of its real
and imaginary parts. More generally, a complex random vector Z = (Z
1
, . . . , Z
n
) of
dimension n can be identied with the 2n-dimensional real vector that combines all real
and imaginary parts. The second order structure of the latter vector is determined by
a 2n-dimensional real covariance matrix, and this is not completely determined by the
(n n) complex covariance matrix of Z. This is clear from comparing the dimensions
of these objects. For instance, for n = 1 the complex covariance matrix of Z is simply
the single positive number var Z, whereas the covariance matrix of its pair of real and
imaginary parts is a symmetric (2 2)-matrix. This discrepancy increases with n. The
covariance matrix of a complex vector of dimension n contains n reals on the diagonal
and
1
2
n(n1) complex numbers o the diagonal, giving n+n(n1) = n
2
reals in total,
whereas a real covariance matrix of dimension 2n varies over an open set of
1
2
2n(2n+1)
reals.
* 1.3.1 Complex Gaussian vectors
It is particularly important to keep this in mind when working with complex Gaussian
vectors. A complex random vector Z = (Z
1
, . . . , Z
n
) is normally distributed or Gaussian
if the 2n-vector of all its real and imaginary parts is multivariate-normally distributed.
The latter distribution is determined by the mean vector and the covariance matrix
of the latter 2n-vector. The mean vector is one-to-one related to the (complex) mean
vector of Z, but the complex covariances cov(Z
i
, Z
j
) do not completely determine the
covariance matrix of the combined real and imaginary parts. This is sometimes resolved
by requiring that the covariance matrix of the 2n-vector has the special form
(1.2)
1
2
_
Re Im
Im Re
_
.
Here Re and Im are the real and imaginary parts of the complex covariance matrix
= Re + i Im of Z. A Gaussian distribution with covariance matrix of this type
is sometimes called circular complex normal. This distribution is completely described
by the mean vector and covariance matrix , and can be referred to by a code such
as N
n
(, ). However, the additional requirement (1.2) imposes a relationship between
the vectors of real and imaginary parts of Z, which seems not natural for a general
denition of a Gaussian complex vector. For instance, under the requirement (1.2) a
20 1: Introduction
complex Gaussian variable Z (for n = 1) would be equivalent to a pair of independent
real Gaussian variables each with variance equal to
1
2
var Z.
On the other hand, the covariance matrix of any complex vector that can be con-
structed by linearly transforming total complex white noise is of the form (1.2). Here
by total complex white noise we mean a complex vector of the form U + iV , with
(U
T
, V
T
)
T
a 2n-real vector with covariance matrix
1
2
times the identity. The covariance
matrix of the real and imaginary parts of the complex vector A(U + iV ) is then of the
form (1.2), for any complex matrix A (see Problem 1.35). Furthermore, the canoni-
cal construction of a complex vector Z with prescribed covariance matrix , using the
spectral decomposition, also yields real and imaginary parts with covariance matrix (1.2)
(see Problem 1.36), and the coordinates of a complex N(, )-vector are independent if
and only if is diagonal.
Warning. If X is a real Gaussian n-vector and A a complex matrix, then AX is not
necessarily a complex Gaussian vector, if the denition of the latter concept includes the
requirement that the covariance matrix of the vector of real and imaginary parts takes
the form (1.2).
1.34 Example (Complex trigonometric series). The time series X
t
= Ae
it
, for a
complex-valued random variable A and a xed real number , denes a complex version
of the trigonometric time series considered in Example 1.5. If EA = 0 and var A =
2
is nite, then this time series is stationary with mean 0 and autocovariances
X
(h) =

2
e
ih
.
The real part Re X
t
of this time series takes the form of the series in Example 1.5,
with the variables A and B in that example taken equal to Re A and ImA. It follows that
the real part is stationary if Re A and ImA have equal variance and are uncorrelated,
i.e. if the covariance of A has the form (1.2). If this is not the case, then the real part will
typically not be stationary. For instance, in the case that the variances of Re A and ImA
are equal, the extra term ERe AImA cos
_
(t +h)) sin(t) will appear in the formula for
the covariance of Re X
t+h
and Re X
t
, which depends on h unless ERe AImA = 0.
Thus a stationary complex time series may have nonstationary real and imaginary
parts!
1.35 EXERCISE. Let U and V be real n-vectors with EUU
T
= EV V
T
= I and EUV
T
=
0. Show that for any complex (n n)-matrix A the complex vector A(U +iV ) possesses
covariance matrix of the form (1.2).
1.36 EXERCISE. A nonnegative, Hermitian complex matrix can be decomposed as
= UDU
T
for a unitary matric U and a real diagonal matrix D. Given uncorrelated real
vectors U and V each with mean zero and covariance matrix
1
2
I set Z = U

D(U +iV ).
(i) Show that Cov Z = .
(ii) Show that (Re Z, ImZ) possesses covariance matrix (1.2).
(If U

D = A + iB, then = U

D(U

D)
T
= (A + iB)(A
T
iB
T
), whence Re =
AA
T
+BB
T
and Im = BA
T
AB
T
. Also Z = AU BV +i(AV +BU).)
1.37 EXERCISE. Suppose that the 2n-vector (X
T
, Y
T
)
T
possesses covariance matrix
1.4: Multivariate Time Series 21
as (1.2). Show that the n-vector Z = X +iY possesses covariance matrix . Also show
that E(Z EZ)(Z EZ)
T
= 0.
1.38 EXERCISE. Let Z and

Z be independent complex random vectors with mean zero
and covariance matrix .
(i) Show that E(Re Z)(Re Z)
T
+ E(ImZ)(ImZ)
T
= Re and E(ImZ)(Re Z)
T

E(Re Z)(ImZ)
T
= Im.
(ii) Show that the 2n-vector (X
T
, Y
T
)
T
dened by X = (Re Z + Im

Z)/

2 and Y =
(Re

Z + ImZ)/

2 possesses covariance matrix (1.2).


(iii) Show that the 2n-vector (Re Z
T
, ImZ
T
)
T
does not necessarily have this property.
1.39 EXERCISE. Say that a complex n-vector Z = X+iY is N
n
(, ) distributed if the
2n real vector (X
T
, Y
T
)
T
is multivariate normally distributed with mean (Re , Im)
and covariance matrix (1.2). Show that:
(i) If Z is N(, )-distributed, then AZ +b is N(A +b, AA
T
)-distributed.
(ii) If Z
1
and Z
2
are independent and N(
i
,
i
)-distributed, then Z
1
+ Z
2
is N(
1
+

2
,
1
+
2
)-distributed.
(iii) If X is a real vector and N(, )-dstributed and A is a complex matrix, then AX
is typically not complex normally distributed.
1.4 Multivariate Time Series
In many applications we are interested in the time evolution of several variables jointly.
This can be modelled through vector-valued time series. The denition of a station-
ary time series applies without changes to vector-valued series X
t
= (X
t,1
, . . . , X
t,d
)
T
.
Here the mean EX
t
is understood to be the vector (EX
t,1
, . . . , X
t,d
)
T
of means of the
coordinates and the auto-covariance function is dened to be the matrix

X
(h) =
_
cov(X
t+h,i
, X
t,j
)
_
i,j=1,...,d
= E(X
t+h
EX
t+h
)(X
t
EX
t
)
T
.
The auto-correlation at lag h is dened as

X
(h) =
_
(X
t+h,i
, X
t,j
)
_
i,j=1,...,d
= diag
__

X
(0)
_

X
(h) diag
__

X
(0)
_
.
The study of properties of multivariate time series can often be reduced to the study of
univariate time series by taking linear combinations
T
X
t
of the coordinates. The rst
and second moments satisfy, for every C
n
,
E
T
X
t
=
T
EX
t
,

T
X
(h) =
T

X
(h).
1.40 EXERCISE. Show that the auto-covariance function of a complex, multivariate
time series X satises
X
(h)
T
=
X
(h), for every h Z. (The order of the two random
variables in the denition
X
(h) = cov(X
t+h
, X
t
) matters!)
2
Hilbert Spaces
and Prediction
In this chapter we rst recall denitions and basic facts concerning Hilbert spaces. In
particular, we consider the Hilbert space of square-integrable random variables, with
the covariance as the inner product. We apply the Hilbert space theory to solve the
prediction problem: nding the best predictor of X
n+1
, or other future variables, based
on observations X
1
, . . . , X
n
.
2.1 Hilbert Spaces and Projections
Given a measure space (, |, ), we dene L
2
(, |, ) (or L
2
() for short) as the set of all
measurable functions f: C such that
_
[f[
2
d < . (Alternatively, all measurable
functions with values in R with this property.) Here a complex-valued function is said
to be measurable if both its real and imaginary parts Re f and Imf are measurable
functions. Its integral is by denition
_
f d =
_
Re f d +i
_
Imf d,
provided the two integrals on the right are dened and nite. We set
f
1
, f
2
) =
_
f
1
f
2
d,
|f| =

_
[f[
2
d, (2.1)
d(f
1
, f
2
) = |f
1
f
2
| =

_
[f
1
f
2
[
2
d.
2.1: Hilbert Spaces and Projections 23
These dene a semi-inner product, a semi-norm, and a semi-metric, respectively. The
rst is a semi-inner product in view of the properties:
f
1
+f
2
, f
3
) = f
1
, f
3
) +f
2
, f
3
),
f
1
, f
2
) = f
1
, f
2
),
f
2
, f
1
) = f
1
, f
2
),
f, f) 0, with equality i f = 0, a.e..
These equalities are immediate consequences of the denitions, and the linearity of the
integral.
The second entity in (2.1) is a semi-norm because it has the properties:
|f
1
+f
2
| |f
1
| +|f
2
|,
|f| = [[|f|,
|f| = 0 i f = 0, a.e..
Here the rst line, the triangle inequality is not immediate, but it can be proved with
the help of the Cauchy-Schwarz inequality, given below. The other properties are more
obvious. The third is a semi-distance, in view of the relations:
d(f
1
, f
3
) d(f
1
, f
2
) +d(f
2
, f
3
),
d(f
1
, f
2
) = d(f
2
, f
1
),
d(f
1
, f
2
) = 0 i f
1
= f
2
, a.e..
The rst property is again a triangle inequality, and follows from the corresponding
inequality for the norm.
Immediate consequences of the denitions and the properties of the inner product
are
(2.2) |f +g|
2
= f +g, f +g) = |f|
2
+f, g) +g, f) +|g|
2
,
|f +g|
2
= |f|
2
+|g|
2
, if f, g) = 0.
The last equality is Pythagoras rule. In the complex case this is true, more generally,
for functions f, g with Ref, g) = 0.
2.1 Lemma (Cauchy-Schwarz). Any pair f, g in L
2
(, |, ) satises

f, g)

|f||g|.
Proof. This follows upon working out the inequality |f g|
2
0 for = f, g)/|g|
2
using the decomposition (2.2).
Now the triangle inequality for the norm follows from the decomposition (2.2) and
the Cauchy-Schwarz inequality, which, when combined, yield
|f +g|
2
|f|
2
+ 2|f||g| +|g|
2
=
_
|f| +|g|
_
2
.
Another consequence of the Cauchy-Schwarz inequality is the continuity of the inner
product:
(2.3) f
n
f, g
n
g implies that f
n
, g
n
) f, g).
24 2: Hilbert Spaces and Prediction
2.2 EXERCISE. Prove this.
2.3 EXERCISE. Prove that

|f| |g|

|f g|.
2.4 EXERCISE. Derive the parallellogram rule: |f +g|
2
+|f g|
2
= 2|f|
2
+ 2|g|
2
.
2.5 EXERCISE. Prove that |f +ig|
2
= |f|
2
+|g|
2
for every pair f, g of real functions
in L
2
(, |, ).
2.6 EXERCISE. Let = 1, 2, . . . , k, | = 2

the power set of and the counting


measure on . Show that L
2
(, |, ) is exactly C
k
(or R
k
in the real case).
We attached the qualier semi to the inner product, norm and distance dened
previously, because in every of the three cases, the third of the three properties involves
a null set. For instance |f| = 0 does not imply that f = 0, but only that f = 0
almost everywhere. If we think of two functions that are equal almost everywere as the
same function, then we obtain a true inner product, norm and distance. We dene
L
2
(, |, ) as the set of all equivalence classes in L
2
(, |, ) under the equivalence
relation f g if and only if f = g almost everywhere. It is a common abuse of
terminology, which we adopt as well, to refer to the equivalence classes as functions.
2.7 Proposition. The metric space L
2
(, |, ) is complete under the metric d.
We shall need this proposition only occasionally, and do not provide a proof. (See e.g.
Rudin, Theorem 3.11.) The proposition asserts that for every sequence f
n
of functions in
L
2
(, |, ) such that
_
[f
n
f
m
[
2
d as m, n (a Cauchy sequence), there exists
a function f L
2
(, |, ) such that
_
[f
n
f[
2
d 0 as n .
A Hilbert space is a general inner product space that is metrically complete. The
space L
2
(, |, ) is an example, and the only example we need. (In fact, this is not a
great loss of generality, because it can be proved that any Hilbert space is (isometrically)
isomorphic to a space L
2
(, |, ) for some (, |, ).)
2.8 Denition. Two elements f, g of L
2
(, |, ) are orthogonal if f, g) = 0. This is
denoted f g. Two subsets T, ( of L
2
(, |, ) are orthogonal if f g for every f T
and g (. This is denoted T (.
2.9 EXERCISE. If f ( for some subset ( L
2
(, |, P), show that f lin (, where
lin ( is the closure of the linear span lin ( of (. (The linear span of a set is the set of all
nite linear combinations of elements of the set.)
2.10 Theorem (Projection theorem). Let L L
2
(, |, ) be a closed linear subspace.
For every f L
2
(, |, ) there exists a unique element f L that minimizes l
|f l|
2
over l L. This element is uniquely determined by the requirements f L
and f f L.
2.1: Hilbert Spaces and Projections 25
Proof. Let d = inf
lL
|f l| be the minimal distance of f to L. This is nite,
because 0 L and |f| < . Let l
n
be a sequence in L such that |f l
n
|
2
d. By the
parallellogram law
_
_
(l
m
f) + (f l
n
)
_
_
2
= 2|l
m
f|
2
+ 2|f l
n
|
2

_
_
(l
m
f) (f l
n
)
_
_
2
= 2|l
m
f|
2
+ 2|f l
n
|
2
4
_
_
1
2
(l
m
+l
n
) f
_
_
2
.
Because (l
m
+ l
n
)/2 L, the last term on the far right is bounded above by 4d
2
. The
two rst terms on the far right both converge to 2d
2
as m, n . We conclude that the
left side, which is |l
m
l
n
|
2
, is bounded above by 2d
2
+2d
2
+o(1)4d
2
and hence, being
nonnegative, converges to zero. Thus l
n
is a Cauchy sequence, and has a limit l, by the
completeness of L
2
(, |, ). The limit is in L, because L is closed. By the continuity of
the norm |f l| = lim|f l
n
| = d. Thus the limit l qualies as the minimizing element
f.
If both
1
f and
2
f are candidates for f, then we can take the sequence
l
1
, l
2
, l
3
, . . . in the preceding argument equal to the sequence
1
f,
2
f,
1
f, . . .. It then
follows that this sequence is a Cauchy-sequence and hence converges to a limit. The
latter is possibly only if
1
f =
2
f.
Finally, we consider the orthogonality relation. For every real number a and l L,
we have
_
_
f (f +al)
_
_
2
= |f f|
2
2a Ref f, l) +a
2
|l|
2
.
By denition of f this is minimal as a function of a at the value a = 0, whence the
given parabola (in a) must have its bottom at zero, which is the case if and only if
Ref f, l) = 0. In the complex case we see by a similar argument with ia instead of
a, that Imf f, l) = 0 as well. Thus f f L.
Conversely, if f f, l) = 0 for every l L and f L, then f l L for every
l L and by Pythagoras rule
|f l|
2
=
_
_
(f f) + (f l)
_
_
2
= |f f|
2
+|f l|
2
|f f|
2
.
This proves that f minimizes l |f l|
2
over l L.
The function f given in the preceding theorem is called the (orthogonal) projection
of f onto L. A geometric representation of a projection in R
3
is given in Figure 2.1.
From the orthogonality characterization of f, we can see that the map f f is
linear and decreases norm:
(f +g) = f + g,
(f) = f,
|f| |f|.
A further important property relates to repeated projections. If
L
f denotes the pro-
jection of f onto L and L
1
and L
2
are two closed linear subspaces, then

L1

L2
f =
L1
f, i L
1
L
2
.
Thus we can nd a projection in steps, by projecting a projection (
L2
f) onto a bigger
space (L
2
) a second time onto the smaller space (L
1
). This, again, is best proved using
the orthogonality relation.
26 2: Hilbert Spaces and Prediction
f
f f
f
L
1
Figure 2.1. Projection of f onto the linear space L. The remainder f f is orthogonal to L.
2.11 EXERCISE. Prove the relations in the two preceding displays.
The projection
L1+L2
onto the sum L
1
+L
2
= l
1
+l
2
: l
i
L
i
of two closed linear
spaces is not necessarily the sum
L1
+
L2
of the projections. (It is also not true that
the sum of two closed linear subspaces is necessarily closed, so that
L1+L2
may not
even be well dened.) However, this is true if the spaces L
1
and L
2
are orthogonal:

L1+L2
f =
L1
f +
L2
f, if L
1
L
2
.
2.12 EXERCISE.
(i) Show by counterexample that the condition L
1
L
2
cannot be omitted.
(ii) Show that L
1
+L
2
is closed if L
1
L
2
and both L
1
and L
2
are closed subspaces.
(iii) Show that L
1
L
2
implies that
L1+L2
=
L1
+
L2
.
[Hint for (ii): It must be shown that if z
n
= x
n
+ y
n
with x
n
L
1
, y
n
L
2
for every n
and z
n
z, then z = x + y for some x L
1
and y L
2
. How can you nd x
n
and y
n
from z
n
? Also remember that a projection is continuous.]
2.13 EXERCISE. Find the projection
L
f of an element f onto a one-dimensional space
L = l
0
: C.
* 2.14 EXERCISE. Suppose that the set L has the form L = L
1
+ iL
2
for two closed,
linear spaces L
1
, L
2
of real functions. Show that the minimizer of l |f l| over l L
for a real function f is the same as the minimizer of l |f l| over L
1
. Does this imply
that f f L
2
? Why does this not follow from the projection theorem?
2.2: Square-integrable Random Variables 27
2.2 Square-integrable Random Variables
For (, |, P) a probability space the space L
2
(, |, P) is exactly the set of all complex
(or real) random variables X with nite second moment E[X[
2
. The inner product is the
product expectation X, Y ) = EXY , and the inner product between centered variables
is the covariance:
X EX, Y EY ) = cov(X, Y ).
The Cauchy-Schwarz inequality takes the form
[EXY [
2
E[X[
2
E[Y [
2
.
When combined the preceding displays imply that

cov(X, Y )

2
var X var Y . Conver-
gence X
n
X relative to the norm means that E[X
n
X[
2
0, and is referred to
as convergence in second mean. This implies the convergence in mean E[X
n
X[ 0,
because E[X[
_
E[X[
2
by the Cauchy-Schwarz inequality. The continuity of the inner
product (2.3) gives that:
E[X
n
X[
2
0, E[Y
n
Y [
2
0 implies cov(X
n
, Y
n
) cov(X, Y ).
2.15 EXERCISE. How can you apply this rule to prove equalities of the type
cov(

j
X
tj
,

j
Y
tj
) =

j

i

j
cov(X
ti
, Y
tj
), such as in Lemma 1.28?
2.16 EXERCISE. Show that sd(X+Y ) sd(X)+sd(Y ) for any pair of random variables
X and Y .
2.2.1 Conditional Expectation
Let |
0
| be a sub -eld of the -eld |. The collection L of all |
0
-measurable
variables Y L
2
(, |, P) is a closed, linear subspace of L
2
(, |, P). (It can be identied
with L
2
(, |
0
, P))). By the projection theorem every square-integrable random variable
X possesses a projection onto L. This particular projection is important enough to give
it a name and study it in more detail.
2.17 Denition. The projection of X L
2
(, |, P) onto the the set of all |
0
-
measurable square-integrable random variables is called the conditional expectation of
X given |
0
. It is denoted by E(X[ |
0
).
The name conditional expectation suggests that there exists another, more intu-
itive interpretation of this projection. An alternative denition of a conditional expecta-
tion is as follows.
28 2: Hilbert Spaces and Prediction
2.18 Denition. The conditional expectation given |
0
of a random variable X which
is either nonnegative or integrable is dened as a |
0
-measurable variable X
t
such that
EX1
A
= EX
t
1
A
for every A |
0
.
It is clear from the denition that any other |
0
-measurable map X
tt
such that
X
tt
= X
t
almost surely is also a conditional expectation. Apart from this indeterminacy
on null sets, a conditional expectation as in the second denition can be shown to be
unique; its existence can be proved using the Radon-Nikodym theorem. We do not give
proofs of these facts here.
Because a variable X L
2
(, |, P) is automatically integrable, Denition 2.18 de-
nes a conditional expectation for a larger class of variables than Denition 2.17. If
E[X[
2
< , so that both denitions apply, then the two denitions agree. To see this
it suces to show that a projection E(X[ |
0
) as in the rst denition is the conditional
expectation X
t
of the second denition. Now E(X[ |
0
) is |
0
-measurable by denition
and satises the equality E
_
X E(X[ |
0
)
_
1
A
= 0 for every A |
0
, by the orthog-
onality relationship of a projection. Thus X
t
= E(X[ |
0
) satises the requirements of
Denition 2.18.
Denition 2.18 shows that a conditional expectation has to do with expectations,
but is still not very intuitive. Some examples help to gain more insight in conditional
expectations.
2.19 Example (Ordinary expectation). The expectation EX of a random variable
X is a number, and as such can be viewed as a degenerate random variable. It is also
the conditional expectation relative to the trivial -eld : E
_
X[ ,
_
= EX. More
generally, we have that E(X[ |
0
) = EX if X and |
0
are independent. This is intuitively
clear: an independent -eld |
0
gives no information about X and hence the expectation
given |
0
is the unconditional expectation.
To derive this from the denition, note that E(EX)1
A
= EXE1
A
= EX1
A
for every
measurable set A such that X and A are independent.
2.20 Example. At the other extreme we have that E(X[ |
0
) = X if X itself is |
0
-
measurable. This is immediate from the denition. Given |
0
we then know X exactly.
A measurable map Y : D with values in some measurable space (D, T) generates
a -eld (Y ) = g
1
(T). The notation E(X[ Y ) is used as an abbreviation of E(X[ (Y )).
2.21 Example (Conditional density). Let (X, Y ): R R
k
be measurable and
possess a density f(x, y) relative to a -nite product measure on R R
k
(for
instance, the Lebesgue measure on R
k+1
). Then it is customary to dene a conditional
density of X given Y = y by
f(x[ y) =
f(x, y)
_
f(x, y) d(x)
.
2.3: Linear Prediction 29
This is well dened for every y for which the denominator is positive. As the denominator
is precisely the marginal density f
Y
of Y evaluated at y, this is for all y in a set of measure
one under the distribution of Y .
We now have that the conditional expection is given by the usual formula
E(X[ Y ) =
_
xf(x[ Y ) d(x).
Here we may dene the right hand arbitrarily if the denominator of f(x[ Y ) is zero.
That this formula is the conditional expectation according to Denition 2.18 follows
by some applications of Fubinis theorem. To begin with, note that it is a part of the
statement of this theorem that the right side of the preceding display is a measurable
function of Y . Next we write EE(X[ Y )1
Y B
, for an arbitrary measurable set B, in the
form
_
B
_
xf(x[ y) d(x) f
Y
(y) d(y). Because f(x[ y)f
Y
(y) = f(x, y) for almost every
(x, y), the latter expression is equal to EX1
Y B
, by Fubinis theorem.
2.22 Lemma (Properties).
(i) EE(X[ |
0
) = EX.
(ii) If Z is |
0
-measurable, then E(ZX[ |
0
) = ZE(X[ |
0
) a.s.. (Here require that X
L
p
(, |, P) and Z L
q
(, |, P) for 1 p and p
1
+q
1
= 1.)
(iii) (linearity) E(X +Y [ |
0
) = E(X[ |
0
) +E(Y [ |
0
) a.s..
(iv) (positivity) If X 0 a.s., then E(X[ |
0
) 0 a.s..
(v) (towering property) If |
0
|
1
|, then E
_
E(X[ |
1
)[ |
0
) = E(X[ |
0
) a.s..
The conditional expectation E(X[ Y ) given a random vector Y is by denition a
(Y )-measurable function. The following lemma shows that, for most Y , this means that
it is a measurable function g(Y ) of Y . The value g(y) is often denoted by E(X[ Y = y).
Warning. Unless P(Y = y) > 0 it is not right to give a meaning to E(X[ Y = y) for
a xed, single y, even though the interpretation as an expectation given that we know
that Y = y often makes this tempting (and often leads to a correct result). We may
only think of a conditional expectation as a function y E(X[ Y = y) and this is only
determined up to null sets.
2.23 Lemma. Let Y

: A be random variables on and let X be a (Y

: A)-
measurable random variable.
(i) If A = 1, 2, . . . , k, then there exists a measurable map g: R
k
R such that
X = g(Y
1
, . . . , Y
k
).
(ii) If [A[ = , then there exists a countable subset
n

n=1
A and a measurable
map g: R

R such that X = g(Y


1
, Y
2
, . . .).
Proof. For the proof of (i), see e.g. Dudley Theorem 4.28.
30 2: Hilbert Spaces and Prediction
2.3 Linear Prediction
Suppose that we observe the values X
1
, . . . , X
n
from a stationary, mean zero time series
X
t
. The linear prediction problem is to nd the linear combination of these variables that
best predicts future variables.
2.24 Denition. Given a mean zero time series (X
t
), the best linear predictor of X
n+1
is
the linear combination
1
X
n
+
2
X
n1
+ +
n
X
1
that minimizes E[X
n+1
Y [
2
over all
linear combinations Y of X
1
, . . . , X
n
. The minimal value E[X
n+1

1
X
n

n
X
1
[
2
is called the square prediction error.
In the terminology of the preceding section, the best linear predictor of X
n+1
is the
projection of X
n+1
onto the linear subspace lin (X
1
, . . . , X
n
) spanned by X
1
, . . . , X
n
. A
common notation is
n
X
n+1
, for
n
the projection onto lin (X
1
, . . . , X
n
). Best linear
predictors of other random variables are dened similarly.
Warning. The coecients
1
, . . . ,
n
in the formula
n
X
n+1
=
1
X
n
+ +
n
X
1
depend on n, even though we often suppress this dependence in the notation. The reversed
ordering of the labels on the coecients
i
is for convenience.
By Theorem 2.10 the best linear predictor can be found from the prediction equations
X
n+1

1
X
n

n
X
1
, X
t
) = 0, t = 1, . . . , n,
where , ) is the inner product in L
2
(, |, P). For a stationary time series X
t
this system
can be written in the form
(2.4)
_
_
_
_

X
(0)
X
(1)
X
(n 1)

X
(1)
X
(0)
X
(n 2)
.
.
.
.
.
.
.
.
.
.
.
.

X
(n 1)
X
(n 2)
X
(0)
_
_
_
_
_
_
_

1
.
.
.

n
_
_
_ =
_
_
_

X
(1)
.
.
.

X
(n)
_
_
_.
If the (n n)-matrix on the left is nonsingular, then
1
, . . . ,
n
can be solved uniquely.
Otherwise there are multiple solutions for the vector (
1
, . . . ,
n
), but any solution will
give the best linear predictor
n
X
n+1
=
1
X
n
+ +
n
X
1
, as this is uniquely determined
by the projection theorem. The equations express
1
, . . . ,
n
in the auto-covariance func-
tion
X
. In practice, we do not know this function, but estimate it from the data, and
use the corresponding estimates for
1
, . . . ,
n
to calculate the predictor. (We consider
the estimation problem in later chapters.)
The square prediction error can be expressed in the coecients by Pythagoras rule,
which gives, for a stationary time series X
t
,
(2.5)
E[X
n+1

n
X
n+1
[
2
= E[X
n+1
[
2
E[
n
X
n+1
[
2
=
X
(0) (
1
, . . . ,
n
)
n
(
1
, . . . ,
n
)
T
,
for
n
the covariance matrix of the vector (X
1
, . . . , X
n
), i.e. the matrix on the left left
side of (2.4).
Similar arguments apply to predicting X
n+h
for h > 1. If we wish to predict the fu-
ture values at many time lags h = 1, 2, . . ., then solving a n-dimensional linear system for
2.3: Linear Prediction 31
every h separately can be computer-intensive, as n may be large. Several more ecient,
recursive algorithms use the predictions at earlier times to calculate the next prediction,
and are computationally more ecient. We discuss one of these in Secton 2.4.
2.25 Example (Autoregression). Prediction is extremely simple for the stationary
auto-regressive time series satisfying X
t
= X
t1
+Z
t
for a white noise sequence Z
t
and
[[ < 1: the best linear predictor of X
n+1
given X
1
, . . . , X
n
is simply X
n
(for n 1).
Thus we predict X
n+1
= X
n
+Z
n+1
by simply setting the unknown Z
n+1
equal to its
mean, zero. The interpretation is that the Z
t
are external noise factors that are completely
unpredictable based on the past. The square prediction error E[X
n+1
X
n
[
2
= EZ
2
n+1
is equal to the variance of this innovation.
The claim is not obvious, as is proved by the fact that it is wrong in the case that
[[ > 1. To prove the claim we recall from Example 1.8 that the unique stationary solution
to the auto-regressive equation in the case that [[ < 1 is given by X
t
=

j=0

j
Z
tj
.
Thus X
t
depends only on Z
s
from the past and the present. Because Z
t
is a white noise
sequence, it follows that X
t
is uncorrelated with the variables Z
t+1
, Z
t+2
, . . .. Therefore
X
n+1
X
n
, X
t
) = Z
n+1
, X
t
) = 0 for t = 1, 2, . . . , n. This veries the orthogonality
relationship; it is obvious that X
n
is contained in the linear span of X
1
, . . . , X
n
.
2.26 EXERCISE. There is a hidden use of the continuity of the inner product in the
preceding example. Can you see where?
2.27 Example (Deterministic trigonometric series). For the process X
t
= Acos(t)+
Bsin(t), considered in Example 1.5, the best linear predictor of X
n+1
given X
1
, . . . , X
n
is given by 2(cos )X
n
X
n1
, for n 2. The prediction error is equal to 0! This
underscores that this type of time series is deterministic in character: if we know it at
two time instants, then we know the time series at all other time instants. The intuitive
explanation is that the values A and B can be recovered from the values of X
t
at two
time instants.
These assertions follow by explicit calculations, solving the prediction equations. It
suces to do this for n = 2: if X
3
can be predicted without error by 2(cos )X
2
X
1
,
then, by stationarity, X
n+1
can be predicted without error by 2(cos )X
n
X
n1
.
2.28 EXERCISE.
(i) Prove the assertions in the preceding example.
(ii) Are the coecients 2 cos , 1, 0, . . . , 0 in this example unique?
If a given time series X
t
is not centered at 0, then it is natural to allow a constant
term in the predictor. Write 1 for the random variable that is equal to 1 almost surely.
2.29 Denition. The best linear predictor of X
n+1
based on X
1
, . . . , X
n
is the projec-
tion of X
n+1
onto the linear space spanned by 1, X
1
, . . . , X
n
.
If the time series X
t
does have mean zero, then the introduction of the constant
term 1 does not help. Indeed, the relation EX
t
= 0 is equivalent to X
t
1, which
32 2: Hilbert Spaces and Prediction
implies both that 1 lin (X
1
, . . . , X
n
) and that the projection of X
n+1
onto lin 1 is
zero. By the orthogonality the projection of X
n+1
onto lin (1, X
1
, . . . , X
n
) is the sum
of the projections of X
n+1
onto lin 1 and lin (X
1
, . . . , X
n
), which is the projection on
lin (X
1
, . . . , X
n
), the rst projection being 0.
If the mean of the time series is nonzero, then adding a constant to the predictor
does cut the prediction error. By a similar argument as in the preceding paragraph we
see that for a time series with mean = EX
t
possibly nonzero,
(2.6)
lin(1,X1,...,Xn)
X
n+1
= +
lin (X1,...,Xn)
(X
n+1
).
Thus the recipe for prediction with uncentered time series is: substract the mean from
every X
t
, calculate the projection for the centered time series X
t
, and nally add the
mean. Because the auto-covariance function
X
gives the inner produts of the centered
process, the coecients
1
, . . . ,
n
of X
n
, . . . , X
1
are still given by the prediction
equations (2.4).
2.30 EXERCISE. Prove formula (2.6), noting that EX
t
= is equivalent to X
t
1.
* 2.4 Innovation Algorithm
The prediction error X
n

n1
X
n
at time n is called the innovation at time n. The linear
span of the innovations X
1

0
X
1
, . . . , X
n
Pi
n1
X
n
is the same as the linear span
X
1
, . . . , X
n
, and hence the linear prediction of X
n+1
can be expressed in the innovations,
as

n
X
n+1
=
n,1
(X
n

n1
X
n
) +
n,2
(X
n1

n2
X
n1
) + +
n,n
(X
1

0
X
1
).
The innovations algorithm shows how the triangular array of coecients
n,1
, . . . ,
n,n
can be eciently computed. Once these coecients are known, predictions further into
the future can also be computed easily. Indeed by the towering property of projections
the projection
k
X
n+1
for k n can be obtained by applying
k
to the left side
of the preceding display. Next the linearity of projection and the orthogonality of the
innovations to the past yields that, for k n,

k
X
n+1
=
n,nk+1
(X
k

k1
X
k
) + +
n,n
(X
1

0
X
1
).
We just drop the innovations that are future to time k.
The innovations algorithm computes the coeents
n,t
for n = 1, 2, . . ., and for
every n backwards for t = n, n 1, . . . , 1. It also uses and computes the square norms of
the innovations
v
n1
= E(X
n

n1
X
n
)
2
.
2.5: Nonlinear Prediction 33
The algorithm is initialized at n = 1 by setting
0
X
1
= EX
1
and v
0
= var X
1
=
X
(0).
For each n the last coecient is computed as

n,n
=
cov(X
n+1
, X
1
)
var X
1
=

X
(n)
v
0
.
In particular, this yields the coecient
1,1
at level n = 1. Given the coecients at level
n, we compute
v
n
= var X
n+1
var
n
X
n+1
=
X
(0)
n

t=1

2
n,t
v
nt
.
Next if the coecients (
j,t
) are known for j = 1, . . . , n 1, then for t = 1, . . . , n 1,

n,t
=
cov(X
n+1
, X
n+1t

nt
X
n+1t
)
var(X
n+1t

nt
X
n+1t
)
=

X
(t)
nt,1

n,t+1
v
nt+1

nt,2

n,t+2
v
nt

nt,nt

n,n
v
0
v
nt
.
In the last step we have replaced X
n+1t

nt
X
n+1t
by

nt
j=1

nt,j
(X
n+1tj

ntj
X
n+1tj
), and next use that the inner product of X
n+1
with (X
n+1tj

ntj
X
n+1tj
is equal to the inner product of
n
X
n+1
with the latter variable.
2.5 Nonlinear Prediction
The method of linear prediction is commonly used in time series analysis. Its main
advantage is simplicity: the linear predictor depends on the mean and auto-covariance
function only, and in a simple fashion. On the other hand, utilization of general functions
f(X
1
, . . . , X
n
) of the observations as predictors may decrease the prediction error.
2.31 Denition. The best predictor of X
n+1
based on X
1
, . . . , X
n
is the function
f
n
(X
1
, . . . , X
n
) that minimizes E

X
n+1
f(X
1
, . . . , X
n
)

2
over all measurable functions
f: R
n
R.
In view of the discussion in Section 2.2.1 the best predictor is the conditional ex-
pectation E(X
n+1
[ X
1
, . . . , X
n
) of X
n+1
given the variables X
1
, . . . , X
n
. Best predictors
of other variables are dened similarly as conditional expectations.
The dierence between linear and nonlinear prediction can be substantial. In clas-
sical time series theory linear models with Gaussian errors were predominant and for
those models the two predictors coincide. However, for nonlinear models, or non-Gaussian
distributions, nonlinear prediction should be the method of choice, if feasible.
34 2: Hilbert Spaces and Prediction
2.32 Example (GARCH). In the GARCH model of Example 1.10 the variable X
n+1
is
given as
n+1
Z
n+1
, where
n+1
is a function of X
n
, X
n1
, . . . and Z
n+1
is independent
of these variables. It follows that the best predictor of X
n+1
given the innite past
X
n
, X
n1
, . . . is given by
n+1
E(Z
n+1
[ X
n
, X
n1
, . . .) = 0. We can nd the best predictor
given X
n
, . . . , X
1
by projecting this predictor further onto the space of all measurable
functions of X
n
, . . . , X
1
. By the linearity of the projection we again nd 0.
We conclude that a GARCH model does not allow a true prediction of the future,
if true refers to predicting the values of the time series itself.
On the other hand, we can predict other quantities of interest. For instance, the
uncertainty of the value of X
n+1
is determined by the size of
n+1
. If
n+1
is close to
zero, then we may expect X
n+1
to be close to zero, and conversely. Given the innite past
X
n
, X
n1
, . . . the variable
n+1
is known completely, but in the more realistic situation
that we know only X
n
, . . . , X
1
some chance component will be left.
For large n the dierence between these two situations is small. The dependence
of
n+1
on X
n
, X
n1
, . . . is given in Example 1.10 as
2
n+1
=

j=0

j
( + X
2
nj
) and
is nonlinear. For large n this is close to

n1
j=0

j
( + X
2
nj
), which is a function of
X
1
, . . . , X
n
. By denition the best predictor
2
n+1
based on 1, X
1
, . . . , X
n
is the closest
function and hence it satises
E


2
n+1

2
n+1

2
E

n1

j=0

j
( +X
2
nj
)
2
n+1

2
= E

j=n

j
( +X
2
nj
)

2
.
For small and large n this will be small if the sequence X
n
is suciently integrable.
Thus accurate nonlinear prediction of
2
n+1
is feasible.
2.6 Partial Auto-Correlation
For a mean-zero stationary time series X
t
the partial auto-correlation at lag h is dened
as the correlation between X
h

h1
X
h
and X
0

h1
X
0
, where
h
is the projection
onto lin (X
1
, . . . , X
h
). This is the correlation between X
h
and X
0
with the correlation
due to the intermediate variables X
1
, . . . , X
h1
removed. We shall denote it by

X
(h) =
_
X
h

h1
X
h
, X
0

h1
X
0
_
.
For an uncentered stationary time series we set the partial auto-correlation by denition
equal to the partial auto-correlation of the centered series X
t
EX
t
. A convenient method
to compute
X
is given by the prediction equations combined with the following lemma,
which shows that
X
(h) is the coecient of X
1
in the best linear predictor of X
h+1
based on X
1
, . . . , X
h
.
2.6: Partial Auto-Correlation 35
2.33 Lemma. Suppose that X
t
is a mean-zero stationary time series. If
1
X
h
+
2
X
h1
+
+
h
X
1
is the best linear predictor of X
h+1
based on X
1
, . . . , X
h
, then
X
(h) =
h
.
Proof. Let
1
X
h
+ +
h1
X
2
=:
2,h
X
1
be the best linear predictor of X
1
based on
X
2
, . . . , , X
h
. The best linear predictor of X
h+1
based on X
1
, . . . , X
h
can be decomposed
as

h
X
h+1
=
1
X
h
+ +
h
X
1
=
_
(
1
+
h

1
)X
h
+ + (
h1
+
h

h1
)X
2

+
_

h
(X
1

2,h
X
1
)

.
The two random variables in square brackets are orthogonal, because X
1

2,h
X
1

lin (X
2
, . . . , X
h
) by the projection theorem. Therefore, the second variable in square
brackets is the projection of
h
X
h+1
onto the one-dimensional subspace lin (X
1

2,h
X
1
). It is also the projection of X
h+1
onto this one-dimensional subspace, because
lin (X
1

2,h
X
1
) lin (X
1
, . . . , X
h
) and we can compute projections by rst projecting
onto a bigger subspace.
The projection of X
h+1
onto the one-dimensional subspace lin (X
1

2,h
X
1
) is easy
to compute directly. It is given by (X
1

2,h
X
1
) for given by
=
X
h+1
, X
1

2,h
X
1
)
|X
1

2,h
X
1
|
2
=
X
h+1

2,h
X
h+1
, X
1

2,h
X
1
)
|X
1

2,h
X
1
|
2
.
Because the linear prediction problem is symmetric in time, as it depends on the auto-
covariance function only, |X
1

2,h
X
1
| = |X
h+1

2,h
X
h+1
|. Therefore, the right side
is exactly
X
(h). In view of the preceding paragraph, we have =
h
and the lemma is
proved.
2.34 Example (Autoregression). According to Example 2.25, for the stationary auto-
regressive process X
t
= X
t1
+Z
t
with [[ < 1, the best linear predictor of X
n+1
based
on X
1
, . . . , X
n
is X
n
, for n 1. Thus
X
(1) = and the partial auto-correlations

X
(h) of lags h > 1 are zero. This is often viewed as the dual of the property that for
the moving average sequence of order 1, considered in Example 1.6, the auto-correlations
of lags h > 1 vanish.
In Chapter 8 we shall see that for higher order stationary auto-regressive processes
X
t
=
1
X
t1
+ +
p
X
tp
+ Z
t
the partial auto-correlations of lags h > p are zero
under the (standard) assumption that the time series is causal.
3
Stochastic Convergence
This chapter provides a review of modes of convergence of sequences of stochastic vectors.
In particular, convergence in distribution and in probability. Many proofs are omitted,
but can be found in most standard probability books, and certainly in the book Asymp-
totic Statistics (A.W. van der Vaart, 1998).
3.1 Basic theory
A random vector in R
k
is a vector X = (X
1
, . . . , X
k
) of real random variables. More
formally it is a Borel measurable map from some probability space in R
k
. The distribution
function of X is the map x P(X x).
A sequence of random vectors X
n
is said to converge in distribution to X if
P(X
n
x) P(X x),
for every x at which the distribution function x P(X x) is continuous. Alternative
names are weak convergence and convergence in law. As the last name suggests, the
convergence only depends on the induced laws of the vectors and not on the probability
spaces on which they are dened. Weak convergence is denoted by X
n
X; if X has
distribution L or a distribution with a standard code such as N(0, 1), then also by
X
n
L or X
n
N(0, 1).
Let d(x, y) be any distance function on R
k
that generates the usual topology. For
instance
d(x, y) = |x y| =
_
k

i=1
(x
i
y
i
)
2
_
1/2
.
A sequence of random variables X
n
is said to converge in probability to X if for all > 0
P
_
d(X
n
, X) >
_
0.
3.1: Basic theory 37
This is denoted by X
n
P

X. In this notation convergence in probability is the same as


d(X
n
, X)
P

0.
As we shall see convergence in probability is stronger than convergence in distribu-
tion. Even stronger modes of convergence are almost sure convergence and convergence
in pth mean. The sequence X
n
is said to converge almost surely to X if d(X
n
, X) 0
with probability one:
P
_
limd(X
n
, X) = 0
_
= 1.
This is denoted by X
n
as

X. The sequence X
n
is said to converge in pth mean to X if
Ed(X
n
, X)
p
0.
This is denoted X
n
Lp

X. We already encountered the special cases p = 1 or p = 2,


which are referred to as convergence in mean and convergence in quadratic mean.
Convergence in probability, almost surely, or in mean only make sense if each X
n
and X are dened on the same probability space. For convergence in distribution this is
not necessary.
The portmanteau lemma gives a number of equivalent descriptions of weak con-
vergence. Most of the characterizations are only useful in proofs. The last one also has
intuitive value.
3.1 Lemma (Portmanteau). For any random vectors X
n
and X the following state-
ments are equivalent.
(i) P(X
n
x) P(X x) for all continuity points of x P(X x);
(ii) Ef(X
n
) Ef(X) for all bounded, continuous functions f;
(iii) Ef(X
n
) Ef(X) for all bounded, Lipschitz

functions f;
(iv) liminf P(X
n
G) P(X G) for every open set G;
(v) limsup P(X
n
F) P(X F) for every closed set F;
(vi) P(X
n
B) P(X B) for all Borel sets B with P(X B) = 0 where B = B

B
is the boundary of B.
The continuous mapping theorem is a simple result, but is extremely useful. If the
sequence of random vector X
n
converges to X and g is continuous, then g(X
n
) converges
to g(X). This is true without further conditions for three of our four modes of stochastic
convergence.
3.2 Theorem (Continuous mapping). Let g: R
k
R
m
be measurable and continuous
at every point of a set C such that P(X C) = 1.
(i) If X
n
X, then g(X
n
) g(X);
(ii) If X
n
P

X, then g(X
n
)
P

g(X);
(iii) If X
n
as

X, then g(X
n
)
as

g(X).
Any random vector X is tight: for every > 0 there exists a constant M such that
P
_
|X| > M
_
< . A set of random vectors X

: A is called uniformly tight if M

A function is called Lipschitz if there exists a number L such that ]f(x) f(y)] Ld(x, y) for every x
and y. The least such number L is denoted |f|Lip.
38 3: Stochastic Convergence
can be chosen the same for every X

: for every > 0 there exists a constant M such


that
sup

P
_
|X

| > M
_
< .
Thus there exists a compact set to which all X

give probability almost one. Another


name for uniformly tight is bounded in probability. It is not hard to see that every weakly
converging sequence X
n
is uniformly tight. More surprisingly, the converse of this state-
ment is almost true: according to Prohorovs theorem every uniformly tight sequence
contains a weakly converging subsequence.
3.3 Theorem (Prohorovs theorem). Let X
n
be random vectors in R
k
.
(i) If X
n
X for some X, then X
n
: n N is uniformly tight;
(ii) If X
n
is uniformly tight, then there is a subsequence with X
nj
X as j for
some X.
3.4 Example. A sequence X
n
of random variables with E[X
n
[ = O(1) is uniformly
tight. This follows since by Markovs inequality: P
_
[X
n
[ > M
_
E[X
n
[/M. This can be
made arbitrarily small uniformly in n by choosing suciently large M.
The rst absolute moment could of course be replaced by any other absolute mo-
ment.
Since the second moment is the sum of the variance and the square of the mean an
alternative sucient condition for uniform tightness is: EX
n
= O(1) and var X
n
= O(1).
Consider some of the relationships between the three modes of convergence. Conver-
gence in distribution is weaker than convergence in probability, which is in turn weaker
than almost sure convergence and convergence in pth mean.
3.5 Theorem. Let X
n
, X and Y
n
be random vectors. Then
(i) X
n
as

X implies X
n
P

X;
(ii) X
n
Lp

X implies X
n
P

X;
(iii) X
n
P

X implies X
n
X;
(iv) X
n
P

c for a constant c if and only if X


n
c;
(v) if X
n
X and d(X
n
, Y
n
)
P

0, then Y
n
X;
(vi) if X
n
X and Y
n
P

c for a constant c, then (X


n
, Y
n
) (X, c);
(vii) if X
n
P

X and Y
n
P

Y , then (X
n
, Y
n
)
P

(X, Y ).
Proof. (i). The sequence of sets A
n
=
mn
_
d(X
m
, X) >
_
is decreasing for every
> 0 and decreases to the empty set if X
n
() X() for every . If X
n
as

X, then
P
_
d(X
n
, X) >
_
P(A
n
) 0.
(ii). This is an immediate consequence of Markovs inequality, according to which
P
_
d(X
n
, X) >
_

p
Ed(X
n
, X)
p
for every > 0.
(v). For every bounded Lipschitz function f and every > 0 we have

Ef(X
n
) Ef(Y
n
)

|f|
Lip
E1
_
d(X
n
, Y
n
)
_
+ 2|f|

E1
_
d(X
n
, Y
n
) >
_
.
3.1: Basic theory 39
The second term on the right converges to zero as n . The rst term can be made
arbitrarily small by choice of . Conclude that the sequences Ef(X
n
) and Ef(Y
n
) have
the same limit. The result follows from the portmanteau lemma.
(iii). Since d(X
n
, X)
P

0 and trivially X X it follows that X


n
X by (v).
(iv). The only if part is a special case of (iii). For the converse let ball(c, ) be the
open ball of radius around c. Then P
_
d(X
n
, c)
_
= P
_
X
n
ball(c, )
c
_
. If X
n
c,
then the limsup of the last probability is bounded by P
_
c ball(c, )
c
_
= 0.
(vi). First note that d
_
(X
n
, Y
n
), (X
n
, c)
_
= d(Y
n
, c)
P

0. Thus according to (v) it


suces to show that (X
n
, c) (X, c). For every continuous, bounded function (x, y)
f(x, y), the function x f(x, c) is continuous and bounded. Thus Ef(X
n
, c) Ef(X, c)
if X
n
X.
(vii). This follows from d
_
(x
1
, y
1
), (x
2
, y
2
)
_
d(x
1
, x
2
) +d(y
1
, y
2
).
According to the last assertion of the lemma convergence in probability of a se-
quence of vectors X
n
= (X
n,1
, . . . , X
n,k
) is equivalent to convergence of every one of
the sequences of components X
n,i
separately. The analogous statement for convergence
in distribution is false: convergence in distribution of the sequence X
n
is stronger than
convergence of every one of the sequences of components X
n,i
. The point is that the dis-
tribution of the components X
n,i
separately does not determine their joint distribution:
they might be independent or dependent in many ways. One speaks of joint convergence
in distribution versus marginal convergence.
The one before last assertion of the lemma has some useful consequences. If X
n
X
and Y
n
c, then (X
n
, Y
n
) (X, c). Consequently, by the continuous mapping theorem
g(X
n
, Y
n
) g(X, c) for every map g that is continuous at the set R
k
c where the
vector (X, c) takes its values. Thus for every g such that
lim
xx0,yc
g(x, y) = g(x
0
, c), every x
0
.
Some particular applications of this principle are known as Slutskys lemma.
3.6 Lemma (Slutsky). Let X
n
, X and Y
n
be random vectors or variables. If X
n
X
and Y
n
c for a constant c, then
(i) X
n
+Y
n
X +c;
(ii) Y
n
X
n
cX;
(iii) X
n
/Y
n
X/c provided c ,= 0.
In (i) the constant c must be a vector of the same dimension as X, and in (ii) c is
probably initially understood to be a scalar. However, (ii) is also true if every Y
n
and c
are matrices (which can be identied with vectors, for instance by aligning rows, to give
a meaning to the convergence Y
n
c), simply because matrix multiplication (y, x) yx
is a continuous operation. Another true result in this case is that X
n
Y
n
Xc, if this
statement is well dened. Even (iii) is valid for matrices Y
n
and c and vectors X
n
provided
c ,= 0 is understood as c being invertible and division is interpreted as (pre)multiplication
by the inverse, because taking an inverse is also continuous.
40 3: Stochastic Convergence
3.7 Example. Let T
n
and S
n
be statistical estimators satisfying

n(T
n
) N(0,
2
), S
2
n
P


2
,
for certain parameters and
2
depending on the underlying distribution, for every
distribution in the model. Then = T
n
S
n
/

is a condence interval for of


asymptotic level 1 2.
This is a consequence of the fact that the sequence

n(T
n
)/S
n
is asymptotically
standard normal distributed.
* 3.2 Convergence of Moments
By the portmanteau lemma, weak convergence X
n
X implies that Ef(X
n
) Ef(X)
for every continuous, bounded function f. The condition that f be bounded is not su-
peruous: it is not dicult to nd examples of a sequence X
n
X and an unbounded,
continuous function f for which the convergence fails. In particular, in general conver-
gence in distribution does not imply convergence EX
p
n
EX
p
of moments. However, in
many situations such convergence occurs, but it requires more eort to prove it.
A sequence of random variables Y
n
is called asymptotically uniformly integrable if
lim
M
limsup
n
E[Y
n
[1[Y
n
[ > M = 0.
A simple sucient condition for this is that for some p > 1 the sequence E[Y
n
[
p
is
bounded in n.
Uniform integrability is the missing link between convergence in distribution and
convergence of moments.
3.8 Theorem. Let f: R
k
R be measurable and continuous at every point in a set C.
Let X
n
X where X takes its values in C. Then Ef(X
n
) Ef(X) if and only if the
sequence of random variables f(X
n
) is asymptotically uniformly integrable.
3.9 Example. Suppose X
n
is a sequence of random variables such that X
n
X and
limsup E[X
n
[
p
< for some p. Then all moments of order strictly less than p converge
also: EX
k
n
EX
k
for every k < p.
By the preceding theorem, it suces to prove that the sequence X
k
n
is asymptotically
uniformly integrable. By Markovs inequality
E[X
n
[
k
1
_
[X
n
[
k
M
_
M
1p/k
E[X
n
[
p
.
The limsup, as n followed by M , of the right side is zero if k < p.
3.3: Arrays 41
3.3 Arrays
Consider an innite array x
n,l
of numbers, indexed by (n, l) N N, such that every
column has a limit, and the limits x
l
themselves converge to a limit along the columns.
x
1,1
x
1,2
x
1,3
x
1,4
. . .
x
2,1
x
2,2
x
2,3
x
2,4
. . .
x
3,1
x
3,2
x
3,3
x
3,4
. . .
.
.
.
.
.
.
.
.
.
.
.
. . . .
. . .
x
1
x
2
x
3
x
4
. . . x
Then we can nd a path x
n,ln
, indexed by n N through the array along which
x
n,ln
x as n . (The point is to move to the right slowly in the array while going
down, i.e. l
n
.) A similar property is valid for sequences of random vectors, where
the convergence is taken as convergence in distribution.
3.10 Lemma. For n, l N let X
n,l
be random vectors such that X
n,l
X
l
as n
for every xed l for random vectors such that X
l
X as l . Then there exists a
sequence l
n
such X
n,ln
X as n .
Proof. Let D = d
1
, d
2
, . . . be a countable set that is dense in R
k
and that only contains
points at which the distribution functions of the limits X, X
1
, X
2
, . . . are continuous.
Then an arbitrary sequence of random variables Y
n
converges in distribution to one of
the variables Y X, X
1
, X
2
, . . . if and only if P(Y
n
d
i
) P(Y d
i
) for every
d
i
D. We can prove this using the monotonicity and right-continuity of distribution
functions. In turn P(Y
n
d
i
) P(Y d
i
) as n for every d
i
D if and only if

i=1

P(Y
n
d
i
) P(Y d
i
)

2
i
0.
Now dene
p
n,l
=

i=1

P(X
n,l
d
i
) P(X
l
d
i
)

2
i
,
p
l
=

i=1

P(X
l
d
i
) P(X d
i
)

2
i
.
The assumptions entail that p
n,l
0 as n for every xed l, and that p
l
0 as
l . This implies that there exists a sequence l
n
such that p
n,ln
0. By the
triangle inequality

i=1

P(X
n,ln
d
i
) P(X d
i
)

2
i
p
n,ln
+p
ln
0.
This implies that X
n,ln
X as n .
42 3: Stochastic Convergence
3.4 Stochastic o and O symbols
It is convenient to have short expressions for terms that converge in probability to zero
or are uniformly tight. The notation o
P
(1) (small oh-P-one) is short for a sequence
of random vectors that converges to zero in probability. The expression O
P
(1) (big oh-
P-one) denotes a sequence that is bounded in probability. More generally, for a given
sequence of random variables R
n
X
n
= o
P
(R
n
) means X
n
= Y
n
R
n
and Y
n
P

0;
X
n
= O
P
(R
n
) means X
n
= Y
n
R
n
and Y
n
= O
P
(1).
This expresses that the sequence X
n
converges in probability to zero or is bounded
in probability at rate R
n
. For deterministic sequences X
n
and R
n
the stochastic oh-
symbols reduce to the usual o and O from calculus.
There are many rules of calculus with o and O symbols, which will be applied
without comment. For instance,
o
P
(1) +o
P
(1) = o
P
(1)
o
P
(1) +O
P
(1) = O
P
(1)
O
P
(1)o
P
(1) = o
P
(1)
_
1 +o
P
(1)
_
1
= O
P
(1)
o
P
(R
n
) = R
n
o
P
(1)
O
P
(R
n
) = R
n
O
P
(1)
o
P
_
O
P
(1)
_
= o
P
(1).
To see the validity of these rules it suces to restate them in terms of explicitly
named vectors, where each o
P
(1) and O
P
(1) should be replaced by a dierent sequence
of vectors that converges to zero or is bounded in probability. In this manner the rst
rule says: if X
n
P

0 and Y
n
P

0, then Z
n
= X
n
+ Y
n
P

0; this is an example of the


continuous mapping theorem. The third rule is short for: if X
n
is bounded in probability
and Y
n
P

0, then X
n
Y
n
P

0. If X
n
would also converge in distribution, then this
would be statement (ii) of Slutskys lemma (with c = 0). But by Prohorovs theorem X
n
converges in distribution along subsequences if it is bounded in probability, so that the
third rule can still be deduced from Slutskys lemma by arguing along subsequences.
Note that both rules are in fact implications and should be read from left to right,
even though they are stated with the help of the equality = sign. Similarly, while it is
true that o
P
(1) +o
P
(1) = 2o
P
(1), writing down this rule does not reect understanding
of the o
P
-symbol.
Two more complicated rules are given by the following lemma.
3.11 Lemma. Let R be a function dened on a neighbourhood of 0 R
k
such that
R(0) = 0. Let X
n
be a sequence of random vectors that converges in probability to zero.
(i) if R(h) = o(|h|) as h 0 , then R(X
n
) = o
P
(|X
n
|);
3.5: Transforms 43
(ii) if R(h) = O(|h|) as h 0, then R(X
n
) = O
P
(|X
n
|).
Proof. Dene g(h) as g(h) = R(h)/|h| for h ,= 0 and g(0) = 0. Then R(X
n
) =
g(X
n
)|X
n
|.
(i). Since the function g is continuous at zero by assumption, g(X
n
)
P

g(0) = 0 by
the continuous mapping theorem.
(ii). By assumption there exist M and > 0 such that

g(h)

M whenever |h| .
Thus P
_

g(X
n
)

> M
_
P
_
|X
n
| >
_
0, and the sequence g(X
n
) is tight.
It should be noted that the rule expressed by the lemma is not a simple plug-in rule.
For instance it is not true that R(h) = o(|h|) implies that R(X
n
) = o
P
(|X
n
|) for every
sequence of random vectors X
n
.
3.5 Transforms
It is sometimes possible to show convergence in distribution of a sequence of random vec-
tors directly from the denition. In other cases transforms of probability measures may
help. The basic idea is that it suces to show characterization (ii) of the portmanteau
lemma for a small subset of functions f only.
The most important transform is the characteristic function
t Ee
it
T
X
, t R
k
.
Each of the functions x e
it
T
x
is continuous and bounded. Thus by the portmanteau
lemma Ee
it
T
Xn
Ee
it
T
X
for every t if X
n
X. By Levys continuity theorem the
converse is also true: pointwise convergence of characteristic functions is equivalent to
weak convergence.
3.12 Theorem (Levys continuity theorem). Let X
n
and X be random vectors in
R
k
. Then X
n
X if and only if Ee
it
T
Xn
Ee
it
T
X
for every t R
k
. Moreover, if
Ee
it
T
Xn
converges pointwise to a function (t) that is continuous at zero, then is the
characteristic function of a random vector X and X
n
X.
The following lemma, which gives a variation on Levys theorem, is less well known,
but will be useful in Chapter 4.
3.13 Lemma. Let X
n
be a sequence of random variables such that E[X
n
[
2
= O(1) and
such that E(iX
n
+ vt)e
itXn
0 as n , for every t R and some v > 0. Then
X
n
N(0, v).
Proof. By Markovs inequality and the bound on the second moments, the sequence X
n
is uniformly tight. In view of Prohorovs theorem it suces to show that N(0, v) is the
only weak limit point.
44 3: Stochastic Convergence
If X
n
X along some sequence of n, then by the boundedness of the second
moments and the continuity of the function x (ix+vt)e
itx
, we have E(iX
n
+vt)e
itXn

E(iX +vt)e
itX
for every t R. (Cf. Theorem 3.8.) Combining this with the assumption,
we see that E(iX +vt)e
itX
= 0. By Fatous lemma EX
2
liminf EX
2
n
< and hence
we can dierentiate the characteristic function (t) = Ee
itX
under the expectation to
nd that
t
(t) = EiXe
itX
. We conclude that
t
(t) = vt(t). This dierential equation
possesses (t) = e
vt
2
/2
as the only solution within the class of characteristic functions.
Thus X is normally distributed with mean zero and variance v.
3.6 Cramer-Wold Device
The characteristic function t Ee
it
T
X
of a vector X is determined by the set of all
characteristic functions u Ee
iu(t
T
X)
of all linear combinations t
T
X of the components
of X. Therefore the continuity theorem implies that weak convergence of vectors is
equivalent to weak convergence of linear combinations:
X
n
X if and only if t
T
X
n
t
T
X for all t R
k
.
This is known as the Cramer-Wold device. It allows to reduce all higher dimensional
weak convergence problems to the one-dimensional case.
3.14 Example (Multivariate central limit theorem). Let Y, Y
1
, Y
2
, . . . be i.i.d. random
vectors in R
k
with mean vector = EY and covariance matrix = E(Y )(Y )
T
.
Then
1

n
n

i=1
(Y
i
) =

n(Y
n
) N
k
(0, ).
(The sum is taken coordinatewise.) By the Cramer-Wold device the problem can be
reduced to nding the limit distribution of the sequences of real-variables
t
T
_
1

n
n

i=1
(Y
i
)
_
=
1

n
n

i=1
(t
T
Y
i
t
T
).
Since the random variables t
T
Y
1
t
T
, t
T
Y
2
t
T
, . . . are i.i.d. with zero mean and
variance t
T
t this sequence is asymptotically N
1
(0, t
T
t) distributed by the univariate
central limit theorem. This is exactly the distribution of t
T
X if X possesses a N
k
(0, )
distribution.
3.7: Delta-method 45
3.7 Delta-method
Let T
n
be a sequence of random vectors with values in R
k
and let : R
k
R
m
be a
given function dened at least on the range of T
n
and a neighbourhood of a vector .
We shall assume that, for given constants r
n
, the sequence r
n
(T
n
) converges in
distribution, and wish to derive a similar result concerning the sequence r
n
_
(T
n
)()
_
.
Recall that is dierentiable at if there exists a linear map (matrix)
t

: R
k
R
m
such that
( +h) () =
t

(h) +o(|h|), h 0.
All the expressions in this equation are vectors of length m and |h| is the Euclidean
norm. The linear map h
t

(h) is sometimes called a total derivative, as opposed to


partial derivatives. A sucient condition for to be (totally) dierentiable is that all
partial derivatives
j
(x)/x
i
exist for x in a neighbourhood of and are continuous
at . (Just existence of the partial derivatives is not enough.) In any case the total
derivative is found from the partial derivatives. If is dierentiable, then it is partially
dierentiable and the derivative map h
t

(h) is matrix multiplication by the matrix

=
_
_
_
1
x1
()
1
x
k
()
.
.
.
.
.
.
m
x1
()
m
x
k
()
_
_
_.
If the dependence of the derivative
t

on is continuous, then is called continuously


dierentiable.
3.15 Theorem. Let : R
k
R
m
be a measurable map dened on a subset of R
k
and
dierentiable at . Let T
n
be random vectors taking their values in the domain of . If
r
n
(T
n
) T for numbers r
n
, then r
n
_
(T
n
) ()
_

t

(T). Moreover, the


dierence between r
n
_
(T
n
) ()
_
and
t

_
r
n
(T
n
)
_
converges to zero in probability.
Proof. Because r
n
, we have by Slutskys lemma T
n
= (1/r
n
)r
n
(T
n
)
0T = 0 and hence T
n
converges to zero in probability. Dene a function g by
g(0) = 0, g(h) =
( +h) ()
t

(h)
|h|
, if h ,= 0.
Then g is continuous at 0 by the dierentiability of . Therefore, by the continuous
mapping theorem, g(T
n
)
P

0 and hence, by Slutskys lemma and again the continuous


mapping theorem r
n
|T
n
|g(T
n
)
P

|T|0 = 0. Consequently,
r
n
_
(T
n
) ()
t

(T
n
)
_
= r
n
|T
n
|g(T
n
)
P

0.
This yields the last statement of the theorem. Since matrix multiplication is continuous,

_
r
n
(T
n
)
_

t

(T) by the continuous-mapping theorem. Finally, apply Slutskys


lemma to conclude that the sequence r
n
_
(T
n
) ()
_
has the same weak limit.
46 3: Stochastic Convergence
A common situation is that

n(T
n
) converges to a multivariate normal distribu-
tion N
k
(, ). Then the conclusion of the theorem is that the sequence

n
_
(T
n
)()
_
converges in law to the N
m
_

,
t

(
t

)
T
_
distribution.
3.8 Lindeberg Central Limit Theorem
In this section we state, for later reference, a central limit theorem for independent, but
not necessarily identically distributed random vectors.
3.16 Theorem (Lindeberg). For each n N let Y
n,1
, . . . , Y
n,n
be independent random
vectors with nite covariance matrices such that
1
n
n

i=1
Cov Y
n,i
,
1
n
n

i=1
E|Y
n,i
|
2
1
_
|Y
n,i
| >

n
_
0, for every > 0.
Then the sequence n
1/2

n
i=1
(Y
n,i
EY
n,i
) converges in distribution to the normal
distribution with mean zero and covariance matrix .
3.9 Minimum Contrast Estimators
Many estimators

n
of a parameter are dened as the point of minimum (or maximum)
of a given stochastic process M
n
(). In this section we state basic theorems that
give the asymptotic behaviour of such minimum contrast estimators or M-estimators

n
in the case that the contrast function M
n
uctuates around a deterministic, smooth
function.
Let M
n
be a sequence of stochastic processes indexed by a subset of R
d
, dened on
given probability spaces, and let

n
be random vectors dened on the same probability
spaces with values in such that M
n
(

n
) M
n
() for every . Typically it will be
true that M
n
()
P

M() for each and a given deterministic function M. Then we may


expect that

n
P


0
for
0
a point of minimum of the map M(). The following
theorem gives a sucient condition for this. It applies to the more general situation that
the limit function M is actually a random process.
For a sequence of random variables X
n
we write X
n
P

0 if X
n
> 0 for every n and
1/X
n
= O
P
(1).
3.9: Minimum Contrast Estimators 47
3.17 Theorem. Let M
n
and M
n
be stochastic processes indexed by a semi-metric space
such that, for some
0
,
sup

M
n
() M
n
()

0,
inf
:d(,0)>
M
n
() M
n
(
0
)
P

0.
If

n
are random elements with values in with M
n
(

n
) M
n
(
0
) o
P
(1), then
d(

n
,
0
)
P

0.
Proof. By the uniform convergence to zero of M
n
M
n
and the minimizing property of

n
, we have M
n
(

n
) = M
n
(

n
) +o
P
(1) M
n
(
0
) +o
P
(1) = M
n
(
0
) +o
P
(1). Write Z
n
()
for the left side of the second equation in the display of the theorem. Then d(

n
,
0
) >
implies that M
n
(

n
) M
n
(
0
) Z
n
(). Combined with the preceding this implies that
Z
n
() o
P
(1). By assumption the probability of this event tends to zero.
If the limit criterion function M() is smooth and takes its minimum at
the point
0
, then its rst derivative must vanish at
0
, and the second derivative
V must be positive denite. Thus it possesses a parabolic approximation M() =
M(
0
) +
1
2
(
0
)
T
V (
0
) around
0
. The random criterion function M
n
can be
thought of as the limiting criterion function plus the random perturbation M
n
M and
possesses approximation
M
n
() M
n
(
0
)
1
2
(
0
)
T
V (
0
) +
_
(M
n
M
n
)() (M
n
M
n
)(
0
)

.
We shall assume that the term in square brackets possesses a linear approximation of the
form (
0
)
T
Z
n
/

n. If we ignore all the remainder terms and minimize the quadratic


form

0

1
2
(
0
)
T
V (
0
) + (
0
)
T
Z
n
/

n
over
0
, then we nd that the minimum is taken for
0
= V
1
Z
n
/

n. Thus we
expect that the M-estimator

n
satises

n(

0
) = V
1
Z
n
+o
P
(1). This derivation
is made rigorous in the following theorem.
3.18 Theorem. Let M
n
be stochastic processes indexed by an open subset of Eu-
clidean space and let M: R be a deterministic function. Assume that M()
is twice continuously dierentiable at a point of minimum
0
with nonsingular second-
derivative matrix V .

Suppose that
r
n
(M
n
M)(

n
) r
n
(M
n
M)(
0
)
= (

n

0
)
t
Z
n
+o

P
_
|

0
| +r
n
|

0
|
2
+r
1
n
_
,

It suces that a two-term Taylor expansion is valid at 0.


48 3: Stochastic Convergence
for every random sequence

n
=
0
+ o

P
(1) and a uniformly tight sequence of random
vectors Z
n
. If the sequence

n
converges in outer probability to
0
and satises M
n
(

n
)
inf

M
n
() +o
P
(r
2
n
) for every n, then
r
n
(

n

0
) = V
1
Z
n
+o

P
(1).
If it is known that the sequence r
n
(

0
) is uniformly tight, then the displayed condition
needs to be veried for sequences

n
=
0
+O

P
(r
1
n
) only.
Proof. The stochastic dierentiability condition of the theorem together with the two-
times dierentiability of the map M() yields for every sequence

h
n
= o

P
(1)
(3.1)
M
n
(
0
+

h
n
) M
n
(
0
) =
1
2

h
t
n
V

h
n
+r
1
n

h
t
n
Z
n
+o

P
_
|

h
n
|
2
+r
1
n
|

h
n
| +r
2
n
_
.
For

h
n
chosen equal to

h
n
=

n

0
, the left side (and hence the right side) is at
most o
P
(r
2
n
) by the denition of

n
. In the right side the term

h
t
n
V

h
n
can be bounded
below by c|

h
n
|
2
for a positive constant c, since the matrix V is strictly positive denite.
Conclude that
c|

h
n
|
2
+r
1
n
|

h
n
|O
P
(1) +o
P
_
|

h
n
|
2
+r
2
n
_
o
P
(r
2
n
).
Complete the square to see that this implies that
_
c +o
P
(1)
_
_
|

h
n
| O
P
(r
1
n
)
_
2
O
P
(r
2
n
).
This can be true only if |

h
n
| = O

P
(r
1
n
).
For any sequence

h
n
of the order O

P
(r
1
n
), the three parts of the remainder term
in (3.1) are of the order o
P
(r
2
n
). Apply this with the choices

h
n
and r
1
n
V
1
Z
n
to
conclude that
M
n
(
0
+

h
n
) M
n
(
0
) =
1
2

h
t
n
V

h
n
+r
1
n

h
t
n
Z
n
+o

P
(r
2
n
),
M
n
(
0
r
1
n
V
1
Z
n
) M
n
(
0
) =
1
2
r
2
n
Z
t
n
V
1
Z
n
+o

P
(r
2
n
).
The left-hand side of the rst equation is smaller than the second, up to an o

P
(r
2
n
)-term.
Subtract the second equation from the rst to nd that
1
2
(

h
n
+r
1
n
V
1
Z
n
)
t
V (

h
n
+r
1
n
V
1
Z
n
) o
P
(r
2
n
).
Since V is strictly positive denite, this yields the rst assertion of the theorem.
If it is known that the sequence

n
is r
n
-consistent, then the middle part of the
preceding proof is unnecessary and we can proceed to inserting

h
n
and r
1
n
V
1
Z
n
in (3.1) immediately. The latter equation is then needed for sequences

h
n
= O

P
(r
1
n
)
only.
4
Central Limit Theorem
The classical central limit theorem asserts that the mean of independent, identically
distributed random variables with nite variance is asymptotically normally distributed.
In this chapter we extend this to dependent variables.
Given a stationary time series X
t
let X
n
= n
1

n
t=1
X
t
be the average of the
variables X
1
, . . . , X
n
. If and
X
are the mean and auto-covariance function of the time
series, then, by the usual rules for expectation and variance,
EX
n
= ,
var(

nX
n
) =
1
n
n

s=1
n

t=1
cov(X
s
, X
t
) =
n

h=n
_
n [h[
n
_

X
(h). (4.1)
In the expression for the variance every of the terms
_
n[h[
_
/n is bounded by 1 and con-
verges to 1 as n . If

X
(h)

< , then we can apply the dominated convergence


theorem and obtain that var(

nX
n
)

X
(h). In any case
(4.2) var

nX
n

X
(h)

.
Hence absolute convergence of the series of auto-covariances implies that the sequence

n(X
n
) is uniformly tight. The purpose of the chapter is to give conditions for
this sequence to be asymptotically normally distributed with mean zero and variance

X
(h).
Such conditions are of two types: martingale and mixing.
The martingale central limit theorem is concerned with time series such that the
sums

n
t=1
X
t
from a martingale. It thus makes a structural assumption on the expected
value of the increments X
t
of these sums given the past variables.
Mixing conditions, in a general sense, require that elements X
t
and X
t+h
at large
time lags h be approximately independent. Positive and negative values of deviations X
t

at large time lags will then occur independently and partially cancel each other, which
is the intutive reason for normality of sums. Absolute convergence of the series

X
(h),
50 4: Central Limit Theorem
which is often called short-range dependence can be viewed as a mixing condition, as
it implies that covariances at large lags are small. However, it is not strong enough to
imply a central limit theorem. Finitely dependent time series and linear processes are
special examples of mixing time series that do satisfy the central limit theorem. We also
discuss a variety of alternative mixing conditions.
* 4.1 EXERCISE. Suppose that the series v: =

X
(h) converges (not necessarily ab-
solutely). Show that var

nX
n
v. [Write var

nX
n
as v
n
for v
h
=

]j]<h

X
(j) and
apply Cesaros lemma: if v
n
v, then v
n
v.]
4.1 Finite Dependence
A time series X
t
is called m-dependent if the random vectors (. . . , X
t1
, X
t
) and
(X
t+m+1
, X
t+m+2
, . . .) are independent for every t Z. In other words, past and
future are independent if m present variables are left out.
4.2 EXERCISE. Show that the moving average X
t
= Z
t
+ Z
t1
considered in Exam-
ple 1.6, with X
t
an i.i.d. sequence, is 1-dependent.
4.3 EXERCISE. Show that 0-dependent is equivalent to independent.
4.4 Theorem. Let X
t
be a strictly stationary, m-dependent time series with mean zero
and nite variance. Then the sequence

nX
n
converges in distribution to a normal
distribution with mean 0 and variance

X
(h).
Proof. Choose a (large) integer l and divide X
1
, . . . , X
n
into r = n/l| groups of size l
and a remainder group of size nrl < l. Let A
1,l
, . . . , A
r,l
and B
1,l
, . . . , B
r,l
be the sums
of the rst l m and last m of the variables X
i
in the r groups. (Cf. Figure 4.1.) Then
both A
1,l
, . . . , A
r,l
and B
1,l
, . . . , B
r,l
are sequences of independent identically distributed
random variables (but the two sequences may be dependent) and
(4.3)
n

i=1
X
i
=
r

j=1
A
j,l
+
r

j=1
B
j,l
+
n

i=rl+1
X
i
.
We shall show that for suitable l = l
n
, the second and third terms tend to zero,
whereas the rst tends to a normal distirbution.
For xed l and n (hence r ) the classical central limit theorem applied
to the variables A
j,l
yields
1

n
r

j=1
A
j,l
=
_
r
n
1

r
r

j=1
A
j,l

l
N(0, var A
1,l
).
4.2: Linear Processes 51
Next the variables n
1/2

n
i=rl+1
X
i
have mean zero, and by the triangle inequality, for
xed l as n ,
sd
_
1

n
n

i=rl+1
X
i
_

n
sd(X
1
) 0.
By Chebyshevs inequality it follows that the third term in (4.3) tends to zero in proba-
bility. We conclude by Slutskys lemma that, as n ,
S
n,l
: =
1

n
r

j=1
A
j,l
+
1

n
n

i=rl+1
X
i
N
_
0,
1
l
var A
1,l
_
.
This is true for every xed l. If l , then
1
l
var A
1,l
=
1
l
var
lm

i=1
X
i
=
lm

h=ml
l m[h[
l

X
(h)
m

h=m

X
(h) =: v.
Therefore, by Lemma 3.10 there exists a sequence l
n
such that S
n,ln
N(0, v).
Let r
n
= n/l
n
| be the corresponding sequence of values of r
n
, so that r
n
/n 0. By
the strict stationarity of the series X
t
each B
j,l
is equal in distribution to X
1
+ +X
m
.
Hence var B
j,l
is independent of j and l and, by the independence of B
1,l
, . . . , B
r,l
,
E
_
1

n
rn

j=1
B
j,ln
_
2
=
r
n
var B
1,ln
0.
Thus the sequence of random variables in the left side converges to zero in probablity,
by Chebyshevs inequality.
In view of (4.3) another application of Slutskys lemma gives the result.
1 l + 1 2l + 1 (r 1)l + 1 rl + 1 n
. . . . . . . . . .
l m m l m m l m m
A
1,l
B
1,l
A
2,l
B
2,l
A
r,l
B
r,l
Figure 4.1. Blocking of observations in the proof of Theorem 4.4.
4.2 Linear Processes
A linear process is a time series that can be represented as an innite moving average
(4.4) X
t
= +

j=

j
Z
tj
,
52 4: Central Limit Theorem
for a sequence . . . , Z
1
, Z
0
, Z
1
, Z
2
, . . . of independent and identically distributed variables
with EZ
t
= 0, a constant , and constants
j
with

j
[
j
[ < . This may seem special,
but we shall see later that this includes, for instance, the rich class of all ARMA-processes.
By (iii) of Lemma 1.28 the covariance function of a linear process is given by
X
(h) =

j

j

j+h
, where
2
= var Z
t
, and hence the asymptotic variance of

nX
n
is given
by
v: =

X
(h) =
2

j+h
=
2
_

j
_
2
.
4.5 Theorem. Suppose that (4.4) holds for an i.i.d. sequence Z
t
with mean zero and
nite variance and numbers
j
with

j
[
j
[ < . Then the sequence

n(X
n
)
converges in distribution to a normal distribution with mean zero and variance v.
Proof. We can assume without loss of generality that = 0. For a xed (large) integer
m dene the time series
X
m
t
=

]j]m

j
Z
tj
=

m
j
Z
tj
,
where
m
j
=
j
if [j[ m and 0 otherwise. Then X
m
t
is (2m+1)-dependent and strictly
stationary. By Theorem 4.4, the sequence

nX
m
n
converges as n in distribution to
a normal distribution with mean zero and variance
v
m
: =

X
m(h) =
2

m
j

m
j+h
=
2
_

]j]m

j
_
2
.
As m this variance converges to v. Because N(0, v
m
) N(0, v), Lemma 3.10
guarantees that the existence of a sequence m
n
such that

nX
mn
n
N(0, v).
In view of Slutskys lemma the proof will be complete once we have shown that

n(X
n
X
mn
n
)
P

0. This concerns the average Y


mn
n
of the dierences Y
m
t
= X
t
X
m
t
=

]j]>m

j
Z
tj
. These satisfy
E
_
nX
n

nX
mn
n
_
2
= var

nY
mn
n

Y
mn (h)


2
_

]j]>mn
[
j
[
_
2
.
The inequalities follow by (4.1) and Lemma 1.28(iii). The right side converges to zero as
m
n
.
4.3: Strong Mixing 53
4.3 Strong Mixing
The -mixing coecients (or strong mixing coecients) of a time series X
t
are dened
by (0) =
1
2
and for h N

(h) = 2 sup
t
sup
A(...,Xt1,Xt)
B(X
t+h
,X
t+h+1
,...)

P(A B) P(A)P(B)

.
The events A and B in this display depend on elements X
t
of the past and future that
are h time lags apart. Thus the -mixing coecients measure the extent by which events
A and B that are separated by h time instants fail to satisfy the equality P(A B) =
P(A)P(B), which is valid for independent events. If the series X
t
is strictly stationary,
then the supremum over t is unnecessary, and the mixing coecient (h) can be dened
using the -elds (. . . , X
1
, X
0
) and (X
h
, X
h+1
, . . .) only.
It is immediate from their denition that the coecients (1), (2), . . . are decreas-
ing and nonnegative. Furthermore, if the time series is m-dependent, then (h) = 0 for
h > m.
4.6 EXERCISE. Show that (1)
1
2
(0). [Apply the inequality of Cauchy-Schwarz
to P(A B) P(A)P(B) = cov(1
A
, 1
B
).]
Warning. The mixing numbers (h) are denoted by the same symbol as the partial
auto-correlations
X
(h).
If (h) 0 as h , then the time series X
t
is called -mixing or strong mixing.
Then events connected to time sets that are far apart are approximately independent.
For a central limit theorem to hold, the convergence to 0 must take place at a sucient
speed, dependent on the sizes of the variables X
t
.
A precise formulation can best be given in terms of the inverse function of the mixing
coecients. We can extend to a function : [0, ) [0, 1] by dening it to be constant
on the intervals [h, h +1) for integers h. This yields a monotone function that decreases
in steps from (0) =
1
2
to 0 at innity in the case that the time series is mixing. The
generalized inverse
1
: [0, 1] [0, ) is dened by

1
(u) = inf
_
x 0: (x) u
_
=

h=0
1
u<(h)
.
Similarly, the quantile function F
1
X
of a random variable X is the generalized inverse of
the distribution function F
X
of X, and is given by
F
1
X
(1 u) = inf
_
x: 1 F
X
(x) u
_
.

We denote by (Xt: t I) the -eld generated by the set of random variables Xt: t I].
54 4: Central Limit Theorem
0 1 2 3 4 5
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
o
o
o
o
o
0.0 0.2 0.4 0.6 0.8 1.0
0
1
2
3
4
5
o
o
o
o
o
Figure 4.2. Typical functions (top panel) and
1
(bottom panel).
4.7 Theorem. If X
t
is a strictly stationary time series with mean zero such that
_
1
0

1
(u)F
1
]X0]
(1 u)
2
du < , then the series v =

X
(h) converges absolutely
and

nX
n
N(0, v).
At rst sight the condition of the theorem is complicated. Finiteness of the integral
requires that the mixing coecients converge to zero fast enough, but the rate at which
this must happen is also dependent on the tails of the variables X
t
. To make this concrete
we can derive niteness of the integral under a combination of a mixing and a moment
condition, as expressed in the following corollary.
4.8 Corollary (Ibragimov). If X
t
is a strictly stationary time series with mean zero
such that E[X
0
[
r
< and

h
(h)
12/r
< for some r > 2, then the series v =

X
(h) converges absolutely and

nX
n
N(0, v).
Proof. If c
r
: = E[X
0
[
r
< for some r > 2, then 1 F
]X0]
(x) c
r
/x
r
by Markovs
inequality and hence F
1
]X0]
(1 u) c/u
1/r
. Then we obtain the bound
_
1
0

1
(u)F
1
]X0]
(1 u)
2
du

h=0
_
1
0
1
u<(h)
c
2
u
2/r
du =
c
2
r
r 2

h=0
(h)
12/r
.
4.3: Strong Mixing 55
The right side is nite by assumption, and hence the integral condition of Theorem 4.7
is satised.
The corollary assumes niteness of the rth moment E[X
0
[
r
and convergence of the
series of the (1 2/r)th powers of the mixing coecients. For larger values of r the
moment condition is more restrictive, but the mixing condition is weaker. The intuition
is that heavy tailed distributions occasionally produce big values [X
t
[. Combined with
strong dependence, this may lead to clusters of big variables, which may inuence the
average X
n
relatively strongly.
4.9 EXERCISE (Case r = ). Show that
_
1
0

1
(u)F
1
]X0]
(1 u)
2
du is bounded above
by |X
0
|
2

h=0
(h). [Note that F
1
]X0]
(1 U) is distributed as [X
0
[ if U is uniformly
distributed and hence is bounded by |X
0
|

almost surely.]
4.10 EXERCISE. Show that
_
1
0

1
(u)F
1
]X0]
(1 u)
2
du (m+1)EX
2
0
if the time series
X
t
is m-dependent. Recover Theorem 4.4 from Theorem 4.7.
4.11 EXERCISE. Show that
_
1
0

1
(u)F
1
]X0]
(1 u)
2
du < implies that E[X
0
[
2
< .
The key to the proof of Theorem 4.7 is a lemma that bounds covariances in terms
of mixing coecients. Let |X|
p
denote the L
p
-norm of a random variable X, i.e.
|X|
p
=
_
E[X[
p
_
1/p
, 1 p < , |X|

= infM: P([X[ M) = 1.
Recall H olders inequality: for any pair of numbers p, q > 0 (possibly innite) with
p
1
+q
1
= 1 and random variables X and Y
E[XY [ |X|
p
|Y |
q
.
For p = q = 2 this is precisely the inequality of Cauchy-Schwarz. The other case of
interest to us is the combination p = 1, q = , for which the inequality is immediate. By
repeated application the inequality can be extended to more than two random variables.
For instance, for any numbers p, q, r > 0 with p
1
+q
1
+r
1
= 1 and random variables
X, Y , and Z
E[XY Z[ |X|
p
|Y |
q
|Z|
r
.
4.12 Lemma (Covariance bound). Let X
t
be a time series with -mixing coecients
(h) and let Y and Z be random variables that are measurable relative to (. . . , X
1
, X
0
)
and (X
h
, X
h+1
, . . .), respectively, for a given h 0. Then, for any p, q, r > 0 such that
p
1
+q
1
+r
1
= 1,

cov(Y, Z)

2
_
(h)
0
F
1
]Y ]
(1 u)F
1
]Z]
(1 u) du 2(h)
1/p
|Y |
q
|Z|
r
.
Proof. By the denition of the mixing coecients, we have, for every y, z > 0,

cov(1
Y
+
>y
, 1
Z
+
>z
)

1
2
(h).
56 4: Central Limit Theorem
The same inequality is valid with Y
+
and/or Z
+
replaced by Y

and/or Z

. By bilin-
earity of the covariance and the triangle inequality,

cov(1
Y
+
>y
1
Y

>y
, 1
Z
+
>z
1
Z

>z
)

2(h).
Because

cov(U, V )

2E[U[ |V |

for any pair of random variables U, V (by the simplest


H older inequality), we obtain that the covariance on the left side of the preceding display
is also bounded by 2
_
P(Y
+
> y) +P(Y

> y)
_
= 2(1 F
]Y ]
)(y). Yet another bound for
the covariance is obtained by interchanging the roles of Y and Z. Combining the three
inequalities, we see that, for any y, z > 0,

cov(1
Y
+
>y
1
Y

>y
, 1
Z
+
>z
1
Z

>z
)

2(h) (1 F
]Y ]
)(y) (1 F
]Z]
)(z)
= 2
_
(h)
0
1
1F
|Y |
(y)>u
1
1F
|Z|
(z)>u
du.
Next we write Y = Y
+
Y

=
_

0
(1
Y
+
>y
1
Y

>y
) dy and similarly for Z, to obtain,
by Fubinis theorem,

cov(Y, Z)

_

0
_

0
cov(1
Y
+
>y
1
Y

>y
, 1
Z
+
>z
1
Z

>z
) dy dz

2
_

0
_

0
_
(h)
0
1
F
|Y |
(y)<1u
1
F
|Z|
(z)<1u
du dy dz.
Any pair of a distribution and a quantile function satises F
X
(x) < u if and only x <
F
1
X
(u), for every x and u. Hence
_

0
1
FX(x)<1u
dx = F
1
X
(1 u), and we conclude the
proof of the rst inequality of the lemma by another application of Fubinis theorem.
The second inequality follows upon noting that F
1
]Y ]
(1 U) is distributed as [Y [ if
U is uniformly distributed on [0, 1], and next applying H olders inequality.
Taking Y and Z in the lemma to X
0
and X
h
, respectively, for a stationary time
series X
t
, and (p, q, r) equal to (r/(r 4), r/2, r/2) for some r 4, we see that

X
(h)

2
_
(h)
0
F
1
]X]
(1 u)
2
du 2(h)
14/r
|X|
2
r
.
Consequently, we nd that

h0

X
(h)

h0
_
(h)
0
F
1
]X0]
(1 u)
2
du = 2
_
1
0

1
(u) F
1
]X0]
(1 u)
2
du.
This proves the rst assertion of Theorem 4.7. Furthermore, in view of (4.2) and the
symmetry of the auto-covariance function,
(4.5) var

nX
n
4
_
1
0

1
(u) F
1
]X0]
(1 u)
2
du.
We defer the remainder of the proof of Theorem 4.7 to the end of the chapter.
4.4: Uniform Mixing 57
* 4.4 Uniform Mixing
There are several other types of mixing coecients. The -mixing coecients or uniform
mixing coecients of a strictly stationary time series X
t
are dened by
(h) = sup
A(...,X1,X0),P(A),=0
B(X
h
,X
h+1
,...)

P(B[ A) P(B)

(h) = sup
A(...,X1,X0)
B(X
h
,X
h+1
,...),P(B),=0

P(A[ B) P(A)

.
It is immediate from the denitions that (h) 2
_
(h)

(h)
_
. Thus a -mixing time
series is always -mixing. It appears that conditions in terms of -mixing are often much
more restrictive, even though there is no complete overlap.
4.13 Lemma (Covariance bound). Let X
t
be a strictly stationary time series with
-mixing coecients (h) and

(h) and let Y and Z be random variables that are
measurable relative to (. . . , X
1
, X
0
) and (X
h
, X
h+1
, . . .), respectively, for a given
h 0. Then, for any p, q > 0 with p
1
+q
1
= 1,

cov(Y, Z)

2(h)
1/p

(h)
1/q
|Y |
p
|Z|
q
.
Proof. Let Q be the measure P
Y,Z
P
Y
P
Z
on R
2
, and let [Q[ be its absolute value.
Then

cov(Y, Z)

_ _
yz dQ(y, z)


_
_ _
[y[
p
dQ(y, z)
_
1/p
_
_ _
[z[
q
dQ(y, z)
_
1/q
,
by H olders inequality. It suces to show that the rst and second marginals of [Q[ are
bounded above by the measures 2(h)P
Y
and 2

(h)P
Z
, respectively. By symmetry it
suces to consider the rst marginal.
By denition we have that
[Q[(C) = sup
D
_

Q(C D)

Q(C D
c
)

_
for the supremum taken over all Borel sets D in R
2
. Equivalently, we can compute the
supremum over any algebra that generates the Borel sets. In particular, we can use the
algebra consisting of all nite unions of rectangles A B. Conclude from this that
[Q[(C) = sup

Q(C (A
i
B
j
)

,
for the supremum taken over all pairs of partitions R =
i
A
i
and R =
j
B
j
. It follows
that
[Q[(A R) = sup

Q
_
(A A
i
) B
j
_

= sup

P
Z]Y
(B
j
[ A A
i
) P
Z
(B
j
)

P
Y
(A A
i
).
58 4: Central Limit Theorem
If, for xed i, B
+
i
consists of the union of all B
j
such that P
Z]Y
(B
j
[ AA
i
)P
Z
(B
j
) > 0
and B

i
is the union of the remaining B
j
, then the double sum can be rewritten

i
_

P
Z]Y
(B
+
i
[ A A
i
) P
Z
(B
+
i
)

P
Z]Y
(B

i
[ A A
i
) P
Z
(B

i
)

_
P
Y
(A A
i
).
The sum between round brackets is bounded above by 2(h), by the denition of . Thus
the display is bounded above by 2(h)P
Y
(A).
4.14 Theorem. If X
t
is a strictly stationary time series with mean zero such that
E[X
t
[
pq
< and

h
(h)
1/p
(h)
1/q
< for some p, q > 0 with p
1
+ q
1
= 1,
then the series v =

X
(h) converges absolutely and

nX
n
N(0, v).
Proof. For a given M > 0 let X
M
t
= X
t
1[X
t
[ M and let Y
M
t
= X
t
X
M
t
.
Because X
M
t
is a measurable transformation of X
t
, it is immediate from the denition
of the mixing coecients that Y
M
t
is mixing with smaller mixing coecients than X
t
.
Therefore, by (4.2) and Lemma 4.13,
var

n(X
n
X
M
n
) 2

h
(h)
1/p

(h)
1/q
|Y
M
0
|
p
|Y
M
0
|
q
.
As M , the right side converges to zero, and hence the left side converges to zero,
uniformly in n. This means that we can reduce the problem to the case of uniformly
bounded time series X
t
, as in the proof of Theorem 4.7.
Because the -mixing coecients are bounded above by the -mixing coecients,
we have that

h
(h) < . Therefore, the second part of the proof of Theorem 4.7
applies without changes.
4.5 Martingale Dierences
The martingale central limit theorem applies to the special time series for which the
partial sums

n
t=1
X
t
are a martingale (as a process in n), or equivalently the increments
X
t
are martingale dierences. In Chapter 13 score processes (the derivative of the log
likelihood) will be an important example of application.
The martingale central limit theorem can be seen as another type of generalization
of the ordinary central limit theorem. The partial sums of an i.i.d. sequence grow by
increments X
t
that are independent from the past. The classical central limit theorem
shows that this induces asymptotic normality, provided that the increments are centered
and not too big (nite variance suces). The mixing central limit theorems relax the
independence to near independence of variables at large time lags, a condition that
involves the whole distribution. In contrast, the martingale central limit theorem imposes
conditions on the conditional rst and second moments of the increments given the past,
without directly involving other aspects of the distribution, and in this sense is closer to
4.5: Martingale Dierences 59
the ordinary central limit theorem. The rst moments given the past are assumed zero;
the second moments given the past must not be too big.
The past can be given by an arbitrary ltration. A ltration T
t
is a nondecreasing
collection of -elds T
1
T
0
T
1
. The -eld T
t
can be thought of as the
events that are known at time t. Often it will be the -eld generated by the variables
X
t
, X
t1
, X
t2
, . . .; the corresponding ltration is called the natural ltration of the time
series X
t
, or the ltration generated by this series. A martingale dierence series relative
to a given ltration is a time series X
t
such that, for every t:
(i) X
t
is T
t
-measurable;
(ii) E(X
t
[ T
t1
) = 0.
The second requirement implicitly includes the assumption that E[X
t
[ < , so that the
conditional expectation is well dened; the identity is understood to be in the almost-sure
sense.
4.15 EXERCISE. Show that a martingale dierence series with nite variances is a white
noise series.
4.16 Theorem. If X
t
is a martingale dierence series relative to the ltration T
t
such that n
1

n
t=1
E(X
2
t
[ T
t1
)
P

v for a positive constant v, and such that


n
1

n
t=1
E
_
X
2
t
1
]Xt]>

n
[ T
t1
_
P

0 for every > 0, then



nX
n
N(0, v).
Proof. For simplicity of notation, let E
t
denote conditional expectation given T
t1
.
Because the events A
t
= n
1

t
j=1
E
j1
X
2
j
2v are T
t1
-measurable, the variables
X
n,t
= n
1/2
X
t
1
At
are martingale dierences relative to the ltration T
t
. They satisfy
(4.6)
n

t=1
E
t1
X
2
n,t
2v,
n

t=1
E
t1
X
2
n,t
P

v,
n

t=1
E
t1
X
2
n,t
1
]Xn,t]>
P

0, every > 0.
To see this note rst that E
t1
X
2
n,t
= 1
At
n
1
E
t1
X
2
t
, and that the events A
t
are decreas-
ing: A
1
A
2
A
n
. The rst relation in the display follows from the denition of
the events A
t
, which make that the cumulative sums stop increasing before they cross the
level 2v. The second relation follows, because the probability of the event
n
t=1
A
t
= A
n
tends to 1 by assumption, and the left side of this relation is equal to n
1

n
t=1
E
t1
X
2
t
on A
n
, which tends to v by assumption. The third relation is immediate from the con-
ditional Lindeberg assumption on the X
t
and the fact that [X
n,t
[ X
t
/

n, for every
t.
On the event A
n
we also have that X
n,t
= X
t
/

n for every t = 1, . . . , n and hence


the theorem is proved once it has been established that

n
t=1
X
n,t
N(0, v).
We have that [e
ix
1 ix[ x
2
/2 for every x R. Furthermore, the function
R: R C dened by e
ix
1 ix +x
2
/2 = x
2
R(x) satises [R(x)[ 1 and R(x) 0 as
60 4: Central Limit Theorem
x 0. If , > 0 are chosen such that [R(x)[ < if [x[ , then [e
ix
1 ix +x
2
/2[
x
2
1
]x]>
+x
2
. For xed u R dene
R
n,t
= E
t1
(e
iuXn,t
1 iuX
n,t
).
It follows that [R
n,t
[
1
2
u
2
E
t1
X
2
n,t
and
n

t=1
[R
n,t
[
1
2
u
2
n

t=1
E
t1
X
2
n,t
u
2
v,
max
1tn
[R
n,t
[
1
2
u
2
_
n

t=1
E
t1
X
2
n,t
1
]Xn,t]>
+
2
_
,
n

t=1

R
n,t
+
1
2
u
2
E
t1
X
2
n,t

u
2
n

t=1
_
E
t1
X
2
n,t
1
]Xn,t]>
+E
t1
X
2
n,t
_
.
The second and third inequalities together with (4.6) imply that the sequence
max
1tn
[R
n,t
[ tends to zero in probability and that the sequence

n
t=1
R
n,t
tends
in probability to
1
2
u
2
v.
The function S: R R dened by log(1 x) = x + xS(x) satises S(x) 0 as
x 0. It follows that max
1tn
[S(R
n,t
)[
P

0, and
n

t=1
(1 R
n,t
) = e

n
t=1
log(1Rn,t)
= e

n
t=1
Rn,t+

n
t=1
Rn,tS(Rn,t) P

e
u
2
v/2
.
We also have that

k
t=1
[1 R
n,t
[ exp

n
t=1
[R
n,t
[ e
u
2
v
, for every k n. Therefore,
by the dominated convergence theorem, the convergence in the preceding display is also
in absolute mean.
For every t,
E
n1
e
iuXn,n
(1 R
n,n
) = (1 R
n,n
)E
n1
(e
iuXn,n
1 iuX
n,n
+ 1)
= (1 R
n,n
)(R
n,n
+ 1) = 1 R
2
n,n
.
Therefore, by conditioning on T
n1
,
E
n

t=1
e
iuXn,t
(1 R
n,t
) = E
n1

t=1
e
iuXn,t
(1 R
n,t
) E
n1

t=1
e
iuXn,t
(1 R
n,t
)R
2
n,n
.
By repeating this argument, we nd that

E
n

t=1
e
iuXn,t
(1 R
n,t
) 1

k=1
E
k1

t=1
e
iuXn,t
(1 R
n,t
)R
2
n,k

e
u
2
v
E
n

t=1
R
2
n,t
.
This tends to zero, because

n
t=1
[R
n,t
[ is bounded above by a constant and
max
1tn
[R
n,t
[ tends to zero in probability.
4.6: Projections 61
We combine the results of the last two paragraphs to conclude that

E
n

t=1
e
iuXn,t
e
u
2
v/2
1

E
n

t=1
e
iuXn,t
e
u
2
v/2

t=1
e
iuXn,t
(1 R
n,t
)

+o(1) 0.
The theorem follows from the continuity theorem for characteristic functions.
Apart from the structural condition that the sums

n
t=1
X
t
form a martingale, the
martingale central limit theorem requires that the sequence of variables Y
t
= E(X
2
t
[ T
t1
)
satises a law of large numbers and that the variables Y
t,,n
= E(X
2
t
1
]Xt]>

n
[ T
t1
)
satisfy a (conditional) Lindeberg-type condition. These conditions are immediate for
ergodic sequences, which by denition are (strictly stationary) sequences for which
any running transformation of the type Y
t
= g(X
t
, X
ti
, . . .) satises the Law of Large
Numbers. The concept of ergodicity is discussed in Section 7.2.
4.17 Corollary. If X
t
is a strictly stationary, ergodic martingale dierence series rela-
tive to its natural ltration with mean zero and v = EX
2
t
< , then

nX
n
N(0, v).
Proof. By strict stationarity there exists a xed measurable function g: R

such
that E(X
2
t
[ X
t1
, X
t2
, . . .) = g(X
t1
, X
t2
, . . .) almost surely, for every t. The ergodicity
of the series X
t
is inherited by the series Y
t
= g(X
t1
, X
t2
, . . .) and hence Y
n
EY
1
=
EX
2
1
almost surely. By a similar argument the averages n
1

n
t=1
E
_
X
2
t
1
]Xt]>M
[ T
t1
_
converge almost surely to their expectation, for every xed M. This expectation can
be made arbitrarily small by choosing M large. The sequence n
1

n
t=1
E
_
X
2
t
1[X
t
[ >

n[ T
t1
_
is bounded by this sequence eventually, for any M, and hence converges
almost surely to zero.
* 4.6 Projections
Let X
t
be a centered time series and T
0
= (X
0
, X
1
, . . .). For a suitably mixing time
series the covariance E
_
X
n
E(X
j
[ T
0
)
_
between X
n
and the best prediction of X
j
at time
0 should be small as n . The following theorem gives a precise and remarkably
simple sucient condition for the central limit theorem in terms of these quantities.
4.18 Theorem. let X
t
be a strictly stationary, mean zero, ergodic time series with

X
(h)

< and, as n ,

j=0

E
_
X
n
E(X
j
[ T
0
)
_

0.
62 4: Central Limit Theorem
Then

nX
n
N(0, v), for v =

X
(h).
Proof. For a xed integer m dene a time series
Y
t,m
=
t+m

j=t
_
E(X
j
[ T
t
) E(X
j
[ T
t1
)
_
.
Then Y
t,m
is a strictly stationary martingale dierence series. By the ergodicity of the
series X
t
, for xed m as n ,
1
n
n

t=1
E(Y
2
t,m
[ T
t1
) EY
2
0,m
=: v
m
,
almost surely and in mean. The number v
m
is nite, because the series X
t
is square-
integrable by assumption. By the martingale central limit theorem, Theorem 4.16, we
conclude that

nY
n,m
N(0, v
m
) as n , for every xed m.
Because X
t
= E(X
t
[ T
t
) we can write
n

t=1
(Y
t,m
X
t
) =
n

t=1
t+m

j=t+1
E(X
j
[ T
t
)
n

t=1
t+m

j=t
E(X
j
[ T
t1
)
=
n+m

j=n+1
E(X
j
[ T
n
)
m

j=1
E(X
j
[ T
0
)
n

t=1
E(X
t+m
[ T
t1
).
Write the right side as Z
n,m
Z
0,m
R
n,m
. Then the time series Z
t,m
is stationary with
EZ
2
0,m
=
m

i=1
m

j=1
E
_
E(X
i
[ T
0
)E(X
j
[ T
0
)
_
m
2
EX
2
0
.
The right side divided by n converges to zero as n , for every xed m. Furthermore,
ER
2
n,m
=
n

s=1
n

t=1
E
_
E(X
s+m
[ T
s1
)E(X
t+m
[ T
t1
)
_
2

1stn
E
_
E(X
s+m
[ T
s1
)X
t+m
_
2n

h=1

EE(X
m+1
[ T
0
)X
h+m

= 2n

h=m+1

EX
m+1
E(X
h
[ T
0
)

.
The right side divided by n converges to zero as m . Combining the three preceding
displays we see that the sequence

n(Y
n,m
X
n
) = (Z
n,m
Z
0,m
R
n,m
)/

n converges
to zero in second mean as n followed by m .
Because Y
t,m
is a martingale dierence series, the variables Y
t,m
are uncorrelated
and hence
var

nY
n,m
= EY
2
0,m
= v
m
.
4.7: Proof of Theorem 4.7 63
Because, as usual, var

nX
n
v as n , combination with the preceding paragraph
shows that v
m
v as m . Consequently, by Lemma 3.10 there exists m
n

such that

nY
n,mn
N(0, v) and

n(Y
n,mn
X
n
) 0. This implies the theorem in
view of Slutskys lemma.
4.19 EXERCISE. Derive the martingale central limit theorem, Theorem 4.16, from The-
orem 4.18. [The innite sum is really equal to [EX
n
X
0
[!]
* 4.7 Proof of Theorem 4.7
In this section we present two proofs of Theorem4.7, the rst based on characteristic
functions and Lemma Lemma 3.13, and the second based on Theorem 4.18.
Proof of Theorem 4.7(rst version). For a given M > 0 let X
M
t
= X
t
1[X
t
[ M and
let Y
M
t
= X
t
X
M
t
. Because X
M
t
is a measurable transformation of X
t
, it is immediate
from the denition of the mixing coecients that the series Y
M
t
is mixing with smaller
mixing coecients than the series X
t
. Therefore, in view of (4.5)
var

n(X
n
X
M
n
) = var

nY
M
n
4
_
1
0

1
(u) F
1
]Y
M
0
]
(1 u)
2
du.
Because Y
M
0
= 0 whenever [X
0
[ M, it follows that Y
M
0
0 as M and hence
F
1
]Y
M
0
]
(u) 0 for every u (0, 1). Furthermore, because [Y
M
0
[ [X
0
[, its quantile
function is bounded above by the quantile function of [X
0
[. By the dominated convergence
theorem the integral in the preceding display converges to zero as M , and hence
the variance in the left side converges to zero as M , uniformly in n. If we can show
that

n(X
M
n
EX
M
0
) N(0, v
M
) as n for v
M
= limvar

nX
M
n
and every xed
M, then it follows that

n(X
n
EX
0
) N(0, v) for v = limv
M
= limvar

nX
n
, by
Lemma 3.10, and the proof is complete.
Thus it suces to prove the theorem for uniformly bounded variables X
t
. Let M be
the uniform bound.
Fix some sequence m
n
such that

n(m
n
) 0 and m
n
/

n 0. Such
a sequence exists. To see this, rst note that

n(

n/k|) 0 as n , for ev-


ery xed k. (See Problem 4.20). Thus by Lemma 3.10 there exists k
n
such that

n(

n/k
n
|) 0 as k
n
. Now set m
n
=

n/k
n
|. For simplicity write m for
m
n
. Also let it be silently understood that all summation indices are restricted to the
integers 1, 2, . . . , n, unless indicated otherwise.
Let S
n
= n
1/2

n
t=1
X
t
and, for every given t, set S
n
(t) = n
1/2

]jt]<m
X
j
.
64 4: Central Limit Theorem
Because [e
i
1 i[
1
2

2
for every R, we have

E
_
1

n
n

t=1
X
t
e
iSn
_
e
iSn(t)
1 +iS
n
(t)
__



2
nM
2

n
n

t=1
ES
2
n
(t)
=

2
M
2

n
n

t=1

]it]<m

]jt]<m

X
(i j)


2
M
2

n
m

X
(h)

0.
Furthermore, with A
n
(t) and B
n
(t) dened as n
1/2
times the sum of the X
j
with
1 j t m and t +m j n, respectively, we have S
n
S
n
(t) = A
n
(t) +B
n
(t) and

E
_
1

n
n

t=1
X
t
e
iSn
e
iSn(t)
_

n
n

t=1

cov
_
X
t
e
iAn(t)
, e
iBn(t)
_
+ cov
_
X
t
, e
iAn(t)
_
Ee
iBn(t)

nM(m) 0,
by the second inequality of Lemma 4.12, with p = 1 and q = r = . Combining the
preceding pair of displays we see that
ES
n
e
iSn
= E
1

n
n

t=1
X
t
e
iSn
iS
n
(t) +o(1) = iE
_
e
iSn
1
n

]st]<m
X
s
X
t
_
+o(1).
If we can show that n
1

]st]<m
X
s
X
t
converges in mean to v, then the right side
of the last display is asymptotically equivalent to iEe
iSn
v, and the theorem is proved
in view of Lemma 3.13.
In fact, we show that n
1

]st]<m
X
s
X
t
v in second mean. First,
E
1
n

]st]<m
X
s
X
t
=

]h]<m
_
n [h[
n
_

X
(h) v,
By the dominated convergence theorem, in view of (4.2). Second,
var
_
1
n

]st]<m
X
s
X
t
_

1
n
2

]st]<m

]ij]<m

cov(X
s
X
t
, X
i
X
j
)

.
The rst double sum on the right can be split in the sums over the pairs (s, t) with s < t
and s t, respectively, and similarly for the second double sum relative to (i, j). By
4.7: Proof of Theorem 4.7 65
symmetry the right side is bounded by
4
n
2

]st]<m
st

]ij]<m
ij

cov(X
s
X
t
, X
i
X
j
)

4
n
2
n

s=1
m

t=0
n

i=1
m

j=0

cov(X
s
X
s+t
, X
i
X
i+j
)

8
n
2
n

s=1
m

t=0
n

i=1
m

j=0

cov(X
s
X
s+t
, X
s+i
X
s+i+j
)

,
by the same argument, this time splitting the sums over s i and s > i and using
symmetry between s and i. If i t, then the covariance in the sum is bounded above by
2(i t)M
4
, by Lemma 4.12, because there are i t time instants between X
s
X
s+t
and
X
s+i
X
s+i+j
. If i < t, then we rewrite the absolute covariance as

cov(X
s
, X
s+t
X
s+i
X
s+i+j
) cov(X
s
, X
s+t
)EX
s+i
X
s+i+j

4(i)M
4
.
Thus the four-fold sum is bounded above by
32
n
2
n

s=1
m

t=0
n

i=1
m

j=0
_
(i t)M
4
1
it
+(i)M
4
1
i<t
_
64M
4
m
2
n

i0
(i).
Because F
1
]X0]
is bounded away from zero in a neighbourhood of 0, niteness of the
integral
_
1
0

1
(u)F
1
]X0]
(1 u)
2
du implies that the series on the right converges. This
conclude the proof.
* 4.20 EXERCISE. Suppose that (h) is a decreasing sequence of nonnegative numbers
(h = 1, 2, . . .) with

h
(h) < . Show that h(h) 0 as h . [First derive, using
the monotonicity, that

h
2
h
(2
h
) < and conclude from this that 2
h
(2
h
) 0. Next
use the monotonicity again to ll the gaps.]
Proof of Theorem 4.7(second version). Let F denote the jump sizes of a cumulative
distribution function F. For Y
n
= E(X
n
[ T
0
) and V a uniform variable independent of
the other variables, set
1 U
n
= F
]Yn]
_
[Y
n
[
_
+V F
]Yn]
_
[Y
n
[
_
,
This denition is an extended form of the probability integral transformation, allowing
for jumps in the distribution function. The variable U
n
is uniformly distributed and
F
1
]Yn]
(1U
n
) = [Y
n
[ almost surely. Because Y
n
is T
0
-measurable the covariance inequality,
66 4: Central Limit Theorem
Lemma 4.12, gives

E
_
E(X
n
[ T
0
)X
j
_

2
_
j
0
F
1
]Yn]
(1 u)F
1
]Xj ]
(1 u) du
= 2EY
n
sign(Y
n
)F
1
]Xj ]
(1 U
n
)1
Un<j
= 2EX
n
sign(Y
n
)F
1
]Xj]
(1 U
n
)1
Un<j
2E[X
n
[F
1
]Xj]
(1 U
n
)1
Un<j
4
_
1
0
F
1
]Xn]
(1 u)G
1
(1 u) du
by a second application of Lemma 4.12, with = 1 and G the distribution function of the
random variable F
1
]Xj ]
(1 U
n
)1
Un<j
. The corresponding quantile function G
1
(1 u)
vanishes o [0,
j
] and is bounded above by the quantile function of [X
j
[. Therefore, the
expression is further bounded by
_
j
0
F
1
]X0]
(1 u)
2
du. We nish by summing up over
j.
5
Nonparametric Estimation
of Mean and Covariance
Suppose we observe the values X
1
, . . . , X
n
from the stationary time series X
t
with mean

X
= EX
t
, covariance function h
X
(h), and correlation function h
X
(h). If
nothing is known about the distribution of the time series, besides that it is stationary,
then obvious estimators for these parameters are

n
= X
n
=
1
n
n

t=1
X
t
,

n
(h) =
1
n
nh

t=1
(X
t+h
X
n
)(X
t
X
n
), (0 h < n),

n
(h) =

n
(h)

n
(0)
.
These estimators are called nonparametric, because they are not motivated by a sta-
tistical model that restricts the distribution of the time series. The advantage is that
they work for (almost) every stationary time series. However, given a statistical model,
it might be possible to nd better estimators for
X
,
X
and
X
. We shall see examples
of this when discussing ARMA-processes in Chapter 8.
* 5.1 EXERCISE. The factor 1/n in the denition of
n
(h) is sometimes replaced by
1/(n h), because there are n h terms in the sum. Show that with the present de-
nition of
n
the corresponding estimator
_

n
(s t)
_
s,t=1,...,h
for the covariance matrix
of (X
1
, . . . , X
h
) is nonnegative-denite. Show by example that this is not true if we use
1/(n h). [Write the matrix as QQ
T
for a suitable (n (2n)) matrix Q.]
The time series X
t
is called Gaussian if the joint distribution of any nite number
of the variables X
t
is multivariate-normal. In that case X
n
is normally distributed. The
distributions of
n
(h) and
n
(h) are complicated, even under normality. Distributional
statements considering these estimators are therefore usually asymptotic in nature, as
68 5: Nonparametric Estimation of Mean and Covariance
0 5 10 15 20
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
Lag
A
C
F
Figure 5.1. Realization of the sample auto-correlation function (n = 250) of the stationary time series
satisfying Xt+1 = 0.5Xt + Zt for standard normal white noise Zt.
n . In this chapter we discuss conditions under which the three estimators are
asymptotically normally distributed. This knowledge can be used to set approximate
condence intervals.
5.1 Mean
The asymptotic normality of the sample mean is the subject of Chapter 4. The statistical
signicance of the central limit theorem is that the sample mean is an asymptotically
consistent estimator of the true mean
X
, with precision of the order 1/

n. The central
limit theorem can be used in a preciser way to derive an asymptotic condence interval
for
X
. This requires an estimator of the (asymptotic) variance of the sample mean,
which we discuss in this section.
An approximate condence interval for
X
based on the sample mean typically takes
the form
_
X
n


n

n
1.96, X
n
+

n

n
1.96
_
.
5.1: Mean 69
If

n(X
n

X
)/
n
N(0, 1) as n , then the condence level of this interval
converges to 95%. The problem is to nd suitable estimators
n
.
If the sequence

n(X
n

X
) is asymptotically normal, as it is under the conditions
of the preceding chapter, the procedure works if the
n
are consistent estimators of the
(asymptotic) variance of X
n
. However, constructing such estimators is not a trivial mat-
ter. Unlike in the case of independent, identically distributed variables, the variance of the
sample mean depends on characteristics of the joint distribution of (X
1
, . . . , X
n
), rather
than only on the marginal distributions. (See (4.1).) The limiting variance

X
(h)
depends even on the joint distribution of the innite sequence (X
1
, X
2
, . . .). With a suf-
cient number of observations it is possible to estimate the auto-covariances
X
(h) at
smaller lags h, but, without further information, this is not true for larger lags h n
(let alone h n), unless we make special assumptions. Setting a condence interval is
therefore much harder than in the case of independent, identically distributed variables.
If a reliable model is available, expressed in a vector of parameters, then the problem
can be solved by a model-based estimator. We express the variance of the sample mean
in the parameters, and next replace these by estimators. If there are not too many
parameters in the model this should be feasible. (Methods to estimate parameters are
discussed in later chapters.)
5.2 EXERCISE.
(i) Calculate the asymptotic variance of the sample mean for the moving average X
t
=
Z
t
+Z
t1
.
(ii) Same question for the stationary solution of X
t
= X
t1
+ Z
t
, where [[ < 1.
However, the use of a model-based estimator is at odds with the theme of this chap-
ter: nonparametric estimation. It is possible to estimate the variance nonparametrically
provided the time series is suciently mixing. We discuss several methods.
A commonly used method is the method of batched means. The total set of obser-
vations is split into r blocks
[X
1
, . . . , X
l
], [X
l+1
, . . . , X
2l
], . . . , [X
(r1)l+1
, . . . , X
rl
]
of l observations each. (Assume that n = rl for simplicity; drop a last batch of fewer than
l observations, for simplicity.) If Y
1
, . . . , Y
r
are the sample means of the r blocks, then
Y
r
= X
n
and hence var Y
r
= var X
n
. The hope is that we can ignore the dependence
between Y
1
, . . . , Y
r
and can simply estimate the variance var(

rY
r
) = (r/n) var(

nX
n
)
by the sample variance S
2
r,Y
of Y
1
, . . . , Y
r
. If l is large enough and the orginal series
X
t
is suciently mixing, then this actually works, to some extent.
Presumably, the method of batched means uses disjoint blocks of X
t
in order to
achieve the approximate independence of the block means Y
1
, . . . , Y
r
. In general these
are still dependent. This does not cause much (additional) bias in the estimator of the
variance, but it may have an eect on the precision. It turns out that it is better to use
all blocks of l consecutive X
t
, even though these may be more dependent. Thus in our
second method we consider all blocks
[X
1
, . . . , X
l
], [X
2
, . . . , X
l+1
], . . . , [X
nl+1
, . . . , X
n
]
70 5: Nonparametric Estimation of Mean and Covariance
of l consecutive observations. We let Z
1
, Z
2
, . . . , Z
nl+1
be the sample means of the
n l + 1 blocks, so that l var Z
i
= var(

lX
l
) var(

nX
n
), if l is large. This suggests
to estimate the variance of

nX
n
by lS
2
nl+1,Z
. The following theorem shows that this
method works under some conditions, provided that l is chosen dependent on n with
l
n
at a not too fast rate. Because in the theorem l depends on n, so do the block
means, and we denote them by Z
n,1
, . . . , Z
n,nln+1
. The theorem considers both the
sample variance of the (recentered) block means,

S
2
n,Z
=
1
n l
n
+ 1
nln+1

i=1
(Z
n,i
X
n
)
2
,
and the centered empirical distribution function of the block means,
F
n
(x) =
1
n l
n
+ 1
nln+1

i=1
1
_
_
l
n
(Z
n,i
X
n
) x
_
.
The function F
n
is the cumulative distribution function of the discrete distribution that
puts mass 1/(n l
n
+ 1) at each centered and scaled block mean

l
n
(Z
n,i
X
n
). The
second moment of this distribution is l
n

S
2
n,Z
. The following theorem shows that the
sequence F
n
tends weakly to the same limit as the sequence

n(X
n

X
), and its
second moments converge also.
5.3 Theorem. Suppose that the time series X
t
is strictly stationary and -mixing with
mixing coecients satisfying

h
(h) < . Let l
n
such that l
n
/n 0. Further-
more, suppose that

n(X
n

X
) N(0, v), for some number v. Then, for every x, the
sequence F
n
(x) converges in probability to (x/

v). Furthermore, if v =

X
(h) and

X
(h)

< , then the second moments l


n

S
2
n,Z
of F
n
converge in probability to v.
Proof. The distribution function G
n
obtained by replacing the mean X
n
in the denition
of F
n
by
X
relates to F
n
through
F
n
(x) = G
n
_
x +
_
l
n
(X
n

X
)
_
.
The sequence

l
n
(X
n

X
) converges in probabity to zero, by the assumptions that
the sequence

n(X
n

X
) converges weakly and that l
n
/n 0. In view of the mono-
tonicity of the functions F
n
and G
n
, the claimed convergence of F
n
(x) then follows if
G
n
(x)
P

(x/

v) for every x. Furthermore, by the triangle inequality the dierence


between the roots of the second moments of F
n
and G
n
is bounded by

l
n
[X
n

X
[
and hence tends to zero in probability. Therefore it suces to prove the claims for G
n
instead of F
n
.
Fix some x and dene Y
t
= 1
_
l
n
(Z
n,t

X
) x
_
. Then the time series Y
t
is
strictly stationary and G
n
(x) = Y
nln+1
. By assumption
EY
nln+1
= P
_
_
l
n
(X
ln

X
) x
_
(x/

v).
5.1: Mean 71
Because the variable Y
t
depends only on the variables X
s
with t s < t +l
n
, the series
Y
t
is -mixing with mixing coecients bounded above by (hl
n
) for h > l
n
. Therefore,
by (4.2) followed by Lemma 4.12 (with q = r = ),
var Y
nln+1

1
n l
n
+ 1

Y
(h)

4
n l
n
+ 1
_

hln
(h l
n
) +l
n
1
2
_
.
This converges to zero as n . Thus G
n
(x) = Y
nln+1
(x/

v) in probability by
Chebyshevs inequality, and the rst assertion of the theorem is proved.
By Theorem 3.8 the second moments of a sequence of distributions G
n
with G
n

N(0, v) tends to v if
_
]x]M
x
2
dG
n
(x) 0 as n followed by M . We can
apply this to the present random distributions G
n
, for instance by arguing along almost
surely converging subsequences, to conclude that the second moments of the present G
n
converge to v if and only if
_
]x]M
x
2
dG
n
(x)
P

0, as n followed by M . Now
E
_
]x]M
x
2
dG
n
(x) = E
1
n l
n
+ 1
nln+1

i=1

_
l
n
(Z
n,i

X
)

2
1
_
_
l
n
[Z
n,i

X
[ M
_
= E

_
l
n
(X
ln

X
)

2
1
_
_
l
n
[X
ln

X
[ M
_
.
By assumption

l
n
(X
ln

X
) N(0, v), while E

l
n
(X
ln

X
)

2
v by (4.1). Thus
we can apply Theorem 3.8 in the other direction to conclude that the right side of the
display converges to zero as n followed by M .
5.4 EXERCISE. Suppose that we center the block means Z
n,i
at Z
nln+1
instead of
X
n
, and consider the empirical distribution of the variables

l
n
(Z
n,i
Z
nln+1
) and
its variance l
n
S
2
n,Z
instead of F
n
and

S
2
n,Z
. Show that the preceding theorem is valid for
these quantities provided l
n
/n
2/3
0. [The mean Z
nln+1
of the block means is almost
the same as X
n
. Each X
t
with t (l
n
, n l
n
+1) occurs in exactly l
n
block means Z
n,i
,
and the X
t
with extreme indexes occur in between 1 and l
n
1 block means. This shows
that (n l
n
+ 1)Z
nn+1
= nX
n
A
n
, where A
n
is a weighted sum of the extreme
X
t
, the weights ranging from (l
n
1)/l
n
to 1/l
n
. It follows that E[A
n
[ 2l
n
E[X
1
[.]
The usefulness of the estimator F
n
goes beyond its variance. Because the sequence
F
n
tends to the same limit distribution as the sequence

n(X
n

X
), we can think of it
as an estimator of the distribution of the latter variable. In particular, we could use the
quantiles of F
n
as estimators of the quantiles of

n(X
n

X
) and use these to replace
the normal quantiles and
n
in the construction of a condence interval. This gives the
interval
_
X
n

F
1
n
(0.975)

n
, X
n

F
1
n
(0.025)

n
_
.
The preceding theorem shows that this interval has asymptotic condence level 95% for
covering
X
. Indeed, the theorem shows that P
_
n(X
n

X
) F
1
n
()
_
, for any
(0, 1), which implies that
X
is contained in the condence interval with probability
tending to 95 %.
72 5: Nonparametric Estimation of Mean and Covariance
Another, related method is the blockwise bootstrap. Assume that n = lr for simplic-
ity. Given the same blocks [X
1
, . . . , X
l
], [X
2
, . . . , X
l+1
], . . . , [X
nl+1
, . . . , X
n
], we choose
r = n/l blocks at random with replacement and put the r blocks in a row, in random
order, but preserving the order of the X
t
within the r blocks. We denote the row of
n = rl variables obtained in this way by X

1
, X

2
, . . . , X

n
and let X

n
be their average.
The bootstrap estimator of the distribution of

n(X
n

X
) is by denition the condi-
tional distribution of

n(X

n
X
n
) given X
1
, . . . , X
n
. The corresponding estimator of
the variance of

n(X
n

X
) is the variance of this conditional distribution.
An equivalent description of the bootstrap procedure is to choose a random sample
with replacement from the block averages Z
n,1
, . . . , Z
n,nln+1
. If this sample is denoted
by Z

1
, . . . , Z

r
, then the average X

n
is also the average Z

r
. It follows that the bootstrap
estimator of the variance of X
n
is the conditional variance of the mean of a random
sample of size r from the block averages given the values Z
n,1
, . . . , Z
n,nln+1
of these
averages. This is simply (n/r)S
2
nln+1,Z
, as before.
Other aspects of the bootstrap estimators of the distribution, for instance quantiles,
are hard to calculate explicitly. In practice we perform computer simulation to obtain
an approximation of the bootstrap estimator. By repeating the sampling procedure a
large number of times (with the same values of X
1
, . . . , X
n
), and taking the empirical
distribution over the realizations, we can, in principle obtain arbitrary precision.
All three methods discussed previously are based on forming blocks of a certain
length l. The proper choice of the block length is crucial for their success: the preceding
theorem shows that (one of) the estimators will be consistent provided l
n
such that
l
n
/n 0. Additional calculations show that, under general conditions, the variances
of the variance estimators are minimal if l
n
is proportional to n
1/3
.

Unfortunately, in
practice, it is rarely known if these general conditions are satised, as an optimal block
length must depend on the amount of mixing of the time series. Long blocks necessarily
have much overlap and hence are positively dependent, which has a negative impact on
the variance of the estimators. On the other hand, short blocks cannot propertly reect
the dependence structure of the time series, with longer-range dependence needing longer
blocks. An optimal block size must balance these conicting demands.
5.5 EXERCISE. Extend the preceding theorem to the method of batched means. Show
that the variance estimator is consistent.
5.2 Auto Covariances
Replacing a given time series X
t
by the centered time series X
t

X
does not change
the auto-covariance function. Therefore, for the study of the asymptotic properties of
the sample auto covariance function
n
(h), it is not a loss of generality to assume that

See K unsch (1989), Annals of Statistics 17, p12171241.


5.2: Auto Covariances 73

X
= 0. The sample auto-covariance function can be written as

n
(h) =
1
n
nh

t=1
X
t+h
X
t
X
n
_
1
n
nh

t=1
X
t
_

_
1
n
n

t=h+1
X
t
_
X
n
+ (X
n
)
2
.
Under the conditions of Chapter 4 and the assumption
X
= 0, the sample mean X
n
is of the order O
P
(1/

n) and hence the last term on the right is of the order O


P
(1/n).
For xed h the second and third term are almost equivalent to (X
n
)
2
and are also of the
order O
P
(1/n). Thus, under the assumption that
X
= 0,

n
(h) =
1
n
nh

t=1
X
t+h
X
t
+O
P
_
1
n
_
.
It follows from this and Slutskys lemma that the asymptotic behaviour of the sequence

n
_

n
(h)
X
(h)
_
depends only on n
1

nh
t=1
X
t+h
X
t
. Here a change of n by n h
(or n h by n) is asymptotically negligible, so that, for simplicity of notation, we can
equivalently study the averages

n
(h) =
1
n
n

t=1
X
t+h
X
t
.
These are unbiased estimators of EX
t+h
X
t
=
X
(h), under the condition that
X
= 0.
Their asymptotic distribution can be derived by applying a central limit theorem to the
averages Y
n
of the variables Y
t
= X
t+h
X
t
.
If the time series X
t
is mixing with mixing coecients (k), then the time series Y
t
is mixing with mixing coecients bounded above by (k h) for k > h 0. Because
the conditions for a central limit theorem depend only on the speed at which the mixing
coecients converge to zero, this means that in most cases the mixing coecients of
the time series X
t
and Y
t
are equivalent. By the Cauchy-Schwarz inequality the series
Y
t
has nite moments of order k if the series X
t
has nite moments of order 2k. This
means that the mixing central limit theorems for the sample mean apply without further
diculties to proving the asymptotic normality of the sample auto-covariance function.
The asymptotic variance takes the form

g

Y
(g) and in general depends on fourth order
moments of the type EX
t+g+h
X
t+g
X
t+h
X
t
as well as on the auto-covariance function of
the series X
t
. In its generality, its precise form is not of much interest.
5.6 Theorem. If X
t
is a strictly stationary, mixing time series with -mixing coecients
such that
_
1
0

1
(u)F
1
]X
h
X0]
(1 u)
2
du < , then the sequence

n
_

n
(h)
X
(h)
_
converges in distribution to a normal distribution.
Another approach to central limit theorems is special to linear processes, of the form
(4.4), for a sequence . . . , Z
1
, Z
0
, Z
1
, Z
2
, . . . of independent and identically distributed
variables with EZ
t
= 0, and constants
j
satisfying

j
[
j
[ < . The sample auto-
covariance function of a linear process is also asymptotically normal, but the proof of
74 5: Nonparametric Estimation of Mean and Covariance
this requires additional work. This work is worth while mainly because the limit variance
takes a simple form in this case.
Under (4.4) with = 0, the auto-covariance function of the series Y
t
= X
t+h
X
t
can
be calculated as

Y
(g) = cov
_
X
t+g+h
X
t+g
, X
t+h
X
t
_
=

ti

t+hj

t+gk

t+g+hl
cov(Z
i
Z
j
, Z
k
Z
l
).
Here cov(Z
i
Z
j
, Z
k
Z
l
) is zero whenever one of the indices i, j, k, l occurs only once. For
instance EZ
1
Z
2
Z
10
Z
2
= EZ
1
EZ
2
2
EZ
10
= 0. It also vanishes if i = j ,= k = l. The
covariance is nonzero only if all four indices are the same, or if the indices occur in the
pairs i = k ,= j = l or i = l ,= j = k. Thus the preceding display can be rewritten as
cov(Z
2
1
, Z
2
1
)

ti

t+hi

t+gi

t+g+hi
+ cov(Z
1
Z
2
, Z
1
Z
2
)

i,=j

ti

t+hj

t+gi

t+g+hj
+ cov(Z
1
Z
2
, Z
2
Z
1
)

i,=j

ti

t+hj

t+gj

t+g+hi
=
_
EZ
4
1
3(EZ
2
1
)
2
_

i+h

i+g

i+g+h
+
X
(g)
2
+
X
(g +h)
X
(g h).
In the last step we use Lemma 1.28(iii) twice, after rst adding the diagonal terms i = j
to the double sums. Since cov(Z
1
Z
2
, Z
1
Z
2
) = (EZ
2
1
)
2
, these diagonal terms account for
2 of the 3 times the sum in the rst term. The variance of

n
(h) = Y
n
converges to
the sum over g of this expression. With
4
(Z) = EZ
4
1
/(EZ
2
1
)
2
3, the fourth cumulant
(equal to the kurtosis minus 3 divided by the square variance) of Z
t
, this sum can be
written as
V
h,h
=
4
(Z)
X
(h)
2
+

X
(g)
2
+

X
(g +h)
X
(g h).
5.7 Theorem. If (4.4) holds for an i.i.d. sequence Z
t
with mean zero and EZ
4
t
< and
numbers
j
with

j
[
j
[ < , then

n
_

n
(h)
X
(h)
_
N(0, V
h,h
).
Proof. As explained in the discussion preceding the statement of the theorem, it suces
to show that the sequence

n
_

n
(h)
X
(h)
_
has the given asymptotic distribution in
the case that = 0. Dene Y
t
= X
t+h
X
t
and, for xed m N,
Y
m
t
=

]i]m

i
Z
t+hi

]j]m

j
Z
tj
=: X
m
t+h
X
m
t
.
The time series Y
m
t
is (2m + h + 1)-dependent and strictly stationary. By Theorem 4.4
the sequence

n(Y
m
n
EY
m
n
) is asymptotically normal with mean zero and variance

2
m
=

Ym
(g) =
4
(Z)
X
m(h)
2
+

X
m(g)
2
+

X
m(g +h)
X
m(g h),
5.2: Auto Covariances 75
where the second equality follows from the calculations preceding the theorem. For every
g, as m ,

X
m(g) = EZ
2
1

j:]j]m,]j+g]m

j+g
EZ
2
1

j+g
=
X
(g).
Furthermore, the numbers on the left are bounded above by EZ
2
1

j
[
j

j+g
[, and

g
_

j
[
j

j+g
[
_
2
=

k
[
i

i+g

k+g
[ sup
j
[
j
[
_

j
[
j
[
_
3
< .
Therefore, by the dominated convergence theorem

g

X
m(g)
2

g

X
(g)
2
as m .
By a similar argument, we obtain the corresponding property for the third term in the
expression dening
2
m
, whence
2
m
V
h,h
as m .
We conclude by Lemma 3.10 that there exists a sequence m
n
such that

n(Y
mn
n
EY
mn
n
) N(0, V
h,h
). The proof of the theorem is complete once we also have
shown that the dierence between the sequences

n(Y
n
EY
n
) and

n(Y
mn
n
EY
mn
n
)
converges to zero in probability.
Both sequences are centered at mean zero. In view of Chebyshevs inequality it
suces to show that nvar(Y
n
Y
mn
n
) 0. We can write
Y
t
Y
m
t
= X
t+h
X
t
X
m
t+h
X
m
t
=

m
ti,t+hj
Z
i
Z
j
,
where
m
i,j
=
i

j
if [i[ > m or [j[ > m and is 0 otherwise. The variables Y
n
Y
m
n
are
the averages of these double sums and hence

n times their variance can be found as
n

g=n
_
n [g[
n
_

Y Y
m(g)
=
n

g=n
_
n [g[
n
_

m
ti,t+hj

m
t+gk,t+g+hl
cov(Z
i
Z
j
, Z
k
Z
l
).
Most terms in this ve-fold sum are zero and by similar arguments as before the whole
expression can be bounded in absolute value by
cov(Z
2
1
, Z
2
1
)

i
[
m
i,i+h

m
g,g+h
[+(EZ
2
1
)
2

j
[
m
i,j

m
i+g,j+g
[
+ (EZ
2
1
)
2

j
[
m
i,j+h

m
j+g,i+g+h
[.
We have that
m
i,j
0 as m for every xed (i, j), [
m
i,j
[ [
i

j
[, and sup
i
[
i
[ < .
By the dominated convergence theorem the double and triple sums converge to zero as
well.
76 5: Nonparametric Estimation of Mean and Covariance
By similar arguments we can also prove the joint asymptotic normality of the sample
auto-covariances for a number of lags h simultaneously. By the Cramer-Wold device a
sequence of k-dimensional random vectors X
n
converges in distribution to a random
vector X if and only if a
T
X
n
a
T
X for every a R
k
. A linear combination of sample
auto-covariances can be written as an average, as before. These averages can be shown to
be asymptotically normal by the same methods, with only the notation becoming more
complex.
5.8 Theorem. Under the conditions of either Theorem 5.6 or 5.7, for every h N and
some (h + 1) (h + 1)-matrix V ,

n
_
_
_
_
_
_

n
(0)
.
.
.

n
(h)
_
_
_
_
_
_

X
(0)
.
.
.

X
(h)
_
_
_
_
_
_ N
h+1
(0, V ).
For a linear process X
t
the matrix V has (g, h)-element
V
g,h
=
4
(Z)
X
(g)
X
(h) +

X
(k +g)
X
(k +h) +

X
(k g)
X
(k +h).
5.3 Auto Correlations
The asymptotic distribution of the auto-correlations
n
(h) can be obtained from the
asymptotic distribution of the auto-covariance function by the Delta-method (Theo-
rem 3.15). We can write

n
(h) =

n
(h)

n
(0)
=
_

n
(0),
n
(h)
_
,
for the function given by (u, v) = v/u. This function has gradient (v/u
2
, 1/u). By
the Delta-method,

n
_

n
(h)
X
(h)
_
=

X
(h)

X
(0)
2

n
_

n
(0)
X
(0)
_
+
1

X
(0)

n
_

n
(h)
X
(h)
_
+o
P
(1).
The limit distribution of the right side is the distribution of the random variable

X
(h)/
X
(0)
2
Y
0
+ 1/
X
(0)Y
h
, for Y
0
and Y
h
the rst and last coordinate of a ran-
dom vector Y with the N
h+1
(0, V )-distribution given in Theorem 5.8. The joint limit
distribution of a vector of auto-correlations is the joint distribution of the corresponding
linear combinations of the Y
h
. By linearity this is a Gaussian distribution; its mean is
zero and its covariance matrix can be expressed in the matrix V by linear algebra.
5.3: Auto Correlations 77
5.9 Theorem. Under the conditions of either Theorem 5.6 or 5.7, for every h N and
some h h-matrix W,

n
_
_
_
_
_
_

n
(1)
.
.
.

n
(h)
_
_
_
_
_
_

X
(1)
.
.
.

X
(h)
_
_
_
_
_
_ N
h
(0, W),
For a linear process X
t
the matrix W has (g, h)-element
W
g,h
=

k
_

X
(k +g)
X
(k +h) +
X
(k g)
X
(k +h) + 2
X
(g)
X
(h)
X
(k)
2
2
X
(g)
X
(k)
X
(k +h) 2
X
(h)
X
(k)
X
(k +g)
_
.
The expression for the asymptotic covariance matrix W of the auto-correlation co-
ecients in the case of a linear process is known as Bartletts formula. An interesting
fact is that W depends on the auto-correlation function
X
only, although the asymp-
totic covariance matrix V of the sample auto-covariance coecients depends also on the
second and fourth moments of Z
1
. We discuss two interesting examples of this formula.
5.10 Example (Iid sequence). For
0
= 1 and
j
= 0 for j ,= 0, the linear process
X
t
given by (4.4) is equal to the i.i.d. sequence +Z
t
. Then
X
(h) = 0 for every h ,= 0
and the matrix W given by Bartletts formula can be seen to reduce to the identity
matrix. This means that for large n the sample auto-correlations
n
(1), . . . ,
n
(h) are
approximately independent normal variables with mean zero and variance 1/n.
This can be used to test whether a given sequence of random variables is indepen-
dent. If the variables are independent and identically distributed, then approximately
95 % of the computed auto-correlations should be in the interval [1.96/

n, 1.96/

n].
This is often veried graphically, from a plot of the auto-correlation function, on which
the given interval is indicated by two horizontal lines. Note that, just as we should expect
that 95 % of the sample auto-correlations are inside the two bands in the plot, we should
also expect that 5 % of them are not! A more formal test would be to compare the sum of
the squared sample auto-correlations to the appropriate chisquare table. The Ljung-Box
statistic is dened by
k

h=1
n(n + 2)
n h

n
(h)
2
.
By the preceding theorem, for xed k, this sequence of statistics tends to the
2
dis-
tribution with k degrees of freedom, as n . (The coecients n(n + 2)/(n h)
are motivated by a calculation of moments for nite n and are thought to improve the
chisquare approximation, but are asymptotically equivalent to (

n)
2
.)
The more auto-correlations we use in a procedure of this type, the more information
we extract from the data and hence the better the result. However, the tests are based
on the asymptotic distribution of the sample auto-correlations and this was derived
under the assumption that the lag h is xed and n . We should expect that the
78 5: Nonparametric Estimation of Mean and Covariance
convergence to normality is slower for sample auto-correlations
n
(h) of larger lags h,
since there are fewer terms in the sums dening them. Thus in practice we should not
use sample auto-correlations of lags that are large relative to n.
0 5 10 15 20
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
Lag
A
C
F
Figure 5.2. Realization of the sample auto-correlation function of a Gaussian white noise series of length
250. The horizontal bands are drawn at heights 1/

250.
5.11 Example (Moving average). For a moving average X
t
= Z
t
+
1
Z
t1
+ +
q
Z
tq
of order q, the auto-correlations
X
(h) of lags h > q vanish. By the preceding theorem
the sequence

n
n
(h) converges for h > q in distribution to a normal distribution with
variance
W
h,h
=

X
(k)
2
= 1 + 2
X
(1)
2
+ + 2
X
(q)
2
, h > q.
This can be used to test whether a moving average of a given order q is an appropriate
model for a given observed time series. A plot of the auto-correlation function shows
nonzero auto-correlations for lags 1, . . . , q, and zero values for lags h > q. In practice we
plot the sample auto-correlation function. Just as in the preceding example, we should ex-
pect that some sample auto-correlations of lags h > q are signicantly dierent from zero,
due to the estimation error. The asymptotic variances W
h,h
are bigger than 1 and hence
we should take the condence bands a bit wider than the intervals [1.96/

n, 1.96/

n]
as in the preceding example. A proper interpretation is more complicated, because the
sample auto-correlations are not asymptotically independent.
5.4: Partial Auto Correlations 79
0 5 10 15 20
-
0
.
2
-
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
Lag
A
C
F
Figure 5.3. Realization (n = 250) of the sample auto-correlation function of the moving average process
Xt = 0.5Zt + 0.2Zt1 + 0.5Zt2 for a Gaussian white noise series Zt. The horizontal bands are drawn at
heights 1/

250.
5.12 EXERCISE. Verify the formula for W
h,h
in the preceding example.
5.13 EXERCISE. Find W
1,1
as a function of for the process X
t
= Z
t
+Z
t1
.
5.14 EXERCISE. Verify Bartletts formula.
5.4 Partial Auto Correlations
By Lemma 2.33 and the prediction equations the partial auto-correlation
X
(h) is the
solution
h
of the system of equations
_
_
_

X
(0)
X
(1)
X
(h 1)
.
.
.
.
.
.
.
.
.

X
(h 1)
X
(h 2)
X
(0)
_
_
_
_
_
_

1
.
.
.

h
_
_
_ =
_
_
_

X
(1)
.
.
.

X
(h)
_
_
_.
80 5: Nonparametric Estimation of Mean and Covariance
A nonparametric estimator
n
(h) of
X
(h) is obtained by replacing the auto-covariance
function in this linear system by the sample auto-covariance function
n
. This yields
estimators

1
, . . . ,

h
of the prediction coecients satisfying
_
_
_

n
(0)
n
(1)
n
(h 1)
.
.
.
.
.
.
.
.
.

n
(h 1)
n
(h 2)
n
(0)
_
_
_
_
_
_

1
.
.
.

h
_
_
_ =
_
_
_

n
(1)
.
.
.

n
(h)
_
_
_.
We dene a nonparametric estimator for
X
(h) by
n
(h) =

h
. The function h
n
(h)
is called the sample partial auto-covariance function.
If we write these two systems of equations as = and

= , respectively, then
we obtain that

1

1
=

1
( )

1
(

)
1
.
The sequences

n( ) and

n(

) are jointly asymptotically normal by Theo-


rem 5.8. With the help of Slutskys lemma we readily obtain the asymptotic normality of
the sequence

n(

) and hence of the sequence



n
_

n
(h)
X
(h)
_
. The asymptotic
covariance matrix appears to be complicated, in general; we shall not derive it.
5.15 Example (Autoregression). For the stationary solution to X
t
= X
t1
+ Z
t
and [[ < 1, the partial auto-correlations of lags h 2 vanish, by Example 2.34. In
Section 11.1.2 it is shown that the sequence

n
n
(h) is asymptotically standard normally
distributed, for every h 2.
This result extends to the causal solution of the pth order auto-regressive scheme
X
t
=
1
X
t1
+ +
p
X
tp
+Z
t
and the auto-correlations of lags h > p. (The meaning
of causal is explained in Chapter 8.) This property can be used to nd an appropriate
order p when tting an auto-regressive model to a given time series. The order is chosen
such that most of the sample auto-correlations of lags bigger than p are within the band
[1.96/

n, 1.96/

n]. A proper interpretation of most requires that the dependence


of the
n
(h) is taken into consideration.
5.4: Partial Auto Correlations 81
0 5 10 15 20
0
.
0
0
.
2
0
.
4
Lag
P
a
r
t
i
a
l

A
C
F
Figure 5.4. Realization (n = 250) of the partial auto-correlation function of the stationary solution to
Xt = 0.5Xt1 + 0.2Xt1 + Zt for a Gaussian white noise series.
6
Spectral Theory
Let X
t
be a stationary, possibly complex, time series with auto-covariance function
X
.
If the series

h
[
X
(h)[ is convergent, then the series
(6.1) f
X
() =
1
2

hZ

X
(h)e
ih
,
is absolutely convergent, uniformly in R. This function is called the spectral density
of the time series X
t
. Because it is periodic with period 2 it suces to consider it on
an interval of length 2, which we shall take to be (, ]. In the present context the
values in this interval are often referred to as frequencies, for reasons that will become
clear. By the uniform convergence, we can exchange the order of integral and sum when
computing
_

e
ih
f
X
() d and we nd that, for every h Z,
(6.2)
X
(h) =
_

e
ih
f
X
() d.
Thus the spectral density f
X
determines the auto-covariance function, just as the auto-
covariance function determines the spectral density.
6.1 EXERCISE. Verify the orthonormality of the trigonometric basis:
_

e
ih
d = 0
for integers h ,= 0 and
_

e
ih
d = 2 for h = 0. Use this to prove the inversion formula
(6.2), under the condition that

h
[
X
(h)[ < .
In mathematical analysis the series f
X
is called a Fourier series and the numbers

X
(h) are called the Fourier coecients of f
X
. (The factor 1/(2) is sometimes omitted
or replaced by another number, and the Fourier series is often dened as f
X
() rather
than f
X
(), but this is inessential.) A central result in Fourier analysis is that the series
in (6.1) converges in L
2
_
(, ], B,
_
, and (6.2) is true, if and only if

h
[
X
(h)[
2
< .
(See Lemma 6.3 below.) The latter condition is weaker than the absolute summability of
the autocovariances. Furthermore, the converse of this statement is also true: if complex
6.1: Spectral Measures 83
numbers
X
(h) are dened by (6.2) from a given function f
X
L
2
_
(, ], B,
_
, then
this function can be recovered from (6.1), where the convergence is in L
2
_
(, ], B,
_
.
This involves the completeness of the trigonometric basis, consisting of the functions
(2)
1/2
e
ih
, for h Z.
6.1 Spectral Measures
The requirement that the series

X
(h)

2
is convergent means roughly that
X
(h) 0
as h at a suciently fast rate. In statistical terms it means that variables X
t
that are widely separated in time must be approximately uncorrelated. This is not true
for every time series, and consequently not every time series possesses a spectral density.
However, every stationary time series does have a spectral measure, by the following
theorem.
6.2 Theorem (Herglotz). For every stationary time series X
t
there exists a unique
nite measure F
X
on (, ] such that

X
(h) =
_
(,]
e
ih
dF
X
(), h Z.
Proof. Dene F
n
as the measure on [, ] with Lebesgue density equal to
f
n
() =
1
2
n

h=n

X
(h)
_
1
[h[
n
_
e
ih
.
It is not immediately clear that this is a real-valued, nonnegative function, but this
follows from the fact that
0
1
2n
var
_
n

t=1
X
t
e
it
_
=
1
2n
n

s=1
n

t=1
cov(X
s
, X
t
)e
i(ts)
= f
n
().
It is clear from the denition of f
n
that the numbers
X
(h)
_
1 [h[/n
_
are the Fourier
coecients of f
n
for [h[ n (and the remaining Fourier coecients of f
n
are zero). Thus,
by the inversion formula,

X
(h)
_
1
[h[
n
_
=
_

e
ih
f
n
() d =
_

e
ih
dF
n
(), [h[ n.
Setting h = 0 in this equation, we see that F
n
[, ] =
X
(0) for every n. Thus, apart
from multiplication by the constant
X
(0), the F
n
are probability distributions. Because
the interval [, ] is compact, the sequence F
n
is uniformly tight. By Prohorovs theorem
84 6: Spectral Theory
there exists a subsequence F
n
that converges weakly to a distribution F on [, ].
Because e
ih
is a continuous function, it follows by the portmanteau lemma that
_
[,]
e
ih
dF() = lim
n

_
[,]
e
ih
dF
n
() =
X
(h),
by the preceding display. If F puts a positive mass at , we can move this to the point
without aecting this identity, since e
ih
= e
ih
for every h Z. The resulting F
satises the requirements for F
X
.
That this F is unique can be proved using the fact that the linear span of the func-
tions e
ih
is uniformly dense in the set of continuous, periodic functions (see Rudin,
Theorem 4.25; in fact, by Fejers theorem, the Cesaro sums of the Fourier series of a
continuous, periodic function converge uniformly). Using this fact, we see that equality
_
hdF =
_
hdG, for two measures F and G and every function h of the form e
ih
,
implies equality for any continuous, periodic function h. Using the monotone and/or dom-
inated convergence theorems, this can be extended to any bounded measurable function.
This implies that F = G.
The measure F
X
is called the spectral measure of the time series X
t
. If the spectral
measure F
X
admits a density f
X
relative to the Lebesgue measure, then the latter is
called the spectral density. A sucient condition for this is that the square summability
of the square autocovariances. Then the spectral density is the Fourier series (6.1), with
Fourier coecients
X
(h).
6.3 Lemma. If

X
(h)

2
< , then the series (6.1) converges in L
2
_
(, ], B,
_
,
and denes a density of F
X
relative to Lebesgue measure.
Proof. The orthonormality of the trigonometric basis shows that, for any m n,
_

m]h]n

X
(h)e
ih

2
d =

m]g]n

m]h]n

X
(h)
X
(g)
_

e
i(hg)
d
= 2

m]h]n

X
(h)

2
.
Thus the convergence of the series

X
(h)

2
is equivalent to the convergence of the
series (6.1) in L
2
_
(, ], B,
_
.
Next the continuity of the inner product in L
2
_
(, ], B,
_
allows exchanging the
order of integration and summation in
_

e
ig

X
(h)e
ih
d. Then (6.2) follows
from the orthonormality of the trigonometric basis.
6.4 EXERCISE. Show that the spectral density of a real-valued time series with

X
(h)

< is symmetric about zero.


* 6.5 EXERCISE. Show that the spectral measure of a real-valued time series is symmetric
about zero, apart from a possible point mass at . [Hint: Use the uniqueness of a spectral
measure.]
6.1: Spectral Measures 85
6.6 Example (White noise). The covariance function of a white noise sequence X
t
is
0 for h ,= 0. Thus the Fourier series dening the spectral density has only one term and
reduces to
f
X
() =
1
2

X
(0).
The spectral measure is the uniform measure with total mass
X
(0). Hence a white
noise series contains all possible frequencies in an equal amount.
6.7 Example (Deterministic trigonometric series). Let X
t
= Acos(t)+B sin(t) for
mean-zero, uncorrelated variables A and B of variance
2
, and (0, ). By Example 1.5
the covariance function is given by

X
(h) =
2
cos(h) =
2 1
2
(e
ih
+e
ih
).
It follows that the spectral measure F
X
is the discrete 2-point measure with F
X
=
F
X
=
2
/2.
Because the time series is real, the point mass at does not really count: because
the spectral measure of a real time series is symmetric, the point must be there
because is there. The form of the spectral measure and the fact that the time series in
this example is a trigonometric series of frequency , are good motivation for referring
to the values as frequencies.
6.8 EXERCISE.
(i) Show that the spectral measure of the sum X
t
+Y
t
of two uncorrelated time series
is the sum of the spectral measures of X
t
and Y
t
.
(ii) Construct a time series with spectral measure equal to a symmetric discrete measure
on the points
1
,
2
, . . . ,
k
with 0 <
1
< <
k
< .
(iii) Construct a time series with spectral measure the 1-point measure with F
X
0 =
2
.
(iv) Same question, but now with F
X
=
2
.
* 6.9 EXERCISE. Show that every nite measure on (, ] is the spectral measure of
some stationary time series.
The spectrum of a time series is an important theoretical concept, but it is also
an important practical tool to gain insight in periodicities in the data. Inference using
the spectrum is called spectral analysis or analysis in the frequency domain as opposed
to ordinary analysis, which is in the time domain. It depends on the subject eld
whether a spectral analysis is useful. If there are periodicities present in a time series,
then a spectral analysis will reveal these. However, in general we should not have too
great expectations of the insight oered by the spectrum. For instance, the interpretation
of the spectrum of economic time series may be complicated, or even unclear, due to the
fact that all possible frequencies are present to some nontrivial extent.
The idea of a spectral analysis is to view the consecutive values
. . . , X
1
, X
0
, X
1
, X
2
, . . .
86 6: Spectral Theory
of a time series as a random function X: Z R R, and to write this as a weighted
sum (or integral) of trigonometric functions t cos t or t sint of dierent frequen-
cies . In simple cases nitely many frequencies suce, whereas in other situations all
frequencies (, ] are needed to give a full description, and the weighted sum
becomes an integral. Two extreme examples are provided by a deterministic trigonomet-
ric series (which incorporates a single frequency) and a white noise series (which has all
frequencies in equal amounts). The spectral measure gives the weights of the dierent
frequencies in the sum.
We shall derive the spectral decomposition, the theoretical basis for this interpreta-
tion, in Section 6.4. Another method to gain insight in the interpretation of a spectrum
is to consider its transformation under ltering. The term ltering stems from the eld
of signal processing, where a lter takes the form of an physical device that lters out
certain frequencies from a given electric current. For us, a (linear) lter will remain an
innite moving average as dened in Chapter 1. For a given lter with lter coecients

j
the function () =

j

j
e
ij
is called the transfer function of the lter.
6.10 Theorem. Let X
t
be a stationary time series with spectral measure F
X
and let

j
[
j
[ < . Then Y
t
=

j

j
X
tj
has spectral measure F
Y
given by, with the
transfer function,
dF
Y
() =

()

2
dF
X
().
Proof. According to Lemma 1.28(iii) (if necessary extended to complex-valued lters),
the series Y
t
is stationary with auto-covariance function

Y
(h) =

X
(h k +l) =

l
_
e
i(hk+l)
dF
X
().
By the dominated convergence theorem we are allowed to change the order of (double)
summation and integration. Next we can rewrite the right side as
_

()

2
e
ih
dF
X
().
This proves the theorem, in view of Theorem 6.2 and the uniqueness of the spectral
measure.
6.11 Example (Moving average). A white noise process Z
t
has a constant spectral
density
2
/(2). By the preceding theorem the moving average X
t
= Z
t
+ Z
t1
has
spectral density
f
X
() = [1 +e
i
[
2

2
2
= (1 + 2 cos +
2
)

2
2
.
If > 0, then the small frequencies dominate, whereas the bigger frequencies are more
important if < 0. This suggests that the sample paths of this time series will be more
wiggly if < 0. However, in both cases all frequencies are present in the signal.
6.1: Spectral Measures 87
0.0 0.5 1.0 1.5 2.0 2.5 3.0
-
6
-
4
-
2
0
2
s
p
e
c
t
r
u
m
Figure 6.1. Spectral density of the moving average Xt = Zt + .5Zt1. (Vertical scale in decibels.)
6.12 Example. The complex-valued process X
t
= Ae
it
for a mean zero variable A and
(, ] has covariance function

X
(h) = cov(Ae
i(t+h)
, Ae
it
) = e
ih
E[A[
2
.
The corresponding spectral measure is the 1-point measure F
X
with F
X
= E[A[
2
.
Therefore, the ltered series Y
t
=

j

j
X
tj
has spectral measure the 1-point measure
with F
Y
=

()

2
E[A[
2
. By direct calculation we nd that
Y
t
=

j
Ae
i(tj)
= Ae
it
() = ()X
t
.
This suggests an interpretation for the term transfer function. Filtering a pure signal
Ae
it
of a single frequency apparently yields another signal of the same single frequency,
but the amplitude of the signal changes by multiplication with the factor (). If () =
0, then the frequency is not transmitted, whereas values of

()

bigger or smaller
than 1 mean that the frequency is amplied or weakened.
6.13 EXERCISE. Find the spectral measure of X
t
= Ae
it
for not necessarily belong-
ing to (, ].
88 6: Spectral Theory
To give a further interpretation to the spectral measure consider a band pass lter.
This is a lter with transfer function of the form
() =
_
0, if [
0
[ > ,
1, if [
0
[ ,
for a xed frequency
0
and xed band width 2. According to Example 6.12 this lter
kills all the signals Ae
it
of frequencies outside the interval [
0
,
0
+ ] and
transmits all signals Ae
it
for inside this range unchanged. The spectral density of the
ltered signal Y
t
=

j

j
X
tj
relates to the spectral density of the original signal X
t
(if there exists one) as
f
Y
() =

()

2
f
X
() =
_
0, if [
0
[ > ,
f
X
(), if [
0
[ .
Now think of X
t
as a signal composed of many frequencies. The band pass lter transmits
only the subsignals of frequencies in the interval [
0
,
0
+]. This explains that the
spectral density of the ltered sequence Y
t
vanishes outside this interval. For small > 0,
var Y
t
=
Y
(0) =
_

f
Y
() d =
_
0+
0
f
X
() d 2f
X
(
0
).
We interprete this as saying that f
X
(
0
) is proportional to the variance of the subsignals
in X
t
of frequency
0
. The total variance var X
t
=
X
(0) =
_

f
X
() d in the signal
X
t
is the total area under the spectral density. This can be viewed as the sum of the
variances of the subsignals of frequencies , the area under f
X
between
0
and
0
+
being the variance of the subsignals of frequencies in this interval.
A band pass lter is a theoretical lter: in practice it is not possible to lter out
an exact range of frequencies. Electronic devices will not lter frequencies close to the
borders of the band correctly. On a computer only nite lters can be implemented (the
ones with only nitely many nonzero lter coecients
j
). These have transfer functions
that are a nite linear combination of sines and cosines, and hence are smooth and cannot
jump, in contrast to the indicator function corresponding to a band pass lter.
The lter coecients
j
relate to the transfer function () in the same way as
the auto-covariances
X
(h) relate to the spectral density f
X
(h), apart from a factor 2.
Thus, to nd the lter coecients of a given transfer function , it suces to apply the
Fourier inversion formula

j
=
1
2
_

e
ij
() d.
6.14 EXERCISE. Show that the lter coecients of a band pass lter with (l
0
,
0
+
) (0, ) are given by
j
= e
i0j
sin(j)/(j).
6.15 Example (Low frequency and trend). An apparent trend in observed data
X
1
, . . . , X
n
could be modelled as a real trend in a nonstationary time series, but could
alternatively be viewed as the beginning of a long cycle. In practice, where we get to
see only a nite stretch of a time series, low frequency cycles and slowly moving trends
6.1: Spectral Measures 89
cannot be discriminated. It was seen in Chapter 1 that dierencing Y
t
= X
t
X
t1
of
a time series X
t
removes a linear trend, and repeated dierencing removes higher order
polynomial trends. In view of the preceding observation the dierencing lter should
remove, to a certain extent, low frequencies.
The dierencing lter has transfer function
() = 1 e
i
= 2ie
i/2
sin

2
.
The absolute value

()

of this transfer function increases from 0 at 0 to its maximum


value at . Thus, indeed, it lters away low frequencies, albeit only with partial success.
0.0 0.5 1.0 1.5 2.0 2.5 3.0
0
.
0
0
.
5
1
.
0
1
.
5
2
.
0
Figure 6.2. Absolute value of the transfer function of the dierence lter.
6.16 Example (Averaging). The averaging lter Y
t
= (2M + 1)
1

M
j=M
X
tj
has
transfer function
() =
1
2M + 1
M

j=M
e
ij
=
sin
_
(M +
1
2
)
_
(2M + 1) sin(
1
2
)
.
(The expression on the right is dened by continuity, as 1, at = 0.) This function
is proportional to the Dirichlet kernel, which is the function obtained by replacing the
factor 2M + 1 by 2. From a picture of this kernel we conclude that averaging removes
high frequencies to a certain extent (and in an uneven manner depending on M), but
retains low frequencies.
90 6: Spectral Theory
0.0 0.5 1.0 1.5 2.0 2.5 3.0
0
1
2
3
s
i
n
(
u

*

1
0
.
5
)
/
s
i
n
(
u
/
2
)
/
(
2

*

p
i
)
Figure 6.3. Dirichlet kernel of order M = 10.
6.17 EXERCISE. Express the variance of Y
t
in the preceding example in and the
spectral density of the time series X
t
(assuming that there is one). What happens if
M ? Which conclusion can you draw? Does this remain true if the series X
t
does
not have a spectral density? [Verify that [()[ 1 for every and M, and that () 0
as M for every 0.]
6.18 EXERCISE. Find the transfer function of the lter Y
t
= X
t
X
t12
. Interprete
the result.
Instead of in terms of frequencies we can also think in periods. A series of the form
t e
it
repeats itself after 2/ instants of time. Therefore, the period is dened as
period =
2
frequency
.
Most monthly time series (one observation per month) have a period eect of 12 months.
If so, this will be visible as a peak in the spectrum at the frequency 2/12 = /6.

Often

That this is a complicated number is an inconvenient consequence of our convention to dene the spectrum
on the interval (, ]. This can be repaired. For instance, the Splus package produces spectral plots with the
frequencies rescaled to the interval (
1
2
,
1
2
]. Then a 12-month period gives a peak at 1/12.
6.2: Aliasing 91
the 12-month cycle is not completely regular. This may produce additional (but smaller)
peaks at the harmonic frequencies 2/6, 3/6, . . ., or /12, /18, . . ..
It is surprising at rst that the highest possible frequency is , the so-called Nyquist
frequency. This is caused by the fact that the series is measured only at discrete time
points. Very rapid uctuations fall completely between the measurement times and hence
cannot be observed. The Nyquist frequency corresponds to a period of 2/ = 2 time
instants and this is clearly the smallest period that is observable. The next section sheds
more light on this.
* 6.2 Aliasing
In many situations in practice a time series arises by measuring an underlying continuous
time process, an operation called sampling. The choice of the time instants at which
measurements are taken is up to us, and we have to choose a sampling frequency. The
more we sample the better, of course, but practical considerations, such as computer
storage of the data, will impose restrictions. In this section we study the eect of sampling
on the spectrum of a time series.
Consider a time series (X
t
: t Z) obtained by sampling a continuous time stochastic
process (Y
t
: t R) at times , 0, , 2, . . . for a sampling interval > 0: X
t
= Y
t
. The
continuous time process Y
t
is said to be second order stationary if its expectation is
constant and the autovariances cov(Y
t+h
, Y
t
) do not depend on t, so that its (continuous
time) autocovariance function can be dened by, for h R,

Y
(h) = cov(Y
t+h
, Y
t
).
By Bochners theorem, provided that it is continuous, such a covariance function can
always be represented by a nite Borel measure F
Y
on R in the form

Y
(h) =
_

e
ih
dF
Y
(), h R.
This is similar to Herglotz theorem, the important dierence being that the represen-
tation uses all frequencies in R, not just an interval of length 2, and is valid for the
continuum of lags h R.

6.19 Lemma. The skeleton (X


t
: t Z) = (Y
t
: t Z) of a continuous time stationary
process (Y
t
: t Z) with spectral measure F
Y
possesses spectral measure F
X
given by
F
X
(B) =

kZ
F
Y
_
B +k 2

_
.

For a proof of Bochners theorem, along the lines of the proof of Herglotz theorem, see pages 275277 of
Gnedenko (1962).
92 6: Spectral Theory
Proof. The covariance function of the time series X
t
can be written
X
(h) =
Y
(h) =
_
e
ih
dF
Y
(). The function e
ih
is periodic with period 2/, for every h Z. It
is therefore identical on the intervals . . . , (3/, /], (/, /], (/, 3/], . . . and
the integral over R of the function relative to F
Y
can be written as the integral of the
function over the interval (/, /] relative to the innite sum of the translates of the
restrictions of the measure F
Y
to the intervals. Finally we rescale .
The preceding lemma shows that the spectral weights of the continuous time process
Y
t
at a frequency not in the interval (/, /] contribute to the spectral weights of the
sampled process at the frequency modulo 2/. In particular, if = 1 the frequencies
bigger than the Nyquist frequency, which may be present in the signal Y
t
, contribute
to the spectrum of the sampled process at frequencies modulo 2. This eect, which
is referred to as aliasing, is often considered undesirable, because it is impossible to
separate the contributions of the frequencies that are really present (below the Nyquist
frequency) from the aliased frequencies. Aliasing typically leads to an overall higher
spectral density, which is equivalent to a bigger noise content of the signal.
For this reason, if it is possible, a signal should be low-pass ltered before sampling
it. With continuous time electronic signals this ltering step is often achieved at the
measuring time by passing the signal through physical, electronic devices, before sampling
or digitizing it. If for some reason an already discretized signal must be down-sampled,
then it can also be useful to low-pass lter the signal rst, which can be achieved by just
applying a linear lter with the appropriate transfer function.
* 6.3 Nonsummable lters
For coecients
j
with

j
[
j
[ < , the ltered series

j

j
X
tj
is well dened for
any stationary time series X
t
. Unfortunately, not all lters have summable coecients.
An embarassing example is the band pass lter considered previously, for which the lter
coecients tend to zero at the speed [
j
[ 1/j.
In fact, if a sequence of lter coecients is summable, then the series

j

j
e
ij
converges uniformly, and hence the corresponding transfer function must be continuous.
The transfer function () = 1
[0,0+]
() of the band pass lter has two jumps.
In such a case, the series

j

j
e
ij
may still be well dened, for instance as a limit
in L
2
_
(, ], B,
_
, and dene a valid transfer function. To handle such examples it is
worthwhile to generalize Theorem 6.10 (and Lemma 1.28).
6.20 Theorem. Let X
t
be a stationary time series with spectral measure F
X
, dened on
the probability space (, |, P). Then the series () =

j

j
e
ij
converges in L
2
(F
X
)
if and only if Y
t
=

j

j
X
tj
converges in L
2
(, |, P) for some t (and then for every
t Z) and in that case
dF
Y
() =

()

2
dF
X
().
6.4: Spectral Decomposition 93
Proof. For 0 m n let
m,n
j
be equal to
j
for m [j[ n and be 0 otherwise, and
dene Y
m,n
t
as the series X
t
ltered by the coecients
m,n
j
. Then certainly

j
[
m,n
j
[ <
for every xed pair (m, n) and hence we can apply Lemma 1.28 and Theorem 6.10 to
the series Y
m,n
t
. This yields
E

m]j]n

j
X
tj

2
= E[Y
m,n
t
[
2
=
Y
m,n(0)
=
_
(,]
dF
Y
m,n =
_
(,]

m]j]n

j
e
ij

2
dF
X
().
The left side converges to zero for m, n if and only if the partial sums of the series
Y
t
=

j

j
X
tj
form a Cauchy sequence in L
2
(, |, P). The right side converges to zero
if and only if the partial sums of the sequence

j

j
e
ij
form a Cauchy sequence in
L
2
(F
X
). The rst assertion of the theorem now follows, because both spaces are complete.
To prove the second assertion, we rst note that, by Theorem 6.10,
cov
_

]j]n

j
X
t+hj
,

]j]n

j
X
tj
_
=
Y
0,n(h) =
_
(,]

]j]n

j
e
ij

2
e
ih
dF
X
().
We now take limits of the left and right sides as n to nd that
Y
(h) =
_
(,]

()

2
e
ih
dF
X
(), for every h.
6.21 Example. If the lter coecients satisfy

j
[
j
[
2
< (which is weaker than
absolute convergence), then the series

j

j
e
ij
converges in L
2
_
(, ], B,
_
. Conse-
quently, the series also converges in L
2
(F
X
) for every spectral measure F
X
that possesses
a bounded density.
Thus, in many cases a sequence of square-summable coecients denes a valid lter.
A particular example is a band pass lter, for which [
j
[ = O(1/[j[) as j .
* 6.4 Spectral Decomposition
In the preceding section we interpreted the mass F
X
(I) that the spectral distribution
gives to an interval I as the size of the contribution of the components of frequencies
I to the signal t X
t
. In this section we give a precise mathematical meaning to
this idea. We show that a given stationary time series X
t
can be written as a randomly
weighted sum of single frequency signals e
it
.
This decomposition is simple in the case of a discrete spectral measure. For given
uncorrelated, zero-mean random variables Z
1
, . . . , Z
k
and numbers
1
, . . . ,
k
(, ]
the process
X
t
=
k

j=1
Z
j
e
ijt
94 6: Spectral Theory
possesses as spectral measure F
X
the discrete measure with point masses of sizes
F
X

j
= E[Z
j
[
2
at the frequencies
1
, . . . ,
k
(and no other mass). The series X
t
is
the sum of uncorrelated, single-frequency signals of stochastic amplitudes [Z
j
[. This is
called the spectral decomposition of the series X
t
. We prove below that this construc-
tion can be reversed: given a mean zero, stationary time series X
t
with discrete spectral
measure as given, there exist mean zero uncorrelated (complex-valued) random variables
Z
1
, . . . , Z
k
with variances F
X

j
such that the decomposition is valid.
This justies the interpretation of the spectrum given in the preceding section. The
possibility of the decomposition is surprising in that the spectral measure only involves
the auto-covariance function of a time series, whereas the spectral decomposition is a
decomposition of the sample paths of the time series: if the series X
t
is dened on a
given probability space (, |, P), then so are the random variables Z
j
and the preceding
spectral decomposition may be understood as being valid for (almost) every . This
can be true, of course, only if the variables Z
1
, . . . , Z
k
also have other properties besides
the ones described. The spectral theorem below does not give any information about
these further properties. For instance, even though uncorrelated, the Z
j
need not be
independent. This restricts the usefulness of the spectral decomposition, but we could
not expect more. The spectrum only involves the second moment properties of the time
series, and thus leaves most of the distribution of the series undescribed. An important
exception to this rule is if the series X
t
is Gaussian. Then the rst and second moments,
and hence the mean and the spectral distribution, completely describe the distribution
of the series X
t
.
The spectral decomposition is not restricted to time series with discrete spectral
measures. However, in general, the spectral decomposition involves a continuum of fre-
quencies and the sum becomes an integral
X
t
=
_
(,]
e
it
dZ().
A technical complication is that such an integral, relative to a random measure Z, is
not dened in ordinary measure theory. We must rst give it a meaning.
6.22 Denition. A random measure with orthogonal increments Z is a collection
Z(B): B B of mean zero, complex random variables Z(B) indexed by the Borel
sets B in (, ] dened on some probability space (, |, P) such that, for some nite
Borel measure on (, ],
cov
_
Z(B
1
), Z(B
2
)
_
= (B
1
B
2
), every B
1
, B
2
B.
This denition does not appear to include a basic requirement of a measure: that
the measure of a countable union of disjoint sets is the sum of the measures of the
individual sets. However, this is implied by the covariance property, which says that the
linear extension of the map 1
B
Z(B) is an isometry from L
2
() into L
2
(, |, P).
6.4: Spectral Decomposition 95
6.23 EXERCISE. Let Z be a random measure with orthogonal increments. Show that
Z(
j
B
j
) =

j
Z(B
j
) in mean square, whenever B
1
, B
2
, . . . is a sequence of pairwise
disjoint Borel sets.
6.24 EXERCISE. Let Z be a random measure with orthogonal increments and dene
Z

= Z(, ]. Show that (Z

: (, ]) is a stochastic process with uncorrelated


increments: for
1
<
2

3
<
4
the variables Z
4
Z
3
and Z
2
Z
1
are uncorrelated.
This explains the phrase with orthogonal increments.
* 6.25 EXERCISE. Suppose that Z

is a mean zero stochastic process with nite second


moments and uncorrelated increments. Show that this process corresponds to a random
measure with orthogonal increments as in the preceding exercise. [This asks you to
reconstruct the random measure Z from the weights Z

= Z(, ] it gives to cells,


similarly as an ordinary measure can be reconstructed from its distribution function.
The construction below will immediately give us the complete random measure as in
Denition 6.22. So it is not really helpful to go through the somewhat tedious details of
this construction, although many authors start from distribution functions rather than
measures.]
Next we dene an integral
_
f dZ for given functions f: (, ] C. For an
indicator function f = 1
B
of a Borel set B, we dene, in analogy with an ordinary
integral,
_
1
B
dZ = Z(B). Because we wish the integral to be linear, we are lead to the
denition _

j

j
1
Bj
dZ =

j
Z(B
j
),
for every nite collections of complex numbers
j
and Borel sets B
j
. This determines
the integral for many, but not all functions f. We extend its domain by continuity:
we require that
_
f
n
dZ
_
f dZ whenever f
n
f in L
2
(). The following lemma
shows that these denitions and requirements can be consistently made, and serves as a
denition of
_
f dZ.
6.26 Lemma. For every random measure with orthogonal increments Z there exists a
unique map, denoted f
_
f dZ, from L
2
() into L
2
(, |, P) with the properties
(i)
_
1
B
dZ = Z(B);
(ii)
_
(f +g) dZ =
_
f dZ +
_
g dZ;
(iii) E

_
f dZ

2
=
_
[f[
2
d.
In other words, the map f
_
f dZ is a linear isometry such that 1
B
Z(B).
Proof. By the dening property of Z, for any complex numbers
i
and Borel sets B
i
,
E

i=1

i
Z(B
i
)

2
=

j
cov
_
Z(B
i
), Z(B
j
)
_
=
_

j=1

i
1
Bi

2
d.
For f a simple function of the form f =

i
1
Bi
, we dene
_
f dZ as

i
Z(B
i
).
This is well dened, for, if f also has the representation f =

j

j
1
Dj
, then
96 6: Spectral Theory

i
Z(B
i
) =

j

j
Z(D
j
) almost surely. This follows by applying the preceding identity
to

i
Z(B
i
)

j

j
Z(D
j
).
The integral
_
f dZ that is now dened on the domain of all simple functions f
trivially satises (i) and (ii), while (iii) is exactly the identity in the preceding display.
The proof is complete upon showing that the map f
_
f dZ can be extended from
the domain of simple functions to the domain L
2
(), meanwhile retaining the properties
(i)(iii).
We extend the map by continuity. For every f L
2
() there exists a sequence of
simple functions f
n
such that
_
[f
n
f[
2
d 0. We dene
_
f d as the limit of the
sequence
_
f
n
dZ. This is well dened. First, the limit exists, because, by the linearity of
the integral and the identity,
E

_
f
n
d
_
f
m
d[
2
=
_
[f
n
f
m
[
2
d,
since f
n
f
m
is a simple function. Because f
n
is a Cauchy sequence in L
2
(), the right
side converges to zero as m, n . We conclude that
_
f
n
dZ is a Cauchy sequence
in L
2
(, |, P) and hence it has a limit by the completeness of this space. Second, the
denition of
_
f dZ does not depend on the particular sequence f
n
f we use. This
follows, because given another sequence of simple functions g
n
f, we have
_
[f
n

g
n
[
2
d 0 and hence E

_
f
n
dZ
_
g
n
dZ

2
0.
We conclude the proof by noting that the properties (i)(iii) are retained under
taking limits.
6.27 Denition (Stochastic integral). The stochastic integral
_
f dZ of a function f
L
2
() relative to a random measure with orthogonal increments Z as in Denition 6.22
is given by the map dened in Lemma 6.26.
Warning. This integral is constructed in a similar way as the It o integral in stochas-
tic analysis, but it is not the same. The process Z in the decomposition Z(0, ] = Z

Z
0
of a random measure with orthogonal increments is not necessarily a semimartingale. On
the other hand, the integral
_
f dZ is dened here only for deterministic integrands.
6.28 EXERCISE. Show that a linear isometry : H
1
H
2
between two Hilbert spaces
H
1
and H
2
retains inner products, i.e. (f
1
), (f
2
))
2
= f
1
, f
2
)
1
. Conclude that
cov
__
f dZ,
_
g dZ
_
=
_
fg d.
We are now ready to derive the spectral decomposition for a general stationary time
series X
t
. Let L
2
(X
t
: t Z) be the closed, linear span of the elements of the time series
in L
2
(, |, P) (i.e. the closure of the linear span of the set X
t
: t Z).
6.29 Theorem. For any mean zero stationary time series X
t
with spectral distribution
F
X
there exists a random measure Z with orthogonal increments relative to the measure
F
X
such that
_
Z(B): B B
_
L
2
(X
t
: t Z) and such that X
t
=
_
e
it
dZ() almost
surely for every t Z.
6.4: Spectral Decomposition 97
Proof. By the denition of the spectral measure F
X
we have, for every nite collections
of complex numbers
j
and integers t
j
,
E

j
X
tj

2
=

X
(t
i
t
j
) =
_

j
e
itj

2
dF
X
().
Now dene a map : L
2
(F
X
) L
2
(X
t
: t Z) as follows. For f of the form f =

j

j
e
itj
dene (f) =

j
X
tj
. By the preceding identity this is well dened.
(Check!) Furthermore, is a linear isometry. By the same arguments as in the pre-
ceding lemma, it can be extended to a linear isometry on the closure of the space of
all functions

j

j
e
itj
. By Fejers theorem from Fourier theory, this closure contains
at least all Lipschitz periodic functions. By measure theory this collection is dense in
L
2
(F
X
). Thus the closure is all of L
2
(F
X
). In particular, it contains all indicator func-
tions 1
B
of Borel sets B. Dene Z(B) = (1
B
). Because is a linear isometry, it retains
inner products and hence
cov
_
Z(B
1
), Z(B
2
)
_
=

(1
B1
), (1
B2
)
_
=
_
1
B1
1
B2
dF
X
.
This shows that Z is a random measure with orthogonal increments. By denition
_

j

j
1
Bj
dZ =

j
Z(B
j
) =

j
(1
Bj
) =
_

j
1
Bj
_
.
Thus
_
f dZ = (f) for every simple function f. Both sides of this identity are linear
isometries when seen as functions of f L
2
(F
X
). Hence the identity extends to all f
L
2
(F
X
). In particular, we obtain
_
e
it
dZ() = (e
it
) = X
t
on choosing f() = e
it
.
Thus we have managed to give a precise mathematical formulation to the spectral
decomposition
X
t
=
_
(,]
e
it
dZ()
of a mean zero stationary time series X
t
. The denition may seem a bit involved. (It
may even seem that giving a meaning to the representation was harder than proving its
validity.) An insightful interpretation is obtained by approximation through Riemann
sums. Given a partition =
0,k
<
1,k
< <
k,k
= and a xed time t Z,
consider the function f
k
() that is piecewise constant, and takes the value e
it
j,k
on the interval (
j1,k
,
j,k
]. If the partitions are chosen such that the mesh width
of the partitions converges to zero as k , then

f
k
() e
it

converges to zero,
uniformly in (, ], by the uniform continuity of the function e
it
, and hence
f
k
() e
it
in L
2
(F
X
). Because the stochastic integral f
_
f dZ is linear, we have
_
f
k
dZ =

j
e
it
j,k
Z(
j1,k

j,k
] and because it is an isometry, we nd
E

X
t

j=1
e
it
j,k
Z(
j1,k
,
j,k
]

2
=
_

e
it
f
k
()

2
dF
X
() 0.
98 6: Spectral Theory
Because the intervals (
j1,k
,
j,k
] are pairwise disjoint, the random variables Z
j
: =
Z(
j1,k
,
j,k
] are uncorrelated, by the dening property of an orthogonal random mea-
sure. Thus the time series X
t
can be approximated by a time series of the form

j
Z
j
e
itj
,
as in the introduction of this section. The spectral measure F
X
(
j1,k
,
j,k
] of the interval
(
j1,k
,
j,k
] is the variance of the random weight Z
j
in this decomposition.
6.30 Example. If the spectral measure F
X
is discrete with support points
1
, . . . ,
k
,
then the integral on the right in the preceding display (with
j,k
=
j
) is identically
zero. In that case X
t
=

j
Z
j
e
ij t
almost surely for every t, where Z
j
= Z
j
. (Note
that Z(
j1
,
j
) = 0).
6.31 Example. If the time series X
t
is Gaussian, then all variables in the linear span
of the X
t
are normally distributed (possibly degenerate) and hence all variables in
L
2
(X
t
: t Z) are normally distributed. In that case the variables Z(B) obtained from the
random measure Z of the spectral decomposition of X
t
are jointly normally distributed.
The zero correlation of two variables Z(B
1
) and Z(B
2
) for disjoint sets B
1
and B
2
now
implies independence of these variables.
Theorem 6.10 shows how a spectral measure changes under ltering. There is a
corresponding result for the spectral decomposition.
6.32 Theorem. Let X
t
be a mean zero, stationary time series with spectral measure
F
X
and associated random measure Z
X
, dened on some probability space (, |, P). If
() =

j

j
e
ij
converges in L
2
(F
X
), then Y
t
=

j

j
X
tj
converges in L
2
(, |, P)
and has spectral measure F
Y
and associated random measure Z
Y
such that, for every
f L
2
(F
Y
),
_
f dZ
Y
=
_
fdZ
X
.
Proof. The series Y
t
converges by Theorem 6.10, and the spectral measure F
Y
has
density

()

2
relative to F
X
. By denition,
_
e
it
dZ
Y
() = Y
t
=

j
_
e
i(tj)
dZ
X
() =
_
e
it
() dZ
X
(),
where in the last step changing the order of integration and summation is justied by the
convergence of the series

j

j
e
i(tj)
in L
2
(F
X
) and the continuity of the stochastic
integral f
_
f dZ
X
. We conclude that the identity of the theorem is satised for every
f of the form f() = e
it
. Both sides of the identity are linear in f and isometries on
the domain f L
2
(F
Y
). Because the linear span of the functions e
it
for t Z is
dense in L
2
(F
Y
), the identity extends to all of L
2
(F
Y
).
The spectral decomposition X
t
=
_
e
it
dZ() is valid for both real and complex time
series. Because the representation is in terms of the complex functions e
it
, the spectral
random measure Z of a real time series is typically complex valued. The following lemma
shows that it is complex-conjugate. Abbreviate
_
f dZ() to Z(f), and given a function
f: (, ] C let f

be the reection (of its periodic extension): f

() = f() for
(, ) and f

() = f().
6.4: Spectral Decomposition 99
6.33 Lemma. The spectral random measure of a real time series X
t
satises Z(f) =
Z(f

) almost surely, for every f L(F


X
). In particular Z(B) = Z(B) almost surely,
for every Borel set B (, ). Furthermore, for any f, g L
2
(F
X
),
ERe Z(f) Re Z(g) =
1
4
_
(f +f

)(g +g

) dF
X
,
ERe Z(f) ImZ(g) =
1
4
_
(f +f

)(g g

) dF
X
,
EImZ(f) ImZ(g) =
1
4
_
(f f

)(g g

) dF
X
.
In particular, if f and g are real and vanish outside (0, ), then ERe Z(f) ImZ(g) = 0
and ERe Z(f) Re Z(g) = EImZ(f) ImZ(g).
Proof. If the Fourier series

j
f
j
e
it
, where f
j
= (2)
1
_
e
ij
f() d, converges in
L
2
(F
X
) to the function f, then by the continuity of the random spectral measure
Z(f) =

j
f
j
Z(e
ij
) =

j
f
j
X
j
,
where the convergence of the innite series is in L
2
(X
t
: t Z). Because the Fourier
coecients of f

are equal to f
j
, it then follows that
Z(f) =

f
j
X
j
= Z(f

).
This proves the conjugate symmetry of Z for functions f of this type. The convergence
(and hence the conjugate symmetry) is certainly true for every periodic, Lipschitz func-
tion f, for which the convergence of the Fourier series is uniform. A general f can be
approximated in L
2
(F
X
) by a sequence f
n
of Lipschitz, periodic functions, and then the
sequence f
n
tends in L
2
(F
X
) to f

, in view of the symmetry of F


X
. Thus the conjugate
symmetry extends to any such function.
The expressions for the second moments of the real and imaginary parts follow
upon writing Re Z(f) =
1
2
_
Z(f) + Z(f)
_
=
1
2
Z(f + f

) and ImZ(f) =
1
2
_
Z(f)
Z(f)
_
=
1
2
Z(f f

). In the particular case that f and g are real and vanish outside
(0, ), the function f + f

is symmetric, the function g g

is anti-symmetric, and
(f +f

)(g +g

= (f f

)(g g

). Together with the symmetry of F


X
this yields the
last assertion.
The last assertion of the preceding lemma shows that the random vectors
_
Re Z(f
1
), . . . , Re Z(f
k
)
_
and
_
ImZ(f
1
), . . . , ImZ(f
k
)
_
are uncorrelated and have equal
covariance matrix, for any real functions f
1
, . . . , f
k
in L
2
(F
X
) that vanish outside (0, ).
This implies that the vector of combined real and imaginary parts of the complex random
vector
_
Z(f
1
), . . . , Z(f
k
)
_
possesses a covariance matrix of the form (1.2), for functions
f
1
, . . . , f
k
of the special type considered, and then also for general functions that vanish
at 0 and by the linearity of Z in f. In particular, if the time series X
t
is real and Gaus-
sian, then the complex random vector
_
Z(f
1
), . . . , Z(f
k
)
_
possesses a complex normal
distribution N(0, ) for the matrix with (i, j)th element
_
f
i
f
j
dF
X
.
100 6: Spectral Theory
6.34 EXERCISE. Show that for a real time series Z0 and Z are real. Conclude
that these variables are complex normally distributed only if they are 0. [One proof can
be based on the law of large numbers in Section 7.1, which exhibits 1
0]
as a limit of real
linear combinations of functions e
it
; the same linear combinations of the functions
e
it()
= (1)
t
e
it
tend to 1
]
.]
6.35 EXERCISE (Inversion formula). Show that, for any continuity points 0 a <
b of F
X
,
1
2
n

t=n
X
t
_
b
a
e
it
d Z(a, b], in L
2
.
[Note that (2)
1
_
b
a
e
it
d are the Fourier coecients of the function 1
(a,b]
.]
* 6.5 Multivariate Spectra
If spectral analysis of univariate time series is hard, spectral analysis of multivariate
time series is an art. It concerns not only frequencies present in a single signal, but
also dependencies between signals at given frequencies.
This diculty concerns the interpretation only: the mathematical theory does not
pose new challenges. The covariance function
X
of a vector-valued times series X
t
is
matrix-valued. If the series

hZ
_
_

X
(h)
_
_
is convergent, then the spectral density of the
series X
t
can be dened by exactly the same formula as before:
f
X
() =
1
2

hZ

X
(h)e
ih
.
The summation is now understood to be entry-wise, and hence f
X
() maps the
interval (, ] into the set of (d d)-matrices, for d the dimension of the series X
t
.
Because the covariance function of the univariate series
T
X
t
is given by

T
X
=
T

X
,
it follows that, for every C
d
,

T
f
X
() = f

T
X
().
In particular, the matrix f
X
() is nonnegative-denite, for every . From the identity

X
(h)
T
=
X
(h) it can also be ascertained that it is Hermitian: f
X
()
T
= f
X
(). The
diagonal elements are nonnegative, but the o-diagonal elements of f
X
() are complex
valued, in general. If the time series X
t
is real, then the spectral density is also conjugate
symmetric: f
X
() = f
X
(), for every .
Warning. The spectral density is Hermitian. The autocovariance function is not.
As in the case of univariate time series, not every vector-valued time series possesses
a spectral density, but every such series does possess a spectral distribution. This dis-
tribution is a matrix-valued, complex measure. A complex Borel measure on (, ] is
6.5: Multivariate Spectra 101
a map B F(B) on the Borel sets that can be written as F = F
1
F
2
+i(F
3
F
4
) for
nite Borel measures F
1
, F
2
, F
3
, F
4
. If the complex part F
3
F
4
is identically zero, then
F is a signed measure. The spectral measure F
X
of a d-dimensional time series X
t
is a
(d d) matrix whose d
2
entries are complex Borel measures on (, ]. The diagonal
elements are precisely the spectral measures of the coordinate time series and hence
are ordinary measures, but the o-diagonal measures are typically signed or complex
measures. The measure F
X
is Hermitian in the sense that F
X
(B)
T
= F
X
(B) for every
Borel set B.
6.36 Theorem (Herglotz). For every stationary vector-valued time series X
t
there
exists a unique Hermitian-matrix-valued complex measure F
X
on (, ] such that

X
(h) =
_
(,]
e
ih
dF
X
(), h Z.
Proof. For every C
d
the time series
T
X
t
is univariate and possesses a spectral
measure F

T
X
. By Theorem 6.2, for every h Z,

X
(h) =

T
X
(h) =
_
(,]
e
ih
dF

T
X
().
We can express any entry of the matrix
X
(h) as a linear combination of the quadratic
form on the left side, evaluated for dierent vectors . One possibility is to write, with
e
k
the kth unit vector in C
d
,
2
X
(h)
k,l
= (e
k
+e
l
)
T

X
(h)(e
k
+e
l
) +i(e
k
+ie
l
)
T

X
(h)(e
k
ie
l
)+
(1 +i)
_
e
T
k

X
(h)e
k
+e
T
l

X
(h)e
l
_
.
By expressing the right-hand side in the spectral matrices F

T
X
, by using the rst
display, we obtain a representation
X
(h)
k,l
=
_
e
ih
dF
k,l
(), for a certain complex-
valued measure F
k,l
, for every (k, l). Then the matrix-valued complex measure F = (F
k,l
)
satises
X
(h) =
_
(,]
e
ih
dF() and hence, , for every h Z,
_
(,]
e
ih
dF
T
() =
X
(h)
T
=
X
(h).
Thus the matrix-valued complex measure F + F
T
, which is Hermitian, also represents
the autocovariance function.
If F is any Hermitian-matrix-valued complex measure with the representing prop-
erty, then
T
F must be the spectral measure of the time series
T
X
t
and hence is
uniquely determined. This determines
T
F(B) for every C
d
, and therefore the
Hermitian matrix F(B), for every Borel set B.
Consider in particular a bivariate time series, written as (X
t
, Y
t
)
T
for univariate time
series X
t
and Y
t
. The spectral density of (X
t
, Y
t
), if it exists, is a (2 2)-matrix valued
function. The diagonal elements are the spectral densities f
X
and f
Y
of the univariate
102 6: Spectral Theory
series X
t
and Y
t
. The o-diagonal elements are complex conjugates and thus dene one
function, say f
XY
for the (1, 2)-element of the matrix, giving
f
(X,Y )
T () =
_
f
X
() f
XY
()
f
XY
() f
Y
()
_
.
The following derived functions are often plotted:
Re f
XY
, co-spectrum,
Imf
XY
, quadrature,
[f
XY
[
2
f
X
f
Y
, coherency,
[f
XY
[, amplitude,
arg f
XY
, phase.
It requires some experience to read the plots of these functions appropriately. The co-
herency is perhaps the easiest to interprete: evaluated at (, ] it gives the corre-
lation between the series X and Y at the frequency .
6.37 EXERCISE. Show that the stationary time series X
t
and Y
t
are totally uncorrelated
(i.e. cov(X
s
, Y
t
) = 0 for every s and t) if and only if the spectral measure of the vector-
valued process (X
t
, Y
t
)
T
is the product of the spectral measures F
X
and F
Y
of X and
Y . What does this imply for the measure F
XY
, the coherency, amplitude and phase?
A given sequence of (d d) matrices
j
with

j
|
j
| < denes a multivariate
linear lter that transforms a given d-dimensional time series X
t
into the time series
Y
t
=

j

j
X
tj
. The corresponding multivariate transfer function is the matrix-valued
function (): =

j

j
e
ij
. It plays the same role as the one-dimensional transfer
function given in Theorem 6.10, whose proof goes through almost without changes.
6.38 Theorem. Let X
t
be a multivariate stationary time series with spectral measure
F
X
and let

j
|
j
| < . Then Y
t
=

j

j
X
tj
has spectral measure F
Y
given by
dF
Y
() = () dF
X
() ()
T
.
The right side of the display is to be understood as a product of three matrices. If
the series X
t
possesses a spectral density f
X
, then the right side is understood to be the
matrix-valued measure with spectral density f
X

T
. In the general case the display
means to express a relationship between the spectral measures F
Y
and F
X
, with the
notations dF
Y
and dF
X
being only formal identiers. One way of making the relation
precise is to use some density

f
X
of F
X
relative to some scalar measure

(not necessarily
Lebesgue measure). Then F
Y
possesses density

f
X

T
.
The spectral decomposition also extends to vector-valued time series. In particular,
for a bivariate time series (X
t
, Y
t
)
T
there exist versions Z
X
and Z
Y
of their spectral
6.6: Prediction in the Frequency Domain 103
measures (i.e. X
t
=
_
e
it
dZ
X
() and Y
t
=
_
e
it
dZ
Y
()) such that, for pair of every
measurable sets B
1
, B
2
(, ],
cov
_
Z
X
(B
1
), Z
Y
(B
2
)
_
= F
X,Y
(B
1
B
2
).
* 6.6 Prediction in the Frequency Domain
As the spectrum encodes the complete second order structure of a stationary time series,
it can be used to solve the (linear) prediction problem. The observed variables X
1
, . . . , X
n
are represented by the functions e
1
, . . . , e
n
, given by e
h
() = e
ih
, in the spectral domain.
A random variable Y in the closed linear span L
2
(X
t
: t Z) of the time series can be
represented as Y =
_
f dZ
X
for some function f L
2
(F
X
). By the spectral isometry,
E

Y
n

t=1

t
X
t

2
=
_

f
n

t=1

t
e
t

2
dF
X
.
Thus nding the best linear predictor of Y is the same as projecting f onto the linear
span of e
1
, . . . , e
n
in F
X
.
If Y is not contained in L
2
(X
t
: t Z), then we apply this argument to its projection
onto this space, the orthogonal part of it being completely unpredictable.
7
Law of Large Numbers
The law of large numbers is concerned with the convergence of the sequence X
n
of
averages to its mean value. For statistical applications, the central limit theorem is of
more interest, as it gives also expression to the size of the error when estimating the mean
by the sample average. In fact, the central limit theorem implies the law of large numbers,
as, by Slutskys lemma X
n
in probability if the sequence

n(X
n
) is uniformly
tight. However, the law of large numbers is valid under much weaker conditions. The
weakening not only concerns moments, but also the dependence between the X
t
.
We present separate laws of large numbers for second order stationary and for strictly
stationary sequences, respectively. The weak law for second order stationary time series
can be fully characterized using the spectrum of the time series. The strong law of
large numbers for a strictly stationary time series is a central result of ergodic theory.
Actually both results show that the sequence X
n
always converges to a limit. However,
in general this limit may be random. For nonstationary time series mixing conditions
give alternative criteria.
* 7.1 Stationary Sequences
A mean zero, stationary time series X
t
can be expressed in its spectral random measure
Z
X
through X
t
=
_
e
it
dZ
X
. (See Section 6.4.) A simple calculation then gives the
following law of large numbers.
7.1 Theorem. If X
t
is a mean zero, stationary time series, then X
n
Z
X
0 in second
mean, as n . In particular, if F
X
0 = 0, then X
n
P

0.
Proof. For Z
X
the spectral measure of X, we write
X
n
=
1
n
n

t=1
_
e
it
dZ
X
() =
_
e
i
(1 e
in
)
n(1 e
i
)
dZ
X
().
7.2: Ergodic Theorem 105
Here the integrand must be read as 1 if = 0. For all other (, ] the integrand
converges to zero as n . It is bounded by 1 for every . Hence the integrand converges
in second mean to 1
0]
in L
2
(F
X
). By the continuity of the integral f
_
f dZ
X
, we
nd that X
n
converges in L
2
(, |, P) to
_
1
0]
dZ
X
= Z
X
0.
The mean of the variable Z
X
0 is zero, and its variance is equal to F
X
0. If the
latter is zero, then Z
X
0 is degenerate at zero.
7.2 Ergodic Theorem
Given a strictly stationary sequence X
t
dened on some probability space (, |, P), with
values in some measurable space (A, /) the invariant -eld, denoted |
inv
, is the -eld
consisting of all sets A such that A = (. . . , X
t1
, X
t
, X
t+1
, . . .)
1
(B) for all t and some
measurable set B A

. Here throughout this section the product space A

is equipped
with the product -eld /

.
Our notation in the denition of the invariant -eld is ackward, if not unclear,
because we look at two-sided innite series. The triple X
t1
, X
t
, X
t+1
in the denition of
A is meant to be centered at a xed position in Z. We can write this down more precisely
using the forward shift function S: A

dened by S(x)
i
= x
i+1
. The two-sided
sequence (. . . , X
t1
, X
t
, X
t+1
, . . .) denes a map X: A

. The invariant sets A are


the sets such that A = S
t
X B for all t and some measurable set B A

. The
strict stationarity of the sequence X is identical to the invariance of its induced law P
X
on A

under the shift S.


The inverse images X
1
(B) of measurable sets B A

with B = SB are clearly in-


variant. Conversely, it can be shown that, up to null sets, all invariant sets take this form.
(See Exercise 7.3.) The symmetric events are special examples of invariant sets. They
are the events that depend symmetrically on the variables X
t
. For instance,
t
X
1
t
(B)
for some measurable set B A.
* 7.2 EXERCISE. Call a set B A

invariant under the shift S: A

if B = SB.
Call it almost invariant relative to a measure P
X
if P
X
(BSB) = 0. Show that a set B
is almost invariant if and only if there exists an invariant set

B such that P
X
(B

B) = 0.
[Try

B =
t
S
t
B.]
* 7.3 EXERCISE. Dene the invariant -eld B
inv
on A

as the collection of measurable


sets that are invariant under the shift operation, and let B
inv
be its completion under
the measure P
X
. Show that X
1
(B
inv
) |
inv
X
1
(B
inv
), where the long bar on the
right denotes completion relative to P. [Note that X B = X SB implies that
P
X
(B SB) = 0. Use the preceding exercise to replace B by an invariant set

B.]
7.4 Theorem (Birkho). If X
t
is a strictly stationary time series with E[X
t
[ < ,
then X
n
E(X
0
[ |
inv
) almost surely and in mean.
106 7: Law of Large Numbers
Proof. For a given R dene a set B =
_
x R

: limsup
n
x
n
>
_
. Because
x
n+1
=
x
1
n + 1
+
n
n + 1
1
n
n+1

t=2
x
t
,
a point x is contained in B if and only if limsupn
1

n+1
t=2
x
t
> . Equivalently, x B if
and only if Sx B. Thus the set B is invariant under the shift operation S: R

.
We conclude from this that the variable limsup
n
X
n
is measurable relative to the
invariant -eld.
Fix some measurable set B R

. For every invariant set A |


inv
there exists a
measurable set C R

such that A = S
t
X C for every t. By the strict stationarity
of X,
P
_
S
t
X B A
_
= P(S
t
X B, S
t
X C) = P(X B, X C) = P
_
X B A
_
.
This shows that P(S
t
X B[ |
inv
) = P(X B[ |
inv
) almost surely. We conclude that
the conditional laws of S
t
X and X given the invariant -eld are identical.
In particular, the conditional means E(X
t
[ |
inv
) = E(X
1
[ |
inv
) are identical for
every t, almost surely. It also follows that a time series Z
t
of the type Z
t
= (X
t
, R)
for R: a xed |
inv
-measurable variable (for instance with values in = R
2
) is
strictly stationary, the conditional law B P(X B[ R) = E
_
P(X B[ |
inv
)[ R
_
of
its rst marginal (on A

) being strictly stationary by the preceding paragraph, and the


second marginal (on

) being independent of t.
For the almost sure convergence of the sequence X
n
it suces to show that, for
every > 0, the event
A =
_
limsup
n
X
n
> E(X
1
[ |
inv
) +
_
and a corresponding event for the lower tail have probably zero. By the preced-
ing the event A is contained in the invariant -eld. Furthermore, the time se-
ries Y
t
=
_
X
t
E(X
1
[ |
inv
)
_
1
A
, being a xed transformation of the time se-
ries Z
t
=
_
X
t
, E(X
1
[ |
inv
), 1
A
_
, is strictly stationary. We can write A =
n
A
n
for
A
n
=
n
t=1
Y
t
> 0. Then EY
1
1
An
EY
1
1
A
by the dominated convergence theo-
rem, in view of the assumption that X
t
is integrable. If we can show that EY
1
1
An
0
for every n, then we can conclude that
0 EY
1
1
A
= E
_
X
1
E(X
1
[ |
inv
)
_
1
A
P(A) = P(A),
because A |
inv
. This implies that P(A) = 0, concluding the proof of almost sure
convergence.
The L
1
-convergence can next be proved by a truncation argument. We can rst show,
more generally, but by an identical argument, that n
1

n
t=1
f(X
t
) E
_
f(X
0
)[ |
inv
_
almost surely, for every measurable function f: A R with E[f(X
t
)[ < . We can
apply this to the functions f(x) = x1
]x]M
for given M.
7.2: Ergodic Theorem 107
We complete the proof by showing that EY
1
1
An
0 for every strictly stationary
time series Y
t
and every xed n, and A
n
=
n
t=1
Y
t
> 0. For every 2 j n,
Y
1
+ +Y
j
Y
1
+ max(Y
2
, Y
2
+Y
3
, , Y
2
+ +Y
n+1
).
If we add the number 0 in the maximum on the right, then this is also true for j = 1.
We can rewrite the resulting n inequalities as the single inequality
Y
1
max(Y
1
, Y
1
+Y
2
, . . . , Y
1
+ +Y
n
) max(0, Y
2
, Y
2
+Y
3
, , Y
2
+ +Y
n+1
).
The event A
n
is precisely the event that the rst of the two maxima on the right is
positive. Thus on this event the inequality remains true if we add also a zero to the rst
maximum. It follows that EY
1
1
An
is bounded below by
E
_
max(0, Y
1
, Y
1
+Y
2
, . . . , Y
1
+ +Y
n
) max(0, Y
2
, Y
2
+Y
3
, , Y
2
+ +Y
n+1
)
_
1
An
.
O the event A
n
the rst maximum is zero, whereas the second maximum is always
nonnegative. Thus the expression does not increase if we cancel the indicator 1
An
. The
resulting expression is identically zero by the strict stationarity of the series Y
t
.
Thus a strong law is valid for every integrable strictly stationary sequence, with-
out any further conditions on possible dependence of the variables. However, the limit
E(X
0
[ |
inv
) in the preceding theorem will often be a true random variable. Only if the
invariant -eld is trivial, we can be sure that the limit is degenerate. Here trivial may
be taken to mean that the invariant -eld consists of sets of probability 0 or 1 only. If
this is the case, then the time series X
t
is called ergodic.
* 7.5 EXERCISE. Suppose that X
t
is strictly stationary. Show that X
t
is ergodic if and
only if every sequence Y
t
= f(. . . , X
t1
, X
t
, X
t+1
, . . .) for a measurable map f that is
integrable satises the law of large numbers Y
n
EY
1
almost surely. [Given an invariant
set A = (. . . , X
1
, X
0
, X
1
, . . .)
1
(B) consider Y
t
= 1
B
(. . . , X
t1
, X
t
, X
t+1
, . . .). Then
Y
n
= 1
A
.]
Checking that the invariant -eld is trivial may be a nontrivial operation. There
are other concepts that imply ergodicity and may be easier to verify. A time series X
t
is
called mixing if, for any measurable sets A and B, as h ,
P
_
(. . . , X
h1
, X
h
, X
h+1
, . . .) A, (. . . , X
1
, X
0
, X
1
, . . .) B
_
P
_
(. . . , X
h1
, X
h
, X
h+1
, . . .) A
_
P
_
(. . . , X
1
, X
0
, X
1
, . . .) B
_
0.
Every mixing time series is ergodic. This follows because if we take A = B equal to an
invariant set, the preceding display reads P
X
(A) P
X
(A)P
X
(A) 0, for P
X
the law
of the innite series X
t
, and hence P
X
(A) is 0 or 1.
The present type of mixing is related to the mixing concepts used to obtain central
limit theorems, and is weaker.
108 7: Law of Large Numbers
7.6 Theorem. Any strictly stationary -mixing time series is mixing.
Proof. For t-dimensional cylinder sets A and B in A

(i.e. sets that depend on nitely


many coordinates only) the mixing condition becomes
P
_
(X
h
, . . . X
t+h
) A, (X
0
, . . . , X
t
) B
_
P
_
(X
h
, . . . X
t+h
) A
_
P
_
(X
0
, . . . , X
t
) B
_
.
For h > t the absolute value of the dierence of the two sides of the display is bounded
by (h t) and hence converges to zero as h , for each xed t.
Thus the mixing condition is satised by the collection of all cylinder sets. This
collection is intersection-stable, i.e. a -system, and generates the product -eld on
A

. The proof is complete if we can show that the collections of sets A and B for which
the mixing condition holds, for a given set B or A, is a -eld. By the - theorem it
suces to show that these collections of sets are a -system.
The mixing property can be written as P
X
(S
h
A B) P
X
(A)P
X
(B) 0, as
h . Because S is a bijection we have S
h
(A
2
A
1
) = S
h
A
2
S
h
A
1
. If A
1
A
2
,
then
P
X
_
S
h
(A
2
A
1
) B
_
= P
X
_
S
h
A
2
B
_
P
X
_
S
h
A
1
B
_
,
P
X
(A
2
A
1
)P
X
(B) = P
X
(A
2
)P
X
(B) P
X
(A
1
)P
X
(B).
If, for a given set B, the sets A
1
and A
2
satisfy the mixing condition, then the right
hand sides are asymptotically the same, as h , and hence so are the left sides. Thus
A
2
A
1
satises the mixing condition. If A
n
A, then S
h
A
n
S
h
A as n and
hence
P
X
(S
h
A
n
B) P
X
(A
n
)P
X
(B) P
X
(S
h
A B) P
X
(A)P
X
(B).
The absolute dierence of left and right sides is bounded above by 2[P
X
(A
n
) P
X
(A)[.
Hence the convergence in the display is uniform in h. If every of the sets A
n
satises the
mixing condition, for a given set B, then so does A. Thus the collection of all sets A that
satises the condition, for a given B, is a -system.
We can prove similarly, but more easily, that the collection of all sets B is also a
-system.
7.7 Theorem. Any strictly stationary time series X
t
with trivial tail -eld is mixing.
Proof. The tail -eld is dened as
hZ
(X
h
, X
h+1
, . . .).
As in the proof of the preceding theorem we need to verify the mixing condition
only for nite cylinder sets A and B. We can write

E1
X
h
,...,X
t+h
A
_
1
X0,...,XtB
P(X
0
, . . . , X
t
B)
_

E1
X
h
,...,X
t+h
A
_
P(X
0
, . . . , X
t
B[ X
h
, X
h+1
, . . .) P(X
0
, . . . , X
t
B)
_

P(X
0
, . . . , X
t
B[ X
h
, X
h+1
, . . .) P(X
0
, . . . , X
t
B)
_

.
For every integrable variable Y the sequence E(Y [ X
h
, X
h+1
, . . .) converges in L
1
to the
conditional expectation of Y given the tail -eld, as h . Because the tail -eld
7.2: Ergodic Theorem 109
is trivial, in the present case this is EY . Thus the right side of the preceding display
converges to zero as h .
* 7.8 EXERCISE. Show that a strictly stationary time series X
t
is ergodic if and only if
n
1

n
h=1
P
X
(S
h
A B) P
X
(A)P
X
(B), as n , for every measurable subsets A
and B of A

. [Use the ergodic theorem on the stationary time series Y


t
= 1
S
t
XA
to
see that n
1

1
XS
t
A
1
B
P
X
(A)1
B
for the proof in one direction.]
* 7.9 EXERCISE. Show that a strictly stationary time series X
t
is ergodic if and only
if the one-sided time series X
0
, X
1
, X
2
, . . . is ergodic, in the sense that the one-sided
invariant -eld, consisting of all sets A such that A = (X
t
, X
t+1
, . . .)
1
(B) for some
measurable set B and every t 0, is trivial. [Use the preceding exercise.]
The preceding theorems can be used as starting points to construct ergodic se-
quences. For instance, every i.i.d. sequence is ergodic by the preceding theorems, be-
cause its tail -eld is trivial by Kolmogorovs 0-1 law, or because it is -mixing. To
construct more examples we can combine the theorems with the following stability prop-
erty. From a given ergodic sequence X
t
we construct a process Y
t
by transforming the
vector (. . . , X
t1
, X
t
, X
t+1
, . . .) with a given map f from the product space A

into
some measurable space (}, B). As before, the X
t
in (. . . , X
t1
, X
t
, X
t+1
, . . .) is meant to
be at a xed 0th position in Z, so that the dierent variables Y
t
are obtained by sliding
the function f along the sequence (. . . , X
t1
, X
t
, X
t+1
, . . .).
7.10 Theorem. The sequence Y
t
= f(. . . , X
t1
, X
t
, X
t+1
, . . .) obtained by application
of a measurable map f: A

} to an ergodic sequence X
t
is ergodic.
Proof. Dene f: A

by f(x) =
_
, f(S
1
x), f(x), f(Sx),
_
, for S the forward
shift on A

. Then Y = f(X) and S


t
f(x) = f(Sx), if S
t
is the forward shift on }

.
Consequently (S
t
)
t
Y = f(S
t
X). If A = (S
t
)
t
Y B is invariant for the series Y
t
, then
A = f(S
t
X) B = S
t
X f
1
(B) for every t, and hence A is also invariant for the
series X
t
.
7.11 EXERCISE. Let Z
t
be an i.i.d. sequence of integrable variables and let X
t
=

j

j
Z
tj
for a sequence
j
such that

j
[
j
[ < . Show that X
t
satises the law
of large numbers (with degenerate limit).
7.12 EXERCISE. Show that the GARCH(1, 1) process dened in Example 1.10 is er-
godic.
7.13 Example (Discrete Markov chains). Every stationary irreducible Markov chain
on a countable state space is ergodic. Conversely, a stationary reducible Markov chain
on a countable state space whose initial (or marginal) law is positive everywhere is
nonergodic.
110 7: Law of Large Numbers
To prove the ergodicity note that a stationary irreducible Markov chain is (pos-
itively) recurrent (e.g. Durrett, p266). If A is an invariant set of the form A =
(X
0
, X
1
, . . .)
1
(B), then A (X
h
, X
h1
, . . .) for all h and hence
1
A
= P(A[ X
h
, X
h1
, . . .) = P
_
(X
h+1
, X
h+2
, . . .) B[ X
h
, X
h1
, . . .
_
= P
_
(X
h+1
, X
h+2
, . . .) B[ X
h
_
.
We can write the right side as g(X
h
) for the function g(x) = P(A[ X
1
= x
_
. By
recurrence, for almost every in the underlying probability space, the right side runs
innitely often through every of the numbers g(x) with x in the state space. Because
the left side is 0 or 1 for a xed , the function g and hence 1
A
must be constant. Thus
every invariant set of this type is trivial, showing the ergodicity of the one-sided sequence
X
0
, X
1
, . . .. It can be shown that one-sided and two-sided ergodicity are the same. (Cf.
Exercise 7.9.)
Conversely, if the Markov chain is reducible, then the state space can be split into
two sets A
1
and A
2
such that the chain will remain in A
1
or A
2
once it enters there. If the
initial distribution puts positive mass everywhere, then each of the two possibilities occurs
with positive probability. The sets A
i
= X
0
A
i
are then invariant and nontrivial and
hence the chain is not ergodic.
It can also be shown that a stationary irreducible Markov chain is mixing if and only
if it is aperiodic. (See e.g. Durrett, p310.) Furthermore, the tail -eld of any irreducible
stationary aperiodic Markov chain is trivial. (See e.g. Durrett, p279.)
7.3 Mixing
In the preceding section it was seen that an -mixing, strictly stationary time series is
ergodic and hence satises the law of large numbers if it is integrable. In this section we
extend the law of large numbers to possibly nonstationary -mixing time series.
The key is the bound on the tails of the distribution of the sample mean given in
the following lemma.
7.14 Lemma. For any mean zero time series X
t
with -mixing numbers (h), every
x > 0 and every h N, with Q
t
= F
1
]Xt]
,
P(X
n
2x)
2
nx
2
_
1
0
_

1
(u) h
_
1
n
n

t=1
Q
2
t
(1 u) du +
2
x
_
(h)
0
1
n
n

t=1
Q
t
(1 u) du.
Proof. The quantile function of the variable [X
t
[/(xn) is equal to u F
1
]Xt]
(u)/(nx).
Therefore, by a rescaling argument we can see that it suces to bound the probability
P
_
n
t=1
X
t
2
_
by the right side of the lemma, but with the factors 2/(nx
2
) and 2/x
replaced by 2 and the factor n
1
in front of

Q
2
t
and

Q
t
dropped. For ease of notation
set S
0
= 0 and S
n
=

n
t=1
X
t
.
7.3: Mixing 111
Dene the function g: R R to be 0 on the interval (, 0], to be x
1
2
x
2
on
[0, 1], to be x 1
1
2
(x 2)
2
on [1, 2], and to be 1 on [2, ). Then g is continuously
dierentiable with uniformly Lipschitz derivative. By Taylors theorem it follows that

g(x) g(y) g
t
(x)(x y)


1
2
[x y[
2
for every x, y R. Because 1
[2,)
g and
S
t
S
t1
= X
t
,
P(S
n
2) Eg(S
n
) =
n

t=1
E
_
g(S
t
) g(S
t1
)
_

t=1
E

g
t
(S
t1
)X
t

+
1
2
n

t=1
EX
2
t
.
The last term on the right can be written
1
2

n
t=1
_
1
0
Q
2
t
(1 u) du, which is bounded by

n
t=1
_
(0)
0
Q
2
t
(1 u) du, because (0) =
1
2
and u Q
t
(1 u) is decreasing.
For i 1 the variable g
t
(S
ti
) g
t
(S
ti1
) is measurable relative to (X
s
: s t i)
and is bounded in absolute value by [X
ti
[. Therefore, Lemma 4.12 yields the inequality

E
_
g
t
(S
ti
) g
t
(S
ti1
)
_
X
t

2
_
(i)
0
Q
ti
(1 u)Q
t
(1 u) du.
For t h we can write g
t
(S
t1
) =

t1
i=1
_
g
t
(S
ti
) g(S
ti1
)
_
. Substituting this in the
left side of the following display and applying the preceding display, we nd that
h

t=1
E

g
t
(S
t1
)X
t

2
h

t=1
t1

i=1
_
(i)
0
Q
ti
(1 u)Q
t
(1 u) du.
For t > h we can write g
t
(S
t1
) = g
t
(S
th
) +

h1
i=1
_
g
t
(S
ti
) g(S
ti1
)
_
. By a similar
argument, this time also using that the function [g
t
[ is uniformly bounded by 1, we nd
n

t=h+1
E

g
t
(S
t1
)X
t

2
_
(h)
0
Q
t
(1 u) du +2
n

t=h+1
h1

i=1
_
(i)
0
Q
ti
(1 u)Q
t
(1 u) du.
Combining the preceding displays we obtain that P(S
n
2) is bounded above by
2
_
(h)
0
Q
t
(1 u) du + 2
n

t=1
th1

i=1
_
(i)
0
Q
ti
(1 u)Q
t
(1 u) du +
1
2
n

t=1
EX
2
t
.
In the second term we can bound 2Q
ti
(1 u)Q
t
(1 u) by Q
2
ti
(1 u) +Q
2
t
(1 u) and
next change the order of summation to

h1
i=1

n
t=i+1
. Because

n
t=i+1
(Q
2
ti
+ Q
2
t
)
2

n
t=1
Q
2
t
this term is bounded by 2

h1
i=1
_
(i)
0

n
t=1
Q
2
t
(1 u) du. Together with the
third term on the right this gives rise to by the rst sum on the right of the lemma, as

h1
i=0
1
u(i)
=
1
(u) h.
7.15 Theorem. For each n let the time series (X
n,t
: t Z) be mixing with mixing
coecients
n
(h). If sup
n

n
(h) 0 as h and (X
n,t
: t Z, n N) is uniformly
integrable, then the sequence X
n
EX
n
converges to zero in probability.
Proof. By the assumption of uniform integrability n
1

n
t=1
E[X
n,t
[1
]Xn,t]>M
0 as
M uniformly in n. Therefore we can assume without loss of generality that X
n,t
is
112 7: Law of Large Numbers
bounded in absolute value by a constant M. We can also assume that it is centered at
mean zero.
Then the quantile function of [X
n,t
[ is bounded by M and the preceding lemma
yields the bound
P([X
n
[ 2)
4hM
2
n
2
+
4M
x
sup
n

n
(h).
This converges to zero as n followed by h .
* 7.16 EXERCISE. Relax the conditions in the preceding theorem to, for every > 0:
n
1
n

t=1
E[X
n,t
[1
]Xn,t]>nF
1
|X
n,t
|
(1n(h))
0.
[Hint: truncate at the level n
n
and note that EX1
X>M
=
_
1
0
Q(1 u)1
Q(1u)>M
du for
Q(u) = F
1
X
(u).]
7.4 Subadditive Ergodic Theorem
The subadditive theorem of Kingman can be considered an extension of the ergodic
theorem that gives the almost sure convergence of more general functions of a strictly
stationary sequence than the consecutive means. Given a strictly stationary time se-
ries X
t
with values in some measurable space (A, /) and dened on some probability
space (, |, P), write X for the induced map (. . . , X
1
, X
0
, X
1
, . . .): A

, and let
S: A

be the forward shift function (all as before). A family (T


n
: n N) of maps
T
n
: A

R is called subadditive if, for every m, n N,


T
m+n
(X) T
m
(X) +T
n
(S
m
X).
7.17 Theorem (Kingman). If X is strictly stationary with invariant -eld |
inv
and
the maps (T
n
: n N) are subadditive with nite means ET
n
(X), then T
n
(X)/n : =
inf
n
n
1
E
_
T
n
(X)[ |
inv
_
almost surely. Furthermore, the limit satises E > if and
only if inf
n
ET
n
(X)/n > and in that case the convergence T
n
(X) takes also
place in mean.
Because the maps T
n
(X) =

n
t=1
X
t
are subadditive, the ordinary ergodic theo-
rem by Birkho is a special case of Kingmans theorem. If the time series X
t
is ergodic,
then the limit in Kingmans theorem is equal to = inf
n
n
1
ET
n
(X).
7.4: Subadditive Ergodic Theorem 113
7.18 EXERCISE. Show that the normalized means n
1
ET
n
(X) of a subadditive map
are decreasing in n.
7.19 EXERCISE. Let X
t
be a time series with values in the collection of (dd) matrices.
Show that the maps dened by T
n
(X) = log |X
1
X
n
| are subadditive.
7.20 EXERCISE. Show that Kingmans theorem remains true if the forward shift oper-
ator in the denition of subadditivity is replaced by the backward shift operator.
8
ARIMA Processes
For many years ARIMA processes were the work horses of time series analysis, the sta-
tistical analysis of time series being almost identical to tting an appropriate ARIMA
process. This important class of time series models are dened through linear relations
between the observations and noise factors.
8.1 Backshift Calculus
To simplify notation we dene the backshift operator B through
BX
t
= X
t1
, B
k
X
t
= X
tk
.
This is viewed as operating on a complete time series X
t
, transforming this into a new
series by a time shift. Even though we use the word operator, we shall use B only as
a notational device. In particular, BY
t
= Y
t1
for any other time series Y
t
.

For a given polynomial (z) =

j

j
z
j
we also abbreviate
(B)X
t
=

j
X
tj
.
If the series on the right is well dened, then we even use this notation for innite Laurent
series

j=

j
z
j
. Then (B)X
t
is simply a short-hand notation for the (innite)
linear lters that we encountered before. By Lemma 1.28 the ltered time series (B)X
t
is certainly well dened if

j
[
j
[ < and sup
t
E[X
t
[ < , in which case the series
converges both almost surely and in mean.

Be aware of the dangers of this notation. For instance, if Yt = Xt, then BYt = Yt1 = X
(t1)
. This
is probably the intended meaning. We could also argue that BYt = BXt = Xt1. This is something else.
Such inconsistencies can be avoided by dening B as a true operator, for instance a linear operator acting on
the linear span of a given time series. Then B may be dierent for dierent time series.
8.1: Backshift Calculus 115
If

j
[
j
[ < , then the Laurent series

j

j
z
j
converges absolutely on the unit
circle S
1
=
_
z C: [z[ = 1
_
in the complex plane and hence denes a function : S
1
C.
In turn, the values of this function determine the coecients. (See the next exercise.)
Given two of such series or functions
1
(z) =

j

1,j
z
j
and
2
(z) =

j

2,j
z
j
, the
product (z) =
1
(z)
2
(z) is a well-dened function on (at least) the unit circle. By
changing the summation indices (which is permitted in view of absolute convergence)
this can be written as
(z) =
1
(z)
2
(z) =

j
z
j
,
k
=

1,j

2,kj
.
The coecients
j
are called the convolutions of the coecients
1,j
and
2,j
. Under
the condition that

j
[
i,j
[ < for i = 1, 2, the Laurent series

k
z
k
converges
absolutely at least on the unit circle:

k
[
k
[ < .
8.1 EXERCISE. Show that
(i) If (z) =

j

j
z
j
, for an absolutely summable sequence (
j
), then
j
=
(2i)
1
_
S
1
(z)z
1j
dz.
(ii) The convolution (
j
) of two sequences of lter coecients satises

k
[
k
[

j
[
1,j
[

j
[
2,j
[.
Having dened the function (z) and veried that it has an absolutely convergent
Laurent series representation on the unit circle, we can now also dene the time series
(B)X
t
. The following lemma shows that the convolution formula remains valid if z is
replaced by B, at least when applied to time series that are bounded in L
1
.
8.2 Lemma. If both

j
[
1,j
[ < and

j
[
2,j
[ < , then, for every time series X
t
with sup
t
E[X
t
[ < ,
(B)X
t
=
1
(B)
_

2
(B)X
t

, a.s..
Proof. The right side is to be read as
1
(B)Y
t
for Y
t
=
2
(B)X
t
. The variable Y
t
is
well dened almost surely by Lemma 1.28, because

j
[
2,j
[ < and sup
t
E[X
t
[ < .
Furthermore,
sup
t
E[Y
t
[ = sup
t
E

2,j
X
tj

j
[
2,j
[ sup
t
E[X
t
[ < .
Thus the time series
1
(B)Y
t
is also well dened by Lemma 1.28. Now
E

j
[
1,i
[[
2,j
[[X
tij
[ sup
t
E[X
t
[

i
[
1,i
[

j
[
2,j
[ < .
This implies that the double series

j

1,i

2,j
X
tij
converges absolutely, almost
surely, and hence unconditionally. The latter means that we may sum the terms in an
arbitrary order. In particular, by the change of variables (i, j) (i = l, i +j = k),

1,i
_

2,j
X
tij
_
=

k
_

1,l

2,kl
_
X
tk
, a.s..
116 8: ARIMA Processes
This is the assertion of the lemma, with
1
(B)
_

2
(B)X
t

on the left side, and (B)X


t
on the right.
The lemma implies that the operators
1
(B) and
2
(B) commute. From now on
we may omit the square brackets in
1
(B)
_

2
(B)X
t

.
8.3 EXERCISE. Verify that the lemma remains valid for any sequences
1
and
2
with

j
[
i,j
[ < and every process X
t
such that

j
[
1,i
[[
2,j
[[X
tij
[ < almost
surely. In particular, conclude that
1
(B)
2
(B)X
t
= (
1

2
)(B)X
t
for any polynomials

1
and
2
and every time series X
t
.
8.2 ARMA Processes
Linear regression models attempt to explain a variable by the sum of a linear function of
explanatory variables and a noise variable. ARMA processes are a time series version of
linear regression, where the explanatory variables are the past values of the time series
itself and the added noise is a moving average process.
8.4 Denition. A time series X
t
is an ARMA(p, q)-process if there exist polynomials
and of degrees p and q, respectively, and a white noise series Z
t
such that (B)X
t
=
(B)Z
t
.
The equation (B)X
t
= (B)Z
t
is to be understood as pointwise almost surely
on the underlying probability space: the random variables X
t
and Z
t
are dened on a
probability space (, |, P) and satisfy (B)X
t
() = (B)Z
t
() for almost every .
The polynomials are often

written in the forms (z) = 1


1
z
2
z
2

p
z
p
and (z) = 1 +
1
z + +
q
z
q
. Then the equation (B)X
t
= (B)Z
t
takes the form
X
t
=
1
X
t1
+
2
X
t2
+ +
p
X
tp
+Z
t
+
1
Z
t1
+ +
q
Z
tq
.
In other words: the value of the time series X
t
at time t is the sum of a linear regression
on its own past and of a moving average. An ARMA(p, 0)-process is also called an auto-
regressive process and denoted AR(p); an ARMA(0, q)-process is also called a moving
average process and denoted MA(q). Thus an auto-regressive process is a solution X
t
to
the equation (B)X
t
= Z
t
, and a moving average process is explicitly given by X
t
=
(B)Z
t
.

A notable exception is the Splus package. Its makers appear to have overdone the cleverness of including
minus-signs in the coecients of and have included them in the coecients of also.
8.2: ARMA Processes 117
8.5 EXERCISE. Why is it not a loss of generality to assume
0
=
0
= 1?
We next investigate for which pairs of polynomials and there exists a corre-
sponding stationary ARMA-process. For given polynomials and and a white noise
series Z
t
, there are always many time series X
t
that satisfy the ARMA equation, but
none of these may be stationary. If there exists a stationary solution, then we are also
interested in knowing whether this is uniquely determined by the pair (, ) and/or the
white noise series Z
t
, and in what way it depends on the series Z
t
.
8.6 Example. The polynomial (z) = 1 z leads to the auto-regressive equation
X
t
= X
t1
+ Z
t
. In Example 1.8 we have seen that a stationary solution exists if and
only if [[ , = 1.
8.7 EXERCISE. Let arbitrary polynomials and , a white noise sequence Z
t
and vari-
ables X
1
, . . . , X
p
be given. Show that there exists a time series X
t
that satises the
equation (B)X
t
= (B)Z
t
and coincides with the given X
1
, . . . , X
p
at times 1, . . . , p.
What does this imply about existence of solutions if only the Z
t
and the polynomials
and are given?
In the following theorem we shall see that a stationary solution to the ARMA-
equation exists if the polynomial z (z) has no roots on the unit circle S
1
=
_
z
C: [z[ = 1
_
. To prove this, we need some facts from complex analysis. The function
(z) =
(z)
(z)
is well dened and analytic on the region
_
z C: (z) ,= 0
_
. If has no roots on the unit
circle S
1
, then, since it has at most p dierent roots, there is an annulus
_
z: r < [z[ < R
_
with r < 1 < R on which it has no roots. On this annulus is an analytic function, and
it has a Laurent series representation
(z) =

j=

j
z
j
.
This series is uniformly and absolutely convergent on every compact subset of the annu-
lus, and the coecients
j
are uniquely determined by the values of on the annulus.
In particular, because the unit circle is inside the annulus, we obtain that

j
[
j
[ < .
Then we know from Lemma 1.28 that (B)Z
t
is a well dened, stationary time
series. By the following theorem it is the unique stationary solution to the ARMA-
equation. Here the white noise series Z
t
and the probability space on which it is dened
are considered given. (In analogy with the terminology of stochastic analysis, the latter
expresses that there exists a unique strong solution.)
118 8: ARIMA Processes
8.8 Theorem. Let and be polynomials such that has no roots on the complex
unit circle, and let Z
t
be a white noise process. Dene = /. Then X
t
= (B)Z
t
is the unique stationary solution to the equation (B)X
t
= (B)Z
t
. It is also the only
solution that is bounded in L
1
.
Proof. By the rules of calculus justied by Lemma 8.2, (B)(B)Z
t
= (B)Z
t
, because
(z)(z) = (z) on an annulus around the unit circle, the series

j
[
j
[ and

j
[
j
[
converge and the time series Z
t
is bounded in absolute mean. This proves that (B)Z
t
is a solution to the ARMA-equation. It is stationary by Lemma 1.28.
Let X
t
be an arbitrary solution to the ARMA equation that is bounded in L
1
,
for instance a stationary solution. The function

(z) = 1/(z) is analytic on an an-
nulus around the unit circle and hence possesses a unique Laurent series representa-
tion

(z) =

j
z
j
. Because

j
[

j
[ < , the innite series

(B)Y
t
is well dened
for every stationary time series Y
t
by Lemma 1.28. By the calculus of Lemma 8.2

(B)(B)X
t
= X
t
almost surely, because

(z)(z) = 1, the lter coecients are
summable and the time series X
t
is bounded in absolute mean. Therefore, the equation
(B)X
t
= (B)Z
t
implies, after multiplying by

(B), that X
t
=

(B)(B)Z
t
= (B)Z
t
,
again by the calculus of Lemma 8.2, because

(z)(z) = (z). This proves that (B)Z
t
is the unique stationary solution to the ARMA-equation.
8.9 EXERCISE. It is certainly not true that (B)Z
t
is the only solution to the ARMA-
equation. Can you trace where in the preceding proof we use the required stationarity of
the solution? Would you agree that the calculus of Lemma 8.2 is perhaps more subtle
than it appeared to be at rst?
Thus the condition that has no roots on the unit circle is sucient for the existence
of a stationary solution. It is almost necessary. The only point is that it is really the
quotient / that counts, not the function on its own. If has a zero on the unit circle
of the same or smaller multiplicity as , then this quotient is still a nice function. Once
this possibility is excluded, there can be no stationary solution if (z) = 0 for some z
with [z[ = 1.
8.10 Theorem. Let and be polynomials such that has a root on the unit circle
that is not a root of , and let Z
t
be a white noise process. Then there exists no stationary
solution X
t
to the equation (B)X
t
= (B)Z
t
.
Proof. Suppose that the contrary is true and let X
t
be a stationary solution. Then X
t
has a spectral distribution F
X
, and hence so does the time series (B)X
t
= (B)Z
t
. By
Theorem 6.10 and Example 6.6 we must have

(e
i
)

2
dF
X
() =

(e
i
)

2
2
2
d.
Now suppose that (e
i0
) = 0 and (e
i0
) ,= 0 for some
0
(, ]. The preceding
display is just an equation between densities of measures and should not be interpreted as
being valid for every , so we cannot immediately conclude that there is a contradiction.
8.2: ARMA Processes 119
By dierentiability of and continuity of there exist positive numbers A and B and a
neighbourhood of
0
on which both

(e
i
)

A[
0
[ and

(e
i
)

B. Combining
this with the preceding display, we see that, for all suciently small > 0,
_
0+
0
A
2
[
0
[
2
dF
X
()
_
0+
0
B
2

2
2
d.
The left side is bounded above by A
2

2
F
X
[
0
,
0
+], whereas the right side is equal
to B
2

2
/. This shows that F
X
[
0
,
0
+] as 0 and contradicts the fact
that F
X
is a nite measure.
8.11 Example. The AR(1)-equation X
t
= X
t1
+ Z
t
corresponds to the polynomial
(z) = 1 z. This has root
1
. Therefore a stationary solution exists if and only if
[
1
[ , = 1. In the latter case, the Laurent series expansion of (z) = 1/(1 z) around
the unit circle is given by (z) =

j=0

j
z
j
for [[ < 1 and is given by

j=1

j
z
j
for [[ > 1. Consequently, the unique stationary solutions in these cases are given by
X
t
=
_

j=0

j
Z
tj
, if [[ < 1,

j=1
1

j
Z
t+j
, if [[ > 1.
This is in agreement, of course, with Example 1.8.
8.12 EXERCISE. Investigate the existence of stationary solutions to:
(i) X
t
=
1
2
X
t1
+
1
2
X
t2
+Z
t
;
(ii) X
t
=
1
2
X
t1
+
1
4
X
t2
+Z
t
+
1
2
Z
t1
+
1
4
Z
t2
.
Warning. Some authors require by denition that an ARMA process be stationary.
Many authors occasionally forget to say explicitly that they are concerned with a sta-
tionary ARMA process. Some authors mistakenly believe that stationarity requires that
has no roots inside the unit circle and may fail to recognize that the ARMA equation
does not dene a process without some sort of initialization.
If given time series X
t
and Z
t
satisfy the ARMA-equation (B)X
t
= (B)Z
t
, then
they also satisfy r(B)(B)X
t
= r(B)(B)Z
t
, for any polynomial r. From observed data
X
t
it is impossible to determine whether (, ) or (r, r) are the right polynomials.
To avoid this problem of indeterminacy, we assume from now on that the ARMA-model
is always written in its simplest form. This is when and do not have common factors
(are relatively prime in the algebraic sense), or equivalently, when and do not have
common (complex) roots. Then, in view of the preceding theorems, a stationary solution
X
t
to the ARMA-equation exists if and only if has no roots on the unit circle, and this
is uniquely given by
X
t
= (B)Z
t
=

j
Z
tj
, =

.
120 8: ARIMA Processes
8.13 Denition. An ARMA-process X
t
is called causal if, in the preceding representa-
tion, the lter is causal: i.e.
j
= 0 for every j < 0.
Thus a causal ARMA-process X
t
depends on the present and past values Z
t
, Z
t1
, . . .
of the noise sequence only. Intuitively, this is a desirable situation, if time is really time
and Z
t
is really attached to time t. We come back to this in Section 8.6.
A mathematically equivalent denition of causality is that the function (z) is an-
alytic in a neighbourhood of the unit disc
_
z C: [z[ 1
_
. This follows, because the
Laurent series

j=

j
z
j
is analytic inside the unit disc if and only if the negative
powers of z do not occur. Still another description of causality is that all roots of are
outside the unit circle, because only then is the function = / analytic on the unit
disc.
The proof of Theorem 8.8 does not use that Z
t
is a white noise process, but only
that the series Z
t
is bounded in L
1
. Therefore, the same arguments can be used to invert
the ARMA-equation in the other direction. If has no roots on the unit circle and X
t
is
stationary, then (B)X
t
= (B)Z
t
implies that
Z
t
= (B)X
t
=

j
X
tj
, =

.
8.14 Denition. An ARMA-process X
t
is called invertible if, in the preceding repre-
sentation, the lter is causal: i.e.
j
= 0 for every j < 0.
Equivalent mathematical denitions are that (z) is an analytic function on the unit
disc or that has all its roots outside the unit circle. In the denition of invertibility we
implicitly assume that has no roots on the unit circle. The general situation is more
technical and is discussed in the next section.
* 8.3 Invertibility
In this section we discuss the proper denition of invertibility in the case that has
roots on the unit circle. The intended meaning of invertibility is that every Z
t
can be
written as a linear function of the X
s
that are prior or simultaneous to t (i.e. s t).
Two reasonable ways to make this precise are:
(i) Z
t
=

j=0

j
X
tj
for a sequence
j
such that

j=0
[
j
[ < .
(ii) Z
t
is contained in the closed linear span of X
t
, X
t1
, X
t2
, . . . in L
2
(, |, P).
In both cases we require that Z
t
depends linearly on the prior X
s
, but the second
requirement is weaker. It turns out that if X
t
is a stationary ARMA process relative to
Z
t
and (i) holds, then the polynomial cannot have roots on the unit circle. In that case
the denition of invertibility given in the preceding section is appropriate (and equivalent
to (i)). However, the requirement (ii) does not exclude the possibility that has zeros
on the unit circle. An ARMA process is invertible in the sense of (ii) as soon as does
not have roots inside the unit circle.
8.3: Invertibility 121
8.15 Lemma. Let X
t
be a stationary ARMA process satisfying (B)X
t
= (B)Z
t
for
polynomials and that are relatively prime.
(i) Then Z
t
=

j=0

j
X
tj
for a sequence
j
such that

j=0
[
j
[ < if and only if
has no roots on or inside the unit circle.
(ii) If has no roots inside the unit circle, then Z
t
is contained in the closed linear span
of X
t
, X
t1
, X
t2
, . . ..
Proof. (i). If has no roots on or inside the unit circle, then the ARMA process is
invertible by the arguments given previously. We must argue the other direction. If Z
t
has the given given reprentation, then consideration of the spectral measures gives

2
2
d = dF
Z
() =

(e
i
)

2
dF
X
() =

(e
i
)

(e
i
)

(e
i
)

2
2
d.
Hence

(e
i
)(e
i
)

(e
i
)

Lebesgue almost everywhere. If

j
[
j
[ < , then
the function (e
i
) is continuous, as are the functions and , and hence this
equality must hold for every . Since (z) has no roots on the unit circle, nor can (z).
(ii). Suppose that
1
is a zero of , so that [[ 1 and (z) = (1 z)
1
(z)
for a polynomial
1
of degree q 1. Dene Y
t
= (B)X
t
and V
t
=
1
(B)Z
t
, whence
Y
t
= V
t
V
t1
. It follows that
k1

j=0

j
Y
tj
=
k1

j=0

j
(V
tj
V
tj1
) = V
t

k
V
tk
.
If [[ < 1, then the right side converges to V
t
in quadratic mean as k and hence
it follows that V
t
is contained in the closed linear span of Y
t
, Y
t1
, . . ., which is clearly
contained in the closed linear span of X
t
, X
t1
, . . ., because Y
t
= (B)X
t
. If q = 1, then
V
t
and Z
t
are equal up to a constant and the proof is complete. If q > 1, then we repeat
the argument with
1
instead of and V
t
in the place of Y
t
and we shall be nished after
nitely many recursions.
If [[ = 1, then the right side of the preceding display still converges to V
t
as k ,
but only in the weak sense that E(V
t

k
V
tk
)W EV
t
W for every square integrable
variable W. This implies that V
t
is in the weak closure of lin (Y
t
, Y
t1
, . . .), but this is
equal to the strong closure by an application of the Hahn-Banach theorem. Thus we
arrive at the same conclusion.
To see the weak convergence, note rst that the projection of W onto the closed lin-
ear span of (Z
t
: t Z) is given by

j

j
Z
j
for some sequence
j
with

j
[
j
[
2
<
. Because V
tk
lin (Z
s
: s t k), we have [EV
tk
W[ = [

j

j
EV
tk
Z
j
[

jtk
[
j
[ sdV
0
sdZ
0
0 as k .
8.16 Example. The moving average X
t
= Z
t
Z
t1
is invertible in the sense of (ii),
but not in the sense of (i). The moving average X
t
= Z
t
1.01Z
t1
is not invertible.
Thus X
t
= Z
t
Z
t1
implies that Z
t
lin (X
t
, X
t1
, . . .). An unexpected phe-
nomenon is that it is also true that Z
t
is contained in lin (X
t+1
, X
t+2
, . . .). This follows
by time reversal: dene U
t
= X
t+1
and W
t
= Z
t
and apply the preceding to the
122 8: ARIMA Processes
processes U
t
= W
t
W
t1
. Thus it appears that the opposite of invertibility is true as
well!
8.17 EXERCISE. Suppose that X
t
= (B)Z
t
for a polynomial of degree q that has
all its roots on the unit circle. Show that Z
t
lin (X
t+q
, X
t+q+1
, . . .). [As in (ii) of the
preceding proof, it follows that V
t
=
k
(V
t+k

k1
j=0

j
X
t+k+j
). Here the rst term on
the right side converges weakly to zero as k .]
8.4 Prediction
As to be expected from their denitions, causality and invertibility are important for cal-
culating predictions for ARMA processes. For a causal and invertible stationary ARMA
process X
t
satisfying (B)X
t
= (B)Z
t
, we have
X
t
lin (Z
t
, Z
t1
, . . .), (by causality),
Z
t
lin (X
t
, X
t1
, . . .), (by invertibility).
Here lin , the closed linear span, is the operation of rst forming all (nite) linear combi-
nations and next taking the metric closure in L
2
(, |, P) of this linear span. Since Z
t
is
a white noise process, the variable Z
t+1
is orthogonal to the linear span of Z
t
, Z
t1
, . . ..
By the continuity of the inner product it is then also orthogonal to the closed linear
span of Z
t
, Z
t1
, . . . and hence, under causality, it is orthogonal to X
s
for every s t.
This shows that the variable Z
t+1
is totally (linearly) unpredictable at time t given the
observations X
1
, . . . , X
t
. This is often interpreted in the sense that the variable Z
t
is an
external noise variable that is generated at time t independently of the history of the
system before time t.
8.18 EXERCISE. The preceding argument gives that Z
t+1
is uncorrelated with the sys-
tem variables X
t
, X
t1
, . . . of the past. Show that if the variables Z
t
are independent,
then Z
t+1
is independent of the system up to time t, not just uncorrelated.
This general discussion readily gives the structure of the best linear predictor for
stationary, causal, auto-regressive processes. Suppose that
X
t+1
=
1
X
t
+ +
p
X
t+1p
+Z
t+1
.
If t p, then X
t
, . . . , X
tp+1
are perfectly predictable based on the past variables
X
1
, . . . , X
t
: by themselves. If the series is causal, then Z
t+1
is totally unpredictable (its
best prediction is zero), in view of the preceding discussion. Since a best linear predictor
is a projection and projections are linear maps, the best linear predictor of X
t+1
based
on X
1
, . . . , X
t
is given by

t
X
t+1
=
1
X
1
+ +
p
X
t+1p
, (t p).
8.4: Prediction 123
We should be able to obtain this result also from the prediction equations (2.4) and
the explicit form of the auto-covariance function, but that calculation would be more
complicated.
8.19 EXERCISE. Find a formula for the best linear predictor of X
t+2
based on
X
1
, . . . , X
t
, if t p 1.
For moving average and general ARMA processes the situation is more complicated.
A similar argument works only for computing the best linear predictor
,t
X
t+1
based
on the innite past X
t
, X
t1
, . . . down to time . Assume that X
t
is a causal and
invertible stationary ARMA process satisfying
X
t+1
=
1
X
t
+ +
p
X
t+1p
+Z
t+1
+
1
Z
t
+ +
q
Z
t+1q
.
By causality the variable Z
t+1
is completely unpredictable. By invertibility the variable
Z
s
is perfectly predictable based on X
s
, X
s1
, . . . and hence is perfectly predictable based
on X
t
, X
t1
, . . . for every s t. Therefore,

,t
X
t+1
=
1
X
t
+ +
p
X
t+1p
+
1
Z
t
+ +
q
Z
t+1q
.
The practical importance of this formula is small, because we never observe the com-
plete past. However, if we observe a long series X
1
, . . . , X
t
, then the distant past
X
0
, X
1
, . . . will not give much additional information over the recent past X
t
, . . . , X
1
,
and
,t
X
t+1
and
t
X
t+1
will be close.
8.20 Lemma. For a stationary causal, invertible ARMA process X
t
there exists con-
stants C and c < 1 such that E[
,t
X
t+1

t
X
t+1
[
2
Cc
t
for every t.
Proof. By invertibility we can express Z
s
as Z
s
=

j=0

j
X
sj
, for (
j
) the coecients
of the Taylor expansion of = / around 0. Because this power series has convergence
radius bigger than 1, it follows that

j

j
R
j
< for some R > 1, and hence [
j
[ C
1
c
j
,
for some constants C
1
> 0, c < 1 and every j. The variables X
sj
with 1 s j t are
perfectly predictable using X
1
, . . . , X
t
. We conclude that for s t
Z
s

t
Z
s
=

js

j
(X
sj

t
X
sj
).
It follows that |Z
s

t
Z
s
|
2

js
C
1
c
j
2|X
1
|
2
C
2
c
s
, for every s t.
In view of the formula for
,t
X
t+1
given preceding the lemma and the tower-
ing property of projections, the dierence between
,t
X
t+1
and
t
X
t+1
is equal to

q
j=1

j
(Z
t+1j

t
Z
t+1j
). Thus the L
2
-norm of this dierence is bounded above by

q
j=1
[
j
[|Z
t+1j

t
Z
t+1j
|
2
. The lemma follows by inserting the bound obtained in
the preceding paragraph.
124 8: ARIMA Processes
Thus for causal stationary auto-regressive processes the error when predicting X
t+1
at time t > p using X
t
, . . . , X
1
is equal to Z
t+1
. For general stationary ARMA-
processes this is approximately true for large t, and it is exactly true if we include
the variables X
0
, X
1
, . . . in the prediction. It follows that the square prediction error
E[X
t+1

t
X
t+1
[
2
is equal to the variance
2
= EZ
2
1
of the innovations, exactly in the
auto-regressive case, and approximately for general stationary ARMA-processes.
If the innovations are large, the quality of the predictions will be small. Irrespec-
tive of the size of the innovations, the possibility of prediction of ARMA-processes into
the distant future is limited. The following lemma shows that the predictors converge
exponentially fast to the trivial predictor, the constant mean, 0.
8.21 Lemma. For a stationary causal, invertible ARMA process X
t
there exists con-
stants C and c < 1 such that E[
t
X
t+s
[
2
Cc
s
for every s and t.
Proof. By causality, for s > q the variable (B)Z
t+s
is orthogonal to the variables
X
1
, . . . , X
t
, and hence cannot be predicted. Thus the linearity of prediction and the
ARMA equation X
t+s
=
1
X
t+s1
+ +
p
X
t+sp
+(B)Z
t+s
yield that the variables
Y
s
=
t
X
t+s
satisfy Y
s
=
1
Y
s1
+ +
p
Y
sp
, for s > q. Writing this equation in
vector-from and iterating, we obtain
_
_
_
_
Y
s
Y
s1
.
.
.
Y
sp+1
_
_
_
_
=
_
_
_
_

1

2

p1

p
1 0 0 0
.
.
.
.
.
.
.
.
.
.
.
.
0 0 1 0
_
_
_
_
_
_
_
_
Y
s1
Y
s2
.
.
.
Y
sp
_
_
_
_
=
sp
_
_
_
_
Y
p
Y
p1
.
.
.
Y
1
_
_
_
_
,
where is the matrix in the middle expression. By causality the spectral radius of the
matrix is strictly smaller than 1, and this implies that |
s
| Cc
s
for some constant
c < 1. (See the proof of Theorem 8.32.) In particular, it follows that [Y
s
[
2
Cc
s

p
i=1
Y
2
p
,
for s > q. The inequality of the lemma follows by taking expecations. For s < q the
inequality is trivially valid for suciently large C, because the left side is bounded.
8.5 Auto Correlation and Spectrum
The spectral density of an ARMA representation is immediate from the representation
X
t
= (B)Z
t
and Theorem 6.10.
8.22 Theorem. The stationary ARMA process satisfying (B)X
t
= (B)Z
t
possesses
a spectral density given by
f
X
() =

(e
i
)
(e
i
)

2
2
.
8.5: Auto Correlation and Spectrum 125
0.0 0.5 1.0 1.5 2.0 2.5 3.0
-
1
0
-
5
0
5
1
0
1
5
Figure 8.1. Spectral density of the AR series satisfying Xt 1.5Xt1 +0.9Xt2 0.2Xt3 +0.1Xt9 = Zt.
(Vertical axis in decibels, i.e. it gives the logarithm of the spectral density.)
8.23 EXERCISE. Plot the spectral densities of the following time series:
(i) X
t
= Z
t
+ 0.9Z
t1
;
(ii) X
t
= Z
t
0.9Z
t1
;
(iii) X
t
0.7X
t1
= Z
t
;
(iv) X
t
+ 0.7X
t1
= Z
t
;
(v) X
t
1.5X
t1
+ 0.9X
t2
0.2X
t3
+ 0.1X
t9
= Z
t
.
Finding a simple expression for the auto-covariance function is harder, except for the
special case of moving average processes, for which the auto-covariances can be expressed
in the parameters
1
, . . . ,
q
by a direct computation (cf. Example 1.6 and Lemma 1.28).
The auto-covariances of a general stationary ARMA process can be solved from a system
of equations. In view of Lemma 1.28(iii), the equation (B)X
t
= (B)Z
t
leads to the
identities, with (z) =

j
z
j
and (z) =

j

j
z
j
,

l
_

j+lh
_

X
(l) =
2

j+h
, h Z.
In principle this system of equations can be solved for the values
X
(l).
126 8: ARIMA Processes
An alternative method to compute the auto-covariance function is to write X
t
=
(B)Z
t
for = /, whence, by Lemma 1.28(iii),

X
(h) =
2

j+h
.
This requires the computation of the coecients
j
, which can be expressed in the
coecients of and by comparing coecients in the power series equation (z)(z) =
(z). Even in simple situations many
j
must be taken into account.
8.24 Example. For the AR(1) series X
t
= X
t1
+ Z
t
with [[ < 1 we obtain (z) =
(1z)
1
=

j=0

j
z
j
. Therefore,
X
(h) =
2

j=0

j+h
=
2

h
/(1
2
) for h 0.
8.25 EXERCISE. Find
X
(h) for the stationary ARMA(1, 1) series X
t
= X
t1
+Z
t
+
Z
t1
with [[ < 1.
* 8.26 EXERCISE. Show that the auto-covariance function of a stationary ARMA process
decreases exponentially. Give an estimator of the constant in the exponent in terms of
the distance of the zeros of to the unit circle.
A third method to express the auto-covariance function in the coecients of the
polynomials and uses the spectral representation

X
(h) =
_

e
ih
f
X
() d =

2
2i
_
]z]=1
z
h1
(z)(z
1
)
(z)(z
1
)
dz.
The second integral is a contour integral along the positively oriented unit circle in the
complex plane. We have assumed that the coecients of the polynomials and are
real, so that (z)(z
1
) = (z)(z) = [(z)[
2
for every z in the unit circle, and similarly
for . The next step is to evaluate the contour integral with the help of the residue
theorem from complex function theory.

The poles of the integrand are contained in the


set consisting of the zeros v
i
and their inverses v
1
i
of and possibly the point 0. The
auto-covariance function can be written as a function of the residues at these points.
8.27 Example (ARMA(1, 1)). Consider the stationary ARMA(1, 1) series X
t
=
X
t1
+Z
t
+Z
t1
with 0 < [[ < 1. The corresponding function (z)(z
1
) has zeros of
multiplicity 1 at the points
1
and . Both points yield a pole of rst order for the inte-
grand in the contour integral. The number
1
is outside the unit circle, so we only need
to compute the residue at the second point. The function (z
1
)/(z
1
) = (z+)/(z)
is analytic on the unit disk except at the point and hence does not contribute other

Remember that the residue at a point z0 of a meromorf function f is the coecient a1 in its Laurent
series representation f(z) = j aj(z z0)
j
around z0. If f can be written as h(z)/(z z0)
k
for a function h
that is analytic on a neighbourhood of z0, then a1 can be computed as h
(k1)
(z0)/(k 1)!. For k = 1 it is
also limzz
0
(z z0)f(z).
8.6: Existence of Causal and Invertible Solutions 127
poles, but the term z
h1
may contribute a pole at 0. For h 1 the integrand has poles
at and
1
only and hence

X
(h) =
2
res
z=
z
h1
(1 +z)(1 +z
1
)
(1 z)(1 z
1
)
=
2

h
(1 +)(1 +/)
1
2
.
For h = 0 the integrand has an additional pole at z = 0 and the integral evaluates to
the sum of the residues at the two poles at z = 0 and z = . The rst residue is equal to
/. Thus

X
(0) =
2
_
(1 +)(1 +/)
1
2

_
.
The values of
X
(h) for h < 0 follow by symmetry.
8.28 EXERCISE. Find the auto-covariance function for a MA(q) process by using the
residue theorem. (This is not easier than the direct derivation, but perhaps instructive.)
We do not present an additional method to compute the partial auto-correlation
function of an ARMA process. However, we make the important observation that for a
causal AR(p) process the partial auto-correlations
X
(h) of lags h > p vanish.
8.29 Theorem. For a causal AR(p) process, the partial auto-correlations
X
(h) of lags
h > p are zero.
Proof. This follows by combining Lemma 2.33 and the expression for the best linear
predictor found in the preceding section.
8.6 Existence of Causal and Invertible Solutions
In practice we never observe the white noise process Z
t
in the denition of an ARMA
process. The Z
t
are hidden variables, whose existence is hypothesized to explain the
observed series X
t
. From this point of view our earlier question of existence of a stationary
solution to the ARMA equation is perhaps not the right question, as it took the sequence
Z
t
as given. In this section we turn this question around and consider an ARMA(p, q)
process X
t
as given. We shall see that there are at least 2
p+q
white noise processes Z
t
such
that (B)X
t
= (B)Z
t
for certain polynomials and of degrees p and q, respectively.
(These polynomials depend on the choice of Z
t
and hence are not necessarily the ones that
are initially given.) Thus the white noise process Z
t
is far from being uniquely determined
by the observed series X
t
. On the other hand, among the multitude of solutions, only one
choice yields a representation of X
t
as a stationary ARMA process that is both causal
and invertible.
128 8: ARIMA Processes
8.30 Theorem. For every stationary ARMA process X
t
satisfying (B)X
t
= (B)Z
t
for
polynomials and such that has no roots on the unit circle, there exist polynomials

and

of the same or smaller degrees as and that have all roots outside the unit
disc and a white noise process Z

t
such that

(B)X
t
=

(B)Z

t
, almost surely, for every
t Z.
Proof. The existence of the stationary ARMA process X
t
and our implicit assumption
that and are relatively prime imply that has no roots on the unit circle. Thus
all roots of and are either inside or outside the unit circle. We shall show that we
can move the roots inside the unit circle to roots outside the unit circle by a ltering
procedure. Suppose that
(z) =
p
(z v
1
) (z v
p
), (z) =
q
(z w
1
) (z w
q
).
Consider any zero z
i
of or . If [z
i
[ < 1, then we replace the term (z z
i
) in the above
products by the term (1 z
i
z); otherwise we keep (z z
i
). For z
i
= 0, this means that
we drop the term z z
i
= z and the degree of the polynomial decreases; otherwise, the
degree remains the same. We apply this procedure to all zeros v
i
and w
i
and denote the
resulting polynomials by

and

. Because 0 < [z
i
[ < 1 implies that [z
1
i
[ > 1, the
polynomials

and

have all zeros outside the unit circle. For z in a neighbourhood


of the unit circle,
(z)
(z)
=

(z)

(z)
(z), (z) =

i:]vi]<1
1 v
i
z
z v
i

i:]wi]<1
z w
i
1 w
i
z
.
Because X
t
= (/)(B)Z
t
and we want that X
t
= (

)(B)Z

t
, we dene the process
Z

t
by Z

t
= (B)Z
t
. This is to be understood in the sense that we expand (z) in its
Laurent series (z) =

j

j
z
j
and apply the corresponding linear lter to Z
t
.
By construction we now have that

(B)X
t
=

(B)Z

t
. If [z[ = 1, then [1 z
i
z[ =
[z z
i
[. In view of the denition of this implies that

(z)

= 1 for every z on the unit


circle and hence Z

t
has spectral density
f
Z
() =

(e
i
)

2
f
Z
() = 1

2
2
.
This shows that Z

t
is a white noise process, as desired.
As are many results in time series analysis, the preceding theorem is a result on
second moments only. For instance, if Z
t
is an i.i.d. sequence, then the theorem does
not guarantee that Z

t
is an i.i.d. sequence as well. Only rst and second moments are
preserved by the ltering procedure in the proof, in general. Nevertheless, the theorem
is often interpreted as implying that not much is lost by assuming a-priori that and
have all their roots outside the unit circle.
8.7: Stability 129
8.31 EXERCISE. Suppose that the time series Z
t
is Gaussian. Show that the series Z

t
constructed in the preceding proof is Gaussian and hence i.i.d..
8.7 Stability
Let and be polynomials of degrees p 1 and q 0, with having no roots on the
unit circle. Given initial values X
1
, . . . , X
p
and a process Z
t
, we can recursively dene a
solution to the ARMA equation (B)X
t
= (B)Z
t
by
(8.1)
X
t
=
1
X
t1
+ +
p
X
tp
+(B)Z
t
, t > p,
X
tp
=
1
p
_
X
t

1
X
t1

p1
X
tp+1
(B)Z
t
_
, t p < 1.
Theorem 8.8 shows that the only solution X
t
that is bounded in L
1
is the (unique)
stationary solution X
t
= (B)Z
t
, ofr = /. Hence the solution (8.1) can only be
bounded if the initial values X
1
, . . . , X
p
are chosen randomly according to the stationary
distribution. In particular, the process X
t
obtained from deterministic initial values must
necessarily be unbounded. The unboundedness refers to the process (X
t
: t Z) on the
full time scale Z.
In this section we show that in the causal situation, when has no zeros on the unit
disc, the process X
t
tends to stationarity as t , given arbitrary initial values. Hence
in this case the unboundedness occurs as t . This is another reason to prefer the
case that has no roots on the unit disc: in this case the eect of initializing the process
wears o as time goes by.
Let Z
t
be a given white noise process and let (X
1
, . . . , X
p
) and (

X
1
, . . . ,

X
p
) be
two possible sets of initial values, consisting of random variables dened on the same
probability space.
8.32 Theorem. Let and be polynomials such that has no roots on the unit disc.
Let X
t
and

X
t
be the ARMA processes as in dened (8.1) with initial values (X
1
, . . . , X
p
)
and (

X
1
, . . . ,

X
p
), respectively. Then X
t


X
t
0 almost surely as t .
8.33 Corollary. Let and be polynomials such that has no roots on the unit disc.
If X
t
is an ARMA process with arbitrary initial values, then the vector (X
t
, . . . , X
t+k
)
converges in distribution to the distribution of the stationary solution to the ARMA
equation, as t , for every xed k.
Proofs. For the corollary we take (

X
1
, . . . ,

X
p
) equal to the values of the stationary
solution. Then we can conclude that the dierence between X
t
and the stationary solution
converges to zero almost surely and hence also in distribution.
130 8: ARIMA Processes
For the proof of the theorem we write the ARMA relationship in the state space
form, for t > p,
_
_
_
_
X
t
X
t1
.
.
.
X
tp+1
_
_
_
_
=
_
_
_
_

1

2

p1

p
1 0 0 0
.
.
.
.
.
.
.
.
.
.
.
.
0 0 1 0
_
_
_
_
_
_
_
_
X
t1
X
t2
.
.
.
X
tp
_
_
_
_
+
_
_
_
_
(B)Z
t
0
.
.
.
0
_
_
_
_
.
Denote this system by Y
t
= Y
t1
+B
t
. By some algebra it can be shown that
det( zI) = (1)
p
z
p
(z
1
), z ,= 0.
Thus the assumption that has no roots on the unit disc implies that the eigenvalues of
are all inside the unit circle. In other words, the spectral radius of , the maximum of the
moduli of the eigenvalues, is strictly less than 1. Because the sequence |
n
|
1/n
converges
to the spectral radius as n , we can conclude that |
n
|
1/n
is strictly less than 1
for all suciently large n, and hence |
n
| 0 as n . (In fact, if |
n0
|: = c < 1,
then |
kn0
| c
k
for every natural number k and hence |
n
| Cc
n/n0
for every n,
for C = max
0jn0
|
j
|, and the convergence to zero is exponentially fast.)
If

Y
t
relates to

X
t
as Y
t
relates to X
t
, then Y
t


Y
t
=
tp
(Y
p


Y
p
) 0 almost
surely as t .
8.34 EXERCISE. Suppose that (z) has no zeros on the unit circle and at least one
zero inside the unit circle. Show that there exist initial values (X
1
, . . . , X
p
) such that the
resulting process X
t
is not bounded in probability as t . [Let

X
t
be the stationary
solution and let X
t
be the solution given initial values (X
1
, . . . , X
p
). Then, with notation
as in the preceding proof, Y
t


Y
t
=
tp
(Y
p


Y
p
). Choose an appropriate deterministic
vector for Y
p


Y
p
.]
8.35 EXERCISE. Simulate a series of length 200 of the ARMA process satisfying X
t

1.3X
t1
+0.7X
t2
= Z
t
+0.7Z
t1
. Plot the sample path, and the sample auto-correlation
and sample partial auto-correlation functions. Vary the distribution of the innovations
and simulate both stationary and nonstaionary versions of the process.
8.8 ARIMA Processes
In Chapter 1 dierencing was introduced as a method to transform a nonstationary time
series into a stationary one. This method is particularly attractive in combination with
ARMA modelling: in the notation of the present chapter the dierencing lters can be
written as
X
t
= (1 B)X
t
,
d
X
t
= (1 B)
d
X
t
,
k
X
t
= (1 B
k
)X
t
.
8.9: VARMA Processes 131
Thus the dierencing lters ,
d
and
k
correspond to applying the operator (B) for
the polynomials (z) = 1 z, (z) = (1 z)
d
and (z) = (1 z
k
), respectively. These
polynomials have in common that all their roots are on the complex unit circle. Thus
they were forbidden polynomials in our preceding discussion of ARMA processes.
By Theorem 8.10, for the three given polynomials the series Y
t
= (B)X
t
cannot
be a stationary ARMA process relative to polynomials without zeros on the unit circle
if X
t
is a stationary process. On the other hand, the series Y
t
= (B)X
t
can well be
a stationary ARMA process if X
t
is a non-stationary time series. Thus we can use
polynomials with roots on the unit circle to extend the domain of ARMA modelling to
nonstationary time series.
8.36 Denition. A time series X
t
is an ARIMA(p, d, q) process if
d
X
t
is a stationary
ARMA(p, q) process.
In other words, the time series X
t
is an ARIMA(p, d, q) process if there exist poly-
nomials and of degrees p and q and a white noise series Z
t
such that the time
series
d
X
t
is stationary and (B)
d
X
t
= (B)Z
t
almost surely. The additional I
in ARIMA is for integrated. If we view taking dierences
d
as dierentiating, then
the denition requires that a derivative of X
t
is a stationary ARMA process, whence X
t
itself is an integrated ARMA process.
The following denition goes a step further.
8.37 Denition. A time series X
t
is a SARIMA(p, d, q)(P, D, Q, per) process if there
exist polynomials , , and of degrees p, q, P and Q and a white noise series
Z
t
such that the time series
D
per

d
X
t
is stationary and (B
per
)(B)
D
per

d
X
t
=
(B
per
)(B)Z
t
almost surely.
The S in SARIMA is short for seasonal. The idea of a seasonal model is that
we might only want to use certain powers B
per
of the backshift operator in our model,
because the series is thought to have a certain period. Including the terms (B
per
) and
(B
per
) does not make the model more general (as these terms could be subsumed in
(B) and (B)), but reects our a-priori idea that certain coecients in the polynomials
are zero. This a-priori knowledge will be important when estimating the coecients from
an observed time series.
Modelling an observed time series by an ARIMA, or SARIMA, model has become
popular through an inuential book by Box and Jenkins. The unied ltering paradigm
of a Box-Jenkins analysis is indeed attractive. The popularity is probably also due to
the compelling manner in which Box and Jenkins explain the reader how he or she must
set up the analysis, going through a xed number of steps. They thus provide the data-
analyst with a clear algorithm to carry out an analysis that is intrinsically dicult. It is
obvious that the results of such an analysis will not always be good, but an alternative
is less obvious.
132 8: ARIMA Processes
* 8.9 VARMA Processes
A VARMA process is a vector-valued ARMA process. Given matrices
j
and
j
and a
white noise sequence Z
t
of dimension d, a VARMA(p, q) process satises the relationship
X
t
=
1
X
t1
+
2
X
t2
+ +
p
X
tp
+Z
t
+
1
Z
t1
+ +
q
Z
tq
.
The theory for VARMA process closely resembles the theory for ARMA processes. The
role of the polynomials and is taken over by the matrix-valued polynomials
(z) = 1
1
z
2
z
2

p
z
p
,
(z) = 1 +
1
z +
2
z
2
+ +
q
z
q
.
These identities and sums are to be interpreted entry-wise and hence and are (dd)-
matrices with entries that are polynomials in z C.
Instead of looking at zeros of polynomials we must now look at the values of z for
which the matrices (z) and (z) are singular. Equivalently, we must look at the zeros
of the complex functions z det (z) and z det (z). Apart from this dierence, the
conditions for existence of a stationary solution, causality and invertibility are the same.
8.38 Theorem. If the matrix-valued polynomial (z) is invertible for every z in the
unit circle, then there exists a unique stationary solution X
t
to the VARMA equations.
If the matrix-valued polynomial (z) is invertible for every z on the unit disc, then this
can be written in the form X
t
=

j=0

j
Z
tj
for matrices
j
with

j=0
|
j
| < .
If, moreover, the polynomial (z) is invertible for every z on the unit disc, then we also
have that Z
t
=

j=0

j
X
tj
for matrices
j
with

j=0
|
j
| < .
The norm | | in the preceding may be any matrix norm. The proof of this theorem
is the same as the proofs of the corresponding results in the one-dimensional case, in
view of the following observations.
A series of the type

j=

j
Z
tj
for matrices
j
with

j=0
|
j
| < and a
vector-valued process Z
t
with sup
t
E|Z
t
| < converges almost surely and in mean. We
can dene a vector-valued function with domain at least the complex unit circle by
(z) =

j=

j
z
j
, and write the series as (B)Z
t
, as usual.
Next, the analogue of Lemma 8.2 is true. The product =
1

2
of two matrix-
valued functions z
i
(z) is understood to be the function z (z) with (z) equal
to the matrix-product
1
(z)
2
(z).
8.39 Lemma. If (
j,1
) and (
j,2
) are sequences of (dd)-matrices with

j
|
j,i
| <
for i = 1, 2, then the function =
1

2
can be expanded as (z) =

j=

j
z
j
, at
least for z S
1
, for matrices
j
with

j=
|
j
| < . Furthermore (B)X
t
=

1
(B)
_

2
(B)X
t

, almost surely, for every time series X


t
with sup
t
E|X
t
| < .
The functions z det (z) and z det (z) are polynomials. Hence if they are
nonzero on the unit circle, then they are nonzero on an open annulus containing the unit
circle, and the matrices (z) and (z) are invertible for every z in this annulus. Cramers
8.10: ARMA-X Processes 133
rule, which expresses the solution of a system of linear equations in determinants, shows
that the entries of the inverse matrices (z)
1
and (z)
1
are quotients of polynomials.
The denominators are the determinants det (z) and det (z) and hence are nonzero in
a neighbourhood of the unit circle. These matrices may thus be expanded (entrywise) in
Laurent series
(z)
1
=
_

j=
(
j
)
k,l
z
j
_
k,l=1,...,d
=

j=

j
z
j
,
where the
j
are matrices such that

j=
|
j
| < , and similarly for (z)
1
. This
allows the back-shift calculus that underlies the proof of Theorem 8.8. For instance, the
matrix-valued function =
1
is well-dened on the unit circle, and can be expanded
as (z) =

j=

j
z
j
for matrices
j
with

j=
|
j
| < . The process (B)Z
t
is the unique solution to the VARMA-equation.
* 8.10 ARMA-X Processes
The (V)ARMA-relation models a time series as an autonomous process, whose evolution
is partially explained by its own past (or the past of the other coordinates in the vector-
valued situation). It is often of interest to include additional explanatory variables into
the model. The natural choice of including time-dependent, vector-valued variables W
t
linearly yields the ARMA-X model
X
t
=
1
X
t1
+
2
X
t2
+ +
p
X
tp
+Z
t
+
1
Z
t1
+ +
q
Z
tq
+
T
W
t
.
9
GARCH Processes
White noise processes are basic building blocks for time series models, but can also be of
interest on their own. A sequence of i.i.d. variables is white noise sequence that is not of
great interest as a time series. On the other hand, many nancial time series appear to
be realizations of white noise series that have interesting time structure. This is possible,
because the white noise property only concerns the second moments of the process,
so that the variables of a white noise process may possess many types of dependence.
GARCH processes are white noises that have been found useful for modelling nancial
time series.
Figure 9.1 shows a realization of a GARCH process. The striking feature are the
bursts of activity, which alternate with quiet periods of the series. The frequency
of the movements of the series appears to be constant over time, but their amplitude
changes, alternating between volatile periods (large amplitude) and quiet periods. This
phenomenon is referred to as volatility clustering. A look at the auto-correlation function
of the realization, Figure 9.2, shows that the alternations are not reected in the second
moments of the series: the series can be modelled as white noise, at least in the sense
that the correlations are zero.
Recall that a white noise series is any mean-zero stationary time series whose auto-
covariances at nonzero lags vanish. We shall speak of a heteroscedastic white noise if the
auto-covariances at nonzero lags vanish, but the variances are possibly time-dependent.
A related concept is that of a martingale dierence series. Recall that a ltration T
t
is a
nondecreasing collection of -elds T
1
T
0
T
1
. A martingale dierence
series relative to the ltration T
t
is a time series X
t
such that X
t
is T
t
-measurable and
E(X
t
[ T
t1
) = 0, almost surely, for every t. (The latter includes the assumption that
E[X
t
[ < , so that the conditional expectation is well dened.)
Any martingale dierence series X
t
with nite second moments is a (possibly het-
eroscedastic) white noise series. Indeed, the equality E(X
t
[ T
t1
) = 0 is equivalent to
X
t
being orthogonal to all random variables Y T
t1
, and this includes the variables
X
s
T
s
T
t1
, for every s < t, so that EX
t
X
s
= 0 for every s < t. The converse
is false: not every white noise is a martingale dierence series (relative to a natural l-
9: GARCH Processes 135
0 100 200 300 400 500
-
1
.
0
-
0
.
5
0
.
0
0
.
5
1
.
0
Figure 9.1. Realization of length 500 of the stationary Garch(1, 1) process with = 0.15, 1 = 0.4, 1 = 0.4
and standard normal variables Zt.
tration). This is because E(X[ Y ) = 0 implies that X is orthogonal to all measurable
functions of Y , not just to linear functions.
9.1 EXERCISE. If X
t
is a martingale dierence series, show that E(X
t+k
X
t+l
[ T
t
) =
0 almost surely for every k ,= l > 0. Thus future variables are uncorrelated given
the present. Find a white noise series which lacks this property relative to its natural
ltration.
A martingale dierence sequence has zero rst moment given the past. A natural step
for further modelling is to postulate a specic form of the conditional second moment.
GARCH models are examples, and in that sense are again concerned only with rst
and second moments of the time series, albeit conditional moments. They turn out to
capture many features of observed time series, in particular those in nance, that are not
captured by ARMA processes. Besides volatility clustering these stylized facts include
leptokurtic (i.e. heavy) tailed marginal distributions and nonzero auto-correlations for
the process X
2
t
of squares.
136 9: GARCH Processes
0 5 10 15 20 25
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
Lag
A
C
F
Figure 9.2. Sample auto-covariance function of the time series in Figure 9.1.
9.1 Linear GARCH
There are many types of GARCH processes, of which we discuss a selection in the
following sections. Linear GARCH processes were the earliest GARCH processes to be
studied, and may be viewed as the GARCH processes.
9.2 Denition. A GARCH (p, q) process is a martingale dierence sequence X
t
, relative
to a given ltration T
t
, whose conditional variances
2
t
= E(X
2
t
[ T
t1
) satisfy, for every
t Z and given nonnegative constants > 0,
1
, . . . ,
p
,
1
, . . . ,
q
,
(9.1)
2
t
= +
1

2
t1
+ +
p

2
tp
+
1
X
2
t1
+ +
q
X
2
tq
, a.s..
With the usual convention that (z) = 1
1
z
p
z
p
and (z) =
1
z+ +
q
z
q
,
the equation for the conditional variance
2
t
= var(X
t
[ T
t1
) can be abbreviated to
(B)
2
t
= +(B)X
2
t
.
Note that the polynomial is assumed to have zero intercept. If the coecients
1
, . . . ,
p
all vanish, then
2
t
is modelled as a linear function of X
2
t1
, . . . , X
2
tq
. This is called an
ARCH (q) model, from auto-regressive conditional heteroscedastic. The additional G
of GARCH is for the nondescript generalized.
Sinece
t
> 0 (as > 0), we can dene Z
t
= X
t
/
t
. The martingale dierence
property of X
t
=
t
Z
t
and the denition of
2
t
as the conditional variance imply
(9.2) E(Z
t
[ T
t1
) = 0, E(Z
2
t
[ T
t1
) = 1.
9.1: Linear GARCH 137
Thus the time series Z
t
is a scaled martingale dierence series. By reversing this
construction we can easily dene GARCH processes on the time set Z
+
. Here we could
start for instance with an i.i.d. sequence Z
t
of variables with mean zero and variance 1.
9.3 Lemma. Let (Z
t
: t 1) be a martingale dierence sequence that satises (9.2)
relative to an arbitrary ltration (T
t
: t 0) and let (
2
s
: s 0) and (X
s
: s 0) be
processes dened on the same probability space that are measurable in T
0
. Then the
recursions, for t 1,
_

2
t
= +
1

2
t1
+ +
p

2
tp
+
1
X
2
t1
+ +
q
X
2
tq
X
t
=
t
Z
t
dene a GARCH process (X
t
: t 1) with conditional variance process
2
t
= E(X
3
t
[ T
t1
).
Furthermore
t
T
0
(X
s
: 1 s t 1) = T
0
(Z
s
: 1 s t), for every t 1.
Proof. The set of four assertions
t
T
t1
, X
t
T
t
, E(X
t
[ T
t1
) = 0, E(X
2
t
[ T
t1
) =
2
t
follows for t = 1 from the assumptions and the recursion formulas for t = 1. By induction
the set of assertions extends to every t 1. Thus all requirements for a GARCH process
are satised for t 1.
The recursion formula for
2
t
shows that
t
T
0
(X
s
, Z
s
: 1 s t 1). The
equality T
0
(Z
s
: 1 s t) = T
0
(X
s
: 1 s t) is clearly valid for t = 0, and
next follows by induction for every t 1, since X
t
=
t
Z
t
and Z
t
= X
t
/
t
, where
t
is
a function of
s
and X
s
with s < t. This yields the nal assertion of the lemma.
The lemma constructs the time series X
t
only for positive times, using initial values
that are measurable in T
0
. Because the initial values can be chosen in many ways, the
process is not unique. If desired, it can be extended in multiple ways to the full time
set Z. A simple way is to set
t
= X
t
= 0 for all t 0, with T
0
the trivial -eld.
The construction of stationary GARCH processes is much more involved and taken up
in Section 9.1.2. Next in Section 9.1.3 the eect of initialization is shown to wear o as
t : any process as constructed in the preceding lemma will tend to the stationary
solution, if such a solution exists.
The nal assertion of the lemma shows that at each time instance the randomness
added into the system is generated by Z
t
. This encourages to interprete this standard-
ized process as the innovations of a GARCH process, although this term should now be
inderstood in a multiplicative sense.
As any martingale dierences series, a GARCH process is constant in the mean.
In fact, not only is its mean function zero, but also at each time its expected change
is zero. However, a GARCH process has an interesting (conditional) variance structure:
large absolute values X
t1
, . . . , X
tq
at times t 1, . . . , t q lead, through the GARCH
equation, to a large conditional variance
2
t
at time t, and then the value X
t
=
t
Z
t
of the time series at time t tends to deviate from far from zero as well. Similarly, large
conditional variances
2
t1
, . . . ,
2
tp
tend to lead to large values of [X
t
[. In this context
(conditional) variance is often referred to as volatility, and strongly uctuating time
series are said to be volatile. Thus GARCH processes tend to alternate periods of high
138 9: GARCH Processes
volatility with periods of low volatility. This volatility clustering is one of the stylized
facts of nancial time series. The phrase stylized fact is used for salient properties
that seem to be shared by many nancial time series, although fact is not to be taken
literally.
A second stylized fact are the leptokurtic tails of the marginal distribution of a
typical nancial time series. A distribution on R is called leptokurtic if it has fat tails,
for instance fatter than normal tails. A quantitative measure of fatness of the tails
of the distribution of a random variable X is the kurtosis dened as
4
(X) = E(X
EX)
4
/(var X)
2
. If X
t
=
t
Z
t
, where
t
is T
t1
-measurable and Z
t
is independent of
T
t1
with mean zero and variance 1, then
EX
4
t
= E
4
t
EZ
4
t
=
4
(Z
t
)E
_
E(X
2
t
[ T
t1
)
_
2

4
(Z
t
)(E
2
t
)
2
=
4
(Z
t
)(EX
2
t
)
2
.
Dividing the left and right sides by (EX
2
t
)
2
, we see that
4
(X
t
)
4
(Z
t
). The dierence
can be substantial if the variance of the random variable E(X
2
t
[ T
t1
) is large. In fact,
taking the dierence of the left and right sides of the preceding display yields

4
(X
t
) =
4
(Z
t
)
_
1 +
var E(X
2
t
[ T
t1
)
(EX
2
t
)
2
_
.
It follows that the GARCH structure is also able to capture some of the observed lep-
tokurtosis of nancial time series.
The kurtosis of a normal variable is equal to 3. Thus for a Gaussian driving process
Z
t
, the kurtosis of the observed series X
t
is always bigger than 3. It has been observed
that this usually does not go far enough in explaining excess kurtosis over the normal
distribution. The use of one of Students t-distributions can often improve the t of a
GARCH process substantially.
If we substitute
2
t
= X
2
t
W
t
in (9.1), then we nd after rearranging the terms,
(9.3)
X
2
t
= + (
1
+
1
)X
2
t1
+ + (
r
+
r
)X
2
tr
+W
t

1
W
t1

p
W
tp
,
where r = p q and the sequences
1
, . . . ,
p
or
1
, . . . ,
q
are padded with zeros to
increase their lengths to r, if necessary. We can abbreviate this to
( )(B)X
2
t
= +(B)W
t
, W
t
= X
2
t
E(X
2
t
[ T
t1
).
This is the characterizing equation for an ARMA(r, r) process X
2
t
relative to the noise
process W
t
. The variable W
t
= X
2
t

2
t
is the prediction error when predicting X
2
t
by
its conditional expectation
2
t
= E(X
2
t
[ T
t1
) and hence W
t
is orthogonal to T
t1
. Thus
W
t
is a martingale dierence series and a-fortiori a white noise sequence if its second
moments exist and are independent of t.
Under these conditions the time series of squares X
2
t
is an ARMA process in the
sense of Denition 8.4. This observation is useful to compute certain characteristics of
the process. However, the variables W
t
in equation (9.3) are dened in terms of of the
process X
t
(their squares and conditional variances) and therefore do not have a simple
interpretation as a noise process that drives the process X
2
t
. This circularity makes that
9.1: Linear GARCH 139
one should not apply results on ARMA processes unthinkingly to the process X
2
t
. For
instance, equation (9.3) seems to have little use for proving existence of solutions to the
GARCH equation.
* 9.4 EXERCISE. Suppose that X
t
and W
t
are martingale diference series relative to a
given ltration such that (B)X
2
t
= (B)W
t
for polynomials and of degrees p and
q. Show that X
t
is a GARCH process. Does strict stationarity of the time series X
2
t
or
W
t
imply strict stationarity of the time series X
t
?
9.5 EXERCISE. Write
2
t
as the solution to an ARMA(p q, q 1) equation by substi-
tuting X
2
t
=
2
t
+W
t
in (9.3).
9.1.1 Autocovariances of Squares
Even though the autocorrelation function of a GARCH process vanishes (except at lag
zero), the variables are not independent. One way of seeing this is to consider the auto-
correlations of the squares X
2
t
of the time series. Independence of the X
t
would imply
independence of the X
2
t
, and hence a zero autocorrelation function. However, in view
of (9.3) the squares form an ARMA process, whence their autocorrelations are nonzero.
The auto-correlations of the squares of nancial time series are typically observed to be
positive at all lags, another stylized fact of such series.
The auto-correlation function of the squares of a GARCH series will exist under
appropriate additional conditions on the coecients and the driving noise process Z
t
.
The ARMA relation (9.3) for the square process X
2
t
may be used to derive this function,
from the formulas for the auto-correlation function of an ARMA process. Here we must
not forget that the process W
t
in (9.3) is dened through X
t
and hence its variance
depends on the parameters in the GARCH relation.
Actually, equation (9.3) is an ARMA equation with intercept . Provided that
(1) ,= 0, we can rewrite it as (B)
_
X
2
t
/(1)
_
= (B), which shows that the time
series X
2
t
is an ARMA proces plus /(1). The autocorrelations are of course independent
of this shift.
9.6 Example (GARCH(1, 1)). The conditional variances of a GARCH(1,1) process
satisfy
2
t
= +
2
t1
+ X
2
t1
. If we assume the process X
t
to be stationary, then
E
2
t
= EX
2
t
is independent of t. Taking the expectation across the GARCH equation
and rearranging then immediately gives
E
2
t
= EX
2
t
=

1
.
To compute the auto-correlation function of the time series of squares X
2
t
, we employ
(9.3), which reveals this process as an ARMA(1,1) process with the auto-regressive and
moving average polynomials given as 1(+)z and 1z, respectively. The calculations
140 9: GARCH Processes
in Example 8.27 yield that

X
2(h) =
2
( +)
h
(1 ( +))(1 /( +))
1 ( +)
2
, h > 0,

X
2 (0) =
2
_
(1 ( +))(1 /( +))
1 ( +)
2
+

+
_
.
Here
2
is the variance of the process W
t
= X
2
t
E(X
2
t
[ T
t1
), which is also dependent
on the parameters and . By squaring the GARCH equation we nd

4
t
=
2
+
2

4
t1
+
2
X
4
t1
+ 2
2
t1
+ 2X
2
t1
+ 2
2
t1
X
2
t1
.
If Z
t
is independent of T
t1
, then E
2
t
X
2
t
= E
4
t
and EX
4
t
=
4
(Z
t
)E
4
t
. If we assume,
moreover, that the moments exists and are independent of t, then we can take the
expectation across the preceding display and rearrange to nd that
E
4
t
(1
2
2
4
(Z)
2
) =
2
+ (2 + 2)E
2
t
.
Together with the formulas obtained previously, this gives the variance of W
t
, since
EW
t
= 0 and EW
2
t
= EX
4
t
E
4
t
, by Pythagoras identity for projections.
9.7 EXERCISE. Find the auto-covariance function of the process
2
t
for a GARCH(1, 1)
process.
9.8 EXERCISE. Find an expression for the kurtosis of the marginal distribution in a sta-
tionary GARCH(1, 1) process as in the preceding example. Can this be made arbitrarily
large?
9.1.2 Stationary Solutions
Because the GARCH relation is recursive and the variables X
t
enter it both explicitly and
implicitly (through their conditional variances) existence of stationary GARCH processes
is far from obvious. In this section we show that a (strictly) stationary solution exists
only for certain parameter values
1
, . . . ,
p
,
1
. . . ,
q
, and if it exists, then it is unique.
A bit surprisingly, a strictly stationary solution exists for a larger set of parameter values
than a second order stationary solution.
We start with investigating second order stationary solutions. Substituting X
t
=

t
Z
t
in the GARCH relation (9.1), we obtain
(9.4)
2
t
= + (
1
+
1
Z
2
t1
)
2
t1
+ + (
r
+
r
Z
2
tr
)
2
tr
.
This exhibits the process
2
t
as an auto-regressive process with random coecients
(the variables
i
+
i
Z
2
ti
) and deterministic innovations (the constant ). We shall
construct and study the process
2
t
by considering this equation in a state space rep-
resentation, with state vectors (
2
t1
, . . . ,
2
tr
). Given the process
2
t
, we next obtain a
GARCH process as X
t
=
t
Z
t
.
In the following theorem we consider given a martingale dierence sequence Z
t
satisfying (9.2), dened on a xed probability space. A stationary GARCH process exists
if and only if

j
(
j
+
j
) < 1.
9.1: Linear GARCH 141
9.9 Theorem. Let > 0, let
1
, . . . ,
p
,
1
, . . . ,
q
be nonnegative numbers, and let Z
t
be a martingale dierence sequence satisfying (9.2) relative to an arbitrary ltration T
t
.
(i) There exists a stationary GARCH process X
t
such that X
t
=
t
Z
t
, where
2
t
=
E(X
2
t
[ T
t1
), if and only if

j
(
j
+
j
) < 1.
(ii) This process is unique among the GARCH processes X
t
with X
t
=
t
Z
t
that are
bounded in L
2
.
(iii) This process satises (X
t
, X
t1
, . . .) = (Z
t
, Z
t1
, . . .) for every t, and
2
t
=
E(X
2
t
[ T
t1
) is (X
t1
, X
t2
, . . .)-measurable.
Proof. The recursion equation (9.4) can be written in matrix form as
_
_
_
_
_

2
t

2
t1
.
.
.

2
tr+1
_
_
_
_
_
=
_
_
_
_

1
+
1
Z
2
t1

r1
+
r1
Z
2
tr+1

r
+
r
Z
2
tr
1 0 0
.
.
.
.
.
.
.
.
.
.
.
.
0 1 0
_
_
_
_
_
_
_
_
_

2
t1

2
t2
.
.
.

2
tr
_
_
_
_
_
+
_
_
_
_

0
.
.
.
0
_
_
_
_
.
Writing this system as Y
t
= A
t
Y
t1
+b, and iterating we nd, for every n 1,
(9.5) Y
t
= b +A
t
b +A
t
A
t1
b + +A
t
A
t1
A
tn+1
b +A
t
A
t1
A
tn
Y
tn1
.
If the last term on the right tends to zero as n , then this gives
(9.6) Y
t
= b +

j=1
A
t
A
t1
A
tj+1
b.
We shall prove that this representation is indeed valid for the process Y
t
corresponding
to a GARCH process that is bounded in L
2
. Because the right side of the display is a
function of the Z
t
only, the uniqueness (ii) of such a GARCH process for a given process
Z
t
is then clear. To prove existence, as in (i), we shall reverse this argument: we use
equation (9.6) to dene a process Y
t
, and next dene processes
t
and X
t
as functions
of Y
t
and Z
t
.
We start by studying the random matrix products A
t
A
t1
A
tn
.
The expectation A = EA
t
of a single matrix is obtained by replacing the variables
Z
2
s
in the denition of A
t
by their expectation 1. The expectation of the product turns
out to be the product of the expectations EA
t
A
t1
A
tn
= A
n+1
. To see this, write
A
t
= +

q
j=1

j
Z
2
tj
L
j
, for L
j
the matrix with rst row the jth unit vector and 0s
elsewhere. Then L
1
L
j
= L
j
and L
i
L
j
= 0 for every j 1 and i 2, and hence
the product A
t
A
t1
A
tn
can be written as a linear combination of deterministic
matrices with random coecients of the type Z
2
t1
Z
2
t
k
= 1, for t
1
< t
2
< < t
k
.
Noting that Z
t
= X
t
/
t
is T
t
-measurable and E(Z
2
t
[ T
t1
) = 1 for every t, we can use
the towering property of conditional expectations to see that EZ
2
t1
Z
2
t
k
= 1, for every
t
1
< t
2
< < t
k
. Hence the expected value EA
t
A
t1
A
tn
can also be computed by
replacing every Z
2
s
by its expectation 1, yielding A
n+1
as claimed.
The characteristic polynomial of the matrix A can (with some eort) be seen to be
equal to
det(A zI) = (1)
r
_
z
r

j=1
(
j
+
j
)z
rj
_
.
142 9: GARCH Processes
If

j
(
j
+
j
) < 1, then the polynomial on the right has all its roots inside the unit circle.
(See Exercise 9.10, with z replaced by 1/z.) The spectral radius (the maximum of the
moduli of the eigenvalues) of the operator A is then strictly smaller than 1. This implies
that |A
n
|
1/n
is smaller than 1 for all suciently large n and hence

n=0
|A
n
| < .
Combining the results of the last two paragraphs we see that EA
t
A
t1
A
tn
=
A
n+1
0 as n if

j
(
j
+
j
) < 1. Because the matrices A
t
possess nonnegative
entries, this implies that the sequence A
t
A
t1
A
tn
converges to zero in probability.
If X
t
is a GARCH process that is bounded in L
2
, then, in view of its denition, the
corresponding process Y
t
is bounded in L
1
. In that case A
t
A
t1
A
tn
Y
tn1
0
in probability as n , and hence Y
t
satises (9.6). This concludes the proof of the
uniqueness (ii).
Under the assumption that

j
(
j
+
j
) < 1, the series on the right side of (9.6)
converges in L
1
. Thus we can use this equation to dene a process Y
t
from the given
martingale dierence series Z
t
. Simple algebra on the innite series shows that Y
t
=
A
t
Y
t1
+ b for every t. Clearly the coordinates of Y
t
are nonnegative, whence we can
dene processes
t
and X
t
by

t
=
_
Y
t,1
, X
t
=
t
Z
t
.
Because
t
is (Z
t1
, Z
t2
, . . .) T
t1
-measurable, we have that E(X
t
[ T
t1
) =

t
E(Z
t
[ T
t1
) = 0 and E(X
2
t
[ T
t1
) =
2
t
E(Z
2
t
[ T
t1
) =
2
t
. Furthermore, the process

2
t
satises the GARCH relation (9.1). Indeed, the rst row of the matrix equation
Y
t
= A
t
Y
t1
+ b expresses
2
t
into
2
t1
and Y
t1,2
, . . . , Y
t1,r
, and the other rows per-
mit to reexpress the variable Y
t1,k
for k > 1 as
2
tk
by recursive use of the relations
Y
t,k
= Y
t1,k1
, and the denitions Y
tk,1
=
2
tk
. This concludes the proof that there
exists a stationary solution as soon as

j
(
j
+
j
) < 1.
To see that the latter condition is necessary, we return to equation (9.5). If X
t
is a
stationary solution, then its conditional variance process
2
t
is integrable, and hence so is
the process Y
t
in (9.5). Taking the expectation of the left and right sides of this equation
for t = 0 and remembering that all terms are nonnegative, we see that

n
j=0
A
j
b EY
0
,
for every n. This implies that A
n
b 0 as n , or, equivalently A
n
e
1
0, where e
i
is the ith unit vector. In view of the denition of A we see, recursively, that
A
n
e
r
= A
n1
(
r
+
r
)e
1
0,
A
n
e
r1
= A
n1
_
(
r1
+
r1
)e
1
+e
r
_
0,
.
.
.
A
n
e
2
= A
n1
_
(
2
+
2
)e
1
+e
3
_
0.
Therefore, the sequence A
n
converges to zero. This can only happen if none of its eigen-
values is on or outside the unit disc. This was seen to be case only if

r
j=1
(
j
+
j
) < 1.
This concludes the proof of (i) and (ii). For the proof of (iii) we rst note that the
matrices A
t
, and hence the variables Y
t
, are measurable functions of (Z
t1
, Z
t2
, . . .).
In particular, the variable
2
t
, which is the rst coordinate of Y
t
, is (Z
t1
, Z
t2
, . . .)-
measurable, so that the second assertion of (iii) follows from the rst.
9.1: Linear GARCH 143
Because the variable X
t
=
t
Z
t
is a measurable transformation of (Z
t
, Z
t1
, . . .),
it follows that (X
t
, X
t1
, . . .) (Z
t
, Z
t1
, . . .). To prove the converse, we note that
the process W
t
= X
2
t

2
t
is bounded in L
1
and satises the ARMA relation (
)(B)X
2
t
= +(B)W
t
, as in (9.3). Because has no roots on the unit disc, this relation
is invertible, whence W
t
= (1/)(B)
_
( )(B)X
2
t

_
is a measurable transformation
of X
2
t
, X
2
t1
, . . .. Therefore
2
t
= W
t
+ X
2
t
and hence Z
t
= X
t
/
t
is (X
t
, X
t1
, . . .)-
measurable.
9.10 EXERCISE. If p
1
, . . . , p
r
are nonnegative real numbers, then the polynomial p(z) =
1

r
j=1
p
j
z
j
possesses no roots on the unit disc if and only if p(1) > 0. [Use that
p(0) = 1 > 0; furthermore, use the triangle inequality.]
Theorem 9.9s condition

j
(
j
+
j
) < 1 is necessary for existence of a GARCH
process with bounded second moments, but stronger than necessary if we are interested
in a strictly stationary solution to the GARCH equations with possibly innite second
moments. We can see this from the proof, where the GARCH process is dened from
the series in (9.6). If this series converges in an appropriate sense, then a strictly sta-
tionary GARCH process exists. The series involves products of random matrices A
t
; its
convergence depends on the value of their top Lyapounov exponent, dened by
= inf
nN
1
n
Elog |A
1
A
2
A
n
|.
Here | | may be any matrix norm (all matrix norms being equivalent). If the process
Z
t
is ergodic, for instance i.i.d., then we can apply Kingmans subergodic theorem (e.g.
Dudley (1987, Theorem 10.7.1)) to the process log |A
1
A
2
A
n
| to see that
1
n
log |A
1
A
2
A
n
| , a.s..
This implies that the sequence of matrices A
1
A
2
A
n
converges to zero almost
surely as soon as < 0. The convergence is then exponentially fast and the series in
(9.6) will converge.
Thus sucient conditions for the existence of strictly stationary solutions to the
GARCH equations can be given in terms of the top Lyapounov exponent of the random
matrices A
t
. This exponent is in general dicult to compute explicitly, but it can easily
be estimated numerically for a given sequence Z
t
.
To obtain conditions that are both sucient and necessary the preceding proof must
be adapted somewhat. The following theorem is in terms of the top Lyapounov exponent
144 9: GARCH Processes
of the matrices
(9.7) A
t
=
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_

1
+
1
Z
2
t1

2

p1

p

2

q1

q
1 0 0 0 0 0 0
0 1 0 0 0 0 0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0 0 1 0 0 0 0
Z
2
t1
0 0 0 0 0 0
0 0 0 0 1 0 0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0 0 0 0 0 1 0
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
.
These matrices have the advantage of being independent and identically distributed if
the process Z
t
is i.i.d.. They are motivated by the equation obtained by substituting
X
t1
=
t1
Z
t1
in the GARCH equation (9.1), leaving X
t2
, . . . , X
tq
untouched:

2
t
= + (
1
+
1
Z
2
t1
)
2
t1
+
2

2
t1
+ +
p

2
tp
+
2
X
2
t2
+ +
q
X
2
tq
.
This equation gives rise to the system of equations Y
t
= A
t
Y
t1
+ b for the random
vectors Y
t
= (
2
t
, . . . ,
2
tp+1
, X
2
t1
, . . . , X
2
tq+1
)
T
and the vector b equal to times the
rst unit vector.
9.11 Theorem. Let > 0, let
1
, . . . ,
p
,
1
, . . . ,
q
be nonnegative numbers, and let
Z
t
be an i.i.d. sequence with mean zero and unit variance. There exists a strictly
stationary GARCH process X
t
such that X
t
=
t
Z
t
, where
2
t
= E(X
2
t
[ T
t1
) and
T
t
= (Z
t
, Z
t1
, . . .), if and only if the top Lyapounov coecient of the random matrices
A
t
given by (9.7) is strictly negative. For this process (X
t
, X
t1
, . . .) = (Z
t
, Z
t1
, . . .).
Proof. Let b = e
1
, where e
i
is the ith unit vector in R
p+q1
. If
t
is strictly larger than
the top Lyapounov exponent , then |A
t
A
t1
A
tn+1
| < e

n
, eventually as n ,
almost surely, and hence, eventually,
_
_
A
t
A
t1
A
tn+1
b
_
_
< e

n
|b|.
If < 0, then we may choose
t
< 0, and hence

n
|A
t
A
t1
A
tn+1
b| < almost
surely. Then the series on the right side of (9.6), but with the matrix A
t
dened as in
(9.7), converges almost surely and denes a process Y
t
. We can then dene processes
t
and X
t
by setting
t
=
_
Y
t,1
and X
t
=
t
Z
t
. That these processes satisfy the GARCH
relation follows from the relations Y
t
= A
t
Y
t1
+b, as in the proof of Theorem 9.9. Being
a xed measurable transformation of (Z
t
, Z
t1
, . . .) for each t, the process (
t
, X
t
) is
strictly stationary.
By construction the variable X
t
is (Z
t
, Z
t1
, . . .)-measurable for every t. To see
that, conversely, Z
t
is (X
t
, X
t1
, . . .)-measurable, we apply a similar argument as in
the proof of Theorem 9.9, based on inverting the relation ( )(B)X
2
t
= +(B)W
t
,
for W
t
= X
2
t

2
t
. Presently, the series X
2
t
and W
t
are not necessarily integrable,
but Lemma 9.12 below still allows to conclude that W
t
is (X
2
t
, X
2
t1
, . . .)-measurable,
provided that the polynomial has no zeros on the unit disc.
9.1: Linear GARCH 145
The matrix B obtained by replacing the variables Z
t1
and the numbers
j
in the
matrix A
t
by zero is bounded above by A
t
in a coordinatewise sense. By the nonnegativity
of the entries this implies that B
n
A
0
A
1
A
n+1
and hence B
n
0. This can
happen only if all eigenvalues of B are inside the unit circle. Indeed, if z is an eigenvalue
of B with [z[ 1 and c ,= 0 a corresponding eigenvector, then B
n
c = z
n
c does not
converge to zero. Now
det(B zI) = (z)
p+q1
(1/z).
Thus z is a zero of if and only if z
1
is an eigenvalue of B. We conclude that has no
zeros on the unit disc.
Finally, we show the necessity of the top Lyapounov exponent being negative. If
there exists a strictly stationary solution to the GARCH equations, then, by (9.5) and the
nonnegativity of the coecients,

n
j=1
A
0
A
1
A
n+1
b Y
0
for every n, and hence
A
0
A
1
A
n+1
b 0 as n , almost surely. By the form of b this is equivalent
to A
0
A
1
A
n+1
e
1
0. Using the structure of the matrices A
t
we next see that
A
0
A
1
A
n+1
0 in probability as n , by an argument similar as in the
proof of Theorem 9.9. Because the matrices A
t
are independent and the event where
A
0
A
1
A
n+1
0 is a tail event, this event must have probability one. It can be
shown that this is possible only if the top Lyapounov exponent of the matrices A
t
is
negative.

9.12 Lemma. Let be a polynomial without roots on the unit disc and let X
t
be
a time series that is bounded in probability. If Z
t
= (B)X
t
for every t, then X
t
is
(Z
t
, Z
t1
, . . .)-measurable.
Proof. Because (0) ,= 0 by assumption, we can assume without loss of generality
that possesses intercept 1. If is of degree 0, then X
t
= Z
t
for every t and the
assertion is certainly true. We next proceed by induction on the degree of . If is of
degree p 1, then we can write it as (z) = (1 z)
p1
(z) for a polynomial
p1
of degree p 1 and a complex number with [[ < 1. The series Y
t
= (1 B)X
t
is
bounded in probability and
p1
(B)Y
t
= Z
t
, whence Y
t
is (Z
t
, Z
t1
, . . .)-measurable,
by the induction hypothesis. By iterating the relation X
t
= X
t1
+ Y
t
, we nd that
X
t
=
n
X
tn
+

n1
j=0

j
Y
tj
. Because the sequence X
t
is uniformly tight and
n
0,
the sequence
n
X
tn
converges to zero in probability. Hence X
t
is the limit in probability
of a sequence that is (Y
t
, Y
t1
, . . .)-measurable and hence is (Z
t
, Z
t1
, . . .)-measurable.
This implies the result.
** 9.13 EXERCISE. In the preceding lemma the function (z) = 1/(z) possesses a power
series representation (z) =

j=0

j
z
j
on a neighbourhood of the unit disc. Is it true
under the conditions of the lemma that X
t
=

j=0

j
Z
tj
, where the series converges
(at least) in probability?
9.14 Example. For the GARCH(1, 1) process the random matrices A
t
given by (9.7)
reduce to the random variables
1
+
1
Z
2
t1
. The top Lyapounov exponent of these ran-
dom (1 1) matrices is equal to Elog(
1
+
1
Z
2
t
). This number can be written as an

See Bougerol (), Lemma ?.


146 9: GARCH Processes
integral relative to the distribution of Z
t
, but in general is not easy to compute analyt-
ically. Figure 9.3 shows the points (
1
,
1
) for which a strictly stationary GARCH(1, 1)
process exists in the case of Gaussian innovations Z
t
. It also shows the line
1
+
1
= 1,
which is the boundary line for existence of a stationary solution.
0.0 0.5 1.0 1.5 2.0 2.5 3.0
0
.
0
0
.
5
1
.
0
1
.
5
2
.
0
2
.
5
3
.
0
**********************************************************************************************************************************************************************************************************************************************************************************************************************
***************************************************************************************************************************************************************************************************************************************************************************************************
*************************************************************************************************************************************************************************************************************************************************************************************
*************************************************************************************************************************************************************************************************************************************************************************
***************************************************************************************************************************************************************************************************************************************************************
******************************************************************************************************************************************************************************************************************************************************
**********************************************************************************************************************************************************************************************************************************************
**************************************************************************************************************************************************************************************************************************************
*******************************************************************************************************************************************************************************************************************************
************************************************************************************************************************************************************************************************************************
******************************************************************************************************************************************************************************************************************
************************************************************************************************************************************************************************************************************
*******************************************************************************************************************************************************************************************************
*************************************************************************************************************************************************************************************************
********************************************************************************************************************************************************************************************
***************************************************************************************************************************************************************************************
**********************************************************************************************************************************************************************************
******************************************************************************************************************************************************************************
*************************************************************************************************************************************************************************
*********************************************************************************************************************************************************************
*****************************************************************************************************************************************************************
*************************************************************************************************************************************************************
*********************************************************************************************************************************************************
*****************************************************************************************************************************************************
**************************************************************************************************************************************************
**********************************************************************************************************************************************
*******************************************************************************************************************************************
***************************************************************************************************************************************
************************************************************************************************************************************
*********************************************************************************************************************************
*****************************************************************************************************************************
**************************************************************************************************************************
***********************************************************************************************************************
********************************************************************************************************************
*****************************************************************************************************************
***************************************************************************************************************
************************************************************************************************************
*********************************************************************************************************
******************************************************************************************************
****************************************************************************************************
*************************************************************************************************
***********************************************************************************************
********************************************************************************************
******************************************************************************************
***************************************************************************************
*************************************************************************************
***********************************************************************************
********************************************************************************
******************************************************************************
****************************************************************************
**************************************************************************
************************************************************************
*********************************************************************
*******************************************************************
*****************************************************************
***************************************************************
*************************************************************
***********************************************************
**********************************************************
********************************************************
******************************************************
****************************************************
**************************************************
************************************************
***********************************************
*********************************************
*******************************************
******************************************
****************************************
**************************************
*************************************
***********************************
**********************************
********************************
*******************************
*****************************
****************************
**************************
*************************
***********************
**********************
*********************
*******************
******************
*****************
***************
**************
*************
************
**********
*********
********
*******
******
*****
****
***
***
Figure 9.3. The shaded area gives all points (, ) where Elog(+Z
2
) < 0 for a standard normal variable
Z. For each point in this area a strictly stationary Garch process with Gaussian innovations exists. This process
is stationary if and only if (, ) falls under the line + = 1, which is also indicated.
9.1.3 Stability, Persistence and Prediction
Lemma 9.3 gives a simple recipe for constructing a GARCH process, at least on the
positive time set. In the preceding section it was seen that a (strictly) stationary solution
exists only for certain parameter values, and then is unique. Clearly this stationary
solution is obtained by the recipe of Lemma 9.3 only if the initial values in this lemma
are chosen according to the stationary distribution.
In the following theorem we show that the eect of a nonstationary initialization
wears o as t and the process will approach stationarity, provided that a stationary
solution exists. This is true both for L
2
-stationarity and strict stationarity, under the
appropriate conditions on the coecients.
9.15 Theorem. Let > 0, let
1
, . . . ,
p
,
1
, . . . ,
q
be nonnegative numbers, and let
Z
t
be an i.i.d. process with mean zero and unit variance such that Z
t
is independent of
T
t1
for every t Z.
9.1: Linear GARCH 147
(i) If

j
(
j
+
j
) < 1, then the dierence X
t


X
t
of any two solutions X
t
=
t
Z
t
and

X
t
=
t
Z
t
of the GARCH equations that are square-integrable converges to zero in
L
2
as t .
(ii) If the top Lyapounov exponent of the matrices A
t
in (9.7) is negative, then the
dierence X
t


X
t
of any two solutions X
t
=
t
Z
t
and

X
t
=
t
Z
t
of the GARCH
equations converges to zero in probability as t .
Proof. From the two given GARCH processes X
t
and

X
t
dene processes Y
t
and

Y
t
as
indicated preceding the statement of Theorem 9.11. These processes satisfy (9.5) for the
matrices A
t
given in (9.7). Choosing n = t 1 and taking dierences we see that
Y
t


Y
t
= A
t
A
t1
A
1
(Y
0


Y
0
).
If the top Lyapounov exponent of the matrices A
t
is negative, then the norm of the
right side can be bounded, almost surely for suciently large t, by by e

t
|Y
0


Y
0
| for
some number
t
< 0. This follows from the subergodic theorem, as before (even though
this time the matrix product grows on its left side). This converges to zero as t ,
implying that
t

t
0 almost surely as t . This in turn implies (ii).
Under the condition of (i), the spectral radius of the matrix A = EA
t
is strictly
smaller than 1 and hence |A
n
| 0. By the nonnegativity of the entries of the matrices
A
t
the absolute values of the coordinates of the vectors Y
t


Y
t
are bounded above by
the coordinates of the vector A
t
A
t1
A
1
W
0
, for W
0
the vector obtained by replacing
the coordinates of Y
0


Y
0
by their absolute values. By the independence of the matrices
A
t
and vector W
0
, the expectation of A
t
A
t1
A
1
W
0
is bounded by A
t
EW
0
, which
converges to zero. Because
2
t
= Y
t,1
and X
t
=
t
Z
t
, this implies that, as t ,
E[X
2
t


X
2
t
[ = E[
2
t

2
t
[Z
2
t
= E[
2
t

2
t
[ 0.
For the stationary solution X
t
the sequence (X
2
t
) is uniformly integrable, because the
variables X
2
t
possess a xed marginal distribution with nite second moment. By the
preceding display this is then also true for

X
t
, and hence also for a general

X
t
. The
sequence X
t


X
t
is then uniformly square-integrable as well. Combining this with the
fact that X
t


X
t
0 in probability, we see that X
t


X
t
converges to zero in second
mean.
The preceding theorem may seem at odds with a common interpretation of a sta-
tionary and stability condition as a condition for persistence. The condition for L
2
-
stationarity of a GARCH process is stronger than the condition for strict stationarity,
so that it appears as if we have found two dierent conditions for persistence. Whenever
a strictly stationary solution exists, the inuence of initial values wears o as time goes
to innity, and hence the initial values are not persistent. This is true independently of
the validity of the condition

j
(
j
+
j
) < 1 for L
2
-stationarity. However, the latter
condition, if it holds, does ensure that the process approaches stationarity in the stronger
L
2
-sense.
The condition

j
(
j
+
j
) < 1 is necessary for the strictly stationary solution to have
nite second moments. By an appropriate initialization we can ensure that a GARCH
148 9: GARCH Processes
process has nite second moments for every t, even if this condition fails. (It will then
not be stationary.) However, in this case the variances EX
2
t
must diverge to innity as
t . This follows by a Fatou type argument, because the process will approach the
strictly stationary solution and this has innite variance.
The case that

j
(
i
+
j
) = 1 is often viewed as having particular interest and is
referred to as integrated GARCH or IGARCH. Many nancial time series yield GARCH
ts that are close to IGARCH.
9.16 EXERCISE. Suppose that the time series

X
t
is strictly stationary with innite
second moments and X
t


X
t
0 in probability as t . Show that EX
2
t
.
9.1.4 Prediction
As it is a martingale dierence process, a GARCH process does not allow nontriv-
ial predictions of its mean values. However, it is of interest to predict the condi-
tional variances
2
t
, or equivalently the process of squares X
2
t
. Predictions based on
the innite past T
t
can be obtained using the auto-regressive representation from
the proofs of Theorems 9.11 and 9.11. Let A
t
be the matrix given in (9.7) and let
Y
t
= (
2
t
, . . . ,
2
tp+1
, X
2
t1
, . . . , X
2
tq+1
)
T
, so that Y
t
= A
t
Y
t1
+b for every t. The vector
Y
t1
is T
t2
-measurable, and the matrix A
t
depends on Z
t1
only, with A = E(A
t
[ T
t2
)
independent of t. It follows that
E(Y
t
[ T
t2
) = E(A
t
[ T
t2
)Y
t1
+b = AY
t1
+b.
By iterating this equation we nd that, for h > 1,
E(Y
t
[ T
th
) = A
h1
Y
th+1
+
h2

j=0
A
j
b.
As
2
t
is the rst coordinate of Y
t
, this equation gives in particular the predictions for
the volatility, based on the innite past. If

j
(
j
+
j
) < 1, then the spectral radius of
the matrix A is strictly smaller than 1, and both terms on the right converge to zero at
an exponential rate, as h . In this case the potential of predicting the conditional
variance process is limited to the very near future. The case that

j
(
j
+
j
) 1 is
more interesting, as is illustrated by the following example.
9.17 Example (GARCH(1,1)). For a GARCH(1, 1) process the vector Y
t
is equal to
2
t
and the matrix A reduces to the number
1
+
1
. The general equation can be rewritten
in the form
(9.8) E(X
2
t+h
[ T
t
) = E(
2
t+h
[ T
t
) = (
1
+
1
)
h1

2
t+1
+
h2

j=0
(
1
+
1
)
j
.
For
1
+
1
< 1 the rst term on the far right converges to zero as h , indicating
that information at the present time t does not help to predict the conditional variance
9.2: Linear GARCH with Leverage and Power GARCH 149
process in the innite future. On the other hand, if
1
+
1
1 and > 0 then
both terms on the far right side contribute positively as h . If
1
+
1
= 1, then the
contribution of the term (
1
+
1
)
h1

2
t
=
2
t
is constant, and hence tends to zero relative
to the nonrandom term (which becomes (h 1) ), as h . If
1
+
1
> 1, then
the contributions of the two terms are of the same order. In both cases the volatility
2
t
persists into the future, although in the rst case its relative inuence on the prediction
dwindles. If
1
+
1
> 1, then the value
2
t
is particularly persistent; this happens even
if the time series tends to (strict) stationarity in distribution.
9.18 EXERCISE. Suppose that

j
(
j
+
j
) < 1 and let X
t
be a stationary GARCH
process. Show that E(X
2
t+h
[ T
t
) EX
2
t
as h .
* 9.2 Linear GARCH with Leverage and Power GARCH
Fluctuations of foreign exchange rates tend to be symmetric, in view of the two-sided
nature of the foreign exchange market. However, it is an empirical nding that for asset
prices the current returns and future volatility are negatively correlated. For instance, a
crash in the stock market will be followed by large volatility.
A linear GARCH model is not able to capture this type of asymmetric relationship,
because it models the volatility as a function of the squares of the past returns. One
attempt to allow for asymmetry is to replace the GARCH equation (9.1) by

2
t
= +
1

2
t1
+ +
p

2
tp
+
1
([X
t1
[ +
1
X
t1
)
2
+ +
q
([X
tq
[ +
q
X
tq
)
2
.
This reduces to the ordinary GARCH equation if the leverage coecients
i
are set
equal to zero. If these coecients are negative, then a positive deviation of the process
X
t
contributes to lower volatility in the near future, and conversely.
A power GARCH model is obtained by replacing the squares in the preceding display
by other powers.
* 9.3 Exponential GARCH
The exponential GARCH or EGARCH model is signicantly dierent from the GARCH
models described so far. It retains the basic set-up of a process of the form X
t
=
t
Z
t
for a martingale dierence sequence Z
t
satisfying (9.2) and an T
t1
-adapted process
t
,
but replaces the GARCH equation by
log
2
t
= +
1
log
2
t1
+ +
p
log
2
tp
+
1
([Z
t1
[+
1
Z
t1
)+ +
q
([Z
tq
[+
q
Z
tq
).
Through the presence of both the variables Z
t
and their absolute values and the trans-
formation to the logarithmic scale this can also capture the leverage eect. An advantage
150 9: GARCH Processes
-1.5 -1.0 -0.5 0.0 0.5 1.0 1.5
0
.
0
0
.
5
1
.
0
1
.
5
2
.
0
2
.
5
3
.
0
Figure 9.4. The function x (]x] + x)
2
for = 0.2.
of modelling the logarithm of the volatility is that the parameters of the model need not
be restricted to be positive.
Because the EGARCH model species the log volatility directly in terms of the noise
process Z
t
and its own past, its denition is less recursive than the ordinary GARCH
denition, and easier to handle. In particular, for xed and identical leverage coecients

i
= the EGARCH equation describes the log volatility process log
2
t
as a regular
ARMA process driven by the noise process [Z
t
[ + Z
t
, and we may use the theory for
ARMA processes to study its properties. In particular, if the roots of the polynomial
(z) = 1
1
z
p
z
p
are outside the unit circle, then there exists a stationary
solution log
2
t
that is measurable relative to the -eld generated by the process Z
t1
.
If the process Z
t
is strictly stationary, then so is the stationary solution log
2
t
and so is
the EGARCH process X
t
=
t
Z
t
.
* 9.4 GARCH in Mean
A GARCH process by its denition is a white noise process, and thus it could be a
useful candidate to drive another process. For instance, an observed process Y
t
could be
assumed to satisfy the ARMA equation
(B)Y
t
= (B)X
t
,
for X
t
a GARCH process, relative to other polynomials and (which are unrelated to
9.4: GARCH in Mean 151
and ). One then says that Y
t
is ARMA in the mean and GARCH in the variance,
or that Y
t
is an ARMA-GARCH series. Results on ARMA processes that hold for any
driving white noise process will clearly also hold in the present case, where the white
noise process is a GARCH process.
9.19 EXERCISE. Let X
t
be a stationary GARCH process relative to polynomials and
and let the time series Y
t
be the unique stationary solution to the equation (B)Y
t
=
(B)X
t
, for and polynomials that have all their roots outside the unit disc. Let T
t
be
the ltration generated by Y
t
. Show that var(Y
t
[ T
t1
) = var(X
t
[ X
t1
, X
t2
, . . .) almost
surely.
It has been found useful to go a step further and let also the conditional variance of
the driving GARCH process appear in the mean model for the process Y
t
. Thus given a
GARCH process X
t
with conditional variance process
2
t
= var(X
t
[ T
t1
) it is assumed
that Y
t
= f(
t
, X
t
) for a xed function f. The function f is assumed known up to a
number of parameters. For instance,
(B)Y
t
=
t
+(B)X
t
,
(B)Y
t
=
2
t
+(B)X
t
,
(B)Y
t
= log
2
t
+(B)X
t
.
These models are known as GARCH-in-mean, or GARCH-M models.
10
State Space Models
A stochastic process X
t
is Markov if the conditional distribution of X
t+1
given the past
values X
t
, X
t1
, . . . depends on the present value X
t
only, for every t. The evolution
of a Markov process can be pictured as a sequence of moves in a state space, where
at each time t the process chooses a new state X
t+1
based only on the current state X
t
,
and not on the past evolution. The current state thus contains all necessary information
for the future.
The Markov structure obviously facilitates prediction, since only the current state
is needed to predict the future. It also permits a simple factorization of the likelihood.
Markov structure can be created in any time series by redening the states of the
series to incorporate enough past information. The present state should contain all
information that is relevant for the next state. Specically, given an arbitrary time series
X
t
, the process

X
t
= (X
t
, X
t1
, . . .) is Markov. However, the high complexity of the
state space of the process

X
t
osets the advantages of the Markov property.
General state space models take the idea of states that contain the relevant informa-
tion as a basis, but go beyond Markov models. They typically consist of a specication of
a Markov process together with an output process. Although they come from a dierent
background, hidden Markov models are almost identical structures.
The showpiece of state space modelling is the Kalman lter. This is an algorithm to
compute linear predictions for certain linear state space models, under the assumption
that the parameters of the system are known. Because the formulas for the predictors,
which are functions of the parameters and the outputs, can in turn be used to set up
estimating equations for the parameters, the Kalman lter is also important for statistical
analysis. We start discussing parameter estimation in Chapter 11.
10.1 Example (AR(1)). A causal, stationary AR(1) process with i.i.d. innovations Z
t
is a Markov process: the future value X
t+1
= X
t
+ Z
t+1
given the past values
X
1
, . . . , X
t
depends on the present value X
t
only. Specically, the conditional density
of X
t+1
is given by
p
Xt+1]X1,...,Xt
(x) = p
Z
(x X
t
).
10.1: Hidden Markov Models and State Spaces 153
The assumption of causality ensures that Z
t+1
is independent of X
1
, . . . , X
t
. The Markov
structure allows to factorize the density of (X
1
, . . . , X
n
), as
p
X1,...,Xn
(X
1
, . . . , X
n
) =
n

t=2
p
Z
(X
t
X
t1
)p
X1
(X
1
).
This expression as a function of the unknown parameter is the likelihood for the model.
10.2 EXERCISE. How can a causal AR(p) process be forced into Markov form? Give a
precise argument. Is causal needed?
10.1 Hidden Markov Models and State Spaces
A hidden Markov model consists of a Markov chain, but rather than the state at time t,
we observe an output variable, which is generated by a mechanism which depends on
the state. This is illustrated in Figure 10.1, where the sequence . . . , X
1
, X
2
, . . . depicts
a Markov chain, and . . . , Y
1
, Y
2
, . . . are the outputs at times 1, 2, . . .. It is assumed that
given the state sequence . . . , X
1
, X
2
, . . ., the outputs . . . , Y
1
, Y
2
, . . . are independent, and
the conditional distribution of Y
t
given the states depends on X
t
only. Thus the arrows
in the picture indicate dependence relationships.
X
1
X
2
X
3
. . .
X
n1 X
n
Y
1
Y
2
Y
3
. . .
Y
n1
Y
n
Figure 10.1. Hidden Markov model.
The probabilistic properties of a hidden Markov process are simple. For instance,
the joint sequence (X
t
, Y
t
) can be seen to be Markovian, and we can easily write down
the joint density of the variables (X
1
, Y
1
, X
2
, Y
2
, . . . , X
n
, Y
n
) as
p(x
1
)p(x
2
[ x
1
) p(x
n
[ x
n1
) p(y
1
[ x
1
) p(y
n
[ x
n
).
154 10: State Space Models
The diculty is to do statistics in the case where only the outputs Y
1
, . . . , Y
n
are observed.
Typical aims are to estimate parameters of the underlying distribution, and to compute
best guesses of the unobserved states given the observed outputs. (In the preceding
display, as later on in the chapter, we abuse notation by writing the conditional density
of a variable Y given a variable X as p(y[ x) and a marginal density of X as p(x), thus
using the symbol p for any density function, leaving it to the arguments to reveal which
density it is.)
Many state space models are hidden Markov models, although they are often de-
scribed in a dierent manner. Given an initial state X
0
, disturbances V
1
, W
1
, V
2
, . . .
and functions f
t
and g
t
, processes X
t
and Y
t
are dened recursively by
(10.1)
X
t
= f
t
(X
t1
, V
t
),
Y
t
= g
t
(X
t
, W
t
).
We refer to X
t
as the state at time t and to Y
t
as the output. The state process X
t
can be viewed as primary and evolving in the background, describing the consecutive
states of a system in time. At each time t the system is measured, producing an output
Y
t
.
If the sequence X
0
, V
1
, V
2
, . . . consists of independent variables, then the state pro-
cess X
t
is a Markov chain. If, moreover, the variables X
0
, V
1
, W
1
, V
2
, W
2
, V
3
, . . . are inde-
pendent, then for every t given the state X
t
the output Y
t
is conditionally independent
of the states X
0
, X
1
, . . . and outputs Y
1
, . . . , Y
t1
. Then the state space model becomes
a hidden Markov model. Conversely, any hidden Markov model arises in this form.
10.3 Lemma. If X
0
, V
1
, W
1
, V
2
, W
2
, V
3
, . . . are independent random variables, then the
pair of sequences X
t
and Y
t
dened in (10.1) forms a hidden Markov model: the sequence
the sequence X
1
, X
2
, . . . forms a Markov chain, the variables Y
1
, Y
2
, . . . are conditionally
independent given X
0
, X
1
, X
2
, . . ., and Y
t
is conditionally independent of (X
s
: s ,= t)
given X
t
, for every t. Conversely, every hidden Markov model with state and output
variables X
t
and Y
t
in Polish spaces can be realized in the form (10.1) for some sequence
of independent random variables X
0
, V
1
, W
1
, V
2
, W
2
, V
3
, . . . and measurable functions f
t
and g
t
.
Typically, the state process X
t
in the state space model (10.1) is not observed, but
at time t we only observe the output Y
t
. For this reason the process Y
t
is also referred
to as the measurement process. The second equation in the display (10.1) is called the
measurement equation, while the rst is the state equation. Inference might be directed at
estimating parameters attached to the functions f
t
or g
t
, to the distribution of the errors
or to the initial state, and/or on predicting or reconstructing the states X
t
from the
observed outputs Y
1
, . . . , Y
n
. Predicting or reconstructing the state sequence is referred
to as ltering or smoothing.
For linear functions f
t
and g
t
and vector-valued states and outputs the state space
model can without loss of generality be written in the form
(10.2)
X
t
= F
t
X
t1
+V
t
,
Y
t
= G
t
X
t
+W
t
.
10.1: Hidden Markov Models and State Spaces 155
The matrices F
t
and G
t
are often postulated to be independent of t. In this linear
state space model the analysis usually concerns linear predictions, and then a common
assumption is that the vectors X
0
, V
1
, W
1
, V
2
, . . . are uncorrelated. If F
t
is independent of
t and the vectors V
t
form a white noise process, then the series X
t
is a VAR(1) process.
Because state space models are easy to handle, it is of interest to represent a given
observable time series Y
t
as the output of a state space model. This entails nding a
state space, a state process X
t
, and a corresponding state space model with the given
series Y
t
as output. It is particularly attractive to nd a linear state space model. Such a
state space representation is denitely not unique. An important issue in systems theory
is to nd a (linear) state space representation of minimal dimension.
10.4 Example (State space representation ARMA). Let Y
t
be a stationary, causal
ARMA(r +1, r) process satisfying (B)Y
t
= (B)Z
t
for an i.i.d. process Z
t
. (The choice
p = q + 1 can always be achieved by padding the set of coecients of the polynomials
or with zeros.) Then the AR(p) process X
t
= (1/)(B)Z
t
is related to Y
t
through
Y
t
= (B)X
t
. Thus
Y
t
= (
0
, . . . ,
r
)
_
_
X
t
.
.
.
X
tr
_
_
,
_
_
_
_
X
t
X
t1
.
.
.
X
tr
_
_
_
_
=
_
_
_
_

1

2

r

r+1
1 0 0 0
.
.
.
.
.
.
.
.
.
.
.
.
0 0 1 0
_
_
_
_
_
_
_
_
X
t1
X
t2
.
.
.
X
tr1
_
_
_
_
+
_
_
_
_
Z
t
0
.
.
.
0
_
_
_
_
.
This is a linear state space representation (10.2) with state vector (X
t
, . . . , X
tr
), and
matrices F
t
and G
t
, that are independent of t. Under causality the innovations V
t
=
(Z
t
, 0, . . . , 0) are orthogonal to the past X
t
and Y
t
; the innovations W
t
as in (10.2) are
dened to be zero. The state vectors are typically unobserved, except when is of degree
zero. (If the ARMA process is invertible and the coecients of are known, then the
states can be reconstructed from the innite past through the relation X
t
= (1/)(B)Y
t
.)
In the present representation the state-dimension of the ARMA(p, q) process is r +
1 = max(p, q + 1). By using a more complicated noise process it is possible to represent
an ARMA(p, q) process in dimension max(p, q), but this dierence appears not to be
very important.

10.5 Example (State space representation ARIMA). Consider a time series Z


t
whose
dierences Y
t
= Z
t
satisfy the linear state space model (10.2) for a state sequence X
t
.
Writing Z
t
= Y
t
+Z
t1
= G
t
X
t
+W
t
+Z
t1
, we obtain that
_
X
t
Z
t1
_
=
_
F
t
0
G
t1
1
__
X
t1
Z
t2
_
+
_
V
t
W
t1
_
,
Z
t
= ( G
t
1 )
_
X
t
Z
t1
_
+W
t
.

See e.g. Brockwell and Davis, p469471.


156 10: State Space Models
We conclude that the time series Z
t
possesses a linear state space representation, with
states of one dimension higher than the states of the original series.
A drawback of the preceding representation is that the error vectors (V
t
, W
t1
, W
t
)
are not necessarily uncorrelated if the error vectors (V
t
, W
t
) in the system with outputs
Y
t
have this property. In the case that Z
t
is an ARIMA(p, 1, q) process, we may use the
state representation of the preceding example for the ARMA(p, q) process Y
t
, which has
errors W
t
= 0, and this disadvantage does not arise. Alternatively, we can avoid this
problem by using another state space representation. For instance, we can write
_
X
t
Z
t
_
=
_
F
t
0
G
t
F
t
1
__
X
t1
Z
t1
_
+
_
V
t
G
t
V
t
+W
t
_
,
Z
t
= ( 0 1 )
_
X
t
Z
t
_
.
This illustrates that there may be multiple possibilities to represent a time series as the
output of a (linear) state space model.
The preceding can be extended to general ARIMA(p, d, q) models. If Y
t
= (1B)
d
Z
t
,
then Z
t
= Y
t

d
j=1
_
d
j
_
(1)
j
Z
tj
. If the process Y
t
can be represented as the output
of a state space model with state vectors X
t
, then Z
t
can be represented as the output
of a state space model with the extended states (X
t
, Z
t1
, . . . , Z
td
), or, alternatively,
(X
t
, Z
t
, . . . , Z
td+1
).
10.6 Example (Stochastic linear trend). A time series with a linear trend could be
modelled as Y
t
= + t + W
t
for constants and , and a stationary process W
t
(for
instance an ARMA process). This restricts the nonstationary part of the time series to
a deterministic component, which may be unrealistic. An alternative is the stochastic
linear trend model described by
_
A
t
B
t
_
=
_
1 1
0 1
__
A
t1
B
t1
_
+V
t
,
Y
t
= A
t
+W
t
.
The stochastic processes (A
t
, B
t
) and noise processes (V
t
, W
t
) are unobserved. This state
space model contains the deterministic linear trend model as the degenerate case where
V
t
0, so that B
t
B
0
and A
t
A
0
+B
0
t.
The state equations imply that A
t
= B
t1
+ V
t,1
and B
t
= V
t,2
, for V
t
=
(V
t,1
, V
t,2
)
T
. Taking dierences on the output equation Y
t
= A
t
+W
t
twice, we nd that

2
Y
t
= B
t1
+V
t,1
+
2
W
t
= V
t,2
+V
t,1
+
2
W
t
.
If the process (V
t
, W
t
) is a white noise process, then the auto-correlation function of
the process on the right vanishes for lags bigger than 2 (the polynomial
2
= (1 B)
2
being of degree 2). Thus the right side is an MA(2) process, whence the process Y
t
is an
ARIMA(0,2,2) process.
10.1: Hidden Markov Models and State Spaces 157
10.7 Example (Structural models). Besides a trend we may suspect that a given time
series shows a seasonal eect. One possible parametrization of a deterministic seasonal
eect with S seasons is the function
(10.3) t
0
+
S/2

s=1
_

s
cos(
s
t) +
s
sin(
s
t)
_
,
s
=
2s
S
, s = 1, . . . , S/2|.
By appropriate choice of the parameters
s
and
s
this function is able to adapt to any
periodic function on the integers with period S. (See Exercise 10.9. If S is even, then the
last sinus function t sin(
S/2
t) vanishes, and the coecient
s
is irrelevant. There
are always S nontrivial terms in the sum.) We could add this deterministic function to a
given time series model in order to account for seasonality. Again it may not be realistic
to require the seasonality a-priori to be deterministic. An alternative is to replace the
xed function s (
s
,
s
) by the time series dened by
_

s,t

s,t
_
=
_
cos
s
sin
s
sin
s
cos
s
__

s,t1

s,t1
_
+V
s,t
.
An observed time series may next have the form
Y
t
= ( 1 1 . . . 1 )
_
_
_
_

1,t

2,t
.
.
.

s,t
_
_
_
_
+Z
t
.
Together these equations again constitute a linear state space model. If V
t
= 0, then this
reduces to the deterministic trend model. (Cf. Exercise 10.8.)
A model with both a stochastic linear trend and a stochastic seasonal component is
known as a structural model.
10.8 EXERCISE. Consider the state space model with state equations
t
=
t1
cos +

t1
sin +V
t,1
and
t
=
t1
sin
t1
cos +V
t,2
and output equation Y
t
=
t
+W
t
.
What does this model reduce to if V
t
0?
10.9 EXERCISE.
(i) Show that the function (of t Z) in (10.3) is periodic with period S.
(ii) Show that any periodic function f: Z R with period S can be written in the form
(10.3).
[For (ii) it suces to show that any vector
_
f(1), . . . , f(S)
_
can be represented as a linear
combination of the vectors a
s
=
_
cos
s
, cos(2
s
), . . . , cos(S
s
)
_
for s = 0, 1, . . . , S/2|
and the vectors b
s
=
_
sin
s
, sin(2
s
), . . . , sin(S
s
)
_
, for s = 1, 2, . . . , S/2|. To prove
this note that the vectors e
s
= (e
is
, e
2is
. . . , e
ins
)
T
for
s
running through the natural
frequencies is an orthogonal basis of C
n
. The vectors a
s
= Re e
s
and b
s
= Ime
s
can be
seen to be orthogonal, apart from the trivial cases b
0
= 0 and b
S/2
= 0 if S is even.]
The variables X
t
and Y
t
in a state space model will typically be random vectors. For
two random vectors X and Y of dimensions m and n the covariance or cross-covariance
158 10: State Space Models
is the (mn) matrix Cov(X, Y ) = E(XEX)(Y EY )
T
. The random vectors X and Y
are called uncorrelated if Cov(X, Y ) = 0, or equivalently if cov(X
i
, Y
j
) = 0 for every
pair (i, j). The linear span of a set of vectors is dened as the linear span of all their
coordinates. Thus this is a space of (univariate) random variables, rather than random
vectors! We shall also understand a projection operator , which is a map on the space
of random variables, to act coordinatewise on vectors: if X is a vector, then X is the
vector consisting of the projections of the coordinates of X. As a vector-valued operator
a projection is still linear, in that (FX + Y ) = FX + Y , for any matrix F and
random vectors X and Y .
10.2 Kalman Filtering
The Kalman lter is a recursive algorithm to compute best linear predictions of the states
X
1
, X
2
, . . . given observations Y
1
, Y
2
, . . . in the linear state space model (10.2). The core
algorithm allows to compute predictions
t
X
t
of the states X
t
given observed outputs
Y
1
, . . . , Y
t
. Here by predictions we mean Hilbert space projections. Given the time
values involved, reconstructions would perhaps be more appropriate. Filtering is the
preferred term in systems theory. Given the reconstructions
t
X
t
, it is easy to compute
predictions of future states and future outputs. A next step is Kalman smoothing,
which is the name for the reconstruction (again through projections) of the full state
sequence X
1
, . . . , X
n
given the outputs Y
1
, . . . , Y
n
.
In the simplest situation the vectors X
0
, V
1
, W
1
, V
2
, W
2
, . . . are assumed uncorre-
lated. We shall rst derive the lter under the more general assumption that the vectors
X
0
, (V
1
, W
1
), (V
2
, W
2
), . . . are uncorrelated, and in Section 10.2.4 we further relax this
condition. The matrices F
t
and G
t
as well as the covariance matrices of the noise variables
(V
t
, W
t
) are assumed known.
By applying (10.2) recursively, we see that the vector X
t
is contained in the linear
span of the variables X
0
, V
1
, . . . , V
t
. It is immediate from (10.2) that the vector Y
t
is
contained in the linear span of X
t
and W
t
. These facts are true for every t N. It follows
that under our conditions the noise variables V
t
and W
t
are uncorrelated with all vectors
X
s
and Y
s
with s < t.
Let H
0
be a given closed linear subspace of L
2
(, |, P) that contains the constants,
and for t 0 let
t
be the orthogonal projection onto the space H
t
= H
0
+lin(Y
1
, . . . , Y
t
).
The space H
0
may be viewed as our knowledge at time 0; it may be H
0
= lin 1. We
assume that the noise vectors V
1
, W
1
, V
2
, . . . are orthogonal to H
0
. Combined with the
preceding this shows that the vector (V
t
, W
t
) is orthogonal to the space H
t1
, for every
t 1.
The Kalman lter consists of the recursions

_
_

t1
X
t1
Cov(
t1
X
t1
)
Cov(X
t1
)
_
_
(1)

_
_

t1
X
t
Cov(
t1
X
t
)
Cov(X
t
)
_
_
(2)

_
_

t
X
t
Cov(
t
X
t
)
Cov(X
t
)
_
_

Thus the Kalman lter alternates between updating the current state, step (1), and
10.2: Kalman Filtering 159
updating the prediction space, step (2).
Step (1) is simple. Because V
t
H
0
, Y
1
, . . . , Y
t1
by assumption, we have
t1
V
t
=
0. Applying
t
to the state equation X
t
= F
t
X
t1
+V
t
we nd that, by the linearity of
a projection,

t1
X
t
= F
t
(
t1
X
t1
),
Cov(
t1
X
t
) = F
t
Cov(
t1
X
t1
)F
T
t
,
Cov(X
t
) = F
t
Cov(X
t1
)F
T
t
+ Cov(V
t
).
This gives a complete description of step (1) of the algorithm.
Step (2) is more involved, but also comes down to simple matrix computations. The
vector

W
t
= Y
t

t1
Y
t
is known as the innovation at time t, because it is the part of Y
t
that is not explainable at time t1. It is orthogonal to H
t1
, and together with this space
spans H
t
. It follows that H
t
can be orthogonally decomposed as H
t
= H
t1
+ lin

W
t
and hence the projection onto H
t
is the sum of the projections onto the spaces H
t1
and
lin

W
t
. At the beginning of step (2) the vector

W
t
is known, because we can write, using
the measurement equation and the fact that
t1
W
t
= 0,
(10.4)

W
t
= Y
t

t1
Y
t
= Y
t
G
t

t1
X
t
= G
t
(X
t

t1
X
t
) +W
t
.
The middle expression is easy to compute from the current values at the beginning of
step (2). Applying this to projecting the variable X
t
, we nd
(10.5)
t
X
t
=
t1
X
t
+
t

W
t
,
t
= Cov(X
t
,

W
t
) Cov(

W
t
)
1
.
The matrix
t
is chosen such that
t

W
t
is the projection of X
t
onto lin

W
t
. Because
W
t
X
t1
the state equation equation yields Cov(X
t
, W
t
) = Cov(V
t
, W
t
). By the or-
thogonality property of projections Cov(X
t
, X
t

t1
X
t
) = Cov(X
t

t1
X
t
). Com-
bining this and the identity

W
t
= G
t
(X
t

t1
X
t
) +W
t
from (10.4), we compute
(10.6)
Cov(X
t
,

W
t
) = Cov(X
t

t1
X
t
)G
T
t
+ Cov(V
t
, W
t
),
Cov(

W
t
) = G
t
Cov(X
t

t1
X
t
)G
T
t
+G
t
Cov(V
t
, W
t
)
+ Cov(W
t
, V
t
)G
T
t
+ Cov(W
t
),
Cov(X
t

t1
X
t
) = Cov(X
t
) Cov(
t1
X
t
).
The matrix Cov(X
t

t1
X
t
) is the prediction error matrix at time t 1 and the last
equation follows by Pythagoras rule. To complete the recursion of step (2) we compute
from (10.5)
(10.7) Cov(
t
X
t
) = Cov(
t1
X
t
) +
t
Cov(

W
t
)
T
t
.
Equations (10.5)(10.7) give a complete description of step (2) of the Kalman recursion.
The Kalman algorithm must be initialized in one of its two steps, for instance by
providing
0
X
1
and its covariance matrix, so that the recursion can start with a step
of type (2). It is here where the choice of H
0
plays a role. Choosing H
0
= lin (1) gives
predictions using Y
1
, . . . , Y
t
as well as an intercept and requires that we know
0
X
1
=
EX
1
. It may also be desired that
t1
X
t
is the projection onto lin (1, Y
t1
, Y
t2
, . . .) for
a stationary extension of Y
t
into the past. Then we set
0
X
1
equal to the projection of
X
1
onto H
0
= lin (1, Y
0
, Y
1
, . . .).
160 10: State Space Models
10.2.1 Future States and Outputs
Predictions of future values of the state variable follow easily from
t
X
t
, because

t
X
t+h
= F
t+h

t
X
t+h1
for any h 1. Given the predicted states, future outputs
can be predicted from the measurement equation by
t
Y
t+h
= G
t+h

t
X
t+h
.
* 10.2.2 Missing Observations
A considerable attraction of the Kalman lter algorithm is the ease by which missing
observations can be accomodated. This can be achieved by simply lling in the missing
data points by external variables that are independent of the system. Suppose that
(X
t
, Y
t
) follows the linear state space model (10.2) and that we observe a subset (Y
t
)
tT
of the variables Y
1
, . . . , Y
n
. We dene a new set of matrices G

t
and noise variables W

t
by
G

t
= G
t
, W

t
= W
t
, t T,
G

t
= 0, W

t
= W
t
, t / T,
for random vectors W
t
that are independent of the vectors that are already in the system.
The choice W
t
= 0 is permitted. Next we set
X
t
= F
t
X
t1
+V
t
,
Y

t
= G

t
X
t
+W

t
.
The variables (X
t
, Y

t
) follow a state space model with the same state vectors X
t
. For
t T the outputs Y

t
= Y
t
are identical to the outputs in the original system, while
for t / T the output is Y

t
= W
t
, which is pure noise by assumption. Because the noise
variables W
t
cannot contribute to the prediction of the hidden states X
t
, best predictions
of states based on the observed outputs (Y
t
)
tT
or based on Y

1
, . . . , Y

n
are identical.
We can compute the best predictions based on Y

1
, . . . , Y

n
by the Kalman recursions,
but with the matrices G

t
and Cov(W

t
) substituted for G
t
and Cov(W
t
). Because the
Y

t
with t / T will not appear in the projection formula, we can just as well set their
observed values equal to zero in the computations.
* 10.2.3 Kalman Smoothing
Besides in predicting future states or outputs we may be interested in reconstructing the
complete state sequence X
0
, X
1
, . . . , X
n
from the outputs Y
1
, . . . , Y
n
. The computation of

n
X
n
is known as the ltering problem, and is step (2) of our description of the Kalman
lter. The computation of P
n
X
t
for t = 0, 1, . . . , n1 is known as the smoothing problem.
For a given t it can be achieved through the recursions, with

W
n
as given in (10.4),
_
_

n
X
t
Cov(X
t
,

W
n
)
Cov(X
t
, X
n

n1
X
n
)
_
_

_
_

n+1
X
t
Cov(X
t
,

W
n+1
)
Cov(X
t
, X
n+1

n
X
n+1
)
_
_
, n = t, t + 1, . . . .
The initial value at n = t of the recursions and the covariance matrices Cov(

W
n
) of the
innovations

W
n
are given by (10.6)(10.7), and hence can be assumed known.
10.2: Kalman Filtering 161
Because H
n+1
is the sum of the orthogonal spaces H
n
and lin

W
n+1
, we have, as in
(10.5),

n+1
X
t
=
n
X
t
+
t,n+1

W
n+1
,
t,n+1
= Cov(X
t
,

W
n+1
) Cov(

W
n+1
)
1
.
The recursion for the rst coordinate
n
X
t
follows from this and the recursions
for the second and third coordinates, the covariance matrices Cov(X
t
,

W
n+1
) and
Cov(X
t
, X
n+1

n
X
n+1
).
Using in turn the state equation and equation (10.5), we nd
Cov(X
t
, X
n+1

n
X
n+1
) = Cov
_
X
t
, F
n+1
(X
n

n
X
n
) +V
n+1
_
= Cov
_
X
t
, F
n+1
(X
n

n1
X
n
+
n

W
n
)
_
.
This readily gives the recursion for the third component, the matrix
n
being known
from (10.5)(10.6). Next using equation (10.4), we nd
Cov(X
t
,

W
n+1
) = Cov(X
t
, X
n+1

n
X
n+1
)G
T
n+1
.
* 10.2.4 Lagged Correlations
In the preceding we have assumed that the vectors X
0
, (V
1
, W
1
), (V
2
, W
2
), . . . are uncor-
related. An alternative assumption is that the vectors X
0
, V
1
, (W
1
, V
2
), (W
2
, V
3
), . . . are
uncorrelated. (The awkward pairing of W
t
and V
t+1
can be avoided by writing the state
equation as X
t
= F
t
X
t1
+ V
t1
and next making the assumption as before.) Under
this condition the Kalman lter takes a slightly dierent form, where for economy of
computation it can be useful to combine the steps (1) and (2).
Both possibilities are covered by the assumptions that
- the vectors X
0
, V
1
, V
2
, . . . are orthogonal.
- the vectors W
1
, W
2
, . . . are orthogonal.
- the vectors V
s
and W
t
are orthogonal for all (s, t) except possibly s = t or s = t +1.
- all vectors are orthogonal to H
0
.
Under these assumptions step (2) of the Kalman lter remains valid as described. Step
(1) must be adapted, because it is no longer true that
t1
V
t
= 0.
Because V
t
H
t2
, we can compute
t1
V
t
from the innovation decomposition
H
t1
= H
t2
+ lin

W
t1
, as
t1
V
t
= K
t1

W
t1
for the matrix
K
t1
= Cov(V
t
, W
t1
) Cov(

W
t1
)
1
.
Note here that Cov(V
t
,

W
t1
) = Cov(V
t
, W
t1
), in view of (10.4). We replace the calcu-
lations for step (1) by

t1
X
t
= F
t
(
t1
X
t1
) + K
t

W
t1
,
Cov(
t1
X
t
) = F
t
Cov(
t1
X
t1
)F
T
t
+K
t
Cov(

W
t1
)K
T
t
,
Cov(X
t
) = F
t
Cov(X
t1
)F
T
t
+ Cov(V
t
).
162 10: State Space Models
This gives a complete description of step (1) of the algorithm, under the assumption that
the vector

W
t1
, and its covariance matrix are kept in memory after the preceding step
(2).
The smoothing algorithm goes through as stated except for the recursion for the
matrices Cov(X
t
, X
n

n1
X
n
). Because
n
V
n+1
may be nonzero, this becomes
Cov(X
t
, X
n+1

n
X
n+1
) = Cov
_
X
t
, X
n

n1
X
n
)F
T
n+1
+ Cov(X
t
,

W
n
)
T
n
F
T
n+1
+ Cov(X
t
,

W
n
)K
T
n
.
* 10.3 Nonlinear Filtering
The simplicity of the Kalman lter results from the combined linearity of the state space
model and the predictions. These lead to update formulas expressed in the terms of
matrix algebra. The principle of recursive predictions can be applied more generally to
compute nonlinear predictions in nonlinear state space models, provided the conditional
densities of the variables in the system are available and certain integrals involving these
densities can be evaluated, analytically, numerically, or by stochastic simulation.
Abusing notation we write a conditional density of a variable X given another vari-
able Y as p(x[ y), and a marginal density of X as p(x). Consider the nonlinear state
space model (10.1), where we assume that the vectors X
0
, V
1
, W
1
, V
2
, . . . are indepen-
dent. Then the outputs Y
1
, . . . , Y
n
are conditionally independent given the state sequence
X
0
, X
1
, . . . , X
n
, and the conditional law of a single output Y
t
given the state sequence
depends on X
t
only. In principle the (conditional) densities p(x
0
), p(x
1
[ x
0
), p(x
2
[ x
1
), . . .
and the conditional densities p(y
t
[ x
t
) of the outputs are available from the form of the
functions f
t
and g
t
and the distributions of the noise variables (V
t
, W
t
). The joint density
of states up till time n + 1 and outputs up till time n in this hidden Markov model can
be expressed in these densities as
(10.8) p(x
0
)p(x
1
[ x
0
) p(x
n+1
[ x
n
)p(y
1
[ x
1
)p(y
2
[ x
2
) p(y
n
[ x
n
).
The marginal density of the outputs (Y
1
, . . . , Y
n
) is obtained by integrating this function
relative to (x
0
, . . . , x
n+1
). The conditional density of the state sequence (X
0
, . . . , X
n+1
)
given the outputs is proportional to the function in the display, the norming constant
being the marginal density of the outputs. In principle, this allows the computation
of all conditional expectations E(X
t
[ Y
1
, . . . , Y
n
), the (nonlinear) predictions of the
state. However, because this approach expresses these predictions as a quotient of n+1-
dimensional integrals, and n may be large, this is unattractive unless the integrals can
be evaluated easily.
An alternative for nding predictions is a recursive scheme for calculating conditional
densities, of the form
p(x
t1
[ y
t1
, . . . , y
1
)
(1)
p(x
t
[ y
t1
, . . . , y
1
)
(2)
p(x
t
[ y
t
, . . . , y
1
) .
10.3: Nonlinear Filtering 163
This is completely analogous to the updates of the linear Kalman lter: the recursions
alternate between updating the state, (1), and updating the prediction space, (2).
Step (1) can be summarized by the formula
p(x
t
[ y
t1
, . . . , y
1
) =
_
p(x
t
[ x
t1
, y
t1
, . . . , y
1
)p(x
t1
[ y
t1
, . . . , y
1
) d
t1
(x
t1
)
=
_
p(x
t
[ x
t1
)p(x
t1
[ y
t1
, . . . , y
1
) d
t1
(x
t1
).
The second equality follows from the conditional independence of the vectors X
t
and
Y
t1
, . . . , Y
1
given X
t1
. This is a consequence of the form of X
t
= f
t
(X
t1
, V
t
)
and the independence of V
t
and the vectors X
t1
, Y
t1
, . . . , Y
1
(which are functions of
X
0
, V
1
, . . . , V
t1
, W
1
, . . . , W
t1
).
To obtain a recursion for step (2) we apply Bayes formula to the conditional density
of the pair (X
t
, Y
t
) given Y
t1
, . . . , Y
1
to obtain
p(x
t
[ y
t
, . . . , y
1
) =
p(y
t
[ x
t
, y
t1
, . . . , y
1
)p(x
t
[ y
t1
, . . . , y
1
)
_
p(y
t
[ x
t
, y
t1
, . . . , y
1
)p(x
t
[ y
t1
, . . . , y
1
) d
t
(x
t
)
=
p(y
t
[ x
t
)p(x
t
[ y
t1
, . . . , y
1
)
p(y
t
[ y
t1
, . . . , y
1
)
.
The second equation is a consequence of the fact that Y
t
= g
t
(X
t
, W
t
) is conditionally
independent of Y
t1
, . . . , Y
1
given X
t
. The conditional density p(y
t
[ y
t1
, . . . , y
1
) in the
denominator is a nuisance, because it will rarely be available explicitly, but acts only as
a norming constant.
The preceding formulas are useful only if the integrals can be evaluated. If analytical
evaluation is impossible, then perhaps numerical methods or stochastic simulation could
be of help.
If stochastic simulation is the method of choice, then it may be attractive to ap-
ply Markov Chain Monte Carlo for direct evaluation of the joint law, without recur-
sions. The idea is to simulate a sample from the conditional density (x
0
, . . . , x
n+1
)
p(x
0
, . . . , x
n+1
[ y
1
, . . . , y
n
) of the states given the outputs. The biggest challenge is
the dimensionality of this conditional density. The Gibbs sampler overcomes this by
simulating recursively from the marginal conditional densities p(x
t
[ x
t
, y
1
, . . . , y
n
)
of the single variables X
t
given the outputs Y
1
, . . . , Y
n
and the vectors X
t
=
(X
0
, . . . , X
t1
, X
t+1
, . . . , X
n+1
) of remaining states. These iterations yield a Markov
chain, with the target density (x
0
, . . . , x
n+1
) p(x
0
, . . . , x
n+1
[ y
1
, . . . , y
n
) as a station-
ary density, and under some conditions the iterates of the chain can eventually be thought
of as a (dependent) sample from (approximately) this target density. We refer to the lit-
erature for general discussion of the Gibbs sampler, but shall show that these marginal
distributions are relatively easy to obtain for the general state space model (10.1).
Under independence of the vectors X
0
, V
1
, W
1
, V
2
, . . . the joint density of states and
outputs takes the hidden Markov form (10.8). The conditional density of X
t
given the
other vectors is proportional to this expression viewed as function of x
t
only. Only three
terms of the product depend on x
t
and hence we nd
p(x
t
[ x
t
, y
1
, . . . , y
n
) p(x
t
[ x
t1
)p(x
t+1
[ x
t
)p(y
t
[ x
t
).
164 10: State Space Models
The norming constant is a function of the conditioning variables x
t
, y
1
, . . . , y
n
only and
can be recovered from the fact that the left side is a probability density as a function
of x
t
. A closer look will reveal that it is equal to p(y
t
[ x
t1
, x
t+1
)p(x
t+1
[ x
t1
). However,
many simulation methods, in particular the popular Metropolis-Hastings algorithm, can
be implemented without an explicit expression for the proportionality constant. The
forms of the three densities on the right side should follow from the specication of the
system.
The assumption that the variables X
0
, V
1
, W
2
, V
2
, . . . are independent may be too
restrictive, although it is natural to try and construct the state variables so that it is
satised. Somewhat more complicated formulas can be obtained under more general
assumptions. Assumptions that are in the spirit of the preceding derivations in this
chapter are:
(i) the vectors X
0
, X
1
, X
2
, . . . form a Markov chain.
(ii) the vectors Y
1
, . . . , Y
n
are conditionally independent given X
0
, X
1
, . . . , X
n+1
.
(iii) for each t 1, . . . , n the vector Y
t
is conditionally independent of the vector
(X
0
, . . . , X
t2
, X
t+2
, . . . , X
n+1
) given (X
t1
, X
t
, X
t+1
).
The rst assumption is true if the vectors X
0
, V
1
, V
2
, . . . are independent. The second
and third assumptions are certainly satised if all noise vectors X
0
, V
1
, W
1
, V
2
, W
2
, V
3
, . . .
are independent. The exercises below give more general sucient conditions for (i)(iii)
in terms of the noise variables.
In comparison to the hidden Markov situation considered previously not much
changes. The joint density of states and outputs can be written in a product form similar
to (10.8), the dierence being that each conditional density p(y
t
[ x
t
) must be replaced by
p(y
t
[ x
t1
, x
t
, x
t+1
). The variable x
t
then occurs in ve terms of the product and hence
we obtain
p(x
t
[ x
t
, y
1
, . . . ,y
n
) p(x
t+1
[ x
t
)p(x
t
[ x
t1
)
p(y
t1
[ x
t2
, x
t1
, x
t
)p(y
t
[ x
t1
, x
t
, x
t+1
)p(y
t+1
[ x
t
, x
t+1
, x
t+2
).
This formula is general enough to cover the case of the ARV model discussed in the next
section.
10.10 EXERCISE. Suppose that X
0
, V
1
, W
1
, V
2
, W
2
, V
3
, . . . are independent, and dene
states X
t
and outputs Y
t
by (10.1). Show that (i)(iii) hold, where in (iii) the vector Y
t
is even conditionally independent of (X
s
: s ,= t) given X
t
.
10.11 EXERCISE. Suppose that X
0
, V
1
, V
2
, . . . , Z
1
, Z
2
, . . . are independent, and dene
states X
t
and outputs Y
t
through (10.2) with W
t
= h
t
(V
t
, V
t+1
, Z
t
) for measurable
functions h
t
. Show that (i)(iii) hold. [Under (10.2) there exists a measurable bijection
between the vectors (X
0
, V
1
, . . . , V
t
) and (X
0
, X
1
, . . . , X
n
), and also between the vectors
(X
t
, X
t1
, X
t+1
) and (X
t
, V
t
, V
t+1
). Thus conditioning on (X
0
, X
1
, . . . , X
n+1
) is the same
as conditioning on (X
0
, V
1
, . . . , V
n+1
) or on (X
0
, V
1
, . . . , V
n
, X
t1
, X
t
, X
t+1
).]
* 10.12 EXERCISE. Show that the condition in the preceding exercise that W
t
=
h
t
(V
t
, V
t+1
, Z
t
) for Z
t
independent of the other variables is equivalent to the conditional
independence of W
t
and X
0
, V
1
, . . . , V
n
, W
s
: s ,= t given V
t
, V
t+1
.
10.4: Stochastic Volatility Models 165
10.4 Stochastic Volatility Models
The term volatility, which we have used at multiple occasions to describe the mov-
ability of a time series, appears to have its origins in the theory of option pricing. The
Black-Scholes model for pricing an option on a given asset with price S
t
is based on a
diusion equation of the type
dS
t
=
t
S
t
dt +
t
S
t
dB
t
.
Here B
t
is a Brownian motion process and
t
and
t
are stochastic processes, which
are usually assumed to be adapted to the ltration generated by the process S
t
. In the
original Black-Scholes model the process
t
is assumed constant, and the constant is
known as the volatility of the process S
t
.
The Black-Scholes diusion equation can also be written in the form
log
S
t
S
0
=
_
t
0
(
s

1
2

2
s
) ds +
_
t
0

s
dB
s
.
If and are deterministic processes this shows that the log returns log S
t
/S
t1
over
the intervals (t 1, t] are independent, normally distributed variables (t = 1, 2, . . .) with
means
_
t
t1
(
s

1
2

2
s
) ds and variances
_
t
t1

2
s
ds. In other words, if these means and
variances are denoted by
t
and
2
t
, then the variables
Z
t
=
log S
t
/S
t1

t
are an i.i.d. sample from the standard normal distribution. The standard deviation
t
can be viewed as an average volatility over the interval (t 1, t]. If the processes
t
and
t
are not deterministic, then the process Z
t
is not necessarily Gaussian. However,
if the unit of time is small, so that the intervals (t 1, t] correspond to short time
intervals in real time, then it is still believable that the variables Z
t
are approximately
normally distributed. In that case it is also believable that the processes
t
and
t
are
approximately constant and hence these processes can replace the averages
t
and
t
.
Usually, one even assumes that the process
t
is constant in time. For simplicity of
notation we shall take
t
to be zero in the following, leading to a model of the form
log S
t
/S
t1
=
t
Z
t
,
for standard normal variables Z
t
and a volatility process
t
. The choice
t
=
t

1
2

2
t
=
0 corresponds to modelling under the risk-free martingale measure, but is made here
only for convenience.
There is ample empirical evidence that models with constant volatility do not t
observed nancial time series. In particular, this has been documented through a compar-
ison of the option prices predicted by the Black-Scholes formula to the observed prices on
the option market. Because the Black-Scholes price of an option on a given asset depends
only on the volatility parameter of the asset price process, a single parameter volatility
model would allow to calculate this parameter from the observed price of an option on
this asset, by inversion of the Black-Scholes formula. Given a range of options written on
166 10: State Space Models
a given asset, but with dierent maturities and/or dierent strike prices, this inversion
process usually leads to a range of implied volatilities, all connected to the same asset
price process. These implied volatilities usually vary with the maturity and strike price.
This discrepancy could be taken as proof of the failure of the reasoning behind the
Black-Scholes formula, but the more common explanation is that volatility is a random
process itself. One possible model for this process is a diusion equation of the type
d
t
=
t

t
dt +
t

t
dW
t
,
where W
t
is another Brownian motion process. This leads to a stochastic volatility
model in continuous time. Many dierent parametric forms for the processes
t
and

t
are suggested in the literature. One particular choice is to assume that log
t
is an
Ornstein-Uhlenbeck process, i.e. it satises
d log
t
= ( log
t
) dt + dW
t
.
(An application of It os formula show that this corresponds to the choices
t
=
1
2

2
+
( log
t
) and
t
= .) The Brownian motions B
t
and W
t
are often assumed to be
dependent, with quadratic variation B, W)
t
= t for some parameter 0.
A diusion equation is a stochastic dierential equation in continuous time, and does
not t well into our basic set-up, which considers the time variable t to be integer-valued.
One approach would be to use continuous time models, but assume that the continuous
time processes are observed only at a grid of time points. In view of the importance
of the option-pricing paradigm in nance it has been also useful to give a denition of
volatility directly through discrete time models. These models are usually motivated
by an analogy with the continuous time set-up. Stochastic volatility models in discrete
time are specically meant to parallel continuous time diusion models.
The most popular stochastic volatility model in discrete time is the auto-regressive
random variance model or ARV model. A discrete time analogue of the Ornstein-
Uhlenbeck type volatility process
t
is the specication
(10.9) log
t
= + log
t1
+V
t1
.
For [[ < 1 and a white noise process V
t
this auto-regressive equation possesses a causal
stationary solution log
t
. We select this solution in the following. The observed log
return process X
t
is modelled as
(10.10) X
t
=
t
Z
t
,
where it is assumed that the time series (V
t
, Z
t
) is i.i.d.. The latter implies that Z
t
is inde-
pendent of V
t1
, Z
t1
, V
t2
, Z
t2
, . . . and hence of X
t1
, X
t2
, . . ., but allows dependence
between V
t
and Z
t
. The volatility process
t
is not observed.
A dependence between V
t
and Z
t
allows for a leverage eect, one of the stylized
facts of nancial time series. In particular, if V
t
and Z
t
are negatively correlated, then
a small return X
t
, which is indicative of a small value of Z
t
, suggests a large value of V
t
,
and hence a large value of the log volatility log
t+1
at the next time instant. (Note that
the time index t 1 of V
t1
in the auto-regressive equation (10.9) is unusual, because
10.4: Stochastic Volatility Models 167
in other situations we would have written V
t
. It is meant to support the idea that
t
is
determined at time t 1.)
An ARV stochastic volatility process is a nonlinear state space model. It induces a
linear state space model for the log volatilities and log absolute log returns of the form
log
t
= ( )
_
1
log
t1
_
+V
t1
log [X
t
[ = log
t
+ log [Z
t
[.
In order to take the logarithm of the observed series X
t
it was necessary to take the
absolute value [X
t
[ rst. Usually this is not a serious loss of information, because the
sign of X
t
is equal to the sign of Z
t
, and this is a Bernoulli
1
2
series if Z
t
is symmetrically
distributed.
The linear state space form allows the application of the Kalman lter to compute
best linear projections of the unobserved log volatilities log
t
based on the observed log
absolute log returns log [X
t
[. Although this approach is computationally attractive, a
disadvantage is that the best predictions of the volatilities
t
based on the log returns
X
t
may be much better than the exponentials of the best linear predictions of the log
volatilities log
t
based on the log returns. Forcing the model in linear form is not entirely
natural here. However, the computation of best nonlinear predictions is involved. Markov
Chain Monte Carlo methods are perhaps the most promising technique, but are highly
computer-intensive.
An ARV process X
t
is a martingale dierence series relative to its natural ltration
T
t
= (X
t
, X
t1
, . . .). To see this we rst note that by causality
t
(V
t1
, V
t2
, . . .),
whence T
t
is contained in the ltration (
t
= (V
s
, Z
s
: s t). The process X
t
is actually
already a martingale dierence relative to this bigger ltration, because by the assumed
independence of Z
t
from (
t1
E(X
t
[ (
t1
) =
t
E(Z
t
[ (
t1
) = 0.
A fortiori the process X
t
is a martingale dierence series relative to the ltration T
t
.
There is no correspondingly simple expression for the conditional variance process
E(X
2
t
[ T
t1
) of an ARV series. By the same argument
E(X
2
t
[ (
t1
) =
2
t
EZ
2
t
.
If EZ
2
t
= 1 it follows that E(X
2
t
[ T
t1
) = E(
2
t
[ T
t1
), but this is intractable for further
evaluation. In particular, the process
2
t
is not the conditional variance process, unlike
in the situation of a GARCH process. Correspondingly, in the present context, in which

t
is considered the volatility, the volatility and conditional variance processes do not
coincide.
10.13 EXERCISE. One denition of a volatility process
t
of a time series X
t
is a
process
t
such that X
t
/
t
is an i.i.d. standard normal series. Suppose that X
t
=
t
Z
t
is a GARCH process with conditional variance process
2
t
and driven by an i.i.d. process
Z
t
. If Z
t
is standard normal, show that
t
qualies as a volatility process. [Trivial.] If Z
t
is a t
p
-process show that there exists a process S
2
t
with a chisquare distribution with p
degrees of freedom such that

p
t
/S
t
qualies as a volatility process.
168 10: State Space Models
10.14 EXERCISE. In the ARV model is
t
measurable relative to the -eld generated
by X
t1
, X
t2
, . . .? Compare with GARCH models.
In view of the analogy with continuous time diusion processes the assumption that
the variables (V
t
, Z
t
) in (10.9)(10.10) are normally distributed could be natural. This
assumption certainly helps to compute moments of the series. The stationary solution
log
t
of the auto-regressive equation (10.9) is given by (for [[ < 1)
log
t
=

j=0

j
(V
t1j
+) =

j=0

j
V
t1j
+

1
.
If the time series V
t
is i.i.d. Gaussian with mean zero and variance
2
, then it follows that
the variable log
t
is normally distributed with mean /(1) and variance
2
/(1
2
).
The Laplace transform Eexp(aZ) of a standard normal variable Z is given by exp(
1
2
a
2
).
Therefore, under the normality assumption on the process V
t
it is straightforward to
compute that, for p > 0,
E[X
t
[
p
= Ee
p log t
E[Z
t
[
p
= exp
_
1
2

2
p
2
1
2
+
p
1
_
E[Z
t
[
p
.
Consequently, the kurtosis of the variables X
t
can be computed to be

4
(X) = e
4
2
/(1
2
)

4
(Z).
If follows that the time series X
t
possesses a larger kurtosis than the series Z
t
. This is
true even for = 0, but the eect is more pronounced for values of that are close
to 1, which are commonly found in practice. Thus the ARV model is able to explain
leptokurtic tails of an observed time series.
Under the assumption that the variables (V
t
, Z
t
) are i.i.d. and bivariate normally
distributed, it is also possible to compute the auto-correlation function of the squared
series X
2
t
explicitly. If = (V
t
, Z
t
) is the correlation between the variables V
t
and
Z
t
, then the vectors (log
t
, log
t+h
, Z
t
) possess a three-dimensional normal distribution
with covariance matrix
_
_

h
0

h1

0
h1
1
_
_
,
2
=

2
1
2
.
Some calculations show that the auto-correlation function of the square process is given
by

X
2(h) =
(1 + 4
2

2h2
)e
4
2

h
/(1
2
)
1
3e
4
2
/(1
2
)
1
, h > 0.
The auto-correlation is positive at positive lags and decreases exponentially fast to zero,
with a rate depending on the proximity of to 1. For values of close to 1, the decrease
is relatively slow.
10.4: Stochastic Volatility Models 169
10.15 EXERCISE. Derive the formula for the auto-correlation function.
10.16 EXERCISE. Suppose that the variables V
t
and Z
t
are independent for every t, in
addition to independence of the vectors (V
t
, Z
t
), and assume that the variables V
t
(but
not necessarily the variables Z
t
) are normally distributed. Show that

X
2(h) =
e
4
2

h
/(1
2
)
1

4
(Z)e
4
2
/(1
2
)
1
, h > 0.
[Factorize E
2
t+h

2
t
Z
2
t+h
Z
2
t
as E
2
t+h

2
t
EZ
2
t+h
Z
2
t
.]
The choice of the logarithmic function in the auto-regressive equation (10.9) has
some arbitrariness, and other possibilities, such as a power function, have been explored.
11
Moment and
Least Squares Estimators
Suppose that we observe realizations X
1
, . . . , X
n
from a time series X
t
whose distri-
bution is (partly) described by a parameter R
d
. For instance, an ARMA process
with the parameter (
1
, . . . ,
p
,
1
, . . . ,
q
,
2
), or a GARCH process with parameter
(,
1
, . . . ,
p
,
1
, . . . ,
q
), both ranging over a subset of R
p+q+1
. In this chapter we dis-
cuss two methods of estimation of the parameters, based on the observations X
1
, . . . , X
n
:
the method of moments and the least squares method.
When applied in the standard form to auto-regressive processes, the two methods
are essentially the same, but for other models the two methods may yield quite dierent
estimators. Depending on the moments used and the underlying model, least squares
estimators can be more ecient, although sometimes they are not usable at all. The
generalized method of moments tries to bridge the eciency gap, by increasing the
number of moments employed.
Moment and least squares estimators are popular in time series analysis, but in gen-
eral they are less ecient than maximum likelihood and Bayes estimators. The dierence
in eciency depends on the model and the true distribution of the time series. Maxi-
mum likelihood estimation using a Gaussian model can be viewed as an extension of the
method of least squares. We discuss the method of maximum likelihood in Chapter 13.
11.1 Yule-Walker Estimators
Suppose that the time series X
t
is a stationary auto-regressive process of known
order p and with unknown parameters
1
, . . . ,
p
and
2
. The mean = EX
t
of the
series may also be unknown, but we assume that it is estimated by X
n
and concentrate
attention on estimating the remaining parameters.
From Chapter 8 we know that the parameters of an auto-regressive process are not
uniquely determined by the series X
t
, but can be replaced by others if the white noise
process is changed appropriately as well. We shall aim at estimating the parameter under
11.1: Yule-Walker Estimators 171
the assumption that the series is causal. This is equivalent to requiring that all roots of
the polynomial (z) = 1
1
z
p
z
p
are outside the unit circle.
Under causality the best linear predictor of X
p+1
based on 1, X
p
, . . . , X
1
is given by

p
X
p+1
= +
1
(X
p
) + +
p
(X
1
). (See Section 8.4.) Alternatively, the best
linear predictor can be obtained by solving the general prediction equations (2.4). This
shows that the parameters
1
, . . . ,
p
satisfy
_
_
_
_

X
(0)
X
(1)
X
(p 1)

X
(1)
X
(0)
X
(p 2)
.
.
.
.
.
.
.
.
.

X
(p 1)
X
(p 2)
X
(0)
_
_
_
_
_
_
_
_

2
.
.
.

p
_
_
_
_
=
_
_
_
_

X
(1)

X
(2)
.
.
.

X
(p)
_
_
_
_
.
We abbreviate this system of equations by
p

p
=
p
. These equations, known as the
Yule-Walker equations, express the parameters into second moments of the observations.
The Yule-Walker estimators are dened by replacing the true auto-covariances
X
(h) by
their sample versions
n
(h) and next solving for
1
, . . . ,
p
. This leads to the estimators

p
: =
_
_
_
_
_

2
.
.
.

p
_
_
_
_
_
=
_
_
_
_

n
(0)
n
(1)
n
(p 1)

n
(1)
n
(0)
n
(p 2)
.
.
.
.
.
.
.
.
.

n
(p 1)
n
(p 2)
n
(0)
_
_
_
_
1 _
_
_
_

n
(1)

n
(2)
.
.
.

n
(p)
_
_
_
_
=:

1
p

p
.
The parameter
2
is by denition the variance of Z
p+1
, which is the prediction error
X
p+1

p
X
p+1
when predicting X
p+1
by the preceding observations, under the assump-
tion that the time series is causal. By the orthogonality of the prediction error and the
predictor
p
X
p+1
and Pythagoras rule,
(11.1)
2
= E(X
p+1
)
2
E(
p
X
p+1
)
2
=
X
(0)

T
p

p

p
.
We dene an estimator
2
by replacing all unknowns by their moment estimators, i.e.

2
=
n
(0)

T
p

p
.
11.1 EXERCISE. An alternative method to derive the Yule-Walker equations is to work
out the equations cov
_
(B)(X
t
), X
tk

_
= cov
_
Z
t
,

j0

j
Z
tjk
_
for k =
0, . . . , p. Check this. Do you need causality? What if the time series would not be causal?
11.2 EXERCISE. Show that the matrix
p
is invertible for every p. [Suggestion: write

p
in terms of the spectral density.]
Another reasonable method to nd estimators is to start from the fact that the true
values of
1
, . . . ,
p
minimize the expectation
(
1
, . . . ,
p
) E
_
X
t

1
(X
t1
)
p
(X
tp
)
_
2
.
172 11: Moment and Least Squares Estimators
The least squares estimators are dened by replacing this criterion function by an em-
pirical (i.e. observable) version of it and next minimizing this. Let

1
, . . . ,

p
minimize
the function
(11.2) (
1
, . . . ,
p
)
1
n
n

t=p+1
_
X
t
X
n

1
(X
t1
X
n
)
p
(X
tp
X
n
)
_
2
.
The minimum value itself is a reasonable estimator of the minimum value of the ex-
pectation of this criterion function, which is EZ
2
t
=
2
. The least squares estimators

j
obtained in this way are not identical to the Yule-Walker estimators, but the dier-
ence is small. To see this, we derive the least squares estimators as the solution of a
system of equations. The right side of (11.2) is the square of the norm |Y
n
D
n

p
| for

p
= (
1
, . . . ,
p
)
T
the vector of parameters and Y
n
and D
n
the vector and matrix given
by
Y
n
=
_
_
_
_
X
n
X
n
X
n1
X
n
.
.
.
X
p+1
X
n
_
_
_
_
, D
n
=
_
_
_
_
X
n1
X
n
X
n2
X
n
X
np
X
n
X
n2
X
n
X
n3
X
n
X
np1
X
n
.
.
.
.
.
.
.
.
.
X
p
X
n
X
p1
X
n
X
1
X
n
_
_
_
_
.
The norm |Y
n
D
n

p
| is minimized by the vector

p
such that D
n

p
is the
projection of the vector Y
n
onto the range of the matrix D
n
. By the projection theorem,
Theorem 2.10, this is characterized by the relationship that the residual Y
n
D
n

p
is
orthogonal to the range of D
n
, i.e. D
T
n
(Y
n
D
n

p
) = 0. This normal equation can be
solved for
p
to yield that the minimizing vector is given by

p
=
_
1
n
D
T
n
D
n
_
1
1
n
D
T
n
(

X
n
X
n
).
At closer inspection this vector is nearly identical to the Yule-Walker estimators. Indeed,
for every s, t 1, . . . , p,
_
1
n
D
T
n
D
n
_
s,t
=
1
n
n

j=p+1
(X
js
X
n
)(X
jt
X
n
)
n
(s t) = (

p
)
s,t
,
_
1
n
D
T
n
(

X
n
X
n
)
_
t
=
1
n
n

j=p+1
(X
jt
X
n
)(X
j
X
n
)
n
(t)(
p
)
t
.
Asymptotically the dierence between the Yule-Walker and least squares estimators is
negligible. They possess the same (normal) limit distribution.
11.3 Theorem. Let (X
t
) be a causal AR(p) process relative to an i.i.d. sequence Z
t
with nite fourth moments. Then both the Yule-Walker and the least squares estimators
satisfy, with
p
the covariance matrix of (X
1
, . . . , X
p
),

n(

p
) N(0,
2

1
p
).
11.1: Yule-Walker Estimators 173
Proof. We can assume without loss of generality that = 0. The AR equations
(B)X
t
= Z
t
for t = n, n 1, . . . , p + 1 can be written in the matrix form
_
_
_
_
X
n
X
n1
.
.
.
X
p+1
_
_
_
_
=
_
_
_
_
X
n1
X
n2
X
np
X
n2
X
n3
X
np1
.
.
.
.
.
.
.
.
.
X
p
X
p1
X
1
_
_
_
_
_
_
_
_

2
.
.
.

p
_
_
_
_
+
_
_
_
_
Z
n
Z
n1
.
.
.
Z
p+1
_
_
_
_
= D
n

p
+

Z
n
,
for

Z
n
the vector with coordinates Z
t
+X
n

i
, and D
n
the design matrix as before.
We can solve

p
from this as, writing

X
n
for the vector on the left,

p
= (D
T
n
D
n
)
1
D
T
n
(

X
n

Z
n
).
Combining this with the analogous representation of the least squares estimators

j
we
nd

n(

p
) =
_
1
n
D
T
n
D
n
_
1
1

n
D
T
n
_

Z
n
X
n
(1

i
)

1
_
.
Because X
t
is an auto-regressive process, it possesses a representation X
t
=

j

j
Z
tj
for a sequence
j
with

j
[
j
[ < . Therefore, the results of Chapter 8 apply and show
that n
1
D
T
n
D
n
P


p
. (In view of Problem 11.2 this also shows that the matrix D
T
n
D
n
is invertible, as was assumed implicitly in the preceding.)
In view of Slutskys lemma it now suces to show that
1

n
D
T
n

Z
n
N(0,
2

p
),
1

n
D
T
n
1X
n
P

0.
A typical coordinate of the last vector is (for h = 1, . . . , p)
1

n
n

t=p+1
(X
th
X
n
)X
n
=
1

n
n

t=p+1
X
th
X
n

n p

n
X
2
n
.
In view of Theorem 4.5 and the assumption that = 0, the sequence

nX
n
converges
in distribution. Hence both terms on the right side are of the order O
P
(1/

n).
A typical coordinate of the rst vector is (for h = 1, . . . , p)
1

n
n

t=p+1
(X
th
X
n
)Z
t
=
1

n
np

t=1
Y
t
+O
P
_
1

n
_
,
for Y
t
= X
ph+t
Z
p+t
. By causality of the series X
t
we have Z
p+t
X
ps+t
for s > 0
and hence EY
t
= EX
ps+t
EZ
p+t
= 0 for every t. The same type of arguments as in
Chapter 5 will give us the asymptotic normality of the sequence

nY
n
, with asymptotic
variance

g=

Y
(g) =

g=
EX
ph+g
Z
p+g
X
ph
Z
p
.
174 11: Moment and Least Squares Estimators
In this series all terms with g > 0 vanish because Z
p+g
is independent of (X
ph+g
, X
ph
, Z
p
),
by the assumption of causality and the fact that Z
t
is an i.i.d. sequence. All terms with
g < 0 vanish by symmetry. Thus the series is equal to
Y
(0) = EX
2
ph
Z
2
p
=
X
(0)
2
,
which is the diagonal element of
2

p
. This concludes the proof of the convergence in
distribution of all marginals of n
1/2
D
T
n

Z
n
. The joint convergence is proved in similarly,
with the help of the Cramer-Wold device.
This concludes the proof of the asymptotic normality of the least squares estima-
tors. The Yule-Walker estimators can be proved to be asymptotically equivalent to the
least squares estimators, in that the dierence is of the order o
P
(1/

n). Next we apply


Slutskys lemma.
11.4 EXERCISE. Show that the time series Y
t
in the preceding proof is strictly station-
ary.
* 11.5 EXERCISE. Give a complete proof of the asymptotic normality of

nY
n
as dened
in the preceding proof, along the lines sketched, or using the fact nY
n
is a martingale.
11.1.1 Order Selection
In the preceding derivation of the least squares and Yule-Walker estimators the order p
of the AR process is assumed known a-priori. Theorem 11.3 is false if X
t
were in
reality an AR (p
0
) process of order p
0
> p. In that case

1
, . . .

p
are estimators of the
coecients of the best linear predictor based on p observations, but need not converge
to the p
0
coecients
1
, . . . ,
p0
. On the other hand, Theorem 11.3 remains valid if the
series X
t
is an auto-regressive process of true order p
0
strictly smaller than the order
p used to dene the estimators. This follows because for p
0
p an AR(p
0
) process
is also an AR(p) process, albeit that
p0+1
, . . . ,
p
are zero. Theorem 11.3 shows that
overtting (choosing too big an order) does not cause great harm: if

(p)
1
, . . . ,

(p)
j
are
the Yule-Walker estimators when tting an AR(p) model and the observations are an
AR(p
0
) process with p
0
p, then

(p)
j
N
_
0,
2
(
1
p
)
j,j
_
, j = p
0
+ 1, . . . , p.
It is recomforting that the estimators of the unnecessary coecients
p0+1
, . . . ,
p
converge to zero at rate 1/

n. However, there is also a price to be paid by overtting.


By Theorem 11.3, if tting an AR(p)-model, then the estimators of the rst p
0
coecients
satisfy

n
_
_
_
_
_
_

(p)
1
.
.
.

(p)
p0
_
_
_
_
_
_

1
.
.
.

p0
_
_
_
_
_
_ N
_
0,
2
(
1
p
)
s,t=1,...,p0
_
.
The covariance matrix in the right side, the (p
0
p
0
) upper principal submatrix of the
(pp) matrix
1
p
, is not equal to
1
p0
, which would be the asymptotic covariance matrix
if we tted an AR model of the correct order p
0
. In fact, it is bigger in that
(
1
p
)
s,t=1,...,p0

1
p0
0.
11.1: Yule-Walker Estimators 175
Here A 0 means that the matrix A is nonnegative denite. In particular, the diagonal
elements of these matrices, which are the dierences of the asymptotic variances of the
estimators
(p)
j
and the estimators
(p0)
j
, are nonnegative. Thus overtting leads to more
uncertainty in the estimators of both
1
, . . . ,
p0
and
p0+1
, . . . ,
p
. Fitting an auto-
regressive process of very high order p increases the chance of having the model t well
to the data, but generally will result in poor estimates of the coecients, which render
the nal outcome less useful.
* 11.6 EXERCISE. Prove the assertion that the given matrix is nonnegative denite.
In practice we do not know the correct order to use. A suitable order is often
determined by a preliminary data-analysis, such as an inspection of the plot of the
sample partial auto-correlation function. More formal methods are discussed within the
general context of maximum likelihood estimation in Chapter 13.
11.7 Example. If we t an AR(1) process to observations of an AR(1) series, then the
asymptotic covariance of

n(

1

1
) is equal to
2

1
1
=
2
/
X
(0). If to this same
process we t an AR(2) process, then we obtain estimators (

(2)
1
,

(2)
2
) (not related to
the earlier

1
) such that

n(

(2)
1

1
,

(2)
2
) has asymptotic covariance matrix

1
2
=
2
_

X
(0)
X
(1)

X
(1)
X
(0)
_
1
=

2

2
X
(0)
2
X
(1)
_

X
(0)
X
(1)

X
(1)
X
(0)
_
.
Thus the asymptotic variance of the sequence

n(

(2)
1

1
) is equal to

X
(0)

2
X
(0)
2
X
(1)
=

2

X
(0)
1
1
2
1
.
(Note that
1
=
X
(1)/
X
(0).) Thus overtting by one degree leads to a loss in eciency
of 1
2
1
. This is particularly harmful if the true value of [
1
[ is close to 1, i.e. the time
series is close to being a (nonstationary) random walk.
11.1.2 Partial Auto-Correlations
Recall that the partial auto-correlation coecient
X
(h) of a centered time series X
t
is
the coecient of X
1
in the formula
1
X
h
+ +
h
X
1
for the best linear predictor of
X
h+1
based on X
1
, . . . , X
h
. In particular, for the causal AR(p) process satisfying X
t
=

1
X
t1
+ +
p
X
tp
+Z
t
we have
X
(p) =
p
and
X
(h) = 0 for h > p. The sample
partial auto-correlation coecient is dened in Section 5.4 as the Yule-Walker estimator

h
when tting an AR(h) model. This connection provides an alternative method to
derive the limit distribution in the special situation of auto-regressive processes. The
simplicity of the result makes it worth the eort.
176 11: Moment and Least Squares Estimators
11.8 Corollary. Let X
t
be a causal stationary AR(p) process relative to an i.i.d.
sequence Z
t
with nite fourth moments. Then, for every h > p,

n
n
(h) N(0, 1).
Proof. For h > p the time series X
t
is also an AR(h) process and hence we can
apply Theorem 11.3 to nd that the Yule-Walker estimators

(h)
1
, . . . ,

(h)
h
when tting
an AR(h) model satisfy

n(

(h)
h

(h)
h
) N
_
0,
2
(
1
h
)
h,h
_
.
The left side is exactly

n
n
(h). We show that the variance of the normal distribution
on the right side is unity. By Cramers rule the (h, h)-element of the matrix
1
h
can be
found as det
h1
/ det
h
. By the prediction equations we have for h p
_
_
_
_

X
(0)
X
(1)
X
(h 1)

X
(1)
X
(0)
X
(h 2)
.
.
.
.
.
.
.
.
.

X
(h 1)
X
(h 2)
X
(0)
_
_
_
_
_
_
_
_
_
_
_
_
_

1
.
.
.

p
0
.
.
.
0
_
_
_
_
_
_
_
_
_
=
_
_
_
_

X
(1)

X
(2)
.
.
.

X
(h)
_
_
_
_
.
This expresses the vector on the right as a linear combination of the rst p columns of the
matrix
h
on the left. We can use this to rewrite det
h+1
(by a sweeping operation)
in the form

X
(0)
X
(1)
X
(h)

X
(1)
X
(0)
X
(h 1)
.
.
.
.
.
.
.
.
.

X
(h)
X
(h 1)
X
(0)

X
(0)
1

X
(1)
p

X
(p) 0 0

X
(1)
X
(0)
X
(h 1)
.
.
.
.
.
.
.
.
.

X
(h)
X
(h 1)
X
(0)

.
The (1, 1)-element in the last determinant is equal to
2
by (11.1). Thus this determinant
is equal to
2
det
h
and the theorem follows.
This corollary can be used to determine a suitable order p if tting an auto-regressive
model to a given observed time series. Because the true partial auto-correlation coe-
cients of lags higher than the true order p are all zero, we should expect that the sample
partial auto-correlation coecients are inside a band of the type (2/

n, 2

n). Thus
we should not choose the order equal to p if
n
(p +k) is outside this band for too many
k 1. Here we should expect a fraction of 5 % of the
n
(p + k) for which we perform
this test to be outside the band in any case.
To turn this procedure in a more formal statistical test we must also take the depen-
dence between the dierent
n
(p +k) into account, but this appears to be complicated.
11.2: Moment Estimators 177
* 11.9 EXERCISE. Find the asymptotic limit distribution of the sequence
_

n
(h),
n
(h+
1)
_
for h > p, e.g. in the case that p = 0 and h = 1.
* 11.1.3 Indirect Estimation
The parameters
1
, . . . ,
p
of a causal auto-regressive process are exactly the coecients
of the one-step ahead linear predictor using p variables from the past. This makes appli-
cation of the least squares method to obtain estimators for these parameters particularly
straightforward. For an arbitrary stationary time series the best linear predictor of X
p+1
given 1, X
1
, . . . , X
p
is the linear combination +
1
(X
p
) + +
1
(X
1
) whose
coecients satisfy the prediction equations (2.4). The Yule-Walker estimators are the
solutions to these equations after replacing the true auto-covariances by the sample
auto-covariances. It follows that the Yule-Walker estimators can be considered estima-
tors for the prediction coecients (using p variables from the past) for any stationary
time series. The case of auto-regressive processes is special only in that these prediction
coecients are exactly the parameters of the model.
Furthermore, it remains true that the Yule-Walker estimators are

n-consistent
and asymptotically normal. This does not follow from Theorem 11.3, because this uses
the auto-regressive structure explicitly, but it can be inferred from the asymptotic nor-
mality of the auto-covariances, given in Theorem 5.8. (The argument is the same as
used in Section 5.4. The asymptotic covariance matrix will be dierent from the one in
Theorem 11.3, and more complicated.)
If the prediction coecients (using a xed number of past variables) are not the
parameters of main interest, then these remarks may seem little useful. However, if the
parameter of interest is of dimension d, then we may hope that there exists a one-to-
one relationship between and the prediction coecients
1
, . . . ,
p
if we choose p = d.
(More generally, we can apply this to a subvector of and a matching number of
j
s.)
Then we can rst estimate
1
, . . . ,
d
by the Yule-Walker estimators and next employ
the relationshiop between
1
, . . . ,
p
to infer an estimate of . If the inverse map giving
as a function of
1
, . . . ,
d
is dierentiable, then it follows by the Delta-method that
the resulting estimator for is

n-consistent and asymptotically normal, and hence we
obtain good estimators.
If the relationship between and (
1
, . . . ,
d
) is complicated, then this idea may be
hard to implement. One way out of this problem is to determine the prediction coecients

1
, . . . ,
d
for a grid of values of , possibly through simulation. The value on the grid
that yields the Yule-Walker estimators is the estimator for we are looking for.
11.10 EXERCISE. Indicate how you could obtain (approximate) values for
1
, . . . ,
p
given using computer simulation, for instance for a stochastic volatility model.
178 11: Moment and Least Squares Estimators
11.2 Moment Estimators
The Yule-Walker estimators can be viewed as arising from a comparison of sample auto-
covariances to true auto-covariances and therefore are examples of moment estimators.
Moment estimators are dened in general by matching sample moments and population
moments. Population moments of a time series X
t
are true expectations of functions of
the variables X
t
, for instance,
E

X
t
, E

X
2
t
, E

X
t+h
X
t
, E

X
2
t+h
X
2
t
.
In every case, the subscript indicates the dependence on the unknown parameter :
in principle, every of these moments is a function of . The principle of the method
of moments is to estimate by that value

n
for which the corresponding population
moments coincide with a corresponding sample moment, for instance,
1
n
n

t=1
X
t
,
1
n
n

t=1
X
2
t
,
1
n
n

t=1
X
t+h
X
t
,
1
n
n

t=1
X
2
t+h
X
2
t
.
From Chapter 5 we know that these sample moments converge, as n , to the true
moments, and hence it is believable that the sequence of moment estimators

n
also
converges to the true parameter, under some conditions.
Rather than true moments it is often convenient to dene moment estimators
through derived moments such as an auto-covariance at a xed lag, or an auto-correlation,
which are both functions of moments of degree smaller than 2. These derived moments
are then matched by the corresponding sample quantities.
The choice of moments to be used is crucial for the existence and consistency of the
moment estimators, and also for their eciency.
For existence we shall generally need to match as many moments as there are pa-
rameters in the model. With fewer moments we should expect a moment estimator to
be not uniquely dened, while with more moments no solution to the moment equations
may exist. Because in general the moments are highly nonlinear functions of the param-
eters, it is hard to make this statement precise, as it is hard to characterize solutions
of systems of nonlinear equations in general. This is illustrated already in the case of
moving average processes, where a characterization of the existence of solutions requires
eort, and where conditions and restrictions are needed to ensure their uniqueness. (Cf.
Section 11.2.2.)
To ensure consistency and improve eciency it is necessary to use moments that
can be estimated well from the data. Auto-covariances at high lags, or moments of high
degree should generally be avoided. Besides on the quality of the initial estimates of the
population moments, the eciency of the moment estimators also depends on the inverse
map giving the parameter as a function of the moments. To see this we may formalize
the method of moments through the scheme
() = E

f(X
t
, . . . , X
t+h
),
(

n
) =
1
n
n

t=1
f(X
t
, . . . , X
t+h
).
11.2: Moment Estimators 179
Here f: R
h+1
R
d
is a given map, which denes the moments used. (For deniteness we
allow it to depend on the joint distribution of at most h + 1 consecutive observations.)
We assume that the time series t f(X
t
, . . . , X
t+h
) is strictly stationary, so that the
mean values () in the rst line do not depend on t, and for simplicity of notation we
assume that we observe X
1
, . . . , X
n+h
, so that the right side of the second line is indeed
an observable quantity. We shall assume that the map : R
d
is one-to-one, so that
the second line uniquely denes the estimator

n
as the inverse

n
=
1
_

f
n
_
,

f
n
=
1
n
n

t=1
f(X
t
, . . . , X
t+h
).
We shall generally construct

f
n
such that it converges in probability to its mean ()
as n . If this is the case and
1
is continuous at (), then we have that

1
() = , in probability as n , and hence the moment estimator is asymptotically
consistent.
Many sample moments converge at

n-rate, with a normal limit distribution. This
allows to rene the consistency result, in view of the Delta-method, given by Theo-
rem 3.15. If
1
is dierentiable at () with derivative of full rank, and

n
_

f
n
()
_
converges in distribution to a normal distribution with mean zero and covariance matrix

, then

n(

n
) N
_
0,
t

(
t

1
)
T
_
.
Here
t

1
is the derivative of
1
at (), which is the inverse of the derivative of at
. We conclude that, under these conditions, the moment estimators are

n-consistent
with a normal limit distribution, a desirable property.
A closer look concerns the size of the asymptotic covariance matrix
t

(
t

1
)
T
.
It depends both on the accuracy by which the chosen moments can be estimated from
the data (through the matrix

) and the smoothness of the inverse


1
. If the in-
verse map has a large derivative, then extracting the moment estimator

n
from the
sample moments

f
n
magnies the error of

f
n
as an estimate of (), and the moment
estimator will be relatively inecient. Unfortunately, it is hard to see how a particular
implementation of the method of moments works out without doing (part of) the algebra
leading to the asymptotic covariance matrix. Furthermore, the outcome may depend on
the true value of the parameter, a given moment estimator being relatively ecient for
some parameter values, but (very) inecient for others.
11.2.1 Generalized Moment Estimators
Moment estimators are measurable functions of the sample moments

f
n
and hence cannot
be better than the best estimator based on

f
n
. In most cases summarizing the data
through the sample moments

f
n
incurs a loss of information. Only if the sample moments
are sucient (in the statistical sense), moment estimators can be fully ecient. This is
an exceptional situation. The loss of information can be controlled somewhat by working
with the right type of moments, but is usually unavoidable through the restriction of
using only as many moments as there are parameters. This is because the reduction of
180 11: Moment and Least Squares Estimators
a sample of size n to a sample of empirical moments of size d usually entails a loss of
information.
This observation motivates the generalized method of moments. The idea is to reduce
the sample to more empirical moments than there are parameters. Given a function
f: R
h+1
R
e
for e > d with corresponding mean function () = E

f(X
t
, . . . , X
t+h
),
there is no hope, in general, to solve an estimator

n
from the system of equations
() =

f
n
, because these are e > d equations in d unknowns. The generalized method
of moments overcomes this by dening

n
as the minimizer of the quadratic form, for a
given (possibly random) matrix

V
n
,
(11.3)
_
()

f
n
_
T

V
n
_
()

f
n
_
.
Thus a generalized moment estimator tries to solve the system of equations () =

f
n
as well as possible, where the discrepancy is measured through a certain quadratic form.
The matrix

V
n
weighs the inuence of the dierent components of

f
n
on the estimator

n
,
and is typically chosen dependent on the data to increase the eciency of the estimator.
We assume that

V
n
is symmetric and positive-denite.
As n the estimator

f
n
typically converges to its expectation under the true
parameter, which we shall denote by
0
for clarity. If we replace

f
n
in the criterion
function by its expectation (
0
), then we can reduce the resulting quadratic form to
zero by choosing equal to
0
. This is clearly the minimal value of the quadratic form,
and the choice =
0
will be unique as soon as the map is one-to-one. This suggests
that the generalized moment estimator

n
is asymptotically consistent. As for ordinary
moment estimators, a rigorous justication of the consistency must take into account the
properties of the function .
The distributional limit properties of a generalized moment estimator can be un-
derstood by linearizing the function around the true parameter. Insertion of the rst
order Taylor expansion () = (
0
) +
t
0
(
0
) into the quadratic form yields the
approximate criterion

_

f
n
(
0
)
t
0
(
0
)
_
T

V
n
_

f
n
(
0
)
t
0
(
0
)
_
=
1
n
_
Z
n

t
0

n(
0
)
_
T

V
n
_
Z
n

t
0

n(
0
)
_
,
for Z
n
=

n
_

f
n
(
0
)
_
. The sequence Z
n
is typically asymptotically normally dis-
tributed, with mean zero. Minimization of this approximate criterion over h =

n(
0
)
is equivalent to minimizing the quadratic form h (Z
n

t
0
h)

V
n
(Z
n

t
0
h), or equiv-
alently minimizing the norm of the vector Z
n

t
0
h over h in the Hilbert space R
d
with inner product dened by x, y) = x
T

V
n
y. This comes down to projecting the vector
Z
n
onto the range of the linear map
t
0
and hence by the projection theorem, Theo-
rem 2.10, the minimizer

h =

n(


0
) is characterized by the orthogonality of the
vector Z
n

t
0

h to the range of
t
0
. The algebraic expression of this orthogonality
(
t
0
)
T

V
n
(Z
n

t
0

h) = 0 can be written in the form

n(

n

0
) =
_
(
t
0
)
T

V
n

t
0
_
1
(
t
0
)
T

V
n
Z
n
.
11.2: Moment Estimators 181
This readily gives the asymptotic normality of the sequence

n(

0
), with mean zero
and a somewhat complicated covariance matrix depending on
t
0
,

V
n
and the asymptotic
covariance matrix of Z
n
.
The best nonrandom weight matrices

V
n
, in terms of minimizing the asymptotic co-
variance of

n(

n
), is the inverse of the covariance matrix of Z
n
. (Cf. Problem 11.11.)
For our present situation this suggests to choose the matrix

V
n
to be consistent for the
inverse of the asymptotic covariance matrix of the sequence Z
n
=

n
_

f
n
(
0
)
_
. With
this choice and the asymptotic covariance matrix denoted by
0
, we may expect that

n(

n

0
) N
_
0,
_
(
t
0
)
T

1
0

t
0
_
1
_
.
The argument shows that the generalized moment estimator can be viewed as a weighted
least squares estimators for regressing

n
_

f
n
(
0
)
_
onto
t
0
. With the optimal weight-
ing matrix it is the best such estimator. If we use more initial moments to dene

f
n
and
hence (), then we add observations and corresponding rows to the design matrix
t
0
,
but keep the same parameter

n(
0
). This suggests that the asymptotic eciency
of the optimally weigthed generalized moment estimator increases if we use a longer
vector of initial moments

f
n
. In particular, the optimally weigthed generalized moment
estimator is more ecient than an ordinary moment estimator based on a subset of d of
the initial moments. Thus, the generalized method of moments achieves the aim of using
more information contained in the observations.
11.11 EXERCISE. Let be a symmetric, positive-denite matrix and A a given ma-
trix. Show that the matrix (A
T
V A)
1
A
T
V A(A
T
V A)
1
is minimized over nonnegative-
denite matrices V (where we say that V W if W V is nonnegative denite) for
V =
1
. [The given matrix is the covariance matrix of
A
= (A
T
V A)
1
A
T
V Z for
Z a random vector with the normal distribution with covariance matrix . Show that
Cov(
A

1 ,

1 ) = 0.]
These arguments are based on asymptotic approximations. They are reasonably
accurate for values of n that are large relative to the values of d and e, but should not be
applied if d or e are large. In particular, it is illegal to push the preceding argument to
its extreme and infer that is necessarily right to use as many initial moments as possible.
Increasing the dimension of the vector

f
n
indenitely may contribute more variability
to the criterion (and hence to the estimator) without increasing the information much,
depending on the accuracy of the estimator

V
n
.
The implementation of the (generalized) method of moments requires that the ex-
pectations () = E

f(X
t
, . . . , X
t+h
) are available as functions of . In some models,
such as AR or MA models, this causes no diculty, but already in ARMA models the
required analytical computations become complicated. Sometimes it is easy to simulate
realizations of a time series, but hard to compute explicit formulas for moments. In this
case the values () may be estimated stochastically at a grid of values of by simulating
realizations of the given time series, taking in turn each of the grid points as the true
parameter, and next computing the empirical moment for the simulated time series. If
the grid is suciently dense and the simulations are suciently long, then the grid point
182 11: Moment and Least Squares Estimators
for which the simulated empirical moment matches the empirical moment of the data
is close to the moment estimator. Taking it to be the moment estimator is called the
simulated method of moments.
In the following theorem we make the preceding informal derivation of the asymp-
totics of generalized moment estimators rigorous. The theorem is a corollary of Theo-
rems 3.17 and 3.18 on the asymptotics of general minimum contrast estimators. Con-
sider generalized moment estimators as previously, dened as the point of minimum of
a quadratic form of the type (11.3). In most cases the function () will be the expected
value of the random vectors

f
n
under the parameter , but this is not necessary. The
following theorem is applicable as soon as () gives a correct centering to ensure that
the sequence

n
_

f
n
()
_
converges to a limit distribution, and hence may also apply
to nonstationary time series.
11.12 Theorem. Let

V
n
be random matrices such that

V
n
P

V
0
for some matrix V
0
.
Assume that : R
d
R
e
is dierentiable at an inner point
0
of with derivative

t
0
such that the matrix (
t
0
)
T
V
0

t
0
is nonsingular and satises, for every > 0,
inf
:|0|>
_
() (
0
)
_
T
V
0
_
() (
0
)
_
> 0.
Assume either that V
0
is invertible or that the set (): is bounded. Fi-
nally, suppose that the sequence of random vectors Z
n
=

n
_

f
n
(
0
)
_
is uniformly
tight. If

n
are random vectors that minimize the criterion (11.3), then

n(

n

0
) =
((
t
0
)
T
V
0

t
0
)
1
V
0
Z
n
+o
P
(1).
Proof. We rst prove that

n
P


0
using Theorem 3.17, with the criterion functions
M
n
() =
_
_
V
1/2
n
_

f
n
()
__
_
,
M
n
() =
_
_
V
1/2
n
_
() (
0
)
__
_
.
The squares of these functions are the criterion in (11.3) and the quadratic form in
the display of the theorem, but with V
0
replaced by

V
n
, respectively. By the triangle
inequality [M
n
() M
n
()[
_
_
V
1/2
n
_

f
n
(
0
)
__
_
0 in probability, uniformly in
. Thus the rst condition of Theorem 3.17 is satised. The second condition, that
infM
n
(): |
0
| > is stochastically bounded away from zero for every > 0, is
satised by assumption in the case that

V
n
= V
0
is xed. Because

V
n
P

V
0
, where V
0
is invertible or the set (): |
0
| > is bounded, it is also satised in the general
case, in view of Exercise 11.13. This concludes the proof of consistency of

n
.
For the proof of asymptotic normality we use Theorem 3.18 with the criterion func-
tions M
n
and M
n
redened as the squares of the functions M
n
and M
n
as used in
the consistency proof (so that M
n
() is the criterion function in (11.3)) and with the
centering function M dened by
M() =
_
() (
0
)
_
T
V
0
_
() (
0
)
_
.
11.2: Moment Estimators 183
It follows that, for any random sequence

n
P


0
,
n(M
n
M
n
)(

n
) n(M
n
M
n
)(
0
)
=
_
Z
n

n
_
(

n
) (
0
)
__
T

V
n
_
Z
n

n
_
(

n
) (
0
)
__

n
_
(

n
) (
0
)
_
T

V
n

n
_
(

n
) (
0
)
_
Z
T
n

V
n
Z
n
,
= 2

n
_
(

n
) (
0
)
_
T

V
n
Z
n
,
= 2(

0
)
T
(
t
0
)
T

V
n
Z
n
+o
P
(

n

0
),
by the dierentiability of at
0
. Together with the convergence of

V
n
to V
0
, the dieren-
tiability of also gives that M
n
(

n
)M(

n
) = o
P
_
|

0
|
2
_
for any sequence

n
P

0
.
Therefore, we may replace M
n
by M in the left side of the preceding display, if we add an
o
P
_
|

0
|
2
_
-term on the right. By a third application of the dierentiability of , the
function M permits the two-termTaylor expansion M() = (
0
)
T
W(
0
)+o(
0
)
2
,
for W = (
t
0
)
T
V
0

t
0
. Thus the conditions of Theorem 3.18 are satised and the proof
of asymptotic normality is complete.
11.13 EXERCISE. Let V
n
be a sequence of nonnegative-denite matrices such that V
n

V for a matrix V such that infx
T
V x: x C > 0 for some set C. Show that:
(i) If V is invertible, then liminf infx
T
V
n
x: x C > 0.
(ii) If C is bounded, then liminf infx
T
V
n
x: x C > 0.
(iii) The assertion of (i)-(ii) may fail without some additional assumption.
[Suppose that x
T
n
V
n
x
n
0. If V is invertible, then it follows that x
n
0. If the
sequence x
n
is bounded, then x
T
n
V x
n
x
T
n
V
n
x
n
0. As counterexample let V
n
be the
matrices with eigenvectors propertional to (n, 1) and (1, n) and eigenvalues 1 and 0,
let C = x: [x
1
[ > and let x
n
= (1, n).]
11.2.2 Moving Average Processes
Suppose that X
t
=

q
j=0

j
Z
tj
is a moving average process of order q. For simplicity
of notation assume that 1 =
0
and dene
j
= 0 for j < 0 or j > q. Then the auto-
covariance function of X
t
can be written in the form

X
(h) =
2

j+h
.
Given observations X
1
, . . . , X
n
we can estimate
X
(h) by the sample auto-covariance
function and next obtain estimators for
2
,
1
, . . . ,
q
by solving the system of equations

n
(h) =
2

j+h
, h = 0, 1, . . . , q.
A solution of this system, which has q + 1 equations with q + 1 unknowns, does not
necessarily exist, or may be nonunique. It cannot be derived in closed form, but must be
determined numerically by an iterative method. Thus applying the method of moments
184 11: Moment and Least Squares Estimators
for moving average processes is considerably more involved than for auto-regressive pro-
cesses. The real drawback of this method is, however, that the moment estimators are less
ecient than the least squares estimators that we discuss later in this chapter. Moment
estimators are therefore at best only used as starting points for numerical procedures to
compute other estimators.
11.14 Example (MA(1)). For the moving average process X
t
= Z
t
+Z
t1
the moment
equations are

X
(0) =
2
(1 +
2
),
X
(1) =
2
.
Replacing
X
by
n
and solving for
2
and yields the moment estimators

n
=
1
_
1 4
2
n
(1)
2
n
(1)
,
2
=

n
(1)

n
.
We obtain a real solution for

n
only if [
n
(1)[ 1/2. Because the true auto-correlation

X
(1) is contained in the interval [1/2, 1/2], it is reasonable to truncate the sample auto-
correlation
n
(1) to this interval and then we always have some solution. If [
n
(1)[ < 1/2,
then there are two solutions for

n
, corresponding to the sign. This situation will
happen with probability tending to one if the true auto-correlation
X
(1) is strictly
contained in the interval (1/2, 1/2). From the two solutions, one solution has [

n
[ < 1
and corresponds to an invertible moving average process; the other solution has [

n
[ > 1.
The existence of multiple solutions was to be expected in view of Theorem 8.30.
Assume that the true value [[ < 1, so that
X
(1) (1/2, 1/2) and
=
1
_
1 4
2
X
(1)
2
X
(1)
.
Of course, we use the estimator

n
dened by the minus sign. Then

n
can be written
as
_

n
(1)
_

X
(1)
_
for the function given by
() =
1
_
1 4
2
2
.
This function is dierentiable on the interval (1/2, 1/2). By the Delta-method the
limit distribution of the sequence

n(

n
) is the same as the limit distribution of
the sequence
t
_

X
(1)
_
n
_

n
(1)
X
(1)
_
. Using Theorem 5.9 we obtain, after a long
calculation, that

n(

n
) N
_
0,
1 +
2
+ 4
4
+
6
+
8
(1
2
)
2
_
.
Thus, to a certain extent, the method of moments works: the moment estimator

n
converges at a rate of 1/

n to the true parameter. However, the asymptotic variance


is large, in particular for 1. We shall see later that there exist estimators with
asymptotic variance 1
2
, which is smaller for every , and is particularly small for
1.
11.2: Moment Estimators 185
0.4 0.2 0.0 0.2 0.4

1
.
0

0
.
5
0
.
0
0
.
5
1
.
0
Figure 11.1. The function (1
_
1 4
2
)/(2).
11.15 EXERCISE. Derive the formula for the asymptotic variance, or at least convince
yourself that you know how to get it.
The asymptotic behaviour of the moment estimators for moving averages of order
higher than 1 can be analysed, as in the preceding example, by the Delta-method as well.
Dene : R
q+1
R
q+1
by
(11.4)
_
_
_
_

1
.
.
.

q
_
_
_
_
=
2
_
_
_
_
_

j

2
j

j

j

j+1
.
.
.

j

j

j+q
_
_
_
_
_
.
Then the moment estimators and true parameters satisfy
_
_
_
_

2

1
.
.
.

q
_
_
_
_
=
1
_
_
_
_

n
(0)

X
(1)
.
.
.

n
(q)
_
_
_
_
,
_
_
_
_

1
.
.
.

q
_
_
_
_
=
1
_
_
_
_

X
(0)

X
(1)
.
.
.

X
(q)
_
_
_
_
.
The joint limit distribution of the sequences

n
_

n
(h)
X
(h)
_
is known from Theo-
rem 5.8. Therefore, the limit distribution of the moment estimators
2
,

1
, . . . ,

q
follows
by the Delta-method, provided the map
1
is dierentiable at
_

X
(0), . . . ,
X
(q)
_
.
Practical and theoretical complications arise from the fact that the moment equa-
tions may have zero or multiple solutions, as illustrated in the preceding example. This
diculty disappears if we insist on an invertible representation of the moving average
process, i.e. require that the polynomial 1 +
1
z + +
q
z
q
has no roots in the complex
unit disc. This follows by the following lemma, whose proof also contains an algorithm
to compute the moment estimators numerically.
186 11: Moment and Least Squares Estimators
11.16 Lemma. Let R
q
be the set of all vectors (
1
, . . . ,
q
) such that all roots of
1 +
1
z + +
q
z
q
are outside the unit circle. Then the map : R
+
R in (11.4)
is one-to-one and continuously dierentiable. Furthermore, the map
1
is dierentiable
at every point (
2
,
1
, . . . ,
q
) for which the roots of 1 +
1
z + +
q
z
q
are distinct.
* Proof. Abbreviate
h
=
X
(h). The system of equations
2

j

j

j+h
=
h
for h =
0, . . . , q implies that
q

h=q

h
z
h
=
2

j+h
z
h
=
2
(z
1
)(z).
For any h 0 the function z
h
+ z
h
can be expressed as a polynomial of degree h in
w = z + z
1
. For instance, z
2
+ z
2
= w
2
2 and z
3
+ z
3
= w
3
3w. The case
of general h can be treated by induction, upon noting that by rearranging Newtons
binomial formula
z
h+1
+z
h1
w
h+1
=
_
h + 1
(h + 1)/2
_

j,=0
_
h + 1
(h + 1 j)/2
_
(z
j
+z
j
).
Thus the left side of the preceding display can be written in the form

0
+

h=1

j
(z
j
+z
j
) = a
0
+a
1
w + +a
q
w
q
,
for certain coecients (a
0
, . . . , a
q
). Let w
1
, . . . , w
q
be the zeros of the polynomial on
the right, and for each j let
j
and
1
j
be the solutions of the quadratic equation
z + z
1
= w
j
. Choose [
j
[ 1. Then we can rewrite the right side of the preceding
display as
a
q
q

j=1
(z +z
1
w
j
) = a
q
(z
j
)(
j
z
1
)
1
j
.
On comparing this to the rst display of the proof, we see that
1
, . . . ,
q
are the zeros
of the polynomial (z). This allows us to construct a map
(
0
, . . . ,
q
) (a
0
, . . . , a
q
) (w
1
, . . . , w
q
, a
q
) (
1
, . . . ,
q
, a
q
) (
1
, . . . ,
q
,
2
).
If restricted to the range of this is exactly the map
1
. It is not hard to see that the
rst and last step in this decomposition of
1
are analytic functions. The two middle
steps concern mapping coecients of polynomials into their zeros.
For = (
0
, . . . ,
q
) C
q+1
let p

(w) =
0
+
1
w + +
q
w
q
. By the implicit
function theorem for functions of several complex variables we can show the following.
If for some the polynomial p

has a root of order 1 at a point w

, then there exists


neighbourhoods U

and V

of and w

such that for every U

the polynomial p

has exactly one zero w

and the map w

is analytic on U

. Thus, under the


assumption that all roots are or multiplicity one, the roots can be viewed as analytic
functions of the coecients. If has distinct roots, then
1
, . . . ,
q
are of multiplicity one
and hence so are w
1
, . . . , w
q
. In that case the map is analytic.
11.2: Moment Estimators 187
* 11.2.3 Moment Estimators for ARMA Processes
If X
t
is a stationary ARMA process satisfying (B)(X
t
) = (B)Z
t
, then
cov
_
(B)(X
t
), X
tk
_
= E
_
(B)Z
t
_
X
tk
.
If X
t
is a causal, stationary ARMA process, then the right side vanishes for k > q.
Working out the left side, we obtain the eqations

X
(k)
1

X
(k 1)
p

X
(k p) = 0, k > q.
For k = q + 1, . . . , q +p this leads to the system
_
_
_
_

X
(q)
X
(q 1)
X
(q p + 1)

X
(q + 1)
X
(q)
X
(q p + 2)
.
.
.
.
.
.
.
.
.

X
(q +p 1)
X
(q +p 2)
X
(q)
_
_
_
_
_
_
_
_

2
.
.
.

p
_
_
_
_
=
_
_
_
_

X
(q + 1)

X
(q + 2)
.
.
.

X
(q +p)
_
_
_
_
.
These are the Yule-Walker equations for general stationary ARMA processes and may
be used to obtain estimators

1
, . . . ,

p
of the auto-regressive parameters in the same
way as for auto-regressive processes: we replace
X
by
n
and solve for
1
, . . . ,
p
.
Next we apply the method of moments for moving averages to the time se-
ries Y
t
= (B)Z
t
to obtain estimators for the parameters
2
,
1
, . . . ,
q
. Because also
Y
t
= (B)(X
t
) we can write the covariance function
Y
in the form

Y
(h) =

X
(h +i j), if (z) =

j
z
j
.
Let
Y
(h) be the estimators obtained by replacing the unknown parameters

j
=
j
and
X
(h) by their moment estimators and sample moments, respectively. Next we solve

2
,

1
, . . . ,

q
from the system of equations

Y
(h) =
2

j+h
, h = 0, 1, . . . , q.
As is explained in the preceding section, if X
t
is invertible, then the solution is
unique, with probability tending to one, if the coecients
1
, . . . ,
q
are restricted to give
an invertible stationary ARMA process.
The resulting estimators (
2
,

1
, . . . ,

q
,

1
, . . . ,

p
) can be written as a function of
_

n
(0), . . . ,
n
(q + p)
_
. The true values of the parameters can be written as the same
function of the vector
_

X
(0), . . . ,
X
(q + p)
_
. In principle, under some conditions, the
limit distribution of the estimators can be obtained by the Delta-method.
188 11: Moment and Least Squares Estimators
* 11.2.4 Stochastic Volatility Models
In the stochastic volatility model discussed in Section 10.4 an observation X
t
is de-
ned as X
t
=
t
Z
t
for log
t
a stationary auto-regressive process satisfying log
t
=
+ log
t1
+ V
t1
, and (V
t
, Z
t
) an i.i.d. sequence of bivariate normal vectors with
mean zero, unit variances and correlation . Thus the model is parameterized by four
parameters , , , .
The series X
t
is a white noise series and hence we cannot use the auto-covariances

X
(h) at lags h ,= 0 to construct moment estimators. Instead, we might use higher
marginal moments or auto-covariances of powers of the series. In particular, it is com-
puted in Section 10.4 that
E[X
t
[ = exp
_
1
2

2
1
2
+

1
_
_
2

,
E[X
t
[
2
= exp
_
1
2
4
2
1
2
+
2
1
_
,
E[X
t
[
3
= exp
_
1
2
9
2
1
2
+
3
1
_
2
_
2

,
EX
4
t
= exp
_
8
2
1
2
+
4
1
_
3,

X
2 (1) =
(1 + 4
2

2
)e
4
2
/(1
2
)
1
3e
4
2
/(1
2
)
1
,

X
2 (2) =
(1 + 4
2

2
)e
4
2

2
/(1
2
)
1
3e
4
2
/(1
2
)
1
,

X
2 (3) =
(1 + 4
2

4
)e
4
2

3
/(1
2
)
1
3e
4
2
/(1
2
)
1
.
We can use a selection of these moments to dene moment estimators, or use some or
all of them to dene generalized moments estimators. Because the functions on the right
side are complicated, this requires some eort, but it is feasible.

11.3 Least Squares Estimators


For auto-regressive processes the method of least squares is directly suggested by the
structural equation dening the model, but it can also be derived from the prediction
problem. The second point of view is deeper and can be applied to general time series.
A least squares estimator is based on comparing the predicted value of an observa-
tion X
t
based on the preceding observations to the actually observed value X
t
. Such a
prediction
t1
X
t
will generally depend on the underlying parameter of the model,

See Taylor (1986).


11.3: Least Squares Estimators 189
which we shall make visible in the notation by writing it as
t1
X
t
(). The index t 1 of

t1
indicates that
t1
X
t
() is a function of X
1
, . . . , X
t1
(and the parameter) only.
By convention we dene
0
X
1
= 0. A weighted least squares estimator, with inverse
weights w
t
(), is dened as the minimizer, if it exists, of the function
(11.5)
n

t=1
_
X
t

t1
X
t
()
_
2
w
t
()
.
This expression depends only on the observations X
1
, . . . , X
n
and the unknown param-
eter and hence is an observable criterion function. The idea is that using the true
parameter should yield the best predictions. The weights w
t
() could be chosen equal
to one, but are generally chosen to increase the eciency of the resulting estimator.
This least squares principle is intuitively reasonable for any sense of prediction,
in particular both for linear and nonlinear prediction. For nonlinear prediction we set

t1
X
t
() equal to the conditional expectation E

(X
t
[ X
1
, . . . , X
t1
), an expression that
may or may not be easy to derive analytically.
For linear prediction, if we assume that the the time series X
t
is centered at mean
zero, we set
t1
X
t
() equal to the linear combination
1
X
t1
+ +
t1
X
1
that
minimizes
(
1
, . . . ,
t1
) E

_
X
t
(
1
X
t1
+ +
t1
X
1
)
_
2
,
1
, . . . ,
t
R.
In Chapter 2 the coecients of the best linear predictor are expressed in the auto-
covariance function
X
by the prediction equations (2.4). Thus the coecients
t
de-
pend on the parameter of the underlying model through the auto-covariance function.
Hence the least squares estimators using linear predictors can also be viewed as moment
estimators.
The dierence X
t

t1
X
t
() between the true value and its prediction is called
innovation. Its second moment
v
t
() = E

_
X
t

t1
X
t
()
_
2
is called the (square) prediction error at time t 1. The weights w
t
() are often chosen
equal to the prediction errors v
t
() in order to ensure that the terms of the sum of squares
contribute equal amounts of information.
For both linear and nonlinear predictors the innovations X
1

0
X
1
(), X
2

1
X
2
(), . . . , X
n

n1
X
n
() are uncorrelated random variables. This orthogonality
suggests that the terms of the sum contribute additive information to the criterion,
which should be good. It also shows that there is usually no need to replace the sum
of squares by a more general quadratic form, which would be the standard approach in
ordinary least squares estimation.
Whether the sum of squares indeed possesses a (unique) point of minimum

and
whether this constitutes a good estimator of the parameter depends on the statistical
model for the time series. Moreover, this model determines the feasibility of computing
the point of minimum given the data. Auto-regressive and GARCH processes provide a
positive and a negative example.
190 11: Moment and Least Squares Estimators
11.17 Example (Autoregression). A mean-zero, causal, stationary, auto-regressive
process of order p is modelled through the parameter = (
2
,
1
, . . . ,
p
). For t p the
best linear predictor is given by
t1
X
t
=
1
X
t1
+
p
X
tp
and the prediction error
is v
t
= EZ
2
t
=
2
. For t < p the formulas are more complicated, but could be obtained
in principle.
The weighted sum of squares with weights w
t
= v
t
reduces to
p

t=1
_
X
t

t1
X
t
(
1
, . . . ,
p
)
_
2
v
t
(
2
,
1
, . . . ,
p
)
+
n

t=p+1
_
X
t

1
X
t1

p
X
tp
_
2

2
.
Because the rst term, consisting of p of the n terms of the sum of squares, possesses a
complicated form, it is often dropped from the sum of squares. Then we obtain exactly
the sum of squares considered in Section 11.1, but with X
n
replaced by 0 and divided by

2
. For large n the dierence between the sums of squares and hence between the two
types of least squares estimators should be negligible.
Another popular strategy to simplify the sum of squares is to act as if the ob-
servations X
0
, X
1
, . . . , X
p+1
are available and to redene
t1
X
t
for t = 1, . . . , p
accordingly. This is equivalent to dropping the rst term and letting the sum in the
second term start at t = 1 rather than at t = p+1. To implement the estimator we must
now choose numerical values for the missing observations X
0
, X
1
, . . . , X
p+1
; zero is a
common choice.
The least squares estimators for
1
, . . . ,
p
, being (almost) identical to the Yule-
Walker estimators, are

n-consistent and asymptotically normal. However, the least
squares criterion does not lead to a useful estimator for
2
: minimization over
2
leads
to
2
= and this is obviously not a good estimator. A more honest conclusion is that
the least squares criterion as posed originally fails for auto-regressive processes, since
minimization over the full parameter = (
2
,
1
, . . . ,
p
) leads to a zero sum of squares
for
2
= and arbitrary (nite) values of the remaining parameters. The method of
least squares works only for the subparameter (
1
, . . . ,
p
) if we rst drop
2
from the
sum of squares.
11.18 Example (GARCH). A GARCH process is a martingale dierence series and
hence the one-step predictions
t1
X
t
() are identically zero. Consequently, the weighted
least squares sum, with weights equal to the prediction errors, reduces to
n

t=1
X
2
t
v
t
()
.
Minimizing this criterion over is equivalent to maximizing the prediction errors v
t
().
It is intuitively clear that this does not lead to reasonable estimators.
One alternative is to apply the least squares method to the squared series X
2
t
. This
satises an ARMA equation in view of (9.3). (Note however that the innovations in that
equation are also dependent on the parameter.)
The best x of the least squares method is to augment the least squares criterion to
the Gaussian likelihood, as discussed in Chapter 13.
11.3: Least Squares Estimators 191
So far the discussion in this section has assumed implicitly that the mean value
= EX
t
of the time series is zero. If this is not the case, then we apply the preceding
discussion to the time series X
t
instead of to X
t
, assuming rst that is known.
Then the parameter will show up in the least squares criterion. To dene estimators we
can either replace the unknown value by the sample mean X
n
and minimize the sum
of squares with respect to the remaining parameters, or perform a joint minimization
over all parameters.
Least squares estimators can rarely be written in closed form, the case of stationary
auto-regressive processes being an exception, but iterative algorithms for the approxi-
mate calculation are implemented in many computer packages. Newton-type algorithms
provide one possibility. The best linear predictions
t1
X
t
are often computed recur-
sively in t (for a grid of values ), for instance with the help of a state space representation
of the time series and the Kalman lter.
The method of least squares is closely related to Gaussian likelihood, as discussed in
Chapter 13. Gaussian likelihood is perhaps more fundamental than the method of least
squares. For this reason we restrict further discussion of the method of least squares to
ARMA processes.
11.3.1 ARMA Processes
The method of least squares works well for estimating the regression and moving average
parameters (
1
, . . . ,
p
,
1
, . . . ,
q
) of ARMA processes, if we perform the minimization
for a xed value of the parameter
2
. In general, if some parameter, such as
2
for
ARMA processes, enters the covariance function as a multiplicative factor, then the best
linear predictor
t
X
t+1
is free from this parameter, by the prediction equations (2.4). On
the other hand, the prediction error v
t+1
=
X
(0) (
1
, . . . ,
t
)
t
(
1
, . . . ,
t
)
T
(where

1
, . . . ,
t
are the coecients of the best linear predictor) contains such a parameter as
a multiplicative factor. It follows that the inverse of the parameter will enter the least
squares criterion as a multiplicative factor. Thus on the one hand the least squares meth-
ods does not yield an estimator for this parameter; on the other hand, we can just omit
the parameter and minimize the criterion over the remaining parameters. In particular,
in the case of ARMA processes the least squares estimators for (
1
, . . . ,
p
,
1
, . . . ,
q
)
are dened as the minimizers of, for v
t
=
2
v
t
,
n

t=1
_
X
t

t1
X
t
(
1
, . . . ,
p
,
1
, . . . ,
q
)
_
2
v
t
(
1
, . . . ,
p
,
1
, . . . ,
q
)
.
This is a complicated function of the parameters. However, for a xed value of
(
1
, . . . ,
p
,
1
, . . . ,
q
) it can be computed using the state space representation of an
ARMA process and the Kalman lter. A grid search or iteration method can do the rest.
11.19 Theorem. Let X
t
be a causal and invertible stationary ARMA(p, q) process rela-
tive to an i.i.d. sequence Z
t
with nite fourth moments. Then the least squares estimators
satisfy

n
__

q
_

q
_
_
N(0,
2
J
1

p,

q
),
192 11: Moment and Least Squares Estimators
where J

p,

q
is the covariance matrix of (U
1
, . . . , U
p
, V
1
, . . . , V
q
) for stationary auto-
regressive processes U
t
and V
t
satisfying (B)U
t
= (B)V
t
= Z
t
.
Proof. The proof of this theorem is long and technical. See e.g. Brockwell and
Davis (1991), pages 375396, Theorem 10.8.2.
11.20 Example (MA(1)). The least squares estimator

n
for in the moving average
process X
t
= Z
t
+Z
t1
with [[ < 1 possesses asymptotic variance equal to
2
/ var V
1
,
where V
t
is the stationary solution to the equation (B)V
t
= Z
t
. Note that V
t
is an auto-
regressive process of order 1, not a moving average!
As we have seen before the process V
t
possesses the representation V
t
=

j=0

j
Z
tj
and hence var V
t
=
2
/(1
2
) for every t.
Thus the sequence

n(

n
) is asymptotically normally distributed with mean zero
and variance equal to 1
2
. This should be compared to the asymptotic distribution of
the moment estimator, obtained in Example 11.14.
11.21 EXERCISE. Find the asymptotic covariance matrix of the sequence

n(

n

,

n
) for (

n
,

n
) the least squares estimators for the stationary, causal, invertible
ARMA process satisfying X
t
= X
t1
+Z
t
+Z
t1
.
12
Spectral Estimation
In this chapter we study nonparametric estimators of the spectral density and spectral
distribution of a stationary time series. As in Chapter 5 nonparametric means that no
a-priori structure of the series is assumed, apart from stationarity.
If a well-tting model is available, then an alternative to the methods of this chapter
are spectral estimators suited to this model. For instance, the spectral density of a
stationary ARMA process can be expressed in the parameters
2
,
1
, . . .
p
,
1
, . . . ,
q
of
the model (see Section 8.5) and hence can be estimated by plugging in estimators for
the parameters. If the ARMA model is appropriate, this should lead to better estimators
than the nonparametric estimators discussed in this chapter. We do not further discuss
this type of estimator.
Let the observations X
1
, . . . , X
n
be the values at times 1, . . . , n of a stationary time
series X
t
, and let
n
be their sample auto-covariance function. In view of the denition
of the spectral density f
X
(), a natural estimator is
(12.1)

f
n,r
() =
1
2

]h]<r

n
(h)e
ih
.
Whereas f
X
() is dened as an innite series, the estimator

f
n,r
is truncated at its rth
term. Because the estimators
n
(h) are dened only for [h[ < n and there is no hope of
estimating the auto-covariances
X
(h) for lags [h[ n, we must choose r n. Because
the estimators
n
(h) are unreliable for [h[ n, it may be wise to choose r much smaller
than n. We shall see that a good choice of r depends on the smoothness of the spectral
density and also on which aspect of the spectrum is of interest. For estimating f
X
()
at a point, values of r such as n

for some (0, 1) may be appropriate, whereas for


estimating the spectral distribution function (i.e. areas under f
X
) the choice r = n works
well.
In any case, since the covariances of lags [h[ n can never be estimated from
the data, nonparametric estimation of the spectrum is hopeless, unless one is willing to
assume that expressions such as

]h]n

X
(h)

are small. In Section 12.3 ahead we relate


this tail series to the smoothness of the spectral density f
X
(). If this function is
194 12: Spectral Estimation
several times dierentiable, then the auto-covariance function decreases fast to zero, and
nonparametric estimation is feasible.
12.1 Finite Fourier Transform
The nite Fourier transform is a useful tool in spectral analysis, both for theory and prac-
tice. The practical use comes from the fact that it can be computed eciently by a clever
algorithm, the Fast Fourier Transform (FFT). While a naive computation would require
O(n
2
) multiplications and summations, the FFT needs only O(nlog n) operations.

The nite Fourier transform of an arbitrary sequence x


1
, . . . , x
n
of complex numbers
is dened as the function d
x
() given by
d
x
() =
1

n
n

t=1
x
t
e
it
, (, ].
In other words, the function
_
n/2 d
x
() is the Fourier series corresponding to the
coecients . . . , 0, 0, x
1
, x
2
, . . . , x
n
, 0, 0, . . .. The inversion formula (or a short calculation)
shows that
x
t
=

n
2
_

e
it
d
x
() d, t = 1, 2, . . . , n.
Thus there is a one-to-one relationship between the numbers x
1
, . . . , x
n
and the function
d
x
. We may view the function d
x
as encoding the numbers x
1
, . . . , x
n
.
Encoding n numbers by a function on the interval (, ] is rather inecient. At
closer inspection the numbers x
1
, . . . , x
n
can also be recovered from the values of d
x
on
the grid
. . . ,
4
n
,
2
n
, 0,
2
n
,
4
n
, . . . (, ].
These n points are called the natural frequencies at time n. (The value 0 is always a
natural frequency; for odd n there are (n1)/2 positive and negative natural frequencies,
situated symmetrically about 0; for even n the value is a natural frequency, and the
remaining natural frequencies are n/2 1 points in (0, ) and their reections.)
12.1 Lemma. If d
x
is the nite Fourier transform of x
1
, . . . , x
n
C, then
x
t
=
1

j
d
x
(
j
)e
itj
, t = 1, 2, . . . , n,
where the sum is computed over the natural frequencies
j
(, ] at time n.
Proof. For every of the natural frequencies
j
dene a vector
e
j
=
1

n
(e
ij
, e
i2j
, . . . , e
inj
)
T
.

See e.g. Brockwell and Davis, Chapter 10 for a discussion.


12.2: Periodogram 195
It is straightforward to check that the n vectors e
j
form an orthonormal set in C
n
and
hence a basis. Thus the vector x = (x
1
, . . . , x
n
)
T
can be written as x =

j
x, e
j
)e
j
.
Now x, e
j
) = d
x
(
j
) and the lemma follows.
The proof of the preceding lemma shows how the numbers d
x
(
j
) can be interpreted.
View the coordinates of the vector x = (x
1
, . . . , x
n
) as the values of a signal at the time
instants 1, 2, . . . , n. Similarly, view the coordinates of the vector e
j
as the values of
the trigonometric signal t n
1/2
e
itj
of frequency
j
at these time instants. By the
preceding lemma the signal x can be written as a linear combination of the signals e
j
.
The value

d
x
(
j
)

is the weight of signal e


j
, and hence of frequency
j
, in x.
12.2 EXERCISE. How is the weight of frequency 0 expressed in (x
1
, . . . , x
n
)?
12.3 EXERCISE. Show that d
(,,...,)
(
j
) = 0 for every nonzero natural frequency
j
and every C. Conclude that d
x1xn
(
j
) = d
x
(
j
) for every
j
,= 0. [Hint: you can
compute this explicity, or use that d
x
(
j
) = x, e
j
) as in the proof of the preceding
lemma, for any x.]
12.2 Periodogram
The periodogram of a sequence of observations X
1
, . . . , X
n
is dened as the function
I
n
() given by
I
n
() =
1
2

d
X
()

2
=
1
2n

t=1
X
t
e
it

2
.
We write I
n,X
if the dependence on X
1
, . . . , X
n
needs to be stressed.
In view of the interpretation of the nite Fourier transform in the preceding section
I
n
() is the square of the weight of frequency in the signal X
1
, . . . , X
n
. Because the
spectral density f
X
() at can be interpreted as the variance of the component of
frequency in the time series X
t
, the function I
n
appears to be a reasonable estimator
of the fucntion f
X
. This is true to a certain extent, but not quite true. While we shall
show that the expected value of I
n
() converges to f
X
(), we shall also show that there
are much better estimators than the periodogram. Because these will be derived from
the periodogram, it is still of interest to study its properties.
By evaluating the square in its denition and rearranging the resulting double sum,
the periodogram can be rewritten in the form (if x
1
, . . . , x
n
are real)
(12.2) I
n
() =
1
2

]h]<n
_
1
n
n]h]

t=1
X
t+]h]
X
t
_
e
ih
.
196 12: Spectral Estimation
For natural frequencies
j
,= 0 we have that d
1
(
j
) = 0 for every , in particular for
= X
n
. This implies that I
n,X
(
j
) = I
n,XXn
(
j
) and hence
(12.3) I
n
(
j
) =
1
2

]h]<n

n
(h)e
ih
,
j

2
n
Z 0.
This is exactly the estimator

f
n,n
() given in (12.1), with r = n. As noted before, we
should expect this estimator to be unreliable as an estimator of f
X
(), because of the
imprecision of the estimators
n
(h) for lags [h[ close to n.
Under the assumption that the time series X
t
is stationary, we can compute the
mean of the periodogram, for ,= 0, as
EI
n
() =
1
2
E

d
X
()

2
=
1
2
var d
X
() +
1
2

Ed
X
()

2
=
1
2n
n

s=1
n

t=1
cov(X
s
, X
t
)e
i(st)
+
1
2n

t=1
EX
t
e
it

2
=
1
2

]h]<n
_
1
[h[
n
_

X
(h)e
ih
+

2
2n

1 e
in
1 e
i

2
.
The second term on the far right is of the order O(1/n) for every ,= 0 (and
(, )) and even vanishes for every natural frequency
j
,= 0. Under the condition
that

X
(h)

< , the rst term converges to f


X
() as n , by the dominated
convergence theorem. We conclude that the periodogram is asymptotically unbiased for
estimating the spectral density, in that EI
n
() f
X
(). This is a good property.
However, the periodogram is not a consistent estimator for f
X
(): the following
theorem shows that I
n
() is asymptotically exponentially distributed with mean f
X
(),
whence we do not have that I
n
()
P

f
X
(). Using the periodogram as an estimator of
f
X
() is asymptotically equivalent to estimating f
X
() using a single observation from
an exponential distribution with mean f
X
(). This is disappointing, because we should
hope that after observing the time series X
t
long enough, we are able to estimate its
spectral density with arbitrary precision. The periodogram does not fulll this hope, as
it keeps uctuating around the target value f
X
(). Apparently, it does not eectively
use the information available in the observations X
1
, . . . , X
n
.
12.4 Theorem. Let X
t
=

j
Z
tj
for an i.i.d. sequence Z
t
with mean zero and nite
second moment and coecients
j
with

j
[
j
[ < . Then for any values 0 <
1
< <

k
< the variables I
n
(
1
), . . . , I
n
(
k
) are asymptotically distributed as independent
exponential variables with means f
X
(
1
), . . . , f
X
(
k
), respectively.
Proof. First consider the case that X
t
= Z
t
for every t. Then the spectral density f
X
()
is the function f
Z
() =
2
/2, for
2
the variance of the white noise sequence. We can
write
d
Z
() =
1

n
n

t=1
Z
t
cos(t) i
1

n
n

t=1
Z
t
sin(t) =: A
n
() iB
n
().
12.2: Periodogram 197
By straightforward calculus we nd that, for any , (0, ),
cov
_
A
n
(), A
n
()
_
=

2
n
n

t=1
cos(t) cos(t)
_

2
/2 if = ,
0 if ,= ,
cov
_
B
n
(), B
n
()
_
=

2
n
n

t=1
sin(t) sin(t)
_

2
/2 if = ,
0 if ,= ,
cov
_
A
n
(), B
n
()
_
=

2
n
n

t=1
cos(t) sin(t) 0.
By the Lindeberg central limit theorem, Theorem 3.16, we now nd that the se-
quence of vectors
_
A
n
(), B
n
(), A
n
(), B
n
()
_
converges in distribution to a vector
(G
1
, G
2
, G
3
, G
4
) with the N
4
_
0, (
2
/2)I
_
distribution. Consequently, by the continuous
mapping theorem,
_
I
n
(), I
n
()
_
=
1
2
_
A
2
n
() +B
2
n
(), A
2
n
() +B
2
n
()
_

1
2
(G
2
1
+G
2
2
, G
2
3
+G
2
4
).
The vector on the right is distributed as
2
/(4) times a vector of two independent
2
2
variables. Because the chisquare distribution with two degrees of freedom is identical to
the standard exponential distribution with parameter 1/2, this is the same as a vector
of two independent exponential variables with means
2
/(2).
This concludes the proof in the special case that X
t
= Z
t
and for two frequencies
and . The case of k dierent frequencies
1
, . . . ,
k
can be treated in exactly the same
way, but is notationally more involved.
The spectral density of a general time series of the form X
t
=

j

j
Z
tj
satises
f
X
() =

()

2
f
Z
(), for () =

j
e
ij
the transfer function of the linear lter. We
shall prove the theorem by showing that the periodograms I
n,X
and I
n,Z
approximately
satisfy a similar relation. Indeed, rearranging sums we nd
d
X
() =
1

n
n

t=1
_

j
Z
tj
_
e
it
=

j
e
ij
_
1

n
nj

s=1j
Z
s
e
is
_
.
If we replace the sum

nj
s=1j
in the right side by the sum

n
s=1
, then the right side
of the display becomes ()d
Z
(). These two sums dier by 2([j[ n) terms, every of
the terms Z
s
e
is
having mean zero and variance bounded by
2
, and the terms being
independent. Thus
E

n
nj

s=1j
Z
s
e
is
d
Z
()

2
2
[j[ n
n

2
.
In view of the inequality E[X[
_
EX
2
_
1/2
, we can drop the square on the left side if we
take a root on the right side. Next combining the two preceding displays and applying
the triangle inequality, we nd
E

d
X
() ()d
Z
()

j
[
j
[
_
2
[j[ n
n
_
1/2
.
198 12: Spectral Estimation
The jth term of the series is bounded by [
j
[
_
2[j[/n
_
1/2
and hence converges to zero
as n , for every xed j; it is also dominated by [
j
[

2. Therefore, the right side


of preceding display converges to zero as n .
By Markovs and Slutskys lemmas it follows that d
X
() has the same limit dis-
tribution as ()d
Z
(). Hence, by the continuous mapping theorem, the sequence
I
n,X
() has the same limit distribution as [()[
2
I
n,Z
(). This is true for every xed
, but also for nite sets of jointly. The proof is nished, because the variables
[()[
2
I
n,Z
() are asymptotically distributed as independent exponential variables with
means [()[
2
f
Z
(), by the rst part of the proof.
A remarkable aspect of the preceding theorem is that the periodogram values I
n
()
at dierent frequencies are asymptotically independent. This is well visible already for
nite values of n in plots of the periodogram, which typically have a wild and peaky
appearance. The theorem says that for large n such a plot should be similar to a plot
of independent exponentially distributed variables E

with means f
X
() (on the y-axis)
versus (on the x-axis).
0.0 0.1 0.2 0.3 0.4 0.5
-
3
0
-
2
0
-
1
0
0
1
0
frequency
s
p
e
c
t
r
u
m
Figure 12.1. Periodogram of a realization of the moving average Xt = 0.5Zt + 0.2Zt1 + 0.5Zt2 for a
Gaussian white noise series. (Vertical scale in decibel, i.e.
10
log.)
The following theorem shows that we even have independence of the periodogram
values at natural frequencies that converge to the same value.
12.2: Periodogram 199
12.5 Theorem. Let X
t
=

j
Z
tj
for an i.i.d. sequence Z
t
with nite second moments
and coecients
j
with

j
[
j
[ < . Let
n
= (2/n)j
n
for j
n
Z be a sequence
of natural frequencies such that
n
(0, ). Then for any k Z the variables
I
n
(
n
k2/n), I
n
_

n
(k1)2/n
_
, . . . , I
n
(
n
+k2/n) are asymptotically distributed
as independent exponential variables with mean f
X
().
Proof. The second part of the proof of Theorem 12.4 is valid uniformly in and hence
applies to sequences of frequencies
n
. For instance, the continuity of () and the proof
shows that

d
X
(
n
) (
n
)d
Z
(
n
)

0 for any sequence


n
. It suces to extend the
rst part of the proof, which concerns the special case that X
t
= Z
t
.
Here we apply the same method as in the proof of Theorem 12.4. The limits of
the covariances are as before, where in the present case we use the fact that we are
considering natural frequencies only. For instance,
cov
_
A
n
_
k
2
n
_
, B
n
_
l
2
n
__
=

2
2
n

t=1
cos
_
kt
2
n
_
sin
_
lt
2
n
_
= 0,
for every integers k, l such that (k + l)/n and (k l)/n are not contained in Z. An
application of the Lindeberg central limit theorem concludes the proof.
The sequences of frequencies
n
+ j(2/n) considered in the preceding theorem all
converge to the same value . That Theorem 12.4 remains valid (it does) if we replace
the xed frequencies
j
in this theorem by sequences
j,n

j
is not very surprising.
More surprising is the asymptotic independence of the periodograms I
n
(
n,j
) even if
every sequence
n,j
converges to the same frequency . As the proof of the preceding
theorem shows, this depends crucially on using natural frequencies
n,j
.
The remarkable independence of the periodogram at frequencies that are very close
together is a further explanation of the peaky appearance of the periodogram I
n
() as
a function of . It is clear that this function is not a good estimator of the spectral
density. However, the independence suggests ways of improving our estimator for f
X
().
The values I
n
(
n
k2/n), I
n
_

n
(k 1)2/n
_
, . . . , I
n
(
n
+k2/n) can be viewed as
a sample of independent estimators of f
X
(), for any k. Rather than one exponentially
distributed veriable, we therefore have many exponentially distributed variables, all with
the same (asymptotic) mean. We exploit this in the next section.
In practice the periodogram may have one or a few extremely high peaks that
completely dominate its graph. This indicates an important cyclic component in the time
series at those frequencies. Cyclic components of smaller amplitude at other frequencies
may be hidden. It is practical wisdom that in such a case a fruitful spectral analysis at
other frequencies requires that the peak frequencies are rst removed from the signal
(by a lter with the appropriate transfer function). We next estimate the spectrum of
the new time series and, if desired, transform this back to obtain the spectrum of the
original series, using the formula given in Theorem 6.10. Because a spectrum without
high peaks is similar to the uniform spectrum of a white noise series, this procedure is
known as prewhitening of the data.
200 12: Spectral Estimation
12.3 Estimating a Spectral Density
Given (0, ) and n, let
n
be the natural frequency closest to . Then
n
as
n and Theorem 12.5 shows that for any k Z the variables I
n
(
n
+ j2/n) for
j = k, . . . , k are asymptotically distributed as independent exponential variables with
mean f
X
(). This suggests to estimate f
X
() by the average
(12.4)

f
k
() =
1
2k + 1

]j]k
I
n
_

n
+
2
n
j
_
.
As a consequence of Theorem 12.5, the variables (2k + 1)

f
k
() are asymptotically (as
n , for xed k) distributed according the Gamma distribution with shape parameter
2k+1 and mean (2k+1)f
X
(). This suggests a condence interval for f
X
() of the form,
with
2
k,
the -quantile of the chisquare distribution with k degrees of freedom,
_
(4k + 2)

f
k
()

2
4k+2,1
,
(4k + 2)

f
k
()

2
4k+2,
_
.
12.6 EXERCISE. Show that, for every xed k, this interval is asymptotically of level
1 2.
Instead of a simple average we may prefer a weighted average. For given weights W
j
such that

j
W
j
= 1, we use
(12.5)

f
k
() =

j
W
j
I
n
_

n
+
2
n
j
_
.
This allows to give greater weight to frequencies
n
+ (2/n)j that are closer to . A
disadvantage is that the asymptotic distribution is relatively complicated: it is a weighted
sum of independent exponential variables. Because tabulating these types of distributions
is complicated, one often approximates them by a scaled chisquare distribution, where
the scaling and the degrees of freedom are chosen to match the rst two moments: the
estimator c
1
f
k
() is approximately
2

distributed for c and solving the equations


asymptotic mean of

f
k
() = f
X
() = c,
asymptotic variance of

f
k
() =

j
W
2
j
f
2
X
() = c
2
2.
This yields c proportional to f
X
() and independent of f
X
(), and thus condence
intervals based on this approximation can be derived as before. Rather than using the
approximation, we could of course determine the desired quantiles by computer simula-
tion.
Because the periodogram is continuous as a function of , a discrete average (over
natural frequencies) can be closely approximated by a continuous average of the form
(12.6)

f
W
() =
_
W()I
n,X1X
( ) d.
12.3: Estimating a Spectral Density 201
0.0 0.1 0.2 0.3 0.4 0.5
-
1
5
-
1
0
-
5
0
5
frequency
s
p
e
c
t
r
u
m
Figure 12.2. Smoothed periodogram of a realization of the moving average Xt = 0.5Zt +0.2Zt1 +0.5Zt2
for a Gaussian white noise series. (Vertical scale in decibel, i.e.
10
log.)
Here the weight function W is to satisfy
_
W() d = 1 and would typically concentrate
its mass near zero, so that the average is computed over I
n,X1X
() for . We use
the periodogram of the centered series X1X, because the average involves nonnatural
frequencies. In view of (12.3) this estimator can be written in the form

f
W
() =
_
W()
1
2

]h]<n

n
(h)e
i()h
d =
1
2

]h]<n
w(h)
n
(h)e
ih
,
where w(h) =
_
e
ih
W() d are the Fourier coecients of the weight function. Thus
we have arrived at a generalization of the estimator (12.1). If we choose w(h) = 1 for
[h[ < r and w(h) = 0 otherwise, then the preceding display exactly gives (12.1). The more
general form can be motivated by the same reasoning: the role of the coecients w(h) is
to diminish the inuence of the relatively unreliable estimators
n
(h) (for h n) when
plugging in these sample estimators for the true auto-covariances in the expression for
the spectral density. Thus, the weights w(h) are typically chosen to decrease in absolute
value from

w(0)

= 1 to

w(n)

= 0 if h increases from 0 to n.
The function W is known as the spectral window; its Fourier coecients w(h) are
known as the lag window, tapering function or convergence factors. The last name comes
202 12: Spectral Estimation
from Fourier analysis, where convergence factors were introduced to improve the ap-
proximation properties of a Fourier series: it was noted that for suitably chosen weights
w(h) the partial sums

]h]<n
w(h)
X
(h)e
ih
could be much closer to the full series

X
(h)e
ih
than the same partial sums with w 1. In our statistical context this is
even more so the case, because we introduce additional approximation error by replacing
the coecients
X
(h) by the estimators
n
(h).
12.7 Example. The tapering function
w(h) =
_
0 if [h[ r,
1 if [h[ > r,
corresponds to the Dirichlet kernel
W() =
1
2

]h]r
e
ih
=
1
2
sin(r +
1
2
)
sin
1
2

.
Therefore, the estimator (12.1) should be compared to the estimators (12.5) and (12.6)
with weights W
j
chosen according to the Dirichlet kernel.
12.8 Example. The uniform kernel
W() =
_
r/(2) if [[ /r,
0 if [[ > /r,
corresponds to the weight function w(h) = r sin(h/r)/(h). These choices of spectral
and lag windows correspond to the estimator (12.4).
All estimators for the spectral density considered so far can be viewed as smoothed
periodograms: the value

f() of the estimator at is an average or weighted average of
values I
n
() for in a neighbourhood of . Thus irregularities in the periodogram
are smoothed out. The amount of smoothing is crucial for the accuracy of the estima-
tors. This amount, called the bandwidth, is determined by the parameter k in (12.4), the
weights W
j
in (12.5), the kernel W in (12.6), and, more hidden, by the parameter r in
(12.1). For instance, a large value of k in (12.4) or a kernel W with a large variance in
(12.6) result in a large amount of smoothing (large bandwidth). Oversmoothing, choos-
ing a bandwidth that is too large, results in spectral estimators that are too at and
therefore inaccurate, whereas undersmoothing, choosing too small a bandwidth, yields
spectral estimators that share the bad properties of the periodogram. In practice an op-
timal bandwidth is often determined by plotting the spectral estimators for a number of
dierent bandwidths and next choosing the one that looks reasonable. An alternative
is to use one of several methods of data-driven choices of bandwidths, such as cross
validation or penalization. We omit a discussion.
Theoretical analysis of the choice of the bandwidth is almost exclusively asymp-
totical in nature. Given a number of observations tending to innity, the optimal
bandwidth decreases to zero. A main concern of an asymptotic analysis is to determine
12.3: Estimating a Spectral Density 203
the rate at which the bandwidth should decrease as the number of observations tends
to innity. The key concept is the bias-variance trade-o. Because the periodogram is
more or less unbiased, little smoothing gives an estimator with small bias. However, as
we have seen, the estimator will have a large variance. Oversmoothing has the opposite
eects. Because accurate estimation requires that both bias and variance are small, we
need an intermediate value of the bandwidth.
We shall quantify this bias-variance trade-o for estimators of the type (12.1), where
we consider r as the bandwidth parameter. As our objective we take to minimize the
mean integrated square error
(12.7) 2 E
_


f
n,r
() f
X
()

2
d.
The integrated square error is a global measure of the discrepancy between

f
n,r
and
f
X
. Because we are interested in f
X
as a function, it is more relevant than the distance


f
n,r
() f
X
()

for any xed .


We shall use Parsevals identity, which says that the space L
2
(, ] is isometric to
the space
2
.
12.9 Lemma (Parsevals identity). Let f: (, ] C be a measurable function such
that
_
[f[
2
() d < . Then its Fourier coecients f
j
=
_

e
ij
f() d satisfy
_

f()

2
d =
1
2

j=
[f
j
[
2
.
12.10 EXERCISE. Prove this identity. Also show that for a pair of square-integrable,
measurable functions f, g: (, ] C we have
_
f()g() d =

j
f
j
g
j
.
The function

f
n,r
f
X
possesses the Fourier coecients
n
(h)
X
(h) for [h[ < r
and
X
(h) for [h[ r. Thus, Parsevals identity yields that the mean integrated square
error (12.7) is equal to
E

]h]<r


n
(h)
X
(h)

2
+

]h]r

X
(h)

2
In a rough sense the two terms in this formula are the variance and the square bias
term. A large value of r clearly decreases the second, square bias term, but increases the
rst, variance term. This variance term can itself be split into a bias and variance term
and we can reexpress the mean integrated square error as

]h]<r
var
n
(h) +

]h]<r

E
n
(h)
X
(h)

2
+

]h]r

X
(h)

2
.
Assume for simplicity that EX
t
= 0 and that X
t
=

j

j
Z
tj
for a summable sequence
(
j
) and i.i.d. sequence Z
t
with nite fourth moments. Furthermore, assume that we use
the estimator
n
(h) = n
1

nh
t=1
X
t+h
X
t
rather than the true sample auto-covariance
204 12: Spectral Estimation
function. (The results for the general case are similar, but the calculations will be more
involved. Note that the dierence between the present estimator and the usual one is
approximately X and

]h]<r
E(X)
4
= O(r/n
2
), under appropriate moment conditions.
This is negligible in the following.) Then the calculations in Chapter 5 show that the
preceding display is equal to

]h]<r
1
n
2

]g]<n]h]
_
n [h[ [g[
_
_

i+h

i+g

i+g+h
+
2
X
(g)
+
X
(g +h)
X
(g h)
_
+

]h]<r
_
[h[
n

X
(h)
_
2
+

]h]r

2
X
(h)

[
4
[
4
n

i
[
i

i+h

i+g

i+g+h
[ +
2r
n

2
X
(g)
+
1
n

X
(g +h)
X
(g h)

+
r
2
n
2

2
X
(h) +

]h]r

2
X
(h).
To ensure that the second and last terms on the right converge to zero as n we
must choose r = r
n
such that r
n
/n 0 and r
n
, respectively. The rst and third
term are of the order O(1/n), and the fourth term is of the order O(r
2
n
/n
2
). Under the
requirements r
n
and r
n
/n 0 these terms are dominated by the other terms, and
the whole expression is of the order
r
n
n
+

]h]rn

2
X
(h).
A rst conclusion is that the sequence of estimators

f
n,rn
is asymptotically consistent
for estimating f
X
relative to the L
2
-distance whenever r
n
and r
n
/n 0. A wide
range of sequences r
n
satises these constraints. For an optimal choice we must make
assumptions regarding the rate at which the bias term

]h]r

2
X
(h) converges to zero
as r . For any constant m we have that
r
n
n
+

]h]rn

X
(h)
2

r
n
n
+
1
r
2m
n

2
X
(h)h
2m
.
Suppose that the series on the far right converges; this means roughly that the auto-
covariances
X
(h) decrease faster than [h[
m1/2
as [h[ . Then we can make a
bias-variance trade-o by balancing the terms r
n
/n and 1/r
2m
n
. These terms are of equal
order for r
n
n
1/(2m+1)
; for this choice of r
n
we nd that
E
_


f
n,rn
() f
X
()

2
d = O
__
1
n
_
2m/(2m+1)
_
.
Large values of m yield the fastest rates of convergence. The rate n
m/(2m+1)
is always
slower than n
1/2
, the rate obtained when using parametric spectral estimators, but
approaches this rate as m .
12.4: Estimating a Spectral Distribution 205
Unfortunately, in practice we do not know
X
(h) and therefore cannot check whether
the preceding derivation is valid. So-called cross-validation techniques may be used to
determine a suitable constant m from the data.
The condition that

2
X
(h)h
2m
< can be interpreted in terms of the smooth-
ness of the spectral density. It can be shown (see Exercise 12.11) that the Fourier coef-
cients of the mth derivative f
(m)
X
of f
X
are equal to
X
(h)(ih)
m
. Consequently, by
Parsevals identity

2
X
(h)h
2m
=
_

f
(m)
X
()
2
d.
Thus the left side is nite if and only if the mth derivative of f
X
exists and is square-
integrable. Thus for time series with such an m-smooth spectral density, one can estimate
the spectral density with an integrated square error of order O(n
2m/(2m+1)
). This rate
is uniform over the set of all time series with spectral densities such that
_
f
(m)
X
()
2
d
is uniformly bounded, i.e. balls in the Sobolev space of order m.
12.11 EXERCISE. Show that the Fourier coecients
_

e
ij
f
t
() d of the derivative
of a continuously dierentiable function f: [, ] with f() = f(), are equal to
ijf
j
, for f
j
the Fourier coecients of f. [Hint: apply partial integration.]
This conclusion is similar to the conclusion in the problem of estimating a density
given a random sample from this density, where also m-smooth densities can be estimated
with an integrated square error of order O(n
2m/(2m+1)
). The smoothing methods dis-
cussed previously (the estimator (12.6) in particular) are also related to the method of
kernel smoothing for density estimation. It is interesting that historically the method of
smoothing was rst applied to the problem of estimating a spectral density. Here kernel
smoothing of the periodogram was a natural extension of taking simple averages as in
(12.4), which itself is motivated by the independence property of the periodogram. The
method of kernel smoothing for the problem of density estimation based on a random
sample from this density was invented later, even though this problem by itself appears
to be simpler.
* 12.4 Estimating a Spectral Distribution
In the preceding section it is seen that nonparametric estimation of a spectral density
requires smoothing and yields rates of convergence n

for values of < 1/2. In contrast,


a spectral distribution function can be estimated at the usual rate of convergence
n
1/2
and natural estimators are asymptotically normally distributed. We assume X
t
is
a stationary time series with spectral density f
X
.
The spectral distribution function F
X
(
0
) =
_
0

f
X
() can be written in the form
_

a()f
X
() d
206 12: Spectral Estimation
for a the indicator function of the interval (,
0
]. We shall consider estimation of a
general functional of this type by the estimator
I
n
(a): =
_

a()I
n
() d.
12.12 Theorem. Suppose that X
t
=

j

j
Z
tj
for an i.i.d. sequence Z
t
with nite
fourth cumulant
4
and constants
j
such that

j
[
j
[ < . Moreover, assume that

h
[h[
2
X
(h) < . Then, for any symmetric function a such that
_

a
2
() d < ,

n
_
I
n
(a)
_
af
X
d
_
N
_
0,
4
_
_
af
X
d
_
2
+ 4
_
a
2
f
2
X
d
_
.
Proof. We can expand a() in its Fourier series a() =

j
a
j
e
ij
(say). By Parsevals
identity
_
af
X
d =

X
(h)a
h
.
Similarly, by (12.2) and Parsevals identity
I
n
(a) =
_
aI
n
d =

]h]<n

n
(h)a
h
.
First suppose that a
h
= 0 for [h[ > m and some m. Then
_
aI
n
d is a linear combination
of
_

n
(0), . . . ,
n
(m)
_
. By Theorem 5.8, as n ,

n
_

h

n
(h)a
h

X
(h)a
h
_

h
a
h
Z
h
,
where (Z
m
, . . . , Z
0
, Z
1
, . . . , Z
m
) is a mean zero normally distributed random vector such
that (Z
0
, . . . , Z
m
) has covariance matrix V as in Theorem 5.8 and Z
h
= Z
h
for every
h. Thus

h
a
h
Z
h
is normally distributed with mean zero and variance
(12.8)

h
V
g,h
a
g
a
h
=
4
_

g
a
g

X
(g)
_
2
+

h
_

X
(k +h)
X
(k +g)
+

X
(k +h)
X
(k g)
_
a
g
a
h
=
4
_
_
af
X
d
_
2
+ 4
_
a
2
f
2
X
d.
The last equality follows after a short calculation, using that a
h
= a
h
. (Note that we
have used the expression for V
g,h
given in Theorem 5.8 also for negative g or h, which is
correct, because both cov(Z
g
, Z
h
) and the expression in Theorem 5.8 remain the same if
g or h is replaced by g or h.)
This concludes the proof in the case that a
h
= 0 for [h[ > m, for some m. The
general case is treated with the help of Lemma 3.10. Set a
m
=

]j]m
a
j
e
ij
and apply
12.4: Estimating a Spectral Distribution 207
the preceding argument to X
n,m
: =

n
_
a
m
(I
n
f
X
) d) to see that X
n,m
N(0,
2
m
)
as n , for every xed m. The asymptotic variance
2
m
is the expression given in the
theorem with a
m
instead of a. If m , then
2
m
converges to the expression in the
theorem, by the dominated convergence theorem, because a is squared-integrable and
f
X
is uniformly bounded. Therefore, by Lemma 3.10 it suces to show that for every
m
n

(12.9)

n
_
(a a
mn
)(I
n
f
X
) d
P

0.
Set b = aa
mn
. The variance of the random variable in (12.9) is the same as the variance
of
_
(a a
mn
)I
n
d, and can be computed as, in view of Parsevals identity,
var
_

n
2

]h]<n

n
(h)b
h
_
=
n
4
2

]g]<n

]h]<n
cov
_

n
(g),
n
(h)
_
b
g
b
h
=
n
4
2

]g]<n

]h]<n
1
n
2
n]g]

s=1
n]h]

t=1
cov(X
s+g
X
s
, X
t+h
X
t
)b
g
b
h
.
Using the same approach as in Section 5.2, we can rewrite this as
1
4
2
n

]g]<n

]h]<n
n]g]

s=1
n]h]

t=1
_

i+g

ts+h+i

ts+i
+
X
(t s +h g)
X
(t s) +
X
(t s g)
X
(t s +h)
_
b
g
b
h
.
The absolute value of this expression can be bounded above by
1
4
2

k
_

i
[
i+g

k+h+i

k+i
[[
4
[
4
+

X
(k +h g)
X
(k)

X
(k g)
X
(k +h)

_
[b
g
b
h
[
=
1
4
2
_
[k
4
[
_
_
bf
X
d
_
2
+ 4b
2
_
f
2
X
d
_
,
by the same calculation as in (12.8), where we dene
b =

h
[b
h
[e
ih
, f
X
() =

i
[
i

i+h
[
2
e
ih
, f
X
() =

X
(h)

e
ih
.
Under our assumptions f
X
and f
X
are bounded functions. It follows that var
_
b
n
I
n
d
0 if
_
b
2
n
d 0. This is true in particular for b
n
= a a
mn
.
Next the mean of the left side of (12.9) can be computed as

n
_

]h]<n
E
n
(h)b
h

_
b f
X
d
_
=

n
_

]h]<n
n [h[
n

X
(h)b
h

X
(h)b
h
_
=

h
n [h[

n

X
(h)b
h
.
208 12: Spectral Estimation
By the Cauchy-Schwarz inequality this is bounded in absolute value by the square root
of

h
[b
h
[
2

h
(n [h[)
2
n

2
X
(h)

h
[b
h
[
2

h
[h[
2
X
(h).
Under our assumptions this converges to zero as
_
b
2
d 0.
The preceding theorem is restricted to symmetric functions a, but can easily be
extended to general functions, because by the symmetry of the spectral density
_
a()f
X
() d =
_
a() +a()
2
f
X
() d.
12.13 EXERCISE. Show that for a possibly nonsymmetric function a the theorem is
valid, but with asymptotic variance

4
_
_
af
X
d
_
2
+ 2
_

a
2
f
2
X
d + 2
_

a()a()f
2
X
() d.
12.14 Example. To obtain the limit distribution of the estimator for the spectral dis-
tribution function at the point
0
[0, ], we apply the theorem with the symmetric
function a =
_
1
(,0]
+ 1
(0,]
_
. The asymptotic variance is equal to
4
F
X
(
0
)
2
+
4
_
0
0
f
2
X
d + 2
_

0
f
2
X
d for
0
[0, ].
12.15 Example. The choice a() = cos(h) yields the estimator (for 0 h < n)
_
cos(h)I
n
() d = Re
_
e
ih
I
n
() d =
1
n
nh

t=1
X
t+h
X
t
of the auto-covariance function
X
(h) in the case that EX
t
= 0. Thus the preceding
theorem contains Theorem 5.8 as a special case. The present theorem shows how the
asymptotic covariance of the sample auto-covariance function can be expressed in the
spectral density.
12.16 EXERCISE. Show that the sequence of bivariate random vectors

n
__
a(I
n

f
X
) d,
_
b(I
n
f
X
) d
_
converges in distribution to a bivariate Gaussian vector (G
a
, G
b
)
with mean zero and EG
a
G
b
=
4
_
af
X
d
_
bf
X
d + 4
_
abf
2
X
d.
12.17 EXERCISE. Plot the periodogram of a white noise series of length 200. Does this
look like a plot of 200 independent exponential variables?
12.18 EXERCISE. Estimate the spectral density of the simulated time series given in
the le ~sda/Cursusdata/sim2 by a smoothed periodogram. Compare this to the estimate
obtained assuming that sim2 is an AR(3) series.
12.19 EXERCISE. Estimate the spectral density of the Wolfer sunspot numbers (the
object sunspots in Splus) by
12.4: Estimating a Spectral Distribution 209
(i) a smoothed periodogram;
(ii) the spectral density of an appropriate AR-model.
Note: the mean is nonzero.
13
Maximum Likelihood
The method of maximum likelihood is one of the unifying principles of statistics, and
applies equally well to replicated experiments as to time series. Given observations
X
1
, . . . , X
n
with a joint probability density (x
1
, . . . , x
n
) p
n,
(x
1
, . . . , x
n
) that depends
on a parameter , the likelihood function is the stochastic process
p
n,
(X
1
, . . . , X
n
).
The maximum likelihood estimator for , if it exists, is the value of that maximizes the
likelihood function.
The likelihood function corresponding to i.i.d. observations X
1
, . . . , X
n
is the prod-
uct of the likelihood functions of the individual observations, which makes likelihood
inference relatively easy in this case. For time series models the likelihood function may
be a more complicated function of the observations and the parameter. This complicates
the practical implementation of likelihood inference. It also makes its theoretical analysis
more involved, although in most situations the nal results are not that dierent from
the more familiar i.i.d. case. In particular, maximum likelihood estimators of smooth
nite-dimensional parameters are typically

n-consistent and possess a normal limit dis-


tribution, with mean zero and covariance the inverse of the Fisher information matrix.
In this chapter we study the maximum likelihood estimator, and some approxima-
tions. We also consider the eect of model misspecication: using the likelihood for a
model that does not contain the true distribution of the data. Such misspecication of
the model may be unintended, but also be deliberate. For instance, the likelihood un-
der the assumption that X
1
, . . . , X
n
are part of a stationary Gaussian time series X
t
is
popular for inference, even if one may not believe that the time series is Gaussian. The
corresponding maximum likelihood estimator is closely related to the least squares esti-
mators and turns out to perform well also for a wide range of non-Gaussian time series.
Another example is to postulate that the innovations in a GARCH model are Gaussian.
The resulting estimators again work well also for non-Gaussian innovations. A misspec-
ied likelihood is also referred to as a quasi likelihood and the resulting estimators as
quasi likelihood estimators. A misspecied maximum likelihood estimator also falls in the
13.1: General Likelihood 211
class of M-estimators, which are dened as the maximizers of a given stochastic process
(a contrast function). We prefer the latter name.
In our heuristic discussion of the asymptotic properties of maximum likelihood es-
timators we use the term pseudo likelihood for a certain process that is close to the
likelihood, but has a simpler structure.
13.1 General Likelihood
A convenient representation of a likelihood is obtained by repeated conditioning. To
alleviate the notation we abuse notation by writing p(y[ x) for a conditional density of
a variable Y given that another variable X takes the value x, and denote the marginal
density of X by p(x). Thus the argument(s) of the function must reveal its identity. In this
notation the likelihood corresponding to the observations X
1
, . . . , X
n
is p

(x
1
, . . . , x
n
),
and this can be decomposed as
(13.1) p

(x
1
, . . . , x
n
) = p

(x
1
)p

(x
2
[ x
1
) p

(x
n
[ x
n1
, . . . , x
1
).
Clearly we must select appropriate versions of the (conditional) densities, but we shall
not worry about technical details in this section.
The decomposition resembles the factorization of the likelihood of an i.i.d. sample of
observations, but an important dierence is that the n terms on the right may all be of a
dierent form. Even if the time series X
t
is strictly stationary, each further term entails
conditioning on a bigger past and hence is potentially of a dierent character than the
earlier terms. However, in many examples the present X
t
is nearly independent of the
distant past (X
s
: s t) given the near past (X
s
: s < t, s t). Then the likelihood
does not change much if the conditioning in each term is limited to a xed number of
variables in the past, and most of the terms of the product will take almost a common
form. Alternatively, we may augment the conditioning in each term to include the full
innite past, yielding the pseudo likelihood
(13.2) p

(x
1
[ x
0
, x
1
, . . .)p

(x
2
[ x
1
, x
0
, . . .) p

(x
n
[ x
n1
, x
n2
, . . .).
If the time series X
t
is strictly stationary, then the tth term p

(x
t
[ x
t1
, x
t2
, . . .) in
this product is a xed measurable function, independent of t, applied to the vector
(x
t
, x
t1
, . . .). In particular, the terms of the product form a strictly stationary time
series, which will be ergodic if the original time series X
t
is ergodic. This is almost as
good as the i.i.d. terms obtained in the case of an i.i.d. time series.
The pseudo likelihood (13.2) cannot be used in practice, because the negative
variables X
0
, X
1
, . . . are not observed. However, the preceding discussion suggests that
the maximum pseudo likelihood estimator, dened as the maximizer of the pseudo like-
lihood, may behave the same as the true maximum likelihood estimator. Moreover, if it
is true that the past observations X
0
, X
1
, . . . do not play an important role in dening
the pseudo likelihood, then we could also replace them by arbitrary values, for instance
zero, and hence obtain an observable criterion function.
212 13: Maximum Likelihood
13.1 Example (Markov time series). If the time series X
t
is Markov, then the con-
ditioning in each term p

(x
t
[ x
t1
, . . . , x
1
) or p

(x
t
[ x
t1
, x
t2
, . . .) can be restricted to
the single variable x
t1
. In this case the likelihood and the pseudo likelihood dier only
in their rst terms, which are p

(x
1
) and p

(x
1
[ x
0
), respectively. This dierence should
be negligible if n is large.
Similarly, if the time series is Markov of order p, i.e. p(x
t
[ x
t1
, x
t2
, . . .) depends
only on x
t
, x
t1
, . . . , x
tp
, then the two likelihoods dier only in p terms. This should be
negligible if n is large relative to p.
13.2 Example (Autoregression). A causal auto-regressive time series dened relative
to an i.i.d. white noise series is an example of a pth order Markov series. Maximum
likelihood estimators for auto-regressive processes are commonly dened by using the
pseudo likelihood with X
0
, . . . , X
p+1
set equal to zero. Alternatively, we simply drop
the rst p terms of the likelihood, and work with the approximate likelihood
(,
1
, . . . ,
p
)
n

t=p+1
p
Z,
_
X
t

1
X
t1

p
X
tp
_
,
for p
Z,
the density of the innovations. This can also be considered a conditional likeli-
hood given the observations X
1
, . . . , X
p
. The dierence of this likelihood with the true
likelihood is precisely the marginal density of the vector (X
1
, . . . , X
p
), which can be
complicated in general, but should have a noticeable eect on the maximum likelihood
estimator only if p is large relative to n.
13.3 Example (GARCH). A strictly stationary GARCH process X
t
relative to an
i.i.d. series Z
t
can be written as X
t
=
t
Z
t
, for
2
t
= E(X
2
t
[ T
t1
) and T
t
the ltration
generated by X
t
, X
t1
, . . .. From Theorem 9.11 it is known that the ltration T
t
is also
the natural ltration of the process Z
t
and hence the variable Z
t
is independent of
2
t
,
as this is measurable relative to T
t1
. Thus given X
t1
, X
t2
, . . . the variable X
t
=
t
Z
t
is obtained by rst calculating
2
t
from X
t1
, X
t2
, . . . and next multiplying
t
by an
independent variable Z
t
. The conditional distribution of X
t
given X
t1
, X
t2
, . . . is the
distribution of Z
t
scaled by
t
. If p
Z
is the marginal density of the variables Z
t
, then the
pseudo likelihood (13.2) takes the form
(13.3)
n

t=1
1

t
p
Z
_
X
t

t
_
.
The parameters ,
1
, . . .
p
,
1
, . . . ,
q
are hidden in the process
t
, through the GARCH
relation (9.1). Formula (13.3) is not the true likelihood, because it depends on the un-
observable variables X
0
, X
1
, . . . through the
t
.
For an ARCH(q) process the conditional variances
2
t
depend only on the variables
X
t1
, . . . , X
tq
, in the simple form

2
t
= +
1
X
2
t1
+ +
q
X
2
tq
.
In this case the true likelihood and the pseudolikelihood dier only in the rst q of the n
terms. This dierence should be negligible. For practical purposes, if n is large relative
13.2: Asymptotics under a Correct Model 213
to q, we could either drop those rst q terms, giving a conditional likelihood, or act as if
the unobserved variables X
2
0
, . . . , X
2
q+1
are zero.
For general GARCH processes the dierence between the likelihoods is more sub-
stantial, because the conditional variance
2
t
depends on X
2
t1
, . . . , X
2
tq
as well as on
the previous
2
s
, causing a dependence on the full past X
2
s
with s < t of the process of
squares. However, the dependence of
2
t
on lagged variables X
2
s
decreases exponentially
fast as s . This suggests that we might again use the pseudo likelihood with the
unobserved variables X
2
s
replaced by zero.
That the variables X
2
s
with s 0 do not play an important role in the denition of
the (pseudo) likelihood is also suggested by Theorem 9.15, which shows that a GARCH
series dened from arbitrary starting values converges to (strict) stationarity as t grows
to innity (provided that a strictly stationary GARCH process exists). (It is suggested
only, as the stability assertion of Theorem 9.15 is about the time series itself and not
about its likelihood function.)
A practical implementation is to dene
2
0
, . . . ,
2
p+1
and X
2
0
, . . . , X
2
q+1
to be
zero, and next compute
2
1
,
2
2
, . . . recursively, using the GARCH relation (9.1) and the
observed values X
1
, . . . , X
n
. By Theorem 9.15 these zero starting values cannot be the
true values of the series if the series is strictly stationary, but any other initialization
should yield approximately the same likelihood. Given
2
1
,
2
2
, . . . ,
2
n
and X
1
, . . . , X
n
we
can use (13.3) as a contrast function.
13.2 Asymptotics under a Correct Model
In this section we consider the asymptotic behaviour of the maximum likelihood esti-
mator under the assumption that the observations are sampled according to one of the
distributions used to form the likelihood. We denote the parameter of latter distribution
by
0
.
We adopt the working hypothesis that the maximum likelihood estimators have the
same asymptotic properties as the corresponding maximum pseudo likelihood estimators.
Furthermore, we assume that the time series X
t
is strictly stationary and ergodic. These
conditions are certainly too stringent, but they simplify the arguments. The conclusions
typically apply to any time series that approaches stationarity as t and for which
averages converge to constants.
13.2.1 Consistency
Abbreviate x
t
, x
t1
, . . . to x
t
. The maximum pseudo likelihood estimator maximizes the
function
(13.4) M
n
() =
1
n
n

t=1
log p

(X
t
[

X
t1
).
214 13: Maximum Likelihood
If the variables log p

(X
t
[

X
t1
) are integrable, as we assume, then, by the ergodic the-
orem, Theorem 7.4, the averages M
n
() converge to their expectation
(13.5) M() = E
0
log p

(X
1
[

X
0
).
The expectation is taken under the true parameter
0
governing the distribution of the
time series X
t
. The dierence of the expected values M(
0
) and M() can be written as
M(
0
) M() = E
0
_
_
log
p
0
(x
1
[

X
0
)
p

(x
1
[

X
0
)
_
p
0
(x
1
[

X
0
) d(x
1
).
The integral inside the expectation is the Kullback-Leibler divergence between the (con-
ditional) measures with densities p

([ x
0
) and p
0
([ x
0
). The Kullback-Leibler divergence
_
log(p/q) dP between two probability measures P and Q (with densities p and q relative
to some measure) is always nonnegative, and it is zero if and only if the two measures
are the same. This yields the following lemma.
13.4 Lemma. The function M: R dened in (13.5) satises M() M(
0
) for
every , with equality if and only if the conditional distribution of X
1
given

X
0
= x
0
is identical for almost every x
0
under the distribution of

X
0
under
0
.
Proof. The inequality log x 2(

x1), valid for x > 0, yields, for any two probability


measures p and q relative to some measure ,
_
log
q
p
p d
_
2
_
_
q
p
1
_
p d =
_
(

p)
2
d.
It follows that
_
log(p/q) p d is nonnegative, and equal to zero if and only if p = q almost
everywhere under . We apply this with p and q equal to the densities x
1
p
0
(x
1
[

X
0
)
and x
1
p

(x
1
[

X
0
), respectively, to the inner integral in the right side of the expression
for M()M(
0
) preceding the lemma, and conclude that this is nonnegative. Moreover,
it is zero if and only if the two conditional laws are identical. The whole expression,
including the outer expectation E
0
, is nonegative, and vanishes only if the inner integral
is zero for almost every x
0
.
Under the reasonable assumption that each value of indexes a dierent underlying
(conditional) distribution of the time series X
t
(identiability of ), we conclude that
the map M() possesses a unique absolute maximum at =
0
.
The convergence of the criterion functions M
n
to M, and the denitions of

n
and
0
as their points of maximum suggest that

n
converges to
0
. In other words, we expect the
maximum likelihood estimators to be consistent for the true value
0
. This argument
can be made mathematically rigorous, for instance by imposing additional conditions
that guarantee the uniform convergence of M
n
to M. See e.g. Theorem 3.17.
13.2: Asymptotics under a Correct Model 215
13.2.2 Asymptotic Normality
If it is true that

n

0
, then the question poses itself to characterize the rate at
which the dierence

n

0
converges to zero and to nd a possible limit distribution
for the rescaled dierence. We shall assume that the parameter set is a subset of R
d
.
Under smoothness conditions on the contrast function M
n
() we shall establish the
asymptotic normality of the sequence

n(

n

0
).
We assume that the gradient

M
n
() and second derivative matrix

M
n
() of the map
M
n
() exist and are continuous. Because

n
is a point of maximum of M
n
, it satises
the stationary equation

M
n
(

n
) = 0, if

n
is an inner point of the parameter set. Because
we assume consistency of

n
, this is the case with probability tending to 1 if
0
is an
inner point of , as we shall assume. In the case that the parameter is one-dimensional,
Taylors theorem shows that there exists a point

n
on the line segment between
0
and

n
such that
0 =

M
n
(

n
) =

M
n
(
0
) +

M
n
(

n
)(

n
).
This can be rewritten as

n(

n

0
) =
_

M
n
(

n
)
_
1

n

M
n
(
0
).
Taylors theorem does not apply in the same way to higher-dimensional parameters, for
which the function

M
n
is vector-valued. In that case we can apply the theorem to every
coordinate function of

M
n
, giving the second last display (which is now vector-valued)
for every coordinate, but possibly with a dierent point

n
for every coordinate. Abusing
notation we shall omit the dependence on the coordinate and write the resulting matrix
of gradients again as

M
n
(

n
). These matrices

M
n
() = n
1

n
i=1

2
/
2
log p

(X
t
[

X
t1
)
are averages and hence the ergodic theorem guarantees their convergence in probability
to a xed matrix under reasonable conditions. Because

n
P


0
, it is a reasonable as-
sumption that the matrices

M
n
(

n
) and

M
n
(
0
) possess the same limit. If we can also
show that the sequence

n

M
n
(
0
) converges in distribution, then we can conclude that
the sequence

n(

0
) converges in distribution, by Slutskys lemma.
The convergence of the sequence

n

M
n
(
0
) can be established by the martin-
gale central limit theorem, Theorem 4.16. To see this, rst dierentiate the identify
_
p

(x
1
[ x
0
) d(x
1
) = 1 twice to verify that
_
p

(x
1
[ x
0
) d(x
1
) =
_
p

(x
1
[ x
0
) d(x
1
) = 0.
The function

= log p

possesses derivatives

= p

/p

and

= p

/p

.
Combination with the preceding display yields conditional versions of the Bartlett iden-
tities expectation of score function is zero and expectation of observed information is
minus the Fisher information:
E

(X
1
[

X
0
)[

X
0
_
= 0,
E

(X
1
[

X
0
)[

X
0
_
= Cov

(X
1
[

X
0
)[

X
0
_
= E

(X
1
[

X
0
)[

X
0
_
.
216 13: Maximum Likelihood
The rst identity shows that the sequence n

M
n
() =

n
t=1

(X
t
[

X
t1
) is a martingale
under the true measure specied by the parameter . Under reasonable conditions the
martingale central limit theorem yields that the sequence

n

M
n
() is asymptotically
normal with mean zero and covariance matrix
I

= Cov

(X
1
[

X
0
)
_
.
By the second identity E


M
n
() = I

and hence we may expect that



M
n
()
P

,
by the ergodic theorem. Combining this with Slutskys lemma as indicated before, we
nd that, under the true parameter
0
,

n(

n

0
) N(0, I
1
0
).
The matrix I

is known as the Fisher information matrix (average per observation).


Typically, it can also be found through the limits
I
n,
: =
1
n
E

log p

(X
1
, . . . , X
n
)

log p

(X
1
, . . . , X
n
)
T
I

1
n

2
log p

(X
1
, . . . , X
n
)
]=

n
I

.
The expression on the left in the second line is minus the second derivative matrix of
the likelihood surface (times 1/n) at the maximum likelihood estimator, and is known
as the observed information. As it gives an estimate for the inverse of the asymptotic
covariance matrix of the sequence

n(

n
) under , a large observed information
(strongly curved likelihood surface) indicates that the maximum likelihood estimator has
small asymptotic covariance.
The rst line of the preceding display connects the matrix I

to the denition of
the Fisher information for arbitrary observations, which appears in the Cramer-Rao
bound for the variance of unbiased estimators. According to the Cramer-Rao theorem,
the covariance matrix of any unbiased estimator T
n
of satises
Cov

T
n

1
n
I
1
n,
.
The preceding informal derivation suggests that the asymptotic covariance matrix of
the sequence

n(

n
) under is equal to I
1

. We interprete this as saying that the


maximum likelihood estimator is asymptotically of minimal variance, or asymptotically
ecient.
For a rigorous proof of the asymptotic normality of the maximum likelihood esti-
mator, and a precise formulation of its asymptotic eciency, see ??
13.5 EXERCISE. Compute the conditional maximum likelihood estimator for (,
2
) in
a stationary, causal AR(1) model X
t
= X
t1
+Z
t
with Gaussian innovations Z
t
. What
is its limit distribution? Calculate the Fisher information matrix I

.
13.6 EXERCISE. Find the pair of (conditional) likelihood equations

M
n
(, ) = 0 for
estimating the parameters (, ) in an ARCH(1) model. Verify the martingale property
of n

M
n
(, ).
13.3: Asymptotics under Misspecication 217
13.3 Asymptotics under Misspecication
Specication of a correct statistical model for a given time series is generally dicult, and
it is typically hard to decide which of two given reasonable models is the better one. This
observation is often taken as motivation for modelling a time series as a Gaussian series,
Gaussianity being considered as good as any other specication and Gaussian likelihoods
being relatively easy to handle. Meanwhile the validity of the Gaussian assumption may
not really be accepted. It is therefore important, in time series analysis even more than in
statistics for replicated experiments, to consider the behaviour of estimation procedures
under misspecication of a model.
Consider an estimator

n
dened as the point of maximum of a likelihood function
of a model that possibly does not contain the true distribution of the observations. It
is again easier to consider the pseudo likelihood (13.2) than the true likelihood. The
misspecied maximum pseudo likelihood estimator is still the point of maximum of the
map M
n
() dened in (13.4). For the asymptotic analysis of

n
we again apply the
ergodic theorem to see that M
n
() M() almost surely, for M() the expectation of
M
n
(), dened by
(13.6) M() = Elog p

(X
1
[

X
0
) = E
_
log p

(x
1
[

X
0
) p(x
1
[

X
0
) d(x
1
).
The dierence with the foregoing is that presently the expectation is taken under the
true model for the series X
t
, which may or may not be representable through one of the
parameters . In the display this is indicated by writing a general expectation E (and not
E
0
) and a generic density p (and not p
0
). However, the same reasoning suggests that

n
converges in probability to a value
0
that maximizes the map M(). Without
further specication of the model and the true distribution of the time series, there is
little more we can say about this maximizing value than that it gives conditional densities
p
0
([ x
0
) that are, on the average, closest to the true conditional densities p([ x
0
) of the
time series in terms of the Kullback-Leibler divergence.
Having ascertained that the sequence

n
ought to converge to a limit, most of the
subsequent arguments to establish asymptotic normality of the sequence

n(

0
) go
through, also under misspecication, provided that
(13.7) E
_

0
(X
1
[

X
0
)[

X
0
_
= 0, a.s..
Under this condition the sequence n

M
n
(
0
) is still a martingale, and

n

M
n
(
0
) may be
expected to be asymptotically normal by the martingale central limit theorem. By the
assumed ergodicity of the series X
t
the sequence

M
n
(
0
) will still converge to a xed
matrix, and the same may be expected to be true for the sequence of second derivatives

M
n
(

n
) evaluated at a point between

n
and
0
. A dierence is that the asymptotic
covariance matrix
0
of the sequence

n

M
n
(
0
) and the limit R
0
of the sequence

M
n
(
0
) may no longer be each others negatives. The conclusion will therefore take the
more complicated form

n(

n

0
) N
_
0, R
1
0

0
(R
1
0
)
T
_
.
218 13: Maximum Likelihood
The covariance of this normal limit distribution is referred to as the sandwich formula.
Thus under (13.7) we may expect that the sequence

n
will converge rapidly to a
limit
0
. Then tting the wrong model will be useful as long as the density p
0
(or an
aspect of interest of it) is suciently close to the true density of the time series.
Condition (13.7) is odd, and not automatically satised. It is certainly satised if
the point of maximum
0
of the map M() is such that for every x
0
it is also a
point of maximum of the map
(13.8)
_
log p

(x
1
[ x
0
) p(x
1
[ x
0
) d(x
1
).
This is not necessarily the case, as the points of maxima of these functions may depend
on the dierent dierent values of x
0
. The point
0
is by denition the point of maximum
of the average of these functions over x
0
, weighted by the true distribution of

X
0
. Failure
of (13.7) does not necessarily mean that the sequence

n(

0
) is not asymptotically
normally distributed, but it does mean that we cannot apply the martingale central limit
theorem, as in the preceding argument.
In the next section we discuss a major example of possible misspecication: esti-
mating a parameter by Gaussian maximum likelihood. The following example concerns
GARCH processes, and illustrates that some misspecications are harmless, whereas
others may cause trouble.
13.7 Example (GARCH). As found in Example 13.3, the pseudo likelihood for a
GARCH(p, q) process takes the form
n

t=1
1

t
()
p
Z
_
X
t

t
()
_
,
where p
Z
is the density of the innovations and = (,
1
, . . . ,
p
,
1
, . . . ,
q
). In Chapter 9
it was noted that a t-density p
Z
may be appropriate to explain the observed leptokurtic
tails of nancial time series. However, the Gaussian density p
Z
(z) = exp(
1
2
z
2
)/

2 is
more popular for likelihood based inference for GARCH processes. The corresponding
Gaussian log pseudo likelihood is up to additive and multiplicative constants equal to

1
n
n

t=1
log
2
t
()
1
n
n

t=1
X
2
t

2
t
()
.
The expectation of this criterion function can be written as
M() = E
_
log
2
1
() +
E(X
2
1
[ T
0
)

2
1
()
_
.
Both the expectation and the conditional expectation on the right side are taken relative
to the true distribution of the time series. (The outermost expectation is relative to the
vector

X
0
that is hidden in
2
1
() and in E(X
2
1
[ T
0
).) The sequence

n
may be expected
to converge to the point of maximum of the map M().
13.4: Gaussian Likelihood 219
Suppose that the GARCH equation (9.1) for the conditional variances is correctly
specied, even though the true density of the innovations may not be standard normal.
In other words, E(X
2
1
[ T
0
) =
2
1
(
0
) for the true parameter
0
and hence
(13.9) M() = E
_
log
2
1
() +

2
1
(
0
)

2
1
()
_
.
For every xed
2
0
, the map
2
log
2
+
2
0
/
2
assumes its minimal value on the
domain (0, ) at
2
=
2
0
. It follows that the map M() is maximized at =
0
no
matter the distribution of

X
0
that determines the expectation that denes M().
We conclude that the use of the Gaussian density for p
Z
will lead to consistent
estimators

n
for the coecients of the GARCH equation as long as the conditional
variance model is correctly specied. In particular, the true density p
Z
of the innovations
need not be Gaussian. As shown in the preceding arguments, this pleasant fact is the
result of the fact that the likelihood based on choosing the normal density for p
Z
depends
on the observations X
t
only through a linear function of the squares X
2
t
. For another
choice of density p
Z
, such as a t-density, this is not true, and we cannot hope to be
guarded against misspecication in that case.
The maximizing parameter
0
maximizes the expression log
2
1
()+
2
1
(
0
)/
2
1
()
inside the brackets in (13.9) over , for any xed value of

X
0
. In the general notation this
corresponds to maximizing the inner integral in (13.6), for every value of

X
0
. Therefore
(13.7) is satised, and we expect the sequence

n(

0
) to be asymptotically normal,
with asymptotic variance given by the sandwich formula.
These conclusions are all conditional on the assumption that the conditional variance
model stipulated by the GARCH model is correct. Because the parameter has a meaning
only relative to this model, this does not seem an important restriction.
13.4 Gaussian Likelihood
A Gaussian time series is a time series X
t
such that the joint distribution of every nite
subvector (X
t1
, . . . , X
tn
) of the series possesses a multivariate normal distribution. In
particular, the vector (X
1
, . . . , X
n
) is multivariate normally distributed, and hence its
distribution is completely specied by a mean vector
n
R
n
and an (nn) covariance
matrix
n
. If the time series X
t
is covariance stationary, then the matrix
n
has entries
(
n
)
s,t
=
X
(s t), for
X
the auto-covariance function of X
t
. We assume that both
the mean
n
and the covariance function
X
can be expressed in a parameter of xed
dimension, so that we can write
n
=
n
() and
n
=
n
().
The likelihood function under the assumption that X
t
is a Gaussian time series is
the multivariate normal density viewed as function of the parameter and takes the form
(13.10)
1
(2)
n/2
1
_
det
n
()
e

1
2
_

Xnn()
_
T
n()
1
_

Xnn()
_
.
220 13: Maximum Likelihood
We refer to this function as the Gaussian likelihood, and to its point of maximum

n
,
if it exists, as the maximum Gaussian likelihood estimator. The Gaussian likelihood and
the corresponding estimator are commonly used, also in the case that the time series X
t
is non-Gaussian.
Maximum Gaussian likelihood is closely related to the method of least squares, de-
scribed in Section 11.3. We can see this using the likelihood factorization (13.1). For
a Gaussian process the conditional densities p

(x
t
[ X
t1
, . . . , X
1
) are univariate normal
densities with means E

(X
t
[ X
t1
, . . . , X
1
) and variances v
t
() equal to the prediction er-
rors. (Cf. Exercise 13.8.) Furthermore, the best nonlinear predictor E

(X
t
[ X
t1
, . . . , X
1
)
is automatically a linear combination of the predicting variables and hence coincides with
the best linear predictor
t1
X
t
(). This shows that the factorization (13.1) reduces to
n

t=1
1
_
v
t
()

_
X
t

t1
X
t
()
_
v
t
()
_
.
Maximizing this relatively to is equivalent to maximizing its logarithm, which can be
written in the form
(13.11)
n
2
log(2)
1
2
n

t=1
log v
t
()
1
2
n

t=1
_
X
t

t1
X
t
()
_
2
v
t
()
.
This function diers in form from the least squares criterion function (11.5) only in
the presence of the function
1
2

n
t=1
log v
t
(). In situations where this function is
almost constant least squares and Gaussian maximum likelihood estimators are almost
the same.
13.8 EXERCISE. Suppose that the vector (X
1
, . . . , X
t
) possesses a multivariate normal
distribution. Show that the conditional distribution of X
t
given (X
1
, . . . , X
t1
) is normal
with mean the best linear predictor
t1
X
t
of X
t
based on X
1
, . . . , X
t1
and variance
the prediction error v
t
= var(X
t

t1
X
t
). [Reduce to the case that the unconditional
mean is 0. Let Y be the best linear predictor of X
t
based on X
1
, . . . , X
t1
. Show that
(X
t
Y,

X
t1
) is multivariate normal with zero covariance between X
t
Y and

X
t1
.
Conclude that the variables are independent, and use this to deduce the conditional
distribution of X
t
= (X
t
Y ) +Y .]
13.9 Example (Autoregression). For causal stationary auto-regressive processes of
order p and t > p the best linear predictor of X
t
is equal to
1
X
t1
+ +
p
X
tp
. Thus
the innovations X
t

t1
X
t
are equal to the noise inputs Z
t
, and the prediction errors
v
t
are equal to
2
= EZ
2
t
for t p. Thus the function
1
2

n
t=1
log v
t
() in the
formula for the Gaussian likelihood is approximately equal to
1
2
nlog
2
, if n is much
bigger than p. The log Gaussian likelihood is approximately equal to

n
2
log(2)
1
2
nlog
2

1
2
n

t=p+1
_
X
t

1
X
t1

p
X
tp
_
2

2
.
13.4: Gaussian Likelihood 221
For a xed
2
maximization relative to
1
, . . . ,
p
is equivalent to minimization of the
sum of squares and hence gives identical results as the method of least squares discussed
in Sections 11.1 and 11.3. Maximization relative to
2
gives (almost) the Yule-Walker
estimator discussed in Section 11.1.
13.10 Example (ARMA). In ARMA models the parameter
2
enters as a multiplica-
tive factor in the covariance function (cf. Section 11.3). This implies that the log Gaussian
likelihood function can be written in the form, with = (
1
, . . . ,
p
,
1
, . . . ,
q
),

n
2
log(2)
n
2
log
2

1
2
n

t=1
log v
t
()
1
2
n

t=1
_
X
t

t1
X
t
()
_
2

2
v
t
()
.
Dierentiating this with respect to
2
we see that for every xed , the Gaussian likeli-
hood is maximized relative to
2
by

2
() =
1
n
n

t=1
_
X
t

t1
X
t
()
_
2
v
t
()
.
Substituting this expression in the log Gaussian likelihood, we see that the maximum
Gaussian likelihood estimator of maximizes the function

n
2
log(2)
n
2
log
2
()
1
2
n

t=1
log v
t
()
n
2
.
The latter function is called the prole likelihood for , and the elimination of the param-
eter
2
is referred to as concentrating out this parameter. We can drop the constant terms
in the prole likelihood and conclude that the maximum Gaussian likelihood estimator

for minimizes
(13.12) log
1
n
n

t=1
_
X
t

t1
X
t
()
_
2
v
t
()
+
1
n
n

t=1
log v
t
().
The maximum Gaussian likelihood estimator for
2
is
2
(

).
For causal, invertible stationary ARMA processes the innovations X
t

t1
X
t
()
are for large t approximately equal to Z
t
, whence v
t
() EZ
2
t
/
2
= 1. (Cf. the discus-
sion in Section 8.4. In fact, it can be shown that [ v
t
1[ c
t
for some 0 < c < 1 and
suciently large t.) This suggests that the criterion function (13.12) does not change
much if we drop its second term and retain only the sum of squares. The correspond-
ing approximate maximum Gaussian likelihood estimator is precisely the least squares
estimator, discussed in Section 11.3.
13.11 Example (GARCH). The distribution of a GARCH process X
t
=
t
Z
t
depends
on the distribution of the innovations Z
t
, but is rarely (or never?) Gaussian. Nevertheless
we may try and apply the method of Gaussian likelihood.
Because a GARCH series is a white noise series, the mean vector of (X
1
, . . . , X
n
)
is zero and its covariance matrix is diagonal. For a stationary GARCH process the
222 13: Maximum Likelihood
diagonal elements are identical and can be expressed in the parameters of the GARCH
process. (For instance EX
2
t
= /(1 ) for the GARCH(1, 1).) As the Gaussian
likelihood depends on the parameters of the model only through the variances v
2
t
= EX
2
t
,
this likelihood can at best yield good estimators for functions of this variance. The
GARCH parameters cannot be recovered from this. For instance, we cannot estimate the
parameter (, , ) of a GARCH (1, 1) process from a criterion function that depends on
these parameters only through /(1 ).
We conclude that the method of Gaussian likelihood is useless for GARCH processes.
(We note that the Gaussian likelihood as discussed in this section is not the same as
the GARCH likelihood under the assumption that the innovations Z
t
are Gaussian (cf.
Example 13.3). The Gaussian likelihood is similar in form, but with the conditional
variances
2
t
in the latter replaced by their expectations.)
In the preceding examples we have seen that for AR and ARMA processes the
Gaussian maximum likelihood estimators are, asymptotically as n , close to the
least squares estimators. The following theorem shows that the asymptotic behaviour
of these estimators is identical to that of the least squares estimators, which is given in
Theorem 11.19.
13.12 Theorem. Let X
t
be a causal, invertible stationary ARMA(p, q) process relative
to an i.i.d. sequence Z
t
. Then the Gaussian maximum likelihood estimator satises

n
__

q
_

q
_
_
N(0,
2
J
1

p,

q
),
where J

p,

q
is the covariance matrix of (U
1
, . . . , U
p
, V
1
, . . . , V
q
) for stationary auto-
regressive processes U
t
and V
t
satisfying (B)U
t
= (B)V
t
= Z
t
.
Proof. The proof is long and technical. See Brockwell and Davis (1991), pages 375396,
Theorem 10.8.2.
The theorem does not assume that the time series X
t
itself is Gaussian; it uses
the Gaussianity only as a working hypothesis to dene maximum likelihood estimators.
Apparently, using the wrong likelihood still leads to reasonable estimators. This is
plausible, because Gaussian maximum likelihood estimators are asymptotically equiv-
alent to least squares estimators and the method of least squares can be motivated
without reference to Gaussianity. Alternatively, it can be explained from consideration
of the Kullback-Leibler divergence, as in Section 13.3.
On the other hand, in the case that the series X
t
is not Gaussian the true maximum
likelihood estimators (if the true model, i.e. the true distribution of the noise factors Z
t
is known) are likely to perform better than the least squares estimators. In this respect
time series analysis is not dierent from the situation for replicated experiments. An
important dierence is that in practice non-Gaussianity may be dicult to detect, other
plausible distributions dicult to motivate, and other likelihoods may yield computa-
tional problems. The Gaussian distribution is therefore frequently adopted as a working
hypothesis.
13.4: Gaussian Likelihood 223
13.4.1 Whittle Estimators
Because the Gaussian likelihood function of a mean zero time series depends on the
autocovariance function only, it can be helpful to write it in terms of the spectral density.
The covariance matrix of a vector (X
1
, . . . , X
n
) belonging to a stationary time series X
t
with spectral density f
X
can be written as
n
(f
X
), for

n
(f) =
_
_

e
i(st)
f() d
_
s,t=1,...,n
.
Thus if the time series X
t
has spectral density f

under the parameter and mean zero,


then the log Gaussian likelihood can be written in the form

n
2
log(2)
1
2
log det
n
(f

)
1
2

X
T
n

n
(f

)
1

X
n
.
Maximizing this expression over is equivalent to maximizing the Gaussian likelihood
as discussed previously, but gives a dierent perspective. For instance, to t an ARMA
process we would maximize this expression over all rational spectral densities of the
form
2

(e
i
)

2
/

(e
i
)

2
.
The true advantage of writing the likelihood in spectral notation is that it suggests
a convenient approximation. The Whittle approximation is dened as

n
2
log(2)
n
4
_

log f

() d
n
4
_

I
n
()
f

()
d,
where I
n
() is the periodogram of the time series X
t
, as dened in Section 12.2. This ap-
proximation results from the following approximations, for a suciently regular function
f,

n
(f)
1

n
_
1
f
_
1
4
2
,
1
n
log det
n
(f) log(2) +
1
2
_

log f() d,
combined with the identity
1
n

X
T
n

n
(f)

X
n
= 2
_
I
n
()f() d.
The approximations are made precise in Lemma ??, whereas the identity follows by some
algebra.
13.13 EXERCISE. Verify the identity in the preceding display.
The Whittle approximation is both more convenient for numerical manipulation and
more readily amenable to theoretical analysis. The point of maximum

n
of the Whittle
approximation, if it exists, is known as the Whittle estimator. Conceptually, this comes
down to a search in the class of spectral densities f

dened through the model.


224 13: Maximum Likelihood
13.14 Example (Autoregression). For an auto-regressive time series of xed order the
Whittle estimators are identical to the Yule-Walker estimators, which are also (almost)
identical to the maximum Gaussian likelihood estimators. This can be seen as follows.
The Whittle estimators are dened by maximizing the Whittle approximation over
all spectral densities of the form f

() =
2
/

(e
i
)

2
, for the auto-regressive poly-
nomial (z) = 1
1
z
p
z
p
. By the Kolmogorov-Szego formula (See ??), or direct
computation,
_
log f

() d = 2 log(
2
/2) is independent of the parameters
1
, . . . ,
p
.
Thus the stationary equations for maximizing the Whittle approximation with respect
to the parameters take the form
0 =

k
_
I
n
()

2
f

()
d =

k
_
(e
i
)(e
i
)I
n
() d
=
_
_
e
ik
(e
i
) (e
i
)e
ik
_
I
n
() d
= 2 Re
_
_
e
ik

1
e
i(k1)

p
e
i(kp)
_
I
n
() d
= 2 Re
_

n
(k)
1

n
(k 1)
p

n
(k p)
_
,
because

n
(h) = n
1

n]h]
t=1
X
t+]h]
X
t
are the Fourier coecients of the function I
n
for
[h[ < n, by (12.2). Thus the stationary equations are the Yule-Walker equations, apart
from the fact that the observations have been centered at mean zero, rather than at X
n
.
13.15 EXERCISE. Derive the Whittle estimator for
2
for an autoregressive process.
If we write I
n
(f) for
_
I
n
()f() d, then a Whittle estimator is a point of minimum
of the map
M
n
() =
_

log f

() d +I
n
_
1
f

_
.
In Section 12.4 it is shown that the sequence

n
_
I
n
(f)
_
ff
X
d
_
is asymptotically
normally distributed with mean zero and some covariance matrix
2
(f), under some
conditions. This implies that the sequence M
n
() converges for every xed in probability
to
M() =
_
log f

() d +
_
f
X
f

() d.
By reasoning as in Section 13.1 we expect that the Whittle estimators

n
will be asymp-
totically consistent for the parameter
0
that minimizes the function M().
If the true spectral density f
X
takes the form F
X
= f
0
for some parameter
0
, then
this parameter is the minimizing value. Indeed, by the inequality log x + (x 1)
_
x 1
_
2
, valid for every x 0,
M() M(
0
) =
_
_
log
f

f
0
() +
f
0
f

() 1
_
d
_
_

f
0
f

() 1
_
2
d.
13.4: Gaussian Likelihood 225
This shows that the function M() possesses a minimum value at =
0
, and this
point of minimum is unique as soon as the parameter is identiable from the spectral
density.
To derive the form of the limit distribution of the Whittle estimators we replace

M() by its linear approximation, as in Section 13.1, and obtain that

n(

n

0
) =
_

M
n
(

n
)
_
1

n

M
n
(
0
).
Denoting the gradient and second order derivative matrix of the function log f

()
by

() and

(), we can write

n

M
n
() =
_

() d I
n
_

_
,

M
n
() =
_

() d +I
n
_

_
.
By the results of Section 12.4 the sequence

n

M
n
(
0
) converges in distribution to a
normal distribution with mean zero and covariance matrix (

0
/f
0
), under some condi-
tions. Furthermore, the sequence

M
n
(
0
) converges in probability to
_

T
0
() d =: J
0
.
If both are satised, then we obtain that

n(

n

0
) N
_
0, J
1
0
(

0
/f
0
)J
1
0
_
.
The asymptotic covariance is of the sandwich form.
If the time series X
t
is Gaussian, then the Whittle likelihood is an approximation
for the correctly specied likelihood, and the asyptotic covariance reduces to a simpler
expression. In this case, since the kurtosis of the normal distribution is 0,
(f) = 4
_
ff
T
()f
2
X
() d.
It follows that in the case, and with f
X
= f
0
, the asymptotic covariance of the sequence

n

M
n
(
0
) reduces to 4J
0
, and the sandwich covariance reduces to 4J
1
0
.
13.16 Example (ARMA). The log spectral density of a stationary, causal, invertible
ARMA(p, q) process with parameter vector = (
2
,
1
, . . . ,
p
,
1
, . . . ,
q
) can be written
in the form
log f

() = log
2
log(2) + log (e
i
) + log (e
i
) log (e
i
) log (e
i
).
Straightforward dierentiation shows that the gradient of this function is equal to

() =
_
_
_

2
e
ik
(e
i
)
+
e
ik
(e
i
)
e
il
(e
i
)
+
e
il
(e
i
)
_
_
_
k=1,...,p,l=1,...,q
226 13: Maximum Likelihood
Here the second and third lines of the vector on the right are abbreviations of vectors of
length p and q, obtained by letting k and l range over the values 1, . . . , p and 1, . . . , q,
respectively. The matrix J

=
_

() d takes the form


J

=
_
_
2

4
0 0
0 AR MAAR
0 MAAR
T
MA
_
_
,
where AR, MA, and MAAR are matrices of dimensions (p p), (q q) and (p q),
respectively, which are described in more detail in the following. The zeros must be
replicated to fulll the dimension requirements, and result from calculations of the type,
for k 1,
_

e
ik
(e
i
)
d =
1
i
_
]z]=1
z
k1
(z)
dz = 0,
by Cauchys theorem, because the function z z
k1
/(z) is analytic on a neighbourhood
of the unit disc, by the assumption of causility of the ARMA process.
Using the identity (f +f)(g +g) = 2 Re(fg +fg) we can compute the (l, l
t
)-element
of the matrix MA as
2 Re
_
_
e
il
(e
i
)
e
il

(e
i
)
+
e
il
(e
i
)
e
il

(e
i
)
_
d
= 2 Re
_
0 +
_
e
i(ll

)
[(e
i
)[
2
d
_
= 2
V
(l l
t
)2,
where V
t
is a stationary auto-regressive process satisfying (B)V
t
= Z
t
for a white noise
process Z
t
of unit variance. The matrix AR can be expressed similarly as the covariance
matrix of p consecutive elements of an auto-regressive process U
t
satisfying (B)U
t
= Z
t
.
The (k, l)-element of the matrix MAAR can be written in the form
2 Re
_
_
e
ik
(e
i
)
e
il
(e
i
)
+
e
ik
(e
i
)
e
il
(e
i
)
_
d = 2 Re
_
0 + 2
_
f
UV
()e
i(kl)
d
_
.
Here f
UV
() = 1/
_
2(e
i
)(e
i
)
_
is the cross spectral density of the auto-regressive
processes U
t
and V
t
dened previously (using the same white noise process Z
t
)(?). Hence
the integral on the far left is equal to 2 times the complex conjugate of the cross
covariance
UV
(k l).
Taking this all together we see that the matrix resulting from deleting the rst row
and rst column from the matrix J

/(4) results in the matrix J

p,

q
that occurs in The-
orem 13.12. Thus the Whittle estimators and maximum Gaussian likelihood estimators
have asymptotically identical behaviour.
The Whittle estimator for
2
is asymptotically independent of the estimators of the
remaining parameters.
13.5: Model Selection 227
13.4.2 Gaussian Time Series
In this section we study the behaviour of the maximum likelihood estimators for general
Gaussian time series in more detail. Thus

n
is the point of maximum of (13.10) (or
equivalently (13.11)), and we study the properties of the sequence

n(

n
) under the
assumption that the true density of (X
1
, . . . , X
n
) possesses the form (13.10), for some
. For simplicity we assume that the time series is centered at mean zero, so that the
model is completely parametrized by the covariance matrix
n
(). Equivalently, it is
determined by the spectral density f

, which is related to the covariance matrix by


_

n
()
_
s,t
=
_

e
i(st)
f

() d.
It is easier to express conditions and results in terms of the spectral density f

, which is
xed, than in terms of the sequence of matrices
n
(). The asymptotic Fisher information
for is dened as
I

=
1
4
_

log f

()
_
log f

()
_
T
d.
13.17 Theorem. Suppose that X
t
is a Gaussian time series with zero mean and spectral
density f

such that the map f

is one-to-one and the map (, ) f

() is
three times continuously dierentiable and strictly positive. Suppose that ranges over
a bounded, open subset of R
d
. Then the maximum likelihood estimator

n
based on
X
1
, . . . , X
n
satises

n(

n
) N(0, I
1

).
Proof. See Azencott and Dacunha-Castelle (1984), Chapitre XIII.
The theorem is similar in form to the theorem for maximum likelihood estimators
based on replicated experiments. If p
n,
is the density of

X
n
= (X
1
, . . . , X
n
) (given in
(13.10)), then it can be shown under the conditions of the theorem that
I
n,
: =
1
n
E

log p
n,
(

X
n
)
_

log p
n,
(

X
n
)
_
T
I

.
The left side of this display is the true (average) Fisher information (per observation) for
based on

X
n
, and this explains the name asymptotic Fisher information for I

. With
this in mind the analogy with the situation for replicated experiments is perfect.
13.5 Model Selection
In the preceding sections and chapters we have studied estimators for the parameters of
ARMA or GARCH processes assuming that the orders p and q are known a-priori. In
practice reasonable values of p and q can be chosen from plots of the auto-correlation
and the partial auto-correlation functions, followed by diagnostic checking after tting a
particular model. Alternatively (or in addition) we can estimate appropriate values of p
228 13: Maximum Likelihood
and q and the corresponding parameters simultaneously from the data. The maximum
likelihood method must then be augmented by penalization.
The value of the likelihood (13.1) depends on the dimension the parameter . If
models of dierent dimension are available, then we can make the dependence explicit
by denoting the log likelihood as, with d the dimension of the model,

n
(, d) =
n

t=1
log p
,d
(X
t
[ X
t1
, . . . , X
1
).
A rst idea to select a reasonable dimension is to maximize the function
n
jointly over
(, d). This rarely works. The models of interest are typically nested in that a model of
dimension d is a submodel of a model of dimension d+1. The maximum over (, d) is then
taken for the largest possible dimension. To counter this preference for large dimension
we can introduce a penalty function. Instead of
n
we maximize
(, d)
n
(, d)
n
(d),
where
n
is a xed function that takes large values for large values of its argument.
Maximizing this function jointly over (, d) must strike a balance between maximizing

n
, which leads to big values of d, and minimizing
n
, which leads to small values of d.
The choice of penalty function is crucial for this balance to yield good results.
Several penalty functions are in use, each of them motivated by certain consider-
ations. There is no general agreement as to which penalty function works best, partly
because there are several reasonable criteria for best. Three examples for models of
dimension d are
AIC(d) = d,
AICC(d) =
nd
n d 1
,
BIC(d) =
1
2
d log n.
The abbreviations are for Akaikes Information Criterion, Akaikes information corrected
criterion, and Bayesian Information Criterion respectively.
It seems reasonable to choose a penalty function such that as n the value

d
n
that maximizes the penalized likelihood converges to the true value (in probability or
almost surely). By the following theorem penalties such that
n
(d) faster than
loglog n achieve this aim in the case of ARMA processes. Here an ARMA(p, q) process is
understood to be exactly of orders p and q, i.e. the leading coecients of the polynomials
and of degrees p and q are nonzero.
13.18 Theorem. Let X
t
be a Gaussian causal, invertible stationary ARMA(p
0
, q
0
) pro-
cess and let (

, p, q) maximize the penalized likelihood over


p+qd0
(
p,q
, p, q), where
for each (p, q) the set
p,q
is a compact subset of R
p+q+1
consisting of parameters of a
causal, invertible stationary ARMA(p, q) process and d
0
p
0
+q
0
is xed. If
n
(d)/n 0
and liminf
n
(d)/ loglog n is suciently large for every d d
0
, then p p
0
and q q
0
almost surely.
Proof. See Azencott and Dacunha-Castelle (1984), Chapitre XIV.
13.5: Model Selection 229
The condition on the penalty is met by the BIC penalty, but not by Akaikes penalty
function. It is observed in practice that the use of Akaikes criterion overestimates the
order of the model. The AICC criterion, which puts slightly bigger penalty on big models,
is an attempt to correct this.
However, choosing a model of the correct order is perhaps not the most relevant
criterion for good estimation. A dierent criterion is the distance of the estimated
model, specied by a pair of a dimension

d and a corresponding parameter

, to the
true law of the observations. Depending on the distance used, an incorrect estimate

d
together with a good estimate

of that dimension may well yield a model that is closer
than the estimated model of the correct (higher) dimension. For ARMA processes the
AIC criterion performs well in this respect.
The AIC criterion is based on the Kullback-Leibler distance.
13.19 EXERCISE. Repeatedly simulate a MA(1) process with = .8 for n = 50 or
n = 100.
(i) Compare the quality of the moment estimator and the maximum likleihood estima-
tor.
(ii) Are the sampling distributions of the estimators approximately normal?
13.20 EXERCISE. Find best tting AR and ARMA models for the Wolfer sunspot
numbers (object sunspots in Splus).

Anda mungkin juga menyukai