Anda di halaman 1dari 482

Preface

The theory and practice of the analysis of time series has followed two lines
almost since its inception. One of these proceeds from the Fourier trans-
formation of the data and the other from a parametric representation of the
temporal relationships. Of course, the two lines are interrelated. The frequency
analysis of data was surveyed in Volume 3 of the present Handbook of Statistics
series, subtitled, Time Series in the Frequency Domain, edited by D. R.
Brillinger and P. R. Krishnaiah. Time domain methods are dealt with in this
volume. The methods are old, going back at least to the ideas of Prony in the
eighteenth century, and owe a great deal to the work of Yule early this century.
Several different techniques for classes of nonstationary processes have been
developed by various analysts. By the very nature of the subject in these cases,
the work tends to be either predominantly data analysis oriented with scant
justifications, or mathematically oriented with inevitably advanced arguments.
This volume contains descriptions of both these approaches by strengthening
the former and minimizing the latter, and yet presenting the state-of-the-art in
the subject. A brief indication of the work included is as follows.
One of the successful parametric models is the classical autoregressive
scheme, going back to the pioneering work of G. U. Yule, early in this century.
The model is a difference equation with constant coefficients, and much of the
classical work is done if the roots of its characteristic equation are interior to
the unit circle. If the roots are of unit modulus, the analysis presents many
difficulties. The advances made in recent years in this area are described in W.
Fuller's article. An important development in the time domain area is the work
of R. Kalman. It led to the emphasis on a formalization of rational transfer
function systems as defined by an underlying state vector generated in a
Markovian manner and observed subject to noise. This representation is
connected with a rich structure theory whose understanding is central in the
subject. It is surveyed in the article by M. Deistler. The structure and analysis of
several classes of nonstationary time series that are not of autoregressive type
but for which the ideas of Fourier analysis extend is given in the article by M.
M. Rao; and the filtering and smoothing problems are discussed by D. K.
Chang. Related results on what may be termed "asymptotically stationary" and
allied time series have been surveyed in C. S. K. Bahagavan's paper.
The papers by L. Ljung, P. Young and G. C. Tiao relate to the estimation
vi Preface

problems in the dynamical modelling systems. Here Young's paper deals with
the on-line (real time) calculations. One of the uses of these models has been to
analyze the consequences of an intervention (such as the introduction of
exhaust emission laws) and another to consider the outlier detection problems.
These are discussed by Tiao and T. Ozaki. Though rational transfer function
models are parametric, it is seldom the case that the model set contains the
truth and the problem may better be viewed as one of selecting a structure
from an infinite set in some asymptotically optimal manner. This point of view
is explored by R. Shibata. Though least squares techniques, applied to the
prediction errors, have dominated, there is a need to modify these to obtain
estimators less influenced by discrepant observations. This is treated by Tiao
and, in an extensive discussion, by R. D. Martin and V. J. Yohai. The model
selection and unequally spaced data are natural problems in this area confront-
ing the experimenter, and these are discussed by R. H. Jones. Since the time
points may sometimes be under control of the experimenter, their optimal
choice must be considered. This problem is treated by S. Cambanis. The
modelling in the papers referred to above has been essentially linear. Ozaki
presents an approach to the difficult problem of nonlinear modelling.
The autoregressive models may have time varying parameters, and this is
considered by D. F. Nicholls and A. R. Pagan. Their paper has special
reference to econometric data as does also the paper by H. Theil and D. G.
Fiebig who treat the problem where the regressor vectors in a multivariate
system may be of a dimension higher than the number of time points for
observation. The final two papers on applications by M. A. Cameron, P. J.
Thomson and P. de Souza complement the areas covered by the preceding
ones. These are designed to show two special applications, namely in signal
attenuation estimation and speech recognition.
Thus several aspects of the time domain analysis and the current trends are
described in the different chapters of this volume. So they will be of interest
not only to the research workers in the area of time series, but also to data
analysts who use these techniques in their work.
We wish to express our sincere appreciation to the authors for their excellent
cooperation. We also thank the North-Holland Publishing Company for their
cooperation.

Eo J. Hannan
P. R. Krishnaiah
M. M. Rao
Contributors

C. S. K. Bhagavan, Dept. of Statistics, Andhra University, Waltair, India 530003


(Ch. H)
S. Cambanis, Dept. of Statistics, University of North Carolina, Chapel Hill, NC
27514, USA (Ch. 13)
M. A. Cameron, CSIRO, Division of Mathematics & Statistics, P.O. Box 218,
Lindfield, N.S.W., Australia 2070 (Ch. 14)
D. K. Chang, Dept. of Mathematics, California State University, Los Angeles,
CA 90023, USA (Ch. 12)
M. Deistler, Institute of Econometrics, Technical University of Vienna, Argen-
tinierstr. 8, A 1040 Vienna, Austria (Ch. 9)
P. de Souza, Dept. of Mathematics, Victoria University, Wellington, New
Zealand (Ch. 15)
D. G. Fiebig, University of Sydney, Sydney, N.S.W., Australia 2006 (Ch. 17)
W. A. Fuller, Dept. of Statistics, Iowa State University, Ames, IA 50011, USA
(Ch. 1)
R. H. Jones, Scientific Computing Center, University of Colorado Medical
Center, Box B-119, Denver, CO 80262, USA (Ch. 5)
L. Ljung, Dept. of Electrical Engineering, Link6ping University, S-581 83
LinkSping, Sweden (Ch. 7)
R. D. Martin, Dept. of Statistics, GN22, B313 Padelford Hall, University of
Washington, Seattle, WA 98195, USA (Ch. 4)
D. F. Nicholls, Statistics Dept., Australian National University, G.P.O. Box 4,
Canberra, A.C.T., Australia 2601 (Ch. 16)
T. Ozaki, The Institute of Statistical Mathematics, 4-6-7-Minami-Azabu,
Minato-Ku, Tokyo, Japan (Ch. 2)
A.R. Pagan, Statistics Dept., Australian National University, G.P.O. Box 4,
Canberra, A.C.T., Australia 2601 (Ch. 16)
M.M. Rao, Dept. of Mathematics, University of California, Riverside, CA
92521, USA (Ch. 10)
R. Shibata, Dept. of Mathematics, Keio University, 3-14-1 Hiyoshi, Kohoku,
Yokohama 223, Japan (Ch. 6)
H. Theil, College of Business Administration, Dept. of Economics, University of
Florida, Gainesville, FL 32611, USA (Ch. 17)

xiii
xiv Contributors

19. J. Thomson, Institute of Statistics and Operations Research, Victoria University,


Wellington, New Zealand (Ch. 14, 15)
G. C. Tiao, Graduate School of Business, University of Chicago, Chicago, IL
60637, USA (Ch. 3)
V. J. Yohai, Department of Mathematics, Piso 7, University of Buenos Aires,
Argentina (Ch. 4)
P. Young, Dept. of Environmental Sciences, University of Lancaster, Lancaster
L A I 4YQ, England (Ch. 8)
E. J. Hannan, P. R. Krishnaiah, M. M. Rao, eds., Handbook of Statistics, Vol. 5 1
© Elsevier Science Publishers B.V. (1985) 1-23

Nonstationary Autoregressive Time Series

Wayne A. Fuller

1. I n t r o d u c t i o n

A m o d e l often u s e d to d e s c r i b e the b e h a v i o r of a v a r i a b l e o v e r time is the


a u t o r e g r e s s i v e m o d e l . In this m o d e l it is a s s u m e d that the c u r r e n t value can be
e x p r e s s e d as a function of p r e c e d i n g values a n d a r a n d o m error. If we let Yt
d e n o t e the value of the v a r i a b l e at time t, the p t h - o r d e r real valued a u t o r e -
gressive time series is a s s u m e d to satisfy
P
Y , = g ( t ) + ~ , c q Y , _ i + e ,, t-l,2 ..... (1.1)
i-1

w h e r e t h e e,, t - 1, 2 . . . . . are r a n d o m v a r i a b l e s a n d g ( t ) is a real v a l u e d fixed


function of time. W e h a v e chosen to define the a u t o r e g r e s s i v e time series on
the p o s i t i v e integers, but the time series might b e d e f i n e d on o t h e r d o m a i n s .
T h e statistical b e h a v i o r of the t i m e series is d e t e r m i n e d by the initial values
(I(0, Y--1. . . . . Y p+l), by the function g(t), by the coefficients (cq, a 2. . . . . C~p),
and by the stochastic p r o p e r t i e s of the e,. W e shall, h e n c e f o r t h , assume that the
e, h a v e z e r o m e a n a n d v a r i a n c e 0-2. A t a m i n i m u m we a s s u m e the e, to be
u n c o r r e l a t e d . O f t e n we a s s u m e the e, to be i n d e p e n d e n t l y and i d e n t i c a l l y
distributed.
L e t t h e j o i n t d i s t r i b u t i o n function of a finite set {Y,,, Y'2 . . . . . Y,} of the 1/,
be d e n o t e d by

Y,2
. . . . . Y'2. . . . . Y'o)
T h e t i m e series is strictly s t a t i o n a r y if

F,,,,, r,2..... v,, (Y,,' Y,2. . . . . Y,°) = F ~,~+,, W,2+h..... r,,,h(Y'~' Y'2. . . . . Y',,)

for all p o s s i b l e sets of indices tl, t 2 , . . . , tn and t 1+ h, t 2+ h , . . . ,-t, + h in t h e set


{1, 2 . . . . }. T h e time series is said to b e c o v a r i a n c e s t a t i o n a r y if

E{Y,} = t=1,2 .....


2 W. A, Fuller

and
E{(Y,-tx)(Yt.~h-/X)}=y(h), t=1,2 .... ; h=0,1,...,

w h e r e / x is a real n u m b e r and y ( h ) is a'real valued function of h.


T o study the b e h a v i o r of the time series Y~ we solve the difference e q u a t i o n
(1.1) and express I,', as a function of (el, e2 . . . . e,) and (Y0, Y < . . . . , Y-p+1). T h e
difference equation
p

coi= ~ c~jcoi-j (1.2)


j=l

with initial conditions

co0 = 1 , coi=0, i= 1, - 2 . . . .

has solution of the form

col = Z cjim~, (1.3)


j=l

where m i are the roots of the characteristic equation


p

m p - ~ ~jm p-j = 0 , (1.4)


j=l

the coefficients Cji are of the form

cji = bji kj , (1.5)

and the bj are such that the initial conditions are satisfied. T h e e x p o n e n t kj is
zero if the root rnj is a distinct root. A root with multiplicity r has r coefficients
with k j = 0 , 1 . . . . . r - 1 .
Using the coi, the time series Y, can be written as

t-I p I t -1

Yt = 2 coie,-i + ~, co,+iY-i + ~]~ wig(t i) . (1.6)


i=0 i= 0 i=0

T h e m e a n of Yt is

t-I p 1

E { Y t } = ~, co,g(t i)+ ~, co,+iE{Y ,}. (1.7)


i=0 i= 0

T h e r e f o r e , if (1/0, Y-1 . . . . . Y-p+1) is a fixed vector, the variance of I", is a function


of t and Y, is not stationary.
Nonstationary autoregressive time series 3

If the roots of (1.4) are less than one in absolute value, then m i goes to zero
as i goes to infinity. O n e c o m m o n model is that in which g(t) =- a o. A s s u m e that
(Yo, Y - l , . . . , Y-p+1) is a vector of r a n d o m variables with c o m m o n mean

o0(,-ko,)' i=1

c o m m o n variance

0"2 E tO~ (1.9)
i=o

and covariances

S{Yt.Yt+h}=0-2E(.oio)i+h, t,t+h=O,-1 .... ,-p+l. (1.10)


i=0

If g(t) = %, if (Y0, Y-1 . . . . . Y--v+1) is i n d e p e n d e n t of (e0, e 1. . . . ), and if the


initial conditions satisfy (1.8), (1.9) and (1.10), then Y, is covariance stationary.
If the initial conditions do not satisfy (1.8), (1.9) and (1.10), the time series
will display a different b e h a v i o r for small t than for large t. H o w e v e r , if
g ( t ) = a0 and the roots of the characteristic e q u a t i o n are less than one in
absolute value, the nonstationarity is transitory. In such a situation, the large-t
b e h a v i o r is that of a stationary time series.

2. T h e first-order model

W e begin our discussion with the first-order m o d e l

g,=ao+c~lY , 1-~ e,, t=1,2 ..... (2.1)


=%, t=0.

Given n o b s e r v a t i o n s on the process, several inference p r o b l e m s can be


considered. O n e is the estimation of %. Closely related to the estimation
p r o b l e m is the p r o b l e m of testing h y p o t h e s e s a b o u t oq, particularly the
hypothesis that o~ = 1. Finally, one may be interested in predicting future
observations.
A natural e s t i m a t o r for (c~o,%) is the least squares e s t i m a t o r o b t a i n e d by
regressing Y, oll Y,-1, including an intercept in the regression. T h e estimators
are
n _. 2 q lr n

(2.2)
4 W . A . Fuller

where
n

Y, 1,
t 1
n

t-1

These estimators are the m a x i m u m likelihood estimators for normal e, and


fixed I/0. The distribution of c~1 depends upon the true value of al, the initial
conditions, and the distribution of the e,. The error in the estimator of c~1 can
be written

(~1 (~1 (Yg-I- Y(-,)) ~'~ (Y,-1- Y(-t))( e, e(0)). (2.3)


- t=l

Under the assumption that the e, are uncorrelated, the expected value of the
n u m e r a t o r is zero. The limiting behavior of the estimator is determined by the
joint behavior of the sample m o m e n t s in the numerator and d e n o m i n a t o r of
(2.3). The limiting distributions of c}~ are characterized in Table 2.1.
For a time series with l a [ < 1, the limiting distribution of F/1/2(15~1 - O{1) is
normal under quite weak assumptions. T h e first proof of the limiting normal
distribution was given by Mann and Wald (1943). There have been a n u m b e r of
extensions since that time. Because wi ~ 0 as n -~ % the initial value Y0, for any
real Y0, will not influence the limiting distribution, though the influence for
small samples could be large. The variance of the limiting distribution of
n l / 2 ( ~ 1 -- 0~1) is

O) i ~ l-- O'1.

Table 2.1
Limiting properties of the least squares estimator of crt

Parameters Limiting distribution

Initial value Distribution Standardizing


0"1 0"o I/0 of et function a Form b

]all < 1 any real any real lID(0, 0-2) nl/2(1 - a2) -1/2 N(0, 1)
10"ll = 1 ao # 0 any real lID(0, 0-2) n 3/2 Normal
lai] = 1 ao 0 any real lID(0, 0.2) n Tabulated
[all > 1 0"0 = 0 !/0 = 0 NID(0, 0.2) (a 2 - 1) '0"¢ Cauchy
10"11> 1 ao = 0 Yo = 0 lID(0, 0.2) ( a 2 - I) toe'l' 9
I0"1[ > a 0"0 # 0 Y0 # 0 N I D ( 0 , 0.2) (a '2 -- 1) 'de N(0, 1)/N(~:, 1)

aThe standardizing function is a multiplier of ( ~ - oct) that produces a n o n d e g e n e r a t e limiting


distribution.
bThe constant sc = Y0+ o~0(1 og) 1.
Nonstationary autoregressivetime series 5

The result of Table 2.1 is stated for independently and identically distributed
random variables, but the limiting distribution of nU2(&l- cq) is also normal for
et that are martingale differences. For example, see H a n n a n and Heyde (1972)
and Crowder (1980).
If Icq] = 1 and a 0 - 0, there is no simple closed form expression for the
limiting distribution of n(&a- oq). The limiting distribution of n(~ 1 %) is that
of a function of three random variables,

L
n(al-- ~,)--, [2(r- W2)] l[(T2 1) 2TW], (2.4)

where
(F, T, W) = 2 2 E 2%,,z,, Z ~,~/2
~,,z,, 2~.
~ ~',~,),
= i=1 i=1

Yi-- (-1)i+12[( 2i - l ) v ] -1 ,

and {Z/} is a sequence of NI(0, 1) random variables. Tables of the distribution


are given in Fuller (1976) and the distribution has been discussed by Dickey
and Fuller (1979). T h e estimator of oq constructed under the knowledge that
a 0 = 0 has been studied by White (1958), Rao (1978a, 1978b), Dickey and Fuller
(1979), and Evans and Savin (1981a). It is interesting that the normalization
required to obtain a limiting distribution for 61 when [all = 1 is n, not n u2. The
basis for the normalization is partly explained by examining the sum of squares
in the denominator of ill. If Yt is stationary, E{Y~} is a constant for all t and

is nearly a constant multiple of n. This remains true for lall < 1 and any fixed
real initial conditions. If lO/ll : 1 and a 0 = 0,

E { v , ~} = to-~
and
n

y,- n g 2 = [2-'n(n q 1) 6<(n + 1)(2n + 1)1o"2


=

= 6 q ( n 2 - 1)o.2 "

Ifa 0-/0 anda l=l,then

Yt = Yo + aot + ~ ej
j=l
6 W.A. Fuller

and it follows that


tl
E{t_~
I: ( Y , - 17")2} = ( 1 2 ) 1 ( n 2 - 1 ) ( n c ~ + 20-2).

Therefore, the normalization required to obtain a limiting distribution for the


estimator in the first three situations of Table 2.1 is proportional to the square
root of the expected value of the d e n o m i n a t o r of ~ .
If I%1 = 1 and a 0 ¢ 0, the term sot will ultimately dominate the expression
for Y,, explaining the fact that there is a limiting normal distribution for the
least squares estimator of oq in the model where the true, but unknown, value
of o% is not zero.
If lal[ > 1, a0 = 0 and Y0 = 0, then Y, can be written as

t-1

r , : E ' °llet i
i=o
t-1
= Ol tl £ Ol i1- t et i
i=0
c~

j=t+l
where

X : • aTJej.
j=l

Therefore, aTtyt converges to the r a n d o m variable X as t becomes large. It is


also true that

-2n --2 P , 2 1)X 2


(~1 £ Y t - " ' + ( O / 1 -" .
t=l

T h e limiting properties of the estimator of % follow from these results.


Because the sum of squares of Yt is increasing at the rate a 2n 1 , the least squares
estimator of a 1 converges to o/1 very rapidly and it is necessary to multiply
d 1 - % by a'~ to obtain a limiting distribution. The limiting distribution of
a~'(~l - al) is that of the ratio of two r a n d o m variables. T h e variable X (or X plus
a constant) is in the d e n o m i n a t o r and the n u m e r a t o r variable is an independent
r a n d o m variable whose distribution is the limiting distribution of

n-1

1=0

Therefore, if s 0 - 0, I/0 = 0 and the e, are normally distributed, the limiting


Nonstationary autoregressive time series 7

distribution is that of a Cauchy random variable. This-result was obtained by


White (1958) and has been extended by Anderson (1959), R a o (1961), Venkatar-
aman (1967), Narasimham (1969), and Hasza (1977). If s 0 ¢ 0 or Y0 ¢ 0, the
denominator random variable has a nonzero m e a n (see Table 2.1). If the e t are
not normally distributed, the form of the limit distribution depends upon the
form of the distribution of the e r
To summarize, the least squares estimator of c~1 has a limiting distribution for
any value of c~,, but the standardizing function of n required to obtain a limiting
distribution is a function of cq, c~0 and Y0- Also, the form of the distribution is a
function of the same three parameters. An interesting aspect of the limiting
distribution of the estimator of % is that o-2 is not a p a r a m e t e r of the
distribution. This is because the least squares estimator of o~1 is invariant to
changes in the scale of Yr.
The case of Icq] = 1 is clearly a boundary case. Fuller (197"9) has shown that
slight changes in the definition of the estimator produce different limiting
distributions. For example, if it is known that ]O{11~ 1, and if one has obser-
vations (Y0, Y , . . . , Y,), one might use the estimator

n 1 n
~1 ( Y o - y)2 + Z ( Y t - Y)2 q-2(Yn Z (Yt-,- Y)(Yt-- Y),
t=l t=l
(2.6)
where
n
J3 = (n + 1)-' ~'~ Y~.
t-0

This estimator is restricted to [ - 1 , 1] and is the estimator for the first-order


process used in the m a x i m u m entropy method of spectral estimation described
by Burg (1975) and Ulrych and Bishop (1975). If al = 1, then

L 1
n ( ~ l - °gl)-)-2[/=~1 ~ u2i ] 2] -1 , (2.7)

where {ui} is a sequence of NID(0, 1) random variables

Y2i-12= (4i2 2)-1 Y2i2= (4Z~)-1 ,

and Z i is the ith positive zero of the function t 2 sin t - t ' cos t. The limiting
distribution was obtained in a different context by Anderson and Darling
(1952) and is discussed by MacNeil (1978). The distribution defined in (2.7) is
much easier to tabulate than that of 61, where 61 is defined in (2.2) because the
characteristic function for (2.7) may be obtained and inverted numerically.
Statistics closely related to 61 have been discussed by Durbin (1973), Sargan
and Bhargava (1983) and Bhargava (1983).
Lai and Siegmund (1983) consider a sampling scheme in which observations
8 W.A. Fuller

are taken from the time series until

nc
E Y~-I > co-2,
t=l

where c is a specified constant and n c is the smallest number such that the
inequality holds. For this sampling scheme and the model with a 0 = 0 known,
they show that

n~l 2 \1t2 L
Yt-1) ( d q - oq) + N(0, o e)

as c ~ % uniformly for - 1 ~< ~'1 ~ 1. Thus, for a particular kind of sampling, a


limiting normal distribution is also obtained for the unit root case.
The least squares estimator of a0 given in (2.2) can be written as

(2.8)
Therefore, the distribution of d 0 is intimately related to that of & l - %. For the
model with ]%] < 1, the limiting distribution of nl/2(60- o~0)is normal. For other
situations, the limiting distribution is more complicated.
The fact that the distribution of 61 does not depend on o-2 permits one to use
the distribution of Table 2.1 for inference about a 1. Another statistic that is
natural to use for inference purposes is the Studentized statistic

= [ ~t~'r{61}1-1(~1- 1), (2.9)


where

g{~l} = (Yt- ] or2,


=

n
I~"2= (g/ - - 2) -1 Z [ 1/, - Y{0~- c~l(Y, 1 - 'i2{ 0)12.
t=l

The limiting distribution of the statistic [ also depends upon the true
parameters of the model. The types of distributions are tabulated in Table 2.2.
For those situations where the limiting distribution of the standardized least
squares estimator oq is normal, the limiting distribution of the [-statistic is
N(0, 1). The distribution of [ for loq] = 1 is a ratio of quadratic forms and has
been tabulated by Dickey (1976). See Fuller (1976). One of the more interes-
ting results of Table 2.2 is the fact that the limiting distribution of the •-statistic
is N(0, 1) for ]all > 1. This result emphasizes the unique place of tall = 1. The
•-statistic for estimator (2.6) has a limiting distribution that is a simple trans-
formation of the limiting distribution of 61 .
The properties of predictors for the first-order autoregressive process are
Nonstationary autoregressive time series

Table 2.2
Limiting properties of the least squares 't-statistic'

Parameters
Initial value Distribution Limiting
Y0 of et distribution

lall < I any real any real IID(0, 0-:) N(0, 1)


Icql = 1 s0 ~ 0 any real IID(0, 0-2) N(0, 1)
I~ll = 1 so = 0 any real lID(0, 0 - 2 ) Tabulated
Icql > 1 0~0 = 0 Yo - 0 NID(0, 0-2) N(0, 1)
Icql > 1 a0 = 0 Y~ - 0 liD(0, 0-2) ?(0, 1)
[all > 1 ao # 0 Yo- 0 N1D(O, 0"2) N(O, 1)

given in Table 2.3. Let Y,÷j denote the predictor constructed with known
parameters. If the parameters are known and if the e t are independent, the best
predictor of Y,+j given (Y0, Y~. . . . , Y,) is the conditional expectation

?o+j = E{Yo+j [ Y.} = d0+


= %(1+ a 1+... + -1)+ a t Y . .

The error in this predictor is

j-2
Yn+j - Y,+: e,,.j+ale,.; 1 +'' "q-Od 1 e,<,

and the variance of the prediction error is

V{Yn±j -- ~/n+j} ~- (1 -- O'T) 1(1 -- O:,2j\


1 )O. 2, oq¢l,
(2.10)
"0.2 Clfl ~= I ,
~-J ,

Therefore, with known parameters, the variance of the one-period prediction


error is always o-2. For loq{< 1, the variance of the prediction error approaches
the variance of the process as j increases, where the variance of the process is
(1 - c¢~)-1o.2. For lall ~> 1, the variance of the prediction error increases without
bound as ] increases.
Estimation of the parameters increases the variance of the prediction error
by a term that is order n -~ if Ic~tl~< 1. The interesting result in Table 2.3 is the
fact that when % is estimated and la~[ > 1, the one-period prediction error is
c~2~o'2+ O(n-~). That is, the estimation of the parameter leads to an order one
increase in the prediction error. This was first pointed out by Hasza (1977). The
variance of the prediction error for explosive processes increases rapidly as the
forecast period increases. Hasza (1977) has shown that if the parameters of the
explosive process are estimated, the mean square error of the prediction for s
10 W. A. Fuller

~'~ zzzz

©
+

I ~ -

S
i ,x -~
I

b
1

; ggg
'N + ++~
e~

+ p P
9
£

us:
"a

7 b ~ ~ b

r-,

m.

V II II A

#i ~'ggg
Nonstationary autoregressive time series 11

periods ahead is

0.2f~2
~ a l2(s-l)g_
t . 21- 1) + (a~ - 1)(a~- 1)-1} + O(n-1).

Perhaps the most useful result of Table 2.3 is the fact that the ordinary
regression 't-statistic' has N(0, 1) as its limiting distribution for normal proces-
ses for all values of a 1. Therefore, one can use an ordinary regression program
to construct predictions and to construct confidence intervals for the predic-
tions.

EXAMPLE 2.1. TO illustrate some of the methods of this section, we use the
example of Dickey and Fuller (1979) based upon the study of Gould and
Nelson (1974). Gould and Nelson used the data from Friedman and Schwartz
(1963) to reach the conclusion that the logarithm of the velocity of money had a
behavior consistent with the hypothesis of a random walk. Two models are
fitted to the data. The first is

gt - tl = °¢l(gt-1- Y1) + e, (2.11)

and the second is

Y t : OgO+ °l l Y t - I + e t , (2.12)

where Yt is the velocity of money and we assume the e t to be independently


and identically distributed. The model estimates obtained by ordinary least
squares are

l : t - Y1 = 1"0044(Yt-1- Y1), 6"2= 0.0052,


(0,0094)
and

Yt = 0 . 0 1 6 4 + 0.9663Yt_~, ~2 = 0.0050,
(0.0175) (0.0195)

where the numbers in parentheses are the standard errors computed by the
ordinary regression formulas. There are 92 observations and 91 observations
are included in the regression. For the first model, we have

n(~ 1- 1 ) - 91(0.0044)= 0.4004


and
= (0.0094)-1(0.0044) - 0.4681.

If these numbers are compared to the tabular values in Table 8.5.1 and Table
8.5.2 of Fuller (1976), the hypothesis that ~1 1 is accepted at the 0.10 level. In
=

these tests the maintained model is that of (2.11).


12 W. A. Fuller

If the model is that of (2.12), one might wish to test the hypothesis,
H0:(a0, al) = (0, 1). Dickey and Fuller (1981) give tables for the null distribution
of the likelihood ratio test which takes the form of the least squares 'F-test'.
For the model (2.12), the test statistic is

(0.47253 - 0.44664)/2
= 2.58,
(0.44664)/89

where 0.47253 Y~tn=2(Yt - Yt_l)2 and 0.44664 is the residual sum of squares
=

obtained in the regression of Yt - Y,-I on Y,_~ with an intercept included in the


regression. From Table IV of Dickey and Fuller (1981) the value of this statistic
will exceed 3.86 about 10 percent of the time. Therefore, the hypothesis that
(s0, cq) = (0, 1) is easily accepted.
If one knew that the process is generated by the model with (c%, %) = (0, 1)
and if one were given Y, = 0.5247, the predictions for the next three periods
would be

(Y,+ 1, 12~+2, 'i>,+~) = (0.5247, 0.5247, 0.5247).

The standard deviations of the three prediction errors computed with o.2=
0.0050 are (o-,21/2o-,31/2o.)= (0.0707, 0.1000, 0.1225). If the parameters are
unknown and estimated by ordinary least squares, the predictions for the next
three periods are

(Y,+I, 4+2, 4+3) = (0.5234, 0.5223, 0.5211).

The standard deviations computed by nonlinear least squares method described


in Fuller (1980) are (0.0714, 0.1001, 0.1217). There is little difference between
the short-term predictions computed under the hypothesis of a random walk
and those computed with estimated parameters. Also, the estimated standard
errors for the prediction errors based upon estimated parameters differ little
from those computed under the assumption of known parameters.

3. The pth-order mode|

Mann and Wald (1943) proved that the vector of least squares estimates for
the pth-order stationary time series converges in distribution to a vector
normal random variable. The following theorem is a generalization of their
result. See also Anderson (1959) and Crowder (1980).

THEOREM 3.1. Let Yt satisfy


p

v, = - 0 + 2 iY, i e,, (3.1)


i-1
Nonstationary autoregressive time series 13

where the roots of the characteristic equation

P
m p _ ~, cqm p i = 0, (3.2)
i=1

are less than one in absolute value. Let (Yo, Y-l, Y ~ 2 . . . . , Y-p+1) be either fixed
numbers or a vector of random variables with finite variances. Let {et} be a
sequence of independent identically distributed (0, cr 2) random variables in-
dependent of (Yo, Y-l . . . . . Y-p+1). Then, as n ~ %

L
na/2(d - a)----> N(O, V ~ ) ,

where ~ = (c% % . . . . . Cep)'

a = E x,Y = MLMxy,
= t=l

X t = (1, Yt--1, Yt 2. . . . . Yt v),

V,~-2 = plim M x x o-2 , (3.3)


n

M x x = n -1 ~ X ~ , ,
t=l

Mxy = n 1Z X ; gt .
t=l

COROLLARY 3.1. Let T n be a lower triangular matrix such that

T', ~ X ~ f l ~ = I . (3.4)
t=l

Then
L
T2'(& a ) --~N(0, I ) .

On the basis of these results the usual regression statistics can be used, in
large samples, for inference p u r p o s e s for stationary processes.
T h e extension of the results to include m e a n functions g(t) that are not
constant functions of time requires some care. This is particularly true if the
conclusion of the t h e o r e m is to cover g(t) of practical interest, such as g(t) that are
p o l y n o m i a l time trends. Because
tl tt

E , 2, Et
t=l t=l
14 W. A . Fuller

etc., increase faster than n, the e s t i m a t e d coefficients for t and t 2 must be


standardized by functions of n that increase faster than n ~/2 in o r d e r to obtain a
n o n d e g e n e r a t e limiting distribution. W e restrict our attention to g functions
that are linear in the p a r a m e t e r s to be estimated. Let

q P
Y, = ~', Oafii + ~ , e9 Y , - j + e,
i=1 j=l
0.5)
= XtO + e t ,
where
X t = (I//tD t//t2 . . . . . {/Jtq, g t - p gt-2 .... , gt-p),

0 ' = (J~D 1~2. . . . . ~q} ol D ol 2. . . . , Olp),

and (Y0, Y - v . . - , Y-v+1) is a fixed vector. W e assume {gs,}, i = 1, 2 . . . . . q are


fixed sequences, W e can solve the difference equation (3.5) and define

Y t = S t -I- Ut, (3.6)


where
p-1 t-1 q
S, = ~ .
j=O j=O i=1
t-I
Ut~--- E Vj e t_j ,
j=O

and the vj satisfy the h o m o g e n e o u s difference equation

P
v p - ~ ,~jvp j = o
j=l

with initial conditions v 0 = 1 and vi --- 0 for j < 0 . T h e fixed part of Y, is S, and
u, is the r a n d o m part. Let

z;,, = H,,(~O,, S,)', (3.7)


where
(qJo s,) = (4,,, ~',2, - . . , ,/% s,_l, s , , . . . . , s,_p)
and H , is the lower triangular matrix with ones on the diagonal such that

n
n ; 2 (+,, s,)'(~,, s ; ) n , (3.8)
t=l

is a diagonal matrix. It is possible for s o m e of the last p diagonal e l e m e n t s of


the matrix (3.8) to be zero.
Nonstationary autoregressive time series 15

Our theorem for the limiting distribution of the least squares estimator
requires the regularity conditions

lim sup (s~l z,i,)


2 \ - ' z a2, = O , i= l,2,...,q , (3.9)

lim sup
n~¢~ l<-t<-n
(k)
n +
s= 1
z 2,in z 2ti~ = 0 , i = q + l , q + 2 , ... ,q+p. (3.10)

These conditions prohibit any element of the vector of observations on q,,


from dominating the behavior of the estimator. Conditions such as these are
required for the application of the Lindeberg central limit theorem.
Theorem 3.2 follows from the results of Fuller, Hasza and Goebel (1981).

THEOREM 3.2. Let model (3.5) hold, let the roots of the characteristic equation
(3.2) be less than one in absolute value, and let {e,} be a sequence of IID(0, cr2)
random variables. L e t the {0,} satisfy conditions (3.9) and (3.10). Let

Y, - Z.,O. + e, , (3.11)
where
t ! t
Z,. = H~X,,

O. = H~.l(131, 32''''' ~q' OlD Og2' " " " ' Ogp)'

and H , is defined by (3.8). Let

n n
~j. = - i 7~ Z,,Yt,
Azz. , 6.2 = (n _ q - p ) - ' ~ , (Yt - Zafi,) z , (3.12)
t=l t=l
where
n

A~zn = • z;,z,..
t=l

a m be the positive square root of A z z .. Then


Let ,.zz,,

L
~-1--1/2
O" A I Z Z n l, ,a~ n -- 0n) --+ N(O, I ) .

It follows from this theorem that, in large samples, the usual regression
statistics can be used for inference.
A special model of the form (1.1) of considerable interest is the model
q

Yt - ~, Od3i + u,, (3.13)


i=1
16 W. A . Fuller

where u t is a stationary autoregressive process satisfying

Ut ~ E OdiUt i + et • (3.14)
i=i

T h e model (3.13), (3.14) can also be written as

Y, = ~b,i- ~', (xjO,-j,i i + 2 o~j Y, j + e,. (3.15)


i=1 j=l 1=1

That is, we can write m o d e l (3.15) as

qp-vp p

Yt = Z ~ifbti + E (~qp4p+jY,-j + et, (3.16)


i=1 j=l
where
for i = qr+j, r= 0, 1 . . . . . p ,

8i = fii, i=1,2 .... ,q,


i=qr+j,r=l,2 .... q,j=l,2 ..... p,
= Cej, i=qp+p+j,j= l , 2 . . . . . p.

If the restrictions on the coefficients are ignored, the model (3.16) is of the
same form as m o d e l (3.5). Because of the following t h e o r e m the results of
T h e o r e m 3.2 are applicable to the nonlinear least squares estimator of the
p a r a m e t e r s of model (3.15).

THEOaEM 3.3. Let 6 be a least squares estimator of the p-dimensional vector 6.


That is, 6 is the value of 6 that minimizes a function 0~(8). Assume

L
nl'2(g - ~) --~ N(0, % ) .

Let the elements of 6 satisfy the restrictions

JX~)--o, i= l , 2 , . . . , r ,

where the functions fi(8) are continuous and have continuous first and second
derivatives. Let ~ be the value of 6 that minimizes O,, (6) subject to the restrictions

f,(g) = f~(a) . . . . . fr(g) = O.


Then
L
n ~ ( g - o) o N ( 0 , &),
Nonstationary autoregressive time series 17

where
= -

(I~: Vaa ~ = (',dla,


' d;~ . . . . d,a)Ve~(l,d',a,
' dea,
' d' Y
Ve~ / . . . . , ,~j ,

= of,(n)
dis \ 061 ' 082 .... ' 06p "

Therefore, the estimates obtained by applying nonlinear least squares to


model (3.15) are asymptotically normally distributed and the usual nonlinear
least squares theory applies. Because of the nature of the model (3.13), it is
possible to use a two-step procedure in place of nonlinear least squares. See,
for example, Hatanaka (1974) and Fuller (1976, pp. 419ff). In practice, one may
wish to weight the first p observations as described in Fuller (1976, pp. 419ff).
The limiting behavior of the least square estimators for processes of order
greater than one have been discussed by Anderson (1959), Rao (1961), Ven-
kataraman (1967), Muench (1971), Narasimham (1969), and Hasza (1977) for
roots greater than one in absolute value and by Muench (1971), Hasza and
Fuller (1979) for roots equal to one in absolute value.
The special case of seasonal time series with multiple roots of unit absolute
value has been discussed by Dickey and Fuller (1981), Hasza and Fuller (1982)
and Dickey, Hasza and Fuller (1984).
We consider model (3.1) in the special case where the characteristic equation
has only one root greater than or equal to one in absolute value.

q'HEOREM 3.4. Let model (3.1) hold and let the roots of (3.2) be [mll > Im21 >
Im31>~ . . . >~ lmp[, where I m l [ > l and ]mi]<l for i = 2 , 3 , . . . , p . Let
(Yo, Y-~, Y - 2 , . . . , Y-p<) be fixed and let {e,} be a sequence of independent
identically distributed (0, o-2) random variables. L e t

X, - (1, Y,-l, Y, 1- ml Yt 2, L 2-- re,Y, 3. . . . . Y, p+l- ml Y, p)

so that model (3.1) becomes

Y, - X f l + e,,

where O~ = a o, 02 = m 1, 0 i = 6i_2, i = 3,4 . . . . . p + 1, and the 8~ are defined by

p-1 p 1

m Z Cm : 11 (m -- m . , ) .
i-I j=l
Let

= t=l
18 W. A. Fuller

and let ~ be the statistic

= (aiis2)-l/2(~ i - Oi) ,

where a, is the ith diagonal element of (X~'=l X ' ~ , ) -1 and


n
s 2= (n - p - 1)-I ~'~ ( Y t - X t O ) ~ .
t=l

Then the limiting distribution of [2 is that of the [-statistic for ~ of Table 2.2 a n d
L
~ -~N(0, 1) f o r i = 3 , 4 ..... p+l.

T h e o r e m 3.4 defines a procedure where one can test hypotheses about the
largest root in a p t h - o r d e r autoregressive process where p - 1 roots are less
than one in absolute value.
Fountis and Dickey (1983) have extended the results for a p t h - o r d e r process
with a single root of unit absolute value to the multivariate process with a
single root of unit absolute value. Let the multivariate process lit satisfy

Yt = Yt 1A + et, t=l,2 .....


where
Yt = [gtl, gt2 ..... gtk] , et = [ e t l , et2 . . . . . etk] , I10=0,

and e t is a sequence of independent identically distributed N(0, X) r a n d o m


vectors. Let one root of the equation

IA - All-- 0

be equal to one and let the remaining roots be less than one in absolute value.
Let the least squares estimator of A, based on a sample of n observations, be
given by

A: v,_~v,_~ E v;_,v,
= t=l

and let £i, i = 1, 2 . . . . . k be the roots of

I A - Axl = o,

with IX',I/> 1£21~ > / > I*kl. Fountis and Dickey (1983) show that
L
n(2,- a , ) - * A , ( 2 r ) - ' ( r 2 - 1),
where F and T are defined in (2.4).
Nonstationary autoregressive time series 19

Fuller, Hasza and Goebel (1981) have extended T h e o r e m 3.3 to the model
with one root of (3.2) greater than or equal to one in absolute value and the
remaining roots less than one in absolute value. Also, the prediction results of
Section 2 extend to the more complicated models of this section.

EXAMPLE 3.1. Engle and Kraft (1981) analyzed the logarithm of the implicit
Price Deflator for Gross National Product as an autoregressive time series. We
simplify the model of Engle and Kraft and use data for the period 1955 first
quarter through 1980 third quarter. For the initial part of our analysis we
assume that the process is a third-order autoregressive process. The least
squares estimated autoregressive equation is

f'~ = -0.021 + 1.429Y t ~ 0.133Yr_2+ 0.290Yt_ 3 (3.17)

and the residual mean square error is d-2 = 1.1173(10 5). T h e r e are a total of 103
observations and 100 observations are used in the regression. The largest root
of the characteristic equation

m 3 - 1.429m2 + 0.133m + 0.290 = 0

is 1.0178. Because the largest root is greater than one, the estimated model is
explosive. We first test the hypothesis that the largest root is one. This is done
by regressing the first differences on Yt-i and the lagged first differences. The
estimated equation is

'frl -- Wt-1 = -0,0211 @ 0.0054Y t 1 + 0.423(Yt-1 .... "}//-2)


(0.0082) (0.0026) (0.098)

+ 0.290(Yt-2- Y, 3), (3.18)


(o.099)

where the numbers in parentheses are the estimated standard errors obtained from
the ordinary least squares regression program. By T h e o r e m 3.4 the statistic

[ = (0.0026) 1(0.0054)= 1.93

has the distribution tabulated by Dickey when the largest root is one. By Table
8.5.2 of Fuller (1976) the /'-statistic will exceed 0.63 about one percent of the
time. Therefore, the hypothesis of a unit root is easily rejected. Because of the
large positive autocorrelation of series such as the price deflator, numerical
problems are often reduced by fitting the model in the form (3.18) instead of in
the form (3.17).
To set confidence limits for the largest root, we again use T h e o r e m 3.4. Let
the coefficient of gt-I in the regression of }It--mlYt-1 on Yt-l, Y~ 1--mlYt-2
and Y, 2-miYt-3 be denoted by /~. If m 1 > 1 is the largest root of the
characteristic equation and if all other roots are less than one in absolute value,
20 W. A. Fuller

then the limiting distribution of the statistic

i = (s.e. t;)-'~,

where s.e. /) is the ordinary least squares standard error, is that of a N(0, 1)
r a n d o m variable. Therefore, we can define a confidence interval for m I to be
those m 1 such that the absolute value of the calculated statistic /" is less than the
tabular values of Student's t for the desired confidence level. F o r our data

"/, 1.0091Y,_~ = -0.0211+ 0.00274 Y, 1+0.417(Y, 1-- 1.0091Y,~2)


(0.0082) (0.00139) (0.098)

-r- 0 . 2 8 8 ( Y t 2 - 1.0091 Y, 3)
(0.099)
and

Y,-1.02561/,1=-0.0211- 0.00254Y 1 1+ 0.406(Y, 1 .- 1.0256Yt_2)


(0.0082) (0.00128) (0.098)

+ 0.283(Y,-2 -- 1.0256 Y,-3) •


(0.097)

It follows that a 95 percent confidence interval for m I based on the large


sample theory is (1.0091, 1.0256).
In the preceding analysis we assumed the process to be a third-order autore-
gressive process. W e can use T h e o r e m 3.4 to test the hypothesis that the
coefficient for Yt-4 is zero. By that t h e o r e m the ordinary regression t-statistic
for Y, 3 - relY, 4 in the regression of I/, - relY,_ l on Yt-1, Yt 1- miYt-2, Y,-2--
relY, 3, and Y,-3-m~Yt-4 has a N(0, 1) distribution in the limit. Because the
t-statistic for the hypothesis that the coefficient for Yt 3-m~ Y, 4 is zero is
identical (for any m 1 # 0) to the t-statistic for the coefficient of Y,-4, we have a
test of the hypothesis that the process is third order against the hypothesis that
it is fourth order. W e have, for example,

f't - 1.0178Y, i = -0.0207 0.000001/, ~ q 0.409(Y, j 1.0178 Y,_2)


(0.0086) (o.00r 14) (0.104)

+ 0 . 2 8 0 ( Yt_ 2 -- 1.0178 Y,-3)


(o.1o8)

+ 0.012(Y,_~ 1.0178I/, 4).


(0.104)

Because the /--statistic for Y, 4 is

t = (0.104) 10.012 = 0.12


Nonstationary autoregressive time series 21

we easily accept the hypothesis that the process is third order. The argument
extends to the use of an F-test with two degrees of freedom to t e s t the
hypothesis of third order against the alternative of a fifth order process, etc.

Acknowledgements

This research was partly supported by Joint Statistical A g r e e m e n t J.S.A. 82-6


with the U.S. Bureau of the Census. I thank David Dickey, David Hasza, V. A.
Samaranayake, and Sastry Pantula for comments.

References

Amemiya, T. and Fuller, W. A. (1967). A comparative study of alternative estimators in a


distributed lag model. Econometrica 35, 509-529.
Anderson, R. L. (1942). Distribution of the serial correlation coefficient. Ann. Math. Statist. 13,
t-13.
Anderson, T. W. (1959). On asymptotic distributions of estimates of parameters of stochastic
difference equations. Ann. Math. Statist. 30, 676~87.
Anderson, T. W. and Darling, D. A. (1952). Asymptotic theory of certain "goodness of fit" criteria
based on stochastic processes. Ann. Math. Statist. 23, 193-212.
Bhargava, A. (1983). On the theory of testing for unit roots in observed time series. London School
of Economics.
Burg, J. P. (1975). Maximum entropy spectral analysis. Unpublished Ph.D. thesis. Stanford
University, Stanford, CA.
Crowder, M. J. (1980). On the asymptotic properties of least squares estimators in autoregression.
Ann. Statist. 8, 132-146.
Davisson, L. D. (1965). The prediction error of stationary Gaussian time series of unknown
covariance. IEEE Trans. Inform. Theory, IT-11, 527-532.
Dickey, D. A. (1976). Estimation and hypothesis testing ill nonstationary time series.
Unpublished Ph.D. thesis, Iowa State University, Ames, Iowa.
Dickey, D. A. (1977). Distributions associated with the nonstationary autoregressive process. Paper
presented at the eastern regional meeting of the Institute of Mathematical Statistics, Chapel Hill,
NC (April 1977).
Dickey, D. A. and Fuller, W. A. (1979). Distribution of the estimators for autoregressive time
series with a unit root. J. Amer. Statist. Assoc. 74, 427-431
Dickey, D. A. and Fuller, W. A. (1981). Likelihood ratio statistics for autoregressive time series
with a unit root. Econometrica 49, 1057-1072.
Dickey, D. A. and Said, S. E. (1982). Testing A R I M A (p, 1, q) versus A R M A (p + 1, q). In: O. D.
Anderson, ed., Applied Time Series Analysis. North-Holland, Amsterdam.
Dickey, D. A., Hasza, D. P., and Fuller, W. A. (1984). Testing for unit roots in scasoaaI time series. J.
Amer. Statist. Assoc. 79, 355-367.
Durbin, J. (1960). Estimation of parameters in time-series regression models. J. Roy. Statist. Soc.
22, 139-153,
Durbin, J. (1973). Distribution theory for tests based on the sample distribution function. Regional
Conference Series in Applied Mathematics No. 9. SIAM, Philadelphia, Pennsylvania.
Engle, R. F. and Kraft, D. F. (1981). Multiperiod forecast error variances of inflation estimated
from A R C H models. In: A. Zellner, ed,, Proceedings of the A S A - C e n s u s - N B E R Conference on
Applied "Iime Series Analysis of Economic Data.
Evans, G. B. A. and Savin, N. E. (1981a). The calculation of the limiting distribution of the least
squares estimator of the parameter in a random walk model. Ann. Statist. 9, 1114-1118.
22 W. A . Fuller

Evans, G. B. A. and Savin, N. E. (1981b). Testing for unit roots 1. Econometrica 49, 753-777.
Findley, D. F. (1980). Large sample behavior of the S-array of seasonally nonstationary ARMA
series. In: O. D. Anderson and M. R. Perryman, eds., Time Series Analysis, 163-170. North-
Holland, Amsterdam.
Fountis, N. G. (1983). Testing for unit roots in' multivariate autoregressions. Unpublished Ph.D.
thesis. North Carolina State University, Raleigh, NC.
Fountis, N. G. and Dickey, D. A. (1983). Testing for a unit root nonstationarity in multivariate
autoregressive time series. Paper presented at Statistics: An Appraisal, International Conference
to Mark the 50th Anniversary of the Iowa State University Statistical Laboratory, Ames, Iowa.
Friedman, M. and Schwartz, A. J. (1963). A Monetary History of the United States I867-1960.
Princeton University Press, Princeton, NJ.
Fuller, W. A. (1976). Introduction to Statistical Time Series. Wiley, New York.
Fuller, W. A. (1979). Testing the autoregressive process for a unit root. Paper presented at the
42rid Session of the International Statistical Institute, Manila.
Fuller, W. A. (1980). The use of indicator variables in computing predictions. J. Econometrics 12,
231-243.
Fuller, W. A. and Hasza, D. P. (1980). Predictors for the first-order autoregressive process. J.
Econometrics 13, 139-157.
Fuller, W. A. and Hasza, D. P. (1981). Properties of predictors for autoregressive time series. J.
Amer. Statist. Assoc. 76, 155-161.
Fuller, W. A., Hasza, D. P. and Goebel, J. J. (1981). Estimation of the parameters of stochastic
difference equations. Ann. Statist. 9, 531-543.
Gould, J. P. and Nelson, C. R. (1974). The stochastic structure of the velocity of money. American
Economic Review 64, 405-417.
Grenander, U. (1954). On the estimation of regression coefficients in the case of an autocorrelated
disturbance. Ann. Math. Statist. 25, 252-272.
Hannan, E. J. (1956). The estimation of relationships involving distributed lags. Econometrica 33,
206--224.
Hannah, E. J. (1970). Multiple Time Series. Wiley, New York.
Hannan, E. J. (1979). The central limit theorem for time series regression. Stoch. Process. Appl. 9,
281-289.
Hannan, E. J., Dunsmuir, W. T. M. and Deistler, M. (i980). Estimation of vector ARMAX
models. J. Multivariate Anal. 10, 275-295.
Hannan, E. J. and Heyde, C. C. (1972). On limit theorems for quadratic fuuctions of discrete time
series. Ann. Math. Statist. 43, 2058-2066.
Hannan, E. J. and Nicholls, D. F. (1972). The estimation of mixed regression, autoregression,
moving average and distributed lag models. Econometrica 40, 529-548.
Hasza, D. P. (1977). Estimation in nonstationary time series. Unpublished Ph.D. thesis. Iowa State
University, Ames, Iowa.
Hasza, D. P. and Fuller, W. A. (1979). Estimation for autoregressive processes with unit roots.
Ann. Statist. 7, 1106-1120.
Hasza, D. P. and Fuller, W. A. (1982). Testing for nonstationary parameter specifications in
seasonal time series models. Ann. Statist. 10, 1209-1216.
Hatanaka, M. (1974). An efficient two-step estimator for the dynamic adjustment model with
autoregressive errors. J. Econometrics 2, 199-220.
Kawashima, H. (1980). Parameter estimation of autoregressive integrated processes by least
squares. Ann. Statist. 8, 423435.
Koopmans, T. C., Rubin, H. and Leipnik, R. B. (1950). Measuring the equation systems of dynamic
economics. In: T. C. Koopmans, ed., Statistical Inference in Dynamic Economic Models. Wiley,
New York.
Lai, T. L. and Siegmund, D. (1983). Fixed accuracy estimation of an autoregressive parameter.
Ann. Statist. 11, 478-485.
Lai, T. L. and Wei, C. Z. (1982). Asymptotic properties of projections with applications to
stochastic regression problems. J. Multivariate Anal. 12, 346-370.
Nonstationary autoregressive time series 23

MacNeil, I. B. (1978). Properties of sequences of partial sums of polynomial regression residuals


with applications to tests for change of regression at unknown times. Ann. Statist. 6, 422-433.
Mann, H. B. and Wald, A. (1943). On the statistical treatment of linear stochastic difference
equations. Eeonometrica 11, 173-220.
Muench, T. J. (1971). Consistency of least squares estimates of coefficients of stochastic difference
equations. Mimeograph, University of Minnesota, Minneapolis, MN.
Narasimham, G. V. L. (1969). Some properties of estimators occurring in the theory of linear
stochastic processes. In: M. Beckman and H. P. K/inzi, eds., Lecture Notes in Operations
Research and Mathematical Economics. Springer, Berlin.
Nichols, D. F. (1976). The efficient estimation of vector linear time series models. Biometrika 63,
381-390.
Orcutt, G. H. and Winokur, H. S. (1969). First order autoregression: Inference, estimation, and
prediction. Econometrica 37, 1-14.
Phillips, P. C. B. (1979). The sampling distribution of forecasts from a first-order autoregression. J.
Econometrics 9,241-262.
Rao, C. R. (1967). Least squares theory using an estimated dispersion matrix and its application to
measurement of signals. In: Proc. Fifth Berkeley Syrup. Math. Statist. and Probability, Vol. 1,
355-372. University of California, Berkeley, CA.
Rao, M. M. (1961). Consistency and limit distributions of estimators of parameters in explosive
stochastic difference equations. Ann. Math. Statist. 32, 195-218.
Rao, M. M. (1978a). Asymptotic distribution of an estimator of the boundary parameter of an
unstable process. Ann. Statist. 6, 185-190. Correction (1980) 1403.
Rao, M. M. (1978b). Covariance analysis of nonstationary time series. In: P. R. Krishnaiah, ed.,
Developments in Statistics. Academic Press, New York.
Rubin, H. (1950). Consistency of maximum-likelihood estimates in the explosive case. In: T. C.
Koopmans, ed., Statistical Inference in Dynamic Economic Models. Wiley, New York.
Samaranayake, V. A. and Hasza, D. P. (1983). The asymptotic properties of the sample autocorrela-
tions for a multiple autoregressive process with one unit root. University of Missouri, Rolla, MO.
Sargan, J. D. and Bhargava, A. (1983). Testing residuals from least squares regression for being
generated by the Gaussian random walk. Econometrica 51, 153-174.
Scott, D. J. (1973). Central limit theorems for martingales and for processes with stationary
increments using a Skorokhod representation approach. Adv. in Appl. Probab. 5, 119-137.
Stigum, B. P. (1974). Asymptotic properties of dynamic stochastic parameter estimates (Ill). J.
Multivariate Anal. 4, 351-381.
Stigum, B. P. (1975). Asymptotic properties of autoregressive integrated moving average processes.
Stoch. Proper. Appl. 3, 315-344.
Stigum, B. P. (1976). Least squares and stochastic difference equations. J. Econometrics 4, 349-370.
Tiao, G. C. and Tsay, R. S. (1983). Consistency properties of least squares estimates of autoregres-
sive parameters in A R M A models. Ann. Statist. II, 856-871.
Utrych, T. J. and Bishop, T. N. (1975). Maximum entropy spectral analysis and autoregressive
decomposition. Rev. Geophys. Space Phys. 13, 183-200.
Venkataraman, K. N. (i967). A note on the least square estimators of the parameters of a second
order linear stochastic difference equation. Calcutta Statist. Assoc. Bull. 16, 15-28.
Venkataraman, K. N. (1968). Some limit theorems on a linear stochastic difference equation with a
constant term, and their statistical applications. Sankhyg Ser. A 30, 51-74.
Venkataraman, K. N. (1973). Some convergence theorems on a second order linear explosive
stochastic difference equation with a constant term. J. Indian Statist. Assoc. 11, 47--69.
Wegman, E. J. (1974). Some results on nonstationary first order autoregression. Technometrics 16,
321-322.
White, J. S. (1958). The limiting distribution of the serial correlation in the explosive case. Ann.
Math. Statist. 29, 1188-1197.
White, J. S. (1959). The limiting distribution of the serial correlation coefficient in the explosive
case I1. Ann. Math. Statist. 30, 831-834.
Yamamoto, T. (1976). Asymptotic mean square prediction error for an autoregressive model with
estimated coefficients. J. Appl. Statist. 25, 123-127.
E. J. Hannan, P. R. Krishnaiah, M. M. Rao, eds., Handbook of Statistics, Vol. 5
© Elsevier Science Publishers B.V. (1985) 25-83 2

Non-Linear Time Series Models


and Dynamical Systems

Tohru Ozaki

1. Introduction

There are many examples of dynamic p h e n o m e n a in nature which can be


regarded as stochastic processes, e.g. a ship rolling in the sea, brain wave
records, animal populations in ecology, sunspot numbers in astronomy and
riverflow discharge in hydrology. Some of them are considered to be stochastic
processes by virtue of their own mechanism. Some of them, such as hydro-
dynamic p h e n o m e n a , may not be considered to be stochastic at the microscopic
level, but may be considered to be at the macroscopic level. By treating them
as stochastic processes, meaningful results both in theory and applications may
be obtained.
For the inference of the characteristics of these stochastic processes and for
their forecasting and control, observation data, obtained by sampling the
process at equally spaced intervals of time, are often used. Since many
stochastic p h e n o m e n a in the world can be considered to be approximately
Gaussian processes, it follows that much of the effort of time series analysts has
been devoted to providing methodologies for the statistical analysis of Gaus-
sian time series. After recent development in linear time series modelling for
stationary Gaussian processes by Box and Jenkins (1970) and Akaike and
Nakagawa (1972), time series analysts' attention has been turned to non-
Gaussian or non-stationary processes. For the analysis of these processes and
for their forecasting and control, non-linear or non-stationary time series
models are needed. It is also expected that non-linear time series models may
be useful for the inference of the non-linear structure in the dynamics of
stochastic processes. Several non-linear time series models have been intro-
duced and used for the analysis of time series data. In other applications,
diffusion process models have been considered to be a good approximate
model for stochastic dynamic p h e n o m e n a and have been used in many fields of
science.
Our main purpose in this paper is to see how non-linear time series models
and diffusion process models characterize the non-linear dynamics in the
non-Gaussian stochastic process and to show that the both are closely related

25
26 T. Ozaki

by a time discretization scheme for differential equations. For this purpose, we


study some examples of dynamic p h e n o m e n a and their linear and non-linear
time series models in Section 2, and diffusion processes and their time dis-
cretization schemes in Section 3. In Section 4 we discuss estimation methods
for models considered in the previous sections and show that non-linear time
series models are useful for the inference of the non-linear structure of a process,
whose non-linearity is characterized by potential functions. In Section 5 the
multivariate extension and the implications of the present scheme in time series
analysis are discussed.

2. Amplitude-dependent autoregressive models

2.1. Autoregressive models for ship rolling


T h e m o v e m e n t of a ship on the ocean is complicated because of sustained
excitement caused by the ocean waves, whose dynamics are impossible to
describe by a deterministic differential equation. To describe the dynamics, a
stochastic approach was introduced and stochastic differential equation
methods and time series analysis methods were introduced (Yamanouchi,
1974). For example, the dynamics of ship rolling are supposed to have the
following mechanism. If the ship rolls by x degrees, then the centre of
buoyancy moves from B to B' in Fig. 2.1, and has a buoyancy force from the
water at B' while the ship's centre of gravity is G.
Then the gravity force G W and the buoyancy force B'M cause the righting
m o m e n t and it gives the ship a restoring force which is a function g(x) of the

/*

Fig. 2.1.
Non-linear time series models and dynamical systems 27

rolling angle x. At the same time, the sea water damps the ship's m o v e m e n t
and this damping force is supposed to be a function f(ic) of the velocity 2 of the
rolling movement. T h e ship is also considered to be under the continual
external force s~ of r a n d o m excitement by the ocean waves. Therefore, the
dynamics of the ship rolling can be described by the following stochastic
differential equation:

5~+ / ( x ) + g ( x ) : s~ . (2.1)

When x and x are not very large, f ( 2 ) and g ( x ) are usually approximated by
linear functions as

f(2) = a 2 , (2.2)
g(x) = bx. (2.3)

Then the dynamics of ship rolling is approximately described by the linear


stochastic differential equation

5i + a2 + bx = ~. (2.4)

where the p a r a m e t e r a is called the damping coefficient and the p a r a m e t e r b is


called the restoring coefficient. Both a and b depend on the size and shape of
the ship and hence each ship has its own parameters. When the external force s~
is a Gaussian white noise and At is sufficiently small, (2.4) corresponds to an
A R M A ( 2 , 1) model

X t = ~)IXt 1 + ~lXt_2 + 01at_ 1 + a t , (2.5)

where a t is a discrete time Gaussian white noise with variance 0 "2, and q~l, ~2, 01
and 0-2 a re uniquely determined if a, b and the variance of white noise s~ are
known (Pandit and Wu, 1975). If we use the backward shift operator B, which
is such that B x t = xt_ 1 and Bixt = Bi-axt_~, (2.5) is rewritten as

(1 - 4 , 1 B - 4,2B2)x, - (1 + OiB)a,
o r as
~(B) (2.6)
O ( B ) x, = a,,

where ~b(B) = 1 - q ~ l B - q~2B2 and O ( B ) - 1 + 0 l B . T h e x t is considered to be an


output of the system which is driven by a Gaussian white noise a t. However, ~:
of (2.1) is usually a non-Gaussian coloured noise and so a t of (2.6) is, in
general, not a Gaussian white noise but a non-Gaussian coloured noise process
with some peaks and valleys in its spectrum. A simple approximate model for
such a coloured noise process a t with rational svectrum is a Gaussian
Non-linear time series models and dynamical systems 29

O- I
-!

sl
'0 80 ]6'0 2410 ~o 40'0 480 5do 6~'o 720 8;0 8so 980 iO00
TIME

Fig. 2.2. Ship rolling data.

Fig. 2.3 shows the spectrum which was calculated using the Hanning window.
When we fit A R models of order 0 to 20 to the data, the following AR(7)
model:

xt = 1-9500xt-1- 0.8660xt-2- 0.2430xt 3-r- 0.1070x t 4


+ O.0120x,-s - 0.0660x,-6 + 0.0690xt-7 + 6 , (2.10)
6.2= 0.2100,

was adopted as the best model by A1C (Ozaki and Oda, 1978). Fig. 2.4 shows the
spectrum of the fitted model (2.10), where the p e a k of the spectrum shows
about 25 seconds periodicity of rolling, and the figure of the spectrum extracts
the essential characteristics of the spectrum Fig. 2.3 calculated without the
parametric model.

(2)
C3

f2~¢3
tA
z~-t /

z
C_Bc2"
CM___
T--T--- I I. . . . I
~0 9 0 0,20 0=~0
FREOUENSY
Fig. 2.3. Estimated spectrum by Hanning window (Lag = 100).
Non-linear time series models and dynamical systems 31

×t

X
X 0
X
)( X
X X C
X X C
X
X ©
X ×

× X X X
×
× O
X
o x°

tO

Fig. 2.6. One-step-ahead prediction (x : observed, ©: predicted).

restoring force is the following stochastic differential equation model:

2 + a2 + bx + cx 3 = ~, (2.11)

where the restoring force is approximated by (b + cx2)x. When c > 0 the system
is called a hard spring type and when c < 0 the system is called a soft spring
type. A natural non-linear extension of the time series model for ship rolling
may be the A R model with some non-linear terms such as

Xt = ~)lXt_ 1 -~ . . . + (~pXt_ p 4- O ( X t _ l . . . . . Xt_p) + E,, (2.12)

where Q(Xt_l,..., Xt_p) is a polynomial of the variables xt_ 1. . . . , xt_ p. We call


the non-linear A R model (2.12) a polynomial A R model. The validity of the
polynomial A R model is checked by fitting both linear and polynomial A R
models for part of the data (see Fig. 2.6) and by comparing the one-step-ahead
prediction error variances obtained by applying both fitted models to the rest
of the data. For example, we fitted the AR(7) model and an AR(7) model with
a non-linear term x 3t - 1 (AR(7)+ ~-Xt_l) 3 for the first 760 data points,
x 1, x a. . . . . x760, of Fig. 2.2, and calculated the variances of the one-step-ahead
prediction error for x761. . . . . xl000. The obtained prediction error variance d-2Lof
~2
the AR(7) model was ~rL= 0.1041 and the prediction error variance O'NL^2by the
AR(7) +Trxt_ ~3 model was ~rNL^2= 0.1016. This means that the non-linear model
slightly improves the prediction performance of the ship rolling.
Although the above polynomial A R model gives better predictions than the
linear A R model, it has a fatal deficiency as a model for the dynamics of
vibration systems. Simulations of fitted polynomial A R models almost always
diverge even though the original ship rolling process or the process defined by
a non-linear stochastic differential equation (2.11) is quite stable and non-
divergent. Therefore, some other non-linear time series model which is not
explosive is desired.

2.3. E x p o n e n t i a l A R models

To see the reason why the polynomial A R models are explosive, let us
32 T. Ozaki

consider some simple polynomial A R models of vibration systems. The sim-


plest A R model which can exhibit a r a n d o m vibration system is the AR(2)
model

• X t • 051Xt_1 ~- 052Xt 2 + F.t, (2.13)

and the simplest polynomial A R model for a non-linear vibration system is

xt = 05ixt_1 + 052xt_2+ 7rxt_13+ et. (2.14)

The spectrum of the process defined by (2.13) is

2
O"
P(f) ~- I 1 - 051 e i 2 ~ r f 05 e-~2~S.2 2 • (2.15)
2

The p e a k of the spectrum which characterizes the proper frequency of the


vibration system is attained at

1 X / - 052~-- 4052,
f = 2"rr tan-1
05~
which is the argument of the roots of the characteristic equation,

A 2 - 05,A --052 = O. (2.16)

When 052 is fixed to be a constant, the proper frequency is characterized by 051


as in Fig. 2.7. The polynomial A R model (2.14) is represented as

xt = (051 + 7TXt_l)Xt_l
2 -]- 052Xt_2 q- e t , (2.17)

gl

Fig. 2.7.
Non-linear time series models and dynamical systems 33

and is considered to have an amplitude-dependent first-order autoregressive


coefficient (see Fig. 2.8).
In m a n y vibration systems, the value of x~ may stay in some finite region
[x,l < M and the roots of the equation

a 2 - (l~lq- "JTx~)A -- 11)2 -..~ 0 (2.18)

may stay inside the unit circle for such x[s. However, the white noise e t is
Gaussian distributed and may have a large value and the roots of (2.18) may lie
outside the unit circle. Then the system begins to diverge at this stage. Since we
are interested in the stochastic behaviour of x, mostly for ]xt[ < M, it may be
reasonable to make the non-linear function approach a b o u n d as t ~ + oo as in
Fig. 2.9.
A time series model which can exhibit this characteristic is the following
model
2
x, = (¢1 ~r e -x,_, )x,__l+ ff~2Xt_2 q- g t . (2.19)

The model is called an exponential A R model (Ozaki and Oda, 1978). The
roots of the equation

A2 - (~1-- "/r e -x2 ')A - ¢2 = 0 (2.2o)


always stay inside the unit circle for any x,_ 1 if ¢1, ¢2 and rr satisfy the
condition that the roots A0 and £0 of

A2-- ( 4 1 - 7r)A - (/)2 0, ~- (2.21)

and the roots , ~ and a-® of

A 2 --I~IA .--4)2 = 0 (2.22)

all lie inside the unit circle (see Fig. 2.10).

. . . . . . . . . . . . . . 1 .........................

x 1

Fig. 2.8.
34 T. Ozaki

P
Xt-~

Fig. 2.9.

In the above example, the second-order coefficient is fixed to 4~2 and the
roots of both (2.21) and (2.22) all stay inside the unit circle. However, in the
general vibration system, the damping coefficient is not constant in general.
One example is the following van der Pol equation:

- a(1 - x 2 ) . , ~ q- b x = O, (2.23)

where for x 2 < 1 the system has negative damping force and starts to oscillate
and diverge, but for x 2 > 1 the system has positive damping force and it starts
to damp out. The interplay of these two effects of opposite tendency produces
a steady oscillation of a certain amplitude, which is called a limit cycle. When
the system is disturbed by a white noise n, we have

- a ( 1 - x2)x + bx = n, (2.24)

which produces a perturbed limit cycle process (see Fig. 2.11).


The exponential A R model (2.19) is easily extended and applied (Haggan

Fig. 2.10.
Non-linear time series models and dynamical systems 35

"" I ~ _ i~1111t. Wlgllllll i i BI~Ill I

x(t)

~(t)

n(t) ~

Fig~ 2.11. Analog simulation of (2.24).

and Ozaki, 1981; O z a k i 1982a) for this kind of non-linear d a m p i n g system by


m a k i n g the s e c o n d - o r d e r coefficient amplitude d e p e n d e n t as

Xt = (~1 q- 7/'1 e-X2-1)xt-1 q- (t~2 + 7r2 e-x2 l)xt 2 q- Ft " (2.25)

If the coefficients satisfy the condition (C1), which is such that


(C1) the roots )t o and A0 of

A 2 - (4~1 + wi)A - (~b2 + -,r2)= 0 (2.26)

lie outside the unit circle, then x t starts to oscillate and diverge for small x t 1~
while if the coefficients satisfy the condition (C2) such that
(C2) the roots of A~ and A~ of

A2 ~b~A - ~b2 = 0 (2.27)

lie inside the unit circle, then x t starts to d a m p out w h e n xt_ ~ b e c o m e s too large.
T h e result of these two effects is e x p e c t e d to p r o d u c e a similar sort of
self-sustained oscillation as (2.23) if we suppress the white noise e t of (2.25). Fig.
36 T. Ozaki

2-

.-
L-)

kD

'0,00 I00.00 200,00 300.00 400,00 500,00 500.00 700

Fig. 2.12.

2.12 shows the limit cycles obtained for the model

xt = (1.95 + 0.23 e-X2'-')xt_l - (0.96 + 0.24 e -x~ 9x, 2 + t-,, (2.28)

where the coefficients satisfy the above conditions (C1) and (C2).

2.4. Stationarity
The necessary and sufficient condition for the AR(2) model

Xt = (01Xt-1 + (02Xt-2 ~- ~'t (2.29)

to be stationary is that the roots of the characteristic equation

A 2 - (01A - (02 = 0 (2.30)

all lie inside the unit circle. For checking the stationarity of exponential model

x, = ((o1+ ~rle x2'-l)xt 1+(,52 + 7rze x]-l)X, 2-~ e,, (2.31)

the following theorem about the ergodicity of a Markov chain on a norm space
is useful.

THEORE~ 2.1 (Tweedie, 1975). A Markov chain X , on a norm space with


transition law p(x, y) is ergodic if p(x, y) is strongly continuous, i.e. p(x, y) is
continuous with respect to x when y is fixed, and if there exists a compact set K
and a positive value c > 0 which satisfy the following conditions,

(i) E{Ilxo+,II- Ilxoll I x . = x} -c for x ~ K , (2.32)


(ii) E{[]X,+I[]- [[X,][ [ X, = x} ~<B < ~ for x ~ K , (2.33)

where E{ } means expectation, [I ]] is the norm of the state space and B is some
finite number.
Non-linear time series models and dynamical systems 37

In the case of model (2.31) the condition (i) means that the Markov chain
defined by
_ 2 _X 2

(/t)= ( ~ i +"/7"1 e x' 1 t~2+ 7r2e ' l)(xt-l~+ (;t) (2.34)


xt-1 1 0 /\xt_ S

shifts back to the origin, on average, when X,_ 1 = (x, 1, Xt-2)' is outside some
finite region. We call this property the 'shift back to centre' property. This is
satisfied when the eigenvalues of the matrix

(?
which are the characteristic roots of (2.30), lie inside the unit circle. The
condition (ii) is satisfied by the exponential A R model (2.31) because the
transition matrix of (2.34) is a continuous function of x~-x and xt-2. Therefore,
the process defined by (2.29) is an ergodic Markov chain if the coefficients
satisfy the condition (C2).
If the Markov chain is ergodic, we can define the stationary distribution by

n
lim 1 'ff'~pi(y, x ) ,
n~.~ n i~l

which is independent of the initial value y (Tweedie, 1975). T h e o r e m 2.1 suggests


that the process defined by exponential A R model (2.25) is stationary even though
the roots of the equation

A 2 - (4~+ ~,)A - (4~+ ~;)= 0

lie outside the unit circle.

2.5. Singular points', limit cycles and stability


We have seen that some exponential A R model can exhibit a limit cycle
when white noise input is suppressed, and if such a model is driven by a white
noise it produces a stationary time series. Whether it produces a limit cycle or
not is determined by the coefficients of the deterministic non-linear difference
equation
2 2
x, = (41 + % e-X'-l~x
J ,-1 +(4,2+ 7r2 e-Xt-1)xt_2 (2.35)

The geometric property of the trajectory of the difference equation gives us


important information for the prediction of the long-term future behaviour of
the process. We saw that model (2.28) satisfied the conditions (C1) and (C2)
and produced a limit cycle. However, it must be noted that they are not
38 T Ozaki

sufficient conditions for the model to have a limit cycle The model

x,=(18+4eX21)x 1 (0.97+01e d )xt = (2.36)

satisfies conditions (C1) and (C2) but the trajectory of xt converges to a point
(see Fig 2 1 3 ) where it holds that

=(18+4ee2)~ (0.97+01ee2)~

The point ~ is called a singular point (or fixed point) of the m o d e l Singular
points exist for the exponential A R model (2.35) if the coefficients satisfy
2 2

0 < (1 i=~1~ i ) / i ~ l ~i < 1 , (2.37)

and are given by


2

~: = + ~/--- l°g{ (1 - i=1


~ ~bi)//--~x 7ri} ' (2.38)

The model

x t = (1.57 + 3 e-'2'-')x,_l - (0.83 + 2 e x~-')x t 2 (2.39)

satisfies condition (2.37) and has singular points

= _+1.719 . . . .

However, even though the process x t has an initial value near one of the
singular points, it starts to oscillate and tends to a limit cycle in the simulation
(see Fig. 2.14).
This is because the singular points of model (2.39) are unstable singular

o
O -

co

co . . . . . ~ . . . .
i
~.00 80.00 150.00 240.00 320.00 400.00 480.00

Fig 213
Non-linear time series models and dynamical systems 39

C9
C~
°-

(:D
O-
(:D

(~-
o
c~
c~
I I "-"]--T'~ 1 [ r !
'0.00 80.00 160.00 240 '30 320,00

Fig. 2.14.

points. W h e t h e r the singular points are stable or not is easily checked by


considering the variational equation of the difference e q u a t i o n as follows. Let
x t be close to the singular point and be r e p r e s e n t e d as x t = ~: + G where ~t is
small. By replacing x,_~ and xt_ 2 by ~: + ~t-1 and sc + ~ct_2 we h a v e

~t = h1~,-1 + h2~t-2, (2.40)


where

hi = ( % + ~blTr2- 42%)/'ff'~ 7r, - 2(1 - ~b, log 1- 4)i 7ri


i=l i=1 i=l i=1
(2.41)
and
2

h2 = (rr2 + ~b2~r, - 4,2~2)/~, ~-,. (2.42)


i=1

Since the necessary and sufficient condition for ~t to c o n v e r g e to zero is that the
characteristic roots of (2.40) lie inside the unit circle, we h a v e the following
proposition:

PROPOSmON 2.1. The singular point, if it exists, of (2.35) is stable if and only if
the roots of the equation

A 2 - hlA -- h 2 - 0

lie inside the unit circle, where h I and h 2 are given by (2.41) a n d (2.42).

Unlike the continuous-time differential equation, in the case of ditterence


~quations, a first-order non-linear difference e q u a t i o n can give rise to a limit
cycle. T h e following m o d e l

x, -- (0.5 + 10 e 2
x'-')x, 1 (2.43)

has singular points ~=_+1.7308 but they do not satisfy the condition in
40 T. Ozaki

Proposition 2.1 and x t tends to a limit cycle when it starts at a point near £ in
simulation (see Fig. 2.15).
A l t h o u g h the notions of singular points or limit cycles for the discrete time
non-linear difference equation seem to be obvious by analogy with the con-
tinuous time case (Minorsky, 1962), it m a y be useful in later discussions to give
an explicit definition of singular points and limit cycles of a general discrete
time difference equation,

x, = f ( x , _ l , . . . , x,_p) . (2.44)

DEFINrFION 2.1. A s i n g u l a r p o i n t of x t = f ( x t ~. . . . . x t p) is defined as a point,


for which every trajectory of x, = f ( x t _ 1. . . . . xt_p) beginning sufficiently near ~:
a p p r o a c h e s either for t ~ o ~ or t - * - ~ . If it approaches ~ for t - * w we call sc a
s t a b l e s i n g u l a r p o i n t , and if it a p p r o a c h e s sc for t ~ - ~ we call ( an u n s t a b l e
s i n g u l a r point.

DEFINITION 2.2. A l i m i t c y c l e of x,-- f ( x t 1. . . . . Xt_p) is defined as an 'isolated'


and 'closed' trajectory xt+l, x,+2. . . . , Xt+q, where q is a positive integer.

' C l o s e d ' means that if the initial value (x 1. . . . . xp) belongs to the limit cycle,
then (xl+ ~ . . . . , xp+kq) = (x l . . . . . Xp) for any integer k. 'Isolated' m e a n s that
every trajectory beginning sufficiently near the limit cycle approaches it either
for t ~ w or t ~ - o o . If it a p p r o a c h e s the limit cycle for t-~o~ we call it a s t a b l e
l i m i t cycle, and if it a p p r o a c h e s the limit cycle for t-~ -oc we call it an u n s t a b l e
l i m i t cycle. T h e smallest integer q which satisfies Definition 2.2 is called the
p e r i o d of the limit cycle of x t = f ( x t 1 , . . . , x t - p ) . T h e singular point can be
considered to be a limit cycle of period 1 but we distinguish it because it has a
significantly different physical meaning.
T h e stability condition for a limit cycle is given in the same way as the

cD

"~0.00 80.00 160,00 240.00 520.00


Fig. 2.15.
Non-linear time series models and dynamical systems 41

stability of singular points. For example, when the model

x, = (4h + % e-XL~)x,-1 (2.45)

is known to have a limit cycle Yt, Yt+I. . . . . Yt+q-1, Yt+q(= Yt) a point x~ near the
limit cycle is r e p r e s e n t e d as

xs = Ys + so,. (2.46)

then replacing x, and x,_ 1 in (2.45) by Yt + set and Yt-1 + ~t 1 we have

~:t{(/)l -}- "n'l(1 -- 2Y~-1) e ;~-1}~, 1 -}- O(~t-1) • (2.47)

From this we can see that when the solution of the difference equation

f, = { ' b 1 + % ( 1 - 2 y 2 < ) e ;2, ~}sc,_~ (2.48)

converges to zero as t ~ m the limit cycle is stable. T h e equation (2.48) is a


linear difference equation with a periodic coefficient, which is difficult to solve
analytically. What we want to know is whether set of (2.48) converges to zero or
not, and this can be checked by seeing whether [~:,+0/sctl is less than one or not.
F r o m (2.48) we have

sCt+q = {¢1 + r q ( 1 - 2y-2t+q-l) e f2+q_~}~t+q-1


=
{~1 + %(1 - 2~ 2 :t+q-l]
a e-;~+a ,t,fa,
JtW'I + %(1 --
-2
2yt+q_2) e _y2+q2}~t+q_2

-2 -372
..... {~bI -1 ",~,(1 235~+q_l)e ;~,+0 ~}{4h(1 2y,+q 2) e ,~q 2}~t+q_ 2

This gives rise to the following proposition:

PROPOSmON 2.2. A limil cycle of period q, ~,+~, 37,+2. . . . . ~,+q of the model (2.45)
is stable if

I{¢1 + %(1 - 2y~+q_l) e ;~''~ 1}... {4~, + ~~(1 - 2y 2) e-;2}t < 1. (2.49)

Using similar logic as for the general order exponential model

Xt = (4)1 + 7rl e-X2-')x,-1 + ' ' " + (dpp Jr "ffp e -x2-1)x, p + e t (2.50)

we have the following proposition:

PROeOSmON 2.3. A limit cycle of period q, 5,+1. . . . . Y,+o for model (2.50) is
42 T. Ozaki

stable if all the e i g e n v a l u e s o f the matrix,

A = A q "Aq_ 1 • . • A 1 , (2.51)

h a v e absolute v a l u e less t h a n one, where

a (i) ~(i)
1,1 Ul,2 " " "
(i)
al,p-I ai,ii)p\)
1 0 ..- 0 O~
(2.52)
Ai 0 1 .

() ... 0 1

with
P
a (i) -2
,., - 4 , + % - 2 ~ . (rrjy,+i_i)y,+i_ 1 e y . . . . ~,
1=1
a (i) -2
~,k =
&k + % e -y'+/ ', k -- 2, 3, . . . . P -

T h e fact that a d e t e r m i n i s t i c n o n - d i v e r g e n t e x p o n e n t i a l A R m o d e l

x, = (4~1 +
2
rrl e-X'-')x, l + ((/~2+ /'72 e-Xt ')Xt-2
2
(2.53)
d o e s n o t have any s t a b l e singular p o i n t d o e s n o t n e c e s s a r i l y m e a n t h a t x t goes
to s o m e stable limit cycle for t ~ ~. T h e r e can b e a m o d e l which d o e s n o t h a v e
any s t a b l e singular p o i n t n o r a s t a b l e limit cycle. F o r e x a m p l e , t h e m o d e l

x, = (1 - 18 e-~L1)x,_~ - (0.25 - 72.5) e -x2` 1)x,_2 (2.54)

is n o n - d i v e r g e n t b e c a u s e the r o o t s of

A2-A +0.25- 0

lie inside t h e unit circle; h o w e v e r , it d o e s n o t h a v e any stable singular p o i n t n o r


s t a b l e limit cycle ( O z a k i , 1981a). T h e s i m u l a t i o n of t h e m o d e l shows r a n d o m
fluctuation w i t h o u t w h i t e noise i n p u t (see Fig. 2.16). T h e r a n d o m p r o c e s s
g e n e r a t e d by a d e t e r m i n i s t i c n o n - l i n e a r d i f f e r e n c e e q u a t i o n m o d e l is c a l l e d

0
, - - ~ - - ~ - ~
u 20J
~ - - ~ - -zO
3u
, ~ ~
u
~-
~
~ ---~
~
, --T---7 ~ .... ~
90
2 :. ~ 7,.
u
....
Fig. 2.16.
Non-linear time series models and dynamical systems 43

chaos and has been studied by s o m e physicists, biologists and m a t h e m a t i c i a n s


(see, for e x a m p l e , R6sler, 1979).

2.0. S o m e other amplitude-dependent A R models


T h e exponential A R models are introduced f r o m the idea of m a k i n g the
instantaneous characteristic roots of A R models a m p l i t u d e d e p e n d e n t as

Xt = fl(Xt-1)Xt-l+ " " " + fp(Xt_l)Xt- p + S t , (2.55)


where
L ( X t _ I ) = t~i + 7)"i e-X2,-1 .

W e can introduce s o m e other p a r a m e t r i c non-linear a m p l i t u d e - d e p e n d e n t A R


models based on the s a m e idea. F o r example, if we p a r a m e t r i z e the [~(Xt_l) of
the first-order a m p l i t u d e - d e p e n d e n t A R m o d e l (2.55) by a step function (see
Fig. 2.17), we have a linear threshold A R model (Tong and Lim, 1980),

(~(1)Xt-1 + /~t for -oo < xt_ 1 < T~,

Xt
d)(Z>x~_~ + e, for T 1 ~< xt_ 1 < T2, (2.56)

~b(m)xt_~ + e t for T,,_ 1 ~< xt ~< oc.

If we a p p r o x i m a t e fl(Xt_l) by a p o l y n o m i a l (see Fig. 2.17) for a finite region of


xt_ ~ and by constants outside the region, we h a v e the following non-linear
threshold A R model (Ozaki, 1981b),

t Tr(T1)xt_ + e, f o r xt_ 1 "( T 1 ,


xt = 1r(x,_Ox,_, + e, for T 1 ~< x, 1 < r12,
(2.57)
7r(T2)xt-1 + et for x,_ 1/> T 2 ,

where 1r(x,_l)= ~r0+ ~r,xt-,+ ' " + ~rrx~-l. If we a p p r o x i m a t e f1(x,-1)by a con- 2


stant plus a H e r m i t i a n - t y p e p o l y n o m i a l thl + ( % + 7rlxt-i + " ' " + %x~_1)e -x'-~

(a) Linear threshold AR model (b) Non-linear threshold AR model (c) Exponential AR model

~(x) ~(x)'

i
~(x)
I
--q
I
I

x 0 x 0

Fig. 2.17.
44 T. Ozaki

, ',(XOT
I
I
I
I
i
I
x2 ,% 0 ~; /,~ xt

Fig. 2.18.

(see Fig. 2.17), we have the following e x t e n d e d e x p o n e n t i a l A R m o d e l (Ozaki,


1981a):
x 2
x, = {4)1+ ( % + 7rlx,_ 1 + " " + 7r2c~_1) e '-i}x, 1 + et. (2.58)

This model includes the exponential A R model as special case, s --- 0.


It seems that non-linear models with continuous 4) functions have more
versatile geometric structure than models with discontinuous step & functions
such as linear threshold A R models. For example, if we design the 4) function
of

x,+, = 4~(x,)x, + ~,+1

as in Fig. 2.18 by using non-linear threshold A R models or an extended


exponential A R model, then (b(x,) = 1 at four points x t = (1, sol, (2 and ~:2 (see
Fig. 2.18) and so they have four non-zero stable or unstable singular points to
which x t converges or from which x t diverges when the white noise e, is
suppressed. However, the linear threshold models do not have such a
geometric structure, since the ~b function of the model is a discontinuous step
function.

2. Z T h r e s h o l d structure

W e have used the threshold in some amplitude-dependent A R models to


a p p r o x i m a t e the dynamics of the A R coefficients. The introduction of the
threshold idea in such a situation may look somewhat ad hoc. However, there
are often cases in nature, in physical or biological phenomena, where the
threshold value has a significant physical meaning. The threshold structure does
not necessarily mean that the system is switched from one linear system to
another linear system depending on whether the concerned x t values crosses
over the critical value. One example is the wave propagation of a nerve
impulse (see Fig. 2.19) or a heart beat, which are supposed to form a fixed wave
Non-linear time series models and dynamical systems 45

(a) Impulse above the threshold

(b) Impulse b e l o w the threshold

L___-
Fig. 2.19.

pattern and propagate if an impulse is larger than a critical value, while if the
impulse is less than the critical value the impulse wave dies out (see Fig. 2.19).
Neurophysically, the wave propagation is realized by the flow of electrons
along the axon which is caused by the change of membrane potential and a
mathematical model, called the Hodgkin-Huxley equation, is presented for this
dynamic phenomenon by Hoi3gkin and Huxley (1952). Starting from this
Hodgkin-Huxley equation, Fitzhugh (1969) obtained the following non-linear
dynamical system model for the dynamics of the potential V:

dV
. : = a ( V - Eo)3+(E=- V) - b ( V - E l ) , (2.59)
tit

where ( V - E o ) B + = ( V - E o ) 3 for V>~Eo and ( V - E o ) 3 - O for V < E o , E 0 <


E 1% E 2 and E 0, E 1 and E 2 are ionic equilibrium potentials determined by the
sodium and potassium ion and some other ion. The coefficients a and b of
(2.59) are values which are related to the sodium, potassium and some other
ions. Since they are varying very slowly compared with V, they can be
considered to be locally constant. From (2.59) we know that dV/dt is zero at A,
B and C (see Fig. 2.20). The reference to the sign of dV/dt on the neighbouro
hood of these points shows that A and C are stable singular points, while B is
an unstable singular point. If V > B , V ~ C, but if V < B , V ~ A . Therefore,
B is a 'threshold', separating two stable states which may be called the resting
state A, and the excited state C. This kind of threshold structure is realized by
the discrete time non-linear difference equation

X,+l = 4,(x,)x,,

designing &(xt) as in Fig. 2.18. One example is the following model:


2 2
x,+1 = (0.8 + 4x, e-X')x,, (2.60)
46 7". O z a k i

Ionic current

~ b(V-E1)

A ~ ( V - E o ) ~ (E2-V)

Eo E1 0 E2

Fig. 2.20.

where sc~ = 0.226 . . . . and so; = - 0 . 2 2 6 . . . are unstable singular points and
~:~ = 2.1294 . . . . sc~ = - 2 . 1 2 9 4 . . . and s%= 0 are stable singular points. If we
apply an impulse to model (2.60), then xt goes to zero for t - + ~ if the
magnitude of the impulse is less than the unstable singular point ~:~ but xt goes
to ~:~ for t ~ m if the magnitude of the impulse is larger than the threshold
value ~ (see Fig. 2.21).
If we have a white noise input to the model defined by (2.60), we have the
following model:

Xt+1= (0.8 -}-4*x ,2 e x2/.


)x,+e,+~, (2.61)
where et+~ is a Gaussian white noise. Fig. 2.22 shows the simulation of model
(2.61), where xt fluctuates around one of the stable singular points and
sometimes moves around from one stable singular point to another depending

cD.....
cD ,%

c~

g'£?"t____J
r r -r----T ......--F ....... : -T"----T. . . . T. . . . .
0.00 40.00 80.00 °O ,00 40.00 60,00
Fig. 2.21.
Non-linear time series models and dynamical systems 47

(xJ ~-I-- n'~rY"l"11~,'~-, .q-lr,rr'F''ir'v v.~n~r-T.'~

C)
130

-f 1 i I I I "3"--I l-- I
io .00 2o.oo 4o.oo 6o.oo 8o.oo I00,00
~i0 1
Fig. 2.22.

on the white noise input. By looking at the data (Fig. 2.22) of the above
example (2.61), people may think of two linear models, one above the
threshold and one below the threshold. However, the data are actually de-
scribed by o n e non-linear model.
A similar non-linear phenomenon is realized by a non-linear time series
model with time varying coefficients. For example, consider the following
model:

X,+ 1 = (~(t, Xt)X t H- 6 , + 1 , (2.62)


where
~b(t, x,) = {0.8 + 0.4rt e -x2' + 4(1 - rt)x 2 e -~2'}

changes from
2
x,+l = (0.8 H 4x 2 e-X,)x, -t £', +1 (2.63)
to
xt+x = (0.8 + 0.4 e-X2')x, + e , < , (2.64)

(23

(%1

or')

T r 1 ~ J '-T ~" 1 T .............. J. . . . . . . . 1


'0.00 20.00 40.00 60.00 80.~u~'~ I00.00

Fig. 2.23.
48 T. Ozaki

x'- 0

Fig. 2.24.

as "rt increases monotonically from 0 to 1 as t increases. The model (2.63), as we


saw before, has three stable singular points ~:~, ~:z and so0= 0 and two unstable
singular points s~ and ~ , while the model (2.64) has two stable singular points
r t ~ = 0 . 8 3 . . . . and - q ~ = - 0 . 8 3 . . . , and one unstable singular point rl = 0 .
Therefore, the stable singular point ~:0= 0 changes into an unstable singular
point as time t passes and the process xt begins to move arouhd s% to one of the
other stable singular points as in Fig. 2.23.
The sudden change of an equilibrium point in the above example is con-
sidered to be a result of a smooth change of some potential function as in Fig.
2.24.
This kind of structural change of the process caused by a gradual change of
parameters is closely related with the topic treated in catastrophe theory (see,
for example, Zeeman, 1977).

2.8. Distributions
We have seen that a threshold structure is realized by a stationary non-linear
time series model

x,+l = (0.8 + 4x 2 e-X~)x, + e,+l, (2.65)

where x~ moves around from one stable singular point to another depending on
the white noise input. However, the process defined by (2.65) has one and the
same equilibrium distribution on the whole. Fig. 2.26 shows the histogram of
the data generated by simulating the non-linear threshold A R model

1"(0.8+ 1.3x{ - 1.3xg)xt + e,+l for Ix, I < 1.0, (2.66)


x,+l = t 0.8xt + et+x for tx, f > 1.0,

which has the same structural property as (2.65). It has three stable singular
points ~:0= 0, ~:¢ = 0.9 and sc~ = - 0 . 9 and two unstable singular points ~:~ =
Non-linear time series models and dynamical systems 49

I
-1 .47 -0.63 0.2 1 .05
Fig. 2.25.

0.4358.~. and s~i = - 0 . 4 3 5 8 . . . Fig. 2.25 shows the histogram of the white
noise used in the above simulation, where the number of data is N = 8000.
It is obvious that the three peaks in Fig. 2.25 correspond to the three stable
singular points G0, s~i and ~ , and the two valleys correspond to the two
unstable singular points ~:i~ and s~. These correspondences remind us of the

-0.8I -0.26 0.09


1

Fig. 2.26.
I 0.44
50 T. Ozaki

Fig. 2.27.

correspondence between the singular points of the dynamical system

Yc = f ( x )

and its potential function


x

V(x) -= - f f ( y ) dy.

For example, the dynamical system

2 = - 4 x + 5x 3 - x 5

has three stable singular points G0= 0, s~ = 2, ~:~ = - 2 and two unstable
singular points ~:i~ - 1 and £~ = - 1 (see Fig. 2.27).
The stable singular points correspond to the valleys of the potential and
unstable singular points correspond to the peaks of potential (see Fig. 2.28).
Further, it is known that the equilibrium distribution W ( x ) of the diffusion
process defined by the stochastic dynamical system

2 = f ( x ) + n(t)

is given by

W ( x ) - Wo exp{-2 V(x)/0-2},

where 0-2 is the variance of white noise n(t) and W0 is a normalizing constant.
If we consider this structural correspondence between non-linear time series
models and diffusion processes defined by stochastic dynamical systems, it may
be natural to study the diffusion process and its time discretization scheme in
the succeeding section.
Non-linear time series models and dynamical systems 51

V(×J

0 X

Fig. 2.28.

3. Diffusion processes and their time discretizations

3.1. Stochastic d y n a m i c a l systems


A stochastic dynamical system is defined by

= f(x) + ~(t). (3.~)


where ~:(t) is a Gaussian white noise with variance cr 2, and so it is also
r e p r e s e n t e d as

5c = f ( x ) + ern (t) , (3.2)

where n ( t ) is a unit G a u s s i a n white noise whose v a r i a n c e is one. Since, for


small r > 0, it holds that

lim E [ A x ] _ f ( x ) ,
.r~O T

E[(Ax)2I
lim ~r2 ,
r~0 T

Et(ax)q
lim -0 (k/>3),
r-*O T

where Ax = x ( t + r ) - x ( t ) = f ( x ) r + f[+'~ ds n ( s ) + o(r), we have, for the process


defined by (3.2), the following F o k k e r - P l a n c k equation:

0p 0 1 02
ot Ox [f(x)pl + ~ x 5 [o'2p], (3.3)
52 T. Ozaki

where p stands for the transition probability p(X[Xo, t) which means the
probability that the process takes the value x at time t, given that it had the
value x 0 at time t = 0. Thus the stochastic dynamical system uniquely defines a
diffusion process with transition probability p(x I Xo, t) defined by the F o k k e r -
Planck equation (3.2).
Conversely, the diffusion process defined by (3.3), obviously, uniquely
defines the stochastic dynamical system (3.2). However, the rate of the growth
of the variance,

E[(Ax)
lim - - ,
r~0 T

of a general diffusion process is not a constant but a function of x. A general


diffusion process is characterized by the following Fokker-Planck equation:

Op 0 1 02
[a(x)p] + ~ ~Ox [b(x)p] . (3.4)
Ot Ox

Then (3.4) uniquely defines the following stochastic differential equation (see,
for example, Goel and Richter-Dyn, 1974)

2 = f(x) + g(x)n(t), (3.5)


where
f ( x ) : a(x), g ( x ) - X/b(x).

On the other hand, a stochastic differential equation

2 - f ( x ) + g(x)n(t)

uniquely defines a diffusion process whose Fokker-Planck equation is

019 0 1 0a
Ot Ox [f(x)p] + ~ ~x 2 [g2(x)p] .

By the variable transformation

x dE (3.6)
y = y(x)=
f g(~),

we have, from the stochastic differential equation (3.5), the following stochastic
dynamical system:

= a ( y ) + n(t), (3.7)

where n(t) is a Gaussian white noise with unit variance. We call the process y
Non-linear time series models and dynamical systems 53

the associated diffusion process of (3.4), and we call the dynamical system
f~ = a(y) the associated dynamical system of (3.4). By the analogy with
mechanics we define the potential function by
Y
V(y) = -
f a 05) d)T (3.8)

We note that the potential function (3.8) is different from the potential function
well known in Markov process theory (Blumenthal and Getoor, 1968), and we
call V(y) of (3.8) the potential function associated with the diffusion process or
simply the associated potential function.
The above discussion suggests that any diffusion process uniquely defines a
variable transformation and a potential function with respect to the trans-
formed variable.

3.2. Distribution systems


Since our interests are non-linear stationary time series with given equili-
brium distributions, let us confine ourselves to homogeneous diffusion processes
which have unique equilibrium distributions. The equilibrium distribution
W(x) of the diffusion process (3.4) is given by

W(x) = ~ C exp{2 fx [a(~)/b(~)]d~} (3.9)

where C is the normalizing constant. Wong (1963) showed that for any
probability distribution function W(x) defined by the Pearson system

d W(x) c o+ Qx
dx - do+ dlX + d2x2 W(x) , (3.10)

we can construct a diffusion process whose equilibrium distribution is W(x).


Then the following proposition is obvious from the straightforward extension
of Wong's logic:

PROPOSITION 3.1~ For any distribution W(x) defined by the distribution system

dW(x) c(x)
W(x), (3.11)
dx d(x)

we can construct a diffusion process whose equilibrium distribution is W(x) as


follows:
Op O 1 02
Ot ..... Ox [{c(x) + d'(x)Ip] + ~ Ox~ [2d(x)p] , (3.12)

where c(x) and d(x) are analytic functions.


54 T. Ozaki

We call the distribution system (3.11) a generalized Pearson system. The


system includes not only distributions of the Pearson system but also all the
analytic exponential families ~g of distributions which are defined by the set of
distributions {f} of the following forria:

W(x) = a (f)a (x) exp{ fi (f). t(x)}, (3.13)

where a and the fli of fl = ( i l l , . . . , ilk) are real-valued functions of ~, and a(x)
and t(x)= (tl(X) . . . . . tk(X))' are analytic functions of x (Barndorff-Nielsen,
1978). From the definition of the generalized Pearson system the following
propositions are also easily obtained.

PROPOSITION 3.2. The generalized Pearson system of the equilibrium distribution


of the diffusion process defined by the Fokker-Planck equation

Op 0 1 Oa
[a(x)p] + ~ ~Ox [b(x)p] (3.14)
Ot Ox
is
dW 2 a ( x ) - b'(x)
- W(x). (3.15)
dx b(x)

PROPOSITION 3.3. The generalized Pearson system of the equilibrium distribution


of the diffusion process defined by the stochastic differential equation

Yc= f(x) + g(x)n(t) (3.16)


is
dW(x) 2 f ( x ) - g(x)g'(x)
= - - W(x). (3.1'7)
dx g(x) 2

PROPOSrrlON 3.4. The generalized Pearson system of the diffusion process y


associated with the diffusion process x defined by (3.16) is

dW(y)
-2a(y)W(y), (3.18)
dy
where
f(x)
c~ ( y ) = c~ ( y ( x ) ) - (3.19)
g(x) "

The above correspondence between the generalized Pearson system and the
diffusion process in Proposition 3.1 is unique if we restrict that c(x) and d(x) of
(3.11) are mutually irreducible.
Non-linear time series models and dynamical systems 55

3.3. Local linearization of y = f(y) + n(t)


A well-known method for the time discretization of

= f ( y ) + n(t) (3.20)

is to use the following Markov chain model:

yt+A,- Yt = At. f ( y , ) + B,+a,- B,, (3.21)

where B , + a t - B t is an increment of a process of Brownian motion and is


distributed as a Gaussian distribution with variance At. The process y, defined
by (3.21) is known to converge uniformly, for At-+0, to the original diffusion
process y defined by (3.20) on a finite interval of time (Gikhman and Skorohod,
1965). The deterministic part, y,+a,-Yt = At "f(Yt), of (3.21) is known as the
Euler method of discretization of the dynamical system

Y = f(y). (3.22)

However, the Euler method is known to be unstable and explosive for any
small At, if the initial value of y is in some region. For example, the trajectory
y(t) of

3~ = _y3 (3.23)

is known to go to zero for any initial value of y. Its discretized model by the
Euler method is

Y,+at = Yt- At. y~, (3.24)

which is explosive, the trajectory going to infinity if the initial value Y0 is in the
region ]Y0I> ~/2/At. It is also known that, for any small At, the Markov chain
(3.21) is non-stationary if f ( y ) is a non-linear function which goes to + ~ for
[ y l - ~ (Jones, 1978). The same thing can be said for some other more
sophisticated discretization methods such as the H e u n m e t h o d or the R u n g e -
Kutta method (see, for example, Henrici, 1962).
For the estimation and simulation of diffusion processes by a digital com~
puter, it is desirable to have a stationary Markov chain which converges to the
concerned stationary diffusion process for At-~ 0. Our idea of obtaining such a
stationary Markov chain is based on the following local linearization idea. When
f ( y ) of (3.22) is linear as in

= -~y, (3.25)

its analytic solution is obtained as

y(t) = Yo e-~'. (3.26)


56 T. O z a k i

Therefore, we can define the discrete time dynamical system by

y,+~, = e ~aty,, (3.27)

which coincides with y(t) of (3.26) on t, t + At, t + 2At . . . . . Also, the Markov
chain defined by

Y t+At = e - ' ~ a t y t + k / ~ e t+at (3.28)

is stationary if a > 0 , and the Markov chain converges to the stationary


diffusion process

:9 = - ~ y + n(t)~

If we approximate e -~a' of (3.27) by a first-order Taylor approximation, (3.27)


becomes equivalent to the Euler method, which does not even coincide with
the analytic solution (3.26) at t, t + At, t + 2 A t , . . . . Other discretization methods
such as the Heun method and the R u n g e - K u t t a method are approximation
methods which aim to be higher-order (2nd and 4th, respectively) Taylor
approximations of e -~at. If we consider the general dynamical system (3.22) to
be locally linear, i.e. linear for a small interval At, and if we use the analytic
solution (3.26) for the small interval, we have a trajectory which coincides with
the trajectory of the original dynamical system at least for linear f(y). This idea
is realized by integrating, over [t, T), t ~< ~- < t + At,

of
Y = 7f(Y) • (3.29)
oy

which is obtained by differentiating (3.22), assuming that

J, = Of ¢ 0 (3.30)
Oy

is constant on the interval, i.e. assuming that the system is linear on the
interval. Then we have

y(~-) = eJ'('-°3~(t ) (3.31)

from which we have, by integrating again over [t, t ~ At),

y(t 4 At) = y(t) + J;l(eJ'a'-- 1)f(y,). (3.32)

For Jt = 0 we have

y(t ~ at) = y(/) + a t f ( y , ) . (3.33)


Non-linear time series models and dynamical systems 57

It is easily seen that the model defined by (3.32) and (3.33), which we call a
locally linearized dynamical system, converges to 3 ) = f ( t ) f o r A t e 0 . It is also
easily checked (see, for example, Gikhman and Skorohod, 1965) that the
Markov chain defined by

Y,+a, = @(Y,)+ V ~ e , + a , , (3.34)


where
Yt + j;1 (e,,a, _ 1)f(yt) for Jt ¢ O,
(3.35)
qS(Yt) = Yt + At . f(yt) for Jt = O,

and et+~t is a Gaussian white noise with unit variance, converges to the
diffusion process y(t) of (3.20). We call the model (3.34) the locally linearized
Markov chain model of the stochastic dynamical system model (3.20). As we
shall see later, the present local linearization method brings us, unlike the
Euler method or other discretization methods, non-explosive discrete time
dynamical systems. If f ( x ) is specified it is easy to check whether the locally
linearized dynamical system is non-explosive or not. However, it may be
sometimes useful if sufficient conditions for the non-explosiveness of the locally
linearized dynamical system are given for the general dynamical system

9 = f(Y).

The model (3.34) is rewritten in the following way:

Y,+a, = 4' (Y,)Y, + V~Te t+at, (3.36)


where
4'(y,) = 1 + (e j'a'- 1)f(y,)/(.l, . y,) (3.37)

for y, ¢ 0 and Jt ¢ 0. For the y, to be non-explosive for t ~ , the function f ( y )


has only to satisfy

[6(y,)l < 1

for large lY,I. From (3.37) it is obvious that we have

(e j'a' - 1)f(yD/(J , . y,) < 0 ; (3.38)

hence 4'(Yt) < 1, for large ]YtI if f ( y ) satisfies the following condition:

(A) f(y)<O for y ~ % f(y)>O for y - > - - m ,

of(y :(=J(y))<0 forlyl~ ~.


0y

For the locally linearized dynamical system model (3.36) to be non-explosive


58 T. Ozaki

we have to say that

6(y)> -1

for lY]-* m This is equivalent to

(eJ~y~a' 1)f(y)/{J(y)y} > - 2 (3.39)

for lyl~m, W h e t h e r f(y) satisfies (3.39) or not very much d e p e n d s on the


decreasing (or increasing) b e h a v i o u r of the function f ( y ) for [Yl--'m. F r o m now
on, we will discuss the situation w h e r e y ~ m, because the s a m e l o g i c m a y be
applied for the negative side. If

J(y)"~O for y ~ m

then we have

e s°)a'- I 2~tf(y) Atf(y)


- - > -2 for y--~ m. (3.40)
J(y) At y y

T h e r e f o r e , ~b(y) > - 1 for y -* m if f(y) satisfies J ( y ) ~ 0 for y ~ m. If J(y)-* c <


0 for y -* m, we have, for sufficiently small At,

eJ(y)at 1
.f(y___)) e - c a ' - l f ( y ) > _ 2 for y ~ m (3.41)
J(y) y c y

T h e r e f o r e , a sufficient condition for qS(y)> -1 for y ~ o~ is:

(BI) J(y)-*c<<O for y - * r e .

If J ( y ) ~ - m for y ~ m , we have

(e s~y)a'- 1) f(y) i->- f(y------)- for y ~ .


J(Y)Y J(Y)Y

T h e r e f o r e , the following condition is also a sufficient condition for ~b(y) > - 1 for
y-.m:

f(Y)
(B2) J(y)-~ --m for y ~ m and ---->-2 for y - * o ~
J(Y)Y

E x a m p l e s of functions which satisfy conditions J(y)-~ 0 for y -* ~ are

f ( y ) ..... t a n h y and f(y) .... l o g y ,


Non-linear time sertes models and dynamical systems 59

which decrease slower than linear function for y-+ ~. Examples of functions
which satisfy J(y) ~ c < 0 for y ~ oo are

f ( y ) = - c y + by e -r2 and f(y)= -cy- b e -y2 .

which approach a linear function for y ~ . T h e functions which satisfy the


condition (B2) are the ones which decrease faster than linear functions. The
following condition (B;) is stronger than (B2) but easier to use for the checking of
~b(y) > - 1 by comparing f ( y ) with linear functions.

(B;) J(y)~-% and there exists Y0 > 0 such that f ( y ) is concave, i.e.
O2f(y)
- - <
Oy2 O, for y > Yo and for any c > 0 there exists Ya >~ Yo such that

f(Yl) < - c Y l "

The condition (B;) is introduced as follows. To have - f ( y ) / { J ( y ) y } > - 2 for


y ~ % we need to have q~(y) = f ( y ) - 2J(y)y > 0. If 9(Y0) > 0 and 9'(Y) > 0 for
Y>Yo, then we have q ~ ( y ) > 0 for Y>Yo. T o have q ~ ' ( y ) > 0 for Y>Y0 con-
caveness of f ( y ) is sufficient. To have q~(y0)_> 0 for some Y0, it is sufficient that
there exists Yl ~> Y0 such that f(Yl) < - c X / y l Vc > 0. This is always satisfied if
f(y) satisfies the condition (B;), and so (B~) is a sufficient condition for
qS(y) > - 1 for y ~ oo. Examples of functions which satisfy (B;) are

f(y)=-y e y2 and f(y)=-y3.

The similar conditions of f ( y ) for y ~ --oo are obtained f r o m the same logic as
follows:

(C1) J (y )'~ c <~0 for y ~ - w ,

(c;) J(y)~-~ for y ~ - ~ and there exists Y0 < 0 such that for y < Y0
2
f ( y ) is convex (i.e. O~f(Y)
OY2 > 0"~
] and for any c > 0 there exists Yl ~< Yo
such that

f(Yl) > - cYl.


From the above discussions we have the following theorem:

THEOREM 3.1. The locally linearized dynamical system (3.32) is non-explosive


if the function f ( y ) of (3.22) satisfies the condition (A), any one of conditions (B1) or
(B;) and any one of conditions (C0 or (C;).

The non-explosiveness of the locally linearized dynamical system (3.32) is


60 T. Ozaki

closely related with the ergodicity of Markov chains on the continuous norm
space. For the locally linearized Markov chains (3.34) to be ergodic, Theorem
2.1 requires q~(y) to be a continuous function of y and to have the shift back to
centre property which is guaranteed by

[q'(Y)/Yl = 14ffy)l < 1 forly[ ~ ~ .

Therefore, we have the following theorem:

THEOREM 3.2. The locally linearized Markov chain (3.34) is ergodic if f ( y )


satisfies the condition (A), any one of conditions (B1) or (B;) and,any one of
conditions (C 0 or (C~).

3.4. Some examples


Let us see some examples of diffusion processes which have some dis-
tributions of interest and their locally linearized Markov chain models.

EXAMPLE 1. Ornstein-Uhlenbeck process. The Ornstein-Uhlenbeck process


is defined by

0p_ 0 10 2
at Ox [axp] + ~ Ox--~ [o'2pl, (3.42)

from which we have the following stochastic differential equation:

Y¢= - ax + ~rn(t). (3.43)

The associated dynamical system is

= -ay, (3.44)

where y = x/cr. We define the damping function z ( y ) of a dynamical system


= f ( y ) by

z(y) = -f(y) o

Then the damping function of (3.44) is a linear function (see Fig. 3.1)

z(y) = ay. (3.45)

The associated potential function (see Fig. 3.2) is

V(y) = a y2. (3.46)


Z
Non-linear time series models and dynamical systems 61

zlv)

¥
/ ¥

Fig. 3.1. Fig. 3.2.

T h e Pearson system of the equilibrium distribution of x of (3.42) is

dW(x) - 2ax
- W(x), (3.47)
dx o.2

and the distribution W ( x ) is the well-known Gaussian distribution (see Fig. 3.3)

/ a / ax2\
W ( x ) = ~ a 2 e x p , - --~-7-) - (3.48)

T h e locally linearized M a r k o v chain model is

Xt = o . Y t ,

Yt+at = e - a atYt + X/--~ et+at , (3.49)

which is an AR(1) model with a constant ~b function (see Fig. 3.4).

EXAMPLE 2. 2 = --X 3. T h e dynamical system

2 - - x3 (3.50)

has a non-linear cubic damping function as in Fig. 3.5. If this dynamical system is

Wlxl

0 ×

Fig. 3.3. Fig. 3.4.


62 T. Ozaki

driven by a white noise of variance 0"2, we have

2 = - x 3 + o'n(t). (3.51)

The Fokker-Planck equation of the process x is

_013
_ = _ _0 [x3p ] + _102 [0-2p].
(3.52)
Ot OX 20X 2

The associated dynamical system is obtained by employing the variable trans-


formation

y = x/o', (3.53)
giving
= _0-2y3. (3.54)

The associated potential function (see Fig. 3.6) is

V(y)
0-1 y4 .
= ~
(3.55)

The distribution system of the equilibrium distribution of x is

d W(x) _ --2X 3
dx 0"2 W ( x ) . (3.56)

Then the distribution W ( x ) is given by (see Fig. 3.7)

W ( x ) = W o exp - ~ 2 , (3.57)

where W 0 is a normalizing constant.

V{y)

Fig. 3.5. Fig. 3.6.


Non-linear time series models and dynamical systems 63

W(X)

~(Yt)

2
5

Fig. 3.7. Fig. 3.8.

T h e locally linearized M a r k o v chain model is

xt = 0.Yt,
Yt+at = 6(Yt)Y, + X/~te,+at. (3.58)
where
2 1 2 2
qb(yt) = 3 + 5 exp(--30. 2xtyt). (3.59)

T h e figure of the ~b function is shown in Fig. 3.8.

EXAMPLE 3. 2 = --6X + 5.5X 3 - X5. T h e d a m p i n g function of the dynamical


system

2 = - 6 x + 5.5x 3 - x 5 (3.60)

has five zero points, so0 = 0, sc~ = ~22, ~:~ =-Xf~-~, sc~ = 2 and ( ~ = - 2 (see Fig.
3.9). T h e y are called singular points of the dynamical system. If an initial value
x 0 of (3.60) is one of the five singular points, then x(t) stays at x 0 for any t > 0.
If the d y n a m i c a l system is driven by a white noise 0-n(t), we have

2 = -6x + 5.5x 3 - x 5~ o'n(t). (3.61)

T h e c o r r e s p o n d i n g F o k k e r - P l a n c k equation is

0t9 0 [ ( - 6 x + 5.5x 3-- xS)p] + 1 0 2


(3.62)
Ot Ox 0x~ [0-2p] ~

T h e associated dynamical system is

3? = - 6 y + 5.50-2y 3 - o4y5,

where y = x/0-. The associated potential function is (see Fig. 3.10)

0. 4
11°"2 y4 + y6 (3.63)
V ( y ) = 3y 2 - - - 8 - --6 "
64 7". Ozaki

T h e distribution system of x is
dW(x) - 1 2 x + l l x 3 - 2x 5
W(x) , (3.64)
dx 0-2

and the distribution W ( x ) is (see Fig. 3.11)

W(x)= Woexp{(-6x2+llx4-1x6)/0-2}, (3.65)

where W 0 is a normalizing constant. T h e locally linearized M a r k o v chain m o d e l


is

x t = cryt ,

y,+~,, = 49(yt) + X/Net+at,


where
+ flY')
q)(Y,) =! Y' ~ t ) [exp{J(yt)z~t} - 1] for J(Yt) ~ 0,

y, + a t . f(y,) for J(y,) = 0 ,

f ( y , ) = _ 6 y t + 5.50-2y3_ o '4y t,,


and
J(Yt) = - 6 + 16.50-2yt; - 50"4yt.a
Since
cl)(yt)/y, ~ e -6a' for ly, l-" 0,
the ~b function of the locally linearized M a r k o v chain m o d e l is (see Fig. 3.12)

I1 + f ( Y t L [ e x p { J ( y t ) A t } - 1] for J(Yt)Yt # O,
'I J~Yt)Yt
~P(Yt) = ' 1 ] + (--6y, + 16.50-2y3t -- 50-'ySt)At for J(y,) --- O,
/
t e-6At for Yt = 0.

Z(y)

/
Fig. 3.9. Fig. 3.10.
Non-linear time series models and dynamical systems 65

w(×l'
¢(Vt)

0 Z

Fig. 3.11. Fig. 3.12.

EXAMPLE 4. Gamma-distributed process. The Gamma distribution is defined


by

x ~-1 e x p ( - x / f l )
W(x) = (3.66)
r(~)/3 °

Its Pearson system is

dW(x) (a-1)/3-x
(3.67)
dx /3x

from which we have a diffusion process defined by the following Fokker=


Planck equation

0t9 0 l 02
[(a/3 - x)p] + ~ ~ [2/3xp]. (3.68)
Ot Ox

The stochastic differential equation representation of the diffusion process is

2 = (a - ½)fl - x + ~/~2flx" n ( t ) . (3.69)

By the variable transformation

y = x/2~//3 0.70)

we have the stochastic dynamical system

y = ( a - ~ )1/ y - y/2+ n ( t ) . (3.71)

The damping function z ( y ) of the associated dynamical system is

z ( y ) = y / 2 - (a - ~)/y.
1 (3.72)
66 T. Ozaki

1
As is seen in Fig. 3.13, if a >~ the damping function is negative for y <
1
V'2a - 1 while if a < ~ the damping function is always positive. The associated
potential function (see Fig. 3.14) is

V ( y ) = y2/4 - (a - 2) log y. (3.73)

The shape of the distribution of Gamma distribution changes drastically at


a = 1, while the critical value for the distribution of the associated process y ( t )
of (3.71) is ce = 12. The equilibrium distribution of y(t) is given by (see Fig. 3.15)

1 y(~_l) exp(_ ~ ) (3.74)


W(y)- V(oL)2,,_1

when a =~1 the damping function is a linear function of y, and the potential
function is a quadratic function. Therefore, the distribution of y is Gaussian for
1
a = ~. The locally linearized Markov chain model for the diffusion process x ( t )
is

x, = (flyt)2/2fl,
(3.75)
Yt+a, = 49(y,) + X/ M e,+a,,
where
= l y , + [exp{J(yt) At}- 1]. f ( y t ) / J ( y , ) for J(y,) # 0,
q:'(y,)
(3.76)
Ly, + A t . f(y,), for J(y,) = 0
and
f ( y , ) = (a - ~1) / y , - y J 2 , J(y,)=-(a 1 2 1
- 9/7, - ~.

L~(y,)/y,I < 1 for y,-~ % ~b(y,) is not bounded (see Fig. 3.16),
A l t h o u g h [4~(Y,)I =
when a <~, near the origin. Therefore, we can have a 'stationary' locally
linearized Markov chain model for the Gamma-distributed process (3.68) only
when a >_!
~ z.

Z(y)[
V(y)

(~ >0.5

0 Y

Fig. 3.13. Fig. 3.14.


Non-linear time series models and dynamical systems 67

W(x) ~'<1 ~(Yt)


oy>0.5

o'<0.5

Yt

Fig. 3.15. Fig. 3.16.

EXAMPLE 5. Beta-distributed process. Beta distribution over [-1, 1] is given by


(see Fig. 3.19)

I'(a+7+2) ( 1 + x ) ~ ( 1 - x) ~
W ( x ) = F ( a + 1)F(7 + 1) U +'+1 , c~, 3' ~> - 1 . (3.77)

The Pearson system of the distribution is

dW(x) (a-y)-(a+y)x
dx 1- x2 W(x). (3.78)

The Beta-distributed diffusion process is given by

Op 0 1 02
. . . . . . [2(1 - xZ)p] (3.79)
Ot Ox [{(a - 7) - (c~ + 7 + 2)x}p] + ~ Ox2

from which we have the stochastic differential equation

2 : (a - y ) - (a + 7 + 1)x + X / ~ n ( t ) . (3.80)
Using the variable transformation

1 1
y =--;=sin- x, (3.81)
V2

we have a stochastic dynamical system

}?= a - 7 1 a + 7 + 1 tanX/2y + n ( t ) . (3.82)


cosV y
The damping function of the dynamical system is (see Fig. 3.17)

a+7+ltanX/2y a-7 1 (3.83)


z(y)= cosV y"
68 7". Ozaki

T h e n the associated potential function is (see Fig. 3.18)

V(y) - c~ - y log(1 + sin X / 2 y ) - ( y + ½) log(cos X / 2 y ) . (3.84)


2

T h e equilibrium distribution of associated diffusion process is given by

W(y) = Wo e x p { - 2 V ( y ) }
r(o~ + y + 2)
: r(c~ + 1 ) r ( y + 1)2 ~+r+°i (1 + sin X/2y)~-~(cos V'2y) 2~+~ .
(3.85)
T h e locally linearized M a r k o v chain model is

xt = sin X/2y,, (3.86)


Y,+at : q~(Yt) + X/--~e,+a,, (3.87)
where
cb(yt) : yt + [exp{J(yt) A t } - ll" f(yt)/J(y,)

f(Y~) = a - y 1 c~ + y + 1 tan X/2y~, (3.88)


~/2 cos V 2 y , V2
(~ - y ) sin ~ / 2 y , - (c~ + y + 1)
J(Yt) =
(cos ~/2 yt) 2

y~= ±7r/2%/2 are naturally reflective boundaries arid y,+l=+-wl2~/2-+-e is


identified as y,+l = ± v / 2 X / 2 ¥ e. For Yt ¢ 0 we can define the ~b function as (see
Fig. 3.20)

&(y,) = dp(yt)/yt ;

i
i

-2~- 2~

Fig. 3.17. Fig. 3.18.


Non-linear time series models and dynamical systems 69

W(x)
~(Yt)

I
I
I
-1 1 7r 0 /r Yt

Fig. 3.19. Fig. 3.20.

however, since

[1 exp{-(a+ 3'-~ 1)ht}](a y)


a'(O) = ~ 0 (3.89)

and [4,(Y,)I ~ ~ for y ~ 0, ~b(y) is undefined for y = 0.

EXAMPLE 6. Cauchy distributed process. The Cauchy distribution is a special


case (a = ½) of the distribution defined by (see Fig. 3.23)
_I- 1
/'(O/ ~) __ X2)_(a+l/2)
W(x) = ~,1,-- (1 + (3.90)
r(~)r(~)
The Pearson system of the distribution is

dW(x) - ( 2 a + 1)x
dx 1+ x2 W(x). (3.91)

The Fokker-Planck equation of the diffusion process which corresponds to the


Pearson system (3.91) is

Op_ O 1 02
Ot Ox [(1 -- 2a)xp] + ~ 0-~ [2(1 + xa)p], (3.92)

from which we have a stochastic differential equation

2 =-2ax + ~Sn(t). (3.93)

By the variable transformation

1
y=~ sinh<(x) (3.94)
70 T. Ozaki

we have a stochastic dynamical system


3~= - ~ / 2 a tanh X/2y + n ( t ) . (3.95)

The damping function of the dynamical system is (see Fig. 3.21)


z ( y ) = X/-2a tanh V'2y, (3.96)
and the associated potential function is (see Fig. 3.22)

V ( y ) = a log(cosh ~ 2 y ) . (3.97)

The distribution of the associated process y(t) of (3.95) is

X/-~F(a + ~) (cosh X/2y) -~ . (3.98)


W(y)- r(yr(.)
The locally linearized Markov chain model of the diffusion process (3.92) is

x, = sinh(X/2Yt),
Yt+at = cl)(Yt) + "X/--~tet+a,,
[ f - 2 a At ] 1] cosh X/-2y tsinh~/-2yt
(3.99)
*(Yt) = Yt + Lexp~.co~f(-Cyt) j - ] ~v/~

Since @(Yt)/Yt ~ e-~at for Yt ~ 0 the ~b function of the model is given by (see
Fig. 3.24)

I
/1+ r[exp~f .....-2c~ At
,--~,-;= jI - 1 ] c°sh X/-2y t sinh ~/2y, for Yt # 0,
4~(Y,) = 1 cosh (V2yt) ~/-2Yt
e - 2 a At
for Yt = O.
(3.100)

i
Zly}

/
V(v)
2~

O/\\
/ \
z" \~
/ \

Fig. 3.21. Fig. 3.22.


Non-linear time series models and dynamical systems 71

W(x}' O(yt)

0 yt

Fig. 3.23. Fig. 3.24.

EXAMPLE 7.

1
W(x) -- X-(2a+l) e -1/x . (3.101)
V(2o,)
The distribution (3.101) (see Fig. 3.27) is introduced (Wong, 1963) from the
Pearson system

dW(x) 1-(2a+1)
W(x) . (3.102)
dx x2

The corresponding Fokker-Planck equation is given by

Op 0 1 02

Ot Ox
[{1 - (2a - 1)x}p] + ~ x 2 [2x2p], (3.103)

and the corresponding stochastic differential equation is

2 = 1 - 2olx + X/2x. n (t). (3.104)

Using the variable transformation


1
y = ~-~ log x, (3.1o5)
we obtain the following stochastic dynamical system:

y= 1 e -v~,_ X/2a + n(t). (3.106)

The damping function of the dynamical system is (see Fig. 3.25)

1 _~y
z (y ) = "V:-2a -~ e (3.1o7)
72 7". Ozaki

T h e associated potential function is (see Fig. 3:26)

V ( y ) = X/-2o~y + ~' ~ - ~ . (3.108)

T h e n the distribution of the associated process y(t) of (3.106) is given by

W ( y ) - F ( 2 e ) e x p { - 2 X / 2 ~y - e x p ( - X / 2 y ) } . (3.109)

T h e locally linearized M a r k o v chain m o d e l of the diffusion process defined by


(3.103) is given by

x t = exp(X/~y-~yt), (3.110)
Yt+a, = ~(Yt) + ~¢/Atet+at, (3.111)
., e x p ( - X / 2 y , ) - 2 a
q~(y,) = y, - [ e x p { - e x p ( - X / 2 y , ) A t } - tl - - - -- . (3.112)
X/2 e x p ( - X / 2 y , )

F o r Yt =/ 0 we can define the ~b function by 4~(Yt) = @(Yt)/Yt; H o w e v e r , since

q)(0) = (1 - e TM)1 - 2~- -e0


and 14, (y,)[ --' ~ for y, ~ 0, ~b(y,)is undefined for Yt = 0 (see Fig. 3.28). H o w e v e r ,
since qS(y) is a continuous function of y for all y and the function

1 e./~y V'2o~

of (3.106) satisfies the conditions of T h e o r e m 3.2, the locally linearized M a r k o v


chain (3.111) is ergodic.

zCY)l _ vly)l ,~' .

//

0 V .,/"
Y

/
/
/

Fig. 3.25. Fig. 3.26.


Non-linear time series models and dynamical systems 73

Wlvl[
e,('¢1

Fig. 3.27. Fig. 3.28.

4. Estimation

4.1. Model estimation


Suppose we have a time series observation record, X l , . . . , XN, which is
sampled, with sampling interval At, from some stochastic dynamic phenomena
which are well approximated by a diffusion process. From the diffusion process
model, as in the previous section, we have a locally linearized Markov chain
model

Xt = h-l(yt ]a),
(4.1)
y,+a, - 4,(y, [/3) + V-~i~,+~,,
where c~ is a parameter used in the variable transformation y = h(x), /3 =
(ill . . . . , fir) is the parameter vector of the function @(Yt) and e t is a unit
Gaussian white noise. The model is considered to be a linear or a non-linear
time series model where Yt+at is denoted as Yt+l and the model (4.1) is rewritten
as

x,=h l(y,l~ ),
y,+l = 45(y, [/3)+ V&te,+,. (4.2)

The estimates of parameters of the time series model (4.2) are obtained by the
maximum likelihood method. The log-likelihood of the model is given by

log p(x 1. . . . . x N l a , fl)= log{p(yl . . . . . yN [ c~, f l ) d e t (0h ~ ) c 0 ) } ,

(4.3)

where (Oh(x i ]a)lOxi) is the Jacobian matrix of the transformation from


74 T. Ozaki

(Xl, . . . , XN) to (Yl . . . . . YN)" Since (Y2. . . . , YN) is transformed again by

1
- - - {y,+~ - 4~(y, [/3)}

into an (N - 1)-dimensional Gaussian distribution (e2, . . . , eN) when Yl is given,


we have, from (4.3),

logp(xl, ,xN ]~,/3)= logIp(y~[ /3) ~--~/

N e2 ~{ 1 "~N-ldet(Oh(xiJa))]
.exp{- ~ ,_ij~,~t ] \ Oxj

=_= m E
N e,_~ + log Idet {0h xijo }j - -N-1
2 - log(2w 2~t)
t=2 Oxj 2
+ log P(Yl ] a,/3). (4.4)

When N is sufficiently large logp(yl [ a,/3) is small compared with other terms
of (4.4) and it can be ignored. Then the (approximate) maximum likelihood
estimates a and/3 are obtained by minimizing the following function l(a,/3),
with respect to the variables a and 13 = (B1. . . . . /3r).

(4.5)

The minimization of (4.5) is performed by some non-linear optimization


procedure.
The estimation of the amplitude-dependent A R models discussed in Section
2 is computationally much easier than the estimation of the above non-linear
time series model obtained by the local linearization. For example, for the
extended exponential A R model

Yt+l 4~(Y,)Y, + e,+l,


= (4.6)
where
&(Y,) = &l + 7r(y,) e ,2 (4.7)
and
~r(yt) = % + 7flyt 4. . . . + ~;y,,r (4.8)

we have

Y = X.A+ e, (4.9)
Non-linear time series models and dynamical systems 75

where Y = (Y2. . . . . YN)', A = ((])1, T/'o, - - • , "B'r),e = (e 2. . . . . eN) and

... y~+l e-y~


t Y~ Yl e-Y~
X = Y.2

2
Y -1
Y2 e-'~

YN-le-rN-~ 2 • ..
I
y~_l1e-Y1'-:1/

Then the least squares estimate A of the parameter A - ( q ~ l . . . . , %)' is


obtained by solving the linear equation

A = (X'X) ~X'Y. (4.10)

We note that the contribution of non-constant terms of the c~(yt) very much
depends on the scaling unit of Yr In the practical application of the model, it is
more efficient in the identification of the dynamics of ~b(yt) if we introduce a
scaling parameter 7 as

Y,+,={4, + 7r(y,)e 'Y]}y, + %1. (4.11)

The estimation of model (4.11) is still much easier than the locally linearized
non-linear time series model (see Haggan and Ozaki, 1980). The asymptotic
behaviour of the maximum likelihood estimates of A R M A models and Markov
chains are discussed in Hannah (1973) and Billingsley (1961).

4.2. Estimation of potentials


As we saw in Section 3, the associated potential function of the diffusion
process contains information about the equilibrium distribution of the process as
well as information about the dynamical system. The information of the global
behaviour of the process is obtained from the equilibrium distribution, while
the information of the local behaviour of the process, which is useful for the
local, one-step-ahead or few-step-ahead prediction of the process, is obtained
from the dynamical system. Therefore, both the global and local information of
the behaviour of the process are contained in the associated potential function,
and we can regard non-linear time series analysis as the estimation of the
potential function associated with a diffusion process.
There are two methods for the estimation of the potential function. One is to
estimate the parameters of the original diffusion process by the maximum
likelihood method using the locally linearized non-linear time series model
(4.2). Another method is to estimate the ~b function of the amplitude-depen-
dent A R model

Y,+~ = 6(Yt)Yt + % 1 , (4.12)


using first some parametric amplitude-dependent A R model. Then we obtain
76 72 Ozaki

the potential function using the estimated 4} function. We call the former
method the direct method and the latter method the indirect method (Ozaki,
1983a). The indirect method is based on the following relation in local
linearization:

exp{[Oa(y)/Oy] A t } - 1
~b(y) -- 1 ~ , (4.13)
[Oa(y)/Oy]y

where
- a(y) + n(t) (4.14)
and
Y,+x- ck(Yt)Yt + e,+l . (4.15)

From (4.14) we have

~(y) = {~b(y)- 1} [Oa(y)/Oyly


(4.16)
exp{[0a (y)/Oy] At} - 1

It is difficult to solve c~(y) of (4.16) for a given ~(y). One approximate method
is to expand

ex ~Oa(Y) At}
Pt Oy

by the Taylor expansion and ignore the terms of order At 2, A t 3. . . . . Then we


have from (4.17)

1
&(y) - ~ {q~(y) - 1}y. (4.17)

Then we have the estimated potential function as follows:

-1
"v'(y, zXt) = - ~ f {q~(y) - 1}y dy. (4.18)

For the estimation of the ~b function of (4.15), the following three parametric
amplitude-dependent A R models studied in Section 2 are useful.

(i) Extended exponential A R model


The 4} function of the model is flexible enough to approximate any function
which goes to a constant for [Yt[~ o~. Obviously, it is analytic and useful for the
estimation of potentials.

(ii) Non-linear threshold A R model


The 4} function of the model is continuous but non-differentiable at the
Non-linear time series models and dynamical systems 77

threshold. However, it may be useful to give an approximate picture of the 4'


functions and potentials.

(iii) Linear threshold A R model


A linear threshold A R model employs a step function for the approximation
of 4, function. Obviously, the function is not only non-differentiable but it is
not even continuous at the thresholds. Moreover, too many parameters may be
required to approximate the 4, function of an ordinary diffusion process
adequately by such step functions. However, there may be some diffusion
process which naturally have non-analytic or discontinuous potential functions,
for which the linear or non-linear threshold may be useful.
We also note that a non-parametric approach for the estimation of the 4'
function is proposed by Priestley (1980) (see also Haggan et al., 1982), which
may also be useful in obtaining an idea about the potential function. Incident-
ally, in the non-parametric method of Priestley, the differentiability of 4'
function also plays an essential role.

5. Discussions

We have seen in Section 3 that for any diffusion process x(t), there is an
associated diffusion process y(t) which is obtained by a variable transformation
y = h(x) (see Fig. 5.1). Since the variable transformation is uniquely defined by
(3.6), the associated diffusion process is unique. From the associated diffusion
process we have a Markov chain, which is a non-linear time series, converging
to the diffusion process for zXt-~ 0, by the local linearization method. However,
we note that the correspondence between the distribution system and the
diffusion process x(t) (see Fig. 5.1) is not unique if we do not restrict c(x) and
d(x) to be mutually irreducible (Ozaki, 1982b). For example, if we consider the
distribution system

dW(x) {(c~- l)fl- x}x


dx fix 2 W(x) , (5.1)

instead of the Pearson system

dW(x) (c~-1)/3-x
W(x) (5.2)
dx /?Ix

for the Gamma distribution, we have another Gamma-distributed diffusion


process defined by

0p 0 1 02
0--t = - 0x [{(a 1)~x - xZ}p] + ~O~[213xLp], (5.3)
78 T. Ozaki

I stochasticdifferentequation [ Stochasticdynamicalsystem
,Y = f(x) + g(x)n(t) v I yz = c~(y)+n(t)
i

Fokker-Planckequation Potential function

op a 1 a2
~[= ~x[a(x)P]+~xe[b(x)P] v(y) = a(y) dy

Distributionsystem Distributionsystem
dW(x) c(x) d W(y)
dx d(x) W(x) ~dy-- = 24 (y) W(y)

W(x)Distributi°n ~,~ ~1 Distribution /q


W(y) = W o exp{-2V(y)} I

Fig. 5.1.

from which we have a stochastic differential equation,

= ~#x - x2+ ~/~xn(0. (5.4)

Using the variable transformation

y = (log x)/X/2~, (5.5)

we have a stochastic dynamical system

3) = c q 3 / V ' ~ - exp(',/2-~y)/~/2-~ + n(t). (5.6)

The damping function z(y) of the associated dynamical system is (see Fig. 5.2)

z(y)- exp(X/2-~y) a / 3 / X / ~ , (5.7)


vN
and the associated potential function is (see Fig. 5.3)

V(y ) = {exp(X/2-~y)I/V'2-~- a[3y/~/2~ . (5.8)

The locally linearized Markov chain model is

Xt = e x p(~/2-~y t )
y,+~, = q,(y,) + VX5 ~,+~,,
Non-linear time series models and dynamical systems 79

[exp{-exp(X/~y,) At}- 11 {c~/3 - exp(X/~yt)


@(Y,) = Yt 4 (5.9)
- exp(X/2-~yt) V'2-~

The ~b function is given in Fig. 5.4, where ~b(y)= dp(y)/y for y ¢ 0 and ~b(y) is
undefined for y = 0 since

q~(0) = (1 - e -at) -a/3


- ~ -- ¢ 1 0 fora/3¢l, (5.10)

and [~b(y)]~ ~ for y ~ 0.

From the viewpoint of time series analysis, the above example shows two
methods for the whitening of a given Gamma-distributed time series. One is
root-square transformation plus non-linear time series modelling and the other
is logarithmic transformation plus non-linear time series modelling. The data
transformation methods have been used in many statistical methods with the
objective of obtaining Gaussian or quasi-Gaussian data. This idea is still valid
in the statistical analysis of non-Gaussian time series data. The above examples
of Gamma-distributed processes and Example 5 of Beta-distributed process in
Section 3 suggests that not only non-linear time series modelling, but also
variable transformations, such as root-square, logarithmic and arcsine trans-

/
V(y)

/
z(..,,) ~

Fig. 5.2. Fig. 5.3.

~h(Yt)

1
f J ~

Fig. 5.4.
80 T. Ozaki

formations are justified from the viewpoint of specifying the transformation of


the non-Gaussian process into a Gaussian or quasi-Gaussian white noise
process. The problem of which transformation to choose among the several
possible transformations will be solved by employing the Entropy Maximiza-
tion Principle (Akaike, 1977).
We have seen in Section 2 that dynamics of random vibration is modelled by
a second-order stochastic differential equation

5i + f ( x ) + g ( x ) : n ( t ) . (5.11)
This is equivalent to the following two-variate dynamical system model:

)~ = F(X) + N ( t ) , (5.12)

where X = (x, 2)', N ( t ) = (0, n(t))', F ( X ) = ( 2 , - f ( x ) - - - g ( x ) ) ' and n(t) is a unit


Gaussian white noise. Although we do not have such nice relationships
between the multivariate potentials and multivariate diffusion processes as in
the univariate case, the local linearization method is still valid in the multi-
variate case, and we have, from (5.12), the following locally linearized model
(5.13), which can be considered to be a non-linear extension of the Markovian
representation model for the multivariate stochastic dynamical system (Akaike,
1974; Ozaki, 1983b).

Xt+ ~ = X t + J ; ~ l e x p ( A t J t ) - I}F(Xt) + X / ~ N , . (5.13)


Jt is the Jacobian matrix OF/Ox[ . . . . which is, from (5.11), given by

j, =
(Ox 1)
\
ag(x)
c)x 02
L
i=i,

I is the unit matrix and exp(AtJt) is defined by

1 2 1 3
exp(AtJt)= I + A t J t + ~}- At 2Jr + 3 i At3 J t + ' ° "

X t ~ (xo 2t), x t = x ( t ) and 2, = x(t).


A comPutational method for the estimation and simulation of the model
(5.13) and its application to Lorentz chaos (Lorentz, 1963) and the moored
vessel problem are discussed in Ozaki (1983b).
It must be noted that the locally linearized Markov chain model is an
approximate model for stochastic dynamic phenomena in nature for which the
diffusion process model also seems to be a good approximation. In the diffusion
Non-linear time series models and dynamical systems 81

I
P(co) P(ro)

~o

(a) real (b) continuous-time {c) discrete-time

Fig. 5.5.

process model, the autocorrelation function of the noise process used is a delta
function and so its spectral density p(o)) is uniformly zero for the frequency
band - ~ < w < o0 (see Fig. 5.5) and its integral is one. However, it is obvious
that in nature there is no noise process whose autocorrelation is exactly the
same as the delta function. If we make the sampling interval very small, any
noise process in nature would fail to have a delta autocorrelation function. In
other words, if we are concerned about the periodicity in a very high frequency
band of the noise process, the whiteness assumption becomes inadequate.
Nevertheless, diffusion process models seem to approximate well to
phenomena in many fields, such as genetics, statistical physics, ecology,
mechanical engineering and electrical engineering. This is because we are
concerned about a special finite frequency band in the spectrum of each
phenomenon, and the white noise approximation together with a dynamical
system approximation seems to work well. This consideration suggests that the
discrete time white noise approximation (see Fig. 5.5) together with discrete time
dynamical systems such as the amplitude-dependent A R models in Section 2
may work well for the approximation of the same phenomenon.

Acknowledgement

The author is grateful to Dr. Yamanouchi and Mr. Oda for inviting his
interest to dynamical systems in nautical engineering.
He is also grateful to Dr. Shimizu and Dr. Akaike for many useful comments
and discussions.

References

ll] Akaike, H. and Nakagawa, T. (1972). Statistical Analysis and Control of Dynamic Systems.
Saiensu-sha (in Japanese).
[2] Akaike, H. (1974). Markovian representation of stochastic processes and its application to the
analysis of autoregressive and moving average process. Ann. Inst. Statist. Math. 26, 363387.
[3] Akaike, H. (1977). On entropy maximization principle. In: P. R. Krishnaiah, ed., Ap-
plication of Statistics. North-Holland, Amsterdam.
82 T. Ozaki

[4] Barndorff-Nielsen, O. (1978). Information and Exponential Families in Statistica! Theory.


Wiley, New York.
[5] Biltingsley, (1961). Statistical Inference for Markov Processes. Holt, New York.
[6] Blumenthal, R. M. and Getoor, R. K. (1968). Markov Processes and Potential Theory.
Academic Press, New York.
[7] Box, G. E. P. and Jenkins, G. M. (1970). Time Series Analysis, Forecasting and Control.
Holden-Day, San Francisco, CA.
[8] Fitzhugh, R. (1969). Mathematical models of excitation and propagation in nerve. In: H. P.
Schwan, ed., Biological Engineering, 1-85. McGraw-Hill, New York.
[9] Gikhman, 1. I. and Skorohod, A. V. (1965). Introduction to the Theory of Random Processes
(translated by Scripta Technica, Inc.). W. B. Saunders Company.
[10] Goel, N. S. and Richter-Dyn, N. (1974). Stochastic Models in Biology. Academic Press, New
York.
[11] Haggan, V. and Ozaki, T. (1981). Modelling nonlinear random vibrations using an amplitude-
dependent autoregressive time series model. Biometrika 68, 189-196.
[12] Haggan, V. Heravi, S. M. and Priestley, M. (1982). A study of the application of state-
dependent models in non-linear time series analysis. To appear in J. Time Ser. Anal.
[13] Hannan, E. J. (1973). The asymptotic theory of linear time series models. J. Appl. Probab. 8,
130-145.
[14] Henrici, P. (1962). Discrete Variable Methods in Ordinary Differential Equations. Wiley, New
York.
[15] Hodgkin, A. L. and Huxley, A. F. (1952). A quantitative description of membrane current and
its application to conduction and excitation in nerve. J. Physiol. 117, 500-544.
[16] Jones, D. A. (1978). Nonlinear autoregressive processes. Proc. Roy. Soc. London Ser. A 360,
71-95.
[17] Lorentz, E. N. (1963). Deterministic nonperiodic flow. J. Atomspheric Sci. 20, 130-141.
[18] Minorsky, N. (1974). Non-linear Oscillations, Krieger, New York.
[19] Ozaki, T. and Oda, H. (1978). Nonlinear time series model identification by Akaike's
information criterion. In: B. Dubuisson, ed., Information and Systems, 83-91. Pergamon Press,
Oxford-New York.
[20] Ozaki, T. (1981a). Non-linear phenomena and time series models. Invited paper 43rd Session
of the International Statistical Institute. Buenos Aires, Argentina.
[21] Ozaki, T. (1981b). Non-linear threshold autoregressive models for non-linear random vibra-
tions. J. Appl. Probab. 18, 443-451.
[22] Ozaki, T. (1982a). Tile statistical analysis of perturbed limit cycle processes using non-linear
time series models. J. Time Ser. Anal. 3, 29-41.
[23] Ozaki, T. (1982b). On Gamma-distrubuted time series models. Research Memo. No. 238. Inst.
of Stat. Maths., Tokyo.
[24] Ozaki, T. (1983a). Estimation of potentials associated with diffusion processes using non-linear
time series models. Research Memo No. 251. Inst. of Stat. Maths., Tokyo.
[25] Ozaki, T. (1983b). Non-linear Markovian representation models and multivariate stochastic
dynamical systems. Research Memo No. 252. Inst. of Star. Maths., Tokyo.
[26] Pandit, S. M. and Wu, S. M. (1975). Unique estimates of the parameters of a continuous
stationary stochastic processes. Biometrika 62, 497-501.
[27] Priestley, M. B. (1980). State-dependent models: a general approach to non-linear time series
analysis. J. Time Ser. Anal. 1, 47-71.
[28] Rtsler, O. E. (1979). Chaos. ln: W. Guttinger and H. Eikemeier, eds., Structure Stability in
Physics, 290-309. Springer, Berlin-New York.
[29] Tong, H. and Lim, K. S. (1980). Threshold autoregression, limit cycles and cyclical data..L
Roy. Statist. Soc. Ser. B 42, 245-292.
[30] Tweedie, R. L. (1975). Sufficient conditions for ergodicity and stationarity of Markov chains
on a general state space. Stochastic Process. AppI. 3, 385M03.
Non-linear time series models and dynamical systems 83

[31] Wong, E. (1963). The construction of a class of stationary Markoff process. Proc. Amer. Math.
Soc. Symp. Appl. Math. 16, 264-276.
[32] Yamanouchi, Y. (1974). Ship's behaviour on ocean waves as a stochastic process. Proc. of
International Symposium on the Dynamics of Marine Vehicles and Structures in Waves~
Institution of Mechanical Engineering, London.
[33] Zeeman, E. C. (1977). Catastrophe Theory: Selected Papers (1972-1977). Addison-Wesley,
Reading, MA.
E. J. Hannah, P. R. Krishnaiah, M. M. Rao, eds., Handbook of Statistics, VoL 5 3
© Elsevier Science Publishers B.V. (1985) 85-118

Autoregressive Moving Average Models,


Intervention Problems and Outlier Detection in
Time Series

G. C. Tiao

1. Introduction

Economic, engineering, environmental and other scientific data are often


taken in roughly equally spaced time intervals; for example, hour, day, month,
quarter or year. Because of the inertia of the system such time series data are
frequently serially dependent. For instance, t e m p e r a t u r e of a given day tends
to be correlated with t e m p e r a t u r e of the previous day, the air pollution index
at noon may be heavily influenced by the weather conditions and traffic level in
the morning, and an individual's spending of a given m o n t h may be highly
correlated with his income and spending of the previous month. D y n a m i c
relationships in time series data can often be represented by a linear transfer
function model. Consider first the simple case in which Yt is the output and X ,
the input of a system at time t. A linear dynamic model can be written as

Y, = ~ J , + < x , _ ~ + . . . + .~x,_j + . . .
(1.1)
= ~(B)x,,

where B is the backshift operator such that B X t = Xt_l, and the quantity uj
measures the effect of X,_j on Yr. T h e vj's are known as the impulse responses
and v ( B ) the transfer function. Rather than dealing with a possibly infinite
number of the uj's, in practice, a parsimonious representation takes the rational
polynomial form

(1.2)
where
~ ( B ) : oa0 - OalB . . . . . . . . o)aB s and 6(/9) = 1 - alB . . . . . . 6,B r

are polynomials in B of degrees s and r, respectively, and b measures the delay


or dead time of the system.
86 G. C. Tiao

1.1. Univariate A R M A models


In recent years, primarily due to the work of Box and Jenkins (1970), a class
of models of the form

cP(B)Z t = C + O(B)a,, (1.3)

originally proposed by Yule (1927) and Slutsky (1937), has been found useful in
representing the serially dependent relationship of many time series encountered
in practice. In (1.3), {Z,} is the observable time series, {a,} is a sequence of white
noise, identically and independently distributed as normal N(0, o-2), C is a
constant, 45(B)= ( 1 - ~ I B . . . . . @pBp) is the autoregressive polynomial,
O(B)-- (1 - O~B . . . . . OqB q) is the moving average polynomial and 49(B) and
O(B) are assumed to have no c o m m o n factor. The model (1.3) is of the form in (1.2)
where Yt = Z, and the input X t is now the white noise a t.
The model (1.3) will be stationary if all the zeros of ~ ( B ) are restricted to lie
outside the unit circle and in this case C = (1 - qb1. . . . . qbp)#, w h e r e / x is the
mean of the series. Theoretically, stationarity means that the probability
density functions of (Ztl ..... Ztl+k ) and (Zt2 ..... Zt2+k) are of identical forms for
any arbitrary choice of the integers (t 1, t2, k). In practice, this is saying that the
overall behavior of the observations remains the same over time. Now, real
world time series data often exhibit a drifting behavior. Such nonstationary
series can be modeled by allowing some of the zeros of ~ ( B ) to be equal to
one. Thus writing ~ ( B ) = ¢(B)(1 - B ) a we have from (1.3) that

¢(B)(1 - B)dZ, = C + O(B)a,, (1.4)

where the zeros of ¢ ( B ) = 1 - ~ 1 B . . . . . . ¢~p_dBp-a are all lying outside the


unit circle. The model (1.4) is known as the autoregressive integrated moving
average model of order ( p - d, d, q), or A R I M A ( p - d, d, q). Now time series
data with evolving cyclical behavior can often be represented by ~ ( B ) having
zeros on the unit circle. Thus, more generally, we can write (1.3) in the form

¢(B)U(B)Z, = C+ (1.5)

where U ( B ) = 1 - U I B . . . . . . . . UdBd has all its zeros on the unit circle. The
special form U ( B ) = ( 1 - B ) d ( l - B S ) < where s is some positive integer has
been widely used in practice to model seasonal time series (s = 12 for monthly
data and s = 4 for quarterly data).
In what follows, we list some special cases of the class in (1.5) which have
been widely used in practice:

(1 - cl)IB)N t = C + a,, (1.5a)

Z t = C + ( 1 - 01B)at, (1.5b)

(1 - B ) Z t = (1 - O,B)a t , (1.5c)
A R M A models, intervention problems and outlier detection 87

(1 - B s ) Z t = (1 - 8,B)(1 - 82BS)at, (1.5d)

(1 - B)(1 - B ' ) Z t = (1 - 0,B)(1 - 82B')a , . (1.5e)

The model (1.5a) is an A R M A ( 1 , 0) or AR(1) model and it is stationary when


14~11< 1. T h e second model (1.5b) is a stationary A R M A ( 0 , 1) or MA(1) model.
The third, (1.5c), is nonstationary and for 101[< 1, can be written in the
alternative form (see discussion of the ~- form in Subsection 2.1)

Z t = (1 - 8,)Z,_, + 8,(1 - 81)G_ 2 + 82(1 - 81)Zt_ 3 q - ' . . q- a t

showing that the dependence of Z t on past values Zt_~, a t _ 2 . . . . . decreases


exponentially as we stretch into the past. This model is c o m m o n l y called the
'exponential smoothing' model in forecasting literature. Finally, (1.5d) and
(1.5e) are two models most extensively used for seasonal time series data.

1.2. Transfer function a n d intervention analysis models


Models of the form (1.3) basically express the relationship of Z t to its own
past values Z,<, Zt_ 2. . . . . In some situations, data on some of the factors
influencing the output II, of a system are available and we can write

%(B) (1.6)
Y,= C+ BbJX# + Nt,
]=1 6](B)

where ~oj (B) and 61 (/3) are polynomials of the form in (1.2), X , , , . . . , Xjt are J
input time series, and N t is the noise term following an A R M A ( p , q) model. In
(1.6) the input series {X#} may be deterministic or stochastic and in the latter
case they are required to be independent of the noise series {Nt}.. The effects of
exogeneous interventions such as governmental policy changes, strikes, and
promotional campaigns on I1, can frequently be modelled by making X# in
(1.6) indicator variables (see Box and Tiao, 1975).

1.3. Vector A R M A models


Transfer function models of the form (1.6) express an unidirectional r e l a
tionship between an output Y, and J inputs X#'s in the sense that the X#'s are
not dependent on the past values of Yr In other words, there is no feedback.
Systems with feedback a m o n g interrelated series can be modeled by a vector
generalization of the A R M A form (1.2),

4,(B)Z, = c + O(B)a,, (1.v)

where Z, = (Zlt . . . . . Zk, )' is a vector of k c o m p o n e n t time series, C a k × 1


vector of constants and at = (al, . . . . . ak,)' is the white noise vector. In (1.7) the
at's are assumed identically and independently distributed as k-dimensional
88 G. C. Tiao

multivariate normal N(O, X),

@(B)=I-~,B ..... ~pB p and O(B):I-O,B ..... OqB q,

where the ~ ' s and 0's are k x k matrices, and the zeros of the determinental
polynomial Iq~(B)l are required to lie on or outside the unit circle. T h e vector
series will be stationary if (i) all the zeros of Iq~(B)I are lying outside the unit
circle or (ii) all the zeros of Iq'(B)I on the unit circle are divisible by [Adj
~(B)IO(B).

1.4. Outline o f the chapter


The main purpose of this chapter is to provide the reader with an expository
account of the models in (1.3), (1.6) and (1.7) and a description of some useful
modeling techniques for applying these models to real data• Section 2 considers
the univariate A R M A models (1.3), Section 3 discusses the transfer function
model (1.6) with deterministic inputs, placing emphasis on intervention analysis
problems and detection of outliers, and section 4 presents two illustrative
examples. Finally, in Section 5 some aspects of the vector A R M A model (1.7)
are briefly discussed.

2. The univariate ARMA models

2.1. S o m e properties o f the m o d e l

For simplicity in presenting the main results, until further notice we shall
assume that C = 0. To discuss the properties of (1.3), it is convenient to begin by
supposing that the series {Zt} actually starts at some time point m. Suppose
now that t - m + 1 observations Z , , , . . . , Z t are available• From (1.3), we can
then write

De~T, = Doa + w, (2.1)


where
1 1
__1~1 • ' . .., -01
• '.. and D O -~-
- - Oq • o

'""--(/)p''" "~" I ~ 1 " ' 1 "" -Oq . . . . O, " 1

are (t - m + 1) x ( t - m + 1) matrices, Z = ( Z m , . . . , Zt)' , a ( a m , . . . , at)' , w =


(Win,..., W,,+,_1 , 0 , . . . , 0 ) ' and w i n , . . . , Wm+,_1 are r = m a x ( p , q ) initial values.
The w's can be deterministic or stochastic, and in this chapter we shall suppose
they are normally distributed and are independent of the at's.
ARMA models, intervention problems and outlier detection 89

The tp a n d 7r weights
F r o m (2.1), we have that

Z = D~IDo a + D-~lw, (2.2)


where

.-~ ".
I]/1 ..
D-~IDo = ~01"". D_I =
,

, qq "1 *_,,, ..,/,t 1


and the ~b's can be o b t a i n e d by equating coefficients of p o w e r s of B from the
relations

clg(B )qJ(B ) = O(B ) , (2.3)

where ~p(B) = 1 + q,IB + ~b2B2 + • - .. It follows that, for l/> 0, the 4,'s satisfy the
difference equation

{]/l = ~ l ~ t / - 1 -[- " ' " + ~I)pffll-p - - O l , (2.4)

where O0 = 1, q,j = 0 for j < 0 and 0 t = 0 for l > q. T h u s for 1/> r, the ~p,s can be
expressed in the form

~t = A l a t l + "" . + Apoapol , (2.5)

where po<~p, A 1. . . . . Ap0 are polynomials in 1 and a1-1, • '" , a p0 -1 are the P0
distinct zeros of qb(B). A n a l o g o u s expressions can be o b t a i n e d for the q~*'s.
From (2.2), we can write Z t in the ~p form as

t m t-m
z,=a,+ Z ¢,:, + Z q'h*w,-h' (2.6)
h= 1 h=t-(m+r)+ l

n a similar way, we have from (2.1) that

DolD, Zt - Dolw = a,
,here

1 1
-- 7ri. --7"/" 1 .

DolD~ = D~ ~ = (2.7)

--'TTt_m. . .--7"i" 1 1
90 G. C. Tiao

and the zr's can be o b t a i n e d f r o m the relation O(B)rr(B)= q)(B), w h e r e


~r(B) = 1 - zr~B - vr2B 2 . . . . so that the zr's satisfy the difference e q u a t i o n

"ITl : O l qTl_ 1 + " • " -~- OqTTl_ q "1- l~)l , (2.8)

w h e r e % = - 1 , % = 0 for j < 0 and q~t = 0 for l > p. Thus, for 1 >t r, the ~-'s can
be written

= G,t I + ' " + aq0 '%, (2.9)

w h e r e qo~q, G1 . . . . . G% are p o l y n o m i a l s in 1 and /3~ 1. . . . . /3~01 are the q0


distinct zeros of O(B). Expressions for the 1r*'s can be o b t a i n e d in a similar
m a n n e r . T h e zr f o r m of Z t is

t--m t--m

Z, : Z 7rhZ,_ n - • ~'*wh,-h + a,. (2.10)


h=l h=t-(m+r)+l

N o t e that by supposing the series to start at time m and introducing the initial
values w , , , . . . , win+,_ 1, we have o b t a i n e d two alternative forms of the A R M A
m o d e l (1.3). T h e 0 f o r m (2.6) shows h o w Z t is affected by current and previous
white noise or shocks at_j's and the ~r f o r m (2.10) indicates how Z t relates to its
own past values Zt_j's. In obtaining (2.6) and (2.10) no assumption is m a d e on
the nature of qb(B) and O(B).

Stationarity condition and autocovariance structure of Z t


Since the w's and the at's are a s s u m e d normally distributed, it follows f r o m
(2.6) that the Z , s are also normally distributed. It is readily seen that, if in (2.5)

[cr/[ < 1, j = 1 . . . . . P0, (2.11)

then as t - m ~ % we have that

),
so that Z, will be stationary in this a s y m p t o t i c sense. In what follows, we shall
refer to (2.11) as the stationarity condition of the A R M A m o d e l which is
equivalent to require that the zeros of 45(B) are lying outside of the unit circle.
L e t us d e n o t e

cov(z z,+,)= cov(z, z,_,) (2.13)

as the lag 1 a u t o c o v a r i a n c e of a stationary series. For an alternative expression


ARMA models, intervention problems and outlier detection 91

of 7(l) in (2.12), we proceed as follows. From (1.3), we have that

Zt-l(Zt - ~)IZt-I ..... ~It-p) = Zt-t(at- Olat-I . . . . . Oqat-q)"

Taking expectations on both sides and using (2.6), we have that, for 1/> 0,
p

y(l) = ~ CI)hY(l- h)+ q, (2.14)


h=l
where
q-I
2
-o-~ Y. 4'hOh+l, 1 = 0 , . . . , q ,
C1 = h=0

0, l>q,
00=-1, and note that 7(I)= 7(-1).

Autocorrelation function
The autocorrelation function

p(1) = y(1)/'y(O) (2.15)

Can be obtained directly from (2.14). Note that if @(B)= 1, i.e. (1.3) is a
moving average model of order q, MA(q), then
2 -1
p(l)={o,-Oq(l+O2+ ' ' - + 0 o ) , ll>=q.q' (2.16)

This is an important property which will prove useful in the model building
process.

Partial autocorrelation function


Let us define, for k = 1,2 . . . . ; m = 0, 1, 2, ..

~'(k, m) = [Y(7 + 1) ]

and
J
y(m)..., y(m _.1) o. y ( m - k + 1)]

a(k,m)=]v(T+l).'..'"'".ii" "",(,h 1) (2.17)

b y ( m + k ......1)..;7"(tfi+l i" 7(m) J


92 G. C. Tiao

From (2.14), we have for l = 1 , . . . , p, a system of equations

r(p, 0) = a(p, 0)¢,(p)+ c, (2.18)

where q~(p)= (qb1. . . . . ~p)' and c = (cD . . . , Cp)t. When O(B)= 1, c = 0 and in
this case we can express 45p as

45 = e,(p)lc(p, 0)/~,(0)1-', (2.19)


where
1 ... p(,p-2) p(1)
p(1)
g(p) =

P(P" 1) " " p(i) PC)

This result then leads to defining the following function of the autocor-
relation coefficients O(1) . . . . , p(/):

~(1) = c{p(l), l = 1, (2.20)


k g(l)lG(l, O)/r(O)l-i, l > 1,

which is known as the partial autocorrelation function. It has the property that,
for a stationary AR(p) model, i.e. O(B)= 1,

N(1) = {¢ qbt' l = p, (2.21)


t O, l>p.

The important property of ~(1) is that it vanishes for 1 > p when the model is
AR(p). This is akin to the property of the autocorrelation coefficients p(/)'s
with respect to the MA(q) model, and will prove to be a useful tool in model
building.

Extended autocorrelation function


For the ARMA(p, q) model, we see from (2.14) that for 1 > q, letting

@(')(p) = c(p, t)-lr(p, O, (2.22)

where 4~(')(p) = (4~{° . . . . . . q~o), and letting

p,t = -'"
(2.23)

then, since q¢O(p)_- q~(p), the transformed process {__p.,,


W(t)~ follows a MA(q)
ARMA models, intervention problems a n d outlier detection 93

model. Thus, if we let p(p, l) be the lag 1 autocorrelation of wp,


" q)t, we have that

+'"+02) -1 , l=q,
p ( p , 1) = + (2.24)
ll0, l>q.

In general, for k = 1,2, 3 , . . . and 1 = 1, 2,3 . . . . , let the k x 1 vector

~O(k ) = ( 9~1 . . . . .

satisfies the equations

G(k, l)~(~)(k ) = ~,(k, l) (2.25)

and p(k, l) be the lag 1 autocorrelation of the transformed process tw(0x


t vv k,tJ,
where W~l]t = (1 - cI)g)B . . . . . cI)g~Bk)Z,. That is

p(k, l) = b ' G ( k + 1, I)b/b'G(k + 1, O)b, (2.26)

where b '= (1, q¢°(k)') and it is easily seen that p(k, l) is a function of the
autocorrelations p(1) . . . . , p(k + 1).
Now, for k = p and l >1 q, p(k, l) has the 'cutting off' property (2.24) for
A R M A ( p , q ) model which is akin to the property of p(1) in (2.16) for the
MA(q) model. Following the work of Tsay and Tiao (1984), we shall call p(k, l)
the kth extended autocorrelation of lag l for Z r W e shall also denote
p(l) = p(O, l) so that p(k, l) will be defined for k >i 0 and l/> 1. It can be readily
shown that for stationary A R M A ( p , q) model, when k >~p,

c, l = q + k-p, (2.27)
p(k,l)= O, l>q+k-p,

where [c] < 1. The above property for p(k, l) will be exploited later in the
model building process.

2.2. Prediction theory


In this section, we discuss the problem of forecasting future observations for
the A R M A ( p , q) model (1.3). W e shall assume that the model is known, i.e. all
the p a r a m e t e r s q~l. . . . , q~p, 01. . . . . Oq and o-2 are given. In practice, these
parameters will, of course, have to be estimated from the data. For a discussion
of the effect of estimation errors of the estimates on forecasts, see e.g.
Y a m a m o t o (1976).
Basically, the forecasting problem is as follows. Suppose that the {Zt} series
begins at time m and we have available observations up to time T, Z m. . . . . Z r.
What statements can then be m a d e about future observations Zr+l, l =
1, 2 . . . . . L? Clearly, all the information about Zr+ 1. . . . . Zr+ c is contained in
the conditional distribution p ( Z T + 1. . . . . ZT+ c [ Z(T)), where Z ( T ) = ( Z m . . . . . Z T ) ' .
94 G. (2". T i a o

From the probabilistic structure assumed in (2.1), this conditional distribution is


a L-dimensional multivariate normal distribution. In what follows, we obtain
the mean vector and covariance matrix of this distribution and discuss their
main properties. We shall denote Z r ( l ) as the conditional expectation

Zr(l) = ET(ZT+,)= E(ZT+t I ZCT)), (2.28)

which is the minimum mean square,d error (m.m.s.e.) forecast of Zr+ l, and
denote er(1 ) as the forecast error

er(l) = Zr+, - 2r(z). (2.29)

From (1.3) with C = 0 and (2.3), we have that for l t> 1

2r(0 : @ , 2 r ( l - 1) + " - + % 2 r ( 1 - p) + fir(t)


- O~fiT(l -- 1) ..... OqfiT(l -- q) (2.30)
where
Z,)(/') = ZT+j, j < 0, and fiT(i) = E(ar+ i [ Z(T))

so that fiT(i) = 0 for i > 0. Thus, the Z,r(/)'s can be recursively calculated from
(2.30) once the expected values fir(-/'), J' = 0 . . . . . q - 1, are determined, and
for l > q the Zr(/)'s satisfy the difference equation

• ( B ) 2 r ( / ) = 0, (2.31)

where B now operates on l. To obtain a r ( - ] ) , we have from (2.10) that

T-j-m T-j m
fir(--J)- Z r - i - Z %Zr-i-h + Z fr *hE (w~__j_~ I z¢r~)
h=l h= T-j-(m+r)+l
(2.32)

It can be shown that when all the zeros of O(B) are lying outside the unit circle,
both ~rh and rr~ approach zero as h ~ m and for T - j >> m, the third term on
the right-hand side of (2.30) can be ignored so that

T-j-m

fir(--jl=Zr_j - ~ rrhZrq_ h. (2.32a)


h=l

Thus, approximately, fir(--J) only depends on Zr_ j. . . . . Z,,. Note that the
requirement that all zeros of O(B) be lying outside the unit circle is known as
the 'invertibility condition' of the A R M A ( p , q) model. For a discussion of
noninvertible models, see e.g. Harvey (1981).
It is of interest to study the behavior of the forecasts Z'r(/) as a function of
ARMA m o d e l s , i n t e r v e n t i o n p r o b l e m s a n d outlier d e t e c t i o n 95

the lead time I. F r o m (2.31), we can write

aT(l) = "~IA(T)'~'I~I-1- "'" -t- J-lNa(T)of/~l , (2.33)

-1
where, as in (2.5), p o < p , a71, . . . , c%o are the Po distinct zeros of q~(B), and
. (T)
A~r), .,Ap0 are polynomials in I whose coefficients are linear functions of
.

Z<r). Thus, the forecast function Z r ( l ) will be a mixture of exponentials,


sinusoids and polynomials depending upon the nature of % . . . . . %. If all the
a ' s are inside the unit circle, i.e. when the model is stationary, then as l ~ ~,
2 r ( / ) - + 0 which is the mean level of Z, (since C in (1.3) is taken as zero). On
the other hand, if some of the a ' s are on the unit circle, then Z,r(l ) will be a
mixture of polynomials and u n d a m p e d sinusoidal functions of 1. Note also that
2 r ( l ) depends on available observations Z ~ Zr-1 . . . . so that it 'adapts' to new
observations as the origin T is advanced.
Turning now to the forecast error er(l), we have from (2.30) that

e r ( l ) : ~ b , e r ( l - 1) + ' . . + q~per(l - p ) + at+ , - fir(1)


- 01[at+,_1- fiT(l- 1)] . . . . . Oq[aT+>q -- fir(1 -- q)], (2.34)

where it is to be n o t e d that

er(j)=O forj~<O
and
at+ i - fir(i) = at+ i for i > 0.

For 1 = 1 . . . . . L, L > max(p, q), we can write the vector of forecast errors
e(L ) = (er(1) . . . . . e r ( L ) ) ' in the form

D ( L ) o_ - E(w (2.35)

{: ] [1 1
where

--q~)l " " "


-0~'"..
° . . ° " . . ° .
~] F i (4a
L)
= _ ", , . . .
"" ' O (oL ) =
-- 0q " . . . . " ~, ,
• --q~p .'.': "--q~, "1 ..... oq .'.."-hi', 1

0, . . . . . . . . . . 0, [- c+ . *T-j- m
02 . .Oq . = .

X(L) = - -

Oq" L&+J' q-' L=* "


I ,

w ' = (w . . . . . . . w,,+,_l) and a<e) : ( a t + l , - . . , at+e).


Thus,
e r ( L ) = **+~<L)-ln<L)~v0
"<L/--'~ o+n<L)-l~:(L)t
°.....
r E(w [ Z<r))], (2.36)
96 G . C . Tiao

where

ID(L)-llt~(L) _

Iil. 1
~tl ' - . . . .

and conditional on Z(T ) the covariance matrix of e(L) is

COV(e(L)) = U'a~4)~Zrl(L)-I'D(L)[il(L)-ID(L)¥~O
~,~ ~0 ] + R, (2.37)

where R : D(~)-'X(L)COV(w [Z(T))(D~)-IX(L)) '. When the model is invertible, it


is easily shown that, for T >> m, R ~ 0 and, approximately

jc~n~Zte(L)~ =" t,~2a~q~


~._.--t I~(L)-1]FIb{n(L)-lll ~,
L-O \a~, ~ L-O) . (2.38)

In particular, the variance of the forecast error eT(l ) is

l-1
Var(er(l)) = ~r2a• ~}. (2.39)
]=0

For stationary models, Var(er(l))~ 7(0) as I--~ % and for nonstationary models,
Var(er(1)) increases without bound as 1 increases.
It is important to remember that the m.m.s.e, forecasts of Zr+ t, Z r ( l ),
I = 1. . . . , L and the covariance matrix COV(e(L)) are simply the mean vector
and covariance matrix of the conditional distribution p ( Z r + l , . . . , ZT+LIZ(T)).
While these are important features of the conditional distribution, they are by
no means the only features of practical importance in forecasting. For example,
in some situations one may be interested in the probability of future obser-
vations exceeding some specified levels or in assessing the probability that a
turning point occurs at a specific point of time. The solutions of these and other
forecasting problems are all based on the joint density p(Zr+ 1. . . . . Zr+LlZ(r)).

2.3. Model building strategy


The class of ARMA(p, q) models in (1.3) is extensive. In practice, guidlines
are needed in selecting a member of the class to represent the time series data
at hand. Box and Jenkins (1970) have proposed an iterative model building
strategy which has been widely adopted by practitioners. The strategy consists
of three main phases:
(i) Tentative specification or identification of a model.
(ii) Efficient estimation of model parameters.
(iii) Diagnostic checking of fitted model for further improvement.
ARMA models, intervention problems and outlier detection 97

2.3.1. Tentative specification


The aim here is to employ statistics which (a) can be readily calculated from
the data and (b) allow the user to tentatively select a model, i.e. determine
(p, q) in (1.3) or (p - d, d, q) in (1.4). We shall discuss three methods: (i) sample
autocorrelation function (SACF), (ii) sample partial autocorrelation function
(SPACF) and (iii) extended sample autocorrelation function (ESACF).
SACF
The sample autocorrelations of Z t are defined as

r(l) = C(l)/C(O), l : 1. . . . . (2.40)


where
n-j
c(j) = (z, - 2)(z,+j 2) -

t=l

and ,~ is the sample mean of the n available observations Z 1. . . . . Z,. It is well


known that for stationary models
P
r(l) ~ p(l). (2.41)

Also, if q~(B) is of the form (1.4), then


P
r(z)--, 1. (2.42)

Thus, from (2.42) if the SACF (as a function of the tag 1) of the original series is
persistently close to one as l increases, one then forms the series of the first
differences Wt = ( 1 - B ) Z t and studies its SACF to determine whether further
differencing is called for. Once stationarity is achieved, from (2.16) and (2.41), a
'cutting off' pattern after a certain lag, say q, in the SACF will then lead to
tentative specification of a MA(q) model.
For stationary models, the r(l)'s are asymptotically normally distributed, but
the covariance structure is rather complex in general (see Bartlett, 1946). For
MA(q) model and l > q, the asymptotic variance of r(1) is

V a r ( r ( / ) ) - -- 1 + 2 02(/) . (2.43)
/'~ j=l

By substituting r(j) for the unknown p(j) in (2.43), the estimated variances of the
r(/)'s are often used to help specify the order q of a MA model.
SPA CF
The sample partial autocorrelations

~(l), l = 1. . . . . (2.44)
98 G. C. Tiao

of Z, are obtained by replacing the p(/)'s in (2.20) by their sample estimates


r(/)'s. For stationary models
P
a~(1) ~ ~(1) (2.45)

and the ~(/)'s are asymptotically normally distributed. Also, for a stationary
AR(p) model

1
Var(~(l))---, l>p. (2.46)
n

The properties in (2.45) arid (2.46) make SPACF a convenient tool for
specifying the order p of a stationary A R model in practice. For nonstationary
models, i.e. ~ ( B ) contains the factor U(B) in (1.5), the asymptotic property of
~(l) is rather complex, however.
In the past, the SACF and SPACF have been the most commonly used
statistical tools for tentative model specification. Specifically, a persistently high
SACF signals the need for differencing, a moving average model is suggested
by SACF exhibiting a small number of large values at low lags and an
autoregressive model, by SPACF showing a similar 'cutting off' pattern. Also,
for series exhibiting a strong seasonal behavior of period s, persistent high
SACF at lags which are multiples of s signals the need to apply the 'seasonal
differencing' operator 1 - B ' to the data, and so on. The weaknesses of these
two methods are (i) subjective judgement is often required to decide on the
order of differencing and (ii) for stationary mixed autoregressive moving
average models, both SACF and SPACF tend to exhibit a gradual 'tapering off'
behavior making specification of the orders of the autoregressive and the
moving average parts difficult.
ESA CF
Recently, several approaches have been proposed to handle the mixed model
specification problems. These include the R- and S-array methods of Gray et
al. (1978) and the generalized partial autocorrelations by Woodward and Gray
(1981). In what follows, we discuss the procedure proposed by Tsay and Tiao
(1984), using what they called the extended sample autocorrelation function
(ESACF) for tentative specification of the order (p, q) for the general non-
stationary and stationary A R M A model (1.3). The proposed procedure eli-
minates the need to difference or in general transform the series to achieve
stationarity and directly specify the values p and q.
For stationary A R M A models, estimates ~(k,/)'s of the EACF p(k,/)'s as
defined in (2.26) can be obtained upon replacing the p(/)'s in (2.26) by their
sample counterparts r(/)'s. In this case, the estimated ~5(k,/)'s will be consistent
for the p(k,/)'s and hence the property (2.27) can be exploited for model
identification. However, for nonstationary model, the ~(k,/)'s will not have the
asymptotic property given by the right-hand side of (2.27) in general.
ARMA models, intervention problems and outlier detection 99

Now for ARMA(p, q) models, one can view the extended sample autocor-
relation function approach as consisting of the following two steps. W e first
attempt to find consistent estimates of the autoregressive parameters in order
to transform Z t into a moving average process. We then make use of the
'cutting off' property of the autocorrelation function of the transformed process
for model identification.
For estimating the autoregressive parameters, the following iterated regres-
sion approach has been proposed. First, let ,.g(0)
"a- l ( k ) , • . . .¢.(0)
U)k(k) be the ordinary ,

least squares (OLS) estimates from fitting the A R ( k ) regression to the data,

(o) (o) Z °(°) (2.47)


Z t = 451(k)Zt 1 + " " " + qgk(k) t-t + ~ka,

where .,(0)
L. k,t denotes the error term. The 1st iterated A R ( k ) regression is given by

Zt = ¢~(1) 7 ~(1) Z ~ ~(1) .9(0) (1)


(2.48)
~l(k.~t-1 ~- " " " -1- k(k) t - k -- t-" l ( k ) ~ k , t - 1 -}- e k , t ,

who,.o
.....
~(o) _ :1 ,g(o)~
k . , - - , - - ~ ' t ( k ) ~" . . . . .
^ (o) k •
q~k(k)B )Z, ,s the residual from (2.47) and e(k'~
denotes the error term. This yields a new set of OLS estimates C~]~k),.. " ' C ~k(kF O)
In general, for 1 = 1, 2, . . . the estimates ,fi(t) ~'t(k), • • •, ~m(~)
k ( k ) are obtained from the
/th iterated A R ( k ) regression

Z t ~_ ( ~ l ~ k ) Z t _ 1 _ ~ . . . _[_ (~)(l) 7
k(k)L't~k -}- bft(l)
'l(k)
~('-~)
k,t-1 -}- " " " q- P0 ( 'l(k)t~
) ~(0)k,t-I + e(~!,
(2.49)
where
i
O(i) = ( 1 - ¢~(i) R .... . __ (~(i) ]~k~, 7 __ ~ i~(i) ~(i-h)
k,t "x'- ( k ) ~ "~ k(k) JJ .IL't ~ I"h(k)'k,t-h
h=l

(i.e. the residuals from the ith iterated regression) and e~)t is the error term. In
practice, these iterated estimates ,g(0
'~:'(k),~~ can be obtained from OLS estimates of
the autoregressive coefficients by fitting AR(k), . . . , A R ( k + l) to Z t using the
recursion

q•(t)
j(k, = ~(t-1) ^ - . . . .
~i(g+0- q~;'(~)qb~+lllk+,)/45~(I,1)), (2.50)

where ~0(k,'~(~)'=-1, j = l , . . . , k , k ~ > l and 1/>1. Based on some consistency


results of OLS estimates of autoregressive parameters for nonstationary and
stationary ARMA(p, q) models in Tiao and Tsay (1983), they show that for
k=p

P
~(')(p)--> ~(p), l ~ q, (2.51)
where
~(l)(p)= (ci)l(p) . . . . . . p(p): .

Now analogous to (2.26), the extended sample autocorrelation -function r ( k , 1)


100 G. C. Tiao

is defined as

r(k, 1)= q(Wk.,)


~ (o , (2.52)

where rl(lTd~t!,) is the lag l sample autocorrelation of the transformed series

-~ (0
Wk, t =
( 1 - -a(0
tPl(kyB .....
,fi(0 r~k~7
~k(k)'-" J~t
(2.53)

Also, we may denote r(0, l ) = r(l) for the ordinary sample autocorrelations,
and shall call r(k, l) the kth extended sample autocorrelation of lag I. Tsay and
Tiao show that for the general A R M A ( p , q) model in (1.3), stationary or
nonstationary, when k >/p

e {c, l=q+k-p, (2.54)


r(k,l)--~ O, l > q + k - p .

where Icl < 1.


Tentative model specification via E S A CF
The asymptotic property of the E S A C F r(k, l) given by (2.54) can now be
exploited to help tentatively identify A R M A ( p , q) models in practice. For this
purpose, it is useful to arrange the r(k, l)'s in a two-way table as shown in
Table 2.1 in which the first row gives the SACF, the second row gives the 1st
E S A C F , and so on. The rows are numbers 0, 1, 2 , . . . to signify the A R order
and the columns in a similar way for the M A order. To illustrate the use of the
table, suppose the true model is an A R M A ( 1 , 2). For the SACF, it is well
known that asymptotically r(0, l) ¢ 0 for l ~ 2. Now from (2.54) with p = 1 and
q = 2, we see that (i) when k = 1, r(1, l) - 0 for 1/> 3, (ii) when k = 2, r(2, I) - 0
for 1 ~> 4 and so on. The full situation is shown in Table 2.2, where x denotes a
nonzero value, 0 is zero and * means a value between - 1 and 1. T h e zero
values are seen to form a triangle with boundaries given by the two lines k = 1
and l - k = 2. The row and column coordinates of the vertex correspond
precisely to the A R and M A order, respectively.

Table 2.1
The ESACF table

~,.,.
M A R~ MA 0 1 2 3
_ \
0 r(O, 1) r(0,2) r(O, 3) r(0,4)
1 r(1, 1) r(1, 2) r(1, 3) r(1, 4)
2 r(2, 1) r(2, 2) r(2, 3) r(2, 4)
3 r(3, 1) r(3, 2) r(3, 3) r(3, 4)
A R M A models, intervention problems and outlier detection 101

Table 2.2
T h e asymptotic E S A C F table for an A R M A (1.2) model
where x denotes a nonzero value and * denotes a value
between - 1 and 1

A R ~ MA 0 1 2 3 4 5 6 7

0 * X X X X X X X
1 * X 0 0 0 0 0 0
2 * X X 0 0 0 0 0
3 * X X X 0 0 0 0
4 * X X X X 0 0 0

In general, we are thus led to search from the E S A C F table the vertex of a
triangle of asymptotic 'zero' values having boundary lines k = c1> 0 and
l - k = c 2 > 0 , and tentatively identify p - - c 1 and q = c 2 as the order of the
A R M A model. In practice, for finite samples, the r(k,/)'s will not be zero. The
asymptotic variance of the r(k,/)'s can be approximately obtained by using
Bartlett's formula. As a crude but simple approximation, we may use the value
(n - k - l) -1 on the hypothesis that the transformed series lYC(~!tis white noise to
estimate the variance of r(k, l). Of course, it is understood that this simple
approximation might underestimate the variance of r(l, k) and a further study
of this subject is needed in the future. As a preliminary but informative guide
for model specification, the E S A C F table may be supplemented by an analo-
gous table consisting of indicator symbols x denoting values greater or less
than -+2 standard deviations and 0 for in between values.

2.3.2. Estimation
Once the order (p,q) of the model (1.3) is tentatively specified, the
parameters (C, ~1 . . . . . @p, 01,. • . , Oq, tr 2) can now be estimated by maximizing
the corresponding likelihood function. An extensive literature exists on pro-
perties of the likelihood function, various simplifying approximations to this
function, and asymptotic properties of the associated maximum likelihood
estimates (see e.g. Anderson, 1971; Newbold, 1974; Fullerl 1976; Ljung and
Box, 1979). In what follows, we consider two useful approximations, the first of
which has been called the 'conditional likelihood function' proposed by Box
and Jenkins (1970) and the second, the 'exact likelihood function' by Hillmer
and Tiao (1979).
With n observations Z = ( Z 1. . . . . Zn)' from the model (1.3) and assuming
m ~ 1, consider the transformed vector W = ( W x. . . . . IV,)', where

W : D~)Z, (2.55)

with D ~ ) a n x n matrix analogous to D ~ ) in (2.35). Now partitioning W ' =


102 q~ C. T i a o

(Wo),
. W(2)),
. . where
. W ( O - ( W 1 , . . . , Wp) and Wi2)= (Wp+l, .. . , W,), we can
write the joint distribution of W as

p( W ) = p ( w m l W~2))p( W~2)) . (2.56)

Both the 'conditional' and the 'exact' likelihood approaches are based on the
distribution p(W(2)) by ignoring p(W(l) I W(2)); and it can in fact be shown that,
for moderately large n, the parameter estimates are little affected by
p(W(I) IW(2)). Now from (1.3) and (2.55), the probabilistic structure of W(2) is
given by
q
W t - C - ~ , Oia, i + a, t = p + 1. . . . , n . (2.57)
i=1

The 'conditional' approach assumes that ap = ap_ 1 . . . . ap_q+1 = 0. In this


case, the likelihood function can be written as

(
10(C, ~, 0, ~r2 I Z) oc o-; ("-p) exp - ~
1 Z° )
a2 , (2.58)
O'a t = p + l

where for given parameter values of (C, ~, O) the at's are recursively calculated
from
p q
a, = Z, - C - Z ebZ,-i + ~, O,a,-i. (2.59)
i=1 i=1

Standard nonlinear least squares methods can now be employed to obtain


estimates (C, q~, 0) minimizing the sum of squares in the exponent of (2.58). That
is,

S(C, 4~, O)= min S(C, ~ , O), (2.60)


where S(C, ~, 0 ) = Y'",=p+l a,.2 Also, the corresponding maximum likelihood
estimate of ~r2a is

1
d-2a= - S ( C , ~, 0) o (2.61)
n

In the 'exact' approach, the assumption at, ap_q+1 = 0 is not made, and
. . . . . .

after some algebraic reduction it can be shown that the likelihood function is

l(c, ~, o, o-]lZ) o~ o-X~"-,~)l~l-laexp( 1 =p~q+l


2~r2 "~ d~) . (2.62)
ARMA models, intervention problems and outlier detection 103

In (2.62), for t = p + 1 . . . . . n
p q

d, = z, - c - Z 4,,z,, + Z 0,a, ,, (2.63)


i=1 i=1

and for t = p - q + 1 . . . . . p the vector d , = (ap-q+l, - - - , i~p)' is given by

~i, = 22 - I R ' M ' a , (2.63a)

w h e r e / ) = Iq + R ' M 'MR,

- 1 °

." . .

71"1 "..
" " " • . ° " "1

E°q.......°i1 7"gn,_l . . . . . . . 7"i'n,_q

n' = n - p, lq is a q x q identity matrix, the 7r~'s satisfy the relation (1 ~- 7 r ~ B


Ir~B 2 . . . . )(1 - 01B . . . . . 0qB q) = 1, and a = (ap+l . . . . . a , ) ' the elements of
which are given by (2.59). For a detailed derivation of (2.62), see Hillmer and
Tiao (1979).
T o obtain the m a x i m u m likelihood estimates of the p a r a m e t e r s in (2.62), we
see that the c o n c e n t r a t e d likelihood of (C, q~, 0) is

max l(C, ~, O, or] ] Z) ~ /~, , (2.64)


O'a t= - +1

where/)t = l~'~[l/2(n-P)~lt"T h u s standard nonlinear routines can be used to obtain


estimates (C, ~ , 0) minimizing the sum of squares

s*(c,., o ) = Z b,~ (2.65)


t = p - q + l

and the c o r r e s p o n d i n g m a x i m u m likelihood estimate of O"~2a is

1
O"
^2 a __ ~-l/(n P)S*(C, 4}, 0). (2.66)
n-p

it is clear f r o m (2.59), (2.63) and (2.63a) that the exact a p p r o a c h is c o r n


putationally m o r e b u r d e n s o m e , but it can appreciably r e d u c e the biases in
estimating the moving average p a r a m e t e r s 0 associated with the conditiona~
approach, especially w h e n some of the zeros of O(B) are near or on the uni~.
104 G. C. Tiao

circle. In practice, one uses the conditional approach in the initial phases of the
iterative modeling process and switches to the exact methods towards the end.

2.3.3. Diagnostic checking

Once the parameters of the tentatively specified model are obtained, it is


important to perform various diagnostic checks on the fitted model to deter-
mine if it is indeed adequate in representing the time series being studied.
Methods for detecting model inadequacies are primarily based on the residuals

P q

at:Zt-d-~l~tZt-i-~at i, t : p + l .... ,n, (2.67)


i=1 i=1

from the fitted model. Useful tools include plotting of residuals against time to
spot outliers (see later discussion in Subsection 3.3) and changes in level and
variability, and studying the sample autocorrelation function rn(1) of the
residuals to determine if it is consonant with that of a white noise process. A
'portmenteau' criterion originally proposed by Box and Pierce (1970) and later
modified by Ljung and Box (1978) is given by

O = n(n + 2) ~ (n - l)-lr](l). (2.68)


1=1

On the hypothesis that the Zt's are generated from a stationary ARMA(p, q)
model, then O in (2.68) obtained from the residuals will be approximately
distributed as X2 with m - (p + q) degrees of freedom. It should be noted that
in practice when serious inadequacy occurs, patterns of the individual ra(/)'s
often provide useful information about directions to modify the tentatively
specified model.

3. Transfer function models, intervention analysis and outlier detection

In this section, we discuss some properties of the transfer function model in


(1.6) with special emphasis on its application to intervention analysis and
outlier detection problems. In general, the input variables X#'s can be
deterministic or stochastic. When the X#'s themselves are stochastic and follow
Gaussian ARMA models, Box and Jenkins (1970) have proposed a modeling
procedure which specifically deals with the case of one input variable. AI-
though their procedure can in principle be extended to the case of several
stochastically independent input variables, it becomes cumbersome to apply
and an alternative method via vector ARMA models has been suggested (see
Tiao and Box, 1981). In what follows, we shall confine our discussion to
deterministic inputs.
A R M A models, intervention problems and outlier detection 105

3.1. Intervention problems


In the analysis of economic and environmental time series data, it is frequently
of interest to determine the effects of exogenous interventions such as a change
in fiscal policy or the implementation of a certain pollution control measures
that occurred at some known time points. Standard statistical procedures such
as the t-test of mean difference before and after the intervention are often not
appropriate because of (i) the dynamic characteristics of the intervention, and
(ii) the existence of serial dependence in the observations. It is shown in Box
and Tiao (1975) that a transfer function of the form (1.6) can be employed to
study the effect of interventions. Specifically, suppose we wish to estimate
simultaneously the effects of J interventions on an output series Yt, we may
make X# indicator variables taking the values 1 and 0 to denote the occur-
rences and nonoccurrences of exogenous interventions and use 8~I(B)coj(B)B bj
to model the dynamic effects on the output, where

8j(B) = 1 - 6liB . . . . . 6rfiB rj, co(B) = cooj - colj B . . . . . cosj s]


(3.1)
and bj is a nonnegative integer representing the delay or 'dead time'. The
variables X# can assume the form of a step function X# = S(~~) or a pulse
function Xjt = -tP(rJ), where

0, t<Tj, {1, t=~,


S~rJ)= 1, t~>Tj, and p~r,)= 0, tCTj, (3.2)

and note that (1 - B)S~ r) = p~r).


Fig. 3.1 shows the response to a step and a pulse input for various transfer
functions models of practical interest. Specifically, for a step change in input,
(a) shows a step response with one-period delay; (b) shows the more common
situation of a 'first-order' dynamic response and the steady state gain (eventual
effect) is measured by w/(1 - 6); and (c) represents the situation when 6 = 1 in
which the step change in the input produces a 'ramp' response or trend in the
output. For a pulse input, (d) shows the situation in which the pulse input (e.g.
a promotion campaign) has only a transient effect on the output (sales) with col
measuring the initial increase and 6 the rate of decline; (e) represents the
situation that apart from the transient effect, the possibility is entertained that a
residual gain (or loss) 0)2 in the output persists, and finally (f) shows the
situation of an immediate positive response to be followed by a decay and
possibly a permanent residual effect. The last figure might represent the
dynamic response of sales to a price increase. A positive coo would represent an
immediate rush of buying when a prospective price change was announced at
time T, the initial reduction in sales which occurred at time T + 1 when the
price increase took effect would be measured by o) I + o)2 and the final effect of
the price change would be represented by 0)2.
106 G. C. Tiao

I~_~

sIT) e , ~ - ~ ~ Pt(T'
STEP PULS~

's(s, st(T, e(B)


~(B--~ piT)

[o1 Ill

_ _ II + P
w2 - -
(hi ~e)

,'4 J .... % r~P,


(c) If) ~t . . . . .

Fig. 3.1. R e s p o n s e s to a step and a p u l s e input.

Obviously, these dynamic transfer models may be readily extended to


represent many situations of potential interest, and intervention extending over
several time periods can be represented by indicator variables other than the
pulse or the step functions.

3.2. Model building


In practice, one needs to tentatively specify both the dynamic models
8il(B)o~j(B)B bj and an ARMA(p, q) model for the noise term N, in (1.6).
Parsimonious dynamic models are usually postulated to represent the expected
effects of interventions. For tentative specification of a model for the noise
term Nt, there are several possible alternatives. One may apply the
identification procedures discussed earlier in Subsection 2.3 to data prior to the
occurrences of the interventions if a sufficiently large number of such obser-
vations are available. One may apply these procedures to the entire data set
when the effects of the interventions are expected to be transient in nature.
Finally, one may first estimate the impulse responses ~'~h l = 1 , . . . , m, for a
ARMA models, intervention problems and outlier detection 107

suitably large m, where

I.~j(B ) = FoJ q._ Pl.1B .jr_... _[_ 1]m.iB m ._t. 6fl(B)wj(B)Bb,,

by ordinary least squares, and then apply the identification procedures to the
residuals Yt - Y'~=t ui(B)X# •
Once a model of the form (1.6) is tentatively specified, we can then estimate
the intervention parameters and parameters in the noise model for N t simul-
taneously via maximum likelihood. Specifically, write

J
Yt = C + Z ujt + dP-I(B)O(B)a,, (3.3)
j=l

where 8j(B)U i, = ~oj(B)BbJXj, so that for given values of the parameters in


3j(B) and ~oj(B) the Uj, s carl be recursively calculated from the Xj,'s; we may
then compute the at's recursively from q 0 ( B ) ( Y t - C - Z ] = l Ujt) = O(B)at and
apply nonlinear least squares methods to estimate all the parameters involved.
Finally, diagnostic checks can be performed on the residuals to assess the
adequacy of the model fit and to search for directions of improvement, if
needed.

3.3. Detection of outliers in time series


In the above application of the transfer function model (1.6), the time points
of occurrence of the interventions are supposed known. We now discuss a
variant of the methods for handling situations in which the timings Tj's of the
exogenous interventions are unknown and the effects lead to what may be
called aberrant observations or outliers. We summarize the results on outliers
detection in time series of Chang and Tiao (1983), following earlier work by
Fox (1972).

Additive and innovational outliers


Let {Yt} be the observable time series. We shall concentrate on two types of
outliers, additive and innovational. An additive outlier (AO) is defined as

YI Nt + .~(,o) (3.4)
while an innovational outlier (10) is defined as

O(B)
v, N, + (3.5)

where
~:(t0)_{l' t=t 0,
t - O, t#to,
108 G. (2 Tiao

and N t follows the m o d e l (1.3). In terms of the a,'s in (1.3) with C = 0, we have
that

o(B)
(AO) Y, - 4)(B) at + 0)¢70) (3.6)
and

(Io) Y, - O(B) ta,


, + ~(,0),) . (3.7)

Thus, the A O case m a y be called a 'gross e r r o r ' model, since only the level of
the t0th observation is affected. On the o t h e r hand, an I O r e p r e s e n t s an
e x t r a o r d i n a r y shock at t o influencing Z~, Z~+ 1. . . . through the m e m o r y of the
system described by O(B)/q)(B).

Estimation of o~ when to is known


T o m o t i v a t e the p r o c e d u r e s for the detection of A O and IO, we discuss the
situation when to and all time series p a r a m e t e r s in the m o d e l (1.3) are
known. Defining the residuals e, = r r ( B ) Y , where 7 r ( B ) = 4)(B)/O(B)=
( 1 - ~ ' B 1 - 7rB~ . . . . ), we have that

(AO) e, : w~(B)~(t'°)+ a,.


and (3.8)
(IO) e, : ~o~(,'°)+ a,.

F r o m least squares theory, estimators of the impact w of tile intervention and


the variances of these estimators are

(AO) ~ba=p27r(F)e,o, Var(~A) = p 2o" a;


and (3.9)
(IO) a3, = e~, Var(wl) = (r2a,

w h e r e F = B -~, p 2 = (1 + 7r~ + rr~ + - • ,)-1. Thus, the best estimate of the effect of
an I O at time t o is the residual et0, while the best estimate for the effect for an A O is
a linear c o m b i n a t i o n of e~, e,~+l. . . . with weights d e p e n d i n g on the structure of the
time series model. N o t e that the variance of o5A can be much smaller than ~r]
If desired, one m a y p e r f o r m various tests a m o n g the h y p o t h e s e s

Ho: Yt0 is neither an I O nor an A O ,


Hi: Y'0 is an IO,
H2: Y~is an A O .

T h e likelihood ratio test statistics for I O and A O are

H1 vs./40 ~.1 = @/o~.


ARMA models, intervention problems a n d outlier detection 109

and
H 2 vs. H o A 2 = ff)Ai(po'a).

On the null hypothesis/40, ,~ and /~2 a r e both distributed as N(0, 1).

Detection of outliers
In practice, t o as well as the time series p a r a m e t e r s are all unknown. If only to
is unknown, one may proceed by calculating A1 and A2 for each t, denoted by Att
and Az, and then m a k e decisions based on the sampling properties given above.
The time series p a r a m e t e r s (q~'s, O's, and O'a) are also unknown, and it can be
shown that the estimates of these p a r a m e t e r s can be seriously biased by the
existence of outliers. In particular, ~ra will tend to be overestimated. These
considerations have led to the following iterative procedure to handle a
situation in which there may exist an unknown n u m b e r of A O or I O outliers.
(i) Model the series Yt by supposing that there are no outliers Yt (i.e.
Yt = Nt) and from the estimated model compute the residuals

e, = e(B)Y,.
Let
^2 = ^2
O"a e t
/~l t = l

2
be the initial estimate of cra.
(ii) C o m p u t e £i, i = 1, 2 and t = 1 , . . . , n, these being Alt and A2t with the
estimated model. Let 1£`01= max, maxi[12,fl. If 1£`01= I,(1`01> c, where c is a
predetermined positive constant usually taken to be some value between 3 and
4, then there is the possibility of an I O at to and the best estimate of o) is o.]1` 0.
Eliminate the effect of this possible I O by defining a new residual Yt° = ~t0- ~b~t° =
0. If, on the other hand, ]£J = 1£2,01> c, then there is the possibility of an A O at
to, and the best estimate of its effect is o3at¢ T h e effect of this A O can be
removed by defining the new residuals et = e t - WAtorrtD)gt ^ ,m,.(to), t ~> t0. A new
estimate or, - 2 is c o m p u t e d from the modified residuals.
(iii) R e c o m p u t e £1t and £2t based on the same initial p a r a m e t e r estimates of
the ~ ' s and 0's but using the modified residuals and 52a, and repeat the process
(ii).
(iv) W h e n no more outliers are found in (iii), suppose that J outliers (either
I O or A O ) have been tentatively identified at times t~. . . . . b. Treat these times
as if they are known, and estimate the outlier p a r a m e t e r s o21. . . . , ~oj and the
time series p a r a m e t e r s simultaneously using models of the form

J O(B)
Y, = • wjLj (B)~(,") + - - a, (3.10)
j=l 4)(B) '

where L j ( B ) = 1 for an A O and L j ( B ) = O(B)/CP(B) for an I O at t = tf The


110 G. C. Tiao

new residuals are


J

~}1)= 7r0/(B)[ Yt- ~'~ °JjL, (B)sC(t'P] • (3.11)


./=t

The entire process is repeated until all outliers are identified and their effects
simultaneously estimated.
The above procedure is easy to implement since very few modifications to
existing software capable of dealing with A R M A and transfer function models
are needed to carry out the required computations. Based on simulation
studies, the performance of this procedure for estimating the autoregressive
coefficient of a simple AR(1) model compares favorably with the robust
estimation procedure proposed by Denby and Martin (1979) and Martin (1980).
While the latter procedures cover only the AR case, our iterative procedure
can be used for any ARMA model.

4. Illustrative examples

In this section, we illustrate the ARMA modeling, intervention analysis and


outlier detection procedures discussed in the preceding sections by two actual
examples.

4.1. Gas data


We here apply the ARMA modeling and outlier detection procedures to the
Gas data given in Box and Jenkins (1970). The data consist of 296 obser-
vations taken at 9 second intervals on input gas feed rate from a gas furnace.
Fig. 4.1 shows a plot of the series. The sample mean Z and sample variance
s 2 = (n - 1)-1E (Z t - 5 ) 2 are, respectively, Z = -0.0568 and s 2 = 1.147.

Model specification
Tables 4.1a, 4.1b, 4.1c and 4.1d give, respectively, the SACF, SPACF,
ESACF and the simplified ESCAF for this example. Note that (i) the estimated
standard errors of SACF are computed using Bartlett's formula (2.43), (ii)
those for the SPACF are obtained by assuming that the series is white noise
and (iii) the indicator symbol x is used in the simplified ESACF table when
Ir(k, l)l > 2(n - k - l) -1/2. The SPACF suggests that an AR(3) model might be
appropriate. On the other hand, an alternative ARMA(2, 3) model is suggested
by the ESACF. The AR(3) model was used by Box and Jenkins; but we have
found that the ARMA(2, 3) model gives a slightly better fit, and shall proceed
with this model.

Estimation and diagnostic checking


Employing the 'exact' likelihood approach discussed in Subsection 2.3.2, the
A R M A models, intervention problems' and outlier detection 111

-i

-2

0 40 80 120 180 200 240 280


t

Fig. 4.1. Gas data.

estimation results c o r r e s p o n d i n g to an A R M A ( 2 , 3) m o d e l are

(1 - 1.29B + 0.43B2)Zt = --0.0082 + (1 + 0.63B + 0.50B 2 + 0.36B2)a, ~


(0.10) (0.09) (0.03) (0.10) (0.09) (0.07)
(4.1)

~2
~r a = 0.0341, where the values in the parentheses are the estimated standard
errors of the p a r a m e t e r estimates.
Table 4.2 gives the S A C F of the residuals f r o m the fitted m o d e l (4.1). T h e

Table 4.1a
Sample autocorrelation function--gas data

l 1 2 3 4 5 6 7 8 9 10 11 12

e(l) 95 0.83 0.68 0.53 0.41 0.32 0.26 0.23 21 0.21 0.20 0A9
S.E. 0.06 0.10 0.12 0.13 0.14 0.14 0.15 0.15 0.15 0.15 0.15 0.15

Table 4.1b
Sample partial autocorrelation function---gas data

l 1 2 3 4 5 6 7 8 9 10 11 12

p(l) 0.95 -0.79 0.34 0.12 0.06 -0.11 0.05 0.10 0.02 - 0.07 -0.09 0.04
S.E. 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.06
112 G . C . Tiao

Table 4.1c
E x t e n d e d sample autocorrelation f u n c t i o n - - g a s data

0 1 2 3 4 5 6 7 8

0.95 0.83 0.68 0.53 0.41 0.32 0.26 0.23 0.21


0.78 0.50 0.26 0.07 -0.06 -0.14 -0.18 -0.18 0.10
0.40 0.31 0.23 0.09 -0.09 -0,08 -0.07 -0.10 -0.10
-0.32 -0.02 0.20 -0.20 -0.09 0.09 0.04 0.01 -0.11
-0.38 -0.03 0.14 -0.18 -0.17 0.07 0.00 0.01 -0.09
0.40 0.31 -0.07 -0.19 0.06 0.07 -0.00 0.07 -0.01
0.38 0.33 -0.23 -0.22 0.05 0.09 -0.07 0.09 -0.02

Table 4.1d
Simplified extended sample autocorrelation function---gas data

MAR~M]~ 0 1 2 3 4 5 6 7 8

0 X X X X X X X X X
1 x x x 0 0 x x x 0
2 x x x 0 0 0 0 0 0
3 x 0 x x 0 0 0 0 0
4 x 0 x x x 0 0 0 0
5 x x 0 x 0 0 0 0 0
6 x x x x 0 0 0 0 0

associated values of the Q statistic indicate that the model seems adequate.
However, inspection of these residuals themselves indicates the possible exiso
tence of a number of outliers. Specifically, applying the iterative outlier
detection procedure discussed in Subsection 3.3 with c = 4.0, the following
results are obtained:

to 43 55 113
~ 6.62 -5.95 -4.23
Nature AO AO AO

Simultaneous estimation of the effects of these three outliers and the time

Table 4.2
Sample autocorrelation function of residuals--gas data

l 1 2 3 4 5 6 7 8 9 10 11 12

ra(l) 0.02 -0.02 -0.02 0.02 -0.05 0.05 0.04 -0.02 -0.05 0.07 0.13 -0.06
S.E. 0.06 0.06 0.06 0.06 0.06 0,06 0.06 0.06 0.06 0.06 0.06 0.06

O 0.01 0.2 0.2 0.4 1.2 2.1 2.6 2.7 3.5 4.9 10.2 11.4
ARMA models, intervention problems and outlier detection 113

series m o d e l p a r a m e t e r s yields

Z t = - 0 . 0 5 5 3 + 0 . 4 6 ~ 43) - 0.39~:~55)- 0.27~:~m) + Nt, (4.2)


(o.18) (0.05) (0.05) (0.05)
where

(1 - 1.41B + 0.53B2)Nt = (1 + 0.81B + 0.45B 2 + 0.23B3)a,


(0.09) (0.09) (0,10) (0.12) (0.08)

and 6-2a = 0.0227. C o m p a r i n g (4.2) with (4.1), it is seen that a substantial


reduction in the estimated variance 6-2 of the at's occurs, f r o m 0.0341 to 0.0227,
when the effects of these three A O ' s are taken into account. In addition, changes
in the estimates of the autoregressive and moving average p a r a m e t e r s are also
appreciable. W e note here that if the critical value c were set to be equal to 3, a few
additional A O or I O would be identified. T h e effects on p a r a m e t e r estimates are,
however, very slight and h e n c e they have not been included in the model.
N o w it is readily verified that the zeros of the fitted autoregressive poly-
nomial ( 1 - 1 . 4 1 B + 0.53B 2) in (4.2) are complex and lying outside the unit
circle. This implies that the series is stationary. T h e estimated m e a n of the series is
- 0 . 0 5 5 3 having an estimated standard error of 0.18 so that the m e a n is
essentially zero. T h e estimated m o v i n g average p o l y n o m i a l ( 1 + 0 . 8 1 B +
0.45B2+ 0.23B 3) has o n e real zero and a pair of c o m p l e x zeros, all lying outside
the unit circle. T h e c o m p l e x zeros in the autoregressive and moving average
polynomials jointly explain the p s e u d o periodic b e h a v i o r exhibited by the
series.

Forecasts
E m p l o y i n g (4.2) as the final m o d e l and treating the p a r a m e t e r estimates as
T h e true values, Table 4.3 gives the forecasts Z r ( / ) of future observations Zr+t,

Table 4.3
Forecasts of future observations--gas data

(T = 296)

Lead time l 1 2 3 4 5
Z,r(l) -0.248 --0.192 -0.122 -0.076 -0.049
S.E.(eT (l)) 0.151 0.367 0.588 0.775 0.905
Lead time l 6 7 8 9 10
Zr(l) -0.036 -0.031 -0.031 -0.035 -0.039
S.E.(er (l)) 0.986 1.031 1.054 1.064 1.068
Lead time l 11 12 13 14 15
ZT(I) -0.043 -0.047 -0.050 -0.052 -0.054
S.E.(er(l)) 1.070 1.070 1.070 1.070 1.070
Lead time l 16 17 18 19 20
Zr(l) -0.055 -0.055 -0.056 -0.056 -0.056
S.E.(er(l)) 1.070 1.070 1.070 1.070 1.070
114 G. C. Tiao

l = 1 . . . . . 20, made at T = 296, the end of the data period. It is seen that as l
increases, ZT(I) gradually approaches -0.0553, the estimated m e a n of the
series. Also, the estimated standard error of the forecast error eT(l ) increases
from 0.151 = 6-a for l -- 1 to 1.070 for l = 20 which is essentially the estimated
standard deviation of the series. T h e seven-fold increase, from 0.1517 to 1.070,
in the standard errors of forecasts shows that, although the series is stationary,
substantial i m p r o v e m e n t in the accuracy of short-term forecasts is possible
when past values of the series are utilized instead of relying solely on the mean
level of the series.
It is noted that all the computations involved in this example are p e r f o r m e d
using the package developed by Liu et al. (1983).

4.2. Ozone data


T o illustrate the intervention analysis techniques, we turn to consider the
ozone data shown in Fig. 4.2 analyzed earlier by Tiao et al. (1975) and Box and
Tiao (1975). The data consist of monthly averages of ozone level in downtown
Los Angeles from January 1955 to D e c e m b e r 1972. Two interventions 11 a n d / 2
of potential m a j o r importance are:
11: In early 1960 the opening of a new freeway in Los Angeles which altered
the traffic pattern and the inception of a new law (Rule 63) which reduced
the proportion of reactive hydrocarbons in the gasoline sold locally.
/2: F r o m 1966 onward, regulations required engine design changes in new
automobiles which would be expected to reduce the emission of nitrogen
oxides and hydrocarbons which are the primary components in the for-
mation of ozone through photochemical reaction.

10

i J

o __]__L~A___.I ! L~[___I l 1 l I _L_ L__]----]--L


24 48 72 98 120 144 168 192 218
t

Fig. 4.2. Monthly averages of ozone at downtown Los Angeles (January 1955-December 1972).
ARMA models, intervention problems and outlier detection 115

The first intervention 11 was expected to produce a step change in the ozone
level at the beginning of 1960. As for 12, the engine changes were expected to
reduce the formation of ozone. Now in the absence of information on the
proportion of cars with new design changes in the car population over time, we
might represent the possible effect of I 2 as an annual trend reflecting the effect
of the increased proportion of 'new design vehicles' in the population. As
explained more fully in Tiao et al. (1975), because of the differences in
meteorological conditions between the summer months and the winter months,
the effect of I 2 would be different in these two seasons.
The above considerations have led to the following model for the monthly
ozone observations Y,

Yt -- (001Xl, if- ( 0 0 2 ( 1 - B 1 2 ) - l x 2 t + ( 0 o 3 ( 1 - B 1 2 ) - l x 3 t + N t , (4.3)


where
.~(T) T = January 1960,
Nit ~-
{1,
~ t

'summer' months J u n e - O c t o b e r beginning 1966,


X2t~ O,
otherwise,
{1,
X3t= O,
'winter' months N o v e m b e r - M a y beginning 1966,
otherwise,

and N t is the noise term. Inspection of the SACF of Y, and that of the
seasonally differenced series ( 1 - B ~ 2 ) Y t leads to the tentative model for the
noise term Nt

(1 - - B12)Nt : (1 - O~B)(1 - OzB~2)a,. (4.4)

The models (4.3) and (4.4) allow for (i) a step change in the level of ozone of
size (001 associated with 11, (ii) progressive yearly increment in ozone level of
sizes o)02 and (003, respectively, for the summer and the winter months asso-
ciated with 12, and (iii) seasonal and nonseasonal serial relationship in the data.
Employing the estimation procedure described in Subsection 3.2, we have
obtained the following fitting results:

Parameter Estimate S.E,

(901 -1.34 0.19


(002 -0.24 0.06
(003 -0.10 0.05
01 -0.27 0.07
02 0.78 0.04
o-.2 0.62

Examination of the residuals shows that the model seems adequate for this
data set. Thus, there is evidence to support the following:
116 G. C. Tiao

(i) associated with 11 is a step change of a p p r o x i m a t e l y o501= - 1 . 3 4 in the


level of ozone;
(ii) associated with I 2 there is a progressive reduction in ozone. O v e r the
d a t a period, the yearly i n c r e m e n t is e s t i m a t e d at &02 = - 0 . 2 4 for the s u m m e r
m o n t h s , but the i n c r e m e n t in the winter is slight.

5. S o m e a s p e c t s of v e c t o r A R M A models

M u c h of the p r o p e r t i e s of the univariate A R M A m o d e l (1.3) discussed in


Section 2 can be generalized to the vector m o d e l (1.7). In particular, following
the s a m e d e v e l o p m e n t leading to the q, form in (2.6) and the ~r f o r m in (2.10)
and relabeling the Zt's, wt's and a t ' s as vectors, and the O's, q~'s, ~p's and ~-'s as
matrices, we can write the vector m o d e l alternatively in the forms

t-m t-m

Zt:.,+ O:,_h + Z O h*w,-h (5.1)


h=l h~t-(m+r)+ l
and
t-m t-m
Zt = Z arhZt-h -- Z rr h-w, ,, + a t . (5.2)
h=l h=t-(m+r)+l

It is clear from (5.2) that every e l e m e n t of Z t in general is related to all the


e l e m e n t s of Zt-j, j = 1, 2 , . . . , so that there can be f e e d b a c k relationships a m o n g
all the k c o m p o n e n t series {Z1¢}, i = 1 , . . . , k. H o w e v e r , if the c o m p o n e n t s of Zt
can be arranged such that the matrices ~ ' s and O's are all lower triangular, then so
will be the ~-'s and (1.7) will imply an unidirectional relationship a m o n g the series.
T o illustrate, consider the vector A R M A ( 1 , 0) m o d e l with k = 2 and C = 0.
S u p p o s e ~ is lower triangular so that we can write

[ 1 - (~)llB ][Z12:] = [alt ] . (5.3)


-(iD21B 1 - qO22B ka2tJ

Letting a2, = /~al, + gt where e t and a u are i n d e p e n d e n t , we carl express (5.3) as

( 1 - : P n B ) Z 1 , - al, , (5.4a)

w o - o)IB
,- ZI, + (1 - ~ 2 2 B ) - l ~ ; t , (5.4b)
1 - ~22 B

w h e r e w 0 =/3 and w 1 = (I)21 - A[~I~ll. T h u s ZI, will only d e p e n d on its o w n past,


but Z2, will d e p e n d on its own past as well as the p r e s e n t and past of Z w In this
case, Z1, can be r e g a r d e d as the 'input' and Z2t the ' o u t p u t ' . E x p r e s s i o n (5.4b)
is of the s a m e f o r m as (1.6) with a single stochastic input. M o r e generally, an
undirectional relationship m a y exist b e t w e e n subsets of the c o m p o n e n t s
A R M A models, intervention problems and outlier detection 117

of Zt while feedbacks are allowed within each subset. This occurs when
the @'s and O's are lower block triangular. The important thing to note is
that vector A R M A models cover both undirectional and feedback relationship.
Model building procedures discussed in Subsection 2.3 can also be extended
to cover the vector case. For a discussion of the various modeling techniques,
see Quenouille (1957), Hannan (1970), Tiao and Box (1981), and Tiao and Tsay
(1983).

References

Anderson, T. W. (1971). The Statistical Analysis of Time Series. Wiley, New York.
Bartlett, M. S. (1964), On the theoretical specification of sampling properties of autocorrelated time
series. J. Roy. Statist. Soc. 8, 27.
Box, G. E. P. and Jenkins, G. M. (19"70), Time Series Analysis l,brecasting and Control. Holden-
Day, San Francisco, CA.
Box, G. E. P. and Pierce, D. A. (1970). Distribution of residual autocorrelations in autoregressive-
integrated moving average time series models. J. Amer. Statist. Assoc. 65, 1509-1526.
Box, G. E. P. and Tiao, G. C. (1975). Intervention analysis with application to economic and
environmental problems. J. Amer. Statist. Assoc. 70, 70-79.
Chang, I. and Tiao, G. C. (1983). Estimation of time series parameters in the presence of outliers.
Technical Report No. 8. Statistics Research Center, Graduate School of Business, University of
Chicago (to appear in Technometrics).
Denby, L. and Martin, R. D. (1979). Robust estimation of the first order autoregressive
parameters. J. Amer. Statist. Assoc. 74, 140--146.
Fox, A. J. (1972). Outliers in time series. J. Roy. Statist. Soc. Set. B 43, 350-363.
Fuller, W. A. (1976). Introduction to Time Series Analysis. Wiley, New York.
Gray, H. L., Kelly, G. D. and Mclntire, D. D. (1978). A new approach to ARMA modeling.
Comm. Statist. B7, 1-77.
Hannah, E. J. (1970). Multiple Time Series. Wiley, New York.
Harvey, A. C. (1981). Finite sample prediction and overdifferencing. J. Time Set. Anal. 2, 221-232,
Hillmer, S. C. and Tiao, G. C. (1979). Likelihood function of stationary multiple autoregressive
moving average models. J. Amer. Statist. Assoc. 74, 652--660.
Liu, L. M., Hudak, G. B., Box, G. E. P., Muller, M. E. and Tiao, G. C. (1983). The SCA System for
Univariate-Multivariate Time Series and General Statistical Analysis. DeKalb: Scientific Com-
puting Associates.
Ljung, G. M. and Box, G. E. P. (1978). On a measure of lack of fit in time series models. Biometrika
65, 297-304.
Ljung, G. M. and Box, G. E. P. (1979). The likelihood function of stationary autoregressive-moving
average models. Biometrika 66, 265-270.
Martin, R. D. (1980). Robust estimation of autoregressive models In: D. R. Brillinger and G. C.
Tiao, eds., Direction in Time Series. Institute of Mathematical Statistics, Hayward, CA.
Newbold, P. (1974). The exact likelihood function for a mixed autoregressive-moving average
models. Biometrika 61, 423--426.
Quenouille, M. H. (1957). The Analysis of Multiple Time Series. Griffin, London.
Slutsky, E. (1937). The summation of random causes as the source of cyclic processes. Econometrica 5,
105-146.
Fiao, G. C. and Box, G. E. P. (1981). Modeling multiple time series with applications. J. Amer.
Statist. Assoc. 76, 802.-816.
Tiao G. C. and Tsay, R. S. (1983). Consistency properties of least squares estimates of autoregres.-
sire parameters in A R M A models. Ann. Statist. 11, 856-871.
118 G. C. Tiao

Tiao, G. C. Box, G. E. P. and Hamming, W. J. (1975). Analysis of Los Angeles photochemical


smog data: a statistical overview. J. Air Pollution Control Assoc. 25, 260-265.
Tsay, R. S. and Tiao, G. C. (1984). Consistent estimates of autoregressive parameters and extended
sample autocorrelation function for stationary and nonstationary ARMA models. J. Amer.
Statist. Assoc. 79, 84-96.
E. J. Hannan, P. R. Krishnaiah, M. M. Rao, eds., Handbook of Statistics, Vol. 5
© Elsevier Science Publishers B.V. (1985) 11%155
4

Robustness in Time Series and Estimating


A R M A Models

R . D o u g l a s M a r t i n * a n d Victor J. Y o h a i t

1. Robustness concepts

Three distinct probabilistic concepts of robustness have been developed in


the context of point estimation based on independent and identically dis-
tributed (i.i.d.) observations (or error terms). These concepts are, in historical
order of inception, efficiency robustness (Tukey, 1960), m i n - m a x robustness
(Huber, 1964), and qualitative robustness (Hampel, 1968, 1971), the con-
tribution of these notions being due to the authors cited. In addition there is
Tukey's (1976) data-oriented counterpart of qualitative robustness known as
resistance. Taking relative importance as the criterion, we would list the
probability-based robustness concepts in the following order: qualitative
robustness, efficiency robustness and rain-max robustness. Resistance is at the
same relative level as qualitative robustness, but on the data-based side of
things.
Since we regard resistance and qualitative robustness as the most important
concepts, the bulk of this section is devoted to qualitative robustness. A careful
definition of resistance forms the base for a particularly transparent definition
of qualitative robustness which turns out to be equivalent to Hampel's (1968,
1971) definition in the classic i.i.d, setting.
In Subsections 1.1-1.3 below, we briefly define efficiency robustness, min-
max robustness and resistance. We note in advance that there is no conceptual
difficulty in applying these three concepts in the time-series setting. The
situation is quite different with regard to qualitative robustness, where new
technical issues arise in providing an adequate definition for time series, and
the relevant details are given in Subsection 1.4. Some summary comments are
provided in Subsection 1.5.

I.I. Efficiency robustness


Let T. = T.(x 1, x z. . . . . x.) be an estimate of the scalar parameter 0 in the

*Research supported by the Office of Naval Research under contract N00014-82-0062, and by
National Science Foundation Grant SES80-15570.
tResearch supported by the Office of Naval Research under contract N00014-82-0062.

119
120 R. D. Martin and V. J. Yohai

distribution P~ for x" = (x 1, x 2. . . . . X n ) , and let EFF(T,, P0) denote a suitably


defined efficiency of T~ at P0. For example, we might have

EFF(T,, P0) = VARe;(best known T,)


VARe;(T,) , (1.1)

or we might have

n
VcR(Po)
EFF(T,, P 0) = VARpo(T,) ' (1.2)

where VcR(P0) is the Cramer-Rao lower bound at P0.


When the focus is on asymptotic efficiencies (as it often is), tile estimate is
denoted T, the measure for the process {x,},m is denoted Po, and the efficiency
of T at F o is

VcR(Po) 1
EFF(T, P o ) - V~(T) i(Po)V~(T )' (1.3)

where V=(T) is the asymptotic variance of ~ / n T , at Po, and i(Po) =


lim,.= n-li(Po) is the asymptotic Fisher information for 0, i(Po) being the
finite-sample Fisher information for 0.
Let Po be the nominal distribution for the data (typically P~ is Gaussian),
and let P~,I, Po,2, • • -, Po,K be a strategically selected set of distributions which
are in some sense ~near' Po. Typically, the Po, i will have marginal or conditional
distributions which are heavy-tailed deviations from normality, and hence give
rise to outliers. Then an estimate T (or Tn) is said to be efficiency robust if T (or 7~)
has high efficiency at Po, and also at P~,I,..., Po,K. High efficiency at Po will
usually mean an efficiency in the range 90% to 95%. Of course, for estimates T, at
finite samples we require only the appropriate marginal distributions
Po, Po,1. . . . , P0,K. In the most frequently used situation where the x, are i.i.d., we
need only the one-dimensional marginal measures Po, Po,x. . . . . Po,K"
For estimating location in the i.i.d, setting, the sample mean is fully efficient
(i,e. has efficiency 100%) at a Gaussian P0, but has low efficiencies at heavy
tailed alternatives ('low' can mean zero, e.g. at Cauchy-tailed distributions). On
the other hand, high-efficiency robustness can be obtained through the use of
trimmed means (Tukey, 1960), or Huber's (1964) location M-estimates, defined
in (1.6) and (1.7) below, to name just two of many possibilities.
Efficiency robustness can be similarly defined for vector parameters by using
an appropriate definition of multivariate efficiency.

1.2. M i n - m a x robustness
Let V(T, P=) denote the asymptotic variance of an estimate T at distribution
P=, and let T denote a large family of estimates, while W denotes a large family
Robustness in time series and estimating A R M A models 121

of distributions for the process {Xn}n~1. A min-max robust estimate T O shires


the problem

inf sup V(T, P=). (1.4)


TET P~P~

The solution to this problem is usually obtained by solving the saddle-point


problem,

sup inf V(T, P~) = V(To, P o ) = inf sup V(T, P~). (1.5)
P~P~ TET TET P~EP ~

Of course, for that most frequently treated case of i.i.d, processes {x,}n~ ~ with
marginal distribution F, one would replace W by a family P of univariate
distributions, and replace P= by a univariate distribution P in the above
expressions.
Huber's (1964) seminal work showed that for estimating location in the i.i.d.
setting, the above problem is solved by a member of the class of M-estimates/2
defined by
tl

min 2 P ( Y i - / x ) , (1.6)
P- i=l

where p is symmetric and convex. Equivalently,/2 is a solution of


n

o(y,- = 0 (1.7)
i=1

with ~b = p'. The min-max estimate T Ocorresponds to a particular psi-function


OH which is now often called 'Huber's psi-function'. The definition of OH is
given by (2.22) in Section 2.
For more general min-max theory and results see H u b e r (1981).

1.3. R e s i s t a n c e

To many statisticans a resistant estimate is one which is not unduly affected by a


few outliers (Tukey, 1976). This definition has been refined somewhat in the
following way (cf. Huber, 1981, Chap. 1.2): t

An estimate Tn is called resistant if 'small' changes in the data result in only small changes in
T~, where 'small changes' in the data means (i) large changes in a small fraction of the data,
and/or (ii) small changes in all the data.

The large changes in (i) correspond to outliers, while the small changes in (ii)
correspond, for example, to rounding errors or grouping errors. The sample mean
lacks resistance, whereas estimates such as trimmed means (with the median as
a limiting case) and M-estimates are resistant.
122 R. D. Martin and V. J. Yohai

We would remark that while resistance is a beautifully transparent notion,


which makes it eminently serviceable for applied scientists, it suffers from a
small defect, which is that the definition is not very precise. This defect is
remedied in the next subsection, where a precise definition of resistance is
given. This definition turns out to yield a very transparent and useful definition
of qualitative robustness. As a caveat, one should be aware of the fact that
even with a careful definition, it is not completely trivial to verify resistance for
implicitly defined estimates such as location and regression M-estimates.

1.4. Qualitative robustness


Let x~. . . . . x, . . . . . be i.i.d, observations with values in a Polish space, i.e. a
complete and separable metric space (X, d). In most cases, X is a Euclidean
space with the usual metric. The following notation will be used. Let X ~ and
X = be the Cartesian product of n copies of X and countable copies of X,
respectively./J will denote the Borel o--field on X, and/3 ~, fl~ the correspond-
ing product o--field on X" and X =. For any measurable space (/2, A), let P(g2)
be the set of all the probability measures on A. I f / z and u are in P(O), then
POx, u) denotes the class of all the probabilities P on ( ~ x ~ , A x A) with
marginals/z and v. Given a probability P C P(X), P" and P~ will for the time being
denote the corresponding product probabilities in P(X") and P(X=).
If (X, d) is any metric space, the Prohorov distance 7rd between # and v in P(X)
is defined by

7rd(/z, ~')--inf{e: /x(B) ~ u(V(B, e, d ) ) + e, VB C fl}, (1.8)


where
V(B, e, d)-- {x ~ X: d(x, B ) < e}. (1.9)
Strassen (1965) proved that if (X, d) is a Polish space, then 7rd is alternatively
given by

"rrd~, ~,) = inf {e: 3 P C POx, u) such that P([d(x, x') >7 e]) ~< e}, (1.10)

where [d(x, x') ~> e] = {(x, x'): d(x, x') >1 e}.
Let T,: X" ~ F, n I> n 0, be a sequence of estimates which arc invariant under
permutation of coordinates, where the parameter space (F, 3') is also a Polish
space (in most cases F is a Euclidean space). The reason for the appearance of n o is
that often a minimum number n o of observations are required in order to define
the estimate. Hampel (1968) introduced two definitions of qualitative robustness.
The first definition is as follows:

DEFINrrlON 1.1. The sequence {T,},~0 is qualitatively robust at P ~ P(X) if


given e > 0, there exists 6 > 0 such that Vn 1> no, VQ C P(X), Try(P, (2)) < 6
rr~(L(T,, P"), L(T,, (2)")) < e, where L ( T n, pn) denotes the law of T, under
pn,
Robustness in time series and estimatingARMA models 123

According to the Strassen characterization of the Prohorov distance, this


definition of qualitative robustness requires, uniformly ir~ sample size n, that
the distributions of the estimates do not change too much when there is a
small change in the marginal distribution of the observations produced by one
or both of the following:
(a) A small fraction of observations with gross errors (outliers).
(b) Small errors in all the observations (e.g. rounding or grouping errors).
However, Definition 1.1 allows only for i.i.d, deviations from the central i.i.d.
model P". In order to at least partially cover non-i.i.d, deviations, Hampel
introduced the concept of qualitative ~r-robustness. We use the following
notation. Let J~" be X n modulo a permutation of coordinates. Given x " =
( x t , . . . , x , ) ~ X ", denote by tx [x n] the empirical probability which assigns mass
1/n to each point xi, l ~ i < ~ n . Given x ", y n in X", define dn(x $ n
,y")=
7rd(/~[x.], ~[yn]). Finally, given P~ E P(Xn), let /5~ be the probability induced
on

DEFINITION 1.2. The sequence {Tn},~,0 is qualitatively ~v-robust at P C P(X) if


given e > 0, there exists 6 > 0 such that

vn 1>no, vOn e &)


L(Tn, &))

Boente, Fraiman and Yohai (1982) proposed a new approach to qualitative


robustness, based on the concept of resistance (see Tukey, 1976; Mosteller and
Tukey, 1977). The basic idea is to require that the estimate change by only a small
amount when the sample is changed by replacing a small fraction of observations
by arbitrarily large outliers or by perturbing all the observations with small errors
(e.g. round-off or grouping errors). This approach has the advantage that it may be
applied without special assumptions on the probability model for the obser-
vations, e.g., they may be dependent or non-identically distributed. Moreover as
we will see below, the new definitions are based on quite simple and transparent
concepts.
First we define a new distance d~ on X". Given x " = (xl . . . . . xn), y n =
(Yl. . . . . Yn) in X ' , define

d](x", yn) = inf{e: #{i: d(xi, y i ) ~ e } ~ ne}. (1.11)

Therefore, two points of X n are close in the metric d + if all the coordinates
except a small fraction are close. According to this notion of closeness, if the
original sample is modified by replacing a fraction no greater than e of
observations by arbitrary outliers, or if all the observations are perturbed by
round-off errors smaller than e, then the original and modified samples have a
distance smaller than e.
124 R. D. Martin and V. J. Yohai

Given x" E X" and 6 > 0 let

AT,(x", 6) = sup{[T.(y") - T,(z")[: d+~(y ", x ~) <~ 6, d+(z ", x ~) <~ 6}.
(1.12)

The following definition formalizes the notation of resistance:

DEFINITION 1.3. Let x = (x 1. . . . , x , o . . . ) E X ~ and x" - ( x ~ , ~ . . ,x,). Then


{T,}n~,,° is resistant at x if given e > 0 , there exists 6 > 0 such that

AT,(x", ~) < e, Vn i> no.

From now on P " will denote any probability in P(X n) (not just a product
probability) and similarly P= will denote any probability in P(X=), unless
otherwise noted.
The following definitions of strong and weak robustness were introduced by
Boente, Fraiman and Yohai (1982), and represent an alternative to Hampel's
definition of qualitative robustness.

DEFINrrlON 1.4. Let P~ E P(X~). {7~,},~.,,° is strongly robust at P~ if

P~([{T,},~ o is resistant at x ] ) = 1 . (1.13)

DEFINITION 1.5. Let P= ~ P(X=). {T,},~,o is weakly robust at P~ if, given e > 0,
there exists 6 > 0 such that

P~([AxT~(x", ~ ) ~ e ] ) ~ 1 - E , Vn -~ n o . (1.14)

Boente, Fraiman and Yohai (1982) proved the following relationships be-
tween (i) weak and strong robustness, and between (ii) both weak and strong
robustness and Hampel's definition of qualitative 7r-robustness:

THEOREM 1.1. Le¢ {T,}n~,~° be a sequence of estimates and P~ ~ P(X~). 7hen


(i) Strong robustness implies weak robustness.
(ii) I f {T,},~,,° are invariant under permutations of coordinates and P~ cor-
responds to an i.i.d, process, weak robustness, strong robustness and qualitative
•r-robustness are equivalent.

Papantoni-Kazakos and Gray (1979), Bustos (1981) and Cox (1981) also gave
various definitions of qualitative robustness which hold for dependent proces-
ses and which are in the spirit of Hampel's approach. There are two such
definitions which correspond to generalizations of Hampel's qualitative
robustness and qualitative 7r-robustness respectively.

DEFINITION 1.6. Let p be a metric on P(X~), and P ~ P(X~). {~/~},~,~ is


Robustness in time series and estimating A R M A models 125

qualitatively p-robust at P~ if given e > 0, there exists ~ > 0 such that

Vn>~no, VQ=~P(X~), p(P=,Q=)<~


Ir~(L(Tn' p,), L(Tn ' Qn)) <~e,

where P " and Q" are the marginals corresponding to the first n coordinates of
P= and Q=, respectively,

DEFINITION 1.7. Let {P,}n~,~ be a sequence of metrics on P(X"), n 1> 0. Sup-


pose that P = E P ( X =) and that P " is the marginal measure for the first n
coordinates. {T,}n~ 0 is qualitatively p, robust at p~o if given e > 0, there exists
6 > 0 such that

Vn ~ no, VQ n e P ( X ' ) , O,(P", Qn) ~<,~


__~~v(L(7;, ' pn), L(7~, Q")) ~ e.

Boente, Fraiman and Yohai (1982) argue that the relevant definition for the
case of dependent processes is qualitative p,-robustness, and not qualitative
p-robustness. They also propose using On = Zrd~, and show that if (F, y) is a
Euclidean space, qualitative ~-a+-robustness is equivalent to weak robustness.
Papantoni-Kazakos and Gray" (1979) proposed the use of the Vasershtein
distance on P(X"). The Vasershtein distance utilizes a so-called 'per-letter'
metric d = d(x i, yi) to measure the distance between coordinates x~ and Yi of x n
and y'. Boente, Fraiman and Yohai (1982) proved that if d is bounded, then
the notion of qualitative p,-robustness based on the Vasershtein metric is
equivalent to qualitative p,-robustness with /9, = ~'d+. Other metrics on P ( X ' )
were proposed by Cox (1981) and Bustos (1981).
In the sequel the term qualitative robustness, or simply robustness, witho~t
qualifiers, will mean either strong or weak robustness as defined in this section,
and we will keep in mind the equivalence of strong robustness and Hampel's
qualitative zr-robustness in the i.i.d, setting, given by T h e o r e m 1.1(ii).

1.5. COmments
Qualitative robustness is a continuity property of estimates which is quite fun°
damental and basic. As such, it is a principle of statistics which should be
considered at the same level as the principles of sufficiency, unbiasedness, the
likelihood, and so on.
Of course, qualitative robustness alone is not enough, because some silly
estimates, such as the estimate T n---=c for any constant c, are qualitatively
robust. One needs a high degree of efficiency robustness as well. These two
kinds of robustness are paramount, and they are easily attainable in the i.i.d.
setting using location M-estimates, for example.
126 R. D. Martin and V. 3. Yohai

While min-max robustness is quite a nice property to have, and some


statisticians would elevate this property to a basic principle of statistics, we can
often have qualitative and efficiency robustness in situations where rain-max
robust estimates may not exist. We should not worry too much about not
having a rain-max robust estimate when we have both qualitative and efficiency
robustness in hand.
In many situations of both the time-series and non-time-series variety, one
now has the possibility of using a q.ualitatively robust estimate. Usually, this
should not be done to the exclusion of using a classical least squares or
Gaussian maximum-likelihood procedure. Except in certain situations (e.g.
some real-time settings), one will compute both classic and robust estimates.
When the two types of estimates are in reasonable agreement, outliers are not
likely to be having much effect on the classical procedure. When the estimates
differ substantially, one suspects that outliers may be a problem because the
robust estimate is not much affected by outliers, whereas the classical procedure is.
In the latter case, one should inspect the data for outliers or other anomalies which
may have been overlooked in a preliminary analysis of the data.
Although outliers are sometimes easy to spot in relatively routine exploratory
analysis of the data, this is certainly not always the case, and robust procedures
can often highlight the difficulty through improved residuals analysis (see, for
example, Denby and Mallows, 1977). For a striking example of a time-series
data set containing outliers which are not visible to the eye, but which are clearly
revealed in the residuals of a robust fit, see Kleiner, Martin and Thompson (1979)
or Martin and Thompson (1982).
In any event, if one does not use a robust estimate, one should at least do
this with an awareness that one is electing not to invoke a basic principle of
statistics!

2. Estimates for perfectly observed ARMA models

2.1. L e a s t - s q u a r e s e s t i m a t e s

Let Yl. . . . . YT be observations corresponding to a stationary and invertible


A R M A (p, q) model:

(y,- . . . . . - u,- OlU,_ . . . . . . . oqu,_q,


(2.1)
where the innovations u t are i.i.d, random variables with a symmetric dis--
tribution F. Put ~p = (91 . . . . . ~p)', 0 = (01 . . . . ,0q)' and A = (~', 0 ' , / , ) ' , where
the superscript primes denote transposes.
We can also write (2.1) as

~o(B )(y, - tz ) : O(B )u, , (2.1')


Robustness in time series and estimating A R M A models 127

where B is the backward shift operator, and ~(B), O(B) are the polynomial
operators

~o(B) = 1 - qOlB . . . . . q~pB p, O(B) = 1 - O,B . . . . . OrB q . (2.2)

If F is normal, an asymptotically efficient estimate A is obtained using any


one of several asymptotically equivalent versions of least-squares estimates
(LSE's) denoted here by ,~ts- One popular variant of LSE is the conditional
maximum-likelihood estimate, conditioned on y~ . . . . . yp and Up = l i p _ 1 . . . . .
Up_q+l = 0 (see Box and Jenkins, 1976). This LSE is the value A which
minimizes

T
Z ti2(a-), (2.3)
t=p+l

where the residuals fit(A) are defined by

fit(l) : O(B )-l~o(B )(yt - tx) (2.4)

putting y, =/x for t ~ 0. These residuals may be computed recursively:

a,(a) = (y,- ~,)- <(y,-1- t,) . . . . . ~,(y,_, - ~,)


-t- 01/~t_l(a) q - . . . - k Oqat_q(a ) (2.5)
with
/ ~ p ( a ) =- / ~ p _ l ( a ) . . . . . /]p_q+l(a) = 0. (2.53

The following asymptotic result is well known (see, for example, Whittle, 1962 or
Box and Jenkins, 1976): if var(F)= EF(u2t)< % then

d
V'T(h.~ - A)--> N(O, D ( ~ , O, F)), (2.6)

d
where --~ denotes convergence in distribution, and N(~, X) denotes the multi--
variate normal distribution with mean ~ and covariance matrix X. The (p + q +
1) x (p + q + 1) covariance matrix D(~, 0, F) for the least-squares estimate is
given by

0
VLs= D ( ¢ , O , F ) = [C-I(o'O)
var(F)H2(~, 0)]" (2.7)
128 R. D. Martin and V. J. Yohai

C(q~, O) is the (p + q) × (p + q) symmetric matrix with elements

Ci,j = ~, &&+j-i if i ~ j ~<p,


k=0

Ci,p+j= ~_, tkSk+j_i ifi<~p,j~<q,i~<j,


k=0
(2.8)
ce

Ci,p+j = Z Sktk+i-j if i <~p, j ~<q, j ~ i,


k=O

Cp+i.p+j= ~.tk&+i j i f i ~< j ~ < q ,


k=O

where Si and ti, 0 ~< i < oc are the coefficients in the series of expansions of the
operators g,-*(B) and O<(B) respectively:

~ - ' ( B ) = ~ s,B', O-I(B) = ~ tiB'. (2.9)


i=0 i=0
Also,
1-< ...... o,
H(~p, O) = (2.10)
1 - q~l . . . . . Pq

2.2. The distribution-free property of least squares


Suppose that the observations Yt satisfy an ARMA(p, q) model given by
(2.1), but the distribution F of the u/s, instead of being normal, has 'heavier
tails' than the normal distribution. Such distributions generate oatliers which
have considerable adverse effects on least-squares estimates in ordinary linear
regression models, and are the source of lack of efficiency robustness in such
models. For example, F might be a contaminated normal distribution

F = (1 - e)N(0, o-2) + e G , (2.11)

where e is small and G is an arbitrary distribution with dispersion greater than


or, for example, G = N(0; 42) with ~a>>cra. (Strictly speaking, this is not a
heavy-tailed distribution since its tails are Gaussian, but it is an outlier
producing distribution and hence it is often called heavy-tailed anyway.) Any
perfectly observed A R M A model (2.1) where the ut have a heavy-tailed
distribution is called an innovation outliers (IO) model (see Denby and Martin,
1979; Martin, 1981).
Notice that the upper left-hand block of (2.'7) gives the asymptotic covariance
matrix of (g,', 0 % Thus the LS-estimate of ~ and 0 has an asymptotic covariance
matrix which is independent of F, and therefore is distribution free under the
restriction v a r ( F ) < oo! This result was first pointed out by Whittle (1962), who
used the term robust to describe this behavior of the LSE.
Robustness in time series and estimating ARMA models 129

On the other hand, the lower right-hand block of (2.7) gives the asymptotic
variance of/2, and since this variance depends on var(F), the LSE of/x is very
sensitive to small deviations of F from a nominal Gaussian distribution. For
example, if F is given by (2.11), then var(F) = ( 1 - e)tr2+ e var(G), and hence
for e > 0 and 2 fixed, var(F) may take arbitrarily large values, including
(e.g. for Cauchy-tailed F). Thus/2 lacks efficiency robustness!

2.3. Infinite-variance autoregressions


Not much is known about the asymptotic distribution of the least-squares
estimates *3LS, 0LS of q9 and 0 when var(F) = % even assuming that tz is known
and need not be estimated. However, in the case of pure autoregression with/x
known, ~bLSis known to be consistent in the infinite-variance case. For example,
Yohai and Maronna (1978) have shown that the LS-estimate is consistent for an
AR(p) model with /, = 0 and F symmetric, even if EF(IUt]) w. In this case a
=

sufficient condition for consistency of ~LS is that EF([Iog lUtl]+) < % where [x] +
denotes the positive part of x. This condition cannot be weakened since it is
necessary for the existence of the stationary autoregressive process. Hannan
and Kanter (1977) have shown that if F belongs to the domain of attraction of
a stable law of order c~, 0 < a < 2, then the LS-estimate of an AR(p) process
(with/, = 0) converges to the true value with rate T 1/~,which is faster than the rate
T 1/2 for the finite-variance case.

2.4. Efficiency robust estimates


Despite the distribution-free asymptotic behavior of the LS-estimates of *9, 0
for the autoregressive and moving average coefficients, these LS-estimates may
be very inefficient for heavy-tailed F, even if it is close to the normal, e.g. if F
is given by (2.1) with small e. Asymptotic efficiency can of course be obtained
by using the maximum-likelihood estimate (MLE). If F has a ditierentiabte
density f, then the M L E is asymptotically equivalent to the estimate ,( obtained
by minimizing
T

pF(•,(a)) (2.12)
t=l

where PF = - I n f.
The asymptotic covariance matrix VMLE of this estimate is under general
conditions equal to the inverse of the information matrix and is given by

VMLE = a('~g, F)" D(*9, O, F ) , (2.13)


where ~ = P'e-- - f ' / f , D(*9, O, F) is given by (2.7), and

a(g', F ) = EF~2(U) _ V~oc(g~, F ) .... 1 (2.14)


E2g~'(u) . var(F) var(F) i(F)" var(F) "
130 R. D. Martin and V. J. Yohai

Here
EF1/-f2(R) 1
Vlo~(qt, F) = E 2 ~ , ( u ) i(F)" (2.14')

is the asymptotic variance of the location M L E (Huber, 1964), and i ( F ) =


[Vloc(gt, F)] -1 is the Fisher information for F. The expression (2.14'), with
replaced by a general O-function, is the asymptotic variance of Huber's (1964)
location M-estimates defined by (1.7).
Using the ratio of the trace of VMLE in (2.13) to the trace of VLs in (2.7) as
the measure of asymptotic efficiency shows that the efficiency of the LS-
estimate is just a(gt, F). It is well known (see Huber, 1964) that for any
e-contaminated neighborhood of the N(0, 0-2) distribution

/e,~2 = {F: (1 - e)N(0, 0-2) + eG, s > 0} (2.15)

with G symmetric, c~(tP', F) may be arbitrarily small:

inf a ( ~ , F ) = O.
FEFE,o. 2

Thus for the innovations outliers model the infimum of the asymptotic
efficiency of the LS-estimate in any neighborhood F~.2 of the normal dis--
tribution is zero. The LS-estimates /2, ~b, 0 all lack efficiency robustness for
perfectly observed A R M A models.
While the least-squares estimates ~bLS, 0LS have the asymptotic distribution-
free property cited in Subsection 2.2, it should be noted from inspection of
(2.14), along with the fact that i(F) is relatively stable as F deviates from
Gaussianity in a sufficiently smooth way, that the lack of efficiency robustness
of ~b and /~ is associated with not taking advantage of increased precision
attainable with heavy-tailed F. This is because innovations outliers result in
'good' leverage points for estimating ~ and 0 (see Belsley, Kuh and Welsch,
1980 or Cook and Weisberg, 1982, for definitions of leverage for ordinary
regression models).
The situation with regard to estimating p. is of course quite different in this
respect, since innovations outliers do not result in 'good' leverage points for
estimating ~, and as a consequence the (asymptotic) variance of ~Ls is quite
sensitive to heavy-tailed innovations distributions.
Since F is not generally known, it is not possible to use the MLE in practice.
However, (2.12) suggests to define the following class of M-estimates, which are
a natural generalization of the M-estimates for location (Huber, 1964) or
regression (Huber, 1973). An M-estimate A for an ARMA(p, q) model mini-
mizes

E p (2.16)
t=l
Robustness in time series and estimating ARMA models 131

where cr is a scale parameter (cf. (1.6) and (1.7), where the nuisance parameter
o- was omitted).
Let 0 = p' be the derivative of p. Then differentiating (2.16) we get the
following system of equations:

tp G_j(A)=0, j = l . . . . . p, (2.17)
t~p+l

0 b,_i(A) = 0, j = l . . . . . q, (2.18)
t~p+l

(2.19)
t~p+l
where
at(A ) = O - ' ( B ) ( y , - IX) = q~ I(B)at(A) (2.20)
and
bt(A) : O-2(B )~R(B ) ( y t - i~) : O-~(B )fit(A ) . (2.21)

One good choice of O-function is Huber's favorite type

q*H,c=
{u lul-<c, (2.22)
c sgn(u) lul>c.

If one wants protection against extremely heavy-tailed distributions, e.g. the


Cauchy distribution, one can use, as Hampel (1974) proposed, a psi-function 0
such that 0(u) = 0 if lu] > c, for some finite c. The ~p-functions with this property
are called redescending. A frequently used type of redescending function is the
bisquare type proposed by Beaton and Tukey (1974). The bisquare ~b-function is
given by

OKc(u)= 1 c2 / iflul~c,
(2.23)
if [u[ > c.

In most practical cases the scale parameter ~r is unknown and therefore must
be estimated. As was proposed by Huber (1964) for the location case, ~ may be
estimated simultaneously by an equation of the form

(2.24)
t=p+l

where X is a monotone nondecreasing and even function. Huber suggests


choosing X in the family

xH,c(u) = q,.,c(u)
2 -b
132 R. D. Martin and V~ J. Yohai

with b = EytP~.c(U) where ~ is the N(0, 1) distribution. This choice of b yields a d-


which is consistent when the u, are N(0, o-2).
In the case where ~b is redescending, the system (2.17), (2.18), (2.19) and
(2.24) has more than one solution, and some of them are not consistent
estimates of the true parameters (this is undoubtedly also the case with the
LSE when moving-average terms are present, since then the estimating equa-
tions are nonlinear). Therefore, a good choice of the initial estimate ,((0) in the
iterative numerical algorithm used to solve these estimating equations is
essential.
As in the location case mentioned in Subsection 1.2, we will choose p to be
symmetric and convex. Then ~ = p' will be odd and nondecreasing. Under
general regularity conditions it may be proved that

d
X/T(A, - a ) - + N(0, aOP,F)D), (2.25)

where D = D(q~, 0, F ) is the asymptotic covariance matrix of the LSE given by


(2.7), and the form of aOp,F) is given by (2.14), except now the M L E
score-function qt is replaced by a general psi-function ~b. Details may be found
in Lee and Martin (1982).
Again we take the trace of the asymptotic covariance matrix as a measure of
multivariate efficiency, and since 0 specifies ,~, we denote the asymptotic
efficiency of ~ by EFF0p, F). From (2.13), (2.14) and (2.17) we have

a(~, F) 1
EFF(0 , F) = - (2.26)
a(O, F) i(F)V~o~(¢,F)"
But (2.26) is just the asymptotic efficiency expression for a location M-estimate,
and A therefore has the same asymptotic effciency-robustness properties as a
locatio~n M-estimate based on the same psi-function ~. More details on the
efficiency robustness of autoregression M-estimates for the innovations outlier
model may be found in Martin (1982).
Thus, one can achieve efficiency robustness through the use of M-estimates
for A R M A models. However, we will see in the next section that these
M-estimates are not at all qualitatively robust!

3. Estimation of imperfectly observed ARMA models

3.1. Non-robustness of M-estimates


In this section we indicate the effect produced on the M-~estimates (and
LSE's) when a small fraction of the observations of an A R M A model are
replaced by observations of a different process. Let y~ 1 <~ t ~< T, be the tth
Robustness in time series and estimating A R M A models 133

observation with the y/s obtained from the general replacement type con°
tamination model

y , = (1 - z,)x, + z,w,, (3.1)

where x, is a stationary ARMA(p, q) model, z t is a 0-1 process which denotes


respectively absence-presence of outliers, and w t is the outlier-generating
process. The fraction of outliers is 3/= P(z~ = 1). Some especially interesting
cases are:
(i) The additive outliers model: This model corresponds to w t = x t + v~ with
v, independent of x~. This case was studied by D e n b y and Martin (1979) and
Bustos and Yohai (1983).
(ii) The substitution outlier model: In this case w~ is a process independent
of x t.
The most important thing to note about M-estimates is that even if 0 is
bounded, the a t and b, in (2.17) and (2.18) are unbounded, and therefore it is
clear that a small fraction of outliers may have a very large influence on the
M-estimate. Thus while M-estimates can have high efficiency robustness for
perfectly observed A R M A models, they are not resistant at any point, and
hence they are not qualitatively robust! Outliers can cause not only inflated
variability of M-estimates, but also considerable bias, even asymptotically. In
essence the model (3.1) is a kind of errors-in-variables model, and we should
not expect consistency when 3, > 0.
In order to assess how sensitive the LS- and M-estimates are to small
fractions of outliers, Bustos and Yohai (1983) performed the following Monte
Carlo experiments. They considered 500 samples of size 100 from two models
with additive outliers. In both models Yt = xt + ztv, where xt, v~ and z t are
independent series, the z / s are i.i.d. Bernoulli variables with P ( z , = 1)= 3,, and
the v,'s are i.i.d, with distribution N(0, ~.z), where ~ = 9var(xt). The two
models were as follows:
(i) x, is a Gaussian A R ( I ) process, x, = 0.8x,_ 1+ u,, where tile u,'s are i.i.d.
with distribution N(0, 1) and 3, = 0.1.
(ii) x, is a Gaussian MA(1) process, x t = 0.8ut_ 1+ u,, where the u/s are i.i.d
with distribution N(O, 1) and y = 0.05.
For each of these models they consider three differenl estimates: the
LS-estimate, an M-estimate based on the Huber family 0n.c given by (2.22)
(MH-estimate) and an M-estimate based on the bisquare family given by (2.23)
(MB-estimate). In the last two cases the tuning constants are chosen so that the
efficiency under the perfectly observed AR(1) or MA(1) Gaussian model is
O.95.
Table 1 shows the sample mean of the 500 values of each estimate. The
resulting large biases show just how sensitive the LS- and M-estimates are
when the observations contain a small fraction of additive outliers. Denby and
Martin (1979) reported similar results for the AR(1) case.
134 R . D . Mar~n and V. J. Yohai

Table 1
Sample m e a n s of LS- and M-estimates for the
AR(1) and MA(1) models with additive outliers

Estimate AR(1): ~ = 0.8 MA(1): 0 = - 0 . 8

LS 0.422 -0.426
MH 0.507 -0.449
MB 0.583 -0.456

3.2. Robust alternatives to M-estimates


Several classes of estimates which are qualitatively robust have been pro-
posed: general M-estimates (GM-estimates), estimates based on residual auto
covariances (RA-estimates), truncated RA-estimates (TRA-estimates) and ap-
proximate-maximum-likelihood estimates (AM-estimates). These three classes
of estimates are described in the next three sections.

4. General M-estimates

4.1. GM-estimates and their asymptotics


The following notation will be used to describe general M-estimates (GM-
estimates):

a,(A) = (a,(A), a,_l(A),..., ar_p+l(A))',


b,(a) = (b,(a), b , _ , ( a ) , . . . , b,_q+l(a))',
(4.1)
a, (a) = (,',(a), b ;(a))',
d~(a) = (a;(a), 1)'.

Then according to (2.17), (2.18) and (2.19), the equations for tile LSE's may be
written
T
Z a,(~)aL~(a)= o (4.2)
t=p+l

The class of general M-estlmates (GM-estimates) is defined by

t=p+l
w
[ , d,_I(A)C d,_,(A
'1 u,(a)d, ,(a)= O, (4.3)

where w is a weight function mapping R × R into R, C is an estimate of the


covariance matrix C(A) of d,_l(A),
+ and o- is a scale parameter for the ut's. In order
to obtain robustness, the summands in (4.3) should be bounded, and in order to
Robustness in time series and estimating A R M A models t35

obtain Fisher consistency we require that

/~l.W [ ~ +,
, dt_l(t~)C-
^ 1 +
dt_l(,~) ] /,/t(/~)d;_l(/~) ..... O. (4.3

where ut(A ) is generated from (2.4) with the y, starting in the remote part. If we
assume that F is symmetric, a sufficient condition for (4.3') is that w be even wit~
respect to its first argument. The scale parameter o- may be estimated :;imr!
taneously by an equation similar to (2.24).
There are two main types of GM-estimates:
(a) Mallows type:

4,,( u )4,d v ~)
w(u,v) - u " v ~/2 (4.4)

(b) H a m p e l - K r a s k e r - Welsch type:

O(uv ~/~)
w(u, v ) - u" vii2 • (4.5)

The GM-estimates were first proposed for regression by Hampel (1975) and
Mallows (1976) (see also Krasker, 1980, Krasker and Welsch, 1982; Maronna,
Bustos and Yohai, 1979). GM-estimates were studied for autoregressive models
by Denby and Martin (1979), Martin (1980), and Bustos (1982).
Call '£w the estimate defined by (4.3); then a formal Taylor series expansio~
indicates that under suitable regularity conditions we will have

d
X/T-(A w - h)---) N(0, Dw(W, F, ~, 0))~ (4.6)
where
Dw(W, F, ~, O) -- IT2B-1AB -1 , (4:/)
with

A=E n , ,q~,0 ~ , ,q~,0 , (4.8)


IT O"

(u~ a;_l(a) )
B == J ~ l IT o a; 1(~), (4.9)

,l(u, a +, ~, O) = w[u, a+(a)c-'(a)a~(,t)l (4.10)


and
o,i(u, d +, ~,, o)
~l(u, d +, ,¢, o) = (4.1 1)
Ou
136 R. D. Martin and V. J. Yohai

For a careful proof in the case of pure autoregressions see Bustos (1982).
K/insch (1984) has shown that GM-estimates of the Hampel-Krasker type, with 4'
in the H u b e r family given by (2.22), have an optimality property of the Hampel
type for autoregressions: they minimize the trace of the asymptotic covariance
matrix under a Gaussian AR(p) model, subject to a bound on Hampel's (1974)
influence curve.

4.2. Computation of GM-estimates


The numerical computation of GM-estimates may be done by iterative
weighted least squares. In each iteration the weight of the tth observation is
given by

'd; l(a) j
where A and d- are the estimates from the preceding iteration.
may be computed in one of two ways:
(i) /) may be expressed parametrically as C = C(A), or
(ii) for autoregressions, C may be computed independently of ~; for an
example of this approach, see Martin (1980).

4.3. Qualitative robustness of GMoestimates


In tile case of a pure AR(p) model under suitable 'regularity' conditions, the
GM-estimates are resistant qualitatively and robust if the function r/ defined in
(4.10) is bounded and continuous. We give an heuristic argument supporting
this statement. If r/ is bounded, a small proportion e of outliers will affect only
a small proportion e * = e(p + 1) of terms of (4.3). Moreover, this small pro-
portion of terms will be bounded, and therefore their influence on the estimate
will be small. On the other hand, if a? is not bounded, the proportion e* of bad
terms in (4.3) may have an unbounded influence. When moving-average terms
are present, the GM-estimates are neither resistant nor robust. We comment
on the reason for this in Subsection 5.3.

S. Residual autocovariance estimates

5.1. RA~-estimates and their asymptotics


Consider the system (2.17), (2.18) and (2.19) for tile M-estimates, and replace
a t and b t by their expressions as a function of the fit(A)'s given in (2.20) and
(2.21). After interchanging the order of the summations we get the following
systems of equations:
T-j p-I
= o, 1 (5.1)
h=O
Robustness in time series and estimating A R M A models 137

T-j-p 1
Z th(A)3'h+j(A) = 0 , l<~J <~q, (5.2)
h=0
T
Z O(fit(a))= 0, (5.3)
t=p+l
where

~,,(A)= £ 0 fit(A) (5.4)


t=p+l

is, except for a factor of 1/(t- i - p - 1) (or perhaps 1/(t-.p - 1)), an estimate
of the autocovariance of the residuals fii(A) at lag i. The s h and th are as in (2.9),
except we now indicate their dependence on A. For simplicity we shall refer to the
~i(A), without the needed factors, as residual autocovariance estimates.
The lack of resistance and robustness of the M-estimates is due to the lack of
resistance and robustness of the residual autocovariance estimates ,;i(A): even
if 0 is bounded, each summand 0(fit+i(A))t~t(A ) is an unbounded function of~the
observations, and so a small fraction of outliers can ruin the estimate. The
residual autocovariance estimates (RA-estimates) proposed by Bustos and
Yohai (1983) are obtained by replacing ~/(A) in (5.4) by a robust estimate of the
residuals autocovariance; the form of the robust estimate is

,),(a)= Z n(_u, , u, , (5.5)


t=p+l

where ~ : R 2-'+ R and o- may be estimated simultaneously by an equation of the


form (2.24).
If ~/ is odd in both variables and F symmetric, the resulting RA-estimate is
Fisher consistent, since in that case

E,,Tt, - 0. (5.6)

There are two standard ways of choosing 7:


(i) Mallows type: rl(u, v ) = 01(u)~bg(V);
(ii) Hampel type: r~(u, v ) = 03(R/.)),
where 01, 02 and 03 a r e typically bounded and continuous, odd functions, e.g
they may be monotone functions such as those of the Huber family given by (2.22)
or redescending functions such as those of the bisquare family given by (2.23)
Let J{n be the RA-estimate solution of (5.1)-(5.3), with the Yi given by (5.5)
Then under general regularity conditions on ~ and F it can be shown that
d
V (X, - a ) - - , N(0, D * ) ,
with

D , = (a*(rb F)C-l(~o, O) 0 )
(5./)
0 a(0, F) var(F)H2(~o, 0) '
138 R. D. Martin and V. J. Yohai

where C-I(~, 0) is the asymptotic covariance matrix of (~b~s, 0~s), with C0f, 0)
given by (2.8). var(F)H2(¢, 0) is the asymptotic variance of /2LS, with H 0 f , 0)
given by (2.10), the form of a(0, F) is given by (2.14), and

a*(n, F)= ~ E~n~(u,/~,u,_~/~) (5.8)


E~[n~(u,/~, u,_J~)u,_,] '

where r/l(U, v ) = O~7(u, v)/Ou. A proof of these results may be found in Bustos,
Fraiman and Yohai (1984).
It follows that the efficiency of (~b, 0,) with respect to (~bts, 0ts) is given by the
inverse of a*('q, F), independently of the A R M A model parameters. Similarly,
the relative efficiency of t2, with respect to /2LS is given by the inverse of
a(O, F ) as in the case of location M-estimates.
One of the advantages of the RA-estimates with respect to the GM-estimates
is the relative simplicity of their covariance matrix. The structure of D* in (5.7)
is much simpler than that of D w in (4.7). Furthermore, it is clear from (5.7) that
the choice of tuning constant in the function $ to obtain a given efficiency (say
0.95) is independent of the A R M A model parameters. This is not the case in
choosing tuning constants for the GM-estimates, since the expression (4.7) for
the asymptotic covariance involves the A R M A model parameters in a com-
plicated way.

5.2. Computing R A - e s t i m a t e s
In the case of the Mallows type RA-estimates with rl(u, v) = O(u)O(v), there
is an heuristically appealing iterative numerical algorithm for computing RA-
estimates. Suppose that in the ith iteration we have estimates ~(i), dr(°, then the
estimate for the (i + l)th iteration is obtained as follows:
(i) Compute the residuals ~,(~(1)), p + 1 ~< i ~< T.
(ii) Modify the residuals by applying the 0-function,

(5.9)

If 0 is in the Huber family ti* may be considered a Winsorized residual.


(iii) Construct a new 'pseudo-observations' process y*, using ~(i) and the
modified residuals u~, by putting

y * = t£ + [~(')(B)]-'0('I(B)fi*. (5.10)

(iv) Compute £0+1) as the LSE of A for the pseudo-series y~.


(v) Compute d"(~+1) as the square root of the usual residuals-variance esti-
mate associated with the LSE in (iv).
Robustness in time series and estimating A R M A models 139

5.3. Q u a l i t a t i v e robustness o f R A - e s t i m a t e s

For an A R ( p ) model, and under suitable regularity conditions, the RA-


estimates are qualitatively robust if "0 is bounded. The situation is completely
different for ARMA(p, q) models with q > 0. In this case both GM-estimates
and RA-estimates are neither resistant nor qualitatively robust.
For example, let us consider the MA(1) model y, = u , - Ou t p The estimated
residuals ~t(O) are given by
a,(O) = y, + Oy,_l + " " + O'-ly~, (5.11)

and a single outlier y, at time t has influence on all £~t,(O) with t'>~ t. Thus, a
small fraction of outliers may have a large effect on a large fraction of
residuals. Just one large outlier in the first observation may have a large effect
on all the observations. Therefore, since GM- and RA-estimates depend on the
residuals fi~ they cannot be qualitatively robust. However, GM- and RA-
estimates are less sensitive to outliers than LS- and M-estimates. A Monte
Carlo study (see Bustos and Yohai, 1983) shows that for the MA(1) model with
additive outliers, the RA-estimates of the Mallows and Hampel type are more
robust than LS- or M-estimates. This is especially true when $ is taken in the
bisquare family given by (2.23). More theoretical support of the behavior of
GM- or RA-estimates for the AR(1) and MA(1) models using a proper
definition of influence function for time series may be found in Martin and
Yohai (1984). The idea is briefly described in Section 7.
In the next subsection we present another class of estimates which are
qualitatively robust for ARMA(p, q) models with q > 0.

5.4. T r u n c a t e d R A - e s t i m a t e s

As we have seen in the preceding subsection, the failure of resistance and


robustness for the RA-estimates of the MA(1) model is due to the fact that tk.<
residuals fit(0) given by (5.11) depend upon all the present and past data. By
the same type of reasoning, RA-estimates lack robustness for a n y genuir:e
A R M A model (i.e. one with a moving-average component).
In order to robustify these estimates we introduce the truncated ~csid,_;ak. ~;~
order k. In the MA(1) case, these are

(~t,k(O) = Yt + OYt-i + " " " + OkYt-k .

It is easy to see that if 00 is the true parameter, then

~lt,k(O0) = gt- ok+lut-k-1 •

Therefore, if F is symmetric and ,/(u, v) odd in each variable, we have

EopT(~t+L~ (0°) /~"k(0°))=0, Vj ~ l , j ¢ k+l (5.~2)


140 R. D. Martin and V. J. Yohai

Recall that an RA-estimate for the MA(1) model with mean zero is obtained
as a solution of
T-1

Z 0i-~%(O) : 0, (5.13)
1=1

where 3)j is defined in (5.5). Define "Yj,k(O) by

%,k(o) = Z n u,,
t=l

Then the k-TRA-estimates, introduced by Bustos and Yohai (1983), are


defined by replacing ~i(0) by ~j.k(O) for j ¢ k + 1, and Yk+l(O) by Yk+l,k-l(O) in
(5.13). Equation (5.12) implies that if rl is odd in each variable and F
symmetric, the TRA-estimates are Fisher consistent.
The extension of the TRA-estimates for any A R M A model may be found in
Bustos and Yohai (1983). The k-TRA-estimates are asymptotically normal, but
their asymptotic covariance matrix expression is quite complicated, and can be
found in Bustos and Yohai (1983) and in Bustos, Fraiman and Yohai (1984).
Since the residuals in a TRA-estimate depend on only a finite number of
observations, a sufficient condition for resistance and qualitative robustness of
the TRA-estimate, under general regularity conditions, is that "0 be bounded.
As k increases, the corresponding TRA-estimate becomes more efficient under
the nominal Gaussian model without outliers, but it becomes less robust with
regard to bias and variability under a general contamination model of the type
(3.1). Of course, in large samples the former is dominant and so we often focus
on bias robustness. Therefore, the choice of k will depend on a trade-off
between efficiency under the model, and bias robustness under a general
contamination model. Monte Carlo results studying the performance trade-otis
of the TRA-estimates may be found in Bustos and Yohai (1983).

6. Approximate maximum-likelihood type estimates

One of several things learned from Huber's (1964) early work on robust
estimation of location was that robust estimates can be obtained using maxi-
mum-likelihood estimates for suitably heavy-tailed distributions. Some caveats
are in order here, e.g. densities whose MLE's for location are robust do not
always produce robust estimates of scale, and we do not yet have an M L E
rationale for the bounded-influence regression estimates studied by Krasker
and Welsch (1982) and Huber (1983). Nonetheless, the non-Gaussian M L E
rationale sometimes provides a convenient way of uncovering and understanding
the structure of robust estimates.
We have already seen in Subsections 2.4 and 3.1 that while 'simple' M.estimates
Robustness in time series and estimating ARMA models 141

can provide efficiency robustness for perfectly observed A R M A models, they are
not resistant or robust toward general contamination models of the type (3.1).
In this section we describe a class of estimates of A R M A model parameters
which are motivated by maximum-likelihood estimates for the additive outliers
type of contamination model described in Subsection 3.1, and which are
resistant and robust. W e call these estimates approximate maximum-likelihood
type estimates (AM-estimates) because of approximations involving the non-
Gaussian MLE.

6.1. Definition of A M estimates


As before, let the parameter vector a ' = (~p', 0', o"2) represent the parameters
of the x t process in the A O model,

y, = x, + v,, (6.1)

where x t and v t are assumed to be independent, and the v t are i.i.d, with zero
mean. Throughout, we shall presume that the Yt in (6.1) have mean /z = 0.
W h e n / x is unknown, it may be estimated robustly and the estimate/2 can be
used to form centered observations. W h e n / 2 is consistent, estimators based on
the centered data typically behave asymptotically as if /x were known and
exactly centered observations were used. Alternatively, an intercept term can
be included in some of the equations to follow.
The log likelihood for this model is

T
log h(y r a ) : Z l o g h(y, [y'-~, a ) , (6.2)
t=l

where y ' - (Yl, Yz. . . . Yt)' is the vector of observations up to and including
observation y,. The observation-prediction density h(y t [ y t - l , a ) is the con-
ditional density of the observations Yt given yt-~, and h ( y l l y °, a) denotes the
unconditional density h(yl ] a).
Since x t and vt are by assumption independent, we can write

h(y, [y' l, a ) - f fx(y, - ~ l y '-1, a ) dF~(~), (6.3)

where F v is the distribution function of the measurement error v~ and fx is the


conditional prediction density function of x t given y' 1. W e shall refer to this
density as the state-prediction density.
Let

~,
Xt i =: E(x, l y '-1) (6.4)
and
m, = E[(x,-- 2,--1):]y,-1] (6.5)
142 R. D. Martin and V. J. Yohai

denote the conditional-mean predictor of x, given yt 1, and the conditional-


mean-square error of prediction, respectively. Because of the assumptions
concerning (6.1), we also have

xtt-1 = ytt-I : E(y, I y ' - l ) , (6.6)

where 13',-1 is the conditional-mean predictor of Yt given yt-1. Because of (6.6)


we shall use 21-1 and y't-1 interchangeably.
Since we cannot actually compute the exact conditional m e a n s xtt -1= ytt-1 ,
we shall only require that the 21-1 or ))it-1 appearing in the remainder of the
discussion have the same structure as the approximate conditional-mean esti-
mates described in the last part of Subsection 6.3. We make an important
simplifying assumption that fx may be well approximated by the form

1 IX,- '~',-'~
s (x, i,' 1, u st-v-U), (6.7)

for some fixed density f which is independent of the parameters a (for t = 1,


the expectations are taken to be unconditional). Of course, m t = rG(a ) and
2,,-1= 2,-,(,,).
Now, using (6.7) we can rewrite (6.3) as

h(y, l y '-1, or) = gt(ut), (6.8)

where u t = y t - 2 ' t -1 and the subscript t on the function g, indicates the


dependence of g, on y,-l.
In practice, we very rarely know the noise distribution F v in the tails with
high accuracy. For the contaminated normal (CN) noise distribution

F~ = (1 - y)N(0, 02) + yN(0, 02), (6.9)

2 ~ O.2 and small y > 0, Martin (1979) gave some motivation for ap-
with 0"0
proximating gt by setting

g , ( u , ) = s,

where s, is defined below, and ttle density g is obtained by convolution,

g = f*Fv. (6.11)

Although the functional forms assumed in (6.7) and (6.10) are not good
approximations for general non-Gaussian F v, we believe that the use of these
forms when F v is nearly normal involves an approximation error that is small
enough to be relatively inconsequential.
Robustness in time series and estimating ARMA models 143

The scale measure st in (6.10) represents the scale of the y-prediction residuals
ut = Yt - 33t/1. Since the x-prediction residuals x t - 21-1 have as scale measures the
quantity V'm-~,,and since Yt = x, + vo with v t independent of x,, it is reasonable to let

s, = X/m,----~¢~ (6.12)

when F, at most deviates from a nominal N(0, o-2) distribution primarily in


the tails, e.g. as in (6.9). Of course, when the errors v, are zero most of the
time, so that P(v t = 0) = 1 - y by virtue of having o.~ = 0 in (6.9), with 3, not too
large, then we have

st = X/~. (6.13)

Using (6.10) and (6.11) we can rewrite (6.2) as

T T

l ° g h ( y r l ° z ) - - ~ l ° g s t + ~ ' ~ l °,=,
gg(~ (6.14)

Now, it seems natural by analogy with Huber's (1964, 1981) M-estimates


(maximum-likelihood type estimates) to replace - l o g g with a properly chosen
symmetric function p. Thus, we propose to define approximate maximum°
likelihood estimates (AM-estimates) as the value a that minimizes the following
robustified loss function:

L ( a ) = Z log s,(a)+ Z P k s ~ a ) ) ' (6.15)


t=l t=l

with the residuals u t - - u t ( a ) and scale values s t = st(a ) obtained from the
approximate conditional mean type filter cleaners described in Subsection 6.3.
The parameter vector a is included in (6.15) to indicate explicitly the depen-
dence of s,(a) and u,(a) on the parameter vector o d = (¢', 0', o'2).
If p ( t ) = - l o g g(t) and the density g is normal, then minimization of L ( a )
yields the Gaussian maximum-likelihood estimate.
The choice of the function p is guided by the same qualitative robustness
considerations as for H u b e r M-estimates for location and regression (see, for
example, Huber, 1981; Hampel, 1974), and, for the A R M A model M-estimates
of Section 2: O should have a bounded and continuous derivative ~O= p'.

6.2. State-variable representation of the A R M A model


To determine the parameter estimates which minimize the loss function L ( a )
defined by (6.15), we need to express 33', 1 and s, as functions of ¢, 0, G , and y,-1
In doing so it is convenient to write the A O A R M A model for the xt in the
144 R. D. Martin and V. J. Yohai

state-variable form

X, = 45X,_1 + re,, (6.16)


y, = xt + vt, (6.17)

where x t is the first element of X~ and

45= q~2 1) , (6.18)

where I(k_l)is a (k - 1 ) x ( k - 1) identity matrix, and 0 is a ( k - 1) column


vector of zeros. The dimensionality of the square 45 matrix is k = max{p, q + 1}.
If q ~>p, the first column of 45 contains the autoregressive parameters
q~l, q~2. . . . , Pk, but with q~i = 0 for i > p. Corresponding to this choice of 45, the
vector r in (6.16) is a k x 1 column vector defined as ( 1 , - 0 1 , - 0 2 , . . . , - - O k - l ) '
with 0i = 0 for i > q in case p > q. For details, see Appendix A of Martin,
Samarov and Vandaele (1983).
This state-variable representation is not unique. See Akaike (1974) for
another possibility.

6.3. R o b u s t filter cleaners


We now describe a class of robust filter cleaners which are used to obtain the
one-step-ahead predictions ~gtt-1= 2tt -1, and thereby compute the prediction
residuals ut= y t - ~ t t -1 appearing in the loss function (6.15). These filter
cleanears are sometimes called approximate conditional-mean type (ACM) filter
cleaners because of an approximate optimality result described at the end of
this section. Here the term filter refers to an estimate 2 t of x t which is based on
the present and past data y' = (Yx. . . . . y,)'. A smoother is an estimate 2~ of x,
based on all the observed data y r = (Yl, Y2. . . . . Yr)'. We discuss smoother
cleaners in Subsection 6.5.
Under conditions to be described subsequently, the ))tt-~ are approximate
conditional-mean estimates for the non-Gaussian A O model, and it is in this
case that (6.14) will be an approximation to the log-likelihood function (6.2)
(the various approximations involved here seem difficult to avoid in non~
Gaussian A O models). However, we shall not generally require that the
conditions alluded to be in force, since good filter-cleaners and associated
parameter estimates ~ can be obtained without such a requirement.
The filter cleaner computes robust estimates J~r of the vector X, according to
the following recursion:

^t 1

(6.19)
Robustness in time series and estimating A R M A models 14~

where p, = mJs~, with m t being the first column of the k x k matrix M,, which is
computed recursively as

M,+I = q~P,q~' + Q , (6.20)

, = M, w("s::, ') m"'


s 2,

The ~ is a robustifying psi-function, O = cr2rr ', and w is a weight function


described in (6.27) below. The time-varying scale s t is defined by

s~ = mll.t, (6.22)

where r a n , t is the 1-1 element of M , the robust one-step-ahead predictors of y,


and x, are

)3;-1 = 2;-~= (4)Xt_l)l, (6.23)

and the cleaned data at time t is

2, = (Xt)~ • (6.24)

With the scaling (6.22), we will have 37t = y, a large fraction of the time when
there are rather few outliers in the series. This is why we use the term filter
cleaner.
Before proceeding, note that when 0 is the identity function, w is identically
1, and (6.22) is replaced by s 2t - mH, , + or20 with o-z0 = var v, in the additive-noise
model, the above recursions are those of the Kalman filter. Correspondingly, M t
and Pt are the prediction and filtering error-covariance matrices. See, for example,
Kalman (1960), Jazwinski (1970), Meditch (1969). Unfortunately, the Kalman
filter is not robust; a single outlying observation Yt can spoil not only Y~t, but
also ~',, u > t. Use of a robust version is imperative in many situations. Our use
of ~r20= 0 in (6.22) corresponds to the assumption that v t = 0 a large fraction of
the time, e.g. as when a contaminated normal distribution with degenerate
central component, i.e. o-~= 0 in (6.9), and 3: small, provides a reasonable
model for F~.
The weight function w should have the same qualitative properties as a good
robustifying 0-function, namely: b o u n d e d n e s s , c o n t i n u i t y and perhaps c o m p a c t
support.
A common compact support for 0 and w results in the following desirable
behavior of the filter cleaner: if an observation y, deviates from its prediction 311-~
146 R. D. Martin and V. J. Yohai

by a sufficiently large amount, then Y~, will be the pure prediction X', = q~J~t-~, and
the filtering-error covariance is set equal to the one-step prediction-error
covariance Pt = Mt.
The latter idea has often been i m p l e m e n t e d as a so-called hard-rejection
rule: set Xt = q~X,-i and Pt = M, if lu,] > cs~ replacing (6.22) by s 2t = m11,t + o-20 in
the general noise case where there is a nonzero additive Gaussian noise
component. Typically, c = 3 has been used according to a time-honored habit,
and the procedure accordingly is termed a 3-sigma-edit rule. This corresponds
to the choices

t, It I < c , (6.25)
¢,..(t) = 0, [tt/> c,

1, It] < c , (6.26)


WHR(t)---- 0, ttl>fc.

Our filter cleaners would differ from this simple rule by imposing continuity,
as well as boundedness and compact support. The O and w functions should
return smoothly to zero. One reasonable way to accomplish this is to impose
continuity on ~, and take w as

w(t)- 440 (6.27)


t

The two-part redescending W-function

t, Itt-<a,

i
a
OnA(t)= -b-a(b-t)' a<t<-<'b'
--a (6.28)
~_a(b+t), -b~t< --a,
0, Itt>b,

along with WHA(t) = ~HA(t)/t, appears to work well in practice (see, for example,
Martin and Thomson, 1982; Martin, Samarov and Vandaele, 1983).
It should be noted that the data-dependent scale s t has the property that
st+ ~> s t if a gross outlier occurs at time t. This is a particularly important
feature when using a redescending psi-function such as thHA. For if a fixed scale
s were used, the filter could lose track of the data, never to regain it for the
duration of a fixed length data set. Although this is also possible when using s,
the properties of s, suggest that the filter will tend to regain track more quickly
than when a fixed scale is used. This aspect of the filter cleaner is illustrated in
Section V I I of Martin and T h o m p s o n (1982).
Robustness in time series and estimating A R M A models 147

A n approximate optimality result


There is an approximate non-Gaussian optimality result for the filter de-
scribed above if we replace (6.22) by

St2 = m u , t + 0"2 (6.29)

and (6.27) by
d
w(t) = C(t) = d t 4'(t) • (6.30)

Namely, under a key assumption that the state-prediction density fx(x, l Y'-l) is
Gaussian, and that ~0(t)= (a/0y,)logf(yt Ly'-I), where f(y, ]yt-1) is the obser-
vation-prediction density, the filter is the conditional-mean filter (Masreliez,
1975).
It should be pointed out that the key assumption just cited will never hold
exactly under an AO model where v, is non-Gaussian. However, there is some
evidence that f(x t ]y '-1) is nearly Gaussian and that the filter is a good
approximation to the exact conditional-mean filter (Martin, 1979; Masreliez's
Monte Carlo, 1975). Hence, the filter is referred to as an approximate conditional
mean (ACM) filter. Some clarifying details may be found in Martin (1979); the
figures therein suggest that the use of 4'HA is reasonable when the observation
noise vt has a contaminated-normal distribution. However, the w given by (6.30) is
discontinuous when 4' = @hA, and we prefer w specified by (6.27).
Further study is needed on the goodness of approximation of such ACM filters.

6.4. Iterative computation of AM-estimates


Differentiation of L(tr) = L(~0, 0, o~2~)in (6.15) with respect to parameters (¢, 0)
yields the following equation for the full maximum-likelihood-type estimates:

,=, \x/7~) a(~, o)


+- 1- ¢ = O, (6.31)
2 ,=~ m, 0(q~, 0)

where mt = s~ and ~p(r)= p'(r). We now assume that the first term in (6.31)
dominates the second one. This will be the case, for example, if Om]O(% O) = 0
most of the time, which will be true if m, = ~r2 most of the time. In turn, this
will be the case for data sets for which a well-designed filter cleaner results in
2, = yt most of the time. See Martin (1981) for a similar approximation. Under
this assumption, (6.31) becomes
r 1 / u, \ 0;", -1
22 ~(/--m--~-£4¢~,~m---~) --0 (6,32)
,=1 a(~, 0)
148 R . D. M a r t i n a n d V. J. Y o h a i

Notice that (6.32) has the same form as Huber's robust nonlinear regression
equation (see Huber, 1981, pp. 179 if) but with a time-varying scale. It is natural to
solve (6.32) using an iterative algorithm which is a modification of the so-called
H-algorithm due to H u b e r (1977, p. 39).
H e r e are the steps in such an algorithm:
(1) Compute initial estimates of q~, 0, and cr ; one possible method of doing this
robustly is discussed in Section 7 of Martin, Samarov and Vandaele (1983).
(2) Use a robust filter cleaner to obtain the 'cleaned' observations 2,~), the
predicted values ~tt-lu) , and scale stU)= ~/m---~t based on the current parameter
estimate ~ ) = (~b°), 0 °), ~r2~)).
(3) Use the 2~ ~ as observations and the c~~) as initial values in a standard
A R M A nonlinear least-squares estimation procedure, and compute new esti-
mates (~b~/+1), ~q+l)). The first component of the vector recursion (6.19) may be
written as

St ',St/ m t

Thus we have

where d ~)= {d~ )} is a T-component vector with

my ) , t=l,...,T,

and C = C q~ is an n x (p + q) matrix with elements

k=l,...,p,
O~Uk) ' t=l,~..,T.
CZ) :

Ogl(j) k=p+l . . . . . p q q,
Vk-p

(4) Next calculate d Cj+l)as the standard deviation of the least-squares residuals
rtq+l) ..0+1) ..t-l(j+i)
= "~t -- "gut "

(5) Iterate (2)-(4) until the estimates do not change by more than a pre-
specified tolerance.
For examples of application of the above algorithm to real tirne-series data
sets, see Martin, Samarov and Vandaele (1983).
Robustness in time seriesand estimating ARMA models 149

6.5. Use of smoother cleaners


A smoother for the A O model (6.1) is an estimate 2tr of x t based on all the
data y r = (Yl, Y2. . . . . Yr). A smoother cleaner 2rt is similar to a filter cleaner in
that for data sets with small to m o d e r a t e fractions of outliers, we will have
2tr = Yt for all but a small to m o d e r a t e fraction of the data (i.e. for the 'good'
observations the s m o o t h e r cleaner will not alter the data). However, whereas
the filter cleaner will use a pure prediction (based on all the past cleaned data)
at a gross outlier position, the smoother cleaner wilt use a pure interpolation
(based on all the past and future data) at a gross outlier position. A class of
smoother-cleaner algorithms are proposed in Martin (1979), where an ap.
proximate conditional-mean optimality result is established.
On intuitive grounds it would appear that better p a r a m e t e r estimates could
be obtained using smoother cleaners instead of filter cleaners in an appropriate
iterative algorithm. Some evidence of this shows in some examples in Martin
and T h o m s o n (1982), and also in some exploratory M o n t e Carlo experiments.
Further study of this issue is needed.
Also, it should be noted that the approximate maximum-likelihood rationale
described in the preceding subsections does not apply when one uses smoother
cleaners in place of filter cleaners. As a point of information, the likelihood
structure for Gaussian models of the A O form (6.1) is much m o r e complicated
when expressed in terms of (exact) conditional-mean smoothers than whert
expressed in terms of the conditional-mean (Kalman) filter (see, for example,
Schweppe, 1973).

6.6. Robustness and lack of Fisher consistency

Robustness
The algorithm for computing AM-estimates described in Subsection 0.4 ~,o
very close to a simpler and more transparent scheme whereby one iteratc',;
between the filter cleaner and least-squares estimates of a.
In terms of the latter algorithm, it is easy to see at a heuristic level that the
A M - e s t i m a t e s are resistant. The reason is that (i) the action of the fiter cleane~
is to replace a small to moderate fraction of gross outliers by one-sided
interpolates based on ' g o o d ' data, at least when ~b is redescending, and (ii) the
use of a continuous (as well as bounded) ~ and w means that small rounding
errors will not result in large changes in the filter-cleaned values 2,. A careful
proof of qualitative robustness remains to be given.

Lack of Fisher consistency


Whereas M-estimates, GM-estimates, RAoestimates and TRA-estimates arc
Fisher consistent for a Gaussian A R M A model, this is not the case with
AM-estimates. It is easy to see why this is the case, at least for AM-estimates
c o m p u t e d iteratively by applying least-squares to the filter-cleaned data. Fo~
suppose the filter cleaner is based on the true p a r a m e t e r s ~, 0, and the y, are
150 R . D . Martin and V . Z Yohai

realizations of a Gaussian A R M A process (i.e. z, -= 0 and x t is Gaussian in (3.1)).


Then the least-squares estimates will be computed from 2, values for which 2, = y,
for most t, but not all t. It will be the case that P ( 2 t ~ y, for some t) is positive, and
this will result in an asymptotic bias. It is expected that the asymptotic bias will
usually be quite small (see the discussion in Section VI-E of Martin and Thomson,
1982), but it is not zero.
A major open problem is that of modifying the AM-estimates .so that they
are Fisher consistent as well as robust.

7. Breakdown points and influence curves for time series

In addition to the basic notions of robustness described in Section 1, there


are two important auxiliary concepts which have received considerable atten-
tion, namely the breakdown point and the influence curve. Both of these
concepts are due to Hampel (1968, 1971, 1974), and until recently have been
used almost exclusively in the i.i.d, setting.

7.1. Breakdown points


In heuristic terms the breakdown point of an estimator is the largest fraction
of the data which can be moved to infinity without taking the value of the
estimate to infinity, or to the boundary of the parameter space in case the latter
is bounded, e.g. as when estimating the correlation coefficient. The sample
mean has breakdown point zero, while the sample median has breakdown
point approximately ~1 for finite sample sizes, with the breakdown point tending
to ~1 as the sample size tends to infinity. The breakdown point is a global
measure of performance of an estimator.
A careful mathematical definition of asymptotic breakdown point may be found
in Hampel (1971). H u b e r (1981) gives a somewhat different asymptotic definition.
The concept of finite-sample breakdown point was in essence first introduced for
the special case of location estimates by Hodges (1967), who used the term
tolerance instead of breakdown point. Recently, D o n o h o (1982) and D o n o h o and
H u b e r (1983) have given a general definition of finite-sample breakdown points.
They have presented interesting results on the finite-sample breakdown points of
many kinds of estimates, particularly in the context of multivariate estimation of
location and dispersion. For recent results on high breakdown point regression
estimates, see Rousseeuw and Yohai (1984).
It is no small virtue that the finite-sample breakdown point is a transparent
concept which may be easily communicated to scientists who are not highly
trained in the mathematical aspects of statistics.
Although the basic idea of the breakdown point can be applied directly to
time-series parameter estimates, relatively little has been done in the way of
computing breakdown points in the time series setting. First of all it should be
noted that in the time-series setting, a fixed fraction of outliers will have a
different impact on an estimate depending upon whether outliers occur in
isolation, or in patches. The reader can easily verify this claim by checking the
Robustness in time series and estimating ARMA modeh" 151

behavior of the lag-one correlation coefficient estimate


T-1 T

fi = Z YtY,+JZ y2 (7.1)
i-1 t=l

when the data contains k outliers with common value ~: which get arbitrarily
large, in both isolated and contiguous (or patch) configurations of the k
outliers. For the isolated configuration t~ -+ 0 as ~:~ m, while in the contiguous
configuration ~-+ ( k - 1)/k ~ 1 for k large. Thus we should expect different
breakdown points for time-series parameter estimates for different types of
outlier configurations.
Some asymptotic breakdown point calculations for GM-estimates of the
AR(1) parameter ~ were carried out by Martin and Jong (1977). The Mallows
type GM-estimates (4.3)-(4.4) were used, and both isolated and patch outlier
configurations were used. This yielded a set of lower and upper breakdown
points which correspond to the asymptotic value q~0 of the estimate ~b being
taken to 0 and 1, respectively.
In spite of the calculations cited, much more remains to be done with regard
to breakdown points for time-series parameter estimates. For one thing, the
definition used by Martin and Jong (1977) is not the only possibility, and it is
not yet clear whether an alternative definition would be preferred. In any event,
asymptotic breakdown points need to be computed for GM-, RA-, TRA- and
AM-estimates (among others), at A R ( p ) and A R M A ( p, q) models. Furthermore,
it remains to be determined whether or not computable finite-sample breakdown
points are a possibility for a variety of robust estimates of interest.

Z2. Influence curves

Suppose that T = T ( F ) is the functional representation of an estimate T,, i.e


T, = T ( F , ) where F, is the empirical distribution function for i.i.d, vector°
valued observations Yl. . . . . y, in R p. Let F~ = ( 1 - T)F + y6 v where 6e is th(
point mass located at ~: in R p. Then Hampel's influence curve ICH(se)=
ICH(sC; T, F ) is the directional derivative of T in the direction 8e - F:

ICH(~:) = lim T(F~) - T ( F ) (7.2)


~0 T
In the linear regression model situation where Yi = x~8 + % 1 ~< i ~ n, we would
have Yi = (Yl, xi) and ~: = (y, x) a point in R p+I.
Several authors have suggested to carry over the definition ICH to the
stationary time-series setting (Martin and Jong, 1977; Martin, 1980; K/insch,
1984). A natural way to do this would be to let {Yt} be a random process with
stationary measure #, and define a contamination process measure #7 =
(1 - 3')/-* + yt$~ where now s~ = (s~l, ~0. . . . ) is a contamination vector representing
arbitrary contamination values ~:1,se0,~:-1. . . . at times 1, 0, - 1 . . . . . and where 6~ is
the point mass at # We use only the semi-infinite contamination sequence
because many estimators of interest in the time series setting have functionais
152 R. D. Martin and V. J. Yohai

T ( # ) which are defined by a g e n e r a l M-estimate equation of the form

f 0(Y~; T(p.))dp. = 0, (7.3)

where y~ = (Yl, Y0, Y-x. . . . ). The corresponding estimating equation is


T

O,(y,; L ) = 0,
t=l

where q~t --> 0 as t-~ % i.e. the dependency of O, on t vanishes asymptotically.


For examples which clarify this point, see Martin and Yohai (1984). Then the
time-series version of I C H is

ICH(~c) = lim T ( / z r ) - T(/z) (7.4)


~o T

The trouble with this definition is that it is not very natural from the
following viewpoint, among others. The contamination process measure p.~ is a
mixture which corresponds to obtaining a realization from the stationary
m e a s u r e / x with probability 1 - y, and with probability 7 obtaining a realization
from the (nonstationary) process having marginal measure 65 for
(Yp Y0, Y-1. . . . ). Such a mixture process does not correspond to any realistic
contamination process occurring in practice! Further discussion on this point
may be found in Martin and Yohai (1984), who propose a new definition of
time-series influence curve IC as follows.
Let the 0-1 process in (3.1) satisfy P ( z , = 1)= ,y + o(7), let /xx denote the
measure for x t, let/x w denote the measure for the contaminating process wo and let
/x~ be the measure for y,L We can get either isolated or patchy outliers depending
upon how we specify the 0-1 processes z t and w,. Assume that the estimate of
interest is obtained from the functional T(/x~). Then the time-series influence
curve IC(p.w)= IC(/xw; T, {#~}) is the derivative at #w, along the arc { ~ } = {/x~:
0 4 7 < 1} as 7 ~ 0 , and correspondingly # ~ # x :

IC(/xw) = lim T O x ~ ) - T(/xx) (7.5)


:,~0 )'

The argument of IC is the contamination measure /xw, so in general IC is a


curve on measure space. However, calculations of IC(/xw) usually entail special
forms for the contamination process w,: in the case (Subsection 3.1 (i)) of additive
outliers, we let v t ~ ~ so that w, = x t + ~, whence the additive outliers have
constant amplitude ~:, and in the case (Subsection 3.1(ii)) of substitution outliers
we let w, ~-~, so that the substitution outliers all have constant value ~. When
these special forms of w t are used we replace the notation IC(/xw) by IC(~:), so
IC is now a curve with domain the real line. This is in keeping with the spirit of
ICH(~:), whose argument is a fixed contamination value £ in R p, with p = 1 for
Robustness in time series and estimating A R M A models 153

univariate problems (and we are dealing only with univariate time series in the
present discussion).
Although the IC is similar in spirit to ICH, it coincides with ICH only in the
special case where the estimate is permutation invariant and y, is an i.i.d.
substitution outliers model (Subsection 3.1(ii)), i.e. in the usual i.i.d, setup the
two definitions coincide (Corollary 4.1 of Martin and Yohai, 1984). Although in
general IC is different from ICH, there is a close relationship between the two
which facilitates the calculation of IC. Namely, under regularity conditions

IC(/zw) = lim E, ICI-I(y~), (7.6)


:,-~0 y

where Yt = (Yl, Y0, Y-l, • • ") is governed by the measure/z ~ for the process Yt in
(3.1), and Er den6tes expectation with respect to/Zy.
The above result is established in Martin and Yohai (1984), where several
other results concerning IC's are presented: Conditions are established which
aid in the computation of IC's and which ensure that an IC is bounded. IC's are
computed for both least squares and a variety of robust estimates of first-order
autoregressive and moving-average models. Distinctly different behaviors of
the IC are exhibited for patchy versus isolated outliers. It is shown that
bounded monotone ~b-functions do not yield bounded IC's for moving-average
parameters, whereas redescending 0-functions do yield bounded IC's. Finally,
the IC is used to show that a class of generalized RA-estimates has a certain
optimality property.

References

Akaike, H. (1974). Markovian representation of stochastic processes and its application to the
analysis of autoregressive moving average processes. Ann. Instit. Statist. Math. 26, 363-387.
Beaton, A.E. and Tukey, J.W. (1974). The fitting of power series, meaning polynomials, illustrated
on band spectroscopic data. Technometrics 16, 147-185.
Belsley, D. A., Kuh, E. and Welsch, R. E. (1980). Regression Diagnostics. Wiley, New York.
Boente, G., Fraiman, R. and Yohai, V. J. (1982). Qualitative robustness for general stochastic
processes. Technical Report No. 26. Department of Statistics. University of Washington, Seattle,
WA.
Box, G. E. P. and Jenkins, G. M. (1976). Time Series Analysis Forecasting and Control. Holdeno
Day, San Francisco, CA.
Bustos, O. H. (1981). Qualitative robustness for general processes, Informes de Mathemfitica, Serie
B-002/81. Instituto de Mathemfitica Pura e Aplicada, Brazil.
Bustos, O. H. (1982). General M-estimates for contaminated p-th order autoregressive processes:
consistency and asymptotic normality. Z. Wahrsch. Verw. Gebiete 59, 491-504.
Bustos, O. H. and Yohai, V. J. (1983). Robust estimates for ARMA models. Informes de
Mathemfitica, Serie B-12/83. Instituto de Mathemfitica Pura e Aplicada, Brazil. To appear in
J. Amer. Statist. Assoc.
Bustos, O., Fraiman, R. and Yohai, V. J. (1984). Asymptotics for RA-estimates of ARMA rnodels~
In: J. Franke, W. Hiirdle and D. Martin, eds., Robust and Nonlinear Time Series Analysis. Springe~ ..
Berlin.
154 R. D. Martin and V. J. Yohai

Cook, D. and Weisberg, S. (1982) Residuals and Influence in Regression. Chapman and Hall,
New York.
Cox, D. (1981). Metrics on stochastic processes and qualitative robustness. Technical Report No. 3.
Department of Statistics, University of Washington, Seattle, WA.
Denby, L. a n d Mallows, C. L. (1977). Two diagnostic displays for robust regression analysis.
Technometrics 19, 1-13.
Denby, L. and Martin, R. D. (1979). Robust estimation on the first order autoregressive parameter.
J. Amer. Statist. Assoc. 74, 140-146.
Donoho, D. L. (1982). Breakdown propertirs of multivariate location estimators. Unpublished
manuscript. Harvard University Ph.D. qualifying paper.
Donoho, D. L. and Huber, P. J. (1983). The notion of breakdown point. In: P. J. Bickel, K.
Doksum and J. L. Hodges, eds., Festschriftfur Erich L. Lehman. Wadsworth, Belmont, CA.
Hampel, F. R. (1968). Contributions to the theory of robust estimation. Ph.D. Thesis. University of
California, Berkeley, CA.
Hampel, F. R. (1971). A general qualitative definition of robustness. Ann. Math. Stat. 42,
1887-1896.
Hampel, F. R. (1974). The influence curve and its role in robust estimation. J. Amer. Statist. Ass.
69, 383-393.
Hampel, F. R. (1975). Beyond location parameters: Robust concepts and methods. Proc. 40th
Session I.S.I., Warsaw 1975. Bull. Int. Statist. Inst. 46, Book 1,375-382.
Hannan, E. H. and Kanter, M. (1977). Autoregressive processes with infinite variance. J. Appl. 14,
411-415.
Hodges, J. L. (1967). Efficiency in normal samples and tolerance of extreme values for some estimates
of location. Proc. Fifth Berkeley Symp. on Math. Statist. and Probab. Vol. 1, 163-186.
Huber, P. J. (1964). Robust estimation of a location parameter. Annals Math. Statist. 35, 73-101.
Huber, P. J. (1973). Robust regression: asymptotics, conjectures and Monte Carlo. Ann. Statist. 1,
799-821.
Huber, P. J. (1977). Robust Statistical Procedures. Regional Conference Series in Applied Mathe-
matics No. 27. SIAM, Philadelpha, PA.
Huber, P. J. (1981). Robust Statistics. Wiley, New York.
Huber, P. J. (1983). Minimax aspects of bounded-influence regression. J. Amer. Statist. Assoc. 78,
66-80.
Jazwinski, A. H. (1970). Stochastic Processes and Filtering Theory. Academic Press, New York.
Kalman, R. E. (1960). A new approach to linear filtering and prediction problems. Trans. A S M E
Ser. D J. Basic Eng. 82, 34-45.
Kleiner, B., Martin, R. D. and Thompson, D. J. (1979). Robust estimation of power spectra. J.
Roy. Statist. Soc. Ser. B 41, 313-351.
Krasker, W. S. (1980). Estimation in linear models with disparate points. Econometrica 48,
1833-1846.
Krasker, W. S. and Welsch, R. T. (1982). Efficient bounded-influence regression estimation. J,
Amer. Statist. Assoc. 77, 595-604.
K/inisch, H. (1984). Infinitesimal robustness for autoregressive processes. Arm. Statist. 12, 843 863~
Lee, C. H. and Martin, R. D. (1982). M-estimates for A R M A processes. Technical Report No. 23.
Department of Statistics, University of Washington, Seattle, WA.
Mallows, C. L. (1976). On some topics in Robustness. Bell Labs. Tech. Memo, Murray Hill, N J.
Maronna, R., Bustos, O. H. and Yohai, V. J. (1979). Bias and efficiency robustness of general
M-estimates for regression with random carriers. In: T. Gasser and M. Rosenblatt, eds.,
Smoothing Techniques for Curve Estimation (Proceedings, Heidelberg, 1979). Springer, New
York.
Martin, R. D. (1979). Approximate conditional-mean type smoothers and interpolators. In: T.
Gasser and M. Rosenblatt, Smoothing Techniques for Curve Estimation, 117-143. Springer, Berlin.
Martin, R. D. (1980). Robust estimation in Autoregressive models. In: D. R. Brillinger and G. C.
Tiao, eds., Directions in Time Series, 228-254. Institute of Mathematical Statistics Publication,
Haywood, CA.
Robustness in time series and estimating A R M A models 155

Martin, R. D. (1981). Robust methods for time series. In: D. F. Findley, ed., Applied Time Series H.
Academic Press, New York.
Martin, R. D. (1982). The Cramer-Rao bound and robust M-estimates for autoregressions.
Biometrika 69, 437-442.
Martin, R. D. and Jong, J. (1977). Asymptotic properties of robust generalized M-estimates for the
first-order autoregressive parameter. Bell Labs. Tech. Memo, Murray Hill, NJ.
Martin, R. D., Samarov, A. and Vandaele, W. (1983). Robust methods for A R I M A models. In: A.
Zellner, ed., Applied Time Series Analysis of Economic Data. Econ. Res. Report ER-5, Bureau of
the Census, Washington, DC.
Martin, R. D. and Thompson, D. J. (1982). Robust resistant spectrum estimation. IEEE Proceed-
ings 70(9), 1097-1115.
Martin, R. D. and Yohai, V. J. (1984). Influence function for time series. Tech. Report.
Department of Statistics, University of Washington, Seattle, WA.
Masreliez, C. J. (1975). Approximate non-Gaussian filtering with linear state and observation
relations. I E E E Trans. Automat. Control AC-20, 361-371.
Meditch, J. S. (1969). Stochastic Optimal Linear Estimation and Control. McGraw-Hill, New York.
Mosteller, F. and Tukey, J. W. (1977). Data Analysis and Regression. Addison-Wesley, Reading,
MA.
Papantoni-Kazakos, P. and Gray, R. M. (1979). Robustness of estimators on stationary obse~
vations. Ann. Probab. 7, 989-1002.
Rousseeuw, P. and Yohai, V.J. (1984). Robust regression by means of S-estimators. In: J. Franke, W
H~irdle and D. Martin, eds., Robust and Nonlinear Time Series Analysis. Springer, Berlin.
Schweppe, F. C. (1973). Uncertain Dynamic Systems. Prentice-Hall, Englewood Cliffs, NJ.
Strassen (1965). The existence of probability measures with given marginals. Ann. Math. Statist. 36,
423-439.
Tukey, J. W. (1960). A survey of sampling from contaminated distributions. In: I. Olkin, ed.,
Contributions to Probability and Statistics. Stanford University Press, Stanford, CA.
Tukey, J. W. (1976). Useable resistant/robust techniques of analysis. In: W. L. Nicholson and J. L.
Harris, eds., Proc. First E R D A Statistics Symposium. Batelle Northwest Laboratories, Richland,
WA.
Whittle, P. (1962). Gaussian estimation in stationary time series. Bull. lnt. Statist. 39, 105-129.
Yohai, V. J. and Maronna, R. A. (1978). Asymptotic behavior of least squares estimates for
autoregressive processes with infinite variances. Ann. Statist. 5, 554-560.
E. J. Hannan, P. R. Krishnaiah, M. M. Rao, eds., Handbook of Statistics~ Vol. 5
© Elsevier Science Publishers B.V. (1985) 157-177 5

Time Series Analysis with Unequally


Spaced Data

Richard H. Jones

1. Introduction

Unequally spaced data can occur in two distinct ways. The data can be
equally spaced with missing observations, or the data can be truly unequally
spaced with no underlying sampling interval. For multivariate data when
several variables are recorded at each observation time, it is possible to have
missing observations within the observation vector at a given time. In this case,
the observation times may be equally or unequally spaced. The key to data
analysis in these situations is to represent the structure of the process using a
state-space representation. For Gaussian inputs and errors, this allows the
calculation of the exact likelihood using the Kalman filter. Nonlinear optimiza-
tion can then be used to obtain m a x i m u m likelihood estimates of the unknown
p a r a m e t e r s of the process.
These methods are easily extended to regression with stationary errors
including analysis of variance with serially correlated errors. Mixed models that
include r a n d o m effects fit naturally into the state-space formulation since the
random parameters can be included in the state vector, and the variances
estimated by m a x i m u m likelihood, even with unbalanced designs.

2. State space and the Kalman filter

Kalman (1960) developed an approach to filtering and prediction based on


the concept of state and state transition. While the term prediction is clear to
those working in the area of time series analysis, filtering simply means
estimating the current state of a process given observations .up to the present
time. H e r e , the Kalman filter will be used as a recursive m e t h o d of calculating
- 2 1 n likelihood which easily handles missing or unequally spaced data. The
concept of state is the least amount of data about the past and present of a
process needed to predict the future. For a first-order autoregressive (AR(1))
process, this is simply the current observation. For autoregressive moving
average ( A R M A ) processes, the state can be represented in many ways as a

157
158 R.H. Jones

vector. This turns a univariate process into a vector Markov process involving a
state transition matrix. It is this M a r k o v property that allows - 2 In likelihood to be
calculated recursively (Schweppe, 1965).
A discussion of the Kalman filter can be found in Gelb (1974). A general
state-space model consists of two equations. The state equation defines the
properties of the process in vector M a r k o v form, and the observation equation
defines what is actually observed. These equations are

X ( t ) - F(t; t - 1 ) X ( t - 1)+ G(t)u(t),


Y(t) = H(t)X(t) + v(t).

X(t) is an m by 1 column vector representing the state of the process at time t.


F(t; t - 1 ) is an m by m state transition matrix defining how the process
progresses from one time point to the next. u(t) is the random input to the
state equation, sometimes referred to as the plant noise, which is a column
vector of length m', m ' ~< m, assumed to have a multivariate normal dis-
tribution with zero m e a n vector and covariance matrix equal to the identity
matrix. T h e u(t) are assumed to be independent at different times. G(t) is an m
by m ' matrix defining how the r a n d o m inputs are propagated into the state.
H(t) is a d by m matrix defining linear combinations of the state that are
observed at time t. Y(t) is a d by 1 vector of observations at time t, and v(t) is
a d by 1 vector of r a n d o m observational errors assumed to be normally
distributed with zero mean vector and covariance matrix R(t). T h e v(t) are
assumed to be independent at different times and independent of the r a n d o m
input u(t).
The Kalman filter produces estimates of the state vector, X(t) based on data
collected up to time t assuming that all the p a r a m e t e r s of the model are known,
i.e. F(t; t - 1), G(t), H(t) and R(t). The notation for this estimate is X(t] t),
and the estimate has covariance matrix

P(t [ t) = E{[X(t)- X(t [ t)l[X(t) - X(t I t)]'}, (2.2)

where ' denotes transposed. Similarly, X(t I t-- 1) denotes the estimate of the
state at time t given observations up to time t - 1, a one-step prediction, and its
covariance matrix is denoted P(t] t - 1). T o begin the recursion, it is necessary
to specify an initial value of the state vector before the first observation is
collected, X(010), and its covariance matrix P ( 0 1 0 ). T h e general step of the
recursion starts with the information available at time t - 1, X ( t - 1] t - 1) and
P ( t - 1 ] t - 1), and ends when this same information is available at time t. The
recursion proceeds as follows:

(1) Calculate a one-step prediction


x(tlt- 1 ) : F'(t; t - 1 ) x ( t - l i t .... 1).
Time series analysis with unequally spaced data 159

(2) Calculate the covariance matrix of this prediction

P ( t l t - 1) = F(t; t - 1)e(t - 11 t - 1)F'(t; t - 1) + G G ' .

(3) The prediction of the next observation vector is

Y ( t l t - 1)= H ( t ) X ( t l t - 1).

(4) The innovation vector is the difference between the observations and the
predicted observations

I(t)-- Y(t)- Y(tIt- 1).

(5) The innovation covariance matrix is

V(t) = H ( t ) P ( t I t - 1)H'(t) + R ( t ) .

(6) The contribution to - 2 In likelihood for this step is

l'(t) V - l ( t ) l ( t ) + In IV(t)l,

where I I denotes the determinant of the matrix. The contribution for each
step is summed over all steps.
(7) The Kalman gain matrix is

K(t) = P(t l t- 1)H'(t)V-l(t).

(8) The update estimate of the state vector is

X ( t [ t) : X ( t [ t - 1)+K(t)l(t).
(9) Its covariance matrix is

P ( t l t) = P ( t [ t - 1 ) - K ( t ) H ( t ) P ( t l t - 1).

For univariate time series Y(t), the observation at time t, and R(t), the
observational error variance, will be scalars. A process with time invariant
structure observed at equally spaced time intervals has parameters that do not
depend on time, and the model can be written

X ( t ) = F X ( t - 1) + G u ( t ) , (2.3)
Y ( t ) = O X ( t ) + v(t) ,

with observational error covariance matrix R. A stationary time series has time
160 R. H. Jones

invariant structure as do certain nonstationary processes such as autoregressive


integrated moving average (ARIMA) processes. A R I M A processes can be
directly represented in the above form without differencing the data.
Differencing can cause problems when there are missing observations. For
unequally spaced time series, F(t; t - 1) and G(t) will usually depend on the
length of the time step. For univariate time series at equal spacing with missing
observations, when an observation is missing, the recursion skips steps (3)-(7),
and the final two steps become simply

(8) Update estimate

X(tlt)-- X(t[t- 1). (2.4)

(9) Covariance matrix

P(t l t)= e(t l t - 1). (2.5)

Note that these two equations require no calculation since the values that are
in memory are not changed. For multivariate time series with missing obsero
vations within the observation vector, it is only necessary to reduce the number
of rows in the H(t) matrix to allow for these missing observations.
When the unknown model parameters have been estimated by maximum
likelihood, predictions can be made by running the recursion off the end of the
data using the missing data form of the recursion. Calculating V(t) from step
(5) of the recursion gives the variance or covariance matrix of the prediction.

3. A state-space representation for an ARMA(1, 1) process

There are several ways to represent an ARMA(1, 1) process in state-space


form. One method that keeps all the equations in scalar form is to represent
the process as a first-order autoregression with observational error,

x(t) = ,~x(t- 1)+ ,~u(t),


(3.1)
y(t) : x(t)+ v(t).

u(t) is assumed to have unit variance, and the observational error variance is
R. The three parameters to be estimated by maximum likelihood are a, o-, and
R. For any pass through the recursion, these parameters are assumed to be
known and are varied by a nonlinear optimization routine between passes. In
other words, one pass through the Kalman filter produces a value of - 2 1 n
likelihood which is one function evaluation for a nonlinear optimization
routine. Multivariate extensions of this model are discussed in Jones (1984),
This special case has many practical applications. ARMA(1, 1) processes can
be used to model serial correlation in many situations where data spans are not
Time series analysis with unequally spaced data 161

tOO long and the process is not highly structured. In particular, in regression or
analysis of variance, using an ARMA(1, 1) model for the error structure may
be much better than the usual assumption of independent errors. Approximate
modeling of serial correlation when it exists is better than not modeling it at all.
To begin the recursion, it is necessary to specify the initial parameters. Since
x(t) is a zero mean AR(1) process, the variance of the process (lag zero
covariance) is o-2/(1 - c~2). For given values of the parameters, the initial
conditions specify what is known before any data are collected,

x(010 ) = 0, P(0 [ 0) = 0-2/(1 - a2). (3.2)

The recursion is now a special case of the general recursion given in the last
section:
(1) Calculate a one step prediction

x ( t l t - 1)= a x ( t - l l t - 1).

(2) Calculate its variance

e ( t l t - 1)= e ( t - l i t - 1)a2+ 0-2.


(3) The prediction of the next observation is

y ( t l t - t ) = x ( t l t - 1).

(4) Calculate the innovation

I(t) = y ( t ) - y(tlt- 1).

(5) The innovation variance is

v(t) = P(tl t - 1) + R.

(6) The contribution to - 2 In likelihood is

I2(t)/V(t) + In V(t).

(7) The Kalman gain is

K ( t ) = P ( t l t - 1)/V(t) .

(8) Update the estimate of the state

x(t [ t)= x(t [ t - 1)+ K(t)l(t)


= [Rx(tlt- 1)+ P ( t l t - 1)y(t)]/v(t).
162 R . H . Jones

(9) Update its variance

POLO= P(t [ t - 1 ) - K(t)P(t [ t - 1)


= R P ( t [ t - 1)/V(t).

The second form of this last equation is more numerically stable since it
prevents a subtraction.
It is possible to concentrate 0.2 out of the likelihood by differentiation in
which case the recursion takes a slightly different form. In this case, to calculate
the - 2 In likelihood it is necessary to accumulate two terms, one associated
with the weighted residual (or innovation) sum of squares RSS, and the other
associated with the determinant in the multivariate normal distribution, DET.
A new variable is defined which is the ratio of the two variances,

c 2= R/o "2 . (3.3)

The initialization is

x ( 0 [ 0 ) = 0, P(010 ) = 1/(1- a2), (3.4)


RSS = O, D E T = 0.

The modified recursion is

(1) x ( t l t - 1) = ax(t-- l i t - 1),


(2) P ( t l t - 1) = P ( t - l i t - 1)a2+ 1,
(3) y(t l t - 1 ) = x(t l t - 1 ) ,
(4) I(t)= y ( t ) - y ( t ] t - 1),
(5) V(t)= P ( t [ t - 1)+ c 2,
(6) R S S = RSS + I2(t)/V(t), D E T = D E T + In V(t).

Here the equal sign is used in the programming sense of "is replaced by".

(7) K(t)= P(t [ t - 1)IV(t),


(8) x ( t l t ) = [ c 2 x ( t l t - 1)+ P ( t t t - - l)y(t)]/V(t),
(9) P ( t l t ) = c 2 P ( t l t - 1)/V(t).

After completing the recursion with n observations present,

- 2 In likelihood = n In RSS + D E T . (3.5)

A nonlinear optimization search procedure can be used to find the minimum of


- 2 1 n likelihood with respect to a and c ~. When this is completed, the
maximum likelihood estimates of the two variances can be calculated from

0.2 = RSS/n, R -- c20.2 o (3.6)


Time series analysis with unequally spaced data 163

For missing observations, steps (3)--(7) are skipped and the last two steps
replace by

(8) x ( t l t ) = x ( t [ t - 1),
(9) P(t l t) = P(t l t - 1).

The above recursions are easily modified for an AR(1) process with missing
observations by setting R or c 2= 0. In this case, it is nonlinear in only one
parameter c~. Since, for a stationary process, this parameter must be in the
range - 1 < c~< 1, and in most practical applications is in the range 0 ~< a < 1, it
is easy to search for the maximum likelihood estimate of a on a micro
computer. Note that in the case of an AR(1) process, the above steps simplify
to:

(5) V(t) = P(t l t - 1),


(8) x(t l t)= y(t),
(9) P(t l t)= O .

In other words, without observational error, the variance of tile innovation is


the prediction variance, the updated estimate of the state is new observation,
and it has variance zero.

4. ARMA(p, q) processes

There are various equivalent state-space representations of ARMA(p, q)


processes in the literature. These can be constructed using concepts well known
to electrical engineers (see, for example, Wiberg, 1971). As a simple example~
consider an autoregression of order p (AR(p)),

x(t) = % x ( t - 1)+ a 2 x ( t - 2 ) + ' " + a~(t--p)+ e(t), (4.l)

where e(t) has standard deviation o-. The state of this process can be defined as
the p most recent values of the process, and a state-space representation is

1 ,Xt l 1 'Ii u(t)

kx(,-b+l)j 0 1 oj kx(:,-p)
(4.2)

where u(t) has unit standard deviation.


A minimal state-space representation of an A R M A (p, q) process has a state
164 R. H. Jones

vector of length m = max(p, q + 1). The inclusion of observational error in the


observation equation can modify this. The addition of white noise to an
ARMA(p, q) process is discussed by Box and Jenkins (1976, p. 122). If p > q,
and the process is observed with error, the resulting observed process is
ARMA(p, p). The resulting 2p parameters are a function of the original p + q
parameters plus the variance of the observational error. The inclusion of
observational error in the model provides the opportunity to find a more
parsimonious model than simply fitting A R M A processes. For example, if the
process is actually pure autoregressive with observational error, it is only
necessary to estimate p + 1 parameters rather than 2p parameters for an
ARMA(p, p) process. Fitting an ARMA(p, p - 1) with observational error may
be equivalent to fitting an ARMA(p, p) process without observational error. If
p ~< q, the addition of observational error produces an A R M A model of the
same order, so the variance of the observational error will be confounded with
the model parameters. The conclusion is that observational error can only be
included in the model if p > q. If the model is A R M A ( p , p ) , the state vector
can sometimes be reduced to length p rather than length p + 1 by fitting an
ARMA(p, p - 1) model with observational error.
The state vector of an A R M A ( p , q) process is not simply values of the process at
lagged times. The m elements must summarize the entire past and present for the
purpose of prediction. Akaike's (1975) Markovian representation was used by
Jones (1980). The elements of the state vector are the present value of the process
and 1, 2 . . . . , m - 1 step predictions into the future. A j step prediction is denoted
x ( t + j ] t), i.e. the prediction at time t + j given data up to and including time t. The
state equation is

[ x(t+m-alt
l J°l°
x(t + 11 t) )J = L~
,.
0
...
1 ...
az
~001[x
,
x(t [.t- 1)
(t+m-2lt-a

+o" g2 u(t), (4.3)

and the observation equation is

y(t)=[1 0 ~..
[ x(ttt) ]
0] [ x(t+. l i t ) ~ v(t). (4.4)

L (t+ m - l l t )
Time series analysis with unequally spaced data 165

For the ARMA(p, q) model,

x(t) = OllX(t- 1)+ " " + % x ( t - p ) + E(t)


+ fl,e(t- 1)+"" + f l q e ( t - q). (4.5)

The g's in (4.3) are a function of the c~'s and /3's and are generated by the
recursion

j-1

gl = 1, gj = ,Sj_1+ ~ eek&_k. (4.6)


k~l

Harvey (1981) uses a different but equivalent state-space representation


where the g vector of equation (2.1) is made up of the /3's. Other represen~
tations have the /3's in the H vector. Whatever representation is used, for a
stationary process, it is necessary to be able to calculate the initial covariance
matrix of the state, P ( 0 ] 0 ) . For Akaike's Markovian representation, the
necessary equations for calculating this matrix are given in Jones (1980).

5. Stationarity and invertibility

The ARMA(p, q) process will be stationary if the roots of

P
1 - ~, % z k = 0 (5.i)
k=l

are outside the unit circle, and for the moving average to be invertible, the
roots of
q

1 + ~, /3kzk = 0 (5.2)
k=l

must be outside the unit circle. To insure stationarity and invertibility, Jones
(1980) reparameterized in terms of the partial autoregression and partial moving
average coefficients, and constrained them to be in the interval (-1, 1) by a
logistic type transformation. If a k is a partial autoregressive coefficient,

a k = [1 - e x p ( - Uk)]/[1 + e x p ( - Uk)] , C
'~.....
:, ~>;

which has the inverse transformation

u k = ln[(1 + ak)/(1 -- ak)] . *.~, a~,

The u k can vary from - ~ to % and these aic the vaxiablcv t[~a~ ~;;," ::~:::x;c~
166 R. H. Jones

optimization work with. For a given value of uk, the corresponding a k is


calculated from (5.3), and the autoregressive coefficients calculated from the
Levinson (1947)-Durbin (1960) recursion. For j = 1 . . . . . p ,

a~)= a t,

and for j > 1,

Olk
~) --- -0-~-
L'~k
a j c~ j0-1~
-k '
k = 1, 2, " " " '
j - 1 "
(5.5)

The o ' s are then used in the state-space representation along with the fl's which
are transformed in a similar fashion and a value of - 2 In likelihood calculated.
A natural way to obtain initial guesses at the parameters for nonlinear
optimization is to proceed in a stepwise fashion, adding a single parameter, or
perhaps both an autoregressive and a moving average parameter at each step.
The initial values of the parameters can be the final values obtained from the
previous step with the new parameter or parameters set to zero. The optimiza-
tion will then start from the best value of - 2 In likelihood found at the previous
step and try to improve it.

6. ARIMA(p, d, q) processes

Differencing is usually used to reduce A R I M A processes to A R M A proces-


ses. When there are missing observations, this presents a problem. An alter-
native is to represent the integrated moving average process in state-space form
so that it is possible to work with the original observations. The only problem is
that for nonstationary processes, the initial covariance matrix cannot be
expressed as a function of the process parameters. One possibility is to use the
conditional likelihood, conditional on observing the first d available time
points. Consider the following examples.
A random walk observed with error, in state-space form is

x(t) = x ( t - 1)+ o'u(t),


(6.1)
y(t) = x ( t ) + v(t),

where v(t) has variance R. The y(t) process is a special case of an


ARIMA(0, 1, 1) and contains two unknown parameters to be estimated by
maximum likelihood, o" and R. It is well known that the best estimate of the
present of this process or the best prediction of the future is an exponentially
weighted average of the past. The Kalman filter produces an exponentially
weighted moving average in the limit for long data spans (Jones, 1966). It also
produces optimal estimates near the beginning of a data span or in the
presence of missing data once the parameters of the process are known° The
Time series analysis with unequally spaced data 167

likelihood conditional on the first observation can be calculated using the


following starting conditions,

x(111) = y(1), P(ll 1) = R. (6.2)

An ARIMA(1, 1, 0) process requires a state vector of length two to model


directly in state-space form. One state-space representation corresponding to
the representation (4.2) is

x(t)]= l+a -a [x(t-1) [0]


x(t-1)J [ 1 0 ]Lx(t-2)] + u(t),
(6.3)
y(t)=[10][x(~(t)l) ].

This is a nonstationary second-order autoregression with one root of the


characteristic equation (5.1) equal to 1, and the other equal to 1/a. Since there
is no observational error in this model, the initial conditions for calculating the
likelihood conditional on the first observation are

X(2[ 1) = [y(1)] P(211) = [ ~r2/(1- cJ) 0


ty(1)]' [ 0 0]" • (6.4)

Using this form of the initial conditions, in the form of a predictior~, me


recursion is entered at step (3).
The general ARIMA(p, d, q) represented directly in state-space form withou~
differencing, requires an autoregressive part of order p + d. The stationary-
autoregressive part of order p can be represented as before in terms of partial
autoregression coefficients using the transformation (5.3) to ensure stationarity
The autoregression coefficients for the nonstationary process can be calculateo
from the corresponding powers of z in the generating function

(1 - z ) a ( 1 - o/1z --.og2 Z2 . . . . . olpzP). t(}.))

For example, if d = 1, the nonstationary a's are

a'~= a ~ + l ,
0/;= ~ 2 - - 0~1,
0/~ i~ ~ 3 - 0/2 ,
(6.6)
~p ~p ~- O/p_ 1 ,
O/p+ 1 = --(_.~p .
168 R. H. Jones

7. Continuous time models for unequally spaced data

When data are truly unequally spaced, not equally spaced with missing
observations, continuous time models are necessary to represent the process.
These processes are discussed by Doob (1953). Kalman and Bucy (1961)
develop the state-space filtering approach for continuous time processes, and
Wiberg (1971) gives an easy-to-read introduction to the subject. The use of
continuous time models allows the prediction and updating equations to be
developed for an arbitrary time interval so that the Kalman filter recursion
depends on the length of the step.
As an introduction, consider a continuous time first-order autoregression
referred to as a C A R ( l ) process. A zero mean C A R ( l ) process can be
represented as a first-order linear differential equation driven by 'white noise'.
The continuous time state-space representation is

dx(t) = - a x ( t ) dt + d W ( t ) , (7.1)

where a > 0, and W(t) is a Wiener process, i.e. dW(t) is continuous time zero
mean 'white noise'. Integrated white noise is a continuous time random walk or
Brownian motion process which satisfies the differential equation

dz (t) = d W ( t ) . (7.2)

The variance of the change in the random walk over a finite time interval is
proportional to the length of the interval, i.e. for b > a,

b
Var{z(b)- z(a)} = Var{~! dW(t)}- (b- a)O. (7,3)
Here Q will be referred to as the variance of the white noise process.
The process (7.1) is a continuous time Markov process, with covariance
function at lag r

C(r) = O [ e x p ( - a Irl)]/2a. (7.4)

If the process is observed at a given time, this observation sums up the


information in the past for the purposes of predicting the future, i.e. the value
of the process at any time is the state of the process. If the process is observed
at time a, the prediction of the process at time b > a can be calculated by
solving the differential equation without the random input, dW(t), which is
unpredictable, and substituting in the initial condition

x(b) = x(a) e x p [ - - a ( b - a ) ] . (7.5)


The random input over a finite time interval is an exponentially weighted
T i m e series a n a l y s i s with u n e q u a l l y s p a c e d d a t a 169

integral of the white noise input (Gelb, 1974),


b
£ exp[-a(b- r)] d W ( r ) , (7.6)

and has variance

O{1 - e x p [ - 2 a ( b - a)]}/2a. (7.7)

Note that as the time interval b - a becomes large, this prediction variance
approaches the variance of the process.
If this C A R ( l ) process is observed at equally spaced time intervals with
spacing h, the resulting discrete time process is AR(1) with autoregression
coefficient

a~ = e x p ( - a h ) (7.8)
and
0-2 = 011 - exp(-2ah)l/2a (7.9)

Assume that the process is observed at n unequally spaced time points.

t l < t2 < " " " < tn .

It does not matter how these time points are determined, by some random
mechanism or selected in advance, as long as the time points do not depend on
the values of the process being sampled. It is assumed that the sampling times
are known. The continuous time state-space model (7.1) can now be represen
ted as a discrete time state-space model at the sampling times,

x(t3 = F ( t , ; t,_0x(t~_0 + G(t,)u(O, .,7 t,.~)


where
F(ti; t~_~)= exp[-a(t~ - t~_l) ] , F7 ~

G(t~) = # Q { 1 - e x p [ - 2 a ( t ~ - t~ ~)]}/2c~.

Now the state transition and the standard deviation of the random inpui
depend on the length of the time step. The observation equation is

y(ti) = x(O + v(O, C~.i~)

where v(ti) is the observational error with variance R. H e r e observational error


would probably be truly observational error or numerical round off error, i:~
the equally spaced case, observational error is sometimes used as a convenient
way to obtain a parameterization of the process with one less element in the
state vector.
170 R . H . Jones

8. Continuous time AR(p) process with observational error, CAR(p)

A CAR(p) process with observational error, and F O R T R A N code for


calculating the exact - 2 1 n likelihood are given in Jones (1981). Variable
transformations are used to ensure that the estimated parameters generate a
stationary, process. The continuous time model can be written

d[x0~)(t) + O l p _ l X ( P - 1 ) ( t ) + o ' " + o~0x(t)] = d W ( t ) , (8.1)

where xO)(t) denotes the jth derivative with respect to time. For stationarity, it
is necessary that the roots of

P
E .j zj = 0 (8.2)
j=O

have negative real parts (ap = 0).


A state-space representation for this process uses the value of the process
and its first p - 1 derivatives as the state. It does not matter that the derivatives
are unobservable since the Kalman theory assumes that linear combinations of
the state are observed, and in this case the linear combination simply picks off
the first element of the state vector. For stationary processes, the unconditional
covariance matrix of this state vector is known (Doob, 1953), i.e. the covari-
ances between the process and its derivatives, so the initial state covariance
matrix can be calculated for given values of the parameters of the process
(Jones, 1981). An interesting byproduct of this analysis is that estimates are
obtained of the derivatives of the process as well as the process itself, and these
are often of interest. For example, the velocity and acceleration are estimated
if p 1> 3. If the best estimate of velocity and acceleration are required within the
data span using all the data, this can be calculated using the Kalman smoother
(Gelb, 1974).
This state-space representation can be written

ixx"~.(O.t jlo = o
1 o
o 1 ...
o1
o
x(t)
'¢~1!(t) dt
L xO'-o(t) -% --0:1 --0[, 2 .... Olp_ 1 x~-l~(t)_]

+Iil dW(t), (8.3)


Time series analysis with unequally spaced data 171

and the observation equation is

[ x(t) 1
y(t)=[1 0 "'" O] / x O : ( t ) [ "" + v ( t ) , (8.4)
[ " [

at each time point where the process is observed.

9. CARMA(p, q) and CARIMA(p, d, q) processes

The above continuous time autoregression can be generalized to continuous


time autoregressive moving average (CARMA) and continuous time autore-
gressive integrated moving average (CARIMA) processes. Doob (1953, p. 542)
discusses continuous time processes with rational spectra, and Wiberg (1971, p.
19) gives state-space representations for these processes. If we formally
represent the continuous time 'white noise' process as

e(t)= dW(t), (9.1)

a CARMA(p, q) process can be represented

X(P)(t) + ap_lX(P-1)(t) + . . . + aoX(t )


= 15(q)(t) + fq_lff(q-X)(t) + . . . + floE(t). (9.2)

For stationarity, it is necessary that p > q and that the roots of (8.2) have
negative real parts. The representation will be 'minimum phase' if the roots of

q
Z f/= 0 (9.3)
j=o

have negative real parts. It is also assumed that (8.2) and (9.3) have no commort
roots.
For a CARIMA(p, d, q) process, d roots of (8.2) must be zero. This means
that

a0~ O/1= ,.. = Ogd_.l = 0 , (9.4)

1O. Regression with stationary errors

Harvey and Phillips (1979) used state-space representations to obtain exac~


maximum likelihood estimates of regression coefficients and the A R M A
172 R.H. Jones

p a r a m e t e r s for regression problems when the errors have a stationary A R M A


structure. The regression p a r a m e t e r s are included in the state vector of the
process and concentrated out of the likelihood. There are some problems with
the initial conditions when using this method.
An alternative method is to realize that the Kalman filter is simply a linear
operation on the previous data. The prediction summarizes the information in
the past and most recent observation. The innovation is the component of the
next observation that is orthogonal to the past. The Kalman filter, therefore,
simply transforms correlated data to uncorrelated data. Since the prediction
variance is also part of the algorithm, the innovation can be divided by the
square root of this variance producing a sequence of uncorrelated r a n d o m
variables with constant variance. In reality, the parameters of the A R M A
process are not known, so guesses are m a d e and - 2 1 n likelihood calculated.
E m b e d d i n g the procedure in a nonlinear optimization algorithm gives maxi-
m u m likelihood estimates of the A R M A parameters and the regression
coefficients. The regression coefficients are separated out of the equations so
that optimization is required only with respect to the nonlinear parameters.
Consider the usual regression equation,

y = X B + ~, (10.1)

where y is an n by 1 vector of the response variable, X is an n by p matrix of


the independent variables, /3 is a p by 1 vector of unknown regression
coefficients, and ~ is an n by 1 vector of errors. It is assumed that the errors
have a structure of one of the models discussed in this paper, i.e. an A R I M A or
C A R I M A process with missing or unequally spaced data. The missing data are
not within the y vector or X matrix. These are assumed to have no missing
observations. The assumption is that the data are collected in time, and the
time points are not equally spaced.
As is usual when discussing weighted least squares, premultiply the regres-
sion equation by a matrix K, which in this case represents the Kalman filter,

K y = KXI~ + K ~ . (10.2)

In a regression situation, the Kalman filter operates on the y vector and each
column of the X matrix in order to transform the errors to be uncorrelated
with constant variance. It is very easy to modify the Kalman filter algorithm so
that it operates on a matrix rather than a vector. By forming an n by p + 1
matrix by augmenting X by y, the algorithm can operate on each column
replacing the entry by the innovation. The usual X ' X matrix and X ' y vector
can be formed from the innovations. If y ' y is also calculated, the total sum of
squares, then the residual sum of squares is

RSS = 7"SS - y ' X ( X ' X ) - l x ' y . (10.3)


Time series analysis with unequally spaced data 173

It is important that the determinant term be included in the likelihood since the
weight matrix is changed for each iteration,

- 2 In likelihood = n In R S S + D E T . (10.4)

where D E T is the natural log of the innovation variance summed over the time
points.

11. Variance component models

Duncan and Horn (1972) showed how random effects are naturally handled
using the Kalman recursion. Random effects can be included in the state of the
process while linear fixed effects are concentrated out of the likelihood as in
regression. The advantage of this approach is the ability to handle unbalanced
designs, such as missing observations by exact likelihood methods. A second
advantage is that serial correlation in repeated measures designs can be
modeled, even when there are missing observations or the data are unequally
spaced.
Consider a simple two-way repeated measures design,

yo = / z + 5 +Try+ %, (11.1)

where i denotes the subject and j denotes the repeated measurements on each
subject, tx is the fixed grand mean, T the fixed treatment or time effect, ~ri the
random subject effect, and % the random error. It is assumed that 7ri are
independent N(0, V,~) and eij are independent N(0, VD and are independent of
%. These assumptions produce the compound symmetry correlation structure
for observations on the same subject, i.e. constant correlation between any two
observations. This intraclass correlation,

vj(v + v,), (11.2)

is a result of the random subject effect, and is not serial correlation in the usual
time series analysis sense.
If this model is balanced with no missing observations, the usual repeated
measures analysis of variance is appropriate (Winer, 1971), and exact maximum
likelihood estimates of the two variances can be expressed in closed form
(Herbach, 1959). In the unbalanced case with missing observations, the exact
likelihood can be calculated using a state-space model. Concentrating V~ out of
the likelihood as before, - 2 In likelihood is nonlinear in only one parameter,
the ratio of the two parameters,

c = v jr. 0 .3)
174 R. H. Jones

Since the fixed effects can be handled by regression as in Section 10, only the
random terms need be represented in state-space form. Since subjects are
independent, - 2 In likelihood can be calculated for each subject and summed
over subjects. For subject i, the state equation is trivial since 7r~ is constant for
each subject. It is, however, random across subjects with variance V,~. The state
equation is

7r,(j) = w,(j - 1), (11.4)

and the observation equation

Yi/-/x - ~) = 7ri(j) + %. (11.5)

The initial conditions are

7r,(01 0 ) -- O, P,(0 10 ) = c 2 . (11.6)

This initial variance would be V~ if V~ had not been concentrated out of the
likelihood. % now plays the role of observational error. Concentrating V, out
of the likelihood has the effect of dividing all variances in the recursion by V~;
therefore, the observational error variance R for this model will be set equal
to 1.
If serial correlation exists between the e's within a subject, the e's must be
modeled as part of the state. Any of the models discussed in this chapter can be
used to model this serial correlation. The random subject effect is simply
tacked onto the end of the state vector. The observations can be equally or
unequally spaced, and there may be missing observations. For example, if the
E's satisfy an AR(I) structure,

% = ~Eij- I + u/, (11.7)

the state equation can be written

0][ ~i,/-1]
kTr~[%]=[0a 1AL ~ J + [ O ] uj' (11.8)

and the observation equation is

y ~ j - t~ - y / = [1 I][EiJ] . (11.9)

The initial state vector is [0 0]' with covariance matrix

P(010)=111/(10- 2 ) c2
]0 . (11.10)

Here c 2= V.fftr 2, and this model is nonlinear in two parameters, c and m


Time series analysis with unequally spaced data 175

The usual linear mixed model (Rao and Kleffe, 1980),

y = X ~ + UIO 1+ ' . . -~- UpOp + ~, (11.11)

where X, U 1. . . . , Up are known matrices, /3 is a vector of fixed unknown


parameters and 0 1 , . . . , Op are random vectors with zero means, uncorrelated
with covariance matrices

E{OiO'i} = V i i . (11.12)

The methods presented here can be used to estimate /3 and the variance
components for unbalanced designs, and the estimation is nonlinear in only p
parameters, the variance components.

12. Nonlinear optimization

The nonlinear optimization routines used by the author are the quasi--
Newton methods discussed by Dennis and Schnabel (1983), who give al~
gorithms in the Appendix of their book. The art of nonlinear optimization is
highly developed by computer scientists, and statisticians need only find good
code. Supplying derivatives for the functions being minimized is a good idea if
possible, but it is not necessary. Gradients can be approximated by finite
differences.

13. Conclusion

State-space representations and the Kahnan filter provide a unified approach


for calculating likelihoods for time series models. The state-space represen-
tation represents the process as a vector Markov process. At each time point,
the Kalman filter calculates the component of the new observation that is
orthogonal to the past. These innovations together with their variances are
used to calculate the likelihood assuming that the process has Gaussian errors
and inputs. If observations are missing, predictions continue across the missing
data keeping track of the growing prediction variance. When the next o b s e r
vation is available, the innovation has a larger variance than when there are no
missing observations, but the correct variance has been calculated to enter into
the likelihood. If a large block of data is missing so that there is no longer any
information available from the past for prediction, the algorithm converges to a
steady state and the result is the same as if the algorithm starts again on a new
realization of the process It is also possible to use multiple realizations of a
process to obtain a single value of the likelihood.
If data are unequally spaced, continuous time models can be used for the
176 R . H . Jones

process. A continuous time state-space representation defines a discrete time


representation at the sample points.
Embedding the recursion as a function evaluation in a nonlinear optimiza-
tion routine provides maximum likelihood estimates of the process parameters.
A calculation of - 2 In likelihood from a recursion through the data is one value
of the function that the routine is attempting to minimize.
These procedures generalize to regression with correlated error structure,
including analysis of variance problems. Of particular interest are mixed linear
models since variance components can be estimated from unbalanced designs,
and in the presence of serially correlated errors.

References

Akaike, H. (1975). Markovian representation of stochastic processes by canonical variables. S I A M


J. Control 13, 162-173.
Box, G. E. P. and Jenkins, G. M. (1976). Time Series Analysis forecasting and control. Holden-Day,
San Francisco, CA.
Dennis, J. E. and Schnabel, R. B. (1983). Numerical Methods for Unconstrained Optimization and
Nonlinear Equations. Prentice-Hall, Englewood Cliffs, NJ.
Doob, J. L. (1953). Stochastic Processes. Wiley, New York.
Duncan, D. B. and Horn, S. D. (1972). Linear dynamic recursive estimation from the viewpoint of
regression analysis. J. Amer. Statis~. Assoc. 67, 815-821.
Durbin, J. (1960). The fitting of time series models. Review of the International Statistical Institute
28, 233-244.
Gelb, A. (i974). Applied Optimal Estimation. M.1.T. Press, Cambridge, MA.
Harvey, A. C. (1981). Time Series Models. Philip Allan, Deddington, Oxford, and John Wiley
(Halstead Press), New York.
Harvey, A. C. and Phillips, G. D. A. (1979). Maximum likelihood estimation of regression models
with autoregressive-moving average disturbances. Biometrika 66, 49-58.
Herbach, L. H. (1959). Properties of model II-type analysis of variance tests. Ann. Math. Statist. 30,
939-959.
Jones, R. H. (1966). Exponential smoothing for multivariate time series. J. Roy. Statist. Soc. Set. B
28, 241-251.
Jones, R. H. (1980). Maximum likelihood fitting of A R M A models to time series with missing
observations. Technometrics 22, 389-395.
Jones, R. H. (1981). Fitting a continuous time autoregression to discrete data. In: D. F. Findley,
ed., Applied Time Series Analysis II, 651~582. Academic Press, New York.
Jones, R. H. (1984). Fitting multivariate models to unequally spaced data. In: E Parzen, ed.,
Time Series Analysis of Irregularly Observed Data. Lecture Notes in Statistics, Vol. 25, 158-188.
Springer, Berlin-New York.
Kalman, R. E. (1960). A new approach to linear filtering and prediction problems. A S M E Trans.
Part D (J. Basic Engineering) 82, 35-45.
Kalman, R. E. and Bucy, R. S. (1961). New results in linear filtering and prediction theory. A S M E
Trans. Part D (J. Basic Engineering) 83, 95-108.
Levinson, N. (1947). The Wiener RMS error criterion in filter design and prediction. J. Math. Phys.
25, 261-278.
Rao, C. R. and Kleffe, J. (1980). Estimation of variance components. In: P. R. Krishnaiah, ed.,
Handbook of Statistics, Vol. 1, 1-40. North-Holland, Amsterdam.
Schweppe, F. C. (1965). Evaluation of likelihood functions for Gaussian Signals. I E E E Trans.
Inform. Theory IT-11, 61-70.
Time series analysis with unequally spaced data 177

Solo, V. (1984). Some aspects of continuous-discrete time series modeling. In: E. Parzen, ed.,
Time Series Analysis of Irregularly Observed Data. Lecture Notes in Statistics, Vol. 25, 325-345.
Springer, Berlin-New York.
Wecker, W. E. and Ansley, C. F. (1983). The signal extraction approach to nonlinear regression
and spline smoothing. J. Amer. Statist. Assoc. 78, 81-89.
Wiberg, D. M. (1971). Theory and Problems of State Space and Linear Systems. Schaum's Outline
Series. McGraw-Hill, New York.
Winer, B. J. (1971). Statistical Principles in Experimental Design, 2nd ed. McGraw-Hill, New York.
E. J. Hannan, P. R. Krishnaiah, M. M. Rao, eds., Handbook of Statistics, Vol. 5
© Elsevier Science Publishers B.V. (1985) 179-187 6

Various Model Selection Techniques in Time Series


Analysis

Ritei Shibata

1. Introduction

This chapter aims to give a short review on various model selection techniques
which have been developed in the context of time series analysis. Our main
concern is on moving-average (MA), autoregressive (AR) or autoregressive-
moving average ( A R M A ) models. A similar problem is called model checking,
which aims at checking adequacy of a model. Since the test statistics employed
in model checking can be also used for constructing a model selection pro u
cedure, we first look up such statistics. We should, however, note that the
model selection is not a simple combination of model checkings.
The aim of the model selection is not only in checking adequacy of a model but
also in
(a) obtaining a good predictor, or
(b) describing a system, or identifying a system.
W e consider a univariate A R M A ( p , q) model,

rb(B)z, = 0(B)e,, 0.~)


where
~b(B) = 1 + 4~1B + ~baB2+ "'" + ~bpB p
and
O(B) = 1 + 01B + 02 B 2 + ' ' " + OqB q

are transfer functions with backward operator B. For simplicity, we assume


that {et} is a sequence of independent identically normally distributed random
variables with mean 0 and variance o-2. A sequence of autocovariances is
denoted by {%}, and that of the sample autocovariance, based on n obse~
vations Zl . . . . . z,, is denoted by {C~}. The estimated noise sequence is then

~t-- O(B)-I~(B)z,,
where 0(B) and 4~(B) are the m a x i m u m likelihood or quasimaximum likeli-
hood estimates of O(B) and &(B), respectively. In this chapter, we do not go

179
180 R. Shibata

further into mathematical details. We assume any commonly used regularity


conditions. In the next section, we will see some test statistics which are specific
for type of models, MA, A R , or A R M A . In Section 3, some other statistics will
be seen, which are not specific for type of models. Section 4 is for discussions
on how to construct a selection procedure, based on such test statistics.

2. Statistics specific for each mode|

2.1. Moving-average models


C o m m o n l y used test statistic for M A ( q ) is that based on sample auto-
covariances [7]. The quadratic form of the sample autocovariances,
h
T= n ~ l'lL'mO'~f~^Ira
1, m = q + l

is asymptotically distributed as X~h_q(O) with the noncentrality

qJ=n ~ 3,t~/mo
"j" ,
l,m=q+l

where o"t" or o7t~ are the l, m elements of the inverse of the autocovariance
matrix or of the sample autocovariance matrix, respectively. Therefore, by T we
can check q dependence, a specific property of MA(q), but we may fail in
checking a linearity of the process. A r e m a r k a b l e fact is that this statistic is not
equivalent to the m a x i m u m log likelihood in any sense.

2.2. Autoregressive models


For testing AR(p), a sequence of partial autocorrelations {4'm}, which are
zeros for m > p, plays an important role. The sum of sample partial autocor-
relations {0m},
h

~Jp+l
l=l

is asymptotically distributed as X~(t)) with the noncentrality


h

Z 4.2.+,
l=l

Therefore, by T we can test the null hypothesis of {zt} being an AR(p). As the
sample partial autocorrelation ~bm, commonly used definition is the last coordinate
of the solution ~ ( m ) ' = ( ~ l ( m ) , . . . , C~m(m)) of the m t h - o r d e r Y u l e - W a l k e r
equation.
Model selection techniques in time series analysis 181

Historically the following statistic is proposed before by Quenouille [18],

i h
T - d.4(p) n ~ h^2
p+ l
l=l
where

=- Y, ,(p)z,_, j(p)z,_,,j ,
]~l n t=p+l -= "'j=O

and

6 " 2 ( P ) - n - 1p , = ~ ~ = dpl(P)Zt_l

with z, = 0 for t ~< 0. The above/~l can be thought of as an approximation to the


covariance

n
hl = _ ~, ete,-i
n t=p+l

between the noise e, and its backward representation e, = El,_0 ~b,,z,+~. Hence,
@dr2(p) might be more natural than the q~z as an estimate of the partial
autocorrelation.
It is well known that the above two statistics are asymptotically equivalent to
each other [2]. These statistics are also asymptotically equivalent to the
maximum log likelihood.

2.3. Autoregressive moving-average models


A specific property of autoregressive moving-average model is that it is
non-identifiable when overfitted. Since an A R M A model

rI ( B ) 6 (B)z, = ~7(B)O(B)e,

has the same covariance structure as that of (1.1), the transfer functions O(B)
and ~b(B) are not uniquely determined by autocovariances of {z,}.
Generalized partial autocorrelation 0k(J) is defined as the last coordinate oi
the solution ~b(/') of the equation

AQ, k )~P(l) = - ~ ( / ' ) ,


where
7] " • 7j-k+1

A¢i,k)= j k-1 "'°


~,, J J
and ~,(j)'- (~,j+~..... ~j+~)°
182 R. Shibata

Because of the non-identifiability when overfitted, the equations which


characterize ARMA(p, q),
p
3~t= - ~ ~j'/t-j for 1 > q,
j=l

imply that the matrix A ( j , k) is nonsingular if and only if j ~<q or k ~<p. The
generalized partial autocorrelations then have the following properties:

0 ifj=q,k>p,
~k(j) = undefined ifj>q,k>p,
[# 0 otherwise.

If j = 0, 0k(J) reduces to an ordinary partial autocorrelation ~k. Making use of


such property, we can find the orders p and q as the following.
In the estimated 4' array,

find the coordinates (q + 1, p + 1) which specify the North-West corner of tile


largest South-East sub-array, whose all elements are unstable but there are
zeros on the North edge.
By similar idea, the use of zl array,

A = [IA(j, k)[]

is proposed by Beguin, Gourieroux and Monfort [4]. Equivalent procedures are


proposed by Chow [8], Graupe, Krause and Moore [10], or Woodside [29].
Since

0 if j > q and k > p ,


]A(/, k)l= # 0 otherwise,

it is enough to find, in an estimated A array, the coordinates (q + 1, p + 1),


which specify the North-West corner of the largest South-East sub-array
whose all elements are zeros.
We can also construct a test statistic by considering the determinant of the
above South-East sub-array [4].
More complicated statistics are proposed by Gray, Kelley and McIntire [11],
called S array or R array. For example, the (j, k) element of S array is defined as

s,(j) = {(-1)~s~+u} -' ,


M o d e l selection techniques in time series a n a l y s i s 183

where S ~'m is the l, m element of the inverse of

S =
i 7j " "" "Y!+k- 1]

')lj+k "' • "Yj+2k-lj

The S array has the following properties:

(-1)P (1 + f~ th,) for all j > q - p,


l=1
Sp(j) =
(-1)P+'(1+ ~ 4h)~bp for all j < - q - p ,
1=1
S k ( - - q - - p -- 1)= +_ca for any k > p ,

Sk(j)= undefined for k > p, if j < - q - p or j > q - p.

Then, similarly as in A array, we can find the coordinates (q + 1, p + 1) in S array.


It is known that some elements of S array coincide with the partial a u t o c o r
relations (see Woodward and Gray [30]).
An advantage of a selection procedure based on such generalized partial
autocorrelations is that we can avoid unstable estimation of A R M A parameters
when overfitted. It is, of course, at the risk of underfitting. In other words, such
procedure might be good for the aim (b) in Section 1.

3. Other statistics not specific for type of models


3.1. Based on likelihood
The dominant term of the maximum log likelihood is the one-step-ahead
prediction error,

~2 1 ~-, ~2
O" e = - - ~ Et . (3.1)
/~ t=l

However, non-identifiability of A R M A model when overfitted causes a


problem. If we want to estimate both transfer functions O(B) and $(B), these
estimates are not only inconsistent but also unstable. For the aim (a) in Section l,
such inconsistency does however not cause much problems since the maximum
likelihood estimate of the transfer function

k (B) = O(B)-%k(B )
184 R. Shibata

is not far from the true one, even when overfitted [14].
Another way might be to use the Lagrangian multiplier test statistic as is
demonstrated in Poskitt and Tremayne [18]. By modifying the Fisher information
matrix, we can avoid the problem of the singularity, but for doing this we have to
fix an alternative a priori. Therefore, such a statistic is not suitable for model
selection.

3.2. Portmanteau test statistic


Portmanteau test statistic or Q statistic is the sum of squares of serial
correlations rt(e,) of residual sequence {g,},

r =. E
/=1

It is shown by Box and Pierce [6] that the above T is asymptotically distributed
as X]-p-q under the null hypothesis. To accelerate the speed of convergence to the
asymptotic distribution, Ljung and Box [16] proposed a correction such as

T = n(n + 2) Z (n - l)-lr~(g,). (3.2)


/=1

Detailed analysis of the distribution under null hypothesis or alternatives can


be found in [16]. The above statistic is the most natural for checking uncor-
relatedness of residuals, but, if our main concern is in only obtaining a good
predictor in the sense of mean squared error, it might be checking too many
things. In spite of the correction in (3.2), convergence is not so fast since it consists
of fourth moments of original process. A comparison with the Lagrangian
multiplier test can be found in Godfrey [9].

3.3. Cross-validation
This kind of statistic is proposed by Stone [27], in the context of multiple
regression. A formal extension yields a statistic

r = Z {z,- e,(t-t} 2 ,
t=l

where g , ( t - ) is an interpolation of z~, which is estimated from the observations


except Zr It generally requires a laborious calculation. There is not so much known
about the behavior of this statistic, but it has a tendency of overfitting, particularly
when outliers exist.
Model selection techniques in time series analysis 185

4. Model selection

We can construct a selection procedure by using one of test statistics


introduced in the previous sections. However, it is not so good an idea to
repeat such testing for various p and q. If we do so, we first have to choose
many significance levels required, and the resulting power is a complicated
function of the levels, as well as of the order of testings. It is hard to get a
good control even for overall type I error. As an alternative we can consider
the use of a h o m o g e n e o u s testing, like as in Krishnaiah [15]. By such
testing, we can well control the type I error, but still it requires a lot of
computation.
A better principle might be to find a model which balances overfitting risks
and underfitting risks. A typical way of realizing such balancing behavior is to
select p and q which minimizes

C(p, q)= T+ a(p + q) . (4.1)


Here, the T is one of test statistics which are introduced in previous sections.
The second term in (4.1) can be considered as a penalty term for the
complexity of the model, and a term for compensating the r a n d o m fluctuation
of T. Since the expectation of T is p + q when T is distributed as X 2 with
degrees of freedom p and q, it is better to choose c~ greater than 1, so as to
ensure the positive penalty for an increase of the degrees of freedom.
In A I C [1], B I C [20], or 4' [12], which are called criterion procedures, all
criteria are of the form of (4.1) with

T = - 2 log(maximum likelihood).

For such criteria, a lot of discussions have been done. Most controversial point
is that how to choose a, which is 2 in AIC, log n in BIC, and c loglog n for
some c > 2 in 4,. The choice of a depends on the aim of the selection. If our main
concern is in prediction, c~ should be chosen so as to yield less prediction error. If it
is to identify a system stable, the consistency is more important than the amount of
the prediction error. In Shibata [21], such two aspects of the model selection are
d e m o n s t r a t e d for the case of AIC, in the context of the nested A R model fitting. It
is shown that the selection by the minimum A I C procedure has a tendency of the
underfitting and is not consistent, but the increase of the prediction error is not so
much, only of the order O(1/n) uniformly in 4,1, - . . , 4,p. Similar discussions are
done for general a by Bhansali and D o w n h a m [5], or Atkinson [3]. Their
conclusion is consistent on the point that a should be greater than 2 even if the
prediction error is our main concern. An answer to the optimality is given by
Shibata [22] from the viewpoint of prediction error. H e showed that the choice
a = 2 is asymptotically optimal, under the assumption that the underlying process
does not degenerate finite order A R process. This result, namely "asymptotic
efficiency of the selection with a = 2" is also applied to an autoregressive spectral
186 R. Shibata

estimate [24]. Taniguchi [29] showed that Shibata's result holds true also for
A R M A models. However, for the case of small samples, the above asymptotic
theory does not work so well [23]. Recently Shibata [25] showed that the
a p p r o x i m a t e minimax regret choice of ce is 2.8. The regret means how much the
prediction error increases when a section procedure is applied, compared with the
error when the true model is known. Further generalization of the A I C can be
found in [26].
If we want to avoid overfitting in any case, a should be chosen greater than
2 loglog n but slower than n. This is the result of H a n n a n and Quinn [12]. The
term 2 loglog n follows from the fact that the range of the random fluctuation
of T is at most 2 loglog n from the law of iterated logarithm. It is interesting to
note that the choice a = log n in BIC, which is derived from the viewpoint of
Bayesian, satisfies the above condition.
H a n n a n and Rissanen [13] proposed a practical way of selecting the orders p
and q of A R M A by using one of the above consistent criteria. Assuming p = q,
find m which minimizes C(m, m) in (4.1), then the m is asymptotically equal to
max(p0, q0) of the true orders P0 and q0. Next assuming p = m or q = m, find p
and q which minimize C(p, q), then we can find P0 and q0 consistently.
A remaining problem in practice is how to choose P and O which
specify the largest orders p and q. This is equivalent to the problem how to
choose ' h ' of statistics in Section 2. This problem has not been analyzed well,
but an analysis by Shibata [26] gives a rough guideline that we can choose any
large P and Q, as long as the tail probability P(F,,+2,,_p_o> am/(m + 2)) is
2
close enough to P(x,,.z>am) for m = 1 , 2 , 3 . . . . . n - P - O.
As a final remark, we note that if a is chosen bounded, then actual penalty is
seriously affected by small changes of T as well as changes of initial conditions.
W e should choose a so as to compensate well any such changes.

References

[1] Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle.
In: B. N. Petrov and F. Csfiki, eds., Second International Symposium on Information Theory,
267-281. Akadrmia Kiado, Budapest.
[2] Anderson, T. W. (1971). The Statistical Analysis of Time Series. Wiley, New York.
[3] Atkinson, A. C. (1980). A note on the generalized information criterion for choice of a model.
Biometrika 67, 413--418.
[4] Beguin, J.-M., Gorieroux, C. and Monfort, A. (1980). Identification of a mixed autore-
gressive-moving average process: the corner method. In: O. D. Anderson, ed., Time Series,
423-435. North-Holland, Amsterdam.
[5] Bhansali, R. J. and Downham, D. Y. (1977). Some properties of the order of an autoregressive
model selected by a generalization of Akaike's E P F criterion. Biometrika 64, 547-551.
[6] Box, G. E. P. and Pierce, D. A. (1970). Distribution of residual autocorrelations in autore-
gressive-integrated moving average time series models. J. Amer. Statist. Assoc. 65, 1509--1526.
[7] Box, G. E. P. and Jenkins, G. M. (1976). Time Series Analysis: Forecasting and Control.
Holden-Day, New York.
[8] Chow, J. C. (1972). On estimating the orders of an autoregressive moving-average process
with uncertain observations. IEEE ?¥ans. Automat. Control AC-17, 707-709.
Model selection techniques in time series analysis 187

[9] Godfrey, L. G. (1979). Testing the adequacy of a time series model. Biometrika 66, 67-72.
[10] Graupe, D., Krause, D. J. and Moore, J. B. (1975). Identification of autoregressive-moving
average parameters of time series. I E E E Trans. Automat. Control AC-20, 104--107.
[ll] Gray, H. L. Kelley, G. D. and McIntire, D. D. (1978). A new approach to ARMA modeling.
Comm. Statist. B7, 1-115.
[12] Hannah, E. J. and Quinn, B. G. (1979). The determination of the order of an autoregression.
J. Roy. Statist. Soc. Ser. B 41, 190-195.
[13] Hannan, E. J. and Rissanen, J. (1982). Recursive estimation of mixed autoregressive-moving
average order. Biometrika 69, 81-94.
[14] Hannan, E. J. (1982). Fitting multivariate A R M A models. In: G. Kallianpur, P. R. Krishnaiah, J.
K. Ghosh, eds., Statistics and Probability: Essays in Honor ofC. R. Rao, 307-316. North-Holland,
Amsterdam.
[15] Krishnaiah, P. R. (1982). Selection of variables under univariate regression models. In: P. R.
Krishnaiah, ed., Handbook of Statistics--II. North-Holland, Amsterdam.
[16] Ljung, C. M. and Box, G. E. P. (1978). On a measure of lack of fit in time series models.
Biometrika 65, 297-303.
[17] Milh0j, A. (1981). A test of fit in time series models. Biometrika 68, 177-18%
[18] Poskitt, D. S. and Tremayne, A. R. (1981). An approach to testing linear time series models.
Ann. Statist. 9, 974--986.
[19] Quenouille, M. H. (1947). A large-sample test for the goodness of fit of autoregressive
schemes. J. Roy. Statist. Soc. Ser. B 11, 123-129.
[20] Schwarz, C. (1978). Estimating the dimension of a model. Ann. Statist. 6, 461-464.
[21] Shibata, R. (1976). Selection of the order of an autorcgressive model by Akaike's information
criterion. Biometrika 63, 117-126.
[22] Shibata, R. (1980). Asymptotically efficient selection of the order of the model for estimating
parameters of a linear process. Ann. Statist. 8, 147-164.
[23] Shibata, R. (1980). Selection of the number of regression parameters in small sample cases. In:
Statistical Climatology, 137-148. Elsevier, Amsterdam.
[24] Shibata, R. (1981). An optimal autoregressive spectral estimate. Ann. Statist. 9, 300-306.
[25] Shibata, R. (1983). A theoretical view of the use of AIC. In: O. D. Anderson, ed., Time Series
Analysis: Theory and Practice, Vol. 4, 237-244. Elsevier, Amsterdam
[26] Shibata, R. (1984). Approximate efficiency of a selection procedure for the number of regression
variables. Biometrika 71, 43-49.
[27] Stone, M. (1974). Cross-validatory choice and assessment of statistical predictions. Z Roy~
Statist. Soc. 2, 111-133.
[28] Taniguchi, M. (1980). On selection of the order of the spectral density model for a stationary
process. Ann. Inst. Statist. Math. 32A, 401--419.
[29] Woodside, C. M. (1971)o Estimation of the order of linear systems. Automatica 7, '727-733.
[30] Woodward, W. A. and Gray, H. L. (1981). On the relationship between the S array and the
Box-Jenkins method of A R M A model identification. J. Amer. Statist. Assoc. 76, 579-587.
E. J. Hannan, P. R. Krishnaiah, M. M. Rao, eds., Handbook of Statistics, Vol. 5
© Elsevier Science Publishers B.V. (1985) 189-211 7

Estimation of Parameters in Dynamical Systems

LennanLjung

1. Introduction

By a dynamical system we mean a relationship between an 'input signal' and


an 'output signal', such that the current input affects also future outputs. Such
systems are important in a number of areas in engineering, applied science and
econometrics. We may think of systems like paper machines, communication
channels, ecological systems, national economies, etc. Often it is found neceso
sary to include also disturbances of various character in order to obtain
reasonable descriptions of the systems. A special case is the situation when no
input is present, and the system is driven by the disturbances only. Time series
can be described in that way, and will thus form a subclass of the systems
discussed in this chapter.
The nomenclature and the notation may differ between various application
areas (in econometrics, for example, the input signal is usually called 'exoo
geneous variable'), and we shall here stick to notation that is common in the
control theory area.
In the different areas various types of models have been developed fol
describing systems. Some of these have been specifically designed to serve
certain purposes, like Bode plots for control synthesis. Others are of more
general character, e.g. differential and/or difference equations for simulation~
analysis and various decision-making tasks. We shall discuss models of
dynamical systems in somewhat more detail in Section 2. Various engineering
disciplines have been developed for solving certain design problems based on a
given model of a system (like control theory, filter design, signal processing,
etc.). The applicability of the theory is thus critically dependent on the
availability of good models.
How does one construct good models of a given system? This question about
the interface between the real world and the world of mathematics thus
becomes crucial. The general answer is that we have to study the system
experimental and make some inference from the observations. In practice there
are two main routes. One is to split up the system, figuratively speaking, into
subsystems, whose properties are well understood from previous experience~

189
190 L. Ljung

This basically means that we rely upon 'laws of Nature' and other well-
established relationships that have their roots in earlier empirical work. These
subsystems are then joined together mathematically, and a model of the whole
system is obtained. This route is known as modelling, and does not necessarily
involve any experimentation on the actual system. When a model is required of
a yet unconstructed system (such as a projected aircraft), this is the only
possible approach.
The other route is based on experiqaentation. Input and output signals from
the system are recorded and are then subjected to data analysis in order to
infer a model of the system. This route is known as identification. It is often
advantageous to try to combine the approaches of modelling and identification in
order to maximize the information obtained from identification experiments
and to make the data analysis as sensible as possible.
In this chapter we shall discuss various techniques for the identification of
dynamical systems, focusing on methods that have been used, and to some
extent developed, in the control oriented community. We shall pay special
attention to sequential, or recursive methods, which refer to schemes that
process the measurements obtained from the system continually, as they
become available (Sections 7-11). We have found it suitable to present such
recursive schemes as a natural development of off-line or batch identification
methods, which assume that the whole data batch is available in each stage of
the data processing. Therefore, we will spend a major part in exposing and
explaining general ideas in identification (Sections 4-6). A particular problem
with parameter estimation in dynamical systems is the multitude of possible
models that are available. In Sections 2-3 we shall discuss a number of such
possibilities, and also point out a unified framework for how to handle them.
A very important problem, which is crucial for a successful application, is the
choice of a family of candidate models for describing the systems, This problem
is quite application-dependent, and we consider it outside the scope of this
chapter to address the problem of choice of model set. This means that what
we discuss most of the time is actually how to estimate parameters in a given
model structure.
For further discussion of the topics treated here we may refer to Goodwin
and Payne (1977), Eykhoff (1974, 1981) and Ljung and S6derstr6m (1983).

2. Time-domain models of dynamical systems

Describing dynamical systems in the time domain allows a considerable


amount of freedom. Usually, differential (partial or ordinary) equations are
used to describe the relationships between inputs and outputs. In discrete-time
(sampled-data systems), difference equations are used instead. The question of
how to describe properties of various disturbance signals also allows for several
different possibilities. Here we shall list a few typical choices, confining
ourselves to the case of linear, discrete-time models.
Estimation of parameters in dynamical systems 191

The word m o d e l is sometimes used ambiguously. It may mean a particular


description (with numerical values) of a given system. It may also refer to a
description with several coefficients or parameters that are not fixed. In the
latter case, it is more appropriate to talk about a m o d e l set: a set of models that
is obtained as the parameters range over a certain domain.

L i n e a r difference equations
Let the relationship between the input sequence {u(t)} and the output
sequence {y(t)} be described by

y(t)+ aly(t- 1)+ . . . + a.y(t- n): blu(t- 1)+... + bmu(t- m).


(2.1)

Here the coefficients a i and b i are adjustable parameters. (A multivariable


description would be quite analogous, with a i and bi as matrices.) We shall
generally denote the adjustable parameters by a vector 0:

0 = (al . . . . . an, bl . . . . . bin)T . (2.2)

if we introduce the vector of lagged inputs and outputs

q~(t) = ( - y ( t - 1) . . . . , - y ( t - n), u ( t - 1 ) , . . . , u ( t - m)) s , (2.3)

(2.1) can be rewritten in the more compact form

y(t) = OTq)(t). (2.4)

In (2.1) or (2.4), the relationship between inputs and outputs is assumed to be


exact. This may not be realistic in a n u m b e r of cases. Then we may add a term
v ( t ) to (2.1) or (2.4),

y(t) = o % ( 0 + o(t), (2.5)


which accounts for various noise sources and disturbances that affect the
system, as well as for model inaccuracies. This term can be further modelled,
typically by describing it as a stochastic process with certain properties. The
simplest model of that kind is to assume {v(t)} to be white noise, i.e. a sequence
of independent random variables with zero mean values. However, many other
possibilities exist. A m o n g the most c o m m o n models is the following one.

ARMAX models
If the term {v(t)} in (2.5) is described as a moving average (MA) of white
192 L. Ljung

noise {e(t)}, we have a model

y(t) + a l y ( t - 1)+.-. + a . y ( t - n) = blU(t- 1 ) + . . - + bmu(t- m)


+ e(t)+ q e ( t - 1 ) + . . - + Ge(t- n). (2.6)

Such a model is known as an A R M A X model.

Output error models


Instead of adding the disturbance v(t) to the equation as in (2.5), it can be
added as an output measurement error:

y(t) = x(t) + v(t), (2.7a)


x(t)+ f l x ( t - 1 ) + . . . + f , x ( t - n)= blU(t- 1 ) + . . . + bmu(t- m).
(2.7b)

Such models are often called output error models. The 'noise-free output' x(t)
is here not available for measurement, but given (2.7b) it can be reconstructed
from the input. We denote by x(t, O) the noise-free output that is constructed
using the model parameters

0 = (fl ..... fn, bD • . . , brn) T , (2.8)

i.e.

x(t, o) + L x ( t - 1, 0 ) + . . . + L x ( t - n, o)
= b l U ( t - 1)-~ . . . . + b m u ( t - m ) . (2.9)

With

~(t, o) : ( - x ( t - 1, o) . . . . . -x(t- n, o), u(t - 1) ..... u(t- m)) T ,


(2.10)
(2.7) can be rewritten as

y(t) - 0r~(t, 0) + v(t). (2.ll)

Notice the formal similarity to (2.5) but the important computational


difference!

State-space models
A common way of describing stochastic, dynamical systems is to use state-
Estimation of parameters in dynamical systems 193

space models. Then the relationship between input and output is described by

x ( t + 1)= F(O)x(t)+ G(O)u(t)+ w ( t ) ,


(2.12)
y(t) = H(O)x(t) + e(t),

where the noise sequences w and e are assumed to be independent at different


time instants and have certain covariance matrices. Unknown, adjustable
parameters 0 may enter the matrix elements in F, G and H in an arbitrary
manner. These may, for example, correspond to canonical parametrizations
(canonical forms) or to physical parameters in a time-continuous state-space
description, which has been sampled to yield (2.12).

3. Models and predictors

The list of potential models and model sets can be made long. For our
purposes it is useful to extract the basic features of models, so as to allow for a
treatment of model sets in general. First we introduce the following notation:

J///(0): a particular model, corresponding to the parameter value 0,


~: a set of models:

= I m. CRa},

zt: the set of measured input--output data up to time t:

z' = {u(1), y(1), u(2), y ( 2 ) , . . . , u(t), y(t)}.

Similarly, u t and y' denote the input sequence and the output sequence,
respectively, up to time t.
The various models that can be used for dynamical systems all represent
different ways of thinking and representing relationships between measured
signals. They have one feature in common, though. They all provide a rule for
computing the next output or a prediction (or 'guess') of the next output, given
previous observations. This rule is, at time t, a function from z '-1 to the space
where y(t) takes its values (R p in general). It will also be parametrized in terms
of the model parameter 0. We shall use the notation

~(tl O) : g.(O; t, z '-1) (3.1)

for this mapping. The actual form of (3.1) will of course depend on the
underlying model. For the linear difference equation (2.1) or (2.4), we will have

~(t l O) = OT~(t). (3.2)


194 L. Ljung

The same prediction or guess of the output y ( t ) will be used for the model (2.5)
with disturbances, in case {v(t)} is considered as 'unpredictable' (like white
noise). For the state-space model (2.12) the predictor function is given by the
Kalman filter. Then g~ is a linear function of past data.
For the A R M A X model (2.6) a natural predictor is computed as

¢(t l 0)+ Cl~(t- 1 I 0 ) + " " + c . ~ ( t - n I O)


= (c 1 - a l ) y ( t - 1 ) + . . . + ( c , - a , ) y ( t - n)
+ b x u ( t - 1)+ • • • + bmu(t- m). (3.3)

Notice that this can be rewritten as

P(t l O) - Ore(t, 0 ) , (3.4a)


0 = (a 1. . . . . a,, b l . . . . , bin, c 1. . . . . c,) v , (3.4b)
¢(t, O)= (- y(t-- 1) . . . . . - y ( t - n), u ( t - 1). . . . , u ( t - m ),
e ( t - 1, O) . . . . , e ( t - n, 0)) (3.4c)
e(t, O) = y ( t ) - P(tl 0). (3.4d)

For the models (2.7)-(2.11), a natural predictor is also given by (3.4a) with 0
and ¢(t, 0) defined by (2.8}-(2.10). Notice that in this case the prediction is
formed from past inputs only. We then have, formally,

33(tl 0) = g~(O; t, u' 1). (3.5)

Such a model we call an output error m o d e l or a simulation m o d e l


We shall sometimes work with the general linear structure

~(t+ 1, O): o%(O)~(t,0)+ ~(O)z(t), (3.6)


9(t [ 0) : X(0)~(t, 0).

Here we simply assume that the prediction 33 is a linear function of past data z,
and that this linear function can be realized with a finite-dimensional, time--
invariant filter.
Notice that the function g , ( O ; t, .) in (3.1) is a deterministic function from the
observations z t-1 to the predicted output. All stochastic assumptions involved
in the model descriptions (e.g. white noises, covariances matrices, Gaussianness)
have only served as vehicles or 'alibis' to arrive at the predictor function.
The prediction p ( t [ 0 ) is computed from z '-1 at time t - 1 . At time t the
output y ( t ) is received. We can then evaluate how good the prediction was by
computing

e(t, O)-- y ( t ) - ~(t l O) . (3.7)


Estimation of parameters in dynamical systems 195

We shall call e (t, 0) the prediction error at time t, corresponding to model :g (0).
This term will be the generic name for general model sets. Depending on the
character of the particular model set, other names, for example, the (general-
ized) equation error, may be used. For a simulation model (3.5) it is customary
to call the corresponding prediction error (3.7) the output error.
We can also adjoin an assumption about the stochastic properties of the
prediction error to the model A/(0):

M(0): "Assume that the prediction error e(t, O) has the conditional
(given z '-1) probability density function (p.d.f.) f(t, O, x) [i.e.
P(e(t, O)~ B) = fxeBf(t, O, x)dx]".
(3.8)
Notice that in 0.8) there is an implied assumption of independence of the
prediction errors, for different t, since the p.d.f, does not depend on z t-~. A
predictor model (3.1) adjoined with a probabilistic assumption (3.8)we shall call a
probabilistic model.

4. Guiding principles behind identification methods

The problem now is to decide upon how to use the information contained in z u
to select a proper member ~(ON) in the model set that is capable of 'describing'
the data. Formally speaking, we have to determine a mapping from z N to the set

zN ~.~(0N). (4.1)

Now, how can such mapping de determined? We pointed out that the essence of a
model of a dynamical system is its prediction aspect. It is then natural to judge the
performance of a given model d//(0") by evaluating the prediction errors, e(t, 0")
given by (3.7). A guiding principle to form mappings (4.1) is thus the following
one~

°'Based on z t compute the prediction error e(t, O) using (3.1) and (3.7). At time
t = N, select ON so that the sequence of prediction errors e(t, ON), ¢=
1 . . . . . N, becomes as small as possible".
The question is how to quantify what 'small' should mean. Two approaches
have been taken. These will be treated in the following two subsections.

4.1. Criterion minimization techniques


We introduce the scalar measure

l(t, 0, 0))
196 L. Ljung

to evaluate 'how large' the prediction error e(t, O) is. Here 1 is a mapping from
R x R d x R p to R, where d = dim 0, p = dim y.

After having recorded data up to time N a natural criterion of the validity of


the model ~/(0) is

1 N
VN(O, zN) = 77 ~'~ l(t, O, e(t, 0))o (4,3)

This function is, for given z N, a well-defined, scalar-valued function of the


model parameter 0. The estimate at time N, i.e. ON, is then determined by
minimization of the function VN(O, zN). This gives us a large family of
well-known methods. Particular 'named' methods are obtained as special cases,
corresponding to specific choices of model sets and criterion functions l(t, O, e);
and sometimes particular ways of minimizing (4.3).

The least squares method


Choose l(t, 0, e) = le[ 2 and apply the criterion (4.3) to the difference equation
model (2.5). Since the prediction is given by (3.2), we have the prediction error

e(t, O)= y ( t ) - OT~(t).

The criterion function (4.3) thus becomes

1 N
VN(O, z N ) = v;. ~] l y ( t ) - OV~o(t)l2 , (4.4)
P4 ~-

which we recognize as the familiar least squares criterion (see, e.g., Strejc,
1980). This function is quadratic in 0, which is a consequence of the prediction
being linear in 0 and the quadratic choice of criterion function. This means that
an explicit expression for the minimizing element 0N can be given:

Ou = [~1 ~(tlq~ T(t) ]-1 ~,


N ~(t)y(t). (4.5)
1
A quadratic criterion is a common ad hoc choice also for general models. For
multioutput systems this gives

l(t, 0, e) = 26TA-le . (4.6)

To arrive at other specific functions l, we could invoke, for example, tile


maximum likelihood idea:

7he maximum likelihood method


For the probabilistic model (3.1), (3.8), the likelihood function can be
Estimation of parameters in dynamical systems 197

determined. Calculations show that

1 1 N
---log P(y(N), y(N - 1) . . . . . y(1)) = =-: ~'~ log f(t, 0, e(t, 0)). (4.v)
N NT

Maximizing the likelihood function is thus the same as minimizing the criterion
(4.3) with

l(t, 0, e ) = - l o g f(t, 0, e ) . 0.8)


For Gaussian prediction errors

1 1 T
- l o g f(t, 0, e) = const + ~ log det At(O ) + 2e A p(O)e, (4.9)

where At(O) is the assumed covariance matrix for the prediction errors. If the
covariance matrix A t is supposed to be known (independent of 0), then the first
two terms of (4.9) do not affect the minimization, and we have obtained a
quadratic criterion like (4.6). The maximum likelihood method was introduced
for A R M A X models in Astr6m and Bohlin (1965).

For the least squares case it was possible to give an explicit expression for
the parameter estimate. This is not the case in general. Then the criterion
function (4.3) must be minimized using numerical search procedures. We shall
comment more on this later.
We shall, following Ljung (1978), use the general term prediction error
identification methods for the procedures we described in this section (see also
Astr6m, 1980). When applied to the special simulation model (3.5), the term
output error methods might be preferred.

4.2. Correlation techniques


Another way of expressing that the sequence {e(t, 0)} is small is to require
that it be uncorrelated with a given sequence {~'(t)}. Let the vector ((t)
represent some information that is available at time t - 1:

~r(t) = ~r(t, z ' - l ) . (4.10)

Sometimes, there is reason to consider a more sophisticated variant, where


itself may depend on the parameter 0. (Some such cases will be discussed
below.)

~(t) = ~(t, O, z'-~). (4.11)

The rationale for requiring e(t, O) and ~'(t) to be uncorrelated is the following:
the predictors ~(tlO ) should ideally utilize all available information at time
198 L. Ljung

t - 1 . Thus the prediction errors e(t, O) should be uncorrelated with such


information. (If they are not, more information can be squeezed out from z'-~.)
We thus determine ON as the solution of

fu(O, z N) = 0 (4.12a)
with
1 u
fu(O, z u) = "~ ~, s(t, O)(X(t) , (4.12b)
t=l

where, normally, the dimension of ~" is such that (4.12) gives a system of
equations that is compatible with the dimension of 0. When (4.12) is applied to
the model (3.2), the well-known instrumental variable method results. The vector
( is then known as the instruments or the instrumental variables. See Young
(1970) and SSderstr6m and Stoica (1981) for a further discussion of this method.

H o w to choose ~'(t)? A way to make the estimate 0 insensitive to the charac-


teristics of the noise that affects the system is to choose ( to depend on past
inputs only,

~'(t) = ~'(t, u'-~). (4.13)

Then that contribution to e(t, O) that has its origin in the noise will be
uncorrelated with ~" for all 0. Choices (4.13) are typical for the instrumental
variable method. It turns out that such choices that give the best accuracy of
the obtained estimates are obtained when u is filtered through filters associated
with the true system (see SSderstr6m and Stoica, 1981). We then have

~(t) = ~(t, O, u'-l). (4.14)

For models that can be written as (3.4a) (like the A R M A X model (2.6) and the
output error model (2.7)= (2.11)), a natural choice is

~(t, 0, z '-1) = ~(t, 0). (4.15)

Notice also that if we choose

~(t, O, Z t 1)= --t#(l, O)A -1 , (4.16)


where
d
O(t, O) = - ~ e(t, 0), (4.17)
121//

we find that (4.12) will define the stationary points of the criterion (4.3), (4.6).
The criterion minimization approach can thus be seen as a special case of
(4.12), from this point of view.
Estimation of parameters in dynamical systems 199

5. Asymptotic properties of the estimates

It is an important problem to investigate the properties of the estimates


defined by (4.3) and (4.12). Since the data z t typically are described as realiza-
tions of stochastic processes, the analysis has to be performed in a probabilistic
setting. It is a difficult problem to derive the finite sample properties of these
estimates, i.e. the properties for finite N. It is easier to establish what happens
asymptotically as N tends to infinity. Such analysis basically relies upon
(non-standard versions of) the law of large numbers and the central limit
theorem. For the current problem formulation, the analysis is carried out in
Ljung (1978) and Ljung and Caines (1979). The result is that

0N ~ 0* with probability one as N--> w, (5.1)


where
0* = arg min V(0), (5.2)
OED~

IS'(0) = lim EVu(O, z N) (5.3)


N--~ce

and
~/N(0N - 0") E AsN(0, P ) , (5.4)
where
P = Q-1HQ-1 ' (5.5)
H = lim E N V ~ O * , ZN)[V'~O *, ZN)]T , (5.6)

o = Q"(0*). (5.7)

Here 0N is the estimate defined by the minimizing argument of VN given by


(4.3). Prime and double prime denote differentiation, once and twice, respec-
tively, with respect to 0. Expectation E is over the stochastic process z N.
Equation (5.4) means that the random variable X/N-(0N - 0 " ) converges in
distribution to the normal distribution with zero mean and convariance
matrix P.
An analogous result holds for the estimate defined by solution of (4.i2).

6. Numerical schemes for determining the estimates

Above we described two principles for identification methods, namely to


minimize VN(O ; Z N) in (4.3) or to solve fN(O, z N) = 0 in (4.12). In many cases
these functions may be fairly complex, and it is not obvious how to actually
obtain theo estimate in practice. Such questions will be discussed in this sectior~
(see also Astr6m and Bohlin, 1965; Gupta and Mehra, 1974).
200 L. Ljung

6.1. General schemes


For the minimization of (4.3), the gradient of the criterion will play an
important role. Let us therefore introduce the notation

d d
ov(t, O) = - ~ ~(t I O) = -- ~-~ e(t, O) (6.1)

(~0 is a d I P matrix) for the gradient of the prediction, with respect to 0. Then

1 u
V~(O, z u) = - ~ ~ (lTo(t, O, e (t, 0 ) ) - O(t, O)Ir~(t, 0, e (t, 0)) (6.2)

( d l l vector) and, in the quadratic case (4.6)

1 U
V~(O, Z u) = - ~ t~=~ O(t, o)a-le(t, 0). (6.3)

Standard search routines for numerical minimization of functions can now be


applied to (4.3). The general descent method in

~/+1) = 6 ~ ) /"L(i)[
(i)]NK''NJ
J~ 1 --NKL" N, z N) ,
V ' I'~(i) (6.4)

where 0~) denotes the ith iterate when solving for the minimizing value 0u. The
number/z is chosen so that

VN(0~+1))< Vu(O(~). (6.5)

Tile matrix R is chosen to (possibly) modify the gradient search direction.


Typical choices are

R}~) = I, R ~ = tr(V"(~J~), zN)) " 1 (6.6a)

(gradient, unnormalized and normalized),

R }~)= I/,,tt~(0
--m"N, zN) (6.6b)

(Newton algorithm),

1
~(t, 0(~) A--a~0T(t, 0~]) (6.6c)
,

(Gauss-Newton for the choice (4.6)).


The iterations are terminated when the norm of V} is smaller than a given
value.
Estimation of parameters in dynamical systems 201

We now turn to the question of how to solve (4.12). Standard numerical


procedures are the substitution method

~(~ = ~-1) q_ tz(i)fN(~(~ll), zN), (6.7)

and the N e w t o n - R a p h s o n method

0~/) = 0~/-1) -[- ~L/~(i)[f/~0~/-1), zN)I lfN(0~/-1) , zN). (6.8)

These are the natural counterparts of the gradient, and the Newton methods,
respectively, for minimizing VN(O; zN).

6.2. How to compute the gradient


In order to apply the schemes above for minimization of (4.3) we must compute
the gradient of the prediction. This could lead to more or less complex
calculations, depending on the character of the model. For the linear regression
model (3.2) we simply have

~0(t, O) = ~p(t). (6.9)

For the A R M A X model (2.6) we differentiate (3.3) with respect to the


parameters. This gives, for example,

0 O O
koa--Y(t[O)+c'~--Y(t-dak 110)+'''+c,--y(t n{O)=-y(t-k).
Oak
(6.10)

Repeating this for the other parameters gives

~(t, 0 ) + c l ~ ( t - I, 0 ) + ' " + c,~b(t- n, 0) = q~(t, 0), (6.11)

where ¢(t, 0) is given by (2.4c) and the c i are the c parameters corresponding
to 0. The calculation of qJ in this case thus corresponds to applying a simple
filtering operation to the vector ~p, which has to be determined anyway. For the
linear model (3.6) we would obtain, conceptually, that

¢(t + 1, O) = A(O)~(t, O) + B(O)z(t), (6.12)


(;(t[o) )=
col o)/ 0),

where col means that the columns of ~0 are stacked on top of each other.

6.3. Methods that utilize particular structures


Many particular identification methods have been developed that can be
202 L. Ljung

interpreted as special schemes for minimizing (4.3) or solving (4.12). It would


lead beyond the space limitation of this chapter to give a comprehensive list of
such approaches. Let us however point out that the basic idea often is to treat
the problem as a sequence of least squares subproblems. This is possible, for
example, when the prediction ~9(tl 0) has the structure:

p(t l O) : O%(t, O) (6.13)

for some function ~p(t, 0) of the data, as in (3.4a). With this structure, the
following approach to solve (4.12) is natural: determine the iterate 0~) by
solving

1 N
W Z ~(t, OC~-l))(y(t) - oTq~(t, 0~-1)))= 0 (6.14)

for 0. This is a linear problem.


Such an iterative scheme for estimating 0 in a model like (2.5) with {v(t)} having
rational spectral density has been suggested in Talmon and van den Boom (1973)
with ~'(t, 0) = p (t, 0). Of course, a similar approach could be applied also for the
minimization problem.

7. Recursive identification methods

In many cases it is necessary or useful to have a model of a dynamical system


available on-line, as the system operates. The reason could be that the model is
needed in order to take some on-line decision, like choice of input signal
(adaptive control), tuning of a filter (e.g. adaptive equalization), fault detection
or monitoring in general. Such problems can be solved using recursive
identification (also called on-line or real-time identification, adaptive algorithrns
and sequential parameter estimation), which means that measured input-
output data are processed recursively (sequentially) as they become available.
An off-line identification method can be seen as an arbitrary mapping from
z t t o Or:

O, = F(t, zt), (7.1)

where the function F may be implicitly defined, for example, as the minimizing
argument of some function. Such a general expression (7.1) cannot be used in a
recursive algorithm, since the evaluation of F may involve an unforseen
amount of calculations, which perhaps may not be terminated at the next
sampling instant. Instead, a recursive algorithm must comply with the following
format:

x ( t ) = H(t, x ( t - 1), z(t)), (7.2)


o, = h(X(t)).
Estimation of parameters in dynamical systems 203

Here X(t) is a vector of fixed dimension, which represents some 'information


state'. The functions H and h are explicit expressions that can be evaluated
with a fixed and a priori known amount of calculations. In that way it can be
secured that 0t can be evaluated during a sampling interval.
Several different approaches for developing such recursive algorithms have
been pursued. One approach is stochastic approximation, which goes back to
the Robbins-Monro algorithm (Robbins and Monro, 1951) and has been
extensively pursued, among others, by Tsypkin (1971). Another approach is to
phrase the identification problem as a non-linear state-estimation (filtering)
problem, which also could be called a Bayesian approach. The extended
Kalman filter ( E K F ) i s probably the best known representative within this
approach (see, e.g., Ljung, 1979). Adaptive observers represent a third ap-
proach (see, e.g., Liiders and Narendra, 1974). We shall here follow a fourth
approach, namely to develop recursive algorithms from off-line counterparts. It
turns out that the resulting algorithms are very similar, regardless of the way
they are developed.

8. Recursive prediction error methods

It is clear that the general scheme (6.4) does in general not comply with the
structure (7.2) for a recursive algorithm. However, we can modify (6.4) to
achieve this. Conceptually, this is obtained by making a one additional iteration
i - 1--> i at the same time as the data record is advanced from N - 1 to N. From
(6.4) we thus obtain

0(t) = 0(t - 1) - / ~ (t)R -l(t) V't(O(t -- 1), zt), (8.1)

where we introduced the notation

0(t) = ~j~t), etc. (8.2)

Now, from (6.3) we see that (with A - I)

v;(o, z') t- 1 z '-1)- 1


= -7- t q,(t, 0). (8.3)

Now, introduce the approximating assumption that 0 ( t - 1) actually minimized


V,_1(O, z'-l), i.e.

Vtt_l({J(t-1), z t-l) = 0 . (8.4)

This gives the algorithm

0(t) = 0(t - 1)+ y(t)R-l(t)q,(t, O(t---1))e (t, 0(t-- 1)), (8.5)


204 L. Ljung

where we introduced

"),(t) =/x (t)/t. (8.6)

In case 0(t, 0) and e(t, O) can be computed for any 0 using only a fixed amount
(t-independent) of past data, then (8.5) is indeed a recursive algorithm. This is
the case for the model (3.2), for which we have

e(t, O)= y(t)- 0T~(t), (8.7)


~b(t, 0)= ~0(t).

With/z = 1 and
t

R (t): ;1 ~1 q(k)~T{k)' (8.8)

(the Newton or the Gauss-Newton choice (6.6c)), (8.5)-(8.8) is the celebrated


recursive least squares algorithm. For the gradient choice (6.6a) we obtain the
LMS algorithm of Widrow, well known from many signal-processing ap-
plications (see e.g., Widrow et al., 1975).
In general, it is not possible to compute e(t, O(t- 1)) and ~(t, 0 ( t - 1)) using a
fixed amount of data. In such a case these quantities have to be replaced by
recursively computed approximations, which will be denoted by e (t) and ~(t). The
following simple example shows how to do this:
Consider the ARMAX model (2.6) for n = m = 1. From (3.3) and (6.11) we
easily find how to compute e(t, O) and q~(t, 0) for any 0:

e(t, 0)+ ce(t- 1, 0)= y(t)+ ay(t- 1 ) - bu(t- 1), (8.9)


/-y(t- 1)\
tp(t, 0) + c~b(t - 1, 0) = I u(t - 1) J . (8.10)
1,0)/
Now, the obvious way of computing the approximations e(t) and ~(.t) is to
replace the nominal value 0 in (8.9) and (8.10) by the current estimate O(t- 1):

e(t)+ c(t- 1)e(t- 1 ) - y(t)+ f i ( t - 1)y(t- 1) - / ~ ( t - 1)u(t- 1),


(8.11)
/-y(t-1)~
O(t)+ c ( t - 1)O(t- 1)= ~ u(t- 1) J . (8.12)
\ e ( t - 1) /

The complete algorithm for estimating a and c in the ARMAX model (2.6)
Estimation of parameters m dynamical systems 205

thus is (8.11) and (8.12) together with

0(t) = t~(t- 1)+ 7(t)R-l(t)$(t)e(t). (8.13)


For the Gauss-Newton algorithm we take

R(t) = R ( t - 1)+ y(t)[~O(t)OT(t) - R ( t - 1)]. (8.14)

Obviously, the same idea can be applied to any model set:

"Derive the equations defining 19(t]0), e(t,O) and th(t, 0 ) =


(d/dO)5(t]O) for any nominal value 0. In order to obtain the ap-
proximations e(t) and $(t) replace at each time instant the nominal
value in these expressions by the latest available estimate". (8.15)
Applied to the linear model (6.12) this gives a general recursive Gauss-
Newton algorithm

e(t) = y ( t ) - ~9(t), (8.16a)


0(t) = 0(t - 1) + y(t)R-l(t)$(t)e(t), (8.16b)
~(t + 1) = A(O(t))~(t) + B(O(t))z(t), (8.16c)
33(t+l) ) =
col O(t + 1)/ C(O(t))((t), (8.16d)

R (t) = R (t - 1) + y(t)[O(t)~l,r(t) - R (t - 1)]. (8.16e)

For a general criterion function l(e(t, 0)) instead of the quadratic one the factor
qt(t)e(t) should be replaced by

tp(t)l'(e(t)) .

The algorithm (8.16) can be applied to arbitrary model sets, yielding a large
family of recursive identification methods. These are well known in special
cases. The 'recursive maximum likelihood method' (RML) is obtained when
(8.16) is applied to an A R M A X model and the 'recursive generalized least
squares method' (RGLS) is obtained for models where the equation error is
described as an autoregression. The extended Kalman filter (EKF) is closely
related to (8.16) applied to the state-space model (2.12) (Ljung, 1979). See
S6derstr6m et al. (1978) and Ljung and S6derstr6m (1983) for a further
discussion.

9. Pseudolinear regressions

Let us now consider the correlation approach (4.12) applied to (3.4) with the
choice (4.15) for (. With the same approximations as those leading from (6.4) to
206 L. Ljung

(8.5) we obtain

O(t) = O ( t - 1)+ y(t)R-l(t)~o(t, O(t- 1))e(t,/~(t- 1)). (9.1)

If 37(tl O) and q~(t, O) are subject to the structure (6.12) (with ~b = ~), and we use
the approximation (8.15) we arrive at the following algorithm:

e(t) = y ( t ) - OX(t- 1)~(t),


O(t) = O ( t - 1)+ y(t)R-l(t)~p(t)e(t),
~(t + 1)= A(O(t))~(t)+ B(O(t))z(t), (9.2)
1))
~(t + 1) = C(O(t))~(t),

R(t) = R ( t - 1)+ y(t)[q~(t)q~T(t) - R ( t - 1)].

We call this scheme a pseudolinear regression (PLR) for the model (3.4), since it
is treated as though q~ does not depend on 0, in which case it would be a linear
regression.

If this scheme is applied to the first-order A R M A X model, we obtain an


algorithm defined by (8.13), (8.14) and (8.11), where O(t) is replaced by

~p(t) = / u ( t - 1) /
\ 1) /
PLR's are well known in a number of special cases. When applied to an
A R M A X model the PLR algorithm is known as extended least squares (ELS)
and when applied to the output error model (2.11), it is the scheme suggested
in Landau (1976).

Recursive instrumental variable methods


With a similar philosophy as in the previous two cases, we can transform the
general off-line expression (4.12), (4.13) to a recursive variant. This will look
like (8.16), with the essential difference that ~b is replaced by a recursively
computed vector (. Such an algorithm will be a recursive instrumental variable
method (RIV). See S6derstr6m et al. (1978) and Young (1976) for a further
discussion.

10. Asymptotic properties


The recursively computed estimate 0(t) is a fairly complicated function of z~
Its statistical properties will thus be inherited from those of z t in a complex
Estimation of parameters in dynamical systems 207

manner, and it is a difficult problem to compute them. Even if we confine


ourselves to the asymptotic properties of 0(t) as t tends to infinity, the problem
still is significant. While no detailed account of available methods and results
can be given here (see Ljung and S6derstr6m, 1983, for that), some general
remarks will be made.
The convergence properties at t-+ oo of the general recursive schemes that
have been discussed here can be analysed, using an associated ordinary
differential equation (ODE). Details of this approach are discussed in Ljung
(1977a), and the result can be summarized as:
Consider the algorithm (r/ is here a placeholder for ~, q~ or ~', used in the
particular algorithms),

e(t) = y ( t ) - fJ(t),
{J(t) = t~(t - 1) + y(t)R-'(t)~7 (t)e (t),
e(t + 1) = A(O(t))~(t) + B(O(t))z(t), (10.1)
33(t+l) '~=
col r/(t + 1)/ C(O(t))~(t),

R(t) = R ( t - 1)+ 7(t)[n(t)nT(t)- R ( t - 1)],

with the assumption

y(t)-+O as t ~ . (10.2)

Let ~9(t[O) and r/(t, 0) be defined by (6.12) (with ~O= r/) and let

0) = y ( t ) - I 0).

Let f and G be defined by

N
f(O) = lim E 1 • rl(t, O)e(t, 0), (10.3)
N-~o~ N t= 1

1 N
G(O) = lim E - ~ rl(t, 0)7/T(t, 0), (10.4)
N-~o~ N t=l

with expectation over {z(t)}. Then the O D E associated with (10.1) is

d
d--7 0D(r) = R ~l(r)f(OD(r)) , (10.5a)

d
= (10.5b)
208 L. L j u n g

Global asymptotic stability of a certain stationary point 0", R* will then imply
that 0(t) as defined by (10.1) converges to 0* with probability one. In that way
the convergence properties of (10.1) can be investigated. The results can be
summarized as follows:

Recursive prediction error methods


The algorithm (8.16) will converge with probability one to 0", such that 0* is
a (locally) minimizing value of f'(0), where

1 N
9 ( 0 ) = u-~
l i m -N- E ~1] '~e2(t, 0). (10.6)

This follows from the quoted convergence theory, in which the main step is
that f(O), defined by (10.3), will be - V'(O), the negative gradient of (10.6). The
function V(O) can thus be used as a Lyapunov function to prove the required
stability of (10.5). It can also be shown that

V t ( 0 ( t ) - 0") E AsN(0, P ) , (10.7)


meaning that the leftohand side converges in distribution to the normal dis-
tribution with zero mean and covariance matrix P. This P is the same matrix as
for the corresponding off-line estimate given by (5.5)-(5.7)---thus no loss of
accuracy is made asymptotically despite the constraint of recursive calculations.
If 0* is such that {e(t, 0")} is a sequence of independent random variables with
zero mean values and covariance matrix I, then

P = [E~(t, 0*)0T(t, 0")1 ~. (lO.8)


These results are proven in Ljung and S6derstr6m (1983) (see also Solo, 1978).

Pseudolinear regressions
For the PLR (9.2) less general results can be obtained. For special cases, like
the A R M A X case, it can be shown that the condition

[C0(ei~) -- 1t<1 Vw (10.9)

is sufficient for convergence, where

Co(z ) : zn @ ClZn i _~_ . . . 4:- Cn

represents the true description of the noise characteristics in (2.6). For the
output error model (2.7) an analogous condition has to be imposed on the f
coeffÉcients. These results are proven in Ljung (1977b) using the O D E method
and in Solo (1979) using a Martingale approach.
Estimation of parameters in dynamical systems 209

The recursive instrumental variable approach


The asymptotic properties of recursively computed instrumental variable esti-
mates will mostly coincide with those of the corresponding off-line estimates.
See S6derstr6m and Stoica (1981) for a detailed discussion.

11. Implementation aspects

The forms that the algorithms here have been given in, such as (10.1), are
well suited for discussion and analysis. However, they are not particularly well
prepared for numerical implementation. A common step is to define the
quantity

P(t) = 7(t)R-~(t).

Simple algebra then shows that (10.1) can be rewritten as

e(t) = y ( t ) - ~)(t), (ll.la)


S(t) = rl v ( t ) e ( t - 1)r/(t) + a ( t ) , (ll.lb)
L(t) = e(t- 1)~?(t)S-'(t), 01.1c)
~J(t) = [~J(t- 1)+ L(t)e(t)]D,, (ll.ld)
P ( t ) = [P(t - 1) - L(t)S(t)LX(t)]/a (t), 01.1e)
y ( t - 1)
A(t)- [ 1 - y(t)]. (11.2)
y(t)

The subscript D s in (11.1d) indicates that the right-hand side ~hould be


projected into the stability region

D~ - {0 I A(0) has all eigenvalues strictly inside the unit circle}


(11.3)

(A(O) is defined in (6.12)). For higher-order models, where the matrix P is of


high dimension it may be required to resort to more sophisticated algorithms~
such as square-root and factorization methods. See Ljung and S6derstr6m
(1983, chapter 6) or Bierman (1977) for details.

120 Some practical aspects

In order to achieve good performance in the algorithms discussed here, both


asymptotically and in the transient phases, there are a number of parameters to
be 'tuned'. Perhaps the most important one of these is the gain sequence {7(0} or
equivalently the sequence of 'forgetting factors' {A(t)}, defined by (11.2).
210 L. Ljung

Under ideal situations for a constant system this gain should asymptotically
behave like

y ( t ) = 1/t [or 3.(0=- 1]. (12.1)

Even for constant, time-invariant systems it has been noted that much better
transient performance is achieved if t. y(t) increases to 2-3 during the first
couple of hundred samples and then falls back to 1. This may be achieved by
the choice

A(t) = Ao3.(t - 1) + (1 - 3.0),


typically 3. (0) ~ 0.95, 3.0 ~ 0.99. (12.2)

Often, the real reason for using recursive identification is that the system's
properties change with time. Then y(t) should not tend to zero but to a value
Y0> 0, chosen as a trade-off between tracking capability (large Y0) and noise
insensitivity (small Y0). Alternatively, A(t) is chosen as a constant A0 < 1. The
rule of the thumb is that with such a choice, the algorithm 'remembers' the last
1/(1- 3.0)= 1/y 0 data points. The search direction, defined by the matrix R in
the algorithms above, may have an important influence on the algorithms. The
Gauss-Newton direction (8.16e) normally gives the best behavior, but is also
more demanding computationally.

13. Conclusions

We have in this chapter discussed different possibilities to estimate


parameters in models of dynamic systems. We have pointed out the role of the
predictors as a unifying concept in model descriptions. We have also stressed
how this concept, and the related prediction error can be used to give a
coherent presentation of the different possible estimation methods.
Recursive algorithms have been of particular interest here. We have derived
such schemes using and displaying the close relationship to the corresponding
off-line methods.

References
o

oAstr6m, K. J. (1980). Maximum likelihood and prediction error methods. Automatica 16, 55l--574.
Astr6m, K. J. and Bohlin, T. (1965). Numerical identification of linear dynamic systems from
normal operating records. IFAC Symposium on Self-Adaptive Systems, Teddington, England.
Also in P. H. Hammond, ed., Theory of Self-Adaptive Control Systems. Plenum Press, New York.
Bierman, G. J. (1977). Factorization Methods for Discrete Sequential Estimation. Academic Press,
New York.
Eykhoff, P. (1974). System Identification. Wiley, London.
Estimation of parameters in dynamical systems 211

Eykhoff, P-, ed. (1981). Trends and Progress in System Identification. Pergamon Press, Oxford.
Goodwin, G. C. and Payne, R. L. (1977). Dynamic System Identification : Experiment Design and
Data Analysis. Academic Press, New York.
Gupta, N. K. and Mehra, R. K. (1974). Computational aspects of maximum likelihood estimation
and reduction in sensitivity function calculations. IEEE Trans. Automat. Control AC-19, 744-
783.
Landau, I. D. (1976). Unbiased recursive identification using model reference techniques. I E E E
Trans. Automat. Control AC-21, 194-202.
Ljung, L. (1977a). Analysis of recursive stochastic algorithms. I E E E Trans. Automat. Control
AC-22, 551-575.
Ljung, L. (1977b). On positive real transfer functions and the convergence of some recursions.
I E E E Trans. Automat. Control AC-22, 539-551.
Ljung, L. (1978). Convergence analysis of parametric identification methods. IEEE Tram.
Automat. Control AC-23, 770--783.
Ljung, L. (1979). Asymptotic behaviour of the extended Kalman filter as a parameter estimator for
linear systems. I E E E Trans. Automat. Control AC-24, 36-50.
Ljung, L. and Caines, P. (1979). Asymptotic normality of prediction error estimation for a p
proximate system models. Stochastics 3, 29-46.
Ljung, L. and S6derstr6m, T. (1983). Theory and Practice of Recursive Identification Methods.
M.I.T. Press, Cambridge, MA.
L/.iders, G. and Narendra, K. S. (1974). Stable adaptive schemes for state estimation and
identification of linear systems. IEEE Trans. Automat. Control AC-19, 841-847.
Robbins, H. and Monro, S. (1951). A stochastic approximation method. Ann. Math. Statist. 22,
400--407.
S6derstr6m, T. and Stoica, P. (1981). Comparison of instrumental variable methods- .-consistency
and accuracy aspects. Automatica 17, 101-115.
S6derstr6m, T., Ljung, L. and Gustavsson, I. (1978). A theoretical analysis of recursive
identification methods. Automatica 14, 231-244.
Solo, V. (1978). Time series recursions and stochastic approximation. Ph.D. dissertation. The
Australian National University, Canberra, Australia.
Solo, V. (1979). The convergence of AML. I E E E Trans. Automat. Control AC-24, 958-963~
Strejc, V. (1980). Least squares parameter estimation. Automatica 16, 535-550.
Talmon, J. L. and van den Boom, A. J. W. (1973). On the estimation of transfer function
parameters of process and noise dynamics using a single-stage estimator. Proc. 3rd IFA(
Symposium on Identification and System Parameter Estimation, the Hague. North-Holland,
Amsterdam.
Tsypkin, Ya. Z. (1971). Adaption and Learning in Automatic Systems. Academic Press, New York.
Widrow, B., Glover, J. R. Jr. et al. (1975). Adaptive noise cancelling: Principles and appfications~
Proc. IEEE 63, 1692-1716.
Young, P. C. (1970). An instrumental variable method for real-time identification of a noisy
process. Automatica 6, 271-287.
Young, P. C. (1976). Some observations on instrumental variable methods of time~serie,~ anaiysi~
Internat. J. Control 23, 593-612.
E. J. Hannan, P. R. Krishnaiah, M. M. Rao, eds., Handbook of Statistics, Vol. 5 8
© Elsevier Science Publishers B.V. (1985) 213-255

Recursive Identification, Estimation and Control

Peter Young

1o Introduction

Conventional time-series analysis normally involves the batch processing of a


fixed block of data. This is achieved in either: a single set of arithmetic
operations, such as the solution of the normal equations in ordinary least
squares estimation; or in some iterative manner, as when hill climbing, iterative
least squares or some other relaxation procedures are used to maximise a
likelihood function. In contrast to this, recursive analysis involves a sequential
technique in which estimates are updated continuously, whilst working serially
through the data in temporal order, as shown in Fig. 1.
This recursive procedure introduces an extra dimension to estimation: in
addition to the en bloc estimates based on the complete data set of N samples,
the analyst is also able to obtain estimates of the p a r a m e t e r s for up to N
subsets of the data, in a computationally elegant and efficient manner. This not
only provides, as we shall see in this chapter, useful information on the nature
of both the time-series model and the estimates, but it also opens up the way
for the estimation of time-variable parameters and states in stochastic-dynamic
systems. In other words, it provides a basis for sophisticated new methods of

T ~ P E S OF DATA PROCESSING
1 Batch ( e n - b l o c ) Processing
2. Recurslve Processing

2
- - ~ RECURSIVE

ITERATIVE

Fig. t. Recursive and iterative data processing.

213
214 ~ Young

time-series analysis which seem to have wide potential for application to


nonstationary and even nonlinear dynamic systems (e.g. Young, 1978, 1982a;
Priestley, 1980).
The recursive approach to estimation can be traced back to Gauss (1821-
1826), although it is also linked, quite properly, with the name of R. L. Plackett
who re-discovered the results of Gauss in 1950 and translated them into more
useful vector-matrix terms. However, it was R. E. Kalman who, in 1960,
initiated the most recent research on the subject with the publication of his
now famous paper on " A new approach to linear filtering and prediction
problems".
In providing a time-domain solution to the filtering problem posed earlier by
Wiener (1949), Kalman presents a perfectly general approach to the estimation
of the state variables in a linear stochastic, dynamic system which is observed in
the presence of noise. But, as Kalman himself admitted, it is not a complete
solution of the time-series analysis problem for such a system. For example, the
theory assumes perfect knowledge of the parameters in the model of the
system and Kalman pointed out that a more complete solution of the problem
would require the simultaneous estimation of the states and parameters of the
system model. As he said on page 40 of his paper, "This is a subtle and presently
largely unsolved problem in experimentation and data reduction . . . . To be
sure the two problems (parameter and state estimation) should be solved
jointly if possible; the author is not aware, however, of any study of the joint
optimization problem".
Since the publication of Kalman's paper, various methods for simultaneously
estimating the states and parameters of a linear stochastic, dynamic system
have been proposed, of which the most popular is the 'extended' or 're-
linearised' Kalman filter (see, e.g., Jazwinski, 1970; Young, 1974). In this
extended Kalman filter (EKF), a process of successive re-linearisation of the
nonlinear estimation equations, about the current recursive estimates, is util-
ised in order to allow for direct application of the Kalman recursive equations.
But even with certain apparent improvements suggested recently by Ljung
(1979b) the EKF is not a perfect solution to the problem: it can be statistically
inefficient and it may produce a misleading estimate of the error-covariance
matrix associated with the state-parameter estimates.
A more recent solution to the problem (Young, 1976, 1984) takes a rather
different approach. Rather than considering estimation from the standpoint of
nonstationary state-variable estimation, it is based on a more conventional
formulation, in which the primary problem is interpreted as the consistent and
asymptotically efficient estimation of the time-invariant parameters of the
system model. State-variable estimation is not explicitly attempted at all; rather
it emerges naturally as an implicit part of the recursive estimation procedure
for the time-invariant parameters. Having achieved optimal estimation of the
parameters, however, it is possible (if the analyst so desires) to recover the
state estimates as a linear combination of signals from certain adaptive
Recursive identification, estimation and control 215

prefilters that are an inherent feature of the parameter estimation algorithm.


Whether the analyst will wish to generate estimates of state variables or not
will depend upon the objectives of his analysis. In automatic control terms, for
example, he may wish to implement a state-space designed control scheme
using explicit state-variable feedback control; in which case he will wish to use
this facility. As we shall see, however, one attraction of the proposed new
procedures is that they allow for methods of optimally controlling a linear
stochastic system completely in the observation space, without explicit resort to
state-variable estimation. As such, they open up the possibility of a new era in
time-domain, stochastic control system design which is carried out directly in
the observation space of the system under study. Here, the analyst will be able
to proceed directly from measured time-series data to control system design in
a straightforward and systematic series of computer-aided analytical steps.
This chapter first describes the recursive estimation procedures proposed by
the author and his co-workers for the well-known transfer function (TF or
Box-Jenkins) time-series model. It then shows how these algorithms can form
the basis for a general recursive approach to identification, estimation and
control for a stochastic system that can be adequately described by such a TF
model. As in the rest of this book, the aim is to introduce the reader to the
computational procedures and demonstrate their practical utility, rather than to
present a rigorous exposition on the subject. Indeed, although certain asymp
totic theoretical analysis of the algorithms described here is available (Solo,
1978), more work will be required before a completely satisfactory theoretical
analysis is possible. At the moment, the primary justification for the procedures
is that they appear to work rather well in practice and have been evaluated
quite comprehensively by Monte Carlo simulation (Young and Jakeman, 1979,
1980, 1981, 1983). These stochastic simulation results, which are supported in
part by theoretical analysis (e:g. Jakeman and Young, 1983), also help to
demonstrate the robustness of the recursive techniques to incorrect assump~
tions about both the model form and the noise characteristics.
In order to emphasise the practical utility of the estimation algorithms, this
chapter will also outline the main features of a microcomputer program
package MICROCAPTAIN, that is based on recursive estimation and utilises
some of the algorithms described here. MICROCAPTAIN is programmed on
an APPLE microcomputer and has been used in numerous practical ap
plications, some of which are mentioned in later sections of this chapter.

2. The transfer function model

The major time-series model form considered here is the transfer function
(TF) model, which came into prominence with the publication in 1970 of Box
and Jenkin's Book Time Series Analysis, Forecasting and Control. This
model is shown diagrammatically in Fig. 2 and can be represented in the
216 ~ Young

-- [ ~-L)
K

UK ~ B( L ) YK
A(L)

Fig. 2. The transfer function (TF) time-series model.

following operator form:

B(L) D(L)
Yk -- A ( L ) uk + ~ ek' (1)

where Uk, Yk and ek are, respectively, the assumed deterministic (noise-free)


input, the observed noisy output and a zero mean, serially uncorrelated
sequence of random variables ('white noise') with variance o,2, which is the
source of all stochastic disturbances to the system. In equation (1), A ( L ) , B ( L ) ,
C ( L ) and D ( L ) are finite-dimensional polynomials of the form

A(L)= 1+ alL+ • • • + a,L",

B ( L ) = bo + b l L + . . . + b,.L n,
(2)
C(L)= 1 + c l L + . . . + cqL" ,

D(L)= 1 + d l L + . . . + deL" ,

where L is the backward shift operator (sometimes denoted by B or z 1), i.e.


L Y k = Yk-t" Here, for notational convenience, all polynomials are considered to
be of order n, although different order polynomials can be accommodated by
attaching zero coefficients to appropriate terms.
An alternative version of equation (1), which demonstrates its 'errors-in-
variables' structure (Johnson, 1963; Jakeman and Young, 1983), is in the form
of the following three connected equations:

Deterministic system: A(L)xk - B(L)u k , (3.i)


A R M A noise model: C(L)G = D(L)ek, (3.ii)
Observation or output equation: Yk = Xk + ~k" (3.iii)

Here x k is the hypothetical 'noise-free' output of the system (i.e. that part of
the observed output Yk that is causally related to the input u k ) and SCk is the
'coloured noise' which is generated from e k by the A R M A model (3.ii).
Equation (1) is a perfectly general linear time-series model that seems widely
applicable to practical problems. And it can be related, quite simply, to other
Recursive identification, estimation and control 217

well-known time-series model forms; for example, if C ( L ) = - A ( L ) then equa-


tion (l) can be written

A(L)y k = B(L)u k + D(L)e k , (4)


which is the popular ARMAX model (e.g. Hannan and Tanaka, 1976). In
certain practical applications, however, it may not be possible to guarantee that
the noise SCk has rational spectral density, as required by the ARMA assump-
tion. One advantage of the techniques described in this chapter is, as we shall
see, that they are based on an instrumental variable (IV) approach to estimao
tion (e.g. Kendall and Stuart, 1961; Young, 1976). As a result, the consistency
of the estimates is maintained, even if the noise does not possess rational
spectral density, provided only that the stochastic disturbances that contribute
to ~k are independent of u k. The latter requirement could limit, to some
extent, the on-line (real-time) utility of the algorithms when used in regulatory
feedback control systems, where the control input signals are intentionally
made functions of the observed system outputs.
Finally in relation to the TF model, it should be noted that all of the
techniques discussed in subsequent sections of this chapter can be applied to
the multivariable or vector form of equation (2), i.e.

A(L)xk : B(L)uk,
C(L)(jk : D ( L ) e k , (5)
Yk = Xk + ~k,

where all the variables Xk, Uk, gk, ek and Yk are now specified vector quantities
(i.e. u k and Yk are composed, respectively, of the measured inputs and outputs
of a multi-input, multi-output system), while the polynomials A ( L ) , B ( L ) , etc.
are appropriately dimensioned matrix polynomials in L (see Jakeman and
Young, 1979). However, the complexity of the estimation algorithms is much
increased in this vector case and the difficulty of applying them in practice
should not be underestimated.

3. Recursive algorithms for estimating the TF model parameters

Over the past twenty years, numerous recursive methods have been sug~-
gested for estimating the parameters in time-series models. On the other hand,
if we assume here that the user demands an algorithm which can, if necessary,
generate estimates with the desirable statistical properties of consistency and
asymptotic efficiency, then the choice is much more limited. Currently, there are
only two classes of recursive estimation procedure which possess these properties:
the prediction error (PE) approach suggested by Ljung (1979) and others, and the
related optimal generalised equation error (OGEE) method of Young and
218 P. Young

Jakeman (1979, 1980), which is asymptotically efficient when applied to the TF


model. Before discussing these sophisticated approaches, however, we shall
outline the simple recursive least squares algorithm which is only strictly
applicable to a special form of model (1) with a limited noise structure.

3.1. T h e recursive least squares algorithm

If in equation (1) we assume that D ( L ) = 1 and C ( L ) = - A ( L ) , then

A ( L ) y k = B ( L ) U k + ek (6)
or
Yk = --a~yk-1 . . . . anYk-n + boUk + "" " + bnUk-, + ek • (7)

Equation (7) can be written in the following vector form:

Yk " ZTka + ek , (8)


where
Z Tk = [ - - Y k - l , ' ' " ~--Yk-n, Uk,''" , Uk-.]
and
a = [al, a 2..... a., b0, o . . , b , ] T .

The recursive least squares (LS) algorithm for estimating the parameter vector
a is well known (see, e.g., Young, 1974 for a tutorial paper on this subject) and
takes the form

ak = ak-1 ~- Pk - 1Zk [ °-2 -1- ZkPk-1


T Z k] -1 {Yk __ ZkTI~k-l}, l(i)

I(ii)

Alternatively, I(i) can bc written in the form

g~k = gtk-1 + ~ Zkek , I(iii)


O"

where e k - - - Y k - - Y k l k 1 and Yklk._l = Z x~


k k-l" The algorithm I(iii) and (ii) can be
compared directly with the more sophisticated algorithms outlined in Sub--
sections 3.2 and 3.3.

3.2. Prediction error ( P E ) approach

First note that, for model (1), e k is given by

C Iy k B ] (9)

where tile nomenclature for the polynomials has been simplified for con~
Recursive identification, estimation and control 219

venience by removal of the L operator. It is then possible to estimate


recursively the vector 0, composed of all the unknown coefficients in the
polynomials A, B, C and D, by recursively minimising the cost function
N
J=~e k, 0o)
k=l
where
(11)

and the 'hat' denotes estimated values.


The recursive PE algorithm for achieving this minimisation takes the form

Pk
or

ek _~ Pk 1 Pk
_ _ _
ilt~k[ O'2-}- I~T
kP~-lqtk] qtkPk_,.
-/ "r ii(ii)

Here Ok is the estimate of 0, where

0 = [ a 1. . . . , a . , b o. . . . . b . , C 1. . . . . Cn, d 1. . . . , d. IT (12)
and
= (13)
where VO(~)k denotes a vector composed of partial derivatives of ~ with respect
to each element o f / J in turn, evaluated at the kth sampling instant. Note that
in algorithm II, Pk is a full 4n + 1 matrix and the generation of qt k will require
the use of adaptive 'prefilters'. The form and parameters of these prefilters will
depend upon the estimated parameters in /~ and will need to be updated
adaptively in some manner (e.g. Ljung, 1979).
The recursive PE algorithm for the TF model proposed by Ljung (1979) has
not been implemented in practice and so it is a relatively unknown quantity ir~
practical terms. It is, however, closely related to the equivalent O G E E al.
gorithm discussed below.

3.3. Optimal generalized equation error (OGEE) approach


Here, we note that an alternative expression for ek from model (1) is

C
ek = ~ - [ A Y k - BUk] (14)
or
ek =: Ay~ Bu*k, (15)

where the star indicates that the variables are filtered by a prefilter with a
220 P. Young

transfer function C / D A . We can, therefore, attempt to estimate the coefficients


in the A, B, C and D polynomials by recursively minimising a cost function
N
J = Z 02, (16)
k=l
where
ek = A y e - Bu~ . (17)

The similarity of this approach to the PE method is obvious (Young 1981b),


but it is interesting to note that, since ~ in (17) is linear in the unknown
parameters of the A and /~ polynomials (assuming we know the prefilter
parameters), then it is possible to decompose the estimation problem into two
coordinated sub-problems each solved recursively. This decomposition is not
only convenient but can also be justified in theoretical terms for the TF model
(Young and Jakeman, 1979).
The first of the two coordinated recursive algorithms is based on an 'in-
strumental variable' modification of the recursive least squares equations and it
is used to estimate the coefficients of the A and B polynomials in model (3.i)
which represents the deterministic part of the system. In this refined in-
strumental variable (IV) algorithm, an estimate Li of the parameter vector
a = [al, a 2 . . . . . G, bo, b~ . . . . . bn] T is obtained in the following recursive
manner:

ak - ak-1 + --~ "~kek, III(i)


O"

pk= p~_l_ p k_lXk[O


~,, "2 4 z,Wp i , , ~ ,T.. lll(ii)
k k-1 k l Zk l'k 1,
where
i~ = [ - i L , . . "
,-2" k-n,
*
~k~ " • " , l~k_n
, iT (18)
Z•~ = [ - - y k* _ l , . . . , - - y ~ _ , , U k* . . . . , U*
k-,J] T , (19)

and 2 k is an estimate of the hypothetical 'noise-free' system output x k in (3.i)


obtained from an adaptive auxiliary model (Young, 1976) of the form

A(L)2k '/~(L)ak. (20)

Note that the use of instrumental variables in the solution of this problem
arises naturally when the estimation problem for the T F model (1) is con-
sidered within the context of maximum likelihood (Young, 1976). However, it
is also justified intuitively: as we have seen, the TF model is of the errors-in-
variables type and the IV method is well known as the simplest approach to the
consistent estimation of parameters in such models (e.g. Kendall and Stuart,
1961). The advantage of the refined IV method for the T F model is that it is
'optimal' in the sense that the particular choice of IV vector in (18) results in
asymptotically efficient (i.e. minimum variance) as welt as consistent estimates
Recursive identification, estimation and control 221

(see also Stoica and S6derstr6m, 1983). This optimal behaviour has been
discussed theoretically by Solo (1978) and demonstrated in practice, using
Monte Carlo stochastic simulation techniques, by Young and Jakeman (1979)•
It should also be noted that a 'symmetric matrix' version of algorithm III is
possible in which z~ v in (ii) is replaced by Xk ..T whenever it occurs. Both
algorithms perform similarly in asymptotic statistical terms (see, e.g., Solo,
1978) but the symmetric matrix version appears computationally less robust in
general use (Young and Jakeman, 1979). Nevertheless, the symmetric matrix
version of III(ii) is useful, subsequent to convergence of the more robust
nonsymmetric matrix version, for generating a good estimate of the error-
covariance matrix associated with the parameter estimates.
In order to allow for the adaptive prefiltering implicit in the star superscript
notation of (18) and (19)7 a second recursive estimator, the refined approximate
maximum-likelihood (AML) algorithm, is used to obtain recursive estimates of
the parameters in the C and D polynomials. If we denote the estimate of the
noise model parameter vector e = [c 1. . . . . % d 1. . . . . d,] T by ~, then the
refined A M L algorithm takes the form

?'k = C'k-1+ ---~ mke~ , ~v(i)


or

P~ =
pN
k-1
_ N
Pk-lmk[ ~ + mkPk-lmkl
2 T N 1-1 TnN
mkrk-1, IV(ii)

where e Nk is obtained from an 'inverse noise model' of the form

ekN = - (21)

and
mk = [--~k* * 1, • " " ,
--~'~*q, ~**
k - l , " " " ~
e** lw
k - p l •
(22)

The variables ~k and ek are, respectively, estimates of the noise variables sCk and
e k obtained recursively as the estimation proceeds, while the double star
superscript indicates that the variables are adaptively prefiltered, in this case by
a prefilter of the form 1//). This algorithm is similar to the recursive maximum-
likelihood (RML) procedure of S6derstr6m (1973) but here it is applied in the
purely stochastic situation. Full details of these refined I V A M L algorithms,
together with a description of their coordination, are given by Young and
Jakeman (1979). It should be noted, however, that in off-line applications, a
recursive-iterative solution is utilised: here the recursive estimates are obtained
after several iterative steps through the whole time-series data set, starting with
an initial recursive least squares evaluation as the first step (see, e.g., Young
and Jakernan, 1979). In this way, improved efficiency is obtained, especially for
small sample sizes. It is this recursive-iterative algorithm which is primarily
considered in this chapter.
The similarity between the refined IVAML and PE inspired algorithms for
222 P. Young

the TF model becomes clear when we note that qrk in (13) is defined as

with ~ and m~ given by (18) and (22), respectively. In contrast to the PE


mechanisation, however, the refined I V A M L algorithms have much fewer
computational requirements because, the estimates of the parameters in the
system and noise models ((3.i) and (3.ii), respectively) are obtained by separate
but coordinated algorithms. In particular, the/5 k and P ~ are only of dimension
(2n + 1)x (2n + 1) and 2n x 2n, respectively, while Pk in II is of dimension
(4n + 1) x (4n + 1). As a result, there are a total of 2(2n + 1) × 2n less elements
to update in III(ii) and IV(ii) than in the equivalent II(ii) of the PE algorithm
(although, since Pk is symmetric, this only involves (2n + 1)x 2n additional
computations at each update). If the algorithmic decomposition used in refined
I V A M L is justified, therefore, it is particularly advantageous in practical terms.
Finally, it can be argued that both the refined I V A M L and PE inspired
algorithms will converge under fairly general conditions and that, upon con-
vergence, the J0k, P~ and Pk matrices provide reasonable estimates of the
covariance matrices for the estimate vectors ak, ~3k and 0k, respectively (Solo,
1978; Young and Jakeman, 1979). 1 Furthermore, Pierce (1972) has shown that
the maximum-likelihood estimate i of the parameter vector c is asymptotically
independent of the estimate ~i, thus justifying the decomposition used in the
design of the refined IVAML algorithms for large sample sets. Although
theoretical analysis is difficult in the small-medium sample-size situations,
comprehensive Monte Carlo simulation experiments (Young and Jakeman,
1979; Jakeman and Young, 1979) have indicated that similar results appear to
hold in these more realistic conditions. It should be noted, however, that all of
the stochastic simulation exercises have investigated the off-line recursive-
iterative version of the refined I V A M L algorithm mentioned previously; the
on-line, fully recursive solution has not been considered in nearly so much
detail and its practical performance is not as well established. In general, it
must be assumed that the on-line version will not be as robust in practical
applications as the off-line alternative.

4. State-variable estimation

The best known approach to the estimation of state variables in linear,


stochastic, dynamic systems is the Kalman filter method proposed by Kalman
in 1960. In the single-input, single-output case under primary consideration
here, the filter generates an estimate ~'k of the state vector Xk =

1Computationally, it is usual to set o-2= 1.0 in algorithms 1 to IV and then the covariance
matrices are obtained as d'2Pk,d~2p~v and d-2Pk (Young, 1984).
Recursive identification, estimation and control 223

[)(1, /3(2. . . . . Xn IT on the basis of noisy observations Yk of the system output and
assumed perfect knowledge of the input u k. The analogy with the parameter
estimation problem discussedo in previous sections is obvious.
It is well known (e.g. Astr6m, 1970), that the Kalman filter equations
themselves constitute a description of the stochastic system, the so-called
'innovations representation'. This can be written in the following 'state-space'
form:

ffk = I ~ k - 1 -}- gbtk-I "~ k k e k - 1 , (23.i)


Yk = hTffk + ek, (23.ii)

where, without loss of generality, we can assume the following canonical form
for {F, g, h } :

-al 1 0 ... 0 b1

F=

hW=[1
I J-a 2

-a.

0
0 1

0 0

0 ' ' ' 0 I.


...
.
0

''1
0
, g=

In (23) Xk is 'conditional mean' state vector (the optimal estimate of the state
b2
,

Xk); ek is the 'innovations' sequence defined by ek = Yk -- Yk, where Yk = h r X k ;


and k k = [k 1, k 2. . . . , k, ]~ is the Kalman gain vector.
If the system is stationary in statistical terms, then k k eventually reaches a
steady-state, asymptotic value k = k~ and, in this condition, e k constitutes a
zero mean, white noise sequence with constant variance ~r2. Furthermore, it is
easy to show, by converting (23) into L operator form, that in this asymptotic
situation, y~ is generated in the 'observation space' by the A R M A X model (4),
i.e.

B(L) O(L)
Yk -- A ( L ) uk + A - ~ ) ek' with B ( L ) = b l L + ' " + b,L",

where d i = ai + k~, i = 1, 2 . . . . . n. The implications of this result are clear: if we


utilise a recursive algorithm to estimate the parameters in the A R M A X model
(4) (or its equivalent, the TF model (1)), then knowledge of these estimates will
enable us to synthesise the Kalman filter equations (23).
On the other hand, we might ask if this process is really necessary: the
algorithms described in Section 3 all incorporate adaptive 'prefilters'; and we
might suspect that the outputs of these filters could be utilised directly to
generate the optimal state estimates. It can be shown (Young, 1979) that this is
indeed the case and that the 'asymptotic' state estimates can be obtained from
224 P. Young

the following linear expression:

~.k = ZkP, (24)

where Z k is defined by

Z k = [Tl~"lk , Z2~'lk . . . . . Tn(lk, rl~2k . . . . . Zn~2k].

In these equations,

p = [d + - a T : g + ] + ,

d r = [d 1, d2,.. , d. ], [al, a2,..., a,],


~'1T = [y~_~ . . . . . y~_,], G=[uk... • ,Uk * n]

while the matrices T 1 to Tn are composed of the numerator polynomial


coefficients of [ I - A L ] - I ~ i L , where 8 i is the ith unit vector (i.e. with ith
element unity; all others zero) and A = [ F - kh x] is a matrix similar to F but
with a i replaced by di, i = 1, 2 . . . . . n.
From these results, it is clear that state estimation is implicit in the optimal
recursive parameter estimation algorithms outlined in Section 3. Consequently,
if state estimates are required for applications such as optimal state-variable
feedback (SVF) control system design, then they can be generated from the
outputs of the adaptive prefilters by the use of (24). This allows the analyst to
proceed from experimental or monitored data right through to control system
synthesis in a systematic series of straightforward, data-based analytical steps.
And, in the process, he is able to avoid explicit Kalman filter design, with its
requirement for state-space model identification, specification of system and
observation noise covariance matrices, and the solution of the Kalman equa-
tions, including the nonlinear matrix Riccati equation (Kalman, 1960)._
But we also see that such an approach is not, in itself, essential: since
information on the optimal state-variable estimates is implicit in the adaptive
prefilter outputs, then it is clear that these outputs can themselves be used
directly in the specification of the optimal law. This is clear if we note that the
optimal SVF control law for the linear, quadratic, Gaussian (LQG) problem
(see, for example, the special issue of the I E E E Transactions on A u t o m a t i c
Control, Vol. AC-16, December 1971) is given by

u k -- --KW,~k, (25)

where K ~- [K1, K2,.. •, Kn1T is the optimal SVF gain vector. From (24) and (25),
therefore, the control input is given by

u k = _ KWzkp
or
Uk = _ ~/T Z ~ , (26)
Recursive identification, estimation and control 225

where z~ is the vector of prefiltered input and output signals, as generated and
used in the optimal parameter estimation algorithm III, and y is a suitably
defined gain vector. For example, in the simplest, first-order case, "y is a
2-dimensional vector defined by

]/T = [ K l ( d l _ a l ) , Klbl] ,

and the control law is of the form

/'~k = - [ K l ( d l - a l ) Y ~ - i q- K~blU*k-1].

From this analysis, we see that optimal stochastic control can be synthesised
and implemented directly in the observation space, without specific reference to
the state-variable estimation and the Kalman filter. It is interesting to note that
this approach has some similarities with that of Arthur (1978), who considers
the problem of L Q G control for systems described by the related 'distributed°
lag' models, as used in econometric studies. The reader is directed to this paper
for a more complete discussion of the optimal control aspects, albeit limited to
the more restricted distributed-lag representation.

5. Other recursive algorithms and related topics

Time-series modelling involves several stages of analysis from data pre-


processing, through model structure (order) identification and parameter esti-
mation, to model validation and use. Before describing such a procedure based
on recursive data processing, it is necessary to outline other recursive pro=
cedures and related topics that can be useful at various stages in the analysis.

5.i. D a t a pre-processing
Most pre-processing of data, prior to formal identification and estimation,
involves filtering of one kind or another. Such filtering operations are usually
aimed at removing nonstationarity, in the form of deterministic or stochastic
trends; or periodic behaviour, as in 'seasonal adjustment'. Given the filtering
interpretation that Kalman placed on recursive estimation, it is not surprising
that such procedures can be used very effectively in this pre-processing stage of
the analysis.
Here we will outline a very simple recursive approach which uses off-line,
recursive-smoothing versions of the Kalman filter, as suggested by Jakeman
and Young (1979). Other research work of a similar kind has been reported
previously (e.g. Pagan, 1975) but has not tended to emphasise the practical
advantages obtained by basing the algorithmic design on the simplest class of
nonstationary processes; namely the random walk family. Such an approach
also allows for the development of simple but effective algorithms for locally
226 P. Young

smooth trend fitting, Stein-type shrunken estimators, ridge regression, numeri-


cal differentiation, and the localised regression of O'Hagan, all of which are
discussed by Akaike (1979) in relation to his nonrecursive 'Likelihood and
Bayes' procedures. In concept, the recursive-smoothing approach is almost
identical to that of Akaike but, by resorting to a recursive solution of an
equivalent problem, it is possible to develop algorithms that are computation-
ally simpler and more flexible. Also the recursive setting seems more natural
for these kind of problems.
The reader is referred to Jakeman and Young (1979) for a complete
description of the recursive-smoothing procedures and their wide range of
applicability. Here, we will restrict the discussion to the simplest realisations of
tile algorithms, where the trend or periodic behaviour is assumed to be
described by one of three major random walk models: the simple random walk
(RW); the integrated random walk (IRW) and the smoothed random walk
(SRW). The first two of these models are the basis for the recursive-smoothing
algorithms R W S M O O T H and I R W S M O O T H that constitute the B E S T
(Bayesian Estimation of Stochastic Trends) sub-program in the MICRO°
C A P T A I N package discussed in subsequent sections of this chapter.
It is assumed that the time-series data ut, Yk (or both) may contain 'un-
wanted' components, i.e. components which are not seen as directly important
to the analysis or, by their presence, would introduce problems in the later
identification and estimation exercises. Of course, the dangers of removing
such components without very good reason cannot be overemphasised: if any
component can be retained and used in the later stages, then it is well to do so,
as we shall see in the sales data analysis discussed in Subsection 7.4. However,
it is certainly true that some advantage can be obtained, in certain circum-
stances, by such exercises as trend removal.
If we denote the unwanted component by U k, then the series, say Yk for
illustration, can be defined as

Ya = Uk + e k , (27)

where e k is the residual obtained by removing U k from Yk" In the B E S T


approach, the U k component is modelled as either:
(a) an RW process

Uk ~--"Uk-1-1- ~ k - l ' (28)

(b) an IRW process

G = G - , + &-~, & = &-i + .k-~, (29)

or
(c) an SRW process

G = 4 G - 1 + (1 - 4)Sk_a, & = &-, + ~-1, (30)


Recursive identification, estimation and control 227

where, in all cases, tzk is a zero mean, white noise sequence with variance q;
while, in case (c), ~b is a smoothing parameter (0.5 < ~b < 1.0) introduced to
produce a model whose descriptive ability lies somewhere between (a) and (b),
depending upon its chosen value (Young, 1984).
At first sight, the choice of such simple models for U k would appear to limit
the practical utility of the subsequent estimation algorithms. But this would not
appear to be the case, as we shall see in a number of practical examples
described in subsequent sections of this chapter.
In order to obtain the recursive-smoothing algorithms, it is necessary to use
the Kalman filter formulation (Kalman, 1960; Young, 1984) with (27) providing
the observation equation and one of the equations (28) to (30) defining the
state equations. This requires that we assume e k is itself a zero mean white
noise sequence independent of /~k" Once again, while this would seem a
restrictive assumption, it does not appear to affect the performance of the
algorithm very much in practical terms. Clearly, however, more complex
stochastic models than (28)-(30) could be utilised (see Young, 1984; Pagan,
1975), if this proves necessary. Also, more complex observation equations can
introduce still greater flexibility (Jakeman and Young, 1979).
All of the models (28)-(30) can be considered to be of the general Gauss-
Markov form

Uk = c19Uk-1 + Fl-~k-1, (31)

with U k denoting a scalar Uk in (28) and a vector Uk = [U k Sk]T in (29) and


(30). As a result, the Kalman recursive filtering equations take the following
form in the case of Yk (and are similar for Uk):

~-fk = (~)~Jk-1 q- Pkh{Yk - hT(~Uk-l}, V(i)


Pk = Pklk-l-- P k t k - l h [ h T p k l k - 1 h + 1]-lhXpklk-a, V(ii)
P k Ik-I = dpp k-1 cI')T + F q F T . V(iii)

Although superficially complex, these equations are really quite simple because
of the simplicity of q~ and F.
The 'filtering' algorithm V provides an estimate 0 k which is based on all of
the data up to and including the kth sample in the time series. To obtain the
smoothed estimate ldklN, i.e. the estimate at the kth sample based on all N
samples in the time series, it is necessary to process the data further as
discussed, for example, by Gelb et al. (1974). In contrast to the filtering
algorithm V, however, the smoothing algorithm has a number of forms, some
of which are inherently unstable in computational terms. Norton (1975) con-
cludes that the most useful form is that proposed by Bryson and Ho (1975),
Here the smoothed estimates are obtained in a backwards recursion of the
form
228 P. Young

with AN = 0 and

A k = [I - Pk+lhhT]r[cI)TAk+~- h(Yk+ ~ -- hT4~grk)]


for k = N - 1 , N - 2 . . . . . 1.
V(v)

In practice, algorithm V is easy to use: the analyst simply chooses a value for q
(or q and ~ if (30) is being used to model Uk) and observes the form of the
resulting UktN estimate; he can then adjust q until satisfactory results are
o b t a i n e d ) It is possible to conceive of more complex algorithms in which q is
selected adaptively in some manner, (e.g. Davis, 1983), but practical experience
has shown that, at least in off-line applications, it is best to leave some freedom
for subjective judgement on the part of the analyst. After all, the human eye is
a very good filter of data, particularly in relation to the detection of long-term
trends.

5.2. M o d e l order identification

Box and Jenkins (1970) introduced a systematic method of identifying the


order of time-series models on the basis of correlation analysis. This approach
is now widely used in practice but it is known to have some limitations: for
example, it relies on the analyst's recognition of certain characteristic patterns
in the correlation functions; patterns which are not always easy to identify. An
alternative and more objective approach has been suggested recently by Young
et al. (1980). This is based on two simple statistics: a coefficient of deter-
mination R~,, and an error-variance-norm EVN.
Both of these statistics are only strictly applicable to the identification of the
order of the deterministic part of the model, i.e. the order of the A and B
polynomials in equation (1). However, as we shall see in Subsection 5.3, they
can be used for noise model order identification if an alternative to the refined
A M L algorithm IV is used for noise model parameter estimation.
When applied to the identification of the deterministic model order, the two
statistics are computed for a whole range of different model orders. For a
model with 2n parameters in the deterministic model, the refined I V A M L
results are obtained: R 2 and EVN are then computed as follows for each set of
results:

R~ = 1 (32)
y N (Yk - y ) 2 ,

1 2n
EVN = -2n
- i=1
Z 3 (33)

2Note that, in algorithm V, the variance of ek is normalised to unity for simplicity (cf. algorithm VI).
3Sometimes normalised measures may be preferable (Young et al., 1980, p. 284).
Recursive identification, estimation and control 229

Here ~k is the estimate of ~k, i.e. ~ = Yk -- Xk; Y is the mean value of Yk; and/3,
is the ith diagonal element of the PN matrix obtained from algorithm II.
In the normal manner, R~ is a measure of model fit (although it will be
noted that it is not based here on the usual least squares residuals); while EVN,
which is typically evaluated in logarithmic form (i.e. In EVN), is an indication
of the overall (or average) variance of the parameter estimates in the 2nth
order model. As Young et al. (1980) show, the EVN appears to be a sensitive
indicator of over-parameterisation: If the model has too many parameters, then
it can be shown that the instrumental cross-product matrix, 4 E NXA, k Z ,T
k , t e n d s to
singularity and its inverse,/~N, as computed by the refined instrumental variable
(IV) algorithm II, tends to increase sharply in value, with a consequent
increase in the EVN. This increase can often be of several orders of magnitude.
Thus the analyst monitors both statistics and chooses the model which has the
best combination of R~ and In EVN: usually R~ will have reached a 'plateau'
level, with little.further increase for any increase in model order; while In EVN
will have a very low value in relation to that obtained for higher order models.
There is a need for more statistical research on the EVN test, since it is
currently justified largely in numerical analysis terms. Nevertheless, it does
appear to work well in practical applications. For example, when applied to the
famous gas furnace data of Box and Jenkins (1970), it revealed rather obvious
over-parameterisation of the Box-Jenkins identified model and suggested a TF
representation with two parameters rather than four (see Young et al., 1980).

5.3. A n alternative recursive approach to noise (univariate series) model estima..


tion and identification
The weakest part of the refined I V A M L approach to TF model estimation is
the refined A M L algorithm III (see the discussion in Young and Jakeman,
1979). An alternative but more complicated recursive alternative has been
proposed recently (Young, 1983). It is based on the idea of high order
autoregressive (AR) model fitting, as suggested by Hannan and Rissanen
(1982). The recursive least squares algorithm I is applied to estimate of the
noise sequence ~k in order to obtain a purely A R model of moderate to high
order. This model is then used to generate an estimate ek of the white noise
residuals ek; and ek SO obtained is employed in TF model estimation, based on
the LS or IV algorithm, with ek representing the system input and (k the
output.
Since the noise modelling problem has been converted, in this manner, to
one of TF model estimation, it is now possible to apply the model order
identification procedures discussed in Subsection 5.2. In simulation studies, this
approach has consistently worked well and, in practical applications, it has
yielded noise models with sensible characteristics, both as regards impulse
response definition and spectral properties. This is not too surprising since, in
4In the refined instrumental variable (IV) algorithm, this matrix approaches the equivalent
information matrix asymptotically.
230 P. Young

effect, it is aimed at finding an A R M A model form which adequately ap-


proximates the high order AR model obtained at the first, least squares stage in
the analysis. And it is well known that high order AR modelling (or its
equivalent, maximum entropy spectrum estimation) produces very good results
in practice (see, e.g., Kirk et al., 1979).
As in all approaches to estimation which involve high order AR modelling, it
is necessary to identify the order of the AR process prior to estimation. In
practice, it has been found that the procedure proposed above is not too
sensitive to the selected AR order and even relatively low order models (see
Subsection 7.5) produce good results. However, it is probably safer to employ
one of the AR order identification procedures currently available, such as the
Final Prediction Error (FPE) and Information Criterion (AIC) of Akaike (1970,
1972) or the related procedures of Parzen (1977) and Rissanen (1976).
Finally, it is clearly possible to synthesise an on-line equivalent of the
estimation procedure discussed in this section, in which parallel recursive AR ^

and TF estimation algorithms are utilised, with the current recursive estimate ek
obtained from the inverse AR model being used as the latest estimated input to
the TF estimation algorithm, However, such a procedure is more complicated
than the AML algorithm III and it has not yet been evaluated in practical
terms.

5.4. Time-variable parameter estimation


One advantage with the recursive formulation of time-series analysis is the
potential it provides for time variable parameter estimation (Young and
Jakeman, 1980). Using an analogy with the Kalman filter, it is possible to
develop time-variable parameter estimation algorithms for the TF model by
assuming that the parameter variations can be represented by a Gauss-Markov
model similar to (31), but with Uk replaced by the parameter vector (ak, Ck or
Ok). Indeed, in practical terms, the most useful forms of this model are the
random walk family discussed in Section 5.1: the q) and F matrices are simply
defined for these models and their inherent nonstationarity allows for a wide
variety of parameter variations.
The time variable estimation algorithm in the case of the tl vector takes the
following form:

Prediction (i) akj~-~ = ~ak-~,


between
samples (ii) Pk jk-~ = ~Pk-~ ~ + FQ FT ,
VI
Correction
on
(iii) ak = aklk-l+--~Xkek,
receipt of
kth sample (iv) : *" . . . .
Recursive identification, estimation and control 231

where aklk_ 1 and Pklk_ 1 are the a priori predictions of ak and/5 k, based on the
estimates at the ( k - 1 ) t h sampling instant and knowledge of the assumed
Gauss-Markov model; while Q is the covariance matrix of the white noise
vector in the assumed Gauss-Markov model for the parameter variations. The
Bayesian nature of the recursive algorithm is clear from these equations where
the a priori predictions are converted to a posteriori estimates on receipt of the
latest data; and where these a posteriori estimates provide the basis for a priori
predictions at the next recursion, as in the Kalman filter.
In the simplest RW case, the algorithm is extremely simple with the
prediction equations reducing to

glkhk_t: gtk l and ff)klk_l: ff~k__l-~(~.


In other words, the only modification to algorithm III is the addition of the
matrix O to Pk at each recursive step. In effect, this ensures that Pk (and,
therefore, the algorithmic gain) are maintained at levels defined by the chosen
value of O for k--) ~, rather than being strictly decreasing for k--) % as in the
time-invariant case of algorithm III, where O = 0 (see Young (1984) for a
detailed discussion on this aspect of the algorithmic behaviour). As a result, the
algorithm always takes note of changes in the recursive residual or 'in-
novations' measure ek caused by parametric variations and is so able to 'track'
such variations.
The simple time-variable parameter estimation algorithm obtained under the
RW assumption is the basis for the TVAR sub-program of the MICRO~
CAPTAIN package described in the next section of this chapter. Other more
complicated stochastic models for parameter variations that can be used to
develop more complex time-variable parameter estimation algorithms are
discussed by Young and Jakeman (1980) and, in more depth, by Young (1984).
However, their practical utility will tend to be problem dependent (Young,
1981a).

6. Recursive time-series analysis and the MICROCAPTAIN program package

A mainframe, interactive, computer program package CAPTAIN (Com--


puter Aided Program for Time-Series Analysis and Identification of Noisy
Systems) has been available for some time (e.g. Young et al., 1971; Jakeman et
al., 1982). More recently, a simpler microcomputer version of this package has
been developed by the present author for use on the Apple II Computer. Most
of the results for the practical examples discussed in the next section have been
obtained with MICROCAPTAIN. It is an easy to use, interactive package
which makes considerable use of high resolution graphics and the EPSON
MX-82 printer. The package consists of a number of subprograms based on
simple versions of the recursive algorithms considered in previous sections of
232 ~ Young

this chapter. The data pre-processing sub-program B E S T incorporates two


algorithms R W S M O O T H and I R W S M O O T H which are sub-optimal versions
of algorithm V, in which the backwards recursion involves a simple, fixed
parameter filter based on the optimal backwards smoothing equations V(iv)
and (v). Identification and estimation of the deterministic model transfer
function B ( L ) / A ( L ) is accomplished with a basic version of algorithm III (see
Young and Jakeman, 1979), in which the adaptive prefilters are removed for
simplicity. The sub-program T V A R is based on a similarly modified version of
algorithm VI, which is used for time-variable parameter estimation. Finally,
noise model identification and estimation involves a sub-program A R M A
which implements the approach discussed in Subsection 5.3, with the recursive
least squares algorithm I used for A R model estimation and the basic version
of algorithm Ill employed for subsequent A R M A model identification and
estimation.
The overall approach to time-series analysis based on the M 1 C R O C A P T A I N
package can be summarised as follows:
1. Data pre-processing: Time-series data are read from disk file and processed
by the BEST program, using R W S M O O T H and I R W S M O O T H : the estimated
U k component and the 'small perturbation' residuals about 0 k are then sent to
disk file for subsequent use.
2. Model order identification: T F model estimation is carried out for a
number of models of different specified orders (also incorporating any pure
time delay between u k and Yk, if necessary); here the basic IV algorithm is used
and the statistics R 2 and In EVN are computed for each model; in addition, the
recursive estimates are examined to see if there is evidence of over-
parameterisation (as indicated by high E V N values), or temporal variation (see
step 6 below).
3. Model parameter estimation: The model with the most satisfactory order
identification statistics and recursive estimates is re-estimated and plots of x k, Yk
and ~k are obtained from the printer, on request; also plots of the recursive
parameter estimates and the model impulse response are available to the user.
4. Model validation: Using subsets of the data or separate data, re-estima-
tion is carried out to assess whether the model seems an adequate represen-
tation of the system; this is aided by evaluation of the recursive estimates, to
see if there is initial evidence of parametric nonstationarity; the noise estimate
~k, to ensure that it is statistically independent of the input Uk; and the impulse
response, to assess whether the model is physically meaningful.
5. Noise model identification and estimation: If the model seems acceptable
on the basis of the results in step 4, then the noise estimate ~k = Yk -- 2k is saved
on disk file where it provides the input to the A R M A sub-program, which is
used to identify an estimate either an A R or A R M A noise model, as specified
by the user.
6. Time-variable parameter estimation: if there appears to be evidence of
parametric nonstationarity (or nonlinearity), then the T V A R program is used
to obtain initial estimates of these variations; the O matrix required for this
Recursive identification, estimation and control 233

analysis is restricted to diagonal form and its elements are specified by the user
on the basis of the diagonal elements of the final PN matrix that are displayed
to him after an initial estimation run with O -- 0 (the user may need to adjust O
further on the basis of the results obtained with his initially specified values for
the diagonal elements).
The results obtained in steps 1 to 6 often provide a satisfactory time-series
model for most practical purposes; in certain cases, however, it may be
necessary to proceed with further, more sophisticated analysis based on either
the optimal algorithms available in the mainframe CAPTAIN package or
independent and specialised nonlinear time-series analysis (e.g. Young, 1978,
1982a; Priestley, 1980).
In addition, another microcomputer program MCAPTREF is available
which provides a simple version of the refined IVAML approach to model
estimation. Having obtained estimates for the A, B, C and D polynomials
using MICROCAPTAIN, the prefiltered variables u~,, y~ and 2~ are generated
off-line and used for re-estimation of the A and B polynomials. In effect, these
are the estimates obtained after one iteration of the refined IVAML program
in CAPTAIN and the statistical properties of the estimates so obtained are
usually quite close to those of the optimal estimates obtained with the main-
frame program. However, as we have stressed previously (Young and Jakeman,
1979), the basic IVAML estimates are often good enough for most practical
purposes and further refinement is usually only justified in special circum-
stances.

7. Practical experiences with recursive estimation

The CAPTAIN package was assembled in 1970 and has been used in many
practical applications during the ensuing thirteen years. MICROCAPTAIN
was written in 1982 and, although still in its final stages of development, has
proven extremely useful in teaching and research. In this section, we discuss a
number of examples which illustrate the use of both programs and demonstrate
the power of the recursive approach to time-series analysis.

Z1. Simulation results


Often practical examples do not expose the full potential of an analytical
procedure. Consequently, it is useful to test time-series programs with simu-
lated data which are chosen carefully to provide a reasonable challenge to the
estimation algorithms.
The algorithms III, IV and VI have been tested exhaustively using Mono
te Carlo stochastic simulation and the results are discussed by Young and
Jakeman in a series of papers that have appeared in the International Journal of
Control (1979, 1980, 1981, 1983). Here we will test the MICROCAPTAIN
package using simulated data which have been chosen so that they expose the
234 P. Young

practical robustness of the constituent algorithms when the noise on the data
does not conform to the usual statistical assumptions.
Fig. 3a shows the 100 sample set of output data for the simulation example.
These data are generated from the following model:
0.2L 3
Yk = 1 -- 1.5L + 0.7L i uk + ~ k , (34)

where the deterministic input u k is a random binary (+__1)signal and the noise ~k
is the sum of a pure sinusoidal component and a stochastic trend, in the form of
a positive random walk. The transfer function in (34) represents a second-order
oscillatory system with a pure time delay of 3 sampling periods, and the
sinusoidal component in the noise is chosen so that it interferes seriously with
the natural oscillations of the system.
The major problem posed by the data set shown in Fig. 3a lies in the nature
of noise: not only does it contain a nonrandom sinusoid, chosen to confuse the
estimator, but the sinusoid is embedded in a nonstationary, stochastically rising
trend. The R W S M O O T H estimated trend is shown as a dashed line in Fig. 3a
and the de-trended output is given in Fig. 3b.
Estimates of the model parameters can be obtained directly from the data
shown in Fig. 3a, or from the data with the output de-trended, as in Fig. 3b.
Table 1 shows the estimates obtained in both cases: It is clear that the

(a)

~~
(b)
- - - ESTIMATED
TRE'ND

Fig. 3. Output data for simulation example: (a) data ( - - ) with estimated "trend (. . . . . ); (b)
de-trended data.
Recursive identification, estimation and control 235

Table 1

Measured data De-trended data

Parameter True value Estimate Estimated Estimate Estimated


standard error standard error

ax - 1.5 - 1.556 +0.559 - 1.559 +0.197


a2 0.7 0.738 _+0.559 0.736 __+0.198
b0 0.2 0.195 --+0.196 0.204 -+0.068

estimated parameters are quite similar but the estimated standard errors are
much smaller when the output series is de-trended.
Fig. 4 shows the graphical results obtained from the analysis of the measured
data. Fig. 4a compares the estimated noise-free output ~k with the actual value
xk; and it also shows the measured noisy output Yk. Fig. 4b compares the noise
estimate ~k = Yk -- Xk with the actual noise SCk. W e see that, although the noise is
particularly bad in this case, the estimation results are very good: on the basis
of only 100 samples it has been possible to obtain remarkably accurate
estimates of both the noise-free output and the noise.
The estimation results obtained from the de-trended data are shown in Figs. 5
and 6. In Fig. 5a, we see how the sinusoidal component of the noise is

(a)
~. ,"
"" ':"
" "
?:
i ".: /
",
\...:
,,j
:"'...I
~
• i : : : \ .'; "w ""'
i!; ,
- i !

~ MODEL OUTPUT x k
. . . . ACTUAL OUTPUT x k
........ NOISY OBSERVATION Yk

NOISE ESTIb~ATE ° ~k
.... ACTUAL NOISE ~k

Fig. 4. M I C R O C A I r F A I N results for simulation example: (a) model output ( - - - ) vs. data ( . . . . );
(b) estimated ( - - - ) and actual noise ( - - - ) signal.
(a)
t".

:: --- Yk

(b)

Fig. 5. M I C R O C A P T A I N results for simulation example with de-trended output data: (a) model
output ( ) vs. data ( . . . . ); (b) estimated noise signal.

(a)

lJlllJll]lIllIIIlIIl]lllllllllllli (b)

Fig. 6. M I C R O C A P T A I N results for simulation example with de-trended output data: (a) recursive
estimate of al; (b) estimated ( - - ) and actual ( - - - ) impulse response.

236
Recursive identification, estimation and control 237

particularly problematical in the second half of the data, with the measured
output Yk at times completely out of phase with the noise-free output x k (which
is virtually indistinguishable from the estimate x k in this case). Fig. 6a shows
the recursive estimate of the a t parameter and, finally, Fig. 6b compares the
estimated and actual impulse response characteristics of the transfer function in
(34).
The robustness of the IV estimation algorithm is obvious from the
results obtained in this simulation example. Although the noise does not
conform to the usual nice statistical assumptions, the algorithm has no difficulty
in estimating the model from a comparatively small sample of 100 input--output
pairs. 5 We also see that, provided the input signal u k is independent of the
noise ~:k, then we can achieve reasonable estimation results, even without the
removal of stochastic trends or bias components.
Of course, the whole topic of nonstationarity and the removal of low
frequency components is a complex one and it is difficult to generalise about
the wisdom or otherwise of such operations: normally the decision will be
problem dependent. Certainly, the ability to obtain reasonable results without
the removal of nonstationary behaviour can sometimes be of practical value.
For example, in hydrological modelling, a 'baseflow' component is often
present on flow data and conventional deterministic rainfall-flow analyses
usually requires its removal prior to model 'calibration'. 6 On the other hand,
when using M I C R O C A P T A I N , it is possible to analyse the data directly, with
the baseflow component treated as noise. The subsequent analysis then not
only provides an estimate of the transfer function between rainfall and flow,
but also an estimate of the noise ~k, from which the baseflow component can be
deduced (see Beer et al., 1982). We will see other examples later in Subsections
7.3 and 7.4, however, where trend removal is extremely helpful in identification
and estimation terms and where its retention would have caused problems.

7.2. A p p l i c a t i o n s in environmental and ecological s y s t e m s analysis

Typical applications of recursive time-series analysis in environmental and


ecological systems analysis are: the characterisation of pollutant transportation
and dispersion in river systems; the evaluation of translocation phenomena in
plants and animals; the modelling and forecasting of river flow (so-called 'flood
routing'); the representation of subsurface flow; and modelling estuarine
dynamics (Young, 1982a).
M I C R O C A P T A I N has been applied successfully to all these problems and
examples of the first two are described by Young (1982b). Here we will typify
the analysis by considering a flow modelling example, using data from the
River Wyre south of Lancaster. These data consist of 32 flow measurements

51ndeed, convergence is virtually complete after only 25 samples.


("Calibration' is the normal term used for model par~Imeterestimation in hydrologicaland water
resource applications; it usually involvessome form of deterministic optimisation, or its equivalent.
238 P. Young

if ~ - - - U
, y \ k

11'"'.ii(i,,l,i * '"'''"",.

Fig. 7. Data for flow modelling example: (input (---); output ( )).

t a k e n at two l o c a t i o n s on t h e r i v e r o v e r a p e r i o d of 32 h o u r s with a s a m p l i n g
i n t e r v a l of o n e h o u r . T h e d a t a a r e p l o t t e d in Fig. 7, w h e r e it s h o u l d b e n o t e d
t h a t b o t h series are s c a l e d d i f f e r e n t l y b y t h e a u t o m a t i c scaling p r o c e d u r e s in
M I C R O C A P T A I N : in fact, t h e m a g n i t u d e of t h e i n p u t series is a b o u t 0.28 of
t h e o u t p u t series.

--BEST IDENTIFIED ....... MUSKINGUM ~ ,

B BEST IDENTIFIED MODEL (b)

....... MUSKINGUM MODEL

Fig. 8. MICROCAPTAIN results for flow modelling example: (a) model outputs vs data (----); (b)
estimated noise signals, (best identified model ( - - ) ; Muskingum model ( . . . . )).
Recursive identification, estimation and control 239

(c)
~O MODEL 1
-1.0 A
".4
,--. - - - bI MODEL 2

" "

III .2

(d)
+I .0

I i
i Ii i~
_ L ]_LL L

(e) ~ I ' BEST IDENTIFIED MODEL

Fig. 8(cont.). MICROCAPTAINresults for flowmodellingexample: (c)and(d)parameterestimates;


(e) estimated impulse responses; (best identified model ( - - ) ; Muskingum model (---)).

The best identified model is of first order, with two p a r a m e t e r s (al, b0) and a
pure time delay of 2 sampling intervals. Part of the M I C R O C A P T A I N results
for this model are shown in Fig. 8 where it will be seen that the model has an
R~-= 0.999, indicating a very good explanation of the data; and a In E V N =
-7.138, reflecting the low estimated standard errors on the p a r a m e t e r esti-
mates. These results are compared with those for other models in Table 2,
where it is clear that the first-order model is superior in all ways. The
considerably higher in E V N values for the other models are indicative of the
less well-defined p a r a m e t e r estimates: for example, the recursive estimate of
240 P. Young

Table 2

Model No. of No. of Time delay R~- In EVN


A parameters B parameters

1 1 1 2 0.999 -7.138
2 1 2 2 0.999 -4.774
3 2 1 2 0.999 -4.227
4 2 2 2 0.999 -2.696
5 1 2 0 0.996 -3.706

the b I p a r a m e t e r in model 2 is shown dashed in Fig. 8c and we see that it shows


no clear convergence. This is also borne out in the estimated q a n d a r d error for
this p a r a m e t e r (0.12) which is similar to its estimated value (0A3).
It is interesting to note the results for the final model with a first-order A
polynomial, second-order B polynomial and no pure time delay. This is one of
the successful conventional representations used in flood routing studies, where
it is termed the 'Muskingum model'. We see, however, that it performs rather
badly in relation to the best identified first-order model. This is emphasised in
Fig. 8 where the Muskingum model results are shown in dashed lines. The
advantages of the time-series approach are clear from this analysis: the model
fit for the Muskingum model, although not so good as that for the best
identified model, could well be considered acceptable in hydrological terms;
but reference to In E V N results and comparison with other model p e r f o r m a n c e
leaves no doubt that it is not an acceptable model in statistical terms.
In addition, the estimated Muskingum model in this case is of the form

- 0 . 2 2 + 0.91L
Yk = uk ~ G , (35)
1-0.81L

and we see from the negative value for the /~0 coefficient that it exhibits
non-minimum phase (NMP) behaviour: in other words, the response of Yk to
increases in u k is an immediate d e c r e a s e followed by a subsequent increase.
This unacceptable physical behaviour of the model is characteristic of Muskin-
gum models and it arises from faults in the basic model formulation (see
Whitehead et al., 1979). The model can only be justified in numerical terms,
where it could be argued that the N M P properties reflect a first-order Pad6
approximation (e.g. Truxal, 1955) to a pure time delay. However, this was not
the justification used originally in the model formulation and, in any case, it
seems more meaningful in physical terms to explicitly introduce the pure time
delay to reflect transportation processes in the river.
The above analysis shows the value of M I C R O C A P T A I N time-series analy-
sis: the analyst is able to objectively evaluate existing physically based
models and introduce new types of model if the conventional models are shown
to be inferior° Other examples of this are available in environmental systems
Recursive identification, estimation and control 241

analysis. For example, Beer and Young (1983) have shown that simple aggre-
gated dead-zone (ADZ) models of dispersion in rivers, which are obtained by
time-series analysis, can be superior to the classical 'Fickian diffusion' model
that has been the standard hydrodynamic model for dispersion processes for
many years.

7.3. Economic data analysis


No one would argue that time-series analysis should be used indiscriminately
for the analysis of economic data. The economic system is poorly defined (see
Young, 1978, 1982a) and the inability to perform planned experiments means
that the information content in normally observed economic data may not be
sufficient to allow for the identification of a satisfactory dynamic model (i.e.
one which explains satisfactorily the economic mechanisms).
With these caveats in mind, let us analyse a set of economic data from the
U.K.: namely, the quarterly variations in the industrial production index (Ik)
and industrial employment (Ek) over the period 2nd qtr. 1971 to 4th qtr. 1981,
as shown in Fig. 9a. Since there are long-term trends in the data,

___ uk ,
~. 4F"L I f
I ' ,'"E . r ', I .,-i
FI '~ ~'~_ t" ',,' ',, ..i " I

"""'""'"
" ,, I
~'"'"Y Yk I

Ial
~--- uk i~

I t

(b) -- Yk L-"K"

Fig. 9, Data [or economic example: (a) measured data; (b) de-trended data; (input (-~--); output
(--)).
242 P. Young

R W S M O O T H was utilised to remove this nonstationarity and the resulting,


small perturbation time series are shown in Fig. 9b: the existence of some form
of lagged relationship is clear from this plot.
Part of the M I C R O C A P T A I N results for the best identified first-order
model, with two parameters (at, b0) and zero pure time delay are shown in Fig.
10. From the error series ~k, it will be noted that the data are explained
reasonably well (R~-= 0.79), but there is a distinctively large negative error in
the latter part of the data. This coincides with the election of the Conservative
Government in 1979 and its subsequent harsh monetarist policies, which have
led to a fall in the industrial index from a high point of about 108 in 1979 to
current levels less than 90.
This behaviour and the nature of the recursive estimates suggest the pos-
sibility of changes in the dynamic relationship between I k and E k over the data
set. Fig. 11 shows one set of results obtained from the T V A R option in
M I C R O C A P T A I N , in which variations in the /~0 estimate are investigated,
under the assumption that the other a 1 parameter is constant. It can be shown
quite simply that the 'steady-state gain' (~0/(1 - 41)) of the first-order model is
equivalent to the long-term multiplier (M) between I and E. The resultant
variations in M are tabulated in Table 3, where a large fluctuation can be
observed over the latter part of the data.

- - - Yk ~,

(b)

_____~

Fig. 10. M I C R O C A P T A I N results for economic example: (a) model output ( - - - ) vs. data (. . . . );
(b) estimated noise signal.
Recursive identification, estimation and control 243

(a)
1.0

40
l
!

'-- . . . . . ' I

Fig. 11. MICROCAPTAIN results for economic example: assuming time-variable b0 parameter:
(a) 41; (b)/~0 (capital letters refer to similar time periods in Table 3).

The interpretation of these initial results is outside the scope of this chapter~
But the analysis does expose the possible value of recursive estimation in the
investigation of economic data. Provided the results are used with care and the
limitations of the analysis are acknowledged, we see that potentially useful
information can be obtained.

7.4. B u s i n e s s data analysis

In this example, we will look at a set of sales data analysed by Jenkins (19"79).
The data are shown in Fig. 12: they represent the monthly sales of a com-
petitive group of products from two organisations; one set are the sales of the
organisation who commissioned Jenkin's study and the other are those of a rival
organisation.
Jenkin's analysis is based on the procedures described in his book (Box and
Jenkins, 1970) and it involves initial pre-processing in the form of first
differencing and differencing over lag 12: the first differencing is intended to
remove the obvious nonstationarity and the lag 12 differencing to remove the
equally noticeable seasonal behaviour. The resultant differenced data are
shown in Fig. 13, where it will be seen that a not surprising amplification of
noise has occurred because of the differencing process.
The mainframe C A P T A I N package has been used to perform an alternative
244 P. Young

Table 3

Sample Recursively Recursive M, 6O Possible


number estimated RWSMOOTH reasons for
multiplier, 57/ estimate variation
Ms

73 72 12
71 72 12
77 71 11
75 68 8
61 61 1

6 60 56 -5 (A)
7 32 45 15 oil crisis
8 27 42 -18 ?
9 41 47 -13 Common
10 63 55 -5 Market entry
11 61 59 1 ?

12 60 62
13 73 67
14 76 68
15 60 66
16 76 67
17 55 63
18 55 63
19 71 67
20 73 68
21 68 67
22 70 66
23 58 63
24 58 63
25 69 65
26 67 66
27 66 65
28 57 64
29 72 67
30 80 67
31 62 61

32 43 54 ---6 (B)
33 57 54 -6 'Winter
34 50 51 -9 of
35 41 49 -11 discontent'
,)
36 40 51 -9
37 53 59 1

38 70 70 10 (c)
39 83 81 21 deflationary
40 92 91 31 policies
41 101 100 40 ?
42 128 110 50
Recursive identification, estimation and control 245

GROUP 1
12.0 GROUP 2
f\
--. -- TREND ESTIMATED AS / ,/~
S R W (/3=0.99;O,p=10) 7/. /
t /" /
11 . 5 / J\ t
/ 7 \./

It .0 /r
' j,,'f J

i?/;~(~'/z
~10.5
I11

dlO.O
o

9.5

9.0 T T I T I T - - - ~ - - T. . . . . F - - - i q ; [ r [ l l
0 12 24 36 48 60 72 84 96
NUMBER OF S A M P L E S

Fig. 12. Data for business example (after Jenkins, 1979); (Group l sales ( ---); Group 2 sales
( )).

analysis. H e r e , the trend behaviour is estimated -using the data preprocessing


procedures in C A P T A I N (the equivalent of R W S M O O T H in M I C R O -
C A P T A I N ) and the resultant small perturbation time series are shown in Fig.
14. In contrast to the results in Fig. 13, we see that noise amplification has been
avoided and a quite obvious lagged relationship appears to exist between the
two series.
It is possible to analyse the data in Fig. 14 on an input-output TF model
basis, as in other examples discussed in this chapter. However, we will use an
alternative approach in which the data are modelled as a purely stochastic
vector (multivariable) process. This can be justified in this case since the data
could well arise from a feedback process in which the changes in each
c o m p a n y ' s sales are, to some extent, dependent on those of its competitor. This
type of model is also consistent with the observed behaviour, which is similar to
that obtained from an oscillatory feedback system.
The identified and estimated model takes the following form:

053 rx, +<,


246 P. Young

0.4 --- GROUP 1

0.3 i,ili , ,
fl I1 1 I1
I\1\\ ~ ~\ ;}1 I I II I1
' * Ill ' /'V' l
0.2 ' [ ( \-x I ~ /Ill -I Y I /'-1 /

J tl ii I
0.1 II II II

'I II , l ,

0.0

-O.I

-0.2
~ - - q , w ...... r - - ~ , r ~ r r - - r - q - - i
0 12 24 36 48 60 72 84

NUMBER OF SAMPLES

Fig. 13. Business data as pre-processcd by Jenkins (key as for Fig. 12): (Group 1 data (---) shifted
vertically for clarity).

w h e r e the noise vector ~k [~1 ~2] T is the following first=order vector moving
a v e r a g e process:

= + (36b)
s¢2 k e2 k [-0.21 + 0 . 6 8 1 L e 2 k-1 ~

with the vector e ~ - [e 1 e2]~ r e p r e s e n t i n g a white noise process with an


instantaneous covariance matrix

coy(el, e2)= I 0.0018 - 0 . 0 0 0 0 5 ] . (36c)


[--0.00005 0.0012 J

A typical r a n d o m l y selected stochastic simulation of this m o d e l (i.e. the


m o d e l is simulated with G as a white noise input vector with the r e q u i r e d
c o v a r i a n c e properties) is shown in Fig. 15. T h e general statistical characteristics
of the simulated data are quite similar to those of the real data in Fig. 14 and it
is clear that the m o d e l provides a reasonable, parametrically efficient represen°
Recursive identification, estimation and control 247

-1
3I

~
0 , .... GROUP 1
/ - - GROUP 2

//~ /~
I, /111

t
I

O, I ~ / ~I
t l/ ! ! I.

0,0
t/) t/t
[ I

-0. 1 t I

l
-0"2 L - q T
V4

1 [
l
-0.3 T r T ] i I i ~ T ]
0 12 24 36 48 60 72 84 96
NUMBER OF SAMPLES

Fig. 14. De-trended business data (key as for Fig. 12).

ration of the data. it is quite likely, therefore, that the model would provide a
good basis for forecasting, although this has not been attempted here.
The resursive estimates have also not been shown here, but they were useful
in identifying the model structure: in the finally identified model (36), the
estimates were sensibly stationary and well defined.

7.5. Univariate time-series analysis


In this chapter, we have concentrated on the statistical modelling of input-
output data. In practice, however, the analyst is often confronted by a uni-
variate time series. As a final practical example, therefore, let us consider the
analysis of the 114 samples of air temperature data from Sussex (Lai, 1981),
shown in Fig. 16b, using the approach to univariate analysis outlined in
Subsection 5.3.
The initial A R model estimation in this case was based on a fairly low 7th
order A R model, in order to save computational time in M I C R O C A P T A I N .
Reasonable simulation results have been obtained with similar low order A R
models, although higher order models than this are probably advisable in
general applications. The parameter vector for this A R model is estimated as
248 P. Young

t --- GROUP

- - G R O U P 2
1

O2-
0,1 I i L lI i i
I I

0"0 ' I'/ t


i I
I
t

y! !
/
I
i
I
I

-o.2 /

-0.3 J I I | I
1 Z T
0 12 24 '6'o Z
NUMBER OF SAMPLES

Fig. 15. Typical stochastic simulation obtained f r o m estimated Business Data Model (key as for
Fig. 12).

g" = [-1.45, 0.44, 0.11, -0.06, 0.19, -0.20, 0.11"]"r and the estimated impulse
response is shown in Fig. 16a.
The results of the identification analysis are summarised in Table 4. Two sets
of identification statistics are shown: the 'IV Results' were obtained using the
method described in Subsection 5.3; while for the 'LS Results' recursive least
squares, rather than IV estimation, was employed to estimate the A R M A
model parameters.
Both sets of results indicate that the best identified A R M A model has a
second-order C polynomial and either a first° or second-order D polynomial
(i.e. 2, 2 or 2, 3 models). However, closer examination of the results indicates
that both models have similar impulse response and spectral characteristics;
while the d2 estimate, in the higher order case, is insignificantly different from
its standard error ( - 0 . 0 4 -+ 0.04). Consequently, we can reject this model as
being over-parameterised and accept the 2, 2 model, which has the form

1 - 0.25L
~:k = 1 - 1.7tL + 0.79L= ek' (37)

as being the parametrically most efficient A R M A representation.


Recursive identification, estimation and control 249

(a)
7th ORDERAR MODEL

(c)

Fig. 16. MICROCAPTAIN results for univariate temperature data analysis. (a) estimated impulse
responses (7th order AR model - - ; ARMA (2, 2) model - - - ) ; (b) model output ( - - ) vs. data
(---); (c) typical stochastic simulation from estimated ARMA (2, 2) model.

T h e slight ambiguity of the results in Table 4 are quite characteristic of this


kind of univariate analysis, particularly when low o r d e r A R models are utilised
in the initial phase of the analysis. T h e ambiguity does not seem problematic in
practice, however, since either closer examination reveals a superior model
(as in this case), or the models are virtually identical in response terms (Young,
1983). In the latter case, it is likely that any of the c o m p e t i n g models would
prove satisfactory in practical terms.
T h e impulse response of the model (37) is shown as a dashed line in
250 P. Young

Table 4
Temperature time series

I V LS
Results Results

Model R ~- EVN R ~- EVN

1, 1 0.41 -3.08 0 . 3 9 -3.12


1,2 0.62 -3.25 0 . 5 4 -3.11
2, 1 0.90 -4.56 0 . 9 0 -4.59

2, 2 0.98 -5.60 0 . 9 8 -5.37


2, 3 0.99 -6.10 0 . 9 8 -5.4

3, 2 0.99 -3.53 0.97 -2.48


3, 3 No convergence 0.99 -2.68
3, 1 0.97 -4.96 0.97 -5.05
4, 1 0.98 -5.31 0.98 -5.35
5, 1 0.99 -6.4 0.99 -6.4
6, 1 0.99 -5.46 0.99 -5.47

Fig. 16a and the model o u t p u t is c o m p a r e d with the original time-series data in
Fig. 16b. All of the recursive estimates except /)1 show rapid c o n v e r g e n c e and
well-defined characteristics: /~1 shows s o m e variation at a r o u n d 70 samples and
this p r o b a b l y m e a n s that it would be wise to carry out further analysis on
a n o t h e r i n d e p e n d e n t data set before accepting and using the model.
Finally, Fig. 16c shows a r a n d o m l y selected stochastic simulation of the
m o d e l (37). By comparison of this time-series with the original time series in
Fig. 16b, it would a p p e a r that the analysis has c a p t u r e d the m a j o r statistical
patterns in the data and that the A R M A model (37) provides a reasonable
basis for applications such as forecasting, which are referred to in the next
section.

8o State-of-the-art of r e c u r s i v e e s t i m a t i o n

This chapter has been c o n c e r n e d primarily with the recursive a p p r o a c h to


time-series analysis d e v e l o p e d by the a u t h o r and his colleagues over the past
fifteen years. B e f o r e completing the chapter, therefore, it is necessary to
outline certain aspects of o t h e r research on the subject which have led to
practically useful procedures.
Since the a p p e a r a n c e of K a l m a n ' s seminal paper in 1960, there have been
m a n y h u n d r e d s of papers in the control and system's literature on the subject
of recursive estimation, m a n y of which have m a d e practical contributions. It
would be impossible to review this work satisfactorily here, but the r e a d e r can
Recursive iden~fication, estimation and control 251

obtain some feeling for the large body of literature by consulting the review by
Young (1981b) and the Proceedings of the various IFAC Symposia on
Identification and System Parameter Estimation that have appeared in the last
few years (e.g. Eykhoff, 1973; Isermann, 1979).
One specific area of practical importance that should be mentioned,
however, is 'self-adaptive' and 'self-tuning' control, where a control system is
given the ability to update its control parameters on a continuing basis, in order
to achieve some specified performance. Much research on this topic was carried
out in the 1960s (e.g. Kopp and Orford, 1963; Lee, 1964; Young, 1965, 1966)
and the importance of recursive updating to the development of practical
adaptive control systems has become increasingly apparent since this time (see
Young, 1981a). More recent research (reviewed, for example, by Goodwin and
Sin, 1984) has demonstrated the practical potential of these methods in a wide
variety of application areas. There has been considerable practical develop-
ments, for example, in the area of self-tuning control (Astr6m and Wittenmark,
1973), where several systems have already been successfully implemented (e.g.
Astr6m, 1980; Clarke and Gawthrop, 1981; Sanoff and Wellstead, 1982) and
commercial units are becoming available for use in industry.
Other important work on recursive estimation has appeared in the statistics
and economics literature, often within the context of the Kalman filter. Again it
would be impossible to adequately review this literature here, and we will
merely mention briefly certain developments which seem to be of practical
merit. It is worth remarking, however, that significant theoretical developments
have also been reported (e.g. Hannah and Tanaka, 1976; Solo, 1978) and these
have considerably increased the confidence of the practitioner in using recur°
sire methods.
In the early 1970s a number of papers attempted to demonstrate the rele-
vance of recursive estimation to statistical and econometric data analysis. In
general terms, however, they tended to parallel earlier research reported in the
systems literature. A rather more novel insight into the value of recursive
estimation was provided in an important paper by Brown et al. (1975), which
showed how recursive least squares methods could be exploited for the
detection of changes in regression relationships over time.
A quite different use of recursive estimation was proposed at about the same
time by Harrison and Stevens (1976), in a paper which has apparently had
considerable practical impact in the area of time-series forecasting. We have
not dealt with this aspect of recursive estimation in this chapter, since forecast-
ing per se is of less importance in control and systems analysis than in
economic and business data analysis. However, certain aspects of the Harrison
and Steven's methodology are closely related to the data pre-processing
procedures outlined in Subsection 5.1 of this chapter.
In effect, Harrison and Stevens specify simple statistical models (such as the
RW family) for trends and seasonality in the data and use a Kalman filter to
update estimates of these lower frequency components. The techniques are, in
fact, quite closely related to earlier, more conventional procedures, such as
252 ~ Young

those proposed by Brown (1963). This becomes clear if the steady-state,


asymptotic gain versions of the Kalman filter and smoothing algorithms are
considered (see Section 4). In this situation, the Kalman algorithms for models
such as the RW family become relatively conventional, fixed gain filters which
have an exponential smoothing-like effect on the time-series data (Young,
1984).
Harrison and Steven's major contribution is the recognition that these simple
procedures can prove useful in practical forecasting terms and their techniques
have been applied to a wide variety of forecasting problems. This has been
aided by their development of an interactive computer program package
SHAFT (Stevens Harrison Adaptive Forecasting Technique). In these pro-
grams, the basic approach discussed above is embedded within a 'multi-process
model': here four Kalman filtering algorithms, each based on one of the family
of simple stochastic models used by Harrison and Stevens, are run in parallel
and each one is given a 'state probability' which is used to assign the relative
importance of each one in forecasting the series.
The development of Harrison and Steven's programs probably marks the
final acceptance by the statistical establishment of the practical utility of
recursive estimation in forecasting (see also papers such as Conrad and
Corrado, 1982). However, their acceptance as practically advantageous
methods of general time-series analysis, as described in this paper, is not yet
complete. Nevertheless, recent contributions such as those of Harvey and
Phillips (1979) and Harvey (1981) have heightened the classical time-series
analyst's consciousness in this regard. Hopefully, this will lead to a greater
realisation of the value of recursive methodology in the future. After all, many
nonrecursive time-series techniques can now be implemented in an alternative
recursive manner, often with some computational advantage. And in this
manner, the analyst is provided with the extra dimension of the recursive
estimation which, as we have seen in this chapter, can be so useful to the
interpretation of the analytical results, particularly when the time series may
have nonstationary statistical characteristics.

Acknowledgements

The economic data analysed in Subsection 7.3 was collected by Miss Rose-
mary Prince. The C A P T A I N analysis in Subsection 7.4 was carried out with the
help of Miss Christina Sirakott. Both contributions are gratefully ac-
knowledged.

References

Akaike, H. (1970). Statistical predictor identification. Ann. lnst. Statist. Math. 22, 203-21'7.
Akaike, H. (1972). Use of an information theoretical quantity for statistical model identification.
In: Proc. of 5th Hawaii Int. Conf. on Systems Science, 249-250. Western Periodicals, North
Hollywood.
Recursive identtfication, estimation and control 253

Akaike, H. (1979). Likelihood and the Bayes procedure. Int. Mtg. on Bayesian Statistics, Valencia,
Spain.
Arthur, W. B. (1978). Stochastic control for linear discrete-time distributed lag models. Internat. J.
Control 28, 611-619.
,~strtm, K. J. (1970). Introduction to Stochastic Control Theory. Academic Press, New York.
Astrtm, K. J. (1980). Self-tuning regulators: design principles and applications. In: K. S. Narendra
and I. Monopoli, eds., Applications of Adaptive Control.
Astrtm, K. J. and Wittenmark, B. (1973). On self-tuning regulators. Automatica 9, 185-199.
Beer, T. and Young, P. C. (1983). Longitudinal dispersion in natural streams. A S C E J. Env. Eng.
109, 1049-1067.
Beer, T., Young, P. C., Humphries, R. B. and Burgess, J. S. (1982). Environmental Water Quality
(CRES Monograph). A N U Press, Canberra.
Box, G. E. P. and Jenkins, G. M. (1970). Time Series Analysis, Forecasting and Control. Holden Day,
San Francisco, CA.
Brown, R. G. (1963). Smoothing, Forecasting and Prediction. Prentice-Hall, Englewood Cliffs, NJ.
Brown, R. L , Durbin, J. and Evans, J. M. (1975). Techniques for testing the constancy of
regression relationships over time. J. Roy. Statist. Soc. Ser. B 37, 149-192.
Bryson, A. E. and Ho, Y. C. (1969). Applied Optimal Control. Blaisdell, MA.
Clarke, D. W. and Gawthrop, P. J. (1981). Implementation and application of microprocessor-
based self tuners. Automatica 17, 233-244.
Conrad, W. and Corrado, C. (1982). Application of the Kalman filter to revisions in monthly sales
estimates. J. Econom. Dynamics Control 1, 177-198.
Davis, P. F. (1983). Time-series analysis applied to data smoothing for an investigation of automatic
adjustments of rates of fitted parameter adaption. National Inst. of Agricult. Eng., Divisional
Note DN/1168.
Eykhoff, P. (ed.) (1973). Identification and System Parameter Estimation. North-Holland, Am-
sterdam and American Elsevier, New York.
Gauss, K. F. (1821, 1823, 1826). Theoria combinationis observationum erroribus minimis obnoxiae,
Parts 1, 2 and supplement. Werke 4, 1-11/8.
Gelb, A. (ed.) (1974). Applied Optimal Estimation. MIT Press for The Analytic Sciences Cor-
poration, Cambridge, MA.
Goodwin, G. C. and Sin, K. S. (1984). Adaptive Filtering, Prediction and Control. Prentice-Hall,
Englewood Cliffs, NJ.
Hannan, E. J. and Rissanen, J. (1982). Recursive estimation of mixed autoregressive moving average
order. Biometrika 69, 81-94.
Hannan, E. J. and Tanaka, K. (1976). A R M A X models and recursive calculations, In: H. Myoken,
ed., Proc. Conf. System Dynamics and Control in Quantitative Economics. Nagoya City Uni-
versity. Nagoya City, Japan.
Harrison, P. J. and Stevens, C. F. (1976). Bayesian forecasting. J. Roy. Statist. Soc. Ser. B 38,
205-247.
Harvey, A. C. (1981). Time Series Models. Phillip Allan, Oxford.
Harvey, A. C. and Phillips, G. D. A. (1979). The estimation of regression models with A R M A
disturbances. Biometrika 66, 49-58.
Isermann, R. (ed.) (1979). Identification and System Parameter Estimation. Pergamon Press,
Oxford.
Jakeman, A. J. and Young, P. C. (1979). Recursive filtering and the inversion of ill-posed causal
problems. CRES, ANU, Report No. AS/R28/1979. Utilitas Math., to appear.
Jakeman, A. J. and Young, P. C. (1981). On the decoupling of system and noise model parameter
estimation in time-series analysis, lnternat. J. Control 34, 423-431.
Jakeman, A. J. and Young, P. C. (1983). Advanced methods of Recursive time-series analysis.
lnternat. J. Control 37, 1291-1310.
Jakeman, A. J., Young, P. C. and Bayes, A. J. (1982). A computer program for general recursive
time-series analysis. In: G. Saridis, ed., Identification and System Parameter Estimation. Per-
gamon Press, Oxford.
Jazwinski, A. H. (1970). Stochastic Processes and Filtering Theory. Academic Press, New York~
254 P. Young

Jenkins, G. M. (1979). Practical experiences with modelling and forecasting time-series. In: O. D.
Anderson, ed., Forecasting. North-Holland, Amsterdam.
Johnston, J. (1963). Econometric Methods. McGraw-Hill, New York.
Kalman, R. E. (1960). A new approach to linear filtering and prediction problems. Trans. ASMU
Ser. D J. Basic Eng. 83, 95--108.
Kendall, M. G. and Stuart, A. (1961). The Advanced Theory of Statistics, Vol. 2. Griffin, London.
Kirk, B. L., Rust, B. W. and Van Winkle, W. (1979). Time series analysis by the maximum
likelihood method. Env. Sciences Div., Public. No. 1220. Oak Ridge National Laboratory.
Kopp, R. E. and Orford, R. J. (1963). Linear regression applied to system identification for
adaptive control systems. A I A A J. 1, 2300-2306.
Lai, P. W. (1981). Transfer function modelling: relationship between time-series variables. Concepts
and Techniques in Modern Geography, No. 22.
Lee, R. C. K. (1964). Optimal Estimation, Identification and Control (Res. Monograph 28). MIT
Press, Cambridge, MA.
Ljung, L. (1979). Convergence of recursive estimators. In: R. Isermann, ed., Identification and
System Parameter Estimation, 131-144. Pergamon Press, Oxford.
Ljung, L. (1979b). Asymptotic behaviour of the Extended Kalman filter as a parameter estimator
for linear systems. IEEE Trans. Automat. Control. AC-24, 36-50.
Norton, J. P. (1975). Optimal smoothing in the identification of linear time-varying systems. Proc.
Inst. Electr. Engrs. 122, 663-668.
Pagan, A. (1975). A note on the extraction of components from time-series. Econometrica 43,
163-168.
Parzen, E. (1977). Multiple time-series: determining the order of approximating autoregressive
Schemes. In: P. Krishnaiah, ed., Multivariate Analysis--IV. North-Holland, Amsterdam.
Pierce, D. A. (1972). Least squares estimation in dynamic disturbance time-series models.
Biometrika 59, 73-78.
Plackett, R. L. (1950). Some theorems in least squares. Biometrika 37,149-157.
Priestley, M. B. (1980). State dependent models: a general approach to nonlinear time-series
analysis. Time Series Anal. 1, 47-72.
Rissanen, J. (1976). Minimax entropy estimation of models for vector processes. In: R. K. Mehra
and D. G. Lainiotis, eds., Systems Identification: Advances and Case Studies. Academic Press,
New York.
Sanoff, S. P. and Wellstead, P. E. (1982). Extended self-tuning: practical aspects. In: G. Saridis, ed.,
Identification and System Parameter Estimation. Pergamon, Oxford.
S6derst6m, T. (1973). An on-line algorithm for approximate maximum likelihood identification of
linear dynamic systems. Lund Inst. of Tech., Div. Auto. Control, Rep. No. 7308.
Solo, V. (1978). Time-series recursions and stochastic approximation. Ph.D. thesis. The Australian
National University, Canberra, Australia.
Stoica, P. and S~Sderstr6m, T. (1983). Optimal instrumental variable estimation and approximate
implementations. I E E E Trans. Automat. Control AC-28.
Truxal, T. G. (1955). Control System Synthesis. McGraw-Hill, New York.
Whitehead, P. G., Young, P. C. and Hornberger, G. (1979). A systems model of stream flow and
water quality in the Bedford-Ouse River, I: Stream flow modelling. Water Res. 13, 1155-1169.
Wiener, N. (1949). The Extrapolation, Interpolation and Smoothing of Stationary Time-Series. Wiley,
New York.
Young, P. C. (1965). The determination of the parameters of a dynamic process. Radio Electron.
Engineer (J. Brit. 1ERE) 29, 345-362.
Young, P. C. (1966). Process parameter estimation and self adaptive control. Proc. IFAC Symp.
Teddington. Appears in: P. H. Hammond, ed., Theory of Self Adaptive Control Systems. Plenum
Press, New York.
Young, P. C. (1970). An instrumental variable method for real-time identification of a noisy
process. Automatica 6, 271-287.
Young, P. C. (1974). Recursive approaches to time-series analysis. Bull. Inst. Maths. Appl. 10,
209--224.
Recursive identification, estimation and control 255

Young, P. C. (1976). Some observations on instrumental variable methods of time-series analysis.


Internat. J. Control 23, 593-612.
Young, P. C. (1978). A general theory of modeling for badly defined systems. In: G. C.
Vansteenkiste, ed., Modeling, Identification and Control in Environmental Systems. North-
Holland, Amsterdam.
Young, P. C. (1979). Self adaptive Kalman filter. Electron. Lett. 15, 358.
Young, P. C. (1981a). A second generation adaptive autostabilization system for airborne vehicles.
Automatica 17, 459-470.
Young, P. C. (1981b). Parameter estimation for continuous-time m o d e l s - - a survey. Automatica
17, 23-39.
Young, P. C. (1982a). The validity and credibility of models for badly defined systems. In: M. B.
Beck and G. van Straten, eds., Uncertainty and Forecasting of Water Quality. Springer, Berlin.
Young, P. C. (1982b). MICROCAPTAIN: A microcomputer aid in the modelling of environmental
systems. Report No. 9. Department of Environmental Sciences, University of Lancaster.
Young, P. C. (1983). An instrumental variable approach to ARMA model identification and
estimation. Report No. 11. Department of Environmental Sciences, University of Lancaster.
Young, P. C. (1984). Recursive Estimation and Time-Series Analysis. Springer, Berlin.
Young, P. C. and Jakeman, A. J. (1979-1980). Refined instrumental variable methods of recursive
time-series analysis; Parts I, II and IlL Internat. Control 29, 1-30; 29, 621--644; 31, 741-764.
Young, P. C., Jakeman, A. J. and McMurtrie, R. (1980). An instrumental variable method for
model order identification. Automatica 16, 281-294.
Young, P. C., Shellswell, S. H. and Neethling, C. G. (1971). A recursive approach to time-series
analysis. Report. No. CUED/B-Control/TR16. Department of Engineering, University of Cam-
bridge, England.
E. J. H a n n a n , P. R. K r i s h n a i a h , M. M. R a o , eds., Handbook of Statistics, Vol. 5 0
..,I
© E l s e v i e r Science P u b l i s h e r s B.V. (1985) 257-277

General Structure and Parametrization of A R M A


and State-Space Systems and its Relation
to Statistical Problems
M. Deistler*

1. Introduction

In the process of identification (in the sense of specification and estimation)


of linear dynamic systems, rather complicated problems connected with the
parametrization of the model occur. A certain part of these complications
arises only in the multivariable case, and this is one reason why identification of
multivariable systems (see [2, 14, 28, 29, 31, 32, 34, 47, 59]) is not yet a standard
task in applications.
We are concerned with the parametrization of (discrete-time), linear, time-
invariant and finite-dimensional systems here, both in (vector-) difference
equation and in state-space form, emphazising the first case. For the results
presented it is not essential whether the inputs are unobserved white noise, or
whether there are in addition observed inputs (or, as a third case, whether the
system is deterministic). Thus for the sake of notational convenience, we will
mainly discuss the case where the inputs are unobserved white noise.
An A R M A system is a system of linear difference equations of the form

p q
A ( i ) y ( t - i) = ~_. B ( i ) e ( t - i), (1.1)
i=0 i=0

where the unobserved inputs e(t) are white noise, i.e.

~(t)- o, E~(s)~'(t) = ~,2:, (1.2)


where y(t) are the (observed) outputs and where A(i), B ( i ) E R '×s are
parameter matrices. Let
p q
a(z) = ~ A ( i ) z ~, b(z) = ~ B(i)z i .
i =0 i=0

*The a u t h o r t h a n k s Prof. E. J. t t a n n a n a n d Dr. B. M. POtscher for v a l u a b l e c o m m e n t s .

257
258 M. Deistler

Then we throughout assume

det a(z) ~ 0 (1.3)

If we add a term Z~=0 D(i)z(t - i), D(i) E R '×m, on the right-hand side of (1.1),
where z(t) are observed inputs (exogenous variables) we have an A R M A X
system.
Every (linear) regular (wide sense) stationary solution (y(t)), t ~ Z, of an
A R M A system is called an A R M A process. One reason for the great practical
importance of A R M A models is that every regular stationary process can be
approximated with arbitrary accuracy by an A R M A process (this corresponds
to a rational approximation of the spectral density) and that only a finite
number of (real-valued) parameters is needed for description of A R M A
systems.
Every A R M A process can be obtained from a state-space system

x(t + 1) = Fx(t)+ Ge(t), (1.4)


y(t) = Hx(t)+ e(t), (1.5)

where x(t) is the (n-dimensional) state at time t and F ~ R n×", G C R "×',


H E R sx" are parameter matrices.
If (1.1) (or (1.4), (1.5)) generate an A R M A process, we call (1.1) (or (1.4),
(1.5)) the A R M A (or state-space) representation of this process or of the
corresponding transfer function k(z)= a-l(z)b(z).
Adding a term, Kz(t), K ~ R "×m, on the right-hand side of (1.4) gives a
state-space representation corresponding to an A R M A X system.
The problem of identification can be stated in our context as follows (for a
more elaborate formulation see [42]): given are the (set of feasible) observations
and a model (class), i.e. a certain class of ARMA(X) or state-space systems.
Then we have to show how to compute from every fixed and finite part y(t),
t = 1 . . . . . T, of the observations a system within the model class which can be
described in a 'convenient' way and which gives a 'good' representation of the
observations. In most cases it is assumed (as we do here throughout) that the
observations are generated by a linear time invariant system. Then one
criterion for a good representation of the observations is that for T -* ~ (almost
surely) the systems selected by the identification procedure will become arbi-
trarily close in their input-output behavior (i.e. in their transfer functions) to
the true system (i.e. the system generating the data).
In identification of linear dynamic systems the following problems occur:
(i) In many cases the model class is so large (e.g. when it is the class of all
A R M A systems) that it has to be broken into pieces for the purpose of
convenient description; each piece (submodel) then is described by certain
integer-valued parameters as, for example p and q in the A R M A or n in the
state-space case. We call this the dynamic specification of the submodel.
Parametrization of A R M A and state-space systems 259

(ii) For the model or a submodel corresponding to given integer-valued


parameters a suitable, unique description of all corresponding transfer func-
tions, or of all equivalence classes of systems corresponding to these transfer
functions, by real-valued parameters has to be given. This is the problem of
parametrization in the narrow sense; whereas parametrization in the wide sense
also includes the characterization by the integer-valued parameters.
(iii) Now let us turn to the problem of inference, both for the integer-valued
parameters (if they are not known a priori) and for the real-valued parameters
(in the A(i), B(i) (or in F, G, H ) and in X). The determination of the
integer-valued parameters from the data can be done either by data analytic
search procedures (see e.g. [6]) or by formal inference procedures (by estima-
tion procedures like minimum AIC or BIC or by test procedures) [2, 3, 30, 31,
34]. The most common estimation methods for the real-valued parameters in
the case when the integer-valued parameters have already been specified are
(Gaussian) maximum likelihood estimation and related methods. The Gaus-
sian likelihood function and related criteria do depend on the parameters A(i),
B(i) or F, G, H only via the transfer function k(z).
For the sake of formal analysis, identification may be decomposed into three
steps: first, if necessary, inference for the integer-valued parameters, then
estimation of the transfer function k(z) (in the submodel) and of X and then
finally the parametrization of the estimate of k(z) by the A(i) and B(i) (or by
the F, G, H). This shows the importance of the properties of the relation
between transfer functions and parameters for estimation (and numerical
calculation) as, for example, of uniqueness and of continuity properties of this
relation. The first two steps may be interpreted as an approximation of the
sample covariances (which, at least asymptotically, contain all information
from the data, used in the common procedures) by an element of the set of second
moments corresponding to the (suitably chosen sub) model.

2. ARMA representations

It should be noted that it makes some difference whether our primary


interest is in systems identification, or in prediction or in the modelling of time
series by spectral densities.
We will restrict ourselves to solutions corresponding to processes defined on
(rather than considering solutions starting from an initial value). For the
moment let the s(t) be general inputs, which are not necessarily white noise
nor stationary, to cover the A R M A X case as well.
Commencing from a given system, the restriction to causal solutions, i.e. to
solutions where the output y(t) does not depend on future inputs s(s), s > t, is
a priori justified in most cases. Therefore, we will require throughout that the
transfer function

k(z)-- a l(z)b(z) (2.1)


260 M. Deistler

has a convergent power series expansion

k(z) = • K(i)z ~ (2.2)


i=0

in a suitable neighborhood of zero. This is a causality requirement in the sense


that then there exists a causal solution of the form

y(t) = ~, K ( i ) e ( t - i) (2.3)
i=0

provided that the inputs are such that the sum on the right-hand side of (2.3)
exists.
In many cases in addition the stability condition,

deta(z)¢0, {z I~<l, (2.4)

is required. In the A R M A case, the only causal and stationary solutions


excluded by (2.4) may occur in the case where X is singular and where k has a
pole for some tz] = 1. Under (2.4) the system (1.1) has a unique stationary
solution (2.3) whose spectral density is given by

fy (A) = (2rr)-I k (e -~A)Xk* (e -~A), (2.5)

where k* denotes the conjugate transpose of k. Since fy is rational (in e-iA),


(y(t)) is regular (see e.g. [26, 57]).
If we commence from an arbitrary stationary process (y(t)) with rational
spectral density, then always a corresponding A R M A system (1.1) can be found
[26, 57]. Without restriction of generality every such fy can be factorized such
that in (2.5)

k(O)= I, (2.6)

the stability condition (2.4) and the miniphase condition~

det b(z) ¢ 0, Iz{ < 1, (2.7)

hold. Since under these assumptions the e(t) are the prediction errors of the
best linear (least squares) one-step-ahead prediction (and thus (linear) in-
novations) of y(t) (see e.g. [26, 57]), these conditions are natural if we com-
mence from an A R M A process (rather than from a system).
X is uniquely determined under these assumptions. We will assume
throughout that

;f>O, (2.8)
Parametrization of ARMA and state-space systems 261

then also k(z) is uniquely determined from fy. Equation (2.8) means that we
restrict ourselves to nonsingular (over the field of rational functions) spectral
density matrices fy. Otherwise there exists an g < s-dimensional process y(t),
where (35(t)) and (y(t)) are mutually obtained by finite linear transformations
from each other and where the spectral density of @(t)) is nonsingular. In this
sense, if (2.8) is violated, the problem can be suitably transformed to satisfy
(2.8). Also, a violation of (2.8) would be indicated from the data.
In the case of A R M A X systems, we have a second transfer function

I(Z)= a l ( z ) d ( z ) ,
where
r

d(z) = ~. D(i)z i .
i=O

l(z) can be obtained more directly (without solving the spectral factorization
problem) from the observations than k(z): for example, if (z(t)) is stationary
with spectral density fz, then l is obtained from

where fyx is the cross spectrum between (y(t)) and (z(t)), provided that fz(a) is
nonsingular on a set of positive Lebesgue measure.
Of course for d(z), a miniphase assumption would not be justified in general.
Also unstable systems might be of interest. This is one reason (besides the
point of notational convenience) why we do not impose (2.4) and (2.7)
throughout (and only require that k(z) has a convergent power series expan-
sion in a neighborhood of zero, (2.6) and (2.8)). However, we will also discuss
the consequences of additionally assuming (2.4) and (2.7) later in this con-
tribution.
As already mentioned, the transfer function k and X are uniquely obtained
from (y(t)). The parametrization of X by the vector o'(X) of its on- and
above-diagonal elements is straightforward: We will assume that there are no
further restrictions on X and no joint restrictions between the elements of
and those of the A(i), B(i). As can be easily seen {~r(X) ] X C R'×'; X > 0} is an
open subset in R "(s+1)/2. Thus we can restrict ourselves to the problem of the
parametrization of the transfer functions.
First we consider the parameter space for all A R M A systems, where s is
fixed and where p are q are arbitrary. Let 0 A denote the set of all parameter
matrices (A(0), A ( 1 ) , . . . , B(1), B(2) . . . . ) satisfying our assumptions. A pair
(a, b) of polynomial matrices such that (2.1) holds is called a left matrix fraction
description (MFD) of k(z). We will identify (a, b) with the corresponding
parameter matrices (A(0), A(1) . . . . . B(1), B(2) . . . . ). By "JT]OA-'~U A w e
denote the mapping attaching to every (a, b) the corresponding transfer func-
tion k = a-lb, where U A is just the image of 0 A by ~r.
262 M. Deistler

As far as the observations (at least in the regular stationary case) are
concerned, we cannot distinguish between two elements in O A corresponding to
the same k. Thus from this point of view, if O C 0 A is the parameter space
under consideration, it is natural to consider the quotient set O/rr of O by rr
(restricted to O) too. rr-l(k) (f-') O) is called the k-equivalence class or the class
of all observationally equivalent MFD's corresponding to k (in O).
For most practical applications, parameter spaces that are subsets of Eucli-
dean spaces are used because they are more convenient in several respects, e.g.
for the optimization of the likelihood function. In addition we want a unique
description of the transfer functions by their parameters. A subclass of A R M A
systems (a subset O C Oa) is called identifiable, if ~r restricted to 0 is injective, i.e.
if within this class the MFD's (a, b) are uniquely determined from the cor-
responding transfer functions a-lb.
/ f (a, b ) E O, then in general not all entries in the matrices A(i), B(i) will be
needed for the description of (a, b) as some entries may be a priori known to
be constant over O or to be functions of other entries. Let 0 E R e be a vector
of minimal dimension consisting of entries of the A(i), B(i) such that (a, b) is
uniquely determined by 0. We call 0 a vector of free parameters, and we
identify 0 with (a, b ) E O. If O is identifiable, then the function 4': rr(O)-+
O: g,(~r(0))= 0 is called an ( A R M A ) parametrization of U = 7r(O).
The polynomial matrix (a, b) is called (relatively) left prime if every (s x s-
polynomial matrix) common left factor u (i.e. every u such that there exist
polynomial matrices (&/7) such that (a, b) = u(&/7) holds) must be unimodular
(a polynomial matrix is called unimodular if det u = const ¢ 0). (a, b) is (rela-
tively) left prime if and only if the degree of det a is minimal among all MFD's
of k or if and only if (a(z), b(z)) has rank s for all z ~ C. (For these and
related results, see e.g. [4].) In order to avoid redundancy of description we
often restrict ourselves to irreducible, i.e. to left prime MFD's (a, b). Then an MFD
(&/7) is observationally equivalent to an irreducible M F D (a, b) if and only
if there exists a (nonsingular) polynomial matrix u such that

(& /7) = u(a, b ) . (2.9)

If (&/7) is irreducible too, then u must be unimodular [25, 49, 56].


In many cases it turns out to be more convenient to consider the transfer
function
0~

l~(z) = k(z 1)_ I = ~, K(i)z-' (2.10)


i=1

rather than k(z) and the corresponding MFD's

1~ = a -1/7. (2.11)

By (2.6), k is uniquely determined from /~ (and in this sense k and /~ can be


Parametrization of A R M A and state-space systems 263

identified) and an MFD (a, b) of k is obtained from an MFD (a,/~) o f 1~ in a


unique way from

( a ( z ) , b ( z ) ) - {diag z"~}(~i(z-1),/~(z-') + a(z 1)), (2.12)

where n i is the degree of the ith row of & Thus we may identify (fi,/)) and
Ca, b) and they even may be represented by the same vector 0 of free
parameters, k has the advantage of being strictly proper, i.e.

lim/~(z) : 0 ,
z--+~

and therefore the degree of the ith row of/7 must be smaller than n i.
By what was said above, the degree of det a, n say, is an invariant of the set
of all irreducible MFD's of/~, and is called the order of the system (or of/~). B y
M ( n ) w e denote the set o f all transfer f u n c t i o n s k o f order n. To /~ its (block)
Hankel matrix

H:=~KI2 )
K(2)K(3))
K(3) K(4), (2.13)

may be attached. H ~ has the following properties [43, 53]:


(i) The rank of H i is finite if and only if k is rational. If/~ has order n, then
H i has rank n.
(ii) If the ith row of H i is in the linear span of the rows in position
i 1. . . . . iK, then the i + sth row is in the linear span of the rows in positions
i, + s, . . . , iK + s.
The natural topology for parameter spaces like O C R ~ is the relative
Euclidean topology. We here endow U A with the topology corresponding to
the relative topology in the product space (Rs×s)N of the power series
coefficients ( K ( i ) ) i e ~. As then convergence of transfer functions corresponds to
the pointwise convergence of all power series coefficients, we call this topology
the pointwise topology Tp,. If A is a set in a topological space, its closure is
denoted by A.

3. State-space representations

The general state-space system (in our context of linear, time-invariant and
finite-dimensional systems) is of the form

x ( t + 1) : F x ( t ) + G u ( t ) ( + K z ( t ) ) , (3.1)
y(t) : Hx(t)+ v(t), (3.2)
264 M. Deistler

where u(t) and v(t) are unobserved white noise inputs satisfying

E u ( t ) = O, E v ( t ) = O,

Elg(S)l~'(t) = ~st~l, ED(S)1)'(t) = (~st~2,


Eu(s)v'(t) = 6,r£,,

where x(t) is the state, y(t) is the output (z(t) is the observed input) and
F ¢ R n×", G E R "×s, H ¢ R s×n, (K C R n×m) are parameter matrices. Equation
(3.1) is called the state equation and (3.2) the observational equation. Again we
restrict ourselves to the case of unobserved white noise inputs.
State-space systems have been widely studied, especially in control engineer-
ing (see e.g. [5, 9, 39, 40, 43, 61]), and the majority of the results on the
parametrization of linear systems have first been obtained for state-space
representations.
The stability condition that all eigenvalues of F are of modulus less than
unity, which corresponds to (2.4), guarantees that there is a unique (causal)
stationary solution of (3.1) and (3.2). In many cases (3.1) and (3.2) directly come
from the formulation of the theory considered.
Every state-space representation (3.1) and (3.2) can be transformed to a
'normalized' state-space system, having the same output y(t), as follows: let
2(t + 1 I t) denote the best linear predictor of x(t + 1) from y(s), s ~< t. Then we
have from (3.1) and (3.2):

J?(t~ l i t ) = F 2 ( t l t - 1)+ G e ( t ) , (3.3)


y(t) = H2(t l t - 1)+ e(t), (3.4)

where the e(t) are the prediction errors of the best linear prediction of y(t)
from y(s), s < t. Equations (3.3) and (3.4) are called the prediction error
representation and we here restrict ourselves to this representation.
The transfer function of (3.3) and (3.4) is given by

l~(z) = z I H ( I - Fz ~ ) ' G . (3.5)


Every A R M A system can be transformed into a state-space system with the
same transfer function and conversely.
Analogous to the A R M A case, let A A denote the set of all parameter
matrices (F, G, H), F ~ R n×n, G E R n×', H E R '×", where s is fixed and where n
is arbitrary. By # : AA-~ U A we denote the mapping attaching to every
(F, G , H ) the corresponding transfer function (3.5). k-equivalence, free
parameters 0 C R e for, and identifiability of subsets A c AA are defined com-
pletely analogous to the A R M A case. If A is identifiable, then the function
4 ~ : / z ( A ) ~ z l : ~ b ( # ( 0 ) ) = 0 is called a (state-space)parametrization of U -=
Parametrization of ARMA and state-space systems 265

A state-space system (or F, G, H ) is called minimal if the dimension n of the


state vector x(t) is minimal among all state-space systems corresponding to the'
same transfer function. As is well known (F, G, H ) is minimal if and only if
(G, FG, F 2 G , . . . , F n 1G) has rank n (in this case the system is completely
reachable) and (H', F'H', . . . . (F') n 1H') has rank n (in this case the system is
completely observable).
In order to avoid redundancy of description we often restrict ourselves to
minimal systems. Two minimal (F, G, H ) and (F, G, H ) are observationally
equivalent if and only if there is a nonsingular matrix T ~ R "×~ such that

F = TFT -1, G = TG, F-I- H T -1. (3.6)

If (P, G, H ) is minimal, then the dimension of x(t) is equal to the order of/~(z)
and thus to the number of linearly independent rows in the (block) Hankel
matrix (2.13). Thus M(n) is also the set of all transfer functions corresponding
to all minimal (F, G, H ) with state dimension n.

4. Canonical forms

As already stated, our problem is (eventually) to cover the model class by


subclasses O 6 (or A,~) each of which is characterized by certain inter-valued para-
meters c~ and then to parametrize the quotient sets O~/= (or A~/,) or the sets U~ C
UA of transfer functions corresponding to the subclasses O 4 CO A (or A C AA).
In most cases it is convenient to give this parametrization directly in terms of
A R M A or state-space system parameters. If there are no additional a priori
restrictions (e.g. coming from 'physical' a priori knowledge), we are free to
prescribe suitable sets 0~ (or A~) to cover the model class 0 A (or AA) and to
prescribe suitable representatives for the k-equivalence classes in 05 (in As)
(k E U~) to define the parametrization: In this case two main lines have been
followed: canonical forms and the overlapping parametrization of M(n).
A canonical A R M A (state-space) form for a subset 0 C OA (d C AA) is a
function c : 0 ~ 0 (c : A ~ A) attaching to every k-equivalence class in 0 (in A)
a unique representative. Also the values of c are called canonical forms. The
canonical forms introduced here are defined for 0 = 0 A (or Zi = AA); however,
for obtaining finite-dimensional parameter spaces subsets have to be con-
sidered.
Of course there are many ways to define canonical forms (see e.g. [1, 17, 18,
21-24, 38, 45, 46, 48, 61]). The usual canonical forms can be obtained from the
Hankel matrix H~ of the transfer function /~. We will first explain this for an
important special case, namely for Echelon forms (see [18, 20, 49]).
If /~ is of order n, then we are selecting the first (in natural order) n linear
independent rows in H~: these rows form a basis for the row space of H~. By
the block Hankel structure of H~, this selection has the property that if the
i + sth row of H~ is in this basis, then so also is the ith row. Thus this selection
266 M. Deistler

may be characterized by a m u l t i i n d e x c~ = (n 1. . . . . ns) indicating that the


selected basis rows are (in H2) in positions 1, 1 + s . . . . . 1 + ( n l - 1)s, 2, 2 + s,
. . . . 2 + (n 2 - 1)s . . . . . s, 2s . . . . . nsS. Then n 1. . . . , n, are called the K r o n e c k e r
indices of the system (or of ,(). Of course n - n~ + • • • + n s.
Now from

O(z) = a ( z ) l { ( z ) , (4.1)

we obtain

0 = (fi,(O) . . . . , _A(n))H]+ 1 . (4.2)

Here d ( z ) = Y. A ( i ) z i, f ) ( z ) = 2, # ( i ) z i and H ~ denotes the submatrix of H 2


consisting of the first r block rows of H i .
Now (4.2) is used to obtain (fi~(0). . . . . A(n)) (and thus ~i(z)) from H~+ l as
follows: we express the row of H~+ 1 in position i + n~s as a linear combination
of the p r e c e d i n g selected basis rows. Then the ith row of ( f t , ( 0 ) , . . . , / ~ ( n ) ) is
defined as the vector of coefficients corresponding to this linear combination,
namely containing a 'one' at its i + n~sth position, containing the negative
coefficients of the respective basis rows in the linear combination in positions
corresponding to the positions of the preceding basis rows and containing zeroes
elsewhere. Doing this for i = 1. . . . . s defines a unique d(z) and via (4.1) a unique
M F D (&/~) and thus, by (2.12) a unique M F D (a, b). The free parameters in ~ are
the coordinates of the respective preceding basis rows, all other elements of the
ft,(i) are either zero or one. All parameters in b that are not explicitly restricted to
zero are free. Both (&/~) and (a, b) obtained in the way described above are called
Echelon (canonical ARMA ) form.
Let ~iij, b~j denote the i, j element of 6 and b respectively and let S(p) denote
the degree of the polynomial p indicated. Then (6,/)) is in Echelon form if and
only if all of the following properties hold:

(6,/;) are left prime,


dii are monic polynomials (i.e. the leading coefficient is equal to 1),
~(6ij)~(Clii), j<~i; ~(6ij)<3(gtii), j > i ;
~(6ji)<~(6ii), ]7 d i; ~(bij)<~(aii ).
Of course 6 ( d i i ) = hi.
The corresponding MFD (a, b) is left prime too and A ( 0 ) - B ( 0 ) is lower
triangular with diagonal elements equal to one and 6(aii ) = ni.
The corresponding state-space form is obtained as follows (see e.g. [10, 17,
21, 23, 50, 60]): let y~(t + lit) denote the best linear predictor of y(t + i) from
y(t), y ( t - 1 ) , . . . , and let

Y;~- O]'(t + 1 i t), ~'(t 4 2{ t) . . . . ),


e' t = (e'(t), e ' ( t - 1) . . . . ).
Parametrization of A R M A and state-space systems 267

Then
y, : H ~ e , . (4.3)

It is easily seen that we obtain a unique (F, G, H ) from the following equa-
tions:

F H ; ) = H;+~), (4.4)
HH;) = H I (4.5)
and
G = H~,), (4.6)

where H ~ ) (HT~+,)) is the submatrix of H i consisting of the rows in positions 1,


1 + s . . . . . 1 + (n l - 1)s . . . . . s, 2s . . . . . Gs (in positions 1 + s, 1 + 2s . . . . ,1 + nls,
1
. . . . 2s, 3s . . . . . (n s + 1)s; arranged in the same order as in H~) and H(~) is
the submatrix of H ~ ) consisting of its first block column. The free parameters
of F are obtained by expressing the rows in positions i + nis, i = 1 . . . . . s, as
linear combinations of their preceding basis rows (in H~)). Thus these free
parameters are identical with those in the corresponding a(z) matrix (up to
sign change). The other elements of F are either zero or one.
Another common canonical A R M A form is H e r m i t e ' s canonical form [18,
27, 56], which is obtained analogous to Echelon form with the only difference
that the basis rows in H i are now selected in the following way: if we attach
the index (i, j) to the ith row of the jth block row of H i , then we are selecting
the first basis rows of H i with respect to lexicographical order, i.e. we first take
all rows with indices (1,j), j = 1 . . . . . ( i l l - 1 ) say, which are linearly in-
dependent, next all rows with indices (2, j), j = 1 . . . . . (ri 2 - 1) say, which are
linearly independent of the preceding selected rows and so on. This defines a
multiindex & basis rows H~s ), and a unique M F D (a,/7) via (4.2) and (4.1) in
the same way as before.
Hermite's canonical form can also be obtained by bringing an arbitrary d
from the equivalence class of left prime MFD's in Hermite's canonical form for
polynomial matrices [4] by left multiplication with a unimodular matrix, and
then applying the same unimodular matrix to /). The resulting a is lower
triangular, has monic diagonal elements and 6(g@ < ~(gl,) for,j # i. The cor-
responding (a, b) obtained from (2.12) has one disadvantage compared with
(a, b) in Echelon form, namely it need not be left prime.
In the same way an analogous state-space form is defined (see e.g. [17]).
To be more general, A R M A and state-space canonical forms can be
obtained in an analogous way as before by specifying a certain selection rule
for the basis rows of H i (see e.g. [10, 17, 21, 38, 43, 50, 53]). Using the block
Hankel structure of H i we restrict ourselves to selection rules with the
property that if the i + sth row is in the selected basis, then so also is the ith row~
This defines a multiindex c~ and a corresponding matrix HT~) and via (4.2) and
(4.1) (via (4.4}-(4.6)) a corresponding canonical A R M A (state-space) form.
268 M. Deistler

F r o m (4.3) it is clear that the linear d e p e n d e n c e structure of the rows of H ~


is identical with that of the r a n d o m variables in Yr Thus the canonical forms
a b o v e m a y alternatively be defined by selecting a basis for the predictor space,
i.e. the space s p a n n e d by the c o m p o n e n t variables in Yt [1, 2].
W e n o w ' i n v e s t i g a t e the natural p a r a m e t e r space for E c h e l o n forms. A n a l o -
gous results hold for the m o r e general case described above.
Let U(,1) d e n o t e the set of all transfer functions /~ with K r o n e c k e r indices
a = (n~ . . . . . G), let 0(~ ) d e n o t e the set of all free p a r a m e t e r s for (5,/)) in
E c h e l o n form, such that 5-~b E U ~ ) and finally let &0). (1) . (1)
c o r r e s p o n d i n g parametrization of U(f. _ , . U , - , O~ d e n o t e the
By a * = (n~ . . . . . n * ) < ~ a = ( n l . . . . . G) we mean n *~<n
i i, i = l , .. . , s, and
a*<a is used if at least one inequality holds. T h e n we have [12, 15, 16,
34-37]:

THEOREM 4.1.
(i) O~ ) is an open and dense subset of R d, where d =
( n t + ' . - + n,)(s + 1)+ E~j.j<i {min(n~, n j ) + min(n/, n i + 1)}.
(ii) 4'~( )"• g (1)--)
, O ,0) ,~
:o a (Tpt-)' homeomorphism.
(iii) {'u~(l) ] nl + ' " " + n~ = n} is a (disjoint) partition of M ( n ) .
(iv) ~ - ( 0 ~ ) ) = L J ~ a U0_.
(v) For every k E 7r(O~ )) the k-equivalence class in (9~) is an affine sub-
space.
(vi) U(J ) is (rpt-) open in 0 2 ).
(vii) ~.,,~o),
re,, ) C u"-~(I)
~, ana" equality holds for s - 1.

C o m p l e t e l y analogous results hold for the c o r r e s p o n d i n g state-space canonical


form.

REMARK 1. By T h e o r e m 4.1 above, the p a r a m e t e r space O~ ) C R d is 'almost


all' of the whole Euclidean space R d. T h e remaining set consists of all
koequivalence classes c o r r e s p o n d i n g to 'lower-dimensional' systems (in the
.~ense that the K r o n e c k e r / 3 indices satisfy fl < a).

Also, as a c o n s e q u e n c e of the T h e o r e m 4.1, we see that O~ ), A(,1) and U(,1) are


h o m e o m o r p h i c . T h e h o m e o m o r p h i s m s are defined by (4.1), (4.2) and by
(4.4)-(4.6).

REMARK 2. In the case s = 1 we have M ( n ) = U~ ), a = (n); for s > 1, M ( n ) is


partitioned into the ~ (n+s-l~
,-1 ~ sets _i1(l) , ~ , E n i = n, which are of different dimension.
T h e r e is one a, namely where nt = n2. . . . , nk = nk+~+ 1 . . . . . n, + 1, i.e.
w h e r e the first linear i n d e p e n d e n t basis rows of H ~ are just the first n rows, for
which O ~ / = M ( n ) , the o t h e r sets are of lower dimension.

T h e partitioning of M ( n ) just described contains an element of arbitrariness, ab


for o t h e r canonical forms different partitions are obtained.
Parametrization of ARMA and state-space systems 269

REMARK 3. We now discuss the consequences of the assumptions (2.4) and


(2.7). Let 0 °) denote the set of all 0 E O~ ) where in addition these assumptions
have been imposed. As is easily seen, O~2)will in general be not dense in R d. If,
however, (2.7) is strengthened to

det b ( z ) ¢ 0, tzt~<l,

then the corresponding parameter space is open R d. Let ~ ) denote the clo-
sure of Oto in R e. Then 0 ~ ) - O~) is the set of all 0 ~ R a that are either in O~,
where det a ( z ) ¢ 0, ]z] < 1, and det a ( z ) = 0 for some lz] = 1 and where (2.7) is
fulfilled, or that are in _Ra - O~ ), where det a ( z ) ¢ O, det b ( z ) # 0, ]zl < 1. The
equivalence classes in ~o) are subsets of the affine subspaces described in (v)
above, containing a relatively open set.

5. The manifold structure of M(n)

If there are no additional a priori restrictions, in a certain sense, the natural


object for our investigations is M ( n ) . Clearly, (M(n))nee+ is a cover of U a.
The results of the previous section may be interpreted as giving a parametriza-
tion of M ( n ) by the sets of O~ ), Z n~ = n. We now discuss the overlapping
parametrization of M ( n ) as a manifold (see [10,35,41]). Let U~2), a =
(n 1. . . . . n,), denote the set of all /~ E M ( n ) , n = n I + • . . n,, where the rows of
H(~) are linear independent. Here we do not postulate, however, that these
rows are the first linear independent rows of H~, and thus c~ in general are not
the Kronecker indices. Clearly, U(~2) D U(~~), and equality holds in the case where
n 1= n 2 . . . . . n~ = n K + 1 + 1 . . . . . n s + 1, i.e. when the first linear independent
basis rows of H~ are just the first n rows. Now, for every /~ ~ U C2) a (within
U~ )) unique MFD (fi,/~) may be defined by (4.2) and (4.1) analogous to
Echelon form, with the only difference that here the respective i + n~sth rows of
H~ are expressed as linear combinations of all basis rows in H ~ ) (not only of
the preceding ones).
An MFD defined in this way has the following properties:

(6, b) are left prime, 6, are monic,


(5.1)
66ji < 66, = ni, j ¢ i.

The corresponding state-space systems (see e.g. [10, 41, 45, 47, 53]) are defined
completely analogous by (4.4)-(4.6) where again, in (4.4), the respective i +
nisth rows of H~ are expressed as linear combinations of all basis rows in H~).
Note that, in general, not all parameters of (d,/)) that are not restricted by
(5.1) will be free. This is a disadvantage of this A R M A parametrization compared
with the corresponding analogous state-space representation where all restric-
tions are of the z e r o - o n e type. The free parameters in the MFD (6,/~) may be
270 M. Deistler

chosen as [16]:

6~j(u), u = 0, 1 . . . . , nj - 1, j, i = 1 , . . . , s, (5.2)
bij(u), u = 0 , 1 . . . . , n i - l , i , j = l . . . . . s,
where @(u) and/;ij(u) d e n o t e the ij elements in fi,(u) a n d / 3 ( u ) , respectively.
Let ~9) denote the mapping attaching to every /~ ~ U ~ ~ the vector 0 E R am
consisting of these free parameters and let O~ ) d e n o t e the image of U ~ ) by ~ ) .
W e recall that a real analytic manifold of dimension d is a separable
Hausdortt space M together with a family (U~, g'~)~el, where (U~)~e I is an open
cover of M and the g,~ are h o m e o m o r p h i s m s from U~ o n t o an open subset of
R e, such that, if U~CqUe¢O ( a , ~ @ I ) , then ~eog,~: 4,~(U~AUe)->
Oa(U,~ f3 U~) is a real analytic function. (U,~, ~p~) are called local coordinates
an d ~ o 4~21 is called a coordinate transformation. Then we have, analogous to
T h e o r e m 4.1 (see [7, 8, 10, 16, 35, 37, 41]):

THEOREM 5.1.
(i) M(n) together with tu~'"
.(2), d,(2he~
~,,=, is a real a n a l y t i c manifold of dimen-
sion 2ns.
(ii) 0(2 ) is dense in R 2'~.
(iii) U~ - M ( n ) = U,~,, MO).
- (2) _

(iv) _~c5(2)x _ i i r r(23"


" t ~ " J - ~a-<" uB- 2
(v) For every k ~ 7r(O~)), the k-equivalence class in O~ ~ is an affine sub.-
space of dimension s(n - no), where n o is the order of l~.
(vi) U ~ ) is (Gc) open in O~ ~.
(vii) 7r(65~ )) c O ~ ) and equality holds f o r s = 1.

T h e n u m b e r t"+s-l~
s-~ J of local coordinates (U~ 2>, 4,~)) used here is not minimal in
general. However, for s > 1, a parametrization of M(n) by a single coordinate
system does not exist.
T h e statements in R e m a r k s 1 and 3 below T h e o r e m 4.1 directly carry over to
this case.
Since U ~ ), Z n i - n, is o p e n and dense in M(n), 'almost all' points of M(n)
can be described by a corresponding ¢~(2). thus once n is known, the choice of ~x
is not that crucial c o m p a r e d with the case of the Echelon canonical form (or
canonical forms in general) where only for the set where U~ ~= U ~ ), i.e. where
n t = n2..... n K = nK+1 + 1 . . . . . ns + 1, we have openness and denseness in
M(n), whereas the other U~ ) are of dimension smaller than 2ns. M o r e o v e r ,
since M(n) is the set of all transfer functions with order smaller than or equal
to n, every transfer function of order smaller than or equal to n can be
a p p r o x i m a t e d with arbitrary accuracy in U ~ ). H o w e v e r , there is a certain
tradeoff for this advantage of the overlapping description of M(n) as a
manifold, because parametrizing /~ in a lower dimensional U~ ), rather than in
U(2) leads to a certain efficiency gain in estimation, as fewer free parameters
ot '

are used.
Parametrization of A R M A and state-space systems 271

6. The case of additional a priori information on the parameters: Structural


identifiability

In many situations the system parameters have direct 'physical' inter-


pretations and then we often have a priori information in form of constraints
on these parameters, e.g. we might know a priori the m a x i m u m lag for a
particular variable in a certain equation or that a particular input variable does
not influence a certain output variable. In such a situation the question arises
whether these a priori restrictions are sufficient to guarantee identifiability of
the set 0 C OA defined by these restrictions. In this case we use the terminology
structural identifiability to make a distinction to the cases described in the
previous sections were no a priori restrictions were available.
In applications, the case where the a priori restrictions are overidentifying,
i.e. they do not only guarantee identifiability but they also reduce the dimen-
sion of the parameter-space, is of great importance, since in the unrestricted
case the dimension of the p a r a m e t e r space might be too great compared with
the given sample size of the observations.
A simple identifiable class of A R M A systems is given by the set 0 m of all
(a, b ) ~ OA which are left prime, where the m a x i m u m lag lengths p and q in
(1.1) are a priori prescribed, where the end matrix (A(p), B ( q ) ) has rank s and
where

A(0) = B ( 0 ) = I (6.1)

(or an analogous norming condition holds) (see [25, 27]). This is easily seen,
since multiplying such a (a, b) by a nonconstant matrix u in (2.9) would
increase the lag lengths beyond the prescribed ones.
Of course the sets Op.q, p, q E Z +, can also be taken as convenient subclasses,
without any structural interpretation, of a given model class.
For s > 1 there is one shortcoming of this prescription, namely
{Tr(Om), p, q E Z +} is not a cover of U A, i.e. there exist k E g A for which there
is no p, q such that k c 7r(Op,q). This problem does not occur if we more
generally prescribe the maximal degrees a = (/91,... ,Ps, ql . . . . . qs) of the
columns a 1. . . . . as, b l , . . . , bs of (a, b), i.e. we are prescribing the maximum
lags for each variable. Let a~(j) and bi(j) denote the coefficients of z j in a i and
(3) d
bi, respectively. Then the set 0 6 C R of all p a r a m e t e r s 0 = vec(al(1) . . . . .
al(p~) . . . . . a,(1) . . . . . as(p,), b l ( 1 ) , . . . , bs(qs)) :~ (a,b), d = s(p~+ . . . . . ~ p,+
ql+...+q,), which are left prime, with prescribed column degrees c~,
where the column-end matrix (al(pl) . . . . . as(ps), b l ( q O , . . . , bs(q,)) has rank s
(and where (6.1) holds) is identifiable ([27]). Let U~a~= 7r(Ol3~) and let 4,~:
U ( 3 ~ O~ ~ denote the corresponding parametrization°
Of course a disadvantage of this parametrization is that 2s integer-valued
p a r a m e t e r s p~ . . . . . q, have to be prescribed (compared with 2 for Op.q). In
addition note that pl . . . . . q~ is, in general, not uniquely determined by k, i.e.
there are k E U A such that k E U ~3~, k E U~ ~, ~ ¢ o~.
272 M. Deistler

For results on structural identifiability in the case of other or more general a


priori restrictions, e.g. exclusion of variables in certain equations or affine
restrictions on the A R M A parameters, see [11, 27]. Structural identifiability in
the case of state-space forms is investigated in [22].
Analogous to the previous cases we have [13, 15]:

THEOREM 6.1.
(i) O~ ) is an open and dense subset of R d, where d = s ( p l + . . . + p,+
ql + . . . + q~).
(ii) I11
~ (3).. U ~ 3)--> O~u (3) is a (Tpt-) homeomorphism.
(iii) -(3) )-_
(iv) For every k ~ rr(el~ (3) ) the k-equivalence class in 0~) is an affine subspace.
( V ) U (3) Is• (T~-~ oven m ' U- (3)
or_ p_ / r ot "
(vi) 7r(O~~)) C U~ ) and equality holds for s = 1.

For more general a priori restrictions, results analogous to Theorem 6.1 are not
yet available.
Again results analogous to Remarks 1 and 3 after Theorem 4.1 hold.

7. The relation to estimation

We now discuss the implications of the preceding results concerning the


properties of the parametrizations for the process of identification. H e r e we
concentrate on A R M A representations, as the results for state-space represen-
tations are analogous.
First let us consider the case where a a determining one of the parameter
spaces O(2), O~ ~ or O~ ) is already given° The common estimation procedures in
this case are the (,Gaussian) maximum likelihood estimators (MLE) or related
methods (e.g. prediction error estimation). We here discuss MLE as the
prototype procedure. Let T denote the sample size, let y ) - ( y ' ( 1 ) , . . . , y ' ( T ) )
be the observations and let Fr(O , ~r(X)) denote the s T x s T covariance matrix
given by

- r ,s, = 1, . . . , T

where the spectral density matrix fy is determined by the parameter vectors 0


and o'(X). Then - 2 T -1 times the log of the (Gaussian) likelihood is given up to
a constant by

I],r(O, o'(X)) = T - ' log det Fr(O, or(X)) + "F-IyT-[~TI(0, O~(~))yr . (7.1)

- (3)
Here 0 is an element of either O~ ), or v~,c;I(e)or of O~. We use O, for short to
Parametrization of A R M A and state-space systems 273

cover all three cases. (Even more generally O~ could be an identifiable set of
MFD's with bounded degrees and with an additional technical assumption
imposed.) In this section, (2.4) and (2.7) are assumed throughout, without
taking this into account in our notation. Let U~ = ~(0~) be the corresponding
set of transfer functions and @6: U ~ O~ is to denote the corresponding
parametrization. As i t ( 0 , o'(X)) depends on 0 only via ~r(0), a 'coordinate-
free' likelihood depending on k (and on ~(22)) may be defined. In the process
of optimization of the likelihood, the possibility that the optimum is attained at
certain boundary points cannot be excluded and this is one reason to define the
coordinate-free likelihood Lr(k,o-(X)) as a function with domain Q~ x
{o-(X) ] X > 0} (where again (2.4) and (2.7) have been imposed) rather than with
domain U~ x {o-(£)1 X > 0}. Note that this coordinate-free likelihood is intro-
duced for mathematical convenience as some statistical properties do not
depend on the underlying parametrization, that however the actual optimiza-
tion of the likelihood has to be performed in suitable coordinates.
A reason for the introduction of the coordinate-free likelihood is the
following consistency result [19, 33]: under the additional ergodicity require-
ment

1 T
l i m - - ~ y(t + s)y'(t) = Ey(s)y'(O) (a.s.)
Tt=l

and if the true transfer function k 0 is in U , then the MLE's/~v ~ r for k 0 and X0
(obtained by optimizing L r over 0~ x {~r(X) IX > 0}) are strongly consistent, i.e.
/or ~ k0 (in Tpt) a.s. and 2r --" "~o (where X0 is the true matrix) a.s.
This result, together with the properties of the parametrizations discussed in
the previous sections, has the following implications for parameter estimation:
let/~r ~ k0 (in Tpt), kr, k0 C/5-~ (where/~r is not necessarily the MLE), then we
can distinguish three different cases [13, 16]:
(i) If k 0 ~ U~, then, by the openness of U~ in U~, k r will be in U~ too, from
a certain T o onwards. From this T Oonwards, the parameter estimates @~(/~r)=
0r are uniquely defined and by the continuity of the, we have 0r = @~(t~r)
~b~(k0) = 00, and thus, for example, the MLE's ~Jr are strongly consistent in this
case.
(ii) Let k 0 ~ 7r ( 6 ~ ) - U~. Then k 0 is represented by an equivalence class in
(~ - O 4 (along this equivalence class the likelihood defined on (0, for fixed X,
is constant). If in addition suitable prior bounds are imposed on the norm of
the elements in O~, then the--not necessarily unique--parameter estimates 0r
(i.e. ~ ' ( 0 r ) - / ~ r ) will converge to the 'true' equivalence class. Whether the
algorithm will search along this class or whether the Or converge to a certain point
in the equivalence class depends on the actual estimation procedure used.
Of course reparametrization with a suitable/3 < c~, such that k 0 E U#, leads to
the 'well-posed' situation described in (i).
274 M. Deistler

(iii) The situation k 0 ~ O~ - 7r((~) can only occur in the multivariable case
(s > 1). In this case, k 0 corresponds to the 'point of infinity' of Oa, in the sense that
even if /~r C U~, T E N, then /~r ~ k0 implies that the norm of the parameter
estimates 0~(kr) will tend to infinity. In the special case of the overlapping
parametrization of M(n), when U~ = U~ ) then this situation occurs if either k~0has
order n and we have chosen the wrong local coordinates (i.e. k 0 ~ U,,) or if k 0 has
order smaller than n, but cannot be described in a O~ ) such that/3 < a. Also in this
situation, a suitable reparametrization leads to case (i).
If k0~ U~ but 'near' to a point in 0 ~ - U , similar problems (in finite
samples) may arise. In this case the matrices (as e.g. H~)) determining the
parametrizations are ill conditioned and thus @~, although being continuous, is
very distorting in the sense that a 'small' variation of transfer functions causes a
'large' variation of the parameters 0.
The discussion in (ii) and (iii) may be considered as an analysis of the
behavior of the parameter estimates in the case of a wrong dynamic
specification. Of course there is also another case of wrong dynamic
specification, namely when ko~ 0,~, i.e. when the observations do not cor-
respond to a system in the model class (underfitting). In this case, of course, we
cannot have consistency of the estimates. However, the maximum likelihood
type estimates still have an optimality property: they converge to the set
(consisting of more than one element in general) in U~ corresponding to the
best linear one-step-ahead predictors for the process generating the data [44].
Now let us turn to the problem of inference of integer-valued parameters for
the dynamic specification of the submodel.
There are two main inference principles in this case, namely information
criteria like AIC or BIC and criteria based on the inference of the linear
dependence relations in H~.
We mainly consider the case of the overlapping parametrization of M(n).
Here both the order n and appropriate local coordinates given by a have to be
determined.
The reason why MLE's do not give reasonable results in order estimation
(and in related problems) is as follows: since JQ(n0)C/Q(nl) for n o < n, and
M(nl) is 'almost all' of/~f(nl) , the M L E over _M(n~) will be attained 'almost surely'
in M(nO, even if n o is the true order.
One way to overcome this notorious tendency of the MLE to overestimate
the true order (to be more precise to attain its value at the maximum
prescribed order) is to add a penalty term, taking into account the dimension of
the parameter space. This leads to estimation criteria that are of the form

A r ( n ) .... log det ~ r ( n ) + d -C(T)


-, n =O,...,N,
T

where 2r(n) is the M L E ~ r over hT/(n)x {o-(2:)[ X > 0}, N is the maximum
prescribed order and where d = 2ns is the dimension of the parameter space.
C(T) has to be prescribed. If C(T)= 2, then Ar(n ) is called AIC [2, 3]. If
a

C ( T ) = c log T, then A r ( n ) is called BIC [54]. The estimates fir of the order
are obtained by minimizing A t ( n ) , n = 0 . . . . . N. Consistency of the minimum
BIC estimate f i r has been shown in [30, 31]. BIC, defined over U~ / with d
given by Theorem 4.1 (i), gives also consistent estimates of the Kronecker
indices [34]. Minimum AIC estimates of n are not consistent; AIC was
designed to satisfy another optimality criterion [58]. Closely related to thes~
estimation methods are likelihood ratio or Lagrange multiplier tests for the order
[51, 52].
For estimation of the local coordinates, measures of the conditioning of the
estimate of H ~ are used [47]. In principle, all the integer-valued parameters
discussed here could be inferred from an investigation of the linear dependence
relations in H~, where H~ is estimated, e.g. by a 'long' autoregression.
However, in most practical applications this seems to be a fairly tedious
procedure.
As has been pointed out in [34] in practical applications, for s > 3, both the
large dimension of the parameter spaces as well as the large number of
neighborhoods that have to be considered may cause great problems. Each
optimization of the likelihood itself is a fairly costly procedure and, if N is the
maximum prescribed order, we have to search over (N+s s ) neighborhoods vIfa (2~,
E n i = n, 0 ~ n <~ N , or the same number of sets U~ ). A practical procedure,
where the search is performed only over suitably selected neighborhoods, is
given in [34].

References

[1] Akaike, H. (19'74). Stochastic theory of minimal realization. I E E E Trans. Automat. Control
AC-19, 667-674.
[2] Akaike, H. (1976). Canonical correlation analysis of time series and the use of an information
criterion. In: R. K. Mehra and D. G. Lainiotis, eds. Systems Identification: Advances and
Case Studies, 27-96. Academic Press, New York.
[3] Akaike, H. (1977). On entropy maximisation principle. In: P. R. Krishnaiah, ed., Applications
of Statistics. North-Holland, Amsterdam.
[4] Barnett, S. (197l). Matrices in Control Theory. Van Nostrand-Reinhold, London.
[5] Brockett, R. W. (1970). Finite Dimensional Linear Systems. Wiley, New York.
[6] Box, G. E. P. and Jenkins, G. M. (1970). Time Series Analysis--Forecasting and Cbntrol.
Holden-Day, San Francisco, CA.
[7] Byrnes, C. 1. (1982). A brief tutorial on calculus on manifolds, with emphasis on applications
to identification and control. Mimeo. Harvard University.
[8] Byrnes, C. I. and Martin, C. F. 0980). Geometrical Theory of Linear Systems. Reidel,
Dordrecht.
[9] Casti, J. L. (1977). Dynamical Systems and their Applications : Linear Theory. Academic Press,
New York.
[10] Clark, J. M. C. (1976). The Consistent selection of parametrizations in systems identification.
Paper presented at Joint Automatic Control Conference, Purdue University.
[11] Deistler, M. (1978). The structural identifiability of linear models with autocorrelated errors in
the case of cross-equation restrictions. J. Econometrics 8, 23-31.
[121 Deistler, M. (1983). The structure of A R M A systems and its relation to estimation. In: P. E.
276 M. Deistler

Caines and R. Hermann, eds., Geometry and Identification, 49-61. Math. Sci. Press, Brookline,
MA.
[13] Deistler, M. (1983). The properties of the parametrization of ARMAX systems and their
relevance for structural estimation. Econometrica 51, 1187-1207.
[14] Deistler, M. (1984). ARMA systems: Parameirization and estimation. In: J. P. Florens et al., eds.,
Alternative Approaches to Time Series Analysis, 143-160. Publications des Facult6s Universitaires
Saint-Louis, Bruxelles.
[15] Deistler, M , Dunsmuir, W. and Hannan, E. J., (1978). Vector linear time series models:
Corrections and extensions. Adv. in Appl. Probab. 10, 360-372.
[16] Deistler, M. and Hannah, E. J. (1981). Some properties of the parametrization of ARMA
systems with unknown order. J. Multivariate Anal. 11, 474-484.
[17] Denham, M. J. (1974). Canonical forms for the identification of multivariable linear systems.
I E E E Trans. Automat. Control AC-19, 646-656.
[18] Dickinson, B. W., Kailath, T. and Morf, M. (1974). Canonical matrix fraction and state-space
description for deterministic and stochastic linear systems. I E E E Trans. Automat. Control
AC-19, 656-667.
[19] Dunsmuir, W. and Hannah, E.J. (1976). Vector linear time series models. Adv. in Appl.
Probab. 8, 339-364.
[20] Forney, D. G. (1975). Minimal bases of rational vector spaces with applications to multivari-
able linear systems. S I A M J. Control 13, 493-520.
[21] Gevers, M. and Wertz, V. (1982). On the problem of structure selections for the identification
of stationary stochastic processes. Paper presented at the 6th 1FAC Symposium on Identification
and System Parameter Estimation, Washington, DC.
[22] Glover, K. and Willems, J. C. (1974). Parametrizations of linear dynamical systems: Canonical
forms and identifiability. I E E E Trans. Automat. Control AC-19, 640-646.
[23] Guidorzi, R. (1975). Canonical structures in the identification of multivariable systems.
Automatica 11,361-374.
[24] Guidorzi, R. (1981). Invariants and canonical forms for systems structural and parametric
identification. Automatica 17, 117=133.
[25] Hannan, E. J. (1969). The identification of vector mixed autoregressive-moving average
systems. Biometrika 56, 223-225.
[26] Hannan, E. J. (1970). Multiple Time Series. Wiley, New York.
[27] Hannah, E. J. (1971). The identification problem for multiple equation systems with moving
average errors. Econometrica 39, 751-765.
[28] Hannah, E. J. (1976). The identification and parametrization of ARMAX and state spacc
forms. Econometrica 44, 713-723.
[29] Hannan, E. J. (1979). The statistical theory of linear systems. In: P. R. Krishnaiah, ed.,
Developments in Statistics, Vol. 2, 83-121. Academic Press, New York.
[30] Hannan, E. J. (1980). The estimation of the order of an ARMA process. Ann. Statist. 8,
1071-1081.
[31] Hannan, E. J. (1981). Estimating the dimension of a linear system. J. Multivariate Anal. 11,
459473.
[32] Hannah, E. J. (1981). System identification. In: M. Hazewinkel and J. C. Willems, eds.,
Stochastic Systems: The Mathematics of Filtering and Identification and Applications. Academic
Press, New York.
[33] Hannan, E. J. Dunsmuir, W. and Deistler, M. (1980). Estimation of vector ARMAX models. J.
Multivariate Anal. 10, 275-295.
[34] Hannan, E. J. and Kavalieris, L. (1984), Multivariate linear time series models. Adv. in Appl.
Probab. 16, 492-561.
[35] Hazewinkel, M. (1977). Moduli and canonical forms for linear dynamical systems II: The
topological case. Math. Systems Theory 10, 363-385.
[36] Hazewinkel, M. (1977). Moduli and canonical forms for linear dynamical systems 111: The
algebraic-geometric case. In: C. Martin and R. Hermann, eds., Geometric Control Theory,
291--336. Math. Sci. Press, Brookline, MA.
Parametrization of A R M A and state-space systems 277

[37] Hazewinkel, M. and Kalman, R. E. (1975). On invariants, canonical forms, and moduli for linear
constant, finite-dimensional dynamical systems. In: Lecture Notes Econ.-Math. Systems Vol. 131,
4~60. Springer, Berlin.
[38] Ho, B. L. and Kalman, R. E. (1966). Effective construction of linear state variable models
from input-output functions. Regelungstechnik 14, 545-548.
[39] Kalman, R. E. (1963). Mathematical description of linear dynamical systems. S I A M J . Control
1, 152-192.
[40] Kalman, R. E. (1965). Irreducible realizations and the degree of a rational matrix. S I A M J.
Control 13, 520-544.
[41] Kalman, R. E. (1974). Algebraic geometric description of the class of linear systems of
constant dimension. Eighth Annual Princeton Conf. on Information Sciences and Systems,
Princeton, NJ.
[42] Kalman, R. E. (1982). Identifiabitity and problems of model selection in econometrics. In: W.
Hildenbrand, ed., Advances in Econometrics. Cambridge University Press, Cambridge.
[43] Kalman, R. E., Falb, P. L. and Arbib, M. A. (1969). Topics in Mathematical System Theory.
McGraw-Hill, New York.
[44] Ljung, L. (1978). Convergence analysis of parametric identification methods. IEEE Trans.
Automat. Control AC-23, 770-783.
[45] Ljung, L. and Rissanen, J. (1976). On canonical forms, parameter identifiability and the
concept of complexity. Proceedings of the 4th I F A C Symposium on Identification and Systems
Parameter Estimation, Tbilisi.
[46] Luenberger, D. G. (1967). Canonical forms for linear vaultivariable systems. IEEE Trans.
Automat. Control AC-12, 290-293.
[47] van Overbeek, A. J. M. and Ljung, L. (1982). On line structure selection for multivariable
state space models. Automatica 18, 529-543.
[48] Picci, G. (1982). Some numerical aspects of multivariable systems identification. Math.
Programming Study Vol. 18, 72-101. North-Holland, Amsterdam.
[49] Popov, V. M. (1969). Some properties of the control systems with irreducible matrix-transfer
functions. In: Seminar on Differential Equations and Dynamical Systems II. Lecture Notes in
Mathematics, Vol. 144. Springer, Berlin.
[50] Popov, V. M. (1972). invariant description of linear, time-invariant controllable systems.
S I A M J. Control 10, 252-264.
[51] Poskitt, D. S. and Tremayne, A. R. (1980). Testing the specification of a fitted autoregressive
moving average model Biometrika 67, 359-363.
[52] P6tscher, B. M. (t983). Order estimation in ARMA-models by Lagrangian multiplier tests.
Ann. Statist. 11,872-885.
[53] Rissanen, J. (1974). Basis of invariants and canonical forms for linear dynamic systems.
Automatica 10, 175.
[54] Rissanen, J. (1978). Modeling by shortest data description. Automatica 14, 465471.
[55] Rissanen, J. and Ljung, L. (1975). Estimation of optimum structures and parameters for linear
systems. In: Proc. Symp. Advanced School on Mathematical System Theory. Udine, Italy.
Lecture Notes in Economics and Mathematical Systems, Vol. 131, 75-91. Springer, Berlin.
[56] Rosenbrock, H. H. (1970). State-Space and Multivariable Theory. Wiley, New York.
[57] Rozanov, Y. A. (1967). Stationary Random Processes. Holden-Day, San Francisco, CA.
[58] Shibata, R. (1980). Asymptotically efficient selection of the order of the model for estimating
parameters of a linear process. A n n . Statist. 8, 147-164.
[59] Tiao, G. C. and Box, G. E. P. (1981). Modelling multiple time series with applications. Z
Amer. Statist. Assoc. 76, 802-816.
[60] Wertz, V., Gevers, M. and Hannan, E. 3. (1980). The determination of optimum structures for
the state space representation of multivariate stochastic processes. IEEE Trans. Automat.
Control AC-27, 1200-1211.
[61] Wolovich, W. A. (1974). Linear Multivariable Systems. Springer, New York.
E. J. Hannah, P. R. Krishnaiah, M. M. Rao, eds., Handbook of Statistics, Vol. 5 1•
.1_ KJ'
© Elsevier Science Publishers B.V. (1985) 279-310

Harmonizable, Cram6r, and Karhunen Classes of


Processes

M. M. Rao

I. Introduction

If {X~ t C T} is a family of r a n d o m variables with zero means and finite


variances, then it is termed a second-order centered process (or time series).
Several subclasses of such processes and their analyses are discussed in this
chapter. They are taken to be centered for convenience. T h e simplest and most
well-understood class is the stationary one. This is a second-order process
whose covariance r is a continuous function which is invariant under shifts of
the time axis T taken to be either integers, T = Z (the discrete case), or the
real line, T = R (the continuous case). Thus in either case one has

r(s, t ) - E(X,~2~) = ~a X s X t dP, s, t E T, (l)

and r(s, t) = f(s - t) which depends on the difference of s and t. Writing r for f,
it follows from the classical theorems of Herglotz and Bochner that r is
expressible as

r(s - t ) = f4 ei('-')~F(dA), s, t ~ T, (2)

where ~ ' = [ 0 , 2 ~ r ) if T = 2 and T = R if T = R . H e r e F is a nonnegative,


nondecreasing b o u n d e d function on T, called the spectral function of the
process. The importance of the stationary class in electrical and communication
engineering is well known and a good exposition may be found in Yaglom [34]
where the pioneering works of Wiener and K o l m o g o r o v are also discussed.
Many statistical problems on these processes have been treated by G r e n a n d e r
and Rosenblatt [11], and by Parzen [22] who includes some related extensions.
In a n u m b e r of other applications, stationarity is an unacceptable restriction.
Since one is not certain about the fulfillment of the stationarity assumption, it is
at least desirable to have a knowledge as to how far the results obtained under
this condition are still valid when it is relaxed. In other words, one wants to

279
280 M. M. Rao

know whether there is some kind of 'robustness' for this work. In a response to
such questions and also to take into account some honestly nonstationary
processes, the classes of harmonizable and Karhunen families have been
introduced independently and simultaneously by Lo~ve [16] and Karhunen
[14]. Only slightly later a c o m m o n generalization of both these classes was
formulated by Cramdr [6]. An analysis and interrelations together with some of
their extensions constitute the main theme of the present exposition. It turns
out that harmonizable processes, properly generalized, have the 'robustness
properties' alluded to above. Also their study proceeds through Karhunen
classes since it is shown that the harmonizable family is an important subset of
Karhunen processes. A discussion of Cram~r's class is included since technic-
ally this illuminates the structure of the above two families and has in-
dependent interest.
There are two important classes of nonstationary processes that have been
considered in recent studies on the subject. One is the class (KF), studied
extensively by Kamp6 de Feriet and Frenkiel [13] and independently by Parzen
[21] who termed it 'asymptotically stationary', and by Rozanov [31]. The
second one is the C r a m 6 r - H i d a class which is based on the 'multiplicity
theory', having been motivated by the classical work of Hellinger and H a h n on
infinite matrices. It turns out that a stationary process has multiplicity one,
while there is a harmonizable process of any given multiplicity, 1 ~< N ~< ~ (cf.
[7, 12, 5]). Even in the multiplicity one case, there are several types of non-
stationary processes useful in prediction and filtering problems. This indicates
that classes of nonstationary processes can be studied, using different tech-
niques frequently in the time domain. It will also be found that generalizations
of spectral ideas have a role to play in this work. Let us thus turn to a detailed
description of these statements.

2. Harmonizabie processes

From the point of view of applications, one of the most useful extensions of
stationarity is harmonizability. Since for nonstationary processes (of second
order) the covariance function r, given by (1), must depend on both variables
s, t, it is natural to consider those processes for which the analog of (2) still
holds. This leads to the following concept, introduced by LoEve [16], and it is
called strongly harmonizable hereafter. Namely, the covariance r admits the
representation:

r(s, t) f~. ~÷ exp(isa - i t a ' ) F ( d a , da'), s, t C T, (3)

where F is a covariance function of bounded variation on ~ ' x 7~.


tn contrast to the B o c h n e r - H e r g l o t z theorem, there is no usable charac.~
terization of such an r. But it is easily seen that strongly harmonizable processes
Harmonizable, Cram&, and Karhunen classes of processes 281

exist in abundance. Indeed, let f be any (Lebesgue) integrable scalar function


on the line, and denote by f its Fourier transform, i.e.

f(t) = fR ei'af(A) dA, tE R . (4)

If ~: is a random variable with mean zero and unit variance, consider {X, =
~f(t), t ~ R}. Then X, has mean zero and the covariance r is given by

r(s,t)= ~(x,5:,)= )(s))(t)= f, ~. exp(isa - ita')f(Aff(A') da dA'.


Taking
A A'

F(A,A')-f f f(x)f(y)dxdy,

one verifies that F is positive definite and of bounded variation so that the X t
process is strongly harmonizable. All finite linear combinations of such 'simple'
processes constitute a large class and if F in (2) is absolutely continuous, then
these simple processes are even 'linearly dense' in a certain well-defined sense.
On the other hand, if F of (3) concentrates on the diagonal a = A', then clearly
(3) reduces to (2). Thus stationary processes are properly extended.
Even though the Lobve extension of stationarity is useful, it does not go far
enough to afford a flexibility for linear operations on these processes. Since the
latter operations appear naturally in filtering problems, a further extension is
needed to take care of these applications. First let us consider an example to
understand how certain simple linear operations lead from stationarity to
problems beyond the strongly harmonizable class. Thus let L2o(P) be the space
of scalar random variables with zero means and finite variances so that a
second-order process considered in this chapter is a subset of L20(p). Let the
metric (or norm) be denoted by I~ll = V[E(Ifl2)I, f e L2(p). If A is a bounded
linear transformation on L2(p), so that A X , = Yt ~ L2(p), consider a stationary
(or strongly harmonizable) process {X t, t E T}, and the transformed process
{Y,, t E T}. If the range of A is finite dimensional, then the Yt process is
strongly harmonizable (can be nonstationary) and if the range of A is infinite
dimensional, then the Y, process need not be strongly harmonizable. For
instance, let T = 2 , 32, = f,, an orthonormal sequence (hence stationary) in
L2(p), and A be the projection such that Y, = A X , =f~ for n > 0 , =0 for
n ~<0. Then the Y, sequence is a truncation of the original orthonormal
sequence, but it is not strongly harmonizable. This fact as well as the preceding
general statement on the range of A is not entirely simple. The details may be
found in [28]. Since, as remarked earlier, linear operations are important for
(filtering and other) practical problems, one should have an extension to
include at least these questions. Fortunately, this is possible and it can be
formulated as follows.
282 M. M. Rao

Let {X,, t E T} C LZ(P) be a process with r as its covariance. If r admits a


representation of the form (3) with F ( . , • ) as a covariance function which is not
necessarily of bounded variation (as demanded in (3)) but satisfies only the
weaker condition of finite Fr6chet variation, then the process is termed weakly
harmonizable. Recall that F has finite Fr6chet variation on ~2 if
tl

ISi(,x,)=sup{] j~i/:,ai~jf,,f**s(dA', d A ' ) : l a i l ~ < l , a i E C ,

{1i }]' disjoint intervals in R, n/> 1 } < CO,

In case a;gfr frF(da, da') is replaced in the above by (ai-= 1 and)


f~, ftj [F(da, da')[, 'th~n one has the usual (Vitali) bounded variation. This small
alteration makes an essential difference since IFI(R x R) < oo can hold when the
usual variation on R is infinite. A simple example exhibiting this phenomenon
is the truncated series of the preceding paragraph whose verification, however,
needs some work. Thus each strongly harmonizable process is weakly har-
monizable but not conversely, and the latter is a strictly larger class. But one
has to make concessions to a technical problem in this generalization. The
integral in (3) is in the usual Lebesgue-Stieltjes sense when F is of bounded
variation, but in the general case it must be defined in the sense of Morse and
Transue [18]. The thus defined MT integral is somewhat weaker than the usual
LS one in that the generalization does not admit the Jordan-type decom-
position, and the standard Fubini-type theorem is not valid. However, enough
usable properties are available to proceed with a substantial amount of the
work for many applications. A systematic account on the structural properties
of these extended processes may be found in [28] (cf. also [20]). As a
consequence, one deduces that if {X,, t E T} is weakly harmonizable and A is a
bounded linear (and even some unbounded ones such as the differential)
transformation, Y, = A X , , then { Y, t E T} is also weakly harmonizable. Thus
the latter class is closed under such mappings. For these reasons, the positive
definite function F in (3), of finite Fr6chet variation, is often called the
(generalized) spectral function of the harmonizable process, even though it can
be complex valued. Such spectra have also important roles to play in ap-
plications such as sampling the process, filtering and even prediction problems.
The above definition may be given a different (but equivalent) form due to
Bochner [2]. For a reference it is stated as follows:

THEOREM 2.1. A second-order process {X,, t ~ T}CL2o(P) is weakly har-


monizable iff (= if and only if)
(i) E(IXtl2)<~M < % t E T, for a constant M > O,
(ii) the covariance r(. , . ) is continuous on T x T, (5)
(iii) s u p { E ( I f r f ( t ) X ( t ) dt]2): I[fNu~< 1} < ¢¢,

where f is the Fourier transform, given by (4), for each integrable f on T and the
integral in (5) is defined in a standard manner as a vector (or Bochner) integral,
[If~], being the uniform (= supremum) norm off.
Harmonizable, Cramdr, and Karhunen classes of processes 283

Even though both these harmonizability concepts are generalizations of


stationarity, there is a deep reciprocal relationship between them. This is quite
important for some applications. The following example gives an indication of
this property and explains the underlying reasons more vividly.

EXAMPLE. Let {S(t), t >1O} be a family of bounded linear mappings on L2(p)


such that
(i) S(u + v) = S(u)S(v), u, v >~O, S(O) = identity,
(ii) ]lS(u)f[[ <~IlfH, f ~ L20(P) where ][fH2= E(lf]'~), and
(iii) HS(u)f- fll--' 0 as u ~ 0 + .
Such a family is usually called a continuous contraction semigroup on Leo(P).
For any given X 0 E L2(p) define the process {Y,, t E T} as

Y ( t ) - S(t)Xo, if t ~ O, Y(t) = S * ( - t ) X o, if t < 0,

where S*(u) stands for the adjoint of S(u) so that it is a linear mapping
satisfying the relation

E((S(u)f)g) = E(f(S*(u)g)), for all f, g in L2o(P).

Then the Y ( t ) process can be shown to be weakly harmonizable. This is not


obvious. One shows that, on letting S ( - u ) - - S * ( u ) , u ~ 0 , the family {S(u),
u E T} is positive definite in the sense that

n n

i=1 j=l

for each finite set {u~. . . . . un}C T. This is easy if T = 2 and the case that
T = R is then reducible to the former. Then one applies a form of the next
result to deduce that there is a family of unitary transformations V, (meaning
V~V* = V~ V, = identity, (i) and (iii) hold), on a larger space L20(P') D L~(P)
such that S(t) = QVt, t E T. Here O is the orthogonal projection of L20(p ') onto
L2(p). It should be noted that if S ( t ) = V t so that L~(P')= L2(p) and Q--
identity, then Yr = V~Xo, t C T, gives the classical representation of a stationary
process. Thus the connection between these two classes obtained by an
enlargement of the underlying probability space is an important and a deep
resulL
The precise statement alluded to above is the following:

THEOREM 2.2. Let { Y , , t E T } C L ~ ( P ) be a given (weakly or strongly) har-


monizable process. 7hen there exists a possibly enlarged probability space on
which is defined Lio(f') containing L~(P), an orthogonal projection Q on L20(P)
with range L20(P), and a stationary process {22,, t ~ T}(C L2(P)) such that Y, =
QX,, t ~ T. (This Yt process is termed a dilation of the harmonizable X, process.)
In the opposite direction, each stationary process {Xt, t C 7'} c L~(P) and each
284 M, M. R a o

continuous linear transformation A on LZ(P) define {Yt = A X T , t ~ T}, as a


weakly (but usually not strongly) harmonizable process in LZ(P).

The superspace L2o(p) is not generally unique, but one can find a minimal
space With the desired properties. The result and its extended space may be
obtained essentially 'constructively'. It is related to some work of M. A.
Naimark, B. Sz-Nagy and others on Hilbert space operator theory. A detailed
proof with related references is given in [28]. Based on this result one can show
that each weakly harmonizable process may be represented in terms of a
(continuous) positive definite semigroup, as described in the preceding
example.
The above theorem enables some extensions of the well-known results from
the stationary theory to the harmonizable case. For instance, the following
inversion formula for F ( . , . ) of (3) can be obtained from the classical work at
once.

PROPOSITION 2.3. Let r be a weakly harmonizable covariance function with F as


its representing function--the spectral measure. If A = (aa, a2) , B = (bl, b2) are
two intervals such that ai, bi, i = 1, 2, are continuity points of F, then one has

f a (,8 e - i a 2 s --. e - i a l s e ib2t - e ib~t


F(A, B) = lira J r(s, t) ds dt. (6)
. . . . ~ _~ -is it

The generally complex valued spectral function F of the process plays a role,
in analyzing harmonizable processes somewhat similar to the one given by the
classical case (2). So it is desirable to estimate F and investigate the asymptotic
properties of such estimators. This problem, even in the strongly harmonizable
case, is not yet solved. Other unresolved points will be recorded for future
work, as the exposition proceeds.
The strongly harmonizable case admits an extension in a slightly different
direction. The covariance function r of (3) may be written as

r(s,t)= ~ g(s,A)g(t,A')F(dA, dA'), s , t ~ T, (7)

where g(t, A ) = exp(itA), which is 2~ periodic in t for each A E 7~. Also g is


b o u n d e d and jointly continuous in the variables t, A. The result (7) is meaning-
ful if g : T × D ~ C (D C T) is almost periodic on T for each compact subset D
of "F. More explicitly, a complex valued continuous function g on T z D is
almost periodic on T uniformly relative to D if for each compact subset K of D,
and each e > 0, there is a n u m b e r l0 = 10(E, K ) such that each interval I C T of
length 10 contains a number ~- (called an e translation number of g) for which
one has

Ig(t+%a) g(t,a)i~, t~,A<K. (8)


Harmonizable, Cram&,and Karhunenclassesof processes 285

If D is a single point, then g is called the classical almost periodic function; and
in any case, for each a E K C D, g ( . , A ) is bounded. With this concept, a
second-order process {Xt, t E T} C L2(p) is termed almost harmonizable if its
covariance r admits the representation (7) with respect to a family { g ( . , a), a E
7"} of almost periodic functions on T uniformly relative to T, and a covariance
function F of bounded variation. It will be seen in Section 6 below that this
family inherits an important structural property of strongly harmonizable
processes of which it is an extension.
From an applicational point of view, however, one should consider multi-
variate processes. Thus if X t :S't ~ C n, t E 7, so that X t = (X I . . . . . X~'), let
XitELz(P), i= 1. . . . . n ; t ¢ T. Then the X t process is termed multivariate
strongly or weakly or almost harmonizable (relative to a fixed scalar almost
periodic g family in the last case) if for each vector c~ = (c~ ~. . . . . ~ " ) ¢ C ", the
scalar process

XT=~ a i X t ', t E T,
i=1

is of the same type as defined in the preceding paragraphs. From this definition,
it follows after an easy algebraic manipulation that for each 1 ~<j, k ~< n, and s, t
in T, the (cross-) covariance function rjk of the c o m p o n e n t processes X Jr,X f , is
also harmonizable and that

e ~jktaa, dA'), s, t C T , (9)

where Fjk(A x B) = Fkj(B x A), and each ~ k is of respectively usual (Vitali) or


Fr6chet variation finite, Fjj being positive definite (Fjk need not be). If
r = ( r j ~ , l ~ < j , k ~ < n ) and F = ( ~ k , l ~ < j , k ~ < n ) are n-by-n matrices, then the
matrix covariance function r of X t admits the representation

r(s, t) = g(s, A)g(t, A')F(dA, d A ' ) , (10)

with g(s, A) - e i'* in the (weak or strong) harmonizable case. The integrals here
are defined componentwise. Again F will be called the spectral matrix function
of the vector process {X,, t E T}.
In all above cases, F has the following important property inherited from r:

O <- trace( ~ J) f(A )F(da, dA ')f *(a ')) < oc (11)

for any m - b y - n matrix function f with bounded Borel entries. Here f* is tile
conjugate transpose of f. In the stationary case, one has F(A, A ' ) = a,vG(a),
286 M. M. R a o

where 6**, is the Kronecker delta and G is a positive definite hermitean n-by-n
matrix funciton. In the general (e.g. harmonizable) cases the latter property is
no longer present because of the behavior of the off-diagonal entries of F
noted earlier. Before considering the spectral properties of the multivariate
harmonizable processes, it will be necessary to discuss another extension of
stationarity due to Karhunen [14] and some of its ramifications. Let us
introduce this.

3. Karhunen class

It will be useful to motivate the concept in the following way. Consider a


stationary (scalar) covariance function r. By (2) it has a spectral function F
which is positive, increasing, and bounded. Suppose that F admits a density f
(relative to the Lebesgue measure). Then (T = R so that T = R also)

r(s, t) = fi" ei(' '>f(a) da

= frh(S+ u ) h ( t + u)du, s, t E T, (12)

where h is the Fourier transform of \ / f (which exists) and then the last equality
follows by the Parseval formula (since ~ / f is square integrable). Note that if
T = l , then h is a polygonal function and the integral in (12) reduces to a
(possibly infinite) sum. Thus a process whose covariance is representable by a
formula of the type (12) relative to a Borel family {h(t,. ), t E T } and a measure
/x (here h ( t , . ) = h ( t + . ) and d / x ( A ) = d a ) includes the stationary class, and
brings in considerable flexibility. It should also be observed, from (12) and (2),
that even a stationary covariance can have different representations, and this
remark will be pertinent later on. Let us thus present the desired general
concept.

DEFINITION. A process {Xt, t ¢ T} C LZo(P) with covariance r is said to belong


to the Karhunen class if there is an auxiliary measure space (S, 5°, L,) and a set
of complex functions {g(t, . ), t ~ T } C L2(S, 5O, u) such that

r(tl, t2) - Js g(tl, h)g(t 2, h)v(dh), t, ~ T, i = 1, 2. (13)

Here both S, T can be general sets without any relation. In applications, one
usually has T = R, Z (as is assumed in this chapter) and then S = ~b (=R, or
Harmonizable, Cram&, and Karhunen classes of processes 287

[0, 2rr)), C, or such others. Also p can be a nonfinite measure (dr = da, the
Lebesgue measure on R in (t2) is an example).
The Karhunen class is quite large. It was already noted that stationary
processes are included in it. From the forms (4) and (13) it is not at all evident
that there is any relationship between harmonizable and Karhunen classes. It
will now be shown that the former is also a subset of the latter. This fact could
not be obtained until the availability of the dilation result (Theorem 2.2). It
also depends on another classical fact (due to Cram&) that each stationary
process { X,, t E T} is representable as

S t = f~. ei'aZ(da), t ~ T, (14)

where Z ( A ) E L 2 ( p ) and E ( Z ( A ) Z ( B ) ) = F ( A A B ) , for any Borel sets


A, B C T, the measure F being the same as that of (2) which is related to the
covariance function r of the X, process. Such a Z ( . ) is called an orthogonally
scattered measure by Masani [17]. With this setup, one has:

THEOREM 3.1. Each weakly harmonizable process { Xt, t ~ T} is also a Kar-


hunen process relative to a finite positive measure u on T and a suitable Borel
family of functions {f,, t E T} in L2(T, v).

PROOF. A sketch of the argument follows because it is not yet available in the
literature, and it is not long. Since the X t process is weakly harmonizable, by
Theorem 2.2 there exists a stationary dilation { Yt, t ~ T}CL~(/5) on a larger
probability space and L2o(P) can be identified as a subspace, such that X, =
OY,, t C T, and O is the orthogonal projection from L2(/5) onto L2o(p). But by
(14),

Y,=J7 ei'*Z,(dA), t~T,

and 2 is orthogonally scattered. Let u(A)= E(IZ(A)]2). Then u ( . ) is a finite


positive measure on 7~. But by a classical theorem of Kolmogorov (see [32], p.
33, and also [17], Thm. 5.10), there exists an orthogonal projection fI on
L2(T, u) into itself induced by O such that

Xt=OYr=O(f~e'taZ(dA))=f÷lI(eir(")(A)Z(dA), t~T. (15)

If f(t, A) =//(eit('))(A), )t E ~ then {ft, t ~ T} C L2(ib, v) and (15) further implies

r(s, t) -E(X,&)- I÷f(s, a)f(t, a)~(da).


288 M. M. R a o

This m e a n s that r has the representation (13) relative to {f(t,. ), t C T} and u,


so that the X, is of K a r h u n e n class, as asserted.

This result, which is a c o n s e q u e n c e of the preceding work, exhibits a type of


inclusiveness of the K a r h u n e n class and will be shown below to have a d e e p e r
impact on the structural analysis of (multivariate) h a r m o n i z a b l e processes. It is,
however, also useful to note a n o t h e r p r o p e r t y of this family regarding the
existence of shift operators on a n o n s t a t i o n a r y subclass. This is significant since
h a r m o n i z a b l e processes themselves actually do not admit such shifts in contrast
to the stationary class. Let us explain this in m o r e detail because it is not at all
obvious.
Let {X,, t ~ T} C L2o(P) be a process. For each s ~ T, define r,X, = 3(,+ s and,
if possible, extend r, as a linear transformation on L2o(P). T h e thus e x t e n d e d rs
(also d e n o t e d by the same symbol) is called a shift o p e r a t o r on the process. If
the X, process is stationary, then it is well k n o w n that such a rs exists, and in
fact

E((r~X.)(rsX.))= E(X,+.N~,+~) r(u v)= E(X.2~).


T h u s % preserves the lengths (= norms) and it is unitary. H o w e v e r , for a
h a r m o n i z a b l e process such a r s need not exist. For instance, consider Y, = X,
for t > 0, = 0 for t ~< 0 where the X¢ process is stationary. T h e n by T h e o r e m 2.2,
{t/,, t C T} is weakly harmonizable, but if u, v < 0, and s ¢ R such that u + s >
O, v + s > O, for all such s,

E ( Y . ? p -- o # r ( . - ~) : E ( Y . . Po+.).

In fact, assuming that the X, process is not identically zero, one has

[IYul[2= E(lYui 2) = 0, but llr, Yull2 = E(] y ,+,l 2) = 4 0 ) # 0.

H e n c e ,rs cannot be linearly extended. H e r e are some simple (good) sufficient


conditions for a shift o p e r a t o r rs to exist and be continuous. For each finite set
t~. . . . . t, of points from T, and complex n u m b e r s a 1. . . . . a n, if U n - Zi"=~ aiX q,
then
n

u. : o ~ ~sU. -- 2S a,X,,,, -o. (16)


i~l

Equivalently,
t7 el

IIU~ll~= Z 2; ~,aj~(t,, tj)= o


i=1 j=l

implies
n n

IIr~U~ll~ = Z ~ aiajr(ti + :5, tj s)-- O.


i=l j=]
Harmonizable, Cram&, and Karhunen classes of processes 289

Then % can be extended by linearity (unboundedly in general) onto the


linear subspace ~ of L2o(P) generated by {X,, t E T}. Such a r s will also be
bounded if there exists a number c > 0 such that

II ,U ll cllU ll • 07)

in the stationary case, c = 1 and there is equality in (17). As an easy con-


sequence of (16), one will have rs%, = %+,, and hence on the relevant subspace
N, {%, s E T} should form a semigroup. Since this is true for the stationary case
(with % as unitary) and since one wants to include some nonstationary
processes, it is natural to look for the r, family, with some structure, at least as
a normal operator semigroup, i.e. {r~, s E T} should satisfy the commutativity
relations r,r* = r~r s (r* is the adjoint of %). Let us find out possible non-
stationary processes admitted under such an assumption, since the stationary
class is automatically included (because every unitary o p e r a t o r is normal). The
mathematical detail will be minimized here.
Let {%, s ~> 0} be a bounded semigroup of normal shifts on { X , t >~ 0} such
that [I%X - XII ~ 0 as s --> 0 for each X ~ ~, the closed span of the X t in L2o(P).
In order to include the unitary (or equivalently the stationary) case, r s should
not be assumed self-adjoint! Thus normality is the next reasonable generaliza-
tion. [Also the condition that IIr,X - XI[--->0 is known to be equivalent to the
strong continuity of r s for s > 0 and the boundedness of % on 0 < s <~ 1 as well
as the density of U,>0 %(Yg) in Yg. This is thus a technical hypothesis.] Let
A h = (% - I)/h, h > 0. Then A h is a bounded normal transformation for each h.
It is a consequence of the classical theory of such semigroups that for each
X ~ )g, one has

%X = lim cAhx, (18)


h~0

the limit existing in the metric of ~f, uniformly in s on closed intervals [0, a],
a > 0 . On the other hand, for each h > 0 , A h is a b o u n d e d normal operator on
the Hilbert space ~. Hence one can invoke the standard spectral theorem
according to which there exists a 'resolution of the identity', {Eh(A), A C C}
such that

A h X = fc ZEh(dz)X' X E ~o, (19)

where the integral is a vector integral and/x~,(zi) ..... E h ( A ) X E Yd, gives a vector
measure. H e r e ~0 C ~ is the subspace for which the integral exists, i.e. z is/'7,
integrable for X E ~0. But from the same theory one can also deduce that

eSAhX = fc e'~'Eh(dz)X' x ~ ~(1 c No, (20)


290 M. M. Rao

for which e 'z is/x,~-integrable. If y* E Y(*, then y*Eh(. )X is a signed measure


in (20) and if y* = X ( E ~ * = Yg), then it is a positive bounded measure for
each h so that one can invoke the Helly selection principle and then the
Helly-Bray theorem in one of its forms to conclude that limh_~oy*Eh(')X
converges to some Ux.y., a signed measure. This may be represented as
y ' F ( . )X for an F ( - ) which has properties analogous to those of Eh(" ). Here
the argument, which is standard in spectral theory, needs much care and detail.
With this, one can take limits in (20) as h ~ 0 and interchange it with the
integral to get

rsX = limh~oe s A h x = Jc eSZF(dz)X" (21)

Thus the measure F ( . ) X is orthogonally scattered and is supported by the


intersection of the spectral sets of A h , h > 0. It now follows that, if X s = ~-sX0,
then by (21) with X = X0(¢ N1) there, one gets

t"
X, = Jc e'~Z(dA)' s >~0, (22)

where Z ( . ) on C is an L~(P)-valued orthogonally scattered measure. The


covariance function r of this process is given by

r(s, t ) - E(X,f(,) : ~c exp(sz + t~,)G(dz), (23)

with G(A f-IB)= E(Z(A)z~(B)). If S = C and v = G in (13), one sees that


{32,, s ~> 0} is a Karhunen process relative to f(s,. ), s >>-O, f(s, z) = e 'z, and the
finite positive measure G such that f(s,. ) E L2(C, G), s >~O. If C is replaced by
its imaginary axis, and for s < 0 the process is extended with X s = ~'*,X0, then
the stationary case is recovered (cf. (2)). That (23) is essentially the largest such
subclass of Karhunen processes admitting shifts again involved further analysis
and this was shown by G e t o o r [9] in some detail.
Thus the Karhunen class contains a subset of nonstationary processes which
admit shift operations on them and also a subset of nonstationary processes
(namely the harmonizable class) which do not admit such transformations.
Since the representing measures in (3) and (23) or (13) are of a different
character (it is complex 'bimeasure' in (3) and a regular signed measure in (13)),
a study of Karhunen processes becomes advantageous for a structural analysis
of various stochastic models. On the other hand, (3) shows a close relationship
of some processes with a possibility of employing the finer Fourier analytic
methods, giving perhaps a more detailed insight into their behavior. Thus both
Harmonizable, Cramdr,and Karhunenclassesof processes 291

of these viewpoints are pertinent in understanding many nonstationary


phenomena.

4. Cram6r class

After seeing the work of the preceding two sections it is natural to ask
whether one can define a more inclusive nonstationary class incorporating and
extending the ideas of both Karhunen an Lo6ve. Indeed, the answer is yes and
such a family was already introduced by Cram6r in 1951 [6], and a brief
description of it is in order. This also has an independent methodological
interest since it results quite simply under linear transformations of Karhunen
classes in much the same way that harmonizable families result under similar
mappings from the stationary ones.
One says that a function F on T x ~ into C is locally of (Fr~chet) variation
finite if the restriction of F to each finite proper subrectangle I x I of 7" x 7"
has the (Fr6chet) variation finite, I C T being a finite interval. Let us now state
the concept in:

DEFINITION. A second-order process {X~, t E T} CL20(P) is of Cramdr class


(or class(C)) if its covariance function r is representable as

r(t~,,t2)=fs~g(tl, A)g(t2, A')u(dA, dA' ), tiET, i = l , 2, (24)

relative to a family {g(t,. ), t E T} of Borel functions and a positive definite


function u of locally bounded variation on S × S, S being a subset of 7" (or
more generally a locally compact space) and each g satisfying the (Lebegue)
integrability condition:

0 <~fsfs. g(t,A)g(t,A')u(d)~,dA')<% tE T. (25)

If u has a locally finite Fr6chet variation and the integrals in (24) and (25) are in
Morse-Transue sense, the corresponding concept is called the weakly of
class(C).

It should be noted that, in (24), u is of locally finite variation means that ~,


determines a regular complex measure v on S x S, which is locally finite. In
particular, if S = R, the variation measure of ,5 is o- finite. In the Fr6cheto
variation case, v does not determine such a measure, but it is merely a 'C
bimeasure' which is locally finite. If, however, ~, concentrates on the diagonal
of S x S, then (24) reduces to (12) and the Karhunen class is thus included in
class(C) which in turn is included in weakly of class(C). Here, if g(t, A) = e it*,
292 M. M. R a o

S = T, then necessarily v will be of (Fr6chet) variation finite and the har-


monizable class is realized. Thus the following hierarchy of nonstationary
classes is obtained:

Stationary C strongly harmonizable


C weakly harmonizable
C Karhunen class
C Cram6r class
C weakly of class(C).

All these inclusions are proper. One key feature is that each member of these
classes admits an integral representation analogous to that of (14), though their
proofs in each case differ considerably.
Remembering the dilation of a harmonizable series into a stationary one as
given in Theorem 2.2, one might ask for a similar result between the Cram6r
and Karhunen classes. The methods and ideas of proof of that result extend to
give only the following somewhat weaker statement.

THEOREM 4.1. If { X t, t E T}CL2o(P) is a Karhunen process and A is a


bounded linear transformation on LZ(P) into itself, then {Yt = A X t , t E T}, is a
process of class(C) whenever the representing measure G of (13) is finite. The
converse direction (on dilation) is not necessarily valid. However, if in (24) v is of
finite (Frdchet) variation and each g(t, . ), t E 77, is individually a bounded Borel
function, then such a class(C) process can be dilated to a Karhunen process on a
larger space L2o(p) containing LZ(P).

In this generalization, it is significant that the full dilation result does not
obtain. Only an interesting subclass extends. Details and related references
with further extensions on the problem can be found in [28]. It will appear in
applications (cf. Section 8 below) that these classes arise naturally, especially as
solutions of linear stochastic differential equations of filtering and signal
extraction problems. Another important reason for a study of Karhunen class
will emerge in Section 7.

5. Multivariate harmonizable processes

An n-dimensional harmonizable process X t - ( X I. . . . . X']), t E "F, as


defined in Section 2 (cf. (9)-(11)), has its covariance matrix r representable as

r(s, t)= f 4 1 ; eiS*-~'~'F(d;~, dA ' ) s, t ~ T,

for a suitable n-by-n matrix of C bimeasures F. A similar multivariate analog


of Cram6r and Karhunen processes can be given with a corresponding formula
Harmonizable,Cramdr,and Karhunenclassesof processes 293

in the form of (10). Note that F is also hermitean positive definite for the
Karhunen class but not for the Cram6r and harmonizable families. Since there
are not many results available for other classes, the more familiar (non-
stationary) harmonizable case will be discussed here. For this, the spectrum
plays a role somewhat analogous to that of the stationary case, and its spectral
domain is given as follows.
In all the extensions of stationarity considered above, their covariance
functions admit 'factorizable' kernels for their integrands (cf. (3), (13), (24)).
This fact translates itself into integral representations of their sample paths as
in (22) or analogously:

X, = fs g(t, A)Z(dA), t C T, (26)

where Z is a measure on S into L2(p) which is orthogonally scattered for the


Karhunen case (just as for the stationary processes). It satisfies

E(Z(A)Z(B)) = F(A, B),


for the Cram6r and harmonizable processes. Here F is generally only a
bimeasure. The symbol in (26) is a suitable stochastic (or vector measure)
integral. It is these representations that make up a study of the related spectral
domain, and inherit several properties of the time domain. Thus in these cases
their spectral spaces are given as

~Lp2(F)={f:S~C] fs~sf(A)f(A')F(dA, dA')=(f,f),exists}. (27)

For the harmonizable case S = ]b, and in all cases 0 ~< (f, f ) < % because of the
special relationship between F and Z. The ( . , . ) gives a (semi-) inner product
and a (semi-) norm: Ilfl[2 = (f,f). For the multivariate case, one has

~a(F)= { f : S ~ M I ~¢fsf(A)F(dA, dA')f*(A')--(f,f),exists}, (28)

and I[fll~ = trace(f, f ) defines a semi-norm, f* being the conjugate transpose of


the matrix function f(~M, the space of complex m-by-n matrices). Again
S = T for the harmonizable case. In order to carry out linear operations for
problems such as filtering and prediction, it is necessary to know the structural
properties of the space 2e2(F). This is nontrivial and especially in the multi-
variate case it was open for sometime (cf. [15]). The following key property,
which was needed there, has recently been established for the harmonizable
case in [30] and can be stated as:

THEOaEM 5.1. If {X. t ~ T} is a multivariate weakly harmonizable process


294 M. M. Rao

with F as its spectral matrix function and ~.~,2(F), defined by (28), is its spectral
domain space, then (S2(F), I1"I1 ) is complete in the sense that it is a Hilbert
space of equivalence classes of matrices with inner product defined by ( ( . , . ) ) =
trace(.,. ), where

r r
(f, g ) = J~. J~f(A)F(dA, dA')g*(A'), (29)

the 'star' denoting the conjugate transpose. Here ~ ; ( F ) is a linear space of


(complex) matrix functions on T with constant matrix coefficients for linear
combinations.

In order to assert a similar property for the class(C) or Karhunen class, it will
be necessary to restrict the family {g(t,. ), t E T } suitably.
The importance of the above property is better understood if one looks at an
application. The following is one such for signal extraction. A general signal
plus noise model is given by

x , = s,+N,, T, (30)

where the St and Nt are (stochastic) signal and noise processes both of which
are supposed to be weakly harmonizable so that the output process X t is also,
whenever the St and N~ are uncorrelated or harmonizably correlated. For
simplicity of exposition here, let us assume that they are uncorrelated. If
Fx, Fs, F, are the (known) spectral functions of these processes, let h ( a ) =
Fs(A, T)+ F,(A, 2?) and k(A)= F~(A, 2b). The knowledge of these spectral func-
tions is assumed from prior considerations. The problem here is to estimate Sa
optimally, for any a E T, based on the output X~ (i.e. on a realization). Here
optimality refers to the least squares (or error mean square) criterion. A
solution of the problem, using Theorem 5.1, can be given as

THEOREM 5.2. Let Xt = St + N , t E T, be a harmonizable output of the (uncor-


related harmonizable) signal plus noise model (30). Let Zx(" ) be the stochastic
representing measure of the Xt process given by (14) on (26). Then the least
squares optimal estimation Sa of the signal St at t = a C T, is obtained as

Sa = ~i- G,(A)Zx(dA), (31)

where the 'signal characteristic' Ga(" ) iS an n-by-n matrix function which is a


unique solution of the (matrix) integral equation

f4 G,(A)h(dA) = l~ e~a~k(dA)' (32)

the h(. ) and k(. ) being the 'marginal' measures of the spectral functions of the
Harmonizable,Cramdr,and Karhunenclassesof processes 295

St and Nt processes defined above. The variance-covariance matrix of the error


S, - Sa is given by

~ a : I~ f÷ ei~(a-a'lF~(dA' d A ' ) - fi" f:~ G~(A)F~(dA, dA')G*(A'). (33)

To use this result in specific applications, one has to compute F, and Zx.
From the data, one can calculate the covariance r, and then F~ is obtained by
means of formula (6). The stochastic measure Z ~ ( . ) can also be obtained by
using the dilation theorem (cf. Theorem 2.2) and a known result from the
stationary theory (cf. [32], pp. 26-27). Thus for any interval A = (a, b) C T for
which Z({a})= 0 = Z({b}), one has (limits taken in mean square sense)

f Xo(b - a) + lim - 1
- 2 e -ibk~-~/c
-- e-iak X k if T = Z ,
,_.® 2'n 0<lkl~n
Z(A)= 1 f~ e ib, e i~,
lim ~ / ~ X, dr, if T = R .
a~ ~ J-a -- It

In the one-dimensional stationary case, if all spectral functions have densities


fx, f,, f,, then (32) and (33) reduce to the well-known results where the ga and
tra can now be given explicitly as

ga (A) = ei'~af~(A)/(f~ (A) + f. (A)), o'2. = E([So] 2) - f,? tg. (A)12f.(A)dA.

6. Class(KF) and harmonizability

As noted in the preceding sections, many processes which arc extensions of the
stationary ones with 'triangular covariances' admit integral representations
such as (26). However, there are other classes based on the behavior of
covariances at infinity. An important such family, motivated by certain sum-
mability methods, is the one introduced in the 1950's by J. Kamp6 de Feriet
and F. N. Frenkiel, with a detailed exposition later in [13]. It will be called
class(KF). This was also independently given, slightly later, by Yu. A. Rozanov
[31] and E. Parzen [21], the latter under the name 'asymptotic stationarity'. Let
us state the precise concept here.

DEFINITION. A process {X,, t E T} C Lg(P) with a continuous covariance r is


of class(KF) if for each h E 7' the following limit exists:

1__( ~-Ihl
lim r(s,s+th])ds, if T = = R ,
a~ O/ "/0
f(h) =
1 n--N-1
lira-- • r(k, k + Ihl), if T-- Z . (34)
n--,~ n k"---0
296 M. M. Rao

It may be verified that each of the quantities on the right of (34) before
taking the limit is positive definite so that, when the limits exist, f ( - ) is also.
The continuity of f in the discrete case is trivial and in the case that T = R, r is
measurable even when it is not continuous. In either case, as a consequence of
the classical Herglotz-Bochner-Riesz theorem on a characterization of such
functions (cf., e.g., [29], Sections 4.4 and 4.5) there is a unique positive
bounded nondecreasing H ( " ) such that

r ( h ) = f÷eihaH(dA), a.a.(h)E 7", (35)

where a.a.(h) refers to Lebesgue measure when T - R, and all h if T - Z. In


analogy with (2), H ( . ) is termed an associated spectral function of the X,
process. Here several examples of processes in class(KF), in addition to the
obvious stationary family for which f ( h ) = r(]hi), will be noted.
Every strongly harmonizable process is in class(KF). This was first noted by
Rozanov in [31] and independently later analyzed in more detail by Bhagavan
[1]. In fact, somewhat more generally, almost harmonizable processes (cf.
Section 2) also belong to class(KF), as shown in [25]. It is an interesting fact
that, in the strongly harmonizable case, (35) holds for all h E T, so that f(. ) is
continuous. Another example is provided by the process {X,, t E 2} which is a
solution of

Xt a X , l + et, t~ Z , (36)

where et's are uncorrelated random variables with zero means and unit
variances, and ] a ] < l . It is easily seen that the limit (34) exists for this
sequence. One can also consider kth-order difference equations with constant
coefficients, extending (36), such that all the roots of the characteristic equa-
tions lie inside the unit circle, but the details will not be entered into here.
What about the weakly harmonizable class? In fact, this question was raised in
[31]. It turns out that neither includes the other completely. The preceding
example already gives a nonharmonizable process of class(KF)o The other
noninclusion will now be discussed since that provides a better insight into the
structure of both these classes.
Consider a weakly harmonizable process {X,, t C T} with covariance r and F'
as its representing bimeasure (cf. (3)). Suppose that for this bimeasure the
dominated convergence theorem holds in the sense that if ]f,] ~<g a.e., f, ~ f
pointwise, and

f÷1÷ g(A)g(A')F(dA, dA') < o~

(37)
Harmonizable, Cram&, and Karhunen classes of processes 297

If F has finite variation, this is automatic so that the strongly harmonizable


case is included. It also holds for some F having only finite Fr6chet variation.
For instance, this holds if F is of finite variation locally but not on 7" x ]b itself.
(Such f ' s appear in the C r a m & classes. This statement a p p e a r e d in ([28], Thm.
8.1) without such a precise qualification.) On the other hand, some restriction is
necessary since all weakly harmonizable processes are not in class(KF) as the
following example shows: Let { e , , - ~ < n < ~ } be a complete orthonormal
sequence in a space L~(P) so that the underlying probability space is separable.
The sequence is clearly stationary. Let {a,,-oc<n < ~ } be a bounded
sequence of numbers and define a mapping A: e~ >-~ a~E, in L2(p), and extend
it linearly onto all of L2(p). This is possible since the e, form a basis. Thus A is
a bounded linear o p e r a t o r in L2(p), and since the {G},ee is stationary, T h e o r e m
2.2 implies that X , = Ae,, n E 2, is weakly harmonizable. It will be shown now
that for a suitable set of a , ' s , the X, sequence is not in the class(KF), and this
will give the noninclusion statement mentioned above.
Let a , = a_,, a 0 = 1, and for k > 0, define

ak ~ [Xc. + 2X~,°l(k),
n=0

where C. = [22% 22n+1) and D . = [22n+~, 22"+2), the left closed and right open
intervals. The sets C. and D . are disjoint, and for each k the series is finite
(only one nonzero term), 1 ~< ak ~< 2, k > 0. Then A defined with this set of a / s
is clearly bounded. The covariance r(k, l) = 0 if k ¢ l, and hence

r , ( h ) = n k~2_
° r(k,k+h)= a2k, i f h = 0 .

So lim .... r.(h)= 0 for h ¢ 0, but

f5 1
if n = 2 2'n - 1 ,

rn (0) / (38)
- 4 1
if n = 22m+1 ~-

and hence l i m , . ~ r2~m_l(0) = ~, l i m , . ~ F22'n*l-l(0 ) -- ~. Thus lim .... r.(0) does not
exist. Consequently, { X . , n E Z } ~ class(KF). This example is due to H. Niemi
(personal communication).
The preceding computation suggests an extension of class(KF). Since by
definition a process { X . , n ff g } C L2(p) belongs to class(KF) provided that the
sequence of their averaged covariances converges, it is natural to consider a
wider class by looking at their higher-order averages. Thus one can say that; a
process with covariance r is in class(KF, p), p = 1 being the original definition
298 M. M. Rao

(cf. (34)), if the following limit exists for each h C Z'

lim r~)(h) = f(h), p i> 1, (39)


where
n

r~)(h) : n1 ~=1 r~p 1)(h)' r("l)(h): r.(h).

The analog for the case that T : R can similarly be given. Since in (34) rn(" )
is positive definite, it is seen easily that "n'(P)¢t'J
~ is also positive definite. Hence
?(. ) satisfies the same hypothesis and (35) holds, so that the representing H ( . )
may now be called a pth-order associated spectrum. The classical results
on summability imply that if r~)(h)~?(h), then r~)(h)-~?(h) for each
integer p~> 1, but the converse implication is false. Hence class(KF)C
class(KF, p)Cclass(KF, p + 1) and the inclusions are proper. Thus one
has an increasing sequence of classes of nonstationary processes each
having an associated spectrum. The computations given for (38) show that the
preceding example does not belong even to the class Up~class(KF, p). This
also indicates that weakly harmonizable processes form a much larger class
than the strongly harmonizable one, and is not included in the last union.
It should be remarked here that a further extension of the preceding class is
obtainable by considering the still weaker concept of Abel summabflity. The
consequences of such an extension are not yet known, and perhaps should be
investigated in future.
The general idea behind the class(KF, p), p >/1, is that if the given process is
not stationary, then some averaging, which is a smoothing operation, may give
an insight into the structure by analyzing its associated spectrum. Moreover, if
{Xt, t C R} E class(KF), and f is any Lebesgue integrable scalar function on R,
then the convolution of f and the Xt process is again in class(KF) whenever the
function ~b defined by 4~(t)= [E(IXtl2)] l/2 is in Lq(~) for some 1 ~< q ~< oc. Then

Y,- (f*X)t : fnf(t-s)Xs ds, t62 R , (40)

where the integral is a vector (or Bochner) integral, gives {Y,,tfSR}E:i


class(KF). Thus class(KF) itself is a large family. This example is a slight
extension of one indicated in [31].

7. The Cram6r-Hida approach and multiplicity

In the previous discussion of Karhunen and Cram6r classes, it was noted that
each {Xt, t E T} admits an integral representation such as (26) relative to a
family {g(t,. ), t C T} and a stochastic measure Z ( . ) on the spectral set S into
L~(P). Both g(t, u) and Z(du) can be given the following intuitive meaning,
Harmonizable, Cram&, and Karhunen classes of processes 299

leading to another aspect of the subject. Thus X, may be considered as the


intensity of an electrical circuit measured at time t, Z ( d u ) as a random
(orthogonal) impulse at u, and g(t, u) as a response function at time u but
measured at a later time t. So X, is regarded as the accumulated random
innovations up to t. This will be realistic provided the effects are additive and
g(t, u) = 0 if u > t. Hence (26) should be replaced by

X, =
f g(t, u ) Z ( d u ) , t ~ T. (41)

Since in (26) the g there need not satisfy this condition, that formula does not
generally reduce to (41). So one should seek conditions on a subclass of
Karhunen processes admitting a representation of the type (41) which clearly
has interesting applications. Such a class will be discussed together with some
illustrations.
First it is noted that each process {Xt, t E T} c Lz(P), assumed to be left
continuous with right limits (i.e. for each t E T, E ( I X , - X,_ h [2)-+0 as h -+0 +,
and there is an )(t such that E(I)( , -Xt+h[2)-+ 0 as h -+ 0 +, denoted )(, = X,+0),
can be decomposed into a deterministic and a purely nondeterministic part
(defined below). The deterministic component does not change from the
remote past so that it has no real interest for further stochastic analysis such as
in prediction and filtering problems. Thus only the second component has to be
analyzed for a possible representation (41). This was shown to be the case by
Cram& [7] and Hida [12] independently, and it will be presented here. ([7] has
the 1960 references to Cram6r's papers.)
Let Y£ = sp{X,, t E T} C L2(p), and similarly ~, = sp{X~, s < t} Q ~ and
~_~ = f-'l,Er ~ , . Since Y('I C Yg'2 for t 1 < t2, one has ~_= C Yg, C ~ and Yg_~
represents the remote past while ~ , stands for the past and present. The X t
process is deterministic if gg_~= ~ and purely nondeterministic if ~ _ = = {0}.
Thus the remote past generally contributes little to the experiment. The
separation of remote past from the evolving part is achieved as follows. A
process {X,, t E T} which is left continuous with right limits (and this is
automatic if T = Z) can be uniquely decomposed as: X , = Y t + Z t, t E T,
where the Y, component is purely nondeterministic, the Z, is deterministic and
where the I:, and Z , processes are uncorrelated. (This is a special case of
Wold's decomposition.)
Since the deterministic part is uninteresting for the problems of stochastic
analysis, and can be separated by the above result, one can ignore it. Hence for
the rest of this section it will be assumed that our processes are purely non-
deterministic. The proofs of the following assertions may be completed from the
work of Cramdr in [7] (cf. the references for his other papers there).
The approach here does not give much insight if T = Z. However, T = R is
really the difficult case, and the present method is sPecifically designed for it.
The new element in this analysis is the concept of 'multiplicity', and it is always
300 M. M. Rao

one if T = Z while it can be any integer N ~> 1 if T = R. (See [5], and the
references there, and also [7].) The basic idea is to 'break up' the continuous
parameter case, in the sense that each such process can be expressed as a direct
sum of mutually uncorrelated components of the type (41) so that each of the
latter elements can be analyzed with special methods. This relatively deep
result was obtained independently (cf. [7] and [12]) and can be given as follows:

THEOREM 7.1. Let {Xt, t ~ R } C L 2 ( p ) be a purely nondeterministic process


which is left continuous with right limits on R. Then there exists a unique integer
N, 1 <~N <~% called the multiplicity of the process, and a not necessarily unique
set of ordered pairs {(gk(t," ), Fk), 1 <~k <- N, t E R} of the following description
(i) gk(t," ): R ~ C is a Borel function, 1 <~k <~N, t E ~,
(ii) Fk:R ~ R is a non-decreasing (not necessary bounded) left-continuous
function such that if lJk(A ) = fA Fk(dA), A C ~ is a Borel set, then ~k+l < uk,
! <~k <~N (i.e. Pk+~is dominated by vk),
(iii) {gk(S," ), S ~< t} C L2((-% t), uk) is (norm) dense; and if r is the covari-
ance function of the X , process, then

r(s, t ) =
~ fmln(s,t)gk(s, A)gk(t, a)Fk(da), s, t E R , (42)
k=l -~

the series converging absolutely if N = +~.

Using the Lebesgue decomposition of measure theory, it can be verified that


if the X t process is stationary and nondeterministic, then in (2) the spectral
function is absolutely continuous relative to the Lebesgue measure on R with a
density and hence, as noted in (12), (42) becomes

r(s, t) = g(s - X ) g ( t - a) d a . (43)

Thus such stationary processes always have multiplicity unity. The converse is
not true. There exist nonstationary (even strongly harmonizable) nondeter-
ministic processes of multiplicity N for any given N, 1 ~<N ~<oo.
It should be noted that (42) can also be stated for the Xt process using the
(stochastic) integral representations:

X, = ~=~ gk(t, U)Zk(dU), t E R, (44)

where each Z k ( ' ) is orthogonally scattered, E ( Z k ( A ) Z t ( B ) ) = 0 if k ¢ 1, and


= vk(A (3B)=fan~Fk(da) if k = I. The pairs {&(t,-), Fk), 1 <~k <~N, t ~ N }
satisfy the previous conditions. Moreover, one has

Y£, = + ~i.t, t~R, (45)


i=l
Harmonizable, Cramdr, and Karhunen classes of processes 301

where ~ t was defined before and ~'{'i,t= s p { Z i ( - % s): s <<-t}CLg(P). Also in


case gk(t, t ) > 0 , t ~ R, then writing gk(t, a ) = g~(t, a)/gk(a, a), and 2k(da)
gk(&,~)Zk(d,~) in (42) or (44) one can assume that gk0L, A) - 1, for con-
venience.
T o get a better feeling for this s o m e w h a t complicated decomposition, let us
present a class of n o n s t a t i o n a r y processes of multiplicity one.

THEOREM 7.2. Suppose that {X,, t E R } C L Z ( P ) is a process which may be


represented as (44) (or (42)). Suppose further that each g k ( ' , ' ) satisfies the
following conditions :
(i) gk(t, t) - 1 (this is no restriction if g~(t, t) > 0), t E R,
(ii) for each A <~ t, g~(t, A) and (OgJOt)(t, A) (which exists) are continuous and
bounded,
(iii) each F k ~ a constant and has a (Lesbesgue) density F 'k which has at most
a finite number of discontinuities on each finite subinterval of R.
Then the multiplicity of the process { Xt, t E R }is one, i.e., N = 1.

T h e case N = 1 already has interesting connections with o t h e r known classes.


For instance, let g ( t , A ) = p ( t ) / p ( A ) in (42) with N 1, and p(Z)>0, a ~R.
Writing f = F ' , (42) b e c o m e s

r(s, t) = p(s)p(t)
f rain(s,t) f(3,) d,~ (46)
(p(a)) 2 •

H e n c e for all s < t < u, if p(s, t) = r(s, t)/r(s, s), s <~ t, o n e gets

p(s, u)= o(s, t)p(t, u) . (47)

This p ( ' , ' ) is called a correlation characteristic, and the functional equation
(47) implies that {X t, t C R} is a wide-sense M a r k o v process. This m e a n s for
each t l < h < • " ' < tn, n ~> 1, tk C R, the (orthogonal) projection of X t , on the
linear span of Xtl . . . . . Xt 1 is the same as the projection of Xt, on the
one-dimensional span of X , . (For a proof of this classical fact see [29], p. 145.) It
is of s o m e interest to note that, in the special case of (46), if fi(s, t ) =
r(s, t)/[r(s, s)r(t, 0] 1/2, the correlation coefficient, then also ti satisfies the relation
(47). If the X, process is normal and (46) holds, the a b o v e - n o t e d projection
b e c o m e s the conditional expectation and the wide-sense p r o p e r t y b e c o m e s the
usual (strict-sense) M a r k o v property. As an example, one m a y consider r(s, t) =
e x p ( - c [ s - t[), c > 0.
A different example of a nonstationary ( n o n h a r m o n i z a b l e ) process of multi.-
plicity one is the B r o w n i a n motion. H e r e g(t, A ) = 1, F ( u ) - 0 if u < 0 , = u , if
0 <~ u < 1, and =1 if u ~> 1, N = 1 in (42). If the process is not assumed normal
(Gaussian) 1, g =-a constant, and F is also a constant outside of a c o m p a c t
interval, then each nondeterministic process of the form (44), which has
o r t h o g o n a l increments, has multiplicity one. T h u s each of these classes is large

~From here on, a process is normal means Gaussian. It should not be confused with 'normal
operators' of Section 3 (cf. (17)-(23)).
302 M. M. R a o

in itself. Other useful applications will now be discussed in the final two
sections. (Cf. also [7].)

8. Prediction and related questions

A linear least squares prediction of the 3(, process, by definition, is a linear


function of the past and present {Xu, u ~ s} which is closest to Xt, t > s, so that
if )(t,~ is the desired element in ~ s then

E([X, _ Xt,,
^ I2) = inf{E(lX ' _ y[2): y E Yt'~} (48)

where, as usual Yes = sp{Xu, u <~s}C__L2(p). Consequently, J~t,~ = P~Xt, with Ps as


the orthogonal projection of Yt' = sp{X, t E R} C La(p) onto Y~.
For processes satisfying the hypothesis of Theorem 7.1, the predictor J~t,~ is
obtained immediately. In fact, if X t is as above so that it admits a representation
given by (44), one has

2,.~ = P~X, = gk(t, A)Zk (dA) , (49)


k=l -~

since Ns = @ ~ 1 ~ i . ~ in (45). Moreover, the minimum mean square error of


prediction is obtained as

2 s = E ( [ X _ Xt,s[
o-t. A ~) = ~ f ' [gk( t, A)[aFk(dA) • (50)
k=l -~

This in principle furnishes the desired solution of the least squares linear
prediction problem for processes of the type (44). In general, however, there is
as yet no recipe for determining the multiplicity of a given continuous
parameter purely nondeterministic second-order left-continuous with right
limits process. But results are available if one is willing to assume somewhat
more on gk ' S , generalizing the stationary case.
Even when the X t process does not satisfy all the conditions of T h e o r e m 7.1,
the least squares prediction problem can be formulated and solved differently.
T o understand this aspect, let {X t, t E_T} C L2o(P) be a process and ~ t = sp{Xs:
s ~< t} as before. Suppose that ~ = sp{X t, t E T } C L2o(p) is separable, which
holds if the covariance r ( - , . ) is continuous (e.g., T = Z). For each to E T, the
best linear least squares predictor of Xt0 based on the past { X u, u ~< s < t0} is
ffto,S = PsXto(e~s) and it is the limit-in-mean of linear combinations of Xu,
u ~ s. On the other hand, it is known that a nonlinear least squares predictor of
Xto is given by the conditional expectation

Y,o,S = E ( X , o I Xu, u ~ s).

If the process is normal, then one can verify that Jq,0., = Y~,s-Thus for normal
processes, with a continuous covariance, both these predictors coincide. (See
Harmonizable, Cramdr,and Karhunenclassesof processes 303

Yaglom [34], Chapters 4 and 6 for a lucid discussion of these problems.) Since
from a practical point of view it is not feasible to have a complete realization
{Xu, u f> s} at our disposal, it is desirable to have some approximations to the
best predictor. A result on this can be described as follows. Let T = Z for
simplicity,_ and for s < t0E Z . define ~. = sp{Xs, Xs_ 1 . . . . . Xs_n} SO that
lim. cg. = sp{i..)..0 ~.} = y(.. If X,0,. = Q.(Xto ), Q. being the orthogonal pro-
jection of Y( onto ~ , then one can show, using the geometry of Y(, that
E([..Y '0,. - X t .12)"-->0 as n --> oo. However, the pointwise convergence of )~,0.
to 3~,0,. is mu~h more difficult, and in fact the truth of the general statement is
not known. For a normal process, an affirmative answer can be obtained from
the following nonlinear case.
Let Y ' 0 , = E(XtolXs'Xs-1 . . . . . Xs_n) and Y t , be as before. Then the
' . 0'

sequence { Y t o , , n ~ 1} Is a square integrable martingale such that


sup, E([ Yt0,,[2)o< oo. Hence the general martingale convergence theory implies
Yt0.~--~ Yt0,s both in the mean and with probability one, as n-~ ~. Since for
normal processes both the linear and nonlinear predictors coincide, the remark
at the end of the preceding paragraph follows. Thus predictors from finite but
large samples give good (asymptotic) approximations for solutions )~t0,s (or
Yt0,s) and this is important in practical cases. However, the error estimation in
these problems received very little attention in the literature. In the case of
normal processes certain other methods (e.g., the Kalman filter, etc.) giving an
algorithm to compute the )~0,, sequence are available. But there is no such
procedure as yet for the general second-order processes.
At this point it will be useful to present a class of nondeterministic processes,
belonging to a Karhunen class, which arise quite naturally as solutions of
certain stochastic differential equations. This will also illustrate the remark
made at the end of Section 4.
In some problems of physics, the motion Xt of a simple harmonic oscillator,
subject to random disturbances, can be described by a formal stochastic
differential equation of the form (cf. [3]):

dt 2 +/3 dXdd~t)+coZX(t) = A(t)


d2X(t) (X(t)= X,), (51)

where/3 is the friction coefficient and w0 denotes the circular frequency of the
oscillator. H e r e A(t) is the random fluctuation, assumed to be the white
n o i s e - - t h e symbolic (but really fictional) derivative of the Brownian motion. In
some cases,/3 and w0 may depend on time. To make (51) realistic, the symbolic
equation should be expressed as:

d X ( t ) + al(t))~(t) dt + az(t)X(t) dt = d B ( t ) , (52)

where the B(t) process is Brownian motion. Thus for each t > 0, B(t) is normal
with mean zero and variance ~r2t, denoted N(0, crzt), and if 0 < t~ < t2 < t3, then
B ( t 3 ) - B ( t 2 ) and B(t2)-B(tl) are independent normal random variables with
N(0, 0"2(t3- t2)), N(0, 0"2(/2- tx)) respectively. Also )~(t) = dX(t)/dt is taken as a
mean square derivative. Then (52) and (51) can be interpreted in the integrated
304 M. M. Rao

form, i.e. by definition,

Lbf(t) A ( t ) dt = f(t) d B ( t ) , (53)

the right side of (53) being a simple stochastic integral which is understood as m
Section 3 (since B is also orthogonally scattered). Here f is a nonstochastic
function. The integration theory, if f is stochastic needs a more subtle treat-
ment and the B(t) process can also be replaced by a 'semi-martingale'. (See,
e.g., [26], Chapter IV and V for details.) The point is that the following
statements have a satisfactory and rigorous justification. With Brownian
motion one can assert more, and, in fact regarding the solution process of (52),
the following is true.

THEOREM 8.1. Let Y = [a0, b0] C R + be a bounded interval, and {B,, t C J} be


the Brownian motion. If ai(" ), i - 1, 2, are real (Lebesgue) integrable functions
on J such that equation (52) is valid, then there exists a unique solution process
{Xt, t E J} satisfying the initial conditions X% = C1, f(ao = C2 where CI, C2 are
constants. In fact, the solution is defined by

t
X,=
L
o
G(t,u)dB(u)+C1V,(t)+CRV2(t), t~J, (54)

where Vi('), i= 1,2, are the unique solutions of the accompanying homo-
geneous differential equation:

d2f(t) df(t)_~ a2(t)f(t)-0 (55)


dt 2 ~- al(t) dt

with the initial conditions f ( a o ) - 1, f(ao)= O, and f ( a o ) - O, /(ao)= 1 respec-


tively. In (54), O : J × J --~ C is the Green function. This is a continuous function
such that oG/at is continuous in (t, s) on ao <~t <~s <~bo, and has a jump on the
diagonal, i.e.,

aO (s + 0, s ) - aO (s -- 0, s) = 1
at -5-f •
(56)

Moreover, the X , process given by (54) is of Karhunen class and is purely


nondeterministic (since the spectral function of B is the Lebesgue measure). Its
covariance r is given by
rain(s, t)

r(s, t) = ~ G(s, A)G(t, A) dA, s, t e J, (57)


a0

and the process has multiplicity one°


Harmonizable, Cram~r, and Karhunen classes of processes 305

This result shows that the processes appearing as solutions of the (linear)
stochastic differential equations have interesting special properties. Further,
one can show that the vector process {(X,, ~2t), t C J} is a (vector) Markov
normal process almost all of whose sample paths are continuous. (For details of
these assertions, see [25], sec. 4.) Related results for the nth-order case with
continuous (n - 1) times differentiable coefficients a i ( ' ), and initial conditions
Ci = 0 have been analyzed by Dolph and W o o d b u r y [8]. T h e work exemplifies
the importance of nondeterministic processes of multiplicity one in applications
coming from both the physical sciences and communication theory. Let us now
turn to another type of application.
The general filtering problem can be presented, following Bochner [2], as
follows. Let ~ r be the set of all second-order processes X = {X,, t ¢ T}, with
zero means. Let A be a linear operator on the linear space ~fr. Suppose
X, Y E ~ r and that X E domain(A), and

AX-- Y or ( A X ) , , - Y,, tET. (58)

As usual T = Z, or = R. Typically the Y process is the output and the X


process, the input and A is termed a (linear) Jilter. The problem here is to find
conditions on A such that if the output is known one can recover the input
process. As examples,

(A X ) , = 2 aiXt i, t ~ Z, ai C R ,
i-1

(AX)~= f R X t ,f(u)du, t ~ N , suitable f .

The first one is called a difference or a polynomial or a moving average filter,


and the second one, an integral filter, one can also have a difference-
differential or an integro-differential filter and the like. If both X, Y are
stationary processes and A is a polynomial or an integral filter, then precise
conditions for recovering the input were first obtained by Nagabhushanam [19].
His results were extended by Kelsh [15] if X, Y are strongly harmonizable, and
(using necessarily different methods) a further extension to the weakly har-
monizable case by Chang. For an exposition of these results with a numerical
illustration, one may refer to [4].
The above-described filtering problem changes its character if only a finite
segment of the observations on {X,, t C T} is available. Assuming a knowledge
of the covariance structure (from prior information) of the process, how can
one estimate an element of the process (prediction or interpolation) which is
not part of the observed ones? While a precise set of conditions is difficult to
obtain, good sufficient conditions can be given for its solution. This point will
be discussed by a specialization and adjustment of the work from [8].
Let {X,, 0 ~< t ~< 1} C LZ(P) be an observed process which is known to have a
306 M. M. R a o

linear time trend and a random disturbance. Thus the model is given as

X t = a + b t + Yt, 0 <~ t ~ 1, (59)

where a, b are real but unknown constants and {Yt, 0 ~ t <~ 1} is a noise process
which is assumed to be stationary with mean zero and covariance r, given by
r(s, t ) = e x p ( - / 3 l s - t[),/3 > 0 . The problem is to find an unbiased linear esti-
m a t o r of Xto, to > 1, based on the output {Xt, 0 ~< t ~< 1}, using the least squares
criterion. This can be made more explicit as follows: it is desired to find a
weight function w(. ) on 0 ~ t <~ 1 which is of bounded variation such that if

x'°(w)= ~0'x , dw(t) (60)

is the linear estimator, then E ( f f , o ( w ) ) = a + bto, a, b ~ R and E ( I X , o - X,o(W){2 )


is a minimum. Since r and hence /3 are known, let us take /3 = 1 for this
illustration. Then subject to the unbiasedness constraint, one can minimize the
mean square error using the variational calculus (or Lagrange multipliers) and
show that there exists a weight w ( . ) having a density w'. Thus after a
calculation, one finds w' and f ( t ( w ) to be

2w'(t) = 17 [ 5 6 - 36t0 + 36t(2t0- 1)]


and

Xto (w) = ~
l/fo X t [ 2 8 - t8t0+ 18t(2t0- !)1 dt

- 2X1(4 - 27t0)- 2Xo(27t o -- 23)}. (62)

The mean square error for this problem is then

2 = E(IX,o_ 2,0(w)12) = [72t - 72to + 561/57. (63)

T h e actual details of computations for (61)-(63) involve solving an integral


equation and thus are not entirely simple. It is however interesting to remark
that, in this calculation, one need not find estimators of a, b separately from
the data, and the variational calculus enables a direct solution as indicated.
The problem of minimum variance unbiased (linear) estimator of a, b is also
important. For instance, if b = 0 in (59), then an estimator of the unknown
p a r a m e t e r a (i.e. estimating the mean of the X process without trend) can be
obtained by a similar method. T h e estimator d of the form

1
~/(w) =
~0 X, dw(t), E ( & ( w ) ) = a, a C IR ,
Harmonizable, CramO, and Karhunen classes of processes 307

has been considered by G r e n a n d e r [10], and the result is

d=3~ X~dt+Xl+X0 ] . (64)

Other estimation methods and their properties are discussed for stationary
error processes in [11].

9. Some inference problems with normal processes

In this final section some special inference questions when the processes are
normal are briefly discussed to supplement the preceding work.
Recall that a normal process {X,, t ~ T} is a collection of random variables
such that each finite subset has a joint normal distribution. Now if 02, 2, Pi),
i -- 1, 2 are a pair of probability spaces with a c o m m o n base space ,(2, then P~
and P2 are said to be mutually singular or perpendicular (written PI ± P2) if
there is an event A 0 E £ such that PI(Ao)= 0 and Pz(A0) = 1, and mutually
absolutely continuous or equivalent (written P I - P2) if both P1 and P2 vanish on
the same class of sets from ~. For instance, if ,(2 = R, £ = the Borel ~r algebra,
PI = normal and P2 = Cauchy, then P1 ~ P2. On the other hand, if P~ is normal
and P2 is Poisson, then P~ ± P> However, if S2 = R v, ~ = the cylinder cr algebra,
then X,: S2 ~ R is defined as Xt(o~)= w(t), i.e., the coordinate function, and the
problem of determining as to when P~ - P2, or P~ ± P2, or neither, is not simple.
In the case that both P~, P2 are normal probability measures on .Q = R v, only
the main dichotomy that P1 - P2 or P~ ± P2 can occur. This was first established
independently by J. Feldman and J. Hfijek in 1958 and later elementary proofs
of this theorem were presented by L. A. Shepp and others. A simplified but
still nontrivial proof of this result with complete details is given in ([27], pp.
212-217).
The statistical problem therefore is to decide, on the basis of a realization,
which one of P~, P2 is the correct probability governing the process. In the
singular case, this is somewhat easier, but in case P 1 - P2, the problem is not
simple. A n u m b e r of cases have been discussed in [10] before the dichotomy
result is known. The simplest usable condition in the general case is the
following:
Let P/ have the mean and covariance functions (mi, ri), written P(mi, ri), i =
1,2. Then Pl - P2 iff one has P(O, rl) ~ P(O, r2) and P(ml, rl) ~ P(m2, rl). Thus
P ( m l , rl) ~ P(m2, r2) if[ P ( m l , r l ) ~ P(m2, r l ) - P(m2, r2). Some applications with
likelihood ratios appear in [25]. This equivalence criterion will now be illuso
trated on a purely nondeterministic normal process of multiplicity one.
If {Xt, t ~ T} is a normal process with mean zero and covariance r let
Z , = m (t) + Xt where m : T--+ R is a measurable nonstochastic function, so that
the Z, process has mean function m and covariance r and is also normal. Let P
and Pm be the corresponding probabilities governing them. The mean m ( - ) is
308 M. M. Rao

called admissible if P - Pro- The set Mp of all admissible means is an interesting


space in its own right. In fact, it is a linear space, carries an inner produce and
with it Mp becomes a Hilbert space attached to the given normal process. (For
an analysis of Me, and t__hefollowing, see [24].) One shows that m E Mp iff there
is a unique Y @ ~ sp{Xt, t E T} C L2(p) such that
=

m(t)= E(YXt), t E T, (65)

and then the likelihood ratio dPm/dP is given by

dP,. = exp{Y - ~E(I yI2)} (66)


dP

Using now an abstract generalization of the classical N e y m a n - P e a r s o n L e m m a


due to G r e n a n d e r ([10], p. 210), one can test the hypothesis H0: m ~ O, vs. HI:
m (t) ~ 0. The critical region for this problem can be shown to be

Ak = {w E ~O: Y ( w ) ~< k } , (67)

where k is chosen so that P ( A k ) = a, tile prescribed size of the test (e.g.,


a = 0.05 or 0.01). This general result was first obtained by Pitcher [23]. In the
case of nondeterministic processes of multiplicity one, the conditions on
admissible means can be simplified much further. This may be stated following
C r a m 6 r [7], as follows:
Let T = [ a , b ] and X, be purely nondeterministic so that by (44) with
N = 1, one has
t

X, = ~. g(t, A)Z(dA), t E T, (68)

_ _ I

and that ~ = sp{X t C T} = sp{Z(A): A C T, Borel}. But m E M e if there exists a


Y E g( such that (65) holds. In this special case therefore, Y admits a
representation as

Y= h(A)Z(OA), (69)

for some h G LZ([a, b], F) where F ( A ) = E(IZ(A)]2). Suppose that the deriva-
tive F ' exists outside a set of Lebesgue measure zero. Since Z ( - ) has ortho--
gonal increments, (65), (68) and (69) imply

m(t)=
L h(A)g(t,A)F'(A)dA, t C T = [a, b]. (70)

This is the simplification noted above, If Og/Ot is assumed to exist, then (70)
Harmonizable, Cramdr, and Karhunen classes of processes 309

implies that the derivative m'(t) of m(t) also exists. In particular, if the Xt is
the Brownian motion so that g = 1 and F ' = 1, one gets m'(t)= h(t) (a.e.) and
h E LZ([a, b], dr) in order that P m - P.
There is a corresponding result, when P 1 - P 2 , P~ are normal, but have
different covariances. However, this is more involved. A discussion of this case
from different points of view occurs in the works [35, 33, 7, 25]. (See also the
extensive bibliography in these papers.) There is a great deal of specialized
analysis for normal process in both the stationary and general cases. It is thus
clear how various types of techniques can be profitably employed to several
classes of nonstationary processes of second order. Many realistic problems
raised by the above work are of interest for future investigations.

Acknowledgement

This work is prepared with a partial support of O N R Contract No. N00014-84-K-


0356.

References

[1] Bhagavan, C. S. K. (1974). Nonstationary Processes, Spectral and Some Ergodic Theorems.
Andhra University Press, Waltair, India.
[2] Bochner, S. (1954). Stationarity, boundedness, almost periodicity of random valued functions.
In: Proc. Third Berkeley Symp. Math. Statist. and Probability, Vol. 2, 7-27. University of
California, Berkeley, CA.
[3] Chandrasekhar, S. (1943). Stochastic problems in physics and astromony. Rev. Modern Phys.
15, 1-89.
[4] Chang, D. K. (1983). Harmonizable filtering and sampling of time series, UCR Tech. Report
No. 8, 26 pp. (to appear in Handbook in Statistics, Vol. 5).
[5] Chi, G. Y. H. (1971). Multiplicity and representation theory of generalized random processes.
J. Multivariate Anal. 1,412-432.
[6] Cram6r, H. (1951). A contribution to the theory of stochastic process. In: Proc. Second
Berkeley Symp. Math. Statist. and Probability, 329-339. University of California, Berkeley, CA.
[7] Cramdr H. (1971). Structural and Statistical Problems for a Class of Stochastic Processes. S. S.
Wilks Memorial Lecture, Princeton University Press, Princeton, NJ.
[8] Dolph, C. L. and Woodbury, M. A., (1952). On the relation between Green's functions and
covariances of certain stochastic processes and its application to unbiased linear predictions.
Trans~ Amer. Math. Soc. 72, 519-550.
[9] Getoor, R. K. (1956). The shift operator for nonstationary stochastic processes. Duke Math. £
23, 175-187.
[10] Grenander, U. (1950). Stochastic processes and statistical inference. Ark. Mat. 1, 195-277.
[11] Grenander, U. and Rosenblatt, M. (1975). Statistical Analysis of Stationary Time Series. Wiley,
New York.
[12] Hida, T. (1960). Canonical representation of Gaussian processes and their applications. Mere.
Coll. Sci. Kyoto Univ., Sec 4, 32, 109-155.
[13] Kamp6 de Feriet, J. and Frenkiel, F. N. (1962). Correlation and spectra ot' nonstationary
random functions. Math. Comp. 10, 1-21.
[14] Karhunen, K. (1947). l]ber lineare Methoden in der Wahrscheinlichkeitsrechnung Ann.
Acad. Sci. Fenn. Ser. A I Math. 37, 3-79.
310 M . M . Rao

[15] Kelsh, J. P. (1978). Linear analysis of harmonizable time series. Ph.D. thesis. UCR Library.
[16] Lo~ve, M. (1948). Fonctions alfiatoires du second ordre. A note in P. L6vy's Processes
Stochastiques et Movement Browien, 228-352. Gauthier-Villars, Paris.
[17] Masani, P. (1968). Orthogonally scattered measures. Adv. in Math. 2, 61-117.
[18] Morse, M. and Transue, W. (1956). C-bimeasures and their integral extensions. Ann. Math.
64, 480-504.
[19] Nagabhushanam, K. (1951). The primary process of a smoothing relation. Ark. Mat. 1,
421-488.
[20] Niemi, H. (1975). Stochastic processes as Fourier transforms of stochastic measures. Ann.
Acad. Sci. Fenn. Set. A I Math. 591, 1-47.
[21] Parzen, E. (1962). Spectral analysis of asymptotically stationary time series. Bull. Internat.
Statist. Inst. 39, 87-103.
[22] Parzen, E. (1962). Stochastic Processes. Holden-Day, San Francisco, CA.
[23] Pitcher, T. S. (1959). Likelihood ratios of Gaussian processes. Ark. Mat. 4, 35-44.
[24] Rao, M. M. (1975). Inference in stochastic processes--V: Admissible means. Sankhyd Set. A
37, 538-549.
[25] Rao, M. M. (1978). Covariance analysis of nonstationary time series. Developments in
Statistics, Vol. 1, 171-225. Academic Press, New York.
[26] Rao, M. M. (1979). Stochastic Processes and Integration. Sijthoff and Noordhoff, Alphen aan
den Rijn, The Netherlands.
[27] Rao, M. M. (1981). Foundations of Stochastic Analysis, Academic Press, New York.
[28] Rao, M. M. (1982). Harmonizable processes: structure theory. L'Enseign. Math. 28, 295-351.
[29] Rao, M. M. (1984). Probability Theory with Applications. Academic Press, New York.
[30] Rao, M. M. (1984). The spectral domain of multivariate harmonizable processes. Proc. Nat.
• Acad. Sci. U.S.A. 81, 4611-4612.
[31] Rozanov, Yu. A. (1959). Spectral analysis of abstract functions. Theory Probab. AppL 4,
271-287.
[32] Rozanov, Yu. A. (1967). Stationary Random Processes (English translation). Holden-Day, San
Francisco.
[33] Rozanov, Yu. A. (1971). Infinite Dimensional Gaussian Distributions (English translation).
American Mathematical Society, Providence, RI.
[34] Yaglom, A. M. (1962). A n Introduction to the Theory of Stationary Random Functions (English
translation). Prentice-Hall, Englewood Cliffs, NJ
[35] Yaglom, A. M. (1963). On the equivalence and perpendicularity of two Gaussian probability
measures in function spaces. Proc. Syrup. Time Series Analysis, 327-346. Wiley, New York.
E. J. Hannan, P. R. Krishnaiah, M. M. Rao, eds., Handbook of Statistics, Vol. 5 ] "l
A L At_
© Elsevier Science Publishers B.V. (1985) 311-320

On Non-Stationary Time Series

C. S. K. B h a g a v a n

1. Introduction

A set of (numerical) observations on a characteristic of interest, collected


over a few successive values of a progressively increasing indexing p a r a m e t e r
called time, is known as time series. A single characteristic is usually con-
sidered as it is simple for discussion.
The source for a time series is a stochastic process, an indexed set of r a n d o m
variables, the index set being usually an infinite set. A time series is thus a
truncated realisation of a stochastic process.
From the point of view of collection of data, time series has the facility and
simplicity of arising from routine collection, incidental to, say, administrative
routine without resorting to the methods of sampling. T h e time series now
takes the place of a sample.
Time series being part of stochastic process, the terminology used for
processes will also be used for time series, indistinguishably.

2. Stationarity

Considerable progress has been m a d e in the analysis of time series under the
basic assumption of stationarity. The stationarity considered is one of a
structural invariance under translations of time. The structures considered are
in two directions. O n e is regarding the probability setup and the other is
regarding the second moments. T h e former is known as strict stationarity,
while the latter is known as weak stationarity, the two coinciding over Gaussian
processes. Strict stationarity is occasionally used, while weak stationarity is the
frequently used one. A slight variant of weak stationarity, which requires that
the first m o m e n t also is time invariant, is known as wide sense stationarity, a
concept used in inferential problems. Unless otherwise stated, stationarity
normally means weak or wide sense stationarity. It can be formally defined as
follows:

311
312 c. s. K. Bhagavan

Let X(t) be a stochastic process possessing finite m o m e n t s of the first and


second order: m (t); C(t, u). If

C(t, u)= E [ ( X ( t ) - m(t))(X(u)- m(u))] = C(u t),

a function of the time lag, where the bar over an expression denotes the
complex conjugate; then X(t) is called a weakly stationary process. If further,

m(t) : E(X(t)) = m,

a constant over time, X(t) is referred to as a wide sense stationary (W.S.S)


process. A non-stationary process means a process that is not necessarily
stationary. Thus the class of stationary processes will be expanded by including
classes of other processes as well. A time series from a non-stationary process
will be naturally called a non-stationary time series.

3. Spectrum

T i m e series in general are observed to have periodic tendencies. This is due


to the blend of harmonic terms in the characteristic considered. Thus one
well-known aim of time series analysis is the search for hidden periodicities.
T h e presence of harmonic terms can be detected by an instrument called the
spectrum of the process, the existence of which was established in the case of
stationary processes by Herglotz (1911) and Khintchine (1934) (see also
G r e n a n d e r and Rosenblatt, 1957), through a result saying that the covariance
can be written as the Fourier-Stieltjes transform

C(u - t) = C(k ) = fw e~k~d F ( s ) ,

where w = [-~r, 7r] or ( - ~ , o0) according as the time p a r a m e t e r is discrete or


continuous and F(s) is a bounded, non-negative and non-decreasing function.
H e r e F(s) is known as the spectral function or spectrum of the process. Thus
the concentration is now on the possible jumps of F(s). The jumps of the
spectrum are noted to reflect the periodic nature of the time series.
Using a time series, the device to track the jumps of the spectrum is through
the well-known technique of periodogram analysis, where we plot the function
called the intensity function I(s) or its modifications (see Anderson (1971) in
this regard) against various chosen trial values. The intensity function is of the
form

I(s) = {A2(s) + B2(s)}U2 ~


On non-stationary time series 313

where
n 2wt
a (s) = 2 ~, X ( t ) cos
n s
t=l
n

B(s) = 2 ~, X ( t ) sin 2wt


n s
t=l

The nature of the periodogram is that it runs close to the X-axis except that
there are sudden peaks at points corresponding to the jumps of the spectrum.
Thus the periodogram analysis plays a vital role in time series analysis. Before
passing on, it is to be emphasised that all these considerations are conditioned by
the assumption of stationarity.
Having recognised the importance of the spectrum, one is naturally led to
the question: What happens to these considerations if stationarity is absent? In
other words, one is led to the consideration of 'non-stationary situations'. The
immediate problem one faces here is to restore the concept of spectrum in this
case.
It may be remarked, even at this stage, that when the facility and simplicity
of stationarity is gone, the attempts become diverse and the related aspects
need much further developments. This chapter thus concentrates more on
these spectral aspects, presenting the details in the discrete parameter case,
putting the concept of spectrum thus obtained to the same usage as in the
stationary case.
The generalisation of the spectrum envisaged can fruitfully be achieved if we
first have a look at what we are expecting of the spectral function. Broadly
speaking, the following would be the requirements for a spectral function (see
Loynes, 1968; Nagabhushanam, 1970):
Non-negativity and additivity like mass or energy, unique determinability
from the auto-covariance function, relationship to a meaningful function by
Fourier-Stieltjes transformation, possibility of possessing a jump part; deter-
minability of the spectral transfer function when the process variates undergo a
simple linear transformation, estimability of the spectral density from a single
realisation of the process, and reduction to the usual spectral function when the
process is specialised to be a weakly stationary process.
Loynes (1968) has listed all the requirements for a spectral function of a process
and concluded that when a process is not stationary, there does not seem to exist a
spectral function satisfying all the requirements. Then what can be done seems to
be to define a spectrum of a type that will be suitable to the particular inquiry on
hand. These have broadly developed in two streams: one stream taking a start
from the covariance and the other taking a start from the process representation
(see Cram6r and Leadbetter (1967) for process representation).

4. Spectra of non-stationary processes

We shall now review the various spectra considered for non-stationary


processes:
314 C. S. K. Bhagavan

(a) Fano (1950) and Page (1952) have defined spectra based on con-
siderations of Fourier integrals. The spectrum defined by Fano cannot include
stationary processes in an essential way and that of Page cannot be necessarily
non-negative.
(b) Cram& (1961) has defined
s

f f ldh s
as the spectrum of the harmonisable process of discrete parameter, where
h (s, r) is a function of bounded variation in terms of which the auto-covariance
function of the process has the representation

C(t, u) = e i~+i"r dh (s, r)


-~r -~

(see Lo6ve, 1963).


The function F(s) is now a bounded measure function and thus additive like
mass. Further, when the process is stationary, it reduces to the spectrum as in
the stationary process. This spectrum has been shown to be useful for judging if
the process is purely non-deterministic or not and for linear prediction.
(c) Parzen (1967) has considered real processes for which

E(X(t)) = 0
and

R ( k ) = l i m l f r-k E ( X (t)X (t + k )) dt for k ~ 0


T~o Z o

exist finitely for each k and remarks that these may be termed asymptotically
weakly stationary processes, and that a time series X(t), t >t O, for which there
exists a function R(k) satisfying the above could be said to possess a covariance
function R(k) and a spectrum. H e establishes the existence of the spectrum
assuming that: (i) fourth moments of the process exist and are uniformly
bounded; (ii) ( l / T ) f o r-k X(t)X(t + k)dt converges in the mean square to R(k)
as T ~ ; and (iii) R ( k ) i s continuous.
(d) Herbst (1964) has considered discrete parameter processes X(t) of the
form

x ( o : ~ ajc, jE, j, t= 0,+1,+2 .....


j=O

where e(t) is a real Gaussian stationary process of identically and in-.


dependently distributed random variables, ai's being constants and ct's being
On non-stationary time series 315

such that

c~<Q<~ for each t


and
1 N
lim ~- ~ CtCt
2 2 ~ 0"~, say = co .
N~oo ,=1

For these processes, Herbst has defined the spectral density as

2
Co 2.~ aj e i# .
j=0

(e) Karhunen (1947) has considered processes with representation of the


form

X(t) - m(t) = fw g(t, s) dZ(s),


where

fw]g(t, s)l 2 dF(s) < ~ ,

F(s) being the spectral measure corresponding to the orthogonal process Z(s).
Priestley (1965) has studied a subclass of these processes for which g(t, s) is a
function changing slowly with time. They are referred to by him as oscillatory
processes or processes with evolutionary spectra, and he has defined the spec-
trum as

dFt(s ) - ]g(t, s)] 2 d F ( s ) .

Priestley's spectrum has been shown to be useful for linear prediction, filtering
and a test for stationarity, etc. (see Abdrabbo and Priestley (1967) and the
succeeding papers on the subject). Processes with slowly varying spectra have
also been considered by Granger and Hatanaka (1964).
(f) Nagabhushanam and/or Bhagavan (see Nagabhushanam and Bhagavan,
1968; Bhagavan, 1974) obtained a straightforward generalisation of the spec-
trum of the stationary process, by considering the sequence of functions

] I n k--I n e-iks __ eik.n ]


= + s) )2 c(t, t) + 2 Z c(t, u)
,=1 -k+l'~,,=~ -ik
k=XO u-t=k

(cf. Grenander and Rosenblatt, 1957); and obtaining the limit function F(s)
existing at every continuity point of the latter when the limits, called averaged
316 C. S. K. Bhagavan

covariances,
n

lim 1 E c(t, u)= C ( k ) ,


n ~ m Fl t , u = l
u-t=k

exist finitely for each k. In the continuous parameter case also, a similar set of
functions and the limit exist. In this setup, it is shown that

C(k) = ~ eiks d F ( s ) ,

where F(s) is non-negative, non-decreasing and bounded, and is thus rightfully


called spectrum of the process. F(s) has almost all the properties of a spectrum
listed in Section 3. This spectrum is shown to exist for some cases of branching
processes (Bhagavan, 1974) and some cases of Markov processes (Nagab-
hushanam, 1975, 1976). Further, it is pointed out that when we move from
stationarity, we have the interesting phenomenon, viz., differing directional
spectra, i.e. spectra obtained by the forward or the backward time movements,
as happens in the case of the process

X ( t ) = ~ Xlei'~m + X2 for t / > 0 ,


[ X 1 e -~t~/2)+ X z for t < 0,

where X 1 and X 2 are random variables with zero means, variances each unity
1
and covariance ~. In this case, the forward spectrum consists of two jumps of
magnitude unity at 0 and ~w, while the backward spectrum consists of again
two jumps of magnitude unity at 0 and -~w (see Nagabhushanam, 1970).
When the directional spectra coincide, the common spectrum is called the
overall spectrum. The overall spectrum is shown to exist for the harmonisable
process (referred to in (b) above) and is given by

F(s) - h(~,, s)- h(---~, ~)+J(s),


where
f0 fors<~r,
J(s) [ 2J for s ='rr,

J standing for the discrete mass of h at (--rr, w).


(g) Kamp6 de Fdriet and Frenkiel (see M. M. Rao, 1976) have considered a
class of non-stationary processes arising in the form of a signal plus noise,
where the signal is a time series with zero mean and periodic covariance and
the noise is a stationary series with zero mean for which

1 f T-Ikb/2
R (k) = l i m - - - | R (t - k/2; t + k/2) dt
r-~ T JIkl/2
On non-stationary time series 317

exist. The class of processes for which the above limit exists were termed as K - F
class of processes.
M. M. Rao (1976) studied a more general class of non-stationary processes
called almost harmonisable processes arising as

X(t)-m(t)= f ~f(t,s)dZ(s),_

where f is an almost periodic function and the Z process has a covariance of


bounded variation. This class of processes arises as solutions of some difference
and differential equations giving rise to the K - F class of processes. They
include harmonisable processes as a particular case and are closely related to
the oscillatory processes of Priestley (1965). Employing arguments of almost
periodic functions, they are shown to possess the spectrum of (f) above.
(h) Consider the class of processes of the form

f
X(t)- m (t) = Jw f(t, s) d Z ( s ) ,

where Z processes has covariance of bounded variation and for each s, f,(t) has
the Fourier relation

fs(t) : ~w ei'v dH,(v) (H,(v) = H(s, v)),

H being a function of bounded variation. This is a general class of processes


including harmonisable processes and oscillatory processes. For this class of
processes, the overall spectrum (in (f)) is shown by Bhagavan and Somaraju
(1979) to exist in a form directly generalising that of the harmonisable proces-
ses mentioned in (f).
Thus the attempts of Parzen, Nagabhushanam and/or Bhagavan, Kamp6 de
F6riet, M. M. Rao, and Bhagavan and Somaraju go along the line of establish-
ing the existence of the spectrum, while the other attempts at the spectrum are
proposals only. The attempt of Herbst also comes in line with that of Nagab~
hushanam and Bhagavan. The attempt of Nagabhushanam and/or Bhagavan is a
straightforward generalisation of the stationary processes, serves as a common
binding factor and gives a time-free spectral function. Incidentally, it forms a
link between the two approaches mentioned. The common element in these
attempts, however, is to enlist and enlarge the classes of processes for which a
spectrum can be obtained by ensuring that the average second moments exist.

8. Spectrum and ergodic theorems

Another well-developed area in time series regards the questions of con--


318 C. S. K. Bhagavan

vergence of the sample mean

n
1Zx ,
nt=l

in the case of a discrete parameter, and

T X ( t ) dt

in the continuous parameter case. The results on these convergences to


constants are of particular interest. Further, the particular modes, of con-
v e r g e n c e - - m e a n square and almost sure convergences--have received good
attention as they give consistent estimation of parameters. Here we shall
review those that have a bearing on the spectrum.
Again starting with the stationary case, we have the well-known result that

n
- ~ (X(t) - m (t))
nt=l

converges in mean square to a random variable whose mean is zero and


variance equalling the possible jump of the spectrum at the origin. This result is
known as Statistical Ergodic Theorem (see Grenander and Rosenblatt, 1957).
Thus the lack of a spectral jump at the origin forms a necessary and sufficient
condition for the average considered to converge to zero. The form in which
this result could be readily extended to the general non-stationary processes by
Nagabhushanam and Bhagavan (1969) is that the lack of a spectral jump at the
origin is a sufficient condition for the mean square convergence of the averages
to zero. However, a direct generalisation of the Statistical Ergodic T h e o r e m in
respect of classes of processes having a process representation has been
achieved. These classes of processes are harmonizable processes (see Nagab-
hushanam, 1969; Bhagavan, 1974) and their further generalisation considered
in (11) of Section 4 (see Bhagavan and Somaraju, 1979). Incidentally, analogous
results are valid for oscillatory processes also, as they are a particular case of
the class of processes considered in (h).
Coming to the question of almost sure convergence of the averages con-
sidered, the result for stationary processes involving the spectrum says (see
L o i r e , 1963) that the differentiability of the spectrum,in a neighbourhood of
the origin implies the almost sure convergence considered. This result has been
extended by Bhagavan and Somaraju (1982) in respect of the class of non-
stationary processes considered in (h) by requiring that the spectrum be
differentiable and the derivative be a function of bounded variation, in a
neighbourhood of the origin.
When these results are to be interpreted as giving consistent estimators, the
On non-stationary time series 319

parameters they are estimating need some clarification. This matter is trivial for
wide sense stationary processes as the parameter there is the constant mean.
But in respect of other processes, there is need to introduce a corresponding
parameter, it is called the pooled mean,
defined as
11

lim -- ~ m (t) (or liml f'r m(t)dt) .


n~ F/ t = l T~.~ T o

This pooled mean has been shown to exist in respect of processes having
process representation, like stationary processes and harmonisable processes
(see Bhagavan, 1974). Thus when the pooled mean exists, the convergence of
the sample mean is to the pooled mean.

6. Concluding remarks

The concept of spectrum of stationary processes having been generalised in a


straightforward manner, further efforts have to be in the direction of enlisting
more and more classes of processes, thus enlarging the presently known
classes of processes possessing spectrum. The various aspects of estimation of
spectrum of these processes can be considered another area in which work can
progress. However, the conventional methods of spectral estimation go through
perhaps in a simple way. Progress can also be made on several aspects of the
spectral representation including a process representation.

References

[1] Abdrabbo, N. A. and Priestley, M. B. (1967). On the prediction of non-stationary processes. J.


Roy. Statist. Soc. Ser. B 29, 570-585.
[2] Anderson, T. W. (1971). Stationary Time Series Analysis. Wiley, New York.
[3] Bhagavan, C. S. K. (1974). Non-Stationary Processes, Spectra and Some Ergodic Theorems,
Andhra University Press, Waltair, India.
[4] Bhagavan, C. S. K. and Somaraju, P. (1978). A mean ergodic theorem for Karhunen type
J
PrOcesses. Presented at the Seminar on Probability and Statistical Inference, Poona, India.
[5] Bhagavan, C. S. K. and Somaraju, P. (1979). Spectrum of a class of non-stationary processes.
Presented at the Seminar on Probability and Statistical Inference, Poona, India.
[6] Bhagavan, C. S. K. and Somaraju, P. (1982). An individual ergodic theorem for a class of
non-stationary processes. Presented at the Seminar on Probability and Statistical Inference,
Poona, India.
[7] Cramrr, H. (196,l). On some classes of non-stationary stochastic processes. In: Proc. Fourth
Berkeley Symp., Vol. 2, 57-78. University of California Press, Berkeley, CA.
[8] Cramrr, H. and Leadbetter, M. R. (1967). Stationary and Related Stochastic Processes. Wiley,
New York.
[9] Fano, R. M. (1950). Short time auto-correlation functions and power spectra. £ Acoust. Soc.
Amer. 22, 546-550~
[10] Granger, C. W. J. and Hatanaka, M. (1964). Spectral Analysis of Economic Time Series.
Princeton University Press, Princeton, NJ.
320 C . S . K . Bhagavan

[11] Grenander, U. and Rosenblatt, M. (1957). Statistical Analysis of Stationary Time Series. Wiley,
New York.
[12] Herbst, L. J. (1964). Spectral analysis in the presence of variance fluctuations. J. Roy. Statist.
Soc. Set. B 21, 354-360.
[13] Herglotz, G. (1911). Uber Pontenzreihen mit positiven reellem Teil im Einheitskreis. Ber.
Verk. Kgl. 63, 501-511.
[14] Karhunen, K. (1947). lJber lineare Methoden in der Wahrscheinlichkeitsrechnung. Ann. Acad.
Sci. Fenn. Set. A 1 37 (Helsinki).
[15] Khintchine, A. I. (1934). Korrelationstheorie der stationaren stochastiche Processc. Math.
Ann. 109, 604-615.
[16] Lo~ve, M. (1963). Probabili~ Theory, 3rd ed. Van Nostrand, New York.
[17] Loynes, R. M. (1968). On the concept of the spectrum for non-stationary processes..L Roy.
Statist. Soc. Ser. B 30, 1-30.
[18] Nagabhushanam, K. (1969). Mean ergodic theorem for second moment stationary processes.
Abstracts, Statistics section, Proceedings of the Indian Science Congress.
[19] Nagabhushanam, K. (1970). Spectral concepts in non-stationary processes-Unpublished
proceedings of the Second Mastech Conference on Probability and Statistics, Madras, India.
[20] Nagabhushanam, K. (1975)~ Mean ergodic theorem for an irreducible time homogeneous
Markov chain with a finite number of states. Current Science.
[21] Nagabhushanam, K. (1976). Some ergodic theorems for non-stationary Markov processes.
Lecture to the Symposium on Advanced Probability and Statistics at Annamalainagar, India.
[22] Nagabhushanam, K. and Bhagavan, C. S. K. (1968). Non-stationary processes and spectrum.
Can. J. Math. 20, 1203-1206.
[23] Nagabhushanam, K. and Bhagavan, C. S. K. (1969). A mean ergodic theorem for a class of
non-stationary processes. Sankhy6 Ser. A 31, 421-424.
[24] Page, C.H. (1952). Instantaneous power spectra. J. Appl. Phy. 23, 103-106.
[25] Parzen, E. (1967). Time Series Analysis Papers, Papers 10, 11. Holden-Day, San Francisco CA.
[26] Priestley, M. B. (1965). Evolutionary spectra and non-stationary processes. J. Roy. Statist. Soc.
Ser. B 27, 204-237.
[27] Rao, M. M. (1976). Covariance analysis of some non-stationary time series. Technical
Report AFFDL TR 76, 126, Aug. 1974-Aug. 1976. (Cf. Developments in Statistics, Vol. 1
(1978) 171-225, Academic Press, New York.)
E. J. Hannan, P. R. Krishnaiah, M. M. Rao, eds., Handbook of Statistics, Vol. 5 "1~')
© Elsevier Science Publishers B.V. (1985) 321-336

Harmonizable Filtering and Sampling of


Time Series

D e r e k K. C h a n g *

1. Introduction

In engineering, economics and some other fields, we usually face the follow-
ing types of problems.
(l) With the information obtained from an observation of a p h e n o m e n o n in
the past and present, we wish to predict it at a future time with the best
accuracy in some prescribed sense.
(2) The data one gets in observing a p h e n o m e n o n are almost always an
approximation of the real data, which is due to the error caused in the process
of observation; for instance, in the m e a s u r e m e n t of a quantity. It is then
desired to estimate the original values by utilizing the observed data with the
best accuracy in a certain sense.
(3) With a given input time series and a linear filter, such as an electric
circuit, a unique output series can be generated. In case the filter is known, and
only the ouptut series is observed, it is desired to recover the input series, and
to determine whether the solution to this inversion p r o b l e m is unique, and
whether the filter is physically realizable, that is, whether the input series at any
time can be expressed solely in terms of the observations of the output series
made in the past and present, but not the future.
To treat these problems, we need to establish some mathematical and
statistical models of time series. Then we can describe these problems pre-
cisely, and provide some solutions.
Let the triple (,(2, X, P ) denote a probability space, where /2 is a sample
space, X i s the o'-algebra of all events under consideration, and P is a
probability measure on X. Let C stand for complex numbers. For any complex
vali~ed-random variable X : 12 ~ C , which is X-measurable by definition, the
expectation of X is denoted by E ( x ) = f a X dP. The Hilbert space H =
L2(12, X, P ) of all the (equivalence classes of) r a n d o m variables with finite
second m o m e n t s can be constructed as usual with an inner product defined by
0 r, g) = E(,fg). The ' o v e r b a r ' denotes complex conjugation. Let I1 112denote the

*Prepared under the ONR Contract No. N00014-79-C-0754 (Modification No. P00003).

321
322 D.K. Chang

c o r r e s p o n d i n g norm of H. For convenience, let L~(P) stand for the subset of


functions in Lz(12, X, P) with m e a n s zero.
L e t X - {X(t), t C T} d e n o t e an indexed set of r a n d o m variables. If T. is the
set Z of integers, X is called a discrete p a r a m e t e r time series. If T is the set R
of real n u m b e r s , X 'is called a continuous p a r a m e t e r time series. If every X(t)
has a finite second m o m e n t , X is called a s e c o n d - o r d e r time series. If in
addition X is of m e a n s zero, i.e. E(X(t))= 0 for all t E T, the covariance
function r: T x T ~ C is defined as r(s, t)= E(X(s)X(t))for s, t E T.
In o r d e r to study the analytical p r o p e r t i e s of s e c o n d - o r d e r time series with
m e a n s zero, we first classify these series according to their covariance structure
as follows. A list of m o r e classes of time series can be f o u n d in [26] and [2].
1. If the value of the covariance function r of X d e p e n d s only on the
difference of its two arguments, i.e. if r(s, t) = r(s + h, t + h) for all s, t, h ~ T,
then X is called a weakly stationary time series. It is k n o w n that in this case we
have the representation

r(s, t) = fo ei(~-')" d/x ( u ) , (1)

w h e r e D = R for T = N, the reals, D - [ 0 , 2 7 ) for T = / / , the integers, and # is


a b o u n d e d , m o n o t o n e increasing n o n n e g a t i v e function on D, called the spectral
function of X. It is k n o w n that every weakly stationary time series X has a
stochastic integral r e p r e s e n t a t i o n

t"
X(t) = JD eitA dZ(A), t E T, (2)

w h e r e D is as above, Z: D ~ L2o(P) is a vector valued function with finite


semi-variation and with o r t h o g o n a l increments, i.e. for any a~ < a 2 < a 3 < a4,

E((Z(a2)- Z(al))(Z(a4)- Z(a3))) = O,

and the integral is in the sense of D u n f o r d and Schwartz (see [10], p. 323).
N o t e that without the restriction that the time series X is of m e a n s zero, we
have the slightly different concepts of weak and wide sense stationarity (see
[2]). T h e y agree u n d e r our hypotheses.
2. If X is not necessarily weakly stationary, but its covariance function
admits a r e p r e s e n t a t i o n

r(s, t ) = f f e ~'" ~'~ d/z(u, v ) , (3)


DxD

w h e r e /x is a c o m p l e x valued, positive definite function of two variables with


Harmonizable filtering a n d s a m p l i n g o f t i m e series 323

finite Vitali variation [/z Iv on D × D defined as


N
IP'tv- sup{ ~ ]#(t i, t})-Ix(ti_ 1, t))-Ix(ti, t)_a)+ Ix(t i 1, t;-1)]} , (4)
i,] 1

then X is called a strongly harmonizable time series [17]. It was proved by


Lo6ve [17] that every strongly harmonizable time series X also has a stochastic
integral representation (2), where D and Z are as before except that Z need
not have orthogonal increments, as in the weakly stationary case, but satisfies
certain other conditions generalizing (2).
It is easy to see that if/x concentrates on the diagonal, i.e. if p~(s, t) = 0 for
s ¢ t, then r(s, t) depends only on the difference s - t , so that X is weakly
stationary. Thus the class of strongly harmonizable time series is an extension
of the class of weakly stationary time series. Again /z is called the spectral
function of X.
3. If the function/x in (3) is not necessarily of Vitali variation finite, but is of
finite Fr6chet variation defined as

I/XlF-- sup {I ~.
N aibfllx(ti,
__ , + tz(t i ~, t)-l)] :
t})- tz(t i ~, t}) - p.(ti, tj_l)
i,j = 1

lail < 1, Ibjl < 1, a. bj E C}, (5)

then X is called a weakly harmonizable time series with # as its spectral


function [26]. By the definitions, IP']F-< t/xlv, and it is known that l/x tF < cc does
not imply that I/z Iv < w [7], so that the class of weakly harmonizable time series
is an extension of the class of strongly harmonizable time series. It was proved
in [26] that every weakly harmonizable time series also has a stochastic integral
representation (2), where Z: D ~ L2(p) is merely a vector valued function
which, however, always has a finite semi-variation, defined in ([10], p. 321).
Note that the right side of the integral representation (2) is actually the Fourier
transform of a vector valued function Z. The minimum requirement on Z for
the integral in (2) to exist in the sense of Dunford and Schwartz is that Z must
be of finite semi-variation. Thus by relaxing the restrictions on Z, we extended
weakly stationary series to strongly harmonizable and then to weakly har-
monizable series, and no more extension is possible if the representation (2) is
to hold.
We may also consider multidimensional time series. If X ( t ) =
(Xl(t) Xp(t)) tr, t E T, is a p-dimensional column vector of random vari-
. . . . .

ables, where tr indicates the transpose, then X is called a p-dimensional series.


If for any p-vector w = ( w l , . . . , wp) of complex numbers, the corresponding
time series
P
Y(t) = "~ wjXj(t), t~ T,
j=l
324 D.K. Chang

is weakly stationary, then X is called a p-dimensional weakly stationary time


series. Similarly, we have p-dimensional strongly and weakly harmonizable
time series defined in the same way.
The concept of weakly stationary time series was introduced in the 1930s.
Since t h e stationarity assumption is reasonably close to the reality in many
areas in practice, and since powerful and elegant mathematical tools are
available for the study of weakly stationary time series, a considerable amount
of important work has been done in this field. In the case of strongly
harmonizable time series, the situation becomes more complicated. Not every
result for weakly stationary time series has a corresponding counterpart for
strongly harmonizable series. However, the standard analysis, such as measure
theory and integration theory, is still of essential importance there. It was
known that given any weakly stationary time series {X(t), t E T} as an input
series, and given a bounded linear projection L, which is clearly a (bounded)
linear operator, it can be proved that the resulting output series Y(t)= LX(t)
is not necessarily strongly harmonizable (see [13], p. 183 and [26], p. 301). This
shows that the classes of weakly stationary as well as strongly harmonizable
time series are not large enough for operations under linear transformations
such as in general filtering problems (see below). This is indeed a shortcoming
of the structure of these two classes.
The weakly harmonizable time series was first studied under different names
by Bochner [3] and Rozanov [28] in the 1950s. More work in this area can be
found in [21], [19] and [26]. The class of weakly harmonizable time series is an
enlargement of the classes mentioned above, and it is closed under bounded
linear operators [26]. This closure property makes this class a more attractive
one for some practical applications than the other two. However, we can no
longer apply many known facts in standard analysis as in the previous two
cases, since some new mathematical concepts are involved, and we still do not
have a well-rounded theory of these processes for an analysis. It should be
noticed that some results can be extended from strongly to weakly harmoniz-
able cases, but the same proofs cannot be used, and a different approach is
often necessary.

2. The linear filtering problem

We now recall the definition of a linear filter for our work here. A linear
filter L is a mapping L: X~-~Y or, L X = Y, where X = { X ( t ) , t C T } and
Y = {Y(t), t E T} are two time series with T = Z or R, such that: (i) for any
a, b ff C and time series X and Y, L ( a X + bY) = aL(X)+ bL(Y); (ii) for any
h E T, with a time series X h defined by Xh(t) = X ( t + h), t ~ T,

(LXh)(t)- (LX)(t + h), t ~ T.

Condition (i) says that L is linear, and condition (ii) says that L commutes
Harmonizable filtering and sampling of time series 325

with translations on the T axis. In the case that all the time series :under
consideration are in L2o(P), a linear filter does not have to be bounded, i.e.,
I[gX(t)ll2/llX(t)ll2 need not be bounded for t E T. For instance, one can have
differential filters which need not be bounded. A study of the general concept
of a linear filter on a second order homogeneous time series on a globally
symmetric index set can be found in Yaglom [29] and Hannan [12], which
extends the work of Masani [18] where the index set is the real line. However,
they will not be considered here.
Next we discuss the problem mentioned at the beginning of Section 1. Let
X = {X(n), n ~ Z} be a time series, and let L be a polynomialfilter, also called
a moving average filter, as follows:
N
LX(.) = £ a#(,, j), . E 1, (6)
j=O

where a 0 , . . . , a N are constants. If Y = { Y ( n ) , n ~ Z } is such that Y(n)=


LX(n) for all n C 2 , Y is called an output series, and X is called an input
series. The inversion problem is to solve the equation Y -- L X for the series X
with given Y and L. Let f be a function defined by

N
ijr
f(t)=~aje , O~<t<2v. (7)
j=O

f is called the spectral characteristic of the polynomial filter L defined by (6),


and it is also called the 'frequency response' in the engineering literature.
The following result is from [20]. Let Y be a given weakly stationary time
series with a spectral function/x, and let L be a polynomial filter defined by (6).
Let O denote the set of zeros of f in the interval W = [0,2Tr), i.e. O =
{t~ W:f(t) = 0}. Then there exists a weakly stationary time series X =
{X(n), n ~ Z} such that Y(n) = LX(n) for all n ~ Y_ if and only if the following
two conditions are satisfied:

(i) JodlX(t)=O, (8)

(ii) fw-o ~(t)[1 2 d/x (t) < oc. (9)

If both conditions are satisfied, then the solution series X to the inversion
problem is unique if and only if O is empty. If O is not empty, there exists only
one solution belonging to the closed span of all square integrable functions
relative to /x under the L 2 norm and whose spectral function v satisfies the
condition fo de(t) = 0. Furthermore, if all the roots of the characteristic poly-
326 D.K. Chang

nomial

P(t) = E a~
/=0

are outside the unit circle {t ~ C: Itl = 1}, then the filter L is physically realiz-
able. Actually, with

e(t) E b°t°
n=0

being the Taylor series expansion, we have the expression

X(m)- ~_,bnY(m-n) for all m E 2 .


tl=O

However, if some roots are inside and none is on the unit circle, then L is not
physically realizable. In this case, the function 1/P(t) has a Laurent series
expansion E~=_= bnt", and the future values of Y are clearly involved in the
expression

X(m)= E b,Y(m n) forall raCY .


rt = - c ~

In case both X and Y are of continuous parameter, results as described below


are the analogs of the above.
O n e defines an integral filter L on a continuous p a r a m e t e r time series
X = {X(t), t E R} as follows:

LX(t)=fRg(u)X(t-u)du, tER, (10)

where g is a Lebesgue integrable weight function over R, and the integral on the
right side of (10)with a vector integrand is in the sense of Bochner [9]. The
spectral characteristic F associated with L is now taken as the Fourier
transform Of g, i.e.

F(t)= jRe-i" g(u)du, tE• . (11)

With L and F just defined and with W - - R, all the results listed above for
polynomial filters still hold, except the last part concerning the physical
realizability of L. For the unbounded linear filter defined by a difference-
Harmonizable filtering and sampling of time series 327

differential operator and also for some more general filters, similar results have
been obtained by the author in [5]. In the case when both input and output
series X and Y are strongly harmonizable, the corresponding work was done
by Kelsh [15].
Since the spectral function # of the output series Y is now defined on R x R,
the necessary and sufficient conditions (8) and (9) for the existence of a solution
X, which is also strongly harmonizable, to the equation Y = L X with L defined
by (6) should be replaced by

(i)' l~jt(O ><O) : O, (12)

(ii)' ff 1 (13)
g(u)f(v)l dl# [t(/A, u) < oc,
ocxo c

where ]#],(u, v) is the total variation o f / , on the rectangle ( - % u ) x @% v),


and It*l,(O × O) is that on the set O x O. The uniqueness part is the same as in
the weakly stationary case, except that the condition J'o d u ( t ) = 0 should be
replaced by I~'[t(O x O) = 0 where v is the spectral function of X. Finally, the
physical realizability part is identical with that in the previous case. A similar
result holds for the integral filters defined by (10).
It is easy to see that if tx concentrates on the diagonal, then (i)' and (ii)'
reduce to (i) and (ii). In the multidimensional case, the work becomes more
involved. For the multidimensional polynomial filters, Kelsh [15] solved the
problem for a subclass of multidimensional strongly harmonizable time series,
which was called the factorizable spectral measure series, and which contains
all the multidimensional weakly stationary series as a proper subset. Since the
frequency characteristic matrix may be singular, the generalized inverse of a
matrix introduced by Moore and Penrose [23] played an important role there.
In the case when Y is weakly harmonizable, the spectral function/, of Y is
only of Fr6chet variation finite. The above condition (ii)' can no longer be used,
since I# It(u, v) will generally be infinite. Instead, we use the following definition
of a reproducing kernel Hilbert space associated with the spectral function (cf.
[11).
Let f ( . , . ): R × R -~ C be a positive definite function. Let H be the set of
linear combinations of the functions f(a," ): R ~ C, i.e. H - sp{f(a, • ), a C R}.
For any two elements
n m

g : ~, bJ(aj,. ), h : ~, Ckf(a k," )


j=l k=l

in H, define an inner product by


n m

(g, h ) = ~ :~, bigkf(a j, a'k).


j=l k=l
328 D. K. Chang

Let H * be the closure of H under ( . , . ) . Then H * is called a reproducing


kernel Hilbert space associated with the function f, and f is called its reproduc-
ing kernel. Now we regard f as a vector valued function, f: a ~ H * , for all
a E R. A theory of integration of a scalar function with respect to a vector
integrator is available (see [10], p. 323). L e t Ll(f H*) be the set of all functions
g : R ~ C which are integrable relative to f : ~ ~ H * in this sense.
Now let Y be a weakly harmonizable time series with spectral function #.
Let L be a linear filter, either a polynomial or an integral filter, with the
spectral characteristic if'. The necessary and sufficient conditions for the exis-
tence of a weakly harmonizable time series X satisfying LX = Y are as
follows:

(i)" ]~IF(O × O) - O,
(ii)" Xodff ~ t~(ix'*).

For the uniqueness and the physical realizability, we have the same situation as
before. Similar results hold for certain multidimensional filters (cf. [5]). It
should be noted that the above conditions and hence the result for weakly
harmonizable time series is not the same as that for the strongly harmonizable
time series. However, both agree when specialized to the stationary case.

3. Optimal signal estimation

Next we consider the problem of filtering signal from noise. Let S =


{S(t), t C R} and N = {N(t), t E R} be the signal and noise time series respec-
tively, and let X = {X(t), t ~ R} be the output series which consists of both
signal and noise, i.e. X = S + N. W e only consider the continuous p a r a m e t e r
case here. If the signal S and the noise N are not observable, but the output X
is observable, it is desired to estimate S by a best linear filter L operating on X
in the mean square error sense, i.e. with the error [IS(t) - LX(t)II2 minimized for
t E R. We assume that the series S and N are of the same type, and some
conditions on their spectral functions are needed for obtaining corresponding
results.
In the case that the noise series N is Gaussian, the work in this problem can
be found in [14] and [22]. In the weakly stationary case, the problem was
treated by G r e n a n d e r [11], as follows. Let S, N and X be weakly stationary.
Then X will have the integral representation

X(t) = Ja ei'a d Z ( A ) ,

where Z : R ~ L~(P) is a vector valued function with orthogonal increments.


Let /xs and #N be the spectral functions of series S and N. I f / x s and #N are
Harmonizable filtering and sampling of time series 329

absolutely continuous, i.e. if the derivatives fs(t) = dl~s(t)/dt and f u ( t ) =


dl~N(t)/dt exist, and if S and N are uncorrelated, i.e.

rs,N(u, v) = E ( S ( u ) N ( v ) ) = O, u, v E a ,

then the best linear filter S in the mean square error sense, which is also called
an optimal filter, is given by the formula

j. fs(a) e"* dZ(a), tE a


S(t)= fs(h ) + fN(h )

If the integrand, which is called a response function or a filter function, can be


approximated by an expansion

f~(a) M
- lim .~ c k e i~kA,
fs(A)+fN(A) M-~=k_l

then we have

S(t) : ~ q X ( t + ak).
k=l

In general, if the signal and the noise are not necessarily uncorrelated, but
are weakly stationarily correlated, i.e.

rs,N(u , v) = rs,u(U - v) = fR ei(U v> d/'*s'N(A )' u, v E R 5

and if the function tXs,N : R - + C is absolutely continuous, so that one has

fS,N(U -- V) = f . e iC"-O*fs,N(A) d a ,

the solution to the filter problem is given by the expression

f k(a)+ f~N(a) i,,


S(t) . .. . =-- e de(a), teR. (14)
J, fs(A) + fN(A) + 2 Re(fs, u (a))

Note that in the above results, one has to assume that all the spectral functions
and the cross-spectral function are absolutely continuous. If this is not the case,
the results become more complicated.
When the series S, N and X are of Cram6r class as defined in [8], which
contains the class of all strongly harmonizable time series, and when S and N are
330 D.K. Chang

uncorrelated, then similar results were obtained in [271. Without assuming that
S and N are uncorrelated, Kelsh [15] considered the same problem for
multidimensional Cram6r class series, and got the corresponding result using
the technique essentially due to Rao [27]. For one-dimensional strongly har-
monizable series S, N and X, Kelsh's result can be stated as follows.
L e t / x s,/x N : R x R ~ C be the spectral functions of S and N, #S,N : R X R ~ C
be the cross-spectral function, and let P-~.N:R x R--~C be defined by
tX*s.N(u, v) = #s,N(v, u), for u, v E R. Then the optimal filter is

S(t)=~aF(A)dZ(A), t~R, (15)

where F: R ~ C is a solution to the set of integral equations

j f F ( u ) e -i~v dot s + # u +/Xs,N +/x ].N)(u, v)


RxR

= f f eitU-iS°d(tzs+ #s,N)(u,v),
RxR

for all s ¢ R. In general, it is not easy to solve this system of integral equations
analytically. However, if the spectral functions P-s, #N and #s.N are absolutely
continuous, expression (15) can be reduced to an explicit form as in (14).

4. Sampling a harmonizable process

Next we discuss the sampling problem of the continuous parameter time


series. When we study a time series in practice, it is sometimes physically
difficult or economically undesirable to observe the whole series. It is then
required to sample it at only finitely many times, and to estimate the original
series from the observed samples. Sampling theorems are very important in
many fields in practice, such as the communication and information theory.
The following result is called the Kotel'nikov-Shannon formula, and is an
abstraction of a classical (nonstochastic) result due to Cauchy [4].
If X = {X(t), t E R} is a weakly stationary time series with spectral function
/x which is supported by a bounded interval (-1~h, ~1h ), h > 0 , i.e. it is constant
in (-~,-~h] and [12h, o~), then

N sin[~r(t nh)/h]
X(t)=l.i.m. ~ X(nh) , t~R, (16)
N ~ ,,=-N ~(t -- nh )/h

where the convergence on the right side of (16) is in the sense of mean square.
This formula gives a periodic samplino theorem, where one observes the time
H a r m o n i z a b l e filtering a n d s a m p l i n g o f time series 331

series at the periodic points t = nh, - N <~n <~N, and the random variable X(t)
at any time t can be estimated by (16) by taking large enough N.
If X is weakly stationary but its spectral function p, is not necessarily
supported by a compact set, then the following results of Lloyd [16] hold.
If/x has an open support S such that the sets {S + n/h}, n ~ Z, are mutually
disjoint, i.e. no two values in S differ by an integer multiple of 1/h, then

X(t) = l.i.m. ~ (1-jnl/N)X(nh)K(t- nh), t E R, (17)


N~ n=-N

where

K(t)=h.lse 2"ia'dA, tCR,

and l.i.m, denotes the convergence in mean square. If S is a finite union of


intervals, or if sup_®<t<~ttK(t)t< % then

X(t)= l.i.m. ~ X(nh)K(t- nh), tE R . (18)


N~ n= N

The problem of sampling for strongly harmonizable time series has been
studied by Rao, Piranashvili and Pourahmadi (cf. [27], [24] and [25]). The
formula (16) was obtained by Piranashvili [24] for strongly harmonizable time
series whose spectral function has a bounded support in R × R. The formulas (17)
and (18) have been extended to strongly harmonizable time series by Rao [27]
and Pourahmadi [25]. Let X be a strongly harmonizable time series with
spectral function #, and let u be a function defined by v ( s ) = s)x R),
s E R. If e has an open support S such that the sets {S+ n/h}, n E Z, are
mutually disjoint, then (17) holds. If S is a finite union of intervals, or if
sup_~<,<~]tK(t)] < % then (18) holds.
In the case where X = {X(t), t ~ R} is weakly harmonizable with spectral
function F, we have the following result due to Chang and Rao [6]. Given any
e > 0, if there exists a bounded Borel set A = (At) C ~ such that

f l dF(u,v)<~e,
BxB
for a l l B C A c ( B a B o r e l s e t ) ,

and if cr0 - diameter of A, then for any h < v/o- 0, one has an N (=N,.,) such
that

II N s i n [ v ( t - nh)/h] <
X (t) - ~, X(nh) ~-t--nh~ - 2 C(t)[(~v- hcr0)N]-'+ e,
n=-N
332 D.K. Chang

w h e r e 0 < C(t) < ~ is b o u n d e d for t in b o u n d e d sets. If t h e s p e c t r a l f u n c t i o n F


has a b o u n d e d s u p p o r t , t h e n we can set E = 0. T h e a b o v e f o r m u l a a c t u a l l y gives
an e s t i m a t i o n of t h e e r r o r in t h e m e a n s q u a r e sense.

5. A numerical illustration

In this section, w e give a n u m e r i c a l e x a m p l e that d e a l s with the d a t a of a


n o n s t a t i o n a r y t i m e series. T h e d a t a a r e f r o m a t a p e t h a t c o n t a i n s d i g i t i z e d
a c o u s t i c d a t a of the t i m e series o u t p u t for o n e b e a m of a m u l t i b e a m s o n a r . T h e
t i m e series is p r i m a r i l y d o m i n a t e d by e n e r g y e m i t t e d by a transiting m e r c h a n t
ship, a n d it c o n t a i n s significant e n e r g y o v e r a b r o a d r a n g e of f r e q u e n c i e s . T h e
t a p e is v e r y k i n d l y p r o v i d e d by D r . D. F. G i n g r a s of the N a v a l O c e a n S y s t e m
C e n t e r at San D i e g o , C a l i f o r n i a . Since t h e d a t a are u s e d solely for t h e p u r p o s e
of illustration, no specific details will b e discussed.
A set of 4000 d a t a is r e a d in f r o m the t a p e a n d s t o r e d in a v e c t o r Y of real
n u m b e r s . L e t a 0. . . . . a 6 b e the r e a l coefficients to be d e t e r m i n e d . F o r each
8 ~< n ~ 4000, we define t h e ' e r r o r '

= Y(n)- ajY(. - j ) .
j=O

4000 2
M i n i m i z i n g the sum 2;,= 8 ~, r e l a t i v e to the a i s, we o b t a i n the values for the
filter coefficients a 0 , . . . , a 6. C o r r e c t to t h r e e d e c i m a l places, t h e s e are as
follows:

a 0= 0.852, a 1= - 0 . 4 6 3 , a 2= 0 . 2 2 7 ,

a 3= - 0 . 0 8 7 , a 4 = 0.007, a s = 0.002, a 6 = -0.100.

W i t h t h e given s e q u e n c e Y as t h e o u t p u t series, a n d with the coefficients aj


o b t a i n e d as a b o v e , w e can c o n s i d e r the filtering p r o b l e m

6
Y(n) = LX(n)= "~ ajX(n - j ) ,
j~O

for n / > 7, w h e r e X is an u n k n o w n i n p u t series. N o t e t h a t if we a s s u m e that


Y(n) = 0 for all n ~< 0, t h e p r o b l e m b e c o m e s q u i t e simple. This is not a s s u m e d
here.
T h e c h a r a c t e r i s t i c p o l y n o m i a l of the filter L is of the f o r m P(t)= E6=0 af.
T h e r o o t s t I. . . . . t 6 of P can also b e c o m p u t e d . T h e s e a r e as follows:

t I = 1.295, /3, 14 = 0.501-v- ] . 3 5 7 i ,

t2 = _n1.746, t~, t 6 = 0.739 - 1.118i.


Harmonizable filtering and sampling of time series 333

Table 1

X 56.94 46.50 17.34 37.26 51.16 62.02 63.24 49.35


Y 36.00 26.00 6.00 31.00 27.00 33.00 28.00 18.00
X 25.51 -24.86 -24.45 5.87 -9.74 -8.53 -25.60 -62.52
Y 4.00 -29.00 -13.00 -7.00 -15.00 -7.00 -22.00 -40.00

X -26.09 -12.47 -28.30 -26.92 -42.74 -53.19 -34.82 -19.68


Y 4.00 -10.00 -18.00 -10.00 -27.00 -23.00 -10.00 -8.00
X -21.52 -19.68 11.98 31.81 28.22 15.39 1.51 19.35
Y -10.00 -6.00 20.00 24.00 17.00 8.00 0.00 19.00

X 51.61 32.46 26.11 48.21 6.16 -25.86 7.92 16.36


Y 33.00 5.00 17.00 29.00 13.00 -18.00 11.00 1.00

X -14.99 -36.44 -21.63 -22.06 -19.53 12.98 3.82 -43.91


Y 19.00 -26.00 -7.00 -13.00 9.00 15.00 -4.00 -31.00

X 37.65 35.71 -27.98 28.84 0.38 41.05 42.43 -29.48


Y -10.00 -21.00 -10.00 -18.00 9.00 35.00 23.00 -32.00
X -54.60 -24.72 -25.92 -32.84 -2.06 1.89 -20.87 -56.99
Y -24.00 3.00 -20.00 -21.00 5.00 0.00 -11.00 -36.00

X -29.33 46.23 63.53 51.70 63.24 57.73 21.57 42.92


Y - 1.00 45.00 31.00 27.00 42.00 32.00 5.00 30.00

X 45.59 16.25 8.15 42.99 -75.97 -60.67 43.61 -63.65


Y 13.00 -4.00 0.00 -46.00 -46.00 -31.00 -27.00 -43.00

X -68.65 -62.08 -52.70 -25.59 -5.88 -10.89 5.56 18.39


Y --35.00 -28.00 -19.00 0.00 4.00 -2.00 17.00 17.00

X -16.33 -54.96 13.74 29.98 42.14 49.10 43.92 24.88


Y -15.00 -33.00 9.00 22.00 23.00 28.00 23.O0 14.00

X 34.20 24.15 2.77 -0.71 30.88 45.78 15.79 1.18


Y 25.00 4.00 -7.00 -4.00 21.00 22.00 -4.00 - 1.00
X --8.23 -59.02 -86.41 -64.37 -0.74 9.18 9.82 -33.10
Y -8.00 -47.00 -51.00 -32.00 13.00 9.00 5.00 -23.00
X -26.27 20.79 32.36 29.50 21.57 10.95 -56.82 -70.72
Y 2.00 28.00 15.00 15.00 9.00 12.00 -38.00 -40.00

X -24.37 10.38 17.36 14.21 61.42 51.27 11.06 30.26


Y -3.00 6.00 8.00 9.00 54.00 24.00 1.00 26.00
X 50.84 32.36 10.57 36.31 58.69 22.91 -30.18 --50.69
Y 26.00 9.00 -3.00 24.00 32.00 -3.00 31.00 -32.00
X -16.49 -31.08 -45.90 -50.45 -40.44 -21.23 -3.85 -'7.21
Y 0.00 -31.00 -30.00 -30.00 16.00 -2.00 3.00 -3.00
X -'17.05 -16.91 12.31 38.81 41.67 42.31 37.32 20.33
Y -6.00 -3.00 19.00 27.00 22.00 25.00 20.00 8.00
X -8.89 24.73 7'7.68 80.61 21.77 -25.42 -46.07 -57.54
Y -13.00 23.00 47.00 35.00 -7.00 -22.00 -28.00 -37.00
X -66.98 -58.49 -45.03 22.81 12.46 -6.08 -33.97 --12.72
Y -46.00 -36.00 -24.00 -4.00 20.00 -7.00 -15.00 8.00
334 D. K. Chang

Table 1 (Con~nued)

X 45.63 85.60 95.99 66.43 29.03 -49.77 -48.37 21.58


Y 42.00 54.00 52.00 28.00 12.00 -47.00 -21.00 19.00

X 69.26 43.30 26.50 5.22 -28.48 -52.19 -13.25 -6.87


Y 33.00 7.00 13.00 1.00 -19.00 -34.00 -16.00 -5.00

X 20.25 13.12 -6.64 6.62 35.84 41.95 -8.41 -42.37


Y 15.00 2.00 -4.00 15.00 28.00 22.00 -21.00 27.00

X -54.06 -22.13 16.36 1.63 -36.77 -57.53 -57.13 -26.83


Y -31.00 -3.00 12.00 -11.00 -26.00 -29.00 -25.00 -4.00

Since all these roots lie outside the unit circle, the filter L is physically
realizable. To compute the values for the sequence X, we need to expand the
rational function 1/P using the Taylor series method. With the coefficients
b0, b 1. . . . thus determined, we can use the formula

X(m)= ~ b.Y(m -n)


n=0

40

30

20

10

-10

-20

-30

-40

-50

-60

-70

Time
Fig. la. O u t p u t Series Y.
Harmonizable filtering and sampling of time series 335

60

50

40

30

20

10

-I0

--20

/
-5(

-7( /
0 10 20 30 40 50 60 70 80 90 1oo
Time
Fig. lb. Input Series X.

to o b t a i n t h e i n p u t s e r i e s X. T h e first 24 b's, c o r r e c t to t h r e e d e c i m a l places,


are as f o l l o w s :

bo = 1.177, b6 = 0.136, b12 = 0.018, bt8 = 0.003,


b 1= 0.640, b7= 0.147, b13 = 0.025, b19 = 0.004,
b2 = 0.033, bs = 0.049, bl4 = 0.015, b2o = 0.003,
b3 = -0.032, b9 = -0.003, bls = 0.003, b21 = 0.001,
b4 = 0.030, blo = 0.003, bt6 = 0.000, b22 = 0.000,
b5 = 0.019, b n = 0.008, b17 = 0.002, b23 = 0.000.

A set of t w o h u n d r e d v a l u e s (from t h e s a m e d a t a r e c o r d s ) of X a n d Y, c o r r e c t
to two d e c i m a l places, is g i v e n in T a b l e 1, a n d t h e g r a p h s for b o t h series X a n d
Y with t h e s e v a l u e s a r e p l o t t e d in Fig. l a , b for c o m p a r i s o n .

References

[1] Aronszajn, N. (1950). Theory of reproducing kernels. Trans. Amer. Math. Soc. 68,337--404.
[2] Bhagavan, C. S. K. on non-stationary time series. This volume, Chapter 11.
[3] Bochner, S. (1956). Stationarity, boundedness, almost periodicity of random valued functions.
In: Proc. Third Berkeley Syrup. Math. Statist. and Probability, Vol. 2, 7-27. University of
California Press, Berkeley, CA.
336 D. K. Chang

[4] Cauchy, A.-L. (1841). Memoire sur diverses formulaes de analyse. C. R. Acad. Sci. Paris 12,
283-298.
[5] Chang, D. K. (1983) Bimeasures, harmonizable processes and filtering. Ph.D. Dissertation.
University of California, Riverside, CA.
[6] Chang, D. K. and Rao, M. M. (1983). Bimeasures and sampling theorems for weakly
harmonizable processes. Stochastic Anal. & Appl. 1, 21-55.
[7] Clarkson, J. A. and Adams, C. R. (1933). On definitions of bounded variation of two
variables. Trans. Amer. Math. Soc. 35, 824-854.
[8] Cram~r, H., (1951). A contribution to the theory of stochastic processes. In: Proc. Second
Berkeley Symp. Math. Statist. and Probability, 329-339. University of California, Berkeley,
CA.
[9] Diestel, J. and Uhl, 3. J. Jr. (1977). Vector Measures, Mathematical Surveys No. 15. American
Mathematical Society, Providence, RI.
[10] Dunford, N. and Schwartz, J. T. (1958). Linear Operators, Part I: General Theory. Inter-
science, New York.
[11] Grenander, U. (1950). Stochastic processes and statistical inference. Ark. Mat. 1, 195-277
[12] Hannan, E. J. (1967). The concept of a filter. Proc. Cambr. Phil. Soc. 63, 221-227.
[13] Helson, H. and Lowdenslager, D. (1958). Prediction theory and Fourier series in several
variables. Acta Math. 99, 165-202.
[14] Kallianpur, G. (1959). A problem in optimum filtering with finite data. Ann. Math. Statist. 30,
659-669.
[15] Kelsh, J. P. (1978). Linear analysis of Harmonizable time series. Ph.D. Dissertation. University of
California, Riverside, CA.
[16] Lloyd, S. P. (1959). A sampling theorem for stationary (wide sense) stochastic processes. Trans.
Amer. Math. Soc. 92, 1-12.
[17] Lo6ve, M. (1963). Probability Theory, 3rd ed. Van Nostrand, New York.
[18] Masani, P. (1965). The normality of time-invariaut, subordinative operators in Hilbert space.
Bull. Amer. Math. Soc. 71, 546-550.
[19] Miamee, A. G. and Salehi, H. (1978). Harmonizability, V-boundedness, and stationary
dilations of stochastic processes. Indiana Univ. Math. J. 27, 37-50.
[20] Nagabhushanam, K. (1950). The primary process of a smoothing relation. Ark. Mat. 1~
421--488.
[21] Niemi, H. (1975). Stochastic processes as Fourier transforms of stochastic measures. Ann.
Acad. Sci. Fenn. Ser. A I, 591, 1-47 (Helsinki).
[22] Parzen, E. (1962). Extraction and detection problems and reproducing kernel Hilbert spaces. J.
S I A M Control Ser. A 1, 35-62.
[23] Penrose, R. A. (1955). A generalized inverse for matrices. Proc. Cambr. Phil. Soc. 51,400-413.
[24] Piranashvili, Z. A. (1967). On the problem of interpolation of stochastic processes. Theory
Prob. Appl. 12, 647-657.
[25] Pourahmadi, M. (1980). On subordination, sampling theorem and 'past and future' of
some classes of second-order processes. Ph.D. dissertation. Michigan State University.
[26] Rao, M. M. (1982). Harmonizable processes: structure theory. L'Enseign. Math. 28, 295-351.
[27] Rao, M. M. (1967). Inference in stochastic processes, III. Zeit. Warsch. Verw. Gebiete 8,
49-72.
[28] Rozanov, Yu. A. (1959). Spectral theory of abstract functions. Theory Prob. Appl. 4.
271-287.
[29] Yaglom, A. M. (1961). Second order homogencous random fields. In: Proc. Fourth Berkeley
Symp. Math. Statist. and Probability, Vol. 2, 593. University of California Press, Berkeley, CA.
E. J. H a n n a n , P. R. Krishnaiah, M. M. Rao, eds., Handbook of Statistics, Vol. 5 "l '7~
© Elsevier Science Publishers B.V. (1985) 337-362 At_

Sampling Designs for Time Series*

Stamatis Cambanis

1. Introduction

In practice, a time series (or more generally a random field) is observed only
at a finite number of appropriately chosen points, which constitute the samp-
ling design, and based on these observations an estimate or a statistic is formed
for use in the problem at hand. How to select the most appropriate choice of
sampling points is the problem of sampling design. The statistician may be free
to choose any desirable points, or else part of the sampling mechanism may be
imposed on the statistician who then controls only certain parameters, e.g.
periodic sampling is imposed where the period is controlled by the statistician,
or Poisson sampling (at the times of occurrence of a Poisson stream of events)
is imposed but the statistician has control over its rate. With such constraints,
i.e. within certain classes of sampling designs, or with no constraints, i.e. among
all sampling designs, how can the statistician choose the best design of a given
sample size, or how can the statistician determine the sample size of a certain
kind of design required to achieve a given performance? These questions will
be considered in the context of three specific problems of interest involving
time series: the estimation of a weighted average of a random quantity, the
estimation of regression coefficients, and the detection of signals in noise.
These three problems have a great deal in common, and specifically their
sampling design questions are essentially the same.
The setup here differs in two important ways from the classical setup. All
observations are taken from a fixed (interval) region A and so, especially for large
sample sizes, it is not realistic to assume lack of correlation; hence, observations
form a correlated time series. Also repeated sampling at the same point is not
allowed, and only one realization of the time series is available; i.e. only simple
designs are considered in the terminology of Pfizman (1977).
We consider both deterministic and random sampling designs, where either

*This research was supported under the Air Force Office of Scientific Research Grant No. F49620 82
C 0009.

337
338 S. Cambanis

optimal estimators and sufficient statistics are employed, or much simpler


estimators and statistics are employed instead.
Finding optimal designs of a given sample size turns out to be a very difficult
task, which can be accomplished only for certain specific covariance structures
on certain sufficiently simple sampling designs, such as simple random samp-
ling. Finding sampling designs, which for large sample size perform like the
best designs, is an easier task, which can be accomplished for broad classes of
covariance structures and different designs, such as unconstrained, median
and stratified, the latter two in fact using the simpler kind of estimators and
statistics.
There is a vast literature on designs with uncorrelated errors. In sharp
contrast, the literature on sampling designs for time series is rather limited.
Expressions for mean square errors for various kinds of deterministic and
random sampling designs with correlated errors or in a time series setup are
given in Cochran (1946), Quenouille (1949), Zubrzycki (1958), and Tubilla
(1975). The question of finding optimal, and asymptotically optimal, sampling
designs begins with the fundamental work of Sacks and Ylvisaker (1966, 1968,
1970a, 1970b) who resolved in a series of papers the case of deterministic
designs using optimal estimators. Their work was continued by Hfijek and
Kimeldorf (1974), Wahba (1971, 1974), and Eubank, Simth and Smith (1981,
1982a, 1982b). Median sampling and random sampling designs were con-
sidered by Schoenfelder (1978, 1982), Schoenfelder and Cambanis (1982), and
Cambanis and Masry (1983). While the picture is reasonably, but by no means
fully, complete for one-dimensional sampling, the case of multivariate sampling
designs is in its infancy as the work of Ylvisakar (1975) indicates.
Throughout {X(t), t C A} will be a time series defined over the time interval
A of length ]A], with covariance function R(s, t), which is assumed continuous
and strictly positive definite. When T = {t 1. . . . . t,} C A we will write X~- for the
vector (X(tl) . . . . . X(t,)), R T for the n × n matrix {R(ti, ~)}/~4:1, and similarly
f~-= (f(tl) . . . . ,f(tn) ) for a function f(t) defined on A. We will consider a
function f of the form

f(t)=IaR(t,s)4~(s)ds, tCA, (1.1)

where ~b is a continuous function on A, and we will put

s2= ;a f A R(S, t)cb(s)4)(t) ds dt. (1.2)

For simplicity, we will sometimes write double integrals of this form as


f f R4~4'. The centered process X(t) has quadratic mean derivative on A if and
only if its covariance function R(s, t) is differentiable on A x A (and similarly
for higher order derivatives). Reference will be made to the reproducing kernel
Sampling designs for time series 339

Hilbert space of a covariance function R, R K H S (R). The relevant facts can be


found in Parzen (1967) but no essential knowledge is required here. For the
reader's convenience, we mention a few relevant properties here. Any function
f of the form (1.1) (in fact with ~b simply square integrabte over A) belongs to
R K H S ( R ) , and the expression in (1.2) is its norm in R K H S ( R ) . The reproduc-
ing kernel Hilbert space norm of f r E R K H S ( R r ) is the familiar expression

)lf (1.3)
and of course R K H S ( R r ) = R". In fact, f E R K H S ( R ) if and only if the
supremum of (1.3) taken over all finite subsets T of A is finite, and the value of
that supremum is the R K H S norm of f. When R (s, t) = min(s, t) and A = [0, hi,
the R K H S ( R ) consists of all functions f which vanish at zero and are absolutely
continuous with square integrable derivative: f(t) = f~ g(u) du, 0 <~ t <~ b, where
Job g2(u)du < ~; the R K H S norm is given by Ibrll2-- To~ g2(u)du; and

f~R -rlf,r = f~t2,tl___~+ n12 [f(lk + 1) -- f(tk)l 2 (1.4)


tl k=l tk+l- tk

Similar simple expressions are available in the more general case of factorable
R, in the sense that R(s, t) = u(s)v(t) for s < t, and also when R is stationary
with rational spectral density (but with more complicated expressions).

2. The time series setup

We introduce three important problems involving time series, which we will


follow throughout. Even though these problems look different on the surface,
we will see that the design problems associated with them are essentially the
same.

2.1. Estimation of integrals of random quantities


Here we want to estimate the random integral

I = f A X ( t ) f b ( t ) dt (2.1)

from observations of the random quantity X(t) at n sample points T = {tk}~=1


and we assume that the nonrandom function 4~ is known. We restrict attention
to linear estimates with weights c~-= (cr.1. . . . . cr.n):

n
: }2 c,,:kX(tk) . . . . (2.2)
k=l
340 S. Cambanis

whose mean square approximation error is

2 = E ( I - IT) 2 = S 2 - 2 c r f r + c ' r R r c r .
e T (2.3)

If the random process X could be observed over the entire interval A, rather
than at n sample points, then of course the random integral I could be
evaluated precisely, i.e. e 2A = 0. Our goal therefore is to choose the sampling
points T and the estimator weights c r in such a way that the resulting mean
square estimation error e r2 should be as close to zero as possible.

2.2. E s t i m a t i o n of regression coefficients


Consider the linear regression model where an observation taken at t has the
form

X(t) =/3f(t) + N(t), t < A, (2.4)

where f is a known regression function, /3 an unknown regression parameter,


and the observation errors are correlated, in fact in such a manner that the
r a n d o m process N has mean zero and covariance function R ( s , t). In contrast
with the classical case where observation errors are assumed uncorrelated and
r e p e a t e d observations at each t are allowed, here we assume correlated
observation errors and we allow only one observation at each point t (cor-
responding to one realization of the observation process).
If observations over the entire interval A were available, then the minimum
variance linear unbiased e s t i m a t o r /~A of /3 has variance

1 1
Var/~A -- Ibql2 s2, (2.5)

so that/3 can be estimated precisely if and only if the regression function f does
not belong to the reproducing kernel Hilbert space of R. In other words, when
the regression function is less smooth than the observation noise
0e~ R K H S ( R ) ) , then the regression coefficient can be determined precisely
from observations over the entire interval. When the regression function is of
comparable smoothness with the observation noise, f ~ R K H S ( R ) , then the
variance o f ~A is positive. H e r e we consider the simpler case where the
regression function f is even slightly smoother, namely of the form given in
(1.1).
Our problem then is to estimate /3 linearly from observations at n sample
points T = {/k}~=1 using appropriate weights c } = (cr. 1. . . . . cr.n) by

fir = ~J, Cr.kX (ta) c'rXr. (2.6)


k=l
Sampling designs for time series 341

The bias and variance of such an estimator are

Bias/?r = ~(c'rfT - 1), Var/3 r : c}Rrcr, (2.7)

and our goal is therefore to choose the sampling points T and the estimator
weights c r in such a way that the resulting bias and variance are as close to 0
and to s 2 as possible.

2.3. Detection of signals in noise


Consider the problem of detecting a known (nonrandom) signal f(t) in
additive Gaussian noise N(t), in other words the problem of distinguishing
between the two hypotheses H 0 and H 1 that the observation at each t consists
of noise only or of signal plus noise:

HI: X(t)=f(t)+N(t), t ~A. (2.8)


Ho: X ( t ) = N ( t ) ,

The noise process is assumed to be Gaussian with mean zero and covariance
function R (s, t).
If observations over the entire interval A were available, then the presence
of the signal could be detected with probability one if and only if it is less
smooth than the noise, in the sense that it does not belong to the reproducing
kernel Hilbert space of R. Here we consider the more interesting case where
the smoothness of the signal is comparable to that of the noise ( f ~ R K H S ( R ) )
and in fact we consider the simpler case where the signal f is of the form (1.1).
In this case the sufficient statistic is

SA = JA X(t)g>(t) dt, (2.9)

which under H 1 is :V(s 2, s 2) and under H 0 is 2('(0, s2), and the probability Pd of
detection (i.e. of deciding H 1 when H 1 is true) at false alarm rate ~ (i.e. when
the probability of deciding H , when H 0 is true is cQ is

(2.10)

where q5 is the standard normal distribution function.


When observations are available over n sample points T = {tk}~,=1 and a
(linear) statistic is used with weights c)-= (cr,1,..., cr, n):

tt

Sr = • cr,~X(tk)-- c'rX r, (2.11)


k=l
342 S. Cambanis

the resulting probability of detection at false alarm rate a is

pd(ST) = dp[ C'TfT -- • ~(1 -- a)] (2.12)


[(C~RTCr) 1/2 A

which is of course smaller than Pd(SA), and our goal is therefore to choose the
sampling points T and the weights cr in such a way that the resulting
probability of detection Pd(ST) should be as close as possible to Pd(SA), the
probability of detection of the optimal detector which uses observations over
the entire interval.
A n o t h e r measure of performance of a test based on the statistic S y is the
generalized signal-to-noise ratio defined by

{E(ST I H , ) - E(ST I H o)} 2


S N R ( S r) = (2,13)
Var(ST I Ho)

For Sa given by (2.9) we have

SNR(SA) = s 2

and for S T as in (2.11) we have

(c'~f~)~
SNR(ST) - - - (2.14)
c'~aTfT
so that

Pd(ST) = (Ig[SNR1/2(ST)- ~-1(1 - a)]. (2.15)

Thus when the noise is Gaussian, there is a very simple relationship between
signal-to-noise ratio and probability of detection. The signal-to-noise ratio is
used as performance criterion instead of the probability of detection whenever
the latter cannot be easily computed, e.g. when the noise is not Gaussian; or
when the noise is Gaussian but the test statistic is not, because random
sampling is used°

3. The sampling designs

Two kinds of sampling schemes are considered, deterministic and random. In


deterministic sampling, the sampling points are chosen according to a deter-
ministic rule, such as periodic sampling. In random sampling, the sampling
points are chosen according to a randomized rule. There are several reasons for
considering random sampling schemes: they are easier to analyze, they provide
bounds to the performance of the best deterministic designs, and in certain
cases they are imposed by the very nature of the information gathering process,
Sampling designsfor time series 343

such as when an observation can be taken only at a time a certain event occurs,
or ,when there are imperfections in deterministic sampling schemes, e.g.
periodic sampling with jitter.

3.1. Deterministic sampling designs


The most commonly used deterministic sampling scheme is periodic samp-
ling. For each n, a periodic sample of size n consists of appropriate percentiles
of the uniform density over the interval A. Generalizing this to a, not
necessarily uniform, density h positive over the interval A = [a, b], we obtain
for each n a sample of size n consisting of all (n - 1)-~ percentiles of h:
n ("~ i- 1
T, = {tni}i= 1 where h(t) dt = - - i = 1, 2 . . . . n.
L n-l'
Such a sequence {T,} of sampling designs Tn of size n is called a regular
sequence of designs generated by the density h, or simply regular sampling.
W h e n h is the uniform density, regular sampling becomes periodic sampling.
Using the medians of a regular sequence of designs generated by the density h, we
obtain a sequence of sampling designs
. f m"i 2i - 1
T,={m,i}i= 1 w h e r e a h ( t ) d t - 2n ' i = 1 , 2 . . . . . n,
a

(i.e. m,i is the median of (t,+l. i, t,+l,i+ 0 with respect to h) called a sequence of
median designs generated by the density h, or simply median sampling.

3.2. Random sampling designs


In random sampling schemes, the points T = { t l , . . . , t,} of a sample of size n
form a random vector whose distribution is determined by the scheme and
which is independent of the random observation process X.
Since in this case the sampling points are themselves random, the expressions
for e zT , Bias fiT, Var fiT, and Pd(Sr) in Section 2 are no longer applicable--but
represent conditional values given a realization of the design.
The simplest random sampling design is simple random sampling where for
each sample size n, the n sampling points t 1. . . . . tn are chosen independently at
random according to a density h (allowing the sampling density h to depend on
the sample size n turns out to offer no advantages).
In stratified sampling, for each sample size n, the interval A is first par-
titioned into n strata {A,k}~= 1 and sampling points {t,k}~: 1 are chosen in-
dependently, one from each stratum, so that each t,k is chosen at random from
A,k according to the density h,k. An example of stratified sampling is periodic
sampling with jitter density h: for each sample size n, the interval A = [0, 1] (for
simplicity) is partitioned into the equispaced interval strata A,k = [(k - 1)/n, k/n]
with midpoints r,k = (2k - 1)/(2n), and the sampling point t,k is chosen from the
stratum A,k according to the density h,k(t)= nh[n(t-tnk)], where h is the
normalized jitter density which is defined on the interval (-~, ~) and is symmetric
about 0; in other words, instead of the periodic sampling points z,k, the sampling
344 S. Cambanis

mechanism chooses the points %k + e,k where the errors e,k are independent and
their normalizations ne,k have density h.
In systematic sampling, the sampling point tnl is chosen at random from
the stratum A,1 according to the density h,1, and each t,k (k = 2, . . . , n) is the
point in Ank homologous to t,1 in A,~. Thus any one of the n sampling points in
systematic sampling determines all the others.

4. The estimators

The natural choice for the coefficients c r of the estimators I r and fiT and of
the test statistic S T is that which renders their performance optimal for the
given sample T of size n. We are also interested in considering simpler
coefficients which result in performance comparable to that obtained with
optimal coefficients, but which are substantially easier to compute.

4.1. Optimal coefficients


The optimal coefficients CT for a fixed sample T = {tk}~=1 of size n are as
follows for each of the problems under consideration.

Estimation of integrals of random quantities


The optimal coefficients are those which minimize the mean square ap-
proximation error e 2r of (2.3). They are given by cT=f~R-r
~p
1 and thus the
optimal estimator and its mean square approximation error are

Jr = f~.R ) ' X T (4.1)


and
2 I A 2
er = s2- f ~ R r f T = e T . (4.2)

Estimation of regression coefficients


The minimum variance linear unbiased estimator f r of/3 is

fiT - f~.R -r'XT _ ir(f~R _r,fr)_ 1 (4.3)

with variance

Var fir - 1 (4.4)


f ~cR-rlf r

which, compared with the variance s -2 of the estimator fA (cf. (2.5)), gives

1 8 2
_ r
(4.5)
Var fir s2 s2f~cRrift"
Sampling designs for time series 345

Detection of signals in noise


The optimal coefficients for a given sample of size n, T = {tk}~=~, are those of
the sufficient statistic for the corresponding finite sample detection problem:

Hx: X ( t k ) = f ( t k ) + N(tk)' k = 1 . . . . . n, (4.6)


H0: X(tk) = N(tk) ,

so that

Sr = f~.R T~XT = i t . (4.7)

As under H 1, Sr is W(f~R-Tlfr, f ) . R ) l f r ) and under H 0 it is W(O,f~cR~lfr), its


probability of detection at false alarm rate c~ is

P~(Sr) = ~ [ ( f r R T~fT) ~a - @-1( 1 - c0]- (4.8)

Comparing this with the probability of detection of the test based on the
sufficient statistic SA, we have

P d ( S a ) - Pa(Sv) -
fZ7
(f~'R~fT) /2-q5- l(1-a)
s-,b 1(l ~)
d~(u) du
(4.9)
= ~ 6 ( u ) du o

%2_@1/2 ~-,(>.)

2
Thus Pd(Sr.)~Pd(SA) if and only if er-+O, and in either case

P~(Sa)- Pa(Sr.) 1
2 ,~s(bis ~-~(1 c0]. (4.10)
8r,

We also have S N R ( S r ) = f'vR-rifv so that

SNR(SA)-- S N R ( S r ) = cTe2 . (4.:11)

4.2. Simpler coefficients


It is clear from the expressions of i7)/~r a n d Sr that the optimal choice of
coefficients requires inverting the n x n covariance matrix R v and that the
coefficient Cr,k multiplying the observation X(tk) depends on all sample points
{tk}~=v Here we consider coefficients of a very simple form, namely for each
sample of size n, T, = {t,k}~,=1, we take

cr.,k = n-lc(t.~) or Cr.,k = n lC~(t~k) (4.12)

for some functions c(t) or c.(t), so that the coefficient CT.,k multiplying the
346 S. Cambanis

observation X(tnk ) depends only on the sample point tnk and in fact via a
(global) function (which may or may not depend on the sample size). For such a
simplified choice of coefficients to be admissible, the functions c or {G} should
be chosen so that as the sample size increases the performance should tend to
that attained using observations over the entire interval. This turns out to be
possible for the various sampling designs introduced earlier.

Estimation of integrals of random quantities


For a sequence {T.} of median sampling designs generated by the density
h, it turns out that if c(t) satisfies the consistency condition

c ( t ) h ( t ) : dp(t), t~A, (4.13)


so that

I,,,T, : n-l(~/h)r.Xr, = n-I Z ~~(t.~) x(t.~) = lr, , (4.14)


k=l
then
e2,L = s z - 2 m . + s 2n . (s2,-s
. . . 2.) - 2. ( m
. . s2 ) ~ 0 , (4.15)
where
m~ = n-l(6/h)~ f r ~ s 2 , (4.16)
2
s, = n-2(4)/h )~- RT, (4)/h )T, -> s 2 . (4.17)

Similarly, in simple random sampling with density h, if c(t) satisfies the


consistency condition (4.13), we have

1sr, T n = I T n , (4.18)

esr,
2 T, = /,/-1( f cr2(~2h 1_ s 2 ) ~ 0 , (4.19)

where o-2(0 = R (t, t).


In stratified sampling, we can choose c, so as to satisfy for each n,

G(t)h.(t)-= th(t), t ~ A, (4.20)

where h a is the averaged sampling density n -1 ~;~=1 hnk, and then we have

.l~,,w. = n-'(cblh)~ XT. = ~ X(t.~)~ (4.21)


k=l '~nkk*nkl

(4.22)
k =1 nk nk nk
S a m p l i n g designs f o r t i m e series 347

E s t i m a t i o n of regression coefficients
For a sequence {T.} of median sampling designs generated by the density
h, if c(t) satisfies the consistency condition

c ( t ) h ( t ) = s-24)(t), t ~ A, (4.23)

we have

tim,T. = s-glr., (4.24)


Bias tm.r. = ~s-a(m~ - s 2) ~ O, (4.25)
Var tim,r - s-2 = s-4(s2. - s2) -~ O. (4.26)

In simple random sampling with density h, by choosing c(t) as in (4.23) we


have

ti..,ro= s--21To, (4.27)


Bias flsr,r. = 0, (4.28)
Var fls~,r. - s -2 = s-4 e~,r. ~ O . (4.29)

In stratified sampling, we choose c . ( t ) h . ( t ) = s 24~(t) for each n and obtain

fist,r. = s-2Ist,r,, (4.30)


Bias ti,,.r, = 0, (4.31)
Var/3~t,r, .... s-2 = S -4e 2
st,n ~0. (4.32)

Detection of signals in noise


For a sequence {T,} of median sampling designs generated by the density
h, if c(t) satisfies the consistency condition (4.13), then

S~,r,' = I t , (4,33)

Pa(Sm,r ) = q5 [}m.!. 4 - ' ( 1 - a ) ] , (4.34)


Sn

and comparing it with the probability of detection of the optimal test based on
the entire interval we have
s-4,-l(l-a)
[ q~(u) du
s~4,-l(1-a)
= ~b(u) du (4.35)
a (s2--,s2r)*2-4,-1(1-,~),~(u)du
348 S. C a m b a n i s

where
~2
= SNR(•)- SNR(Sm,r.)= s 2- n/Sn2
m 2
(4.36)
= 1 {$2($2 _ S2) _ (m. - s2)(mn + $2)}--~ O.
s.

In simple random sampling with density h, if c(t) satisfies the consistency


conditions (4.13), so that

Ssr,Tn = I T n , (4.37)

then, while the statistics S~r,r" are no longer Gaussian, we have

Pd(S,~,r.) ~ Pd(Sa), SNR(S,~,T.)-~ SNR(SA) , (4.38)

and in fact the distributions of Ssr,rn under each alternative hypothesis converge
weakly to those of SA:

f(Ss~,r~ IHI)~ ~#(s2, $2), ~J~(Ssr,Tn [ S 0 ) -'--),Jr'(0, $2). (4.39)

In this case (as in any case of a random sampling design), probabilities of


detection are expressed in terms of the joint distribution of the random
variables m, and s,, 2 which is not easy to compute. We thus focus only on
signal-to-noise ratios whose expressions are much simpler. In this case,

S4
SNR(S~r,~)- 2 *s2, (4.40)
S 2 + esr,~

so that

$2 -- SNR(Ssr,Tn) - 2 Gr,T.2 •0 . (4.41)


S + esr,T~

In stratified sampling, if the functions G(t) are chosen from the consistency
condition (4.20), so that

Sst3; = Ist,r " , (4.42)

then Ss,3," have the desirable limiting properties (4.38) and (4.39), and again
concentrating on signal-to-noise ratios we have

S4
SNR(Ss, r,) .............. -~ s 2 (4.43)
, S2 + e2st,rn
Sampling designs for time series 349

and
2 2
S e st, T,,
s 2 - SNR(Sst, r.) = s2 + e2t,r . . O. (4.44)

4.3. Parametric versus nonparametric estimators

The simple-coefficient estimator (4.14) and statistic (4.33) require no knowledge


of the covariance R and are thus nonparametric in nature, while the estimator
(4.24) requires knowledge of s 2only. In contrast, the optimal coefficient estimators
(4.1), (4.3), and statistic (4.7) require precise knowledge of the covariance R (t, z).

5. Optimal fixed sample size designs and asymptotically optimal designs

Within a specified class of sampling designs ~, we are interested in finding


the best sampling design of size n. For the specific problems we have been
considering here a sampling design T of size n is optimal if it minimizes the
2 or the bias and variance of /3> or its
mean square approximation error er,
mean square error (MSE = Var + (Bias)2), or if it maximizes the probability of
detection or the signal-to-noise ratio of S r, among all sampling designs in @ of
size n.
Finding optimal designs of fixed sample size turns out to be a difficult task.
We are therefore interested also in finding sequences {T,*} of sampling designs
T* of size n, which, while generally not optimal for any sample size n, are
nevertheless asymptotically optimal in the sense that as the sample size tends to
infinity their performance tends to that of the sequence of optimal sampling
designs. For the specific problems under consideration, this means that

er~---*l,
2
Var/3r;' -+1, MSE/3r; -->1,
inf e 2r inf Var fir inf MSE/3 r
(5.1)
--Pd(ST;) -+ 1, SNR(Sr;') - + 1,
sup Pa(ST) sup SNR(Sr)

where infimum and supremum are taken over all sampling designs of size n
in ~.
it should be clear that for any random design there always exists a better
nonrandom design. Our main interest is therefore to find optimal or asymp-
totically optimal sampling designs within the class of all (deterministic) designs.
In the following we comment on the asymptotics of the performance of optimal
sampling designs, we show how in certain cases asymptotically optimal
sequences of designs can be found, and we consider the performance of
optimal fixed sample size simple random designs and of asymptotically optimal
stratified designs.
350 S. Cambanis

5.1. Optimal coefficients and regular sampling


When optimal coefficients are used, it is clear from expressions (4.2), (4.4) or
(4.5), and (4.8) or (4.9), that in all three problems under consideration, the
optimal sampling design of size n maximizes

f ~.R -r~f r = IIPrf II2 (5,2)

among all sampling designs @n of size n: T = {t~ < t 2 < ' ' " < tn} , where Prf
is the projection of f to the subspace of the reproducing kernel Hilbert space of
R generated by {R(., t), t C T}. Since the maximization is over the open subset
of A" determined by /he inequalities t~< t 2 < - . . < tn, an optimal sampling
design of size n does not necessarily exist. Such an optimal design exists when
R(s, t ) = u(s)v(t) for s < t , including the Wiener and Gauss-Markov cases
min(s, t) and exp(-Is - t[), but its existence becomes a very delicate question when
R is ditterentiable (on the diagonal of A × A). Even when an optimal design
exists, it is usually difficult to determine it by carrying out the minimization (an
algorithm for certain special cases is developed in Eubank, Smith and Smith
(1982a)). A very special case where the minimi.zation is easily carried out is when
~b ~- 1 and X has stationary independent increments: the optimal design of sample
size n is given, for A = [0, 1], by tni = 2i/(2n + 1), i = 1 , . . . , n, with corresponding
ETn2 = (0-2/3)(2n + 1)-2, where 0-2 = R(t, t)/t; this is derived in Samaniego (1976)
and in Cressie (1978). The optimal designs satisfy

sup f~R-rlfw : sup IlPTfll2~ [Ifll2 : s2 . (5.3)


TE~ n . TEffJ n n

We now turn our attention to sequences {Tn} of sampling designs Tn of size


n, which are not optimal. If they form a regular sequence of designs generated
by a density h, then they satisfy

2 = s 2_ f~cR r~fr, = Ilfll2 - [[Pr,f[I2 = Ilf-


Ern P~fll 2--, 0 (5.4)

When R is smooth, upper bounds can be found on er.2 Specifically if R(s, t) has
continuous (k, k) mixed partial derivative, then

er,2 = o ( n 2k), (5.5)

and if in addition the (k, k) mixed partial derivative of R(s, t) is smooth off the
diagonal of A x A, then

2 = o(n
er, 2k-2). (5.6)

Thus the smoother R is, i.e. the more quadratic derivatives the centered
2
process X(t) has, the faster er, tends to 0.
Sampling designs for time series 351

Precise rates of convergence and asymptotically optimal sequences of samp-


ling designs are known only in certain cases where the centered process X ( t )
has exactly k quadratic mean derivatives, and the rate is n -2k-2. Specifically,
under certain further regularity conditions,

2k+2 2 %(t)fb2(t) dt.


n ~.-~ G fA h2k+2(t) (5.7)

We will not insist on the precise technical regularity conditions other than
giving the expression for the function ak(t), assumed positive,

% ( t ) = R(k'k+')(t, t -- O) - R(k'k+l)(t, t + 0), (5.8)


and noting that stationary covariances R with rational spectral densities and
the right number of quadratic mean derivatives satisfy them in fact with
% ( 0 =- %. The constant C k is defined by C k = ]B2k+a]/(2k + 2)!, where B m is the
ruth Bernoulli number, and C O= 1/12, C 1= 1/720. This asymptotic result has
been established by Sacks and Ylvisaker (1970a, 1970b) for k = 0, 1, and for
covariances R satisfying specific regularity conditions, and by Eubank, Smith
and Smith (1981) for all k and a narrower class of covariances R, essentially
those of k-fold integrals of Brownian motion or bridge; the latter authors
conjecture its validity for the broader class of covariances considered by Sacks
and Ylvisaker.
By choosing the density h which minimizes the right-hand side of (5.7), one
obtains an asymptotically optimal sequence of designs! Specifically, the regular
sequence {T*} of sampling designs generated by the density h*(t) proportional
to [%(t)f)2(t)] 1/(2k+3)is asymptotically optimal and

n2k+ZeT;-'+ C2k fA [ak(t)62(t)] 1/(2k+3)dt }2k+3, (5.9)

• 2
i.e. mfrs~ e T = infTe~,Hf- PTfl] 2 has the same asymptotics (where @n consists of
all sampling designs of size n).
The asymptotics of er;,
2 Var fiT', and Pd(ST;) follow immediately from (4.2),
(4.5) and (4.10). Periodic sampling is covered by (5.7) by taking h the uniform
density over A, and its asymptotic performance is then easily compared with
that of the asymptotically optimal sequence of designs; they both have the
same rate but different asymptotic constants: C2klA[2k+2f %4~2 for periodic
sampling and as in the right-hand side of (5.9) for the asymptotically optimal
sampling design, and the ratio of the latter to the former can take (for different
4~'s) any value in (0, 1]. Thus substantial improvement .in asymptotic per-
formance may be achieved by sampling according to h*(t) rather than
periodically.
352 S. Cambanis
5.2. Simple coefficients and median sampling
For sequences of median sampling designs {Tn} generated by the density h
and using the simpler coefficients described in Section 4, it is clear from
expressions (4.15), (4.25), (4.26) and (2k34) that their asymptotic performance is
determined by the asymptotics of m, ~ s 2 and of s2,~ s of (4.16) and (4.17)
Here we describe the results for the case k = 0, i.e. the centered process X(t)
has no quadratic mean derivative, under regularity conditions similar to those
required for (5.7). Included are the cases where R is the covariance of the
Wiener process, the Gauss-Markov process, etc. The precise asymptotic
behavior is as follows:

n2(s 2 -- S 2) -~ ~ f A ~°(t)ga2(t)
h2(t) dt ' (5.10)

n 2 ( m n _ S 2) --> 1 fA
O¢o(t)(/12(t)
h2(t ) dt. (5.11)

It then follows from (4.15), (4.25), (4.26), (4.36) and (4.35) that

n2e2,T~ -+ 112f Oeo~b2h-2 , (5.12)

n 2 Bias [3m,Tn --+ 481 ~S -2f c¢0~2h -2 , (5.13)

n2(Var/~,,,r,, _ s-2)_~s 4 f c~042h 2, (5.14)

n2(MSE/~m, G _ S-2)__.> ~S-4 f O~o~2h -2, (5.14')

n2[s 2 - SNR(Sm, r,)] -~ ]~ f ao4 2h-2 , (5.15)

n2[pd(&) _ Pd(Sm'Tn)] ~ d¢)[S -- (I) 1(1 -- Or)] f ce0q~2h -2 (5.16)


24s

The density h*(t) which minimizes the integral f ao4)2h -2 is proportional to


[Ceo(t)4)2(t)]2/3, and then the value of the integral becomes
2 3

and the corresponding sequence {T*} of median sampling designs is asymp-


totically optimal for the integral approximation problem and for the signal
detection problem, as is seen from the equality of the asymptotic constants in
Sampling designs for time series 353

(5.7) and (5.12) for the former case, and from (5.7) and (5.16), (4.35) for the
latter. Thus in these cases, median sampling design is both very simple (in
view of the very simple form of its coefficients) and asymptotically optimal. In
the regression problem the asymptotic constant in (5.14) is 50% larger than that
in (4.5) and (5.7) and thus median sampling is not asymptotically optimal; it
requires asymptotically about 22.5% more samples than the optimal sampling
design in order to achieve the same variance.
It is remarkable that median sampling design, utilizing such a simple
(nonparametric) form of estimator coefficients, is asymptotically optimal for
integral approximation and signal detection, and for regression coefficient
estimation, it has the same rate of convergence of the optimal sequence of designs
using (parametric) optimal coefficients, but with larger asymptotic constant.
These results were obtained by Schoenfelder (1978) and complemented by
Cambanis and Masry (1983). Work in progress by Schoenfelder has extended
these results to k = 1, i.e. exactly one quadratic mean derivative for the
centered process X(t); and for k ~> 2 it has produced rates of convergence
n -2k-2 (i.e. identical with those of the optimal sequence of designs using optimal
coefficients) by using, instead of the median of each interval (i.e. midpoint
sampling), k appropriate quantiles (i.e. quantile sampling).

5.3. Simple coefficients and simple random sampling


2
In this case, it is clear from expressions (4.19), (4.29) and (4.41) that esr, r .,
Var/3~r,r " - s -z, s z - SNR(Ssr,r,) all tend to zero with rate n -1, with no assump-
tion whatsoever on the covariance R. This very simple result is also valid for
random fields (i.e. for multidimensional index sets A).
We can also find the optimal fixed sample size simple random design, by
finding the density h which minimizes the integral f~r2cb2h -1 in (4.19). This
optimal density h(t) is proportional to cr(t)lcb(t)l, and assuming or(t) is bounded
away from 0, we have

[ ' r , - n-1 ~" sgn ¢b(tklx(tk )


k=l o.(tk) , (5.18)

e:. 1,
and all other quantities are determined by these via (4.27), (4.29), (4.37) and
(4.41).

5.4. Simple coefficients" and stratified sampling


It is clear from expressions (4.22), (4.32) and (4.44) that only the convergence
2
to zero of e~t,r " needs to be considered.
For periodic sampling with uniform jitter (worst case), we have

ne 2st.r, -~ IAIJA
f o-2(t)4f(t) dt. (5.20)
354 S. Cambanis

For each fixed partition {A,k}7,=~ of A, the sampling densities/~,k(t) which are
proportional to ~r(t)l~b(t)] within each stratum A,k minimize the right-hand side
of (4.22) term by term, producing a (partly optimal) stratified sampling design
with
n

= -7--c. A(tnk) (5.21)


' = ,k cr(t,k)
and
n

k
/tfA nk
JAfA
nk nk
}
For regular sequences of partitions generated by a density h bounded away
from zero, we have

e-2s,,r. = O(n-1). (5.23)

Precise rates of convergence depend again on the quadratic mean differen-


tiability of the centered process X, and again require appropriate regularity
conditions. When k = 0, i.e. the centered process X has no quadratic mean
derivative, then
z-2 l ( ao(t)492(t)
n est,r" -~ ~ Ja h2(t ) dt, (5.24)

and by choosing h * ( t ) proportional to [ao(t)4)z(t)] 2/3, we obtain an asymptotic-


ally optimal sequence {T~} of stratified sampling designs with

n 2-z
e s t , ~ ; ~1 IrA[a0(t)~b2(/)l 1/3dt /3. (5.25)

Comparing (5.25) with (5.9), we see that the asymptotically optimal sequence of
stratified sampling designs has the same rate as the sequence of optimal
sampling designs using optimal coefficients, and the asymptotically optimal
sequence of median sampling designs using simple coefficients, but asymptotic
constant twice as large, thus requiring asymptotically 41.5% more samples for
the same performance.
When k ~ 1, i.e. when the centered process X has one or more quadratic mean
derivatives, then
3-2 1 f . fl(t)d)2(t)
n est,T.--~ ~ - , a h3(t ) dr, (5.26)

where/3(t) = 2o-(t)o-"(t)- 2R~°a~(t, t) (= -2R"(0) in the stationary case)is ~>0 and


in fact >0 on some small interval, so that the rate does not improve as the
centered process X has more than one quadratic mean derivative. Thus n - 3 is the
ultimate rate achievable by stratified sampling designs. By choosing h * ( t )
proportional to [/3 (t)~b2(t)] TM we obtain an asymptotically optimal sequence {T~}
Sampling designsfor time series 355

of stratified sampling designs with

n est,T.. ~ [fi(t)qsz(t)] TM dt ; . (5.27)

6. Discussion and extension

6.1. References for Sections 4 and 5


The work described in Sections 4 and 5 began with a series of fundamental
papers by Sacks and Ylvisaker (1966, 1968, 1970a, 1970b), where the regres-
sion problem was considered using optimal coefficients, the notion of asymp-
totically optimal designs was introduced, and asymptotically optimal designs
using optimal coefficients were found for k = 0, 1. In the last paper, the
connection with the random integral estimation problem is also discussed. For
general k, but a more restricted class of covariances, asymptotically optimal
designs using optimal coefficients were developed by Eubank, Smith and Smith
(1981, 1982b) based on the related work of Barrow and Smith (1978, 1979).
Schoenfelder (1978) studied median and random designs for the integral
approximation problem, and Cambanis and Masry (1983) considered the signal
detection problem.

6.2. Sampling designs of fixed size


The existence of optimal fixed sample size designs using optimal coefficients,
and algorithms for their construction, are discussed, for k = 0, 1 and certain
covariances, by Eubank, Smith and Smith (1981, 1982a). Random designs of
fixed sample size are compared in Schoenfelder and Cambanis (1983): for every
simple random sampling design, there is a better stratified sampling design of
the same size; there are cases where systematic sampling outperforms stratified
sampling and vice versa--in fact, systematic sampling may be outperformed
even by random sampling in special cases.

6.3. Comparison of asymptotic performance


The asymptotic performance of the various sampling designs considered in
Section 5 is summarized in Table 1, where the exact rates of convergence are
shown. The parentheses indicate that, except for k = 1, the result is established
for special cases, and the double parentheses indicate anticipated results, not
yet available in the literature. The performance of simple random sampling, while
quite poor, is not affected by the smoothness of R or by the dimensionality of the
index set. Stratified sampling performs as expected better than simple random,
but is also not affected by the smoothness of R once one quadratic mean
derivative exists; and even when its rate is n -2 for k = 0, just as for the nonrandom
designs, its asymptotic constant is twice as large. The performance of all of these
356 S. Cambanis

Table 1
Exact rates of convergence

Optimal Simple coefficients


Exact coefficients,
no. of q.m. optimal median simple
derivatives designs ((quantile)) r a n d o m stratified
k=0 F/ 2 /,/-2 n -1 F/ 2

1 ~<k < ~ (n-2k-2) ((n-2k 2)) n 1 n 3

sampling designs for small sample sizes, 1 < n < 20, is considered in Cambanis and
Masry (1983) for the signal detection problem, using both the asymptotically
optimal densities h * as well as a uniform density h, for R the covariance of Wiener
process and of stationary first and second-order Gauss-Markov process: the
advantages of using h* versus a uniform h are discussed in special cases, and it is
seen that the asymptotically optimal median sampling design with simple
coefficients may outperform the asymptotically optimal sampling design with
optimal coefficients!

6.4. Improvement of performance of random sampling


The optimal fixed sample size simple random and stratified designs con-
sidered in Section 5 make use of the variance R (t, t), but not the full covariance
structure R(s, t). This is because the simple coefficients of the form (4.12) weigh
each observation at a sampling point by a weight depending only on that
sampling point. By allowing these weights to depend on two (rather than just
one) sampling points, improved performance random sampling designs would
be obtained which would use the full covariance structure of R (s, t).

6.5. Robustness of sampling design


The question of robustness (at fixed sample size or asymptotically) of the
sampling designs is very important, as in practice the covariance R(s, t) may
not be known precisely, or may be estimated. Specifically, if the true covari-
ance is R, while the designer uses the inaccurate covariance R', how much
performances is lost? Median, simple random and stratified designs using
simple coefficients are easily seen to be robust; for the integral estimation
problem some details are worked out in Schoenfelder (1978). For designs using
optimal coefficients, the question of robustness is more delicate because of the
inverse covariance matrix appearing in (4.2), (4.4) and (4.8).
Sampling designs for time series 357

6.6. The role of assumption (1.1)


While in the integral representation problem one starts with the function ~b
and then naturally introduces f via (1.1), in the regression and signal detection
problems, one starts with the function f and one argues that for the problem to
be regular, f should belong to the reproducing kernel Hilbert space of the
covariance R. The representation (1.1) actually says that f belongs to the range
of the integral type operator with kernel R, a smaller space. Sacks and
Ylvisaker (1970b) show that if f is not of the form (1.1), convergence can be
arbitrarily slow. In their series of papers, they also show that in fact their
asymptotic results remain valid if f is of the slightly more general form

f(t) = IA R(t, s)c~(s)ds + ~. a~R(t, Sm) (6.1)


m=l

(sm C A) by adjoining the points sl . . . . . sM to the regular sequence of designs.

6.7. Multiple regression


The problem of estimating multiple regression coefficients has been con-
sidered by Sacks and Ylvisaker (1968, 1970a) and by Eubank, Smith and Smith
(1982a, 1982b) with similar results. In this case an observation taken at time t
has the form

J
X(t)=~/3jfj(t)+N(t), tea, (6.2)
j=l

where each regression function ~ is known and of the form (1.1), and we want
to estimate the unknown regression coefficients/3 = (fi~. . . . . /3j).

6.8. Designs for random fields


In many important applications the observation process is indexed by a
multidimensional set, so that the interval A instead of being one-dimensional is
d-dimensional; included here are krieging (cf. David, 1978), forestry land
surveying, factorial experiments, response surface regression, image processing,
sonar wave detection, etc. For this case of multidimensional sampling designs,
random designs have been studied, but a great deal of work remains to be done
for deterministic designs. As has been already mentioned, expressions (4.19)
(4.29), (4.41), as well as (5.19) remain valid for all d and thus simple random
sampling designs have always rate n -1 for all d 7> 1. This is the simplest result
currently available, requires no restrictive assumptions, and in higher dimen-
sions (d large) compares favorably with what is currently known for deter-
ministic sampling designs! Tubilla (1975) in the stationary case and Schoen-
felder (1982) in the general case show that the rate of convergence is n -~-1/~ for
358 S. Cambanis

k = 0 and n -1-2/a for k = 1 for stratified sampling designs, and n -2/d for
systematic sampling designs--independent of the number of quadratic mean
derivatives; Schoenfelder (1982) also finds asymptotically optimal stratified
sampling designs with rectangular partitions. Ylvisaker (1975) shows that when
k = 0, product sampling designs using optimal coefficients have rate of con-
vergence n -2/~ (product sampling designs are obtained by sampling each
coordinate separately). The results of Ylvisaker (1975) and especially Wahba
(1978) indicate, however, that superposition of nonrandom product sampling
designs gives a better rate of convergence. A special design problem in the
plane is considered in Dalenius, Hfijek and Zubrzycki (1961).

7. Related topics

7.1. U s i n g derivatives

When the observations process has k (>/1) quadratic mean derivatives, and
in forming the estimators I r and/3 r and the statistic Sr one uses not only the
value of the observation process X ( t ) at each sampling instant but also its
derivatives X~)(t), j = 1 . . . . . k, then naturally one can expect improved per-
formance. This situation is not of practical interest, as it requires measuring
derivatives at sampling points, but it provides bounds on performance and is of
independent theoretical interest. Sacks and Ylvisaker (1970a, 1970b) established
(5.7) in this case, and thus also (5.9) for the asymptotically optimal sequence of
designs, under the same regularity conditions but with a different constant C;
given by

{(k + 1)!}2
c;, = (2k + 2)!(2k + 3 ) !

Naturally, C; = C O(= 1/12); quite surprisingly C'1= C 1 (= 1/720), so that when


exactly one quadratic derivative exists, there is no advantage asymptotically in
using derivatives; however, for k > 1, C~, < Ck and as expected use of deriva-
tives improves performance asymptotically (for instance, C ; = 0.3C 2, C~:-
C3/21, C; = C4/210, etc.). Wahba (1971) and Hfijek and Kimeldorf (1974) treat
the case where the centered process X is autoregressive, and these results are
generalized to the vector-valued case by Wittwer (1976). Wahba (1974) also
treats the (more general) case where R is the covariance of a Gaussian process
equivalent to an autoregressive Gaussian process. Eubank, Smith and Smith
(1981, 1982a) give sufficient conditions on f under which there exist unique
optimal designs using optimal coefficients for each sample size or for all
sufficiently large sample sizes, and develop algorithms for finding them.
Product sampling designs for a certain two-dimensional random field are
considered by Wittwer (1978).
Sampling designs for time series 359

7.2. Relationship to quadrature formulae


The approximation of the random integral (2.1), f a X ( t ) c k ( t ) d t , by the
random sum (2.2), c~X T = Z~= 1Cr,kX(tk), of the values of the random integrand
X ( t ) at a finite number of points, is reminiscent of quadrature formulae in the
approximation of ordinary integrals, and their relationship is discussed in Sacks
and Ylvisaker (1970b). Consider approximating the ordinary integral
fa x(t)qb(t)dt by the quadrature formula crxr = ~=1Cr.kX(tk) • When x belongs
to the reproducing kernel Hilbert space of R, the approximation error can be
written as

er(X , Cr)= I f a x ( t ) 4 ) ( t ) d r - n
k~= l CT'kX(tk)
L= I f f -- gr, cT~ X)l ,

.... CT,kR ' t( k, t).


where ( . , • ) is the inner product of the RKHS(R) and gr, cr(t ) = zk=l
For fixed T, the quadrature formula is called best in the sense of Sard (1963) if
the coefficients c r minimize

sup er(X, CT) = sup ][(f- gT,~r' X)I = I[f-


llxtl~l Ilxll~l
gv,~ll.
But this is minimized when gTc* = P r f (cf. (5.2) and (5.4)). The quadrature
• . ' T
formula is called best m the sense of Sard if the finite set T* of sampling points
(nodes) and the weights c~-. satisfy

sup ev.(x, c).) = inf inf sup er(X, cv)(= inf IV - Prf[I) -
tlXl]~l r C T tlxll~l r

The connection with the random integral approximation problem follows from
the relationship

Ee2r(X(.), cr) = E X(t)4~(t) dt - ~, CT;kX(tk


k=t

- 17-gT;¢r]l 2 = sup er(X, CT) .


IlxlL~l

Thus the best quadrature formula c~.x r. in the sense of Sard for f A x(t)g)(t) dt,
x ~ R K H S ( R ) , determines the optimal sampling design T* when optimal
coefficients c r are used and the best quadratic mean approximation c r.Xr. of
fAX(t)~(t)dt, and vice versa. Certain properties of the best quadrature
formula in the sense of Sard, and thus also of the optimal sampling design, are
established in Karlin (1976), and asymptotics are studied in Barrow and Smith
(1979).
360 S. Cambanis

7.3. Estimating random integrals with observation errors


In connection with the problem of estimating the random integral (2.1), it is
natural to consider the case where the values of the process X cannot be
measured with perfect accuracy, but the observation at each sampling instant ti
is Y~ = X(ti)+ ei, where the observation errors ei are uncorrelated and have
zero means and common variance o-2. This important case has been studied by
Jones (1948) when ~b --- 1 and R(t, s) = e x p ( - l t - sl), where it is shown that, with
A = [0, 1], the best estimator is n -1 £i~=1 Y~ and the optimal sampling design of
size n, {t~}~"__1, is periodic and symmetrically located in [0, 1], and its period
along with the value t 1= 1 - t , are determined (implicitly). Kendall (1948)
showed that the mean square error e ,2 satisfies ne2,~ 0-2, which, compared to
(5.9) with k = 0, shows a loss of one power in the rate of convergence due to
observation errors (n -1 instead of n -2 with no observation errors).

Z 4. Multiple regression with negligible correlation


Bickel and Herzberg (1979) consider the multiple regression problem (6.2)
with error covariance R(t, s)= y0-2p(t- s) for t # s and R(t, t) = 0-2, where p is
a stationary covariance with p(0) = 1, and 0 ~< 7 ~< 1 ; i.e. the error N(t) consists
of a stationary component with covariance y0-2p(t-s) and an uncorrelated
white component with variance ( 1 - T ) 0 - 2. They make the critical assumption
that as the sample size n of the design increases, the error correlation becomes
negligible; specifically, they assume that p(t) depends on the sample size n as
follows: p,(t)= r(nt), where r is a fixed stationary covariance with r(t)~O as
t~. When the regression functions are powers, f j ( t ) = tj-l, and A = [ - a , a],
they point out that asymptotic results when the covariance p is allowed to
depend on the sample size n as above, can be translated to asymptotic results
for fixed covariance p but interval over which the samples are taken depending
on the sample size as follows: A , = I-ha, na]. They show that the variance-
covariance matrix of the least squares linear estimates of the regression
coefficients tends to zero with rate n -1, and they determine asymptotically
optimal designs implicitly in the general case, and explicitly for location (J = 1,
f l = 1), regression through the origin (J = 1, fl(t)= t), and linear regression
(J = 2, fl =- 1, fz(t) = t). In Bickel, Herzberg and Schilling (1981) the first-order
autoregressive case p ( t ) - exp(-[tl) is considered in detail for the location and
linear regression problems, and the performance of the uniform designs is
compared with that of the optimal and the asymptotically optimal designs. For
the cases treated in the latter paper, the asymptotic performance of the
variance-covariance matrix of the minimum variance linear unbiased estimao
tots and of the least squares linear estimates are identical.

7.5. Periodic sampling to discriminate processes with independent increments


Some aspects of the problem of finding the optimal period of a periodic
sampling design of fixed size, in discriminating between two processes with
independent increments are considered in Newman and Stuck (1979).
Sampling designs for time series 361

References

Barrow, D. L. and Smith, P. W. (1978). Asymptotic properties of best L2[0, 1] approximation by


splines With variable knots. Quart. Appl. Math. 36, 293-304.
Barrow, D. L. and Smith, P. W. (1979). Asymptotic properties of optimal quadrature formula. In:
Hiimmerlin, ed., Separatum aus: Numerische Integration, 54-66. Birkhiiuser, Basel, Switzerland.
Bickel, P. J. and Herzberg, A. M. (1979). Robustness of design against autocorrelation in time I:
Asymptotic theory, optimality for location and linear regression. Ann. Statist. 7, 77-95.
Bickel, P. J., Herzberg, A. M. and Schilling, M. F. (1981). Robustness of design against autocor-
relation in time II: Optimality, theoretical and numerical results for the first-order autoregression
process. J. Amer. Statist. Assoc. 76, 870--877.
Cambanis, S. and Masry, E. (1983). Sampling designs for the detection of signals in noise. IEEE
Trans. Inform. Theory IT-29, 83-104.
Cochran, W. (1946). Relative accuracy of systematic and stratified random samples for a certain class of
population. Ann. Math. Statist. 17, 164-177.
Cressie, N. (1978). Estimation of the integral of a stochastic process. Bull. Austral. Math. Soc. 18,
83-93.
Dalenius, T., Hfijek, J. and Zubrzycki, S. (1961). On plane sampling and related geometrical
problems. In: Proc. Fourth Berkeley Syrup. Math. StatisL and Probability, Vol. 1, 125-150.
University of California Press, Berkeley, CA.
David, M. (1978). Sampling and estimation problems for three dimensional spatial stationary and
nonstationary stochastic processes as encountered in the mineral industry. J. Statist. Plann.
Inference 2, 211-244.
Eubank, R. L., Smith, P. L. and Smith, P. W. (1981). Uniqueness and eventual uniqueness of
optimal designs in some times series models. Ann. Statist. 9, 486-493.
Eubank, R. L., Smith, P. L. and Smith, P. W. (1982a). On the computation of optimal designs for
certain time series models with applications to optimal quantile selection for location or scale
parameter estimation. S I A M J. Sci. Statist. Comput. 3, 238249.
Eubank, R. L., Smith, P. L. and Smith, P. W. (1982b). A note on optimal and asymptotically
optimal designs for certain time series models. Ann. Statist. 10, 1295-1301.
Hfijek, J. and Kimeldorf, G. (1974). Regression designs in autoregressive stochastic processes. Ann.
Statist. 2, 520-527.
Jones, A. E. (1948). Systematic sampling of continuous parameter populations. Biometrika 35,
291-296.
Karlin, S. (19'76). Best approximations, optimal quadrature and monosplines. In: S. Karlin, C. A.
Michelli, A. Pinkus and I. J. Schoenberg, eds., Studies in Spline Functions and Approximation
Theory, 19~6. Academic Press, New York.
Kendall, M.G. (1948). Continuation of Dr. Jones's paper. Biometrika 35, 291-296.
Newman, C. M. and Stuck, B. W. (1979). Optimal sampling of independent increment processes.
Stochastics 2, 213-225.
Parzen, E. (1967). Time Series Analysis Papers. Holden-Day, San Francisco, CA.
Pfizman, A. (1977). A contribution to the time series design problems. In: Trans. Seventh Prague
Conference, 467-476. Reidel, Dordrecht.
Quenouille, M. (1949). Problems in plane sampling. Ann. Math. Statist. 20, 355-375.
Sacks, J. and Ylvisaker, D. (1966). Designs for regression problems with correlated errors. Ann.
Math. Statist. 37, 66-89.
Sacks, J. and Ylvisaker, D. (1968). Designs for regression problems with correlated errors; many
parameters. Ann. Math. Statist. 39, 49~59.
Sacks, J. and Ylvisaker, D. (1970a). Designs for regression problems with correlated errors Ill.
Ann. Math. Statist. 41, 2057-2074.
Sacks, J. and Ylvisaker, D. (1970b). Statistical designs and integral approximation. In: Proc. Twelfth
Biennial Seminar of the Canadian Mathematical Congress, 115-136. Canadian Mathematical
Congress, Montreal, Canada.
362 S. Cambanis

Samaniego, F. J. (1976). The optimal sampling design for estimating the integral of a process with
stationary independent increments. IEEE Trans. Inform. Theory IT-22, 375-376.
Sard, A. (1963). Linear Approximation. American Mathematical Society, Providence, RI.
Schoenfelder, C. (1978). Random designs for estimating integrals of stochastic processes. Institute
of Statistics Mimeo Series No. 1201. University of North Carolina, Chapel Hill, NC.
Schoenfelder, C. (1982). Random designs for estimating integrals of stochastic processes: Asymp-
totics. Center for Stochastic Processes Tech. Rept. No. 6. University of North Carolina, Chapel
Hill, NC.
Schoenfelder, C. and Cambanis, S. (1982). Random designs for estimating integrals of stochastic
processes. Ann. Statist. 10, 526-538.
Tubilla, A. (1975). Error convergence rates for estimates of multi-dimensional integrals of random
functions. Tech. Rept. No. 72. Department of Statistics, Stanford University, Stanford, CA.
Wahba, G. (1971). On the regression design problem of Sacks and Ylvisaker. Ann. Math. Statist.
42, 1035--1053.
Wahba, G. (1974). Regression design for some equivalence classes of kernels. Ann. Statist. 2,
925-934.
Wahba, G. (1978). Interpolating surfaces: High order convergence rates and their associated
designs with applications to X-ray image reconstruction. Tech. Rept. No. 523. Department of
Statistics, University of Wisconsin, Madison.
Wittwer, G. (1976). Versuchsplanung im Sinne von Sacks-Ylvisaker fur Vectorprozesse. Math.
Operationsforsch. Statist. 7, 95-105.
Wittwer, G. (1978). Uber asymptotisch optimale Versuchsplanung im Simme von Sacks-Ylvisaker.
Math. Operationsforsch. Statist. Set. Statist. 9, 61-71.
Ylvisaker, D. (1975). Designs on random fields. In: J. Srivastava, ed., A Survey of Statistical Design
and Linear Models, 593--607. North-Holland, Amsterdam.
Zubrzycki, S. (1958). Remarks on random, stratified, and systematic sampling in a plane. Colloq.
Math. 6, 251-264.
E. J. Hannan, P. R. Krishnaiah, M. M. Rao, eds., Handbook of Statistics, Vol. 5 1Zt
Jl_ 7 1 -
© Elsevier Science Publishers B.V. (1985) 363--387

Measuring Attenuation

M. A . Cameron and P. J. T h o m s o n

1. Introduction

The analysis of the relationships between several time series may be per-
formed in a number of ways. In the time domain, vector models of A R M A
type may be fitted to the data and these give empirical descriptions of the
relationships between the series that can be used to generate forecasts. These
models may form the basis of further investigation of the data structure (see
Box and Tiao (1977) for an example). The alternative in the time domain is to
hypothesise a structure and to use that to formulate and fit an appropriate
model. A simple example of this is a time series regression (transfer function
model), but more generally the model will be a constrained vector A R M A
model.
In the frequency domain the methods developed in multivariate analysis for
the analysis of covariance matrices may be adapted to analyse complex
covariance matrices and applied to the estimated cross-spectral density matrices
at different frequencies. Brillinger (1975) gives examples of regression, prin-
cipal component and canonical variate analysis applied to the estimated cross-
spectral matrices. If the analysis is exploratory, then the estimated spectral
density for each of a number of frequencies across the frequency range of
interest is analysed separately and no attempt is made to combine, formally,
the information from different frequency bands. On the other hand, there may
be a parametric model for the dependence between the different series which is
formulated in either the time or the frequency domains. Then, if the analysis is
performed in the frequency domain, information from different frequency
bands should be combined formally, in either the time or the frequency
domain.
In this chapter, the problem of the estimation of the attenuation of a signal is
considered. Here a model for the observations may be written down in the time
domain, but the estimation of the time domain model may be performed in
either the time or the frequency domain, The model is a time series version of
a factor analysis model and so the methods and results are of wide relevance.
For example, Geweke and Singleton (1981) and Engle and Watson (1981)

363
364 M. A. Cameron and P. J. Thomson

describe the application of time series factor models to economic data and
C a m e r o n (1983) uses the model for comparative calibration of time series
recorders. The methods are also applicable, for example, to a geophysical
p r o b l e m described by Clay and Hinich (1981) and to the estimation of the
velocity and attenuation of a signal as it passes across an array of sensors. The
model will be developed and discussed in terms of the attenuation of a signal
across an array for definiteness.
In the next section, the model is described and some notation is introduced.
In Section 3, methods for estimating attenuation are given. These are for time
domain and for both narrow and broad band frequency domain estimation. In
Section 4 the methods are extended to the case where delays are present whilst
in Section 5 a discussion is given of how the methods could be used in practice
and the analysis of a simple set of data is described.

2. The model

Consider the situation where a n u m b e r of recording devices are each


endeavouring to measure a c o m m o n scalar stochastic signal. Typically the
observations made at these recorders will comprise a modified form of the
signal together with additive noise. A simple model of a situation such as this is
the (single) factor analysis model

Yij = & + cgSi + Eo, J = 1 . . . . . p . (2.1)

H e r e the observed random variables Yij have mean &, the c o m m o n signal S i has
mean zero and variance o-2, Si and eij are independent Gaussian random
variables and each eij has mean zero and variance cry. The c~j are attenuation
coefficients. However, for the case of the simple model (2.1), they are more
c o m m o n l y known as factor loadings. This particular model has been estensively
used in many areas, in particular the social sciences. The properties of such
models together with associated estimation procedures have been extensively
discussed (see, for example, Joreskog, 1978). We wish to set this familiar
problem in the more general time series context of signal estimation where the
signal now becomes a stochastic process over time.
T h e basic model considered is given by

yj(t) = txi + f T~_cg(r)S(t- r) dr + xj(t), --az< t < o c (2.2)

where each of the observed processes yj(t), j - 1 . . . . . p, comprises a mean #j, a


filtered form of the c o m m o n signal S(t) and a noise process xj(t). The S(t) and
xj(t) processes are assumed to be zero mean, continuous time stationary
processes with spectral densities fs(o)) and fx,i(w), respectively. More-
over, the xi(t ) will be assumed to be independent of one another and also
of S(t). The impulse response function aj(7) reflects the fact that the signal
Measuring attenuation 365

will, in general, undergo modification prior to its arrival at any recorder and
that this modification will typically depend on the particular recorder conceT
ned. Suppose the spectral representation of S(t) is given by

S ( t ) = f~e_ it'dZ(o)), (2.3)

where the complex valued process Z(w) has zero mean and orthogonal
increments, i.e.

E{dZ(w)dZ( ,)}={ f~lo~)dw, o)= w,.

Then (2.2) can be written as

yj(t) = Ixj + e-~'~aj(w)dZ(w) + xj(t), (2.4)


where

aJ(w)-flaJ(r)ei~dr_

is the transfer function of the filter with impulse response function aj(r). It is
evidently aj(co) that modifies the component of S(t) at frequency ~o.
In practice, the data will normally be sampled at equidistant time intervals
either by virtue of the recording process adopted or as a consequence of the
fact that the data will invariably be analysed by a digital computer. Assuming
that the sampling interval has been chosen sufficiently small for aliasing effects
in either S(t) or the xj(t) to be ignored, the continuous time model (2.2) is now
replaced by the discrete time process:

yj(n) =/xj + ~, fij(k)S(n - k) + xj(n), n = 0, _+1,.,., (2.5)

where time has now been rescaled so that the sampling intervai represents one
time unit and the fij(k) satisfy

j(k) e 'k..... a j ( , , ) ,

We assume that aj(w) is only non-zero in ( - ~ , 7 ] . Then the c~j(r) can be


recovered from the flj(k) via the formula

sin(k - r)~r
= j(k) (k -
366 M. A . C a m e r o n a n d P. J. T h o m s o n

due to Shannon. However, (2.5) may be considered as a model in its own right
irrespective of whether or not it has been derived from some underlying
continuous time process. In any event, we are concerned with fitting models
such as (2.5) to data and, in particular, estimating the transfer functions aj(co).
From the latter, estimates of the flj(k) and aj(r) can be derived, either
non-parametrically or via some suitably parametrised version of (2.5).
Consider the p-dimensional process y(n), n = 0, _+1. . . . . with jth component
yj(n) given by (2.5). If 7s(n) and yx,j(n) are the serial covariance functions of
S(n) and xj(n) respectively, then the matrix of cross-covariance functions of
y(n) has typical element

jk(n) = Z Z t j(0&(l + l')rs(n r) + ajkrxj(n),


l l'

where 3jk is the Kronecker delta. Likewise, the spectral density matrix f(w) of
y(n) has typical element

fjk(O~) = ai(w)ak(o~)fs(co ) + 8jkf, j(co) .


It is clear that the signal, noise and attentuation parameters are not uniquely
determined by the cross-covariances and spectral densities of the y(n) process
unless more is known or artificial constraints are imposed. If one of the yj(n)
can be thought of as a reference or control series for which the corresponding
aj.(oJ) is identically unity (i.e. /3j(n)= 60, ), then this problem is resolved. Yet
another possibility would be to confound fs(co) with the ai(~o) and measure only
t h e factors aj(co)J .1/2
s (co). This is equivalent to taking the signal as unit variance
white noise, but with spectral components whose amplitude is unrelated to
those of the noise. Normalising the factors aj(co)f s1/2(w) by the amplitude of the
corresponding noise process at frequency o~ yields the spectral density

flk(co) =-- (17j(t'O)Pk((M) q- ,sj.,,)f,j1/2(co)L.,(,,'),


1/2 (2.6)
where
,,j(co ) = ¢(o, ){fs(o., )/f,,.j(o, )} . (2.7)

Apart from scaling the signal and noise to have the same amplitude, this
reformulation of f(co) has the virtue'~that luj(o~)]2 admits a simple interpretation
as the signal-to-noise ratio at t h e jth recorder. The latter is an important
parameter in its own right.
In practice, it might be expected that the signal S(n) and the transfer
functions aj(o~) would be relatively smooth with the noise processes xj(t) having
reasonably flat spectra. This would mean that the uj(co) would tend to be sizeable
in the lower frequencies. As a consequence, it would seem that these spectral
quantities should typically vary smoothly over frequency. Thus it might be
expected that non-parametric estimation techniques based on the Fourier
transforms of the data over non-overlapping narrow frequency bands would prove
Measuring attenuation 367

to be effective. Similarly, simple parametric forms for the vj(w) and the f,j(w)
should generally fit such data well.
The model (2.5) can be generalised in a number of ways. Of these, the most
obvious is to incorporate delays. Consider the situation where the signal is
received by an array of recorders. Because of the spatial configuration of the
array, the individual recorders will, at any instant of time, receive lagged or
delayed forms of the signal. In such circumstances, the model (2.5) becomes,

yj(n) = a i + ~ , ~j(k ) S ( n - rj - k) + xj(n) , (2.8)

where the rj's are not integers in general. However, it may well be true that the
medium through which the signal is travelling is dispersive. This would mean
that the different frequency components that make up S(t) travel at different
speeds, resulting in frequency-dependent delays. This leads to a model of the
form

'rr

yj(n) = O~j+ f e -i(n-Ti(°)))°~aj(w) d Z ( w ) + x j ( n ) . (2.9)


-- "rr

Methods of taking account of delays will be given later in this chapter. Other
generalisations concern the cases where trend is present, where the signal is
transient, where the observations and the signal are no longer scalar but are
vector time series, and where there is more than one signal. This last case
includes the situation where signal and noise are not incoherenL

3. Estimation

This section addresses the problems of fitting models such as (2.5) to data.
As in most model fitting, the procedure used is composed of three parts: (a)
exploratory data analysis and model selection, (b) parameter estimation and (c)
diagnostic model checking.
As models become more complicated and involve more parameters, the
model selection phase becomes increasingly important since there will be many
plausible models. In time series these models must be fitted by numerically
maximising some function or, equivalently, by solving a set of non-linear
equations. This becomes more difficult as the number of parameters increases,
particularly if good initial estimates of the parameters are not available or if the
model is a poor description of the data.
Unfortunately, the model selection itself becomes more complicated. "For
example, model selection in A R M A models is generally based solely on the
autocorrelations and partial autocorrelations, whilst in transfer functions it is
either a multistage procedure (Box and Jenkins, 1976) or involves calculating
368 M. A . Cameron and P. J. Thomson

the impulse response function and noise autocorrelations from spectrum esti-
mates (Cameron, 1981; Pukkila, 1982). For models of the form of (2.5), the
simplest procedure is to calculate estimates of uj(~o) and fxj(~o) for a number of
frequencies and then to apply an inverse Fourier transform to these and choose
appropriate models on the basis of these derived quantities.
The estimation of uj(w) and fxj(~o) is described in Subsection 3.1. Although
we have introduced these estimates as being part of the model selection
procedure, they are of interest in their own right in many applications,
especially if there are delays between the series (see Section 4).
Subsection 3.2 addresses the problem of fitting (2.5) and its various
parametric forms over all frequencies or possibly over some chosen band of
frequencies. For this estimation a parametric form is required for ~,j(co) but not
for fxj(~o) which may be approximated by the narrow band estimates obtained
using the methods of Subsection 3.1. Finally, in Subsection 3.3, methods of
fitting (2.5) in the time domain are considered. Here all components of the
model must be parametrised.
The estimation procedures described in Subsections 3.1, 3.2 and 3.3 thus
form a natural sequence in fitting models such as (2.5). The exploratory phase
suggests the models to be fitted, the frequency domain procedure of Subsection
3.2 allows the transfer functions or uj(w)'s to be modelled without a model also
being fitted to the noise processes. The time domain procedure allows a full,
exact maximum likelihood estimation of all parameters simultaneously. This
will be most useful if there are few observations (which may be the case with
economic data, for example). It wilt often be the case, however, that there are
sufficient data for frequency domain methods to be used and that it is only the
transfer function that is of interest, so that the time domain estimation
procedure will not be required.

3.1. Estimation: Narrow band


We now consider the problem of estimating the uj(oo) and fxa(oo) from a
sample of observations y(1) . . . . . y(N) generated by (2.5). These are most
expeditiously determined using the finite Fourier transform of the data,

u 2vk
W(o)k) = (2~rrN) ~/2 ~ y(n) e '"~k, 0 ~< k < [~N
' l, w~ = - (3.1)
n=l N

Here [x] denotes the integral part of x. These quantities are important because
(2~r/N)l/2W(~o) is an approximation to the component of frequency w in the
spectral representation of y(n) and, if N is highly composite, they are
extremely cheap (i.e. rapid) to compute. Moreover, under certain quite general
conditions, the W(wk) are asymptotically independently distributed each with a
complex multivariate normal distribution with zero mean vector and covari-
ance matrix f(o)k) (see Hannan, 1970). Now the situation we have in mind is
that where the vi(~o) and the fxj(o)) vary slowly with frequency ~o. This means
Measuringattenuation 369

that estimates of the ~,j(w) and the fxj(W) at any given frequency o) should be
able to be constructed from the W(Wk) evaluated over those wk nearest to o9.
For the sake of clarity let us denote the chosen frequency of interest as A and
suppose that our estimates of the v~(A) and fx.j(A) will be based on the m values
of wk closest to A. An obvious estimation technique in this situation is to use
the method of maximum likelihood where the likelihood is given by the
probability density function derived from the asymptotic distribution of the
W(w) for the m values of ~ok nearest to A.
The relevant log-likelihood is proportional to

/(A)=-m '~_~ {logdetf(wk)+trDWl(Wk)W(wk)W(Wk)*]}, (3.2)

where det() and tr() denote the matrix operations of determinant and trace
respectively and Z A is the sum over the m values of o)k concerned. Moreover,
in keeping with the assumption that f(w) is not varying to any degree over this
narrow band of frequencies, for the o)k near A, we set

f(Wk) = f xl/2( I + Uu,)flx/2 (3.3)

Here I is the p-row identity matrix, the asterisk denotes transposition com-
bined with conjugation, the diagonal matrix fx has typical diagonal element
fxj(A) and the p-dimensional vector v has typical element ~,i(A). Note that,
given f(w) as specified by (3.3), we can only know the ]uj(A)l and the phase
differences ~0j(A)- 0k(A), where [uj(A) I and ¢i(A) are the modulus and argument
respectively of uj(A). Additional information is necessary in order to identify
the individual Oj(A). We shall assume that Ol(A) is zero. Thus the first recorder
is chosen as the recorder relative to which the phase differences 4Jj(A) will be
measured.
Now, with f(w) given by (3.3), maximising (3.2) is equivalent to minimising

P P
Z log f~,j + l o g o + u* ~,) + Z ~j/f~,j - (1 + u* u) ~u*f~/2ffxX/2v,
j-i 1:1

where, for simplicity, the argument A has been omitted and f=


m -1Za W(wk)W(Wk)*.
The parameters ~,j(A) and fxj(A) could be estimated by direct numerical
maximisation of l(A) or else the derivatives of l(A) with respect to the unknown
parameters may be calculated and the estimates found by solving the resulting
estimating equations. These are

(1 + (3.4)
and
L.j - (1 + I jl j = 1 ..... p. (3.5)
370 M. A . Cameron and P. J. Thomson

In the special case when the noise spectra at the different recorders are
assumed equal, these equations may be solved explicitly, yielding

)rx = (trig- ~)*f~)/(1 + ~3"~3)}/(p - 1) (3.6)

and 1) is the eigenvector of f corresponding to the maximum eigenvalue of )~ If


there are only two recorders, such an assumption must be made. In the most
general situation, (3.4) and (3.5) are the complex analogues of the equations
that arise when fitting a conventional factor analysis model in the case where
there is only one factor. If we write f(A) and ~ in terms of their real and
imaginary parts, i.e.

f ( A ) = C(A) + i0(A), 13= ~3R + i 6 ,

where C(A) is symmetric, O ( a ) is skew-symmetric, then (3.4) and (3.5) become

o re(A) -O(A)][/;
[/;,,2
o
^
c(A)JL o
(3.7)
^2
f~a(A) = (1 + URj + £,~j)-ICjj(h ) . (3.8)
Equations (3.7) and (3.8) can now be solved using methods similar to con-
ventional factor analysis procedures. Clearly, if good a priori estimates of the
fxj(A) were known, then (3.4) and (3.2) state that ~ is proportional to the
eigenvector of )~l/2f(A)f~l/2 associated with the largest eigenvalue. The constant
of proportionality is the square root of the difference between the eigenvalue
and unity. This observation and the simple form of the relationship (3.5)
suggest that one might solve (3.4) and (3.5) numerically by first fixing the fxj
and determining v, then adjusting the fx,j using (3.7) and recomputing 1),
iterating until convergence. Unfortunately, this algorithm is frequently very
slow to converge since it fails to take into account the covariation between
small changes in the vj and the fxj. This problem has been discussed by
Joreskog (1967) and Lawley (1967). It is thus preferable to maximise the
log-likelihood directly or equivalently, to devise a Newton-Raphson algorithm
based on (3.4) and (3.5), which takes into account the covariation and the fact
that (3.2) is being maximised. Experience in standard factor analysis suggests
that algorithms based on an optimisation technique of Fletcher and Powell
(1963) converge reliably. This is discussed in greater detail by Joreskog (1978)
and by Geweke and Singleton (1981).
Under relatively mild regularity conditions (see Thomson, 1982), it can be
shown that the resulting estimators of u and the fxj(A) are strongly consistent.
Moreover, suppose the vector a 0 is defined as

0/0= 0¢;,1(•) . . . . . L,p(/~), IPl(a)t . . . . . IPp(a)l, (//2(A) . . . . . ~p(,~))t ,


Measuring attenuation 371

where [vj(a)[ and ~i(h) are as before the modulus and argument respectively of
vj(h). If ~ is defined as the corresponding vector estimator of a 0 obtained from
(3.4) and (3.5), then for m and N large enough ml/2(& _ a0 ) has an asymptotic
multivariate normal distribution with zero mean vector and covariance matrix
F. Here

J1 j12t 1 0
F=
o o J3

where the p x p matrices J1,-/2, J12 ( = J21) and the ( p - - 1 ) x ( p - 1) diagonal


matrix -/3 have typical elements

_]_1 2 ~ , 1 , 2 2
J*jk = ,[(1 ~l~jl (1 v v) (v ~--l.jl ))/L,j, J = k,
1 2 2 , -1
t(-:l,51 t,,~l (1+ v v) )/(.fx,f~,k), j # k,
= ,[2(1+ ~,*~,)-'(v*~-I~.jl=(l+ 3~,* v)(1 + ~,* ~,) '), j = k ,
s2j~ t-21,,~llv~l(1+3,,*,.)(l+ ~*~)-~, j#k,
/l,,jl(2+ v*~-ivjl2){Lj(l+ v*~,)}', j = k,
J12 jk = 2 , ' - 1
' t-t,,jll~l{L,~(l+,, ~)} , jek,

J3,o = 2(1 + v*v)-l{lvjl2v* v).

3.2. Estimation: Wide band


The narrow band estimates provide a useful decomposition of the obser-
vations and are computed relatively easily from standard factor analysis soft-
ware. These estimates may be sufficient in many circumstances. It will often be
the case, however, that a model will be fitted to the attenuation over a broad
band of frequencies (or all frequencies) for comparison with theory, for
forecasting or perhaps just to obtain a smoother estimate. In this case the
narrow band estimates may be used to help choose models for the individual
components of (2.5) and to provide initial parameter estimates.
In this section it is assumed that the vj(co)'s depend on a vector 0 of
unknown parameters and they are written as vj(o); 0) to indicate this. The form
of the parametrisation is not important here. However, stationary processes are
usually modelled by A R M A models, transfer functions are often modelled by
ratios of polynomials and so models for vj(o); 0) will, most commonly, be ratios
of trigonometric polynomials. Note that the noise spectra are not parametrised
here. That may be done as a straightforward extension of the methods used
here or by using the time domain procedures described in Subsection 3.3.
However, that first requires a good parametric model of the noise processes. A
good way of choosing a model for these processes is from the estimated noise
spectra derived in Subsection 3.1. These may be inverted using the discrete
372 M. A. Cameron and P. J. Thomson

Fourier transform to obtain estimates of the serial correlation functions and


other quantities derived from them.
In practice, estimates of all the unknown parameters will be calculated using
a numerical optimisation algorithm.. If this is to find a global optimum
efficiently and if there are a large number of parameters to be estimated, then
the models for the various components fx(W) and u(co) must be reasonable fits
to the data and the initial estimates should be close to the optimal ones. Since
the number of unknown parameters may grow very rapidly, the best strategy is
to use the. methods of this section to estimate the parameters defining the
vj(co)'s before attempting to model the noise processes. It will be shown below,
however, that if the noise spectra are assumed to be equal, then 0 may be
estimated without explicitly estimating the noise spectra.
As in the narrow band case, the estimation of 0 is performed by maximising
a likelihood derived from the asymptotic distribution of W ( w ) . The likelihood
is calculated assuming that the noise spectra are known, though in practice they
will not be known and will be replaced by estimates obtained in the manner
described in Subsection 3.1. The estimate of 0 need not depend on all
frequencies available. Let B be a finite union of intervals in ( - v , 7r) that is
symmetric about, but does not include, the origin. Then the vector, 0, of
parameters describing the vj's may be estimated by maximising the log-
likelihood, which is equivalent to minimising

C N (0) = (m/2N) ~. [log{det f(a)} + tr{f-l(a)f(A)}] (3.9)

with respect to the parameter 0. Except for a scaling factor, CN(O ) is just the
sum of terms of the form (3.2) calculated over a number of non-overlapping
narrow bands. Differentiating Cu(O) with respect to 0 leads to the estimating
equations

m Z t r l f '(A) Of(a)f-l(A){f(A)-f(a)} 1 = 0 (3.10)


2N *eB L O0
where
-= 1,2 {Olt*(/~; 0) O/-"(~t;0)]_1/2..
of(aoo) L ( a ) ~ ~+"* 7o- JL(a).
Suppose that tJ is the solution to (3.10) and that 00 is the true value of 0. It is
proved in Cameron (1983) that this estimator of 0 is strongly consistent and
that N1/2(/J - 00) is a Gaussian random vector with mean zero and covariance
matrix

A -1 + A - 1 K A -q ,
Measuring attenuation 373

where the (1",k) element of A is

(4Tr)-l f t r { f f l ( A ) ~ f f l ( a ) O0k J

and the (j, k) element of K is

(2~r) ~ b~a f f (I)(])ab(A)CrP(k)cd(A')Sabcd(A,-A,-A',A')dAdA'


B

Here CI)(j)ab(~) is the (a, b) element of the matrix f-l(~)[Of(l~)/OOj]f I(A) and
Force(A, -A, -A', A') is the Fourier transform of the fourth cumulant function of
y(n). The integrand in A simplifies to

, 1 9 ]
tr{[I + vu*l-x ~ (vv*)[l + O0k

where the arguments h and 0 of v have been deleted, and

0
q)o) = f21/2[I + vv*] 1~ (w,*)[I + vv*l-lf21/2.

The matrix A is estimated consistently by the matrix of second derivatives of


CN(O). In the important case where the data are Gaussian, then matrix K is
null and N1/2(0-00) has an asymptotic Gaussian distribution with mean zero
and covariance matrix A 1. In the non-Gaussian case, Taniguchi (1982) gives a
method of obtaining consistent estimates of K.
If, in addition to assuming that the noise spectra are constant over narrow
bands, it is also assumed that within each band the spectra of the noise are
equal, then the estimating equations may be simplified. For a given 0, v(A; 0)
may be evaluated for each frequency band and thus using (3.6), estimates of the
common noise spectrum may be obtained. When this is substituted into CN(O),
the resulting expression simplifies and, omitting an additive constant, reduces
to

C~)(0) = m(2N)~' ~ [ ( p - 1)log{f(,~) R(A; 0)}+ logR(A; 0)],


AEB
where
I£: 2
R(a; o)= m -~ (a ; 0)* W(~oO /~(a; O)*~(a; 0).
374 M. A. Cameron and P. J. Thomson

3.3. T i m e d o m a i n estimation
If parameters are to be estimated in the time domain, then not only must the
attenuation be appropriately parametrised but so also must the signal and
noise processes. In addition, constraints must be introduced in order that the
parameters be identifiable.
Suppose that the signal and noise processes are assumed to be A R M A
processes and that the transfer function aj(w) is a ratio of trigonometric
polynomials. Then the observed series yj(n), j = 1 . . . . . p, form a vector
A R M A model from a particular parametric family and so for given values of
the unknown parameters the exact likelihood can be calculated (see, for
example, Nicholls and Hall, 1979). An algorithm such as that of Fletcher and
Powell may then be used to find the parameter values maximising this likeli-
hood. Difficulties arise, however, in choosing the correct models and lags for
the signal and noise processes and the attenuation and in obtaining good initial
estimates of the unknown parameters. Thus in all but the simplest cases the
estimation methods described in Subsections 3.1 and 3.2 should be used before
attempting a full time domain model.
A different approach to calculating the likelihood in the time domain, which
is in some ways more direct arises by noting that (2.5), is similar in form to the
observation equation in a state-space model, except that the noise here may be
coloured. Since the noise processes may also be written in state-space form, the
model (2.5) may be written in standard state-space form by augmenting the
state to include not only the signal but also the state of the coloured noise
processes. Once the model has been written in state-space form, the Kalman
filter can be used to compute the innovations recursively and a Gaussian
likelihood may be evaluated. This likelihood may be maximised to obtain
estimates of the unknown parameters.
Engle and Watson (1981) use this procedure to estimate parameters in an
example where they have 25 observations of each of 5 series. In their case, the
signal follows a second-order autoregression and each of the noises is a
first-order autoregression. The impulse response of the transfer function in-
volves only the present value of the signal so that the aj(w) are taken as
constant. The state is a vector of length seven, the components at time n being

S ( n ) , S ( n - 1), xl(n ) . . . . , xs(n ) .

Once the models for the various components have been chosen and the
Kalman filter appropriately parametrised, the likelihood is easily evaluated.
Again the likelihood must be maximised numerically and Engle and Watson
suggest using an algorithm based on the method of scoring. Aasnaes and
Kailath (1973) provide a direct recursion for state-space estimation in colored
noise without going through the augmentation step.
As mentioned earlier, a difficulty with time domain estimation lies in the
need to choose the correct orders of the models for all components. An
Measuring attenuation 375

additional problem arises because all of the parameters must be estimated


simultaneously. If there are many parameters, a large amount of computation
is required unless good initial parameter estimates are available. If one works
solely in the time domain, it is difficult to uncouple the signal and noise
components so that a sensible decision about the orders of each of the
components cannot be made. As we have shown in Subsection 3.1, however, it
is easy to produce band by band estimates of the ~,j(w) and of the noise
spectrum and therefore to calculate 'model-free' estimates of the impulse
response functions and of the autocorrelation functions of the noise processes.
From these the orders of the models may be chosen and initial parameter
estimates calculated.

4. Estimation in the presence of delays

We now consider the situation where the model incorporates delayed as well
as attenuated forms of the signal. If the delay is frequency dependent, then a
simple time domain model cannot be given and estimation of the delay must be
performed in the frequency domain. If the delay is not frequency dependent,
then it may be approximated by a time domain model (Chan, Riley and Plant,
1980). However, a simpler procedure is to estimate delay in the frequency
domain, realign the series and then, if time domain models are required, use
the methods of Subsection 3.3 to estimate the remaining parameters in the
model. Only frequency domain methods are considered in this section.
The model is as given by (2.9) and the spectral density matrix of the observed
process, expressed in terms of the uj(w), fxj(w) and the rj(w), now has typical
element
1/2 l/2
fik(W) = (vj(w)~,k(tO) + 6ik)f xJ (Wff ~,k(w) ei(',(~)-'k(~°))~° (4.1)

The effect of delay, if unaccounted for, will lead to biased estimators. Consider,
for example, the case where, over a narrow band of frequencies centred at
frequency A, the vj(w), fxa(w) and ~)(w) are approximately constant. Then,
keeping the bandwidth fixed but allowing N to increase, the estimator of f(h)
given by

/(a) = !m 2 a

will converge to the matrix with typical element


h+8

f k(A) f,~-8 ei(~i('~)~-k(A))(,o,~)dw = ]}k(h)


- " " sin(T,(A) - zk(h))6
(5(A)_-~.k(--~ . (4.2)

Here the bandwidth is 26. Now, if the 5 ( t ) - ,k(h) are large, then (4.2) shows
376 M. A. Cameron and P. J. Thomson

that the off-diagonal cross-spectral estimates )~k(A), j ¢ k, will be biased down-


wards, but the estimates of the spectra )~j(A) Will be unbiased. This will have the
effect of inflating the estimates of the noise spectra fxj(A) and seriously
deflating the estimates of the attenuation function uj(A). Hence it is important
to take account of any significant phase variation caused by delays.
We consider two situations, first the case of correcting for significant phase
variation due to delays over a narrow band and, second the case of correcting
for such variation over any arbitrary range of frequencies.

4,1. The narrow band case


We consider, as in Subsection 3.1, the narrow band of m frequencies
(ok = 2 w k / N closest to some chosen frequency A. The relevant log-likelihood is
given by (3.2), but with f ( w ) given by (4.1), i.e.

f ( w ) - A (w ;'r)f~]2(I + uu*)f~x/2A *(w ; r ) . (4.3)

Here fx and ~ are as defined below (3.3) and A (~o; ~-) is a diagonal matrix with
typical element exp{i~-(A)(w - A)}. As before we are requiring that uj(co), f~j(w)
and ~)(w) be effectively constant over the narrow band of frequencies and we
have chosen to describe the phase of ~k(~o) near A as

4,j(a)- q,k(a)+ 6- j( a ) - ~k(a))(,o - a ) .

Thus Oj(A), the argument of uj(A), now incorporates 5(A)A, the phase at
frequency A due to delays. To identify all the parameters we shall again
measure all phase differences relative to the first recorder and so

[/11(~) = TI(/~ ) 0,

Maximising this log-likelihood with respect to the parameters yields

]xl/2i(,,~ ; ~)/xl/2/) = (l + /)*/])p, (4.4)


Lj(A) = (1 + ]~j(A)[2)-'~j(A ; 4), j = 1 . . . . . p , (4.5)
~*f-xl/2{ Of(A ; ? )/ OS}/xl/2 ~ = O, j = 2 . . . . . p , (4.6)
where
1
/(a; ~-) Z A ( ,, ,)W(~os)W(~o,)*A(~os;,~),
m ,~
i.e.
1
~k(a; ,) -- - Z w,(~o,)wk(~o,) e - * + )- Tk<A)>+,~).
m a

t t e r e the p - 1-dimensional vector ~- has typical element "ri(A), J" = 2 . . . . . p. In


principle, (4.4)-(4.6) can be solved by a numerical procedure such as the
Measuring attenuation 377

Newton-Raphson procedure with due account being taken of the fact that the
required solutions must maximise (3.2). However, it can be shown that,
appropriately normalised, the estimators ÷j are asymptotically independent of 1)
and the f~j(A). This suggests that the following two-stage iterative scheme
might have computational advantages. First, select initial estimates of the ~)(A)
and solve (4.4) and (4.5) using factor analysis techniques as before. Then, using
the resulting estimates of ~ and the f~j, solve (4.6). Note that, given v and f~,
(4.6) is equivalent to maximising

v^ , fx
~ - 1/2f(A;~')/~
- ~ = 1/2 v
^ = m - 1
Z a
If? * f x 1/2A
*(a~,;r)W(ws)] 2

With the new estimates of ~j(A), repeat the first step and so on. Schemes such
as this based on first solving (4.4) and (4.5) and then (4.6) will be asymptotically
equivalent to a full Newton-Raphson procedure. However, as in any numerical
maximisation procedure, it is important get good first estimates, especially of
the delays ~)(A).
If we had once and for all estimates of the delays ~)(A), we would then apply
the standard estimation procedure described in Subsection 3.1. Such a pro-
cedure would have computational advantages over the direct solution of
(4.4)-(4.6). In Hannan and Thomson (1973) a non-parametric estimate of the
(group) delay in the case of two recorders is given. A simple generalisation of
their technique leads us to estimate the ~)(A) by the values of ~)(A) that minimise
detf(A; T). Indeed, this criterion follows from maximising (3.2)with f(w)
replaced by A ( o o ) f ( A ) A * ( w ) and taking the f/k(A) and the ~)(A) as the
parameters. To compare this estimator with that obtained by maximising the
original likelihood involving v(A), f~(A) and ~-(A), consider maximising the
likelihood over v and the fxj for any given ~-. The resulting maximised value of
the likelihood is
p

Z log(1 ~- tPjt 2) - log(1 + 1)'1)) = - l o g det 2 ( Z ; ~-),


1

where ~(A ; ~-) has typical element

2jk(a; L(a; ''2 (4.7)


and
f(a; r) = f x~1/2
(I+ ,)A,,~,/2
v y~ . (4.8)

Thus the maximum likelihood estimate of ~- is the value of "r minimising


det ~(>.;'r), where 2(A; r) is the estimated (complex) coherence matrix. Now
minimising det f(A ; r) is clearly equivalent to minimising det ~(A ; r), where
.~(A; r) is formed from )~(A; r) in the same way as 2(a; ~-) was formed from
f(A; r). Thus we see that the two criteria, i.e. minimising detZ(A; r) and
minimising det ~(A ; r), are of essentially the same character. Indeed, they can
378 M. A. Cameron and P. J. Thomson

be shown to be asymptotically equivalent. These considerations lead us ,o


estimate the r/(a) by minimising d e t f ( a ; r) and then to estimate u and the
f,,j(A) using the standard estimation procedure described in Subsection 3.1 with
f ( a ) replaced by f(A; ÷).
In terms of asymptotic properties it can be shown that, under suitable regularity
conditions, the ~(A) are strongly consistent estimators of the ~-j(A) and that the
estimators of u and the fxj(A) obtained from f(A; ?) have the same properties
as before. Moreover, N - l m 1/2(~_ 7") is asymptotically distributed independently
of ~9and the fxj(A) and has asymptotic multivariate normal distribution with zero
mean and covariance matrix B -1, where

2,n.
T 1 ;12( * -Ivyl2)/(l+ u'v),

f
j : k,
Bjk = (4.9)
- 2w2
-~-]uil2lGlZ/(l+ v'u), j ¢ k.

For further details concerning the proof of these results, see Thomson (1982).
Note that the chosen criterion of minimising d e t f ( A ; r ) has the following
interpretation. Let

((n ) = m-1/2 ~ aW (w, ) exp{-i2~rns/m }, n = 1 , . . . , m (4.10)

and

lj= 5, j=l ..... (4.11)

Then it can be shown that minimising d e t f ( a ; r ) is almost the same as


minimising the determinant of the matrix with typical element
m
1
m ~_, ;~j(n + lj)(k(n + lk)*, (4.12)
n=l

i.e. the generalised variance of the ~(n + li). Roughly speaking, ~(n) is the
output of a band pass filter acting on y(n) which passes only the band of
frequencies in question. Then the (i(n) are lagged in the obvious way and the
lagging is optimised by minimising a (generalised) variance.
Finally, having determined estimates of the delays rj(A), we now give
corrected estimates of the attenuation coefficient by l~)j(A)]exp[i{q)j(A)-

4.2. The broad band case


The model for the data may be considered as having three distinct com-
ponents; the phase effects embodied in the delay r(a) and the arguments of the
M e a s u r i n g attenuation 379

uj(a), the signal-to-noise ratio Iv(a)[ and the noise spectrum fx(A). Each
component may be estimated within each band or else can be modelled by a finite
parameter model across a wide band of frequencies. The precise method to be
used depends on which components are to be modelled by a finite number of
parameters and which are to be estimated band by band. Of course, when all are
estimated band by band, the problem reverts to that considered in the previous
section. Here we consider only the case where the phase effects are modelled
across broad bands, but the other parameters remain specific to each narrow band.
Now, in contrast to the assumption in Subsection 4.1, it is assumed that the
variation of {rj(w)-rk(~o)}~0 over any narrow band of frequencies is small
compared with the variation across some given broad band of frequencies B. In
the broad band case it is the variation between narrow bands that is to be
modelled, whereas in Subsection 4.1 it was the within band variation that was
modelled. It will be supposed, in this section, that the phase differences depend
on a vector, 4,, of unknown parameters. The unknown parameters may be
simply the relative delays of the signal at the different recorders, or, if the
signals have been dispersed, they may be coefficients of polynomials or splines
used to model the frequency dependence of the delays.
As in Subsection 3.2 we consider B to be the union of non-overlapping
narrow bands B u of m frequencies about central frequencies a,. For any w in
the band B,, we model the spectral density of the process as

f ( o ) ) = ~ ( a . ; 4 , ) f ~ 2 ( a ~ ) { l + p ( a u ) p ( a , ) } ,f , 1/2(Au)A(A,; 4,)*, (4.13)

where fx(au) is a diagonal matrix with typical diagonal element f,j(a,,), p(au) is
a vector with typical element tv,(A,)l and A(A~; 4,) is a diagonal matrix with
typical diagonal element exp{ig(a.; 4,)}.
Here g(a~; 4,)= g ( Z , ; 4,)+ rj(Z,; 4,)a,, where g ( a , ; 4,) and rj(a,; 4,) are
appropriate parametric forms for 4',(a) and rj(a) that model the phase variation
over B.
If the f~(A,,) and p(a,,) were known, then maximising the approximate
log-likelihood (or equivalently minimising (3.9)) with respect to 4,, reduces to
mmlm~smg

Z {1 + ,,*,,}-~ ~ I,,,,* "


j kl{&jL.k} -~j2-
Nka ~(4,)Ak(4,) (4.14)
u j,k

where the frequency a,, on which each of the quantities depends, has been
omitted for simplicity. Note that (4.14) can also be written as

1
---~7, (1 + v* v)-' "~ Ip'f; '2za 2 . (4.15)
m . Bu

Now the quantity

~,*u~-llpu*lJ¢ ¢ ~.-l/2tg
{1+ , I j klUx,ffx,kJ IJjkt (4.16)
380 M. A . Cameron and P. J. Thomson

may be consistently estimated by

where ~'jk(hu) is the estimated coherence at frequency h. between yj(n) and


yk(n) and ~Jk(h.) is the (j, k) element of the inverse of the matrix of estimated
coherences {d)k(h.)}. Using this estimate, (4.14) reduces, after a little manipu-
lation, to

o(4,) = I2 22,~j,(A.)~"(ao)cos[&(&)- {~:j(&; ~)- ~,,(&; 4,)}1.


u>O j,k

Here &(X,) is the argument of m 1Z A Wj(ws)fVk(WS). The properties of the


estimate of ~b which is obtained by maximizing O(~b) have been discussed in
Hamon and Hannan (1974). Usually, ~jk(A,) is the standard estimator of
coherence, namely

,~jk(a°) = L~,(A.)I/0~j(A°)L~(,U} ''2 .

However, Hannan and Thomson (1981) suggest fitting a vector autoregression


to the data and then estimating the coherences from the autoregressive spectra.
Their simulations show that using the autoregression leads to improved esti-
mates of 4}. However, the number of parameters fitted increases as p2 and does
not take account of the hypothesised structure of the observations.
Alternatively, the structure of the underlying m o d e l can be more directly
utilised to estimate (4.16). Consider in (4.16) replacing u by ~, fx by/x and ]~kl
by the asymptotically equivalent form 1,9~kI{L.Z,A 1'2. Here 1) and )~ are the
estimators obtained using the narrow band techniques of Subsection 3.1. In this
case, (4.14) becomes

O,(4}) = Z ( 1 + /2*P) -1 Z I,>jl2I,>~I= cos[&(A.) - {~j(&; ~)- ~(&; ~)}].


u>0 j,k

In similar vein, if/5 has typical element [1)it, (4.15) yields the criterion

02(4}) = m - ' ~ 2 (1 + z3* ~3)-' ~_, I~'f~'/2A *(40W(w)[ 2 .


u Bu

Tile properties of the estimators that minimise these criteria are the same as
those of the estimator which minimises 0(4}).
Note that the methods of Subsection 3.2 could be used to estimate strongly
consistent, 'smooth' values of ~, and thus of the weight function (4.15). To
completely match the procedure of Hannan and Thomson a finite parameter
Measuring attenuation 381

model should also be fitted to the noise processes. In this case the number of
parameters increases only as p and the parameters are easily interpreted. The
cost is extra computation. At this stage no comparison of the finite sample
properties of this procedure with the others has been performed.

5. Applying the methods

Some of the practical problems that can arise in fitting models of the type
described earlier are discussed in this section. These questions fall into two
categories: (i) assuming the model is correct, how should the parameters be
estimated? and (ii) what can be done to check that the model fitted is
appropriate, and what is the effect of a poor model on the parameter esti-
mates?
The questions in the first category are of most importance when frequency
domain methods are to be used, and cover problems such as whether the data
should be prefiltered, what frequency bands should be used and how wide
these should be. Of necessity the discussion of such matters is more nebulous
than that presented in earlier sections where the data are always stationary and
the observed records are assumed to be long. However, when sample sizes are
finite, if spectra are changing rapidly poor data analysis may result if the
practical problems are not treated thoughtfully.
The main problems that arise when estimating parameters using frequency
domain methods are the result of biases in the spectrum estimates caused by
rapid changes in power or phase across a narrow band of frequencies. This was
mentioned at the beginning of Section 4 where it was shown that if there is a
large delay between recorders, then the modulus of the estimated u/s may be
biased downwards. The delay may also be less efficiently estimated in such
circumstances. See, for example, the simulation results in Hannan and Thom-~
son (1981). These biases may be substantially reduced by realigning the series
(possibly in a frequency dependent way), to reduce the phase changes, by
prefiltering the observations so that their spectra are approximately fiat, and by
making the individual frequency bands narrower. Decreasing the width of the
individual frequency bands increases the variance of spectral quantities within
the band. However, if effective realignment and prewhitening are performed,
then the width of the bands will not be critical. Unless data are delayed by an
integral number of time units, the simplest way to delay a series by -r(A) time
units is to compute

That is, x(n) is Fourier transformed, the coefficient at frequency ~ok is multi~
plied by exp{--i~'(wk)Wk} to introduce the delay and then an inverse Fourier
transform is applied. If ~" is an integer constant, then this moves the first ,~
382 M. A. Cameron and P. J. Thomson

observations from the beginning of the series to the end. This may not always
be an appropriate transformation so that an improved procedure would mean
correct the series, add zeroes to the ends and multiply by an appropriate
taper before performing the transform described above.
So that appropriate prewhitening and realigning can be performed, explora-
tory analyses of the data will be required. These preliminary analyses should
also be used to suggest good parametrisations of the different components in
the model and to suggest initial parameter estimates to be used in the iterative
maximisation of the likelihood. A general procedure for estimating attenuation
and delay should thus include the following steps:
(i) Estimate the spectrum of each observed series separately, using a few
different bandwidths and also an autoregressive estimator. Choose a filter
which will, approximately, whiten each of the observed series.
(ii) From the whitened series, estimate the phase difference functions,
compute approximate delays between the observed series and use these to
realign the observations, if necessary.
(iii) Compute the coherence functions for the prewhitened and realigned
series. This should show the frequency range in which the signal common to
the different recorders is discernible over the background noise.
(iv) For narrow bands over the frequency range of interest, perform the
narrow band analysis described in Subsection 3.1 or, if significant delays
between the observed series remain, perform the analysis in Subsection 4.1.
This analysis should suggest parametric forms for the delay and attenuation.
Initial estimates of these parameters as well as estimates of the spectra of the
noise processes should be obtained at this stage.
(v) A broad band analysis of the form of those described in Subsections 3.2
and 4.2 may now be performed.
(vi) If a time domain estimation is to be performed, the data should be
corrected for any remaining delays estimated in (v). Also noise models may be
chosen and preliminary parameter estimates obtained from the estimates of the
noise spectra obtained in (iv).
(vii) The final estimates should be modified to account for any preliminary
whitening or realignment that have been performed.
The steps outlined above may need to be performed more than once. For
example, if there is a large delay between two series, then, except at very low
frequencies, the phase difference between the two will change rapidly and so a
reasonable estimate of the delay between the two series will be difficult to
obtain initially. However, an approximate realignment will make the next
estimate much better.
It remains to consider whether the model fitted is a good description of the
data. Difficulties may arise early in the estimation procedure if an inap-
propriate model is chosen. For example, if the noise at different recorders is
correlated (often called signal associated noise), then narrow band analysis will
yield noise spectrum estimates which are biased down and may be negative
Measuring attenuation 383

unless the estimation is constrained. If, for a particular band, the noise
spectrum is biased downwards, then the weight attached to that band will be
inflated in a broad band analysis leading to biased estimates of the attenuation
and delay. Signal associated noise is not uncommon, particularly in geophysical
data where the same signal may follow more than one path. If one path is
substantially longer than the other, then the amplitude of the signal received
after travelling along the longer path will often be diminished so as to be of the
same magnitude as background noise. When signal associated noise is detected,
the only solution is to model it as a second signal. When there is more than one
signal, the basic ideas for parameter estimation are the same as those described
here, but greater problems of identifiability arise. This will not be discussed
here. T o check if there is more than one signal present, a principal component
analysis of the cross-spectral matrix for each frequency band of interest could
be performed (see Brillinger, 1975). Alternatively, narrow band factor analysis
models could be fitted and the number of factors chosen using an AIC-type
criterion.
To illustrate the methods of this chapter, we consider a simple example of
just two anemometers from which instantaneous readings of wind speed were
obtained several times per second. A plot of the spectrum of the data shows a
rapid decrease in power with frequency. Autoregressions were fitted to the two
series and it was found that both series were adequately described by simple
third-order autoregressions, the coefficients of the two series being very similar.
The two series were prewhitened using a filter obtained using the coefficients of
one of the autoregressions. This substantially reduced the variation of the
spectra with frequency, though the observations were still not white noise. The
two filtered series were strongly coherent at low frequencies, though the
coherence was reduced at high frequencies (see Fig. 1). There was no evidence
of delay between the two series.
As mentioned earlier, when estimating the attenuation between just two
observed series, some further assumption must be made about the noise. The
simplest and most natural assumption is that the two noise series have the same
spectrum, and that assumption was made here.
The phase difference between the two series varied unsystematically about
zero and so it was assumed that the ratio a2(oJ)/ai(oJ ) [= u2(m)/ul(~o)] was real
and further estimation was based solely on the real part of the cross-spectrum
between the two observed series.
Narrow band estimates of the ratio a2(oJ)/al(w ) were calculated and these are
plotted in Fig. 2. Because we are dealing with real quantities, confidence
intervals may be calculated for these estimates using the method of Creasy
(1956). The approximate 95% intervals calculated using this method are plotted
on Fig. 2.
It can be seen that the estimates are approximately constant and generally
greater than one. There is one band at which the estimate is much greater than
those at other frequencies. However, the signal-to-noise ratio is much lower at
384 M. A. Cameron and P. J. Thomson

LO
[J
r o

L
O)

0
U co

C3

0
I J 1 I

O. 0 0 " 1 0 , 2 0"3 0 I ~ O.S


Frequency

Fig. l. Coherence between the two observed, prewhitened series.

rq

04

O.O 0.1 0.2 0.3 0°4 O.S

Frequency

Fig. 2. Narrow band estimates of tvz(~o)/vl(w)] with approximate 95% confidence intervals,
Measunng aUenua6on 385

(o
L

Q)

0
C

0
C~

(0
C
O)

O3

r-I -- I -- I I

0o0 O, 1 0.2 0.3 0,4 O.S

Frequency

Fig. 3. Narrow band estimates of lul(w)l, the signal-to-noise ratio for the first anemometer.

E
D
L

O
Q)
(3_
O]

ra

0
Z-

.... I I I _ t
o
O. 0 0.1 O. ~ 0.3 0. 4 Oo 5
fr-equer~cy

Fig. 4. Estimated noise spectrum.


386 M. A. Cameron and P. J. Thomson

the high frequencies, as can be seen from the widths of the confidence intervals
in Fig. 2 and from Fig. 3, where ]~l(w)t is plotted. The estimate of the spectrum
of the noise is plotted in Fig. 4.
The above analysis suggests that the ratio Vz(W)/ul(w ) is constant and, under
this assumption, the ratio was estimated using a wide frequency band estima-
tor. The estimated ratio was found to be 1.10 with an approximate standard
error of 0.02.
The estimation was repeated without prefiltering. The estimate of
u2(w)/v~(oo ) was essentially the same, but the estimated noise spectrum was
dominated by the power at low frequencies c o m m o n to both signal and noise.
At this point, time domain models could be chosen for vl(co ) and f(w) by
examination of their narrow band estimates and an overall time domain
estimation performed using the methods of Engle and Watson. This could
improve the estimate of PZ(O))/b'l((.O) if there were fewer observations, but
because there are a moderate number of observations and because the spectra
contain no sharp peaks, there would appear to be no benefit in this case. N o t e
that the noise models would probably be misspecified in the first phase of a
time domain modelling unless the data were prewhitened.

References

Aasnaes, H. B. and Kailath, T. (1973). An innovations approach to least-squares estimation--part


VII: Some applications of vector autoregressive-moving average models. IEEE Trans. Automat.
Control AC-18, 601-607.
Box, G. E. P. and Jenkins, G. M. (1976). Time Series Analysis--Forecasting and Control. Holden-Day,
San" Francisco, CA.
Box, G. E. P. and Tiao, G. C. (1977). A canonical analysis of multiple time series. Biometrika 64,
35.5-366.
Brillinger, D. R. (1975). Time Series, Data Analysis and Theory. Holt, Rinehart and Winston, New
York.
Cameron, M. A. (1981). Estimation of noise correlations in transfer function models. Commun.
Statist.--Simula. Computa. B10, 369-381.
Cameron, M. A. (1983). The comparison of time series recorders. Technometrics 25, 9-22.
Chart, Y. T., Riley, J. M. and Plant, J. B. (1980). A parameter estimation approach to time delay
estimation and signal detection. IEEE Trans. Acoust. Speech Signal Process. ASSP-28, 8-16.
Clay, C. S. and Hinich, M. J. (1981). Estimating the earth's impedance function when there is noise
in the electric and magnetic signals. In: D. F. Findley, ed., Applied Time Series Analysis 1I,
184-219. Academic Press, New York.
Creasy, M. A. (1956). Confidence limits for the gradient in the linear functional relationship. J.
Roy. Statist. Soe. Ser. B 18, 65-69.
Engle, R. and Watson, M. (1981). A one-factor multivariate time series model of metropolitan
wage rates. J. Amer. Statist. Assoc. 76, 774-781.
Fletcher, R. and Powell, M. J. D. (1963). A rapidly convergent descent method for minimization.
Computer J. 6, 163-168.
G e w e k e , J. F. and Singleton, K. J. (1981). Maximum likelihood "confirmatory" factor analysis of
economic time series. Internat. Econom. Rev. 22, 37-54.
Hamon, B. V. and Hannan, E. J. (1974). Spectral estimation of time delay for dispersive and
non-dispersive systems. Appl. Statist. 23, 134-142.
Hannan, E. J. (1970). Multiple Time Series. Wiley, New York.
Measuring attenuation 387

Hannan, E. J. (1983). Signal estimation. In: P. R. Krishnaiah, ed., Time Series Analysis in the
Frequency Domain. North-Holland, Amsterdam.
Hannan, E. J. and Thomson, P. J. (1973). Estimating group delay. Biometrika 60, 241-253.
Hannan, E. J. and Thomson, P. J. (1981). Delay estimation and the estimation of coherence and phase.
I E E E Trans. Acoust. Speech and Signal Process. ASSP 29, 485-490.
Jones, R. H. (1980). Maximum likelihood fitting of A R M A models to time series with missing
observations. Technometrics 22, 389-395.
Joreskog, K. (1967). Some contributions to maximum likelihood factor analysis. Psychometrika 32,
443--482.
Joreskog, K. (1978). Structural analysis of covariance and correlation matrices. Psychometrika 43,
443447.
Lawley, D. N. (1967). Some new results in maximum likelihood factor analysis. Proe. Roy. Soc.
Edinburgh Ser. A 67, 256-264.
Nicholls, D. F. and Hall, A. D. (1979). The exact likelihood function of multivariate autoregressive-
moving average models. Biometrika 66, 259-264.
Pukkila, T. (1982). On the identification of transfer function noise models with several correlated
inputs. Scand. J. Statist. 9, 139-146.
Taniguchi, M. (1982). On estimation of the integrals of the fourth order cumulant spectral density.
Biometrika 69, 117-122.
Thomson, P. J. (1982). Signal estimation using an array of recorders. Stochastic Process. Appl. 13,
201-214.
E. J. ttannan, P. R. Kristmaiah, M. M. Rao, eds., Handbook of Statistics, Vol. 5 ]
AL
© Elsevier Science Publishers B.V. (1985) 389-412

Speech Recognition Using LPC


Distance Measures

P. J. Thomson and P. de Souza

1. Introduction

Research into automatic recognition of speech is concerned with the problem


of designing a device which accepts speech as input and determines what words
were spoken. This is to be distinguished from the related task of speech
understanding where the goal is to design a device which reacts correctly to
spoken commands, and which may be able to do so without necessarily
recognising every word correctly. The task of a speech recogniser is simply to
transcribe the input utterance without responding to its meaning, as if it were
taking dictation. Indeed, dictation is one possible commercial application of a
speech recogniser.
Speech recognition can be subdivided into two categories: recognition of
isolated words where there are distinct pauses between the words; and
recognition of natural continuous speech where words are usually run together.
Recognition of isolated words is less difficult than recognition of continuous
speech and, in fact, isolated word recognisers handling vocabularies of up to
200 words have been in use commercially for several years [1]. Recently,
recognisers have become available which will accept vocabularies of up to 500
words [19].
Two reasons why isolated word recognition has been more successful than
continuous speech recognition are that: (1) pausing between words leads to
clearer speech and substantially reduces the effects of co-articulation caused by
the preceding and following words; (2) pauses can be identified fairly reliably
[2] and hence the end-points of isolated words can be determined more easily
than in continuous speech.
In continuous speech, words are pronounced less carefully and co-artb
culatory effects cause the pronunciation of words to vary according to their
context. For example, the pronunciation of the word " a n d " is usually much
clearer when spoken in isolation than when spoken naturally in a phrase like
"black and blue" which may be pronounced more like "black 'n' blue". Also,
in words such as " b r a n d " the final consonant may or may not be pronounced
depending on its context: it is more likely to be pronounced in the phrase
"brand of product" than in "brand new product".
390 P. J. Thomson and P. de Souza

Unlike isolated words, it is not usually possible to identify word end-points in


continuous speech. As most successful isolated word recognisers rely heavily on
end-point information they cannot be converted easily to work on continuous
speech. Continuous speech, therefore, represents a significantly more complex
problem which remains far from being fully solved.
Typically, in isolated word recognition [3,20] a parameter vector is com-
puted every 10-20 ms over a windowed region of the input signal. Making use
of the pauses between words to determine the end-points of an utterance, a
feature matrix is then extracted consisting of the time-varying parameter
vectors over the interval between the end-points. The word is identified by
comparing the feature matrix with a set of stored templates derived from the
words in the vocabulary and selecting the template which gives the closest
match. This simple, template matching approach is adequate for small vocabu-
laries when the words are spoken in isolation, but problems are encountered
when it is applied to large vocabularies [4], and it is difficult to apply to
continuous speech although it can be done for very small vocabularies such as
the ten digits [17].
In order to implement an isolated word recogniser like the one above, it is
necessary to decide on the parameter set and distance measure to be used in
comparing an unidentified word with the reference words. The predominant
parametric representation in use today [5] is linear predictive coding (LPC)
which is applied to the speech signal after digitising it at a rate of between 6
and 20 kHz.
The sampling rate must be twice the desired bandwidth in order to avoid
spectral aliasing. In the case of voiced speech a minimum bandwidth of 3 kHz is
required for accurate estimation, while in the case of unvoiced fricative sounds
such as "s" a bandwidth of 8-10 kHz is necessary. Since the telephone has a
bandwidth of about 3 kHz only, a low sampling rate is usually adequate for
telephone quality speech, whereas a 20 kHz sampling rate is desirable for high
quality microphone speech [6].
The sampled speech is typically quantised to an accuracy of between 9 and
16 bits [5], again depending on the quality of the speech required and the
application.
The linear predictive model [21] which is applied to the digitised signal
basically assumes that a speech sample can be approximated by a linear
combination of the immediately preceding speech samples, but the foundations
of LPC can be traced to Fant's very successful linear speech production model
[7]. In Fant's model of acoustical speech behaviour, speech is considered to be
the output of a linear, time-varying system excited by periodic pulses during
voiced speech and random noise during unvoiced speech. Linear prediction
provides a robust and accurate means of estimating the parameters that
characterise this system [8, chapter 8], and is valid, therefore, to the extent that
the production model is valid. The success of LPC can be attributed to the
accuracy with which Fant's basic model applies to speech.
Further reasons for the importance of LPC lie in the accuracy with which the
Speech recognition using LPC distance measures 391

speech parameters can be estimated and in the speed with which they can be
computed. As it happens, estimation of the linear prediction coefficients
reduces to a set of linear equations which are blessed with mathematical
properties that allow extremely efficient solution [6].
Additionally, LPC has the further advantage that the asymptotic distribution
of the linear prediction coefficients is known, tractable, and appears to provide
a workable approximation to the distribution of the coefficients as obtained
over finite length intervals typical of those used in speech recognition [9, 10].
The importance of knowing the approximate distribution of the coefficients is
that it provides the means of developing appropriate distance measures instead
of relying on empirically derived measures which would otherwise be the case.
Despite these advantages, Fant's model is known to be imperfect [6], and the
speech signal is not truly stationary as the LPC model assumes. Therefore, LPC
does not provide a perfect representation of speech and is not necessarily the
best parameter set for speech recognition purposes. Other parameters such as
the discrete Fourier transform spectrum are in use and good results have been
reported [11].
Once a parameter set has been selected, the isolated word recogniser
sketched earlier needs an appropriate distance measure to determine which of
the reference templates is closest to the feature matrix of an unidentified word.
Ideally, the distance measure should have the property that it maximises the
probability of selecting a template of the correct word. Since an unidentified
word will usually be of a different duration than its corresponding template(s),
it is necessary to perform some kind of time alignment between the uni-
dentified and the reference patterns. Thus, computing the distance between an
unidentified word and a reference template involves both time alignment and
the accumulation of some distance function after alignment.
it is well established that careful time alignment leads to a significant
reduction in the recognition error rate of isolated words, particularly when the
vocabulary contains polysyllabic words [12]. In practice this means that it is
inadequate to perform simple linear time alignment in which one of the
patterns is stretched or compressed linearly to match the other. Better results
are obtained by performing dynamic time warping in which the optimal
alignment is taken as being that one which minimises the total accumulated
distance between the unidentified and reference patterns. In dynamic time
warping the accumulated distance is taken to be the sum of the local distances
between the temporally aligned parameter vectors of the reference and uni-
dentified words. Efficient recursive procedures to find the required alignment
have been devised and a useful discussion on the subject is given by Myers et
al. [13].
To complete the definition of the distance between an unidentified word and
a reference word, it is necessary to specify how the local distance between two
parameter vectors will be measured. In the case of parameter vectors which
have an unknown or intractable distribution, it is difficult to define an optimal
distance; it may be necessary, instead, to resort to measures such as the
392 P. J. Thomson and P. de Souza

Euclidean distance, the city-block metric or any one of several intuitive


distances [14] according to whichever was found to work best in practice. In
contrast, when the parameters have a known tractable distribution, as is the
case for LPC coefficients, this information can be used to derive distance
measures which are optimal, or nearly so, in some well-defined sense.
Probably the most common LPC distance measure in use today is Itakura's
so-called log likelihood ratio [15]. Ironically, this is neither the log likelihood
ratio statistic for comparing two estimated LPC vectors, nor is it statistically
optimal as was pointed out by de Souza and Thomson [9, 10]. Despite this, it
has been found to work better than most ad hoc distance measures and has the
advantages that it can be computed quickly, and its storage requirements are
small.
Nonetheless, when Itakura's distance measure is used, isolated word recog-
nisers of the type described here do not perform well with complex vocabu-
laries containing acoustically similar words [16]. There is a need, therefore, for
more powerful LPC distance measures than Itakura's in order to discriminate
better between similar sounding words. Several candidates for an improved
LPC distance measure where investigated by de Souza and Thomson [10] and
will be further discussed later in this chapter.
Because of the limitations inherent in isolated word recognisers using tem-
plate matching, other approaches to speech recognition have been investigated.
One of the more successful of these has been the maximum likelihood
approach in which speech is modelled as a Markov source [18]. In this
approach, word templates are replaced by Markov models, and each parameter
vector is replaced by a scalar indicating which of several reference vectors is
the nearest to the observed vector. The resulting sequence of scalars, or labels,
is analysed to find those words whose collective concatenated Markov models
maximise the likelihood of the observed labels. Good results using this tech-
nique have been reported for continuous speech [18] as well as for isolated
words [11].
As in the case of template matching, the Markov modelling approach
requires a distance measure defined in terms of the chosen parameter set. In
the latter case it is needed in order to find the closest reference vector to an
observed parameter vector during the labelling phase. Additionally, the
reference vectors should be chosen so as to minimise distortion in the quam
tised, or labelled, speech. This process, known as vector quantisation, is
intrinsically related to the choice of distance measure used in labelling
[22, 23].
It can be seen, then, that the definition of distance measures is an important
aspect of speech recognition research. Given the predominance of LPC as the
choice of parameters, LPC distance measures are of particular importance, and
the remainder of this chapter is devoted to this subject.
We begin in Section 2 by reviewing the LPC model and some of its statistical
properties.
Speech recognition using LPC distance measures 393

2. The LPC m o d e l - - a review

The stationary stochastic process {xt} is said to follow an L P C model if


p
~,aix,_ i=e,, t=0,+l ..... (2.1)
i=0

where {et} is a white noise process; i.e. a sequence of uncorrelated r a n d o m


variables each with m e a n zero and variance o-2. It is also assumed that a 0 = 1
and the z transform E ai zi is n o n - z e r o inside and on the unit circle. T h e latter
condition ensures that x t is uncorrelated with future innovation terms Et+s,
s > 0. N o t e that the L P C (linear predictive coding) model is nothing other than
the familiar autoregressive model used in almost all b r a n c h e s of time series.
Consider estimating a = (% . . . . . ap) T and o.2 from a sample of N obser-
vations on the process {xt}. W i t h o u t any loss of generality we may take these
observations to be given by x = ( x I . . . . . XN) ~. A natural m e t h o d for estimating
and o2 is by m e a n s of the least squares criterion

N' a T (-'
~a ' (2.2)
t=p+l =

where the (p + 1) x (p x 1) matrix C has typical e l e m e n t

] N
c,j = m
N' Z x, ix,_j, i,j = o, 1 . . . . . p , (2.3)
t=p+l

with N ' - N - p and a .... (1, v ) r . Minimising (2.2) with respect to a yields

& = -D-'d, (2.4)

where D is the p x p submatrix of C obtained by deleting row and column 0,


and d is the p - d i m e n s i o n a l column vector o b t a i n e d by deleting row zero f r o m
the zeroth column of C. As is usual in least squares an estimate of o-z can be f o u n d
by considering the m e a n squared residual

~.2
=:

~
_Z
t- 1 i=0
OliXt_ i = ~'r6~. (2.5)

It is well k n o w n (see, for example, [24-26]) that the distribution of ~ / ~ ( ~ -


a ) a p p r o a c h e s that of a multivariate Gaussian distribution as N ' ~ c . The
limiting distribution has mean 0 and covariance matrix N where an asymp-
totically unbiased estimator of X is given by

)2 - 4 2 D ' ' . (2.6)


394 P. J. Thomson and P. de Souza

Although d"2 is asymptotically unbiased it will be biased for finite N ' .


Paralleling the usual procedure for estimating variance in least squares, con-
sider the estimator

S2 _ i ~ (/:~0)2 S t
•iXt_i _ __ (j.2 (2.7)
N'-pt=p+l = N'-p "

This estimator takes into account the p degrees of freedom lost by using the
estimated values of cq in (2.2). Following the argument given in Fuller [26, p.
338] s 2 should prove to be a less biased estimator of o,2 in small samples. Now,
since ~)2 and s 2 are asymptotically equivalent, we can define an alternative
asymptotically unbiased estimator of X as

S = s2D -~ . (2.8)

This estimator should be more appropriate than 2~ for small samples.


If the process {xt} were Gaussian, then the likelihood of xp< . . . . . XN,
conditioned on the first p observations x, . . . . . xp is given by

(o-'k/ 2Trr,
rr)-N' e x p { - ~J~22arrCa } . (2.9)

Maximising (2.9) with respect to a and o-2 shows that d~ and d-2 are the
m a x i m u m likelihood estimates of a and 0-2.
Now the form of (2.9) as a function of oL and o- is precisely analogous to that
of the Gaussian linear regression model in conventional statistics. (See Seber
[27] for example.) Indeed, if y = (Y, . . . . . yn)x follows a Gaussian linear regres-
sion model, then y has a multivariate Gaussian distribution with mean Xfl and
covariance matrix o-21. H e r e the columns of X contain the regressors, fl
contains the regression coefficients and I is the identity matrix. H e n c e the
likelihood for the linear regression model is

- - -" ~ xfl)]
(cryX/2~T) e x p [ - ~2Cry
--~ (y - xt~)T(y -

= (o-,',/2-~)-" exp[~--~n2 b TBb ] ,


Z,O'y
where
b - ( l , - , i g T ) T and B:I[yTy yWX J
n [XTy XTX "

Replacing the fij by --aj, o~y by o-, n by N ' , y by (xp+1. . . . . Xu)v and the
columns of X by (Xp_j+1. . . . . XN_j) T, j = 1. . . . . p, yields (2.9). Note that this
establishes the correspondence between the likelihoods only and not the
respective distributions. The distribution of xp< . . . . , x N conditioned on
Speech recognition using LPC distance measures 395

x l , . . . , xp does not follow the Gaussian linear regression model. The parallel
between the likelihoods proves to be useful in the development that follows.
The unconditional likelihood of x 1. . . . . xN is (2.9) multiplied by

1 T --1 }
(crX/2-~w)-P(det(A)) -1/2 exp - - -2o.2 x p A Xp (2.10)

where xp = ( x 1. . . . . Xp) T and A has typical element Aij = covariance (x i, xs)/cr 2.


A procedure for determining the Ais as a function of a is given in McLeod
[28]. Although there exist algorithms for computing the exact maximum
likelihood estimates [29, 30], they are considerably more costly to compute
than the conditional maximum likelihood estimates. For this reason the
unconditional maximum likelihood estimates have not been seriously
considered for speech recognition where computational efficiency is a primary
requirement. Note also, if N is large b y comparison to p, then the
multiplicative factor (2.10) contributes little to the likelihood.
Now, apart from the multiplicative factor (det(A)) -1/2, the exact likelihood of
x 1. . . . . x N is (2.9) with N ' replaced by N and C replaced by (~ where (~ has
typical element

1 N-i-j
Cis = ~ ~, xt+ix,+j, i, j = O, 1 . . . . , p . (2.11)
t=l

This gives an approximation to the exact likelihood since (det(A)) -1/2 is


independent of N and will, for moderate to large samples, contribute little to
the likelihood. (See Box and Jenkins [31] for discussion of this approximate
likelihood.) Maximising this approximate likelihood yields the estimate

= -/3-1d, (2.12)

where /5 and d are defined in terms of (~ in exactly the same way that D and
d were defined in terms of C. Moreover, the estimate of o-2 is

(~2 __~I ~ T ~ , (2.13)

where ~ = (1, ~T)T. Paralleling (2.7) we also define the least squares estimator
of @2 as

N
g2 __ 0.2. (2.14)
N-p

These estimates should provide closer approximations to the (unconditional)


maximum likelihood estimates than ~ and d"2 .
Yet another way to estimate ~e is to solve the well-known Yule-Walker
396 P.J. Thomson and P. de Souza

equations. These yield the estimate

(2.15)
where, once again, /} and d are defined in terms of C in exactly the same
way that D and d were defined in terms of C. Here 0 has typical element

~ij = ~(li-in), i,j =0, 1 . . . . . p ,


and
1 Nn
a(n) = Z x,x.., = o, l . . . . . p.
i=I

The estimates of 0-2 analogous to (2.5) and (2.7) are

N
(~2 .__ liT~li, g2__ ~2, (2.16)
N-p

where a = ( l , liT)T. The Yule-Walker estimates are also approximate


(unconditional) maximum likelihood estimates where the approximate
likelihood in question is as for the previous paragraph, but with
approximated by C. These estimates, popular because of the ease with which
they can be computed, have lost some of their appeal due to the fact that
there are now fast algorithms for computing & and 6-2 [6, 32]. Moreover, even
in moderate size samples, simulation studies favour & and d"2 to & and 6-2
[331.
All the estimates of o~ and 0-2 considered are asymptotically equivalent. In
practice however, as intimated above, the estimates & and ~2 should normally
provide the best estimates of ot and 02 followed, in order, by & and d-2 and
then & and 6-2. In terms of computation the estimates & and 6-2 require a
computational effort no greater than that for the Yule-Walker estimates ~i
and 6-2 [32]. The approximate maximum likelihood estimates &, 6-2 can also be
computed rapidly [34], but not quite as rapidly as & and 6-2 or & and 6-2. The
algorithms concerned are the Levinson-Durbin recursion and generalisations of
this.
In the following section a number of tests will be derived using the
likelihood ratio method and Gaussian likelihoods. The exact distribution of
the resulting test statistics cannot easily be established and, as a consequence,
only asymptotic distributions can be given. However, it can be shown that
these asymptotic distributions will also hold under more general conditions
where the white noise process {~t} satisfies the conditions following (2.1)
together with additional mild regularity conditions. (See Hannan [25] and
Fuller [26] in particular.) Thus, although the statistics are derived under
Gaussian assumptions, they will continue to follow the stated asymptotic
distribution in more general circumstances.
For the sake of definiteness, we shall now confine our attention, in the
Speech recognition using LPC distance measures 397

main, to the conditional likelihood estimates & and d-2. This is not a
restriction. Since the estimates can all be o b t a i n e d by maximising an
a p p r o p r i a t e version of the likelihood (2.9), the three estimates of a and er2
may be used interchangeably in the statistics and distance measures that
follow.

3. Comparative tests for LPC models

In this section we use the formal theory of statistical hypothesis testing and,
in particular, the likelihood ratio m e t h o d to g e n e r a t e a p p r o p r i a t e measures of
the distance b e t w e e n sets of L P C coefficients. M u c h of the material that
follows is drawn f r o m [10].
C o n s i d e r a s e q u e n c e of observations x T of length N T c o r r e s p o n d i n g to a
stretch of voiced input that is to be coded. A s s u m e that x r is generated by a
linear predictive process of order p with L P C coefficients a T and innovation
variance err. 2 M o r e o v e r , we shall assume, for the m o m e n t , that x T has a
Gaussian distribution so that the likelihood of x v c o n d i t i o n e d on the first p
observations, is given by (2.9) with a = a r = (1, a T ) T and o- = err.
In the simplest situation we might conceivably wish to test the hypothesis
H : a r = a R, where a R is some k n o w n fixed reference vector. Alternatively we
may not k n o w erR, but know instead only a reference s e q u e n c e of observations
x R of length N R. In such cases we shall assume that x R is g e n e r a t e d in-
d e p e n d e n t l y of x T by a linear predictive process of order p with L P C
coefficients erR and innovation variance cr2R. This again leads to consideration of
the test of the hypothesis H : a T = erR, where n o w both err and e r r are u n k n o w n
and must be estimated from the data.
In this section we discuss the likelihood ratio tests of the hypothesis H : a R =
a r in the various situations alluded to above. F r o m these tests relevant
distance measures are constructed. It should be noted in passing that these tests
and distance m e a s u r e s are also of interest in their own right since they are
applicable to p r o b l e m s in fields o t h e r than speech recognition. T h e y can be
seen as building on the work of Quenouille [35].

3.1. H: err = aR ; err known

H e r e the relevant likelihood ratio statistic to test H : erj = oLR is o b t a i n e d as


the m a x i m u m of the likelihood (2.9) u n d e r H expressed as a ratio of the
u n c o n s t r a i n e d , m a x i m u m of the likelihood. By taking advantage of the cor-
r e s p o n d e n c e between the likelihood (2.9) and that of the Gaussian linear
regression model (see the discussion following (2.9)), the relevant-likelihood
ratio test statistic is a m o n o t o n i c function of

N ~ ( ~ T - aR)TDT(d@ aR)
e ( a , , aR) = , (3. t)
or T
398 P. J. Thomson and P. de Souza

where &r, °-r,"2 C T and D r are obtained from x r using (2.3)--(2.5). Writing
l~T = (1, t^r T
r ) T, a R = (1, trR)
T T we note that (3.1) can also be written as

~(&r, a n ) = N ~ R CT~ir l
1 , (3.2)

which is cheaper to compute than the form given by (3.1)o Asymptotically ~ has
a Xp2 distribution when H : ¢er = atR is true.
Note that the logarithm of the likelihood ratio is proportional to

r.]Ga. 1
I(& r, a n ) = l o g [ ~ / = log(1 + ((&r~ a n ) / N ~ ) . (3.3)
k~rCr~r3

This statistic, with ¢iT and C r replaced by the asymptotically equivalent


Y u l e - W a l k e r estimates dr and 0 r , is commonly known as Itakura's dista_nce
[15]. When H : a T = a R is true, NLfl(&r, ozR) also has an asymptotic Xp2 dis-
tribution. We have chosen the statistic F in preference to I because of the
f o r m e r ' s more tractable distributional properties and because of its direct
relationship to the conventional test statistics developed for the Gaussian linear
regression model [27, 37].
A better approximation to the null distribution of ~ can be determined. Note
first that

N~(& r - OIR)TDT(dfT -- o~R)/,~~ (3.4)

has an asymptotic X2 distribution under H and, if x t follows a Gaussian


distribution, N~d-2r/o-2r is asymptotically equivalent to a X2n, p distribution under
2 T
H. Moreover, &r and dr T are asymptotically independent. (See [31, p. 228]
for example.) Hence, when H : ~ r = eeR is true,

N~-- p N~.(c~r - aR)TDT (deT - ozR)/p


l(&~ o~R) = N~.p ( (&r, aR) 2 (3.5)
ST

has an (approximate) Fp,N~_ p d i s t r i b u t i o n . H e r e s 2r is obtained from ~r


.,2r using
(2.7).
It can be seen from (3.4) that approximating the null distribution of g by a xzp
distribution ignores the variability inherent in O-z " T . The modified distance
measure l and its approximate F distribution have gone some way towards taking
account of the variability of o--zT. Note that (3.4) and hence F each have an
asymptotic X2p distribution under considerably m o r e general circumstances than
the Gaussian assumption given previously [25, 26]. However, the distribution of
NTO-T/o-
, ^ 2 2T will
• not necessarily be well approximated by a AVN2 T-P distribution if x r
is non-Gaussian.
It has been argued forcefully that, it many circumstances, no single reference
vector a~R will successfully characterise any given speech segment. (See
Speech recognition using LPC distance measures 399

[9, 10, 36] for example.) Because of co-articulation, there can be marked
differences between different realisations of the same speech segment. In this
context it is interesting to note that between 25% and 50% of words in normal
conversation are unintelligible when taken out of context and heard in isola-
tion. In practice this implies that a R is frequently not a fixed reference vector,
but an estimated LPC vector with its own inherent variability. In these
circumstances ~', l and Itakura's distance I are inappropriate. More appropriate
measures are given in the remaining part of Section 3.

3.2. H : a r = a n ; a n u n k n o w n , er~ a n d o.2 k n o w n to be e q u a l

The joint likelihood of x n and x r, conditioned on the first p observations of


each sequence, is

(er'X/~)-(N~+N~)exptf - - zer
~
( N n,a gTC R a a + N Tra r TC r a r )}
' (3.6)

where N a' = N R - - p , a R = ( 1 , aR)


T T , C R is obtained from x a using (2.3) and o.2
denotes the common value of ern2 and err. z We can again take advantage of the
correspondence between (3.6) and the likelihood associated with two Gaussian
linear regression models having the same variance; the corresponding test is
that of testing for coincidence of two regression functions. (See Graybill [37, p.
190] for example.)
The relevant likelihood ratio test statistic is a monotonic function of

[(N~+ N'r)erp
, -2 - NRer
n,~2 - N~6"Zrl/p
F(I~IT, ~ R ) = , ~2 , ^2 , , ' (3.7)
[Nner R + N r e r r ] / ( N R + N r - 2p)

where er^ 2R and err


^ 2 are the estimates of o.2 obtained from x a and x r respectively
using (2.5) and

^2 ^1"
er,- a~G, ~ . (3.8)

Moreover, Cp is the pooled covariance matrix given by

Cp = ( N ~ C n + N ' r C r ) / ( N ~ + N r ' ) , (3.9)

and dp .....(1, ¢~p),


^TT where ~p is obtained from 6p and its corresponding sub-
matrix Dp using (2.4). The null hypothesis H : a r = a n will be rejected when F
is significantly large. This statistic is due to de Souza [9] who derived it by
analogy to classical regression theory. It can be shown [27, 37] that (3.7) is the
same as

F'(d~v &R) = (&R - & r ) T [ D R 1 / N ~ + D~'l/N'r]-~(d~n - & r ) / ( p s 2 ) , (3.10)


400 P. J. T h o m s o n a n d P. de S o u z a

where

S2 t ^2
= (Nn0.n + N r i0 .~2r ) / ( N n t + N~-- 2p) (3.11)

estimates 0-2. Thus p F is just the squared distance between &n and ~T
standardised by an estimate of the covariance matrix of &n - &r" Note that, as
a distance measure, F possesses the desirable property that it is symmetric in
&n and &r.
If the null hypothesis H : a T = a n is true, then p F will have an asymptotic X2p
distribution. This follows from the form of (3.10) and the stochastic properties
of &n, &r, 0.R~2 and °'r.~2 This result will hold under quite mild regularity
conditions concerning the nature of the processes generating x n and x r. It is
not restricted to Gaussian x R and x r. However, if x n and x T are Gaussian,
(NRd'2R + N~.d'2r)/0. 2 is asymptotically equivalent to a X2~+N~_2p random variable.
Since &n and a r are asymptotically independent of d'~ and d-~- it is evident that
F is, in the case of Gaussian x n and xr, asymptotically equivalent to an
Fp. uk+N~_2p random variable. Because the F distribution yields a m o r e con-
servative testing precedure, we shall take t h e Fp,gk+s~_2p distribution as the
a p p r o x i m a t e distribution of F under H even when x R and x r are not Gaussian.
When N T' - N R' -- N ' , then

2 ( N ' - P) "1
F(&r, & R ) - p [d'~ + 6"~. 1/ (3.12)

2 ( N ' - p) (&R - &Y ) T ( D R 1 + D r ~) I ( ~ R -- ~T)


^2 ^2 (3.13)
P 0 . R q- 0 . T

and F is asymptotically equivalent to an Fp,2(N,p) random variable when


~fl~R ~'~ a T "

3.3. H : a n = a r ; a R u n k n o w n , o-2r and o-2 not necessarily equal

If 0.~ and 0.~ were known a priori, then it is easily shown that the likelihood
ratio test statistic for testing H " a T = a n is a monotonic function of

- I ./1~
(t~ n -- a T ) \ T 1r 0 . 2R lrJ- ~ R . . . n _~- o.~D-r,/N,r}-l(d~n _ &r). (3.14)

This has an asymptotic X2p distribution under H~ Since 0.R, 2 0.r2 are unknown, the
natural test statistic, or distance measure, to use is (3.14) with o-2 and 0.J
replaced by their estimates 8 2 and 0.~27 or SR2 and ST.2 (See (2.5) and (2.7).) This
yields the test statistic

X2(&r, den)= (an &r)T{s~Dn~INR + S2rD T llN~}-~(&n - &T), (3.15)

and the null hypothesis H : a R = a T will be rejected when X 2 is significantly


large. As a distance measure, /~2 is symmetric in &R and der and is again the
Speech recognition using LPC distance measures 401

squared distance between &R and &r standardised by an estimate of the


covariance matrix of & R - & r . Although not/ a function of the appropriate
likelihood ratio test statistic, X2 is asymptotically equivalent to it. In this case
the correct likelihood ratio test statistic possesses certain undesirable theoreti-
cal and computational properties. (See, for example, the literature on the
Behrens-Fisher problem in classical statistics [38, 39].) Note that X2 is propor-
tional to the F statistic of Subsection 3.2 computed for the rescaled sequences
£R = XR/SR and 2 r = xv/sT, We have chosen s 2R and s 2r instead of crheR and ~~2 .
partly because they are less biased and partly because they make Xz a more
conservative test statistic.
The X2 statistic has an asymptotic .g2 distribution under the null hypothesis.
If xR and x r are Gaussian, a slightly better approximation to the null dis-
tribution which takes some account of the variability of the estimates s 2R and s 2r
is given by the following argument. Under H : a n = a r the matrices D R and D r
converge to a common limit A, i.e.

lim D R = lim D r = A, (3.16)


N~-~ N~oo

which means that X2 is asymptotically equivalent to

(a. - a )Ta(aR - / s2 /Ni s /N;


(3.17)
o'R/N R+ ¢ r / N r /crR/NR+ ~rr/N r

When H is true, the numerator of (3.17) has an asymptotic X2p distribution and
is asymptotically independent of the denominator. In addition, using Sat-
terthwaite's approximation [40], the denominator is asymptotically equivalent,
under H, to a X 2J r random variable where v is estimated as

= ( s 2 / N i + s2r/N'r)2 (3.18)
(s2 /N~)2/(N~ - p) + ( s 2 / N ~)Z/(Nr - p)

Thus the distribution of X2/p under H : a R = oer can be approximated by an


Fp,~ distribution. Note that (3.18) varies between the smaller of N ~ - p , N ~ - p
and N~ + N } - 2 p with the maximum occurring when
2
SR S2F

N~(N R-p) N~(N~-p)"

For the special case N r i- N ~ = N,r

X2(&r,&e) -- N,(& R --OgT)


~ 7{ s 2n D n t -~ S T2D T } :l ~(OgR--
~ ~IT) (3.19)

= 2(N .... p)(s2p-- 1), (3.20)


402 P. J. Thomson and 1:'. de Souza

where sp2 is obtained from (3.8) and (3.9), but with C R and C T replaced by
CR/s ~ and Cfls2r respectively. When otR = otT, the distribution of X 2 is
asymptotically X2p and the distribution of X2/p can be approximated by an/z~,~
distribution, where

4 4 (N'-p).
S R -1- S T

3.4. Alternative tests of H : o~R = o~T


The F and X 2 statistics are m o r e costly to compute than either ~ or I. In
order to meet this problem, approximations to F and X 2 were introduced by de
Souza and Thomson [10]. These are F , and ,,v,2 respectively, where

F*('~T, '~.) = ( N i - ' + N 7~)-2('~R - '~T) T


• ( D R / N R + D T / N T ) ( & R -- & r ) / ( p s 2) (3.21)
and
X 2(~[tT' aR) = ( N i -1-~- NT-1)-2(~R -- &T)X(s-e2DR/N;~ + s-r2DT/N))
" (&R - &r). (3.22)

These approximations follow from the observation that F and X 2 involve


quadratic forms of the type

x'r ( w l A ~ 1 + w2Azl)-lx, (3.23)

where A 1 and A 2 denote positive definite symmetric matrices and w 1 and w 2


are non-negative weights that sum to unity. In the case of F the weights w 1
and w 2 are N ; f l / ( N ~ -1 + N~--1) and N r,-1 / ( N R,-1 + N T1-1) respectively and A1 and
A 2 a r e D R and D T respectively. T h e X 2 statistic has the same weights as F, but
A 1 and A 2 are now DR/s 2 and Dv/s2r. Observe that (wiA71 + w 2 A ; l ) -1 is just a
harmonic average of A~ and A 2. F , and X,2. are basically the F and X 2 statistics
with this harmonic average replaced by the analogous arithmetic average, i.e.
w~A 1 + w2A 2.
The F , and X~ statistics involve much the same computational cost as either (
or I. Moreover, when the null hypothesis is true they are asymptotically equivalent
to F and X 2 respectively. Thus the asymptotic approximations to the distributions
of F and X 2 under the null hypothesis also hold, respectively, for F , and X2,. In
theory, F , and X~ are less powerful than F and X 2. In practice, however, the loss in
power may be small enough to not warrant the additional computational cost
involved in computing the more powerful tests. The simulation results of de Souza
and T h o m s o n [10] give some guidelines here. In summary the F , and X~ statistics
possess the advantage that they are naturally related to the optimal F and X 2
statistics and they are relatively inexpensive to compute.
Speech recognition using LPC distance measures 403

W h e n N Rp= N r, _- N t and a R = a r ,

N ' - p (&R -- &r)T(DR + D r ) ( & n -- &r)


F * ( & r ' OlR) =
- 2p
- -z + erT
erR A2 (3.24)

has an a p p r o x i m a t e Fp,2(N,_p) distribution if x R and x r are G a u s s i a n and

Nt
X2*(&T, &R) = -4- (~¢R -- &T)T(DR/S2 + D r / s Z ) ( & R - &r) (3.25)

has an a s y m p t o t i c X 2 distribution. Using Satterthwaite's a p p r o x i m a t i o n the null


distribution of X 2 / p is a p p r o x i m a t e l y Fp,~, w h e r e ~) is given by (3.18).
Tribolet et al. [41] consider the test statistic

N ' ( ~ r - -,...--~n~rDr(&r- &R)


g*(&r, &R)= 2 ^2 (3.26)
O" T

in the situation w h e r e N r' -- N R' = N ' and it is assumed a priori that err2 = erR. 2
O b s e r v e that ~'*(&r, &R) = F(&r, &R)/2- Given er 2r _- erR,
2 t h e n , u n d e r H : aR = a r ,
F, has an a s y m p t o t i c X 2 distribution. As in the case of ~, a modification of ~',
yields

N'-p
l,(e r, e,R) = - - e,(e r,
N'p

N' (~T -- &R)TDT(&T- CeR)/P


2
(3.27)
2 sr

which is asymptotically equivalent to an F v , N , p distribution u n d e r H : a R = a r


p r o v i d e d x R and x r are Gaussian.
T h e tests based on t~. or l. will not be as powerful as F and X 2 or F . and
X 2. H o w e v e r , the principal d i s a d v a n t a g e of ( . and l. is simply that, as distance
measures, they are not s y m m e t r i c functions of &r and &R; i.e~
~*(~(T' ~R) 76 ~P*(&R,•T) and l,(&T, dzR) ¢ I,(~R, &T)" M o r e o v e r , if o-~ ¢ err, 2 it
is easily shown that when H : a r = a R is true, ( . is asymptotically equivalent to
2 1 2,, 2 2 2
a ~(1 + o'R/errlXv r a n d o m variable and I. to a ½(1 + erR/O'r)Fv, u,_p r a n d o m
variable. H e n c e the 2"p2 or Fp,N, p a p p r o x i m a t i o n will give spurious results
w h e n e v e r erre differs significantly f r o m err.
T h e inter-relationships b e t w e e n I t a k u r a ' s distance m e a s u r e I and the dis-
trance m e a s u r e s g, ( . and X~ are of interest. First o b s e r v e from (3.3) that,
when a r = a R , N f l ( & r , a e ) and ( ( & r , aeR) are asymptotically equivalent and
~ . given by (3.26) is asymptotically equivalent to N } I ( & r, &R)/2 irrespective of
404 P. J. Thomson and P. de Souza

2 2 t r
whether ~rR = ~rT or N R = N T. Thus, under H : a R = ¢~r,

g(&r, aR) =' N~I(dzr, aR), e*(d~r, &R) "- N~I(&r, e~R)/2, (3.28)

where - indicates that the expressions concerned approach equality as N ~ and


N ~ tend to infinity in such a way that N ' r / N ~ remains fixed. Moreover, from
(3.22),

xz(&T, &R)= 2(N~+ N~)-2[(1-N~)N~-Z~,(&R, &r)


P t2 ^
(3.29)

and, when the null hypothesis is true, X~ is asymptotically equivalent to

I *(&r, &R) = ( N ~ + N~-)-2[(N~ - p) N'rZI (&R, &r)


t 12
+ (NT- p)N. ' R)I - (3.30)

H e r e (3.29) expresses the symmetric distance measure X 2 as a linear com-


bination of the two asymmetric distances g*(&n, &r) and g*(&r, &R). When
a R = a r, (3.30) shows that ,g2, is asymptotically equivalent to the symmetric
distance measure I , which is a linear combination of the two asymmetric
Itakura distances I(&R, &r) and I(&r, &g)- For N~ = N~-= N',

1 ^
= p • aT)+
and

I*(&r, e/R)= N ' - p . ~2[I(¢~R' &r) + l(&r, &e)l- (3.32)


2

Forming a symmetric distance measure from two asymmetric distance


measures in this way is an intuitively reasonable procedure. Such a procedure
has been used previously by Rabiner et al. [42]. The above derivation also
shows that the distance measure I , is closely related to X2, and hence X 2. The
implication of (3.30) and (3.32) is that speech recognition systems relying on
Itakura's distance might be improved by the trivial modification of replacing I
by I , . An experiment in which this was done is described in Section 6.

3.5. H : o"2R = O'T


2
, OlR and o~r unknown
The tests and distance measures considered so far compare only the LPC
coefficients, and not the innovation variances. Tests for comparing the in~-
Speech recognition using LPC distance measures 405

novation variances are useful for detecting a m p l i t u d e or e n e r g y changes. T h e


resulting information, when c o m b i n e d a p p r o p r i a t e l y with an L P C distance
m e a s u r e , can lead to greater recognition accuracy [3].
T o c o m p a r e two innovation variances we consider the likelihood ratio test of
H : o 2 = o-r. 2 This can be shown to be a function of 2 2
s r / s R or, equivalently,

ST2 NTaT, ^T CTIiT/(N~, - p )


FR- 2
- ,~ ^ T ^ , " (3.33)
Sa NRaR CRaR/(NR-- p)

T h e statistic F R is asymptotically equivalent to an FNi_p. Ni¢ p r a n d o m variable


when O-R 2 = O.2 and x R and x r are Gaussian. T h e test is two sided with the
critical points, in practice, being d e t e r m i n e d as the 100(~-y)% percentile and the
l O
1 0 0 ( 1 - ~y)Yo percentile of the FNr_ p N,R p distribution w h e r e 7 d e n o t e s the
level of significance of the test. T h e s e values are not o p t i m a l and can be slightly
i m p r o v e d on. (See, for example, R o u s s a s [43, p. 303].)
T h e p r o b l e m of combining L P C and energy m e a s u r e s in o n e overall m e a s u r e
is considered in Subsection 3.6.

2 2
3.6. H : O-R = O-T, OlR ~- 017"
H e r e we wish to c o m p a r e p a r a m e t e r vectors that c o m p r i s e the p + 1
coefficients o-2 and re. T h e likelihood ratio test statistic for testing H : o-~¢ = o-2,
a a = a T can be shown to be A w h e r e - 2 log A is given by

LLR (N~+N~-)tog ( I + N ~ + N ~ cP_ 2 p


F)
^2 ~.2
r O'~
+ N R log ~ 5 + N~-log ~ S " (3.34)
OrR o-T

In the a b o v e F is given by (3.10),

^2 ~ 2 t
o-, - (N'R+ N , r • 2 p ) s , / ( N R + N~)

and s~• is given by (3.11). W h e n o-2 = o-'r2 and 01R = 017, L L R has an asymptotic
2
,gp+l distribution. Alternatively the critical values of this statistic when H is true
can be d e t e r m i n e d numerically from the joint distribution of ps~F/cr 2, ( N ~ -
p)s2/o- 2 and ( N ; ~ - p ) s 2 / o - 2 (o-2= o-~ = o_2) which are asymptotically equivalent
to i n d e p e n d e n t h '2 r a n d o m variables with degrees of f r e e d o m p, (N~r --.p) and
( N ; ~ - p ) respectively. T h e null hypothesis will be r e j e c t e d when L L R is
sufficiently large.
Simpler asymptotically equivalent expressions for L L R when o-2 ..... o"27 and
01R ~ KIlT a r e

L L R , = p F + N~¢ log SR -~ N r' log ---~


~27.
~ (3.35)
406 P. d. T h o m s o n a n d P. de S o u z a

2 -- 2 2
1
LLR~, = p F + ~(N Rt-1 + N ~ - l )" -~l{lS~R - - S T ~ I (3.36)
\ S, I

Both L L R , and LLR.~ have asymptotic )(p+l2 distributions under H.

3. 7. Tests based on alternative L P C likelihoods


The test statistics and distance measures constructed in Section 3 have all
been derived using the conditional likelihood (2.9). However, as noted in
Section 2, other likelihoods could be chosen. In particular, a good ap-
proximation to the exact likelihood is given by (2.9) with C replaced by C, and
this likelihood can, in turn, be further approximated by (2.9) with C replaced
by C'. These likelihoods yield precisely the same test statistics as before, but
with the conditional likelihood estimates (&r, °'r,^2 etc.) replaced by the ap-
proximate maximum likelihood estimates 0 i t , o'v,-2etc.) for the likelihood
based on C', or by the Yule-Walker estimates (&r, ~2r, etc.) for the likelihood
based on 0. Since ~ / N C , V ' N C and X/N-ff are asymptotically equivalent, the
stated asymptotic distributional properties of the various test statistics and
distance measures based on C are the same as for those based on (~ or C.

4. Power functions of the tests

Power is an important consideration in the choice of an appropriate test or


distance measure. The more powerful the test, the greater the recognition
accuracy of the associated distance measure. In this section the asymptotic
distributions of many of the test statistics given in Section 3 are determined for
the case when the null hypothesis is false. From these distributions ap-
proximations to the power functions can be constructed.
For the situation described in Subsection 3.1 it can be shown that when
a R ~ a T both t ~ and l have distributions that are asymptotically equivalent to a
non-central gp2 distribution with non-centrality parameter

T 12-_- N , T ( a T -- aa'TV<'arJ"-,rt - aR) . (4.l)


Here 2;r is the limiting covariance matrix of V'N~.(& r - a t ) . The precise form
of X r is given in Hannan [25, p. 332] where it can be seen that Z r depends only
on a v In the Gaussian case l is asymptotically equivalent to a non-central
Fp.u~_, distribution with non-centrality parameter ~2. Note that the distribution
of It&kura's distance I can be determined from ¢~ using (3.3).
For the case where ~r2 = o"5 (Subsection 3.2) it is readily established that p F
has, in general, a distribution that is asymptotically equivalent to a non-central
Xp2 distribution with non-centrality parameter

2
3" 2 = (aeR _ a r ) v ( X i f l N ~ + ~ r / N O - l ( a R ._. at), (4.2)
Speech recognition using LPC distance measures 407

In the Gaussian case, F is asymptotically equivalent to a non-central


Fp.Nk+r%_2p distribution with non-centrality parameter ~-~. When ¢eR ~ ~ r and
0-~ and ~r2 are not necessarily equal, X2 is asymptotically equivalent to a
non-central X2p distribution with non-centrality parameter r~.
The asymptotic distributions of F when o-2R ¢ o-2r and, in general, of F , , X2,
(~, and l , are more difficult to obtain when ~R ¢ at. They involve the
distribution of quadratic forms of the type XvAX, where X has a multivariate
normal distribution with m e a n / ~ and covariance matrix I (the identity) and A
is a positive definite symmetric matrix such that A ¢ L Note, however, that for
the case N ~ = N~, both F and F , have asymptotic X~/P distributions when
~R = ~r, irrespective of whether o-~ = o'~ or not.
In Subsection 3.4 F , and X 2 were derived by replacing a harmonic mean of
the general form xV(w~A~+ wzA;~)-~x by the arithmetic mean xT(w~A~+
w2A2)x. Here x is a vector, A~ and A 2 are positive definite symmetric matrices
and w~ and w 2 are non-negative weights satisfying wt + w 2 = 1. Using the fact
that on the set of positive definite Hermitian matrices, the function A 1 is
strictly matrix convex [44],

(wlA l + w2A2) -1 ~ wlA~ 1+ w2A21 , (4.3)

where U ~< V means that V - U is positive definite. Consequently,

( W I A l l + w2A21)-I ~ w 1 A 1 + w 2 A 2 (4.4)

and so, for any arbitrary x,

xV(w1A11 + wzA~1)-lx ~ xT(w1A1 + w2A2)x, (4.5)

with equality if and only if A 1 = A 2. This is, essentially, a generalisation to


quadratic forms of the familiar inequality between the arithmetic and harmonic
means of a finite set of positive numbers. Now, replacing x by &R - &r and the
wi, A i (i = 1,2) by the forms given in the discussion following (3.23), the
inequality (4.5) shows that

F ~< F , , ;g2 <~X 2. (4.6)

in particular, for any c >/0,

P(F >- c) <~P(F, >! c), P O ( 2 >>"c) <~PO(2 >1c). (4.7)

It would appear that F , and X2 are the more powerful statistics since F and X2
have the same respective asymptotic distributions under the null hypothesis as
F , and X2. However, the asymptotic distributions of these statistics are only
approximations to the true distributions for finite N~ and N}. Hence it might
be deduced from (4.6) that any apparent increase in power by F , and .g2 would
408 P. d. Thomson and P. de Souza

be at the expense of a greater rate of Type l errors than that chosen. Evidence
that this is indeed the case is provided in the simulation studies of de Souza and
Thomson [10].
Since Itakura's distance I and its modification I , are asymptotically
equivalent to ~ and X2, respectively under the null hypothesis, it might be
expected that I and I , would be equally as powerful as ~ and X 2 respectively
in the important case of small departures from the null hypothesis. One way of
showing this is to consider, under the alternative hypothesis, a sequence of
values of o~R - aer which decrease with N r, and N R, in such a way that ~-~2 and ~'22
remain fixed. In this situation it is readily shown that I and I , are asymptotic-
ally equivalent to t ~ and X~ respectively.
For the general case when O~R 2 is not necessarily the same as o2 , the statistic
2 2
FR of Subsection 3.5 is asymptotically equivalent to a (rrr/O'R)FN~c_p,N~_ p
random variable.
The simulation studies of de Souza and Thomson [10] give some guidance as
to the adequacy of the various asymptotic approximations in practice. In
particular, the LPC distance measures X2 and X~ were found to be robust and
powerful. However, of the two, X 2 follows the Xp2 distribution more closely
under the null hypothesis, especially in the upper tail of the distribution

5. Computational costs of the tests

We first note that the conditional likelihood estimates & and 6.2, the
approximate maximum likelihood estimates ¢/ and ~2 and the Yule-Walker
estimates d, and 6. can all be computed efficiently. Indeed, Morf et al. [32]
show that the number of multiplications necessary to compute & and 6.2 or &
and c~z is ( N ' - 1)(p + 1) + 7p 2 + O(p), whereas Dickinson [34] shows that & and
~2 can be computed in (N' - 1)(/) + 1) + 7~2p2 + O(p) multiplications. In normal
speech processing applications, however, the value of p chosen is such that
these algorithms are no faster than the solution using Cholesky decomposition.
Nevertheless, these algorithms do lead to reduced storage requirements.
Consider now the computation of the various test statistics given in Section 3
and, in particular, the case N~,= N~. First observe that (, 1, I, t~, and l, are all
functions of F which is computed more efficiently using (3.2) rather than (3.1).
Moreover F is computed more efficiently using (3.12) rather than (3.13) and X2
is more efficiently computed using (3.20) rather than (3.19). Here (rp ^2 (3.8) in F
and sap in 1`2 (3.20) can be calculated using the algorithms of Friedlander et al.
[45] which take advantage of the near-Toeplitz structure, measured in terms of
displacement rank, of D R and D r . In particular, since D R and D r each have
displacement rank 2, any linear combination ClD R + c2D T where c~ and c 2 are
positive constants has displacement rank 4. These algorithms together with
those for 6 and 6.2 yield computationally efficient procedures for determining,
not only F and X2, but also all the other distance measures discussed in Section 3.
Once again, however, although the values of p used in most speech processing
Speech recognition using LPC distance measures 409

applications are such that these algorithms are no faster than Cholesky
decomposition, they do lead to reduced storage requirements.
Using the number of multiplications as an estimate of computational com-
plexity, the calculations for F require approximately (p + 1)(N r - 1)+p3/3+
3p2+ 8p/3 multiplications, whereas those for 4, l, I, ~ , , I, and F , each require
(p + 1)(N~,- 1)+p3/6+ 2p2+ 17p/6 multiplications. X2 and X~ require -~(p + 1)
multiplications in addition to those for F and F , respectively and I , requires
an additional lzp(p + 3) multiplications in addition to those for X~. In arriving at
these estimates it has been assumed that &r, &p, etc. were obtained using the
Cholesky decomposition and that the number of multiplications that this entails
is as given in [32]. Thus for N = 100, p = 10 the F and X2 statistics require
approximately 20% more computations than any of 4, l, I, ( , , l, and F , . For
N = 300, p = 10 this figure reduces to approximately 8%. Therefore, except
under stringent computational conditions, the choice of test statistic can he
based on the properties of the test statistic concerned rather than com~-
putational cost.
Turning now to storage requirements we note that the computation of F, X2,
1
F , , X~ and I , require the storage of &R, °'R~2and D R (p + 1 + ~p(p + 1) floating
point numbers), whereas 4, l, I, ~, and l, require the storage of &R only (p
floating point numbers). Thus, for the typical case p = 10, the storage necessary
to compute F, X2, F , , X2, or I , exceeds that for 4, l, I, 4, or 1, by about 56
floating point numbers per reference template.

6. An isolated word recognition experiment

In this section we describe two versions of an isolated word recognition


experiment performed on a 62-word vocabulary for the case in which N~ = N~.
In the first version the Itakura distance was used, and in the second it was
replaced by the symmetric distance I . (3.32) with dR, dr, o-~ and ~r~ estimated
using the Yule-Walker estimates (2.15) and (2.16).
The vocabulary used for this experiment comprised the letters, digits, and 26
other words consisting mainly of keyboard symbols (comma, period, asterisk,
slash, percent, dollar, etc.). Each word in the vocabulary was uttered 10 times
in random sequence by a male speaker in an ordinary laboratory environment
using a Shure SM12 headset microphone. The incoming speech was digitised by
a 14-bit A/D converter at a rate of 20 kHz. It was then pre-emphasised and a
14 pole selective autocorrelation LPC analysis was performed every 10 ms over
a 512 point Hamming window.
The end-points of the 620 words uttered were determined automatically and
corrected where necessary by hand. The first utterance of each word was used
as the reference or template, and the remaining 558 utterances were used as
test data for recognition.
Each of the 558 test utterances was recognised by finding the closest
matching template using dynamic time warping to obtain a good match. The
410 P.J. Thomson and P. de Souza

dynamic time warping algorithm, which has been widely used, imposed strict
end-point constraints on the test and reference patterns by forcing the two sets
of end-points to coincide. Under these conditions 30 recognition errors were
made using the ltakura distance I (3.3).
The experiment was then repeated using the symmetric measure I , as the
local distance, and on this occasion the number of errors was 25.
Comparing these results, it can be seen that in this experiment, the price paid
for using the Itakura distance measure instead of a more powerful LPC
distance measure is a 20% increase in the word recognition error rate. This is
a consequence of discarding information about the variability of the reference
vector &R.
In summary, this experiment shows that speech recognition systems based on
Itakura's distance I can be significantly improved by the trivial modification of
replacing I by I , where, for the case N T - NR, I, is directly proportional to
t -- t

the arithmetic mean of the Itakura distances I(&n, ~T) and I(& r, ~R).

Acknowledgements

The authors would like to thank S. Haltsonen for his assistance in perform-
ing the isolated word recognition experiments.

References

[1] Martin, T. B. (1977). One way to talk to computers. IEEE Spectrum 14(5), 35-39.
[2] de Souza, P. (1983). A statistical approach to the design of an adaptive self-norma!ising silence
detector. IEEE Trans. Acoustic. Speech Signal Process. 31(3), 678-684.
[3] Brown, M. K. and Rabiner, L. R. (1982). On the use of energy in LPC-based recognition of
isolated words. Bell System Tech. J. 61(10), 2971-2987.
[4] Rabiner, L. R., Rosenberg, A. E., Wilpon, J. G. and Keilin, W. J. (1982). Isolated word
recognition for large vocabularies. Bell System Tech. J. 61(10), 2989-3005.
[5] Zue, V. W. and Schwartz, R. M. (1980). Acoustic processing and phonetic analysis. In: W. A.
Lea, ed., Trends in Speech Recognition 101-124. Prentice-Hall, Englewood Cliffs, NJ.
[6] Markel, J. D. and Gray, A. H. (1976). Linear Prediction of Speech. Springer, Berlin.
[7] Fant, G. C. M. (1960). Acoustic Theory of Speech Production. Mouton and Co., 's-Gravenhage,
The Netherlands.
[8] Rabiner, L. R. and Schafer, R. W. (1978). Digital Processing of Speech Signals'. Prentice-Hall,
Englewood Cliffs, NJ.
[9] de Souza, P. (1977). Statistical tests and distance measures for LPC coefficients. IEEE Trans.
Acoust. Speech Signal Process. 25(6), 554-559.
[10] de Souza, P. and Thompson, P. J. (1982). LPC distance measures and statistical tests with
particular reference to the likelihood ratio. IEEE Trans. Acoust. Speech Signal Process. 30(2),
304-315.
[11] Bahl, L. R., Cole, A. G., Jelinek, F., Mercer, R. L., Nadas, A., Nahamoo, D. and Picheny, M.
A. (1983). Recognition of isolated-word sentences from a 5000-word vocabulary office cor-
respondence task. Proc. IEEE Internat. Conf. on Acoustics, Speech, and Signal Processing,
1065-1067.
[12] White, G. M. and Neely, R. B. (1976). Speech recognition experiments with linear prediction,
Speech recognition using LPC distance measures 411

bandpass filtering, and dynamic programming. I E E E Trans. Acoust. Speech Signal Process.
24(2), 183-188.
[13] Myers, C. S., Rabiner, L. R. and Rosenberg, A. E. (1980). Performance tradeoffs in dynamic
time warping algorithms for isolated word recognition. I E E E Trans. Acoust. Speech Signal
Process. 28(6), 622-633.
[14] Sneath, P. H. A. and Sokal, R. R. (1973). Numerical Taxonomy, Chap. 4. Freeman, San
Francisco, CA.
[15] Itakura, F. (1975). Minimum prediction residual principle applied to speech recognition. I E E E
Trans. Acoust. Speech Signal Process. 23(1), 67-72.
[16] Tribolet, J. M., Rabiner, L. R. and Wilpon, J. G. (1982). An improved model for isolated
word recognition. Bell System Tech. J. 61(9), 2289-2312.
[17] Sakoe, H. (1979). Two-level DP-matching--a dynamic programming-based pattern matching
algorithm for connected word recognition. IEEE Trans. Acoust. Speech Signal Process. 27(6),
588-595.
[18] Jelinek, F., Mercer, R. L. and Bahl, L. R. (1982). Continuous speech recognition: Statistical
methods. In: P. R. Krishnaiah and L. N. Kanal, eds., Handbook of Statistics, Vol. 2, 549-573.
North-Holland, Amsterdam.
[19] Lea, W. A. (1983). Selecting the best speech recogniser for the job. Speech Technology 1(4),
10-29.
[20] Rabiner, L. R. and Levinson, S. E. (1981). Isolated and connected word recognition--theory
and selected applications. IEEE Trans. Commun. 29(5), 621-659.
[21] Makhoul, J. (1975). Linear prediction: a tutorial review. Proc. 1EEE 63(4), 56i-580.
[22] Buzo, A., Gray, R. M., Gray, A. H. and Markel, J. D. (1980). Speech coding based upon
vector quantization. I E E E Trans. Acoust. Speech Signal Process. 28(5), 562-574.
[23] Juang, B. H., Wong, D. Y. and Gray, A. H. (1982). Distortion performance of vector
quantization for LPC voice coding. I E E E Trans. Acoust. Speech Signal Process. 30(2),
294-304.
[24] Mann, H. B. and Wald, A. (1943). On the statistical treatment of linear stochastic difference
equations. Econometrika 11, 173-220.
[25] Hannan, E. J. (1970). Multiple Time Series. Wiley, New York.
[26] Fuller, W. A. (1976). Introduction to Statistical Time Series. Wiley, New York.
[27] Seber, G. A. F. (1977). Linear Regression Analysis. Wiley, New York.
[28] McLeod, I. (1975). Derivation of theoretical autocovariance function of autoregressive-moving
average time series. Appl. Statist. 24(2), 255-256.
[29] Ansley, C. F. (1979). An algorithm for the exact likelihood of a mixed autoregressive-moving
average process. Biometrika 66, 59-65.
[30] Ljung, G. M. and Box, G. E. P. (1979). The likelihood function of stationary autoregressive-
moving average models. Biometrika 66, 265-270.
[31] Box, G. E. P. and Jenkins, G. M. (1976). Time Series Analysis Forecasting and Control (rev.
ed.). Holden-Day, San Francisco, CA.
[32] Morf, M., Dickinson, B., Kailath, T. and Vieira, (1977). Efficient solution of covariance
equations for linear prediction. IEEE Trans. Acoust. Speech Signal Process. 25, 429--433.
[33] Chandra, S. and Lin, W. C. (1974). Experimental comparison between stationary and
non-stationary formulations of linear prediction applied to voiced speech analysis. IEEE
Trans. Acoust. Speech Signal Process. 22(6), 403-415.
[34] Dickinson, B. W. (1978). Two recursive estimates of autoregressive models based on maxi-
mum likelihood. J. Statist. Comput. Simulation 7, 85-92.
[35] Quenouille, M. H. (1958). The comparison of correlations in time series. J. Roy. Statist. Soc.
Set. B 20, 158-164.
[36] Sambur, M. R. and Rabiner, L. R. (1976). A statistical decision approach to the recognition of
connected digits. I E E E Trans. Acoust. Speech Signal Process. 24(6), 550-558.
[37] Graybill, F. A. (1976). Theory and Application of the Linear Model. Duxbury, Belmont, CA.
[38] Anderson, T. W. (1958). A n Introduction to Multivariate Statistical Analysis. Wiley, New
York.
412 P.J. Thomson and P. de Souza

[39] Cox, D. R. and Hinkley, D. V. (1974). Theoretical Statistics. Chapman and Hall, London.
[40] Satterthwaite, F. E. (1946). An approximate distribution of estimates of variance components.
Biometrics Bulletin 2, 110-114.
[41] Tribolet, J. M., Rabiner, L. R. and Sondhi, M. M. (1979). Statistical properties of an LPC
distance measure. IEEE Trans. Aeoust. Speech Signal Process. 27(5), 550-558.
[42] Rabiner, L. R., Levinson, S. E., Rosenberg, A. E. and Wilpon, I. G. (1979). Speaker-
independent recognition of isolated words using clustering techniques. IEEE Trans. Acoust.
Speech Signal Process. 27, 336-349.
[43] Roussas, G. G. (1973). A First Course in Mathematical Statistics. Addison-Wesley, Reading,
MA.
[44] Bellman, R. (1970). Introduction to Matrix Analysis. McGraw-Hill, New York.
[45] Friedlander, B., Kailath, T., Morf, M. and Ljung, L. (1978). Extended Levinson and
Chandrasekhar equations for general discrete-time linear estimation problems. I E E E Trans.
Automat. Control. 23, 653-659.
E. J. Hannan, P. R. Krishnaiah, M. M. Rao, eds., Handbook of Statistics, Vol. 5 1 (~
J_
© Elsevier Science Publishers B.V. (1985) 413-449

Varying Coefficient Regression

D. F. Nicholls a n d A . R . P a g a n

I. Introduction

Very early on in the development of methods for the analysis of time series
and the relationships between time series, it was recognized that techniques
based upon constant coefficient models might well be inadequate. Early exam-
pies of this position would be Rubin (1950) and Kendall (1953), the former
allowed for some r a n d o m variation in the coefficients whilst the latter restricted
them to a deterministically evolving pattern. Despite these qualms, constant
coefficient models have proven to be effective in empirical data analysis, so
much so that only relatively recently have there a p p e a r e d either theoretical
papers detailing the methodology for dealing with the types of coefficient
variation important in the analysis of time series or empirical studies providing
applications of these techniques)
A n u m b e r of surveys have been written in the last five years of the area that
this chapter covers, a book by Raj and Ullah (1981) and a contribution by
Chow (1983) to the H a n d b o o k of Econometrics being prominent examples. As
well, there is an annotated bibliography by Johnson (1977, 1980). Combining
these references would provide any reader with a fairly comprehensive list of
papers on the topic. For this reason we do not attempt an exhaustive examina-
tion of all the work done. Our objective is best understood by considering what
it was that made the work by Box and Jenkins (1976) so seminal. Though there
were some advances in estimation and hypothesis testing documented in that
book, it is arguable that most of the techniques used by them had been
available for some period of t i m e - - t h e autocorrelation function had been
routinely computed during spectral analysis, a non-linear least squares tech~
nique for fitting models with moving average errors can be found in Whittle
(1954), and the analysis of residuals for diagnostic purposes was long a feature
of applied research. What was pathbreaking in their material was the presen-
IBecause of ttle nature of this volume a large literature based on longitudinal data -which
indexes responses by individual units to allow for variation in model coefficients across in-
dividuals-is ignored. Some of this literature is surveyed in Swamy (1971) and Engle and Watson
(1979).

413
414 D. F. Nicholls and A. R. Pagan

tation of an integrated approach to time series modelling, involving the


specification/estimation/diagnostics cycle. 2 When approached in this disciplined
way it proved easy to both communicate and assimilate techniques that had
been in existence previously but had not been extensively used.
It seems likely therefore that, in a book concerned with time series analysis,
any discussion on varying coefficient models can be usefully structured in the
same way. As will be shown, the estimation phase has received the pre-
dominant attention to date and yet, just as in standard time series analysis, it is
probably the specification part of the cycle which is critical for practical work.
Consequently, some of this chapter is an attempt to remedy that deficiency,
although it will be apparent that much remains to be done.
The analogy with Box and Jenkins' approach can be pushed one step further.
In their research they recognized that the presence of seasonal factors in time
series led to a different class of models than would be appropriate if no
seasonal effects were present; the modelling cycle remained the same but
different models were likely to be required. It is also useful to make such a
distinction in discussing the varying coefficient regression (VCR) literature.
To clarify that contention, (1.1) represents the model examined in this paper:

Yt = x, fl, + e,. (1.1)

H e r e e, is a martingale difference process with E(e2t]o%,_l)= ~ < ~ a.s., and


~,-1 is the sigma field composed of a set of events that includes {Yt-/}~=l, {x,_j}7=,
and may include x, if it is taken to be exogenous; x, is a 1 x p vector of
regressors; and /3t is specified as following a multivariate A R I M A process
A ( L ) ( f l , - / ? ) = rl,, where A ( L ) is a (possibly rational) polynomial in the back-
ward lag operator L. Although such a characterization is restrictive, on the
basis of the success of A R I M A models in representing time series it is to be
hoped that /3t could also be approximated in such a way. The noise driving
/3t - ¢i, r/o is taken to be i.i.d.(0, X) and independent of {e,}, while/3 is the mean
of /3, when the process generating fi, is stationary, but equals zero if that
process is A R I M A . 3
Equation (1.1) illustrates the three dimensions to any particular model: 3 the
nature of x~ the nature of the process generating/3 t and the constancy of o-~.
Table 1 lists the various assumptions employed about those three dimensions in
this chapter.
Altogether there are some 18 possible combinations of these assumptions.
Some are discussed elsewhere in this volume, e.g. (X1, B1, V1) which

2Our preference is for the term 'specification' rather than 'identification' to describe the process
of a preliminary screening of models, as this hatter term also needs to be used when discussing
whether unique estimates of the unknown parameters of a process can be obtained.
3In restricting/3t to have a constant mean at most we have ignored the possibility that /~ might
vary in a deterministic fashion with some variables zt (say), i.e./3t = E(flt) = zt6. As will be evident
from later analysis this modification merely induces extra regressors involving the cross product
b e t w e e n zt and xf and does not change the essence of our proposals.
Varying coefficient regression 415

Table 1
Assumptions employed in models

(X1) xt contains lagged (B1) fir =/3. (VI) 0.2 = 0.2 ,


values of Yr.
(X2) x, is a strictly (B2) fit =/~ + rh, i.e. A(L) = 1. (V2) 0-~ is not
exogenous set of This will be referred constant.
variables. They may to as random coefficient
be stochastic or variation.
non-stochastic with a
uniform bound.

(X3) xt contains endogenous (B3) A(L) ~ 1. This case


variables, i.e. (1.1) is will be referred to as
part of a set of evolving coefficient
simultaneous equations. variation.

r e p r e s e n t s c o n s t a n t coefficient a u t o r e g r e s s i v e m o d e l s . S o m e h a v e n e v e r b e e n
f o r m a l l y e x a m i n e d within the l i t e r a t u r e , e.g. (X1, B3, V2) a n d , a p a r t f r o m s o m e
g e n e r a l c o m m e n t s l a t e r , will have to b e i g n o r e d . In T a b l e 2 t h e c o n t e n t of each
of the l a t e r s e c t i o n s of t h e c h a p t e r is m a t c h e d with t h e v a r i o u s d e m a r c a t i o n s of
T a b l e 1.

Table 2

Section Models

2 (x1, B2, V1) (X2, B2, V1)


3 (X1, B3, V1) (X2, B3, V1)
4.1 (X3, B2, V1)
4.2 (X1, B1, V2) (X2, B1, V2)

A n u m b e r of the c o m b i n a t i o n s missing from T a b l e 2, e.g. (X3, B3, V2), m a y well


be t o o c o m p l e x to solve, p a r t i c u l a r l y in the light of t h e difficulties facing
i n v e s t i g a t o r s with t h e s i m p l e r a l t e r n a t i v e (X3, B2, V1) a n d we t h e r e f o r e d o not
even a t t e m p t to a n a l y s e such classes of m o d e l s h e r e .

2. Random coefficient variation

In this s e c t i o n we shall be p r i m a r i l y i n t e r e s t e d in m o d e l s of the form (X1, B2,


V I ) a n d shall o u t l i n e k n o w n results r e l a t i n g to this p a r t i c u l a r class of m o d e l .
T h e s e r e s u l t s s h o u l d e x t e n d , using similar a r g u m e n t s , to t h e class (X2, B2, V1)
416 D. F. Nicholls and A. R. Pagan

or even to mixtures of these two classes of models. W h e n appropriate,


r e f e r e n c e will be m a d e to where proofs of results for the wider class of models
are available.
T h e model (X1, B2, V1) can be written in the form
p

y, = ~] f,jY,-j + ~,,
j=l

with f , j the jth element of the p x 1 vector fit = fi + r/,, so that this model
becomes

p
Y, = E (/~ + r/t.j)Yt-j + et' (2.1)
j=l

For models of this form Andel (1976) derived conditions for their second-order
stationarity while Nicholls and Quinn (1981), referring to such models as
r a n d o m coefficient autoregressions (RCA), have extended Andel's results to
the case of multivariate RCA's. For simplicity we shall concentrate, in the
r e m a i n d e r of this section, on scalar models, though most of the results extend
in a natural way to the multivariate situation.
In the case of the model (2.1) we m a k e the following assumptions:
(i) {et; t = 0, +1,_+2 . . . . } is a sequence of i.i.d, random variables with zero
m e a n and variance o-2.
(ii) j3'= (J~l. . . . . /3p) is a vector of constants.
(iii) If r/'t = 071,1. . . . . r/t,p), then {r/t; t = 1 . . . . . T} is a sequence of i.i.d, ran-
dom vectors with zero mean and E(r/tr/'t)= X.
(iv) {r/,} and {et} are mutually independent°
If
p

I /1
kfp . . . . . . . fl j

with the (1, 1) block being the ( p - 1)× 1 null matrix, the (1, 2) block the
(p - 1) x (p - 1) identity matrix, and o%t is the g-field generated by {(e,, r/,); s ~<
t}, then it is possible to show (see Nicholls and Quinn, 1982, p. 31) that, when
X > 0, there exists a unique ~t - measurable second-order stationary solution to
(2.1) if and only if M has all its eigenvalues within the unit circle, and
( v e c X ) ' v e c W < 1, where vec W is the last column of the matrix ( I -
M @ M ) -1. (The tensor or K r o n e c k e r product @ together with associated
definitions and useful results are given in the Appendix.)
T o obtain asymptotic properties of the estimators of the parameters of (2.1),
the b o u n d e d n e s s of the second m o m e n t s of {y,} is required. If the two criteria
required for the second-order stationarity of (2.1) are b o u n d e d away from
unity, it follows that this m o m e n t condition will be satisfied. As a result the
next assumption for (2.1) is
Varying coefficient regression 417

(v) The largest eigenvalue of M is less than or equal to ( 1 - 6 1 ) and


(vec X)' vec W ~< (1 - 62), where 6 l > 0 and 62 > 0 are both arbitrarily small.
The parameters/~, j = 1 . . . . . p, and X must be such that the solution {Yr} to
(2.1) is strictly stationary and ergodic, these conditions being required to obtain
asymptotic properties of the estimators. A sufficient condition for this strict
stationarity and ergodicity is that a second-order stationary solution to (2.1)
exists; a feature guaranteed by (v), together with the fact that {et} and {tit} are
strictly stationary, which follows immediately from (i) and (iii).
If z, = Kp vec(Y t 1Y't_I), where Kp is defined in the Appendix, the proofs of a
number of the theorems to follow require that E{(z t - E(z,))(z,-E(zt))'} is
positive definite. This follows from
(vi) There is no non-zero constant vector c~ such that od(zt-E(zt)) = 0
almost everywhere.
The next assumption to be imposed on (2.1) is
(vii) The variance ~r: of e t is bounded below by 63 while the smallest
eigenvalue of X is bounded below by 64, with 63 > 0 and 64 > 0 both arbitrarily
small.
Imposing (vii) eliminates the possibility of the vector of parameters of (2.1)
lying on the boundary of the parameter space. Such situations cause difficulties
when obtaining asymptotic properties of estimators. We discuss this further in
the next subsection.

2.1. Specification of the model


The first step in the specification stage is to determine the order of (2.1).
Rewriting this model in the form

Yt - Z ~jYI-j "+ btt = Y't-l~- @ •t (2.2)


j=l
with
p

U, = Zrl,,jyt_j + E, = Y't_lrlt + a,, (2.3)


j=l

the ordinary least squares (OLS) estimates of /~p, p = 1, 2 . . . . . are just the
partial correlation coefficients. Furthermore, as seen later in Theorem 2.2,
these estimators are strongly consistent and asymptotically normally dis-
tributed. Consequently, in order to determine the order of the model (2.1) (or
(2.2)), the partial autocorrelation coefficients and their standard errors are
computed for orders 1, 2 , . . . . Thereafter, the order at which the first (and
subsequent) of these is not significantly different from zero is found in a similar
fashion to that proposed in the Box-Jenkins procedure. If the first coefficient
which is not significantly different from zero occurs at lag (p + 1), and all
higher-order coefficients are not significantly different from zero, then the
model is of order p.
Alternative order determination procedures including Akaike's AIC, B1C or
418 D. F. Nicholls and A. R. Pagan

related criteria could also be used for determination. A number of these are
discussed in Priestley (1981).
In practice, when considering models of the form (2.1), having determined
the order of the model, the next question to be determined is whether or not a
constant coefficient autoregressive model would fit the data just as well. The
usual theory associated with tests based on m a x i m u m likelihood estimates will
not hold in this context, since the vector of unknown coefficients under the null
hypothesis lies on the boundary of the p a r a m e t e r space. Indeed, to demon-
strate that the m a x i m u m likelihood estimators in (2.14)-(2.16) later are asymp-
totically normal, it is necessary to restrict the p a r a m e t e r space 0 in such a way
that the parameters do not lie on the boundary of O. If a boundary value was
to be permitted, e.g. 0/> 0, the m a x i m u m likelihood estimator (MLE) of 0
would need to solve {max 0 l(O) s.t. 0/> 0}, where l(O) is the log likelihood. As
Moran (1971), Chant (1974) and, m o r e recently, Gourieroux et al. (1982) have
shown, 0R that stems from this restricted problem has a very complex asymp-
totic distribution. From the results of those papers the likelihood ratio test is a
mixture of X2 and degenerate r a n d o m variables, while Gourieroux et al. also
d e m o n s t r a t e that this is true of the test statistic based upon the K u h n - T u c k e r
multipliers, the analogue of the Lagrange Multiplier test statistic.
Although the obvious way to test for coefficient constancy is to test the
hypothesis X = 0, the fact that the M L E of X has a complex distribution when
the null is true makes this route unattractive. One potential solution is to base
a test for £ = 0 upon the scores Ol/O0, thereby ignoring the constraint X >~ 0.
Such an approach loses power c o m p a r e d to that based on the scores of the
restricted likelihood, but it does at least have a tractable asymptotic dis-
tribution; essentially this represents the proposal by Moran that N e y m a n ' s
(1959) C ( a ) statistic be used. In this sense, the test presented below can be
regarded as a score or C(c~) or Lagrange Multiplier test statistic.
In order to develop a test of the null hypothesis that X = 0 or, equivalently
y = vech 2; = 0, let/3r and o- ^2r be the m a x i m u m likelihood estimates of/q and o-2
under that hypothesis. (/3 r and d-2 are, of course, the usual m a x i m u m likelihood
estimates for a fixed coefficient autoregression.) For a sample of size T, if

=.=
Y, (Y, . . . . y,<-p)

T
m!
2= T l~z~ f , - y, - ~ r Y , < ,
t=l
T 7"
= r -1 Z
(2.4)
t=l t=l

~'r r-1 et 2
= ~5-1 ,
t=l O'T
and
= O'er) rgrWrgr,
Varying coefficient regression 419

then a test for y = 0 may be based on Theorem 2.1, a proof of which appears in
Nicholls and Quinn (1982, p. 103).

THEOREM 2.1. Under the assumptions (i)-(vii) and the conditions that o.2 > 33 >
0 and y = O, (b given by (2.2) is asymptotically distributed as X 2 with p(p + 1)/2
degrees of freedom.

As a result of this theorem, having obtained the maximum likelihood


estimates of ]3 and o-2 under the hypothesis that 3, = 0, a test can be constructed
for whether or not the model has fixed or random coefficients. A test which is
more powerful than that based on Theorem 2.1 has been derived in Nicholls
and Quinn (1982, pp. 105-108). Because of its more complicated nature and the
detail involved we will not present it here however but refer the interested reader
to that reference.

2.2. Estimation
We now examine the least squares and maximum likelihood estimation of
the parameters of (2.1), along with the asymptotic properties of these estimates.

2.2.1. Least squares estimation


For the model (2.1) the parameters to be estimated are the coetficients
/3' = (ill, - • •,/3p), the variance o-2 of {et}, and the covariance matrix £ of {r/t}.
Since X is symmetric, it will only be necessary to estimate 3, = vech X. Given a
sample of size ( T + p ) , say y > p , . . . , Y0, Y l , . . . , Yr, a two-step estimation pro-
cedure is used to obtain the least squares estimates of these parameters. In
contrast, the maximum likelihood method is an iterative process requiring
initial estimates of the parameters to commence the iterations. The least
squares estimates will be seen to be most suited for this purpose since they will
be shown to be strongly consistent.
In order to obtain the least squares estimates it is more convenient to
consider (2.1) in the form (2.2) and (2.3). Now ~, is the o--field generated by
{(es, r/,); s ~< t}, e, and r/t are independent of {(et_ l, "0t 1)' (et-2, glt-2),...} and Yt-,
is a measurable function of this set alone. Thus

E(ut I ffS-~) - Y;_,E(r/,) + E(e,) = 0 (2.5)


and
E(u2 [ ~t_~)= h, = E{e2} + 2E{(e:qtY;_, I •_1}+ ElY;_lrhr/'tY¢_ 11o~t 1}
= o-2 + Y;_1XYt_I.

From the results given in the Appendix, it follows that

Y't-~*YYI-~ -- ( Y',-~ @ Y't-1) vec )2


= ( y t t-,@ Y't-_ 1)K'pvech Z = (Y'~_,@ Yt_,)KpT.
¢
420 D. F. Nicholls and A. R. Pagan

But Y,-1 is a p x 1 vector so that

vec(Y/01 Nit-l) = (tt-1 @ gt-1) vec(1) = Y~-I@ Y~-I

and since (A @ B)' = A ' @ B', it follows that

t t_ t
{vec(Y, i Yt_l)} - Y , I @ Y't - - I "

Consequently,
2 , , ,
= ~r + {vec(Y~_lY,_0} K j
= o"2 + z ; 7 . (2.6)

From (2.2) and (2.6) least squares methods may now be used to obtain
estimates of the unknown parameters, namely/3, o-2 and 7. For the sample of
size ( T + p), the first step is to apply least squares to (2.2) to yield the estimate
fi of fi as
T T

The second step in the estimation procedure is as follows. First, use (2.7) to
determine, for t = 1 . . . . , T,

t --
a, = y , - Y,-1/3 "

Second, from (2.6), if we let

~2
u, ;= 0 . 2 + z ,ry q ~,, (2.8)

then the estimates 6 2 and "/ of o-2 and 7 respectively are obtained by regressing
u~ 2t on 1 and z r Doing so gives

= ~, (z t - - -~)(z,
.-e)' }_, 2r a~(z, 2,) (2.9)
t=l t=l

and
T

if2_ r - ' Z u~ 2, - 2',2. (2.10)


t=l

Expressions (2.7), (2.9) and (2.10) provide the required least squares estimates.
The asymptotic properties of these least squares estimators are contained in:

THEOREM 2.2. For the process {y,} satisfying (2.1)under conditions (i~(vi), and
for/3, ~ and 6 2 defined by (2.7), (2.9) and (2.10) respectively, 6i'= (fi' ~', 6 2)
Varying coefficient regression 421

converges almost surely to O' = (fi', 7', 02) if E(ynt) < 2. When E(ySt) < %
Tl/2(O- O) has a distribution which converges to that of a normally distributed
random vector with mean zero and covariance matrix 12.

The strong consistency of the estimators can be demonstrated using the


ergodic theorem, while a central limit theorem can be derived using a
martingale central limit theorem due to Billingsley (1961). A proof of Theorem
2.2 along these lines is given in Nicholls and Quinn (1982, pp. 46-50) while the
form of the covariance matrix/2 is derived on pp. 57-58.
The eighth moment condition is only required for the existence of the
covariance matrix ~. In practice, of course, such a condition may not be easy to
check; even the analysis of the fourth moment condition may be complex,
though for most of the simpler models which one would consider in practice
this is quite straightforward.
As an example consider the model

y, = ]3,y,_~+ et, ]3, = fi + ~/t, (2.11)

where the e, are n.i.d.(0, 02) and the ~t are n.i.d. (0, 62). Relating this model to
(2.1) it follows that M = fi, X = 62 and W = (1 _fi2)-1; for second-order sta-
tionary therefore we must have the largest eigenvalue of M within the unit
circle, i.e. lilt < 1 and (vec X)' vec W < 1 which, in this case, becomes 6 2 / ( 1 -
fi2)< 1. Thus for model (2.11) a stationary solution exists if lfil< 1 and
62 < (1 - fi2); that is if and only if fi2 + 62 < 1, since this latter condition implies
that Ifil < 1.
From Theorem 2.2, in order that the estimators of the parameters be
strongly consistent, it is sufficient that E ( y 4) < 2. For the model (2.11), exploi-
ting e, - n.i.d. (0, o-2),

E(y ) -- E( 4)E(y4 1) 4- 2 2
6E([3,)E(yt_l)0- 2+ 30- 4 .

Furthermore, from (2.11) it follows that

E03~)= fiz+ 62 and E(y~) = E(y~_l) ,

producing

E(y4t) = {30 -4 + 60-2(]~ 2 4 62)E(y2t_l)}/{1 -- E(]34)} .

Thus a necessary condition for y, to have a finite fourth moment is that


E034,) < 1. Since "O,- n.i.d.(0, 62), it is not hard to see that

E([3~) = E ( j ~ if- ,~t)4 = fi4 q.. 6fi262 q_ 364 < 1 . (2.12)

From this it follows that (2.11) will be stationary if and only if fiR+ 62~ 1,
422 D. F. Nicholls and A. R. Pagan

while (2.12) gives a sufficient condition for {Yt} to have a finite fourth moment,
a condition required for the strong consistency of the estimators.
It should also be noted that the results of Hannan and Kavalieris (1983)
could be applied to (2.1). For a general class of scalar stationary processes these
authors show that under very general conditions the least squares estimates of
the parameters, as well as the estimate of the order of the model using BIC,
converge almost surely to their theoretical values. We shall not present a
detailed discussion of their results here but refer the interested reader to that
paper.
Turning finally to least squares estimation of model (X2, B2, V1), the
presence of exogeneous variables rather than lagged values of Yt does not
change the strategy outlined above. Amemiya (1977) provides a formal proof of
this proposition when the xt's are non-stochastic with an upper bound, while
Hildreth and Houck (1968) made a very similar proposal, differing only in
having E(t~ 2) as the right-hand side of (2.8) instead of E(u~)= O'2+Z;T.
Because 1~ z ,,( u-2, - u~) is op(T~/2), such a substitution does not affect the limiting
distribution of T m ( O - 0), but it might be superior in small samples.

2.2.2. M a x i m u m likelihood estimation


In order to obtain the M L E the likelihood function is set up as if {et} and {r/1}
were jointly normal. The estimates obtained from maximizing this function can
be shown to be strongly consistent and satisfy a central limit theorem. As the
likelihood function to be optimized will be found to be non-linear in the
parameters to be estimated, it will be necessary to use an iterative procedure to
obtain the estimator. In practice, it is desirable to commence the iterative
procedure as close to the global maximum as possible, thereby seeking to avoid
convergence to a local optimum. As a result, it is best if the iterations
commence with the least squares estimates, which have appeal since they are
strongly consistent.
The moment conditions in T h e o r e m 2.2 will not be required in order to
obtain the asymptotic properties of the MLE; all that will be required in this
case are assumptions (i)-(vii), along with the assumption of the finiteness of the
moments of {e,} and {r/t}.
For a sample Yl . . . . , Y r from a time series {Yt} satisfying (2.1) under
conditions (i)--(vii), we derive the likelihood function conditional on the p r e
period values Yl-p. . . . . Yo, acting as though {et} and {~t} were jointly normal.
From (2.2) it follows that the conditional expectation is

E(y, [ Yt-1)= ~' Yt-1,


while from (2.2) and (2.6) the conditional variance is

var(y, Y,-1) = E(u2t ] Y,-1) = ~r2 + y ' z , .

Suppose fs(Yo • . , Y,-s+l ] Yt-s) is the conditional probability density function


Varyingcoefficientregression 423

of y, . . . . . y,_,<, given an event Yt-, in the 0.-field ~'t-s. T h e n fhe joint density of
Yl. . . . . YT, conditional on the p r e p e r i o d values (Y0 . . . . , Yl-p)= I:-1, is
T
f (yl . . . . . I Y-,)= 1-Ifl(y, I Y,-0
t=l
T
= (2rr) r/2 1-i ( °-2 + 7'zt) 1/2 exp{ 1(Yt- ~"Yt-1)2I
t=l

= L r @ , 7, 0.2), say,

which is the likelihood function conditional on (Y0. . . . . Yl-p). R a t h e r than


maximize L , however, it is more convenient to minimize

2
Ir (j3, y, 0"2) = - - - l n Lr(fi, Y, o'2) T M ln(2~r)
T

r T (y, y,_ 1)2


= T-1 E 1n(0.2 + y,zt ) + T - ' ~,
t=l t=l O'2@ 71Zt

As this function is non-linear in 0.2 and y, there is no closed form expression


for the estimates of /3, 7 and o-2 minimizing [r. Defining A = 0.-27 , the
optimization problem can be reduced to one involving A alone. This is achieved
by using concentrated m a x i m u m likelihood procedures. Initially estimates o f / 3
and 0.2 are found in terms of A, and subsequently are substituted back into the
likelihood function. This results in a likelihood which is a function of A alone,
and which can be minimized with respect to A to obtain an estimate of that
parameter.
Since A = 0.-27, [r may be expressed in the form

1T = lr(fl, o'2, A)

~ (Yt- ~ ' Yt-1)2


= In 0.2 + T-1 t=l l n ( l + A'z') + 0.-2T-' t=l l+A~zt °

T h e first step is to obtain estimates of/)- and 0.2 in terms of A. Differentiation of


r T with respect t o / 3 and 0.2 gives

0---~= - 2 0 . - 2 T ' Z (y' - tt-l)g;-I


t=l 1 + A'z t

Olr -2 o.-4T-l ~ (Yt- ~'Yt-l) 2


&r---5= 0. - t=l 1 + A'z t
424 D. F. Nicholls and A. R. Pagan

while 0lv/0¢3 = 0 yields

,=1 1 + A'z, J ,=1 1 + A'GJ "

Furthermore, when OlvlO(fi', 0-2), = 0, we have

r (y, /~'(a ) Yt_l) 2


0-2(A) = T 1
t=l 1 + A'z t

Having obtained fi(A) and tT2(a) these values are now substituted into /r - 1 to
give

T
/}(A) = In 0-2(a) + r -1 E In(1 + A ' z t ) ,
t=l

which is a function of A alone. Thus if £r minimizes l}, the maximum


likelihood estimates of the parameters are given by

~2 = 0-2(2),
0-T /~r = fi('£), Yr = 02"£ (2.13)

When an iterative procedure such as the Newton-Raphson algorithm is


utilized to optimize I}(A), the first and second derivatives of I~;(A) will be
required, and these may be quite complicated. Accordingly, it is better to
minimize

lr (,~, A ) : inf 1r(¢3 , 0 -2, A ) - 1.


0-2

Now, as observed earlier, 0T7~/0~0-2z 0 when

r (y, j~' gt_l) 2


0-2(a) = T-'
t=l 1 + k'Z t
and so

lr(/3, a) = T-1 ~ In(1 + A'z,) + l n { T - ' (y, ~_/3 Y,_1)/2 .


t=l t=l 1 -}- A'Z t J

Both the first two derivatives of Ir with respect to/3 and A are easily obtained.
This approach-proves useful not only in the application of the Newton--
Raphson algorithm to obtain the required estimates, but also when discussing
the asymptotic properties of the estimators. Using such a method, the maxi-
Varying coefficient regression 425

mum likelihood estimates/~v o"


~2r and YT of/3,
- 0.2 and y = vech X are given by

lr(/~v'£r) = inf lr(/3, A), (2.14)


~'. A')'EO

~2 _ T-1 (Y, ~'r ,-lJ , (2.15)


O-T--
t=l 1 + A'z t
and

= vech(S)= 22. (2.16)

These two approaches do, of course, result in the same estimates of the
parameters, that is, the estimates defined by (2.13) will be the same as the
corresponding estimates given by (2.14)--(2.16).
The set O in (2.14) is a compact subset of R p(p+3)/2 and is such that condition
(v) is satisfied along with the condition
(viii) 0--2X has strictly positive eigenvalues, all of which are greater than or
equal to 6, > 0.
This form of O is required in order to prove the strong consistency of, and
obtain a central limit theorem for, the MLE.
We now state without proof Theorem 2.3 which relates to the asymptotic
properties of the maximum likelihood estimators. For a proof of the results
contained in this theorem, along with definitions of 5~ and J stated in the
theorem, see Appendix 4.2 and Sections 4.3 and 4.4 of Nicholls and Quinn (1982).

rFHEOREM2.3. For {y,} satisfying (2.1) and assumptions (i)-(viii), the estimators
/37~ 3'r and 0-T^2obtained by minimizing lr(/3, A) over (~', A')' E O, where A =
0--2 y, are strongly consistent when the true values (fi', A')' E int O. Furthermore,
assuming finiteness of the fourth moments of {et} and {r/t}, if O~ = (/3r,~'Yr,~' d'~.), then
T1/2(0r - 0) has a limiting normal distribution with mean zero and covariance
matrix ~ ~j~¢-1. When {et} and {r/t} are jointly normal, this covariance matrix
reduces to 2.~ -~.

Under fairly general conditions, therefore, the maximum likelihood esti-


mates for tile model (2.1) can be obtained in a straightforward manner. They
are strongly consistent and satisfy a central limit theorem. Theorem 2.3 also
holds when the regressors are exogeneous rather than involving lagged values
of y,, except that some of the conditions of the theorem are no longer relevant.
Rubin (1950) proposed the MLE for such a m o d e l - - ( X 2 , B2, V 1 ) - - a n d he
presents a comprehensive discussion of that case.

2.3. Diagnostic checks


Having specified and estimated the model, the final step in the modelling
cycle is to check that the estimated model is acceptable. From (2.3) it follows
426 D. F. Nicholls and A. R. Pagan

that
E(utUt-k I ~t-1) = Y;-1E(rltrl;-k) gt-k-1 + E(etet-k)
l
= 6o,k(yt_lXy,_k_l + 0"2),

where •0,k = 1 when k = 0 and zero otherwise. From this

E(utut_ k I ~t_l): 0, k # 0,
and
E(u21 o ~ , ) : ht = Y;-1XY,_, + 0"2 : Z;~/ q- 0"2

in agreement with (2.6).


Defining et = h~l/2u,, then E(e,et_k) are zero for k # 0. This suggests that
diagnostic tests be based upon the autocorrelation function of the et =/~tl/2t/t,
where ~/t = Yt _ Zj=~fly,_j
P ^
and /~t = o'a+ z'tY. The autocorrelation coefficients
r(k) are estimated by
^^ ^2
~(k) : ~ e,e,_ k ] ~ e,

and these might be adopted in forming analogues of the Box-Pierce (1970) test
K

O= T~2(k)~ 2
)( K-p
k=l

or modifications of it such as Ljung and Box (1978)


K

O, : T ( T + 2) E ( T - k)-lf2(k).
k=l

Other refinements are set out in Priestley (1981, p. 493).


Durbin (1970) pointed out that T1/2f(k) was not distributed as a standard
normal deviate under H0: r ( k ) = 0 when the et were residuals from an autore-
gression, because the asymptotic variance of Tm?(k) was actually less than
unity. H e gave a number of methods for obtaining a test statistic of the correct
size. One of these, ^ which has become known as Durbin's second method,
involved regressing e, against e,-k and x t (in the case of (1.1)) and testing if the
coefficient of e,-k in this regression was in fact zero. In Subsection 3.3.2, a more
detailed discussion of diagnostic checking in VCR models is given, and it
follows from the results presented there that Durbin's second method is an
appropriate way for checking model adequacy in the current contexL

3. Evolving coefficient variation

The model analysed in this section will be (X2, B3, V1) with comments
about extensions to (X1, B3, V1) at the end; generally a combination of the insights
Varying coefficient regression 427

of Section 2 with the results of Section 3 suggests that any results obtained for (X2,
B3, V1) also extend to (X1, B3, V1), although only a few formal proofs are yet
available. We will also strengthen the assumptions regarding variation in x t to
make x, non-stochastic and uniformly bounded, commenting upon the likelihood
that such an assumption can be relaxed at the end of Subsection 3.2.
The general form A(L)(13 t - / 3 ) = r/, given for evolving coefficient variation
under (B3) in Table 1 highlights the fact that there are many more issues to be
faced than when coefficient variation was purely random. Not only is it
necessary to determine the order of A(L) and whether elements of the
covariance matrix of tit are zero, but also decisions must be reached concerning
the nature of A ( L ) - - a u t o r e g r e s s i v e and/or moving a v e r a g e - - a n d whether it
contains unit roots or not. Such diversity makes a succinct summary of past
research an impossibility. Nevertheless, there is value in attempting to discuss
such research in a parallel fashion to the account in Section 2. Accordingly, the
following t h r e e subsections, 3.1, 3.2 and 3.3, treat specification, estimation and
diagnostic testing in that order. Within each subsection a further attempt at
classifying contributions according to major themes is made.

3.1. Specification
The multi-dimensional nature of the specification problem present in evolv-
ing coefficient models prompts the adoption of assumptions to narrow the
range of alternatives needing to be investigated. As there are two elements
constituting the evolutionary p r o c e s s - - A ( L ) and ~/t--it is natural therefore
that research on specification has proceeded by ascribing particular properties
to one of the elements, leaving the data to determine the other. Thus, in
Subsection 3.1.1. it is A ( L ) which is treated as known and hypotheses about the
covariance matrix of tit are tested, while in Subsection 3.1.2 a prior division of
coefficients into fixed and evolving (equivalent to a statement about the
variances of r/t ) is implicitly assumed and the characteristics of A ( L ) are
determined. Combinations are conceptually possible, e.g. A(L) might be
assumed known except for order, but we have been unable to locate any
specific discussion on the matter. However, it might be mentioned that the
order of A ( L ) could be chosen using Akaike's criterion.

3.1.1. Testing for coefficient evolution


Some authors have addressed the question of whether any coefficients of
(1.1) exhibit variation, given that the evolutionary form, if present, is known,
i.e. A ( L ) is fixed and the covariance matrix of r/t is tested for whether it is the
null matrix. To discuss this work, write (1.1) as

Yt = xt~ + ut, ut = xt([3t - ~) + er. (3.1)

In matrix terms (3.1) is

y = X/~+ u, (3.2)
428 D. IV. Nicholls and A. R. Pagan

where y is a T x 1 vector with y, as tth row, X is a ( T x p) matrix with x t as


tth row and u is a T x 1 vector with u t as tth row.
Let the covariance matrix of u be ,0(4,), where ~ are parameters depending
upon the specification of the problem; the first element of 4" being the variance
of et, 0"2. If coefficients are constant the remaining elements in O, 02, will be
zero, i.e. E ( u u ' ) = 0.2I r when coefficients are constant. Because it is presumed
that estimation has proceeded under the restriction of constant coefficients, it is
important that any test statistic designed to detect evolutionary behaviour
should utilize only information from the estimated constant coefficient model.
This suggests that the Lagrange Multiplier (LM) or score-test statistic (Rao,
1947; Silvey, 1959) which has this very property, be adopted.
When this approach was selected in Subsection 2.1, mention was made of the
fact that certain non-negativity conditions needed to be ignored if a tractable
asymptotic distribution was to be obtained. This meant that the log likelihood 1
was that corresponding to a model in which any parametric constraints were
ignored. Under such circumstances, and with ~2 a (q x 1) vector, the Lagrange
Multiplier (LM) test statistic is constructed from the quadratic form
(0l/0~2)'.5f522(01/002), where ~ 2 2 is the block of the inverse information matrix
~¢ = -E(O2I/Oq, O~') corresponding to ¢'2, and all elements are evaluated at OLS
estimates of /q and 0.2 from (3.2). Assuming that e, and r/t are multivariate
normal, the LM statistic follows from Breusch and Pagan (1980, p. 245) as
1
LM = ~[vec(I- a'a/d-2)]'A2j22A~[vec(I -/~,/,~/~.2)] , (3.3)

where 5t = A ' A and A ' = 0 vec(O)/0O, A ; = O vec(S2)/0g,2 and ri are the residuals
from OLS applied to (3.2).
Inspection of (3.3) reveals that A2 and 5~22 depend upon the type of
coefficient evolution specified as the alternative hypothesis, i.e. A ( L ) (as
reflected in ~). La Motte and McWhorter (1978) consider the case where the
elements of/?t followed ARIMA(0, 1, 0) processes and the covariance matrix of
rh was known up to a specified constant o-2, i.e. E ( u u ' ) = 0.2 V + 0.2IT where V
was known. Therefore, A 2 = vec(V) and if= is a known scalar. Consequently,
rather than work with (3.3), an equivalent test statistic is based upon the
stochastic elements in (3.3) alone, i.e. upon vec(ri'ri/d2)'A2/~;vec(~'~/6-2) or
upon d- 2Li'V~ which is the positive square root of the expression.
The numerator of & 2~'Vfi is y ' M x V M x y , where M x = I - X ( X ' X ) 1X'.
Defining H H ' = M x gives y ' H H ' V H H ' y = z ' H V H ' z = z , (Ei= r 1a p i P i ,) z in their
notation, where 3, are the eigenvalues and Pi the eigenvectors of H V H ' .
Rather than finding P r { T ( g t ' V ~ / ~ ' ~ ) < c } , they propose an F statistic based
upon the ratio o f Sg =-~ Z t (Ei=
g
1A~PePi)Z p
to ~ . 2 . It is not entirely clear why such a
strategy is followed; since the eigenvalues ai need to be obtained for their
solution, one might as well utilize that information to compute the exact
probability that 8-2zi' Vti < c, as this is a quadratic form in normal variables and
Imhof's (1961) procedure can be applied. In fact, as King and Hillier (1980)
have demonstrated, the LM test above is the locally best invariant test in
Varying coefficient regression 429

this situation where testing is being done on a boundary, and La Motte and
McWhorter explicitly promulgated invariance as a desirable characteristic of
their test. Furthermore, a direct application of the LM principle avoids the
need of an arbitrary selection of g.
At first sight, the LM test seems to provide the desired test statistic. It
extends to cover 02 being a vector of unknown parameters and, even though
there is no optimal property currently known for it in this wider context, its
small sample distribution may be determined numerically through Monte Carlo
m e a n s - - s e e Breusch and Pagan (1980, p. 250). Furthermore, it is tempting to
claim the standard limiting property that it is x2(q) under the null. Un-
fortunately, as Tanaka (1981) shows, this property does not extend to the
current model, and this rather reduces the attractiveness of the procedure.
Takaka demonstrates this failure for a simple model through an expansion of
the characteristic function, while an alternative viewpoint is offered in Pagan
and Hall (1983). It should be emphasized that the failure of (3.3) to be x2(q)
asymptotically under the null hypothesis has nothing to do with the fact that
non-negativity constraints on g'2 were ignored in the construction of (3.3).
Rather, it is a consequence of the fact that T -q 02L/&) Orb' does not converge to
its expectation as the sample size grows.
Watson (1980) concentrates upon 13, following an A R M A rather than
A R I M A process, although in most of his analysis 13, is a scalar and follows the
AR(I) 1 3 , - f i = a(j3,_~-fi)+ rh. It is natural to consider testing if a = 0, 0-27= 0
but, as setting o-2 = 0 gives ~Q = o2I, it is apparent that there is some redun-
dency here. For any given value of c~ it is possible to construct an LM test for
o'n2 = 0, but the problem of combining these different statistics remains. One
approach, put forth by Davies (1977), is to base the decision on the maximum
of the test statistics computed for various values of c~ in the range - 1 < ~ < 1,
and Watson proceeds in this way. Even apart from the difficulties of com-
putation the idea is not especially attractive, in that it rests upon a knowledge
of the order and form of A(L). Moreover, it is interesting to observe a result
from Watson's Monte Carlo studies that the LM test indexed on c~ = 0 - - w h i c h
is effectively an LM test for a particular type of heteroscedasticity and so is one
of the members of the class studied in Godfrey (1978) and Breusch and Pagan
(1979)--exhibits very good performance in detecting coefficient variation, and
the simplicity of that test statistic makes it a strong candidate for detecting any
need to allow for evolving coefficients.
Finally, Garbade (1977), also restricting 13, to an AR1MA(0, 1, 0) process and
I; = E(~j,~',) to be diagonal, proposed that the likelihood ratio (LR) statistic
- 2 l o g Lr(~)/log L r ( ~ l , 0) be computed, where L r is the likelihood, q~ is the
M L E of 4' and ~/'1 is the MLE of 4'1 given the null hypothesis H0: I/t 2 = 0. In
standard cases this statistic is asymptotically equivalent to the LM statistic and
is x 2 ( q ) under the null hypothesis that ~2 = 0, Garbade found in simulation
studies that the distribution of LR was not in fact X2 in this case. Such a finding
could have arisen as a concomitant of imposing the restriction g'2~ > 0 , but it
might also reflect the problems with the LM test in this situation discovered by
430 D. 1::.Nicholls and A. R. Pagan

Tanaka. Given the close relationship that normally holds between the two test
statistics, there would seem to be a need to examine the asymptotic properties
of the L R test statistic in this context more closely.

3.1.2. Determining likely orders and forms of coefficient evolution


It is instructive to investigate problems in order/form determination by
examining the case when the sole coeffÉcient undergoing changes in (1.1) is that
corresponding to the constant term, i.e. (1.1) would be

Yt = /31,t + X*/32,t + Et' (3.4)


a(L)(fl~,- ill) = "01~ /32,-- fi2, (3.5)
from which

Yt ~-" 1~1 + X~]~2-l- (J~lt- ~-1) -{- Ct ~-- ~-~- X~/~2 + Ut' (3.6)

where u, = ot'(L)-l'l~l.t + Et follows an A R M A process of the same type and


order as /3 u - / 3 . From this equivalence, it follows that the autocorrelation
function (a.c.f.) and partial autocorrelation function (p.a.c.f.) of u t will provide
the information needed to infer the order and type of polynomials making up
a(L). An estimator of the a.c.f, and p.a.c.f, is available from the least squares
residuals fit.
In fact, the 'evolving constant' model is particularly simple to specify and
estimate. An early version of it is the adaptive regression model associated with
Cooley and Prescott (1973, 1976), in which the intercept evolves as an
ARIMA(0, 1, 1) process ce(L) = (1 + 01L)-1(1 - L). That format implies a model
in differenced variables with a composite disturbance term

(1 - L)y t = (1 - L)x*/~2 + (1 + 01L)rl, + (1 - - L)e,. (3.7)


The estimation and detection of the parameters of an equation such as (3.7) has
been considered by McDonald (1981), Pagano (1974), Pagan (1973a, 1973b) and
Revankar (1980), while Reinsel (1979) explicitly proposed the above method
for estimating the adaptive regression model.
The above key to specification analysis was the ability to obtain an estimate
of the autocovariance function of /3~t-fil, after which the Box-Jenkins
methodology could be applied, and this directs our attention towards the
possibility of extracting the same information from the data for each evolving
coefficient. Returning to (3.1), the a.c.f, of u t will be found from

E(utblt_j) = X t I ) ; j + O'2(~0,j = Zt.j vech Fj + cr260j = ztjy j + o'26o,j, (3.8)

where {60j = 1 if j = 0 and zero otherwise}, l}, j = 0, 1, 2 . . . . . is the jth auto-


covariance of t3,-/3, and ztj = ( x , _ i @ x , ) K ' p. In writing (3.8), it has been
Varying coefficient regression 431

assumed that fit - J3 is a stationary process; later we will briefly comment upon
ways of discriminating between stationary and non-stationary models for/3,.
Suppose that u t were known. Equation (3.8) could then be written

utut_j = z,ayj + 60,jo-2+ ~j.,, (3.9)


i
with vj.t = utu,_j - E(utut_j) having zero mean. Such a property alludes to the
opportunity of finding estimates of yj ( / > 0) by regressing u,ut_j_against z,,j. Of
course, 13 is not known so that it will b e n e c e s s a r y to estimate fi by regressing
y, against x t. With the resulting estimate/3, the residuals ~7, = y, - xd3 replace u, in
(3.9). Such a strategy parallels that of Subsection 2.1, the expansion in the
number of regressions being occasioned by the fact that only F 0 = X is non-zero
in the model (X1, B2, V1). Another pleasing outcome is that such a regression
yields exactly the autocovariance function of the residuals when xt = 1 since
then zj. t = 1 and ~j = T -1 ~ ~,g~,_j, demonstrating that the regression based stra-
tegy to obtain ~j is merely an extension of standard time series analysis.
Defining ztj as zt.j for j > 0 and (1 zt,j) for j = 0, the relations (3.9) written in
terms of residuals are tifi,_j = ~,.jO2,j + vj, t or, in matrix form, U_j = ZjOa,j + vj.
The estimators considered are fi = ( X ' X ) - I X ' y and 02,j = (Z)2j)-~2)O-j -the
estimates of ~ from the regression of y~ against x, and 02.j from the regression of
fitfit j against 2tj - - a n d Theorem 3.1 describes their asymptotic properties.

THEOREM 3.1. Under conditions (i)-(iv) of Theorem 2.2 a n d with (v) 13,- fi a
stationary invertible process, (vi) E(vjv})= Vj > O, (vii) x, a sequence of non-
stochastic regressors with uniform bound and limr_~ X ' X -~ B > 0, (viii)

Oj = lim Z~z-'-
=,jz,,
i lim T-'Z,~Zj = Oj > 0,
T~ T~

= (13 02j) has the properties


a.s.
(A) 6---, 0,

(B) T'/2(/3-/3)-~d N(0, B ~ ( l i m X ' V X ) B ~),

d
(C) T 1/2(02,j
- - 02.j)-~ N( 0, O ; l ( lim ZjVZj)Oj-'
-- 1)

A somewhat cumbersome proof of this theorem was referred to in Pagan


(1980) but a simpler one could be mounted along the following lines.
When residuals are used in (3.9), the error term is vj, t = utut_j -
E(u,ut_j) + atfi,_ j - u,u,_ j. As T - m E Z't,j(Lltl~t._ j -- UtU t j) can be shown to be Op(l)
without much difficulty, a proof of Theorem 3.1 would need to establish
432 D. F. Nicholls and A. R. Pagan

the limiting distribution of an OLS estimator in a regression involving dis.


turbances that are both non-stationary and dependent processes; non-sta-
tionarity owing to the presence of a non-constant x t in the definition of u t and
dependence because of any dependence in /3,-/3. Theorems 2.3 and 2.4 of
Domowitz and White (1982) may be used for this purpose. Joint normality of e,
and r/t may be dispensed with through such an approach, being replaced by
some finite moment assumptions.
Now ~j provides an estimate of the a.c.f, o f / 3 t - / 3 and Theorem 3.1 poten-
tially enables a judgement to be made concerning whether any yj = 0. Un-
fortunately, computing the covariance matrix of ~j is no easy task. To see this,
suppose that there is only a single coefficient and it is evolving as c~(L)(,/3,
/3) = r/,. Specializing Theorem 3.1, if/3, follows a kth-order linear process, the
asymptotic variance of T1/2(5/j - "yj) for j > k is the probability limit of T times

k T
-~ 2 2 - + --
49~1(~x, x, iY,,o 2 ~ ~ x, xt_jx , mXt.j._m31t,m)~); 1 ,
m=l t=m+l+j

where 4~j = ~-~


• XtYt_
2 2 j and

9,,m = E{(u,u, , - E(u,u,_j))(u,_,.u, _, m -- E(U,~r.U, j .,))}


= E(u,u,_jU,_mU,_j_m)= E ( u , u , m ) E ( u , - j u , - j - m )

= E(am,,)ff.(am,t_j)

under the null hypothesis. When x t -- 1 and e , - 0, E(am.,)= E(am.,_j)=: Y,n and
the formula corresponds to that in Box and Jenkins (1976, p. 35, eq. (2.1.13)).
It is apparent that the variance of T1/2(~/j- "yj) corresponds to that from a
regression model in which the errors follow a 'moving average' of kth order
with time dependent covariances. Under the conditions of Theorem 3.1 it follows
from Domowitz and White (1982) that this variance may be consistently
estimated by T times

k 7"

~/)71(>2X~X~_j~j2,t-}- 2 ~ ~ ~)j,t~j,trXtXt.jXt rXt_j_r)~j 1


r=l t=r+l

by using the OLS residuals tSj,t from (3.9). Estimating the asymptotic variance in
this fashion seems a good deal simpler than the alternative of explicitly
evaluating 5~,.m and replacing any yj appearing in them with ~j. As well,
Domowitz and White's formula applies even if/3~ is not a scalar.
Many regression packages nowadays provide estimates of the variance of the
OLS estimator adjusted for heteroscedasticity as recommended in Eicker
(1967) and White (1980). In the context of the regression in (3.9), these
estimates would correspond to T times (aj 1(~ xjx,2 2 iY,,o)&j
- -1 • When only a single
coefficient evolves the omitted term is strictly non-negative, so that any test
Varying coefficient regression 433

statistic for yj = 0 b a s e d on the heteroscedasticity-adjusted variances would be


a conservative one. U n f o r t u n a t e l y , this directional result does not obviously
extend to the case w h e n m o r e than one coefficient varies.
All of the a b o v e has been d e v o t e d to o r d e r d e t e r m i n a t i o n . R e g a r d i n g
format, it is c u s t o m a r y to examine the partial a.c.f, as well as the ordinary a.c.f.
As the p.a.c.f, ordinates can be thought of as estimates of the p a r a m e t e r s pj in
a s e q u e n c e of autoregressions fitted to /3t, they m a y be f o u n d from the ~j by
solving the multivariate equivalent of the Y u l e - W a l k e r equations. M o r e
specifically, defining 7' = (7; . . . . . 7~:) and p' = (p'~ . . . . . p~), a linear relation of
the f o r m Ap = 7 exists, where A is a matrix constructed f r o m 70 . . . . . 7K-~-
W h e n / 3 t is a scalar, the (i, ]')th e l e m e n t of A is y~ j. T h e linear relation b e t w e e n
7 and p m a y be exploited to r e - p a r a m e t e r i z e the regressions of T h e o r e m 3.1,
with pj replacing yj as the u n k n o w n p a r a m e t e r s . T o illustrate, suppose /3t is a
scalar and p~ and P2 are to be found. For K = 1, 70P~ = 7~ so that the relation
tifi,_~ = xtx t 17~ + v~., is equivalently written as ti,fi,_ 1 = xtxt_l'YOp1q- 1.)1,r W h e n
K=2,

?0 1io11_-
'Yl
, ?]
'Y03I-tO23 'Y2

provides

Ut~t_l = (XtXt_l'Y0)Pl ~- (XtX t 1~/1)P2 @ I.)1,t


and
a,a,_2 = (x~x,_2~q)o~ + (x,x,_2~,o)p2 + v2.,.

As is evident, the p r e s e n c e of p~ and P2 in b o t h e q u a t i o n s shows that, if an


efficient e s t i m a t o r of both p a r a m e t e r s is desired, it would be necessary to
estimate both equations jointly imposing the cross e q u a t i o n equality restric-
tions. Simpler alternatives would be to ignore the cross-equation restrictions or
to add the equations together; unfortunately, if the last tactic were adopted,
the error Vl,, + V2, t would generally be autocorrelated. S o m e research in this
area would seem n e e d e d .
All of the a b o v e is predicated u p o n a k n o w l e d g e of yj. In fact these are
u n k n o w n , and all that is available are the ~j. Replacing yj by ~j modifies
the error term, e.g. in the scalar case when K = 1, fitfi, 1--
(xev,_l~0)pl + Vl., + xrx,_l(yo- ~/o)Pv As is easily verified such a substitution does
not affect the consistency of t~l, but it does m e a n that the covariance matrix of
T1/2(Pl- Pl) d e p e n d s not only upon the limit of T~/2Z x,x,_~vt,,, but also upon

T-'J2 )2 x;c,_l(x~,_~(7o- ?o)pO = r~'2(~o- ~,o)plr-lS, X t2X t 2 1

the second term clearly possesses a limit distribution. A d j u s t m e n t s must be


p e r f o r m e d to obtain the correct variance for T1/2~1-Pl), but the exact nature
of these must r e m a i n an area for future research. It is worth noting that,
434 D. F. Nicholls and A. R. Pagan

asymptotically and for a single evolving coefficient, the computed OLS variance of
Tm(tij - pj) understates the true variance, providing a conservative test statistic.
In all of the above analysis it was presumed that the process generating/3 t
was stationary, yet there is no compelling reason why A ( L ) should not contain
unit roots. In standard time series analysis, such a happening is detected by
successive differencing of the time series until the a.c.f, ordinates die out
rapidly. Unfortunately, it does not seem easy to mimic that mode of operation
here. Suppose /3t was a scalar and /3t =/3 t 1+ ~t. Then /3 =/30 and u t in (3.1)
becomes u, = x t E}=~ rh_j + e t, demonstrating that

E(u, ut_j) = o'2(t - j)xtx,_ j + Or2(~j,0 q = 1, 2 . . . . ).

Regressing fifit-j against ( t - j ) x d c t _ ~ yields an estimate not of the jth auto-


covariance of rh but its variance! A differencing-like test might be constructed
by regressing ~i2 - fi,fi,_~ against a constant, (xa, t - xdc,_l(t- 1)) and x~t; the last
of these regressors should provide an insignificant contribution as

E(u,u,_,)= 2
O'Tl(xtt- XtXt 2
l(t- 1))+ o'~..

Probably a similar strategy might be devised for the detection of models where
A ( L ) is not solely composed of unit roots, e.g. A ( L ) = (1 - A~L)(1 - L), but it
must remain a high-priority area for research.
Finally, application of T h e o r e m 3.1 to (X1, B3, V1) is not at all straightfor-
ward. It is a feature of (3.9) that in this case the random variable vj.t will be
autocorrelated. OLS applied to (3.9) will therefore yield inconsistent estimators
of yj whenever x t contains lagged values of Yr. This is a serious weakness, and
to o v e r c o m e it requires the use of some instrumental variables for the lagged
values of Yr Unfortunately, unless the autocorrelation in vj,t is of the MA
t y p e - - a n d there is little reason to believe that /3t-/3 would have this
c h a r a c t e r i s t i c - - i t is not possible to exploit the past history of Yt for instru-
ments. Finding instruments may well be difficult, and unless they are of good
quality it could be very hard to m a k e any prior discrimination between models.
Once again this is a topic that requires much more detailed attention.

3.2. E s t i m a t i o n
Having isolated a range of models that are to be entertained, the next phase
in the modelling cycle involves estimating the unknown parameters. In contrast
to the specification aspect, there has been a substantial amount of research
devoted to estimation. Much of this research represents an adaptation of the
techniques presented in Section 2. The r e c o m m e n d e d estimation technique of
that section was maximum likelihood, and Subsection 3.2.2 details the proper-
ties of this estimator in the evolving coefficient case. What differentiates the
two situations is that the likelihood can only be defined implicitly in the
evolving coefficient case, making it difficult to find analytic derivatives as was
Varying coefficient regression 435

done in Subsection 2.2.2. Consequently, resort is frequently had to numerical


algorithms for maximizing the likelihood, and a brief discussion of some of
these is presented later in Subsection 3.3.3.
Even though these algorithms have managed to handle quite complex
models, the computational burden can be quite heavy, and one might be
satisfied with a consistent estimator only. Subsection 3.2.1 deals with the most
popular variants to achieve this objective, all of which involve the construction
of estimators via regression analysis. These are essentially extensions of the
least squares estimators described in Theorem 2.2, which were proposed mainly
for the purpose of generating estimates to begin the iterations towards the
MLE. Some authors have suggested that the estimators of Subsection 3.2.1 be
employed to derive a 'two-step' estimator that has the same limiting dis-
tribution as the MLE, and that idea is described in Subsection 3.3.1.

3.2.1. Covariance estimators


As T h e o r e m 3.1 showed it is possible to find consistent estimators of/3, yj and
o-2 by regression. For 7j (J > 0 ) the regression is of ~tfit-j against z,j =
(xt_j @ x,)Kp, where the matrix Kp reflects the relation yj = vech(Fj). Although
not rigorous, it is convenient in what follows to ignore the symmetry in Fj and
to define z,j = x,_j @ x,. In practice, this symmetry restriction is always imposed
by the nature of the regression anyway.
Once ,~j is derived, & (the vector of unknown parameters in A ( L ) ) and
may be recovered from the autocovariances of the/3 t process, and the task is
therefore one of factorizing the covariance f u n c t i o n - - s e e Wilson (1969) and
(1973) for details on algorithms to accomplish this. Accordingly, such estima-
tors might be termed 'covariance estimators'. Although this covariance estima-
tor was suggested in Rosenberg (1973), it has not received a great deal of use
until recently; possibly because Rosenberg did not provide any asymptotic
properties for it.
There is, however, one variant of the covariance estimator which has been
a p p l i e d - - t h a t of Swamy and Tinsley (1980) (hereafter S - T ) - - w i t h applications
in H a v e n n e r and Swamy (1981) and Swamy et al. (1982). S-T formulate
A ( L ) f l t = rh in the linear system form ~, = ~sct_l + 0,; when /3, is an AR(p) for
example, ~ 't -- (J3t, ' . . . , /3t-p+l)'
' ~,' = (~l'tO)and a = vec(Jq)).
. .. Their estimator then
involves the regression of (tt~t_j against 2t.j = xt_J(~_f',_~@ x t (the intercept
term, if there is one, is absorbed into x, in their formulation), where ~ is an
estimate of ~t generated by a formula given in S - T (1980, eq. (4.10)).
To appreciate the relation of this estimator to that set out in Theorem 3.1, it
is important to observe that S-T estimate a and X rather than the yj. For
illustrative purposes let x, be a scalar and assume that /3t follows an AR(1),
/3t = %/3t-1 + rh. Then 71 = °t~lY0 and the unknown coefficients are 19/1 and o 2
The regression relation to generate Yl is

utfi, - t - x?c~-lYl + vl,t~ (3.10)


436 D. F. Nicholls and A . R. Pagan

that is
UtUt_ 1 = XtXt_l')/00/1 ~- Vl, t , (3.11)

and, just as in the p.a.c.f, computations, &~ could be found by regressing fi,fi,-i
-2 against x,2 and
against xtxt_l'~ o (C0 would be an output from the regression of u,
unity).
In contrast to this approach the S-T regression would be

UtUt 1 = xtxt-l~t-l°~l + Va,t, (3.12)

demonstrating its close relation to the estimator of Theorem 3.1; the sole
difference in this instance being the replacement of 70 by so,_1.-2 As might be
expected, for 61,s_T to be consistent certain conditions must be satisfied by ~t. In
particular, for this scalar case, the sample moments of ~t should be consistent
estimators of the population moments of ~:t (up to the fourth order). S-T's
choice of ~t can in fact be shown to imply this for the scalar case, but it is much
harder to see the equivalent necessary conditions for consistency in more
general models. It would obviously be desirable that a proof of consistency of
S-T's proposed estimator be available before extensive use is made of it.
Havenner and Swamy (1981) show that /3 is consistent and asymptotically
normal, but that is a comparatively simple task compared to establishing the
limiting properties of estimators of ~ and X.
There are a number of other points that need to be made about S-T's
approach. First, just as the insertion of 70 in place of 70 invalidates the
consistency of the OLS estimate of the covariance matrix of ~j (and hence 61),
so too the covariance matrix of c~l.s_T is not consistently estimated by the OLS
formula variance. Applications made of the S-T estimator do not seem to have
allowed for this. Second, it is not clear what is to be gained by moving from 70
"2
to £t-1; the computational load of the first estimator being much lower. Of
course, the S-T estimator is iterative in that new ~t can be found with the
updated c~ and 2 and these may be exploited to give new estimates ~ and ~,
etc. One could iterate the estimator of Theorem 3.1 as well by exploiting the
form of the covariance matrix of vl.t, e.g. a weighted least squares regression to
account for the heteroscedasticity in VlS, but iterations on covariance estimators
seem a bit pointless as the computational burden in each iteration is much the
same as in each step of an iterative scheme to get the maximum likelihood
estimator (MLE). Furthermore, because S-T's estimator is a variant of that in
Theorem 3.1, it shares with that estimator the problems posed whenever x,
contains lagged values of Y,- Some applications of the S-T estimator have in fact
been made to models such as (X1, B3, V1) without apparently realizing that the
estimator will be inconsistent in such cases. Overall, this difficulty seriously
reduces the appeal of covariance estimators for VCR's.

3.2.2. M a x i m u m likelihood e s t i m a t i o n
As would be familiar from ordinary time series analysis, covariance estima-
Varying coefficient regression 437

tors tend to be fairly inefficient; their prime virtue being their simplicity and
their ability to provide consistent estimators that possess a limiting distribution.
To improve efficiency, most investigators interested in V C R models have
followed Box and Jenkins and engaged in ML estimation.
Following the strategy of Section 2, the log likelihood is constructed as if e l
and "Or are jointly normal, being

log LT(3~ , a, .~,) = - T/2 In 2rr - ~ ~'~ In h,


- ~ 2 h,~(y, - yt/t_l) 2 + In f(Yl), (3.13)

where Yt/t-1 is the expectation of Yt conditional upon the o.-field ~t_l =


(Yl . . . . . Yt-1, Xx," ' - , Xt) and h t is the variance of the innovations e t = Yt- Yt/t 1.
The first author to exploit this decomposition may have been Schweppe (1965)
and it has subsequently formed the cornerstone for M L estimation of VCR
models.
Equation (3.13) is constructed with e t and h t. For the model treated in
Section 2, both quantities could be derived analytically, but that is not so for
the model of this section. Fortunately, once the V C R system in (1.1) is placed
in the state-space form (SSF)

Yt = x,/~ + 2,£ t + e I , (3.14a)

~:t = qb~:_~+ ~b,, (3.14b)

where xt = (xt i 0) and ~t has leading rows (/3t - / ~ ) and thereafter is defined to
reduce A ( L ) ( / 3 , - / ~ ) to first-order form, the Kalman Filter (KF) equations
provide values of h t and et for given a and X. This approach has been well
documented elsewhere, e.g. Rosenberg (1973) and Harvey (1981), and inter-
ested readers can find the KF described in these and in a number of other
references. Two items deserve some attention however. First, the KF needs to
be initialized by E(scl) and E(scl~'l); because set is composed from A ( L ) ( 1 3 , - ~ ) =
~/~ E(sC~)= 0 and E(~1~:'1) is a function solely of X and a. Accordingly, it is not
necessary to treat the initial coefficient/3 t as fixed and unknown; if this were
desirable Rosenberg (1973) showed how to concentrate it out of the likelihood.
Second, the term l o g f ( y 0 in the log likelihood (3.13) needs examination. As
Yt = X t ~ + X t ( ~ , - J~)+ e,, Yl will be normally distributed with mean xa/3 and
variance xlF~x'l+ o-2, allowing logf(yl) to be computed from x 1, a and £ (this
derivation assuming x t to be non-stochastic).
What makes ML estimation desirable is the consistency of the resulting
estimator and the fact that its covariance matrix is given by the inverse of the
information matrix ~¢00= -E(O2Lv/00 00'); this latter quantity frequently being
estimated by the inverse of the Hessian of the log likelihood. However,
because VCR constitutes a non-standard problem, with observations Yt being
dependent and non-stationary, there is no certainty that these desirable pro-
perties can be invoked. Crowder (1976) and Baswa et al. (1976) have provided
438 D. F. NichoUs and A. R. Pagan

theorems for the M L E to be consistent and asymptotically normal when y, has


the characteristics stemming from a VCR model, and the following t h e o r e m - -
which is a special case of Theorem 4 in Pagan (1980)--was proven by verifying
that the conditions set out by Crowder hold under the stated assumptions.

THEOREM 3.2. If
A . the model is asymptotically locally identified;
B. (i) x t is non-stochastic and uniformly bounded from above,
(ii) O, the permissible parameter space, is a subset of R s,
(iii) the eigenvalues of qb in (3.14b) have modulus less than unity;
C. the errors et and ~b constitute a multivariate normal distribution with finite
variances;
D. 0o, the s × I vector of true parameter values, is an interior point of {9 ;
then

p ~ 1/2t~ d
OML --+ 0o, ~ oo ,vML -- 0o) ~ N (o, L ) .

I f 0 does not include vec(a), i.e. the transition matrix is fixed a priori, condition
B(iii) may be deleted.

Some comments can be made upon this theorem and its assumptions. The
permissible parameter space is defined by the problem but would certainly
require ,Y to be p.s.d. In some situations the conditions are not exclusive, e.g.
B(i)-(iii) would be a sufficient condition for A, but it seems worthwhile leaving
unspecified what is needed for A and concentrating upon the asymptotic theory
of the ML estimator given that A holds. Then, even if B(iii) did not, provided
no elements in @ were estimated and the parameters in X were asymptotically
identified, consistency and asymptotic normality would follow. This then
extends the range of the estimation theorem to non-stationary cases, provided
a separate analysis of asymptotic identifiability can be given. If elements in q~
are to be estimated, then, by analogy with the corresponding situation of
estimating unstable AR's, it would be expected that normality would not hold
when q~ had unit roots. Although in that literature asymptotic normality of
1/2 ^
•00(0-00) does hold when the roots in q~ are greater than unity, it is very
doubtful that such a result would be true for a VCR model, the reason being
that the innovations in an A R have bounded variance regardless of the roots of
whereas the variance of the innovations would tend rapidly to infinity if the
roots of q~ were greater than unity in the VCR case.
Of the other assumptions of the theorem, normality could be dispensed with
by providing bounds on the moments of "qt and e t. However, it would not seem
possible to relax B(i) to allow non-stationary behaviour in x,, as the outcome of
such an alternative would be an unbounded variance of the innovations, and it
is hard to see how the theorem could possibly hold. It is worthwhile noting that
Amemiya (1977) also retained this assumption for the ordinary random
coefficient case.
From the definition of the log likelihood in (3.13), the conditioning on past
Varying coefficient regression 439

data ensures that it remains the same even when x, includes lagged values of y,.
Thus, M L estimates would be obtained in the same way regardless of the
definition of x r However, Theorem 3.2 does not apply directly, although Weiss
(1982) has considered the requisite extension. To do so demands the addition
of various assumptions that serve to bound the moments of y,; as might be
expected Weiss' methodology effectively combines Theorems 2.2 and 3.2. From
Weiss' research it would seem that the properties of the M L E extend to the
combination (X3, B3, V1) and, given the difficulties experienced by the covari-
ance estimator under these circumstances, establishes a strong case for its use.
3.2.3. Identifiability
T h e o r e m 3.2 required that the model be asymptotically locally identified or
that T-l~oo be non-singular in the limit--this latter interpretation being
provided by Rothenberg (1971). Since this assumption is very closely bound up
with the existence of a consistent estimator of 0, and T h e o r e m 3.1 showed how
such a consistent estimator might be found, it should come as no surprise that
the conditions for the existence of the estimator of T h e o r e m 3.1, viz. that T - 1 X ' X
_ l _

and T a Z ztjztj have a non-singular probability limit, also appear as sufficient


conditions for asymptotic identifiability. This is the result proven in Pagan
(1980, p. 349) by decomposing the information matrix. Swamy and Tinsley
(1980) give a similar requirement, but in terms of their zTtj's rather than z,j. An
unsatisfactory aspect of stating identification conditions in terms of zT,j is
the dependence upon estimates of SCr There seems little to be gained by
adopting their version. Deducing necessary conditions is much harder. S-T
assert that their conditions are necessary, but no proof is actually given of this
proposition; just because it is necessary for the existence of their estimator
does not mean that it is a necessary condition for identifiability. A promising
alternative approach has been set out by Solo (1982). By utilizing a variant of
the KF equations- the Output Statistics Kalman Filter due to Son and
Anderson (1971)--and assuming that the xt's follow stationary processes, he
has been able to slightly generalize the results in Pagan (1980).
All of the above papers relate to models in which f i t - / 3 follow stationary
invertible processes. However, some applications have forced the A ( L ) poly-
nomial to have unit roots, e.g. the seasonal adjustment model in Hannan et al.
(1970), treated as an evolving coefficient regression in Pagan (1973b). Under
these circumstances only X and 2 are unknown and, although Theorem 3.2
shows that the M L E retains its standard properties, it does so by assuming
asymptotic identifiability. The most complete treatment of identifiability when
A(L) = I - L is contained in Hatanaka and T a n a k a (1981). They demonstrate
that asymptotic identifiability holds under the following assumptions.
A1 (a) xrx't < c I < oo for all t, with c I a positive constant.
(b) There exists a positive integer ~- and a positive real number c: such
that for every pair of k-element vectors h 1 and h 2 with h'lh 1= 1,
h~h 2=1 and for every nonnegative integer rn and s ( 0 ~ < s ~ T - 1 ) ,
[xthl] > c 2 and Ixth2] > c 2 for some t in the interval [rot + s, (m + 1)~-+ s].
A2. xtx't > c 3 for all t and for some positive number c 3.
440 D. F. Nicholls and A. R. Pagan

Of the two assumptions A2 is the strongest, but they indicate that it can be
eliminated at the expense of a more complex proof. Accordingly, in most
circumstances, the presence of unit roots in A ( L ) would not invalidate the
standard properties of the M L E expressed in T h e o r e m 3.2. Unfortunately,
when x t contains lagged values of Yt nothing is yet available concerning
identifiability, and this is an area that is in need of much more research.

3.3. Some miscellaneous topics

3.3.1. Two-step estimators


One way to find the M L E is to use the method of scoring, which involves the
iterative scheme

0~)- O(j-1)= ~¢-~-a)OLT/OOq ~) , (3.15)

where (j) indicates values at the jth iteration. If 0(0)iis a^ consistent estimator
such that tJ(0) - 0 is Op(T-1/2), it is well known that T:/2(0(1)- O) has the same
limiting distribution a s T1/Z(OML--0).
Many proofs qf this proposition are
available, with a convenient statement being Rothenberg and Leenders (1964).
^

Generally, 0(1) is not second-order efficient, but a further iteration will in fact
produce 0(2) which i s - - s e e Rothenberg (1983). Consequently, it is possible to
derive asymptotically efficient estimators from (3.15) once a /J(0) is available.
But the estimator of T h e o r e m 3.1 satisfies these requirements for 0(0), making
one step of the scoring algorithm from such estimates a means for deriving an
estimator that is as efficient as the MLE. It is this argument that justifies the
contention at the end of Subsection 3.2 that there was little point in iterating
covariance estimators.

3.3.2. Diagnostic checking


After estimation is complete, checks need to be made of model adequacy.
An alternative is to 'overfit' and to test if the surplus parameters are zero using
the asymptotic theory developed for the MLE. But, just as in ordinary time
series analysis, exercising this option can be computationally expensive, and
diagnosing inadequacy through residuals comes to the fore.
Suppose the true innovations e, were available. As mentioned earlier, the
autocorrelation function might be found by regressing e, against et_j ( j -
1, 2 . . . . ), i.e. the relation

e, =- 4)je,_j + a, (3.16)

would be estimated. Because a t does not generally have constant v a r i a n c e - -


under /4o: ~bj = 0, a t = e t and so E ( a ~ ) = h t - - O L S is not the most efficient
estimator of 4>/in (3.16). To isolate an efficient test statistic for H0: qbj = 0 it is
weighted least squares rather than OLS which is the appropriate estimator.
Varying coefficient regression 441

Accordingly, the regression should feature h~l/2e t as regressand and htmet_j as


regressor.
But even this modification is not enough. In practice, e t is not available and
the investigator must make do with er The relation (3.16) converts to

et = 4)jYt-j + at + et - et + 4)j(et_j - ~,_j) • (3.17)

Under H0: ~bj = 0, the error term in this regression is a t + e t - er By the mean
value theorem,

Oet 02et
e,- Y, + ~ ( 0 - 0)+ (0 - O * ) ' O ~ - ~ ( v - 0"),

where 0 ~< O* ~< O, allowing (3.17) to be rewritten as

Oet
¢= 6 A - j + at + - o) + % ( v l J 2 ) . (3.18)

Examining the limiting distribution of T1/2(~j- ~j) from the regression of Y,


against Yt-j, it becomes apparent that the term ( 0 e l 0 0 ) ( 0 - 0) contributes unless

Oet p
T l ~ e t _ j 0~---~0.

The likelihood that this moment is zero is r e m o t e - - f r o m Pagan (1980, p. 359)


OeJO0 is a linear combination of e,_k (k = 1, 2 . . . . ). From this, the variance of
the jth ordinate of the a.c.f, of et will be quite complex, and certainly not T 1/2.
Some way around this complication needs to be sought. Let Y*, Y*j and P be
matrices with ~tl/Zyt~ ~tl/Zet_j and htl/ZcgeJO0 as tth elements respectively. An
obvious matrix representation of/~7 '/2 times (3.18) is

Y* = ~jY:j + P ( 0 - 0)+ op(Tm). (3.19)

Defining M = I - p ( p , p ) - l p , , it is possible to annihilate the term P(0 - 0) by


pre-multiplication of (3.19) by M

MY* = 4)jNIY~j + M a * + Op(Tt/2). (3.20)

Equation (3.20) is the basis for the proof of the following theorem concerning a
valid diagnostic test for H0: ~bj = 0.

THEOREM 3.3. In the regression of Y* against Y*_j and P, the 't statistic'
associated with the coefficient of Y'j, when treated as a standard normal deviate,
is asymptotically a valid test statistic for the null hypothesis that c~j = O.
442 D. F. Nicholls and A. R. Pagan

PROOF. Regressing M~* against M~_*/ gives

~ ) j : ~)j+ ^.,
(e_jM ,M6_j)
^ . - 1 ^e_jM
., ,
Ma . + o,(T 1/2) (3.21)
= ~b/+ (O*)MO*/)-20*)M'a* + o,(T1/2), (3.22)

P
using M ' M = M. Since/zt ~ ht because of the consistency of 0, it is easily seen
that
@
T1/2(dpj - dpj)'---~ N ( 0 , (plim T - l ~*_~M~*_j)-l) .
T--~

But the regression (3.20) involves the regression of the residuals from the
regression of ~* against P versus the residuals from the regression of ~* -/
against P; as is well known an identical estimate of q~/and its covariance matrix
can be found by the regression of 6" against 6*.-1
and P.

It might be noted that, except for the division by ]~t 1/2 to produce a constant
variance in the innovations, the regression in Theorem 3.3 is essentially that
recommended by Durbin (1970) for adjusting the a.c.f, of residuals from an auto-
regression, again establishing a link between ordinary time series analysis and
VCR methodology. Some other points of interest arise concerning Theorem
3.3. First, as Y*~.M'MY*j = Y*~V/Y*/~<Y_*jY*_/, the t statistic from the regres-
sion of Theorem 3.3 exceeds that from a regression of ~* against ~_*/, demon-
strating that any failure to allow for the use of residuals rather than innovations
creates a bias towards acceptance of the null hypothesis of serial independence.
From this feature emerges the conclusion that past studies, e.g. Engle and
Watson (1981), which have examined the a.c.f, of the estimated innovations,
even if the latter were normalized with/~-1/2, would have tended to accept their
model as adequate when it may not have been. Second, the regression of
Theorem 3.3 is easily performed since the information is either available from
the ML estimation program, or could easily be extracted by numerical
differentiation.

3.3.3. Algorithms
In Subsection 3.3.1 brief mention was made of the scoring algorithm for the
derivation of the MLE. A range of other algorithms has been suggested, e.g.
Newton-Raphson, in which # is replaced by -02L/0000'; general purpose
optimization routines such as Davidon/Fletcher/Powell or Praxis; modified
versions of scoring in which # is estimated by X (OLJOO)(aLt/O0)' (see Berndt et
al., 1974) and, recently, the E-M algorithm of Dempster et al. (1977) (see
Watson and Engle (1982) for a full account of the application of the latter).
There is some evidence that scoring works very well provided good starting
values are available, but that the E-M algorithm is very much faster if starting
values are poor. A judicious combination of both algorithms--beginning with
Varying coefficient regression 443

E - M and switching to scoring after some progress has been made towards the
m a x i m u m - - p r o m i s e s to be the best solution. Watson has indicated success
with this strategy when up to one hundred p a r a m e t e r s need to be estimated.

4. S o m e special m o d e l s

4.1. Endogenous variable regressors


The regressors x t in (1.1) may be part of a larger system (see Hurwicz, 1950).
Very little research is available on this model (X3, B2, V1). Hurwicz set out the
likelihood for a very simple three equation case, while Kelejian (1974) analysed
problems of identification. As Hurwicz's work demonstrates, derivation of the
M L E is an extremely complex process, with numerical integration being
needed. Inevitably, investigators will probably have to be content with con-
sistent estimators, and a brief examination of how these might be constructed
will now be provided.
When coefficients are constant and E(xte,) # 0, a c o m m o n way of deriving
consistent estimators is to perform instrumental variable estimation with in-
P P
struments w, that possess the properties, T - ~ Z w~x,---~ O > 0 , T i S w',et--->O.
Frequently, there are more instruments available than are needed, i.e. the
dimension of w t exceeds that of x, and Sargan (1958) showed that an optimal
instrument would be the predictions from the regression of x t against w t -
denoted ~t in the following. Let us, therefore, rewrite (1.2) in terms of ~t as

y, = 2~,/3 + ( x , - ~,)fi + x , ( ~ , - fi)+ e,. (4.1)

If/?t - ¢i = ~t, the estimator obtained by regressing Yt against x t will be

= ~ ~- XtX ' ~ (XtXtT]t-~- XtEt) (4.2)

using the properties of regression that Z 2'~(x, - 2,) = 0, ~; 2,x t = Z ~ 2 , .


Examination of (4.2) reveals that /3 is a consistent estimator of /3 if
t P
T - I Z 2~2tr}t--~O. Although this might appear a reasonable condition to impose,
it would be better if a more basic set of assumptions ensuring it could be stated.
When coefficients are constant the 2 t are sometimes taken as predictions from
the reduced form of the system. Shiba and Tsurmi (1982) followed such a
strategy for the (X3, B2, V1) case, and their analysis shows some of the
difficulties that can arise. An important one is that, even if a reduced form
exists, the stochastic part may have an infinite variance. Consequently, regress-
ing x t against w, would not yield a consistent estimator of any relationship
between the two variables. Tsurmi and Shiba (1982) provide an example of just
such a situation in the context of a very simple macro-economic model.
Even if /3 is consistently estimated however, there is still the task of
t ~2
estimating o-2 and X = E('o,rlt). One possibility is to regress u, = (yt-x,~) 2
444 D. F. Nicholls and A. R. Pagan

against a constant and variables such a s xt@fft; under certain conditions


involving the existence of reduced form moments this would provide a con-
sistent estimator of X, but not of 02 as E ( x t e t ) ¢ 0 means that the mean of fit
will also not be zero. Needless to say, fiaore research in this area is appropriate.

4.2. N o n - c o n s t a n t variances
It is sometimes forgotten that a regression model is characterized by two
types of parameters: those connected with the mean behaviour of the data (/3)
and those with the second moment (0"2). Nevertheless, there has been a steady
growth of interest in either allowing for the effects of a non-constant error
variance or in modelling any changes in it.
If 0"2 is not a constant, but rather indexed by t, it is well known that the
variance of the OLS estimator/3 = (X x'tx,) -1 E x'ty , is (X x'~xt) -I E, x'txto-~(E x;x~) -I,
and Eicker (1967), Fuller (1975) and White (1980) have all proposed con-
sistently estimating this quantity by replacing ~r~ in the formula by the squared
OLS residuals t~. As the autocorrelation and partial autocorrelation functions
of a series y, can be viewed as the estimated coefficients from a regression of yt
against its various lags, if the variance in such a regression is not constant the
standard theory as in Box and Jenkins (1976) would not be applicable.
Nicholls and Pagan (1983) have shown that it is possible to use the same
adjustment as proposed by Eicker/Fuller/White even when x t contains lagged
values of y, and this approach allows for test statistics based on the autocor-
relation function to be adjusted for non-stationarity in variances.
Rather than react passively to a non-constant variance for e, some have
suggested an active strategy of modelling it. Amemiya (1973) derived a three-
step procedure asymptotically equivalent to the MLE when cr~ is assumed a
function of E(yt). Harvey (1976) focussed upon an application of the scoring
algorithm to give a two-step estimator asymptotically equivalent to the M L E
for a general class o f heteroscedastic errors. Amemiya (1977) considered
consistent estimators rather than efficient ones. As might be expected all of
these proposals have been covered indirectly in Section 2. Any differences stem
from the special characteristics of the model of Section 2 that lagged values of
Yt appear in x t and the fact that cr~ under random coefficients is related to the
Yt-j rather than to a more arbitrary set of variables wt. Nevertheless, the M L E
and consistent estimators are derived exactly as in that section.
A more recent development in this area has been the distinction drawn
between the conditional and unconditional variance of e t. It may be that
the unconditional variance of e t is a constant, but that the variance
conditional upon the sigma field ~-t-1 = [Yt, • • •, Yt-t, x~. . . . . x,] is not, e.g. the
model y t = ~ y , _ l + e t with E ( e ~ l ~ t _ 1 ) = 0 " 2 + 6 y t 2 _ l possesses this property
(assuming that E(y~) is finite). There are obviously many ways in which such a
situation could arise. One example is the random coefficient autoregressions of
Section 2 or equivalent M A formulations (Robinson, 1977); another the
bilinear models of Granger and Andersen (1978). A third is the recent
Varying coefficient regression 445

development by Engle (1982) of what he terms autoregressive conditional


heteroscedasticity ( A R C H ) models, in which o-~ = E(e~l ~t-1) is a linear func-
tion of g ( e o , . . . , et_l), where g is some known function, e.g. o-2t = o-2 + 6e,_
2
1. A
number of papers have reported such effects in time series data, although it is
by no means certain that the A R C H model is not proxying for a different type
of conditional heteroscedasticity, e.g. o-2+ ~$Y~-I. It might be expected that this
will be an area which is likely to receive a good deal more attention in the next
few years.

5. Conclusion

Our aim has been to survey material on varying coefficient regression in the
context of a framework of crucial importance to the analysis of time series, viz.
the specification, estimation and diagnostic cycle. By adopting this systematic
approach, a better rapport between VCR research and traditional time series
analysis is established, and areas in which there is a serious research deficiency
can be more effectively isolated. As we commented in the introduction, there
are many cells in Table 1 that have not been studied at all, and our survey has
thrown up some gaps even for those models that have been studied. Two
aspects stand out from the review: considerable progress has been made in
enabling the standard fixed coefficient assumption to be relaxed but, at the
same time, much remains to be done before VCR models become as wide-
spread in their use as their fixed coefficient counterpart.

Appendix

Tensor notation and related results


If A and B are matrices of order m x n and p × q respectively, then
(i) the tensor or Kronecker product A @ B is the mp x nq matrix whose
(i, j)th block is aiiB, where aij is the (i, j)th element of A ;
(ii) vec A denotes the mn x 1 vector obtained by stacking the columns of A
one on top of the other, in order, from left to right.
The vector with r(r + 1)/2 elements obtained by stacking those elements of
the columns of the r x r symmetric matrix D on and below the main diagonal,
one on top of the other, in order, from left to right, is denoted by vech D (the
vector half of D).
If the matrix product A B C is defined, then it can be shown (Henderson and
Searle, 1979) that

vec ( A B C ) = ( C' @ A ) vec B . (A.1)

For symmetric matrices it is possible to obtain linear relationships between


446 D. F. Nicholls and A. R. Pagan

v e c D and v e c h D . I n d e e d , H e n d e r s o n and Searle s h o w that for any r x r


s y m m e t r i c matrix D there exists {r(r + 1)/2} x r 2 matrices K r and H r such that
HrK; = Ir(r+l)/2 and for which
vechD=HrvecD and vecD=K;vechD.

References

Amemiya, T. (1973). Regression when the variance of the dependent variable is proportional to the
square of its expectation. J. Amer. Statist. Assoc. 68, 928-934.
Amemiya, T. (1977). A note on a heteroskedastic model. J. Econometrics 6, 365-3"70.
Andel, J. (1976). Autoregressive series with random parameters. Math. Operationsforsch. Statist. 7,
735-741.
Basawa, I. V., Feigin, P. D. and Heyde, C. C. (1976). Asymptotic properties of maximum likelihood
estimators for stochastic processes. Sankhygt 38, 259-270.
Berndt, E. K., Hall, B. H., Hall, R. E. and Hausman, J. A. (1974). Estimation and inference in
nonlinear structural models. Ann. Econ. Soc. Meas. 4, 653-665.
Billingsley, P. (1961). The Lindeberg-L6vy theorem for martingales. Proc. Amer. Math. Soc. 12,
788-792.
Box, G. E. P. and Jenkins, G. M. (1976). Time Series Analysis: Forecasting and Control (revised
edition). Holden-Day, San Francisco, CA.
Box, G. E. P. and Pierce, D. A. (1970). Distribution of residual autocorrelations in autoregressive-
integrated moving average time series models. J. Amer. Statist. Assoc. 65, 1509-1526.
Breusch, T. S. and Pagan, A. R. (1979). A simple test for heteroscedasticity and random coefficient
variation. Econometrica 47, 1287-1294.
Breusch, T. S. and Pagan, A. R. (1980). The Lagrange multiplier test and its applications to model
specification in econometrics. Rev. Econom. Stud. 47, 239-253.
Chant, D. (1974). On asymptotic tests of composite hypotheses in non-standard conditions.
Biometrika 61, 291-298.
Chow, G. C. (1983). Random and changing coefficient models. In: Z. Griliches and M. D.
Intriligator, eds., Handbook of Econometrics, Chap. 21. North-Holland, Amsterdam.
Cooley, T. and Prescott, E. (1973). An adaptive regression model. Internat. Econom. Rev. 14,
364-371.
Cooley, T. and Prescott, E. (1976). Estimation in the presence of sequential parameter variation.
Econometrica 44, 167-184.
Crowder, M. J. (1976). Maximum likelihood estimation for dependent observations. Jr. Roy. Statist.
Soc. Set. B 38, 45-53.
Davies, R. B. (1977). Hypothesis testing when a nuisance parameter is present only under the
alternative. Biometrika 64, 247-254.
Dempster, A. P., Laird, N. M. and Rubin, D. B. (1977). Maximum likelihood from incomplete data
via the EM algorithm. J. Roy. Statist. Soc. Ser B 39, 1-39.
Domowitz, I. and White, H. (1982). Misspecified models with dependent observations. Z
Econometrics 20, 35-58.
Durbin, J. (1970). Testing for serial correlation in least squares regression when some of the
regressors are lagged dependent variables. Uconometrica 38, 410--421.
Eicker, F., (1967). Limit theorems for regression with unequal dependent errors. In: L. Le Cam
and J. Neyman, eds., Proc. Fifth Berkeley Symposium, 59-82. University of California Press,
Berkeley, CA.
Engle, R. F. (1982). Autoregressive conditional heteroscedasticity with estimates of the variance of
United Kingdom inflation. Econometrica 50, 987-1007.
Engle, R. F. and Watson, M. (1979). A time domain approach to dynamic factor and MIMIC
models. Discussion paper 79-41. University of California, San Diego.
Varying coefficient regression 447

Engle, R. F. and Watson, M. (1981). A one-factor multivariate time series model for metropolitan
wage rates. J. Amer. Statist. Assoc. 76, 774-781.
Fuller, W. A. (1975). Regression analysis for sample survey. Sankhyg~ 37, C, 117-132.
Garbade, K. (1977). Two methods for examining the stability of regression coefficients. J. Amer.
Statist. Assoc. 72, 54-63.
Godfrey, L. G. (1978). Testing for multiplicative heteroskedasticity. J. Econometrics 8, 227-236.
Gourieroux, C., Holly, A. and Monfort, A. (1982). Likelihood ratio test, Wald test, and Kuhn-
Tucker test in linear models with inequality constraints on the regression parameters.
Econometrica 50, 63-80.
Granger, C. W. J. and Andersen, A. (1978). A n Introduction to Bilinear Time Series Models.
Vandenhoeck and Ruprecht, G6ttingen.
Hannah, E. J. and Kavalieris, L. (1983). The convergence of autocorrelations and autoregressions.
Austral. J. Statist. 25, 287-297.
Hannan, E. J., Terrell, R. D. and Tuckwell, N. (1970). The seasonal adjustment of economic time
series. Internat. Econom. Rev. 11, 24-52.
Harvey, A. C. (1976). Estimating regression models with multiplicative heteroscedasticity.
Econometrica 44, 461-466.
Harvey, A. C. (1981). Time Series Models. Phillip Allan, Oxford.
Hatanaka, M. and Tanaka, K. (1981). On the estimability of the covariance matrix in the
multivariate random walk representing the time changing parameters of regression models.
Mimeo. Osaka University.
Havenner, A. and Swamy, P. A. V. B. (1981). A random coefficient approach to seasonal
adjustment of economic time series. J. Econometrics 15, 177-210.
Henderson, H. V. and Searle, S. R. (1979). Vec and vech operators for matrices with some uses in
Jacobian and multivariate statistics. Canad. J. Statist. 7, 65-81.
Hildreth, C. and Houck, J. P. (1968). Some estimators for a linear model with random coefficients.
J. Amer. Statist. Assoc. 63, 584-595.
Hurwicz, L. (1950). Systems with non-additive disturbances. In: T. C. Koopmans, Ed., Statistical
Inference in Dynamic Economic Models, 410-418. Wiley, New York.
Imhof, J. P. (1961). Computing the distribution of quadratic forms in normal variables. Biometrika
48, 419--426.
Johnson, L. W. (1977). Stochastic parameter regression; an annotated bibliography. Internat.
Statist. Rev. 45, 257-272.
Johnson, L. W. (1980). Stochastic parameter regression: an additional annotated bibliography,
Internat. Statist. Rev. 48, 95-102.
Kendall, M. G. (1953). The analysis of economic time series--Part I: Prices. J. Roy. Statist. Soc.
Ser. A 106, 11-25.
Kelejian, H. H. (1974). Random parameters in a simultaneous equation framework: Identification
and estimation. Econometrica 42, 517-528.
King, M. L. and Hillier, G. (1980). A small sample power property of the Lagrange multiplier test.
Monash University discussion paper.
La Motte, L. R. and McWhorter, A. (1978). An exact test for the presence of random walk
coefficients in a linear regression model. J. Amer. Statist. Assoc. 73, 816--820.
Ljung, G. M. and Box, G. E. P. (1978). On a measure of lack of fit in time series models.
Biometrika 65, 297-303.
McDonald, J. (1981). Consistent estimation of models with composite moving average disturbance
terms: A survey. Flinders University Mimeo.
Moran, P. A. P. (1971). Maximum likelihood estimation in non-standard conditions. Proc. Camb. Phil.
Soc. 70, 441-445.
Neyman, J. (1959). Optimal asymptotic tests for composite statistical hypotheses. In: U. Grenan-
der, ed., Probability and Statistics, 213-234. Wiley, New York.
Nicholls, D. F. and Pagan, A. R. (1983). Heteroscedasticity in models with lagged dependent
variables. Econometrica 51, 1233-1242.
Nicholls, D. F. and Quinn, B. G. (1981). Multiple autoregressive models with random coefficients.
.L Multivariate Anal. 11, 185-198.
448 D. F. Nicholls and A. R. Pagan

Nicholls, D. F. and Quinn, B. G. (1982). Random Coefficient Autoregressive Models: A n Intro-


duction. Springer-Verlag, New York.
Pagan, A. R. (1973a). Efficient estimation of models with composite disturbance terms. J.
Econometrics 1, 329-340.
Pagan, A. R. (1973b). Estimation of an evolving seasonal pattern as an application of stochastically
varying parameter regression. Econometric Research Program Memo No. 153. Princeton Uni-
versity.
Pagan, A. R. (1980). Some identification and estimation results for regression models with
stochastically varying coefficients. J. Econometrics 13, 341-363.
Pagan, A. R. and Hall, A. D. (1983). Diagnostic tests as residual analysis. Econometric Reviews 2,
159-218.
Pagano, M. (1974). Estimation of models of autoregressive signal plus white noise. Ann. Statist. 2,
99-108.
Priestley, M. B. (1981). Spectral Analysis and Time Series, Volume 1: Univariate Series. Academic
Press, New York.
Raj, B. and Ullah, A. (1981). Econometrics, A Varying Cbefficients Approach. Croom-Helm,
London.
Rao, C. R. (1974). Large sample tests of statistical hypotheses concerning several parameters with
application to problems of estimation. Proc. Camb. Phil, Soc. 44, 50--57.
Reinsel, G. (1979). A note on the estimation of the adaptive regression model. Internat. Econom.
Rev. 20, 193-202.
Revenkar, N. S. (1980). Analysis of regressions containing serially correlated and serially uncor-
related error components. Internat. Eeonom. Rev. 21, 185-200.
Robinson, P. M. (1977). The estimation of a non-linear moving average model. Stochastic Process
Appl. 5, 81-90.
Rosenberg, B. (1973). The analysis of a cross-section of time series by stochastically convergent
parameter regression. Ann. Econ. Soc. Meas. 2, 399-428.
Rothenberg, T. J. (1971). Identification in parametric models. Econometriea 39, 577-592.
Rothenberg, T. J. (1983). Approximating the distributions of econometric estimators and test
statistics. In: Z. Griliches and M. D. Intriligator, eds. Handbook of Econometrics. North-Holland,
Amsterdam.
Rothenberg, T. J. and Leenders, C. T. (1964). Efficient estimation of simultaneous equation
systems. Econometrica 32, 57-76.
Rubin, H. (1950). Note on random coefficients. In: T. C. Koopmans, ed., Statistical Inference in
Dynamic Economic Models. Wiley, New York.
Sargan, J. D. (1958). The estimation of economic relationships using instrumental variables.
Econometrica 26, 393-415.
Schweppe, F. C. (1965). Evaluation of likelihood functions for Gaussian Signals. IEEE Trans.
Inform. Theory IT-11, 61-70.
Shiba, T. and Tsurmi, H. (1982). Consistent estimation of the random coefficient model in a
simultaneous framework. Discussion paper 81-20, Rutgers University.
Silvey, S. D. (1959). The Lagrangian multiplier test. Ann. Math. Statist. 30, 389-407.
Solo, V. (1982). The output statistics Kalman Filter and varying parameter regression. Mimeo.
Harvard University.
Son, L. H. and Anderson, B. D. O. (197l), Design of Kalman filters using signal model output
statistics. Proc. IEEE 120, 312-318.
Swamy, P. A. V. B. (1971). Statistical Inference in Random Coefficient Regression Models.
Springer-Verlag, New York.
Swamy, P. A. V. B. and Tinsley, P. A. (1980). Linear prediction and estimation methods for
regression models with stationary stochastic coefficients. J. Econometrics 12, 103-142.
Swamy, P. A. V. B., Tinsley, P. A. and Moore, G. R..(1982). An autopsy of a conventional
macroeconomic relation: the case of money demand. Paper presented to the Society for
Economic Dynamics and Control Conference, Washington, D.C.
Varying coefficient regression 449

Takaka, K., (1981). On the Lagrange multiplier test for the constancy of regression coefficients and
the asymptotic expansion. Mimeo, Kanazawa University.
Tsurmi, H. and Shiba, T. (1982). A Bayesian analysis of a random coefficient model in a simple
Keynsian system. J. Econometrics 18, 239-250.
Watson, M. (1980). Testing for varying coefficients when a parameter is unidentified. University of
California, San Diego discussion paper No. 80-8.
Watson, M. and Engle, R. F. (1982). The EM algorithm for dynamic factor and MIMIC models.
Harvard Institute of Economic Research discussion paper No. 879.
Weiss, A. A. (1982). The estimation of the dynamic regression model with stochastic coefficients.
University of California, San Diego discussion paper No. 82-11.
White, H. (1980). A heteroscedasticity-consistent covariance matrix estimator and a direct test
for heteroscedasticity. Econometrica 48, 817838.
Whittle, P. (1954). Estimation and information in stationary time series. Ark. Mat. 2, 423-434.
Wilson, G. T. (1969). Factorization of the covariance generating function of a pure moving average.
S I A M J. Num. Anal. 6, 1-7.
Wilson, G. T. (1973). The estimation of parameters in multivariate time series models. 3". Roy.
Statist. Soc. Ser. B 35, 76-85.
E. J. Hannan, P. R. Krishnaiah, M. M. Rao, eds., Handbook of Statistics, Vol. 5 1~7
ALl
@ Elsevier Science Publishers B.V. (1985) 451-480

Small Samples and Large Equation Systems*

Henri Theil and Denzil G. Fiebig

1. Introduction

in econometrics (and in several other areas of applied statistics) it happens


frequently that we face a system of equations rather than a single equation. For
example, let a consumer select the quantities of N goods which maximize his
utility function subject to his budget constraint. Then under appropriate
conditions a system of demand equations emerges, each describing the con-
sumption of one good in terms of income and all N prices. The n u m b e r of
coefficients in such a system is on the order of N 2, which is a large n u m b e r
(unless N is small) and which raises problems when we want to test hypotheses
about these coefficients (see Section 2). Another example is the estimation of
the coefficients of one equation which is part of a system of simultaneous
equations. H e r e a problem arises when the system contains a large n u m b e r of
exogenous variables (see Section 3).
One way of solving such problems is by (1) recognizing that the sample is
drawn from a continuous distribution and (2) using this sample to fit a
continuous approximation to the parent distribution. W h e n this is done under
the m a x i m u m entropy (ME) criterion subject to mass- and mean-preserving
constraints, a continuous M E distribution emerges which is superior to the
discrete sample distribution in a n u m b e r of respects, particularly for small
samples (see Sections 4 and 5). Subsequent sections show how the M E
distribution can be used for the problems mentioned above.

2. How asymptotic tests can be misleading

Let t = 1 . . . . . n refer to successive observations and i, j = 1. . . . . N to


consumer goods. W e consider a linear demand system,
N

y . - Oixot + ~ . Iri/xjt + ei,, (2.1)


]=1

*Research supported in part by NSF Grant SES-8023555. The authors are indebted to Sartaj A.
Kidwai of the University of Florida for his research assistance.

451
452 H. Theil and D. G. Fiebig

where y~, is consumption of good i, x0, is total consumption, xj, is the price of
good .L eit is a random error, and 0i and "rrlj are parameters, the 7rij's being
known as Slutsky coefficients. H e r e we shall be interested in two hypotheses,
viz., demand homogeneity,
N
~ 7r~j= 0, i=1 .... ,N, (2.2)
j=l

and Slutsky symmetry,

7rjj = 7rj~, i, j = 1. . . . , N , (2.3)

Details on these properties are provided in the Appendix.


Summation of (2.1) over i = 1 . . . . . N yields x0t = xot + E i ei, (because Zi 0~ =
1, Z~ ¢rij = 0), which implies that the e~t's are linearly dependent. This problem
can be solved by deleting one of the equations, say the N t h . We assume that
(el~ . . . . eu_l,t) for t = 1 . . . . , n are independently and normally distributed
with zero means and nonsingular covariance matrix ~Y. Since (2.2) and (2.3) are
linear in the ~-~j's, the standard procedure for testing these hypotheses is an F
test if ~Y is known. However, ~Y is typically unknown, in which case it is usual to
replace ~Y by S, the matrix of mean squares and products of LS residuals. Many
such tests have yielded unexpected negative results; see, e.g., Barten (1969),
Byron (1970), Christensen et al. (1975), Deaton (1974) and Lluch (1971).
Laitinen (1978) conducted a simulation experiment in order to explore this
problem. H e constructed a model of the form (2.1) satisfying both (2.2) and
(2.3), with the e,'s obtained as pseudo-normal variates with zero means and a
known covariance matrix ~Y. He used n = 31 observations and considered
systems of N = 5, 8, 11 and 14 equations. Using the true ~Y, he applied the F
test of the homogeneity hypothesis (2.2). The upper left part of Table 1 shows
the numbers of rejections out of 100 trials at the 5 and 1 percent significance
levels; these numbers are satisfactorily close to 5 and 1, respectively. Laitinen
also used the same samples to compute S and the associated test statistic; this
amounts to a X2 test which is asymptotically (n ~ ~) valid. The numbers of
rejections shown in the upper middle part of Table 1 are much larger,
particularly for large N. The results for the corresponding exact X2 test based
on the true ~Y (upper right part of Table 1) are far more satisfactory, thus
strongly suggesting that the use of S rather than ~ is mainly responsible for the
numerous rejections of homogeneity in the literature.
Meisner (1979) conducted a similar simulation experiment for testing the
symmetry hypothesis (2.3). His results, shown in the lower part of Table 1,
indicate an analogous increasing bias toward rejecting the null hypothesis as N
increases when S rather than ,Y is used. The tests based on S fall under what is
frequently referred to as Wald tests. See Bera et al. (1981) for similar results
obtained with the asymptotically equivalent Lagrange multiplier and likelihood
ratio tests. Laitinen (1978) proved that the exact distribution of the homo-
Small samples and large equation systems 453

Table 1
Rejections (out of 100 samples) of homogeneity and symmetry

Exact F tests Asymptotic X 2 Exact X 2 t e s t s


based on X tests based on S based on

5% 1% 5% 1% 5% 1%

Rejections of homogeneity
5 goods 7 1 14 6 8 1
8 goods 8 2 30 16 5 2
11 goods 5 2 53 35 5 1
14 goods 4 2 87 81 6 1

Refections of symmetry
5 goods 6 0 9 3 5 1
8 goods 5 2 26 8 5 1
11 goods 6 2 50 37 4 3
14 goods 4 2 96 91 6 0

geneity test statistic based on S is Hotelling's ,/,2 which in this case is an F


ratio whose denominator has n - 2 N + 1 degrees of freedom. This illustrates
the problem of homogeneity testing when N is not far below one-half the
sample size n. The exact distribution of the symmetry test statistic based on S is
a much more difficult issue because (2.3) is a cross-equation constraint. This is
also the reason why symmetry-constrained estimation of (2.1) presents a
problem when X is not known (see Section 7).

3. Simultaneous equation estimation from undersized samples

Our objective is to estimate the parameter 3' in

Ylt = YYzt + et, t = 1..... n, (3.1)

which is one equation of a system that consists of several linear equations. The
y's are observations on two endogenous variables of the system. The other
equations contain certain endogenous variables in addition to these two, and
also p exogeneous variables; the observations on the latter variables are written
xw..., xpt. T h e sample moment matrix of these variables and those in (3.1) is
thus of order (p + 2) x (p + 2),

Imll .m12 m'lp7 ylt


mt2 m22 m;e| y2, (3.2)
mxe mzp Mp_] xl . . . . . . xo,

where mtp, map are p-element vectors and Alp is square ( p x p ) . The LS
454 1-t. Theil and D. G. Fiebig

estimator of 7 in (3.1) is then m12/m22; this estimator is biased and inconsistent


because Yzt and e, are correlated. However, we can obtain consistent estimators
from the property that each exogenous variable is statistically orthogonal to the
errors in (3.1) in the sense that, for h = 1 . . . . . p, (1/n)Etxh,E , has zero prob-
ability limit as n ~ ~.
The conditional sample m o m e n t matrix of the two endogeneous variables
given the p exogenous variables is

mn. p ml2.p] = [mll m ~ ] _ [m!p]M_pl[mlp m2p] . (3.3)


ml2.p m22.pJ L m 1 2 m22 j Lm2pd

T h e k-class estimator of 7 is defined as [see, e.g., Theil (1971, Chap. 10)]

~(k ) = m12- km12P (3.4)


m 2 2 - km22.p

which includes the LS estimator mt2/m2.2 as a special case (k = 0). It can be


shown that, under standard conditions, ~(k) is consistent if k - 1 has zero
probability limit and that n~/2[~,(k)- 7] converges to a normal distribution with
zero mean if n l / 2 ( k - 1 ) has zero probability limit. These conditions are
obviously satisfied by k = 1, which is the case of two-stage least squares (2SLS).
In Section 6 we shall meet a k-class estimator with a r a n d o m k.
Equation (3.1) is quite special because it contains only two endogeneous and
no exogenous variables. The extension to m o r e variables is straightforward
[see, e.g., Theil (1971, Chaps. 9 and 10)], but it is not our main concern here.
O u r problem is that the matrix (3.3) does not exist when there are m o r e
exogenous variables than observations (p > n) because Mp is then singular. In
fact, all standard methods of consistently estimating 7 fail for p > n because
they all require the inverse of Mp. Almost all present-day economy-wide
econometric models have m o r e exogenous variables than observations. The
problem is even m o r e pervasive due to the occurrence of lagged variables in
dynamic equation systems. It is standard practice to treat each lagged variable
in the same way as the exogenous variables are treated, which means that the
'dynamic' version of Mp is even m o r e likely to be singular.
T h e irony of this problem is that we can reasonably argue that it should not
be a problem. We can estimate 7 by 2SLS from n = 20 observations when
there are p = 10 exogenous variables in the system, but not when p = 30 (because
Mp is then singular for n = 20). In the former case there are 10 variables known to
be statistically orthogonal to the error vector ( e l , . . . , e,) of (3.1), in the latter
there are 30 such variables; a priori one would expect that the estimation of (3.1) is
improved (at least, not hurt) when there are more orthogonality conditions on its
error vector. In Section 6 we shall consider this matter further.
Small samples and large equation systems 455

4. The ME distribution of a univariate sample

The previous discussion illustrates difficulties with sample moment matrices:


S in Section 2, Mp in Section 3. H e r e and in Section 5 we shall seek a solution
under the condition that the relevant random variables are continuously
distributed. Our strategy will be to use the sample to fit a continuous dis-
tribution as an estimate of the parent distribution, and to compute moments
(and other characteristics) from this fitted distribution. Such moments are
population-moment estimators which are alternatives to the ordinary sample
moments.

4. I. The M E principle and the univariate M E distribution


The principle of maximum entropy (ME) states that given some information
on the parent distribution of a random variable, the fitted distribution should
be most uniformative subject to the constraints imposed by the prior in-
formation. To do otherwise would imply the use of information that is not
available. The criterion of uniformativeness used in information theory is the
entropy, which is minus the expectation of the logarithm of the density
function. Specifically, the ME criterion maximizes

H=-f~f(x)logf(x)dx_ (4.1)

by varying the density function f ( . ) subject to certain constraints. If all that is


known is that the variable is continuous with a finite range (a, b), the ME
distribution is the uniform over this interval. If we know the mean of a positive
continuous random variable (but nothing else), the M E distribution is the
exponential with this m e a n )
These results were used by Theil and Laitinen (1980) to construct an
estimated distribution function from a sample (xl . . . . . x,). 2 They used order
statistics, written here with superscripts: x l < x 2 < - . . < x", and defined inter-
mediate points between successive order statistics,

~i = ~(xl, xi+l), i=1 ..... n-l, (4.2)

where so( • ) is a symmetric ditterentiable function of its two arguments whose


value is between these arguments. These ~¢i's define two open-ended intervals,
i~ = (-%~:1) and I. = ((n_l,~), and n - 2 bounded intervals, /2= (sc1,(2),
. . . . In1 = ((.-2, (n-l). Each /~ contains one order statistic x i and, hence,

1For other M E properties of the exponential family, see Kagan et al. (1973).
2See Theil and Fiebig (1984) for a survey containing m a n y other results as well as proofs of the
statements which follow.
456 H. Theil and D. G. Fiebig

a fraction 1/n of the mass of the sample distribution. W e impose on the density
function f ( . ) which will be fitted that it preserves these fractions,

f(x) dx=-7 i=l ..... n, (4.3)


• n

which is a mass-preserving constraint. W e also impose an analogous m e a n -


preserving constraint, referring b o t h to the overall m e a n (the sample m e a n 2)
and to the m e a n s in each interval Ii. 3 Thus, o u r constraints refer to m o m e n t s of
o r d e r zero and one.
Subject to these constraints we seek the density f ( . ) which maximizes the
e n t r o p y (4.1). T h e solution is unique for n > 2; it implies that the intermediate
points (4.2) b e c o m e midpoints b e t w e e n successive o r d e r statistics, ~¢i =
1 i + xi+l)~ and that f ( . ) is constant in each b o u n d e d / ~ and exponential in 11
~(x
and I,. Thus, the associated cdf is c o n t i n u o u s and m o n o t o n e increasing, and it
is piecewise linear a r o u n d each x i except a r o u n d x 1 and x" w h e r e it is
exponential. W e shall refer to this fitted distribution as the M E distribution of a
univariate sample or, m o r e briefly, as the univariate M E distribution.
It will be convenient to extend ~i=~(x 1 i + XI+I) to i = 0 , 1 , . . . , n, where
x ° = x 1, x " + x = x " so that ~:0 = x 1, ~ , = x " . T h e s e ~'s are referred to as the
primary midpoints. T h e interval m e a n s of the M E distribution, written
21 . . . . . 2", are given by

xi = ~-(sci-I+ ~i), i = 1. . . . . n, (4.4)

which will be called the secondary midpoints.

4.2. Applications
G i v e n that the density picture of the M E distribution is so simple (piecewise
constant or exponential), it is straightforward to evaluate its variance and
higher m o m e n t s . F o r example, the variance of the M E distribution (the M E
variance) equals

n n-1
k=l q.n i-l- 2 ~ n i=2Z ( x i + l - xi-1) 2. (4.5)

Since the first term is the sample variance and since the two others are
negative, the M E variance is thus subject to shrinkage relative to toe sample
variance. Kidwai and Theil (1981) s h o w e d that, u n d e r normality, this shrinkage

3Define the order statistics associated with each interval k as those which determine its end
points: x 1 and x 2 for II = (-~, ~¢1)[see (4.2)], x" and x "-I for I,, and x i-1, x i and x i+1 for Ii with
1 < i < n. The mean-preserving constraint on Ii requires that f(x) for x Eli be constructed so that
the mean is a homogeneous linear function of the order statistics associated with I~.
S m a l l s a m p l e s a n d large equation s y s t e m s 457

is a random variable whose mean and standard deviation are both about
proportional to n -1'3.
Simulation experiments with pseudo-normal variates indicate that the ME
variance and third- and fourth-order moments about the mean are all more
accurate (in the mean-squared-error sense) than the corresponding estimators
derived from the discrete sample distribution. This difference reflects the
efficiency gain obtained by exploiting the knowledge that the parent dis-
tribution is continuous. However, the difference converges to zero as n ~ %
implying that the efficiency gain is a small-sample gain. Fiebig (1982, Chap. 4)
extended the simulation experiment to the estimation of the variances of
fat-tailed mixtures of normal distributions. The fatter the tails for given n, the
larger is the efficiency gain of the M E variance over the sample variance.
Since the M E distribution is formulated in terms of order statistics, it is
natural to consider the quantiles of the ME distribution as estimators of the
parent quantiles. Let n be odd and write m = ½(n + 1). Then the sample median
is x m, but the ME median is $m, i.e. the median of the secondary midpoints. For
random samples from a normal population, the ME median has a smaller
expected squared sampling error than the sample median, but the relative
difference tends to zero as n ~ . Let n + 1 be a multiple of 4 and write
q = ¼(n + 1). Then the sample quartiles are x q and X 3q, whereas the ME
1 q _ 3 q+l
quartiles a r e Q L ~xq-1 -[- ~X
= t gX and Ou = 3~x3q-1 + ix
1 3q - 1 3q+l
± ~x if q > 1.4 For
random samples from a normal population, the M E quartiles have smaller
expected squared errors. Again, the relative difference tends to zero as n ~ ~,
but this difference is still in excess of 10 percent for the interquartile distances
O u - QL and x 3q - x q at n = 39. Also, the ME median and quartiles dominate
their sample distribution counterparts (under squared-error loss) in the
presence of an outlier with a different mean or a different variance; see Theil
and Fiebig (1984) for details.

4.3. E x t e n s i o n s

Tile M E distribution is easily extended to bounded random variables. If the


variable is positive, the only modification is that 11 = ( - % ~:1) becomes (0, £1)
and that the distribution over this interval becomes truncated exponential.
A different extension is in order when the parent distribution is known to be
symmetric. The difference x ~- $ then has a sampling distribution identical to
that of $ - x "+1i for each i. We define 2 i _ 2 as the average of these
differences, i.e.

~ i = ~ "b ~(X' -" x n + l - i ) , i =" ~, . . . , n . (4.6)

Clearly, ~1 . . . . . 2" are 'symmetrized' order statistics located symmetrically

4Since the ME distribution has a continuous cdf, its median and quartiles are uniquely defined
for each n. This is in contrast to the sample quantiles whose definitions for certain values of n can
be made unique only by interpolation between order statistics.
458 H. Theil and D. G. Fiebig

around the sample mean g. (Since the M E procedure is mean-preserving, )~ is a


natural point of symmetry.) The symmetric M E (SYME) distribution is then
constructed from the .fl's in the same way that the ME distribution is obtained
from the x i's. An alternative justification of the definition (4.6) is that it
satisfies the LS criterion of minimizing Ni ( i / - x i ) z for variations in the 2 i's
subject to the symmetry constraint 2 ~+ 2 n + l - i : 2x. 5
SYME moments and quantiles can be used as estimators of the correspond-
ing population values if the population is symmetric. Doing so amounts to
exploiting the knowledge of symmetry in addition to continuity. For random
samples from a normal distribution, the SYME quartiles are asymptotically
more efficient than the M E and sample quartiles: as n-->% the sampling
variance of the former is about 13 percent below that of the latter. This shows
that there are situations in which the exploitation of symmetry yields a
large-sample gain. (Recall that the ME efficiency gain, based on the exploita-
tion of continuity, is a small-sample gain only.) Under normality, the SYME
variance provides no reduction in mean squared error beyond that of M E
(mainly because the SYME variance is subject to additional shrinkage), but
Fiebig (1982) did obtain such reductions for fat-tailed symmetric mixtures of
normal distributions.

5. The ME distribution of a multivariate sample

5.1. The bivariate and multivariate ME distributions


Let (xk, yk) for k = 1 . . . . . n be a sample from a continuous bivariate
population. Our objective is to use this sample in the construction of the joint
density function which maximizes the bivariate entropy
oo

H=- f f f(x,y)logf(x,y)dxdy, (5.1)


-oo -oo

subject to mass-and mean-preserving constraints.


As in the univariate case, we start with order statistics and the intermediate
points (4.2), but we do this now for both variables, yielding n intervals
11. . . . . I, for x and n intervals J1 . . . . . J, for y. In the plane of both variables,
we thus have n 2 rectangular cells, but since there are only n observations, n
cells contain one observation each and n 2 - n cells contain no observations.
The mass-preserving constraint states that the former cells are assigned mass
1In and the latter zero mass. Maximizing (5.1) requires stochastic independence

5A different procedure for estimating a symmetric distribution, proposed by Schuster (1973,


1975), consists of 'doubling the sample'; i.e. associated with each sample element xk is a value
2 5 - xk at equal distance from ~ but on the opposite side, which yields an augmented sample of
size 2n (symmetric around .~) when these associated values are merged with a sample of size n. In a
bivariate context, the value associated with (xk, Yk) is ( 2 5 - xk, 2y-Yk), yielding spherical sym~
metry. However, the simulation experiments by Theft, Kidwai, Yalnizo~lu and Yell6 (1982) based
on pseudo-normal variates indicate that this alternate form of symmetrizing is not very promising.
Small samples and large equation systems 459

within each cell with mass 1/n. Each such cell falls under one of three groups:
those which are bounded on all four sides, those which are open-ended on one
side, and those which are open-ended on two sides. For the first group, the M E
distribution within the cell is the bivariate uniform distribution; for the second,
it is the product of the exponential (for the open-ended variable) and the
uniform (for the other variable); for the third, it is the product of two
exponentials.
The extension to the p-variate M E distribution is straightforward. There are
then n p cells, n of which contain one observation each and are assigned mass
1/n, while the n p - n others are assigned zero mass. The M E distribution within
each cell with mass 1/n is the product of p univariate distributions, each
being either uniform or exponential. The cdf of this distribution is a continuous
and nondecreasing function of its p arguments, and it is piecewise linear except
for exponential tails.

5.2. The M E covariance matrix


The covariance of the bivariate M E distribution equals the covariance of the
secondary midpoints,

1 "
- ~', (xk - x)(Yk -- Y), (5.2)
nk=l

where (Xk, Yk) for k = 1, . . . , n are the secondary midpoint pairs rearranged in
the order of the original sample elements (Xk, Yk)" This rearrangement is
indicated by the use of subscripts rather than superscripts [cf. (4.4)].
The M E variance was given in (4.5), but this variance can also be written in
the form

" 1 n-1 (~1 -- ~:0) 2 -}- (~:, -- ~:n-1) 2


! Z - + -- Z - ,-02 + , (5.3)
n k=, 12n i=z 4n

where the first term is the variance of the secondary midpoints. 6 The two other
terms are a weighted sum of squared differences between successive primary
midpoints which is always positive. On combining (5.2) and (5.3) we find that
the 2 × 2 M E c o v a n a n c e matrix takes the form C + D, where C is the covari-
ance matrix of the secondary midpoints and D is a diagonal matrix with
positive diagonal elements. This C + D formulation applies to the covariance
matrix of any p-variate M E distribution. The diagonal matrix D serves as the
ridge of the M E covariance matrix; 7 this ridge ensures that the ME covariance
matrix is always positive definite even when p / > n.

6Expression (5.3) is nothing but the variance decomposition of the univariate ME distribution
between and within groups, the 'groups' being the intervals 11,. • •, I,.
7This ridge formulation has a superficial similarity to ridge regression, q21e major difference is
that the ridge of the M E covariance matrix is not subject to arbitrary choice but is uniquely
determined by the M E criterion subject to mass- and mean-preserving constraints.
460 H. Theil and D. G. Fiebig

The M E correlation t~ is obtained by dividing the M E covariance by the


square root of the product of the two corresponding M E variances. A simula-
tion experiment based on 10,000 pseudo-binormal variates with correlation p
indicates that ~ has a smaller expected squared error than the sample cor-
relation r for ]p]~<0.4, but that the opposite holds for 101/>0.6. The less
satisfactory performance of ~ for large [p] results from the ridge of the M E
covariance matrix which prevents [Pl from being close to 1. However, the
picture is different when we evaluate the correlation estimators in terms of the
squared errors of their Fisher transforms; then ,6 is superior to r for IPf ~< 0.95.8
Fiebig (1982) generated pseudo-normal vectors consisting of p equicorrelated
variates with zero mean and unit variance. He computed their M E and sample
covariance matrices and applied different loss functions to both. The ME
estimator has smaller expected loss than the sample estimator when p is not
small and P not close to 1, whereas the opposite holds for p = 0.99 and small p.
The latter result is again due to the ridge of the ME covariance matrix. Fiebig
also amended HaWs (1980) empirical Bayes estimator of the covariance matrix
by substituting the M E covariance matrix for the sample covariance matrix in
Haff's formula. Simulations indicate that this is an improvement except when
the population covariance matrix is close to singular or when the number of
variables is small.

5.3. Ties a n d m i s s i n g v a l u e s

Ties have zero probability when the sample is drawn from a continuous
distribution, but they can occur when the data are rounded. Let the a t h and
bth observations on x after rounding share the tth and (t + 1)st positions in
ascending order:

X a ~-~ X b ~- X t = xt+l* (5.4)

Here we consider the bivariate M E distribution of x and y under the assump-


tion that the Yk'S are not tied and that x~ < x b and Xa > Xb both have probability
1
before rounding. The appropriate procedure is to assign mass 1/2n to each of
the four cells associated with the tie. The M E covariance formula (5.2) remains
applicable if £~ and Xb are defined as

(5.5)
which means that the tie x a = x b is preserved in the form Ya = Yb-
The univariate M E distribution is not affected by the tie (5.4) so that we can

8Since the ridge of the ME covariance matrix tends to push ~6toward zero, this difference mainly
results from the downward bias of r and the upward bias of the Fisher transform of r (for p > 0). In
Theil, Kidwai, Yalnizo~lu and Yell6 (1982) the simulation experiment is extended to the SYME
correlation and also to the correlation of the spherically symmetric version mentioned in footnote
5. Only the last correlation estimator has some merits for particular values of p (around 0.95) under
squared-error loss of the Fisher transform.
S m a l l s a m p l e s a n d large e q u a t i o n s y s t e m s 461

use (4.5) for t h e M E v a r i a n c e . 9 H o w e v e r , it is of i n t e r e s t t o also c o n s i d e r the


effect of t h e tie on t h e v a r i a n c e f o r m u l a (5.3) w h i c h c o n t a i n s Sk for k = a and
k = b. It can b e s h o w n that, u n d e r t h e definition (5.5), a t e r m m u s t b e a d d e d to
(5.3) of t h e f o r m (x t+2- x ' - 1 ) 2 / 3 2 n , which a m o u n t s to an e x t r a r i d g e (the 'tie
r i d g e ' ) of t h e M E c o v a r i a n c e m a t r i x in t h e p r e s e n c e of a tie. See T h e i l a n d
F i e b i g (1984) for f u r t h e r details.
S i m i l a r results h o l d for the m u l t i v a r i a t e M E d i s t r i b u t i o n with missing values
as a n a l y z e d b y C o n w a y a n d T h e i l (1980). C o n s i d e r n o b s e r v a t i o n s on two vari-
ables; let n t values b e k n o w n for o n e v a r i a b l e (n - n 1 a r e missing at r a n d o m )
and n 2 v a l u e s for t h e o t h e r (n - n 2 a r e missing at r a n d o m ) . T h e n u m b e r of cells
is t h e n r e d u c e d f r o m n 2 t o n l n 2. T h e result for t h e M E c o v a r i a n c e is that (5.2) is
still a p p l i c a b l e p r o v i d e d that )7k is i n t e r p r e t e d as t h e s a m p l e m e a n $ w h e n x k is
missing (similarly f o r 37k). This d o e s n o t m e a n t h a t we act as if the missing x k
t a k e s a p a r t i c u l a r v a l u e . N o such v a l u e is a s s u m e d ; t h e o n l y thing n e e d e d for
t h e M E c o v a r i a n c e is a specification of Xk for missing x k, a n d this specification
is $k = X, which f o l l o w s directly f r o m the M E p r i n c i p l e s u b j e c t to mass- a n d
m e a n - p r e s e r v i n g c o n s t r a i n t s u n d e r t h e a s s u m p t i o n that t h e values which are
missing a r e missing at r a n d o m .
W h e n we a p p l y Sk = ~ for missing x k to t h e v a r i a n c e f o r m u l a (5.3), we m u s t
a d d an e x t r a r i d g e (the m i s s i n g - v a l u e ridge). T h i s result is s i m i l a r to t h a t of the
tie r i d g e a n d it is n o t surprising. B o t h ties a n d missing v a l u e s m a k e t h e s a m p l e
less i n f o r m a t i v e than it w o u l d b e if t h e r e w e r e n o ties o r missing values. Since
the M E d i s t r i b u t i o n is o b t a i n e d by m a x i m i z i n g t h e e n t r o p y s u b j e c t to con-
straints i m p l i e d by t h e s a m p l e , w e s h o u l d e x p e c t t h a t b o t h missing v a l u e s a n d
ties y i e l d an M E d i s t r i b u t i o n closer to the i n d e p e n d e n c e case, and that is i n d e e d
w h a t is s h o w n by its c o v a r i a n c e m a t r i x .

6. Experiments in simultaneous equation estimation

H e r e w e r e t u r n to (3.1) a n d we c o n s i d e r t h e q u e s t i o n of w h e t h e r t h e M E
a p p r o a c h can he useful w h e n t h e s a m p l e is u n d e r s i z e d .

6.1. The LIML estimator

S u p p o s e t h a t e t in (3.1) and t h e e r r o r t e r m s in the o t h e r e q u a t i o n s of the


system h a v e a m u l t i n o r m a l d i s t r i b u t i o n . It is then p o s s i b l e to a p p l y the
m a x i m u m l i k e l i h o o d m e t h o d , which yields a k-class e s t i m a t o r k n o w n as
L I M L . 1° T h e L I M L v a l u e of k is k - - # , w h e r e # is t h e smallest r o o t of a

9For t = 1 and t = n - 1, (5.4) is an extremal tie which implies that the exponential distribution
over/1 or I, collapses, all mass being concentrated at the tied point. This also holds for a multiple
tie, Xa = xb = xc = X t = x t+l = x t+2. In both cases the ME distribution becomes mixed discrete/con-
tinuous, but the validity of the variance formula (4.5) is not affected.
I°LIML = limited-information maximum likelihood. 'Limited information' refers to the fact that no
restrictions are incorporated on equations other than (3.1). 'Full information' and FIML use all
restrictions in the system; see, e.g., Theil (1971, Chap. 10).
462 H. Theil and D. G. Fiebig

polynomial which is quadratic in the case of (3.1). The solution is

B 1
tz 2A 2A ~/B2-4A(mumzz- m~2) (6.1)

where A = mll.pm22, p - m 212.p and B = mltm22.p + mzzm l l . p - 2m12m12.p, the m ij.p ' s
being obtained from (3.3). Note that/x is random.
As n ~ 0% n(/x - 1) converges in distribution to a I "2 variate so that nl/20x - 1)
converges in probability to zero. Therefore, the propositions stated in the
discussion following (3.4) imply that nl/2[~0z ) - 7] has the same asymptotic
normal distribution as its 2SLS counderpart, n m [ ~ ( 1 ) - 7].
A closer approximation to the sampling distributions of the 2SLS and LIME
estimators may be described as follows, n We standardize these two estimators
by subtracting the true value of 7 and then dividing the difference by their
common asymptotic standard deviation. The asymptotic distribution of these
two standardized estimators is standard normal. This is a first-order ap-
proximation which can be improved upon by appropriate expansions. The
second-order approximation yields cdfs of the form

2SLS: q g ( u ) - n-mO(u z - p + 1)q~'(u), (6.2)


LIME: q ) ( u ) - n-l/ZOu2CD'(u), (6.3)

where qS(u) and 45'(u) are the standard normal cdf and density function,
respectively, while 0 is a constant determined by the parameters of the system
which contains (3.1) as one of its equations. Since substitution of u - - 0 into
(6.3) yields q~(0)-0 = ~, we conclude that the approximate distribution of the
standardized LIML estimator has zero median, whereas (6.2) shows that the
standardized 2SLS estimator has this property only for p = 1. As p increases,
the median of the latter approximate distribution moves away from zero.
It appears that to a large extent these properties also apply when the
estimators are formulated in terms of ME rather than sample moments. Theil
and Meisner (1980) performed a simulation experiment in which the 2SLS
estimator is systematically formulated in terms of ME moments. This has the
advantage that the estimator exists even when p > n [because Mp in (3.3) is
then positive definite], but the estimator is badly biased for large p. We shall
therefore pay no further attention to 2SLS-type estimators. On the other hand,
the approximate median-unbiasedness of L I M E which is implied by (6.3)
appears to also apply when this estimator is formulated in terms of ME
moments.

6.2. L I M L estimators based on sample and on M E moments

We return again to (3.1) and specify that the two associated reduced-form

n T h e results which follow are from A n d e r s o n and Sawa; a convenient s u m m a r y is given by


Malinvaud (1980, pp. 716-721).
Small samples and large equation systems 463

equations are ~2

P P

Ylt = E Xht + ~lt, Y2t = ~'~ Xht + ~2,, (6.4)


h=l h=l

which agree with (3.1) if and only if y = 1 and e t = golf- (2t" In the simulation
e x p e r i m e n t to be discussed, the Xht'S and ~t's are all g e n e r a t e d as i n d e p e n d e n t
p s e u d o - n o r m a l variates, 13 the distribution of each Xht being N(0, V / p ) and that
of each ~'j, being N(0, ~r2). T h e r e f o r e ,

P
Xh,- N(0, V), ~ j t - N(0, ~ro2), e , - N(0, tr2), (6.5)
h=l

where 0.2 = 2o-~. N o t e that the distribution of tile e x o g e n o u s c o m p o n e n t in the


r e d u c e d f o r m (6.4) is i n d e p e n d e n t of p. The objective of the experiment is to
analyze the b e h a v i o r of L I M L estimators as p increases b e y o n d n.
Table 2 is based on 1000 trials for the specification V = 1, ~r~ = ~. C o l u m n s (2)
and (3) contain, for each selected pair (p, n), the m e d i a n of the L I M L estimates
over the 1000 trials. In column (2) we use the c o n v e n t i o n a l L I M L estimator
based on sample m o m e n t s ( L I M L / S A ) ; in c o l u m n (3) we have L I M L / M E ,
o b t a i n e d by interpreting the matrix (3.2) as consisting of M E m o m e n t s . Since
(3.1) contains no constant term, b o t h the sample and the M E m o m e n t s are
interpreted as s e c o n d - o r d e r m o m e n t s m e a s u r e d f r o m zero (rather than from
the mean). C o l u m n s (4), (7), (10) and (13) will be discussed in the next
subsection.
T h e m e d i a n s in column (3) are all close to 1 and thus suggest that the
L I M L / M E estimator is approximately median-unbiasedness. N o t e that the
medians in c o l u m n (2) decline as p a p p r o a c h e s n. This m e a n s that the
conventional L I M L / S A estimator loses its m e d i a n - u n b i a s e d n e s s for large p.
Also, the interquartile distance of L I M L / S A in c o l u m n (11) increases sub-
stantially as p a p p r o a c h e s n. A c o m p a r i s o n with the c o r r e s p o n d i n g quartiles in
c o l u m n s (5) and (8) indicates that this increased dispersion results primarily
from a declining lower quartile but also from an increasing u p p e r q u a r t i l e ) 4
T h e e x p e r i m e n t underlying Table 2 uses u n c o r r e l a t e d e x o g e n o u s variables.
In T a b l e 3 we e x t e n d this to equicorrelated variables. Let c o m p o n e n t s of the
vector (Xlo . . . . xpt ) be p s e u d o - n o r m a l with zero m e a n , correlation p and

t2The reduced form is obtained by solving the system for the endogenous variables. This
requires the number of equations to be equal to the number of these variables.
13The x's are not constant in repeated trials. Making them constant would have implied that all
entries in any given row of Table 2 are determined by the same set of n observations on the p
exogenous variables.
t4Since Mafiano and Sawa (1972) have shown that the sampling distribution of the LIML/SA
estimator does not possess finite moments of any order, we use medians and quartiles to measure
location and dispersion.
464 H. Theil and D. G. Fiebig

Table 2
Quartiles of LIML estimators based on sample, ME and hybrid moments a

Interquartile
Median Lower quartile Upper quartile distance

p SA ME HY SA ME HY SA ME HY SA ME HY
(1) (2) (3) (4) (5) (6) (7) (8) (9) (10) (11) (12) (13)

n = 21 observations
10 1.00 1.00 1.00 0.84 0.84 0.85 1.18 1.19 1,19 0.34 0.35 0.34
15 0.99 1.00 1.00 0.80 0.81 0.83 1.23 1.23 1.20 0.43 0.42 0.37
20 0.87 1.00 0.99 0.48 0.78 0.83 1.33 1.25 1.19 0.84 0.47 0.36
25 b 0.98 0.98 b 0.81 0.84 b 1.18 1.17 b 0.37 0.33
30 b 1.01 1.00 b 0.84 0.85 b 1.19 1.18 b 0.35 0.33
35 b 1.01 1.01 b 0.85 0.86 b 1.19 1.16 b 0.34 0.31
40 b 0.99 1.00 b 0.85 0.86 b 1.15 1,13 b 0.30 0.27

n = 31 observations
10 1.00 1.00 1.00 0.88 0.88 0.88 1.13 1.12 1.12 0.25 0.24 0.24
15 1.00 1.00 1,00 0.87 0.87 0,87 1.15 1.14 1.14 0.28 0.27 0.27
20 0.99 0.99 0.99 0.85 0.85 0.86 1.15 1.15 1.14 0.31 0.30 0.28
25 1.00 1.00 1.00 0.83 0.84 0.87 1.20 1.19 1.17 0.37 0.35 0.29
30 0.92 0.99 1.00 0.57 0.83 0.87 1.28 1.20 1.16 0.72 0.37 0.30
35 b 0.99 1.00 b 0.83 0.85 b 1.19 1.16 b 0.35 0,30
40 b 0.99 1.00 b 0.86 0.87 b 1.15 1.13 b 0.29 0.27
45 b 0.98 0.98 b 0.86 0.87 b 1.14 1.13 u 0.28 0.27
50 b 1.01 1.01 b 0.88 0.88 b 1.16 1.15 b 0.29 0.27

n = 41 observations
10 1.01 1.01 1.01 0.89 0.90 0.90 1.14 1.13 1.13 0.24 0.24 0.24
15 1.01 1.01 1.01 0.91 0.91 0.91 1,12 1.12 1.12 0.21 0.21 0.21
20 1.00 1.00 1.00 0.89 0.90 0.90 1,13 1.13 1.12 0.24 0.23 0.22
25 1.00 1.00 0.99 0.86 0.87 0.87 1,14 1.13 1.13 0.27 0.27 0.26
30 1.00 1.01 1.01 087 0.87 0.88 1.16 1.16 1.15 0.28 0.29 0.27
35 0.99 1.00 1.00 0.82 0.83 0.86 1.22 1.21 1.17 0.40 0.38 0.31
40 0.91 1.00 1.00 0.47 0.84 0.88 1.35 1.21 1.16 0.88 0.38 0.28
45 b 1.01 1.01 b 0.87 0.89 b 1.18 1.16 b 0.31 0.27
50 b 1.00 1.00 ~ 0.88 0.88 b 1.14 1.13 b 0.26 0.25
55 b 1.00 1.00 b 0.88 0.89 b 1.15 1.14 U 0.27 0.25
60 b 1,01 1.01 b 0.90 0.90 b 1.15 1.15 b 0.25 0.25

abased on 1000 trials; see text.


bThe LIM/SA estimator does not exist,

variance V/[p + p ( p - 1 ] p ] so that ( 6 . 5 ) is still a p p l i c a b l e for any (p, p). Let


t h e s e v e c t o r s b e i n d e p e n d e n t f o r d i f f e r e n t v a l u e s o f t. T a b l e 3 u s e s V = 1 a n d
2 t
o- 0 = ~ a s b e f o r e a n d it is b a s e d o n 1 0 0 0 t r i a l s o f s i z e n = 21 f o r s e l e c t e d v a l u e s
of p. T h e results for LIML/SA and LIML/ME in t h i s t a b l e a r e s i m i l a r t o t h e
corresponding r e s u l t s in T a b l e 2.
Small samples and large equation systems 465

Table 3
Q u a r t i l e s of L I M L e s t i m a t o r s b a s e d on c o r r e l a t e d e x o g e n o u s variables"

Interquartile
Median Lower quartile Upper quartile distance

p SA ME HY SA ME HY SA ME HY SA ME HY

p=0
10 1.02 1.02 1.02 0.85 0.85 0.85 1.19 1.20 1.19 0.35 0.35 0.34
15 1,00 1.00 1.01 0.81 0.82 0.85 1.21 1.20 1.18 0.40 0.38 0.33
20 0.90 0.98 0.99 0.56 0.79 0.85 1.31 1.22 1.18 0.75 0.43 0.33
25 U 0.97 0.98 b 0,81 0.83 U 1.17 1.15 b 0.36 0.32
30 b 1.00 1.01 b 0.85 0.86 b 1.21 1.20 b 0.36 0.34
35 b 0.99 0.99 b 0.83 0.84 b 1.16 1.14 b 0.32 0.30
40 b 1.01 1.01 b 0.87 0.87 b 1.18 1.17 b 0.31 0.29

p = 0.3
10 1.02 1.01 1.01 0.84 0.84 0.85 1.18 1,18 1.18 0.34 0.34 0.33
15 1.01 1.01 1.01 0.82 0.83 0.86 1.23 1.22 1.19 0.41 0.39 0.33
20 0.91 0.99 0.99 0.49 0.81 0.85 1.37 1.24 1.17 0.88 0.42 0.32
25 b 0.98 0.99 b 0.83 0.85 b 1.16 1.14 b 0.33 0.29
30 b 1.00 1.00 b 0,85 0.86 b 1.16 1.15 b 0.31 0.29
35 b 1.00 1.01 b 0.87 0.88 b 1.16 1.16 b 0.30 0.29
40 b 1.00 1,00 b 0.87 0.88 b 1.16 1.15 b 0.29 0.27

p = 0.6
10 1.02 1.0I 1.02 0.84 0.85 0.86 1.18 1.18 1.18 0.34 0.33 0.32
15 1.01 1.01 1.00 0.82 0.83 0.85 1.23 1.22 1.19 0.42 0.39 0.34
20 0.93 0.99 1.00 0.52 0.82 0.85 1.38 1.22 1.17 0.87 0.40 0.33
25 b 0.98 0.99 b 0.83 0.85 b 1.17 1.15 b 0.34 0.31
30 b 0.99 0.99 b 0.85 0.86 b 1.16 1.16 b 0.30 0.30
35 b 1.01 1.01 h 0.87 0.87 b 1.17 1.16 b 0.30 0.29
40 b 1,00 1.00 b 0.87 0.88 b 1.16 1.16 b 0.29 0.28

P =0.9
10 1.02 1.02 1.02 0.85 0.85 0.85 1.17 1.17 1,17 0.32 0.32 0.31
15 1.01 1.01 1.01 0.81 0.84 0.86 1.24 1.20 1.19 0.43 0.36 0.32
20 0.94 1.00 0.99 0.53 0.83 0.85 1.38 1.20 1.18 0.85 0.37 0.33
25 b 0.99 0.99 b 0.83 0.85 b 1.17 1.17 b 0.33 0.32
30 b 0.99 0.99 b 0.84 0.86 b 1.17 1.16 b 0.33 0.30
35 b 1.01 1.01 b 0.87 0.88 b 1.18 1.17 b 0.31 0.30
40 b 0.99 0.99 b 0.86 0.86 b 1.17 1.17 b 0.32 0.31

p = 0.99
10 1.02 1.02 1.02 0.85 0.86 0.86 1.17 1.17 1.17 0.32 0.31 0.30
15 1.01 1.01 11.01 0.81 0.86 0.86 1.24 1.20 1.18 0.42 0.34 0.32
20 0.93 0.99 0.99 0.54 0.84 0.85 1.38 1.18 1.16 0.84 0.33 0.31
25 b 0.99 1,00 b 0.85 0.85 b 1.17 1.17 b 0.33 0.31
30 b 0.99 0.99 b 0.85 0.86 b 1.16 1.16 b 0.31 0.30
35 b 1.01 1.01 b 0.87 0.88 b 1.19 1.18 b 0.31 0.29
40 b 1.00 1.00 U 0.85 0.85 U 1.18 1.17 b 0.33 0.31

"Based on 1000 trials; see text. bThe L I M L / S A e s t i m a t o r d o e s n o t exist.


466 H. Theil and D. G. Fiebig

6.3. L I M L estimators based on hybrid moments


Although Tables 2 and 3 indicate that the performance of L I M L / M E is far
better than that of L I M L / S A , it is the case that the interquartile distance of the
former estimator shows a bulge around p = n. 15 This bulge indicates that for
fixed n and increasing p, the precision of the estimator deteriorates when p
approaches n and then improves when p increases beyond n. Is it possible to
eliminate this bulge?
O n e way of doing this is by adding a ridge to the M E m o m e n t matrix in the
same way that Haff's (1980) empirical Bayes estimator of the covariance matrix
amounts to adding a ridge to the sample m o m e n t matrix. Specifically, let us
interpret the p + 2 diagonal elements of the matrix (3.2) as sample m o m e n t s and all
off-diagonal elements as M E moments. We shall refer to (3.2) thus interpreted as
the hybrid moment matrix of the p + 2 variables. Simulation experiments based on
alternative risk functions have indicated that the hybrid m o m e n t matrix is an
attractive alternative to the M E m o m e n t matrix, particularly when the objective is
to estimate the inverse of a parent m o m e n t matrix; see Theil and Fiebig (1984).
T h e 1000 trials underlying each line of Table 2 have also been used to
c o m p u t e L I M L / H Y estimates, all obtained from the hybrid interpretation of
the m o m e n t matrix (3.2). The medians of these estimates in column (4) are
about as close to 1 as those of L I M L / M E in column (3), but the interquartile
distances of the former estimates in column (13) are systematically below those
of the latter in column (12). Also, the interquartile distances in column (13) do
not show the same large bulge around p = n which we find in column (12). The
picture of the correlated case in Table 3 is about the same.
T h e evidence of Tables 2 and 3 suggests that the L I M L approach can be
rescued in the case of undersized samples by the simple device of replacing
sample m o m e n t s by hybrid moments. This simplicity is in agreement with the
view (see Section 3, last paragraph) that the problem of undersized samples
should not be a problem. See Theil and Fiebig (1984) for additional evidence
concerning equations with more than two variables. In Section 7 we shall apply
hybrid m o m e n t s to a problem of constrained estimation.

7. Canonical correlations and symmetry-constrained estimation

7. l. Error covariance matrices and canonical correlations


W e return to the linear system (2.1), which we generalize to a system of q
linear equations with q dependent variables on the left and, in each equation,
the same set of p independent variables on the right. The errors in the
equations form a vector (elt . . . . . eat ) with zero mean and covariance matrix £.
In Section 2 we described some problems that arise when we replace .~ by the

15There is no clear evidence of such a bulge for large p. This exception reflects the fact that the p
exogenous variables effectively behave as one variable when p is sufficiently close to 1.
Small samples and large equation systems 467

estimate S consisting of mean squares and products of LS residuals; here we


shall consider whether an M E approach yields m o r e attractive results. The
account which follows is a modified version of Meisner (1981).
We write the (p + q ) x (p + q) covariance matrix of the dependent and the
independent variables in partitioned form:

[ ~11 ~121
"~'12 "~22j
q dependent variables
p independent variables
(7.1)

If this matrix is interpreted as consisting of population variances and covari-


ances, it is related to the error covariance matrix X by

,a~ = "~11 - X12X-1 t


22X 12. ('7.2)

Let Pl . . . . . Pm be the canonical correlation coefficients of the dependent and


the independent variables, where m = min(p,q). These pi's can be obtained
from the determinantal equation

p i ~ 1 1 [ = O, (7.3)

so that (7.2) implies

{X - (1 - p~)Zn{--- O, (7.4)

which provides a link between the error covariance matrix X and the canonical
correlations of the q dependent and the p independent variables of the system:
for i = 1 . . . . . m, one minus each squared canonical correlation coefficient is a
latent root of the diagonalization of X in the metric of the covariance matrix
~11 of the dependent variables.

7.2. Estimation of canonical correlations


Given our interest in ~, the result (7.4) suggests that it is worthwhile to
consider the estimation of canonical correlations. Fiebig (1980) conducted a
simulation experiment based on

Y, = p , x , + (1 -- 0 )I'2v,, i = I ..... 9, (7.5)

where the Xi's and vi's are 18 independent standard pseudo-normal variates.
Then the Yi's are also independent standard pseudo-normal, while X~ and
are uncorrelated for i ¢ j and (Xi, Yi) has correlation p~. Therefore, Pl . . . . . P9
are the canonical correlations of (X L. . . . . Xg) and "(Y1 . . . . . Yg). The joint
covariance matrix of the 18 variables ( X ' s and Y's) takes the form (7.1) with
Xn = Xz2 = I and ~12 diagonal with 01 . . . . . P9 on the diagonal. Their true values
468 14. Theil and D. G. Fiebig

a r e specified as

PI = 0.9, P2 = 0.8 . . . . . P8 = 0.2, P9 = 0.1 . (7.6)

By i n t e r p r e t i n g (7.1) as consisting of e i t h e r M E or s a m p l e m o m e n t s c o m -
p u t e d for a s a m p l e of size n, a n d then solving the a s s o c i a t e d d e t e r m i n a n t a l
e q u a t i o n (7.3), we o b t a i n nine M E o r s a m p l e c a n o n i c a l c o r r e l a t i o n s . This
e x p e r i m e n t was r e p l i c a t e d 100 t i m e s a n d t h e results are s u m m a r i z e d in T a b l e 4
in t e r m s of m e a n s a n d R M S E s a r o u n d t h e t r u e value. T h e u p p e r p a r t of the
t a b l e c o n c e r n s the l a r g e s t c a n o n i c a l c o r r e l a t i o n (with true v a l u e Pl = 0.9). B o t h
t h e M E a n d the s a m p l e e s t i m a t o r are s u b j e c t to a s u b s t a n t i a l u p w a r d bias
w h i c h slowly d e c l i n e s as n increases, 16 b u t t h e bias of t h e f o r m e r e s t i m a t o r is
s m a l l e r a n d this also h o l d s for its R M S E . T h e m i d d l e p a r t of T a b l e 4 c o n c e r n s
t h e a r i t h m e t i c a v e r a g e c a n o n i c a l c o r r e l a t i o n (true v a l u e 0.5) and the l o w e r p a r t
d e a l s with the sum of the s q u a r e d c a n o n i c a l c o r r e l a t i o n s (true v a l u e 2.85); this
s u m p l a y s a role in H o o p e r ' s (1959) t r a c e c o r r e l a t i o n coefficient. T h e r e s u l t s are
s i m i l a r to t h o s e in the u p p e r p a r t : t h e r e is an u p w a r d bias which slowly
d e c r e a s e s as n increases, a n d b o t h the bias a n d t h e R M S E are s m a l l e r w h e n
M E r a t h e r than the s a m p l e m o m e n t s are used.
A l t h o u g h these results a r e e n c o u r a g i n g for t h e M E a p p r o a c h , it s h o u l d be
a d m i t t e d that the u p w a r d bias is q u i t e s u b s t a n t i a l . A c o m p a r i s o n of t h e
last f o u r c o l u m n s of T a b l e 4 shows that this bias is typically close to the
c o r r e s p o n d i n g R M S E , suggesting t h a t a bias c o r r e c t i o n is in o r d e r . L e t r 1/> r 2/>
• " 1> r,, b e the M E c a n o n i c a l c o r r e l a t i o n s . T h e c o r r e c t e d coefficients are
fl . . . . . f,,, o b t a i n e d f r o m

1 -- r~ = (1 -- r~) n/(n+p+q-1) , (7.7)

which is a c o r r e c t i o n in e x p o n e n t i a l f o r m . T o e x p l a i n t h e e x p o n e n t we n o t e
t h a t e a c h c a n o n i c a l v a r i a t e involves p - 1 o r q - 1 m u l t i p l i c a t i v e coefficients
(only t h e ratios of t h e s e coefficients m a t t e r ) . T h i s yields p + q - 2 coefficients
for a p a i r of c a n o n i c a l variates, to which w e a d d 1 for the use of a c o n s t a n t
t e r m , y i e l d i n g a total of p + q - 1 coefficients. (Both c a n o n i c a l v a r i a t e s h a v e
c o n s t a n t t e r m s , b u t the c o v a r i a n c e in t h e n u m e r a t o r of t h e c a n o n i c a l cor-
r e l a t i o n is not affected w h e n only o n e c o n s t a n t is used.)
T a b l e 5 p r o v i d e s e v i d e n c e of t h e c o r r e c t i o n (7.7) b a s e d on the e x p e r i m e n t a l
d e s i g n (7.5) a n d (7.6) for b o t h t h e M E a n d t h e h y b r i d c a n o n i c a l c o r r e l a t i o n s .

~6The upward bias of the sample estimator is not surprising, since canonical correlations are
generalizations of the multiple correlations. Let R be such a correlation, associated with a linear
regression on p independent variables (including a constant term). A frequently used correction
amounts to multiplying 1 - R 2 by the ratio of n - 1 to n - p - 1. Both this correction and that which
is shown in (7.7) below for canonical correlations are corrections to the order 1/n, but (7.7) has the
advantage of never yielding a negative f]. See also Lawley (1956, 1959) for an asymptotic expansion
of the expected sample canonical correlations; the implied correction is much more complicated
than (7.7).
Small samples and large equation systems 469

Table 4
ME and sample canonical correlation coefficients

Mean Estimated bias a RMSE

n ME Sample ME Sample ME Sample

Largest canonical correlation coefficient


10 0.991 b 0.091 b 0.091 b
15 0.992 b 0.092 b 0.092 b
20 0.986 0.994 0.086 0.094 0.087 0.094
25 0.973 0.978 0.073 0.078 0.073 0.079
30 0.963 0.967 0.063 0.067 0.064 0.068
40 0.945 0.949 0.045 0.049 0.050 0.052
50 0.936 0.939 0.036 0.039 0.040 0.042
100 0.919 0.920 0.019 0.020 0.026 0.028

Average canonical correlation coefficient


10 0.826 b 0.326 b 0.328 b
15 0.733 b 0.233 b 0.235 b
20 0.678 0.685 0.178 0.185 0.181 0.188
25 0.635 0.640 0.135 0.140 0.139 0.144
30 0.613 0.618 0.113 0.118 0.118 0.122
40 0.583 0.586 0.083 0.086 0.088 0.091
50 0.562 0.563 0.062 0.063 0.068 0.069
100 0.530 0.531 0.030 0.031 0.037 0.039

Sum of squared canonical correlation coefficients


10 6.74 b 3.89 b 3.91 b
15 5.67 b 2.82 b 2.84 b
20 4.95 5.06 2.10 2.21 2.14 2.23
25 4.46 4.52 1.61 1.67 1.63 1.69
30 4.19 4.25 1.34 1.40 1.38 1.43
40 3.83 3.86 0.98 1.01 1.00 1.04
50 3.60 3.62 0.75 0.77 0.79 0.81
100 3.23 3.23 0.38 0.38 0.41 0.42

aMean minus true value.


bNot computed. For n = 10 and 15, The largest sample canonical
correlation coefficient is identically equal to 1.

The top row of the table shows the true value of each squared canonical
c o r r e l a t i o n . T h e first e i g h t r o w s c o n t a i n m e a n s o v e r 100 t r i a l s a n d , in p a r e n -
theses, the RMSEs around the true value of the squared ME canonical
correlation. The next eight lines provide analogous results for the hybrid
e s t i m a t e s o b t a i n e d b y i n t e r p r e t i n g (7.1) as t h e h y b r i d c o v a r i a n c e m a t r i x ( w i t h
sample variances on the diagonal and ME covariances elsewhere). In the lower
h a l f o f t h e t a b l e t h e c o r r e c t i o n (7.7) is a p p l i e d t o e i t h e r t h e M E o r t h e h y b r i d
e s t i m a t o r . A c o m p a r i s o n o f m e a n s a n d R M S E s s h o w s t h a t f o r n ~> 15 t h e
c o r r e c t e d h y b r i d e s t i m a t o r is s u p e r i o r e x c e p t w i t h r e s p e c t t o t h e l a r g e s t
canonical correlation.
470 I-I. Theil and D. G. Fiebig

~5
H

~D

e-
0

z
?,
0

"d
©

"K

~2
Small samples and large equation systems 471
472 H. Theil and D. G. Fiebig

7.3. A cross-country d e m a n d system


W e return again to the d e m a n d system (2.1), which we now a m e n d by adding
a constant term to each equation:

N
Yi, = oq + BiXo, + ~'~ rrqxj, +eit. (7.8)
j=l

O u r application of this system will not be to time series d a t a but to per capita
d a t a for 15 countries (t = 1 . . . . . n = 15); see the A p p e n d i x for further details.
T h e analysis of h o m o g e n e i t y and s y m m e t r y testing is b e y o n d the scope of this
chapter, because it would involve not only the f r e q u e n c y of rejections of the
null hypothesis when this hypothesis is true but also the p o w e r of the test.
Instead, we shall i m p o s e the h o m o g e n e i t y condition (2.2) by writing (7.8) in the
form

N-1

Yit = ai + fliXo, + • ~ij(xi, - xm) + ei,, (7.9)


j=l

and we shall want to estimate this system subject to the s y m m e t r y constraint


(2.3).
Since e u + . - . + e u t = 0, we can confine the estimation of (7.9) to i =
1..... N-1. W e write (7.9) for t = l . . . . . 15 as y i = X 6 i + e i , where
•i = (O~i, fli, 7ril, • " • , 7ri.N-i)' and X is a 15 x ( N + 1) matrix whose tth row equals
(1, x0,, xlt - x N , , . . . , xN_l, , - XN,). L e t ( e l t , . . . , e u 1.t) for t = 1 . . . . . 15 be in-
d e p e n d e n t l y and identically distributed with zero m e a n s and nonsingular
c o v a r i a n c e matrix X. T h e n ( X ' X ) - I X ' y ~ is the LS e s t i m a t o r of 6~, which is
u n b i a s e d if X is fixed, while S defined as

1
S- Y'[I-X(X'X)-IX'] Y, Y=[Yl,'-',YN 1], (7.10)
15 - ( N + 1)

is an unbiased e s t i m a t o r of X.
T h e LS estimator of 6i does not satisfy the s y m m e t r y constraint (2.3). W e can
write (2.3) in the form R 6 = 0, w h e r e 8 is a vector with 6i as the ith s u b v e c t o r
(i = 1 , . . . , N - 1) and R is a matrix whose e l e m e n t s are all 0 or _+1, each r o w of
R c o r r e s p o n d i n g to Irq = 1rji for s o m e (i,j). T h e B L U e s t i m a t o r of 6 con-
strained by (2.3) is

6(Z) = d - C(X)R'[RC(.~)R'I-~Rd, (7.11)

and its covariance matrix is

C(Z)- C(Z)R'[RC(Z)R']-~RC(Z), (7.12)


Small samples and large equation systems 473

where C ( X ) = X , @ ( X ' X ) -~ and d is a vector with (X'X)-~X'y~ as the ith


subvector (i = 1 . . . . . N - 1). For details on constrained linear estimation, see,
e.g., Theil (1971, Sec. 6.8).
If X is known, we can c o m p u t e (7.11) f r o m the data. If X is not k n o w n , the
standard p r o c e d u r e is to replace X in (7.11) by the estimator S of (7.10).
Alternatively, we can use an estimator based on corrected canonical cor-
relations of the type (7.7), but an adjustment must be m a d e for the fact that X
refers only to N - 1 equations. ~7 H e r e we retain all equations (7.9) for
i = 1 . . . . . N by specifying p = q = N in (7.1). 18 Indicating by hats (circumflexes)
that (7.1) has sample variances on the diagonal and M E covariances elsewhere,
we obtain the (uncorrected) hybrid canonical correlations r l > r 2 > . . . > rN
from

2
(~x2.,~2~.,~12- ri~lt)Z i = 0 , (7.13)

where z~ is a characteristic vector associated with r~, normalized so that


z'i~,laZj = 6ij or, equivalently,

Z ' 2 n Z = I, Z- [z, . . . . . ZN] . (7.14)

Let XN be the covariance matrix of (elt . . . . , eNt), to be estimated f r o m the


N x N version of (7.4) with characteristic vectors (the z / s ) added. Since NN has
rank N - 1, we correct r 1 to 1 and use (7.7) with p = q = N for i = 2 . . . . . N. Let
A be the diagonal matrix with 0, 1 - r-22. . . . ,1 - r/,
-2 on the diagonal. T h e n , from
(7.4), (7.13) and (7.14), the corrected estimator of J2N is 2 u = ( Z ' ) - I A Z -1=
211ZAZ'21t so that
N
2N = ~ ( 1 - ?~)~1tzi(~llzi) ' , (7.15)
i=2

after which ~ is o b t a i n e d by deleting the last row and c o l u m n of ~N. This ~ is


an estimator of X in (7.11) that will be used below as alternative to S of (7.10).
N o t e that X does not involve the largest canonical correlation (see the end of
the p r e v i o u s subsection).

7.4. Discussion of numerical results'


A simulation e x p e r i m e n t was p e r f o r m e d in order to c o m p a r e the three
s y m m e t r y - c o n s t r a i n e d estimators, with N = 8 goods: f o o d ; clothing; rent; fur-

17Deleting the first of N equations rather than the last amounts to a linear transformation of the
dependent variables. Such a transformation affects the corrected ME error covariance matrix in a
nontrivial way, since the rectangular cells in the second paragraph of Section 5 become paralo
lelolograms when the variables are linearly transformed.
18There are p = N independent variables in (7.9); the constant terms ai are handled by the use of
variances and covariances rather than second moments around zero.
474 H. Theil and D. G. Fiebig

~d
r.~

~ ? ~ ? ~
I II

¢'.1

.=

I I I

[ ~ l f l i l l
I

II II II II II II II II II II II II II II
Small samples and large equation systems 4"75

II ~ I I I I I

off

t~
©

I I I II

~1 H II II II II II II II II II IIIt II II II II tl II IItt

II II II II II II II H II II II II II II II II tl II H II II
476 H. Theil and D. G. Fiebig

niture; medical care; transport and communication; recreation and education;


other consumption expenditures.
The first column of Table 6 contains the true values of the parameters. For
each of the three estimators, the columns labeled Bias and RMSE contain the
estimated bias (mean minus true value) and the RMSE around the true value
over 500 trials. Bias presents no problem; the estimated bias values are all
small in magnitude relative to the corresponding RMSEs. Differences between
the three estimators appear when we consider their RMSEs. The estimates
based on S are markedly inferior to those which use the true X. When we use
rather than S, we obtain estimates which compare much more favorably to
those based on the true X.
In order to facilitate these comparisons, we computed ratios of the RMSEs
based on ~7 to those based on S and ~ for each of the 35 coefficients. These
ratios are shown in the first two columns of Table 7, and the quartiles of these
ratios (lower, median, upper) are shown below.

Lower Median Upper


Ratios for S 0.67 0.74 0.81
Ratios for ~ 0.92 0.95 0.99

It is evident from these figures that there is a substantial efficiency gain from
using ~ rather than S in the symmetry-constrained estimation procedure, and
that the efficiency loss from not knowing the true error covariance matrix is
quite modest when ~ is used as its estimator.
A n o t h e r matter of importance is whether the standard errors of the sym-
metry-constrained estimates provide an adequate picture of the variability of
these estimates around the true parameter values. This problem is pursued by
the RMSSEs of Table 6. These are obtained from the matrix (7.12), with X
interpreted as either S or ~ or the true X, by averaging the diagonal elements
of (7.12) over the 500 trials and then taking square roots of these averages. On
comparing the RMSSEs based on S with the corresponding RMSEs we must
conclude that the standard errors based on S tend to underestimate the
variability of their coefficient estimates. Table 7 illustrates this more clearly by
providing the ratio of the RMSSE to the corresponding RMSE for each
estimator. The third column of this table shows the substantial understatement
of the variability of the estimates based on S. The quartiles of the 35 ratios in
each of the last three columns are as follows:

Lower Median Upper


Ratios for S 0.48 0.59 0.69
Ratios for 2 _ 0.89 0.94 1.06
Ratios for X 0.99 1.01 1.02

When the true X is used, the ratios are tightly distributed around unity. Use of
yields ratios which are more widely dispersed around 1, but which represent
a marked improvement over the use of S.
Small samples and large equation systems 477

Table 7
R a t i o s of R M S E s a n d R M S S E s of s y m m e t r y - c o n s t r a i n e d e s t i m a t e s

R a t i o of R M S E b a s e d o n
true X to RMSE based on R a t i o of R M S S E to R M S E

s ~ s 2 z

Coefficients/3/
i= 1 0.70 0.92 0.56 1.39 1.01
i= 2 0.69 0.90 0.60 0.94 1.06
i= 3 0.89 0.99 0.86 1.08 1.02
i= 4 0.63 0.99 0.44 0.92 0.98
i= 5 0.79 0.90 0.67 1.25 0.98
i= 6 0.83 0.94 0.76 0.95 1.02
i= 7 0.67 0.94 0.48 0.94 0.97

D i a g o n a l S l u t s k y c o e f f i c i e n t s ~r~i
i = 1 0.72 0.94 0.57 1.52 1.03
i = 2 0.78 1.00 0.60 0.88 1.00
i = 3 0.92 0.97 0.84 0.95 1.00
i = 4 0.61 1.00 0.38 0.83 0.97
i = 5 0.79 0.93 0.67 1.35 1.04
i = 6 0.88 0.97 0.79 0.85 1.03
i = 7 0.65 0.92 0.44 0.87 0.96

O f f - d i a g o n a l S l u t s k y coefficients 7r0
i = 1, j = 2 0.70 0.87 0.52 0.93 0.99
i = 1, i = 3 0.83 0.86 0.72 1.27 1.03
i = 1, j = 4 0.62 0.97 0.42 0.98 0.99
i = 1, j = 5 0.77 0.93 0.59 1.43 0.99
i = 1, j = 6 0.86 0.88 0.66 1.00 0.96
i = 1,/" = 7 0.63 0.96 0.43 1.06 1.01
i = 2, . / = 3 0.81 0.99 0.71 0.90 1.02
i = 2,/" = 4 0.63 0.99 0.43 0.84 0.99
i = 2, j = 5 0.72 0.94 0.55 0.98 1.01
i = 2, j = 6 0.81 1.00 0.67 0.86 1.02
i = 2, j = 7 0.67 0.95 0.45 0.87 0.98
i = 3, j = 4 0.78 1.01 0.66 0.92 0.98
i = 3, j = 5 0.81 0.87 0.71 1.11 1.01
i = 3, j = 6 0.90 1.01 0.79 0.90 1.01
i = 3, j = 7 0.79 0.96 0.69 0.91 0.99
i = 4, j = 5 0.68 0.95 0.49 0.97 1.02
i = 4, ] = 6 0.69 0.96 0.54 0.86 1.03
i = 4,/" = 7 0.60 1.01 0.36 0.89 1.00
i = 5,/" = 6 0.78 0.86 0.65 0.97 1.05
i = 5,/" = 7 0.72 0.92 0.55 0.98 1.01
i = 6,/" = 7 0.74 0.99 0.58 0.93 1.01
478 1-1. Theil and D. G. Fiebig

8. Conclusion

We have attempted to demonstrate an approach, based on the ME dis-


tribution, to problems that arise in large equation systems. Estimators of
various population parameters are generated from this distribution according
to the method of moments: whenever a standard procedure uses sample
moments, we use ME moments. For example, previous analyses have found
that the ME moment matrix leads to small-sample gains relative to the usual
sample moment matrix. On the basis of our experimentation, impressive results
were also achieved from a hybrid moment matrix whose diagonal elements are
sample moments and whose off-diagonal elements are ME moments.
The experiments presented, hopefully, have illustrated the effectiveness of
the ME approach. Simulation experiments cannot be conclusive, though, and it
is appropriate that further work be done in order to reinforce these initial
impressions. The simultaneous equation experiment could be extended in a
number of directions. The form of the equation isolated for attention is
extremely simple and the experiment could be extended to equations that
include more endogenous and/or exogenous variables. Also, no attempt was
made to test the validity of the asymptotic standard errors.
It is appropriate to note that there exist other problems associated with large
equation systems that have not been discussed here. In the context of simul-
taneous equation estimation, full information methods of estimation (such as
three-stage least squares) require the number of endogenous variables in the
system to be less than the number of observations. Without such a condition,
the usual sample estimator of the error covariance matrix is singular. Essen-
tially the same problem can arise in systems of demand equations or more
generally in any system of seemingly unrelated regression equations. For
example, in order to estimate a system of demand equations with 37 goods on
the basis of annual U.K. data for 17 years, Deaton (1975) used an a priori
specified covariance matrix. The ME approach provides a simple and elegant
solution in such situations.

Appendix

The demand systems (2.1) and (7.8) are obtained by differentiating an


appropriately differentiable utility function subject to the budget constraint
E i p i q ~ = M , where pi and qi are the price and quantity of good i and M is total
expenditure (or 'income'). The technique used amounts to deriving the first-
order constrained maximum condition and then differentiating it with respect
to M and the p/s. The result can be conveniently written in the differential
form
N

wi d(log qi) = 0i d(log O) + ~ ~°ijd(log pj), (A1)


j=l
where w~ is the budget share of good i and d(log Q) is the Divisia volume
Small samples and large equation systems 479

index,

N
Piqi
wi = d(log O) : ~'~ wi d(log q,), (A2)
M ' i=1

while 0~ = O(p~q~)/OM is the marginal budget share of good i and the Slutsky
coefficient ~/ equals ( p i p j / M ) O q , / O p / , the derivative OqflOpj measuring the
effect of pj on q~ when real income remains constant. The homogeneity
property (2.2) reflects that proportionate changes in all prices do not affect any
qi when M also changes proportionately. The symmetry property (2.3) results
from the assumed symmetry of the Hessian matrix of the utility function.
To apply (A1) to time series we write D x t = l o g ( x f l x t _ t ) for any positive
variable x with value x t at time t. A finite-change approximation to (A1) is then

VvitDq, , = OeDQ, + ~ , 7rijDpj , , (A3)


/=1

where D O t = Z~ 1,vitDqi t and wit is the arithmetic average budget share of good i
at t - 1 and t. Equation (A3) is equivalent to (2.1) for y , = ~ t D q , , x0t = DOt,
x/t = D p j t. For further details, see Theil (1980).
The numerical results reported in Section 7 are based on the analysis of Theil
and Suhm (1981) of data on 15 countries collected by Kravis et al. (1978). These
countries are the U.S., Belgium, France, West Germany, U.K., The Netherlands,
Japan, Italy, Hungary, Iran, Colombia, Malaysia, Philippines, South Korea, and
India. Let w~, be the per capita budget share of good i in country t. Working's
(1943) model describes such a share as a linear function of the logarithm of
income. T o take into account that different countries have different relative
prices, Working's model is postulated to hold at the geometric mean prices across
countries, Pl . . . . . PN, where
15
log Pi = ~ ~, log Pit, (A4)
t=l

which requires that a substitution term be added to the model. The result is
that the demand system takes the form (7.8), with x0t per capita real income of
country t, xit = log(p~,/pj) and Yit equal to 1 - x , + X j w # x / , multiplied by wi,.
Then the sums over i = 1 . . . . . N of Yi, % 13i and 7r,j are equal to 1, 1, 0 and 0,
respectively, implying that eat . . . . . eNt are linearly dependent.

References

Barten, A. P. (1969). Maximum likelihood estimation of a complete system of demand equations.


European Economic Review 1, 7--73.
Bera, A. K., Byron, R. P. and Jarque, C. M. (1981). Further evidence on asymptotic tests for
homogeneity and symmetry in large demand systems. Econom. Lett. 8, 101-105.
480 H. Theil and D. G. Fiebig

Byron, R. P. (1970). The restricted Aitken estimation of sets of demand equations. Econometrica
39, 816-830.
Christensen, L. R., Jorgenson, D. W. and Lau, L. J. (1975). Transcendental logarithmic utility
functions. American Economic Review 65, 367-383.
Conway, D. and Theil, H. (1980). The maximum entropy moment matrix with missing values.
Econom. Lett. 5, 319-322.
Deaton, A. S. (1974). The analysis of consumer demand in the United Kingdom. Econometrica 42,
341-367.
Deaton, A. S. (1975). Models and Projections of Demand in Post-War Britain. Chapman and Hall,
London.
Fiebig, D. G. (1980). Maximum entropy canonical correlations. Econom. Lett. 6, 345-348.
Fiebig, D. G. (1982). The maximum entropy distribution and its covariance matrix. Doctoral
dissertation. Department of Economics, University of Southern California.
Haft, L. R. (1980). Empirical Bayes estimation of the multivariate normal covariance matrix. Ann.
Statist. 8, 586-597.
Hooper, J. W. (1959). Simultaneous equations and canonical correlation theory. Econometrica 27,
245-256.
Kagan, A. M., Linnik, Y. V. and Rao, C. R. (1973). Characterization Problems in Mathematical
Statistics. Wiley, New York.
Kidwai, S. A. and Theil, H. (1981). Simulation evidence on the ridge and the shrinkage of the
maximum entropy variance. Econom. Lett. 8, 59-61.
Kravis, I. B., Heston, A. W. and Summers, R. (1978). International Comparisons of Real Product
and Purchasing Power. The Johns Hopkins University Press, Baltimore, MD.
Laitinen, K. (1978). Why is demand homogeneity so often rejected? Econom. Lett. 1, 187-191.
Lawley, D. N. (1956). Tests of significance for the latent roots of covariance and correlation
matrices. Biometrika 43, 128-136.
Lawley, D. N. (1959). Tests of significance in canonical analysis. Biometrika 46, 59-66.
Lluch, C. (1971). Consumer demand functions, Spain, 1958-1964. European Economic Review 2,
277-302.
Malinvaud, E. (1980). Statistical Methods of Econometrics, 3rd ed. North-Holland, Amsterdam.
Mariano, R. S. and Sawa, T. (1972). The exact finite-sample distribution of the limited-information
maximum likelihood estimator in the case of two included exogenous variables. J. Amer. Statist.
Assoc. 67, 159-165.
Meisner, J. F. (1979). The sad fate of the asymptotic Slutsky symmetry test for large systems.
Econom. Lett. 2, 231-233.
Meisner, J. F. (1981). Appendix to Theil and Suhm (1981).
Schuster, E. F. (1973). On the goodness-of-fit problem for continuous symmetric distributions. J.
Amer. Statist. Assoc. 68, 713-715.
Schuster, E. F. (1975). Estimating the distribution function of a symmetric distribution. Biometrika
62, 631-635.
Theft, H. (1971). Principles of Econometrics. Wiley, New York.
Theil, H. (1980). The System-Wide Approach to Microeconomics. The University of Chicago Press,
Chicago, IL.
Theil, H. and Fiebig, D. G. (1984). Exploiting Continuity: Maximum Entropy Estimation of
Continuous Distributions. Ballinger, Cambridge, MA.
Theil, H. and Laitinen, K. (1980). Singular moment matrices in applied econometrics. In: P. R.
Krishnaiah, ed., Multivariate Analysis--V, 629-649. North-Holland, Amsterdam.
Theil, H. and Meisner, J. F. (1980). Simultaneous equation estimation based on maximum entropy
moments. Econom. Lett. 5, 339-344.
"I]leil, H. and Suhm, F. E. (1981). International Consumption Comparisons: A System-Wide
Approach. North-Holland, Amsterdam.
Theil, H., Kidwai, S. A., Yalnizo~lu, M. A. and Yell6, K. A. (1982). Estimating characteristics of a
symmetric continuous distribution. CEDS Discussion Paper 74. College of Business Ad-
ministration, University of Florida.
Working, H. (1943). Statistical laws of family expenditure. J. Amer. Statistist. Assoc. 38, 4.3--56.
Subject Index

ACM filter cleaners, 144 Canonical correlations, 466, 467


Adaptive algorithm, 202 Canonical echelon, 265, 266
Adaptive regression model, 430 Canonical form, 265-269
Additive outliers model, 133 Canonical state space, 26_5-269
Akaike's AIC, 274, 417 Catastrophe theory, 48
Akaike's Markovian representation, 165 Class (KF), 295
Almost harmonizable process, 285 Comparative calibration, 364
Amplitude-dependent autoregressive models, Computation of AM-estimates, 147
26, 43 Concentrated maximum likelihood, 423
Analysis of covariances matrices, 363 Consistency, 436, 437
Approximate maximum-likelihood type esti- Consistency condition, 346, 347, 348
mates, 141 Continuous time models, 168
ARMAX model, 191,217, 223, 258 Correlation characteristic, 3(11
Asymptotically efficient estimators, 440 Covariance estimators, 435, 436
Asymptotically optimal, 349 Covariance matrix analysis, 363
Asymptotically optimal sequence of designs, Covariance stationary, 1
351 Cram6r class, 291
Asymptotically stationary, 280 Cram6r-Hida class, 280
Asymptotic breakdown point, 150 Cram6r-Rao lower bound, 120
Asymptotic properties, 199, 378 Cross-validation, 184
Autocorrelation coefficients, 426
Autocorrelation function, 430, 433 Delta-array, 182
Autoregressive (AR) model, 1, 15, 75, 179, 1"80, Detection of outliers, 107, 109
358, 393, 445 Deterministic process, 299
Autoregressive moving average (ARMA) Deterministic sampling, 342
model, 27, 28, 75, 85, 86, 88, 106, 119, 155, 157, Diagnostic checks, 425, 440
179, 181, 257, 258, 265-269 Difference-differential filter, 305
Averaged covariances, 315, 316 Difference equation, 2, 191
Diffusion processes, 51, 53
Dilation of Cram6r process, 292
Backward representation, 181 Dilation of harmonizable process, 283
Batch identification, 190 Distribution systems, 53
Bernoulli number, 351 Domain estimation, 374
Bivariate entropy, 458 Dynamical systems, 25, 189
Bochner-Herglotz theorem, 280 Dynamic time warping, 391,409
Box and Jenkins' approach, 414
Breakdown points, t50 Echelon forms, 265, 266
Brownian motion, 303, 351 Ecological systems, 237
Business data analysis, 243 Efficiency of robust estimates, 129

481
482 Subject index

Efficiency robustness, 119 Identifiability, 262, 439


E - M algorithm, 442 Identification, 179, 190, 258, 414
Endogenous variables, 415, 443, 454 Implementation, 209
Environmental systems, 237 Infinite-variance autoregression, 129
Ergodicity, 36, 417, 421 Influence curves, 151
Errors-in-variables, 133 Information criterion, 230
Estimating the transfer functions, 273, 366 Information matrix, 437
Estimation of narrow band, 368, 434 Initial estimates, 372
Estimation in presence of delays, 375 Innovation outliers, 128
Evolutionary behavior, 428 Innovations representation, 223
Evolving coefficient regression, 439 Input process, 305
Evolving coefficient variation, 415, 426 Instrumental variable estimation, 198, 206, 443
Evolving constant model, 430 Integral filter, 305, 326
Exogenous set of variables, 415, 454 Integrals of random quantities, 339, 346
Explosive autoregression, 6, 17 Integrated random walk, 226
Exponential A R models, 31, 33, 37, 74 Integro-differential filter, 305
Extended autocorrelation function, 92 Intensity function, 312
Extended exponential A R model, 44, 76 Intervention analysis, 87, 104, 105
Extended Kalman filter, 205 Isolated experimental study, 409, 410
Isolated word recognition, 389-392
Factorable, 339 Itakura distance, 392, 398, 403, 404
Factor analysis, 363, 364
False alarm rate, 341,345 Jitter, 353
Filter, 143, 145, 305
Final prediction error, 230 Kalman filter, 157, 223, 225, 374, 437
Finite Fourier transform of the data, 368 Karhunen class, 286
Finite-sample breakdown points, 150 k-class estimator, 454
Fisher consistency, 149 Kotel'nikov-Shannon formula, 330
Fisher information, 120, 130 Kronecker estimation, 275
Fokker-Planck equation, 51, 52, 54, 63, 71 Kronecker indices, 266
Forgetting factors, 209 Kuhn-Tucker multipliers, 418
Frrchet variation, 282, 291,323
Frequency dependent, 375 Lagged values, 439
Frequency-dependent delays, 367 Lagrange multiplier, 428
Frequency response, 325 Lagrange multiplier test, 184, 418, 428
Large equation systems, 451-480
Least-squares estimates, 3, 4, 126, 127, 128, 419
Gaussian noise, 341,342 Least-squares prediction, linear, 302
Gaussian process, 301,358 Least-squares prediction, nonlinear, 302
Gauss-Markov, 350, 352, 356 Likelihood (Gaussian), 272
Gauss-Newton algorithm, 200 Likelihood ratio test, 418
Generalized partial autocorrelation, 181 Limiting distribution, A R process, 12
Generalized Pearson system, 54 LIML, 461
Generalized spectral function, 282 Linear filter, 324
Gradient algorithm, 200 Linearization, 55
Green's function, 304 Linear threshold AR modeI, 43, 77
Local coordinates for the manifold, 270
HempeI-Krasker-Welsch type, 135 Log likelihood, 437
Hankel matrix (block), 263 Longitudinal data, 413
Harmonizable process, 314 LPC (approximate) likelihood, 394-396
Harmonizable process, multivariate, 285 LPC comparative tests, 397-406
Harmonizable process, strongly, 280 LPC computational cost, 408, 409
Harmonizable process, weakly, 282 LPC distance measures, 389-412
Homogeneous testing, 185 LPC likelihood ratio, 397-402
Subject index 483

LPC model, 393 Normal operator semigroup, 289


LPC of Gaussian linear model, 394 Normal process, 307
LPC power functions, 406-408 Numerical schemes, 201

Observation errors, 360


Manifold estimation, 275 Off-line identification, 190
Manifold of systems, 263, 269, 270 Optimal designs, 349, 352, 355, 356, 358, 359,
Markov chain, 36, 37, 55, 60, 75 360
Markov chain model, 57, 63, 66, 68, 72 Optimal filter, 329
Markov models, 392 Optimal generalized equation error (OGEE)
Martingale central limit theorem, 421 approach, 219
Martingale difference, 414 Order determination, 417, 433
Matrix fraction description, 261 Order of a system, 263
Maximum entropy, 451,455, 458, 459, 460, 462 Orders of the models, 375
Maximum likelihood estimation, 129, 272, 273, Ornstein-Uhlenbeck process, 60
274, 419, 422, 436 Orthogonally scattered measure, 28'7
Mean square approximation error, 344 Oscillatory processes, 315
Median sampling, 343, 346, 347, 351, 352, 354, Outliers, 104, 120, 126
355, 356 Output error identification, 197
M-estimates, 121, 130, 134 Output error models, 192
Method of scoring, 440 Ozone data, 114
Minimax regret choice, 186
Minimum variance unbiased estimator, 306 Parametrization, 259
Min-max robustness, 119, 120, 121 Parametrization, ARMA, 262
Missing observations, 157, 460 Parametrization of the manifold, 269
Model adequacy, 440 Parametrization of state space, 264
Model building strategy, 96, 106 Partial autocorrelations, 91, 180, 430, 433
Model checking, 179 Periodic, 360
Model order identification, 228, 232 Periodic sampling, 343, 351,360
Model parameter estimation, 232 Periodic sampling with jitter, 343
Model selection, 185, 367, 375 Periodogram analysis, 312
Model validation, 232 Permissible parameter space, 438
Moving average filter, 305, 325 Phase across broad band, 379
Moving average (MA) model, 179, 180 Phase across narrow band, 381
Multidimensional time series, 323 Polynomial AR models, 30
Multiindex, 266 Polynomial filter, 325
Multiple regression, 357, 360 Portmanteau test statistic, 184
Multiplicity of a process, 300 Praxis, 442
Multivariate autoregression, 18 Prediction error, 9, 183, 217, 218
Prediction error identification, 197
Prediction error representation, 264
Narrow band case, 376 Prediction theory, 93
Newton algorithm, 200 Predictors, 193
Newton-Raphson method, 2111, 424, 442 Prewhitening, 381,382
Noise process, 294 Processes with independent increments, 360
Non-Gaussian colored noise, 27 Product sampling designs, 358
Non-identifiability, 182 Prohorov distance, 122
Non-linear difference equation, 40 Pseudolinear regressions, 205
Non-linear least squares, 413 Psi-array, 182
Non-linear model, 47 Purely nondeterministic process, 299
Non-linear optimization, 175
Non-linear time series, 25 Quadratic mean derivative, 338, 350, 351, 353,
Non-parametric estimation techniques, 366, 377 354
Non-stationarity, 312, 444 Quadrature formula, 359
484 Subject index

Qualitative robustness, 119, 122, 123, 125, 136 Spectral matrix function, 285
Quantile sampling, 353, 356 Spectral representation, 365
Quenouille's test statistic, 181 Spectrum of the process, 312
Speech recognition, 389-412
Random coefficient autoregressions, 416, 444 State-space models, 192, 258, 263, 264, 374
Random coefficient variation, 415 State-space representation, 157
State-variable estimation, 222
Random fields, 357, 358
Random sampling, 342, 343, 355 State-variable feedback, 224
Stationarity condition, 90
Random vibrational system, 32
Stationary covariances, 351,360
Random walk, 226
Stationary independent increments, 350
R-array, 182
Stationary invertible processes, 439
Rate of convergence, 35.3-358, 360
Statistical ergodic theorem, 318
Realignment, 381,382
Stochastic differential equation, 31
Real-time identification, 202
Stochastic dynamical systems, 51, 52, 67
Recursive identification, 202, 213
Strassen characterization, 123
Recursive methods, 190
Stratified sampling, 343, 346, 348, 353, 355, 356,
Recursive least squares algorithm, 218
358
Recursive prediction error methods, 202
Strict stationarity, 1,417
Recursive time series, 231
Strong consistency, 370, 421
Regression with stationary errors, 171
Strongly harmonizable time series, 323
Regular sampling, 343, 349
Strong robustness, 124
Relatively smooth with noise, 366
Structural identifiability, 271,272
Relevant log-likelihood, 369
Sufficient statistic, 341,345
Reproducing kernel Hilbert space, 327, 338,
Systematic sampling, 344, 355
339, 340, 341,350, 359
Residual autocovariance estimates, 136, 137,
Tensor notation, 445
138-140
Testing for coefficient evolution, 427
Robust filter cleaners, 144
Three-step procedure, 444
Robustness, 123, 124, 125, 356
Time-series influence curve, 152
Runge-Kutta method, 56 Time-variable parameter estimation, 232
Transfer function models, 104, 215, 216, 217,
Sampling designs, 337, 342, 343 259
S-array, 182 Trigonometric polynomials, 371
Score-test statistic, 428 Two-step estimator(s), 440, 444
Seasonal adjustment model, 439 Typical diagonal element, 379
Seasonal factors, 414
Second-order efficient, 440 Unequally spaced data, 157
Second-order stationary, 417 Unit roots, 434, 439
Sequential parameter estimation, 202
Shift operator, 288 Variance component models, 173
Ship rolling, 26 Varying coefficient models, 413, 414
Signal associated noise, 383 Vector ARMA models, 87, 116
Signal characteristic, 294 Vibration systems, 33
Signal process, 294 Vitali variation, 323
Signals in noise, 341,345, 347
Signal-to-noise ratio, 342, 348, 366 Weakly harmonizable time series, 323
Simple random sampling, 343, 346, 347, 348, Weakly stationary process, 312
353, 356, 357 Weakly stationary time series, 322
Simultaneous equation estimation, 453, 461 Wide band, 371
Small sample distribution, 429 Wiener, 350, 352, 356
Smoothed random walk, 226
Spectral characteristic, 325 Yule-Walker equation, 180
Spectral function, 322, 323 Yule-Walker estimates, 395, 396, 406, 409
H a n d b o o k of Statistics
Contents of Previous V o l u m e s

V o l u m e 1. A n a l y s i s of V a r i a n c e
E d i t e d b y P. R. K r i s h n a i a h
1980 xviii + 1002 p p .

1. Estimation of Variance Components by C. R. Rao and J. Kleffe


2. Multivariate Analysis of Variance of Repeated Measurements by N. H.
Timm
3. Growth Curve Analysis by S. Geisser
4. Bayesian Inference in MANOVA by S. J. Press
5. Graphical Methods for Internal Comparisons in A N O V A and MANOVA
by R. Gnanadesikan
6. Monotonicity and Unbiasedness Properties of ANOVA and MANOVA
Tests by S. Das Gupta
7. Robustness of ANOVA and MANOVA Test Procedures by P. K. Ito
8. Analysis of Variance and Problems under Time Series Models by D. R.
Brillinger
9. Tests of Univariate and Multivariate Normality by K. V. Mardia
10. Transformations to Normality by G. Kaskey, B. Kolman, P. R. Krishnaiah
and L. Steinberg
11. ANOVA and MANOVA: Models for Categorical Data by V. P. Bhapkar
12. Inference and the Structural Model for ANOVA and MANOVA by D. A.
S. Fraser
13. Inference Based on Conditionally Specified ANOVA Models Incorporat,.
ing Preliminary Testing by T. A. Bancroft and C.-P. Han
14. Quadratic Forms in Normal Variables by C. G. Khatri
15. Generalized Inverse of Matrices and Applications to Linear Models by S.
K. Mitra
16. Likelihood Ratio Tests for Mean Vectors and Covariance Matrices by P.
R. Krishnaiah and J. C. Lee

485
486 Contents of previous volumes

17. Assessing Dimensionality in Multivariate Regression by A. J. Izenman


18. Parameter Estimation in Nonlinear Regression Models by H. Bunke
19. Early History of Multiple Comparison Tests by H. L. Harter
20. Representations of Simultaneous Pairwise Comparisons by A. R. Sampson
21. Simultaneous Test Procedures for Mean Vectors and Covariance Matrices
by P. R. Krishnaiah, G. S. Mudholkar and P. Subbaiah
22. Nonparametric Simultaneous Inference for Some MANOVA Models by
P. K. Sen
23. Comparison of Some Computer Programs for Univariate and Multivariate
Analysis of Variance by R. D. Bock and D. Brandt
24. Computations of Some Multivariate Distributions by P. R. Krishnaiah
25. Inference on the Structure of Interaction in Two-Way Classification Model
by P. R. Krishnaiah and M. Yochmowitz

V o l u m e 2. C l a s s i f i c a t i o n , P a t t e r n R e c o g n i t i o n a n d R e d u c t i o n of
Dimensionality
E d i t e d by P. R. K r i s h n a i a h a n d L. N. K a n a l
1982 xxii + 903 pp.

1. Discriminant Analysis for Time Series by R. H. Shumway


2. Optimum Rules for Classification into Two Multivariate Normal Popu-
lations with the Same Covariance Matrix by S. Das Gupta
3. Large Sample Approximations and Asymptotic Expansions of Classification
Statistics by M. Siotani
4. Bayesian Discrimination by S. Geisser
5. Classification of Growth Curves by 3. C. Lee
6. Nonparametric Classification by J. D. Broffitt
7. Logistic Discrimination by J. A. Anderson
8. Nearest Neighbor Methods in Discrimination by L. Devroye and T. J.
Wagner
9. The Classification and Mixture Maximum Likelihood Approaches to Clus-
ter Analysis by G. J. McLachlan
10, Graphical Techniques for Multivariate Data and for Clustering by J. M.
Chambers and B. Kleiner
ll. Cluster Analysis Software by R. K. Blashfield, M. S. Aldenderfer and L. C.
Morey
12. Single-link Clustering Algorithms by F. J. Rohlf
13. Theory of Multidimensional Scaling by J. de Leeuw and W. lfeiser
14. Multidimensional Scaling and its Applications by M. W!sh and J. D.
Carroll
15. Intrinsic Dimensionality Extraction by K. Fukunaga
Contents of previous volumes 487

16. Structural Methods in Image Analysis and Recognition by L. N. Kanal, B.


A. Lambird and D. Lavine
17. Image Models by N. Ahuja and A. Rosenfeld
18. Image Texture Survey by R. M. Haralick
19. Applications of Stochastic Languages by K. S, Fu
20. A Unifying Viewpoint on Pattern Recognition by J. C. Simon, E. Backer
and J. Sallentin
21. Logical Functions in the Problems of Empirical Prediction by G. S. Lbov
22. Inference and Data Tables and Missing Values by N. G. Zagoruiko and V.
N. Yolkina
23. Recognition of Electrocardiographic Patterns by J. H. van Bemmel
24. Waveform Parsing Systems by G. C. Stockman
25. Continuous Speech Recognition: Statistical Methods by F. Jelinek, R. L.
Mercer and L. R. Bahl
26. Applications of Pattern Recognition in Radar by A. A. Grometstein and
W. H, Schoendorf
27. White Blood Cell Recognition by E. S. Gelsema and G. H. Landweerd
28. Pattern Recognition Techniques for Remote Sensing Applications by P. H.
Swain
29. Optical Character Recognition--Theory and Practice by G, Nagy
30. Computer and Statistical Considerations for Oil Spill Identification by Y.
T. Chien and T. J. Killeen
31. Pattern Recognition in Chemistry by B, R. Kowalski and S. Wold
32. Covariance Matrix Representation and Object-Predicate Symmetry by T.
Kaminuma, S, Tomita and S. Watanabe
33. Multivariate Morphometrics by R. A. Reyment
34. Multivariate Analysis with Latent Variables by P. M. Bentler and D. G.
Weeks
35. Use of Distance Measures, Information Measures and Error Bounds in
Feature Evaluation by M. Ben-Bassat
36. Topics in Measurement Selection by J. M. Van Campenhout
37. Selection of Variables Under Univariate Regression Models by P. R.
Krishnaiah
38. On the Selection of Variables Under Regression Models Using Krish-
naiah's Finite Intersection Tests by J. L. Schmidhammer
39. Dimensionality and Sample Size Considerations in Pattern Recognition
Practice by A. K. Jain and B. Chandrasekaran
40. Selecting Variables in Discriminant Analysis for Improving upon Classical
Procedures by W. Schaafsma
41. Selection of Variables in Discriminant Analysis by P. R. Krishnaiah
488 Contents of previous volumes

V o l u m e 3. T i m e S e r i e s in t h e F r e q u e n c y D o m a i n
E d i t e d b y D . R. B r i l l i n g e r a n d P. R . K r i s h n a i a h
1983 xiv + 485 pp.

1. Wiener Filtering (with emphasis on frequency-domain approaches) by R. J.


Bhansali and D. Karavellas
2. The Finite Fourier Transform of a Stationary Process by D. R. Brillinger
3. Seasonal and Calendar Adjustment by W. S. Cleveland
4. Optimal Inference in the Frequency Domain by R. B. Davies
5. Applications of Spectral Analysis in Econometrics by C. W. J. Granger and R.
Engle
6. Signal Estimation by E. J. Hannan
7. Complex Demodulation: Some Theory and Applications by T. Hasan
8. Estimating the Gain of A Linear Filter from Noisy Data by M. J. Hinich
9. A Spectral Analysis Primer by L. H. Koopmans
10. Robust-Resistant Spectral Analysis by R. D. Martin
11. Autoregressive Spectral Estimation by E. Parzen
12. Threshold Autoregression and Some Frequency-Domain Characteristics by
J. Pemberton and H. Tong
13. The Frequency-Domain Approach to the Analysis of Closed-Loop Systems
by M. B. Priestley
14. The Bispectral Analysis of Nonlinear Stationary Time Series with Reference
to Bilinear Time-Series Models by T. Subba Rao
15. Frequency-Domain Analysis of Multidimensional Time-Series Data by E. A.
Robinson
16. Review of Various Approaches to Power Spectrum Estimation by P. M.
Robinson
17. Cumulants and Cumulant Spectra by M. Rosenblatt
18. Replicated Time-Series Regression: An Approach to Signal Estimation and
Detection by R. H. Shumway
19. Computer Programming of Spectrum Estimation by T. Thrall
20. Likelihood Ratio Tests on Covariance Matrices and Mean Vectors of
Complex Multivariate Normal Populations and their Applications in Time
Series by P. R. Krishnaiah, J. C. Lee and T. C. Chang
Contents of previous volumes 489

V o l u m e 4. N o n p a r a m e t r i c M e t h o d s
E d i t e d by P. R. K r i s h n a i a h a n d P. K. Sen
1984 xx + 968 pp.

1. Randomization Procedures by C. B. Bell and P. K. Sen


2. Univariate and Multivariate Multisample Location and Scale Tests by V. P.
Bhapkar
3. Hypothesis of Symmetry by M. Hu~kovfi
4. Measures of Dependence by K. Joag-Dev
5. Tests of Randomness against Trend or Serial Correlations by G. K,
Bhattacharyya
6. Combination of independent Tests by J. L. Folks
7. Combinatorics by L. Takfics
8. Rank Statistics and Limit Theorems by M. Ghosh
9. Asymptotic Comparison of T e s t s - A Review by K. Singh
10. Nonparametric Methods in Two-Way Layouts by D. Quade
11. Rank Tests in Linear Models by J. N. Adichie
12. On the Use of Rank Tests and Estimates in the Linear Model by J. C.
Aubuchon and T. P. Hettmansperger
13. Nonparametric Preliminary Test Inference by A. K. Md. E. Saleh and P. K.
Sen
14. Paired Comparisons: Some Basic Procedures and Examples by R. A.
Bradley
15. Restricted Alternatives by S. K. Chatterjee
16. Adaptive Methods by M, Hu~kovfi
17. Order Statistics by J. Galambos
18. Induced Order Statistics: Theory and Applications by P. K. Bhattacharya
19. Empirical Distribution Function by E. Csfiki
20. Invariance Principles for Empirical Processes by M. Cs6rg6
21. M-, L- and R-estimators by J. Jureekovfi
22. Nonparametric Sequential Estimation by P. K. Sen
23. Stochastic Approximation by V. Dupae
24. Density Estimation by P. R6v6sz
25. Censored Data by A. P. Basu
26. Tests for Exponentiality by K. A. Doksum and B. S. Yandell
27. Nonparametric Concepts and Methods in Reliability by M. Hollander and
F. Proschan
28. Sequential Nonparametric Tests by U. M/iller-Funk
29. Nonparametric Procedures for some Miscellaneous Problems by P. K. Sen
30. Minimum Distance Procedures by R. Beran
31. Nonparametric Methods in Directional Data Analysis by S. R. Jam-
malamadaka
32. Application of Nonparametric Statistics to Cancer Data by H. S. Wieand
490 Contents of previous volumes

33. Nonparametric Frequentist Proposals for Monitoring Comparative Survival


Studies by M. Gail
34. Meteorological Applications of Permutation Techniques based on Distance
Functions by P. W. Mielke, Jr.
35. Categorical Data Problems Using Information Theoretic Approach by S.
Kullback and J. C. Keegel
36. Tables for Order Statistics by P. R. Krishnaiah and P. K. Sen
37. Selected Tables for Nonparametric Statistics by P. K. Sen and P. R.
Krishnaiah

Anda mungkin juga menyukai