The theory and practice of the analysis of time series has followed two lines
almost since its inception. One of these proceeds from the Fourier trans-
formation of the data and the other from a parametric representation of the
temporal relationships. Of course, the two lines are interrelated. The frequency
analysis of data was surveyed in Volume 3 of the present Handbook of Statistics
series, subtitled, Time Series in the Frequency Domain, edited by D. R.
Brillinger and P. R. Krishnaiah. Time domain methods are dealt with in this
volume. The methods are old, going back at least to the ideas of Prony in the
eighteenth century, and owe a great deal to the work of Yule early this century.
Several different techniques for classes of nonstationary processes have been
developed by various analysts. By the very nature of the subject in these cases,
the work tends to be either predominantly data analysis oriented with scant
justifications, or mathematically oriented with inevitably advanced arguments.
This volume contains descriptions of both these approaches by strengthening
the former and minimizing the latter, and yet presenting the state-of-the-art in
the subject. A brief indication of the work included is as follows.
One of the successful parametric models is the classical autoregressive
scheme, going back to the pioneering work of G. U. Yule, early in this century.
The model is a difference equation with constant coefficients, and much of the
classical work is done if the roots of its characteristic equation are interior to
the unit circle. If the roots are of unit modulus, the analysis presents many
difficulties. The advances made in recent years in this area are described in W.
Fuller's article. An important development in the time domain area is the work
of R. Kalman. It led to the emphasis on a formalization of rational transfer
function systems as defined by an underlying state vector generated in a
Markovian manner and observed subject to noise. This representation is
connected with a rich structure theory whose understanding is central in the
subject. It is surveyed in the article by M. Deistler. The structure and analysis of
several classes of nonstationary time series that are not of autoregressive type
but for which the ideas of Fourier analysis extend is given in the article by M.
M. Rao; and the filtering and smoothing problems are discussed by D. K.
Chang. Related results on what may be termed "asymptotically stationary" and
allied time series have been surveyed in C. S. K. Bahagavan's paper.
The papers by L. Ljung, P. Young and G. C. Tiao relate to the estimation
vi Preface
problems in the dynamical modelling systems. Here Young's paper deals with
the on-line (real time) calculations. One of the uses of these models has been to
analyze the consequences of an intervention (such as the introduction of
exhaust emission laws) and another to consider the outlier detection problems.
These are discussed by Tiao and T. Ozaki. Though rational transfer function
models are parametric, it is seldom the case that the model set contains the
truth and the problem may better be viewed as one of selecting a structure
from an infinite set in some asymptotically optimal manner. This point of view
is explored by R. Shibata. Though least squares techniques, applied to the
prediction errors, have dominated, there is a need to modify these to obtain
estimators less influenced by discrepant observations. This is treated by Tiao
and, in an extensive discussion, by R. D. Martin and V. J. Yohai. The model
selection and unequally spaced data are natural problems in this area confront-
ing the experimenter, and these are discussed by R. H. Jones. Since the time
points may sometimes be under control of the experimenter, their optimal
choice must be considered. This problem is treated by S. Cambanis. The
modelling in the papers referred to above has been essentially linear. Ozaki
presents an approach to the difficult problem of nonlinear modelling.
The autoregressive models may have time varying parameters, and this is
considered by D. F. Nicholls and A. R. Pagan. Their paper has special
reference to econometric data as does also the paper by H. Theil and D. G.
Fiebig who treat the problem where the regressor vectors in a multivariate
system may be of a dimension higher than the number of time points for
observation. The final two papers on applications by M. A. Cameron, P. J.
Thomson and P. de Souza complement the areas covered by the preceding
ones. These are designed to show two special applications, namely in signal
attenuation estimation and speech recognition.
Thus several aspects of the time domain analysis and the current trends are
described in the different chapters of this volume. So they will be of interest
not only to the research workers in the area of time series, but also to data
analysts who use these techniques in their work.
We wish to express our sincere appreciation to the authors for their excellent
cooperation. We also thank the North-Holland Publishing Company for their
cooperation.
Eo J. Hannan
P. R. Krishnaiah
M. M. Rao
Contributors
xiii
xiv Contributors
Wayne A. Fuller
1. I n t r o d u c t i o n
Y,2
. . . . . Y'2. . . . . Y'o)
T h e t i m e series is strictly s t a t i o n a r y if
F,,,,, r,2..... v,, (Y,,' Y,2. . . . . Y,°) = F ~,~+,, W,2+h..... r,,,h(Y'~' Y'2. . . . . Y',,)
and
E{(Y,-tx)(Yt.~h-/X)}=y(h), t=1,2 .... ; h=0,1,...,
co0 = 1 , coi=0, i= 1, - 2 . . . .
and the bj are such that the initial conditions are satisfied. T h e e x p o n e n t kj is
zero if the root rnj is a distinct root. A root with multiplicity r has r coefficients
with k j = 0 , 1 . . . . . r - 1 .
Using the coi, the time series Y, can be written as
t-I p I t -1
T h e m e a n of Yt is
t-I p 1
If the roots of (1.4) are less than one in absolute value, then m i goes to zero
as i goes to infinity. O n e c o m m o n model is that in which g(t) =- a o. A s s u m e that
(Yo, Y - l , . . . , Y-p+1) is a vector of r a n d o m variables with c o m m o n mean
o0(,-ko,)' i=1
c o m m o n variance
a¢
0"2 E tO~ (1.9)
i=o
and covariances
2. T h e first-order model
(2.2)
4 W . A . Fuller
where
n
Y, 1,
t 1
n
t-1
Under the assumption that the e, are uncorrelated, the expected value of the
n u m e r a t o r is zero. The limiting behavior of the estimator is determined by the
joint behavior of the sample m o m e n t s in the numerator and d e n o m i n a t o r of
(2.3). The limiting distributions of c}~ are characterized in Table 2.1.
For a time series with l a [ < 1, the limiting distribution of F/1/2(15~1 - O{1) is
normal under quite weak assumptions. T h e first proof of the limiting normal
distribution was given by Mann and Wald (1943). There have been a n u m b e r of
extensions since that time. Because wi ~ 0 as n -~ % the initial value Y0, for any
real Y0, will not influence the limiting distribution, though the influence for
small samples could be large. The variance of the limiting distribution of
n l / 2 ( ~ 1 -- 0~1) is
O) i ~ l-- O'1.
Table 2.1
Limiting properties of the least squares estimator of crt
]all < 1 any real any real lID(0, 0-2) nl/2(1 - a2) -1/2 N(0, 1)
10"ll = 1 ao # 0 any real lID(0, 0-2) n 3/2 Normal
lai] = 1 ao 0 any real lID(0, 0.2) n Tabulated
[all > 1 0"0 = 0 !/0 = 0 NID(0, 0.2) (a 2 - 1) '0"¢ Cauchy
10"11> 1 ao = 0 Yo = 0 lID(0, 0.2) ( a 2 - I) toe'l' 9
I0"1[ > a 0"0 # 0 Y0 # 0 N I D ( 0 , 0.2) (a '2 -- 1) 'de N(0, 1)/N(~:, 1)
The result of Table 2.1 is stated for independently and identically distributed
random variables, but the limiting distribution of nU2(&l- cq) is also normal for
et that are martingale differences. For example, see H a n n a n and Heyde (1972)
and Crowder (1980).
If Icq] = 1 and a 0 - 0, there is no simple closed form expression for the
limiting distribution of n(&a- oq). The limiting distribution of n(~ 1 %) is that
of a function of three random variables,
L
n(al-- ~,)--, [2(r- W2)] l[(T2 1) 2TW], (2.4)
where
(F, T, W) = 2 2 E 2%,,z,, Z ~,~/2
~,,z,, 2~.
~ ~',~,),
= i=1 i=1
Yi-- (-1)i+12[( 2i - l ) v ] -1 ,
is nearly a constant multiple of n. This remains true for lall < 1 and any fixed
real initial conditions. If lO/ll : 1 and a 0 = 0,
E { v , ~} = to-~
and
n
= 6 q ( n 2 - 1)o.2 "
Yt = Yo + aot + ~ ej
j=l
6 W.A. Fuller
t-1
r , : E ' °llet i
i=o
t-1
= Ol tl £ Ol i1- t et i
i=0
c~
j=t+l
where
X : • aTJej.
j=l
n-1
1=0
n 1 n
~1 ( Y o - y)2 + Z ( Y t - Y)2 q-2(Yn Z (Yt-,- Y)(Yt-- Y),
t=l t=l
(2.6)
where
n
J3 = (n + 1)-' ~'~ Y~.
t-0
L 1
n ( ~ l - °gl)-)-2[/=~1 ~ u2i ] 2] -1 , (2.7)
and Z i is the ith positive zero of the function t 2 sin t - t ' cos t. The limiting
distribution was obtained in a different context by Anderson and Darling
(1952) and is discussed by MacNeil (1978). The distribution defined in (2.7) is
much easier to tabulate than that of 61, where 61 is defined in (2.2) because the
characteristic function for (2.7) may be obtained and inverted numerically.
Statistics closely related to 61 have been discussed by Durbin (1973), Sargan
and Bhargava (1983) and Bhargava (1983).
Lai and Siegmund (1983) consider a sampling scheme in which observations
8 W.A. Fuller
nc
E Y~-I > co-2,
t=l
where c is a specified constant and n c is the smallest number such that the
inequality holds. For this sampling scheme and the model with a 0 = 0 known,
they show that
n~l 2 \1t2 L
Yt-1) ( d q - oq) + N(0, o e)
(2.8)
Therefore, the distribution of d 0 is intimately related to that of & l - %. For the
model with ]%] < 1, the limiting distribution of nl/2(60- o~0)is normal. For other
situations, the limiting distribution is more complicated.
The fact that the distribution of 61 does not depend on o-2 permits one to use
the distribution of Table 2.1 for inference about a 1. Another statistic that is
natural to use for inference purposes is the Studentized statistic
n
I~"2= (g/ - - 2) -1 Z [ 1/, - Y{0~- c~l(Y, 1 - 'i2{ 0)12.
t=l
The limiting distribution of the statistic [ also depends upon the true
parameters of the model. The types of distributions are tabulated in Table 2.2.
For those situations where the limiting distribution of the standardized least
squares estimator oq is normal, the limiting distribution of the [-statistic is
N(0, 1). The distribution of [ for loq] = 1 is a ratio of quadratic forms and has
been tabulated by Dickey (1976). See Fuller (1976). One of the more interes-
ting results of Table 2.2 is the fact that the limiting distribution of the •-statistic
is N(0, 1) for ]all > 1. This result emphasizes the unique place of tall = 1. The
•-statistic for estimator (2.6) has a limiting distribution that is a simple trans-
formation of the limiting distribution of 61 .
The properties of predictors for the first-order autoregressive process are
Nonstationary autoregressive time series
Table 2.2
Limiting properties of the least squares 't-statistic'
Parameters
Initial value Distribution Limiting
Y0 of et distribution
given in Table 2.3. Let Y,÷j denote the predictor constructed with known
parameters. If the parameters are known and if the e t are independent, the best
predictor of Y,+j given (Y0, Y~. . . . , Y,) is the conditional expectation
j-2
Yn+j - Y,+: e,,.j+ale,.; 1 +'' "q-Od 1 e,<,
~'~ zzzz
©
+
I ~ -
S
i ,x -~
I
b
1
; ggg
'N + ++~
e~
+ p P
9
£
2©
us:
"a
7 b ~ ~ b
r-,
m.
V II II A
#i ~'ggg
Nonstationary autoregressive time series 11
periods ahead is
0.2f~2
~ a l2(s-l)g_
t . 21- 1) + (a~ - 1)(a~- 1)-1} + O(n-1).
Perhaps the most useful result of Table 2.3 is the fact that the ordinary
regression 't-statistic' has N(0, 1) as its limiting distribution for normal proces-
ses for all values of a 1. Therefore, one can use an ordinary regression program
to construct predictions and to construct confidence intervals for the predic-
tions.
EXAMPLE 2.1. TO illustrate some of the methods of this section, we use the
example of Dickey and Fuller (1979) based upon the study of Gould and
Nelson (1974). Gould and Nelson used the data from Friedman and Schwartz
(1963) to reach the conclusion that the logarithm of the velocity of money had a
behavior consistent with the hypothesis of a random walk. Two models are
fitted to the data. The first is
Y t : OgO+ °l l Y t - I + e t , (2.12)
Yt = 0 . 0 1 6 4 + 0.9663Yt_~, ~2 = 0.0050,
(0.0175) (0.0195)
where the numbers in parentheses are the standard errors computed by the
ordinary regression formulas. There are 92 observations and 91 observations
are included in the regression. For the first model, we have
If these numbers are compared to the tabular values in Table 8.5.1 and Table
8.5.2 of Fuller (1976), the hypothesis that ~1 1 is accepted at the 0.10 level. In
=
If the model is that of (2.12), one might wish to test the hypothesis,
H0:(a0, al) = (0, 1). Dickey and Fuller (1981) give tables for the null distribution
of the likelihood ratio test which takes the form of the least squares 'F-test'.
For the model (2.12), the test statistic is
(0.47253 - 0.44664)/2
= 2.58,
(0.44664)/89
where 0.47253 Y~tn=2(Yt - Yt_l)2 and 0.44664 is the residual sum of squares
=
The standard deviations of the three prediction errors computed with o.2=
0.0050 are (o-,21/2o-,31/2o.)= (0.0707, 0.1000, 0.1225). If the parameters are
unknown and estimated by ordinary least squares, the predictions for the next
three periods are
Mann and Wald (1943) proved that the vector of least squares estimates for
the pth-order stationary time series converges in distribution to a vector
normal random variable. The following theorem is a generalization of their
result. See also Anderson (1959) and Crowder (1980).
P
m p _ ~, cqm p i = 0, (3.2)
i=1
are less than one in absolute value. Let (Yo, Y-l, Y ~ 2 . . . . , Y-p+1) be either fixed
numbers or a vector of random variables with finite variances. Let {et} be a
sequence of independent identically distributed (0, cr 2) random variables in-
dependent of (Yo, Y-l . . . . . Y-p+1). Then, as n ~ %
L
na/2(d - a)----> N(O, V ~ ) ,
a = E x,Y = MLMxy,
= t=l
M x x = n -1 ~ X ~ , ,
t=l
Mxy = n 1Z X ; gt .
t=l
T', ~ X ~ f l ~ = I . (3.4)
t=l
Then
L
T2'(& a ) --~N(0, I ) .
On the basis of these results the usual regression statistics can be used, in
large samples, for inference p u r p o s e s for stationary processes.
T h e extension of the results to include m e a n functions g(t) that are not
constant functions of time requires some care. This is particularly true if the
conclusion of the t h e o r e m is to cover g(t) of practical interest, such as g(t) that are
p o l y n o m i a l time trends. Because
tl tt
E , 2, Et
t=l t=l
14 W. A . Fuller
q P
Y, = ~', Oafii + ~ , e9 Y , - j + e,
i=1 j=l
0.5)
= XtO + e t ,
where
X t = (I//tD t//t2 . . . . . {/Jtq, g t - p gt-2 .... , gt-p),
P
v p - ~ ,~jvp j = o
j=l
with initial conditions v 0 = 1 and vi --- 0 for j < 0 . T h e fixed part of Y, is S, and
u, is the r a n d o m part. Let
n
n ; 2 (+,, s,)'(~,, s ; ) n , (3.8)
t=l
Our theorem for the limiting distribution of the least squares estimator
requires the regularity conditions
lim sup
n~¢~ l<-t<-n
(k)
n +
s= 1
z 2,in z 2ti~ = 0 , i = q + l , q + 2 , ... ,q+p. (3.10)
THEOREM 3.2. Let model (3.5) hold, let the roots of the characteristic equation
(3.2) be less than one in absolute value, and let {e,} be a sequence of IID(0, cr2)
random variables. L e t the {0,} satisfy conditions (3.9) and (3.10). Let
Y, - Z.,O. + e, , (3.11)
where
t ! t
Z,. = H~X,,
O. = H~.l(131, 32''''' ~q' OlD Og2' " " " ' Ogp)'
n n
~j. = - i 7~ Z,,Yt,
Azz. , 6.2 = (n _ q - p ) - ' ~ , (Yt - Zafi,) z , (3.12)
t=l t=l
where
n
A~zn = • z;,z,..
t=l
L
~-1--1/2
O" A I Z Z n l, ,a~ n -- 0n) --+ N(O, I ) .
It follows from this theorem that, in large samples, the usual regression
statistics can be used for inference.
A special model of the form (1.1) of considerable interest is the model
q
Ut ~ E OdiUt i + et • (3.14)
i=i
qp-vp p
If the restrictions on the coefficients are ignored, the model (3.16) is of the
same form as m o d e l (3.5). Because of the following t h e o r e m the results of
T h e o r e m 3.2 are applicable to the nonlinear least squares estimator of the
p a r a m e t e r s of model (3.15).
L
nl'2(g - ~) --~ N(0, % ) .
JX~)--o, i= l , 2 , . . . , r ,
where the functions fi(8) are continuous and have continuous first and second
derivatives. Let ~ be the value of 6 that minimizes O,, (6) subject to the restrictions
where
= -
= of,(n)
dis \ 061 ' 082 .... ' 06p "
q'HEOREM 3.4. Let model (3.1) hold and let the roots of (3.2) be [mll > Im21 >
Im31>~ . . . >~ lmp[, where I m l [ > l and ]mi]<l for i = 2 , 3 , . . . , p . Let
(Yo, Y-~, Y - 2 , . . . , Y-p<) be fixed and let {e,} be a sequence of independent
identically distributed (0, o-2) random variables. L e t
Y, - X f l + e,,
p-1 p 1
m Z Cm : 11 (m -- m . , ) .
i-I j=l
Let
= t=l
18 W. A. Fuller
= (aiis2)-l/2(~ i - Oi) ,
Then the limiting distribution of [2 is that of the [-statistic for ~ of Table 2.2 a n d
L
~ -~N(0, 1) f o r i = 3 , 4 ..... p+l.
T h e o r e m 3.4 defines a procedure where one can test hypotheses about the
largest root in a p t h - o r d e r autoregressive process where p - 1 roots are less
than one in absolute value.
Fountis and Dickey (1983) have extended the results for a p t h - o r d e r process
with a single root of unit absolute value to the multivariate process with a
single root of unit absolute value. Let the multivariate process lit satisfy
IA - All-- 0
be equal to one and let the remaining roots be less than one in absolute value.
Let the least squares estimator of A, based on a sample of n observations, be
given by
A: v,_~v,_~ E v;_,v,
= t=l
I A - Axl = o,
with IX',I/> 1£21~ > / > I*kl. Fountis and Dickey (1983) show that
L
n(2,- a , ) - * A , ( 2 r ) - ' ( r 2 - 1),
where F and T are defined in (2.4).
Nonstationary autoregressive time series 19
Fuller, Hasza and Goebel (1981) have extended T h e o r e m 3.3 to the model
with one root of (3.2) greater than or equal to one in absolute value and the
remaining roots less than one in absolute value. Also, the prediction results of
Section 2 extend to the more complicated models of this section.
EXAMPLE 3.1. Engle and Kraft (1981) analyzed the logarithm of the implicit
Price Deflator for Gross National Product as an autoregressive time series. We
simplify the model of Engle and Kraft and use data for the period 1955 first
quarter through 1980 third quarter. For the initial part of our analysis we
assume that the process is a third-order autoregressive process. The least
squares estimated autoregressive equation is
and the residual mean square error is d-2 = 1.1173(10 5). T h e r e are a total of 103
observations and 100 observations are used in the regression. The largest root
of the characteristic equation
is 1.0178. Because the largest root is greater than one, the estimated model is
explosive. We first test the hypothesis that the largest root is one. This is done
by regressing the first differences on Yt-i and the lagged first differences. The
estimated equation is
where the numbers in parentheses are the estimated standard errors obtained from
the ordinary least squares regression program. By T h e o r e m 3.4 the statistic
has the distribution tabulated by Dickey when the largest root is one. By Table
8.5.2 of Fuller (1976) the /'-statistic will exceed 0.63 about one percent of the
time. Therefore, the hypothesis of a unit root is easily rejected. Because of the
large positive autocorrelation of series such as the price deflator, numerical
problems are often reduced by fitting the model in the form (3.18) instead of in
the form (3.17).
To set confidence limits for the largest root, we again use T h e o r e m 3.4. Let
the coefficient of gt-I in the regression of }It--mlYt-1 on Yt-l, Y~ 1--mlYt-2
and Y, 2-miYt-3 be denoted by /~. If m 1 > 1 is the largest root of the
characteristic equation and if all other roots are less than one in absolute value,
20 W. A. Fuller
i = (s.e. t;)-'~,
where s.e. /) is the ordinary least squares standard error, is that of a N(0, 1)
r a n d o m variable. Therefore, we can define a confidence interval for m I to be
those m 1 such that the absolute value of the calculated statistic /" is less than the
tabular values of Student's t for the desired confidence level. F o r our data
-r- 0 . 2 8 8 ( Y t 2 - 1.0091 Y, 3)
(0.099)
and
we easily accept the hypothesis that the process is third order. The argument
extends to the use of an F-test with two degrees of freedom to t e s t the
hypothesis of third order against the alternative of a fifth order process, etc.
Acknowledgements
References
Evans, G. B. A. and Savin, N. E. (1981b). Testing for unit roots 1. Econometrica 49, 753-777.
Findley, D. F. (1980). Large sample behavior of the S-array of seasonally nonstationary ARMA
series. In: O. D. Anderson and M. R. Perryman, eds., Time Series Analysis, 163-170. North-
Holland, Amsterdam.
Fountis, N. G. (1983). Testing for unit roots in' multivariate autoregressions. Unpublished Ph.D.
thesis. North Carolina State University, Raleigh, NC.
Fountis, N. G. and Dickey, D. A. (1983). Testing for a unit root nonstationarity in multivariate
autoregressive time series. Paper presented at Statistics: An Appraisal, International Conference
to Mark the 50th Anniversary of the Iowa State University Statistical Laboratory, Ames, Iowa.
Friedman, M. and Schwartz, A. J. (1963). A Monetary History of the United States I867-1960.
Princeton University Press, Princeton, NJ.
Fuller, W. A. (1976). Introduction to Statistical Time Series. Wiley, New York.
Fuller, W. A. (1979). Testing the autoregressive process for a unit root. Paper presented at the
42rid Session of the International Statistical Institute, Manila.
Fuller, W. A. (1980). The use of indicator variables in computing predictions. J. Econometrics 12,
231-243.
Fuller, W. A. and Hasza, D. P. (1980). Predictors for the first-order autoregressive process. J.
Econometrics 13, 139-157.
Fuller, W. A. and Hasza, D. P. (1981). Properties of predictors for autoregressive time series. J.
Amer. Statist. Assoc. 76, 155-161.
Fuller, W. A., Hasza, D. P. and Goebel, J. J. (1981). Estimation of the parameters of stochastic
difference equations. Ann. Statist. 9, 531-543.
Gould, J. P. and Nelson, C. R. (1974). The stochastic structure of the velocity of money. American
Economic Review 64, 405-417.
Grenander, U. (1954). On the estimation of regression coefficients in the case of an autocorrelated
disturbance. Ann. Math. Statist. 25, 252-272.
Hannan, E. J. (1956). The estimation of relationships involving distributed lags. Econometrica 33,
206--224.
Hannah, E. J. (1970). Multiple Time Series. Wiley, New York.
Hannan, E. J. (1979). The central limit theorem for time series regression. Stoch. Process. Appl. 9,
281-289.
Hannan, E. J., Dunsmuir, W. T. M. and Deistler, M. (i980). Estimation of vector ARMAX
models. J. Multivariate Anal. 10, 275-295.
Hannan, E. J. and Heyde, C. C. (1972). On limit theorems for quadratic fuuctions of discrete time
series. Ann. Math. Statist. 43, 2058-2066.
Hannan, E. J. and Nicholls, D. F. (1972). The estimation of mixed regression, autoregression,
moving average and distributed lag models. Econometrica 40, 529-548.
Hasza, D. P. (1977). Estimation in nonstationary time series. Unpublished Ph.D. thesis. Iowa State
University, Ames, Iowa.
Hasza, D. P. and Fuller, W. A. (1979). Estimation for autoregressive processes with unit roots.
Ann. Statist. 7, 1106-1120.
Hasza, D. P. and Fuller, W. A. (1982). Testing for nonstationary parameter specifications in
seasonal time series models. Ann. Statist. 10, 1209-1216.
Hatanaka, M. (1974). An efficient two-step estimator for the dynamic adjustment model with
autoregressive errors. J. Econometrics 2, 199-220.
Kawashima, H. (1980). Parameter estimation of autoregressive integrated processes by least
squares. Ann. Statist. 8, 423435.
Koopmans, T. C., Rubin, H. and Leipnik, R. B. (1950). Measuring the equation systems of dynamic
economics. In: T. C. Koopmans, ed., Statistical Inference in Dynamic Economic Models. Wiley,
New York.
Lai, T. L. and Siegmund, D. (1983). Fixed accuracy estimation of an autoregressive parameter.
Ann. Statist. 11, 478-485.
Lai, T. L. and Wei, C. Z. (1982). Asymptotic properties of projections with applications to
stochastic regression problems. J. Multivariate Anal. 12, 346-370.
Nonstationary autoregressive time series 23
Tohru Ozaki
1. Introduction
25
26 T. Ozaki
/*
Fig. 2.1.
Non-linear time series models and dynamical systems 27
rolling angle x. At the same time, the sea water damps the ship's m o v e m e n t
and this damping force is supposed to be a function f(ic) of the velocity 2 of the
rolling movement. T h e ship is also considered to be under the continual
external force s~ of r a n d o m excitement by the ocean waves. Therefore, the
dynamics of the ship rolling can be described by the following stochastic
differential equation:
5~+ / ( x ) + g ( x ) : s~ . (2.1)
When x and x are not very large, f ( 2 ) and g ( x ) are usually approximated by
linear functions as
f(2) = a 2 , (2.2)
g(x) = bx. (2.3)
5i + a2 + bx = ~. (2.4)
where a t is a discrete time Gaussian white noise with variance 0 "2, and q~l, ~2, 01
and 0-2 a re uniquely determined if a, b and the variance of white noise s~ are
known (Pandit and Wu, 1975). If we use the backward shift operator B, which
is such that B x t = xt_ 1 and Bixt = Bi-axt_~, (2.5) is rewritten as
(1 - 4 , 1 B - 4,2B2)x, - (1 + OiB)a,
o r as
~(B) (2.6)
O ( B ) x, = a,,
O- I
-!
sl
'0 80 ]6'0 2410 ~o 40'0 480 5do 6~'o 720 8;0 8so 980 iO00
TIME
Fig. 2.3 shows the spectrum which was calculated using the Hanning window.
When we fit A R models of order 0 to 20 to the data, the following AR(7)
model:
was adopted as the best model by A1C (Ozaki and Oda, 1978). Fig. 2.4 shows the
spectrum of the fitted model (2.10), where the p e a k of the spectrum shows
about 25 seconds periodicity of rolling, and the figure of the spectrum extracts
the essential characteristics of the spectrum Fig. 2.3 calculated without the
parametric model.
(2)
C3
f2~¢3
tA
z~-t /
z
C_Bc2"
CM___
T--T--- I I. . . . I
~0 9 0 0,20 0=~0
FREOUENSY
Fig. 2.3. Estimated spectrum by Hanning window (Lag = 100).
Non-linear time series models and dynamical systems 31
×t
X
X 0
X
)( X
X X C
X X C
X
X ©
X ×
× X X X
×
× O
X
o x°
tO
2 + a2 + bx + cx 3 = ~, (2.11)
where the restoring force is approximated by (b + cx2)x. When c > 0 the system
is called a hard spring type and when c < 0 the system is called a soft spring
type. A natural non-linear extension of the time series model for ship rolling
may be the A R model with some non-linear terms such as
2.3. E x p o n e n t i a l A R models
To see the reason why the polynomial A R models are explosive, let us
32 T. Ozaki
2
O"
P(f) ~- I 1 - 051 e i 2 ~ r f 05 e-~2~S.2 2 • (2.15)
2
1 X / - 052~-- 4052,
f = 2"rr tan-1
05~
which is the argument of the roots of the characteristic equation,
xt = (051 + 7TXt_l)Xt_l
2 -]- 052Xt_2 q- e t , (2.17)
gl
Fig. 2.7.
Non-linear time series models and dynamical systems 33
may stay inside the unit circle for such x[s. However, the white noise e t is
Gaussian distributed and may have a large value and the roots of (2.18) may lie
outside the unit circle. Then the system begins to diverge at this stage. Since we
are interested in the stochastic behaviour of x, mostly for ]xt[ < M, it may be
reasonable to make the non-linear function approach a b o u n d as t ~ + oo as in
Fig. 2.9.
A time series model which can exhibit this characteristic is the following
model
2
x, = (¢1 ~r e -x,_, )x,__l+ ff~2Xt_2 q- g t . (2.19)
The model is called an exponential A R model (Ozaki and Oda, 1978). The
roots of the equation
. . . . . . . . . . . . . . 1 .........................
x 1
Fig. 2.8.
34 T. Ozaki
P
Xt-~
Fig. 2.9.
In the above example, the second-order coefficient is fixed to 4~2 and the
roots of both (2.21) and (2.22) all stay inside the unit circle. However, in the
general vibration system, the damping coefficient is not constant in general.
One example is the following van der Pol equation:
- a(1 - x 2 ) . , ~ q- b x = O, (2.23)
where for x 2 < 1 the system has negative damping force and starts to oscillate
and diverge, but for x 2 > 1 the system has positive damping force and it starts
to damp out. The interplay of these two effects of opposite tendency produces
a steady oscillation of a certain amplitude, which is called a limit cycle. When
the system is disturbed by a white noise n, we have
- a ( 1 - x2)x + bx = n, (2.24)
Fig. 2.10.
Non-linear time series models and dynamical systems 35
x(t)
~(t)
n(t) ~
lie outside the unit circle, then x t starts to oscillate and diverge for small x t 1~
while if the coefficients satisfy the condition (C2) such that
(C2) the roots of A~ and A~ of
lie inside the unit circle, then x t starts to d a m p out w h e n xt_ ~ b e c o m e s too large.
T h e result of these two effects is e x p e c t e d to p r o d u c e a similar sort of
self-sustained oscillation as (2.23) if we suppress the white noise e t of (2.25). Fig.
36 T. Ozaki
2-
.-
L-)
kD
Fig. 2.12.
where the coefficients satisfy the above conditions (C1) and (C2).
2.4. Stationarity
The necessary and sufficient condition for the AR(2) model
all lie inside the unit circle. For checking the stationarity of exponential model
the following theorem about the ergodicity of a Markov chain on a norm space
is useful.
where E{ } means expectation, [I ]] is the norm of the state space and B is some
finite number.
Non-linear time series models and dynamical systems 37
In the case of model (2.31) the condition (i) means that the Markov chain
defined by
_ 2 _X 2
shifts back to the origin, on average, when X,_ 1 = (x, 1, Xt-2)' is outside some
finite region. We call this property the 'shift back to centre' property. This is
satisfied when the eigenvalues of the matrix
(?
which are the characteristic roots of (2.30), lie inside the unit circle. The
condition (ii) is satisfied by the exponential A R model (2.31) because the
transition matrix of (2.34) is a continuous function of x~-x and xt-2. Therefore,
the process defined by (2.29) is an ergodic Markov chain if the coefficients
satisfy the condition (C2).
If the Markov chain is ergodic, we can define the stationary distribution by
n
lim 1 'ff'~pi(y, x ) ,
n~.~ n i~l
sufficient conditions for the model to have a limit cycle The model
satisfies conditions (C1) and (C2) but the trajectory of xt converges to a point
(see Fig 2 1 3 ) where it holds that
=(18+4ee2)~ (0.97+01ee2)~
The point ~ is called a singular point (or fixed point) of the m o d e l Singular
points exist for the exponential A R model (2.35) if the coefficients satisfy
2 2
The model
= _+1.719 . . . .
However, even though the process x t has an initial value near one of the
singular points, it starts to oscillate and tends to a limit cycle in the simulation
(see Fig. 2.14).
This is because the singular points of model (2.39) are unstable singular
o
O -
co
co . . . . . ~ . . . .
i
~.00 80.00 150.00 240.00 320.00 400.00 480.00
Fig 213
Non-linear time series models and dynamical systems 39
C9
C~
°-
(:D
O-
(:D
(~-
o
c~
c~
I I "-"]--T'~ 1 [ r !
'0.00 80.00 160.00 240 '30 320,00
Fig. 2.14.
Since the necessary and sufficient condition for ~t to c o n v e r g e to zero is that the
characteristic roots of (2.40) lie inside the unit circle, we h a v e the following
proposition:
PROPOSmON 2.1. The singular point, if it exists, of (2.35) is stable if and only if
the roots of the equation
A 2 - hlA -- h 2 - 0
lie inside the unit circle, where h I and h 2 are given by (2.41) a n d (2.42).
x, -- (0.5 + 10 e 2
x'-')x, 1 (2.43)
has singular points ~=_+1.7308 but they do not satisfy the condition in
40 T. Ozaki
Proposition 2.1 and x t tends to a limit cycle when it starts at a point near £ in
simulation (see Fig. 2.15).
A l t h o u g h the notions of singular points or limit cycles for the discrete time
non-linear difference equation seem to be obvious by analogy with the con-
tinuous time case (Minorsky, 1962), it m a y be useful in later discussions to give
an explicit definition of singular points and limit cycles of a general discrete
time difference equation,
x, = f ( x , _ l , . . . , x,_p) . (2.44)
' C l o s e d ' means that if the initial value (x 1. . . . . xp) belongs to the limit cycle,
then (xl+ ~ . . . . , xp+kq) = (x l . . . . . Xp) for any integer k. 'Isolated' m e a n s that
every trajectory beginning sufficiently near the limit cycle approaches it either
for t ~ w or t ~ - o o . If it a p p r o a c h e s the limit cycle for t-~o~ we call it a s t a b l e
l i m i t cycle, and if it a p p r o a c h e s the limit cycle for t-~ -oc we call it an u n s t a b l e
l i m i t cycle. T h e smallest integer q which satisfies Definition 2.2 is called the
p e r i o d of the limit cycle of x t = f ( x t 1 , . . . , x t - p ) . T h e singular point can be
considered to be a limit cycle of period 1 but we distinguish it because it has a
significantly different physical meaning.
T h e stability condition for a limit cycle is given in the same way as the
cD
is known to have a limit cycle Yt, Yt+I. . . . . Yt+q-1, Yt+q(= Yt) a point x~ near the
limit cycle is r e p r e s e n t e d as
xs = Ys + so,. (2.46)
From this we can see that when the solution of the difference equation
-2 -372
..... {~bI -1 ",~,(1 235~+q_l)e ;~,+0 ~}{4h(1 2y,+q 2) e ,~q 2}~t+q_ 2
PROPOSmON 2.2. A limil cycle of period q, ~,+~, 37,+2. . . . . ~,+q of the model (2.45)
is stable if
I{¢1 + %(1 - 2y~+q_l) e ;~''~ 1}... {4~, + ~~(1 - 2y 2) e-;2}t < 1. (2.49)
Xt = (4)1 + 7rl e-X2-')x,-1 + ' ' " + (dpp Jr "ffp e -x2-1)x, p + e t (2.50)
PROeOSmON 2.3. A limit cycle of period q, 5,+1. . . . . Y,+o for model (2.50) is
42 T. Ozaki
A = A q "Aq_ 1 • . • A 1 , (2.51)
a (i) ~(i)
1,1 Ul,2 " " "
(i)
al,p-I ai,ii)p\)
1 0 ..- 0 O~
(2.52)
Ai 0 1 .
() ... 0 1
with
P
a (i) -2
,., - 4 , + % - 2 ~ . (rrjy,+i_i)y,+i_ 1 e y . . . . ~,
1=1
a (i) -2
~,k =
&k + % e -y'+/ ', k -- 2, 3, . . . . P -
T h e fact that a d e t e r m i n i s t i c n o n - d i v e r g e n t e x p o n e n t i a l A R m o d e l
x, = (4~1 +
2
rrl e-X'-')x, l + ((/~2+ /'72 e-Xt ')Xt-2
2
(2.53)
d o e s n o t have any s t a b l e singular p o i n t d o e s n o t n e c e s s a r i l y m e a n t h a t x t goes
to s o m e stable limit cycle for t ~ ~. T h e r e can b e a m o d e l which d o e s n o t h a v e
any s t a b l e singular p o i n t n o r a s t a b l e limit cycle. F o r e x a m p l e , t h e m o d e l
is n o n - d i v e r g e n t b e c a u s e the r o o t s of
A2-A +0.25- 0
0
, - - ~ - - ~ - ~
u 20J
~ - - ~ - -zO
3u
, ~ ~
u
~-
~
~ ---~
~
, --T---7 ~ .... ~
90
2 :. ~ 7,.
u
....
Fig. 2.16.
Non-linear time series models and dynamical systems 43
Xt
d)(Z>x~_~ + e, for T 1 ~< xt_ 1 < T2, (2.56)
(a) Linear threshold AR model (b) Non-linear threshold AR model (c) Exponential AR model
~(x) ~(x)'
i
~(x)
I
--q
I
I
x 0 x 0
Fig. 2.17.
44 T. Ozaki
, ',(XOT
I
I
I
I
i
I
x2 ,% 0 ~; /,~ xt
Fig. 2.18.
2. Z T h r e s h o l d structure
L___-
Fig. 2.19.
pattern and propagate if an impulse is larger than a critical value, while if the
impulse is less than the critical value the impulse wave dies out (see Fig. 2.19).
Neurophysically, the wave propagation is realized by the flow of electrons
along the axon which is caused by the change of membrane potential and a
mathematical model, called the Hodgkin-Huxley equation, is presented for this
dynamic phenomenon by Hoi3gkin and Huxley (1952). Starting from this
Hodgkin-Huxley equation, Fitzhugh (1969) obtained the following non-linear
dynamical system model for the dynamics of the potential V:
dV
. : = a ( V - Eo)3+(E=- V) - b ( V - E l ) , (2.59)
tit
X,+l = 4,(x,)x,,
Ionic current
~ b(V-E1)
A ~ ( V - E o ) ~ (E2-V)
Eo E1 0 E2
Fig. 2.20.
where sc~ = 0.226 . . . . and so; = - 0 . 2 2 6 . . . are unstable singular points and
~:~ = 2.1294 . . . . sc~ = - 2 . 1 2 9 4 . . . and s%= 0 are stable singular points. If we
apply an impulse to model (2.60), then xt goes to zero for t - + ~ if the
magnitude of the impulse is less than the unstable singular point ~:~ but xt goes
to ~:~ for t ~ m if the magnitude of the impulse is larger than the threshold
value ~ (see Fig. 2.21).
If we have a white noise input to the model defined by (2.60), we have the
following model:
cD.....
cD ,%
c~
g'£?"t____J
r r -r----T ......--F ....... : -T"----T. . . . T. . . . .
0.00 40.00 80.00 °O ,00 40.00 60,00
Fig. 2.21.
Non-linear time series models and dynamical systems 47
C)
130
-f 1 i I I I "3"--I l-- I
io .00 2o.oo 4o.oo 6o.oo 8o.oo I00,00
~i0 1
Fig. 2.22.
on the white noise input. By looking at the data (Fig. 2.22) of the above
example (2.61), people may think of two linear models, one above the
threshold and one below the threshold. However, the data are actually de-
scribed by o n e non-linear model.
A similar non-linear phenomenon is realized by a non-linear time series
model with time varying coefficients. For example, consider the following
model:
changes from
2
x,+l = (0.8 H 4x 2 e-X,)x, -t £', +1 (2.63)
to
xt+x = (0.8 + 0.4 e-X2')x, + e , < , (2.64)
(23
(%1
or')
Fig. 2.23.
48 T. Ozaki
x'- 0
Fig. 2.24.
2.8. Distributions
We have seen that a threshold structure is realized by a stationary non-linear
time series model
where x~ moves around from one stable singular point to another depending on
the white noise input. However, the process defined by (2.65) has one and the
same equilibrium distribution on the whole. Fig. 2.26 shows the histogram of
the data generated by simulating the non-linear threshold A R model
which has the same structural property as (2.65). It has three stable singular
points ~:0= 0, ~:¢ = 0.9 and sc~ = - 0 . 9 and two unstable singular points ~:~ =
Non-linear time series models and dynamical systems 49
I
-1 .47 -0.63 0.2 1 .05
Fig. 2.25.
0.4358.~. and s~i = - 0 . 4 3 5 8 . . . Fig. 2.25 shows the histogram of the white
noise used in the above simulation, where the number of data is N = 8000.
It is obvious that the three peaks in Fig. 2.25 correspond to the three stable
singular points G0, s~i and ~ , and the two valleys correspond to the two
unstable singular points ~:i~ and s~. These correspondences remind us of the
Fig. 2.26.
I 0.44
50 T. Ozaki
Fig. 2.27.
Yc = f ( x )
V(x) -= - f f ( y ) dy.
2 = - 4 x + 5x 3 - x 5
has three stable singular points G0= 0, s~ = 2, ~:~ = - 2 and two unstable
singular points ~:i~ - 1 and £~ = - 1 (see Fig. 2.27).
The stable singular points correspond to the valleys of the potential and
unstable singular points correspond to the peaks of potential (see Fig. 2.28).
Further, it is known that the equilibrium distribution W ( x ) of the diffusion
process defined by the stochastic dynamical system
2 = f ( x ) + n(t)
is given by
W ( x ) - Wo exp{-2 V(x)/0-2},
where 0-2 is the variance of white noise n(t) and W0 is a normalizing constant.
If we consider this structural correspondence between non-linear time series
models and diffusion processes defined by stochastic dynamical systems, it may
be natural to study the diffusion process and its time discretization scheme in
the succeeding section.
Non-linear time series models and dynamical systems 51
V(×J
0 X
Fig. 2.28.
lim E [ A x ] _ f ( x ) ,
.r~O T
E[(Ax)2I
lim ~r2 ,
r~0 T
Et(ax)q
lim -0 (k/>3),
r-*O T
0p 0 1 02
ot Ox [f(x)pl + ~ x 5 [o'2p], (3.3)
52 T. Ozaki
where p stands for the transition probability p(X[Xo, t) which means the
probability that the process takes the value x at time t, given that it had the
value x 0 at time t = 0. Thus the stochastic dynamical system uniquely defines a
diffusion process with transition probability p(x I Xo, t) defined by the F o k k e r -
Planck equation (3.2).
Conversely, the diffusion process defined by (3.3), obviously, uniquely
defines the stochastic dynamical system (3.2). However, the rate of the growth
of the variance,
E[(Ax)
lim - - ,
r~0 T
Op 0 1 02
[a(x)p] + ~ ~Ox [b(x)p] . (3.4)
Ot Ox
Then (3.4) uniquely defines the following stochastic differential equation (see,
for example, Goel and Richter-Dyn, 1974)
2 - f ( x ) + g(x)n(t)
019 0 1 0a
Ot Ox [f(x)p] + ~ ~x 2 [g2(x)p] .
x dE (3.6)
y = y(x)=
f g(~),
we have, from the stochastic differential equation (3.5), the following stochastic
dynamical system:
= a ( y ) + n(t), (3.7)
where n(t) is a Gaussian white noise with unit variance. We call the process y
Non-linear time series models and dynamical systems 53
the associated diffusion process of (3.4), and we call the dynamical system
f~ = a(y) the associated dynamical system of (3.4). By the analogy with
mechanics we define the potential function by
Y
V(y) = -
f a 05) d)T (3.8)
We note that the potential function (3.8) is different from the potential function
well known in Markov process theory (Blumenthal and Getoor, 1968), and we
call V(y) of (3.8) the potential function associated with the diffusion process or
simply the associated potential function.
The above discussion suggests that any diffusion process uniquely defines a
variable transformation and a potential function with respect to the trans-
formed variable.
where C is the normalizing constant. Wong (1963) showed that for any
probability distribution function W(x) defined by the Pearson system
d W(x) c o+ Qx
dx - do+ dlX + d2x2 W(x) , (3.10)
PROPOSITION 3.1~ For any distribution W(x) defined by the distribution system
dW(x) c(x)
W(x), (3.11)
dx d(x)
where a and the fli of fl = ( i l l , . . . , ilk) are real-valued functions of ~, and a(x)
and t(x)= (tl(X) . . . . . tk(X))' are analytic functions of x (Barndorff-Nielsen,
1978). From the definition of the generalized Pearson system the following
propositions are also easily obtained.
Op 0 1 Oa
[a(x)p] + ~ ~Ox [b(x)p] (3.14)
Ot Ox
is
dW 2 a ( x ) - b'(x)
- W(x). (3.15)
dx b(x)
dW(y)
-2a(y)W(y), (3.18)
dy
where
f(x)
c~ ( y ) = c~ ( y ( x ) ) - (3.19)
g(x) "
The above correspondence between the generalized Pearson system and the
diffusion process in Proposition 3.1 is unique if we restrict that c(x) and d(x) of
(3.11) are mutually irreducible.
Non-linear time series models and dynamical systems 55
= f ( y ) + n(t) (3.20)
Y = f(y). (3.22)
However, the Euler method is known to be unstable and explosive for any
small At, if the initial value of y is in some region. For example, the trajectory
y(t) of
3~ = _y3 (3.23)
is known to go to zero for any initial value of y. Its discretized model by the
Euler method is
which is explosive, the trajectory going to infinity if the initial value Y0 is in the
region ]Y0I> ~/2/At. It is also known that, for any small At, the Markov chain
(3.21) is non-stationary if f ( y ) is a non-linear function which goes to + ~ for
[ y l - ~ (Jones, 1978). The same thing can be said for some other more
sophisticated discretization methods such as the H e u n m e t h o d or the R u n g e -
Kutta method (see, for example, Henrici, 1962).
For the estimation and simulation of diffusion processes by a digital com~
puter, it is desirable to have a stationary Markov chain which converges to the
concerned stationary diffusion process for At-~ 0. Our idea of obtaining such a
stationary Markov chain is based on the following local linearization idea. When
f ( y ) of (3.22) is linear as in
= -~y, (3.25)
which coincides with y(t) of (3.26) on t, t + At, t + 2At . . . . . Also, the Markov
chain defined by
:9 = - ~ y + n(t)~
of
Y = 7f(Y) • (3.29)
oy
J, = Of ¢ 0 (3.30)
Oy
is constant on the interval, i.e. assuming that the system is linear on the
interval. Then we have
For Jt = 0 we have
It is easily seen that the model defined by (3.32) and (3.33), which we call a
locally linearized dynamical system, converges to 3 ) = f ( t ) f o r A t e 0 . It is also
easily checked (see, for example, Gikhman and Skorohod, 1965) that the
Markov chain defined by
and et+~t is a Gaussian white noise with unit variance, converges to the
diffusion process y(t) of (3.20). We call the model (3.34) the locally linearized
Markov chain model of the stochastic dynamical system model (3.20). As we
shall see later, the present local linearization method brings us, unlike the
Euler method or other discretization methods, non-explosive discrete time
dynamical systems. If f ( x ) is specified it is easy to check whether the locally
linearized dynamical system is non-explosive or not. However, it may be
sometimes useful if sufficient conditions for the non-explosiveness of the locally
linearized dynamical system are given for the general dynamical system
9 = f(Y).
[6(y,)l < 1
hence 4'(Yt) < 1, for large ]YtI if f ( y ) satisfies the following condition:
6(y)> -1
J(y)"~O for y ~ m
then we have
eJ(y)at 1
.f(y___)) e - c a ' - l f ( y ) > _ 2 for y ~ m (3.41)
J(y) y c y
If J ( y ) ~ - m for y ~ m , we have
T h e r e f o r e , the following condition is also a sufficient condition for ~b(y) > - 1 for
y-.m:
f(Y)
(B2) J(y)-~ --m for y ~ m and ---->-2 for y - * o ~
J(Y)Y
which decrease slower than linear function for y-+ ~. Examples of functions
which satisfy J(y) ~ c < 0 for y ~ oo are
(B;) J(y)~-% and there exists Y0 > 0 such that f ( y ) is concave, i.e.
O2f(y)
- - <
Oy2 O, for y > Yo and for any c > 0 there exists Ya >~ Yo such that
The similar conditions of f ( y ) for y ~ --oo are obtained f r o m the same logic as
follows:
(c;) J(y)~-~ for y ~ - ~ and there exists Y0 < 0 such that for y < Y0
2
f ( y ) is convex (i.e. O~f(Y)
OY2 > 0"~
] and for any c > 0 there exists Yl ~< Yo
such that
closely related with the ergodicity of Markov chains on the continuous norm
space. For the locally linearized Markov chains (3.34) to be ergodic, Theorem
2.1 requires q~(y) to be a continuous function of y and to have the shift back to
centre property which is guaranteed by
0p_ 0 10 2
at Ox [axp] + ~ Ox--~ [o'2pl, (3.42)
= -ay, (3.44)
z(y) = -f(y) o
Then the damping function of (3.44) is a linear function (see Fig. 3.1)
zlv)
¥
/ ¥
dW(x) - 2ax
- W(x), (3.47)
dx o.2
and the distribution W ( x ) is the well-known Gaussian distribution (see Fig. 3.3)
/ a / ax2\
W ( x ) = ~ a 2 e x p , - --~-7-) - (3.48)
Xt = o . Y t ,
2 - - x3 (3.50)
has a non-linear cubic damping function as in Fig. 3.5. If this dynamical system is
Wlxl
0 ×
2 = - x 3 + o'n(t). (3.51)
_013
_ = _ _0 [x3p ] + _102 [0-2p].
(3.52)
Ot OX 20X 2
y = x/o', (3.53)
giving
= _0-2y3. (3.54)
V(y)
0-1 y4 .
= ~
(3.55)
d W(x) _ --2X 3
dx 0"2 W ( x ) . (3.56)
W ( x ) = W o exp - ~ 2 , (3.57)
V{y)
W(X)
~(Yt)
2
5
xt = 0.Yt,
Yt+at = 6(Yt)Y, + X/~te,+at. (3.58)
where
2 1 2 2
qb(yt) = 3 + 5 exp(--30. 2xtyt). (3.59)
2 = - 6 x + 5.5x 3 - x 5 (3.60)
has five zero points, so0 = 0, sc~ = ~22, ~:~ =-Xf~-~, sc~ = 2 and ( ~ = - 2 (see Fig.
3.9). T h e y are called singular points of the dynamical system. If an initial value
x 0 of (3.60) is one of the five singular points, then x(t) stays at x 0 for any t > 0.
If the d y n a m i c a l system is driven by a white noise 0-n(t), we have
T h e c o r r e s p o n d i n g F o k k e r - P l a n c k equation is
3? = - 6 y + 5.50-2y 3 - o4y5,
0. 4
11°"2 y4 + y6 (3.63)
V ( y ) = 3y 2 - - - 8 - --6 "
64 7". Ozaki
T h e distribution system of x is
dW(x) - 1 2 x + l l x 3 - 2x 5
W(x) , (3.64)
dx 0-2
x t = cryt ,
I1 + f ( Y t L [ e x p { J ( y t ) A t } - 1] for J(Yt)Yt # O,
'I J~Yt)Yt
~P(Yt) = ' 1 ] + (--6y, + 16.50-2y3t -- 50-'ySt)At for J(y,) --- O,
/
t e-6At for Yt = 0.
Z(y)
/
Fig. 3.9. Fig. 3.10.
Non-linear time series models and dynamical systems 65
w(×l'
¢(Vt)
0 Z
x ~-1 e x p ( - x / f l )
W(x) = (3.66)
r(~)/3 °
dW(x) (a-1)/3-x
(3.67)
dx /3x
0t9 0 l 02
[(a/3 - x)p] + ~ ~ [2/3xp]. (3.68)
Ot Ox
y = x/2~//3 0.70)
z ( y ) = y / 2 - (a - ~)/y.
1 (3.72)
66 T. Ozaki
1
As is seen in Fig. 3.13, if a >~ the damping function is negative for y <
1
V'2a - 1 while if a < ~ the damping function is always positive. The associated
potential function (see Fig. 3.14) is
when a =~1 the damping function is a linear function of y, and the potential
function is a quadratic function. Therefore, the distribution of y is Gaussian for
1
a = ~. The locally linearized Markov chain model for the diffusion process x ( t )
is
x, = (flyt)2/2fl,
(3.75)
Yt+a, = 49(y,) + X/ M e,+a,,
where
= l y , + [exp{J(yt) At}- 1]. f ( y t ) / J ( y , ) for J(y,) # 0,
q:'(y,)
(3.76)
Ly, + A t . f(y,), for J(y,) = 0
and
f ( y , ) = (a - ~1) / y , - y J 2 , J(y,)=-(a 1 2 1
- 9/7, - ~.
L~(y,)/y,I < 1 for y,-~ % ~b(y,) is not bounded (see Fig. 3.16),
A l t h o u g h [4~(Y,)I =
when a <~, near the origin. Therefore, we can have a 'stationary' locally
linearized Markov chain model for the Gamma-distributed process (3.68) only
when a >_!
~ z.
Z(y)[
V(y)
(~ >0.5
0 Y
o'<0.5
Yt
I'(a+7+2) ( 1 + x ) ~ ( 1 - x) ~
W ( x ) = F ( a + 1)F(7 + 1) U +'+1 , c~, 3' ~> - 1 . (3.77)
dW(x) (a-y)-(a+y)x
dx 1- x2 W(x). (3.78)
Op 0 1 02
. . . . . . [2(1 - xZ)p] (3.79)
Ot Ox [{(a - 7) - (c~ + 7 + 2)x}p] + ~ Ox2
2 : (a - y ) - (a + 7 + 1)x + X / ~ n ( t ) . (3.80)
Using the variable transformation
1 1
y =--;=sin- x, (3.81)
V2
W(y) = Wo e x p { - 2 V ( y ) }
r(o~ + y + 2)
: r(c~ + 1 ) r ( y + 1)2 ~+r+°i (1 + sin X/2y)~-~(cos V'2y) 2~+~ .
(3.85)
T h e locally linearized M a r k o v chain model is
&(y,) = dp(yt)/yt ;
i
i
-2~- 2~
W(x)
~(Yt)
I
I
I
-1 1 7r 0 /r Yt
however, since
dW(x) - ( 2 a + 1)x
dx 1+ x2 W(x). (3.91)
Op_ O 1 02
Ot Ox [(1 -- 2a)xp] + ~ 0-~ [2(1 + xa)p], (3.92)
1
y=~ sinh<(x) (3.94)
70 T. Ozaki
V ( y ) = a log(cosh ~ 2 y ) . (3.97)
x, = sinh(X/2Yt),
Yt+at = cl)(Yt) + "X/--~tet+a,,
[ f - 2 a At ] 1] cosh X/-2y tsinh~/-2yt
(3.99)
*(Yt) = Yt + Lexp~.co~f(-Cyt) j - ] ~v/~
Since @(Yt)/Yt ~ e-~at for Yt ~ 0 the ~b function of the model is given by (see
Fig. 3.24)
I
/1+ r[exp~f .....-2c~ At
,--~,-;= jI - 1 ] c°sh X/-2y t sinh ~/2y, for Yt # 0,
4~(Y,) = 1 cosh (V2yt) ~/-2Yt
e - 2 a At
for Yt = O.
(3.100)
i
Zly}
/
V(v)
2~
O/\\
/ \
z" \~
/ \
W(x}' O(yt)
0 yt
EXAMPLE 7.
1
W(x) -- X-(2a+l) e -1/x . (3.101)
V(2o,)
The distribution (3.101) (see Fig. 3.27) is introduced (Wong, 1963) from the
Pearson system
dW(x) 1-(2a+1)
W(x) . (3.102)
dx x2
Op 0 1 02
Ot Ox
[{1 - (2a - 1)x}p] + ~ x 2 [2x2p], (3.103)
1 _~y
z (y ) = "V:-2a -~ e (3.1o7)
72 7". Ozaki
W ( y ) - F ( 2 e ) e x p { - 2 X / 2 ~y - e x p ( - X / 2 y ) } . (3.109)
x t = exp(X/~y-~yt), (3.110)
Yt+a, = ~(Yt) + ~¢/Atet+at, (3.111)
., e x p ( - X / 2 y , ) - 2 a
q~(y,) = y, - [ e x p { - e x p ( - X / 2 y , ) A t } - tl - - - -- . (3.112)
X/2 e x p ( - X / 2 y , )
1 e./~y V'2o~
//
0 V .,/"
Y
/
/
/
Wlvl[
e,('¢1
4. Estimation
Xt = h-l(yt ]a),
(4.1)
y,+a, - 4,(y, [/3) + V-~i~,+~,,
where c~ is a parameter used in the variable transformation y = h(x), /3 =
(ill . . . . , fir) is the parameter vector of the function @(Yt) and e t is a unit
Gaussian white noise. The model is considered to be a linear or a non-linear
time series model where Yt+at is denoted as Yt+l and the model (4.1) is rewritten
as
x,=h l(y,l~ ),
y,+l = 45(y, [/3)+ V&te,+,. (4.2)
The estimates of parameters of the time series model (4.2) are obtained by the
maximum likelihood method. The log-likelihood of the model is given by
(4.3)
1
- - - {y,+~ - 4~(y, [/3)}
N e2 ~{ 1 "~N-ldet(Oh(xiJa))]
.exp{- ~ ,_ij~,~t ] \ Oxj
=_= m E
N e,_~ + log Idet {0h xijo }j - -N-1
2 - log(2w 2~t)
t=2 Oxj 2
+ log P(Yl ] a,/3). (4.4)
When N is sufficiently large logp(yl [ a,/3) is small compared with other terms
of (4.4) and it can be ignored. Then the (approximate) maximum likelihood
estimates a and/3 are obtained by minimizing the following function l(a,/3),
with respect to the variables a and 13 = (B1. . . . . /3r).
(4.5)
we have
Y = X.A+ e, (4.9)
Non-linear time series models and dynamical systems 75
2
Y -1
Y2 e-'~
YN-le-rN-~ 2 • ..
I
y~_l1e-Y1'-:1/
We note that the contribution of non-constant terms of the c~(yt) very much
depends on the scaling unit of Yr In the practical application of the model, it is
more efficient in the identification of the dynamics of ~b(yt) if we introduce a
scaling parameter 7 as
The estimation of model (4.11) is still much easier than the locally linearized
non-linear time series model (see Haggan and Ozaki, 1980). The asymptotic
behaviour of the maximum likelihood estimates of A R M A models and Markov
chains are discussed in Hannah (1973) and Billingsley (1961).
the potential function using the estimated 4} function. We call the former
method the direct method and the latter method the indirect method (Ozaki,
1983a). The indirect method is based on the following relation in local
linearization:
exp{[Oa(y)/Oy] A t } - 1
~b(y) -- 1 ~ , (4.13)
[Oa(y)/Oy]y
where
- a(y) + n(t) (4.14)
and
Y,+x- ck(Yt)Yt + e,+l . (4.15)
It is difficult to solve c~(y) of (4.16) for a given ~(y). One approximate method
is to expand
ex ~Oa(Y) At}
Pt Oy
1
&(y) - ~ {q~(y) - 1}y. (4.17)
-1
"v'(y, zXt) = - ~ f {q~(y) - 1}y dy. (4.18)
For the estimation of the ~b function of (4.15), the following three parametric
amplitude-dependent A R models studied in Section 2 are useful.
5. Discussions
We have seen in Section 3 that for any diffusion process x(t), there is an
associated diffusion process y(t) which is obtained by a variable transformation
y = h(x) (see Fig. 5.1). Since the variable transformation is uniquely defined by
(3.6), the associated diffusion process is unique. From the associated diffusion
process we have a Markov chain, which is a non-linear time series, converging
to the diffusion process for zXt-~ 0, by the local linearization method. However,
we note that the correspondence between the distribution system and the
diffusion process x(t) (see Fig. 5.1) is not unique if we do not restrict c(x) and
d(x) to be mutually irreducible (Ozaki, 1982b). For example, if we consider the
distribution system
dW(x) (c~-1)/3-x
W(x) (5.2)
dx /?Ix
0p 0 1 02
0--t = - 0x [{(a 1)~x - xZ}p] + ~O~[213xLp], (5.3)
78 T. Ozaki
I stochasticdifferentequation [ Stochasticdynamicalsystem
,Y = f(x) + g(x)n(t) v I yz = c~(y)+n(t)
i
op a 1 a2
~[= ~x[a(x)P]+~xe[b(x)P] v(y) = a(y) dy
Distributionsystem Distributionsystem
dW(x) c(x) d W(y)
dx d(x) W(x) ~dy-- = 24 (y) W(y)
Fig. 5.1.
The damping function z(y) of the associated dynamical system is (see Fig. 5.2)
Xt = e x p(~/2-~y t )
y,+~, = q,(y,) + VX5 ~,+~,,
Non-linear time series models and dynamical systems 79
The ~b function is given in Fig. 5.4, where ~b(y)= dp(y)/y for y ¢ 0 and ~b(y) is
undefined for y = 0 since
From the viewpoint of time series analysis, the above example shows two
methods for the whitening of a given Gamma-distributed time series. One is
root-square transformation plus non-linear time series modelling and the other
is logarithmic transformation plus non-linear time series modelling. The data
transformation methods have been used in many statistical methods with the
objective of obtaining Gaussian or quasi-Gaussian data. This idea is still valid
in the statistical analysis of non-Gaussian time series data. The above examples
of Gamma-distributed processes and Example 5 of Beta-distributed process in
Section 3 suggests that not only non-linear time series modelling, but also
variable transformations, such as root-square, logarithmic and arcsine trans-
/
V(y)
/
z(..,,) ~
~h(Yt)
1
f J ~
Fig. 5.4.
80 T. Ozaki
5i + f ( x ) + g ( x ) : n ( t ) . (5.11)
This is equivalent to the following two-variate dynamical system model:
)~ = F(X) + N ( t ) , (5.12)
j, =
(Ox 1)
\
ag(x)
c)x 02
L
i=i,
1 2 1 3
exp(AtJt)= I + A t J t + ~}- At 2Jr + 3 i At3 J t + ' ° "
I
P(co) P(ro)
~o
Fig. 5.5.
process model, the autocorrelation function of the noise process used is a delta
function and so its spectral density p(o)) is uniformly zero for the frequency
band - ~ < w < o0 (see Fig. 5.5) and its integral is one. However, it is obvious
that in nature there is no noise process whose autocorrelation is exactly the
same as the delta function. If we make the sampling interval very small, any
noise process in nature would fail to have a delta autocorrelation function. In
other words, if we are concerned about the periodicity in a very high frequency
band of the noise process, the whiteness assumption becomes inadequate.
Nevertheless, diffusion process models seem to approximate well to
phenomena in many fields, such as genetics, statistical physics, ecology,
mechanical engineering and electrical engineering. This is because we are
concerned about a special finite frequency band in the spectrum of each
phenomenon, and the white noise approximation together with a dynamical
system approximation seems to work well. This consideration suggests that the
discrete time white noise approximation (see Fig. 5.5) together with discrete time
dynamical systems such as the amplitude-dependent A R models in Section 2
may work well for the approximation of the same phenomenon.
Acknowledgement
The author is grateful to Dr. Yamanouchi and Mr. Oda for inviting his
interest to dynamical systems in nautical engineering.
He is also grateful to Dr. Shimizu and Dr. Akaike for many useful comments
and discussions.
References
ll] Akaike, H. and Nakagawa, T. (1972). Statistical Analysis and Control of Dynamic Systems.
Saiensu-sha (in Japanese).
[2] Akaike, H. (1974). Markovian representation of stochastic processes and its application to the
analysis of autoregressive and moving average process. Ann. Inst. Statist. Math. 26, 363387.
[3] Akaike, H. (1977). On entropy maximization principle. In: P. R. Krishnaiah, ed., Ap-
plication of Statistics. North-Holland, Amsterdam.
82 T. Ozaki
[31] Wong, E. (1963). The construction of a class of stationary Markoff process. Proc. Amer. Math.
Soc. Symp. Appl. Math. 16, 264-276.
[32] Yamanouchi, Y. (1974). Ship's behaviour on ocean waves as a stochastic process. Proc. of
International Symposium on the Dynamics of Marine Vehicles and Structures in Waves~
Institution of Mechanical Engineering, London.
[33] Zeeman, E. C. (1977). Catastrophe Theory: Selected Papers (1972-1977). Addison-Wesley,
Reading, MA.
E. J. Hannah, P. R. Krishnaiah, M. M. Rao, eds., Handbook of Statistics, VoL 5 3
© Elsevier Science Publishers B.V. (1985) 85-118
G. C. Tiao
1. Introduction
Y, = ~ J , + < x , _ ~ + . . . + .~x,_j + . . .
(1.1)
= ~(B)x,,
where B is the backshift operator such that B X t = Xt_l, and the quantity uj
measures the effect of X,_j on Yr. T h e vj's are known as the impulse responses
and v ( B ) the transfer function. Rather than dealing with a possibly infinite
number of the uj's, in practice, a parsimonious representation takes the rational
polynomial form
(1.2)
where
~ ( B ) : oa0 - OalB . . . . . . . . o)aB s and 6(/9) = 1 - alB . . . . . . 6,B r
originally proposed by Yule (1927) and Slutsky (1937), has been found useful in
representing the serially dependent relationship of many time series encountered
in practice. In (1.3), {Z,} is the observable time series, {a,} is a sequence of white
noise, identically and independently distributed as normal N(0, o-2), C is a
constant, 45(B)= ( 1 - ~ I B . . . . . @pBp) is the autoregressive polynomial,
O(B)-- (1 - O~B . . . . . OqB q) is the moving average polynomial and 49(B) and
O(B) are assumed to have no c o m m o n factor. The model (1.3) is of the form in (1.2)
where Yt = Z, and the input X t is now the white noise a t.
The model (1.3) will be stationary if all the zeros of ~ ( B ) are restricted to lie
outside the unit circle and in this case C = (1 - qb1. . . . . qbp)#, w h e r e / x is the
mean of the series. Theoretically, stationarity means that the probability
density functions of (Ztl ..... Ztl+k ) and (Zt2 ..... Zt2+k) are of identical forms for
any arbitrary choice of the integers (t 1, t2, k). In practice, this is saying that the
overall behavior of the observations remains the same over time. Now, real
world time series data often exhibit a drifting behavior. Such nonstationary
series can be modeled by allowing some of the zeros of ~ ( B ) to be equal to
one. Thus writing ~ ( B ) = ¢(B)(1 - B ) a we have from (1.3) that
¢(B)U(B)Z, = C+ (1.5)
where U ( B ) = 1 - U I B . . . . . . . . UdBd has all its zeros on the unit circle. The
special form U ( B ) = ( 1 - B ) d ( l - B S ) < where s is some positive integer has
been widely used in practice to model seasonal time series (s = 12 for monthly
data and s = 4 for quarterly data).
In what follows, we list some special cases of the class in (1.5) which have
been widely used in practice:
Z t = C + ( 1 - 01B)at, (1.5b)
(1 - B ) Z t = (1 - O,B)a t , (1.5c)
A R M A models, intervention problems and outlier detection 87
%(B) (1.6)
Y,= C+ BbJX# + Nt,
]=1 6](B)
where ~oj (B) and 61 (/3) are polynomials of the form in (1.2), X , , , . . . , Xjt are J
input time series, and N t is the noise term following an A R M A ( p , q) model. In
(1.6) the input series {X#} may be deterministic or stochastic and in the latter
case they are required to be independent of the noise series {Nt}.. The effects of
exogeneous interventions such as governmental policy changes, strikes, and
promotional campaigns on I1, can frequently be modelled by making X# in
(1.6) indicator variables (see Box and Tiao, 1975).
where the ~ ' s and 0's are k x k matrices, and the zeros of the determinental
polynomial Iq~(B)l are required to lie on or outside the unit circle. T h e vector
series will be stationary if (i) all the zeros of Iq~(B)I are lying outside the unit
circle or (ii) all the zeros of Iq'(B)I on the unit circle are divisible by [Adj
~(B)IO(B).
For simplicity in presenting the main results, until further notice we shall
assume that C = 0. To discuss the properties of (1.3), it is convenient to begin by
supposing that the series {Zt} actually starts at some time point m. Suppose
now that t - m + 1 observations Z , , , . . . , Z t are available• From (1.3), we can
then write
The tp a n d 7r weights
F r o m (2.1), we have that
.-~ ".
I]/1 ..
D-~IDo = ~01"". D_I =
,
where ~p(B) = 1 + q,IB + ~b2B2 + • - .. It follows that, for l/> 0, the 4,'s satisfy the
difference equation
where O0 = 1, q,j = 0 for j < 0 and 0 t = 0 for l > q. T h u s for 1/> r, the ~p,s can be
expressed in the form
where po<~p, A 1. . . . . Ap0 are polynomials in 1 and a1-1, • '" , a p0 -1 are the P0
distinct zeros of qb(B). A n a l o g o u s expressions can be o b t a i n e d for the q~*'s.
From (2.2), we can write Z t in the ~p form as
t m t-m
z,=a,+ Z ¢,:, + Z q'h*w,-h' (2.6)
h= 1 h=t-(m+r)+ l
DolD, Zt - Dolw = a,
,here
1 1
-- 7ri. --7"/" 1 .
DolD~ = D~ ~ = (2.7)
--'TTt_m. . .--7"i" 1 1
90 G. C. Tiao
w h e r e % = - 1 , % = 0 for j < 0 and q~t = 0 for l > p. Thus, for 1 >t r, the ~-'s can
be written
t--m t--m
N o t e that by supposing the series to start at time m and introducing the initial
values w , , , . . . , win+,_ 1, we have o b t a i n e d two alternative forms of the A R M A
m o d e l (1.3). T h e 0 f o r m (2.6) shows h o w Z t is affected by current and previous
white noise or shocks at_j's and the ~r f o r m (2.10) indicates how Z t relates to its
own past values Zt_j's. In obtaining (2.6) and (2.10) no assumption is m a d e on
the nature of qb(B) and O(B).
),
so that Z, will be stationary in this a s y m p t o t i c sense. In what follows, we shall
refer to (2.11) as the stationarity condition of the A R M A m o d e l which is
equivalent to require that the zeros of 45(B) are lying outside of the unit circle.
L e t us d e n o t e
Taking expectations on both sides and using (2.6), we have that, for 1/> 0,
p
0, l>q,
00=-1, and note that 7(I)= 7(-1).
Autocorrelation function
The autocorrelation function
Can be obtained directly from (2.14). Note that if @(B)= 1, i.e. (1.3) is a
moving average model of order q, MA(q), then
2 -1
p(l)={o,-Oq(l+O2+ ' ' - + 0 o ) , ll>=q.q' (2.16)
This is an important property which will prove useful in the model building
process.
~'(k, m) = [Y(7 + 1) ]
and
J
y(m)..., y(m _.1) o. y ( m - k + 1)]
where q~(p)= (qb1. . . . . ~p)' and c = (cD . . . , Cp)t. When O(B)= 1, c = 0 and in
this case we can express 45p as
This result then leads to defining the following function of the autocor-
relation coefficients O(1) . . . . , p(/):
which is known as the partial autocorrelation function. It has the property that,
for a stationary AR(p) model, i.e. O(B)= 1,
The important property of ~(1) is that it vanishes for 1 > p when the model is
AR(p). This is akin to the property of the autocorrelation coefficients p(/)'s
with respect to the MA(q) model, and will prove to be a useful tool in model
building.
p,t = -'"
(2.23)
+'"+02) -1 , l=q,
p ( p , 1) = + (2.24)
ll0, l>q.
~O(k ) = ( 9~1 . . . . .
where b '= (1, q¢°(k)') and it is easily seen that p(k, l) is a function of the
autocorrelations p(1) . . . . , p(k + 1).
Now, for k = p and l >1 q, p(k, l) has the 'cutting off' property (2.24) for
A R M A ( p , q ) model which is akin to the property of p(1) in (2.16) for the
MA(q) model. Following the work of Tsay and Tiao (1984), we shall call p(k, l)
the kth extended autocorrelation of lag l for Z r W e shall also denote
p(l) = p(O, l) so that p(k, l) will be defined for k >i 0 and l/> 1. It can be readily
shown that for stationary A R M A ( p , q) model, when k >~p,
c, l = q + k-p, (2.27)
p(k,l)= O, l>q+k-p,
where [c] < 1. The above property for p(k, l) will be exploited later in the
model building process.
which is the minimum mean square,d error (m.m.s.e.) forecast of Zr+ l, and
denote er(1 ) as the forecast error
so that fiT(i) = 0 for i > 0. Thus, the Z,r(/)'s can be recursively calculated from
(2.30) once the expected values fir(-/'), J' = 0 . . . . . q - 1, are determined, and
for l > q the Zr(/)'s satisfy the difference equation
• ( B ) 2 r ( / ) = 0, (2.31)
T-j-m T-j m
fir(--J)- Z r - i - Z %Zr-i-h + Z fr *hE (w~__j_~ I z¢r~)
h=l h= T-j-(m+r)+l
(2.32)
It can be shown that when all the zeros of O(B) are lying outside the unit circle,
both ~rh and rr~ approach zero as h ~ m and for T - j >> m, the third term on
the right-hand side of (2.30) can be ignored so that
T-j-m
Thus, approximately, fir(--J) only depends on Zr_ j. . . . . Z,,. Note that the
requirement that all zeros of O(B) be lying outside the unit circle is known as
the 'invertibility condition' of the A R M A ( p , q) model. For a discussion of
noninvertible models, see e.g. Harvey (1981).
It is of interest to study the behavior of the forecasts Z'r(/) as a function of
ARMA m o d e l s , i n t e r v e n t i o n p r o b l e m s a n d outlier d e t e c t i o n 95
-1
where, as in (2.5), p o < p , a71, . . . , c%o are the Po distinct zeros of q~(B), and
. (T)
A~r), .,Ap0 are polynomials in I whose coefficients are linear functions of
.
where it is to be n o t e d that
er(j)=O forj~<O
and
at+ i - fir(i) = at+ i for i > 0.
For 1 = 1 . . . . . L, L > max(p, q), we can write the vector of forecast errors
e(L ) = (er(1) . . . . . e r ( L ) ) ' in the form
D ( L ) o_ - E(w (2.35)
{: ] [1 1
where
0, . . . . . . . . . . 0, [- c+ . *T-j- m
02 . .Oq . = .
X(L) = - -
where
ID(L)-llt~(L) _
Iil. 1
~tl ' - . . . .
COV(e(L)) = U'a~4)~Zrl(L)-I'D(L)[il(L)-ID(L)¥~O
~,~ ~0 ] + R, (2.37)
l-1
Var(er(l)) = ~r2a• ~}. (2.39)
]=0
For stationary models, Var(er(l))~ 7(0) as I--~ % and for nonstationary models,
Var(er(1)) increases without bound as 1 increases.
It is important to remember that the m.m.s.e, forecasts of Zr+ t, Z r ( l ),
I = 1. . . . , L and the covariance matrix COV(e(L)) are simply the mean vector
and covariance matrix of the conditional distribution p ( Z r + l , . . . , ZT+LIZ(T)).
While these are important features of the conditional distribution, they are by
no means the only features of practical importance in forecasting. For example,
in some situations one may be interested in the probability of future obser-
vations exceeding some specified levels or in assessing the probability that a
turning point occurs at a specific point of time. The solutions of these and other
forecasting problems are all based on the joint density p(Zr+ 1. . . . . Zr+LlZ(r)).
t=l
Thus, from (2.42) if the SACF (as a function of the tag 1) of the original series is
persistently close to one as l increases, one then forms the series of the first
differences Wt = ( 1 - B ) Z t and studies its SACF to determine whether further
differencing is called for. Once stationarity is achieved, from (2.16) and (2.41), a
'cutting off' pattern after a certain lag, say q, in the SACF will then lead to
tentative specification of a MA(q) model.
For stationary models, the r(l)'s are asymptotically normally distributed, but
the covariance structure is rather complex in general (see Bartlett, 1946). For
MA(q) model and l > q, the asymptotic variance of r(1) is
V a r ( r ( / ) ) - -- 1 + 2 02(/) . (2.43)
/'~ j=l
By substituting r(j) for the unknown p(j) in (2.43), the estimated variances of the
r(/)'s are often used to help specify the order q of a MA model.
SPA CF
The sample partial autocorrelations
~(l), l = 1. . . . . (2.44)
98 G. C. Tiao
and the ~(/)'s are asymptotically normally distributed. Also, for a stationary
AR(p) model
1
Var(~(l))---, l>p. (2.46)
n
The properties in (2.45) arid (2.46) make SPACF a convenient tool for
specifying the order p of a stationary A R model in practice. For nonstationary
models, i.e. ~ ( B ) contains the factor U(B) in (1.5), the asymptotic property of
~(l) is rather complex, however.
In the past, the SACF and SPACF have been the most commonly used
statistical tools for tentative model specification. Specifically, a persistently high
SACF signals the need for differencing, a moving average model is suggested
by SACF exhibiting a small number of large values at low lags and an
autoregressive model, by SPACF showing a similar 'cutting off' pattern. Also,
for series exhibiting a strong seasonal behavior of period s, persistent high
SACF at lags which are multiples of s signals the need to apply the 'seasonal
differencing' operator 1 - B ' to the data, and so on. The weaknesses of these
two methods are (i) subjective judgement is often required to decide on the
order of differencing and (ii) for stationary mixed autoregressive moving
average models, both SACF and SPACF tend to exhibit a gradual 'tapering off'
behavior making specification of the orders of the autoregressive and the
moving average parts difficult.
ESA CF
Recently, several approaches have been proposed to handle the mixed model
specification problems. These include the R- and S-array methods of Gray et
al. (1978) and the generalized partial autocorrelations by Woodward and Gray
(1981). In what follows, we discuss the procedure proposed by Tsay and Tiao
(1984), using what they called the extended sample autocorrelation function
(ESACF) for tentative specification of the order (p, q) for the general non-
stationary and stationary A R M A model (1.3). The proposed procedure eli-
minates the need to difference or in general transform the series to achieve
stationarity and directly specify the values p and q.
For stationary A R M A models, estimates ~(k,/)'s of the EACF p(k,/)'s as
defined in (2.26) can be obtained upon replacing the p(/)'s in (2.26) by their
sample counterparts r(/)'s. In this case, the estimated ~5(k,/)'s will be consistent
for the p(k,/)'s and hence the property (2.27) can be exploited for model
identification. However, for nonstationary model, the ~(k,/)'s will not have the
asymptotic property given by the right-hand side of (2.27) in general.
ARMA models, intervention problems and outlier detection 99
Now for ARMA(p, q) models, one can view the extended sample autocor-
relation function approach as consisting of the following two steps. W e first
attempt to find consistent estimates of the autoregressive parameters in order
to transform Z t into a moving average process. We then make use of the
'cutting off' property of the autocorrelation function of the transformed process
for model identification.
For estimating the autoregressive parameters, the following iterated regres-
sion approach has been proposed. First, let ,.g(0)
"a- l ( k ) , • . . .¢.(0)
U)k(k) be the ordinary ,
least squares (OLS) estimates from fitting the A R ( k ) regression to the data,
where .,(0)
L. k,t denotes the error term. The 1st iterated A R ( k ) regression is given by
who,.o
.....
~(o) _ :1 ,g(o)~
k . , - - , - - ~ ' t ( k ) ~" . . . . .
^ (o) k •
q~k(k)B )Z, ,s the residual from (2.47) and e(k'~
denotes the error term. This yields a new set of OLS estimates C~]~k),.. " ' C ~k(kF O)
In general, for 1 = 1, 2, . . . the estimates ,fi(t) ~'t(k), • • •, ~m(~)
k ( k ) are obtained from the
/th iterated A R ( k ) regression
Z t ~_ ( ~ l ~ k ) Z t _ 1 _ ~ . . . _[_ (~)(l) 7
k(k)L't~k -}- bft(l)
'l(k)
~('-~)
k,t-1 -}- " " " q- P0 ( 'l(k)t~
) ~(0)k,t-I + e(~!,
(2.49)
where
i
O(i) = ( 1 - ¢~(i) R .... . __ (~(i) ]~k~, 7 __ ~ i~(i) ~(i-h)
k,t "x'- ( k ) ~ "~ k(k) JJ .IL't ~ I"h(k)'k,t-h
h=l
(i.e. the residuals from the ith iterated regression) and e~)t is the error term. In
practice, these iterated estimates ,g(0
'~:'(k),~~ can be obtained from OLS estimates of
the autoregressive coefficients by fitting AR(k), . . . , A R ( k + l) to Z t using the
recursion
q•(t)
j(k, = ~(t-1) ^ - . . . .
~i(g+0- q~;'(~)qb~+lllk+,)/45~(I,1)), (2.50)
P
~(')(p)--> ~(p), l ~ q, (2.51)
where
~(l)(p)= (ci)l(p) . . . . . . p(p): .
is defined as
-~ (0
Wk, t =
( 1 - -a(0
tPl(kyB .....
,fi(0 r~k~7
~k(k)'-" J~t
(2.53)
Also, we may denote r(0, l ) = r(l) for the ordinary sample autocorrelations,
and shall call r(k, l) the kth extended sample autocorrelation of lag I. Tsay and
Tiao show that for the general A R M A ( p , q) model in (1.3), stationary or
nonstationary, when k >/p
Table 2.1
The ESACF table
~,.,.
M A R~ MA 0 1 2 3
_ \
0 r(O, 1) r(0,2) r(O, 3) r(0,4)
1 r(1, 1) r(1, 2) r(1, 3) r(1, 4)
2 r(2, 1) r(2, 2) r(2, 3) r(2, 4)
3 r(3, 1) r(3, 2) r(3, 3) r(3, 4)
A R M A models, intervention problems and outlier detection 101
Table 2.2
T h e asymptotic E S A C F table for an A R M A (1.2) model
where x denotes a nonzero value and * denotes a value
between - 1 and 1
A R ~ MA 0 1 2 3 4 5 6 7
0 * X X X X X X X
1 * X 0 0 0 0 0 0
2 * X X 0 0 0 0 0
3 * X X X 0 0 0 0
4 * X X X X 0 0 0
In general, we are thus led to search from the E S A C F table the vertex of a
triangle of asymptotic 'zero' values having boundary lines k = c1> 0 and
l - k = c 2 > 0 , and tentatively identify p - - c 1 and q = c 2 as the order of the
A R M A model. In practice, for finite samples, the r(k,/)'s will not be zero. The
asymptotic variance of the r(k,/)'s can be approximately obtained by using
Bartlett's formula. As a crude but simple approximation, we may use the value
(n - k - l) -1 on the hypothesis that the transformed series lYC(~!tis white noise to
estimate the variance of r(k, l). Of course, it is understood that this simple
approximation might underestimate the variance of r(l, k) and a further study
of this subject is needed in the future. As a preliminary but informative guide
for model specification, the E S A C F table may be supplemented by an analo-
gous table consisting of indicator symbols x denoting values greater or less
than -+2 standard deviations and 0 for in between values.
2.3.2. Estimation
Once the order (p,q) of the model (1.3) is tentatively specified, the
parameters (C, ~1 . . . . . @p, 01,. • . , Oq, tr 2) can now be estimated by maximizing
the corresponding likelihood function. An extensive literature exists on pro-
perties of the likelihood function, various simplifying approximations to this
function, and asymptotic properties of the associated maximum likelihood
estimates (see e.g. Anderson, 1971; Newbold, 1974; Fullerl 1976; Ljung and
Box, 1979). In what follows, we consider two useful approximations, the first of
which has been called the 'conditional likelihood function' proposed by Box
and Jenkins (1970) and the second, the 'exact likelihood function' by Hillmer
and Tiao (1979).
With n observations Z = ( Z 1. . . . . Zn)' from the model (1.3) and assuming
m ~ 1, consider the transformed vector W = ( W x. . . . . IV,)', where
W : D~)Z, (2.55)
(Wo),
. W(2)),
. . where
. W ( O - ( W 1 , . . . , Wp) and Wi2)= (Wp+l, .. . , W,), we can
write the joint distribution of W as
Both the 'conditional' and the 'exact' likelihood approaches are based on the
distribution p(W(2)) by ignoring p(W(l) I W(2)); and it can in fact be shown that,
for moderately large n, the parameter estimates are little affected by
p(W(I) IW(2)). Now from (1.3) and (2.55), the probabilistic structure of W(2) is
given by
q
W t - C - ~ , Oia, i + a, t = p + 1. . . . , n . (2.57)
i=1
(
10(C, ~, 0, ~r2 I Z) oc o-; ("-p) exp - ~
1 Z° )
a2 , (2.58)
O'a t = p + l
where for given parameter values of (C, ~, O) the at's are recursively calculated
from
p q
a, = Z, - C - Z ebZ,-i + ~, O,a,-i. (2.59)
i=1 i=1
1
d-2a= - S ( C , ~, 0) o (2.61)
n
In the 'exact' approach, the assumption at, ap_q+1 = 0 is not made, and
. . . . . .
after some algebraic reduction it can be shown that the likelihood function is
In (2.62), for t = p + 1 . . . . . n
p q
w h e r e / ) = Iq + R ' M 'MR,
- 1 °
." . .
71"1 "..
" " " • . ° " "1
1
O"
^2 a __ ~-l/(n P)S*(C, 4}, 0). (2.66)
n-p
circle. In practice, one uses the conditional approach in the initial phases of the
iterative modeling process and switches to the exact methods towards the end.
P q
from the fitted model. Useful tools include plotting of residuals against time to
spot outliers (see later discussion in Subsection 3.3) and changes in level and
variability, and studying the sample autocorrelation function rn(1) of the
residuals to determine if it is consonant with that of a white noise process. A
'portmenteau' criterion originally proposed by Box and Pierce (1970) and later
modified by Ljung and Box (1978) is given by
On the hypothesis that the Zt's are generated from a stationary ARMA(p, q)
model, then O in (2.68) obtained from the residuals will be approximately
distributed as X2 with m - (p + q) degrees of freedom. It should be noted that
in practice when serious inadequacy occurs, patterns of the individual ra(/)'s
often provide useful information about directions to modify the tentatively
specified model.
I~_~
sIT) e , ~ - ~ ~ Pt(T'
STEP PULS~
[o1 Ill
_ _ II + P
w2 - -
(hi ~e)
by ordinary least squares, and then apply the identification procedures to the
residuals Yt - Y'~=t ui(B)X# •
Once a model of the form (1.6) is tentatively specified, we can then estimate
the intervention parameters and parameters in the noise model for N t simul-
taneously via maximum likelihood. Specifically, write
J
Yt = C + Z ujt + dP-I(B)O(B)a,, (3.3)
j=l
YI Nt + .~(,o) (3.4)
while an innovational outlier (10) is defined as
O(B)
v, N, + (3.5)
where
~:(t0)_{l' t=t 0,
t - O, t#to,
108 G. (2 Tiao
and N t follows the m o d e l (1.3). In terms of the a,'s in (1.3) with C = 0, we have
that
o(B)
(AO) Y, - 4)(B) at + 0)¢70) (3.6)
and
Thus, the A O case m a y be called a 'gross e r r o r ' model, since only the level of
the t0th observation is affected. On the o t h e r hand, an I O r e p r e s e n t s an
e x t r a o r d i n a r y shock at t o influencing Z~, Z~+ 1. . . . through the m e m o r y of the
system described by O(B)/q)(B).
w h e r e F = B -~, p 2 = (1 + 7r~ + rr~ + - • ,)-1. Thus, the best estimate of the effect of
an I O at time t o is the residual et0, while the best estimate for the effect for an A O is
a linear c o m b i n a t i o n of e~, e,~+l. . . . with weights d e p e n d i n g on the structure of the
time series model. N o t e that the variance of o5A can be much smaller than ~r]
If desired, one m a y p e r f o r m various tests a m o n g the h y p o t h e s e s
and
H 2 vs. H o A 2 = ff)Ai(po'a).
Detection of outliers
In practice, t o as well as the time series p a r a m e t e r s are all unknown. If only to
is unknown, one may proceed by calculating A1 and A2 for each t, denoted by Att
and Az, and then m a k e decisions based on the sampling properties given above.
The time series p a r a m e t e r s (q~'s, O's, and O'a) are also unknown, and it can be
shown that the estimates of these p a r a m e t e r s can be seriously biased by the
existence of outliers. In particular, ~ra will tend to be overestimated. These
considerations have led to the following iterative procedure to handle a
situation in which there may exist an unknown n u m b e r of A O or I O outliers.
(i) Model the series Yt by supposing that there are no outliers Yt (i.e.
Yt = Nt) and from the estimated model compute the residuals
e, = e(B)Y,.
Let
^2 = ^2
O"a e t
/~l t = l
2
be the initial estimate of cra.
(ii) C o m p u t e £i, i = 1, 2 and t = 1 , . . . , n, these being Alt and A2t with the
estimated model. Let 1£`01= max, maxi[12,fl. If 1£`01= I,(1`01> c, where c is a
predetermined positive constant usually taken to be some value between 3 and
4, then there is the possibility of an I O at to and the best estimate of o) is o.]1` 0.
Eliminate the effect of this possible I O by defining a new residual Yt° = ~t0- ~b~t° =
0. If, on the other hand, ]£J = 1£2,01> c, then there is the possibility of an A O at
to, and the best estimate of its effect is o3at¢ T h e effect of this A O can be
removed by defining the new residuals et = e t - WAtorrtD)gt ^ ,m,.(to), t ~> t0. A new
estimate or, - 2 is c o m p u t e d from the modified residuals.
(iii) R e c o m p u t e £1t and £2t based on the same initial p a r a m e t e r estimates of
the ~ ' s and 0's but using the modified residuals and 52a, and repeat the process
(ii).
(iv) W h e n no more outliers are found in (iii), suppose that J outliers (either
I O or A O ) have been tentatively identified at times t~. . . . . b. Treat these times
as if they are known, and estimate the outlier p a r a m e t e r s o21. . . . , ~oj and the
time series p a r a m e t e r s simultaneously using models of the form
J O(B)
Y, = • wjLj (B)~(,") + - - a, (3.10)
j=l 4)(B) '
The entire process is repeated until all outliers are identified and their effects
simultaneously estimated.
The above procedure is easy to implement since very few modifications to
existing software capable of dealing with A R M A and transfer function models
are needed to carry out the required computations. Based on simulation
studies, the performance of this procedure for estimating the autoregressive
coefficient of a simple AR(1) model compares favorably with the robust
estimation procedure proposed by Denby and Martin (1979) and Martin (1980).
While the latter procedures cover only the AR case, our iterative procedure
can be used for any ARMA model.
4. Illustrative examples
Model specification
Tables 4.1a, 4.1b, 4.1c and 4.1d give, respectively, the SACF, SPACF,
ESACF and the simplified ESCAF for this example. Note that (i) the estimated
standard errors of SACF are computed using Bartlett's formula (2.43), (ii)
those for the SPACF are obtained by assuming that the series is white noise
and (iii) the indicator symbol x is used in the simplified ESACF table when
Ir(k, l)l > 2(n - k - l) -1/2. The SPACF suggests that an AR(3) model might be
appropriate. On the other hand, an alternative ARMA(2, 3) model is suggested
by the ESACF. The AR(3) model was used by Box and Jenkins; but we have
found that the ARMA(2, 3) model gives a slightly better fit, and shall proceed
with this model.
-i
-2
~2
~r a = 0.0341, where the values in the parentheses are the estimated standard
errors of the p a r a m e t e r estimates.
Table 4.2 gives the S A C F of the residuals f r o m the fitted m o d e l (4.1). T h e
Table 4.1a
Sample autocorrelation function--gas data
l 1 2 3 4 5 6 7 8 9 10 11 12
e(l) 95 0.83 0.68 0.53 0.41 0.32 0.26 0.23 21 0.21 0.20 0A9
S.E. 0.06 0.10 0.12 0.13 0.14 0.14 0.15 0.15 0.15 0.15 0.15 0.15
Table 4.1b
Sample partial autocorrelation function---gas data
l 1 2 3 4 5 6 7 8 9 10 11 12
p(l) 0.95 -0.79 0.34 0.12 0.06 -0.11 0.05 0.10 0.02 - 0.07 -0.09 0.04
S.E. 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.06
112 G . C . Tiao
Table 4.1c
E x t e n d e d sample autocorrelation f u n c t i o n - - g a s data
0 1 2 3 4 5 6 7 8
Table 4.1d
Simplified extended sample autocorrelation function---gas data
MAR~M]~ 0 1 2 3 4 5 6 7 8
0 X X X X X X X X X
1 x x x 0 0 x x x 0
2 x x x 0 0 0 0 0 0
3 x 0 x x 0 0 0 0 0
4 x 0 x x x 0 0 0 0
5 x x 0 x 0 0 0 0 0
6 x x x x 0 0 0 0 0
associated values of the Q statistic indicate that the model seems adequate.
However, inspection of these residuals themselves indicates the possible exiso
tence of a number of outliers. Specifically, applying the iterative outlier
detection procedure discussed in Subsection 3.3 with c = 4.0, the following
results are obtained:
to 43 55 113
~ 6.62 -5.95 -4.23
Nature AO AO AO
Simultaneous estimation of the effects of these three outliers and the time
Table 4.2
Sample autocorrelation function of residuals--gas data
l 1 2 3 4 5 6 7 8 9 10 11 12
ra(l) 0.02 -0.02 -0.02 0.02 -0.05 0.05 0.04 -0.02 -0.05 0.07 0.13 -0.06
S.E. 0.06 0.06 0.06 0.06 0.06 0,06 0.06 0.06 0.06 0.06 0.06 0.06
O 0.01 0.2 0.2 0.4 1.2 2.1 2.6 2.7 3.5 4.9 10.2 11.4
ARMA models, intervention problems and outlier detection 113
series m o d e l p a r a m e t e r s yields
Forecasts
E m p l o y i n g (4.2) as the final m o d e l and treating the p a r a m e t e r estimates as
T h e true values, Table 4.3 gives the forecasts Z r ( / ) of future observations Zr+t,
Table 4.3
Forecasts of future observations--gas data
(T = 296)
Lead time l 1 2 3 4 5
Z,r(l) -0.248 --0.192 -0.122 -0.076 -0.049
S.E.(eT (l)) 0.151 0.367 0.588 0.775 0.905
Lead time l 6 7 8 9 10
Zr(l) -0.036 -0.031 -0.031 -0.035 -0.039
S.E.(er (l)) 0.986 1.031 1.054 1.064 1.068
Lead time l 11 12 13 14 15
ZT(I) -0.043 -0.047 -0.050 -0.052 -0.054
S.E.(er(l)) 1.070 1.070 1.070 1.070 1.070
Lead time l 16 17 18 19 20
Zr(l) -0.055 -0.055 -0.056 -0.056 -0.056
S.E.(er(l)) 1.070 1.070 1.070 1.070 1.070
114 G. C. Tiao
l = 1 . . . . . 20, made at T = 296, the end of the data period. It is seen that as l
increases, ZT(I) gradually approaches -0.0553, the estimated m e a n of the
series. Also, the estimated standard error of the forecast error eT(l ) increases
from 0.151 = 6-a for l -- 1 to 1.070 for l = 20 which is essentially the estimated
standard deviation of the series. T h e seven-fold increase, from 0.1517 to 1.070,
in the standard errors of forecasts shows that, although the series is stationary,
substantial i m p r o v e m e n t in the accuracy of short-term forecasts is possible
when past values of the series are utilized instead of relying solely on the mean
level of the series.
It is noted that all the computations involved in this example are p e r f o r m e d
using the package developed by Liu et al. (1983).
10
i J
Fig. 4.2. Monthly averages of ozone at downtown Los Angeles (January 1955-December 1972).
ARMA models, intervention problems and outlier detection 115
The first intervention 11 was expected to produce a step change in the ozone
level at the beginning of 1960. As for 12, the engine changes were expected to
reduce the formation of ozone. Now in the absence of information on the
proportion of cars with new design changes in the car population over time, we
might represent the possible effect of I 2 as an annual trend reflecting the effect
of the increased proportion of 'new design vehicles' in the population. As
explained more fully in Tiao et al. (1975), because of the differences in
meteorological conditions between the summer months and the winter months,
the effect of I 2 would be different in these two seasons.
The above considerations have led to the following model for the monthly
ozone observations Y,
and N t is the noise term. Inspection of the SACF of Y, and that of the
seasonally differenced series ( 1 - B ~ 2 ) Y t leads to the tentative model for the
noise term Nt
The models (4.3) and (4.4) allow for (i) a step change in the level of ozone of
size (001 associated with 11, (ii) progressive yearly increment in ozone level of
sizes o)02 and (003, respectively, for the summer and the winter months asso-
ciated with 12, and (iii) seasonal and nonseasonal serial relationship in the data.
Employing the estimation procedure described in Subsection 3.2, we have
obtained the following fitting results:
Examination of the residuals shows that the model seems adequate for this
data set. Thus, there is evidence to support the following:
116 G. C. Tiao
5. S o m e a s p e c t s of v e c t o r A R M A models
t-m t-m
( 1 - : P n B ) Z 1 , - al, , (5.4a)
w o - o)IB
,- ZI, + (1 - ~ 2 2 B ) - l ~ ; t , (5.4b)
1 - ~22 B
of Zt while feedbacks are allowed within each subset. This occurs when
the @'s and O's are lower block triangular. The important thing to note is
that vector A R M A models cover both undirectional and feedback relationship.
Model building procedures discussed in Subsection 2.3 can also be extended
to cover the vector case. For a discussion of the various modeling techniques,
see Quenouille (1957), Hannan (1970), Tiao and Box (1981), and Tiao and Tsay
(1983).
References
Anderson, T. W. (1971). The Statistical Analysis of Time Series. Wiley, New York.
Bartlett, M. S. (1964), On the theoretical specification of sampling properties of autocorrelated time
series. J. Roy. Statist. Soc. 8, 27.
Box, G. E. P. and Jenkins, G. M. (19"70), Time Series Analysis l,brecasting and Control. Holden-
Day, San Francisco, CA.
Box, G. E. P. and Pierce, D. A. (1970). Distribution of residual autocorrelations in autoregressive-
integrated moving average time series models. J. Amer. Statist. Assoc. 65, 1509-1526.
Box, G. E. P. and Tiao, G. C. (1975). Intervention analysis with application to economic and
environmental problems. J. Amer. Statist. Assoc. 70, 70-79.
Chang, I. and Tiao, G. C. (1983). Estimation of time series parameters in the presence of outliers.
Technical Report No. 8. Statistics Research Center, Graduate School of Business, University of
Chicago (to appear in Technometrics).
Denby, L. and Martin, R. D. (1979). Robust estimation of the first order autoregressive
parameters. J. Amer. Statist. Assoc. 74, 140--146.
Fox, A. J. (1972). Outliers in time series. J. Roy. Statist. Soc. Set. B 43, 350-363.
Fuller, W. A. (1976). Introduction to Time Series Analysis. Wiley, New York.
Gray, H. L., Kelly, G. D. and Mclntire, D. D. (1978). A new approach to ARMA modeling.
Comm. Statist. B7, 1-77.
Hannah, E. J. (1970). Multiple Time Series. Wiley, New York.
Harvey, A. C. (1981). Finite sample prediction and overdifferencing. J. Time Set. Anal. 2, 221-232,
Hillmer, S. C. and Tiao, G. C. (1979). Likelihood function of stationary multiple autoregressive
moving average models. J. Amer. Statist. Assoc. 74, 652--660.
Liu, L. M., Hudak, G. B., Box, G. E. P., Muller, M. E. and Tiao, G. C. (1983). The SCA System for
Univariate-Multivariate Time Series and General Statistical Analysis. DeKalb: Scientific Com-
puting Associates.
Ljung, G. M. and Box, G. E. P. (1978). On a measure of lack of fit in time series models. Biometrika
65, 297-304.
Ljung, G. M. and Box, G. E. P. (1979). The likelihood function of stationary autoregressive-moving
average models. Biometrika 66, 265-270.
Martin, R. D. (1980). Robust estimation of autoregressive models In: D. R. Brillinger and G. C.
Tiao, eds., Direction in Time Series. Institute of Mathematical Statistics, Hayward, CA.
Newbold, P. (1974). The exact likelihood function for a mixed autoregressive-moving average
models. Biometrika 61, 423--426.
Quenouille, M. H. (1957). The Analysis of Multiple Time Series. Griffin, London.
Slutsky, E. (1937). The summation of random causes as the source of cyclic processes. Econometrica 5,
105-146.
Fiao, G. C. and Box, G. E. P. (1981). Modeling multiple time series with applications. J. Amer.
Statist. Assoc. 76, 802.-816.
Tiao G. C. and Tsay, R. S. (1983). Consistency properties of least squares estimates of autoregres.-
sire parameters in A R M A models. Ann. Statist. 11, 856-871.
118 G. C. Tiao
R . D o u g l a s M a r t i n * a n d Victor J. Y o h a i t
1. Robustness concepts
*Research supported by the Office of Naval Research under contract N00014-82-0062, and by
National Science Foundation Grant SES80-15570.
tResearch supported by the Office of Naval Research under contract N00014-82-0062.
119
120 R. D. Martin and V. J. Yohai
or we might have
n
VcR(Po)
EFF(T,, P 0) = VARpo(T,) ' (1.2)
VcR(Po) 1
EFF(T, P o ) - V~(T) i(Po)V~(T )' (1.3)
1.2. M i n - m a x robustness
Let V(T, P=) denote the asymptotic variance of an estimate T at distribution
P=, and let T denote a large family of estimates, while W denotes a large family
Robustness in time series and estimating A R M A models 121
sup inf V(T, P~) = V(To, P o ) = inf sup V(T, P~). (1.5)
P~P~ TET TET P~EP ~
Of course, for that most frequently treated case of i.i.d, processes {x,}n~ ~ with
marginal distribution F, one would replace W by a family P of univariate
distributions, and replace P= by a univariate distribution P in the above
expressions.
Huber's (1964) seminal work showed that for estimating location in the i.i.d.
setting, the above problem is solved by a member of the class of M-estimates/2
defined by
tl
min 2 P ( Y i - / x ) , (1.6)
P- i=l
o(y,- = 0 (1.7)
i=1
1.3. R e s i s t a n c e
An estimate Tn is called resistant if 'small' changes in the data result in only small changes in
T~, where 'small changes' in the data means (i) large changes in a small fraction of the data,
and/or (ii) small changes in all the data.
The large changes in (i) correspond to outliers, while the small changes in (ii)
correspond, for example, to rounding errors or grouping errors. The sample mean
lacks resistance, whereas estimates such as trimmed means (with the median as
a limiting case) and M-estimates are resistant.
122 R. D. Martin and V. J. Yohai
"rrd~, ~,) = inf {e: 3 P C POx, u) such that P([d(x, x') >7 e]) ~< e}, (1.10)
where [d(x, x') ~> e] = {(x, x'): d(x, x') >1 e}.
Let T,: X" ~ F, n I> n 0, be a sequence of estimates which arc invariant under
permutation of coordinates, where the parameter space (F, 3') is also a Polish
space (in most cases F is a Euclidean space). The reason for the appearance of n o is
that often a minimum number n o of observations are required in order to define
the estimate. Hampel (1968) introduced two definitions of qualitative robustness.
The first definition is as follows:
Therefore, two points of X n are close in the metric d + if all the coordinates
except a small fraction are close. According to this notion of closeness, if the
original sample is modified by replacing a fraction no greater than e of
observations by arbitrary outliers, or if all the observations are perturbed by
round-off errors smaller than e, then the original and modified samples have a
distance smaller than e.
124 R. D. Martin and V. J. Yohai
AT,(x", 6) = sup{[T.(y") - T,(z")[: d+~(y ", x ~) <~ 6, d+(z ", x ~) <~ 6}.
(1.12)
From now on P " will denote any probability in P(X n) (not just a product
probability) and similarly P= will denote any probability in P(X=), unless
otherwise noted.
The following definitions of strong and weak robustness were introduced by
Boente, Fraiman and Yohai (1982), and represent an alternative to Hampel's
definition of qualitative robustness.
DEFINITION 1.5. Let P= ~ P(X=). {T,},~,o is weakly robust at P~ if, given e > 0,
there exists 6 > 0 such that
P~([AxT~(x", ~ ) ~ e ] ) ~ 1 - E , Vn -~ n o . (1.14)
Boente, Fraiman and Yohai (1982) proved the following relationships be-
tween (i) weak and strong robustness, and between (ii) both weak and strong
robustness and Hampel's definition of qualitative 7r-robustness:
Papantoni-Kazakos and Gray (1979), Bustos (1981) and Cox (1981) also gave
various definitions of qualitative robustness which hold for dependent proces-
ses and which are in the spirit of Hampel's approach. There are two such
definitions which correspond to generalizations of Hampel's qualitative
robustness and qualitative 7r-robustness respectively.
where P " and Q" are the marginals corresponding to the first n coordinates of
P= and Q=, respectively,
Boente, Fraiman and Yohai (1982) argue that the relevant definition for the
case of dependent processes is qualitative p,-robustness, and not qualitative
p-robustness. They also propose using On = Zrd~, and show that if (F, y) is a
Euclidean space, qualitative ~-a+-robustness is equivalent to weak robustness.
Papantoni-Kazakos and Gray" (1979) proposed the use of the Vasershtein
distance on P(X"). The Vasershtein distance utilizes a so-called 'per-letter'
metric d = d(x i, yi) to measure the distance between coordinates x~ and Yi of x n
and y'. Boente, Fraiman and Yohai (1982) proved that if d is bounded, then
the notion of qualitative p,-robustness based on the Vasershtein metric is
equivalent to qualitative p,-robustness with /9, = ~'d+. Other metrics on P ( X ' )
were proposed by Cox (1981) and Bustos (1981).
In the sequel the term qualitative robustness, or simply robustness, witho~t
qualifiers, will mean either strong or weak robustness as defined in this section,
and we will keep in mind the equivalence of strong robustness and Hampel's
qualitative zr-robustness in the i.i.d, setting, given by T h e o r e m 1.1(ii).
1.5. COmments
Qualitative robustness is a continuity property of estimates which is quite fun°
damental and basic. As such, it is a principle of statistics which should be
considered at the same level as the principles of sufficiency, unbiasedness, the
likelihood, and so on.
Of course, qualitative robustness alone is not enough, because some silly
estimates, such as the estimate T n---=c for any constant c, are qualitatively
robust. One needs a high degree of efficiency robustness as well. These two
kinds of robustness are paramount, and they are easily attainable in the i.i.d.
setting using location M-estimates, for example.
126 R. D. Martin and V. 3. Yohai
2.1. L e a s t - s q u a r e s e s t i m a t e s
where B is the backward shift operator, and ~(B), O(B) are the polynomial
operators
T
Z ti2(a-), (2.3)
t=p+l
The following asymptotic result is well known (see, for example, Whittle, 1962 or
Box and Jenkins, 1976): if var(F)= EF(u2t)< % then
d
V'T(h.~ - A)--> N(O, D ( ~ , O, F)), (2.6)
d
where --~ denotes convergence in distribution, and N(~, X) denotes the multi--
variate normal distribution with mean ~ and covariance matrix X. The (p + q +
1) x (p + q + 1) covariance matrix D(~, 0, F) for the least-squares estimate is
given by
0
VLs= D ( ¢ , O , F ) = [C-I(o'O)
var(F)H2(~, 0)]" (2.7)
128 R. D. Martin and V. J. Yohai
where Si and ti, 0 ~< i < oc are the coefficients in the series of expansions of the
operators g,-*(B) and O<(B) respectively:
On the other hand, the lower right-hand block of (2.7) gives the asymptotic
variance of/2, and since this variance depends on var(F), the LSE of/x is very
sensitive to small deviations of F from a nominal Gaussian distribution. For
example, if F is given by (2.11), then var(F) = ( 1 - e)tr2+ e var(G), and hence
for e > 0 and 2 fixed, var(F) may take arbitrarily large values, including
(e.g. for Cauchy-tailed F). Thus/2 lacks efficiency robustness!
sufficient condition for consistency of ~LS is that EF([Iog lUtl]+) < % where [x] +
denotes the positive part of x. This condition cannot be weakened since it is
necessary for the existence of the stationary autoregressive process. Hannan
and Kanter (1977) have shown that if F belongs to the domain of attraction of
a stable law of order c~, 0 < a < 2, then the LS-estimate of an AR(p) process
(with/, = 0) converges to the true value with rate T 1/~,which is faster than the rate
T 1/2 for the finite-variance case.
pF(•,(a)) (2.12)
t=l
where PF = - I n f.
The asymptotic covariance matrix VMLE of this estimate is under general
conditions equal to the inverse of the information matrix and is given by
Here
EF1/-f2(R) 1
Vlo~(qt, F) = E 2 ~ , ( u ) i(F)" (2.14')
inf a ( ~ , F ) = O.
FEFE,o. 2
Thus for the innovations outliers model the infimum of the asymptotic
efficiency of the LS-estimate in any neighborhood F~.2 of the normal dis--
tribution is zero. The LS-estimates /2, ~b, 0 all lack efficiency robustness for
perfectly observed A R M A models.
While the least-squares estimates ~bLS, 0LS have the asymptotic distribution-
free property cited in Subsection 2.2, it should be noted from inspection of
(2.14), along with the fact that i(F) is relatively stable as F deviates from
Gaussianity in a sufficiently smooth way, that the lack of efficiency robustness
of ~b and /~ is associated with not taking advantage of increased precision
attainable with heavy-tailed F. This is because innovations outliers result in
'good' leverage points for estimating ~ and 0 (see Belsley, Kuh and Welsch,
1980 or Cook and Weisberg, 1982, for definitions of leverage for ordinary
regression models).
The situation with regard to estimating p. is of course quite different in this
respect, since innovations outliers do not result in 'good' leverage points for
estimating ~, and as a consequence the (asymptotic) variance of ~Ls is quite
sensitive to heavy-tailed innovations distributions.
Since F is not generally known, it is not possible to use the MLE in practice.
However, (2.12) suggests to define the following class of M-estimates, which are
a natural generalization of the M-estimates for location (Huber, 1964) or
regression (Huber, 1973). An M-estimate A for an ARMA(p, q) model mini-
mizes
E p (2.16)
t=l
Robustness in time series and estimating ARMA models 131
where cr is a scale parameter (cf. (1.6) and (1.7), where the nuisance parameter
o- was omitted).
Let 0 = p' be the derivative of p. Then differentiating (2.16) we get the
following system of equations:
tp G_j(A)=0, j = l . . . . . p, (2.17)
t~p+l
0 b,_i(A) = 0, j = l . . . . . q, (2.18)
t~p+l
(2.19)
t~p+l
where
at(A ) = O - ' ( B ) ( y , - IX) = q~ I(B)at(A) (2.20)
and
bt(A) : O-2(B )~R(B ) ( y t - i~) : O-~(B )fit(A ) . (2.21)
q*H,c=
{u lul-<c, (2.22)
c sgn(u) lul>c.
OKc(u)= 1 c2 / iflul~c,
(2.23)
if [u[ > c.
In most practical cases the scale parameter ~r is unknown and therefore must
be estimated. As was proposed by Huber (1964) for the location case, ~ may be
estimated simultaneously by an equation of the form
(2.24)
t=p+l
xH,c(u) = q,.,c(u)
2 -b
132 R. D. Martin and V~ J. Yohai
d
X/T(A, - a ) - + N(0, aOP,F)D), (2.25)
a(~, F) 1
EFF(0 , F) = - (2.26)
a(O, F) i(F)V~o~(¢,F)"
But (2.26) is just the asymptotic efficiency expression for a location M-estimate,
and A therefore has the same asymptotic effciency-robustness properties as a
locatio~n M-estimate based on the same psi-function ~. More details on the
efficiency robustness of autoregression M-estimates for the innovations outlier
model may be found in Martin (1982).
Thus, one can achieve efficiency robustness through the use of M-estimates
for A R M A models. However, we will see in the next section that these
M-estimates are not at all qualitatively robust!
observation with the y/s obtained from the general replacement type con°
tamination model
Table 1
Sample m e a n s of LS- and M-estimates for the
AR(1) and MA(1) models with additive outliers
LS 0.422 -0.426
MH 0.507 -0.449
MB 0.583 -0.456
4. General M-estimates
Then according to (2.17), (2.18) and (2.19), the equations for tile LSE's may be
written
T
Z a,(~)aL~(a)= o (4.2)
t=p+l
t=p+l
w
[ , d,_I(A)C d,_,(A
'1 u,(a)d, ,(a)= O, (4.3)
/~l.W [ ~ +,
, dt_l(t~)C-
^ 1 +
dt_l(,~) ] /,/t(/~)d;_l(/~) ..... O. (4.3
where ut(A ) is generated from (2.4) with the y, starting in the remote part. If we
assume that F is symmetric, a sufficient condition for (4.3') is that w be even wit~
respect to its first argument. The scale parameter o- may be estimated :;imr!
taneously by an equation similar to (2.24).
There are two main types of GM-estimates:
(a) Mallows type:
4,,( u )4,d v ~)
w(u,v) - u " v ~/2 (4.4)
O(uv ~/~)
w(u, v ) - u" vii2 • (4.5)
The GM-estimates were first proposed for regression by Hampel (1975) and
Mallows (1976) (see also Krasker, 1980, Krasker and Welsch, 1982; Maronna,
Bustos and Yohai, 1979). GM-estimates were studied for autoregressive models
by Denby and Martin (1979), Martin (1980), and Bustos (1982).
Call '£w the estimate defined by (4.3); then a formal Taylor series expansio~
indicates that under suitable regularity conditions we will have
d
X/T-(A w - h)---) N(0, Dw(W, F, ~, 0))~ (4.6)
where
Dw(W, F, ~, O) -- IT2B-1AB -1 , (4:/)
with
(u~ a;_l(a) )
B == J ~ l IT o a; 1(~), (4.9)
For a careful proof in the case of pure autoregressions see Bustos (1982).
K/insch (1984) has shown that GM-estimates of the Hampel-Krasker type, with 4'
in the H u b e r family given by (2.22), have an optimality property of the Hampel
type for autoregressions: they minimize the trace of the asymptotic covariance
matrix under a Gaussian AR(p) model, subject to a bound on Hampel's (1974)
influence curve.
'd; l(a) j
where A and d- are the estimates from the preceding iteration.
may be computed in one of two ways:
(i) /) may be expressed parametrically as C = C(A), or
(ii) for autoregressions, C may be computed independently of ~; for an
example of this approach, see Martin (1980).
T-j-p 1
Z th(A)3'h+j(A) = 0 , l<~J <~q, (5.2)
h=0
T
Z O(fit(a))= 0, (5.3)
t=p+l
where
is, except for a factor of 1/(t- i - p - 1) (or perhaps 1/(t-.p - 1)), an estimate
of the autocovariance of the residuals fii(A) at lag i. The s h and th are as in (2.9),
except we now indicate their dependence on A. For simplicity we shall refer to the
~i(A), without the needed factors, as residual autocovariance estimates.
The lack of resistance and robustness of the M-estimates is due to the lack of
resistance and robustness of the residual autocovariance estimates ,;i(A): even
if 0 is bounded, each summand 0(fit+i(A))t~t(A ) is an unbounded function of~the
observations, and so a small fraction of outliers can ruin the estimate. The
residual autocovariance estimates (RA-estimates) proposed by Bustos and
Yohai (1983) are obtained by replacing ~/(A) in (5.4) by a robust estimate of the
residuals autocovariance; the form of the robust estimate is
E,,Tt, - 0. (5.6)
D , = (a*(rb F)C-l(~o, O) 0 )
(5./)
0 a(0, F) var(F)H2(~o, 0) '
138 R. D. Martin and V. J. Yohai
where C-I(~, 0) is the asymptotic covariance matrix of (~b~s, 0~s), with C0f, 0)
given by (2.8). var(F)H2(¢, 0) is the asymptotic variance of /2LS, with H 0 f , 0)
given by (2.10), the form of a(0, F) is given by (2.14), and
where r/l(U, v ) = O~7(u, v)/Ou. A proof of these results may be found in Bustos,
Fraiman and Yohai (1984).
It follows that the efficiency of (~b, 0,) with respect to (~bts, 0ts) is given by the
inverse of a*('q, F), independently of the A R M A model parameters. Similarly,
the relative efficiency of t2, with respect to /2LS is given by the inverse of
a(O, F ) as in the case of location M-estimates.
One of the advantages of the RA-estimates with respect to the GM-estimates
is the relative simplicity of their covariance matrix. The structure of D* in (5.7)
is much simpler than that of D w in (4.7). Furthermore, it is clear from (5.7) that
the choice of tuning constant in the function $ to obtain a given efficiency (say
0.95) is independent of the A R M A model parameters. This is not the case in
choosing tuning constants for the GM-estimates, since the expression (4.7) for
the asymptotic covariance involves the A R M A model parameters in a com-
plicated way.
5.2. Computing R A - e s t i m a t e s
In the case of the Mallows type RA-estimates with rl(u, v) = O(u)O(v), there
is an heuristically appealing iterative numerical algorithm for computing RA-
estimates. Suppose that in the ith iteration we have estimates ~(i), dr(°, then the
estimate for the (i + l)th iteration is obtained as follows:
(i) Compute the residuals ~,(~(1)), p + 1 ~< i ~< T.
(ii) Modify the residuals by applying the 0-function,
(5.9)
y * = t£ + [~(')(B)]-'0('I(B)fi*. (5.10)
5.3. Q u a l i t a t i v e robustness o f R A - e s t i m a t e s
and a single outlier y, at time t has influence on all £~t,(O) with t'>~ t. Thus, a
small fraction of outliers may have a large effect on a large fraction of
residuals. Just one large outlier in the first observation may have a large effect
on all the observations. Therefore, since GM- and RA-estimates depend on the
residuals fi~ they cannot be qualitatively robust. However, GM- and RA-
estimates are less sensitive to outliers than LS- and M-estimates. A Monte
Carlo study (see Bustos and Yohai, 1983) shows that for the MA(1) model with
additive outliers, the RA-estimates of the Mallows and Hampel type are more
robust than LS- or M-estimates. This is especially true when $ is taken in the
bisquare family given by (2.23). More theoretical support of the behavior of
GM- or RA-estimates for the AR(1) and MA(1) models using a proper
definition of influence function for time series may be found in Martin and
Yohai (1984). The idea is briefly described in Section 7.
In the next subsection we present another class of estimates which are
qualitatively robust for ARMA(p, q) models with q > 0.
5.4. T r u n c a t e d R A - e s t i m a t e s
Recall that an RA-estimate for the MA(1) model with mean zero is obtained
as a solution of
T-1
Z 0i-~%(O) : 0, (5.13)
1=1
%,k(o) = Z n u,,
t=l
One of several things learned from Huber's (1964) early work on robust
estimation of location was that robust estimates can be obtained using maxi-
mum-likelihood estimates for suitably heavy-tailed distributions. Some caveats
are in order here, e.g. densities whose MLE's for location are robust do not
always produce robust estimates of scale, and we do not yet have an M L E
rationale for the bounded-influence regression estimates studied by Krasker
and Welsch (1982) and Huber (1983). Nonetheless, the non-Gaussian M L E
rationale sometimes provides a convenient way of uncovering and understanding
the structure of robust estimates.
We have already seen in Subsections 2.4 and 3.1 that while 'simple' M.estimates
Robustness in time series and estimating ARMA models 141
can provide efficiency robustness for perfectly observed A R M A models, they are
not resistant or robust toward general contamination models of the type (3.1).
In this section we describe a class of estimates of A R M A model parameters
which are motivated by maximum-likelihood estimates for the additive outliers
type of contamination model described in Subsection 3.1, and which are
resistant and robust. W e call these estimates approximate maximum-likelihood
type estimates (AM-estimates) because of approximations involving the non-
Gaussian MLE.
y, = x, + v,, (6.1)
where x t and v t are assumed to be independent, and the v t are i.i.d, with zero
mean. Throughout, we shall presume that the Yt in (6.1) have mean /z = 0.
W h e n / x is unknown, it may be estimated robustly and the estimate/2 can be
used to form centered observations. W h e n / 2 is consistent, estimators based on
the centered data typically behave asymptotically as if /x were known and
exactly centered observations were used. Alternatively, an intercept term can
be included in some of the equations to follow.
The log likelihood for this model is
T
log h(y r a ) : Z l o g h(y, [y'-~, a ) , (6.2)
t=l
where y ' - (Yl, Yz. . . . Yt)' is the vector of observations up to and including
observation y,. The observation-prediction density h(y t [ y t - l , a ) is the con-
ditional density of the observations Yt given yt-~, and h ( y l l y °, a) denotes the
unconditional density h(yl ] a).
Since x t and vt are by assumption independent, we can write
~,
Xt i =: E(x, l y '-1) (6.4)
and
m, = E[(x,-- 2,--1):]y,-1] (6.5)
142 R. D. Martin and V. J. Yohai
1 IX,- '~',-'~
s (x, i,' 1, u st-v-U), (6.7)
2 ~ O.2 and small y > 0, Martin (1979) gave some motivation for ap-
with 0"0
proximating gt by setting
g , ( u , ) = s,
g = f*Fv. (6.11)
Although the functional forms assumed in (6.7) and (6.10) are not good
approximations for general non-Gaussian F v, we believe that the use of these
forms when F v is nearly normal involves an approximation error that is small
enough to be relatively inconsequential.
Robustness in time series and estimating ARMA models 143
The scale measure st in (6.10) represents the scale of the y-prediction residuals
ut = Yt - 33t/1. Since the x-prediction residuals x t - 21-1 have as scale measures the
quantity V'm-~,,and since Yt = x, + vo with v t independent of x,, it is reasonable to let
s, = X/m,----~¢~ (6.12)
st = X/~. (6.13)
T T
l ° g h ( y r l ° z ) - - ~ l ° g s t + ~ ' ~ l °,=,
gg(~ (6.14)
with the residuals u t - - u t ( a ) and scale values s t = st(a ) obtained from the
approximate conditional mean type filter cleaners described in Subsection 6.3.
The parameter vector a is included in (6.15) to indicate explicitly the depen-
dence of s,(a) and u,(a) on the parameter vector o d = (¢', 0', o'2).
If p ( t ) = - l o g g(t) and the density g is normal, then minimization of L ( a )
yields the Gaussian maximum-likelihood estimate.
The choice of the function p is guided by the same qualitative robustness
considerations as for H u b e r M-estimates for location and regression (see, for
example, Huber, 1981; Hampel, 1974), and, for the A R M A model M-estimates
of Section 2: O should have a bounded and continuous derivative ~O= p'.
state-variable form
^t 1
(6.19)
Robustness in time series and estimating A R M A models 14~
where p, = mJs~, with m t being the first column of the k x k matrix M,, which is
computed recursively as
s~ = mll.t, (6.22)
2, = (Xt)~ • (6.24)
With the scaling (6.22), we will have 37t = y, a large fraction of the time when
there are rather few outliers in the series. This is why we use the term filter
cleaner.
Before proceeding, note that when 0 is the identity function, w is identically
1, and (6.22) is replaced by s 2t - mH, , + or20 with o-z0 = var v, in the additive-noise
model, the above recursions are those of the Kalman filter. Correspondingly, M t
and Pt are the prediction and filtering error-covariance matrices. See, for example,
Kalman (1960), Jazwinski (1970), Meditch (1969). Unfortunately, the Kalman
filter is not robust; a single outlying observation Yt can spoil not only Y~t, but
also ~',, u > t. Use of a robust version is imperative in many situations. Our use
of ~r20= 0 in (6.22) corresponds to the assumption that v t = 0 a large fraction of
the time, e.g. as when a contaminated normal distribution with degenerate
central component, i.e. o-~= 0 in (6.9), and 3: small, provides a reasonable
model for F~.
The weight function w should have the same qualitative properties as a good
robustifying 0-function, namely: b o u n d e d n e s s , c o n t i n u i t y and perhaps c o m p a c t
support.
A common compact support for 0 and w results in the following desirable
behavior of the filter cleaner: if an observation y, deviates from its prediction 311-~
146 R. D. Martin and V. J. Yohai
by a sufficiently large amount, then Y~, will be the pure prediction X', = q~J~t-~, and
the filtering-error covariance is set equal to the one-step prediction-error
covariance Pt = Mt.
The latter idea has often been i m p l e m e n t e d as a so-called hard-rejection
rule: set Xt = q~X,-i and Pt = M, if lu,] > cs~ replacing (6.22) by s 2t = m11,t + o-20 in
the general noise case where there is a nonzero additive Gaussian noise
component. Typically, c = 3 has been used according to a time-honored habit,
and the procedure accordingly is termed a 3-sigma-edit rule. This corresponds
to the choices
t, It I < c , (6.25)
¢,..(t) = 0, [tt/> c,
Our filter cleaners would differ from this simple rule by imposing continuity,
as well as boundedness and compact support. The O and w functions should
return smoothly to zero. One reasonable way to accomplish this is to impose
continuity on ~, and take w as
t, Itt-<a,
i
a
OnA(t)= -b-a(b-t)' a<t<-<'b'
--a (6.28)
~_a(b+t), -b~t< --a,
0, Itt>b,
along with WHA(t) = ~HA(t)/t, appears to work well in practice (see, for example,
Martin and Thomson, 1982; Martin, Samarov and Vandaele, 1983).
It should be noted that the data-dependent scale s t has the property that
st+ ~> s t if a gross outlier occurs at time t. This is a particularly important
feature when using a redescending psi-function such as thHA. For if a fixed scale
s were used, the filter could lose track of the data, never to regain it for the
duration of a fixed length data set. Although this is also possible when using s,
the properties of s, suggest that the filter will tend to regain track more quickly
than when a fixed scale is used. This aspect of the filter cleaner is illustrated in
Section V I I of Martin and T h o m p s o n (1982).
Robustness in time series and estimating A R M A models 147
and (6.27) by
d
w(t) = C(t) = d t 4'(t) • (6.30)
Namely, under a key assumption that the state-prediction density fx(x, l Y'-l) is
Gaussian, and that ~0(t)= (a/0y,)logf(yt Ly'-I), where f(y, ]yt-1) is the obser-
vation-prediction density, the filter is the conditional-mean filter (Masreliez,
1975).
It should be pointed out that the key assumption just cited will never hold
exactly under an AO model where v, is non-Gaussian. However, there is some
evidence that f(x t ]y '-1) is nearly Gaussian and that the filter is a good
approximation to the exact conditional-mean filter (Martin, 1979; Masreliez's
Monte Carlo, 1975). Hence, the filter is referred to as an approximate conditional
mean (ACM) filter. Some clarifying details may be found in Martin (1979); the
figures therein suggest that the use of 4'HA is reasonable when the observation
noise vt has a contaminated-normal distribution. However, the w given by (6.30) is
discontinuous when 4' = @hA, and we prefer w specified by (6.27).
Further study is needed on the goodness of approximation of such ACM filters.
where mt = s~ and ~p(r)= p'(r). We now assume that the first term in (6.31)
dominates the second one. This will be the case, for example, if Om]O(% O) = 0
most of the time, which will be true if m, = ~r2 most of the time. In turn, this
will be the case for data sets for which a well-designed filter cleaner results in
2, = yt most of the time. See Martin (1981) for a similar approximation. Under
this assumption, (6.31) becomes
r 1 / u, \ 0;", -1
22 ~(/--m--~-£4¢~,~m---~) --0 (6,32)
,=1 a(~, 0)
148 R . D. M a r t i n a n d V. J. Y o h a i
Notice that (6.32) has the same form as Huber's robust nonlinear regression
equation (see Huber, 1981, pp. 179 if) but with a time-varying scale. It is natural to
solve (6.32) using an iterative algorithm which is a modification of the so-called
H-algorithm due to H u b e r (1977, p. 39).
H e r e are the steps in such an algorithm:
(1) Compute initial estimates of q~, 0, and cr ; one possible method of doing this
robustly is discussed in Section 7 of Martin, Samarov and Vandaele (1983).
(2) Use a robust filter cleaner to obtain the 'cleaned' observations 2,~), the
predicted values ~tt-lu) , and scale stU)= ~/m---~t based on the current parameter
estimate ~ ) = (~b°), 0 °), ~r2~)).
(3) Use the 2~ ~ as observations and the c~~) as initial values in a standard
A R M A nonlinear least-squares estimation procedure, and compute new esti-
mates (~b~/+1), ~q+l)). The first component of the vector recursion (6.19) may be
written as
St ',St/ m t
Thus we have
my ) , t=l,...,T,
k=l,...,p,
O~Uk) ' t=l,~..,T.
CZ) :
Ogl(j) k=p+l . . . . . p q q,
Vk-p
(4) Next calculate d Cj+l)as the standard deviation of the least-squares residuals
rtq+l) ..0+1) ..t-l(j+i)
= "~t -- "gut "
(5) Iterate (2)-(4) until the estimates do not change by more than a pre-
specified tolerance.
For examples of application of the above algorithm to real tirne-series data
sets, see Martin, Samarov and Vandaele (1983).
Robustness in time seriesand estimating ARMA models 149
Robustness
The algorithm for computing AM-estimates described in Subsection 0.4 ~,o
very close to a simpler and more transparent scheme whereby one iteratc',;
between the filter cleaner and least-squares estimates of a.
In terms of the latter algorithm, it is easy to see at a heuristic level that the
A M - e s t i m a t e s are resistant. The reason is that (i) the action of the fiter cleane~
is to replace a small to moderate fraction of gross outliers by one-sided
interpolates based on ' g o o d ' data, at least when ~b is redescending, and (ii) the
use of a continuous (as well as bounded) ~ and w means that small rounding
errors will not result in large changes in the filter-cleaned values 2,. A careful
proof of qualitative robustness remains to be given.
fi = Z YtY,+JZ y2 (7.1)
i-1 t=l
when the data contains k outliers with common value ~: which get arbitrarily
large, in both isolated and contiguous (or patch) configurations of the k
outliers. For the isolated configuration t~ -+ 0 as ~:~ m, while in the contiguous
configuration ~-+ ( k - 1)/k ~ 1 for k large. Thus we should expect different
breakdown points for time-series parameter estimates for different types of
outlier configurations.
Some asymptotic breakdown point calculations for GM-estimates of the
AR(1) parameter ~ were carried out by Martin and Jong (1977). The Mallows
type GM-estimates (4.3)-(4.4) were used, and both isolated and patch outlier
configurations were used. This yielded a set of lower and upper breakdown
points which correspond to the asymptotic value q~0 of the estimate ~b being
taken to 0 and 1, respectively.
In spite of the calculations cited, much more remains to be done with regard
to breakdown points for time-series parameter estimates. For one thing, the
definition used by Martin and Jong (1977) is not the only possibility, and it is
not yet clear whether an alternative definition would be preferred. In any event,
asymptotic breakdown points need to be computed for GM-, RA-, TRA- and
AM-estimates (among others), at A R ( p ) and A R M A ( p, q) models. Furthermore,
it remains to be determined whether or not computable finite-sample breakdown
points are a possibility for a variety of robust estimates of interest.
O,(y,; L ) = 0,
t=l
The trouble with this definition is that it is not very natural from the
following viewpoint, among others. The contamination process measure p.~ is a
mixture which corresponds to obtaining a realization from the stationary
m e a s u r e / x with probability 1 - y, and with probability 7 obtaining a realization
from the (nonstationary) process having marginal measure 65 for
(Yp Y0, Y-1. . . . ). Such a mixture process does not correspond to any realistic
contamination process occurring in practice! Further discussion on this point
may be found in Martin and Yohai (1984), who propose a new definition of
time-series influence curve IC as follows.
Let the 0-1 process in (3.1) satisfy P ( z , = 1)= ,y + o(7), let /xx denote the
measure for x t, let/x w denote the measure for the contaminating process wo and let
/x~ be the measure for y,L We can get either isolated or patchy outliers depending
upon how we specify the 0-1 processes z t and w,. Assume that the estimate of
interest is obtained from the functional T(/x~). Then the time-series influence
curve IC(p.w)= IC(/xw; T, {#~}) is the derivative at #w, along the arc { ~ } = {/x~:
0 4 7 < 1} as 7 ~ 0 , and correspondingly # ~ # x :
univariate problems (and we are dealing only with univariate time series in the
present discussion).
Although the IC is similar in spirit to ICH, it coincides with ICH only in the
special case where the estimate is permutation invariant and y, is an i.i.d.
substitution outliers model (Subsection 3.1(ii)), i.e. in the usual i.i.d, setup the
two definitions coincide (Corollary 4.1 of Martin and Yohai, 1984). Although in
general IC is different from ICH, there is a close relationship between the two
which facilitates the calculation of IC. Namely, under regularity conditions
where Yt = (Yl, Y0, Y-l, • • ") is governed by the measure/z ~ for the process Yt in
(3.1), and Er den6tes expectation with respect to/Zy.
The above result is established in Martin and Yohai (1984), where several
other results concerning IC's are presented: Conditions are established which
aid in the computation of IC's and which ensure that an IC is bounded. IC's are
computed for both least squares and a variety of robust estimates of first-order
autoregressive and moving-average models. Distinctly different behaviors of
the IC are exhibited for patchy versus isolated outliers. It is shown that
bounded monotone ~b-functions do not yield bounded IC's for moving-average
parameters, whereas redescending 0-functions do yield bounded IC's. Finally,
the IC is used to show that a class of generalized RA-estimates has a certain
optimality property.
References
Akaike, H. (1974). Markovian representation of stochastic processes and its application to the
analysis of autoregressive moving average processes. Ann. Instit. Statist. Math. 26, 363-387.
Beaton, A.E. and Tukey, J.W. (1974). The fitting of power series, meaning polynomials, illustrated
on band spectroscopic data. Technometrics 16, 147-185.
Belsley, D. A., Kuh, E. and Welsch, R. E. (1980). Regression Diagnostics. Wiley, New York.
Boente, G., Fraiman, R. and Yohai, V. J. (1982). Qualitative robustness for general stochastic
processes. Technical Report No. 26. Department of Statistics. University of Washington, Seattle,
WA.
Box, G. E. P. and Jenkins, G. M. (1976). Time Series Analysis Forecasting and Control. Holdeno
Day, San Francisco, CA.
Bustos, O. H. (1981). Qualitative robustness for general processes, Informes de Mathemfitica, Serie
B-002/81. Instituto de Mathemfitica Pura e Aplicada, Brazil.
Bustos, O. H. (1982). General M-estimates for contaminated p-th order autoregressive processes:
consistency and asymptotic normality. Z. Wahrsch. Verw. Gebiete 59, 491-504.
Bustos, O. H. and Yohai, V. J. (1983). Robust estimates for ARMA models. Informes de
Mathemfitica, Serie B-12/83. Instituto de Mathemfitica Pura e Aplicada, Brazil. To appear in
J. Amer. Statist. Assoc.
Bustos, O., Fraiman, R. and Yohai, V. J. (1984). Asymptotics for RA-estimates of ARMA rnodels~
In: J. Franke, W. Hiirdle and D. Martin, eds., Robust and Nonlinear Time Series Analysis. Springe~ ..
Berlin.
154 R. D. Martin and V. J. Yohai
Cook, D. and Weisberg, S. (1982) Residuals and Influence in Regression. Chapman and Hall,
New York.
Cox, D. (1981). Metrics on stochastic processes and qualitative robustness. Technical Report No. 3.
Department of Statistics, University of Washington, Seattle, WA.
Denby, L. a n d Mallows, C. L. (1977). Two diagnostic displays for robust regression analysis.
Technometrics 19, 1-13.
Denby, L. and Martin, R. D. (1979). Robust estimation on the first order autoregressive parameter.
J. Amer. Statist. Assoc. 74, 140-146.
Donoho, D. L. (1982). Breakdown propertirs of multivariate location estimators. Unpublished
manuscript. Harvard University Ph.D. qualifying paper.
Donoho, D. L. and Huber, P. J. (1983). The notion of breakdown point. In: P. J. Bickel, K.
Doksum and J. L. Hodges, eds., Festschriftfur Erich L. Lehman. Wadsworth, Belmont, CA.
Hampel, F. R. (1968). Contributions to the theory of robust estimation. Ph.D. Thesis. University of
California, Berkeley, CA.
Hampel, F. R. (1971). A general qualitative definition of robustness. Ann. Math. Stat. 42,
1887-1896.
Hampel, F. R. (1974). The influence curve and its role in robust estimation. J. Amer. Statist. Ass.
69, 383-393.
Hampel, F. R. (1975). Beyond location parameters: Robust concepts and methods. Proc. 40th
Session I.S.I., Warsaw 1975. Bull. Int. Statist. Inst. 46, Book 1,375-382.
Hannan, E. H. and Kanter, M. (1977). Autoregressive processes with infinite variance. J. Appl. 14,
411-415.
Hodges, J. L. (1967). Efficiency in normal samples and tolerance of extreme values for some estimates
of location. Proc. Fifth Berkeley Symp. on Math. Statist. and Probab. Vol. 1, 163-186.
Huber, P. J. (1964). Robust estimation of a location parameter. Annals Math. Statist. 35, 73-101.
Huber, P. J. (1973). Robust regression: asymptotics, conjectures and Monte Carlo. Ann. Statist. 1,
799-821.
Huber, P. J. (1977). Robust Statistical Procedures. Regional Conference Series in Applied Mathe-
matics No. 27. SIAM, Philadelpha, PA.
Huber, P. J. (1981). Robust Statistics. Wiley, New York.
Huber, P. J. (1983). Minimax aspects of bounded-influence regression. J. Amer. Statist. Assoc. 78,
66-80.
Jazwinski, A. H. (1970). Stochastic Processes and Filtering Theory. Academic Press, New York.
Kalman, R. E. (1960). A new approach to linear filtering and prediction problems. Trans. A S M E
Ser. D J. Basic Eng. 82, 34-45.
Kleiner, B., Martin, R. D. and Thompson, D. J. (1979). Robust estimation of power spectra. J.
Roy. Statist. Soc. Ser. B 41, 313-351.
Krasker, W. S. (1980). Estimation in linear models with disparate points. Econometrica 48,
1833-1846.
Krasker, W. S. and Welsch, R. T. (1982). Efficient bounded-influence regression estimation. J,
Amer. Statist. Assoc. 77, 595-604.
K/inisch, H. (1984). Infinitesimal robustness for autoregressive processes. Arm. Statist. 12, 843 863~
Lee, C. H. and Martin, R. D. (1982). M-estimates for A R M A processes. Technical Report No. 23.
Department of Statistics, University of Washington, Seattle, WA.
Mallows, C. L. (1976). On some topics in Robustness. Bell Labs. Tech. Memo, Murray Hill, N J.
Maronna, R., Bustos, O. H. and Yohai, V. J. (1979). Bias and efficiency robustness of general
M-estimates for regression with random carriers. In: T. Gasser and M. Rosenblatt, eds.,
Smoothing Techniques for Curve Estimation (Proceedings, Heidelberg, 1979). Springer, New
York.
Martin, R. D. (1979). Approximate conditional-mean type smoothers and interpolators. In: T.
Gasser and M. Rosenblatt, Smoothing Techniques for Curve Estimation, 117-143. Springer, Berlin.
Martin, R. D. (1980). Robust estimation in Autoregressive models. In: D. R. Brillinger and G. C.
Tiao, eds., Directions in Time Series, 228-254. Institute of Mathematical Statistics Publication,
Haywood, CA.
Robustness in time series and estimating A R M A models 155
Martin, R. D. (1981). Robust methods for time series. In: D. F. Findley, ed., Applied Time Series H.
Academic Press, New York.
Martin, R. D. (1982). The Cramer-Rao bound and robust M-estimates for autoregressions.
Biometrika 69, 437-442.
Martin, R. D. and Jong, J. (1977). Asymptotic properties of robust generalized M-estimates for the
first-order autoregressive parameter. Bell Labs. Tech. Memo, Murray Hill, NJ.
Martin, R. D., Samarov, A. and Vandaele, W. (1983). Robust methods for A R I M A models. In: A.
Zellner, ed., Applied Time Series Analysis of Economic Data. Econ. Res. Report ER-5, Bureau of
the Census, Washington, DC.
Martin, R. D. and Thompson, D. J. (1982). Robust resistant spectrum estimation. IEEE Proceed-
ings 70(9), 1097-1115.
Martin, R. D. and Yohai, V. J. (1984). Influence function for time series. Tech. Report.
Department of Statistics, University of Washington, Seattle, WA.
Masreliez, C. J. (1975). Approximate non-Gaussian filtering with linear state and observation
relations. I E E E Trans. Automat. Control AC-20, 361-371.
Meditch, J. S. (1969). Stochastic Optimal Linear Estimation and Control. McGraw-Hill, New York.
Mosteller, F. and Tukey, J. W. (1977). Data Analysis and Regression. Addison-Wesley, Reading,
MA.
Papantoni-Kazakos, P. and Gray, R. M. (1979). Robustness of estimators on stationary obse~
vations. Ann. Probab. 7, 989-1002.
Rousseeuw, P. and Yohai, V.J. (1984). Robust regression by means of S-estimators. In: J. Franke, W
H~irdle and D. Martin, eds., Robust and Nonlinear Time Series Analysis. Springer, Berlin.
Schweppe, F. C. (1973). Uncertain Dynamic Systems. Prentice-Hall, Englewood Cliffs, NJ.
Strassen (1965). The existence of probability measures with given marginals. Ann. Math. Statist. 36,
423-439.
Tukey, J. W. (1960). A survey of sampling from contaminated distributions. In: I. Olkin, ed.,
Contributions to Probability and Statistics. Stanford University Press, Stanford, CA.
Tukey, J. W. (1976). Useable resistant/robust techniques of analysis. In: W. L. Nicholson and J. L.
Harris, eds., Proc. First E R D A Statistics Symposium. Batelle Northwest Laboratories, Richland,
WA.
Whittle, P. (1962). Gaussian estimation in stationary time series. Bull. lnt. Statist. 39, 105-129.
Yohai, V. J. and Maronna, R. A. (1978). Asymptotic behavior of least squares estimates for
autoregressive processes with infinite variances. Ann. Statist. 5, 554-560.
E. J. Hannan, P. R. Krishnaiah, M. M. Rao, eds., Handbook of Statistics~ Vol. 5
© Elsevier Science Publishers B.V. (1985) 157-177 5
Richard H. Jones
1. Introduction
Unequally spaced data can occur in two distinct ways. The data can be
equally spaced with missing observations, or the data can be truly unequally
spaced with no underlying sampling interval. For multivariate data when
several variables are recorded at each observation time, it is possible to have
missing observations within the observation vector at a given time. In this case,
the observation times may be equally or unequally spaced. The key to data
analysis in these situations is to represent the structure of the process using a
state-space representation. For Gaussian inputs and errors, this allows the
calculation of the exact likelihood using the Kalman filter. Nonlinear optimiza-
tion can then be used to obtain m a x i m u m likelihood estimates of the unknown
p a r a m e t e r s of the process.
These methods are easily extended to regression with stationary errors
including analysis of variance with serially correlated errors. Mixed models that
include r a n d o m effects fit naturally into the state-space formulation since the
random parameters can be included in the state vector, and the variances
estimated by m a x i m u m likelihood, even with unbalanced designs.
157
158 R.H. Jones
vector. This turns a univariate process into a vector Markov process involving a
state transition matrix. It is this M a r k o v property that allows - 2 In likelihood to be
calculated recursively (Schweppe, 1965).
A discussion of the Kalman filter can be found in Gelb (1974). A general
state-space model consists of two equations. The state equation defines the
properties of the process in vector M a r k o v form, and the observation equation
defines what is actually observed. These equations are
where ' denotes transposed. Similarly, X(t I t-- 1) denotes the estimate of the
state at time t given observations up to time t - 1, a one-step prediction, and its
covariance matrix is denoted P(t] t - 1). T o begin the recursion, it is necessary
to specify an initial value of the state vector before the first observation is
collected, X(010), and its covariance matrix P ( 0 1 0 ). T h e general step of the
recursion starts with the information available at time t - 1, X ( t - 1] t - 1) and
P ( t - 1 ] t - 1), and ends when this same information is available at time t. The
recursion proceeds as follows:
Y ( t l t - 1)= H ( t ) X ( t l t - 1).
(4) The innovation vector is the difference between the observations and the
predicted observations
V(t) = H ( t ) P ( t I t - 1)H'(t) + R ( t ) .
l'(t) V - l ( t ) l ( t ) + In IV(t)l,
where I I denotes the determinant of the matrix. The contribution for each
step is summed over all steps.
(7) The Kalman gain matrix is
X ( t [ t) : X ( t [ t - 1)+K(t)l(t).
(9) Its covariance matrix is
P ( t l t) = P ( t [ t - 1 ) - K ( t ) H ( t ) P ( t l t - 1).
For univariate time series Y(t), the observation at time t, and R(t), the
observational error variance, will be scalars. A process with time invariant
structure observed at equally spaced time intervals has parameters that do not
depend on time, and the model can be written
X ( t ) = F X ( t - 1) + G u ( t ) , (2.3)
Y ( t ) = O X ( t ) + v(t) ,
with observational error covariance matrix R. A stationary time series has time
160 R. H. Jones
Note that these two equations require no calculation since the values that are
in memory are not changed. For multivariate time series with missing obsero
vations within the observation vector, it is only necessary to reduce the number
of rows in the H(t) matrix to allow for these missing observations.
When the unknown model parameters have been estimated by maximum
likelihood, predictions can be made by running the recursion off the end of the
data using the missing data form of the recursion. Calculating V(t) from step
(5) of the recursion gives the variance or covariance matrix of the prediction.
u(t) is assumed to have unit variance, and the observational error variance is
R. The three parameters to be estimated by maximum likelihood are a, o-, and
R. For any pass through the recursion, these parameters are assumed to be
known and are varied by a nonlinear optimization routine between passes. In
other words, one pass through the Kalman filter produces a value of - 2 1 n
likelihood which is one function evaluation for a nonlinear optimization
routine. Multivariate extensions of this model are discussed in Jones (1984),
This special case has many practical applications. ARMA(1, 1) processes can
be used to model serial correlation in many situations where data spans are not
Time series analysis with unequally spaced data 161
tOO long and the process is not highly structured. In particular, in regression or
analysis of variance, using an ARMA(1, 1) model for the error structure may
be much better than the usual assumption of independent errors. Approximate
modeling of serial correlation when it exists is better than not modeling it at all.
To begin the recursion, it is necessary to specify the initial parameters. Since
x(t) is a zero mean AR(1) process, the variance of the process (lag zero
covariance) is o-2/(1 - c~2). For given values of the parameters, the initial
conditions specify what is known before any data are collected,
The recursion is now a special case of the general recursion given in the last
section:
(1) Calculate a one step prediction
x ( t l t - 1)= a x ( t - l l t - 1).
y ( t l t - t ) = x ( t l t - 1).
v(t) = P(tl t - 1) + R.
I2(t)/V(t) + In V(t).
K ( t ) = P ( t l t - 1)/V(t) .
The second form of this last equation is more numerically stable since it
prevents a subtraction.
It is possible to concentrate 0.2 out of the likelihood by differentiation in
which case the recursion takes a slightly different form. In this case, to calculate
the - 2 In likelihood it is necessary to accumulate two terms, one associated
with the weighted residual (or innovation) sum of squares RSS, and the other
associated with the determinant in the multivariate normal distribution, DET.
A new variable is defined which is the ratio of the two variances,
The initialization is
Here the equal sign is used in the programming sense of "is replaced by".
For missing observations, steps (3)--(7) are skipped and the last two steps
replace by
(8) x ( t l t ) = x ( t [ t - 1),
(9) P(t l t) = P(t l t - 1).
The above recursions are easily modified for an AR(1) process with missing
observations by setting R or c 2= 0. In this case, it is nonlinear in only one
parameter c~. Since, for a stationary process, this parameter must be in the
range - 1 < c~< 1, and in most practical applications is in the range 0 ~< a < 1, it
is easy to search for the maximum likelihood estimate of a on a micro
computer. Note that in the case of an AR(1) process, the above steps simplify
to:
4. ARMA(p, q) processes
where e(t) has standard deviation o-. The state of this process can be defined as
the p most recent values of the process, and a state-space representation is
kx(,-b+l)j 0 1 oj kx(:,-p)
(4.2)
[ x(t+m-alt
l J°l°
x(t + 11 t) )J = L~
,.
0
...
1 ...
az
~001[x
,
x(t [.t- 1)
(t+m-2lt-a
y(t)=[1 0 ~..
[ x(ttt) ]
0] [ x(t+. l i t ) ~ v(t). (4.4)
L (t+ m - l l t )
Time series analysis with unequally spaced data 165
The g's in (4.3) are a function of the c~'s and /3's and are generated by the
recursion
j-1
P
1 - ~, % z k = 0 (5.i)
k=l
are outside the unit circle, and for the moving average to be invertible, the
roots of
q
1 + ~, /3kzk = 0 (5.2)
k=l
must be outside the unit circle. To insure stationarity and invertibility, Jones
(1980) reparameterized in terms of the partial autoregression and partial moving
average coefficients, and constrained them to be in the interval (-1, 1) by a
logistic type transformation. If a k is a partial autoregressive coefficient,
a k = [1 - e x p ( - Uk)]/[1 + e x p ( - Uk)] , C
'~.....
:, ~>;
The u k can vary from - ~ to % and these aic the vaxiablcv t[~a~ ~;;," ::~:::x;c~
166 R. H. Jones
a~)= a t,
Olk
~) --- -0-~-
L'~k
a j c~ j0-1~
-k '
k = 1, 2, " " " '
j - 1 "
(5.5)
The o ' s are then used in the state-space representation along with the fl's which
are transformed in a similar fashion and a value of - 2 In likelihood calculated.
A natural way to obtain initial guesses at the parameters for nonlinear
optimization is to proceed in a stepwise fashion, adding a single parameter, or
perhaps both an autoregressive and a moving average parameter at each step.
The initial values of the parameters can be the final values obtained from the
previous step with the new parameter or parameters set to zero. The optimiza-
tion will then start from the best value of - 2 In likelihood found at the previous
step and try to improve it.
6. ARIMA(p, d, q) processes
a'~= a ~ + l ,
0/;= ~ 2 - - 0~1,
0/~ i~ ~ 3 - 0/2 ,
(6.6)
~p ~p ~- O/p_ 1 ,
O/p+ 1 = --(_.~p .
168 R. H. Jones
When data are truly unequally spaced, not equally spaced with missing
observations, continuous time models are necessary to represent the process.
These processes are discussed by Doob (1953). Kalman and Bucy (1961)
develop the state-space filtering approach for continuous time processes, and
Wiberg (1971) gives an easy-to-read introduction to the subject. The use of
continuous time models allows the prediction and updating equations to be
developed for an arbitrary time interval so that the Kalman filter recursion
depends on the length of the step.
As an introduction, consider a continuous time first-order autoregression
referred to as a C A R ( l ) process. A zero mean C A R ( l ) process can be
represented as a first-order linear differential equation driven by 'white noise'.
The continuous time state-space representation is
dx(t) = - a x ( t ) dt + d W ( t ) , (7.1)
where a > 0, and W(t) is a Wiener process, i.e. dW(t) is continuous time zero
mean 'white noise'. Integrated white noise is a continuous time random walk or
Brownian motion process which satisfies the differential equation
dz (t) = d W ( t ) . (7.2)
The variance of the change in the random walk over a finite time interval is
proportional to the length of the interval, i.e. for b > a,
b
Var{z(b)- z(a)} = Var{~! dW(t)}- (b- a)O. (7,3)
Here Q will be referred to as the variance of the white noise process.
The process (7.1) is a continuous time Markov process, with covariance
function at lag r
Note that as the time interval b - a becomes large, this prediction variance
approaches the variance of the process.
If this C A R ( l ) process is observed at equally spaced time intervals with
spacing h, the resulting discrete time process is AR(1) with autoregression
coefficient
a~ = e x p ( - a h ) (7.8)
and
0-2 = 011 - exp(-2ah)l/2a (7.9)
It does not matter how these time points are determined, by some random
mechanism or selected in advance, as long as the time points do not depend on
the values of the process being sampled. It is assumed that the sampling times
are known. The continuous time state-space model (7.1) can now be represen
ted as a discrete time state-space model at the sampling times,
G(t~) = # Q { 1 - e x p [ - 2 a ( t ~ - t~ ~)]}/2c~.
Now the state transition and the standard deviation of the random inpui
depend on the length of the time step. The observation equation is
where xO)(t) denotes the jth derivative with respect to time. For stationarity, it
is necessary that the roots of
P
E .j zj = 0 (8.2)
j=O
ixx"~.(O.t jlo = o
1 o
o 1 ...
o1
o
x(t)
'¢~1!(t) dt
L xO'-o(t) -% --0:1 --0[, 2 .... Olp_ 1 x~-l~(t)_]
[ x(t) 1
y(t)=[1 0 "'" O] / x O : ( t ) [ "" + v ( t ) , (8.4)
[ " [
For stationarity, it is necessary that p > q and that the roots of (8.2) have
negative real parts. The representation will be 'minimum phase' if the roots of
q
Z f/= 0 (9.3)
j=o
have negative real parts. It is also assumed that (8.2) and (9.3) have no commort
roots.
For a CARIMA(p, d, q) process, d roots of (8.2) must be zero. This means
that
y = X B + ~, (10.1)
K y = KXI~ + K ~ . (10.2)
In a regression situation, the Kalman filter operates on the y vector and each
column of the X matrix in order to transform the errors to be uncorrelated
with constant variance. It is very easy to modify the Kalman filter algorithm so
that it operates on a matrix rather than a vector. By forming an n by p + 1
matrix by augmenting X by y, the algorithm can operate on each column
replacing the entry by the innovation. The usual X ' X matrix and X ' y vector
can be formed from the innovations. If y ' y is also calculated, the total sum of
squares, then the residual sum of squares is
It is important that the determinant term be included in the likelihood since the
weight matrix is changed for each iteration,
- 2 In likelihood = n In R S S + D E T . (10.4)
where D E T is the natural log of the innovation variance summed over the time
points.
Duncan and Horn (1972) showed how random effects are naturally handled
using the Kalman recursion. Random effects can be included in the state of the
process while linear fixed effects are concentrated out of the likelihood as in
regression. The advantage of this approach is the ability to handle unbalanced
designs, such as missing observations by exact likelihood methods. A second
advantage is that serial correlation in repeated measures designs can be
modeled, even when there are missing observations or the data are unequally
spaced.
Consider a simple two-way repeated measures design,
yo = / z + 5 +Try+ %, (11.1)
where i denotes the subject and j denotes the repeated measurements on each
subject, tx is the fixed grand mean, T the fixed treatment or time effect, ~ri the
random subject effect, and % the random error. It is assumed that 7ri are
independent N(0, V,~) and eij are independent N(0, VD and are independent of
%. These assumptions produce the compound symmetry correlation structure
for observations on the same subject, i.e. constant correlation between any two
observations. This intraclass correlation,
is a result of the random subject effect, and is not serial correlation in the usual
time series analysis sense.
If this model is balanced with no missing observations, the usual repeated
measures analysis of variance is appropriate (Winer, 1971), and exact maximum
likelihood estimates of the two variances can be expressed in closed form
(Herbach, 1959). In the unbalanced case with missing observations, the exact
likelihood can be calculated using a state-space model. Concentrating V~ out of
the likelihood as before, - 2 In likelihood is nonlinear in only one parameter,
the ratio of the two parameters,
c = v jr. 0 .3)
174 R. H. Jones
Since the fixed effects can be handled by regression as in Section 10, only the
random terms need be represented in state-space form. Since subjects are
independent, - 2 In likelihood can be calculated for each subject and summed
over subjects. For subject i, the state equation is trivial since 7r~ is constant for
each subject. It is, however, random across subjects with variance V,~. The state
equation is
This initial variance would be V~ if V~ had not been concentrated out of the
likelihood. % now plays the role of observational error. Concentrating V, out
of the likelihood has the effect of dividing all variances in the recursion by V~;
therefore, the observational error variance R for this model will be set equal
to 1.
If serial correlation exists between the e's within a subject, the e's must be
modeled as part of the state. Any of the models discussed in this chapter can be
used to model this serial correlation. The random subject effect is simply
tacked onto the end of the state vector. The observations can be equally or
unequally spaced, and there may be missing observations. For example, if the
E's satisfy an AR(I) structure,
0][ ~i,/-1]
kTr~[%]=[0a 1AL ~ J + [ O ] uj' (11.8)
y ~ j - t~ - y / = [1 I][EiJ] . (11.9)
P(010)=111/(10- 2 ) c2
]0 . (11.10)
E{OiO'i} = V i i . (11.12)
The methods presented here can be used to estimate /3 and the variance
components for unbalanced designs, and the estimation is nonlinear in only p
parameters, the variance components.
The nonlinear optimization routines used by the author are the quasi--
Newton methods discussed by Dennis and Schnabel (1983), who give al~
gorithms in the Appendix of their book. The art of nonlinear optimization is
highly developed by computer scientists, and statisticians need only find good
code. Supplying derivatives for the functions being minimized is a good idea if
possible, but it is not necessary. Gradients can be approximated by finite
differences.
13. Conclusion
References
Solo, V. (1984). Some aspects of continuous-discrete time series modeling. In: E. Parzen, ed.,
Time Series Analysis of Irregularly Observed Data. Lecture Notes in Statistics, Vol. 25, 325-345.
Springer, Berlin-New York.
Wecker, W. E. and Ansley, C. F. (1983). The signal extraction approach to nonlinear regression
and spline smoothing. J. Amer. Statist. Assoc. 78, 81-89.
Wiberg, D. M. (1971). Theory and Problems of State Space and Linear Systems. Schaum's Outline
Series. McGraw-Hill, New York.
Winer, B. J. (1971). Statistical Principles in Experimental Design, 2nd ed. McGraw-Hill, New York.
E. J. Hannan, P. R. Krishnaiah, M. M. Rao, eds., Handbook of Statistics, Vol. 5
© Elsevier Science Publishers B.V. (1985) 179-187 6
Ritei Shibata
1. Introduction
This chapter aims to give a short review on various model selection techniques
which have been developed in the context of time series analysis. Our main
concern is on moving-average (MA), autoregressive (AR) or autoregressive-
moving average ( A R M A ) models. A similar problem is called model checking,
which aims at checking adequacy of a model. Since the test statistics employed
in model checking can be also used for constructing a model selection pro u
cedure, we first look up such statistics. We should, however, note that the
model selection is not a simple combination of model checkings.
The aim of the model selection is not only in checking adequacy of a model but
also in
(a) obtaining a good predictor, or
(b) describing a system, or identifying a system.
W e consider a univariate A R M A ( p , q) model,
~t-- O(B)-I~(B)z,,
where 0(B) and 4~(B) are the m a x i m u m likelihood or quasimaximum likeli-
hood estimates of O(B) and &(B), respectively. In this chapter, we do not go
179
180 R. Shibata
qJ=n ~ 3,t~/mo
"j" ,
l,m=q+l
where o"t" or o7t~ are the l, m elements of the inverse of the autocovariance
matrix or of the sample autocovariance matrix, respectively. Therefore, by T we
can check q dependence, a specific property of MA(q), but we may fail in
checking a linearity of the process. A r e m a r k a b l e fact is that this statistic is not
equivalent to the m a x i m u m log likelihood in any sense.
~Jp+l
l=l
Z 4.2.+,
l=l
Therefore, by T we can test the null hypothesis of {zt} being an AR(p). As the
sample partial autocorrelation ~bm, commonly used definition is the last coordinate
of the solution ~ ( m ) ' = ( ~ l ( m ) , . . . , C~m(m)) of the m t h - o r d e r Y u l e - W a l k e r
equation.
Model selection techniques in time series analysis 181
i h
T - d.4(p) n ~ h^2
p+ l
l=l
where
=- Y, ,(p)z,_, j(p)z,_,,j ,
]~l n t=p+l -= "'j=O
and
6 " 2 ( P ) - n - 1p , = ~ ~ = dpl(P)Zt_l
n
hl = _ ~, ete,-i
n t=p+l
between the noise e, and its backward representation e, = El,_0 ~b,,z,+~. Hence,
@dr2(p) might be more natural than the q~z as an estimate of the partial
autocorrelation.
It is well known that the above two statistics are asymptotically equivalent to
each other [2]. These statistics are also asymptotically equivalent to the
maximum log likelihood.
rI ( B ) 6 (B)z, = ~7(B)O(B)e,
has the same covariance structure as that of (1.1), the transfer functions O(B)
and ~b(B) are not uniquely determined by autocovariances of {z,}.
Generalized partial autocorrelation 0k(J) is defined as the last coordinate oi
the solution ~b(/') of the equation
imply that the matrix A ( j , k) is nonsingular if and only if j ~<q or k ~<p. The
generalized partial autocorrelations then have the following properties:
0 ifj=q,k>p,
~k(j) = undefined ifj>q,k>p,
[# 0 otherwise.
A = [IA(j, k)[]
S =
i 7j " "" "Y!+k- 1]
~2 1 ~-, ~2
O" e = - - ~ Et . (3.1)
/~ t=l
k (B) = O(B)-%k(B )
184 R. Shibata
is not far from the true one, even when overfitted [14].
Another way might be to use the Lagrangian multiplier test statistic as is
demonstrated in Poskitt and Tremayne [18]. By modifying the Fisher information
matrix, we can avoid the problem of the singularity, but for doing this we have to
fix an alternative a priori. Therefore, such a statistic is not suitable for model
selection.
r =. E
/=1
It is shown by Box and Pierce [6] that the above T is asymptotically distributed
as X]-p-q under the null hypothesis. To accelerate the speed of convergence to the
asymptotic distribution, Ljung and Box [16] proposed a correction such as
3.3. Cross-validation
This kind of statistic is proposed by Stone [27], in the context of multiple
regression. A formal extension yields a statistic
r = Z {z,- e,(t-t} 2 ,
t=l
4. Model selection
T = - 2 log(maximum likelihood).
For such criteria, a lot of discussions have been done. Most controversial point
is that how to choose a, which is 2 in AIC, log n in BIC, and c loglog n for
some c > 2 in 4,. The choice of a depends on the aim of the selection. If our main
concern is in prediction, c~ should be chosen so as to yield less prediction error. If it
is to identify a system stable, the consistency is more important than the amount of
the prediction error. In Shibata [21], such two aspects of the model selection are
d e m o n s t r a t e d for the case of AIC, in the context of the nested A R model fitting. It
is shown that the selection by the minimum A I C procedure has a tendency of the
underfitting and is not consistent, but the increase of the prediction error is not so
much, only of the order O(1/n) uniformly in 4,1, - . . , 4,p. Similar discussions are
done for general a by Bhansali and D o w n h a m [5], or Atkinson [3]. Their
conclusion is consistent on the point that a should be greater than 2 even if the
prediction error is our main concern. An answer to the optimality is given by
Shibata [22] from the viewpoint of prediction error. H e showed that the choice
a = 2 is asymptotically optimal, under the assumption that the underlying process
does not degenerate finite order A R process. This result, namely "asymptotic
efficiency of the selection with a = 2" is also applied to an autoregressive spectral
186 R. Shibata
estimate [24]. Taniguchi [29] showed that Shibata's result holds true also for
A R M A models. However, for the case of small samples, the above asymptotic
theory does not work so well [23]. Recently Shibata [25] showed that the
a p p r o x i m a t e minimax regret choice of ce is 2.8. The regret means how much the
prediction error increases when a section procedure is applied, compared with the
error when the true model is known. Further generalization of the A I C can be
found in [26].
If we want to avoid overfitting in any case, a should be chosen greater than
2 loglog n but slower than n. This is the result of H a n n a n and Quinn [12]. The
term 2 loglog n follows from the fact that the range of the random fluctuation
of T is at most 2 loglog n from the law of iterated logarithm. It is interesting to
note that the choice a = log n in BIC, which is derived from the viewpoint of
Bayesian, satisfies the above condition.
H a n n a n and Rissanen [13] proposed a practical way of selecting the orders p
and q of A R M A by using one of the above consistent criteria. Assuming p = q,
find m which minimizes C(m, m) in (4.1), then the m is asymptotically equal to
max(p0, q0) of the true orders P0 and q0. Next assuming p = m or q = m, find p
and q which minimize C(p, q), then we can find P0 and q0 consistently.
A remaining problem in practice is how to choose P and O which
specify the largest orders p and q. This is equivalent to the problem how to
choose ' h ' of statistics in Section 2. This problem has not been analyzed well,
but an analysis by Shibata [26] gives a rough guideline that we can choose any
large P and Q, as long as the tail probability P(F,,+2,,_p_o> am/(m + 2)) is
2
close enough to P(x,,.z>am) for m = 1 , 2 , 3 . . . . . n - P - O.
As a final remark, we note that if a is chosen bounded, then actual penalty is
seriously affected by small changes of T as well as changes of initial conditions.
W e should choose a so as to compensate well any such changes.
References
[1] Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle.
In: B. N. Petrov and F. Csfiki, eds., Second International Symposium on Information Theory,
267-281. Akadrmia Kiado, Budapest.
[2] Anderson, T. W. (1971). The Statistical Analysis of Time Series. Wiley, New York.
[3] Atkinson, A. C. (1980). A note on the generalized information criterion for choice of a model.
Biometrika 67, 413--418.
[4] Beguin, J.-M., Gorieroux, C. and Monfort, A. (1980). Identification of a mixed autore-
gressive-moving average process: the corner method. In: O. D. Anderson, ed., Time Series,
423-435. North-Holland, Amsterdam.
[5] Bhansali, R. J. and Downham, D. Y. (1977). Some properties of the order of an autoregressive
model selected by a generalization of Akaike's E P F criterion. Biometrika 64, 547-551.
[6] Box, G. E. P. and Pierce, D. A. (1970). Distribution of residual autocorrelations in autore-
gressive-integrated moving average time series models. J. Amer. Statist. Assoc. 65, 1509--1526.
[7] Box, G. E. P. and Jenkins, G. M. (1976). Time Series Analysis: Forecasting and Control.
Holden-Day, New York.
[8] Chow, J. C. (1972). On estimating the orders of an autoregressive moving-average process
with uncertain observations. IEEE ?¥ans. Automat. Control AC-17, 707-709.
Model selection techniques in time series analysis 187
[9] Godfrey, L. G. (1979). Testing the adequacy of a time series model. Biometrika 66, 67-72.
[10] Graupe, D., Krause, D. J. and Moore, J. B. (1975). Identification of autoregressive-moving
average parameters of time series. I E E E Trans. Automat. Control AC-20, 104--107.
[ll] Gray, H. L. Kelley, G. D. and McIntire, D. D. (1978). A new approach to ARMA modeling.
Comm. Statist. B7, 1-115.
[12] Hannah, E. J. and Quinn, B. G. (1979). The determination of the order of an autoregression.
J. Roy. Statist. Soc. Ser. B 41, 190-195.
[13] Hannan, E. J. and Rissanen, J. (1982). Recursive estimation of mixed autoregressive-moving
average order. Biometrika 69, 81-94.
[14] Hannan, E. J. (1982). Fitting multivariate A R M A models. In: G. Kallianpur, P. R. Krishnaiah, J.
K. Ghosh, eds., Statistics and Probability: Essays in Honor ofC. R. Rao, 307-316. North-Holland,
Amsterdam.
[15] Krishnaiah, P. R. (1982). Selection of variables under univariate regression models. In: P. R.
Krishnaiah, ed., Handbook of Statistics--II. North-Holland, Amsterdam.
[16] Ljung, C. M. and Box, G. E. P. (1978). On a measure of lack of fit in time series models.
Biometrika 65, 297-303.
[17] Milh0j, A. (1981). A test of fit in time series models. Biometrika 68, 177-18%
[18] Poskitt, D. S. and Tremayne, A. R. (1981). An approach to testing linear time series models.
Ann. Statist. 9, 974--986.
[19] Quenouille, M. H. (1947). A large-sample test for the goodness of fit of autoregressive
schemes. J. Roy. Statist. Soc. Ser. B 11, 123-129.
[20] Schwarz, C. (1978). Estimating the dimension of a model. Ann. Statist. 6, 461-464.
[21] Shibata, R. (1976). Selection of the order of an autorcgressive model by Akaike's information
criterion. Biometrika 63, 117-126.
[22] Shibata, R. (1980). Asymptotically efficient selection of the order of the model for estimating
parameters of a linear process. Ann. Statist. 8, 147-164.
[23] Shibata, R. (1980). Selection of the number of regression parameters in small sample cases. In:
Statistical Climatology, 137-148. Elsevier, Amsterdam.
[24] Shibata, R. (1981). An optimal autoregressive spectral estimate. Ann. Statist. 9, 300-306.
[25] Shibata, R. (1983). A theoretical view of the use of AIC. In: O. D. Anderson, ed., Time Series
Analysis: Theory and Practice, Vol. 4, 237-244. Elsevier, Amsterdam
[26] Shibata, R. (1984). Approximate efficiency of a selection procedure for the number of regression
variables. Biometrika 71, 43-49.
[27] Stone, M. (1974). Cross-validatory choice and assessment of statistical predictions. Z Roy~
Statist. Soc. 2, 111-133.
[28] Taniguchi, M. (1980). On selection of the order of the spectral density model for a stationary
process. Ann. Inst. Statist. Math. 32A, 401--419.
[29] Woodside, C. M. (1971)o Estimation of the order of linear systems. Automatica 7, '727-733.
[30] Woodward, W. A. and Gray, H. L. (1981). On the relationship between the S array and the
Box-Jenkins method of A R M A model identification. J. Amer. Statist. Assoc. 76, 579-587.
E. J. Hannan, P. R. Krishnaiah, M. M. Rao, eds., Handbook of Statistics, Vol. 5
© Elsevier Science Publishers B.V. (1985) 189-211 7
LennanLjung
1. Introduction
189
190 L. Ljung
This basically means that we rely upon 'laws of Nature' and other well-
established relationships that have their roots in earlier empirical work. These
subsystems are then joined together mathematically, and a model of the whole
system is obtained. This route is known as modelling, and does not necessarily
involve any experimentation on the actual system. When a model is required of
a yet unconstructed system (such as a projected aircraft), this is the only
possible approach.
The other route is based on experiqaentation. Input and output signals from
the system are recorded and are then subjected to data analysis in order to
infer a model of the system. This route is known as identification. It is often
advantageous to try to combine the approaches of modelling and identification in
order to maximize the information obtained from identification experiments
and to make the data analysis as sensible as possible.
In this chapter we shall discuss various techniques for the identification of
dynamical systems, focusing on methods that have been used, and to some
extent developed, in the control oriented community. We shall pay special
attention to sequential, or recursive methods, which refer to schemes that
process the measurements obtained from the system continually, as they
become available (Sections 7-11). We have found it suitable to present such
recursive schemes as a natural development of off-line or batch identification
methods, which assume that the whole data batch is available in each stage of
the data processing. Therefore, we will spend a major part in exposing and
explaining general ideas in identification (Sections 4-6). A particular problem
with parameter estimation in dynamical systems is the multitude of possible
models that are available. In Sections 2-3 we shall discuss a number of such
possibilities, and also point out a unified framework for how to handle them.
A very important problem, which is crucial for a successful application, is the
choice of a family of candidate models for describing the systems, This problem
is quite application-dependent, and we consider it outside the scope of this
chapter to address the problem of choice of model set. This means that what
we discuss most of the time is actually how to estimate parameters in a given
model structure.
For further discussion of the topics treated here we may refer to Goodwin
and Payne (1977), Eykhoff (1974, 1981) and Ljung and S6derstr6m (1983).
L i n e a r difference equations
Let the relationship between the input sequence {u(t)} and the output
sequence {y(t)} be described by
ARMAX models
If the term {v(t)} in (2.5) is described as a moving average (MA) of white
192 L. Ljung
Such models are often called output error models. The 'noise-free output' x(t)
is here not available for measurement, but given (2.7b) it can be reconstructed
from the input. We denote by x(t, O) the noise-free output that is constructed
using the model parameters
i.e.
x(t, o) + L x ( t - 1, 0 ) + . . . + L x ( t - n, o)
= b l U ( t - 1)-~ . . . . + b m u ( t - m ) . (2.9)
With
State-space models
A common way of describing stochastic, dynamical systems is to use state-
Estimation of parameters in dynamical systems 193
space models. Then the relationship between input and output is described by
The list of potential models and model sets can be made long. For our
purposes it is useful to extract the basic features of models, so as to allow for a
treatment of model sets in general. First we introduce the following notation:
= I m. CRa},
Similarly, u t and y' denote the input sequence and the output sequence,
respectively, up to time t.
The various models that can be used for dynamical systems all represent
different ways of thinking and representing relationships between measured
signals. They have one feature in common, though. They all provide a rule for
computing the next output or a prediction (or 'guess') of the next output, given
previous observations. This rule is, at time t, a function from z '-1 to the space
where y(t) takes its values (R p in general). It will also be parametrized in terms
of the model parameter 0. We shall use the notation
for this mapping. The actual form of (3.1) will of course depend on the
underlying model. For the linear difference equation (2.1) or (2.4), we will have
The same prediction or guess of the output y ( t ) will be used for the model (2.5)
with disturbances, in case {v(t)} is considered as 'unpredictable' (like white
noise). For the state-space model (2.12) the predictor function is given by the
Kalman filter. Then g~ is a linear function of past data.
For the A R M A X model (2.6) a natural predictor is computed as
For the models (2.7)-(2.11), a natural predictor is also given by (3.4a) with 0
and ¢(t, 0) defined by (2.8}-(2.10). Notice that in this case the prediction is
formed from past inputs only. We then have, formally,
Here we simply assume that the prediction 33 is a linear function of past data z,
and that this linear function can be realized with a finite-dimensional, time--
invariant filter.
Notice that the function g , ( O ; t, .) in (3.1) is a deterministic function from the
observations z t-1 to the predicted output. All stochastic assumptions involved
in the model descriptions (e.g. white noises, covariances matrices, Gaussianness)
have only served as vehicles or 'alibis' to arrive at the predictor function.
The prediction p ( t [ 0 ) is computed from z '-1 at time t - 1 . At time t the
output y ( t ) is received. We can then evaluate how good the prediction was by
computing
We shall call e (t, 0) the prediction error at time t, corresponding to model :g (0).
This term will be the generic name for general model sets. Depending on the
character of the particular model set, other names, for example, the (general-
ized) equation error, may be used. For a simulation model (3.5) it is customary
to call the corresponding prediction error (3.7) the output error.
We can also adjoin an assumption about the stochastic properties of the
prediction error to the model A/(0):
M(0): "Assume that the prediction error e(t, O) has the conditional
(given z '-1) probability density function (p.d.f.) f(t, O, x) [i.e.
P(e(t, O)~ B) = fxeBf(t, O, x)dx]".
(3.8)
Notice that in 0.8) there is an implied assumption of independence of the
prediction errors, for different t, since the p.d.f, does not depend on z t-~. A
predictor model (3.1) adjoined with a probabilistic assumption (3.8)we shall call a
probabilistic model.
The problem now is to decide upon how to use the information contained in z u
to select a proper member ~(ON) in the model set that is capable of 'describing'
the data. Formally speaking, we have to determine a mapping from z N to the set
zN ~.~(0N). (4.1)
Now, how can such mapping de determined? We pointed out that the essence of a
model of a dynamical system is its prediction aspect. It is then natural to judge the
performance of a given model d//(0") by evaluating the prediction errors, e(t, 0")
given by (3.7). A guiding principle to form mappings (4.1) is thus the following
one~
°'Based on z t compute the prediction error e(t, O) using (3.1) and (3.7). At time
t = N, select ON so that the sequence of prediction errors e(t, ON), ¢=
1 . . . . . N, becomes as small as possible".
The question is how to quantify what 'small' should mean. Two approaches
have been taken. These will be treated in the following two subsections.
l(t, 0, 0))
196 L. Ljung
to evaluate 'how large' the prediction error e(t, O) is. Here 1 is a mapping from
R x R d x R p to R, where d = dim 0, p = dim y.
1 N
VN(O, zN) = 77 ~'~ l(t, O, e(t, 0))o (4,3)
1 N
VN(O, z N ) = v;. ~] l y ( t ) - OV~o(t)l2 , (4.4)
P4 ~-
which we recognize as the familiar least squares criterion (see, e.g., Strejc,
1980). This function is quadratic in 0, which is a consequence of the prediction
being linear in 0 and the quadratic choice of criterion function. This means that
an explicit expression for the minimizing element 0N can be given:
1 1 N
---log P(y(N), y(N - 1) . . . . . y(1)) = =-: ~'~ log f(t, 0, e(t, 0)). (4.v)
N NT
Maximizing the likelihood function is thus the same as minimizing the criterion
(4.3) with
1 1 T
- l o g f(t, 0, e) = const + ~ log det At(O ) + 2e A p(O)e, (4.9)
where At(O) is the assumed covariance matrix for the prediction errors. If the
covariance matrix A t is supposed to be known (independent of 0), then the first
two terms of (4.9) do not affect the minimization, and we have obtained a
quadratic criterion like (4.6). The maximum likelihood method was introduced
for A R M A X models in Astr6m and Bohlin (1965).
For the least squares case it was possible to give an explicit expression for
the parameter estimate. This is not the case in general. Then the criterion
function (4.3) must be minimized using numerical search procedures. We shall
comment more on this later.
We shall, following Ljung (1978), use the general term prediction error
identification methods for the procedures we described in this section (see also
Astr6m, 1980). When applied to the special simulation model (3.5), the term
output error methods might be preferred.
The rationale for requiring e(t, O) and ~'(t) to be uncorrelated is the following:
the predictors ~(tlO ) should ideally utilize all available information at time
198 L. Ljung
fu(O, z N) = 0 (4.12a)
with
1 u
fu(O, z u) = "~ ~, s(t, O)(X(t) , (4.12b)
t=l
where, normally, the dimension of ~" is such that (4.12) gives a system of
equations that is compatible with the dimension of 0. When (4.12) is applied to
the model (3.2), the well-known instrumental variable method results. The vector
( is then known as the instruments or the instrumental variables. See Young
(1970) and SSderstr6m and Stoica (1981) for a further discussion of this method.
Then that contribution to e(t, O) that has its origin in the noise will be
uncorrelated with ~" for all 0. Choices (4.13) are typical for the instrumental
variable method. It turns out that such choices that give the best accuracy of
the obtained estimates are obtained when u is filtered through filters associated
with the true system (see SSderstr6m and Stoica, 1981). We then have
For models that can be written as (3.4a) (like the A R M A X model (2.6) and the
output error model (2.7)= (2.11)), a natural choice is
we find that (4.12) will define the stationary points of the criterion (4.3), (4.6).
The criterion minimization approach can thus be seen as a special case of
(4.12), from this point of view.
Estimation of parameters in dynamical systems 199
and
~/N(0N - 0") E AsN(0, P ) , (5.4)
where
P = Q-1HQ-1 ' (5.5)
H = lim E N V ~ O * , ZN)[V'~O *, ZN)]T , (5.6)
o = Q"(0*). (5.7)
d d
ov(t, O) = - ~ ~(t I O) = -- ~-~ e(t, O) (6.1)
(~0 is a d I P matrix) for the gradient of the prediction, with respect to 0. Then
1 u
V~(O, z u) = - ~ ~ (lTo(t, O, e (t, 0 ) ) - O(t, O)Ir~(t, 0, e (t, 0)) (6.2)
1 U
V~(O, Z u) = - ~ t~=~ O(t, o)a-le(t, 0). (6.3)
~/+1) = 6 ~ ) /"L(i)[
(i)]NK''NJ
J~ 1 --NKL" N, z N) ,
V ' I'~(i) (6.4)
where 0~) denotes the ith iterate when solving for the minimizing value 0u. The
number/z is chosen so that
R }~)= I/,,tt~(0
--m"N, zN) (6.6b)
(Newton algorithm),
1
~(t, 0(~) A--a~0T(t, 0~]) (6.6c)
,
These are the natural counterparts of the gradient, and the Newton methods,
respectively, for minimizing VN(O; zN).
0 O O
koa--Y(t[O)+c'~--Y(t-dak 110)+'''+c,--y(t n{O)=-y(t-k).
Oak
(6.10)
where ¢(t, 0) is given by (2.4c) and the c i are the c parameters corresponding
to 0. The calculation of qJ in this case thus corresponds to applying a simple
filtering operation to the vector ~p, which has to be determined anyway. For the
linear model (3.6) we would obtain, conceptually, that
where col means that the columns of ~0 are stacked on top of each other.
for some function ~p(t, 0) of the data, as in (3.4a). With this structure, the
following approach to solve (4.12) is natural: determine the iterate 0~) by
solving
1 N
W Z ~(t, OC~-l))(y(t) - oTq~(t, 0~-1)))= 0 (6.14)
where the function F may be implicitly defined, for example, as the minimizing
argument of some function. Such a general expression (7.1) cannot be used in a
recursive algorithm, since the evaluation of F may involve an unforseen
amount of calculations, which perhaps may not be terminated at the next
sampling instant. Instead, a recursive algorithm must comply with the following
format:
It is clear that the general scheme (6.4) does in general not comply with the
structure (7.2) for a recursive algorithm. However, we can modify (6.4) to
achieve this. Conceptually, this is obtained by making a one additional iteration
i - 1--> i at the same time as the data record is advanced from N - 1 to N. From
(6.4) we thus obtain
where we introduced
In case 0(t, 0) and e(t, O) can be computed for any 0 using only a fixed amount
(t-independent) of past data, then (8.5) is indeed a recursive algorithm. This is
the case for the model (3.2), for which we have
With/z = 1 and
t
The complete algorithm for estimating a and c in the ARMAX model (2.6)
Estimation of parameters m dynamical systems 205
For a general criterion function l(e(t, 0)) instead of the quadratic one the factor
qt(t)e(t) should be replaced by
tp(t)l'(e(t)) .
The algorithm (8.16) can be applied to arbitrary model sets, yielding a large
family of recursive identification methods. These are well known in special
cases. The 'recursive maximum likelihood method' (RML) is obtained when
(8.16) is applied to an A R M A X model and the 'recursive generalized least
squares method' (RGLS) is obtained for models where the equation error is
described as an autoregression. The extended Kalman filter (EKF) is closely
related to (8.16) applied to the state-space model (2.12) (Ljung, 1979). See
S6derstr6m et al. (1978) and Ljung and S6derstr6m (1983) for a further
discussion.
9. Pseudolinear regressions
Let us now consider the correlation approach (4.12) applied to (3.4) with the
choice (4.15) for (. With the same approximations as those leading from (6.4) to
206 L. Ljung
(8.5) we obtain
If 37(tl O) and q~(t, O) are subject to the structure (6.12) (with ~b = ~), and we use
the approximation (8.15) we arrive at the following algorithm:
We call this scheme a pseudolinear regression (PLR) for the model (3.4), since it
is treated as though q~ does not depend on 0, in which case it would be a linear
regression.
~p(t) = / u ( t - 1) /
\ 1) /
PLR's are well known in a number of special cases. When applied to an
A R M A X model the PLR algorithm is known as extended least squares (ELS)
and when applied to the output error model (2.11), it is the scheme suggested
in Landau (1976).
e(t) = y ( t ) - fJ(t),
{J(t) = t~(t - 1) + y(t)R-'(t)~7 (t)e (t),
e(t + 1) = A(O(t))~(t) + B(O(t))z(t), (10.1)
33(t+l) '~=
col r/(t + 1)/ C(O(t))~(t),
y(t)-+O as t ~ . (10.2)
Let ~9(t[O) and r/(t, 0) be defined by (6.12) (with ~O= r/) and let
0) = y ( t ) - I 0).
N
f(O) = lim E 1 • rl(t, O)e(t, 0), (10.3)
N-~o~ N t= 1
1 N
G(O) = lim E - ~ rl(t, 0)7/T(t, 0), (10.4)
N-~o~ N t=l
d
d--7 0D(r) = R ~l(r)f(OD(r)) , (10.5a)
d
= (10.5b)
208 L. L j u n g
Global asymptotic stability of a certain stationary point 0", R* will then imply
that 0(t) as defined by (10.1) converges to 0* with probability one. In that way
the convergence properties of (10.1) can be investigated. The results can be
summarized as follows:
1 N
9 ( 0 ) = u-~
l i m -N- E ~1] '~e2(t, 0). (10.6)
This follows from the quoted convergence theory, in which the main step is
that f(O), defined by (10.3), will be - V'(O), the negative gradient of (10.6). The
function V(O) can thus be used as a Lyapunov function to prove the required
stability of (10.5). It can also be shown that
Pseudolinear regressions
For the PLR (9.2) less general results can be obtained. For special cases, like
the A R M A X case, it can be shown that the condition
represents the true description of the noise characteristics in (2.6). For the
output error model (2.7) an analogous condition has to be imposed on the f
coeffÉcients. These results are proven in Ljung (1977b) using the O D E method
and in Solo (1979) using a Martingale approach.
Estimation of parameters in dynamical systems 209
The forms that the algorithms here have been given in, such as (10.1), are
well suited for discussion and analysis. However, they are not particularly well
prepared for numerical implementation. A common step is to define the
quantity
P(t) = 7(t)R-~(t).
Under ideal situations for a constant system this gain should asymptotically
behave like
Even for constant, time-invariant systems it has been noted that much better
transient performance is achieved if t. y(t) increases to 2-3 during the first
couple of hundred samples and then falls back to 1. This may be achieved by
the choice
Often, the real reason for using recursive identification is that the system's
properties change with time. Then y(t) should not tend to zero but to a value
Y0> 0, chosen as a trade-off between tracking capability (large Y0) and noise
insensitivity (small Y0). Alternatively, A(t) is chosen as a constant A0 < 1. The
rule of the thumb is that with such a choice, the algorithm 'remembers' the last
1/(1- 3.0)= 1/y 0 data points. The search direction, defined by the matrix R in
the algorithms above, may have an important influence on the algorithms. The
Gauss-Newton direction (8.16e) normally gives the best behavior, but is also
more demanding computationally.
13. Conclusions
References
o
oAstr6m, K. J. (1980). Maximum likelihood and prediction error methods. Automatica 16, 55l--574.
Astr6m, K. J. and Bohlin, T. (1965). Numerical identification of linear dynamic systems from
normal operating records. IFAC Symposium on Self-Adaptive Systems, Teddington, England.
Also in P. H. Hammond, ed., Theory of Self-Adaptive Control Systems. Plenum Press, New York.
Bierman, G. J. (1977). Factorization Methods for Discrete Sequential Estimation. Academic Press,
New York.
Eykhoff, P. (1974). System Identification. Wiley, London.
Estimation of parameters in dynamical systems 211
Eykhoff, P-, ed. (1981). Trends and Progress in System Identification. Pergamon Press, Oxford.
Goodwin, G. C. and Payne, R. L. (1977). Dynamic System Identification : Experiment Design and
Data Analysis. Academic Press, New York.
Gupta, N. K. and Mehra, R. K. (1974). Computational aspects of maximum likelihood estimation
and reduction in sensitivity function calculations. IEEE Trans. Automat. Control AC-19, 744-
783.
Landau, I. D. (1976). Unbiased recursive identification using model reference techniques. I E E E
Trans. Automat. Control AC-21, 194-202.
Ljung, L. (1977a). Analysis of recursive stochastic algorithms. I E E E Trans. Automat. Control
AC-22, 551-575.
Ljung, L. (1977b). On positive real transfer functions and the convergence of some recursions.
I E E E Trans. Automat. Control AC-22, 539-551.
Ljung, L. (1978). Convergence analysis of parametric identification methods. IEEE Tram.
Automat. Control AC-23, 770--783.
Ljung, L. (1979). Asymptotic behaviour of the extended Kalman filter as a parameter estimator for
linear systems. I E E E Trans. Automat. Control AC-24, 36-50.
Ljung, L. and Caines, P. (1979). Asymptotic normality of prediction error estimation for a p
proximate system models. Stochastics 3, 29-46.
Ljung, L. and S6derstr6m, T. (1983). Theory and Practice of Recursive Identification Methods.
M.I.T. Press, Cambridge, MA.
L/.iders, G. and Narendra, K. S. (1974). Stable adaptive schemes for state estimation and
identification of linear systems. IEEE Trans. Automat. Control AC-19, 841-847.
Robbins, H. and Monro, S. (1951). A stochastic approximation method. Ann. Math. Statist. 22,
400--407.
S6derstr6m, T. and Stoica, P. (1981). Comparison of instrumental variable methods- .-consistency
and accuracy aspects. Automatica 17, 101-115.
S6derstr6m, T., Ljung, L. and Gustavsson, I. (1978). A theoretical analysis of recursive
identification methods. Automatica 14, 231-244.
Solo, V. (1978). Time series recursions and stochastic approximation. Ph.D. dissertation. The
Australian National University, Canberra, Australia.
Solo, V. (1979). The convergence of AML. I E E E Trans. Automat. Control AC-24, 958-963~
Strejc, V. (1980). Least squares parameter estimation. Automatica 16, 535-550.
Talmon, J. L. and van den Boom, A. J. W. (1973). On the estimation of transfer function
parameters of process and noise dynamics using a single-stage estimator. Proc. 3rd IFA(
Symposium on Identification and System Parameter Estimation, the Hague. North-Holland,
Amsterdam.
Tsypkin, Ya. Z. (1971). Adaption and Learning in Automatic Systems. Academic Press, New York.
Widrow, B., Glover, J. R. Jr. et al. (1975). Adaptive noise cancelling: Principles and appfications~
Proc. IEEE 63, 1692-1716.
Young, P. C. (1970). An instrumental variable method for real-time identification of a noisy
process. Automatica 6, 271-287.
Young, P. C. (1976). Some observations on instrumental variable methods of time~serie,~ anaiysi~
Internat. J. Control 23, 593-612.
E. J. Hannan, P. R. Krishnaiah, M. M. Rao, eds., Handbook of Statistics, Vol. 5 8
© Elsevier Science Publishers B.V. (1985) 213-255
Peter Young
1o Introduction
T ~ P E S OF DATA PROCESSING
1 Batch ( e n - b l o c ) Processing
2. Recurslve Processing
2
- - ~ RECURSIVE
ITERATIVE
213
214 ~ Young
The major time-series model form considered here is the transfer function
(TF) model, which came into prominence with the publication in 1970 of Box
and Jenkin's Book Time Series Analysis, Forecasting and Control. This
model is shown diagrammatically in Fig. 2 and can be represented in the
216 ~ Young
-- [ ~-L)
K
UK ~ B( L ) YK
A(L)
B(L) D(L)
Yk -- A ( L ) uk + ~ ek' (1)
B ( L ) = bo + b l L + . . . + b,.L n,
(2)
C(L)= 1 + c l L + . . . + cqL" ,
D(L)= 1 + d l L + . . . + deL" ,
Here x k is the hypothetical 'noise-free' output of the system (i.e. that part of
the observed output Yk that is causally related to the input u k ) and SCk is the
'coloured noise' which is generated from e k by the A R M A model (3.ii).
Equation (1) is a perfectly general linear time-series model that seems widely
applicable to practical problems. And it can be related, quite simply, to other
Recursive identification, estimation and control 217
A(L)xk : B(L)uk,
C(L)(jk : D ( L ) e k , (5)
Yk = Xk + ~k,
where all the variables Xk, Uk, gk, ek and Yk are now specified vector quantities
(i.e. u k and Yk are composed, respectively, of the measured inputs and outputs
of a multi-input, multi-output system), while the polynomials A ( L ) , B ( L ) , etc.
are appropriately dimensioned matrix polynomials in L (see Jakeman and
Young, 1979). However, the complexity of the estimation algorithms is much
increased in this vector case and the difficulty of applying them in practice
should not be underestimated.
Over the past twenty years, numerous recursive methods have been sug~-
gested for estimating the parameters in time-series models. On the other hand,
if we assume here that the user demands an algorithm which can, if necessary,
generate estimates with the desirable statistical properties of consistency and
asymptotic efficiency, then the choice is much more limited. Currently, there are
only two classes of recursive estimation procedure which possess these properties:
the prediction error (PE) approach suggested by Ljung (1979) and others, and the
related optimal generalised equation error (OGEE) method of Young and
218 P. Young
A ( L ) y k = B ( L ) U k + ek (6)
or
Yk = --a~yk-1 . . . . anYk-n + boUk + "" " + bnUk-, + ek • (7)
The recursive least squares (LS) algorithm for estimating the parameter vector
a is well known (see, e.g., Young, 1974 for a tutorial paper on this subject) and
takes the form
I(ii)
C Iy k B ] (9)
where tile nomenclature for the polynomials has been simplified for con~
Recursive identification, estimation and control 219
Pk
or
ek _~ Pk 1 Pk
_ _ _
ilt~k[ O'2-}- I~T
kP~-lqtk] qtkPk_,.
-/ "r ii(ii)
0 = [ a 1. . . . , a . , b o. . . . . b . , C 1. . . . . Cn, d 1. . . . , d. IT (12)
and
= (13)
where VO(~)k denotes a vector composed of partial derivatives of ~ with respect
to each element o f / J in turn, evaluated at the kth sampling instant. Note that
in algorithm II, Pk is a full 4n + 1 matrix and the generation of qt k will require
the use of adaptive 'prefilters'. The form and parameters of these prefilters will
depend upon the estimated parameters in /~ and will need to be updated
adaptively in some manner (e.g. Ljung, 1979).
The recursive PE algorithm for the TF model proposed by Ljung (1979) has
not been implemented in practice and so it is a relatively unknown quantity ir~
practical terms. It is, however, closely related to the equivalent O G E E al.
gorithm discussed below.
C
ek = ~ - [ A Y k - BUk] (14)
or
ek =: Ay~ Bu*k, (15)
where the star indicates that the variables are filtered by a prefilter with a
220 P. Young
Note that the use of instrumental variables in the solution of this problem
arises naturally when the estimation problem for the T F model (1) is con-
sidered within the context of maximum likelihood (Young, 1976). However, it
is also justified intuitively: as we have seen, the TF model is of the errors-in-
variables type and the IV method is well known as the simplest approach to the
consistent estimation of parameters in such models (e.g. Kendall and Stuart,
1961). The advantage of the refined IV method for the T F model is that it is
'optimal' in the sense that the particular choice of IV vector in (18) results in
asymptotically efficient (i.e. minimum variance) as welt as consistent estimates
Recursive identification, estimation and control 221
(see also Stoica and S6derstr6m, 1983). This optimal behaviour has been
discussed theoretically by Solo (1978) and demonstrated in practice, using
Monte Carlo stochastic simulation techniques, by Young and Jakeman (1979)•
It should also be noted that a 'symmetric matrix' version of algorithm III is
possible in which z~ v in (ii) is replaced by Xk ..T whenever it occurs. Both
algorithms perform similarly in asymptotic statistical terms (see, e.g., Solo,
1978) but the symmetric matrix version appears computationally less robust in
general use (Young and Jakeman, 1979). Nevertheless, the symmetric matrix
version of III(ii) is useful, subsequent to convergence of the more robust
nonsymmetric matrix version, for generating a good estimate of the error-
covariance matrix associated with the parameter estimates.
In order to allow for the adaptive prefiltering implicit in the star superscript
notation of (18) and (19)7 a second recursive estimator, the refined approximate
maximum-likelihood (AML) algorithm, is used to obtain recursive estimates of
the parameters in the C and D polynomials. If we denote the estimate of the
noise model parameter vector e = [c 1. . . . . % d 1. . . . . d,] T by ~, then the
refined A M L algorithm takes the form
P~ =
pN
k-1
_ N
Pk-lmk[ ~ + mkPk-lmkl
2 T N 1-1 TnN
mkrk-1, IV(ii)
ekN = - (21)
and
mk = [--~k* * 1, • " " ,
--~'~*q, ~**
k - l , " " " ~
e** lw
k - p l •
(22)
The variables ~k and ek are, respectively, estimates of the noise variables sCk and
e k obtained recursively as the estimation proceeds, while the double star
superscript indicates that the variables are adaptively prefiltered, in this case by
a prefilter of the form 1//). This algorithm is similar to the recursive maximum-
likelihood (RML) procedure of S6derstr6m (1973) but here it is applied in the
purely stochastic situation. Full details of these refined I V A M L algorithms,
together with a description of their coordination, are given by Young and
Jakeman (1979). It should be noted, however, that in off-line applications, a
recursive-iterative solution is utilised: here the recursive estimates are obtained
after several iterative steps through the whole time-series data set, starting with
an initial recursive least squares evaluation as the first step (see, e.g., Young
and Jakernan, 1979). In this way, improved efficiency is obtained, especially for
small sample sizes. It is this recursive-iterative algorithm which is primarily
considered in this chapter.
The similarity between the refined IVAML and PE inspired algorithms for
222 P. Young
the TF model becomes clear when we note that qrk in (13) is defined as
4. State-variable estimation
1Computationally, it is usual to set o-2= 1.0 in algorithms 1 to IV and then the covariance
matrices are obtained as d'2Pk,d~2p~v and d-2Pk (Young, 1984).
Recursive identification, estimation and control 223
[)(1, /3(2. . . . . Xn IT on the basis of noisy observations Yk of the system output and
assumed perfect knowledge of the input u k. The analogy with the parameter
estimation problem discussedo in previous sections is obvious.
It is well known (e.g. Astr6m, 1970), that the Kalman filter equations
themselves constitute a description of the stochastic system, the so-called
'innovations representation'. This can be written in the following 'state-space'
form:
where, without loss of generality, we can assume the following canonical form
for {F, g, h } :
-al 1 0 ... 0 b1
F=
hW=[1
I J-a 2
-a.
0
0 1
0 0
''1
0
, g=
In (23) Xk is 'conditional mean' state vector (the optimal estimate of the state
b2
,
B(L) O(L)
Yk -- A ( L ) uk + A - ~ ) ek' with B ( L ) = b l L + ' " + b,L",
where Z k is defined by
In these equations,
p = [d + - a T : g + ] + ,
u k -- --KW,~k, (25)
where K ~- [K1, K2,.. •, Kn1T is the optimal SVF gain vector. From (24) and (25),
therefore, the control input is given by
u k = _ KWzkp
or
Uk = _ ~/T Z ~ , (26)
Recursive identification, estimation and control 225
where z~ is the vector of prefiltered input and output signals, as generated and
used in the optimal parameter estimation algorithm III, and y is a suitably
defined gain vector. For example, in the simplest, first-order case, "y is a
2-dimensional vector defined by
]/T = [ K l ( d l _ a l ) , Klbl] ,
/'~k = - [ K l ( d l - a l ) Y ~ - i q- K~blU*k-1].
From this analysis, we see that optimal stochastic control can be synthesised
and implemented directly in the observation space, without specific reference to
the state-variable estimation and the Kalman filter. It is interesting to note that
this approach has some similarities with that of Arthur (1978), who considers
the problem of L Q G control for systems described by the related 'distributed°
lag' models, as used in econometric studies. The reader is directed to this paper
for a more complete discussion of the optimal control aspects, albeit limited to
the more restricted distributed-lag representation.
5.i. D a t a pre-processing
Most pre-processing of data, prior to formal identification and estimation,
involves filtering of one kind or another. Such filtering operations are usually
aimed at removing nonstationarity, in the form of deterministic or stochastic
trends; or periodic behaviour, as in 'seasonal adjustment'. Given the filtering
interpretation that Kalman placed on recursive estimation, it is not surprising
that such procedures can be used very effectively in this pre-processing stage of
the analysis.
Here we will outline a very simple recursive approach which uses off-line,
recursive-smoothing versions of the Kalman filter, as suggested by Jakeman
and Young (1979). Other research work of a similar kind has been reported
previously (e.g. Pagan, 1975) but has not tended to emphasise the practical
advantages obtained by basing the algorithmic design on the simplest class of
nonstationary processes; namely the random walk family. Such an approach
also allows for the development of simple but effective algorithms for locally
226 P. Young
Ya = Uk + e k , (27)
or
(c) an SRW process
where, in all cases, tzk is a zero mean, white noise sequence with variance q;
while, in case (c), ~b is a smoothing parameter (0.5 < ~b < 1.0) introduced to
produce a model whose descriptive ability lies somewhere between (a) and (b),
depending upon its chosen value (Young, 1984).
At first sight, the choice of such simple models for U k would appear to limit
the practical utility of the subsequent estimation algorithms. But this would not
appear to be the case, as we shall see in a number of practical examples
described in subsequent sections of this chapter.
In order to obtain the recursive-smoothing algorithms, it is necessary to use
the Kalman filter formulation (Kalman, 1960; Young, 1984) with (27) providing
the observation equation and one of the equations (28) to (30) defining the
state equations. This requires that we assume e k is itself a zero mean white
noise sequence independent of /~k" Once again, while this would seem a
restrictive assumption, it does not appear to affect the performance of the
algorithm very much in practical terms. Clearly, however, more complex
stochastic models than (28)-(30) could be utilised (see Young, 1984; Pagan,
1975), if this proves necessary. Also, more complex observation equations can
introduce still greater flexibility (Jakeman and Young, 1979).
All of the models (28)-(30) can be considered to be of the general Gauss-
Markov form
Although superficially complex, these equations are really quite simple because
of the simplicity of q~ and F.
The 'filtering' algorithm V provides an estimate 0 k which is based on all of
the data up to and including the kth sample in the time series. To obtain the
smoothed estimate ldklN, i.e. the estimate at the kth sample based on all N
samples in the time series, it is necessary to process the data further as
discussed, for example, by Gelb et al. (1974). In contrast to the filtering
algorithm V, however, the smoothing algorithm has a number of forms, some
of which are inherently unstable in computational terms. Norton (1975) con-
cludes that the most useful form is that proposed by Bryson and Ho (1975),
Here the smoothed estimates are obtained in a backwards recursion of the
form
228 P. Young
with AN = 0 and
In practice, algorithm V is easy to use: the analyst simply chooses a value for q
(or q and ~ if (30) is being used to model Uk) and observes the form of the
resulting UktN estimate; he can then adjust q until satisfactory results are
o b t a i n e d ) It is possible to conceive of more complex algorithms in which q is
selected adaptively in some manner, (e.g. Davis, 1983), but practical experience
has shown that, at least in off-line applications, it is best to leave some freedom
for subjective judgement on the part of the analyst. After all, the human eye is
a very good filter of data, particularly in relation to the detection of long-term
trends.
R~ = 1 (32)
y N (Yk - y ) 2 ,
1 2n
EVN = -2n
- i=1
Z 3 (33)
2Note that, in algorithm V, the variance of ek is normalised to unity for simplicity (cf. algorithm VI).
3Sometimes normalised measures may be preferable (Young et al., 1980, p. 284).
Recursive identification, estimation and control 229
Here ~k is the estimate of ~k, i.e. ~ = Yk -- Xk; Y is the mean value of Yk; and/3,
is the ith diagonal element of the PN matrix obtained from algorithm II.
In the normal manner, R~ is a measure of model fit (although it will be
noted that it is not based here on the usual least squares residuals); while EVN,
which is typically evaluated in logarithmic form (i.e. In EVN), is an indication
of the overall (or average) variance of the parameter estimates in the 2nth
order model. As Young et al. (1980) show, the EVN appears to be a sensitive
indicator of over-parameterisation: If the model has too many parameters, then
it can be shown that the instrumental cross-product matrix, 4 E NXA, k Z ,T
k , t e n d s to
singularity and its inverse,/~N, as computed by the refined instrumental variable
(IV) algorithm II, tends to increase sharply in value, with a consequent
increase in the EVN. This increase can often be of several orders of magnitude.
Thus the analyst monitors both statistics and chooses the model which has the
best combination of R~ and In EVN: usually R~ will have reached a 'plateau'
level, with little.further increase for any increase in model order; while In EVN
will have a very low value in relation to that obtained for higher order models.
There is a need for more statistical research on the EVN test, since it is
currently justified largely in numerical analysis terms. Nevertheless, it does
appear to work well in practical applications. For example, when applied to the
famous gas furnace data of Box and Jenkins (1970), it revealed rather obvious
over-parameterisation of the Box-Jenkins identified model and suggested a TF
representation with two parameters rather than four (see Young et al., 1980).
and TF estimation algorithms are utilised, with the current recursive estimate ek
obtained from the inverse AR model being used as the latest estimated input to
the TF estimation algorithm, However, such a procedure is more complicated
than the AML algorithm III and it has not yet been evaluated in practical
terms.
where aklk_ 1 and Pklk_ 1 are the a priori predictions of ak and/5 k, based on the
estimates at the ( k - 1 ) t h sampling instant and knowledge of the assumed
Gauss-Markov model; while Q is the covariance matrix of the white noise
vector in the assumed Gauss-Markov model for the parameter variations. The
Bayesian nature of the recursive algorithm is clear from these equations where
the a priori predictions are converted to a posteriori estimates on receipt of the
latest data; and where these a posteriori estimates provide the basis for a priori
predictions at the next recursion, as in the Kalman filter.
In the simplest RW case, the algorithm is extremely simple with the
prediction equations reducing to
analysis is restricted to diagonal form and its elements are specified by the user
on the basis of the diagonal elements of the final PN matrix that are displayed
to him after an initial estimation run with O -- 0 (the user may need to adjust O
further on the basis of the results obtained with his initially specified values for
the diagonal elements).
The results obtained in steps 1 to 6 often provide a satisfactory time-series
model for most practical purposes; in certain cases, however, it may be
necessary to proceed with further, more sophisticated analysis based on either
the optimal algorithms available in the mainframe CAPTAIN package or
independent and specialised nonlinear time-series analysis (e.g. Young, 1978,
1982a; Priestley, 1980).
In addition, another microcomputer program MCAPTREF is available
which provides a simple version of the refined IVAML approach to model
estimation. Having obtained estimates for the A, B, C and D polynomials
using MICROCAPTAIN, the prefiltered variables u~,, y~ and 2~ are generated
off-line and used for re-estimation of the A and B polynomials. In effect, these
are the estimates obtained after one iteration of the refined IVAML program
in CAPTAIN and the statistical properties of the estimates so obtained are
usually quite close to those of the optimal estimates obtained with the main-
frame program. However, as we have stressed previously (Young and Jakeman,
1979), the basic IVAML estimates are often good enough for most practical
purposes and further refinement is usually only justified in special circum-
stances.
The CAPTAIN package was assembled in 1970 and has been used in many
practical applications during the ensuing thirteen years. MICROCAPTAIN
was written in 1982 and, although still in its final stages of development, has
proven extremely useful in teaching and research. In this section, we discuss a
number of examples which illustrate the use of both programs and demonstrate
the power of the recursive approach to time-series analysis.
practical robustness of the constituent algorithms when the noise on the data
does not conform to the usual statistical assumptions.
Fig. 3a shows the 100 sample set of output data for the simulation example.
These data are generated from the following model:
0.2L 3
Yk = 1 -- 1.5L + 0.7L i uk + ~ k , (34)
where the deterministic input u k is a random binary (+__1)signal and the noise ~k
is the sum of a pure sinusoidal component and a stochastic trend, in the form of
a positive random walk. The transfer function in (34) represents a second-order
oscillatory system with a pure time delay of 3 sampling periods, and the
sinusoidal component in the noise is chosen so that it interferes seriously with
the natural oscillations of the system.
The major problem posed by the data set shown in Fig. 3a lies in the nature
of noise: not only does it contain a nonrandom sinusoid, chosen to confuse the
estimator, but the sinusoid is embedded in a nonstationary, stochastically rising
trend. The R W S M O O T H estimated trend is shown as a dashed line in Fig. 3a
and the de-trended output is given in Fig. 3b.
Estimates of the model parameters can be obtained directly from the data
shown in Fig. 3a, or from the data with the output de-trended, as in Fig. 3b.
Table 1 shows the estimates obtained in both cases: It is clear that the
(a)
~~
(b)
- - - ESTIMATED
TRE'ND
Fig. 3. Output data for simulation example: (a) data ( - - ) with estimated "trend (. . . . . ); (b)
de-trended data.
Recursive identification, estimation and control 235
Table 1
estimated parameters are quite similar but the estimated standard errors are
much smaller when the output series is de-trended.
Fig. 4 shows the graphical results obtained from the analysis of the measured
data. Fig. 4a compares the estimated noise-free output ~k with the actual value
xk; and it also shows the measured noisy output Yk. Fig. 4b compares the noise
estimate ~k = Yk -- Xk with the actual noise SCk. W e see that, although the noise is
particularly bad in this case, the estimation results are very good: on the basis
of only 100 samples it has been possible to obtain remarkably accurate
estimates of both the noise-free output and the noise.
The estimation results obtained from the de-trended data are shown in Figs. 5
and 6. In Fig. 5a, we see how the sinusoidal component of the noise is
(a)
~. ,"
"" ':"
" "
?:
i ".: /
",
\...:
,,j
:"'...I
~
• i : : : \ .'; "w ""'
i!; ,
- i !
~ MODEL OUTPUT x k
. . . . ACTUAL OUTPUT x k
........ NOISY OBSERVATION Yk
NOISE ESTIb~ATE ° ~k
.... ACTUAL NOISE ~k
Fig. 4. M I C R O C A I r F A I N results for simulation example: (a) model output ( - - - ) vs. data ( . . . . );
(b) estimated ( - - - ) and actual noise ( - - - ) signal.
(a)
t".
:: --- Yk
(b)
Fig. 5. M I C R O C A P T A I N results for simulation example with de-trended output data: (a) model
output ( ) vs. data ( . . . . ); (b) estimated noise signal.
(a)
lJlllJll]lIllIIIlIIl]lllllllllllli (b)
Fig. 6. M I C R O C A P T A I N results for simulation example with de-trended output data: (a) recursive
estimate of al; (b) estimated ( - - ) and actual ( - - - ) impulse response.
236
Recursive identification, estimation and control 237
particularly problematical in the second half of the data, with the measured
output Yk at times completely out of phase with the noise-free output x k (which
is virtually indistinguishable from the estimate x k in this case). Fig. 6a shows
the recursive estimate of the a t parameter and, finally, Fig. 6b compares the
estimated and actual impulse response characteristics of the transfer function in
(34).
The robustness of the IV estimation algorithm is obvious from the
results obtained in this simulation example. Although the noise does not
conform to the usual nice statistical assumptions, the algorithm has no difficulty
in estimating the model from a comparatively small sample of 100 input--output
pairs. 5 We also see that, provided the input signal u k is independent of the
noise ~:k, then we can achieve reasonable estimation results, even without the
removal of stochastic trends or bias components.
Of course, the whole topic of nonstationarity and the removal of low
frequency components is a complex one and it is difficult to generalise about
the wisdom or otherwise of such operations: normally the decision will be
problem dependent. Certainly, the ability to obtain reasonable results without
the removal of nonstationary behaviour can sometimes be of practical value.
For example, in hydrological modelling, a 'baseflow' component is often
present on flow data and conventional deterministic rainfall-flow analyses
usually requires its removal prior to model 'calibration'. 6 On the other hand,
when using M I C R O C A P T A I N , it is possible to analyse the data directly, with
the baseflow component treated as noise. The subsequent analysis then not
only provides an estimate of the transfer function between rainfall and flow,
but also an estimate of the noise ~k, from which the baseflow component can be
deduced (see Beer et al., 1982). We will see other examples later in Subsections
7.3 and 7.4, however, where trend removal is extremely helpful in identification
and estimation terms and where its retention would have caused problems.
if ~ - - - U
, y \ k
11'"'.ii(i,,l,i * '"'''"",.
Fig. 7. Data for flow modelling example: (input (---); output ( )).
t a k e n at two l o c a t i o n s on t h e r i v e r o v e r a p e r i o d of 32 h o u r s with a s a m p l i n g
i n t e r v a l of o n e h o u r . T h e d a t a a r e p l o t t e d in Fig. 7, w h e r e it s h o u l d b e n o t e d
t h a t b o t h series are s c a l e d d i f f e r e n t l y b y t h e a u t o m a t i c scaling p r o c e d u r e s in
M I C R O C A P T A I N : in fact, t h e m a g n i t u d e of t h e i n p u t series is a b o u t 0.28 of
t h e o u t p u t series.
Fig. 8. MICROCAPTAIN results for flow modelling example: (a) model outputs vs data (----); (b)
estimated noise signals, (best identified model ( - - ) ; Muskingum model ( . . . . )).
Recursive identification, estimation and control 239
(c)
~O MODEL 1
-1.0 A
".4
,--. - - - bI MODEL 2
" "
III .2
(d)
+I .0
I i
i Ii i~
_ L ]_LL L
The best identified model is of first order, with two p a r a m e t e r s (al, b0) and a
pure time delay of 2 sampling intervals. Part of the M I C R O C A P T A I N results
for this model are shown in Fig. 8 where it will be seen that the model has an
R~-= 0.999, indicating a very good explanation of the data; and a In E V N =
-7.138, reflecting the low estimated standard errors on the p a r a m e t e r esti-
mates. These results are compared with those for other models in Table 2,
where it is clear that the first-order model is superior in all ways. The
considerably higher in E V N values for the other models are indicative of the
less well-defined p a r a m e t e r estimates: for example, the recursive estimate of
240 P. Young
Table 2
1 1 1 2 0.999 -7.138
2 1 2 2 0.999 -4.774
3 2 1 2 0.999 -4.227
4 2 2 2 0.999 -2.696
5 1 2 0 0.996 -3.706
- 0 . 2 2 + 0.91L
Yk = uk ~ G , (35)
1-0.81L
and we see from the negative value for the /~0 coefficient that it exhibits
non-minimum phase (NMP) behaviour: in other words, the response of Yk to
increases in u k is an immediate d e c r e a s e followed by a subsequent increase.
This unacceptable physical behaviour of the model is characteristic of Muskin-
gum models and it arises from faults in the basic model formulation (see
Whitehead et al., 1979). The model can only be justified in numerical terms,
where it could be argued that the N M P properties reflect a first-order Pad6
approximation (e.g. Truxal, 1955) to a pure time delay. However, this was not
the justification used originally in the model formulation and, in any case, it
seems more meaningful in physical terms to explicitly introduce the pure time
delay to reflect transportation processes in the river.
The above analysis shows the value of M I C R O C A P T A I N time-series analy-
sis: the analyst is able to objectively evaluate existing physically based
models and introduce new types of model if the conventional models are shown
to be inferior° Other examples of this are available in environmental systems
Recursive identification, estimation and control 241
analysis. For example, Beer and Young (1983) have shown that simple aggre-
gated dead-zone (ADZ) models of dispersion in rivers, which are obtained by
time-series analysis, can be superior to the classical 'Fickian diffusion' model
that has been the standard hydrodynamic model for dispersion processes for
many years.
___ uk ,
~. 4F"L I f
I ' ,'"E . r ', I .,-i
FI '~ ~'~_ t" ',,' ',, ..i " I
"""'""'"
" ,, I
~'"'"Y Yk I
Ial
~--- uk i~
I t
(b) -- Yk L-"K"
Fig. 9, Data [or economic example: (a) measured data; (b) de-trended data; (input (-~--); output
(--)).
242 P. Young
- - - Yk ~,
(b)
_____~
Fig. 10. M I C R O C A P T A I N results for economic example: (a) model output ( - - - ) vs. data (. . . . );
(b) estimated noise signal.
Recursive identification, estimation and control 243
(a)
1.0
40
l
!
'-- . . . . . ' I
Fig. 11. MICROCAPTAIN results for economic example: assuming time-variable b0 parameter:
(a) 41; (b)/~0 (capital letters refer to similar time periods in Table 3).
The interpretation of these initial results is outside the scope of this chapter~
But the analysis does expose the possible value of recursive estimation in the
investigation of economic data. Provided the results are used with care and the
limitations of the analysis are acknowledged, we see that potentially useful
information can be obtained.
In this example, we will look at a set of sales data analysed by Jenkins (19"79).
The data are shown in Fig. 12: they represent the monthly sales of a com-
petitive group of products from two organisations; one set are the sales of the
organisation who commissioned Jenkin's study and the other are those of a rival
organisation.
Jenkin's analysis is based on the procedures described in his book (Box and
Jenkins, 1970) and it involves initial pre-processing in the form of first
differencing and differencing over lag 12: the first differencing is intended to
remove the obvious nonstationarity and the lag 12 differencing to remove the
equally noticeable seasonal behaviour. The resultant differenced data are
shown in Fig. 13, where it will be seen that a not surprising amplification of
noise has occurred because of the differencing process.
The mainframe C A P T A I N package has been used to perform an alternative
244 P. Young
Table 3
73 72 12
71 72 12
77 71 11
75 68 8
61 61 1
6 60 56 -5 (A)
7 32 45 15 oil crisis
8 27 42 -18 ?
9 41 47 -13 Common
10 63 55 -5 Market entry
11 61 59 1 ?
12 60 62
13 73 67
14 76 68
15 60 66
16 76 67
17 55 63
18 55 63
19 71 67
20 73 68
21 68 67
22 70 66
23 58 63
24 58 63
25 69 65
26 67 66
27 66 65
28 57 64
29 72 67
30 80 67
31 62 61
32 43 54 ---6 (B)
33 57 54 -6 'Winter
34 50 51 -9 of
35 41 49 -11 discontent'
,)
36 40 51 -9
37 53 59 1
38 70 70 10 (c)
39 83 81 21 deflationary
40 92 91 31 policies
41 101 100 40 ?
42 128 110 50
Recursive identification, estimation and control 245
GROUP 1
12.0 GROUP 2
f\
--. -- TREND ESTIMATED AS / ,/~
S R W (/3=0.99;O,p=10) 7/. /
t /" /
11 . 5 / J\ t
/ 7 \./
It .0 /r
' j,,'f J
i?/;~(~'/z
~10.5
I11
dlO.O
o
9.5
9.0 T T I T I T - - - ~ - - T. . . . . F - - - i q ; [ r [ l l
0 12 24 36 48 60 72 84 96
NUMBER OF S A M P L E S
Fig. 12. Data for business example (after Jenkins, 1979); (Group l sales ( ---); Group 2 sales
( )).
0.3 i,ili , ,
fl I1 1 I1
I\1\\ ~ ~\ ;}1 I I II I1
' * Ill ' /'V' l
0.2 ' [ ( \-x I ~ /Ill -I Y I /'-1 /
J tl ii I
0.1 II II II
'I II , l ,
0.0
-O.I
-0.2
~ - - q , w ...... r - - ~ , r ~ r r - - r - q - - i
0 12 24 36 48 60 72 84
NUMBER OF SAMPLES
Fig. 13. Business data as pre-processcd by Jenkins (key as for Fig. 12): (Group 1 data (---) shifted
vertically for clarity).
w h e r e the noise vector ~k [~1 ~2] T is the following first=order vector moving
a v e r a g e process:
= + (36b)
s¢2 k e2 k [-0.21 + 0 . 6 8 1 L e 2 k-1 ~
-1
3I
~
0 , .... GROUP 1
/ - - GROUP 2
//~ /~
I, /111
t
I
O, I ~ / ~I
t l/ ! ! I.
0,0
t/) t/t
[ I
-0. 1 t I
l
-0"2 L - q T
V4
1 [
l
-0.3 T r T ] i I i ~ T ]
0 12 24 36 48 60 72 84 96
NUMBER OF SAMPLES
ration of the data. it is quite likely, therefore, that the model would provide a
good basis for forecasting, although this has not been attempted here.
The resursive estimates have also not been shown here, but they were useful
in identifying the model structure: in the finally identified model (36), the
estimates were sensibly stationary and well defined.
t --- GROUP
- - G R O U P 2
1
O2-
0,1 I i L lI i i
I I
y! !
/
I
i
I
I
-o.2 /
-0.3 J I I | I
1 Z T
0 12 24 '6'o Z
NUMBER OF SAMPLES
Fig. 15. Typical stochastic simulation obtained f r o m estimated Business Data Model (key as for
Fig. 12).
g" = [-1.45, 0.44, 0.11, -0.06, 0.19, -0.20, 0.11"]"r and the estimated impulse
response is shown in Fig. 16a.
The results of the identification analysis are summarised in Table 4. Two sets
of identification statistics are shown: the 'IV Results' were obtained using the
method described in Subsection 5.3; while for the 'LS Results' recursive least
squares, rather than IV estimation, was employed to estimate the A R M A
model parameters.
Both sets of results indicate that the best identified A R M A model has a
second-order C polynomial and either a first° or second-order D polynomial
(i.e. 2, 2 or 2, 3 models). However, closer examination of the results indicates
that both models have similar impulse response and spectral characteristics;
while the d2 estimate, in the higher order case, is insignificantly different from
its standard error ( - 0 . 0 4 -+ 0.04). Consequently, we can reject this model as
being over-parameterised and accept the 2, 2 model, which has the form
1 - 0.25L
~:k = 1 - 1.7tL + 0.79L= ek' (37)
(a)
7th ORDERAR MODEL
(c)
Fig. 16. MICROCAPTAIN results for univariate temperature data analysis. (a) estimated impulse
responses (7th order AR model - - ; ARMA (2, 2) model - - - ) ; (b) model output ( - - ) vs. data
(---); (c) typical stochastic simulation from estimated ARMA (2, 2) model.
Table 4
Temperature time series
I V LS
Results Results
Fig. 16a and the model o u t p u t is c o m p a r e d with the original time-series data in
Fig. 16b. All of the recursive estimates except /)1 show rapid c o n v e r g e n c e and
well-defined characteristics: /~1 shows s o m e variation at a r o u n d 70 samples and
this p r o b a b l y m e a n s that it would be wise to carry out further analysis on
a n o t h e r i n d e p e n d e n t data set before accepting and using the model.
Finally, Fig. 16c shows a r a n d o m l y selected stochastic simulation of the
m o d e l (37). By comparison of this time-series with the original time series in
Fig. 16b, it would a p p e a r that the analysis has c a p t u r e d the m a j o r statistical
patterns in the data and that the A R M A model (37) provides a reasonable
basis for applications such as forecasting, which are referred to in the next
section.
8o State-of-the-art of r e c u r s i v e e s t i m a t i o n
obtain some feeling for the large body of literature by consulting the review by
Young (1981b) and the Proceedings of the various IFAC Symposia on
Identification and System Parameter Estimation that have appeared in the last
few years (e.g. Eykhoff, 1973; Isermann, 1979).
One specific area of practical importance that should be mentioned,
however, is 'self-adaptive' and 'self-tuning' control, where a control system is
given the ability to update its control parameters on a continuing basis, in order
to achieve some specified performance. Much research on this topic was carried
out in the 1960s (e.g. Kopp and Orford, 1963; Lee, 1964; Young, 1965, 1966)
and the importance of recursive updating to the development of practical
adaptive control systems has become increasingly apparent since this time (see
Young, 1981a). More recent research (reviewed, for example, by Goodwin and
Sin, 1984) has demonstrated the practical potential of these methods in a wide
variety of application areas. There has been considerable practical develop-
ments, for example, in the area of self-tuning control (Astr6m and Wittenmark,
1973), where several systems have already been successfully implemented (e.g.
Astr6m, 1980; Clarke and Gawthrop, 1981; Sanoff and Wellstead, 1982) and
commercial units are becoming available for use in industry.
Other important work on recursive estimation has appeared in the statistics
and economics literature, often within the context of the Kalman filter. Again it
would be impossible to adequately review this literature here, and we will
merely mention briefly certain developments which seem to be of practical
merit. It is worth remarking, however, that significant theoretical developments
have also been reported (e.g. Hannah and Tanaka, 1976; Solo, 1978) and these
have considerably increased the confidence of the practitioner in using recur°
sire methods.
In the early 1970s a number of papers attempted to demonstrate the rele-
vance of recursive estimation to statistical and econometric data analysis. In
general terms, however, they tended to parallel earlier research reported in the
systems literature. A rather more novel insight into the value of recursive
estimation was provided in an important paper by Brown et al. (1975), which
showed how recursive least squares methods could be exploited for the
detection of changes in regression relationships over time.
A quite different use of recursive estimation was proposed at about the same
time by Harrison and Stevens (1976), in a paper which has apparently had
considerable practical impact in the area of time-series forecasting. We have
not dealt with this aspect of recursive estimation in this chapter, since forecast-
ing per se is of less importance in control and systems analysis than in
economic and business data analysis. However, certain aspects of the Harrison
and Steven's methodology are closely related to the data pre-processing
procedures outlined in Subsection 5.1 of this chapter.
In effect, Harrison and Stevens specify simple statistical models (such as the
RW family) for trends and seasonality in the data and use a Kalman filter to
update estimates of these lower frequency components. The techniques are, in
fact, quite closely related to earlier, more conventional procedures, such as
252 ~ Young
Acknowledgements
The economic data analysed in Subsection 7.3 was collected by Miss Rose-
mary Prince. The C A P T A I N analysis in Subsection 7.4 was carried out with the
help of Miss Christina Sirakott. Both contributions are gratefully ac-
knowledged.
References
Akaike, H. (1970). Statistical predictor identification. Ann. lnst. Statist. Math. 22, 203-21'7.
Akaike, H. (1972). Use of an information theoretical quantity for statistical model identification.
In: Proc. of 5th Hawaii Int. Conf. on Systems Science, 249-250. Western Periodicals, North
Hollywood.
Recursive identtfication, estimation and control 253
Akaike, H. (1979). Likelihood and the Bayes procedure. Int. Mtg. on Bayesian Statistics, Valencia,
Spain.
Arthur, W. B. (1978). Stochastic control for linear discrete-time distributed lag models. Internat. J.
Control 28, 611-619.
,~strtm, K. J. (1970). Introduction to Stochastic Control Theory. Academic Press, New York.
Astrtm, K. J. (1980). Self-tuning regulators: design principles and applications. In: K. S. Narendra
and I. Monopoli, eds., Applications of Adaptive Control.
Astrtm, K. J. and Wittenmark, B. (1973). On self-tuning regulators. Automatica 9, 185-199.
Beer, T. and Young, P. C. (1983). Longitudinal dispersion in natural streams. A S C E J. Env. Eng.
109, 1049-1067.
Beer, T., Young, P. C., Humphries, R. B. and Burgess, J. S. (1982). Environmental Water Quality
(CRES Monograph). A N U Press, Canberra.
Box, G. E. P. and Jenkins, G. M. (1970). Time Series Analysis, Forecasting and Control. Holden Day,
San Francisco, CA.
Brown, R. G. (1963). Smoothing, Forecasting and Prediction. Prentice-Hall, Englewood Cliffs, NJ.
Brown, R. L , Durbin, J. and Evans, J. M. (1975). Techniques for testing the constancy of
regression relationships over time. J. Roy. Statist. Soc. Ser. B 37, 149-192.
Bryson, A. E. and Ho, Y. C. (1969). Applied Optimal Control. Blaisdell, MA.
Clarke, D. W. and Gawthrop, P. J. (1981). Implementation and application of microprocessor-
based self tuners. Automatica 17, 233-244.
Conrad, W. and Corrado, C. (1982). Application of the Kalman filter to revisions in monthly sales
estimates. J. Econom. Dynamics Control 1, 177-198.
Davis, P. F. (1983). Time-series analysis applied to data smoothing for an investigation of automatic
adjustments of rates of fitted parameter adaption. National Inst. of Agricult. Eng., Divisional
Note DN/1168.
Eykhoff, P. (ed.) (1973). Identification and System Parameter Estimation. North-Holland, Am-
sterdam and American Elsevier, New York.
Gauss, K. F. (1821, 1823, 1826). Theoria combinationis observationum erroribus minimis obnoxiae,
Parts 1, 2 and supplement. Werke 4, 1-11/8.
Gelb, A. (ed.) (1974). Applied Optimal Estimation. MIT Press for The Analytic Sciences Cor-
poration, Cambridge, MA.
Goodwin, G. C. and Sin, K. S. (1984). Adaptive Filtering, Prediction and Control. Prentice-Hall,
Englewood Cliffs, NJ.
Hannan, E. J. and Rissanen, J. (1982). Recursive estimation of mixed autoregressive moving average
order. Biometrika 69, 81-94.
Hannan, E. J. and Tanaka, K. (1976). A R M A X models and recursive calculations, In: H. Myoken,
ed., Proc. Conf. System Dynamics and Control in Quantitative Economics. Nagoya City Uni-
versity. Nagoya City, Japan.
Harrison, P. J. and Stevens, C. F. (1976). Bayesian forecasting. J. Roy. Statist. Soc. Ser. B 38,
205-247.
Harvey, A. C. (1981). Time Series Models. Phillip Allan, Oxford.
Harvey, A. C. and Phillips, G. D. A. (1979). The estimation of regression models with A R M A
disturbances. Biometrika 66, 49-58.
Isermann, R. (ed.) (1979). Identification and System Parameter Estimation. Pergamon Press,
Oxford.
Jakeman, A. J. and Young, P. C. (1979). Recursive filtering and the inversion of ill-posed causal
problems. CRES, ANU, Report No. AS/R28/1979. Utilitas Math., to appear.
Jakeman, A. J. and Young, P. C. (1981). On the decoupling of system and noise model parameter
estimation in time-series analysis, lnternat. J. Control 34, 423-431.
Jakeman, A. J. and Young, P. C. (1983). Advanced methods of Recursive time-series analysis.
lnternat. J. Control 37, 1291-1310.
Jakeman, A. J., Young, P. C. and Bayes, A. J. (1982). A computer program for general recursive
time-series analysis. In: G. Saridis, ed., Identification and System Parameter Estimation. Per-
gamon Press, Oxford.
Jazwinski, A. H. (1970). Stochastic Processes and Filtering Theory. Academic Press, New York~
254 P. Young
Jenkins, G. M. (1979). Practical experiences with modelling and forecasting time-series. In: O. D.
Anderson, ed., Forecasting. North-Holland, Amsterdam.
Johnston, J. (1963). Econometric Methods. McGraw-Hill, New York.
Kalman, R. E. (1960). A new approach to linear filtering and prediction problems. Trans. ASMU
Ser. D J. Basic Eng. 83, 95--108.
Kendall, M. G. and Stuart, A. (1961). The Advanced Theory of Statistics, Vol. 2. Griffin, London.
Kirk, B. L., Rust, B. W. and Van Winkle, W. (1979). Time series analysis by the maximum
likelihood method. Env. Sciences Div., Public. No. 1220. Oak Ridge National Laboratory.
Kopp, R. E. and Orford, R. J. (1963). Linear regression applied to system identification for
adaptive control systems. A I A A J. 1, 2300-2306.
Lai, P. W. (1981). Transfer function modelling: relationship between time-series variables. Concepts
and Techniques in Modern Geography, No. 22.
Lee, R. C. K. (1964). Optimal Estimation, Identification and Control (Res. Monograph 28). MIT
Press, Cambridge, MA.
Ljung, L. (1979). Convergence of recursive estimators. In: R. Isermann, ed., Identification and
System Parameter Estimation, 131-144. Pergamon Press, Oxford.
Ljung, L. (1979b). Asymptotic behaviour of the Extended Kalman filter as a parameter estimator
for linear systems. IEEE Trans. Automat. Control. AC-24, 36-50.
Norton, J. P. (1975). Optimal smoothing in the identification of linear time-varying systems. Proc.
Inst. Electr. Engrs. 122, 663-668.
Pagan, A. (1975). A note on the extraction of components from time-series. Econometrica 43,
163-168.
Parzen, E. (1977). Multiple time-series: determining the order of approximating autoregressive
Schemes. In: P. Krishnaiah, ed., Multivariate Analysis--IV. North-Holland, Amsterdam.
Pierce, D. A. (1972). Least squares estimation in dynamic disturbance time-series models.
Biometrika 59, 73-78.
Plackett, R. L. (1950). Some theorems in least squares. Biometrika 37,149-157.
Priestley, M. B. (1980). State dependent models: a general approach to nonlinear time-series
analysis. Time Series Anal. 1, 47-72.
Rissanen, J. (1976). Minimax entropy estimation of models for vector processes. In: R. K. Mehra
and D. G. Lainiotis, eds., Systems Identification: Advances and Case Studies. Academic Press,
New York.
Sanoff, S. P. and Wellstead, P. E. (1982). Extended self-tuning: practical aspects. In: G. Saridis, ed.,
Identification and System Parameter Estimation. Pergamon, Oxford.
S6derst6m, T. (1973). An on-line algorithm for approximate maximum likelihood identification of
linear dynamic systems. Lund Inst. of Tech., Div. Auto. Control, Rep. No. 7308.
Solo, V. (1978). Time-series recursions and stochastic approximation. Ph.D. thesis. The Australian
National University, Canberra, Australia.
Stoica, P. and S~Sderstr6m, T. (1983). Optimal instrumental variable estimation and approximate
implementations. I E E E Trans. Automat. Control AC-28.
Truxal, T. G. (1955). Control System Synthesis. McGraw-Hill, New York.
Whitehead, P. G., Young, P. C. and Hornberger, G. (1979). A systems model of stream flow and
water quality in the Bedford-Ouse River, I: Stream flow modelling. Water Res. 13, 1155-1169.
Wiener, N. (1949). The Extrapolation, Interpolation and Smoothing of Stationary Time-Series. Wiley,
New York.
Young, P. C. (1965). The determination of the parameters of a dynamic process. Radio Electron.
Engineer (J. Brit. 1ERE) 29, 345-362.
Young, P. C. (1966). Process parameter estimation and self adaptive control. Proc. IFAC Symp.
Teddington. Appears in: P. H. Hammond, ed., Theory of Self Adaptive Control Systems. Plenum
Press, New York.
Young, P. C. (1970). An instrumental variable method for real-time identification of a noisy
process. Automatica 6, 271-287.
Young, P. C. (1974). Recursive approaches to time-series analysis. Bull. Inst. Maths. Appl. 10,
209--224.
Recursive identification, estimation and control 255
1. Introduction
p q
A ( i ) y ( t - i) = ~_. B ( i ) e ( t - i), (1.1)
i=0 i=0
257
258 M. Deistler
If we add a term Z~=0 D(i)z(t - i), D(i) E R '×m, on the right-hand side of (1.1),
where z(t) are observed inputs (exogenous variables) we have an A R M A X
system.
Every (linear) regular (wide sense) stationary solution (y(t)), t ~ Z, of an
A R M A system is called an A R M A process. One reason for the great practical
importance of A R M A models is that every regular stationary process can be
approximated with arbitrary accuracy by an A R M A process (this corresponds
to a rational approximation of the spectral density) and that only a finite
number of (real-valued) parameters is needed for description of A R M A
systems.
Every A R M A process can be obtained from a state-space system
2. ARMA representations
y(t) = ~, K ( i ) e ( t - i) (2.3)
i=0
provided that the inputs are such that the sum on the right-hand side of (2.3)
exists.
In many cases in addition the stability condition,
k(O)= I, (2.6)
hold. Since under these assumptions the e(t) are the prediction errors of the
best linear (least squares) one-step-ahead prediction (and thus (linear) in-
novations) of y(t) (see e.g. [26, 57]), these conditions are natural if we com-
mence from an A R M A process (rather than from a system).
X is uniquely determined under these assumptions. We will assume
throughout that
;f>O, (2.8)
Parametrization of ARMA and state-space systems 261
then also k(z) is uniquely determined from fy. Equation (2.8) means that we
restrict ourselves to nonsingular (over the field of rational functions) spectral
density matrices fy. Otherwise there exists an g < s-dimensional process y(t),
where (35(t)) and (y(t)) are mutually obtained by finite linear transformations
from each other and where the spectral density of @(t)) is nonsingular. In this
sense, if (2.8) is violated, the problem can be suitably transformed to satisfy
(2.8). Also, a violation of (2.8) would be indicated from the data.
In the case of A R M A X systems, we have a second transfer function
I(Z)= a l ( z ) d ( z ) ,
where
r
d(z) = ~. D(i)z i .
i=O
l(z) can be obtained more directly (without solving the spectral factorization
problem) from the observations than k(z): for example, if (z(t)) is stationary
with spectral density fz, then l is obtained from
where fyx is the cross spectrum between (y(t)) and (z(t)), provided that fz(a) is
nonsingular on a set of positive Lebesgue measure.
Of course for d(z), a miniphase assumption would not be justified in general.
Also unstable systems might be of interest. This is one reason (besides the
point of notational convenience) why we do not impose (2.4) and (2.7)
throughout (and only require that k(z) has a convergent power series expan-
sion in a neighborhood of zero, (2.6) and (2.8)). However, we will also discuss
the consequences of additionally assuming (2.4) and (2.7) later in this con-
tribution.
As already mentioned, the transfer function k and X are uniquely obtained
from (y(t)). The parametrization of X by the vector o'(X) of its on- and
above-diagonal elements is straightforward: We will assume that there are no
further restrictions on X and no joint restrictions between the elements of
and those of the A(i), B(i). As can be easily seen {~r(X) ] X C R'×'; X > 0} is an
open subset in R "(s+1)/2. Thus we can restrict ourselves to the problem of the
parametrization of the transfer functions.
First we consider the parameter space for all A R M A systems, where s is
fixed and where p are q are arbitrary. Let 0 A denote the set of all parameter
matrices (A(0), A ( 1 ) , . . . , B(1), B(2) . . . . ) satisfying our assumptions. A pair
(a, b) of polynomial matrices such that (2.1) holds is called a left matrix fraction
description (MFD) of k(z). We will identify (a, b) with the corresponding
parameter matrices (A(0), A(1) . . . . . B(1), B(2) . . . . ). By "JT]OA-'~U A w e
denote the mapping attaching to every (a, b) the corresponding transfer func-
tion k = a-lb, where U A is just the image of 0 A by ~r.
262 M. Deistler
As far as the observations (at least in the regular stationary case) are
concerned, we cannot distinguish between two elements in O A corresponding to
the same k. Thus from this point of view, if O C 0 A is the parameter space
under consideration, it is natural to consider the quotient set O/rr of O by rr
(restricted to O) too. rr-l(k) (f-') O) is called the k-equivalence class or the class
of all observationally equivalent MFD's corresponding to k (in O).
For most practical applications, parameter spaces that are subsets of Eucli-
dean spaces are used because they are more convenient in several respects, e.g.
for the optimization of the likelihood function. In addition we want a unique
description of the transfer functions by their parameters. A subclass of A R M A
systems (a subset O C Oa) is called identifiable, if ~r restricted to 0 is injective, i.e.
if within this class the MFD's (a, b) are uniquely determined from the cor-
responding transfer functions a-lb.
/ f (a, b ) E O, then in general not all entries in the matrices A(i), B(i) will be
needed for the description of (a, b) as some entries may be a priori known to
be constant over O or to be functions of other entries. Let 0 E R e be a vector
of minimal dimension consisting of entries of the A(i), B(i) such that (a, b) is
uniquely determined by 0. We call 0 a vector of free parameters, and we
identify 0 with (a, b ) E O. If O is identifiable, then the function 4': rr(O)-+
O: g,(~r(0))= 0 is called an ( A R M A ) parametrization of U = 7r(O).
The polynomial matrix (a, b) is called (relatively) left prime if every (s x s-
polynomial matrix) common left factor u (i.e. every u such that there exist
polynomial matrices (&/7) such that (a, b) = u(&/7) holds) must be unimodular
(a polynomial matrix is called unimodular if det u = const ¢ 0). (a, b) is (rela-
tively) left prime if and only if the degree of det a is minimal among all MFD's
of k or if and only if (a(z), b(z)) has rank s for all z ~ C. (For these and
related results, see e.g. [4].) In order to avoid redundancy of description we
often restrict ourselves to irreducible, i.e. to left prime MFD's (a, b). Then an MFD
(&/7) is observationally equivalent to an irreducible M F D (a, b) if and only
if there exists a (nonsingular) polynomial matrix u such that
1~ = a -1/7. (2.11)
where n i is the degree of the ith row of & Thus we may identify (fi,/)) and
Ca, b) and they even may be represented by the same vector 0 of free
parameters, k has the advantage of being strictly proper, i.e.
lim/~(z) : 0 ,
z--+~
and therefore the degree of the ith row of/7 must be smaller than n i.
By what was said above, the degree of det a, n say, is an invariant of the set
of all irreducible MFD's of/~, and is called the order of the system (or of/~). B y
M ( n ) w e denote the set o f all transfer f u n c t i o n s k o f order n. To /~ its (block)
Hankel matrix
H:=~KI2 )
K(2)K(3))
K(3) K(4), (2.13)
3. State-space representations
The general state-space system (in our context of linear, time-invariant and
finite-dimensional systems) is of the form
x ( t + 1) : F x ( t ) + G u ( t ) ( + K z ( t ) ) , (3.1)
y(t) : Hx(t)+ v(t), (3.2)
264 M. Deistler
where u(t) and v(t) are unobserved white noise inputs satisfying
E u ( t ) = O, E v ( t ) = O,
where x(t) is the state, y(t) is the output (z(t) is the observed input) and
F ¢ R n×", G E R "×s, H ¢ R s×n, (K C R n×m) are parameter matrices. Equation
(3.1) is called the state equation and (3.2) the observational equation. Again we
restrict ourselves to the case of unobserved white noise inputs.
State-space systems have been widely studied, especially in control engineer-
ing (see e.g. [5, 9, 39, 40, 43, 61]), and the majority of the results on the
parametrization of linear systems have first been obtained for state-space
representations.
The stability condition that all eigenvalues of F are of modulus less than
unity, which corresponds to (2.4), guarantees that there is a unique (causal)
stationary solution of (3.1) and (3.2). In many cases (3.1) and (3.2) directly come
from the formulation of the theory considered.
Every state-space representation (3.1) and (3.2) can be transformed to a
'normalized' state-space system, having the same output y(t), as follows: let
2(t + 1 I t) denote the best linear predictor of x(t + 1) from y(s), s ~< t. Then we
have from (3.1) and (3.2):
where the e(t) are the prediction errors of the best linear prediction of y(t)
from y(s), s < t. Equations (3.3) and (3.4) are called the prediction error
representation and we here restrict ourselves to this representation.
The transfer function of (3.3) and (3.4) is given by
If (P, G, H ) is minimal, then the dimension of x(t) is equal to the order of/~(z)
and thus to the number of linearly independent rows in the (block) Hankel
matrix (2.13). Thus M(n) is also the set of all transfer functions corresponding
to all minimal (F, G, H ) with state dimension n.
4. Canonical forms
O(z) = a ( z ) l { ( z ) , (4.1)
we obtain
Then
y, : H ~ e , . (4.3)
It is easily seen that we obtain a unique (F, G, H ) from the following equa-
tions:
F H ; ) = H;+~), (4.4)
HH;) = H I (4.5)
and
G = H~,), (4.6)
THEOREM 4.1.
(i) O~ ) is an open and dense subset of R d, where d =
( n t + ' . - + n,)(s + 1)+ E~j.j<i {min(n~, n j ) + min(n/, n i + 1)}.
(ii) 4'~( )"• g (1)--)
, O ,0) ,~
:o a (Tpt-)' homeomorphism.
(iii) {'u~(l) ] nl + ' " " + n~ = n} is a (disjoint) partition of M ( n ) .
(iv) ~ - ( 0 ~ ) ) = L J ~ a U0_.
(v) For every k E 7r(O~ )) the k-equivalence class in (9~) is an affine sub-
space.
(vi) U(J ) is (rpt-) open in 0 2 ).
(vii) ~.,,~o),
re,, ) C u"-~(I)
~, ana" equality holds for s - 1.
det b ( z ) ¢ 0, tzt~<l,
then the corresponding parameter space is open R d. Let ~ ) denote the clo-
sure of Oto in R e. Then 0 ~ ) - O~) is the set of all 0 ~ R a that are either in O~,
where det a ( z ) ¢ 0, ]z] < 1, and det a ( z ) = 0 for some lz] = 1 and where (2.7) is
fulfilled, or that are in _Ra - O~ ), where det a ( z ) ¢ O, det b ( z ) # 0, ]zl < 1. The
equivalence classes in ~o) are subsets of the affine subspaces described in (v)
above, containing a relatively open set.
The corresponding state-space systems (see e.g. [10, 41, 45, 47, 53]) are defined
completely analogous by (4.4)-(4.6) where again, in (4.4), the respective i +
nisth rows of H~ are expressed as linear combinations of all basis rows in H~).
Note that, in general, not all parameters of (d,/)) that are not restricted by
(5.1) will be free. This is a disadvantage of this A R M A parametrization compared
with the corresponding analogous state-space representation where all restric-
tions are of the z e r o - o n e type. The free parameters in the MFD (6,/~) may be
270 M. Deistler
chosen as [16]:
6~j(u), u = 0, 1 . . . . , nj - 1, j, i = 1 , . . . , s, (5.2)
bij(u), u = 0 , 1 . . . . , n i - l , i , j = l . . . . . s,
where @(u) and/;ij(u) d e n o t e the ij elements in fi,(u) a n d / 3 ( u ) , respectively.
Let ~9) denote the mapping attaching to every /~ ~ U ~ ~ the vector 0 E R am
consisting of these free parameters and let O~ ) d e n o t e the image of U ~ ) by ~ ) .
W e recall that a real analytic manifold of dimension d is a separable
Hausdortt space M together with a family (U~, g'~)~el, where (U~)~e I is an open
cover of M and the g,~ are h o m e o m o r p h i s m s from U~ o n t o an open subset of
R e, such that, if U~CqUe¢O ( a , ~ @ I ) , then ~eog,~: 4,~(U~AUe)->
Oa(U,~ f3 U~) is a real analytic function. (U,~, ~p~) are called local coordinates
an d ~ o 4~21 is called a coordinate transformation. Then we have, analogous to
T h e o r e m 4.1 (see [7, 8, 10, 16, 35, 37, 41]):
THEOREM 5.1.
(i) M(n) together with tu~'"
.(2), d,(2he~
~,,=, is a real a n a l y t i c manifold of dimen-
sion 2ns.
(ii) 0(2 ) is dense in R 2'~.
(iii) U~ - M ( n ) = U,~,, MO).
- (2) _
T h e n u m b e r t"+s-l~
s-~ J of local coordinates (U~ 2>, 4,~)) used here is not minimal in
general. However, for s > 1, a parametrization of M(n) by a single coordinate
system does not exist.
T h e statements in R e m a r k s 1 and 3 below T h e o r e m 4.1 directly carry over to
this case.
Since U ~ ), Z n i - n, is o p e n and dense in M(n), 'almost all' points of M(n)
can be described by a corresponding ¢~(2). thus once n is known, the choice of ~x
is not that crucial c o m p a r e d with the case of the Echelon canonical form (or
canonical forms in general) where only for the set where U~ ~= U ~ ), i.e. where
n t = n2..... n K = nK+1 + 1 . . . . . ns + 1, we have openness and denseness in
M(n), whereas the other U~ ) are of dimension smaller than 2ns. M o r e o v e r ,
since M(n) is the set of all transfer functions with order smaller than or equal
to n, every transfer function of order smaller than or equal to n can be
a p p r o x i m a t e d with arbitrary accuracy in U ~ ). H o w e v e r , there is a certain
tradeoff for this advantage of the overlapping description of M(n) as a
manifold, because parametrizing /~ in a lower dimensional U~ ), rather than in
U(2) leads to a certain efficiency gain in estimation, as fewer free parameters
ot '
are used.
Parametrization of A R M A and state-space systems 271
A(0) = B ( 0 ) = I (6.1)
(or an analogous norming condition holds) (see [25, 27]). This is easily seen,
since multiplying such a (a, b) by a nonconstant matrix u in (2.9) would
increase the lag lengths beyond the prescribed ones.
Of course the sets Op.q, p, q E Z +, can also be taken as convenient subclasses,
without any structural interpretation, of a given model class.
For s > 1 there is one shortcoming of this prescription, namely
{Tr(Om), p, q E Z +} is not a cover of U A, i.e. there exist k E g A for which there
is no p, q such that k c 7r(Op,q). This problem does not occur if we more
generally prescribe the maximal degrees a = (/91,... ,Ps, ql . . . . . qs) of the
columns a 1. . . . . as, b l , . . . , bs of (a, b), i.e. we are prescribing the maximum
lags for each variable. Let a~(j) and bi(j) denote the coefficients of z j in a i and
(3) d
bi, respectively. Then the set 0 6 C R of all p a r a m e t e r s 0 = vec(al(1) . . . . .
al(p~) . . . . . a,(1) . . . . . as(p,), b l ( 1 ) , . . . , bs(qs)) :~ (a,b), d = s(p~+ . . . . . ~ p,+
ql+...+q,), which are left prime, with prescribed column degrees c~,
where the column-end matrix (al(pl) . . . . . as(ps), b l ( q O , . . . , bs(q,)) has rank s
(and where (6.1) holds) is identifiable ([27]). Let U~a~= 7r(Ol3~) and let 4,~:
U ( 3 ~ O~ ~ denote the corresponding parametrization°
Of course a disadvantage of this parametrization is that 2s integer-valued
p a r a m e t e r s p~ . . . . . q, have to be prescribed (compared with 2 for Op.q). In
addition note that pl . . . . . q~ is, in general, not uniquely determined by k, i.e.
there are k E U A such that k E U ~3~, k E U~ ~, ~ ¢ o~.
272 M. Deistler
THEOREM 6.1.
(i) O~ ) is an open and dense subset of R d, where d = s ( p l + . . . + p,+
ql + . . . + q~).
(ii) I11
~ (3).. U ~ 3)--> O~u (3) is a (Tpt-) homeomorphism.
(iii) -(3) )-_
(iv) For every k ~ rr(el~ (3) ) the k-equivalence class in 0~) is an affine subspace.
( V ) U (3) Is• (T~-~ oven m ' U- (3)
or_ p_ / r ot "
(vi) 7r(O~~)) C U~ ) and equality holds for s = 1.
For more general a priori restrictions, results analogous to Theorem 6.1 are not
yet available.
Again results analogous to Remarks 1 and 3 after Theorem 4.1 hold.
- r ,s, = 1, . . . , T
I],r(O, o'(X)) = T - ' log det Fr(O, or(X)) + "F-IyT-[~TI(0, O~(~))yr . (7.1)
- (3)
Here 0 is an element of either O~ ), or v~,c;I(e)or of O~. We use O, for short to
Parametrization of A R M A and state-space systems 273
cover all three cases. (Even more generally O~ could be an identifiable set of
MFD's with bounded degrees and with an additional technical assumption
imposed.) In this section, (2.4) and (2.7) are assumed throughout, without
taking this into account in our notation. Let U~ = ~(0~) be the corresponding
set of transfer functions and @6: U ~ O~ is to denote the corresponding
parametrization. As i t ( 0 , o'(X)) depends on 0 only via ~r(0), a 'coordinate-
free' likelihood depending on k (and on ~(22)) may be defined. In the process
of optimization of the likelihood, the possibility that the optimum is attained at
certain boundary points cannot be excluded and this is one reason to define the
coordinate-free likelihood Lr(k,o-(X)) as a function with domain Q~ x
{o-(X) ] X > 0} (where again (2.4) and (2.7) have been imposed) rather than with
domain U~ x {o-(£)1 X > 0}. Note that this coordinate-free likelihood is intro-
duced for mathematical convenience as some statistical properties do not
depend on the underlying parametrization, that however the actual optimiza-
tion of the likelihood has to be performed in suitable coordinates.
A reason for the introduction of the coordinate-free likelihood is the
following consistency result [19, 33]: under the additional ergodicity require-
ment
1 T
l i m - - ~ y(t + s)y'(t) = Ey(s)y'(O) (a.s.)
Tt=l
and if the true transfer function k 0 is in U , then the MLE's/~v ~ r for k 0 and X0
(obtained by optimizing L r over 0~ x {~r(X) IX > 0}) are strongly consistent, i.e.
/or ~ k0 (in Tpt) a.s. and 2r --" "~o (where X0 is the true matrix) a.s.
This result, together with the properties of the parametrizations discussed in
the previous sections, has the following implications for parameter estimation:
let/~r ~ k0 (in Tpt), kr, k0 C/5-~ (where/~r is not necessarily the MLE), then we
can distinguish three different cases [13, 16]:
(i) If k 0 ~ U~, then, by the openness of U~ in U~, k r will be in U~ too, from
a certain T o onwards. From this T Oonwards, the parameter estimates @~(/~r)=
0r are uniquely defined and by the continuity of the, we have 0r = @~(t~r)
~b~(k0) = 00, and thus, for example, the MLE's ~Jr are strongly consistent in this
case.
(ii) Let k 0 ~ 7r ( 6 ~ ) - U~. Then k 0 is represented by an equivalence class in
(~ - O 4 (along this equivalence class the likelihood defined on (0, for fixed X,
is constant). If in addition suitable prior bounds are imposed on the norm of
the elements in O~, then the--not necessarily unique--parameter estimates 0r
(i.e. ~ ' ( 0 r ) - / ~ r ) will converge to the 'true' equivalence class. Whether the
algorithm will search along this class or whether the Or converge to a certain point
in the equivalence class depends on the actual estimation procedure used.
Of course reparametrization with a suitable/3 < c~, such that k 0 E U#, leads to
the 'well-posed' situation described in (i).
274 M. Deistler
(iii) The situation k 0 ~ O~ - 7r((~) can only occur in the multivariable case
(s > 1). In this case, k 0 corresponds to the 'point of infinity' of Oa, in the sense that
even if /~r C U~, T E N, then /~r ~ k0 implies that the norm of the parameter
estimates 0~(kr) will tend to infinity. In the special case of the overlapping
parametrization of M(n), when U~ = U~ ) then this situation occurs if either k~0has
order n and we have chosen the wrong local coordinates (i.e. k 0 ~ U,,) or if k 0 has
order smaller than n, but cannot be described in a O~ ) such that/3 < a. Also in this
situation, a suitable reparametrization leads to case (i).
If k0~ U~ but 'near' to a point in 0 ~ - U , similar problems (in finite
samples) may arise. In this case the matrices (as e.g. H~)) determining the
parametrizations are ill conditioned and thus @~, although being continuous, is
very distorting in the sense that a 'small' variation of transfer functions causes a
'large' variation of the parameters 0.
The discussion in (ii) and (iii) may be considered as an analysis of the
behavior of the parameter estimates in the case of a wrong dynamic
specification. Of course there is also another case of wrong dynamic
specification, namely when ko~ 0,~, i.e. when the observations do not cor-
respond to a system in the model class (underfitting). In this case, of course, we
cannot have consistency of the estimates. However, the maximum likelihood
type estimates still have an optimality property: they converge to the set
(consisting of more than one element in general) in U~ corresponding to the
best linear one-step-ahead predictors for the process generating the data [44].
Now let us turn to the problem of inference of integer-valued parameters for
the dynamic specification of the submodel.
There are two main inference principles in this case, namely information
criteria like AIC or BIC and criteria based on the inference of the linear
dependence relations in H~.
We mainly consider the case of the overlapping parametrization of M(n).
Here both the order n and appropriate local coordinates given by a have to be
determined.
The reason why MLE's do not give reasonable results in order estimation
(and in related problems) is as follows: since JQ(n0)C/Q(nl) for n o < n, and
M(nl) is 'almost all' of/~f(nl) , the M L E over _M(n~) will be attained 'almost surely'
in M(nO, even if n o is the true order.
One way to overcome this notorious tendency of the MLE to overestimate
the true order (to be more precise to attain its value at the maximum
prescribed order) is to add a penalty term, taking into account the dimension of
the parameter space. This leads to estimation criteria that are of the form
where 2r(n) is the M L E ~ r over hT/(n)x {o-(2:)[ X > 0}, N is the maximum
prescribed order and where d = 2ns is the dimension of the parameter space.
C(T) has to be prescribed. If C(T)= 2, then Ar(n ) is called AIC [2, 3]. If
a
C ( T ) = c log T, then A r ( n ) is called BIC [54]. The estimates fir of the order
are obtained by minimizing A t ( n ) , n = 0 . . . . . N. Consistency of the minimum
BIC estimate f i r has been shown in [30, 31]. BIC, defined over U~ / with d
given by Theorem 4.1 (i), gives also consistent estimates of the Kronecker
indices [34]. Minimum AIC estimates of n are not consistent; AIC was
designed to satisfy another optimality criterion [58]. Closely related to thes~
estimation methods are likelihood ratio or Lagrange multiplier tests for the order
[51, 52].
For estimation of the local coordinates, measures of the conditioning of the
estimate of H ~ are used [47]. In principle, all the integer-valued parameters
discussed here could be inferred from an investigation of the linear dependence
relations in H~, where H~ is estimated, e.g. by a 'long' autoregression.
However, in most practical applications this seems to be a fairly tedious
procedure.
As has been pointed out in [34] in practical applications, for s > 3, both the
large dimension of the parameter spaces as well as the large number of
neighborhoods that have to be considered may cause great problems. Each
optimization of the likelihood itself is a fairly costly procedure and, if N is the
maximum prescribed order, we have to search over (N+s s ) neighborhoods vIfa (2~,
E n i = n, 0 ~ n <~ N , or the same number of sets U~ ). A practical procedure,
where the search is performed only over suitably selected neighborhoods, is
given in [34].
References
[1] Akaike, H. (19'74). Stochastic theory of minimal realization. I E E E Trans. Automat. Control
AC-19, 667-674.
[2] Akaike, H. (1976). Canonical correlation analysis of time series and the use of an information
criterion. In: R. K. Mehra and D. G. Lainiotis, eds. Systems Identification: Advances and
Case Studies, 27-96. Academic Press, New York.
[3] Akaike, H. (1977). On entropy maximisation principle. In: P. R. Krishnaiah, ed., Applications
of Statistics. North-Holland, Amsterdam.
[4] Barnett, S. (197l). Matrices in Control Theory. Van Nostrand-Reinhold, London.
[5] Brockett, R. W. (1970). Finite Dimensional Linear Systems. Wiley, New York.
[6] Box, G. E. P. and Jenkins, G. M. (1970). Time Series Analysis--Forecasting and Cbntrol.
Holden-Day, San Francisco, CA.
[7] Byrnes, C. 1. (1982). A brief tutorial on calculus on manifolds, with emphasis on applications
to identification and control. Mimeo. Harvard University.
[8] Byrnes, C. I. and Martin, C. F. 0980). Geometrical Theory of Linear Systems. Reidel,
Dordrecht.
[9] Casti, J. L. (1977). Dynamical Systems and their Applications : Linear Theory. Academic Press,
New York.
[10] Clark, J. M. C. (1976). The Consistent selection of parametrizations in systems identification.
Paper presented at Joint Automatic Control Conference, Purdue University.
[11] Deistler, M. (1978). The structural identifiability of linear models with autocorrelated errors in
the case of cross-equation restrictions. J. Econometrics 8, 23-31.
[121 Deistler, M. (1983). The structure of A R M A systems and its relation to estimation. In: P. E.
276 M. Deistler
Caines and R. Hermann, eds., Geometry and Identification, 49-61. Math. Sci. Press, Brookline,
MA.
[13] Deistler, M. (1983). The properties of the parametrization of ARMAX systems and their
relevance for structural estimation. Econometrica 51, 1187-1207.
[14] Deistler, M. (1984). ARMA systems: Parameirization and estimation. In: J. P. Florens et al., eds.,
Alternative Approaches to Time Series Analysis, 143-160. Publications des Facult6s Universitaires
Saint-Louis, Bruxelles.
[15] Deistler, M , Dunsmuir, W. and Hannan, E. J., (1978). Vector linear time series models:
Corrections and extensions. Adv. in Appl. Probab. 10, 360-372.
[16] Deistler, M. and Hannah, E. J. (1981). Some properties of the parametrization of ARMA
systems with unknown order. J. Multivariate Anal. 11, 474-484.
[17] Denham, M. J. (1974). Canonical forms for the identification of multivariable linear systems.
I E E E Trans. Automat. Control AC-19, 646-656.
[18] Dickinson, B. W., Kailath, T. and Morf, M. (1974). Canonical matrix fraction and state-space
description for deterministic and stochastic linear systems. I E E E Trans. Automat. Control
AC-19, 656-667.
[19] Dunsmuir, W. and Hannah, E.J. (1976). Vector linear time series models. Adv. in Appl.
Probab. 8, 339-364.
[20] Forney, D. G. (1975). Minimal bases of rational vector spaces with applications to multivari-
able linear systems. S I A M J. Control 13, 493-520.
[21] Gevers, M. and Wertz, V. (1982). On the problem of structure selections for the identification
of stationary stochastic processes. Paper presented at the 6th 1FAC Symposium on Identification
and System Parameter Estimation, Washington, DC.
[22] Glover, K. and Willems, J. C. (1974). Parametrizations of linear dynamical systems: Canonical
forms and identifiability. I E E E Trans. Automat. Control AC-19, 640-646.
[23] Guidorzi, R. (1975). Canonical structures in the identification of multivariable systems.
Automatica 11,361-374.
[24] Guidorzi, R. (1981). Invariants and canonical forms for systems structural and parametric
identification. Automatica 17, 117=133.
[25] Hannan, E. J. (1969). The identification of vector mixed autoregressive-moving average
systems. Biometrika 56, 223-225.
[26] Hannan, E. J. (1970). Multiple Time Series. Wiley, New York.
[27] Hannah, E. J. (1971). The identification problem for multiple equation systems with moving
average errors. Econometrica 39, 751-765.
[28] Hannah, E. J. (1976). The identification and parametrization of ARMAX and state spacc
forms. Econometrica 44, 713-723.
[29] Hannan, E. J. (1979). The statistical theory of linear systems. In: P. R. Krishnaiah, ed.,
Developments in Statistics, Vol. 2, 83-121. Academic Press, New York.
[30] Hannan, E. J. (1980). The estimation of the order of an ARMA process. Ann. Statist. 8,
1071-1081.
[31] Hannan, E. J. (1981). Estimating the dimension of a linear system. J. Multivariate Anal. 11,
459473.
[32] Hannah, E. J. (1981). System identification. In: M. Hazewinkel and J. C. Willems, eds.,
Stochastic Systems: The Mathematics of Filtering and Identification and Applications. Academic
Press, New York.
[33] Hannan, E. J. Dunsmuir, W. and Deistler, M. (1980). Estimation of vector ARMAX models. J.
Multivariate Anal. 10, 275-295.
[34] Hannan, E. J. and Kavalieris, L. (1984), Multivariate linear time series models. Adv. in Appl.
Probab. 16, 492-561.
[35] Hazewinkel, M. (1977). Moduli and canonical forms for linear dynamical systems II: The
topological case. Math. Systems Theory 10, 363-385.
[36] Hazewinkel, M. (1977). Moduli and canonical forms for linear dynamical systems 111: The
algebraic-geometric case. In: C. Martin and R. Hermann, eds., Geometric Control Theory,
291--336. Math. Sci. Press, Brookline, MA.
Parametrization of A R M A and state-space systems 277
[37] Hazewinkel, M. and Kalman, R. E. (1975). On invariants, canonical forms, and moduli for linear
constant, finite-dimensional dynamical systems. In: Lecture Notes Econ.-Math. Systems Vol. 131,
4~60. Springer, Berlin.
[38] Ho, B. L. and Kalman, R. E. (1966). Effective construction of linear state variable models
from input-output functions. Regelungstechnik 14, 545-548.
[39] Kalman, R. E. (1963). Mathematical description of linear dynamical systems. S I A M J . Control
1, 152-192.
[40] Kalman, R. E. (1965). Irreducible realizations and the degree of a rational matrix. S I A M J.
Control 13, 520-544.
[41] Kalman, R. E. (1974). Algebraic geometric description of the class of linear systems of
constant dimension. Eighth Annual Princeton Conf. on Information Sciences and Systems,
Princeton, NJ.
[42] Kalman, R. E. (1982). Identifiabitity and problems of model selection in econometrics. In: W.
Hildenbrand, ed., Advances in Econometrics. Cambridge University Press, Cambridge.
[43] Kalman, R. E., Falb, P. L. and Arbib, M. A. (1969). Topics in Mathematical System Theory.
McGraw-Hill, New York.
[44] Ljung, L. (1978). Convergence analysis of parametric identification methods. IEEE Trans.
Automat. Control AC-23, 770-783.
[45] Ljung, L. and Rissanen, J. (1976). On canonical forms, parameter identifiability and the
concept of complexity. Proceedings of the 4th I F A C Symposium on Identification and Systems
Parameter Estimation, Tbilisi.
[46] Luenberger, D. G. (1967). Canonical forms for linear vaultivariable systems. IEEE Trans.
Automat. Control AC-12, 290-293.
[47] van Overbeek, A. J. M. and Ljung, L. (1982). On line structure selection for multivariable
state space models. Automatica 18, 529-543.
[48] Picci, G. (1982). Some numerical aspects of multivariable systems identification. Math.
Programming Study Vol. 18, 72-101. North-Holland, Amsterdam.
[49] Popov, V. M. (1969). Some properties of the control systems with irreducible matrix-transfer
functions. In: Seminar on Differential Equations and Dynamical Systems II. Lecture Notes in
Mathematics, Vol. 144. Springer, Berlin.
[50] Popov, V. M. (1972). invariant description of linear, time-invariant controllable systems.
S I A M J. Control 10, 252-264.
[51] Poskitt, D. S. and Tremayne, A. R. (1980). Testing the specification of a fitted autoregressive
moving average model Biometrika 67, 359-363.
[52] P6tscher, B. M. (t983). Order estimation in ARMA-models by Lagrangian multiplier tests.
Ann. Statist. 11,872-885.
[53] Rissanen, J. (1974). Basis of invariants and canonical forms for linear dynamic systems.
Automatica 10, 175.
[54] Rissanen, J. (1978). Modeling by shortest data description. Automatica 14, 465471.
[55] Rissanen, J. and Ljung, L. (1975). Estimation of optimum structures and parameters for linear
systems. In: Proc. Symp. Advanced School on Mathematical System Theory. Udine, Italy.
Lecture Notes in Economics and Mathematical Systems, Vol. 131, 75-91. Springer, Berlin.
[56] Rosenbrock, H. H. (1970). State-Space and Multivariable Theory. Wiley, New York.
[57] Rozanov, Y. A. (1967). Stationary Random Processes. Holden-Day, San Francisco, CA.
[58] Shibata, R. (1980). Asymptotically efficient selection of the order of the model for estimating
parameters of a linear process. A n n . Statist. 8, 147-164.
[59] Tiao, G. C. and Box, G. E. P. (1981). Modelling multiple time series with applications. Z
Amer. Statist. Assoc. 76, 802-816.
[60] Wertz, V., Gevers, M. and Hannan, E. 3. (1980). The determination of optimum structures for
the state space representation of multivariate stochastic processes. IEEE Trans. Automat.
Control AC-27, 1200-1211.
[61] Wolovich, W. A. (1974). Linear Multivariable Systems. Springer, New York.
E. J. Hannah, P. R. Krishnaiah, M. M. Rao, eds., Handbook of Statistics, Vol. 5 1•
.1_ KJ'
© Elsevier Science Publishers B.V. (1985) 279-310
M. M. Rao
I. Introduction
and r(s, t) = f(s - t) which depends on the difference of s and t. Writing r for f,
it follows from the classical theorems of Herglotz and Bochner that r is
expressible as
279
280 M. M. Rao
know whether there is some kind of 'robustness' for this work. In a response to
such questions and also to take into account some honestly nonstationary
processes, the classes of harmonizable and Karhunen families have been
introduced independently and simultaneously by Lo~ve [16] and Karhunen
[14]. Only slightly later a c o m m o n generalization of both these classes was
formulated by Cramdr [6]. An analysis and interrelations together with some of
their extensions constitute the main theme of the present exposition. It turns
out that harmonizable processes, properly generalized, have the 'robustness
properties' alluded to above. Also their study proceeds through Karhunen
classes since it is shown that the harmonizable family is an important subset of
Karhunen processes. A discussion of Cram~r's class is included since technic-
ally this illuminates the structure of the above two families and has in-
dependent interest.
There are two important classes of nonstationary processes that have been
considered in recent studies on the subject. One is the class (KF), studied
extensively by Kamp6 de Feriet and Frenkiel [13] and independently by Parzen
[21] who termed it 'asymptotically stationary', and by Rozanov [31]. The
second one is the C r a m 6 r - H i d a class which is based on the 'multiplicity
theory', having been motivated by the classical work of Hellinger and H a h n on
infinite matrices. It turns out that a stationary process has multiplicity one,
while there is a harmonizable process of any given multiplicity, 1 ~< N ~< ~ (cf.
[7, 12, 5]). Even in the multiplicity one case, there are several types of non-
stationary processes useful in prediction and filtering problems. This indicates
that classes of nonstationary processes can be studied, using different tech-
niques frequently in the time domain. It will also be found that generalizations
of spectral ideas have a role to play in this work. Let us thus turn to a detailed
description of these statements.
2. Harmonizabie processes
From the point of view of applications, one of the most useful extensions of
stationarity is harmonizability. Since for nonstationary processes (of second
order) the covariance function r, given by (1), must depend on both variables
s, t, it is natural to consider those processes for which the analog of (2) still
holds. This leads to the following concept, introduced by LoEve [16], and it is
called strongly harmonizable hereafter. Namely, the covariance r admits the
representation:
If ~: is a random variable with mean zero and unit variance, consider {X, =
~f(t), t ~ R}. Then X, has mean zero and the covariance r is given by
F(A,A')-f f f(x)f(y)dxdy,
one verifies that F is positive definite and of bounded variation so that the X t
process is strongly harmonizable. All finite linear combinations of such 'simple'
processes constitute a large class and if F in (2) is absolutely continuous, then
these simple processes are even 'linearly dense' in a certain well-defined sense.
On the other hand, if F of (3) concentrates on the diagonal a = A', then clearly
(3) reduces to (2). Thus stationary processes are properly extended.
Even though the Lobve extension of stationarity is useful, it does not go far
enough to afford a flexibility for linear operations on these processes. Since the
latter operations appear naturally in filtering problems, a further extension is
needed to take care of these applications. First let us consider an example to
understand how certain simple linear operations lead from stationarity to
problems beyond the strongly harmonizable class. Thus let L2o(P) be the space
of scalar random variables with zero means and finite variances so that a
second-order process considered in this chapter is a subset of L20(p). Let the
metric (or norm) be denoted by I~ll = V[E(Ifl2)I, f e L2(p). If A is a bounded
linear transformation on L2(p), so that A X , = Yt ~ L2(p), consider a stationary
(or strongly harmonizable) process {X t, t E T}, and the transformed process
{Y,, t E T}. If the range of A is finite dimensional, then the Yt process is
strongly harmonizable (can be nonstationary) and if the range of A is infinite
dimensional, then the Y, process need not be strongly harmonizable. For
instance, let T = 2 , 32, = f,, an orthonormal sequence (hence stationary) in
L2(p), and A be the projection such that Y, = A X , =f~ for n > 0 , =0 for
n ~<0. Then the Y, sequence is a truncation of the original orthonormal
sequence, but it is not strongly harmonizable. This fact as well as the preceding
general statement on the range of A is not entirely simple. The details may be
found in [28]. Since, as remarked earlier, linear operations are important for
(filtering and other) practical problems, one should have an extension to
include at least these questions. Fortunately, this is possible and it can be
formulated as follows.
282 M. M. Rao
where f is the Fourier transform, given by (4), for each integrable f on T and the
integral in (5) is defined in a standard manner as a vector (or Bochner) integral,
[If~], being the uniform (= supremum) norm off.
Harmonizable, Cramdr, and Karhunen classes of processes 283
where S*(u) stands for the adjoint of S(u) so that it is a linear mapping
satisfying the relation
n n
i=1 j=l
for each finite set {u~. . . . . un}C T. This is easy if T = 2 and the case that
T = R is then reducible to the former. Then one applies a form of the next
result to deduce that there is a family of unitary transformations V, (meaning
V~V* = V~ V, = identity, (i) and (iii) hold), on a larger space L20(P') D L~(P)
such that S(t) = QVt, t E T. Here O is the orthogonal projection of L20(p ') onto
L2(p). It should be noted that if S ( t ) = V t so that L~(P')= L2(p) and Q--
identity, then Yr = V~Xo, t C T, gives the classical representation of a stationary
process. Thus the connection between these two classes obtained by an
enlargement of the underlying probability space is an important and a deep
resulL
The precise statement alluded to above is the following:
The superspace L2o(p) is not generally unique, but one can find a minimal
space With the desired properties. The result and its extended space may be
obtained essentially 'constructively'. It is related to some work of M. A.
Naimark, B. Sz-Nagy and others on Hilbert space operator theory. A detailed
proof with related references is given in [28]. Based on this result one can show
that each weakly harmonizable process may be represented in terms of a
(continuous) positive definite semigroup, as described in the preceding
example.
The above theorem enables some extensions of the well-known results from
the stationary theory to the harmonizable case. For instance, the following
inversion formula for F ( . , . ) of (3) can be obtained from the classical work at
once.
The generally complex valued spectral function F of the process plays a role,
in analyzing harmonizable processes somewhat similar to the one given by the
classical case (2). So it is desirable to estimate F and investigate the asymptotic
properties of such estimators. This problem, even in the strongly harmonizable
case, is not yet solved. Other unresolved points will be recorded for future
work, as the exposition proceeds.
The strongly harmonizable case admits an extension in a slightly different
direction. The covariance function r of (3) may be written as
If D is a single point, then g is called the classical almost periodic function; and
in any case, for each a E K C D, g ( . , A ) is bounded. With this concept, a
second-order process {Xt, t E T} C L2(p) is termed almost harmonizable if its
covariance r admits the representation (7) with respect to a family { g ( . , a), a E
7"} of almost periodic functions on T uniformly relative to T, and a covariance
function F of bounded variation. It will be seen in Section 6 below that this
family inherits an important structural property of strongly harmonizable
processes of which it is an extension.
From an applicational point of view, however, one should consider multi-
variate processes. Thus if X t :S't ~ C n, t E 7, so that X t = (X I . . . . . X~'), let
XitELz(P), i= 1. . . . . n ; t ¢ T. Then the X t process is termed multivariate
strongly or weakly or almost harmonizable (relative to a fixed scalar almost
periodic g family in the last case) if for each vector c~ = (c~ ~. . . . . ~ " ) ¢ C ", the
scalar process
XT=~ a i X t ', t E T,
i=1
is of the same type as defined in the preceding paragraphs. From this definition,
it follows after an easy algebraic manipulation that for each 1 ~<j, k ~< n, and s, t
in T, the (cross-) covariance function rjk of the c o m p o n e n t processes X Jr,X f , is
also harmonizable and that
with g(s, A) - e i'* in the (weak or strong) harmonizable case. The integrals here
are defined componentwise. Again F will be called the spectral matrix function
of the vector process {X,, t E T}.
In all above cases, F has the following important property inherited from r:
for any m - b y - n matrix function f with bounded Borel entries. Here f* is tile
conjugate transpose of f. In the stationary case, one has F(A, A ' ) = a,vG(a),
286 M. M. R a o
where 6**, is the Kronecker delta and G is a positive definite hermitean n-by-n
matrix funciton. In the general (e.g. harmonizable) cases the latter property is
no longer present because of the behavior of the off-diagonal entries of F
noted earlier. Before considering the spectral properties of the multivariate
harmonizable processes, it will be necessary to discuss another extension of
stationarity due to Karhunen [14] and some of its ramifications. Let us
introduce this.
3. Karhunen class
where h is the Fourier transform of \ / f (which exists) and then the last equality
follows by the Parseval formula (since ~ / f is square integrable). Note that if
T = l , then h is a polygonal function and the integral in (12) reduces to a
(possibly infinite) sum. Thus a process whose covariance is representable by a
formula of the type (12) relative to a Borel family {h(t,. ), t E T } and a measure
/x (here h ( t , . ) = h ( t + . ) and d / x ( A ) = d a ) includes the stationary class, and
brings in considerable flexibility. It should also be observed, from (12) and (2),
that even a stationary covariance can have different representations, and this
remark will be pertinent later on. Let us thus present the desired general
concept.
Here both S, T can be general sets without any relation. In applications, one
usually has T = R, Z (as is assumed in this chapter) and then S = ~b (=R, or
Harmonizable, Cram&, and Karhunen classes of processes 287
[0, 2rr)), C, or such others. Also p can be a nonfinite measure (dr = da, the
Lebesgue measure on R in (t2) is an example).
The Karhunen class is quite large. It was already noted that stationary
processes are included in it. From the forms (4) and (13) it is not at all evident
that there is any relationship between harmonizable and Karhunen classes. It
will now be shown that the former is also a subset of the latter. This fact could
not be obtained until the availability of the dilation result (Theorem 2.2). It
also depends on another classical fact (due to Cram&) that each stationary
process { X,, t E T} is representable as
PROOF. A sketch of the argument follows because it is not yet available in the
literature, and it is not long. Since the X t process is weakly harmonizable, by
Theorem 2.2 there exists a stationary dilation { Yt, t ~ T}CL~(/5) on a larger
probability space and L2o(P) can be identified as a subspace, such that X, =
OY,, t C T, and O is the orthogonal projection from L2(/5) onto L2o(p). But by
(14),
E ( Y . ? p -- o # r ( . - ~) : E ( Y . . Po+.).
In fact, assuming that the X, process is not identically zero, one has
Equivalently,
t7 el
implies
n n
II ,U ll cllU ll • 07)
the limit existing in the metric of ~f, uniformly in s on closed intervals [0, a],
a > 0 . On the other hand, for each h > 0 , A h is a b o u n d e d normal operator on
the Hilbert space ~. Hence one can invoke the standard spectral theorem
according to which there exists a 'resolution of the identity', {Eh(A), A C C}
such that
where the integral is a vector integral and/x~,(zi) ..... E h ( A ) X E Yd, gives a vector
measure. H e r e ~0 C ~ is the subspace for which the integral exists, i.e. z is/'7,
integrable for X E ~0. But from the same theory one can also deduce that
t"
X, = Jc e'~Z(dA)' s >~0, (22)
4. Cram6r class
After seeing the work of the preceding two sections it is natural to ask
whether one can define a more inclusive nonstationary class incorporating and
extending the ideas of both Karhunen an Lo6ve. Indeed, the answer is yes and
such a family was already introduced by Cram6r in 1951 [6], and a brief
description of it is in order. This also has an independent methodological
interest since it results quite simply under linear transformations of Karhunen
classes in much the same way that harmonizable families result under similar
mappings from the stationary ones.
One says that a function F on T x ~ into C is locally of (Fr~chet) variation
finite if the restriction of F to each finite proper subrectangle I x I of 7" x 7"
has the (Fr6chet) variation finite, I C T being a finite interval. Let us now state
the concept in:
If u has a locally finite Fr6chet variation and the integrals in (24) and (25) are in
Morse-Transue sense, the corresponding concept is called the weakly of
class(C).
All these inclusions are proper. One key feature is that each member of these
classes admits an integral representation analogous to that of (14), though their
proofs in each case differ considerably.
Remembering the dilation of a harmonizable series into a stationary one as
given in Theorem 2.2, one might ask for a similar result between the Cram6r
and Karhunen classes. The methods and ideas of proof of that result extend to
give only the following somewhat weaker statement.
In this generalization, it is significant that the full dilation result does not
obtain. Only an interesting subclass extends. Details and related references
with further extensions on the problem can be found in [28]. It will appear in
applications (cf. Section 8 below) that these classes arise naturally, especially as
solutions of linear stochastic differential equations of filtering and signal
extraction problems. Another important reason for a study of Karhunen class
will emerge in Section 7.
in the form of (10). Note that F is also hermitean positive definite for the
Karhunen class but not for the Cram6r and harmonizable families. Since there
are not many results available for other classes, the more familiar (non-
stationary) harmonizable case will be discussed here. For this, the spectrum
plays a role somewhat analogous to that of the stationary case, and its spectral
domain is given as follows.
In all the extensions of stationarity considered above, their covariance
functions admit 'factorizable' kernels for their integrands (cf. (3), (13), (24)).
This fact translates itself into integral representations of their sample paths as
in (22) or analogously:
For the harmonizable case S = ]b, and in all cases 0 ~< (f, f ) < % because of the
special relationship between F and Z. The ( . , . ) gives a (semi-) inner product
and a (semi-) norm: Ilfl[2 = (f,f). For the multivariate case, one has
with F as its spectral matrix function and ~.~,2(F), defined by (28), is its spectral
domain space, then (S2(F), I1"I1 ) is complete in the sense that it is a Hilbert
space of equivalence classes of matrices with inner product defined by ( ( . , . ) ) =
trace(.,. ), where
r r
(f, g ) = J~. J~f(A)F(dA, dA')g*(A'), (29)
In order to assert a similar property for the class(C) or Karhunen class, it will
be necessary to restrict the family {g(t,. ), t E T } suitably.
The importance of the above property is better understood if one looks at an
application. The following is one such for signal extraction. A general signal
plus noise model is given by
x , = s,+N,, T, (30)
where the St and Nt are (stochastic) signal and noise processes both of which
are supposed to be weakly harmonizable so that the output process X t is also,
whenever the St and N~ are uncorrelated or harmonizably correlated. For
simplicity of exposition here, let us assume that they are uncorrelated. If
Fx, Fs, F, are the (known) spectral functions of these processes, let h ( a ) =
Fs(A, T)+ F,(A, 2?) and k(A)= F~(A, 2b). The knowledge of these spectral func-
tions is assumed from prior considerations. The problem here is to estimate Sa
optimally, for any a E T, based on the output X~ (i.e. on a realization). Here
optimality refers to the least squares (or error mean square) criterion. A
solution of the problem, using Theorem 5.1, can be given as
the h(. ) and k(. ) being the 'marginal' measures of the spectral functions of the
Harmonizable,Cramdr,and Karhunenclassesof processes 295
To use this result in specific applications, one has to compute F, and Zx.
From the data, one can calculate the covariance r, and then F~ is obtained by
means of formula (6). The stochastic measure Z ~ ( . ) can also be obtained by
using the dilation theorem (cf. Theorem 2.2) and a known result from the
stationary theory (cf. [32], pp. 26-27). Thus for any interval A = (a, b) C T for
which Z({a})= 0 = Z({b}), one has (limits taken in mean square sense)
f Xo(b - a) + lim - 1
- 2 e -ibk~-~/c
-- e-iak X k if T = Z ,
,_.® 2'n 0<lkl~n
Z(A)= 1 f~ e ib, e i~,
lim ~ / ~ X, dr, if T = R .
a~ ~ J-a -- It
As noted in the preceding sections, many processes which arc extensions of the
stationary ones with 'triangular covariances' admit integral representations
such as (26). However, there are other classes based on the behavior of
covariances at infinity. An important such family, motivated by certain sum-
mability methods, is the one introduced in the 1950's by J. Kamp6 de Feriet
and F. N. Frenkiel, with a detailed exposition later in [13]. It will be called
class(KF). This was also independently given, slightly later, by Yu. A. Rozanov
[31] and E. Parzen [21], the latter under the name 'asymptotic stationarity'. Let
us state the precise concept here.
1__( ~-Ihl
lim r(s,s+th])ds, if T = = R ,
a~ O/ "/0
f(h) =
1 n--N-1
lira-- • r(k, k + Ihl), if T-- Z . (34)
n--,~ n k"---0
296 M. M. Rao
It may be verified that each of the quantities on the right of (34) before
taking the limit is positive definite so that, when the limits exist, f ( - ) is also.
The continuity of f in the discrete case is trivial and in the case that T = R, r is
measurable even when it is not continuous. In either case, as a consequence of
the classical Herglotz-Bochner-Riesz theorem on a characterization of such
functions (cf., e.g., [29], Sections 4.4 and 4.5) there is a unique positive
bounded nondecreasing H ( " ) such that
Xt a X , l + et, t~ Z , (36)
where et's are uncorrelated random variables with zero means and unit
variances, and ] a ] < l . It is easily seen that the limit (34) exists for this
sequence. One can also consider kth-order difference equations with constant
coefficients, extending (36), such that all the roots of the characteristic equa-
tions lie inside the unit circle, but the details will not be entered into here.
What about the weakly harmonizable class? In fact, this question was raised in
[31]. It turns out that neither includes the other completely. The preceding
example already gives a nonharmonizable process of class(KF)o The other
noninclusion will now be discussed since that provides a better insight into the
structure of both these classes.
Consider a weakly harmonizable process {X,, t C T} with covariance r and F'
as its representing bimeasure (cf. (3)). Suppose that for this bimeasure the
dominated convergence theorem holds in the sense that if ]f,] ~<g a.e., f, ~ f
pointwise, and
(37)
Harmonizable, Cram&, and Karhunen classes of processes 297
ak ~ [Xc. + 2X~,°l(k),
n=0
where C. = [22% 22n+1) and D . = [22n+~, 22"+2), the left closed and right open
intervals. The sets C. and D . are disjoint, and for each k the series is finite
(only one nonzero term), 1 ~< ak ~< 2, k > 0. Then A defined with this set of a / s
is clearly bounded. The covariance r(k, l) = 0 if k ¢ l, and hence
r , ( h ) = n k~2_
° r(k,k+h)= a2k, i f h = 0 .
f5 1
if n = 2 2'n - 1 ,
rn (0) / (38)
- 4 1
if n = 22m+1 ~-
and hence l i m , . ~ r2~m_l(0) = ~, l i m , . ~ F22'n*l-l(0 ) -- ~. Thus lim .... r.(0) does not
exist. Consequently, { X . , n E Z } ~ class(KF). This example is due to H. Niemi
(personal communication).
The preceding computation suggests an extension of class(KF). Since by
definition a process { X . , n ff g } C L2(p) belongs to class(KF) provided that the
sequence of their averaged covariances converges, it is natural to consider a
wider class by looking at their higher-order averages. Thus one can say that; a
process with covariance r is in class(KF, p), p = 1 being the original definition
298 M. M. Rao
The analog for the case that T : R can similarly be given. Since in (34) rn(" )
is positive definite, it is seen easily that "n'(P)¢t'J
~ is also positive definite. Hence
?(. ) satisfies the same hypothesis and (35) holds, so that the representing H ( . )
may now be called a pth-order associated spectrum. The classical results
on summability imply that if r~)(h)~?(h), then r~)(h)-~?(h) for each
integer p~> 1, but the converse implication is false. Hence class(KF)C
class(KF, p)Cclass(KF, p + 1) and the inclusions are proper. Thus one
has an increasing sequence of classes of nonstationary processes each
having an associated spectrum. The computations given for (38) show that the
preceding example does not belong even to the class Up~class(KF, p). This
also indicates that weakly harmonizable processes form a much larger class
than the strongly harmonizable one, and is not included in the last union.
It should be remarked here that a further extension of the preceding class is
obtainable by considering the still weaker concept of Abel summabflity. The
consequences of such an extension are not yet known, and perhaps should be
investigated in future.
The general idea behind the class(KF, p), p >/1, is that if the given process is
not stationary, then some averaging, which is a smoothing operation, may give
an insight into the structure by analyzing its associated spectrum. Moreover, if
{Xt, t C R} E class(KF), and f is any Lebesgue integrable scalar function on R,
then the convolution of f and the Xt process is again in class(KF) whenever the
function ~b defined by 4~(t)= [E(IXtl2)] l/2 is in Lq(~) for some 1 ~< q ~< oc. Then
In the previous discussion of Karhunen and Cram6r classes, it was noted that
each {Xt, t E T} admits an integral representation such as (26) relative to a
family {g(t,. ), t C T} and a stochastic measure Z ( . ) on the spectral set S into
L~(P). Both g(t, u) and Z(du) can be given the following intuitive meaning,
Harmonizable, Cram&, and Karhunen classes of processes 299
X, =
f g(t, u ) Z ( d u ) , t ~ T. (41)
Since in (26) the g there need not satisfy this condition, that formula does not
generally reduce to (41). So one should seek conditions on a subclass of
Karhunen processes admitting a representation of the type (41) which clearly
has interesting applications. Such a class will be discussed together with some
illustrations.
First it is noted that each process {Xt, t E T} c Lz(P), assumed to be left
continuous with right limits (i.e. for each t E T, E ( I X , - X,_ h [2)-+0 as h -+0 +,
and there is an )(t such that E(I)( , -Xt+h[2)-+ 0 as h -+ 0 +, denoted )(, = X,+0),
can be decomposed into a deterministic and a purely nondeterministic part
(defined below). The deterministic component does not change from the
remote past so that it has no real interest for further stochastic analysis such as
in prediction and filtering problems. Thus only the second component has to be
analyzed for a possible representation (41). This was shown to be the case by
Cram& [7] and Hida [12] independently, and it will be presented here. ([7] has
the 1960 references to Cram6r's papers.)
Let Y£ = sp{X,, t E T} C L2(p), and similarly ~, = sp{X~, s < t} Q ~ and
~_~ = f-'l,Er ~ , . Since Y('I C Yg'2 for t 1 < t2, one has ~_= C Yg, C ~ and Yg_~
represents the remote past while ~ , stands for the past and present. The X t
process is deterministic if gg_~= ~ and purely nondeterministic if ~ _ = = {0}.
Thus the remote past generally contributes little to the experiment. The
separation of remote past from the evolving part is achieved as follows. A
process {X,, t E T} which is left continuous with right limits (and this is
automatic if T = Z) can be uniquely decomposed as: X , = Y t + Z t, t E T,
where the Y, component is purely nondeterministic, the Z, is deterministic and
where the I:, and Z , processes are uncorrelated. (This is a special case of
Wold's decomposition.)
Since the deterministic part is uninteresting for the problems of stochastic
analysis, and can be separated by the above result, one can ignore it. Hence for
the rest of this section it will be assumed that our processes are purely non-
deterministic. The proofs of the following assertions may be completed from the
work of Cramdr in [7] (cf. the references for his other papers there).
The approach here does not give much insight if T = Z. However, T = R is
really the difficult case, and the present method is sPecifically designed for it.
The new element in this analysis is the concept of 'multiplicity', and it is always
300 M. M. Rao
one if T = Z while it can be any integer N ~> 1 if T = R. (See [5], and the
references there, and also [7].) The basic idea is to 'break up' the continuous
parameter case, in the sense that each such process can be expressed as a direct
sum of mutually uncorrelated components of the type (41) so that each of the
latter elements can be analyzed with special methods. This relatively deep
result was obtained independently (cf. [7] and [12]) and can be given as follows:
r(s, t ) =
~ fmln(s,t)gk(s, A)gk(t, a)Fk(da), s, t E R , (42)
k=l -~
Thus such stationary processes always have multiplicity unity. The converse is
not true. There exist nonstationary (even strongly harmonizable) nondeter-
ministic processes of multiplicity N for any given N, 1 ~<N ~<oo.
It should be noted that (42) can also be stated for the Xt process using the
(stochastic) integral representations:
r(s, t) = p(s)p(t)
f rain(s,t) f(3,) d,~ (46)
(p(a)) 2 •
H e n c e for all s < t < u, if p(s, t) = r(s, t)/r(s, s), s <~ t, o n e gets
This p ( ' , ' ) is called a correlation characteristic, and the functional equation
(47) implies that {X t, t C R} is a wide-sense M a r k o v process. This m e a n s for
each t l < h < • " ' < tn, n ~> 1, tk C R, the (orthogonal) projection of X t , on the
linear span of Xtl . . . . . Xt 1 is the same as the projection of Xt, on the
one-dimensional span of X , . (For a proof of this classical fact see [29], p. 145.) It
is of s o m e interest to note that, in the special case of (46), if fi(s, t ) =
r(s, t)/[r(s, s)r(t, 0] 1/2, the correlation coefficient, then also ti satisfies the relation
(47). If the X, process is normal and (46) holds, the a b o v e - n o t e d projection
b e c o m e s the conditional expectation and the wide-sense p r o p e r t y b e c o m e s the
usual (strict-sense) M a r k o v property. As an example, one m a y consider r(s, t) =
e x p ( - c [ s - t[), c > 0.
A different example of a nonstationary ( n o n h a r m o n i z a b l e ) process of multi.-
plicity one is the B r o w n i a n motion. H e r e g(t, A ) = 1, F ( u ) - 0 if u < 0 , = u , if
0 <~ u < 1, and =1 if u ~> 1, N = 1 in (42). If the process is not assumed normal
(Gaussian) 1, g =-a constant, and F is also a constant outside of a c o m p a c t
interval, then each nondeterministic process of the form (44), which has
o r t h o g o n a l increments, has multiplicity one. T h u s each of these classes is large
~From here on, a process is normal means Gaussian. It should not be confused with 'normal
operators' of Section 3 (cf. (17)-(23)).
302 M. M. R a o
in itself. Other useful applications will now be discussed in the final two
sections. (Cf. also [7].)
E([X, _ Xt,,
^ I2) = inf{E(lX ' _ y[2): y E Yt'~} (48)
2 s = E ( [ X _ Xt,s[
o-t. A ~) = ~ f ' [gk( t, A)[aFk(dA) • (50)
k=l -~
This in principle furnishes the desired solution of the least squares linear
prediction problem for processes of the type (44). In general, however, there is
as yet no recipe for determining the multiplicity of a given continuous
parameter purely nondeterministic second-order left-continuous with right
limits process. But results are available if one is willing to assume somewhat
more on gk ' S , generalizing the stationary case.
Even when the X t process does not satisfy all the conditions of T h e o r e m 7.1,
the least squares prediction problem can be formulated and solved differently.
T o understand this aspect, let {X t, t E_T} C L2o(P) be a process and ~ t = sp{Xs:
s ~< t} as before. Suppose that ~ = sp{X t, t E T } C L2o(p) is separable, which
holds if the covariance r ( - , . ) is continuous (e.g., T = Z). For each to E T, the
best linear least squares predictor of Xt0 based on the past { X u, u ~< s < t0} is
ffto,S = PsXto(e~s) and it is the limit-in-mean of linear combinations of Xu,
u ~ s. On the other hand, it is known that a nonlinear least squares predictor of
Xto is given by the conditional expectation
If the process is normal, then one can verify that Jq,0., = Y~,s-Thus for normal
processes, with a continuous covariance, both these predictors coincide. (See
Harmonizable, Cramdr,and Karhunenclassesof processes 303
Yaglom [34], Chapters 4 and 6 for a lucid discussion of these problems.) Since
from a practical point of view it is not feasible to have a complete realization
{Xu, u f> s} at our disposal, it is desirable to have some approximations to the
best predictor. A result on this can be described as follows. Let T = Z for
simplicity,_ and for s < t0E Z . define ~. = sp{Xs, Xs_ 1 . . . . . Xs_n} SO that
lim. cg. = sp{i..)..0 ~.} = y(.. If X,0,. = Q.(Xto ), Q. being the orthogonal pro-
jection of Y( onto ~ , then one can show, using the geometry of Y(, that
E([..Y '0,. - X t .12)"-->0 as n --> oo. However, the pointwise convergence of )~,0.
to 3~,0,. is mu~h more difficult, and in fact the truth of the general statement is
not known. For a normal process, an affirmative answer can be obtained from
the following nonlinear case.
Let Y ' 0 , = E(XtolXs'Xs-1 . . . . . Xs_n) and Y t , be as before. Then the
' . 0'
where/3 is the friction coefficient and w0 denotes the circular frequency of the
oscillator. H e r e A(t) is the random fluctuation, assumed to be the white
n o i s e - - t h e symbolic (but really fictional) derivative of the Brownian motion. In
some cases,/3 and w0 may depend on time. To make (51) realistic, the symbolic
equation should be expressed as:
where the B(t) process is Brownian motion. Thus for each t > 0, B(t) is normal
with mean zero and variance ~r2t, denoted N(0, crzt), and if 0 < t~ < t2 < t3, then
B ( t 3 ) - B ( t 2 ) and B(t2)-B(tl) are independent normal random variables with
N(0, 0"2(t3- t2)), N(0, 0"2(/2- tx)) respectively. Also )~(t) = dX(t)/dt is taken as a
mean square derivative. Then (52) and (51) can be interpreted in the integrated
304 M. M. Rao
the right side of (53) being a simple stochastic integral which is understood as m
Section 3 (since B is also orthogonally scattered). Here f is a nonstochastic
function. The integration theory, if f is stochastic needs a more subtle treat-
ment and the B(t) process can also be replaced by a 'semi-martingale'. (See,
e.g., [26], Chapter IV and V for details.) The point is that the following
statements have a satisfactory and rigorous justification. With Brownian
motion one can assert more, and, in fact regarding the solution process of (52),
the following is true.
t
X,=
L
o
G(t,u)dB(u)+C1V,(t)+CRV2(t), t~J, (54)
where Vi('), i= 1,2, are the unique solutions of the accompanying homo-
geneous differential equation:
aO (s + 0, s ) - aO (s -- 0, s) = 1
at -5-f •
(56)
This result shows that the processes appearing as solutions of the (linear)
stochastic differential equations have interesting special properties. Further,
one can show that the vector process {(X,, ~2t), t C J} is a (vector) Markov
normal process almost all of whose sample paths are continuous. (For details of
these assertions, see [25], sec. 4.) Related results for the nth-order case with
continuous (n - 1) times differentiable coefficients a i ( ' ), and initial conditions
Ci = 0 have been analyzed by Dolph and W o o d b u r y [8]. T h e work exemplifies
the importance of nondeterministic processes of multiplicity one in applications
coming from both the physical sciences and communication theory. Let us now
turn to another type of application.
The general filtering problem can be presented, following Bochner [2], as
follows. Let ~ r be the set of all second-order processes X = {X,, t ¢ T}, with
zero means. Let A be a linear operator on the linear space ~fr. Suppose
X, Y E ~ r and that X E domain(A), and
(A X ) , = 2 aiXt i, t ~ Z, ai C R ,
i-1
linear time trend and a random disturbance. Thus the model is given as
where a, b are real but unknown constants and {Yt, 0 ~ t <~ 1} is a noise process
which is assumed to be stationary with mean zero and covariance r, given by
r(s, t ) = e x p ( - / 3 l s - t[),/3 > 0 . The problem is to find an unbiased linear esti-
m a t o r of Xto, to > 1, based on the output {Xt, 0 ~< t ~< 1}, using the least squares
criterion. This can be made more explicit as follows: it is desired to find a
weight function w(. ) on 0 ~ t <~ 1 which is of bounded variation such that if
Xto (w) = ~
l/fo X t [ 2 8 - t8t0+ 18t(2t0- !)1 dt
1
~/(w) =
~0 X, dw(t), E ( & ( w ) ) = a, a C IR ,
Harmonizable, CramO, and Karhunen classes of processes 307
Other estimation methods and their properties are discussed for stationary
error processes in [11].
In this final section some special inference questions when the processes are
normal are briefly discussed to supplement the preceding work.
Recall that a normal process {X,, t ~ T} is a collection of random variables
such that each finite subset has a joint normal distribution. Now if 02, 2, Pi),
i -- 1, 2 are a pair of probability spaces with a c o m m o n base space ,(2, then P~
and P2 are said to be mutually singular or perpendicular (written PI ± P2) if
there is an event A 0 E £ such that PI(Ao)= 0 and Pz(A0) = 1, and mutually
absolutely continuous or equivalent (written P I - P2) if both P1 and P2 vanish on
the same class of sets from ~. For instance, if ,(2 = R, £ = the Borel ~r algebra,
PI = normal and P2 = Cauchy, then P1 ~ P2. On the other hand, if P~ is normal
and P2 is Poisson, then P~ ± P> However, if S2 = R v, ~ = the cylinder cr algebra,
then X,: S2 ~ R is defined as Xt(o~)= w(t), i.e., the coordinate function, and the
problem of determining as to when P~ - P2, or P~ ± P2, or neither, is not simple.
In the case that both P~, P2 are normal probability measures on .Q = R v, only
the main dichotomy that P1 - P2 or P~ ± P2 can occur. This was first established
independently by J. Feldman and J. Hfijek in 1958 and later elementary proofs
of this theorem were presented by L. A. Shepp and others. A simplified but
still nontrivial proof of this result with complete details is given in ([27], pp.
212-217).
The statistical problem therefore is to decide, on the basis of a realization,
which one of P~, P2 is the correct probability governing the process. In the
singular case, this is somewhat easier, but in case P 1 - P2, the problem is not
simple. A n u m b e r of cases have been discussed in [10] before the dichotomy
result is known. The simplest usable condition in the general case is the
following:
Let P/ have the mean and covariance functions (mi, ri), written P(mi, ri), i =
1,2. Then Pl - P2 iff one has P(O, rl) ~ P(O, r2) and P(ml, rl) ~ P(m2, rl). Thus
P ( m l , rl) ~ P(m2, r2) if[ P ( m l , r l ) ~ P(m2, r l ) - P(m2, r2). Some applications with
likelihood ratios appear in [25]. This equivalence criterion will now be illuso
trated on a purely nondeterministic normal process of multiplicity one.
If {Xt, t ~ T} is a normal process with mean zero and covariance r let
Z , = m (t) + Xt where m : T--+ R is a measurable nonstochastic function, so that
the Z, process has mean function m and covariance r and is also normal. Let P
and Pm be the corresponding probabilities governing them. The mean m ( - ) is
308 M. M. Rao
_ _ I
Y= h(A)Z(OA), (69)
for some h G LZ([a, b], F) where F ( A ) = E(IZ(A)]2). Suppose that the deriva-
tive F ' exists outside a set of Lebesgue measure zero. Since Z ( - ) has ortho--
gonal increments, (65), (68) and (69) imply
m(t)=
L h(A)g(t,A)F'(A)dA, t C T = [a, b]. (70)
This is the simplification noted above, If Og/Ot is assumed to exist, then (70)
Harmonizable, Cramdr, and Karhunen classes of processes 309
implies that the derivative m'(t) of m(t) also exists. In particular, if the Xt is
the Brownian motion so that g = 1 and F ' = 1, one gets m'(t)= h(t) (a.e.) and
h E LZ([a, b], dr) in order that P m - P.
There is a corresponding result, when P 1 - P 2 , P~ are normal, but have
different covariances. However, this is more involved. A discussion of this case
from different points of view occurs in the works [35, 33, 7, 25]. (See also the
extensive bibliography in these papers.) There is a great deal of specialized
analysis for normal process in both the stationary and general cases. It is thus
clear how various types of techniques can be profitably employed to several
classes of nonstationary processes of second order. Many realistic problems
raised by the above work are of interest for future investigations.
Acknowledgement
References
[1] Bhagavan, C. S. K. (1974). Nonstationary Processes, Spectral and Some Ergodic Theorems.
Andhra University Press, Waltair, India.
[2] Bochner, S. (1954). Stationarity, boundedness, almost periodicity of random valued functions.
In: Proc. Third Berkeley Symp. Math. Statist. and Probability, Vol. 2, 7-27. University of
California, Berkeley, CA.
[3] Chandrasekhar, S. (1943). Stochastic problems in physics and astromony. Rev. Modern Phys.
15, 1-89.
[4] Chang, D. K. (1983). Harmonizable filtering and sampling of time series, UCR Tech. Report
No. 8, 26 pp. (to appear in Handbook in Statistics, Vol. 5).
[5] Chi, G. Y. H. (1971). Multiplicity and representation theory of generalized random processes.
J. Multivariate Anal. 1,412-432.
[6] Cram6r, H. (1951). A contribution to the theory of stochastic process. In: Proc. Second
Berkeley Symp. Math. Statist. and Probability, 329-339. University of California, Berkeley, CA.
[7] Cramdr H. (1971). Structural and Statistical Problems for a Class of Stochastic Processes. S. S.
Wilks Memorial Lecture, Princeton University Press, Princeton, NJ.
[8] Dolph, C. L. and Woodbury, M. A., (1952). On the relation between Green's functions and
covariances of certain stochastic processes and its application to unbiased linear predictions.
Trans~ Amer. Math. Soc. 72, 519-550.
[9] Getoor, R. K. (1956). The shift operator for nonstationary stochastic processes. Duke Math. £
23, 175-187.
[10] Grenander, U. (1950). Stochastic processes and statistical inference. Ark. Mat. 1, 195-277.
[11] Grenander, U. and Rosenblatt, M. (1975). Statistical Analysis of Stationary Time Series. Wiley,
New York.
[12] Hida, T. (1960). Canonical representation of Gaussian processes and their applications. Mere.
Coll. Sci. Kyoto Univ., Sec 4, 32, 109-155.
[13] Kamp6 de Feriet, J. and Frenkiel, F. N. (1962). Correlation and spectra ot' nonstationary
random functions. Math. Comp. 10, 1-21.
[14] Karhunen, K. (1947). l]ber lineare Methoden in der Wahrscheinlichkeitsrechnung Ann.
Acad. Sci. Fenn. Ser. A I Math. 37, 3-79.
310 M . M . Rao
[15] Kelsh, J. P. (1978). Linear analysis of harmonizable time series. Ph.D. thesis. UCR Library.
[16] Lo~ve, M. (1948). Fonctions alfiatoires du second ordre. A note in P. L6vy's Processes
Stochastiques et Movement Browien, 228-352. Gauthier-Villars, Paris.
[17] Masani, P. (1968). Orthogonally scattered measures. Adv. in Math. 2, 61-117.
[18] Morse, M. and Transue, W. (1956). C-bimeasures and their integral extensions. Ann. Math.
64, 480-504.
[19] Nagabhushanam, K. (1951). The primary process of a smoothing relation. Ark. Mat. 1,
421-488.
[20] Niemi, H. (1975). Stochastic processes as Fourier transforms of stochastic measures. Ann.
Acad. Sci. Fenn. Set. A I Math. 591, 1-47.
[21] Parzen, E. (1962). Spectral analysis of asymptotically stationary time series. Bull. Internat.
Statist. Inst. 39, 87-103.
[22] Parzen, E. (1962). Stochastic Processes. Holden-Day, San Francisco, CA.
[23] Pitcher, T. S. (1959). Likelihood ratios of Gaussian processes. Ark. Mat. 4, 35-44.
[24] Rao, M. M. (1975). Inference in stochastic processes--V: Admissible means. Sankhyd Set. A
37, 538-549.
[25] Rao, M. M. (1978). Covariance analysis of nonstationary time series. Developments in
Statistics, Vol. 1, 171-225. Academic Press, New York.
[26] Rao, M. M. (1979). Stochastic Processes and Integration. Sijthoff and Noordhoff, Alphen aan
den Rijn, The Netherlands.
[27] Rao, M. M. (1981). Foundations of Stochastic Analysis, Academic Press, New York.
[28] Rao, M. M. (1982). Harmonizable processes: structure theory. L'Enseign. Math. 28, 295-351.
[29] Rao, M. M. (1984). Probability Theory with Applications. Academic Press, New York.
[30] Rao, M. M. (1984). The spectral domain of multivariate harmonizable processes. Proc. Nat.
• Acad. Sci. U.S.A. 81, 4611-4612.
[31] Rozanov, Yu. A. (1959). Spectral analysis of abstract functions. Theory Probab. AppL 4,
271-287.
[32] Rozanov, Yu. A. (1967). Stationary Random Processes (English translation). Holden-Day, San
Francisco.
[33] Rozanov, Yu. A. (1971). Infinite Dimensional Gaussian Distributions (English translation).
American Mathematical Society, Providence, RI.
[34] Yaglom, A. M. (1962). A n Introduction to the Theory of Stationary Random Functions (English
translation). Prentice-Hall, Englewood Cliffs, NJ
[35] Yaglom, A. M. (1963). On the equivalence and perpendicularity of two Gaussian probability
measures in function spaces. Proc. Syrup. Time Series Analysis, 327-346. Wiley, New York.
E. J. Hannan, P. R. Krishnaiah, M. M. Rao, eds., Handbook of Statistics, Vol. 5 ] "l
A L At_
© Elsevier Science Publishers B.V. (1985) 311-320
C. S. K. B h a g a v a n
1. Introduction
2. Stationarity
Considerable progress has been m a d e in the analysis of time series under the
basic assumption of stationarity. The stationarity considered is one of a
structural invariance under translations of time. The structures considered are
in two directions. O n e is regarding the probability setup and the other is
regarding the second moments. T h e former is known as strict stationarity,
while the latter is known as weak stationarity, the two coinciding over Gaussian
processes. Strict stationarity is occasionally used, while weak stationarity is the
frequently used one. A slight variant of weak stationarity, which requires that
the first m o m e n t also is time invariant, is known as wide sense stationarity, a
concept used in inferential problems. Unless otherwise stated, stationarity
normally means weak or wide sense stationarity. It can be formally defined as
follows:
311
312 c. s. K. Bhagavan
a function of the time lag, where the bar over an expression denotes the
complex conjugate; then X(t) is called a weakly stationary process. If further,
m(t) : E(X(t)) = m,
3. Spectrum
where
n 2wt
a (s) = 2 ~, X ( t ) cos
n s
t=l
n
The nature of the periodogram is that it runs close to the X-axis except that
there are sudden peaks at points corresponding to the jumps of the spectrum.
Thus the periodogram analysis plays a vital role in time series analysis. Before
passing on, it is to be emphasised that all these considerations are conditioned by
the assumption of stationarity.
Having recognised the importance of the spectrum, one is naturally led to
the question: What happens to these considerations if stationarity is absent? In
other words, one is led to the consideration of 'non-stationary situations'. The
immediate problem one faces here is to restore the concept of spectrum in this
case.
It may be remarked, even at this stage, that when the facility and simplicity
of stationarity is gone, the attempts become diverse and the related aspects
need much further developments. This chapter thus concentrates more on
these spectral aspects, presenting the details in the discrete parameter case,
putting the concept of spectrum thus obtained to the same usage as in the
stationary case.
The generalisation of the spectrum envisaged can fruitfully be achieved if we
first have a look at what we are expecting of the spectral function. Broadly
speaking, the following would be the requirements for a spectral function (see
Loynes, 1968; Nagabhushanam, 1970):
Non-negativity and additivity like mass or energy, unique determinability
from the auto-covariance function, relationship to a meaningful function by
Fourier-Stieltjes transformation, possibility of possessing a jump part; deter-
minability of the spectral transfer function when the process variates undergo a
simple linear transformation, estimability of the spectral density from a single
realisation of the process, and reduction to the usual spectral function when the
process is specialised to be a weakly stationary process.
Loynes (1968) has listed all the requirements for a spectral function of a process
and concluded that when a process is not stationary, there does not seem to exist a
spectral function satisfying all the requirements. Then what can be done seems to
be to define a spectrum of a type that will be suitable to the particular inquiry on
hand. These have broadly developed in two streams: one stream taking a start
from the covariance and the other taking a start from the process representation
(see Cram6r and Leadbetter (1967) for process representation).
(a) Fano (1950) and Page (1952) have defined spectra based on con-
siderations of Fourier integrals. The spectrum defined by Fano cannot include
stationary processes in an essential way and that of Page cannot be necessarily
non-negative.
(b) Cram& (1961) has defined
s
f f ldh s
as the spectrum of the harmonisable process of discrete parameter, where
h (s, r) is a function of bounded variation in terms of which the auto-covariance
function of the process has the representation
E(X(t)) = 0
and
exist finitely for each k and remarks that these may be termed asymptotically
weakly stationary processes, and that a time series X(t), t >t O, for which there
exists a function R(k) satisfying the above could be said to possess a covariance
function R(k) and a spectrum. H e establishes the existence of the spectrum
assuming that: (i) fourth moments of the process exist and are uniformly
bounded; (ii) ( l / T ) f o r-k X(t)X(t + k)dt converges in the mean square to R(k)
as T ~ ; and (iii) R ( k ) i s continuous.
(d) Herbst (1964) has considered discrete parameter processes X(t) of the
form
such that
2
Co 2.~ aj e i# .
j=0
F(s) being the spectral measure corresponding to the orthogonal process Z(s).
Priestley (1965) has studied a subclass of these processes for which g(t, s) is a
function changing slowly with time. They are referred to by him as oscillatory
processes or processes with evolutionary spectra, and he has defined the spec-
trum as
Priestley's spectrum has been shown to be useful for linear prediction, filtering
and a test for stationarity, etc. (see Abdrabbo and Priestley (1967) and the
succeeding papers on the subject). Processes with slowly varying spectra have
also been considered by Granger and Hatanaka (1964).
(f) Nagabhushanam and/or Bhagavan (see Nagabhushanam and Bhagavan,
1968; Bhagavan, 1974) obtained a straightforward generalisation of the spec-
trum of the stationary process, by considering the sequence of functions
(cf. Grenander and Rosenblatt, 1957); and obtaining the limit function F(s)
existing at every continuity point of the latter when the limits, called averaged
316 C. S. K. Bhagavan
covariances,
n
exist finitely for each k. In the continuous parameter case also, a similar set of
functions and the limit exist. In this setup, it is shown that
C(k) = ~ eiks d F ( s ) ,
where X 1 and X 2 are random variables with zero means, variances each unity
1
and covariance ~. In this case, the forward spectrum consists of two jumps of
magnitude unity at 0 and ~w, while the backward spectrum consists of again
two jumps of magnitude unity at 0 and -~w (see Nagabhushanam, 1970).
When the directional spectra coincide, the common spectrum is called the
overall spectrum. The overall spectrum is shown to exist for the harmonisable
process (referred to in (b) above) and is given by
1 f T-Ikb/2
R (k) = l i m - - - | R (t - k/2; t + k/2) dt
r-~ T JIkl/2
On non-stationary time series 317
exist. The class of processes for which the above limit exists were termed as K - F
class of processes.
M. M. Rao (1976) studied a more general class of non-stationary processes
called almost harmonisable processes arising as
X(t)-m(t)= f ~f(t,s)dZ(s),_
f
X(t)- m (t) = Jw f(t, s) d Z ( s ) ,
where Z processes has covariance of bounded variation and for each s, f,(t) has
the Fourier relation
n
1Zx ,
nt=l
T X ( t ) dt
n
- ~ (X(t) - m (t))
nt=l
parameters they are estimating need some clarification. This matter is trivial for
wide sense stationary processes as the parameter there is the constant mean.
But in respect of other processes, there is need to introduce a corresponding
parameter, it is called the pooled mean,
defined as
11
This pooled mean has been shown to exist in respect of processes having
process representation, like stationary processes and harmonisable processes
(see Bhagavan, 1974). Thus when the pooled mean exists, the convergence of
the sample mean is to the pooled mean.
6. Concluding remarks
References
[11] Grenander, U. and Rosenblatt, M. (1957). Statistical Analysis of Stationary Time Series. Wiley,
New York.
[12] Herbst, L. J. (1964). Spectral analysis in the presence of variance fluctuations. J. Roy. Statist.
Soc. Set. B 21, 354-360.
[13] Herglotz, G. (1911). Uber Pontenzreihen mit positiven reellem Teil im Einheitskreis. Ber.
Verk. Kgl. 63, 501-511.
[14] Karhunen, K. (1947). lJber lineare Methoden in der Wahrscheinlichkeitsrechnung. Ann. Acad.
Sci. Fenn. Set. A 1 37 (Helsinki).
[15] Khintchine, A. I. (1934). Korrelationstheorie der stationaren stochastiche Processc. Math.
Ann. 109, 604-615.
[16] Lo~ve, M. (1963). Probabili~ Theory, 3rd ed. Van Nostrand, New York.
[17] Loynes, R. M. (1968). On the concept of the spectrum for non-stationary processes..L Roy.
Statist. Soc. Ser. B 30, 1-30.
[18] Nagabhushanam, K. (1969). Mean ergodic theorem for second moment stationary processes.
Abstracts, Statistics section, Proceedings of the Indian Science Congress.
[19] Nagabhushanam, K. (1970). Spectral concepts in non-stationary processes-Unpublished
proceedings of the Second Mastech Conference on Probability and Statistics, Madras, India.
[20] Nagabhushanam, K. (1975)~ Mean ergodic theorem for an irreducible time homogeneous
Markov chain with a finite number of states. Current Science.
[21] Nagabhushanam, K. (1976). Some ergodic theorems for non-stationary Markov processes.
Lecture to the Symposium on Advanced Probability and Statistics at Annamalainagar, India.
[22] Nagabhushanam, K. and Bhagavan, C. S. K. (1968). Non-stationary processes and spectrum.
Can. J. Math. 20, 1203-1206.
[23] Nagabhushanam, K. and Bhagavan, C. S. K. (1969). A mean ergodic theorem for a class of
non-stationary processes. Sankhy6 Ser. A 31, 421-424.
[24] Page, C.H. (1952). Instantaneous power spectra. J. Appl. Phy. 23, 103-106.
[25] Parzen, E. (1967). Time Series Analysis Papers, Papers 10, 11. Holden-Day, San Francisco CA.
[26] Priestley, M. B. (1965). Evolutionary spectra and non-stationary processes. J. Roy. Statist. Soc.
Ser. B 27, 204-237.
[27] Rao, M. M. (1976). Covariance analysis of some non-stationary time series. Technical
Report AFFDL TR 76, 126, Aug. 1974-Aug. 1976. (Cf. Developments in Statistics, Vol. 1
(1978) 171-225, Academic Press, New York.)
E. J. Hannan, P. R. Krishnaiah, M. M. Rao, eds., Handbook of Statistics, Vol. 5 "1~')
© Elsevier Science Publishers B.V. (1985) 321-336
D e r e k K. C h a n g *
1. Introduction
In engineering, economics and some other fields, we usually face the follow-
ing types of problems.
(l) With the information obtained from an observation of a p h e n o m e n o n in
the past and present, we wish to predict it at a future time with the best
accuracy in some prescribed sense.
(2) The data one gets in observing a p h e n o m e n o n are almost always an
approximation of the real data, which is due to the error caused in the process
of observation; for instance, in the m e a s u r e m e n t of a quantity. It is then
desired to estimate the original values by utilizing the observed data with the
best accuracy in a certain sense.
(3) With a given input time series and a linear filter, such as an electric
circuit, a unique output series can be generated. In case the filter is known, and
only the ouptut series is observed, it is desired to recover the input series, and
to determine whether the solution to this inversion p r o b l e m is unique, and
whether the filter is physically realizable, that is, whether the input series at any
time can be expressed solely in terms of the observations of the output series
made in the past and present, but not the future.
To treat these problems, we need to establish some mathematical and
statistical models of time series. Then we can describe these problems pre-
cisely, and provide some solutions.
Let the triple (,(2, X, P ) denote a probability space, where /2 is a sample
space, X i s the o'-algebra of all events under consideration, and P is a
probability measure on X. Let C stand for complex numbers. For any complex
vali~ed-random variable X : 12 ~ C , which is X-measurable by definition, the
expectation of X is denoted by E ( x ) = f a X dP. The Hilbert space H =
L2(12, X, P ) of all the (equivalence classes of) r a n d o m variables with finite
second m o m e n t s can be constructed as usual with an inner product defined by
0 r, g) = E(,fg). The ' o v e r b a r ' denotes complex conjugation. Let I1 112denote the
*Prepared under the ONR Contract No. N00014-79-C-0754 (Modification No. P00003).
321
322 D.K. Chang
t"
X(t) = JD eitA dZ(A), t E T, (2)
and the integral is in the sense of D u n f o r d and Schwartz (see [10], p. 323).
N o t e that without the restriction that the time series X is of m e a n s zero, we
have the slightly different concepts of weak and wide sense stationarity (see
[2]). T h e y agree u n d e r our hypotheses.
2. If X is not necessarily weakly stationary, but its covariance function
admits a r e p r e s e n t a t i o n
I/XlF-- sup {I ~.
N aibfllx(ti,
__ , + tz(t i ~, t)-l)] :
t})- tz(t i ~, t}) - p.(ti, tj_l)
i,j = 1
We now recall the definition of a linear filter for our work here. A linear
filter L is a mapping L: X~-~Y or, L X = Y, where X = { X ( t ) , t C T } and
Y = {Y(t), t E T} are two time series with T = Z or R, such that: (i) for any
a, b ff C and time series X and Y, L ( a X + bY) = aL(X)+ bL(Y); (ii) for any
h E T, with a time series X h defined by Xh(t) = X ( t + h), t ~ T,
Condition (i) says that L is linear, and condition (ii) says that L commutes
Harmonizable filtering and sampling of time series 325
with translations on the T axis. In the case that all the time series :under
consideration are in L2o(P), a linear filter does not have to be bounded, i.e.,
I[gX(t)ll2/llX(t)ll2 need not be bounded for t E T. For instance, one can have
differential filters which need not be bounded. A study of the general concept
of a linear filter on a second order homogeneous time series on a globally
symmetric index set can be found in Yaglom [29] and Hannan [12], which
extends the work of Masani [18] where the index set is the real line. However,
they will not be considered here.
Next we discuss the problem mentioned at the beginning of Section 1. Let
X = {X(n), n ~ Z} be a time series, and let L be a polynomialfilter, also called
a moving average filter, as follows:
N
LX(.) = £ a#(,, j), . E 1, (6)
j=O
N
ijr
f(t)=~aje , O~<t<2v. (7)
j=O
If both conditions are satisfied, then the solution series X to the inversion
problem is unique if and only if O is empty. If O is not empty, there exists only
one solution belonging to the closed span of all square integrable functions
relative to /x under the L 2 norm and whose spectral function v satisfies the
condition fo de(t) = 0. Furthermore, if all the roots of the characteristic poly-
326 D.K. Chang
nomial
P(t) = E a~
/=0
are outside the unit circle {t ~ C: Itl = 1}, then the filter L is physically realiz-
able. Actually, with
e(t) E b°t°
n=0
However, if some roots are inside and none is on the unit circle, then L is not
physically realizable. In this case, the function 1/P(t) has a Laurent series
expansion E~=_= bnt", and the future values of Y are clearly involved in the
expression
where g is a Lebesgue integrable weight function over R, and the integral on the
right side of (10)with a vector integrand is in the sense of Bochner [9]. The
spectral characteristic F associated with L is now taken as the Fourier
transform Of g, i.e.
With L and F just defined and with W - - R, all the results listed above for
polynomial filters still hold, except the last part concerning the physical
realizability of L. For the unbounded linear filter defined by a difference-
Harmonizable filtering and sampling of time series 327
differential operator and also for some more general filters, similar results have
been obtained by the author in [5]. In the case when both input and output
series X and Y are strongly harmonizable, the corresponding work was done
by Kelsh [15].
Since the spectral function # of the output series Y is now defined on R x R,
the necessary and sufficient conditions (8) and (9) for the existence of a solution
X, which is also strongly harmonizable, to the equation Y = L X with L defined
by (6) should be replaced by
(ii)' ff 1 (13)
g(u)f(v)l dl# [t(/A, u) < oc,
ocxo c
(i)" ]~IF(O × O) - O,
(ii)" Xodff ~ t~(ix'*).
For the uniqueness and the physical realizability, we have the same situation as
before. Similar results hold for certain multidimensional filters (cf. [5]). It
should be noted that the above conditions and hence the result for weakly
harmonizable time series is not the same as that for the strongly harmonizable
time series. However, both agree when specialized to the stationary case.
X(t) = Ja ei'a d Z ( A ) ,
rs,N(u, v) = E ( S ( u ) N ( v ) ) = O, u, v E a ,
then the best linear filter S in the mean square error sense, which is also called
an optimal filter, is given by the formula
f~(a) M
- lim .~ c k e i~kA,
fs(A)+fN(A) M-~=k_l
then we have
S(t) : ~ q X ( t + ak).
k=l
In general, if the signal and the noise are not necessarily uncorrelated, but
are weakly stationarily correlated, i.e.
fS,N(U -- V) = f . e iC"-O*fs,N(A) d a ,
Note that in the above results, one has to assume that all the spectral functions
and the cross-spectral function are absolutely continuous. If this is not the case,
the results become more complicated.
When the series S, N and X are of Cram6r class as defined in [8], which
contains the class of all strongly harmonizable time series, and when S and N are
330 D.K. Chang
uncorrelated, then similar results were obtained in [271. Without assuming that
S and N are uncorrelated, Kelsh [15] considered the same problem for
multidimensional Cram6r class series, and got the corresponding result using
the technique essentially due to Rao [27]. For one-dimensional strongly har-
monizable series S, N and X, Kelsh's result can be stated as follows.
L e t / x s,/x N : R x R ~ C be the spectral functions of S and N, #S,N : R X R ~ C
be the cross-spectral function, and let P-~.N:R x R--~C be defined by
tX*s.N(u, v) = #s,N(v, u), for u, v E R. Then the optimal filter is
= f f eitU-iS°d(tzs+ #s,N)(u,v),
RxR
for all s ¢ R. In general, it is not easy to solve this system of integral equations
analytically. However, if the spectral functions P-s, #N and #s.N are absolutely
continuous, expression (15) can be reduced to an explicit form as in (14).
N sin[~r(t nh)/h]
X(t)=l.i.m. ~ X(nh) , t~R, (16)
N ~ ,,=-N ~(t -- nh )/h
where the convergence on the right side of (16) is in the sense of mean square.
This formula gives a periodic samplino theorem, where one observes the time
H a r m o n i z a b l e filtering a n d s a m p l i n g o f time series 331
series at the periodic points t = nh, - N <~n <~N, and the random variable X(t)
at any time t can be estimated by (16) by taking large enough N.
If X is weakly stationary but its spectral function p, is not necessarily
supported by a compact set, then the following results of Lloyd [16] hold.
If/x has an open support S such that the sets {S + n/h}, n ~ Z, are mutually
disjoint, i.e. no two values in S differ by an integer multiple of 1/h, then
where
The problem of sampling for strongly harmonizable time series has been
studied by Rao, Piranashvili and Pourahmadi (cf. [27], [24] and [25]). The
formula (16) was obtained by Piranashvili [24] for strongly harmonizable time
series whose spectral function has a bounded support in R × R. The formulas (17)
and (18) have been extended to strongly harmonizable time series by Rao [27]
and Pourahmadi [25]. Let X be a strongly harmonizable time series with
spectral function #, and let u be a function defined by v ( s ) = s)x R),
s E R. If e has an open support S such that the sets {S+ n/h}, n E Z, are
mutually disjoint, then (17) holds. If S is a finite union of intervals, or if
sup_~<,<~]tK(t)] < % then (18) holds.
In the case where X = {X(t), t ~ R} is weakly harmonizable with spectral
function F, we have the following result due to Chang and Rao [6]. Given any
e > 0, if there exists a bounded Borel set A = (At) C ~ such that
f l dF(u,v)<~e,
BxB
for a l l B C A c ( B a B o r e l s e t ) ,
and if cr0 - diameter of A, then for any h < v/o- 0, one has an N (=N,.,) such
that
II N s i n [ v ( t - nh)/h] <
X (t) - ~, X(nh) ~-t--nh~ - 2 C(t)[(~v- hcr0)N]-'+ e,
n=-N
332 D.K. Chang
5. A numerical illustration
= Y(n)- ajY(. - j ) .
j=O
4000 2
M i n i m i z i n g the sum 2;,= 8 ~, r e l a t i v e to the a i s, we o b t a i n the values for the
filter coefficients a 0 , . . . , a 6. C o r r e c t to t h r e e d e c i m a l places, t h e s e are as
follows:
a 0= 0.852, a 1= - 0 . 4 6 3 , a 2= 0 . 2 2 7 ,
6
Y(n) = LX(n)= "~ ajX(n - j ) ,
j~O
Table 1
Table 1 (Con~nued)
Since all these roots lie outside the unit circle, the filter L is physically
realizable. To compute the values for the sequence X, we need to expand the
rational function 1/P using the Taylor series method. With the coefficients
b0, b 1. . . . thus determined, we can use the formula
40
30
20
10
-10
-20
-30
-40
-50
-60
-70
Time
Fig. la. O u t p u t Series Y.
Harmonizable filtering and sampling of time series 335
60
50
40
30
20
10
-I0
--20
/
-5(
-7( /
0 10 20 30 40 50 60 70 80 90 1oo
Time
Fig. lb. Input Series X.
A set of t w o h u n d r e d v a l u e s (from t h e s a m e d a t a r e c o r d s ) of X a n d Y, c o r r e c t
to two d e c i m a l places, is g i v e n in T a b l e 1, a n d t h e g r a p h s for b o t h series X a n d
Y with t h e s e v a l u e s a r e p l o t t e d in Fig. l a , b for c o m p a r i s o n .
References
[1] Aronszajn, N. (1950). Theory of reproducing kernels. Trans. Amer. Math. Soc. 68,337--404.
[2] Bhagavan, C. S. K. on non-stationary time series. This volume, Chapter 11.
[3] Bochner, S. (1956). Stationarity, boundedness, almost periodicity of random valued functions.
In: Proc. Third Berkeley Syrup. Math. Statist. and Probability, Vol. 2, 7-27. University of
California Press, Berkeley, CA.
336 D. K. Chang
[4] Cauchy, A.-L. (1841). Memoire sur diverses formulaes de analyse. C. R. Acad. Sci. Paris 12,
283-298.
[5] Chang, D. K. (1983) Bimeasures, harmonizable processes and filtering. Ph.D. Dissertation.
University of California, Riverside, CA.
[6] Chang, D. K. and Rao, M. M. (1983). Bimeasures and sampling theorems for weakly
harmonizable processes. Stochastic Anal. & Appl. 1, 21-55.
[7] Clarkson, J. A. and Adams, C. R. (1933). On definitions of bounded variation of two
variables. Trans. Amer. Math. Soc. 35, 824-854.
[8] Cram~r, H., (1951). A contribution to the theory of stochastic processes. In: Proc. Second
Berkeley Symp. Math. Statist. and Probability, 329-339. University of California, Berkeley,
CA.
[9] Diestel, J. and Uhl, 3. J. Jr. (1977). Vector Measures, Mathematical Surveys No. 15. American
Mathematical Society, Providence, RI.
[10] Dunford, N. and Schwartz, J. T. (1958). Linear Operators, Part I: General Theory. Inter-
science, New York.
[11] Grenander, U. (1950). Stochastic processes and statistical inference. Ark. Mat. 1, 195-277
[12] Hannan, E. J. (1967). The concept of a filter. Proc. Cambr. Phil. Soc. 63, 221-227.
[13] Helson, H. and Lowdenslager, D. (1958). Prediction theory and Fourier series in several
variables. Acta Math. 99, 165-202.
[14] Kallianpur, G. (1959). A problem in optimum filtering with finite data. Ann. Math. Statist. 30,
659-669.
[15] Kelsh, J. P. (1978). Linear analysis of Harmonizable time series. Ph.D. Dissertation. University of
California, Riverside, CA.
[16] Lloyd, S. P. (1959). A sampling theorem for stationary (wide sense) stochastic processes. Trans.
Amer. Math. Soc. 92, 1-12.
[17] Lo6ve, M. (1963). Probability Theory, 3rd ed. Van Nostrand, New York.
[18] Masani, P. (1965). The normality of time-invariaut, subordinative operators in Hilbert space.
Bull. Amer. Math. Soc. 71, 546-550.
[19] Miamee, A. G. and Salehi, H. (1978). Harmonizability, V-boundedness, and stationary
dilations of stochastic processes. Indiana Univ. Math. J. 27, 37-50.
[20] Nagabhushanam, K. (1950). The primary process of a smoothing relation. Ark. Mat. 1~
421--488.
[21] Niemi, H. (1975). Stochastic processes as Fourier transforms of stochastic measures. Ann.
Acad. Sci. Fenn. Ser. A I, 591, 1-47 (Helsinki).
[22] Parzen, E. (1962). Extraction and detection problems and reproducing kernel Hilbert spaces. J.
S I A M Control Ser. A 1, 35-62.
[23] Penrose, R. A. (1955). A generalized inverse for matrices. Proc. Cambr. Phil. Soc. 51,400-413.
[24] Piranashvili, Z. A. (1967). On the problem of interpolation of stochastic processes. Theory
Prob. Appl. 12, 647-657.
[25] Pourahmadi, M. (1980). On subordination, sampling theorem and 'past and future' of
some classes of second-order processes. Ph.D. dissertation. Michigan State University.
[26] Rao, M. M. (1982). Harmonizable processes: structure theory. L'Enseign. Math. 28, 295-351.
[27] Rao, M. M. (1967). Inference in stochastic processes, III. Zeit. Warsch. Verw. Gebiete 8,
49-72.
[28] Rozanov, Yu. A. (1959). Spectral theory of abstract functions. Theory Prob. Appl. 4.
271-287.
[29] Yaglom, A. M. (1961). Second order homogencous random fields. In: Proc. Fourth Berkeley
Symp. Math. Statist. and Probability, Vol. 2, 593. University of California Press, Berkeley, CA.
E. J. H a n n a n , P. R. Krishnaiah, M. M. Rao, eds., Handbook of Statistics, Vol. 5 "l '7~
© Elsevier Science Publishers B.V. (1985) 337-362 At_
Stamatis Cambanis
1. Introduction
In practice, a time series (or more generally a random field) is observed only
at a finite number of appropriately chosen points, which constitute the samp-
ling design, and based on these observations an estimate or a statistic is formed
for use in the problem at hand. How to select the most appropriate choice of
sampling points is the problem of sampling design. The statistician may be free
to choose any desirable points, or else part of the sampling mechanism may be
imposed on the statistician who then controls only certain parameters, e.g.
periodic sampling is imposed where the period is controlled by the statistician,
or Poisson sampling (at the times of occurrence of a Poisson stream of events)
is imposed but the statistician has control over its rate. With such constraints,
i.e. within certain classes of sampling designs, or with no constraints, i.e. among
all sampling designs, how can the statistician choose the best design of a given
sample size, or how can the statistician determine the sample size of a certain
kind of design required to achieve a given performance? These questions will
be considered in the context of three specific problems of interest involving
time series: the estimation of a weighted average of a random quantity, the
estimation of regression coefficients, and the detection of signals in noise.
These three problems have a great deal in common, and specifically their
sampling design questions are essentially the same.
The setup here differs in two important ways from the classical setup. All
observations are taken from a fixed (interval) region A and so, especially for large
sample sizes, it is not realistic to assume lack of correlation; hence, observations
form a correlated time series. Also repeated sampling at the same point is not
allowed, and only one realization of the time series is available; i.e. only simple
designs are considered in the terminology of Pfizman (1977).
We consider both deterministic and random sampling designs, where either
*This research was supported under the Air Force Office of Scientific Research Grant No. F49620 82
C 0009.
337
338 S. Cambanis
)lf (1.3)
and of course R K H S ( R r ) = R". In fact, f E R K H S ( R ) if and only if the
supremum of (1.3) taken over all finite subsets T of A is finite, and the value of
that supremum is the R K H S norm of f. When R (s, t) = min(s, t) and A = [0, hi,
the R K H S ( R ) consists of all functions f which vanish at zero and are absolutely
continuous with square integrable derivative: f(t) = f~ g(u) du, 0 <~ t <~ b, where
Job g2(u)du < ~; the R K H S norm is given by Ibrll2-- To~ g2(u)du; and
Similar simple expressions are available in the more general case of factorable
R, in the sense that R(s, t) = u(s)v(t) for s < t, and also when R is stationary
with rational spectral density (but with more complicated expressions).
I = f A X ( t ) f b ( t ) dt (2.1)
n
: }2 c,,:kX(tk) . . . . (2.2)
k=l
340 S. Cambanis
2 = E ( I - IT) 2 = S 2 - 2 c r f r + c ' r R r c r .
e T (2.3)
If the random process X could be observed over the entire interval A, rather
than at n sample points, then of course the random integral I could be
evaluated precisely, i.e. e 2A = 0. Our goal therefore is to choose the sampling
points T and the estimator weights c r in such a way that the resulting mean
square estimation error e r2 should be as close to zero as possible.
1 1
Var/~A -- Ibql2 s2, (2.5)
so that/3 can be estimated precisely if and only if the regression function f does
not belong to the reproducing kernel Hilbert space of R. In other words, when
the regression function is less smooth than the observation noise
0e~ R K H S ( R ) ) , then the regression coefficient can be determined precisely
from observations over the entire interval. When the regression function is of
comparable smoothness with the observation noise, f ~ R K H S ( R ) , then the
variance o f ~A is positive. H e r e we consider the simpler case where the
regression function f is even slightly smoother, namely of the form given in
(1.1).
Our problem then is to estimate /3 linearly from observations at n sample
points T = {/k}~=1 using appropriate weights c } = (cr. 1. . . . . cr.n) by
and our goal is therefore to choose the sampling points T and the estimator
weights c r in such a way that the resulting bias and variance are as close to 0
and to s 2 as possible.
The noise process is assumed to be Gaussian with mean zero and covariance
function R (s, t).
If observations over the entire interval A were available, then the presence
of the signal could be detected with probability one if and only if it is less
smooth than the noise, in the sense that it does not belong to the reproducing
kernel Hilbert space of R. Here we consider the more interesting case where
the smoothness of the signal is comparable to that of the noise ( f ~ R K H S ( R ) )
and in fact we consider the simpler case where the signal f is of the form (1.1).
In this case the sufficient statistic is
which under H 1 is :V(s 2, s 2) and under H 0 is 2('(0, s2), and the probability Pd of
detection (i.e. of deciding H 1 when H 1 is true) at false alarm rate ~ (i.e. when
the probability of deciding H , when H 0 is true is cQ is
(2.10)
tt
which is of course smaller than Pd(SA), and our goal is therefore to choose the
sampling points T and the weights cr in such a way that the resulting
probability of detection Pd(ST) should be as close as possible to Pd(SA), the
probability of detection of the optimal detector which uses observations over
the entire interval.
A n o t h e r measure of performance of a test based on the statistic S y is the
generalized signal-to-noise ratio defined by
SNR(SA) = s 2
(c'~f~)~
SNR(ST) - - - (2.14)
c'~aTfT
so that
Thus when the noise is Gaussian, there is a very simple relationship between
signal-to-noise ratio and probability of detection. The signal-to-noise ratio is
used as performance criterion instead of the probability of detection whenever
the latter cannot be easily computed, e.g. when the noise is not Gaussian; or
when the noise is Gaussian but the test statistic is not, because random
sampling is used°
such as when an observation can be taken only at a time a certain event occurs,
or ,when there are imperfections in deterministic sampling schemes, e.g.
periodic sampling with jitter.
(i.e. m,i is the median of (t,+l. i, t,+l,i+ 0 with respect to h) called a sequence of
median designs generated by the density h, or simply median sampling.
mechanism chooses the points %k + e,k where the errors e,k are independent and
their normalizations ne,k have density h.
In systematic sampling, the sampling point tnl is chosen at random from
the stratum A,1 according to the density h,1, and each t,k (k = 2, . . . , n) is the
point in Ank homologous to t,1 in A,~. Thus any one of the n sampling points in
systematic sampling determines all the others.
4. The estimators
The natural choice for the coefficients c r of the estimators I r and fiT and of
the test statistic S T is that which renders their performance optimal for the
given sample T of size n. We are also interested in considering simpler
coefficients which result in performance comparable to that obtained with
optimal coefficients, but which are substantially easier to compute.
with variance
which, compared with the variance s -2 of the estimator fA (cf. (2.5)), gives
1 8 2
_ r
(4.5)
Var fir s2 s2f~cRrift"
Sampling designs for time series 345
so that
Comparing this with the probability of detection of the test based on the
sufficient statistic SA, we have
P d ( S a ) - Pa(Sv) -
fZ7
(f~'R~fT) /2-q5- l(1-a)
s-,b 1(l ~)
d~(u) du
(4.9)
= ~ 6 ( u ) du o
%2_@1/2 ~-,(>.)
2
Thus Pd(Sr.)~Pd(SA) if and only if er-+O, and in either case
P~(Sa)- Pa(Sr.) 1
2 ,~s(bis ~-~(1 c0]. (4.10)
8r,
for some functions c(t) or c.(t), so that the coefficient CT.,k multiplying the
346 S. Cambanis
observation X(tnk ) depends only on the sample point tnk and in fact via a
(global) function (which may or may not depend on the sample size). For such a
simplified choice of coefficients to be admissible, the functions c or {G} should
be chosen so that as the sample size increases the performance should tend to
that attained using observations over the entire interval. This turns out to be
possible for the various sampling designs introduced earlier.
1sr, T n = I T n , (4.18)
esr,
2 T, = /,/-1( f cr2(~2h 1_ s 2 ) ~ 0 , (4.19)
where h a is the averaged sampling density n -1 ~;~=1 hnk, and then we have
(4.22)
k =1 nk nk nk
S a m p l i n g designs f o r t i m e series 347
E s t i m a t i o n of regression coefficients
For a sequence {T.} of median sampling designs generated by the density
h, if c(t) satisfies the consistency condition
c ( t ) h ( t ) = s-24)(t), t ~ A, (4.23)
we have
S~,r,' = I t , (4,33)
and comparing it with the probability of detection of the optimal test based on
the entire interval we have
s-4,-l(l-a)
[ q~(u) du
s~4,-l(1-a)
= ~b(u) du (4.35)
a (s2--,s2r)*2-4,-1(1-,~),~(u)du
348 S. C a m b a n i s
where
~2
= SNR(•)- SNR(Sm,r.)= s 2- n/Sn2
m 2
(4.36)
= 1 {$2($2 _ S2) _ (m. - s2)(mn + $2)}--~ O.
s.
Ssr,Tn = I T n , (4.37)
and in fact the distributions of Ssr,rn under each alternative hypothesis converge
weakly to those of SA:
S4
SNR(S~r,~)- 2 *s2, (4.40)
S 2 + esr,~
so that
In stratified sampling, if the functions G(t) are chosen from the consistency
condition (4.20), so that
then Ss,3," have the desirable limiting properties (4.38) and (4.39), and again
concentrating on signal-to-noise ratios we have
S4
SNR(Ss, r,) .............. -~ s 2 (4.43)
, S2 + e2st,rn
Sampling designs for time series 349
and
2 2
S e st, T,,
s 2 - SNR(Sst, r.) = s2 + e2t,r . . O. (4.44)
er~---*l,
2
Var/3r;' -+1, MSE/3r; -->1,
inf e 2r inf Var fir inf MSE/3 r
(5.1)
--Pd(ST;) -+ 1, SNR(Sr;') - + 1,
sup Pa(ST) sup SNR(Sr)
where infimum and supremum are taken over all sampling designs of size n
in ~.
it should be clear that for any random design there always exists a better
nonrandom design. Our main interest is therefore to find optimal or asymp-
totically optimal sampling designs within the class of all (deterministic) designs.
In the following we comment on the asymptotics of the performance of optimal
sampling designs, we show how in certain cases asymptotically optimal
sequences of designs can be found, and we consider the performance of
optimal fixed sample size simple random designs and of asymptotically optimal
stratified designs.
350 S. Cambanis
among all sampling designs @n of size n: T = {t~ < t 2 < ' ' " < tn} , where Prf
is the projection of f to the subspace of the reproducing kernel Hilbert space of
R generated by {R(., t), t C T}. Since the maximization is over the open subset
of A" determined by /he inequalities t~< t 2 < - . . < tn, an optimal sampling
design of size n does not necessarily exist. Such an optimal design exists when
R(s, t ) = u(s)v(t) for s < t , including the Wiener and Gauss-Markov cases
min(s, t) and exp(-Is - t[), but its existence becomes a very delicate question when
R is ditterentiable (on the diagonal of A × A). Even when an optimal design
exists, it is usually difficult to determine it by carrying out the minimization (an
algorithm for certain special cases is developed in Eubank, Smith and Smith
(1982a)). A very special case where the minimi.zation is easily carried out is when
~b ~- 1 and X has stationary independent increments: the optimal design of sample
size n is given, for A = [0, 1], by tni = 2i/(2n + 1), i = 1 , . . . , n, with corresponding
ETn2 = (0-2/3)(2n + 1)-2, where 0-2 = R(t, t)/t; this is derived in Samaniego (1976)
and in Cressie (1978). The optimal designs satisfy
When R is smooth, upper bounds can be found on er.2 Specifically if R(s, t) has
continuous (k, k) mixed partial derivative, then
and if in addition the (k, k) mixed partial derivative of R(s, t) is smooth off the
diagonal of A x A, then
2 = o(n
er, 2k-2). (5.6)
Thus the smoother R is, i.e. the more quadratic derivatives the centered
2
process X(t) has, the faster er, tends to 0.
Sampling designs for time series 351
We will not insist on the precise technical regularity conditions other than
giving the expression for the function ak(t), assumed positive,
• 2
i.e. mfrs~ e T = infTe~,Hf- PTfl] 2 has the same asymptotics (where @n consists of
all sampling designs of size n).
The asymptotics of er;,
2 Var fiT', and Pd(ST;) follow immediately from (4.2),
(4.5) and (4.10). Periodic sampling is covered by (5.7) by taking h the uniform
density over A, and its asymptotic performance is then easily compared with
that of the asymptotically optimal sequence of designs; they both have the
same rate but different asymptotic constants: C2klA[2k+2f %4~2 for periodic
sampling and as in the right-hand side of (5.9) for the asymptotically optimal
sampling design, and the ratio of the latter to the former can take (for different
4~'s) any value in (0, 1]. Thus substantial improvement .in asymptotic per-
formance may be achieved by sampling according to h*(t) rather than
periodically.
352 S. Cambanis
5.2. Simple coefficients and median sampling
For sequences of median sampling designs {Tn} generated by the density h
and using the simpler coefficients described in Section 4, it is clear from
expressions (4.15), (4.25), (4.26) and (2k34) that their asymptotic performance is
determined by the asymptotics of m, ~ s 2 and of s2,~ s of (4.16) and (4.17)
Here we describe the results for the case k = 0, i.e. the centered process X(t)
has no quadratic mean derivative, under regularity conditions similar to those
required for (5.7). Included are the cases where R is the covariance of the
Wiener process, the Gauss-Markov process, etc. The precise asymptotic
behavior is as follows:
n2(s 2 -- S 2) -~ ~ f A ~°(t)ga2(t)
h2(t) dt ' (5.10)
n 2 ( m n _ S 2) --> 1 fA
O¢o(t)(/12(t)
h2(t ) dt. (5.11)
It then follows from (4.15), (4.25), (4.26), (4.36) and (4.35) that
(5.7) and (5.12) for the former case, and from (5.7) and (5.16), (4.35) for the
latter. Thus in these cases, median sampling design is both very simple (in
view of the very simple form of its coefficients) and asymptotically optimal. In
the regression problem the asymptotic constant in (5.14) is 50% larger than that
in (4.5) and (5.7) and thus median sampling is not asymptotically optimal; it
requires asymptotically about 22.5% more samples than the optimal sampling
design in order to achieve the same variance.
It is remarkable that median sampling design, utilizing such a simple
(nonparametric) form of estimator coefficients, is asymptotically optimal for
integral approximation and signal detection, and for regression coefficient
estimation, it has the same rate of convergence of the optimal sequence of designs
using (parametric) optimal coefficients, but with larger asymptotic constant.
These results were obtained by Schoenfelder (1978) and complemented by
Cambanis and Masry (1983). Work in progress by Schoenfelder has extended
these results to k = 1, i.e. exactly one quadratic mean derivative for the
centered process X(t); and for k ~> 2 it has produced rates of convergence
n -2k-2 (i.e. identical with those of the optimal sequence of designs using optimal
coefficients) by using, instead of the median of each interval (i.e. midpoint
sampling), k appropriate quantiles (i.e. quantile sampling).
e:. 1,
and all other quantities are determined by these via (4.27), (4.29), (4.37) and
(4.41).
ne 2st.r, -~ IAIJA
f o-2(t)4f(t) dt. (5.20)
354 S. Cambanis
For each fixed partition {A,k}7,=~ of A, the sampling densities/~,k(t) which are
proportional to ~r(t)l~b(t)] within each stratum A,k minimize the right-hand side
of (4.22) term by term, producing a (partly optimal) stratified sampling design
with
n
k
/tfA nk
JAfA
nk nk
}
For regular sequences of partitions generated by a density h bounded away
from zero, we have
n 2-z
e s t , ~ ; ~1 IrA[a0(t)~b2(/)l 1/3dt /3. (5.25)
Comparing (5.25) with (5.9), we see that the asymptotically optimal sequence of
stratified sampling designs has the same rate as the sequence of optimal
sampling designs using optimal coefficients, and the asymptotically optimal
sequence of median sampling designs using simple coefficients, but asymptotic
constant twice as large, thus requiring asymptotically 41.5% more samples for
the same performance.
When k ~ 1, i.e. when the centered process X has one or more quadratic mean
derivatives, then
3-2 1 f . fl(t)d)2(t)
n est,T.--~ ~ - , a h3(t ) dr, (5.26)
Table 1
Exact rates of convergence
sampling designs for small sample sizes, 1 < n < 20, is considered in Cambanis and
Masry (1983) for the signal detection problem, using both the asymptotically
optimal densities h * as well as a uniform density h, for R the covariance of Wiener
process and of stationary first and second-order Gauss-Markov process: the
advantages of using h* versus a uniform h are discussed in special cases, and it is
seen that the asymptotically optimal median sampling design with simple
coefficients may outperform the asymptotically optimal sampling design with
optimal coefficients!
J
X(t)=~/3jfj(t)+N(t), tea, (6.2)
j=l
where each regression function ~ is known and of the form (1.1), and we want
to estimate the unknown regression coefficients/3 = (fi~. . . . . /3j).
k = 0 and n -1-2/a for k = 1 for stratified sampling designs, and n -2/d for
systematic sampling designs--independent of the number of quadratic mean
derivatives; Schoenfelder (1982) also finds asymptotically optimal stratified
sampling designs with rectangular partitions. Ylvisaker (1975) shows that when
k = 0, product sampling designs using optimal coefficients have rate of con-
vergence n -2/~ (product sampling designs are obtained by sampling each
coordinate separately). The results of Ylvisaker (1975) and especially Wahba
(1978) indicate, however, that superposition of nonrandom product sampling
designs gives a better rate of convergence. A special design problem in the
plane is considered in Dalenius, Hfijek and Zubrzycki (1961).
7. Related topics
7.1. U s i n g derivatives
When the observations process has k (>/1) quadratic mean derivatives, and
in forming the estimators I r and/3 r and the statistic Sr one uses not only the
value of the observation process X ( t ) at each sampling instant but also its
derivatives X~)(t), j = 1 . . . . . k, then naturally one can expect improved per-
formance. This situation is not of practical interest, as it requires measuring
derivatives at sampling points, but it provides bounds on performance and is of
independent theoretical interest. Sacks and Ylvisaker (1970a, 1970b) established
(5.7) in this case, and thus also (5.9) for the asymptotically optimal sequence of
designs, under the same regularity conditions but with a different constant C;
given by
{(k + 1)!}2
c;, = (2k + 2)!(2k + 3 ) !
er(X , Cr)= I f a x ( t ) 4 ) ( t ) d r - n
k~= l CT'kX(tk)
L= I f f -- gr, cT~ X)l ,
sup ev.(x, c).) = inf inf sup er(X, cv)(= inf IV - Prf[I) -
tlXl]~l r C T tlxll~l r
The connection with the random integral approximation problem follows from
the relationship
Thus the best quadrature formula c~.x r. in the sense of Sard for f A x(t)g)(t) dt,
x ~ R K H S ( R ) , determines the optimal sampling design T* when optimal
coefficients c r are used and the best quadratic mean approximation c r.Xr. of
fAX(t)~(t)dt, and vice versa. Certain properties of the best quadrature
formula in the sense of Sard, and thus also of the optimal sampling design, are
established in Karlin (1976), and asymptotics are studied in Barrow and Smith
(1979).
360 S. Cambanis
References
Samaniego, F. J. (1976). The optimal sampling design for estimating the integral of a process with
stationary independent increments. IEEE Trans. Inform. Theory IT-22, 375-376.
Sard, A. (1963). Linear Approximation. American Mathematical Society, Providence, RI.
Schoenfelder, C. (1978). Random designs for estimating integrals of stochastic processes. Institute
of Statistics Mimeo Series No. 1201. University of North Carolina, Chapel Hill, NC.
Schoenfelder, C. (1982). Random designs for estimating integrals of stochastic processes: Asymp-
totics. Center for Stochastic Processes Tech. Rept. No. 6. University of North Carolina, Chapel
Hill, NC.
Schoenfelder, C. and Cambanis, S. (1982). Random designs for estimating integrals of stochastic
processes. Ann. Statist. 10, 526-538.
Tubilla, A. (1975). Error convergence rates for estimates of multi-dimensional integrals of random
functions. Tech. Rept. No. 72. Department of Statistics, Stanford University, Stanford, CA.
Wahba, G. (1971). On the regression design problem of Sacks and Ylvisaker. Ann. Math. Statist.
42, 1035--1053.
Wahba, G. (1974). Regression design for some equivalence classes of kernels. Ann. Statist. 2,
925-934.
Wahba, G. (1978). Interpolating surfaces: High order convergence rates and their associated
designs with applications to X-ray image reconstruction. Tech. Rept. No. 523. Department of
Statistics, University of Wisconsin, Madison.
Wittwer, G. (1976). Versuchsplanung im Sinne von Sacks-Ylvisaker fur Vectorprozesse. Math.
Operationsforsch. Statist. 7, 95-105.
Wittwer, G. (1978). Uber asymptotisch optimale Versuchsplanung im Simme von Sacks-Ylvisaker.
Math. Operationsforsch. Statist. Set. Statist. 9, 61-71.
Ylvisaker, D. (1975). Designs on random fields. In: J. Srivastava, ed., A Survey of Statistical Design
and Linear Models, 593--607. North-Holland, Amsterdam.
Zubrzycki, S. (1958). Remarks on random, stratified, and systematic sampling in a plane. Colloq.
Math. 6, 251-264.
E. J. Hannan, P. R. Krishnaiah, M. M. Rao, eds., Handbook of Statistics, Vol. 5 1Zt
Jl_ 7 1 -
© Elsevier Science Publishers B.V. (1985) 363--387
Measuring Attenuation
M. A . Cameron and P. J. T h o m s o n
1. Introduction
The analysis of the relationships between several time series may be per-
formed in a number of ways. In the time domain, vector models of A R M A
type may be fitted to the data and these give empirical descriptions of the
relationships between the series that can be used to generate forecasts. These
models may form the basis of further investigation of the data structure (see
Box and Tiao (1977) for an example). The alternative in the time domain is to
hypothesise a structure and to use that to formulate and fit an appropriate
model. A simple example of this is a time series regression (transfer function
model), but more generally the model will be a constrained vector A R M A
model.
In the frequency domain the methods developed in multivariate analysis for
the analysis of covariance matrices may be adapted to analyse complex
covariance matrices and applied to the estimated cross-spectral density matrices
at different frequencies. Brillinger (1975) gives examples of regression, prin-
cipal component and canonical variate analysis applied to the estimated cross-
spectral matrices. If the analysis is exploratory, then the estimated spectral
density for each of a number of frequencies across the frequency range of
interest is analysed separately and no attempt is made to combine, formally,
the information from different frequency bands. On the other hand, there may
be a parametric model for the dependence between the different series which is
formulated in either the time or the frequency domains. Then, if the analysis is
performed in the frequency domain, information from different frequency
bands should be combined formally, in either the time or the frequency
domain.
In this chapter, the problem of the estimation of the attenuation of a signal is
considered. Here a model for the observations may be written down in the time
domain, but the estimation of the time domain model may be performed in
either the time or the frequency domain, The model is a time series version of
a factor analysis model and so the methods and results are of wide relevance.
For example, Geweke and Singleton (1981) and Engle and Watson (1981)
363
364 M. A. Cameron and P. J. Thomson
describe the application of time series factor models to economic data and
C a m e r o n (1983) uses the model for comparative calibration of time series
recorders. The methods are also applicable, for example, to a geophysical
p r o b l e m described by Clay and Hinich (1981) and to the estimation of the
velocity and attenuation of a signal as it passes across an array of sensors. The
model will be developed and discussed in terms of the attenuation of a signal
across an array for definiteness.
In the next section, the model is described and some notation is introduced.
In Section 3, methods for estimating attenuation are given. These are for time
domain and for both narrow and broad band frequency domain estimation. In
Section 4 the methods are extended to the case where delays are present whilst
in Section 5 a discussion is given of how the methods could be used in practice
and the analysis of a simple set of data is described.
2. The model
H e r e the observed random variables Yij have mean &, the c o m m o n signal S i has
mean zero and variance o-2, Si and eij are independent Gaussian random
variables and each eij has mean zero and variance cry. The c~j are attenuation
coefficients. However, for the case of the simple model (2.1), they are more
c o m m o n l y known as factor loadings. This particular model has been estensively
used in many areas, in particular the social sciences. The properties of such
models together with associated estimation procedures have been extensively
discussed (see, for example, Joreskog, 1978). We wish to set this familiar
problem in the more general time series context of signal estimation where the
signal now becomes a stochastic process over time.
T h e basic model considered is given by
will, in general, undergo modification prior to its arrival at any recorder and
that this modification will typically depend on the particular recorder conceT
ned. Suppose the spectral representation of S(t) is given by
where the complex valued process Z(w) has zero mean and orthogonal
increments, i.e.
aJ(w)-flaJ(r)ei~dr_
is the transfer function of the filter with impulse response function aj(r). It is
evidently aj(co) that modifies the component of S(t) at frequency ~o.
In practice, the data will normally be sampled at equidistant time intervals
either by virtue of the recording process adopted or as a consequence of the
fact that the data will invariably be analysed by a digital computer. Assuming
that the sampling interval has been chosen sufficiently small for aliasing effects
in either S(t) or the xj(t) to be ignored, the continuous time model (2.2) is now
replaced by the discrete time process:
where time has now been rescaled so that the sampling intervai represents one
time unit and the fij(k) satisfy
j(k) e 'k..... a j ( , , ) ,
sin(k - r)~r
= j(k) (k -
366 M. A . C a m e r o n a n d P. J. T h o m s o n
due to Shannon. However, (2.5) may be considered as a model in its own right
irrespective of whether or not it has been derived from some underlying
continuous time process. In any event, we are concerned with fitting models
such as (2.5) to data and, in particular, estimating the transfer functions aj(co).
From the latter, estimates of the flj(k) and aj(r) can be derived, either
non-parametrically or via some suitably parametrised version of (2.5).
Consider the p-dimensional process y(n), n = 0, _+1. . . . . with jth component
yj(n) given by (2.5). If 7s(n) and yx,j(n) are the serial covariance functions of
S(n) and xj(n) respectively, then the matrix of cross-covariance functions of
y(n) has typical element
where 3jk is the Kronecker delta. Likewise, the spectral density matrix f(w) of
y(n) has typical element
Apart from scaling the signal and noise to have the same amplitude, this
reformulation of f(co) has the virtue'~that luj(o~)]2 admits a simple interpretation
as the signal-to-noise ratio at t h e jth recorder. The latter is an important
parameter in its own right.
In practice, it might be expected that the signal S(n) and the transfer
functions aj(o~) would be relatively smooth with the noise processes xj(t) having
reasonably flat spectra. This would mean that the uj(co) would tend to be sizeable
in the lower frequencies. As a consequence, it would seem that these spectral
quantities should typically vary smoothly over frequency. Thus it might be
expected that non-parametric estimation techniques based on the Fourier
transforms of the data over non-overlapping narrow frequency bands would prove
Measuring attenuation 367
to be effective. Similarly, simple parametric forms for the vj(w) and the f,j(w)
should generally fit such data well.
The model (2.5) can be generalised in a number of ways. Of these, the most
obvious is to incorporate delays. Consider the situation where the signal is
received by an array of recorders. Because of the spatial configuration of the
array, the individual recorders will, at any instant of time, receive lagged or
delayed forms of the signal. In such circumstances, the model (2.5) becomes,
where the rj's are not integers in general. However, it may well be true that the
medium through which the signal is travelling is dispersive. This would mean
that the different frequency components that make up S(t) travel at different
speeds, resulting in frequency-dependent delays. This leads to a model of the
form
'rr
Methods of taking account of delays will be given later in this chapter. Other
generalisations concern the cases where trend is present, where the signal is
transient, where the observations and the signal are no longer scalar but are
vector time series, and where there is more than one signal. This last case
includes the situation where signal and noise are not incoherenL
3. Estimation
This section addresses the problems of fitting models such as (2.5) to data.
As in most model fitting, the procedure used is composed of three parts: (a)
exploratory data analysis and model selection, (b) parameter estimation and (c)
diagnostic model checking.
As models become more complicated and involve more parameters, the
model selection phase becomes increasingly important since there will be many
plausible models. In time series these models must be fitted by numerically
maximising some function or, equivalently, by solving a set of non-linear
equations. This becomes more difficult as the number of parameters increases,
particularly if good initial estimates of the parameters are not available or if the
model is a poor description of the data.
Unfortunately, the model selection itself becomes more complicated. "For
example, model selection in A R M A models is generally based solely on the
autocorrelations and partial autocorrelations, whilst in transfer functions it is
either a multistage procedure (Box and Jenkins, 1976) or involves calculating
368 M. A . Cameron and P. J. Thomson
the impulse response function and noise autocorrelations from spectrum esti-
mates (Cameron, 1981; Pukkila, 1982). For models of the form of (2.5), the
simplest procedure is to calculate estimates of uj(~o) and fxj(~o) for a number of
frequencies and then to apply an inverse Fourier transform to these and choose
appropriate models on the basis of these derived quantities.
The estimation of uj(w) and fxj(~o) is described in Subsection 3.1. Although
we have introduced these estimates as being part of the model selection
procedure, they are of interest in their own right in many applications,
especially if there are delays between the series (see Section 4).
Subsection 3.2 addresses the problem of fitting (2.5) and its various
parametric forms over all frequencies or possibly over some chosen band of
frequencies. For this estimation a parametric form is required for ~,j(co) but not
for fxj(~o) which may be approximated by the narrow band estimates obtained
using the methods of Subsection 3.1. Finally, in Subsection 3.3, methods of
fitting (2.5) in the time domain are considered. Here all components of the
model must be parametrised.
The estimation procedures described in Subsections 3.1, 3.2 and 3.3 thus
form a natural sequence in fitting models such as (2.5). The exploratory phase
suggests the models to be fitted, the frequency domain procedure of Subsection
3.2 allows the transfer functions or uj(w)'s to be modelled without a model also
being fitted to the noise processes. The time domain procedure allows a full,
exact maximum likelihood estimation of all parameters simultaneously. This
will be most useful if there are few observations (which may be the case with
economic data, for example). It wilt often be the case, however, that there are
sufficient data for frequency domain methods to be used and that it is only the
transfer function that is of interest, so that the time domain estimation
procedure will not be required.
u 2vk
W(o)k) = (2~rrN) ~/2 ~ y(n) e '"~k, 0 ~< k < [~N
' l, w~ = - (3.1)
n=l N
Here [x] denotes the integral part of x. These quantities are important because
(2~r/N)l/2W(~o) is an approximation to the component of frequency w in the
spectral representation of y(n) and, if N is highly composite, they are
extremely cheap (i.e. rapid) to compute. Moreover, under certain quite general
conditions, the W(wk) are asymptotically independently distributed each with a
complex multivariate normal distribution with zero mean vector and covari-
ance matrix f(o)k) (see Hannan, 1970). Now the situation we have in mind is
that where the vi(~o) and the fxj(o)) vary slowly with frequency ~o. This means
Measuringattenuation 369
that estimates of the ~,j(w) and the fxj(W) at any given frequency o) should be
able to be constructed from the W(Wk) evaluated over those wk nearest to o9.
For the sake of clarity let us denote the chosen frequency of interest as A and
suppose that our estimates of the v~(A) and fx.j(A) will be based on the m values
of wk closest to A. An obvious estimation technique in this situation is to use
the method of maximum likelihood where the likelihood is given by the
probability density function derived from the asymptotic distribution of the
W(w) for the m values of ~ok nearest to A.
The relevant log-likelihood is proportional to
where det() and tr() denote the matrix operations of determinant and trace
respectively and Z A is the sum over the m values of o)k concerned. Moreover,
in keeping with the assumption that f(w) is not varying to any degree over this
narrow band of frequencies, for the o)k near A, we set
Here I is the p-row identity matrix, the asterisk denotes transposition com-
bined with conjugation, the diagonal matrix fx has typical diagonal element
fxj(A) and the p-dimensional vector v has typical element ~,i(A). Note that,
given f(w) as specified by (3.3), we can only know the ]uj(A)l and the phase
differences ~0j(A)- 0k(A), where [uj(A) I and ¢i(A) are the modulus and argument
respectively of uj(A). Additional information is necessary in order to identify
the individual Oj(A). We shall assume that Ol(A) is zero. Thus the first recorder
is chosen as the recorder relative to which the phase differences 4Jj(A) will be
measured.
Now, with f(w) given by (3.3), maximising (3.2) is equivalent to minimising
P P
Z log f~,j + l o g o + u* ~,) + Z ~j/f~,j - (1 + u* u) ~u*f~/2ffxX/2v,
j-i 1:1
(1 + (3.4)
and
L.j - (1 + I jl j = 1 ..... p. (3.5)
370 M. A . Cameron and P. J. Thomson
In the special case when the noise spectra at the different recorders are
assumed equal, these equations may be solved explicitly, yielding
o re(A) -O(A)][/;
[/;,,2
o
^
c(A)JL o
(3.7)
^2
f~a(A) = (1 + URj + £,~j)-ICjj(h ) . (3.8)
Equations (3.7) and (3.8) can now be solved using methods similar to con-
ventional factor analysis procedures. Clearly, if good a priori estimates of the
fxj(A) were known, then (3.4) and (3.2) state that ~ is proportional to the
eigenvector of )~l/2f(A)f~l/2 associated with the largest eigenvalue. The constant
of proportionality is the square root of the difference between the eigenvalue
and unity. This observation and the simple form of the relationship (3.5)
suggest that one might solve (3.4) and (3.5) numerically by first fixing the fxj
and determining v, then adjusting the fx,j using (3.7) and recomputing 1),
iterating until convergence. Unfortunately, this algorithm is frequently very
slow to converge since it fails to take into account the covariation between
small changes in the vj and the fxj. This problem has been discussed by
Joreskog (1967) and Lawley (1967). It is thus preferable to maximise the
log-likelihood directly or equivalently, to devise a Newton-Raphson algorithm
based on (3.4) and (3.5), which takes into account the covariation and the fact
that (3.2) is being maximised. Experience in standard factor analysis suggests
that algorithms based on an optimisation technique of Fletcher and Powell
(1963) converge reliably. This is discussed in greater detail by Joreskog (1978)
and by Geweke and Singleton (1981).
Under relatively mild regularity conditions (see Thomson, 1982), it can be
shown that the resulting estimators of u and the fxj(A) are strongly consistent.
Moreover, suppose the vector a 0 is defined as
where [vj(a)[ and ~i(h) are as before the modulus and argument respectively of
vj(h). If ~ is defined as the corresponding vector estimator of a 0 obtained from
(3.4) and (3.5), then for m and N large enough ml/2(& _ a0 ) has an asymptotic
multivariate normal distribution with zero mean vector and covariance matrix
F. Here
J1 j12t 1 0
F=
o o J3
_]_1 2 ~ , 1 , 2 2
J*jk = ,[(1 ~l~jl (1 v v) (v ~--l.jl ))/L,j, J = k,
1 2 2 , -1
t(-:l,51 t,,~l (1+ v v) )/(.fx,f~,k), j # k,
= ,[2(1+ ~,*~,)-'(v*~-I~.jl=(l+ 3~,* v)(1 + ~,* ~,) '), j = k ,
s2j~ t-21,,~llv~l(1+3,,*,.)(l+ ~*~)-~, j#k,
/l,,jl(2+ v*~-ivjl2){Lj(l+ v*~,)}', j = k,
J12 jk = 2 , ' - 1
' t-t,,jll~l{L,~(l+,, ~)} , jek,
with respect to the parameter 0. Except for a scaling factor, CN(O ) is just the
sum of terms of the form (3.2) calculated over a number of non-overlapping
narrow bands. Differentiating Cu(O) with respect to 0 leads to the estimating
equations
A -1 + A - 1 K A -q ,
Measuring attenuation 373
(4Tr)-l f t r { f f l ( A ) ~ f f l ( a ) O0k J
Here CI)(j)ab(~) is the (a, b) element of the matrix f-l(~)[Of(l~)/OOj]f I(A) and
Force(A, -A, -A', A') is the Fourier transform of the fourth cumulant function of
y(n). The integrand in A simplifies to
, 1 9 ]
tr{[I + vu*l-x ~ (vv*)[l + O0k
0
q)o) = f21/2[I + vv*] 1~ (w,*)[I + vv*l-lf21/2.
3.3. T i m e d o m a i n estimation
If parameters are to be estimated in the time domain, then not only must the
attenuation be appropriately parametrised but so also must the signal and
noise processes. In addition, constraints must be introduced in order that the
parameters be identifiable.
Suppose that the signal and noise processes are assumed to be A R M A
processes and that the transfer function aj(w) is a ratio of trigonometric
polynomials. Then the observed series yj(n), j = 1 . . . . . p, form a vector
A R M A model from a particular parametric family and so for given values of
the unknown parameters the exact likelihood can be calculated (see, for
example, Nicholls and Hall, 1979). An algorithm such as that of Fletcher and
Powell may then be used to find the parameter values maximising this likeli-
hood. Difficulties arise, however, in choosing the correct models and lags for
the signal and noise processes and the attenuation and in obtaining good initial
estimates of the unknown parameters. Thus in all but the simplest cases the
estimation methods described in Subsections 3.1 and 3.2 should be used before
attempting a full time domain model.
A different approach to calculating the likelihood in the time domain, which
is in some ways more direct arises by noting that (2.5), is similar in form to the
observation equation in a state-space model, except that the noise here may be
coloured. Since the noise processes may also be written in state-space form, the
model (2.5) may be written in standard state-space form by augmenting the
state to include not only the signal but also the state of the coloured noise
processes. Once the model has been written in state-space form, the Kalman
filter can be used to compute the innovations recursively and a Gaussian
likelihood may be evaluated. This likelihood may be maximised to obtain
estimates of the unknown parameters.
Engle and Watson (1981) use this procedure to estimate parameters in an
example where they have 25 observations of each of 5 series. In their case, the
signal follows a second-order autoregression and each of the noises is a
first-order autoregression. The impulse response of the transfer function in-
volves only the present value of the signal so that the aj(w) are taken as
constant. The state is a vector of length seven, the components at time n being
Once the models for the various components have been chosen and the
Kalman filter appropriately parametrised, the likelihood is easily evaluated.
Again the likelihood must be maximised numerically and Engle and Watson
suggest using an algorithm based on the method of scoring. Aasnaes and
Kailath (1973) provide a direct recursion for state-space estimation in colored
noise without going through the augmentation step.
As mentioned earlier, a difficulty with time domain estimation lies in the
need to choose the correct orders of the models for all components. An
Measuring attenuation 375
We now consider the situation where the model incorporates delayed as well
as attenuated forms of the signal. If the delay is frequency dependent, then a
simple time domain model cannot be given and estimation of the delay must be
performed in the frequency domain. If the delay is not frequency dependent,
then it may be approximated by a time domain model (Chan, Riley and Plant,
1980). However, a simpler procedure is to estimate delay in the frequency
domain, realign the series and then, if time domain models are required, use
the methods of Subsection 3.3 to estimate the remaining parameters in the
model. Only frequency domain methods are considered in this section.
The model is as given by (2.9) and the spectral density matrix of the observed
process, expressed in terms of the uj(w), fxj(w) and the rj(w), now has typical
element
1/2 l/2
fik(W) = (vj(w)~,k(tO) + 6ik)f xJ (Wff ~,k(w) ei(',(~)-'k(~°))~° (4.1)
The effect of delay, if unaccounted for, will lead to biased estimators. Consider,
for example, the case where, over a narrow band of frequencies centred at
frequency A, the vj(w), fxa(w) and ~)(w) are approximately constant. Then,
keeping the bandwidth fixed but allowing N to increase, the estimator of f(h)
given by
/(a) = !m 2 a
Here the bandwidth is 26. Now, if the 5 ( t ) - ,k(h) are large, then (4.2) shows
376 M. A. Cameron and P. J. Thomson
Here fx and ~ are as defined below (3.3) and A (~o; ~-) is a diagonal matrix with
typical element exp{i~-(A)(w - A)}. As before we are requiring that uj(co), f~j(w)
and ~)(w) be effectively constant over the narrow band of frequencies and we
have chosen to describe the phase of ~k(~o) near A as
Thus Oj(A), the argument of uj(A), now incorporates 5(A)A, the phase at
frequency A due to delays. To identify all the parameters we shall again
measure all phase differences relative to the first recorder and so
[/11(~) = TI(/~ ) 0,
Newton-Raphson procedure with due account being taken of the fact that the
required solutions must maximise (3.2). However, it can be shown that,
appropriately normalised, the estimators ÷j are asymptotically independent of 1)
and the f~j(A). This suggests that the following two-stage iterative scheme
might have computational advantages. First, select initial estimates of the ~)(A)
and solve (4.4) and (4.5) using factor analysis techniques as before. Then, using
the resulting estimates of ~ and the f~j, solve (4.6). Note that, given v and f~,
(4.6) is equivalent to maximising
v^ , fx
~ - 1/2f(A;~')/~
- ~ = 1/2 v
^ = m - 1
Z a
If? * f x 1/2A
*(a~,;r)W(ws)] 2
With the new estimates of ~j(A), repeat the first step and so on. Schemes such
as this based on first solving (4.4) and (4.5) and then (4.6) will be asymptotically
equivalent to a full Newton-Raphson procedure. However, as in any numerical
maximisation procedure, it is important get good first estimates, especially of
the delays ~)(A).
If we had once and for all estimates of the delays ~)(A), we would then apply
the standard estimation procedure described in Subsection 3.1. Such a pro-
cedure would have computational advantages over the direct solution of
(4.4)-(4.6). In Hannan and Thomson (1973) a non-parametric estimate of the
(group) delay in the case of two recorders is given. A simple generalisation of
their technique leads us to estimate the ~)(A) by the values of ~)(A) that minimise
detf(A; T). Indeed, this criterion follows from maximising (3.2)with f(w)
replaced by A ( o o ) f ( A ) A * ( w ) and taking the f/k(A) and the ~)(A) as the
parameters. To compare this estimator with that obtained by maximising the
original likelihood involving v(A), f~(A) and ~-(A), consider maximising the
likelihood over v and the fxj for any given ~-. The resulting maximised value of
the likelihood is
p
2,n.
T 1 ;12( * -Ivyl2)/(l+ u'v),
f
j : k,
Bjk = (4.9)
- 2w2
-~-]uil2lGlZ/(l+ v'u), j ¢ k.
For further details concerning the proof of these results, see Thomson (1982).
Note that the chosen criterion of minimising d e t f ( A ; r ) has the following
interpretation. Let
and
i.e. the generalised variance of the ~(n + li). Roughly speaking, ~(n) is the
output of a band pass filter acting on y(n) which passes only the band of
frequencies in question. Then the (i(n) are lagged in the obvious way and the
lagging is optimised by minimising a (generalised) variance.
Finally, having determined estimates of the delays rj(A), we now give
corrected estimates of the attenuation coefficient by l~)j(A)]exp[i{q)j(A)-
uj(a), the signal-to-noise ratio Iv(a)[ and the noise spectrum fx(A). Each
component may be estimated within each band or else can be modelled by a finite
parameter model across a wide band of frequencies. The precise method to be
used depends on which components are to be modelled by a finite number of
parameters and which are to be estimated band by band. Of course, when all are
estimated band by band, the problem reverts to that considered in the previous
section. Here we consider only the case where the phase effects are modelled
across broad bands, but the other parameters remain specific to each narrow band.
Now, in contrast to the assumption in Subsection 4.1, it is assumed that the
variation of {rj(w)-rk(~o)}~0 over any narrow band of frequencies is small
compared with the variation across some given broad band of frequencies B. In
the broad band case it is the variation between narrow bands that is to be
modelled, whereas in Subsection 4.1 it was the within band variation that was
modelled. It will be supposed, in this section, that the phase differences depend
on a vector, 4,, of unknown parameters. The unknown parameters may be
simply the relative delays of the signal at the different recorders, or, if the
signals have been dispersed, they may be coefficients of polynomials or splines
used to model the frequency dependence of the delays.
As in Subsection 3.2 we consider B to be the union of non-overlapping
narrow bands B u of m frequencies about central frequencies a,. For any w in
the band B,, we model the spectral density of the process as
where fx(au) is a diagonal matrix with typical diagonal element f,j(a,,), p(au) is
a vector with typical element tv,(A,)l and A(A~; 4,) is a diagonal matrix with
typical diagonal element exp{ig(a.; 4,)}.
Here g(a~; 4,)= g ( Z , ; 4,)+ rj(Z,; 4,)a,, where g ( a , ; 4,) and rj(a,; 4,) are
appropriate parametric forms for 4',(a) and rj(a) that model the phase variation
over B.
If the f~(A,,) and p(a,,) were known, then maximising the approximate
log-likelihood (or equivalently minimising (3.9)) with respect to 4,, reduces to
mmlm~smg
where the frequency a,, on which each of the quantities depends, has been
omitted for simplicity. Note that (4.14) can also be written as
1
---~7, (1 + v* v)-' "~ Ip'f; '2za 2 . (4.15)
m . Bu
~,*u~-llpu*lJ¢ ¢ ~.-l/2tg
{1+ , I j klUx,ffx,kJ IJjkt (4.16)
380 M. A . Cameron and P. J. Thomson
In similar vein, if/5 has typical element [1)it, (4.15) yields the criterion
Tile properties of the estimators that minimise these criteria are the same as
those of the estimator which minimises 0(4}).
Note that the methods of Subsection 3.2 could be used to estimate strongly
consistent, 'smooth' values of ~, and thus of the weight function (4.15). To
completely match the procedure of Hannan and Thomson a finite parameter
Measuring attenuation 381
model should also be fitted to the noise processes. In this case the number of
parameters increases only as p and the parameters are easily interpreted. The
cost is extra computation. At this stage no comparison of the finite sample
properties of this procedure with the others has been performed.
Some of the practical problems that can arise in fitting models of the type
described earlier are discussed in this section. These questions fall into two
categories: (i) assuming the model is correct, how should the parameters be
estimated? and (ii) what can be done to check that the model fitted is
appropriate, and what is the effect of a poor model on the parameter esti-
mates?
The questions in the first category are of most importance when frequency
domain methods are to be used, and cover problems such as whether the data
should be prefiltered, what frequency bands should be used and how wide
these should be. Of necessity the discussion of such matters is more nebulous
than that presented in earlier sections where the data are always stationary and
the observed records are assumed to be long. However, when sample sizes are
finite, if spectra are changing rapidly poor data analysis may result if the
practical problems are not treated thoughtfully.
The main problems that arise when estimating parameters using frequency
domain methods are the result of biases in the spectrum estimates caused by
rapid changes in power or phase across a narrow band of frequencies. This was
mentioned at the beginning of Section 4 where it was shown that if there is a
large delay between recorders, then the modulus of the estimated u/s may be
biased downwards. The delay may also be less efficiently estimated in such
circumstances. See, for example, the simulation results in Hannan and Thom-~
son (1981). These biases may be substantially reduced by realigning the series
(possibly in a frequency dependent way), to reduce the phase changes, by
prefiltering the observations so that their spectra are approximately fiat, and by
making the individual frequency bands narrower. Decreasing the width of the
individual frequency bands increases the variance of spectral quantities within
the band. However, if effective realignment and prewhitening are performed,
then the width of the bands will not be critical. Unless data are delayed by an
integral number of time units, the simplest way to delay a series by -r(A) time
units is to compute
That is, x(n) is Fourier transformed, the coefficient at frequency ~ok is multi~
plied by exp{--i~'(wk)Wk} to introduce the delay and then an inverse Fourier
transform is applied. If ~" is an integer constant, then this moves the first ,~
382 M. A. Cameron and P. J. Thomson
observations from the beginning of the series to the end. This may not always
be an appropriate transformation so that an improved procedure would mean
correct the series, add zeroes to the ends and multiply by an appropriate
taper before performing the transform described above.
So that appropriate prewhitening and realigning can be performed, explora-
tory analyses of the data will be required. These preliminary analyses should
also be used to suggest good parametrisations of the different components in
the model and to suggest initial parameter estimates to be used in the iterative
maximisation of the likelihood. A general procedure for estimating attenuation
and delay should thus include the following steps:
(i) Estimate the spectrum of each observed series separately, using a few
different bandwidths and also an autoregressive estimator. Choose a filter
which will, approximately, whiten each of the observed series.
(ii) From the whitened series, estimate the phase difference functions,
compute approximate delays between the observed series and use these to
realign the observations, if necessary.
(iii) Compute the coherence functions for the prewhitened and realigned
series. This should show the frequency range in which the signal common to
the different recorders is discernible over the background noise.
(iv) For narrow bands over the frequency range of interest, perform the
narrow band analysis described in Subsection 3.1 or, if significant delays
between the observed series remain, perform the analysis in Subsection 4.1.
This analysis should suggest parametric forms for the delay and attenuation.
Initial estimates of these parameters as well as estimates of the spectra of the
noise processes should be obtained at this stage.
(v) A broad band analysis of the form of those described in Subsections 3.2
and 4.2 may now be performed.
(vi) If a time domain estimation is to be performed, the data should be
corrected for any remaining delays estimated in (v). Also noise models may be
chosen and preliminary parameter estimates obtained from the estimates of the
noise spectra obtained in (iv).
(vii) The final estimates should be modified to account for any preliminary
whitening or realignment that have been performed.
The steps outlined above may need to be performed more than once. For
example, if there is a large delay between two series, then, except at very low
frequencies, the phase difference between the two will change rapidly and so a
reasonable estimate of the delay between the two series will be difficult to
obtain initially. However, an approximate realignment will make the next
estimate much better.
It remains to consider whether the model fitted is a good description of the
data. Difficulties may arise early in the estimation procedure if an inap-
propriate model is chosen. For example, if the noise at different recorders is
correlated (often called signal associated noise), then narrow band analysis will
yield noise spectrum estimates which are biased down and may be negative
Measuring attenuation 383
unless the estimation is constrained. If, for a particular band, the noise
spectrum is biased downwards, then the weight attached to that band will be
inflated in a broad band analysis leading to biased estimates of the attenuation
and delay. Signal associated noise is not uncommon, particularly in geophysical
data where the same signal may follow more than one path. If one path is
substantially longer than the other, then the amplitude of the signal received
after travelling along the longer path will often be diminished so as to be of the
same magnitude as background noise. When signal associated noise is detected,
the only solution is to model it as a second signal. When there is more than one
signal, the basic ideas for parameter estimation are the same as those described
here, but greater problems of identifiability arise. This will not be discussed
here. T o check if there is more than one signal present, a principal component
analysis of the cross-spectral matrix for each frequency band of interest could
be performed (see Brillinger, 1975). Alternatively, narrow band factor analysis
models could be fitted and the number of factors chosen using an AIC-type
criterion.
To illustrate the methods of this chapter, we consider a simple example of
just two anemometers from which instantaneous readings of wind speed were
obtained several times per second. A plot of the spectrum of the data shows a
rapid decrease in power with frequency. Autoregressions were fitted to the two
series and it was found that both series were adequately described by simple
third-order autoregressions, the coefficients of the two series being very similar.
The two series were prewhitened using a filter obtained using the coefficients of
one of the autoregressions. This substantially reduced the variation of the
spectra with frequency, though the observations were still not white noise. The
two filtered series were strongly coherent at low frequencies, though the
coherence was reduced at high frequencies (see Fig. 1). There was no evidence
of delay between the two series.
As mentioned earlier, when estimating the attenuation between just two
observed series, some further assumption must be made about the noise. The
simplest and most natural assumption is that the two noise series have the same
spectrum, and that assumption was made here.
The phase difference between the two series varied unsystematically about
zero and so it was assumed that the ratio a2(oJ)/ai(oJ ) [= u2(m)/ul(~o)] was real
and further estimation was based solely on the real part of the cross-spectrum
between the two observed series.
Narrow band estimates of the ratio a2(oJ)/al(w ) were calculated and these are
plotted in Fig. 2. Because we are dealing with real quantities, confidence
intervals may be calculated for these estimates using the method of Creasy
(1956). The approximate 95% intervals calculated using this method are plotted
on Fig. 2.
It can be seen that the estimates are approximately constant and generally
greater than one. There is one band at which the estimate is much greater than
those at other frequencies. However, the signal-to-noise ratio is much lower at
384 M. A. Cameron and P. J. Thomson
LO
[J
r o
L
O)
0
U co
C3
0
I J 1 I
rq
04
Frequency
Fig. 2. Narrow band estimates of tvz(~o)/vl(w)] with approximate 95% confidence intervals,
Measunng aUenua6on 385
(o
L
Q)
0
C
0
C~
(0
C
O)
O3
r-I -- I -- I I
Frequency
Fig. 3. Narrow band estimates of lul(w)l, the signal-to-noise ratio for the first anemometer.
E
D
L
O
Q)
(3_
O]
ra
0
Z-
.... I I I _ t
o
O. 0 0.1 O. ~ 0.3 0. 4 Oo 5
fr-equer~cy
the high frequencies, as can be seen from the widths of the confidence intervals
in Fig. 2 and from Fig. 3, where ]~l(w)t is plotted. The estimate of the spectrum
of the noise is plotted in Fig. 4.
The above analysis suggests that the ratio Vz(W)/ul(w ) is constant and, under
this assumption, the ratio was estimated using a wide frequency band estima-
tor. The estimated ratio was found to be 1.10 with an approximate standard
error of 0.02.
The estimation was repeated without prefiltering. The estimate of
u2(w)/v~(oo ) was essentially the same, but the estimated noise spectrum was
dominated by the power at low frequencies c o m m o n to both signal and noise.
At this point, time domain models could be chosen for vl(co ) and f(w) by
examination of their narrow band estimates and an overall time domain
estimation performed using the methods of Engle and Watson. This could
improve the estimate of PZ(O))/b'l((.O) if there were fewer observations, but
because there are a moderate number of observations and because the spectra
contain no sharp peaks, there would appear to be no benefit in this case. N o t e
that the noise models would probably be misspecified in the first phase of a
time domain modelling unless the data were prewhitened.
References
Hannan, E. J. (1983). Signal estimation. In: P. R. Krishnaiah, ed., Time Series Analysis in the
Frequency Domain. North-Holland, Amsterdam.
Hannan, E. J. and Thomson, P. J. (1973). Estimating group delay. Biometrika 60, 241-253.
Hannan, E. J. and Thomson, P. J. (1981). Delay estimation and the estimation of coherence and phase.
I E E E Trans. Acoust. Speech and Signal Process. ASSP 29, 485-490.
Jones, R. H. (1980). Maximum likelihood fitting of A R M A models to time series with missing
observations. Technometrics 22, 389-395.
Joreskog, K. (1967). Some contributions to maximum likelihood factor analysis. Psychometrika 32,
443--482.
Joreskog, K. (1978). Structural analysis of covariance and correlation matrices. Psychometrika 43,
443447.
Lawley, D. N. (1967). Some new results in maximum likelihood factor analysis. Proe. Roy. Soc.
Edinburgh Ser. A 67, 256-264.
Nicholls, D. F. and Hall, A. D. (1979). The exact likelihood function of multivariate autoregressive-
moving average models. Biometrika 66, 259-264.
Pukkila, T. (1982). On the identification of transfer function noise models with several correlated
inputs. Scand. J. Statist. 9, 139-146.
Taniguchi, M. (1982). On estimation of the integrals of the fourth order cumulant spectral density.
Biometrika 69, 117-122.
Thomson, P. J. (1982). Signal estimation using an array of recorders. Stochastic Process. Appl. 13,
201-214.
E. J. ttannan, P. R. Kristmaiah, M. M. Rao, eds., Handbook of Statistics, Vol. 5 ]
AL
© Elsevier Science Publishers B.V. (1985) 389-412
1. Introduction
speech parameters can be estimated and in the speed with which they can be
computed. As it happens, estimation of the linear prediction coefficients
reduces to a set of linear equations which are blessed with mathematical
properties that allow extremely efficient solution [6].
Additionally, LPC has the further advantage that the asymptotic distribution
of the linear prediction coefficients is known, tractable, and appears to provide
a workable approximation to the distribution of the coefficients as obtained
over finite length intervals typical of those used in speech recognition [9, 10].
The importance of knowing the approximate distribution of the coefficients is
that it provides the means of developing appropriate distance measures instead
of relying on empirically derived measures which would otherwise be the case.
Despite these advantages, Fant's model is known to be imperfect [6], and the
speech signal is not truly stationary as the LPC model assumes. Therefore, LPC
does not provide a perfect representation of speech and is not necessarily the
best parameter set for speech recognition purposes. Other parameters such as
the discrete Fourier transform spectrum are in use and good results have been
reported [11].
Once a parameter set has been selected, the isolated word recogniser
sketched earlier needs an appropriate distance measure to determine which of
the reference templates is closest to the feature matrix of an unidentified word.
Ideally, the distance measure should have the property that it maximises the
probability of selecting a template of the correct word. Since an unidentified
word will usually be of a different duration than its corresponding template(s),
it is necessary to perform some kind of time alignment between the uni-
dentified and the reference patterns. Thus, computing the distance between an
unidentified word and a reference template involves both time alignment and
the accumulation of some distance function after alignment.
it is well established that careful time alignment leads to a significant
reduction in the recognition error rate of isolated words, particularly when the
vocabulary contains polysyllabic words [12]. In practice this means that it is
inadequate to perform simple linear time alignment in which one of the
patterns is stretched or compressed linearly to match the other. Better results
are obtained by performing dynamic time warping in which the optimal
alignment is taken as being that one which minimises the total accumulated
distance between the unidentified and reference patterns. In dynamic time
warping the accumulated distance is taken to be the sum of the local distances
between the temporally aligned parameter vectors of the reference and uni-
dentified words. Efficient recursive procedures to find the required alignment
have been devised and a useful discussion on the subject is given by Myers et
al. [13].
To complete the definition of the distance between an unidentified word and
a reference word, it is necessary to specify how the local distance between two
parameter vectors will be measured. In the case of parameter vectors which
have an unknown or intractable distribution, it is difficult to define an optimal
distance; it may be necessary, instead, to resort to measures such as the
392 P. J. Thomson and P. de Souza
N' a T (-'
~a ' (2.2)
t=p+l =
] N
c,j = m
N' Z x, ix,_j, i,j = o, 1 . . . . . p , (2.3)
t=p+l
with N ' - N - p and a .... (1, v ) r . Minimising (2.2) with respect to a yields
~.2
=:
~
_Z
t- 1 i=0
OliXt_ i = ~'r6~. (2.5)
S2 _ i ~ (/:~0)2 S t
•iXt_i _ __ (j.2 (2.7)
N'-pt=p+l = N'-p "
This estimator takes into account the p degrees of freedom lost by using the
estimated values of cq in (2.2). Following the argument given in Fuller [26, p.
338] s 2 should prove to be a less biased estimator of o,2 in small samples. Now,
since ~)2 and s 2 are asymptotically equivalent, we can define an alternative
asymptotically unbiased estimator of X as
S = s2D -~ . (2.8)
(o-'k/ 2Trr,
rr)-N' e x p { - ~J~22arrCa } . (2.9)
Maximising (2.9) with respect to a and o-2 shows that d~ and d-2 are the
m a x i m u m likelihood estimates of a and 0-2.
Now the form of (2.9) as a function of oL and o- is precisely analogous to that
of the Gaussian linear regression model in conventional statistics. (See Seber
[27] for example.) Indeed, if y = (Y, . . . . . yn)x follows a Gaussian linear regres-
sion model, then y has a multivariate Gaussian distribution with mean Xfl and
covariance matrix o-21. H e r e the columns of X contain the regressors, fl
contains the regression coefficients and I is the identity matrix. H e n c e the
likelihood for the linear regression model is
- - -" ~ xfl)]
(cryX/2~T) e x p [ - ~2Cry
--~ (y - xt~)T(y -
Replacing the fij by --aj, o~y by o-, n by N ' , y by (xp+1. . . . . Xu)v and the
columns of X by (Xp_j+1. . . . . XN_j) T, j = 1. . . . . p, yields (2.9). Note that this
establishes the correspondence between the likelihoods only and not the
respective distributions. The distribution of xp< . . . . , x N conditioned on
Speech recognition using LPC distance measures 395
x l , . . . , xp does not follow the Gaussian linear regression model. The parallel
between the likelihoods proves to be useful in the development that follows.
The unconditional likelihood of x 1. . . . . xN is (2.9) multiplied by
1 T --1 }
(crX/2-~w)-P(det(A)) -1/2 exp - - -2o.2 x p A Xp (2.10)
1 N-i-j
Cis = ~ ~, xt+ix,+j, i, j = O, 1 . . . . , p . (2.11)
t=l
= -/3-1d, (2.12)
where /5 and d are defined in terms of (~ in exactly the same way that D and
d were defined in terms of C. Moreover, the estimate of o-2 is
where ~ = (1, ~T)T. Paralleling (2.7) we also define the least squares estimator
of @2 as
N
g2 __ 0.2. (2.14)
N-p
(2.15)
where, once again, /} and d are defined in terms of C in exactly the same
way that D and d were defined in terms of C. Here 0 has typical element
N
(~2 .__ liT~li, g2__ ~2, (2.16)
N-p
main, to the conditional likelihood estimates & and d-2. This is not a
restriction. Since the estimates can all be o b t a i n e d by maximising an
a p p r o p r i a t e version of the likelihood (2.9), the three estimates of a and er2
may be used interchangeably in the statistics and distance measures that
follow.
In this section we use the formal theory of statistical hypothesis testing and,
in particular, the likelihood ratio m e t h o d to g e n e r a t e a p p r o p r i a t e measures of
the distance b e t w e e n sets of L P C coefficients. M u c h of the material that
follows is drawn f r o m [10].
C o n s i d e r a s e q u e n c e of observations x T of length N T c o r r e s p o n d i n g to a
stretch of voiced input that is to be coded. A s s u m e that x r is generated by a
linear predictive process of order p with L P C coefficients a T and innovation
variance err. 2 M o r e o v e r , we shall assume, for the m o m e n t , that x T has a
Gaussian distribution so that the likelihood of x v c o n d i t i o n e d on the first p
observations, is given by (2.9) with a = a r = (1, a T ) T and o- = err.
In the simplest situation we might conceivably wish to test the hypothesis
H : a r = a R, where a R is some k n o w n fixed reference vector. Alternatively we
may not k n o w erR, but know instead only a reference s e q u e n c e of observations
x R of length N R. In such cases we shall assume that x R is g e n e r a t e d in-
d e p e n d e n t l y of x T by a linear predictive process of order p with L P C
coefficients erR and innovation variance cr2R. This again leads to consideration of
the test of the hypothesis H : a T = erR, where n o w both err and e r r are u n k n o w n
and must be estimated from the data.
In this section we discuss the likelihood ratio tests of the hypothesis H : a R =
a r in the various situations alluded to above. F r o m these tests relevant
distance measures are constructed. It should be noted in passing that these tests
and distance m e a s u r e s are also of interest in their own right since they are
applicable to p r o b l e m s in fields o t h e r than speech recognition. T h e y can be
seen as building on the work of Quenouille [35].
N ~ ( ~ T - aR)TDT(d@ aR)
e ( a , , aR) = , (3. t)
or T
398 P. J. Thomson and P. de Souza
where &r, °-r,"2 C T and D r are obtained from x r using (2.3)--(2.5). Writing
l~T = (1, t^r T
r ) T, a R = (1, trR)
T T we note that (3.1) can also be written as
~(&r, a n ) = N ~ R CT~ir l
1 , (3.2)
which is cheaper to compute than the form given by (3.1)o Asymptotically ~ has
a Xp2 distribution when H : ¢er = atR is true.
Note that the logarithm of the likelihood ratio is proportional to
r.]Ga. 1
I(& r, a n ) = l o g [ ~ / = log(1 + ((&r~ a n ) / N ~ ) . (3.3)
k~rCr~r3
[9, 10, 36] for example.) Because of co-articulation, there can be marked
differences between different realisations of the same speech segment. In this
context it is interesting to note that between 25% and 50% of words in normal
conversation are unintelligible when taken out of context and heard in isola-
tion. In practice this implies that a R is frequently not a fixed reference vector,
but an estimated LPC vector with its own inherent variability. In these
circumstances ~', l and Itakura's distance I are inappropriate. More appropriate
measures are given in the remaining part of Section 3.
(er'X/~)-(N~+N~)exptf - - zer
~
( N n,a gTC R a a + N Tra r TC r a r )}
' (3.6)
[(N~+ N'r)erp
, -2 - NRer
n,~2 - N~6"Zrl/p
F(I~IT, ~ R ) = , ~2 , ^2 , , ' (3.7)
[Nner R + N r e r r ] / ( N R + N r - 2p)
^2 ^1"
er,- a~G, ~ . (3.8)
where
S2 t ^2
= (Nn0.n + N r i0 .~2r ) / ( N n t + N~-- 2p) (3.11)
estimates 0-2. Thus p F is just the squared distance between &n and ~T
standardised by an estimate of the covariance matrix of &n - &r" Note that, as
a distance measure, F possesses the desirable property that it is symmetric in
&n and &r.
If the null hypothesis H : a T = a n is true, then p F will have an asymptotic X2p
distribution. This follows from the form of (3.10) and the stochastic properties
of &n, &r, 0.R~2 and °'r.~2 This result will hold under quite mild regularity
conditions concerning the nature of the processes generating x n and x r. It is
not restricted to Gaussian x R and x r. However, if x n and x T are Gaussian,
(NRd'2R + N~.d'2r)/0. 2 is asymptotically equivalent to a X2~+N~_2p random variable.
Since &n and a r are asymptotically independent of d'~ and d-~- it is evident that
F is, in the case of Gaussian x n and xr, asymptotically equivalent to an
Fp. uk+N~_2p random variable. Because the F distribution yields a m o r e con-
servative testing precedure, we shall take t h e Fp,gk+s~_2p distribution as the
a p p r o x i m a t e distribution of F under H even when x R and x r are not Gaussian.
When N T' - N R' -- N ' , then
2 ( N ' - P) "1
F(&r, & R ) - p [d'~ + 6"~. 1/ (3.12)
If 0.~ and 0.~ were known a priori, then it is easily shown that the likelihood
ratio test statistic for testing H " a T = a n is a monotonic function of
- I ./1~
(t~ n -- a T ) \ T 1r 0 . 2R lrJ- ~ R . . . n _~- o.~D-r,/N,r}-l(d~n _ &r). (3.14)
This has an asymptotic X2p distribution under H~ Since 0.R, 2 0.r2 are unknown, the
natural test statistic, or distance measure, to use is (3.14) with o-2 and 0.J
replaced by their estimates 8 2 and 0.~27 or SR2 and ST.2 (See (2.5) and (2.7).) This
yields the test statistic
When H is true, the numerator of (3.17) has an asymptotic X2p distribution and
is asymptotically independent of the denominator. In addition, using Sat-
terthwaite's approximation [40], the denominator is asymptotically equivalent,
under H, to a X 2J r random variable where v is estimated as
= ( s 2 / N i + s2r/N'r)2 (3.18)
(s2 /N~)2/(N~ - p) + ( s 2 / N ~)Z/(Nr - p)
where sp2 is obtained from (3.8) and (3.9), but with C R and C T replaced by
CR/s ~ and Cfls2r respectively. When otR = otT, the distribution of X 2 is
asymptotically X2p and the distribution of X2/p can be approximated by an/z~,~
distribution, where
4 4 (N'-p).
S R -1- S T
W h e n N Rp= N r, _- N t and a R = a r ,
Nt
X2*(&T, &R) = -4- (~¢R -- &T)T(DR/S2 + D r / s Z ) ( & R - &r) (3.25)
in the situation w h e r e N r' -- N R' = N ' and it is assumed a priori that err2 = erR. 2
O b s e r v e that ~'*(&r, &R) = F(&r, &R)/2- Given er 2r _- erR,
2 t h e n , u n d e r H : aR = a r ,
F, has an a s y m p t o t i c X 2 distribution. As in the case of ~, a modification of ~',
yields
N'-p
l,(e r, e,R) = - - e,(e r,
N'p
2 2 t r
whether ~rR = ~rT or N R = N T. Thus, under H : a R = ¢~r,
g(&r, aR) =' N~I(dzr, aR), e*(d~r, &R) "- N~I(&r, e~R)/2, (3.28)
1 ^
= p • aT)+
and
2 2
3.6. H : O-R = O-T, OlR ~- 017"
H e r e we wish to c o m p a r e p a r a m e t e r vectors that c o m p r i s e the p + 1
coefficients o-2 and re. T h e likelihood ratio test statistic for testing H : o-~¢ = o-2,
a a = a T can be shown to be A w h e r e - 2 log A is given by
^2 ~ 2 t
o-, - (N'R+ N , r • 2 p ) s , / ( N R + N~)
and s~• is given by (3.11). W h e n o-2 = o-'r2 and 01R = 017, L L R has an asymptotic
2
,gp+l distribution. Alternatively the critical values of this statistic when H is true
can be d e t e r m i n e d numerically from the joint distribution of ps~F/cr 2, ( N ~ -
p)s2/o- 2 and ( N ; ~ - p ) s 2 / o - 2 (o-2= o-~ = o_2) which are asymptotically equivalent
to i n d e p e n d e n t h '2 r a n d o m variables with degrees of f r e e d o m p, (N~r --.p) and
( N ; ~ - p ) respectively. T h e null hypothesis will be r e j e c t e d when L L R is
sufficiently large.
Simpler asymptotically equivalent expressions for L L R when o-2 ..... o"27 and
01R ~ KIlT a r e
2 -- 2 2
1
LLR~, = p F + ~(N Rt-1 + N ~ - l )" -~l{lS~R - - S T ~ I (3.36)
\ S, I
2
3" 2 = (aeR _ a r ) v ( X i f l N ~ + ~ r / N O - l ( a R ._. at), (4.2)
Speech recognition using LPC distance measures 407
( W I A l l + w2A21)-I ~ w 1 A 1 + w 2 A 2 (4.4)
It would appear that F , and X2 are the more powerful statistics since F and X2
have the same respective asymptotic distributions under the null hypothesis as
F , and X2. However, the asymptotic distributions of these statistics are only
approximations to the true distributions for finite N~ and N}. Hence it might
be deduced from (4.6) that any apparent increase in power by F , and .g2 would
408 P. d. Thomson and P. de Souza
be at the expense of a greater rate of Type l errors than that chosen. Evidence
that this is indeed the case is provided in the simulation studies of de Souza and
Thomson [10].
Since Itakura's distance I and its modification I , are asymptotically
equivalent to ~ and X2, respectively under the null hypothesis, it might be
expected that I and I , would be equally as powerful as ~ and X 2 respectively
in the important case of small departures from the null hypothesis. One way of
showing this is to consider, under the alternative hypothesis, a sequence of
values of o~R - aer which decrease with N r, and N R, in such a way that ~-~2 and ~'22
remain fixed. In this situation it is readily shown that I and I , are asymptotic-
ally equivalent to t ~ and X~ respectively.
For the general case when O~R 2 is not necessarily the same as o2 , the statistic
2 2
FR of Subsection 3.5 is asymptotically equivalent to a (rrr/O'R)FN~c_p,N~_ p
random variable.
The simulation studies of de Souza and Thomson [10] give some guidance as
to the adequacy of the various asymptotic approximations in practice. In
particular, the LPC distance measures X2 and X~ were found to be robust and
powerful. However, of the two, X 2 follows the Xp2 distribution more closely
under the null hypothesis, especially in the upper tail of the distribution
We first note that the conditional likelihood estimates & and 6.2, the
approximate maximum likelihood estimates ¢/ and ~2 and the Yule-Walker
estimates d, and 6. can all be computed efficiently. Indeed, Morf et al. [32]
show that the number of multiplications necessary to compute & and 6.2 or &
and c~z is ( N ' - 1)(p + 1) + 7p 2 + O(p), whereas Dickinson [34] shows that & and
~2 can be computed in (N' - 1)(/) + 1) + 7~2p2 + O(p) multiplications. In normal
speech processing applications, however, the value of p chosen is such that
these algorithms are no faster than the solution using Cholesky decomposition.
Nevertheless, these algorithms do lead to reduced storage requirements.
Consider now the computation of the various test statistics given in Section 3
and, in particular, the case N~,= N~. First observe that (, 1, I, t~, and l, are all
functions of F which is computed more efficiently using (3.2) rather than (3.1).
Moreover F is computed more efficiently using (3.12) rather than (3.13) and X2
is more efficiently computed using (3.20) rather than (3.19). Here (rp ^2 (3.8) in F
and sap in 1`2 (3.20) can be calculated using the algorithms of Friedlander et al.
[45] which take advantage of the near-Toeplitz structure, measured in terms of
displacement rank, of D R and D r . In particular, since D R and D r each have
displacement rank 2, any linear combination ClD R + c2D T where c~ and c 2 are
positive constants has displacement rank 4. These algorithms together with
those for 6 and 6.2 yield computationally efficient procedures for determining,
not only F and X2, but also all the other distance measures discussed in Section 3.
Once again, however, although the values of p used in most speech processing
Speech recognition using LPC distance measures 409
applications are such that these algorithms are no faster than Cholesky
decomposition, they do lead to reduced storage requirements.
Using the number of multiplications as an estimate of computational com-
plexity, the calculations for F require approximately (p + 1)(N r - 1)+p3/3+
3p2+ 8p/3 multiplications, whereas those for 4, l, I, ~ , , I, and F , each require
(p + 1)(N~,- 1)+p3/6+ 2p2+ 17p/6 multiplications. X2 and X~ require -~(p + 1)
multiplications in addition to those for F and F , respectively and I , requires
an additional lzp(p + 3) multiplications in addition to those for X~. In arriving at
these estimates it has been assumed that &r, &p, etc. were obtained using the
Cholesky decomposition and that the number of multiplications that this entails
is as given in [32]. Thus for N = 100, p = 10 the F and X2 statistics require
approximately 20% more computations than any of 4, l, I, ( , , l, and F , . For
N = 300, p = 10 this figure reduces to approximately 8%. Therefore, except
under stringent computational conditions, the choice of test statistic can he
based on the properties of the test statistic concerned rather than com~-
putational cost.
Turning now to storage requirements we note that the computation of F, X2,
1
F , , X~ and I , require the storage of &R, °'R~2and D R (p + 1 + ~p(p + 1) floating
point numbers), whereas 4, l, I, ~, and l, require the storage of &R only (p
floating point numbers). Thus, for the typical case p = 10, the storage necessary
to compute F, X2, F , , X2, or I , exceeds that for 4, l, I, 4, or 1, by about 56
floating point numbers per reference template.
dynamic time warping algorithm, which has been widely used, imposed strict
end-point constraints on the test and reference patterns by forcing the two sets
of end-points to coincide. Under these conditions 30 recognition errors were
made using the ltakura distance I (3.3).
The experiment was then repeated using the symmetric measure I , as the
local distance, and on this occasion the number of errors was 25.
Comparing these results, it can be seen that in this experiment, the price paid
for using the Itakura distance measure instead of a more powerful LPC
distance measure is a 20% increase in the word recognition error rate. This is
a consequence of discarding information about the variability of the reference
vector &R.
In summary, this experiment shows that speech recognition systems based on
Itakura's distance I can be significantly improved by the trivial modification of
replacing I by I , where, for the case N T - NR, I, is directly proportional to
t -- t
the arithmetic mean of the Itakura distances I(&n, ~T) and I(& r, ~R).
Acknowledgements
The authors would like to thank S. Haltsonen for his assistance in perform-
ing the isolated word recognition experiments.
References
[1] Martin, T. B. (1977). One way to talk to computers. IEEE Spectrum 14(5), 35-39.
[2] de Souza, P. (1983). A statistical approach to the design of an adaptive self-norma!ising silence
detector. IEEE Trans. Acoustic. Speech Signal Process. 31(3), 678-684.
[3] Brown, M. K. and Rabiner, L. R. (1982). On the use of energy in LPC-based recognition of
isolated words. Bell System Tech. J. 61(10), 2971-2987.
[4] Rabiner, L. R., Rosenberg, A. E., Wilpon, J. G. and Keilin, W. J. (1982). Isolated word
recognition for large vocabularies. Bell System Tech. J. 61(10), 2989-3005.
[5] Zue, V. W. and Schwartz, R. M. (1980). Acoustic processing and phonetic analysis. In: W. A.
Lea, ed., Trends in Speech Recognition 101-124. Prentice-Hall, Englewood Cliffs, NJ.
[6] Markel, J. D. and Gray, A. H. (1976). Linear Prediction of Speech. Springer, Berlin.
[7] Fant, G. C. M. (1960). Acoustic Theory of Speech Production. Mouton and Co., 's-Gravenhage,
The Netherlands.
[8] Rabiner, L. R. and Schafer, R. W. (1978). Digital Processing of Speech Signals'. Prentice-Hall,
Englewood Cliffs, NJ.
[9] de Souza, P. (1977). Statistical tests and distance measures for LPC coefficients. IEEE Trans.
Acoust. Speech Signal Process. 25(6), 554-559.
[10] de Souza, P. and Thompson, P. J. (1982). LPC distance measures and statistical tests with
particular reference to the likelihood ratio. IEEE Trans. Acoust. Speech Signal Process. 30(2),
304-315.
[11] Bahl, L. R., Cole, A. G., Jelinek, F., Mercer, R. L., Nadas, A., Nahamoo, D. and Picheny, M.
A. (1983). Recognition of isolated-word sentences from a 5000-word vocabulary office cor-
respondence task. Proc. IEEE Internat. Conf. on Acoustics, Speech, and Signal Processing,
1065-1067.
[12] White, G. M. and Neely, R. B. (1976). Speech recognition experiments with linear prediction,
Speech recognition using LPC distance measures 411
bandpass filtering, and dynamic programming. I E E E Trans. Acoust. Speech Signal Process.
24(2), 183-188.
[13] Myers, C. S., Rabiner, L. R. and Rosenberg, A. E. (1980). Performance tradeoffs in dynamic
time warping algorithms for isolated word recognition. I E E E Trans. Acoust. Speech Signal
Process. 28(6), 622-633.
[14] Sneath, P. H. A. and Sokal, R. R. (1973). Numerical Taxonomy, Chap. 4. Freeman, San
Francisco, CA.
[15] Itakura, F. (1975). Minimum prediction residual principle applied to speech recognition. I E E E
Trans. Acoust. Speech Signal Process. 23(1), 67-72.
[16] Tribolet, J. M., Rabiner, L. R. and Wilpon, J. G. (1982). An improved model for isolated
word recognition. Bell System Tech. J. 61(9), 2289-2312.
[17] Sakoe, H. (1979). Two-level DP-matching--a dynamic programming-based pattern matching
algorithm for connected word recognition. IEEE Trans. Acoust. Speech Signal Process. 27(6),
588-595.
[18] Jelinek, F., Mercer, R. L. and Bahl, L. R. (1982). Continuous speech recognition: Statistical
methods. In: P. R. Krishnaiah and L. N. Kanal, eds., Handbook of Statistics, Vol. 2, 549-573.
North-Holland, Amsterdam.
[19] Lea, W. A. (1983). Selecting the best speech recogniser for the job. Speech Technology 1(4),
10-29.
[20] Rabiner, L. R. and Levinson, S. E. (1981). Isolated and connected word recognition--theory
and selected applications. IEEE Trans. Commun. 29(5), 621-659.
[21] Makhoul, J. (1975). Linear prediction: a tutorial review. Proc. 1EEE 63(4), 56i-580.
[22] Buzo, A., Gray, R. M., Gray, A. H. and Markel, J. D. (1980). Speech coding based upon
vector quantization. I E E E Trans. Acoust. Speech Signal Process. 28(5), 562-574.
[23] Juang, B. H., Wong, D. Y. and Gray, A. H. (1982). Distortion performance of vector
quantization for LPC voice coding. I E E E Trans. Acoust. Speech Signal Process. 30(2),
294-304.
[24] Mann, H. B. and Wald, A. (1943). On the statistical treatment of linear stochastic difference
equations. Econometrika 11, 173-220.
[25] Hannan, E. J. (1970). Multiple Time Series. Wiley, New York.
[26] Fuller, W. A. (1976). Introduction to Statistical Time Series. Wiley, New York.
[27] Seber, G. A. F. (1977). Linear Regression Analysis. Wiley, New York.
[28] McLeod, I. (1975). Derivation of theoretical autocovariance function of autoregressive-moving
average time series. Appl. Statist. 24(2), 255-256.
[29] Ansley, C. F. (1979). An algorithm for the exact likelihood of a mixed autoregressive-moving
average process. Biometrika 66, 59-65.
[30] Ljung, G. M. and Box, G. E. P. (1979). The likelihood function of stationary autoregressive-
moving average models. Biometrika 66, 265-270.
[31] Box, G. E. P. and Jenkins, G. M. (1976). Time Series Analysis Forecasting and Control (rev.
ed.). Holden-Day, San Francisco, CA.
[32] Morf, M., Dickinson, B., Kailath, T. and Vieira, (1977). Efficient solution of covariance
equations for linear prediction. IEEE Trans. Acoust. Speech Signal Process. 25, 429--433.
[33] Chandra, S. and Lin, W. C. (1974). Experimental comparison between stationary and
non-stationary formulations of linear prediction applied to voiced speech analysis. IEEE
Trans. Acoust. Speech Signal Process. 22(6), 403-415.
[34] Dickinson, B. W. (1978). Two recursive estimates of autoregressive models based on maxi-
mum likelihood. J. Statist. Comput. Simulation 7, 85-92.
[35] Quenouille, M. H. (1958). The comparison of correlations in time series. J. Roy. Statist. Soc.
Set. B 20, 158-164.
[36] Sambur, M. R. and Rabiner, L. R. (1976). A statistical decision approach to the recognition of
connected digits. I E E E Trans. Acoust. Speech Signal Process. 24(6), 550-558.
[37] Graybill, F. A. (1976). Theory and Application of the Linear Model. Duxbury, Belmont, CA.
[38] Anderson, T. W. (1958). A n Introduction to Multivariate Statistical Analysis. Wiley, New
York.
412 P.J. Thomson and P. de Souza
[39] Cox, D. R. and Hinkley, D. V. (1974). Theoretical Statistics. Chapman and Hall, London.
[40] Satterthwaite, F. E. (1946). An approximate distribution of estimates of variance components.
Biometrics Bulletin 2, 110-114.
[41] Tribolet, J. M., Rabiner, L. R. and Sondhi, M. M. (1979). Statistical properties of an LPC
distance measure. IEEE Trans. Aeoust. Speech Signal Process. 27(5), 550-558.
[42] Rabiner, L. R., Levinson, S. E., Rosenberg, A. E. and Wilpon, I. G. (1979). Speaker-
independent recognition of isolated words using clustering techniques. IEEE Trans. Acoust.
Speech Signal Process. 27, 336-349.
[43] Roussas, G. G. (1973). A First Course in Mathematical Statistics. Addison-Wesley, Reading,
MA.
[44] Bellman, R. (1970). Introduction to Matrix Analysis. McGraw-Hill, New York.
[45] Friedlander, B., Kailath, T., Morf, M. and Ljung, L. (1978). Extended Levinson and
Chandrasekhar equations for general discrete-time linear estimation problems. I E E E Trans.
Automat. Control. 23, 653-659.
E. J. Hannan, P. R. Krishnaiah, M. M. Rao, eds., Handbook of Statistics, Vol. 5 1 (~
J_
© Elsevier Science Publishers B.V. (1985) 413-449
D. F. Nicholls a n d A . R . P a g a n
I. Introduction
Very early on in the development of methods for the analysis of time series
and the relationships between time series, it was recognized that techniques
based upon constant coefficient models might well be inadequate. Early exam-
pies of this position would be Rubin (1950) and Kendall (1953), the former
allowed for some r a n d o m variation in the coefficients whilst the latter restricted
them to a deterministically evolving pattern. Despite these qualms, constant
coefficient models have proven to be effective in empirical data analysis, so
much so that only relatively recently have there a p p e a r e d either theoretical
papers detailing the methodology for dealing with the types of coefficient
variation important in the analysis of time series or empirical studies providing
applications of these techniques)
A n u m b e r of surveys have been written in the last five years of the area that
this chapter covers, a book by Raj and Ullah (1981) and a contribution by
Chow (1983) to the H a n d b o o k of Econometrics being prominent examples. As
well, there is an annotated bibliography by Johnson (1977, 1980). Combining
these references would provide any reader with a fairly comprehensive list of
papers on the topic. For this reason we do not attempt an exhaustive examina-
tion of all the work done. Our objective is best understood by considering what
it was that made the work by Box and Jenkins (1976) so seminal. Though there
were some advances in estimation and hypothesis testing documented in that
book, it is arguable that most of the techniques used by them had been
available for some period of t i m e - - t h e autocorrelation function had been
routinely computed during spectral analysis, a non-linear least squares tech~
nique for fitting models with moving average errors can be found in Whittle
(1954), and the analysis of residuals for diagnostic purposes was long a feature
of applied research. What was pathbreaking in their material was the presen-
IBecause of ttle nature of this volume a large literature based on longitudinal data -which
indexes responses by individual units to allow for variation in model coefficients across in-
dividuals-is ignored. Some of this literature is surveyed in Swamy (1971) and Engle and Watson
(1979).
413
414 D. F. Nicholls and A. R. Pagan
2Our preference is for the term 'specification' rather than 'identification' to describe the process
of a preliminary screening of models, as this hatter term also needs to be used when discussing
whether unique estimates of the unknown parameters of a process can be obtained.
3In restricting/3t to have a constant mean at most we have ignored the possibility that /~ might
vary in a deterministic fashion with some variables zt (say), i.e./3t = E(flt) = zt6. As will be evident
from later analysis this modification merely induces extra regressors involving the cross product
b e t w e e n zt and xf and does not change the essence of our proposals.
Varying coefficient regression 415
Table 1
Assumptions employed in models
r e p r e s e n t s c o n s t a n t coefficient a u t o r e g r e s s i v e m o d e l s . S o m e h a v e n e v e r b e e n
f o r m a l l y e x a m i n e d within the l i t e r a t u r e , e.g. (X1, B3, V2) a n d , a p a r t f r o m s o m e
g e n e r a l c o m m e n t s l a t e r , will have to b e i g n o r e d . In T a b l e 2 t h e c o n t e n t of each
of the l a t e r s e c t i o n s of t h e c h a p t e r is m a t c h e d with t h e v a r i o u s d e m a r c a t i o n s of
T a b l e 1.
Table 2
Section Models
y, = ~] f,jY,-j + ~,,
j=l
with f , j the jth element of the p x 1 vector fit = fi + r/,, so that this model
becomes
p
Y, = E (/~ + r/t.j)Yt-j + et' (2.1)
j=l
For models of this form Andel (1976) derived conditions for their second-order
stationarity while Nicholls and Quinn (1981), referring to such models as
r a n d o m coefficient autoregressions (RCA), have extended Andel's results to
the case of multivariate RCA's. For simplicity we shall concentrate, in the
r e m a i n d e r of this section, on scalar models, though most of the results extend
in a natural way to the multivariate situation.
In the case of the model (2.1) we m a k e the following assumptions:
(i) {et; t = 0, +1,_+2 . . . . } is a sequence of i.i.d, random variables with zero
m e a n and variance o-2.
(ii) j3'= (J~l. . . . . /3p) is a vector of constants.
(iii) If r/'t = 071,1. . . . . r/t,p), then {r/t; t = 1 . . . . . T} is a sequence of i.i.d, ran-
dom vectors with zero mean and E(r/tr/'t)= X.
(iv) {r/,} and {et} are mutually independent°
If
p
I /1
kfp . . . . . . . fl j
with the (1, 1) block being the ( p - 1)× 1 null matrix, the (1, 2) block the
(p - 1) x (p - 1) identity matrix, and o%t is the g-field generated by {(e,, r/,); s ~<
t}, then it is possible to show (see Nicholls and Quinn, 1982, p. 31) that, when
X > 0, there exists a unique ~t - measurable second-order stationary solution to
(2.1) if and only if M has all its eigenvalues within the unit circle, and
( v e c X ) ' v e c W < 1, where vec W is the last column of the matrix ( I -
M @ M ) -1. (The tensor or K r o n e c k e r product @ together with associated
definitions and useful results are given in the Appendix.)
T o obtain asymptotic properties of the estimators of the parameters of (2.1),
the b o u n d e d n e s s of the second m o m e n t s of {y,} is required. If the two criteria
required for the second-order stationarity of (2.1) are b o u n d e d away from
unity, it follows that this m o m e n t condition will be satisfied. As a result the
next assumption for (2.1) is
Varying coefficient regression 417
the ordinary least squares (OLS) estimates of /~p, p = 1, 2 . . . . . are just the
partial correlation coefficients. Furthermore, as seen later in Theorem 2.2,
these estimators are strongly consistent and asymptotically normally dis-
tributed. Consequently, in order to determine the order of the model (2.1) (or
(2.2)), the partial autocorrelation coefficients and their standard errors are
computed for orders 1, 2 , . . . . Thereafter, the order at which the first (and
subsequent) of these is not significantly different from zero is found in a similar
fashion to that proposed in the Box-Jenkins procedure. If the first coefficient
which is not significantly different from zero occurs at lag (p + 1), and all
higher-order coefficients are not significantly different from zero, then the
model is of order p.
Alternative order determination procedures including Akaike's AIC, B1C or
418 D. F. Nicholls and A. R. Pagan
related criteria could also be used for determination. A number of these are
discussed in Priestley (1981).
In practice, when considering models of the form (2.1), having determined
the order of the model, the next question to be determined is whether or not a
constant coefficient autoregressive model would fit the data just as well. The
usual theory associated with tests based on m a x i m u m likelihood estimates will
not hold in this context, since the vector of unknown coefficients under the null
hypothesis lies on the boundary of the p a r a m e t e r space. Indeed, to demon-
strate that the m a x i m u m likelihood estimators in (2.14)-(2.16) later are asymp-
totically normal, it is necessary to restrict the p a r a m e t e r space 0 in such a way
that the parameters do not lie on the boundary of O. If a boundary value was
to be permitted, e.g. 0/> 0, the m a x i m u m likelihood estimator (MLE) of 0
would need to solve {max 0 l(O) s.t. 0/> 0}, where l(O) is the log likelihood. As
Moran (1971), Chant (1974) and, m o r e recently, Gourieroux et al. (1982) have
shown, 0R that stems from this restricted problem has a very complex asymp-
totic distribution. From the results of those papers the likelihood ratio test is a
mixture of X2 and degenerate r a n d o m variables, while Gourieroux et al. also
d e m o n s t r a t e that this is true of the test statistic based upon the K u h n - T u c k e r
multipliers, the analogue of the Lagrange Multiplier test statistic.
Although the obvious way to test for coefficient constancy is to test the
hypothesis X = 0, the fact that the M L E of X has a complex distribution when
the null is true makes this route unattractive. One potential solution is to base
a test for £ = 0 upon the scores Ol/O0, thereby ignoring the constraint X >~ 0.
Such an approach loses power c o m p a r e d to that based on the scores of the
restricted likelihood, but it does at least have a tractable asymptotic dis-
tribution; essentially this represents the proposal by Moran that N e y m a n ' s
(1959) C ( a ) statistic be used. In this sense, the test presented below can be
regarded as a score or C(c~) or Lagrange Multiplier test statistic.
In order to develop a test of the null hypothesis that X = 0 or, equivalently
y = vech 2; = 0, let/3r and o- ^2r be the m a x i m u m likelihood estimates of/q and o-2
under that hypothesis. (/3 r and d-2 are, of course, the usual m a x i m u m likelihood
estimates for a fixed coefficient autoregression.) For a sample of size T, if
=.=
Y, (Y, . . . . y,<-p)
T
m!
2= T l~z~ f , - y, - ~ r Y , < ,
t=l
T 7"
= r -1 Z
(2.4)
t=l t=l
~'r r-1 et 2
= ~5-1 ,
t=l O'T
and
= O'er) rgrWrgr,
Varying coefficient regression 419
then a test for y = 0 may be based on Theorem 2.1, a proof of which appears in
Nicholls and Quinn (1982, p. 103).
THEOREM 2.1. Under the assumptions (i)-(vii) and the conditions that o.2 > 33 >
0 and y = O, (b given by (2.2) is asymptotically distributed as X 2 with p(p + 1)/2
degrees of freedom.
2.2. Estimation
We now examine the least squares and maximum likelihood estimation of
the parameters of (2.1), along with the asymptotic properties of these estimates.
t t_ t
{vec(Y, i Yt_l)} - Y , I @ Y't - - I "
Consequently,
2 , , ,
= ~r + {vec(Y~_lY,_0} K j
= o"2 + z ; 7 . (2.6)
From (2.2) and (2.6) least squares methods may now be used to obtain
estimates of the unknown parameters, namely/3, o-2 and 7. For the sample of
size ( T + p), the first step is to apply least squares to (2.2) to yield the estimate
fi of fi as
T T
The second step in the estimation procedure is as follows. First, use (2.7) to
determine, for t = 1 . . . . , T,
t --
a, = y , - Y,-1/3 "
~2
u, ;= 0 . 2 + z ,ry q ~,, (2.8)
then the estimates 6 2 and "/ of o-2 and 7 respectively are obtained by regressing
u~ 2t on 1 and z r Doing so gives
= ~, (z t - - -~)(z,
.-e)' }_, 2r a~(z, 2,) (2.9)
t=l t=l
and
T
Expressions (2.7), (2.9) and (2.10) provide the required least squares estimates.
The asymptotic properties of these least squares estimators are contained in:
THEOREM 2.2. For the process {y,} satisfying (2.1)under conditions (i~(vi), and
for/3, ~ and 6 2 defined by (2.7), (2.9) and (2.10) respectively, 6i'= (fi' ~', 6 2)
Varying coefficient regression 421
converges almost surely to O' = (fi', 7', 02) if E(ynt) < 2. When E(ySt) < %
Tl/2(O- O) has a distribution which converges to that of a normally distributed
random vector with mean zero and covariance matrix 12.
where the e, are n.i.d.(0, 02) and the ~t are n.i.d. (0, 62). Relating this model to
(2.1) it follows that M = fi, X = 62 and W = (1 _fi2)-1; for second-order sta-
tionary therefore we must have the largest eigenvalue of M within the unit
circle, i.e. lilt < 1 and (vec X)' vec W < 1 which, in this case, becomes 6 2 / ( 1 -
fi2)< 1. Thus for model (2.11) a stationary solution exists if lfil< 1 and
62 < (1 - fi2); that is if and only if fi2 + 62 < 1, since this latter condition implies
that Ifil < 1.
From Theorem 2.2, in order that the estimators of the parameters be
strongly consistent, it is sufficient that E ( y 4) < 2. For the model (2.11), exploi-
ting e, - n.i.d. (0, o-2),
E(y ) -- E( 4)E(y4 1) 4- 2 2
6E([3,)E(yt_l)0- 2+ 30- 4 .
producing
From this it follows that (2.11) will be stationary if and only if fiR+ 62~ 1,
422 D. F. Nicholls and A. R. Pagan
while (2.12) gives a sufficient condition for {Yt} to have a finite fourth moment,
a condition required for the strong consistency of the estimators.
It should also be noted that the results of Hannan and Kavalieris (1983)
could be applied to (2.1). For a general class of scalar stationary processes these
authors show that under very general conditions the least squares estimates of
the parameters, as well as the estimate of the order of the model using BIC,
converge almost surely to their theoretical values. We shall not present a
detailed discussion of their results here but refer the interested reader to that
paper.
Turning finally to least squares estimation of model (X2, B2, V1), the
presence of exogeneous variables rather than lagged values of Yt does not
change the strategy outlined above. Amemiya (1977) provides a formal proof of
this proposition when the xt's are non-stochastic with an upper bound, while
Hildreth and Houck (1968) made a very similar proposal, differing only in
having E(t~ 2) as the right-hand side of (2.8) instead of E(u~)= O'2+Z;T.
Because 1~ z ,,( u-2, - u~) is op(T~/2), such a substitution does not affect the limiting
distribution of T m ( O - 0), but it might be superior in small samples.
of y, . . . . . y,_,<, given an event Yt-, in the 0.-field ~'t-s. T h e n fhe joint density of
Yl. . . . . YT, conditional on the p r e p e r i o d values (Y0 . . . . , Yl-p)= I:-1, is
T
f (yl . . . . . I Y-,)= 1-Ifl(y, I Y,-0
t=l
T
= (2rr) r/2 1-i ( °-2 + 7'zt) 1/2 exp{ 1(Yt- ~"Yt-1)2I
t=l
= L r @ , 7, 0.2), say,
2
Ir (j3, y, 0"2) = - - - l n Lr(fi, Y, o'2) T M ln(2~r)
T
1T = lr(fl, o'2, A)
Having obtained fi(A) and tT2(a) these values are now substituted into /r - 1 to
give
T
/}(A) = In 0-2(a) + r -1 E In(1 + A ' z t ) ,
t=l
~2 = 0-2(2),
0-T /~r = fi('£), Yr = 02"£ (2.13)
Both the first two derivatives of Ir with respect to/3 and A are easily obtained.
This approach-proves useful not only in the application of the Newton--
Raphson algorithm to obtain the required estimates, but also when discussing
the asymptotic properties of the estimators. Using such a method, the maxi-
Varying coefficient regression 425
These two approaches do, of course, result in the same estimates of the
parameters, that is, the estimates defined by (2.13) will be the same as the
corresponding estimates given by (2.14)--(2.16).
The set O in (2.14) is a compact subset of R p(p+3)/2 and is such that condition
(v) is satisfied along with the condition
(viii) 0--2X has strictly positive eigenvalues, all of which are greater than or
equal to 6, > 0.
This form of O is required in order to prove the strong consistency of, and
obtain a central limit theorem for, the MLE.
We now state without proof Theorem 2.3 which relates to the asymptotic
properties of the maximum likelihood estimators. For a proof of the results
contained in this theorem, along with definitions of 5~ and J stated in the
theorem, see Appendix 4.2 and Sections 4.3 and 4.4 of Nicholls and Quinn (1982).
rFHEOREM2.3. For {y,} satisfying (2.1) and assumptions (i)-(viii), the estimators
/37~ 3'r and 0-T^2obtained by minimizing lr(/3, A) over (~', A')' E O, where A =
0--2 y, are strongly consistent when the true values (fi', A')' E int O. Furthermore,
assuming finiteness of the fourth moments of {et} and {r/t}, if O~ = (/3r,~'Yr,~' d'~.), then
T1/2(0r - 0) has a limiting normal distribution with mean zero and covariance
matrix ~ ~j~¢-1. When {et} and {r/t} are jointly normal, this covariance matrix
reduces to 2.~ -~.
that
E(utUt-k I ~t-1) = Y;-1E(rltrl;-k) gt-k-1 + E(etet-k)
l
= 6o,k(yt_lXy,_k_l + 0"2),
E(utut_ k I ~t_l): 0, k # 0,
and
E(u21 o ~ , ) : ht = Y;-1XY,_, + 0"2 : Z;~/ q- 0"2
and these might be adopted in forming analogues of the Box-Pierce (1970) test
K
O= T~2(k)~ 2
)( K-p
k=l
O, : T ( T + 2) E ( T - k)-lf2(k).
k=l
The model analysed in this section will be (X2, B3, V1) with comments
about extensions to (X1, B3, V1) at the end; generally a combination of the insights
Varying coefficient regression 427
of Section 2 with the results of Section 3 suggests that any results obtained for (X2,
B3, V1) also extend to (X1, B3, V1), although only a few formal proofs are yet
available. We will also strengthen the assumptions regarding variation in x t to
make x, non-stochastic and uniformly bounded, commenting upon the likelihood
that such an assumption can be relaxed at the end of Subsection 3.2.
The general form A(L)(13 t - / 3 ) = r/, given for evolving coefficient variation
under (B3) in Table 1 highlights the fact that there are many more issues to be
faced than when coefficient variation was purely random. Not only is it
necessary to determine the order of A(L) and whether elements of the
covariance matrix of tit are zero, but also decisions must be reached concerning
the nature of A ( L ) - - a u t o r e g r e s s i v e and/or moving a v e r a g e - - a n d whether it
contains unit roots or not. Such diversity makes a succinct summary of past
research an impossibility. Nevertheless, there is value in attempting to discuss
such research in a parallel fashion to the account in Section 2. Accordingly, the
following t h r e e subsections, 3.1, 3.2 and 3.3, treat specification, estimation and
diagnostic testing in that order. Within each subsection a further attempt at
classifying contributions according to major themes is made.
3.1. Specification
The multi-dimensional nature of the specification problem present in evolv-
ing coefficient models prompts the adoption of assumptions to narrow the
range of alternatives needing to be investigated. As there are two elements
constituting the evolutionary p r o c e s s - - A ( L ) and ~/t--it is natural therefore
that research on specification has proceeded by ascribing particular properties
to one of the elements, leaving the data to determine the other. Thus, in
Subsection 3.1.1. it is A ( L ) which is treated as known and hypotheses about the
covariance matrix of tit are tested, while in Subsection 3.1.2 a prior division of
coefficients into fixed and evolving (equivalent to a statement about the
variances of r/t ) is implicitly assumed and the characteristics of A ( L ) are
determined. Combinations are conceptually possible, e.g. A(L) might be
assumed known except for order, but we have been unable to locate any
specific discussion on the matter. However, it might be mentioned that the
order of A ( L ) could be chosen using Akaike's criterion.
y = X/~+ u, (3.2)
428 D. IV. Nicholls and A. R. Pagan
where 5t = A ' A and A ' = 0 vec(O)/0O, A ; = O vec(S2)/0g,2 and ri are the residuals
from OLS applied to (3.2).
Inspection of (3.3) reveals that A2 and 5~22 depend upon the type of
coefficient evolution specified as the alternative hypothesis, i.e. A ( L ) (as
reflected in ~). La Motte and McWhorter (1978) consider the case where the
elements of/?t followed ARIMA(0, 1, 0) processes and the covariance matrix of
rh was known up to a specified constant o-2, i.e. E ( u u ' ) = 0.2 V + 0.2IT where V
was known. Therefore, A 2 = vec(V) and if= is a known scalar. Consequently,
rather than work with (3.3), an equivalent test statistic is based upon the
stochastic elements in (3.3) alone, i.e. upon vec(ri'ri/d2)'A2/~;vec(~'~/6-2) or
upon d- 2Li'V~ which is the positive square root of the expression.
The numerator of & 2~'Vfi is y ' M x V M x y , where M x = I - X ( X ' X ) 1X'.
Defining H H ' = M x gives y ' H H ' V H H ' y = z ' H V H ' z = z , (Ei= r 1a p i P i ,) z in their
notation, where 3, are the eigenvalues and Pi the eigenvectors of H V H ' .
Rather than finding P r { T ( g t ' V ~ / ~ ' ~ ) < c } , they propose an F statistic based
upon the ratio o f Sg =-~ Z t (Ei=
g
1A~PePi)Z p
to ~ . 2 . It is not entirely clear why such a
strategy is followed; since the eigenvalues ai need to be obtained for their
solution, one might as well utilize that information to compute the exact
probability that 8-2zi' Vti < c, as this is a quadratic form in normal variables and
Imhof's (1961) procedure can be applied. In fact, as King and Hillier (1980)
have demonstrated, the LM test above is the locally best invariant test in
Varying coefficient regression 429
this situation where testing is being done on a boundary, and La Motte and
McWhorter explicitly promulgated invariance as a desirable characteristic of
their test. Furthermore, a direct application of the LM principle avoids the
need of an arbitrary selection of g.
At first sight, the LM test seems to provide the desired test statistic. It
extends to cover 02 being a vector of unknown parameters and, even though
there is no optimal property currently known for it in this wider context, its
small sample distribution may be determined numerically through Monte Carlo
m e a n s - - s e e Breusch and Pagan (1980, p. 250). Furthermore, it is tempting to
claim the standard limiting property that it is x2(q) under the null. Un-
fortunately, as Tanaka (1981) shows, this property does not extend to the
current model, and this rather reduces the attractiveness of the procedure.
Takaka demonstrates this failure for a simple model through an expansion of
the characteristic function, while an alternative viewpoint is offered in Pagan
and Hall (1983). It should be emphasized that the failure of (3.3) to be x2(q)
asymptotically under the null hypothesis has nothing to do with the fact that
non-negativity constraints on g'2 were ignored in the construction of (3.3).
Rather, it is a consequence of the fact that T -q 02L/&) Orb' does not converge to
its expectation as the sample size grows.
Watson (1980) concentrates upon 13, following an A R M A rather than
A R I M A process, although in most of his analysis 13, is a scalar and follows the
AR(I) 1 3 , - f i = a(j3,_~-fi)+ rh. It is natural to consider testing if a = 0, 0-27= 0
but, as setting o-2 = 0 gives ~Q = o2I, it is apparent that there is some redun-
dency here. For any given value of c~ it is possible to construct an LM test for
o'n2 = 0, but the problem of combining these different statistics remains. One
approach, put forth by Davies (1977), is to base the decision on the maximum
of the test statistics computed for various values of c~ in the range - 1 < ~ < 1,
and Watson proceeds in this way. Even apart from the difficulties of com-
putation the idea is not especially attractive, in that it rests upon a knowledge
of the order and form of A(L). Moreover, it is interesting to observe a result
from Watson's Monte Carlo studies that the LM test indexed on c~ = 0 - - w h i c h
is effectively an LM test for a particular type of heteroscedasticity and so is one
of the members of the class studied in Godfrey (1978) and Breusch and Pagan
(1979)--exhibits very good performance in detecting coefficient variation, and
the simplicity of that test statistic makes it a strong candidate for detecting any
need to allow for evolving coefficients.
Finally, Garbade (1977), also restricting 13, to an AR1MA(0, 1, 0) process and
I; = E(~j,~',) to be diagonal, proposed that the likelihood ratio (LR) statistic
- 2 l o g Lr(~)/log L r ( ~ l , 0) be computed, where L r is the likelihood, q~ is the
M L E of 4' and ~/'1 is the MLE of 4'1 given the null hypothesis H0: I/t 2 = 0. In
standard cases this statistic is asymptotically equivalent to the LM statistic and
is x 2 ( q ) under the null hypothesis that ~2 = 0, Garbade found in simulation
studies that the distribution of LR was not in fact X2 in this case. Such a finding
could have arisen as a concomitant of imposing the restriction g'2~ > 0 , but it
might also reflect the problems with the LM test in this situation discovered by
430 D. 1::.Nicholls and A. R. Pagan
Tanaka. Given the close relationship that normally holds between the two test
statistics, there would seem to be a need to examine the asymptotic properties
of the L R test statistic in this context more closely.
Yt ~-" 1~1 + X~]~2-l- (J~lt- ~-1) -{- Ct ~-- ~-~- X~/~2 + Ut' (3.6)
assumed that fit - J3 is a stationary process; later we will briefly comment upon
ways of discriminating between stationary and non-stationary models for/3,.
Suppose that u t were known. Equation (3.8) could then be written
THEOREM 3.1. Under conditions (i)-(iv) of Theorem 2.2 a n d with (v) 13,- fi a
stationary invertible process, (vi) E(vjv})= Vj > O, (vii) x, a sequence of non-
stochastic regressors with uniform bound and limr_~ X ' X -~ B > 0, (viii)
Oj = lim Z~z-'-
=,jz,,
i lim T-'Z,~Zj = Oj > 0,
T~ T~
d
(C) T 1/2(02,j
- - 02.j)-~ N( 0, O ; l ( lim ZjVZj)Oj-'
-- 1)
k T
-~ 2 2 - + --
49~1(~x, x, iY,,o 2 ~ ~ x, xt_jx , mXt.j._m31t,m)~); 1 ,
m=l t=m+l+j
= E(am,,)ff.(am,t_j)
under the null hypothesis. When x t -- 1 and e , - 0, E(am.,)= E(am.,_j)=: Y,n and
the formula corresponds to that in Box and Jenkins (1976, p. 35, eq. (2.1.13)).
It is apparent that the variance of T1/2(~/j- "yj) corresponds to that from a
regression model in which the errors follow a 'moving average' of kth order
with time dependent covariances. Under the conditions of Theorem 3.1 it follows
from Domowitz and White (1982) that this variance may be consistently
estimated by T times
k 7"
by using the OLS residuals tSj,t from (3.9). Estimating the asymptotic variance in
this fashion seems a good deal simpler than the alternative of explicitly
evaluating 5~,.m and replacing any yj appearing in them with ~j. As well,
Domowitz and White's formula applies even if/3~ is not a scalar.
Many regression packages nowadays provide estimates of the variance of the
OLS estimator adjusted for heteroscedasticity as recommended in Eicker
(1967) and White (1980). In the context of the regression in (3.9), these
estimates would correspond to T times (aj 1(~ xjx,2 2 iY,,o)&j
- -1 • When only a single
coefficient evolves the omitted term is strictly non-negative, so that any test
Varying coefficient regression 433
?0 1io11_-
'Yl
, ?]
'Y03I-tO23 'Y2
provides
asymptotically and for a single evolving coefficient, the computed OLS variance of
Tm(tij - pj) understates the true variance, providing a conservative test statistic.
In all of the above analysis it was presumed that the process generating/3 t
was stationary, yet there is no compelling reason why A ( L ) should not contain
unit roots. In standard time series analysis, such a happening is detected by
successive differencing of the time series until the a.c.f, ordinates die out
rapidly. Unfortunately, it does not seem easy to mimic that mode of operation
here. Suppose /3t was a scalar and /3t =/3 t 1+ ~t. Then /3 =/30 and u t in (3.1)
becomes u, = x t E}=~ rh_j + e t, demonstrating that
E(u,u,_,)= 2
O'Tl(xtt- XtXt 2
l(t- 1))+ o'~..
Probably a similar strategy might be devised for the detection of models where
A ( L ) is not solely composed of unit roots, e.g. A ( L ) = (1 - A~L)(1 - L), but it
must remain a high-priority area for research.
Finally, application of T h e o r e m 3.1 to (X1, B3, V1) is not at all straightfor-
ward. It is a feature of (3.9) that in this case the random variable vj.t will be
autocorrelated. OLS applied to (3.9) will therefore yield inconsistent estimators
of yj whenever x t contains lagged values of Yr. This is a serious weakness, and
to o v e r c o m e it requires the use of some instrumental variables for the lagged
values of Yr Unfortunately, unless the autocorrelation in vj,t is of the MA
t y p e - - a n d there is little reason to believe that /3t-/3 would have this
c h a r a c t e r i s t i c - - i t is not possible to exploit the past history of Yt for instru-
ments. Finding instruments may well be difficult, and unless they are of good
quality it could be very hard to m a k e any prior discrimination between models.
Once again this is a topic that requires much more detailed attention.
3.2. E s t i m a t i o n
Having isolated a range of models that are to be entertained, the next phase
in the modelling cycle involves estimating the unknown parameters. In contrast
to the specification aspect, there has been a substantial amount of research
devoted to estimation. Much of this research represents an adaptation of the
techniques presented in Section 2. The r e c o m m e n d e d estimation technique of
that section was maximum likelihood, and Subsection 3.2.2 details the proper-
ties of this estimator in the evolving coefficient case. What differentiates the
two situations is that the likelihood can only be defined implicitly in the
evolving coefficient case, making it difficult to find analytic derivatives as was
Varying coefficient regression 435
that is
UtUt_ 1 = XtXt_l')/00/1 ~- Vl, t , (3.11)
and, just as in the p.a.c.f, computations, &~ could be found by regressing fi,fi,-i
-2 against x,2 and
against xtxt_l'~ o (C0 would be an output from the regression of u,
unity).
In contrast to this approach the S-T regression would be
demonstrating its close relation to the estimator of Theorem 3.1; the sole
difference in this instance being the replacement of 70 by so,_1.-2 As might be
expected, for 61,s_T to be consistent certain conditions must be satisfied by ~t. In
particular, for this scalar case, the sample moments of ~t should be consistent
estimators of the population moments of ~:t (up to the fourth order). S-T's
choice of ~t can in fact be shown to imply this for the scalar case, but it is much
harder to see the equivalent necessary conditions for consistency in more
general models. It would obviously be desirable that a proof of consistency of
S-T's proposed estimator be available before extensive use is made of it.
Havenner and Swamy (1981) show that /3 is consistent and asymptotically
normal, but that is a comparatively simple task compared to establishing the
limiting properties of estimators of ~ and X.
There are a number of other points that need to be made about S-T's
approach. First, just as the insertion of 70 in place of 70 invalidates the
consistency of the OLS estimate of the covariance matrix of ~j (and hence 61),
so too the covariance matrix of c~l.s_T is not consistently estimated by the OLS
formula variance. Applications made of the S-T estimator do not seem to have
allowed for this. Second, it is not clear what is to be gained by moving from 70
"2
to £t-1; the computational load of the first estimator being much lower. Of
course, the S-T estimator is iterative in that new ~t can be found with the
updated c~ and 2 and these may be exploited to give new estimates ~ and ~,
etc. One could iterate the estimator of Theorem 3.1 as well by exploiting the
form of the covariance matrix of vl.t, e.g. a weighted least squares regression to
account for the heteroscedasticity in VlS, but iterations on covariance estimators
seem a bit pointless as the computational burden in each iteration is much the
same as in each step of an iterative scheme to get the maximum likelihood
estimator (MLE). Furthermore, because S-T's estimator is a variant of that in
Theorem 3.1, it shares with that estimator the problems posed whenever x,
contains lagged values of Y,- Some applications of the S-T estimator have in fact
been made to models such as (X1, B3, V1) without apparently realizing that the
estimator will be inconsistent in such cases. Overall, this difficulty seriously
reduces the appeal of covariance estimators for VCR's.
3.2.2. M a x i m u m likelihood e s t i m a t i o n
As would be familiar from ordinary time series analysis, covariance estima-
Varying coefficient regression 437
tors tend to be fairly inefficient; their prime virtue being their simplicity and
their ability to provide consistent estimators that possess a limiting distribution.
To improve efficiency, most investigators interested in V C R models have
followed Box and Jenkins and engaged in ML estimation.
Following the strategy of Section 2, the log likelihood is constructed as if e l
and "Or are jointly normal, being
where xt = (xt i 0) and ~t has leading rows (/3t - / ~ ) and thereafter is defined to
reduce A ( L ) ( / 3 , - / ~ ) to first-order form, the Kalman Filter (KF) equations
provide values of h t and et for given a and X. This approach has been well
documented elsewhere, e.g. Rosenberg (1973) and Harvey (1981), and inter-
ested readers can find the KF described in these and in a number of other
references. Two items deserve some attention however. First, the KF needs to
be initialized by E(scl) and E(scl~'l); because set is composed from A ( L ) ( 1 3 , - ~ ) =
~/~ E(sC~)= 0 and E(~1~:'1) is a function solely of X and a. Accordingly, it is not
necessary to treat the initial coefficient/3 t as fixed and unknown; if this were
desirable Rosenberg (1973) showed how to concentrate it out of the likelihood.
Second, the term l o g f ( y 0 in the log likelihood (3.13) needs examination. As
Yt = X t ~ + X t ( ~ , - J~)+ e,, Yl will be normally distributed with mean xa/3 and
variance xlF~x'l+ o-2, allowing logf(yl) to be computed from x 1, a and £ (this
derivation assuming x t to be non-stochastic).
What makes ML estimation desirable is the consistency of the resulting
estimator and the fact that its covariance matrix is given by the inverse of the
information matrix ~¢00= -E(O2Lv/00 00'); this latter quantity frequently being
estimated by the inverse of the Hessian of the log likelihood. However,
because VCR constitutes a non-standard problem, with observations Yt being
dependent and non-stationary, there is no certainty that these desirable pro-
perties can be invoked. Crowder (1976) and Baswa et al. (1976) have provided
438 D. F. NichoUs and A. R. Pagan
THEOREM 3.2. If
A . the model is asymptotically locally identified;
B. (i) x t is non-stochastic and uniformly bounded from above,
(ii) O, the permissible parameter space, is a subset of R s,
(iii) the eigenvalues of qb in (3.14b) have modulus less than unity;
C. the errors et and ~b constitute a multivariate normal distribution with finite
variances;
D. 0o, the s × I vector of true parameter values, is an interior point of {9 ;
then
p ~ 1/2t~ d
OML --+ 0o, ~ oo ,vML -- 0o) ~ N (o, L ) .
I f 0 does not include vec(a), i.e. the transition matrix is fixed a priori, condition
B(iii) may be deleted.
Some comments can be made upon this theorem and its assumptions. The
permissible parameter space is defined by the problem but would certainly
require ,Y to be p.s.d. In some situations the conditions are not exclusive, e.g.
B(i)-(iii) would be a sufficient condition for A, but it seems worthwhile leaving
unspecified what is needed for A and concentrating upon the asymptotic theory
of the ML estimator given that A holds. Then, even if B(iii) did not, provided
no elements in @ were estimated and the parameters in X were asymptotically
identified, consistency and asymptotic normality would follow. This then
extends the range of the estimation theorem to non-stationary cases, provided
a separate analysis of asymptotic identifiability can be given. If elements in q~
are to be estimated, then, by analogy with the corresponding situation of
estimating unstable AR's, it would be expected that normality would not hold
when q~ had unit roots. Although in that literature asymptotic normality of
1/2 ^
•00(0-00) does hold when the roots in q~ are greater than unity, it is very
doubtful that such a result would be true for a VCR model, the reason being
that the innovations in an A R have bounded variance regardless of the roots of
whereas the variance of the innovations would tend rapidly to infinity if the
roots of q~ were greater than unity in the VCR case.
Of the other assumptions of the theorem, normality could be dispensed with
by providing bounds on the moments of "qt and e t. However, it would not seem
possible to relax B(i) to allow non-stationary behaviour in x,, as the outcome of
such an alternative would be an unbounded variance of the innovations, and it
is hard to see how the theorem could possibly hold. It is worthwhile noting that
Amemiya (1977) also retained this assumption for the ordinary random
coefficient case.
From the definition of the log likelihood in (3.13), the conditioning on past
Varying coefficient regression 439
data ensures that it remains the same even when x, includes lagged values of y,.
Thus, M L estimates would be obtained in the same way regardless of the
definition of x r However, Theorem 3.2 does not apply directly, although Weiss
(1982) has considered the requisite extension. To do so demands the addition
of various assumptions that serve to bound the moments of y,; as might be
expected Weiss' methodology effectively combines Theorems 2.2 and 3.2. From
Weiss' research it would seem that the properties of the M L E extend to the
combination (X3, B3, V1) and, given the difficulties experienced by the covari-
ance estimator under these circumstances, establishes a strong case for its use.
3.2.3. Identifiability
T h e o r e m 3.2 required that the model be asymptotically locally identified or
that T-l~oo be non-singular in the limit--this latter interpretation being
provided by Rothenberg (1971). Since this assumption is very closely bound up
with the existence of a consistent estimator of 0, and T h e o r e m 3.1 showed how
such a consistent estimator might be found, it should come as no surprise that
the conditions for the existence of the estimator of T h e o r e m 3.1, viz. that T - 1 X ' X
_ l _
Of the two assumptions A2 is the strongest, but they indicate that it can be
eliminated at the expense of a more complex proof. Accordingly, in most
circumstances, the presence of unit roots in A ( L ) would not invalidate the
standard properties of the M L E expressed in T h e o r e m 3.2. Unfortunately,
when x t contains lagged values of Yt nothing is yet available concerning
identifiability, and this is an area that is in need of much more research.
where (j) indicates values at the jth iteration. If 0(0)iis a^ consistent estimator
such that tJ(0) - 0 is Op(T-1/2), it is well known that T:/2(0(1)- O) has the same
limiting distribution a s T1/Z(OML--0).
Many proofs qf this proposition are
available, with a convenient statement being Rothenberg and Leenders (1964).
^
Generally, 0(1) is not second-order efficient, but a further iteration will in fact
produce 0(2) which i s - - s e e Rothenberg (1983). Consequently, it is possible to
derive asymptotically efficient estimators from (3.15) once a /J(0) is available.
But the estimator of T h e o r e m 3.1 satisfies these requirements for 0(0), making
one step of the scoring algorithm from such estimates a means for deriving an
estimator that is as efficient as the MLE. It is this argument that justifies the
contention at the end of Subsection 3.2 that there was little point in iterating
covariance estimators.
e, =- 4)je,_j + a, (3.16)
Under H0: ~bj = 0, the error term in this regression is a t + e t - er By the mean
value theorem,
Oet 02et
e,- Y, + ~ ( 0 - 0)+ (0 - O * ) ' O ~ - ~ ( v - 0"),
Oet
¢= 6 A - j + at + - o) + % ( v l J 2 ) . (3.18)
Oet p
T l ~ e t _ j 0~---~0.
Equation (3.20) is the basis for the proof of the following theorem concerning a
valid diagnostic test for H0: ~bj = 0.
THEOREM 3.3. In the regression of Y* against Y*_j and P, the 't statistic'
associated with the coefficient of Y'j, when treated as a standard normal deviate,
is asymptotically a valid test statistic for the null hypothesis that c~j = O.
442 D. F. Nicholls and A. R. Pagan
~ ) j : ~)j+ ^.,
(e_jM ,M6_j)
^ . - 1 ^e_jM
., ,
Ma . + o,(T 1/2) (3.21)
= ~b/+ (O*)MO*/)-20*)M'a* + o,(T1/2), (3.22)
P
using M ' M = M. Since/zt ~ ht because of the consistency of 0, it is easily seen
that
@
T1/2(dpj - dpj)'---~ N ( 0 , (plim T - l ~*_~M~*_j)-l) .
T--~
But the regression (3.20) involves the regression of the residuals from the
regression of ~* against P versus the residuals from the regression of ~* -/
against P; as is well known an identical estimate of q~/and its covariance matrix
can be found by the regression of 6" against 6*.-1
and P.
It might be noted that, except for the division by ]~t 1/2 to produce a constant
variance in the innovations, the regression in Theorem 3.3 is essentially that
recommended by Durbin (1970) for adjusting the a.c.f, of residuals from an auto-
regression, again establishing a link between ordinary time series analysis and
VCR methodology. Some other points of interest arise concerning Theorem
3.3. First, as Y*~.M'MY*j = Y*~V/Y*/~<Y_*jY*_/, the t statistic from the regres-
sion of Theorem 3.3 exceeds that from a regression of ~* against ~_*/, demon-
strating that any failure to allow for the use of residuals rather than innovations
creates a bias towards acceptance of the null hypothesis of serial independence.
From this feature emerges the conclusion that past studies, e.g. Engle and
Watson (1981), which have examined the a.c.f, of the estimated innovations,
even if the latter were normalized with/~-1/2, would have tended to accept their
model as adequate when it may not have been. Second, the regression of
Theorem 3.3 is easily performed since the information is either available from
the ML estimation program, or could easily be extracted by numerical
differentiation.
3.3.3. Algorithms
In Subsection 3.3.1 brief mention was made of the scoring algorithm for the
derivation of the MLE. A range of other algorithms has been suggested, e.g.
Newton-Raphson, in which # is replaced by -02L/0000'; general purpose
optimization routines such as Davidon/Fletcher/Powell or Praxis; modified
versions of scoring in which # is estimated by X (OLJOO)(aLt/O0)' (see Berndt et
al., 1974) and, recently, the E-M algorithm of Dempster et al. (1977) (see
Watson and Engle (1982) for a full account of the application of the latter).
There is some evidence that scoring works very well provided good starting
values are available, but that the E-M algorithm is very much faster if starting
values are poor. A judicious combination of both algorithms--beginning with
Varying coefficient regression 443
E - M and switching to scoring after some progress has been made towards the
m a x i m u m - - p r o m i s e s to be the best solution. Watson has indicated success
with this strategy when up to one hundred p a r a m e t e r s need to be estimated.
4. S o m e special m o d e l s
4.2. N o n - c o n s t a n t variances
It is sometimes forgotten that a regression model is characterized by two
types of parameters: those connected with the mean behaviour of the data (/3)
and those with the second moment (0"2). Nevertheless, there has been a steady
growth of interest in either allowing for the effects of a non-constant error
variance or in modelling any changes in it.
If 0"2 is not a constant, but rather indexed by t, it is well known that the
variance of the OLS estimator/3 = (X x'tx,) -1 E x'ty , is (X x'~xt) -I E, x'txto-~(E x;x~) -I,
and Eicker (1967), Fuller (1975) and White (1980) have all proposed con-
sistently estimating this quantity by replacing ~r~ in the formula by the squared
OLS residuals t~. As the autocorrelation and partial autocorrelation functions
of a series y, can be viewed as the estimated coefficients from a regression of yt
against its various lags, if the variance in such a regression is not constant the
standard theory as in Box and Jenkins (1976) would not be applicable.
Nicholls and Pagan (1983) have shown that it is possible to use the same
adjustment as proposed by Eicker/Fuller/White even when x t contains lagged
values of y, and this approach allows for test statistics based on the autocor-
relation function to be adjusted for non-stationarity in variances.
Rather than react passively to a non-constant variance for e, some have
suggested an active strategy of modelling it. Amemiya (1973) derived a three-
step procedure asymptotically equivalent to the MLE when cr~ is assumed a
function of E(yt). Harvey (1976) focussed upon an application of the scoring
algorithm to give a two-step estimator asymptotically equivalent to the M L E
for a general class o f heteroscedastic errors. Amemiya (1977) considered
consistent estimators rather than efficient ones. As might be expected all of
these proposals have been covered indirectly in Section 2. Any differences stem
from the special characteristics of the model of Section 2 that lagged values of
Yt appear in x t and the fact that cr~ under random coefficients is related to the
Yt-j rather than to a more arbitrary set of variables wt. Nevertheless, the M L E
and consistent estimators are derived exactly as in that section.
A more recent development in this area has been the distinction drawn
between the conditional and unconditional variance of e t. It may be that
the unconditional variance of e t is a constant, but that the variance
conditional upon the sigma field ~-t-1 = [Yt, • • •, Yt-t, x~. . . . . x,] is not, e.g. the
model y t = ~ y , _ l + e t with E ( e ~ l ~ t _ 1 ) = 0 " 2 + 6 y t 2 _ l possesses this property
(assuming that E(y~) is finite). There are obviously many ways in which such a
situation could arise. One example is the random coefficient autoregressions of
Section 2 or equivalent M A formulations (Robinson, 1977); another the
bilinear models of Granger and Andersen (1978). A third is the recent
Varying coefficient regression 445
5. Conclusion
Our aim has been to survey material on varying coefficient regression in the
context of a framework of crucial importance to the analysis of time series, viz.
the specification, estimation and diagnostic cycle. By adopting this systematic
approach, a better rapport between VCR research and traditional time series
analysis is established, and areas in which there is a serious research deficiency
can be more effectively isolated. As we commented in the introduction, there
are many cells in Table 1 that have not been studied at all, and our survey has
thrown up some gaps even for those models that have been studied. Two
aspects stand out from the review: considerable progress has been made in
enabling the standard fixed coefficient assumption to be relaxed but, at the
same time, much remains to be done before VCR models become as wide-
spread in their use as their fixed coefficient counterpart.
Appendix
References
Amemiya, T. (1973). Regression when the variance of the dependent variable is proportional to the
square of its expectation. J. Amer. Statist. Assoc. 68, 928-934.
Amemiya, T. (1977). A note on a heteroskedastic model. J. Econometrics 6, 365-3"70.
Andel, J. (1976). Autoregressive series with random parameters. Math. Operationsforsch. Statist. 7,
735-741.
Basawa, I. V., Feigin, P. D. and Heyde, C. C. (1976). Asymptotic properties of maximum likelihood
estimators for stochastic processes. Sankhygt 38, 259-270.
Berndt, E. K., Hall, B. H., Hall, R. E. and Hausman, J. A. (1974). Estimation and inference in
nonlinear structural models. Ann. Econ. Soc. Meas. 4, 653-665.
Billingsley, P. (1961). The Lindeberg-L6vy theorem for martingales. Proc. Amer. Math. Soc. 12,
788-792.
Box, G. E. P. and Jenkins, G. M. (1976). Time Series Analysis: Forecasting and Control (revised
edition). Holden-Day, San Francisco, CA.
Box, G. E. P. and Pierce, D. A. (1970). Distribution of residual autocorrelations in autoregressive-
integrated moving average time series models. J. Amer. Statist. Assoc. 65, 1509-1526.
Breusch, T. S. and Pagan, A. R. (1979). A simple test for heteroscedasticity and random coefficient
variation. Econometrica 47, 1287-1294.
Breusch, T. S. and Pagan, A. R. (1980). The Lagrange multiplier test and its applications to model
specification in econometrics. Rev. Econom. Stud. 47, 239-253.
Chant, D. (1974). On asymptotic tests of composite hypotheses in non-standard conditions.
Biometrika 61, 291-298.
Chow, G. C. (1983). Random and changing coefficient models. In: Z. Griliches and M. D.
Intriligator, eds., Handbook of Econometrics, Chap. 21. North-Holland, Amsterdam.
Cooley, T. and Prescott, E. (1973). An adaptive regression model. Internat. Econom. Rev. 14,
364-371.
Cooley, T. and Prescott, E. (1976). Estimation in the presence of sequential parameter variation.
Econometrica 44, 167-184.
Crowder, M. J. (1976). Maximum likelihood estimation for dependent observations. Jr. Roy. Statist.
Soc. Set. B 38, 45-53.
Davies, R. B. (1977). Hypothesis testing when a nuisance parameter is present only under the
alternative. Biometrika 64, 247-254.
Dempster, A. P., Laird, N. M. and Rubin, D. B. (1977). Maximum likelihood from incomplete data
via the EM algorithm. J. Roy. Statist. Soc. Ser B 39, 1-39.
Domowitz, I. and White, H. (1982). Misspecified models with dependent observations. Z
Econometrics 20, 35-58.
Durbin, J. (1970). Testing for serial correlation in least squares regression when some of the
regressors are lagged dependent variables. Uconometrica 38, 410--421.
Eicker, F., (1967). Limit theorems for regression with unequal dependent errors. In: L. Le Cam
and J. Neyman, eds., Proc. Fifth Berkeley Symposium, 59-82. University of California Press,
Berkeley, CA.
Engle, R. F. (1982). Autoregressive conditional heteroscedasticity with estimates of the variance of
United Kingdom inflation. Econometrica 50, 987-1007.
Engle, R. F. and Watson, M. (1979). A time domain approach to dynamic factor and MIMIC
models. Discussion paper 79-41. University of California, San Diego.
Varying coefficient regression 447
Engle, R. F. and Watson, M. (1981). A one-factor multivariate time series model for metropolitan
wage rates. J. Amer. Statist. Assoc. 76, 774-781.
Fuller, W. A. (1975). Regression analysis for sample survey. Sankhyg~ 37, C, 117-132.
Garbade, K. (1977). Two methods for examining the stability of regression coefficients. J. Amer.
Statist. Assoc. 72, 54-63.
Godfrey, L. G. (1978). Testing for multiplicative heteroskedasticity. J. Econometrics 8, 227-236.
Gourieroux, C., Holly, A. and Monfort, A. (1982). Likelihood ratio test, Wald test, and Kuhn-
Tucker test in linear models with inequality constraints on the regression parameters.
Econometrica 50, 63-80.
Granger, C. W. J. and Andersen, A. (1978). A n Introduction to Bilinear Time Series Models.
Vandenhoeck and Ruprecht, G6ttingen.
Hannah, E. J. and Kavalieris, L. (1983). The convergence of autocorrelations and autoregressions.
Austral. J. Statist. 25, 287-297.
Hannan, E. J., Terrell, R. D. and Tuckwell, N. (1970). The seasonal adjustment of economic time
series. Internat. Econom. Rev. 11, 24-52.
Harvey, A. C. (1976). Estimating regression models with multiplicative heteroscedasticity.
Econometrica 44, 461-466.
Harvey, A. C. (1981). Time Series Models. Phillip Allan, Oxford.
Hatanaka, M. and Tanaka, K. (1981). On the estimability of the covariance matrix in the
multivariate random walk representing the time changing parameters of regression models.
Mimeo. Osaka University.
Havenner, A. and Swamy, P. A. V. B. (1981). A random coefficient approach to seasonal
adjustment of economic time series. J. Econometrics 15, 177-210.
Henderson, H. V. and Searle, S. R. (1979). Vec and vech operators for matrices with some uses in
Jacobian and multivariate statistics. Canad. J. Statist. 7, 65-81.
Hildreth, C. and Houck, J. P. (1968). Some estimators for a linear model with random coefficients.
J. Amer. Statist. Assoc. 63, 584-595.
Hurwicz, L. (1950). Systems with non-additive disturbances. In: T. C. Koopmans, Ed., Statistical
Inference in Dynamic Economic Models, 410-418. Wiley, New York.
Imhof, J. P. (1961). Computing the distribution of quadratic forms in normal variables. Biometrika
48, 419--426.
Johnson, L. W. (1977). Stochastic parameter regression; an annotated bibliography. Internat.
Statist. Rev. 45, 257-272.
Johnson, L. W. (1980). Stochastic parameter regression: an additional annotated bibliography,
Internat. Statist. Rev. 48, 95-102.
Kendall, M. G. (1953). The analysis of economic time series--Part I: Prices. J. Roy. Statist. Soc.
Ser. A 106, 11-25.
Kelejian, H. H. (1974). Random parameters in a simultaneous equation framework: Identification
and estimation. Econometrica 42, 517-528.
King, M. L. and Hillier, G. (1980). A small sample power property of the Lagrange multiplier test.
Monash University discussion paper.
La Motte, L. R. and McWhorter, A. (1978). An exact test for the presence of random walk
coefficients in a linear regression model. J. Amer. Statist. Assoc. 73, 816--820.
Ljung, G. M. and Box, G. E. P. (1978). On a measure of lack of fit in time series models.
Biometrika 65, 297-303.
McDonald, J. (1981). Consistent estimation of models with composite moving average disturbance
terms: A survey. Flinders University Mimeo.
Moran, P. A. P. (1971). Maximum likelihood estimation in non-standard conditions. Proc. Camb. Phil.
Soc. 70, 441-445.
Neyman, J. (1959). Optimal asymptotic tests for composite statistical hypotheses. In: U. Grenan-
der, ed., Probability and Statistics, 213-234. Wiley, New York.
Nicholls, D. F. and Pagan, A. R. (1983). Heteroscedasticity in models with lagged dependent
variables. Econometrica 51, 1233-1242.
Nicholls, D. F. and Quinn, B. G. (1981). Multiple autoregressive models with random coefficients.
.L Multivariate Anal. 11, 185-198.
448 D. F. Nicholls and A. R. Pagan
Takaka, K., (1981). On the Lagrange multiplier test for the constancy of regression coefficients and
the asymptotic expansion. Mimeo, Kanazawa University.
Tsurmi, H. and Shiba, T. (1982). A Bayesian analysis of a random coefficient model in a simple
Keynsian system. J. Econometrics 18, 239-250.
Watson, M. (1980). Testing for varying coefficients when a parameter is unidentified. University of
California, San Diego discussion paper No. 80-8.
Watson, M. and Engle, R. F. (1982). The EM algorithm for dynamic factor and MIMIC models.
Harvard Institute of Economic Research discussion paper No. 879.
Weiss, A. A. (1982). The estimation of the dynamic regression model with stochastic coefficients.
University of California, San Diego discussion paper No. 82-11.
White, H. (1980). A heteroscedasticity-consistent covariance matrix estimator and a direct test
for heteroscedasticity. Econometrica 48, 817838.
Whittle, P. (1954). Estimation and information in stationary time series. Ark. Mat. 2, 423-434.
Wilson, G. T. (1969). Factorization of the covariance generating function of a pure moving average.
S I A M J. Num. Anal. 6, 1-7.
Wilson, G. T. (1973). The estimation of parameters in multivariate time series models. 3". Roy.
Statist. Soc. Ser. B 35, 76-85.
E. J. Hannan, P. R. Krishnaiah, M. M. Rao, eds., Handbook of Statistics, Vol. 5 1~7
ALl
@ Elsevier Science Publishers B.V. (1985) 451-480
1. Introduction
*Research supported in part by NSF Grant SES-8023555. The authors are indebted to Sartaj A.
Kidwai of the University of Florida for his research assistance.
451
452 H. Theil and D. G. Fiebig
where y~, is consumption of good i, x0, is total consumption, xj, is the price of
good .L eit is a random error, and 0i and "rrlj are parameters, the 7rij's being
known as Slutsky coefficients. H e r e we shall be interested in two hypotheses,
viz., demand homogeneity,
N
~ 7r~j= 0, i=1 .... ,N, (2.2)
j=l
Table 1
Rejections (out of 100 samples) of homogeneity and symmetry
5% 1% 5% 1% 5% 1%
Rejections of homogeneity
5 goods 7 1 14 6 8 1
8 goods 8 2 30 16 5 2
11 goods 5 2 53 35 5 1
14 goods 4 2 87 81 6 1
Refections of symmetry
5 goods 6 0 9 3 5 1
8 goods 5 2 26 8 5 1
11 goods 6 2 50 37 4 3
14 goods 4 2 96 91 6 0
which is one equation of a system that consists of several linear equations. The
y's are observations on two endogenous variables of the system. The other
equations contain certain endogenous variables in addition to these two, and
also p exogeneous variables; the observations on the latter variables are written
xw..., xpt. T h e sample moment matrix of these variables and those in (3.1) is
thus of order (p + 2) x (p + 2),
where mtp, map are p-element vectors and Alp is square ( p x p ) . The LS
454 1-t. Theil and D. G. Fiebig
H=-f~f(x)logf(x)dx_ (4.1)
1For other M E properties of the exponential family, see Kagan et al. (1973).
2See Theil and Fiebig (1984) for a survey containing m a n y other results as well as proofs of the
statements which follow.
456 H. Theil and D. G. Fiebig
a fraction 1/n of the mass of the sample distribution. W e impose on the density
function f ( . ) which will be fitted that it preserves these fractions,
4.2. Applications
G i v e n that the density picture of the M E distribution is so simple (piecewise
constant or exponential), it is straightforward to evaluate its variance and
higher m o m e n t s . F o r example, the variance of the M E distribution (the M E
variance) equals
n n-1
k=l q.n i-l- 2 ~ n i=2Z ( x i + l - xi-1) 2. (4.5)
Since the first term is the sample variance and since the two others are
negative, the M E variance is thus subject to shrinkage relative to toe sample
variance. Kidwai and Theil (1981) s h o w e d that, u n d e r normality, this shrinkage
3Define the order statistics associated with each interval k as those which determine its end
points: x 1 and x 2 for II = (-~, ~¢1)[see (4.2)], x" and x "-I for I,, and x i-1, x i and x i+1 for Ii with
1 < i < n. The mean-preserving constraint on Ii requires that f(x) for x Eli be constructed so that
the mean is a homogeneous linear function of the order statistics associated with I~.
S m a l l s a m p l e s a n d large equation s y s t e m s 457
is a random variable whose mean and standard deviation are both about
proportional to n -1'3.
Simulation experiments with pseudo-normal variates indicate that the ME
variance and third- and fourth-order moments about the mean are all more
accurate (in the mean-squared-error sense) than the corresponding estimators
derived from the discrete sample distribution. This difference reflects the
efficiency gain obtained by exploiting the knowledge that the parent dis-
tribution is continuous. However, the difference converges to zero as n ~ %
implying that the efficiency gain is a small-sample gain. Fiebig (1982, Chap. 4)
extended the simulation experiment to the estimation of the variances of
fat-tailed mixtures of normal distributions. The fatter the tails for given n, the
larger is the efficiency gain of the M E variance over the sample variance.
Since the M E distribution is formulated in terms of order statistics, it is
natural to consider the quantiles of the ME distribution as estimators of the
parent quantiles. Let n be odd and write m = ½(n + 1). Then the sample median
is x m, but the ME median is $m, i.e. the median of the secondary midpoints. For
random samples from a normal population, the ME median has a smaller
expected squared sampling error than the sample median, but the relative
difference tends to zero as n ~ . Let n + 1 be a multiple of 4 and write
q = ¼(n + 1). Then the sample quartiles are x q and X 3q, whereas the ME
1 q _ 3 q+l
quartiles a r e Q L ~xq-1 -[- ~X
= t gX and Ou = 3~x3q-1 + ix
1 3q - 1 3q+l
± ~x if q > 1.4 For
random samples from a normal population, the M E quartiles have smaller
expected squared errors. Again, the relative difference tends to zero as n ~ ~,
but this difference is still in excess of 10 percent for the interquartile distances
O u - QL and x 3q - x q at n = 39. Also, the ME median and quartiles dominate
their sample distribution counterparts (under squared-error loss) in the
presence of an outlier with a different mean or a different variance; see Theil
and Fiebig (1984) for details.
4.3. E x t e n s i o n s
4Since the ME distribution has a continuous cdf, its median and quartiles are uniquely defined
for each n. This is in contrast to the sample quantiles whose definitions for certain values of n can
be made unique only by interpolation between order statistics.
458 H. Theil and D. G. Fiebig
within each cell with mass 1/n. Each such cell falls under one of three groups:
those which are bounded on all four sides, those which are open-ended on one
side, and those which are open-ended on two sides. For the first group, the M E
distribution within the cell is the bivariate uniform distribution; for the second,
it is the product of the exponential (for the open-ended variable) and the
uniform (for the other variable); for the third, it is the product of two
exponentials.
The extension to the p-variate M E distribution is straightforward. There are
then n p cells, n of which contain one observation each and are assigned mass
1/n, while the n p - n others are assigned zero mass. The M E distribution within
each cell with mass 1/n is the product of p univariate distributions, each
being either uniform or exponential. The cdf of this distribution is a continuous
and nondecreasing function of its p arguments, and it is piecewise linear except
for exponential tails.
1 "
- ~', (xk - x)(Yk -- Y), (5.2)
nk=l
where (Xk, Yk) for k = 1, . . . , n are the secondary midpoint pairs rearranged in
the order of the original sample elements (Xk, Yk)" This rearrangement is
indicated by the use of subscripts rather than superscripts [cf. (4.4)].
The M E variance was given in (4.5), but this variance can also be written in
the form
where the first term is the variance of the secondary midpoints. 6 The two other
terms are a weighted sum of squared differences between successive primary
midpoints which is always positive. On combining (5.2) and (5.3) we find that
the 2 × 2 M E c o v a n a n c e matrix takes the form C + D, where C is the covari-
ance matrix of the secondary midpoints and D is a diagonal matrix with
positive diagonal elements. This C + D formulation applies to the covariance
matrix of any p-variate M E distribution. The diagonal matrix D serves as the
ridge of the M E covariance matrix; 7 this ridge ensures that the ME covariance
matrix is always positive definite even when p / > n.
6Expression (5.3) is nothing but the variance decomposition of the univariate ME distribution
between and within groups, the 'groups' being the intervals 11,. • •, I,.
7This ridge formulation has a superficial similarity to ridge regression, q21e major difference is
that the ridge of the M E covariance matrix is not subject to arbitrary choice but is uniquely
determined by the M E criterion subject to mass- and mean-preserving constraints.
460 H. Theil and D. G. Fiebig
5.3. Ties a n d m i s s i n g v a l u e s
Ties have zero probability when the sample is drawn from a continuous
distribution, but they can occur when the data are rounded. Let the a t h and
bth observations on x after rounding share the tth and (t + 1)st positions in
ascending order:
(5.5)
which means that the tie x a = x b is preserved in the form Ya = Yb-
The univariate M E distribution is not affected by the tie (5.4) so that we can
8Since the ridge of the ME covariance matrix tends to push ~6toward zero, this difference mainly
results from the downward bias of r and the upward bias of the Fisher transform of r (for p > 0). In
Theil, Kidwai, Yalnizo~lu and Yell6 (1982) the simulation experiment is extended to the SYME
correlation and also to the correlation of the spherically symmetric version mentioned in footnote
5. Only the last correlation estimator has some merits for particular values of p (around 0.95) under
squared-error loss of the Fisher transform.
S m a l l s a m p l e s a n d large e q u a t i o n s y s t e m s 461
H e r e w e r e t u r n to (3.1) a n d we c o n s i d e r t h e q u e s t i o n of w h e t h e r t h e M E
a p p r o a c h can he useful w h e n t h e s a m p l e is u n d e r s i z e d .
9For t = 1 and t = n - 1, (5.4) is an extremal tie which implies that the exponential distribution
over/1 or I, collapses, all mass being concentrated at the tied point. This also holds for a multiple
tie, Xa = xb = xc = X t = x t+l = x t+2. In both cases the ME distribution becomes mixed discrete/con-
tinuous, but the validity of the variance formula (4.5) is not affected.
I°LIML = limited-information maximum likelihood. 'Limited information' refers to the fact that no
restrictions are incorporated on equations other than (3.1). 'Full information' and FIML use all
restrictions in the system; see, e.g., Theil (1971, Chap. 10).
462 H. Theil and D. G. Fiebig
B 1
tz 2A 2A ~/B2-4A(mumzz- m~2) (6.1)
where A = mll.pm22, p - m 212.p and B = mltm22.p + mzzm l l . p - 2m12m12.p, the m ij.p ' s
being obtained from (3.3). Note that/x is random.
As n ~ 0% n(/x - 1) converges in distribution to a I "2 variate so that nl/20x - 1)
converges in probability to zero. Therefore, the propositions stated in the
discussion following (3.4) imply that nl/2[~0z ) - 7] has the same asymptotic
normal distribution as its 2SLS counderpart, n m [ ~ ( 1 ) - 7].
A closer approximation to the sampling distributions of the 2SLS and LIME
estimators may be described as follows, n We standardize these two estimators
by subtracting the true value of 7 and then dividing the difference by their
common asymptotic standard deviation. The asymptotic distribution of these
two standardized estimators is standard normal. This is a first-order ap-
proximation which can be improved upon by appropriate expansions. The
second-order approximation yields cdfs of the form
where qS(u) and 45'(u) are the standard normal cdf and density function,
respectively, while 0 is a constant determined by the parameters of the system
which contains (3.1) as one of its equations. Since substitution of u - - 0 into
(6.3) yields q~(0)-0 = ~, we conclude that the approximate distribution of the
standardized LIML estimator has zero median, whereas (6.2) shows that the
standardized 2SLS estimator has this property only for p = 1. As p increases,
the median of the latter approximate distribution moves away from zero.
It appears that to a large extent these properties also apply when the
estimators are formulated in terms of ME rather than sample moments. Theil
and Meisner (1980) performed a simulation experiment in which the 2SLS
estimator is systematically formulated in terms of ME moments. This has the
advantage that the estimator exists even when p > n [because Mp in (3.3) is
then positive definite], but the estimator is badly biased for large p. We shall
therefore pay no further attention to 2SLS-type estimators. On the other hand,
the approximate median-unbiasedness of L I M E which is implied by (6.3)
appears to also apply when this estimator is formulated in terms of ME
moments.
We return again to (3.1) and specify that the two associated reduced-form
equations are ~2
P P
which agree with (3.1) if and only if y = 1 and e t = golf- (2t" In the simulation
e x p e r i m e n t to be discussed, the Xht'S and ~t's are all g e n e r a t e d as i n d e p e n d e n t
p s e u d o - n o r m a l variates, 13 the distribution of each Xht being N(0, V / p ) and that
of each ~'j, being N(0, ~r2). T h e r e f o r e ,
P
Xh,- N(0, V), ~ j t - N(0, ~ro2), e , - N(0, tr2), (6.5)
h=l
t2The reduced form is obtained by solving the system for the endogenous variables. This
requires the number of equations to be equal to the number of these variables.
13The x's are not constant in repeated trials. Making them constant would have implied that all
entries in any given row of Table 2 are determined by the same set of n observations on the p
exogenous variables.
t4Since Mafiano and Sawa (1972) have shown that the sampling distribution of the LIML/SA
estimator does not possess finite moments of any order, we use medians and quartiles to measure
location and dispersion.
464 H. Theil and D. G. Fiebig
Table 2
Quartiles of LIML estimators based on sample, ME and hybrid moments a
Interquartile
Median Lower quartile Upper quartile distance
p SA ME HY SA ME HY SA ME HY SA ME HY
(1) (2) (3) (4) (5) (6) (7) (8) (9) (10) (11) (12) (13)
n = 21 observations
10 1.00 1.00 1.00 0.84 0.84 0.85 1.18 1.19 1,19 0.34 0.35 0.34
15 0.99 1.00 1.00 0.80 0.81 0.83 1.23 1.23 1.20 0.43 0.42 0.37
20 0.87 1.00 0.99 0.48 0.78 0.83 1.33 1.25 1.19 0.84 0.47 0.36
25 b 0.98 0.98 b 0.81 0.84 b 1.18 1.17 b 0.37 0.33
30 b 1.01 1.00 b 0.84 0.85 b 1.19 1.18 b 0.35 0.33
35 b 1.01 1.01 b 0.85 0.86 b 1.19 1.16 b 0.34 0.31
40 b 0.99 1.00 b 0.85 0.86 b 1.15 1,13 b 0.30 0.27
n = 31 observations
10 1.00 1.00 1.00 0.88 0.88 0.88 1.13 1.12 1.12 0.25 0.24 0.24
15 1.00 1.00 1,00 0.87 0.87 0,87 1.15 1.14 1.14 0.28 0.27 0.27
20 0.99 0.99 0.99 0.85 0.85 0.86 1.15 1.15 1.14 0.31 0.30 0.28
25 1.00 1.00 1.00 0.83 0.84 0.87 1.20 1.19 1.17 0.37 0.35 0.29
30 0.92 0.99 1.00 0.57 0.83 0.87 1.28 1.20 1.16 0.72 0.37 0.30
35 b 0.99 1.00 b 0.83 0.85 b 1.19 1.16 b 0.35 0,30
40 b 0.99 1.00 b 0.86 0.87 b 1.15 1.13 b 0.29 0.27
45 b 0.98 0.98 b 0.86 0.87 b 1.14 1.13 u 0.28 0.27
50 b 1.01 1.01 b 0.88 0.88 b 1.16 1.15 b 0.29 0.27
n = 41 observations
10 1.01 1.01 1.01 0.89 0.90 0.90 1.14 1.13 1.13 0.24 0.24 0.24
15 1.01 1.01 1.01 0.91 0.91 0.91 1,12 1.12 1.12 0.21 0.21 0.21
20 1.00 1.00 1.00 0.89 0.90 0.90 1,13 1.13 1.12 0.24 0.23 0.22
25 1.00 1.00 0.99 0.86 0.87 0.87 1,14 1.13 1.13 0.27 0.27 0.26
30 1.00 1.01 1.01 087 0.87 0.88 1.16 1.16 1.15 0.28 0.29 0.27
35 0.99 1.00 1.00 0.82 0.83 0.86 1.22 1.21 1.17 0.40 0.38 0.31
40 0.91 1.00 1.00 0.47 0.84 0.88 1.35 1.21 1.16 0.88 0.38 0.28
45 b 1.01 1.01 b 0.87 0.89 b 1.18 1.16 b 0.31 0.27
50 b 1.00 1.00 ~ 0.88 0.88 b 1.14 1.13 b 0.26 0.25
55 b 1.00 1.00 b 0.88 0.89 b 1.15 1.14 U 0.27 0.25
60 b 1,01 1.01 b 0.90 0.90 b 1.15 1.15 b 0.25 0.25
Table 3
Q u a r t i l e s of L I M L e s t i m a t o r s b a s e d on c o r r e l a t e d e x o g e n o u s variables"
Interquartile
Median Lower quartile Upper quartile distance
p SA ME HY SA ME HY SA ME HY SA ME HY
p=0
10 1.02 1.02 1.02 0.85 0.85 0.85 1.19 1.20 1.19 0.35 0.35 0.34
15 1,00 1.00 1.01 0.81 0.82 0.85 1.21 1.20 1.18 0.40 0.38 0.33
20 0.90 0.98 0.99 0.56 0.79 0.85 1.31 1.22 1.18 0.75 0.43 0.33
25 U 0.97 0.98 b 0,81 0.83 U 1.17 1.15 b 0.36 0.32
30 b 1.00 1.01 b 0.85 0.86 b 1.21 1.20 b 0.36 0.34
35 b 0.99 0.99 b 0.83 0.84 b 1.16 1.14 b 0.32 0.30
40 b 1.01 1.01 b 0.87 0.87 b 1.18 1.17 b 0.31 0.29
p = 0.3
10 1.02 1.01 1.01 0.84 0.84 0.85 1.18 1,18 1.18 0.34 0.34 0.33
15 1.01 1.01 1.01 0.82 0.83 0.86 1.23 1.22 1.19 0.41 0.39 0.33
20 0.91 0.99 0.99 0.49 0.81 0.85 1.37 1.24 1.17 0.88 0.42 0.32
25 b 0.98 0.99 b 0.83 0.85 b 1.16 1.14 b 0.33 0.29
30 b 1.00 1.00 b 0,85 0.86 b 1.16 1.15 b 0.31 0.29
35 b 1.00 1.01 b 0.87 0.88 b 1.16 1.16 b 0.30 0.29
40 b 1.00 1,00 b 0.87 0.88 b 1.16 1.15 b 0.29 0.27
p = 0.6
10 1.02 1.0I 1.02 0.84 0.85 0.86 1.18 1.18 1.18 0.34 0.33 0.32
15 1.01 1.01 1.00 0.82 0.83 0.85 1.23 1.22 1.19 0.42 0.39 0.34
20 0.93 0.99 1.00 0.52 0.82 0.85 1.38 1.22 1.17 0.87 0.40 0.33
25 b 0.98 0.99 b 0.83 0.85 b 1.17 1.15 b 0.34 0.31
30 b 0.99 0.99 b 0.85 0.86 b 1.16 1.16 b 0.30 0.30
35 b 1.01 1.01 h 0.87 0.87 b 1.17 1.16 b 0.30 0.29
40 b 1,00 1.00 b 0.87 0.88 b 1.16 1.16 b 0.29 0.28
P =0.9
10 1.02 1.02 1.02 0.85 0.85 0.85 1.17 1.17 1,17 0.32 0.32 0.31
15 1.01 1.01 1.01 0.81 0.84 0.86 1.24 1.20 1.19 0.43 0.36 0.32
20 0.94 1.00 0.99 0.53 0.83 0.85 1.38 1.20 1.18 0.85 0.37 0.33
25 b 0.99 0.99 b 0.83 0.85 b 1.17 1.17 b 0.33 0.32
30 b 0.99 0.99 b 0.84 0.86 b 1.17 1.16 b 0.33 0.30
35 b 1.01 1.01 b 0.87 0.88 b 1.18 1.17 b 0.31 0.30
40 b 0.99 0.99 b 0.86 0.86 b 1.17 1.17 b 0.32 0.31
p = 0.99
10 1.02 1.02 1.02 0.85 0.86 0.86 1.17 1.17 1.17 0.32 0.31 0.30
15 1.01 1.01 11.01 0.81 0.86 0.86 1.24 1.20 1.18 0.42 0.34 0.32
20 0.93 0.99 0.99 0.54 0.84 0.85 1.38 1.18 1.16 0.84 0.33 0.31
25 b 0.99 1,00 b 0.85 0.85 b 1.17 1.17 b 0.33 0.31
30 b 0.99 0.99 b 0.85 0.86 b 1.16 1.16 b 0.31 0.30
35 b 1.01 1.01 b 0.87 0.88 b 1.19 1.18 b 0.31 0.29
40 b 1.00 1.00 U 0.85 0.85 U 1.18 1.17 b 0.33 0.31
15There is no clear evidence of such a bulge for large p. This exception reflects the fact that the p
exogenous variables effectively behave as one variable when p is sufficiently close to 1.
Small samples and large equation systems 467
[ ~11 ~121
"~'12 "~22j
q dependent variables
p independent variables
(7.1)
p i ~ 1 1 [ = O, (7.3)
{X - (1 - p~)Zn{--- O, (7.4)
which provides a link between the error covariance matrix X and the canonical
correlations of the q dependent and the p independent variables of the system:
for i = 1 . . . . . m, one minus each squared canonical correlation coefficient is a
latent root of the diagonalization of X in the metric of the covariance matrix
~11 of the dependent variables.
where the Xi's and vi's are 18 independent standard pseudo-normal variates.
Then the Yi's are also independent standard pseudo-normal, while X~ and
are uncorrelated for i ¢ j and (Xi, Yi) has correlation p~. Therefore, Pl . . . . . P9
are the canonical correlations of (X L. . . . . Xg) and "(Y1 . . . . . Yg). The joint
covariance matrix of the 18 variables ( X ' s and Y's) takes the form (7.1) with
Xn = Xz2 = I and ~12 diagonal with 01 . . . . . P9 on the diagonal. Their true values
468 14. Theil and D. G. Fiebig
a r e specified as
By i n t e r p r e t i n g (7.1) as consisting of e i t h e r M E or s a m p l e m o m e n t s c o m -
p u t e d for a s a m p l e of size n, a n d then solving the a s s o c i a t e d d e t e r m i n a n t a l
e q u a t i o n (7.3), we o b t a i n nine M E o r s a m p l e c a n o n i c a l c o r r e l a t i o n s . This
e x p e r i m e n t was r e p l i c a t e d 100 t i m e s a n d t h e results are s u m m a r i z e d in T a b l e 4
in t e r m s of m e a n s a n d R M S E s a r o u n d t h e t r u e value. T h e u p p e r p a r t of the
t a b l e c o n c e r n s the l a r g e s t c a n o n i c a l c o r r e l a t i o n (with true v a l u e Pl = 0.9). B o t h
t h e M E a n d the s a m p l e e s t i m a t o r are s u b j e c t to a s u b s t a n t i a l u p w a r d bias
w h i c h slowly d e c l i n e s as n increases, 16 b u t t h e bias of t h e f o r m e r e s t i m a t o r is
s m a l l e r a n d this also h o l d s for its R M S E . T h e m i d d l e p a r t of T a b l e 4 c o n c e r n s
t h e a r i t h m e t i c a v e r a g e c a n o n i c a l c o r r e l a t i o n (true v a l u e 0.5) and the l o w e r p a r t
d e a l s with the sum of the s q u a r e d c a n o n i c a l c o r r e l a t i o n s (true v a l u e 2.85); this
s u m p l a y s a role in H o o p e r ' s (1959) t r a c e c o r r e l a t i o n coefficient. T h e r e s u l t s are
s i m i l a r to t h o s e in the u p p e r p a r t : t h e r e is an u p w a r d bias which slowly
d e c r e a s e s as n increases, a n d b o t h the bias a n d t h e R M S E are s m a l l e r w h e n
M E r a t h e r than the s a m p l e m o m e n t s are used.
A l t h o u g h these results a r e e n c o u r a g i n g for t h e M E a p p r o a c h , it s h o u l d be
a d m i t t e d that the u p w a r d bias is q u i t e s u b s t a n t i a l . A c o m p a r i s o n of t h e
last f o u r c o l u m n s of T a b l e 4 shows that this bias is typically close to the
c o r r e s p o n d i n g R M S E , suggesting t h a t a bias c o r r e c t i o n is in o r d e r . L e t r 1/> r 2/>
• " 1> r,, b e the M E c a n o n i c a l c o r r e l a t i o n s . T h e c o r r e c t e d coefficients are
fl . . . . . f,,, o b t a i n e d f r o m
which is a c o r r e c t i o n in e x p o n e n t i a l f o r m . T o e x p l a i n t h e e x p o n e n t we n o t e
t h a t e a c h c a n o n i c a l v a r i a t e involves p - 1 o r q - 1 m u l t i p l i c a t i v e coefficients
(only t h e ratios of t h e s e coefficients m a t t e r ) . T h i s yields p + q - 2 coefficients
for a p a i r of c a n o n i c a l variates, to which w e a d d 1 for the use of a c o n s t a n t
t e r m , y i e l d i n g a total of p + q - 1 coefficients. (Both c a n o n i c a l v a r i a t e s h a v e
c o n s t a n t t e r m s , b u t the c o v a r i a n c e in t h e n u m e r a t o r of t h e c a n o n i c a l cor-
r e l a t i o n is not affected w h e n only o n e c o n s t a n t is used.)
T a b l e 5 p r o v i d e s e v i d e n c e of t h e c o r r e c t i o n (7.7) b a s e d on the e x p e r i m e n t a l
d e s i g n (7.5) a n d (7.6) for b o t h t h e M E a n d t h e h y b r i d c a n o n i c a l c o r r e l a t i o n s .
~6The upward bias of the sample estimator is not surprising, since canonical correlations are
generalizations of the multiple correlations. Let R be such a correlation, associated with a linear
regression on p independent variables (including a constant term). A frequently used correction
amounts to multiplying 1 - R 2 by the ratio of n - 1 to n - p - 1. Both this correction and that which
is shown in (7.7) below for canonical correlations are corrections to the order 1/n, but (7.7) has the
advantage of never yielding a negative f]. See also Lawley (1956, 1959) for an asymptotic expansion
of the expected sample canonical correlations; the implied correction is much more complicated
than (7.7).
Small samples and large equation systems 469
Table 4
ME and sample canonical correlation coefficients
The top row of the table shows the true value of each squared canonical
c o r r e l a t i o n . T h e first e i g h t r o w s c o n t a i n m e a n s o v e r 100 t r i a l s a n d , in p a r e n -
theses, the RMSEs around the true value of the squared ME canonical
correlation. The next eight lines provide analogous results for the hybrid
e s t i m a t e s o b t a i n e d b y i n t e r p r e t i n g (7.1) as t h e h y b r i d c o v a r i a n c e m a t r i x ( w i t h
sample variances on the diagonal and ME covariances elsewhere). In the lower
h a l f o f t h e t a b l e t h e c o r r e c t i o n (7.7) is a p p l i e d t o e i t h e r t h e M E o r t h e h y b r i d
e s t i m a t o r . A c o m p a r i s o n o f m e a n s a n d R M S E s s h o w s t h a t f o r n ~> 15 t h e
c o r r e c t e d h y b r i d e s t i m a t o r is s u p e r i o r e x c e p t w i t h r e s p e c t t o t h e l a r g e s t
canonical correlation.
470 I-I. Theil and D. G. Fiebig
~5
H
~D
e-
0
z
?,
0
"d
©
"K
~2
Small samples and large equation systems 471
472 H. Theil and D. G. Fiebig
N
Yi, = oq + BiXo, + ~'~ rrqxj, +eit. (7.8)
j=l
O u r application of this system will not be to time series d a t a but to per capita
d a t a for 15 countries (t = 1 . . . . . n = 15); see the A p p e n d i x for further details.
T h e analysis of h o m o g e n e i t y and s y m m e t r y testing is b e y o n d the scope of this
chapter, because it would involve not only the f r e q u e n c y of rejections of the
null hypothesis when this hypothesis is true but also the p o w e r of the test.
Instead, we shall i m p o s e the h o m o g e n e i t y condition (2.2) by writing (7.8) in the
form
N-1
1
S- Y'[I-X(X'X)-IX'] Y, Y=[Yl,'-',YN 1], (7.10)
15 - ( N + 1)
is an unbiased e s t i m a t o r of X.
T h e LS estimator of 6i does not satisfy the s y m m e t r y constraint (2.3). W e can
write (2.3) in the form R 6 = 0, w h e r e 8 is a vector with 6i as the ith s u b v e c t o r
(i = 1 , . . . , N - 1) and R is a matrix whose e l e m e n t s are all 0 or _+1, each r o w of
R c o r r e s p o n d i n g to Irq = 1rji for s o m e (i,j). T h e B L U e s t i m a t o r of 6 con-
strained by (2.3) is
2
(~x2.,~2~.,~12- ri~lt)Z i = 0 , (7.13)
17Deleting the first of N equations rather than the last amounts to a linear transformation of the
dependent variables. Such a transformation affects the corrected ME error covariance matrix in a
nontrivial way, since the rectangular cells in the second paragraph of Section 5 become paralo
lelolograms when the variables are linearly transformed.
18There are p = N independent variables in (7.9); the constant terms ai are handled by the use of
variances and covariances rather than second moments around zero.
474 H. Theil and D. G. Fiebig
~d
r.~
~ ? ~ ? ~
I II
¢'.1
.=
I I I
[ ~ l f l i l l
I
II II II II II II II II II II II II II II
Small samples and large equation systems 4"75
II ~ I I I I I
off
t~
©
I I I II
~1 H II II II II II II II II II IIIt II II II II tl II IItt
II II II II II II II H II II II II II II II II tl II H II II
476 H. Theil and D. G. Fiebig
It is evident from these figures that there is a substantial efficiency gain from
using ~ rather than S in the symmetry-constrained estimation procedure, and
that the efficiency loss from not knowing the true error covariance matrix is
quite modest when ~ is used as its estimator.
A n o t h e r matter of importance is whether the standard errors of the sym-
metry-constrained estimates provide an adequate picture of the variability of
these estimates around the true parameter values. This problem is pursued by
the RMSSEs of Table 6. These are obtained from the matrix (7.12), with X
interpreted as either S or ~ or the true X, by averaging the diagonal elements
of (7.12) over the 500 trials and then taking square roots of these averages. On
comparing the RMSSEs based on S with the corresponding RMSEs we must
conclude that the standard errors based on S tend to underestimate the
variability of their coefficient estimates. Table 7 illustrates this more clearly by
providing the ratio of the RMSSE to the corresponding RMSE for each
estimator. The third column of this table shows the substantial understatement
of the variability of the estimates based on S. The quartiles of the 35 ratios in
each of the last three columns are as follows:
When the true X is used, the ratios are tightly distributed around unity. Use of
yields ratios which are more widely dispersed around 1, but which represent
a marked improvement over the use of S.
Small samples and large equation systems 477
Table 7
R a t i o s of R M S E s a n d R M S S E s of s y m m e t r y - c o n s t r a i n e d e s t i m a t e s
R a t i o of R M S E b a s e d o n
true X to RMSE based on R a t i o of R M S S E to R M S E
s ~ s 2 z
Coefficients/3/
i= 1 0.70 0.92 0.56 1.39 1.01
i= 2 0.69 0.90 0.60 0.94 1.06
i= 3 0.89 0.99 0.86 1.08 1.02
i= 4 0.63 0.99 0.44 0.92 0.98
i= 5 0.79 0.90 0.67 1.25 0.98
i= 6 0.83 0.94 0.76 0.95 1.02
i= 7 0.67 0.94 0.48 0.94 0.97
D i a g o n a l S l u t s k y c o e f f i c i e n t s ~r~i
i = 1 0.72 0.94 0.57 1.52 1.03
i = 2 0.78 1.00 0.60 0.88 1.00
i = 3 0.92 0.97 0.84 0.95 1.00
i = 4 0.61 1.00 0.38 0.83 0.97
i = 5 0.79 0.93 0.67 1.35 1.04
i = 6 0.88 0.97 0.79 0.85 1.03
i = 7 0.65 0.92 0.44 0.87 0.96
O f f - d i a g o n a l S l u t s k y coefficients 7r0
i = 1, j = 2 0.70 0.87 0.52 0.93 0.99
i = 1, i = 3 0.83 0.86 0.72 1.27 1.03
i = 1, j = 4 0.62 0.97 0.42 0.98 0.99
i = 1, j = 5 0.77 0.93 0.59 1.43 0.99
i = 1, j = 6 0.86 0.88 0.66 1.00 0.96
i = 1,/" = 7 0.63 0.96 0.43 1.06 1.01
i = 2, . / = 3 0.81 0.99 0.71 0.90 1.02
i = 2,/" = 4 0.63 0.99 0.43 0.84 0.99
i = 2, j = 5 0.72 0.94 0.55 0.98 1.01
i = 2, j = 6 0.81 1.00 0.67 0.86 1.02
i = 2, j = 7 0.67 0.95 0.45 0.87 0.98
i = 3, j = 4 0.78 1.01 0.66 0.92 0.98
i = 3, j = 5 0.81 0.87 0.71 1.11 1.01
i = 3, j = 6 0.90 1.01 0.79 0.90 1.01
i = 3, j = 7 0.79 0.96 0.69 0.91 0.99
i = 4, j = 5 0.68 0.95 0.49 0.97 1.02
i = 4, ] = 6 0.69 0.96 0.54 0.86 1.03
i = 4,/" = 7 0.60 1.01 0.36 0.89 1.00
i = 5,/" = 6 0.78 0.86 0.65 0.97 1.05
i = 5,/" = 7 0.72 0.92 0.55 0.98 1.01
i = 6,/" = 7 0.74 0.99 0.58 0.93 1.01
478 1-1. Theil and D. G. Fiebig
8. Conclusion
Appendix
index,
N
Piqi
wi = d(log O) : ~'~ wi d(log q,), (A2)
M ' i=1
while 0~ = O(p~q~)/OM is the marginal budget share of good i and the Slutsky
coefficient ~/ equals ( p i p j / M ) O q , / O p / , the derivative OqflOpj measuring the
effect of pj on q~ when real income remains constant. The homogeneity
property (2.2) reflects that proportionate changes in all prices do not affect any
qi when M also changes proportionately. The symmetry property (2.3) results
from the assumed symmetry of the Hessian matrix of the utility function.
To apply (A1) to time series we write D x t = l o g ( x f l x t _ t ) for any positive
variable x with value x t at time t. A finite-change approximation to (A1) is then
where D O t = Z~ 1,vitDqi t and wit is the arithmetic average budget share of good i
at t - 1 and t. Equation (A3) is equivalent to (2.1) for y , = ~ t D q , , x0t = DOt,
x/t = D p j t. For further details, see Theil (1980).
The numerical results reported in Section 7 are based on the analysis of Theil
and Suhm (1981) of data on 15 countries collected by Kravis et al. (1978). These
countries are the U.S., Belgium, France, West Germany, U.K., The Netherlands,
Japan, Italy, Hungary, Iran, Colombia, Malaysia, Philippines, South Korea, and
India. Let w~, be the per capita budget share of good i in country t. Working's
(1943) model describes such a share as a linear function of the logarithm of
income. T o take into account that different countries have different relative
prices, Working's model is postulated to hold at the geometric mean prices across
countries, Pl . . . . . PN, where
15
log Pi = ~ ~, log Pit, (A4)
t=l
which requires that a substitution term be added to the model. The result is
that the demand system takes the form (7.8), with x0t per capita real income of
country t, xit = log(p~,/pj) and Yit equal to 1 - x , + X j w # x / , multiplied by wi,.
Then the sums over i = 1 . . . . . N of Yi, % 13i and 7r,j are equal to 1, 1, 0 and 0,
respectively, implying that eat . . . . . eNt are linearly dependent.
References
Byron, R. P. (1970). The restricted Aitken estimation of sets of demand equations. Econometrica
39, 816-830.
Christensen, L. R., Jorgenson, D. W. and Lau, L. J. (1975). Transcendental logarithmic utility
functions. American Economic Review 65, 367-383.
Conway, D. and Theil, H. (1980). The maximum entropy moment matrix with missing values.
Econom. Lett. 5, 319-322.
Deaton, A. S. (1974). The analysis of consumer demand in the United Kingdom. Econometrica 42,
341-367.
Deaton, A. S. (1975). Models and Projections of Demand in Post-War Britain. Chapman and Hall,
London.
Fiebig, D. G. (1980). Maximum entropy canonical correlations. Econom. Lett. 6, 345-348.
Fiebig, D. G. (1982). The maximum entropy distribution and its covariance matrix. Doctoral
dissertation. Department of Economics, University of Southern California.
Haft, L. R. (1980). Empirical Bayes estimation of the multivariate normal covariance matrix. Ann.
Statist. 8, 586-597.
Hooper, J. W. (1959). Simultaneous equations and canonical correlation theory. Econometrica 27,
245-256.
Kagan, A. M., Linnik, Y. V. and Rao, C. R. (1973). Characterization Problems in Mathematical
Statistics. Wiley, New York.
Kidwai, S. A. and Theil, H. (1981). Simulation evidence on the ridge and the shrinkage of the
maximum entropy variance. Econom. Lett. 8, 59-61.
Kravis, I. B., Heston, A. W. and Summers, R. (1978). International Comparisons of Real Product
and Purchasing Power. The Johns Hopkins University Press, Baltimore, MD.
Laitinen, K. (1978). Why is demand homogeneity so often rejected? Econom. Lett. 1, 187-191.
Lawley, D. N. (1956). Tests of significance for the latent roots of covariance and correlation
matrices. Biometrika 43, 128-136.
Lawley, D. N. (1959). Tests of significance in canonical analysis. Biometrika 46, 59-66.
Lluch, C. (1971). Consumer demand functions, Spain, 1958-1964. European Economic Review 2,
277-302.
Malinvaud, E. (1980). Statistical Methods of Econometrics, 3rd ed. North-Holland, Amsterdam.
Mariano, R. S. and Sawa, T. (1972). The exact finite-sample distribution of the limited-information
maximum likelihood estimator in the case of two included exogenous variables. J. Amer. Statist.
Assoc. 67, 159-165.
Meisner, J. F. (1979). The sad fate of the asymptotic Slutsky symmetry test for large systems.
Econom. Lett. 2, 231-233.
Meisner, J. F. (1981). Appendix to Theil and Suhm (1981).
Schuster, E. F. (1973). On the goodness-of-fit problem for continuous symmetric distributions. J.
Amer. Statist. Assoc. 68, 713-715.
Schuster, E. F. (1975). Estimating the distribution function of a symmetric distribution. Biometrika
62, 631-635.
Theft, H. (1971). Principles of Econometrics. Wiley, New York.
Theil, H. (1980). The System-Wide Approach to Microeconomics. The University of Chicago Press,
Chicago, IL.
Theil, H. and Fiebig, D. G. (1984). Exploiting Continuity: Maximum Entropy Estimation of
Continuous Distributions. Ballinger, Cambridge, MA.
Theil, H. and Laitinen, K. (1980). Singular moment matrices in applied econometrics. In: P. R.
Krishnaiah, ed., Multivariate Analysis--V, 629-649. North-Holland, Amsterdam.
Theil, H. and Meisner, J. F. (1980). Simultaneous equation estimation based on maximum entropy
moments. Econom. Lett. 5, 339-344.
"I]leil, H. and Suhm, F. E. (1981). International Consumption Comparisons: A System-Wide
Approach. North-Holland, Amsterdam.
Theil, H., Kidwai, S. A., Yalnizo~lu, M. A. and Yell6, K. A. (1982). Estimating characteristics of a
symmetric continuous distribution. CEDS Discussion Paper 74. College of Business Ad-
ministration, University of Florida.
Working, H. (1943). Statistical laws of family expenditure. J. Amer. Statistist. Assoc. 38, 4.3--56.
Subject Index
481
482 Subject index
Qualitative robustness, 119, 122, 123, 125, 136 Spectral matrix function, 285
Quantile sampling, 353, 356 Spectral representation, 365
Quenouille's test statistic, 181 Spectrum of the process, 312
Speech recognition, 389-412
Random coefficient autoregressions, 416, 444 State-space models, 192, 258, 263, 264, 374
Random coefficient variation, 415 State-space representation, 157
State-variable estimation, 222
Random fields, 357, 358
Random sampling, 342, 343, 355 State-variable feedback, 224
Stationarity condition, 90
Random vibrational system, 32
Stationary covariances, 351,360
Random walk, 226
Stationary independent increments, 350
R-array, 182
Stationary invertible processes, 439
Rate of convergence, 35.3-358, 360
Statistical ergodic theorem, 318
Realignment, 381,382
Stochastic differential equation, 31
Real-time identification, 202
Stochastic dynamical systems, 51, 52, 67
Recursive identification, 202, 213
Strassen characterization, 123
Recursive methods, 190
Stratified sampling, 343, 346, 348, 353, 355, 356,
Recursive least squares algorithm, 218
358
Recursive prediction error methods, 202
Strict stationarity, 1,417
Recursive time series, 231
Strong consistency, 370, 421
Regression with stationary errors, 171
Strongly harmonizable time series, 323
Regular sampling, 343, 349
Strong robustness, 124
Relatively smooth with noise, 366
Structural identifiability, 271,272
Relevant log-likelihood, 369
Sufficient statistic, 341,345
Reproducing kernel Hilbert space, 327, 338,
Systematic sampling, 344, 355
339, 340, 341,350, 359
Residual autocovariance estimates, 136, 137,
Tensor notation, 445
138-140
Testing for coefficient evolution, 427
Robust filter cleaners, 144
Three-step procedure, 444
Robustness, 123, 124, 125, 356
Time-series influence curve, 152
Runge-Kutta method, 56 Time-variable parameter estimation, 232
Transfer function models, 104, 215, 216, 217,
Sampling designs, 337, 342, 343 259
S-array, 182 Trigonometric polynomials, 371
Score-test statistic, 428 Two-step estimator(s), 440, 444
Seasonal adjustment model, 439 Typical diagonal element, 379
Seasonal factors, 414
Second-order efficient, 440 Unequally spaced data, 157
Second-order stationary, 417 Unit roots, 434, 439
Sequential parameter estimation, 202
Shift operator, 288 Variance component models, 173
Ship rolling, 26 Varying coefficient models, 413, 414
Signal associated noise, 383 Vector ARMA models, 87, 116
Signal characteristic, 294 Vibration systems, 33
Signal process, 294 Vitali variation, 323
Signals in noise, 341,345, 347
Signal-to-noise ratio, 342, 348, 366 Weakly harmonizable time series, 323
Simple random sampling, 343, 346, 347, 348, Weakly stationary process, 312
353, 356, 357 Weakly stationary time series, 322
Simultaneous equation estimation, 453, 461 Wide band, 371
Small sample distribution, 429 Wiener, 350, 352, 356
Smoothed random walk, 226
Spectral characteristic, 325 Yule-Walker equation, 180
Spectral function, 322, 323 Yule-Walker estimates, 395, 396, 406, 409
H a n d b o o k of Statistics
Contents of Previous V o l u m e s
V o l u m e 1. A n a l y s i s of V a r i a n c e
E d i t e d b y P. R. K r i s h n a i a h
1980 xviii + 1002 p p .
485
486 Contents of previous volumes
V o l u m e 2. C l a s s i f i c a t i o n , P a t t e r n R e c o g n i t i o n a n d R e d u c t i o n of
Dimensionality
E d i t e d by P. R. K r i s h n a i a h a n d L. N. K a n a l
1982 xxii + 903 pp.
V o l u m e 3. T i m e S e r i e s in t h e F r e q u e n c y D o m a i n
E d i t e d b y D . R. B r i l l i n g e r a n d P. R . K r i s h n a i a h
1983 xiv + 485 pp.
V o l u m e 4. N o n p a r a m e t r i c M e t h o d s
E d i t e d by P. R. K r i s h n a i a h a n d P. K. Sen
1984 xx + 968 pp.