On Kalman Filtering, Posterior Mode Estimation and Fisher Scoring PDF

Metrika (1991) 38:37- 60
On Kalman Filtering, Posterior Mode Estimation and Fisher Scoring in Dynamic Exponential Family Regression
B y L. F a h r m e i r I a n d H . K a u f m a n n l ' 2
Summary: Dynamic exponential family regression provides a framework for nonlinear regression analysis with time dependent parameters flo, f l l , . . . , f i t , . . . , d i m f l t = P. In addition to the familiar conditionally Gaussian model, it covers eg. models for categorical or counted responses. Parameters can be estimated by e~tended Kalman filtering and smoothing. In this paper, further algorithms are presented. They are derived from posterior mode estimation of the whole parameter vector (fl~). . . . , fl't) by Gauss-Newton resp. Fisher scoring iterations. Factorizing the information matrix into blockbidiagonal matrices, algorithms can be given in a forward-backward recursive form where only inverses of "small" pp-matrices occur. Approximate error covariance matrices are obtained by an inversion formula for the information matrix, which is explicit up to pp-matrices.
1 Introduction
L e t a r e g r e s s i o n r e l a t i o n s h i p b e t w e e n a r e s p o n s e v a r i a b l e Yt a n d c o v a r i a t e s xt b e o b s e r v e d s e q u e n t i a l l y i n t i m e . T h e n it is o f t e n a p l a u s i b l e a s s u m p t i o n t h a t t h e u n k n o w n r e g r e s s i o n p a r a m e t e r s are a l s o t i m e d e p e n d e n t :
yt=x'tflt+et
t= 1,2 .....
(1.1)
Supposing a linear transition equation
~t = Tt1~t-I q- ut ,
t = 1,2 ....
(1.2)
1 Ludwig Fahrmeir and Heinz Kaufmann, Universit~it Regensburg, Lehrstuhl fiar Statistik, Universit~itsstraBe 31, I)-8400 Regensburg. 2 Heinz Leo Kaufmann, my friend and coauthor for many years, died in a tragical rock climbing accident in August 1989. This paper is dedicated to his memory.
0 0 2 6 - 1 3 3 5 / 9 1 / 1 / 3 7 - 6 0 $2.50 1991 Physica-Verlag, Heidelberg
38
L. Fahrmeir and H. Kaufmann
for the regression parameters and making certain assumptions on the noise processes [et], [vt], equations (1.1) and (1.2) constitute a linear state space model (e.g. Sage and Melsa 1971, ch. 7, Anderson and Moore 1979, ch. 2). Given the observations y~. . . . . Yt, estimation of fit ("filtering") and of fl0. . . . . fit-1 ("smoothing"), together with corresponding error covariance matrices, is of primary interest. These tasks are solved recursively by the linear Kalman filter and the fixed interval smoother. The observation equation (1.1) is appropriate for metric responses. Generalized linear models (Nelder and Wedderburn 1972) provide a framework for regression analysis where the distribution of the response variable Yt belongs to a natural exponential family. Univariate examples are the normal, binomial, Poisson and gamma distribution, multivariate examples the multinormal and the multinomial distribution. Thus, besides metric Gaussian responses, generalized linear models allow for categorical, counted and nonnegative metric response variables. Since such responses are common in longitudinal as well as cross-sectional analyses, it is of considerable interest to extend static generalized linear models to a dynamic setting like the one above. Such extensions have been suggested by West, Harrison and Migon (1985, univariate responses) and Fahrmeir (1988, multivariate responses). In both papers, the observation equation (1.1) is replaced by specifying the conditional distribution of Yt, given x t and fit, in analogy to the corresponding specification in static generalized linear models. In order to facilitate the use of the discount concept (Ameen and Harrison 1985), the transition equation (1.2) is generalized somewhat in the first paper, whereas it is retained in the second. A full Bayesian filter would give a recursive update of the whole posterior density P(fltlYl . . . . . Yt). In the linear model (1.I), (1.2) with Gaussian errors, this posterior density is also Gaussian, whence updating the mean and the covariance matrix suffices. In general, however, posterior densities are not available in closed form, forcing numerical integration. Using spline functions, such a filter has recently been presented by Kitagawa (1987). Since this approach becomes computationally unfeasible for higher dimensional parameter vectors and large data sets, recursive filters which are similar to the linear Kalman filter are also attractive. Similar remarks apply to smoothing. Based on discounting and on using conjugate prior-posterior distributions for the linear predictor x~flt, West et al. (1985) present such a filter. However, their approach raises a number of problems, in particular in extension to multivariate responses. These problems are discussed in Fahrmeir (1988), and a different filter approximating the posterior mode is proposed. It is termed extended Kalman filter, since it is analogous to the extended Kalman filter for Gaussian responses which applies if x~flt and Ftfl t in (1.1), (1.2) are generalized to nonlinear smooth functions of fit (Sage and Melsa 1971, ch. 9, Anderson and Moore 1979, ch. 8). In this paper, the relevant models are defined in Section 2.1. They are referred to as models for dynamic exponential family regression. Actually, they extend dynamic generalized linear models (Section 2.2). They also cover conditionally
On Kalman Filtering
39
Gaussian responses common in extended Kalman filtering (Section 2.3). Some quantities of relevance in estimation are noted in 2.4. In Section 3, first the extended Kalman filter and smoother are given. These algorithms apply if smoothing is carried out after t consecutive filter steps. At the end of 3.2, we obtain an integrated algorithm where smoothing back takes place after each filter step. In Section 4, the estimation problem is taken up from a different point of view. We consider posterior mode estimation o f the whole sequence flo . . . . . fit, i.e. maximizing the posterior density of flo . . . . . fit, given y~ . . . . . Yr. Algorithmically, this can be performed by Gauss-Newton or Fisher-scoring iterations, replacing in the latter case the random information matrix (negative second derivates of the log posterior density) by some other information matrix. For static generalized linear models where the fl-vectors are all equal, this approach is followed by West (1985, Section 4.1). In the dynamic setting, it may at first seem unfeasible computationally: if p is the dimension of fls, s = 0 . . . . . t, then the total number of parameters is p ( t + l ) , and a correspondingly large system of equations must be solved at each iteration. However, the information matrix, which forms the coefficient matrix of this system, has a block tridiagonal structure (Section 4.1). This structure is investigated further in 4.2, where a proposition on factorization and inversion of the information matrix is given. This proposition is basic for the results which follow. The factorization part leads to a simple and efficient forward-backward recursive implementation of GaussNewton resp. Fisher scoring iterations (Section 4.3). The inversion part yields approximate error covariance matrices. In both parts, only inverses of "small" p p matrices occur. If Gauss-Newton (Fisher scoring) iterations are applied sequentially in time, one may conjecture that in many cases the new observation does not change estimates too drastically, with the implication that a single iteration suffices. The resulting algorithm is considered in Section 4.4. Relationships between the various algorithms are discussed in Section 5. It is shown that they form a hierarchy of approximations to posterior mode estimation. With rather different arguments, extended Kalman filtering and smoothing is derived as an approximation to posterior mode estimation in Sage and Melsa (1979, ch. 9), for Gaussian error sequences. Although their arguments extend to exponential families (Fahrmeir 1988), the derivation given here sheds new light on this relationship. It suggests forward-backward recursive estimation algorithms which seem to be new also in the Gaussian case, and it clarifies the nature of approximations. Quantitative assertions on the quality of approximations are desirable. This topic will be treated in subsequent work.
40
L. Fahrmdr and H. Kaufmann
2 Dynamic Exponential Family Regression

2.1 A s s u m p t i o n s The models of this paper form a dynamic extension of nonlinear exponential family regression models. Let Yl,Y2 . . . . be a sequence of observations (responses), where each Yt ranges in a subset Y of R q. In parallel, there evolves a sequence flo,fll . . . . of unobservable p-dimensional state or parameter vectors. Let y * = ( y ] . . . . . y;)' , f l * = ( , 8 ~ . . . . . fl})' , t=0,1,2 ....
denote the first t observations resp. t + 1 parameter vectors, where y~ is to be read as an empty vector. In view of the dynamic nature, it is natural to model the conditional distribution of Yt given Y*-I, the past observations, and fl*, the history of the parameter process, including the actual parameter vector. Specifically, denoting conditional densities by P(" I ") as we shall do throughout the paper, we suppose , P ( Y t l f l * , Y * - , ) = P(Yt[flt,Yt-O * t = 1,2 . . . . . (2.1)
Loosely speaking, given Y*-1, the actual parameter vector fit contains the same information on Yt as the whole fl*. The conditional density in (2.1) is assumed to be of the natural exponential type: logPO~tl~t,Y*-l) = O~Yt-bt(Ot)-ct(Yt) , t = 1,2 . . . . . (2.2)
where c t is a measurable function on Y. The density in (2.2) is fully specified once the natural parameter 0 t is given as a function of the conditioning quantities. Let O denote the natural parameter space of (2.2) (e.g. Fahrmeir and Kaufm a n n 1985). We suppose that O contains interior points. Then in the interior O the function bt is differentiable infinitely often, and the conditional mean and covariance matrix are given by Obt(Ot) Pt - - , OOt "~'t -- 02 ( Obt t) OOtO0~
(2.3)
(2.4)
On Kalman Filtering
41
We assume that Z"t is positive definite (briefly 27t > 0) on O , inducing that (2.3) defines a one-to-one mapping from 690 onto
M = Obt ( 0 o) .
80t
Thus, instead of Ot, o n e can equivalently specify the conditional mean/at. We allow for the general specification
/at = ht (J3t,Yt_ * 0 ,
t = 1,2,...
(2.5)
where ht: R p x y t - I ~ M is an arbitrary measurable function of the conditioning quantities, only subject to the requirement of being two times continuously differentiable with respect to fit. Analogous modelling in the static setting has been proposed by JOrgensen (1983), for more general densities than those of the natural exponential type. O f course, in regression applications Pt depends not only on lagged values Y~'-I of the response, but also on covariates, typically forming a parallel process {xt]. If such covariates are present, we assume that they are deterministic, inducing that formally they can be absorbed into the subscript of h r In this sense, (2.5) covers the seemingly more general
~at = h t ( f l t , Y * - l , x l . . . . .
xt)
Conditioning in (2.1), (2.2) additionally on x 1. . . . . xt, a random covariate process can also be handled. Under a further conditional independence assumption, interpreting the likelihood given later as a partial likelihood, results continue to hold. For simplicity, we do not discuss this issue any further. Equation (2.5) corresponds to (1.1) or, more generally, to the observation equation is nonlinear filtering. It must be supplemented by a transition equation (compare (1.2)), assumed to be linear for simplicity:
Bt = T t B t - t + vt ,
flO = Olo + Uo
t = 1,2 . . . . . (2.6)
The p xp-transition matrices T 1, T2. . . . . are nonrandom, and a 0 is a n o n r a n d o m p-vector. The error process {vt] is supposed to be nondegenerate Gaussian white noise, i.e. a sequence of independent random variables with
42
v t - N ( O , Ot) ,
t=0,1 .....
(2.7)
where Qt is positive definite (Qt> 0). In (conditionally) Gaussian filtering, the noise processes of the observation and the transition equations are usually assumed to be independent. In our more general setting, this assumption must be replaced by a further conditional independence assumption:
p(fltlfl~-,,y*l)=p(fltlfl* O , t = 1,2 . . . . .
(2.8)
In view of (2.6) and (2.7), this is equivalent to the apparently stronger assumption
p(/3tlfl*_t,yt*_ 0 = p(/3t[flt_l) , t = 1,2 . . . . .
(2.9)
Throughout the paper, we suppose that the probability model is completely specified, i.e. contains no unknown deterministic "hyperstructural" parameters. Thus
T1, T2 . . . .
aO'
00,01 .....
are assumed to be known, as well as any other nonrandom quantities appearing in the exponential family density or in the model equation for Bt.
2.2 D y n a m i c Generalized Linear M o d e l s
This model family is obtained by specializing the observation equation (2.5) to

Bt = h(Z'tflt) ,
t = 1,2 . . . .
(2.10)
where h : R r ~ M is the two times continuously differentiable "response function", and Z t is a p r-matrix depending on lagged values y ;-1 and on covariates. It is assumed to be known before the observation Yt is made, so that it may be referred to as the predetermined matrix (in the observation equation). Dynamic modeling of {Zt} can be performed along the lines described in Kaufmann (1987), Fahrmeir and Kaufmann (1987). If q = r = 1, we get a family of univariate dynamic generalized linear models which are related to, yet different from the models of West, Harrison and Migon (1985). For continuous responses, this family covers conditionally Gaussian and
On Kalman Filtering
43
G a m m a models. If the range Y o f the observations is discrete, the density in (2.1), (2.2) must be taken with respect to counting measure, leading e.g. to the log-linear Poisson model for counted data and to logit and probit models for binomial data. Apart from the conditionally Gaussian model (see 2.3), the most interesting multivariate (q > 1) models are those for multicategorical or multinomial observations. More details are given by Fahrmeir (1988). In contrast to the standard formulation where q = r and h is a one-to-one mapping, q and r may differ in the present setting. Therefore (2.10) includes a dynamic extension of the composite link function models of Baker and Thompson (1981). Let us finally note an additional restriction not necessary in static generalized linear models. Since fit can vary in the whole R p, the response function h must be defined on the whole R ~. This rules out e.g. h(y) = I/y, y > 0 , which is the natural response function for the gamma distribution, or the linear Poisson model (h (y) -- y >0). The problem may be overcome by using other prior distributions, but this would destroy the simplicity of the normal priors approach.
2.3 C o n d i t i o n a l l y
Gaussian O b s e r v a t i o n s
The standard form of the nonlinear observation equation (e.g. Anderson and Moore 1979, ch. 8, formula (1.2)) is
Yt = h f ( f l t ) + e t
t = 1,2 . . . . .
(2.11)
where let/is a white noise sequence, independent of the error sequence [vt} of the transition equation, with e t - N ( O , Zt), Z't>0. Thus, given fit and (Yt*-0, the observation Yt is conditionally Gaussian, and this fits into our framework. Equation (2.5) obviously holds, and the exponential family assumption is met with
Ot= Z ~ l l ~ t ,
bt=O~ZtOt
The covariance matrix Z t is a (nonrandom) parameter of the conditional density supposed to be known, according to the assumptions at the end of 2.1. More generally than (2.11), our hypotheses allow for conditionally Gaussian observations with
/at = ht(flt,Y *- I)
depending on previous observations.
44
L. F a h r m e i r a n d H . K a u f m a n n
More specially, if the model (2.10) is taken with h the identity mapping, then we get back the most familiar linear state space model.
2.4 C o n t r i b u t i o n to t h e L o g L i k e l i h o o d a n d R e l a t e d Q u a n t i t i e s
Let us give some preliminary expressions following from the exponential family assumption, which are needed later on. According to Section 2.1, instead of/.t t we can model the natural parameter by
Ot t(ft,Yt-O
t = 1,2 . . . . .
p yt-
(2.12)
with a function w t: R
1~ 0 0 . The relationship between w t and h t is
ht=Ob___Zow t , 00,
w t=
\~,,)
Ob
oh t .
(2. t3)
Stressing dependence on the parameter fit, we write

fit(flit) = ht(ft,.,v'~-l)
,
Ot(ft) =
Wt ( f t , Y t -* 1)
and Z t ( f t ) for the conditional covariance matrix, inserting 0t(ft) into (2.4). Define further first derivatives
0 ht 0 wt
From (2.13), we get

Ht(ft ) = Wt(ft)Zt(ft ) .
(2.14)
Finally, denoting the components of wt by wtj, j = 1 . . . . . q, let

02 Wtj
Vtj(ft) - OJ~tOfl't ,
j = 1. . . . . . q .
The contribution of the observation Yt to the log likelihood is
On Kalman Filtering
45
lt~Jt) =
O'tYt- b t ( O t ) - c t ( Y t )
(2.15)
inserting Ot = 0t(Bt). The contribution to its first derivative, the score function, is
rt(flt) = W t ( Y t - P , )
: H t Z ( I( Y t - P t )
(2.16)
with the quantities on the right evaluated at fit, as in the sequel. As the contribution of Yt to the information on fit, we may consider the random information
Gt (/~t) 02 It
,
in conformity with c o m m o n practice in posterior mode estimation with a normal prior (e.g. West 1985, (4.4)). The conditional information
Gt(]3t) = E ( G t ( f l t ) [ f l t , Y ~ - O
seems a reasonable approximation which is in general easier to evaluate and has the advantage of being always positive semidefinite. In our setting, we have
G,q~,) = W,Z, W', = H , Z ? ' H ; ,

q Gt(~t) = @(fit) + ~ Vtj(Ytj-Ptj) j=l
(2.17)
(2.18)
The algorithms which follow can always be applied with the conditional information matrix. Provided certain positive definiteness conditions hold, they can also be applied with the random information matrix. Therefore they are written down with an information matrix R t which may be Gt or Gt- For a dynamic generalized linear model (2.10) with a natural response function (or link function in the more standard terminology), we have a linear model Ot = Z~flt for the natural parameter, leading to
rt = Z t (Yt - Pt) , R t = Gt = Gt = Zt27tZ~
(2.19) (2.20)
In particular, both information matrices coincide.
46
3 E x t e n d e d K a l m a n Filtering and S m o o t h i n g
3.1 Filtering
Filtering and smoothing algorithms provide estimates of parameter vectors using the observations up to a certain time point, together with (approximate) conditional error covariance matrices. Let ~ l t denote the estimate of fls based on the observation of y * = (y~. . . . . Y't), and 2Yslt the corresponding approximate conditional error covariance matrix, s, t = 0, 1 , . . . The extended Kalman filter gives estimates /~llt-1, if t tt-1 (prediction step) resp. ~tlt, if'tit (correction step), proceeding recursively. The following filtering algorithm is analogous to the standard one for conditionally Gaussian observations (e.g. Sage and Melsa 1971, Section 9.3; Anderson and Moore 1979, ch. 8). It can be justified as an approximate posterior mode filter by extending the arguments in Sage and Melsa (1971) to exponential families, see Fahrmeir (1988).
Extended Kalman filter 1. Initialization

/~OlO=ao , ifOlo=Qo
For t = 1 , 2 , . . . :
2. Prediction step
/~tlt-1 = T,/~,-ll,-1
~tlt_ 1 = Ttift_llt_
,
1 T; +Qt ,
3. Correction step ~tlt = ~tlt-~ + Kt(Y~-at) ,

.~tl t = ( I - K t H t ) Z ,' -tlt_l
where
Kt = iftlt- l I-It[H'tiftlt-i Ht + Zt] -1
On Kalman Filtering
47
is the Kalman gain, and Pt, Zt, H t are evaluated at J~tlt-1. By an application of the matrix inversion lemma (e.g. Anderson and Moore 1979, p. 138) to the Kalman gain, the correction step can be given in a different form: It is easily verified that
~tl, = (Xtl)-,
and
Kt(Yt-/lt)
+ HtZ/1H;)
-' = (~d~-, +Gt)-
'
= Z t h t H t Z ? 1 ( Y t - P t ) = Ztltrt
A slightly different form is obtained if the conditional information G t is replaced by the random information Gt. This can be done provided the matrices ~ _ 1 + Or, t = 1,2 . . . . . occurring in the modified correction step are positive definite. This condition makes sense, since the inverse of 27t~]_1+ Gt forms a covariance matrix estimate. According to the remarks at the end of 2.4, the correction step is written down with the general R t which may be G t or Gt.
3' C o r r e c t i o n s t e p
Ztlt = (Z?~J-1 + R t ) -1 ,
fltlt = fltEt-1 + ~ t l t r t ,
with r t and R t evaluated at/~ttt 1. As will be seen in Section 5, the filter is closely connected to finding posterior mode estimates by Gauss-Newton resp. Fisher scoring iterations. A first indication of this relationship is provided by the form 3' of the correction step: thinking of covariance matrices and information matrices being inverses to each other, Z~]_I is the (estimated) information on fit given y*_ 1- The matrix R t is the information on fit contributed by the new observation Yt, and the sum Z ~ _ 1 + R t is the information on fit given YT. Inverting this matrix, we get the covariance matrix "~tlt- The update step ]~tlt = . . . has just the form of a single GaussNewton resp. Fisher scoring iteration.
3.2 S m o o t h i n g
If an extended Kalman filter is run until time t, estimates/~ol 0. . . . . /~dt are available. The full information is only used in estimating fit. Improved estimates for
48
fl0. . . . . flt-i are obtained with the following backward recursive smoother (Sage and Melsa 1971, ch. 9).
Smoother 1 For s = t, ...,1:

~s-1l,-~s-1ls-i =
Bs~slt-LLs-,)
, ,
Zs-1 [ - ~ s - 1 1 s - 1 = B~(~slt--~sbs-,)Bs where
B s = z~,s_ l ls_ l
T,-I s~., s i s _
l .
This smoother is essentially the same as the fixed interval smoother in the linear Gaussian case, except that ~s-11s-1, Zsls-I and therefore Bs, s = 1. . . . . t, depend o n ]~1t0 . . . . . ]~t_ 11/_ 2. The fixed interval smoother 1 provides improved estimates J~lt for fls, s<t, only after a block of t consecutive filter steps. For recursive Gauss-Newton and Fisher scoring posterior mode filtering and smoothing in Section 4, it is of interest to smooth backwards after each filter step. To do so, one needs a smoother based on J~tlt- I,J~tl,
J~OIt- I . . . . .
(3.1)
instead of ~010,/~110. . . . . fltlt-,,fltl, (3.2)
Writing down smoother 1 first for observations until time t and then for observations until time t - 1 and forming the difference, we find L-ll,-Bs-ll,-, = Bs sl,-Li,-,) (3.3)
Analogously, we get the corresponding equation for the covariance matrices. However, in evaluating (3.3), the covariance matrices determining B 1. . . . . B t should also depend on/~l It-I . . . . . /~t-I It-I instead of/~110 . . . . . /~,-1 It-2. This requirement can be fulfilled by separating the covariance recursion from the filter. The same thing will be needed later with further sequences, whence the following
O n Kalman Filtering
49
scheme is devised for an arbitrary sequence ill,f2 . . . . Note that the scheme is forward recursive. For the choice Rt(/3t)= Gt(flt), it makes sense provided St/~-i + Gt(flt) is positive definite, t = 1,2 . . . .
Covariance recursion Let fl~,fl2 . . . . be a sequence o f p-vectors. Starting with Z01 o = Qo, define f o r t = 1,2 . . . . :
Stir_ 1 -~- Z t Z t _ l l t _ ! T~ +Q, , Bt Stlt =Zt-l[t-lr;S~:-i
= [t~-I
(3.4) (3.5) (3.6)
+ R t ( f l t ) ] -1
Actually, Stlt_ 1 and Bt(Stle) depend only on fll . . . . . flt-l(fll . . . . . fit). I f the sequence fll,fl2 . . . . referred to in the covariance recursion is to be stressed, we write
Stir-1 = Stir-1 ~/~1 . . . . .
fit-l) ,
Bt St It
= Bt(fll . . . . . fit-l) , = "~tlt(t~l . . . . . f t )
(3.7)
If the extended K a l m a n filter of 3.1 is run until time t, it uses the covariance recursion based on/~110 ~tlt-1. To be consistent with the intended smoother, it should instead be based on flllt-1 . . . . . ]~tlt-1. This leads to the following integrated algorithm for filtering with s m o o t h i n g back after each filter step.
. . . . .
Filter and S m o o t h e r 2 L Initialization

/)olo = ao , Z o l o
= Qo ,
For t = 1 , 2 , . . . :
2. Prediction step:
LL,-1 = T,L-11,-1
50
3. Covariance recursion, f o r s = 1 . . . . . t: f'sls-I = Ts.Ss-lls-I T'~+Qs , Bs ~sls

= Z~s_lls_ 1 Ts27sls-~ , --1 ,
- - 1 1 q- R s(flslt_l) ^ = [Zsls_ } -1
4. Correction step: ~tlt = ~tlt- ~+ z~tltrt(~tlt- O , 5. S m o o t h e r 2, f o r s = t . . . . . 1:
L-11,-L-,I,-1
B,(Lt,-LI,-,),
.~s_11,-,_11,_~ =Bs(.~sl,-z.i,_l)- B'.,

Step t of this combined algorithm consists of a forward recursion, including the filter steps, and the backward recursive smoother. In terms of (3.7), in step t we have
~-~sls-I = .~'sls-l(~llt-1 . . . . . /~s-llt-1) ,
BS
= B s ( ~ l l t _ 1. . . . . L-lit-l) ,
-~sls
= zsl~11,-1 ..... L~,-1),
s = 1. . . . . t. Since error covariance matrices are recomputed before the t-th correction step, estimates are in general different from those o f the previous extended Kalman filter and smoother.
4 Posterior Mode Estimation

4.1 Score F u n c t i o n a n d I n f o r m a t i o n M a t r i x
Let us now look from a different angle at the problem of estimating //* = (fl~ . . . . . fl~). Given y * , the whole parameter vector p * may be estimated by maximizing its posterior density. Equivalently, one can maximize the joint density
On Kalman Filtering
51
p(.v, 1,6'?)p(f ?)
or, taking logarithms,
l* (f*)+a~ (f*)
(4.1)
with the log likelihood l ~ ' ( f * ) = l o g p ( y ~ l f l * ) and the log prior a * ( f * ) = l o g p ( f * ) . In maximizing (4.1), the score function
uP(f,) = ~ (t* (f*)+a? (f*))
(4.2)
and the information matrix Ut * (flit), * say, are of interest. According to c o m m o n practice in posterior mode estimation, one would use the random information
_ 02 U~' 0 " ) - O/~? -----~ Off* (17 (,8*) +a~' (,8*)) .
(4.3)
An alternative is to replace (4.3) by some kind of conditional information. Due to the recursive nature o f dynamic exponential family regression models, u* and U* have a special structure. From the assumptions in Section 2, we get
t
Z*(f~) = ~
I
Is(fs)
(4.4)
for the log likelihood by successive conditioning. Up to a summand independent of fl*, the log prior is
a* (f?) = - y ( f o - ao)' Qo ~0 o - ao)

(4.5)
--~
E
I
(fs-Tsfls-1)'Qs1(fs-Tsfls-1)
The score function can be partitioned as u ~ = (u~ . . . . , u't)', where the subvector u s gives the derivative of the log posterior with respect to fls, s = 0 . . . . . t. In order to avoid special formulas for the boundary quantities u0, ut, it is convenient to complement the score function contributions r 1. . . . . r t given in (2.16) and (2.19) by r 0 = 0, and to set Tt+ 1 = O, T't+lCt+l = 0. Defining additionally
52
Co = OolCa0-a0) ,
(4.6)
cs = 0 f ' (#s-T~Zs-,) , s = l , . . . , t
we have
us = rs-cs+Ts+lCs+
1 ,
S = 0..... t .
(4.7)
The subvector Us o f the score function depends only on P s - l , fls and /~s+l, s = 1. . . . . t - 1, the b o u n d a r y vectors u o, u t only on fl0, fl, resp. fit- 1, fit. The inf o r m a t i o n matrix therefore has tridiagonal block structure:
"U~ U~ U*=
U01
0
U~
. .
(4.8)
Ut-l,t
..
..
U't- 1, t with
Uss = R s + Q s I +T's+lOs+ll T s + l , s= 1,...,t
U.
s = 0 .....
t ,
(4.9)
Us_l, s = - T ' s Q s 1 ,
(4.10)
where R0 = 0, Tt+ , = 0 and R , . . . . . R t are the ( r a n d o m or conditional) information matrix contributions given in 2.4.
4.2 F a c t o r i z a t i o n a n d I n v e r s i o n o f t h e I n f o r m a t i o n M a t r i x
A n y positive definite block-tridiagonal matrix can uniquely be factorized into I -Do
0 D1
-B~
I
"..
-I -B1 I
".
0 -B2.
(4.11)
"'. - B t I
-B~,
o "" - B ~ "I
0
Dt
O n Kalman Filtering
53
where D O. . . . . Dt are positive definite matrices. Conversely, if a matrix possesses such a factorization with D o. . . . . Dt positive definite, then it is positive definite and block-tridiagonal. The first part o f the following proposition specifies the factors for the inform a t i o n matrix U*. It rests on the covariance recursion, which includes expressions for the matrices B1 . . . . . B t, together with similar expressions for D o. . . . . D , The second part gives the inverses D o 1. . . . , D t ~. Finally, in part (iii) we obtain the inverse of U* t, which yields approximate error covariance matrices.
P r o p o s i t i o n 1. (i) Let 27010, Xsl s_ t, Xsls, s = 1 . . . . . t, be defined by the covariance
recursion, as well as
Bs=Xs-lls-I
T,s~sls-1 x--1 ,
-1
s=l ..... t.
(4.12)
A s s u m e that 2 ~ s l s _ l + R s, s = 1. . . . ,t, is positive definite. Then U* is also positive definite, and the matrices in its factorization are given by (4.12) and
Ds = Xs-ils +Ts+lQs+ll Ts+ 1 , s = 0..... t .
(4.13)
(ii) The inverses are

D s 1 = Zsls-Bs+t27s+ I IsB's+l
s=0,...,t-1
, (4.14)
D t t =Z'tl t . (iii) The (r,s)-block of the inverse o f U~ is

t
Ars =
B~+~ . . . ' B j D f t B ) ' . .
.- B ~ ' +~,
(4.15)
0 <__r, s_< t, where m = m a x Jr, s/.
R e m a r k s . (i) In (4.13), recall that Tt+ 1 = 0. In (4.15), the p r o d u c t B r + l ' . . . "Bj is in increasing order, so one must read Br+ 1 . . . "B r as an e m p t y product, i.e. Br+ ! . "B . r = L. Correspondingly, . . . B). .B's+l is in decreasing order, inducing
Bs'...'B's+l =L (ii) Part (iii) actually gives the inverse for any positive definite block tridiagonal matrix in terms of the factorization (4.10, since it can be proved without recourse to (i) and (ii).
54
L. Fahrmeir and H. Kaufmarm
Proof" (i) Due to the assumptions in Section 2.1, X ~ = Q o 1 is positive definite The same holds true for Xs-i~s = Z -s1l s _ l + R s, s = 1. . . . . t by hypothesis. Since positive semidefinite matrices are added, it follows that D O. . . . . D t defined by (4.13) are positive definite According to the remarks preceeding Proposition 1, it remains to be shown that U* possesses a factorization (4.11), with the submatrices of the factors given by (4.12), (4.13). From (4.12), (4.13) and the covariance recursion, we get D s _ l B s = TsXsls-I , -1 , -I + T s Q s I Ts'Ss-I Is-1Ts'Ssls-1 ~ Z -sis-1 I = T ' ~ Q s l ( O s + T s X s - t l s - I T ,s) = T'sQ~ 1
(4.16)
and similarly
D s + B s D s _ I B s = R s + Q s 1 +Ts+l , Qs+l -1 Ts+l ,
(4.17)
s = 1. . . . . t. Multiplying out (4.11) yields the matrix

DO -B~Do .. 0 -DoB 1 B~DoBI+D1 ".. "'. -Dt_lBt 0 ]
(4.18)
-B~Dt_ 1 B'tDt_IBt+Dt]
Inserting (4.6) and (4.7) into (4.18) and comparing with U* given in (4.8)-(4.10), the desired result is obtained. (ii) This can be inferred by inserting the formula for Bs+ l into
Z~I S - - B S + 1Z~ + l i B S'
S+ 1
and invoking the matrix inversion lemma (e.g. Anderson and Moore 1979, p. 138). (iii) By multiplying out, it can be verified that -I -B l 0 I 0
-1
-IB 1 I
... B2
B 1 " . . . ' B t-
-B2
-B t I
0
Bt I
On Katman Filtering
55
From this formula, we get (4.15) by multiplying the inverses of the matrices in (4.11) in converse direction. []
4.3 Gauss-Newton and Fisher Scoring Iterations

A vector maximizing the posterior density p(p~ [y~') can be found by GaussNewton or Fisher scoring iterations. If fl* denotes the current vector, then the next iterate is fl* + 6", where 6* = (5~ . . . . ,5't)' solves (4.19) This process is repeated until convergence. If U~ is the random information (Rs = Gs), then (4.19) is a Gauss-Newton, otherwise (R s = Gs) a Fisher scoring iteration. Due to the decomposition (4.11), equations (4.19) can be solved by forwardbackward recursion, thus avoiding inversion of U * ~ * ) . First one solves the system
-B]
i l
I 0.
. ",.
~-
-B~
for the auxiliary vector e* = (e~ . . . . . e't) by forward recursion, and then
io0
o
so = uo , ~oto=
;:~'1
i:l L:l
, =
,Dt,
by backward recursion. Incorporating computation of B l . . . . . Bt and D O .... we get the following.
Gauss-Newton (Fisher scoring) step 1. Initialization

Qo ,
56
2. F o r w a r d recursion, f o r s = 1 . . . . . t:
compute
Xsls-~, Bs, Xsls by the covariance recursion, u s by (4.6), (4.7),
e s = us+B'sCs_ 1 ,
3. Filter correction Ot = .Stltet , 4. S m o o t h e r corrections, f o r s = t . . . . . D s - ~ = X s - i t s - 1 - Bs.Ssls- 1B's , Os- 1 = Ds-li es- l + BsOs 1:
O f course, several numerical variants are possible. In the smoother corrections, for instance, instead of computing D s - 1 by means of (4.14), one can first compute Ds_ 1 by (4.13) and then invert. Although this is more time-consuming, it should have the advantage of greater numerical stability, since cancellation effects are avoided. After one or more iterations, one can apply L e m m a 1 (iii) to get approximate error covariance matrices. To give an example, let us look at the diagonal
Ass=Zstt,
s=0 ..... t.
From (4.15), we obtain the formula (4.20)
Zs_lLt= Ds]l+BsXsltB's
s = 1. . . . .
t .
This can be applied backward recursively, starting with Z'tlt.
4.4 A S i n g l e G a u s s - N e w t o n (Fisher S c o r i n g ) S t e p
In the discussion of Gauss-Newton (Fisher scoring) iterations, time t has been fixed, in contrast to the algorithms of Section 3. If Gauss-Newton (Fisher scoring) steps are to be applied sequentially for t = 1,2 . . . . . then ]~0tt-1 . . . . . ~ t - l l t - i together with the forecast /~tlt-1 = T t ~ t - l l t - 1 is a reasonable starting value at time t, when the new observation Yt becomes available. In many cases, Yt should
On Kalman Filtering
57
not change estimates too drastically, with the implication that a single Gauss Newton (Fisher scoring) iteration suffices. This leads to the following algorithm.
Filter and smoother L Initialization
/~ol0
a0 , Z010
Q0 ,
F o r t = 1,2 . . . . :
E0 = H 0
2. P r e d i c t i o n step:
fltlt_ 1 =
TrOt_lit_
1 ,
3. F o r w a r d recursion, f o r s = 1 . . . . .
t:
compute Z~sls_ 1, B s, Z'sis by the covariance recursion, us by (4.6), (4.7) based

on
/~l I t - 1 . . . . .
Btlt-!
c s = Us+B'ses_ 1 .
4. F i l t e r c o r r e c t i o n
B, It = BtI,-1 + L I , e, ,
5. S m o o t h e r
corrections, for s = t .....
1:
D s~l = Z~s_ l ls_ , - Bs~,sls- , B's ,
fls- l l t - fls- , It- ,
B s ( f f s l , - flslt- O + D s-t, es - ,
58
5 Relationships Between Algorithms
The following proposition gives the connection between filter and smoother 2 and 3, which look rather close. Briefly, they are referred as to algorithms 2 and 3.
v e c t o r s ]~0[t- 1. . . . .
P r o p o s i t i o n 2. Assume that step t of algorithms 2 and 3 is run with the same input ]~t- 1Lt- 1" T h e n the following assertions are equivalent:
(i) algorithms 2 and 3 provide the same estimates/~01t . . . . . ]~tlt, (ii) in algorithm 3, it holds that ~0 = . . . = el-1 = 0, (iii) the input vector (fl01t-I . . . . . /~t-l/t-0 is a stationary point p(/3*lly* O.
of
Proof." Prediction steps are obviously the same, inducing that the matrices computed in the covariance recursion are also equal for b o t h algorithms, since they are based on the same sequence. Thus, we need only consider the filter and smoother corrections. Note further that the prediction step implies
Ct = O t l
(~/it_l-Zt~t_llt_l)
: 0
(5.1)
inducing (compare (4.6), (4.7))

ut = rt(t~tlt- 1)
(5.2)
If (i) holds, then a comparison o f the smoother formulas yields (ii). If (ii) holds good, then we have
e t = ut+B~gt_
1 = u t = rt(~tlt_l)
from (5.2), inducing that the filter corrections are equal. Since the smoother corrections are obviously equal if e 0 = . . . = e t_l = 0, (ii) implies (i). To see that (ii) and (iii) are equivalent, let v*_~ = (v~ . . . . . v~_l)' denote the score function of log p(fl*_~[y*_0. By definition, (iii) is equivalent to
vLl~?-~) = 0 .
F r o m (4.6), (4.7), we get
(5.3)
On Kalman Filtering
59
Us
=Vs
s-O
....
,t-2
ut-1 = vt-1 +T'tct
For evaluation at (flolt-1 . . . . . /~tlt-1), we even have s=0 ..... t-I ,
us=v
s ,
(5.4)
due to (5.1). From e0 = u0 and the forward recursion, it can easily be inferred that e0 . . . . . e t ~ = 0 is equivalent to u0 = . = u t _ ~ = 0. Equations (5.3) and (5.4) yield the desired equivalence. [] In Section 3 and 4, we have discussed four algorithms: l. extended Kalman filtering and smoothing, 2. filtering combined with smoothing after each filter step, 3. filtering and smoothing by a single Gauss-Newton (Fisher scoring) iteration at step t, 4. posterior mode estimation by Gauss-Newton (Fisher scoring) iterations. Regarding the relationship between 3 and 4, it seems plausible that a single Gauss-Newton (Fisher scoring) iteration at step t often provides a good approximation: for s fixed, the score function subvector u s appears in all steps t>_s, and it should therefore become closer and closer to zero. Algorithm 2 makes a further approximation in that the quantities e 0. . . . . e t _ are neglected in step t. It may be thought o f as a variant of filtering and smoothing by a single Gauss-Newton (Fisher scoring) iteration, which relies on the induction hypothesis that fro I t - ~. . . . . f i t - 1 Lt - t are posterior mode estimates. If this holds true, then it provides the same estimates as algorithm 3, according to Proposition 2. However, the induction hypothesis will usually not hold exactly. Since e 0 , . . . , e t _ ~ are neglected, the score function subvector u s enters only at step t = s, and there is no possibility for correction at subsequent steps. The original algorithm I simplifies further by smoothing back only after t filter steps. To summarize, we have shown that the algorithms form a hierarchy of approximations to posterior mode estimation, the quality of approximation being opposed to computational effort. The kind of approximations has been clarified to some extent. O f course, it would be very valuable to have results judging the quality of the various approximations more quantitatively, and characterizing the situations where they can be successfully applied. This is presently only known in the linear Gaussian model, where all four algorithms provide the same estimates, but should be a topic for future research.
Acknowledgement.
I thank a refereefor his valuable comments, which helped to improvethe presenta-
tion of the paper.
60
References
Ameen JRM, Harrison PJ (1985) Normal discount Bayesian models. In: Bernardo JM, DeGroot MH, Lindley DV, Smith AFM (eds) Bayesian Statistics 2:271-294 Anderson BDO, Moore JB (1979) Optimal Filtering. Prentice Hall, Englewood Cliffs Baker RJ, Thompson R (1981) Composite link functions in generalized linear models. Appl Stat 30:125-131 Fahrmeir L (1988) Extended Kalman filtering for dynamic generalized linear models and survival data. Regensburger Beitrage zur Statistik und Okonometrie 10 Fahrmeir L, Kaufmann H (1985) Consistency and asymptotic normality of the maximum likelihood estimator in generalized linear models. Ann Statist 13:342-368 Fahrmeir L, Kaufmann H (1987) Regression models for nonstationary categorical time series. J Time Ser Anal 8:147-160 JOrgensen B (1983) Maximum likelihood estimation and large sample inference for generalized linear and nonlinear regression models. Biometrika 70:19-28 Kaufmann H (1987) Regression models for nonstationary categorical time series: asymptotic estimation theory. Ann Statist 15:79-98 Kitagawa G (1987) Non-Gaussian state-space modelling of nonstationary time series (with comments). JASA 82:1032- 1063 Nelder JA, Wedderburn RWM (1972) Generalized linear models. J Roy Statist Soc Ser A 135:370- 384 Sage AP, Melsa JL (1971) Estimation Theory with Applications to Communication and Control. McGraw Hill, New York West M (1985) Generalized linear models: scale parameters, outlier accommodation and prior distributions. In: Bernardo JM, DeGroot MH, Lindley DV, Smith AFM (eds) Bayesian Statistics 2:531-538 West M, Harrison R J, Migon HS (1985) Dynamic generalized linear models and Bayesian forecasting. JASA 80:73 - 83
Received 9 June 1989 Revised version 10 January 1990

On Kalman Filtering, Posterior Mode Estimation and Fisher Scoring PDF

Diunggah oleh

Informasi Dokumen

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

On Kalman Filtering, Posterior Mode Estimation and Fisher Scoring PDF

Diunggah oleh

Hak Cipta:

Format Tersedia

Metrika (1991) 38:37- 60

Supposing a linear transition equation

0 0 2 6 - 1 3 3 5 / 9 1 / 1 / 3 7 - 6 0 $2.50 1991 Physica-Verlag, Heidelberg

L. Fahrmeir and H. Kaufmann

L. Fahrmdr and H. Kaufmann

2 Dynamic Exponential Family Regression

L. Fahrmeir and H. Kaufmann

2.2 D y n a m i c Generalized Linear M o d e l s

This model family is obtained by specializing the observation equation (2.5) to

depending on previous observations.

1~ 0 0 . The relationship between w t and h t is

Stressing dependence on the parameter fit, we write

From (2.13), we get

Finally, denoting the components of wt by wtj, j = 1 . . . . . q, let

The contribution of the observation Yt to the log likelihood is

G,q~,) = W,Z, W', = H , Z ? ' H ; ,

In particular, both information matrices coincide.

L. Fahrmeir and H. Kaufmann

Extended Kalman filter 1. Initialization

3. Correction step ~tlt = ~tlt-~ + Kt(Y~-at) ,

Kt = iftlt- l I-It[H'tiftlt-i Ht + Zt] -1

-' = (~d~-, +Gt)-

L. Fahrmeir and H. Kaufmann

Smoother 1 For s = t, ...,1:

Zs-1 [ - ~ s - 1 1 s - 1 = B~(~slt--~sbs-,)Bs where

instead of ~010,/~110. . . . . fltlt-,,fltl, (3.2)

(3.4) (3.5) (3.6)

= Bt(fll . . . . . fit-l) , = "~tlt(t~l . . . . . f t )

Filter and S m o o t h e r 2 L Initialization

L. Fahrmeir and H. Kaufmann

3. Covariance recursion, f o r s = 1 . . . . . t: f'sls-I = Ts.Ss-lls-I T'~+Qs , Bs ~sls

4. Correction step: ~tlt = ~tlt- ~+ z~tltrt(~tlt- O , 5. S m o o t h e r 2, f o r s = t . . . . . 1:

.~s_11,-,_11,_~ =Bs(.~sl,-z.i,_l)- B'.,

= zsl~11,-1 ..... L~,-1),

4 Posterior Mode Estimation

uP(f,) = ~ (t* (f*)+a? (f*))

_ 02 U~' 0 " ) - O/~? -----~ Off* (17 (,8*) +a~' (,8*)) .

a* (f?) = - y ( f o - ao)' Qo ~0 o - ao)

L. Fahrmeir and H. Kaufmann

A n y positive definite block-tridiagonal matrix can uniquely be factorized into I -Do

P r o p o s i t i o n 1. (i) Let 27010, Xsl s_ t, Xsls, s = 1 . . . . . t, be defined by the covariance

(ii) The inverses are

D t t =Z'tl t . (iii) The (r,s)-block of the inverse o f U~ is

B~+~ . . . ' B j D f t B ) ' . .

0 <__r, s_< t, where m = m a x Jr, s/.

L. Fahrmeir and H. Kaufmarm

s = 1. . . . . t. Multiplying out (4.11) yields the matrix

4.3 Gauss-Newton and Fisher Scoring Iterations

by backward recursion. Incorporating computation of B l . . . . . Bt and D O .... we get the following.

Gauss-Newton (Fisher scoring) step 1. Initialization

L. Fahrmeir and H. Kaufmann

Xsls-~, Bs, Xsls by the covariance recursion, u s by (4.6), (4.7),

From (4.15), we obtain the formula (4.20)

This can be applied backward recursively, starting with Z'tlt.

Filter and smoother L Initialization

compute Z~sls_ 1, B s, Z'sis by the covariance recursion, us by (4.6), (4.7) based

corrections, for s = t .....

D s~l = Z~s_ l ls_ , - Bs~,sls- , B's ,

fls- l l t - fls- , It- ,

L. Fahrmeir and H. Kaufmann

5 Relationships Between Algorithms

inducing (compare (4.6), (4.7))

ut-1 = vt-1 +T'tct

uP(f,) = ~ (t* (f)+a? (f))

_ 02 U~' 0 " ) - O/~? -----~ Off* (17 (,8) +a~' (,8)) .