In lecture 2, we introduced stationary linear time series models. In that lecture, we discussed
the data generating processes and their characteristics, assuming that we know all parameters
(autoregressive or moving average coefficients). However, in empirical studies, we have to specify
an econometric model, estimate this model and draw inferences based on the estimates. In this
lecture, we will provide an introduction to parametric estimation of a linear model with time
series observations. Three commonly used estimation methods are least square estimation (LS),
maximum likelihood estimation (MLE) and general method of moments (GMM). In this lecture,
we will discuss LS and MLE.
Yn = Xn 0 + Un , (3)
Copyright 2002-2006 by Ling Hu.
1
and the OLS estimator can be written as
Define
MX = In Xn (Xn0 Xn )1 Xn0 .
It is easy to see that Mx is symmetric, idempotent (Mx Mx = Mx ), and orthogonal to the
columns of X. Then we have
Un = Yn Xn n = MX Yn .
To derive the distribution of the estimator n ,
Therefore, the properties of n depends on (Xn0 Xn )1 Xn0 Un . For example, if E[(Xn0 Xn )1 Xn0 Un ] =
0, then n is unbiased estimator.
1.1 Case 1: OLS with deterministic regressors and i.i.d. Gaussian errors
Assumption 1 (a) xt is deterministic; (b) ut i.i.d(0, 2 ); (c) ut i.i.d.N (0, 2 ).
Under assumption 1 (a) and (b), E(Un ) = 0 and E(Un Un0 ) = 2 In . Then from (5) we have
and
Under these assumptions, Gauss-Markov theorem tells that the OLS estimator n is the best
linear unbiased estimator for 0 . The OLS estimator for 2 is
MX = P P 0 and P 0 P = In
where is a n by n matrix with the eigenvalues of MX along the principal diagonal and zeros
elsewhere. From properties of MX we can compute that contains k zeros and n k ones along
its principal diagonal. Then
n
X
RSS = Un0 MX Un 0 0 0
= Un P P Un = (P Un )(P Un ) = Wn0 Wn = t wt2
t=1
2
where Wn = P 0 Un . Then E(Wn Wn0 ) = P 0 E(Un Un0 )P = 2 In , therefore, wt are uncorrelated with
mean 0 and variance 2 . Therefore,
n
X
E(Un0 MX Un ) = t E(wt2 ) = (n k) 2 .
t=1
n N (0, 2 (Xn0 Xn )1 ).
Note that here n is exact normal, while many of the estimator in our later discussions are
asymptotically normal. Actually, under assumption 1, OLS estimator is optimal. Also, with the
Gaussian assumption, wt is i.i.d.N (0, 2 ). Therefore we have
Un0 MX Un / 2 2 (n k).
1.2 Case 2: OLS with stochastic regressors and i.i.d. Gaussian errors
The assumption of deterministic regressors is very strong for empirical studies in economics. Some
examples of deterministic regressors are constants and deterministic trend (i.e. xt = (1, t, t2 , . . .)).
However, most data we have for econometric regression are stochastic. Therefore from this subsec-
tion, we will allow the regressors to be stochastic. However, in case 2 and case 3, we assume that
xt is independent of errors (leads and lags). This is still too strong in time series, as it rules out
many processes including ARMA models.
Assumption 2 (a) xt is stochastic and independent of us for all t, s; (b) ut i.i.d.N (0, 2 ).
This assumption can be equivalently written as Un |Xn N (0, 2 In ). Under these assumptions,
n is still unbiased:
E(n ) = 0 + E[(Xn0 Xn )1 Xn0 ]E(Un ) = 0 .
Conditional on Xn , n is normal, n |Xn N (0 , 2 (Xn0 Xn )1 ). To get the unconditional
probability distribution for n , we have to integrate this conditional density over X. Therefore, the
unconditional distribution of n will depend on the distribution of X. However, we still have the
unconditional distribution for the estimate of the variance Un0 MX Un / 2 2 (n k).
1.3 Case 3: OLS with stochastic regressors and i.i.d. Non-Gaussian errors
Compared to case 2, in this section we let the error terms to follow arbitrary i.i.d. distribution
with finite fourth moments. Since this is an arbitrary unknown distribution, it is very hard obtain
exact distribution (finite sample distribution) for n , instead, we will apply asymptotic theory in
this problem.
2
Assumption 3 (a) xt is stochastic and independent of us for all t, s; (b)
4 0
Pnut i.i.d.(0, ), and
E(ut ) = 4 < ; (c) E(xt xt ) = Qt , a positive definite matrix with (1/n) Pn t=1 Qt0 Q, a positive
definite matrix; (d) E(xit xjt xkt xlt ) < for all i, j, k, l and t; (e) (1/n) t=1 (xt xt ) p Q.
3
With assumption (a), we still have the n is unbiased estimator for 0 . ThePassumption (c) to
(e) are restrictions on xt . Basically we want to have (1/n) t=1 xt xt p (1/n) nt=1 E(xt x0t ).
Pn 0
We have
" n #1 " n #
X X
0
n 0 = xt xt xt ut
t=1 t=1
" n
#1 " n
#
X X
= (1/n) xt x0t (1/n) xt ut
t=1 t=1
xt ut is a martingale difference sequence with finite variance, then by LLN for mixingales, we
have
n
" #
X
(1/n) xt ut p 0.
t=1
for n . Then we apply CLT on the term nt=1 xt ut , also after properly normed (so that the limit
P
is nondegenerate).
Note E(xt x0t u2t ) = 2 Qt and (1/n) nt=1 2 Qt 2 Q. By CLT for mds, we have
P
n
" #
X
(1/ n) xt ut N (0, 2 Q).
t=1
Therefore,
" n
#1 " n
#
X X
n(n 0 ) = (1/n) xt x0t (1/ n) xt ut
t=1 t=1
1
N (0, [Q ( 2 Q)Q1 ]) = N (0, 2 Q1 ).
so the n follows
2 Q1
n N 0 , .
n
Note that this distribution is not exact, but approximate. So we should read it as approximately
distributed as normal.
4
To compute this variance, we need to know 2 . When it is unknown, the OLS estimator s2n is
still consistent under assumption 3. We have
By LLN, we have (1/n) nt=1 u2t 2 . There are three terms in the above equation. For the
P
second term, we have
X n
(1/n) (yt x0t n )x0t (n 0 ) = 0
t=1
and we have
n
X n
X n
X
n2 = (1/n) (yt x0t n )2 = (1/n) u2t (1/n) [x0t (n 0 )]2 2 .
t=1 t=1 t=1
This estimator is only slightly different from s2n (n2 = (n k)s2n /n). Since (n k)/n 1 as
n , if n2 is consistent, so is s2n .
Next, to derive the distribution of n2 .
n n
" #
X X
n(n2 2 ) = (1/ n) (u2t 2 ) n(n 0 )0 (1/n) x0t xt (n 0 ).
t=1 t=1
The second term goes to zero as [(1/n) nt=1 x0t xt ] p Q and n 0 p 0. Define zt = u2t 2 ,
P
then zt is i.i.d. with mean zero and variance E(u4t ) 4 = 4 4 . Applying CLT, we have
n
X
(1/ n) zt d N (0, 4 4 ),
t=1
therefore,
n(n2 2 ) d N (0, 4 4 ).
The same limit distribution applies for s2n , since the difference between n2 and s2n is op (n1/2 ).
5
1.4 Case 4: OLS estimation in autoregression with i.i.d. error
In an autoregression, say, xt = 0 xt1 + t , where t is i.i.d., the regressors are no longer inde-
pendent of t . In this case, the OLS estimator of 0 is biased. However, we will show that under
assumption 4, the estimator is consistent.
with roots of (1 1 z 2 z 2 ... p z p ) = 0 outside the unit circle (so yt is stationary) and with
t i.i.d. with mean zero , variance 2 , and finite fourth moments 4 .
Page 215-216 in Hamilton presents the general AR(p) case with constant. We will use AR(2)
as an example, yt = 1 yt1 + 2 yt2 + t . Let x0t = (yt1 , yt2 ), ut = t and yt = x0t 0 + ut (so
00 = (1 , 2 )).
" n
#1 " n
#
X
0 X
n(n 0 ) = (1/n) xt xt (1/ n) xt ut (7)
t=1 t=1
The first term
n Pn 2
Pn
X y y t1 y t2
(1/n) xt x0t = (1/n) Pn t=1 t1 t=1
Pn 2
t=1 t=1 yt1 yt2 t=1 yt2
Pn Pn
In this matrix, first, on the diagonal, n1 2
t=1 ytj converge to 0 . The remaining term n1 t=1 yt1 yt2
converges to 1 . Therefore,
n
X 0 1
(1/n) xt x0t p Q = .
1 0
t=1
therefore,
n(n 0 ) d N (0, 2 Q1 ).
So far we have considered four cases in OLS regressions. The common assumption in all those
four cases are i.i.d. errors. From next section, we will consider cases where the errors are not i.i.d..
6
Assumption 5 (a) xt is stochastic; (b) conditional on the full matrix X, the vector U N (0, 2 V );
(c) V is a known positive matrix.
Under these assumptions, the exact distribution of n can be derived. However, this is a very
strong assumption and it rules out the autoregressive regression. Also, the assumption that V is
known rarely holds in applications.
Case 6 in Hamilton assumes uncorrelated but heteroskedastic errors with unknown covariance
matrix. Under assumption 6, the OLS estimator is still consistent and asymptotically normal.
7
The trick here is to make use of a known fact that n 0 p 0. If we could write n n as
sums of some products of n 0 and terms that are bounded, then n n p 0.
Then
n
X n
X
n n = (2/n) ut (n 0 )0 xt (xt x0t ) + (1/n) [(n 0 )0 xt ]2 (xt x0t ).
t=1 t=1
The term in the bracket has a finite plim by assumption 6 (e) and we have in i0 0 for
each i. Then this term converges to zero. (if this looks messy, take k = 1, then you can simply
move (n 0 ) out of the summation. n 0 p 0 and the sum has a finite limit, so the product
goes to zero).
Similarly for the second term,
n k X
k n
" #
X X X
(1/n) [(n 0 )0 xt ]2 (xt x0t ) = (in i0 )(jn j0 ) (1/n) xit xjt (xt x0t ) p 0
t=1 i=1 j=1 t=1
8
Then the new error U = LU is i.i.d. conditional on X,
ut = (yt 0 xt + 0 xt n xt ) = ut + (0 n )0 xt .
n n
1X 1X
ut ut1 = [ut + (0 n )0 xt ][ut1 + (0 n )0 xt1 ]
n n
t=1 t=1
n n n
" #
1 X
01
X
0 1X 0
= ut ut1 + (0 n ) (ut xt1 + ut1 xt ) + (0 n ) xt xt1 (0 )
n n n
t=1 t=1 t=1
n
1X
= ut ut1 + op (1)
n
t=1
n
1X
= (t + ut1 )ut1
n
t=1
var(ut ).
1 Pn
Similarly, we can show that n t=1 ut ut p var(ut ), hence n 0 . Still use similar method, we
can show that
n n
1 X 1 X
ut ut1 = ut ut1 + op (1).
n t=1 n t=1
Hence
n( 0 ) N (0, (1 20 )).
Finally the FGLS estimator for 0 based on V () has the same limit distribution as the GLS
estimator based on V (0 ) (page 222-225 in Hamilton).
9
i /sd(i ). Let the estimate of the variance of be denoted by s2 W , then the standard deviation
of i is the product of s and the square root of the ith element on the diagonal, i.e.,
i
t= . (9)
2
s wii
Recall that if X/ N (0, 1), and Y 2 / 2 2 (m), and let X and Y be independent, then
X m
t=
Y
follows exact student t distribution with m degree of freedom.
F -statistics is used to test the hypothesis of m different linear restrictions about , say
H0 : R = r,
where R is a m by k matrix. The F statistics is then defined as
F = (R r)0 [V ar(R r)]1 (R r). (10)
This is a Wald statistics. To derive the distribution of the statistics, we will need the following
result
Proposition 2 If a k by 1 vector X N (, ), then (X )0 1 (X ) 2 (k).
Also recall that an exact F (m, n) distribution is defined to be
2 (m)/m
F (m, n) = .
2 (n)/n
With assumption 1 W = (Xn0 Xn )1 , and under the hull hypothesis i N (0, 2 wii ). We can
then write
i
2 wii
t= q .
s2
2
Since the numerator is N (0, 1) and the denominator is the square root of 2 (n k) divided by n k
(since RSS/ 2 2 (n k)), and the numerator and denominator are independent, so t statistics
(9) under assumption 1 follows exact t distribution.
With assumption 1 and under the null hypothesis, we have
R r N (0, 2 R(Xn0 Xn )1 R),
then by proposition 2, the F statistics defined in (10) under hypothesis H0
(R r)0 [ 2 R(Xn0 Xn )1 R]1 (R r) 2 (m).
If we replace 2 with s2 , and divide it by the number of restrictions m, we get the OLS F test
of a linear hypothesis
F = (R r)0 [s2 R(Xn0 Xn )1 R]1 (R r)/m
F /m
= ,
(RSS/ 2 )/(n k)
10
so F follows a exact F (m, n k) distribution.
An alternative way to express the F statistics is to compute the estimator without restriction
and its associated sum of residual RSSu ; and the estimator with restriction and its associated
sum of residual RSSr , then we can write
where wii is the ith element on the diagonal of s asymptotic variance Q1 n1 . If we let the ith
element on the diagonal of Q denoted by qii , then we have i d N (0, 2 qii ). Recall that under
assumption 3, sn , there we have
tn N (0, 1).
Next, write
11
2 Maximum Likelihood Estimation
2.1 Review: maximum likelihood principle and Cramer-Rao lower bound
The basic idea of maximum likelihood principle is to choose the parameter estimates that max-
imizes the probability of obtaining the observed sample. Consider that we observe a sample
Xn = (x1 , x2 , . . . , xn ) and assume that the sample is drawn from an i.i.d. distribution and the
associated parameters are denoted by . Let p(xt ; ) denote the pdf of the tth observation. For
example, when xt i.i.d.N (, 2 ), then = (, 2 ) and
(xt )2
2 1/2
p(xt ; ) = (2 ) exp .
2 2
The maximum likelihood estimates for are chosen so that l(Xn ; ) is maximized. Define the
score function S() = l()/, and the Hessian matrix H() = 2 l()/0 , then the famous
Cramer-Rao inequality tells that the lowest bound for the variance of an unbiased estimator of
is the inverse of the information matrix I(0 ) = E[S(0 )S(0 )0 ], where 0 denotes the true value
of the parameter. An estimator that have a variance equal to this bound is known as efficient.
Under some regularity condition which are satisfied for the Gaussian density, we have the following
equality 2
l()
I() = E[H()] = E .
0
So, if we find an unbiased estimator and its variance achieves the Cramer-Rao lower bound,
then we know that this estimator is efficient and there is no other unbiased estimator (linear or
nonlinear) that could have smaller variance than this estimator. However, this lower bound is not
always achievable. If an estimator does achieve this bound, then this estimator is identical to MLE.
Note that Cramer-Rao inequality holds for unbiased estimator while sometimes ML estimators
are biased. If the estimator is biased but consistent, and its variance approaches the Cramer-Rao
bound asymptotically, then this estimator is known as asymptotically efficient.
Example 1 (MLE estimation for i.i.d. Gaussian distribution) Let xt i.i.d.N (, 2 ), so the pa-
rameter = (, 2 ). Then we have
(xt )2
1
p(xt ; ) = exp
2 2 2 2
n
n n 2 1 X
l(Xn ; ) = log(2) log( ) 2 (xt )2
2 2 2
t=1
12
n
l(Xn ; ) 1 X
S(Xn ; ) = = 2 (xt )2
t=1
n
l(Xn ; ) n 1 X
S(Xn ; 2 ) = = + (xt )2
2 2 2 2 4
t=1
Set the score functions to zero, we found the MLE estimator for are = Xn and 2 = n1 nt=1 (xt
P
)2 . It is easy to verify that E() = E(Xn ) = , so is unbiased and its variance V ar() = 2 /n,
while
" n #
1 X
E 2 = E (xt )2
n
t=1
= E(xt )2
= E[(xt ) + ( )]2
2 1
= 2 2 + 2
n n
n1 2
=
n
1 Pn
so 2 is biased, but it is consistent as 2 2 as n . Define s2 = n1 t=1 (xt )2 , then
Es2 = 2 , and V ar(s2 ) = 2 4 /(n 1).
We can further compute the Hessian matrix,
" 2 l(X ;) 2 l(X ;) #
n n
2 2
H(Xn ; ) = 2
l(Xn ;) 2
l(Xn ;)
2 2 2
where
l(Xn ; ) n
=
2 2
n
l(Xn ; ) l(Xn ; ) 1 X
= = (xt )
2 2 4
t=1
n
l(Xn ; ) n 1 X
= (xt )2
22 2 4 6
t=1
13
therefore the information matrix
n
2
0
I() = E[H(Xn ; )] = n
0 2 4
2
So the MLE of has achieved the Cramer-Rao lower bound of variance n . Although s2 does
not achieve to the lower bound, it turns out it is still the unbiased estimator for 2 with minimum
variance.
14
Now, since E[S(Xn , 0 )S(Xn , 0 )0 ] + E[H(Xn ; 0 )] = 0, we have that E[S(Xn , 0 )S(Xn , 0 )0 ] =
E[H(Xn ; 0 )].
Next, definePthat s(xt ; ) = log
p(xt ;)
, then we write the score function as the sum of s(xt ; ),
n
i.e., S(Xn , ) = t=1 s(xt ; ). s(xt ; ) is i.i.d. and we can show that E[s(xt ; 0 )] = 0 and E[s(xt ; 0 )s(xt ; 0 )0 ]
= E[H(xt ; 0 )]. Applying Lindeberg-Levy CLT, we obtain the asymptotic normality of the score
function
1
n1/2 S(Xn ; 0 ) d N (0, E[H(Xn ; 0 )]).
n
Next, we consider the properties of the Hessian matrix. First we assume that E[H(Xn ; 0 )] is
non-singular. Let N be a neighborhood of 0 , and
then we have
n
1X
H(xt ; ) E[H(Xn ; 0 )] V,
n
t=1
Proposition 3 (Asymptotic normality of MLE) With all the conditions we have outlined above,
n( 0 ) d N (0, 1 ).
Therefore, we have
n( 0 ) = nS(Xn ; 0 )H(Xn ; 0 )
1
1 1
= S(Xn ; 0 ) H(Xn ; 0 )
n n
N (0, 1 1 )
= N (0, 1 )
15
we compute the Hessian matrix, and evaluate it at = , i.e. V = H(Xn ; ). The second way is to
use the outer product estimate, which is
n
X
V = [S(xt ; )S(xt ; )0 ].
t=1
16
Note that n that maximizes l is the vector that minimizes the sum of squares, therefore, under the
assumption 2, the OLS estimator is equivalent to ML estimator for 0 . It can be shown that this
estimator is unbiased and achieves the Cramer-Rao lower bound, therefore under assumption 2, the
OLS/MLE estimator are efficient (compared to all unbiased linear or nonlinear estimators). Recall
that under assumption 1, we have Gauss-Markov theorem to show that OLS estimator is the best
linear unbiased estimator. Now, the Cramer-Rao inequality tells the optimality of OLS estimator
under assumption2. The ML estimator for 2 is (Y X)0 (Y X)/n. We have introduced this
estimator a moment ago and we showed that the difference between n2 and the OLS estimator s2n
becomes arbitrarily small as n .
Next, consider assumption 5, where U |X N (0, 2 V ) and V is known. Then the log likelihood
function omitting constant term is
17
Similarly, the probability density for the nth observation conditional on xn1 is
Taking log we get the exact likelihood function (omitting constant terms for simplicity)
n
2 (x1 c/(1 ))2 n 1 X (xt c xt1 )2
1 2
l(Xn ; ) = log log( ) . (11)
2 1 2 2 2 /(1 2 ) 2 2 2
t=2
Next, to construct the conditional likelihood, assume that x1 is observable, then the log likeli-
hood function is (again, constant terms are omitted)
n
X (xt c xt1 )2
n1
l(Xn ; ) = log( 2 ) . (12)
2 2 2
t=2
The maximum likelihood estimates c and are obtained by maximizing (12), or solving the
score function. Note that maximizing (12) with respect to is equivalent to minimizing
n
X
(xt c xt1 )2 ,
t=1
3 Model Selection
In the discussion on estimation above, we assume that the order of the lags is known. However,
in empirical estimation, we have to choose a proper order. A larger number of order (parameters)
will increase the fitness of the model, therefore we need some criterion to balance the goodness of
18
fit and model parsimony. There are three commonly used criterion, Akaike information criterion
(AIC), Schwartzs Bayesian information criterion (BIC), and the posterior information criterion
(PIC) developed by Phillips (1996).
In all these criterion, we specify a maximum order kmax , and then choose k to minimize a
criterion equation.
SSRk 2k
AIC = log + (13)
n n
where n is the sample size, k = 1, 2, . . . , kmax is the number of parameters in the model, and SSRk
is the residual from the fitted model. When k increase, the fit increases, so SSRk decreases, but
the second term increases. So this shows a trade off between fit and parsimony. Since the model is
estimated using different lags, the sample size also varies. We can either use the different sample
size n k, or we can use a fixed sample size n kmax . Ng and Perron (2000) has recommended
using the fixed sample size and use it to replace n in the criterion. However, the AIC rule is not
consistent and tends to overfit the model by choosing larger k.
With all other issues similar as in the AIC rule, the BIC rule imposes a larger penalty for
increasing number of parameters,
SSRk k log(n)
BIC = log + (14)
n n
BIC suggests samller k than AIC and BIC rule is consistent in stationary data, i.e., limn kBIC =
k. Further, Hannan and Deistler (1988) has shown that kBIC is consistent when we set kmax =
[c log(n)] (the integer part of c log(n)) for any c > 0. Therefore, we can estimate kBIC consistently
without knowing the upper bound of k.
Finally, to present the PIC criterion, let K = kmax , and let X(K) and X(k) to denote the
regressor matrix with K and k parameters respectively. Similar for , the parameter vector.
then
1
PIC = |A()/k2 |1/2 exp 0
() A()() .
2k2
PIC is asymptotically equivalent to the BIC criterion when the data is stationary, and when
the data is nonstationary, PIC is still consistent.
Reading: Hamilton, Ch. 5, 8.
19