Anda di halaman 1dari 19

Lecture 5: Linear Regressions

In lecture 2, we introduced stationary linear time series models. In that lecture, we discussed
the data generating processes and their characteristics, assuming that we know all parameters
(autoregressive or moving average coefficients). However, in empirical studies, we have to specify
an econometric model, estimate this model and draw inferences based on the estimates. In this
lecture, we will provide an introduction to parametric estimation of a linear model with time
series observations. Three commonly used estimation methods are least square estimation (LS),
maximum likelihood estimation (MLE) and general method of moments (GMM). In this lecture,
we will discuss LS and MLE.

1 Least Square Estimation


Least square (LS) estimation is one of the first techniques we learn in econometrics. It is both
intuitive and easy to implement, and the famous Gauss-Markov theorem tells that under certain
assumptions, ordinary least square (OLS) estimator is the best linear unbiased estimator (BLUE).
We will start from review of classical LS estimation and then we will consider estimations with
relaxed assumptions.
Below are our notations in this lecture and the basic algebra in LS estimation. Consider the
regression
yt = x0t 0 + ut , t = 1, . . . , n (1)
where xt is k by 1 vector and 0 , also a k by 1 vector is the true parameter. Then the OLS estimator
of 0 , denoted by n is
" n #1 " n #
X X
n = xt x0t xt yt (2)
t=1 t=1
and the OLS sample residual is
ut = yt x0t n .
Sometimes, it is more convenient to work in matrix form. Define
0
y1 x1 u1
y2 x0 u2
2
Yn = . Xn = .. Un = ..

.. . .
yn x0n un
Then the regression can be written as

Yn = Xn 0 + Un , (3)

Copyright 2002-2006 by Ling Hu.

1
and the OLS estimator can be written as

n = (Xn0 Xn )1 Xn0 Yn . (4)

Define
MX = In Xn (Xn0 Xn )1 Xn0 .
It is easy to see that Mx is symmetric, idempotent (Mx Mx = Mx ), and orthogonal to the
columns of X. Then we have
Un = Yn Xn n = MX Yn .
To derive the distribution of the estimator n ,

n = (Xn0 Xn )1 Xn0 Yn = (Xn0 Xn )1 Xn0 (Xn 0 + Un ) = 0 + (Xn0 Xn )1 Xn0 Un . (5)

Therefore, the properties of n depends on (Xn0 Xn )1 Xn0 Un . For example, if E[(Xn0 Xn )1 Xn0 Un ] =
0, then n is unbiased estimator.

1.1 Case 1: OLS with deterministic regressors and i.i.d. Gaussian errors
Assumption 1 (a) xt is deterministic; (b) ut i.i.d(0, 2 ); (c) ut i.i.d.N (0, 2 ).

Under assumption 1 (a) and (b), E(Un ) = 0 and E(Un Un0 ) = 2 In . Then from (5) we have

E(n ) = 0 + (Xn0 Xn )1 Xn0 E(Un ) = 0 ,

and

E[(n 0 )(n 0 )0 ] = E[(Xn0 Xn )1 Xn0 Un Un0 Xn (Xn0 Xn )1 ]


= (Xn0 Xn )1 Xn0 E(Un Un0 )Xn (Xn0 Xn )1
= 2 (Xn0 Xn )1

Under these assumptions, Gauss-Markov theorem tells that the OLS estimator n is the best
linear unbiased estimator for 0 . The OLS estimator for 2 is

s2n = Un Un0 /(n k) = Un0 MX MX Un /(n k) = Un0 MX Un /(n k). (6)

Since MX is symmetric, there exists a n by n matrix P such that

MX = P P 0 and P 0 P = In

where is a n by n matrix with the eigenvalues of MX along the principal diagonal and zeros
elsewhere. From properties of MX we can compute that contains k zeros and n k ones along
its principal diagonal. Then
n
X
RSS = Un0 MX Un 0 0 0
= Un P P Un = (P Un )(P Un ) = Wn0 Wn = t wt2
t=1

2
where Wn = P 0 Un . Then E(Wn Wn0 ) = P 0 E(Un Un0 )P = 2 In , therefore, wt are uncorrelated with
mean 0 and variance 2 . Therefore,
n
X
E(Un0 MX Un ) = t E(wt2 ) = (n k) 2 .
t=1

So the s2n defined in (6) is unbiased estimator for 2 : E(s2n ) = 2 .


With the Gaussian assumption (c), n is also Gaussian,

n N (0, 2 (Xn0 Xn )1 ).

Note that here n is exact normal, while many of the estimator in our later discussions are
asymptotically normal. Actually, under assumption 1, OLS estimator is optimal. Also, with the
Gaussian assumption, wt is i.i.d.N (0, 2 ). Therefore we have

Un0 MX Un / 2 2 (n k).

1.2 Case 2: OLS with stochastic regressors and i.i.d. Gaussian errors
The assumption of deterministic regressors is very strong for empirical studies in economics. Some
examples of deterministic regressors are constants and deterministic trend (i.e. xt = (1, t, t2 , . . .)).
However, most data we have for econometric regression are stochastic. Therefore from this subsec-
tion, we will allow the regressors to be stochastic. However, in case 2 and case 3, we assume that
xt is independent of errors (leads and lags). This is still too strong in time series, as it rules out
many processes including ARMA models.

Assumption 2 (a) xt is stochastic and independent of us for all t, s; (b) ut i.i.d.N (0, 2 ).

This assumption can be equivalently written as Un |Xn N (0, 2 In ). Under these assumptions,
n is still unbiased:
E(n ) = 0 + E[(Xn0 Xn )1 Xn0 ]E(Un ) = 0 .
Conditional on Xn , n is normal, n |Xn N (0 , 2 (Xn0 Xn )1 ). To get the unconditional
probability distribution for n , we have to integrate this conditional density over X. Therefore, the
unconditional distribution of n will depend on the distribution of X. However, we still have the
unconditional distribution for the estimate of the variance Un0 MX Un / 2 2 (n k).

1.3 Case 3: OLS with stochastic regressors and i.i.d. Non-Gaussian errors
Compared to case 2, in this section we let the error terms to follow arbitrary i.i.d. distribution
with finite fourth moments. Since this is an arbitrary unknown distribution, it is very hard obtain
exact distribution (finite sample distribution) for n , instead, we will apply asymptotic theory in
this problem.
2
Assumption 3 (a) xt is stochastic and independent of us for all t, s; (b)
4 0
Pnut i.i.d.(0, ), and
E(ut ) = 4 < ; (c) E(xt xt ) = Qt , a positive definite matrix with (1/n) Pn t=1 Qt0 Q, a positive
definite matrix; (d) E(xit xjt xkt xlt ) < for all i, j, k, l and t; (e) (1/n) t=1 (xt xt ) p Q.

3
With assumption (a), we still have the n is unbiased estimator for 0 . ThePassumption (c) to
(e) are restrictions on xt . Basically we want to have (1/n) t=1 xt xt p (1/n) nt=1 E(xt x0t ).
Pn 0

We have
" n #1 " n #
X X
0
n 0 = xt xt xt ut
t=1 t=1
" n
#1 " n
#
X X
= (1/n) xt x0t (1/n) xt ut
t=1 t=1

From assumptions and continuous mapping theorem, we have


" n
#1
X
(1/n) xt x0t p Q1 .
t=1

xt ut is a martingale difference sequence with finite variance, then by LLN for mixingales, we
have
n
" #
X
(1/n) xt ut p 0.
t=1

Therefore, n p 0 , so n is a consistent estimator. Next, we will derive the distribution for


it. This is the first time we derive asymptotic distribution for an OLS estimator. The routines in
deriving
Pn asymptotically distribution for n are outlined as follows: first we apply LLN on the term
0
t=1 t t , after properly normed
x x (so that the limit is a constant); then apply continuous mapping
theorem to get the limit for [ t=1 xt x0t ]1 . We already got this in the above proof of consistency
Pn

for n . Then we apply CLT on the term nt=1 xt ut , also after properly normed (so that the limit
P
is nondegenerate).
Note E(xt x0t u2t ) = 2 Qt and (1/n) nt=1 2 Qt 2 Q. By CLT for mds, we have
P

n
" #
X
(1/ n) xt ut N (0, 2 Q).
t=1

Therefore,
" n
#1 " n
#
X X
n(n 0 ) = (1/n) xt x0t (1/ n) xt ut
t=1 t=1
1
N (0, [Q ( 2 Q)Q1 ]) = N (0, 2 Q1 ).

so the n follows
2 Q1
 
n N 0 , .
n
Note that this distribution is not exact, but approximate. So we should read it as approximately
distributed as normal.

4
To compute this variance, we need to know 2 . When it is unknown, the OLS estimator s2n is
still consistent under assumption 3. We have

u2t = (yt x0t 0 )2


= [yt x0t n + x0t (n 0 )]2
= (yt x0t n )2 + 2(yt x0t n )x0t (n 0 ) + [x0t (n 0 )]2

By LLN, we have (1/n) nt=1 u2t 2 . There are three terms in the above equation. For the
P
second term, we have
X n
(1/n) (yt x0t n )x0t (n 0 ) = 0
t=1

as (yt x0t n ) is orthogonal to xt . For the third term,


n
" #
X
0 0
(n 0 ) (1/n) xt xt (n 0 ) p 0
t=1
Pn 0
as n 0 is op (1) and (1/n) t=1 xt xt Q. Therefore, we can define
n
X
n2 = (1/n) (yt x0t n )2 ,
t=1

and we have
n
X n
X n
X
n2 = (1/n) (yt x0t n )2 = (1/n) u2t (1/n) [x0t (n 0 )]2 2 .
t=1 t=1 t=1

This estimator is only slightly different from s2n (n2 = (n k)s2n /n). Since (n k)/n 1 as
n , if n2 is consistent, so is s2n .
Next, to derive the distribution of n2 .
n n
" #
X X
n(n2 2 ) = (1/ n) (u2t 2 ) n(n 0 )0 (1/n) x0t xt (n 0 ).
t=1 t=1

The second term goes to zero as [(1/n) nt=1 x0t xt ] p Q and n 0 p 0. Define zt = u2t 2 ,
P
then zt is i.i.d. with mean zero and variance E(u4t ) 4 = 4 4 . Applying CLT, we have
n
X
(1/ n) zt d N (0, 4 4 ),
t=1

therefore,
n(n2 2 ) d N (0, 4 4 ).
The same limit distribution applies for s2n , since the difference between n2 and s2n is op (n1/2 ).

5
1.4 Case 4: OLS estimation in autoregression with i.i.d. error
In an autoregression, say, xt = 0 xt1 + t , where t is i.i.d., the regressors are no longer inde-
pendent of t . In this case, the OLS estimator of 0 is biased. However, we will show that under
assumption 4, the estimator is consistent.

Assumption 4 The regression model is

yt = c + 1 yt1 + 2 yt2 + . . . + p ytp + t ,

with roots of (1 1 z 2 z 2 ... p z p ) = 0 outside the unit circle (so yt is stationary) and with
t i.i.d. with mean zero , variance 2 , and finite fourth moments 4 .

Page 215-216 in Hamilton presents the general AR(p) case with constant. We will use AR(2)
as an example, yt = 1 yt1 + 2 yt2 + t . Let x0t = (yt1 , yt2 ), ut = t and yt = x0t 0 + ut (so
00 = (1 , 2 )).
" n
#1 " n
#
X
0 X
n(n 0 ) = (1/n) xt xt (1/ n) xt ut (7)
t=1 t=1
The first term
n  Pn 2
Pn 
X y y t1 y t2
(1/n) xt x0t = (1/n) Pn t=1 t1 t=1
Pn 2
t=1 t=1 yt1 yt2 t=1 yt2

Pn Pn
In this matrix, first, on the diagonal, n1 2
t=1 ytj converge to 0 . The remaining term n1 t=1 yt1 yt2
converges to 1 . Therefore,
n  
X 0 1
(1/n) xt x0t p Q = .
1 0
t=1

Apply CLT for mds on the second term in (7),


n
" #
X
(1/ n) xt ut d N (0, 2 Q),
t=1

therefore,
n(n 0 ) d N (0, 2 Q1 ).
So far we have considered four cases in OLS regressions. The common assumption in all those
four cases are i.i.d. errors. From next section, we will consider cases where the errors are not i.i.d..

1.5 OLS with non-i.i.d. errors


When the error ut is i.i.d., then the variance-covariance matrix V = E(Un Un0 ) = 2 In . If V is
still diagonal but the elements are not equal, for example, the errors on some dates display larger
variance and the errors on some dates display smaller variance, then the errors are said to exhibit
heteroskedasticity. If V is non-diagonal, then the errors are said to be autocorrelated. For example,
let ut = t t1 where t is i.i.d., then ut is serially correlated errors.
Case 5 in Hamilton assumes

6
Assumption 5 (a) xt is stochastic; (b) conditional on the full matrix X, the vector U N (0, 2 V );
(c) V is a known positive matrix.
Under these assumptions, the exact distribution of n can be derived. However, this is a very
strong assumption and it rules out the autoregressive regression. Also, the assumption that V is
known rarely holds in applications.
Case 6 in Hamilton assumes uncorrelated but heteroskedastic errors with unknown covariance
matrix. Under assumption 6, the OLS estimator is still consistent and asymptotically normal.

Assumption 6 (a) xt stochastic, including perhaps lagged values of y; (b) xt uP t is martingale


2 x x0 ) = , a positive definite matrix, with (1/n) n
difference sequence;
Pn (c) E(u t t t t t=1 t p
2 0 4
and (1/n) t=1 ut xt xt p ;P(d) E(ut xit xjt xlt xkt ) < for all i, j, k, l and t; (e) P plims of
(1/n) nt=1 ut xit xt x0t and (1/n) nt=1 xit xjt xt x0t exist and are finite for all i, j and (1/n) nt=1 x0t xt p
P
Q, a nonsingular matrix.

Again, write the OLS estimator as


" n
#1 " n
#
X X
n(n 0 ) = (1/n) xt x0t (1/ n) xt ut
t=1 t=1

Assumption 6 (e) ensures that


" n
#1
X
(1/n) xt x0t p Q1 .
t=1

Apply CLT for mds,


n
" #
X
(1/ n) xt ut N (0, ),
t=1
therefore,
n(n 0 ) N (0, Q1 Q1 ).
However, both Q and are not observable and we
P need to find consistent estimates for them.
White proposes the following estimator Qn = (1/n) nt=1 xt x0t and n = (1/n) nt=1 u2t xt x0t where
P

ut is the OLS residual yt x0t n .

Proposition 1 With heteroskedasticity of unknown form satisfying assumption 6, the asymptotic


variance-covariance matrix of the OLS coefficient vector can be consistently estimated by
Qn1 n Q1 1
n p Q Q
1
(8)

Proof: Assumption 6 (e) ensures Q Q and assumption 6 (c) ensures that


n
X
n (1/n) u2t xt x0t p .
t=1

So to prove (8), we only need to show that


n
X
n n = (1/n) (u2t u2t )xt x0t 0.
t=1

7
The trick here is to make use of a known fact that n 0 p 0. If we could write n n as
sums of some products of n 0 and terms that are bounded, then n n p 0.

u2t u2t = (ut + ut )(ut ut )


= [2(yt 00 xt ) (n 0 )0 xt ][(n 0 )0 xt ]
= 2ut (n 0 )0 xt + [(n 0 )0 xt ]2

Then
n
X n
X
n n = (2/n) ut (n 0 )0 xt (xt x0t ) + (1/n) [(n 0 )0 xt ]2 (xt x0t ).
t=1 t=1

Write the first term


n k n
" #
X X X
(2/n) ut (n 0 ) xt (xt x0t )
0
= 2 (in i0 ) (1/n) ut xit (xt x0t ) .
t=1 i=1 t=1

The term in the bracket has a finite plim by assumption 6 (e) and we have in i0 0 for
each i. Then this term converges to zero. (if this looks messy, take k = 1, then you can simply
move (n 0 ) out of the summation. n 0 p 0 and the sum has a finite limit, so the product
goes to zero).
Similarly for the second term,
n k X
k n
" #
X X X
(1/n) [(n 0 )0 xt ]2 (xt x0t ) = (in i0 )(jn j0 ) (1/n) xit xjt (xt x0t ) p 0
t=1 i=1 j=1 t=1

as the term in bracket has a finite plim. Therefore, n n 0.


Define Vn = Q1 1
n n Qn , then
n N (0 , Vn /n),
and Vn /n is a heteroskedastic-consistent estimates for the variance-covariance matrix. Newey-West
proposes the following estimator for the variance-covariance matrix which is heteroskedastic and
autocorrelation consistent (HAC).
q 
" n  X n
#
0 1
X
0
X k
Vn /n = (Xn Xn ) 2
ut xt xt + 1 (xt ut utk xtk + xtk utk ut xt ) (Xn0 Xn )1 .
0 0
q+1
t=1 k=1 t=k+1

1.6 General least square


General least square (GLS) and feasible general least square (FGLS) is preferred in least square
estimation when the errors are heteroskedastic or/and autocorrelated.
Let xt be stochastic and U |X N (0, 2 V ) where V is known (assumption 5). Since V is
symmetric and positive definite, there exists matrix L such that V 1 = L0 L. Premultiply L to our
regression and get
LY = LX0 + LU.

8
Then the new error U = LU is i.i.d. conditional on X,

E(U U 0 |X) = LE(U U 0 |X)L0 = 2 LV L0 = 2 In .

Then the estimator


n = (X 0 L0 LX)1 X 0 L0 Ly = (X 0 V 1 X)1 X 0 V 1 y
is known as the general least square estimator.
However, as we remarked earlier, in applications, V is rarely known and we have estimate it.
The GLS estimator obtained using estimated V is known as feasible GLS estimator. Usually, FGLS
require that we specify a parametric model for the error. For example, let the error ut follow an
AR(1) process, ut = 0 ut1 + t where t i.i.d.(0, 2 ). In this case, we can run OLS first and
obtain the OLS residual ut . Then run OLS estimation for using the ut . This estimator, denoted
by n , is consistent estimator for . To show this, write

ut = (yt 0 xt + 0 xt n xt ) = ut + (0 n )0 xt .

n n
1X 1X
ut ut1 = [ut + (0 n )0 xt ][ut1 + (0 n )0 xt1 ]
n n
t=1 t=1
n n n
" #
1 X
01
X
0 1X 0
= ut ut1 + (0 n ) (ut xt1 + ut1 xt ) + (0 n ) xt xt1 (0 )
n n n
t=1 t=1 t=1
n
1X
= ut ut1 + op (1)
n
t=1
n
1X
= (t + ut1 )ut1
n
t=1
var(ut ).
1 Pn
Similarly, we can show that n t=1 ut ut p var(ut ), hence n 0 . Still use similar method, we
can show that
n n
1 X 1 X
ut ut1 = ut ut1 + op (1).
n t=1 n t=1
Hence
n( 0 ) N (0, (1 20 )).
Finally the FGLS estimator for 0 based on V () has the same limit distribution as the GLS
estimator based on V (0 ) (page 222-225 in Hamilton).

1.7 Statistical inference with LS estimation


Some commonly used test statistics for LS estimator are t statistics and F statistics. t statistics
is used to test the hypothesis of a single parameter, say i = c. For simplicity, we assume that
c = 0, so we use t statistics to test if a variable is significant. The t statistics is defined as the ratio

9
i /sd(i ). Let the estimate of the variance of be denoted by s2 W , then the standard deviation
of i is the product of s and the square root of the ith element on the diagonal, i.e.,
i
t= . (9)
2
s wii
Recall that if X/ N (0, 1), and Y 2 / 2 2 (m), and let X and Y be independent, then

X m
t=
Y
follows exact student t distribution with m degree of freedom.
F -statistics is used to test the hypothesis of m different linear restrictions about , say
H0 : R = r,
where R is a m by k matrix. The F statistics is then defined as
F = (R r)0 [V ar(R r)]1 (R r). (10)
This is a Wald statistics. To derive the distribution of the statistics, we will need the following
result
Proposition 2 If a k by 1 vector X N (, ), then (X )0 1 (X ) 2 (k).
Also recall that an exact F (m, n) distribution is defined to be
2 (m)/m
F (m, n) = .
2 (n)/n

With assumption 1 W = (Xn0 Xn )1 , and under the hull hypothesis i N (0, 2 wii ). We can
then write
i
2 wii
t= q .
s2
2

Since the numerator is N (0, 1) and the denominator is the square root of 2 (n k) divided by n k
(since RSS/ 2 2 (n k)), and the numerator and denominator are independent, so t statistics
(9) under assumption 1 follows exact t distribution.
With assumption 1 and under the null hypothesis, we have
R r N (0, 2 R(Xn0 Xn )1 R),
then by proposition 2, the F statistics defined in (10) under hypothesis H0
(R r)0 [ 2 R(Xn0 Xn )1 R]1 (R r) 2 (m).
If we replace 2 with s2 , and divide it by the number of restrictions m, we get the OLS F test
of a linear hypothesis
F = (R r)0 [s2 R(Xn0 Xn )1 R]1 (R r)/m
F /m
= ,
(RSS/ 2 )/(n k)

10
so F follows a exact F (m, n k) distribution.
An alternative way to express the F statistics is to compute the estimator without restriction
and its associated sum of residual RSSu ; and the estimator with restriction and its associated
sum of residual RSSr , then we can write

(RSSr RSSu )/m


F = .
RSSu /(n k)

Now, with assumption 2, X is stochastic and is normal conditional on X and RSS 2 2 (n


k) conditional on X. This conditional distribution of RSS is the same for all X, therefore, the
unconditional distribution of RSS is the same as the conditional distribution. The same is true
for the t and F statistics. Therefore we have the same results under assumption 2 as that under
assumption 1.
From case 3, we no longer have exact distribution for the estimator, and we have to derive the
asymptotic distribution for the estimator, so we also use the asymptotic distributions for the test
statistics.

i ni
tn = = .
sn wii sn nwii

where wii is the ith element on the diagonal of s asymptotic variance Q1 n1 . If we let the ith
element on the diagonal of Q denoted by qii , then we have i d N (0, 2 qii ). Recall that under
assumption 3, sn , there we have
tn N (0, 1).
Next, write

Fn = (R r)0 [s2n R(Xn0 Xn )1 R]1 (R r)/m



= n(R r)0 [s2n R(Xn0 Xn /n)1 R]1 n(R r)/m

Now we have s2n p 2 , Xn0 Xn /n Q, and under the null,



n(R r) = R n( 0 ) d N (0, 2 RQ1 R0 ).

Then by proposition 2, we have


mFn 2 (m).
We can then use similar methods to derive the distribution for other cases. In general if p 0
and asymptotically normal, s2n 2 , and we have found a consistent estimate for the variance of ,
then the t and F statistics follow asymptotically normal and 2 (m) distribution. Actually, under
assumption 1 or 2, when the sample size is large, we can also use normal and 2 distribution to
approximate the exact t and F distribution. Further, since we are using the asymptotic distribution,
the Wald test can also be used to test nonlinear restrictions.

11
2 Maximum Likelihood Estimation
2.1 Review: maximum likelihood principle and Cramer-Rao lower bound
The basic idea of maximum likelihood principle is to choose the parameter estimates that max-
imizes the probability of obtaining the observed sample. Consider that we observe a sample
Xn = (x1 , x2 , . . . , xn ) and assume that the sample is drawn from an i.i.d. distribution and the
associated parameters are denoted by . Let p(xt ; ) denote the pdf of the tth observation. For
example, when xt i.i.d.N (, 2 ), then = (, 2 ) and

(xt )2
 
2 1/2
p(xt ; ) = (2 ) exp .
2 2

The likelihood function for the whole sample Xn is


n
Y
L(Xn ; ) = p(xt ; )
t=1

and the log likelihood function is


n
X
l(Xn ; ) = log p(xt ; ).
t=1

The maximum likelihood estimates for are chosen so that l(Xn ; ) is maximized. Define the
score function S() = l()/, and the Hessian matrix H() = 2 l()/0 , then the famous
Cramer-Rao inequality tells that the lowest bound for the variance of an unbiased estimator of
is the inverse of the information matrix I(0 ) = E[S(0 )S(0 )0 ], where 0 denotes the true value
of the parameter. An estimator that have a variance equal to this bound is known as efficient.
Under some regularity condition which are satisfied for the Gaussian density, we have the following
equality  2 
l()
I() = E[H()] = E .
0
So, if we find an unbiased estimator and its variance achieves the Cramer-Rao lower bound,
then we know that this estimator is efficient and there is no other unbiased estimator (linear or
nonlinear) that could have smaller variance than this estimator. However, this lower bound is not
always achievable. If an estimator does achieve this bound, then this estimator is identical to MLE.
Note that Cramer-Rao inequality holds for unbiased estimator while sometimes ML estimators
are biased. If the estimator is biased but consistent, and its variance approaches the Cramer-Rao
bound asymptotically, then this estimator is known as asymptotically efficient.

Example 1 (MLE estimation for i.i.d. Gaussian distribution) Let xt i.i.d.N (, 2 ), so the pa-
rameter = (, 2 ). Then we have

(xt )2
 
1
p(xt ; ) = exp
2 2 2 2
n
n n 2 1 X
l(Xn ; ) = log(2) log( ) 2 (xt )2
2 2 2
t=1

12
n
l(Xn ; ) 1 X
S(Xn ; ) = = 2 (xt )2

t=1
n
l(Xn ; ) n 1 X
S(Xn ; 2 ) = = + (xt )2
2 2 2 2 4
t=1

Set the score functions to zero, we found the MLE estimator for are = Xn and 2 = n1 nt=1 (xt
P
)2 . It is easy to verify that E() = E(Xn ) = , so is unbiased and its variance V ar() = 2 /n,
while
" n #
1 X
E 2 = E (xt )2
n
t=1
= E(xt )2
= E[(xt ) + ( )]2
2 1
= 2 2 + 2
n n
n1 2
=
n

1 Pn
so 2 is biased, but it is consistent as 2 2 as n . Define s2 = n1 t=1 (xt )2 , then
Es2 = 2 , and V ar(s2 ) = 2 4 /(n 1).
We can further compute the Hessian matrix,
" 2 l(X ;) 2 l(X ;) #
n n
2 2
H(Xn ; ) = 2
l(Xn ;) 2
l(Xn ;)
2 2 2

where
l(Xn ; ) n
=
2 2
n
l(Xn ; ) l(Xn ; ) 1 X
= = (xt )
2 2 4
t=1
n
l(Xn ; ) n 1 X
= (xt )2
22 2 4 6
t=1

We can also compute that


n2
|H(Xn ; )|= =
> 0,
2 6
so we know that the we have found the maximum (not minimum) of the likelihood function. Next,
compute the information matrix,
" n # " n #
X X
E (xt ) = 0, E (xt )2 = n 2 .
t=1 t=1

13
therefore the information matrix
n
 
2
0
I() = E[H(Xn ; )] = n
0 2 4
2
So the MLE of has achieved the Cramer-Rao lower bound of variance n . Although s2 does
not achieve to the lower bound, it turns out it is still the unbiased estimator for 2 with minimum
variance.

2.2 Asymptotic Normality of MLE


There are a few regularity conditions to ensure that the MLE is consistent. First we assume that the
data is strictly stationary and ergodic (for example, i.i.d.). Second, we assume that the parameter
space is convex and neither the estimate nor the true parameter 0 lie on the boundary of .
Third, we require that the likelihood function evaluated at is different from 0 , for any 6= 0 in
. This is known as the identification condition. Finally, we assume that E[sup |l(Xn ; )|] < .
With all those conditions satisfied, the MLE is consistent p 0 .
Next we will discuss the asymptotic results on the score function S(Xn ; ), the Hessian matrix
H(Xn ; ) and the asymptotic distribution of the MLE estimates .
First, we want to show that E[S(Xn , 0 )] = 0 and E[S(Xn , 0 )S(Xn , 0 )0 ] = E[(H(Xn ; 0 )].
Let the integral operator denote integrate over X1 , X2 , . . . , Xn , then we have that
Z
L(Xn , 0 )dXn = 1.

Taking derivative with respect to , then we have


Z
L(Xn , 0 )
dXn = 0.

While, we can write
Z
L(Xn , 0 )
dXn

Z
1 L(Xn , 0 )
= L(Xn , 0 )dXn
L(Xn , 0 )
Z
l(Xn ; 0 )
= L(Xn , 0 )dXn

= E[S(Xn , 0 )]
So we know that E[S(Xn , 0 )] = 0. Next, let the integral (which equal to zero) take 0 , it is
Z Z 2
l(Xn ; 0 ) L(Xn , 0 ) l(Xn ; 0 )
0
dXn + L(Xn , 0 )dXn = 0.
0
The second term is just E[H(Xn ; 0 )]. The first can be written as
Z  
l(Xn ; 0 ) 1 L(Xn , 0 )
L(Xn , 0 )dXn
L(Xn , 0 ) 0
Z
l(Xn ; 0 ) l(Xn ; 0 )
= L(Xn , 0 )dXn
0
= E[S(Xn , 0 )S(Xn , 0 )0 ]

14
Now, since E[S(Xn , 0 )S(Xn , 0 )0 ] + E[H(Xn ; 0 )] = 0, we have that E[S(Xn , 0 )S(Xn , 0 )0 ] =
E[H(Xn ; 0 )].
Next, definePthat s(xt ; ) = log
p(xt ;)
, then we write the score function as the sum of s(xt ; ),
n
i.e., S(Xn , ) = t=1 s(xt ; ). s(xt ; ) is i.i.d. and we can show that E[s(xt ; 0 )] = 0 and E[s(xt ; 0 )s(xt ; 0 )0 ]
= E[H(xt ; 0 )]. Applying Lindeberg-Levy CLT, we obtain the asymptotic normality of the score
function
1
n1/2 S(Xn ; 0 ) d N (0, E[H(Xn ; 0 )]).
n
Next, we consider the properties of the Hessian matrix. First we assume that E[H(Xn ; 0 )] is
non-singular. Let N be a neighborhood of 0 , and

E[ sup kH(Xn ; )k] < ,


N

then we have
n
1X
H(xt ; ) E[H(Xn ; 0 )] V,
n
t=1

where is any consistent estimator for 0 .


Apply the LLN, we have
n  
1 1X 1
H(Xn ; 0 ) = H(xt ; 0 ) p E(xt ; 0 ) = E H(Xn ; 0 ) .
n n n
t=1

With the notation , we can write n1/2 S(Xn ; 0 ) d N (0, ).

Proposition 3 (Asymptotic normality of MLE) With all the conditions we have outlined above,

n( 0 ) d N (0, 1 ).

Proof: Do a Taylor expansion of S(Xn ; ) around 0 ,

0 = S(Xn ; ) S(Xn ; 0 ) + ( 0 )H(Xn ; 0 ).

Therefore, we have

n( 0 ) = nS(Xn ; 0 )H(Xn ; 0 )
  1
1 1
= S(Xn ; 0 ) H(Xn ; 0 )
n n
N (0, 1 1 )
= N (0, 1 )

Note that 1 = E[ n1 H(Xn ; 0 )]1 = nI(0 )1 , so the asymptotic distribution of can be


written as
N (0 , I(0 )1 ).
However, I(0 ) depends on 0 which is unknown. So we need to find a consistent estimator for
it, denoted by V . There are two methods to compute this variance matrix of . One way is that

15
we compute the Hessian matrix, and evaluate it at = , i.e. V = H(Xn ; ). The second way is to
use the outer product estimate, which is
n
X
V = [S(xt ; )S(xt ; )0 ].
t=1

2.3 Statistical Inference for MLE


There are three asymptotically equivalent tests for MLE: likelihood ratio (LR) test, Wald test, and
Lagrange multiplier (LM) test or score test. You can probably find discussion on these three tests
on any graduate text book in econometrics, so we only describe them briefly here.
The likelihood ratio test is based on the difference between the likelihood you computed (max-
imized) with or without the restriction. Let lu denote the likelihood without restriction and lr
denote the likelihood with restriction (note that lr lu ). If the restriction is valid, then we expect
the lr should not be too much lower than lu . Therefore, to test if the restriction is valid, the
statistics we compute is 2(lu lr ) which follows a 2 distribution with degree of freedom equal to
the number of restrictions imposed.
To do LR test, we have to compute the likelihood under both restricted and unrestricted condi-
tion. In comparison, the other two tests only use either the estimator without restriction (denoted
by ) or the estimator with restriction (denoted by ).
Let the restriction be H0 : R() = r, the idea of Wald test is that: if this restriction is valid,
then the estimator obtained without restriction will make R() r close to zero. Therefore the
Wald statistics is
W = (R() r)0 [V ar(R() r)]1 (R() r),
which also follows a 2 distribution with degree of freedom equal to the number of restrictions
imposed.
To find the ML estimator, we set the score function equal to zero and solve for the estimator,
i.e., S() = 0. If the restriction is valid, and the estimator we obtained with the restriction is ,
then we expect that S() is close to zero. This idea leads to the LM test or score test. The LM
statistics is
LM = S()0 I()1 S(),
which also follows a 2 distribution with degree of freedom equal to the number of restrictions
imposed.

2.4 LS and MLE


In a regression Yn = Xn 0 + Un where Un |Xn N (0, 2 In ) (as in assumption 2), the conditional
density of Y given X is
 
2 n/2 1 0
f (Y |X; ) = (2 ) exp 2 (Y X) (Y X) .
2

The log likelihood function is


n n 1
l(Y |X; ) = log(2) log( 2 ) 2 (X X)0 (X X)
2 2 2

16
Note that n that maximizes l is the vector that minimizes the sum of squares, therefore, under the
assumption 2, the OLS estimator is equivalent to ML estimator for 0 . It can be shown that this
estimator is unbiased and achieves the Cramer-Rao lower bound, therefore under assumption 2, the
OLS/MLE estimator are efficient (compared to all unbiased linear or nonlinear estimators). Recall
that under assumption 1, we have Gauss-Markov theorem to show that OLS estimator is the best
linear unbiased estimator. Now, the Cramer-Rao inequality tells the optimality of OLS estimator
under assumption2. The ML estimator for 2 is (Y X)0 (Y X)/n. We have introduced this
estimator a moment ago and we showed that the difference between n2 and the OLS estimator s2n
becomes arbitrarily small as n .
Next, consider assumption 5, where U |X N (0, 2 V ) and V is known. Then the log likelihood
function omitting constant term is

l(Y |X, ) = (1/2)logV (1/2)(Y X)0 V 1 (Y X).

The MLE estimator is


n = (X 0 V 1 X)1 X 0 Y,
which is equivalent to the GLS estimator. The score vector is Sn () = (Y X)0 V 1 X, the Hessian
matrix Hn () = X 0 V 1 X. Therefore, the information matrix is I() = X 0 V 1 X. Therefore, the
GLS/MLE estimator is efficient as it achieves the Cramer-Rao lower bound (X 0 V 1 X)1 .
When V is unknown, we can parameterize it V (), say, and maximizes the likelihood

l(Y |X, , ) = (1/2)logV () (1/2)(Y X)0 V 1 ()(Y X).

2.5 Example: MLE in autoregressive estimation


In Hamiltons book, you can find many detailed discussions about MLE estimation for an ARMA
model in Chapter 5. We will take an AR(1) model as example.
Consider an AR(1) model,
xt = c + xt1 + ut
where ut i.i.d.N (0, 2 ). Let = (c, , 2 ) and let the sample size denoted by n. There are
two ways to construct the likelihood function, and the difference lies in how to specify the initial
observation x1 . If we let x1 be random, we know that the unconditional distribution of xt is
N (c/(1 ), 2 /(1 2 )), and this will lead to an exact likelihood function. Alternatively, we can
assume that x1 is observable (known) and this will lead to a conditional likelihood function.
We first consider the exact likelihood function. We know that
(x1 c/(1 ))2
 
2 1/2
p(x1 ; ) = (2 ) exp .
2 2 /(1 2 )

Conditional on x1 , the conditional distribution of x2 is N (c + x2 , 2 ), then the conditional


probability density for the second observation is
(x2 c x1 ))2
 
2 1/2
p(x2 |x1 ; ) = (2 ) exp .
2 2
So the joint probability density for (x1 , x2 ) is

p(x1 , x2 ; ) = p(x2 |x1 ; )p(x1 ; ).

17
Similarly, the probability density for the nth observation conditional on xn1 is

(xn c xn1 ))2


 
2 1/2
p(xn |xn1 ; ) = (2 ) exp .
2 2

and the density for the joint observation of Xn = (x1 , x2 , . . . , xn ) is


n
Y
L(Xn ; ) = p(x1 ; ) p(xt |xt1 ; ).
t=2

Taking log we get the exact likelihood function (omitting constant terms for simplicity)
n
2 (x1 c/(1 ))2 n 1 X (xt c xt1 )2
 
1 2
l(Xn ; ) = log log( ) . (11)
2 1 2 2 2 /(1 2 ) 2 2 2
t=2

Next, to construct the conditional likelihood, assume that x1 is observable, then the log likeli-
hood function is (again, constant terms are omitted)
n
X (xt c xt1 )2
n1
l(Xn ; ) = log( 2 ) . (12)
2 2 2
t=2

The maximum likelihood estimates c and are obtained by maximizing (12), or solving the
score function. Note that maximizing (12) with respect to is equivalent to minimizing
n
X
(xt c xt1 )2 ,
t=1

which is the objective function in OLS.


Compared to the exact likelihood function, we see that the conditional likelihood function is
much easier to work with. Actually, when the sample size is large, the first observation becomes
negligible to the total likelihood function. When || < 1, the estimator computed from exact
likelihood and the estimator from conditional likelihood are asymptotically equivalent.
Finally, if the residual is not Gaussian, and if we estimate the parameter using the conditional
Gaussian likelihood as in (12), then the estimate we obtain is known as quasi-maximum likelihood
estimate (QMLE). QMLE is also very frequently used in empirical estimation. Although we mis-
specified the density function, in many cases, QMLE is still consistent. For instance, in an AR(p)
process, if the sample second moment converges to the population second moments, then QMLE
using (12) is consistent, no matter whether the error is Gaussian or not. However, standard errors
for the estimated coefficients that are computed with the Gaussian assumption need not be correct
if the true data are not Gaussian (White, 1982).

3 Model Selection
In the discussion on estimation above, we assume that the order of the lags is known. However,
in empirical estimation, we have to choose a proper order. A larger number of order (parameters)
will increase the fitness of the model, therefore we need some criterion to balance the goodness of

18
fit and model parsimony. There are three commonly used criterion, Akaike information criterion
(AIC), Schwartzs Bayesian information criterion (BIC), and the posterior information criterion
(PIC) developed by Phillips (1996).
In all these criterion, we specify a maximum order kmax , and then choose k to minimize a
criterion equation.  
SSRk 2k
AIC = log + (13)
n n
where n is the sample size, k = 1, 2, . . . , kmax is the number of parameters in the model, and SSRk
is the residual from the fitted model. When k increase, the fit increases, so SSRk decreases, but
the second term increases. So this shows a trade off between fit and parsimony. Since the model is
estimated using different lags, the sample size also varies. We can either use the different sample
size n k, or we can use a fixed sample size n kmax . Ng and Perron (2000) has recommended
using the fixed sample size and use it to replace n in the criterion. However, the AIC rule is not
consistent and tends to overfit the model by choosing larger k.
With all other issues similar as in the AIC rule, the BIC rule imposes a larger penalty for
increasing number of parameters,
 
SSRk k log(n)
BIC = log + (14)
n n

BIC suggests samller k than AIC and BIC rule is consistent in stationary data, i.e., limn kBIC =
k. Further, Hannan and Deistler (1988) has shown that kBIC is consistent when we set kmax =
[c log(n)] (the integer part of c log(n)) for any c > 0. Therefore, we can estimate kBIC consistently
without knowing the upper bound of k.
Finally, to present the PIC criterion, let K = kmax , and let X(K) and X(k) to denote the
regressor matrix with K and k parameters respectively. Similar for , the parameter vector.

Y = X(K)(K) + error = X(k)(k) + X()() + error


A() = X()()
A(k) = X(k)(k)
A(, k) = X()X(k)
A() = A() A(, k)A(k)1 A(k, )
() = [X()0 X() X()0 X(k)(X(k)0 X(k))1 X(k)X()]1 [X()0 Y
X()0 X(k)(X(k)0 X(k))1 X(k)Y ]
2
K = SSRK /(n K)

then   
1
PIC = |A()/k2 |1/2 exp 0
() A()() .
2k2
PIC is asymptotically equivalent to the BIC criterion when the data is stationary, and when
the data is nonstationary, PIC is still consistent.
Reading: Hamilton, Ch. 5, 8.

19

Anda mungkin juga menyukai