Anda di halaman 1dari 40

OLS ESTIMATION OF SINGLE EQUATION

MODELS
Structural Model:

y = 0 + 1 x1 + 2 x2 + + K xK + u
= x + u,

where, x1K = (x1 x2 xK ) with x1 = 1 (intercept) and



1

2
k1 =
.


...


K
Assumptions

1. We can obtain a random sample from the population, where the


sample observations {(xi, yi) : i = 1, 2, , N } are iid.

2. The population error has zero mean and is uncorrelated to the


regressors.

E(u) = 0, cov(xj , u) = 0, j = 1, 2, , K (1)

Sufficient for (1) is the assumption [HW show it]

E(u|x1, x2, , xK ) = E(u|x) = 0 (2)


[Note:

i. An explanatory variable is called endogenous if it is correlated


with population error. An econometric model woth endogenous
explanatory variables is said to suffer from endogeneity.

ii. Usually endogeneity arises in one of the three ways:

(a) Omitted Variables: Suppose we cannot control for some ex-


planatory variables or regressors in the structural model be-
cause we do not have data on them (they may not be enu-
merable at all). Let E(y|x, q) is the true population regression
function that is linear in all xs and q. If we do not have
data on q, we may estimate E(y|x) (an estimable equation)
where q becomes part of u. Now if q and xj is correlated for
any j = 1, 2, , K it leads to endogeneity in the estimable
model.

(b) Measurement Error: Suppose we want to include a regressor


xK in the true structural model, but the data allows us to
observe only an imperfect measure of xK , namely xK [e.g.
true income versus reported income], where xK = xK + eK , eK
being measurement error. Depending on how xK and xK are
correlated, xK and u may be correlated if we use xK in place
of xK in the estimable model, leading to endogeneity.

(c) Simultaneity: Simultaneity arises when at least one explanatory


variable is determined simultaneously with the dependent vari-
able of the equation. Let xK is determined partly as a function
of y. Then xK and u are generally correlated. E.g.- If quan-
tity supplied is a dependent variable and price is an explanatory
variable, then the market clearing (or equilibrium) phenomenon
codetermines the values of quantity supplied and price as may
be observed in data. In this case xK (or price) is likely to be
endogenous.

Assumption (2) is stronger than what is required for deriving the


OLS . So we stick to (1):
asymptotic properties of

E(x0u) = 0 [OLS 1]

x1 u 0

x u 0
2
E = ,

.. ..
. .

xK u 0

i.e. E(xj u) = 0 j = 1, , K.
Also as x1 = 1, Ex1u = 0 E(u) = 0 . Thus, E[xj E(xj )][u
E(u)] = 0 (remember E(xj ) is a population moment, hence constant).
cov(xj , u) = 0, j = 1, 2, , K .

3.

rank E(x0x) = K [OLS 2]



x1 2
Ex1 Ex1x2 ..... Ex1xk

0
x2 Ex x
2 1 Ex2 2 ..... Ex2xK
E(x x) = E (x1x2 xK ) =



...


... ... ... ...


xK ExK x1 ExK x2 ..... Ex2
K KK

Since E(x0x) is symmetric and K K, full rank E(x0x) is positive


definite.

This condition implies that we are not replicating regressors. If, for
instance, x1 xK , then the first and last columns of of x0x is identical.
So rank(E(x0x)) < K. Also if the regressors are linearly dependent
[for instance if you include dummy variables for all categories] then
rank(E(x0x)) < K. Assumption OLS 2 precludes that possibility.
Identification of in Structural Model

In the context of linear models, identification of parameters of a


model implies that the parameters can be expressed in terms of pop-
ulation moments of observable variables.

Our model is, y = x + u

x 0 y = x0 x + x0 u

E(x0y) = E(x0x) + E(x0u)

By OLS 1, E(x0u) = 0

= [E(x0x)]1E(x0y), where OLS 2 ensures that [E(x0x)]1 exists.


Method of Moments

Replace population moments E(x0x) and E(x0y) with corresponding


sample moments to obtain the sample counterpart (or, estimator) of
the population parameter.

So,
N 0
N 0
OLS = (N 1 xixi)1(N 1
X X
xi yi )
i=1 i=1
1

P 2 P P P
x1i x1ix2i ..... x1ixki x1iyi
P P 2 P P
x2ix1i x2i ..... x2ixKi x2iyi
=



... ... ... ...


...


P P P 2 P
xKix1i xKix2i ..... xKi xKiyi
The full data matrix analysis yields the same result.

Suppose the full data matrices are as follows:



x11 x21 ..... xK1

x x22 ..... xK2
Let , XN K = 12 be the N-observation data

.. ... ... ...
.

x1N x2N ..... xKN
matrix on regressors x1, x2, , xK , and

y1

y2
yN 1 = be the N-observation data vector on dependent vari-


...


yN
able y.
OLS = (N 1 PN x0 xi)1(N 1 PN x0 yi) = (X0X)1X0y.
Then i=1 i i=1 i

Under assumption OLS 2, X0X is non-singular with probability ap-


PN 0 p 0
proaching 1. This is because as N , [(1/N ) i=1 xixi] E(x x)
(the sample of size N approaching the entire population causing sam-
ple moment to converge to corresponding population moment in prob-
0 0 0
Hence P [ N
P
ability). But E(x x) is non-singular. x
i=1 i ix X X is
non-singular] 1 as N .

1 PN 0 1
Hence by Corollary 1 of Asymptotic Theory, plim [(N i=1 xixi) ] =
0
A1, where A = E(x x).

.
We have used the method of moments to derive this estimator
Why are we calling it Ordinary Least Square then?
The answer may be found by looking at property 8 of Conditional
Expectation Operator. If (x) E(y|x) then is a solution to
min 2 ].
mM E[(y m(x)) The least square exercise, as we commonly
understand it, is a sample counterpart of this population problem:
min 1 PN 2 . In method of moments we did exactly the same

N i=1 (y i x i )
thing: first, expressing in terms of population moments of x and
y and then replacing population moments with corresponding sample
moments. Thus in effect we found the sample counterpart of E(y|x),
or, 0 + 1x1 + 2x2 + + K xK as 0 + 1x1 + 2x2 + + K xK .
Hence K1 is called least-square estimator of .
Consistency of OLS
N 0
N 0
OLS = (N 1 xixi)1(N 1
X X
x i yi )
i=1 i=1
N 0
N 0
+ (N 1 xixi)1(N 1
X X
= xiui).
i=1 i=1
By weak law of large numbers, (Theorem 1),
0 p 0
N 1
PN
x
i=1 i iu E(x u) = 0 by OLS 1.

Hence by Slutskys Theorem (Lemma 4),


1 1 PN 0 1

plim OLS = + A 0 = (Remember, plim [(N i=1 xixi) ] =
0
A1 and plim [(N 1
PN
i=1 iui)] = 0).
x

Note that if OLS1 or OLS2 fails, is not identified.

OLS is not necessarily unbiased under OLS1 and OLS2.

OLS = + (N 1 PN x0 xi)1(N 1 PN x0 ui) = + (X0X)1X0u,


i=1 i i=1 i

u1

u2
where, uN 1 = .


...


uN
OLS ) = + E[(X0X)1X0u] 6= in general.
But E(

However if we use the stronger assumption (2) instead of OLS1, i.e.,


E(u|x) = 0 then unbiasedness of OLS may be retrieved.

OLS = + (X0X)1X0u.

|X) = + (X0X)1X0E(u|X) = + 0 = .
Now E(

) = E[E(
But E( |X)] = E( ) = .
Asymptotic Inference Using OLS


Note that OLS ) = (N 1 PN x0 xi)1(N 1/2 PN x0 ui).
N ( i=1 i i=1 i

0 p
1 1 A1, so that
PN
We know that, (N x
i=1 i ix )

1 PN 0 p
1 A1
(N x
i=1 i ix ) , i.e.

1 PN 0 1 A1 = o (1).
(N i=1 xixi) p

0
Again, E(xiui) = 0, i = 1, 2, by OLS 1.

0
Also, {(xiui) : i = 1, 2, } is an iid sequence with zero mean and
each term is assumed to have a finite variance. Then by Central
Limit Theorem,
1/2 PN 0 d 0 0
N x
i=1 i iu N (0, B), where BKK = var(x u
i i ) = E(x i ui u i xi ) =
0 2 x0 x) for any i.
E(u2 x
i i ix ) = E(u

0
This means N 1/2
PN
i=1 iui = Op(1), by Lemma 5.
x

0
N (OLS ) = [A1 + op(1)]Op(1) = A1(N 1/2
PN
Then i=1 iui) +
x
0
op(1)Op(1) = A1(N 1/2
PN
i=1 iui) + op(1), by Lemma 2.
x

Assumption 4
4.

0 0
E(u2x x) = 2E(x x), where 2 E(u2) [OLS 3]
Since E(u) = 0, 2 E(u2) [E(u)]2 = var(u). In other words, OLS
3 states that the variance of u, viz E(u2) = 2 is constant and hence
0
independent of x, and thus can be taken out of E(u2x x).

OLS 3 is the weak homoskedasticity assumption. It means that


u2 is uncorrelated with xj , x2
j and xj xk , j, k = 1, ..., K.
Sufficient for OLS 3 is the assumption E(u2|x) = 2 which is
equivalent to var(u|x) = 2 when E(u|x) = 0.

Asymptotic Normality of OLS


From OLS 1 - OLS 3, and the fact that OLS ) =
N (
1 1/2 PN 0 a 2 A1 ).
A (N x u i )+op (1), it follows that N (
OLS ) N (0,
i=1 i

0 d
Proof : N 1/2
PN
x
i=1 i u i N (0, B).
0 d
Hence, by Corollary 2, A1(N 1/2 1 BA1 ).
PN
x
i=1 i u i ) N (0, A


Now, OLS ) A1(N 1/2 PN x0 ui) = op(1), i.e.,
N ( OLS
N (
i=1 i
0 p
) A1(N 1/2
PN
x
i=1 i iu ) 0.

d
Hence by Lemma 7 (Asymptotic Equivalence), OLS )
N (
N (0, A1BA1).

But under OLS 3, B = 2E(x0x) = 2A.

a
Thus, OLS )
N ( N (0, 2A1).

OLS as approximately normal


The above result allows us to treat
2 A1 2 [E(x0 x)]1
with mean and variance-covariance matrix N , i.e., N .
2 = NRSS 2 (sum of
PN
Usual estimator of 2 is K , where RSS = i=1 u
i
).
squared OLS residuals: ui = yi xi
2 is consistent. [H.W. show it]
It can be shown that

2
Replace with 2 0
and E(x x) by the sample average N 1 PN 0
i=1 xixi =
1 (X0 X).
N

Thus Avar( OLS ) = 1 =
2N (X0X)1 N 2(X0X)1.

Hence under OLS 1 - OLS 3, usual OLS standard errors, t-statistics


and F -statistics are also asymptotically valid (the F -statistic being a
degrees of freedom adjusted Wald-statistic for testing linear restric-
tion of the form R = r.)

[See undergraduate notes for derivation of t and F -stats by distribu-


tions of quadratic forms.]
Violation of CLRM Assumptions

Suppose OLS 3 does not hold (Heteroskedasticity)

We have already shown that,


d
OLS )
N ( N (0, A1BA1).

OLS is
In other words, asymptotic variance of
1 1 0
Avar( OLS ) = A BA1,
where AKK = E(x x)
N
0 2 0
and BKK = var(x u) = E(u x x).
1 PN 0
A consistent estimator of A is (N i=1 xixi).

What is a consistent estimator of B?


p
2 x0 x 2 x0 x) =
By Law of Large Numbers (Theorem 1), N 1
PN
u
i=1 i i i E(u
B.

OLS .
i = yi xi
As ui cannot be observed, replace ui by OLS residual u

White (1980, Econometrica) proves the following.


= N 1
White (1980): A consistent estimator of B is B
PN
u
2 x0 x .
i=1 i i i

Proof: The Proof Consists of several parts.

Part I:

)
i = ui xi(
u

0 0 0
xi u )
i = xiui xixi( (1)
Transposing,
0 0
u ) xi x i
ixi = uix ( (2)

Multiplying (1) by (2),


0 0 0 0
u 2 2
i xixi = ui xixi )0 x i x i
ui x i (
0
)xi
ui x i x i (
0 0
)(
+ x i x i ( )0 x i x i

N N N
1 X 2 0 1 X
2 0 1 X 0
0 0
u
x xi = u x xi u i x i ( ) x i x i
N i=1 i i N i=1 i i N i=1
N
1 X 0
)xi
u i x i x i (
N i=1
N
1 X 0
0 0
+ xixi( )( ) xixi
N i=1
(3)
Part II: A digression on Matrix Algebra

The vec operator: stacking columns of a matrix to form a vector.



a11

a a12 a21

Thus, vec(A) = vec 11 =
a21 a22 a12


a22

vec(ABC) = (C0A)vec(B) , where is the Kronecker or Direct



a11B a12B a1LB

a B a22B a2LB
Product: AKLBM N = CKM LN = 21


...

... ... ...



aK1B aK2B aKLB
To prove: vec(ABC) = (C0A)vec(B).

We prove it using an example.



b1

a a12 a13  
Let A23 = 11

, B31 = b2 , C12 =

c1 c2 ,
a21 a22 a23
b3

c (a b + a12b2 + a13b3) c2(a11b1 + a12b2 + a13b3)
ABC = 1 11 1
c1(a21b1 + a22b2 + a23b3) c2(a21b1 + a22b2 + a23b3)

c1(a11b1 + a12b2 + a13b3)

c (a b + a b + a b )
Therefore, LHS = vec(ABC) = 1 21 1 22 2 23 3


c2(a11b1 + a12b2 + a13b3)


c2(a21b1 + a22b2 + a23b3)

Now RHS=

b1

c a a12 a13
(C 0 A)vec(B) = ( 1 11

)vec b2


c2 a21 a22 a23
b3

c1a11 c1a12 c1a13


c a
b1
c1a22 c1a23
= 1 21

b2

c2a11 c2a12 c2a13

b3
c2a21 c2a22 c2a23

a11b1c1 + a12b2c1 + a13b3c1

a b c +a b c +a b c
= 21 1 1 22 2 1 23 3 1


a11b1c2 + a12b2c2 + a13b3c2


a21b1c2 + a22b2c2 + a23b3c2

c1(a11b1 + a12b2 + a13b3)

c (a b + a b + a b3)
= 1 21 1 22 2 23 Hence LHS=RHS.


c2(a11b1 + a12b2 + a13b3)


c2(a21b1 + a22b2 + a23b3)
Part III:

Consider 3rd term on RHS of eqn 3.


1 PN vec(u x0 x (
N i=1 i ii )xi)

1 PN (u x0 x0 x )vec(
=N ).
i=1 i i i i

0
[ui is a scalar. )K1 = B and
Treat (xixi)KK = A and (
xi1K = C]

p p
)
Now, clearly ( )
0 vec( 0.

2
uix1i x1i x1ix2i ..... x1ixKi

0 0 uix2i
x x
2i 1i x2
2i ..... x2ixKi
Again, uixi xixi =


...

... ... ... ...


uixKi xKix1i xKix2i ..... x2Ki


2
x1i x1ix2i ..... x1ixKi



x x x2 ..... x2ixKi

uix1i
2i 1i 2i



... ... ... ...


2


xKix1i xKix2i ..... xKi




2

x1i x1ix2i ..... x1ixKi




=
x x
2i 1i x2 2i ..... x2ixKi
uix2i





... ... ... ...


x2

xKix1i xKix2i ..... Ki

...






2
x1i x1ix2i ..... x1ixKi



x2
x x

u x 2i 1i 2i ..... x2ixKi
i Ki

... ... ... ...


xKix1i xKix2i ..... x2 Ki K 2 K
1 0 0 1 PN
The terms of the final N (uixixixi) matrix are of the form N i=1 uix3
P
ki
1 N u x2 x or 1 N
N i=1 uixkixji xli, j, k, l = 1, . . . , K.
P P
or N i=1 i ki ji

Assume corresponding population moments, i.e., E(x3 2


j u), E(xk xj u),
or E(xj xk xl u) exist and are finite.

1 PN 0 0 p
Then by WLLN, N i=1(uixi xixi) E(), the corresponding popu-
lation moments matrix.

Consequently, by Lemma 1 of Asymptotic Theory, the 3rd term of


RHS of (3) is Op(1)op(1) = op(1). The 2nd term can be treated
similarly, as it is only a transpose of the 3rd term.
Now consider 4th term on RHS of equation (3).

1 PN vec(x0 x (
)(
) 0 x0 x )
N i=1 i i i i
=N1 PN (x0 x x0 x )vec(( )( )0).
i=1 i i i i

p p

Again, 0, thus vec(( )0 )
)( 0, by Lemma 2.
0 0
Also, xixi xixi =

x21i x1ix2i ... x1ixki . . ... . . . ... .


x2ix1i x22i ... x2ixKi . . ... . . . ... .

x2 x1ix2i x1ixki
1i .. ... ... ... .. ... ... ... .. ... ... ...
. . .


x2Ki

xKix1i xKix2i ... . . ... . . . ... .




. . ... . . . ... . . . ... .


. . ... . . . ... . . . ... .

x2ix1i x22i x2ixki

.. ... ... ... .. ... ... ... .. ... ... ...


. . .

. . ... . . . ... . . . ... .
... ... ...

...


. . ... . . . ... . . . ... .



. . ... . . . ... . . . ... .

xkix1i xkix2i x2ki

.. ... ... ... .. ... ... ... .. ... ... ...


. . .
. . ... . . . ... . . . ... .
The terms of the final N 1 PN (x0 x x0 x ) matrix are of the form
i=1 i i i i
1 PN x4 , 1 PN x3 x , 1 PN x2 x2 , 1 PN x2 x x , 1 PN x x x x ,
N i=1 ji N i=1 ji ki N i=1 ji ki N i=1 ji ki li N i=1 ji ki li mi
j, k, l, m = 1, . . . , K.

Assume corresponding population moments, i.e., E(x2 x 2 x ), E(x3 x ),


j k l j k
E(x4 2 2
j ), E(xi xi ), E(xj xk xl xm) exist and are finite.

1 PN 0 0 p
Hence by WLLN, N i=1(xixi xixi) E(), the corresponding pop-
ulation moments matrix.

Consequently, by Lemma 1, the 4th term is also Op(1)op(1) = op(1).

Note also, if N1 PN vec(u x0 x ( )x ) or 1 PN vec(x0 x (


)(

i=1 i i i i N i=1 i i
0 0
) xixi) are op(1), then so also are the original expressions in RHS of
0
equation (3), viz., 1 N (uix xi(
P
)xi) or 1 PN (x0 xi( )(

N i=1 i N i=1 i
0
)0xixi), since the vec operator does nothing but stack the columns
of the original matrices into vectors.

PART IV: Finally, from equation (3),


N N
1 X 2 0 1 X
2 0
u
x xi = u x xi + op(1).
N i=1 i i N i=1 i i
So
N N
1 X 2 0 p 1 X
2 0
i xixi
u ui xi x i .
N i=1 N i=1

We already know that,


N
1 X 2 0 p 2 0
u x xi E(u x x) = B, by WLLN.
N i=1 i i
Thus,
N
1 X 2 0 p
x xi B.
u
N i=1 i i

OLS
Hence the heteroskedasticity-robust-variance-covariance matrix of
is:
1B
A A 1

Avar( )robust =
N
1 1
N N N
1 1 X 1 X 1 X
 
0 2 0 0
= x i xi u
i xixi x i xi
N N i=1 N i=1 N i=1

N 0 0
N
0 1 2 0 1 xixi = X0X.
X X
= (X X) u
i xixi (X X) , where

i=1 i=1

This matrix is also often called the sandwich matrix because of its
form.

OLS ) is obtained, we can


Once heteroskedasticity-consistent varcov(
also get the heteroskedasticity-robust standard errors of OLS , by

taking square roots of diagonal terms of Avar( )robust.

Once robust standard errors are obtained, t and F statistics can be


computed in the usual way (robust t or F stats).

However under the heteroskedasticity, F -stats are usually not valid


even asymptotically. So instead Wald-stats should be employed.

H 0 : R(QK) (K1) = r(Q1), where rank(R) = Q K.


The heteroskedasticity robust Wald stat is W = (R 0)1(R
r)0(RVR
= Avar(
a 2
)robust. W
r), where V .
Q

So far, we were proceeding with weak exogeneity assumption: OLS


1. If we had assumed strong exogeneity, i.e., E(u|x) = 0 then there
exists another solution to violation of OLS 3, i.e., heteroskedasticity.

In this case, i,e, E(u|x) = 0 if OLS 3 fails, then we can specify a


model for var(y|x), estimate the model and apply generalized least
square (GLS).

For observation i, yi and each element of xi are divided by an estimate


of the conditional standard deviation [var(yi|xi)]1/2. OLS applied to
GLS . GLS is a special form
this transformed (weighted) data gives
of weighted least squares (WLS).

In modern econometrics, it is however a more popular approach to


stick to OLS estimate for and heteroskedasticity correction for
, viz., Avar(
the estimated variance-covariance matrix of OLS )robust.
This latter matrix and consequent standard-errors are then used for
testing.

Note that robust standard errors are valid even when OLS 3 holds

(only then Avar( )OLS ) simplifies to 2(X0X)1). So this is an easier
approach.

GLS may be avoided for other reasons.


1. GLS leads to efficiency gain only when the model for var(y|x) is
correct. So it requires a lot of information.

2. Finite sample properties of Feasible GLS are usually not known


except for simplest cases.

3. Under weak exogeneity assumption [OLS 1] GLS is generally in-


consistent if E(u|x) 6= 0.

[See undergraduate notes for treatment of GLS].