Anda di halaman 1dari 15

VIOLATING THE ASSUMPTIONS OF CLRM

(Continued)
ENDOGENEITY
Suppose we know that Yi is generated as:
Yi =

1 Xi

+ ui

An important question that arises is: When does OLS consistently (at least) estimate
1 and thus is a meaningful procedure to follow?

and

The most important assumption (and the most di cult to satisfy in practice) is that the
regressors are (at least) uncorrelated with (or as some times said "orthogonal" to) the error
term in the sense that:
Cov (Xi ; ui ) = 0 = Corr (Xi ; ui )
Indeed, consider the slope estimator:
P
X)(Yi Y )
i
^ = i (X
=
P
1
2
(X
X)
i
i

1
n

X)(ui
i (Xi
P
1
X)2
i (Xi
n

u)

By the LLN, the numerator and the denominator converge in probability to their population
analogs:
1X
p
2
(Xi X)2 ! E (Xi
= V ar (Xi )
x)
n i
1X
(Xi
n i

X)(ui

u) ! E [(Xi

x ) (ui

u )]

= Cov (Xi ; ui )

It is clear that if Cov (Xi ; ui ) = 0; then ^ 1 consistently estimates


p
^1 !

1;

i.e.

and so does the intercept estimator:


^

1X
=
Yi
n i
=

+ ^1

+0

X
1X
^ 1
X
=
+
Xi
i
1
0
n i
n i
1X
1X
Xi +
ui
1
n i
n i
x

(It is safe to assume that E (ui ) =

+0+0=

1X
+
ui
1
n i

X
^ 1
Xi
1
n i

= 0 as the model includes an intercept.)

Thus the orthogonality between u and X su ces for consistency of OLS estimators, although
it does not guarantee unbiasedness. For unbiasedness, and for the slope coe cient to be
interpreted as the average marginal eect of X on Y; we need the (stronger) assumption of
mean independence,
E (ui jXi ) = 0
Mean independence implies uncorrelatedness (but not the other way around), hence under
mean independence OLS is not only unbiased but also consistent.
There are many cases where the uncorrelatedness between error and regressors fails for one
or more regressors, in which case the regressors are called endogenous.
The result is then that OLS is not only biased in any nite sample but also inconsistent, and
hence inappropriate to use. This phenomenon is known as endogeneity bias.
In a multiple regression model, the endogeneity of even one regressor contaminates all estimated parameters and not only the coe cient of the endogenous regressor!
We next present several examples where the regressor is endogenous and hence OLS is inappropriate to use. We then present an estimation method that provides consistent estimators.

Example 1: Omitted Variables


Let the true model be the (long) model
Yi =

+ Xi1

+ Xi2

+ ui

where
E (ui jXi1 ; Xi2 ) = 0
If we, wrongly, omit Xi2 i.e. if we estimate the (short) model
Yi =
the OLS estimate of

is
^1 =

+ Xi1

Xi1
P

Xi1

+ "i

X1

Yi
X1

Y
2

Substituting in Yi and Y from the true (long) model,


P
Xi1 X1
X1 + 2 Xi2 X2 + ui u
1 Xi1
^ =
P
1
2
Xi1 X1
P
P
Xi1 X1 Xi2 X2
Xi1 X1 (ui u)
= 1+ 2
+
P
P
2
2
Xi1 X1
Xi1 X1
It is clear that

E ^ 1 jX1 ; X2 =
The second term above,
2

Xi1
P

X1
Xi1

Xi1
P
Xi2
X1

X1

Xi2

Xi1

X1

X2
2

X2
2

is known as the omitted variable bias. Note thatPthis is not zero, unless 2 = 0 (so that
Xi2 does not in fact belong in the regression) or
Xi1 X1 Xi2 X2 = 0 (i.e. in the
particular sample the two regressors are not correlated). Note that this second term is nothing
but the OLS slope coe cient if we would run the regression of X2 on X1 :
In the simple case of one omitted regressor and one included regressor, we can sign the bias:
It is positive if 2 and the sample correlation of the two regressors have the same sign and it
is negative if they are of opposite sign.

2
2

Corr (X1 ; X2 ) > 0 Corr (X1 ; X2 ) < 0


> 0 Bias > 0
Bias < 0
< 0 Bias < 0
Bias > 0

It is clear that the omitted variable bias term does not vanish as the sample size goes to
innity. Its probability limit is
Cov (Xi1 ; Xi2 )
2
V ar (Xi1 )
and it is not zero unless 2 = 0 or Cov (Xi1 ; Xi2 ) = 0: Hence the omission of one (or more)
explanatory variables leads to biased and inconsistent OLS estimators.
A classic example of omitting a variable occurs in estimating the returns of education to
wages due to the presence of unobserved workers ability which is part of the error term and
which is thought to be correlated with the workers education, or when estimating production
functions with unobserved factors of production.

Example 2: Supply and Demand Estimation


A classic example is Workings (1927) simple model of supply and demand:
qid = 0 + 1 pi + ui
qis = 0 + 1 pi + vi
qid = qis = qi

Demand Equation
Supply Equation
Market Equilibrium

Imposing market equilibrium, the system may be written in terms of the observed equilibrium
prices and quantities as
qi =
qi =

+
0+
0

Demand Equation
Supply Equation

1 pi

+ ui
1 pi + vi

(*)

If we regress the observed (equilibrium) quantity on a constant and the observed (equilibrium)
price, are we estimating the demand or the supply curve?
The OLS estimator of the slope coe cient is
Pn
i=1 (pi
Pn

p) (qi q)
p)2
i=1 (pi

Its probability limit (applying a LLN) is


E

E (p

E (p))

Cov (p; q)
V ar (p)

Using the demand equation (multiplying it by p and taking covariances), we have


Cov (p; q)
Cov (p; u)
1 V ar (p) + Cov (p; u)
=
= 1+
V ar (p)
V ar (p)
V ar (p)
while using the supply equation we obtain:
V ar (p) + Cov (p; v)
Cov (p; v)
Cov (p; q)
= 1
= 1+
V ar (p)
V ar (p)
V ar (p)
For OLS to consistently estimate either the demand or the supply equation, the regressor p
needs to be uncorrelated with the error terms.
However in this (simultaneous equations) model, Cov (p; u) 6= 0 and Cov (p; v) 6= 0. Indeed
( ) is a linear system of two equations in the two variables, pi and qi ; that solves for
vi ui
0
0
pi =
+
1

qi =

1 0

0 1

1
1 vi
1

1 ui
1

from which we see that pi is endogenous in both equations since it is a linear function of
both error terms.
In this example, endogeneity is a result of market equilibrium, i.e. of the fact that price and
quantity are simultaneously determined, i.e. they are both endogenous variables in a system of
equations. In this case the inconsistency of OLS resulting from endogeneity of the regressor(s)
is also known as simultaneity bias.
5

Example 3: Consumption Function Estimation


Consider Haavelmos (1943) simple macroeconomic model:
Ct = 0 + 1 Yt + ut
Yt = Ct + It

0<

<1

where Ct is aggregate consumption; Yt is GNP, and It is investment in year t: The coe cient
1 is the Marginal Propensity to Consume (MPC) which determines the size of the so-called
macroeconomic multiplier 1 1 :
1

It is easy to see that in equilibrium


Yt =

+
1

It
1

+
1

ut
1

If investment is predetermined, i.e. Cov (It ; ut ) = 0; it follows that


Cov (Yt ; ut ) =

V ar (ut )
>0
1
1

So income is endogenous in the consumption function and hence an OLS regression of consumption on income produces an inconsistent estimator of MPC.
In fact,
p lim ^ 1 =

Cov (Yt ; ut )
=
V ar (Yt )

and OLS systematically overestimates

1
1+

1
V ar(It )
V ar(ut )

>

1:

This is also an example of simultaneity bias, due to the fact that Ct and Yt are simultaneously
determined in the economy in a system of equations.

Example 4: Errors-in-Variables
Consider Friedmans (1957) permanent income hypothesis that
Ci = Yi
where Ci is permanent consumption for household i and Yi is permanent income with 0 <
< 1:
It is assumed that observed consumption Ci and income Yi are measured with (additive) error:
Ci = Ci + ci
Yi = Yi + y i
with the classical measurement errors ci and yi satisfying:
E (ci ) = E (yi ) = 0
Cov (ci ; yi ) = Cov (Ci ; ci ) = Cov (Yi ; yi ) = Cov (Ci ; yi ) = Cov (Yi ; ci ) = 0
Substituting the measured variables in the model we obtain:
Ci = Yi + ui
where
ui = c i

yi

It is straightforward to show that


Cov (Yi ; ui ) = E (Yi ui ) =

E yi2 < 0

so that measured income is endogenous in the consumption function due to the measurement
error in the regressor.
The probability limit of the OLS estimator of
measured income is

from regressing measured consumption on


2

E Yi
E (Ci Yi )
p lim ^ =
=
<
2
E (Ci2 )
E Yi + E (yi2 )
which shows that OLS systematically underestimates :
Note that in this example both the left and the right-side variables are measured with error.
However it can be easily seen that it is really only the measurement error in the regressor that
causes the problem!

INSTRUMENTAL VARIABLES (IV) ESTIMATOR


As we saw in the presence of endogeneity, OLS is biased and inconsistent and hence inappropriate to use. How do we estimate the model then?
We need to nd (at least) one additional variable for each endogenous regressor with certain
properties:
Suppose that in the simple linear regression model
Yi =

1 Xi

+ ui

Xi is endogenous i.e.
Cov (Xi ; ui ) 6= 0
Suppose however that we nd an instrument for the endogenous variable, i.e. an observable
variable Z such that
Cov (Zi ; ui ) = 0
Cov (Zi ; Xi ) 6= 0
Note that the instrument has to satisfy both conditions!
For example, in the wage equation
wage =

1 educ

+u

u includes ability which we think is correlated with education. To estimated the eect of education on wages we need to nd an instrument for education, i.e. a variable that is uncorrelated
with u (i.e. with ability) but correlated with education. Researchers have proposed several
instruments, like mothers education (but is it uncorrelated with childs ability?) and/or
number of siblings.
In the supply-demand model, if we think that the supply equation is determined as
qi =

1 pi

2 Zi

+ vi

i.e. the supplied quantity of the good (e.g. coee) is aected by a factor Zi (say temperature
in producing country) that does not aect the demanded quantity, then Zi may serve as an
instrument for estimating the demand equation, as it can be easily shown that Zi is correlated
with price but may be assumed uncorrelated with the errors.
In the macroeconomic consumption function model, if we think that investment is predetermined (i.e. its level is set before consumption is determined) then we may assume that it
is uncorrelated with the error in the consumption equation while by denition of income,
investment is correlated with income, and hence it is a valid instrument.
In the case of measurement error in a regressor, a valid instrument might be a second measurement of the regressor (but measured independently of the rst measurement!)
8

We use this instrument Zi to construct a new estimator called the instrumental variable (IV)
estimator as follows:
P
Z Yi Y
i Zi
^
1;IV = P
Z Xi X
i Zi
for the slope. The intercept estimator is estimated as
^

=Y

0;IV

1;IV

The idea for this estimator comes from the fact that in the model
Yi =

+ Xi

+ ui

if we take the covariance with Z we get:


Cov (Yi ; Zi ) = Cov (Zi ; Xi )
which solves for
1

+ Cov (Zi ; ui )

Cov (Yi ; Zi )
Cov (Zi ; Xi )

since by assumption Cov (Zi ; ui ) = 0 and Cov (Zi ; Xi ) 6= 0: The IV estimator is the sample
analog of the formula for 1 above and hence by the law of large numbers
P
1
Z Yi Y
Cov (Yi ; Zi )
E [(Zi
p
i Zi
Z ) (Yi
Y )]
n
^
=
= 1
! 1P
1;IV = 1 P
Cov (Zi ; Xi )
Z Xi X
Z ) (Xi
X)
i (Zi
i Zi
n
n

PROPERTIES OF IV ESTIMATOR
It can be easily seen that the IV estimator is ALWAYS biased!
P
Zi Z E Yi Y jX; Z
P
E ^ 1;IV jX; Z
=
Xi X Zi Z
P
Xi X E (ui ujX; Z)
P
= 1+
6=
Xi X Zi Z

since X is correlated with u:

However, under the assumptions that make a variable a valid instrument, the estimator is
consistent:
P
Z i Z Yi Y
^
= Pi
1;IV
Z Xi X
i Zi
P
Zi Z (ui u)
= 1+P i
Z Xi X
i Zi
Cov (ui ; Zi )
p
! 1+
Cov (Zi ; Xi )
= 1
It can be also shown to be asymptotically normal:
p

n ^ 1;IV

1
n

1 X
p
Zi
n i

1
Z

Zi
Xi X
1
d
N 0; E (Zi
!
Cov (Zi ; Xi )
1
E (Zi
= N 0;
Cov (Zi ; Xi )2
i

2
Z)

2
Z)

Z (ui

u2i

u2i

If u is homoskedastic conditionally on Z, in the sense that


E u2i jZi =

then the asymptotic variance is


AV ar ^ 1;IV

2
ZX

V ar (Zi )
V ar (Xi ) V ar (Zi )
2

2
ZX

V ar (Xi )

(Recall that the correlation coe cient is dened as

10

ZX

i ;Xi )
= p Cov(Zp

V (Xi )

V (Zi )

):

u)

Recall that the asymptotic variance of the (inconsistent) OLS estimator (under conditional
homoskedasticity given both X and Z) is
AV ar ^ 1;OLS =

V ar (Xi )

Since ZX 1; the IV estimator has larger variance than the (biased and inconsistent) OLS
estimator and this variance is larger the smaller is the correlation ZX between Z and X:

11

TWO STAGE LEAST SQUARES (2SLS)


If there are more than one instruments say Z1 ; :::; Zp for the endogenous regressor the most efcient estimator is the so-called Two-Stage-Least-Squares estimator (2SLS) which is obtained
from two OLS regressions:
^i
1. First we regress the endogenous Xi on all the instruments and obtain tted values X
^ i = ^ 0 + ^ 1 Zi1 + ::: + ^ p Zip
X
^ i and obtain
2. Second we regress Yi on X
^ 1;2SLS =

^i
X
P
i

b
X

^i
X

Yi
b
X

Y
2

NOTE:
2SLS generalizes to more than one endogenous variables. In this case we need at least one
instrument per endogenous variable, i.e. we need at least as many instruments as endogenous
variables. This is the so-called order condition of identication.
IV and 2SLS are special cases of the so-called Generalized Method of Moments (GMM) estimators.
To do instrumental variables in STATA we use the command ivreg y varlist1 (varlist2 = varlist_IV )
where varlist1 is the list of exogenous variables in the regression, varlist2 is the list of endogenous variables and varlist_IV is the list of instrumental variables.
We may obtain the 2SLS estimates by manually doing the two regressions. However, the
correct standard errors for the 2SLS estimator are NOT the ones that STATA gives us from
the second stage OLS regression! For those we need to run the ivreg command!
We should always check whether the rst stage regression produces a good t, i.e. whether
the instruments are correlated with the endogenous variables. Otherwise we may have the socalled weak instruments problem, which translates into large standard errors for the IV/2SLS
estimates.

12

EXAMPLE: Wage Equation


Using WAGE2.DTA (only the 857 observations that dont have meduc and/or sibs missing), OLS
yields:
\ = 5:45 + :08 educ + :022 exper
lwage
(:116)

(:007)

(:003)

To take into account unobserved ability we add the variable IQ (an imperfect measure of
ability) and obtain:
\ = 5:14 + :059 educ + :022 exper + :0059IQ
lwage
(:126)

(:007)

(:003)

(:001)

Using sibs as an instrument for the endogenous educ yields:


\ = 4:49 + :138 educ + :036 exper
lwage
(:471)

(:029)

(:007)

The coe cient on education almost doubles. Observe also how much larger the standard
errors are compared to OLS!
Using meduc as an instrument for the endogenous educ yields:
\ = 4:32 + :149 educ + :038 exper
lwage
(:367)

(:022)

(:006)

Using both sibs and meduc as instruments for educ, 2SLS yields:
\ = 4:37 + :146educ + :037 exper
lwage
(:327)

(:02)

(:006)

Using IQ as well, hardly changes the results; IQ however now is statistically insignicant:
\ = 4:37 + :143 educ + :037 exper + :0005IQ
lwage
(:345)

(:035)

13

(:007)

(:002)

STATA Program
clear all
use WAGE2
sum
keep if (sibs!=.) & (meduc!=.)
reg lwage educ exper
reg lwage educ exper IQ
ivreg lwage exper (educ=sibs)
ivreg lwage exper (educ=meduc)
ivreg lwage exper (educ=sibs meduc)
ivreg lwage exper IQ (educ=sibs meduc)
The last 2SLS estimates may be obtained via the following commands:
reg educ sibs meduc exper IQ
predict educhat
reg lwage educhat exper IQ
Note
The standard errors obtained in the second stage are dierent (and wrong!) from those
obtained using the ivreg command.
The R2 (= :4195) of the rst stage regression (reg educ sibs meduc exper IQ) is large and all
instruments are statistically signicant.
We need to regress educ on all the exogenous variables of the model, i.e. not only sibs and
meduc; but also exper and IQ:

14

Hausman Test of Endogeneity


Suppose we want to test for endogeneity of X in the simple linear model
Yi =

1 Xi

+ ui

Under the null, Xi is exogenous, i.e.


Cov (Xi ; ui ) = 0
whereas under the alternative it is endogenous
Cov (Xi ; ui ) 6= 0
We assume we have at least one valid instrument, Zi :
The idea of Hausmans endogeneity test is to compare the OLS estimate of 1 (that is
consistent and e cient under the null that there is no endogeneity, but is inconsistent under
the alternative that there is endogeneity) with the IV estimate (that is consistent whether
there are endogenous regressors or not, i.e. under both the null and the alternative, but is
ine cient if there is endogeneity). The test statistic is
^
H=

V ar ^ 1;OLS

and it is asymptotically distributed as


V ar ^ 1;IV

1;IV

2
1;OLS

1;IV

(1) : It turns out that V ar ^ 1;OLS

1;IV

(k) :

V ar ^ 1;OLS :

In the general case with k regressors the test statistic is asymptotically distributed as
In STATA we may perform Hausmans test as follows:
keep if (sibs!=.) & (meduc!=.)
reg lwage educ exper IQ
estimates store OLS
ivreg lwage exper IQ (educ=sibs meduc)
estimates store IV
hausman IV OLS

In our example, H = 5:85 (and is distributed as 2 (3)) with p value of 0.1193. We reject exogeneity
of education at signicance levels of 12% and higher. If however we exclude IQ from the regressions
above we obtain a test statistic of 12.45 (distributed as 2 (2)) and a p value of 0.0020 which leads
to rejection of exogeneity of education at all conventional levels of signicance.

15

Anda mungkin juga menyukai