Slide Mancini Eco No Metrics EPFL2011

Master in Financial Engineering
Econometrics
Prof. Loriano Mancini
Swiss Finance Institute at EPFL
First semester
Slides version: September 2011
Information about the course
Material: slides, exercises, data, etc., at
http://sfi.epfl.ch/mfe Courses online Mancini E
Username: StudMFE
Password: Fall2011
Also, register at http://is-academia.epfl.ch
Book: Econometric Analysis, sixth edition, W. Greene, Prentice Hall, 2008
Assignments: each week, due next Monday, groups max 3 persons
Exams: written, closed book, closed notes, one A4 page hand-written notes
Grade: 30% homework, 30% midterm, 40% nal exams
Assistants: Benjamin Junge (E-mail: benjamin.junge@epfl.ch)
Emmanuel Leclercq (E-mail: emmanuel.leclercq@epfl.ch)
1
Information about the course
Exercise sessions start on October 3rd
i.e. no exercise session on September 26th
Prerequisites: W. Greene Econometric Analysis book
Appendix A on matrix algebra
Appendix B on probability and distributions
Appendix D on Laws of Large Numbers and Central Limit Theorems
2
Agenda of the course
Linear regression model
Generalized regression model
Panel data model
Instrumental variables
Generalized method of moments
Maximum likelihood estimation
Hypothesis testing
3
Chapter 2: Econometric model
Econometrics: intersection of Economics and Statistics
Econometric model = association between y
i
and x
i
E.g.: stock return y
i
(IBM) and market return x
i
(S&P 500 index)
Econometric model provides approximate description of the association
The relation will be stochastic and not deterministic
Econometric model provides probabilistic description of the association
Model: y
i
= f(x
i
) +
i
4
Linear regression model
y
i
= f(x
i1
, . . . , x
iK
) + = x
i1
1
+ +x
iK
K
+
i
y
i
: dependent or explained variable
x
i
: regressors or covariates or explanatory variables
i
: error term or random disturbance
Each observation in a sample {y
i
, x
i1
, . . . , x
iK
} , i = 1, . . . , n, comes from
y
i
= x
i1
1
+ +x
iK
K
. .
deterministic
+
i
..
random
Goal: estimate
1
, . . . ,
K
5
Assumptions of the linear regression model
Assumptions on the data generating process
1. Linearity: linear relationship between y
i
and x
i1
, . . . , x
iK
2. Full rank: X = [x
1
, . . . , x
K
] is an n K matrix with rank K
3. Exogeneity of the independent variables: E[
i
|x
j1
, . . . , x
jK
] = 0, i, j
4. Homoscedasticity and nonautocorrelation: Var[
i
|X] =
2
, i = 1, . . . , n, and
Cov[
i
,
j
|X] = 0, i = j
5. Data generation: X can include constants and random variables
6. Normal distribution: |X N(0,
2
I)
Assumptions 4 and 6 simplify life but are too restrictive and will be relaxed
6
Linearity of the regression model
The same linear model holds for all n observations {y
i
, x
i1
, . . . , x
iK
}
n
i=1
y = x
1
1
+ +x
K
K
+ = X +
Notation: y is an n 1 vector; X = [x
1
, . . . , x
K
] is an n K matrix;
is an n 1 vector; is a K 1 vector
In the design matrix X: columns are variables, rows are observations
E.g. for the i-th observation: y
i
= x
i
+
i
Remark: we are modeling E[y|X] = X, as E[|X] = 0 by assumption
Linearity refers to and , not X
E.g. g(y
i
) = h(x
i
) +
i
is a linear model for any function g and h
7
Error term
By assumption E[|X] = 0 = E[] = 0
Note:
i
does not depend on any x
j
, neither past nor future xs
Let

X = E[X]. By the tower property or law of iterated expectations
Cov[, X] = E[(X

X)] = E
x
[E[(X

X)|X]] = E
x
[E[|X]
. .
=0
(X

X)] = 0
E[|X] = 0 implies E[y|X] = X, i.e. X is the conditional mean of y|X
Our analysis is conditional on design matrix X which can be stochastic
8
Spherical error term
Assumptions:
Homoscedasticity Var[
i
|X] =
2
, i = 1, . . . , n
Nonautocorrelation Cov[
i
,
j
|X] = 0, i = j
In short: E[
|X] =
2
I
9
Data generating process for the regressors
X may include constants and random variables
Golden rule: include a column of 1s in X
Crucial assumption: X
10
Chapter 3: Least squares
Regression model: y
i
= x
i
+
i
Goal: statistical inference on , e.g. estimate
Population quantities, not observed: E[y
i
|x
i
] = x
i
, ,
i
Sample quantities, estimated from sample data: y
i
= x
i
b, b, e
i
11
Least squares estimator
Let b
0
be the least squares estimator:
b
0
= arg min
b
0
n
i=1
(y
i
x
i
b
0
)
2
S(b
0
) :=
n
i=1
(y
i
x
i
b
0
)
2
=
n
i=1
e
2
i0
= e
0
e
0
= (y Xb
0
)
(y Xb
0
)
= y
y 2y
Xb
0
+b
0
X
Xb
0
12
Least squares estimator: normal equations
Necessary condition for a minimum:
S(b
0
)
b
0
=
(y
y 2y
Xb
0
+b
0
X
Xb
0
)
b
0
= 2X
y + 2X
Xb
0
= 0
Let b be the solution, normal equations:
X
Xb = X
y
By assumption X has full column rank,
b = (X
X)
1
X
y
Since X has full column rank, the following matrix is positive denite
2
S(b
0
)
b
0
b
0
= 2X
X
13
Example: regression with simulated data
DGP: y
i
= x
i
+
i
, with x
i
= [1 x
2i
], x
2i
U[0, 1],
i
N(0, 1)
i = 1, . . . , 100, = [1 2]
, in this sample b = [1.01 2.07]
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

1
0
1
2
3
4
5
x
2i
= Uniform[0,1]
y
i

=

x
i

i

True regression line
Estimated regression line
14
Algebraic aspects of the least squares solution
Golden rule: include a column of 1s in X
Normal equations: 0 = X
y X
Xb = X
(y Xb) = X
e
First column of X, x
1
= 1s, then the rst normal equation is
0 = x
1
e = [1 1]e =
n
i=1
e
i
=
n
i=1
(y
i
x
i
b)
Implications:
1. Least square residuals have zero mean, e = 0
2. Estimated regression line through the means of the data, y = x
b
3. Mean of tted values equals mean of actual data, y = y
None of these implications holds if X does not include a column of 1s
15
Projection
Estimated residuals e = y Xb, LS estimator b = (X
X)
1
X
y
e = y Xb = y X(X
X)
1
X
y = (I X(X
X)
1
X
)y = My
M is called residual Maker matrix as My = e
As MX = 0, e
y = y
MXb = 0, LS partitions y in two orthogonal parts

y = Xb +e = y +e
P is called the Projection matrix
y = y e = (I M)y = X(X
X)
1
X
y = Py
16
Properties of M and P matrices
M and P are symmetric, idempotent and orthogonal (PM = MP = 0)
Orthogonal decomposition of y
y = Xb +e = Py +My = projection + residual
Pythagorean theorem:
y
y = y
Py +y
My = y
y +e
e
17
Partitioned regression
E.g. regression model: income =
0
+
1
age +
2
education + error
Goal: study incomeeducation association; age is a control variable
Model: y = X + = X
1
1
+X
2
2
+, solve normal equations for b
2
b
2
= (X
2
M
1
X
2
)
1
X
2
M
1
y
= (X
2
M
1
M
1
X
2
)
1
X
2
M
1
M
1
y
= (X
2
)
1
X
M
1
is the residual maker matrix based on columns of X
1
b
2
is obtained regressing y
on X
2
y
(resp. X
2
) are residuals from regression of y (resp. X
2
) on X
1
18
Partial correlation
When education income , but education and age both in time
What is the net eect of education on income?
Partial correlation, r
yz
:
y
= residuals in a regression of income on a constant and age

z
= residuals in a regression of education on a constant and age

r
yz
= simple correlation between y
and z
19
Goodness of t
Goal: measure how much variation in y is explained by variation in x
Suppose y = x = 0. Recall y
i
e
i
, for each observation i
y
i
= y
i
+e
i
n
i=1
y
2
i
=
n
i=1
y
2
i
+
n
i=1
e
2
i
SST = SSR + SSE
Good regression model: SST SSR, hence SSE 0
20
Goodness of t (when means = 0)
When y, x = 0, consider deviations from the means
y
i
y = y
i
y +e
i
= (x
i
x
)b +e
i
n
i=1
(y
i
y)
2
=
n
i=1
((x
i
x
)b)
2
+
n
i=1
e
2
i
Dene M
0
= [I ii
/n], n n, symmetric, idempotent; i
= [1 1]
M
0
transforms observations in deviations from sample means, M
0
y = y iy
y
M
0
M
0
y = b
M
0
M
0
Xb +e
e
y
M
0
y = b
M
0
Xb +e
e
SST = SSR + SSE
21
Coecient of determination, R
2
R
2
=
SSR
SST
=
b
M
0
Xb
y
M
0
y
= 1
e
e
y
M
0
y
Properties:
R
2
measures the linear association between X and y
0 R
2
1, as 0 SSR SST
R
2
when a regressor is added, from X = [x
1
x
K
] to

X = [x
1
x
K+1
]
Adjusted R
2
= 1
e
e/(n K)
y
M
0
y/(n 1)
Remark: X should include a column of 1s M
0
e = e and e X
22
Chapter 4: Statistical properties of LS estimators
LS estimator enjoys various good statistical properties:
1. Easy to compute
2. Explicit use of model assumptions
3. Optimal linear predictor
4. Most ecient, under certain conditions
23
Orthogonality conditions
Assumptions: X stochastic or not, linear model, E[
i
|X] = 0 =
E[x
i
i
] = E
x
[E[x
i
i
|X]] = E
x
[x
i
E[
i
|X]] = 0 = E
x
[x
i
E[(y
i
x
i
)|X]]
which implies the population orthogonality conditions:
E
x
E[x
i
y
i
|X] = E
x
E[x
i
x
i
|X]]
E[x
i
y
i
] = E[x
i
x
i
]
LS normal equations are sample counterpart of orthogonality conditions:
X
y = X
X b
1
n
n
i=1
x
i
y
i
=
1
n
n
i=1
x
i
x
i
b
24
Optimal linear predictor
Goal: nd linear function of x
i
, x
i
, that minimizes MSE
MSE = E[(y
i
x
i
)
2
]
= E[(y
i
E[y
i
|X] +E[y
i
|X] x
i
)
2
]
= E[(y
i
E[y
i
|X])
2
] +E[(E[y
i
|X] x
i
)
2
]
min
MSE = min
E[(E[y
i
|X] x
i
)
2
]
0 = 2E[x
i
(E[y
i
|X] x
i
)]
E[x
i
y
i
] = E[x
i
x
i
]
which are the LS normal equations
Implicit assumption: all these expectations exist, i.e. E[] <
25
Unbiased estimation
LS estimator is unbiased in every sample:
b = (X
X)
1
X
y = (X
X)
1
X
(X +) = + (X
X)
1
X
Using law of iterated expectations, and assumption E[|X] = 0

E[b] = E
X
[E[ + (X
X)
1
X
|X]]
= +E
X
[(X
X)
1
X
E[|X]]
=
26
Monte Carlo simulation: b
2
slope estimates
DGP: y
i
= x
i
+
i
, with x
i
= [1 x
2i
], x
2i
U[0, 1],
i
N(0, 1)
i = 1, . . . , 100, = [1 2]
, repeat simulation and estimation 1,000 times

0 0.5 1 1.5 2 2.5 3 3.5
0
50
100
150
200
250
b
2
F
r
e
q
u
e
n
c
y
27
Variance of LS estimator
LS estimator is linear in : b = + (X
X)
1
X
Easy to derive variance of linear estimator:

Var[b|X] = E[(b )(b )
|X]
= E[(X
X)
1
X
X(X
X)
1
|X]
= (X
X)
1
X
E[
|X]X(X
X)
1
= (X
X)
1
X
(
2
I)X(X
X)
1
=
2
(X
X)
1
Note: assumption of spherical errors, Var[|X] =
2
I, is crucial
28
GaussMarkov theorem
Any linear unbiased estimator b
0
= Cy, where C is a K n matrix
Unbiasedness: E[Cy|X] = E[CX +C|X] = CX = I
Dene C = D+(X
X)
1
X
CX = I = DX+(X
X)
1
X
X =
=0
..
DX +I
Var[b
0
|X] = CVar[y|X]C
= CVar[|X]C
=
2
CC
=
2
(D + (X
X)
1
X
)(D + (X
X)
1
X
=
2
(D + (X
X)
1
X
)(D
+X(X
X)
1
)
=
2
DD
+
2
(X
X)
1
=
2
DD
+ Var[b|X]
= Var[b|X] + nonnegative denite matrix
LS estimator is BLUE (when X is constant and/or stochastic)
29
Estimating the variance of LS estimator
Estimate
2
in Var[b|X] =
2
(X
X)
1
use e
i
sample analog of
i
But e
i
= y
i
x
i
b =
i
x
i
(b ) is an imperfect estimate of
i
Sample residual: e = My = M(X +) = M, as MX = 0
E[e
e|X] = E[
M|X] = E[tr(
M)|X] = E[tr(M
)|X]
= tr(ME[
|X]) = tr(M)
2
= tr(I X(X
X)
1
X
)
2
= (tr(I) tr(X
X(X
X)
1
))
2
= (tr(I
n
) tr(I
K
))
2
= (n K)
2
Unbiased estimator of
2
(conditionally on X and unconditionally):
s
2
=
e
e
n K
=
n
i=1
e
2
i
n K
30
Normality of LS estimator
Assumption |X N(0,
2
I), and linearity b = + (X
X)
1
X

joint normality of b (multivariate normal distribution)
b|X N(,
2
(X
X)
1
)
and each slope, b
k
, is normally distributed
b
k
|X N(
k
,
2
(X
X)
1
kk
)
Note: exact distribution in nite samples
31
Distribution of b
2
slope estimates: simulation
DGP: y
i
= [1 x
2i
] [1 2]
+
i
, with
i
N(0, 1), 1,000 estimations
Comparison between simulated and true normal density of b
2
0 0.5 1 1.5 2 2.5 3 3.5
0
0.2
0.4
0.6
0.8
1
1.2
D
e
n
s
i
t
y
b
2
32
Hypothesis testing on a coecient
As b|X N(,
2
(X
X)
1
)
(b
k
k
)/
_
2
(X
X)
1
kk
N(0, 1)
Unfortunately
2
is not known but estimated via s
2
. Useful statistic:
(b
k
k
)/
_
2
(X
X)
1
kk
_
[e
e/
2
]/(n K)
N(0, 1)
_
2
(nK)
/(n K)
t-Student
(nK)
Note:
2
is unknown but cancels in the ratio above
Need to show: e
e/
2

2
(nK)
and e
e independent on b
k
33
2
distribution of e
e
Recall: M is residual maker matrix, e = My = M as MX = 0
As |X N(0,
2
I) /|X N(0, I)
e
2
=

which is an idempotent quadratic form in /, and in Appendix B.11.4

2
rank(M)
where rank(M) = tr(M) = n K
34
Independence of b and e
e
To show independence between
b
= (X
X)
1
X
= L
N(0, LL
)
and
e
2
=

it suces to show that LM = 0 because this implies, conditional on X,

Cov
_
L
, M

_
= E[L
(M

] = E[L
] = L
2
I
2
M
= LM
= (X
X)
1
X
(I X(X
X)
1
X
) = 0
which implies independence as |X N
35
Signicance of a coecient: t-statistic
Common test H
0
:
k
= 0
t
k
= t-statistic =
(b
k
0)/
_
(X
X)
1
kk
_
[e
e]/(n K)
=
b
k
_
s
2
(X
X)
1
kk
t-Student
(nK)
36
Example: Signicance of a coecient
True = [1 2]
, estimate b = [1.01 2.07]
, n = 100, K = 2
Is b
2
statistically dierent from zero?
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
1
0
1
2
3
4
5
x
2i
= Uniform[0,1]
y
i

=

x
i

i

8 6 4 2 0 2 4 6 8
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
b / (s
2
(X X)
1
)
1/2
D
e
n
s
i
t
y

tStudent distribution
Normal distribution
37
Condence intervals for parameters
Point estimates are useless without condence intervals or standard errors
Use t-Student distribution
(b
k
k
)/
_
2
(X
X)
1
kk
_
[e
e/
2
]/(n K)
t-Student
(nK)
to set condence intervals:
Pr(b
k
t
/2
s
b
k

k
b
k
+t
/2
s
b
k
) = 1
where t
/2
is t-Student quantile, e.g. = 0.05, s
b
k
=
_
s
2
(X
X)
1
kk
38
Signicance of the regression
Common test H
0
:
2
= =
K
= 0 (except intercept)
or equivalently H
0
: R
2
= 0
F-test statistic:
F[K 1, n K] =
R
2
/(K 1)
(1 R
2
)/(n K)

2
(K1)
/(K 1)
2
(nK)
/(n K)
R
2
1 large F reject H
0
39
Marginal distribution of test statistics
Under H
0
:
k
=
0
k
, and conditionally on X, b
k
|X N(
0
k
,
2
(X
X)
1
kk
)
Unconditionally, b
k
?, hard to nd, depends on distribution of X
Key property: t-statistic
t
k
= t|X =
b
k
0
k
_
s
2
(X
X)
1
kk
t-Student
(nK)
but t-Student
(nK)
does not depend on X unconditionally
t
k
t-Student
(nK)
40
Multicollinearity
Multicollinearity = variables in X are linearly dependent X
X is singular
In practice, variables in X are often close to be linearly dependent
Symptoms of multicollinearity:
Small changes in data produce large changes in b
Var[b|X] very large ( t-statistic close to zero) but R
2
is high
Coecient estimates wrong sign or implausible
41
Multicollinearity: analysis
Demeaned variables, X = [X
(k)
x
k
], where x
k
(n 1) is the k-th variable
Use Appendix A.5.3 on inverse of partitioned matrix:
Var[b
k
|X] =
2
(X
X)
1
kk
=
2
_
x
k
x
k
x
k
X
(k)
(X
(k)
X
(k)
)
1
X
(k)
x
k
_
1
=
2
_
x
k
x
k
_
1
x
k
X
(k)
(X
(k)
X
(k)
)
1
X
(k)
x
k
x
k
x
k
__
1
=
2
_
x
k
x
k
_
1
x
k
P
(k)
x
k
x
k
x
k
__
1
=
2
_
x
k
x
k
_
1
x
k
x
k
x
k
x
k
__
1
=
2
_
x
k
x
k
_
1 R
2
k.
_
1
=

2
x
k
x
k
[1 R
2
k.
]
42
Multicollinearity: interpretation
Hence, as column variables in X are demeaned,
Var[b
k
|X] =

2
x
k
x
k
[1 R
2
k.
]
=

2
n
i=1
(x
ik
x
k
)
2
[1 R
2
k.
]
where R
2
k.
is R
2
from regression of x
k
on X
(k)
(i.e. X \ x
k
)
Var[b
k
|X] when
R
2
k.
1, i.e. multicollinearity
n
i=1
(x
ik
x
k
)
2
0

2
, i.e. dispersion of y
i
around regression line
43
Large sample properties of LS estimator
|X N is a strong assumption and can be relaxed, but now
Assumption 5a (DGP of X):
(x
i
,
i
) i = 1, . . . , n, sequence of independent observations
plimX
X/n = Q positive denite matrix

Notation: plim = probability limit, i.e. convergence in probability
plimZ
n
= Z stands for
lim
n
Pr(|Z
n
Z| > ) = 0, > 0
where Z can be either random or constant
44
Consistency of LS estimator
Consistency means plimb =
Highly desirable property of any estimator
Recall: distribution of |X is unknown
b = +
_
X
X
n
_
1
X
n
plimb = + plim
_
X
X
n
_
1
plim
X
n
= +Q
1
plim
X
n
If plimX
/n = 0, then b is consistent
45
Random term X
/n
E[
X
n
] = E
X
[E[
X
n
|X]] =
1
n
n
i=1
E
X
[x
i
=0
..
E[
i
|X]] = 0
Var[
X
n
] = E[Var[
X
n
|X]] + Var[
=0
..
E[
X
n
|X]]
= E[
1
n
2
X
E[
|X]X] =

2
n
E[
X
X
n
] =

2
n
Q
As E[X
/n] = 0 and lim

n
Var[X
/n] = 0,
X
n
m.s.
0 =plim
X
n
= 0
Remark: Var[X
/n] decays as 1/n

46
Example: convergence of X
/n
x
i
U[0.5, 0.5],
2
= 2, hence Var[X
/n] =
2
/nE[
n
i=1
x
2
i
/n]
0.5 0.4 0.3 0.2 0.1 0 0.1 0.2 0.3 0.4 0.5
0
1
2
3
4
5
6
7
8
9
10
X / n
D
e
n
s
i
t
y

n=100
n=50
n=10
47
Asymptotic distribution of LS estimator
Key idea: stabilize the distribution of X
/n
Recall: Var[X
/n] decays as 1/n

Var[
nX
/n] = nVar[X
/n] O(1)
n(b ) =
_
X
X
n
_
1
nX
/n
Q
1
asymptotic distribution of

nX
/n
48
Random term

nX
/n
Recall: E[
nX
/n] = 0; (x
i
,
i
) independent; regressors well behaved
Var[
nX
/n] =
1
n
Var[X
] =
1
n
Var[
n
i=1
x
i
i
]
=
1
n
n
i=1
Var[x
i
i
] =
1
n
n
i=1
2
E[x
i
x
i
] =
2
Q
By Central Limit Theorem:

nX
/n
d
N(0,
2
Q)
n(b )
d
Q
1
N(0,
2
Q)
d
= N(0,
2
Q
1
)
b
a
N(,
2
n
Q
1
)
49
Asymptotic normality of LS estimator
If regressors well behaved and observations independent, then
asymptotic normality of LS estimator follows from CLT, not |X N
In practice in
b
a
N(,
2
n
Q
1
)
Q is estimated by X
X/n
2
is estimated by s
2
= e
e/(n K) (as plims

2
=
2
)
If |X N(0,
2
I), then b N(,
2
(X
X)
1
) for every sample size n
50
Asymptotic dist. of nonlinear function: Delta method
f(b): J possibly nonlinear C
1
functions
f(b)
b
=: C(b) (J K)
Goal: nd asymptotic distribution of f(b)
Slutsky theorem: plimf(b) = f(plimb) = f(), and plimC(b) = C()
First order Taylor expansion (remainder negligible if plimb = )
f(b) = f() +C() (b ) + remainder
f(b)
a
N(f(), C()
2
n
Q
1
C()
)
51
t-Statistic: remark
To test H
0
:
k
= 0, t-statistic t
k
= b
k
/
_
s
2
(X
X)
1
kk
If in nite sample |X N, then t
k
t-Student
(nK)
If only asymptotically |X N (not in nite sample), then t
k
N(0, 1)
5 4 3 2 1 0 1 2 3 4 5
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
b / (s
2
(X X)
1
)
1/2
D
e
n
s
i
t
y
tstatistic (n=10)

Normal
tStudent
52
Missing observations
Common issue in applied work
Missing at random: least serious case, just discard those observations, sample
size reduced
Not missing at random: most dicult case, selection bias, mechanism should
be studied
Read Chapter 4.8.2
53
Chapter 5: Inference
Goal: test implications of economic theory
Example: unrestricted model of investment, I
t
,
ln I
t
=
1
+
2
i
t
+
3
p
t
+
4
ln Y
t
+
5
t +
t
where i
t
nominal interest rate, p
t
ination rate, Y
t
real output
H
0
: investors care only about real interest rate, (i
t
p
t
)
restricted (or nested) model of investment:
ln I
t
=
1
+
2
(i
t
p
t
) +
4
ln Y
t
+
5
t +
t

3
=
2

2
+
3
= 0, in the unrestricted model
54
Linear restrictions
In the linear regression model, y = X +, consider J linear restrictions
R = q
R is J K and usually J K
Example: = (
1

2

3

4
)
1. H
0
:
2
= 0 tested with R = (0 1 0 0) and q = 0
2. H
0
:
2
=
3
=
4
= 0 tested with
R =
_
_
_
0 1 0 0
0 0 1 0
0 0 0 1
_
_
_
and q = (0 0 0)
55
Two approaches to testing hypothesis
1. Fit unrestricted model and check whether estimates satisfy restrictions
2. Fit restricted model and check loss of t (in terms of R
2
)
The two approaches are equivalent in the linear regression model
Working assumption: |X N(0,
2
I) (to be relaxed)
56
Approach 1: discrepancy vector
Null hypothesis: J linear restrictions, R is J K
H
0
: R q = 0
Alternative hypothesis:
H
1
: R q = 0
Discrepancy vector, m = Rb q, will not be exactly zero (most likely)
Decide whether m is not exactly zero because of
(a) sampling variability (do not reject H
0
)
(b) or restrictions are not satised by the data (reject H
0
)
57
Wald criterion
Under H
0
: R q = 0, discrepancy vector m = Rb q
E[m|X] = RE[b|X] q = R q = 0
Var[m|X] = Var[Rb q|X] = RVar[b|X]R
=
2
R(X
X)
1
R
Recall, as |X N(0,
2
I) by assumption, b|X N(,
2
(X
X)
1
)
= m|X N(0,
2
R(X
X)
1
R
)
Wald statistic:
W = m
(Var[m|X])
1
m
= (Rb q)
(
2
R(X
X)
1
R
)
1
(Rb q)

2
(J)
2
distribution Full Rank Gaussian Quadratic form, Appendix B.11.6
58
Wald statistic feasible and F-statistic
In the Wald statistic, need to get rid of unknown
2
F =
(Rb q)
(
2
R(X
X)
1
R
)
1
(Rb q)/J
[e
e/
2
]/(n K)
2
(J)
/J
2
(nK)
/(n K)
F
(J, nK)
Numerator: under H
0
, (Rb q)/ = R(b )/ = R(X
X)
1
X
/
i.e. standardized Gaussian quadratic form in R(X
X)
1
X
/
2
(J)
Denominator: standardized Gaussian quadratic form in M/
2
(nK)
As MX = 0, Cov(R(X
X)
1
X
/, M/) = 0 Num. Den. independent

59
Hypothesis testing on a single coecient
H
0
:
k
=
0
can be tested with t-statistic
t :=
(b
k
0
)/
_
2
(X
X)
1
kk
_
[e
e/
2
]/(n K)
N(0, 1)
_
2
(nK)
/(n K)
t-Student
(nK)
or with linear restriction R = (0 0 1 0 0) and q =
0
F =
(b
k
0
) (
2
(X
X)
1
kk
)
1
(b
k
0
)
[e
e/
2
]/(n K)
2
(1)
2
(nK)
/(n K)
F
(1, nK)
As t
2
= F the two tests are equivalent
60
Approach 2: restricted least squares
Fit of restricted model cannot be better than unrestricted model
Restricted LS:
b
= arg min
b
0
(y Xb
0
)
(y Xb
0
) subject to Rb
0
= q
= b (X
X)
1
R
[R(X
X)
1
R
]
1
(Rb q)
(b
b) = (X
X)
1
R
[R(X
X)
1
R
]
1
(Rb q)
e
residuals from restricted model. Loss of t due to constraints:

e
= y Xb
= y Xb X(b
b) = e X(b
b)
e
= e
e + (b
b)
X(b
b) e
e
e
e = (b
b)
X(b
b)
= (Rb q)
[R(X
X)
1
R
]
1
(Rb q)
61
Loss of t and F-statistic
F-statistic for H
0
: R = q
F =
(Rb q)
[R(X
X)
1
R
]
1
(Rb q)/J
e
e/(n K)
=
(e
e)/J
e
e/(n K)
=
(R
2
R
2
)/J
(1 R
2
)/(n K)
F
(J, nK)
Recall R
2
= 1 e
e/y
M
0
y and y
M
0
y not depend on b. Similarly for R
2
.
Special case: overall signicance of the regression
2
= . . . =
K
= 0 (except intercept) R
2
= 0 with J = K 1
62
Nonnormal disturbances and large sample tests
Drop assumption |X N which implies b|X N(,
2
(X
X)
1
)
All previous tests hold asymptotically, when n
Key ingredient: asymptotic distribution of b
b
a
N(,
2
n
Q
1
), where Q = plim(X
X/n)
Recall
n(b )
d
N(0,
2
Q
1
), from CLT
plims
2
=
2
, where s
2
= e
e/(n K)
63
Example: limiting distribution of Wald statistic
If

n(b )
d
N(0,
2
Q
1
) and H
0
: R q = 0, then
n(Rb q) =
nR(b )
d
N(0,
2
RQ
1
R
)
which implies
n(Rb q)
(
2
RQ
1
R
)
1
n(Rb q)
d
2
(J)
which has the same limiting distribution as W
W = (Rb q)
(s
2
R(X
X)
1
R
)
1
(Rb q)
d
2
(J)
when plims
2
(X
X/n)
1
=
2
Q
1
. Note: in W all ns cancel
Remark: W is only approximately distributed as
2
(J)
in nite samples,
in practice n
64
Testing nonlinear restrictions
Test H
0
: c() = q, where c is J 1 nonlinear functions
Apply delta method: rst order Taylor expansion of c
c(
) c() +
c()
)
Var[c(
)]
c()
Var[
]
c()
In c()/
replace by

Wald statistic
W = (c(
) q)
_
Var[c(
)]
_
1
(c(
) q)
d
2
(J)
65
Prediction
Prominent use of regression model
{y
0
, x
0
} not in our sample, not observed. Predict y
0
using
E[y
0
|x
0
, X] = x
0
b
as y
0
= x
0
+
0
, and assuming that x
0
is known
Forecast error: e
0
= y
0
y
0
= ( b)
x
0
+
0
Prediction variance: Var[e
0
|x
0
, X] =
2
+x
0
[
2
(X
X)
1
]x
0
> 0
Prediction interval at (1 ) condence level:
y
0
z
/2
_
Var[e
0
|x
0
, X]
where z
/2
is the /2-quantile of N(0, 1), e.g. = 0.05
66
Prediction of y
0
and x
0
If x
0
is known, prediction of y
0
E[y
0
|x
0
, X] = x
0
b
with Var[e
0
|x
0
, X]
If x
0
is not known and needs to be predicted too, prediction of y
0
E
x
0
E[y
0
|x
0
, X] = E
x
0
[x
0
b|X]
depends on distribution of x
0
, usually unknown and computed by simulation,
with Var[e
0
|X] > Var[e
0
|x
0
, X]
67
Measure of predictive accuracy
Notation: y
i
realized values, y
i
predicted values, n
0
number of predictions
Not scale invariant:
Root mean square error (RMSE) =
_
i
(y
i
y
i
)
2
/n
0
Mean absolute error (MAE) =
i
|y
i
y
i
|/n
0
Scale invariant:
Theil U statistic
U =
i
(y
i
y
i
)
2
/n
0
i
y
2
i
/n
0
68
Chapter 6: Functional form
Very general functional form of regression model:
L independent variables: z
i
= [z
1i
z
Li
]
K linearly independent functions of z
i
: f
1i
(z
i
) f
Ki
(z
i
)
g(y
i
) observable function of y
i
usual assumptions on
i
The following model is still linear and can be estimated by LS:
g(y
i
) =
1
f
1i
(z
i
) +. . . +
K
f
Ki
(z
i
) +
i
=
1
x
1i
+. . . +
K
x
Ki
+
i
y
i
= x
i
+
i
69
Nonlinearity in variables
A linear model, e.g., y
i
=
1
+x
i
2
+w
i
3
+
i
is typically enriched with
dummy variables
nonlinear functions of regressors (e.g. quadratic function)
interaction terms (i.e. cross products)
y
i
=
1
+x
i
2
+w
i
3
+
4
d
i
+
5
x
2
i
+
6
x
i
w
i
+
i
= x
i
+
i
where x
i
= [1 x
i
w
i
d
i
x
2
i
x
i
w
i
] and the dummy variable
d
i
=
_
1, i D
0, otherwise
70
Dummy variable
Easy to use: one dummy variable is one more column in X
To study various eects (treatment, grouping, seasonality, thresholds, etc.)
y
i
=
1
+x
2i
2
+d
i
3
+
i
= (
1
+d
i
3
) +x
2i
2
+
i
= x
i
+
i
where
d
i
=
_
1, i D
0, otherwise
In this model the dummy variable shifts the intercept:
1
(
1
+
3
)
71
Example: regression with dummy variable
y
i
= x
i
+
i
, x
i
= [1 x
2i
d
i
], x
2i
U[0, 1], d
i
= 1
{x
2i
>0.5}
,
i
N(0, 1)
i = 1, . . . , 100, = [1 2 2]
, in this sample b = [0.99 2.13 1.96]
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

1
0
1
2
3
4
5
6
7
x
2i
= Uniform[0,1]
y
i

=

1

+

x
2
i

2

+

3

d
i

+

i

72
Structural break
Previous graph shows a structural break in the model
y
i
=
_

1
+x
2i
2
+
i
, x
2i
0.5
(
1
+
3
) +x
2i
2
+
i
, x
2i
> 0.5
Structural change can be tested with F-test
Note: the break point is supposed to be known a priori
73
Testing for a structural break
Split the sample in two parts, according to potential structural break
n
b
observations on y
b
and X
b
(n
b
k) before potential structural break
n
a
observations on y
a
and X
a
(n
a
k) after potential structural break
Unrestricted model allows for potential structural break,

b
=

a
:
_
y
b
y
a
_
=
_
X
b
0
0 X
a
__

a
_
+
_

b
a
_
Restricted model, no structural break,

= [
a
]
b
=

a
= 0
[I
k
.
.
. I
k
]

= R

= 0
74
F-test for a structural break
H
0
: R

= q, with q = 0, R = [I
k
.
.
.I
k
], dim(R) = k2k, dim(
) = 2k1
F =
(Rb q)
[R(X
X)
1
R
]
1
(Rb q)/J
e
e/(n K)
F
(J, nK)
where
J = k = number of restrictions = number of rows in R
n K = (n
b
+n
a
) 2k = total number of observations minus dim()
Alternative ways exist to test for structural break (e.g., Wald statistic)
Typical issue: limited sample sizes before, n
b
, and/or after, n
a
, the break
75
Chapter 7: Specication analysis
Implicit assumption: the model y = X + is correct
Common model misspecication:
Omission of relevant variables
Inclusion of superuous variables
76
Omitted relevant variables
True regression model: y = X
1
1
+X
2
2
+
Use wrong regression model: y = X
1
1
+
Regress y on X
1
only:
b
1
= (X
1
X
1
)
1
X
1
y = (X
1
X
1
)
1
X
1
(X
1
1
+X
2
2
+)
=
1
+ (X
1
X
1
)
1
X
1
X
2
2
+ (X
1
X
1
)
1
X
E[b
1
|X] =
1
+ (X
1
X
1
)
1
X
1
X
2
2
Unless X
1
X
2
= 0 or
2
= 0,
E[b
1
|X] =
1
, i.e. b
1
is biased
plim(b
1
) =
1
, i.e. b
1
is inconsistent
Inference procedures (t-test, F-test, etc.) are invalid
77
Inclusion of superuous variables
True regression model: y = X
1
1
+
Use wrong regression model: y = X
1
1
+X
2
2
+
Rewrite y = X
1
1
+X
2
2
+ = X +
where X = [X
1
X
2
] and
= [
2
] = [
1
0
]
Model used, per se, it is not wrong, simply
2
= 0
Regress y on X: LS estimator is unbiased estimator of
E[b|X] = =
_

1
2
_
=
_

1
0
_
Price to pay for not using information
2
= 0: reduced precision of estimates
Var[b|X] Var[b
1
|X]
78
Model building
Simple-to-general:
not a good strategy, omitted variables induce biased and inconsistent
estimates
General-to-simple:
better strategy, computing power is cheap, but variable selection is a dicult
task
79
Choosing between nonnested models
F-test of H
0
: R = q is only for nested models
R represents (linear) restrictions on the model y = X +
Various nonnested hypothesis can be of interest
e.g., choosing between linear or loglinear functional forms:
y
i
=
1
+x
i
2
+
i
or log(y
i
) =
1
+ log(x
i
)
2
+
i
Typically, these tests are based on likelihood function
80
Likelihood function: digression
Probability theory : given population model, what is the probability of
observing that sample?
Inference procedure : given that sample, what is the population model?
Likelihood function = probability of observing that sample as a function of
model parameters
81
Likelihood function: simple example
Fair coin, {H, T}, Pr(toss = T) = p
0
= 0.5
Goal: estimate p
0
(unknown to us)
Observed sample: n = 60 tossing, total T = k = 28
L(p) =
_
n
k
_
p
k
(1 p)
nk
=
_
60
28
_
p
28
(1 p)
32
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.02
0.04
0.06
0.08
0.1
0.12
p
L
i
k
e
l
i
h
o
o
d
82
Choosing between nonnested models: Vuongs test
Goal: choose between two nonnested models
No model is favored, as in classical hypothesis testing
Models can be both wrong: choose the least misspecied
Assumption: observations are independent (conditionally on regressors)
True model: y
i
h(y
i
), density with parameter
Model 0: y
i
f(y
i
Model 1: y
i
g(y
i
KLIC
0
= E [(ln h(y
i
) ln f(y
i
))| h is true ] 0
KLIC
0
= distance between model h (true) and f in terms of log-likelihood
83
Vuongs statistic
Decision criteria: model 1 is better than model 0 if KLIC
1
< KLIC
0
KLIC
1
KLIC
0
= E [(ln f(y
i
) ln g(y
i
))| h is true ]
1
n
n
i=1
(ln f(y
i
) ln g(y
i
)) =
n
i=1
m
i
/n
Vuongs statistic:
V =
n
i=1
m
i
/n
_
n
i=1
(m
i
m)
2
/n
V
d
N(0, 1) when model 0 and 1 are equivalent
V
a.s.
+ when model 0, f(y
i
), is better
V
a.s.
when model 1, g(y
i
), is better
84
Vuongs test: application to linear models
Assume N(0,
2
)
Model 0: y
i
f(y
i
), with y
i
= x
i
+
0i
Model 1: y
i
g(y
i
), with y
i
= x
i
+
1i
f(y
i
) =
1
2
2
e
0.5(y
i
x
i
)
2
/
2
ln f(y
i
) =
1
2
ln(2
2
)
1
2
(y
i
x
i
)
2
/
2
=
1
2
[ln 2 + ln(
2
) +
2
0i
/
2
]
ln

f(y
i
) ln g(y
i
) =
_
1
2
[ln(e
0
e
0
/n) +
e
2
0i
e
0
e
0
/n
]
_
1
2
[ln(e
1
e
1
/n) +
e
2
1i
e
1
e
1
/n
]
_
85
Model selection criteria
Various criteria have been proposed
Adjusted R
2
:
R
2
= 1
e
e/(n K)
n
i=1
(y
i
y)
2
/(n 1)
Akaike Information Criteria:
ln AIC(K) = ln(e
e/n) +K2/n
Bayesian Information Criteria:
ln BIC(K) = ln(e
e/n) +K ln n/n
86
Chapter 8: Generalized regression model
Spherical error E[
|X] =
2
I is a restrictive assumption
Allow for heteroscedasticity,
2
i
=
2
j
, and autocorrelation,
ij
= 0, i, j:
E[
|X] =
2
= =
_
2
1

12

1n
12

2
2

2n
.
.
.
1n

2
n
_
_
Total number of parameters in = n + (n
2
n)/2 = n(n + 1)/2 n
E.g., n = 100 n(n + 1)/2 = 5,050 too many!
Need to impose structure on
87
Heteroscedasticity: asset returns and stochastic volatility
S&P 500 daily returns, 19992003, and asymmetric GARCH volatility
0 100 200 300 400 500 600 700 800 900 1000
8
6
4
2
0
2
4
6
R
e
t
u
r
n

%
0 100 200 300 400 500 600 700 800 900 1000
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
V
o
l
a
t
i
l
i
t
y
88
Least square estimator
When Var[|X] =
2
LS estimator, b = + (X
X)
1
X
, has still good properties:

unbiased, consistent, and asymptotically normal
E[b|X] =
Var[b|X] = (X
X)
1
X
Var[|X] X(X
X)
1
=

2
n
_
X
X
n
_
1
_
X
X
n
__
X
X
n
_
1
If plim(X
X/n) and plim(X
X/n) are positive denite, plimb =
n(b ) =
_
X
X
n
_
1
nX
/n
d
Q
1
N(0,
2
plim
X
X
n
)
89
Generalized least square estimator
Var[|X] =
2
, assume is known; decompose = CC
1
= C
1/2
1/2
C
= P
P, where = diag(
1
, . . . ,
n
), C
C = I
Transformed model : Py = PX +P Var[P|X] =
2
PP
=
2
I
= (X
PX)
1
X
Py
= (X
1
X)
1
X
1
y
= arg min
0
(y
0
X
)
1
(y X
0
)
Heteroscedasticity case: = diag(w
1
, . . . , w
n
)
= arg min
0
n
i=1
(y
i
x
0
)
2
/w
i
Recall: OLS case = I
90
GLS ecient estimator
In the classical model, y = X +, where Var[|X] =
2
I:
OLS is minimum variance, BLUE, estimator
In the transformed model, Py = PX +P, where Var[P|X] =
2
I:
GLS estimator = OLS in the transformed model
GLS estimator is ecient (not OLS)
91
Feasible generalized least square estimator (FGLS)
Var[|X] contains n(n + 1)/2 parameters: impossible to estimate all
Var[|X] =
2
parameterized with few unknown parameters
E.g. Time series:
ij
=
|ij|
, where || < 1
E.g. Heteroscedasticity:
ii
= exp(z
i
)
FGLS estimator relies on

= (
) = (X
1
X)
1
X
1
y
Key result: when n ,

(
) behaves like

()
using any consistent (not necessarily ecient) estimator of ()
92
Heteroscedasticity
Var[|X] =
2
=
2
diag(w
1
, . . . , w
n
)
Scaling: tr(
2
) =
n
i=1
2
i
=
2
n
i=1
w
i
=
2
n
2
=
n
i=1
2
i
/n
Interpretation: w
i
positive weight
When form of heteroscedasticity is
known: parameterize and estimate , then FGLS
unknown: OLS can still be applied, but Var[b|X]?
93
Estimating Var[b|X] under unknown heteroscedasticity
Whites heteroscedasticity consistent estimator:
Var[b|X] =

2
n
_
X
X
n
_
1
_
X
X
n
__
X
X
n
_
1
=
1
n
_
X
X
n
_
1
_
1
n
n
i=1
2
i
x
i
x
i
_
_
X
X
n
_
1
1
n
_
X
X
n
_
1
_
1
n
n
i=1
e
2
i
x
i
x
i
_
_
X
X
n
_
1
Proof sketch: As
2
i
x
i
x
i
= E[
2
i
x
i
x
i
|x
i
],
plim
1
n
n
i=1
2
i
x
i
x
i
= plim
1
n
n
i=1
2
i
x
i
x
i
= plim
1
n
n
i=1
e
2
i
x
i
x
i
Remark: equalities above are in plim, X
X/n never estimated

94
Test for heteroscedasticity: BreuschPagan test
Form of heteroscedasticity:
2
i
=
2
f(
0
+
z
i
)
Note: functional form f does not need to be specied
H
0
: = 0, i.e. homoscedasticity
Under H
0
, E[
2
i
/(
2
f(
0
)) 1] = 0 and does not depend on z
i
Regress g
i
:= (e
2
i
/(e
e/n) 1) on Z
i
:= [1 z
i
] (1 k), i = 1, . . . , n
calculate b = (Z
Z)
1
Z
g and g = Zb
Under H
0
, test statistic:
1
2
g
g =
1
2
g
Z(Z
Z)
1
Z
g
d
2
(k1)
95
Multiplicative heteroscedasticity: example
Goal: explain rms prot, y
i
, i = 1, . . . , n
Model: y
i
= x
i
+
i
, where
Var[
i
|X] =
2
exp(z
i
) (Harveys model)
Step 1: regress y
i
on x
i
using OLS and compute e
i
Step 2: regress log(e
2
i
) on [1 z
i
] using OLS to estimate
2
(biased) and
Step 3: regress y
i
on x
i
using FGLS with
ii
= exp(z
i
) to estimate
LS applied twice to model y
i
= x
i
+
i
: two-stage least squares
Remark: LS estimate of
2
biased (but not important for FGLS) because
E log
2
i
< log E
2
i
= log
2
i
= log
2
+z
E log
2
i
= c + log
2
i
, where c > 0
log e
2
i
= c + log
2
+z
i
+
i
, where
i
error term
96
Chapter 9: Panel data models
Time series: y
it
, t = 1, . . . , T
Cross sectional: y
it
, i = 1, . . . , n
Panel or longitudinal: y
it
, i = 1, . . . , n, t = 1, . . . , T, with n T
y
1t
y
2t
y
3t
y
nt
.
.
.
.
.
.
y
1T
y
2T
y
3T
y
nT
97
Why panel data model
Reach panel databases are available, e.g. labor market, industrial sectors
Certain phenomena can be studied only in panel data models
E.g. Analysis of production function:
technological change (over time) and
economies of scale (across rms of dierent sizes)
98
General framework for panel data model
Typically n T
y
it
= x
it
+z
i
+
it
= x
it
+c
i
+
it
x
it
: K 1, without constant term
z
i
: individual specic variables, observed or unobserved, with constant term
c
i
: individual eect, often unobserved and stochastic, e.g. health, ability
Goal: estimate partial eects = E[y
it
|x
it
]/x
it
and E[c
i
|x
i1
, x
i2
, . . .]
Note: if z
i
observed i linear model estimated by LS
99
Modeling frameworks
Panel data model: y
it
= x
it
+c
i
+
it
1. Pooled model: c
i
= constant term. Use OLS to estimate ,
2. Fixed eects: c
i
unobserved and correlated with x
it
: E[c
i
|X
i
] =
i
y
it
= x
it
+
i
+
it
+ (c
i
i
)
Regress y
it
on x
it
omits variables: LS biased, inconsistent estimate of
3. Random eects: c
i
unobserved and uncorrelated with x
it
: E[c
i
|X
i
] =
y
it
= x
it
+ +
it
+ (c
i
)
Regress y
it
on x
it
and constant: OLS consistent, inecient estimate of ,
100
Pooled model
Assumption: c
i
= constant term
y
it
= x
it
+c
i
+
it
= x
it
+ +
it
E[
it
|X
i
] = 0
Var[
it
|X
i
] =
2
Cov[
it
js
|X
i
, X
j
] = 0, if i = j or t = s
If assumptions of linear regression model are met: OLS unbiased and ecient
But this is hardly the case
101
LS estimation of pooled model
Pooled model: y
it
= x
it
+c
i
+
it
= x
it
+ +
it
If FE true model, Cov[c
i
, x
it
] = 0: LS is inconsistent (omitted variables)
If RE true model, Cov[c
i
, x
it
] = 0: LS consistent but inecient
In RE model:
y
it
= x
it
+c
i
+
it
= x
it
+E[c
i
|X
i
] + (c
i
E[c
i
|X
i
]) +
it
= x
it
+ +u
i
+
it
= x
it
+ +w
it
Autocorrelation (within group i): Cov[w
it
w
is
] =
2
u
= 0, t = s
102
Pooled regression with random eects
RE model: y
it
= x
it
+ +u
i
+
it
. Stack T
i
observations for individual i:
y
i
= [i
i
x
i
]
_

_
+ (
i
+ i
i
u
i
) = X
i
+w
i
Shocks, w
i
, are heteroscedastic (across individuals) and autocorrelated:
Var[w
i
] = Var[
i
+ i
i
u
i
] =
_
_
_
0
.
.
.
0
2
_
_
_
+
_
_
_
2
u

2
u
.
.
.
.
.
.
.
.
.
2
u

2
u
_
_
_
=
2
I
T
i
+
i
=
i
Recall: i = 1, . . . , n, and goal is to estimate
103
LS pooled regression with random eects
Stack all observations for all individuals, (T
1
+. . . +T
n
):
b = (X
X)
1
X
y = +
_
1
n
n
i=1
X
i
X
i
_
1
1
n
n
i=1
X
i
w
i
p
Asy.Var[b] =
1
n
plim
_
1
n
n
i=1
X
i
X
i
_
1
plim
_
1
n
n
i=1
X
i
w
i
w
i
X
i
_
plim
_
1
n
n
i=1
X
i
X
i
_
1
LS consistent; Asy.Var[b] called robust covariance matrix
If data are well behaved
plim
_
1
n
n
i=1
X
i
X
i
_
and plim
_
1
n
n
i=1
X
i
w
i
w
i
X
i
_
are positive denite
but second matrix needs to be estimated
104
Estimating center matrix in Asy.Var[b]
Use Whites approach (not Whites heterosc. estimator):
plim
_
1
n
n
i=1
X
i
w
i
w
i
X
i
_
= plim
_
1
n
n
i=1
X
i
X
i
_
= plim
_
1
n
n
i=1
X
i
w
i
w
i
X
i
_
= plim
_
_
1
n
n
i=1
_
_
T
i
t=1
x
it
w
it
_
_
_
_
T
i
t=1
x
it
w
it
_
_
_
= plim
_
_
1
n
n
i=1
_
_
T
i
t=1
w
2
it
x
it
x
it
_
_
_
_
Correlations across observation (not heterosc.) contribute most to Asy.Var[b]
105
Pooled regression: group means estimator
To estimate use n group means, e.g. for {y
it
, t = 1, . . . , T
i
}:
(1/T
i
)
T
i
t=1
y
it
= (1/T
i
)i
i
y
i
= y
i.
Averaging eliminates time series dimension of panel data ( cross section)
y
i
= X
i
+w
i
(1/T
i
)i
i
y
i
= (1/T
i
)i
i
X
i
+ (1/T
i
)i
i
w
i
y
i.
= x
i.
+w
i.
In Pooled model w
i.
=
i.
; in RE model w
i.
=
i.
+u
i
heteroscedastic
Sample data {(y
i.
, x
i.
), i = 1, . . . , n}
Estimation: LS for and Whites heterosc. estimator for Asy.Var[b]
106
Pooled regression: rst dierence estimator
General panel data model: y
i,t
= x
i,t
+c
i
+
i,t
, where
c
i
correlated (xed eects) or uncorrelated (random eects) with x
i,t
y
i,t
y
i,t1
= (x
i,t
x
i,t1
) +
i,t
i,t1
y
i,t
= (x
i,t
) +u
i,t
Advantage: rst dierence removes all individual specic heterogeneity c
i
Disadvantage: rst dierence removes all time-invariant variables too
u
i,t
: moving average (MA), covariance matrix tridiagonal, two-stage GLS
107
Fixed eects model
Assumption: unobservable individual eect, c
i
, correlated with x
it
y
it
= x
it
+c
i
+
it
= x
it
+E[c
i
|X
i
] + (c
i
E[c
i
|X
i
]) +
it
= x
it
+h(X
i
) +
i
+
it
= x
it
+
i
+
it
Further assumption: Var[c
i
|X
i
] = Var[
i
|X
i
] is constant
In general: Cov[
it
,
is
|X
i
] = E[(
i
+
it
)(
i
+
is
)|X
i
] = E[
2
i
|X
i
] = 0
Assumption: Var[
i
|X
i
] =
2
I
T
i
classical regression model
Parameters to estimate (K +n): [
1

K
]
and
i
, i = 1, . . . , n
108
Fixed eects model: drawback
Time invariant variables in x
it
are absorbed in
i
x
it
= [
1
x
it
2
x
i
] time variant and time invariant variables
y
it
= x
it
+
i
+
it
=
1
x
it
1
+
2
x
2
+
i
+
it
=
1
x
it
1
+
i
+
it
2
cannot be estimated (not identied)
109
Fixed eects model: Least Squares Dummy Variable
Recall i = (T 1) column of ones. Stack T observations for individual i:
y
i
= X
i
+ i
i
+
i
Stack all regression models for n individuals, LSDV model:
_
_
y
1
.
.
.
y
n
_
_
=
_
_
X
1
.
.
.
X
n
_
_
+
_
_
i 0 0
.
.
.
.
.
.
0 0 i
_
_
_
1
.
.
.
n
_
_
+
_
1
.
.
.
n
_
_
y = [X d
1
d
n
]
_

_
+
= X +D +
110
Fixed eects model: least squares estimation
Model for nT observations: y = X +D +, interest on
Partitioned regression, M
D
y on M
D
X, reduces size of computation
b = [X
M
D
X]
1
X
M
D
y
Asy.Var[b] = s
2
[X
M
D
X]
1
Individual eect,
i
, estimated using only T observations on individual i:
a
i
= y
i.
x
i.
b =
1
T
T
t=1
(
i
+x
it
+
it
) x
i.
b
a
i
i
=
1
T
T
t=1
it
+
1
T
T
t=1
x
it
( b) =
i.
+x
i.
( b)
Asy.Var[a
i
] =

2
T
+x
i.
Asy.Var[b] x
i.
0, when n
111
Testing dierences across groups
Null hypothesis H
0
:
1
= =
n
2
= 0
3
= 0
.
.
.
n1
n
= 0
_
1 1 0 0
0 1 1 0
.
.
.
.
.
.
0
0 1 1
_
_
_
2
.
.
.
n
_
_
= R = 0
that is J = n 1 restrictions on .
F-statistic: compare unrestricted R
2
vs. restricted R
2
F[n 1, nT K n] =
(R
2
LSDV
R
2
Pooled
)/(n 1)
(1 R
2
LSDV
)/(nT K n)
112
Random eects model
Assumption: unobservable individual eect, c
i
, uncorrelated with x
it
y
it
= x
it
+c
i
+
it
= x
it
+E[c
i
] + (c
i
E[c
i
]) +
it
= x
it
+ +u
i
+
it
= x
it
+ +
it
For T observations on individual i:
Var[
i
] = Var[
i
+ i
T
u
i
] =
_
_
_
0
.
.
.
0
2
_
_
_
+
_
_
_
2
u

2
u
.
.
.
.
.
.
.
.
.
2
u

2
u
_
_
_
=
2
I
T
+
2
u
i
T
i
T
=
113
Random eects model: Generalized least squares
Observations i j nT nT cov. matrix block diagonal, = I
n
Remark: does not depend on i

GLS:
= (X
1
X)
1
X
1
y =
_
n
i=1
X
i

1
X
i
_
1
_
n
i=1
X
i

1
y
i
_
and
2
u
in are usually unknown: estimate them and then FGLS
114
FGLS of random eects model: estimating
2
Taking deviations from group means remove heterogeneity u

i
y
it
= x
it
+ +u
i
+
it
y
i.
= x
i.
+ +u
i
+
i.
y
it
y
i.
= (x
it
x
i.
)
+
it
i.
= (x
it
x
i.
)
b +e
it
e
i.

2
n
i=1
T
t=1
(e
it
e
i.
)
2
nT n K
p
Degrees of freedom: nT observations n y

i.
means K slopes
Note e
i.
= 0; Note
2
= s
2
LSDV
as
e
it
= y
it
y
i.
(x
it
x
i.
)
b = y
it
x
it
b (y
i.
x
i.
b)
= y
it
x
it
b a
i
= residual in FE LSDV model
115
FGLS of random eects model: estimating
2
u
OLS consistent, unbiased, not ecient estimator of and in
y
it
= x
it
+ +u
i
+
it
= x
it
+ +
it
Hence
plims
2
Pooled
= plim
e
e
nT K 1
= Var[
it
] =
2
u
+
2
Consistent estimator of
2
u
:

2
u
= s
2
Pooled
s
2
LSDV
If negative, change degrees of freedom
116
Random Eects or Fixed Eects model?
FE: exible, Cov[c
i
, x
it
] = 0, but many parameters to estimate:
1
, . . . ,
n
RE: parsimonious but assumption Cov[c
i
, x
it
] = 0 might be violated
Hausmans specication test, H
0
: RE model
H
0
: Cov[c
i
, x
it
] = 0 OLS in LSDV and GLS in RE model both consistent,
but OLS inecient
H
1
: Cov[c
i
, x
it
] = 0 only OLS in LSDV consistent,
but GLS in RE model inconsistent
Under H
0
: OLS in LSDV model GLS in RE model
117
Hausmans specication test
b OLS in LSDV model;

GLS in RE model. Under H
0
: b

0
Var[b

] = Var[b] + Var[
] Cov[b,

] Cov[
, b]
Hausmans key result:
0 = Cov[ecient estimator, (ecient estimator inecient estimator)]
0 = Cov[
, (
b)] = Var[
] Cov[
, b]
This implies, under H
0
,
Var[b

] = Var[b] Var[
]
Wald criterion, based on K estimated slopes, excluding intercept:
W = [b

]
(Var[b] Var[
])
1
[b

]
2
(K)
118
Mundlaks approach
Fixed eects model: E[c
i
|X
i
] =
i
, one parameter for each individual i
Random eects model: E[c
i
|X
i
] = , one parameter for all individuals
Mundlaks approach: E[c
i
|X
i
] = x
i.
, parameters for all individuals
Model:
y
it
= x
it
+c
i
+
it
= x
it
+E[c
i
|X
i
] + (c
i
E[c
i
|X
i
]) +
it
= x
it
+x
i.
+u
i
+
it
Drawback: x
i.
can only include time varying variables
119
Dynamic panel data model
Model y
it
= x
it
+c
i
+
it
describes static relation
Dynamic model y
it
= y
i,t1
+x
it
+c
i
+
it
ts data much better
OLS and GLS inconsistent: c
i
correlated with y
i,t1
FE model, deviations from means, rst dierence: inconsistent estimates
Instrumental variable estimator: consistent estimates
Read about SUR and CAPM in Chapter 10
120
Chapter 12: Instrumental variables
Linear regression model: y = X +
b = + (X
X)
1
X
b unbiased when E[|X] = 0

b consistent when plimX
/n = 0
In many situations (e.g., dynamic panel models, measurement error on X),
X and are correlated OLS (and GLS) biased and inconsistent
Solution: Instrumental variables (IV), consistent estimates
121
Assumptions of the model
1. Linearity: E[y|X] linear in
2. Full rank: X is an n K matrix with rank K
3. Endogeneity of independent variables: E[
i
|x
i
] = 0
4. Homoscedasticity and nonautocorrelation of
i
5. Stochastic or nonstochastic X
6. Normal distribution: |X N(0,
2
I)
122
Instrumental variable: Denition
Instrumental variables Z = [z
1
z
L
] (n L), L K, have two properties:
1. Exogeneity: Z uncorrelated with
2. Relevance: Z correlated with X
Further assumptions of the model:
[x
i
, z
i
,
i
], i = 1, . . . , n, i.i.d.
E[
i
|z
i
] = 0
plimZ
Z/n = Q
zz
, nite, positive denite matrix
plimZ
/n = 0 (Exogeneity)
plimZ
X/n = Q
zx
, nite, L K matrix, rank K (Relevance)
123
Insight on IV estimation
When plimX
/n = 0
y = X +
X
y/n = X
X /n +X
/n
X
y/n X
X /n
(X
X)
1
X
y
When plimX
/n = 0, but plimZ
/n = 0 (and L = K)
Z
y/n = Z
X /n +Z
/n
Z
y/n Z
X /n
(Z
X)
1
Z
y
Remark: are = in plim
124
Instrumental variable estimator (L = K)
L instruments, observed variables, Z is n L matrix, when L = K
b
IV
= (Z
X)
1
Z
y
= + (Z
X)
1
Z
plimb
IV
= + (plimZ
X/n)
1
plimZ
/n
=
n(b
IV
) = (Z
X/n)
1
nZ
/n
d
Q
1
zx
N(0,
2
Q
zz
)
d
= N(0,
2
Q
1
zx
Q
zz
Q
1
xz
)
b
IV
a
N(,
2
Q
1
zx
Q
zz
Q
1
xz
/n)
Exogeneity consistency; Relevance low variance
125
Instrumental variable estimator (L > K)
When L > K, Z
X is L K, not invertible matrix

X correlated with inconsistency
Z uncorrelated with (Exogeneity)
Idea: project X on Z to get

X, than regress y on

X to estimate
X = Z slope of X on Z = Z(Z
Z)
1
Z
X
Regressing y on

X:
b
IV
= [

X

X]
1

X
y
= [X
Z(Z
Z)
1
Z
Z(Z
Z)
1
Z
X]
1
X
Z(Z
Z)
1
Z
y
= [X
Z(Z
Z)
1
Z
X]
1
X
Z(Z
Z)
1
Z
y
Two-stage least squares (2SLS) estimator (only logically)
126
Which instruments?
Instrumental variables are generally dicult to nd
Z can include variables in X uncorrelated with
In time series settings, lagged values of x and y are typical instruments
Relevance high correlation between X and Z (otherwise Q
1
xz
large)
But then Z might be correlated with (as is correlated with X)
127
Example: Dynamic panel data model
Model: y
it
= y
i,t1
+x
it
+c
i
+
it
c
i
correlated or uncorrelated with x
it
c
i
certainly correlated with y
i,t1
LS inconsistent
Taking rst dierence, y
it
= y
it
y
i,t1
y
it
= y
i,t1
+ x
it
+
it
Cov[y
i,t1
,
it
] = 0 LS still inconsistent
To estimate and , valid instruments, e.g., y
i,t2
and y
i,t2
128
Measurement error
Measurement errors are very common in practice
E.g., variables of interest are not available but only approximated by others
E.g., GDP, consumption, capital, . . . , cannot be measured exactly
129
Regression model with measurement error
True, latent (unobserved), univariate model
y
i
= x
i
+
i
Observed data: y
i
= y
i
+v
i
and x
i
= x
i
+u
i
where v
i
(0,
2
v
), v
i
y
i
, x
i
, u
i
and u
i
(0,
2
u
), u
i
y
i
, x
i
, v
i
Working model, derived from true model:
y
i
v
i
= (x
i
u
i
) +
i
y
i
= x
i
+ (u
i
+
i
+v
i
)
Measurement error on y
i
, i.e. v
i
, absorbed in the error term
Measurement error on x
i
, i.e. u
i
, makes LS inconsistent
130
LS estimation with measurement error
Set v
i
= 0 for simplicity. Working model:
y
i
= x
i
+ (u
i
+
i
) = x
i
+w
i
LS estimation of inconsistent because
Cov[x
i
, w
i
] = Cov[x
i
+u
i
, u
i
+
i
] =
2
u
= 0
b =
_
n
i=1
x
2
i
/n
_
1
n
i=1
x
i
y
i
/n
plimb =
_
plim
n
i=1
(x
i
+u
i
)
2
/n
_
1
plim
n
i=1
(x
i
+u
i
)(x
i
+
i
)/n
=
_
Q
+
2
u
_
1
Q
= /(1 +
2
u
/Q
)
0 when
2
u
131
IV estimation with measurement error
Instrument z
i
has the two properties:
1. Exogeneity: Cov[z
i
, u
i
] = 0
2. Relevance: Cov[z
i
, x
i
] = Q
zx
= 0
Recall true model y
i
= x
i
+
i
, observed regressor x
i
= x
i
+u
i
b
IV
=
_
n
i=1
x
i
z
i
/n
_
1
n
i=1
z
i
y
i
/n
plimb
IV
=
_
plim
n
i=1
(x
i
+u
i
)z
i
/n
_
1
plim
n
i=1
z
i
(x
i
+
i
)/n
= (Q
zx
)
1
Q
zx
=
132
IV estimation of generalized regression model
In generalized regression model E[
|X] =
2
b
IV
= [X
Z(Z
Z)
1
Z
X]
1
X
Z(Z
Z)
1
Z
y
= + [X
Z(Z
Z)
1
Z
X]
1
X
Z(Z
Z)
1
Z
plimb
IV
= +Q
xx.z
plimZ
/n =
n(b
IV
)
d
Q
xx.z
N(0,
2
plim(Z
Z/n))
d
= N(0,
2
Q
xx.z
plim(Z
Z/n) Q
xx.z
)
b
IV
a
N(,
2
Q
xx.z
plim(Z
Z/n) Q
xx.z
/n)
Same derivation as when E[
|X] =
2
I
133
Chapter 15: Generalized method of moments (GMM)
General framework for estimation and hypothesis testing
LS, NLS, GLS, IV, etc. special cases of GMM
GMM relies on weak assumptions about rst moments
(existence and convergence of rst moments)
Strength (and limitation) of GMM:
No assumptions about distribution Robust to misspecication of DGP
Widely used in Econometrics, Finance, . . .
134
Logic behind method of moments
Sample moments
p
Population moments = function(parameters)
E.g., random sample {y
i
, i = 1, . . . , n}, with E[y
i
] = and Var[y
i
] =
2
1
n
n
i=1
y
i
p
E[y
i
] =
1
n
n
i=1
y
2
i
p
E[y
2
i
] =
2
+
2
Assumptions of Law of Large Numbers need to hold
135
Orthogonality conditions: Example
Parameters are implicitly dened by two orthogonality conditions:
E[y
i
] = 0
E[y
2
i

2
2
] = 0
To estimate and
2
, replace E[] by empirical distribution
and solve two moment equations:
1
n
n
i=1
(y
i
) = 0
1
n
n
i=1
(y
2
i

2

2
) = 0
Moment estimators: =
n
i=1
y
i
/n and
2
=
n
i=1
(y
i
)
2
/n
136
Example: Gamma distribution
Gamma distribution used to model positive r.v. y
i
, e.g. waiting time
f(y) =

p
[p]
e
y
y
p1
, y 0, p > 0, > 0
(Some) orthogonality conditions:
E
_
_
y
i
p/
y
2
i
p(p + 1)/
2
ln y
i
d ln [p]/dp + ln
1/y
i
/(p 1)
_
_
= 0
Orthogonality conditions are (general) nonlinear functions of sample data
More orthogonality conditions (four) than parameters (two)
Any two orthogonality conditions give ( p,

): need to reconcile all of them
137
Orthogonality conditions
K parameters to estimate, = (
1
, . . . ,
K
)
L moment conditions (L K):

E
_
_
m
i1
(y
i
, x
i
, )
.
.
.
m
il
(y
i
, x
i
, )
.
.
.
m
iL
(y
i
, x
i
, )
_
_
= E[m
i
(y
i
, x
i
, )] = 0
implicitly dened by equation above
estimated via empirical counter part of E[]
138
Exactly identied case
When L = K, i.e. # moment conditions = # parameters,
sample moment equations have a unique solution and are all exactly satised
E.g., previous method of moments estimator of and
2
E.g., LS estimator: E[m
i
(y
i
, x
i
, )] = E[x
i
(y
i
x
i
)] = 0
Solving sample moment equations (or normal equations)
1
n
n
i=1
[x
i
(y
i
x
i

)] = 0
1
n
n
i=1
x
i
y
i
1
n
n
i=1
x
i
x
i

= 0
=
_
n
i=1
x
i
x
i
_
1
_
n
i=1
x
i
y
i
_
139
Overidentied case
When L > K, i.e. # moment conditions > # parameters,
system of L equations in K unknown parameters
1
n
n
i=1
m
il
(y
i
, x
i
, ) = 0, l = 1, . . . , L
has no solution (equations functionally independent) in nite samples
although
plim
1
n
n
i=1
m
il
(y
i
, x
i
, ) = E[m
il
(y
i
, x
i
, )] = 0, l = 1, . . . , L
E.g., previous estimation of parameters of Gamma distribution
E.g., IV estimation when # instruments L > # parameters K
140
Criterion function
When L > K, to reconcile dierent estimates, minimize criterion function
q = m()
W
n
m()
where m() =
n
i=1
m
i
(y
i
, x
i
, )/n, L 1 moment conditions
W
n
: positive denite, weighting matrix, with plimW
n
= W
When W
n
= I
q = m()
m() =
L
l=1
m
l
()
2
where m
l
() =
n
i=1
m
il
(y
i
, x
i
, )/n, l = 1, . . . , L
When W
n
inversely proportional to variance of m() Eciency gains
same logic that makes GLS more ecient than OLS
141
Optimal weighting matrix
L orthogonality conditions, possibly correlated
optimal weighting matrix:
W = {Asy.Var[
nm()]}
1
=
1
Recall, Var[m()] = Var[
n
i=1
m
i
(y
i
, x
i
, )/n] O(1/n)
Ecient GMM estimator based on
1
When L > K, W = I (or any W =
1
) produces inecient estimates of
When L = K = moment equations satised exactly, i.e. m(
) = 0,
= q = 0 and W irrelevant
142
Assumptions of GMM estimation
0
true parameter vector, K 1
L population orthogonality conditions: E[m
i
(
0
)] = 0, L K
L sample moments: m
n
(
0
) =
n
i=1
m
i
(
0
)/n
E.g., IV estimation: m
n
(
0
) =
n
i=1
z
i
(y
i
x
0
)/n,
L K instruments, one orthogonality condition for each instrument
Assumption 1: Convergence of empirical moments
Data generating process satises assumptions of Law of Large Numbers
m
n
(
0
) =
1
n
n
i=1
m
i
(
0
)
p
E[m
i
(
0
)] = 0
143
Empirical moment equations continuous and continuously dierentiable
= L K matrix of partial derivatives
G
n
(
0
) =
m
n
(
0
)
0
=
1
n
n
i=1
m
i
(
0
)
0
p
G(
0
)
Law of Large Numbers apply to moments and derivatives of moments
Assumption 2: Identication
For any n > K, if
1
=
2
, then m
n
(
1
) = m
n
(
2
)
plimq
n
() = plim(m
n
()
W
n
m
n
()) has unique minimum (= zero) at
0
Identication L K and rank(G
n
(
0
)) = K
144
Assumptions 1 and 2 = can be estimated
Assumption 3: Asymptotic distribution of empirical moments
Empirical moments obey a Central Limit Theorem
nm
n
(
0
)
d
N(0, )
145
Asymptotic properties of GMM
Under previous assumptions
GMM
p
GMM
a
N(
0
, [G(
0
)
1
G(
0
)]
1
/n)
146
Consistency of GMM estimator
Recall criterion function q
n
() = m
n
()
W
n
m
n
()
Assumption 1 and continuity of moments = q
n
()
p
q
0
()
W
n
positive denite, for any nite n
0 q
n
(
GMM
) q
n
(
0
)
When n , q
n
(
0
)
p
0 = q
n
(
GMM
)
p
0
W positive denite and identication assumption =

GMM
p
0
147
Asymptotic normality of GMM estimator
First order condition for the GMM estimator:
q
n
(
GMM
)
GMM
= 2 G
n
(
GMM
)
W
n
m
n
(
GMM
) = 0
Assumption: moment equations continuous and continuously dierentiable
Mean Value Theorem and Taylor expansion at
0
of moment equations:
m
n
(
GMM
) = m
n
(
0
) +G
n
()(
GMM
0
)
where
0
< <

GMM
componentwise. Fist order condition becomes:
2 G
n
(
GMM
)
W
n
[m
n
(
0
) +G
n
()(
GMM
0
)] = 0
Solve for (
GMM
0
) and
n give:
148
Asymptotic normality of GMM estimator
n(
GMM
0
) = [G
n
(
GMM
)
W
n
G
n
()]
1
G
n
(
GMM
)
W
n
nm
n
(
0
)
When n
GMM
p
0
and
p
0
as
0
< <

GMM
componentwise
G
n
(
GMM
)
p
G(
0
) and G
n
()
p
G(
0
)
W
n
p
W by construction of weighting matrix

nm
n
(
0
)
d
N(0, ) by Assumption 3
n(
GMM
0
)
d
[G(
0
)
W G(
0
)]
1
G(
0
)
W N(0, )
d
= N[0, {[G(
0
)
W G(
0
)]
1
G(
0
)
W} {. . . . . .}
]
d
= N[0, [G(
0
)
1
G(
0
)]
1
] using W =
1
GMM
a
N(
0
, [G(
0
)
1
G(
0
)]
1
/n)
149
Weighting matrix
Any W positive denite matrix produces consistent GMM estimates
W determines eciency of GMM estimator:
Optimal W = {Asy.Var[
nm()]}
1
depends on unknown
Feasible two-step procedure:
Step 1. Use W = I to obtain a consistent estimator,

(1)
, then estimate
=
1
n
n
i=1
m
i
(y
i
, x
i
,

(1)
) m
i
(y
i
, x
i
,

(1)
)
(when {m
i
(y
i
, x
i
,
0
), i = 1, . . . , n} uncorrelated sequence)
Step 2. Use W =

1
to compute GMM estimator
150
Testing hypothesis in GMM framework
Two sets of tests:
1. Testing restrictions induced by moment equations
2. GMM counterparts to Wald, LM, and LR tests
151
Specication test
In exactly identied case, L moment equations = K parameters:
exists such that m(
) = 0
In overidentied case, L moment equations > K parameters:
L K moment equations imply moment restrictions on

Intuition: K moment equations set to zero to compute the K parameters

L K free moment equations
Test of overidentifying restrictions, using W = {Asy.Var[
nm(
)]}
1
:
J-stat = nq =
nm(
nm(
)
d
2
(LK)
Note: no parametric restrictions on in the specication test
152
Testing parametric restrictions
To test J (linear or nonlinear) parametric restrictions on
Given L moment equations, now only K J free parameters
nq
R
=
nm(
R
)
nm(
R
)
d
2
(L(KJ))
nq
R
nq
d

2
(J)
as for degrees of freedom (L (K J)) (L K) = J
Note: same optimal weighting matrix W in q
R
and q = q
R
q
153
Application of GMM: Asset pricing model estimation
Asset pricing model:
E[r
e
j,t
] =
AER
AER,j
+
HML
HML,j
+
CLS

CLS,j
=
Stochastic Discount Factor (SDF) representation, demeaned factors

f
t
:
m
t
= 1 b
AER

AER
t
b
HML

HML
t
b
CLS

CLS
t
= 1 b

f
t
Euler pricing equation:
E[m
t
r
e
j,t
] = 0
N moment conditions {j = 1, . . . , N}
Market price of risks, , and SDF loadings, b:
= E[

f
t

f
t
]b
154
GMM estimation results
Model (1) (2) (3)
AER
11.43 4.34
(7.26) (11.36)
HML
18.70 7.48
(13.08) (35.93)
CLS
13.16 27.13
(3.35) (17.77)
b
AER
0.13 0.03
(0.05) (0.06)
b
HML
0.05 0.16
(0.04) (0.07)
b
CLS
0.26 0.75
(0.07) (0.24)
J-stat 0.0467 0.0444 0.0349
p-value 6.03% 7.71% 14.14%
Table 1: Parameter estimates (NeweyWest standard errors).
155
Chapter 16: Maximum Likelihood Estimation
Maximum Likelihood Estimation (MLE): very important inference method
Maximum likelihood principle:
Given sample data generated from parametric model,
nd parameters that maximize probability of observing that sample
Basic, strong assumption:
DGP has parametric, known (up to ) distribution
Fundamental result:
MLE makes best use of this information
156
Likelihood function
Likelihood function = probability of observing that sample
Formally, joint density of n i.i.d. observations, y
1
, . . . , y
n
f(y
1
, . . . , y
n
; ) =
n
i=1
f(y
i
; ) = L(; y)
L(; y) is the likelihood function, with unknown
Log-likelihood is usually easier to deal with
ln L(; y) =
n
i=1
ln f(y
i
; )
157
Identication
Identication means parameters are estimable. It depends on the model
Check identication before estimating or testing the model
Denition: is identied (or estimable) if
L(; y) = L(
; y)
= and some data y

E.g. Linear regression model not identied when rank [x
1
, . . . , x
K
] < K
E.g. Threshold model for y
i
> 0 or y
i
0
Pr(y
i
> 0) = Pr(
1
+
2
x
i
+
i
> 0) = Pr(
i
/ > (
1
+
2
x
i
)/)
not identied, ,
1
,
2
not estimable (normalization required, e.g. = 1)
158
Maximum likelihood estimator
Maximum likelihood estimator,

, solves
= arg max
L(; y)
= arg max
ln L(; y)
or equivalently the likelihood equation
ln L(; y)
= 0
159
Maximum likelihood estimator: Example
i.i.d. normal random variables, y
i
N(,
2
), i = 1, . . . , n
ln L(,
2
; y) =
n
2
ln(2)
n
2
ln
2
1
2
n
i=1
(y
i
)
2
/
2
ln L/ =
n
i=1
(y
i
)/
2
= 0
ln L/
2
=
n
2
2
+
1
2
4
n
i=1
(y
i
)
2
= 0
Solve likelihood equations:

ML
=
1
n
n
i=1
y
i

2
ML
=
1
n
n
i=1
(y
i

ML
)
2
160
Asymptotic eciency
An estimator is asymptotically ecient if it is
consistent,
asymptotically normally distributed (CAN), and has
asy. covariance matrix not larger than that of any other CAN estimator
Under some regularity conditions, MLE is asymptotically ecient
Finite sample properties usually not optimal
E.g.,
2
ML
=
n
i=1
(y
i
y)
2
/n biased (no correction for degrees of freedom)
161
Properties of MLE
Under regularity conditions, MLE

has the following properties:
M1 Consistency: plim

=
0
M2 Asymptotic normality:

a
N(
0
, {E
0
[
2
ln L/
0
0
]}
1
)
M3 Asymptotic eciency:

reaches CramerRao lower bound in M2
M4 Invariance: MLE of
0
= c(
0
) is
0
= c(
) if c C
1
162
Regularity conditions on f(y
i
; )
R1 First three derivatives of ln f(y
i
; ) w.r.t. are continuous and nite
R2 Conditions for E[ ln f(y
i
; )/] < , E[
2
ln f(y
i
; )/
] < hold
R3 |
3
ln f(y
i
; )/
j

k
l
| < h, where E[h] < ,
Denition: Regular densities satisfy R1R3
Goals: use Taylor approximation; interchange dierentiation and expectation
Notation: gradient g
i
= ln f(y
i
; )/, Hessian H
i
=
2
ln f(y
i
; )/
163
Properties of regular densities
Moments of derivatives of log-likelihood:
D1 ln f(y
i
; ), g
i
, H
i
, i = 1, . . . , n are random samples
D2 E
0
[g
i
(
0
)] = 0
D3 Var
0
[g
i
(
0
)] = E
0
[H
i
(
0
)]
D1 implied by assumption: {y
i
, i = 1, . . . , n} is random sample
To prove D2: by denition 1 =
_
f(y
i
;
0
) dy
i
1
0
=

0
_
f(y
i
;
0
) dy
i
0 =
_
f(y
i
;
0
)
0
dy
i
=
_
ln f(y
i
;
0
)
0
f(y
i
;
0
) dy
i
= E
0
[g
i
(
0
)]
164
Information matrix equality
To prove D3: dierentiate previous integral once more w.r.t.
0
0
0
=

0
_
ln f(y
i
;
0
)
0
f(y
i
;
0
) dy
i
0 =
_
_
2
ln f(y
i
;
0
)
0
f(y
i
;
0
) +
ln f(y
i
;
0
)
0
f(y
i
;
0
)
0
_
dy
i
=
_
_
2
ln f(y
i
;
0
)
0
f(y
i
;
0
) +
ln f(y
i
;
0
)
0
ln f(y
i
;
0
)
0
f(y
i
;
0
)
_
dy
i
= E
0
[H
i
(
0
)] + Var
0
[g
i
(
0
)] =D3
D1 (random sample) Var
0
[
n
i=1
g
i
(
0
)] =
n
i=1
Var
0
[g
i
(
0
)]
Var
0
[
n
i=1
g
i
(
0
)] =: Var
0
_
ln L(
0
; y)
0
_
= E
0
_
2
ln L(
0
; y)
0
_
. .
Information matrix equality
:= E
0
[
n
i=1
H
i
(
0
)]
165
Likelihood equation
Score vector at :
g =
ln L(; y)
=
n
i=1
ln f(y
i
; )
=
n
i=1
g
i
D1 (random sample) and D2 (E
0
[g
i
(
0
)] = 0) Likelihood equation at
0
:
E
0
_
ln L(
0
; y)
0
_
= 0
166
Consistency of MLE
In any nite sample, ln L(
) ln L(
0
) (and in general =

, not only
0
)
From Jensens inequality, if
0
=

(and in general =
0
, not only

)
E
0
_
ln
L(
)
L(
0
)
_
< ln E
0
_
L(
)
L(
0
)
_
= ln
_
L(
)
L(
0
)
L(
0
) dy = ln 1 = 0
E
0
[lnL(
)/n] < E
0
[ln L(
0
)/n] ()
Under previous assumptions, using inequality in the very rst row:
plim lnL(
)/n plim lnL(

0
)/n
E
0
[lnL(
)/n] E
0
[ln L(
0
)/n]
and combining with (): E
0
[ln L(
0
)/n] > E
0
[lnL(
)/n] E
0
[ln L(
0
)/n]
plim lnL(
)/n = E
0
[ln L(
0
)/n] and plim

=
0
167
Asymptotic normality of MLE
MLE solves sample likelihood equation: g(
) =
n
i=1
g
i
(
) = 0
First order Taylor expansion: g(
) = g(
0
) +H()(

0
) = 0
As = w
0
+ (1 w)

, 0 < w < 1, plim

=
0
plim =
0
Hessian is continuous in . Rearranging, scaling by
n, taking limit n
n(

0
) = {H()}
1
ng(
0
) =
_
1
n
n
i=1
H
i
()
_
1
n
1
n
n
i=1
g
i
(
0
)
d
_
E
0
_
1
n
n
i=1
H
i
(
0
)
__
1
N
_
0, E
0
_
1
n
n
i=1
H
i
(
0
)
__
d
= N
_
_
0,
_
E
0
_
1
n
n
i=1
H
i
(
0
)
__
1
_
_
a
N
_
0
, {E
0
[H(
0
)/n]}
1
/n
_
= N
_
0
, {I(
0
)}
1
_
168
Asymptotic eciency
CramerRao lower bound:
Assume that f(y
i
;
0
) satises regularity conditions R1R3,
the asymptotic variance of a consistent and asy. normally distributed
estimator of
0
is at least as large as
{I(
0
)}
1
=
_
E
0
_
2
ln L(
0
)
0
__
1
Asymptotic variance of MLE reaches the CramerRao lower bound
169
Invariance
MLE of
0
= c(
0
) is
0
= c(
) if c C
1
MLE invariant to one-to-one transformation
Useful application: ln L(
0
) can be complicated function of
0
re-parameterize the model to simplify calculations using ln L(
0
)
E.g. Normal log-likelihood, precision parameter
2
= 1/
2
ln L(,
2
; y) =
n
2
ln(2) +
n
2
ln
2
2
2
n
i=1
(y
i
)
2
ln L
2
=
n
2
1
1
2
n
i=1
(y
i
)
2
= 0

2
ML
=
n
n
i=1
(y
i

ML
)
2
=
1

2
ML
170
Estimating asymptotic covariance matrix of MLE
Asy.Var[
] depends on
0
. Three estimators, asymptotically equivalent:
1. Calculate E
0
[H(
0
)] (very dicult) and evaluate it at

to estimate
{I(
0
)}
1
=
_
E
0
_
2
ln L(
0
)
0
__
1
2. Calculate H(
0
) (still quite dicult) and evaluate it at

to get
{
I(
)}
1
=
_
2
ln L(
_
1
=
_
i=1
H
i
(
)
_
1
3. BHHH or OPG estimator (very easy): use D3 E
0
[H
i
(
0
)] = Var
0
[g
i
(
0
)]
{
I(
)}
1
=
_
i=1
H
i
(
)
_
1
=
_
n
i=1
g
i
(
) g
i
(
_
1
171
Conditional likelihood
Econometric models involve exogenous variables x
i
{y
i
} not i.i.d.
E.g. Model: y
i
= x
i
+
i
, x
i
can be stochastic, correlated across is, etc.
Usually f(y; ) not interesting, data generated by f(y, x) not known
Way out: DGP of x
i
exogenous and well-behaved (LLN applies),
x
i
f(x
i
, ), and no common elements, no restrictions between and
f(y
i
, x
i
; , ) = f(y
i
|x
i
; ) f(x
i
; )
ln L(, ; y, x) =
n
i=1
ln f(y
i
|x
i
; ) +
n
i=1
ln f(x
i
; )
ML
= arg max
i=1
ln f(y
i
|x
i
; )
172
Maximizing log-likelihood
Log-likelihoods are typically highly nonlinear functions of parameters
E.g., GARCH in mean model for asset return, y
t
= p
t
/p
t1
1 = E
t1
[y
t
]+
t
with E
t1
[y
t
] =
0
+
1
2
t
and Var
t1
[y
t
] =
2
t
=
0
+
1
2
t1
+
2
2
t1
ln L = 0.5
T
t=1
[ln(2) + ln
2
t
+ (y
t
2
t
)
2
/
2
t
]
Maximizing log-likelihood is a numerical problem, various methods:
Brute force (but using good routines, e.g. FMINSEARCH in Matlab)
Newtons method:
(i+1)
=
(i)
H
1
(i)
g
(i)
, use actual Hessian
Score method:
(i+1)
=
(i)
H
1
g
(i)
, use expected Hessian
173
Hypothesis testing
Test of hypothesis H
0
: c() = 0
Three tests, asymptotically equivalent (not in nite sample)
Likelihood ratio: If c() = 0, then ln L
U
ln L
R
0
Both unrestricted (ML) and restricted estimators are required
Wald test: If c() = 0, then c(
ML
) 0
Only unrestricted (ML) estimator is required
Lagrange multiplier test: If c() = 0, then ln L/
R
0
Only restricted estimator is required
174
Likelihood ratio test
L
U
= L(
U
), where

U
is MLE, unrestricted
L
R
= L(
R
), where

R
is restricted estimator
Likelihood ratio: L
R
/L
U
0 L
R
/L
U
1
Limiting distribution of likelihood ratio: 2 (ln L
U
ln L
R
)
2
df
with df = # of restrictions
Remarks:
LR test cannot be used to test two restricted models,

U
must be MLE
Likelihood function L must be the same in L
U
and L
R
175
Wald test
Wald test based on full rank quadratic forms
Recall: If x N(, ), quadratic form (x )
1
(x )
2
(J)
If E[x] = , (x)
1
(x) noncentral
2
(J)
(>
2
(J)
on average)
If H
0
: c() = q is true, c(
ML
) q 0 (not = 0 for sampling variability)
If H
0
: c() = q is false, c(
ML
) q 0 or 0
Wald test statistic:
W = [c(
ML
) q]
{Asy.Var[c(
ML
) q]}
1
[c(
ML
) q]
2
df
Drawbacks: no H
1
limited power; not invariant to restriction formulation
176
Lagrange multiplier test
Lagrange multiplier (or score) test based on restricted model
Restrictions H
0
: c() = q, Lagrangean: ln L() + (c() q)
First order conditions for restricted , i.e.

R
:
ln L()
+
c()
= 0
If restrictions not binding = 0 (rst term MLE) and can be tested.
Simpler, equivalent approach: at restricted maximum
ln L(
R
)
R
+
c(
R
)
= 0 =
c(
R
)
=
ln L(
R
)
R
= g
R
Under H
0
: = 0, g
R
=
n
i=1
g
i
(
R
) = 0
Recall, Var
0
[
n
i=1
g
i
(
0
)] = E
0
[
2
ln L/
0
0
] = I(
0
)
177
Lagrange multiplier test statistic
As in Wald test, LM statistic is a full rank quadratic form:
LM =
_
ln L(
R
)
R
_
_
I(
R
)
_
1
_
ln L(
R
)
R
_

2
df
Alternative calculation of LM test: dene G
R
= [g
1
(
R
), . . . , g
n
(
R
)] (K n),
regress a column of 1s, i, on G
R
slope b
i
= (G
R
G
R
)
1
G
R
i
uncentered R
2
i
=
i
i
i
=
(b
i
G
R
) (G
R
b
i
)
i
i
=
(i
G
R
(G
R
G
R
)
1
G
R
) (G
R
(G
R
G
R
)
1
G
R
i)
n
=
i
G
R
{G
R
G
R
}
1
G
R
i
n
=
LM
n
178
Application of MLE: Linear regression model
Model: y
i
= x
i
+
i
, and y
i
|x
i
N(x
i
,
2
)
Log-likelihood based on n conditionally independent observations:
ln L =
n
2
ln(2)
n
2
ln
2
1
2
n
i=1
(y
i
x
i
)
2
2
=
n
2
ln(2)
n
2
ln
2
(y X)
(y X)
2
2
Likelihood equations:
ln L
=
X
(y X)
2
= 0
ln L
2
=
n
2
2
+
(y X)
(y X)
2
4
= 0
179
MLE of linear regression model
Solving likelihood equations:
ML
= (X
X)
1
X
y and
2
ML
=
e
e
n
ML
= b = OLS has all desirable asymptotic properties of MLE

2
ML
= s
2
= e
e/(n K) =
2
ML
biased in nite samples, but
E[
2
ML
] = E
_
(n K)
n
s
2
_
=
(n K)
n

2
2
, n
CramerRao lower bound for

ML
= (
ML

2
ML
) can be computed explicitly:
{I()}
1
=
_
E
_
2
ln L()

__
1
=
_

2
(X
X)
1
0
0
2
4
/n
_
180
MLE and Wald test
Testing J (possibly nonlinear) restrictions, H
0
: c() = 0 vs. H
1
: c() = 0
Idea: check whether unrestricted estimator (i.e. MLE) satises restrictions
Under H
0
, Wald statistic
W = c(b)
_
c(b)
b
_

2
(X
X)
1
c(b)
b
_
1
c(b)
d
2
(J)
where
2
(X
X)
1
= Asy.Var[b], using Delta method:
c(b) c() +
c()
(b )
Asy.Var[c(b)] =
c()
Asy.Var[b]
c()
and plimb = , plimc(b) = c(), plimc(b)/b
= plimc()/
181
MLE and Likelihood ratio test
0
: c() = 0 vs. H
1
: c() = 0
Idea: check whether unrestricted L signicantly larger than restricted L
Likelihood ratio test: b unrestricted, b
restricted slopes
FOC of
2
implies: Est.[
2
] = (y X
(y X
)/n, with

= b or b
LR = 2[lnL ln L
]
=
_
nln
2
(y Xb)
(y Xb)

2
_
_
nln
2
(y Xb
(y Xb
)

2
_
= nln
2
nln
2
d
2
(J)
plugging
2
in ln L, and
2
in ln L
, (i.e. concentrating log-likelihood)

second terms in square brackets both equal n and cancel out
182
MLE and Lagrange multiplier test
0
: c() = 0 vs. H
1
: c() = 0
Idea: gradient of ln L at restricted maximum, g
R
, should be close to zero
From Lagrangean: g
R
() = ln L()/ = c()
/
Under H
0
: = 0, E
0
[g
R
()] = E
0
[X
/
2
] = 0 X
0
Lagrange multiplier: apply Wald-type test to restricted gradient of ln L
LM = e
X(Est.Var[X
])
1
X
= e
X(
2
X)
1
X
=
e
X(X
X)
1
X
/n
= nR
2
2
(J)
R
2
in the regression of restricted residuals e
= (y Xb
) on X
Intuition: if restrictions not binding, b
= b, e
X, LM = 0
183
Pseudo Maximum Likelihood estimation
ML requires complete specication of f(y
i
|x
i
; )
What if the density is misspecied?
Under certain conditions, the estimator retain some good properties
even if the wrong likelihood is maximized
E.g., In the model y
i
= x
i
+
i
, OLS is MLE when
i
N(0,
2
)
but under certain conditions LS is still consistent, even when
i
N(0,
2
)
When
i
N(0,
2
), OLS is maximizing the wrong likelihood
Key point: OLS solves normal equations E[x
i
(y
i
x
i
)] = 0
These equations might hold even when
i
N(0,
2
)
184
Pseudo Maximum Likelihood estimator
ML
= arg max
n
i=1
ln f(y
i
|x
i
; ), where f(y
i
|x
i
; ) true p.d.f.
PML
= arg max
n
i=1
ln h(y
i
|x
i
; ), where h(y
i
|x
i
; ) {Exponential family}
Key point: h(y
i
|x
i
; ) = f(y
i
|x
i
; ), possibly
If h(y
i
|x
i
; ) = f(y
i
|x
i
; ), then

PML
=

ML
E.g., Estimate when f(y
i
|x
i
; ) = N(x
i
,
2
i
) and h(y
i
|x
i
; ) = N(x
i
,
2
)
PML estimator solves rst order conditions:
1
n
n
i=1
ln h(y
i
|x
i
;

PML
)
PML
= 0
185
Asymptotic distribution of PML estimator
Usual method: rst order Taylor expansion of FOC, mean value theorem,
rearrange to have (
PML
0
) in LHS, scale by

n, take limit n :
n(
PML
0
) =
_
1
n
n
i=1
2
ln h(y
i
|x
i
; )

_
1
n
1
n
n
i=1
ln h(y
i
|x
i
;
0
)
0
d
_
H(
0
)
_
1
N (0, )
d
= N
_
0, H(
0
)
1
H(
0
)
1
_
PML
a
N
_
0
, H(
0
)
1
H(
0
)
1
/n
_
If h(y
i
|x
i
;
0
) is true p.d.f., then information matrix equality holds,
= H(
0
), and

PML
=

ML
ML
a
N
_
0
, H(
0
)
1
/n
_
186
Estimator of Asy.Var[
PML
]
Sandwich (or robust) estimator of
Asy.Var[
PML
] = H(
0
)
1
H(
0
)
1
/n
based on
Empirical counterpart (no expectation) of the Hessian H(
0
):
Est.[H(
0
)] =
1
n
n
i=1
2
ln h(y
i
|x
i
;

PML
)
PML
PML
Sample variance of gradients:
Est.[] =
1
n
n
i=1
_
ln h(y
i
|x
i
;

PML
)
PML
__
ln h(y
i
|x
i
;

PML
)
PML
_
187
Remarks on PML estimation
In general, maximizing wrong likelihoods gives inconsistent estimates
(in those cases, sandwich estimator of Asy.Var[
PML
] useless)
Under certain conditions,

PML
robust to some model misspecication
Major advantage of PML: if h(y
i
|x
i
;
0
) is true p.d.f., then

PML
=

ML
(in those cases, sandwich estimator should not be used)
Typical application of PML in Finance: daily asset returns are not normal,
but GARCH volatility models typically estimated using Gaussian likelihoods
188
Summary of the course
Linear regression model: OLS estimator, specication and hypothesis testing
Generalized regression model: heteroscedastic data, GLS estimator
Panel data model: Fixed and Random eects, Hausmans specication test
Instrumental variables: regressors correlated with disturbances
Generalized method of moments: general framework for inference, weak
assumptions
Maximum likelihood estimation: assume parametric DGP, best use of this
information
Hypothesis testing: Likelihood ratio, Wald, Lagrange multiplier tests
189

Slide Mancini Eco No Metrics EPFL2011

Diunggah oleh

Informasi Dokumen

Deskripsi Asli:

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Slide Mancini Eco No Metrics EPFL2011

Diunggah oleh

Hak Cipta:

Format Tersedia

Master in Financial Engineering

, in this sample b = [1.01 2.07]

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

MXb = 0, LS partitions y in two orthogonal parts

= residuals in a regression of income on a constant and age

= residuals in a regression of education on a constant and age

/n], n n, symmetric, idempotent; i

Using law of iterated expectations, and assumption E[|X] = 0

, repeat simulation and estimation 1,000 times

Easy to derive variance of linear estimator:

which is an idempotent quadratic form in /, and in Appendix B.11.4

it suces to show that LM = 0 because this implies, conditional on X,

, estimate b = [1.01 2.07]

X/n = Q positive denite matrix

/n] = 0 and lim

/n] decays as 1/n

/n] decays as 1/n

e/(n K) (as plims

/, M/) = 0 Num. Den. independent

residuals from restricted model. Loss of t due to constraints:

, in this sample b = [0.99 2.13 1.96]

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

, has still good properties:

X/n) and plim(X

X/n) are positive denite, plimb =

X/n never estimated

Remark: does not depend on i

Taking deviations from group means remove heterogeneity u

Degrees of freedom: nT observations n y

b unbiased when E[|X] = 0

X is L K, not invertible matrix

L moment conditions (L K):

exists such that m(

Intuition: K moment equations set to zero to compute the K parameters

Stochastic Discount Factor (SDF) representation, demeaned factors

= and some data y

)/n plim lnL(

First order conditions for restricted , i.e.

and plimb = , plimc(b) = c(), plimc(b)/b

Likelihood ratio test: b unrestricted, b

, (i.e. concentrating log-likelihood)

in the regression of restricted residuals e

Anda mungkin juga menyukai