(Handout Version)∗
Walter Belluzzo
1 Introduction
Efficiency of the OLS Estimator
• Remember that the OLS estimator efficient (best linear unbiased estimator) if the DGP
belongs to the regression model
• For efficiency of least squares, the error terms must be uncorrelated and have the equal
variance, Var(u) = σ 2 I.
• The usual estimators of the covariance matrices of the OLS and NLS estimators are not
valid when these assumptions do not hold.
• Alternative “sandwich” covariance matrix estimators that are asymptotically valid can be
obtained. But inefficiency of the estimators β̂ remains.
• Non-spherical disturbances affect both linear and nonlinear regression models in the same
way. So, we can focus our attention to the simpler, linear case.
y = Xβ + u, E(uu0 ) = Ω.
• The idea to obtain an efficient estimator for the vector β in this model is to find a
transformation that makes the Gauss-Markov conditions to be satisfied.
• The resulting efficient estimator (why?) is called the generalized least squares, or
GLS, estimator.
∗ This lecture is based on D&M Chapter 6.
Econ 507 – Spring 2013
= Ψ 0Ω Ψ ,
• To make the expression in the farther right-hand side to reduce to σ 2 I, we must define Ψ
such that
Ω −1 = Ψ Ψ 0 .
Ψ 0 y = Ψ 0 Xβ + Ψ 0 u.
• Because the covariance matrix Ω is nonsingular, the matrix Ψ must be as well, and so
the transformed regression model is perfectly equivalent to the original model.
= (X0 Ω −1 X)−1 X0 Ω −1 y.
• Since β̂gls is just the OLS estimator for the transformed model, its covariance matrix can
−1
be found directly from the OLS covariance matrix, σ 2 X0 X .
2
Econ 507 – Spring 2013
• The generalized least squares estimator β̂gls can also be obtained by minimizing the GLS
criterion function
(y − Xβ)0 Ω −1 (y − Xβ),
which is just the sum of squared residuals from the transformed regression.
• This can be viewed as the SSR function from the original model, weighted by the inverse
of the matrix Ω.
• The effect of such a weighting scheme is clearest when Ω is a diagonal matrix. In that
case, the weight given to the tth observation is proportional to the inverse of Var(ut ).
• The GLS estimator β̂gls defined in (7.04) is also the solution of the set of moment condi-
tions
X0 Ω −1 (y − X β̂gls ) = 0.
which the same old with W = Ω −1 X.
• It is easy to verify that these moment conditions are equivalent to the first-order conditions
for the minimization of the GLS criterion function (do it as an exercise).
• Suppose that the DGP is a special case of that model, with parameter vector β0 and
known covariance matrix Ω.
3
Econ 507 – Spring 2013
• To show efficiency of β̂gls , we proceed as in previous cases and show that the difference
of the precision matrices,
• This difference being positive semidefinite means that any other choice of variables W
yields larger variance than W = X0 Ω −1 .
• In fact, β̂gls is typically more efficient for all elements of β, because it is only in very
special cases that the matrix (1) will have any zero diagonal elements.
• Note that β̂w reduces to the OLS estimator when W = X. Thus we conclude that our
conclusions apply to the OLS estimator, β̂.
• In general, computation of the GLS estimator will be easy only if the matrix Ψ has a
form that allows us to calculate Ψ 0 x, without having to store Ψ itself in memory.
• Suppose that Ω = σ 2 ∆ , where the n × n matrix ∆ is known to the investigator, but the
positive scalar σ 2 is unknown.
• The OLS estimates from the transformed regression with the modified Ψ is numerically
identical to β̂gls :
= (X0 Ω −1 X)−1 X0 Ω −1 y
= β̂gls .
• Thus the GLS estimates will be the same whether we use Ω or ∆, that is, whether or not
we know σ 2 .
4
Econ 507 – Spring 2013
Var(β̂gls ) = σ 2 (X∆X),
which can be estimated by replacing σ 2 with the usual estimator OLS of the error variance,
s2 , from the transformed regression.
• Then Ω −1 is a diagonal matrix with tth diagonal element ω −2 , and thus Ψ will be a
diagonal matrix with elements ωt−1 .
• This special case of GLS estimation is often called weighted least squares, or WLS.
• The weight given to each observation is ω −1 , and thus observations for which the variance
of the t error term is large/small are given low/high weights.
• Note that all the variables in the regression, including the constant term, must be multi-
plied by the same weights.
• Note that the R2 only makes sense in terms of the transformed regressand, since the
“undoing” the weighting does not preserve orthogonality of residuals and fitted values.
That is,
û ⊥ ŷ =⇒
6 Ψ −1 û ⊥ Ψ −1 ŷ
X 0 (β)Ω −1 (y − x(β)) = 0,
5
Econ 507 – Spring 2013
• Life is much easier if there is heteroskedasticity and no serial correlation. In this case, we
can simply use weighted least squares.
• But even in this case some information on ωt is still necessary, such as sampling design
or a direct relationship between E(u2t ) and some variable zt that can be used as weight.
• In practice, the covariance matrix Ω is often not known even up to a scalar factor. This
makes it impossible to compute GLS estimates.
Ω̂ = Ψ (γ̂)Ψ 0 (γ̂).
• The resulting estimator is called feasible generalized least squares, or feasible GLS
• In the same way that a regression function determines the conditional mean of a random
variable, a skedastic function determines its conditional variance:
6
Econ 507 – Spring 2013
• We can then obtain OLS estimates γ̂ running the auxiliary linear regression
log û2t = Zt γ + vt ,
for all t.
• Finally, feasible GLS estimates of β are obtained by using ordinary least squares to esti-
mate regression, with the estimates ω̂t replacing the unknown ωt ,
1 1 1
yt = Xt β + ut .
ω̂t ω̂t ω̂t
• Under suitable regularity conditions, it can be shown that this type of procedure yields
a feasible GLS estimator β̂f that is consistent and asymptotically equivalent to the GLS
estimator β̂gls .
• If we substitute Xβ0 + u for y into the formula for the GLS estimator, we find that
• As usual, we assume sufficient conditions for the first factor in the right-hand side to tend
to a non-stochastic k × k matrix.
• Then, we apply a CLT to the second factor to conclude that it is a asymptotically normal
random vector, and thus obtain root-n consistency and normality.
7
Econ 507 – Spring 2013
• Following the same argument for the feasible GLS estimator, we find that
−1
√
a 1 0 −1 −1/2 0 −1
n(β̂f − β0 ) = plim X Ω (γ̂)X plim n X Ω (γ̂)u .
n
• If Ω(γ̂) is a very good estimate, then feasible GLS will have essentially the same properties
as GLS itself.
• As a result, inferences should be reasonably reliable, even though they will not be exact
in finite samples.
• On the other hand, if Ω(γ̂) is a poor estimate, feasible GLS estimates may have quite
different properties from real (infeasible) GLS estimates, and inferences may be quite
misleading.
• This procedure can either be stopped after a predetermined number of rounds or continued
until convergence is achieved (although convergence is not guaranteed).
• Iteration does not change the asymptotic distribution of the feasible GLS estimator, but
it does change its finite-sample distribution.
• Another way to estimate models in which the covariance matrix of the error terms depends
on one or more unknown parameters is to use the method of maximum likelihood.
• As we will see later on, in this case, β and γ are estimated jointly and consistency will
follow if the maximum likelihood regularity conditions are satisfied.
• In many cases, an iterated feasible GLS estimator will be the same as a maximum likeli-
hood estimator based on the assumption of normally distributed errors.
8
Econ 507 – Spring 2013
• If the true DGP is heteroskedastic, it will not the included in the estimated model, and
therefore there is a specification error.
• The specification error does not bias the OLS estimator, but renders it inefficient, as the
sandwich form of its covariance matrix suggests.
• As we have seen, we can compute asymptotically valid covariance matrix estimates for
the (inefficient) OLS and NLS parameter estimates.
• So, what if we choose to assume heteroskedasticity and settle with a inefficient estimator,
but the true DGP is homoskedastic?
• Simulation experiments suggest that this specification error frequently has little cost.
• This evidence can be taken as an indication that it may be prudent to employ an HCCME
anyway, especially if the sample size is large.
• However, in finite samples, tests and confidence intervals based on HCCMEs will always
be somewhat less reliable than ones based on the usual OLS covariance matrix under
homoskedasticity.
• If we have information on the form of the skedastic function, we might well wish to use
feasible generalized least squares, which is asymptotically equal to the efficient generalized
least squares.
• However, small sample properties of the feasible generalized least squares depend critically
on the estimates Ω̂.
• So, if the true DGP is homoskedastic and we assume heteroskedastcity, we can expect
that the specification error may be costly in small samples.
where the skedastic function h( · ) is a nonlinear function that can take on only posi-
tive values, zt is a 1 × r vector of observations on exogenous or predetermined variables
that belong to the information set Ωt , δ is a scalar parameter, and γ is an r-vector of
parameters.
9
Econ 507 – Spring 2013
• Under the null hypothesis that γ = 0, the function h(δ+Zt γ) collapses to h(δ), a constant.
u2t = h(δ + zt γ) + vt .
• Alternatively, you can define vt as the difference between u2t and its conditional expecta-
tion, and rewrite the skedastic function as in the last expression.
• So, let us evaluate it at γ = 0 and δ = δ̃ ≡ h−1 (σ̃ 2 ), where σ̃ 2 is the sample variance of
ut :
u2t − σ̃ 2 = h0 (δ̃)bδ + h0 (δ̃)Zt bγ + residual.
• For the purpose of testing the null hypothesis that γ = 0, this regression is equivalent to
u2t = bδ + Zt bγ + residual,
with a suitable redefinition of the artificial parameters bδ and bγ , which does not depend
on the functional form of h( · ).
• It can be shown that replacing u2t by û2t does not change the asymptotic distribution of
the F and nR2 statistics for testing the hypothesis bγ = 0;
• The last issue is to choose the variables to be included in Z. White suggests including
all squares and cross-products of the variables em X (why?), which results in the White
Test.
• The general form of the test is basically the Breush-Pagan Test. We will derive the
limiting distribution for this test later, in a more convenient framework.
• Since the asymptotic approximations for these test statistics may be inaccurate in finite-
samples, bootstrapping them when the sample size is small or moderate may be a good
idea.
10