Michele Pellizzari
IGIER-Bocconi, IZA and fRDB
1 Introduction
1.1 Brief aside on asymptotic theory
Most of you should have already seen the things we discuss in this section
in previous econometrics courses. It is still perhaps useful to review some of
these concepts to refresh ideas and also to harmonize our languages so that
we all know what we are talking about and why. In this discussion I will
try to give you the intuition of the statistical concepts that we need for this
course but I will also avoid much of the technicalities and formalities that
you typically see in a statistics course.1 .
Earlier on we said that we wanted to produce correct estimators for some
parameters of interest and so far we only specified that we want such estima-
tors to be consistent. Moreover, we also left the question of the distribution of
the estimator unanswered. That is a very important question that we really
want to address because otherwise we would not be able to run tests of our
results (for example, we may want to test if our β of interest is significantly
different from zero) and to do that we need to know the distribution of our
estimator.
∗
These notes are largely based on the textbook by Jeffrey M. Wooldridge. 2002.
Econometric Analysis of Cross Section and Panel Data. MIT Press. Contact
details: Michele Pellizzari, IGIER-Bocconi, via Roentgen 1, 20136-Milan (Italy).
michele.pellizzari@unibocconi.it
1
Such technicalities and formalities are nonetheless extremely important! So, if you
feel that the material in this section is completely new or very hard, you may want to go
back and revise some of your notes from statistics.
1
In this section we will review briefly the basic statistical tools and concepts
needed to derive asymptotic consistency and the asymptotic distribution of
estimators. But before we move to reviewing these concepts, it is perhaps
useful to clarify why we focus on the asymptotic properties of estimators and
what that means exactly.
2
1.1.2 Convergence in probability
Let us start with a simple math definition:
and we write
aN → a as N → ∞
and we write
p
xN −→ a or plim(xN ) = a
3
Definition 3 (Consistent estimator) Let {θbN : N = 1, 2, . . . } be a se-
quence of estimators of the K × 1 vector θ, where N indexes sample size. θb
is a consistent estimator of θ if plim θb = θ.
You must have seen theorem 1 somewhere in your previous courses and
I hope you can now appreciate its power and beauty.2 It essentially says
that if you simply have i.i.d. random numbers with finite means, the sample
average of such random numbers will converge in probability to their true
population mean. To bring this to our standard econometrics set up, what
we usually have is a vector of data for a sample of N observations. For
example, our wi might be the vector (yi , xi1 , xi2 , . . . , xiK ) of the dependent
variable and all the explanatory variables. The weak law of large numbers
says that, if the observations are i.i.d (and they certainly are if the sample is
a random extraction from the population), then the sample averages of each
variable are consistent estimators for the population averages. As you can
certainly imagine, theorem 1 is the fundament of what we called the analogy
principle that we used earlier to construct β. b In fact, the analogy principle
is an extension of the weak law of large to functions of random variable.
Theorem 1, in fact, simply says that, under its assumptions, sample averages
converge in probability to population averages. The analogy principle needs
more than that, it needs functions of sample averages to converge to the
same functions of population averages. Fortunately, the following theorem
guarantees precisely that:
2
I agree there are more beautiful things in life but still...
4
p
Theorem 2 (Slutsky’s theorem) If wN −→ w and if g is a continuous
p
function, then g(wN ) −→ g(w).
Importantly, theorem 2 also motivates why we often prefer to look at
asymptotic consistency rather than unbiasedness, a property of estimators
that you have certainly seen in previous courses.
Definition 4 (Unbiased estimator) The random vector βb is an unbi-
ased estimator of the vector of constants β if E(β)
b = β.
In fact, the expectation operator does not work through non linear func-
tions. In particular, using the same notation of theorem 2, if E(wN ) = w
then E(g(wN )) = g(w) only if g is a linear function (on top of being contin-
uous, which is still required). This means that for estimators that are non
linear functions of the random variables of our model we would not be able
to check unbiasedness. And most of these estimators that we will see in his
course are in fact non linear.
5
say that the asymptotic distribution of any of these xN ’s is F and we write
it as follows:
a
xN ∼ F
Similarly to how we proceeded with consistency, we now need a theo-
rem to make definition 5 operational, a theorem that allows to identify the
asymptotic distribution of a given estimator. Here is the theorem we need:
This theorem is even more powerful and more beautiful than the weak
law of large numbers! It essentially says that, under conditions that are
very easily met by most available datasets (the observations are i.i.d. with
finite variance), the sample mean (adjusted by the division by N −1/2 instead
of simply N −1 ) is asymptotically distributed according to a normal. So
regardless of the actual (small-sample) distribution of wi , the asymptotic
distribution is normal (which is in fact one of the reasons why the normal is
actually called normal!).
6
with x1 = 1.
Equation 1.2 is in fact a list of K conditions which we can use to estimate
the K coefficients in β as follows:
Hence:
!−1 !
N
X N
X
plimβb = β + plim N −1 x0i xi N −1 x0i ui
i=1 i=1
1.
N
X
plimN −1 x0i xi = E(x0i xi ) = A
i=1
7
2.
N
X
−1
plimN x0i ui = E(x0i ui ) = 0
i=1
plim βb = β + A−1 · 0 = β
b = β + E (X 0 X)−1 X 0 u
E(β)
E(u|X) = 0 (1.4)
= β + E (X 0 X)−1 X 0 u|X
E(β|X)
b
= β + (X 0 X)−1 X 0 E(u|X) = β
which shows that βb is unbiased, but only under assumption 1.4 and not the
weaker assumption 1.2.
8
1.2.2 Asymptotic normality of OLS
To derive the asymptotic distribution of βb it is useful to start from the
following expression:
" N
#−1 " N
#
√ X X
N (βb − β) = N −1 x0i xi N −1 x0i ui N 1/2
i=1 i=1
" N
#−1 " N
#
X X
= N −1 x0i xi N −1/2 x0i ui (1.5)
i=1 i=1
h i
Since N −1 N 0
P
x
i=1 i x i converges in probability to the finite semipostive ma-
trix A (as we have assumed earlier), some simple lemmas of theorems 1 and
3 guarantee that the asymptotic distribution of the expression
h in equation
−1 −1/2
PN 0 i 4
1.5 is the same as the asymptotic distribution of A N i=1 xi ui .
h i
Let us focus, then, on the expression N −1/2 N 0
P
i=1 xi ui which satisfies
all the assumptions of the central limit theorem (theorem 3): it is the sum,
normalized by N −1/2 , of a sequence of i.i.d. random vectors with zero mean
(and this is guaranteed by assumption 1.2) and finite variance. Then, by
direct application of the central limit theorem:
N
d
X
N −1/2 x0i ui −→ N (0, B) (1.6)
i=1
Let us make things simple and assume homoskedasticity and no serial corre-
4
We are not going into the details of these lemmas and how they work. You find all
the details in the textbook.
9
lation, i.e. V ar(ui |xi ) = E(u2i |xi ) = σ 2 , so that:5
B = V ar(x0i ui ) = E(x0i ui ui xi )
= E(u2i x0i xi ) = σ 2 E(x0i xi ) = σ 2 A
This simplifies expression 1.7 to:
√ d
N (βb − β) −→ N (0, σ 2 A−1 ) (1.8)
and makes it easy to proceed to the following derivation:
d
(βb − β) −→ N (0, N −1 σ 2 A−1 )
d
βb −→ N (β, N −1 σ 2 A−1 )
0 −1
d E(X X)
βb −→ N (β, σ 2 ) (1.9)
N
which defines the asymptotic distribution of the OLS estimator.6
We now know a lot about β, b in particular we know its probability limit
and its asymptotic distribution. However, in order to be able to really use
the distribution in expression 1.9 we need one additional bit. In fact, in
that expression the asymptotic variance-covariance matrix of βb depends on a
parameter σ 2 and a matrix E(X 0 X) that need to be estimated. The simplest
way of obtaining consistent estimators for these unknowns is, again, to apply
the analogy principle and estimate them using their sample counterparts:7
N
X N
X
2 −1 −1
σ
b =N b2i
u =N b2
(yi − xi β) (1.10)
i=1 i=1
N
X
A \
b = E(X 0 X) = N −1 x0i xi (1.11)
i=1
5
By homoskedasticity we mean that all errors have the same (finite) variance:
V ar(ui ) = V ar(uj ) ∀i, j. By serial correlation we mean the presence of some type of
correlation across error terms: Cov(ui , uj ) 6= 0 for some i 6= j. Very frequently, these
two features (homoskedasticity and lack of serial correlation) are assumed together. For
simplicity, unless specified differently, when we assume homoskedasticity we will also mean
to assume lack of serial correlation.
6
Also for this last derivation we have used some lemmas and corollaries of the main
theorems 1 and 3 which we are not discussing in details here. All the details are in the
textbook.
7
In previous econometrics
PN courses you might have used a slightly different estimator
for σ b2 = (N − K)−1 i=1 u
b2 : σ b2i . This alternative estimator is not only consistent but
also unbiased. For consistency, however, the correction for the degrees of freedom of the
model is not necessary.
10
Finally, how can we proceed in case we are not willing to assume ho-
moskedasticity of the error term? Nothing particularly complicated. In that
case the asymptotic variance-covariance matrix of βb does not simplify to
0 −1 −1 BA−1
σ 2 E(XNX) and remains equal to A N which can be easily estimated with
Ab as defined in equation 1.11 and the following:
N
X
b = N −1
B u2i x0i xi )
(b (1.12)
i=1
8
The notation AV ar to indicate the asymptotic variance of a random number is pretty
standard and we will make use of it several times later on in the course.
11