Anda di halaman 1dari 8

Point Estimation: denition of estimators

Point estimator: any function W(X


1
, . . . , X
n
) of a data sample.
The exercise of point estimation is to use particular functions of the data in order to
estimate certain unknown population parameters.
Examples: Assume that X
1
, . . . , X
n
are drawn i.i.d. from some distribution with
unknown mean and unknown variance
2
.
Potential point estimators for include: sample mean

X
n
=
1
n

i
X
i
; sample median
med(X
1
, . . . , X
n
).
Potential point estimators for
2
include: the sample variance
1
n

i
(X
i


X
n
)
2
.
Any point estimator is a random variable, whose distribution is that induced by the
distribution of X
1
, . . . , X
n
.
Example: X
1
, . . . , X
n
i.i.d. N(,
2
).
Then sample mean

X
n
N(
n
,
2
n
) where
n
= , n and
2
n
=
2
/n.
For a particular realization of the random variables x
1
, . . . , x
n
, the corresponding
point estimator evaluated at x
1
, . . . , x
n
, i.e., W(x
1
, . . . , x
n
), is called the point esti-
mate.

In these lecture notes, we will consider three types of estimators:


1. Method of moments
2. Maximum likelihood
3. Bayesian estimation

Method of moments: very intuitive idea


Assume: X
1
, . . . , X
n
i.i.d. f(x|
1
, . . . ,
K
)
Here the unknown parameters are
1
, . . . ,
K
(K N).
Idea is to nd values of the parameters such that the population moments are as close
as possible to their sample analogs. This involves nding values of the parameters
to solve the following K-system of equations:
1
m
1

1
n

i
X
i
= EX =
_
xf(x|
1
, . . . ,
K
)
m
2

1
n

i
X
2
i
= EX
2
=
_
x
2
f(x|
1
, . . . ,
K
)
.
.
.
.
.
.
m
K

1
n

i
X
K
i
= EX
K
=
_
x
K
f(x|
1
, . . . ,
K
).
Example: X
1
, . . . , X
n
i.i.d. N(,
2
). Parameters are ,
2
.
Moment equations are:
1
n

i
X
i
= EX =
1
n

i
X
2
i
= EX
2
= V X + (EX)
2
=
2
+
2
.
Hence, the MOM estimators are
MOM
=

X
n
and
2
MOM
=
1
n

i
X
2
i
(

X
n
)
2
.
Example: X
1
, . . . , X
n
i.i.d. U[0, ]. Parameter is .
MOM:

X
n
=

2
=
MOM
= 2

X
n
.

Remarks:
Apart from these special cases above, for general density functions f(|

), the
MOM estimator is often dicult to calculate, because the population mo-
ments involve dicult integrals. (In Pearsons original paper, the density was
a mixture of two normal density functions:
f(x|

) =
1

2
1
exp
_

(x
1
)
2
2
2
1
_
+ (1 )
1

2
2
exp
_

(x
2
)
2
2
2
2
_
with unknown parameters ,
1
,
2
,
1
,
2
.)
The model assumption that X
1
, . . . , X
n
i.i.d. f(|

) implies a number of
moment equations equal to the number of moments, which can be >> K. This
leaves room for evaluating the model specication.
2
For example, in the uniform distribution example above, another moment con-
dition which should be satised is that
1
n

i
X
2
i
= EX
2
= V X + (EX)
2
=

2
3
+

2
.
(1)
At the MOM estimator
MOM
, one can see whether
1
n

i
X
2
i
=

MOM
2
3
+

MOM
2
.
(Later, you will learn how this can be tested more formally.) If this does not
hold, then that might be cause for you to conclude that the original specication
that X
1
, . . . , X
n
i.i.d. U[0, ] is inadequate. Eq. (1) is an example is an
overidentifying restriction.
While the MOM estimator focuses on using the sample uncentered moments to
construct estimators, there are other sample quantities which could be useful,
such as the sample median (or other sample percentiles), as well as sample min-
imum or maximum. (Indeed, for the uniform case above, the sample maximum
would be a very reasonable estimator for .) All these estimators are lumped
under the rubric of generalized method of moments (GMM).

Maximum Likelihood Estimation


Let X
1
, . . . , X
n
i.i.d. with density f(|
1
, . . . ,
K
).
Dene: the likelihood function, for a continuous random variable, is the joint density
of the sample observations:
L(

|x
1
, . . . , x
n
) =
n

i=1
f(x
i
|

).
View L(

|x) as a function of the parameters



, for the data observations x.
From classical point of view, the likelihood function L(

|x) is a random variable


due to the randomness in the data x. (In the Bayesian point of view, which we talk
about later, the likelihood function is also random because the parameters

are also
treated as random variables.)
3
The maximum likelihood estimator (MLE) are the parameter values

ML
which
maximize the likelihood function:

ML
= argmax

L(

|x).
Usually, in practice, to avoid numerical overow problems, maximize the log of the
likelihood function:

ML
= argmax

log L(

|x) =

i
log f(x
i
|

).
Analogously, for discrete random variables, the likelihood function is the joint prob-
ability mass function:
L(

|x) =
n

i=1
P(X = x
i
|

).

Example: X
1
, . . . , X
n
i.i.d. N(, 1).
log L(|x) = log(n/

2)
1
2

n
i=1
(x
i
)
2
max

log L(|x) = min

1
2

i
(x
i
)
2
FOC:
log L

i
(x
i
) = 0
ML
=
1
n

i
x
i
(sample mean)
Also should check second order condition:

2
log L

2
= n < 0 : so satised.
Example: X
1
, . . . , X
n
i.i.d. Bernoulli with prob. p. Unknown parameter is p.
L(p|x) =

n
i=1
p
x
i
(1 p)
1x
i

log L(p|x) =
n

i=1
[x
i
log p + (1 x
i
) log(1 p)]
= y log p + (n y) log(1 p) : y is number of 1s
FOC:
log L
p
=
y
p

ny
1p
=p
ML
=
y
n
For y = 0 or y = n, the p
ML
is (respectively) 0 and 1: corner solutions.
4
SOC:
log L
p
2

p=p
ML
=
y
p
2

ny
(1p)
2
< 0 for 0 < y < n.
When parameter is multidimensional: check that the Hessian matrix

2
log L

is
negative denite.

You can think of ML as a MOM estimator: for X


1
, . . . , X
n
i.i.d., and K-dimensional
parameter vector , the MLE solves the FOCs:
1
n

i
log f(x
i
|)

1
= 0
1
n

i
log f(x
i
|)

2
= 0
.
.
.
.
.
.
1
n

i
log f(x
i
|)

K
= 0.
Under LLN:
1
n

i
log f(x
i
|)

k
p
E

0
log f(X|)

k
, for k = 1, . . . , K, where the notation
E

0
denote the expectation over the distribution of X at the true parameter vector

0
.
Hence, MLE is equivalent to MOM with the moment conditions
E

0
log f(X|)

k
= 0, k = 1, . . . , K.

Bayes estimators
Fundamentally dierent view of the world. Model the unknown parameters

as
random variables, and assume that researchers beliefs about are summarized in a
prior distribution f().
In this sense, Bayesian approach is subjective, because researchers beliefs about
are accommodated in inferential approach.
X
1
, . . . , X
n
i.i.d. f(x|): the Bayesian views the density of each data observation
as a conditional density, which is conditional on a realization of the random variable
.
5
Given data X
1
, . . . , X
n
, we can update our beliefs about the parameter by comput-
ing the posterior density (using Bayes Rule):
f(|x) =
f(x|) f()
f(x)
=
f(x|) f()
_
f(x|)f()d
.
A Bayesian point estimate of is some feature of this posterior density. Common
point estimators are:
Posterior mean:
E [|x] =
_
f(|x)d.
Posterior median: F
1
|x
(0.5), where F
|x
is CDF corresponding to the posterior
density: i.e., F
|x
(

) =
_

f(|x)d.
Posterior mode: max

f(|x). This is the point at which the density is highest.


Note that f(x|) is just the likelihood function, so that the posterior density f(|x)
can be written as:
f(|x) =
L(|x) f()
_
L(|x)f()d
.
But there is a dierence in interpretation: in Bayesian world, the likelihood function
is random due to both x and , whereas in classical world, only x is random.
Example: X
1
, . . . , X
n
i.i.d. N(, 1), with prior density f().
Posterior density f(|x) =
exp(
1
2
P
i
(x
i
)
2
f())
R
exp(
1
2
P
i
(x
i
)
2
)f()d.
Integral in denominator can be dicult to calculate: computational diculties can
hamper computation of posterior densities.
However, note that the denominator is not a function of . Thus
f(|x) L(|x).
Hence, if we assume that f() is constant (ie. uniform), for all possible values of ,
then the posterior mode argmax

f(|x) = argmax

L(|x) =
ML
.
6

Example: Bayesian updating for normal distribution, with normal priors


X N(,
2
), assume
2
is known.
Prior: N(,
2
), assume is known.
Then posterior distribution
|X N(E(|X), V (|X)),
where
E(|X) =

2

2
+
2
X +

2

2
+
2

V (|X) =

2

2
+
2
.
This is an example of a conjugate prior and conjugate distribution, where the posterior
distribution comes from the same family as the prior distribution.
Posterior mean E(|X) is weighted average of X and prior mean .
In this case, as (so that prior information gets worse and worse): then
E(|X) X (a.s.). These are just the MLE (for just one data observation).
When you observe an i.i.d. sample

X
n
(X
1
, . . . , X
n
), with sample mean

X
n
:
E(|

X
n
) =
n
2
n
2
+
2

X
n
+

2

2
+ n
2

V (|

X
n
) =

2

2
+ n
2
.
In this case, as the number of observations n , the posterior mean E(|

X
n
)

X
n
. So as n , the posterior mean converges to the MLE: when your sample
becomes arbitrarily large, you place no weight on your prior information.
Data augmentation The important philosophical distinction of the Bayesian
approach is that data and model parameters are treated on an equal footing. Hence,
just as we make posterior inference on model parameters, we can also make posterior
inference on unobserved variables in latent variable models, which are models where
not all the model variables are observed.
7
Consider a simple example (the binary probit model):
z = x + , N(0, 1)
y =
_
0 if z 0
1 if z < 0.
(2)
The researcher observes (x, y), but not (z, ). He wishes to form the posterior of
z, |x, y.
We do all inference conditional on x. Therefore the relevant prior is
(z, |x) = (z|, x) |x
= N(x, 1) N(

, a
2
)
. .
f()
. (3)
In the above, we assume the marginal prior on is normal (and doesnt depend on
x). The conditional prior density of z|, x is derived from the model specication (2).
The posterior can also be factored into two parts:
(z, |y, x) = (z|, y, x) (|y, x).
(z|, y, x) L(y|, x) f()
=
_
_
_
_
(zx)
(x)
_
(x)
1
a

a
_
with support z 0, if y = 1
_
(zx)
1(x)
_
(1 (x))
1
a

a
_
with support z < 0, if y = 0
=
_
_
_
(z x)
1
a

a
_
with support z 0, if y = 1
(z x)
1
a

a
_
with support z < 0, if y = 0
(4)
In the above, and denote the CDF and density functions for the N(0, 1) distribu-
tion. In the second line, note that the proportionality constant (ie. the denominator
in Bayes rule) does not depend on (, z). In the third equation above, note that
(z|, y, x) is a truncated standard normal distribution (with the direction of trunca-
tion depending on whether y = 0 or y = 1).
Accordingly, this can be marginalized over to obtain the posterior of z|y, x. Using
Bayesian procedure to do posterior inference on latent data variables is sometimes
called data augmentation.
In non-Bayesian context, obtaining values for missing data values is usually done by
some sort of imputation procedure. Thus, data augmentation can be viewed as a sort
of Bayesian imputation procedure. One attractive feature of the Bayesian approach
is that it follows easily and naturally from the usual Bayesian logic.
8

Anda mungkin juga menyukai