Anda di halaman 1dari 30

Prior vs Likelihood vs Posterior

Posterior Predictive Distribution


Poisson Data
Statistics 220
Spring 2005

c 2005 by Mark E. Irwin


Copyright

Choosing the Likelihood Model


While much thought is put into thinking about priors in a Bayesian Analysis,
the data (likelihood) model can have a big effect.
Choices that need to be made involve
Independence vs Exchangable vs More Complex Dependence
Tail size, e.g. Normal vs tdf
Probability of events

Choosing the Likelihood Model

Example: Probability of Gods Existance


Two different analyses - both using the prior P [God] = P [No God] = 0.5
Likelihood Ratio Components:
P [Datai|God]
Di =
P [Datai|No God]
Evidence (Datai)
Recognition of goodness
Existence of moral evil
Existence of natural evil
Intranatural miracles (prayers)
Extranatural miracles (resurrection)
Religious experiences

Choosing the Likelihood Model

Di - Unwin
10
0.5
0.1
2
1
2

Di - Shermer
0.5
0.1
0.1
1
0.5
0.1

P [God|Data]:
Unwin:

2
3

Shermer: 0.00025
So even starting with the same prior, the difference beliefs about what the
data says gives quite different posterior probabilities.
This is based on an analysis published in an July 2004 Scientific American
article. (Available on the course web site on the Articles page.)
Stephen D. Unwin is a risk management consultant who has done work in
physics on quantum gravity. He is author of the book The Probability of
God.
Michael Shermer is the publisher of Skeptic and a regular contributor to
Scientific American as author of the column Skeptic.

Choosing the Likelihood Model

Note in the article, Bayes rule is presented as


P [God]D
P [God|Data] =
P [God]D + P [No God]
See if you can show that this is equivalent to the normal version of Bayes
rule, under the assumption that the components of the data model are
independent.

Choosing the Likelihood Model

Prior vs Likelihood vs Posterior


The posterior distribution can be seen as a compromise between the prior
and the data
In general, this can be seen based on the two well known relationships

E[] = E[E[|y]]
Var() = E[Var(|y)] + Var(E[|y])

(1)
(2)

The first equation says that our prior mean is the average of all possible
posterior means (averaged over all possible data sets).
The second says that the posterior variance is, on average, smaller than the
prior variance. The size of the difference depends on the variability of the
posterior means.
Prior vs Likelihood vs Posterior

This can be exhibited more precisely using examples


Binomial Model - Conjugate prior

Beta(a, b)
y| Bin(n, )
|y

Beta(a + y, b + n y)

Prior mean:
E[] =

a
=

a+b

MLE:

Prior vs Likelihood vs Posterior

y
n

Then the posterior mean satisfies


E[|y] =

a+b
n

a+b+n
a+b+n

a weighted average of the prior mean and the sample proportion (MLE)
Prior variance:

(1
)
Var() =
a+b+1

Posterior variance:
Var(|y) =

(1
)
a+b+n+1

So if n is large enough, the posterior variance will be smaller than the


prior variance.

Prior vs Likelihood vs Posterior

Normal Model - Conjugate prior, fixed variance

N (0, 02)

yi|

iid

N (, 2);

|y

N (n, n2)

i = 1, . . . , n

Then

n =

1
n

+
y
0
2
2
0
n
1
+
2
2
0

and

1
n
1
=
+
n2 02 2

So the posterior mean is a weighted average of the prior mean and a


sample mean of the data.

Prior vs Likelihood vs Posterior

The posterior mean can be thought of in two other ways

n = 0 + (
y 0) 2
n

= y (
y 0) 2
n

02
+ 02

2
n

+ 02

The first case has n as the prior mean adjusted towards the sample
average of the data.
The second case has the sample average shrunk towards the prior mean.
In most problems, the posterior mean can be thought of as a shrinkage
estimator, where the estimate just based on the data is shrunk toward
the prior mean. The form of the shrinkage may not be able to be written
out in as quite a nice form for more general problems.
Prior vs Likelihood vs Posterior

In this example the posterior variance is never bigger than the prior
variance as
1
n
1
1
+

=
n2 02 2 02

and

1
n

n2 2

The first part of this is thought of as


Posterior Precision = Prior Precision + Data Precision
The first inequality gives

n2 02

The second inequality gives


2

n2
n

Prior vs Likelihood vs Posterior

10

2.5

n = 5, y = 2, a = 4, b = 3

1.5
1.0
0.0

0.5

p(|y)

2.0

Posterior
Likelihood
Prior

0.0

0.2

0.4

0.6

0.8

1.0

4
E[] =
7

=
5

3
Var() =
= 0.0306
98
SD() = 0.175
Prior vs Likelihood vs Posterior

6
E[|y] =
= 0.5
12
1
Var(|y) =
= 0.0192
52
SD(|y) = 0.139
11

n = 20, y = 8, a = 4, b = 3

2
0

p(|y)

Posterior
Likelihood
Prior

0.0

0.2

0.4

0.6

0.8

1.0

4
E[] =
7

=
5

3
Var() =
= 0.0306
98
SD() = 0.175
Prior vs Likelihood vs Posterior

12
E[|y] =
= 0.444
27
Var(|y) = 0.0088
SD(|y) = 0.094
12

n = 100, y = 40, a = 4, b = 3

4
0

p(|y)

Posterior
Likelihood
Prior

0.0

0.2

0.4

0.6

0.8

1.0

4
E[] =
7

=
5

3
Var() =
= 0.0306
98
SD() = 0.175
Prior vs Likelihood vs Posterior

44
E[|y] =
= 0.411
107
Var(|y) = 0.0022
SD(|y) = 0.047
13

25

n = 1000, y = 400, a = 4, b = 3

15
10
0

p(|y)

20

Posterior
Likelihood
Prior

0.0

0.2

0.4

0.6

0.8

1.0

4
E[] =
7

=
5

3
Var() =
= 0.0306
98
SD() = 0.175
Prior vs Likelihood vs Posterior

404
E[|y] =
= 0.401
1007
Var(|y) = 0.0024
SD(|y) = 0.0154
14

Prediction
Another useful summary is the posterior predictive distribution of a future
observation, y
Z
p(
y |y) =

p(
y |y, )p(|y)d

In many situations, y will be conditionally independent of y given . Thus


the distribution in this case reduces to
Z
p(
y |y) =

p(
y |)p(|y)d

In many situations this can be difficult to calculate, though it is often easy


with a conjugate prior.

Prediction

15

For example, with Binomial-Beta model, the posterior distribution of the


success probability is Beta(a1, b1) (for some a1, b1). Then the distribution
of the number of successes in m new trials is
Z
p(
y |y) =

p(
y |)p(|y)d

m y
(a1 + b1) a11
=
(1 )my

(1 )b11d
y
(a1)(b1)
0

m (a1 + b1) (a1 + y)(b1 + m y)
=
y (a1)(b1)
(a1 + b1 + m)
Z

Which is an example of the Beta-Binomial distribution.


The mean this distribution is
E[
y |y] = m
Prediction

a1
= m

a1 + b1
16

This can be gotten by applying


E[
y |y] = E[E[
y |]|y] = E[m|y]
The variance of this can be gotten by
Var(
y |y) = Var(E[
y |]|y) + E[Var(
y |)|y]
= Var(m|y) + E[m(1 )|y]
This is of the form
m
(1
){1 + (m 1) 2}

Prediction

17

One way of thinking about this is that there is two pieces of uncertainty in
predicting a new observation.
1. Uncertainty about the true success probability
2. Deviation of an observation from its expected value
This is more clear with the Normal-Normal model with fixed variance. As
we saw earlier, the posterior distribution is of the form
|y N (n, n2)
Then

Z
p(
y |y) =
Prediction

1
1
1
1
exp 2 (
exp 2 ( n)2 d
y )2
2
2n
2
n 2
18

A little bit of calculus will show that this reduces to a normal density with
mean
E[
y |y] = n = E[E[
y |n]|y] = E[|y]
and variance

Var(
y |y) = n2 + 2
= Var(E[
y |]|y) + E[Var(
y |)|y]
= Var(n|y) + E[ 2|y]

Prediction

19

An analogue to this is the variance for prediction in linear regression. It is


exactly of this form

2
1
(x

)
Var(
y |x) = 2 1 + +
n (n 1)s2x
Simulating the posterior predictive distribution
This is easy to do, assuming that you can simulate from the posterior
distribution of the parameter, which is usually feasible.
To do it involves two steps:
1. Simulate i from |y; i = 1, . . . , m
2. Simulate yi from y|i (= y|i, y); i = 1, . . . , m
The pairs (i, yi) are draws from the joint distribution , y|y. Therefore the
yi are draws from y|y.
Prediction

20

Why interest in the posterior predictive distribution?


You might want to do predictions. For example, what will happen to a
stock in 6 months.
Model checking: Is your model reasonable?
There are a number of ways of doing this. Future observations could be
compared with the posterior predictive distribution.
Another option might be something along the lines of cross validation. Fit
the model with part of the data and compare the remaining observation
to the posterior predictive distribution calculated from the sample used
for fitting.

Prediction

21

Other One Parameter Models


Poisson
Example: Prussian Cavalry Fatailities Due to Horse Kicks
10 Prussian cavalry corp were monitored for 20 years (200 Corp-Years) and
the number of fatalities due to horse kicks were recorded
x = # Deaths
0
1
2
3
4

Number of Corp-Years with x Fatalities


109
65
22
3
1

Let yi, i = 1, . . . , 200 be the number of deaths in observation i.

Prediction

22

iid

Assume that yi P oisson(). (This has been shown to be a good


description for this data). Then the MLE for is
122
= y =
= 0.61
200
This can be seen from
200
Y
P
1 yi
p(y|) =
e yi en = ny en
y!
i=1 i

Instead lets take a Bayesian approach.


Gamma(, )

For a prior, lets use

e
p() =
()
Note that this is a conjugate prior for .
Prediction

23

The posterior density satisfies


p(|y) ny en 1e = ny+1e(n+)
which is proportional to a Gamma( + n
y , + n) density
The mean and variance of a Gamma(, ) are
E[] =

Var() =

So the posterior mean and variance in this analysis are


E[|y] =

+ n
y
+n

Var(|y) =

+ n
y
( + n)2

Similarly to before, the posterior mean is a weighted average of the prior


mean and the MLE (weights and n).
Lets examine the posteriors under different prior choices
Prediction

24

12

n = 200, y = 0.61, = = 0.5

6
0

p(|y)

10

Posterior
Likelihood
Prior

0.0

0.5

1.0

1.5

2.0

E[] = 1

= 0.61

Var() = 2
SD() = 1.412
Prediction

E[|y] = 0.611

Var(|y) = 0.0030
SD(|y) = 0.055
25

n = 200, y = 0.61, = = 1

4
0

p(|y)

Posterior
Likelihood
Prior

0.0

0.5

1.0

1.5

2.0

E[] = 1

= 0.61

Var() = 1
SD() = 1
Prediction

E[|y] = 0.612

Var(|y) = 0.0030
SD(|y) = 0.055
26

n = 200, y = 0.61, = = 10

4
0

p(|y)

Posterior
Likelihood
Prior

0.0

0.5

1.0

1.5

2.0

E[] = 1

= 0.61

Var() = 0.1
SD() = 0.316
Prediction

E[|y] = 0.629

Var(|y) = 0.0030
SD(|y) = 0.055
27

n = 200, y = 0.61, = = 100

4
0

p(|y)

Posterior
Likelihood
Prior

0.0

0.5

1.0

1.5

2.0

E[] = 1

= 0.61

Var() = 0.01
SD() = 0.1
Prediction

E[|y] = 0.74

Var(|y) = 0.0025
SD(|y) = 0.050
28

One way to think of the gamma prior in this case is that you have a data
set with observations with and observed Poisson count of .
Note that the Gamma distribution can be parameterized many ways.
Often the scale parameter form =

is used.

Also it can be parameterized in terms of mean, variance, and coefficient of


variation (only two are needed).
This gives some flexibility in thinking about the desired form of the prior for
a particular model.
In the example, I fixed the mean at 1 and let the variance decrease.

Prediction

29

Anda mungkin juga menyukai