Prior Vs Likelihood Vs Posterior Distribution

Prior vs Likelihood vs Posterior
Posterior Predictive Distribution

Poisson Data
Statistics 220
Spring 2005
c 2005 by Mark E. Irwin

Copyright
Choosing the Likelihood Model

While much thought is put into thinking about priors in a Bayesian Analysis,
the data (likelihood) model can have a big effect.
Choices that need to be made involve
Independence vs Exchangable vs More Complex Dependence
Tail size, e.g. Normal vs tdf
Probability of events
Example: Probability of Gods Existance

Two different analyses - both using the prior P [God] = P [No God] = 0.5
Likelihood Ratio Components:
P [Datai|God]
Di =
P [Datai|No God]
Evidence (Datai)
Recognition of goodness
Existence of moral evil
Existence of natural evil
Intranatural miracles (prayers)
Extranatural miracles (resurrection)
Religious experiences
Di - Unwin
10
0.5
0.1
2
1
2
Di - Shermer
0.5
0.1
0.1
1
0.5
0.1
P [God|Data]:
Unwin:
2
3
Shermer: 0.00025
So even starting with the same prior, the difference beliefs about what the
data says gives quite different posterior probabilities.
This is based on an analysis published in an July 2004 Scientific American
article. (Available on the course web site on the Articles page.)
Stephen D. Unwin is a risk management consultant who has done work in
physics on quantum gravity. He is author of the book The Probability of
God.
Michael Shermer is the publisher of Skeptic and a regular contributor to
Scientific American as author of the column Skeptic.
Note in the article, Bayes rule is presented as

P [God]D
P [God|Data] =
P [God]D + P [No God]
See if you can show that this is equivalent to the normal version of Bayes
rule, under the assumption that the components of the data model are
independent.

The posterior distribution can be seen as a compromise between the prior
and the data
In general, this can be seen based on the two well known relationships
E[] = E[E[|y]]
Var() = E[Var(|y)] + Var(E[|y])
(1)
(2)
The first equation says that our prior mean is the average of all possible
posterior means (averaged over all possible data sets).
The second says that the posterior variance is, on average, smaller than the
prior variance. The size of the difference depends on the variability of the
posterior means.
This can be exhibited more precisely using examples

Binomial Model - Conjugate prior
Beta(a, b)
y| Bin(n, )
|y
Beta(a + y, b + n y)
Prior mean:
E[] =
a
=
a+b
MLE:
y
n
Then the posterior mean satisfies

E[|y] =
a+b
n
a+b+n
a+b+n
a weighted average of the prior mean and the sample proportion (MLE)
Prior variance:
(1
)
Var() =
a+b+1
Posterior variance:
Var(|y) =
(1
)
a+b+n+1
So if n is large enough, the posterior variance will be smaller than the

prior variance.
Normal Model - Conjugate prior, fixed variance
N (0, 02)
yi|
iid
N (, 2);
|y
N (n, n2)
i = 1, . . . , n
Then
n =
1
n
+
y
0
2
2
0
n
1
+
2
2
0
and
1
n
1
=
+
n2 02 2
So the posterior mean is a weighted average of the prior mean and a

sample mean of the data.
The posterior mean can be thought of in two other ways
n = 0 + (
y 0) 2
n
= y (
y 0) 2
n
02
+ 02
2
n
+ 02
The first case has n as the prior mean adjusted towards the sample
average of the data.
The second case has the sample average shrunk towards the prior mean.
In most problems, the posterior mean can be thought of as a shrinkage
estimator, where the estimate just based on the data is shrunk toward
the prior mean. The form of the shrinkage may not be able to be written
out in as quite a nice form for more general problems.
In this example the posterior variance is never bigger than the prior
variance as
1
n
1
1
+
=
n2 02 2 02
and
1
n
n2 2
The first part of this is thought of as

Posterior Precision = Prior Precision + Data Precision
The first inequality gives
n2 02
The second inequality gives

2
n2
n
10
2.5
n = 5, y = 2, a = 4, b = 3
1.5
1.0
0.0
0.5
p(|y)
2.0
Posterior
Likelihood
Prior
0.0
0.2
0.4
0.6
0.8
1.0
4
E[] =
7
=
5
3
Var() =
= 0.0306
98
SD() = 0.175
6
E[|y] =
= 0.5
12
1
Var(|y) =
= 0.0192
52
SD(|y) = 0.139
11
n = 20, y = 8, a = 4, b = 3
2
0
p(|y)
Posterior
Likelihood
Prior
0.0
0.2
0.4
0.6
0.8
1.0
4
E[] =
7
=
5
3
Var() =
= 0.0306
98
SD() = 0.175
12
E[|y] =
= 0.444
27
Var(|y) = 0.0088
SD(|y) = 0.094
12
n = 100, y = 40, a = 4, b = 3
4
0
p(|y)
Posterior
Likelihood
Prior
0.0
0.2
0.4
0.6
0.8
1.0
4
E[] =
7
=
5
3
Var() =
= 0.0306
98
SD() = 0.175
44
E[|y] =
= 0.411
107
Var(|y) = 0.0022
SD(|y) = 0.047
13
25
n = 1000, y = 400, a = 4, b = 3
15
10
0
p(|y)
20
Posterior
Likelihood
Prior
0.0
0.2
0.4
0.6
0.8
1.0
4
E[] =
7
=
5
3
Var() =
= 0.0306
98
SD() = 0.175
404
E[|y] =
= 0.401
1007
Var(|y) = 0.0024
SD(|y) = 0.0154
14
Prediction
Another useful summary is the posterior predictive distribution of a future
observation, y
Z
p(
y |y) =
p(
y |y, )p(|y)d
In many situations, y will be conditionally independent of y given . Thus

the distribution in this case reduces to
Z
p(
y |y) =
p(
y |)p(|y)d
In many situations this can be difficult to calculate, though it is often easy

with a conjugate prior.
Prediction
15
For example, with Binomial-Beta model, the posterior distribution of the

success probability is Beta(a1, b1) (for some a1, b1). Then the distribution
of the number of successes in m new trials is
Z
p(
y |y) =
p(
y |)p(|y)d
m y
(a1 + b1) a11
=
(1 )my
(1 )b11d
y
(a1)(b1)
0

m (a1 + b1) (a1 + y)(b1 + m y)
=
y (a1)(b1)
(a1 + b1 + m)
Z
Which is an example of the Beta-Binomial distribution.

The mean this distribution is
E[
y |y] = m
Prediction
a1
= m
a1 + b1
16
This can be gotten by applying

E[
y |y] = E[E[
y |]|y] = E[m|y]
The variance of this can be gotten by
Var(
y |y) = Var(E[
y |]|y) + E[Var(
y |)|y]
= Var(m|y) + E[m(1 )|y]
This is of the form
m
(1
){1 + (m 1) 2}
Prediction
17
One way of thinking about this is that there is two pieces of uncertainty in
predicting a new observation.
1. Uncertainty about the true success probability
2. Deviation of an observation from its expected value
This is more clear with the Normal-Normal model with fixed variance. As
we saw earlier, the posterior distribution is of the form
|y N (n, n2)
Then
Z
p(
y |y) =
Prediction
1
1
1
1
exp 2 (
exp 2 ( n)2 d
y )2
2
2n
2
n 2
18
A little bit of calculus will show that this reduces to a normal density with
mean
E[
y |y] = n = E[E[
y |n]|y] = E[|y]
and variance
Var(
y |y) = n2 + 2
= Var(E[
y |]|y) + E[Var(
y |)|y]
= Var(n|y) + E[ 2|y]
Prediction
19
An analogue to this is the variance for prediction in linear regression. It is

exactly of this form
2
1
(x
)
Var(
y |x) = 2 1 + +
n (n 1)s2x
Simulating the posterior predictive distribution
This is easy to do, assuming that you can simulate from the posterior
distribution of the parameter, which is usually feasible.
To do it involves two steps:
1. Simulate i from |y; i = 1, . . . , m
2. Simulate yi from y|i (= y|i, y); i = 1, . . . , m
The pairs (i, yi) are draws from the joint distribution , y|y. Therefore the
yi are draws from y|y.
Prediction
20
Why interest in the posterior predictive distribution?

You might want to do predictions. For example, what will happen to a
stock in 6 months.
Model checking: Is your model reasonable?
There are a number of ways of doing this. Future observations could be
compared with the posterior predictive distribution.
Another option might be something along the lines of cross validation. Fit
the model with part of the data and compare the remaining observation
to the posterior predictive distribution calculated from the sample used
for fitting.
Prediction
21
Other One Parameter Models

Poisson
Example: Prussian Cavalry Fatailities Due to Horse Kicks
10 Prussian cavalry corp were monitored for 20 years (200 Corp-Years) and
the number of fatalities due to horse kicks were recorded
x = # Deaths
0
1
2
3
4
Number of Corp-Years with x Fatalities

109
65
22
3
1
Let yi, i = 1, . . . , 200 be the number of deaths in observation i.
Prediction
22
iid
Assume that yi P oisson(). (This has been shown to be a good

description for this data). Then the MLE for is
122
= y =
= 0.61
200
This can be seen from
200
Y
P
1 yi
p(y|) =
e yi en = ny en
y!
i=1 i
Instead lets take a Bayesian approach.

Gamma(, )
For a prior, lets use
e
p() =
()
Note that this is a conjugate prior for .
Prediction
23
The posterior density satisfies

p(|y) ny en 1e = ny+1e(n+)
which is proportional to a Gamma( + n
y , + n) density
The mean and variance of a Gamma(, ) are
E[] =
Var() =
So the posterior mean and variance in this analysis are

E[|y] =
+ n
y
+n
Var(|y) =
+ n
y
( + n)2
Similarly to before, the posterior mean is a weighted average of the prior

mean and the MLE (weights and n).
Lets examine the posteriors under different prior choices
Prediction
24
12
n = 200, y = 0.61, = = 0.5
6
0
p(|y)
10
Posterior
Likelihood
Prior
0.0
0.5
1.0
1.5
2.0
E[] = 1
= 0.61
Var() = 2
SD() = 1.412
Prediction
E[|y] = 0.611
Var(|y) = 0.0030
SD(|y) = 0.055
25
n = 200, y = 0.61, = = 1
4
0
p(|y)
Posterior
Likelihood
Prior
0.0
0.5
1.0
1.5
2.0
E[] = 1
= 0.61
Var() = 1
SD() = 1
Prediction
E[|y] = 0.612
Var(|y) = 0.0030
SD(|y) = 0.055
26
n = 200, y = 0.61, = = 10
4
0
p(|y)
Posterior
Likelihood
Prior
0.0
0.5
1.0
1.5
2.0
E[] = 1
= 0.61
Var() = 0.1
SD() = 0.316
Prediction
E[|y] = 0.629
Var(|y) = 0.0030
SD(|y) = 0.055
27
n = 200, y = 0.61, = = 100
4
0
p(|y)
Posterior
Likelihood
Prior
0.0
0.5
1.0
1.5
2.0
E[] = 1
= 0.61
Var() = 0.01
SD() = 0.1
Prediction
E[|y] = 0.74
Var(|y) = 0.0025
SD(|y) = 0.050
28
One way to think of the gamma prior in this case is that you have a data
set with observations with and observed Poisson count of .
Note that the Gamma distribution can be parameterized many ways.
Often the scale parameter form =
is used.
Also it can be parameterized in terms of mean, variance, and coefficient of

variation (only two are needed).
This gives some flexibility in thinking about the desired form of the prior for
a particular model.
In the example, I fixed the mean at 1 and let the variance decrease.
Prediction
29

Prior Vs Likelihood Vs Posterior Distribution

Diunggah oleh

Informasi Dokumen

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Prior Vs Likelihood Vs Posterior Distribution

Diunggah oleh

Hak Cipta:

Format Tersedia

Prior vs Likelihood vs Posterior

Posterior Predictive Distribution

c 2005 by Mark E. Irwin

Choosing the Likelihood Model

Choosing the Likelihood Model

Example: Probability of Gods Existance

Choosing the Likelihood Model

Choosing the Likelihood Model

Note in the article, Bayes rule is presented as

Choosing the Likelihood Model