Di - Unwin
10
0.5
0.1
2
1
2
Di - Shermer
0.5
0.1
0.1
1
0.5
0.1
P [God|Data]:
Unwin:
2
3
Shermer: 0.00025
So even starting with the same prior, the difference beliefs about what the
data says gives quite different posterior probabilities.
This is based on an analysis published in an July 2004 Scientific American
article. (Available on the course web site on the Articles page.)
Stephen D. Unwin is a risk management consultant who has done work in
physics on quantum gravity. He is author of the book The Probability of
God.
Michael Shermer is the publisher of Skeptic and a regular contributor to
Scientific American as author of the column Skeptic.
E[] = E[E[|y]]
Var() = E[Var(|y)] + Var(E[|y])
(1)
(2)
The first equation says that our prior mean is the average of all possible
posterior means (averaged over all possible data sets).
The second says that the posterior variance is, on average, smaller than the
prior variance. The size of the difference depends on the variability of the
posterior means.
Prior vs Likelihood vs Posterior
Beta(a, b)
y| Bin(n, )
|y
Beta(a + y, b + n y)
Prior mean:
E[] =
a
=
a+b
MLE:
y
n
a+b
n
a+b+n
a+b+n
a weighted average of the prior mean and the sample proportion (MLE)
Prior variance:
(1
)
Var() =
a+b+1
Posterior variance:
Var(|y) =
(1
)
a+b+n+1
N (0, 02)
yi|
iid
N (, 2);
|y
N (n, n2)
i = 1, . . . , n
Then
n =
1
n
+
y
0
2
2
0
n
1
+
2
2
0
and
1
n
1
=
+
n2 02 2
n = 0 + (
y 0) 2
n
= y (
y 0) 2
n
02
+ 02
2
n
+ 02
The first case has n as the prior mean adjusted towards the sample
average of the data.
The second case has the sample average shrunk towards the prior mean.
In most problems, the posterior mean can be thought of as a shrinkage
estimator, where the estimate just based on the data is shrunk toward
the prior mean. The form of the shrinkage may not be able to be written
out in as quite a nice form for more general problems.
Prior vs Likelihood vs Posterior
In this example the posterior variance is never bigger than the prior
variance as
1
n
1
1
+
=
n2 02 2 02
and
1
n
n2 2
n2 02
n2
n
10
2.5
n = 5, y = 2, a = 4, b = 3
1.5
1.0
0.0
0.5
p(|y)
2.0
Posterior
Likelihood
Prior
0.0
0.2
0.4
0.6
0.8
1.0
4
E[] =
7
=
5
3
Var() =
= 0.0306
98
SD() = 0.175
Prior vs Likelihood vs Posterior
6
E[|y] =
= 0.5
12
1
Var(|y) =
= 0.0192
52
SD(|y) = 0.139
11
n = 20, y = 8, a = 4, b = 3
2
0
p(|y)
Posterior
Likelihood
Prior
0.0
0.2
0.4
0.6
0.8
1.0
4
E[] =
7
=
5
3
Var() =
= 0.0306
98
SD() = 0.175
Prior vs Likelihood vs Posterior
12
E[|y] =
= 0.444
27
Var(|y) = 0.0088
SD(|y) = 0.094
12
n = 100, y = 40, a = 4, b = 3
4
0
p(|y)
Posterior
Likelihood
Prior
0.0
0.2
0.4
0.6
0.8
1.0
4
E[] =
7
=
5
3
Var() =
= 0.0306
98
SD() = 0.175
Prior vs Likelihood vs Posterior
44
E[|y] =
= 0.411
107
Var(|y) = 0.0022
SD(|y) = 0.047
13
25
n = 1000, y = 400, a = 4, b = 3
15
10
0
p(|y)
20
Posterior
Likelihood
Prior
0.0
0.2
0.4
0.6
0.8
1.0
4
E[] =
7
=
5
3
Var() =
= 0.0306
98
SD() = 0.175
Prior vs Likelihood vs Posterior
404
E[|y] =
= 0.401
1007
Var(|y) = 0.0024
SD(|y) = 0.0154
14
Prediction
Another useful summary is the posterior predictive distribution of a future
observation, y
Z
p(
y |y) =
p(
y |y, )p(|y)d
p(
y |)p(|y)d
Prediction
15
p(
y |)p(|y)d
m y
(a1 + b1) a11
=
(1 )my
(1 )b11d
y
(a1)(b1)
0
m (a1 + b1) (a1 + y)(b1 + m y)
=
y (a1)(b1)
(a1 + b1 + m)
Z
a1
= m
a1 + b1
16
Prediction
17
One way of thinking about this is that there is two pieces of uncertainty in
predicting a new observation.
1. Uncertainty about the true success probability
2. Deviation of an observation from its expected value
This is more clear with the Normal-Normal model with fixed variance. As
we saw earlier, the posterior distribution is of the form
|y N (n, n2)
Then
Z
p(
y |y) =
Prediction
1
1
1
1
exp 2 (
exp 2 ( n)2 d
y )2
2
2n
2
n 2
18
A little bit of calculus will show that this reduces to a normal density with
mean
E[
y |y] = n = E[E[
y |n]|y] = E[|y]
and variance
Var(
y |y) = n2 + 2
= Var(E[
y |]|y) + E[Var(
y |)|y]
= Var(n|y) + E[ 2|y]
Prediction
19
2
1
(x
)
Var(
y |x) = 2 1 + +
n (n 1)s2x
Simulating the posterior predictive distribution
This is easy to do, assuming that you can simulate from the posterior
distribution of the parameter, which is usually feasible.
To do it involves two steps:
1. Simulate i from |y; i = 1, . . . , m
2. Simulate yi from y|i (= y|i, y); i = 1, . . . , m
The pairs (i, yi) are draws from the joint distribution , y|y. Therefore the
yi are draws from y|y.
Prediction
20
Prediction
21
Prediction
22
iid
e
p() =
()
Note that this is a conjugate prior for .
Prediction
23
Var() =
+ n
y
+n
Var(|y) =
+ n
y
( + n)2
24
12
6
0
p(|y)
10
Posterior
Likelihood
Prior
0.0
0.5
1.0
1.5
2.0
E[] = 1
= 0.61
Var() = 2
SD() = 1.412
Prediction
E[|y] = 0.611
Var(|y) = 0.0030
SD(|y) = 0.055
25
n = 200, y = 0.61, = = 1
4
0
p(|y)
Posterior
Likelihood
Prior
0.0
0.5
1.0
1.5
2.0
E[] = 1
= 0.61
Var() = 1
SD() = 1
Prediction
E[|y] = 0.612
Var(|y) = 0.0030
SD(|y) = 0.055
26
n = 200, y = 0.61, = = 10
4
0
p(|y)
Posterior
Likelihood
Prior
0.0
0.5
1.0
1.5
2.0
E[] = 1
= 0.61
Var() = 0.1
SD() = 0.316
Prediction
E[|y] = 0.629
Var(|y) = 0.0030
SD(|y) = 0.055
27
4
0
p(|y)
Posterior
Likelihood
Prior
0.0
0.5
1.0
1.5
2.0
E[] = 1
= 0.61
Var() = 0.01
SD() = 0.1
Prediction
E[|y] = 0.74
Var(|y) = 0.0025
SD(|y) = 0.050
28
One way to think of the gamma prior in this case is that you have a data
set with observations with and observed Poisson count of .
Note that the Gamma distribution can be parameterized many ways.
Often the scale parameter form =
is used.
Prediction
29