Introduction To Bayesian Methods With An Example

Introduction to Bayesian Methods with an
Example
Wesley Burr
Queen’s University, Kingston, Ontario
wburr@mast.queensu.ca
Statistical Methods Seminar

May 18, 2011
The Book
Gelman, A., Carlin, J.B., Stern, H.S. and Rubin, D.B.. Bayesian
Data Analysis. 2004, Chapman and Hall/CRC.
(and the other book)
Gelman, A. and Hill, J.. Data analysis using regression and multi-
level/hierarchical models. 2007, Cambridge University Press.
The process of Bayesian data analysis can be idealized by dividing
it into three steps:
• Set up a full probability model – a joint probability distribution

for all observable and unobservable quantities in a problem.
• Condition on the observed data – calculate and interpret the

appropriate posterior distribution.
• Evaluate the fit of the model and its implications – does the
model fit the data? are the conclusions reasonable?
. . . iterate if necessary . . .
Some Notation
• θ: an unobservable vector quantity or population parameter of

interest
• y: observed data
• ỹ: an unknown, but potentially observable quantity

Some Definitions
• p(θ): the prior distribution
• p(y|θ): the sampling distribution (if considered as a function of

θ, for fixed y, called the likelihood function)
p(θ)p(y|θ)
• p(θ|y) = : the posterior distribution
p(y)
• p(θ|y) ∝ p(θ)p(y|θ): the unnormalized posterior distribution
Z
• p(y) = p(y, θ)dθ: the prior predictive distribution
Z
• p(ỹ|y) = p(ỹ, θ|y)dθ: the posterior predictive distribution
A Toy Example
Consider the disease hemophilia that exhibits on X-chromosome-

linked recessive inheritance, meaning that a male who inherits the
gene is affected, whereas a female carrying the gene on only one of
two X-chromosomes is not affected.
Now, consider a woman who has an affected brother who is not

herself affected (i.e. if she may be a carrier, but she does not have
both X-chromosomes affected). The unknown quantity of interest
has just two values: the woman is either a carrier (θ = 1) or she is
not (θ = 0). Also assume that the woman has two sons.
. . . more on the board . . .

Simulation of Posterior and Predictive Quantities
In practice, we will be interested in simulating draws from the pos-

terior distribution of θ, and possibly from the posterior predictive
distribution of ỹ.
Next week: Aaron will talk about using BUGS (WinBUGs, Open-
BUGS) interfaced through R as a tool for simulating these draws.
Single-Parameter Models
(Chapter 2 of BDA)
Binomial Data
The binomial sampling model states that

!
n
p(y|θ) = Bin(y|n, θ) = θy (1 − θ)n−y ,
y
where we can suppress the dependence on n since it is regarded as
part of the experimental design that can be considered fixed.
To perform Bayesian inference, we must specify the prior distribu-

tion. For simplicity at this point, assume the prior distribution for
θ is uniform [0, 1]. Then, apply Bayes’ rule:
p(θ|y) ∝ p(θ)p(y|θ) = θy (1 − θ)n−y

since θ ∈ [0, 1].
Notice the closed form solution: this is typical of many ‘examples’,

but not typically of real problems.
The Posterior Distribution
Bayesian inference involves passing from a prior p(θ) to a posterior

p(θ|y); we naturally might expect that some general relations hold
between these two. Two expressions hold:
E [θ] = E [E [θ|y ]]
var(θ) = E [var(θ|y)] + var(E [θ|y ])
The first says that the prior mean of θ is the average of all possible
posterior means over the distribution of all possible data.
The variance formula is more interesting because it says that the

posterior variance is on average smaller than the prior variance. The
amount it is smaller by depends on the variation of posterior means
over the distribution of all possible data.
The greater the variation, the more the potential for reducing
our uncertainty with regard to θ.
Informative Priors and Conjugacy
In the state of knowledge interpretation for a prior distribution, the

guiding principle is to express our knowledge and uncertainty about
θ as if its value could be thought of as a random realization from
p(θ).
If the posterior follows the same parametric form as the prior, this
is called conjugacy; i.e. the beta prior is a conjugate family for the
binomial likelihood.
Justification: it is easy to understand the results, they are often a

good approximation, and they simplify computation.
We can replace conjugate priors with nonconjugate priors at the

main expense of transparency and computation – if our knowledge
supports such a prior, it’s not unreasonable to use it.
Other Models
We can easily extend the binomial model to other simple models,

such as the normal distribution, Poisson distribution or exponential
distribution. The details are similar, and each has an appropriate
conjugate family of priors available. Each of the following has n
i.i.d. yi as data.
 
n
1
(normal) p(y|σ 2) ∝ (σ 2)−n/2exp − 2 )2 
X
(yi − σ
2σ 2 i=1
n
(Poisson) p(y|θ) ∝ θt(y)e−nθ for t(y) =
X
yi
i=1
(exponential) p(y|θ) = θnexp(−nyθ), y > 0
Estimating Cancer Rates with Informative Priors
We will consider a large set of inferences, each based on a different

data set, but with a common prior distribution. This example also
introduces hierarchical modeling, which we will focus on through
the summer.
The following figure shows the counties (3071 total) in the United
States with the highest age-standardized kidney cancer death rates
during the 1980s. The rates are age-adjusted and restricted to
white males.
Model-Based Approach to Estimating Rates
The misleading patterns on the previous two plots suggest that a

model-based approach to estimating the true underlying rates might
be helpful. Thus, we model
yj ∼ Poisson(10nj θj )
for yj the number of kidney cancer deaths in county j from 1980-
1989, nj the population of the county and θj the underlying rate in
units of deaths per-person per-year. Note that for this example we
ignore the age-standardization.
To perform Bayesian inference, we need a prior distribution for the

unknown rate θj : for convenience, we use a Gamma distribution, as
it is conjugate to the Poisson.
Choosing the Gamma Hyperparameters
For a distribution Gamma(α, β), we estimate α and β from the

data to match the distribution of the observed cancer death rates
yj /(10nj ). It might seem inappropriate to use the data to set the
prior, but the authors view this as a useful approximation to the
preferred method of hierarchical modeling.
Under the model above, the observed count yj for any county j
comes from
Z
p(yj ) = p(yj |θj )p(θj )dθj
which is the prior predictive distribution.

Prior Predictive Distribution for Poisson
With conjugate families, the known form of the prior and posterior
densities can be used to find the marginal distribution p(y), using
p(y|θ)p(θ)
p(y) = .
p(θ|y)
Then, for a Poisson model:
Poisson(y|θ)Gamma(θ|α, β)
p(y) =
Gamma(θ|α + y, 1 + β)
Γ(α + y)β α
=
Γ(α)y!(1 + β)α+y
! !α !y
y+α−1 β 1
=
y β+1 β+1
= Neg-bin(α, β).
Thus the prior predictive distribution for a Poisson model with
Gamma prior is a negative binomial density.
Choosing the Gamma Hyperparameters (ctd.)
!
β
From the previous slide, p(yj ) is Neg-bin α, . From standard
10nj
results, the mean and variance of this distribution are:
h i α
E yj = 10nj
β !
α β
var(yj ) = !2 1 + =
β 10nj
10nj
yj
In R, we compute the empirical mean and variance of the
10nj
term:
yj
mean( ) = 1.080832e−05
10nj
yj
var( ) = 4.683567e−11
10nj
Substituting these values into the relationships above (with age-
adjusted death counts) gives parameters α = 20, β = 430, 000,
according to the textbook. However, the actual computation is
“complicated because [of reasons]” (BDA), and the results I obtain
via R are not the same.
We will continue assuming there’s a subtlety in the computation that

isn’t clear in the text. The values that should have been obtained
via the empirical computation are:
yj
mean( ) = 4.65e−05
10nj
yj
var( ) = 1.08e−10
10nj
Posterior Distribution
As the prior is from the conjugate family of the Poisson model, the
posterior distribution will be Gamma:
θj |yj ∼ Gamma(20 + yj , 430000 + 10nj )

with mean and variance
h i 20 + yj
E θj |yj =
430000 + 10nj
20 + yj
var(θj |yj ) = 2
.
(430000 + 10nj )
The posterior mean can be viewed as a sort of weighted average of

the raw rate, yj /(10nj ), and the prior mean, α/β = 4.65 × 10−5.
Small Local Data and the Prior
Consider a small county with nj = 1000 (the actual minimum pop-

ulation is 202).
• If yj = 0, then the raw rate is 0 but the posterior mean is

4.55 × 10−5.
• If yj = 2, then the raw death rate is an extremely-high 2 × 10−4,

but the posterior mean is still only 5.0 × 10−5.
With such small population size, the data are dominated by the
prior.
Large Local Data and the Prior
Consider a large county with nj = 1, 000, 000 (the actual maximum

population is 15,937,146).
• If yj = 393, the raw rate is 3.93 × 10−5 and the posterior mean
is 3.96 × 10−5.
• If yj = 545, the raw death rate is 5.45 × 10−5 and the posterior
mean is 5.41 × 10−5.
With such a large population size, the data dominate the prior.
Where to go from Here?
The obvious extension to today’s example is any problem where the

posterior distribution is not of closed-form. In that case, we will not
be able to simply ’compute’ a posterior mean estimate by plugging
in the data point, and we will need simulation.
Next week: Aaron will show us how to interface R and WinBUGS

to simulate drawing from arbitrary posterior distributions.

Introduction To Bayesian Methods With An Example

Diunggah oleh

Informasi Dokumen

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Introduction To Bayesian Methods With An Example

Diunggah oleh

Hak Cipta:

Format Tersedia

Introduction to Bayesian Methods with an

Statistical Methods Seminar

(and the other book)

• Set up a full probability model – a joint probability distribution

• Condition on the observed data – calculate and interpret the

• θ: an unobservable vector quantity or population parameter of

• ỹ: an unknown, but potentially observable quantity

• p(θ): the prior distribution

• p(y|θ): the sampling distribution (if considered as a function of

• p(θ|y) ∝ p(θ)p(y|θ): the unnormalized posterior distribution

Consider the disease hemophilia that exhibits on X-chromosome-

Now, consider a woman who has an affected brother who is not

. . . more on the board . . .

In practice, we will be interested in simulating draws from the pos-

The binomial sampling model states that

To perform Bayesian inference, we must specify the prior distribu-

p(θ|y) ∝ p(θ)p(y|θ) = θy (1 − θ)n−y

Notice the closed form solution: this is typical of many ‘examples’,

Bayesian inference involves passing from a prior p(θ) to a posterior

The variance formula is more interesting because it says that the

In the state of knowledge interpretation for a prior distribution, the

Justification: it is easy to understand the results, they are often a

We can replace conjugate priors with nonconjugate priors at the

We can easily extend the binomial model to other simple models,

We will consider a large set of inferences, each based on a different

The misleading patterns on the previous two plots suggest that a

To perform Bayesian inference, we need a prior distribution for the

For a distribution Gamma(α, β), we estimate α and β from the

which is the prior predictive distribution.

We will continue assuming there’s a subtlety in the computation that

θj |yj ∼ Gamma(20 + yj , 430000 + 10nj )

The posterior mean can be viewed as a sort of weighted average of

Consider a small county with nj = 1000 (the actual minimum pop-

• If yj = 0, then the raw rate is 0 but the posterior mean is

• If yj = 2, then the raw death rate is an extremely-high 2 × 10−4,

Consider a large county with nj = 1, 000, 000 (the actual maximum

The obvious extension to today’s example is any problem where the

Next week: Aaron will show us how to interface R and WinBUGS

Anda mungkin juga menyukai