Anda di halaman 1dari 25

Introduction to Bayesian Methods with an

Example

Wesley Burr
Queen’s University, Kingston, Ontario
wburr@mast.queensu.ca

Statistical Methods Seminar


May 18, 2011
The Book

Gelman, A., Carlin, J.B., Stern, H.S. and Rubin, D.B.. Bayesian
Data Analysis. 2004, Chapman and Hall/CRC.

(and the other book)

Gelman, A. and Hill, J.. Data analysis using regression and multi-
level/hierarchical models. 2007, Cambridge University Press.
The process of Bayesian data analysis can be idealized by dividing
it into three steps:

• Set up a full probability model – a joint probability distribution


for all observable and unobservable quantities in a problem.

• Condition on the observed data – calculate and interpret the


appropriate posterior distribution.

• Evaluate the fit of the model and its implications – does the
model fit the data? are the conclusions reasonable?
. . . iterate if necessary . . .
Some Notation

• θ: an unobservable vector quantity or population parameter of


interest

• y: observed data

• ỹ: an unknown, but potentially observable quantity


Some Definitions

• p(θ): the prior distribution

• p(y|θ): the sampling distribution (if considered as a function of


θ, for fixed y, called the likelihood function)

p(θ)p(y|θ)
• p(θ|y) = : the posterior distribution
p(y)

• p(θ|y) ∝ p(θ)p(y|θ): the unnormalized posterior distribution

Z
• p(y) = p(y, θ)dθ: the prior predictive distribution

Z
• p(ỹ|y) = p(ỹ, θ|y)dθ: the posterior predictive distribution
A Toy Example

Consider the disease hemophilia that exhibits on X-chromosome-


linked recessive inheritance, meaning that a male who inherits the
gene is affected, whereas a female carrying the gene on only one of
two X-chromosomes is not affected.

Now, consider a woman who has an affected brother who is not


herself affected (i.e. if she may be a carrier, but she does not have
both X-chromosomes affected). The unknown quantity of interest
has just two values: the woman is either a carrier (θ = 1) or she is
not (θ = 0). Also assume that the woman has two sons.

. . . more on the board . . .


Simulation of Posterior and Predictive Quantities

In practice, we will be interested in simulating draws from the pos-


terior distribution of θ, and possibly from the posterior predictive
distribution of ỹ.

Next week: Aaron will talk about using BUGS (WinBUGs, Open-
BUGS) interfaced through R as a tool for simulating these draws.
Single-Parameter Models

(Chapter 2 of BDA)
Binomial Data

The binomial sampling model states that


!
n
p(y|θ) = Bin(y|n, θ) = θy (1 − θ)n−y ,
y
where we can suppress the dependence on n since it is regarded as
part of the experimental design that can be considered fixed.

To perform Bayesian inference, we must specify the prior distribu-


tion. For simplicity at this point, assume the prior distribution for
θ is uniform [0, 1]. Then, apply Bayes’ rule:

p(θ|y) ∝ p(θ)p(y|θ) = θy (1 − θ)n−y


since θ ∈ [0, 1].

Notice the closed form solution: this is typical of many ‘examples’,


but not typically of real problems.
The Posterior Distribution

Bayesian inference involves passing from a prior p(θ) to a posterior


p(θ|y); we naturally might expect that some general relations hold
between these two. Two expressions hold:
E [θ] = E [E [θ|y ]]
var(θ) = E [var(θ|y)] + var(E [θ|y ])
The first says that the prior mean of θ is the average of all possible
posterior means over the distribution of all possible data.

The variance formula is more interesting because it says that the


posterior variance is on average smaller than the prior variance. The
amount it is smaller by depends on the variation of posterior means
over the distribution of all possible data.

The greater the variation, the more the potential for reducing
our uncertainty with regard to θ.
Informative Priors and Conjugacy

In the state of knowledge interpretation for a prior distribution, the


guiding principle is to express our knowledge and uncertainty about
θ as if its value could be thought of as a random realization from
p(θ).

If the posterior follows the same parametric form as the prior, this
is called conjugacy; i.e. the beta prior is a conjugate family for the
binomial likelihood.

Justification: it is easy to understand the results, they are often a


good approximation, and they simplify computation.

We can replace conjugate priors with nonconjugate priors at the


main expense of transparency and computation – if our knowledge
supports such a prior, it’s not unreasonable to use it.
Other Models

We can easily extend the binomial model to other simple models,


such as the normal distribution, Poisson distribution or exponential
distribution. The details are similar, and each has an appropriate
conjugate family of priors available. Each of the following has n
i.i.d. yi as data.
 
n
1
(normal) p(y|σ 2) ∝ (σ 2)−n/2exp − 2 )2 
X
(yi − σ
2σ 2 i=1
n
(Poisson) p(y|θ) ∝ θt(y)e−nθ for t(y) =
X
yi
i=1
(exponential) p(y|θ) = θnexp(−nyθ), y > 0
Estimating Cancer Rates with Informative Priors

We will consider a large set of inferences, each based on a different


data set, but with a common prior distribution. This example also
introduces hierarchical modeling, which we will focus on through
the summer.

The following figure shows the counties (3071 total) in the United
States with the highest age-standardized kidney cancer death rates
during the 1980s. The rates are age-adjusted and restricted to
white males.
Model-Based Approach to Estimating Rates

The misleading patterns on the previous two plots suggest that a


model-based approach to estimating the true underlying rates might
be helpful. Thus, we model

yj ∼ Poisson(10nj θj )
for yj the number of kidney cancer deaths in county j from 1980-
1989, nj the population of the county and θj the underlying rate in
units of deaths per-person per-year. Note that for this example we
ignore the age-standardization.

To perform Bayesian inference, we need a prior distribution for the


unknown rate θj : for convenience, we use a Gamma distribution, as
it is conjugate to the Poisson.
Choosing the Gamma Hyperparameters

For a distribution Gamma(α, β), we estimate α and β from the


data to match the distribution of the observed cancer death rates
yj /(10nj ). It might seem inappropriate to use the data to set the
prior, but the authors view this as a useful approximation to the
preferred method of hierarchical modeling.

Under the model above, the observed count yj for any county j
comes from
Z
p(yj ) = p(yj |θj )p(θj )dθj

which is the prior predictive distribution.


Prior Predictive Distribution for Poisson

With conjugate families, the known form of the prior and posterior
densities can be used to find the marginal distribution p(y), using
p(y|θ)p(θ)
p(y) = .
p(θ|y)
Then, for a Poisson model:
Poisson(y|θ)Gamma(θ|α, β)
p(y) =
Gamma(θ|α + y, 1 + β)
Γ(α + y)β α
=
Γ(α)y!(1 + β)α+y
! !α !y
y+α−1 β 1
=
y β+1 β+1
= Neg-bin(α, β).
Thus the prior predictive distribution for a Poisson model with
Gamma prior is a negative binomial density.
Choosing the Gamma Hyperparameters (ctd.)
!
β
From the previous slide, p(yj ) is Neg-bin α, . From standard
10nj
results, the mean and variance of this distribution are:
h i α
E yj = 10nj
β !
α β
var(yj ) = !2 1 + =
β 10nj
10nj

yj
In R, we compute the empirical mean and variance of the
10nj
term:
yj
mean( ) = 1.080832e−05
10nj
yj
var( ) = 4.683567e−11
10nj
Substituting these values into the relationships above (with age-
adjusted death counts) gives parameters α = 20, β = 430, 000,
according to the textbook. However, the actual computation is
“complicated because [of reasons]” (BDA), and the results I obtain
via R are not the same.

We will continue assuming there’s a subtlety in the computation that


isn’t clear in the text. The values that should have been obtained
via the empirical computation are:

yj
mean( ) = 4.65e−05
10nj
yj
var( ) = 1.08e−10
10nj
Posterior Distribution

As the prior is from the conjugate family of the Poisson model, the
posterior distribution will be Gamma:

θj |yj ∼ Gamma(20 + yj , 430000 + 10nj )


with mean and variance
h i 20 + yj
E θj |yj =
430000 + 10nj
20 + yj
var(θj |yj ) = 2
.
(430000 + 10nj )

The posterior mean can be viewed as a sort of weighted average of


the raw rate, yj /(10nj ), and the prior mean, α/β = 4.65 × 10−5.
Small Local Data and the Prior

Consider a small county with nj = 1000 (the actual minimum pop-


ulation is 202).

• If yj = 0, then the raw rate is 0 but the posterior mean is


4.55 × 10−5.

• If yj = 2, then the raw death rate is an extremely-high 2 × 10−4,


but the posterior mean is still only 5.0 × 10−5.

With such small population size, the data are dominated by the
prior.
Large Local Data and the Prior

Consider a large county with nj = 1, 000, 000 (the actual maximum


population is 15,937,146).

• If yj = 393, the raw rate is 3.93 × 10−5 and the posterior mean
is 3.96 × 10−5.

• If yj = 545, the raw death rate is 5.45 × 10−5 and the posterior
mean is 5.41 × 10−5.

With such a large population size, the data dominate the prior.
Where to go from Here?

The obvious extension to today’s example is any problem where the


posterior distribution is not of closed-form. In that case, we will not
be able to simply ’compute’ a posterior mean estimate by plugging
in the data point, and we will need simulation.

Next week: Aaron will show us how to interface R and WinBUGS


to simulate drawing from arbitrary posterior distributions.

Anda mungkin juga menyukai