Anda di halaman 1dari 31

Probabilistic Graphical Models

David Sontag

New York University

Lecture 2, February 7, 2013

David Sontag (NYU) Graphical Models Lecture 2, February 7, 2013 1 / 31


Bayesian networks
Reminder of last lecture

A Bayesian network is specified by a directed acyclic graph


G = (V , E ) with:
1 One node i ∈ V for each random variable Xi
2 One conditional probability distribution (CPD) per node, p(xi | xPa(i) ),
specifying the variable’s probability conditioned on its parents’ values
Corresponds 1-1 with a particular factorization of the joint
distribution: Y
p(x1 , . . . xn ) = p(xi | xPa(i) )
i∈V

Powerful framework for designing algorithms to perform probability


computations

David Sontag (NYU) Graphical Models Lecture 2, February 7, 2013 2 / 31


Example
Consider the following Bayesian network:
d0 d1 i0 i1
0.6 0.4 0.7 0.3

Difficulty Intelligence

g1 g2 g3
Grade SAT
i 0, d 0 0.3 0.4 0.3
i 0, d 1 0.05 0.25 0.7
i 0, d 0 0.9 0.08 0.02 s0 s1
Letter
i 0, d 1 0.5 0.3 0.2 i 0 0.95 0.05
i1 0.2 0.8
l0 l1
g1 0.1 0.9
g 2 0.4 0.6
g 2 0.99 0.01

What is its joint distribution?


Y
p(x1 , . . . xn ) = p(xi | xPa(i) )
i∈V
p(d, i, g , s, l) = p(d)p(i)p(g | i, d)p(s | i)p(l | g )
David Sontag (NYU) Graphical Models Lecture 2, February 7, 2013 3 / 31
D-separation (“directed separated”) in Bayesian networks

Algorithm to calculate whether X ⊥ Z | Y by looking at graph


separation
Look to see if there is active path between X and Y when variables
Y are observed:
X Y Z X Y Z

(a) (b)

If no such path, then X and Z are d-separated with respect to Y


d-separation reduces statistical independencies (hard) to connectivity
in graphs (easy)
Important because it allows us to quickly prune the Bayesian network,
finding just the relevant variables for answering a query

David Sontag (NYU) Graphical Models Lecture 2, February 7, 2013 4 / 31


Independence maps

Let I (G ) be the set of all conditional independencies implied by the


directed acyclic graph (DAG) G
Let I (p) denote the set of all conditional independencies that hold for
the joint distribution p.
A DAG G is an I-map (independence map) of a distribution p if
I (G ) ⊆ I (p)
A fully connected DAG G is an I-map for any distribution, since
I (G ) = ∅ ⊆ I (p) for all p
G is a minimal I-map for p if the removal of even a single edge
makes it not an I-map
A distribution may have several minimal I-maps
Each corresponds to a specific node-ordering

G is a perfect map (P-map) for distribution p if I (G ) = I (p)

David Sontag (NYU) Graphical Models Lecture 2, February 7, 2013 5 / 31


Equivalent structures

Different Bayesian network structures can be equivalent in that they


encode precisely the same conditional independence assertions (and
thus the same distributions)
Which of these are equivalent?
X Y

Z Z Z X Y

Y X X Y Z

(a) (b) (c) (d)

David Sontag (NYU) Graphical Models Lecture 2, February 7, 2013 6 / 31


Equivalent structures

Different Bayesian network structures can be equivalent in that they


encode precisely the same conditional independence assertions (and
thus the same distributions)
Are these equivalent?

V X Z V X Z

W Y W Y

David Sontag (NYU) Graphical Models Lecture 2, February 7, 2013 7 / 31


What are some frequently used graphical models?

David Sontag (NYU) Graphical Models Lecture 2, February 7, 2013 8 / 31


Hidden Markov models
Y1 Y2 Y3 Y4 Y5 Y6

X1 X2 X3 X4 X5 X6

Frequently used for speech recognition and part-of-speech tagging


Joint distribution factors as:
T
Y
p(y, x) = p(y1 )p(x1 | y1 ) p(yt | yt−1 )p(xt | yt )
t=2

p(y1 ) is the distribution for the starting state


p(yt | yt−1 ) is the transition probability between any two states
p(xt | yt ) is the emission probability
What are the conditional independencies here? For example,
Y1 ⊥ {Y3 , . . . , Y6 } | Y2
David Sontag (NYU) Graphical Models Lecture 2, February 7, 2013 9 / 31
Hidden Markov models
Y1 Y2 Y3 Y4 Y5 Y6

X1 X2 X3 X4 X5 X6

Joint distribution factors as:


T
Y
p(y, x) = p(y1 )p(x1 | y1 ) p(yt | yt−1 )p(xt | yt )
t=2

A homogeneous HMM uses the same parameters (β and α below)


for each transition and emission distribution (parameter sharing):
T
Y
p(y, x) = p(y1 )αx1 ,y1 βyt ,yt−1 αxt ,yt
t=2

How many parameters need to be learned?


David Sontag (NYU) Graphical Models Lecture 2, February 7, 2013 10 / 31
Mixture of Gaussians

The N-dim. multivariate normal distribution, N (µ, Σ), has density:

1  1 
p(x) = exp − (x − µ)T Σ−1 (x − µ)
(2π)N/2 |Σ|1/2 2

Suppose we have k Gaussians given by µk and Σk , and a distribution


θ over the numbers 1, . . . , k

Mixture of Gaussians distribution p(y , x) given by


1 Sample y ∼ θ (specifies which Gaussian to use)
2 Sample x ∼ N (µy , Σy )

David Sontag (NYU) Graphical Models Lecture 2, February 7, 2013 11 / 31


Mixture of Gaussians
The marginal distribution over x looks like:

David Sontag (NYU) Graphical Models Lecture 2, February 7, 2013 12 / 31


Complexity+of+Inference+in+Latent+Dirichlet+Alloca6on+
David+Sontag,+Daniel+Roy+
Latent Dirichlet allocation (LDA)
(NYU,+Cambridge)+
W66+
Topic models are powerful tools for exploring large data sets and for
Topic+models+are+powerful+tools+for+exploring+large+data+sets+and+for+making+
making inferences about the content of documents
inferences+about+the+content+of+documents+
Documents+
!"#$%&'() Topics+
*"+,#)
poli6cs+.0100+ religion+.0500+ sports+.0105+
+"/,9#)1 .&/,0,"'1 )+".()1
president+.0095+ hindu+.0092+ baseball+.0100+
+.&),3&'(1 2,'3$1 65)&65//1
obama+.0090+ judiasm+.0080+ soccer+.0055+
"65%51
washington+.0085+ 4$3,5)%1ethics+.0075+ )"##&.1
basketball+.0050+
:5)2,'0("'1
religion+.0060+ buddhism+.0016+65)7&(65//1
&(2,#)1 football+.0045+
.&/,0,"'1…+ 6$332,)%1 …+ 8""(65//1 …+
- - -
t = p(w | z = t)

ManyAlmost+all+uses+of+topic+models+(e.g.,+for+unsupervised+learning,+informa6on+
applications in information retrieval, document summarization,
retrieval,+classifica6on)+require+probabilis)c+inference:+
and classification
New+document+ What+is+this+document+about?+
weather+ .50+
finance+ .49+
sports+ .01+

Words+w1,+…,+wN+ Distribu6on+of+topics+✓

LDA is one of the simplest and most widely used topic models
David Sontag (NYU) Graphical Models Lecture 2, February 7, 2013 13 / 31
Generative model for a document in LDA
1 Sample the document’s topic distribution θ (aka topic vector)

θ ∼ Dirichlet(α1:T )

where the {αt }T


t=1 are fixed hyperparameters.
P Thus θ is a distribution
over T topics with mean θt = αt / t 0 αt 0
2 For i = 1 to N, sample the topic zi of the i’th word

zi |θ ∼ θ

3 ... and then sample the actual word wi from the zi ’th topic

wi |zi ∼ βzi

where {βt }T
t=1 are the topics (a fixed collection of distributions on
words)
David Sontag (NYU) Graphical Models Lecture 2, February 7, 2013 14 / 31
Generative model for a document in LDA
1 Sample the document’s topic distribution θ (aka topic vector)
θ ∼ Dirichlet(α1:T )
where the {αt }T
are hyperparameters.The Dirichlet density, defined over
t=1 PT
∆ = {θ~ ∈ R : ∀t θt ≥ 0, t=1 θt = 1}, is:
T

T
Y
p(θ1 , . . . , θT ) ∝ θtαt −1
t=1

For example, for T =3 (θ3 = 1 − θ1 − θ2 ):


α1 = α2 = α3 = α1 = α2 = α3 =

log Pr(θ) log Pr(θ)

θ1 θ2 θ1 θ2

David Sontag (NYU) Graphical Models Lecture 2, February 7, 2013 15 / 31


Generative model for a document in LDA

3 ... and then sample the actual word wi from the zi ’th topic
Complexity+of+Inference+in+Latent+Dirichlet+Alloca6on+
David+Sontag,+Daniel+Roy+ wi |zi ∼ βzi
(NYU,+Cambridge)+
where {βt }T t=1 are the topics (a fixed collection of distributions W66on+
Topic+models+are+powerful+tools+for+exploring+large+data+sets+and+for+making+
words)
inferences+about+the+content+of+documents+
Documents+ Topics+
poli6cs+.0100+ religion+.0500+ sports+.0105+
president+.0095+ hindu+.0092+ baseball+.0100+
obama+.0090+ judiasm+.0080+ soccer+.0055+
washington+.0085+ ethics+.0075+ basketball+.0050+
religion+.0060+ buddhism+.0016+ football+.0045+
…+ …+ …+
t = p(w | z = t)

Almost+all+uses+of+topic+models+(e.g.,+for+unsupervised+learning,+informa6on+
retrieval,+classifica6on)+require+probabilis)c+inference:+
New+document+ What+is+this+document+about?+
David Sontag (NYU) Graphical Models Lecture 2, February 7, 2013 16 / 31
Example of using LDA

Topic proportions and


Topics Documents
assignments
gene 0.04
β1 dna
genetic
0.02
0.01 z1d
.,,

life 0.02 θd
evolve 0.01
organism 0.01
.,,

brain 0.04
neuron 0.02
nerve 0.01
...
zN d
data 0.02
number 0.02
βT computer 0.01
.,,

Figure 1: The intuitions behind latent Dirichlet allocation. We assume that some
(Blei,words,
number of “topics,” which are distributions over Introduction
existtofor
Probabilistic
the whole Topic Models, 2011)
collection (far left).
Each document is assumed to be generated as follows. First choose a distribution over the
David Sontag (NYU) Graphical Models Lecture 2, February 7, 2013 17 / 31
“Plate” notation for LDA model

α Dirichlet
hyperparameters

Topic distribution
θd for document
Topic-word
β zid Topic of word i of doc d
distributions

wid Word
i = 1 to N

d = 1 to D

Variables within a plate are replicated in a conditionally independent manner

David Sontag (NYU) Graphical Models Lecture 2, February 7, 2013 18 / 31


Comparison of mixture and admixture models

α Dirichlet
hyperparameters

Prior distribution Topic distribution


θ over topics θd for document
Topic-word Topic-word
β zd Topic of doc d β zid Topic of word i of doc d
distributions distributions

wid Word wid Word


i = 1 to N i = 1 to N

d = 1 to D d = 1 to D

Model on left is a mixture model


Called multinomial naive Bayes (a word can appear multiple times)
Document is generated from a single topic

Model on right (LDA) is an admixture model


Document is generated from a distribution over topics

David Sontag (NYU) Graphical Models Lecture 2, February 7, 2013 19 / 31


Summary

Bayesian networks given by (G , P) where P is specified as a set of


local conditional probability distributions associated with G ’s nodes
One interpretation of a BN is as a generative model, where variables
are sampled in topological order
Local and global independence properties identifiable via
d-separation criteria
Computing the probability of any assignment is obtained by
multiplying CPDs
Bayes’ rule is used to compute conditional probabilities
Marginalization or inference is often computationally difficult
Examples (will show up again): naive Bayes, hidden Markov
models, latent Dirichlet allocation

David Sontag (NYU) Graphical Models Lecture 2, February 7, 2013 20 / 31


Bayesian networks have limitations

Recall that G is a perfect map for distribution p if I (G ) = I (p)


Theorem: Not every distribution has a perfect map as a DAG

Proof.
(By counterexample.) There is a distribution on 4 variables where the only
independencies are A ⊥ C | {B, D} and B ⊥ D | {A, C }. This cannot be
represented by any Bayesian network.

(a) (b)

Both (a) and (b) encode (A ⊥ C |B, D), but in both cases (B 6⊥ D|A, C ).

David Sontag (NYU) Graphical Models Lecture 2, February 7, 2013 21 / 31


Example

Let’s come up with an example of a distribution p satisfying


A ⊥ C | {B, D} and B ⊥ D | {A, C }
A=Alex’s hair color (red, green, blue)
B=Bob’s hair color
C =Catherine’s hair color
D=David’s hair color
Alex and Bob are friends, Bob and Catherine are friends, Catherine
and David are friends, David and Alex are friends
Friends never have the same hair color!

David Sontag (NYU) Graphical Models Lecture 2, February 7, 2013 22 / 31


Bayesian networks have limitations
Although we could represent any distribution as a fully connected BN,
this obscures its structure
Alternatively, we can introduce “dummy” binary variables Z and work
with a conditional distribution:
A

Z1 Z2
D B
Z4 Z3

This satisfies A ⊥ C | {B, D, Z} and B ⊥ D | {A, C , Z}


Returning to the previous example, we would set:
p(Z1 = 1 | a, d) = 1 if a 6= d, and 0 if a = d
Z1 is the observation that Alice and David have different hair colors
David Sontag (NYU) Graphical Models Lecture 2, February 7, 2013 23 / 31
Undirected graphical models

An alternative representation for joint distributions is as an undirected


graphical model
As in BNs, we have one node for each random variable
Rather than CPDs, we specify (non-negative) potential functions over sets
of variables associated with cliques C of the graph,
1 Y
p(x1 , . . . , xn ) = φc (xc )
Z
c∈C

Z is the partition function and normalizes the distribution:


X Y
Z= φc (x̂c )
x̂1 ,...,x̂n c∈C

Like CPD’s, φc (xc ) can be represented as a table, but it is not normalized


Also known as Markov random fields (MRFs) or Markov networks

David Sontag (NYU) Graphical Models Lecture 2, February 7, 2013 24 / 31


Undirected graphical models

1 Y X Y
p(x1 , . . . , xn ) = φc (xc ), Z= φc (x̂c )
Z
c∈C x̂1 ,...,x̂n c∈C

Simple example (potential function on each edge encourages the variables to take
the same value):
B C C
φA,B (a, b) = 0 1
φB,C (b, c) = 0 1 φA,C (a, c) = 0 1
B
0 10 1 0 10 1 0 10 1
A B A
1 1 1 1 10
A C 1 10 1 10

1
p(a, b, c) = φA,B (a, b) · φB,C (b, c) · φA,C (a, c),
Z
where
X
Z= φA,B (â, b̂) · φB,C (b̂, ĉ) · φA,C (â, ĉ) = 2 · 1000 + 6 · 10 = 2060.
â,b̂,ĉ∈{0,1}3

David Sontag (NYU) Graphical Models Lecture 2, February 7, 2013 25 / 31


Hair color example as a MRF
We now have an undirected graph:

The joint probability distribution is parameterized as


1
p(a, b, c, d) = φAB (a, b)φBC (b, c)φCD (c, d)φAD (a, d) φA (a)φB (b)φC (c)φD (d)
Z

Pairwise potentials enforce that no friend has the same hair color:
φAB (a, b) = 0 if a = b, and 1 otherwise
Single-node potentials specify an affinity for a particular hair color, e.g.
φD (“red”) = 0.6, φD (“blue”) = 0.3, φD (“green”) = 0.1
The normalization Z makes the potentials scale invariant! Equivalent to
φD (“red”) = 6, φD (“blue”) = 3, φD (“green”) = 1
David Sontag (NYU) Graphical Models Lecture 2, February 7, 2013 26 / 31
Markov network structure implies conditional independencies

Let G be the undirected graph where we have one edge for every pair
of variables that appear together in a potential
Conditional independence is given by graph separation!

XB

XA
XC

XA ⊥ XC | XB if there is no path from a ∈ A to c ∈ C after removing


all variables in B

David Sontag (NYU) Graphical Models Lecture 2, February 7, 2013 27 / 31


Example

Returning to hair color example, its undirected graphical model is:

Since removing A and C leaves no path from D to B, we have


D ⊥ B | {A, C }
Similarly, since removing D and B leaves no path from A to C , we
have A ⊥ C | {D, B}
No other independencies implied by the graph

David Sontag (NYU) Graphical Models Lecture 2, February 7, 2013 28 / 31


Markov blanket

A set U is a Markov blanket of X if X ∈


/ U and if U is a minimal set
of nodes such that X ⊥ (X − {X } − U) | U

In undirected graphical models, the Markov blanket of a variable is


precisely its neighbors in the graph:

In other words, X is independent of the rest of the nodes in the graph


given its immediate neighbors

David Sontag (NYU) Graphical Models Lecture 2, February 7, 2013 29 / 31


Proof of independence through separation
We will show that A ⊥ C | B for the following distribution:
A B C
1
p(a, b, c) = φAB (a, b)φBC (b, c)
Z

First, we show that p(a | b) can be computed using only φAB (a, b):
p(a, b)
p(a | b) =
p(b)
1 P
Z ĉ φAB (a, b)φBC (b, ĉ)
1 P
=
Z â,ĉ φAB (â, b)φBC (b, ĉ)
P
φAB (a, b) ĉ φBC (b, ĉ) φAB (a, b)
= P P =P .
φ
â AB (â, b) φ
ĉ BC (b, ĉ) â φAB (â, b)

More generally, the probability of a variable conditioned on its Markov


blanket depends only on potentials involving that node
David Sontag (NYU) Graphical Models Lecture 2, February 7, 2013 30 / 31
Proof of independence through separation

We will show that A ⊥ C | B for the following distribution:

A B C
1
p(a, b, c) = φAB (a, b)φBC (b, c)
Z

Proof.

p(a, c, b) φAB (a, b)φBC (b, c)


p(a, c | b) = P = P
â,ĉ p(â, b, ĉ) â,ĉ φAB (â, b)φBC (b, ĉ)
φAB (a, b)φBC (b, c)
= P P
â φAB (â, b) ĉ φBC (b, ĉ)
= p(a | b)p(c | b)

David Sontag (NYU) Graphical Models Lecture 2, February 7, 2013 31 / 31

Anda mungkin juga menyukai