David Sontag
Difficulty Intelligence
g1 g2 g3
Grade SAT
i 0, d 0 0.3 0.4 0.3
i 0, d 1 0.05 0.25 0.7
i 0, d 0 0.9 0.08 0.02 s0 s1
Letter
i 0, d 1 0.5 0.3 0.2 i 0 0.95 0.05
i1 0.2 0.8
l0 l1
g1 0.1 0.9
g 2 0.4 0.6
g 2 0.99 0.01
(a) (b)
Z Z Z X Y
Y X X Y Z
V X Z V X Z
W Y W Y
X1 X2 X3 X4 X5 X6
X1 X2 X3 X4 X5 X6
1 1
p(x) = exp − (x − µ)T Σ−1 (x − µ)
(2π)N/2 |Σ|1/2 2
ManyAlmost+all+uses+of+topic+models+(e.g.,+for+unsupervised+learning,+informa6on+
applications in information retrieval, document summarization,
retrieval,+classifica6on)+require+probabilis)c+inference:+
and classification
New+document+ What+is+this+document+about?+
weather+ .50+
finance+ .49+
sports+ .01+
Words+w1,+…,+wN+ Distribu6on+of+topics+✓
LDA is one of the simplest and most widely used topic models
David Sontag (NYU) Graphical Models Lecture 2, February 7, 2013 13 / 31
Generative model for a document in LDA
1 Sample the document’s topic distribution θ (aka topic vector)
θ ∼ Dirichlet(α1:T )
zi |θ ∼ θ
3 ... and then sample the actual word wi from the zi ’th topic
wi |zi ∼ βzi
where {βt }T
t=1 are the topics (a fixed collection of distributions on
words)
David Sontag (NYU) Graphical Models Lecture 2, February 7, 2013 14 / 31
Generative model for a document in LDA
1 Sample the document’s topic distribution θ (aka topic vector)
θ ∼ Dirichlet(α1:T )
where the {αt }T
are hyperparameters.The Dirichlet density, defined over
t=1 PT
∆ = {θ~ ∈ R : ∀t θt ≥ 0, t=1 θt = 1}, is:
T
T
Y
p(θ1 , . . . , θT ) ∝ θtαt −1
t=1
θ1 θ2 θ1 θ2
3 ... and then sample the actual word wi from the zi ’th topic
Complexity+of+Inference+in+Latent+Dirichlet+Alloca6on+
David+Sontag,+Daniel+Roy+ wi |zi ∼ βzi
(NYU,+Cambridge)+
where {βt }T t=1 are the topics (a fixed collection of distributions W66on+
Topic+models+are+powerful+tools+for+exploring+large+data+sets+and+for+making+
words)
inferences+about+the+content+of+documents+
Documents+ Topics+
poli6cs+.0100+ religion+.0500+ sports+.0105+
president+.0095+ hindu+.0092+ baseball+.0100+
obama+.0090+ judiasm+.0080+ soccer+.0055+
washington+.0085+ ethics+.0075+ basketball+.0050+
religion+.0060+ buddhism+.0016+ football+.0045+
…+ …+ …+
t = p(w | z = t)
Almost+all+uses+of+topic+models+(e.g.,+for+unsupervised+learning,+informa6on+
retrieval,+classifica6on)+require+probabilis)c+inference:+
New+document+ What+is+this+document+about?+
David Sontag (NYU) Graphical Models Lecture 2, February 7, 2013 16 / 31
Example of using LDA
life 0.02 θd
evolve 0.01
organism 0.01
.,,
brain 0.04
neuron 0.02
nerve 0.01
...
zN d
data 0.02
number 0.02
βT computer 0.01
.,,
Figure 1: The intuitions behind latent Dirichlet allocation. We assume that some
(Blei,words,
number of “topics,” which are distributions over Introduction
existtofor
Probabilistic
the whole Topic Models, 2011)
collection (far left).
Each document is assumed to be generated as follows. First choose a distribution over the
David Sontag (NYU) Graphical Models Lecture 2, February 7, 2013 17 / 31
“Plate” notation for LDA model
α Dirichlet
hyperparameters
Topic distribution
θd for document
Topic-word
β zid Topic of word i of doc d
distributions
wid Word
i = 1 to N
d = 1 to D
α Dirichlet
hyperparameters
d = 1 to D d = 1 to D
Proof.
(By counterexample.) There is a distribution on 4 variables where the only
independencies are A ⊥ C | {B, D} and B ⊥ D | {A, C }. This cannot be
represented by any Bayesian network.
(a) (b)
Both (a) and (b) encode (A ⊥ C |B, D), but in both cases (B 6⊥ D|A, C ).
Z1 Z2
D B
Z4 Z3
1 Y X Y
p(x1 , . . . , xn ) = φc (xc ), Z= φc (x̂c )
Z
c∈C x̂1 ,...,x̂n c∈C
Simple example (potential function on each edge encourages the variables to take
the same value):
B C C
φA,B (a, b) = 0 1
φB,C (b, c) = 0 1 φA,C (a, c) = 0 1
B
0 10 1 0 10 1 0 10 1
A B A
1 1 1 1 10
A C 1 10 1 10
1
p(a, b, c) = φA,B (a, b) · φB,C (b, c) · φA,C (a, c),
Z
where
X
Z= φA,B (â, b̂) · φB,C (b̂, ĉ) · φA,C (â, ĉ) = 2 · 1000 + 6 · 10 = 2060.
â,b̂,ĉ∈{0,1}3
Pairwise potentials enforce that no friend has the same hair color:
φAB (a, b) = 0 if a = b, and 1 otherwise
Single-node potentials specify an affinity for a particular hair color, e.g.
φD (“red”) = 0.6, φD (“blue”) = 0.3, φD (“green”) = 0.1
The normalization Z makes the potentials scale invariant! Equivalent to
φD (“red”) = 6, φD (“blue”) = 3, φD (“green”) = 1
David Sontag (NYU) Graphical Models Lecture 2, February 7, 2013 26 / 31
Markov network structure implies conditional independencies
Let G be the undirected graph where we have one edge for every pair
of variables that appear together in a potential
Conditional independence is given by graph separation!
XB
XA
XC
First, we show that p(a | b) can be computed using only φAB (a, b):
p(a, b)
p(a | b) =
p(b)
1 P
Z ĉ φAB (a, b)φBC (b, ĉ)
1 P
=
Z â,ĉ φAB (â, b)φBC (b, ĉ)
P
φAB (a, b) ĉ φBC (b, ĉ) φAB (a, b)
= P P =P .
φ
â AB (â, b) φ
ĉ BC (b, ĉ) â φAB (â, b)
A B C
1
p(a, b, c) = φAB (a, b)φBC (b, c)
Z
Proof.