A Generic Approach To Topic Models and Its Application To Virtual Communities

A generic approach to topic models
and its application to virtual communities

Gregor Heinrich
PhD presentation (English translation incl. backup slides, 45min)
Faculty of Mathematics and Computer Science
University of Leipzig
28 November 2012
Version 2.9 EN BU
Gregor Heinrich
1 / 35

and its application to virtual communities
Gregor Heinrich
PhD presentation (English translation incl. backup slides, 45min)
Faculty of Mathematics and Computer Science
University of Leipzig
28 November 2012
Version 2.9 EN BU
Gregor Heinrich
1 / 35
Overview
Introduction
Generic topic models
Inference methods
Application to virtual communities
Conclusions and outlook
Gregor Heinrich
2 / 35
Motivation: Virtual communities

cooperation
annotation
authorship
Retrieval performance.
The most prominent
retrieval measures are
precision and recall.
Recall is defined as
the ratio between the
number of retrieved
relevant
items
to
the total number of
existing relevant items.
Precision is defined as
recommendation
authorship
annotation
authorship
citation
citation
The most prominent
number of retrieved
relevant
items
to
the total number of
The most prominent
number of retrieved
relevant
items
to
the total number of
similarity
Virtual communities = groups of persons who exchange information

and knowledge electronically
Examples: organisations, digital libraries, Web 2.0 applications
incl. social networks
Data are multimodal: text content; authorship, citation, annotations
and recommendations; cooperation and other social relations
Typical case: discrete data with high dynamics and large volumes
Gregor Heinrich
3 / 35
Motivation: Unsupervised mining of discrete data

Identification of relationships in large data volumes
Only data (and possibly model) required (information retrieval,
network analysis, clustering, NLP methods)
Density problem: Features too sparse for analysis in
high-dimensional feature space
Vocabulary problem: Semantic similarity , lexical similarity
(polysemy, synonymy, etc.)
4/24/12 1:58 AM
restaurant
teller
adhere
bank
atm
computer
location
pressure unit
network
verb
stick
verb
long object
bar
furniture
furniture
table
prevent
employees
staff
people
judicial assembly
verb
counter
glue
long object
people
court
personnel
location
yard
Gregor Heinrich
4 / 35
Motivation: Unsupervised mining of discrete data

Identification of relationships in large data volumes
Only data (and possibly model) required (information retrieval,
network analysis, clustering, NLP methods)
Density problem: Features too sparse for analysis in
high-dimensional feature space
Vocabulary problem: Semantic similarity , lexical similarity
(polysemy, synonymy, etc.)
4/24/12 1:58 AM
restaurant
teller
adhere
bank
atm
computer
location
pressure unit
network
verb
stick
verb
long object
long object
bar
furniture
furniture
table
prevent
employees
staff
people
judicial assembly
verb
counter
glue
people
court
personnel
location
yard
Gregor Heinrich
4 / 35
words
...
...
Topic models as approach
documents
Probabilistic representations of grouped discrete data

Illustrative for text: Words grouped in documents
Latent Topics = Probability distributions over vocabulary. Dominant
terms of a topic are semantically similar.
Language = Mixture of topics (latent semantic structure)
Reduce vocabulary problem: Find semantic relations

Reduce density problem: Dimensionality reduction
Gregor Heinrich
5 / 35
...
...
topics
words
documents


Gregor Heinrich
5 / 35

rhythm
drum
...
...
bar
topics
words
documents


Gregor Heinrich
5 / 35

rhythm
drum
bar
wine
...
...
restaurant
topics
words
documents


Gregor Heinrich
5 / 35

Playing Drums
for Beginners
rhythm
drum
Rhythm & Spice

Jamaican Grill
bar
wine
...
...
restaurant
topics
words
Leipzigs Bars
and Restaurants
documents


Gregor Heinrich
5 / 35

Playing Drums
for Beginners
rhythm
drum
Rhythm & Spice

Jamaican Grill
bar
wine
...
...
restaurant
topics
words
Leipzigs Bars
and Restaurants
documents


Gregor Heinrich
5 / 35
...
...
topics
words
documents


Gregor Heinrich
5 / 35
Language models: Unigram model
p(w | z)
1 2 3
z
w1,1
|
w1,2
w1,3
{z
w2,1
}
document 1
w2,2
w2,3
{z
document 2
wm,n
}
word n
document m
One distribution for all data
Gregor Heinrich
6 / 35
Language models: Unigram mixture model
p(w | z1 )
w1,1
|
p(w | z2 )
1 2 3
1 2 3
z1
z2
w1,2
w1,3
{z
w2,1
}
document 1
w2,2
zm
w2,3
{z
document 2
wm,n
}
word n
document m
One distribution per document
Gregor Heinrich
6 / 35
Language models: Unigram admixture model
p(w | z1,1 ) . . .
. . . p(w | z2,3 )
z1,1
z1,2
z1,3
z2,1
z2,2
z2,3
zm,n
w1,1
w1,2
w1,3
w2,1
w2,2
w2,3
wm,n
{z
document 1
{z
document 2
word n
document m
One distribution per word basic topic model

Gregor Heinrich
6 / 35
Language models: Unigram admixture model
p(w | z1,1 ) . . .
. . . p(w | z2,3 )
z1,1
z1,2
z1,3
z2,1
z2,2
z2,3
zm,n
w1,1
w1,2
w1,3
w2,1
w2,2
w2,3
wm,n
{z
document 1
{z
document 2
word n
document m
One distribution per word basic topic model

Gregor Heinrich
6 / 35
Bayesian topic models: The Dirichlet distribution

Bayesian methodology:
123
123
123
p(w | z) Dir(~)
Distributions generated
from prior distributions
Speech + other discrete
data: Dirichlet distribution
important prior:
Defined on simplex:
Surface containing all
discrete distributions
~ controls
Parameter
behaviour
p3
~ = (4, 4, 2)
p1
Bayesian topic model:
p2
3 = 1
3 = 1
3 = 1
Latent Dirichlet Allocation

(LDA) (Blei et al. 2003)
1 = 1
2 = 1
1 = 1
2 = 1
Gregor Heinrich
1 = 1
2 = 1
7 / 35
~m
Dir()
zm,n
~k
topic k
wm,n
word n
document m
Dir()
Latent Dirichlet Allocation (Blei et al. 2003)

Gregor Heinrich
8 / 35

Concert tonight at Rhythm and Spice Restaurant . . .
~m
Dir()
zm,n
~k
topic k
wm,n
word n
document m
Dir()
Latent Dirichlet Allocation (Blei et al. 2003)

Gregor Heinrich
8 / 35

~m
topic 1
Dir()
~1
t
r
d
ll
ran ba foo gri
tau
s
re
topic 2
zm,n
...
~2
ert
sic m
nc mu hyth
r
co
r ...
ba
~k
topic k
wm,n
word n
document m
Dir()
Generating word distributions for all topics

Gregor Heinrich
8 / 35

document 1
~1
topic 1 topic 2 . . .
~m
topic 1
Dir()
~1
t
r
d
ll
ran ba foo gri
tau
s
re
topic 2
zm,n
...
~2
ert
sic m
nc mu hyth
r
co
r ...
ba
~k
topic k
wm,n
word n
document m
Dir()
Generating topic distribution for document

Gregor Heinrich
8 / 35

document 1
~1
~m
topic 1
Dir()
~1
t
r
d
ll
ran ba foo gri
tau
s
re
topic 2
zm,n
...
~2
ert
sic m
nc mu hyth
r
co
r ...
ba
~k
topic k
wm,n
word n
document m
Dir()
Sampling the topic index for first word, z = 2

Gregor Heinrich
8 / 35

document 1
~1
~m
topic 1
Dir()
~1
t
r
d
ll
ran ba foo gri
tau
s
re
topic 2
zm,n
...
~2
ert
sic m
nc mu hyth
r
co
r ...
ba
~k
topic k
wm,n
word n
document m
Dir()
Sampling a word from term distribution for topic 2, concert

Gregor Heinrich
8 / 35
State of the art

Large number of published models that extend LDA:
Authors (Rosen-Zvi et al. 2004),
Citations (Dietz et al. 2007),
Hierarchy (Li and McCallum 2006; Li et al. 2007),
Image features and captions (Barnard et al. 2003) etc.
Results for topic model (title + abstract) only since 2012: ACM
>400, Google Scholar >1300.
Expanding research area with practical relevance

But: No existing analysis as generic model class
Partly tedious derivation, especially for inference algorithms

Conjecture:
Important properties generic across models
Simplifications for derivation of concrete model properties, inference
algorithms and design methods
Gregor Heinrich
9 / 35
State of the art

Experttagtopic model
Authors (Rosen-Zvi
et al. 2004),
(Heinrich 2011)
~am McCallum
Hierarchy (Li and
2006; Li et al. 2007),
Image features
andx captions
(Barnard et al. 2003) etc.
~
xm, j
x
m,n
[1, A] + abstract) only since 2012: ACM

Results for topic modelx (title
y
z
m,n
>400, Googlem, jScholar
>1300.
cm, j
wm,narea
~k
Expanding
research
practical relevance
~with
k
j [1, J ] n [1, N ] k [1, K]
[1, K]
But: kNo
existing
analysis
as generic model class
m [1, M]
m

Conjecture:
Gregor Heinrich
9 / 35
State of the art



Conjecture:
Gregor Heinrich
9 / 35
State of the art



Conjecture:
Gregor Heinrich
9 / 35
Research questions
How can topic models be described in a generic way in order to use

their properties across particular applications?
Can generic topic models be implemented generically and, if so,
can repeated structures be exploited for optimisations?
How can generic models be applied to data in virtual communities?
Gregor Heinrich
10 / 35
Overview
Introduction
Inference methods
Gregor Heinrich
11 / 35
How can topic models be described in a generic way in order to use

their properties across particular applications?
Gregor Heinrich
12 / 35
Topic models: Example structures

~m
~m
cm,n
zm,n
zm,n
~c
c[1,C]
wm,n
~k
k[1,K]
wm,n
~k
k[1,K]
n[1,Nm ]
(a) Latent Dirichlet allocation (LDA)
n[1,Nm ]
m[1,M]
m[1,M]
(b) Authortopic model (ATM)
z1m,n
~r
~ rm
z2m,n
~x
~ m,x
z3m,n
~0
~ m,0
zTm,n
~T
~ m,T
ztm,n
m,n
T [1,|T |]
~T,t
T,t[1,|T ||t|]
y[1,K]
~k
wm,n
~k
k[1,|t|+|T |+1]
[1,L]
wm,n
n[1,Nm ]
n[1,Nm ]
m[1,M]
m[1,M]
(c) Pachinko allocation model (PAM4)
(d) Hierarchical PAM (hPAM)
(Blei et al. 2003; Rosen-Zvi et al. 2004; Li and McCallum 2006; Li et al. 2007)
Gregor Heinrich
13 / 35
Generic topic models NoMMs
~1
~2
~3
Dir(~k |)
~k
xin
k = f (xin )
xout
Generic characteristics of topic models:

Levels with discrete components ~k , generated from Dirichlet
distributions
Coupling via values of discrete variables x
Network of mixed membership (NoMM):
Compact representation for topic models

Directed acyclical graph
Node: sample from mixture component, selection via incoming
edges; terminal node: observation
Edge: propagation of discrete values to child nodes.
Gregor Heinrich
14 / 35
~k
~k
K
xin1 k = f (xin1 , xin2 )
xout
xin
k = f (xin )
~k
K
xout
xin
k = f (xin )
xout1
xout2
xin2

distributions

Gregor Heinrich
14 / 35
~k
~k
K
xin1 k = f (xin1 , xin2 )
x1
k = f (xin )
~k
K
x2
k = f (xin )
xout1
xout2
xin2

distributions

Gregor Heinrich
14 / 35
xin1
xin2
~k |
x1
~k |
x2
xout1
~k |
xout2

distributions

Gregor Heinrich
14 / 35
xin1
xin2
Dir(~k |)
~k |
Dir(~k |)
x1
~k |
Dir(~k |)
x2
~k |
xout1
xout2

distributions

Gregor Heinrich
14 / 35

LDA
~ k |)
Dir(
~m |
Dir(~
k |)
zm,n
~k |
wm,n

distributions

Gregor Heinrich
14 / 35

LDA
~ k |)
Dir(
~m |
Dir(~
k |)
zm,n
~k |
wm,n

distributions

Gregor Heinrich
14 / 35

LDA
~1
~ k |)
Dir(
m=1
~m |
Dir(~
k |)
zm,n
~k |
wm,n

distributions

Gregor Heinrich
14 / 35

LDA
~1
~ k |)
Dir(
m=1
~m |
Dir(~
k |)
z1,1 =3
~k |
wm,n

distributions

Gregor Heinrich
14 / 35

LDA
~3
~ k |)
Dir(
m=1
~m |
Dir(~
k |)
z1,1 =3
~k |
wm,n

distributions

Gregor Heinrich
14 / 35

LDA
~3
~ k |)
Dir(
m=1
~m |
Dir(~
k |)
z1,1 =3
~k |
w1,1 =2

distributions

Gregor Heinrich
14 / 35
Topic models as NoMMs

m
~m |
[M]
zm,n =k
~k |
[K]
wm,n =t
~m
[V]
zm,n
(a) Latent Dirichlet allocation, LDA
wm,n
~k
k[1,K]
n[1,Nm ]
m[1,M]
xm,n =x
~am
[M]
[A]
zm,n =k
~ x |
[K]
~k |
wm,n =t
~m
[V]
cm,n
(b) Authortopic model, ATM
c[1,C]
m
[M]
~ rm | ~r
z2m,n =x
[s1 ]
~ m,x |~ x
z3m,n =y
[s2 ]
~ y | ~
[M]
k[1,K]
n[1,Nm ]
m[1,M]
[V]
z1m,n
~r
~ rm
~x
~ m,x
z2m,n
z3m,n
y[1,K]
~ m,0 |~0
zTm,n =T
[M]
wm,n
~k
wm,n =t
(c) Pachinko allocation model, PAM

m
zm,n
~c
zTm,n =T
[|T |]
ztm,n =t
~ m,T |~T
[|t|]
~T,t |
`m,n
[3]
~k
wm,n
[1,L]
~ `,T,t |
[|T |+|t|+1]
wm,n
n[1,Nm ]
m[1,M]
[V]
~0
~ m,0
zTm,n
~T
~ m,T
ztm,n
m,n
T [1,|T |]
(d) Hierarchical pachinko allocation model, hPAM
~k
k[1,|t|+|T |+1]
~T,t
T,t[1,|T ||t|]
wm,n
n[1,Nm ]
m[1,M]
(Blei et al. 2003; Rosen-Zvi et al. 2004; Li and McCallum 2006; Li et al. 2007)
Gregor Heinrich
15 / 35
Overview
Introduction
Inference methods
Gregor Heinrich
16 / 35
Can generic topic models be implemented generically. . . ?
Gregor Heinrich
17 / 35
Bayesian inference problem and Gibbs sampler

Bayesian inference: inversion of generative process:
Find distributions over parameters and latent variables/topics H ,
given observations V and Dirichlet parameters A
= Determine posterior distribution p(H, | V, A)
Intractability approximative approaches
Gibbs sampling: Variant of Markov-Chain Monte Carlo (MCMC)
In topic models: Marginalise parameters (Collapsed GS)
Sample topics Hi for each data point i in turn: Hi p(Hi | Hi , V, A)
~mr | r
1
Gregor Heinrich
x
H1
~m,x |
y
H2
~y |
w
V
18 / 35

~mr | r
1
Gregor Heinrich
x
H1
~m,x |
y
H2
~y |
w
V
18 / 35

?
m
~mr | r
p( 1,
Gregor Heinrich
?
?
H 1,
~m,x |
2,
?
?
H 2,
~y |
3 | V , A)
18 / 35

?
m
~mr | r
p( 1,
Gregor Heinrich
?
?
H 1,
~m,x |
2,
?
?
H 2,
~y |
3 | V , A)
18 / 35

~mr | r
Hi1 , Hi2 p( Hi1,

Gregor Heinrich
~m,x |
?
Hi2
~y |
| Hi , V , A)
18 / 35
Sampling distribution for NoMMs

m
~mr | r
~m,x |
~y |
Gibbs sampler can be generically derived (Heinrich 2009)

Typical case: Quotients of factors over levels `:
[`]
Y ni
+
k,t
p(Hi |Hi , V, A)
P i
+
n
t k,t
`
nk,t = count of co-occurrences between input and output values of a

level (components and samples)
More complex variants covered by q(k, t) ,
Gregor Heinrich
beta({nk,t }Tt=1 + )
beta({ni
}T + )
k,t t=1
19 / 35

m
~mr | r
~m,x |
~y |

[`]
Y ni
+
k,t
p(Hi |Hi , V, A)
P i
n
+
t k,t
`

Gregor Heinrich
beta({ni
}T + )
k,t t=1
19 / 35

m
q(m, x)
q((m, x), y)
q(y, w)

[`]
Y ni
Y
+

k,t
=
P
p(Hi |Hi , V, A)
q(k, t) [`]
i
t nk,t +
`
`

Gregor Heinrich
beta({ni
}T + )
k,t t=1
19 / 35
Typology of NoMM substructures

x
a
~a |
~z |
q(a, z) q(z, b)
N1. Dirichletmultinomial
~a |
~y |
~ ca
~z |
q(a, x y) q(x, b) q(y, c)
~a |
~z |
~z |
c
a,z
q(z, b)
q(a, z) q(z, b) q(z, c)
N2. Observed parameters
E3. Coupled edges
~a |
x
y
~b |
~k |
q(a, x) q(b, y) q(k, c), k = f (x, y)
E2. Autonomous edges

z
~x |
C2. Combined indices
~a |
z1
~b |
z2
~z |
q(a, z1 ) q(b, z2 ) q(z1 , c c ) q(z2 , c c)

C3. Interleaved indices
NoMM substructures: Nodes, edges/branches, component

indices/merging of edges:
Representation via q-functions and likelihood
Multiple samples per data point: q(a, x y) for respective level
Library incl. additional structures: alternative distributions,
regression, aggregation etc. q-functions + other factors
Gregor Heinrich
20 / 35
Implementation: Gibbs meta-sampler
data =
validate
deploy
data =
compile
C/Java code
module
Java model
instance
topic model
specification
code
templates
optimise
generate
NoMM code
generator
{z
Java VM
Code generator for topic models in Java and C
C/Java code
prototype
{z
native platform
Separation of knowledge domains: topic model applications vs.

machine learning vs. computing architecture
Gregor Heinrich
21 / 35
Implementation: Gibbs meta-sampler
data =
validate
deploy
data =
compile
C/Java code
module
Java model
instance
topic model
specification
code
templates
optimise
generate
NoMM code
generator
{z
Java VM
Code generator for topic models in Java and C
C/Java code
prototype
{z
native platform
Separation of knowledge domains: topic model applications vs.

machine learning vs. computing architecture
Gregor Heinrich
21 / 35
Example NoMM script and generated kernel: hPAM2

m
~m |
[M]
~ xm,x |xx
[M]
xmn=x
[X]
ymnroot
=y
[Y]
model = HPAM2
sup
description:
Hierarchical PAM model 2 (HPAM2)
words
sub
x
y
w
words
Gregor Heinrich
wmn
[V]
x=0:k=0
x , 0, y = 0 : k = 1 + x
x, y , 0 : k = 1 + X + y
words
// hidden edge
for (hx = 0; hx < X; hx++) {
// hidden edge
for (hy = 0; hy < Y; hy++) {
mxsel = X * m + hx;
mxjsel = hx;
if (hx == 0)
ksel = 0;
else if (hy == 0)
ksel = 1 + hx;
else
ksel = 1 + X + hy;
pp[hx][hy] = (nmx[m][hx] + alpha[hx])
* (nmxy[mxsel][hy] + alphax[mxjsel][hy])
/ (nmxysum[mxsel] + alphaxsum[mxjsel])
* (nkw[ksel][w[m][n]] + beta)
/ (nkwsum[ksel] + betasum);
psum += pp[hx][hy];
} // for h
} // for h
sup
words
sequences:
# variables sampled for each (m,n)
w, x, y : m, n
network:
# each line one NoMM node
m
>> theta | alpha
>>
m,x >> thetax | alphax[x] >>
x,y >> phi[k]
>>
# java code to assign k
k : {
if (x==0) { k = 0; }
else if (y==0) k = 1 + x;
else k = 1 + X + y;
}.
~k |
sub
sub
words
words
22 / 35
Example NoMM script and generated kernel: hPAM2

m
~m |
[M]
[M]
xmn=x
[X]
ymn=y
~ xm,x |xx
[Y]
model = HPAM2
description:
Hierarchical PAM model 2 (HPAM2)
sequences:
# variables sampled for each (m,n)
w, x, y : m, n
network:
# each line one NoMM node
m
>> theta | alpha
>>
m,x >> thetax | alphax[x] >>
x,y >> phi[k]
>>
# java code to assign k
k : {
if (x==0) { k = 0; }
else if (y==0) k = 1 + x;
else k = 1 + X + y;
}.
Gregor Heinrich
x
y
w
~k |
wmn
[V]
x=0:k=0
x , 0, y = 0 : k = 1 + x
x, y , 0 : k = 1 + X + y
// hidden edge
for (hx = 0; hx < X; hx++) {
// hidden edge
for (hy = 0; hy < Y; hy++) {
mxsel = X * m + hx;
mxjsel = hx;
if (hx == 0)
ksel = 0;
else if (hy == 0)
ksel = 1 + hx;
else
ksel = 1 + X + hy;
psum += pp[hx][hy];
} // for h
} // for h
22 / 35
DocumentTopic distribution in Gibbs sampler
Iteration 1
Documenttopic matrix (200 documents, 50 topics)
Gregor Heinrich
23 / 35
Iteration 5
Gregor Heinrich
23 / 35
Iteration 10
Gregor Heinrich
23 / 35
Iteration 15
Gregor Heinrich
23 / 35
Iteration 20
Gregor Heinrich
23 / 35
Iteration 30
Gregor Heinrich
23 / 35
Iteration 40
Gregor Heinrich
23 / 35
Iteration 50
Gregor Heinrich
23 / 35
Iteration 60
Gregor Heinrich
23 / 35
Iteration 80
Gregor Heinrich
23 / 35
Iteration 100
Gregor Heinrich
23 / 35
Iteration 120
Gregor Heinrich
23 / 35
Iteration 150
Gregor Heinrich
23 / 35
Iteration 200
Gregor Heinrich
23 / 35
Iteration 300
Gregor Heinrich
23 / 35
Iteration 500, converged stationary state

Gregor Heinrich
23 / 35
Fast sampling: Hybrid scaling methods

PAM4
indep.
depend.
model
perplexity
LDA
PAM4
PAM4
PAM4
1
10
100
iterations
1000
5000
dim.
500
40 40
40 40
40 40
ser. par. indep. speedup
30.2
7.4
24.1
49.8
Serial and parallel scaling methods:

Generalised results for LDA to generic NoMMs, specifically (Porteous
et al. 2008; Newman et al. 2009) + novel approach
Problem: Sampling space for stat. dependent variables: K L . . .

Independence assumption: Separate samplers with dimensions
K + L + ... K L ...
Empirical result: Iterations , but topic quality comparable
Hybrid approaches with independent samplers highly effective
Implementation: complexity covered by meta-sampler
Gregor Heinrich
24 / 35
Overview
Introduction
Inference methods
Gregor Heinrich
25 / 35
How can generic models be applied to data in virtual communities?
Gregor Heinrich
26 / 35
NoMM design process

x
a
~a |
~z |
q(a, z) q(z, b)
N1. Dirichletmultinomial
~a |
~y |
~ ca
~z |
q(a, x y) q(x, b) q(y, c)
~a |
~z |
~z |
c
a,z
q(z, b)
q(a, z) q(z, b) q(z, c)
N2. Observed parameters
E3. Coupled edges
~a |
x
y
~b |
~k |
q(a, x) q(b, y) q(k, c), k = f (x, y)
E2. Autonomous edges

z
~x |
C2. Combined indices
~a |
z1
~b |
z2
~z |
q(a, z1 ) q(b, z2 ) q(z1 , c c ) q(z2 , c c)

C3. Interleaved indices
Typology Library of NoMM substructures

Idea: Construct models from simple substructures that connect
terminal nodes:
Terminal nodes multimodal data (virtual communities...)

Substructures relationships in data; latent semantics
Process:
Assumptions on dependencies in data

Iterative association to structures in model (usage of typology)
Gibbs distribution known! model behaviour: q(x, y) = rich get richer
Implementation and test with Gibbs meta-sampler; possibly iteration

Gregor Heinrich
27 / 35
NoMM design process
Define
modelling task
and metrics
Define
evidence
Write
NoMM script
Create
model terminals
Generate
and adapt
Gibbs sampler
Formulate
model
assumptions
Implement
target metric
Compose
model and
predict properties
Evaluate
based on
test corpus
10
Optimise
and integrate for
target platform
Typology Library of NoMM substructures

Idea: Construct models from simple substructures that connect
terminal nodes:
Terminal nodes multimodal data (virtual communities...)

Substructures relationships in data; latent semantics
Process:
Assumptions on dependencies in data

Iterative association to structures in model (usage of typology)
Gibbs distribution known! model behaviour: q(x, y) = rich get richer
Implementation and test with Gibbs meta-sampler; possibly iteration

Gregor Heinrich
27 / 35
Application: Expert finding with tag annotations

authorship
authorship
The most prominent
number of retrieved
relevant
items
to
the total number of
annotation
Scenario: Expert finding via documents with tag annotations

Authors of relevant documents experts
Frequently documents with additional annotations, here: tags
Goal: Enable tag queries, improve quality of text queries

Problem: Tags often incomplete, partly wrong
Connection of tags and experts via topics
(1) Data: For each document m: text w

~ m , authors ~am , tags ~cm
(2) Goal: Tag query ~c 0 : p(~c 0 | a) = max, word query w
~ 0 : p(~
w 0 | a) = max
(3) Terminal nodes: Authors in input, words and tags in output
Gregor Heinrich
28 / 35

authors
AB
~am
1
2
AB TH
TH
tags
The most prominent
number of retrieved
relevant
items
to
the total number of
~ m 14
w
33
1
3
~cm
5 14 33
document m



~ 0 : p(~
w 0 | a) = max
Gregor Heinrich
28 / 35

authors
AB
~am
1
2
AB TH
TH
tags
The most prominent
number of retrieved
relevant
items
to
the total number of
~ m 14
w
33
1
3
~cm
5 14 33
document m



~ 0 : p(~
w 0 | a) = max
Gregor Heinrich
28 / 35

authors
AB
~am
1
2
AB TH
TH
tags
The most prominent
number of retrieved
relevant
items
to
the total number of
~ m 14
w
33
1
3
~cm
5 14 33
document m



~ 0 : p(~
w 0 | a) = max
Gregor Heinrich
28 / 35
Model assumptions
authors
AB
~am
1
2
AB TH
TH
tags
The most prominent
number of retrieved
relevant
items
to
the total number of
~ m 14
w
33
1
3
~cm
5 14 33
document m
(4) Model assumptions:

(a) Expertise of an author is weighted with the portion of authorship
(b) Semantics of expertise expressed by topics z. Each author has a
single field of expertise (topic distribution).
(c) Semantics of tags expressed by topics y
Gregor Heinrich
29 / 35
Model assumptions
authors
AB
~am
1
2
AB TH
TH
tags
The most prominent
number of retrieved
relevant
items
to
the total number of
~ m 14
w
33
1
3
~cm
5 14 33
document m

Gregor Heinrich
29 / 35
Model assumptions
authors
AB
~am
1
2
AB TH
TH
tags
The most prominent
number of retrieved
relevant
items
to
the total number of
~ m 14
w
33
1
3
~cm
5 14 33
document m

Gregor Heinrich
29 / 35
Model construction
~am
wm,n
authors
word
[M, Nm ]
~cm
tags
~ , ~c) . . .
p(. . . | ~a, w
(5) Model construction: (a) Start with terminal nodes (from step 3)
Gregor Heinrich
30 / 35
Model construction
~am
document
xm,n
...
wm,n
[M, Nm ]
word
author
~cm
tags
p(x, . . . |) am,x q(x, . . . ) . . .

(b) Authorship ~am given as observed distribution
node sampels author x of a word
Gregor Heinrich
30 / 35
Model construction
~am
document
xm,n
author
zm,n
wm,n
word topic
word
q(x, z)
[M, Nm ]
~cm
tags
p(x, z, . . . |) am,x q(x, z) . . .

(c) Each author has only a single field of expertise (topic distribution)
q(x, z) associates (word-)topics with sampled authors x (cf. ATM)
Gregor Heinrich
30 / 35
Model construction
~am
document
xm,n
author
q(x, z)
zm,n
q(z, w)
word topic
wm,n
[M, Nm ]
word
~cm
tags
p(x, z, . . . |) am,x q(x, z) q(z, w) . . .

(d) Topic distribution over terms
connect z and w via q(z, w)
Gregor Heinrich
30 / 35
Model construction
word topic
~am
document
xm,nj
author
zm,n
q(x,zy)
q(z, w)
[M, Nm ]
word
ym, j
tag topic
wm,n
q(y, c)
cm, j
[M, Jm ]
tag
p(x, z, y |) am,x q(x, z y) q(z, w) q(y, c)

(e) Introduce tag topics ym,j for cm,j as distributions over tags
q(x, z y) overlays values for z and y
Gregor Heinrich
30 / 35
Model construction
word topic
~am
document
xm,n
author
zm,n
q(x,zy)
q(z, w)
[M, Nm ]
word
ym, j
tag topic
wm,n
q(y, c)
cm, j
[M, Jm ]
tag

Gregor Heinrich
30 / 35
Model construction
word topic
~am
document
xm, j
author
zm,n
q(x,zy)
q(z, w)
[M, Nm ]
word
ym, j
tag topic
wm,n
q(y, c)
cm, j
[M, Jm ]
tag

Gregor Heinrich
30 / 35
Model construction ordinary approach

Experttagtopic model
(Heinrich 2011)
~am
xm, j
xm,n
~x
x [1, A]
~k
k [1, K]
ym, j
zm,n
cm, j
wm,n
~k
j [1, Jm ] n [1, Nm ]
m [1, M]
Gregor Heinrich
k [1, K]
30 / 35
-150
0.9
-200
0.8
-250
coherence score
AP@10
Experttagtopic model: Evaluation
0.7
0.6
0.5
0.4
0.3
0.2
word queries
tag queries
-300
-350
-400
-450
-500
ATM
ETT
ETT
ATM
ETT
NIPS Corpus: 2.3 million words, 2037 authors, 165 tags

Retrieval: Average Precision @10:
Term queries: ETT > ATM
Tag queries: Similarly good AP values
Topic coherence (Mimno et al. 2011): ETT > ATM

Semi-supervised learning: Tag queries retrieve items without tags
Gregor Heinrich
31 / 35
Overview
Introduction
Inference methods
Gregor Heinrich
32 / 35
Conclusions: Resesarch contributions
Networks of Mixed Membership: Generic model and

domain-specific compact representation of topic models
Inference algorithms: Generic Gibbs sampler
Fast sampling methods (serial, parallel, independent)
Implementation in Gibbs meta-sampler
Design process based on typology of NoMM substructures
Application to virtual communities: Experttagtopic model for

expert finding with annotated documents
Contribution to facilitated model-based construction of topic
models, specifically for virtual communities and other multimodal
scenarios
Gregor Heinrich
33 / 35

+ Variational Inference for NoMMs (Heinrich and Goesele 2009)

scenarios
Gregor Heinrich
33 / 35


+ AMQ model: Meta-model for virtual communities as formal basis for
scenario modelling (Heinrich 2010)

scenarios
Gregor Heinrich
33 / 35



+ Models ETT2 and ETT3 incl. novel NoMM structure; retrieval

approaches (Heinrich 2011b)

scenarios
Gregor Heinrich
33 / 35




+ Graph-based expert search using ETT models: Integration of
explorative search/browsing and distributions from topic models

scenarios
Gregor Heinrich
33 / 35

4/15/12 1:47
Multiplicative Updating
Rule for Blind Separation
Derived from the Method
of Scoring,
The Efficiency and the

Robustness of Natural
Gradient Descent
Learning Rule,
Blind Separation of
Filtered Sources Using
State-Space Approach,
cites
cites
cites
A New Learning Algorithm

for Blind Signal Separation
Search for Information

Bearing Components in
Speech,
authors
Rokni_U
Shouval_H
Cichocki_A
authors cites
New Approximations of
Differential Entropy for
Independent Component
Analysis and Projection
Pursuit,
2
authors
authors
cites
8.
2.
Yang_H
9.
[independent component
analysis]
cites
authors
2
Blind Separation of
Delayed and Convolved
Sources,
cites
cites
5.
authors Component Analysis,
Parra_L
Lin_J
cites
authors
A Non-linear Information
Maximisation Algorithm
that Performs Blind
Separation
authors
cites
authors
authors
Factorizing Multivariate
Function Classes,
Edges are the

"Independent
Components" of Natural
Scenes,
cites
authors
Bell_A
Unsupervised
Classification with NonGaussian Mixture Models
Using ICA,
Oja_E
10.
cites
authors
authors
One-unit Learning Rules

for Independent
6.
4.
authors
Sparse Code Shrinkage.'

Denoising by Nonlinear
Maximum Likelihood
Estimation,
authors
authors
7.
3.
Lee_T
authors
Hyvarinen_A
1. [blind source separation]
Extended ICA Removes

Artifacts from
Electroencephalographic
Recordings,
Analysis of
Eiectroencephalographic
Data
authors
authors
cites
authors
Data Visualization and

Feature Selection New
cites
Algorithms for
Nongaussian Data,
Receptive Field Formation

in Natural Scene
Environments Comparison
of Single Cell Learning
Rules,
Algorithms for
Independent Components
Analysis and Higher
Order Statistics,
authors
authors
authors
authors
cites
Semiparametric Approach
to Multichannel Blind
Deconvolution of
Nonminimum Phase
Systems,
authors
authors
Analysis for Identification
of Artifacts in
Magnetoencephalographi
c Recordings,
authors
authors
Symplectic Nonlinear
Component Analysis
cites
Source Separation and

cites Density Estimation by
Faithful Equivariant SOM,
Maximum Likelihood Blind

Source Separation A
citesContext-Sensitive
Generalization of ICA,
4
4
Unmixing Hyperspectral
citesData,
Figure: ETT1: Expert search in community browser

Gregor Heinrich
file:///data/workspace/knowceans-freshmind-lucene3/fmica.svg
33 / 35
Page 1 o




+ Graph-based expert search using ETT models: Integration of
explorative search/browsing and distributions from topic models

scenarios
Gregor Heinrich
33 / 35
Outlook
New applications and NoMM structures, e.g., time as variable
Alternative inference methods:
Generic Collapsed Variational Bayes (Teh et al. 2007): Structure
similar to Collapsed Gibbs-Sampler
Non-parametric methods: Learning model dimensions using Dirichlet
or PitmanYor process priors (Teh et al. 2004; Buntine and Hutter
2010), NoMM polymorphism (Heinrich 2011a)
Improved support in design process:

Data-driven design: Search over model structures to obtain best
model for data set
Architecture-specific Gibbs meta-sampler, e.g., massively-parallel or
FPGA, cf. (Heinrich et al. 2011)
Integration with interactive user interfaces: Models can be created

on the fly, e.g., for visual analytics
Gregor Heinrich
34 / 35
Thank you!
Q+A
Gregor Heinrich
35 / 35
References I
References
Barnard, K., P. Duygulu, D. Forsyth, N. de Freitas, D. Blei, and M. Jordan (2003, August).
Matching words and pictures.
JMLR Special Issue on Machine Learning Methods for Text and Images 3(6), 11071136.
Bellegarda, J. (2000, August).
Exploiting latent semantic information in statistical language modeling.
Proc. IEEE 88(8), 12791296.
Blei, D., A. Ng, and M. Jordan (2003, January).
Latent Dirichlet allocation.
Journal of Machine Learning Research 3, 9931022.
Buntine, W. and M. Hutter (2010).
A Bayesian review of the Poisson-Dirichlet process.
arXiv:1007.0296v1 [math.ST].
Chang, J., J. Boyd-Graber, S. Gerrish, C. Wang, and D. Blei (2009).
Reading tea leaves: How humans interpret topic models.
In Proc. Neural Information Processing Systems (NIPS).
Gregor Heinrich
36 / 35
References II
Dietz, L., S. Bickel, and T. Scheffer (2007, June).
Unsupervised prediction of citation influences.
In Proceedings of the 24th International Conference on Machine Learning, Corvallis, Oregon,
USA.
Heinrich, G. (2009).
A generic approach to topic models.
In Proc. European Conf. on Mach. Learn. / Principles and Pract. of Know. Discov. in Databases
(ECML/PKDD), Part 1, pp. 517532.
Heinrich, G. (2010).
Actorsmediaqualities: a generic model for information retrieval in virtual communities.
In Proc. 7th International Workshop on Innovative Internet Community Systems (I2CS 2007), part
of I2CS Jubilee proceedings, Lecture Notes in Informatics, GI.
Heinrich, G. (2011a, March).
Infinite LDA Implementing the HDP with minimum code complexity.
Technical note TN2011/1, arbylon.net.
Heinrich, G. (2011b).
Typology of mixed-membership models: Towards a design method.
In Proc. European Conf. on Mach. Learn. / Principles and Pract. of Know. Discov. in Databases
(ECML/PKDD).
Gregor Heinrich
37 / 35
References III
Heinrich, G. and M. Goesele (2009).
Variational Bayes for generic topic models.
In Proc. 32nd Annual German Conference on Artificial Intelligence (KI2009).
Heinrich, G., J. Kindermann, C. Lauth, G. Paa, and J. Sanchez-Monzon (2005).
Investigating word correlation at different scopes a latent concept approach.
In Workshop Lexical Ontology Learning at Int. Conf. Mach. Learning.
Heinrich, G., F. Logemann, V. Hahn, C. Jung, G. Figueiredo, and W. Luk (2011).
HW/SW co-design for heterogeneous multi-core platforms: The hArtes toolchain, Chapter Audio
array processing for telepresence, pp. 173207.
Springer.
Li, W., D. Blei, and A. McCallum (2007).
Mixtures of hierarchical topics with pachinko allocation.
In International Conference on Machine Learning.
Li, W. and A. McCallum (2006).
Pachinko allocation: DAG-structured mixture models of topic correlations.
In ICML 06: Proceedings of the 23rd international conference on Machine learning, New York,
NY, USA, pp. 577584. ACM.
Gregor Heinrich
38 / 35
References IV
Mimno, D., H. M. Wallach, E. Talley, M. Leenders, and A. McCallum (2011, July).
Optimizing semantic coherence in topic models.
In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing,
Edinburgh, UK, pp. 262272.
Newman, D., A. Asuncion, P. Smyth, and M. Welling (2009, August).
Distributed algorithms for topic models.
JMLR 10, 18011828.
Porteous, I., D. Newman, A. Ihler, A. Asuncion, P. Smyth, and M. Welling (2008).
Fast collapsed Gibbs sampling for latent Dirichlet allocation.
In KDD 08: Proceeding of the 14th ACM SIGKDD international conference on Knowledge
discovery and data mining, New York, NY, USA, pp. 569577. ACM.
Rosen-Zvi, M., T. Griffiths, M. Steyvers, and P. Smyth (2004).
The author-topic model for authors and documents.
In Proc. 20th Conference on Uncertainty in Artificial Intelligence (UAI).
Teh, Y., M. Jordan, M. Beal, and D. Blei (2004).
Hierarchical Dirichlet processes.
Technical Report 653, Department of Statistics, University of California at Berkeley.
Teh, Y. W., D. Newman, and M. Welling (2007).
A collapsed variational Bayesian inference algorithm for latent Dirichlet allocation.
In Advances in Neural Information Processing Systems, Volume 19.
Gregor Heinrich
39 / 35
Appendix
Gregor Heinrich
40 / 35
Example: Text mining for semantic clusters

Topic label
Dominant terms according to k,t = p(term |topic)
Bundesliga
Bochum Freiburg VfB

FC SC Munchen
Borussia SV VfL Kickers SpVgg Uhr Koln
Eintracht Bayern Hamburger Bayern+Munchen
Polizei / Unfall
Polizei verletzt schwer Auto Unfall Fahrer Angaben schwer+verletzt Menschen Wagen Verletzungen Lawine Mann vier Meter Strae
Tschetschenien
Rebellen russischen Grosny russische Tschetschenien Truppen Kaukasus Moskau

Angaben Interfax tschetschenischen Agentur
Politik / Hessen
FDP Koch Hessen CDU Koalition Gerhardt Wagner Liberalen hessischen Westerwelle Wolfgang Roland+Koch Wolfgang+Gerhardt
Grad Temperaturen Regen Schnee Suden

Norden Sonne Wetter Wolken Deutsch-
Wetter
land zwischen Nacht Wetterdienst Wind

Politik / Kroatien
Parlament Partei Stimmen Mehrheit Wahlen Wahl Opposition Kroatien Prasident

Parlamentswahlen Mesic Abstimmung HDZ
Die Grunen
Grunen
Parteitag Atomausstieg Trittin Grune
Partei Trennung Mandat Ausstieg Amt
Roestel Jahren Muller

Radcke Koalition
Russische Politik Russland Putin Moskau russischen russische Jelzin Wladimir Tschetschenien Rus
slands Wladimir+Putin Kreml Boris Prasidenten
Polizei / Schulen
Polizei Schulen Schuler

Tater
Polizisten Schule Tat Lehrer erschossen Beamten
Mann Polizist Beamte verletzt Waffe
Bigram-LDA: Topics from 18400 dpa news messages, Jan. 2000 (Heinrich et al. 2005)
Gregor Heinrich
41 / 35
Notation: Bayesian network vs. NoMM levels
ki
xi
ki
~k |
[K]
~k
xi
[T ]
parameters + hyperparameters nodes ( | )

variables ki , xi edges ki , xi
plates (i.i.d. repetitions) i, k indexes i + dimensions k

Gregor Heinrich
42 / 35
NoMM representation: Variable dependencies
hi
vi
hi
X
hi
hi
~k
`=1
ki1
ki2
vi
hi
vi
~k
~k
~k
~k
~k
~k
~k
`=2
`=3
`=4
`=5
`=6
`=7
`=1
~k
h1i
~k |
h2i
`=2
Gregor Heinrich
h3i
~k |
~k |
~k |
`=3
`=5
`=8
`=8
`=4
h4i
v5i
~k |
`=6
h6i
~k |
~k |
v8i
v7i
`=7
43 / 35
Collapsed Gibbs sampler

x2
~x
p(~x | V)
~x (0)
x1
Collapsed Gibbs sampler: Stochastic EM / MCMC:

NoMMs: parameters correlate with H marginalise
For each data point, i: draw latent variables, Hi = (yi , zi , . . . ), given all
other data, latent, Hi , and observed, V :
Hi p(Hi | Hi , V, A) .
(1)
Stationary state: full conditional distribution (1) simulates posterior
Faster absolute convergence for NoMMs than, e.g., variational

inference (Heinrich and Goesele 2009)
Gregor Heinrich
44 / 35

x2 ~x continuous, i [1, 2]
l
H discrete, i [1, W]
~x
~x (0)
x1

(1)

Gregor Heinrich
44 / 35

x2
~x
i=1
p(x1 | x2 )
x1

(1)

Gregor Heinrich
44 / 35

x2
~x
i=1
x1

(1)

Gregor Heinrich
44 / 35

x2
p(x2 | x1 )
~x
i=2
x1

(1)

Gregor Heinrich
44 / 35

x2
~x
i=2
x1

(1)

Gregor Heinrich
44 / 35

x2
~x
i=1
x1

(1)

Gregor Heinrich
44 / 35

x2
~x
i=1
x1

(1)

Gregor Heinrich
44 / 35

x2
~x
i=2
x1

(1)

Gregor Heinrich
44 / 35

x2
~x
i=1
x1

(1)

Gregor Heinrich
44 / 35

x2
~x
i=2
x1

(1)

Gregor Heinrich
44 / 35

x2
~x
i=1
x1

(1)

Gregor Heinrich
44 / 35

x2
~x
i=2
x1

(1)

Gregor Heinrich
44 / 35

x2
~x
i=1
x1

(1)

Gregor Heinrich
44 / 35
Generic topic models: Generative process

X
xi
..
.
xi
X
xi
I
~ j
A
J
Generative process on level `:
~ k)
xi Mult(xi |
k = fk (parents(xi ), i)
Gregor Heinrich
~ k Dir(
~k |
~ j)
j = fj (known parents(xi ), i) .
(2)
(3)
45 / 35
Generic topic models: Complete-data likelihood

Likelihood of all hidden and visible data X = {H, V} and parameters :
p(X, |A) =
Y" Y
`L
{z
data items Discrete

t
Y"Y
`L
~x )
p(xi,out |
i,in
~ k , ~nk ,
~)
f (
{z
{z
#[`]
components Dirichlet 4
levels
#[`]
|k
~ k |~)
p(
~nk = (nk,1 , nk,2 , )
(4)
Product dependent on co-occurrences nk,t between input and output

values, xi,in =k and xi,out =t, on each level `
There are variants to component selection xi,in =k
There are mixture node variants, e.g., observed components
Gregor Heinrich
46 / 35
Generic topic models: Complete-data likelihood

The conjugacy between the multinomial and Dirichlet distributions of
model levels leads to a simple complete-data likelihood:
p(X, | A) =
YY
`
Mult(xi` | ` , ki` )
Y
k
`
~` |
Dir(
k ~j )
Y 1 Y 1 `
Y Y
j
ki ,xi
=
k,xi
B(~j )
`
(5)
(6)
Y Y 1 Y +n 1 `
=
k,xj i k,t
B(~j )
(7)
Y Y B(~nk +
~
)
j
~ k | ~nk +
~
Dir(
)
=
j
B(~j )
(8)
where brackets []` enclose a particular level `.

nk,t is how often k and t co-occur.
Gregor Heinrich
47 / 35
Inference: Generic full conditionals

Gibbs full conditionals are derived for groups of dependent hidden edges,
Hid H d X and surrounding edges Sid Sd considered observed. All
tokens co-located with a particular observation: Xid = {Hid , Sid }.
Full conditional via chain rule applied to (8) with integrated out:
p(Hid | X\Hid , A) =
H11 H21 H31 H41 H 1 d

H
H12 H22 H32 H42
H2d
Gregor Heinrich
p(Hid , Sid | X\{Hid , Sid }, A)
p(Sid | X\{Hid , Sid }, A)

p(X | A)
p(Xid | X\Xid , A) =
p(X\Xid | A)
#
Y " Y B(~nk +
~ j) `
=
~ j)
B(~nk \Xid +
`
k
#
Y " B(~nk +
~ j) `
~ j)
B(~nk \Xid +
`{H d ,Sd }
(9)
(10)
(11)
(12)
48 / 35
Inference: q-functions
ni
+
k,t
q(k, t) =
=
P
i
~ j)
B(~nk \xid +
t nk,t +
~ j)
B(~nk +
|xid |=1
d + + (xd xd )
d +
nk,t \xi,2
nk,t \xi,1
i,1
i,2
= P
P
d +
d ++1
n
\x
n
\x
t k,t i,1
t k,t i,2
|xid |=2
...
Gregor Heinrich
49 / 35

q-functions: Polya
urn and sampling weights
~ k}
E{
Figure: Polya
urn: sampling with over-replacement.
ti
B(~nk + ) |t|=1 nk,t +
= ti
= smoothed ratio of occurrences
q(k, t) ,
B(~nki + )
nk + T
t={u,v}
i
nu
+
k,u
i
+ T
nu
k
i
nv
+ + (u v)
k,v
nkvi + T + 1
, q(k, u v)
...
Gregor Heinrich
50 / 35

q-functions: Polya
~ k}
E{
Figure: Polya
ti
B(~nk + ) |t|=1 nk,t +
q(k, t) ,
= ti
B(~nki + )
nk + T
t={u,v}
i
nu
+
k,u
i
nu
+ T
k
i
nv
+ + (u v)
k,v
nkvi + T + 1
, q(k, u v)
...
Gregor Heinrich
50 / 35

q-functions: Polya
~ k}
E{
Figure: Polya
ti
B(~nk + ) |t|=1 nk,t +
q(k, t) ,
= ti
B(~nki + )
nk + T
t={u,v}
i
+
nu
k,u
i
nu
+ T
k
i
nv
+ + (u v)
k,v
nkvi + T + 1
, q(k, u v)
...
Gregor Heinrich
50 / 35

q-functions: Polya
~ k}
E{
Figure: Polya
ti
B(~nk + ) |t|=1 nk,t +
= ti
q(k, t) ,
B(~nki + )
nk + T
t={u,v}
i
nu
+
k,u
i
+ T
nu
k
i
nv
+ + (u v)
k,v
nkvi + T + 1
, q(k, u v)
...
Gregor Heinrich
50 / 35

q-functions: Polya
~ k}
E{
Figure: Polya
urn and discrete parameters.
ti
B(~nk + ) |t|=1 nk,t +
= ti
q(k, t) ,
B(~nki + )
nk + T
t={u,v}
i
nu
+
k,u
i
+ T
nu
k
i
nv
+ + (u v)
k,v
nkvi + T + 1
, q(k, u v)
...
Gregor Heinrich
51 / 35

q-functions: Polya
~ k}
E{
Figure: Polya
ti
B(~nk + ) |t|=1 nk,t +
= ti
q(k, t) ,
B(~nki + )
nk + T
| {z }
i
i
nu
+ nv
+ + (u v)
t={u,v}
k,v
k,u
8 6 5 +

, q(k, u v)
=
i
B
+ T
nu
nkvi + T + 1
k

..
B 8 6 4 .+
ti
ti =
Gregor Heinrich
51 / 35

q-functions: Polya
~ k}
E{
Figure: Polya
ti
B(~nk + ) |t|=1 nk,t +
= ti
q(k, t) ,
B(~nki + )
nk + T
| {z
ui }
i
nv
+ + (u v)
t={u,v} nk,u +
k,v
, q(k, u v)
= 8u+
nk i + T
nkvi + T + 1
. . . 8 6 4 +
ti
Gregor Heinrich
51 / 35

q-functions: Polya
~ k}
E{
Figure: Polya
ti
B(~nk + ) |t|=1 nk,t +
= ti
q(k, t) ,
B(~nki + )
nk + T
| {z }
i
i
nu
+ nv
+ + (u v)
t={u,v}
k,v
k,u
8 6 5 +

, q(k, u v)
=
i
B
+ T
nu
nkvi + T + 1
k

..
B 7 6 4 .+
ui
vi
ti = { u i , v i }
Gregor Heinrich
51 / 35

q-functions: Polya
~ k}
E{
Figure: Polya
ti
B(~nk + ) |t|=1 nk,t +
= ti
q(k, t) ,
B(~nki + )
nk + T
i
nu
+
k,u
t={u,v}
...
i
+ T
nu
k
ui
Gregor Heinrich
{z
ui
i
nv
+ + (u v)
k,v
nkvi + T + 1
, q(k, u v)
51 / 35

q-functions: Polya
~ k}
E{
Figure: Polya
ti
B(~nk + ) |t|=1 nk,t +
= ti
q(k, t) ,
B(~nki + )
nk + T
i
nu
+
k,u
t={u,v}
...
i
+ T
nu
k
ui
Gregor Heinrich
{z
ui
i
nv
+ + (u v)
k,v
}|
nkvi + T + 1
{z
vi
u=v
1
, q(k, u v)
q(k,
)
51 / 35

q-functions: Polya
~ k}
E{
Figure: Polya
ti
B(~nk + ) |t|=1 nk,t +
= ti
q(k, t) ,
B(~nki + )
nk + T
i
nu
+
k,u
t={u,v}
...
i
+ T
nu
k
ui
Gregor Heinrich
{z
ui
i
nv
+ + (u v)
k,v
}|
nkvi + T + 1
vi
{z
vi
u=v
1
, q(k, u v)
q(k,
)
51 / 35

q-functions: Polya
~ k}
E{
Figure: Polya
ti
B(~nk + ) |t|=1 nk,t +
= ti
q(k, t) ,
B(~nki + )
nk + T
i
nu
+
k,u
t={u,v}
...
i
+ T
nu
k
ui
Gregor Heinrich
{z
ui
i
nv
+ + (u v)
k,v
}|
nkvi + T + 1
{z
vi
u=v
1
, q(k, u v)
q(k,
)
51 / 35
NoMM substructure library: Gibbs weights and likelihood

9.3. SUB-STRUCTURE LIBRARY
ID.
Name
ai
N2.
ai
NonDirichlet
prior
N4.
Nondiscrete
output
E1
a |
j
E1S(zi ): n i
zi
k |
N1A: j 1
N1B: j = f (ai , i)
=2
zi
ca
z |
=1
ai
a |
zi
z |
zi
z |
ai
Aggregation
a |
=1
zm
Autonomous
edges
E3.
Coupled
edges
ai j
=1
a |
m | ,
vm
=2
x |
yj
y |
=3
=2
z |
zi
a |
z |
=1
bi
cj
bi
ci
=3
C2.
Combined
indices
C3.
Interleaved
indices
ai
bj
ai
bj
ai
C4.
Switch
C5.
Node
coupling
Gregor
bi
ai
bj
=1
a |
xi
k |
ci j
C2A: k = (i, xi )
C2B: k = (xi , y j )
C2C: k = g(i, j, xi , y j )
b |
=2
=1
a |
xi
b |
=2
=1
a |
yj
b |
=2
=1
a |
=3
yj
=3
z |
=3
z |
xi
s=1
z |
=4
=3
x |
yj
y |
b |
C5A: i j
=2 C5B: i = j =4
a
Alternative distributions on the simplex: CTM [Blei & Laerty 2007]:
exp , N (, ); TLM [Wallach 2008]: hierarchy of Dirichlet priors
w(z|, ) = q(a, z)p(vi | z ) ; M-step: estimate z
p(v|a) = z a,z p(v | z )

2
w(z|zm , vm , ) = q(a, z) q(z, w)N (vm |

v m , ) ;
M-step: estimate v , 2 |z, v (for linear regression, N5B)
prediction: vm =
v m
w(x, y|) = q(a, xy)q(x, b)q(y, c) E2A: q(ai j , xi y j ) = q(ai , xi y j )q(a j , xi y j )
p(b, c|a) = x a,x x,b y a,y y,c
Common mixture of causes: Multimodal LDA [Ramage et al. 2009]

w(z|) = q(a, z)q(z, b)q(z, c)
p(b, c|a) = z a,z z,b z,c
Common cause for observations: Hidden relational model (HRM) [Xu et al. 2006],
Link-LDA [Erosheva et al. 2004]
w(x, y|) = q(a, x)q(b, y)q(k, c)
p(c|a, b) = x [a,x y b,y k,c ] , k = g(x, y, i, j)
Dierent dependent causes, relation: hPAM [Li et al. 2007a], HRM [Xu et al.
2006], Multi-LDA [Porteous et al. 2008a]
C3B: w(x, y|) = q(a, x)q(b, y)[q(x, c c)](xy) [q(x, c c )q(y, c c)]1(xy)
C3A: p(c|a) = x a,x x,c , p(c|b) sim., C3B: p(c|a, b) =

x a,x x,c y b,y y,c
Dierent causes, same eect: proposed here

ci
s=0
?
Label distribution: ATM [Rosen-Zvi et al. 2004]

a , ) = p(zi |ai ,
a )q(z, b) ;
w(z|
a [Blei & Laerty 2007]
M-step: estimate
p(b|a) = z a,z z,b
C3A: w(xi , y j |) = q(ai , xi )q(b j , y j )q(xi , ci c j )q(y j , c i c j )

ci j
C3A: i j
C3B: i = j
zi
si
E1S(zi ): q(a, z)q(z, b1 b2 . . . bNi )
Regression/supervised learning: Supervised LDA [Blei & McAulie 2007], Relational topic model [Chang & Blei 2009]
=3
E2A: i j
E2B: i = j
ai
bi
regression
xi
w(z|) = q(a, z)q(z, b)
p(b|a) = z a,z z,b
Mixture/admixture: LDA [Blei et al. 2003b], PAM [Li & McCallum 2006]; LDCC
[Shafiei & Milios 2006] (E1S)
ca , ) = ca,z q(z, b)
w(z|
c
p(b|a) = z a,z
z,b
Non-multinomial observ.: Corr-LDA [Barnard et al. 2003], GMM [McLachlan &

Peel 2000]: p(v|) = N (x | , )
=2
=2
z |
m jm (z zj )
E2.
vi
z p(z | )
=1
zi
N5+E4.
bi
=2
=1
a |
bi
=2
a p(
a | )
ai
bi,n
C1A: k = i
C1B: k = zi
=1
Observed
parameters
N3.
Gibbs sampler weight w, Likelihood p for single token i

Modelled aspect, example models
Structure diagram
N1,E1,C1.
DirMult
nodes,
unbranched
171
di
ci
dj
w(z, s|) = q(a, z)[q(b, 1)q(z, c)](s1) [q(b, 2)q(z, d)](s2)
p(c, d|a, b) = z a,z [b,0 z,c + b,1 z,d ]
Select complex submodels: Multi-grain LDA [Titov & McDonald 2008], Entitytopic models [Newman et al. 2006a]
C5A: w(xi , y j |) = q(ai , xi )q(b j , y j )q(xi , ci d j )q(y j , c i d j )
C5B: w(x, y|) = q(a, x)q(b, y)[q(x, c d)](xy) [q(x, c d)q(y,

c d)]1(xy)
p(c, d|a, b) = x a,x x,c y b,y y,d
Correlation of submodels, relations: Simple relational component model

[Sinkkonen et al. 2008], Relational topic model [Chang & Blei 2009]
Figure 9.2: NoMM sub-structure properties. Notation (also see (9.3)): ab adds counts n(a) +n(b) ;
Heinrich
a b prevents i for a in (9.1); c combines sequences {c , c , c }, as applicable.
52 / 35
Gibbs meta-sampler: Java data structure

MixItem // interface: node or edge
MixNet // represents a NoMM
// global id (= unique variable name)

name : String
// 2 edges: multiple inputs C2
// 2 nodes: merged inputs C3
parents : List<MixItem>
// 2 edges: indep. branches E2
// 2 nodes: coupled branches E3
children : List<MixItem>
// type of item: seq., topic, qfixed...
datatype : enum
// link type: C and E classifications
linktype : enum
// nodes of the NoMM

nodes : List<MixNode>
// edges of the NoMM
edges : List<MixEdge>
// sequences of the NoMM
sequences : List<MixSequence>
// constants for the NoMM
constants : Map<String, String>
<<collects>>
<<collects>>
<<implements>>
<<collects>>
<<collects>>
<<implements>>
MixNode // NoMM node

// parameters kt
theta : Variable
P
// counts nkt , t nkt
ntheta, nthetasum : Variable
// hyperparameter .
alpha : Variable
<<collects>>
MixEdge // NoMM edge

// variable xmn
x : Variable
// range of x, T
T : Expression
// E2 edge siblings, for () expansion
siblingsE2 : List<MixEdge>
// flag: parent node emits subset of range
sparse : boolean
MixSequence // NoMM sequence

// subsequences, null for leaf
subseqs : List<MixSequence>
// supersequence, null for root
superseq : MixSequence
// sequence index variables, m, n, s
m, n, s : Variable
// sequence index ranges: M, Nm , W
M, Mq, Nm, Nmq, W, Wq : Expression
// flag: fixed topics for query
qfixed : boolean
<<collects>>
Gregor Heinrich
53 / 35
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
/** run the main Gibbs sampling kernel */

public void run(int niter) {
// iteration loop
for (int iter = 0; iter < niter; iter++) {
// major loop, sequence [m][n]
for (int m = 0; m < M; m++) {
// component selectors
int mxsel = -1;
int mxjsel = -1;
int ksel = -1;
// minor loop, sequence [m][n]
for (int n = 0; n < w[m].length; n++) {
double psum;
double u;
// decrement counts
nmx[m][x[m][n]]--;
mxsel = X * m + x[m][n];
nmxy[mxsel][y[m][n]]--;
nmxysum[mxsel]--;
if (x[m][n] == 0)
ksel = 0;
else if (y[m][n] == 0)
ksel = 1 + x[m][n];
else
ksel = 1 + X + y[m][n];
nkw[ksel][w[m][n]]--;
nkwsum[ksel]--;
// compute weights
/* p(x_{m,n} \eq x, y_{m,n} \eq y ... (LaTeX omitted) */
psum = 0;
int hx = -1;
int hy = -1;
// hidden edge
for (hx = 0; hx < X; hx++) {
// hidden edge
for (hy = 0; hy < Y; hy++) {
mxsel = X * m + hx;
mxjsel = hx;
if (hx == 0)
ksel = 0;
else if (hy == 0)
ksel = 1 + hx;
else
ksel = 1 + X + hy;
Gregor Heinrich
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96

psum += pp[hx][hy];
} // for h
} // for h
// sample topics
u = rand.nextDouble() * psum;
psum = 0;
SAMPLED:
// each edge value
for (hx = 0; hx < X; hx++) {
// each edge value
for (hy = 0; hy < Y; hy++) {
psum += pp[hx][hy];
if (u <= psum)
break SAMPLED;
} // h
} // h
// assign topics
x[m][n] = hx;
y[m][n] = hy;
// increment counts
nmx[m][x[m][n]]++;
mxsel = X * m + x[m][n];
nmxy[mxsel][y[m][n]]++;
nmxysum[mxsel]++;
if (x[m][n] == 0)
ksel = 0;
else if (y[m][n] == 0)
ksel = 1 + x[m][n];
else
ksel = 1 + X + y[m][n];
nkw[ksel][w[m][n]]++;
nkwsum[ksel]++;
} // for n
} // for m
// estimate hyperparameters
estAlpha();
} // for iter
} // run()
54 / 35
Fast serial sampling: Using a normalisation bound

l
slk
Zi
}|
main mass s11

adjustment mass s01
u U[0, 1]
{z
Zi,0
2
uZi,0
{
}
uZi,1
uZi,2
0,1
0..2 uZi,3
3
4
...
{z
Z4known
{z
Zi,4
uZi,4 0..3
4
X }|
{z }
Z4unknown
Idea: Exploit saliency of few elements compute only largest

(=most likely) weights
Approximate normalisation via vector norms (Porteous et al. 2008)
Generalisation to multiple dependent variables: more expensive
higher-order vector norms higher sparsity of sampling space
Gregor Heinrich
55 / 35
Fast parallel sampling: Synchronisation methods
11
Multi-processor parallelisation
using shared memory (OpenMP)
21
P1
12
22
13
23
P2
14
24
...
...
1M
2M
global parameters
pmfs over vocabulary
...
~%m |
xmn
sync
~ mx |
ymn
Gregor Heinrich
Main challenge: synchronisation

and communication of global data
Synchronisation methods (LDA +
generic NoMMs):
a. Nave synchronisation locks
b. Query read-only + MAP update
step for (split-state)
c. Local copies + reduction step
(=AD-LDA (Newman et al. 2009))
PP
document-specific parameters
pmfs over (sub-)topics
~y |
wmn
56 / 35
Fast sampling: Serial parallel

LDA, NIPS
32.5
sa
pa
spa
30
27.5
speedup
10
7.5
5
P
2.5
1
0
10
20
50
100
200
500
Figure: Speed-up for fast sampling methods: LDA.

Gregor Heinrich
57 / 35
Fast sampling: Serial parallel independent

10
25
15
150
ia
pa
pb
pc
9
8
ipa
ipas
ipcs
10
7
10,10
20,20
20,40
40,40
20,100
speedup vs. ia
8
6
ipa
ipb
ipc
6
5
4
3
10,10
100
20,100 speedup vs. a
5
speedup vs. ia
speedup vs. a
20
20,20
20,40
40,40
20,100
20,20
40,40
50
20,100
K,L
(a) Parallel, independent
(b) Parallel, serial, independent
Figure: Speed-up for combined fast samplers: PAM4 (2 dependent variables).
Gregor Heinrich
58 / 35
Fast sampling: The impact of assumed independence

3000
pa
ipa
10,10
20,20
20,40
20,100
perplexity
2500
2000
1500 0
10
convergence:
10
10
iteration
dependent
10
indep.
Figure: Perplexity over iterations. Example model: PAM4.

Gregor Heinrich
59 / 35
ETT1 model: Derivation using NoMM structure

m
~am
[M]
xm,nj
[Am ]
zm,n
wm,n
q(z, w)
{M, Nm }
[V]
[K]
q(x,zy)
ym, j
[K]
cm, j
q(y, c)
{M, Jm }
[C]
Lining up q-functions:
p(x, z, y | ) am,x q(x, z y) q(z, w) q(y, c)
(13)
Transforming to standard Gibbs full conditionals:
p(xm,n =x, zm,n =z | ) am,x

p(xm,j =x, ym,j =y | ) am,x
{x,z}m,n
nx,z
{x,z}m,n
{x,y}m,j
{x,y}m,j
ny
nz,wm,n +
zm,n
+ K nz
nx
nx,y
+ V
(14)
ny,cm,j +
ym,j
+ K ny + C
(15)
Retrieval uber
Anfrage-Likelihood-Modell:
p(~
w | a) =
w~
w
z a,z z,w
Gregor Heinrich
p(~c | a) =
c~c
y a,y y,c .
(16)
60 / 35
ETT1 model: Derivation using NoMM structure

m
~am
[M]
xm,nj
[Am ]
zm,n
~x |
[K]
wm,n
~z |
ym, j
[K]
Lining up q-functions:
{M, Nm }
[V]
cm, j
~y |
{M, Jm }
[C]
p(x, z, y | ) am,x q(x, z y) q(z, w) q(y, c)
(13)
Transforming to standard Gibbs full conditionals:
p(xm,n =x, zm,n =z | ) am,x

p(xm,j =x, ym,j =y | ) am,x
{x,z}m,n
nx,z
{x,z}m,n
{x,y}m,j
{x,y}m,j
ny
nz,wm,n +
zm,n
+ K nz
nx
nx,y
+ V
(14)
ny,cm,j +
ym,j
+ K ny + C
(15)
Retrieval uber
Anfrage-Likelihood-Modell:
p(~
w | a) =
w~
w
z a,z z,w
Gregor Heinrich
p(~c | a) =
c~c
y a,y y,c .
(16)
60 / 35
Figure E.1: ETT1 model: Bayesian network.

x
c
cm,n
xm,n
xm,n
ETT1 model: Derivation using ordinary method (excerpt)

x [1, A]
Appendix E
p(
w, cc, a, x,z|, ,w) =
Details on application models

xm, j
m,n
m,n

Nm
M
k
m=1
n [1, Nm ]
m [1, M]
xm,n
x [1, A]
k [1, K]
~am
c [1, C]
Next, we integrate out the model parameters, introducing our knowledge on the types of distribuzm,nconjugacy:
z1,2
tions and their

m,n
(a) ETT2
n=1
x ) am,x
p(wm,n |
zm,n )p(zm,n |
w
m,n
m,n
j=1
dp(|) dp(|) dp( |)

M
m,n
Jkm [1, K]
x ) am,x
y )p(ym, j |
p(cm, j |
m, j
m, j
m, j
Nm
n [1, Nm ]
k [1, K]
m [1, M]
(b) ETT3
(E.3)
ETT models:
Bayesian networks.
Figure E.2: Iterated
=
~x
m=1 n=1
p(wm,n |
zm,n )

Jm
M
p(c
k=1
p(
k |) dk
|) d
p(
k
k
w m,, xj ym,, jetc.,
to count
statistics,
n x,k , nk,c , etc. The
m=1 j=1 m,n m,n
k=1
his appendix sketches the traditional derivation of model inference and likelihood equations in Note the change in indexing from tokens,
superscripts in counts distinguish
the
branches
ofm the model: w = words,
c = tags. To solve the
N
Jm
M
hapter 10. Comparing this tothe NoMM-based
ym, j derivations
zm,n developed in
the thesis illustrates integrals in (E.5), either the Dirichlet
integral
of
the
first
type
can
be
used
[Abramowitz
& Stegun
p(|)
p(z
|
)
a
p(y
|
m,n xm,n m,xm,n

m, j xm, j ) am,xm, j dm
he usefulness of that method to avoid tedious calculations.
m=1
n=1
j=1
1964], or one can observe that the Dirichlet
distributions
just re-parametrise
(due to conjugacy
with the multinomial, cf. Appendix B). Because the actual distributions integrate to one(E.4)
and

K
K
V determined.
C
E.1 Bayesian networks
models
vanish, solely their new normalisation
must
be
1
1 nk,c +1
cm, j
nk,t +1
wm,n
~ of application
x [1, A]
~k
k,t
d
k
k,c
k
d
C () c=1
V () t=1
The Gibbs full conditional cank=1
be determined
from (E.6) byk=1
applying
the chain rule. Because
or comparison with the NoMM
representations
10, the Bayesian
networks of the three
j [1,in
JmChaper
]
n [1, Nm ]
k [1, K]
k [1, K]

A
K
M fashion,
A
(y) alternating
(y) two variables

words
tagsn(z)
in+nan
the
+1
1and
n(z)
ariants of the experttagtopic models are presented in Figs. E.1 and E.2. The dashed plate in the Gibbs sampler will scan through
a,k
a,k
m,a +nm,a
a
d
(E.5)
m,a root of the model
a,k
z
and
y
are
sampled
independently.
However,
the
author
association
atathe
m [1,explained
M]
() k=1
m,n
m, j
ig. E.2(b) refers to the duplicate draw of the C3B structure
in Sec. 9.3.
a=1 K
m=1 a=1
(y)
(z)
(y)n), n x,k
must be sampled jointly for both.
Using
i
=
(m,
+
n
and
the
sum
notation
(z)
K
A = n x,k
M
(z)
(y)
(z)
(y)
(nk + ) (nk + )
(na + na x,k+ )
nm,a
+nm,a
V
(z)
=
am,a
. (E.6)
nk = t=1 nk,t , etc., the full conditional
for word
becomes: ()
V ()
tokens
C ()
K
E.2
(a)derivation:
Experttagtopic
Example
Experttagtopicmodel
model 1 (ETT)
p(
w, c, a, x,z, , , |, , ) = p(
w|z, )p(|) p(c|y, )p( |)
p(y |x, )p(z |x, )p(|) p(x |a)

Nm
M
x ) am,x
=
p(wm,n |
zm,n )p(zm,n |
m,n
m,n
m=1
j=1
x ) am,x
y )p(ym, j |
p(cm, j |
m, j
m, j
m, j
p(|) p(|) p( |) .
(E.1)
246
n=1
Jm
a=1
k=1
m=1
i , a, c)
p(zi =k, xi =x|wi =t,zi , y, xi , w
p(
w,z, y, x)
p(
w|z, y)
p(z |x)
p(x)
=
=
p(
w,zi , y, xi ) p(
wi |zi , y)p(wi ) p(zi |xi ) p(xi )
(Heinrich
2011b)
he Bayesian network of the ETT1 model
is shown in
Fig. E.1. The details of the derivation
rategy have been explained for instance in [Heinrich 2009b]; it is similar to the strategies used
n literature.1 We start with the complete-data likelihood of the corpus:
(nk(z) + )
(n x + )
am,x
(z)
(nk,i
+ ) (n x,i + )
(nk,t + ) (nk,i + V)
(nk,t,i + ) (nk + V)
(E.8)
(z)
(n x,k
+ ) (n(z)
+ K)
E. x,i
APPLICATION
APPENDIX
am,x MODELS
(E.9)
(z)
(n(z)
x,k,i + ) (n x + K)
(z)
nk,t,i + n x,k,i +
=
(z)
am,x
For the tag branch, the derivation
is analogous,
now re-defining
i = (m, j):
nk,i + V
n + K
(E.7)
(E.10)
x,i
(E.2)
(y)
= q(k, t) q(x, k) am,x .
nk,c,i + n x,k,i +
, a, ci )
p(yi =k, xi =x|ci =c,zi , yi , xi , w
a
nk,i + V n(y) + K m,x
x,i
= q(k, c) q(x, k) am,x .
(E.11)
(E.12)
(E.13)
Alternative derivation strategies for topic model Gibbs samplers have been published in [Griths 2002] working The dierence of (E.11) and (E.13) to (10.3) is a result of the definition of n x,k as a summed count
Heinrich
A joint
generic
approach
61 / 35
and the to
facttopic
that bothmodels
branches are disjointly sampled.
) p(wi |
a p(zi |zi , w
wi , z)p(zi |zi ) and [McCallumGregor
et al. 2007] who
use the chain rule via the
token likelihood,
1
ETT1 evaluation: Truncated Average Precision

1.
The most prominent
number of retrieved
relevant
items
to
the total number of
The most prominent
number of retrieved
relevant
items
to
the total number of
3.
The most prominent
number of retrieved
relevant
items
to
the total number of
1/2
AP@5 =
4.
The most prominent
number of retrieved
relevant
items
to
the total number of
5.
The most prominent
number of retrieved
relevant
items
to
the total number of
+2/4 +3/5
3
The most prominent
number of retrieved
relevant
items
to
the total number of
AP@5 =
2.
The most prominent
number of retrieved
relevant
items
to
the total number of
The most prominent
number of retrieved
relevant
items
to
the total number of
1/1 +2/2
The most prominent
number of retrieved
relevant
items
to
the total number of
= 0.533
The most prominent
number of retrieved
relevant
items
to
the total number of
+3/5
3
= 0.867
Figure: Average Precision at 5 (assuming 3 relevant documents in corpus)
Gregor Heinrich
62 / 35
ETT1 results: Term Retrieval

query: svm support vector machine kernel classifier hyperplane regression
1. Scholkopf B, lik = 76.272, tokens = 2830, docs = 10: judged relevant

From Regularization Operators to Support Vector Kernels (9); Improving
the Accuracy and Speed of Support Vector Machines (9); Shrinking the
Tube: A New Support Vector Regression Algorithm (11) . . .
2. Smola A, lik = 77.509, tokens = 2760, docs = 11: judged relevant
Support Vector Regression Machines (9); Prior Knowledge in Support
Vector Kernels (10); Support Vector Method for Novelty Detection (12)
The Entropy Regularization Information Criterion (12, support vector
machines, regularization) . . .
3. Vapnik V, lik = 77.525, tokens = 2332, docs = 10: judged relevant
Support Vector Regression Machines (9); Prior Knowledge in Support

Vector Kernels (10); Prior Knowledge in Support Vector Kernels (10);
Support Vector Method for Multivariate Density Estimation (12); . . .
4. Crisp D, lik = 81.401, tokens = 699, docs = 2: judged relevant
A Geometric Interpretation of t/-SVM Classifiers (12); Uniqueness of the

SVM Solution (12)
5. Burges C, lik = 81.630, tokens = 1309, docs = 5: judged relevant
Improving the Accuracy and Speed of Support Vector Machines (9); A

Geometric Interpretation of t/-SVM Classifiers (12); Uniqueness of the
SVM Solution (12) . . .
6. Laskov P, lik = 84.275, tokens = 738, docs = 1: judged relevant

An Improved Decomposition Algorithm for Regression Support Vector
Machines (12)
7. Steinhage V, lik = 84.600, tokens = 438, docs = 1: judged irrelevant
Nonlinear Discriminant Analysis Using Kernel Functions (12)
8. Bennett K, lik = 86.754, tokens = 384, docs = 1: judged relevant
Semi-Supervised Support Vector Machines (11)
9. Herbrich R, lik = 86.754, tokens = 462, docs = 2: judged irrelevant
Classification on Pairwise Proximity Data (11); Bayesian Transduction

(12, classification)
10. Chapelle O, lik = 87.431, tokens = 494, docs = 2: judged relevant

Model Selection for Support Vector Machines (12); Transductive Inference for Estimating Values of Functions (12, regression, classification)
Gregor Heinrich
63 / 35
ETT1 results: Tag retrieval

query: face recognition
1. Movellan J, lik = 4.680, tokens = 3153, docs = 8: judged relevant
Dyn. Features for Visual Speechreading: A System Comparison (9, no
tags); Image Representation for Facial Expression Coding (12, tags: face
recognition, image, ICA); Visual Speech Recognition with Stochastic
Networks (7, tags: HMM, speech recognition) . . .
2. Bartlett M, lik = 4.951, tokens = 812, docs = 3: judged relevant
Viewpoint Invariant Face Recognition using ICA and Attractor Networks
(9, tags: face recognition, invariances, pattern recognition); Image Representation for Facial Expression Coding (12, tags: face recognition, image, ICA) . . .
3. Dailey M, lik = 4.952, tokens = 903, docs = 2: judged relevant
Task and Spatial Frequency Effects on Face Specialization (10, tags: face
recognition); Facial Memory Is Kernel Density Estimation (Almost) (11,

no tags)
4. Padgett C, lik = 4.974, tokens = 499, docs = 1: judged relevant
Representing Face Images for Emotion Classification (9, tags: classification, face recognition, image)
5. Hager J, lik = 5.023, tokens = 377, docs = 2: judged relevant
Classifying Facial Action (8, tags: classification); Image Representation

for Facial Expression Coding (12, tags: face recognition, image, ICA)
6. Ekman P, lik = 5.027, tokens = 374, docs = 2: judged relevant

Image Representation for Facial Expression Coding (12, tags: face
recognition, image, ICA); Classifying Facial Action (8, tags: classification)
7. Phillips P, lik = 5.127, tokens = 795, docs = 1: judged relevant
Support Vector Machines Applied to Face Recognition (11, tags: face

recognition, SVM)
8. Gray M, lik = 5.159, tokens = 470, docs = 2: judged irrelevant
Dynamic Features for Visual Speechreading: A Systematic Comparison

(9, text: dynamic visual features; no tags)
9. Lawrence D, lik = 5.217, tokens = 265, docs = 1: judged relevant
SEXNET: A Neural Network Identifies Sex From Human Faces (3, tags:
neural networks, object recognition, pattern recognition)
10. Ahuja N, lik = 5.221, tokens = 366, docs = 2: judged relevant

A SNoW-Based Face Detector (12, tags: face recognition, image, vision)
Gregor Heinrich
64 / 35
ETT1 results: Tag query and expert topics

tag: face recognition (ETT1/J20)
0.82702 face images faces image facial visual human video database detection
0.09392 image images texture pixel resolution pyramid regions pixels region search
0.02696 wavelet video view images tracking user camera image motion shape
0.00117 eeg brain ica artifacts subjects activity subject erp signals scalp
0.00100 image images visual vision optical pixel surface edge disparity receptive
0.00094 orientation cortical dominance ocular cortex development lateral eye cells visual
0.00089 chip neuron synapse digital pulse analog synaptic chips synapses murray
0.00084 hinton object image energy cost images code visible zemel codes
author: Movellan J (ETT1/J20)

0.53816:
0.16216:
0.08954:
0.06216:
0.03939:
0.03508:
0.02770:
0.02154:
face images faces image facial visual human video database detection
image images texture pixel resolution pyramid regions pixels region search
speech speaker acoustic vowel phonetic phoneme utterances spoken formant
bayesian prior density posterior entropy evidence likelihood distributions
filter frequency signals phase channel amplitude frequencies temporal spectrum
activation boltzmann annealing temperature neuron stochastic schedule machine
cell firing cells neuron activity excitatory inhibitory synaptic potential membrane
convergence stochastic descent optimization batch density global update
author: Cottrell G (ETT1/J20)

0.41865:
0.27523:
0.17531:
0.11287:
0.07130:
0.06143:
0.03695:
0.02049:
recurrent nets correlation cascade activation connection epochs representations

face images faces image facial visual human video database detection
subjects human stimulus cue subject trials experiment perceptual psychophysical
tangent transformation image simard images invariant invariance euclidean
modules attractors cortex phase olfactory frequency bulb activity oscillatory eeg
word connectionist representations words activation production cognitive musical
node activation graph cycle nets message recurrence links connection child
visual attention contour search selective orientation iiii region saliency segment
Gregor Heinrich
65 / 35
ETT1 results: Topic coherence

Topic coherence (Mimno et al. 2011):
How often do top-ranked topic terms co-occur in documents?
Re-enacts human judgement in topic intrusion experiments (Chang

et al. 2009; Heinrich 2011b)
Words in topic (choose worst match (A-F) in every group):
2. A. likelihood
B. mixture
C. theorem
D. density
E. em
F. prior
3. A. risk
B. return
C. stock
D. trading
E. processor
F. prediction
4. A. language
B. word
C. stress
D. grammar
E. neural
F. syllable
5. A. circuit
B. bayesian
C. analog
D. voltage
E. vlsi
F. chip
6. A. validation
B. set
C. variance
D. regression
E. selection
F. bias
-150
-200
coherence score
1. A. orientation
B. cortex
C. visual
D. ocular
E. acoustic
F. eye
-250
-300
-350
-400
-450
-500
LDA
(a) Topic intrusion experiment
Gregor Heinrich
ATM
ETT1/J20 ETT1/J100
(b) Coherence scores
66 / 35
ETT1 results: Topic coherence

Topic coherence (Mimno et al. 2011):
How often do top-ranked topic terms co-occur in documents?
Re-enacts human judgement in topic intrusion experiments (Chang

et al. 2009; Heinrich 2011b)
Words in topic (choose worst match (A-F) in every group):
2. A. likelihood
B. mixture
C. theorem
D. density
E. em
F. prior
3. A. risk
B. return
C. stock
D. trading
E. processor
F. prediction
4. A. language
B. word
C. stress
D. grammar
E. neural
F. syllable
5. A. circuit
B. bayesian
C. analog
D. voltage
E. vlsi
F. chip
6. A. validation
B. set
C. variance
D. regression
E. selection
F. bias
-150
-200
coherence score
1. A. orientation
B. cortex
C. visual
D. ocular
E. acoustic
F. eye
-250
-300
-350
-400
-450
-500
LDA
(a) Topic intrusion experiment
Gregor Heinrich
ATM
ETT1/J20 ETT1/J100
(b) Coherence scores
66 / 35

A Generic Approach To Topic Models and Its Application To Virtual Communities

Diunggah oleh

Informasi Dokumen

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

A Generic Approach To Topic Models and Its Application To Virtual Communities

Diunggah oleh

Hak Cipta:

Format Tersedia

A generic approach to topic models

and its application to virtual communities