A Statistical Machine Learning Perspective of Deep Learning PDF

A Statistical Machine Learning
Perspective of Deep Learning:

Algorithm, Theory, Scalable Computing
Maruan Al-Shedivat, Zhiting Hu, Hao Zhang, and Eric Xing
Petuum Inc
&
Carnegie Mellon University
Element of AI/Machine Learning
Task
Model • Graphical Models • Large-Margin • Deep Learning • Sparse Coding

• Nonparametric • Regularized • Spectral/Matrix • Sparse Structured
Bayesian Models Bayesian Methods Methods I/O Regression
Algorithm • Stochastic Gradient • Coordinate • L-BFGS • Gibbs Sampling • Metropolis-

Descent / Back Descent Hastings
propagation
Implementation • Mahout • Mllib • CNTK • MxNet • Tensorflow …

(MapReduce) (BSP) (Async)
System
Hadoop Spark MPI RPC GraphLab …
• Network • Network attached • Server machines • RAM • Cloud compute
Platform • Virtual
switches storage • Desktops/Laptops • Flash (e.g. Amazon EC2)
machines
and Hardware • Infiniband • Flash storage • NUMA machines • SSD • IoT networks
• Mobile devices • Data centers
• GPUs, CPUs, FPGA, TPU
• ARM-powered devices © Petuum,Inc. 1
ML vs DL
© Petuum,Inc. 2
Plan
• Statistical And Algorithmic Foundation and Insight of Deep
Learning
• On Unified Framework of Deep Generative Models
• Computational Mechanisms: Distributed Deep Learning

Architectures
© Petuum,Inc. 3
Part-I
Basics
Outline
• Probabilistic Graphical Models: Basics
• An overview of DL components
• Historical remarks: early days of neural networks
• Modern building blocks: units, layers, activations functions, loss functions, etc.
• Reverse-mode automatic differentiation (aka backpropagation)
• Similarities and differences between GMs and NNs
• Graphical models vs. computational graphs
• Sigmoid Belief Networks as graphical models
• Deep Belief Networks and Boltzmann Machines
• Combining DL methods and GMs
• Using outputs of NNs as inputs to GMs
• GMs with potential functions represented by NNs
• NNs with structured outputs
• Bayesian Learning of NNs
• Bayesian learning of NN parameters
• Deep kernel learning
© Petuum,Inc. 5
Outline
© Petuum,Inc. 6
Fundamental questions of probabilistic modeling
• Representation: what is the joint probability distr. on multiple variables?
!(#$ , #& , #' , … , #) )
• How many state configurations are there?
• Do they all need to be represented?
• Can we incorporate any domain-specific insights into the representation?
• Learning: where do we get the probabilities from?
• Maximum likelihood estimation? How much data do we need?
• Are there any other established principles?
• Inference: if not all variables are observable, how to compute the conditional
distribution of latent variables given evidence?
• Computing !(+|-) would require summing over 2/ configurations of the unobserved variables
© Petuum,Inc. 7
What is a graphical model?
• A possible world of cellular signal transduction
© Petuum,Inc. 8
GM: structure simplifies representation
• A possible world of cellular signal transduction
© Petuum,Inc. 9
Probabilistic Graphical Models
• If #0 ’s are conditionally independent (as described by a PGM), then
the joint can be factored into a product of simpler terms
! #$ , #& , #' , #2 , #3 , #/ , #4 , #1 =
! #$ ! #& ! #' #$ ! #2 #& ! #3 #&
!(#/ |#' , #2 )!(#4 |#/ )!(#1 |#3 , #/ )
• Why we may favor a PGM?

• Easy to incorporate domain knowledge and causal (logical) structures
• Significant reduction in representation cost (21 reduced down to 18)
© Petuum,Inc. 10
The two types of GMs !(+|@)
q = argmaxq !q(@)
• Directed edges assign causal meaning to the relationships
(Bayesian Networks or Directed Graphical Models)
! #$ , #& , #' , #2 , #3 , #/ , #4 , #1 =
! #$ ! #& ! #' #$ ! #2 #& ! #3 #&
!(#/ |#' , #2 )!(#4 |#/ )!(#1 |#3 , #/ )
• Undirected edges represent correlations between the variables

(Markov Random Field or Undirected Graphical Models)
! #$ , #& , #' , #2 , #3 , #/ , #4 , #1 =
1
exp {= #$ + = #& + = #$ , #' + = #& , #2 + = #3 , #& +
7
= #' , #2 , #/ + = #/ , #4 + = #3 , #/ , #1 }
© Petuum,Inc. 11
Outline
© Petuum,Inc. 12
Perceptron and Neural Nets
• From biological neuron to artificial neuron (perceptron)
Inputs McCulloch & Pitts (1943)
x1 Linear Hard
w1 Combiner Limiter
Output
å Y
w2
q
x2
Threshold
• From biological neuron network to artificial neuron networks

Synapse
Dendrites
O u t p u t Si g n a l s
Synapse
I n p u t Si g n a l s
Axon
Axon
Soma Soma
Dendrites
Synapse Middle Layer
Input Layer Output Layer © Petuum,Inc. 13
The perceptron learning algorithm
• Recall the nice property of sigmoid function

• Consider regression problem f: XàY, for scalar Y:
• We used to maximize the conditional data likelihood
• Here …
© Petuum,Inc. 14
The perceptron learning algorithm
xd = input
td = target output
od = observed output
wi = weight i
Incremental mode:
Do until converge:
§ For each training example d in D

1. compute gradient ÑEd[w]
Batch mode: 2.
Do until converge: where

1. compute gradient ÑED[w]
2.
© Petuum,Inc. 15
Neural Network Model
Inputs
.6 Output
Age 34 .
.2 S 4
.1 .5 0.6
Gende 2 .3 .2
.8
S
r .7 S “Probability of
beingAlive”
Stage 4 .2
Dependent
Independent Weights Hidden Weights variable
variables Layer
Prediction
© Petuum,Inc. 16
“Combined logistic models”
Inputs
.6 Output
Age 34
.5 0.6
.1
Gende 2 S
r .7 .8 “Probability of
beingAlive”
Stage 4
Dependent
variables Layer
Prediction
© Petuum,Inc. 17
Inputs
Output
Age 34
.2 .5
0.6
Gende 2 .3
S
r .8
“Probability of
beingAlive”
Stage 4 .2
Dependent
variables Layer
Prediction
© Petuum,Inc. 18
Inputs
.6 Output
Age 34
.2 .5
.1 0.6
Gende 1 .3
S
r .7 .8
“Probability of
beingAlive”
Stage 4 .2
Dependent
variables Layer
Prediction
© Petuum,Inc. 19
Not really, no target for hidden units...
Age 34 .6 .
.2 S 4
.1 .5 0.6
Gende 2 .3 .2
.8
S
r .7 S “Probability of
beingAlive”
Stage 4 .2
Dependent
variables Layer
Prediction
© Petuum,Inc. 20
Backpropagation:
Reverse-mode differentiation
• Artificial neural networks are nothing more than complex functional compositions that can be
represented by computation graphs:
2 @fn X @fn @f
4
x 1 5 f (x) @x
=
@fi1 @
Input 3 i1 2⇡(n)
Outputs
variables
Intermediate
computations
© Petuum,Inc. 21
Backpropagation:
Reverse-mode differentiation
• Artificial neural networks are nothing more than complex functional compositions that can be
represented by computation graphs:
2 @fn X @fn @f
4
x 1 5 f (x) @x
=
@fi1 @
3 i1 2⇡(n)
• By applying the chain rule and using reverse accumulation, we get

@fn X @fn @fi X @fn X @fi @fi
1 1 1
= = = ...
@x @fi1 @x @fi1 @fi2 @x
i1 2⇡(n) i1 2⇡(n) i2 2⇡(i1 )
• The algorithm is commonly known as backpropagation

• What if some of the functions are stochastic?
• Then use stochastic backpropagation!
(to be covered in the next part)
• Modern packages can do this automatically (more later) © Petuum,Inc. 22
Modern building blocks of deep networks
x1 w 1
• Activation functions
w2 f(Wx + b)
• Linear and ReLU x2 f
• Sigmoid and tanh
x3 w 3
• Etc.
output
output
input input
Linear Rectified linear (ReLU)
© Petuum,Inc. 23
• Linear and ReLU
• Etc.
fully connected
• Layers
convolutional
• Fully connected
• Convolutional & pooling
• Recurrent
• ResNets
• Etc. source: colah.github.io
recurrent
© Petuum,Inc. 24
blocks with residual connections
Putting things together:
• Linear and ReLU loss activation
• Etc.
• Layers
concatenation
• Fully connected
• Convolutional & pooling
• Recurrent fully connected
• ResNets
• Etc. convolutional
• Loss functions avg& max
• Cross-entropy loss pooling
• Mean squared error
(a part of GoogleNet)
• Etc. © Petuum,Inc. 25
Putting things together: l Arbitrary combinations of
• Activation functions the basic building blocks
• Linear and ReLU l Multiple loss functions –
• Sigmoid and tanh multi-target prediction,
transfer learning, and more
• Etc.
l Given enough data, deeper
• Layers architectures just keep
• Fully connected improving
• Convolutional & pooling l Representation learning:
the networks learn
• Recurrent increasingly more abstract
• ResNets representations of the data
that are “disentangled,” i.e.,
• Etc.
amenable to linear
• Loss functions separation.
• Cross-entropy loss
• Mean squared error
(a part of GoogleNet)
• Etc. © Petuum,Inc. 26
Outline
• An overview of the DL components
© Petuum,Inc. 27
Graphical models vs. Deep nets
Graphical models Deep neural networks
• Representation for encoding l Learn representations that

meaningful knowledge and the facilitate computation and
associated uncertainty in a performance on the end-metric
graphical form (intermediate representations are
not guaranteed to be meaningful)
© Petuum,Inc. 28
• Representation for encoding l Learn representations that

meaningful knowledge and the facilitate computation and
associated uncertainty in a performance on the end-metric
graphical form (intermediate representations are
not guaranteed to be meaningful)
• Learning and inference are based l Learning is predominantly based

on a rich toolbox of well-studied on the gradient descent method
(structure-dependent) techniques (aka backpropagation);
(e.g., EM, message passing, VI, Inference is often trivial and done
MCMC, etc.) via a “forward pass”
• Graphs represent models l Graphs represent computation
© Petuum,Inc. 29
X1 X2
Graphical models
X X
Utility of the graph log P (X) = log (xi ) + log (xi , xj )
X3 i i,j
• A vehicle for synthesizing a global loss

function from local structure X5 X4
• potential function, feature function, etc.
• A vehicle for designing sound and
efficient inference algorithms
• Sum-product, mean-field, etc.
• A vehicle to inspire approximation and
penalization
• Structured MF, Tree-approximation, etc.
• A vehicle for monitoring theoretical and
empirical behavior and accuracy of
inference E + ~!(+|@)
Utility of the loss function

• A major measure of quality of the
learning algorithm and the model q = argmaxq !q(@) © Petuum,Inc. 30
Deep neural networks
Utility of the network

l A vehicle to conceptually synthesize
complex decision hypothesis
l stage-wise projection and aggregation
l A vehicle for organizing computational
operations
l stage-wise update of latent states
l A vehicle for designing processing steps
and computing modules
l Layer-wise parallelization
l No obvious utility in evaluating DL
inference algorithms
Utility of the Loss Function
l Global loss? Well it is complex and non-
Images from Distill.pub convex... © Petuum,Inc. 31
Utility of the graph Utility of the network

• A vehicle for synthesizing a global loss l A vehicle to conceptually synthesize
function from local structure
complex decision hypothesis
• potential function, feature function, etc.
l stage-wise projection and aggregation
• A vehicle for designing sound and
efficient inference algorithms l A vehicle for organizing computational
• Sum-product, mean-field, etc. operations
• A vehicle to inspire approximation and l stage-wise update of latent states
penalization l A vehicle for designing processing steps
• Structured MF, Tree-approximation, etc. and computing modules
• A vehicle for monitoring theoretical and l Layer-wise parallelization
empirical behavior and accuracy of
inference l No obvious utility in evaluating DL
inference algorithms
Utility of the loss function Utility of the Loss Function
• A major measure of quality of the l Global loss? Well it is complex and non-
learning algorithm and the model convex... © Petuum,Inc. 32
DL <
= ? ML (e.g., GM)
Graphical models vs. >
Deep nets
Empirical goal: e.g., classification, feature learning e.g., latent variable inference, transfer
learning
Structure: Graphical Graphical
Objective: Something aggregated from local functions Something aggregated from local functions
Vocabulary: Neuron, activation function, … Variable, potential function, …
Algorithm: A single, unchallenged, inference algorithm A major focus of open research, many
– algorithms, and more to come
Backpropagation (BP)
Evaluation: On a black-box score – On almost every intermediate quantity
end performance
Implementation: Many tricks More or less standardized
Experiments: Massive, real data Modest, often simulated data (GT known)
(GT unknown)
© Petuum,Inc. 33
Graphical Models vs. Deep Nets
• So far:
• Graphical models are representations of probability distributions
• Neural networks are function approximators (with no probabilistic meaning)
• Some of the neural nets are in fact proper graphical models
(i.e., units/neurons represent random variables):
• Boltzmann machines (Hinton & Sejnowsky, 1983)
• Restricted Boltzmann machines (Smolensky, 1986)
• Learning and Inference in sigmoid belief networks (Neal, 1992)
• Fast learning in deep belief networks (Hinton, Osindero, Teh, 2006)
• Deep Boltzmann machines (Salakhutdinov and Hinton, 2009)
• Let’s go through these models one-by-one © Petuum,Inc. 34
I: Restricted Boltzmann Machines
• RBM is a Markov random field represented with a bi-partite graph
• All nodes in one layer/part of the graph are connected to all in the other;
no inter-layer connections
• Joint distribution:
1
! G, ℎ = exp I J0K G0 ℎ0 + I M0 G0 + I NK ℎK
7
0,K 0 K
Images from Marcus Frean, MLSS Tutorial 2010 © Petuum,Inc. 35

• Log-likelihood of a single data point (unobservables marginalized out):
log Q G = log I exp I J0K G0 ℎ0 + I M0 G0 + I NK ℎK − log (7)

S 0,K 0 K
• Gradient of the log-likelihood w.r.t. the model parameters:

T T T
log Q G = I !(ℎ|G) !(G, ℎ) − I !(G, ℎ) !(G, ℎ)
TJ0K TJ0K TJ0K
S U,S
• where we have averaging over the posterior and over the joint.
Images from Marcus Frean, MLSS Tutorial 2010 © Petuum,Inc. 36

• Gradient of the log-likelihood w.r.t. the parameters (alternative form):
T T T
log Q G = VW(S|U) !(G, ℎ) − VW(U,S) !(G, ℎ)
TJ0K TJ0K TJ0K
• Both expectations can be approximated via sampling

• Sampling from the posterior is exact (RBM factorizes over ℎ given G)
• Sampling from the joint is done via MCMC (e.g., Gibbs sampling)
• In the neural networks literature:
• computing the first term is called the clamped / wake / positive phase
(the network is “awake” since it conditions on the visible variables)
• Computing the second term is called the unclamped / sleep / free / negative phase
(the network is “asleep” since it samples the visible variables from the joint;
metaphorically, it is ”dreaming” the visible inputs) © Petuum,Inc. 37
• Gradient of the log-likelihood w.r.t. the parameters (alternative form):
T T T
TJ0K TJ0K TJ0K
• Learning is done by optimizing the log-likelihood of the model for a given

data via stochastic gradient descent (SGD)
• Estimation of the second term (the negative phase) heavily relies on the
mixing properties of the Markov chain
• This often causes slow convergence and requires extra computation
© Petuum,Inc. 38
II: Sigmoid Belief Networks
from Neal, 1992
• Sigmoid belief nets are simply Bayesian networks over binary variables with conditional
probabilities represented by sigmoid functions:
! X0 Y X0 = Z X0 I J0K XK
[\ ∈ ^ [_
• Bayesian networks exhibit a phenomenon called “explain away effect”
If A correlates with C, then the chance of B correlating with C

decreases. A and B become correlated given C.
© Petuum,Inc. 39
II: Sigmoid Belief Networks
from Neal, 1992
• Sigmoid belief nets are simply Bayesian networks over binary variables with conditional
probabilities represented by sigmoid functions:
! X0 Y X0 = Z X0 I J0K XK
[\ ∈ ^ [_
• Bayesian networks exhibit a phenomenon called “explain away effect”

Note:
Due to the “explain away effect,” when we
condition on the visible layer in belief networks,
hidden variables all become dependent. © Petuum,Inc. 40
Sigmoid Belief Networks:
Learning and Inference
• Neal proposed Monte Carlo methods for learning and inference (Neal, 1992):
log derivative
Approximated with Gibbs sampling
prob. of the visibles
• Conditional distributions: via marginalization
Bayes rule +
rearrange sums
• No negative phase as in RBM!

• Convergence is very slow, Plug-in the actual
especially for large belief nets, sigmoid form of the
due to the intricate conditional prob.
“explain-away” effects…
Equations from Neal, 1992 © Petuum,Inc. 41

RBMs are infinite belief networks
• Recall the expression for the gradient of the log likelihood for RBM:
T T T
TJ0K TJ0K TJ0K
• To make a gradient update of the model parameters, we need compute
the expectations via sampling.
• We can sample exactly from the posterior in the first term
• We run block Gibbs sampling to approximately sample from the joint distribution
images from Marcus Frean, MLSS Tutorial 2010 © Petuum,Inc. 42

sampling steps
• Gibbs sampling: alternate between sampling hidden and visible variables
sampling steps
• Conditional distributions !(G|ℎ) and ! ℎ G are represented by sigmoids

• Thus, we can think of Gibbs sampling from the joint distribution represented by
an RBM as a top-down propagation in an infinitely deep sigmoid belief network!

• RBMs are equivalent to infinitely deep belief networks
• Sampling from this is the same as sampling from

the network on the right


• When we train an RBM, we are really training an infinitely deep brief net!
• It is just that the weights of all layers are tied.
• If the weights are “untied” to some extent, we get a Deep Belief Network.
III: Deep Belief Nets
• DBNs are hybrid graphical models (chain graphs):

• Exact inference in DBNs is problematic due to explaining away effect
• Training: greedy pre-training + ad-hoc fine-tuning; no proper joint training
• Approximate inference is feed-forward
© Petuum,Inc. 47
Deep Belief Networks
• DBNs represent a joint probability distribution
! G, ℎ$ , ℎ& , ℎ' = ! ℎ& , ℎ' ! ℎ$ ℎ& !(G|ℎ$ )
• Note that ! ℎ& , ℎ' is an RBM and the conditionals ! ℎ$ ℎ&
and !(G|ℎ$ ) are represented in the sigmoid form
• The model is trained by optimizing the log likelihood for a
given data log ! G
Challenges:
• Exact inference in DBNs is problematic due to explain away effect
• Training is done in two stages:
• greedy pre-training + ad-hoc fine-tuning; no proper joint training
• Approximate inference is feed-forward (bottom-up) © Petuum,Inc. 48
DBN: Layer-wise pre-training
• Pre-train and freeze the 1st RBM
• Stack another RBM on top and train it
• The weights weights 2+ layers remain tied

• We repeat this procedure: pre-train and untie
the weights layer-by-layer…

DBN: Layer-wise pre-training
• We repeat this procedure: pre-train and untie
the weights layer-by-layer:
• The weights of 3+ layers remain tied
• and so forth
• From the optimization perspective, this procedure loosely corresponds
to an approximate block-coordinate accent on the log-likelihood
DBN: Fine-tuning
• Pre-training is quite ad-hoc and is unlikely to lead to a good probabilistic
model per se
• However, the layers of representations could perhaps be
useful for some other downstream tasks!
• We can further “fine-tune” a pre-trained DBN for some other task
Setting A: Unsupervised learning (DBN → autoencoder)

1. Pre-train a stack of RBMs in a greedy layer-wise fashion
2. “Unroll” the RBMs to create an autoencoder
3. Fine-tune the parameters by optimizing the reconstruction error
images from Hinton & Salakhutdinov, 2006 © Petuum,Inc. 51

DBN: Fine-tuning
model per se


DBN: Fine-tuning
model per se


DBN: Fine-tuning
model per se
Setting B: Supervised learning (DBN → classifier)

2. “Unroll” the RBMs to create a feedforward classifier
Some intuitions about how pre-training works:
Erhan et al.: Why Does Unsupervised Pre-training Help Deep Learning? JMLR, 2010 © Petuum,Inc. 54
Deep Belief Nets and Boltzmann Machines
• DBNs are hybrid graphical models (chain graphs):

• Inference in DBNs is problematic due to explaining away effect
• Training: greedy pre-training + ad-hoc fine-tuning; no proper joint training
• Approximate inference is feed-forward
© Petuum,Inc. 55
Deep Belief Nets and Boltzmann Machines
• DBMs are fully un-directed models (Markov random fields):

• Can be trained similarly as RBMs via MCMC (Hinton & Sejnowski, 1983)
• Use a variational approximation of the data distribution for faster training
(Salakhutdinov & Hinton, 2009)
• Similarly, can be used to initialize other networks for downstream tasks
© Petuum,Inc. 56
Graphical models vs. Deep networks
• A few critical points to note about all these models:
• The primary goal of deep generative models is to represent the
distribution of the observable variables. Adding layers of hidden
variables allows to represent increasingly more complex distributions.
• Hidden variables are secondary (auxiliary) elements used to facilitate
learning of complex dependencies between the observables.
• Training of the model is ad-hoc, but what matters is the quality of
learned hidden representations.
• Representations are judged by their usefulness on a downstream task
(the probabilistic meaning of the model is often discarded at the end).
• In contrast, classical graphical models are often concerned
with the correctness of learning and inference of all variables
© Petuum,Inc. 57
An old study of belief networks
from the GM standpoint [Xing, Russell, Jordan, UAI 2003]
Mean-field partitions of a sigmoid belief network for subsequent GMF inference
Study focused on only inference/learning accuracy, speed, and partition
GMFb
GMFr
BP
© Petuum,Inc. 58
“Optimize” how to optimize via truncation & re-opt
• Energy-based modeling of the structured output (CRF)
• Unroll the optimization algorithm for a fixed number of steps (Domke, 2012)
`' `3 =
`$
à
`& `2
Relevant recent paper:
Anrychowicz et al.: Learning to learn by gradient
We can backprop through the optimization steps descent by gradient descent. 2016.
since they are just a sequence of computations
© Petuum,Inc. 59
Dealing with structured prediction
• Energy-based modeling of the structured output (CRF)
• Unroll the optimization algorithm for a fixed number of steps (Domke, 2012)
• We can think of y* as some non-linear differentiable function of the inputs and

weights → impose some loss and optimize it as any other standard computation
graph using backprop!
• Similarly, message passing based inference algorithms can be truncated and
converted into computational graphs (Domke, 2011; Stoyanov et al., 2011)
© Petuum,Inc. 60
Outline
© Petuum,Inc. 61
Combining sequential NNs and GMs
© Petuum,Inc. 62
slide courtesy: Matt Gormley
Combining sequential NNs and GMs
© Petuum,Inc. 63
Hybrid NNs + conditional GMs
• In a standard CRF, each of the factor cells is a parameter.

• In a hybrid model, these values are computed by a neural network.
© Petuum,Inc. 64
© Petuum,Inc. 65
© Petuum,Inc. 66
Using GMs as Prediction Explanations
• Idea: Use deep neural nets to generate parameters of a graphical model for a
given context (e.g., specific instance or case)
• Produced GMs are used to make the final prediction
• GMs are built on top of interpretable variables (not deep embeddings!) and can
be used as contextual explanations for each prediction
Al-Shedivat, Dubey, Xing, arXiv, 2017 © Petuum,Inc. 67

Using GMs as Prediction Explanations
Context Encoder Dictionary θ
Y1 Y2 Y3 Y4
CEN X
dot
X1 X2 X3 X4
Y1 Y2 Y3 Y4 MoE
dot
Y1 Y2 Y3 Y4 Y1 Y2 Y3 Y4 Y1 Y2 Y3 Y4
Context Attention X1 X2 X3 X4 X1 X2 X3 X4 X1 X2 X3 X4
Attributes
A practical implementation:
• Maintain a (sparse) dictionary of GM parameters
• Process complex inputs (images, text, time series, etc.) using deep nets; use soft
attention to either select or combine models from the dictionary
• Use constructed GMs (e.g., CRFs) to make predictions
• Inspect GMs to understand the reasoning behind predictions
Al-Shedivat, Dubey, Xing, arXiv, 2017 © Petuum,Inc. 68
Outline
• An overview of the DL components
© Petuum,Inc. 69
Bayesian learning of NNs
• A neural network as a probabilistic model:
• Likelihood: b ` X, c Weight Uncertainty
• Categorical distribution for classification ⇒ cross-entropy loss
• Gaussian distribution for regression ⇒ squared loss Y Y
• Prior on parameters: b c 0.5 0.1 0.7 1.3
• Maximum a posteriori (MAP) solution: H1 H2 H3 1 H1 H2 H3 1
• cefW = argmaxg log b ` X, c b(c) 0.1 0.2 0.1 0.3 1.4
1.2
• Gaussian prior ⇒ L2 regularization
X 1 X 1
• Laplace prior ⇒ L1 regularization
Figure courtesy: Blundell et al, 2016
• Bayesian learning [MacKay 1992, Neal 1996, de Figure
Freitas1.2003]
Left: each weight has a fixed value, as provided by clas-
• Posterior: b c X, ` sical backpropagation. Right: each weight is assigned a distribu-
tion, as provided by Bayes by Backprop.
• Variational inference with approximate posterior h(c)
© Petuum,Inc. 70
Bayesian learning of NNs
• Variational inference (in a nutshell):
minr s t, c = KL h c || b c t − Er(c) [log b(t|c)]
minr s t, c = KL h c || b c t − I log b(t|c0 )
0
where ci ∼ h(c); KL term can be approximated similarly
• We can define h c as a diagonal Gaussian or full-covariance Gaussian

• Alternatively, h c can be defined implicitly, e.g. via dropout [Gal & Ghahramani, 2016]
c = n ⋅ diag | ,
| ∼ Bernoulli(b)
• Dropping out neurons is equivalent to zeroing out
columns of the parameter matrices (i.e., weights)
• k0 = 0 corresponds to m-th column of n being dropped out
⇒ the procedure is equivalent to dropout of unit m [Hinton et al., 2012]
• Variational parameters are {n, o} © Petuum,Inc. 71
“Infinitely Wide” Deep Models
• We have seen that an ”infinitely deep” network can be explained by a proper GM,
How about an “infinitely wide” one?
• Consider a neural network with a Gaussian prior on its weights an infinitely many hidden
neurons in the intermediate layer.
• Turns out, if we have a certain Gaussian prior on the Infinitely many
weights of such infinite network, it will be equivalent hidden units
to a Gaussian process [Neal 1996].
• Gaussian process (GP) is a distribution over functions:
• When used for prediction, GPs account for correlations between the data points and can
output well-calibrated predictive uncertainty estimates.
© Petuum,Inc. 72
Gaussian Process and Deep Kernel Learning
• Consider a neural network with a Gaussian prior on its weights an infinitely many hidden neurons in
the intermediate layer.
Infinitely many
hidden units
• Certain classes of Gaussian priors for neural networks with infinitely many hidden units converge to
Gaussian processes [Neal 1996]
• Deep kernel [Wilson et al., 2016]
• Combines the inductive biases of deep model architectures with the non-parametric flexibility of Gaussian processes
Ä X0 , XK Å → Ä(â X0 , Ç , â(XK , Ç)|Å, Ç) where 0K = Ä(X0 , XK )
b ` É = Ñ(`|É, Ö Ü$ )
• Starting from a base kernel Ä(X0 , XK |Å), transform the inputs X as
b É Å = Ñ(É|á(X), )
• Learn both kernel and neural parameters Å, Ç jointly by optimizing marginal log-likelihood (or its variational lower-bound).
• Fast learning and inference with local kernel interpolation, structured inducing points, and Monte Carlo approximations
© Petuum,Inc. 73
Gaussian Process and Deep Kernel Learning
• By adding GP as a layer to a deep neural net, we can think of it as adding
an infinite hidden layer with a particular prior on the weights
• Deep kernel learning [Wilson et al., 2016]
• Combines the inductive biases of
deep models with the non-parametric
flexibility of Gaussian processes
• GPs add powerful regularization to
the network
• Additionally, they provide predictive
uncertainty estimates
© Petuum,Inc. 74
Deep kernel learning on sequential data
What if we have data of
sequential nature?
Can we still apply the same

reasoning and build rich
nonparametric models on top
recurrent nets?
© Petuum,Inc. 75
The answer is YES!
By adding a GP layer to a recurrent

network, we effectively correlate
samples across time and get
predictions along with well calibrated
uncertainty estimates.
To train such model using stochastic

techniques however requires some
additional care (see our paper).
Al-Shedivat et al., JMLR, 2017 © Petuum,Inc. 76

Lane prediction: LSTM vs GP-LSTM
50
Front distance, m
40
30
20
10
0
5 0 5 5 0 5 5 0 5 5 0 5 5 0 5
Side distance, m
50
Front distance, m
40
30
20
10
0
5 0 5 5 0 5 5 0 5 5 0 5 5 0 5
Side distance, m

Lead vehicle prediction: LSTM vs GP-LSTM
100
Front distance, m
80
60
40
20
0
5 0 5 5 0 5 5 0 5 5 0 5 5 0 5
Side distance, m
100
Front distance, m
80
60
40
20
0
5 0 5 5 0 5 5 0 5 5 0 5 5 0 5
Side distance, m

Conclusion
• DL & GM: the fields are similar in the beginning (structure, energy, etc.), and then
diverge to their own signature pipelines
• DL: most effort is directed to comparing different architectures and their components
(models are driven by evaluating empirical performance on a downstream tasks)
• DL models are good at learning robust hierarchical representations from the data and suitable
for simple reasoning (call it “low-level cognition”)
• GM: the effort is directed towards improving inference accuracy and convergence
speed
• GMs are best for provably correct inference and suitable for high-level complex reasoning
tasks (call it “high-level cognition”)
• Convergence of both fields is very promising!
• Next part: a unified view of deep generative models in the GM interpretation
© Petuum,Inc. 79
Part-II
Deep Generative Models
Plan
Learning

Architectures
© Petuum,Inc. 81
Outline
• Overview of advances in deep generative models
• Backgrounds of deep generative models
• Wake sleep algorithm
• Variational autoencoders
• Generative adversarial networks
• A unified view of deep generative models
• new formulations of deep generative models
• Symmetric modeling of latent and visible variables
© Petuum,Inc. 82
Outline
© Petuum,Inc. 83
Deep generative models
• Define probabilistic distributions over a set of variables
• "Deep" means multiple layers of hidden variables!
#$
...
#%
&
© Petuum,Inc. 84
Early forms of deep generative models
• Hierarchical Bayesian models
• Sigmoid brief nets [Neal 1992] (&) ã
|ä = 0,1
Ç0K
($) å
|ä = 0,1
çä = 0,1 é
($) ($)
b Xèä = 1 cè , |ä = Z cêè |ä
($) (&) &
b k0ä = 1 c0 , |ä = Z cê0 |ä
© Petuum,Inc. 85
• Sigmoid brief nets [Neal 1992]
• Neural network models

7$
• Helmholtz machines [Dayan et al.,1995]
7&
inference
weights
#
[Dayan et al. 1995]

© Petuum,Inc. 86
• Sigmoid brief nets [Neal 1992]
• Neural network models

• Helmholtz machines [Dayan et al.,1995]
• Predictability minimization [Schmidhuber 1995]
DATA
Figure courtesy: Schmidhuber 1996
© Petuum,Inc. 87
• Training of DGMs via an EM style framework
• Sampling / data augmentation
| = |$ , |&
|$äòô ~b |$ |& , ç
|äòô
& ~b | |
& $
äòô
,ç
• Variational inference
log b ç ≥ Erí | ç log bg ç, | − KL(hì | ç || b(|)) ≔ ℒ(c, ñ; ç)
maxc,ñ ℒ(c, ñ; ç)
• Wake sleep
Wake: ming Vrí(ö|[) log bg X k
Sleep: minì Võú([|ö) log hì k X © Petuum,Inc. 88
Resurgence of deep generative models
• Restricted Boltzmann machines (RBMs) [Smolensky, 1986]
• Building blocks of deep probabilistic models
© Petuum,Inc. 89
• Restricted Boltzmann machines (RBMs) [Smolensky, 1986]
• Building blocks of deep probabilistic models
• Deep belief networks (DBNs) [Hinton et al., 2006]
• Hybrid graphical model
• Inference in DBNs is problematic due to explaining away
• Deep Boltzmann Machines (DBMs) [Salakhutdinov & Hinton, 2009]
• Undirected model
© Petuum,Inc. 90
• Variational autoencoders (VAEs) [Kingma & Welling, 2014]
/ Neural Variational Inference and Learning (NVIL) [Mnih & Gregor, 2014]
hì (||ç) bg (ç||)
inference model generative model
Figure courtesy: Kingma & Welling, 2014
© Petuum,Inc. 91
• Generative adversarial networks (GANs)
ùg : generative model
tì : discriminator
© Petuum,Inc. 92
• Generative moment matching networks (GMMNs) [Li et al., 2015; Dziugaite et
al., 2015]
© Petuum,Inc. 93
• Generative moment matching networks (GMMNs) [Li et al., 2015; Dziugaite et
al., 2015]
• Autoregressive neural networks
"$ "' "( ")
© Petuum,Inc. 94
Outline
© Petuum,Inc. 95
Synonyms in the literature
• Posterior Distribution -> Inference model

• Variational approximation
• Recognition model
• Inference network (if parameterized as neural networks)
• Recognition network (if parameterized as neural networks)
• (Probabilistic) encoder
• "The Model" (prior + conditional, or joint) -> Generative model
• The (data) likelihood model
• Generative network (if parameterized as neural networks)
• Generator
• (Probabilistic) decoder
© Petuum,Inc. 96
Recap: Variational Inference
• Consider a generative model bg ç|| , and prior b |
• Joint distribution: bg ç, | = bg ç|| b |
• Assume variational distribution hì ||ç
• Objective: Maximize lower bound for log likelihood
log b ç
bg ç, |
= Q hì | ç || bc | ç + û hì | ç log
| hì | ç
bg ç, |
≥ û hì | ç log
| hì | ç
≔ ℒ(c, ñ; ç)
• Equivalently, minimize free energy
s c, Å; ç = −log b ç + Q(hì | ç || bc (||ç))
© Petuum,Inc. 97
Maximize the variational lower bound ℒ(c, ñ; ç)
• E-step: maximize ℒ wrt. Å with c fixed
maxì ℒ c, ñ; ç = Vrí (ö|[) log bg X k + Q(hì k X ||b(k))
• If with closed form solutions
∗
hì (k|X) ∝ exp[log bg (X, k)]
• M-step: maximize ℒ wrt. c with Å fixed

maxg ℒ c, ñ; ç = Vrí k X log bg X k + Q(hì k X ||b(k))
© Petuum,Inc. 98
Recap: Amortized Variational Inference
• Variational distribution as an inference model hì | ç with
parameters ñ
• Amortize the cost of inference by learning a single data-
dependent inference model
• The trained inference model can be used for quick inference
on new data
• Maximize the variational lower bound ℒ(c, ñ; ç)
• E-step: maximize ℒ wrt. ñ with c fixed
• M-step: maximize ℒ wrt. c with ñ fixed
© Petuum,Inc. 99
Deep generative models with amortized inference
• Helmholtz machines
• Variational autoencoders (VAEs) / Neural Variational Inference
and Learning (NVIL)
• We will see later that adversarial approaches are also included

in the list
• Predictability minimization (PM)
© Petuum,Inc. 100
Wake Sleep Algorithm
• [Hinton et al., Science 1995]
• Train a separate inference model along with the generative model
• Generally applicable to a wide range of generative models, e.g., Helmholtz machines
• Consider a generative model bg ç | and prior b |
• Joint distribution bg ç, | = bg ç | b |
• E.g., multi-layer brief nets
• Inference model hì | ç
• Maximize data log-likelihood with two steps of loss relaxation:
• Maximize the lower bound of log-likelihood, or equivalently, minimize the free
energy
s c, ñ; ç = −log b ç + Q(hì | ç || bc (||ç))
• Minimize a different objective (reversed KLD) wrt Å to ease the optimization
• Disconnect to the original variational lower bound loss
s′ c, ñ; ç = −log b ç + Q(bg | ç || hì (||ç))
© Petuum,Inc. 101
R2
R1
ç
• Free energy:
s c, ñ; ç = −log b ç + Q(hñ | ç || bc (||ç))
• Minimize the free energy wrt. c of bg à wake phase
maxc Erí(||ç) log bc (ç, |)
• Get samples from hì (k|X) through inference on hidden variables
• Use the samples as targets for updating the generative model bg (||ç)
• Correspond to the variational M step
[Figure courtesy: Maei’s slides] © Petuum,Inc. 102

• Free energy:
• Minimize the free energy wrt. Å of hì | ç
• Correspond to the variational E step
• Difficulties: o (|, ç)
∗ c
hñ |ç =
• Optimal ∫ oc |, ç •| intractable
• High variance of direct gradient estimate ¢ì s Ç, Å; X = ⋯ + ¢ì Vrí (ö|[) log bg (k, X) + ⋯
• Gradient estimate with the log-derivative trick:
¢ì Vrí log bg = ∫ ¢ì hì log bg = ∫ hì log bg ¢ì log hì = Vrí [log bg ¢ì log hì ]
• Monte Carlo estimation:
¢ì Vrí log bg ≈ Vö_∼rí [log bg (X, k0 ) ¢ì hì k0 |X ]
• The scale factor log bg of the derivative ¢ì log hì can have arbitrary
large magnitude © Petuum,Inc. 103
R2 G2
R1 G1
ç
• Free energy:
• WS works around the difficulties with the sleep phase approximation
• Minimize the following objective à sleep phase
s′ c, ñ; ç = −log b ç + Q(bg | ç || hì (||ç))
maxñ Eõú(|,ç) log hì | ç
• “Dreaming” up samples from bg ç | through top-down pass
• Use the samples as targets for updating the inference model
• (Recent approaches other than sleep phase is to reduce the variance of
gradient estimate: slides later)
[Figure courtesy: Maei’s slides] © Petuum,Inc. 104

Wake sleep Variational EM
• Parametrized inference model hñ | ç • Variational distribution hì | ç
• Wake phase: • Variational M step:
• minimize Q(hñ | ç || bc (||ç)) wrt. Ç • minimize Q(hì | ç || bc (||ç)) wrt. Ç
• Erí(||ç) ¢g log bc ç | • Erñ(||ç) ¢g log bc ç |
• Sleep phase: • Variational E step:
• minimize Q(bg | ç || hì (||ç)) wrt. Å • minimize Q(hì | ç || bc (||ç)) wrt. Å
∗
• Eõú(|,ç) ¢ì log hì (|, ç) • hì ∝ exp[log bg ] if with closed-form
• low variance • ¢ì Vrí log bg (k, X)
• Learning with generated samples of ç • need variance-reduce in practice
• Learning with real data ç
• Two objective, not guaranteed to converge • Single objective, guaranteed to converge
© Petuum,Inc. 105
Variational Autoencoders (VAEs)
• [Kingma & Welling, 2014]
• Use variational inference with an inference model

• Enjoy similar applicability with wake-sleep algorithm
• Generative model bg ç | , and prior b(|)

• Joint distribution bg ç, | = bg ç | b |
hì (||ç) bg (ç||)
inference model generative model
• Inference model hì | ç
Figure courtesy: Kingma & Welling, 2014
© Petuum,Inc. 106
Variational Autoencoders (VAEs)
• Variational lower bound
ℒ c, ñ; ç = Erí | ç log bg ç, | − KL(hì | ç || b(|))
• Optimize ℒ(c, ñ; ç) wrt. Ç of bg ç |

• The same with the wake phase
• Optimize ℒ(c, ñ; ç) wrt. Å of hì | ç
¢ì ℒ Ç, Å; X = ⋯ + ¢ì Vrí (ö|[) log bg X k +⋯
• Use reparameterization trick to reduce variance
• Alternatives: use control variates as in reinforcement learning [Mnih &
Gregor, 2014; Paisley et al., 2012]
© Petuum,Inc. 107
Reparametrized gradient
• Optimize ℒ c, ñ; ç wrt. Å of hì | ç
• Recap: gradient estimate with log-derivative trick:
¢ì Vrí log bg ç, | = Vrí [log bg ç, | ¢ì log hì ]
• High variance: ¢ì Vrí log bg ≈ Vö_∼ rí [log bg (X, k0 ) ¢ì hì k0 |X ]
• The scale factor log bg (X, k0 ) of the derivative ¢ì log hì can have arbitrary large
magnitude
• gradient estimate with reparameterization trick

| ∼ hì | ç ⇔ Æ = g ì ¨, ç , ¨ ∼ b(¨)
¢ì Erí | ç log bg ç, | = E¨∼õ(≠) ¢ì log bg ç, |ì ¨
• (Empirically) lower variance of the gradient estimate

• E.g., | ∼ ß ® ç , © ç © ç ™ ⇔ ¨ ∼ ß 0,1 , | = ® ç + ©(ç)¨
© Petuum,Inc. 108
VAEs: algorithm
[Kingma & Welling, 2014]
© Petuum,Inc. 109
input we looked out at the setting sun . iw
mean they were laughing at the same time . i we
samp. 1 ill see you in the early morning . i we
samp. 2 i looked up at the blue sky . i loo
VAEs: example results samp. 3 it was down on the dance floor . i tu
• VAEs tend to generate blurred

• Latent
Table 7: Threecode interpolation
sentences which and
were used as inputs to
images due to the mode covering
sentences
mean of the generation
posterior from VAEs
distribution, and from three samp
behavior (more later)
[Bowman et al., 2015].
“ i want to talk to you . ” se

“i want to be with you . ” is
“i do n’t want to be with you . ” th
i do n’t want to be with you .
she did n’t want to be with him .
ge
lo
he was silent for a long moment . m
he was silent for a moment .
it was quiet for a moment . m
it was dark and cold .
Celebrity faces [Radford 2015] there was a pause . no
it was my turn . va
© Petuum,Inc. 110 ti
Generative Adversarial Nets (GANs)
• [Goodfellow et al., 2014]
• Generative model ç = ùg | , | ∼ b(|)
• Map noise variable | to data space ç
• Define an implicit distribution over ç: b∞ú (ç)
• a stochastic process to simulate data ç
• Intractable to evaluate likelihood
• Discriminator tì ç
• Output the probability that ç came from the data rather than the generator
• No explicit inference model
• No obvious connection to previous models with inference networks like VAEs
• We will build formal connections between GANs and VAEs later
© Petuum,Inc. 111
x from data distribution pdata (x). The distribution in Eq.(1) is thus rewritten as:
⇢
pdata (x) y = 0
p(x|z, y) = (5)
pg (x|z) y = 1,
Generative g
Adversarial Nets (GANs)
where p (x|z) = G(z) is the generative distribution. Note that p (x) is the empirical data
data
distribution which is free of parameters. The discriminator is defined in the same way as above, i.e.,
D(x) = p(y = 0|x). Then the objective of GAN is precisely defined in Eq.(2). To make this clearer,
• Learning
we again transform the objective into its conventional form:
• A minimax
maxDgame Ex⇠pdata (x)the
LD = between D(x)] + Ex⇠G(z),z⇠p(z)
[loggenerator and the discriminator
[log(1 D(x))] ,
• Train tmax
to G LG = Ex⇠p
maximize the probability
data (x)
+ Ex⇠G(z),z⇠p(z)
of assigning
[log(1 D(x))] the correct[loglabel
D(x)]to both (6)
training examples= Eand generated samples
x⇠G(z),z⇠p(z) [log D(x)] .
• Train ù to fool the discriminator
maxD LD = Ex⇠pdata (x) [log D(x)] + Ex⇠G(z),z⇠p(z) [log(1 D(x))] ,
minG LG = Ex⇠G(z),z⇠p(z) [log(1 D(x))] .

maxG LG = Ex⇠G(z),z⇠p(z) [log D(x)] .
Note that for learning the generator we are using the adapted objective, i.e., maximizing
Ex⇠G(z),z⇠p(z) [log D(x)], as is usually used in practice (Goodfellow et al., 2014), rather than
minimizing Ex⇠G(z),z⇠p(z) [log(1 D(x))]. © Petuum,Inc. 112
[Figure courtesy: Kim’s slides]
maxG LG = Ex⇠pdata (x) [log(1 D(x))] + Ex⇠G(z),z⇠p(z) [log D(x)]
Generative =Adversarial
E Nets
[log D(x)] . (GANs)
x⇠G(z),z⇠p(z)

• Learning
• Train ùmin G LG
to fool Ex⇠G(z),z⇠p(z) [log(1
the=discriminator D(x))] .
• The original loss suffers from vanishing gradients when t is too strong
• Instead
maxD LD
use Ex⇠pdatain(x)
the=following [log D(x)] + Ex⇠G(z),z⇠p(z) [log(1 D(x))] ,
practice
maxG LG = Ex⇠G(z),z⇠p(z) [log D(x)] .
Note that for learning the generator we are using the adapted objective, i.e., maximizi
Ex⇠G(z),z⇠p(z) [log D(x)], as is usually used in practice (Goodfellow et al., 2014), rather th
minimizing Ex⇠G(z),z⇠p(z) [log(1 D(x))].
KL Divergence Interpretation
Now we take a closer look into Eq.(2). Assume uniform prior distribution p(y) where p(y = 0)
p(y = 1) = 0.5. For optimizing p(x|z, y), we have
Theorem 1. Let p✓ (x|z, y) be the conditional distribution in Eq.(1) parameterized with
© Petuum,Inc. 113 ✓. Den
[Figure courtesy: Kim’s slides] 0
Generative Adversarial Nets (GANs)
• Learning
• Aim to achieve equilibrium of the game
• Optimal state:
• b∞ ç = b)±≤± (X)
õ≥¥µ¥ [ $
• t ç = =
õ≥¥µ¥ [ ∂õ∑ [ &
© Petuum,Inc. 114
[Figure courtesy: Kim’s slides]
GANs: example results
Generated bedrooms [Radford et al., 2016]

© Petuum,Inc. 115
Alchemy Vs Modern Chemistry
© Petuum,Inc. 116
Outline
Z Hu, Z YANG, R Salakhutdinov, E Xing,

“On Unifying Deep Generative Models”, arxiv 1706.00550
© Petuum,Inc. 117
A unified view of deep generative models
• Literatures have viewed these DGM approaches as distinct
model training paradigms
• GANs: achieve an equilibrium between generator and discriminator
• VAEs: maximize lower bound of the data likelihood
• Let's study a new formulation for DGMs

• Connects GANs, VAEs, and other variants, under a unified view
• Links them back to inference and learning of Graphical Models, and the
wake-sleep heuristic that approximates this
• Provides a tool to analyze many GAN-/VAE-based algorithms
• Encourages mutual exchange of ideas from each individual class of
models
© Petuum,Inc. 118
Adversarial domain adaptation (ADA)
• Let’s start from ADA
• The application of adversarial approach on domain adaptation
• We then show GANs can be seen as a special case of ADA
• Correspondence of elements:
Elements GANs ADA GANs

ç data/generation features
Data from src/tgt

| code vector
domains
ADA
Source/target
` Real/fake indicator
domain indicator
© Petuum,Inc. 119
Adversarial domain adaptation (ADA)
• Data k from two domains indicated by ` ∈ 0,1
• Source domain (` = 1)
• Target domain (` = 0) ,
• ADA transfers prediction knowledge learned from the

source domain to the target domain
• Learn a feature extractor ùg : ç = ùg (|)
• Wants ç to be indistinguishable by a domain discriminator:
tì ç
• Application in classification
• E.g., we have labels of the source domain data
• Train classifier over ç of source domain data to predict the
labels
• ç is domain invariant ⇒ ç is predictive for target domain
data
© Petuum,Inc. 120
ADA: conventional formulation
• Train t to distinguish between domains
maximize theìbinary classification accuracy of recognizing the feature domains:
maximize
max Lthe=binary
Ex=Gclassification accuracy
✓ (z),z⇠p(z|y=1)
[log D of recognizing
(x)] + Ex=G✓ the feature domains:
(z),z⇠p(z|y=0) [log(1 D (x))] . (1)
Themax = Ex=GG✓ (z),z⇠p(z|y=1)
L extractor
feature ✓ is then trained to D
[log fool(x)]
the + Ex=G✓ (z),z⇠p(z|y=0) [log(1
discriminator: D (x))] . (1)
• Train ù to fool t
Themax
feature g E ì
✓ L✓extractor
= x=GG✓ ✓ is then trained
(z),z⇠p(z|y=1) to foolDthe(x))]
[log(1 discriminator:
+ Ex=G✓ (z),z⇠p(z|y=0) [log D (x)] . (2)
Here we✓omit
max L✓ =theEx=G
additional loss on ✓[log(1
✓ (z),z⇠p(z|y=1) that fits D
the (x))] + Etox=G
features the✓ data label pairs[log
(z),z⇠p(z|y=0) of source . (2)
D (x)]domain
(see
Herethe
wesupplementary materials
omit the additional loss for
on the details).
✓ that fits the features to the data label pairs of source domain
(see the
With the supplementary
background of materials for the details).
the conventional formulation, we now frame our new interpretation of ADA.
The data distribution p(z|y) and deterministic transformation G✓ together form an implicit distribution
With the background of the conventional formulation, we now frame our new interpretation of ADA.
over x, denoted as p✓ (x|y), which is intractable to evaluate likelihood but easy to sample from. Let
The data distribution p(z|y) and deterministic transformation G✓ together form an implicit distribution
p(y) be the prior distribution of the domain indicator y, e.g., a uniform distribution as in Eqs.(1)-(2).
over x, denoted as p
The discriminator defines
✓ (x|y), which is intractable to evaluate likelihood but easy to sample from. Let
a conditional distribution q (y|x) = D (x). Let q r (y|x) = q (1 y|x)
p(y) be the prior distribution of the domain indicator y, e.g., a uniform distribution as in Eqs.(1)-(2).
be the reversed distribution over domains. The objectives of ADA are therefore rewritten as (up to a
The discriminator defines a conditional distribution q (y|x) = D (x). Let q r (y|x) = q (1 y|x)
constant scale factor 2): © Petuum,Inc. 121
t domains.
rame our new interpretation of ADA, and review conventional formulations in the supplementary
rials. To make clear notational correspondence to other models in the sequel, [Eric: Please add
ADA: new formulation
ure drawing a graphical model here for ADA.] let z be a data example either in the source
rget domain, and y 2 {0, 1} be the domain indicator with y = 0 indicating the target domain
y = 1 the source domain. The data distributions conditioning on the domain are then denoted
(z|y). Let• p(y) be the prior
To reveal the distribution
Figure connections (e.g.,
touniform)
2: One optimization conventional of the domain indicator.
variational
step of the parameter
The feature let’s✓rewrite
approaches,
✓ through Eq.(7) at point 0 . The posterior
ctor maps ztheto representations
objectives x = G (z)
q r (x|y)inisaaformat
mixture
✓ with
that parameters
of p✓0resembles ✓.
(x|y = 0) (blue)The data distributions
variational over z and
EM = 1) (red in the left panel) with the
and p✓0 (x|y
rministic transformationmixing
G✓ together
weightsform an implicit
induced from q distribution
r
(y|x). over x, denoted
Minimizing the KL as p✓ (x|y), of Eq.(7) w.r.t ✓ drives
divergence
• Implicit
h is intractable distribution
to evaluate likelihood over ç ∼tobsample
but easy g (ç|`)from:
0
p✓ (x|y = 0) towards the respective mixture q r (x|y = 0) (green), resulting in a new state where
ç = ùg | , | ∼ bnew |`
nforce domain invariance p✓newof (x|y
feature= 0)
x, = pg (x) gets is
a discriminator closer to pto
trained adversarially
✓0 (x|y data (x). Due to the asymmetry of
= 1) = pdistinguish
een the two• Discriminator
domains, which distribution
KL divergence,
defines h ì (`|ç)
a conditional
pnew
g (x) missed the smaller
distribution modewith
q (y|x) of the mixture q r (x|y
parameters , and= 0) which is a mode of
eature extractor is optimized ∏to
hì
pdata `fool
(x). ç the
= hdiscriminator.
ì (1 − `|ç) Let q r
(y|x) = q (1 y|x) be the reversed
ibution over domains. The objectives of ADA are therefore given as:
• Rewrite the objective in the new form (up to constant scale factor)
maxtheLprior
where =E p(y) [logasq is(y|x)]
is uniform
p✓ (x|y)p(y) widely set, resulting in the constant scale factor 1/2. Note that
r unsaturated objective [16] which is(1)
here the generator is trained ⇥
using the ⇤ commonly used in practice.
max L = E
✓ ✓ p✓ (x|y)p(y) log q (y|x) ,
• | is encapsulated in the implicit distribution bg (ç|`)
e we omit the additional loss of ✓ to fit to the data label pairs of source domain (see supplements
more details). In conventional
max view,
L =the Epfirst equation minimizes
✓ (x|y=0)p(y=0)
[log q (y =the + Ep✓ (x|y=1)p(y=1)
discriminator
0|x)] binary cross
[log q (y = 1|x)]
opy with respect to discriminative parameter
1 , while the second trains the feature
1 extractor (6)
= Eto [log(1 D (x))] + Ex=G
✓. [Eric: [log D (x)]
aximize the cross entropy with respect the✓ (z),z⇠p(z|y=0)
transformation parameter 2 I think for
x=G ✓ (z),z⇠p(z|y=1)
2
containedness, it• would
(Ignorebe
thebetter
constanttoscale
explain
factorboth
1/2) of the cross-entropy notion above.] © Petuum,Inc. 122
rnatively, we can interpret the objectives as optimizing the reconstruction of the domain variable
✓
ic transformation G✓ together form an implicit distribution over x, denoted as p✓ (x|y),
ractable to evaluate likelihood but easy to sample from:
ADA: new formulation

domain invariance of feature x, a discriminator is trained to adversarially distinguish
e two domains, which defines a conditional distribution q (y|x) with parameters , and
extractor is optimized to fool the discriminator. Let q r (y|x) = q (1 y|x) be the reversed
over domains.
• NewThe objectives of ADA are therefore given as:
formulation
max L = Ep✓ (x|y)p(y) [log q (y|x)]
⇥ r
⇤ (1)
max✓ L✓ = Ep✓ (x|y)p(y) log q (y|x) ,
mit the additional

• Theloss of ✓difference
only to fit to thebetween
data label Çpairs
andofÅ:source
h vs.domain
h∏ (see supplements
tails). In conventional view, the first equation minimizes the discriminator binary cross
• This is where the adversarial mechanism comes about
h respect to discriminative parameter , while the second trains the feature extractor
e the cross entropy with respect to the transformation parameter ✓. [Eric: I think for
nedness, it would be better to explain both of the cross-entropy notion above.]
y, we can interpret the objectives as optimizing the reconstruction of the domain variable
ed on feature x. [Eric: I can not understand this point.] We explore this perspective
next section. Note that the only (but critical) difference between the objective of ✓ from
lacement of q(y|x) with q r (y|x). This is where the adversarial mechanism comes about.
© Petuum,Inc. 123
3
and y = 1 the source domain. The data distributions conditioning on the domain
as p(z|y). Let p(y) be the prior distribution (e.g., uniform) of the domain indic
extractor maps z to representations x = G✓ (z) with parameters ✓. The data distrib
deterministic transformation G✓ together form an implicit distribution over x, de
ADA vs. Variational EM which is intractable to evaluate likelihood but easy to sample from:
To enforce domain invariance of feature x, a discriminator is trained to adversa
between the two domains, which defines a conditional distribution q (y|x) with p
Variational EM the feature extractor is optimized
ADA to fool the discriminator. Let q r
(y|x) = q (1 y|
distribution over domains. The objectives of ADA are therefore given as:
• Objectives • Objectives
maxì ℒñ,c = Vrí (ö|[) log bg X k + Q hì k X ||b k max L = Ep✓ (x|y)p(y) [log q (y|x)]
⇥ r
⇤
maxg ℒñ,c = Vrí (ö|[) log bg X k + Q hì k X ||b k max✓ L✓ = Ep✓ (x|y)p(y) log q (y|x) ,
• Two objectives
• Single objective for both Ç and
whereÅ we omit the additional loss of ✓ to fit to the data label pairs of source domain
• Have global optimal state in the game
for more
• Extra prior regularization by b(k) details). In conventional view, the
theoretic first equation minimizes the discrimi
view
entropy with respect to discriminative parameter , while the second trains the
to maximize the cross entropy with respect to the transformation parameter ✓.
self-containedness, it would be better to explain both of the cross-entrop
Alternatively, we can interpret the objectives as optimizing the reconstruction of th
y conditioned on feature x. [Eric: I can not understand this point.] We explor
more in the next section. Note that the only (but critical) difference between the o
is the replacement of q(y|x) with q r (y|x). This is where the adversarial mecha
3 © Petuum,Inc. 124
ADA vs. Variational EM which is intractable to evaluate likelihood but easy to sample from:
ADA to fool the discriminator. Let q r
(y|x) = q (1 y|
⇥ r
⇤
• Two objectives
for more
view
• The reconstruction term: maximize the conditional
to maximize • Thewith
the cross entropy objectives:
respect tomaximize the conditional
the transformation parameter ✓.
log-likelihood of X with the generative distribution it would
self-containedness, be better toof `
log-likelihood (or 1 both
explain − `) of thethe
with cross-entrop
bg (X|k) conditioning on the latent code k inferred
Alternatively, we can interpret hì (`|X)
the objectives
distribution as optimizing the reconstruction
conditioning on latent of th
by hì (k|X) [Eric: I can
y conditioned on feature x. feature not understand
X inferred this point.] We explor
by bg (X|`)
is the replacement of q(y|x) with q r (y|x). This is where the adversarial mechan
• bg (X|k) is the generative model • Interpret hì (`|X) as the generative model
• hì (k|X) is the inference model • Interpret bg (X|`) as 3 the inference model
© Petuum,Inc. 125
We frame our new interpretation of ADA, and review conventional formu
materials. To make clear notational correspondence to other models in th
a figure drawing a graphical model here for ADA.] let z be a data e
ADA: graphical model
or target domain, and y 2 {0, 1} be the domain indicator with y = 0 in
and y = 1 the source domain. The data distributions conditioning on th
as p(z|y). Let p(y) be the prior distribution (e.g., uniform) of the dom
Define: extractor maps z to representations x = G✓ (z) with parameters ✓. The d
deterministic
• Solid-line arrows (X → `): transformation G✓ together form an implicit distribution
which is intractable to evaluate likelihood but easy to sample from:
• generative process
(y, z domain
To enforce
• Dashed-line arrows → X): invariance of feature x, a discriminator is trained
• inference between the two domains, which defines a conditional distribution q (y
the feature
• Hollow arrows (z → X): extractor is optimized to fool the discriminator. Let q r
(y|x) =
• deterministic distribution over domains. The objectives of ADA are therefore given as
transformation
• leading to implicit distributions
max L = Ep✓ (x|y)p(y) [log q (y|x)]
• Blue arrows (X → `): ⇥ r
⇤
• adversarial mechanism max✓ L✓ = Ep✓ (x|y)p(y) log q (y|x) ,
∏
• involves both hì (`|ç) and hì (`|ç)
where we omit the additional loss of ✓ to fit to the data label pairs of sour
for more details). In conventional view, the first equation minimizes the
© Petuum,Inc. 126
entropy with respect to discriminative parameter , while the second
GANs: a variant of ADA
• Transfer the properties of source domain to target domain
• Source domain: e.g. real image, ` = 1
• Target domain: e.g. generated image, ` = 0
ADA GANs © Petuum,Inc. 127

ed sample domain (y = 0), the implicit distribution p✓ (x|y = 0) is defined by the prior of
generator G✓ (z), which is also denoted as pg✓ (x) in the literature. For the real example
GANs: a variant of ADA
= 1), the code space and generator are degenerated, and we are directly presented with a
bution p(x|y = 1), which is just the real data distribution pdata (x). Note that pdata (x) is
plicit distribution allowing efficient empirical sampling. In summary, the distribution over
ucted as• Implicit distribution over ç ∼ bg (ç|`)
⇢
pg✓ (x) y=0 (distribution of generated images)
p✓ (x|y) = (5)
pdata (x) y = 1. (distribution of real images)
parameters
• ç ∼✓b∞are çonly
⟺associated
ç = ùg |with
, |∼ | òf=the
pg✓b(x) 0 generated sample domain, while
ú
s constant. As in ADA, discriminator D is simultaneously trained to infer the probability
mes from• the
ç ∼real
b)±≤±dataçdomain.
That is, q (y = 1|x) = D (x).
stablished •correspondence of | is degenerated
the code space between GANs and ADA, we can see that the objectives of
precisely •expressed
sample directly fromand
as Eq.(4) data
as the graphical model in Figure 1(c). To make this
4 © Petuum,Inc. 128
make clear notational
where correspondence
the prior p(y) is uniform astois other
widelymodels in the in
set, resulting sequel, [Eric: scale
the constant Please add
factor 1/2. Note that
wing a here
graphical model
the generator is here
trainedfor ADA.]
using let z be a objective
the unsaturated data example either
[16] which is in the source
commonly used in practice.
ain, and y 2 {0, 1} be the domain indicator with y = 0 indicating the target domain
GANs: new formulation
source domain. The data distributions conditioning on the domain are then denoted
L = Ep✓ (x|y=0)p(y=0) [log q (y = 0|x)] + Ep✓ (x|y=1)p(y=1) [log q (y = 1|x)]
et p(y) be max
the prior distribution (e.g., uniform) of the domain indicator. The feature
s z to representations x1 = G✓ (z) with parameters ✓. The data 1distributions over z and
• Again, rewrite
= Ex=G GAN objectives
✓ (z),z⇠p(z|y=0)
in the
[log(1 ”variational-EM”
D (x))] format [log D (x)]
+ Ex=G✓ (z),z⇠p(z|y=1)
transformation G✓ together
2 form an implicit distribution over2x, denoted as p✓ (x|y),
• Recap: conventional formulation:
ctable to evaluate likelihood but easy to sample from:
max L = Ex=G✓ (z),z⇠p(z|y=0) [log(1 D (x))] + Ex⇠pdata (x) [log D (x)]
omain invariance of feature x, a discriminator is trained to adversarially distinguish
wo domains, which ✓ L✓ = E
maxdefines a x=G
conditional [log D q(x)]
distribution
✓ (z),z⇠p(z|y=0)
+ Ex⇠p
(y|x) withdata (x) [log(1
parameters D (x))]
, and
ractor is optimized to fool=the discriminator.
Ex=G Let [log
✓ (z),z⇠p(z|y=0) q r (y|x) = q (1 y|x) be the reversed
D (x)]
ver domains. The objectives
• Rewrite in the newof ADA
formare therefore given as:
We now take a closer look at the form of Eq.(4) which is essentially reconstructing the real/fake
indicator ymax
(or its L
reverse y) conditioned
= E1p✓ (x|y)p(y) [log q on x. Further, for each optimization step of p✓ (x|y) at
(y|x)]
point (✓0 , 0 ) in the parameter space,⇥we have ⇤ (1)
max✓ L✓ = Ep✓ (x|y)p(y) log q r (y|x) ,
Lemma 1 Let p(y) be the uniform distribution. Let p✓0 (x) = Ep(y) [p✓0 (x|y)], and q r (x|y) /
t the additional • Exact
q r 0 (y|x)p
loss of
✓0 (x). Therefore,
the✓ same theADA
with
to fit to the updates
data !labelof ✓ at ✓of
pairs 0 have
source domain (see supplements
• The same
ils). In conventional correspondence
h view, the first ⇥equation to minimizes
variational
⇤ i EMdiscriminator
the ! binary cross
r
r✓
respect to discriminative Ep✓parameter
(x|y)p(y) log q, while
0
(y|x)the second= trains the feature extractor
✓=✓0 © Petuum,Inc. 129
h
the cross entropy with respect to the transformation parameter ✓. [Eric: I think for i (7)
GANs vs. Variational EM which is intractable to evaluate likelihood but easy to sample from:
GAN to fool the discriminator. Let q r
(y|x) = q (1 y|
⇥ r
⇤
• Two objectives
for more
view
with cross-entrop
Alternatively, we can interpret
distribution hì (`|X)
the objectives as optimizing the reconstruction
conditioning on of th
by hì (k|X) [Eric: I can not understand
y conditioned on feature x. data/generation thisbpoint.]
X inferred by g (X|`) We explor
© Petuum,Inc. 130
deterministic transformation G✓ •together form an
Interpret ç implicit distribution
as latent over x, de
variables
GANs vs. Variational EM which is intractable to evaluate likelihood but easy to sample from:
• Interpret generation of ç as
To enforce domain invariance of feature x, a discriminator
performing inferenceis trained
over to adversa
latent
GAN to fool the discriminator. Let q r
(y|x) = q (1 y|
⇥ r
⇤
• Two objectives
for more
view
with cross-entrop
Alternatively, we can interpret
distribution hì (`|X)
the objectives as optimizing the reconstruction
conditioning on of th
by hì (k|X) [Eric: I can not understand
y conditioned on feature x. data/generation thisbpoint.]
X inferred by g (X|`) We explor
© Petuum,Inc. 131
= Ex=G✓ (z),z⇠p(z|y=0) [log(1 D (x))] + Ex=G✓ (z),z⇠p(z|y=1) [log D (x)]
2 2
GANs:
max minimizing
L =E KLD
[log(1 x=G✓ (z),z⇠p(z|y=0) D (x))] + Ex⇠pdata (x) [log D (x)]
max✓ L✓ = Ex=G✓ (z),z⇠p(z|y=0) [log D (x)] + Ex⇠pdata (x) [log(1 D (x))]
• As in Variational=EM,Ex=G
we✓ (z),z⇠p(z|y=0) [log D (x)]
can further rewrite in the form of minimizing KLD
to reveal more insights into the optimization problem
•We
Fornow takeoptimization
each a closer look step
at theofform of Eq.(4)
bg (ç|`) which Çis=essentially
at point Ça , Å = Åreconstructing
a , let
the real/fak
indicator
• b ` y: (or its reverse
uniform y) conditioned on x. Further, for each optimization step of p✓ (x|y)
prior 1distribution
point (✓0 , 0 ) in the parameter space, we have
• bgºgΩ ç = Eõ(æ) bgºgΩ ç `
∏
Lemma• h∏ 1
ç `Let∝p(y)
hìºì be `theç uniform
bgºgΩ (ç)distribution. Let p✓0 (x) = Ep(y) [p✓0 (x|y)], and q r (x|y)
a
r
(y|x)p✓0 (x). Therefore, the updates of ✓ at ✓0 have
•q Lemma
0 1: The updates of c at ca have
h ⇥ ⇤ i
r✓ Ep✓ (x|y)p(y) log q r = 0 (y|x) =
✓=✓0
h i (
r✓ Ep(y) [KL (p✓ (x|y)kq r (x|y))] JSD (p✓ (x|y = 0)kp✓ (x|y = 1)) ,
✓=✓0
• KL: KL divergence
where KL(·k·) and JSD(·k·) are the KL and Jensen-Shannon Divergences, respectively.
• JSD: Jensen-shannon divergence
© Petuum,Inc. 132
We provide the proof in the supplement materials. Eq.(7) offers several insights into the generat
We then obtain the conventional formulation of adversarial domain adaptation used or similar
in [3, 4, 5, 2].
Proof
2 Lemma of
1 Lemma 1
Proof.
Ep✓ (x|y)p(y) [log q r (y|x)] =

(3)
Ep(y) [KL (p✓ (x|y)kq r (x|y)) KL(p✓ (x|y)kp✓0 (x))] ,
where
Ep(y) [KL(p✓ (x|y)kp✓0 (x))]

✓ ◆
p✓0 (x|y = 0) + p✓0 (x|y = 1)
= p(y = 0) · KL p✓ (x|y = 0)k
2 (4)
✓ ◆
p✓0 (x|y = 0) + p✓0 (x|y = 1)
+ p(y = 1) · KL p✓ (x|y = 1)k .
2
pg✓ +pdata
Note that p✓ (x|y = 0) = pg✓ (x), and p✓ (x|y = 1) = pdata (x). Let pM✓ = 2 . Eq.(4) can
be simplified as:
1 1
Ep(y) [KL(p✓ (x|y)kp✓0 (x))] = KL pg✓ kpM✓0 + KL pdata kpM✓0 . (5)
2 2
© Petuum,Inc. 133
Proof of Lemma 1 (cont.)
On the other hand,
 
1 pg ✓ 1 pdata
JSD(pg✓ kpdata ) = Epg✓ log + Epdata log
2 pM ✓ 2 pM ✓
" # 
1 pg ✓ 1 p M ✓0
= Epg✓ log + Epg✓ log
2 pM ✓0 2 pM ✓
" # 
1 pdata 1 p M ✓0
+ Epdata log + Epdata log (6)
2 p M ✓0 2 pM ✓
" # " # 
1 pg ✓ 1 pdata pM ✓0
= Epg✓ log + Epdata log + EpM✓ log
2 pM ✓0 2 p M ✓0 pM ✓
1 1
= KL pg✓ kpM✓0 + KL pdata kpM✓0 KL pM✓ kpM✓0 .
2 2
Note that
r✓ KL pM✓ kpM✓0 |✓=✓0 = 0. (7)
Taking derivatives of Eq.(5) w.r.t ✓ at ✓0 we get

r✓ Ep(y) [KL(p✓ (x|y)kp✓0 (x))] |✓=✓0
✓ ◆
1 1
= r✓ KL pg✓ kpM✓0 |✓=✓0 + KL pdata kpM✓0 |✓=✓0 (8)
2 2
= r✓ JSD(pg✓ kpdata ) |✓=✓0 .
Taking derivatives of the both sides of Eq.(3) at w.r.t ✓ at ✓0 and plugging the last equation of Eq.(8),
we obtain the desired results. © Petuum,Inc. 134
h i (7)
r✓ Ep(y) [KL (p✓ (x|y)kq r (x|y))] JSD (p✓ (x|y = 0)kp✓ (x|y = 1)) ,
✓=✓0
GANs: minimizing KLD
We provide the proof in the supplement materials. Eq.(7) offers several insights into the generator
learning in GANs.
• Lemma
h
1: The updates of c at
i
c a have
⇥ r
⇤
r✓ Ep✓ (x|y)p(y) log q = 0
(y|x) =
✓=✓0
h i
r
r✓ Ep(y) [KL (p✓ (x|y)kq (x|y))] JSD (p✓ (x|y = 0)kp✓ (x|y = 1)) ,
✓=✓0
• Connection to variational inference

• 5
See ç as latent variables, ` as visible
• bgºgΩ ç : prior distribution
∏
• h ∏ ç ` ∝ hìºì ` ç bgºgΩ (ç) : posterior distribution
a
• bg (ç|`): variational distribution
• Amortized inference: updates model parameter c
• Suggests relations to VAEs, as we will explore shortly
© Petuum,Inc. 135
h i (7)
r
✓=✓0
learning in GANs.
• Lemma
h 1: The updates
⇥
of c
⇤ i at c a have
r✓ Ep✓ (x|y)p(y) log q r = 0 (y|x) =
✓=✓0
h i
r
✓=✓0
• Minimizing the KLD drives b∞ú (ç) to b)±≤± (ç)

5
• By definition: bgºgΩ ç = Eõ(æ) bgºgΩ ç ` = b∞úøú ç + b)±≤± ç / 2
Ω
• KL bg X ` = 1 ||h∏ X ` = 1 = KL b)±≤± (X)||h∏ X ` = 1 : constant, no free parameters
• KL bg X ` = 0 ||h∏ X ` = 0 = KL b∞ú (X)||h∏ X ` = 0 : parameter Ç to optimize
∏
• h∏ ç ` = 0 ∝ hìºì ` = 0 ç bgºgΩ ç
a
• seen as a mixture of b∞úøú (ç) and b)±≤± ç
Ω
∏
• mixing weights induced from hìºì `=0ç
a
• Drives b∞ú ç ` to mixture of b∞úøú (ç) and b)±≤± (ç)
Ω
⇒ Drives b∞ú ç to b)±≤± (ç) © Petuum,Inc. 136
h i (7)
r
✓=✓0
!"7"# $ % = 1 = !()*) ($) !"7"# $ % = 0 = !./8/ ($)
learning in GANs. #
0 1 ($|% = 0)
• Lemma c c
h 1: The updates⇥
of ⇤ i at a have !"7"345 $ % = 0 = !.
/8/345
($)
r✓ Ep✓ (x|y)p(y) log q r = 0 (y|x) =
✓=✓0
h i
r
$ r✓ Ep(y) [KL (p✓ (x|y)kq (x|y))] JSD (p✓ (x|y $ = 0)kp✓ (x|y = 1)) ,
✓=✓0
• Minimizing the KLD drives b∞ú (ç) to b)±≤± (ç)

5
• By definition: bgºgΩ ç = Eõ(æ) bgºgΩ ç ` = b∞úøú ç + b)±≤± ç / 2
Ω
• KL bg X ` = 1 ||h∏ X ` = 1 = KL b)±≤± (X)||h∏ X ` = 1 : constant, no free parameters
• KL bg X ` = 0 ||h∏ X ` = 0 = KL b∞ú (X)||h∏ X ` = 0 : parameter Ç to optimize
∏
• h∏ ç ` = 0 ∝ hìºì ` = 0 ç bgºgΩ ç
a
• seen as a mixture of b∞úøú (ç) and b)±≤± ç
Ω
∏
• mixing weights induced from hìºì `=0ç
a
• Drives b∞ú ç ` to mixture of b∞úøú (ç) and b)±≤± (ç)
Ω
⇒ Drives b∞ú ç to b)±≤± (ç) © Petuum,Inc. 137
x=G✓ (z),z⇠p(z|y=0) x=G✓ (z),z⇠p(z|y=1)
2 2
indicator y (or its reverse 1 y) conditioned on x. Further, for each optimization step of p✓ (x|y) at
!"7"Lemma
#
$ % =11Let= p(y) ($)the !uniform
!()*)be "7"# $ % = 0 = !./8/
distribution. Let($)
# p✓0 (x) = Ep(y) [p✓0 (x|y)], and q (x|y) /
r
q r
(y|x)p✓0 (x). Therefore, the updates of ✓ at ✓0 have
1• 0Lemma
0 ($|% = 0) 1 !"7"345 $ % = 0 = !. 345 ($)
h ⇥ ⇤ i /8/
r✓ Ep✓ (x|y)p(y) log q r 0 (y|x) =
✓=✓0
h missed mode i (7)
r
$ $ ✓=✓0
• Missing mode phenomena of GANs KL b∞ú (X)||h ∏ X ` = 0
• Asymmetry
We provide the proof inoftheKLD
supplement materials. Eq.(7) offers several insights into the generator b∞ú X
learning in GANs. = û b∞ú X log ∏ ¡X
• Concentrates bg ç ` = 0 to large h X ` = 0
• Resemblance modes of h ∏ ç inference.
to variational ` If we treat y as visible and x as latent (as in ADA), it is
straightforward to see the connections to the variational inference • algorithm where contribution
Large positive q r (x|y) playsto the KLD in the
the role of⇒thebposterior,
∞ú ç misses prior, andofp✓b(x|y)
p✓0 (x) themodes (ç)variationalregions
)±≤± the distribution
of Xthat approximates
space where h ∏ X ` = 0 is
the• posterior.
Symmetry Optimizing
of JSD the generator G✓ is equivalent to minimizing the KL
small, divergence
unless b∞ú X betweenis also small
the variational distribution and the posterior, minus a JSD between ⇒ the
b distributions
X p (x) and
• Does
pdata (x). not affect
The Bayesian the behavior
interpretation of the connections toú VAEs, as we discussregions
further reveals
• ∞ tends to avoid
g ✓
in
where
mode missing
the next section. h ∏ X ` = 0 is small
• Training dynamics. By definition, p✓0 (x) = (pg✓0 (x)+pdata (x))/2 is a mixture of pg✓0 (x) and
© Petuum,Inc. 138
r
x=G✓ (z),z⇠p(z|y=0) x=G✓ (z),z⇠p(z|y=1)
2 2
indicator y (or its reverse 1 y) conditioned on x. Further, for each optimization step of p✓ (x|y) at
Lemma 1 Let p(y) be the uniform distribution. Let p✓0 (x) = Ep(y) [p✓0 (x|y)], and q r (x|y) /
q r 0 (y|x)p✓0 (x). Therefore, the updates of ✓ at ✓0 have
• Lemmah 1: The updates of ci at ca have
⇥ r
⇤
r✓ Ep✓ (x|y)p(y) log q 0 (y|x) =
✓=✓0
h i (7)
r
✓=✓0
∏ compared to GANs (Figure 1(c)), adds
•where Figure 3: (a) Graphical
NoKL(·k·)
assumptionand JSD(·k·)
on model of discriminator
optimal
are the KL
InfoGAN (Eq.9), which,
and Jensen-Shannon
h ì ` ç respectively.
Divergences,
conditional generation of code z with distribution q⌘ (z|x,a y). See the captions of Figure 1 for the
• Previous
meaning ofresults
differentusually
types of rely
arrows. on(b) VAEs optimal
(near) (Eq.12), which is obtained by swapping the generation
discriminator
We provide and the
∗ proof
inference in the supplement
processes of InfoGAN, materials. Eq.(7)ofoffers
i.e., inçterms several model,
the graphical insightsswapping
into the solid-line
generatorarrows
• h ` = 1 ç =
in GANs. process) and)±≤± b ç /(b + b (ç))
learning (generative dashed-line)±≤± arrows (inference)∞
of (a). (c) Adversarial Autoencoder (AAE),
• Optimality
which is to assumption
obtained by swapping is impractical:
data xIfand limited expressiveness of tì [Arora
we code
treat zy in
asInfoGAN (seex the supplements for more
it isdetails).
et al 2017]
• Resemblance variational inference. visible and as latent (as in ADA),
• Our resulttoissee
straightforward a the
generalization
connections toof thethe previous
variational theorem
inference algorithm where
[Arjovsky q r (x|y)
& Bottou plays
2017]
the role•ofgeneralization
the posterior,
Plug p✓the
the optimal
of 0 (x) the prior,
discriminator
previous and p✓into
theorem (x|y)
[1]: the above
the
pluggingvariational
Eq.(7) distribution
equation,
into Eq.(6)we
wethat approximates
recover
obtain the theorem
the posterior. Optimizing
h the generator G✓ isi equivalent to  minimizing the KL divergence between
⇥ ⇤ 1 between the distributions p (x) and
the variational
r✓ distribution
Ep✓ (x|y)p(y)andlogthe
q r posterior,
0
(y|x) minus =r a ✓JSD KL (pg✓ kpdata ) JSD (pg✓ kpdata g✓ ) , (8)
✓=✓0 2
pdata (x). The Bayesian interpretation further reveals the connections to VAEs, as we discuss in ✓=✓0
the next• section.

which
Givegives simplified
insights on the explanations
generator of the training dynamics
training and the missingismode
when discriminator issue only when
optimal
• Trainingthe discriminator
dynamics. meets certain
By definition, optimality
p✓0 (x) = (pg✓0criteria.
(x)+pdataOur (x))/2
generalized result enables
is a mixture of pg✓0 understanding
(x) and© Petuum,Inc. 139
of broader situations. For instance, when the discriminator r distribution q 0 (y|x) gives uniform
In summary:
• Reveal connection to variational inference
• Build connections to VAEs (slides soon)
• Inspire new model variants based on the connections
• Offer insights into the generator training

• Formal explanation of the missing mode behavior of GANs
• Still hold when the discriminator does not achieve its optimum at each
iteration
© Petuum,Inc. 140
KL (p✓ (y|x)kq (y|x)) , (14)
or
Variant of GAN: InfoGAN

KL (p (x|y)kq (x|y)) . ✓ (15)
We can see GANs and VAEs (Variational Auto-encoders Kingma & Welling (2013)) as extending
the sleep and wake phases, respectively. In particular, VAEs extend the wake phase by minimizing
• GANs
Eq. (12) w.r.tdon’t
both offer
and ✓.the functionality
GANs extend of inferring
⇣ the sleep phase
code | given data ç
⌘ by minimizing Eq.(15) w.r.t , and
• InfoGAN
minimizing the y-switched objective
[Chen et al., 2016] KL p✓ (x|y)kq (x|y)
0
JSD in Eq.(7) w.r.t ✓.
• Introduce inference model E¬ (||ç) with parameters √
1 InfoGAN
• Augment the objectives of GANs by additionally inferring |
(16)
maxG,Q LG,Q = Ex⇠G(z),z⇠p(z) [log D(x)+ log Q(z|x)] .
!"
GANs InfoGAN
3
© Petuum,Inc. 141
s of the KL and JSD terms in Eq.(4) cancel out, disabling the learning of generator. Moreover,
esian interpretation [Eric: In earlier explanations, you never say it was a "Bayesian
etation", and now you suddenly say it was. Please claim Bayesian interpretation
InfoGAN: new formulation
where you provided some interpretation.] of our result enables us to discover connections
, as we discuss in the next section.
N Chen • et al. [6] developed

Defines conditional h¬ | for
InfoGAN ç, `disentangled representation learning which
ally recovers
• h(part of)` the
(||ç, = latent
1) is codewithout
fixed z givenfree
example x. Thisto
parameters can be straightforwardly
learn
¬
ed in our framework by introducing an extra conditional q⌘ (z|x,
• As GANs assume the code space of real data
y) parameterized by ⌘. As
is degenerated
d above, GANs assume a degenerated code space for real examples, thus q⌘ (z|x, y = 1)
without free• parameters
Parameters to ƒlearn,
are only
and ⌘associated with h¬to
is only associated (||ç,
y =`0.= The
0) InfoGAN is then
ed by combining q⌘ (z|x, y) with q(y|x) in Eq.(1) to perform full reconstruction of both z
• Rewrite in the new form:
max L = Ep✓ (x|y)p(y) [log q⌘ (z|x, y)q (y|x)]
⇥ r
⇤ (5)
max✓,⌘ L✓,⌘ = Ep✓ (x|y)p(y) log q⌘ (z|x, y)q (y|x) ,
he ground-truth z to reconstruct is sampled from the prior p(z|y) and encapsulated in the
distribution p✓ (x|y). Let q r (x|z, y) / q⌘0 (z|x, y)q r 0 (y|x)p✓0 (x), the result in the form of
ill holds by replacing q r 0 (y|x) with q⌘0 (z|x, y)q r 0 (y|x), and q r (x|y) with q r (x|z, y):
⇥ ⇥ r
⇤ ⇤ © Petuum,Inc. 142
E r E log q (z|x, y)q (y|x) | =
tends to concentrate p✓ (x|y) to large modes of q (x|y) and ignore smaller
ls, by learning domain-invariant features [13, 42, 43, 7]. That is, it learns a
Arjovsky
utput cannot be distinguished by aand Bottou [1]between
discriminator derive a the
similar
sourceresult
andof minimizing the KL divergen
pdata (x). Our result does not rely on assumptions of (near) optimal discrimina
GANs vs InfoGAN to the practice [2]. Indeed, when the discriminator distribution q 0 (y|x) give
tation of ADA, and review conventional formulations in the supplementary
gradients of the KL and JSD terms in Eq.(4) ! cancel out, disabling the learning o
otational correspondence to other models in the sequel, [Eric: Please"add
the Bayesian interpretation [Eric: In earlier explanations, you never say
hical model here for ADA.] let z be a data example either in the source
interpretation",
{0, 1} be the domain indicator with y = 0and now you
indicating the suddenly
target domain say it was. Please claim Bay
earlier
ain. The data distributions where you
conditioning provided
on the domainsome interpretation.]
are then denoted of our result enables us t
e prior distribution (e.g.,touniform)
VAEs, asofwe thediscuss
domaininindicator.
the next section.
The feature
entations x = G✓ (z) with parameters ✓. The data distributions over z and
on G✓ together form an InfoGAN Chen etover
implicit distribution al. [6] developed
x, denoted as p✓InfoGAN
(x|y), for disentangled represen
luate likelihood but easy additionally
to sample from: recovers (part of) the latent code z given example x. This can
formulated in our framework by introducing an extra conditional q⌘ (z|x, y) p
ance of feature x, a discriminator
discussed is trained
above, to adversarially
GANs distinguish code space for real examples
assume a degenerated
which defines a conditional distribution
is fixed without qfree
(y|x) with parameters
parameters to learn, , and⌘ is only associated to y = 0.
and
mized to fool the discriminator. Let r
recovered by(y|x)
q y|x) be
= q (1q (z|x,
combining y)the reversed
with q(y|x) in Eq.(1) to perform full rec
⌘
The objectives of ADA and are therefore
y: given as:
max L = Ep✓ (x|y)p(y) [log q (y|x)] max L = Ep✓ (x|y)p(y) [log q⌘ (z|x, y)q (y|x)]
⇥ r
⇤ (1) ⇥ r
⇤
max✓ L✓ = Ep✓ (x|y)p(y) log q (y|x) , max✓,⌘ L✓,⌘ = Ep✓ (x|y)p(y) log q⌘ (z|x, y)q (y|x) ,
© Petuum,Inc. 143
recovered by combining q⌘ (z|x, y) with q(y|x) in Eq.(3) to perform full reconstruction of both z
and y:
InfoGAN:max
new
L =E
max L = E
formulation
[log q (z|x, y)q (y|x)]
⇥
p✓ (x|y)p(y)
⇤
log q (z|x, y)q (y|x) ,
⌘
r (9)
✓,⌘ ✓,⌘ p✓ (x|y)p(y) ⌘
where the ground-truth

• Similar results z to
asreconstruct
in GANs is sampled
hold: from the prior p(z|y) and encapsulated in the
implicit distribution
∏ p✓ (x|y). The model is expressed
∏ as graphical model in Figure 3(a). Let
q (x|z,•y)Let
r
/ qh⌘0 (z|x,
ç |, y)q
` ∝(y|x)p
r
0
h¬º¬Ω✓(||ç,
0
(x), `)h
the result ` the
ìºì a in ç bform
gºgΩofçEq.(6) still holds by replacing
• with
q r 0 (y|x) ⌘0 (z|x, y)q 0 (y|x), and q (x|y) with q (x|z, y):
We qhave: r r r
h ⇥ ⇤i
r
r✓ Ep✓ (x|y)p(y) log q⌘0 (z|x, y)q 0 (y|x) =
✓=✓0
h i (10)
r✓ Ep(y) [KL (p✓ (x|y)kq r (x|z, y))] JSD (p✓ (x|y = 0)kp✓ (x|y = 1)) ,
✓=✓0
AAE/PM/CycleGAN As a side result, the idea of interpreting data space x as latent immediately
• Next
discovers we show
relations betweencorrespondences
InfoGAN with Adversarial between
AutoencoderGANs/InfoGAN and
(AAE) [35] and Predictability
Minimization
VAEs (PM) [50]. That is, InfoGAN is precisely an AAE that treats the data space x as
latent and to be adversarially regularized while the code space z as visible. Figure 3(c) shows the
graphical model of AAE obtained by simply swapping x and z in InfoGAN. We defer the detailed
© Petuum,Inc. 144
formulations of AAE to the supplementary materials. Further, instead of considering x and z as data
Relates VAEs with GANs
• Resemblance of GAN generator learning to variational
inference
• Suggest strong relations between VAEs and GANs
• Indeed, VAEs are basically minimizing KLD with an opposite

direction, and with a degenerated adversarial discriminator
degenerated
swap the generation (solid-line) discriminator
and inference (dashed-line)
processes of InfoGAN
© Petuum,Inc. 145
InfoGAN VAEs
mily" or "second family"?] of deep generative model learning algorithms. The resembl
AN generator learning to variational inference as shown in Eq.(4) suggests strong relations b
AEs [25] and GANs. We build correspondence between the two approaches, and show tha
Recap: conventional formulation of VAEs
e basically minimizing a KL divergence with an opposite direction, with a degenerated adv
scriminator.
he conventional definition of VAEs is written as:
• Objective:
vae
⇥ ⇤
max✓,⌘ L✓,⌘ = Epdata (x) Eq̃⌘ (z|x) [log p̃✓ (x|z)] KL(q̃⌘ (z|x)kp̃(z)) ,
here p̃✓ (x|z) is the generator, q̃⌘ (z|x) the inference network, and p̃(z) the prior over
b≈(|): prior
rameters to• learn over |
are intentionally denoted with the notations of corresponding modules in
• b≈gVAEs
t first glance, (ç||): appear
generative model
to differ from GANs greatly as they use only real examples a
• h≈¬ (||ç): inference
versarial mechanism. However, our interpretation shows VAEs indeed include a dege
model
versarial discriminator
• Only uses real thatexamples
blocks outfrom
generated samples
b)±≤± (ç), from contributing
lacks adversarial to training.
mechanism
pecifically,
• Towealign
againwith
introduce
GANs, thelet’s
real/fake variable
introduce and assume
they,real/fake a perfect`discriminator
indicator and
hich always predicts ydiscriminator
adversarial = 1 with probability 1 given real examples, and y = 0 given ge
mples. Again, for notational simplicity, let q⇤r (y|x) = q⇤ (1 y|x) be the reversed distribu
© Petuum,Inc. 146
mma 2. Let p✓ (z, y|x) / p✓ (x|z, y)p(z|y)p(y). Therefore,
ae VAEs: new formulation
⇥
,⌘ = 2 · Ep✓0 (x) Eq⌘ (z|x,y)q⇤ (y|x) [log p✓ (x|z, y)]
r
r
KL(q⌘ (z|x, y)q⇤ (y|x)kp(z|y)p(y))
⇤
(8)
= 2 · Ep✓0 (x) [ KL (q⌘ (z|x, y)q⇤r (y|x)kp✓ (z, y|x))] .
• Assume a perfect discriminator h∗ (`|ç)
e most of the • hcomponents
∗ ` =1 ç =1 if ç isexact
have correspondences (and the same definitions) in GANs
real examples
• h∗ `1),
InfoGAN (Table = 0except
ç = 1that
if ç the generation
is generated distribution p✓ (x|z, y) differs slightly from its
samples
• h∗∏ in
nterpart p✓ (x|y) ` Eq.(2)
ç ∶= h∗to
(1additionally
− `|ç) account for the uncertainty of generating x given z:
• Generative distribution ⇢
p✓ (x|z) y = 0
p✓ (x|z, y) = (9)
pdata (x) y = 1.
KL bdivergence
resulting• Let g |, ` ç Figure
∝closely
bg ç3:|, ` bGraphical
(a)
relates |to`that
b(`)inmodel
GANsof(Eq.4)
InfoGAN (Eq.10), (Eq.6),
and InfoGAN which, with compared t
Lemma
generative 2. Let
module
• Lemma 2 pp✓✓ (z,
(x|z,y|x)
y) and
conditional
/ pinference
✓ (x|z, networks
generation of
y)p(z|y)p(y). code
q ⌘ Therefore,
(z|x,
z with
y)q r
(y|x) placed inq⌘the
distribution opposite
(z|x, y). See the
ctions, vae
and with inverted hidden/visible treatments
⇥meaning of different of (z,
types ofy) and x. (b)
arrows. In section 6, we
VAEs (Eq.13),
r
givewhich
a generalis obtained
⇤
L✓,⌘
ussion Ep✓0 (x) E
2 · difference
that=the between
and
q⌘ (z|x,y)q GANs
inference
r [log
and pVAEs
processes
⇤ (y|x) ✓ (x|z,iny)] KL(q ⌘ (z|x,
hidden/visible
of InfoGAN, i.e., y)q⇤of
intreatments
terms (y|x)kp(z|y)p(y))
is relatively
the graphical model(
or. = 2 · Ep✓0 (x) [ (generative
KL (q⌘ (z|x, process) ✓ (z, y|x))] .arrows (inference) of (a). (c) Adve
and dashed-line
y)q⇤r (y|x)kp
which is obtained
proof is provided in the supplementary by swapping
materials. data recall
Intuitively, x andthatcodeforz the
in InfoGAN
real example
© Petuum,Inc. (see
147 the su
Lemma 2: sketch of proof
Lemma 2. Let p✓ (z, y|x) / p✓ (x|z, y)p(z|y)p(y). Therefore,
• Lemma 2
vae
⇥ r
⇤
L✓,⌘ = 2 · Ep✓0 (x) Eq⌘ (z|x,y)q⇤r (y|x) [log p✓ (x|z, y)] KL(q⌘ (z|x, y)q⇤ (y|x)kp(z|y)p(y))
(8
= 2 · Ep✓0 (x) [ KL (q⌘ (z|x, y)q⇤r (y|x)kp✓ (z, y|x))] .
• Proof
Here most of the components have$ exact correspondences $ (and the same definitions) in GAN
1) Expand
and InfoGAN Võexcept
(Table 1), . = V generation
úΩ (ç) that &theõú (ç|æº$) . distribution
+ Võú (ç|æºa) . y) differs slightly from i
p✓ (x|z,
Ω & Ω
counterpart p✓$(x|y) in Eq.(2) to additionally account for the uncertainty of generating x given z:
2) Võú (ç|æºa) . is constant ⇢
& Ω
• Due to the perfect discriminator h∗∏ `pç✓ (x|z) y=0
p✓ (x|z, y) = (9
• Blocks out generated samples in theptraining
data (x)lossy = 1.
$ $
The resulting
3) KL V divergence . =closely
V relates . to that in GANs (Eq.4) and InfoGAN (Eq.6), wi
& õúΩ (ç|æº$) & õ≥¥µ¥ ([)
the generative• module
Recoversp✓the
(x|z, y) and inference
conventional networks q⌘ (z|x, y)q r (y|x) placed
formulation in the opposi
© Petuum,Inc. 148
directions, and with inverted hidden/visible treatments of (z, y) and x. In section 6, we give a gener
Proof of Lemma 2
C Lemme 2
Proof. For the reconstruction term:
⇥ ⇤
Ep✓0 (x) Eq⌘ (z|x,y)q⇤r (y|x) [log p✓ (x|z, y)]
1 ⇥ ⇤
= Ep✓0 (x|y=1) Eq⌘ (z|x,y=0),y=0⇠q⇤r (y|x) [log p✓ (x|z, y = 0)]
2
1 ⇥ ⇤ (25)
+ Ep✓0 (x|y=0) Eq⌘ (z|x,y=1),y=1⇠q⇤ (y|x) [log p✓ (x|z, y = 1)]
r
2
1 ⇥ ⇤
= Epdata (x) Eq̃⌘ (z|x) [log p̃✓ (x|z)] + const,
2
where y = 0 ⇠ q⇤ (y|x) means q⇤r (y|x) predicts y = 0 with probability 1. Note that both q⌘ (z|x, y =
r
1) and p✓ (x|z, y = 1) are constant distributions without free parameters to learn; q⌘ (z|x, y = 0) =
q̃⌘ (z|x), and p✓ (x|z, y = 0) = p̃✓ (x|z).
For the KL prior regularization term:
Ep✓0 (x) [KL(q⌘ (z|x, y)q⇤r (y|x)kp(z|y)p(y))]
Z
= Ep✓0 (x) q⇤r (y|x)KL (q⌘ (z|x, y)kp(z|y)) dy + KL (q⇤r (y|x)kp(y))
1 1 (26)
= Ep✓0 (x|y=1) [KL (q⌘ (z|x, y = 0)kp(z|y = 0)) + const] + Ep✓0 (x|y=1) [const]
2 2
1
= Epdata (x) [KL(q̃⌘ (z|x)kp̃(z))] .
2
Combining Eq.(25) and Eq.(26) we recover the conventional VAE objective in Eq.(7) in the paper.
© Petuum,Inc. 149
y, x now denotes a real example or a generated vae sample, z is the respective
⇥ latent code. For
L✓,⌘ = 2 · Ep✓0 (x) Eq⌘ (z|x,y)q⇤r (y|x) [log p✓ (x|z, y)] KL(q⌘ (z|x, y)q⇤r (y|x)
ated sample domain (y = 0), the implicit distribution p✓ (x|y = 0) is defined by the prior of
r
e generator G✓ (z), which is also denoted as pg=✓ (x)2 · Ein
p✓0the KL (q⌘ (z|x,
(x) [literature. Fory)qthe⇤ (y|x)kp
real example
✓ (z, y|x))] .
GANs vs VAEs side by side
y = 1), the code space and generator are degenerated, and we are directly presented with a
ribution p(x|y = 1), which is just the real
Heredata distribution
most of the components Noteexact
pdata (x). have that pcorrespondences
data (x) is (and the same defi
mplicit distribution allowing efficient
GANs and InfoGAN
empirical
(InfoGAN) (Table
sampling. In 1), except that
summary, thethe generation
distribution
VAEs distribution p✓ (x|z, y) diffe
over
tructed as counterpart p✓ (x|y) in Eq.(2) to additionally account for the uncertainty of gene
⇢ ⇢
Generative pg✓ (x) y=0 p✓ (x|z) y = 0
distribution p ✓ (x|y) = p ✓ (x|z, y) = (5)
pdata (x) y = 1. p data (x) y = 1.
ee parameters Thepgresulting
✓ are only associated with
Discriminator KL generated
(x) of the divergence sample
closely relates
domain,to that
whilein GANs (Eq.4) and Info
is constant. h ì (`|ç)
the
✓
generative module p h
(x|z, y)
∗ (`|ç),
and perfect,
inference
As in ADA, discriminator D is simultaneously trained to infer the probability degenerated
networks q (z|x, y)q r
(y|x) pla
distribution ✓ ⌘
mes from the real data domain. That is, directions,
q (y = 1|x) and =
with
Dinverted
(x). hidden/visible treatments of (z, y) and x. In section 6
discussion that the difference between GANs and VAEs in hidden/visible treat
|-inference
established correspondencehbetween
¬ | ç, `GANsof InfoGAN
minor. and ADA, we can see that the objectives h¬ (||ç, `) of
model
re precisely expressed as Eq.(4) and as The
the graphical model ininthe
proof is provided Figure 1(c). To make
supplementary this Intuitively, recall that fo
materials.
domain with∏
y = 1, both q⌘ (z|x, y = 1) and p✓ (x|z, y = 1) are constant distri
ming KL (bg çwith ` fake
|| hsample
ç |,x`generated
) minfrom g KL p h |
(x),
✓0 ¬ ç,
the` h ∏
∗ ` çperfect
reversed || bgdiscriminator
|, ` ç q⇤r
KLD to 4
prediction y = 1, making the reconstruction loss on fake samples degenerated to
minimize only real examples, where q⇤r predicts y = 0 with probability 1, are effective for
~ ming KL(! g || E)
identical to Eq.(7). We extend VAEs to~min g KL(E || !
also leverage g)
fake samples in section 4.
© Petuum,Inc. 150
VAE/GAN Joint Models Previous work has explored combination of VAEs
ke Sleep Algorithm (WS)
Link back to wake sleep algorithm

iscuss the connections of GANs and VAEs to the classic wake-sleep algorithm [18] which
sed for learning deep generative models such as Helmholtz machines [9]. WS consists of
se and sleep phase, which optimize the generative network and inference network [Eric:
• Denote
been using "model" and "network" interchangeably earlier, please stay consistent,
st call both "model".],
• Latent »
respectively.
variables We follow the above notations, and introduce new
h to denote• general … [Eric: what do you mean by "latents", latent variables?]
latents
Parameters
general •parameters.
Recap: wakeThe wake-sleep algorithm is thus written as:
sleep algorithm
Wake : max✓ Eq (h|x)pdata (x) [log p✓ (x|h)]
(10)
Sleep : max Ep✓ (x|h)p(h) [log q (h|x)]
ons between VAEs and WS are clear in previous discussions [3, 25]. Indeed, WS was
proposed to minimize the variational lower bound as in VAEs (Eq.7) with sleep phase
ation [18]. Alternatively, VAEs can be seen as extending the wake phase. Specifically, if
iate h with z and with ⌘, the wake phase objective recovers VAEs (Eq.7) in terms of
optimization (i.e., optimizing ✓). Therefore, we can see VAEs as generalizing the wake
© Petuum,Inc. 151
also optimizing the inference network q⌘ , with additional prior regularization on latents z.
ss the connections of GANs and VAEs to the classic wake-sleep algorithm [18] which
xtfor graphical
discuss model
the deep
learning of AAE
connections obtained
of GANs
generative models byVAEs
andsuch simply
as to swapping
the classic
Helmholtz and z in [9].
x wake-sleep
machines InfoGAN. We[18]
algorithm
WS consistsdefer the detailed
which
of
oposed
nd formulations
sleepfor learning
phase, ofdeep
whichAAE to the supplementary
generative
optimize themodels
generative materials.
such Further,
as Helmholtz
network instead of considering
machines
and inference x and of
[9]. WS consists
network [Eric: z as data
phase and code
and sleepspaces
phase,respectively, if we instantiate x and znetwork
as data spaces of two modalities,
network and combine
[Eric:
en using "model" andwhich optimize
"network" the generative
interchangeably and inference
earlier, please stay consistent,
ave been
call both VAEs vs. Wake-sleep
the objectives
using
performs
of InfoGAN
"model"
"model".], and and AAE asinterchangeably
"network"
respectively.
transformation between We follow
the two
a joint model, weearlier,
the aboveMore
modalities.
recover please
notations,
details
the cycleGAN
stay
andprovided
are introduce
in
model [56] which
consistent,
new
the supplements.
e just call both "model".],
o denote general latents [Eric: what do you mean by "latents", latent variables?] new
respectively. We follow the above notations, and introduce
ons to denote general
eralhparameters. latents [Eric:
The wake-sleep what do
algorithm you written
is thus mean by as:"latents", latent variables?]
3.3• Wake
for general Variational
parameters.
sleep Autoencoders
The wake-sleep
algorithm (VAEs)
algorithm is thus written as:
Wake : max✓ Eq (h|x)pdata (x) [log p✓ (x|h)]
We next exploreWake : max
the second ✓ Eqof (h|x)p
family deep generative
data (x)
[log
model learning algorithms. The(10)
p✓ (x|h)] resemblance of
Sleep : learning
GAN generator max toEvariational [log q (h|x)]
inference as shown in Eq.(7) suggests strong relations (10)
between
Sleep : max Ep✓ (x|h)p(h) [log q (h|x)]
p✓ (x|h)p(h)
VAEs [28] and GANs. We build correspondence between the two approaches, and show that VAEs
Let » be
are•basically |, anda…KL
minimizing √
bedivergence in an opposite direction, with a degenerated adversarial
⇒ max✓ E
discriminator. max ✓ Eq⌘data
q⌘ (z|x)p (x) [log p✓ (x|z)] , recovers VAE objective of optimizing c
(z|x)pdata (x) [log p✓ (x|z)]
• VAEs extend wake phase by also learning the inference model (ƒ)
between
lations VAEs and
between VAEs WSand are clear in previous discussions [3, 25]. Indeed, WS was
vae WS are clear in previous discussions [3, 25]. Indeed, WS was
posed
ally proposed tomax
to minimizeminimize
✓,⌘ =E
theLvariational
✓,⌘the lower
variational
q⌘ (z|x)p [logbound
bound
datalower
(x) pas
✓ (x|z)]
in as inEVAEs
VAEs (Eq.7)
pdata [KL(q (z|x)kp(z))]
with ⌘with
(Eq.7)
(x) sleep phase
sleep phase
n [18]. Alternatively,
ximation VAEs VAEs
[18]. Alternatively, can becan seenbeas extending
seen the wake
as extending phase.
the wake Specifically,
phase. if if
Specifically,
The • Minimize the
theKLD in phase
the original variational free energy √ terms of
wrt. in
h with
tantiate hzconventional
and
with with definition
z and ⌘,with ⌘, of
theVAEs
wake wakeis phase
written
objectiveas: recovers
objective VAEs
recovers (Eq.7)
VAEs (Eq.7) in terms of
• Stick tooptimizing
minimizing ⇥
theTherefore,
wake-phase KLD wrt.VAEs
both c and √ the wake ⇤ wake
imization (i.e.,
tor optimization optimizing
(i.e., ✓).
vae Therefore,
✓). we can
we seecanVAEs
see as generalizing
as generalizing
max✓,⌘ L✓,⌘ = Epdata (x) Eq̃⌘ (z|x) [log p̃✓ (x|z)] KL(q̃⌘ (z|x)kp̃(z)) , the (12)
byoptimizing •
also optimizing the inference
the inference
Do not network
involve network q⌘ ,additional
q⌘ , with
sleep-phase with additional
objective prior prior regularization
regularization on latents
on latents z. z.
where p̃•✓ (x|z)
Recall:is sleep
the generator, q̃⌘ (z|x) the
phase minimizes theinference
reverse KLDmodel, in and p̃(z) the prior
the variational freeover z. The
energy
ehand,
otherparameters
hand, our interpretation
our interpretation of GANs
ofintentionally
to learn are GANs reveals reveals
close
denoted close resemblance
resemblance
with the notationstoofthe to the sleep
sleep phase.
corresponding phase. To
To in GANs.
modules
his
rer,clearer,
we weglance,
first instantiate
Atinstantiate hVAEs with
and ytoand
withhyappear differwith
with , resulting
, resulting
from GANs in aassleep
ingreatly
a sleep they phase
phase objective
objective
use only identical
identical
real examples and lack 152
© Petuum,Inc.
adversarial mechanism. However, our interpretation shows VAEs indeed include a degenerated
en using "model" and "network" interchangeably earlier, please stay consistent,r
xt discuss
call boththe connections
"model".], maxof GANs
L =and
The respectively.
discriminatorEx=GVAEs
defines
We atoconditional
follow thethe
classic
✓ (z),z⇠p(z|y=1)
wake-sleep
[log
above (x)] +qalgorithm
distribution
Dnotations, Ex=G
(y|x) =[18]
D (x).which
and✓ (z),z⇠p(z|y=0)
introduce Let q (y|x)D=(x))]
[log(1
new q (1.
oposed for learning be
deep generative
thefeature
reversed models such
distribution overas Helmholtz
domains. machines [9].ofWS consists of
o denote general latents
The [Eric: what
extractor do
G✓ is youthenmean
trainedbyto The objectives
"latents",
fool latent ADA
the discriminator: are therefore
variables?] rewritten as (up
phase and sleep phase, whichscale
constant optimize 2): generative network and inference network [Eric:
factorthe
neral parameters. The wake-sleep algorithm is thus written as:
GANs vs.
ave been using "model"max and Wake-sleep
✓L ✓ = Ex=Ginterchangeably
"network" ✓ (z),z⇠p(z|y=1)
e just call both "model".], respectively. We follow
earlier,
[log(1
= Enotations, and
maxtheLabove
Dplease
[logintroduce
q (y|x)] new
Ex=G
(x))] +stay consistent,
✓ (z),z⇠p(z|y=0)
p✓ (x|y)p(y)
[log D (x)] .
Wake : max✓ Eq (h|x)pdata (x) [log p✓ (x|h)]the features
Here we omit the additional loss on ✓ that fits ⇥ tor the data ⇤ label pairs of source dom
ons h to denote general (seelatents [Eric: what materials
the supplementary do you max mean
for
✓ Lthe
✓ =by E"latents", latent variables?]
p✓ (x|y)p(y) log q (y|x) . (10)
details).
for general• Wake Sleep
parameters.
sleep: The max
algorithm Ep✓ (x|h)p(h)
wake-sleep algorithm [log is qthus written as:
(h|x)]
Withabove
The the background
objectives of canthebeconventional
interpreted as formulation,
maximizing wethe
nowlog frame our newofinterpretation
likelihood y (or 1 y) of witA
Wake
The data
“generative ✓ Eq (h|x)p
: distribution
max
distribution” p(z|y) and [log p✓ (x|h)]
(x)deterministic
conditioning transformation
on the latent code G✓ together
x inferred form an implicit distribu
q data
(y|x) (10) by p✓ (x|y). Note th
over(but
Sleep
only : denoted
x, max Eas
critical) p✓p(x|h)p(h)
✓ (x|y),of
difference which is(h|x)]
intractable
the qobjectives
[log of ✓tofromevaluate likelihood
is the replacementbut easy to sample
of q(y|x) from
with qr (
max
This Ethe
p(y)is✓bewhere prior
the
q⌘ (z|x)p distribution
adversarial
data (x)
[log ofpthe
✓ (x|z)]
mechanism domain indicator
comes about.y, e.g., a uniform distribution as in Eqs.(1
• Let » beThe `, discriminator
and … be ñdefines a conditional distribution q (y|x) = D (x). Let q r (y|x) = q (1
⇒ bemax the
max Epq✓⌘(x|y)p(y)
reversed
✓E
[logover
distribution
(z|x)p
q (y|x)]
[log pdomains. , recovers GAN objective of optimizing ñ
✓ (x|z)] The objectives of ADA are therefore rewritten as (up
data (x)
• GANs extend constantsleepscale factor

phase 2): by also L = Ep✓the
max learning [log q (y|x)]
generative
(x|y)p(y) model (c)
between
lations VAEs
between and and
VAEs
• Directly WSWS are are
extending clearclear
sleep inphase:
previous
in previousmaxdiscussions
max L = E (x|y)p(y)
discussions [3, 25].
[3,
✓ L✓ = Epp✓✓(x|y)p(y)
25].[logIndeed,
Indeed,
[log q (y|x)]
q WS
(y|x)] WS. waswas
oposed to minimize the variational lower bound ⇥ with ⇤phase
ally proposed to minimize
• GANs:
the variational lower max✓ L✓ = Ep✓ (x|y)p(y) log q (y|x) . phase
bound as
as inin VAEs
VAEs (Eq.7)
(Eq.7) with sleep
r sleep
ximation
on [18]. Alternatively,
[18]. Alternatively, VAEs
Graphical VAEscan
model can
be be
seenseen as
representationasextending
extending ∏ the
Figure the1(c)
wake
wake phase.
phase.
illustrates Specifically,
Specifically,
the graphical ifmodel
if y (or
of 1the formul
• The The
only above
difference objectives
is can
replacing be h interpreted
with h as maximizing the log likelihood of y) with
etantiate
h withh zwith
andz and with
in with
⌘, ⌘,
the the
wake
Eq.(4),adversarial
“generative where, wake
in the
distribution”
phase
phase new objective
q (y|x)
ì
objective
view,
ìrecovers
recovers
solid-line
conditioning
VAEs
VAEs
arrows
on
(Eq.7)
(Eq.7)
denote
thegeneralizing
latent code
ininterms
terms
thex generative
inferred
of of
by process
p✓ (x|y).while da
Note tha
tor • This
optimization (i.e.,
is optimizing
where ✓). Therefore,
mechanism we can
comesee VAEs
about !as the wake
imization (i.e., optimizing
line
only arrows
(but ✓). Therefore,
denote
critical) difference we
the inference of can
the see VAEs
process.
objectives ✓ asfrom
Weofintroduce generalizing
new
is the visualtheelements,
replacement wakeof q(y|x)
e.g., hollow
with q rar
(
by also optimizing
• GANs the inference
stick network
to minimizing qthe, with additional
sleep-phase prior
KLD regularization on latents z.
optimizing the inference
for
This network
expressing
is where the q⌘ , with
implicit ⌘
adversarial additional
distributions,
mechanism andprior
blue regularization
comes arrows on latents
about. for adversarial mechanism.
z. As noted a
other hand, •ourDointerpretation
not involve wake-phase
adversarial modeling
of GANsisreveals objective
achieved closeby swapping
resemblance between to the sleepand
q(y|x) phase. To when training respe
q r (y|x)
modules.
his clearer, we instantiate h withmodel
Graphical y and representation
with , resulting in a sleep
Figure phase objective
1(c) illustrates identicalmodel
the graphical of 153
© Petuum,Inc. the formula
Mutual exchanges of ideas: augment the loss functions
GANs (InfoGAN) VAEs
KLD to ming KL (bg ç ` || h∏ ç |, ` ) ming KL(h¬ | ç, ` h∗∏ ` ç || bg (|, `|ç))

minimize ~ ming KL(!g || E) ~ming KL(E || !g )
• Asymmetry of KLDs inspires combination of GANs and VAEs

• GANs: ming KL(!g ||E) tends to missing mode
• VAEs: ming KL(E||!g ) tends to cover regions with small469values of b)±≤±
10.1. Variational Inference
• Augment VAEs with GAN loss [Larsen et al., 2016]

• Alleviate the mode covering issue of VAEs
• Improve the sharpness of VAE generated images
• Augment GANs with VAE loss [Che et al., 2017]
• Alleviate the mode missing issue of GANs
© Petuum,Inc. 154
[Figure courtesy: PRML] Mode covering
(a)
Mode missing
(b) (c)
Mutual exchanges of ideas: augment the loss functions
GANs (InfoGAN) VAEs
KLD to ming KL (bg ç ` || h∏ ç |, ` ) ming KL(h¬ | ç, ` h∗∏ ` ç || bg (|, `|ç))

minimize ~ ming KL(!g || E) ~ming KL(E || !g )
• Asymmetry of KLDs inspires combination of GANs and VAEs

• GANs: ming KL(!g ||E) tends to missing mode
• VAEs: ming KL(E||!g ) tends to cover regions with small values of b)±≤±
• Augment VAEs with GAN loss [Larsen et al., 2016]
• Alleviate the mode covering issue of VAEs
• Improve the sharpness of VAE generated images
• Augment GANs with VAE loss [Che et al., 2017]
• Alleviate the mode missing issue of GANs
© Petuum,Inc. 155
Mutual exchanges of ideas: augment the graphical model
GANs (InfoGAN) VAEs
Discriminator
hì (`|ç) h∗ (`|ç), perfect, degenerated
distribution
• Activate the adversarial mechanism in VAEs

• Enable adaptive incorporation of fake samples for learning
• Straightforward derivation by making symbolic analog to GANs
!
Vanilla VAEs Adversary Activated VAEs

© Petuum,Inc. 156
to a minibatch of samples in standard GAN update. Thus the only computational cost added by th
importance weighting method is evaluating the weight for each sample, which is generally negligibl
The discriminator is trained in the same way as in the standard GANs.
Adversary Activated VAEs (AAVAE)
4.2a general discussion
Adversary that the
Activated difference
VAEs between GANs and VAEs in latent/visible treatments is
(AAVAE)
relatively minor.
• Vanilla
In VAEs:VAEs include a degenerated adversarial discriminator which blocks out generate
our formulation,
samples from vaecontributing to model rlearning. We enable adaptive incorporation
⇥ r of fake samples
⇤ b
max✓,⌘ L✓,⌘ = Ep✓0 (x) Eq⌘ (z|x,y)q⇤ (y|x) [log p✓ (x|z, y)] KL(q⌘ (z|x, y)q⇤ (y|x)kp(z|y)p(y))
activating the adversarial mechanism. Again, derivations are straightforward by making symboli
analog to GANs.
The proof ofhLemma
• Replace 2 iswith
(`|ç) provided in the supplementary
learnable one h materials.
(`|ç) with Intuitively,
parametersrecall thatñfor the
Wereal
replace ∗
the domain
perfect with
discriminator ì
example ⌘ (z|x,in
y = 1, bothq⇤q(y|x) y= vanilla
1) and
∏
VAEs with
p✓ (x|z, y=the1)discriminator network q (y|x
are constant distributions.
• As usual,
Therefore,
parameterized with denote
as inreversed
fake sample
with distribution
x generated
GANs, from
resulting in pan hìthe
adapted
✓0 (x), ` reversed
Xobjective
= hìperfect
`
of XEq.(13):
discriminator q⇤r (y|x)
always gives prediction yh= 1, making the reconstruction loss on fake samples degenerated to a i
constant.
max✓,⌘ L✓,⌘ = Ep✓0 (x) Eq⌘ (z|x,y)q (y|x) [logpredicts
Hence
aavae only real examples, where
r
q r
⇤ = 0 with
p✓ (x|z,yy)] KL(qprobability r1, are effective for
⌘ (z|x, y)q (y|x)kp(z|y)p(y)) .
learning, which is identical to Eq.(12). We extend VAEs to also leverage fake samples in section 4.
(22
VAE/GAN Joint Models Previous works have explored combination of VAEs and GANs. This can
The
beform of Eq.(22)
naturally is precisely
motivated symmetricbehaviors
by the asymmetric to the objective
of the KLof divergences
InfoGAN inthat Eq.(10) with
the two the additiona
algorithms
KLaim
prior regularization.
to optimize Before
respectively. analyzing
Specifically, thethe effect ofmodel
VAE/GAN adding thethat
[32] learnable
improvesdiscriminator,
the sharpness of we firs
look
VAEat generated
how the discriminator
images can be is learned. Inmotivated
alternatively analog tobyGANs as inthe
remedying Eqs.(4)
mode and (10),behavior
covering the objective
of o
optimizing
the KL in VAEs.is obtained
That is,by
thesimply
KL tendsreplacing
to drive the
the inverted
generativedistribution q r (y|x)
model to cover with of
all modes q the
(y|x):
© Petuum,Inc. 157
data
distribution as well as regions with small values of pdata , resulting in blurred, implausible samples.
samples from contributing to model learning. We enable adaptive incorporation of fake samples by
activating the adversarial mechanism. Again, derivations are straightforward by making literal [Eric
maybe the word "symbolic" is better here?] analog to GANs.
AAVAE: adaptive data selection
We replace the perfect discriminator q (y|x) in vanilla VAEs with the discriminator network q (y|x)
⇤
parameterized with as in GANs, resulting in an adapted objective of Eq.(8):
h i
max✓,⌘ Laavae
✓,⌘ = Ep✓0 (x) Eq⌘ (z|x,y)q r (y|x) [log p✓ (x|z, y)] KL(q⌘ (z|x, y)q r (y|x)kp(z|y)p(y)) .
(16)
• An
The form of effective data selection
Eq.(16) is precisely symmetric mechanism:
to the objective of InfoGAN in Eq.(5) with the additional
• Both generated
KL prior regularization. samples
Before and real
analyzing the examples
effect of are weighted
adding the by discriminator, we first
learnable
∏
h ì `discriminator
look at how the = 0 ç = hì is `= 1ç
learned. In analog to GANs as in Eqs.(1) and (5), the objective of
optimizing• Only samplesby
is obtained that resembles
simply realthe
replacing data and fool
inverted the discriminator
distribution q r (y|x) will
withbe
q used
(y|x):
for training
h ∏ i
max L • A
aavae real example receiving large weight h
= Ep✓0 (x) Eq⌘ (z|x,y)q (y|x) [log p✓ (x|z, y)]
ì ` ç
KL(q⌘ (z|x, y)q (y|x)kp(z|y)p(y)) . (17)
⇒ Easily recognized by the discriminator as real
Intuitively, the discriminator is trained to distinguish between real and fake instances by predicting
⇒ Hard to be simulated from the generator
appropriate y that selects the components of q⌘ (z|x, y) and p✓ (x|z, y) to best reconstruct x. The
⇒ Hard
difficulty of Eq.(17) is examples get
that p✓ (x|z, y larger
= 1) =weights
pdata (x) is an implicit distribution which is intractable
for likelihood evaluation. We thus use the alternative objective as in GANs to train a binary classifier
© Petuum,Inc. 158
avae
= Ep✓0 (x) Eq⌘ (z|x,y)q (y|x) [log p✓ (x|z, y)] KL(q⌘ (z|x, y)q (y|x)kp(z|y)p(y))
, the discriminator is trained to distinguish between real and fake instances by pred
AAVAE: discriminator learning
e y that selects the components of q⌘ (z|x, y) and p✓ (x|z, y) to best reconstruct x
of Eq.(17) is that p✓ (x|z, y = 1) = pdata (x) is an implicit distribution which is intra
ood evaluation. We thus use the alternative objective as in GANs to train a binary cla
• Use the binary classification objective as in GAN
aavae
max L = Ep✓ (x|z,y)p(z|y)p(y) [log q (y|x)] .
ted discriminator enables an effective data selection mechanism. First, AAVAE us

xamples, but also generated samples for training. Each sample is weighted by the in
ator q r (y|x), so that only those samples that resemble real data and successfully fo
ator will be incorporated for training. This is consistent with the importance weig
n IWGAN. Second, real examples are also weighted by q r (y|x). An example rece
ht indicates it is easily recognized by the discriminator, which further indicates the ex
be simulated from the generator. That is, AAVAE emphasizes more on harder exam
© Petuum,Inc. 159
AAVAE: empirical results
• Applied the adversary activating method on
• vanilla VAEs
• class-conditional VAEs (CVAE)
• semi-supervised VAEs (SVAE)
© Petuum,Inc. 160
GAN 8.34±.03 5.18±.03 CGAN 0.985±.002 0.797±.005 SVAE 0.9412 0.9768
IWGAN 8.45±.04 5.34±.03 IWCGAN 0.987±.002 0.798±.006 AASVAE 0.9425 0.9797
Table 2: Left: Inception scores of vanilla GANs and the importance weighted extension. Middle:
Classification accuracy of the generations by class-conditional GANs and the IW extension. Right:
Classification accuracy of semi-supervised VAEs and the adversary activated extension on the MNIST
test set, with varying size of real labeled training examples.
• Evaluated test-set variational lower bound on MNIST
• The higher the better
Figure •1:X-axis: the ratio

Lower bound of on
values training datatest
the MNIST for set.
learning
X-axis (0.01, 0.1,the
represents 1.)ratio of training data
used for• learning
Y-axis:(0.01,
value0.1,
of and
test-set lower
1.). Y-axis boundthe value of lower bound. Solid lines represent
represents
the base models; dashed lines represent the adversary activated models. Left: VAE vs. AA-VAE.
Middle: CVAE vs. AA-CVAE. Right: SVAE vs. AA-SVAE, where remaining training© Petuum,Inc. data are161
used
" k
#
X q r (y|xi )
r✓ Lk (y) = Ez1 ,...,zk r✓ log q r (y|x(zi , ✓)) (34)
i=1
q(y|xi )
Experimental Results of SVAE
• Evaluated
ble 3 shows the results. classification accuracy of SVAE and AA-SVAE
1% 10%
SVAE 0.9412±.0039 0.9768±.0009
AASVAE 0.9425±.0045 0.9797±.0010
ble 3: Classification accuracy of semi-supervised VAEs and the adversary activated extension on
MNIST test•set, with1%
Used varying
and size
10%ofdata
real labeled
labelstraining
in MNISTexamples.
© Petuum,Inc. 162
Mutual exchanges of ideas
• AAVAE enhances VAEs with ideas from GANs
• We can also enhance GANs with ideas from VAEs
• VAEs maximize a variational lower bound of log likelihood
• Importance weighted VAE (IWAE) [Burda et al., 2016]
• Maximizes a tighter lower bound through importance sampling
• The variational inference interpretation of GANs allows the
importance weighting method to be straightforwardly applied
to GANs
• Just copy the derivations of IWAE side by side with little adaptions!
© Petuum,Inc. 163
Importance weighted GANs (IWGAN)
• Generator learning in vanilla GANs
⇥ r
⇤⇥ ⇤
max✓ Ex⇠p✓ (x|y)p(y)
max✓ Ex⇠p log✓q(x|y)p(y)
0
(y|x) log
. q (y|x) .
r
0
• Generator learning in IWGAN

X X)
kq r 0 (y|xi k q rr0 (y|xi )
max✓ Ex1 ,...,xmax
k ⇠p ✓ E
(x|y)p(y)
✓ x1 ,...,xk ⇠p✓i=1
(x|y)p(y) log q (y|x i ) .
log q r
(y|xi ) .
q 0 (y|xi )i=1 q 00(y|xi ) 0
ANs, As
only •=Assigns
in yGANs, onlyhigher
0 (i.e., y = 0weights
generated to samples
samples)
(i.e., thatfor
is effective
generated samples) are more realistic
islearning
effectivethe and fool
forparameters
learning the✓.theIntu-
parameters
he algorithm discriminator
assigns better
higherassigns
weights to those samples that are more that
realistic and fool the and
itively, the algorithm higher weights to those samples are more realistic
natordiscriminator
better, which better,
is consistent
which tois IWAE that to
consistent emphasizes
IWAE thatmore on codemore
emphasizes statesonproviding
code states pr
constructions. In practice, theInkpractice,
better reconstructions. samples the
in Eq.(15) correspond
k samples in Eq.(15)to correspond
a minibatchtoofasamples
minibatch in of sam
© Petuum,Inc. 164
GANstandard
update. GAN
Thus the onlyThus
update. computational cost added by cost
the only computational the importance
added by theweighting
importance methodweighting
IWGAN: empirical results
• Applied the importance weighting method to
• vanilla GANs
• class-conditional GANs (CGAN)
• CGAN adds one dimension to code k to represent the class label
• The derivations of the IW extension remain the same as in vanilla GANs
© Petuum,Inc. 165
IWGAN: empirical results
• Evaluated on MNIST and SVHN
• Used pretrained NN to evaluate:
• Inception scores of samples from GANs and IW-GAN
• Confidence of a pre-trained classifier on generated samples + diversity of
generated samples
MNIST SVHN MNIST SVHN 1%
GAN 8.34±.03 5.18±.03 CGAN 0.985±.002 0.797±.005 SVAE 0.941
IWGAN 8.45±.04 5.34±.03 IWCGAN 0.987±.002 0.798±.006 AASVAE 0.942
• ClassificationTable 2: Left: of
accuracy Inception
samplesscoresfrom
of vanilla
CGAN GANs and IW-CGAN
and the importance weighted extension
Classification accuracy of the generations by class-conditional GANs and the IW extensio
Classification accuracy of semi-supervised VAEs and the adversary activated extension on th
MNIST SVHN MNISTsize ofSVHN
test set, with varying 1%
real labeled training examples. 10%
GAN 8.34±.03 5.18±.03 CGAN 0.985±.002 0.797±.005 SVAE 0.9412 0.9768
IWGAN 8.45±.04 5.34±.03 IWCGAN 0.987±.002 0.798±.006 AASVAE 0.9425 0.9797
© Petuum,Inc. 166
Table 2: Left: Inception scores of vanilla GANs and the importance weighted extension. Middle:
Maximize the variational lower bound ℒ c, ñ; ç , or equivalently,
minimize free energy
s c, Å; ç = −log b ç + Q(hì | ç || bc (||ç))

• E-step: maximize ℒ wrt. Å with c fixed
maxì ℒ c, ñ; ç = Vrí (ö|[) log bg X k + Q(hì k X ||b(k))
• If with closed form solutions
∗
hì (k|X) ∝ exp[log bg (X, k)]
• M-step: maximize ℒ wrt. c with Å fixed
maxg ℒ c, ñ; ç = Vrí k X log bg X k + Q(hì k X ||b(k))

© Petuum,Inc. 167
Discussion: Modeling latent vs. visible variables
• Latent and visible variables are traditionally distinguished

clearly and modeled in very different ways
• A key thought in the new formulation:

• Not necessary to make clear boundary between latent and visible
variables,
• And between inference and generation
• Instead treat them as a symmetric pair
© Petuum,Inc. 168
Symmetric modeling of latent & visible variables
• Help with modeling and understanding:

• Treating the generation space ç in GANs as latent
• reveals the connection between GANs and ADA
• provides an variational inference interpretation of generation
Inference on features Treat generation of X
as performing
inference
ADA GANs © Petuum,Inc. 169

• Help with modeling and understanding:

• Treating the generation space ç in GANs as latent
• reveals the connection between GANs and ADA
• provides an variational inference interpretation of generation
• wake phase reconstructs visible variables based on latents
• sleep phase reconstructs latent variables based on visibles
• latent and visible variables are treated in a completely symmetric
way
Wake: maxc Erí(||ç) log bc (ç, |)
Sleep: maxñ Eõú(|,ç) log hì | ç © Petuum,Inc. 170

• New modeling approaches narrow the gap

Empirical distributions over visible Prior distributions over latent variables
variables
• Impossible to be explicit distribution • Traditionally defined as explicit distributions, e.g.,
Gaussian prior distribution
• The only information we have is
the observe data examples • Amiable for likelihood evaluation
• We can assume the parametric form
• Do not know the true parametric according to our prior knowledge
form of data distribution
• Naturally an implicit distribution • New tools to allow implicit priors and models
• GANs, density ratio estimation, approximate
• Easy to sample from, hard to Bayesian computations
evaluate likelihood
• E.g., adversarial autoencoder [Makhzani et al., 2015]
replaces the Gaussian prior of vanilla VAEs
with implicit priors
© Petuum,Inc. 171
• No difference in terms of formulations

• with implicit distributions and black-box NN models
• just swap the symbols X and k
| ∼ bõ∏0À∏ (|) ç ∼ b)±≤± (ç)

ç ∼ ÉÃÕ±ŒèÜÃÀ[ | | ∼ É′ÃÕ±ŒèÜÃÀ[ ç
prior distr.
+ ∼ $4/5/ (+)
" ∼ ,′-./012-(3 +
Generation Inference
model model
" ∼ $%&'(& (")

+ ∼ ,-./012-(3 "
© Petuum,Inc. 172
data distr.
Figure 5: Symmetric view of generation and inference. There is little difference of the two processes
in terms of formulation: with implicit distribution modeling, both processes only need to perform
simulation through black-box neural transformations between the latent and visible spaces.
Symmetric modeling of latent & visible variables activating the adversary mechanism on the VAE models. We see that the adversary activated models
consistently outperform their respective base models. Generally, larger improvement can be obtained
with smaller set of real training data. Table 2, right panel, further shows the classification accuracy
of semi-supervised VAE and its adversary activated variants with different size of labeled training
data. We can observe improved performance of the AA-SVAE model. The full results of standard
deviations are reported in the supplementary materials.
• No difference in terms of formulations
6 Discussions
• with implicit distributions and black-box NN models
Our new interpretations of GANs and VAEs have revealed strong connections between them, and
• Difference in terms of space complexity linked the emerging new approaches to the classic wake-sleep algorithm. The generality of the
proposed formulation offers a unified statistical insight of the broad landscape of deep generative
• depend on the problem at hand modeling, and encourages mutual exchange of improvement ideas across research lines. It is
interesting to further generalize the framework to connect to other learning paradigms such as
• choose appropriate tools: reinforcement learning as previous works have started exploration [14, 44]. GANs simultaneously
learn a metric (defined by the discriminator) to guide the generator learning, which resembles the
• implicit/explicit distribution, adversarial/maximum-likelihood optimization,
iterative teacher-student distillation framework [23, 24] where a teacher…network is simultaneously
learned from structured knowledge (e.g., logic rules) and provides knowledge-informed learning
signals for student networks of interest. It will be intriguing to build formallikelihood
maximum connections between
loss
prior distr. prior distr. adversarial
these approaches loss
and enable incorporation of structured knowledge in deep generative modeling.
:
maxC log $ "%&'(& D
zprior
Generation Generation Symmetric view of generation and inference Traditional modeling approaches usually distin-
guish between latent and visible
prior distr.
variables clearly and treat them Inference
in very different ways. One of the
model model Inference
data distr. key thoughts in our formulation is that it is not necessary to make model
model clear boundary between the two
types of variables (and between generation and inference), but instead, treating them as a symmetric
+789 +&8/. pair helps with modeling and understanding. For instance, we treat the generation space x in GANs as
+ latent, which immediately reveals the connection between GANs and adversarial domain adaptation,
max> log $ +&8/. B and provides a variational inference interpretation of the generation. A second example is the classic
: wake-sleep algorithm, where the wake phase reconstructs visibles ©conditioned
Petuum,Inc.
on173
latents, while the
adversarial loss data distr. data distr.
maximum likelihood losssleep phase reconstructs latents conditioned on visibles (i.e., generated samples). Hence, visible and
Part-II: Conclusions Z Hu, Z YANG, R Salakhutdinov, E Xing,
“On Unifying Deep Generative Models”, arxiv 1706.00550
• Deep generative models research have a long history

• Deep blief nets / Helmholtz machines / Predictability Minimization / …
• Unification of deep generative models
• GANs and VAEs are essentially minimizing KLD in opposite directions
• Extends two phases of classic wake sleep algorithm, respectively
• A general formulation framework useful for
• Analyzing broad class of existing DGM and variants: ADA/InfoGAN/Joint-models/…
• Inspiring new models and algorithms by borrowing ideas across research fields
• Symmetric view of latent/visible variables
• No difference in formulation with implicit prior distributions and black-box NN
transformations
• Difference in space complexity: choose appropriate tools
© Petuum,Inc. 174
Plan
Learning

Architectures
© Petuum,Inc. 175
Part-III (1)
Inference and Learning
© Petuum,Inc. 176
Outline
• Deep Learning as Dataflow Graphs
• Auto-differentiable Libraries
© Petuum,Inc. 177
Outline
© Petuum,Inc. 178
A Computational Layer in DL
• A layer in a neural network is composed of a few finer
computational operations
• A layer œ has input X and output k, and transforms X into k following:
` = –X + M, k = —“Q”(`)
• Denote the transformation of layer œ as ÉÕ , which can be represented
as a dataflow graphs: the input X flow though the layer
X k
ÉÕ
© Petuum,Inc. 179
From Layers to Networks
• A neural network is thus a few stacked layers œ = 1, … , Q, where
every layer represents a function transform ÉÕ
• The forward computation proceeds by sequentially executing
É$ , É& , É' , … , É‘
É$ É& É' É‘
⋯
• Training the neural network involves deriving the gradient of its

parameters with a backward pass (next slides)
© Petuum,Inc. 180
A Computational Layer in DL
• Denote the backward pass through a layer œ as MÕ
• MÕ derives the gradients of the input X(dX),given the gradient of k as
dk, as well as the gradients of the parameters W, b
• dX will be the backward input of its previous layer œ − 1
• Backward pass can be thought as a backward dataflow where the
gradient flow through the layer
¡X ¡k
MÕ
© Petuum,Inc. 181
Backpropagation through a NN
• The backward computation proceeds by sequentially
executing M‘ , M‘Ü$ , M‘Ü& , … , M$
M$ M& M‘
⋯
© Petuum,Inc. 182
A Layer as a Dataflow Graph
• Give the forward computation flow, gradients can be computed
by auto differentiation
• Automatically derive the backward gradient flow graph from the forward
dataflow graph
Photo from TensorFlow website
© Petuum,Inc. 183
A Network as a Dataflow Graph
• Gradients can be computed by auto differentiation
• Automatically derive the gradient flow graph from the forward dataflow
graph
É$ É& É‘
⋯
M$ M& M‘
⋯
Photo from TensorFlow website
© Petuum,Inc. 184
Gradient Descent via Backpropagation
• The computational workflow of deep learning
• Forward, which we usually also call inference: forward dataflow
• Backward, which derives the gradients: backward gradient flow
• Apply/update gradients and repeat
Backward
• Mathematically,
Model parameters Forward Data
© Petuum,Inc. 185
Program a neural network
• Define a neural network
• Define operations and layers: fully-connected? Convolution? Recurrent?
• Define the data I/O: read what data from where?
• Define a loss function/optimization objective: L2 loss? Softmax?
Ranking Loss?
• Define an optimization algorithm: SGD? Momentum SGD? etc
• Auto-differential Libraries will then take over
• Connect operations, data I/O, loss functions and trainer.
• Build forward dataflow graph and backward gradient flow graphs.
• Perform training and apply updates
© Petuum,Inc. 186
Outline
© Petuum,Inc. 187
Auto-differential Libraries
• Auto-differential Library automatically derives the gradients following the back-
propagation rule.
• A lot of auto-differentiation libraries have been developed:
• So-called Deep Learning toolkits
© Petuum,Inc. 188
Deep Learning Toolkits
• They are adopted differently in different domains
• For example
Vision
NLP
© Petuum,Inc. 189
• They are also designed differently
• Symbolic v.s. imperative programming
Imperative Symbolic
© Petuum,Inc. 190
• Symbolic vs. imperative programming
• Symbolic: write symbols to assemble the networks first, evaluate later
• Imperative: immediate evaluation
Symbolic Imperative
© Petuum,Inc. 191
• Symbolic
• Good
• easy to optimize (e.g. distributed, batching, parallelization) for developers
• More efficient
• Bad
• The way of programming might be counter-intuitive
• Hard to debug for user programs
• Less flexible: you need to write symbols before actually doing anything
• Imperative:
• Good
• More flexible: write one line, evaluate one line
• Easy to program and easy to debug: because it matches the way we use C++ or python
• Bad
• Less efficient
• More difficult to optimize
© Petuum,Inc. 192
• They are also designed differently
• For another example, dataflow graphs v.s. layer-by-layer construction
Layer-by-layer Dataflow graphs

construction © Petuum,Inc. 193
Good and Bad of Dataflow Graphs
• Dataflow graphs seems to be a dominant choice for representing

deep learning models
• What’s good for dataflow graphs
• Good for static workflows: define once, run for arbitrary batches/data
• Programming convenience: easy to program once you get used to it.
• Easy to parallelize/batching for a fixed graph
• Easy to optimize: a lot of off-the-shelf optimization techniques for graph
• What‘s bad for dataflow graphs
• Not good for dynamic workflows: need to define a graph for every training sample -
> overheads
• Hard to program dynamic neural networks: how can you define dynamic graphs
using a language for static graphs? (e.g. LSTM, tree-LSTM).
• Not easy for debugging.
• Difficult to parallelize/batching across multiple graphs: every graph is different, no
natural batching.
© Petuum,Inc. 194
Static vs. Dynamic Dataflow Graphs
• Static Dataflow graphs

• Define once, execute many times
• For example: convolutional neural networks
• Execution: Once defined, all following computation will follow the
defined computation
• Advantages
• No extra effort for batching optimization, because it can be by nature batched
• It is always easy to handle a static computational dataflow graphs in all aspects,
because of its fixed structure
• Node placement, distributed runtime, memory management, etc.
• Benefit the developers
© Petuum,Inc. 195
• Dynamic Dataflow graphs

• When do we need?
• In all cases that static dataflow graphs do not work well
• Variably sized inputs
• Variably structured inputs
• Nontrivial inference algorithms
• Variably structured outputs
• Etc.
© Petuum,Inc. 196
• Can we handle dynamic dataflow graphs? Using static

methods (or declaration) will have a lot of problems
• Difficulty in expressing complex flow-control logic
• Complexity of the computation graph implementation
• Difficulty in debugging
© Petuum,Inc. 197
Introducing DyNet
• Designed for dynamic deep learning workflow, e.g.
• Tree-LSTM for neural machine translation, where each sentence defines a structure that
corresponds to the computational flow
• Graph-LSTM for image parsing, where each image has a specific connection between
segments
• etc.
© Petuum,Inc. 198
Key Ingredients in DyNet
• Concept
• Separate parameter declaration and graph construction
• Declare trainable parameters and construct models first
• Parameters, e.g. the weight matrices in an LSTM unit.
• Construct a model as a collection of trainable parameters
• Construct computation graphs
• Allocate a few nodes for our computation (node can be seen as layers in NN)
• Specify the dataflow graph by connecting nodes together
• If necessary, different graphs for different input samples
• Conclusion: Define parameter once, but define graphs dynamically depending on inputs
© Petuum,Inc. 199
• Backend and programing model
• Graph construction
• In TensorFlow, constructing a graph has a considerable overhead.
• TensorFlow users avoid defining graphs repeatedly
• DyNet: highly optimized graph definition
• Little overhead defining a graph: good for dynamic neural networks.
• Easy to write recursive programs to define graphs (very effective for many
dynamic networks, such as tree-LSTM or graph-LSTM).
© Petuum,Inc. 200
• A visual comparison
DyNet TreeLSTM (30 LoC) TensorFlow TreeLSTM (200 LoC)
© Petuum,Inc. 201
Part-III (2)
Distributed Deep Learning
© Petuum,Inc. 202
Outline
• Overview: Distributed Deep Learning on GPUs
• Challenges 1: Addressing the communication bottleneck
• Challenges 2: Handling the limited GPU memory
© Petuum,Inc. 203
Review – DL toolkits on single machine
• Using GPU is a must
• A small number of GPU-equipped machines could achieve satisfactory
speedup compared to CPU clusters with thousands of cores
More readily
• A cluster of 8 GPU-equipped machines
available to
• A cluster of 2000 CPU cores researchers
© Petuum,Inc. 204
Review – DL toolkits on single machine
• However, using a single GPU is far from sufficient
• average-sized deep networks can take days to train on a single GPU when
faced with 100s of GBs to TBs of data
• Demand faster training of neural networks on ever-larger datasets
AlexNet, 5 – 7 days GoogLeNet, 10+ days
• However, current distributed DL implementations (e.g. in TensorFlow) can

scale poorly due to substantial parameter synchronization over the network
(we will show later)
© Petuum,Inc. 205
Outline
© Petuum,Inc. 206
Challenges
• Communication challenges
• GPUs are at least one order of magnitude faster than CPUs
High Comm Low GPU

GPU are faster
Load bottleneck utilization
Limited network Bursty Poor Scalability

bandwidth Communication with additional
machines
• High communication load raises the network communication as the main bottleneck
given limited bandwidth of commodity Ethernet
• Managing the computation and communication in a distributed GPU cluster often
complicates the algorithm design
© Petuum,Inc. 207
Let’s see what causes the problem
• Deep Learning on a single node – an iterative-convergent
formulation
Backward
Model parameters Forward Data
Apply gradients
Zhang et al., 2017 © Petuum,Inc. 208

Let’s see what causes the problem
• Deep Learning on a single node – an iterative-convergent
formulation
Backward
Forward Data
Forward and backward are the main computation (99%) workload of deep
learning programs.
© Petuum,Inc. 209
Distributed Deep Learning
• Distributed DL: parallelize DL training using multiple machines.
• i.e. we want to accelerate the heaviest workload (in the box) to
multiple machines Backward
Forward Data
Forward and backward are the main computation (99%) workload of deep
learning programs.
© Petuum,Inc. 210
Data parallelism with stochastic gradient
descent
• We usually seek a parallelization strategy called data parallelism, based
on SGD
• We partition data into different parts
• Let different machines compute the gradient updates on different data partitions
• Then aggregate/sync.
Data Worker 1 Worker 2 Data
Sync
(one or more
machines)
Data Data
Worker 3 Worker 4
© Petuum,Inc. 211
Data Parallel SGD
• Data parallel stochastic gradient descent
• Data-parallelism requires every worker to have read and write
access to the shared model parameters Ç, which causes
communication among workers; In total P workers
Data partition p
Collect and aggregate Happening locally on each worker

before application, where
communication is required
Zhang et al., 2015, Zhang et al. 2017 © Petuum,Inc. 212
How to communicate
• Parameter server, e.g. Bosen, SSP
• A parameter server (PS) is a shared memory system that provides a
shared access for the global model parameters Ç
• Deep learning can be trivially data-parallelized over distributed
workers using PS by 3 steps:
• Each worker computes the gradients (¢L) on their own data partition
(tõ ) and send them to remote servers;
• servers receive the updates and apply (+) them on globally shared
parameters;
• Each worker pulls back the updated parameters (Ç_ÿ)
Ho et al., 2013, Wei et al. 2015 © Petuum,Inc. 213

How PS works
Worker 1 Worker 2
¢Ç$ ¢Ç&
Ç Ç
PS
Ç Ç
¢Ç' ¢Ç2
Worker 3 Worker 4
Ho et al., 2013, Wei et al. 2015, Zhang et al., 2015, Zhang et al. 2017 © Petuum,Inc. 214
Parameter Server
• Parameter server has been successful for CPU-based deep
learning
• Google Distbelief, Dean et al. 2012
• Scale up to thousands of CPU machines and 16000 CPU cores
• SSPTable, Ho et al, 2013
• Stale-synchronous parallel consistency model
• Microsoft Adam, Chilimbi et al. 2014
• 63 machines, state-of-art results on ImageNet 22K
• Bosen, Wei et al. 2015
• Managed communication
© Petuum,Inc. 215
Parameter Server on GPUs
• Directly applying parameter server for GPU-based distributed deep
learning will underperform (as will show later).
• GPU is too fast
• Ethernet bandwidth is limited, and has latency
• For example
• AlexNet: 61.5M float parameters, 0.25s/iteration on Geforce Titan X
(batchsize = 256)
• Gradient generation rate: 240M float/(s*GPU)
• Parallelize it over 8 machines each w/ one GPU using PS.
• To ensure the computation not blocked on GPU (i.e. linear speed-up with
additional nodes)
• As a worker: send 240M floats/s and pull back 240M floats/s (at least)
• As a server: receive 240M * 8 floats/s and send back 240M * 8/s (at least)

• Let’s see where we are
This is what the GPU Ethernet standards
workstation in you lab has
One of the most expensive instances

AWS could provide you (18$/h?)
Specialized hardware! Non-

commodity anymore, inaffordable
© Petuum,Inc. 217
The problem is more severe than described above
• We only use 8 nodes (which is small). How about 32,128, or even 256?
• We haven’t considered other issues (which might be also
troublesome), e.g.
• Memory copy between DRAM and GPU will have a non-trivial cost
• The Ethernet might be shared with other tasks, i.e. available bandwidth is even
less.
• Burst communication happens very often on GPUs (which will explain later).
© Petuum,Inc. 218
Address the Communication Bottleneck
• A simple fact:
• Communication time may be reduced, but cannot be eliminated (of
course)
• Therefore, possible ideas to address the communication
bottleneck
• Hide the communication time by overlapping it with the computation
time
• Reduce the size of messages needed to be communications
© Petuum,Inc. 219
Address the Communication Bottleneck
• A simple fact:
course).
bottleneck
• Hide the communication time by overlapping it with the
computation time
© Petuum,Inc. 220
Overlap Computation and Communication
• Revisit on a single node the computation flow of BP
• MÕ : backpropagation computational through layer l
• Ÿ≤ : forward and backward computation at iteration t
M$ M& M‘
⋯
Ÿ≤ Ÿ≤∂$ Ÿ≤∂& ⋯
ÿ ⁄

• On multiple nodes, when communication is involved
• Introduce two communication operations
• €Õ : send out the gradients in layer œ to the remote
• mÕ : pull back the globally shared parameters of layer œ from the remote
• ‹≤ : the set €Õ ‘Õº$ at iteration t
• ›≤ : the set mÕ ‘Õº$ at iteration t
‘
€Õ Õº$ M$ M& M‘
⋯
‘
mÕ Õº$
Ÿ≤ ‹≤ ›≤ Ÿ≤∂$ ‹≤∂$ ›≤∂$
Computation and communication

Zhang et al., 2015, Zhang et al. 2017 happen sequentially! © Petuum,Inc. 222
• Note the following independency
• The send-out operation €Õ is independent of backward operations
• The read-in operation mÕ could update the layer parameters as long as
MÕ was finished, without blocking the subsequent backward operations
M0 (m < œ)
• Idea: overlap computation and communication by utilizing
concurrency
• Pipelining the updates and computation operations
© Petuum,Inc. 223
WFBP: Wait-free backpropagation
• Idea: overlap computation and communication by utilizing concurrency
• Pipelining the updates and computation operations
‘
€Õ Õº$ M$ M& M‘
⋯
‘
mÕ Õº$
reschedule
€$ €& €' €‘
M$ M& M‘
⋯
m$ m& m' m‘

WFBP: Wait-free backpropagation
• Idea: overlap computation and communication by utilizing
concurrency
• Communication overhead is hidden under computation
• Results: more computations in unit time
Ÿ≤ ‹≤ ›≤ Ÿ≤∂$ ‹≤∂$ ›≤∂$
pipelining
Ÿ≤ Ÿ≤∂$ Ÿ≤∂& Ÿ≤∂'

‹≤ ‹≤∂$ ‹≤∂& ‹≤∂'
›≤ ›≤∂$ ›≤∂& ›≤∂'
ÿ ⁄
WFBP: Distributed Wait-free backpropagation
• How does WFBP perform?
• Using Caffe as an engine:
50% comms
bottleneck
reduction
Zhang et al. 2017
• Using TensorFlow as engine:
© Petuum,Inc. 226
WFBP: Distributed Wait-free backpropagation
• Observation: Why DWBP would be effective
• More statistics of modern CNNs
Params/FLOP distribution of modern CNNs
• 90% computation happens at bottom layers

• 90% communication happens at top layers
• WFBP overlaps 90% and 90%

WFBP: Wait-free Backpropagation
• Does overlapping communication and computation solve all the
problems?
• When communication time is longer than computation, no (see the figure below).
• Say, if communication and computation are perfectly overlapped, how many
scalability we can achieve?
Single node Distributed gap
Ÿ≤ Ÿ≤
‹≤
›≤

Address the communication bottleneck
• Note a simple fact:
course).
bottleneck
• Hide the communication time by overlapping it with the computation
time – which we have described before.
• While without compromising statistical convergence
© Petuum,Inc. 229
Introducing Sufficient Factor Broadcasting
• Matrix-parametrized models
Multiclass Logistic
Regression Distance Metric Learning
Feature dim. Feature dim.
#classes Latent dim.
Sparse Coding Neural Network

Feature dim. #neurons in layer fl − ‡
Dictionary #neurons in
size layer fl
© Petuum,Inc. 230
Distributed Learning of MPMs
• Learning MPMs by communicating parameter matrices between server
and workers
• Dean and Ghemawat, 2008; Dean et al, 2012; Sindhwani and Ghoting, 2012; Gopal
and Yang, 2013; Chilimbi et al, 2014, Li et al, 2015
• High communication cost and large synchronization delays
Multiclass Logistic
Regression Neural Network (AlexNet)
Feature dim. = 20K #neurons in layer

fc6=4096
#neurons in
26G #classes=325K 200M layer fc7
=4096
© Petuum,Inc. 231
Contents:
Sufficient Factor (SF) Updates
Full parameter matrix update ΔW can be computed as outer product of two
vectors uvT (called sufficient factors)
• Example: Primal stochastic gradient descent (SGD)
N
1
min
W N
å f (Wa ; b ) + h(W )
i =1
i i i
¶f (Wai , bi )
DW = uv T u = v = ai
¶ (Wai )
• Example: Stochastic dual coordinate ascent (SDCA)
N
1 1
min
Z N
åf
i =1
i
*
(- zi ) + h* (
N
ZAT )
DW = uv T u = Dzi v = ai
Send lightweight SF updates (u,v), instead of expensive full-matrix ΔW updates!
© Petuum,Inc. 232
Sufficient Factor Broadcasting:
P2P Topology + SF Updates
Xie et al., 2015 © Petuum,Inc. 233

A computing & communication tradeoff
Training examples
• Full update:
Individual update
matrices
Aggregated
update matrix Sum
• Pre-update
Training
examples
Sufficient
vectors ·$ , G$ ·& , G& ·' , G' ·2 , G2
Cannot be aggregated
• Stochastic algorithms
• Mini-batch: C samples
Matrix ‹(‚)
Representation
SV Representation ‹( ‚ + Ÿ) © Petuum,Inc. 234
Synchronization of Parameter Replicas
parameter server Transfer SVs instead of ΔW
• A Cost Comparison
Size of one message Number of messages Network Traffic
P2P SV-Transfer ‹(‚ + ) ‹(!& ) ‹((‚ + )!& )
Parameter Server ‹(‚) ‹(!) ‹(‚!)

© Petuum,Inc. 235
Convergence Speedup
Multiclass Logistic Regression (MLR) Distance Metric Learning (DML) Sparse Coding (SC)
• 3 Benchmark ML Programs
• Big parameter matrices with 6.5-8.6b entries (30+GB), running on 12- & 28-
machine clusters
• 28-machine SFB finished in 2-7 hours
• Up to 5.6x faster than 28-machine PS, 12.3x faster than 28-machine Spark
• PS cannot support SF communication, which requires decentralized
storage
Xie et al., 2015 © Petuum,Inc. 236

Convergence Guarantee
• Assumptions
• Bridging model
• Staleness Synchronous Parallel (SSP) with staleness
parameter „
• Bulk Synchronous Parallel is a special case of SSP when
„=0
• Communication methods
• Partial broadcast (PB): sending messages to a subset of
E (E < ! − 1) machines
• Full broadcast is a special case of PB when E = ! − 1
• Additional assumptions
© Petuum,Inc. 237
• Results
© Petuum,Inc. 238
• Take-home message:
• Under full broadcasting, given a properly-chosen
learning rate, all local worker parameters –õŒ
eventually converge to stationary points (i.e. local
minima) of the objective function, despite the fact
that SV transmission can be delayed by up to „
iterations.
• Under partial broadcasting, the algorithm
converges to a ‹(Qù(! − E)) neighbourhood if
Ÿ ⟶ ∞.
© Petuum,Inc. 239
Parameter Storage and
Communication Paradigms
Centralized Storage Decentralized Storage
Server Worker
Send change Send change Send change

ΔW Send W itself ΔW ΔW
Worker Worker
• Centralized: send parameter W itself from server to worker

• Advantage: allows compact comms topology, e.g. bipartite
• Decentralized: always send changes ΔW between workers
• Advantage: more robust, homogeneous code, low communication (?)
© Petuum,Inc. 240
Topologies:
Master-Slave versus P2P?
Master-slave P2P
• Used with centralized storage paradigm • Used with decentralized storage
• Disadvantage: need to code/manage clients • Disadvantage (?): high comms volume for
and servers separately large # of workers
• Advantage: bipartite topology is comms- • Advantage: same code for all workers; no
efficient single point of failure, high elasticity to
• Popular for Parameter Servers: Yahoo LDA, resource adjustment
Google DistBelief, Petuum PS, Project Adam, • Less well-explored due to perception of high
Li&Smola PS, … communication overhead?
© Petuum,Inc. 241
Hybrid Updates: PS + SFB
• Hybrid communications:
Parameter Server +
Sufficient Factor
Broadcasting
• Parameter Server: Master-
Slave topology
• Sufficient factor
broadcasting: P2P topology
• For problems with a mix of

large and small matrices,
• Send small matrices via PS
• Send large matrices via SFB

Hybrid example: CNNHao Zhang, Zhiting Hu, Jinliang Wei, Pengtao Xie, Gunhee Kim, Qirong Ho, Eric P. Xing. Poseidon: A
System Architecture for Efficient GPU-based Deep Learning on Multiple Machines. USENIX ATC 2016.
• Example: AlexNet CNN model

• Final layers = 4096 * 30000 matrix (120M parameters)
• Use SFB to communicate
• 1. Decouple into two 4096 vectors: u, v
• 2. Transmit two vectors
• 3. Reconstruct the gradient matrix
Figure from
Krizhevsky et al. 2012

Hybrid example: CNNHao Zhang, Zhiting Hu, Jinliang Wei, Pengtao Xie, Gunhee Kim, Qirong Ho, Eric P. Xing. Poseidon: A
System Architecture for Efficient GPU-based Deep Learning on Multiple Machines. USENIX ATC 2016.
• Example: AlexNet CNN model

• Convolutional layers = e.g. 11 * 11 matrix (121 parameters)
• Use Full-matrix updates to communicate
• 1. Send/receive using Master-Slave PS topology
Figure from
Krizhevsky et al. 2012

Hybrid Communication
• Idea
• Sync FC layers using SFB
• Sync Conv layer using PS
• Effectiveness
• It directly reduces the size
of messages in many
situations
• Is SFB always optimal?
• No, its communication
load increases
quadratically
• The right strategy: choose
PS whenever it results in
less communication
© Petuum,Inc. 245
• A best of both worlds strategy
• For example, AlexNet parameters between FC6 and FC7
• Tradeoff between PS and SFB communication
Zhang et al., 2015 © Petuum,Inc. 246

• How to choose? Where is the threshold?
• Determine the best strategy depending on
• Layer type: CONV or FC?
• Layer size
• Batch size
• # of Cluster nodes

• Hybrid communication algorithm
Determine the best strategy depending on

• Layer type: CONV or FC?
• Layer size: M, N
• Batch size: K
• # of Cluster nodes: !$ , !&

• Results: achieve linear scalability across different models/data with 40GbE bandwidth
• Using Caffe as an engine:
• Using TensorFlow as engine
Improve over WFBP

• Linear scalability on throughput, even with limited bandwidth!
• Make distributed deep learning affordable
# parameters 5M 143M 229M

• Discussion: Utilizing SFs is not a new idea, actually
• Microsoft Adam uses the third strategy (c)
PS SFB push: SFs

Pull: full matrices
© Petuum,Inc. 251
• Adam’s strategy leads to communication bottleneck
• Pushing SFs to server is fine
• Pulling full matrices back will create a bottleneck on the server node.
• Hybrid communication yields communication load balancing

• Which is important to address the problem of burst communication.
© Petuum,Inc. 252
Introducing Poseidon
• Poseidon: An efficient communication architecture
• A distributed platform to amplify existing DL toolkits
toolkits
platform Poseidon
© Petuum,Inc. 253
Poseidon’s position
• Design principles
• Efficient distributed platform for amplifying any DL toolkits
• Preserve the programming interface for any high-level toolkits
• i.e. distribute the DL program without changing any line of code
• Easy deployment, easy adoption.
© Petuum,Inc. 254
Poseidon System Architecture
data flow GPU CPU KV Store
allocation
instruction
KV Store
Synceri
SFB
Coordinator
Stream Pool Thread Pool

Poseidon APIs
• KV Store, Syncer and Coordinator
• Standard APIs similar to parameter server
• Push/Pull API for parameter synchronization
• BestScheme method to return the best communication method

Amplify DL toolboxes Using Poseidon
• For developers: plug Poseidon API into the backpropagation
code, all you need to do is:
• Back propagate through layer œ
• Sync parameters of layer œ
• Wait for finishing
• Amplifying Google TensorFlow
• 250 line of code
• Amplifying Caffe
• 150 line of code

Using Poseidon
• Poseidon: An efficient communication architecture
• Preserve the programming interface for any high-level toolkits
• i.e. distribute the DL program without changing any line of application code
toolkits
platform Poseidon
© Petuum,Inc. 258
Outline
© Petuum,Inc. 259
What is the Issue
• Memory
• GPUs have dedicate memory
• For a DL training program to be efficient, its data must be placed on
GPU memory
• GPU memory is limited, compared to CPU, e.g. maximally 12Gb
• Memcpy between CPU and GPU is expensive – a memcpy takes the
same time as launching a GPU computation kernel
• Problems to be answered
• How to Avoid memcpy overhead between CPU and GPU?
• How to proceed the training of a gigantic network with very limited
available memory?
© Petuum,Inc. 260
A Machine w/o GPU
Network
CPU cores Local
... storage
NIC
DRAM
(CPU memory)
© Petuum,Inc. 261
A Machine w/ GPU
Network
CPU cores Local
... storage
NIC
GPU device
GPU cores
DRAM
GPU
(CPU memory)
memory
(a few GB)
Small GPU memory
Expensive to copy between GPU/CPU mem
© Petuum,Inc. 262
Machine Learning on GPU
Staging memory Input data file
for input data batch (training data)
a mini-batch of training data
Input Intermediate
data data
Parameter data
CPU memory GPU memory
© Petuum,Inc. 263
Deep Learning on GPU
Class probabilities
Training batch
parameters
GPU memory
Eagle Vulture Intermediate states
Osprey Accipiter
© Petuum,Inc. 264
Training batch
Numbers parameters
GPU memory
Max available GPU memory: 12G
Intermediate states
Network Batch size Input size Parameters Intermediat

+ grads e states
AlexNet 256 150MB <500M 4.5G
GoogLeNet 64 19MB <40M 10G
VGG19 16 10MB <1.2G 10.8G
© Petuum,Inc. 265
Why Memory is an Issue?
• Intermediate states occupy 90% of the GPU memory
• Intermediate states is proportional to input batch size
• However,
• If you want high throughput, you must have large batch size (because
of the SIMD nature of GPUs)
• If you have large batch size, your GPU will be occupied by
intermediate states, which thereby limits your model size/depth
© Petuum,Inc. 266
Saving Memory: A Simple Trick
• Basic idea
• The fact: intermediate states are proportional to the batch size K
• Idea: achieve large batch size by accumulating gradients generated by smaller batch sizes
which are affordable in the GPU memory
• Solution:
• Parition K into M parts, every part has K/M samples
• For iter = 1:M
• Train with mini-batchsize K/M
• Accumulate the gradient on GPU w/o updating model parameters
• Update the model parameter all together when all M parts finished
• Drawbacks
• What if the GPU still cannot afford the intermediate states even if K=1?
• Small batch size usually leads to insufficient use of GPUs’ computational capability
© Petuum,Inc. 267
Memory Management using CPU Memory
• Core ideas
• If the memory is limited, trade something for memory
• Trade extra computations for memory
• Trade other cost (e.g. memory exchange) for more available memory
• If the memory is limited, then get more
• model parallel
• CPU memory
© Petuum,Inc. 268
Class probabilities
• For each iteration (mini-
batch)
• A forward pass
• Then a backward pass
• Each time only data of two

layers are used
Training images
Cui et al., 2016 © Petuum,Inc. 269

Class probabilities
batch)
• A forward pass

layers are used
Training images

Class probabilities
batch)
• A forward pass

layers are used
Training images

Class probabilities
batch)
• A forward pass

layers are used
Training images

Class probabilities
batch)
• A forward pass

layers are used
Training images

Class probabilities
batch)
• A forward pass
• Each time only data of

two layers are used
Training images
The idea
• Use GPU mem as a cache to keep actively used data
• Store the remaining in CPU memory

Very expensive, for input data batch (training data)
sometimes more
expensive than
computation
Input Intermediate
CPU/GPU data data
data transfer
parameters


for input data batch (training data)
Controller/Scheduler
to alleviate/hide this
overhead
Input Intermediate
CPU/GPU data data
data transfer
parameters

• Controller
• The fact: the memory access order is deterministic and can be exactly
known by a single forward and backward pass
• Idea:
• Obtain the memory access order by a virtual iteration
• Pre-fetch memory blocks from CPU to GPU
• Overlap memory swap overhead with computation
© Petuum,Inc. 277
• What’s the best we can do with this strategy
• We only need 3 memory blocks (peak size) on GPU for:
• Input, Parameters, Output
• The whole training can process with ONLY these three blocks by
• Scheduling memcpy between CPU and GPU to be overlapped with computation
• Move in and out for each layer’s computation as training proceeds
peak

Throughput vs. memory budget
All data in GPU memory
Only buffer pool in GPU memory

Twice the peak size for double buffering
• Only 27% reduction in throughput with 35% memory

• Can do 3x bigger problems with little overhead

Larger models
• Models up to 20 GB

Summary
• Deep learning as dataflow graphs
• A lot of auto-differentiation libraries have been developed to train NNs
• Different adoption, advantages, disadvantages
• DyNet is a new framework for next-wave dynamic NNs
• Difficulties arise when scaling up DL using distributed GPUs
• Communication bottleneck
• Memory limit
• Poseidon as a platform to support and amplify different kinds of DL
toolboxes
© Petuum,Inc. 281
Elements of Modern AI
Data
Task
• Nonparametric • Regularized • Spectral/Matrix • Sparse Structured

Bayesian Models Bayesian Methods Methods I/O Regression
Descent / Back Descent Hastings
propagation
Implementation
• Mahout • Mllib • CNTK • MxNet • Tensorflow …
System
Platform • Network switches • Network attached • Server machines • RAM • IoT device • Virtual
• Infiniband storage • Desktops/Laptops • Flash networks (e.g. machines
and Hardware • Flash storage • ARM-powered • SSD Amazon EC2)
devices
• Mobile devices
• GPUs
© Petuum,Inc. 282
Sys-Alg Co-design Inside!
Data
Task
Model
Our “VML” Algorithm

Software Layer
Implementation
System
Platform
and Hardware
© Petuum,Inc. 283
Better Performance
• Fast and Real-Time • Any Scale • Low Resource
• Orders of magnitude • Perfect straight-line • Turning a regular
faster than Spark and speedup with more cluster into a super
TensorFlow computing devices computer:
• As fast as hand-crafted • Spark, TensorFlow can • Achieve AI results with much
more data, but using fewer
systems slow down with more computing devices
devices
• Google brain uses ~1000
machines whereas Petuum
uses ~10 for the same job
Up to 20x faster deep learning

Speedup vs vs TensorFlow
Up to 200x faster on some ML

algorithms
Time taken (minutes)
Speedup
Hand-
Spark PetuumOS Number of GPU computers
Crafted
System
© Petuum,Inc. 284
A Petuum Vision
Data
Task
• Nonparametric • Regularized • Omni-Source

• Spectral/Matrix • Sparse Structured
Bayesian Models Bayesian Methods (Any Data)
Methods I/O Regression
Descent / Back Descent • Omni-Lingual Hastings
propagation
Implementation (Any Programming Language)
• Mahout • Mllib • CNTK • MxNet • Tensorflow …
• Omni-Mount
System (Any Hardware)
Platform • Network switches • Network attached • Server machines • RAM • IoT device • Virtual
• Infiniband storage • Desktops/Laptops • Flash networks (e.g. machines
and Hardware • Flash storage • ARM-powered • SSD Amazon EC2)
devices
• Mobile devices
• GPUs
© Petuum,Inc. 285

A Statistical Machine Learning Perspective of Deep Learning PDF

Diunggah oleh

Informasi Dokumen

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

A Statistical Machine Learning Perspective of Deep Learning PDF

Diunggah oleh

Hak Cipta:

Format Tersedia

A Statistical Machine Learning

Perspective of Deep Learning:

Model • Graphical Models • Large-Margin • Deep Learning • Sparse Coding

Algorithm • Stochastic Gradient • Coordinate • L-BFGS • Gibbs Sampling • Metropolis-

Implementation • Mahout • Mllib • CNTK • MxNet • Tensorflow …

• On Unified Framework of Deep Generative Models

• Computational Mechanisms: Distributed Deep Learning

• Why we may favor a PGM?

• Undirected edges represent correlations between the variables

• From biological neuron network to artificial neuron networks

• Recall the nice property of sigmoid function

§ For each training example d in D

Do until converge: where

• By applying the chain rule and using reverse accumulation, we get

• The algorithm is commonly known as backpropagation

Linear Rectified linear (ReLU)

• Representation for encoding l Learn representations that

• Representation for encoding l Learn representations that

• Learning and inference are based l Learning is predominantly based

• Graphs represent models l Graphs represent computation

• A vehicle for synthesizing a global loss

Utility of the loss function

Utility of the network

Utility of the graph Utility of the network

Vocabulary: Neuron, activation function, … Variable, potential function, …

Images from Marcus Frean, MLSS Tutorial 2010 © Petuum,Inc. 35

log Q G = log I exp I J0K G0 ℎ0 + I M0 G0 + I NK ℎK − log (7)

• Gradient of the log-likelihood w.r.t. the model parameters:

Images from Marcus Frean, MLSS Tutorial 2010 © Petuum,Inc. 36

• Both expectations can be approximated via sampling

• Learning is done by optimizing the log-likelihood of the model for a given

from Neal, 1992

• Bayesian networks exhibit a phenomenon called “explain away effect”

If A correlates with C, then the chance of B correlating with C

from Neal, 1992

• Bayesian networks exhibit a phenomenon called “explain away effect”

• No negative phase as in RBM!

Equations from Neal, 1992 © Petuum,Inc. 41

images from Marcus Frean, MLSS Tutorial 2010 © Petuum,Inc. 42

• Conditional distributions !(G|ℎ) and ! ℎ G are represented by sigmoids

images from Marcus Frean, MLSS Tutorial 2010 © Petuum,Inc. 43

• Sampling from this is the same as sampling from

images from Marcus Frean, MLSS Tutorial 2010 © Petuum,Inc. 44

images from Marcus Frean, MLSS Tutorial 2010 © Petuum,Inc. 45

• DBNs are hybrid graphical models (chain graphs):

• The weights weights 2+ layers remain tied

images from Marcus Frean, MLSS Tutorial 2010 © Petuum,Inc. 49

• We can further “fine-tune” a pre-trained DBN for some other task

Setting A: Unsupervised learning (DBN → autoencoder)

images from Hinton & Salakhutdinov, 2006 © Petuum,Inc. 51

• We can further “fine-tune” a pre-trained DBN for some other task

Setting A: Unsupervised learning (DBN → autoencoder)

images from Hinton & Salakhutdinov, 2006 © Petuum,Inc. 52

• We can further “fine-tune” a pre-trained DBN for some other task

Setting A: Unsupervised learning (DBN → autoencoder)

images from Hinton & Salakhutdinov, 2006 © Petuum,Inc. 53

• We can further “fine-tune” a pre-trained DBN for some other task

Setting B: Supervised learning (DBN → classifier)

• DBNs are hybrid graphical models (chain graphs):

• DBMs are fully un-directed models (Markov random fields):

Mean-field partitions of a sigmoid belief network for subsequent GMF inference

Study focused on only inference/learning accuracy, speed, and partition

• We can think of y* as some non-linear differentiable function of the inputs and