Petuum Inc
&
Carnegie Mellon University
Element of AI/Machine Learning
Task
System
Hadoop Spark MPI RPC GraphLab …
• Network • Network attached • Server machines • RAM • Cloud compute
Platform • Virtual
switches storage • Desktops/Laptops • Flash (e.g. Amazon EC2)
machines
and Hardware • Infiniband • Flash storage • NUMA machines • SSD • IoT networks
• Mobile devices • Data centers
• GPUs, CPUs, FPGA, TPU
• ARM-powered devices © Petuum,Inc. 1
ML vs DL
© Petuum,Inc. 2
Plan
• Statistical And Algorithmic Foundation and Insight of Deep
Learning
© Petuum,Inc. 7
What is a graphical model?
• A possible world of cellular signal transduction
© Petuum,Inc. 8
GM: structure simplifies representation
• A possible world of cellular signal transduction
© Petuum,Inc. 9
Probabilistic Graphical Models
• If #0 ’s are conditionally independent (as described by a PGM), then
the joint can be factored into a product of simpler terms
! #$ , #& , #' , #2 , #3 , #/ , #4 , #1 =
! #$ ! #& ! #' #$ ! #2 #& ! #3 #&
!(#/ |#' , #2 )!(#4 |#/ )!(#1 |#3 , #/ )
O u t p u t Si g n a l s
Synapse
I n p u t Si g n a l s
Axon
Axon
Soma Soma
Dendrites
Synapse Middle Layer
Input Layer Output Layer © Petuum,Inc. 13
The perceptron learning algorithm
• Here …
© Petuum,Inc. 14
The perceptron learning algorithm
xd = input
td = target output
od = observed output
wi = weight i
Incremental mode:
Do until converge:
Batch mode: 2.
© Petuum,Inc. 15
Neural Network Model
Inputs
.6 Output
Age 34 .
.2 S 4
.1 .5 0.6
Gende 2 .3 .2
.8
S
r .7 S “Probability of
beingAlive”
Stage 4 .2
Dependent
Independent Weights Hidden Weights variable
variables Layer
Prediction
© Petuum,Inc. 16
“Combined logistic models”
Inputs
.6 Output
Age 34
.5 0.6
.1
Gende 2 S
r .7 .8 “Probability of
beingAlive”
Stage 4
Dependent
Independent Weights Hidden Weights variable
variables Layer
Prediction
© Petuum,Inc. 17
“Combined logistic models”
Inputs
Output
Age 34
.2 .5
0.6
Gende 2 .3
S
r .8
“Probability of
beingAlive”
Stage 4 .2
Dependent
Independent Weights Hidden Weights variable
variables Layer
Prediction
© Petuum,Inc. 18
“Combined logistic models”
Inputs
.6 Output
Age 34
.2 .5
.1 0.6
Gende 1 .3
S
r .7 .8
“Probability of
beingAlive”
Stage 4 .2
Dependent
Independent Weights Hidden Weights variable
variables Layer
Prediction
© Petuum,Inc. 19
Not really, no target for hidden units...
Age 34 .6 .
.2 S 4
.1 .5 0.6
Gende 2 .3 .2
.8
S
r .7 S “Probability of
beingAlive”
Stage 4 .2
Dependent
Independent Weights Hidden Weights variable
variables Layer
Prediction
© Petuum,Inc. 20
Backpropagation:
Reverse-mode differentiation
• Artificial neural networks are nothing more than complex functional compositions that can be
represented by computation graphs:
2 @fn X @fn @f
4
x 1 5 f (x) @x
=
@fi1 @
Input 3 i1 2⇡(n)
Outputs
variables
Intermediate
computations
© Petuum,Inc. 21
Backpropagation:
Reverse-mode differentiation
• Artificial neural networks are nothing more than complex functional compositions that can be
represented by computation graphs:
2 @fn X @fn @f
4
x 1 5 f (x) @x
=
@fi1 @
3 i1 2⇡(n)
output
output
input input
© Petuum,Inc. 23
Modern building blocks of deep networks
• Activation functions
• Linear and ReLU
• Sigmoid and tanh
• Etc.
fully connected
• Layers
convolutional
• Fully connected
• Convolutional & pooling
• Recurrent
• ResNets
• Etc. source: colah.github.io
recurrent
© Petuum,Inc. 24
blocks with residual connections
Modern building blocks of deep networks
Putting things together:
• Activation functions
• Linear and ReLU loss activation
• Sigmoid and tanh
• Etc.
• Layers
concatenation
• Fully connected
• Convolutional & pooling
• Recurrent fully connected
• ResNets
• Etc. convolutional
• Loss functions avg& max
• Cross-entropy loss pooling
• Mean squared error
(a part of GoogleNet)
• Etc. © Petuum,Inc. 25
Modern building blocks of deep networks
Putting things together: l Arbitrary combinations of
• Activation functions the basic building blocks
• Linear and ReLU l Multiple loss functions –
• Sigmoid and tanh multi-target prediction,
transfer learning, and more
• Etc.
l Given enough data, deeper
• Layers architectures just keep
• Fully connected improving
• Convolutional & pooling l Representation learning:
the networks learn
• Recurrent increasingly more abstract
• ResNets representations of the data
that are “disentangled,” i.e.,
• Etc.
amenable to linear
• Loss functions separation.
• Cross-entropy loss
• Mean squared error
(a part of GoogleNet)
• Etc. © Petuum,Inc. 26
Outline
• Probabilistic Graphical Models: Basics
• An overview of the DL components
• Historical remarks: early days of neural networks
• Modern building blocks: units, layers, activations functions, loss functions, etc.
• Reverse-mode automatic differentiation (aka backpropagation)
• Similarities and differences between GMs and NNs
• Graphical models vs. computational graphs
• Sigmoid Belief Networks as graphical models
• Deep Belief Networks and Boltzmann Machines
• Combining DL methods and GMs
• Using outputs of NNs as inputs to GMs
• GMs with potential functions represented by NNs
• NNs with structured outputs
• Bayesian Learning of NNs
• Bayesian learning of NN parameters
• Deep kernel learning
© Petuum,Inc. 27
Graphical models vs. Deep nets
Graphical models Deep neural networks
© Petuum,Inc. 28
Graphical models vs. Deep nets
Graphical models Deep neural networks
© Petuum,Inc. 29
Graphical models vs. Deep nets
X1 X2
Graphical models
X X
Utility of the graph log P (X) = log (xi ) + log (xi , xj )
X3 i i,j
Objective: Something aggregated from local functions Something aggregated from local functions
Algorithm: A single, unchallenged, inference algorithm A major focus of open research, many
– algorithms, and more to come
Backpropagation (BP)
Evaluation: On a black-box score – On almost every intermediate quantity
end performance
Implementation: Many tricks More or less standardized
Experiments: Massive, real data Modest, often simulated data (GT known)
(GT unknown)
© Petuum,Inc. 33
Graphical Models vs. Deep Nets
• So far:
• Graphical models are representations of probability distributions
• Neural networks are function approximators (with no probabilistic meaning)
• Some of the neural nets are in fact proper graphical models
(i.e., units/neurons represent random variables):
• Boltzmann machines (Hinton & Sejnowsky, 1983)
• Restricted Boltzmann machines (Smolensky, 1986)
• Learning and Inference in sigmoid belief networks (Neal, 1992)
• Fast learning in deep belief networks (Hinton, Osindero, Teh, 2006)
• Deep Boltzmann machines (Salakhutdinov and Hinton, 2009)
• Let’s go through these models one-by-one © Petuum,Inc. 34
I: Restricted Boltzmann Machines
• RBM is a Markov random field represented with a bi-partite graph
• All nodes in one layer/part of the graph are connected to all in the other;
no inter-layer connections
• Joint distribution:
1
! G, ℎ = exp I J0K G0 ℎ0 + I M0 G0 + I NK ℎK
7
0,K 0 K
T T T
log Q G = VW(S|U) !(G, ℎ) − VW(U,S) !(G, ℎ)
TJ0K TJ0K TJ0K
T T T
log Q G = VW(S|U) !(G, ℎ) − VW(U,S) !(G, ℎ)
TJ0K TJ0K TJ0K
© Petuum,Inc. 38
II: Sigmoid Belief Networks
• Sigmoid belief nets are simply Bayesian networks over binary variables with conditional
probabilities represented by sigmoid functions:
! X0 Y X0 = Z X0 I J0K XK
[\ ∈ ^ [_
• Sigmoid belief nets are simply Bayesian networks over binary variables with conditional
probabilities represented by sigmoid functions:
! X0 Y X0 = Z X0 I J0K XK
[\ ∈ ^ [_
Bayes rule +
rearrange sums
sampling steps
• When we train an RBM, we are really training an infinitely deep brief net!
• It is just that the weights of all layers are tied.
• If the weights are “untied” to some extent, we get a Deep Belief Network.
images from Marcus Frean, MLSS Tutorial 2010 © Petuum,Inc. 46
III: Deep Belief Nets
Challenges:
• Exact inference in DBNs is problematic due to explain away effect
• Training is done in two stages:
• greedy pre-training + ad-hoc fine-tuning; no proper joint training
• Approximate inference is feed-forward (bottom-up) © Petuum,Inc. 48
DBN: Layer-wise pre-training
• Pre-train and freeze the 1st RBM
• Stack another RBM on top and train it
• and so forth
• From the optimization perspective, this procedure loosely corresponds
to an approximate block-coordinate accent on the log-likelihood
images from Marcus Frean, MLSS Tutorial 2010 © Petuum,Inc. 50
DBN: Fine-tuning
• Pre-training is quite ad-hoc and is unlikely to lead to a good probabilistic
model per se
• However, the layers of representations could perhaps be
useful for some other downstream tasks!
© Petuum,Inc. 55
Deep Belief Nets and Boltzmann Machines
GMFb
GMFr
BP
© Petuum,Inc. 58
“Optimize” how to optimize via truncation & re-opt
• Energy-based modeling of the structured output (CRF)
• Unroll the optimization algorithm for a fixed number of steps (Domke, 2012)
`' `3 =
`$
`a
`& `2
Relevant recent paper:
Anrychowicz et al.: Learning to learn by gradient
We can backprop through the optimization steps descent by gradient descent. 2016.
since they are just a sequence of computations
© Petuum,Inc. 59
Dealing with structured prediction
• Energy-based modeling of the structured output (CRF)
• Unroll the optimization algorithm for a fixed number of steps (Domke, 2012)
© Petuum,Inc. 60
Outline
• Probabilistic Graphical Models: Basics
• An overview of DL components
• Historical remarks: early days of neural networks
• Modern building blocks: units, layers, activations functions, loss functions, etc.
• Reverse-mode automatic differentiation (aka backpropagation)
• Similarities and differences between GMs and NNs
• Graphical models vs. computational graphs
• Sigmoid Belief Networks as graphical models
• Deep Belief Networks and Boltzmann Machines
• Combining DL methods and GMs
• Using outputs of NNs as inputs to GMs
• GMs with potential functions represented by NNs
• NNs with structured outputs
• Bayesian Learning of NNs
• Bayesian learning of NN parameters
• Deep kernel learning
© Petuum,Inc. 61
Combining sequential NNs and GMs
© Petuum,Inc. 62
slide courtesy: Matt Gormley
Combining sequential NNs and GMs
© Petuum,Inc. 63
slide courtesy: Matt Gormley
Hybrid NNs + conditional GMs
© Petuum,Inc. 64
slide courtesy: Matt Gormley
Hybrid NNs + conditional GMs
© Petuum,Inc. 65
slide courtesy: Matt Gormley
Hybrid NNs + conditional GMs
© Petuum,Inc. 66
slide courtesy: Matt Gormley
Using GMs as Prediction Explanations
• Idea: Use deep neural nets to generate parameters of a graphical model for a
given context (e.g., specific instance or case)
• Produced GMs are used to make the final prediction
• GMs are built on top of interpretable variables (not deep embeddings!) and can
be used as contextual explanations for each prediction
X1 X2 X3 X4
Y1 Y2 Y3 Y4 MoE
dot
Y1 Y2 Y3 Y4 Y1 Y2 Y3 Y4 Y1 Y2 Y3 Y4
Context Attention X1 X2 X3 X4 X1 X2 X3 X4 X1 X2 X3 X4
Attributes
A practical implementation:
• Maintain a (sparse) dictionary of GM parameters
• Process complex inputs (images, text, time series, etc.) using deep nets; use soft
attention to either select or combine models from the dictionary
• Use constructed GMs (e.g., CRFs) to make predictions
• Inspect GMs to understand the reasoning behind predictions
Al-Shedivat, Dubey, Xing, arXiv, 2017 © Petuum,Inc. 68
Outline
• An overview of the DL components
• Historical remarks: early days of neural networks
• Modern building blocks: units, layers, activations functions, loss functions, etc.
• Reverse-mode automatic differentiation (aka backpropagation)
• Similarities and differences between GMs and NNs
• Graphical models vs. computational graphs
• Sigmoid Belief Networks as graphical models
• Deep Belief Networks and Boltzmann Machines
• Combining DL methods and GMs
• Using outputs of NNs as inputs to GMs
• GMs with potential functions represented by NNs
• NNs with structured outputs
• Bayesian Learning of NNs
• Bayesian learning of NN parameters
• Deep kernel learning
© Petuum,Inc. 69
Bayesian learning of NNs
• A neural network as a probabilistic model:
• Likelihood: b ` X, c Weight Uncertainty
• Categorical distribution for classification ⇒ cross-entropy loss
• Gaussian distribution for regression ⇒ squared loss Y Y
1.2
• Gaussian prior ⇒ L2 regularization
X 1 X 1
• Laplace prior ⇒ L1 regularization
Figure courtesy: Blundell et al, 2016
• Bayesian learning [MacKay 1992, Neal 1996, de Figure
Freitas1.2003]
Left: each weight has a fixed value, as provided by clas-
• Posterior: b c X, ` sical backpropagation. Right: each weight is assigned a distribu-
tion, as provided by Bayes by Backprop.
• Variational inference with approximate posterior h(c)
© Petuum,Inc. 70
Bayesian learning of NNs
• Variational inference (in a nutshell):
minr s t, c = KL h c || b c t − Er(c) [log b(t|c)]
minr s t, c = KL h c || b c t − I log b(t|c0 )
0
where ci ∼ h(c); KL term can be approximated similarly
• When used for prediction, GPs account for correlations between the data points and can
output well-calibrated predictive uncertainty estimates.
© Petuum,Inc. 72
Gaussian Process and Deep Kernel Learning
• Consider a neural network with a Gaussian prior on its weights an infinitely many hidden neurons in
the intermediate layer.
Infinitely many
hidden units
• Certain classes of Gaussian priors for neural networks with infinitely many hidden units converge to
Gaussian processes [Neal 1996]
• Deep kernel [Wilson et al., 2016]
• Combines the inductive biases of deep model architectures with the non-parametric flexibility of Gaussian processes
Ä X0 , XK Å → Ä(â X0 , Ç , â(XK , Ç)|Å, Ç) where 0K = Ä(X0 , XK )
b ` É = Ñ(`|É, Ö Ü$ )
• Starting from a base kernel Ä(X0 , XK |Å), transform the inputs X as
b É Å = Ñ(É|á(X), )
• Learn both kernel and neural parameters Å, Ç jointly by optimizing marginal log-likelihood (or its variational lower-bound).
• Fast learning and inference with local kernel interpolation, structured inducing points, and Monte Carlo approximations
© Petuum,Inc. 73
Gaussian Process and Deep Kernel Learning
• By adding GP as a layer to a deep neural net, we can think of it as adding
an infinite hidden layer with a particular prior on the weights
• Deep kernel learning [Wilson et al., 2016]
• Combines the inductive biases of
deep models with the non-parametric
flexibility of Gaussian processes
• GPs add powerful regularization to
the network
• Additionally, they provide predictive
uncertainty estimates
© Petuum,Inc. 74
Deep kernel learning on sequential data
What if we have data of
sequential nature?
© Petuum,Inc. 75
Deep kernel learning on sequential data
The answer is YES!
40
30
20
10
0
5 0 5 5 0 5 5 0 5 5 0 5 5 0 5
Side distance, m
50
Front distance, m
40
30
20
10
0
5 0 5 5 0 5 5 0 5 5 0 5 5 0 5
Side distance, m
80
60
40
20
0
5 0 5 5 0 5 5 0 5 5 0 5 5 0 5
Side distance, m
100
Front distance, m
80
60
40
20
0
5 0 5 5 0 5 5 0 5 5 0 5 5 0 5
Side distance, m
© Petuum,Inc. 79
Part-II
Deep Generative Models
Plan
• Statistical And Algorithmic Foundation and Insight of Deep
Learning
© Petuum,Inc. 82
Outline
• Overview of advances in deep generative models
• Backgrounds of deep generative models
• Wake sleep algorithm
• Variational autoencoders
• Generative adversarial networks
• A unified view of deep generative models
• new formulations of deep generative models
• Symmetric modeling of latent and visible variables
© Petuum,Inc. 83
Deep generative models
• Define probabilistic distributions over a set of variables
• "Deep" means multiple layers of hidden variables!
#$
...
#%
&
© Petuum,Inc. 84
Early forms of deep generative models
• Hierarchical Bayesian models
• Sigmoid brief nets [Neal 1992] (&) ã
|ä = 0,1
Ç0K
($) å
|ä = 0,1
çä = 0,1 é
($) ($)
b Xèä = 1 cè , |ä = Z cêè |ä
($) (&) &
b k0ä = 1 c0 , |ä = Z cê0 |ä
© Petuum,Inc. 85
Early forms of deep generative models
• Hierarchical Bayesian models
• Sigmoid brief nets [Neal 1992]
7&
inference
weights
#
DATA
Figure courtesy: Schmidhuber 1996
© Petuum,Inc. 87
Early forms of deep generative models
• Training of DGMs via an EM style framework
• Sampling / data augmentation
| = |$ , |&
|$äòô ~b |$ |& , ç
|äòô
& ~b | |
& $
äòô
,ç
• Variational inference
log b ç ≥ Erí | ç log bg ç, | − KL(hì | ç || b(|)) ≔ ℒ(c, ñ; ç)
maxc,ñ ℒ(c, ñ; ç)
• Wake sleep
Wake: ming Vrí(ö|[) log bg X k
Sleep: minì Võú([|ö) log hì k X © Petuum,Inc. 88
Resurgence of deep generative models
• Restricted Boltzmann machines (RBMs) [Smolensky, 1986]
• Building blocks of deep probabilistic models
© Petuum,Inc. 89
Resurgence of deep generative models
• Restricted Boltzmann machines (RBMs) [Smolensky, 1986]
• Building blocks of deep probabilistic models
• Deep belief networks (DBNs) [Hinton et al., 2006]
• Hybrid graphical model
• Inference in DBNs is problematic due to explaining away
• Deep Boltzmann Machines (DBMs) [Salakhutdinov & Hinton, 2009]
• Undirected model
© Petuum,Inc. 90
Resurgence of deep generative models
• Variational autoencoders (VAEs) [Kingma & Welling, 2014]
/ Neural Variational Inference and Learning (NVIL) [Mnih & Gregor, 2014]
hì (||ç) bg (ç||)
inference model generative model
© Petuum,Inc. 91
Resurgence of deep generative models
• Variational autoencoders (VAEs) [Kingma & Welling, 2014]
/ Neural Variational Inference and Learning (NVIL) [Mnih & Gregor, 2014]
• Generative adversarial networks (GANs)
ùg : generative model
tì : discriminator
© Petuum,Inc. 92
Resurgence of deep generative models
• Variational autoencoders (VAEs) [Kingma & Welling, 2014]
/ Neural Variational Inference and Learning (NVIL) [Mnih & Gregor, 2014]
• Generative adversarial networks (GANs)
• Generative moment matching networks (GMMNs) [Li et al., 2015; Dziugaite et
al., 2015]
© Petuum,Inc. 93
Resurgence of deep generative models
• Variational autoencoders (VAEs) [Kingma & Welling, 2014]
/ Neural Variational Inference and Learning (NVIL) [Mnih & Gregor, 2014]
• Generative adversarial networks (GANs)
• Generative moment matching networks (GMMNs) [Li et al., 2015; Dziugaite et
al., 2015]
© Petuum,Inc. 94
Outline
• Overview of advances in deep generative models
• Backgrounds of deep generative models
• Wake sleep algorithm
• Variational autoencoders
• Generative adversarial networks
• A unified view of deep generative models
• new formulations of deep generative models
• Symmetric modeling of latent and visible variables
© Petuum,Inc. 95
Synonyms in the literature
© Petuum,Inc. 98
Recap: Amortized Variational Inference
• Variational distribution as an inference model hì | ç with
parameters ñ
• Amortize the cost of inference by learning a single data-
dependent inference model
• The trained inference model can be used for quick inference
on new data
• Maximize the variational lower bound ℒ(c, ñ; ç)
• E-step: maximize ℒ wrt. ñ with c fixed
• M-step: maximize ℒ wrt. c with ñ fixed
© Petuum,Inc. 99
Deep generative models with amortized inference
• Helmholtz machines
• Variational autoencoders (VAEs) / Neural Variational Inference
and Learning (NVIL)
© Petuum,Inc. 100
Wake Sleep Algorithm
• [Hinton et al., Science 1995]
• Train a separate inference model along with the generative model
• Generally applicable to a wide range of generative models, e.g., Helmholtz machines
• Consider a generative model bg ç | and prior b |
• Joint distribution bg ç, | = bg ç | b |
• E.g., multi-layer brief nets
• Inference model hì | ç
• Maximize data log-likelihood with two steps of loss relaxation:
• Maximize the lower bound of log-likelihood, or equivalently, minimize the free
energy
s c, ñ; ç = −log b ç + Q(hì | ç || bc (||ç))
• Minimize a different objective (reversed KLD) wrt Å to ease the optimization
• Disconnect to the original variational lower bound loss
s′ c, ñ; ç = −log b ç + Q(bg | ç || hì (||ç))
© Petuum,Inc. 101
R2
Wake Sleep Algorithm
R1
ç
• Free energy:
s c, ñ; ç = −log b ç + Q(hñ | ç || bc (||ç))
• Minimize the free energy wrt. c of bg à wake phase
maxc Erí(||ç) log bc (ç, |)
• Get samples from hì (k|X) through inference on hidden variables
• Use the samples as targets for updating the generative model bg (||ç)
• Correspond to the variational M step
© Petuum,Inc. 106
Variational Autoencoders (VAEs)
• Variational lower bound
ℒ c, ñ; ç = Erí | ç log bg ç, | − KL(hì | ç || b(|))
© Petuum,Inc. 107
Reparametrized gradient
• Optimize ℒ c, ñ; ç wrt. Å of hì | ç
• Recap: gradient estimate with log-derivative trick:
¢ì Vrí log bg ç, | = Vrí [log bg ç, | ¢ì log hì ]
• High variance: ¢ì Vrí log bg ≈ Vö_∼ rí [log bg (X, k0 ) ¢ì hì k0 |X ]
• The scale factor log bg (X, k0 ) of the derivative ¢ì log hì can have arbitrary large
magnitude
© Petuum,Inc. 109
input we looked out at the setting sun . iw
mean they were laughing at the same time . i we
samp. 1 ill see you in the early morning . i we
samp. 2 i looked up at the blue sky . i loo
VAEs: example results samp. 3 it was down on the dance floor . i tu
© Petuum,Inc. 111
x from data distribution pdata (x). The distribution in Eq.(1) is thus rewritten as:
⇢
pdata (x) y = 0
p(x|z, y) = (5)
pg (x|z) y = 1,
Generative g
Adversarial Nets (GANs)
where p (x|z) = G(z) is the generative distribution. Note that p (x) is the empirical data
data
distribution which is free of parameters. The discriminator is defined in the same way as above, i.e.,
D(x) = p(y = 0|x). Then the objective of GAN is precisely defined in Eq.(2). To make this clearer,
• Learning
we again transform the objective into its conventional form:
• A minimax
maxDgame Ex⇠pdata (x)the
LD = between D(x)] + Ex⇠G(z),z⇠p(z)
[loggenerator and the discriminator
[log(1 D(x))] ,
• Train tmax
to G LG = Ex⇠p
maximize the probability
data (x)
+ Ex⇠G(z),z⇠p(z)
of assigning
[log(1 D(x))] the correct[loglabel
D(x)]to both (6)
training examples= Eand generated samples
x⇠G(z),z⇠p(z) [log D(x)] .
• Train ù to fool the discriminator
maxD LD = Ex⇠pdata (x) [log D(x)] + Ex⇠G(z),z⇠p(z) [log(1 D(x))] ,
minG LG = Ex⇠G(z),z⇠p(z) [log(1 D(x))] .
Note that for learning the generator we are using the adapted objective, i.e., maximizing
Ex⇠G(z),z⇠p(z) [log D(x)], as is usually used in practice (Goodfellow et al., 2014), rather than
minimizing Ex⇠G(z),z⇠p(z) [log(1 D(x))]. © Petuum,Inc. 112
[Figure courtesy: Kim’s slides]
maxD LD = Ex⇠pdata (x) [log D(x)] + Ex⇠G(z),z⇠p(z) [log(1 D(x))] ,
maxG LG = Ex⇠pdata (x) [log(1 D(x))] + Ex⇠G(z),z⇠p(z) [log D(x)]
Generative =Adversarial
E Nets
[log D(x)] . (GANs)
x⇠G(z),z⇠p(z)
Note that for learning the generator we are using the adapted objective, i.e., maximizi
Ex⇠G(z),z⇠p(z) [log D(x)], as is usually used in practice (Goodfellow et al., 2014), rather th
minimizing Ex⇠G(z),z⇠p(z) [log(1 D(x))].
KL Divergence Interpretation
Now we take a closer look into Eq.(2). Assume uniform prior distribution p(y) where p(y = 0)
p(y = 1) = 0.5. For optimizing p(x|z, y), we have
Theorem 1. Let p✓ (x|z, y) be the conditional distribution in Eq.(1) parameterized with
© Petuum,Inc. 113 ✓. Den
[Figure courtesy: Kim’s slides] 0
Generative Adversarial Nets (GANs)
• Learning
• Aim to achieve equilibrium of the game
• Optimal state:
• b∞ ç = b)±≤± (X)
õ≥¥µ¥ [ $
• t ç = =
õ≥¥µ¥ [ ∂õ∑ [ &
© Petuum,Inc. 114
[Figure courtesy: Kim’s slides]
GANs: example results
© Petuum,Inc. 116
Outline
• Overview of advances in deep generative models
• Backgrounds of deep generative models
• Wake sleep algorithm
• Variational autoencoders
• Generative adversarial networks
• A unified view of deep generative models
• new formulations of deep generative models
• Symmetric modeling of latent and visible variables
© Petuum,Inc. 117
A unified view of deep generative models
• Literatures have viewed these DGM approaches as distinct
model training paradigms
• GANs: achieve an equilibrium between generator and discriminator
• VAEs: maximize lower bound of the data likelihood
© Petuum,Inc. 118
Adversarial domain adaptation (ADA)
• Let’s start from ADA
• The application of adversarial approach on domain adaptation
• We then show GANs can be seen as a special case of ADA
• Correspondence of elements:
© Petuum,Inc. 123
3
and y = 1 the source domain. The data distributions conditioning on the domain
as p(z|y). Let p(y) be the prior distribution (e.g., uniform) of the domain indic
extractor maps z to representations x = G✓ (z) with parameters ✓. The data distrib
deterministic transformation G✓ together form an implicit distribution over x, de
ADA vs. Variational EM which is intractable to evaluate likelihood but easy to sample from:
To enforce domain invariance of feature x, a discriminator is trained to adversa
between the two domains, which defines a conditional distribution q (y|x) with p
Variational EM the feature extractor is optimized
ADA to fool the discriminator. Let q r
(y|x) = q (1 y|
distribution over domains. The objectives of ADA are therefore given as:
• Objectives • Objectives
maxì ℒñ,c = Vrí (ö|[) log bg X k + Q hì k X ||b k max L = Ep✓ (x|y)p(y) [log q (y|x)]
⇥ r
⇤
maxg ℒñ,c = Vrí (ö|[) log bg X k + Q hì k X ||b k max✓ L✓ = Ep✓ (x|y)p(y) log q (y|x) ,
• Two objectives
• Single objective for both Ç and
whereÅ we omit the additional loss of ✓ to fit to the data label pairs of source domain
• Have global optimal state in the game
for more
• Extra prior regularization by b(k) details). In conventional view, the
theoretic first equation minimizes the discrimi
view
entropy with respect to discriminative parameter , while the second trains the
to maximize the cross entropy with respect to the transformation parameter ✓.
self-containedness, it would be better to explain both of the cross-entrop
Alternatively, we can interpret the objectives as optimizing the reconstruction of th
y conditioned on feature x. [Eric: I can not understand this point.] We explor
more in the next section. Note that the only (but critical) difference between the o
is the replacement of q(y|x) with q r (y|x). This is where the adversarial mecha
3 © Petuum,Inc. 124
and y = 1 the source domain. The data distributions conditioning on the domain
as p(z|y). Let p(y) be the prior distribution (e.g., uniform) of the domain indic
extractor maps z to representations x = G✓ (z) with parameters ✓. The data distrib
deterministic transformation G✓ together form an implicit distribution over x, de
ADA vs. Variational EM which is intractable to evaluate likelihood but easy to sample from:
To enforce domain invariance of feature x, a discriminator is trained to adversa
between the two domains, which defines a conditional distribution q (y|x) with p
Variational EM the feature extractor is optimized
ADA to fool the discriminator. Let q r
(y|x) = q (1 y|
distribution over domains. The objectives of ADA are therefore given as:
• Objectives • Objectives
maxì ℒñ,c = Vrí (ö|[) log bg X k + Q hì k X ||b k max L = Ep✓ (x|y)p(y) [log q (y|x)]
⇥ r
⇤
maxg ℒñ,c = Vrí (ö|[) log bg X k + Q hì k X ||b k max✓ L✓ = Ep✓ (x|y)p(y) log q (y|x) ,
• Two objectives
• Single objective for both Ç and
whereÅ we omit the additional loss of ✓ to fit to the data label pairs of source domain
• Have global optimal state in the game
for more
• Extra prior regularization by b(k) details). In conventional view, the
theoretic first equation minimizes the discrimi
view
entropy with respect to discriminative parameter , while the second trains the
• The reconstruction term: maximize the conditional
to maximize • Thewith
the cross entropy objectives:
respect tomaximize the conditional
the transformation parameter ✓.
log-likelihood of X with the generative distribution it would
self-containedness, be better toof `
log-likelihood (or 1 both
explain − `) of thethe
with cross-entrop
bg (X|k) conditioning on the latent code k inferred
Alternatively, we can interpret hì (`|X)
the objectives
distribution as optimizing the reconstruction
conditioning on latent of th
by hì (k|X) [Eric: I can
y conditioned on feature x. feature not understand
X inferred this point.] We explor
by bg (X|`)
more in the next section. Note that the only (but critical) difference between the o
is the replacement of q(y|x) with q r (y|x). This is where the adversarial mechan
• bg (X|k) is the generative model • Interpret hì (`|X) as the generative model
• hì (k|X) is the inference model • Interpret bg (X|`) as 3 the inference model
© Petuum,Inc. 125
We frame our new interpretation of ADA, and review conventional formu
materials. To make clear notational correspondence to other models in th
a figure drawing a graphical model here for ADA.] let z be a data e
ADA: graphical model
or target domain, and y 2 {0, 1} be the domain indicator with y = 0 in
and y = 1 the source domain. The data distributions conditioning on th
as p(z|y). Let p(y) be the prior distribution (e.g., uniform) of the dom
Define: extractor maps z to representations x = G✓ (z) with parameters ✓. The d
deterministic
• Solid-line arrows (X → `): transformation G✓ together form an implicit distribution
which is intractable to evaluate likelihood but easy to sample from:
• generative process
(y, z domain
To enforce
• Dashed-line arrows → X): invariance of feature x, a discriminator is trained
• inference between the two domains, which defines a conditional distribution q (y
the feature
• Hollow arrows (z → X): extractor is optimized to fool the discriminator. Let q r
(y|x) =
• deterministic distribution over domains. The objectives of ADA are therefore given as
transformation
• leading to implicit distributions
max L = Ep✓ (x|y)p(y) [log q (y|x)]
• Blue arrows (X → `): ⇥ r
⇤
• adversarial mechanism max✓ L✓ = Ep✓ (x|y)p(y) log q (y|x) ,
∏
• involves both hì (`|ç) and hì (`|ç)
where we omit the additional loss of ✓ to fit to the data label pairs of sour
for more details). In conventional view, the first equation minimizes the
© Petuum,Inc. 126
entropy with respect to discriminative parameter , while the second
GANs: a variant of ADA
• Transfer the properties of source domain to target domain
• Source domain: e.g. real image, ` = 1
• Target domain: e.g. generated image, ` = 0
4 © Petuum,Inc. 128
make clear notational
where correspondence
the prior p(y) is uniform astois other
widelymodels in the in
set, resulting sequel, [Eric: scale
the constant Please add
factor 1/2. Note that
wing a here
graphical model
the generator is here
trainedfor ADA.]
using let z be a objective
the unsaturated data example either
[16] which is in the source
commonly used in practice.
ain, and y 2 {0, 1} be the domain indicator with y = 0 indicating the target domain
GANs: new formulation
source domain. The data distributions conditioning on the domain are then denoted
L = Ep✓ (x|y=0)p(y=0) [log q (y = 0|x)] + Ep✓ (x|y=1)p(y=1) [log q (y = 1|x)]
et p(y) be max
the prior distribution (e.g., uniform) of the domain indicator. The feature
s z to representations x1 = G✓ (z) with parameters ✓. The data 1distributions over z and
• Again, rewrite
= Ex=G GAN objectives
✓ (z),z⇠p(z|y=0)
in the
[log(1 ”variational-EM”
D (x))] format [log D (x)]
+ Ex=G✓ (z),z⇠p(z|y=1)
transformation G✓ together
2 form an implicit distribution over2x, denoted as p✓ (x|y),
• Recap: conventional formulation:
ctable to evaluate likelihood but easy to sample from:
max L = Ex=G✓ (z),z⇠p(z|y=0) [log(1 D (x))] + Ex⇠pdata (x) [log D (x)]
omain invariance of feature x, a discriminator is trained to adversarially distinguish
wo domains, which ✓ L✓ = E
maxdefines a x=G
conditional [log D q(x)]
distribution
✓ (z),z⇠p(z|y=0)
+ Ex⇠p
(y|x) withdata (x) [log(1
parameters D (x))]
, and
ractor is optimized to fool=the discriminator.
Ex=G Let [log
✓ (z),z⇠p(z|y=0) q r (y|x) = q (1 y|x) be the reversed
D (x)]
ver domains. The objectives
• Rewrite in the newof ADA
formare therefore given as:
We now take a closer look at the form of Eq.(4) which is essentially reconstructing the real/fake
indicator ymax
(or its L
reverse y) conditioned
= E1p✓ (x|y)p(y) [log q on x. Further, for each optimization step of p✓ (x|y) at
(y|x)]
point (✓0 , 0 ) in the parameter space,⇥we have ⇤ (1)
max✓ L✓ = Ep✓ (x|y)p(y) log q r (y|x) ,
Lemma 1 Let p(y) be the uniform distribution. Let p✓0 (x) = Ep(y) [p✓0 (x|y)], and q r (x|y) /
t the additional • Exact
q r 0 (y|x)p
loss of
✓0 (x). Therefore,
the✓ same theADA
with
to fit to the updates
data !labelof ✓ at ✓of
pairs 0 have
source domain (see supplements
• The same
ils). In conventional correspondence
h view, the first ⇥equation to minimizes
variational
⇤ i EMdiscriminator
the ! binary cross
r
r✓
respect to discriminative Ep✓parameter
(x|y)p(y) log q, while
0
(y|x)the second= trains the feature extractor
✓=✓0 © Petuum,Inc. 129
h
the cross entropy with respect to the transformation parameter ✓. [Eric: I think for i (7)
and y = 1 the source domain. The data distributions conditioning on the domain
as p(z|y). Let p(y) be the prior distribution (e.g., uniform) of the domain indic
extractor maps z to representations x = G✓ (z) with parameters ✓. The data distrib
deterministic transformation G✓ together form an implicit distribution over x, de
GANs vs. Variational EM which is intractable to evaluate likelihood but easy to sample from:
To enforce domain invariance of feature x, a discriminator is trained to adversa
between the two domains, which defines a conditional distribution q (y|x) with p
Variational EM the feature extractor is optimized
GAN to fool the discriminator. Let q r
(y|x) = q (1 y|
distribution over domains. The objectives of ADA are therefore given as:
• Objectives • Objectives
maxì ℒñ,c = Vrí (ö|[) log bg X k + Q hì k X ||b k max L = Ep✓ (x|y)p(y) [log q (y|x)]
⇥ r
⇤
maxg ℒñ,c = Vrí (ö|[) log bg X k + Q hì k X ||b k max✓ L✓ = Ep✓ (x|y)p(y) log q (y|x) ,
• Two objectives
• Single objective for both Ç and
whereÅ we omit the additional loss of ✓ to fit to the data label pairs of source domain
• Have global optimal state in the game
for more
• Extra prior regularization by b(k) details). In conventional view, the
theoretic first equation minimizes the discrimi
view
entropy with respect to discriminative parameter , while the second trains the
• The reconstruction term: maximize the conditional
to maximize • Thewith
the cross entropy objectives:
respect tomaximize the conditional
the transformation parameter ✓.
log-likelihood of X with the generative distribution it would
self-containedness, be better toof `
log-likelihood (or 1 both
explain − `) of thethe
with cross-entrop
bg (X|k) conditioning on the latent code k inferred
Alternatively, we can interpret
distribution hì (`|X)
the objectives as optimizing the reconstruction
conditioning on of th
by hì (k|X) [Eric: I can not understand
y conditioned on feature x. data/generation thisbpoint.]
X inferred by g (X|`) We explor
more in the next section. Note that the only (but critical) difference between the o
is the replacement of q(y|x) with q r (y|x). This is where the adversarial mechan
• bg (X|k) is the generative model • Interpret hì (`|X) as the generative model
• hì (k|X) is the inference model • Interpret bg (X|`) as 3 the inference model
© Petuum,Inc. 130
and y = 1 the source domain. The data distributions conditioning on the domain
as p(z|y). Let p(y) be the prior distribution (e.g., uniform) of the domain indic
extractor maps z to representations x = G✓ (z) with parameters ✓. The data distrib
deterministic transformation G✓ •together form an
Interpret ç implicit distribution
as latent over x, de
variables
GANs vs. Variational EM which is intractable to evaluate likelihood but easy to sample from:
• Interpret generation of ç as
To enforce domain invariance of feature x, a discriminator
performing inferenceis trained
over to adversa
latent
between the two domains, which defines a conditional distribution q (y|x) with p
Variational EM the feature extractor is optimized
GAN to fool the discriminator. Let q r
(y|x) = q (1 y|
distribution over domains. The objectives of ADA are therefore given as:
• Objectives • Objectives
maxì ℒñ,c = Vrí (ö|[) log bg X k + Q hì k X ||b k max L = Ep✓ (x|y)p(y) [log q (y|x)]
⇥ r
⇤
maxg ℒñ,c = Vrí (ö|[) log bg X k + Q hì k X ||b k max✓ L✓ = Ep✓ (x|y)p(y) log q (y|x) ,
• Two objectives
• Single objective for both Ç and
whereÅ we omit the additional loss of ✓ to fit to the data label pairs of source domain
• Have global optimal state in the game
for more
• Extra prior regularization by b(k) details). In conventional view, the
theoretic first equation minimizes the discrimi
view
entropy with respect to discriminative parameter , while the second trains the
• The reconstruction term: maximize the conditional
to maximize • Thewith
the cross entropy objectives:
respect tomaximize the conditional
the transformation parameter ✓.
log-likelihood of X with the generative distribution it would
self-containedness, be better toof `
log-likelihood (or 1 both
explain − `) of thethe
with cross-entrop
bg (X|k) conditioning on the latent code k inferred
Alternatively, we can interpret
distribution hì (`|X)
the objectives as optimizing the reconstruction
conditioning on of th
by hì (k|X) [Eric: I can not understand
y conditioned on feature x. data/generation thisbpoint.]
X inferred by g (X|`) We explor
more in the next section. Note that the only (but critical) difference between the o
is the replacement of q(y|x) with q r (y|x). This is where the adversarial mechan
• bg (X|k) is the generative model • Interpret hì (`|X) as the generative model
• hì (k|X) is the inference model • Interpret bg (X|`) as 3 the inference model
© Petuum,Inc. 131
= Ex=G✓ (z),z⇠p(z|y=0) [log(1 D (x))] + Ex=G✓ (z),z⇠p(z|y=1) [log D (x)]
2 2
GANs:
max minimizing
L =E KLD
[log(1 x=G✓ (z),z⇠p(z|y=0) D (x))] + Ex⇠pdata (x) [log D (x)]
max✓ L✓ = Ex=G✓ (z),z⇠p(z|y=0) [log D (x)] + Ex⇠pdata (x) [log(1 D (x))]
• As in Variational=EM,Ex=G
we✓ (z),z⇠p(z|y=0) [log D (x)]
can further rewrite in the form of minimizing KLD
to reveal more insights into the optimization problem
•We
Fornow takeoptimization
each a closer look step
at theofform of Eq.(4)
bg (ç|`) which Çis=essentially
at point Ça , Å = Åreconstructing
a , let
the real/fak
indicator
• b ` y: (or its reverse
uniform y) conditioned on x. Further, for each optimization step of p✓ (x|y)
prior 1distribution
point (✓0 , 0 ) in the parameter space, we have
• bgºgΩ ç = Eõ(æ) bgºgΩ ç `
∏
Lemma• h∏ 1
ç `Let∝p(y)
hìºì be `theç uniform
bgºgΩ (ç)distribution. Let p✓0 (x) = Ep(y) [p✓0 (x|y)], and q r (x|y)
a
r
(y|x)p✓0 (x). Therefore, the updates of ✓ at ✓0 have
•q Lemma
0 1: The updates of c at ca have
h ⇥ ⇤ i
r✓ Ep✓ (x|y)p(y) log q r = 0 (y|x) =
✓=✓0
h i (
r✓ Ep(y) [KL (p✓ (x|y)kq r (x|y))] JSD (p✓ (x|y = 0)kp✓ (x|y = 1)) ,
✓=✓0
• KL: KL divergence
where KL(·k·) and JSD(·k·) are the KL and Jensen-Shannon Divergences, respectively.
• JSD: Jensen-shannon divergence
© Petuum,Inc. 132
We provide the proof in the supplement materials. Eq.(7) offers several insights into the generat
We then obtain the conventional formulation of adversarial domain adaptation used or similar
in [3, 4, 5, 2].
Proof
2 Lemma of
1 Lemma 1
Proof.
where
pg✓ +pdata
Note that p✓ (x|y = 0) = pg✓ (x), and p✓ (x|y = 1) = pdata (x). Let pM✓ = 2 . Eq.(4) can
be simplified as:
1 1
Ep(y) [KL(p✓ (x|y)kp✓0 (x))] = KL pg✓ kpM✓0 + KL pdata kpM✓0 . (5)
2 2
© Petuum,Inc. 133
Proof of Lemma 1 (cont.)
On the other hand,
1 pg ✓ 1 pdata
JSD(pg✓ kpdata ) = Epg✓ log + Epdata log
2 pM ✓ 2 pM ✓
" #
1 pg ✓ 1 p M ✓0
= Epg✓ log + Epg✓ log
2 pM ✓0 2 pM ✓
" #
1 pdata 1 p M ✓0
+ Epdata log + Epdata log (6)
2 p M ✓0 2 pM ✓
" # " #
1 pg ✓ 1 pdata pM ✓0
= Epg✓ log + Epdata log + EpM✓ log
2 pM ✓0 2 p M ✓0 pM ✓
1 1
= KL pg✓ kpM✓0 + KL pdata kpM✓0 KL pM✓ kpM✓0 .
2 2
Note that
Taking derivatives of the both sides of Eq.(3) at w.r.t ✓ at ✓0 and plugging the last equation of Eq.(8),
we obtain the desired results. © Petuum,Inc. 134
h i (7)
r✓ Ep(y) [KL (p✓ (x|y)kq r (x|y))] JSD (p✓ (x|y = 0)kp✓ (x|y = 1)) ,
✓=✓0
where KL(·k·) and JSD(·k·) are the KL and Jensen-Shannon Divergences, respectively.
GANs: minimizing KLD
We provide the proof in the supplement materials. Eq.(7) offers several insights into the generator
learning in GANs.
• Lemma
h
1: The updates of c at
i
c a have
⇥ r
⇤
r✓ Ep✓ (x|y)p(y) log q = 0
(y|x) =
✓=✓0
h i
r
r✓ Ep(y) [KL (p✓ (x|y)kq (x|y))] JSD (p✓ (x|y = 0)kp✓ (x|y = 1)) ,
✓=✓0
© Petuum,Inc. 135
h i (7)
r
r✓ Ep(y) [KL (p✓ (x|y)kq (x|y))] JSD (p✓ (x|y = 0)kp✓ (x|y = 1)) ,
✓=✓0
where KL(·k·) and JSD(·k·) are the KL and Jensen-Shannon Divergences, respectively.
GANs: minimizing KLD
We provide the proof in the supplement materials. Eq.(7) offers several insights into the generator
learning in GANs.
• Lemma
h 1: The updates
⇥
of c
⇤ i at c a have
r✓ Ep✓ (x|y)p(y) log q r = 0 (y|x) =
✓=✓0
h i
r
r✓ Ep(y) [KL (p✓ (x|y)kq (x|y))] JSD (p✓ (x|y = 0)kp✓ (x|y = 1)) ,
✓=✓0
where KL(·k·) and JSD(·k·) are the KL and Jensen-Shannon Divergences, respectively.
GANs: minimizing KLD
We provide the proof in the supplement materials. Eq.(7) offers several insights into the generator
!"7"# $ % = 1 = !()*) ($) !"7"# $ % = 0 = !./8/ ($)
learning in GANs. #
0 1 ($|% = 0)
• Lemma c c
h 1: The updates⇥
of ⇤ i at a have !"7"345 $ % = 0 = !.
/8/345
($)
r✓ Ep✓ (x|y)p(y) log q r = 0 (y|x) =
✓=✓0
h i
r
$ r✓ Ep(y) [KL (p✓ (x|y)kq (x|y))] JSD (p✓ (x|y $ = 0)kp✓ (x|y = 1)) ,
✓=✓0
We now take a closer look at the form of Eq.(3) which is essentially reconstructing the real/fake
indicator y (or its reverse 1 y) conditioned on x. Further, for each optimization step of p✓ (x|y) at
GANs: minimizing KLD
point (✓0 , 0 ) in the parameter space, we have
!"7"Lemma
#
$ % =11Let= p(y) ($)the !uniform
!()*)be "7"# $ % = 0 = !./8/
distribution. Let($)
# p✓0 (x) = Ep(y) [p✓0 (x|y)], and q (x|y) /
r
q r
(y|x)p✓0 (x). Therefore, the updates of ✓ at ✓0 have
1• 0Lemma
0 ($|% = 0) 1 !"7"345 $ % = 0 = !. 345 ($)
h ⇥ ⇤ i /8/
r✓ Ep✓ (x|y)p(y) log q r 0 (y|x) =
✓=✓0
h missed mode i (7)
r
r✓ Ep(y) [KL (p✓ (x|y)kq (x|y))] JSD (p✓ (x|y = 0)kp✓ (x|y = 1)) ,
$ $ ✓=✓0
where KL(·k·) and JSD(·k·) are the KL and Jensen-Shannon Divergences, respectively.
• Missing mode phenomena of GANs KL b∞ú (X)||h ∏ X ` = 0
• Asymmetry
We provide the proof inoftheKLD
supplement materials. Eq.(7) offers several insights into the generator b∞ú X
learning in GANs. = û b∞ú X log ∏ ¡X
• Concentrates bg ç ` = 0 to large h X ` = 0
• Resemblance modes of h ∏ ç inference.
to variational ` If we treat y as visible and x as latent (as in ADA), it is
straightforward to see the connections to the variational inference • algorithm where contribution
Large positive q r (x|y) playsto the KLD in the
the role of⇒thebposterior,
∞ú ç misses prior, andofp✓b(x|y)
p✓0 (x) themodes (ç)variationalregions
)±≤± the distribution
of Xthat approximates
space where h ∏ X ` = 0 is
the• posterior.
Symmetry Optimizing
of JSD the generator G✓ is equivalent to minimizing the KL
small, divergence
unless b∞ú X betweenis also small
the variational distribution and the posterior, minus a JSD between ⇒ the
b distributions
X p (x) and
• Does
pdata (x). not affect
The Bayesian the behavior
interpretation of the connections toú VAEs, as we discussregions
further reveals
• ∞ tends to avoid
g ✓
in
where
mode missing
the next section. h ∏ X ` = 0 is small
• Training dynamics. By definition, p✓0 (x) = (pg✓0 (x)+pdata (x))/2 is a mixture of pg✓0 (x) and
© Petuum,Inc. 138
r
x=G✓ (z),z⇠p(z|y=0) x=G✓ (z),z⇠p(z|y=1)
2 2
We now take a closer look at the form of Eq.(3) which is essentially reconstructing the real/fake
indicator y (or its reverse 1 y) conditioned on x. Further, for each optimization step of p✓ (x|y) at
GANs: minimizing KLD
point (✓0 , 0 ) in the parameter space, we have
Lemma 1 Let p(y) be the uniform distribution. Let p✓0 (x) = Ep(y) [p✓0 (x|y)], and q r (x|y) /
q r 0 (y|x)p✓0 (x). Therefore, the updates of ✓ at ✓0 have
• Lemmah 1: The updates of ci at ca have
⇥ r
⇤
r✓ Ep✓ (x|y)p(y) log q 0 (y|x) =
✓=✓0
h i (7)
r
r✓ Ep(y) [KL (p✓ (x|y)kq (x|y))] JSD (p✓ (x|y = 0)kp✓ (x|y = 1)) ,
✓=✓0
∏ compared to GANs (Figure 1(c)), adds
•where Figure 3: (a) Graphical
NoKL(·k·)
assumptionand JSD(·k·)
on model of discriminator
optimal
are the KL
InfoGAN (Eq.9), which,
and Jensen-Shannon
h ì ` ç respectively.
Divergences,
conditional generation of code z with distribution q⌘ (z|x,a y). See the captions of Figure 1 for the
• Previous
meaning ofresults
differentusually
types of rely
arrows. on(b) VAEs optimal
(near) (Eq.12), which is obtained by swapping the generation
discriminator
We provide and the
∗ proof
inference in the supplement
processes of InfoGAN, materials. Eq.(7)ofoffers
i.e., inçterms several model,
the graphical insightsswapping
into the solid-line
generatorarrows
• h ` = 1 ç =
in GANs. process) and)±≤± b ç /(b + b (ç))
learning (generative dashed-line)±≤± arrows (inference)∞
of (a). (c) Adversarial Autoencoder (AAE),
• Optimality
which is to assumption
obtained by swapping is impractical:
data xIfand limited expressiveness of tì [Arora
we code
treat zy in
asInfoGAN (seex the supplements for more
it isdetails).
et al 2017]
• Resemblance variational inference. visible and as latent (as in ADA),
• Our resulttoissee
straightforward a the
generalization
connections toof thethe previous
variational theorem
inference algorithm where
[Arjovsky q r (x|y)
& Bottou plays
2017]
the role•ofgeneralization
the posterior,
Plug p✓the
the optimal
of 0 (x) the prior,
discriminator
previous and p✓into
theorem (x|y)
[1]: the above
the
pluggingvariational
Eq.(7) distribution
equation,
into Eq.(6)we
wethat approximates
recover
obtain the theorem
the posterior. Optimizing
h the generator G✓ isi equivalent to minimizing the KL divergence between
⇥ ⇤ 1 between the distributions p (x) and
the variational
r✓ distribution
Ep✓ (x|y)p(y)andlogthe
q r posterior,
0
(y|x) minus =r a ✓JSD KL (pg✓ kpdata ) JSD (pg✓ kpdata g✓ ) , (8)
✓=✓0 2
pdata (x). The Bayesian interpretation further reveals the connections to VAEs, as we discuss in ✓=✓0
© Petuum,Inc. 140
KL (p✓ (y|x)kq (y|x)) , (14)
or
We can see GANs and VAEs (Variational Auto-encoders Kingma & Welling (2013)) as extending
the sleep and wake phases, respectively. In particular, VAEs extend the wake phase by minimizing
• GANs
Eq. (12) w.r.tdon’t
both offer
and ✓.the functionality
GANs extend of inferring
⇣ the sleep phase
code | given data ç
⌘ by minimizing Eq.(15) w.r.t , and
• InfoGAN
minimizing the y-switched objective
[Chen et al., 2016] KL p✓ (x|y)kq (x|y)
0
JSD in Eq.(7) w.r.t ✓.
• Introduce inference model E¬ (||ç) with parameters √
1 InfoGAN
• Augment the objectives of GANs by additionally inferring |
maxD LD = Ex⇠pdata (x) [log D(x)] + Ex⇠G(z),z⇠p(z) [log(1 D(x))] ,
(16)
maxG,Q LG,Q = Ex⇠G(z),z⇠p(z) [log D(x)+ log Q(z|x)] .
!"
GANs InfoGAN
3
© Petuum,Inc. 141
s of the KL and JSD terms in Eq.(4) cancel out, disabling the learning of generator. Moreover,
esian interpretation [Eric: In earlier explanations, you never say it was a "Bayesian
etation", and now you suddenly say it was. Please claim Bayesian interpretation
InfoGAN: new formulation
where you provided some interpretation.] of our result enables us to discover connections
, as we discuss in the next section.
max L = Ep✓ (x|y)p(y) [log q (y|x)] max L = Ep✓ (x|y)p(y) [log q⌘ (z|x, y)q (y|x)]
⇥ r
⇤ (1) ⇥ r
⇤
max✓ L✓ = Ep✓ (x|y)p(y) log q (y|x) , max✓,⌘ L✓,⌘ = Ep✓ (x|y)p(y) log q⌘ (z|x, y)q (y|x) ,
© Petuum,Inc. 143
recovered by combining q⌘ (z|x, y) with q(y|x) in Eq.(3) to perform full reconstruction of both z
and y:
InfoGAN:max
new
L =E
max L = E
formulation
[log q (z|x, y)q (y|x)]
⇥
p✓ (x|y)p(y)
⇤
log q (z|x, y)q (y|x) ,
⌘
r (9)
✓,⌘ ✓,⌘ p✓ (x|y)p(y) ⌘
h ⇥ ⇤i
r
r✓ Ep✓ (x|y)p(y) log q⌘0 (z|x, y)q 0 (y|x) =
✓=✓0
h i (10)
r✓ Ep(y) [KL (p✓ (x|y)kq r (x|z, y))] JSD (p✓ (x|y = 0)kp✓ (x|y = 1)) ,
✓=✓0
AAE/PM/CycleGAN As a side result, the idea of interpreting data space x as latent immediately
• Next
discovers we show
relations betweencorrespondences
InfoGAN with Adversarial between
AutoencoderGANs/InfoGAN and
(AAE) [35] and Predictability
Minimization
VAEs (PM) [50]. That is, InfoGAN is precisely an AAE that treats the data space x as
latent and to be adversarially regularized while the code space z as visible. Figure 3(c) shows the
graphical model of AAE obtained by simply swapping x and z in InfoGAN. We defer the detailed
© Petuum,Inc. 144
formulations of AAE to the supplementary materials. Further, instead of considering x and z as data
Relates VAEs with GANs
• Resemblance of GAN generator learning to variational
inference
• Suggest strong relations between VAEs and GANs
© Petuum,Inc. 145
InfoGAN VAEs
mily" or "second family"?] of deep generative model learning algorithms. The resembl
AN generator learning to variational inference as shown in Eq.(4) suggests strong relations b
AEs [25] and GANs. We build correspondence between the two approaches, and show tha
Recap: conventional formulation of VAEs
e basically minimizing a KL divergence with an opposite direction, with a degenerated adv
scriminator.
he conventional definition of VAEs is written as:
• Objective:
vae
⇥ ⇤
max✓,⌘ L✓,⌘ = Epdata (x) Eq̃⌘ (z|x) [log p̃✓ (x|z)] KL(q̃⌘ (z|x)kp̃(z)) ,
here p̃✓ (x|z) is the generator, q̃⌘ (z|x) the inference network, and p̃(z) the prior over
b≈(|): prior
rameters to• learn over |
are intentionally denoted with the notations of corresponding modules in
• b≈gVAEs
t first glance, (ç||): appear
generative model
to differ from GANs greatly as they use only real examples a
• h≈¬ (||ç): inference
versarial mechanism. However, our interpretation shows VAEs indeed include a dege
model
versarial discriminator
• Only uses real thatexamples
blocks outfrom
generated samples
b)±≤± (ç), from contributing
lacks adversarial to training.
mechanism
pecifically,
• Towealign
againwith
introduce
GANs, thelet’s
real/fake variable
introduce and assume
they,real/fake a perfect`discriminator
indicator and
hich always predicts ydiscriminator
adversarial = 1 with probability 1 given real examples, and y = 0 given ge
mples. Again, for notational simplicity, let q⇤r (y|x) = q⇤ (1 y|x) be the reversed distribu
© Petuum,Inc. 146
mma 2. Let p✓ (z, y|x) / p✓ (x|z, y)p(z|y)p(y). Therefore,
ae VAEs: new formulation
⇥
,⌘ = 2 · Ep✓0 (x) Eq⌘ (z|x,y)q⇤ (y|x) [log p✓ (x|z, y)]
r
r
KL(q⌘ (z|x, y)q⇤ (y|x)kp(z|y)p(y))
⇤
(8)
= 2 · Ep✓0 (x) [ KL (q⌘ (z|x, y)q⇤r (y|x)kp✓ (z, y|x))] .
• Assume a perfect discriminator h∗ (`|ç)
e most of the • hcomponents
∗ ` =1 ç =1 if ç isexact
have correspondences (and the same definitions) in GANs
real examples
• h∗ `1),
InfoGAN (Table = 0except
ç = 1that
if ç the generation
is generated distribution p✓ (x|z, y) differs slightly from its
samples
• h∗∏ in
nterpart p✓ (x|y) ` Eq.(2)
ç ∶= h∗to
(1additionally
− `|ç) account for the uncertainty of generating x given z:
• Generative distribution ⇢
p✓ (x|z) y = 0
p✓ (x|z, y) = (9)
pdata (x) y = 1.
KL bdivergence
resulting• Let g |, ` ç Figure
∝closely
bg ç3:|, ` bGraphical
(a)
relates |to`that
b(`)inmodel
GANsof(Eq.4)
InfoGAN (Eq.10), (Eq.6),
and InfoGAN which, with compared t
Lemma
generative 2. Let
module
• Lemma 2 pp✓✓ (z,
(x|z,y|x)
y) and
conditional
/ pinference
✓ (x|z, networks
generation of
y)p(z|y)p(y). code
q ⌘ Therefore,
(z|x,
z with
y)q r
(y|x) placed inq⌘the
distribution opposite
(z|x, y). See the
ctions, vae
and with inverted hidden/visible treatments
⇥meaning of different of (z,
types ofy) and x. (b)
arrows. In section 6, we
VAEs (Eq.13),
r
givewhich
a generalis obtained
⇤
L✓,⌘
ussion Ep✓0 (x) E
2 · difference
that=the between
and
q⌘ (z|x,y)q GANs
inference
r [log
and pVAEs
processes
⇤ (y|x) ✓ (x|z,iny)] KL(q ⌘ (z|x,
hidden/visible
of InfoGAN, i.e., y)q⇤of
intreatments
terms (y|x)kp(z|y)p(y))
is relatively
the graphical model(
or. = 2 · Ep✓0 (x) [ (generative
KL (q⌘ (z|x, process) ✓ (z, y|x))] .arrows (inference) of (a). (c) Adve
and dashed-line
y)q⇤r (y|x)kp
which is obtained
proof is provided in the supplementary by swapping
materials. data recall
Intuitively, x andthatcodeforz the
in InfoGAN
real example
© Petuum,Inc. (see
147 the su
Lemma 2: sketch of proof
Lemma 2. Let p✓ (z, y|x) / p✓ (x|z, y)p(z|y)p(y). Therefore,
• Lemma 2
vae
⇥ r
⇤
L✓,⌘ = 2 · Ep✓0 (x) Eq⌘ (z|x,y)q⇤r (y|x) [log p✓ (x|z, y)] KL(q⌘ (z|x, y)q⇤ (y|x)kp(z|y)p(y))
(8
= 2 · Ep✓0 (x) [ KL (q⌘ (z|x, y)q⇤r (y|x)kp✓ (z, y|x))] .
• Proof
Here most of the components have$ exact correspondences $ (and the same definitions) in GAN
1) Expand
and InfoGAN Võexcept
(Table 1), . = V generation
úΩ (ç) that &theõú (ç|æº$) . distribution
+ Võú (ç|æºa) . y) differs slightly from i
p✓ (x|z,
Ω & Ω
counterpart p✓$(x|y) in Eq.(2) to additionally account for the uncertainty of generating x given z:
2) Võú (ç|æºa) . is constant ⇢
& Ω
• Due to the perfect discriminator h∗∏ `pç✓ (x|z) y=0
p✓ (x|z, y) = (9
• Blocks out generated samples in theptraining
data (x)lossy = 1.
$ $
The resulting
3) KL V divergence . =closely
V relates . to that in GANs (Eq.4) and InfoGAN (Eq.6), wi
& õúΩ (ç|æº$) & õ≥¥µ¥ ([)
the generative• module
Recoversp✓the
(x|z, y) and inference
conventional networks q⌘ (z|x, y)q r (y|x) placed
formulation in the opposi
© Petuum,Inc. 148
directions, and with inverted hidden/visible treatments of (z, y) and x. In section 6, we give a gener
Proof of Lemma 2
C Lemme 2
Proof. For the reconstruction term:
⇥ ⇤
Ep✓0 (x) Eq⌘ (z|x,y)q⇤r (y|x) [log p✓ (x|z, y)]
1 ⇥ ⇤
= Ep✓0 (x|y=1) Eq⌘ (z|x,y=0),y=0⇠q⇤r (y|x) [log p✓ (x|z, y = 0)]
2
1 ⇥ ⇤ (25)
+ Ep✓0 (x|y=0) Eq⌘ (z|x,y=1),y=1⇠q⇤ (y|x) [log p✓ (x|z, y = 1)]
r
2
1 ⇥ ⇤
= Epdata (x) Eq̃⌘ (z|x) [log p̃✓ (x|z)] + const,
2
where y = 0 ⇠ q⇤ (y|x) means q⇤r (y|x) predicts y = 0 with probability 1. Note that both q⌘ (z|x, y =
r
1) and p✓ (x|z, y = 1) are constant distributions without free parameters to learn; q⌘ (z|x, y = 0) =
q̃⌘ (z|x), and p✓ (x|z, y = 0) = p̃✓ (x|z).
For the KL prior regularization term:
Ep✓0 (x) [KL(q⌘ (z|x, y)q⇤r (y|x)kp(z|y)p(y))]
Z
= Ep✓0 (x) q⇤r (y|x)KL (q⌘ (z|x, y)kp(z|y)) dy + KL (q⇤r (y|x)kp(y))
1 1 (26)
= Ep✓0 (x|y=1) [KL (q⌘ (z|x, y = 0)kp(z|y = 0)) + const] + Ep✓0 (x|y=1) [const]
2 2
1
= Epdata (x) [KL(q̃⌘ (z|x)kp̃(z))] .
2
Combining Eq.(25) and Eq.(26) we recover the conventional VAE objective in Eq.(7) in the paper.
© Petuum,Inc. 149
y, x now denotes a real example or a generated vae sample, z is the respective
⇥ latent code. For
L✓,⌘ = 2 · Ep✓0 (x) Eq⌘ (z|x,y)q⇤r (y|x) [log p✓ (x|z, y)] KL(q⌘ (z|x, y)q⇤r (y|x)
ated sample domain (y = 0), the implicit distribution p✓ (x|y = 0) is defined by the prior of
r
e generator G✓ (z), which is also denoted as pg=✓ (x)2 · Ein
p✓0the KL (q⌘ (z|x,
(x) [literature. Fory)qthe⇤ (y|x)kp
real example
✓ (z, y|x))] .
GANs vs VAEs side by side
y = 1), the code space and generator are degenerated, and we are directly presented with a
ribution p(x|y = 1), which is just the real
Heredata distribution
most of the components Noteexact
pdata (x). have that pcorrespondences
data (x) is (and the same defi
mplicit distribution allowing efficient
GANs and InfoGAN
empirical
(InfoGAN) (Table
sampling. In 1), except that
summary, thethe generation
distribution
VAEs distribution p✓ (x|z, y) diffe
over
tructed as counterpart p✓ (x|y) in Eq.(2) to additionally account for the uncertainty of gene
⇢ ⇢
Generative pg✓ (x) y=0 p✓ (x|z) y = 0
distribution p ✓ (x|y) = p ✓ (x|z, y) = (5)
pdata (x) y = 1. p data (x) y = 1.
ee parameters Thepgresulting
✓ are only associated with
Discriminator KL generated
(x) of the divergence sample
closely relates
domain,to that
whilein GANs (Eq.4) and Info
is constant. h ì (`|ç)
the
✓
generative module p h
(x|z, y)
∗ (`|ç),
and perfect,
inference
As in ADA, discriminator D is simultaneously trained to infer the probability degenerated
networks q (z|x, y)q r
(y|x) pla
distribution ✓ ⌘
mes from the real data domain. That is, directions,
q (y = 1|x) and =
with
Dinverted
(x). hidden/visible treatments of (z, y) and x. In section 6
discussion that the difference between GANs and VAEs in hidden/visible treat
|-inference
established correspondencehbetween
¬ | ç, `GANsof InfoGAN
minor. and ADA, we can see that the objectives h¬ (||ç, `) of
model
re precisely expressed as Eq.(4) and as The
the graphical model ininthe
proof is provided Figure 1(c). To make
supplementary this Intuitively, recall that fo
materials.
domain with∏
y = 1, both q⌘ (z|x, y = 1) and p✓ (x|z, y = 1) are constant distri
ming KL (bg çwith ` fake
|| hsample
ç |,x`generated
) minfrom g KL p h |
(x),
✓0 ¬ ç,
the` h ∏
∗ ` çperfect
reversed || bgdiscriminator
|, ` ç q⇤r
KLD to 4
prediction y = 1, making the reconstruction loss on fake samples degenerated to
minimize only real examples, where q⇤r predicts y = 0 with probability 1, are effective for
~ ming KL(! g || E)
identical to Eq.(7). We extend VAEs to~min g KL(E || !
also leverage g)
fake samples in section 4.
© Petuum,Inc. 150
VAE/GAN Joint Models Previous work has explored combination of VAEs
ke Sleep Algorithm (WS)
Discriminator
hì (`|ç) h∗ (`|ç), perfect, degenerated
distribution
© Petuum,Inc. 160
GAN 8.34±.03 5.18±.03 CGAN 0.985±.002 0.797±.005 SVAE 0.9412 0.9768
IWGAN 8.45±.04 5.34±.03 IWCGAN 0.987±.002 0.798±.006 AASVAE 0.9425 0.9797
Table 2: Left: Inception scores of vanilla GANs and the importance weighted extension. Middle:
AAVAE: empirical results
Classification accuracy of the generations by class-conditional GANs and the IW extension. Right:
Classification accuracy of semi-supervised VAEs and the adversary activated extension on the MNIST
test set, with varying size of real labeled training examples.
• Evaluated test-set variational lower bound on MNIST
• The higher the better
1% 10%
SVAE 0.9412±.0039 0.9768±.0009
AASVAE 0.9425±.0045 0.9797±.0010
ble 3: Classification accuracy of semi-supervised VAEs and the adversary activated extension on
MNIST test•set, with1%
Used varying
and size
10%ofdata
real labeled
labelstraining
in MNISTexamples.
© Petuum,Inc. 162
Mutual exchanges of ideas
• AAVAE enhances VAEs with ideas from GANs
• We can also enhance GANs with ideas from VAEs
• VAEs maximize a variational lower bound of log likelihood
• Importance weighted VAE (IWAE) [Burda et al., 2016]
• Maximizes a tighter lower bound through importance sampling
• The variational inference interpretation of GANs allows the
importance weighting method to be straightforwardly applied
to GANs
• Just copy the derivations of IWAE side by side with little adaptions!
© Petuum,Inc. 163
Importance weighted GANs (IWGAN)
• Generator learning in vanilla GANs
⇥ r
⇤⇥ ⇤
max✓ Ex⇠p✓ (x|y)p(y)
max✓ Ex⇠p log✓q(x|y)p(y)
0
(y|x) log
. q (y|x) .
r
0
ANs, As
only •=Assigns
in yGANs, onlyhigher
0 (i.e., y = 0weights
generated to samples
samples)
(i.e., thatfor
is effective
generated samples) are more realistic
islearning
effectivethe and fool
forparameters
learning the✓.theIntu-
parameters
he algorithm discriminator
assigns better
higherassigns
weights to those samples that are more that
realistic and fool the and
itively, the algorithm higher weights to those samples are more realistic
natordiscriminator
better, which better,
is consistent
which tois IWAE that to
consistent emphasizes
IWAE thatmore on codemore
emphasizes statesonproviding
code states pr
constructions. In practice, theInkpractice,
better reconstructions. samples the
in Eq.(15) correspond
k samples in Eq.(15)to correspond
a minibatchtoofasamples
minibatch in of sam
© Petuum,Inc. 164
GANstandard
update. GAN
Thus the onlyThus
update. computational cost added by cost
the only computational the importance
added by theweighting
importance methodweighting
IWGAN: empirical results
• Applied the importance weighting method to
• vanilla GANs
• class-conditional GANs (CGAN)
• CGAN adds one dimension to code k to represent the class label
• The derivations of the IW extension remain the same as in vanilla GANs
© Petuum,Inc. 165
IWGAN: empirical results
• Evaluated on MNIST and SVHN
• Used pretrained NN to evaluate:
• Inception scores of samples from GANs and IW-GAN
• Confidence of a pre-trained classifier on generated samples + diversity of
generated samples
MNIST SVHN MNIST SVHN 1%
GAN 8.34±.03 5.18±.03 CGAN 0.985±.002 0.797±.005 SVAE 0.941
IWGAN 8.45±.04 5.34±.03 IWCGAN 0.987±.002 0.798±.006 AASVAE 0.942
• ClassificationTable 2: Left: of
accuracy Inception
samplesscoresfrom
of vanilla
CGAN GANs and IW-CGAN
and the importance weighted extension
Classification accuracy of the generations by class-conditional GANs and the IW extensio
Classification accuracy of semi-supervised VAEs and the adversary activated extension on th
MNIST SVHN MNISTsize ofSVHN
test set, with varying 1%
real labeled training examples. 10%
GAN 8.34±.03 5.18±.03 CGAN 0.985±.002 0.797±.005 SVAE 0.9412 0.9768
IWGAN 8.45±.04 5.34±.03 IWCGAN 0.987±.002 0.798±.006 AASVAE 0.9425 0.9797
© Petuum,Inc. 166
Table 2: Left: Inception scores of vanilla GANs and the importance weighted extension. Middle:
Recap: Variational Inference
Maximize the variational lower bound ℒ c, ñ; ç , or equivalently,
minimize free energy
© Petuum,Inc. 168
Symmetric modeling of latent & visible variables
• Naturally an implicit distribution • New tools to allow implicit priors and models
• GANs, density ratio estimation, approximate
• Easy to sample from, hard to Bayesian computations
evaluate likelihood
• E.g., adversarial autoencoder [Makhzani et al., 2015]
replaces the Gaussian prior of vanilla VAEs
with implicit priors
© Petuum,Inc. 171
Symmetric modeling of latent & visible variables
prior distr.
+ ∼ $4/5/ (+)
" ∼ ,′-./012-(3 +
Generation Inference
model model
Symmetric modeling of latent & visible variables activating the adversary mechanism on the VAE models. We see that the adversary activated models
consistently outperform their respective base models. Generally, larger improvement can be obtained
with smaller set of real training data. Table 2, right panel, further shows the classification accuracy
of semi-supervised VAE and its adversary activated variants with different size of labeled training
data. We can observe improved performance of the AA-SVAE model. The full results of standard
deviations are reported in the supplementary materials.
• No difference in terms of formulations
6 Discussions
• with implicit distributions and black-box NN models
Our new interpretations of GANs and VAEs have revealed strong connections between them, and
• Difference in terms of space complexity linked the emerging new approaches to the classic wake-sleep algorithm. The generality of the
proposed formulation offers a unified statistical insight of the broad landscape of deep generative
• depend on the problem at hand modeling, and encourages mutual exchange of improvement ideas across research lines. It is
interesting to further generalize the framework to connect to other learning paradigms such as
• choose appropriate tools: reinforcement learning as previous works have started exploration [14, 44]. GANs simultaneously
learn a metric (defined by the discriminator) to guide the generator learning, which resembles the
• implicit/explicit distribution, adversarial/maximum-likelihood optimization,
iterative teacher-student distillation framework [23, 24] where a teacher…network is simultaneously
learned from structured knowledge (e.g., logic rules) and provides knowledge-informed learning
signals for student networks of interest. It will be intriguing to build formallikelihood
maximum connections between
loss
prior distr. prior distr. adversarial
these approaches loss
and enable incorporation of structured knowledge in deep generative modeling.
:
maxC log $ "%&'(& D
zprior
Generation Generation Symmetric view of generation and inference Traditional modeling approaches usually distin-
guish between latent and visible
prior distr.
variables clearly and treat them Inference
in very different ways. One of the
model model Inference
data distr. key thoughts in our formulation is that it is not necessary to make model
model clear boundary between the two
types of variables (and between generation and inference), but instead, treating them as a symmetric
+789 +&8/. pair helps with modeling and understanding. For instance, we treat the generation space x in GANs as
+ latent, which immediately reveals the connection between GANs and adversarial domain adaptation,
max> log $ +&8/. B and provides a variational inference interpretation of the generation. A second example is the classic
: wake-sleep algorithm, where the wake phase reconstructs visibles ©conditioned
Petuum,Inc.
on173
latents, while the
adversarial loss data distr. data distr.
maximum likelihood losssleep phase reconstructs latents conditioned on visibles (i.e., generated samples). Hence, visible and
Part-II: Conclusions Z Hu, Z YANG, R Salakhutdinov, E Xing,
“On Unifying Deep Generative Models”, arxiv 1706.00550
© Petuum,Inc. 174
Plan
• Statistical And Algorithmic Foundation and Insight of Deep
Learning
© Petuum,Inc. 176
Outline
• Deep Learning as Dataflow Graphs
• Auto-differentiable Libraries
© Petuum,Inc. 177
Outline
• Deep Learning as Dataflow Graphs
• Auto-differentiable Libraries
© Petuum,Inc. 178
A Computational Layer in DL
• A layer in a neural network is composed of a few finer
computational operations
• A layer œ has input X and output k, and transforms X into k following:
` = –X + M, k = —“Q”(`)
• Denote the transformation of layer œ as ÉÕ , which can be represented
as a dataflow graphs: the input X flow though the layer
X k
ÉÕ
© Petuum,Inc. 179
From Layers to Networks
• A neural network is thus a few stacked layers œ = 1, … , Q, where
every layer represents a function transform ÉÕ
• The forward computation proceeds by sequentially executing
É$ , É& , É' , … , É‘
É$ É& É' É‘
⋯
© Petuum,Inc. 180
A Computational Layer in DL
• Denote the backward pass through a layer œ as MÕ
• MÕ derives the gradients of the input X(dX),given the gradient of k as
dk, as well as the gradients of the parameters W, b
• dX will be the backward input of its previous layer œ − 1
• Backward pass can be thought as a backward dataflow where the
gradient flow through the layer
¡X ¡k
MÕ
© Petuum,Inc. 181
Backpropagation through a NN
• The backward computation proceeds by sequentially
executing M‘ , M‘Ü$ , M‘Ü& , … , M$
M$ M& M‘
⋯
© Petuum,Inc. 182
A Layer as a Dataflow Graph
• Give the forward computation flow, gradients can be computed
by auto differentiation
• Automatically derive the backward gradient flow graph from the forward
dataflow graph
© Petuum,Inc. 183
A Network as a Dataflow Graph
• Gradients can be computed by auto differentiation
• Automatically derive the gradient flow graph from the forward dataflow
graph
É$ É& É‘
⋯
M$ M& M‘
⋯
© Petuum,Inc. 184
Gradient Descent via Backpropagation
• The computational workflow of deep learning
• Forward, which we usually also call inference: forward dataflow
• Backward, which derives the gradients: backward gradient flow
• Apply/update gradients and repeat
Backward
• Mathematically,
© Petuum,Inc. 185
Program a neural network
• Define a neural network
• Define operations and layers: fully-connected? Convolution? Recurrent?
• Define the data I/O: read what data from where?
• Define a loss function/optimization objective: L2 loss? Softmax?
Ranking Loss?
• Define an optimization algorithm: SGD? Momentum SGD? etc
• Auto-differential Libraries will then take over
• Connect operations, data I/O, loss functions and trainer.
• Build forward dataflow graph and backward gradient flow graphs.
• Perform training and apply updates
© Petuum,Inc. 186
Outline
• Deep Learning as Dataflow Graphs
• Auto-differentiable Libraries
© Petuum,Inc. 187
Auto-differential Libraries
• Auto-differential Library automatically derives the gradients following the back-
propagation rule.
• A lot of auto-differentiation libraries have been developed:
• So-called Deep Learning toolkits
© Petuum,Inc. 188
Deep Learning Toolkits
• They are adopted differently in different domains
• For example
Vision
NLP
© Petuum,Inc. 189
Deep Learning Toolkits
• They are also designed differently
• Symbolic v.s. imperative programming
Imperative Symbolic
© Petuum,Inc. 190
Deep Learning Toolkits
• Symbolic vs. imperative programming
• Symbolic: write symbols to assemble the networks first, evaluate later
• Imperative: immediate evaluation
Symbolic Imperative
© Petuum,Inc. 191
Deep Learning Toolkits
• Symbolic
• Good
• easy to optimize (e.g. distributed, batching, parallelization) for developers
• More efficient
• Bad
• The way of programming might be counter-intuitive
• Hard to debug for user programs
• Less flexible: you need to write symbols before actually doing anything
• Imperative:
• Good
• More flexible: write one line, evaluate one line
• Easy to program and easy to debug: because it matches the way we use C++ or python
• Bad
• Less efficient
• More difficult to optimize
© Petuum,Inc. 192
Deep Learning Toolkits
• They are also designed differently
• For another example, dataflow graphs v.s. layer-by-layer construction
© Petuum,Inc. 194
Static vs. Dynamic Dataflow Graphs
© Petuum,Inc. 195
Static vs. Dynamic Dataflow Graphs
© Petuum,Inc. 196
Static vs. Dynamic Dataflow Graphs
© Petuum,Inc. 197
Introducing DyNet
• Designed for dynamic deep learning workflow, e.g.
• Tree-LSTM for neural machine translation, where each sentence defines a structure that
corresponds to the computational flow
• Graph-LSTM for image parsing, where each image has a specific connection between
segments
• etc.
© Petuum,Inc. 198
Key Ingredients in DyNet
• Concept
• Separate parameter declaration and graph construction
• Declare trainable parameters and construct models first
• Parameters, e.g. the weight matrices in an LSTM unit.
• Construct a model as a collection of trainable parameters
• Construct computation graphs
• Allocate a few nodes for our computation (node can be seen as layers in NN)
• Specify the dataflow graph by connecting nodes together
• If necessary, different graphs for different input samples
• Conclusion: Define parameter once, but define graphs dynamically depending on inputs
© Petuum,Inc. 199
Key Ingredients in DyNet
• Backend and programing model
• Graph construction
• In TensorFlow, constructing a graph has a considerable overhead.
• TensorFlow users avoid defining graphs repeatedly
• DyNet: highly optimized graph definition
• Little overhead defining a graph: good for dynamic neural networks.
• Easy to write recursive programs to define graphs (very effective for many
dynamic networks, such as tree-LSTM or graph-LSTM).
© Petuum,Inc. 200
Key Ingredients in DyNet
• A visual comparison
© Petuum,Inc. 201
Part-III (2)
Distributed Deep Learning
© Petuum,Inc. 202
Outline
• Overview: Distributed Deep Learning on GPUs
• Challenges 1: Addressing the communication bottleneck
• Challenges 2: Handling the limited GPU memory
© Petuum,Inc. 203
Review – DL toolkits on single machine
• Using GPU is a must
• A small number of GPU-equipped machines could achieve satisfactory
speedup compared to CPU clusters with thousands of cores
More readily
• A cluster of 8 GPU-equipped machines
available to
• A cluster of 2000 CPU cores researchers
© Petuum,Inc. 204
Review – DL toolkits on single machine
• However, using a single GPU is far from sufficient
• average-sized deep networks can take days to train on a single GPU when
faced with 100s of GBs to TBs of data
• Demand faster training of neural networks on ever-larger datasets
© Petuum,Inc. 205
Outline
• Overview: Distributed Deep Learning on GPUs
• Challenges 1: Addressing the communication bottleneck
• Challenges 2: Handling the limited GPU memory
© Petuum,Inc. 206
Challenges
• Communication challenges
• GPUs are at least one order of magnitude faster than CPUs
• High communication load raises the network communication as the main bottleneck
given limited bandwidth of commodity Ethernet
• Managing the computation and communication in a distributed GPU cluster often
complicates the algorithm design
© Petuum,Inc. 207
Let’s see what causes the problem
• Deep Learning on a single node – an iterative-convergent
formulation
Backward
Apply gradients
Forward Data
Forward and backward are the main computation (99%) workload of deep
learning programs.
© Petuum,Inc. 209
Distributed Deep Learning
• Distributed DL: parallelize DL training using multiple machines.
• i.e. we want to accelerate the heaviest workload (in the box) to
multiple machines Backward
Forward Data
Forward and backward are the main computation (99%) workload of deep
learning programs.
© Petuum,Inc. 210
Data parallelism with stochastic gradient
descent
• We usually seek a parallelization strategy called data parallelism, based
on SGD
• We partition data into different parts
• Let different machines compute the gradient updates on different data partitions
• Then aggregate/sync.
Sync
(one or more
machines)
Data Data
Worker 3 Worker 4
© Petuum,Inc. 211
Data Parallel SGD
• Data parallel stochastic gradient descent
• Data-parallelism requires every worker to have read and write
access to the shared model parameters Ç, which causes
communication among workers; In total P workers
Data partition p
Worker 1 Worker 2
¢Ç$ ¢Ç&
Ç Ç
PS
Ç Ç
¢Ç' ¢Ç2
Worker 3 Worker 4
Ho et al., 2013, Wei et al. 2015, Zhang et al., 2015, Zhang et al. 2017 © Petuum,Inc. 214
Parameter Server
• Parameter server has been successful for CPU-based deep
learning
• Google Distbelief, Dean et al. 2012
• Scale up to thousands of CPU machines and 16000 CPU cores
• SSPTable, Ho et al, 2013
• Stale-synchronous parallel consistency model
• Microsoft Adam, Chilimbi et al. 2014
• 63 machines, state-of-art results on ImageNet 22K
• Bosen, Wei et al. 2015
• Managed communication
© Petuum,Inc. 215
Parameter Server on GPUs
• Directly applying parameter server for GPU-based distributed deep
learning will underperform (as will show later).
• GPU is too fast
• Ethernet bandwidth is limited, and has latency
• For example
• AlexNet: 61.5M float parameters, 0.25s/iteration on Geforce Titan X
(batchsize = 256)
• Gradient generation rate: 240M float/(s*GPU)
• Parallelize it over 8 machines each w/ one GPU using PS.
• To ensure the computation not blocked on GPU (i.e. linear speed-up with
additional nodes)
• As a worker: send 240M floats/s and pull back 240M floats/s (at least)
• As a server: receive 240M * 8 floats/s and send back 240M * 8/s (at least)
© Petuum,Inc. 217
Parameter Server on GPUs
The problem is more severe than described above
• We only use 8 nodes (which is small). How about 32,128, or even 256?
• We haven’t considered other issues (which might be also
troublesome), e.g.
• Memory copy between DRAM and GPU will have a non-trivial cost
• The Ethernet might be shared with other tasks, i.e. available bandwidth is even
less.
• Burst communication happens very often on GPUs (which will explain later).
© Petuum,Inc. 218
Address the Communication Bottleneck
• A simple fact:
• Communication time may be reduced, but cannot be eliminated (of
course)
• Therefore, possible ideas to address the communication
bottleneck
• Hide the communication time by overlapping it with the computation
time
• Reduce the size of messages needed to be communications
© Petuum,Inc. 219
Address the Communication Bottleneck
• A simple fact:
• Communication time may be reduced, but cannot be eliminated (of
course).
• Therefore, possible ideas to address the communication
bottleneck
• Hide the communication time by overlapping it with the
computation time
• Reduce the size of messages needed to be communications
© Petuum,Inc. 220
Overlap Computation and Communication
• Revisit on a single node the computation flow of BP
• MÕ : backpropagation computational through layer l
• Ÿ≤ : forward and backward computation at iteration t
M$ M& M‘
⋯
Ÿ≤ Ÿ≤∂$ Ÿ≤∂& ⋯
ÿ ⁄
© Petuum,Inc. 223
WFBP: Wait-free backpropagation
• Idea: overlap computation and communication by utilizing concurrency
• Pipelining the updates and computation operations
‘
€Õ Õº$ M$ M& M‘
⋯
‘
mÕ Õº$
reschedule
€$ €& €' €‘
M$ M& M‘
⋯
m$ m& m' m‘
pipelining
ÿ ⁄
Zhang et al., 2015, Zhang et al. 2017 © Petuum,Inc. 225
WFBP: Distributed Wait-free backpropagation
• How does WFBP perform?
• Using Caffe as an engine:
50% comms
bottleneck
reduction
Zhang et al. 2017
© Petuum,Inc. 226
WFBP: Distributed Wait-free backpropagation
• Observation: Why DWBP would be effective
• More statistics of modern CNNs
ٲ ٲ
‹≤
›≤
© Petuum,Inc. 229
Introducing Sufficient Factor Broadcasting
• Matrix-parametrized models
Multiclass Logistic
Regression Distance Metric Learning
Feature dim. Feature dim.
Dictionary #neurons in
size layer fl
© Petuum,Inc. 230
Distributed Learning of MPMs
• Learning MPMs by communicating parameter matrices between server
and workers
• Dean and Ghemawat, 2008; Dean et al, 2012; Sindhwani and Ghoting, 2012; Gopal
and Yang, 2013; Chilimbi et al, 2014, Li et al, 2015
• High communication cost and large synchronization delays
Multiclass Logistic
Regression Neural Network (AlexNet)
© Petuum,Inc. 231
Contents:
Sufficient Factor (SF) Updates
Full parameter matrix update ΔW can be computed as outer product of two
vectors uvT (called sufficient factors)
• Example: Primal stochastic gradient descent (SGD)
N
1
min
W N
å f (Wa ; b ) + h(W )
i =1
i i i
¶f (Wai , bi )
DW = uv T u = v = ai
¶ (Wai )
• Example: Stochastic dual coordinate ascent (SDCA)
N
1 1
min
Z N
åf
i =1
i
*
(- zi ) + h* (
N
ZAT )
DW = uv T u = Dzi v = ai
© Petuum,Inc. 232
Sufficient Factor Broadcasting:
P2P Topology + SF Updates
Aggregated
update matrix Sum
• Pre-update
Training
examples
Sufficient
vectors ·$ , G$ ·& , G& ·' , G' ·2 , G2
Cannot be aggregated
• Stochastic algorithms
• Mini-batch: C samples
Matrix ‹(‚)
Representation
SV Representation ‹( ‚ + Ÿ) © Petuum,Inc. 234
Synchronization of Parameter Replicas
parameter server Transfer SVs instead of ΔW
• A Cost Comparison
Size of one message Number of messages Network Traffic
Multiclass Logistic Regression (MLR) Distance Metric Learning (DML) Sparse Coding (SC)
• 3 Benchmark ML Programs
• Big parameter matrices with 6.5-8.6b entries (30+GB), running on 12- & 28-
machine clusters
• 28-machine SFB finished in 2-7 hours
• Up to 5.6x faster than 28-machine PS, 12.3x faster than 28-machine Spark
• PS cannot support SF communication, which requires decentralized
storage
© Petuum,Inc. 237
Convergence Guarantee
• Results
© Petuum,Inc. 238
Convergence Guarantee
• Take-home message:
• Under full broadcasting, given a properly-chosen
learning rate, all local worker parameters –õŒ
eventually converge to stationary points (i.e. local
minima) of the objective function, despite the fact
that SV transmission can be delayed by up to „
iterations.
• Under partial broadcasting, the algorithm
converges to a ‹(Qù(! − E)) neighbourhood if
Ÿ ⟶ ∞.
© Petuum,Inc. 239
Parameter Storage and
Communication Paradigms
Centralized Storage Decentralized Storage
Server Worker
Worker Worker
Master-slave P2P
• Used with centralized storage paradigm • Used with decentralized storage
• Disadvantage: need to code/manage clients • Disadvantage (?): high comms volume for
and servers separately large # of workers
• Advantage: bipartite topology is comms- • Advantage: same code for all workers; no
efficient single point of failure, high elasticity to
• Popular for Parameter Servers: Yahoo LDA, resource adjustment
Google DistBelief, Petuum PS, Project Adam, • Less well-explored due to perception of high
Li&Smola PS, … communication overhead?
© Petuum,Inc. 241
Hybrid Updates: PS + SFB
• Hybrid communications:
Parameter Server +
Sufficient Factor
Broadcasting
• Parameter Server: Master-
Slave topology
• Sufficient factor
broadcasting: P2P topology
Figure from
Krizhevsky et al. 2012
Figure from
Krizhevsky et al. 2012
• Idea
• Sync FC layers using SFB
• Sync Conv layer using PS
• Effectiveness
• It directly reduces the size
of messages in many
situations
• Is SFB always optimal?
• No, its communication
load increases
quadratically
• The right strategy: choose
PS whenever it results in
less communication
© Petuum,Inc. 245
Hybrid Communication
• A best of both worlds strategy
• For example, AlexNet parameters between FC6 and FC7
• Tradeoff between PS and SFB communication
© Petuum,Inc. 251
Hybrid Communication
• Adam’s strategy leads to communication bottleneck
• Pushing SFs to server is fine
• Pulling full matrices back will create a bottleneck on the server node.
© Petuum,Inc. 252
Introducing Poseidon
• Poseidon: An efficient communication architecture
• A distributed platform to amplify existing DL toolkits
toolkits
platform Poseidon
© Petuum,Inc. 253
Poseidon’s position
• Design principles
• Efficient distributed platform for amplifying any DL toolkits
• Preserve the programming interface for any high-level toolkits
• i.e. distribute the DL program without changing any line of code
• Easy deployment, easy adoption.
© Petuum,Inc. 254
Poseidon System Architecture
data flow GPU CPU KV Store
allocation
instruction
KV Store
Synceri
SFB
Coordinator
toolkits
platform Poseidon
© Petuum,Inc. 258
Outline
• Overview: Distributed Deep Learning on GPUs
• Challenges 1: Addressing the communication bottleneck
• Challenges 2: Handling the limited GPU memory
© Petuum,Inc. 259
What is the Issue
• Memory
• GPUs have dedicate memory
• For a DL training program to be efficient, its data must be placed on
GPU memory
• GPU memory is limited, compared to CPU, e.g. maximally 12Gb
• Memcpy between CPU and GPU is expensive – a memcpy takes the
same time as launching a GPU computation kernel
• Problems to be answered
• How to Avoid memcpy overhead between CPU and GPU?
• How to proceed the training of a gigantic network with very limited
available memory?
© Petuum,Inc. 260
A Machine w/o GPU
Network
CPU cores Local
... storage
NIC
DRAM
(CPU memory)
© Petuum,Inc. 261
A Machine w/ GPU
Network
CPU cores Local
... storage
NIC
GPU device
GPU cores
DRAM
GPU
(CPU memory)
memory
(a few GB)
Small GPU memory
Expensive to copy between GPU/CPU mem
© Petuum,Inc. 262
Machine Learning on GPU
Staging memory Input data file
for input data batch (training data)
Input Intermediate
data data
Parameter data
© Petuum,Inc. 263
Deep Learning on GPU
Class probabilities
Training batch
parameters
GPU memory
Osprey Accipiter
© Petuum,Inc. 264
Training batch
Numbers parameters
GPU memory
Intermediate states
© Petuum,Inc. 265
Why Memory is an Issue?
• Intermediate states occupy 90% of the GPU memory
• Intermediate states is proportional to input batch size
• However,
• If you want high throughput, you must have large batch size (because
of the SIMD nature of GPUs)
• If you have large batch size, your GPU will be occupied by
intermediate states, which thereby limits your model size/depth
© Petuum,Inc. 266
Saving Memory: A Simple Trick
• Basic idea
• The fact: intermediate states are proportional to the batch size K
• Idea: achieve large batch size by accumulating gradients generated by smaller batch sizes
which are affordable in the GPU memory
• Solution:
• Parition K into M parts, every part has K/M samples
• For iter = 1:M
• Train with mini-batchsize K/M
• Accumulate the gradient on GPU w/o updating model parameters
• Update the model parameter all together when all M parts finished
• Drawbacks
• What if the GPU still cannot afford the intermediate states even if K=1?
• Small batch size usually leads to insufficient use of GPUs’ computational capability
© Petuum,Inc. 267
Memory Management using CPU Memory
• Core ideas
• If the memory is limited, trade something for memory
• Trade extra computations for memory
• Trade other cost (e.g. memory exchange) for more available memory
• If the memory is limited, then get more
• model parallel
• CPU memory
© Petuum,Inc. 268
Memory Management using CPU Memory
Class probabilities
• For each iteration (mini-
batch)
• A forward pass
• Then a backward pass
Training images
Training images
Training images
Training images
Input Intermediate
CPU/GPU data data
data transfer
parameters
Input Intermediate
CPU/GPU data data
data transfer
parameters
© Petuum,Inc. 277
Memory Management using CPU Memory
• What’s the best we can do with this strategy
• We only need 3 memory blocks (peak size) on GPU for:
• Input, Parameters, Output
• The whole training can process with ONLY these three blocks by
• Scheduling memcpy between CPU and GPU to be overlapped with computation
• Move in and out for each layer’s computation as training proceeds
peak
• Models up to 20 GB
© Petuum,Inc. 281
Elements of Modern AI
Data
Task
© Petuum,Inc. 282
Sys-Alg Co-design Inside!
Data
Task
Model
System
Platform
and Hardware
© Petuum,Inc. 283
Better Performance
• Fast and Real-Time • Any Scale • Low Resource
• Orders of magnitude • Perfect straight-line • Turning a regular
faster than Spark and speedup with more cluster into a super
TensorFlow computing devices computer:
• As fast as hand-crafted • Spark, TensorFlow can • Achieve AI results with much
more data, but using fewer
systems slow down with more computing devices
devices
• Google brain uses ~1000
machines whereas Petuum
uses ~10 for the same job
Speedup
Hand-
Spark PetuumOS Number of GPU computers
Crafted
System
© Petuum,Inc. 284
A Petuum Vision
Data
Task
© Petuum,Inc. 285