Aaai2013 Tutorial

Deep Learn|ng of
kepresentanons

AAAI 1utor|a|

oshua 8eng|o
!uly 14
Lh
2013, 8ellevue, WA, uSA

Outline of the Tutorial
1. Mouvauons and Scope
2. AlgorlLhms
3. racucal Conslderauons
4. Challenges
See (8englo, Courvllle & vlncenL 2013)
unsupervlsed leaLure Learnlng and ueep Learnlng: A 8evlew and new erspecuves"
and http://www.iro.umontreal.ca/~bengioy/talks/deep-learning-tutorial-aaai2013.html for a
pdf of Lhe slldes and a deLalled llsL of references.
Ultimate Goals
AI
needs know|edge
needs |earn|ng
(lnvolves prlors + !"#$%&'#!(/seotcb)
needs genera||zanon
(guesslng where probablllLy mass concenLraLes)
needs ways Lo ghL Lhe curse of dlmenslonallLy
(exponenually many congurauons of Lhe varlables Lo conslder)
needs dlsenLangllng Lhe underlylng explanaLory facLors
(maklng sense of Lhe daLa)
3
Cood features essenual for successful ML
Pandcralng feaLures vs learnlng Lhem
Cood represenLauon: capLures posLerlor bellef abouL
explanaLory causes, dlsenLangles Lhese underlylng
facLors of varlauon
8epresenLauon learnlng: guesses
Lhe feaLures / facLors / causes =
good represenLauon of observed daLa.
Representation Learning
4
raw
lnpuL
daLa
represenLed
by chosen
feaLures
MACPlnL
LLA8nlnC
represenLed
by learned
feaLures
Deep Representation Learning
Learn mu|np|e |eve|s of representanon
of |ncreas|ng comp|ex|ty]abstracnon

3
x
h
3
h
2
h
1
.
poLenually exponenual galn ln expresslve power
bralns are deep
humans organlze knowledge ln a composluonal way
8euer MCMC mlxlng ln space of deeper represenLauons
(8englo eL al, lCML 2013)
1hey work! SC1A on |ndustr|a|-sca|e AI tasks
(ob[ect recogn|non, speech recogn|non,
|anguage mode||ng, mus|c mode||ng)

wbeo tbe oombet of levels coo be Joto-
selecteJ, tbls ls o Jeep otcbltectote

6
x
h
3
h
2
h
1
.
A Good Old Deep Architecture: MLPs

CuLpuL layer
Pere predlcung a supervlsed LargeL

Pldden layers
1hese learn more absLracL
represenLauons as you head up

lnpuL layer
1hls has raw sensory lnpuLs (roughly)
7
A (Vanilla) Modern Deep Architecture

Cpnona| CuLpuL layer
Pere predlcung or condluonlng on a
supervlsed LargeL
Pldden layers
1hese learn more absLracL
represenLauons as you head up
lnpuL layer
lnpuLs can be reconsLrucLed, lled-ln
or sampled
8
2-way
connecnons
ML 101. What We Are Fighting Against:
The Curse of Dimensionality
1o generallze locally,
need represenLauve
examples for all
relevanL varlauons!

Classlcal soluuon: hope
for a smooLh enough
LargeL funcuon, or
make lL smooLh by
handcralng good
feaLures / kernel
Easy Learning
learned function: prediction = f(x)
*
*
*
*
*
*
*
*
*
*
*
*
*
true unknown function
= example (x,y)
*
x
y
Local Smoothness Prior: Locally
Capture the Variations
*
y
x
*
learnt = interpolated
f(x)
prediction
true function: unknown
*
*
test point x
*
= training example
However, Real Data Are near Highly
Curved Sub-Manifolds
12
Not Dimensionality so much as
Number of Variations
1heorem: Causslan kernel machlnes need aL leasL k examples
Lo learn a funcuon LhaL has 2k zero-crosslngs along some llne

1heorem: lor a Causslan kernel machlne Lo learn some
maxlmally varylng funcuons over J lnpuLs requlres C(2
J
)
examples

(Bengio, Dellalleau & Le Roux 2007)
Putting Probability Mass where
Structure is Plausible
Lmplrlcal dlsLrlbuuon: mass aL
Lralnlng examples
14
SmooLhness: spread mass around
lnsumclenL
Cuess some 'sLrucLure' and
generallze accordlngly
Is there any hope to
generalize non-locally?

Yes! Need good priors!
13
Six Good Reasons to Explore
arL 1
16
Learning features, not just
handcrafting them
MosL ML sysLems use very carefully hand-deslgned
feaLures and represenLauons
Many pracuuoners are very experlenced - and good - aL such
feaLure deslgn (or kernel deslgn)
Machlne learnlng" oen reduces Lo llnear models (lncludlng
C8ls) and nearesL-nelghbor-llke feaLures/models (lncludlng n-
grams, kernel SvMs, eLc.)

nand-crah|ng features |s nme-consum|ng, br|u|e, |ncomp|ete
17
ClusLerlng, nearesL-
nelghbors, 88l SvMs, local
non-parameLrlc denslLy
esumauon & predlcuon,
declslon Lrees, eLc.
arameLers for each
dlsungulshable reglon
# of d|snngu|shab|e reg|ons
|s ||near |n # of parameters
The need for distributed
representations
ClusLerlng
18
! no non-Lrlvlal generallzauon Lo reglons wlLhouL examples
lacLor models, CA, 88Ms,
neural neLs, Sparse Codlng,
ueep Learnlng, eLc.
Lach parameLer lnuences
many reglons, noL [usL local
nelghbors
# of d|snngu|shab|e reg|ons
grows a|most exponenna||y
w|th # of parameters
GLNLkALI2L NCN-LCCALL
1C NLVLk-SLLN kLGICNS
representations
Mulu-
ClusLerlng
19
C1 C2 C3
lnpuL
non-muLually
excluslve feaLures/
aurlbuLes creaLe a
comblnaLorlally large
seL of dlsungulable
congurauons
representations
Mulu-
ClusLerlng
ClusLerlng
20
Learnlng a set of features LhaL are noL muLually excluslve
can be exponenually more sLausucally emclenL Lhan
havlng nearesL-nelghbor-llke or clusLerlng-llke models
Unsupervised feature learning
1oday, mosL pracucal ML appllcauons requlre (loLs of)
labeled Lralnlng daLa
8uL almosL all daLa ls unlabeled
1he braln needs Lo learn abouL 10
14
synapuc sLrengLhs
. ln abouL 10
9
seconds
Labels cannoL posslbly provlde enough lnformauon
MosL lnformauon acqulred ln an unsuperv|sed fashlon
21
#3 How do humans generalize
from very few examples?
22
1hey transfer knowledge from prevlous learnlng:
8epresenLauons
LxplanaLory facLors
revlous learnlng from: unlabeled daLa
+ labels for oLher Lasks
r|or: shared under|y|ng exp|anatory factors, |n
parncu|ar between (x) and (|x)

#3 Sharing Statistical Strength by
Semi-Supervised Learning
PypoLhesls: (x) shares sLrucLure wlLh (y|x)
purely
supervlsed
seml-
supervlsed
23
Learning multiple levels
of representation
1here ls Lheoreucal and emplrlcal evldence ln favor of
muluple levels of represenLauon
Lxponenna| ga|n for some fam|||es of funcnons
8lologlcally lnsplred learnlng
8raln has a deep archlLecLure
CorLex seems Lo have a
generlc learnlng algorlLhm
numans hrst |earn s|mp|er
concepts and then compose
them |nto more comp|ex ones

24
#4 Sharing Components in a Deep
Architecture
Sum-producL
neLwork
olynomlal expressed wlLh shared componenLs: advanLage of
depLh may grow exponenually

1heorems ln
(8englo & uelalleau, AL1 2011,
uelalleau & 8englo nlS 2011)
Learning multiple levels
of representation
Successlve model layers learn deeper lnLermedlaLe represenLauons

Layer 1
Layer 2
Layer 3
Plgh-level
llngulsuc represenLauons
(Lee, ham, Largman & ng, nlS 2009)
(Lee, Crosse, 8anganaLh & ng, lCML 2009)
26
r|or: under|y|ng factors & concepts compact|y expressed w] mu|np|e |eve|s of abstracnon

arts comb|ne
to form ob[ects
Handling the compositionality
of human language and thought
Puman languages, ldeas, and
arufacLs are composed from
slmpler componenLs
kecurs|on: Lhe same
operaLor (same parameLers) ls
applled repeaLedly on
dlerenL sLaLes/componenLs
of Lhe compuLauon
8esulL aer unfoldlng = deep
compuLauon / represenLauon
x
t-1 x
t
x
t+1
z
t-1
z
t
z
t+1
27
(8ouou 2011, Socher eL al 2011)
Multi-Task Learning
Cenerallzlng beuer Lo new Lasks
(Lens of Lhousands!) ls cruclal Lo
approach Al
ueep archlLecLures learn good
lnLermedlaLe represenLauons LhaL
can be shared across Lasks
(ColloberL & WesLon lCML 2008,
8englo eL al AlS1A1S 2011)
Cood represenLauons LhaL
dlsenLangle underlylng facLors of
varlauon make sense for many Lasks
because each task concerns a
subset of the factors
28
raw input x
task 1
output y1
task 3
output y3
task 2
output y2
1ask A 1ask 8 1ask C
r|or: shared under|y|ng exp|anatory factors between tasks

L.g. dlcuonary, wlLh lnLermedlaLe
concepLs re-used across many denluons
#5 Combining Multiple Sources of
Evidence with Shared Representations
1radluonal ML: daLa = maLrlx
8elauonal learnlng: muluple sources,
dlerenL Luples of varlables
Share represenLauons of same Lypes
across daLa sources
Shared learned represenLauons help
propagaLe lnformauon among daLa
sources: e.g., WordneL, xWn,
Wlklpedla, Iree8ase, lmageneL.
(8ordes eL al AlS1A1S 2012, ML !. 2013)
IAC1S = DA1A
Deducnon = Genera||zanon
29
person ur| event
ur| words h|story
person url
evenL
(person,url,evenL)
url
words hlsLory
(url,words,hlsLory)
#5 Different object types
represented in same space
Coogle:
S. 8englo, !.
WesLon & n.
usunler
(l!CAl 2011,
nlS'2010,
!ML8 2010,
ML !. 2010)
#6 Invariance and Disentangling
lnvarlanL feaLures
Whlch lnvarlances?
AlLernauve: learnlng Lo dlsenLangle facLors
Cood dlsenLangllng !
avold Lhe curse of dlmenslonallLy
31
#6 Emergence of Disentangling
(Coodfellow eL al. 2009): sparse auLo-encoders Lralned
on lmages
some hlgher-level feaLures more lnvarlanL Lo
geomeLrlc facLors of varlauon
(CloroL eL al. 2011): sparse recued denolslng auLo-
encoders Lralned on bags of words for senumenL
analysls
dlerenL feaLures speclallze on dlerenL aspecLs
(domaln, senumenL)
32
Wn?
#6 Sparse Representations
!usL add a sparslfylng penalLy on learned represenLauon
(prefer 0s ln Lhe represenLauon)
lnformauon dlsenLangllng (compare Lo dense compresslon)
More llkely Lo be llnearly separable (hlgh-dlmenslonal space)
Locally low-dlmenslonal represenLauon = local charL
Pl-dlm. sparse = emclenL varlable slze represenLauon
= daLa sLrucLure
lew blLs of lnformauon Many blLs of lnformauon
33
r|or: on|y few concepts and aur|butes re|evant per examp|e

Sparse gradients
Trains deep nets even w/o pretraining
Deep Sparse Rectifier Neural Networks
(CloroL,8ordes and 8englo AlS1A1S 2011), followlng up on (nalr & PlnLon 2010) soplus 88Ms
Leaky integrate-and-fire model
Rectifier
Neuroscience motivations
Machine learning motivations
Sparse representations
f(x)=max(0,x)
Cutstand|ng resu|ts by krlzhevsky eL al 2012
kllllng Lhe sLaLe-of-Lhe-arL on lmageneL 1000:

1
st
cho|ce 1op-S
2
nd
besL 27 err
revlous SC1A 43 err 26 err
krlzhevsky eL al 37 err 13 err
Temporal Coherence and Scales
PlnLs from naLure abouL dlerenL explanaLory facLors:
8apldly changlng facLors (oen nolse)
Slowly changlng (generally more absLracL)
ulerenL facLors aL dlerenL ume scales
LxplolL Lhose h|nts Lo d|sentang|e beuer!
(8ecker & PlnLon 1993, Wlskou & Se[nowskl 2002, Purrl &
Pyvarlnen 2003, 8erkes & Wlskou 2003, Mobahl eL al
2009, 8ergsLra & 8englo 2009)
Bypassing the curse
We need Lo bulld composluonallLy lnLo our ML models
!usL as human languages explolL composluonallLy Lo glve
represenLauons and meanlngs Lo complex ldeas
Lxplolung composluonallLy glves an exponenual galn ln
represenLauonal power
ulsLrlbuLed represenLauons / embeddlngs: feaLure learnlng
ueep archlLecLure: muluple levels of feaLure learnlng
rlor: composluonallLy ls useful Lo descrlbe Lhe
world around us emclenLly

36
Bypassing the curse by sharing
statistical strength
8esldes very fasL Cu-enabled predlcLors, Lhe maln advanLage
of represenLauon learnlng ls sLausucal: poLenual Lo learn from
less labeled examples because of sharlng of sLausucal sLrengLh:
unsupervlsed pre-Lralnlng and seml-supervlsed Lralnlng
Mulu-Lask learnlng
Mulu-daLa sharlng, learnlng abouL symbollc ob[ecLs and Lhelr
relauons
37
8aw daLa
1 layer
2 layers
4 layers
3 layers
lCML'2011
workshop on
unsup. &
1ransfer Learnlng
nlS'2011
1ransfer
Learnlng
Challenge
aper:
lCML'2012
Unsupervised and Transfer Learning
Challenge + Transfer Learning
Challenge: Deep Learning 1st Place
Why now?
uesplLe prlor lnvesugauon and undersLandlng of many of Lhe
algorlLhmlc Lechnlques .
8efore 2006 Lralnlng deep archlLecLures was unsuccessfu|
(excepL for convoluuonal neural neLs when used by people who speak lrench)
WhaL has changed?
new meLhods for unsupervlsed pre-Lralnlng have been
developed (varlanLs of 8esLrlcLed 8olLzmann Machlnes =
88Ms, regularlzed auLo-encoders, sparse codlng, eLc.)
new meLhods Lo successfully Lraln deep supervlsed neLs
even wlLhouL unsupervlsed pre-Lralnlng
Successful real-world appllcauons, wlnnlng challenges and
beaung SC1As ln varlous areas, large-scale lndusLrlal apps
39
Montral
Toronto
Bengio
Hinton
Le Cun
Major Breakthrough in 2006
AblllLy Lo Lraln deep archlLecLures by
uslng layer-wlse unsupervlsed
learnlng, whereas prevlous purely
supervlsed auempLs had falled
unsupervlsed feaLure learners:
88Ms
AuLo-encoder varlanLs
Sparse codlng varlanLs
New York
40
2012: Industrial-scale success in
speech recognition
Coogle uses uL ln Lhelr androld speech recognlzer (boLh server-
slde and on some phones wlLh enough memory)
Mlcroso uses uL ln Lhelr speech recognlzer
Lrror reducuons on Lhe order of 30, a ma[or progress
41
Deep Networks for Speech Recognition:
results from Google, IBM, Microsoft
task nours of
tra|n|ng data
Deep net+nMM GMM+nMM
same data
GMM+nMM
more data
SwlLchboard 309 16.1 23.6 17.1 (2k hours)
Lngllsh
8roadcasL news
30 17.3 18.8
8lng volce
search
24 30.4 36.2
Coogle volce
lnpuL
3870 12.3 16.0 (loLs more)
?ouLube 1400 47.6 32.3
42
(numbers Laken from Ceo PlnLon's !une 22, 2012 Coogle Lalk)
Industrial-scale success in object
recognition
krlzhevsky, SuLskever & PlnLon nlS 2012
Goog|e |ncorporates DL |n Goog|e+ photo
search, A sLep across Lhe semanuc
gap" (Coogle 8esearch blog, !une 12, 2013)
8aldu now oers wlLh slmllar servlces
43
1
st
cho|ce 1op-S
2
nd
besL 27 err
revlous SC1A 43 err 26 err
krlzhevsky eL al 37 err 13 err
baby
c
a
r

More Successful Applications
Mlcroso uses uL for speech rec. servlce (audlo vldeo lndexlng), based on
PlnLon/1oronLo's u8ns (Mohamed eL al 2012)
Coogle uses uL ln lLs Coogle Coggles servlce, uslng ng/SLanford uL sysLems,
and ln lLs Coogle+ phoLo search servlce, uslng deep convoluuonal neLs
n?1 Lalks abouL Lhese: http://www.nytimes.com/2012/06/26/technology/in-a-
big-network-of-computers-evidence-of-machine-learning.html?_r=1
SubsLanually beaung SC1A ln language modellng (perplexlLy from 140 Lo 102
on 8roadcasL news) for speech recognluon (WS! WL8 from 16.9 Lo 14.4)
(Mlkolov eL al 2011) and Lranslauon (+1.8 8LLu) (Schwenk 2012)
SLnnA: unsup. pre-Lralnlng + mulu-Lask uL reaches SC1A on CS, nL8, S8L,
chunklng, parslng, wlLh >10x beuer speed & memory (ColloberL eL al 2011)
8ecurslve neLs surpass SC1A ln paraphraslng (Socher eL al 2011)
uenolslng ALs subsLanually beaL SC1A ln senumenL analysls (CloroL eL al 2011)
ConLracuve ALs SC1A ln knowledge-free MnlS1 (.8 err) (8lfal eL al nlS 2011)
Le Cun/n?u's sLacked Sus mosL accuraLe & fasLesL ln pedesLrlan deLecuon
and uL ln Lop 2 wlnnlng enLrles of Cerman road slgn recognluon compeuuon
44
Already Many NLP Applications of DL
Language Modellng (Speech 8ecognluon, Machlne 1ranslauon)
Acousuc Modellng
arL-Cf-Speech 1agglng
Chunklng
named LnuLy 8ecognluon
Semanuc 8ole Labellng
arslng
SenumenL Analysls
araphraslng
Cuesuon-Answerlng
Word-Sense ulsamblguauon
43
Neural Language Model
8eoqlo et ol Nll52000
ooJ IMlk 200J A
Neotol ltoboblllsuc
looqooqe MoJel
Lach word represenLed by
a dlsLrlbuLed conunuous-
valued code vecLor =
embeddlng
Cenerallzes Lo sequences
of words LhaL are
semanucally slmllar Lo
Lralnlng sequences
46
Neural word embeddings -
visualization
47
Analogical Representations for Free
(Mikolov et al, ICLR 2013)
Semanuc relauons appear as llnear relauonshlps ln Lhe space of
learned represenLauons
klng - Cueen = Man - Woman
arls - lrance + lLaly = 8ome
48
arls
lrance
lLaly
8ome
More about depth
49
Architecture Depth
Depth = 3 Depth = 4
Deep Architectures are More
Expressive
1heoreucal argumenLs:
.
1 2 3
2
n
1 2 3
.
n
= universal approximator 2 layers of
Logic gates
Formal neurons
RBF units
Theorems on advantage of depth:
(Hastad et al 86 & 91, Bengio et al 2007, Bengio &
Delalleau 2011, Braverman 2011)
Some functions compactly
represented with k layers may
require exponential size with 2
layers
RBMs & auto-encoders = universal approximator
main
sub1 sub2
sub3
subsub1
subsub2 subsub3
subsubsub1
subsubsub2
subsubsub3
Deep computer program
main
subroutine1 includes
subsub1 code and
subsub2 code and
subsubsub1 code
Shallow computer program
subroutine2 includes
subsub2 code and
subsub3 code and
subsubsub3 code and !
Deep circuit
Shallow circuit
input
!
?

1 2 3
!
n
output
Falsely reassuring theorems: one can approximate any
reasonable (smooth, boolean, etc.) function with a 2-layer
architecture
1

2

3

36
Algorithms
arL 2
37
A neural network = running several
logistic regressions at the same time
lf we feed a vecLor of lnpuLs Lhrough a bunch of loglsuc regresslon
funcuons, Lhen we geL a vecLor of ouLpuLs
8uL we don'L have Lo declde
ahead of ume whaL varlables
Lhese loglsuc regresslons are
Lrylng Lo predlcL!
38
. whlch we can feed lnLo anoLher loglsuc regresslon funcuon
and lL ls Lhe Lralnlng
crlLerlon LhaL wlll
declde whaL Lhose
lnLermedlaLe blnary
LargeL varlables should
be, so as Lo make a
good [ob of predlcung
Lhe LargeLs for Lhe nexL
layer, eLc.
39
8efore we know lL, we have a mululayer neural neLwork..
60
Back-Prop
CompuLe gradlenL of example-wlse loss wrL
parameLers
Slmply applylng Lhe derlvauve chaln rule wlsely
lf compouoq tbe loss(exomple, potometets) ls O(o)
compotouoo, tbeo so ls compouoq tbe qtoJleot
61
Simple Chain Rule
62
Multiple Paths Chain Rule
63
Multiple Paths Chain Rule - General
.
64
Chain Rule in Flow Graph
.
.
.
llow graph: any dlrecLed acycllc graph
node = compuLauon resulL
arc = compuLauon dependency

= successors of
63
Back-Prop in Multi-Layer Net
.
.
66
Back-Prop in General Flow Graph
.
.
.
= successors of
1. lprop: vlslL nodes ln Lopo-sorL order
- CompuLe value of node glven predecessors
2. 8prop:
- lnluallze ouLpuL gradlenL = 1
- vlslL nodes ln reverse order:
CompuLe gradlenL wrL each node uslng
gradlenL wrL successors
Slngle scalar ouLpuL
67
Back-Prop in Recurrent & Recursive
Nets
8epllcaLe a
parameLerlzed funcuon
over dlerenL ume
sLeps or nodes of a uAC
CuLpuL sLaLe aL one
ume-sLep / node ls used
as lnpuL for anoLher
ume-sLep / node
! #$%&& '()*+
,-./0&1 /20/(#
03/ 3.#0)(.'
'3-('3
!"#$%&"' $!(
)*"($+,
(.$(&#
/
01
2($3 4563
71 01
4 #89++
'&%:5
71
71
'!*&'!
73
/(89.$"'
;(<&(#(.$9$"%.#
x
t-1
x
t
x
t+1
z
t-1
z
t
z
t+1
68
Backpropagation Through Structure
lnference ! dlscreLe cholces
(e.g., shorLesL paLh ln PMM, besL ouLpuL congurauon ln C8l)
L.g. Max over congurauons or sum welghLed by posLerlor
1he loss Lo be opumlzed depends on Lhese cholces
1he lnference operauons are ow graph nodes
lf conunuous, can perform sLochasuc gradlenL descenL
Max(a,b) ls conunuous.
69
Automatic Differentiation
1he gradlenL compuLauon can
be auLomaucally lnferred from
Lhe symbollc expresslon of Lhe
fprop.
Lach node Lype needs Lo know
how Lo compuLe lLs ouLpuL and
how Lo compuLe Lhe gradlenL
wrL lLs lnpuLs glven Lhe
gradlenL wrL lLs ouLpuL.
Lasy and fasL proLoLyplng
70
Deep Supervised Neural Nets
We can now Lraln Lhem even
wlLhouL unsupervlsed pre-
Lralnlng, Lhanks Lo beuer
lnluallzauon and non-llnearlues
(recuers, maxouL) and Lhey can
generallze well wlLh large labeled
seLs and dropouL.
unsupervlsed pre-Lralnlng sull
useful for rare classes, Lransfer,
smaller labeled seLs, or as an exLra
regularlzer.
71
Stochastic Neurons as Regularizer:
Improv|ng neura| networks by prevennng co-adaptanon of
feature detectors (Hinton et al 2012, arXiv)
Dropouts Lrlck: durlng Lralnlng muluply neuron ouLpuL by
random blL (p=0.3), durlng LesL by 0.3
used ln deep supervlsed neLworks
Slmllar Lo denolslng auLo-encoder, buL corrupung every layer
Works beuer wlLh some non-llnearlues (recuers, maxouL)
(Coodfellow eL al. lCML 2013)
LqulvalenL Lo averaglng over exponenually many archlLecLures
used by krlzhevsky eL al Lo break Lhrough lmageneL SC1A
Also lmproves SC1A on CllA8-10 (18!16 err)
knowledge-free MnlS1 wlLh u8Ms (.93!.79 err)
1lMl1 phoneme classlcauon (22.7!19.7 err)
72
Dropout Regularizer: Super-Efficient
Bagging
73
*
.
.
Temporal & Spatial Inputs:
Convolutional & Recurrent Nets
Local connecuvlLy across ume/space
Sharlng welghLs across ume/space (Lranslauon equlvarlance)
oollng (Lranslauon lnvarlance, cross-channel poollng for learned lnvarlances)
74
x
t-1
x
t
x
t+1
z
t-1
z
t
z
t+1
8ecurrenL neLs (8nns) can summarlze
lnformauon from Lhe pasL
8ldlrecuonal 8nns also summarlze
lnformauon from Lhe fuLure
73
Distributed Representations
& Neural Nets:

How to do unsupervised
training?
PCA

= Linear Manifold
= Linear Auto-Encoder
= Linear Gaussian Factors

reconsLrucuon error vecLor
Llnear manlfold
reconsLrucuon(x)
x
lnpuL x, 0-mean
feaLures=code=h(x)=w x
reconsLrucuon(x)=w
1
h(x) = w
1
w x
W = prlnclpal elgen-basls of Cov(x)
robablllsuc lnLerpreLauons:
1. Causslan wlLh full
covarlance w
1
w-l
2. LaLenL marglnally lld
Causslan facLors h wlLh
x = w
1
b - oolse
76
!
code= latent features h
!
input reconstruction
Directed Factor Models:
P(x,h)=P(h)P(x|h)
(b) facLorlzes lnLo (b
1
) (b
2
).
ulerenL prlors:
CA: (b
l
) ls Causslan
lCA: (b
l
) ls non-parameLrlc
Sparse cod|ng: (b
l
) ls concenLraLed near 0
Llkellhood ls Lyplcally Causslan x { b
wlLh mean glven by w
1
b
lnference procedures (predlcung b, glven x) dler
Sparse b: x ls explalned by Lhe welghLed addluon of selecLed lLers b
l

= .9 x + .8 x + .7 x
77
h
1
h
2
h
3
x
1
x
2
h
4
h
5
x w
1
w
3
w
3 b
1
b
3
b
3
w
1
w
3
w
3
facLors prlor llkellhood
Sparse autoencoder illustration for
images
naLural lmages
Learned bases: Ldges"
50 100 150 200 250 300 350 400 450 500
50
100
150
200
250
300
350
400
450
500
50 100 150 200 250 300 350 400 450 500
50
100
150
200
250
300
350
400
450
500
50 100 150 200 250 300 350 400 450 500
50
100
150
200
250
300
350
400
450
500
! 0.8 * + 0.3 * + 0.5 *
! ! 0.8 * "
!"

+ 0.3 * "
#$

+ 0.5 * "
"!
[b
1
, ., b
64
= [0, 0, ., 0, 0.8, 0, ., 0, 0.3, 0, ., 0, 0.S, 0
(feaLure represenLauon)

1esL example
78
Stacking Single-Layer Learners
79
Stacking Restricted Boltzmann Machines (RBM) ! Deep Belief Network (DBN)
CA ls greaL buL can'L be sLacked lnLo deeper more absLracL
represenLauons (llnear x llnear = llnear)
Cne of Lhe blg ldeas from PlnLon eL al. 2006: layer-wlse
unsupervlsed feaLure learnlng
Effective deep learning first became
possible with unsupervised pre-training
[Lrhan eL al., !ML8 2010
urely supervlsed neural neL WlLh unsupervlsed pre-Lralnlng
(wlLh 88Ms and uenolslng AuLo-Lncoders)
80
Optimizing Deep Non-Linear
Composition of Functions Seems Hard
81
lallure of Lralnlng deep supervlsed neLs before 2006
8egularlzauon eecL vs opumlzauon eecL of
unsupervlsed pre-Lralnlng
ls opumlzauon dlmculLy due Lo
lll-condluonlng?
local mlnlma?
boLh?
1he [ury ls sull ouL, buL we now have success sLorles of Lralnlng
deep supervlsed neLs wlLhouL unsupervlsed pre-Lralnlng
Initial Examples Matter More
(critical period?)
Vary 10% of the
training set at the
beginning, middle,
or end of the
online sequence.
Measure the effect
on learned
function.

82
Order & Selection of Examples Matters
(8englo, Louradour, ColloberL & WesLon, lCML'2009) A
Currlculum learnlng
(8englo eL al 2009, krueger & uayan 2009)
Start w|th eas|er examp|es
lasLer convergence Lo a beuer local
mlnlmum ln deep archlLecLures
2.73
3
3.23
3.3
0 300 1000 1300
!
"
#
$
%
&
'
(

'
*
+
,

-
"
%
.
/

01!!1"'
23.&,*4
currlculum
no-currlculum
2.73
3
3.23
3.3
0 300 1000 1300
!
"
#
$
%
&
'
(

'
*
+
,

-
"
%
.
/

01!!1"'
23.&,*4
currlculum
no-currlculum
83
Understanding the difficulty of
training deep feedforward neural
networks
(Glorot & Bengio, AISTATS 2010)
SLudy Lhe acuvauons and gradlenLs
wrL depLh
as Lralnlng progresses
for dlerenL lnluallzauons ! blg dlerence
for dlerenL non-llnearlues ! blg dlerence
llrsL demonsLrauon LhaL deep supervlsed neLs can be
successfully Lralned almosL as well as wlLh unsupervlsed pre-
Lralnlng, by seng up Lhe opumlzauon problem approprlaLely.

Layer-wise Unsupervised Learning
! input
83
Layer-Wise Unsupervised Pre-training
!
!
input
features
86
!
!
!
input
features
reconstruction
of input
=
?
!
input
87
!
!
input
features
88
!
!
input
features
!
More abstract
features
89
!
!
input
features
!
More abstract
features
reconstruction
of features
=
?
! ! ! !
90
!
!
input
features
!
More abstract
features
91
!
!
input
features
!
More abstract
features
!
Even more abstract
features
92
!
!
input
features
!
More abstract
features
!
Even more abstract
features
Output
f(X) six
Target
Y
two!
=
?
Supervised Fine-Tuning
Addluonal hypoLhesls: feaLures good for (x) good for (y|x)
93
Restricted Boltzmann Machines
94
See 8englo (2009) deLalled monograph/revlew:
leotoloq ueep Atcbltectotes fot Al".
See PlnLon (2010)
A ptocucol qolJe to ttololoq kesttlcteJ 8oltzmooo Mocbloes"
Undirected Models:
the Restricted Boltzmann Machine
[PlnLon eL al 2006
robablllsuc model of Lhe [olnL dlsLrlbuuon of
Lhe observed varlables (lnpuLs alone or lnpuLs
and LargeLs) x
LaLenL (hldden) varlables b model hlgh-order
dependencles
lnference ls easy, (b|x) facLorlzes lnLo producL
of (b
l
| x)
h
1
h
2
h
3
x
1
x
2
Boltzmann Machines & MRFs
8olLzmann machlnes:
(PlnLon 84)

Markov 8andom llelds:

! More lnLeresung wlLh laLenL varlables!

So consLralnL / probablllsuc sLaLemenL
undlrecLed
graphlcal
models
Restricted Boltzmann Machine
(RBM)
A popular bulldlng
block for deep
archlLecLures

8|parnte undlrecLed
graphlcal model

observed
hidden
Gibbs Sampling & Block Gibbs Sampling
WanL Lo sample from (x
1
,x
2
,.x
n
)
G|bbs samp||ng
lLeraLe or randomly choose l ln 1.n
Sample x
l
from (x
l
| x
1
,x
2
,.x
l-1
, x
l+1
,.x
n
)
can only make small changes aL a ume! ! slow mlxlng
noLe how xed polnL samples from Lhe [olnL.
Speclal case of MeLropolls-Pasungs.

8|ock G|bbs samp||ng (noL always posslble)
x's organlzed ln blocks, e.g. A=(x
1
,x
2
,x
3
), 8=(x
4
,x
3
,x
6
), C=.
uo Clbbs on (A,8,C,.), l.e.
Sample A from (A|8,C)
Sample 8 from (8|A,C)
Sample C from (C|A,8), and lLeraLe.
Larger changes ! fasLer mlxlng
98
A
8
C
x
9
x
8
x
7
x
1
x
2
x
3
x
4
x
3
x
6
Block Gibbs Sampling in RBMs
(b|x) and (x|b) facLorlze
(b|x)= (b
l
|x)
h
1
~ P(h|x
1
)
x
2
~ P(x|h
1
) x
3
~ P(x|h
2
) x
1

h
2
~ P(h|x
2
) h
3
~ P(h|x
3
)
! Easy inference
! Efficient block Gibbs
sampling x!h!x!h

l
Obstacle: Vicious Circle Between
Learning and MCMC Sampling
Larly durlng Lralnlng, denslLy smeared ouL, mode bumps overlap
LaLer on, hard Lo cross empLy volds beLween modes
100
Are we doomed lf
we rely on MCMC
durlng Lralnlng?
Wlll we be able Lo
Lraln really large &
complex models?
1ralnlng updaLes
Mlxlng
vlcloos cltcle
RBM with (image, label) visible units
label
hidden
y 0 0 0 1
y
x
h
U W
image
(Larochelle & 8englo 2008)
RBMs are Universal Approximators
Addlng one hldden unlL (wlLh proper cholce of parameLers)
guaranLees lncreaslng llkellhood
WlLh enough hldden unlLs, can perfecLly model any dlscreLe
dlsLrlbuuon
88Ms wlLh varlable of hldden unlLs = non-parameLrlc
(Le Roux & Bengio 2008)

RBM Conditionals Factorize
RBM Energy Gives Binomial Neurons
lree Lnergy = equlvalenL energy when marglnallzlng

Can be compuLed exacLly and emclenLly ln 88Ms

Marglnal llkellhood l()) LracLable up Lo paruuon funcuon 2
RBM Free Energy
Energy-Based Models Gradient
Boltzmann Machine Gradient
CradlenL has Lwo componenLs:
! ln 88Ms, easy Lo sample or sum over b|x
! ulmculL parL: sampllng from l(x), Lyplcally wlLh a Markov chaln
negative phase positive phase
Positive & Negative Samples
Observed (+) examples push the energy down
Generated / dream / fantasy (-) samples / particles push
the energy up
X
+
X
-
Lqulllbrlum: L[gradlenL = 0
Training RBMs
ConLrasuve ulvergence:
(Cu-k)
sLarL negauve Clbbs chaln aL observed x, run k
Clbbs sLeps

SML/erslsLenL Cu:
(Cu)
run negauve Clbbs chaln ln background whlle
welghLs slowly change
lasL Cu: Lwo seLs of welghLs, one wlLh a large learnlng raLe
only used for negauve phase, qulckly explorlng
modes
Perdlng: ueLermlnlsuc near-chaos dynamlcal sysLem denes
boLh learnlng and sampllng
1empered MCMC: use hlgher LemperaLure Lo escape modes
Contrastive Divergence
Contrastive Divergence (CD-k): start negative phase
block Gibbs chain at observed x, run k Gibbs steps
(Hinton 2002)

Sampled x
-

negative phase
Observed x
+
positive phase
h
+
~ P(h|x
+
) h
-
~ P(h|x
-
)
k = 2 steps
x
+
x
-
Free Energy
push down
push up
Persistent CD (PCD) / Stochastic Max.
Likelihood (SML)
8un negauve Clbbs chaln ln background whlle welghLs slowly
change (?ounes 1999, 1leleman 2008):

Observed x
+

(positive phase)
new x
-

h
+
~ P(h|x
+
)
previous x
-

CuaranLees (?ounes 1999, ?ullle 2003)
lf learnlng raLe decreases ln 1/t,
chaln mlxes before parameLers change Loo much,
chaln sLays converged when parameLers change
Some RBM Variants
ulerenL energy funcuons and allowed
values for Lhe hldden and vlslble unlLs:
PlnLon eL al 2006: blnary-blnary 88Ms
Welllng nlS'2004: exponenual famlly unlLs
8anzaLo & PlnLon Cv8'2010: Causslan 88M weaknesses (no
condluonal covarlance), propose mc88M
8anzaLo eL al nlS'2010: mo1, slmllar energy funcuon
Courvllle eL al lCML'2011: splke-and-slab 88M
112
Convolutionally Trained
Spike & Slab RBMs Samples
ssRBM is not Cheating
C
e
n
e
r
a
L
e
d

s
a
m
p
l
e
s

1
r
a
l
n
l
n
g

e
x
a
m
p
l
e
s

Auto-Encoders & Variants:
Learning a computational graph
113
Computational Graphs
Cperauons for parucular Lask
neural neLs' sLrucLure = compuLauonal graph for (y|))
Craphlcal model's sLrucLure compuLauonal graph for lnference
8ecurrenL neLs & graphlcal models
" fam||y of computanona| graphs shar|ng parameters
coolJ we bove o potomettlzeJ fomlly of compotouoool qtopbs
Jefoloq tbe moJel?
116
ML whose LargeL ouLpuL = lnpuL
8econsLrucuon=decoder(encoder(lnpuL)),
e.g.

WlLh bouleneck, code = new coordlnaLe sysLem
Lncoder and decoder can have 1 or more layers
1ralnlng deep auLo-encoders noLorlously dlmculL

Simple Auto-Encoders
.
code= laLenL feaLures
.
encoder
decoder
lnpuL
reconsLrucuon
117
t(x)
x
b
Link Between Contrastive Divergence
and Auto-Encoder Reconstruction
Error Gradient
(8englo & uelalleau 2009):
Cu-2k esumaLes Lhe log-llkellhood gradlenL from 2k
dlmlnlshlng Lerms of an expanslon LhaL mlmlcs Lhe Clbbs
sLeps
reconsLrucuon error gradlenL looks only aL Lhe rsL sLep, l.e.,
ls a klnd of mean-eld approxlmauon of Cu-0.3
I finally understand what
auto-encoders do!
1ry Lo carve holes ln ||t(x)-x||
2
or -log (x | h(x)) aL examples
vecLor t(x)-x polnLs ln dlrecuon of lncreaslng prob., l.e. esumaLe
score = d log p(x) / dx: learn score vecLor eld = |oca| mean
Cenerallze (volleys) ln beLween above holes Lo form moolfolJs
d t(x) / dx esumaLes Lhe |oca| covar|ance and ls llnked Lo Lhe
Pesslan d
2
log p(x) / dx
2
A Markov Cha|n assoc|ated w|th ALs esnmates the data-
generanng d|str|bunon (8eng|o et a|, arx|v 130S.663, 2013)

119
Stacking Auto-Encoders
120
AuLo-encoders can be sLacked successfully (8englo eL al nlS'2006) Lo form
hlghly non-llnear represenLauons, whlch wlLh ne-Lunlng overperformed
purely supervlsed MLs

Greedy Layerwise Supervised Training
Cenerally worse Lhan unsupervlsed pre-Lralnlng buL beuer Lhan ordlnary
Lralnlng of a deep neural neLwork (8englo eL al. nlS'2006). Pas been used
successfully on large labeled daLaseLs, where unsupervlsed pre-Lralnlng dld
noL make as much of an lmpacL.
Supervised Fine-Tuning is Important
Creedy layer-wlse
unsupervlsed pre-
Lralnlng phase wlLh
88Ms or auLo-encoders
on MnlS1
Supervlsed phase wlLh or
wlLhouL unsupervlsed
updaLes, wlLh or wlLhouL
ne-Lunlng of hldden
layers
Can Lraln all 88Ms aL Lhe
same ume, same resulLs
(Auto-Encoder) Reconstruction Loss
ulscreLe lnpuLs: cross-enLropy for blnary lnpuLs
-
l
x
l
log r
l
(x) + (1-x
l
) log(1-r
l
(x)) (wlLh 0r
l
(x)1)
or log-llkellhood reconsLrucuon crlLerlon, e.g., for a
mulunomlal (one-hoL) lnpuL
-
l
x
l
log r
l
(x) (where
l
r
l
(x)=1, summlng over subseL of lnpuLs
assoclaLed wlLh Lhls mulunomlal varlable)

ln general: conslder whaL are approprlaLe loss funcuons Lo
predlcL each of Lhe lnpuL varlables,
Lyplcally, reconstrucnon neg. |og-||ke||hood -|og (x|h(x))
123
124
Manifold Learning
Addluonal prlor: examples concentrate near a lower
dlmenslonal manlfold" (reglon of hlgh denslLy wlLh only few
operauons allowed whlch allow small changes whlle sLaylng on
Lhe manlfold)
- varlable dlmenslon locally?
- So of dlmenslons?
Denoising Auto-Encoder
(vlncenL eL al 2008)
CorrupL Lhe lnpuL durlng Lralnlng only
1raln Lo reconsLrucL Lhe uncorrupLed lnpuL
KL(reconstruction | raw input)
Hidden code (representation)
Corrupted input
Raw input
reconstruction
Lncoder & decoder: any parameLrlzauon
As good or beuer Lhan 88Ms for unsupervlsed pre-Lralnlng
Denoising Auto-Encoder
Learns a vecLor eld polnung Lowards
hlgher probablllLy dlrecuon (Alaln & 8englo 2013)
Some uALs correspond Lo a klnd of
Causslan 88M wlLh teqolotlzeJ Score
MaLchlng (vlncenL 2011)
[equlvalenL when nolse!0
Compared Lo 88M:
no paruuon funcuon lssue,
+ can measure Lralnlng
crlLerlon
Corrupted input
Corrupted input
pr|or: examp|es
concentrate near a
|ower d|mens|ona|
"man|fo|d"
r(x)-x dlogp(x)/dx
Stacked Denoising Auto-Encoders
Infinite MNIST
noLe how
advanLage of
beuer
lnluallzauon
does noL vanlsh
llke oLher
regularlzers as
exemples!
128
Auto-Encoders Learn Salient
Variations, like a non-linear PCA
Mlnlmlzlng reconsLrucuon error forces Lo
keep varlauons along manlfold.
8egularlzer wanLs Lo Lhrow away all
varlauons.
WlLh boLh: keep CnL? sensluvlLy Lo
varlauons Cn Lhe manlfold.
Regularized Auto-Encoders Learn a
Vector Field or a Markov Chain
Transition Distribution
(8englo, vlncenL & Courvllle, 1AMl 2013) revlew paper
(Alaln & 8englo lCL8 2013, 8englo eL al, arxlv 2013)
129
Contractive Auto-Encoders
wanLs conLracuon ln all
dlrecuons
cannoL aord conLracuon ln
manlfold dlrecuons
(8lfal, vlncenL, Muller, CloroL, 8englo lCML 2011, 8lfal, Mesnll,
vlncenL, 8englo, uauphln, CloroL LCML 2011, 8lfal, uauphln,
vlncenL, 8englo, Muller nlS 2011)
1ralnlng crlLerlon:

lf h
[
=slgmold(b
[
+W
[
x)

(dh
[
(x)/dx
l
)
2
= h
[
2
(1-h
[
)
2
W
[l
2
MosL hldden unlLs saLuraLe (near
0 or 1, derlvauve near 0):
few responslve unlLs represenL
Lhe acuve subspace (local charL)
(8lfal, vlncenL, Muller, CloroL, 8englo lCML 2011, 8lfal, Mesnll,
vlncenL, 8englo, uauphln, CloroL LCML 2011, 8lfal, uauphln,
vlncenL, 8englo, Muller nlS 2011)
Lach reglon/charL = subseL of acuve hldden unlLs
nelghborlng reglon: one of Lhe unlLs becomes acuve/lnacuve
SnAkLD SL1 CI IIL1LkS ACkCSS kLGICNS, LACn USING A SU8SL1
132
!acoblan's specLrum ls peaked =
local low-dlmenslonal
represenLauon / relevanL facLors
lnacuve hldden unlL = 0 slngular value
8enchmark of medlum-slze daLaseLs on whlch several deep learnlng
algorlLhms had been evaluaLed (Larochelle eL al lCML 2007)
134
MnlS1
lnpuL olnL 1angenLs
133
MnlS1 1angenLs
lnpuL olnL 1angenLs
136
Local CA (no sharlng across reglons)
lnpuL olnL 1angenLs
ConLracuve AuLo-Lncoder
Distributed vs Local
(CIFAR-10 unsupervised)
Denoising auto-encoders
are also contractive!
1aylor-expand Causslan corrupuon nolse ln reconsLrucuon
error:
?lelds a conLracuve penalLy ln Lhe reconsLrucuon funcuon
(lnsLead of encoder) proporuonal Lo amounL of corrupuon nolse
137
Learned Tangent Prop:
the Manifold Tangent Classifier
3 hypoLheses:

1. Seml-supervlsed hypoLhesls ((x) relaLed Lo (y|x))
2. unsupervlsed manlfold hypoLhesls (daLa
concenLraLes near low-dlm. manlfolds)
3. Manlfold hypoLhesls for classlcauon (low denslLy
beLween class manlfolds)
(8lfal eL al nlS 2011)
Learned Tangent Prop:
the Manifold Tangent Classifier
AlgorlLhm:

1. LsumaLe local prlnclpal dlrecuons of varlauon u(x)
by CAL (prlnclpal slngular vecLors of dh(x)/dx)
2. enallze f(x)=(y|x) predlcLor by || df/dx u(x) ||
Makes f(x) lnsensluve Lo varlauons on manlfold aL x,
LangenL plane characLerlzed by u(x).
Manifold Tangent Classifier Results
Leadlng slngular vecLors on MnlS1, CllA8-10, 8Cv1:
know|edge-free MNIS1: 0.81 error

Seml-sup.
loresL (300k examples)

Inference and Explaining Away
Lasy lnference ln 88Ms and regularlzed AuLo-Lncoders
8uL no explalnlng away (compeuuon beLween causes)
(CoaLes eL al 2011): even when Lralnlng lLers as 88Ms lL helps
Lo perform addluonal explalnlng away (e.g. plug Lhem lnLo a
Sparse Codlng lnference), Lo obLaln beuer-classlfylng feaLures
88Ms would need laLeral connecuons Lo achleve slmllar eecL
AuLo-Lncoders would need Lo have laLeral recurrenL
connecuons or deep recurrenL sLrucLure
141
Sparse Coding (Clshausen eL al 97)

ulrecLed graphlcal model:
Cne of Lhe rsL unsupervlsed feaLure learnlng algorlLhms wlLh
non-llnear feaLure exLracuon (buL llnear decoder)

MA lnference recovers sparse h alLhough (h|x) noL concenLraLed aL 0

Llnear decoder, non-parameLrlc encoder
Sparse Codlng lnference: convex buL expenslve opumlzauon
142
Predictive Sparse Decomposition
ApproxlmaLe Lhe lnference of sparse codlng by a
parameLrlc encoder:
redlcuve Sparse uecomposluon
(kavukcuoglu eL al 2008)
very successful appllcauons ln machlne vlslon
wlLh convoluuonal archlLecLures
143
Predictive Sparse Decomposition
SLacked Lo form deep archlLecLures
AlLernaung convoluuon, recucauon, poollng
1lllng: no sharlng across overlapplng lLers
Croup sparslLy penalLy ylelds Lopographlc
maps
144
Deep Variants
143
Level-Local Learning is Important
lnluallzlng each layer of an unsupervlsed deep 8olLzmann
machlne helps a loL
lnluallzlng each layer of a supervlsed neural neLwork as an 88M,
auLo-encoder, denolslng auLo-encoder, eLc can help a loL
Pelps mosL Lhe layers furLher away from Lhe LargeL
noL [usL an eecL of Lhe unsupervlsed prlor
!olnLly Lralnlng all Lhe levels of a deep archlLecLure ls dlmculL
because of Lhe lncreased non-llnearlLy / non-smooLhness
lnluallzlng uslng a |eve|-|oca| |earn|ng a|gor|thm ls a useful Lrlck
rovldlng lnLermedlaLe-level LargeLs can help Lremendously
(Culcehre & 8englo lCL8 2013)
Stack of RBMs / AEs
! Deep MLP
Lncoder or (b|v) becomes ML layer

147
x
h
3
h
2
h
1
x
h
3
h
2
h
1
h
1
h
2
W
1
W
2
W
3
W
1
W
2
W
3
y

Stack of RBMs / AEs
! Deep Auto-Encoder
(PlnLon & SalakhuLdlnov 2006)
SLack encoders / (b|x) lnLo deep encoder
SLack decoders / (x|b) lnLo deep decoder
148
x
h
3
h
2
h
1
x
h
3
h
2
h
1
h
1
h
2
x
h
2
h
1

W
1
W
2
W
3
W
1
W
1
1
W
2
W
2
1
W
3
W
3
1
Stack of RBMs / AEs
! Deep Recurrent Auto-Encoder
(Savard 2011) (8englo & Laufer, arxlv 2013)
Lach hldden layer recelves lnpuL from below and above
ueLermlnlsuc (mean-eld) recurrenL compuLauon (Savard 2011)
SLochasuc (ln[ecung nolse) recurrenL compuLauon: ueep
Cenerauve SLochasuc neLworks (CSns)
(8englo & Laufer arxlv 2013)

149
x
h
3
h
2
h
1
h
1
h
2
W
1
W
2
W
3
x
h
3
h
2
h
1
W
1
W
1
W
1
1
W
1
W
2
W
2
1
W
3
W
1
1
W
1
1
W
2
W
2
1
W
2
W
3
1
W
3 W
3
1
Stack of RBMs
! Deep Belief Net (PlnLon eL al 2006)
SLack lower levels 88Ms' (x|b) along wlLh Lop-level 88M
(x, b
1
, b
2
, b
J
) = (b
2
, b
J
) (b
1
|b
2
) (x | b
1
)
Sample: Clbbs on Lop 88M, propagaLe down
130
x
h
3
h
2
h
1
Stack of RBMs
! Deep Boltzmann Machine
(SalakhuLdlnov & PlnLon AlS1A1S 2009)
Palve Lhe 88M welghLs because each layer now has lnpuLs from
below and from above
osluve phase: (mean-eld) varlauonal lnference = recurrenL AL
negauve phase: Clbbs sampllng (sLochasuc unlLs)
Lraln by SML/Cu
131
x
h
3
h
2
h
1
W
1
W
1
W
1
1
W
1
W
2
W
2
1
W
3
W
1
1
W
1
1
W
2
W
2
1
W
2
W
3
1
W
3 W
3
1
Stack of Auto-Encoders
! Deep Generative Auto-Encoder
(8lfal eL al lCML 2012)
MCMC on Lop-level auLo-encoder
h
L+1
= encode(decode(h
L
))+ nolse
where nolse ls normal(0, d/dh encode(decode(h
L
)))
1hen deLermlnlsucally propagaLe down wlLh decoders
132
x
h
3
h
2
h
1
Generative Stochastic Networks (GSN)
kecurrent parametr|zed stochasnc computanona| graph that
dehnes a trans|non operator for a Markov cha|n whose
asymptonc d|str|bunon |s |mp||c|t|y esnmated by the mode|
nolse ln[ecLed ln lnpuL and hldden layers
1ralned Lo max. reconsLrucuon prob. of example aL each sLep
Lxamp|e sLrucLure lnsplred from Lhe u8M Clbbs chaln:
133
!
#
$
%
&
%
'
%
!
(
! (
!
(
!
)
(
!
(
'
(
'
)
(
&
(
!
)
(
!
)
(
'
(
'
)
(
'
(
&
) (
& (
&
)
*+,-./ #
!
*+,-./ #
'
*+,-./ #
&

0+12/0
0+12/0
0+12/0
nolse
nolse
3 Lo 3 sLeps
(8englo, ?ao, Alaln & vlncenL, arxlv 2013, 8englo & Laufer, arxlv 2013)
Denoising Auto-Encoder Markov Chain
: Lrue daLa-generaung dlsLrlbuuon
: corrupuon process
: denolslng auLo-encoder Lralned wlLh o examples
from , probablllsucally lnverLs" corrupuon
: Markov chaln over \ alLernaung ,
134
\
t
\
t

\
t-1

\
t-1
\
t-2
\
t-2

corrupL
denolse
Previous Theoretical Results on
Probabilistic Interpretation of Auto-
Encoders
Conunuous \
Causslan corrupuon
nolse ! 0
Squared reconsLrucuon error ||t(\+nolse)-\||
2

(r(\)-\)/
2
esumaLes Lhe score d log p(\) / d\

133
(vlncenL 2011, Alaln & 8englo 2013)
New Theoretical Results
136
uenolslng AL are conslsLenL esumaLors of Lhe daLa-generaung
dlsLrlbuuon Lhrough Lhelr Markov chaln, so long as Lhey
conslsLenLly esumaLe Lhe condluonal denolslng dlsLrlbuuon and
Lhe Markov chaln converges.
Making P
n
(X|

X) match P(X|

X) makes
n
(X) match P(X)
LruLh
denolslng dlsLr.
sLauonary dlsLr. LruLh
Generative Stochastic Networks (GSN)
lf we decompose Lhe reconsLrucuon probablllLy lnLo a
parameLrlzed nolse-dependenL parL and a nolse-
lndependenL parL , we also geL a conslsLenL
esumaLor of Lhe daLa generaung dlsLrlbuuon, lf Lhe chaln
converges.
137
!
#
$
%
&
%
'
%
!
(
! (
!
(
!
)
(
!
(
'
(
'
)
(
&
(
!
)
(
!
)
(
'
(
'
)
(
'
(
&
) (
& (
&
)
*+,-./ #
!
*+,-./ #
'
*+,-./ #
&

0+12/0 0+12/0
0+12/0
nolse
nolse
GSN Experiments: validating the
theorem in a continuous non-
parametric setting
Conunuous daLa,
x ln k
10
, Causslan
corrupuon

8econsLrucuon
dlsLrlbuuon =
arzen (mlxLure of
Causslans)
esumaLor
3000 Lralnlng
examples, 3000
samples
vlsuallze a palr of
dlmenslons
138
GSN Experiments: validating the theorem in
a continuous non-parametric setting
139
Shallow Model: Generalizing the Denoising
Auto-Encoder Probabilistic Interpretation
Classlcal denolslng auLo-encoder archlLecLure, slngle hldden layer
wlLh nolse only ln[ecLed ln lnpuL
lacLored 8ernoullll reconsLrucuon prob. dlsLr.
= parameLer-less, salL-and-pepper nolse on Lop of x
Ceoetollzes (Alolo & 8eoqlo lclk 201J). oot jost coouoooos t.v.,
ooy ttololoq ctltetloo (os loq-llkellbooJ), oot jost Coossloo bot
ooy cottopuoo (oo oeeJ to be uoy to cottectly esumote
Jlsttlbouoo).
160
x
0
W
1 W
1
W
1
1
W
1
W
1
1
W
1
1
sample x
1
sample x
2

LargeL
sample x
3

Experiments: Shallow vs Deep
Shallow (uAL), no
recurrenL paLh aL
hlgher levels,
sLaLe=x only
ueep CSn:
161
x
0
sample x
1
sample x
2

x
3

x
0
sample x
1
sample x
2
sample x
3

Quantitative Evaluation of Samples
revlous procedure for evaluaung samples (8reuleux eL al 2011, 8lfal
eL al 2012, 8englo eL al 2013):
CeneraLe 10000 samples from model
use Lhem as Lralnlng examples for arzen denslLy esumaLor
LvaluaLe lLs log-llkellhood on MnlS1 LesL daLa
162
1ralnlng
examples
Question Answering, Missing Inputs
and Structured Output
Cnce Lralned, a CSn can provably sample from any condluonal
over subseLs of lLs lnpuLs, so long as we use Lhe condluonal
assoclaLed wlLh Lhe reconsLrucuon dlsLrlbuuon and clamp Lhe
rlghL-hand slde varlables.
(8englo & Laufer arxlv 2013)
163
Experiments: Structured Conditionals
SLochasucally ll-ln mlsslng lnpuLs, sampllng from Lhe chaln LhaL
generaLes Lhe condluonal dlsLrlbuuon of Lhe mlsslng lnpuLs
glven Lhe observed ones (nouce Lhe fasL burn-ln!)
164
Not Just MNIST: experiments on TFD
3 hldden layer model, consecuuve samples:
163
Practical Considerations
arL 3
166
Deep Learning Tricks of the Trade
?. 8englo (2013), racucal 8ecommendauons for CradlenL-
8ased 1ralnlng of ueep ArchlLecLures"
unsupervlsed pre-Lralnlng
SLochasuc gradlenL descenL and seng learnlng raLes
Maln hyper-parameLers
Learnlng raLe schedule
Larly sLopplng
MlnlbaLches
arameLer lnluallzauon
number of hldden unlLs
L1 and L2 welghL decay
SparslLy regularlzauon
uebugglng
Pow Lo emclenLly search for hyper-parameLer congurauons
167
CradlenL descenL uses LoLal gradlenL over all examples per
updaLe, SCu updaLes aer only 1 or few examples:
L = loss funcuon, z
L
= currenL example, = parameLer vecLor, and
L
= learnlng raLe.
Crdlnary gradlenL descenL ls a baLch meLhod, very slow, should
never be used. 2
nd
order baLch meLhod are belng explored as an
alLernauve buL SCu wlLh selecLed learnlng schedule remalns Lhe
meLhod Lo beaL.
Stochastic Gradient Descent (SGD)
168
Learning Rates
SlmplesL reclpe: keep lL xed and use Lhe same for all
parameLers.
ColloberL scales Lhem by Lhe lnverse of square rooL of Lhe fan-ln
of each neuron
8euer resulLs can generally be obLalned by allowlng learnlng
raLes Lo decrease, Lyplcally ln C(1/L) because of Lheoreucal
convergence guaranLees, e.g.,

wlLh hyper-parameLers
0
and .
new papers on adapuve learnlng raLes procedures (Schaul 2012,
2013), Adagrad (uuchl eL al 2011 ), AuAuLL1A (eller 2012)
169
Early Stopping
8eauuful IkLL LUNCn (no need Lo launch many dlerenL
Lralnlng runs for each value of hyper-parameLer for lLerauons)
MonlLor valldauon error durlng Lralnlng (aer vlslung of
Lralnlng examples = a muluple of valldauon seL slze)
keep Lrack of parameLers wlLh besL valldauon error and reporL
Lhem aL Lhe end
lf error does noL lmprove enough (wlLh some pauence), sLop.
170
Long-Term Dependencies
ln very deep neLworks such as recurrent networks (or posslbly
recurslve ones), Lhe gradlenL ls a producL of !acoblan maLrlces,
each assoclaLed wlLh a sLep ln Lhe forward compuLauon. 1hls
can become very small or very large qulckly [8englo eL al 1994,
and Lhe locallLy assumpuon of gradlenL descenL breaks down.
1wo klnds of problems:
slng. values of !acoblans > 1 ! gradlenLs explode
or slng. values 1 ! gradlenLs shrlnk & vanlsh

171
The Optimization Challenge in
Deep / Recurrent Nets
Plgher-level absLracuons requlre hlghly non-llnear
Lransformauons Lo be learned
Sharp non-llnearlues are dlmculL Lo learn by gradlenL
Composluon of many non-llnearlues = sharp non-llnearlLy
Lxplodlng or vanlshlng gradlenLs
172
E
t+1
x
t+1
E
t+1
E
t
E
t1
x
t+1
x
t
x
t1
u
t1
u
t
u
t+1
E
t
x
t
E
t1
x
t1
x
t+2
x
t+1
x
t+1
x
t
x
t
x
t1
x
t1
x
t2
A
B
RNN Tricks
(ascanu, Mlkolov, 8englo, lCML 2013, 8englo, 8oulanger & ascanu, lCASS 2013)
Cllpplng gradlenLs (avold explodlng gradlenLs)
Leaky lnLegrauon (propagaLe long-Lerm dependencles)
MomenLum (cheap 2
nd
order)
lnluallzauon (sLarL ln rlghL ballpark avolds explodlng/vanlshlng)
Sparse CradlenLs (symmeLry breaklng)
CradlenL propagauon regularlzer (avold vanlshlng gradlenL)
LS1M self-loops (avold vanlshlng gradlenL)
173
error
Long-Term Dependencies
and Clipping Trick

1rlck rsL lnLroduced by Mlkolov ls Lo cllp gradlenLs
Lo a maxlmum nC8M value.

Makes a blg dlerence ln 8ecurrenL neLs (ascanu eL al lCML 2013)
Allows SCu Lo compeLe wlLh Pl opumlzauon on dlmculL long-Lerm
dependencles Lasks. Pelped Lo beaL SC1A ln LexL compresslon,
language modellng, speech recognluon.

174
x
t-1 x
t
x
t+1
z
t-1
z
t
z
t+1
Combining clipping to avoid gradient
explosion and Jacobian regularizer to
avoid gradient vanishing
(ascanu, Mlkolov & 8englo, lCML 2013)
173
x
h
y
Normalized Initialization to Achieve
Unity-Like Jacobian
Assumlng f'(acL=0)=1
Normalized Initialization with Variance-
Preserving Jacobians
Shapeset
2x3 data
Unsupervised
pre-training:
Automatically
variance-
preserving!
Parameter Initialization
lnluallze hldden layer blases Lo 0 and ouLpuL (or reconsLrucuon)
blases Lo opumal value lf welghLs were 0 (e.g. mean LargeL or
lnverse slgmold of mean LargeL).
lnluallze welghLs unlform(-r,r), r lnversely proporuonal Lo fan-
ln (prevlous layer slze) and fan-ouL (nexL layer slze):
for Lanh unlLs (and 4x blgger for slgmold unlLs)
(CloroL & 8englo AlS1A1S 2010)
178
Handling Large Output Spaces

AuLo-encoders and 88Ms reconsLrucL Lhe lnpuL, whlch ls sparse and hlgh-
dlmenslonal, Language models have a huge ouLpuL space (1 unlL per word).

!
code= latent features
!
sparse input dense output probabilities
cheap
expensive
179
caLegorles
words wlLhln each caLegory

(uauphln eL al, lCML 2011) 8econsLrucL Lhe non-zeros ln
Lhe lnpuL, and reconsLrucL as many randomly chosen
zeros, + lmporLance welghLs
(ColloberL & WesLon, lCML 2008) sample a ranklng loss
uecompose ouLpuL probablllues hlerarchlcally (Morln
& 8englo 2003, 8llLzer eL al 2003, Mnlh & PlnLon
2007,2009, Mlkolov eL al 2011)

Automatic Differentiation
Makes lL easler Lo qulckly and
safely Lry new models.
1heano Llbrary (pyLhon) does lL
symbollcally. CLher neural
neLwork packages (1orch,
Lush) can compuLe gradlenLs
for any glven run-ume value.
(8ergsLra eL al Scly'2010)
180
Random Sampling of Hyperparameters
(8ergsLra & 8englo 2012)
Common approach: manual + grld search
Crld search over hyperparameLers: slmple & wasLeful
8andom search: slmple & emclenL
lndependenLly sample each P, e.g. l.raLeexp(u[log(.1),log(.0001))
Lach Lralnlng Lrlal ls lld
lf a P ls lrrelevanL grld search ls wasLeful
More convenlenL: ok Lo early-sLop, conunue furLher, eLc.
181
Sequential Model-Based Optimization
of Hyper-Parameters
(Puuer eL al !Al8 2009, 8ergsLra eL al nlS 2011, 1hornLon eL al
arxlv 2012, Snoek eL al nlS 2012)
lLeraLe
LsumaLe (valld. err | hyper-params cong x, u)
choose opumlsuc x, e.g. max
x
(valld. err currenL mln. err | x)
Lraln wlLh cong x, observe valld. err. v, u # u u (x,v)
182
Discussion
183
Concerns
Many algorlLhms and varlanLs (burgeonlng eld)
Pyper-parameLers (layer slze, regularlzauon, posslbly
learnlng raLe)
use mulu-core machlnes, clusLers and random
sampllng for cross-valldauon or sequenual model-
based opumlzauon
184
Concerns
Slower Lo Lraln Lhan llnear models
Cnly by a small consLanL facLor, and much more compacL
Lhan non-parameLrlc (e.g. n-gram models or kernel machlnes)
very fasL durlng lnference/LesL ume (feed-forward pass ls [usL
a few maLrlx muluplles)
need more Lralnlng daLa?
Can booJle ooJ beoeft ftom more Lralnlng daLa (esp.
unlabeled), sulLable for 8lg uaLa (Coogle Lralns neLs wlLh a
bllllon connecuons, [Le eL al, lCML 2012, uean eL al nlS 2012)
AcLually needs less *'+,*,- daLa
183
Concern: non-convex optimization
Can lnluallze sysLem wlLh convex learner
Convex SvM
llxed feaLure space
1hen opumlze non-convex varlanL (add and Lune learned
feaLures), can'L be worse Lhan convex learner
186
Challenges & Questions
arL 4
187
Why is Unsupervised Pre-Training
Sometimes Working So Well?
8egularlzauon hypoLhesls:
unsupervlsed componenL forces model close Lo (x)
8epresenLauons good for (x) are good for (y|x)
Cpumlzauon hypoLhesls:
unsupervlsed lnluallzauon near beuer local mlnlmum of (y|x)
Can reach lower local mlnlmum oLherwlse noL achlevable by random lnluallzauon
Lasler Lo Lraln each layer uslng a layer-local crlLerlon
(Lrhan eL al !ML8 2010)
Learning Trajectories in
Function Space
Lach polnL a model ln
funcuon space
Color = epoch
1op: Lra[ecLorles w/o
pre-Lralnlng
Lach Lra[ecLory
converges ln dlerenL
local mln.
no overlap of reglons
wlLh and w/o pre-
Lralnlng
Learning Trajectories in
Function Space
Lach Lra[ecLory
converges ln dlerenL
local mln.
WlLh lSCMA, Lry Lo
preserve geomeLry:
preLralned neLs
converge near each
oLher (less varlance)
Cood answers =
worse Lhan a needle
ln a haysLack
(learnlng dynamlcs)
Deep Learning Challenges
(Bengio, arxiv 1305.0445 Deep learning
of representations: looking forward)
CompuLauonal Scallng
Cpumlzauon & underng
ApproxlmaLe lnference & Sampllng
ulsenLangllng lacLors of varlauon
8easonlng & Cne-ShoL Learnlng of lacLs
191
Challenge: Computational Scaling
8ecenL breakLhroughs ln speech, ob[ecL recognluon and nL
hlnged on fasLer compuung, Cus, and large daLaseLs
A 100-fold speedup ls posslble wlLhouL walung anoLher 10yrs?
Challenge of dlsLrlbuLed Lralnlng
Challenge of condluonal compuLauon
192
!
#$%&$% '()*+,
-.&$%
/+%01 &+%2
3+4. &+%2
/+%05 $.4%' 60,&01%'7
/+8.9 $.4%':
Conditional Computation: only visit a
small fraction of parameters / example
ueep neLs vs declslon Lrees
Pard mlxLures of experLs
Condluonal compuLauon for deep neLs: sparse dlsLrlbuLed
gaLers selecung comblnaLorlal subseLs of a deep neL
Challenges:
8ack-prop Lhrough hard declslons
CaLed archlLecLures explorauon
SymmeLry breaklng Lo reduce
lll-condluonlng
193
Distributed Training
MlnlbaLches (Loo large = slow down)
Large mlnlbaLches + 2
nd
order meLhods
Asynchronous SCu (8englo eL al 2003, Le eL al lCML 2012, uean eL al nlS 2012)
8ouleneck: sharlng welghLs/updaLes among nodes
new ldeas:
Low-resoluuon sharlng only where needed
Speclallzed condluonal compuLauon (each compuLer
speclallzes ln updaLes Lo some clusLer of gaLed experLs, and
prefers examples whlch Lrlgger Lhese experLs)
194
Optimization & Underfitting
Cn large daLaseLs, ma[or obsLacle ls underng
Marg|na| un||ty of wlder MLs decreases qulckly below
memorlzauon basellne
CurrenL llmlLauons: local mlnlma or lll-condluonlng?
Adapuve learnlng raLes and sLochasuc 2
nd
order meLhods
Condluonal comp. & sparse gradlenLs ! beuer condluonlng:
when some gradlenLs are 0, many cross-derlvauves are also 0.
193

M|x|ng
Local: auLo-correlauon beLween successlve samples
Clobal: m|x|ng between ma[or "modes"

MCMC Sampling Challenges
8urn-ln
Colng from an unllkely congurauon Lo llkely ones

196
challenge
For gradient & inference:
More difficult to mix with better
trained models
Larly durlng Lralnlng, denslLy smeared ouL, mode bumps overlap
LaLer on, hard Lo cross empLy volds beLween modes
197
Are we doomed lf
we rely on MCMC
durlng Lralnlng?
Wlll we be able Lo
Lraln really large &
complex models?
1ralnlng updaLes
Mlxlng
vlcloos cltcle
Poor Mixing: Depth to the Rescue
Sampllng from u8ns and sLacked ConLracuve AuLo-Lncoders:
1. MCMC sampllng from Lop layer model
2. ropagaLe Lop-level represenLauons Lo lnpuL-level repr.
ueeper neLs vlslL more modes (classes) fasLer
198
x
h
2
h
1
1
-
l
a
y
e
r

(
8
8
M
)

2
-
l
a
y
e
r

(
C
A
L
)

(8englo eL al lCML 2013)
Space-Filling in Representation-Space
Plgh-probablllLy samples ll more Lhe convex seL beLween Lhem
when vlewed ln Lhe learned represenLauon-space, maklng Lhe
emplrlcal dlsLrlbuuon more unlform and unfoldlng manlfolds
Llnear lnLerpolauon aL layer 1
Llnear lnLerpolauon aL layer 2
3's manlfold
9's manlfold
Llnear lnLerpolauon ln plxel space
Poor Mixing: Depth to the Rescue
ueeper represenLauons " absLracuons " dlsenLangllng
L.g. reverse vldeo blL, class blLs ln learned represenLauons: easy
Lo Clbbs sample beLween modes aL absLracL level
PypoLheses LesLed and noL re[ecLed:
more absLracL/dlsenLangled represenLauons unfold manlfolds
and ll more Lhe space

can be explolLed for beuer mlxlng beLween modes
200
lxel space
9's manlfold
3's manlfold
8epresenLauon space
9's manlfold
3's manlfold
Inference Challenges
Many laLenL varlables lnvolved ln undersLandlng
complex lnpuLs (e.g. ln nL: sense amblgulLy, parslng,
semanuc role)
AlmosL any lnference mechanlsm can be comblned
wlLh deep learnlng
See [8ouou, LeCun, 8englo 97, [Craves 2012
Complex lnference can be hard (exponenually) and
needs Lo be approxlmaLe ! learn Lo perform lnference
201
Inference & Sampling
CurrenLly for unsupervlsed learnlng & sLrucLured ouLpuL models
(h|x) lnLracLable because of many lmporLanL modes
MA, varlauonal, MCMC approxlmauons llmlLed Lo 1 or few
modes
ApproxlmaLe lnference can hurL learnlng
(kulesza & erelra nlS'2007)
Mode mlxlng harder as Lralnlng progresses
(8englo eL al lCML 2013)
202
1ralnlng updaLes
Mlxlng
vlcloos cltcle
Latent Variables Love-Hate Relationship
CCCu! Appea||ng: model explanaLory facLors .
8Au! LxacL lnference? nope. !usL a|n.
Loo many posslble congurauons of .
WC8SL! Lach learnlng sLep usually requlres
lnference and/or sampllng from (., ))
203
Anonymous Latent Variables
0! "1,2'33%4(,- 3,$'(#53
Learnlng d|scovers underlylng facLors,
e.g., CA dlscovers leadlng dlrecuons of varlauons

lncreases expresslveness of ())=
.
(),.)
unlversal approxlmaLors, e.g. for 88Ms
(Le 8oux & 8englo, neural Comp. 2008)
.
204
Approximate Inference
MA
.* argmax
.
(.|)) " assume 1 domlnanL mode
varlauonal
Look for LracLable C(.) mlnlmlzlng kL(C(.)||(.|)))
C ls elLher facLorlal or Lree-sLrucLured
" sLrong assumpuon
MCMC
SeLup Markov chaln asympLoucally sampllng from (.|))
Approx. marglnallzauon Lhrough MC avg over few samples
" assume a few domlnanL modes
6""1!)%$'7, %(8,1,(5, 5'( 3,1%!93*: .917 *,'1(%(4
(kulesza & erelra nlS'2007)
203
Learned Approximate Inference
1. coosttoct o compotouoool qtopb cottespooJloq to lofeteoce
Loopy bellef prop. (8oss eL al Cv8 2011, SLoyanov eL al 2011)
varlauonal mean-eld (Coodfellow eL al, lCL8 2013)
MA (kavukcuoglu eL al 2008, Cregor & LeCun lCML 2010)
2. Opumlze potometets wrL crlLerlon of lnLeresL, posslbly
decoupllng from Lhe generauve model's parameLers
Learnlng can compensaLe for Lhe lnadequacy of approxlmaLe
lnference, Laklng advanLage of speclcs of Lhe daLa dlsLrlbuuon

206
However: Potentially
Number of Modes in Posterior P(h|x)
lorelgn speech uuerance example, y=answer Lo quesuon:
10 word segmenLs
100 plauslble candldaLes per word
10
6
posslble segmenLauons
MosL congurauons (999999/1000000) lmplauslble
" 10
20
hlgh-probablllLy modes
A|| known approx|mate |nference scheme may break down |f
the poster|or has a huge number of modes (falls MA & MCMC)
and noL respecung a varlauonal approxlmauon (falls varlauonal)

207
ueep neural neLs learn good (y|)) classlers even lf Lhere are
poLenually many Lrue laLenL varlables lnvolved
LxplolLs sLrucLure ln (y|)) LhaL perslsL even aer summlng h
8uL how do we generallze Lhls ldea Lo full [olnL-dlsLrlbuuon
learnlng and answerlng any quesuon abouL Lhese varlables, noL
[usL one?
208
Learning Computational Graphs
ueep Stochasnc Generanve Networks (CSns) Lralnable by
backprop (8englo & Laufer, arxlv 1306.1091)
Avo|d any exp||c|t |atent var|ab|es whose marg|na||zanon |s
|ntractab|e, |nstead tra|n a stochasnc computanona| graph that
generates the r|ght {cond|nona|} d|str|bunon.
209
!
#
$
%
&
%
'
%
!
(
! (
!
(
!
)
(
!
(
'
(
'
)
(
&
(
!
)
(
!
)
(
'
(
'
)
(
'
(
&
) (
& (
&
)
*+,-./ #
!
*+,-./ #
'
*+,-./ #
&

0+12/0
0+12/0
0+12/0
nolse
nolse
3 Lo 3 sLeps
Theoretical Results
1he Markov chaln assoclaLed wlLh a denolslng auLo-encoder ls a
conslsLenL esumaLor of Lhe daLa generaung dlsLrlbuuon (lf Lhe
chaln converges)
Same Lhlng for Cenerauve SLochasuc neLworks (so long as Lhe
reconsLrucuon probablllLy has enough expresslve power Lo learn
Lhe requlred condluonal dlsLrlbuuon).
210 !
#
$
%
&
%
'
%
!
(
! (
!
(
!
)
(
!
(
'
(
'
)
(
&
(
!
)
(
!
)
(
'
(
'
)
(
'
(
&
) (
& (
&
)
*+,-./ #
!
*+,-./ #
'
*+,-./ #
&

0+12/0 0+12/0
0+12/0
nolse
nolse
GSN Experiments: validating the theorem in
a continuous non-parametric setting
211
GSN Experiments: Consecutive Samples
212
lllllng-ln Lhe LPS
The Challenge of Disentangling
Underlying Factors
Cood dlsenLangllng !
- gure ouL Lhe underlylng
sLrucLure of Lhe daLa
- avold curse of dlmenslonallLy
- mlx beuer beLween modes

Pow Lo obLalned beuer
dlsenLangllng????
213
Learning Multiple Levels of
Abstraction
1he blg payo of deep learnlng ls Lo allow learnlng
hlgher levels of absLracuon
Plgher-level absLracuons dlsenLangle Lhe facLors of
varlauon, whlch allows much easler generallzauon and
Lransfer
214
If Time Permits
213
Culture vs Effective Local
Minima
216
Issue: underhmng due to comb|nator|a||y many poor
,;,5#<, |oca| m|n|ma
8englo 2013 (also arxlv 2012)
where Lhe opumlzer geLs sLuck
Hypothesis 1
When Lhe braln of a slngle blologlcal agenL learns, lL performs an
approxlmaLe opumlzauon wlLh respecL Lo some endogenous
ob[ecuve.
217
Hypothesis 2
When Lhe braln of a slngle blologlcal agenL learns, lL relles on
approxlmaLe local descenL ln order Lo gradually lmprove lLself.
Hypothesis 3
Plgher-level absLracuons ln bralns are represenLed by deeper
compuLauons (golng Lhrough more areas or more
compuLauonal sLeps ln sequence over Lhe same areas).
218
Hypothesis 4
Learnlng of a slngle human learner ls
llmlLed by e[ecuve local mlnlma.
1heoreucal and experlmenLal resulLs on deep learnlng suggesL:
osslbly due Lo lll-condluonlng, buL behaves llke local mln
Hypothesis 5
A slngle human learner ls unllkely Lo dlscover hlgh-level
absLracuons by chance because Lhese are represenLed by a deep
sub-neLwork ln Lhe braln.
219
Hypothesis 6
A human braln can learn hlgh-level absLracuons lf gulded by Lhe
slgnals produced by oLher humans, whlch acL as hlnLs or lndlrecL
supervlslon for Lhese hlgh-level absLracuons.
Supporung evldence: (Culcehre & 8englo lCL8 2013)
How is one brain
transferring
abstractions to
another brain?
220
!
!
!
!
! !
!
!
!
!
! !
Shared lnpuL x
Llngulsuc exchange
= uny / nolsy channel
Llngulsuc
represenLauon
Llngulsuc
represenLauon
How do we escape local minima?
llngulsuc lnpuLs = exLra examples, summarlze
knowledge
crlLerlon landscape easler Lo opumlze (e.g.
currlculum learnlng)
Lurn dlmculL unsupervlsed learnlng lnLo easy
supervlsed learnlng of lnLermedlaLe absLracuons
221
222
Hypothesis 7
Language and meme recomblnauon provlde an emclenL
evoluuonary operaLor, allowlng rapld search ln Lhe space of
memes, LhaL helps humans bulld up beuer hlgh-level lnLernal
represenLauons of Lhelr world.
How could language/education/
culture possibly help find the
better local minima associated
with more useful abstractions?

More Lhan random search:
poLenual exponenual speed-
up by dlvlde-and-conquer
comblnaLorlal advanLage:
can comblne soluuons Lo
lndependenLly solved sub-
problems
From where do new ideas emerge?
Seconds: |nference (novel explanauons for currenL x)
MlnuLes, hours: |earn|ng (local descenL, llke currenL uL)
?ears, cenLurles: cu|tura| evo|unon (global opumlzauon,
recomblnauon of ldeas from oLher humans)
223
Related Tutorials
ueep Learnlng LuLorlals (pyLhon): hup://deeplearnlng.neL/LuLorlals
SLanford deep learnlng LuLorlals wlLh slmple programmlng
asslgnmenLs and readlng llsL
hup://deeplearnlng.sLanford.edu/wlkl/
ACL 2012 ueep Learnlng for nL LuLorlal
hup://www.socher.org/lndex.php/ueepLearnlng1uLorlal/
lCML 2012 8epresenLauon Learnlng LuLorlal
hup://www.lro.umonLreal.ca/bengloy/Lalks/deep-learnlng-
LuLorlal-2012.hLml
lAM 2012 Summer school on ueep Learnlng
hup://www.lro.umonLreal.ca/bengloy/Lalks/deep-learnlng-LuLorlal-
aaal2013.hLml
More read|ng: aper references |n separate pdf, on my web page
224
Software
1heano (yLhon Cu/Cu) maLhemaucal and deep learnlng
llbrary hup://deeplearnlng.neL/soware/Lheano
Can do auLomauc, symbollc dlerenuauon
Senna: CS, Chunklng, nL8, S8L
by ColloberL eL al. hup://ronan.colloberL.com/senna/
SLaLe-of-Lhe-arL performance on many Lasks
3300 llnes of C, exLremely fasL and uslng very llule memory
1orch ML Llbrary (C++ + Lua) hup://www.Lorch.ch/
8ecurrenL neural neLwork Language Model
hup://www.L.vuLbr.cz/lmlkolov/rnnlm/
8ecurslve neural neL and 8AL models for paraphrase deLecuon,
senumenL analysls, relauon classlcauon www.socher.org
223
Software: whats next
C-Lhe-shelf SvM packages are useful Lo researchers
from a wlde varleLy of elds (no need Lo undersLand
8kPS).
1o make deep learnlng more accesslble: release o-
Lhe-shelf learnlng packages LhaL handle hyper-
parameLer opumlzauon, explolung mulu-core or
clusLer aL dlsposal of user.
SpearmlnL (Snoek)
PyperCpL (8ergsLra)
226
Conclusions
ueep Learnlng & 8epresenLauon Learnlng have maLured
lnL. Conf. on Learnlng 8epresenLauon 2013 a huge success!
lndusLrlal sLrengLh appllcauons ln place (Coogle, Mlcroso)
8oom for more research:
Scallng compuLauon even more
8euer opumlzauon
Ceng rld of lnLracLable lnference (ln Lhe works!)
Coaxlng Lhe models lnLo more dlsenLangled absLracuons
Learnlng Lo reason from lncremenLally added facLs
227
Merci! Questions?
LISA team:

Aaai2013 Tutorial

Diunggah oleh

Informasi Dokumen

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Aaai2013 Tutorial

Diunggah oleh

Hak Cipta:

Format Tersedia

Deep Learn|ng of

Anda mungkin juga menyukai