Slidessindhwani - Kernels Random Embeddings and Deep Learning

Kernels, Random Embeddings and Deep Learning
Vikas Sindhwani
IBM Research, NY
October 28, 2014
Acknowledgements
At IBM: Haim Avron, Tara Sainath, B. Ramabhadran, Q. Fan
Summer Interns: Jiyan Yang (Stanford), Po-sen Huang (UIUC)
Michael Mahoney (UC Berkeley), Ha Quang Minh (IIT Genova)
IBM DARPA XDATA project led by Ken Clarkson (IBM Almaden)
2 34
Setting
I
Given labeled data in the form of input-output pairs,

{xi , yi }ni=1 ,
xi X Rd ,
y i Y Rm ,
estimate the unknown dependency f : X 7 Y.
3 34
Setting
I

{xi , yi }ni=1 ,
xi X Rd ,
y i Y Rm ,

Regularized Risk Minimization in a suitable hypothesis space H,
arg min
f H
n
X
V (f (xi ), yi ) + (f )
i=1
3 34
Setting
I

{xi , yi }ni=1 ,
y i Y Rm ,

arg min
f H
xi X Rd ,
n
X
V (f (xi ), yi ) + (f )
i=1
Large n = Big models: H rich /non-parametric/nonlinear.
3 34
Setting
I

{xi , yi }ni=1 ,
f H
y i Y Rm ,

arg min
xi X Rd ,
n
X
V (f (xi ), yi ) + (f )
i=1

Two great ML traditions around choosing H:
3 34
Setting
I

{xi , yi }ni=1 ,
f H
y i Y Rm ,

arg min
xi X Rd ,
n
X
V (f (xi ), yi ) + (f )
i=1

Deep Neural Networks: f (x) = sn (. . . s2 (W2 s1 (W1 x)) . . .)
3 34
Setting
I

{xi , yi }ni=1 ,
f H
y i Y Rm ,

arg min
xi X Rd ,
n
X
V (f (xi ), yi ) + (f )
i=1

Deep Neural Networks: f (x) = sn (. . . s2 (W2 s1 (W1 x)) . . .)

Kernel Methods: general nonlinear function space generated by a
kernel function k(x, z) on X X .
This talk: Thrust towards scalable kernel methods, motivated by the

recent successes of deep learning.
3 34
Outline
Motivation and Background

Scalable Kernel Methods
Random Embeddings+Distributed Computation (ICASSP, JSM 2014)
libSkylark: An open source software stack
Quasi-Monte Carlo Embeddings (ICML 2014)
Synergies?
4 34
Deep Learning is Supercharging Machine Learning

Krizhevsky et.al. won the 2012 ImageNet challenge (ILSVRC-2012) with
top-5 error rate of 15.3% compared to 26.2% of the second best entry.
Many statistical and computational ingredients:

Large datasets (ILSVRC since 2010)
5 34


Large statistical capacity (1.2M images, 60M params)
5 34


Distributed computation
5 34


Depth, Invariant feature learning (transferrable to other tasks)
5 34


Depth, Invariant feature learning (transferrable to other tasks)
Engineering: Dropout, ReLU . . .
Very active area in Speech and Natural Language Processing.
5 34
Machine Learning in 1990s

I
Convolutional Neural Networks (Fukushima 1980; Lecun et al 1989)
6 34

I

3 days to train on USPS (n = 7291; digit recognition) on Sun
Sparcstation 1 (33MHz clock speed, 64MB RAM)
6 34

I

Personal history:
6 34

I

Personal history:
1998: First ML experiment - train DNN on UCI Wine dataset.
6 34

I

Personal history:
1999: Introduced to Kernel Methods - by DNN researchers!
6 34

I

Personal history:
2003-4: NN paper at JMLR rejected; accepted in IEEE Trans.
Neural Nets with a kernel methods section!
6 34

I

Personal history:
Why Kernel Methods?
6 34

I

Personal history:
Why Kernel Methods?

Local Minima free - stronger role of Convex Optimization.
6 34

I

Personal history:
Why Kernel Methods?

Theoretically appealing
6 34

I

Personal history:
Why Kernel Methods?

Handle non-vectorial data; high-dimensional data
Easier model selection via continuous optimization.
6 34

I

Personal history:
Why Kernel Methods?

Matched NN in many cases, although didnt scale wrt n as well.
So what changed?
6 34

I

Personal history:
Why Kernel Methods?

Matched NN in many cases, although didnt scale wrt n as well.
So what changed?
More data, parallel algorithms, hardware? Better DNN training? . . .
6 34
Kernel Methods and Neural Networks (Pre-Google)
7 34
Kernel Methods and Neural Networks
Geoff Hinton facts meme maintained at

http://yann.lecun.com/ex/fun/
8 34

I
I
I
All kernels that ever dared approaching Geoff Hinton woke up

convolved.
The only kernel Geoff Hinton has ever used is a kernel of truth.
If you defy Geoff Hinton, he will maximize your entropy in no time.
Your free energy will be gone even before you reach equilibrium.
8 34

I
I
I
All kernels that ever dared approaching Geoff Hinton woke up

convolved.
The only kernel Geoff Hinton has ever used is a kernel of truth.
If you defy Geoff Hinton, he will maximize your entropy in no time.
Your free energy will be gone even before you reach equilibrium.
Are there synergies between these fields towards design of even better
(faster and more accurate) algorithms?
8 34
The Mathematical Naturalness of Kernel Methods
Data X Rd , Models H : X 7 R
9 34
I
I
Geometry in H: inner product h, iH , norm k kH (Hilbert Spaces)
9 34
I
I
I

Theorem All nice Hilbert spaces are generated by a symmetric
positive definite function (the kernel) k(x, x0 ) on X X
if f, g H close i.e. kf gkH small, then f (x), g(x) close x X .

Reproducing Kernel Hilbert Spaces (RKHSs)
9 34
I
I
I


Functional Analysis (Aronszajn, Bergman (1950s)); Statistics

(Parzen (1960s)); PDEs; Numerical Analysis. . .
ML: Nonlinear classification, regression, clustering, time-series
analysis, dynamical systems, hypothesis testing, causal modeling, . . .
9 34
I
I
I


Functional Analysis (Aronszajn, Bergman (1950s)); Statistics

(Parzen (1960s)); PDEs; Numerical Analysis. . .
ML: Nonlinear classification, regression, clustering, time-series
analysis, dynamical systems, hypothesis testing, causal modeling, . . .
In principle, possible to compose Deep Learning pipelines using more

general nonlinear functions drawn from RKHSs.
9 34
Outline

Synergies?
10 34
Scalability Challenges for Kernel Methods
f ? = arg min
f Hk
1X
V (yi , f (xi )) + kf k2Hk , xi Rd
n i=1
Representer Theorem: f ? (x) =
Regularized Least Squares

(K + I) = Y
Pn
i=1
i k(x, xi )
O(n2 )
O(n3 + n2 d)
O(nd)
storage
training
test speed
Hard to parallelize when working directly with Kij = k(xi , xj )
11 34
Randomized Algorithms
Explicit approximate feature map: : Rd 7 Cs such that,

k(x, z)
h(x),
(z)i
Cs
O(ns)
storage

T
T
Z(X) Z(X) + I w = Z(X) Y, O(ns2 ) training
O(s)
test speed
Interested in Data-oblivious maps that depend only on the kernel

function, and not on the data.
Should be very cheap to apply and parallelizable.
12 34
Random Fourier Features (Rahimi & Recht, 2007)

I
Theorem [Bochner 1930,33] One-to-one Fourier-pair

correspondence between any (normalized) shift-invariant kernel k
and density p such that,
Z
T
k(x, z) = (x z) =
ei(xz) w p(w)dw
Rd
Gaussian kernel: k(x, z) = e
kxzk2
2
2 2
p = N (0, 2 Id )
13 34
Random Fourier Features (Rahimi & Recht, 2007)

I
Theorem [Bochner 1930,33] One-to-one Fourier-pair

correspondence between any (normalized) shift-invariant kernel k
and density p such that,
Z
T
k(x, z) = (x z) =
ei(xz) w p(w)dw
Rd
Gaussian kernel: k(x, z) = e

I
kxzk2
2
2 2
p = N (0, 2 Id )
Monte-Carlo approximation to Integral representation:

s
k(x, z)
1 X i(xz)T wj
S (x),
S (z)iCs
e
= h
s j=1
h
i
S (x) = 1 eixT w1 . . . eixT ws Cs , S = [w1 . . . ws ] p
s
13 34
DNNs vs Kernel Methods on TIMIT (Speech)

Joint work with IBM Speech Group, P. Huang:
Can shallow, convex randomized kernel methods match DNNs?
(predicting HMM states given short window of coefficients representing acoustic input)
G = randn(size(X,1), s);
Z = exp(i*X*G);
I = eye(size(X,2));
C = Z*Z;
alpha = (C + lambda*I)\(Z*y(:));
ztest = exp(i*xtest*G)*alpha;
14 34

Joint work with IBM Speech Group, P. Huang:
Can shallow, convex randomized kernel methods match DNNs?
(predicting HMM states given short window of coefficients representing acoustic input)
41
40
Classification Error (%)
G = randn(size(X,1), s);
Z = exp(i*X*G);
I = eye(size(X,2));
C = Z*Z;
alpha = (C + lambda*I)\(Z*y(:));
ztest = exp(i*xtest*G)*alpha;
TIMIT: n = 2M, d = 440, k = 147

DNN (440-4k-4k-147)
RandomFourier
Exact Kernel (n=100k, 75GB)
39
38
37
Z(X): 1.2TB
Stream on blocks
C+ = Z0B ZB
But C also big (47GB).
Need: Distributed solvers to

handle big n, s; Z(X)
implicitly.
36
35
34
33
1
2
3
4
5
6
7
Number of Random Features (s) / 10000
14 34

Kernel Methods match DNNs on TIMIT, ICASSP 2014, with P. Huang and IBM Speech group
High-performance Kernel Machines with Implicit Distributed Optimization and Randomization, JSM 2014, with H. Avron.
15 34

Kernel Methods match DNNs on TIMIT, ICASSP 2014, with P. Huang and IBM Speech group
High-performance Kernel Machines with Implicit Distributed Optimization and Randomization, JSM 2014, with H. Avron.
41
I Phone error rate of 21.3% - best
40
reported for Kernel methods.
Competitive with
HMM/DNN systems.
New record: 16.7% with
CNNs (ICASSP 2014).
I Only two hyperparameters: , s (early
stopping regularizer).
I Z 6.4T B, C 1.2T B.
I Materialized in blocks/used/discarded
on-the-fly, in parallel.
Classification Error (%)
I 2 hours on 256 IBM Bluegene/Q

nodes.
TIMIT: n = 2M, d = 440, k = 147

DNN (440-4k-4k-147)
RandomFourier
Exact Kernel (n=100k, 75GB)
39
38
37
36
35
34
33
0
PER: 21.3% < 22.3% (DNN)
5
10
15
20
25
30
35
40
Number of Random Features (s) / 10000
15 34
Distributed Convex Optimization

I
Alternating Direction Method of Multipliers (50s; Boyd et al, 2013)

arg min f (x) + g(z) subject to Ax + Bz = c
xRn ,zRm
Row/Column Splitting; Block splitting (Parikh & Boyd, 2013)

arg min
xRd
R
X
fi (x) + g(x)
i=1
R
X
fi (xi ) + g(z) s.t xi = z
(1)
i=1
kx z k + ik k22 (2)
2
(k+1)
arg min fi (x) +
proxg/(R) [
xk+1 + k ] (comm.) (3)
ik+1
where proxf [x]
ik + xk+1
z k+1
i
1
arg min kx yk22 + f (y)
2
y
xi
(4)
Note: extra consensus and dual variables need to be managed.

Closed-form updates, Extensibility, Code-reuse, Parallelism.
16 34
Distributed Block-splitting ADMM

https://github.com/xdata-skylark/libskylark/tree/master/ml
node 1
Y1 , X1
node 2
Y2 , X2
node 3
Y3 , X3

proxl
T-cores/OpenMP-threads
MPI (rank 1)
T[X1 , 2]
node 1
Y1 , X1
Z11
Z12
Z13
Z21
Z22
Z23
Z31
Z32
Z33
node 2
Y2 , X2
node 3
Y3 , X3
MPI (rank 3)
T[X3 , 2]

proxl
MPI (rank 1)
T[X1 , 2]
node 1
Y1 , X1
Z11
Z12
Z13
W11
W12
W13
Z21
Z22
Z23
W31
W32
W33
Z31
Z32
Z33
node 2
Y2 , X2
node 3
Y3 , X3
MPI (rank 3)
T[X3 , 2]
projZ
ij

proxl
MPI (rank 1)
T[X1 , 2]
node 1
reduce
Y1 , X1
Z11
Z12
Z13
W11
W12
W13
Z21
Z22
Z23
W31
W32
W33
node 2
Y2 , X2
node 3
Z31
MPI (rank 3)
Z32
T[X3 , 2]
Z33
node 0
node 0
projZ
ij
reduce
Y3 , X3

proxl
MPI (rank 1)
T[X1 , 2]
node 1
reduce
Y1 , X1
Z11
Z12
Z13
W11
W12
W13
Z21
Z22
Z23
W31
W32
W33
node 2
Y2 , X2
node 3
Z31
MPI (rank 3)
Z32
T[X3 , 2]
Z33
node 0
node 0
projZ
ij
reduce
Y3 , X3

proxl
MPI (rank 1)
T[X1 , 2]
node 1
reduce
Y1 , X1
Z11
Z12
Z13
W11
W12
W13
Z21
Z22
Z23
W31
W32
W33
node 2
Y2 , X2
node 3
projZ
ij
Z31
Y3 , X3
MPI (rank 3)
Z32
T[X3 , 2]
Z33
proxr
W
reduce
node 0
node 0

proxl
MPI (rank 1)
T[X1 , 2]
node 1
reduce
Y1 , X1
Z11
Z12
Z13
broadcast
W11
W12
W13
Z21
Z22
Z23
W31
W32
W33
node 2
Y2 , X2
node 3
projZ
ij
node 0
proxr
node 0
reduce
Z31
Y3 , X3
MPI (rank 3)
Z32
Z33
broadcast
T[X3 , 2]
17 34
Scalability
I
Graph Projection
U = [ZTij Zij + I]1 (X + ZTij Y), V = Zij U
| {z }
|
{z
}
| {z }
High-performance implementation that can handle large column

splits C = ds by reorganizing updates to exploit shared-memory
access, structure of graph projection.
Memory
Computation
nm
R

+ T md + sd +
nms
dR
2
nd
smd
nsm
O(
) + O(
) + O(
)
T
R
T
TR }
| {z } | {z } | {z
(T + 5) +
nd
R
+ 5sm +
transform
Communication
I
gemm
gemm
cached
O(s m log R)
T nd
R
graph-proj
gemm
(model reduce/broadcast)
Stochastic, Asynchronous versions may be possible to develop.
18 34
Randomized Kernel methods on thousands of cores

I Triloka: 20 nodes/16-cores per node; BG/Q: 1024 nodes/16 (x4) cores per node.
I s = 100K, C = 200; strong scaling (n = 250k), weak scaling (n = 250k per node)
10
8
15
60
Accuracy
10
40
20
Speedup
Ideal
00
10
15
Speedup
80
MNIST Weak Scaling (Triloka)
20 0
60
Accuracy
40
20
Speedup
Ideal
50
100
150
200
99.0
MNIST Weak Scaling (BG/Q)

500
98.0
97.5
50
10
15
20 96.0
Number of MPI processes (t=6 threads/process)
98.6
300
98.4
200
97.0
96.5
00
99.0
98.8
400
98.5
100
250 0
100.0
Classification Accuracy (%)

Time (secs)
Time (secs)
80
00
99.5
150
100

200
MNIST Strong Scaling (BG/Q)
100
MNIST Strong Scaling (Triloka)
Speedup
25
20
98.2
100
00
50
100
150
200
250 98.0
19 34
Comparisons on MNIST
See no distortion results at http://yann.lecun.com/exdb/mnist/
Gaussian Kernel
Poly (4)
3-layer NN, 500+300 HU
LeNet5
Large CNN (pretrained)
Scattering transform + Gaussian Kernel
20000 random features
committee of 35 conv. nets [elastic distortions]
1.4
1.1
1.53
0.8
0.60
0.53
0.4
0.45
0.52
0.23
Hinton, 2005
LeCun et al. 1998
Ranzato et al., NIPS 2006
Jarrett et al., ICCV 2009
Bruna and Mallat, 2012
my experiments
my experiments
Ciresan et al, 2012
20 34
Comparisons on MNIST
See no distortion results at http://yann.lecun.com/exdb/mnist/
Gaussian Kernel
Poly (4)
3-layer NN, 500+300 HU
LeNet5
Scattering transform + Gaussian Kernel
committee of 35 conv. nets [elastic distortions]
1.4
1.1
1.53
0.8
0.60
0.53
0.4
0.45
0.52
0.23
Hinton, 2005
LeCun et al. 1998
Ranzato et al., NIPS 2006
Jarrett et al., ICCV 2009
Bruna and Mallat, 2012
my experiments
my experiments
Ciresan et al, 2012
When similar prior knowledge is enforced (invariant learning), performance

gaps vanish.
RKHS mappings can, in principle, be used to implement CNNs.
20 34
Aside: LibSkylark
http://xdata-skylark.github.io/libskylark/docs/sphinx/
I C/C++/Python library, MPI,
Elemental/CombBLAS containers.
I Distributed Sketching operators
kAx bk2 kS (Ax b) k2
I Randomized Least Squares, SVD
I Randomized Kernel Methods:
Modularity via Prox operators

Sqr, hinge, L1, mult. logreg.
Regularizers: L1, L2
Kernel
Shift-Invariant
Shift-Invariant
Semigroup
Polynomial (deg q)
Embedding
z(x)
Time
RFT
eiGx
O(sd), O(s nnz(x))
FRFT
eiSGHPHBx
O(s log(d))
RLT
eGx
O(sd), O(s nnz(x))

PPT
F 1 F (C1 x) . . . . F (Cq x)
O(q(nnz(x) + s log s)
Rahimi & Recht, 2007; Pham & Pagh 2013; Le, Sarlos and Smola, 2013
Random Laplace Feature Maps for Semigroup Kernels on Histograms, CVPR 2014, J. Yang, V.S., M. Mahoney, H. Avron, Q. Fan.
21 34
Efficiency of Random Embeddings

I
TIMIT: 58.8M (s = 400k, m = 147) vs DNN 19.9M parameters.
Draw S = [w1 . . . ws ] p and approximate the integral:

Z
T
1 X i(xz)T w
e
k(x, z) =
ei(xz) w p(w)dw
|S|
Rd
(6)
wS
22 34
Efficiency of Random Embeddings

I
TIMIT: 58.8M (s = 400k, m = 147) vs DNN 19.9M parameters.
Draw S = [w1 . . . ws ] p and approximate the integral:

Z
T
1 X i(xz)T w
e
k(x, z) =
ei(xz) w p(w)dw
|S|
Rd
(6)
wS
Integration error
Z

1 X

p,S [f ] =
f (x)p(x)dx
f (w)
[0,1]d

s
wS
Monte-carlo convergence rate: E[p,S [f ]2 ] 2 s 2
Can we do better with a different sequence S?
4-fold increase in s will only cut error by half.
22 34
Quasi-Monte Carlo Sequences: Intuition

I
I
Weyl 1916; Koksma 1942; Dick et. al., 2013; Caflisch, 1998
R
P
Consider approximating [0,1]2 f (x)dx with 1s wS f (w)
23 34
Quasi-Monte Carlo Sequences: Intuition

I
I
Weyl 1916; Koksma 1942; Dick et. al., 2013; Caflisch, 1998
R
P
Consider approximating [0,1]2 f (x)dx with 1s wS f (w)
Uniform
Halton
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0
0
0.2
0.4
0.6
0.8
0.2
0.4
0.6
0.8
Deterministic correlated QMC sampling avoid clustering, clumping effects

in MC point sets.
Hierarchical structure: coarse-to-fine sampling as s increases.
23 34
Star Discrepancy
I
Integration error depends on variation f and uniformity of S.
24 34
Star Discrepancy
I
I
Integration error depends on variation f and uniformity of S.

Theorem (Koksma-Hlawka inequality (1941, 1961))
D? (S)VHK [f ], where

|{i : wi Jx }|
D? (S) =
sup vol(Jx )

s
x[0,1]d
S [f ]
24 34
Quasi-Monte Carlo Sequences
We seek sequences with low discrepancy.
25 34
Theorem (Roth, 1954) D? (S) cd (log s)

s
d1
2
25 34

s
d1
2
.
1
For the regular grid on [0, 1)d with s = md , D? s d = only

optimal for d = 1.
25 34

s
d1
2
.
1

optimal for d = 1.
Theorem (Doerr, 2013) Let S with s be chosen uniformly at

random from [0, 1)d . Then, E[D? (S)] ds - the Monte Carlo rate.
25 34

s
d1
2
.
1

optimal for d = 1.

d
There exist low-discrepancy sequences that achieve a Cd (logs s)

lower bound conjectured to be optimal.
Matlab QMC generators: haltonset, sobolset . . .
25 34

s
d1
2
.
1

optimal for d = 1.

d
There exist low-discrepancy sequences that achieve a Cd (logs s)

lower bound conjectured to be optimal.
Matlab QMC generators: haltonset, sobolset . . .
With d fixed, this bound actually grows until s ed to (d/e)d !
On the unreasonable effectiveness of QMC in high-dimensions.

RKHSs and Kernels show up in Modern QMC analysis!
25 34
How do standard QMC sequences perform?

I
Compare K Z(X)Z(X)T where Z(X) = eiXG where G is

drawn from a QMC sequence generator instead.
USPST, n=1506
Relative error on ||K||2
0.08
0.07
0.06
0.05
0.04
0.03
MC
Halton
Sobol
Digital net
Lattice
CPU, n=6554
0.035
0.09
0.02
0.01
200
400
Number of random features
0.03
MC
Halton
Sobol
Digital net
Lattice
0.025
0.02
0.015
0.01
0.005
200 400 600 800
QMC sequences consistently better.
26 34

I

USPST, n=1506
0.08
0.07
0.06
0.05
0.04
0.03
MC
Halton
Sobol
Digital net
Lattice
CPU, n=6554
0.035
0.09
0.02
0.01
200
400
0.03
MC
Halton
Sobol
Digital net
Lattice
0.025
0.02
0.015
0.01
0.005
200 400 600 800
Why are some QMC sequences better, e.g., Halton over Sobol?
26 34

I

USPST, n=1506
0.08
0.07
0.06
0.05
0.04
0.03
MC
Halton
Sobol
Digital net
Lattice
CPU, n=6554
0.035
0.09
0.02
0.01
200
400
0.03
MC
Halton
Sobol
Digital net
Lattice
0.025
0.02
0.015
0.01
0.005
200 400 600 800
Why are some QMC sequences better, e.g., Halton over Sobol?
Can we learn sequences even better adapted to our problem class?
26 34
RKHSs in QMC Theory

I
Nice integrands f Hh , where h(, ) is a kernel function.
27 34
RKHSs in QMC Theory

I
I

By Reproducing property and Cauchy-Schwartz inequality:
Z

1 X

f (x)p(x)dx
f (w) kf kk Dh,p (S)

Rd

s
(7)
wS
2
where Dh,p
is a discrepancy measure:
mean embedding
1 X
h(w, )k2Hh ,
Dh,p (S) = kmp ()
s
wS
mp =
Rd
}|
h(x, )p(x)dx
27 34
RKHSs in QMC Theory

I
I

By Reproducing property and Cauchy-Schwartz inequality:
Z

1 X

f (x)p(x)dx
f (w) kf kk Dh,p (S)

Rd

s
(7)
wS
2
where Dh,p
is a discrepancy measure:
mean embedding
z
{
Z }|
1 X
2
h(w, )kHh ,
mp =
h(x, )p(x)dx
Dh,p (S) = kmp ()
s
Rd
wS
s Z
s
s
2X
1 XX
= const.
h(wl , )p()d + 2
h(wl , wj )
s
s
d
l=1 R
l=1 j=1
|
{z
} |
{z
}
Alignment with p (wl )
Pairwise similarity in S
27 34
Box Discrepancy
I
Assume that the data (shifted) lives in a box

b = b x z b, x, z X
Class of functions we want to integrate:

Fb = {f (w) = ei
, b}
28 34
Box Discrepancy
I

b = b x z b, x, z X

Fb = {f (w) = ei
, b}
Theorem: Integration error proportional to Box discrepancy:

1
Ef U (Fb ) S,p [f ]2 2 Dp (S)
(9)
where Dp (S) is discrepancy associated with the sinc kernel:
sincb (u, v) = d
d
Y
sin(bj (uj vj ))
uj vj
i=1
28 34
Box Discrepancy
I

b = b x z b, x, z X

Fb = {f (w) = ei
, b}
Theorem: Integration error proportional to Box discrepancy:

1
Ef U (Fb ) S,p [f ]2 2 Dp (S)
(9)
where Dp (S) is discrepancy associated with the sinc kernel:
sincb (u, v) = d
d
Y
sin(bj (uj vj ))
uj vj
i=1
Can be evaluated in closed form for Gaussian density.

28 34
Does Box discrepancy explain behaviour of QMC

sequences?
CPU, d=21
10
D(S)2
10
10
10
Digital Net
MC (expected)
Halton
Sobol
Lattice
500
1000
Samples
1500
29 34
Learning Adaptive QMC Sequences

Unlike Star discrepancy, Box discrepancy admits numerical optimization,
S =
D (S),
arg min
S (0) = Halton.
(10)
S=(w1 ...ws )Rds
CPU dataset, s=100
10
Normalized D (S)2
Maximum Squared Error
Mean Squared Error
Kk2 /kKk2
kK
10
10
10
20
40
60
Iteration
80
However, full impact on large-scale problems is an open research topic.

30 34
Outline

Synergies?
Synergies?
31 34
Randomization-vs-Optimization
k(x, z)
Randomization
pk
Optimization
0
eig5 x
0
eig4 x
0
eig3 x
0
eig2 x
0
eig1 x
Synergies?
32 34
Randomization-vs-Optimization
k(x, z)
Randomization
pk
Optimization
0
eig5 x
0
eig4 x
0
eig3 x
0
eig2 x
0
eig1 x
I Jarret et al 2009, What is the Best Multi-Stage Architecture for Object Recognition?: The
most astonishing result is that systems with random filters and no filter learning whatsoever
achieve decent performance
I On Random Weights and Unsupervised Feature Learning, Saxe et al, ICML 2011:
surprising fraction of performance can be attributed to architecture alone.
Synergies?
32 34
Deep Learning with Kernels?

Maps across layers can be parameterized using more general
nonlinearities (kernels).
f2 (x2 ), f2 Hk2
x2
1
f1 (x1 ), f1 Hk1
Pixel
x1
Image
I Mathematics of Neural Response, Smale et. al., FCM (2010).

I Convolutional Kernel Networks, Mairal et. al., NIPS 2014.
I SimNets: A Generalization of Conv. Nets, Cohen and Sashua, 2014.
learns networks 1/8 the size of comparable ConvNets.

Figure adapted from Mairal et al, NIPS 2014
Synergies?
33 34
Conclusions
Some empirical evidence suggests that once Kernel methods are

scaled up and embody similar statistical principles, they are
competitive with Deep Neural Networks.
Randomization and Distributed Computation both required.
Ideas from QMC Integration techniques are promising.
Opportunities for designing new algorithms combining insights from

deep learning with the generality and mathematical richness of
kernel methods.
Synergies?
34 34

Slidessindhwani - Kernels Random Embeddings and Deep Learning

Diunggah oleh

Informasi Dokumen

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Slidessindhwani - Kernels Random Embeddings and Deep Learning

Diunggah oleh

Hak Cipta:

Format Tersedia

Kernels, Random Embeddings and Deep Learning

October 28, 2014

At IBM: Haim Avron, Tara Sainath, B. Ramabhadran, Q. Fan

Summer Interns: Jiyan Yang (Stanford), Po-sen Huang (UIUC)

Michael Mahoney (UC Berkeley), Ha Quang Minh (IIT Genova)

IBM DARPA XDATA project led by Ken Clarkson (IBM Almaden)

Given labeled data in the form of input-output pairs,

estimate the unknown dependency f : X 7 Y.

Given labeled data in the form of input-output pairs,

estimate the unknown dependency f : X 7 Y.

Given labeled data in the form of input-output pairs,

estimate the unknown dependency f : X 7 Y.

Large n = Big models: H rich /non-parametric/nonlinear.

Given labeled data in the form of input-output pairs,

estimate the unknown dependency f : X 7 Y.

Large n = Big models: H rich /non-parametric/nonlinear.

Given labeled data in the form of input-output pairs,

estimate the unknown dependency f : X 7 Y.

Large n = Big models: H rich /non-parametric/nonlinear.

Given labeled data in the form of input-output pairs,

estimate the unknown dependency f : X 7 Y.

Large n = Big models: H rich /non-parametric/nonlinear.

Deep Neural Networks: f (x) = sn (. . . s2 (W2 s1 (W1 x)) . . .)

This talk: Thrust towards scalable kernel methods, motivated by the

Motivation and Background

Motivation and Background

Deep Learning is Supercharging Machine Learning

Many statistical and computational ingredients:

Motivation and Background

Deep Learning is Supercharging Machine Learning

Many statistical and computational ingredients:

Motivation and Background

Deep Learning is Supercharging Machine Learning

Many statistical and computational ingredients:

Motivation and Background

Deep Learning is Supercharging Machine Learning

Many statistical and computational ingredients:

Large datasets (ILSVRC since 2010)

Motivation and Background

Deep Learning is Supercharging Machine Learning

Many statistical and computational ingredients:

Large datasets (ILSVRC since 2010)

Very active area in Speech and Natural Language Processing.

Motivation and Background

Machine Learning in 1990s

Convolutional Neural Networks (Fukushima 1980; Lecun et al 1989)

Motivation and Background

Machine Learning in 1990s

Convolutional Neural Networks (Fukushima 1980; Lecun et al 1989)

Motivation and Background

Machine Learning in 1990s

Convolutional Neural Networks (Fukushima 1980; Lecun et al 1989)

Motivation and Background

Machine Learning in 1990s

Convolutional Neural Networks (Fukushima 1980; Lecun et al 1989)

Motivation and Background

Machine Learning in 1990s

Convolutional Neural Networks (Fukushima 1980; Lecun et al 1989)

Motivation and Background

Machine Learning in 1990s

Convolutional Neural Networks (Fukushima 1980; Lecun et al 1989)

Motivation and Background

Machine Learning in 1990s

Convolutional Neural Networks (Fukushima 1980; Lecun et al 1989)