Anda di halaman 1dari 85

Kernels, Random Embeddings and Deep Learning

Vikas Sindhwani
IBM Research, NY

October 28, 2014

Acknowledgements

At IBM: Haim Avron, Tara Sainath, B. Ramabhadran, Q. Fan

Summer Interns: Jiyan Yang (Stanford), Po-sen Huang (UIUC)

Michael Mahoney (UC Berkeley), Ha Quang Minh (IIT Genova)

IBM DARPA XDATA project led by Ken Clarkson (IBM Almaden)

2 34

Setting
I

Given labeled data in the form of input-output pairs,


{xi , yi }ni=1 ,

xi X Rd ,

y i Y Rm ,

estimate the unknown dependency f : X 7 Y.

3 34

Setting
I

Given labeled data in the form of input-output pairs,


{xi , yi }ni=1 ,

xi X Rd ,

y i Y Rm ,

estimate the unknown dependency f : X 7 Y.


Regularized Risk Minimization in a suitable hypothesis space H,
arg min
f H

n
X

V (f (xi ), yi ) + (f )

i=1

3 34

Setting
I

Given labeled data in the form of input-output pairs,


{xi , yi }ni=1 ,

y i Y Rm ,

estimate the unknown dependency f : X 7 Y.


Regularized Risk Minimization in a suitable hypothesis space H,
arg min
f H

xi X Rd ,

n
X

V (f (xi ), yi ) + (f )

i=1

Large n = Big models: H rich /non-parametric/nonlinear.

3 34

Setting
I

Given labeled data in the form of input-output pairs,


{xi , yi }ni=1 ,

f H

y i Y Rm ,

estimate the unknown dependency f : X 7 Y.


Regularized Risk Minimization in a suitable hypothesis space H,
arg min

xi X Rd ,

n
X

V (f (xi ), yi ) + (f )

i=1

Large n = Big models: H rich /non-parametric/nonlinear.


Two great ML traditions around choosing H:

3 34

Setting
I

Given labeled data in the form of input-output pairs,


{xi , yi }ni=1 ,

f H

y i Y Rm ,

estimate the unknown dependency f : X 7 Y.


Regularized Risk Minimization in a suitable hypothesis space H,
arg min

xi X Rd ,

n
X

V (f (xi ), yi ) + (f )

i=1

Large n = Big models: H rich /non-parametric/nonlinear.


Two great ML traditions around choosing H:
Deep Neural Networks: f (x) = sn (. . . s2 (W2 s1 (W1 x)) . . .)

3 34

Setting
I

Given labeled data in the form of input-output pairs,


{xi , yi }ni=1 ,

f H

y i Y Rm ,

estimate the unknown dependency f : X 7 Y.


Regularized Risk Minimization in a suitable hypothesis space H,
arg min

xi X Rd ,

n
X

V (f (xi ), yi ) + (f )

i=1

Large n = Big models: H rich /non-parametric/nonlinear.


Two great ML traditions around choosing H:

Deep Neural Networks: f (x) = sn (. . . s2 (W2 s1 (W1 x)) . . .)


Kernel Methods: general nonlinear function space generated by a
kernel function k(x, z) on X X .

This talk: Thrust towards scalable kernel methods, motivated by the


recent successes of deep learning.
3 34

Outline

Motivation and Background


Scalable Kernel Methods
Random Embeddings+Distributed Computation (ICASSP, JSM 2014)
libSkylark: An open source software stack
Quasi-Monte Carlo Embeddings (ICML 2014)
Synergies?

Motivation and Background

4 34

Deep Learning is Supercharging Machine Learning


Krizhevsky et.al. won the 2012 ImageNet challenge (ILSVRC-2012) with
top-5 error rate of 15.3% compared to 26.2% of the second best entry.

Many statistical and computational ingredients:


Large datasets (ILSVRC since 2010)

Motivation and Background

5 34

Deep Learning is Supercharging Machine Learning


Krizhevsky et.al. won the 2012 ImageNet challenge (ILSVRC-2012) with
top-5 error rate of 15.3% compared to 26.2% of the second best entry.

Many statistical and computational ingredients:


Large datasets (ILSVRC since 2010)
Large statistical capacity (1.2M images, 60M params)

Motivation and Background

5 34

Deep Learning is Supercharging Machine Learning


Krizhevsky et.al. won the 2012 ImageNet challenge (ILSVRC-2012) with
top-5 error rate of 15.3% compared to 26.2% of the second best entry.

Many statistical and computational ingredients:


Large datasets (ILSVRC since 2010)
Large statistical capacity (1.2M images, 60M params)
Distributed computation

Motivation and Background

5 34

Deep Learning is Supercharging Machine Learning


Krizhevsky et.al. won the 2012 ImageNet challenge (ILSVRC-2012) with
top-5 error rate of 15.3% compared to 26.2% of the second best entry.

Many statistical and computational ingredients:

Large datasets (ILSVRC since 2010)


Large statistical capacity (1.2M images, 60M params)
Distributed computation
Depth, Invariant feature learning (transferrable to other tasks)

Motivation and Background

5 34

Deep Learning is Supercharging Machine Learning


Krizhevsky et.al. won the 2012 ImageNet challenge (ILSVRC-2012) with
top-5 error rate of 15.3% compared to 26.2% of the second best entry.

Many statistical and computational ingredients:

Large datasets (ILSVRC since 2010)


Large statistical capacity (1.2M images, 60M params)
Distributed computation
Depth, Invariant feature learning (transferrable to other tasks)
Engineering: Dropout, ReLU . . .

Very active area in Speech and Natural Language Processing.

Motivation and Background

5 34

Machine Learning in 1990s


I

Convolutional Neural Networks (Fukushima 1980; Lecun et al 1989)

Motivation and Background

6 34

Machine Learning in 1990s


I

Convolutional Neural Networks (Fukushima 1980; Lecun et al 1989)


3 days to train on USPS (n = 7291; digit recognition) on Sun
Sparcstation 1 (33MHz clock speed, 64MB RAM)

Motivation and Background

6 34

Machine Learning in 1990s


I

Convolutional Neural Networks (Fukushima 1980; Lecun et al 1989)


3 days to train on USPS (n = 7291; digit recognition) on Sun
Sparcstation 1 (33MHz clock speed, 64MB RAM)

Personal history:

Motivation and Background

6 34

Machine Learning in 1990s


I

Convolutional Neural Networks (Fukushima 1980; Lecun et al 1989)


3 days to train on USPS (n = 7291; digit recognition) on Sun
Sparcstation 1 (33MHz clock speed, 64MB RAM)

Personal history:
1998: First ML experiment - train DNN on UCI Wine dataset.

Motivation and Background

6 34

Machine Learning in 1990s


I

Convolutional Neural Networks (Fukushima 1980; Lecun et al 1989)


3 days to train on USPS (n = 7291; digit recognition) on Sun
Sparcstation 1 (33MHz clock speed, 64MB RAM)

Personal history:
1998: First ML experiment - train DNN on UCI Wine dataset.
1999: Introduced to Kernel Methods - by DNN researchers!

Motivation and Background

6 34

Machine Learning in 1990s


I

Convolutional Neural Networks (Fukushima 1980; Lecun et al 1989)


3 days to train on USPS (n = 7291; digit recognition) on Sun
Sparcstation 1 (33MHz clock speed, 64MB RAM)

Personal history:
1998: First ML experiment - train DNN on UCI Wine dataset.
1999: Introduced to Kernel Methods - by DNN researchers!
2003-4: NN paper at JMLR rejected; accepted in IEEE Trans.
Neural Nets with a kernel methods section!

Motivation and Background

6 34

Machine Learning in 1990s


I

Convolutional Neural Networks (Fukushima 1980; Lecun et al 1989)


3 days to train on USPS (n = 7291; digit recognition) on Sun
Sparcstation 1 (33MHz clock speed, 64MB RAM)

Personal history:
1998: First ML experiment - train DNN on UCI Wine dataset.
1999: Introduced to Kernel Methods - by DNN researchers!
2003-4: NN paper at JMLR rejected; accepted in IEEE Trans.
Neural Nets with a kernel methods section!

Why Kernel Methods?

Motivation and Background

6 34

Machine Learning in 1990s


I

Convolutional Neural Networks (Fukushima 1980; Lecun et al 1989)


3 days to train on USPS (n = 7291; digit recognition) on Sun
Sparcstation 1 (33MHz clock speed, 64MB RAM)

Personal history:
1998: First ML experiment - train DNN on UCI Wine dataset.
1999: Introduced to Kernel Methods - by DNN researchers!
2003-4: NN paper at JMLR rejected; accepted in IEEE Trans.
Neural Nets with a kernel methods section!

Why Kernel Methods?


Local Minima free - stronger role of Convex Optimization.

Motivation and Background

6 34

Machine Learning in 1990s


I

Convolutional Neural Networks (Fukushima 1980; Lecun et al 1989)


3 days to train on USPS (n = 7291; digit recognition) on Sun
Sparcstation 1 (33MHz clock speed, 64MB RAM)

Personal history:
1998: First ML experiment - train DNN on UCI Wine dataset.
1999: Introduced to Kernel Methods - by DNN researchers!
2003-4: NN paper at JMLR rejected; accepted in IEEE Trans.
Neural Nets with a kernel methods section!

Why Kernel Methods?


Local Minima free - stronger role of Convex Optimization.
Theoretically appealing

Motivation and Background

6 34

Machine Learning in 1990s


I

Convolutional Neural Networks (Fukushima 1980; Lecun et al 1989)


3 days to train on USPS (n = 7291; digit recognition) on Sun
Sparcstation 1 (33MHz clock speed, 64MB RAM)

Personal history:
1998: First ML experiment - train DNN on UCI Wine dataset.
1999: Introduced to Kernel Methods - by DNN researchers!
2003-4: NN paper at JMLR rejected; accepted in IEEE Trans.
Neural Nets with a kernel methods section!

Why Kernel Methods?

Local Minima free - stronger role of Convex Optimization.


Theoretically appealing
Handle non-vectorial data; high-dimensional data
Easier model selection via continuous optimization.

Motivation and Background

6 34

Machine Learning in 1990s


I

Convolutional Neural Networks (Fukushima 1980; Lecun et al 1989)


3 days to train on USPS (n = 7291; digit recognition) on Sun
Sparcstation 1 (33MHz clock speed, 64MB RAM)

Personal history:
1998: First ML experiment - train DNN on UCI Wine dataset.
1999: Introduced to Kernel Methods - by DNN researchers!
2003-4: NN paper at JMLR rejected; accepted in IEEE Trans.
Neural Nets with a kernel methods section!

Why Kernel Methods?

Local Minima free - stronger role of Convex Optimization.


Theoretically appealing
Handle non-vectorial data; high-dimensional data
Easier model selection via continuous optimization.
Matched NN in many cases, although didnt scale wrt n as well.

So what changed?

Motivation and Background

6 34

Machine Learning in 1990s


I

Convolutional Neural Networks (Fukushima 1980; Lecun et al 1989)


3 days to train on USPS (n = 7291; digit recognition) on Sun
Sparcstation 1 (33MHz clock speed, 64MB RAM)

Personal history:
1998: First ML experiment - train DNN on UCI Wine dataset.
1999: Introduced to Kernel Methods - by DNN researchers!
2003-4: NN paper at JMLR rejected; accepted in IEEE Trans.
Neural Nets with a kernel methods section!

Why Kernel Methods?

Local Minima free - stronger role of Convex Optimization.


Theoretically appealing
Handle non-vectorial data; high-dimensional data
Easier model selection via continuous optimization.
Matched NN in many cases, although didnt scale wrt n as well.

So what changed?
More data, parallel algorithms, hardware? Better DNN training? . . .

Motivation and Background

6 34

Kernel Methods and Neural Networks (Pre-Google)

Motivation and Background

7 34

Kernel Methods and Neural Networks

Geoff Hinton facts meme maintained at


http://yann.lecun.com/ex/fun/

Motivation and Background

8 34

Kernel Methods and Neural Networks

Geoff Hinton facts meme maintained at


http://yann.lecun.com/ex/fun/
I

I
I

All kernels that ever dared approaching Geoff Hinton woke up


convolved.
The only kernel Geoff Hinton has ever used is a kernel of truth.
If you defy Geoff Hinton, he will maximize your entropy in no time.
Your free energy will be gone even before you reach equilibrium.

Motivation and Background

8 34

Kernel Methods and Neural Networks

Geoff Hinton facts meme maintained at


http://yann.lecun.com/ex/fun/
I

I
I

All kernels that ever dared approaching Geoff Hinton woke up


convolved.
The only kernel Geoff Hinton has ever used is a kernel of truth.
If you defy Geoff Hinton, he will maximize your entropy in no time.
Your free energy will be gone even before you reach equilibrium.

Are there synergies between these fields towards design of even better
(faster and more accurate) algorithms?

Motivation and Background

8 34

The Mathematical Naturalness of Kernel Methods

Data X Rd , Models H : X 7 R

Motivation and Background

9 34

The Mathematical Naturalness of Kernel Methods

I
I

Data X Rd , Models H : X 7 R

Geometry in H: inner product h, iH , norm k kH (Hilbert Spaces)

Motivation and Background

9 34

The Mathematical Naturalness of Kernel Methods

I
I
I

Data X Rd , Models H : X 7 R

Geometry in H: inner product h, iH , norm k kH (Hilbert Spaces)


Theorem All nice Hilbert spaces are generated by a symmetric
positive definite function (the kernel) k(x, x0 ) on X X

if f, g H close i.e. kf gkH small, then f (x), g(x) close x X .


Reproducing Kernel Hilbert Spaces (RKHSs)

Motivation and Background

9 34

The Mathematical Naturalness of Kernel Methods

I
I
I

Data X Rd , Models H : X 7 R

Geometry in H: inner product h, iH , norm k kH (Hilbert Spaces)


Theorem All nice Hilbert spaces are generated by a symmetric
positive definite function (the kernel) k(x, x0 ) on X X

if f, g H close i.e. kf gkH small, then f (x), g(x) close x X .


Reproducing Kernel Hilbert Spaces (RKHSs)

Functional Analysis (Aronszajn, Bergman (1950s)); Statistics


(Parzen (1960s)); PDEs; Numerical Analysis. . .
ML: Nonlinear classification, regression, clustering, time-series
analysis, dynamical systems, hypothesis testing, causal modeling, . . .

Motivation and Background

9 34

The Mathematical Naturalness of Kernel Methods

I
I
I

Data X Rd , Models H : X 7 R

Geometry in H: inner product h, iH , norm k kH (Hilbert Spaces)


Theorem All nice Hilbert spaces are generated by a symmetric
positive definite function (the kernel) k(x, x0 ) on X X

if f, g H close i.e. kf gkH small, then f (x), g(x) close x X .


Reproducing Kernel Hilbert Spaces (RKHSs)

Functional Analysis (Aronszajn, Bergman (1950s)); Statistics


(Parzen (1960s)); PDEs; Numerical Analysis. . .
ML: Nonlinear classification, regression, clustering, time-series
analysis, dynamical systems, hypothesis testing, causal modeling, . . .

In principle, possible to compose Deep Learning pipelines using more


general nonlinear functions drawn from RKHSs.

Motivation and Background

9 34

Outline

Motivation and Background


Scalable Kernel Methods
Random Embeddings+Distributed Computation (ICASSP, JSM 2014)
libSkylark: An open source software stack
Quasi-Monte Carlo Embeddings (ICML 2014)
Synergies?

Scalable Kernel Methods

10 34

Scalability Challenges for Kernel Methods

f ? = arg min
f Hk

1X
V (yi , f (xi )) + kf k2Hk , xi Rd
n i=1

Representer Theorem: f ? (x) =

Regularized Least Squares


(K + I) = Y

Pn

i=1

i k(x, xi )

O(n2 )
O(n3 + n2 d)
O(nd)

storage
training
test speed

Hard to parallelize when working directly with Kij = k(xi , xj )

Scalable Kernel Methods

11 34

Randomized Algorithms

Explicit approximate feature map: : Rd 7 Cs such that,


k(x, z)

h(x),
(z)i
Cs

O(ns)
storage

T
T
Z(X) Z(X) + I w = Z(X) Y, O(ns2 ) training
O(s)
test speed

Interested in Data-oblivious maps that depend only on the kernel


function, and not on the data.
Should be very cheap to apply and parallelizable.

Scalable Kernel Methods

12 34

Random Fourier Features (Rahimi & Recht, 2007)


I

Theorem [Bochner 1930,33] One-to-one Fourier-pair


correspondence between any (normalized) shift-invariant kernel k
and density p such that,
Z
T
k(x, z) = (x z) =
ei(xz) w p(w)dw
Rd

Gaussian kernel: k(x, z) = e

Scalable Kernel Methods

kxzk2
2
2 2

p = N (0, 2 Id )

13 34

Random Fourier Features (Rahimi & Recht, 2007)


I

Theorem [Bochner 1930,33] One-to-one Fourier-pair


correspondence between any (normalized) shift-invariant kernel k
and density p such that,
Z
T
k(x, z) = (x z) =
ei(xz) w p(w)dw
Rd

Gaussian kernel: k(x, z) = e


I

kxzk2
2
2 2

p = N (0, 2 Id )

Monte-Carlo approximation to Integral representation:


s

k(x, z)

1 X i(xz)T wj
S (x),
S (z)iCs
e
= h
s j=1

h
i
S (x) = 1 eixT w1 . . . eixT ws Cs , S = [w1 . . . ws ] p

s
Scalable Kernel Methods

13 34

DNNs vs Kernel Methods on TIMIT (Speech)


Joint work with IBM Speech Group, P. Huang:
Can shallow, convex randomized kernel methods match DNNs?
(predicting HMM states given short window of coefficients representing acoustic input)

G = randn(size(X,1), s);
Z = exp(i*X*G);
I = eye(size(X,2));
C = Z*Z;
alpha = (C + lambda*I)\(Z*y(:));
ztest = exp(i*xtest*G)*alpha;

Scalable Kernel Methods

14 34

DNNs vs Kernel Methods on TIMIT (Speech)


Joint work with IBM Speech Group, P. Huang:
Can shallow, convex randomized kernel methods match DNNs?
(predicting HMM states given short window of coefficients representing acoustic input)

41
40
Classification Error (%)

G = randn(size(X,1), s);
Z = exp(i*X*G);
I = eye(size(X,2));
C = Z*Z;
alpha = (C + lambda*I)\(Z*y(:));
ztest = exp(i*xtest*G)*alpha;

TIMIT: n = 2M, d = 440, k = 147


DNN (440-4k-4k-147)
RandomFourier
Exact Kernel (n=100k, 75GB)

39
38
37

Z(X): 1.2TB

Stream on blocks
C+ = Z0B ZB

But C also big (47GB).

Need: Distributed solvers to


handle big n, s; Z(X)
implicitly.

Scalable Kernel Methods

36
35
34
33
1

2
3
4
5
6
7
Number of Random Features (s) / 10000

14 34

DNNs vs Kernel Methods on TIMIT (Speech)


Kernel Methods match DNNs on TIMIT, ICASSP 2014, with P. Huang and IBM Speech group
High-performance Kernel Machines with Implicit Distributed Optimization and Randomization, JSM 2014, with H. Avron.

Scalable Kernel Methods

15 34

DNNs vs Kernel Methods on TIMIT (Speech)


Kernel Methods match DNNs on TIMIT, ICASSP 2014, with P. Huang and IBM Speech group
High-performance Kernel Machines with Implicit Distributed Optimization and Randomization, JSM 2014, with H. Avron.

41

I Phone error rate of 21.3% - best

40

reported for Kernel methods.

Competitive with
HMM/DNN systems.
New record: 16.7% with
CNNs (ICASSP 2014).
I Only two hyperparameters: , s (early
stopping regularizer).
I Z 6.4T B, C 1.2T B.
I Materialized in blocks/used/discarded
on-the-fly, in parallel.

Scalable Kernel Methods

Classification Error (%)

I 2 hours on 256 IBM Bluegene/Q


nodes.

TIMIT: n = 2M, d = 440, k = 147


DNN (440-4k-4k-147)
RandomFourier
Exact Kernel (n=100k, 75GB)

39
38
37
36
35
34
33
0

PER: 21.3% < 22.3% (DNN)

5
10
15
20
25
30
35
40
Number of Random Features (s) / 10000

15 34

Distributed Convex Optimization


I

Alternating Direction Method of Multipliers (50s; Boyd et al, 2013)


arg min f (x) + g(z) subject to Ax + Bz = c
xRn ,zRm

Row/Column Splitting; Block splitting (Parikh & Boyd, 2013)


arg min
xRd

R
X

fi (x) + g(x)

i=1

R
X

fi (xi ) + g(z) s.t xi = z

(1)

i=1

kx z k + ik k22 (2)
2

(k+1)

arg min fi (x) +

proxg/(R) [
xk+1 + k ] (comm.) (3)

ik+1

where proxf [x]

ik + xk+1
z k+1
i
1
arg min kx yk22 + f (y)
2
y

xi

(4)

Note: extra consensus and dual variables need to be managed.


Closed-form updates, Extensibility, Code-reuse, Parallelism.

Scalable Kernel Methods

16 34

Distributed Block-splitting ADMM


https://github.com/xdata-skylark/libskylark/tree/master/ml

node 1

Y1 , X1
node 2

Y2 , X2
node 3

Y3 , X3

Distributed Block-splitting ADMM


https://github.com/xdata-skylark/libskylark/tree/master/ml

proxl

T-cores/OpenMP-threads

MPI (rank 1)

T[X1 , 2]

node 1

Y1 , X1

Z11

Z12

Z13

Z21

Z22

Z23

Z31

Z32

Z33

node 2

Y2 , X2
node 3

Y3 , X3
MPI (rank 3)

T[X3 , 2]

T-cores/OpenMP-threads

Distributed Block-splitting ADMM


https://github.com/xdata-skylark/libskylark/tree/master/ml

proxl

T-cores/OpenMP-threads

MPI (rank 1)

T[X1 , 2]

node 1

Y1 , X1

Z11

Z12

Z13

W11

W12

W13

Z21

Z22

Z23

W31

W32

W33

Z31

Z32

Z33

node 2

Y2 , X2
node 3

Y3 , X3
MPI (rank 3)

T[X3 , 2]

T-cores/OpenMP-threads

projZ
ij

Distributed Block-splitting ADMM


https://github.com/xdata-skylark/libskylark/tree/master/ml

proxl

T-cores/OpenMP-threads

MPI (rank 1)

T[X1 , 2]

node 1
reduce

Y1 , X1

Z11

Z12

Z13

W11

W12

W13

Z21

Z22

Z23

W31

W32

W33

node 2

Y2 , X2
node 3

Z31
MPI (rank 3)

Z32
T[X3 , 2]

T-cores/OpenMP-threads

Z33

node 0

node 0

projZ
ij

reduce

Y3 , X3

Distributed Block-splitting ADMM


https://github.com/xdata-skylark/libskylark/tree/master/ml

proxl

T-cores/OpenMP-threads

MPI (rank 1)

T[X1 , 2]

node 1
reduce

Y1 , X1

Z11

Z12

Z13

W11

W12

W13

Z21

Z22

Z23

W31

W32

W33

node 2

Y2 , X2
node 3

Z31
MPI (rank 3)

Z32
T[X3 , 2]

T-cores/OpenMP-threads

Z33

node 0

node 0

projZ
ij

reduce

Y3 , X3

Distributed Block-splitting ADMM


https://github.com/xdata-skylark/libskylark/tree/master/ml

proxl

T-cores/OpenMP-threads

MPI (rank 1)

T[X1 , 2]

node 1
reduce

Y1 , X1

Z11

Z12

Z13

W11

W12

W13

Z21

Z22

Z23

W31

W32

W33

node 2

Y2 , X2
node 3

projZ
ij

Z31

Y3 , X3
MPI (rank 3)

Z32
T[X3 , 2]

T-cores/OpenMP-threads

Z33

proxr

W
reduce

node 0

node 0

Distributed Block-splitting ADMM


https://github.com/xdata-skylark/libskylark/tree/master/ml

proxl

T-cores/OpenMP-threads

MPI (rank 1)

T[X1 , 2]

node 1
reduce

Y1 , X1

Z11

Z12

Z13

broadcast

W11

W12

W13

Z21

Z22

Z23

W31

W32

W33

node 2

Y2 , X2
node 3

projZ
ij

node 0
proxr

node 0

reduce

Z31

Y3 , X3
MPI (rank 3)

Scalable Kernel Methods

Z32

Z33

broadcast

T[X3 , 2]

T-cores/OpenMP-threads

17 34

Scalability
I

Graph Projection
U = [ZTij Zij + I]1 (X + ZTij Y), V = Zij U
| {z }
|
{z
}
| {z }

High-performance implementation that can handle large column


splits C = ds by reorganizing updates to exploit shared-memory
access, structure of graph projection.
Memory
Computation

nm
R




+ T md + sd + 
nms
dR
2
nd
smd
nsm
O(
) + O(
) + O(
)
T
R
T
TR }
| {z } | {z } | {z

(T + 5) +

nd
R

+ 5sm +

transform

Communication
I

gemm

gemm

cached

O(s m log R)

T nd
R

graph-proj

gemm

(model reduce/broadcast)

Stochastic, Asynchronous versions may be possible to develop.

Scalable Kernel Methods

18 34

Randomized Kernel methods on thousands of cores


I Triloka: 20 nodes/16-cores per node; BG/Q: 1024 nodes/16 (x4) cores per node.
I s = 100K, C = 200; strong scaling (n = 250k), weak scaling (n = 250k per node)

10
8

15

60

Accuracy
10

40

20

Speedup
Ideal

00

10

15

Speedup

80

MNIST Weak Scaling (Triloka)

20 0

60

Accuracy

40
20

Speedup
Ideal
50

100

150

200

99.0

MNIST Weak Scaling (BG/Q)


500

98.0
97.5

50

10

15

20 96.0

Number of MPI processes (t=6 threads/process)

Scalable Kernel Methods

98.6

300

98.4

200

97.0
96.5

00

99.0
98.8

400

98.5
100

250 0

Number of MPI processes (t=64 threads/process)

100.0

Classification Accuracy (%)


Time (secs)

Time (secs)

80

00

99.5
150

100

Number of MPI processes (t=6 threads/process)


200

MNIST Strong Scaling (BG/Q)

Classification Accuracy (%)

100

Classification Accuracy (%)

MNIST Strong Scaling (Triloka)

Classification Accuracy (%)

Speedup

25
20

98.2

100
00

50

100

150

200

250 98.0

Number of MPI processes (t=64 threads/process)

19 34

Comparisons on MNIST
See no distortion results at http://yann.lecun.com/exdb/mnist/
Gaussian Kernel
Poly (4)
3-layer NN, 500+300 HU
LeNet5
Large CNN (pretrained)
Large CNN (pretrained)
Scattering transform + Gaussian Kernel
20000 random features
5000 random features
committee of 35 conv. nets [elastic distortions]

Scalable Kernel Methods

1.4
1.1
1.53
0.8
0.60
0.53
0.4
0.45
0.52
0.23

Hinton, 2005
LeCun et al. 1998
Ranzato et al., NIPS 2006
Jarrett et al., ICCV 2009
Bruna and Mallat, 2012
my experiments
my experiments
Ciresan et al, 2012

20 34

Comparisons on MNIST
See no distortion results at http://yann.lecun.com/exdb/mnist/
Gaussian Kernel
Poly (4)
3-layer NN, 500+300 HU
LeNet5
Large CNN (pretrained)
Large CNN (pretrained)
Scattering transform + Gaussian Kernel
20000 random features
5000 random features
committee of 35 conv. nets [elastic distortions]

1.4
1.1
1.53
0.8
0.60
0.53
0.4
0.45
0.52
0.23

Hinton, 2005
LeCun et al. 1998
Ranzato et al., NIPS 2006
Jarrett et al., ICCV 2009
Bruna and Mallat, 2012
my experiments
my experiments
Ciresan et al, 2012

When similar prior knowledge is enforced (invariant learning), performance


gaps vanish.

RKHS mappings can, in principle, be used to implement CNNs.

Scalable Kernel Methods

20 34

Aside: LibSkylark
http://xdata-skylark.github.io/libskylark/docs/sphinx/
I C/C++/Python library, MPI,
Elemental/CombBLAS containers.
I Distributed Sketching operators
kAx bk2 kS (Ax b) k2
I Randomized Least Squares, SVD
I Randomized Kernel Methods:

Modularity via Prox operators


Sqr, hinge, L1, mult. logreg.
Regularizers: L1, L2
Kernel
Shift-Invariant
Shift-Invariant
Semigroup
Polynomial (deg q)

Embedding

z(x)
Time
RFT
eiGx
O(sd), O(s nnz(x))
FRFT
eiSGHPHBx
O(s log(d))
RLT
eGx
O(sd), O(s nnz(x))

PPT
F 1 F (C1 x) . . . . F (Cq x)
O(q(nnz(x) + s log s)
Rahimi & Recht, 2007; Pham & Pagh 2013; Le, Sarlos and Smola, 2013

Random Laplace Feature Maps for Semigroup Kernels on Histograms, CVPR 2014, J. Yang, V.S., M. Mahoney, H. Avron, Q. Fan.

Scalable Kernel Methods

21 34

Efficiency of Random Embeddings


I

TIMIT: 58.8M (s = 400k, m = 147) vs DNN 19.9M parameters.

Draw S = [w1 . . . ws ] p and approximate the integral:


Z
T
1 X i(xz)T w
e
k(x, z) =
ei(xz) w p(w)dw
|S|
Rd

(6)

wS

Scalable Kernel Methods

22 34

Efficiency of Random Embeddings


I

TIMIT: 58.8M (s = 400k, m = 147) vs DNN 19.9M parameters.

Draw S = [w1 . . . ws ] p and approximate the integral:


Z
T
1 X i(xz)T w
e
k(x, z) =
ei(xz) w p(w)dw
|S|
Rd

(6)

wS

Integration error
Z



1 X


p,S [f ] =
f (x)p(x)dx
f (w)
[0,1]d

s
wS

Monte-carlo convergence rate: E[p,S [f ]2 ] 2 s 2

Can we do better with a different sequence S?

4-fold increase in s will only cut error by half.

Scalable Kernel Methods

22 34

Quasi-Monte Carlo Sequences: Intuition


I
I

Weyl 1916; Koksma 1942; Dick et. al., 2013; Caflisch, 1998
R
P
Consider approximating [0,1]2 f (x)dx with 1s wS f (w)

Scalable Kernel Methods

23 34

Quasi-Monte Carlo Sequences: Intuition


I
I

Weyl 1916; Koksma 1942; Dick et. al., 2013; Caflisch, 1998
R
P
Consider approximating [0,1]2 f (x)dx with 1s wS f (w)
Uniform

Halton

0.8

0.8

0.6

0.6

0.4

0.4

0.2

0.2

0
0

0.2

0.4

0.6

0.8

0.2

0.4

0.6

0.8

Deterministic correlated QMC sampling avoid clustering, clumping effects


in MC point sets.

Hierarchical structure: coarse-to-fine sampling as s increases.

Scalable Kernel Methods

23 34

Star Discrepancy
I

Integration error depends on variation f and uniformity of S.

Scalable Kernel Methods

24 34

Star Discrepancy
I
I

Integration error depends on variation f and uniformity of S.


Theorem (Koksma-Hlawka inequality (1941, 1961))
D? (S)VHK [f ], where



|{i : wi Jx }|
D? (S) =
sup vol(Jx )

s
x[0,1]d
S [f ]

Scalable Kernel Methods

24 34

Quasi-Monte Carlo Sequences

We seek sequences with low discrepancy.

Scalable Kernel Methods

25 34

Quasi-Monte Carlo Sequences

We seek sequences with low discrepancy.

Theorem (Roth, 1954) D? (S) cd (log s)


s

Scalable Kernel Methods

d1
2

25 34

Quasi-Monte Carlo Sequences

We seek sequences with low discrepancy.

Theorem (Roth, 1954) D? (S) cd (log s)


s

d1
2

.
1

For the regular grid on [0, 1)d with s = md , D? s d = only


optimal for d = 1.

Scalable Kernel Methods

25 34

Quasi-Monte Carlo Sequences

We seek sequences with low discrepancy.

Theorem (Roth, 1954) D? (S) cd (log s)


s

d1
2

.
1

For the regular grid on [0, 1)d with s = md , D? s d = only


optimal for d = 1.

Theorem (Doerr, 2013) Let S with s be chosen uniformly at


random from [0, 1)d . Then, E[D? (S)] ds - the Monte Carlo rate.

Scalable Kernel Methods

25 34

Quasi-Monte Carlo Sequences

We seek sequences with low discrepancy.

Theorem (Roth, 1954) D? (S) cd (log s)


s

d1
2

.
1

For the regular grid on [0, 1)d with s = md , D? s d = only


optimal for d = 1.

Theorem (Doerr, 2013) Let S with s be chosen uniformly at


random from [0, 1)d . Then, E[D? (S)] ds - the Monte Carlo rate.
d

There exist low-discrepancy sequences that achieve a Cd (logs s)


lower bound conjectured to be optimal.
Matlab QMC generators: haltonset, sobolset . . .

Scalable Kernel Methods

25 34

Quasi-Monte Carlo Sequences

We seek sequences with low discrepancy.

Theorem (Roth, 1954) D? (S) cd (log s)


s

d1
2

.
1

For the regular grid on [0, 1)d with s = md , D? s d = only


optimal for d = 1.

Theorem (Doerr, 2013) Let S with s be chosen uniformly at


random from [0, 1)d . Then, E[D? (S)] ds - the Monte Carlo rate.
d

There exist low-discrepancy sequences that achieve a Cd (logs s)


lower bound conjectured to be optimal.
Matlab QMC generators: haltonset, sobolset . . .

With d fixed, this bound actually grows until s ed to (d/e)d !

On the unreasonable effectiveness of QMC in high-dimensions.


RKHSs and Kernels show up in Modern QMC analysis!

Scalable Kernel Methods

25 34

How do standard QMC sequences perform?


I

Compare K Z(X)Z(X)T where Z(X) = eiXG where G is


drawn from a QMC sequence generator instead.
USPST, n=1506

Relative error on ||K||2

0.08
0.07
0.06
0.05
0.04
0.03

MC
Halton
Sobol
Digital net
Lattice

CPU, n=6554
0.035

Relative error on ||K||2

0.09

0.02
0.01
200
400
Number of random features

0.03

MC
Halton
Sobol
Digital net
Lattice

0.025
0.02
0.015
0.01
0.005
200 400 600 800
Number of random features

QMC sequences consistently better.

Scalable Kernel Methods

26 34

How do standard QMC sequences perform?


I

Compare K Z(X)Z(X)T where Z(X) = eiXG where G is


drawn from a QMC sequence generator instead.
USPST, n=1506

Relative error on ||K||2

0.08
0.07
0.06
0.05
0.04
0.03

MC
Halton
Sobol
Digital net
Lattice

CPU, n=6554
0.035

Relative error on ||K||2

0.09

0.02
0.01
200
400
Number of random features

0.03

MC
Halton
Sobol
Digital net
Lattice

0.025
0.02
0.015
0.01
0.005
200 400 600 800
Number of random features

QMC sequences consistently better.

Why are some QMC sequences better, e.g., Halton over Sobol?

Scalable Kernel Methods

26 34

How do standard QMC sequences perform?


I

Compare K Z(X)Z(X)T where Z(X) = eiXG where G is


drawn from a QMC sequence generator instead.
USPST, n=1506

Relative error on ||K||2

0.08
0.07
0.06
0.05
0.04
0.03

MC
Halton
Sobol
Digital net
Lattice

CPU, n=6554
0.035

Relative error on ||K||2

0.09

0.02
0.01
200
400
Number of random features

0.03

MC
Halton
Sobol
Digital net
Lattice

0.025
0.02
0.015
0.01
0.005
200 400 600 800
Number of random features

QMC sequences consistently better.

Why are some QMC sequences better, e.g., Halton over Sobol?

Can we learn sequences even better adapted to our problem class?

Scalable Kernel Methods

26 34

RKHSs in QMC Theory


I

Nice integrands f Hh , where h(, ) is a kernel function.

Scalable Kernel Methods

27 34

RKHSs in QMC Theory


I
I

Nice integrands f Hh , where h(, ) is a kernel function.


By Reproducing property and Cauchy-Schwartz inequality:
Z



1 X


f (x)p(x)dx
f (w) kf kk Dh,p (S)

Rd

s

(7)

wS

2
where Dh,p
is a discrepancy measure:
mean embedding

1 X
h(w, )k2Hh ,
Dh,p (S) = kmp ()
s
wS

Scalable Kernel Methods

mp =
Rd

}|

h(x, )p(x)dx

27 34

RKHSs in QMC Theory


I
I

Nice integrands f Hh , where h(, ) is a kernel function.


By Reproducing property and Cauchy-Schwartz inequality:
Z



1 X


f (x)p(x)dx
f (w) kf kk Dh,p (S)

Rd

s

(7)

wS

2
where Dh,p
is a discrepancy measure:
mean embedding

z
{
Z }|
1 X
2
h(w, )kHh ,
mp =
h(x, )p(x)dx
Dh,p (S) = kmp ()
s
Rd
wS
s Z
s
s
2X
1 XX
= const.
h(wl , )p()d + 2
h(wl , wj )
s
s
d
l=1 R
l=1 j=1
|
{z
} |
{z
}
Alignment with p (wl )

Scalable Kernel Methods

Pairwise similarity in S

27 34

Box Discrepancy
I

Assume that the data (shifted) lives in a box


b = b x z b, x, z X

Class of functions we want to integrate:


Fb = {f (w) = ei

Scalable Kernel Methods

, b}

28 34

Box Discrepancy
I

Assume that the data (shifted) lives in a box


b = b x z b, x, z X

Class of functions we want to integrate:


Fb = {f (w) = ei

, b}

Theorem: Integration error proportional to Box discrepancy:



1
Ef U (Fb ) S,p [f ]2 2 Dp (S)
(9)
where Dp (S) is discrepancy associated with the sinc kernel:
sincb (u, v) = d

Scalable Kernel Methods

d
Y
sin(bj (uj vj ))
uj vj
i=1

28 34

Box Discrepancy
I

Assume that the data (shifted) lives in a box


b = b x z b, x, z X

Class of functions we want to integrate:


Fb = {f (w) = ei

, b}

Theorem: Integration error proportional to Box discrepancy:



1
Ef U (Fb ) S,p [f ]2 2 Dp (S)
(9)
where Dp (S) is discrepancy associated with the sinc kernel:
sincb (u, v) = d

d
Y
sin(bj (uj vj ))
uj vj
i=1

Can be evaluated in closed form for Gaussian density.


Scalable Kernel Methods

28 34

Does Box discrepancy explain behaviour of QMC


sequences?

CPU, d=21

10

D(S)2

10

10

10

Digital Net
MC (expected)
Halton
Sobol
Lattice

Scalable Kernel Methods

500

1000
Samples

1500

29 34

Learning Adaptive QMC Sequences


Unlike Star discrepancy, Box discrepancy admits numerical optimization,
S =

D (S),

arg min

S (0) = Halton.

(10)

S=(w1 ...ws )Rds

CPU dataset, s=100

10

Normalized D (S)2
Maximum Squared Error
Mean Squared Error
Kk2 /kKk2
kK

10

10

10

20

40
60
Iteration

80

However, full impact on large-scale problems is an open research topic.


Scalable Kernel Methods

30 34

Outline

Motivation and Background


Scalable Kernel Methods
Random Embeddings+Distributed Computation (ICASSP, JSM 2014)
libSkylark: An open source software stack
Quasi-Monte Carlo Embeddings (ICML 2014)
Synergies?

Synergies?

31 34

Randomization-vs-Optimization
k(x, z)
Randomization

pk

Optimization
0

eig5 x
0

eig4 x
0

eig3 x
0

eig2 x
0

eig1 x

Synergies?

32 34

Randomization-vs-Optimization
k(x, z)
Randomization

pk

Optimization
0

eig5 x
0

eig4 x
0

eig3 x
0

eig2 x
0

eig1 x
I Jarret et al 2009, What is the Best Multi-Stage Architecture for Object Recognition?: The
most astonishing result is that systems with random filters and no filter learning whatsoever
achieve decent performance
I On Random Weights and Unsupervised Feature Learning, Saxe et al, ICML 2011:
surprising fraction of performance can be attributed to architecture alone.

Synergies?

32 34

Deep Learning with Kernels?


Maps across layers can be parameterized using more general
nonlinearities (kernels).
f2 (x2 ), f2 Hk2

x2
1

f1 (x1 ), f1 Hk1

Pixel

x1
Image

I Mathematics of Neural Response, Smale et. al., FCM (2010).


I Convolutional Kernel Networks, Mairal et. al., NIPS 2014.
I SimNets: A Generalization of Conv. Nets, Cohen and Sashua, 2014.

learns networks 1/8 the size of comparable ConvNets.


Figure adapted from Mairal et al, NIPS 2014

Synergies?

33 34

Conclusions

Some empirical evidence suggests that once Kernel methods are


scaled up and embody similar statistical principles, they are
competitive with Deep Neural Networks.
Randomization and Distributed Computation both required.
Ideas from QMC Integration techniques are promising.

Opportunities for designing new algorithms combining insights from


deep learning with the generality and mathematical richness of
kernel methods.

Synergies?

34 34

Anda mungkin juga menyukai