Vikas Sindhwani
IBM Research, NY
Acknowledgements
2 34
Setting
I
xi X Rd ,
y i Y Rm ,
3 34
Setting
I
xi X Rd ,
y i Y Rm ,
n
X
V (f (xi ), yi ) + (f )
i=1
3 34
Setting
I
y i Y Rm ,
xi X Rd ,
n
X
V (f (xi ), yi ) + (f )
i=1
3 34
Setting
I
f H
y i Y Rm ,
xi X Rd ,
n
X
V (f (xi ), yi ) + (f )
i=1
3 34
Setting
I
f H
y i Y Rm ,
xi X Rd ,
n
X
V (f (xi ), yi ) + (f )
i=1
3 34
Setting
I
f H
y i Y Rm ,
xi X Rd ,
n
X
V (f (xi ), yi ) + (f )
i=1
Outline
4 34
5 34
5 34
5 34
5 34
5 34
6 34
6 34
Personal history:
6 34
Personal history:
1998: First ML experiment - train DNN on UCI Wine dataset.
6 34
Personal history:
1998: First ML experiment - train DNN on UCI Wine dataset.
1999: Introduced to Kernel Methods - by DNN researchers!
6 34
Personal history:
1998: First ML experiment - train DNN on UCI Wine dataset.
1999: Introduced to Kernel Methods - by DNN researchers!
2003-4: NN paper at JMLR rejected; accepted in IEEE Trans.
Neural Nets with a kernel methods section!
6 34
Personal history:
1998: First ML experiment - train DNN on UCI Wine dataset.
1999: Introduced to Kernel Methods - by DNN researchers!
2003-4: NN paper at JMLR rejected; accepted in IEEE Trans.
Neural Nets with a kernel methods section!
6 34
Personal history:
1998: First ML experiment - train DNN on UCI Wine dataset.
1999: Introduced to Kernel Methods - by DNN researchers!
2003-4: NN paper at JMLR rejected; accepted in IEEE Trans.
Neural Nets with a kernel methods section!
6 34
Personal history:
1998: First ML experiment - train DNN on UCI Wine dataset.
1999: Introduced to Kernel Methods - by DNN researchers!
2003-4: NN paper at JMLR rejected; accepted in IEEE Trans.
Neural Nets with a kernel methods section!
6 34
Personal history:
1998: First ML experiment - train DNN on UCI Wine dataset.
1999: Introduced to Kernel Methods - by DNN researchers!
2003-4: NN paper at JMLR rejected; accepted in IEEE Trans.
Neural Nets with a kernel methods section!
6 34
Personal history:
1998: First ML experiment - train DNN on UCI Wine dataset.
1999: Introduced to Kernel Methods - by DNN researchers!
2003-4: NN paper at JMLR rejected; accepted in IEEE Trans.
Neural Nets with a kernel methods section!
So what changed?
6 34
Personal history:
1998: First ML experiment - train DNN on UCI Wine dataset.
1999: Introduced to Kernel Methods - by DNN researchers!
2003-4: NN paper at JMLR rejected; accepted in IEEE Trans.
Neural Nets with a kernel methods section!
So what changed?
More data, parallel algorithms, hardware? Better DNN training? . . .
6 34
7 34
8 34
I
I
8 34
I
I
Are there synergies between these fields towards design of even better
(faster and more accurate) algorithms?
8 34
Data X Rd , Models H : X 7 R
9 34
I
I
Data X Rd , Models H : X 7 R
9 34
I
I
I
Data X Rd , Models H : X 7 R
9 34
I
I
I
Data X Rd , Models H : X 7 R
9 34
I
I
I
Data X Rd , Models H : X 7 R
9 34
Outline
10 34
f ? = arg min
f Hk
1X
V (yi , f (xi )) + kf k2Hk , xi Rd
n i=1
Pn
i=1
i k(x, xi )
O(n2 )
O(n3 + n2 d)
O(nd)
storage
training
test speed
11 34
Randomized Algorithms
h(x),
(z)i
Cs
O(ns)
storage
T
T
Z(X) Z(X) + I w = Z(X) Y, O(ns2 ) training
O(s)
test speed
12 34
kxzk2
2
2 2
p = N (0, 2 Id )
13 34
kxzk2
2
2 2
p = N (0, 2 Id )
k(x, z)
1 X i(xz)T wj
S (x),
S (z)iCs
e
= h
s j=1
h
i
S (x) = 1 eixT w1 . . . eixT ws Cs , S = [w1 . . . ws ] p
s
Scalable Kernel Methods
13 34
G = randn(size(X,1), s);
Z = exp(i*X*G);
I = eye(size(X,2));
C = Z*Z;
alpha = (C + lambda*I)\(Z*y(:));
ztest = exp(i*xtest*G)*alpha;
14 34
41
40
Classification Error (%)
G = randn(size(X,1), s);
Z = exp(i*X*G);
I = eye(size(X,2));
C = Z*Z;
alpha = (C + lambda*I)\(Z*y(:));
ztest = exp(i*xtest*G)*alpha;
39
38
37
Z(X): 1.2TB
Stream on blocks
C+ = Z0B ZB
36
35
34
33
1
2
3
4
5
6
7
Number of Random Features (s) / 10000
14 34
15 34
41
40
Competitive with
HMM/DNN systems.
New record: 16.7% with
CNNs (ICASSP 2014).
I Only two hyperparameters: , s (early
stopping regularizer).
I Z 6.4T B, C 1.2T B.
I Materialized in blocks/used/discarded
on-the-fly, in parallel.
39
38
37
36
35
34
33
0
5
10
15
20
25
30
35
40
Number of Random Features (s) / 10000
15 34
R
X
fi (x) + g(x)
i=1
R
X
(1)
i=1
kx z k + ik k22 (2)
2
(k+1)
proxg/(R) [
xk+1 + k ] (comm.) (3)
ik+1
ik + xk+1
z k+1
i
1
arg min kx yk22 + f (y)
2
y
xi
(4)
16 34
node 1
Y1 , X1
node 2
Y2 , X2
node 3
Y3 , X3
proxl
T-cores/OpenMP-threads
MPI (rank 1)
T[X1 , 2]
node 1
Y1 , X1
Z11
Z12
Z13
Z21
Z22
Z23
Z31
Z32
Z33
node 2
Y2 , X2
node 3
Y3 , X3
MPI (rank 3)
T[X3 , 2]
T-cores/OpenMP-threads
proxl
T-cores/OpenMP-threads
MPI (rank 1)
T[X1 , 2]
node 1
Y1 , X1
Z11
Z12
Z13
W11
W12
W13
Z21
Z22
Z23
W31
W32
W33
Z31
Z32
Z33
node 2
Y2 , X2
node 3
Y3 , X3
MPI (rank 3)
T[X3 , 2]
T-cores/OpenMP-threads
projZ
ij
proxl
T-cores/OpenMP-threads
MPI (rank 1)
T[X1 , 2]
node 1
reduce
Y1 , X1
Z11
Z12
Z13
W11
W12
W13
Z21
Z22
Z23
W31
W32
W33
node 2
Y2 , X2
node 3
Z31
MPI (rank 3)
Z32
T[X3 , 2]
T-cores/OpenMP-threads
Z33
node 0
node 0
projZ
ij
reduce
Y3 , X3
proxl
T-cores/OpenMP-threads
MPI (rank 1)
T[X1 , 2]
node 1
reduce
Y1 , X1
Z11
Z12
Z13
W11
W12
W13
Z21
Z22
Z23
W31
W32
W33
node 2
Y2 , X2
node 3
Z31
MPI (rank 3)
Z32
T[X3 , 2]
T-cores/OpenMP-threads
Z33
node 0
node 0
projZ
ij
reduce
Y3 , X3
proxl
T-cores/OpenMP-threads
MPI (rank 1)
T[X1 , 2]
node 1
reduce
Y1 , X1
Z11
Z12
Z13
W11
W12
W13
Z21
Z22
Z23
W31
W32
W33
node 2
Y2 , X2
node 3
projZ
ij
Z31
Y3 , X3
MPI (rank 3)
Z32
T[X3 , 2]
T-cores/OpenMP-threads
Z33
proxr
W
reduce
node 0
node 0
proxl
T-cores/OpenMP-threads
MPI (rank 1)
T[X1 , 2]
node 1
reduce
Y1 , X1
Z11
Z12
Z13
broadcast
W11
W12
W13
Z21
Z22
Z23
W31
W32
W33
node 2
Y2 , X2
node 3
projZ
ij
node 0
proxr
node 0
reduce
Z31
Y3 , X3
MPI (rank 3)
Z32
Z33
broadcast
T[X3 , 2]
T-cores/OpenMP-threads
17 34
Scalability
I
Graph Projection
U = [ZTij Zij + I]1 (X + ZTij Y), V = Zij U
| {z }
|
{z
}
| {z }
nm
R
+ T md + sd +
nms
dR
2
nd
smd
nsm
O(
) + O(
) + O(
)
T
R
T
TR }
| {z } | {z } | {z
(T + 5) +
nd
R
+ 5sm +
transform
Communication
I
gemm
gemm
cached
O(s m log R)
T nd
R
graph-proj
gemm
(model reduce/broadcast)
18 34
10
8
15
60
Accuracy
10
40
20
Speedup
Ideal
00
10
15
Speedup
80
20 0
60
Accuracy
40
20
Speedup
Ideal
50
100
150
200
99.0
98.0
97.5
50
10
15
20 96.0
98.6
300
98.4
200
97.0
96.5
00
99.0
98.8
400
98.5
100
250 0
100.0
Time (secs)
80
00
99.5
150
100
100
Speedup
25
20
98.2
100
00
50
100
150
200
250 98.0
19 34
Comparisons on MNIST
See no distortion results at http://yann.lecun.com/exdb/mnist/
Gaussian Kernel
Poly (4)
3-layer NN, 500+300 HU
LeNet5
Large CNN (pretrained)
Large CNN (pretrained)
Scattering transform + Gaussian Kernel
20000 random features
5000 random features
committee of 35 conv. nets [elastic distortions]
1.4
1.1
1.53
0.8
0.60
0.53
0.4
0.45
0.52
0.23
Hinton, 2005
LeCun et al. 1998
Ranzato et al., NIPS 2006
Jarrett et al., ICCV 2009
Bruna and Mallat, 2012
my experiments
my experiments
Ciresan et al, 2012
20 34
Comparisons on MNIST
See no distortion results at http://yann.lecun.com/exdb/mnist/
Gaussian Kernel
Poly (4)
3-layer NN, 500+300 HU
LeNet5
Large CNN (pretrained)
Large CNN (pretrained)
Scattering transform + Gaussian Kernel
20000 random features
5000 random features
committee of 35 conv. nets [elastic distortions]
1.4
1.1
1.53
0.8
0.60
0.53
0.4
0.45
0.52
0.23
Hinton, 2005
LeCun et al. 1998
Ranzato et al., NIPS 2006
Jarrett et al., ICCV 2009
Bruna and Mallat, 2012
my experiments
my experiments
Ciresan et al, 2012
20 34
Aside: LibSkylark
http://xdata-skylark.github.io/libskylark/docs/sphinx/
I C/C++/Python library, MPI,
Elemental/CombBLAS containers.
I Distributed Sketching operators
kAx bk2 kS (Ax b) k2
I Randomized Least Squares, SVD
I Randomized Kernel Methods:
Embedding
z(x)
Time
RFT
eiGx
O(sd), O(s nnz(x))
FRFT
eiSGHPHBx
O(s log(d))
RLT
eGx
O(sd), O(s nnz(x))
PPT
F 1 F (C1 x) . . . . F (Cq x)
O(q(nnz(x) + s log s)
Rahimi & Recht, 2007; Pham & Pagh 2013; Le, Sarlos and Smola, 2013
Random Laplace Feature Maps for Semigroup Kernels on Histograms, CVPR 2014, J. Yang, V.S., M. Mahoney, H. Avron, Q. Fan.
21 34
(6)
wS
22 34
(6)
wS
Integration error
Z
1 X
p,S [f ] =
f (x)p(x)dx
f (w)
[0,1]d
s
wS
22 34
Weyl 1916; Koksma 1942; Dick et. al., 2013; Caflisch, 1998
R
P
Consider approximating [0,1]2 f (x)dx with 1s wS f (w)
23 34
Weyl 1916; Koksma 1942; Dick et. al., 2013; Caflisch, 1998
R
P
Consider approximating [0,1]2 f (x)dx with 1s wS f (w)
Uniform
Halton
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0
0
0.2
0.4
0.6
0.8
0.2
0.4
0.6
0.8
23 34
Star Discrepancy
I
24 34
Star Discrepancy
I
I
24 34
25 34
d1
2
25 34
d1
2
.
1
25 34
d1
2
.
1
25 34
d1
2
.
1
25 34
d1
2
.
1
25 34
0.08
0.07
0.06
0.05
0.04
0.03
MC
Halton
Sobol
Digital net
Lattice
CPU, n=6554
0.035
0.09
0.02
0.01
200
400
Number of random features
0.03
MC
Halton
Sobol
Digital net
Lattice
0.025
0.02
0.015
0.01
0.005
200 400 600 800
Number of random features
26 34
0.08
0.07
0.06
0.05
0.04
0.03
MC
Halton
Sobol
Digital net
Lattice
CPU, n=6554
0.035
0.09
0.02
0.01
200
400
Number of random features
0.03
MC
Halton
Sobol
Digital net
Lattice
0.025
0.02
0.015
0.01
0.005
200 400 600 800
Number of random features
Why are some QMC sequences better, e.g., Halton over Sobol?
26 34
0.08
0.07
0.06
0.05
0.04
0.03
MC
Halton
Sobol
Digital net
Lattice
CPU, n=6554
0.035
0.09
0.02
0.01
200
400
Number of random features
0.03
MC
Halton
Sobol
Digital net
Lattice
0.025
0.02
0.015
0.01
0.005
200 400 600 800
Number of random features
Why are some QMC sequences better, e.g., Halton over Sobol?
26 34
27 34
(7)
wS
2
where Dh,p
is a discrepancy measure:
mean embedding
1 X
h(w, )k2Hh ,
Dh,p (S) = kmp ()
s
wS
mp =
Rd
}|
h(x, )p(x)dx
27 34
(7)
wS
2
where Dh,p
is a discrepancy measure:
mean embedding
z
{
Z }|
1 X
2
h(w, )kHh ,
mp =
h(x, )p(x)dx
Dh,p (S) = kmp ()
s
Rd
wS
s Z
s
s
2X
1 XX
= const.
h(wl , )p()d + 2
h(wl , wj )
s
s
d
l=1 R
l=1 j=1
|
{z
} |
{z
}
Alignment with p (wl )
Pairwise similarity in S
27 34
Box Discrepancy
I
, b}
28 34
Box Discrepancy
I
, b}
d
Y
sin(bj (uj vj ))
uj vj
i=1
28 34
Box Discrepancy
I
, b}
d
Y
sin(bj (uj vj ))
uj vj
i=1
28 34
CPU, d=21
10
D(S)2
10
10
10
Digital Net
MC (expected)
Halton
Sobol
Lattice
500
1000
Samples
1500
29 34
D (S),
arg min
S (0) = Halton.
(10)
10
Normalized D (S)2
Maximum Squared Error
Mean Squared Error
Kk2 /kKk2
kK
10
10
10
20
40
60
Iteration
80
30 34
Outline
Synergies?
31 34
Randomization-vs-Optimization
k(x, z)
Randomization
pk
Optimization
0
eig5 x
0
eig4 x
0
eig3 x
0
eig2 x
0
eig1 x
Synergies?
32 34
Randomization-vs-Optimization
k(x, z)
Randomization
pk
Optimization
0
eig5 x
0
eig4 x
0
eig3 x
0
eig2 x
0
eig1 x
I Jarret et al 2009, What is the Best Multi-Stage Architecture for Object Recognition?: The
most astonishing result is that systems with random filters and no filter learning whatsoever
achieve decent performance
I On Random Weights and Unsupervised Feature Learning, Saxe et al, ICML 2011:
surprising fraction of performance can be attributed to architecture alone.
Synergies?
32 34
x2
1
f1 (x1 ), f1 Hk1
Pixel
x1
Image
Synergies?
33 34
Conclusions
Synergies?
34 34