Online Semi-Supervised Learning

Online Semi-Supervised Learning
Andrew B. Goldberg, Ming Li, Xiaojin Zhu
jerryzhu@cs.wisc.edu
Computer Sciences
University of WisconsinMadison
Xiaojin Zhu (Univ. Wisconsin-Madison) Online Semi-Supervised Learning 1 / 18

Life-long learning
x1 x2 ... x1000 ... x1000000 ...
... ... ...

y1 = 0 - - y1000 = 1 ... y1000000 = 0 ...

This is how children learn, too
x1 x2 ... x1000 ... x1000000 ...
... ... ...

y1 = 0 - - y1000 = 1 ... y1000000 = 0 ...
Unlike standard supervised learning:

n examples arrive sequentially, cannot even store them all
most examples unlabeled
no iid assumption, p(x, y) can change over time
New paradigm: online semi-supervised learning
Main contribution: merging

1 online learning: learn non-iid sequentially, but fully labeled
2 semi-supervised learning: learn from labeled and unlabeled data, but
in batch mode
1 At time t, adversary picks xt X , yt Y not necessarily iid, shows xt

2 Learner has classifier ft : X 7 R, predicts ft (xt )
3 With small probability, adversary reveals yt ; otherwise it abstains
(unlabeled)
4 Learner updates to ft+1 based on xt and yt (if given). Repeat.

Review: batch manifold regularization
A form of graph-based semi-supervised learning [Belkin et al. JMLR06]:
Graph on x1 . . . xn
Edge weights wst encode similarity between xs , xt , e.g., kNN
Assumption: similar examples have similar labels
Manifold regularization minimizes risk:
T T
1X 1 2 2 X
J(f ) = (yt )c(f (xt ), yt ) + kf kK + (f (xs ) f (xt ))2 wst
l 2 2T
t=1 s,t=1
d1
d3
c(f (x), y) convex loss function, e.g., the hinge loss. d2
Solution f = arg minf J(f ).

Generalizes graph mincut and label propagation. d4

From batch to online
batch risk = average instantaneous risks

1 PT
J(f ) = T t=1 Jt (f )
Batch risk
T T
1X 1 2 2 X
J(f ) = (yt )c(f (xt ), yt ) + kf kK + (f (xs ) f (xt ))2 wst
l 2 2T
t=1 s,t=1
Instantaneous risk
t
T 1 X
Jt (f ) = (yt )c(f (xt ), yt ) + kf k2K + 2 (f (xi ) f (xt ))2 wit
l 2
i=1
(includes graph edges between xt and all previous examples)

Online convex programming
Instead of minimizing convex J(f ), reduce convex Jt (f ) at each step t.

Jt (f )
ft+1 = ft t
f ft
Remarkable no regret guarantee against adversary:
Accuracy can be arbitrarily bad if adversary flips target often
If so, no batch learner in hindsight can do well either
T
1X
regret Jt (ft ) J(f )
T
t=1
PT
[Zinkevich ICML03] No regret: lim supT 1
T t=1 Jt (ft ) J(f ) 0.
If no adversary (iid), the average classifier f = 1/T Tt=1 ft is good:

P
J(f) J(f ).

Kernelized algorithm
t1
(t)
X
ft () = i K(xi , )
i=1
Init: t = 1, f1 = 0
Repeat
Pt1 (t)
1 receive xt , predict ft (xt ) = i=1 i K(xi , xt )
2 occasionally receive yt
3 update ft to ft+1 by
(t+1) (t)
i = (1 t 1 )i 2t 2 (ft (xi ) ft (xt ))wit , i<t
t
(t+1)
X T
t = 2t 2 (ft (xi ) ft (xt ))wit t (yt )c0 (f (xt ), yt )
i=1
l
4 store xt , let t = t + 1

Sparse approximation
The algorithm is impractical

space O(T ): stores all previous examples
time O(T 2 ): each new example compared to all previous ones
T
Two ways to speed up:
buffering, or
random projection tree

Sparse approximation 1: buffering
Keep a size buffer

Pt1 (t)
approximate representers: ft = i=t i K(xi , )
approximate instantaneous risk
t
T 1 t X
Jt (f ) = (yt )c(f (xt ), yt ) + kf k2K + 2 (f (xi ) f (xt ))2 wit
l 2
i=t
dynamic graph on examples in the buffer

Sparse approximation 1: buffer update
At each step, start with the current representers:
t1
(t)
X
ft = i K(xi , ) + 0K(xt , )
i=t
Gradient descent on + 1 terms:

t
X
0
f = i0 K(xi , )
i=t
Pt (t+1)
Reduce to representers ft+1 = i=t +1 i K(xi , ) by
min kf 0 ft+1 k2
(t+1)
Kernel matching pursuit

Sparse approximation 2: random projection tree
[Dasgupta and Freund, STOC08]
Discretize data manifold by online clustering.
When a cluster accumulates enough examples, split along random
hyperplane.
Extends k-d tree.

Sparse approximation 2: random projection tree
We use the clusters N (i , i ) as representers:
s
(t)
X
ft = i K(i , )
i=1
Cluster graph edge weight between a cluster i and example xt is
||x xt ||2

wi t = ExN (i ,i ) exp
2 2
d 1 1
12
= (2) 2 |i | 2 |0 | 2 ||

1 > 1 > 1 >
exp i i i + xt 0 xt
2
A further approximation is
2 /2 2
wi t = eki xt k
Update f (i.e., ) and the RPtree, discard xt .

Experiment: runtime
Buffering and RPtree scales linearly, enabling life-long learning.

Spirals MNIST 0 vs. 1
500 500 Batch MR

Online MR
Online MR (buffer)
Time (seconds)
400 400
Online RPtree
300 300
200 200
100 100
0 0
0 2000 4000 6000 0 5000 10000
T T

Experiment: risk
PT
Online MR risk Jair (T ) 1
T t=1 Jt (ft ) approaches batch risk J(f ) as
T increases.
J(f*) Batch MR
1.7 Jair(T) Online MR
1.6 Jair(T) Online MR (buffer)

Jair(T) Online RPtree
1.5
1.4
1.3
Risk
1.2
1.1
0.9
0.8
0.7
0 500 1000 1500 2000 2500 3000 3500 4000 4500
T

Experiment: generalization error of f if iid
A variation of buffering as good as batch MR (preferentially keep labeled
examples, but not their labels, in buffer).
0.4
Batch MR Batch MR
0.3 Online MR Online MR
0.35
Online MR (buffer) Online MR (buffer)
Online MR (bufferU) Online MR (bufferU)
0.25 0.3
Online RPtree
Generalization error rate
Online RPtree

Online RPtree (PPK)
0.2 0.25
0.2
0.15
0.15
0.1
0.1
0.05
0.05
0 0
0 1000 2000 3000 4000 5000 6000 7000 0 2000 4000 6000 8000 10000
T T
(a) Spirals (b) Face

0.4 0.4
Batch MR Batch MR
0.35 Online MR 0.35 Online MR
Online MR (buffer) Online MR (buffer)
Online MR (bufferU) Online MR (bufferU)
0.3 Online RPtree 0.3 Online RPtree

0.25 0.25
0.2 0.2
0.15 0.15
0.1 0.1
0.05 0.05
0 0
0 2000 4000 6000 8000 10000 0 2000 4000 6000 8000 10000
T T
(c) MNIST 0 vs. 1 (d) MNIST 1 vs. 2

Experiment: adversarial concept drift
Slowly rotating spirals, both p(x) and p(y|x) changing.
Batch f vs. online MR buffering fT
Test set drawn from the current p(x, y) at time T .
0.7 Batch MR
Online MR (buffer)
0.6
0.5
0.4
0.3
0.2
0.1
0
0 1000 2000 3000 4000 5000 6000 7000
T

Conclusions
Online semi-supervised learning framework

Sparse approximations: buffering and RPtree
Future work: new bounds, new algorithms (e.g., S3VM)

Online Semi-Supervised Learning

Diunggah oleh

Informasi Dokumen

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Online Semi-Supervised Learning

Diunggah oleh

Hak Cipta:

Format Tersedia

Online Semi-Supervised Learning

Andrew B. Goldberg, Ming Li, Xiaojin Zhu

Xiaojin Zhu (Univ. Wisconsin-Madison) Online Semi-Supervised Learning 1 / 18

x1 x2 ... x1000 ... x1000000 ...

... ... ...

Xiaojin Zhu (Univ. Wisconsin-Madison) Online Semi-Supervised Learning 2 / 18

x1 x2 ... x1000 ... x1000000 ...

... ... ...

Unlike standard supervised learning:

Main contribution: merging

1 At time t, adversary picks xt X , yt Y not necessarily iid, shows xt

Xiaojin Zhu (Univ. Wisconsin-Madison) Online Semi-Supervised Learning 4 / 18

Solution f = arg minf J(f ).

Xiaojin Zhu (Univ. Wisconsin-Madison) Online Semi-Supervised Learning 5 / 18

batch risk = average instantaneous risks

(includes graph edges between xt and all previous examples)

Xiaojin Zhu (Univ. Wisconsin-Madison) Online Semi-Supervised Learning 6 / 18

If no adversary (iid), the average classifier f = 1/T Tt=1 ft is good:

Xiaojin Zhu (Univ. Wisconsin-Madison) Online Semi-Supervised Learning 7 / 18

Xiaojin Zhu (Univ. Wisconsin-Madison) Online Semi-Supervised Learning 8 / 18

The algorithm is impractical

Xiaojin Zhu (Univ. Wisconsin-Madison) Online Semi-Supervised Learning 9 / 18

Keep a size buffer

dynamic graph on examples in the buffer

Xiaojin Zhu (Univ. Wisconsin-Madison) Online Semi-Supervised Learning 10 / 18

Gradient descent on + 1 terms:

Kernel matching pursuit

Xiaojin Zhu (Univ. Wisconsin-Madison) Online Semi-Supervised Learning 11 / 18

Xiaojin Zhu (Univ. Wisconsin-Madison) Online Semi-Supervised Learning 12 / 18

Cluster graph edge weight between a cluster i and example xt is

Update f (i.e., ) and the RPtree, discard xt .

Buffering and RPtree scales linearly, enabling life-long learning.

500 500 Batch MR

Xiaojin Zhu (Univ. Wisconsin-Madison) Online Semi-Supervised Learning 14 / 18

1.6 Jair(T) Online MR (buffer)

Xiaojin Zhu (Univ. Wisconsin-Madison) Online Semi-Supervised Learning 15 / 18

Generalization error rate

(a) Spirals (b) Face

Generalization error rate

(c) MNIST 0 vs. 1 (d) MNIST 1 vs. 2

Xiaojin Zhu (Univ. Wisconsin-Madison) Online Semi-Supervised Learning 16 / 18

Xiaojin Zhu (Univ. Wisconsin-Madison) Online Semi-Supervised Learning 17 / 18

Online semi-supervised learning framework

Xiaojin Zhu (Univ. Wisconsin-Madison) Online Semi-Supervised Learning 18 / 18

Anda mungkin juga menyukai