Anda di halaman 1dari 18

Online Semi-Supervised Learning

Andrew B. Goldberg, Ming Li, Xiaojin Zhu

jerryzhu@cs.wisc.edu
Computer Sciences
University of WisconsinMadison

Xiaojin Zhu (Univ. Wisconsin-Madison) Online Semi-Supervised Learning 1 / 18


Life-long learning

x1 x2 ... x1000 ... x1000000 ...

... ... ...


y1 = 0 - - y1000 = 1 ... y1000000 = 0 ...

Xiaojin Zhu (Univ. Wisconsin-Madison) Online Semi-Supervised Learning 2 / 18


This is how children learn, too

x1 x2 ... x1000 ... x1000000 ...

... ... ...


y1 = 0 - - y1000 = 1 ... y1000000 = 0 ...

Unlike standard supervised learning:


n examples arrive sequentially, cannot even store them all
most examples unlabeled
no iid assumption, p(x, y) can change over time
Xiaojin Zhu (Univ. Wisconsin-Madison) Online Semi-Supervised Learning 3 / 18
New paradigm: online semi-supervised learning

Main contribution: merging


1 online learning: learn non-iid sequentially, but fully labeled
2 semi-supervised learning: learn from labeled and unlabeled data, but
in batch mode

1 At time t, adversary picks xt X , yt Y not necessarily iid, shows xt


2 Learner has classifier ft : X 7 R, predicts ft (xt )
3 With small probability, adversary reveals yt ; otherwise it abstains
(unlabeled)
4 Learner updates to ft+1 based on xt and yt (if given). Repeat.

Xiaojin Zhu (Univ. Wisconsin-Madison) Online Semi-Supervised Learning 4 / 18


Review: batch manifold regularization
A form of graph-based semi-supervised learning [Belkin et al. JMLR06]:
Graph on x1 . . . xn
Edge weights wst encode similarity between xs , xt , e.g., kNN
Assumption: similar examples have similar labels
Manifold regularization minimizes risk:
T T
1X 1 2 2 X
J(f ) = (yt )c(f (xt ), yt ) + kf kK + (f (xs ) f (xt ))2 wst
l 2 2T
t=1 s,t=1

d1

d3
c(f (x), y) convex loss function, e.g., the hinge loss. d2

Solution f = arg minf J(f ).


Generalizes graph mincut and label propagation. d4

Xiaojin Zhu (Univ. Wisconsin-Madison) Online Semi-Supervised Learning 5 / 18


From batch to online

batch risk = average instantaneous risks


1 PT
J(f ) = T t=1 Jt (f )

Batch risk
T T
1X 1 2 2 X
J(f ) = (yt )c(f (xt ), yt ) + kf kK + (f (xs ) f (xt ))2 wst
l 2 2T
t=1 s,t=1

Instantaneous risk
t
T 1 X
Jt (f ) = (yt )c(f (xt ), yt ) + kf k2K + 2 (f (xi ) f (xt ))2 wit
l 2
i=1

(includes graph edges between xt and all previous examples)

Xiaojin Zhu (Univ. Wisconsin-Madison) Online Semi-Supervised Learning 6 / 18


Online convex programming
Instead of minimizing convex J(f ), reduce convex Jt (f ) at each step t.

Jt (f )
ft+1 = ft t
f ft
Remarkable no regret guarantee against adversary:
Accuracy can be arbitrarily bad if adversary flips target often
If so, no batch learner in hindsight can do well either
T
1X
regret Jt (ft ) J(f )
T
t=1
PT
[Zinkevich ICML03] No regret: lim supT 1
T t=1 Jt (ft ) J(f ) 0.

If no adversary (iid), the average classifier f = 1/T Tt=1 ft is good:


P
J(f) J(f ).

Xiaojin Zhu (Univ. Wisconsin-Madison) Online Semi-Supervised Learning 7 / 18


Kernelized algorithm

t1
(t)
X
ft () = i K(xi , )
i=1

Init: t = 1, f1 = 0
Repeat
Pt1 (t)
1 receive xt , predict ft (xt ) = i=1 i K(xi , xt )
2 occasionally receive yt
3 update ft to ft+1 by
(t+1) (t)
i = (1 t 1 )i 2t 2 (ft (xi ) ft (xt ))wit , i<t
t
(t+1)
X T
t = 2t 2 (ft (xi ) ft (xt ))wit t (yt )c0 (f (xt ), yt )
i=1
l

4 store xt , let t = t + 1

Xiaojin Zhu (Univ. Wisconsin-Madison) Online Semi-Supervised Learning 8 / 18


Sparse approximation

The algorithm is impractical


space O(T ): stores all previous examples
time O(T 2 ): each new example compared to all previous ones
T
Two ways to speed up:
buffering, or
random projection tree

Xiaojin Zhu (Univ. Wisconsin-Madison) Online Semi-Supervised Learning 9 / 18


Sparse approximation 1: buffering

Keep a size buffer


Pt1 (t)
approximate representers: ft = i=t i K(xi , )
approximate instantaneous risk
t
T 1 t X
Jt (f ) = (yt )c(f (xt ), yt ) + kf k2K + 2 (f (xi ) f (xt ))2 wit
l 2
i=t

dynamic graph on examples in the buffer

Xiaojin Zhu (Univ. Wisconsin-Madison) Online Semi-Supervised Learning 10 / 18


Sparse approximation 1: buffer update
At each step, start with the current representers:
t1
(t)
X
ft = i K(xi , ) + 0K(xt , )
i=t

Gradient descent on + 1 terms:


t
X
0
f = i0 K(xi , )
i=t

Pt (t+1)
Reduce to representers ft+1 = i=t +1 i K(xi , ) by

min kf 0 ft+1 k2
(t+1)

Kernel matching pursuit

Xiaojin Zhu (Univ. Wisconsin-Madison) Online Semi-Supervised Learning 11 / 18


Sparse approximation 2: random projection tree
[Dasgupta and Freund, STOC08]
Discretize data manifold by online clustering.
When a cluster accumulates enough examples, split along random
hyperplane.
Extends k-d tree.

Xiaojin Zhu (Univ. Wisconsin-Madison) Online Semi-Supervised Learning 12 / 18


Sparse approximation 2: random projection tree
We use the clusters N (i , i ) as representers:
s
(t)
X
ft = i K(i , )
i=1

Cluster graph edge weight between a cluster i and example xt is

||x xt ||2
  
wi t = ExN (i ,i ) exp
2 2
d 1 1
12
= (2) 2 |i | 2 |0 | 2 ||
 
1  > 1 > 1 >
exp i i i + xt 0 xt
2
A further approximation is
2 /2 2
wi t = eki xt k

Update f (i.e., ) and the RPtree, discard xt .


Xiaojin Zhu (Univ. Wisconsin-Madison) Online Semi-Supervised Learning 13 / 18
Experiment: runtime

Buffering and RPtree scales linearly, enabling life-long learning.


Spirals MNIST 0 vs. 1

500 500 Batch MR


Online MR
Online MR (buffer)
Time (seconds)

400 400
Online RPtree
300 300

200 200

100 100

0 0
0 2000 4000 6000 0 5000 10000
T T

Xiaojin Zhu (Univ. Wisconsin-Madison) Online Semi-Supervised Learning 14 / 18


Experiment: risk

PT
Online MR risk Jair (T ) 1
T t=1 Jt (ft ) approaches batch risk J(f ) as
T increases.
J(f*) Batch MR
1.7 Jair(T) Online MR

1.6 Jair(T) Online MR (buffer)


Jair(T) Online RPtree
1.5
1.4

1.3
Risk

1.2
1.1

0.9

0.8

0.7
0 500 1000 1500 2000 2500 3000 3500 4000 4500
T

Xiaojin Zhu (Univ. Wisconsin-Madison) Online Semi-Supervised Learning 15 / 18


Experiment: generalization error of f if iid
A variation of buffering as good as batch MR (preferentially keep labeled
examples, but not their labels, in buffer).
0.4
Batch MR Batch MR
0.3 Online MR Online MR
0.35
Online MR (buffer) Online MR (buffer)
Online MR (bufferU) Online MR (bufferU)
0.25 0.3
Online RPtree
Generalization error rate

Online RPtree

Generalization error rate


Online RPtree (PPK)
0.2 0.25

0.2
0.15
0.15
0.1
0.1
0.05
0.05

0 0
0 1000 2000 3000 4000 5000 6000 7000 0 2000 4000 6000 8000 10000
T T

(a) Spirals (b) Face


0.4 0.4
Batch MR Batch MR
0.35 Online MR 0.35 Online MR
Online MR (buffer) Online MR (buffer)
Online MR (bufferU) Online MR (bufferU)
0.3 Online RPtree 0.3 Online RPtree
Generalization error rate

Generalization error rate


0.25 0.25

0.2 0.2

0.15 0.15

0.1 0.1

0.05 0.05

0 0
0 2000 4000 6000 8000 10000 0 2000 4000 6000 8000 10000
T T

(c) MNIST 0 vs. 1 (d) MNIST 1 vs. 2

Xiaojin Zhu (Univ. Wisconsin-Madison) Online Semi-Supervised Learning 16 / 18


Experiment: adversarial concept drift
Slowly rotating spirals, both p(x) and p(y|x) changing.
Batch f vs. online MR buffering fT
Test set drawn from the current p(x, y) at time T .

0.7 Batch MR
Online MR (buffer)

0.6
Generalization error rate

0.5

0.4

0.3

0.2

0.1

0
0 1000 2000 3000 4000 5000 6000 7000
T

Xiaojin Zhu (Univ. Wisconsin-Madison) Online Semi-Supervised Learning 17 / 18


Conclusions

Online semi-supervised learning framework


Sparse approximations: buffering and RPtree
Future work: new bounds, new algorithms (e.g., S3VM)

Xiaojin Zhu (Univ. Wisconsin-Madison) Online Semi-Supervised Learning 18 / 18

Anda mungkin juga menyukai