Anda di halaman 1dari 13

Published as a conference paper at ICLR 2018

ACTIVE L EARNING FOR C ONVOLUTIONAL N EURAL


N ETWORKS : A C ORE -S ET A PPROACH
Ozan Sener∗ Silvio Savarese
Intel Labs Stanford University
ozan.sener@intel.com ssilvio@stanford.edu

A BSTRACT

Convolutional neural networks (CNNs) have been successfully applied to many


recognition and learning tasks using a universal recipe; training a deep model
on a very large dataset of supervised examples. However, this approach is rather
restrictive in practice since collecting a large set of labeled images is very expensive.
One way to ease this problem is coming up with smart ways for choosing images
to be labelled from a very large collection (i.e. active learning).
Our empirical study suggests that many of the active learning heuristics in the
literature are not effective when applied to CNNs in batch setting. Inspired by
these limitations, we define the problem of active learning as core-set selection,
i.e. choosing set of points such that a model learned over the selected subset
is competitive for the remaining data points. We further present a theoretical
result characterizing the performance of any selected subset using the geometry
of the datapoints. As an active learning algorithm, we choose the subset which is
expected to yield best result according to our characterization. Our experiments
show that the proposed method significantly outperforms existing approaches in
image classification experiments by a large margin.

1 INTRODUCTION

Deep convolutional neural networks (CNNs) have shown unprecedented success in many areas of
research in computer vision and pattern recognition, such as image classification, object detection,
and scene segmentation. Although CNNs are universally successful in many tasks, they have a
major drawback; they need a very large amount of labeled data to be able to learn their large number
of parameters. More importantly, it is almost always better to have more data since the accuracy
of CNNs is often not saturated with increasing dataset size. Hence, there is a constant desire to
collect more and more data. Although this a desired behavior from an algorithmic perspective (higher
representative power is typically better), labeling a dataset is a time consuming and an expensive
task. These practical considerations raise a critical question: “what is the optimal way to choose data
points to label such that the highest accuracy can be obtained given a fixed labeling budget.” Active
learning is one of the common paradigms to address this question.
The goal of active learning is to find effective ways to choose data points to label, from a pool of
unlabeled data points, in order to maximize the accuracy. Although it is not possible to obtain a
universally good active learning strategy (Dasgupta, 2004), there exist many heuristics (Settles, 2010)
which have been proven to be effective in practice. Active learning is typically an iterative process in
which a model is learned at each iteration and a set of points is chosen to be labelled from a pool of
unlabelled points using these aforementioned heuristics. We experiment with many of these heuristics
in this paper and find them not effective when applied to CNNs. We argue that the main factor behind
this ineffectiveness is the correlation caused via batch acquisition/sampling. In the classical setting,
the active learning algorithms typically choose a single point at each iteration; however, this is not
feasible for CNNs since i) a single point is likely to have no statistically significant impact on the
accuracy due to the local optimization methods, and ii) each iteration requires a full training until
convergence which makes it intractable to query labels one-by-one. Hence, it is necessary to query

Work is completed while author is at Stanford University.

1
Published as a conference paper at ICLR 2018

labels for a large subset at each iteration and it results in correlated samples even for moderately
small subset sizes.
In order to tailor an active learning method for the batch sampling case, we decided to define the
active learning as core-set selection problem. Core-set selection problem aims to find a small subset
given a large labeled dataset such that a model learned over the small subset is competitive over the
whole dataset. Since we have no labels available, we perform the core-set selection without using
the labels. In order to attack the unlabeled core-set problem for CNNs, we provide a rigorous bound
between an average loss over any given subset of the dataset and the remaining data points via the
geometry of the data points. As an active learning algorithm, we try to choose a subset such that
this bound is minimized. Moreover, minimization of this bound turns out to be equivalent to the
k-Center problem (Wolf, 2011) and we adopt an efficient approximate solution to this combinatorial
optimization problem. We further study the behavior of our proposed algorithm empirically for the
problem of image classification using three different datasets. Our empirical analysis demonstrates
state-of-the-art performance by a large margin.

2 R ELATED W ORK

We discuss the related work in the following categories separately. Briefly, our work is different
from existing approaches in that i) it defines the active learning problem as core-set selection, ii)
we consider both fully supervised and weakly supervised cases, and iii) we rigorously address the
core-set selection problem directly for CNNs with no extra assumption.
Active Learning Active learning has been widely studied and most of the early work can be found in
the classical survey of Settles (2010). It covers acquisition functions such as information theoretical
methods (MacKay, 1992), ensemble approaches (McCallumzy & Nigamy, 1998; Freund et al., 1997)
and uncertainty based methods (Tong & Koller, 2001; Joshi et al., 2009; Li & Guo, 2013).
Bayesian active learning methods typically use a non-parametric model like Gaussian process to
estimate the expected improvement by each query (Kapoor et al., 2007) or the expected error after
a set of queries (Roy & McCallum, 2001). These approaches are not directly applicable to large
CNNs since they do not scale to large-scale datasets. A recent approach by Gal & Ghahramani (2016)
shows an equivalence between dropout and approximate Bayesian inference enabling the application
of Bayesian methods to deep learning. Although Bayesian active learning has been shown to be
effective for small datasets (Gal et al., 2017), our empirical analysis suggests that they do not scale to
large-scale datasets because of batch sampling.
One important class is that of uncertainty based methods, which try to find hard examples using
heuristics like highest entropy (Joshi et al., 2009), and geometric distance to decision boundaries
(Tong & Koller, 2001; Brinker, 2003). Our empirical analysis find them not to be effective for CNNs.
There are recent optimization based approaches which can trade-off uncertainty and diversity to
obtain a diverse set of hard examples in batch mode active learning setting. Both Elhamifar et al.
(2013) and Yang et al. (2015) design a discrete optimization problem for this purpose and use its
convex surrogate. Similarly, Guo (2010) cast a similar problem as matrix partitioning. However, the
optimization algorithms proposed in these papers use n2 variables where n is the number of data
points. Hence, they do not scale to large datasets. There are also many pool based active learning
algorithms designed for the specific class of machine learning algorithms like k-nearest neighbors and
naive Bayes (Wei et al., 2015), logistic regression Hoi et al. (2006); Guo & Schuurmans (2008), and
linear regression with Gaussian noise (Yu et al., 2006). Even in the algorithm agnostic case, one can
design a set-cover algorithm to cover the hypothesis space using sub-modularity (Guillory & Bilmes,
2010; Golovin & Krause, 2011). On the other hand, Demir et al. (2011) uses a heuristic to first filter
the pool based on uncertainty and then choose point to label using diversity. Our algorithm can be
considered to be in this class; however, we do not use any uncertainty information. Our algorithm is
also the first one which is applied to the CNNs. Most similar to ours are (Joshiy et al., 2010) and
(Wang & Ye, 2015). Joshiy et al. (2010) uses a similar optimization problem. However, they offer no
theoretical justification or analysis. Wang & Ye (2015) proposes to use empirical risk minimization
like us; however, they try to minimize the difference between two distributions (maximum mean
discrepancy between iid. samples from the dataset and the actively selected samples) instead of

2
Published as a conference paper at ICLR 2018

core-set loss. Moreover, both algorithms are also not experimented with CNNs. In our experimental
study, we compare with (Wang & Ye, 2015).
Recently, a discrete optimization based method (Berlind & Urner, 2015) which is similar to ours
has been presented for k-NN type algorithms in the domain shift setting. Although our theoretical
analysis borrows some techniques from them, their results are only valid for k-NNs.
Active learning algorithms for CNNs are also recently presented in (Wang et al., 2016; Stark et al.,
2015). Wang et al. (2016) propose an heuristic based algorithm which directly assigns labels to the
data points with high confidence and queries labels for the ones with low confidence. Moreover,
Stark et al. (2015) specifically targets recognizing CAPTCHA images. Although their results are
promising for CAPTCHA recognition, their method is not effective for image classification. We
discuss limitations of both approaches in Section 5.
On the theoretical side, it is shown that greedy active learning is not possible in algorithm and data
agnostic case (Dasgupta, 2005). However, there are data dependent results showing that it is indeed
possible to obtain a query strategy which has better sample complexity than querying all points.
These results either use assumptions about data-dependent realizability of the hypothesis space like
(Gonen et al., 2013) or a data dependent measure of the concept space called disagreement coefficient
(Hanneke, 2007). It is also possible to perform active learning in a batch setting using the greedy
algorithm via importance sampling (Ganti & Gray, 2012). Although the aforementioned algorithms
enjoy theoretical guarantees, they do not apply to large-scale problems.
Core-Set Selection The closest literature to our work is the problem of core-set selection since we
define active learning as a core-set selection problem. This problem considers a fully labeled dataset
and tries to choose a subset of it such that the model trained on the selected subset will perform as
closely as possible to the model trained on the entire dataset. For specific learning algorithms, there
are methods like core-sets for SVM (Tsang et al., 2005) and core-sets for k-Means and k-Medians
(Har-Peled & Kushal, 2005). However, we are not aware of such a method for CNNs.
The most similar algorithm to ours is the unsupervised subset selection algorithm in (Wei et al., 2013).
It uses a facility location problem to find a diverse cover for the dataset. Our algorithm differs in that
it uses a slightly different formulation of facility location problem. Instead of the min-sum, we use
the minimax (Wolf, 2011) form. More importantly, we apply this algorithm for the first time to the
problem of active learning and provide theoretical guarantees for CNNs.
Weakly-Supervised Deep Learning Our paper is also related to semi-supervised deep learning since
we experiment the active learning both in the fully-supervised and weakly-supervised scheme. One of
the early weakly-supervised convolutional neural network algorithms was Ladder networks (Rasmus
et al., 2015). Recently, we have seen adversarial methods which can learn a data distribution as a
result of a two-player non-cooperative game (Salimans et al., 2016; Goodfellow et al., 2014; Radford
et al., 2015). These methods are further extended to feature learning (Dumoulin et al., 2016; Donahue
et al., 2016). We use Ladder networks in our experiments; however, our method is agnostic to the
weakly-supervised learning algorithm choice and can utilize any model.

3 P ROBLEM D EFINITION

In this section, we formally define the problem of active learning in the batch setting and set up
the notation for the rest of the paper. We are interested in a C class classification problem defined
over a compact space X and a label space Y = {1, . . . , C}. We also consider a loss function
l(·, ·; w) : X × Y → R parametrized over the hypothesis class (w), e.g. parameters of the deep
learning algorithm. We further assume class-specific regression functions ηc (x) = p(y = c|x) to be
λη -Lipschitz continuous for all c.
We consider a large collection of data points which are sampled i.i.d. over the space Z = X × Y as
{xi , yi }i∈[n] ∼ pZ where [n] = {1, . . . , n}. We further consider an initial pool of data-points chosen
uniformly at random as s0 = {s0 (j) ∈ [n]}j∈[m] .
An active learning algorithm only has access to {xi }i∈[n] and {ys(j) }j∈[m] . In other words, it can
only see the labels of the points in the initial sub-sampled pool. It is also given a budget b of queries

3
Published as a conference paper at ICLR 2018

to ask an oracle, and a learning algorithm As which outputs a set of parameters w given a labelled
set s. The active learning with a pool problem can simply be defined as

min Ex,y∼pZ [l(x, y; As0 ∪s1 )] (1)


s1 :|s1 |≤b

In other words, an active learning algorithm can choose b extra points and get them labelled by an
oracle to minimize the future expected loss. There are a few differences between our formulation and
the classical definition of active learning. Classical methods consider the case in which the budget is
1 (b = 1) but a single point has negligible effect in a deep learning regime hence we consider the
batch case. It is also very common to consider multiple rounds of this game. We also follow the
multiple round formulation with a myopic approach by solving the single round of labelling as;
min Ex,y∼pZ [l(x, y; As0 ∪...sk+1 )] (2)
sk+1 :|sk+1 |≤b
We only discuss the first iteration where k = 0 for brevity although we apply it over multiple rounds.
At each iteration, an active learning algorithm has two stages: 1. identifying a set of data-points and
presenting them to an oracle to be labelled, and 2. training a classifier using both the new and the
previously labeled data-points. The second stage (training the classifier) can be done in a fully or
weakly-supervised manner. Fully-supervised is the case where training the classifier is done using
only the labeled data-points. Weakly-supervised is the case where training also utilizes the points
which are not labelled yet. Although the existing literature only focuses on the active learning for
fully-supervised models, we consider both cases and experiment on both.

4 M ETHOD
4.1 ACTIVE L EARNING AS A S ET C OVER

In the classical active learning setting, the algorithm acquires labels one by one by querying an oracle
(i.e. b = 1). Unfortunately, this is not feasible when training CNNs since i) a single point will not
have a statistically significant impact on the model due to the local optimization algorithms. ii) it is
infeasible to train as many models as number of points since many practical problem of interest is
very large-scale. Hence, we focus on the batch active learning problem in which the active learning
algorithm choose a moderately large set of points to be labelled by an oracle at each iteration.
In order to design an active learning strategy which is effective in batch setting, we consider the
following upper bound of the active learning loss we formally defined in (1):


1 X 1 X
Ex,y∼pZ [l(x, y; As )] ≤ Ex,y∼pZ [l(x, y; As )] − l(xi , yi ; As ) + l(xj , yj ; As )
n |s| j∈s
i∈[n]
| {z } | {z }
Generalization Error Training Error


1 X
(3)
1 X
+ l(xi , yi ; As ) − l(xj , yj ; As ),
n
i∈[n] |s| j∈s
| {z }
Core-Set Loss

The quantity we are interested in is the population risk of the model learned using a small labelled
subset (s). The population risk is controlled by the training error of the model on the labelled subset,
the generalization error over the full dataset ([n]) and a term we define as the core-set loss. Core-set
loss is simply the difference between average empirical loss over the set of points which have labels
for and the average empirical loss over the entire dataset including unlabelled points. Empirically, it
is widely observed that the CNNs are highly expressive leading to very low training error and they
typically generalize well for various visual problems. Moreover, generalization error of CNNs is
also theoretically studied and shown to be bounded by Xu & Mannor (2012). Hence, the critical part
for active learning is the core-set loss. Following this observation, we re-define the active learning
problem as:

X
1 1 X
min l(xi , yi ; A s 0 ∪s1 ) − 0 1
l(x j , yj ; A s0 ∪s1 ) (4)
s1 :|s1 |≤b n
|s + s |
i∈[n] j∈s0 ∪s1

4
Published as a conference paper at ICLR 2018

Figure 1: Visualization of the Theorem 1. Consider the set of selected points s and the points
in
the remainder of the dataset [n] \ s, our results shows that qif s isthe δs cover of the dataset,
1 P 1
P 1
l(x , y , A ) − l(x , y ; A ) ≤ O (δ ) + O

n i∈[n] i i s |s| j∈s j j s s n

Informally, given the initial labelled set (s0 ) and the budget (b), we are trying to find a set of points
to query labels (s1 ) such that when we learn a model, the performance of the model on the labelled
subset and that on the whole dataset will be as close as possible.

4.2 C ORE -S ETS FOR CNN S

The optimization objective we define in (4) is not directly computable since we do not have access to
all the labels (i.e. [n] \ (s0 ∪ s1 ) is unlabelled). Hence, in this section we give an upper bound for this
objective function which we can optimize.
We start with presenting this bound for any loss function which is Lipschitz for a fixed true label
y and parameters w, and then show that loss functions of CNNs with ReLu non-linearities satisfy
this property. We also rely on the zero training error assumption. Although the zero training error is
not an entirely realistic assumption, our experiments suggest that the resulting upper bound is very
effective. We state the following theorem;
Theorem 1. Given n i.i.d. samples drawn from pZ as {xi , yi }i∈[n] , and set of points s. If loss
function l(·, y, w) is λl -Lipschitz continuous for all y, w and bounded by L, regression function is
λη -Lipschitz, s is δs cover of {xi , yi }i∈[n] , and l(xs(j) , ys(j) ; AS ) = 0 ∀j ∈ [m]; with probability
at least 1 − γ,
r

1 X 1 X l µ L2 log(1/γ)
l(xi , yi ; As ) − l(xj , yj ; As ) ≤ δ(λ + λ LC) + .
n
i∈[n] |s| j∈s 2n

Since we assume a zero training error for core-set, the core-set loss is equal to the average er-
ror over entire dataset as n1 i∈[n] l(xi , yi ; As ) − |s|
1 1
P P P
j∈s l(xj , yj ; As ) = n i∈[n] l(xi , yi ; As ).

We state the theorem in this form to be consistent with (3). We visualize this theorem in Figure 1 and
defer its proof to the appendix. In this theorem, “a set s is a δ cover of a set s? ” means a set of balls
with radius δ centered at each member of s can cover the entire s? . Informally, this theorem suggests
that we can bound the core-set loss with covering radius and a term which goes to zero with rate
depends solely on n. This is an interesting result since this bound does not depend on the number of
labelled points. In other words, a provided label does not help the core-set loss unless it decreases the
covering radius.
In order to show that this bound applies to CNNs, we prove the Lipschitz-continuity of the loss
function of a CNN with respect to input image for a fixed true label with the following lemma
where max-pool and restricted linear units are the non-linearities and the loss is defined as the l2

5
Published as a conference paper at ICLR 2018

distance between the desired class probabilities and the soft-max outputs. CNNs are typically used
with cross-entropy loss for classification problems in the literature. Indeed, we also perform our
experiments using the cross-entropy loss although we use l2 loss in our theoretical study. Although
our theoretical study does not extend to cross-entropy loss, our experiments suggest that the resulting
algorithm is very effective for cross-entropy loss.
Lemma 1. Loss function defined as the 2-norm between the class probabilities and the softmax
output of a convolutional neural network with n c convolutional(with max-pool and ReLU) and nf c

C−1 nc +nf c
fully connected layers defined over C classes is C α -Lipschitz function of input for fixed
class probabilities and network parameters.
Here, α is the maximum sum of input weights per neuron (see appendix for formal definition).
Although it is in general unbounded, it can be made arbitrarily small without changing the loss
function behavior (i.e. keeping the label of any data point s unchanged). We defer the proof to the
appendix and conclude that CNNs enjoy the bound we presented in Theorem 1.
In order to computationally perform active learning, we use this upper bound. In other words,
the practical problem of interest becomes mins1 :|s1 ≤b| δs0 ∪s1 . This problem is equivalent to the
k-Center problem (also called min-max facility location problem) (Wolf, 2011). In the next sec-
tion, we explain how we solve the k-Center problem in practice using a greedy approximation.

4.3 S OLVING THE K -C ENTER P ROBLEM

We have so far provided an upper bound for the loss Algorithm 1 k-Center-Greedy
function of the core-set selection problem and showed
that minimizing it is equivalent to the k-Center prob- Input: data xi , existing pool s0 and a
lem (minimax facility location (Wolf, 2011)) which budget b
can intuitively be defined as follows; choose b center Initialize s = s0
points such that the largest distance between a data repeat
point and its nearest center is minimized. Formally, u = arg maxi∈[n]\s minj∈s ∆(xi , xj )
we are trying to solve: s = s ∪ {u}
min max min ∆(x i , x j ) (5) until |s| = b + |s0 |
s1 :|s1 |≤b i j∈s1 ∪s0 return s \ s0
Unfortunately this problem is NP-Hard (Cook et al., 1998). However, it is possible to obtain
a 2 − OP T solution efficiently using a greedy approach shown in Algorithm 1. If OP T =
mins1 maxi minj∈s1 ∪s0 ∆(xi , xj ), the greedy algorithm shown in Algorithm 1 is proven to have a
solution (s1 ) such that; maxi minj∈s1 ∪s0 ∆(xi , xj ) ≤ 2 × OP T .
Although the greedy algorithm gives a good initialization, in practice we can improve the 2 − OP T
solution by iteratively querying upper bounds on the optimal value. In other words, we can design
an algorithm which decides if OP T ≤ δ. In order to do so, we define a mixed integer program
(MIP) parametrized by δ such that its feasibility indicates mins1 maxi minj∈s1 ∪s0 ∆(xi , xj ) ≤ δ. A
straight-forward algorithm would be to use this MIP as a sub-routine and performing a binary search
between the result of the greedy algorithm and its half since the optimal solution is guaranteed to
be included in that range. While constructing this MIP, we also try to handle one of the weaknesses
of k-Center algorithm, namely robustness. To make the k-Center problem robust, we assume an
upper limit on the number of outliers Ξ such that our algorithm can choose not to cover at most Ξ
unsupervised data points. This mixed integer program can be written as:
X X
F easible(b, s0 , δ, Ξ) : uj , = |s0 | + b, ξi,j ≤ Ξ
j i,j
X
ωi,j = 1 ∀i, ωi,j ≤ uj ∀i, j (6)
j

ui = 1 ∀i ∈ s0 , ui ∈ {0, 1} ∀i
ωi,j = ξi,j ∀i, j | ∆(xi , xj ) > δ.
In this formulation, ui is 1 if the ith data point is chosen as center, ωi,j is 1 if the ith point is covered
by the j th , point and ξi,j is 1 if the ith point is an outlier and covered by the j th point without the δ

6
Published as a conference paper at ICLR 2018

constraint, and 0 otherwise. And, variables are binary as ui , ωi,j , ξi,j ∈ {0, 1}. We further visualize
these variables in a diagram in Figure 2, and give the details of the method in Algorithm 2.

Algorithm 2 Robust k-Center


Input: data xi , existing pool s0 , budget b and
outlier bound Ξ
Initialize sg = k-Center-Greedy(xi , s0 , b)
δ2−OP T = maxj mini∈sg ∆(xi , xj )
lb = δ2−OP2
T
, ub = δ2−OP T
repeat
if F easible(b, s0 , lb+ub
2 , Ξ) then
ub = maxi,j|∆(xi ,xj )≤ lb+ub ∆(xi , xj )
2
else
lb = mini,j|∆(xi ,xj )≥ lb+ub ∆(xi , xj )
2
end if Figure 2: Visualizations of the variables. In
until ub = lb this solution, the 4th node is chosen as a cen-
return {i s.t. ui = 1} ter and nodes 0, 1, 3 are in a δ ball around it.
The 2nd node is marked as an outlier.

4.4 I MPLEMENTATION D ETAILS

One of the critical design choices is the distance metric ∆(·, ·). We use the l2 distance between
activations of the final fully-connected layer as the distance. For weakly-supervised learning, we
used Ladder networks (Rasmus et al., 2015) and for all experiments we used VGG-16 (Simonyan
& Zisserman, 2014) as the CNN architecture. We initialized all convolutional filters according to
He et al. (2016). We optimized all models using RMSProp with a learning rate of 1e−3 using
Tensorflow (Abadi et al., 2016). We train CNNs from scratch after each iteration.
We used the Gurobi (Inc., 2016) framework for checking feasibility of the MIP defined in (6). As an
upper bound on outliers, we used Ξ = 1e−4 × n where n is the number of unlabelled points.

5 E XPERIMENTAL R ESULTS

We tested our algorithm on the problem of classification using three different datasets. We per-
formed experiments on CIFAR (Krizhevsky & Hinton, 2009) and Caltech-256 (Griffin et al., 2007)
datasets for image classification and on SVHN(Netzer et al., 2011) dataset for digit classification.
CIFAR (Krizhevsky & Hinton, 2009) dataset has two tasks; one coarse-grained over 10 classes and
one fine-grained over 100 classes. We performed experiments on both.
We compare our method with the following baselines: i)Random: Choosing the points to be labelled
uniformly at random from the unlabelled pool. ii)Best Empirical Uncertainty: Following the em-
pirical setup in (Gal et al., 2017), we perform active learning using max-entropy, BALD and Variation
Ratios treating soft-max outputs as probabilities. We only report the best performing one for each
dataset since they perform similar to each other. iii) Deep Bayesian Active Learning (DBAL)(Gal
et al., 2017): We perform Monte Carlo dropout to obtain improved uncertainty measures and report
only the best performing acquisition function among max-entropy, BALD and Variation Ratios for
each dataset. iv) Best Oracle Uncertainty: We also report a best performing oracle algorithm
which uses the label information for entire dataset. We replace the uncertainty with l(xi , yi , As0 )
for all unlabelled examples. We sample the queries from the normalized form of this function by
l(xi ,yi ,As0 )
setting the probability of choosing the ith point to be queried as pi = P l(x j ,yj ,As0 )
. v)k-Median:
j
Choosing the points to be labelled as the cluster centers of k-Median (k is equal to the budget) al-
gorithm. vi)Batch Mode Discriminative-Representative Active Learning(BMDR)(Wang & Ye,
2015): ERM based approach which uses uncertainty and minimizes MMD between iid. samples
from the dataset and the actively chosen points. vii)CEAL (Wang et al., 2016): CEAL (Wang et al.,
2016) is a weakly-supervised active learning method proposed specifically for CNNs. we include it
in the weakly-supervised analysis.

7
Published as a conference paper at ICLR 2018

CIFAR - 10 CIFAR - 100 CALTECH - 256 SVHN


90 65 95

Classification Accuracy (%)


90
60
85 85
55 90
80
80 50
Random Random 75 Random 85 Random
Empirical-Unc. 45 Empirical-Unc. Empirical-Unc. Empirical-Unc.
75 Oracle-Unc. Oracle-Unc. 70 Oracle-Unc. Oracle-Unc.
DBAL[GIG 17] 40 DBAL[GIG 17] DBAL[GIG 17] DBAL[GIG 17]
65 80
BMDR [WY 15] BMDR [WY 15] BMDR [WY 15] BMDR [WY 15]
70 CEAL [WZL+ 16] 35 CEAL [WZL+ 16] CEAL [WZL+ 16] CEAL [WZL+ 16]
60
K-Median K-Median K-Median K-Median
Our Method 30 Our Method Our Method 75 Our Method
55
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Number of Labelled Images (ratio)

Figure 3: Results on Active Learning for Weakly-Supervised Model (error bars are std-dev)
CIFAR - 10 CIFAR - 100 CALTECH - 256 SVHN
90 65 95
Classification Accuracy (%)

90
60
85
90
55
80 80
50 85

75
45 70 80
Random Random Random Random
70 Empirical-Unc. 40 Empirical-Unc. Empirical-Unc. Empirical-Unc.
Oracle-Unc. Oracle-Unc. Oracle-Unc. 75 Oracle-Unc.
DBAL[GIG 17] DBAL[GIG 17] 60 DBAL[GIG 17] DBAL[GIG 17]
35
65 BMDR [WY 15] BMDR [WY 15] BMDR [WY 15] BMDR [WY 15]
70
K-Median 30 K-Median K-Median K-Median
60 Our Method Our Method 50 Our Method Our Method
65
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Number of Labelled Images (ratio)

Figure 4: Results on Active Learning for Fully-Supervised Model (error bars are std-dev)
We conducted experiments on active learning for fully-supervised models as well as active learning for
weakly-supervised models. In our experiments, we start with small set of images sampled uniformly
at random from the dataset as an initial pool. The weakly-supervised model has access to labeled
examples as well as unlabelled examples. The fully-supervised model only has access to the labeled
data points. We run all experiments with five random initializations of the initial pool of labeled points
and use the average classification accuracy as a metric. We plot the accuracy vs the number of labeled
points. We also plot error bars as standard deviations. We run the query algorithm iteratively; in other
words, we solve the discrete optimization problem minsk+1 :|sk+1 |≤b Ex,y∼pZ [l(x, y; As0 ∪...,sk+1 )]
for each point on the accuracy vs number of labelled examples graph. We present the results in
Figures 3 and 4.
Figures 3 and 4 suggests that our algorithm outperforms all other baselines in all experiments; for the
case of weakly-supervised models, by a large margin. We believe the effectiveness of our approach in
the weakly-supervised case is due to the better feature learning. Weakly-supervised models provide
better feature spaces resulting in accurate geometries. Since our method is geometric, it performs
significantly better with better feature spaces. We also observed that our algorithm is less effective
in CIFAR-100 and Caltech-256 when compared with CIFAR-10 and SVHN. This can easily be
explained using our theoretical analysis. Our bound over the core-set loss scales with the number of
classes, hence it is better to have fewer classes.
One interesting observation is the fact that a state-of-the-art batch mode active learning baseline
(BMDR (Wang & Ye, 2015)) does not necessarily perform better than greedy ones. We believe this is
due to the fact that it still uses an uncertainty information and soft-max probabilities are not a good
proxy for uncertainty. Our method does not use any uncertainty. And, incorporating uncertainty to
our method in a principled way is an open problem and a fruitful future research direction. On the
other hand, a pure clustering based batch active learning baseline (k-Medoids) is also not effective.
We believe this is rather intuitive since cluster sentences are likely the points which are well covered
with initial iid. samples. Hence, this clustering based method fails to sample the tails of the data
distribution.
Our results suggest that both oracle uncertainty information and Bayesian estimation of uncertainty is
helpful since they improve over empirical uncertainty baseline; however, they are still not effective in
the batch setting since random sampling outperforms them. We believe this is due to the correlation
in the queried labels as a consequence of active learning in batch setting. We further investigate this
with a qualitative analysis via tSNE (Maaten & Hinton, 2008) embeddings. We compute embeddings
for all points using the features which are learned using the labelled examples and visualize the points
sampled by our method as well as the oracle uncertainty. This visualization suggests that due to the
correlation among samples, uncertainty based methods fail to cover the large portion of the space
confirming our hypothesis.
Optimality of the k-Center Solution: Our proposed method uses the greedy 2-OPT solution for the
k-Center problem as an initialization and checks the feasibility of a mixed integer program (MIP).

8
Published as a conference paper at ICLR 2018

Table 1: Average run-time of our algorithm for


b = 5k and |s0 | = 10k in seconds.
Distance Greedy MIP MIP
Matrix (2-OPT) (iteration) (total) Total
104.2 2 7.5 244.03 360.23

0.65 0.64
0.65
0.63
0.60

0.55 0.54

0.52

Classification Accuracy
0.50
(a) Uncertainty Oracle (b) Our Method
0.45
0.45

0.43
Figure 5: tSNE embeddings of the CIFAR dataset 0.40

and behavior of uncertainty oracle as well as our


0.35
method. For both methods, the initial labeled pool
of 1000 images are shown in blue, 1000 images 0.30
0.30
0.29 Greedy
chosen to be labeled in green and remaining ones Our Method
0.25
in red. Our algorithm results in queries evenly 10k 20k 30k
Number of Labelled Images
40k 50k

covering the space. On the other hand, samples


chosen by uncertainty oracle fails to cover the Figure 6: We compare our method with k-Center-
large portion of the space. Greedy. Our algorithm results in a small but im-
portant accuracy improvement.

We use LP-relaxation of the defined MIP and use branch-and-bound to obtain integer solutions. The
utility obtained by solving this expensive MIP should be investigated. We compare the average
run-time of MIP1 with the run-time of 2-OPT solution in Table 1. We also compare the accuracy
obtained with optimal k-Center solution and the 2-OPT solution in Figure 6 on CIFAR-100 dataset.
As shown in the Table 1; although the run-time of MIP is not polynomial in worst-case, in practice it
converges in a tractable amount of time for a dataset of 50k images. Hence, our algorithm can easily
be applied in practice. Figure 6 suggests a small but significant drop in the accuracy when the 2-OPT
solution is used. Hence, we conclude that unless the scale of the dataset is too restrictive, using our
proposed optimal solver is desired. Even with the accuracy drop, our active learning strategy using
2-OPT solution still outperforms the other baselines. Hence, we can conclude that our algorithm can
scale to any dataset size with small accuracy drop even if solving MIP is not feasible.

6 C ONCLUSION
We study the active learning problem for CNNs. Our empirical analysis showed that classical
uncertainty based methods have limited applicability to the CNNs due to the correlations caused
by batch sampling. We re-formulate the active learning problem as core-set selection and study the
core-set problem for CNNs. We further validated our algorithm using an extensive empirical study.
Empirical results on three datasets showed state-of-the-art performance by a large margin.

R EFERENCES
Martin Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S
Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. Tensorflow: Large-scale machine
learning on heterogeneous distributed systems. arXiv:1603.04467, 2016.
C. Berlind and R. Urner. Active nearest neighbors in changing environments. In ICML, 2015.
Klaus Brinker. Incorporating diversity in active learning with support vector machines. In ICML,
volume 3, pp. 59–66, 2003.
1
On Intel Core i7-5930K@3.50GHz and 64GB memory

9
Published as a conference paper at ICLR 2018

William J Cook, William H Cunningham, William R Pulleyblank, and Alexander Schrijver. Combi-
natorial optimization, volume 605. Springer, 1998.

Sanjoy Dasgupta. Analysis of a greedy active learning strategy. In NIPS, 2004.

Sanjoy Dasgupta. Analysis of a greedy active learning strategy. In L. K. Saul,


Y. Weiss, and L. Bottou (eds.), Advances in Neural Information Processing Sys-
tems 17, pp. 337–344. MIT Press, 2005. URL http://papers.nips.cc/paper/
2636-analysis-of-a-greedy-active-learning-strategy.pdf.

Begüm Demir, Claudio Persello, and Lorenzo Bruzzone. Batch-mode active-learning methods for the
interactive classification of remote sensing images. IEEE Transactions on Geoscience and Remote
Sensing, 49(3):1014–1031, 2011.

Jeff Donahue, Philipp Krähenbühl, and Trevor Darrell. Adversarial feature learning.
arXiv:1605.09782, 2016.

Vincent Dumoulin, Ishmael Belghazi, Ben Poole, Alex Lamb, Martin Arjovsky, Olivier Mastropietro,
and Aaron Courville. Adversarially learned inference. arXiv:1606.00704, 2016.

Ehsan Elhamifar, Guillermo Sapiro, Allen Yang, and S Shankar Sasrty. A convex optimization
framework for active learning. In ICCV, 2013.

Yoav Freund, H Sebastian Seung, Eli Shamir, and Naftali Tishby. Selective sampling using the query
by committee algorithm. Machine learning, 28(2-3), 1997.

Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model
uncertainty in deep learning. In International Conference on Machine Learning, 2016.

Yarin Gal, Riashat Islam, and Zoubin Ghahramani. Deep bayesian active learning with image data.
arXiv preprint arXiv:1703.02910, 2017.

Ravi Ganti and Alexander Gray. Upal: Unbiased pool based active learning. In Artificial Intelligence
and Statistics, pp. 422–431, 2012.

Daniel Golovin and Andreas Krause. Adaptive submodularity: Theory and applications in active
learning and stochastic optimization. Journal of Artificial Intelligence Research, 42:427–486,
2011.

Alon Gonen, Sivan Sabato, and Shai Shalev-Shwartz. Efficient active learning of halfspaces: an
aggressive approach. The Journal of Machine Learning Research, 14(1):2583–2615, 2013.

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair,
Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In NIPS, 2014.

G. Griffin, A. Holub, and P. Perona. Caltech-256 object category dataset. Technical Report 7694, Cal-
ifornia Institute of Technology, 2007. URL http://authors.library.caltech.edu/
7694.

Andrew Guillory and Jeff Bilmes. Interactive submodular set cover. arXiv:1002.3345, 2010.

Yuhong Guo. Active instance sampling via matrix partition. In Advances in Neural Information
Processing Systems, pp. 802–810, 2010.

Yuhong Guo and Dale Schuurmans. Discriminative batch mode active learning. In Advances in
neural information processing systems, pp. 593–600, 2008.

Steve Hanneke. A bound on the label complexity of agnostic active learning. In Proceedings of the
24th international conference on Machine learning, pp. 353–360. ACM, 2007.

Sariel Har-Peled and Akash Kushal. Smaller coresets for k-median and k-means clustering. In Annual
Symposium on Computational geometry. ACM, 2005.

10
Published as a conference paper at ICLR 2018

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image
recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition,
pp. 770–778, 2016.
Steven CH Hoi, Rong Jin, Jianke Zhu, and Michael R Lyu. Batch mode active learning and its
application to medical image classification. In Proceedings of the 23rd international conference
on Machine learning, pp. 417–424. ACM, 2006.
Gurobi Optimization Inc. Gurobi optimizer reference manual, 2016. URL http://www.gurobi.
com.
Ajay J Joshi, Fatih Porikli, and Nikolaos Papanikolopoulos. Multi-class active learning for image
classification. In CVPR, 2009.
A. J. Joshiy, F. Porikli, and N. Papanikolopoulos. Multi-class batch-mode active learning for image
classification. In 2010 IEEE International Conference on Robotics and Automation, pp. 1873–1878,
May 2010. doi: 10.1109/ROBOT.2010.5509293.
Ashish Kapoor, Kristen Grauman, Raquel Urtasun, and Trevor Darrell. Active learning with gaussian
processes for object categorization. In ICCV, 2007.
Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. 2009.
Xin Li and Yuhong Guo. Adaptive active learning for image classification. In CVPR, 2013.
Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of Machine
Learning Research, 9(Nov):2579–2605, 2008.
David JC MacKay. Information-based objective functions for active data selection. Neural computa-
tion, 4(4):590–604, 1992.
Andrew Kachites McCallumzy and Kamal Nigamy. Employing em and pool-based active learning
for text classification. In ICML, 1998.
Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng. Reading
digits in natural images with unsupervised feature learning. In NIPS workshop on deep learning
and unsupervised feature learning, volume 2011, pp. 5, 2011.
Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep
convolutional generative adversarial networks. arXiv:1511.06434, 2015.
Antti Rasmus, Mathias Berglund, Mikko Honkala, Harri Valpola, and Tapani Raiko. Semi-supervised
learning with ladder networks. In NIPS, 2015.
Nicholas Roy and Andrew McCallum. Toward optimal active learning through monte carlo estimation
of error reduction. ICML, 2001.
Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen.
Improved techniques for training gans. In NIPS, 2016.
Burr Settles. Active learning literature survey. University of Wisconsin, Madison, 52(55-66):11, 2010.
Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image
recognition. arXiv:1409.1556, 2014.
Fabian Stark, Caner Hazırbas, Rudolph Triebel, and Daniel Cremers. Captcha recognition with active
deep learning. In GCPR Workshop on New Challenges in Neural Computation, 2015.
Simon Tong and Daphne Koller. Support vector machine active learning with applications to text
classification. JMLR, 2(Nov):45–66, 2001.
Ivor W Tsang, James T Kwok, and Pak-Ming Cheung. Core vector machines: Fast svm training on
very large data sets. JMLR, 6(Apr):363–392, 2005.

11
Published as a conference paper at ICLR 2018

Keze Wang, Dongyu Zhang, Ya Li, Ruimao Zhang, and Liang Lin. Cost-effective active learning for
deep image classification. Transactions on Circuits and Systems for Video Technology, 2016.
Zheng Wang and Jieping Ye. Querying discriminative and representative samples for batch mode
active learning. ACM Transactions on Knowledge Discovery from Data (TKDD), 9(3):17, 2015.
Kai Wei, Yuzong Liu, Katrin Kirchhoff, and Jeff A Bilmes. Using document summarization tech-
niques for speech data subset selection. In HLT-NAACL, 2013.
Kai Wei, Rishabh K Iyer, and Jeff A Bilmes. Submodularity in data subset selection and active
learning. In ICML, 2015.
Gert W Wolf. Facility location: concepts, models, algorithms and case studies., 2011.
Huan Xu and Shie Mannor. Robustness and generalization. Machine learning, 86(3):391–423, 2012.
Yi Yang, Zhigang Ma, Feiping Nie, Xiaojun Chang, and Alexander G Hauptmann. Multi-class active
learning by uncertainty sampling with diversity maximization. International Journal of Computer
Vision, 113(2):113–127, 2015.
Kai Yu, Jinbo Bi, and Volker Tresp. Active learning via transductive experimental design. In
Proceedings of the 23rd international conference on Machine learning, pp. 1081–1088. ACM,
2006.

A P ROOF FOR L EMMA 1



C−1
Proof. We will start with showing that softmax function defined over C class is C -Lipschitz
continuous. It is easy to show that for any differentiable function f : Rn → Rm ,


kf (x) − f (y)k2 ≤ kJkF kx − yk2 ∀x, y ∈ Rn

where kJkF = max kJkF and J is the Jacobian matrix of f .
x

Softmax function is defined as


exp(xi )
f (x)i = C
, i = 1, 2, ..., C
P
exp(xj )
j=1

For brevity, we will denote fi (x) as fi . The Jacobian matrix will be,
f1 (1 − f1 ) −f1 f2 ... −f1 fC
 
 −f2 f1 f2 (1 − f2 ) ... −f2 fC 
J =
... ... ... ... 
−fC f1 −fC f2 ... −fC (1 − fC )
Now, Frobenius norm of above matrix will be,
v
u C C C
uX X X
kJkF = t fi2 fj2 + fi2 (1 − fi )2
i=1 j=1,i6=j i=1

1 ∗
It is straightforward to show that fi = C is the optimal solution for kJkF = max kJkF Hence,
√ x
1 ∗ C−1
putting fi = C in the above equation , we get kJkF = C .

Now, consider two inputs x and x̃, such that their representation P at dlayer d is xd and x̃d .
d d−1
P consider any convolution or fully-connected layer as xj =
Let’s i wi,j xi . If we assume,
i |wi,j | ≤ α ∀i, j, d, for any convolutional or fully connected layer, we can state:

kxd − x̃d k2 ≤ αkxd−1 − x̃d−1 k2

12
Published as a conference paper at ICLR 2018

On the other hand, using |a − b| ≤ | max(0, a) − max(0, a)| and the fact that max pool layer can
be written as a convolutional layer such that only one weight is 1 and others are 0, we can state for
ReLU and max-pool layers,
kxd − x̃d k2 ≤ kxd−1 − x̃d−1 k2

Combining with the Lipschitz constant of soft-max layer,



C − 1 nc +nf c
kCN N (x; w) − CN N (x̃; w)k2 ≤ α kx − x̃k2
C
Using the reverse triangle inequality as
|l(x, y; w)−l(x̃, y; w)| = |kCN N (x; w)−yk2 −kCN N (x̃; w)−yk2 | ≤ kCN N (x; w)−CN N (x̃; w)k2 ,

C−1 nc +nf c
we can conclude that the loss function is C α -Lipschitz for any fixed y and w.

B P ROOF FOR T HEOREM 1


Before starting our proof, we state the Claim 1 from Berlind & Urner (2015). Fix some p, p0 ∈ [0, 1]
and y 0 ∈ {0, 1}. Then,
py∼p (y 6= y 0 ) ≤ py∼p0 (y 6= y 0 ) + |p − p0 |

Proof. We will start our proof with bounding Eyi ∼η(xi ) [l(xi , yi ; As )]. We have a condition which
states that there exists and xj in δ ball around xi such that xj has 0 loss.
X
Eyi ∼η(xi ) [l(xi , yi ; As )] = pyi ∼ηk (xi ) (yi = k)l(xi , k; As )
k∈[C]
(d) X X
≤ pyi ∼ηk (xj ) (yi = k)l(xi , k; As ) + |ηk (xi ) − ηk (xj )|l(xi , k; As )
k∈[C] k∈[C]
(e) X
≤ pyi ∼ηk (xj ) (yi = k)l(xi , k; As ) + δλη LC
k∈[C]

With abuse of notation, we represent {yi = k} ∼ ηk (xi ) with yi ∼ ηk (xi ). We use Claim 1 in (d),
and Lipschitz property of regression function and bound of loss in (d). Then, we can further bound
the remaining term as;
X X
pyi ∼ηk (xj ) (yi = k)l(xi , k; As ) = pyi ∼ηk (xj ) (yi = k)[l(xi , k; As ) − l(xj , k; As )]
k∈[C] k∈[C]
X
+ pyi ∼ηk (xj ) (yi = k)l(xj , k; As )
k∈[C]

≤ δλl
where last step is coming from the fact that the trained classifier assumed to have 0 loss over training
points. If we combine them,
Eyi ∼η(xi ) [l(xi , yi , As )] ≤ δ(λl + λµ LC)
We further use the Hoeffding’s Bound and conclude that with probability at least 1 − γ,
r

1 X 1 X l µ L2 log(1/γ)
l(xi , yi ; As ) − l(xj , yj ; As ) ≤ δ(λ + λ LC) +
n
i∈[n] |s| j∈s 2n

13

Anda mungkin juga menyukai