Abstract
Dropout and other feature noising schemes control overfitting by artificially corrupting the training data. For generalized linear models, dropout performs a form
of adaptive regularization. Using this viewpoint, we show that the dropout regularizer is first-order equivalent to an L2 regularizer applied after scaling the features
by an estimate of the inverse diagonal Fisher information matrix. We also establish
a connection to AdaGrad, an online learning algorithm, and find that a close relative of AdaGrad operates by repeatedly solving linear dropout-regularized problems. By casting dropout as regularization, we develop a natural semi-supervised
algorithm that uses unlabeled data to create a better adaptive regularizer. We apply this idea to document classification tasks, and show that it consistently boosts
the performance of dropout training, improving on state-of-the-art results on the
IMDB reviews dataset.
Introduction
Dropout training was introduced by Hinton et al. [1] as a way to control overfitting by randomly
omitting subsets of features at each iteration of a training procedure.1 Although dropout has proved
to be a very successful technique, the reasons for its success are not yet well understood at a theoretical level.
Dropout training falls into the broader category of learning methods that artificially corrupt training data to stabilize predictions [2, 4, 5, 6, 7]. There is a well-known connection between artificial
feature corruption and regularization [8, 9, 10]. For example, Bishop [9] showed that the effect of
training with features that have been corrupted with additive Gaussian noise is equivalent to a form
of L2 -type regularization in the low noise limit. In this paper, we take a step towards understanding how dropout training works by analyzing it as a regularizer. We focus on generalized linear
models (GLMs), a class of models for which feature dropout reduces to a form of adaptive model
regularization.
Using this framework, we show that dropout training is first-order equivalent to L2 -regularization af 1/2 , where I is an estimate of the Fisher information matrix.
ter transforming the input by diag(I)
This transformation effectively makes the level curves of the objective more spherical, and so balances out the regularization applied to different features. In the case of logistic regression, dropout
can be interpreted as a form of adaptive L2 -regularization that favors rare but useful features.
The problem of learning with rare but useful features is discussed in the context of online learning
by Duchi et al. [11], who show that their AdaGrad adaptive descent procedure achieves better regret
bounds than regular stochastic gradient descent (SGD) in this setting. Here, we show that AdaGrad
S.W. is supported by a B.C. and E.J. Eaves Stanford Graduate Fellowship.
Hinton et al. introduced dropout training in the context of neural networks specifically, and also advocated
omitting random hidden layers during training. In this paper, we follow [2, 3] and study feature dropout as a
generic training method that can be applied to any learning algorithm.
1
and dropout training have an intimate connection: Just as SGD progresses by repeatedly solving
linearized L2 -regularized problems, a close relative of AdaGrad advances by solving linearized
dropout-regularized problems.
Our formulation of dropout training as adaptive regularization also leads to a simple semi-supervised
learning scheme, where we use unlabeled data to learn a better dropout regularizer. The approach
is fully discriminative and does not require fitting a generative model. We apply this idea to several
document classification problems, and find that it consistently improves the performance of dropout
training. On the benchmark IMDB reviews dataset introduced by [12], dropout logistic regression
with a regularizer tuned on unlabeled data outperforms previous state-of-the-art. In follow-up research [13], we extend the results from this paper to more complicated structured prediction, such
as multi-class logistic regression and linear chain conditional random fields.
We begin by discussing the general connections between feature noising and regularization in generalized linear models (GLMs). We will apply the machinery developed here to dropout training in
Section 4.
A GLM defines a conditional distribution over a response y 2 Y given an input feature vector
x 2 Rd :
def
p (y | x) = h(y) exp{y x
A(x )},
def
`x,y ( ) =
log p (y | x).
(1)
Here, h(y) is a quantity independent of x and , A() is the log-partition function, and `x,y ( ) is the
loss function (i.e., the negative log likelihood); Table 1 contains a summary of notation. Common
examples of GLMs include linear (Y = R), logistic (Y = {0, 1}), and Poisson (Y = {0, 1, 2, . . . })
regression.
Given n training examples (xi , yi ), the standard maximum likelihood estimate 2 Rd minimizes
the empirical loss over the training examples:
def
= arg min
2Rd
n
X
(2)
`xi , yi ( ).
i=1
With artificial feature noising, we replace the observed feature vectors xi with noisy versions x
i =
(xi , i ), where is our noising function and i is an independent random variable. We first create
many noisy copies of the dataset, and then average out the auxiliary noise. In this paper, we will
consider two types of noise:
Additive Gaussian noise: (xi , i ) = xi + i , where i N (0,
Idd ).
2Rd
n
X
i=1
def
(3)
is the expectation taken with respect to the artificial feature noise = (1 , . . . , n ). Similar expressions have been studied by [9, 10].
For GLMs, the noised empirical loss takes on a simpler form:
n
X
E [`xi , yi ( )] =
i=1
2
Artificial noise of the form xi
to dropout noise as defined by [1].
n
X
i=1
(y xi
E [A(
xi )]) =
n
X
`xi , yi ( ) + R( ).
(4)
i=1
R( )
Rq ( )
`( )
Here, R( ) acts as a regularizer that incorporates the effect of artificial feature noising. In GLMs, the
log-partition function A must always be convex, and so R is always positive by Jensens inequality.
The key observation here is that the effect of artificial feature noising reduces to a penalty R( )
that does not depend on the labels {yi }. Because of this, artificial feature noising penalizes the
complexity of a classifier in a way that does not depend on the accuracy of a classifier. Thus, for
GLMs, artificial feature noising is a regularization scheme on the model itself that can be compared
with other forms of regularization such as ridge (L2 ) or lasso (L1 ) penalization. In Section 6, we
exploit the label-independence of the noising penalty and use unlabeled data to tune our estimate of
R( ).
The fact that R does not depend on the labels has another useful consequence that relates to prediction. The natural prediction rule with artificially noised features is to select y to minimize expected
loss over the added noise: y = argminy E [`x, y ( )]. It is common practice, however, not to noise
the inputs and just to output classification decisions based on the original feature vector [1, 3, 14]:
y = argminy `x, y ( ). It is easy to verify that these expressions are in general not equivalent, but
they are equivalent when the effect of feature noising reduces to a label-independent penalty on the
likelihood. Thus, the common practice of predicting with clean features is formally justified for
GLMs.
2.1
Although the noising penalty R yields an explicit regularizer that does not depend on the labels
{yi }, the form of R can be difficult to interpret. To gain more insight, we will work with a quadratic
approximation of the type used by [9, 10]. By taking a second-order Taylor expansion of A around
x , we get that E [A(
x )] A(x ) 12 A00 (x ) Var [
x ] . Here the first-order term
E [A0 (x )(
x x)] vanishes because E [
x] = x. Applying this quadratic approximation to (5)
yields the following quadratic noising regularizer, which will play a pivotal role in the rest of the
paper:
n
X
def 1
Rq ( ) =
A00 (xi ) Var [
xi ] .
(6)
2 i=1
This regularizer penalizes two types of variance over the training examples: (i) A00 (xi ), which
corresponds to the variance of the response yi in the GLM, and (ii) Var [
xi ], the variance of the
estimated GLM parameter due to noising.3
Accuracy of approximation Figure 1a compares the noising penalties R and Rq for logistic redef
gression in the case that x
is Gaussian;4 we vary the mean parameter p = (1 + e x ) 1 and the
q
noise level . We see that R is generally very accurate, although it tends to overestimate the true
penalty for p 0.5 and tends to underestimate it for very confident predictions. We give a graphical
explanation for this phenomenon in the Appendix (Figure A.1).
The quadratic approximation also appears to hold up on real datasets. In Figure 1b, we compare the evolution during training of both R and Rq on the 20 newsgroups alt.atheism vs
3
Although Rq is not convex, we were still able (using an L-BFGS algorithm) to train logistic regression
with Rq as a surrogate for the dropout regularizer without running into any major issues with local optima.
4
This assumption holds a priori for additive Gaussian noise, and can be reasonable for dropout by the central
limit theorem.
0.30
500
Loss
100
0.20
50
0.15
0.00
10
0.05
20
0.10
Noising Penalty
Dropout Penalty
Quadratic Penalty
Negative LogLikelihood
200
0.25
p = 0.5
p = 0.73
p = 0.82
p = 0.88
p = 0.95
0.0
0.5
1.0
1.5
50
Sigma
100
150
Training Iteration
Having established the general quadratic noising regularizer Rq , we now turn to studying the effects of Rq for various likelihoods (linear and logistic regression) and noising models (additive and
dropout). In this section, we warm up with additive noise; in Section 4 we turn to our main target of
interest, namely dropout noise.
Linear regression Suppose x
= x + " is generated by by adding noise with Var["] = 2 Idd to
the original feature vector x. Note that Var [
x ] = 2 k k22 , and in the case of linear regression
1 2
00
A(z) = 2 z , so A (z) = 1. Applying these facts to (6) yields a simplified form for the quadratic
noising penalty:
1 2
Rq ( ) =
nk k22 .
(7)
2
Thus, we recover the well-known result that linear regression with additive feature noising is equivalent to ridge regression [2, 9]. Note that, with linear regression, the quadratic approximation Rq is
exact and so the correspondence with L2 -regularization is also exact.
Logistic regression The situation gets more interesting when we move beyond linear regression.
For logistic regression, A00 (xi ) = pi (1 pi ) where pi = (1 + exp( xi )) 1 is the predicted
probability of yi = 1. The quadratic noising penalty is then
1
R ( )=
2
q
k22
n
X
pi (1
pi ).
(8)
i=1
In other words, the noising penalty now simultaneously encourages parsimonious modeling as before (by encouraging k k22 to be small) as well as confident predictions (by encouraging the pi s to
move away from 12 ).
4
Linear Regression
k k22
k k22
k k22
Logistic Regression
2
Pk k2
2
k
k
p
P 2 i i (1 2pi ) 2
pi ) xij j
i, j pi (1
GLM
k k22
k k22 tr(V ( ))
>
diag(X > V ( )X)
Recall that dropout training corresponds to applying dropout noise to training examples, where
the noised features x
i are obtained by setting x
ij to 0 with some dropout probability and to
xij /(1
) with probability (1
), independently for each coordinate j of the feature vector. We
can check that:
d
X
1
Var [
xi ] =
x2ij j2 ,
(9)
21
j=1
and so the quadratic dropout penalty is
Rq ( ) =
1
21
n
X
i=1
A00 (xi )
d
X
x2ij
2
j.
(10)
j=1
Letting X 2 Rnd be the design matrix with rows xi and V ( ) 2 Rnn be a diagonal matrix with
entries A00 (xi ), we can re-write this penalty as
1
>
Rq ( ) =
diag(X > V ( )X) .
(11)
21
Let be the maximum
estimate given infinite data. When computed at , the matrix
Pn likelihood
1
1
>
X
V
(
)X
=
r
`
) is an estimate of the Fisher information matrix I. Thus,
x i , yi (
i=1
n
n
dropout can be seen as an attempt to apply an L2 penalty after normalizing the feature vector by
diag(I) 1/2 . The Fisher information is linked to the shape of the level surfaces of `( ) around .
If I were a multiple of the identity matrix, then these level surfaces would be perfectly spherical
around . Dropout, by normalizing the problem by diag(I) 1/2 , ensures that while the level
surfaces of `( ) may not be spherical, the L2 -penalty is applied in a basis where the features have
been balanced out. We give a graphical illustration of this phenomenon in Figure A.2.
Linear Regression For linear regression, V is the identity matrix, so the dropout objective is
equivalent to a form of ridge regression where each column of the design matrix is normalized
before applying the L2 penalty.5 This connection has been noted previously by [3].
Logistic Regression The form of dropout penalties becomes much more intriguing once we move
beyond the realm of linear regression. The case of logistic regression is particularly interesting.
Here, we can write the quadratic dropout penalty from (10) as
n X
d
X
1
Rq ( ) =
pi (1 pi ) x2ij j2 .
(12)
21
i=1 j=1
Thus, just like additive noising, dropout generally gives an advantage to confident predictions and
small . However, unlike all the other methods considered so far, dropout may allow for some large
pi (1 pi ) and some large j2 , provided that the corresponding cross-term x2ij is small.
Our analysis shows that dropout regularization should be better than L2 -regularization for learning
weights for features that are rare (i.e., often 0) but highly discriminative, because dropout effectively
does not penalize j over observations for which xij = 0. Thus, in order for a feature to earn a large
2
pi ) each time that it
j , it suffices for it to contribute to a confident prediction with small pi (1
is active.6 Dropout training has been empirically found to perform well on tasks such as document
5
Normalizing the columns of the design matrix before performing penalized regression is standard practice,
and is implemented by default in software like glmnet for R [16].
6
To be precise, dropout does not reward all rare but discriminative features. Rather, dropout rewards those
features that are rare and positively co-adapted with other features in a way that enables the model to make
confident predictions whenever the feature of interest is active.
Table 3: Accuracy of L2 and dropout regularized logistic regression on a simulated example. The
first row indicates results over test examples where some of the rare useful features are active (i.e.,
where there is some signal that can be exploited), while the second row indicates accuracy over the
full test set. These results are averaged over 100 simulation runs, with 75 training examples in each.
All tuning parameters were set to optimal values. The sampling error on all reported values is within
0.01.
Accuracy
Active Instances
All Instances
L2 -regularization
0.66
0.53
Dropout training
0.73
0.55
classification where rare but discriminative features are prevalent [3]. Our result suggests that this is
no mere coincidence.
We summarize the relationship between L2 -penalization, additive noising and dropout in Table 2.
Additive noising introduces a product-form penalty depending on both and A00 . However, the full
potential of artificial feature noising only emerges with dropout, which allows the penalty terms due
to and A00 to interact in a non-trivial way through the design matrix X (except for linear regression,
in which all the noising schemes we consider collapse to ridge regression).
4.1 A Simulation Example
The above discussion suggests that dropout logistic regression should perform well with rare but
useful features. To test this intuition empirically, we designed a simulation study where all the
signal is grouped in 50 rare features, each of which is active only 4% of the time. We then added
1000 nuisance features that are always active to the design matrix, for a total of d = 1050 features.
To make sure that our experiment was picking up the effect of dropout training specifically and not
just normalization of X, we ensured that the columns of X were normalized in expectation.
The dropout penalty for logistic regression can be written as a matrix product
0
10 1
1
Rq ( ) =
( pi (1 pi ) ) @ x2ij A @ j2 A .
21
(13)
We designed the simulation study in such a way that, at the optimal , the dropout penalty should
have structure
Small
(confident prediction)
Big
(weak prediction)
0
B
@
B
CB
AB
B
@
Big
(useful feature)
Small
(nuisance feature)
C
C
C.
C
A
(14)
A dropout penalty with such a structure should be small. Although there are some uncertain predictions with large pi (1 pi ) and some big weights j2 , these terms cannot interact because the
corresponding terms x2ij are all 0 (these are examples without any of the rare discriminative features and thus have no signal). Meanwhile, L2 penalization has no natural way of penalizing some
j more and others less. Our simulation results, given in Table 3, confirm that dropout training
outperforms L2 -regularization here as expected. See Appendix A.1 for details.
There is a well-known connection between L2 -regularization and stochastic gradient descent (SGD).
In SGD, the weight vector is updated with t+1 = t t gt , where gt = r`xt , yt ( t ) is the
gradient of the loss due to the t-th training example. We can also write this update as a linear
L2 -penalized problem
t+1 = argmin `x , y ( t ) + gt (
t ) + 1 k
t k2 ,
(15)
2
t
t
2t
where the first two terms form a linear approximation to the loss and the third term is an L2 regularizer. Thus, SGD progresses by repeatedly solving linearized L2 -regularized problems.
6
0.9
0.88
0.88
0.86
0.86
accuracy
accuracy
0.9
0.84
dropout+unlabeled
dropout
L2
0.82
0.8
10000
20000
30000
size of unlabeled data
0.84
dropout+unlabeled
dropout
L2
0.82
0.8
40000
5000
10000
size of labeled data
15000
Figure 2: Test set accuracy on the IMDB dataset [12] with unigram features. Left: 10000 labeled
training examples, and up to 40000 unlabeled examples. Right: 3000-15000 labeled training examples, and 25000 unlabeled examples. The unlabeled data is discounted by a factor = 0.4.
As discussed by Duchi et al. [11], a problem with classic SGD is that it can be slow at learning
weights corresponding to rare but highly discriminative features. This problem can be alleviated
by running a modified form of SGD with t+1 = t At 1 gt , where the transformation At is
also learned online; this leads to the
family of stochastic descent rules. Duchi et al. use
PtAdaGrad
>
At = diag(Gt )1/2 where Gt =
g
g
and
show that this choice achieves desirable regret
i=1 i i
bounds in the presence of rare but useful features. At least superficially, AdaGrad and dropout seem
to have similar goals: For logistic regression, they can both be understood as adaptive alternatives
to methods based on L2 -regularization that favor learning rare, useful features. As it turns out, they
have a deeper connection.
The natural way to incorporate dropout regularization into SGD is to replace the penalty term k
k2 /2 in (15) with the dropout regularizer, giving us an update rule
2
n
o
t+1 = argmin `x , y ( t ) + gt (
t ) + Rq (
t ; t )
(16)
t
t
where, Rq (; t ) is the quadratic noising regularizer centered at t :7
Rq (
t ; t ) = 1 (
2
t )> diag(Ht ) (
t ), where Ht =
t
X
i=1
r2 `xi , yi ( t ).
(17)
This implies that dropout descent is first-order equivalent to an adaptive SGD procedure with At =
diag(Ht ). To see the connection between AdaGrad and this dropout-based online procedure, recall
that for GLMs both of the expressions
are equal to the Fisher information I [17]. In other words, as t converges to , Gt and Ht are both
consistent estimates of the Fisher information. Thus, by using dropout instead of L2 -regularization
to solve linearized problems in online learning, we end up with an AdaGrad-like algorithm.
Of course, the connection between AdaGrad and dropout is not perfect. In particular, AdaGrad
allows for a more aggressive learning rate by using At = diag(Gt ) 1/2 instead of diag(Gt ) 1 .
But, at a high level, AdaGrad and dropout appear to both be aiming for the same goal: scaling
the features by the Fisher information to make the level-curves of the objective more circular. In
contrast, L2 -regularization makes no attempt to sphere the level curves, and AROW [18]another
popular adaptive method for online learningonly attempts to normalize the effective feature matrix
but does not consider the sensitivity of the loss to changes in the model weights. In the case of
logistic regression, AROW also favors learning rare features, but unlike dropout and AdaGrad does
not privilege confident predictions.
7
t to compute Ht .
Drop-Uni 87.78
89.52
Drop-Bi 91.31
91.98
Recall that the regularizer R( ) in (5) is independent of the labels {yi }. As a result, we can use
additional unlabeled training examples to estimate it more accurately. Suppose we have an unlabeled
dataset {zi } of size m, and let 2 (0, 1] be a discount factor for the unlabeled data. Then we can
define a semi-supervised penalty estimate
n
def
R ( ) =
R( ) + RUnlabeled ( ) ,
(19)
n + m
P
where R( ) is the original penalty estimate and RUnlabeled ( ) = i E [A(zi )] A(zi ) is
computed using (5) over the unlabeled examples zi . We select the discount parameter by crossvalidation; empirically, 2 [0.1, 0.4] works well. For convenience, we optimize the quadratic
surrogate Rq instead of R . Another practical option would be to use the Gaussian approximation
from [3] for estimating R ( ).
Most approaches to semi-supervised learning either rely on using a generative model [19, 20, 21, 22,
23] or various assumptions on the relationship between the predictor and the marginal distribution
over inputs. Our semi-supervised approach is based on a different intuition: wed like to set weights
to make confident predictions on unlabeled data as well as the labeled data, an intuition shared by
entropy regularization [24] and transductive SVMs [25].
Conclusion
Our implementation of semi-supervised MNB. MNB with EM [20] failed to give an improvement.
References
[1] Geoffrey E Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan R Salakhutdinov. Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint
arXiv:1207.0580, 2012.
[2] Laurens van der Maaten, Minmin Chen, Stephen Tyree, and Kilian Q Weinberger. Learning with
marginalized corrupted features. In Proceedings of the International Conference on Machine Learning,
2013.
[3] Sida I Wang and Christopher D Manning. Fast dropout training. In Proceedings of the International
Conference on Machine Learning, 2013.
[4] Yaser S Abu-Mostafa. Learning from hints in neural networks. Journal of Complexity, 6(2):192198,
1990.
[5] Chris J.C. Burges and Bernhard Schlkopf. Improving the accuracy and speed of support vector machines.
In Advances in Neural Information Processing Systems, pages 375381, 1997.
[6] Patrice Y Simard, Yann A Le Cun, John S Denker, and Bernard Victorri. Transformation invariance in
pattern recognition: Tangent distance and propagation. International Journal of Imaging Systems and
Technology, 11(3):181197, 2000.
[7] Salah Rifai, Yann Dauphin, Pascal Vincent, Yoshua Bengio, and Xavier Muller. The manifold tangent
classifier. Advances in Neural Information Processing Systems, 24:22942302, 2011.
[8] Kiyotoshi Matsuoka. Noise injection into inputs in back-propagation learning. Systems, Man and Cybernetics, IEEE Transactions on, 22(3):436440, 1992.
[9] Chris M Bishop. Training with noise is equivalent to Tikhonov regularization. Neural computation,
7(1):108116, 1995.
[10] Salah Rifai, Xavier Glorot, Yoshua Bengio, and Pascal Vincent. Adding noise to the input of a model
trained with a regularized objective. arXiv preprint arXiv:1104.3250, 2011.
[11] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and
stochastic optimization. Journal of Machine Learning Research, 12:21212159, 2010.
[12] Andrew L Maas, Raymond E Daly, Peter T Pham, Dan Huang, Andrew Y Ng, and Christopher Potts.
Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics, pages 142150. Association for Computational Linguistics, 2011.
[13] Sida I Wang, Mengqiu Wang, Stefan Wager, Percy Liang, and Christopher D Manning. Feature noising
for log-linear structured prediction. In Empirical Methods in Natural Language Processing, 2013.
[14] Ian J Goodfellow, David Warde-Farley, Mehdi Mirza, Aaron Courville, and Yoshua Bengio. Maxout
networks. In Proceedings of the International Conference on Machine Learning, 2013.
[15] Sida Wang and Christopher D Manning. Baselines and bigrams: Simple, good sentiment and topic classification. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics,
pages 9094. Association for Computational Linguistics, 2012.
[16] Jerome Friedman, Trevor Hastie, and Rob Tibshirani. Regularization paths for generalized linear models
via coordinate descent. Journal of Statistical Software, 33(1):1, 2010.
[17] Erich Leo Lehmann and George Casella. Theory of Point Estimation. Springer, 1998.
[18] Koby Crammer, Alex Kulesza, Mark Dredze, et al. Adaptive regularization of weight vectors. Advances
in Neural Information Processing Systems, 22:414422, 2009.
[19] Jiang Su, Jelber Sayyad Shirab, and Stan Matwin. Large scale text classification using semi-supervised
multinomial naive Bayes. In Proceedings of the International Conference on Machine Learning, 2011.
[20] Kamal Nigam, Andrew Kachites McCallum, Sebastian Thrun, and Tom Mitchell. Text classification from
labeled and unlabeled documents using EM. Machine Learning, 39(2-3):103134, May 2000.
[21] G. Bouchard and B. Triggs. The trade-off between generative and discriminative classifiers. In International Conference on Computational Statistics, pages 721728, 2004.
[22] R. Raina, Y. Shen, A. Ng, and A. McCallum. Classification with hybrid generative/discriminative models.
In Advances in Neural Information Processing Systems, Cambridge, MA, 2004. MIT Press.
[23] J. Suzuki, A. Fujino, and H. Isozaki. Semi-supervised structured output learning based on a hybrid
generative and discriminative approach. In Empirical Methods in Natural Language Processing and
Computational Natural Language Learning, 2007.
[24] Y. Grandvalet and Y. Bengio. Entropy regularization. In Semi-Supervised Learning, United Kingdom,
2005. Springer.
[25] Thorsten Joachims. Transductive inference for text classification using support vector machines. In
Proceedings of the International Conference on Machine Learning, pages 200209, 1999.