optimization
Mohit Kothari
Computer Science and Engineering
University of California, San Diego
mkothari@ucsd.edu
Sonali Rahagude
Computer Science and Engineering
University of California, San Diego
srahagud@ucsd.edu
Abstract
Logistic Regression is a type of binary classication model used to predict the
outcome of a dependent variable, that is a dichotomy, and the independent vari
ables that are continuous or discrete. In this project, a logistic regression model
was trained that predicts the gender of an individual. The Gender Recognition
DCT
1
dataset was used for training and testing the model, it contains total of 798
examples, each example consisting of 800 features. To efciently train the lo
gistic regression model, we used Stochastic Gradient descent algorithm with L
2
regularization. We also used the LBFGS algorithm that belongs to the family
of quasiNewton optimization techniques. The report discusses methods for nd
ing nearoptimal hyperparameter conguration for training the model and reports
the accuracy of this model on the test dataset. The model trained with Stochastic
Gradient Descent achieves an error rate of 0.092 and model trained with L BFGS
also achieves an errorrate of 0.092. Additionally, the report attempts to analyze
how SGD performs for different hyperparameter congurations and attempts to
reason the observed behaviors.
1 Introduction
The problem of classication deals with identifying to which set of categories a new observation
belongs to, on the basis of a training set of data containing observations, where each observations
category is known. One of the most basic classication problems is one that contains only 2 cat
egories. This set of problems are known as binary classication problems. A logistic regression
model can be used to solve such problems. This report implements the technique of Stochastic
Gradient Descent (SGD) to train a logistic regression model on the gender recognition dataset and
discusses different parameter optimization methods incorporated to achieve high classication ac
curacy. Later for comparison, it also shows results when LBFGS is used to train the model and how
the accuracy varies as compared to implemented SGD.
The report is designed as follows, this chapter deals with the background of Logistic Regression with
L
2
regularization and how technique of Stochastic Gradient Descent is used to learn the model along
with minimal basics of L BFGS. Chapter 2 describes the methodology and design of experiments
that were conducted on the dataset, Chapter 3 analyzes different results obtained and tries to argue
about the behavior observed. Finally Chapter 4 concludes with nal results and our learnings from
the project.
1
http://mlcomp.org/datasets/1571
1
1.1 Logistic Regression with L
2
Regularization
[Elkan(2014)] Suppose, Y is the outcome (Gender label in our case) and
X is the feature vector of
dimension d and there are n examples. According to logistic regression the probability of having a
label y given a feature vector x and weight vector
= [,
1
,
2
...
d
] is given by,
p(yx; ) = ( +
d
i=1
j
x
j
) =
1
1 +exp([ +
d
i=1
j
x
j
])
(1)
In order to train the model, we try to maximize the joint log conditional likelihood on the training
set consisting of n examples {< x
1
, y
1
>, < x
2
, y
2
>, ... < x
n
, y
n
>} and obtain an optimal
. By
assumption, each observed example is independent of each other, the joint log conditional likelihood
is sum of logs over all the examples. So, the Log Conditional Likelihood (LCL) is given by,
LCL =
n
i=1
log p(y
i
x
i
; ) (2)
In order to prevent overtting for the above technique, we use L2 regularization. The regularized
log conditional likelihood is given by
RLCL =
n
i=0
log p(y
i
x
i
; )
2
2
(3)
We train the model by maximizing the regularized log conditional likelihood. In order to learn near
optimal parameters of the regression model, we need to prove that the above objective function is a
convex i.e. it has a unique optimal point. If the second derivative is always nonpositive, we can say
that the function is convex.
j
[logp(yx; )
d
j=0
2
j
] =
j
[logp(yx; )] 2
j
=
i
(y
i
p
i
)x
ij
2
j
(4)
j
[logp(yx; )
d
j=0
2
j
] = 2
i
p
i
(1 p
i
)x
2
ij
(5)
Given, p (0, 1) and > 0, the above sum is always nonpositive and hence the objective function
RLCL is convex and any maximum is also the global maximum.
Since we are now guaranteed that there is only one global maximum, we can use the technique
of Stochastic Gradient Ascent(SGA) to maximize RLCL and learn the parameters of the model
by to learn the model. SGA is a computational optimization on Gradient Ascent(GA) method for
maximizing conditional log likelihood. According to GA, the update rule is given by
j
:=
j
+
n
i=1
[(y
i
p
i
)x
ij
2
j
] (6)
In SGA, instead of going through all the examples for every update, we select a random sample i
with uniform probability from the set {1,2,4,5,6 ... n} and update the weight vectors. Thus,the SGA
update rule for
j
is given by
j
:=
j
+[(y
i
p
i
)x
j
2
j
] (7)
Maximizing the RLCL value is equivalent to minimizing the cost function for logistic regression.
Since, we are using LBFGS (see Section1.2 ) to solve a minimization problem, we would like to
compare it with an equivalent gradient method. Hence, we will use the term Stochastic Gradient
Descent (SGD) instead of SGA for the rest of the report.
1.2 LBFGS with L2 Regularization
Since the LBFGS implementation used by us tries to solve for a minimization problem. Hence, we
need to dene the cost function for logistic regression with L2 regularization. The cost function is
2
stated below [Ng(2010)]
J() =
1
n
n
i=1
[y
(i)
log(p
i
) + (1 y
(i)
log(1 p
i
))] +
2n
d
j=1
2
j
(8)
To run this algorithm, we used a third party library minFunc
2
which implements this algorithm
efciently.
The next section discusses the approach and algorithms used to implement SGD and LBFGS.
2 Methodology
2.1 Implementation Environment
We use Python for data preprocessing which is described in the following section. The code for
training, validating and testing the logistic regression model is entirely implemented in MATLAB.
2.2 Data Preprocessing
We observe that the feature values of raw data in the Gender Recognition[DCT] dataset vary widely.
This is depicted in Figure 1 that plots the mean and variance of individual features across all ex
amples in the raw dataset. An interesting observation is that there is a clear demarcation in the
features; the rst 320 features have both mean and variance values close to 0 while the remaining
features have varied values for mean and variance. As pointed out in notes, if one of the features
has a broad range of values, the objective function of the model will be governed by this particu
lar feature. Therefore, we normalize the range of all features with mean 0 and variance 1 so that
each feature contributes approximately proportionately to the nal objective function. We use the
MATLAB inbuilt function zscore for normalization.
The labels specied in the given dataset are (1, 1). Since our denition of logistic regression
requires the labels to be either 0 or 1, we modied the label values, replacing 1 with 0.
We use holdout validation for assessing the trained model. In order to obtain disjoint training and
validation sets , we split the given train dataset into 2 sets namely, training and validation, with a
ratio of 7:3. While splitting the dataset we randomized the order of observations assuming the given
order might be articially tailored. The effect of order randomization on training is further discussed
in chapter 3.
2.3 Implementation of SGD with L
2
Regularization
The training examples are sorted in a random order only once before the start of the method. We
train the model for a number of epochs, where every epoch contains all the training examples. We
check convergence at the end of every epoch, and if the increase in the RLCL over 2 consecutive
epochs is less than a specied threshold (described Section 2.4), we return.
We adopt an adaptive learning parameter while updating the parameters using Stochastic Gradient
Descent. We reduce the value of using exponential dampening at the start of every epoch. We
also tested with inverse damping but the accuracy on validation set was worse than while using
exponential damping. We choose a constant of 1.2 for damping the learning rate. According to
exponential dampening algorithms, the new learning parameter
e
at the start of the e
th
epoch, is
given by,
e
=
0
c
e
(9)
Algorithm 1 describes the steps involved,
2
http://www.di.ens.fr/
mschmidt/Software/minFunc.html
3
Figure 1: Mean and Variance of the dataset
Algorithm 1 SGD
Input: Matrix X, Vector
Y
1: Initialize to zeros
2: set maxEpochCount = 2000
3: set RLCL =
4: set threshold = 10
6
5: set regStrength = 10
3
6: set
0
= 0.05
7: set exponentialDampConst = 1.2
8: for epochCount = 1, 2, 3, ......., maxEpochCount do
9: for
X
i
= randomizedRowOrder(X) do
10: prob (
X
i
)
11: gradientV alue (Y
i
prob) X
i
12:
New
+
e
[gradientV alue 2 ]
13: Set =
New
14: end for
15: RLCL
New
calculateRLCL(, X, Y );
16: if abs(RLCL
New
RLCL) < threshold then
17: break
18: end if
19: RLCL RLCL
New
20:
e
0
exponentialDampConst
epochCount
21: end for
Output:
2.4 Stop condition
In order to determine convergence, we calculate the change in the RLCL value from the previous
one, at the end of every epoch. If the change is within a threshold, we conclude that the value has
converged for the current parameter estimates. To estimate the value of threshold, we plotted graphs
4
of RLCL value vs different values of learning rate and those of regularization strength . These
are shown in Figures 2 and 3.
Looking at the plots, we can safely assume the threshold to be 10
6
given that we choose the step
size from the range in the gure.
Figure 2: RLCL vs for Training Set
Figure 3: RLCL vs for Training Set
2.5 Estimation of Hyperparameters  Grid Search
[ChihWei Hsu and Lin(2010), Bergstra and Bengio(2012)] We estimate the hyperparameters only
for SGD approach, since we are using third party library for L BFGS, we look at it as a black box
and observe what effect different values of has on the algorithms and how the convergence varies.
The results are covered in chapter 3.
In SGD, we need to determine the learning rate and the regularization strength . The goal
is to identify a pair (, ) so that the classier can accurately predict unseen data. For this, we
implemented grid search where we train logistic regression models for different combinations (, )
values. We then test each of these models on the validation set, and choose the pair which gives the
maximum conditional likelihood (and not regularized conditional likelihood) on the validation set.
Algorithm 2 describes the steps involved.
While selecting the range of values for and , we use trial and error method, wherein we x and
select a range for which gives optimal models for this xed value. Similarly, we select range
of values by xing a value of . We now perform a grid search on these ranges using sequential
increment for the range of and exponential increment for the range of .
5
Algorithm 2 Grid Search
1: RLCL Train Opt = 
2: LCL V al Opt = 
3:
Opt
= 0
4: Accuracy Opt
5: for = 10
6
, 10
4
, 10
3
, 10
2
, 10
1
, 1, 10 do
6: for = 0.001, 0.002, 0.005, 0.01, 0.02, 0.05, 0.1, 0.2, 0.5 do
7: , RLCL Train SGD(, );
8: LCL V al, Accuracy = V alidate(, );
9: if LCL V al Opt < LCL V al then
10: LCL V al Opt = LCL V al
11: Accuracy Opt = Accuracy
12: Opt =
13: end if
14: end for
15: end for
3 Results & Analysis
3.1 SGD
3.1.1 Learning Rate and Regularization Strength with Grid Search
We choose in the range [ 10
8
, 10 ] with a scale factor of 10 and in the range [0.001,0.1]
with a step size of 1, 2, 5 namely [0.001, 0.002, 0.005, 0.01 ... 0.5]. A comparison of the log
conditional likelihoods and accuracies for models generated using different (,) pairs is shown in
Table 1. Because of space constraints, we have listed only a subset of the (,) pairs considered for
experimentation.
3.1.2 Choice of Optimality Score
In choosing the optimal hyperparameters from the results of Grid Search, we considered two mea
sures for determining optimality of the trained model: Accuracy of the validation set and LCL of
the validation set. As can be seen in table 1,since LCL is more ne grained in terms of values as
compared to the accuracy rate, we choose LCL as the optimality score. An interesting thing to note
here, is that the model with the maximum RLCL value for the training set has a high error rate on
validation set and an Validation set LCL value of Inf. This model is an evident case of overtting
of the training set. For every combination of and from the above values, we train a logistic
regression model and check its performance on the validation set.
3.1.3 A note on Regularization Strength
In Table 1, we can see that for a given value of learning rate , the Log Conditional Likelihood
value decreases as the regularization strength increases. In case of the validation set, according
to theory, we would expect the log likelihood to increase as increases to the optimal value and
then decrease for values of beyond it. However, in case of the given model, the log likelihood
remains constant for small values of and decreases later. This is depicted in Figure 4. We observe
an exception for the plot of LCL versus that for the value of 0.05, which behaves according to
theory, as explained by Professor in the lecture.
6
Figure 4: LCL vs Mu for validation set
However, since the LCL decreases monotonically with increase in regularization strength for all
other values of , we hypothesize that the training data perfectly represents the model and does not
need regularization. To check our hypothesis, we try to train the model without L
2
regularization.
The result accuracy on the validation and test set is similar to that of the model with regularization.
The results are depicted in table 1
With Regularization Without Regularization
LCL(Validation
Set)
Validation mis
classication
Test misclassi
cation
LCL(Validation
Set)
Validation mis
classication
Test misclassi
cation
39.894 13 22 35.897 13 22
Table 1: Regularized vs NonRegularized Training with selected nearoptimal parameters, ( =
10
3
, = 0.005)
3.1.4 Test Set Accuracy
We now use the model dened by the optimal hyperparameter values derived from Grid Search 
= 10
3
, = 0.005 and test it on the test set. It gives an accuracy of 91%/error rate of 0.09.
3.2 LBFGS
We use the same training and validation set for LBFGS as Stochastic Gradient Descent. We use
the implementation minFunc provided by Mark Schmidt at
3
. Since minFunc only requires the cost
function to be minimized, we provide it with the cost function of logistic regression as described
in chapter 1. In this case, we only need to estimate the value of the regularization strength (hyper
parameter ) which is passed as an input to the function along with the training examples.
3.2.1 Gradient Check
We use the option DerivativeCheck provided by the minFunc method, to check the correctness of
the gradient of the cost function of logistic regression that we pass to it, for minimization. The
output is as follows, {Checking Gradient... Max difference between user and numerical gradient:
1.924995e07}
3.2.2 Estimation of Regularization Strength
We choose in the range [ 10
8
, 100 ] with a scale factor of 10 and train the logistic regression
model using LBFGS. We then test the model on the validation set, and choose the value of which
gives the maximum accuracy on the validation set. Table 2 contains the results for different values
of .
3
http://www.di.ens.fr/
mschmidt/Software/copyright.html
7
Minimal Func value Validation Misclassication
10
6
0.000105 14
10
5
0.0007727 14
10
4
0.0056838 14
10
3
0.040261 14
0.01 0.26788 14
0.1 1.6316 15
1 8.7223 15
Table 2: Consistent accuracy on Validation Set of L BFGS with varying
3.2.3 Test Set Accuracy
We now use the model dened by the optimal value of derived above and test it on the test set. It
gives an accuracy of 91%/error rate of 0.09.
3.3 Time Bounds
3.3.1 SGD
For given n training examples and d features for each example, if the Stochastic Gradient Descent
method converges in k number of epochs, training of the logistic regression model can be done in
O(k.n.p) time. Validation and Testing on the newly learnt model takes O(m.d) time, where m is the
number of test/validation examples. Figure 5 shows the time statistics for a run of grid search for 54
pairs of (, lambda) values. As expected, majority of the time is spent in training the model. From
theory, we conclude that within the training algorithm, the calculation of LCL after every epoch and
that of the sigmoid function value for every example, takes the maximum amount of time in the total
time.
Figure 5: Time Stats for Grid Search
3.4 Observations
3.4.1 Effect of randomization with grid search in SGD
In the initial implementation of grid search, we tried to randomize the training set each time a model
was trained for a new (, ) pair, each run of the grid search would give different results for measures
like misclassication for the validation set, as shown in Table 3. We attribute this observation to the
fact that SGDis known to depend on the sequence in which training data is presented to it, and hence,
a different model is possible for the same (, ) pair for different runs, with a different sequence
of training data. As evident from the table, for one of the runs we obtained error rate of 0.086 on
validation dataset we choose not to take it as the conguration is not consistently reproducible.
8
Run1 (,,Misclassification) Run2 (,,Misclassification) Run3 (,,Misclassification)
0.0001 0.005 12 0.0001 0.002 13 1.00E06 0.005 12
0.01 0.005 12 0.0001 0.01 13 1.00E06 0.01 12
0.001 0.005 12 0.01 0.0002 13 0.0001 0.002 13
0.001 0.05 12 0.01 0.005 13 0.01 0.002 13
1.00E06 0.002 13 0.001 0.002 13 0.001 0.002 13
Table 3: Matrix for multiple runs of Grid Search with randomized input for every every pair of
hyperparameter conguration for SGD, reported misclassication are on Validation set.
3.4.2 Effect of randomization after every epoch in SGD
In an attempt to make the numerical optimization algorithm as close to theory as possible, we tried
the option of randomizing the examples at the start of every epoch instead of randomizing just at
the beginning. However, we observed that there were more oscillations in the RLCL vs epoch count
plot, with randomization at each epoch. We have not been able to reason out these results; we
believe that an insufcient number of training examples are causing the SGD algorithm to behave
erratically. (See Figure 6)
Figure 6: RLCL vs epochCount: Data randomization after every epoch
3.4.3 Sensitivity of LBFGS to hyperparameter
We analyze how different values of affects the epoch number, accuracy and conditional likelihood
for the model trained by L BFGS algorithm. We found that the algorithm is insensitive for particular
range of . This is evident from Figure 7 which shows that error rate doesnt for low values of ,
namely in the range [10
6
,10
2
]. One of the possible explanations that we could think of was the
size of training and validation set, which are small for the given dataset and hence there are not many
uctuations.
Figure 7: Sensitivity of L BFGS for different values
9
4 Conclusion
We learnt about the logistic regression model for classication and how parameters of the model
can be learnt using numerical optimization methods to maximize its log conditional likelihood. We
used SGD and LBFGS to learn the model. We learnt about how regularization can be used to solve
the problem of overtting while training a model. We learnt about the importance of validating the
trained model on a disjoint dataset to determine its optimality.
For experimentation, we normalized the data using MATLAB function zscore. Furthermore, to test
the trained model on a set disjoint fromthe training dataset, we split the train dataset into training and
validation sets with a ratio of 7:3. The SGDtechnique for learning a model includes the estimation of
2 hyperparameters  learning rate () and regularization strength (). Using grid search, we found
the optimal hyperparameter values as = 0.005 and = 10
3
. For determining convergence of
RLCL in the SGD technique, we estimated a threshold, Threshold = 10
6
. The training algorithm
converged in 77 epochs for the above selection of hyperparameter.
The LBFGS implementation was used as a blackbox for training the model given the hyper
parameter . We found a range of the optimal values for the regularization strength = 10
6
, 10
4
,
10
3
. The training algorithm for these values converged in 23, 25 and 30 iterations respectively.
Both the models achieved an accuracy of 91% on the test dataset.
We made some key observations
The dataset is representative of the model being learnt and hence, the model gives the same
accuracy even in the absence of regularization.
Some of the models trained in grid search returned an optimal RLCL on the training set,
however, gave a error rate more than 50%. We concluded that these models had overt the
training data.
The LBFGS algorithm is not sensitive to the values of within the range [10
6
,10
2
].
The dataset itself is quite small and hence there were some noteworthy results such as
oscillations for certain pair of hyperparameter congurations etc.
References
[Bergstra and Bengio(2012)] James Bergstra and Yoshua Bengio. Random search for hyper
parameter optimization. Journal of Machine Learning Research, December 2012.
[ChihWei Hsu and Lin(2010)] ChihChung Chang ChihWei Hsu and ChihJen Lin. A practi
cal guide to support vector classication, 2010. URL http://www.csie.ntu.edu.tw/
cjlin/papers/guide/guide.pdf.
[Elkan(2014)] Charles Elkan. Maximum likelihood, logistic regression, and stochastic gradient
training. January 2014.
[Ng(2010)] Andrew Ng. Machine learning, 2010. URL http://
openclassroom.stanford.edu/MainFolder/DocumentPage.php?course=
MachineLearning&doc=exercises/ex5/ex5.html.
10
(
R
e
g
u
l
a
r
i
z
a
t
i
o
n
H
y
p
e
r

p
a
r
a
m
e
t
e
r
)
(
L
e
a
r
n
i
n
g
R
a
t
e
)
R
L
C
L
V
a
l
u
e
o
f
T
r
a
i
n
i
n
g
D
a
t
a
S
e
t
L
C
L
v
a
l
u
e
o
f
V
a
l
i
d
a
t
i
o
n
D
a
t
a
S
e
t
W
r
o
n
g
P
r
e
d
i
c
t
i
o
n
o
n
V
a
l
i
d
a
t
i
o
n
D
a
t
a
S
e
t
1
0

6
1
0

4
1
0

3
1
0

2
1
0

1
1
1
0

6
1
0

4
1
0

3
1
0

2
1
0

1
1
1
0

6
1
0

4
1
0

3
1
0

2
1
0

1
1
0
.
0
0
1

4
0
.
7
9
5

4
0
.
8
1

4
0
.
9
4
8

4
2
.
3
2

5
5
.
9
4
1

1
3
6
.
4
2

4
1
.
9
9

4
1
.
9
9
4

4
2
.
0
2
8

4
2
.
3
7
5

4
6
.
0
1
5

7
0
.
0
5
3
1
5
1
5
1
5
1
5
1
5
1
8
0
.
0
0
2

2
2
.
0
8

2
2
.
1
0
2

2
2
.
3
0
7

2
4
.
3
7
9

4
5
.
1
1
9

1
3
3
.
7

3
7
.
8
2

3
7
.
8
2
4

3
7
.
8
6
3

3
8
.
2
7
2

4
3
.
2
6

6
8
.
0
9
2
1
4
1
4
1
4
1
4
1
4
1
9
0
.
0
0
5

8
.
4
3
1
6

8
.
4
6
4
6

8
.
7
6
6
8

1
1
.
9
6

3
9
.
6
8
1

1
3
3
.
1
4

3
5
.
8
9
7

3
5
.
8
9
6

3
5
.
8
9
4

3
6
.
0
7
9

4
1
.
5
1
8

6
6
.
9
4
1
1
3
1
3
1
3
1
3
1
4
1
9
0
.
0
1

3
.
2
1
2
7

3
.
2
5
0
5

3
.
6
0
4
7

7
.
8
4
3
2

3
4
.
0
2
3

1
3
3
.
1
2

3
7
.
6
3

3
7
.
5
9
6

3
7
.
3
0
9

3
6
.
1
5
2

3
9
.
8
4
3

6
7
.
0
5
1
4
1
4
1
4
1
5
1
3
1
9
0
.
0
2

0
.
7
1
7
9

0
.
7
4
2
4

0
.
9
9
0
1

3
.
1
8
8
5

3
5
.
0
8
6

1
3
3
.
1
2

5
5
.
4
5
2

5
5
.
0
9
6

5
2
.
0
9
5

4
5
.
8
0
3

3
9
.
4
7
2

6
7
.
1
5
6
1
6
1
6
1
6
1
6
1
6
1
9
0
.
0
5

0
.
0
9
0
0

0
.
1
0
9
8

0
.
2
6
9
7

1
.
6
6
0
1

3
4
.
8
3
4

1
3
3
.
1
2

1
4
8
.
6
8

1
4
5
.
7
8

1
2
4
.
8
9

5
2
.
8
6
1

3
9
.
4
3
8

6
7
.
1
4
3
2
2
2
2
2
3
1
9
1
5
1
9
0
.
1

0
.
0
1
2
4

0
.
0
6
5
4

0
.
2
8
1
7

2
.
3
7
7
3

3
5
.
0
3
8

1
3
3
.
1
3

I
n
f

I
n
f

I
n
f

5
3
.
4
5
7

3
9
.
3
6
5

6
7
.
2
4
7
1
6
1
7
2
0
1
8
1
4
1
9
0
.
2

0
.
0
0
5
8

0
.
2
1
9
4

0
.
4
5
7
0

1
.
9
2
6
3

3
5
.
1
7
9

1
3
3
.
1
5

I
n
f

I
n
f

I
n
f

8
2
.
6
0
6

3
9
.
3
5
7

6
7
.
3
5
2
2
2
1
1
9
2
0
1
4
1
9
0
.
5

0
.
0
1
6
8

1
.
0
8

0
.
3
0
3
3

1
.
8
9
0
7

3
5
.
2
4
2

1
3
3
.
1
5

I
n
f

I
n
f

I
n
f

5
9
.
5
9
2

3
9
.
3
2
4

6
7
.
3
3
7
2
1
2
4
1
8
2
0
1
4
1
9
T
a
b
l
e
1
A
C
C
U
R
A
C
I
E
S
V
A
L
I
D
A
T
I
O
N
F
O
R
D
I
F
F
E
R
E
N
T
V
A
L
U
E
S
O
F
A
N
D
11