M. Soleymani
Sharif University of Technology
Fall 2017
Some slides have been adapted from Fei Fei Li lectures, cs231n, Stanford 2017
Types of ML problems
• Supervised learning (regression, classification)
– predicting a target variable for which we get to see examples.
• Unsupervised learning
– revealing structure in the observed data
• Reinforcement learning
– partial (indirect) feedback, no explicit guidance
– Given rewards for a sequence of moves to learn a policy and utility functions
2
Components of (Supervised) Learning
• Unknown target function: 𝑓: 𝒳 → 𝒴
– Input space: 𝒳
– Output space: 𝒴
• We use training set to find the function that can also predict output
on the test set
3
Training data: Example
Training data
x2
𝑥1 𝑥2 𝑦
0.9 2.3 1
3.5 2.6 1
2.6 3.3 1
2.7 4.1 1
1.8 3.9 1
6.5 6.8 -1
7.2 7.5 -1
7.9 8.3 -1
6.9 8.3 -1
8.8 7.9 -1
9.1 6.2 -1
x1
4
Supervised Learning: Regression vs. Classification
• Supervised Learning
– Regression: predict a continuous target variable
• E.g., 𝑦 ∈ [0,1]
5
Regression Example
• Housing price prediction
400
300
Price ($)
200
in 1000’s
100
0
0 500 1000 1500 2000 2500
Size in feet2
• Supervised learning
– Given: Training set
𝑁
• labeled set of 𝑁 input-output pairs 𝐷 = 𝒙 𝑖 ,𝑦 𝑖
𝑖=1
– Goal: learning a mapping from 𝒙 to 𝑦
• Unsupervised learning
– Given: Training set
𝑖 𝑁
• 𝒙 𝑖=1
– Goal: find groups or structures in the data
• Discover the intrinsic structure in the data
7
Supervised Learning: Samples
x2
Classification
x1
8
Unsupervised Learning: Samples
x2 Type I Type II
Clustering
Type III
Type IV
x1
9
Reinforcement Learning
• Provides only an indication as to whether an action is correct or not
10
Reinforcement Learning
• Typically, we need to get a sequence of decisions
– it is usually assumed that reward signals refer to the entire sequence
11
Components of (Supervised) Learning
• Unknown target function: 𝑓: 𝒳 → 𝒴
– Input space: 𝒳
– Output space: 𝒴
• We use training set to find the function that can also predict output
on the test set
12
Generalization
• We don’t intend to memorize data but need to figure out the pattern.
13
(Typical) Steps of solving supervised learning problem
• Select the hypothesis space
– A class of parametric models that map each input vector, x, into a predicted output y.
• Come up with a way of efficiently finding the parameters that minimize the
loss function. (optimization)
0
0 500 1000 1500 2000 2500 3000
𝑥
Cost function:
𝑛 2
𝑖
𝐽 𝒘 = 𝑦 − (𝑤0 + 𝑤1 𝑥 𝑖 )
𝑖=1
15
Cost function: example
𝐽(𝒘)
(function of the parameters 𝑤0 , 𝑤1)
500
400
Price ($) 300
in 1000’s
200
100
0
0 1000 2000 3000
Size in feet2 (x) 𝑤1
𝑤0
This example has been adapted from: Prof. Andrew Ng’s slides 16
Cost function: example
𝑓 𝑥; 𝑤0 , 𝑤1 = 𝑤0 + 𝑤1 𝑥 𝐽(𝑤0 , 𝑤1 )
(for fixed 𝑤0 , 𝑤1 , this is a function of 𝑥) (function of the parameters 𝑤0 , 𝑤1 )
𝑤1
𝑤0
This example has been adapted from: Prof. Andrew Ng’s slides 17
Review: Iterative optimization of cost function
• Cost function: 𝐽(𝒘)
• Optimization problem: 𝒘 = argm𝑖𝑛 𝐽(𝒘)
𝒘
• Steps:
– Start from 𝒘0
– Repeat
• Update 𝒘𝑡 to 𝒘𝑡+1 in order to reduce 𝐽
• 𝑡 ←𝑡+1
– until we hopefully end up at a minimum
18
How to optimize parameters?
– the slope of the error surface can be calculated by taking the derivative of the error
function at that point
22
Gradient descent
𝜕𝐽 𝒘 𝜕𝐽 𝒘 𝜕𝐽 𝒘
𝛻𝒘 𝐽 𝒘 = [ , ,…, ]
𝜕𝑤0 𝜕𝑤2 𝜕𝑤𝑑
23
𝐽(𝑤0 , 𝑤1 )
𝑓 𝑥; 𝑤0 , 𝑤1 = 𝑤0 + 𝑤1 𝑥
(function of the parameters 𝑤0 , 𝑤1 )
𝑤1
𝑤0
𝑁
𝑇
𝒘𝑡+1 = 𝒘𝑡 − 𝜂 𝒘𝑡 𝒙(𝑖) − 𝑦 (𝑖) 𝒙(𝑖)
𝑖=1
This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford) 24
𝑓 𝑥; 𝑤0 , 𝑤1 = 𝑤0 + 𝑤1 𝑥 𝐽(𝑤0 , 𝑤1 )
(function of the parameters 𝑤0 , 𝑤1 )
𝑤1
𝑤0
This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford) 25
𝑓 𝑥; 𝑤0 , 𝑤1 = 𝑤0 + 𝑤1 𝑥 𝐽(𝑤0 , 𝑤1 )
(function of the parameters 𝑤0 , 𝑤1 )
𝑤1
𝑤0
This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford) 26
𝑓 𝑥; 𝑤0 , 𝑤1 = 𝑤0 + 𝑤1 𝑥 𝐽(𝑤0 , 𝑤1 )
(function of the parameters 𝑤0 , 𝑤1 )
𝑤1
𝑤0
This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford) 27
𝑓 𝑥; 𝑤0 , 𝑤1 = 𝑤0 + 𝑤1 𝑥 𝐽(𝑤0 , 𝑤1 )
(function of the parameters 𝑤0 , 𝑤1 )
𝑤1
𝑤0
This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford) 28
𝑓 𝑥; 𝑤0 , 𝑤1 = 𝑤0 + 𝑤1 𝑥 𝐽(𝑤0 , 𝑤1 )
(function of the parameters 𝑤0 , 𝑤1 )
𝑤1
𝑤0
This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford) 29
𝑓 𝑥; 𝑤0 , 𝑤1 = 𝑤0 + 𝑤1 𝑥 𝐽(𝑤0 , 𝑤1 )
(function of the parameters 𝑤0 , 𝑤1 )
𝑤1
𝑤0
This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford) 30
𝑓 𝑥; 𝑤0 , 𝑤1 = 𝑤0 + 𝑤1 𝑥 𝐽(𝑤0 , 𝑤1 )
(function of the parameters 𝑤0 , 𝑤1 )
𝑤1
𝑤0
This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford) 31
𝑓 𝑥; 𝑤0 , 𝑤1 = 𝑤0 + 𝑤1 𝑥 𝐽(𝑤0 , 𝑤1 )
(function of the parameters 𝑤0 , 𝑤1 )
𝑤1
𝑤0
This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford) 32
Gradient descent disadvantages
• However, when 𝐽 is convex, all local minima are also global minima ⇒ gradient
descent can converge to the global solution.
33
Stochastic gradient descent
• Batch techniques process the entire training set in one go
– thus they can be computationally costly for large data sets.
• Stochastic gradient descent: when the cost function can comprise a sum over
data points:
𝑛
𝐽(𝒘) = 𝐽 𝑖 (𝒘)
𝑖=1
34
Linear model: multi-dimensional inputs
𝑓 𝒙; 𝒘 = 𝑤0 + 𝑤1 𝑥1 + ⋯ + 𝑤𝑑 𝑥𝑑
= 𝒘𝑇 𝒙
𝑤0 1
𝑤1 𝑥
𝒘= ⋮ 𝒙= 1
⋮
𝑤𝑑 𝑥𝑑 35
Generalized linear regression
• Linear combination of fixed non-linear function of the input vector
𝑓(𝒙; 𝒘) = 𝑤0 + 𝑤1 𝜙1 (𝒙)+ . . . 𝑤𝑚 𝜙𝑚 (𝒙)
36
Model complexity and overfitting
• With limited training data, models may achieve zero training error but
a large test error.
• Over-fitting: when the training loss no longer bears any relation to the
test (generalization) loss.
– Fails to generalize to unseen examples.
37
Over-fitting causes
• Model complexity
– E.g., Model with a large number of parameters (degrees of freedom)
38
Model complexity
• Example:
– Polynomials with larger 𝑚 are becoming increasingly tuned to the random
noise on the target values.
𝑚=0 𝑚=1
𝑦 𝑦
𝑚=3 𝑚=9
𝑦 𝑦
39
39
[Bishop]
Number of training data & overfitting
Over-fitting problem becomes less severe as the size of training data
increases.
𝑚=9 𝑚=9
𝑛 = 15 𝑛 = 100
[Bishop]
40
Avoiding over-fitting
• Determine a suitable value for model complexity
– Simple hold-out method
– Cross-validation
41
Simple hold out: training, validation, and test sets
𝐽𝑣
error Training
Validation
Test
𝐽𝑡𝑟𝑎𝑖𝑛
degree of polynomial 𝑚
42
Cross-Validation (CV): Evaluation
• 𝑘-fold cross-validation steps:
– Shuffle the dataset and randomly partition training data into 𝑘 groups of approximately equal size
– for 𝑖 = 1 to 𝑘
• Choose the 𝑖-th group as the held-out validation group
• Train the model on all but the 𝑖-th group of data
• Evaluate the model on the held-out group
– Performance scores of the model from 𝑘 runs are averaged.
… First run
… Second run
…
… (k-1)th run
… k-th run
43
Regularization
• Adding a penalty term in the cost function to discourage the
coefficients from reaching large values.
𝑛 2
𝐽 𝒘 = 𝑦 𝑖 − 𝒘𝑇 𝝓 𝒙 𝑖 + 𝜆𝑅(𝒘)
𝑖=1
44
Regularization
[Bishop]
Choosing the regularization parameter
error
𝐽𝑣
𝐽𝑡𝑟𝑎𝑖𝑛
46
Classification problem
• Given: Training set
𝑖 𝑖 𝑁
– labeled set of 𝑁 input-output pairs 𝐷 = 𝒙 ,𝑦 𝑖=1
– 𝑦 ∈ {1, … , 𝐾}
• Examples:
– Image classification
– Speech recogntion
–…
47
Linear Classifier example
• Two class example:
3
− 𝑥1 − 𝑥2 + 3 = 0
4
𝑥2
𝒞1
3
2 if 𝒘𝑇 𝒙 + 𝑤0 ≥ 0 then 𝒞1
else 𝒞2
1
𝒞2
𝑥1 3
1 2 3 4 𝒘= − −1
4
𝑤0 = 3
48
Square error loss function for classification!
𝐾=2
Square error loss is not suitable for classification:
– Least square loss penalizes ‘too correct’ predictions (that they lie a long way on the correct
side of the decision)
– Least square loss also lack robustness to noise
𝑁
𝑖 𝑖 2
𝐽 𝒘 = 𝑤𝑥 + 𝑤0 − 𝑦
𝑖=1
49
Parametric classifier: Multiclass
• 𝑓 𝒙; 𝑾 = 𝑓1 𝒙, 𝑾 , … , 𝑓𝐾 𝒙, 𝑾 𝑇
𝑥1
𝒙= ⋮
𝑥784
28 × 28
𝒘1
𝑾𝑇 = ⋮
𝒘10 10×784
𝑏1
𝒃= ⋮
𝑏10
Example
𝑾𝑇
How can we tell whether this W and b is good or bad?
Bias can also be included in the W matrix
Multi-class SVM
𝑁
1
𝐽 𝑾 = 𝐿 𝑖 + 𝜆𝑅(𝑾)
𝑁
𝑖=1
𝑠𝑗 = 𝒘𝑗𝑇 𝒙(𝑖)
𝑁
1 𝑖
1
𝐿 = 2.9 + 0 + 12.9 = 5.7
𝑁 3
𝑖=1
𝐿(1) = max 0,1 + 5.1 − 3.2 𝐿(2) = max 0,1 + 1.3 − 4.9 𝐿(3) = max(0, 2.2 − (−3.1) + 1)
+ max 0,1 − 1.7 − 3.2 + max 0,1 + 2 − 4.9 +max(0, 2.5 − (−3.1) + 1)
= max 0,2.9 + max(0, −3.9) = max 0, −2.6 + max(0, −1.9) = max(0, 6.3) + max(0, 6.6)
= 2.9 + 0 =0+0 = 6.3 + 6.6 = 12.9
Some questions?
𝑖
𝐿 = max 0,1 + 𝑠𝑗 − 𝑠𝑦 (𝑖)
𝑗≠𝑦 (𝑖)
𝐿(𝑖) = − log 𝑃 𝑌 = 𝑦 𝑖
𝑋=𝑥 𝑖
Cross-entropy loss
𝐾
= −𝑠𝑦(𝑖) + log 𝑒 𝑠𝑗
𝑗=1
Softmax classifier loss: example
𝑠 (𝑖)
(𝑖)
𝑒 𝑦
𝐿 = − log 𝐾 𝑠𝑗
𝑗=1 𝑒
𝐻 𝑞, 𝑝 = − 𝑞 𝑥 log 𝑝(𝑥)
𝑥