Anda di halaman 1dari 64

# Machine Learning Review

M. Soleymani
Sharif University of Technology
Fall 2017

Some slides have been adapted from Fei Fei Li lectures, cs231n, Stanford 2017
Types of ML problems
• Supervised learning (regression, classification)
– predicting a target variable for which we get to see examples.
• Unsupervised learning
– revealing structure in the observed data
• Reinforcement learning
– partial (indirect) feedback, no explicit guidance
– Given rewards for a sequence of moves to learn a policy and utility functions

2
Components of (Supervised) Learning
• Unknown target function: 𝑓: 𝒳 → 𝒴
– Input space: 𝒳
– Output space: 𝒴

## • Training data: 𝒙1 , 𝑦1 , 𝒙2 , 𝑦2 , … , (𝒙𝑁 , 𝑦𝑁 )

• We use training set to find the function that can also predict output
on the test set

3
Training data: Example
Training data
x2
𝑥1 𝑥2 𝑦
0.9 2.3 1
3.5 2.6 1
2.6 3.3 1
2.7 4.1 1
1.8 3.9 1
6.5 6.8 -1
7.2 7.5 -1
7.9 8.3 -1
6.9 8.3 -1
8.8 7.9 -1
9.1 6.2 -1
x1
4
Supervised Learning: Regression vs. Classification

• Supervised Learning
– Regression: predict a continuous target variable
• E.g., 𝑦 ∈ [0,1]

## – Classification: predict a discrete target variable

• E.g.,𝑦 ∈ {1,2, … , 𝐶}

5
Regression Example
• Housing price prediction

400

300
Price (\$)
200
in 1000’s
100

0
0 500 1000 1500 2000 2500
Size in feet2

## Figure adopted from slides of Andrew Ng,

Machine Learning course, Stanford.
6
Supervised Learning vs. Unsupervised Learning

• Supervised learning
– Given: Training set
𝑁
• labeled set of 𝑁 input-output pairs 𝐷 = 𝒙 𝑖 ,𝑦 𝑖
𝑖=1
– Goal: learning a mapping from 𝒙 to 𝑦

• Unsupervised learning
– Given: Training set
𝑖 𝑁
• 𝒙 𝑖=1
– Goal: find groups or structures in the data
• Discover the intrinsic structure in the data

7
Supervised Learning: Samples

x2

Classification

x1
8
Unsupervised Learning: Samples

x2 Type I Type II

Clustering

Type III

Type IV
x1
9
Reinforcement Learning
• Provides only an indication as to whether an action is correct or not

## Data in supervised learning:

(input, correct output)
Data in Reinforcement Learning:
(input, some output, a grade of reward for this output)

10
Reinforcement Learning
• Typically, we need to get a sequence of decisions
– it is usually assumed that reward signals refer to the entire sequence

11
Components of (Supervised) Learning
• Unknown target function: 𝑓: 𝒳 → 𝒴
– Input space: 𝒳
– Output space: 𝒴

## • Training data: 𝒙1 , 𝑦1 , 𝒙2 , 𝑦2 , … , (𝒙𝑁 , 𝑦𝑁 )

• We use training set to find the function that can also predict output
on the test set

12
Generalization

• We don’t intend to memorize data but need to figure out the pattern.

## • A core objective of learning is to generalize from the experience.

– Generalization: ability of a learning algorithm to perform accurately on new,
unseen examples after having experienced.

13
(Typical) Steps of solving supervised learning problem
• Select the hypothesis space
– A class of parametric models that map each input vector, x, into a predicted output y.

## • Define a loss function that quantifies how much undesirable is each

parameter vector across the training data.

• Come up with a way of efficiently finding the parameters that minimize the
loss function. (optimization)

## • Evaluate the obtained model

Linear regression: square error loss function
500
𝑓 𝑥; 𝒘 = 𝑤0 + 𝑤1 𝑥
400
𝒘 = [𝑤0 , 𝑤1 ]
𝑦 (𝑖) − 𝑓(𝑥 𝑖 ; 𝒘)
300

## 200 Parameters that be found

100

0
0 500 1000 1500 2000 2500 3000
𝑥

Cost function:
𝑛 2
𝑖
𝐽 𝒘 = 𝑦 − (𝑤0 + 𝑤1 𝑥 𝑖 )
𝑖=1

15
Cost function: example
𝐽(𝒘)
(function of the parameters 𝑤0 , 𝑤1)
500

400
Price (\$) 300
in 1000’s
200

100

0
0 1000 2000 3000
Size in feet2 (x) 𝑤1
𝑤0

This example has been adapted from: Prof. Andrew Ng’s slides 16
Cost function: example
𝑓 𝑥; 𝑤0 , 𝑤1 = 𝑤0 + 𝑤1 𝑥 𝐽(𝑤0 , 𝑤1 )
(for fixed 𝑤0 , 𝑤1 , this is a function of 𝑥) (function of the parameters 𝑤0 , 𝑤1 )

𝑤1
𝑤0

This example has been adapted from: Prof. Andrew Ng’s slides 17
Review: Iterative optimization of cost function
• Cost function: 𝐽(𝒘)
• Optimization problem: 𝒘 = argm𝑖𝑛 𝐽(𝒘)
𝒘

• Steps:
– Start from 𝒘0
– Repeat
• Update 𝒘𝑡 to 𝒘𝑡+1 in order to reduce 𝐽
• 𝑡 ←𝑡+1
– until we hopefully end up at a minimum

18
How to optimize parameters?

## A person is stuck in the mountains and is trying to get

down (i.e. trying to find the minima).

## The steepness of the hill represents the slope of the

surface at that point.
How to compute the slope?
• In 1-dimension, the derivative of a function:

– the slope of the error surface can be calculated by taking the derivative of the error
function at that point

## • In multiple dimensions, the gradient is the vector of (partial derivatives)

along each dimension

## • In each step, takes steps proportional to the negative of the

gradient vector of the function at the current point 𝒘𝑡 :
𝒘𝑡+1 = 𝒘𝑡 − 𝛾𝑡 𝛻 𝐽 𝒘𝑡
– 𝐽(𝒘) decreases fastest if one goes from 𝒘𝑡 in the direction of −𝛻𝐽 𝒘𝑡

point 𝒘𝑡

## Learning rate: The amount of time he travels before taking

another measurement is the learning rate of the algorithm.

22

## • Minimize 𝐽(𝒘) Step size

(Learning rate parameter)

## 𝒘𝑡+1 = 𝒘𝑡 − 𝜂𝛻𝒘 𝐽(𝒘𝑡 )

𝜕𝐽 𝒘 𝜕𝐽 𝒘 𝜕𝐽 𝒘
𝛻𝒘 𝐽 𝒘 = [ , ,…, ]
𝜕𝑤0 𝜕𝑤2 𝜕𝑤𝑑

## • If 𝜂 is small enough, then 𝐽 𝒘𝑡+1 ≤ 𝐽 𝒘𝑡 .

• 𝜂 can be allowed to change at every iteration as 𝜂𝑡 .

23
𝐽(𝑤0 , 𝑤1 )
𝑓 𝑥; 𝑤0 , 𝑤1 = 𝑤0 + 𝑤1 𝑥
(function of the parameters 𝑤0 , 𝑤1 )

𝑤1
𝑤0
𝑁
𝑇
𝒘𝑡+1 = 𝒘𝑡 − 𝜂 𝒘𝑡 𝒙(𝑖) − 𝑦 (𝑖) 𝒙(𝑖)
𝑖=1
This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford) 24
𝑓 𝑥; 𝑤0 , 𝑤1 = 𝑤0 + 𝑤1 𝑥 𝐽(𝑤0 , 𝑤1 )
(function of the parameters 𝑤0 , 𝑤1 )

𝑤1
𝑤0

This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford) 25
𝑓 𝑥; 𝑤0 , 𝑤1 = 𝑤0 + 𝑤1 𝑥 𝐽(𝑤0 , 𝑤1 )
(function of the parameters 𝑤0 , 𝑤1 )

𝑤1
𝑤0

This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford) 26
𝑓 𝑥; 𝑤0 , 𝑤1 = 𝑤0 + 𝑤1 𝑥 𝐽(𝑤0 , 𝑤1 )
(function of the parameters 𝑤0 , 𝑤1 )

𝑤1
𝑤0

This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford) 27
𝑓 𝑥; 𝑤0 , 𝑤1 = 𝑤0 + 𝑤1 𝑥 𝐽(𝑤0 , 𝑤1 )
(function of the parameters 𝑤0 , 𝑤1 )

𝑤1
𝑤0

This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford) 28
𝑓 𝑥; 𝑤0 , 𝑤1 = 𝑤0 + 𝑤1 𝑥 𝐽(𝑤0 , 𝑤1 )
(function of the parameters 𝑤0 , 𝑤1 )

𝑤1
𝑤0

This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford) 29
𝑓 𝑥; 𝑤0 , 𝑤1 = 𝑤0 + 𝑤1 𝑥 𝐽(𝑤0 , 𝑤1 )
(function of the parameters 𝑤0 , 𝑤1 )

𝑤1
𝑤0

This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford) 30
𝑓 𝑥; 𝑤0 , 𝑤1 = 𝑤0 + 𝑤1 𝑥 𝐽(𝑤0 , 𝑤1 )
(function of the parameters 𝑤0 , 𝑤1 )

𝑤1
𝑤0

This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford) 31
𝑓 𝑥; 𝑤0 , 𝑤1 = 𝑤0 + 𝑤1 𝑥 𝐽(𝑤0 , 𝑤1 )
(function of the parameters 𝑤0 , 𝑤1 )

𝑤1
𝑤0

This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford) 32

## • Local minima problem

• However, when 𝐽 is convex, all local minima are also global minima ⇒ gradient
descent can converge to the global solution.

33
• Batch techniques process the entire training set in one go
– thus they can be computationally costly for large data sets.

• Stochastic gradient descent: when the cost function can comprise a sum over
data points:
𝑛
𝐽(𝒘) = 𝐽 𝑖 (𝒘)
𝑖=1

## 𝒘𝑡+1 = 𝒘𝑡 − 𝜂 𝛻𝒘 𝐽(𝑗) (𝒘)

𝑗∈𝑆

34
Linear model: multi-dimensional inputs

𝑓 𝒙; 𝒘 = 𝑤0 + 𝑤1 𝑥1 + ⋯ + 𝑤𝑑 𝑥𝑑
= 𝒘𝑇 𝒙

𝑤0 1
𝑤1 𝑥
𝒘= ⋮ 𝒙= 1

𝑤𝑑 𝑥𝑑 35
Generalized linear regression
• Linear combination of fixed non-linear function of the input vector
𝑓(𝒙; 𝒘) = 𝑤0 + 𝑤1 𝜙1 (𝒙)+ . . . 𝑤𝑚 𝜙𝑚 (𝒙)

## {𝜙1 (𝒙), . . . , 𝜙𝑚 (𝒙)}: set of basis functions (or features)

𝜙𝑖 𝒙 : ℝ𝑑 → ℝ
• Polynomial (univariate)

36
Model complexity and overfitting
• With limited training data, models may achieve zero training error but
a large test error.

• Over-fitting: when the training loss no longer bears any relation to the
test (generalization) loss.
– Fails to generalize to unseen examples.

37
Over-fitting causes
• Model complexity
– E.g., Model with a large number of parameters (degrees of freedom)

## • Low number of training data

– Small data size compared to the complexity of the model

38
Model complexity
• Example:
– Polynomials with larger 𝑚 are becoming increasingly tuned to the random
noise on the target values.
𝑚=0 𝑚=1
𝑦 𝑦

𝑚=3 𝑚=9
𝑦 𝑦

39
39
[Bishop]
Number of training data & overfitting
 Over-fitting problem becomes less severe as the size of training data
increases.

𝑚=9 𝑚=9

𝑛 = 15 𝑛 = 100

[Bishop]

40
Avoiding over-fitting
• Determine a suitable value for model complexity
– Simple hold-out method
– Cross-validation

## • Regularization (Occam’s Razor)

– Explicit preference towards simpler models
– Penalize for the model complexity in the objective function

41
Simple hold out: training, validation, and test sets

## • Simple hold-out chooses the model (hyperparameters) that minimizes error on

validation set.

𝐽𝑣
error Training

Validation

Test
𝐽𝑡𝑟𝑎𝑖𝑛

degree of polynomial 𝑚

## • run on the test set once at the very end!

42
Cross-Validation (CV): Evaluation
• 𝑘-fold cross-validation steps:
– Shuffle the dataset and randomly partition training data into 𝑘 groups of approximately equal size
– for 𝑖 = 1 to 𝑘
• Choose the 𝑖-th group as the held-out validation group
• Train the model on all but the 𝑖-th group of data
• Evaluate the model on the held-out group
– Performance scores of the model from 𝑘 runs are averaged.

… First run
… Second run

… (k-1)th run
… k-th run

43
Regularization
• Adding a penalty term in the cost function to discourage the
coefficients from reaching large values.

## • Ridge regression (weight decay):

𝑛 2
𝐽 𝒘 = 𝑦 𝑖 − 𝒘𝑇 𝝓 𝒙 𝑖 + 𝜆𝑅(𝒘)
𝑖=1

## Generalization: prefer simple ones;

Approximation: Control the variance of the models
How much model predictions
2
match training data e.g. 𝑅 𝒘 = 𝒘 = 𝒘𝑇 𝒘
𝜆: regularization strength
(hyperparameter)

44
Regularization

## 𝜆=0 𝜆 > 0 (e^-18)

[Bishop]
Choosing the regularization parameter

error

𝐽𝑣

𝐽𝑡𝑟𝑎𝑖𝑛

46
Classification problem
• Given: Training set
𝑖 𝑖 𝑁
– labeled set of 𝑁 input-output pairs 𝐷 = 𝒙 ,𝑦 𝑖=1
– 𝑦 ∈ {1, … , 𝐾}

## • Goal: Given an input 𝒙, assign it to one of 𝐾 classes

• Examples:
– Image classification
– Speech recogntion
–…

47
Linear Classifier example
• Two class example:
3
− 𝑥1 − 𝑥2 + 3 = 0
4
𝑥2
𝒞1
3
2 if 𝒘𝑇 𝒙 + 𝑤0 ≥ 0 then 𝒞1
else 𝒞2
1
𝒞2
𝑥1 3
1 2 3 4 𝒘= − −1
4
𝑤0 = 3

48
Square error loss function for classification!
𝐾=2
Square error loss is not suitable for classification:
– Least square loss penalizes ‘too correct’ predictions (that they lie a long way on the correct
side of the decision)
– Least square loss also lack robustness to noise

𝑁
𝑖 𝑖 2
𝐽 𝒘 = 𝑤𝑥 + 𝑤0 − 𝑦
𝑖=1

49
Parametric classifier: Multiclass
• 𝑓 𝒙; 𝑾 = 𝑓1 𝒙, 𝑾 , … , 𝑓𝐾 𝒙, 𝑾 𝑇

## • 𝑾 = 𝒘1 ⋯ 𝒘𝐾 contains one vector of parameters for each class

Parametric classifier: Linear
• 𝑓 𝒙; 𝑾 = 𝑓1 𝒙, 𝑾 , … , 𝑓𝐾 𝒙, 𝑾

## • 𝑾 = 𝒘1 ⋯ 𝒘𝐾 contains one vector of parameters for each class

– In linear classifiers, 𝑾 is 𝑑 × 𝐾 where 𝑑 shows number of features
– 𝑾𝑇 𝒙 provides us a vector

## • 𝑓 𝒙; 𝑾 contains K numbers giving class scores for the input 𝒙

Linear classifier
• Output obtained from 𝑾𝑇 𝒙 + 𝒃

𝑥1
𝒙= ⋮
𝑥784
28 × 28
𝒘1
𝑾𝑇 = ⋮
𝒘10 10×784

𝑏1
𝒃= ⋮
𝑏10
Example

𝑾𝑇
How can we tell whether this W and b is good or bad?
Bias can also be included in the W matrix
Multi-class SVM
𝑁
1
𝐽 𝑾 = 𝐿 𝑖 + 𝜆𝑅(𝑾)
𝑁
𝑖=1

## Hinge loss: 𝐿𝑖 = max 0,1 + 𝑠𝑗 − 𝑠𝑦(𝑖) 𝑠𝑗 ≡ 𝑓𝑗 𝒙 𝑖 ; 𝑾

𝑗≠𝑦 (𝑖) = 𝒘𝑗𝑇 𝒙(𝑖)

## = max 0,1 + 𝒘𝑗𝑇 𝒙(𝑖) − 𝒘𝑇𝑦(𝑖) 𝒙(𝑖)

𝑗≠𝑦 (𝑖)
𝐾 𝑑
L2 regularization: 2
𝑅 𝑾 = 𝑤𝑙𝑘
𝑘=1 𝑙=1
Multi-class SVM loss: Example
3 training examples, 3 classes.
With some W the scores are 𝑊 𝑇 𝑥

𝑠𝑗 = 𝒘𝑗𝑇 𝒙(𝑖)

## 𝐿 𝑖 = max 0,1 + 𝑠𝑗 − 𝑠𝑦 (𝑖)

𝑗≠𝑦 (𝑖)

𝑁
1 𝑖
1
𝐿 = 2.9 + 0 + 12.9 = 5.7
𝑁 3
𝑖=1

𝐿(1) = max 0,1 + 5.1 − 3.2 𝐿(2) = max 0,1 + 1.3 − 4.9 𝐿(3) = max(0, 2.2 − (−3.1) + 1)
+ max 0,1 − 1.7 − 3.2 + max 0,1 + 2 − 4.9 +max(0, 2.5 − (−3.1) + 1)
= max 0,2.9 + max(0, −3.9) = max 0, −2.6 + max(0, −1.9) = max(0, 6.3) + max(0, 6.6)
= 2.9 + 0 =0+0 = 6.3 + 6.6 = 12.9
Some questions?
𝑖
𝐿 = max 0,1 + 𝑠𝑗 − 𝑠𝑦 (𝑖)
𝑗≠𝑦 (𝑖)

## • Q1: What if the sum was over all classes? (including 𝑗 = 𝑦𝑖 )

• Q2: What if we used mean instead of sum?
2
• Q3: What if we used 𝐿 𝑖 = 𝑗≠𝑦 𝑖
max 0,1 + 𝑠𝑗 − 𝑠𝑦 𝑖 ?
• Q4: what is the min/max possible?
• Q5: why do we use regularization term?
Other regularization terms
𝐾 𝑑 2
• L2 regularization 𝑤
𝑘=1 𝑙=1 𝑙𝑘
𝐾 𝑑
• L1 regularization 𝑘=1 𝑙=1 𝑤𝑙𝑘
𝑑 𝑑
• Elastic net (L1 + L2) β 𝐾 𝑤
𝑘=1 𝑙=1 𝑙𝑘
2
+ 𝐾
𝑘=1 𝑙=1 𝑤𝑙𝑘
Softmax Classifier (Multinomial Logistic Regression)
𝑒 𝑠𝑘
softmax function 𝑃 𝑌 = 𝑘 𝑋 = 𝒙(𝑖) = 𝐾 𝑠𝑗
𝑠𝑘 = 𝑓𝑘 𝒙 𝑖 ; 𝑊 = 𝑤𝑘𝑇 𝒙(𝑖)
𝑗=1 𝑒

## • Maximum log likelihood is equivalent to minimize the negative of log

likelihood of the correct class:

𝐿(𝑖) = − log 𝑃 𝑌 = 𝑦 𝑖
𝑋=𝑥 𝑖
Cross-entropy loss
𝐾

= −𝑠𝑦(𝑖) + log 𝑒 𝑠𝑗
𝑗=1
Softmax classifier loss: example
𝑠 (𝑖)
(𝑖)
𝑒 𝑦
𝐿 = − log 𝐾 𝑠𝑗
𝑗=1 𝑒

## 𝐿(1) = − log 0.13

= 0.89
Cross entropy

𝐻 𝑞, 𝑝 = − 𝑞 𝑥 log 𝑝(𝑥)
𝑥

## • For the loss of the softmax classifier:

𝑒 𝑠𝑘
– p: estimated class probabilities 𝐾 𝑒 𝑠𝑗
𝑗=1
– q: the true distribution
• all probability mass is on the correct class 𝑞 𝑌 = 𝑦 𝑖 = 1 (𝑞 𝑌 ≠ 𝑦 𝑖 = 0).
Relation to KL divergence

## • Since 𝐻(𝑞) for the loss of softmax classifier is zero:

– Minimizing cross entropy is equivalent to minimizing the KL divergence
between the two distributions (a measure of distance).
– cross-entropy loss wants the predicted distribution to have all of its mass on