Lec 10

So far in the course we have seen some specific
cases of learning classifiers given the training data.
PRNN (PSS) Jan-Apr 2016 p.1/315

We have looked at the optimal Bayes classifier.

We discussed how to estimate class conditional

densities for implementing Bayes classifier.


We have seen various methods (perceptron, least

squares, LMS, logistic regression, FLD etc) to learn
linear classifiers.


We have seen various methods (perceptron, least

squares, LMS, logistic regression, FLD etc) to learn
linear classifiers.
We now take a look at the general problem of learning
classifiers.
Learning and generalization
The problem of designing a classifier is essentially

one of learning from examples.

Given training data, we want to find an appropriate

classifier.


classifier.
It amounts to searching over a family of classifiers to
find one that minimizes error over training set.


classifier.
It amounts to searching over a family of classifiers to
find one that minimizes error over training set.
For example, in least squares approach we are

searching over the family of linear classifiers for
minimizing square of error.
As we discussed earlier, performance on training set

is not the real issue.

We would like the learnt classifier to perform well on
new data.

We would like the learnt classifier to perform well on
new data.
This is the issue of generalization. Does the learnt
classifier generalize well?
In practice one assesses the generalization of the

learnt classifier by looking at the error on a separate
set of labelled data called test set.

Since the test set would not be used in training, error
on that data could be a good measure of the
performance of the learnt classifier.

This means that we should have access to some
more labelled data.

This means that we should have access to some
more labelled data.
we look at these specific issues of practice later on.
Currently our focus would be on theoretical analysis of
how to say whether a learning algorithm would
generalize well.
We can see the main issue through a simple example

of regression

of regression
Suppose we have data {(Xi , yi )}, Xi , yi .

of regression
We want to learn a function f so that we can predict y

as f (X).

of regression
We want to learn a function f so that we can predict y

as f (X).
This is a simple regression problem and we can use

least squares for it based on the form of f .
Suppose we choose polynomial function
f (X) = w0 + w1 X + w2 X 2 + + wm X m
f (X) = w0 + w1 X + w2 X 2 + + wm X m
As we discussed earlier we can do this using linear

least squares algorithm.
f (X) = w0 + w1 X + w2 X 2 + + wm X m

One question is what m to choose.
f (X) = w0 + w1 X + w2 X 2 + + wm X m

We have looked at regularized least squares for this.
f (X) = w0 + w1 X + w2 X 2 + + wm X m


(It does not tell best m but helps learn a model with
good generalization).
f (X) = w0 + w1 X + w2 X 2 + + wm X m


(It does not tell best m but helps learn a model with
good generalization).
There are other methods (e.g., BIC)

But more fundamentally, let us ask can our data error

tell what m is proper.

Firstly the fact that we get less error for m compared

to m does not necessarily mean m -degree is a better
fit.


fit.
If for a particular m if we get very low data error, can
we say it is good?


fit.
If for a particular m if we get very low data error, can
we say it is good?
We know that if we search over all polynomials, we

can never really learn anything. Can we have a
formalism that makes this precise.
There are different ways of addressing this issue

(MDL, VC-theory etc).
There are different ways of addressing this issue

(MDL, VC-theory etc).
We discuss one such aproach next
Any learning algorithm takes training data as the input

and outputs a specific classifier/function.

For this, it searches over some chosen family of

functions to find one that optimizes a chosen criterion
function.


function.
Learning Algorithm
{(Xi , yi )}
(searching over F )
f F


function.
Learning Algorithm
{(Xi , yi )}
(searching over F )
f F
The question is: how can we formalize correctness of

learning?
For example, a generic approach is what is called

Minimum Description Length principle.

Suppose we want to send the data over a

communication channel.


we can send the 2n numbers, Xi , yi using some
number of bits.


we can send the 2n numbers, Xi , yi using some
number of bits.
Or we can send Xi , the function f and the errors
yi f (Xi ).
If the fit is good, the errors yi f (Xi ) would have

small range and we may be able to send them using
smaller number of bits compared sending yi .

However, we also need to send f .

If f is very complex, then what we save in bits by

sending errors instead of yi may be more than offset
by the bits needed to send description of f .


Hence we can rate different f by the total number of

bits we need.


Hence we can rate different f by the total number of

bits we need.
This can balance the data error and model complexity
in a natural way.
For example, suppose we use polynomials as f (with

Xi , yi ).

Xi , yi ).
So, to send f we need to send d + 1 numbers if we

use a polynomial with degree d.

Xi , yi ).

Getting a good fit using high degree polynomial may

not really pay!

Xi , yi ).


not really pay!
This intuitively captures our idea of simple models are

preferred.

Xi , yi ).


not really pay!
This intuitively captures our idea of simple models are

preferred.
As presented this approach is only asking the

question whether the f is a good fit for the specific
data at hand.
We will follow a different statistical approach to

address the issue of correctness of learning.

Intuitively, we can learn the correct relationship if we

are given sufficient number of representative
examples.


examples.
Also, sufficiency of number of examples depends on

the family F of classifiers where we are searching.
We hence need a good notion of the complexity of the
learning problem.


examples.
Also, sufficiency of number of examples depends on

the family F of classifiers where we are searching.
We hence need a good notion of the complexity of the
learning problem.
We begin with a simple formalism where there is no

noise and the goal of learning is well-defined.
A Learning problem is defined by giving:

(i) X input space; often d (feature space)

(ii) Y = {0, 1} output space (set of class labels)

iii) C 2X concept space (family of classifiers)


Each C C can also be viewed as a function
C : X {0, 1}, with C(X) = 1 iff X C .


Each C C can also be viewed as a function
C : X {0, 1}, with C(X) = 1 iff X C .
iv) S = {(Xi , yi ), i = 1, , n} the set of examples,

where Xi are drawn iid according to some distribution
Px on X and yi = C (Xi ) for some C C . C is
called target concept.
We are considering a 2-class case.
Hence any classifier is a function C : X {0, 1}.
Thus, C is a family of classifiers.
We call this concept space because we can say the

system is learning a concept from examples.

The learning algorithm knows X , Y , C ; but does not

know C .

The learning algorithm knows X , Y , C ; but does not

know C .
It needs to learn the target concept from examples.
We do not know the distribution Px . However, taking

that the examples are iid ensures we get
representative examples.

We are trying to teach a concept through examples

that come from an arbitrary distribution.


Since we have taken yi = C (Xi ), i, there is no

noise.


Since we have taken yi = C (Xi ), i, there is no

noise.
Also assuming that C C means that ideally we can
learn the target concept.
We could take C = 2X .
This means we are searching over the family of all
possible (2-class) classifiers.
This is often not viable even theoretically.
So, choosing a particular C is based on either some

knowledge we have about the problem or becuase of
the kind of learning algorithm we have.
So, choosing a particular C is based on either some

knowledge we have about the problem or becuase of
the kind of learning algorithm we have.
For example we can take C to be all half-spaces the

family of all linear classifiers.
Suppose we want to learn the concept of

medium-build persons based on features of height
and weight.

and weight.
Here X = 2 and Y = {0, 1}.

and weight.
Here X = 2 and Y = {0, 1}.
we would be given examples (with no errors!) drawn

from some arbitrary distribution.
The examples for learning our concept could be the

following.
What could be C in this example? We can use some

problem-based intuition.

We can choose C to be all axis-parallel rectangles.

We can choose C to be all axis-parallel rectangles.
Now assuming C C means that the god-given

classifier is also an axis-parallel rectangle.
The examples along with C now would be
Probably Approximately Correct Learning
Let us now try to define the goal of learning.
Note that each C C can be viewed either as a

subset of X or a binary valued function on X .

Let Cn denote the concept or classifier output by the

learning algorithm after it processes n iid examples.


For correctness of the learning algorithm we want Cn

to be close to C as n becomes large.


For correctness of the learning algorithm we want Cn

to be close to C as n becomes large.
The closeness of Cn to C is in terms of classifying

samples drawn from X according to Px .
We define error of Cn by
err(Cn ) = Px (Cn C )
where, for sets Cn , C ,
Cn C = (Cn C ) (C Cn ).
= Prob[{X X : Cn (X) = C (X)}]
Cn C = (Cn C ) (C Cn ).
= Prob[{X X : Cn (X) = C (X)}]
Cn C = (Cn C ) (C Cn ).
The err(Cn ) is the probability that on a random

sample, drawn according to Px , the classification of
Cn and C differ.
Essentially, we want err(Cn ) to become zero as

n .

n .
However, err(Cn ) is a random variable because Cn is
a function of the random samples X1 , , Xn .

n .
However, err(Cn ) is a random variable because Cn is
a function of the random samples X1 , , Xn .
Hence we have to properly define the sense in which
err(Cn ) converges as n .
We say a learning algorithm Probably Approximately

Correctly (PAC) learns a concept class C if given any
, > 0, N < such that
Prob[err(Cn ) > ] <
for all n > N and for any distribution Px and any C .

Prob[err(Cn ) > ] <
The probability above is with respect to the distribution

of n-tuples of iid samples drawn according to Px on X .

Prob[err(Cn ) > ] <
The probability above is with respect to the distribution

of n-tuples of iid samples drawn according to Px on X .
The Px is arbitrary. But, for testing and training the

distribution is same fair to the algorithm.
An algorithm PAC learns C if

Prob[err(Cn ) > ] <
for sufficiently large n and any Px .

Prob[err(Cn ) > ] <
If err(Cn ) , then Cn is approximately correct.

Prob[err(Cn ) > ] <
So, what the above says is that the classifier output by

the algorithm after seeing n random examples, Cn , is
approximately correct with a high probability.

Prob[err(Cn ) > ] <
So, what the above says is that the classifier output by

the algorithm after seeing n random examples, Cn , is
approximately correct with a high probability.
The and are called the accuracy and confidence

parameters respectively.
Let us look at the example of learning the concept of

medium-built persons.

Here, X = 2 .

Here, X = 2 .
The strategy of the learning algorithm is as follows.

Here, X = 2 .
The algorithm outputs a classifier which correctly

classifies all examples.

Here, X = 2 .
The algorithm outputs a classifier which correctly

classifies all examples.
If there is more than one C C that is consistent with

all examples, we output the smallest such C .
For finite sets, smallest is in terms of number of poins;

for other sets it is in terms of the areas of the sets.
(This will do for our purpose here).

We will look at two different C and findout what the

algorithm does.


algorithm does.
We take C1 to be the set of all axis-parallel rectangles.


algorithm does.
We take C1 to be the set of all axis-parallel rectangles.
We take C2 to be 2X ; that is, set of all possible

classifiers.
We assume, as earlier, that C is an axis-parallel

rectangle.

rectangle.
Note that C belongs to both C1 and C2 .

rectangle.
All our examples are classified according to C .

rectangle.
All our examples are classified according to C .
Hence an (x, y) 2 is a positive example if it is in

C and negative example otherwise.
The examples along with C in this problem are:
First consider C1 .
First consider C1 .
The smallest C C consistent with all examples
would be the smallest axis-parallel rectangle
enclosing all the positive examples seen so far.
The concept output by the algorithm (when using C1 )

would be the following.
First consider C1 .
Thus, under the strategy of our learning algorithm, for
all n, the Cn would always be inside the C .
First consider C1 .
Thus, under the strategy of our learning algorithm, for

all n, the Cn would always be inside the C .
Now let us show that this is a PAC learning algorithm.
Whenever any example is classified as positive by Cn

it would also be classified positive by C .

Hence the points of X where Cn makes errors is the

annular region.
Since Cn is also an axis-parallel rectangle which is

inside C , the Cn C would be the annular region
between the two rectangles.


annular region.
Hence, err(Cn ) is the Px -probability of this annular

region.


annular region.
Hence, err(Cn ) is the Px -probability of this annular

region.
Note that we are not really bothered about the area of

this annular region; we are only interested in the
probability mass of this region under Px .
Now, given an > 0, we have to bound the probability

that err(Cn ) > .

that err(Cn ) > .
The error is greater than only if the probability mass

(under Px ) of the annular region is greater than .

that err(Cn ) > .

When does this event err(Cn ) > occur?

that err(Cn ) > .

Only when none of the examples seen happen to be

in the annular region.

that err(Cn ) > .

Only when none of the examples seen happen to be

in the annular region.
Why? Otherwise, the rectangle learnt by our

algorithm would have been closer to C .
Hence the probability of the event err(Cn ) > is same

as the probability that when n iid examples are drawn
accoding to Px none of them came from a subset of X
that has Px -probability at least .

That is, all examples came from a subset of

probability at most (1 ).

That is, all examples came from a subset of

probability at most (1 ).
The probability of this happenning is at most (1 )n .
Hence we have
Prob[err(Cn ) > ] (1 )
Hence we have
Let N be such that (1 )n < , for all n > N .
Hence we have

ln()
ln(1)
The required N is N
(bound on number of examples).
Hence we have

ln()
ln(1)
The required N is N
(bound on number of examples).
For this N , we have

Prob[err(Cn ) > ] , n N
showing the algorithm PAC learns the concept class.
Now let us consider the same algorithm with concept

class C2 = 2X .

class C2 = 2X .
Here we are searching over all possible 2-class
classifiers.

class C2 = 2X .
classifiers.
So, intuitively, we do not expect the algorithm to be
able to learn anything.

class C2 = 2X .
classifiers.
There is too much flexibility in the bag of classifiers
over which we are searching.

class C2 = 2X .
classifiers.
There is too much flexibility in the bag of classifiers

over which we are searching.
Let us show this formally.
What would be Cn now?

After seeing n examples, the smallest set in C2 that is
consistent with all examples is the set consisting of all
the positive examples seen so far!!

Now the algorithm simply remembers all the positive
examples seen.

Now the algorithm simply remembers all the positive

examples seen.
This happened because every possible finite subset

of X is in our concept class.
So, now, Cn C would be the axis parallel rectangle

C minus some finite number of points from it.

So, under any continuous Px ,

err(Cn ) = Px (Cn C ) = Px (C ).


Hence, for any < Px (C ), Prob[err(Cn ) > ] = 1 for

all n.


Hence, for any < Px (C ), Prob[err(Cn ) > ] = 1 for

all n.
Thus, the algorithm can not PAC learn with C2 .
This example clearly illustrates the difficulty of

learning from examples if the bag of classifiers being
considered is too large.

The largeness is not interms of number of elements in

our concept class.


our concept class.
Both C1 and C2 contain uncountably infinite number of

classifiers.


our concept class.
Both C1 and C2 contain uncountably infinite number of

classifiers.
We would later on define an appropriate quantity to
quantify the sense in which one concept class can be
said to be bigger than (or more complex than) another.
At this point we can still see how C1 is smaller than C2 .
Since every axis parallel rectangle can be specified by

four quantities, this class can be parameterized by
four parameters.

four parameters.
However, there is no such finite parameterization for

C 2 = 2 .

four parameters.
However, there is no such finite parameterization for

C 2 = 2 .
Also, the strategy of our algorithm can be coded
efficiently in case of C1 .
The concept of PAC learnability is interesting.
It allows one to properly define what is correctness of

learning and allows us to ask questions like whether a
given algorithm learns correctly.

As we have seen in our example, we can also bound

the number of examples needed to learn to a given
level of accuracy and confidence.

As we have seen in our example, we can also bound

the number of examples needed to learn to a given
level of accuracy and confidence.
Thus we can appreciate relative complexities of

different learning problems.
However, PAC learnability deals with ideal learning

situations.

situations.
We assume there is a (god-given) C and that it is in
our C .

situations.
our C .
Also, we assume that examples are noise free and
are perfectly classified.

situations.
our C .
Also, we assume that examples are noise free and
are perfectly classified.
Next we consider an extension of this framework that
is relevant for realistic learning scenarios.
In our new framework we are given
X input space; ( as earlier, Feature space)

Y Output space (as earlier, Set of class labels)

H hypothesis space (family of classifiers)

Each h H is a function: h : X A
where A is called action space.

Each h H is a function: h : X A
where A is called action space.
Training data: {(Xi , yi ), i = 1, , n}

drawn iid according to some distribution Pxy on
X Y.
Some Comments
We have replaced C with H.
Some Comments
If we take A = Y then it is same as earlier.
Some Comments

But the freedom in choosing A allows for taking care
of many situations.
Some Comments

of many situations.
For example, even when Y = {0, 1}, we can take

A = (e.g., learning discriminant functions).
Some Comments

of many situations.
For example, even when Y = {0, 1}, we can take

A = (e.g., learning discriminant functions).
Now, e.g., sign of h(X) may denote the class and its
magnitude may give some measure of confidence in
the assigned class.
Now we draw examples from X Y according to Pxy .

This allows for noise in the training data.

For example, when class conditional densities

overlap, same X can come from different classses
with different probabilities.


We can always factorize Pxy = Px Py|x . If Py|x is a

degenerate distribution then it will be same as earlier
we draw iid samples from X and each point is
essentially classified by the target classifier.


We can always factorize Pxy = Px Py|x . If Py|x is a

degenerate distribution then it will be same as earlier
we draw iid samples from X and each point is
essentially classified by the target classifier.
However, having examples drawn from X Y using a

distribution, allows for many more scenarios.
As before, the learning machine outputs a hypothesis,

hn H, given the training data consisting of n
examples.

examples.
However, now there is no notion of a target

concept/hypothesis.

examples.

concept/hypothesis.
There may be no h H which is consistent with all

examples.

examples.

concept/hypothesis.
There may be no h H which is consistent with all

examples.
Hence we use the idea of loss functions to define the

goal of learning.
Loss function
Loss function:
L : Y A + .
Loss function
L : Y A + .
The idea is that L(y, h(X)) is the loss suffered by
h H on a (random) sample (X, y) X Y .
Loss function:
Loss function
L : Y A + .
More generally we can let loss depend on X also
explicitly and can write L(X, y, h(X)) for loss
Loss function:
function.
Loss function
L : Y A + .
Loss function:
function.
By convention we assume that the loss function is
non-negative.
Loss function
L : Y A + .
Loss function:
function.
By convention we assume that the loss function is
non-negative.
Now we can look for hypotheses that have low
average loss over samples drawn accordding to Pxy .
Risk function
Define a function, R : H + by
!
R(h) = E[L(y, h(X))] =
L(y, h(X))dPxy
Risk function
!
R(h) = E[L(y, h(X))] =
L(y, h(X))dPxy
R is called the risk function.
Risk function
!
R(h) = E[L(y, h(X))] =
L(y, h(X))dPxy
Risk is expectation of loss where expectation is with

respect to Pxy .
Risk function
!
R(h) = E[L(y, h(X))] =
L(y, h(X))dPxy

respect to Pxy .
Hence, h with a low R(h) is a better classifier.
Risk function
!
R(h) = E[L(y, h(X))] =
L(y, h(X))dPxy

respect to Pxy .
Hence, h with a low R(h) is a better classifier.
We can now define the goal of learning as minimizer

of risk.
Let h = arg minhH R(h)
We define the goal of learning as finding h , the global

minimizer of risk.

minimizer of risk.
Risk minimization is a very general strategy adopted
by most machine learning algorithms.

minimizer of risk.
However, note that we may not have any knowledge

of Pxy .

minimizer of risk.
However, note that we may not have any knowledge

of Pxy .
How can we find minimizer of risk? Minimization of

R() directly is not feasible.
Empirical Risk function
n : H + , by
Define the empirical risk function, R
n (h) =
R
n
"
1
L(yi , h(Xi ))
i=1
n : H + , by
n (h) =
R
n
"
1
L(yi , h(Xi ))
i=1
This is the sample mean estimator of risk obtained

from n iid samples.
n : H + , by
n (h) =
R
n
"
1
L(yi , h(Xi ))
i=1
This is the sample mean estimator of risk obtained

from n iid samples.
be the global minimizer of empirical risk, R

n.
Let h
n
= arg min R
n (h)
h
n
hH
The loss function is known to us.
n (h).
Hence, given any h we can calculate R
n (h).
by optimization
Hence, we can (in principle) find h
n
methods.
n (h).
by optimization
Hence, we can (in principle) find h
n
methods.
is the basic idea of empirical
Approximating h by h
n
risk minimization strategy which is used in most ML
algorithms.
a good approximator of h , the minimizer of true

Is h
n
risk (for large n)?

Is h
n
risk (for large n)?
This is the question of consistency of empirical risk
minimization.

Is h
n
risk (for large n)?
minimization.
Thus, we can say that any learning problems has two
parts.

Is h
n
risk (for large n)?
minimization.
parts.
, the minimizer of R
n.
The optimization part: find h
n

Is h
n
risk (for large n)?
minimization.
parts.
, the minimizer of R
n.
The optimization part: find h
n
a good approximator of h .
The statistical part: Is h
n
We will now look at some examples of loss functions.
Note that the loss function is chosen by us; it is part of

the specification of the learning problem.
Note that the loss function is chosen by us; it is part of

the specification of the learning problem.
The loss function is intended to capture how we would

like to evaluate performance of the classifier.
The 01 loss function
Consider 2-class classification problem.
Let Y = {0, 1} and A = Y .
Let Y = {0, 1} and A = Y .
Now, the 01 loss function is defined by
L(y, h(X)) = I[y=h(X)]

where I[A] denotes indicator of event A.
The 0-1 loss function is
L(y, h(X)) = I[y=h(X)]
L(y, h(X)) = I[y=h(X)]
Risk is expectation of loss.
L(y, h(X)) = I[y=h(X)]
Hence, R(h) = Prob[y = h(X)]; the risk is probability

of misclassification.
L(y, h(X)) = I[y=h(X)]
Hence, R(h) = Prob[y = h(X)]; the risk is probability

of misclassification.
So, h minimizes probability of misclassification. A
very desirable goal. (Bayes classifier)
Here we assumed that the learning algorithm

searches over a class of binary-valued functions on
X.

X.
We can extend this to, e.g., discriminant function
learning.

X.
learning.
We take A = (now h(X) is a discriminant function).

X.
learning.
We take A = (now h(X) is a discriminant function).
We can define the 0-1 loss now as
L(y, h(X)) = I[y=sgn(h(X))]
Having any fixed misclassification costs is essentially

same as 01 loss.

same as 01 loss.
Even if we take A = , the 01 loss compares only
sign of h(x) with y . The magnitude of h(x) has no
effect on the loss.

same as 01 loss.
effect on the loss.
Here, we can not trade good performance on some
data with bad performance on others.

same as 01 loss.
effect on the loss.
Here, we can not trade good performance on some
data with bad performance on others.
This makes 01 loss function more robust to noise in
classification labels.
While 01 loss is an intuitively appealing performance

measure, minimizing empirical risk here is hard.

Note that the 01 loss function is non-differentiable

which makes the empirical risk function also
non-differentiable.


non-differentiable.
The empirical risk minimization is also a non-convex
optimization problem here.


non-differentiable.
The empirical risk minimization is also a non-convex
optimization problem here.
Hence many other loss functions are often used in

Machine Learning.
Squared error loss
The squared error loss function is defined by
L(y, h(X)) = (y h(X))2
Squared error loss
L(y, h(X)) = (y h(X))2
As is easy to see, the linear least squares method that

we considered is empirical risk minimization with
squared error loss function.
Squared error loss
L(y, h(X)) = (y h(X))2

Here, for a 2-class classification problem, we can take

Y as {0, 1} or {+1, 1}. We take A = so that
each h is a discriminant function.
Squared error loss
L(y, h(X)) = (y h(X))2

Here, for a 2-class classification problem, we can take

Y as {0, 1} or {+1, 1}. We take A = so that
each h is a discriminant function.
As we know, we can use this for regression problems
also and then we take Y = .
Another interesting scenario here is to take

Y = {0, 1} and A = [0, 1].

Y = {0, 1} and A = [0, 1].
Then each h can be interpreted as a posterior

probability (of class-1) function.

Y = {0, 1} and A = [0, 1].

As we know, the minimizer of expectation of squared

error loss (the risk here) is the posterior probability
function.

Y = {0, 1} and A = [0, 1].

As we know, the minimizer of expectation of squared

error loss (the risk here) is the posterior probability
function.
So, risk minimization would now look for a function in
H that is a good approximation for the posterior
probability function.
The empirical risk minimization under squared error

loss is a convex optimization problem for linear
models (when h is linear in its parameters).
The empirical risk minimization under squared error

loss is a convex optimization problem for linear
models (when h is linear in its parameters).
The squared error loss is extensively used in many

learning algorithms.
soft margin loss or hinge loss
We take Y = {+1, 1} and A = . The loss

function is given by
L(y, h(X)) = max(0, 1 yh(X))

L(y, h(X)) = max(0, 1 yh(X))
Here, if yh(X) > 0 then classification is correct and if

yh(X) 1, loss is zero.

L(y, h(X)) = max(0, 1 yh(X))
Here, if yh(X) > 0 then classification is correct and if

yh(X) 1, loss is zero.
This also results in convex optimization for empirical

risk minimization.
We can look at the other loss functions as some

convex approximations of the 01 loss functions as
follows.

follows.
Let us take Y = {+1, 1}.

follows.
Let us take Y = {+1, 1}.
We can write all loss functions as functions of single
variable yh(X).
For 01 loss L(y, h(X)) is one if yh(X) is negative

and zero otherwise.

and zero otherwise.
The squared error loss can be written as
L(y, h(X)) = (1 yh(X))2

and zero otherwise.
The squared error loss can be written as
L(y, h(X)) = (1 yh(X))2
The hinge loss is defined as a function of yh(X).
L(y, h(X)) = max(0, 1 yh(X))
We can plot all the functions as follows.

2.5
L(y, f(x))
1.5
0.5
0-1 Loss
Square Loss
Hinge Loss
0
-2
-1.5
-1
-0.5
0.5
1.5
2.5
y f(x)
(Here we plot yh(X) on x-axis and L(y, h(X)) on y -axis).

There are also many other loss functions used in

regression problems.

The squared error loss can be used. But it is very

sensitive to data errors such as outliers.


So, other loss functions are used for better robustness
during regression.


So, other loss functions are used for better robustness
during regression.
We list a couple of them here.
The L1 loss is defined by
L(y, h(X)) = |y h(X)|
L(y, h(X)) = |y h(X)|
This is more robust than the square loss.
L(y, h(X)) = |y h(X)|
Another similar one is the -insensitive loss defined by
L(y, h(X)) = max(0, |y h(X)| )
L(y, h(X)) = |y h(X)|
Another similar one is the -insensitive loss defined by
L(y, h(X)) = max(0, |y h(X)| )
Here if we make error less than epsilon then loss is

zero.
As we saw, there are many different loss functions

one can think of.

one can think of.
Many of them also make the empirical risk
minimization problem efficiently solvable.

one can think of.
Thus a choice of loss function can efficiently tackle
the optimization part of learning.

one can think of.

We consider some such algorithms later in this

course.

one can think of.

We consider some such algorithms later in this

course.
Now, let us get back to the statistical question that we
started with.
Consistency of Empirical Risk Minimization
Our objective is to find h , minimizer of risk R().
Since we do not know R, we minimize the empirical

.
n , instead and thus find h
risk, R
n

.
risk, R
n
to be close. More precisely we

We want h and h
n
are interested in the question
) R(h )?
R(h
n

.
risk, R
n
to be close. More precisely we

We want h and h
n
are interested in the question
) R(h )?
R(h
n
is random and hence we take
(As earlier, we know h
n
the above as convergence in probability).
What is the intuitive reason for using empirical risk

minimization?

minimization?
Sample mean is a good estimator and hence, with
n (h) converges to R(h), for any h H.
large n, R

minimization?
large n, R
This is (weak) law of large numbers.

minimization?
large n, R
) converges
But this does not necessarily mean R(h
n
to R(h ).

minimization?
large n, R
) converges
But this does not necessarily mean R(h
n
to R(h ).
We are interested in: does the true risk of the
minimizer of empirical risk converge to global
minimum of risk?
Let us consider a specific scenario to appreciate this.
We take A = Y = {0, 1}. We use 01 loss.
Suppose the examples are drawn according to Px on

H.
X and classified according to a h

H.
That is Pxy = Px Py|x and Py|x is a degenerate

distribution.

H.
That is Pxy = Px Py|x and Py|x is a degenerate

distribution.
= 0.
Now the global minimum of risk is zero and R(h)
.
Note that now the risk of any h is same as Px (hh)
.
That is, this scenario is same as what we considered
under PAC framework.
.
Now, under 01 loss, the global minimum of empirical
risk is also zero.
.
risk is also zero.
) with
For any n, there may be many h (other than h
n (h) = 0.
R
.
risk is also zero.
) with
For any n, there may be many h (other than h
n (h) = 0.
R
Hence our optimization algorithm can only use some
general rule to output one such hypothesis.
Consider h1 : X Y with h1 (Xi ) = yi , (Xi , yi ) S

and h1 (X) = 1 for all other X
n (h1 ) = 0! It is a global minimizer of empirical

Then R
risk. But it is obvious that h1 is not a good classifier.
Suppose we took H = 2X .
This is same as the example we considered earlier

will not go to zero with
h)
and hence we know, Px (h
n
n.

h)
n
n.
) will not converge to R(h ).

Thus, here, R(h
n

h)
n
n.
) will not converge to R(h ).

Thus, here, R(h
n
Note that the law of large numbers still implies that
n (h) converges to R(h), h.
R
If functions like h1 are in our H then empirical risk

minimization (ERM) may not yield good classifiers.

If H contains all possible functions, then this is

certainly the case as we saw in our example.

If H contains all possible functions, then this is

certainly the case as we saw in our example.
Functions like h1 could be highly non-smooth and

hence one way is to impose some smoothness
conditions on the learnt function (e.g., regularization).
n (h) R(h), h, the risk of

We saw that though R
minimizer of empirical risk need not converge to the
global minimum of risk.

And this happens based on our choice of H.

Hence, the statistical question we need to ask is: for

what H is empirical risk minimization consistent.


That is, given any , > 0, N < such that
) R(h )| > ] ?
Prob[|R(h
n


That is, given any , > 0, N < such that
) R(h )| > ] ?
Prob[|R(h
n
This is the question we address next.

We would like the algorithm to satisfy: , > 0,

N < , such that
) R(h )| > ] , n N
Prob[|R(h
n

N < , such that
) R(h )| > ] , n N
Prob[|R(h
n
In addition, we would also like to have
) R(h )| > ] , n N
n (h
Prob[|R
n

N < , such that
) R(h )| > ] , n N
Prob[|R(h
n
) R(h )| > ] , n N
n (h
Prob[|R
n
We would like to (approximately) know the true risk of
the learnt classifier.

N < , such that
) R(h )| > ] , n N
Prob[|R(h
n
) R(h )| > ] , n N
n (h
Prob[|R
n
We would like to (approximately) know the true risk of

the learnt classifier.
For what kind of H do these hold?
As we already saw, the law of large numbers (that

n (h) R(h), h) is not enough.
R

R
As it turns out, what we need is that the convergence

under law of large numbers be uniform over H.

R
As it turns out, what we need is that the convergence

under law of large numbers be uniform over H.
Such uniform convergence is necessary and sufficient

for consistency of empirical risk minimization.
Law of large numbers says that sample mean

converges to expectation of the random variable.

Given any h, , > 0, N < such that
n (h) R(h)| > ] , n N

Prob[|R

n (h) R(h)| > ] , n N

Prob[|R
The N that exists can depend on , and also on h.

n (h) R(h)| > ] , n N

Prob[|R
The convergence is said to be uniform if the N

depends only on , and not on h.

n (h) R(h)| > ] , n N

Prob[|R
The convergence is said to be uniform if the N

depends only on , and not on h.
That is, for a given , the same N (, ) works for all

h H.
n (h) converges (in probability) to R(h)

To sum up, R
uniformly over H if , > 0, N (, ) < such that
#
$
n (h) R(h)| > , n N (, )
Prob sup |R
hH

To sum up, R
#
$
n (h) R(h)| > , n N (, )
Prob sup |R
hH
This implies that the same N (, ) works for all h H.

To sum up, R
#
$
n (h) R(h)| > , n N (, )
Prob sup |R
hH
This implies that the same N (, ) works for all h H.
It is easy to show that uniform convergence is

sufficient for consistency of empirical risk
minimization.

Lec 10

Diunggah oleh

Informasi Dokumen

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Lec 10

Diunggah oleh

Hak Cipta:

Format Tersedia

So far in the course we have seen some specific

cases of learning classifiers given the training data.

PRNN (PSS) Jan-Apr 2016 p.1/315

So far in the course we have seen some specific

We have looked at the optimal Bayes classifier.

PRNN (PSS) Jan-Apr 2016 p.2/315

So far in the course we have seen some specific

We have looked at the optimal Bayes classifier.

We discussed how to estimate class conditional

PRNN (PSS) Jan-Apr 2016 p.3/315

So far in the course we have seen some specific

We have looked at the optimal Bayes classifier.

We discussed how to estimate class conditional

We have seen various methods (perceptron, least

PRNN (PSS) Jan-Apr 2016 p.4/315

So far in the course we have seen some specific

We have looked at the optimal Bayes classifier.

We discussed how to estimate class conditional

We have seen various methods (perceptron, least

PRNN (PSS) Jan-Apr 2016 p.5/315

Learning and generalization

The problem of designing a classifier is essentially

PRNN (PSS) Jan-Apr 2016 p.6/315

Learning and generalization

The problem of designing a classifier is essentially

Given training data, we want to find an appropriate

PRNN (PSS) Jan-Apr 2016 p.7/315

Learning and generalization

The problem of designing a classifier is essentially

Given training data, we want to find an appropriate

PRNN (PSS) Jan-Apr 2016 p.8/315

Learning and generalization

The problem of designing a classifier is essentially

Given training data, we want to find an appropriate

For example, in least squares approach we are

PRNN (PSS) Jan-Apr 2016 p.9/315

As we discussed earlier, performance on training set

PRNN (PSS) Jan-Apr 2016 p.10/315

As we discussed earlier, performance on training set

PRNN (PSS) Jan-Apr 2016 p.11/315

As we discussed earlier, performance on training set

PRNN (PSS) Jan-Apr 2016 p.12/315

In practice one assesses the generalization of the

PRNN (PSS) Jan-Apr 2016 p.13/315

In practice one assesses the generalization of the

PRNN (PSS) Jan-Apr 2016 p.14/315

In practice one assesses the generalization of the

PRNN (PSS) Jan-Apr 2016 p.15/315

In practice one assesses the generalization of the

We can see the main issue through a simple example

PRNN (PSS) Jan-Apr 2016 p.17/315

We can see the main issue through a simple example

Suppose we have data {(Xi , yi )}, Xi , yi .

PRNN (PSS) Jan-Apr 2016 p.18/315

We can see the main issue through a simple example

Suppose we have data {(Xi , yi )}, Xi , yi .

We want to learn a function f so that we can predict y

PRNN (PSS) Jan-Apr 2016 p.19/315

We can see the main issue through a simple example

Suppose we have data {(Xi , yi )}, Xi , yi .

We want to learn a function f so that we can predict y

This is a simple regression problem and we can use

PRNN (PSS) Jan-Apr 2016 p.20/315