Anda di halaman 1dari 315

So far in the course we have seen some specific

cases of learning classifiers given the training data.

PRNN (PSS) Jan-Apr 2016 p.1/315

So far in the course we have seen some specific


cases of learning classifiers given the training data.

We have looked at the optimal Bayes classifier.

PRNN (PSS) Jan-Apr 2016 p.2/315

So far in the course we have seen some specific


cases of learning classifiers given the training data.

We have looked at the optimal Bayes classifier.

We discussed how to estimate class conditional


densities for implementing Bayes classifier.

PRNN (PSS) Jan-Apr 2016 p.3/315

So far in the course we have seen some specific


cases of learning classifiers given the training data.

We have looked at the optimal Bayes classifier.

We discussed how to estimate class conditional


densities for implementing Bayes classifier.

We have seen various methods (perceptron, least


squares, LMS, logistic regression, FLD etc) to learn
linear classifiers.

PRNN (PSS) Jan-Apr 2016 p.4/315

So far in the course we have seen some specific


cases of learning classifiers given the training data.

We have looked at the optimal Bayes classifier.

We discussed how to estimate class conditional


densities for implementing Bayes classifier.

We have seen various methods (perceptron, least


squares, LMS, logistic regression, FLD etc) to learn
linear classifiers.
We now take a look at the general problem of learning
classifiers.

PRNN (PSS) Jan-Apr 2016 p.5/315

Learning and generalization

The problem of designing a classifier is essentially


one of learning from examples.

PRNN (PSS) Jan-Apr 2016 p.6/315

Learning and generalization

The problem of designing a classifier is essentially


one of learning from examples.

Given training data, we want to find an appropriate


classifier.

PRNN (PSS) Jan-Apr 2016 p.7/315

Learning and generalization

The problem of designing a classifier is essentially


one of learning from examples.

Given training data, we want to find an appropriate


classifier.
It amounts to searching over a family of classifiers to
find one that minimizes error over training set.

PRNN (PSS) Jan-Apr 2016 p.8/315

Learning and generalization

The problem of designing a classifier is essentially


one of learning from examples.

Given training data, we want to find an appropriate


classifier.
It amounts to searching over a family of classifiers to
find one that minimizes error over training set.

For example, in least squares approach we are


searching over the family of linear classifiers for
minimizing square of error.

PRNN (PSS) Jan-Apr 2016 p.9/315

As we discussed earlier, performance on training set


is not the real issue.

PRNN (PSS) Jan-Apr 2016 p.10/315

As we discussed earlier, performance on training set


is not the real issue.
We would like the learnt classifier to perform well on
new data.

PRNN (PSS) Jan-Apr 2016 p.11/315

As we discussed earlier, performance on training set


is not the real issue.
We would like the learnt classifier to perform well on
new data.
This is the issue of generalization. Does the learnt
classifier generalize well?

PRNN (PSS) Jan-Apr 2016 p.12/315

In practice one assesses the generalization of the


learnt classifier by looking at the error on a separate
set of labelled data called test set.

PRNN (PSS) Jan-Apr 2016 p.13/315

In practice one assesses the generalization of the


learnt classifier by looking at the error on a separate
set of labelled data called test set.
Since the test set would not be used in training, error
on that data could be a good measure of the
performance of the learnt classifier.

PRNN (PSS) Jan-Apr 2016 p.14/315

In practice one assesses the generalization of the


learnt classifier by looking at the error on a separate
set of labelled data called test set.
Since the test set would not be used in training, error
on that data could be a good measure of the
performance of the learnt classifier.
This means that we should have access to some
more labelled data.

PRNN (PSS) Jan-Apr 2016 p.15/315

In practice one assesses the generalization of the


learnt classifier by looking at the error on a separate
set of labelled data called test set.
Since the test set would not be used in training, error
on that data could be a good measure of the
performance of the learnt classifier.
This means that we should have access to some
more labelled data.
we look at these specific issues of practice later on.
Currently our focus would be on theoretical analysis of
how to say whether a learning algorithm would
generalize well.
PRNN (PSS) Jan-Apr 2016 p.16/315

We can see the main issue through a simple example


of regression

PRNN (PSS) Jan-Apr 2016 p.17/315

We can see the main issue through a simple example


of regression

Suppose we have data {(Xi , yi )}, Xi , yi .

PRNN (PSS) Jan-Apr 2016 p.18/315

We can see the main issue through a simple example


of regression

Suppose we have data {(Xi , yi )}, Xi , yi .

We want to learn a function f so that we can predict y


as f (X).

PRNN (PSS) Jan-Apr 2016 p.19/315

We can see the main issue through a simple example


of regression

Suppose we have data {(Xi , yi )}, Xi , yi .

We want to learn a function f so that we can predict y


as f (X).

This is a simple regression problem and we can use


least squares for it based on the form of f .

PRNN (PSS) Jan-Apr 2016 p.20/315

Suppose we choose polynomial function

f (X) = w0 + w1 X + w2 X 2 + + wm X m

PRNN (PSS) Jan-Apr 2016 p.21/315

Suppose we choose polynomial function

f (X) = w0 + w1 X + w2 X 2 + + wm X m

As we discussed earlier we can do this using linear


least squares algorithm.

PRNN (PSS) Jan-Apr 2016 p.22/315

Suppose we choose polynomial function

f (X) = w0 + w1 X + w2 X 2 + + wm X m

As we discussed earlier we can do this using linear


least squares algorithm.

One question is what m to choose.

PRNN (PSS) Jan-Apr 2016 p.23/315

Suppose we choose polynomial function

f (X) = w0 + w1 X + w2 X 2 + + wm X m

As we discussed earlier we can do this using linear


least squares algorithm.

One question is what m to choose.

We have looked at regularized least squares for this.

PRNN (PSS) Jan-Apr 2016 p.24/315

Suppose we choose polynomial function

f (X) = w0 + w1 X + w2 X 2 + + wm X m

As we discussed earlier we can do this using linear


least squares algorithm.

One question is what m to choose.

We have looked at regularized least squares for this.


(It does not tell best m but helps learn a model with
good generalization).

PRNN (PSS) Jan-Apr 2016 p.25/315

Suppose we choose polynomial function

f (X) = w0 + w1 X + w2 X 2 + + wm X m

As we discussed earlier we can do this using linear


least squares algorithm.

One question is what m to choose.

We have looked at regularized least squares for this.


(It does not tell best m but helps learn a model with
good generalization).

There are other methods (e.g., BIC)


PRNN (PSS) Jan-Apr 2016 p.26/315

But more fundamentally, let us ask can our data error


tell what m is proper.

PRNN (PSS) Jan-Apr 2016 p.27/315

But more fundamentally, let us ask can our data error


tell what m is proper.

Firstly the fact that we get less error for m compared


to m does not necessarily mean m -degree is a better
fit.

PRNN (PSS) Jan-Apr 2016 p.28/315

But more fundamentally, let us ask can our data error


tell what m is proper.

Firstly the fact that we get less error for m compared


to m does not necessarily mean m -degree is a better
fit.
If for a particular m if we get very low data error, can
we say it is good?

PRNN (PSS) Jan-Apr 2016 p.29/315

But more fundamentally, let us ask can our data error


tell what m is proper.

Firstly the fact that we get less error for m compared


to m does not necessarily mean m -degree is a better
fit.
If for a particular m if we get very low data error, can
we say it is good?

We know that if we search over all polynomials, we


can never really learn anything. Can we have a
formalism that makes this precise.
PRNN (PSS) Jan-Apr 2016 p.30/315

There are different ways of addressing this issue


(MDL, VC-theory etc).

PRNN (PSS) Jan-Apr 2016 p.31/315

There are different ways of addressing this issue


(MDL, VC-theory etc).

We discuss one such aproach next

PRNN (PSS) Jan-Apr 2016 p.32/315

Any learning algorithm takes training data as the input


and outputs a specific classifier/function.

PRNN (PSS) Jan-Apr 2016 p.33/315

Any learning algorithm takes training data as the input


and outputs a specific classifier/function.

For this, it searches over some chosen family of


functions to find one that optimizes a chosen criterion
function.

PRNN (PSS) Jan-Apr 2016 p.34/315

Any learning algorithm takes training data as the input


and outputs a specific classifier/function.

For this, it searches over some chosen family of


functions to find one that optimizes a chosen criterion
function.
Learning Algorithm
{(Xi , yi )}
(searching over F )

f F

PRNN (PSS) Jan-Apr 2016 p.35/315

Any learning algorithm takes training data as the input


and outputs a specific classifier/function.

For this, it searches over some chosen family of


functions to find one that optimizes a chosen criterion
function.
Learning Algorithm
{(Xi , yi )}
(searching over F )

f F

The question is: how can we formalize correctness of


learning?
PRNN (PSS) Jan-Apr 2016 p.36/315

For example, a generic approach is what is called


Minimum Description Length principle.

PRNN (PSS) Jan-Apr 2016 p.37/315

For example, a generic approach is what is called


Minimum Description Length principle.

Suppose we want to send the data over a


communication channel.

PRNN (PSS) Jan-Apr 2016 p.38/315

For example, a generic approach is what is called


Minimum Description Length principle.

Suppose we want to send the data over a


communication channel.
we can send the 2n numbers, Xi , yi using some
number of bits.

PRNN (PSS) Jan-Apr 2016 p.39/315

For example, a generic approach is what is called


Minimum Description Length principle.

Suppose we want to send the data over a


communication channel.
we can send the 2n numbers, Xi , yi using some
number of bits.
Or we can send Xi , the function f and the errors
yi f (Xi ).

PRNN (PSS) Jan-Apr 2016 p.40/315

If the fit is good, the errors yi f (Xi ) would have


small range and we may be able to send them using
smaller number of bits compared sending yi .

PRNN (PSS) Jan-Apr 2016 p.41/315

If the fit is good, the errors yi f (Xi ) would have


small range and we may be able to send them using
smaller number of bits compared sending yi .

However, we also need to send f .

PRNN (PSS) Jan-Apr 2016 p.42/315

If the fit is good, the errors yi f (Xi ) would have


small range and we may be able to send them using
smaller number of bits compared sending yi .

However, we also need to send f .

If f is very complex, then what we save in bits by


sending errors instead of yi may be more than offset
by the bits needed to send description of f .

PRNN (PSS) Jan-Apr 2016 p.43/315

If the fit is good, the errors yi f (Xi ) would have


small range and we may be able to send them using
smaller number of bits compared sending yi .

However, we also need to send f .

If f is very complex, then what we save in bits by


sending errors instead of yi may be more than offset
by the bits needed to send description of f .

Hence we can rate different f by the total number of


bits we need.

PRNN (PSS) Jan-Apr 2016 p.44/315

If the fit is good, the errors yi f (Xi ) would have


small range and we may be able to send them using
smaller number of bits compared sending yi .

However, we also need to send f .

If f is very complex, then what we save in bits by


sending errors instead of yi may be more than offset
by the bits needed to send description of f .

Hence we can rate different f by the total number of


bits we need.
This can balance the data error and model complexity
in a natural way.

PRNN (PSS) Jan-Apr 2016 p.45/315

For example, suppose we use polynomials as f (with


Xi , yi ).

PRNN (PSS) Jan-Apr 2016 p.46/315

For example, suppose we use polynomials as f (with


Xi , yi ).

So, to send f we need to send d + 1 numbers if we


use a polynomial with degree d.

PRNN (PSS) Jan-Apr 2016 p.47/315

For example, suppose we use polynomials as f (with


Xi , yi ).

So, to send f we need to send d + 1 numbers if we


use a polynomial with degree d.

Getting a good fit using high degree polynomial may


not really pay!

PRNN (PSS) Jan-Apr 2016 p.48/315

For example, suppose we use polynomials as f (with


Xi , yi ).

So, to send f we need to send d + 1 numbers if we


use a polynomial with degree d.

Getting a good fit using high degree polynomial may


not really pay!

This intuitively captures our idea of simple models are


preferred.

PRNN (PSS) Jan-Apr 2016 p.49/315

For example, suppose we use polynomials as f (with


Xi , yi ).

So, to send f we need to send d + 1 numbers if we


use a polynomial with degree d.

Getting a good fit using high degree polynomial may


not really pay!

This intuitively captures our idea of simple models are


preferred.

As presented this approach is only asking the


question whether the f is a good fit for the specific
data at hand.
PRNN (PSS) Jan-Apr 2016 p.50/315

We will follow a different statistical approach to


address the issue of correctness of learning.

PRNN (PSS) Jan-Apr 2016 p.51/315

We will follow a different statistical approach to


address the issue of correctness of learning.

Intuitively, we can learn the correct relationship if we


are given sufficient number of representative
examples.

PRNN (PSS) Jan-Apr 2016 p.52/315

We will follow a different statistical approach to


address the issue of correctness of learning.

Intuitively, we can learn the correct relationship if we


are given sufficient number of representative
examples.

Also, sufficiency of number of examples depends on


the family F of classifiers where we are searching.
We hence need a good notion of the complexity of the
learning problem.

PRNN (PSS) Jan-Apr 2016 p.53/315

We will follow a different statistical approach to


address the issue of correctness of learning.

Intuitively, we can learn the correct relationship if we


are given sufficient number of representative
examples.

Also, sufficiency of number of examples depends on


the family F of classifiers where we are searching.
We hence need a good notion of the complexity of the
learning problem.

We begin with a simple formalism where there is no


noise and the goal of learning is well-defined.
PRNN (PSS) Jan-Apr 2016 p.54/315

A Learning problem is defined by giving:

PRNN (PSS) Jan-Apr 2016 p.55/315

A Learning problem is defined by giving:


(i) X input space; often d (feature space)

PRNN (PSS) Jan-Apr 2016 p.56/315

A Learning problem is defined by giving:


(i) X input space; often d (feature space)

(ii) Y = {0, 1} output space (set of class labels)

PRNN (PSS) Jan-Apr 2016 p.57/315

A Learning problem is defined by giving:


(i) X input space; often d (feature space)

(ii) Y = {0, 1} output space (set of class labels)

iii) C 2X concept space (family of classifiers)

PRNN (PSS) Jan-Apr 2016 p.58/315

A Learning problem is defined by giving:


(i) X input space; often d (feature space)

(ii) Y = {0, 1} output space (set of class labels)

iii) C 2X concept space (family of classifiers)


Each C C can also be viewed as a function
C : X {0, 1}, with C(X) = 1 iff X C .

PRNN (PSS) Jan-Apr 2016 p.59/315

A Learning problem is defined by giving:


(i) X input space; often d (feature space)

(ii) Y = {0, 1} output space (set of class labels)

iii) C 2X concept space (family of classifiers)


Each C C can also be viewed as a function
C : X {0, 1}, with C(X) = 1 iff X C .

iv) S = {(Xi , yi ), i = 1, , n} the set of examples,


where Xi are drawn iid according to some distribution
Px on X and yi = C (Xi ) for some C C . C is
called target concept.
PRNN (PSS) Jan-Apr 2016 p.60/315

We are considering a 2-class case.

PRNN (PSS) Jan-Apr 2016 p.61/315

We are considering a 2-class case.

Hence any classifier is a function C : X {0, 1}.

PRNN (PSS) Jan-Apr 2016 p.62/315

We are considering a 2-class case.

Hence any classifier is a function C : X {0, 1}.

Thus, C is a family of classifiers.

PRNN (PSS) Jan-Apr 2016 p.63/315

We are considering a 2-class case.

Hence any classifier is a function C : X {0, 1}.

Thus, C is a family of classifiers.

We call this concept space because we can say the


system is learning a concept from examples.

PRNN (PSS) Jan-Apr 2016 p.64/315

We are considering a 2-class case.

Hence any classifier is a function C : X {0, 1}.

Thus, C is a family of classifiers.

We call this concept space because we can say the


system is learning a concept from examples.

The learning algorithm knows X , Y , C ; but does not


know C .

PRNN (PSS) Jan-Apr 2016 p.65/315

We are considering a 2-class case.

Hence any classifier is a function C : X {0, 1}.

Thus, C is a family of classifiers.

We call this concept space because we can say the


system is learning a concept from examples.

The learning algorithm knows X , Y , C ; but does not


know C .
It needs to learn the target concept from examples.

PRNN (PSS) Jan-Apr 2016 p.66/315

We do not know the distribution Px . However, taking


that the examples are iid ensures we get
representative examples.

PRNN (PSS) Jan-Apr 2016 p.67/315

We do not know the distribution Px . However, taking


that the examples are iid ensures we get
representative examples.

We are trying to teach a concept through examples


that come from an arbitrary distribution.

PRNN (PSS) Jan-Apr 2016 p.68/315

We do not know the distribution Px . However, taking


that the examples are iid ensures we get
representative examples.

We are trying to teach a concept through examples


that come from an arbitrary distribution.

Since we have taken yi = C (Xi ), i, there is no


noise.

PRNN (PSS) Jan-Apr 2016 p.69/315

We do not know the distribution Px . However, taking


that the examples are iid ensures we get
representative examples.

We are trying to teach a concept through examples


that come from an arbitrary distribution.

Since we have taken yi = C (Xi ), i, there is no


noise.
Also assuming that C C means that ideally we can
learn the target concept.

PRNN (PSS) Jan-Apr 2016 p.70/315

We could take C = 2X .

PRNN (PSS) Jan-Apr 2016 p.71/315

We could take C = 2X .
This means we are searching over the family of all
possible (2-class) classifiers.

PRNN (PSS) Jan-Apr 2016 p.72/315

We could take C = 2X .
This means we are searching over the family of all
possible (2-class) classifiers.

This is often not viable even theoretically.

PRNN (PSS) Jan-Apr 2016 p.73/315

We could take C = 2X .
This means we are searching over the family of all
possible (2-class) classifiers.

This is often not viable even theoretically.

So, choosing a particular C is based on either some


knowledge we have about the problem or becuase of
the kind of learning algorithm we have.

PRNN (PSS) Jan-Apr 2016 p.74/315

We could take C = 2X .
This means we are searching over the family of all
possible (2-class) classifiers.

This is often not viable even theoretically.

So, choosing a particular C is based on either some


knowledge we have about the problem or becuase of
the kind of learning algorithm we have.

For example we can take C to be all half-spaces the


family of all linear classifiers.

PRNN (PSS) Jan-Apr 2016 p.75/315

Suppose we want to learn the concept of


medium-build persons based on features of height
and weight.

PRNN (PSS) Jan-Apr 2016 p.76/315

Suppose we want to learn the concept of


medium-build persons based on features of height
and weight.

Here X = 2 and Y = {0, 1}.

PRNN (PSS) Jan-Apr 2016 p.77/315

Suppose we want to learn the concept of


medium-build persons based on features of height
and weight.

Here X = 2 and Y = {0, 1}.

we would be given examples (with no errors!) drawn


from some arbitrary distribution.

PRNN (PSS) Jan-Apr 2016 p.78/315

The examples for learning our concept could be the


following.

PRNN (PSS) Jan-Apr 2016 p.79/315

What could be C in this example? We can use some


problem-based intuition.

PRNN (PSS) Jan-Apr 2016 p.80/315

What could be C in this example? We can use some


problem-based intuition.

We can choose C to be all axis-parallel rectangles.

PRNN (PSS) Jan-Apr 2016 p.81/315

What could be C in this example? We can use some


problem-based intuition.

We can choose C to be all axis-parallel rectangles.

Now assuming C C means that the god-given


classifier is also an axis-parallel rectangle.

PRNN (PSS) Jan-Apr 2016 p.82/315

The examples along with C now would be

PRNN (PSS) Jan-Apr 2016 p.83/315

Probably Approximately Correct Learning

Let us now try to define the goal of learning.

PRNN (PSS) Jan-Apr 2016 p.84/315

Probably Approximately Correct Learning

Let us now try to define the goal of learning.

Note that each C C can be viewed either as a


subset of X or a binary valued function on X .

PRNN (PSS) Jan-Apr 2016 p.85/315

Probably Approximately Correct Learning

Let us now try to define the goal of learning.

Note that each C C can be viewed either as a


subset of X or a binary valued function on X .

Let Cn denote the concept or classifier output by the


learning algorithm after it processes n iid examples.

PRNN (PSS) Jan-Apr 2016 p.86/315

Probably Approximately Correct Learning

Let us now try to define the goal of learning.

Note that each C C can be viewed either as a


subset of X or a binary valued function on X .

Let Cn denote the concept or classifier output by the


learning algorithm after it processes n iid examples.

For correctness of the learning algorithm we want Cn


to be close to C as n becomes large.

PRNN (PSS) Jan-Apr 2016 p.87/315

Probably Approximately Correct Learning

Let us now try to define the goal of learning.

Note that each C C can be viewed either as a


subset of X or a binary valued function on X .

Let Cn denote the concept or classifier output by the


learning algorithm after it processes n iid examples.

For correctness of the learning algorithm we want Cn


to be close to C as n becomes large.

The closeness of Cn to C is in terms of classifying


samples drawn from X according to Px .

PRNN (PSS) Jan-Apr 2016 p.88/315

We define error of Cn by

PRNN (PSS) Jan-Apr 2016 p.89/315

We define error of Cn by
err(Cn ) = Px (Cn C )
where, for sets Cn , C ,

Cn C = (Cn C ) (C Cn ).

PRNN (PSS) Jan-Apr 2016 p.90/315

We define error of Cn by
err(Cn ) = Px (Cn C )
= Prob[{X X : Cn (X) = C (X)}]
where, for sets Cn , C ,

Cn C = (Cn C ) (C Cn ).

PRNN (PSS) Jan-Apr 2016 p.91/315

We define error of Cn by
err(Cn ) = Px (Cn C )
= Prob[{X X : Cn (X) = C (X)}]
where, for sets Cn , C ,

Cn C = (Cn C ) (C Cn ).

The err(Cn ) is the probability that on a random


sample, drawn according to Px , the classification of
Cn and C differ.
PRNN (PSS) Jan-Apr 2016 p.92/315

Essentially, we want err(Cn ) to become zero as


n .

PRNN (PSS) Jan-Apr 2016 p.93/315

Essentially, we want err(Cn ) to become zero as


n .
However, err(Cn ) is a random variable because Cn is
a function of the random samples X1 , , Xn .

PRNN (PSS) Jan-Apr 2016 p.94/315

Essentially, we want err(Cn ) to become zero as


n .
However, err(Cn ) is a random variable because Cn is
a function of the random samples X1 , , Xn .
Hence we have to properly define the sense in which
err(Cn ) converges as n .

PRNN (PSS) Jan-Apr 2016 p.95/315

We say a learning algorithm Probably Approximately


Correctly (PAC) learns a concept class C if given any
, > 0, N < such that
Prob[err(Cn ) > ] <
for all n > N and for any distribution Px and any C .

PRNN (PSS) Jan-Apr 2016 p.96/315

We say a learning algorithm Probably Approximately


Correctly (PAC) learns a concept class C if given any
, > 0, N < such that
Prob[err(Cn ) > ] <
for all n > N and for any distribution Px and any C .

The probability above is with respect to the distribution


of n-tuples of iid samples drawn according to Px on X .

PRNN (PSS) Jan-Apr 2016 p.97/315

We say a learning algorithm Probably Approximately


Correctly (PAC) learns a concept class C if given any
, > 0, N < such that
Prob[err(Cn ) > ] <
for all n > N and for any distribution Px and any C .

The probability above is with respect to the distribution


of n-tuples of iid samples drawn according to Px on X .

The Px is arbitrary. But, for testing and training the


distribution is same fair to the algorithm.
PRNN (PSS) Jan-Apr 2016 p.98/315

An algorithm PAC learns C if


Prob[err(Cn ) > ] <
for sufficiently large n and any Px .

PRNN (PSS) Jan-Apr 2016 p.99/315

An algorithm PAC learns C if


Prob[err(Cn ) > ] <
for sufficiently large n and any Px .

If err(Cn ) , then Cn is approximately correct.

PRNN (PSS) Jan-Apr 2016 p.100/315

An algorithm PAC learns C if


Prob[err(Cn ) > ] <
for sufficiently large n and any Px .

If err(Cn ) , then Cn is approximately correct.

So, what the above says is that the classifier output by


the algorithm after seeing n random examples, Cn , is
approximately correct with a high probability.

PRNN (PSS) Jan-Apr 2016 p.101/315

An algorithm PAC learns C if


Prob[err(Cn ) > ] <
for sufficiently large n and any Px .

If err(Cn ) , then Cn is approximately correct.

So, what the above says is that the classifier output by


the algorithm after seeing n random examples, Cn , is
approximately correct with a high probability.

The and are called the accuracy and confidence


parameters respectively.
PRNN (PSS) Jan-Apr 2016 p.102/315

Let us look at the example of learning the concept of


medium-built persons.

PRNN (PSS) Jan-Apr 2016 p.103/315

Let us look at the example of learning the concept of


medium-built persons.

Here, X = 2 .

PRNN (PSS) Jan-Apr 2016 p.104/315

Let us look at the example of learning the concept of


medium-built persons.

Here, X = 2 .
The strategy of the learning algorithm is as follows.

PRNN (PSS) Jan-Apr 2016 p.105/315

Let us look at the example of learning the concept of


medium-built persons.

Here, X = 2 .
The strategy of the learning algorithm is as follows.

The algorithm outputs a classifier which correctly


classifies all examples.

PRNN (PSS) Jan-Apr 2016 p.106/315

Let us look at the example of learning the concept of


medium-built persons.

Here, X = 2 .
The strategy of the learning algorithm is as follows.

The algorithm outputs a classifier which correctly


classifies all examples.

If there is more than one C C that is consistent with


all examples, we output the smallest such C .

PRNN (PSS) Jan-Apr 2016 p.107/315

For finite sets, smallest is in terms of number of poins;


for other sets it is in terms of the areas of the sets.
(This will do for our purpose here).

PRNN (PSS) Jan-Apr 2016 p.108/315

For finite sets, smallest is in terms of number of poins;


for other sets it is in terms of the areas of the sets.
(This will do for our purpose here).

We will look at two different C and findout what the


algorithm does.

PRNN (PSS) Jan-Apr 2016 p.109/315

For finite sets, smallest is in terms of number of poins;


for other sets it is in terms of the areas of the sets.
(This will do for our purpose here).

We will look at two different C and findout what the


algorithm does.

We take C1 to be the set of all axis-parallel rectangles.

PRNN (PSS) Jan-Apr 2016 p.110/315

For finite sets, smallest is in terms of number of poins;


for other sets it is in terms of the areas of the sets.
(This will do for our purpose here).

We will look at two different C and findout what the


algorithm does.

We take C1 to be the set of all axis-parallel rectangles.

We take C2 to be 2X ; that is, set of all possible


classifiers.

PRNN (PSS) Jan-Apr 2016 p.111/315

We assume, as earlier, that C is an axis-parallel


rectangle.

PRNN (PSS) Jan-Apr 2016 p.112/315

We assume, as earlier, that C is an axis-parallel


rectangle.

Note that C belongs to both C1 and C2 .

PRNN (PSS) Jan-Apr 2016 p.113/315

We assume, as earlier, that C is an axis-parallel


rectangle.

Note that C belongs to both C1 and C2 .

All our examples are classified according to C .

PRNN (PSS) Jan-Apr 2016 p.114/315

We assume, as earlier, that C is an axis-parallel


rectangle.

Note that C belongs to both C1 and C2 .

All our examples are classified according to C .

Hence an (x, y) 2 is a positive example if it is in


C and negative example otherwise.

PRNN (PSS) Jan-Apr 2016 p.115/315

The examples along with C in this problem are:

PRNN (PSS) Jan-Apr 2016 p.116/315

First consider C1 .

PRNN (PSS) Jan-Apr 2016 p.117/315

First consider C1 .
The smallest C C consistent with all examples
would be the smallest axis-parallel rectangle
enclosing all the positive examples seen so far.

PRNN (PSS) Jan-Apr 2016 p.118/315

The concept output by the algorithm (when using C1 )


would be the following.

PRNN (PSS) Jan-Apr 2016 p.119/315

First consider C1 .
The smallest C C consistent with all examples
would be the smallest axis-parallel rectangle
enclosing all the positive examples seen so far.
Thus, under the strategy of our learning algorithm, for
all n, the Cn would always be inside the C .

PRNN (PSS) Jan-Apr 2016 p.120/315

First consider C1 .
The smallest C C consistent with all examples
would be the smallest axis-parallel rectangle
enclosing all the positive examples seen so far.

Thus, under the strategy of our learning algorithm, for


all n, the Cn would always be inside the C .

Now let us show that this is a PAC learning algorithm.

PRNN (PSS) Jan-Apr 2016 p.121/315

Whenever any example is classified as positive by Cn


it would also be classified positive by C .

PRNN (PSS) Jan-Apr 2016 p.122/315

Whenever any example is classified as positive by Cn


it would also be classified positive by C .

Hence the points of X where Cn makes errors is the


annular region.

PRNN (PSS) Jan-Apr 2016 p.123/315

Since Cn is also an axis-parallel rectangle which is


inside C , the Cn C would be the annular region
between the two rectangles.

PRNN (PSS) Jan-Apr 2016 p.124/315

Whenever any example is classified as positive by Cn


it would also be classified positive by C .

Hence the points of X where Cn makes errors is the


annular region.

Hence, err(Cn ) is the Px -probability of this annular


region.

PRNN (PSS) Jan-Apr 2016 p.125/315

Whenever any example is classified as positive by Cn


it would also be classified positive by C .

Hence the points of X where Cn makes errors is the


annular region.

Hence, err(Cn ) is the Px -probability of this annular


region.

Note that we are not really bothered about the area of


this annular region; we are only interested in the
probability mass of this region under Px .

PRNN (PSS) Jan-Apr 2016 p.126/315

Now, given an > 0, we have to bound the probability


that err(Cn ) > .

PRNN (PSS) Jan-Apr 2016 p.127/315

Now, given an > 0, we have to bound the probability


that err(Cn ) > .

The error is greater than only if the probability mass


(under Px ) of the annular region is greater than .

PRNN (PSS) Jan-Apr 2016 p.128/315

Now, given an > 0, we have to bound the probability


that err(Cn ) > .

The error is greater than only if the probability mass


(under Px ) of the annular region is greater than .

When does this event err(Cn ) > occur?

PRNN (PSS) Jan-Apr 2016 p.129/315

Now, given an > 0, we have to bound the probability


that err(Cn ) > .

The error is greater than only if the probability mass


(under Px ) of the annular region is greater than .

When does this event err(Cn ) > occur?

Only when none of the examples seen happen to be


in the annular region.

PRNN (PSS) Jan-Apr 2016 p.130/315

Now, given an > 0, we have to bound the probability


that err(Cn ) > .

The error is greater than only if the probability mass


(under Px ) of the annular region is greater than .

When does this event err(Cn ) > occur?

Only when none of the examples seen happen to be


in the annular region.

Why? Otherwise, the rectangle learnt by our


algorithm would have been closer to C .

PRNN (PSS) Jan-Apr 2016 p.131/315

Hence the probability of the event err(Cn ) > is same


as the probability that when n iid examples are drawn
accoding to Px none of them came from a subset of X
that has Px -probability at least .

PRNN (PSS) Jan-Apr 2016 p.132/315

Hence the probability of the event err(Cn ) > is same


as the probability that when n iid examples are drawn
accoding to Px none of them came from a subset of X
that has Px -probability at least .

That is, all examples came from a subset of


probability at most (1 ).

PRNN (PSS) Jan-Apr 2016 p.133/315

Hence the probability of the event err(Cn ) > is same


as the probability that when n iid examples are drawn
accoding to Px none of them came from a subset of X
that has Px -probability at least .

That is, all examples came from a subset of


probability at most (1 ).

The probability of this happenning is at most (1 )n .

PRNN (PSS) Jan-Apr 2016 p.134/315

Hence we have
Prob[err(Cn ) > ] (1 )

PRNN (PSS) Jan-Apr 2016 p.135/315

Hence we have
Prob[err(Cn ) > ] (1 )

Let N be such that (1 )n < , for all n > N .

PRNN (PSS) Jan-Apr 2016 p.136/315

Hence we have
Prob[err(Cn ) > ] (1 )

Let N be such that (1 )n < , for all n > N .


ln()
ln(1)

The required N is N
(bound on number of examples).

PRNN (PSS) Jan-Apr 2016 p.137/315

Hence we have
Prob[err(Cn ) > ] (1 )

Let N be such that (1 )n < , for all n > N .


ln()
ln(1)

The required N is N
(bound on number of examples).

For this N , we have


Prob[err(Cn ) > ] , n N
showing the algorithm PAC learns the concept class.
PRNN (PSS) Jan-Apr 2016 p.138/315

Now let us consider the same algorithm with concept


class C2 = 2X .

PRNN (PSS) Jan-Apr 2016 p.139/315

Now let us consider the same algorithm with concept


class C2 = 2X .
Here we are searching over all possible 2-class
classifiers.

PRNN (PSS) Jan-Apr 2016 p.140/315

Now let us consider the same algorithm with concept


class C2 = 2X .
Here we are searching over all possible 2-class
classifiers.
So, intuitively, we do not expect the algorithm to be
able to learn anything.

PRNN (PSS) Jan-Apr 2016 p.141/315

Now let us consider the same algorithm with concept


class C2 = 2X .
Here we are searching over all possible 2-class
classifiers.
So, intuitively, we do not expect the algorithm to be
able to learn anything.
There is too much flexibility in the bag of classifiers
over which we are searching.

PRNN (PSS) Jan-Apr 2016 p.142/315

Now let us consider the same algorithm with concept


class C2 = 2X .
Here we are searching over all possible 2-class
classifiers.
So, intuitively, we do not expect the algorithm to be
able to learn anything.

There is too much flexibility in the bag of classifiers


over which we are searching.

Let us show this formally.

PRNN (PSS) Jan-Apr 2016 p.143/315

What would be Cn now?

PRNN (PSS) Jan-Apr 2016 p.144/315

What would be Cn now?


After seeing n examples, the smallest set in C2 that is
consistent with all examples is the set consisting of all
the positive examples seen so far!!

PRNN (PSS) Jan-Apr 2016 p.145/315

What would be Cn now?


After seeing n examples, the smallest set in C2 that is
consistent with all examples is the set consisting of all
the positive examples seen so far!!
Now the algorithm simply remembers all the positive
examples seen.

PRNN (PSS) Jan-Apr 2016 p.146/315

What would be Cn now?


After seeing n examples, the smallest set in C2 that is
consistent with all examples is the set consisting of all
the positive examples seen so far!!

Now the algorithm simply remembers all the positive


examples seen.

This happened because every possible finite subset


of X is in our concept class.

PRNN (PSS) Jan-Apr 2016 p.147/315

So, now, Cn C would be the axis parallel rectangle


C minus some finite number of points from it.

PRNN (PSS) Jan-Apr 2016 p.148/315

So, now, Cn C would be the axis parallel rectangle


C minus some finite number of points from it.

So, under any continuous Px ,


err(Cn ) = Px (Cn C ) = Px (C ).

PRNN (PSS) Jan-Apr 2016 p.149/315

So, now, Cn C would be the axis parallel rectangle


C minus some finite number of points from it.

So, under any continuous Px ,


err(Cn ) = Px (Cn C ) = Px (C ).

Hence, for any < Px (C ), Prob[err(Cn ) > ] = 1 for


all n.

PRNN (PSS) Jan-Apr 2016 p.150/315

So, now, Cn C would be the axis parallel rectangle


C minus some finite number of points from it.

So, under any continuous Px ,


err(Cn ) = Px (Cn C ) = Px (C ).

Hence, for any < Px (C ), Prob[err(Cn ) > ] = 1 for


all n.
Thus, the algorithm can not PAC learn with C2 .

PRNN (PSS) Jan-Apr 2016 p.151/315

This example clearly illustrates the difficulty of


learning from examples if the bag of classifiers being
considered is too large.

PRNN (PSS) Jan-Apr 2016 p.152/315

This example clearly illustrates the difficulty of


learning from examples if the bag of classifiers being
considered is too large.

The largeness is not interms of number of elements in


our concept class.

PRNN (PSS) Jan-Apr 2016 p.153/315

This example clearly illustrates the difficulty of


learning from examples if the bag of classifiers being
considered is too large.

The largeness is not interms of number of elements in


our concept class.

Both C1 and C2 contain uncountably infinite number of


classifiers.

PRNN (PSS) Jan-Apr 2016 p.154/315

This example clearly illustrates the difficulty of


learning from examples if the bag of classifiers being
considered is too large.

The largeness is not interms of number of elements in


our concept class.

Both C1 and C2 contain uncountably infinite number of


classifiers.
We would later on define an appropriate quantity to
quantify the sense in which one concept class can be
said to be bigger than (or more complex than) another.

PRNN (PSS) Jan-Apr 2016 p.155/315

At this point we can still see how C1 is smaller than C2 .

PRNN (PSS) Jan-Apr 2016 p.156/315

At this point we can still see how C1 is smaller than C2 .

Since every axis parallel rectangle can be specified by


four quantities, this class can be parameterized by
four parameters.

PRNN (PSS) Jan-Apr 2016 p.157/315

At this point we can still see how C1 is smaller than C2 .

Since every axis parallel rectangle can be specified by


four quantities, this class can be parameterized by
four parameters.

However, there is no such finite parameterization for


C 2 = 2 .

PRNN (PSS) Jan-Apr 2016 p.158/315

At this point we can still see how C1 is smaller than C2 .

Since every axis parallel rectangle can be specified by


four quantities, this class can be parameterized by
four parameters.

However, there is no such finite parameterization for


C 2 = 2 .
Also, the strategy of our algorithm can be coded
efficiently in case of C1 .

PRNN (PSS) Jan-Apr 2016 p.159/315

The concept of PAC learnability is interesting.

PRNN (PSS) Jan-Apr 2016 p.160/315

The concept of PAC learnability is interesting.

It allows one to properly define what is correctness of


learning and allows us to ask questions like whether a
given algorithm learns correctly.

PRNN (PSS) Jan-Apr 2016 p.161/315

The concept of PAC learnability is interesting.

It allows one to properly define what is correctness of


learning and allows us to ask questions like whether a
given algorithm learns correctly.

As we have seen in our example, we can also bound


the number of examples needed to learn to a given
level of accuracy and confidence.

PRNN (PSS) Jan-Apr 2016 p.162/315

The concept of PAC learnability is interesting.

It allows one to properly define what is correctness of


learning and allows us to ask questions like whether a
given algorithm learns correctly.

As we have seen in our example, we can also bound


the number of examples needed to learn to a given
level of accuracy and confidence.

Thus we can appreciate relative complexities of


different learning problems.

PRNN (PSS) Jan-Apr 2016 p.163/315

However, PAC learnability deals with ideal learning


situations.

PRNN (PSS) Jan-Apr 2016 p.164/315

However, PAC learnability deals with ideal learning


situations.
We assume there is a (god-given) C and that it is in
our C .

PRNN (PSS) Jan-Apr 2016 p.165/315

However, PAC learnability deals with ideal learning


situations.
We assume there is a (god-given) C and that it is in
our C .
Also, we assume that examples are noise free and
are perfectly classified.

PRNN (PSS) Jan-Apr 2016 p.166/315

However, PAC learnability deals with ideal learning


situations.
We assume there is a (god-given) C and that it is in
our C .
Also, we assume that examples are noise free and
are perfectly classified.
Next we consider an extension of this framework that
is relevant for realistic learning scenarios.

PRNN (PSS) Jan-Apr 2016 p.167/315

In our new framework we are given

PRNN (PSS) Jan-Apr 2016 p.168/315

In our new framework we are given

X input space; ( as earlier, Feature space)

PRNN (PSS) Jan-Apr 2016 p.169/315

In our new framework we are given

X input space; ( as earlier, Feature space)


Y Output space (as earlier, Set of class labels)

PRNN (PSS) Jan-Apr 2016 p.170/315

In our new framework we are given

X input space; ( as earlier, Feature space)


Y Output space (as earlier, Set of class labels)
H hypothesis space (family of classifiers)

PRNN (PSS) Jan-Apr 2016 p.171/315

In our new framework we are given

X input space; ( as earlier, Feature space)


Y Output space (as earlier, Set of class labels)
H hypothesis space (family of classifiers)
Each h H is a function: h : X A
where A is called action space.

PRNN (PSS) Jan-Apr 2016 p.172/315

In our new framework we are given

X input space; ( as earlier, Feature space)


Y Output space (as earlier, Set of class labels)
H hypothesis space (family of classifiers)
Each h H is a function: h : X A
where A is called action space.

Training data: {(Xi , yi ), i = 1, , n}


drawn iid according to some distribution Pxy on
X Y.
PRNN (PSS) Jan-Apr 2016 p.173/315

Some Comments

We have replaced C with H.

PRNN (PSS) Jan-Apr 2016 p.174/315

Some Comments

We have replaced C with H.

If we take A = Y then it is same as earlier.

PRNN (PSS) Jan-Apr 2016 p.175/315

Some Comments

We have replaced C with H.

If we take A = Y then it is same as earlier.


But the freedom in choosing A allows for taking care
of many situations.

PRNN (PSS) Jan-Apr 2016 p.176/315

Some Comments

We have replaced C with H.

If we take A = Y then it is same as earlier.


But the freedom in choosing A allows for taking care
of many situations.

For example, even when Y = {0, 1}, we can take


A = (e.g., learning discriminant functions).

PRNN (PSS) Jan-Apr 2016 p.177/315

Some Comments

We have replaced C with H.

If we take A = Y then it is same as earlier.


But the freedom in choosing A allows for taking care
of many situations.

For example, even when Y = {0, 1}, we can take


A = (e.g., learning discriminant functions).

Now, e.g., sign of h(X) may denote the class and its
magnitude may give some measure of confidence in
the assigned class.

PRNN (PSS) Jan-Apr 2016 p.178/315

Now we draw examples from X Y according to Pxy .


This allows for noise in the training data.

PRNN (PSS) Jan-Apr 2016 p.179/315

Now we draw examples from X Y according to Pxy .


This allows for noise in the training data.

For example, when class conditional densities


overlap, same X can come from different classses
with different probabilities.

PRNN (PSS) Jan-Apr 2016 p.180/315

Now we draw examples from X Y according to Pxy .


This allows for noise in the training data.

For example, when class conditional densities


overlap, same X can come from different classses
with different probabilities.

We can always factorize Pxy = Px Py|x . If Py|x is a


degenerate distribution then it will be same as earlier
we draw iid samples from X and each point is
essentially classified by the target classifier.

PRNN (PSS) Jan-Apr 2016 p.181/315

Now we draw examples from X Y according to Pxy .


This allows for noise in the training data.

For example, when class conditional densities


overlap, same X can come from different classses
with different probabilities.

We can always factorize Pxy = Px Py|x . If Py|x is a


degenerate distribution then it will be same as earlier
we draw iid samples from X and each point is
essentially classified by the target classifier.

However, having examples drawn from X Y using a


distribution, allows for many more scenarios.
PRNN (PSS) Jan-Apr 2016 p.182/315

As before, the learning machine outputs a hypothesis,


hn H, given the training data consisting of n
examples.

PRNN (PSS) Jan-Apr 2016 p.183/315

As before, the learning machine outputs a hypothesis,


hn H, given the training data consisting of n
examples.

However, now there is no notion of a target


concept/hypothesis.

PRNN (PSS) Jan-Apr 2016 p.184/315

As before, the learning machine outputs a hypothesis,


hn H, given the training data consisting of n
examples.

However, now there is no notion of a target


concept/hypothesis.

There may be no h H which is consistent with all


examples.

PRNN (PSS) Jan-Apr 2016 p.185/315

As before, the learning machine outputs a hypothesis,


hn H, given the training data consisting of n
examples.

However, now there is no notion of a target


concept/hypothesis.

There may be no h H which is consistent with all


examples.

Hence we use the idea of loss functions to define the


goal of learning.

PRNN (PSS) Jan-Apr 2016 p.186/315

Loss function

Loss function:

L : Y A + .

PRNN (PSS) Jan-Apr 2016 p.187/315

Loss function

L : Y A + .
The idea is that L(y, h(X)) is the loss suffered by
h H on a (random) sample (X, y) X Y .

Loss function:

PRNN (PSS) Jan-Apr 2016 p.188/315

Loss function

L : Y A + .
The idea is that L(y, h(X)) is the loss suffered by
h H on a (random) sample (X, y) X Y .
More generally we can let loss depend on X also
explicitly and can write L(X, y, h(X)) for loss

Loss function:

function.

PRNN (PSS) Jan-Apr 2016 p.189/315

Loss function

L : Y A + .
The idea is that L(y, h(X)) is the loss suffered by
h H on a (random) sample (X, y) X Y .
More generally we can let loss depend on X also
explicitly and can write L(X, y, h(X)) for loss

Loss function:

function.
By convention we assume that the loss function is
non-negative.

PRNN (PSS) Jan-Apr 2016 p.190/315

Loss function

L : Y A + .
The idea is that L(y, h(X)) is the loss suffered by
h H on a (random) sample (X, y) X Y .
More generally we can let loss depend on X also
explicitly and can write L(X, y, h(X)) for loss

Loss function:

function.
By convention we assume that the loss function is
non-negative.
Now we can look for hypotheses that have low
average loss over samples drawn accordding to Pxy .
PRNN (PSS) Jan-Apr 2016 p.191/315

Risk function

Define a function, R : H + by
!

R(h) = E[L(y, h(X))] =

L(y, h(X))dPxy

PRNN (PSS) Jan-Apr 2016 p.192/315

Risk function

Define a function, R : H + by
!

R(h) = E[L(y, h(X))] =

L(y, h(X))dPxy

R is called the risk function.

PRNN (PSS) Jan-Apr 2016 p.193/315

Risk function

Define a function, R : H + by
!

R(h) = E[L(y, h(X))] =

L(y, h(X))dPxy

R is called the risk function.

Risk is expectation of loss where expectation is with


respect to Pxy .

PRNN (PSS) Jan-Apr 2016 p.194/315

Risk function

Define a function, R : H + by
!

R(h) = E[L(y, h(X))] =

L(y, h(X))dPxy

R is called the risk function.

Risk is expectation of loss where expectation is with


respect to Pxy .

Hence, h with a low R(h) is a better classifier.

PRNN (PSS) Jan-Apr 2016 p.195/315

Risk function

Define a function, R : H + by
!

R(h) = E[L(y, h(X))] =

L(y, h(X))dPxy

R is called the risk function.

Risk is expectation of loss where expectation is with


respect to Pxy .

Hence, h with a low R(h) is a better classifier.

We can now define the goal of learning as minimizer


of risk.
PRNN (PSS) Jan-Apr 2016 p.196/315

Let h = arg minhH R(h)

PRNN (PSS) Jan-Apr 2016 p.197/315

Let h = arg minhH R(h)

We define the goal of learning as finding h , the global


minimizer of risk.

PRNN (PSS) Jan-Apr 2016 p.198/315

Let h = arg minhH R(h)

We define the goal of learning as finding h , the global


minimizer of risk.
Risk minimization is a very general strategy adopted
by most machine learning algorithms.

PRNN (PSS) Jan-Apr 2016 p.199/315

Let h = arg minhH R(h)

We define the goal of learning as finding h , the global


minimizer of risk.
Risk minimization is a very general strategy adopted
by most machine learning algorithms.

However, note that we may not have any knowledge


of Pxy .

PRNN (PSS) Jan-Apr 2016 p.200/315

Let h = arg minhH R(h)

We define the goal of learning as finding h , the global


minimizer of risk.
Risk minimization is a very general strategy adopted
by most machine learning algorithms.

However, note that we may not have any knowledge


of Pxy .

How can we find minimizer of risk? Minimization of


R() directly is not feasible.

PRNN (PSS) Jan-Apr 2016 p.201/315

Empirical Risk function

n : H + , by
Define the empirical risk function, R
n (h) =
R

n
"
1

L(yi , h(Xi ))

i=1

PRNN (PSS) Jan-Apr 2016 p.202/315

Empirical Risk function

n : H + , by
Define the empirical risk function, R
n (h) =
R

n
"
1

L(yi , h(Xi ))

i=1

This is the sample mean estimator of risk obtained


from n iid samples.

PRNN (PSS) Jan-Apr 2016 p.203/315

Empirical Risk function

n : H + , by
Define the empirical risk function, R
n (h) =
R

n
"
1

L(yi , h(Xi ))

i=1

This is the sample mean estimator of risk obtained


from n iid samples.

be the global minimizer of empirical risk, R


n.
Let h
n
= arg min R
n (h)
h
n
hH

PRNN (PSS) Jan-Apr 2016 p.204/315

The loss function is known to us.

PRNN (PSS) Jan-Apr 2016 p.205/315

The loss function is known to us.

n (h).
Hence, given any h we can calculate R

PRNN (PSS) Jan-Apr 2016 p.206/315

The loss function is known to us.

n (h).
Hence, given any h we can calculate R

by optimization
Hence, we can (in principle) find h
n
methods.

PRNN (PSS) Jan-Apr 2016 p.207/315

The loss function is known to us.

n (h).
Hence, given any h we can calculate R

by optimization
Hence, we can (in principle) find h
n
methods.
is the basic idea of empirical
Approximating h by h
n
risk minimization strategy which is used in most ML
algorithms.

PRNN (PSS) Jan-Apr 2016 p.208/315

a good approximator of h , the minimizer of true


Is h
n
risk (for large n)?

PRNN (PSS) Jan-Apr 2016 p.209/315

a good approximator of h , the minimizer of true


Is h
n
risk (for large n)?
This is the question of consistency of empirical risk
minimization.

PRNN (PSS) Jan-Apr 2016 p.210/315

a good approximator of h , the minimizer of true


Is h
n
risk (for large n)?
This is the question of consistency of empirical risk
minimization.
Thus, we can say that any learning problems has two
parts.

PRNN (PSS) Jan-Apr 2016 p.211/315

a good approximator of h , the minimizer of true


Is h
n
risk (for large n)?
This is the question of consistency of empirical risk
minimization.
Thus, we can say that any learning problems has two
parts.

, the minimizer of R
n.
The optimization part: find h
n

PRNN (PSS) Jan-Apr 2016 p.212/315

a good approximator of h , the minimizer of true


Is h
n
risk (for large n)?
This is the question of consistency of empirical risk
minimization.
Thus, we can say that any learning problems has two
parts.

, the minimizer of R
n.
The optimization part: find h
n
a good approximator of h .
The statistical part: Is h
n

PRNN (PSS) Jan-Apr 2016 p.213/315

We will now look at some examples of loss functions.

PRNN (PSS) Jan-Apr 2016 p.214/315

We will now look at some examples of loss functions.

Note that the loss function is chosen by us; it is part of


the specification of the learning problem.

PRNN (PSS) Jan-Apr 2016 p.215/315

We will now look at some examples of loss functions.

Note that the loss function is chosen by us; it is part of


the specification of the learning problem.

The loss function is intended to capture how we would


like to evaluate performance of the classifier.

PRNN (PSS) Jan-Apr 2016 p.216/315

The 01 loss function

Consider 2-class classification problem.

PRNN (PSS) Jan-Apr 2016 p.217/315

The 01 loss function

Consider 2-class classification problem.

Let Y = {0, 1} and A = Y .

PRNN (PSS) Jan-Apr 2016 p.218/315

The 01 loss function

Consider 2-class classification problem.

Let Y = {0, 1} and A = Y .

Now, the 01 loss function is defined by

L(y, h(X)) = I[y=h(X)]


where I[A] denotes indicator of event A.

PRNN (PSS) Jan-Apr 2016 p.219/315

The 0-1 loss function is

L(y, h(X)) = I[y=h(X)]

PRNN (PSS) Jan-Apr 2016 p.220/315

The 0-1 loss function is

L(y, h(X)) = I[y=h(X)]

Risk is expectation of loss.

PRNN (PSS) Jan-Apr 2016 p.221/315

The 0-1 loss function is

L(y, h(X)) = I[y=h(X)]

Risk is expectation of loss.

Hence, R(h) = Prob[y = h(X)]; the risk is probability


of misclassification.

PRNN (PSS) Jan-Apr 2016 p.222/315

The 0-1 loss function is

L(y, h(X)) = I[y=h(X)]

Risk is expectation of loss.

Hence, R(h) = Prob[y = h(X)]; the risk is probability


of misclassification.
So, h minimizes probability of misclassification. A
very desirable goal. (Bayes classifier)

PRNN (PSS) Jan-Apr 2016 p.223/315

Here we assumed that the learning algorithm


searches over a class of binary-valued functions on
X.

PRNN (PSS) Jan-Apr 2016 p.224/315

Here we assumed that the learning algorithm


searches over a class of binary-valued functions on
X.
We can extend this to, e.g., discriminant function
learning.

PRNN (PSS) Jan-Apr 2016 p.225/315

Here we assumed that the learning algorithm


searches over a class of binary-valued functions on
X.
We can extend this to, e.g., discriminant function
learning.
We take A = (now h(X) is a discriminant function).

PRNN (PSS) Jan-Apr 2016 p.226/315

Here we assumed that the learning algorithm


searches over a class of binary-valued functions on
X.
We can extend this to, e.g., discriminant function
learning.

We take A = (now h(X) is a discriminant function).

We can define the 0-1 loss now as

L(y, h(X)) = I[y=sgn(h(X))]

PRNN (PSS) Jan-Apr 2016 p.227/315

Having any fixed misclassification costs is essentially


same as 01 loss.

PRNN (PSS) Jan-Apr 2016 p.228/315

Having any fixed misclassification costs is essentially


same as 01 loss.
Even if we take A = , the 01 loss compares only
sign of h(x) with y . The magnitude of h(x) has no
effect on the loss.

PRNN (PSS) Jan-Apr 2016 p.229/315

Having any fixed misclassification costs is essentially


same as 01 loss.
Even if we take A = , the 01 loss compares only
sign of h(x) with y . The magnitude of h(x) has no
effect on the loss.
Here, we can not trade good performance on some
data with bad performance on others.

PRNN (PSS) Jan-Apr 2016 p.230/315

Having any fixed misclassification costs is essentially


same as 01 loss.
Even if we take A = , the 01 loss compares only
sign of h(x) with y . The magnitude of h(x) has no
effect on the loss.
Here, we can not trade good performance on some
data with bad performance on others.
This makes 01 loss function more robust to noise in
classification labels.

PRNN (PSS) Jan-Apr 2016 p.231/315

While 01 loss is an intuitively appealing performance


measure, minimizing empirical risk here is hard.

PRNN (PSS) Jan-Apr 2016 p.232/315

While 01 loss is an intuitively appealing performance


measure, minimizing empirical risk here is hard.

Note that the 01 loss function is non-differentiable


which makes the empirical risk function also
non-differentiable.

PRNN (PSS) Jan-Apr 2016 p.233/315

While 01 loss is an intuitively appealing performance


measure, minimizing empirical risk here is hard.

Note that the 01 loss function is non-differentiable


which makes the empirical risk function also
non-differentiable.
The empirical risk minimization is also a non-convex
optimization problem here.

PRNN (PSS) Jan-Apr 2016 p.234/315

While 01 loss is an intuitively appealing performance


measure, minimizing empirical risk here is hard.

Note that the 01 loss function is non-differentiable


which makes the empirical risk function also
non-differentiable.
The empirical risk minimization is also a non-convex
optimization problem here.

Hence many other loss functions are often used in


Machine Learning.

PRNN (PSS) Jan-Apr 2016 p.235/315

Squared error loss

The squared error loss function is defined by

L(y, h(X)) = (y h(X))2

PRNN (PSS) Jan-Apr 2016 p.236/315

Squared error loss

The squared error loss function is defined by

L(y, h(X)) = (y h(X))2

As is easy to see, the linear least squares method that


we considered is empirical risk minimization with
squared error loss function.

PRNN (PSS) Jan-Apr 2016 p.237/315

Squared error loss

The squared error loss function is defined by

L(y, h(X)) = (y h(X))2

As is easy to see, the linear least squares method that


we considered is empirical risk minimization with
squared error loss function.

Here, for a 2-class classification problem, we can take


Y as {0, 1} or {+1, 1}. We take A = so that
each h is a discriminant function.

PRNN (PSS) Jan-Apr 2016 p.238/315

Squared error loss

The squared error loss function is defined by

L(y, h(X)) = (y h(X))2

As is easy to see, the linear least squares method that


we considered is empirical risk minimization with
squared error loss function.

Here, for a 2-class classification problem, we can take


Y as {0, 1} or {+1, 1}. We take A = so that
each h is a discriminant function.
As we know, we can use this for regression problems
also and then we take Y = .

PRNN (PSS) Jan-Apr 2016 p.239/315

Another interesting scenario here is to take


Y = {0, 1} and A = [0, 1].

PRNN (PSS) Jan-Apr 2016 p.240/315

Another interesting scenario here is to take


Y = {0, 1} and A = [0, 1].

Then each h can be interpreted as a posterior


probability (of class-1) function.

PRNN (PSS) Jan-Apr 2016 p.241/315

Another interesting scenario here is to take


Y = {0, 1} and A = [0, 1].

Then each h can be interpreted as a posterior


probability (of class-1) function.

As we know, the minimizer of expectation of squared


error loss (the risk here) is the posterior probability
function.

PRNN (PSS) Jan-Apr 2016 p.242/315

Another interesting scenario here is to take


Y = {0, 1} and A = [0, 1].

Then each h can be interpreted as a posterior


probability (of class-1) function.

As we know, the minimizer of expectation of squared


error loss (the risk here) is the posterior probability
function.
So, risk minimization would now look for a function in
H that is a good approximation for the posterior
probability function.

PRNN (PSS) Jan-Apr 2016 p.243/315

The empirical risk minimization under squared error


loss is a convex optimization problem for linear
models (when h is linear in its parameters).

PRNN (PSS) Jan-Apr 2016 p.244/315

The empirical risk minimization under squared error


loss is a convex optimization problem for linear
models (when h is linear in its parameters).

The squared error loss is extensively used in many


learning algorithms.

PRNN (PSS) Jan-Apr 2016 p.245/315

soft margin loss or hinge loss

We take Y = {+1, 1} and A = . The loss


function is given by

L(y, h(X)) = max(0, 1 yh(X))

PRNN (PSS) Jan-Apr 2016 p.246/315

soft margin loss or hinge loss

We take Y = {+1, 1} and A = . The loss


function is given by

L(y, h(X)) = max(0, 1 yh(X))

Here, if yh(X) > 0 then classification is correct and if


yh(X) 1, loss is zero.

PRNN (PSS) Jan-Apr 2016 p.247/315

soft margin loss or hinge loss

We take Y = {+1, 1} and A = . The loss


function is given by

L(y, h(X)) = max(0, 1 yh(X))

Here, if yh(X) > 0 then classification is correct and if


yh(X) 1, loss is zero.

This also results in convex optimization for empirical


risk minimization.

PRNN (PSS) Jan-Apr 2016 p.248/315

We can look at the other loss functions as some


convex approximations of the 01 loss functions as
follows.

PRNN (PSS) Jan-Apr 2016 p.249/315

We can look at the other loss functions as some


convex approximations of the 01 loss functions as
follows.
Let us take Y = {+1, 1}.

PRNN (PSS) Jan-Apr 2016 p.250/315

We can look at the other loss functions as some


convex approximations of the 01 loss functions as
follows.
Let us take Y = {+1, 1}.
We can write all loss functions as functions of single
variable yh(X).

PRNN (PSS) Jan-Apr 2016 p.251/315

For 01 loss L(y, h(X)) is one if yh(X) is negative


and zero otherwise.

PRNN (PSS) Jan-Apr 2016 p.252/315

For 01 loss L(y, h(X)) is one if yh(X) is negative


and zero otherwise.
The squared error loss can be written as

L(y, h(X)) = (1 yh(X))2

PRNN (PSS) Jan-Apr 2016 p.253/315

For 01 loss L(y, h(X)) is one if yh(X) is negative


and zero otherwise.
The squared error loss can be written as

L(y, h(X)) = (1 yh(X))2

The hinge loss is defined as a function of yh(X).

L(y, h(X)) = max(0, 1 yh(X))

PRNN (PSS) Jan-Apr 2016 p.254/315

We can plot all the functions as follows.


2.5

L(y, f(x))

1.5

0.5
0-1 Loss
Square Loss
Hinge Loss
0
-2

-1.5

-1

-0.5

0.5

1.5

2.5

y f(x)

(Here we plot yh(X) on x-axis and L(y, h(X)) on y -axis).


PRNN (PSS) Jan-Apr 2016 p.255/315

There are also many other loss functions used in


regression problems.

PRNN (PSS) Jan-Apr 2016 p.256/315

There are also many other loss functions used in


regression problems.

The squared error loss can be used. But it is very


sensitive to data errors such as outliers.

PRNN (PSS) Jan-Apr 2016 p.257/315

There are also many other loss functions used in


regression problems.

The squared error loss can be used. But it is very


sensitive to data errors such as outliers.
So, other loss functions are used for better robustness
during regression.

PRNN (PSS) Jan-Apr 2016 p.258/315

There are also many other loss functions used in


regression problems.

The squared error loss can be used. But it is very


sensitive to data errors such as outliers.
So, other loss functions are used for better robustness
during regression.

We list a couple of them here.

PRNN (PSS) Jan-Apr 2016 p.259/315

The L1 loss is defined by

L(y, h(X)) = |y h(X)|

PRNN (PSS) Jan-Apr 2016 p.260/315

The L1 loss is defined by

L(y, h(X)) = |y h(X)|

This is more robust than the square loss.

PRNN (PSS) Jan-Apr 2016 p.261/315

The L1 loss is defined by

L(y, h(X)) = |y h(X)|

This is more robust than the square loss.

Another similar one is the -insensitive loss defined by

L(y, h(X)) = max(0, |y h(X)| )

PRNN (PSS) Jan-Apr 2016 p.262/315

The L1 loss is defined by

L(y, h(X)) = |y h(X)|

This is more robust than the square loss.

Another similar one is the -insensitive loss defined by

L(y, h(X)) = max(0, |y h(X)| )

Here if we make error less than epsilon then loss is


zero.

PRNN (PSS) Jan-Apr 2016 p.263/315

As we saw, there are many different loss functions


one can think of.

PRNN (PSS) Jan-Apr 2016 p.264/315

As we saw, there are many different loss functions


one can think of.
Many of them also make the empirical risk
minimization problem efficiently solvable.

PRNN (PSS) Jan-Apr 2016 p.265/315

As we saw, there are many different loss functions


one can think of.
Many of them also make the empirical risk
minimization problem efficiently solvable.
Thus a choice of loss function can efficiently tackle
the optimization part of learning.

PRNN (PSS) Jan-Apr 2016 p.266/315

As we saw, there are many different loss functions


one can think of.
Many of them also make the empirical risk
minimization problem efficiently solvable.

Thus a choice of loss function can efficiently tackle


the optimization part of learning.

We consider some such algorithms later in this


course.

PRNN (PSS) Jan-Apr 2016 p.267/315

As we saw, there are many different loss functions


one can think of.
Many of them also make the empirical risk
minimization problem efficiently solvable.

Thus a choice of loss function can efficiently tackle


the optimization part of learning.

We consider some such algorithms later in this


course.
Now, let us get back to the statistical question that we
started with.

PRNN (PSS) Jan-Apr 2016 p.268/315

Consistency of Empirical Risk Minimization

Our objective is to find h , minimizer of risk R().

PRNN (PSS) Jan-Apr 2016 p.269/315

Consistency of Empirical Risk Minimization

Our objective is to find h , minimizer of risk R().

Since we do not know R, we minimize the empirical


.
n , instead and thus find h
risk, R
n

PRNN (PSS) Jan-Apr 2016 p.270/315

Consistency of Empirical Risk Minimization

Our objective is to find h , minimizer of risk R().

Since we do not know R, we minimize the empirical


.
n , instead and thus find h
risk, R
n

to be close. More precisely we


We want h and h
n
are interested in the question
) R(h )?
R(h
n

PRNN (PSS) Jan-Apr 2016 p.271/315

Consistency of Empirical Risk Minimization

Our objective is to find h , minimizer of risk R().

Since we do not know R, we minimize the empirical


.
n , instead and thus find h
risk, R
n

to be close. More precisely we


We want h and h
n
are interested in the question
) R(h )?
R(h
n
is random and hence we take
(As earlier, we know h
n
the above as convergence in probability).
PRNN (PSS) Jan-Apr 2016 p.272/315

What is the intuitive reason for using empirical risk


minimization?

PRNN (PSS) Jan-Apr 2016 p.273/315

What is the intuitive reason for using empirical risk


minimization?
Sample mean is a good estimator and hence, with
n (h) converges to R(h), for any h H.
large n, R

PRNN (PSS) Jan-Apr 2016 p.274/315

What is the intuitive reason for using empirical risk


minimization?
Sample mean is a good estimator and hence, with
n (h) converges to R(h), for any h H.
large n, R
This is (weak) law of large numbers.

PRNN (PSS) Jan-Apr 2016 p.275/315

What is the intuitive reason for using empirical risk


minimization?
Sample mean is a good estimator and hence, with
n (h) converges to R(h), for any h H.
large n, R
This is (weak) law of large numbers.

) converges
But this does not necessarily mean R(h
n
to R(h ).

PRNN (PSS) Jan-Apr 2016 p.276/315

What is the intuitive reason for using empirical risk


minimization?
Sample mean is a good estimator and hence, with
n (h) converges to R(h), for any h H.
large n, R
This is (weak) law of large numbers.

) converges
But this does not necessarily mean R(h
n
to R(h ).
We are interested in: does the true risk of the
minimizer of empirical risk converge to global
minimum of risk?
PRNN (PSS) Jan-Apr 2016 p.277/315

Let us consider a specific scenario to appreciate this.

PRNN (PSS) Jan-Apr 2016 p.278/315

Let us consider a specific scenario to appreciate this.

We take A = Y = {0, 1}. We use 01 loss.

PRNN (PSS) Jan-Apr 2016 p.279/315

Let us consider a specific scenario to appreciate this.

We take A = Y = {0, 1}. We use 01 loss.

Suppose the examples are drawn according to Px on


H.
X and classified according to a h

PRNN (PSS) Jan-Apr 2016 p.280/315

Let us consider a specific scenario to appreciate this.

We take A = Y = {0, 1}. We use 01 loss.

Suppose the examples are drawn according to Px on


H.
X and classified according to a h

That is Pxy = Px Py|x and Py|x is a degenerate


distribution.

PRNN (PSS) Jan-Apr 2016 p.281/315

Let us consider a specific scenario to appreciate this.

We take A = Y = {0, 1}. We use 01 loss.

Suppose the examples are drawn according to Px on


H.
X and classified according to a h

That is Pxy = Px Py|x and Py|x is a degenerate


distribution.

= 0.
Now the global minimum of risk is zero and R(h)

PRNN (PSS) Jan-Apr 2016 p.282/315

.
Note that now the risk of any h is same as Px (hh)

PRNN (PSS) Jan-Apr 2016 p.283/315

.
Note that now the risk of any h is same as Px (hh)
That is, this scenario is same as what we considered
under PAC framework.

PRNN (PSS) Jan-Apr 2016 p.284/315

.
Note that now the risk of any h is same as Px (hh)
That is, this scenario is same as what we considered
under PAC framework.
Now, under 01 loss, the global minimum of empirical
risk is also zero.

PRNN (PSS) Jan-Apr 2016 p.285/315

.
Note that now the risk of any h is same as Px (hh)
That is, this scenario is same as what we considered
under PAC framework.
Now, under 01 loss, the global minimum of empirical
risk is also zero.

) with
For any n, there may be many h (other than h
n (h) = 0.
R

PRNN (PSS) Jan-Apr 2016 p.286/315

.
Note that now the risk of any h is same as Px (hh)
That is, this scenario is same as what we considered
under PAC framework.
Now, under 01 loss, the global minimum of empirical
risk is also zero.

) with
For any n, there may be many h (other than h
n (h) = 0.
R
Hence our optimization algorithm can only use some
general rule to output one such hypothesis.

PRNN (PSS) Jan-Apr 2016 p.287/315

Consider h1 : X Y with h1 (Xi ) = yi , (Xi , yi ) S


and h1 (X) = 1 for all other X

n (h1 ) = 0! It is a global minimizer of empirical


Then R
risk. But it is obvious that h1 is not a good classifier.

PRNN (PSS) Jan-Apr 2016 p.288/315

Suppose we took H = 2X .

PRNN (PSS) Jan-Apr 2016 p.289/315

Suppose we took H = 2X .

This is same as the example we considered earlier


will not go to zero with
h)
and hence we know, Px (h
n
n.

PRNN (PSS) Jan-Apr 2016 p.290/315

Suppose we took H = 2X .

This is same as the example we considered earlier


will not go to zero with
h)
and hence we know, Px (h
n
n.

) will not converge to R(h ).


Thus, here, R(h
n

PRNN (PSS) Jan-Apr 2016 p.291/315

Suppose we took H = 2X .

This is same as the example we considered earlier


will not go to zero with
h)
and hence we know, Px (h
n
n.

) will not converge to R(h ).


Thus, here, R(h
n
Note that the law of large numbers still implies that
n (h) converges to R(h), h.
R

PRNN (PSS) Jan-Apr 2016 p.292/315

If functions like h1 are in our H then empirical risk


minimization (ERM) may not yield good classifiers.

PRNN (PSS) Jan-Apr 2016 p.293/315

If functions like h1 are in our H then empirical risk


minimization (ERM) may not yield good classifiers.

If H contains all possible functions, then this is


certainly the case as we saw in our example.

PRNN (PSS) Jan-Apr 2016 p.294/315

If functions like h1 are in our H then empirical risk


minimization (ERM) may not yield good classifiers.

If H contains all possible functions, then this is


certainly the case as we saw in our example.

Functions like h1 could be highly non-smooth and


hence one way is to impose some smoothness
conditions on the learnt function (e.g., regularization).

PRNN (PSS) Jan-Apr 2016 p.295/315

n (h) R(h), h, the risk of


We saw that though R
minimizer of empirical risk need not converge to the
global minimum of risk.

PRNN (PSS) Jan-Apr 2016 p.296/315

n (h) R(h), h, the risk of


We saw that though R
minimizer of empirical risk need not converge to the
global minimum of risk.

And this happens based on our choice of H.

PRNN (PSS) Jan-Apr 2016 p.297/315

n (h) R(h), h, the risk of


We saw that though R
minimizer of empirical risk need not converge to the
global minimum of risk.

And this happens based on our choice of H.

Hence, the statistical question we need to ask is: for


what H is empirical risk minimization consistent.

PRNN (PSS) Jan-Apr 2016 p.298/315

n (h) R(h), h, the risk of


We saw that though R
minimizer of empirical risk need not converge to the
global minimum of risk.

And this happens based on our choice of H.

Hence, the statistical question we need to ask is: for


what H is empirical risk minimization consistent.

That is, given any , > 0, N < such that

) R(h )| > ] ?
Prob[|R(h
n

PRNN (PSS) Jan-Apr 2016 p.299/315

n (h) R(h), h, the risk of


We saw that though R
minimizer of empirical risk need not converge to the
global minimum of risk.

And this happens based on our choice of H.

Hence, the statistical question we need to ask is: for


what H is empirical risk minimization consistent.

That is, given any , > 0, N < such that

) R(h )| > ] ?
Prob[|R(h
n

This is the question we address next.


PRNN (PSS) Jan-Apr 2016 p.300/315

Consistency of Empirical Risk Minimization

We would like the algorithm to satisfy: , > 0,


N < , such that

) R(h )| > ] , n N
Prob[|R(h
n

PRNN (PSS) Jan-Apr 2016 p.301/315

Consistency of Empirical Risk Minimization

We would like the algorithm to satisfy: , > 0,


N < , such that

) R(h )| > ] , n N
Prob[|R(h
n

In addition, we would also like to have

) R(h )| > ] , n N
n (h
Prob[|R
n

PRNN (PSS) Jan-Apr 2016 p.302/315

Consistency of Empirical Risk Minimization

We would like the algorithm to satisfy: , > 0,


N < , such that

) R(h )| > ] , n N
Prob[|R(h
n

In addition, we would also like to have

) R(h )| > ] , n N
n (h
Prob[|R
n
We would like to (approximately) know the true risk of
the learnt classifier.

PRNN (PSS) Jan-Apr 2016 p.303/315

Consistency of Empirical Risk Minimization

We would like the algorithm to satisfy: , > 0,


N < , such that

) R(h )| > ] , n N
Prob[|R(h
n

In addition, we would also like to have

) R(h )| > ] , n N
n (h
Prob[|R
n

We would like to (approximately) know the true risk of


the learnt classifier.
For what kind of H do these hold?
PRNN (PSS) Jan-Apr 2016 p.304/315

As we already saw, the law of large numbers (that


n (h) R(h), h) is not enough.
R

PRNN (PSS) Jan-Apr 2016 p.305/315

As we already saw, the law of large numbers (that


n (h) R(h), h) is not enough.
R

As it turns out, what we need is that the convergence


under law of large numbers be uniform over H.

PRNN (PSS) Jan-Apr 2016 p.306/315

As we already saw, the law of large numbers (that


n (h) R(h), h) is not enough.
R

As it turns out, what we need is that the convergence


under law of large numbers be uniform over H.

Such uniform convergence is necessary and sufficient


for consistency of empirical risk minimization.

PRNN (PSS) Jan-Apr 2016 p.307/315

Law of large numbers says that sample mean


converges to expectation of the random variable.

PRNN (PSS) Jan-Apr 2016 p.308/315

Law of large numbers says that sample mean


converges to expectation of the random variable.

Given any h, , > 0, N < such that

n (h) R(h)| > ] , n N


Prob[|R

PRNN (PSS) Jan-Apr 2016 p.309/315

Law of large numbers says that sample mean


converges to expectation of the random variable.

Given any h, , > 0, N < such that

n (h) R(h)| > ] , n N


Prob[|R

The N that exists can depend on , and also on h.

PRNN (PSS) Jan-Apr 2016 p.310/315

Law of large numbers says that sample mean


converges to expectation of the random variable.

Given any h, , > 0, N < such that

n (h) R(h)| > ] , n N


Prob[|R

The N that exists can depend on , and also on h.

The convergence is said to be uniform if the N


depends only on , and not on h.

PRNN (PSS) Jan-Apr 2016 p.311/315

Law of large numbers says that sample mean


converges to expectation of the random variable.

Given any h, , > 0, N < such that

n (h) R(h)| > ] , n N


Prob[|R

The N that exists can depend on , and also on h.

The convergence is said to be uniform if the N


depends only on , and not on h.

That is, for a given , the same N (, ) works for all


h H.
PRNN (PSS) Jan-Apr 2016 p.312/315

n (h) converges (in probability) to R(h)


To sum up, R
uniformly over H if , > 0, N (, ) < such that
#
$
n (h) R(h)| > , n N (, )
Prob sup |R
hH

PRNN (PSS) Jan-Apr 2016 p.313/315

n (h) converges (in probability) to R(h)


To sum up, R
uniformly over H if , > 0, N (, ) < such that
#
$
n (h) R(h)| > , n N (, )
Prob sup |R
hH

This implies that the same N (, ) works for all h H.

PRNN (PSS) Jan-Apr 2016 p.314/315

n (h) converges (in probability) to R(h)


To sum up, R
uniformly over H if , > 0, N (, ) < such that
#
$
n (h) R(h)| > , n N (, )
Prob sup |R
hH

This implies that the same N (, ) works for all h H.

It is easy to show that uniform convergence is


sufficient for consistency of empirical risk
minimization.

PRNN (PSS) Jan-Apr 2016 p.315/315