Anda di halaman 1dari 315

# So far in the course we have seen some specific

## So far in the course we have seen some specific

cases of learning classifiers given the training data.

## So far in the course we have seen some specific

cases of learning classifiers given the training data.

## We discussed how to estimate class conditional

densities for implementing Bayes classifier.

## So far in the course we have seen some specific

cases of learning classifiers given the training data.

## We discussed how to estimate class conditional

densities for implementing Bayes classifier.

## We have seen various methods (perceptron, least

squares, LMS, logistic regression, FLD etc) to learn
linear classifiers.

## So far in the course we have seen some specific

cases of learning classifiers given the training data.

## We discussed how to estimate class conditional

densities for implementing Bayes classifier.

## We have seen various methods (perceptron, least

squares, LMS, logistic regression, FLD etc) to learn
linear classifiers.
We now take a look at the general problem of learning
classifiers.

## The problem of designing a classifier is essentially

one of learning from examples.

## The problem of designing a classifier is essentially

one of learning from examples.

classifier.

## The problem of designing a classifier is essentially

one of learning from examples.

## Given training data, we want to find an appropriate

classifier.
It amounts to searching over a family of classifiers to
find one that minimizes error over training set.

## The problem of designing a classifier is essentially

one of learning from examples.

## Given training data, we want to find an appropriate

classifier.
It amounts to searching over a family of classifiers to
find one that minimizes error over training set.

## For example, in least squares approach we are

searching over the family of linear classifiers for
minimizing square of error.

## As we discussed earlier, performance on training set

is not the real issue.

## As we discussed earlier, performance on training set

is not the real issue.
We would like the learnt classifier to perform well on
new data.

## As we discussed earlier, performance on training set

is not the real issue.
We would like the learnt classifier to perform well on
new data.
This is the issue of generalization. Does the learnt
classifier generalize well?

## In practice one assesses the generalization of the

learnt classifier by looking at the error on a separate
set of labelled data called test set.

## In practice one assesses the generalization of the

learnt classifier by looking at the error on a separate
set of labelled data called test set.
Since the test set would not be used in training, error
on that data could be a good measure of the
performance of the learnt classifier.

## In practice one assesses the generalization of the

learnt classifier by looking at the error on a separate
set of labelled data called test set.
Since the test set would not be used in training, error
on that data could be a good measure of the
performance of the learnt classifier.
more labelled data.

## In practice one assesses the generalization of the

learnt classifier by looking at the error on a separate
set of labelled data called test set.
Since the test set would not be used in training, error
on that data could be a good measure of the
performance of the learnt classifier.
more labelled data.
we look at these specific issues of practice later on.
Currently our focus would be on theoretical analysis of
how to say whether a learning algorithm would
generalize well.
PRNN (PSS) Jan-Apr 2016 p.16/315

of regression

of regression

of regression

as f (X).

of regression

as f (X).

## This is a simple regression problem and we can use

least squares for it based on the form of f .

## Suppose we choose polynomial function

f (X) = w0 + w1 X + w2 X 2 + + wm X m

## Suppose we choose polynomial function

f (X) = w0 + w1 X + w2 X 2 + + wm X m

## As we discussed earlier we can do this using linear

least squares algorithm.

## Suppose we choose polynomial function

f (X) = w0 + w1 X + w2 X 2 + + wm X m

## As we discussed earlier we can do this using linear

least squares algorithm.

## Suppose we choose polynomial function

f (X) = w0 + w1 X + w2 X 2 + + wm X m

## As we discussed earlier we can do this using linear

least squares algorithm.

## Suppose we choose polynomial function

f (X) = w0 + w1 X + w2 X 2 + + wm X m

## As we discussed earlier we can do this using linear

least squares algorithm.

## We have looked at regularized least squares for this.

(It does not tell best m but helps learn a model with
good generalization).

## Suppose we choose polynomial function

f (X) = w0 + w1 X + w2 X 2 + + wm X m

## As we discussed earlier we can do this using linear

least squares algorithm.

## We have looked at regularized least squares for this.

(It does not tell best m but helps learn a model with
good generalization).

## There are other methods (e.g., BIC)

PRNN (PSS) Jan-Apr 2016 p.26/315

## But more fundamentally, let us ask can our data error

tell what m is proper.

## But more fundamentally, let us ask can our data error

tell what m is proper.

## Firstly the fact that we get less error for m compared

to m does not necessarily mean m -degree is a better
fit.

## But more fundamentally, let us ask can our data error

tell what m is proper.

## Firstly the fact that we get less error for m compared

to m does not necessarily mean m -degree is a better
fit.
If for a particular m if we get very low data error, can
we say it is good?

## But more fundamentally, let us ask can our data error

tell what m is proper.

## Firstly the fact that we get less error for m compared

to m does not necessarily mean m -degree is a better
fit.
If for a particular m if we get very low data error, can
we say it is good?

## We know that if we search over all polynomials, we

can never really learn anything. Can we have a
formalism that makes this precise.
PRNN (PSS) Jan-Apr 2016 p.30/315

## There are different ways of addressing this issue

(MDL, VC-theory etc).

## There are different ways of addressing this issue

(MDL, VC-theory etc).

## Any learning algorithm takes training data as the input

and outputs a specific classifier/function.

## Any learning algorithm takes training data as the input

and outputs a specific classifier/function.

## For this, it searches over some chosen family of

functions to find one that optimizes a chosen criterion
function.

## Any learning algorithm takes training data as the input

and outputs a specific classifier/function.

## For this, it searches over some chosen family of

functions to find one that optimizes a chosen criterion
function.
Learning Algorithm
{(Xi , yi )}
(searching over F )

f F

## Any learning algorithm takes training data as the input

and outputs a specific classifier/function.

## For this, it searches over some chosen family of

functions to find one that optimizes a chosen criterion
function.
Learning Algorithm
{(Xi , yi )}
(searching over F )

f F

## The question is: how can we formalize correctness of

learning?
PRNN (PSS) Jan-Apr 2016 p.36/315

## For example, a generic approach is what is called

Minimum Description Length principle.

## For example, a generic approach is what is called

Minimum Description Length principle.

## Suppose we want to send the data over a

communication channel.

## For example, a generic approach is what is called

Minimum Description Length principle.

## Suppose we want to send the data over a

communication channel.
we can send the 2n numbers, Xi , yi using some
number of bits.

## For example, a generic approach is what is called

Minimum Description Length principle.

## Suppose we want to send the data over a

communication channel.
we can send the 2n numbers, Xi , yi using some
number of bits.
Or we can send Xi , the function f and the errors
yi f (Xi ).

## If the fit is good, the errors yi f (Xi ) would have

small range and we may be able to send them using
smaller number of bits compared sending yi .

## If the fit is good, the errors yi f (Xi ) would have

small range and we may be able to send them using
smaller number of bits compared sending yi .

## If the fit is good, the errors yi f (Xi ) would have

small range and we may be able to send them using
smaller number of bits compared sending yi .

## If f is very complex, then what we save in bits by

sending errors instead of yi may be more than offset
by the bits needed to send description of f .

## If the fit is good, the errors yi f (Xi ) would have

small range and we may be able to send them using
smaller number of bits compared sending yi .

## If f is very complex, then what we save in bits by

sending errors instead of yi may be more than offset
by the bits needed to send description of f .

bits we need.

## If the fit is good, the errors yi f (Xi ) would have

small range and we may be able to send them using
smaller number of bits compared sending yi .

## If f is very complex, then what we save in bits by

sending errors instead of yi may be more than offset
by the bits needed to send description of f .

## Hence we can rate different f by the total number of

bits we need.
This can balance the data error and model complexity
in a natural way.

Xi , yi ).

Xi , yi ).

## So, to send f we need to send d + 1 numbers if we

use a polynomial with degree d.

Xi , yi ).

## So, to send f we need to send d + 1 numbers if we

use a polynomial with degree d.

not really pay!

Xi , yi ).

## So, to send f we need to send d + 1 numbers if we

use a polynomial with degree d.

not really pay!

preferred.

Xi , yi ).

## So, to send f we need to send d + 1 numbers if we

use a polynomial with degree d.

not really pay!

preferred.

## As presented this approach is only asking the

question whether the f is a good fit for the specific
data at hand.
PRNN (PSS) Jan-Apr 2016 p.50/315

## We will follow a different statistical approach to

address the issue of correctness of learning.

## We will follow a different statistical approach to

address the issue of correctness of learning.

## Intuitively, we can learn the correct relationship if we

are given sufficient number of representative
examples.

## We will follow a different statistical approach to

address the issue of correctness of learning.

## Intuitively, we can learn the correct relationship if we

are given sufficient number of representative
examples.

## Also, sufficiency of number of examples depends on

the family F of classifiers where we are searching.
We hence need a good notion of the complexity of the
learning problem.

## We will follow a different statistical approach to

address the issue of correctness of learning.

## Intuitively, we can learn the correct relationship if we

are given sufficient number of representative
examples.

## Also, sufficiency of number of examples depends on

the family F of classifiers where we are searching.
We hence need a good notion of the complexity of the
learning problem.

## We begin with a simple formalism where there is no

noise and the goal of learning is well-defined.
PRNN (PSS) Jan-Apr 2016 p.54/315

## A Learning problem is defined by giving:

(i) X input space; often d (feature space)

## A Learning problem is defined by giving:

(i) X input space; often d (feature space)

## A Learning problem is defined by giving:

(i) X input space; often d (feature space)

## A Learning problem is defined by giving:

(i) X input space; often d (feature space)

## iii) C 2X concept space (family of classifiers)

Each C C can also be viewed as a function
C : X {0, 1}, with C(X) = 1 iff X C .

## A Learning problem is defined by giving:

(i) X input space; often d (feature space)

## iii) C 2X concept space (family of classifiers)

Each C C can also be viewed as a function
C : X {0, 1}, with C(X) = 1 iff X C .

## iv) S = {(Xi , yi ), i = 1, , n} the set of examples,

where Xi are drawn iid according to some distribution
Px on X and yi = C (Xi ) for some C C . C is
called target concept.
PRNN (PSS) Jan-Apr 2016 p.60/315

## We call this concept space because we can say the

system is learning a concept from examples.

## We call this concept space because we can say the

system is learning a concept from examples.

know C .

## We call this concept space because we can say the

system is learning a concept from examples.

## The learning algorithm knows X , Y , C ; but does not

know C .
It needs to learn the target concept from examples.

## We do not know the distribution Px . However, taking

that the examples are iid ensures we get
representative examples.

## We do not know the distribution Px . However, taking

that the examples are iid ensures we get
representative examples.

## We are trying to teach a concept through examples

that come from an arbitrary distribution.

## We do not know the distribution Px . However, taking

that the examples are iid ensures we get
representative examples.

## We are trying to teach a concept through examples

that come from an arbitrary distribution.

noise.

## We do not know the distribution Px . However, taking

that the examples are iid ensures we get
representative examples.

## We are trying to teach a concept through examples

that come from an arbitrary distribution.

## Since we have taken yi = C (Xi ), i, there is no

noise.
Also assuming that C C means that ideally we can
learn the target concept.

## PRNN (PSS) Jan-Apr 2016 p.70/315

We could take C = 2X .

## PRNN (PSS) Jan-Apr 2016 p.71/315

We could take C = 2X .
This means we are searching over the family of all
possible (2-class) classifiers.

## PRNN (PSS) Jan-Apr 2016 p.72/315

We could take C = 2X .
This means we are searching over the family of all
possible (2-class) classifiers.

## PRNN (PSS) Jan-Apr 2016 p.73/315

We could take C = 2X .
This means we are searching over the family of all
possible (2-class) classifiers.

## So, choosing a particular C is based on either some

knowledge we have about the problem or becuase of
the kind of learning algorithm we have.

## PRNN (PSS) Jan-Apr 2016 p.74/315

We could take C = 2X .
This means we are searching over the family of all
possible (2-class) classifiers.

## So, choosing a particular C is based on either some

knowledge we have about the problem or becuase of
the kind of learning algorithm we have.

## For example we can take C to be all half-spaces the

family of all linear classifiers.

## Suppose we want to learn the concept of

medium-build persons based on features of height
and weight.

## Suppose we want to learn the concept of

medium-build persons based on features of height
and weight.

## Suppose we want to learn the concept of

medium-build persons based on features of height
and weight.

## we would be given examples (with no errors!) drawn

from some arbitrary distribution.

following.

## What could be C in this example? We can use some

problem-based intuition.

## What could be C in this example? We can use some

problem-based intuition.

## What could be C in this example? We can use some

problem-based intuition.

## Now assuming C C means that the god-given

classifier is also an axis-parallel rectangle.

## Note that each C C can be viewed either as a

subset of X or a binary valued function on X .

## Note that each C C can be viewed either as a

subset of X or a binary valued function on X .

## Let Cn denote the concept or classifier output by the

learning algorithm after it processes n iid examples.

## Note that each C C can be viewed either as a

subset of X or a binary valued function on X .

## Let Cn denote the concept or classifier output by the

learning algorithm after it processes n iid examples.

## For correctness of the learning algorithm we want Cn

to be close to C as n becomes large.

## Note that each C C can be viewed either as a

subset of X or a binary valued function on X .

## Let Cn denote the concept or classifier output by the

learning algorithm after it processes n iid examples.

## For correctness of the learning algorithm we want Cn

to be close to C as n becomes large.

## The closeness of Cn to C is in terms of classifying

samples drawn from X according to Px .

## PRNN (PSS) Jan-Apr 2016 p.88/315

We define error of Cn by

## PRNN (PSS) Jan-Apr 2016 p.89/315

We define error of Cn by
err(Cn ) = Px (Cn C )
where, for sets Cn , C ,

Cn C = (Cn C ) (C Cn ).

## PRNN (PSS) Jan-Apr 2016 p.90/315

We define error of Cn by
err(Cn ) = Px (Cn C )
= Prob[{X X : Cn (X) = C (X)}]
where, for sets Cn , C ,

Cn C = (Cn C ) (C Cn ).

## PRNN (PSS) Jan-Apr 2016 p.91/315

We define error of Cn by
err(Cn ) = Px (Cn C )
= Prob[{X X : Cn (X) = C (X)}]
where, for sets Cn , C ,

Cn C = (Cn C ) (C Cn ).

## The err(Cn ) is the probability that on a random

sample, drawn according to Px , the classification of
Cn and C differ.
PRNN (PSS) Jan-Apr 2016 p.92/315

n .

## Essentially, we want err(Cn ) to become zero as

n .
However, err(Cn ) is a random variable because Cn is
a function of the random samples X1 , , Xn .

## Essentially, we want err(Cn ) to become zero as

n .
However, err(Cn ) is a random variable because Cn is
a function of the random samples X1 , , Xn .
Hence we have to properly define the sense in which
err(Cn ) converges as n .

## We say a learning algorithm Probably Approximately

Correctly (PAC) learns a concept class C if given any
, > 0, N < such that
Prob[err(Cn ) > ] <
for all n > N and for any distribution Px and any C .

## We say a learning algorithm Probably Approximately

Correctly (PAC) learns a concept class C if given any
, > 0, N < such that
Prob[err(Cn ) > ] <
for all n > N and for any distribution Px and any C .

## The probability above is with respect to the distribution

of n-tuples of iid samples drawn according to Px on X .

## We say a learning algorithm Probably Approximately

Correctly (PAC) learns a concept class C if given any
, > 0, N < such that
Prob[err(Cn ) > ] <
for all n > N and for any distribution Px and any C .

## The probability above is with respect to the distribution

of n-tuples of iid samples drawn according to Px on X .

## The Px is arbitrary. But, for testing and training the

distribution is same fair to the algorithm.
PRNN (PSS) Jan-Apr 2016 p.98/315

## An algorithm PAC learns C if

Prob[err(Cn ) > ] <
for sufficiently large n and any Px .

## An algorithm PAC learns C if

Prob[err(Cn ) > ] <
for sufficiently large n and any Px .

## An algorithm PAC learns C if

Prob[err(Cn ) > ] <
for sufficiently large n and any Px .

## So, what the above says is that the classifier output by

the algorithm after seeing n random examples, Cn , is
approximately correct with a high probability.

## An algorithm PAC learns C if

Prob[err(Cn ) > ] <
for sufficiently large n and any Px .

## So, what the above says is that the classifier output by

the algorithm after seeing n random examples, Cn , is
approximately correct with a high probability.

## The and are called the accuracy and confidence

parameters respectively.
PRNN (PSS) Jan-Apr 2016 p.102/315

## Let us look at the example of learning the concept of

medium-built persons.

## Let us look at the example of learning the concept of

medium-built persons.

Here, X = 2 .

## Let us look at the example of learning the concept of

medium-built persons.

Here, X = 2 .
The strategy of the learning algorithm is as follows.

## Let us look at the example of learning the concept of

medium-built persons.

Here, X = 2 .
The strategy of the learning algorithm is as follows.

## The algorithm outputs a classifier which correctly

classifies all examples.

## Let us look at the example of learning the concept of

medium-built persons.

Here, X = 2 .
The strategy of the learning algorithm is as follows.

## The algorithm outputs a classifier which correctly

classifies all examples.

## If there is more than one C C that is consistent with

all examples, we output the smallest such C .

## For finite sets, smallest is in terms of number of poins;

for other sets it is in terms of the areas of the sets.
(This will do for our purpose here).

## For finite sets, smallest is in terms of number of poins;

for other sets it is in terms of the areas of the sets.
(This will do for our purpose here).

algorithm does.

## For finite sets, smallest is in terms of number of poins;

for other sets it is in terms of the areas of the sets.
(This will do for our purpose here).

algorithm does.

## For finite sets, smallest is in terms of number of poins;

for other sets it is in terms of the areas of the sets.
(This will do for our purpose here).

algorithm does.

classifiers.

rectangle.

rectangle.

rectangle.

rectangle.

## Hence an (x, y) 2 is a positive example if it is in

C and negative example otherwise.

## PRNN (PSS) Jan-Apr 2016 p.116/315

First consider C1 .

## PRNN (PSS) Jan-Apr 2016 p.117/315

First consider C1 .
The smallest C C consistent with all examples
would be the smallest axis-parallel rectangle
enclosing all the positive examples seen so far.

## The concept output by the algorithm (when using C1 )

would be the following.

## PRNN (PSS) Jan-Apr 2016 p.119/315

First consider C1 .
The smallest C C consistent with all examples
would be the smallest axis-parallel rectangle
enclosing all the positive examples seen so far.
Thus, under the strategy of our learning algorithm, for
all n, the Cn would always be inside the C .

## PRNN (PSS) Jan-Apr 2016 p.120/315

First consider C1 .
The smallest C C consistent with all examples
would be the smallest axis-parallel rectangle
enclosing all the positive examples seen so far.

## Thus, under the strategy of our learning algorithm, for

all n, the Cn would always be inside the C .

## Whenever any example is classified as positive by Cn

it would also be classified positive by C .

## Whenever any example is classified as positive by Cn

it would also be classified positive by C .

annular region.

## Since Cn is also an axis-parallel rectangle which is

inside C , the Cn C would be the annular region
between the two rectangles.

## Whenever any example is classified as positive by Cn

it would also be classified positive by C .

annular region.

region.

## Whenever any example is classified as positive by Cn

it would also be classified positive by C .

annular region.

region.

## Note that we are not really bothered about the area of

this annular region; we are only interested in the
probability mass of this region under Px .

## Now, given an > 0, we have to bound the probability

that err(Cn ) > .

## Now, given an > 0, we have to bound the probability

that err(Cn ) > .

## The error is greater than only if the probability mass

(under Px ) of the annular region is greater than .

## Now, given an > 0, we have to bound the probability

that err(Cn ) > .

## The error is greater than only if the probability mass

(under Px ) of the annular region is greater than .

## Now, given an > 0, we have to bound the probability

that err(Cn ) > .

## The error is greater than only if the probability mass

(under Px ) of the annular region is greater than .

## Only when none of the examples seen happen to be

in the annular region.

## Now, given an > 0, we have to bound the probability

that err(Cn ) > .

## The error is greater than only if the probability mass

(under Px ) of the annular region is greater than .

## Only when none of the examples seen happen to be

in the annular region.

## Why? Otherwise, the rectangle learnt by our

algorithm would have been closer to C .

## Hence the probability of the event err(Cn ) > is same

as the probability that when n iid examples are drawn
accoding to Px none of them came from a subset of X
that has Px -probability at least .

## Hence the probability of the event err(Cn ) > is same

as the probability that when n iid examples are drawn
accoding to Px none of them came from a subset of X
that has Px -probability at least .

## That is, all examples came from a subset of

probability at most (1 ).

## Hence the probability of the event err(Cn ) > is same

as the probability that when n iid examples are drawn
accoding to Px none of them came from a subset of X
that has Px -probability at least .

## That is, all examples came from a subset of

probability at most (1 ).

## PRNN (PSS) Jan-Apr 2016 p.134/315

Hence we have
Prob[err(Cn ) > ] (1 )

## PRNN (PSS) Jan-Apr 2016 p.135/315

Hence we have
Prob[err(Cn ) > ] (1 )

## PRNN (PSS) Jan-Apr 2016 p.136/315

Hence we have
Prob[err(Cn ) > ] (1 )

## Let N be such that (1 )n < , for all n > N .

ln()
ln(1)

The required N is N
(bound on number of examples).

## PRNN (PSS) Jan-Apr 2016 p.137/315

Hence we have
Prob[err(Cn ) > ] (1 )

## Let N be such that (1 )n < , for all n > N .

ln()
ln(1)

The required N is N
(bound on number of examples).

## For this N , we have

Prob[err(Cn ) > ] , n N
showing the algorithm PAC learns the concept class.
PRNN (PSS) Jan-Apr 2016 p.138/315

class C2 = 2X .

## Now let us consider the same algorithm with concept

class C2 = 2X .
Here we are searching over all possible 2-class
classifiers.

## Now let us consider the same algorithm with concept

class C2 = 2X .
Here we are searching over all possible 2-class
classifiers.
So, intuitively, we do not expect the algorithm to be
able to learn anything.

## Now let us consider the same algorithm with concept

class C2 = 2X .
Here we are searching over all possible 2-class
classifiers.
So, intuitively, we do not expect the algorithm to be
able to learn anything.
There is too much flexibility in the bag of classifiers
over which we are searching.

## Now let us consider the same algorithm with concept

class C2 = 2X .
Here we are searching over all possible 2-class
classifiers.
So, intuitively, we do not expect the algorithm to be
able to learn anything.

## There is too much flexibility in the bag of classifiers

over which we are searching.

## What would be Cn now?

After seeing n examples, the smallest set in C2 that is
consistent with all examples is the set consisting of all
the positive examples seen so far!!

## What would be Cn now?

After seeing n examples, the smallest set in C2 that is
consistent with all examples is the set consisting of all
the positive examples seen so far!!
Now the algorithm simply remembers all the positive
examples seen.

## What would be Cn now?

After seeing n examples, the smallest set in C2 that is
consistent with all examples is the set consisting of all
the positive examples seen so far!!

examples seen.

## This happened because every possible finite subset

of X is in our concept class.

## So, now, Cn C would be the axis parallel rectangle

C minus some finite number of points from it.

## So, now, Cn C would be the axis parallel rectangle

C minus some finite number of points from it.

## So, under any continuous Px ,

err(Cn ) = Px (Cn C ) = Px (C ).

## So, now, Cn C would be the axis parallel rectangle

C minus some finite number of points from it.

## So, under any continuous Px ,

err(Cn ) = Px (Cn C ) = Px (C ).

all n.

## So, now, Cn C would be the axis parallel rectangle

C minus some finite number of points from it.

## So, under any continuous Px ,

err(Cn ) = Px (Cn C ) = Px (C ).

## Hence, for any < Px (C ), Prob[err(Cn ) > ] = 1 for

all n.
Thus, the algorithm can not PAC learn with C2 .

## This example clearly illustrates the difficulty of

learning from examples if the bag of classifiers being
considered is too large.

## This example clearly illustrates the difficulty of

learning from examples if the bag of classifiers being
considered is too large.

## The largeness is not interms of number of elements in

our concept class.

## This example clearly illustrates the difficulty of

learning from examples if the bag of classifiers being
considered is too large.

## The largeness is not interms of number of elements in

our concept class.

classifiers.

## This example clearly illustrates the difficulty of

learning from examples if the bag of classifiers being
considered is too large.

## The largeness is not interms of number of elements in

our concept class.

## Both C1 and C2 contain uncountably infinite number of

classifiers.
We would later on define an appropriate quantity to
quantify the sense in which one concept class can be
said to be bigger than (or more complex than) another.

## Since every axis parallel rectangle can be specified by

four quantities, this class can be parameterized by
four parameters.

## Since every axis parallel rectangle can be specified by

four quantities, this class can be parameterized by
four parameters.

C 2 = 2 .

## Since every axis parallel rectangle can be specified by

four quantities, this class can be parameterized by
four parameters.

## However, there is no such finite parameterization for

C 2 = 2 .
Also, the strategy of our algorithm can be coded
efficiently in case of C1 .

## It allows one to properly define what is correctness of

learning and allows us to ask questions like whether a
given algorithm learns correctly.

## It allows one to properly define what is correctness of

learning and allows us to ask questions like whether a
given algorithm learns correctly.

## As we have seen in our example, we can also bound

the number of examples needed to learn to a given
level of accuracy and confidence.

## It allows one to properly define what is correctness of

learning and allows us to ask questions like whether a
given algorithm learns correctly.

## As we have seen in our example, we can also bound

the number of examples needed to learn to a given
level of accuracy and confidence.

## Thus we can appreciate relative complexities of

different learning problems.

situations.

## However, PAC learnability deals with ideal learning

situations.
We assume there is a (god-given) C and that it is in
our C .

## However, PAC learnability deals with ideal learning

situations.
We assume there is a (god-given) C and that it is in
our C .
Also, we assume that examples are noise free and
are perfectly classified.

## However, PAC learnability deals with ideal learning

situations.
We assume there is a (god-given) C and that it is in
our C .
Also, we assume that examples are noise free and
are perfectly classified.
Next we consider an extension of this framework that
is relevant for realistic learning scenarios.

## X input space; ( as earlier, Feature space)

Y Output space (as earlier, Set of class labels)

## X input space; ( as earlier, Feature space)

Y Output space (as earlier, Set of class labels)
H hypothesis space (family of classifiers)

## X input space; ( as earlier, Feature space)

Y Output space (as earlier, Set of class labels)
H hypothesis space (family of classifiers)
Each h H is a function: h : X A
where A is called action space.

## X input space; ( as earlier, Feature space)

Y Output space (as earlier, Set of class labels)
H hypothesis space (family of classifiers)
Each h H is a function: h : X A
where A is called action space.

## Training data: {(Xi , yi ), i = 1, , n}

drawn iid according to some distribution Pxy on
X Y.
PRNN (PSS) Jan-Apr 2016 p.173/315

## If we take A = Y then it is same as earlier.

But the freedom in choosing A allows for taking care
of many situations.

## If we take A = Y then it is same as earlier.

But the freedom in choosing A allows for taking care
of many situations.

## For example, even when Y = {0, 1}, we can take

A = (e.g., learning discriminant functions).

## If we take A = Y then it is same as earlier.

But the freedom in choosing A allows for taking care
of many situations.

## For example, even when Y = {0, 1}, we can take

A = (e.g., learning discriminant functions).

Now, e.g., sign of h(X) may denote the class and its
magnitude may give some measure of confidence in
the assigned class.

## Now we draw examples from X Y according to Pxy .

This allows for noise in the training data.

## Now we draw examples from X Y according to Pxy .

This allows for noise in the training data.

## For example, when class conditional densities

overlap, same X can come from different classses
with different probabilities.

## Now we draw examples from X Y according to Pxy .

This allows for noise in the training data.

## For example, when class conditional densities

overlap, same X can come from different classses
with different probabilities.

## We can always factorize Pxy = Px Py|x . If Py|x is a

degenerate distribution then it will be same as earlier
we draw iid samples from X and each point is
essentially classified by the target classifier.

## Now we draw examples from X Y according to Pxy .

This allows for noise in the training data.

## For example, when class conditional densities

overlap, same X can come from different classses
with different probabilities.

## We can always factorize Pxy = Px Py|x . If Py|x is a

degenerate distribution then it will be same as earlier
we draw iid samples from X and each point is
essentially classified by the target classifier.

## However, having examples drawn from X Y using a

distribution, allows for many more scenarios.
PRNN (PSS) Jan-Apr 2016 p.182/315

## As before, the learning machine outputs a hypothesis,

hn H, given the training data consisting of n
examples.

## As before, the learning machine outputs a hypothesis,

hn H, given the training data consisting of n
examples.

## However, now there is no notion of a target

concept/hypothesis.

## As before, the learning machine outputs a hypothesis,

hn H, given the training data consisting of n
examples.

## However, now there is no notion of a target

concept/hypothesis.

examples.

## As before, the learning machine outputs a hypothesis,

hn H, given the training data consisting of n
examples.

## However, now there is no notion of a target

concept/hypothesis.

examples.

## Hence we use the idea of loss functions to define the

goal of learning.

Loss function

Loss function:

L : Y A + .

## PRNN (PSS) Jan-Apr 2016 p.187/315

Loss function

L : Y A + .
The idea is that L(y, h(X)) is the loss suffered by
h H on a (random) sample (X, y) X Y .

Loss function:

## PRNN (PSS) Jan-Apr 2016 p.188/315

Loss function

L : Y A + .
The idea is that L(y, h(X)) is the loss suffered by
h H on a (random) sample (X, y) X Y .
More generally we can let loss depend on X also
explicitly and can write L(X, y, h(X)) for loss

Loss function:

function.

## PRNN (PSS) Jan-Apr 2016 p.189/315

Loss function

L : Y A + .
The idea is that L(y, h(X)) is the loss suffered by
h H on a (random) sample (X, y) X Y .
More generally we can let loss depend on X also
explicitly and can write L(X, y, h(X)) for loss

Loss function:

function.
By convention we assume that the loss function is
non-negative.

## PRNN (PSS) Jan-Apr 2016 p.190/315

Loss function

L : Y A + .
The idea is that L(y, h(X)) is the loss suffered by
h H on a (random) sample (X, y) X Y .
More generally we can let loss depend on X also
explicitly and can write L(X, y, h(X)) for loss

Loss function:

function.
By convention we assume that the loss function is
non-negative.
Now we can look for hypotheses that have low
average loss over samples drawn accordding to Pxy .
PRNN (PSS) Jan-Apr 2016 p.191/315

Risk function

Define a function, R : H + by
!

L(y, h(X))dPxy

## PRNN (PSS) Jan-Apr 2016 p.192/315

Risk function

Define a function, R : H + by
!

L(y, h(X))dPxy

## PRNN (PSS) Jan-Apr 2016 p.193/315

Risk function

Define a function, R : H + by
!

L(y, h(X))dPxy

respect to Pxy .

## PRNN (PSS) Jan-Apr 2016 p.194/315

Risk function

Define a function, R : H + by
!

L(y, h(X))dPxy

respect to Pxy .

## PRNN (PSS) Jan-Apr 2016 p.195/315

Risk function

Define a function, R : H + by
!

L(y, h(X))dPxy

respect to Pxy .

## We can now define the goal of learning as minimizer

of risk.
PRNN (PSS) Jan-Apr 2016 p.196/315

## We define the goal of learning as finding h , the global

minimizer of risk.

## We define the goal of learning as finding h , the global

minimizer of risk.
Risk minimization is a very general strategy adopted
by most machine learning algorithms.

## We define the goal of learning as finding h , the global

minimizer of risk.
Risk minimization is a very general strategy adopted
by most machine learning algorithms.

of Pxy .

## We define the goal of learning as finding h , the global

minimizer of risk.
Risk minimization is a very general strategy adopted
by most machine learning algorithms.

of Pxy .

## How can we find minimizer of risk? Minimization of

R() directly is not feasible.

## Empirical Risk function

n : H + , by
Define the empirical risk function, R
n (h) =
R

n
"
1

L(yi , h(Xi ))

i=1

## Empirical Risk function

n : H + , by
Define the empirical risk function, R
n (h) =
R

n
"
1

L(yi , h(Xi ))

i=1

## This is the sample mean estimator of risk obtained

from n iid samples.

## Empirical Risk function

n : H + , by
Define the empirical risk function, R
n (h) =
R

n
"
1

L(yi , h(Xi ))

i=1

## This is the sample mean estimator of risk obtained

from n iid samples.

n.
Let h
n
= arg min R
n (h)
h
n
hH

## The loss function is known to us.

n (h).
Hence, given any h we can calculate R

## The loss function is known to us.

n (h).
Hence, given any h we can calculate R

by optimization
Hence, we can (in principle) find h
n
methods.

## The loss function is known to us.

n (h).
Hence, given any h we can calculate R

by optimization
Hence, we can (in principle) find h
n
methods.
is the basic idea of empirical
Approximating h by h
n
risk minimization strategy which is used in most ML
algorithms.

## a good approximator of h , the minimizer of true

Is h
n
risk (for large n)?

## a good approximator of h , the minimizer of true

Is h
n
risk (for large n)?
This is the question of consistency of empirical risk
minimization.

## a good approximator of h , the minimizer of true

Is h
n
risk (for large n)?
This is the question of consistency of empirical risk
minimization.
Thus, we can say that any learning problems has two
parts.

## a good approximator of h , the minimizer of true

Is h
n
risk (for large n)?
This is the question of consistency of empirical risk
minimization.
Thus, we can say that any learning problems has two
parts.

, the minimizer of R
n.
The optimization part: find h
n

## a good approximator of h , the minimizer of true

Is h
n
risk (for large n)?
This is the question of consistency of empirical risk
minimization.
Thus, we can say that any learning problems has two
parts.

, the minimizer of R
n.
The optimization part: find h
n
a good approximator of h .
The statistical part: Is h
n

## Note that the loss function is chosen by us; it is part of

the specification of the learning problem.

## Note that the loss function is chosen by us; it is part of

the specification of the learning problem.

## The loss function is intended to capture how we would

like to evaluate performance of the classifier.

## L(y, h(X)) = I[y=h(X)]

where I[A] denotes indicator of event A.

## Hence, R(h) = Prob[y = h(X)]; the risk is probability

of misclassification.

## Hence, R(h) = Prob[y = h(X)]; the risk is probability

of misclassification.
So, h minimizes probability of misclassification. A
very desirable goal. (Bayes classifier)

## Here we assumed that the learning algorithm

searches over a class of binary-valued functions on
X.

## Here we assumed that the learning algorithm

searches over a class of binary-valued functions on
X.
We can extend this to, e.g., discriminant function
learning.

## Here we assumed that the learning algorithm

searches over a class of binary-valued functions on
X.
We can extend this to, e.g., discriminant function
learning.
We take A = (now h(X) is a discriminant function).

## Here we assumed that the learning algorithm

searches over a class of binary-valued functions on
X.
We can extend this to, e.g., discriminant function
learning.

same as 01 loss.

## Having any fixed misclassification costs is essentially

same as 01 loss.
Even if we take A = , the 01 loss compares only
sign of h(x) with y . The magnitude of h(x) has no
effect on the loss.

## Having any fixed misclassification costs is essentially

same as 01 loss.
Even if we take A = , the 01 loss compares only
sign of h(x) with y . The magnitude of h(x) has no
effect on the loss.
Here, we can not trade good performance on some
data with bad performance on others.

## Having any fixed misclassification costs is essentially

same as 01 loss.
Even if we take A = , the 01 loss compares only
sign of h(x) with y . The magnitude of h(x) has no
effect on the loss.
Here, we can not trade good performance on some
data with bad performance on others.
This makes 01 loss function more robust to noise in
classification labels.

## While 01 loss is an intuitively appealing performance

measure, minimizing empirical risk here is hard.

## While 01 loss is an intuitively appealing performance

measure, minimizing empirical risk here is hard.

## Note that the 01 loss function is non-differentiable

which makes the empirical risk function also
non-differentiable.

## While 01 loss is an intuitively appealing performance

measure, minimizing empirical risk here is hard.

## Note that the 01 loss function is non-differentiable

which makes the empirical risk function also
non-differentiable.
The empirical risk minimization is also a non-convex
optimization problem here.

## While 01 loss is an intuitively appealing performance

measure, minimizing empirical risk here is hard.

## Note that the 01 loss function is non-differentiable

which makes the empirical risk function also
non-differentiable.
The empirical risk minimization is also a non-convex
optimization problem here.

## Hence many other loss functions are often used in

Machine Learning.

## As is easy to see, the linear least squares method that

we considered is empirical risk minimization with
squared error loss function.

## As is easy to see, the linear least squares method that

we considered is empirical risk minimization with
squared error loss function.

## Here, for a 2-class classification problem, we can take

Y as {0, 1} or {+1, 1}. We take A = so that
each h is a discriminant function.

## As is easy to see, the linear least squares method that

we considered is empirical risk minimization with
squared error loss function.

## Here, for a 2-class classification problem, we can take

Y as {0, 1} or {+1, 1}. We take A = so that
each h is a discriminant function.
As we know, we can use this for regression problems
also and then we take Y = .

## Another interesting scenario here is to take

Y = {0, 1} and A = [0, 1].

## Another interesting scenario here is to take

Y = {0, 1} and A = [0, 1].

## Then each h can be interpreted as a posterior

probability (of class-1) function.

## Another interesting scenario here is to take

Y = {0, 1} and A = [0, 1].

## Then each h can be interpreted as a posterior

probability (of class-1) function.

## As we know, the minimizer of expectation of squared

error loss (the risk here) is the posterior probability
function.

## Another interesting scenario here is to take

Y = {0, 1} and A = [0, 1].

## Then each h can be interpreted as a posterior

probability (of class-1) function.

## As we know, the minimizer of expectation of squared

error loss (the risk here) is the posterior probability
function.
So, risk minimization would now look for a function in
H that is a good approximation for the posterior
probability function.

## The empirical risk minimization under squared error

loss is a convex optimization problem for linear
models (when h is linear in its parameters).

## The empirical risk minimization under squared error

loss is a convex optimization problem for linear
models (when h is linear in its parameters).

## The squared error loss is extensively used in many

learning algorithms.

## We take Y = {+1, 1} and A = . The loss

function is given by

## We take Y = {+1, 1} and A = . The loss

function is given by

## Here, if yh(X) > 0 then classification is correct and if

yh(X) 1, loss is zero.

## We take Y = {+1, 1} and A = . The loss

function is given by

## Here, if yh(X) > 0 then classification is correct and if

yh(X) 1, loss is zero.

## This also results in convex optimization for empirical

risk minimization.

## We can look at the other loss functions as some

convex approximations of the 01 loss functions as
follows.

## We can look at the other loss functions as some

convex approximations of the 01 loss functions as
follows.
Let us take Y = {+1, 1}.

## We can look at the other loss functions as some

convex approximations of the 01 loss functions as
follows.
Let us take Y = {+1, 1}.
We can write all loss functions as functions of single
variable yh(X).

## For 01 loss L(y, h(X)) is one if yh(X) is negative

and zero otherwise.

## For 01 loss L(y, h(X)) is one if yh(X) is negative

and zero otherwise.
The squared error loss can be written as

## For 01 loss L(y, h(X)) is one if yh(X) is negative

and zero otherwise.
The squared error loss can be written as

2.5

L(y, f(x))

1.5

0.5
0-1 Loss
Square Loss
Hinge Loss
0
-2

-1.5

-1

-0.5

0.5

1.5

2.5

y f(x)

## (Here we plot yh(X) on x-axis and L(y, h(X)) on y -axis).

PRNN (PSS) Jan-Apr 2016 p.255/315

## There are also many other loss functions used in

regression problems.

## There are also many other loss functions used in

regression problems.

## The squared error loss can be used. But it is very

sensitive to data errors such as outliers.

## There are also many other loss functions used in

regression problems.

## The squared error loss can be used. But it is very

sensitive to data errors such as outliers.
So, other loss functions are used for better robustness
during regression.

## There are also many other loss functions used in

regression problems.

## The squared error loss can be used. But it is very

sensitive to data errors such as outliers.
So, other loss functions are used for better robustness
during regression.

zero.

## As we saw, there are many different loss functions

one can think of.

## As we saw, there are many different loss functions

one can think of.
Many of them also make the empirical risk
minimization problem efficiently solvable.

## As we saw, there are many different loss functions

one can think of.
Many of them also make the empirical risk
minimization problem efficiently solvable.
Thus a choice of loss function can efficiently tackle
the optimization part of learning.

## As we saw, there are many different loss functions

one can think of.
Many of them also make the empirical risk
minimization problem efficiently solvable.

## Thus a choice of loss function can efficiently tackle

the optimization part of learning.

course.

## As we saw, there are many different loss functions

one can think of.
Many of them also make the empirical risk
minimization problem efficiently solvable.

## Thus a choice of loss function can efficiently tackle

the optimization part of learning.

## We consider some such algorithms later in this

course.
Now, let us get back to the statistical question that we
started with.

## Since we do not know R, we minimize the empirical

.
n , instead and thus find h
risk, R
n

## Since we do not know R, we minimize the empirical

.
n , instead and thus find h
risk, R
n

## to be close. More precisely we

We want h and h
n
are interested in the question
) R(h )?
R(h
n

## Since we do not know R, we minimize the empirical

.
n , instead and thus find h
risk, R
n

## to be close. More precisely we

We want h and h
n
are interested in the question
) R(h )?
R(h
n
is random and hence we take
(As earlier, we know h
n
the above as convergence in probability).
PRNN (PSS) Jan-Apr 2016 p.272/315

minimization?

## What is the intuitive reason for using empirical risk

minimization?
Sample mean is a good estimator and hence, with
n (h) converges to R(h), for any h H.
large n, R

## What is the intuitive reason for using empirical risk

minimization?
Sample mean is a good estimator and hence, with
n (h) converges to R(h), for any h H.
large n, R
This is (weak) law of large numbers.

## What is the intuitive reason for using empirical risk

minimization?
Sample mean is a good estimator and hence, with
n (h) converges to R(h), for any h H.
large n, R
This is (weak) law of large numbers.

) converges
But this does not necessarily mean R(h
n
to R(h ).

## What is the intuitive reason for using empirical risk

minimization?
Sample mean is a good estimator and hence, with
n (h) converges to R(h), for any h H.
large n, R
This is (weak) law of large numbers.

) converges
But this does not necessarily mean R(h
n
to R(h ).
We are interested in: does the true risk of the
minimizer of empirical risk converge to global
minimum of risk?
PRNN (PSS) Jan-Apr 2016 p.277/315

## Suppose the examples are drawn according to Px on

H.
X and classified according to a h

## Suppose the examples are drawn according to Px on

H.
X and classified according to a h

distribution.

## Suppose the examples are drawn according to Px on

H.
X and classified according to a h

## That is Pxy = Px Py|x and Py|x is a degenerate

distribution.

= 0.
Now the global minimum of risk is zero and R(h)

## PRNN (PSS) Jan-Apr 2016 p.282/315

.
Note that now the risk of any h is same as Px (hh)

## PRNN (PSS) Jan-Apr 2016 p.283/315

.
Note that now the risk of any h is same as Px (hh)
That is, this scenario is same as what we considered
under PAC framework.

## PRNN (PSS) Jan-Apr 2016 p.284/315

.
Note that now the risk of any h is same as Px (hh)
That is, this scenario is same as what we considered
under PAC framework.
Now, under 01 loss, the global minimum of empirical
risk is also zero.

## PRNN (PSS) Jan-Apr 2016 p.285/315

.
Note that now the risk of any h is same as Px (hh)
That is, this scenario is same as what we considered
under PAC framework.
Now, under 01 loss, the global minimum of empirical
risk is also zero.

) with
For any n, there may be many h (other than h
n (h) = 0.
R

## PRNN (PSS) Jan-Apr 2016 p.286/315

.
Note that now the risk of any h is same as Px (hh)
That is, this scenario is same as what we considered
under PAC framework.
Now, under 01 loss, the global minimum of empirical
risk is also zero.

) with
For any n, there may be many h (other than h
n (h) = 0.
R
Hence our optimization algorithm can only use some
general rule to output one such hypothesis.

## Consider h1 : X Y with h1 (Xi ) = yi , (Xi , yi ) S

and h1 (X) = 1 for all other X

## n (h1 ) = 0! It is a global minimizer of empirical

Then R
risk. But it is obvious that h1 is not a good classifier.

## PRNN (PSS) Jan-Apr 2016 p.288/315

Suppose we took H = 2X .

## PRNN (PSS) Jan-Apr 2016 p.289/315

Suppose we took H = 2X .

## This is same as the example we considered earlier

will not go to zero with
h)
and hence we know, Px (h
n
n.

## PRNN (PSS) Jan-Apr 2016 p.290/315

Suppose we took H = 2X .

## This is same as the example we considered earlier

will not go to zero with
h)
and hence we know, Px (h
n
n.

Thus, here, R(h
n

## PRNN (PSS) Jan-Apr 2016 p.291/315

Suppose we took H = 2X .

## This is same as the example we considered earlier

will not go to zero with
h)
and hence we know, Px (h
n
n.

## ) will not converge to R(h ).

Thus, here, R(h
n
Note that the law of large numbers still implies that
n (h) converges to R(h), h.
R

## If functions like h1 are in our H then empirical risk

minimization (ERM) may not yield good classifiers.

## If functions like h1 are in our H then empirical risk

minimization (ERM) may not yield good classifiers.

## If H contains all possible functions, then this is

certainly the case as we saw in our example.

## If functions like h1 are in our H then empirical risk

minimization (ERM) may not yield good classifiers.

## If H contains all possible functions, then this is

certainly the case as we saw in our example.

## Functions like h1 could be highly non-smooth and

hence one way is to impose some smoothness
conditions on the learnt function (e.g., regularization).

## n (h) R(h), h, the risk of

We saw that though R
minimizer of empirical risk need not converge to the
global minimum of risk.

## n (h) R(h), h, the risk of

We saw that though R
minimizer of empirical risk need not converge to the
global minimum of risk.

## n (h) R(h), h, the risk of

We saw that though R
minimizer of empirical risk need not converge to the
global minimum of risk.

## Hence, the statistical question we need to ask is: for

what H is empirical risk minimization consistent.

## n (h) R(h), h, the risk of

We saw that though R
minimizer of empirical risk need not converge to the
global minimum of risk.

## Hence, the statistical question we need to ask is: for

what H is empirical risk minimization consistent.

) R(h )| > ] ?
Prob[|R(h
n

## n (h) R(h), h, the risk of

We saw that though R
minimizer of empirical risk need not converge to the
global minimum of risk.

## Hence, the statistical question we need to ask is: for

what H is empirical risk minimization consistent.

) R(h )| > ] ?
Prob[|R(h
n

## This is the question we address next.

PRNN (PSS) Jan-Apr 2016 p.300/315

## We would like the algorithm to satisfy: , > 0,

N < , such that

) R(h )| > ] , n N
Prob[|R(h
n

## We would like the algorithm to satisfy: , > 0,

N < , such that

) R(h )| > ] , n N
Prob[|R(h
n

## In addition, we would also like to have

) R(h )| > ] , n N
n (h
Prob[|R
n

## We would like the algorithm to satisfy: , > 0,

N < , such that

) R(h )| > ] , n N
Prob[|R(h
n

## In addition, we would also like to have

) R(h )| > ] , n N
n (h
Prob[|R
n
We would like to (approximately) know the true risk of
the learnt classifier.

## We would like the algorithm to satisfy: , > 0,

N < , such that

) R(h )| > ] , n N
Prob[|R(h
n

## In addition, we would also like to have

) R(h )| > ] , n N
n (h
Prob[|R
n

## We would like to (approximately) know the true risk of

the learnt classifier.
For what kind of H do these hold?
PRNN (PSS) Jan-Apr 2016 p.304/315

## As we already saw, the law of large numbers (that

n (h) R(h), h) is not enough.
R

## As we already saw, the law of large numbers (that

n (h) R(h), h) is not enough.
R

## As it turns out, what we need is that the convergence

under law of large numbers be uniform over H.

## As we already saw, the law of large numbers (that

n (h) R(h), h) is not enough.
R

## As it turns out, what we need is that the convergence

under law of large numbers be uniform over H.

## Such uniform convergence is necessary and sufficient

for consistency of empirical risk minimization.

## Law of large numbers says that sample mean

converges to expectation of the random variable.

## Law of large numbers says that sample mean

converges to expectation of the random variable.

Prob[|R

## Law of large numbers says that sample mean

converges to expectation of the random variable.

Prob[|R

## Law of large numbers says that sample mean

converges to expectation of the random variable.

Prob[|R

## The convergence is said to be uniform if the N

depends only on , and not on h.

## Law of large numbers says that sample mean

converges to expectation of the random variable.

Prob[|R

## The convergence is said to be uniform if the N

depends only on , and not on h.

## That is, for a given , the same N (, ) works for all

h H.
PRNN (PSS) Jan-Apr 2016 p.312/315

## n (h) converges (in probability) to R(h)

To sum up, R
uniformly over H if , > 0, N (, ) < such that
#
\$
n (h) R(h)| > , n N (, )
Prob sup |R
hH

## n (h) converges (in probability) to R(h)

To sum up, R
uniformly over H if , > 0, N (, ) < such that
#
\$
n (h) R(h)| > , n N (, )
Prob sup |R
hH

## n (h) converges (in probability) to R(h)

To sum up, R
uniformly over H if , > 0, N (, ) < such that
#
\$
n (h) R(h)| > , n N (, )
Prob sup |R
hH

## It is easy to show that uniform convergence is

sufficient for consistency of empirical risk
minimization.