p.s.sastry

© All Rights Reserved

3 tayangan

p.s.sastry

© All Rights Reserved

- Genetic Algorithms
- 11 Economics Notes Ch03 Organization of Data
- Data Transfer for Batches in ERP and EWM - SAP Documentation
- Attributes Decision Tree Synthetic Prototypes Others Knowledge Representations
- Learning Under Concept Drift an Overview
- Tutorial2 Linear Regression
- ICAART conference paper 7 pages1
- Hand Geometry Identification Based On Multiple-Class Association Rules
- Simplified Texture Unit a New Descriptor of the Local Texture
- IJAIEM-2013-11-26-079
- DWDM UNIT 6
- 1.Decision Trees for Uncertain Data
- Image Quality Measures for Gender Classification
- A Statistical Pattern Recognition Paradigm for Vibration-Based Structural Health Monitoring
- Batch Management Note
- 01-Routine Maintenance Instructions.pdf
- An Experiential Study of SVM and Naïve Bayes for Gender Recognization
- Practical Considerations in the Use of Rock Mass Classification in Mining
- Softcomputing
- Learning to Rank Image Tags with Limited Training Examples.pdf

Anda di halaman 1dari 315

cases of learning classifiers given the training data.

cases of learning classifiers given the training data.

densities for implementing Bayes classifier.

cases of learning classifiers given the training data.

densities for implementing Bayes classifier.

squares, LMS, logistic regression, FLD etc) to learn

linear classifiers.

cases of learning classifiers given the training data.

densities for implementing Bayes classifier.

squares, LMS, logistic regression, FLD etc) to learn

linear classifiers.

We now take a look at the general problem of learning

classifiers.

one of learning from examples.

one of learning from examples.

classifier.

one of learning from examples.

classifier.

It amounts to searching over a family of classifiers to

find one that minimizes error over training set.

one of learning from examples.

classifier.

It amounts to searching over a family of classifiers to

find one that minimizes error over training set.

searching over the family of linear classifiers for

minimizing square of error.

is not the real issue.

is not the real issue.

We would like the learnt classifier to perform well on

new data.

is not the real issue.

We would like the learnt classifier to perform well on

new data.

This is the issue of generalization. Does the learnt

classifier generalize well?

learnt classifier by looking at the error on a separate

set of labelled data called test set.

learnt classifier by looking at the error on a separate

set of labelled data called test set.

Since the test set would not be used in training, error

on that data could be a good measure of the

performance of the learnt classifier.

learnt classifier by looking at the error on a separate

set of labelled data called test set.

Since the test set would not be used in training, error

on that data could be a good measure of the

performance of the learnt classifier.

This means that we should have access to some

more labelled data.

learnt classifier by looking at the error on a separate

set of labelled data called test set.

Since the test set would not be used in training, error

on that data could be a good measure of the

performance of the learnt classifier.

This means that we should have access to some

more labelled data.

we look at these specific issues of practice later on.

Currently our focus would be on theoretical analysis of

how to say whether a learning algorithm would

generalize well.

PRNN (PSS) Jan-Apr 2016 p.16/315

of regression

of regression

of regression

as f (X).

of regression

as f (X).

least squares for it based on the form of f .

f (X) = w0 + w1 X + w2 X 2 + + wm X m

f (X) = w0 + w1 X + w2 X 2 + + wm X m

least squares algorithm.

f (X) = w0 + w1 X + w2 X 2 + + wm X m

least squares algorithm.

f (X) = w0 + w1 X + w2 X 2 + + wm X m

least squares algorithm.

f (X) = w0 + w1 X + w2 X 2 + + wm X m

least squares algorithm.

(It does not tell best m but helps learn a model with

good generalization).

f (X) = w0 + w1 X + w2 X 2 + + wm X m

least squares algorithm.

(It does not tell best m but helps learn a model with

good generalization).

PRNN (PSS) Jan-Apr 2016 p.26/315

tell what m is proper.

tell what m is proper.

to m does not necessarily mean m -degree is a better

fit.

tell what m is proper.

to m does not necessarily mean m -degree is a better

fit.

If for a particular m if we get very low data error, can

we say it is good?

tell what m is proper.

to m does not necessarily mean m -degree is a better

fit.

If for a particular m if we get very low data error, can

we say it is good?

can never really learn anything. Can we have a

formalism that makes this precise.

PRNN (PSS) Jan-Apr 2016 p.30/315

(MDL, VC-theory etc).

(MDL, VC-theory etc).

and outputs a specific classifier/function.

and outputs a specific classifier/function.

functions to find one that optimizes a chosen criterion

function.

and outputs a specific classifier/function.

functions to find one that optimizes a chosen criterion

function.

Learning Algorithm

{(Xi , yi )}

(searching over F )

f F

and outputs a specific classifier/function.

functions to find one that optimizes a chosen criterion

function.

Learning Algorithm

{(Xi , yi )}

(searching over F )

f F

learning?

PRNN (PSS) Jan-Apr 2016 p.36/315

Minimum Description Length principle.

Minimum Description Length principle.

communication channel.

Minimum Description Length principle.

communication channel.

we can send the 2n numbers, Xi , yi using some

number of bits.

Minimum Description Length principle.

communication channel.

we can send the 2n numbers, Xi , yi using some

number of bits.

Or we can send Xi , the function f and the errors

yi f (Xi ).

small range and we may be able to send them using

smaller number of bits compared sending yi .

small range and we may be able to send them using

smaller number of bits compared sending yi .

small range and we may be able to send them using

smaller number of bits compared sending yi .

sending errors instead of yi may be more than offset

by the bits needed to send description of f .

small range and we may be able to send them using

smaller number of bits compared sending yi .

sending errors instead of yi may be more than offset

by the bits needed to send description of f .

bits we need.

small range and we may be able to send them using

smaller number of bits compared sending yi .

sending errors instead of yi may be more than offset

by the bits needed to send description of f .

bits we need.

This can balance the data error and model complexity

in a natural way.

Xi , yi ).

Xi , yi ).

use a polynomial with degree d.

Xi , yi ).

use a polynomial with degree d.

not really pay!

Xi , yi ).

use a polynomial with degree d.

not really pay!

preferred.

Xi , yi ).

use a polynomial with degree d.

not really pay!

preferred.

question whether the f is a good fit for the specific

data at hand.

PRNN (PSS) Jan-Apr 2016 p.50/315

address the issue of correctness of learning.

address the issue of correctness of learning.

are given sufficient number of representative

examples.

address the issue of correctness of learning.

are given sufficient number of representative

examples.

the family F of classifiers where we are searching.

We hence need a good notion of the complexity of the

learning problem.

address the issue of correctness of learning.

are given sufficient number of representative

examples.

the family F of classifiers where we are searching.

We hence need a good notion of the complexity of the

learning problem.

noise and the goal of learning is well-defined.

PRNN (PSS) Jan-Apr 2016 p.54/315

(i) X input space; often d (feature space)

(i) X input space; often d (feature space)

(i) X input space; often d (feature space)

(i) X input space; often d (feature space)

Each C C can also be viewed as a function

C : X {0, 1}, with C(X) = 1 iff X C .

(i) X input space; often d (feature space)

Each C C can also be viewed as a function

C : X {0, 1}, with C(X) = 1 iff X C .

where Xi are drawn iid according to some distribution

Px on X and yi = C (Xi ) for some C C . C is

called target concept.

PRNN (PSS) Jan-Apr 2016 p.60/315

system is learning a concept from examples.

system is learning a concept from examples.

know C .

system is learning a concept from examples.

know C .

It needs to learn the target concept from examples.

that the examples are iid ensures we get

representative examples.

that the examples are iid ensures we get

representative examples.

that come from an arbitrary distribution.

that the examples are iid ensures we get

representative examples.

that come from an arbitrary distribution.

noise.

that the examples are iid ensures we get

representative examples.

that come from an arbitrary distribution.

noise.

Also assuming that C C means that ideally we can

learn the target concept.

We could take C = 2X .

We could take C = 2X .

This means we are searching over the family of all

possible (2-class) classifiers.

We could take C = 2X .

This means we are searching over the family of all

possible (2-class) classifiers.

We could take C = 2X .

This means we are searching over the family of all

possible (2-class) classifiers.

knowledge we have about the problem or becuase of

the kind of learning algorithm we have.

This means we are searching over the family of all

possible (2-class) classifiers.

knowledge we have about the problem or becuase of

the kind of learning algorithm we have.

family of all linear classifiers.

medium-build persons based on features of height

and weight.

medium-build persons based on features of height

and weight.

medium-build persons based on features of height

and weight.

from some arbitrary distribution.

following.

problem-based intuition.

problem-based intuition.

problem-based intuition.

classifier is also an axis-parallel rectangle.

subset of X or a binary valued function on X .

subset of X or a binary valued function on X .

learning algorithm after it processes n iid examples.

subset of X or a binary valued function on X .

learning algorithm after it processes n iid examples.

to be close to C as n becomes large.

subset of X or a binary valued function on X .

learning algorithm after it processes n iid examples.

to be close to C as n becomes large.

samples drawn from X according to Px .

We define error of Cn by

We define error of Cn by

err(Cn ) = Px (Cn C )

where, for sets Cn , C ,

Cn C = (Cn C ) (C Cn ).

We define error of Cn by

err(Cn ) = Px (Cn C )

= Prob[{X X : Cn (X) = C (X)}]

where, for sets Cn , C ,

Cn C = (Cn C ) (C Cn ).

We define error of Cn by

err(Cn ) = Px (Cn C )

= Prob[{X X : Cn (X) = C (X)}]

where, for sets Cn , C ,

Cn C = (Cn C ) (C Cn ).

sample, drawn according to Px , the classification of

Cn and C differ.

PRNN (PSS) Jan-Apr 2016 p.92/315

n .

n .

However, err(Cn ) is a random variable because Cn is

a function of the random samples X1 , , Xn .

n .

However, err(Cn ) is a random variable because Cn is

a function of the random samples X1 , , Xn .

Hence we have to properly define the sense in which

err(Cn ) converges as n .

Correctly (PAC) learns a concept class C if given any

, > 0, N < such that

Prob[err(Cn ) > ] <

for all n > N and for any distribution Px and any C .

Correctly (PAC) learns a concept class C if given any

, > 0, N < such that

Prob[err(Cn ) > ] <

for all n > N and for any distribution Px and any C .

of n-tuples of iid samples drawn according to Px on X .

Correctly (PAC) learns a concept class C if given any

, > 0, N < such that

Prob[err(Cn ) > ] <

for all n > N and for any distribution Px and any C .

of n-tuples of iid samples drawn according to Px on X .

distribution is same fair to the algorithm.

PRNN (PSS) Jan-Apr 2016 p.98/315

Prob[err(Cn ) > ] <

for sufficiently large n and any Px .

Prob[err(Cn ) > ] <

for sufficiently large n and any Px .

Prob[err(Cn ) > ] <

for sufficiently large n and any Px .

the algorithm after seeing n random examples, Cn , is

approximately correct with a high probability.

Prob[err(Cn ) > ] <

for sufficiently large n and any Px .

the algorithm after seeing n random examples, Cn , is

approximately correct with a high probability.

parameters respectively.

PRNN (PSS) Jan-Apr 2016 p.102/315

medium-built persons.

medium-built persons.

Here, X = 2 .

medium-built persons.

Here, X = 2 .

The strategy of the learning algorithm is as follows.

medium-built persons.

Here, X = 2 .

The strategy of the learning algorithm is as follows.

classifies all examples.

medium-built persons.

Here, X = 2 .

The strategy of the learning algorithm is as follows.

classifies all examples.

all examples, we output the smallest such C .

for other sets it is in terms of the areas of the sets.

(This will do for our purpose here).

for other sets it is in terms of the areas of the sets.

(This will do for our purpose here).

algorithm does.

for other sets it is in terms of the areas of the sets.

(This will do for our purpose here).

algorithm does.

for other sets it is in terms of the areas of the sets.

(This will do for our purpose here).

algorithm does.

classifiers.

rectangle.

rectangle.

rectangle.

rectangle.

C and negative example otherwise.

First consider C1 .

First consider C1 .

The smallest C C consistent with all examples

would be the smallest axis-parallel rectangle

enclosing all the positive examples seen so far.

would be the following.

First consider C1 .

The smallest C C consistent with all examples

would be the smallest axis-parallel rectangle

enclosing all the positive examples seen so far.

Thus, under the strategy of our learning algorithm, for

all n, the Cn would always be inside the C .

First consider C1 .

The smallest C C consistent with all examples

would be the smallest axis-parallel rectangle

enclosing all the positive examples seen so far.

all n, the Cn would always be inside the C .

it would also be classified positive by C .

it would also be classified positive by C .

annular region.

inside C , the Cn C would be the annular region

between the two rectangles.

it would also be classified positive by C .

annular region.

region.

it would also be classified positive by C .

annular region.

region.

this annular region; we are only interested in the

probability mass of this region under Px .

that err(Cn ) > .

that err(Cn ) > .

(under Px ) of the annular region is greater than .

that err(Cn ) > .

(under Px ) of the annular region is greater than .

that err(Cn ) > .

(under Px ) of the annular region is greater than .

in the annular region.

that err(Cn ) > .

(under Px ) of the annular region is greater than .

in the annular region.

algorithm would have been closer to C .

as the probability that when n iid examples are drawn

accoding to Px none of them came from a subset of X

that has Px -probability at least .

as the probability that when n iid examples are drawn

accoding to Px none of them came from a subset of X

that has Px -probability at least .

probability at most (1 ).

as the probability that when n iid examples are drawn

accoding to Px none of them came from a subset of X

that has Px -probability at least .

probability at most (1 ).

Hence we have

Prob[err(Cn ) > ] (1 )

Hence we have

Prob[err(Cn ) > ] (1 )

Hence we have

Prob[err(Cn ) > ] (1 )

ln()

ln(1)

The required N is N

(bound on number of examples).

Hence we have

Prob[err(Cn ) > ] (1 )

ln()

ln(1)

The required N is N

(bound on number of examples).

Prob[err(Cn ) > ] , n N

showing the algorithm PAC learns the concept class.

PRNN (PSS) Jan-Apr 2016 p.138/315

class C2 = 2X .

class C2 = 2X .

Here we are searching over all possible 2-class

classifiers.

class C2 = 2X .

Here we are searching over all possible 2-class

classifiers.

So, intuitively, we do not expect the algorithm to be

able to learn anything.

class C2 = 2X .

Here we are searching over all possible 2-class

classifiers.

So, intuitively, we do not expect the algorithm to be

able to learn anything.

There is too much flexibility in the bag of classifiers

over which we are searching.

class C2 = 2X .

Here we are searching over all possible 2-class

classifiers.

So, intuitively, we do not expect the algorithm to be

able to learn anything.

over which we are searching.

After seeing n examples, the smallest set in C2 that is

consistent with all examples is the set consisting of all

the positive examples seen so far!!

After seeing n examples, the smallest set in C2 that is

consistent with all examples is the set consisting of all

the positive examples seen so far!!

Now the algorithm simply remembers all the positive

examples seen.

After seeing n examples, the smallest set in C2 that is

consistent with all examples is the set consisting of all

the positive examples seen so far!!

examples seen.

of X is in our concept class.

C minus some finite number of points from it.

C minus some finite number of points from it.

err(Cn ) = Px (Cn C ) = Px (C ).

C minus some finite number of points from it.

err(Cn ) = Px (Cn C ) = Px (C ).

all n.

C minus some finite number of points from it.

err(Cn ) = Px (Cn C ) = Px (C ).

all n.

Thus, the algorithm can not PAC learn with C2 .

learning from examples if the bag of classifiers being

considered is too large.

learning from examples if the bag of classifiers being

considered is too large.

our concept class.

learning from examples if the bag of classifiers being

considered is too large.

our concept class.

classifiers.

learning from examples if the bag of classifiers being

considered is too large.

our concept class.

classifiers.

We would later on define an appropriate quantity to

quantify the sense in which one concept class can be

said to be bigger than (or more complex than) another.

four quantities, this class can be parameterized by

four parameters.

four quantities, this class can be parameterized by

four parameters.

C 2 = 2 .

four quantities, this class can be parameterized by

four parameters.

C 2 = 2 .

Also, the strategy of our algorithm can be coded

efficiently in case of C1 .

learning and allows us to ask questions like whether a

given algorithm learns correctly.

learning and allows us to ask questions like whether a

given algorithm learns correctly.

the number of examples needed to learn to a given

level of accuracy and confidence.

learning and allows us to ask questions like whether a

given algorithm learns correctly.

the number of examples needed to learn to a given

level of accuracy and confidence.

different learning problems.

situations.

situations.

We assume there is a (god-given) C and that it is in

our C .

situations.

We assume there is a (god-given) C and that it is in

our C .

Also, we assume that examples are noise free and

are perfectly classified.

situations.

We assume there is a (god-given) C and that it is in

our C .

Also, we assume that examples are noise free and

are perfectly classified.

Next we consider an extension of this framework that

is relevant for realistic learning scenarios.

Y Output space (as earlier, Set of class labels)

Y Output space (as earlier, Set of class labels)

H hypothesis space (family of classifiers)

Y Output space (as earlier, Set of class labels)

H hypothesis space (family of classifiers)

Each h H is a function: h : X A

where A is called action space.

Y Output space (as earlier, Set of class labels)

H hypothesis space (family of classifiers)

Each h H is a function: h : X A

where A is called action space.

drawn iid according to some distribution Pxy on

X Y.

PRNN (PSS) Jan-Apr 2016 p.173/315

Some Comments

Some Comments

Some Comments

But the freedom in choosing A allows for taking care

of many situations.

Some Comments

But the freedom in choosing A allows for taking care

of many situations.

A = (e.g., learning discriminant functions).

Some Comments

But the freedom in choosing A allows for taking care

of many situations.

A = (e.g., learning discriminant functions).

Now, e.g., sign of h(X) may denote the class and its

magnitude may give some measure of confidence in

the assigned class.

This allows for noise in the training data.

This allows for noise in the training data.

overlap, same X can come from different classses

with different probabilities.

This allows for noise in the training data.

overlap, same X can come from different classses

with different probabilities.

degenerate distribution then it will be same as earlier

we draw iid samples from X and each point is

essentially classified by the target classifier.

This allows for noise in the training data.

overlap, same X can come from different classses

with different probabilities.

degenerate distribution then it will be same as earlier

we draw iid samples from X and each point is

essentially classified by the target classifier.

distribution, allows for many more scenarios.

PRNN (PSS) Jan-Apr 2016 p.182/315

hn H, given the training data consisting of n

examples.

hn H, given the training data consisting of n

examples.

concept/hypothesis.

hn H, given the training data consisting of n

examples.

concept/hypothesis.

examples.

hn H, given the training data consisting of n

examples.

concept/hypothesis.

examples.

goal of learning.

Loss function

Loss function:

L : Y A + .

Loss function

L : Y A + .

The idea is that L(y, h(X)) is the loss suffered by

h H on a (random) sample (X, y) X Y .

Loss function:

Loss function

L : Y A + .

The idea is that L(y, h(X)) is the loss suffered by

h H on a (random) sample (X, y) X Y .

More generally we can let loss depend on X also

explicitly and can write L(X, y, h(X)) for loss

Loss function:

function.

Loss function

L : Y A + .

The idea is that L(y, h(X)) is the loss suffered by

h H on a (random) sample (X, y) X Y .

More generally we can let loss depend on X also

explicitly and can write L(X, y, h(X)) for loss

Loss function:

function.

By convention we assume that the loss function is

non-negative.

Loss function

L : Y A + .

The idea is that L(y, h(X)) is the loss suffered by

h H on a (random) sample (X, y) X Y .

More generally we can let loss depend on X also

explicitly and can write L(X, y, h(X)) for loss

Loss function:

function.

By convention we assume that the loss function is

non-negative.

Now we can look for hypotheses that have low

average loss over samples drawn accordding to Pxy .

PRNN (PSS) Jan-Apr 2016 p.191/315

Risk function

Define a function, R : H + by

!

L(y, h(X))dPxy

Risk function

Define a function, R : H + by

!

L(y, h(X))dPxy

Risk function

Define a function, R : H + by

!

L(y, h(X))dPxy

respect to Pxy .

Risk function

Define a function, R : H + by

!

L(y, h(X))dPxy

respect to Pxy .

Risk function

Define a function, R : H + by

!

L(y, h(X))dPxy

respect to Pxy .

of risk.

PRNN (PSS) Jan-Apr 2016 p.196/315

minimizer of risk.

minimizer of risk.

Risk minimization is a very general strategy adopted

by most machine learning algorithms.

minimizer of risk.

Risk minimization is a very general strategy adopted

by most machine learning algorithms.

of Pxy .

minimizer of risk.

Risk minimization is a very general strategy adopted

by most machine learning algorithms.

of Pxy .

R() directly is not feasible.

n : H + , by

Define the empirical risk function, R

n (h) =

R

n

"

1

L(yi , h(Xi ))

i=1

n : H + , by

Define the empirical risk function, R

n (h) =

R

n

"

1

L(yi , h(Xi ))

i=1

from n iid samples.

n : H + , by

Define the empirical risk function, R

n (h) =

R

n

"

1

L(yi , h(Xi ))

i=1

from n iid samples.

n.

Let h

n

= arg min R

n (h)

h

n

hH

n (h).

Hence, given any h we can calculate R

n (h).

Hence, given any h we can calculate R

by optimization

Hence, we can (in principle) find h

n

methods.

n (h).

Hence, given any h we can calculate R

by optimization

Hence, we can (in principle) find h

n

methods.

is the basic idea of empirical

Approximating h by h

n

risk minimization strategy which is used in most ML

algorithms.

Is h

n

risk (for large n)?

Is h

n

risk (for large n)?

This is the question of consistency of empirical risk

minimization.

Is h

n

risk (for large n)?

This is the question of consistency of empirical risk

minimization.

Thus, we can say that any learning problems has two

parts.

Is h

n

risk (for large n)?

This is the question of consistency of empirical risk

minimization.

Thus, we can say that any learning problems has two

parts.

, the minimizer of R

n.

The optimization part: find h

n

Is h

n

risk (for large n)?

This is the question of consistency of empirical risk

minimization.

Thus, we can say that any learning problems has two

parts.

, the minimizer of R

n.

The optimization part: find h

n

a good approximator of h .

The statistical part: Is h

n

the specification of the learning problem.

the specification of the learning problem.

like to evaluate performance of the classifier.

where I[A] denotes indicator of event A.

of misclassification.

of misclassification.

So, h minimizes probability of misclassification. A

very desirable goal. (Bayes classifier)

searches over a class of binary-valued functions on

X.

searches over a class of binary-valued functions on

X.

We can extend this to, e.g., discriminant function

learning.

searches over a class of binary-valued functions on

X.

We can extend this to, e.g., discriminant function

learning.

We take A = (now h(X) is a discriminant function).

searches over a class of binary-valued functions on

X.

We can extend this to, e.g., discriminant function

learning.

same as 01 loss.

same as 01 loss.

Even if we take A = , the 01 loss compares only

sign of h(x) with y . The magnitude of h(x) has no

effect on the loss.

same as 01 loss.

Even if we take A = , the 01 loss compares only

sign of h(x) with y . The magnitude of h(x) has no

effect on the loss.

Here, we can not trade good performance on some

data with bad performance on others.

same as 01 loss.

Even if we take A = , the 01 loss compares only

sign of h(x) with y . The magnitude of h(x) has no

effect on the loss.

Here, we can not trade good performance on some

data with bad performance on others.

This makes 01 loss function more robust to noise in

classification labels.

measure, minimizing empirical risk here is hard.

measure, minimizing empirical risk here is hard.

which makes the empirical risk function also

non-differentiable.

measure, minimizing empirical risk here is hard.

which makes the empirical risk function also

non-differentiable.

The empirical risk minimization is also a non-convex

optimization problem here.

measure, minimizing empirical risk here is hard.

which makes the empirical risk function also

non-differentiable.

The empirical risk minimization is also a non-convex

optimization problem here.

Machine Learning.

we considered is empirical risk minimization with

squared error loss function.

we considered is empirical risk minimization with

squared error loss function.

Y as {0, 1} or {+1, 1}. We take A = so that

each h is a discriminant function.

we considered is empirical risk minimization with

squared error loss function.

Y as {0, 1} or {+1, 1}. We take A = so that

each h is a discriminant function.

As we know, we can use this for regression problems

also and then we take Y = .

Y = {0, 1} and A = [0, 1].

Y = {0, 1} and A = [0, 1].

probability (of class-1) function.

Y = {0, 1} and A = [0, 1].

probability (of class-1) function.

error loss (the risk here) is the posterior probability

function.

Y = {0, 1} and A = [0, 1].

probability (of class-1) function.

error loss (the risk here) is the posterior probability

function.

So, risk minimization would now look for a function in

H that is a good approximation for the posterior

probability function.

loss is a convex optimization problem for linear

models (when h is linear in its parameters).

loss is a convex optimization problem for linear

models (when h is linear in its parameters).

learning algorithms.

function is given by

function is given by

yh(X) 1, loss is zero.

function is given by

yh(X) 1, loss is zero.

risk minimization.

convex approximations of the 01 loss functions as

follows.

convex approximations of the 01 loss functions as

follows.

Let us take Y = {+1, 1}.

convex approximations of the 01 loss functions as

follows.

Let us take Y = {+1, 1}.

We can write all loss functions as functions of single

variable yh(X).

and zero otherwise.

and zero otherwise.

The squared error loss can be written as

and zero otherwise.

The squared error loss can be written as

2.5

L(y, f(x))

1.5

0.5

0-1 Loss

Square Loss

Hinge Loss

0

-2

-1.5

-1

-0.5

0.5

1.5

2.5

y f(x)

PRNN (PSS) Jan-Apr 2016 p.255/315

regression problems.

regression problems.

sensitive to data errors such as outliers.

regression problems.

sensitive to data errors such as outliers.

So, other loss functions are used for better robustness

during regression.

regression problems.

sensitive to data errors such as outliers.

So, other loss functions are used for better robustness

during regression.

zero.

one can think of.

one can think of.

Many of them also make the empirical risk

minimization problem efficiently solvable.

one can think of.

Many of them also make the empirical risk

minimization problem efficiently solvable.

Thus a choice of loss function can efficiently tackle

the optimization part of learning.

one can think of.

Many of them also make the empirical risk

minimization problem efficiently solvable.

the optimization part of learning.

course.

one can think of.

Many of them also make the empirical risk

minimization problem efficiently solvable.

the optimization part of learning.

course.

Now, let us get back to the statistical question that we

started with.

.

n , instead and thus find h

risk, R

n

.

n , instead and thus find h

risk, R

n

We want h and h

n

are interested in the question

) R(h )?

R(h

n

.

n , instead and thus find h

risk, R

n

We want h and h

n

are interested in the question

) R(h )?

R(h

n

is random and hence we take

(As earlier, we know h

n

the above as convergence in probability).

PRNN (PSS) Jan-Apr 2016 p.272/315

minimization?

minimization?

Sample mean is a good estimator and hence, with

n (h) converges to R(h), for any h H.

large n, R

minimization?

Sample mean is a good estimator and hence, with

n (h) converges to R(h), for any h H.

large n, R

This is (weak) law of large numbers.

minimization?

Sample mean is a good estimator and hence, with

n (h) converges to R(h), for any h H.

large n, R

This is (weak) law of large numbers.

) converges

But this does not necessarily mean R(h

n

to R(h ).

minimization?

Sample mean is a good estimator and hence, with

n (h) converges to R(h), for any h H.

large n, R

This is (weak) law of large numbers.

) converges

But this does not necessarily mean R(h

n

to R(h ).

We are interested in: does the true risk of the

minimizer of empirical risk converge to global

minimum of risk?

PRNN (PSS) Jan-Apr 2016 p.277/315

H.

X and classified according to a h

H.

X and classified according to a h

distribution.

H.

X and classified according to a h

distribution.

= 0.

Now the global minimum of risk is zero and R(h)

.

Note that now the risk of any h is same as Px (hh)

.

Note that now the risk of any h is same as Px (hh)

That is, this scenario is same as what we considered

under PAC framework.

.

Note that now the risk of any h is same as Px (hh)

That is, this scenario is same as what we considered

under PAC framework.

Now, under 01 loss, the global minimum of empirical

risk is also zero.

.

Note that now the risk of any h is same as Px (hh)

That is, this scenario is same as what we considered

under PAC framework.

Now, under 01 loss, the global minimum of empirical

risk is also zero.

) with

For any n, there may be many h (other than h

n (h) = 0.

R

.

Note that now the risk of any h is same as Px (hh)

That is, this scenario is same as what we considered

under PAC framework.

Now, under 01 loss, the global minimum of empirical

risk is also zero.

) with

For any n, there may be many h (other than h

n (h) = 0.

R

Hence our optimization algorithm can only use some

general rule to output one such hypothesis.

and h1 (X) = 1 for all other X

Then R

risk. But it is obvious that h1 is not a good classifier.

Suppose we took H = 2X .

Suppose we took H = 2X .

will not go to zero with

h)

and hence we know, Px (h

n

n.

Suppose we took H = 2X .

will not go to zero with

h)

and hence we know, Px (h

n

n.

Thus, here, R(h

n

Suppose we took H = 2X .

will not go to zero with

h)

and hence we know, Px (h

n

n.

Thus, here, R(h

n

Note that the law of large numbers still implies that

n (h) converges to R(h), h.

R

minimization (ERM) may not yield good classifiers.

minimization (ERM) may not yield good classifiers.

certainly the case as we saw in our example.

minimization (ERM) may not yield good classifiers.

certainly the case as we saw in our example.

hence one way is to impose some smoothness

conditions on the learnt function (e.g., regularization).

We saw that though R

minimizer of empirical risk need not converge to the

global minimum of risk.

We saw that though R

minimizer of empirical risk need not converge to the

global minimum of risk.

We saw that though R

minimizer of empirical risk need not converge to the

global minimum of risk.

what H is empirical risk minimization consistent.

We saw that though R

minimizer of empirical risk need not converge to the

global minimum of risk.

what H is empirical risk minimization consistent.

) R(h )| > ] ?

Prob[|R(h

n

We saw that though R

minimizer of empirical risk need not converge to the

global minimum of risk.

what H is empirical risk minimization consistent.

) R(h )| > ] ?

Prob[|R(h

n

PRNN (PSS) Jan-Apr 2016 p.300/315

N < , such that

) R(h )| > ] , n N

Prob[|R(h

n

N < , such that

) R(h )| > ] , n N

Prob[|R(h

n

) R(h )| > ] , n N

n (h

Prob[|R

n

N < , such that

) R(h )| > ] , n N

Prob[|R(h

n

) R(h )| > ] , n N

n (h

Prob[|R

n

We would like to (approximately) know the true risk of

the learnt classifier.

N < , such that

) R(h )| > ] , n N

Prob[|R(h

n

) R(h )| > ] , n N

n (h

Prob[|R

n

the learnt classifier.

For what kind of H do these hold?

PRNN (PSS) Jan-Apr 2016 p.304/315

n (h) R(h), h) is not enough.

R

n (h) R(h), h) is not enough.

R

under law of large numbers be uniform over H.

n (h) R(h), h) is not enough.

R

under law of large numbers be uniform over H.

for consistency of empirical risk minimization.

converges to expectation of the random variable.

converges to expectation of the random variable.

Prob[|R

converges to expectation of the random variable.

Prob[|R

converges to expectation of the random variable.

Prob[|R

depends only on , and not on h.

converges to expectation of the random variable.

Prob[|R

depends only on , and not on h.

h H.

PRNN (PSS) Jan-Apr 2016 p.312/315

To sum up, R

uniformly over H if , > 0, N (, ) < such that

#

$

n (h) R(h)| > , n N (, )

Prob sup |R

hH

To sum up, R

uniformly over H if , > 0, N (, ) < such that

#

$

n (h) R(h)| > , n N (, )

Prob sup |R

hH

To sum up, R

uniformly over H if , > 0, N (, ) < such that

#

$

n (h) R(h)| > , n N (, )

Prob sup |R

hH

sufficient for consistency of empirical risk

minimization.

- Genetic AlgorithmsDiunggah olehKarthik Sankar
- 11 Economics Notes Ch03 Organization of DataDiunggah olehUmang Mahesahwari
- Data Transfer for Batches in ERP and EWM - SAP DocumentationDiunggah olehlostrider_991
- Attributes Decision Tree Synthetic Prototypes Others Knowledge RepresentationsDiunggah olehbwwcom
- Learning Under Concept Drift an OverviewDiunggah olehorial
- Tutorial2 Linear RegressionDiunggah olehMark Nam
- ICAART conference paper 7 pages1Diunggah olehahmad8000
- Hand Geometry Identification Based On Multiple-Class Association RulesDiunggah olehijcsis
- Simplified Texture Unit a New Descriptor of the Local TextureDiunggah olehKvsakhil Kvsakhil
- DWDM UNIT 6Diunggah olehdeeuGirl
- 1.Decision Trees for Uncertain DataDiunggah olehsimhadrihemanthkumar
- Batch Management NoteDiunggah olehDhruv Tamrakar
- IJAIEM-2013-11-26-079Diunggah olehAnonymous vQrJlEN
- Image Quality Measures for Gender ClassificationDiunggah olehEditor IJRITCC
- 01-Routine Maintenance Instructions.pdfDiunggah olehavonsus
- A Statistical Pattern Recognition Paradigm for Vibration-Based Structural Health MonitoringDiunggah olehjun005
- An Experiential Study of SVM and Naïve Bayes for Gender RecognizationDiunggah olehEditor IJRITCC
- Practical Considerations in the Use of Rock Mass Classification in MiningDiunggah olehHamza Geo
- SoftcomputingDiunggah olehtjdandin1
- Learning to Rank Image Tags with Limited Training Examples.pdfDiunggah olehRajaRam Mani
- Mogi CodingDiunggah olehAfifah Nurul Maulina
- BABI ProjectDiunggah olehShruti Vinyas
- stats mechanicsDiunggah olehmorepj
- 06_Logistic_Regression.pdfDiunggah olehmarc
- CS16MTECH11015Diunggah olehShamik Kundu
- Gatos 2017Diunggah olehGilberto Junior
- Sweet Biscuits, Snack Bars and Fruit Snacks in India AnalysisDiunggah olehaksh
- [World Scientific Handbook in Financial Economic Series] Leonard C. MacLean, Edward O. Thorp, William T. Ziemba - The Kelly Capital Growth Investment Criterion_ Theory and Practice (2011, World Scientific Publishing Company)Diunggah olehArun
- 497_ShashankMamidipelliDiunggah olehMamidipelli Shashank
- Research on Modulation recognition technology based on Machine learningDiunggah olehATS

- Typical Elements in a Project CharterDiunggah olehKGrace Besa-ang
- Comparative Study of Micro-Strip Patch Line Feed Antenna Coaxial Feed Antenna Designed Using Genetic Algorithms-libreDiunggah olehNag Challa
- Police Occupational Culture - Classic Themese, Altered TimesDiunggah olehTony Fernandes
- (Value Inquiry Book, Vol 15) Sidney Axinn-The Logic of Hope_ Extensions of Kant's View of Religion -Editions Rodopi (1994)Diunggah olehMario Correa
- Lambda Value Analysis on Weighted Minkowski Distance Model in CBR of Schizophrenia Type DiagnosisDiunggah olehHako Khechai
- Presentation 1Diunggah olehIshan Srivastava
- Why More Christians Should Believe in Marys Immaculate ConceptioDiunggah olehhoomingos
- TRAGEDY AT WINNIPEG: The Canadian Bishops' Statement on Humanae VitaeDiunggah olehJohn Laws
- IndexDiunggah olehOm Prakash Sahu
- How Great Companies Think DifferentlyDiunggah olehЮрий Буланов
- faeefsfsfdafgad dfdfa sgfrfd dfgdfDiunggah olehDestiny Nelson
- Examples s-convex functionsDiunggah olehtradutora
- 1998 Parapsych ConventionDiunggah olehjossbrown3784
- Lee SatterwhiteDiunggah olehHerb Walker
- MicroeconomicsDiunggah olehjulietaitalia
- Employee AppraisalDiunggah olehRaghu Gupta
- Five QuestionsDiunggah olehlancelot85
- The Menace of the HerdDiunggah olehBurgales
- The Language FacultyDiunggah olehAdrian
- PAM Power SpectraDiunggah olehLax Koirala
- Burden of ProofDiunggah olehShabnam Barsha
- PS 269 Syllabus W13Diunggah olehDoyle McManus
- Evaluative Criteria for Qualitative Research in Applied LinguisticsDiunggah olehsadeghieh
- National CurriculumDiunggah olehMuchtar Lutfi
- The Art and Skill of Dec MakingDiunggah olehExcel Solilapsi
- EuclidDiunggah olehSuparswa Chakraborty
- STATUR Tutorial_06_Solution.pdfDiunggah olehBeams S
- NLP 2012Diunggah olehDebjyoti Paul
- 3529207Diunggah olehHarish
- An Interpretation of Industrialization (a photograph by Jo Spence and Terry Dennett) (Photography Degree, Year 1, Essay 1)Diunggah olehDan Foy