Lecture 5

ECS171: Machine Learning
Lecture 5: Error and Noise (LFD 1.4)
Cho-Jui Hsieh
UC Davis
Jan 24, 2018

Error Measurement
The learning diagram
Error Measure
What does “h ≈ f ” mean?

Define an error measure: E (h, f )
Error Measure

Almost always pairwise definition: (define on each x)
e(h(x), f (x)) (loss on x)

Error Measure

Almost always pairwise definition: (define on each x)
e(h(x), f (x)) (loss on x)
Examples:
Square error: e(h(x), f (x)) = (h(x) − f (x))2

Binary error: e(h(x), f (x)) = [h(x) 6= f (x)]
From pointwise to overall
Overall error E (h, f ) = average of pointwise errors e(h(x), f (x))

From pointwise to overall
Overall error E (h, f ) = average of pointwise errors e(h(x), f (x))

In-sample error (training error):
N
1 X
Ein (h) = e(h(xn ), f (xn ))
N
n=1
Out-of-sample error (generalization error):
Eout (h) = Ex [e(h(x), f (x))]

Learning diagram (with error measurement)
Choice of Error Measure
Two types of error:

false accept and false reject
Application: Supermarket verifies fingerprint for discounts

False reject is costly: customer gets annoyed
False accept is minor: just gave away a free discount
Application: Supermarket verifies fingerprint for discounts

False reject is costly: customer gets annoyed
False accept is minor: just gave away a free discount
Define the error measure:
Application: CIA security

False reject is tolerable
False accept is disaster
Define the error measure:
Take-home lesson
The error measure is application/user-dependent

Take-home lesson
The error measure is application/user-dependent

Plausible:
0/1: minimum mis-classification
regression: square error
Friendly:
Closed-form solution (square loss)
Convex objective function (logistic loss)
Weighted Classification
Out-of-sample error:
Eout (h) = Ex,y [c(y ) · e(h(x), y )]

(
1 if y = +1
Class-dependent weight: c(y ) =
1000 if y = −1
Weighted Classification
Out-of-sample error:
Eout (h) = Ex,y [c(y ) · e(h(x), y )]

(
1 if y = +1
Class-dependent weight: c(y ) =
1000 if y = −1
In-sample error:
N
1 X
Ein (h) = c(yn ) · e(h(xn ), yn )
N
n=1
How to solve it?
In general (any solver):

Augment the dataset by duplicating each −1 example 1000 times
Then solve the unweighted version
How to solve it?
Require much more memory

For most algorithms, we can incorporate the weight in the algorithm
without duplicating data
How to solve it?
Require much more memory

For most algorithms, we can incorporate the weight in the algorithm
without duplicating data
In Stochastic Gradient Descent (SGD):
Approach I:
Update the model according to the weight:
w ← w − η t · c(yn )∇e(w T xn , yn )
Approach II:
Change the sample rate
Increase the probability of choosing −1 examples by 1000 times
Noisy Targets
The “target function” is not always a 1-1 function

Consider the credit card approval:
Two identical customers → two different behaviors

Target “distribution”
Instead of y = f (x), we use target distribution
P(y | x)
P(y | x)
(x, y ) is now generated by the joint distribution
P(x)P(y | x)
P(y | x)
(x, y ) is now generated by the joint distribution
P(x)P(y | x)
Noisy target ≈ deterministic target f (x) = E (y | x) plus noise y − f (x)

Learning diagram with noisy target
Distinction between P(y | x) and P(x)
The target distribution P(y | x)

is what we are trying to learn
The input distribution P(x)
quantifies relative importance of x
Preamble to the theory
What we know so far?
It is likely that
Eout (g ) ≈ Ein (g )
It is likely that
Not yet learning

It is likely that
Not yet learning

For machine learning, we need g ≈ f , which means
Eout (g ) ≈ 0
The 2 questions of learning
Eout ≈ 0 is achieved through:
Eout (g ) ≈ Ein (g ) and Ein (g ) ≈ 0

The 2 questions of learning
Eout ≈ 0 is achieved through:
Eout (g ) ≈ Ein (g ) and Ein (g ) ≈ 0
Learning is thus split into 2 questions:

Can we make sure that Eout (g ) is close enough to Ein (g )?
Hoeffding’s inequality
Can we make Ein (g ) small?
Optimization
What the theory will achieve
Currently we only know

2N
P[|Ein (g ) − Eout (g )| > ] ≤ 2Me −2
Show Ein (g ) ≈ Eout (g ) for infinite M (number of hypothesis)

What the theory will achieve
Currently we only know

2N
P[|Ein (g ) − Eout (g )| > ] ≤ 2Me −2
Show Ein (g ) ≈ Eout (g ) for infinite M (number of hypothesis)

Conclusions
Next class: LFD 2.1, 2.2, 2.3 (VC-dimension)
Questions?

Lecture 5

Diunggah oleh

Informasi Dokumen

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Lecture 5

Diunggah oleh

Hak Cipta:

Format Tersedia

ECS171: Machine Learning

Lecture 5: Error and Noise (LFD 1.4)

Jan 24, 2018

What does “h ≈ f ” mean?

What does “h ≈ f ” mean?

e(h(x), f (x)) (loss on x)

What does “h ≈ f ” mean?

e(h(x), f (x)) (loss on x)

Square error: e(h(x), f (x)) = (h(x) − f (x))2

Overall error E (h, f ) = average of pointwise errors e(h(x), f (x))

Overall error E (h, f ) = average of pointwise errors e(h(x), f (x))

Out-of-sample error (generalization error):

Eout (h) = Ex [e(h(x), f (x))]

Two types of error:

Application: Supermarket verifies fingerprint for discounts

Application: Supermarket verifies fingerprint for discounts

Application: CIA security

The error measure is application/user-dependent

The error measure is application/user-dependent

Eout (h) = Ex,y [c(y ) · e(h(x), y )]

Eout (h) = Ex,y [c(y ) · e(h(x), y )]

In general (any solver):

Require much more memory

Require much more memory

The “target function” is not always a 1-1 function

Two identical customers → two different behaviors

Instead of y = f (x), we use target distribution

Instead of y = f (x), we use target distribution

(x, y ) is now generated by the joint distribution

Instead of y = f (x), we use target distribution

(x, y ) is now generated by the joint distribution

Noisy target ≈ deterministic target f (x) = E (y | x) plus noise y − f (x)

The target distribution P(y | x)

Not yet learning

Not yet learning

Eout ≈ 0 is achieved through:

Eout (g ) ≈ Ein (g ) and Ein (g ) ≈ 0

Eout ≈ 0 is achieved through:

Eout (g ) ≈ Ein (g ) and Ein (g ) ≈ 0

Learning is thus split into 2 questions:

Currently we only know

Show Ein (g ) ≈ Eout (g ) for infinite M (number of hypothesis)

Currently we only know

Show Ein (g ) ≈ Eout (g ) for infinite M (number of hypothesis)

Next class: LFD 2.1, 2.2, 2.3 (VC-dimension)

Anda mungkin juga menyukai