f (X) = w0 + w1 X + w2 X 2 + + wm X m
f (X) = w0 + w1 X + w2 X 2 + + wm X m
f (X) = w0 + w1 X + w2 X 2 + + wm X m
f (X) = w0 + w1 X + w2 X 2 + + wm X m
f (X) = w0 + w1 X + w2 X 2 + + wm X m
f (X) = w0 + w1 X + w2 X 2 + + wm X m
f F
f F
We could take C = 2X .
We could take C = 2X .
This means we are searching over the family of all
possible (2-class) classifiers.
We could take C = 2X .
This means we are searching over the family of all
possible (2-class) classifiers.
We could take C = 2X .
This means we are searching over the family of all
possible (2-class) classifiers.
We could take C = 2X .
This means we are searching over the family of all
possible (2-class) classifiers.
We define error of Cn by
We define error of Cn by
err(Cn ) = Px (Cn C )
where, for sets Cn , C ,
Cn C = (Cn C ) (C Cn ).
We define error of Cn by
err(Cn ) = Px (Cn C )
= Prob[{X X : Cn (X) = C (X)}]
where, for sets Cn , C ,
Cn C = (Cn C ) (C Cn ).
We define error of Cn by
err(Cn ) = Px (Cn C )
= Prob[{X X : Cn (X) = C (X)}]
where, for sets Cn , C ,
Cn C = (Cn C ) (C Cn ).
Here, X = 2 .
Here, X = 2 .
The strategy of the learning algorithm is as follows.
Here, X = 2 .
The strategy of the learning algorithm is as follows.
Here, X = 2 .
The strategy of the learning algorithm is as follows.
First consider C1 .
First consider C1 .
The smallest C C consistent with all examples
would be the smallest axis-parallel rectangle
enclosing all the positive examples seen so far.
First consider C1 .
The smallest C C consistent with all examples
would be the smallest axis-parallel rectangle
enclosing all the positive examples seen so far.
Thus, under the strategy of our learning algorithm, for
all n, the Cn would always be inside the C .
First consider C1 .
The smallest C C consistent with all examples
would be the smallest axis-parallel rectangle
enclosing all the positive examples seen so far.
Hence we have
Prob[err(Cn ) > ] (1 )
Hence we have
Prob[err(Cn ) > ] (1 )
Hence we have
Prob[err(Cn ) > ] (1 )
The required N is N
(bound on number of examples).
Hence we have
Prob[err(Cn ) > ] (1 )
The required N is N
(bound on number of examples).
Some Comments
Some Comments
Some Comments
Some Comments
Some Comments
Now, e.g., sign of h(X) may denote the class and its
magnitude may give some measure of confidence in
the assigned class.
Loss function
Loss function:
L : Y A + .
Loss function
L : Y A + .
The idea is that L(y, h(X)) is the loss suffered by
h H on a (random) sample (X, y) X Y .
Loss function:
Loss function
L : Y A + .
The idea is that L(y, h(X)) is the loss suffered by
h H on a (random) sample (X, y) X Y .
More generally we can let loss depend on X also
explicitly and can write L(X, y, h(X)) for loss
Loss function:
function.
Loss function
L : Y A + .
The idea is that L(y, h(X)) is the loss suffered by
h H on a (random) sample (X, y) X Y .
More generally we can let loss depend on X also
explicitly and can write L(X, y, h(X)) for loss
Loss function:
function.
By convention we assume that the loss function is
non-negative.
Loss function
L : Y A + .
The idea is that L(y, h(X)) is the loss suffered by
h H on a (random) sample (X, y) X Y .
More generally we can let loss depend on X also
explicitly and can write L(X, y, h(X)) for loss
Loss function:
function.
By convention we assume that the loss function is
non-negative.
Now we can look for hypotheses that have low
average loss over samples drawn accordding to Pxy .
PRNN (PSS) Jan-Apr 2016 p.191/315
Risk function
Define a function, R : H + by
!
L(y, h(X))dPxy
Risk function
Define a function, R : H + by
!
L(y, h(X))dPxy
Risk function
Define a function, R : H + by
!
L(y, h(X))dPxy
Risk function
Define a function, R : H + by
!
L(y, h(X))dPxy
Risk function
Define a function, R : H + by
!
L(y, h(X))dPxy
n : H + , by
Define the empirical risk function, R
n (h) =
R
n
"
1
L(yi , h(Xi ))
i=1
n : H + , by
Define the empirical risk function, R
n (h) =
R
n
"
1
L(yi , h(Xi ))
i=1
n : H + , by
Define the empirical risk function, R
n (h) =
R
n
"
1
L(yi , h(Xi ))
i=1
n (h).
Hence, given any h we can calculate R
n (h).
Hence, given any h we can calculate R
by optimization
Hence, we can (in principle) find h
n
methods.
n (h).
Hence, given any h we can calculate R
by optimization
Hence, we can (in principle) find h
n
methods.
is the basic idea of empirical
Approximating h by h
n
risk minimization strategy which is used in most ML
algorithms.
, the minimizer of R
n.
The optimization part: find h
n
, the minimizer of R
n.
The optimization part: find h
n
a good approximator of h .
The statistical part: Is h
n
L(y, f(x))
1.5
0.5
0-1 Loss
Square Loss
Hinge Loss
0
-2
-1.5
-1
-0.5
0.5
1.5
2.5
y f(x)
) converges
But this does not necessarily mean R(h
n
to R(h ).
) converges
But this does not necessarily mean R(h
n
to R(h ).
We are interested in: does the true risk of the
minimizer of empirical risk converge to global
minimum of risk?
PRNN (PSS) Jan-Apr 2016 p.277/315
= 0.
Now the global minimum of risk is zero and R(h)
.
Note that now the risk of any h is same as Px (hh)
.
Note that now the risk of any h is same as Px (hh)
That is, this scenario is same as what we considered
under PAC framework.
.
Note that now the risk of any h is same as Px (hh)
That is, this scenario is same as what we considered
under PAC framework.
Now, under 01 loss, the global minimum of empirical
risk is also zero.
.
Note that now the risk of any h is same as Px (hh)
That is, this scenario is same as what we considered
under PAC framework.
Now, under 01 loss, the global minimum of empirical
risk is also zero.
) with
For any n, there may be many h (other than h
n (h) = 0.
R
.
Note that now the risk of any h is same as Px (hh)
That is, this scenario is same as what we considered
under PAC framework.
Now, under 01 loss, the global minimum of empirical
risk is also zero.
) with
For any n, there may be many h (other than h
n (h) = 0.
R
Hence our optimization algorithm can only use some
general rule to output one such hypothesis.
Suppose we took H = 2X .
Suppose we took H = 2X .
Suppose we took H = 2X .
Suppose we took H = 2X .
) R(h )| > ] ?
Prob[|R(h
n
) R(h )| > ] ?
Prob[|R(h
n
) R(h )| > ] , n N
Prob[|R(h
n
) R(h )| > ] , n N
Prob[|R(h
n
) R(h )| > ] , n N
n (h
Prob[|R
n
) R(h )| > ] , n N
Prob[|R(h
n
) R(h )| > ] , n N
n (h
Prob[|R
n
We would like to (approximately) know the true risk of
the learnt classifier.
) R(h )| > ] , n N
Prob[|R(h
n
) R(h )| > ] , n N
n (h
Prob[|R
n