Pattern Recognition Presenation

Pattern
Classification
All materials in these slides were taken from

Pattern Classification (2nd ed) by R. O. Duda, P.
E. Hart and D. G. Stork, John Wiley & Sons, 2000
with the permission of the authors and the
publisher
Chapter 1: Introduction to Pattern
Recognition (Sections 1.1-1.6)
• Machine Perception
• An Example
• Pattern Recognition Systems
• The Design Cycle
• Learning and Adaptation
• Conclusion
3
Machine Perception
• Build a machine that can recognize patterns:

• Speech recognition
• Fingerprint identification
• OCR (Optical Character Recognition)
• DNA sequence identification
Pattern Classification, Chapter 1

4
An Example
• “Sorting incoming Fish on a conveyor according to

species using optical sensing”
Sea bass
Species
Salmon

5
• Problem Analysis
• Set up a camera and take some sample images to extract
features
• Length
• Lightness
• Width
• Number and shape of fins
• Position of the mouth, etc…
• This is the set of all suggested features to explore for use in our
classifier!

6
• Preprocessing
• Use a segmentation operation to isolate fishes from one

another and from the background
• Information from a single fish is sent to a feature

extractor whose purpose is to reduce the data by
measuring certain features
• The features are passed to a classifier

7

8
• Classification
• Select the length of the fish as a possible feature for
discrimination

9

10
The length is a poor feature alone!
Select the lightness as a possible feature.

11

12
• Threshold decision boundary and cost relationship

• Move our decision boundary toward smaller values of
lightness in order to minimize the cost (reduce the number
of sea bass that are classified salmon!)
Task of decision theory

13
• Adopt the lightness and add the width of the fish

Fish xT = [x1, x2]
Lightness Width

14

15
• We might add other features that are not correlated

with the ones we already have. A precaution should
be taken not to reduce the performance by adding
such “noisy features”
• Ideally, the best decision boundary should be the one

which provides an optimal performance such as in the
following figure:

16

17
• However, our satisfaction is premature because

the central aim of designing a classifier is to
correctly classify novel input
Issue of generalization!

18

19
Pattern Recognition Systems
• Sensing
• Use of a transducer (camera or microphone)
• PR system depends of the bandwidth, the resolution
sensitivity distortion of the transducer
• Segmentation and grouping

• Patterns should be well separated and should not overlap

20

21
• Feature extraction
• Discriminative features
• Invariant features with respect to translation, rotation and
scale.
• Classification
• Use a feature vector provided by a feature extractor to
assign the object to a category
• Post Processing
• Exploit context input dependent information other than from
the target pattern itself to improve performance

22
The Design Cycle
• Data collection
• Feature Choice
• Model Choice
• Training
• Evaluation
• Computational Complexity

23

24
• Data Collection
• How do we know when we have collected an adequately
large and representative set of examples for training and
testing the system?

25
• Feature Choice
• Depends on the characteristics of the problem domain.
Simple to extract, invariant to irrelevant transformation
insensitive to noise.

26
• Model Choice
• Unsatisfied with the performance of our fish classifier and
want to jump to another class of model

27
• Training
• Use data to determine the classifier. Many different
procedures for training classifiers and choosing models

28
• Evaluation
• Measure the error rate (or performance and switch from
one set of features to another one

29
• Computational Complexity
• What is the trade-off between computational ease and
performance?
• (How an algorithm scales as a function of the number of

features, patterns or categories?)

30
Learning and Adaptation
• Supervised learning
• A teacher provides a category label or cost for each
pattern in the training set
• Unsupervised learning
• The system forms clusters or “natural groupings” of the
input patterns

31
Conclusion
• Reader seems to be overwhelmed by the number,

complexity and magnitude of the sub-problems of
Pattern Recognition
• Many of these sub-problems can indeed be solved

• Many fascinating unsolved problems still remain

Pattern
Classification

Pattern Classification (2nd ed) by R. O.
Duda, P. E. Hart and D. G. Stork, John Wiley
& Sons, 2000
publisher
Chapter 2 (Part 1):
Bayesian Decision Theory
(Sections 2.1-2.2)
• Introduction
• Bayesian Decision Theory–Continuous Features
2
Introduction
• The sea bass/salmon example
• State of nature, prior
• State of nature is a random variable
• The catch of salmon and sea bass is equiprobable
• P(ω1) = P(ω2) (uniform priors)
• P(ω1) + P( ω2) = 1 (exclusivity and exhaustivity)
Pattern Classification, Chapter 2 (Part 1)

3
• Decision rule with only the prior information

• Decide ω1 if P(ω1) > P(ω2) otherwise decide ω2
• Use of the class –conditional information

• P(x | ω1) and P(x | ω2) describe the difference in
lightness between populations of sea and salmon

4

5
• Posterior, likelihood, evidence

• P(ωj | x) = P(x | ωj) . P (ωj) / P(x)
• Where in case of two categories
j=2
P ( x ) = ∑ P ( x | ω j )P ( ω j )
j =1
• Posterior = (Likelihood. Prior) / Evidence

6

7
• Decision given the posterior probabilities
X is an observation for which:
if P(ω1 | x) > P(ω2 | x) True state of nature = ω1

if P(ω1 | x) < P(ω2 | x) True state of nature = ω2
Therefore:
whenever we observe a particular x, the probability of
error is :
P(error | x) = P(ω1 | x) if we decide ω2
P(error | x) = P(ω2 | x) if we decide ω1
8
• Minimizing the probability of error

• Decide ω1 if P(ω1 | x) > P(ω2 | x);
otherwise decide ω2
Therefore:
P(error | x) = min [P(ω1 | x), P(ω2 | x)]
(Bayes decision)

9
Bayesian Decision Theory –
Continuous Features
• Generalization of the preceding ideas

• Use of more than one feature
• Use more than two states of nature
• Allowing actions and not only decide on the state of
nature
• Introduce a loss of function which is more general than
the probability of error

10
• Allowing actions other than classification primarily

allows the possibility of rejection
• Refusing to make a decision in close or bad cases!

• The loss function states how costly each action
taken is

11
Let {ω1, ω2,…, ωc} be the set of c states of nature

(or “categories”)
Let {α1, α2,…, αa} be the set of possible actions
Let λ(αi | ωj) be the loss incurred for taking
action αi when the state of nature is ωj

12
Overall risk
R = Sum of all R(αi | x) for i = 1,…,a
Conditional risk
Minimizing R Minimizing R(αi | x) for i = 1,…, a
j =c
R( α i | x ) = ∑ λ ( α i | ω j ) P ( ω j | x )
j =1
for i = 1,…,a
13
Select the action αi for which R(αi | x) is minimum
R is minimum and R in this case is called the

Bayes risk = best performance that can be achieved!

14
• Two-category classification
α1 : deciding ω1
α2 : deciding ω2
λij = λ(αi | ωj)
loss incurred for deciding ωi when the true state of nature is ωj
Conditional risk:
R(α1 | x) = λ11P(ω1 | x) + λ12P(ω2 | x)

R(α2 | x) = λ21P(ω1 | x) + λ22P(ω2 | x)

15
Our rule is the following:

if R(α1 | x) < R(α2 | x)
action α1: “decide ω1” is taken
This results in the equivalent rule :

decide ω1 if:
(λ21- λ11) P(x | ω1) P(ω1) >

(λ12- λ22) P(x | ω2) P(ω2)
and decide ω2 otherwise

16
Likelihood ratio:
The preceding rule is equivalent to the following rule:
P ( x | ω 1 ) λ12 − λ 22 P ( ω 2 )
if > .
P ( x | ω 2 ) λ 21 − λ11 P ( ω 1 )
Then take action α1 (decide ω1)

Otherwise take action α2 (decide ω2)

17
Optimal decision property
“If the likelihood ratio exceeds a threshold value

independent of the input pattern x, we can take
optimal actions”

18
Exercise
Select the optimal decision where:

Ω= {ω1, ω2}
P(x | ω1) N(2, 0.5) (Normal distribution)
P(x | ω2) N(1.5, 0.2)
P(ω1) = 2/3
P(ω2) = 1/3
⎡1 2⎤
λ=⎢ ⎥
⎣3 4 ⎦
0
Pattern
Classification
All materials in these slides were taken

from
& Sons, 2000
with the permission of the authors and
the publisher

Chapter 2 (Part 2):
(Sections 2.3-2.5)
• Minimum-Error-Rate Classification
• Classifiers, Discriminant Functions and Decision Surfaces
• The Normal Density
2
Minimum-Error-Rate Classification
• Actions are decisions on classes

If action αi is taken and the true state of nature is ωj then:
the decision is correct if i = j and in error if i ≠ j
• Seek a decision rule that minimizes the probability

of error which is the error rate

3
• Introduction of the zero-one loss function:

⎧0 i = j
λ ( α i ,ω j ) = ⎨ i , j = 1 ,..., c
⎩1 i ≠ j
Therefore, the conditional risk is:

j =c
R( α i | x ) = ∑ λ ( α i | ω j ) P ( ω j | x )
j =1
= ∑ P( ω j | x ) = 1 − P( ω i | x )
j ≠1
“The risk corresponding to this loss function is the

average probability error”
4
• Minimize the risk requires maximize P(ωi | x)

(since R(αi | x) = 1 – P(ωi | x))
• For Minimum error rate

• Decide ωi if P (ωi | x) > P(ωj | x) ∀j ≠ i

5
• Regions of decision and zero-one loss function, therefore:
λ12 − λ 22 P ( ω 2 ) P( x | ω1 )
Let . = θ λ then decide ω 1 if : > θλ
λ 21 − λ11 P ( ω 1 ) P( x | ω 2 )
• If λ is the zero-one loss function wich means:
⎛ 0 1⎞
λ = ⎜⎜ ⎟⎟
⎝1 0⎠
P( ω 2 )
then θ λ = = θa
P( ω1 )
⎛0 2 ⎞ 2 P( ω 2 )
if λ = ⎜⎜ ⎟⎟ then θ λ = = θb
⎝1 0⎠ P( ω1 )
6

7
Classifiers, Discriminant Functions
and Decision Surfaces
• The multi-category case

• Set of discriminant functions gi(x), i = 1,…, c
• The classifier assigns a feature vector x to class ωi
if:
gi(x) > gj(x) ∀j ≠ i

8

9
• Let gi(x) = - R(αi | x)
(max. discriminant corresponds to min. risk!)
• For the minimum error rate, we take

gi(x) = P(ωi | x)
(max. discrimination corresponds to max. posterior!)

gi(x) ≡ P(x | ωi) P(ωi)
gi(x) = ln P(x | ωi) + ln P(ωi)

(ln: natural logarithm!)

10
• Feature space divided into c decision regions

if gi(x) > gj(x) ∀j ≠ i then x is in Ri
(Ri means assign x to ωi)
• The two-category case

• A classifier is a “dichotomizer” that has two discriminant
functions g1 and g2
Let g(x) ≡ g1(x) – g2(x)
Decide ω1 if g(x) > 0 ; Otherwise decide ω2

11
• The computation of g(x)
g( x ) = P ( ω 1 | x ) − P ( ω 2 | x )
P( x | ω1 ) P( ω1 )
= ln + ln
P( x | ω 2 ) P( ω 2 )

12

13
The Normal Density
• Univariate density
• Density which is analytically tractable
• Continuous density
• A lot of processes are asymptotically Gaussian
• Handwritten characters, speech sounds are ideal or prototype
corrupted by random process (central limit theorem)
1 ⎡ 1⎛ x−μ⎞ ⎤
2
P( x ) = exp ⎢ − ⎜ ⎟ ⎥,
2π σ ⎢⎣ 2 ⎝ σ ⎠ ⎥⎦
Where:
μ = mean (or expected value) of x
σ2 = expected squared deviation or variance

14

15
• Multivariate density
• Multivariate normal density in d dimensions is:
1 ⎡ 1 ⎤
P( x ) = exp ⎢ − ( x − μ ) Σ ( x − μ )⎥
t −1
( 2π ) Σ ⎣ 2 ⎦
d/2 1/ 2
where:
x = (x1, x2, …, xd)t (t stands for the transpose vector form)
μ = (μ1, μ2, …, μd)t mean vector
Σ = d*d covariance matrix
|Σ| and Σ-1 are determinant and inverse respectively

Pattern
Classification

& Sons, 2000
publisher
Chapter 2 (part 3)
(Sections 2-6,2-9)
• Discriminant Functions for the Normal Density

• Bayes Decision Theory – Discrete Features
2
Discriminant Functions for the
Normal Density
• We saw that the minimum error-rate
classification can be achieved by the
discriminant function
gi(x) = ln P(x | ωi) + ln P(ωi)
• Case of multivariate normal

−1
1 d 1
g i ( x ) = − ( x − μ i )t ∑ ( x − μ i ) − ln 2π − ln Σ i + ln P ( ω i )
2 i 2 2

3
• Case Σi = σ2.I (I stands for the identity matrix)
g i ( x ) = w it x + w i 0 (linear discrimina nt function)

where :
μi 1
wi = 2 ; wi 0 = − μ i μ i + ln P ( ω i )
t
σ 2σ 2
( ω i 0 is called the threshold for the ith category! )

4
• A classifier that uses linear discriminant functions

is called “a linear machine”
• The decision surfaces for a linear machine are

pieces of hyperplanes defined by:
gi(x) = gj(x)

5

6
• The hyperplane separating Ri and Rj

1 σ2 P( ω i )
x0 = ( μ i + μ j ) − ln ( μi − μ j )
2 μi − μ j
2
P( ω j )
always orthogonal to the line linking the means!
1
if P ( ω i ) = P ( ω j ) then x0 = ( μ i + μ j )
2

7

8

9
• Case Σi = Σ (covariance of all classes are

identical but arbitrary!)
• Hyperplane separating Ri and Rj
1
x0 = ( μ i + μ j ) −
[ ]
ln P ( ω i ) / P ( ω j )
.( μ i − μ j )
2 ( μi − μ j ) Σ ( μi − μ j )
t −1
(the hyperplane separating Ri and Rj is generally

not orthogonal to the line between the means!)

10

11

12
• Case Σi = arbitrary
• The covariance matrices are different for each category
g i ( x ) = x tW i x + w it x = w i 0
where :
1 −1
Wi = − Σ i
2
w i = Σ i− 1 μ i
1 t −1 1
w i0 = − μ i Σ i μ i − ln Σ i + ln P ( ω i )
2 2
(Hyperquadrics which are: hyperplanes, pairs of hyperplanes,

hyperspheres, hyperellipsoids, hyperparaboloids,
hyperhyperboloids)
13

14

15
Bayes Decision Theory – Discrete

Features
• Components of x are binary or integer valued, x can

take only one of m discrete values
v1, v2, …, vm
• Case of independent binary features in 2 category

problem
Let x = [x1, x2, …, xd ]t where each xi is either 0 or 1, with
probabilities:
pi = P(xi = 1 | ω1)
qi = P(xi = 1 | ω2)

16
• The discriminant function in this case is:

d
g ( x ) = ∑ w i x i + w0
i =1
where :
pi ( 1 − q i )
w i = ln i = 1 ,..., d
q i ( 1 − pi )
and :
1 − pi
d
P( ω1 )
w0 = ∑ ln + ln
i =1 1 − qi P( ω 2 )
decide ω 1 if g(x) > 0 and ω 2 if g(x) ≤ 0

Pattern Recognition Presenation

Diunggah oleh

Informasi Dokumen

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Pattern Recognition Presenation

Diunggah oleh

Hak Cipta:

Format Tersedia

Pattern

All materials in these slides were taken from

• Build a machine that can recognize patterns:

Pattern Classification, Chapter 1

• “Sorting incoming Fish on a conveyor according to

Pattern Classification, Chapter 1

Pattern Classification, Chapter 1

• Use a segmentation operation to isolate fishes from one

• Information from a single fish is sent to a feature

• The features are passed to a classifier

Pattern Classification, Chapter 1

Pattern Classification, Chapter 1

Pattern Classification, Chapter 1

Pattern Classification, Chapter 1

The length is a poor feature alone!

Select the lightness as a possible feature.

Pattern Classification, Chapter 1

Pattern Classification, Chapter 1

• Threshold decision boundary and cost relationship

Task of decision theory

Pattern Classification, Chapter 1

• Adopt the lightness and add the width of the fish

Pattern Classification, Chapter 1

Pattern Classification, Chapter 1

• We might add other features that are not correlated

• Ideally, the best decision boundary should be the one

Pattern Classification, Chapter 1

Pattern Classification, Chapter 1

• However, our satisfaction is premature because

Pattern Classification, Chapter 1

Pattern Classification, Chapter 1

Pattern Recognition Systems

• Segmentation and grouping

Pattern Classification, Chapter 1

Pattern Classification, Chapter 1

Pattern Classification, Chapter 1

The Design Cycle

Pattern Classification, Chapter 1

Pattern Classification, Chapter 1

Pattern Classification, Chapter 1

Pattern Classification, Chapter 1

Pattern Classification, Chapter 1

Pattern Classification, Chapter 1

Pattern Classification, Chapter 1

• (How an algorithm scales as a function of the number of

Pattern Classification, Chapter 1

Learning and Adaptation

Pattern Classification, Chapter 1

• Reader seems to be overwhelmed by the number,

• Many of these sub-problems can indeed be solved

Pattern Classification, Chapter 1

All materials in these slides were taken from

• P(ω1) + P( ω2) = 1 (exclusivity and exhaustivity)

Pattern Classification, Chapter 2 (Part 1)

• Decision rule with only the prior information

• Use of the class –conditional information

Pattern Classification, Chapter 2 (Part 1)

Pattern Classification, Chapter 2 (Part 1)

• Posterior, likelihood, evidence

• Posterior = (Likelihood. Prior) / Evidence

Pattern Classification, Chapter 2 (Part 1)

• Decision given the posterior probabilities

X is an observation for which:

if P(ω1 | x) > P(ω2 | x) True state of nature = ω1