Statistical Approach To Pattern Recognition

© All Rights Reserved

11 tayangan

Ch2 Stat Approach PDF

Statistical Approach To Pattern Recognition

© All Rights Reserved

- Chapter 7
- Robust Speech Recognition on Intelligent Mobile Devices with Dual-Microphone
- Detection Fundamentals
- Earthdawn dice roll probabilities
- asm
- Stat1053 Chapter 5 Lesson2
- BA_PartII
- chapter2.pdf
- ch06busstat
- Matlab
- Statistics Problems 2 Chapter 9
- 1779-3883-1-PB
- GATE 2019 Statistics Syllabus
- SDA 3E Chapter 3 (2)
- Scimakelatex.21169.Mozo
- scimakelatex.832.Ruben+Judocus.Xander+Hendrik
- 10.1.1.132
- Normal distribution
- ACID Plan-statistics and Probability
- snva112a

Anda di halaman 1dari 33

Statistical Approach to PR

Various PR approaches

1

2/21/2017

A fundamental statistical approach to the problem

of pattern classification.

categories is known perfectly.

classifying sea bass and salmon fish appears randomly on a belt, let c denote state

of nature or class:

c = 1 for sea bass c = 2 for salmon

State of nature is unpredictable, therefore must be probabilistically described as

priori probability. Prior probabilities reflect our knowledge of how likely each type of

fish will appear before we actually see it

1)Based on from collected samples as training data,

2) Based on the time of the year,

3) Based on location

Let

P(1) is the prior probability that the ish is a sea bass

P(2) is the prior probability that the fish is a salmon Prior

Also , assuming no other fish can appear, then

P(1) + P(2) = 1 (exclusivity and exhaustivity)

2

2/21/2017

Discrimination function

Case1: That we must make a decision without seeing the fish, i.e. no features

i.e., if N is the total number of available training patterns, and N1, N2 of them

belong to 1 and 2, respectively e.g.

Discriminating function:

if P(1 ) > P(2 ) : choose 1 otherwise choose 2

P(error) = P(choose 2 | 1 ).P(1 ) + P(choose 1| 2 )P(2 ) = min [p(1 ), p(2 )]

e.g. if we always choose c 1 ...

P(error) = 0 * 0.9 + 1 * 0.1 = 0.1

i.e. Probability of error ... 10% ==> minimal error

not well at all if P(1 ) = P(2 ) ) i.e. if they are equi-probable (uniform priors)

The classifier based on Bayesian decision theory design

needs to integrate all the available problem information,

such as measurement, priori probabilities, likelihood

and evidence to form the decision rules.

a) By calculating the posteriori probability p(X|i) from

the priori probability p(i|X). : Bayesian Theory.

error or risk and choosing a decision rule that

minimizes this measure.

3

2/21/2017

Bayes' Theorem

To derive Bayes' theorem, start from the definition of conditional probability. The

probability of the event A given the event B is

Discarding the middle term and dividing both sides by P(B), provided that neither

P(B) nor P(A) is 0, we obtain Bayes' theorem:

----------(1)

Bayes' Theorem

Bayes' theorem is often completed by noting that, according to the Law of Total

Probability

-------(1)

Substituting (2) into (1)

More generally, the Law of Total Probability states that given a partition or class

{Ai}, of the event space

-------(3)

4

2/21/2017

A pattern is represented by a set of d features, or

attributes, viewed as a d-dimensional feature

x ( x1 , x2 ,

vector. T

, xd )

Given a pattern x = (x1, x2, , xd)T, assign it to

one of n classes in set C.

C ={1, 2, , n}

P(i): various classes probabilities (Prior Probability calculated from

training data)

P(x| i): class-conditional probabilities (Likelihood Prob. calculated

from training data)

2

1 P(2)

P(1) P(x|2)

x

P(x| 1)

3

P(3)

P(x| 3)

5

2/21/2017

Posterior Probablity

P(i|X): Posterior probability that a test pattern X belongs to class i can be given

as:

P( X | i ) * P(i )

P(i | X )

P( X )

Likelihood * Pr ior

Posterior

Evidence

i.e. To classify a test pattern with attribute vector X, we assign it to the class most

probable for X.

This means, we estimate the P(i|X) for each class i=1 to n. Then, we assign the

pattern to the class with highest posterior probability. It is just like to maximize the

P(i|X).

P( X | i ).P(i )

Bayes classifier: P(i | X )

P( X )

So, maximizing this term means to maximize the right hand side of the

above equation. All the term can be calculated from training data

In case single feature x and multi-class then it is calculated as:

n

P( X x) p( x | i ). p(i )

i 1

P(X|i) : It needs huge computation to consider all the attributes. So,

assume that the attributes are independent to each other. This makes

the classifier as Nave Bayes classifier. i.e. d

P( X | i ) Pxk | i

k 1

6

2/21/2017

The simple probabilistic approach is not sufficient to classify the

fishes

measured on fish and incorporate the Bayes classification concept.

comprising single feature:

x as length (continuous random variable) feature

probability of x given the state of nature i). Its distribution

depends on the state of nature (class) i and feature value.

large number of training pattern samples (statistical).

p(x|1) and p(x| 2) showing the pdfs curves describe the

difference in populations of two type of classes e.g. Seabass

and Salmon fishes:

Hypothetical class-conditional

pdfs show the probability

density of measuring a part-

icular feature value x given

the pattern is in category i.

thus area under each curve is 1.0

7

2/21/2017

Suppose we know prior P( i) and p(x| i) and We measure the length of a

fish as the value x.

Now, the P( i|x) as the a posteriori probability (probability of the i given

the measurement of feature value x) can be given by bayes theorem . As

P( x | i ).P(i )

2

P(i | x) Where : P( x) p( x | i ). p( i )

P( x) i 1

P(1 | x) p( 2 | x) 1

Decision Rule:

1 if P(1 | x P 2 | x 1 if p( x | 1. p(1) px | 2. p( 2)

c

2 otherise 2 otherise

term. Coz, p(x), the evidence, is constant for all classes. Also if the classes

are equiprobable i.e. P( 1) = P( 2) = 0.5 then it can also be dropped from

comparison.

pdfs show the probability density of particular priors P( 1) = 2/3 and P( 2) = 1/3

measuring a particular feature value for fig (a).

x given the pattern is in category i.

Given that a pattern is measured to have

feature value x= 14, then

its probability for class 2 = 0.08 and

for class 1 = 0.92.

8

2/21/2017

Probability of Error

Remember that the goal is to minimize error.

whenever we observe a particular x, the probability of error is :

P(error|x) = P( 1|x) if we decide 2 in place of 1

P(error|x) = P( 2|x) if we decide 1 in place of 2

Therefore,

P(error|x) = min [P( 1|x), P( 2|x)]

largest of p( 1 |x) and p( 2|x) i.e.

Decide the class 1 if

p( 1 |x) > p( 2|x)

or

the border between

classes i and j, simply

where

P(i|x)=P(j|x)

Fig: Components for error with equal prior and non-optimal decision point x*.

The complete pink area (including triangled area) corresponds to the probability of

error for deciding 1 when the class is 2 and gray area corresponds to converse.

If we select the xB (as decision boundary) in place of x* then we can eliminate the

reducible error portion and minimize the error.

9

2/21/2017

Generalization:

Allowing more than two features:

replaces scalar x with a vector x from a d-dimensional feature

space Rd

deciding i for P(i|X) > P(j |X) for all i j

e.g. allows not to classify if dont know class

weighting decision costs

1. Gaussian (normal) density function

2. Uniform density function

Bayes classifier behavior is determined by

1) conditional density p(x|i)

2) a priori probability P(i)

distribution function, then to estimate p(x|i) for particular xs value is easy.

i.e. no need to find first actual pdf.

for pdf, because it is:

Extensive studies on it say that, it is followed by nature.

Well behaved and analytically tractable

An appropriate model for continuous values randomly distributed around mean .

Suitable for multivariate modeling.

Above all, the Gaussian PDF provide the optimal Bayesian classifiers.

10

2/21/2017

A univariate or single dimensional(d=1) gaussian function is defiend as:

in the range = |x- | 2 and

Peak has value p() = 1/

(b) Mean value = 1 and variance 2 = 0.2

The graphs are symmetric, and they are centered at the respective

mean value.

11

2/21/2017

A multivariate or d-dimensional gaussian function is defined as:

problem with Gaussian pdf:

Nos. of features d = 1 (x) Nos. of Classes: C = 2

Prior probabilities: P(1) = p(2) = 0.5 # Same for both classes for simplicity

Classes variances: 1 = 2 = # Same for both classes

Classes means: 1 2 # Different for both classes

Posterior probability:

Since p(x) is not taken into account, because it is the same for all classes and it does not

affect the decision. Furthermore, if the a priori probabilities are equal, then, its also

does not affect the comparison of two posterior values.

So, Decision Rule: max [ p(x|1), p(x|2)] # Also called maximum likelihood rule

i.e. The search for the maximum now rests on the values of the conditional pdfs

evaluated at x.

12

2/21/2017

with Gaussian pdf:

Line at x 0 is a threshold partitioning

the feature space into two regions,

R1 and R2. Decision boundary

a decision error for the case of two

equiprobable classes, is given by:

problem with Gaussian pdf:

Decision boundary is based on the prior probability, coz, it is the part of posterior

probability.

13

2/21/2017

Gaussian pdf:

To improve classification correctness more, we

add one more feature to our fish sorting.

by feature vector X comprising single feature

x1 as length

x2 as lightness intensity

probability density function for both classes.

obtained by observing a large number of

training pattern samples (statistical). Fig: It shows the probability density

distribution functions in 2D (two

By extending the bayes theorem for multi features) for classes 1 and 2. Since

features and multi-class problem we can we have included two features so, the

design the decision rule for our fish sorting decision boundaries are more clear

problem. resulting in minimizing the error.

Gaussian pdf

The general a multivariate gaussian pdf is defined as:

class-dependent gaussian density functions p(x|i) can be given as:

-------(1)

14

2/21/2017

Gaussian pdf

For classification: largest discrimination function for ith class can be given in term of

Posterior probability maximization as:

Since P(x) is constant for all classes we can drop the term p(x), So

gi(x) = p(x|i).p(i)

log() can be used for alternative discriminating function for ith class as:

gi(x) = log(p(x|i) .p(i))

From equations 1 and 2 we find:

----(3)

----(4)

Now we have following cases:

Case1:

i = = 2 . I (I stands for identity matrix : diagonal case) i.e. equal covariance

matrix for all classes and proportional to I.

Case2:

i = =arbitrary covariance matrix but identical for all classes

Case3:

i = arbitrary covariance matrix & not identical for all classes.

15

2/21/2017

Case1:

i = = 2. I (I stands for identity matrix : diagonal case) i.e. equal

covariance matrix for all classes and proportional to I.

-- --(4)

Assuming equal covariance matrices (i.e. class is only through the mean

vectors i ). We can drop 2nd term as class-independent constant

biases, So,

-------(5)

distance of feature vector x from the ith mean vector i weighted by the

inverse of the covariance matrix -1 .

Assuming features are statistically independent and each feature has the same variance 2.

Taking the following and putting in eq#5, we can write eq#6 as Simple discriminant function:

1 12 .I

x i

2

( x i ) .( x i ) x i

T 2 g i ( x) log( p(i )) --------(6)

2 2

If P(i) are the same for all c classes, then the log(P(i)) terms becomes another unimportant

additive constant that can be ignored.

x i

2

g i ( x)

2 2

discriminating factor. results. i.e. the gi (x) will be largest if the distance will be minimum.

The optimal decision rule:

To classify a feature vector x, measure the Euclidean distance || x- i|| 2 in between x and

each of the mean vector. And assign x to the category of the nearest means class.

Such classifier is called a minimum-distance classifier.

16

2/21/2017

Proof

The Bayesian discriminant function is given as:

x i

2

g i ( x)

2 2

x i xT .x 2.i x .i .i

2 T T

Since,

xT .x 2.i x .i .i

T T

gi( x)

then, 2 2

The quadratic term xTx is the same for all i, making it an ignorable additive constant.

Thus we obtain the equivalent linear discriminant function.

where, wi 0 21 2 i .i and

T

wi 2i

17

2/21/2017

The decision surfaces for a linear classifier are pieces of hyperplanes

defined by the linear equations gi(x) = gj(x) for the two categories with

the highest posterior Probabilities.

If x0 is threshold then, gi(x0)- gj(x0)=0

x i

2

Since,

g i ( x) log( p(i ))

2 2

So,

If P(i) = P(j), then point x0 is halfway between the means, and the

hyperplane is the perpendicular bisector of the line between the

means.

18

2/21/2017

matrix, then the distributions are hyperspherical in d dimensions, and the

boundary is a generalized hyperplane of d 1 dimensions, perpendicular to the line

separating the means.

If P( i) P( j), then point x0 shifts away from the more likely mean.

then the position of the decision boundary x0 is relatively insensitive

to the exact values of the prior probabilities P().

1-D Case

19

2/21/2017

2-D Case

3-D Case

20

2/21/2017

Case2: For i = =arbitrary covariance matrix but identical for each class

--------(4)

Due to identical covariance matrix, we can drop 2nd term and get:

--------(6)

If the prior probabilities P( i) are the same for all classes, then

logP(i) term can be ignored.

In this case the influence of covariance matrix will affect the distance

measurement, so this is not the simple Euclidean distance but it is

called as Mahalanobis distance from feature vector x to i. It is given

as:

21

2/21/2017

identical for all classes

2-D Case

22

2/21/2017

3-D Case

Case3: For i = arbitrary covariance matrix & not identical for each class but

Invertable.

In the general multivariate normal case, the covariance matrices are different for

each catergory. The discriminating function :

Also, due to arbitrary and class specific covariance matrix in first term, the resulting

discriminant functions are inherently quadratic. By expanding the first term we get:

23

2/21/2017

Decision boundaries

In the two-category case, the decision surface are hyperquadrics,

which can assume any of the general form of various type as:

1) Hyperplanes,

2) Pairs of hyperplanes,

3) Hyperspheres,

4) Hyperellipsoids,

5) Hyperparaboloids, and

6) Hyperhyperboloids

1-D Case

dimensions for Gaussians having unequal variance.

24

2/21/2017

2-D Case

Fig: Arbitrary Gaussian distributions lead to Bayes decision boundaries that are general

hyperquadrics. Conversely, given any hyperquadratic, one can find two Gaussian

distributions whose Bayes decision boundary is that hyperquadric.

3-D Case

boundaries that are two-dimensional hyperquadrics.

25

2/21/2017

Fig: The decision regions for four normal distributions. Even with such a low

number of categories, the shapes of the boundary regions can be rather complex.

26

2/21/2017

than probability of error.

In classification problem we face the following situations

equally costly.

In some other cases, error due classification mistakes are not equally

costly e.g. some mistakes lead to more costlier than others. e.g.

signal (e.g. a message or perhaps a radar return) and noise. Here,

our attempt is to develop strategies so that the incoming signal is

correctly classified.

27

2/21/2017

Case1: The incoming signal is a valid-signal and we correctly

classify it as such. ( a hit : ) 1-> 1

classify it as noise. ( a miss) 1-> 2

it as valid signal. (a false alarm) 2 -> 1

as noise. ( a true reject) 2 -> 2

Suppose after classification, we need to take some action as to fire a

missile or not.

In this case, we can assume that, action associated with wrong

classification is more costlier than right classification.

let = denote the action or decision taken as per the above four

cases.

e.g. 1 = fire a missile if signal (1)

2 = Not to fire a missile if noise (2)

manner.

28

2/21/2017

Let

X be d-dimensional vector and [1, 2 , n] finite set of n classes

class. Where i = the decision to choose class i

A decision rule: A function (x) e.g. mapping (x) -> i

( i|j) is a loss or cost function: estimating the loss or cost resulting

from taking action i when the class is j.

ij be the cost associated for selecting action i when the class j.

E.g. in case c = 2 (nos. of classes)

11 and 22 are the rewards for correct classification and

12 and 21 are the penalty for incorrect classification.

Risk R: In decision theoretic terminology, the expected loss is called

the Risk. Where R denote the over all risk.

R( i|x): It is conditional risk depend upon the action selected.

For n classes problem let X be classified as belonging to class k , then the probability

of error as per bayes rule : n

P(error | X ) P( | X )

i 1( i k )

i

But this bayes formulation for error does not describe the risk. So for risk:

The probability that we select i action, when j is the true class is given as:

P(ij) = P(i|j).P(j)

Since the term P(i|j) depends on the chosen mapping (x)-> i, which in turn

depends on x. Then the conditional risk is given as:

n n

R( i | x) ( i | j ). p( i | j ). p( j ) . p(

ij j | x)

j1 j1

29

2/21/2017

If c = 2 and Nos. of actions are 2 then the overall risk measure is:

R = 11.P(1|1).P(1) + 12. P(1|2).P(2)

+ 21.P(2|1).P(1) + 22.P(2|2).P(2)

Where,

For action 1, the measure of conditional risk:

R[(x)-> 1] = R(1|x) = 11.P(1|x) + 12.P(2|x)

R[(x)-> 2] = R(2|x) = 21.P(1|x) + 22.P(2|x)

incorrectly (using the total probability theorem):

For minimizing the overall risk, we have to minimize the conditional

risk. And the lower bound for this minimization is the Bayes risk.

Two actions 1 and 2 are to be taken for 1 and 2 respectively.

The decision rule is formulated as: decide the action 1 or class 1 if:

-----------(1)

30

2/21/2017

So alternatively under the assumption 21 > 11 i.e. the cost for

incorrect decision is higher then the correct one, we can write

decision rule as: Decide 1 if

------------(2)

probabilities [p(2)/p(1)] weighted by .

deciding 1 if the likelihood ratio exceeds a threshold value that

is independent of the observation x.

Effect of ij in case of n = 2 classes

11 = -2, 22 = -1 : for correct

12 = 2, 21 = 4 : for incorrect

becomes:

identifying the class 1 i.e signal than noise.

31

2/21/2017

Often we choose:

11 = 22 = 0, i.e. no cost or risk for correct classification

12 = 21 = 1, i.e. unit cost or risk for incorrect classification

So, in this case equation-2 becomes:

minimization of probability of error.

For n classes problem to decide the cost for correct and incorrect

classification we can define the cost/loss function as a zero-one loss

function:

probability of error, since the conditional risk for decision i is now

given as:

32

2/21/2017

should select the i that maximizes the posterior probability

P(i|x).

Thanks

33

- Chapter 7Diunggah olehRabia Samdani
- Robust Speech Recognition on Intelligent Mobile Devices with Dual-MicrophoneDiunggah olehiloes
- Detection FundamentalsDiunggah olehIlya Mukhin
- Earthdawn dice roll probabilitiesDiunggah olehdabar aher
- asmDiunggah olehNisarg Shah
- Stat1053 Chapter 5 Lesson2Diunggah olehSaurabh Singh
- BA_PartIIDiunggah olehHaru Imam
- chapter2.pdfDiunggah olehPintuabc
- ch06busstatDiunggah olehPrashanthi Priyanka Reddy
- MatlabDiunggah olehAnnaChithySalu
- Statistics Problems 2 Chapter 9Diunggah olehALyzha Mae Lacuesta
- 1779-3883-1-PBDiunggah olehKenyo Loa
- GATE 2019 Statistics SyllabusDiunggah olehravi kumar
- SDA 3E Chapter 3 (2)Diunggah olehxinearpinger
- Scimakelatex.21169.MozoDiunggah olehmaxxflyy
- scimakelatex.832.Ruben+Judocus.Xander+HendrikDiunggah olehJohn
- 10.1.1.132Diunggah olehVijay Shankar T
- Normal distributionDiunggah olehfocus16hoursgmailcom
- ACID Plan-statistics and ProbabilityDiunggah olehLuciferCherub Morningstar
- snva112aDiunggah olehDalton Colombo
- 01408184Diunggah olehsophiaaa
- dsp34Diunggah olehvikasika
- Uncertainty on Subdivision Methods.pdfDiunggah olehOlman Ramos
- Guided Lecture NotesDiunggah olehJenny
- sasnotes13Diunggah olehNagesh Khandare
- Resume Lois ContinuesDiunggah olehAhmed Wetlands
- Biostatistics homeworkDiunggah olehXiou Cao
- StatDiunggah olehCarlos Andres Rodriguez
- Speech identification using a sequence-based heuristicDiunggah olehgregorxx
- Class2 ChartsDiunggah olehjbrunomaciel1957

- SSMO Marking SchemeDiunggah olehSTEPCentre
- Vol ModelingDiunggah olehmshchetk
- unitplan ch11Diunggah olehapi-214017049
- Errorless Physics 1-2(2015)Diunggah olehvivek sonawane
- Steenkamp & Baumgartner, 1998 - Assessing Measurement Invariance in Cross-national Consumer ResearchDiunggah olehdaniel bageac
- 1st qtrDiunggah olehapi-316781445
- SERIESDEFOURIER_ISMAEL_PEREZ_RUIZ_BIOMEDICA_6DM.pdfDiunggah olehIsmael Pérez Ruiz
- Lights OutDiunggah olehmitkrieg
- educ211 unit plan mathDiunggah olehapi-199099982
- CS 311 Final NotesDiunggah olehMicheal Huluf
- Class05 Maths IOM SampleDiunggah olehPayal Jain
- Model Reduction TechniquesDiunggah olehfshadow
- svm.pptDiunggah olehHimanshu Bhatt
- A Guide to Modern Econometrics - Marno VerbeekDiunggah olehnickedwinjohnson
- R AssignmentDiunggah olehundead_ub
- Advanced Statistics With MatlabDiunggah olehRohit Vishal Kumar
- v17-n2-(pp-193-228)Diunggah olehAishwaryaPatel
- Manual.docDiunggah olehusmanzubair
- Eie CurriculumDiunggah olehtemitopefunmi
- Canny SlideDiunggah olehkoustubhthorat
- 5-smp look forsDiunggah olehapi-359248071
- MAT-208-H Chapter 1.10 Problem Set SolutionsDiunggah olehJoseph Heavner
- Math 90 Transformations of Graphs of FunctionsDiunggah olehVincent Vetter
- mfcc.pdfDiunggah olehArpit Jaiswal
- Chapter 7Diunggah olehRobel Metiku
- 2016-global-catalog-b789e7cd61271cfc12e8d2e557303924Diunggah olehAndres White
- Hector Bombin and Miguel Angel Martin-Delgado- Topological Color CodesDiunggah olehSwertyy
- AnaS2013.pdfDiunggah olehChi Trung Nguyen
- A Generalized EWMA Control Chart and Its Comparison With the Optimal EWMA, CUSUM AndDiunggah olehCihan AL
- Zeroes of exponentials sum theorem and Riemann hypothesisDiunggah olehAlexandre Bali