Anda di halaman 1dari 33

2/21/2017

Statistical Approach to PR

Various PR approaches

1
2/21/2017

Simple probabilistic approach


A fundamental statistical approach to the problem
of pattern classification.

Decision problem is posed in probabilistic terms

Ideal case: probability structure underlying the


categories is known perfectly.

Specific example: Fish sorting


classifying sea bass and salmon fish appears randomly on a belt, let c denote state
of nature or class:
c = 1 for sea bass c = 2 for salmon
State of nature is unpredictable, therefore must be probabilistically described as
priori probability. Prior probabilities reflect our knowledge of how likely each type of
fish will appear before we actually see it

A priori probabilities may be known in advance, e.g.


1)Based on from collected samples as training data,
2) Based on the time of the year,
3) Based on location
Let
P(1) is the prior probability that the ish is a sea bass
P(2) is the prior probability that the fish is a salmon Prior
Also , assuming no other fish can appear, then
P(1) + P(2) = 1 (exclusivity and exhaustivity)

2
2/21/2017

Discrimination function
Case1: That we must make a decision without seeing the fish, i.e. no features

Priori probabilities of classes 1 and 2 : Estimate from available training data,


i.e., if N is the total number of available training patterns, and N1, N2 of them
belong to 1 and 2, respectively e.g.

P(1 )= N1 /N = 0.9, P(2 )= N2 /N = 0.1

Discriminating function:
if P(1 ) > P(2 ) : choose 1 otherwise choose 2
P(error) = P(choose 2 | 1 ).P(1 ) + P(choose 1| 2 )P(2 ) = min [p(1 ), p(2 )]
e.g. if we always choose c 1 ...
P(error) = 0 * 0.9 + 1 * 0.1 = 0.1
i.e. Probability of error ... 10% ==> minimal error

So it works well if P(1 ) >> P(2 ),


not well at all if P(1 ) = P(2 ) ) i.e. if they are equi-probable (uniform priors)

Bayesian Decision Theory


The classifier based on Bayesian decision theory design
needs to integrate all the available problem information,
such as measurement, priori probabilities, likelihood
and evidence to form the decision rules.

Decision rules formation:


a) By calculating the posteriori probability p(X|i) from
the priori probability p(i|X). : Bayesian Theory.

a) By formulating a measure of expected classification


error or risk and choosing a decision rule that
minimizes this measure.

3
2/21/2017

Bayes' Theorem
To derive Bayes' theorem, start from the definition of conditional probability. The
probability of the event A given the event B is

Equivalently, the probability of the event B given the event A is

Rearranging and combining these two equations, we find

Discarding the middle term and dividing both sides by P(B), provided that neither
P(B) nor P(A) is 0, we obtain Bayes' theorem:

----------(1)

Bayes' Theorem
Bayes' theorem is often completed by noting that, according to the Law of Total
Probability
-------(1)

where A c is the complementary event of A.


Substituting (2) into (1)

More generally, the Law of Total Probability states that given a partition or class
{Ai}, of the event space

Therefore, for any partition Ai

-------(3)

4
2/21/2017

Pattern and Class Representation


A pattern is represented by a set of d features, or
attributes, viewed as a d-dimensional feature

x ( x1 , x2 ,
vector. T
, xd )
Given a pattern x = (x1, x2, , xd)T, assign it to
one of n classes in set C.

C ={1, 2, , n}

The States of Nature : Given


P(i): various classes probabilities (Prior Probability calculated from
training data)
P(x| i): class-conditional probabilities (Likelihood Prob. calculated
from training data)

2
1 P(2)
P(1) P(x|2)
x
P(x| 1)

3
P(3)
P(x| 3)

5
2/21/2017

Posterior Probablity
P(i|X): Posterior probability that a test pattern X belongs to class i can be given
as:

P( X | i ) * P(i )
P(i | X )
P( X )

Likelihood * Pr ior
Posterior
Evidence
i.e. To classify a test pattern with attribute vector X, we assign it to the class most
probable for X.

This means, we estimate the P(i|X) for each class i=1 to n. Then, we assign the
pattern to the class with highest posterior probability. It is just like to maximize the
P(i|X).

P( X | i ).P(i )
Bayes classifier: P(i | X )
P( X )

So, maximizing this term means to maximize the right hand side of the
above equation. All the term can be calculated from training data

P(i) : (Nos. of instances of a class)/(Total samples)

P(X ): It is constant for all the classes. So single value is needed


In case single feature x and multi-class then it is calculated as:
n
P( X x) p( x | i ). p(i )
i 1
P(X|i) : It needs huge computation to consider all the attributes. So,
assume that the attributes are independent to each other. This makes
the classifier as Nave Bayes classifier. i.e. d
P( X | i ) Pxk | i
k 1

6
2/21/2017

Bayes Classifier for fish sorting


The simple probabilistic approach is not sufficient to classify the
fishes

To improve classification correctness, we use features that can be


measured on fish and incorporate the Bayes classification concept.

Assume the fish(Seabass/Salmon) be denoted by feature Scalar X


comprising single feature:
x as length (continuous random variable) feature

Define p(x|i) as the class-conditional probability density (or


probability of x given the state of nature i). Its distribution
depends on the state of nature (class) i and feature value.

These probability density function can be obtained by observing a


large number of training pattern samples (statistical).

Bayes Classifier for fish sorting


p(x|1) and p(x| 2) showing the pdfs curves describe the
difference in populations of two type of classes e.g. Seabass
and Salmon fishes:
Hypothetical class-conditional
pdfs show the probability
density of measuring a part-
icular feature value x given
the pattern is in category i.

Density functions are normalized,


thus area under each curve is 1.0

7
2/21/2017

Bayes Classifier for fish sorting


Suppose we know prior P( i) and p(x| i) and We measure the length of a
fish as the value x.
Now, the P( i|x) as the a posteriori probability (probability of the i given
the measurement of feature value x) can be given by bayes theorem . As
P( x | i ).P(i )
2

P(i | x) Where : P( x) p( x | i ). p( i )
P( x) i 1

P(1 | x) p( 2 | x) 1
Decision Rule:

1 if P(1 | x P 2 | x 1 if p( x | 1. p(1) px | 2. p( 2)
c
2 otherise 2 otherise

The maximization of P(i|x) will depend only on the likelihood P(x|i)


term. Coz, p(x), the evidence, is constant for all classes. Also if the classes
are equiprobable i.e. P( 1) = P( 2) = 0.5 then it can also be dropped from
comparison.

Posterior probability for particular priors

Fig (a) : Hypothetical class-conditional Fig(b): Posterior probabilities for the


pdfs show the probability density of particular priors P( 1) = 2/3 and P( 2) = 1/3
measuring a particular feature value for fig (a).
x given the pattern is in category i.
Given that a pattern is measured to have
feature value x= 14, then
its probability for class 2 = 0.08 and
for class 1 = 0.92.

8
2/21/2017

Probability of Error
Remember that the goal is to minimize error.
whenever we observe a particular x, the probability of error is :
P(error|x) = P( 1|x) if we decide 2 in place of 1
P(error|x) = P( 2|x) if we decide 1 in place of 2
Therefore,
P(error|x) = min [P( 1|x), P( 2|x)]

For any given x, we can minimize the error by choosing the


largest of p( 1 |x) and p( 2|x) i.e.
Decide the class 1 if
p( 1 |x) > p( 2|x)
or

Deciding the decision boundary

The decision boundary x B is


the border between
classes i and j, simply
where
P(i|x)=P(j|x)

Fig: Components for error with equal prior and non-optimal decision point x*.
The complete pink area (including triangled area) corresponds to the probability of
error for deciding 1 when the class is 2 and gray area corresponds to converse.

If we select the xB (as decision boundary) in place of x* then we can eliminate the
reducible error portion and minimize the error.

9
2/21/2017

Generalization:
Allowing more than two features:
replaces scalar x with a vector x from a d-dimensional feature
space Rd

Allowing more than two classes:


deciding i for P(i|X) > P(j |X) for all i j

Allowing other action than classification


e.g. allows not to classify if dont know class

Introducing loss function more general than probability of error


weighting decision costs

Allowing generalization on the probability density function:


1. Gaussian (normal) density function
2. Uniform density function

Modeling the conditional density


Bayes classifier behavior is determined by
1) conditional density p(x|i)
2) a priori probability P(i)

If we assume that our training data is following some standard probability


distribution function, then to estimate p(x|i) for particular xs value is easy.
i.e. no need to find first actual pdf.

Normal (Gaussian) probability density function is very common assumption


for pdf, because it is:
Extensive studies on it say that, it is followed by nature.
Well behaved and analytically tractable
An appropriate model for continuous values randomly distributed around mean .
Suitable for multivariate modeling.
Above all, the Gaussian PDF provide the optimal Bayesian classifiers.

10
2/21/2017

Univariate Gaussian(or Normal) function: N(, )


A univariate or single dimensional(d=1) gaussian function is defiend as:

The pdf has roughly 95% area


in the range = |x- | 2 and
Peak has value p() = 1/

Univariate Gaussian(or Normal) function: N(, )

(a) Mean value = 0 and variance 2 = 1


(b) Mean value = 1 and variance 2 = 0.2

The larger the variance the broader the graph is.


The graphs are symmetric, and they are centered at the respective
mean value.

11
2/21/2017

Multivariate Gaussian(or Normal) function: N(, )


A multivariate or d-dimensional gaussian function is defined as:

Case1: Single feature and Two classes (equiprobable)


problem with Gaussian pdf:
Nos. of features d = 1 (x) Nos. of Classes: C = 2
Prior probabilities: P(1) = p(2) = 0.5 # Same for both classes for simplicity
Classes variances: 1 = 2 = # Same for both classes
Classes means: 1 2 # Different for both classes

Posterior probability:

Since p(x) is not taken into account, because it is the same for all classes and it does not
affect the decision. Furthermore, if the a priori probabilities are equal, then, its also
does not affect the comparison of two posterior values.

So, Decision Rule: max [ p(x|1), p(x|2)] # Also called maximum likelihood rule

i.e. The search for the maximum now rests on the values of the conditional pdfs
evaluated at x.

12
2/21/2017

Case1: Single feature and Two classes problem


with Gaussian pdf:
Line at x 0 is a threshold partitioning
the feature space into two regions,
R1 and R2. Decision boundary

The total probability of committing


a decision error for the case of two
equiprobable classes, is given by:

Case1: Single feature and Two classes (not equiprobable)


problem with Gaussian pdf:
Decision boundary is based on the prior probability, coz, it is the part of posterior
probability.

13
2/21/2017

Case1: Two features and Two classes problem with


Gaussian pdf:
To improve classification correctness more, we
add one more feature to our fish sorting.

Assume the fish(Seabass/Salmon) be denoted


by feature vector X comprising single feature
x1 as length
x2 as lightness intensity

Define p(x1, x2|i) as the class-conditional


probability density function for both classes.

These probability density function can be


obtained by observing a large number of
training pattern samples (statistical). Fig: It shows the probability density
distribution functions in 2D (two
By extending the bayes theorem for multi features) for classes 1 and 2. Since
features and multi-class problem we can we have included two features so, the
design the decision rule for our fish sorting decision boundaries are more clear
problem. resulting in minimizing the error.

Estimation of Discriminating function for multivariate


Gaussian pdf
The general a multivariate gaussian pdf is defined as:

Having class-specific mean vectors i and class-specific covariance matrices i, a


class-dependent gaussian density functions p(x|i) can be given as:

-------(1)

14
2/21/2017

Estimation of Discriminating function for multivariate


Gaussian pdf
For classification: largest discrimination function for ith class can be given in term of
Posterior probability maximization as:

gi(x) = p(i|x) = p(x|i).p(i))/p(x)

Since P(x) is constant for all classes we can drop the term p(x), So

gi(x) = p(x|i).p(i)

Because of the exponential form of the involved densities, logarithmic function


log() can be used for alternative discriminating function for ith class as:
gi(x) = log(p(x|i) .p(i))

= log(p(x|i ) )+ log(p(i)) ---------(2)

Discriminating function for multivariate Gaussian pdf


From equations 1 and 2 we find:

----(3)

Dropping second term straightforward, due class independency, we get,

----(4)

Here i influences classification.


Now we have following cases:

Case1:
i = = 2 . I (I stands for identity matrix : diagonal case) i.e. equal covariance
matrix for all classes and proportional to I.
Case2:
i = =arbitrary covariance matrix but identical for all classes

Case3:
i = arbitrary covariance matrix & not identical for all classes.

15
2/21/2017

Discriminating function for multivariate Gaussian pdf


Case1:
i = = 2. I (I stands for identity matrix : diagonal case) i.e. equal
covariance matrix for all classes and proportional to I.
-- --(4)

Assuming equal covariance matrices (i.e. class is only through the mean
vectors i ). We can drop 2nd term as class-independent constant
biases, So,

-------(5)

But influences classification through the first term, which is squared


distance of feature vector x from the ith mean vector i weighted by the
inverse of the covariance matrix -1 .

Discriminating function for multivariate Gaussian pdf


Assuming features are statistically independent and each feature has the same variance 2.
Taking the following and putting in eq#5, we can write eq#6 as Simple discriminant function:

1 12 .I
x i
2

( x i ) .( x i ) x i
T 2 g i ( x) log( p(i )) --------(6)
2 2
If P(i) are the same for all c classes, then the log(P(i)) terms becomes another unimportant
additive constant that can be ignored.
x i
2

g i ( x)
2 2

If = 1 then = I, which result in Euclidean space. So Euclidean Distance can be taken as


discriminating factor. results. i.e. the gi (x) will be largest if the distance will be minimum.
The optimal decision rule:

To classify a feature vector x, measure the Euclidean distance || x- i|| 2 in between x and
each of the mean vector. And assign x to the category of the nearest means class.
Such classifier is called a minimum-distance classifier.

16
2/21/2017

Discriminating function for multivariate Gaussian pdf

Discriminant function as Linear discriminant function :


Proof
The Bayesian discriminant function is given as:

x i
2

g i ( x)
2 2

x i xT .x 2.i x .i .i
2 T T
Since,

xT .x 2.i x .i .i
T T
gi( x)
then, 2 2
The quadratic term xTx is the same for all i, making it an ignorable additive constant.
Thus we obtain the equivalent linear discriminant function.

where, wi 0 21 2 i .i and
T

wi 2i

17
2/21/2017

Decision boundaries in case For i = = 2. I


The decision surfaces for a linear classifier are pieces of hyperplanes
defined by the linear equations gi(x) = gj(x) for the two categories with
the highest posterior Probabilities.
If x0 is threshold then, gi(x0)- gj(x0)=0

x i
2
Since,
g i ( x) log( p(i ))
2 2

So,

If P(i) = P(j), then point x0 is halfway between the means, and the
hyperplane is the perpendicular bisector of the line between the
means.

18
2/21/2017

Decision boundaries for i = 2.I and P(1) = (2)

1-D Case 2-D Case 3-D Case

If covariances of two distributions are equal and proportional to the identity


matrix, then the distributions are hyperspherical in d dimensions, and the
boundary is a generalized hyperplane of d 1 dimensions, perpendicular to the line
separating the means.

In the 3-dimensional case, the grid plane separates R1 from R2.

Decision boundaries for i = 2.I and P(1) (2)


If P( i) P( j), then point x0 shifts away from the more likely mean.

If the variance 2 is small relative to the squared distance ||i-j||2,


then the position of the decision boundary x0 is relatively insensitive
to the exact values of the prior probabilities P().

1-D Case

19
2/21/2017

2-D Case

3-D Case

20
2/21/2017

Discriminating function for multivariate Gaussian pdf


Case2: For i = =arbitrary covariance matrix but identical for each class
--------(4)

Due to identical covariance matrix, we can drop 2nd term and get:

--------(6)

If the prior probabilities P( i) are the same for all classes, then
logP(i) term can be ignored.

In this case the influence of covariance matrix will affect the distance
measurement, so this is not the simple Euclidean distance but it is
called as Mahalanobis distance from feature vector x to i. It is given
as:

On the basis Mahalanobis distance, we can classify the patterns.

Discriminating function for multivariate Gaussian pdf

21
2/21/2017

Decision boundaries in case For i = Arbitrary but


identical for all classes

2-D Case

22
2/21/2017

3-D Case

Discriminating function for multivariate Gaussian pdf


Case3: For i = arbitrary covariance matrix & not identical for each class but
Invertable.
In the general multivariate normal case, the covariance matrices are different for
each catergory. The discriminating function :

Nothing can be dropped now from above equation.

Also, due to arbitrary and class specific covariance matrix in first term, the resulting
discriminant functions are inherently quadratic. By expanding the first term we get:

23
2/21/2017

Decision boundaries
In the two-category case, the decision surface are hyperquadrics,
which can assume any of the general form of various type as:

1) Hyperplanes,
2) Pairs of hyperplanes,
3) Hyperspheres,
4) Hyperellipsoids,
5) Hyperparaboloids, and
6) Hyperhyperboloids

Even in 1-D, the decision regions need not be simply connected.

1-D Case

Fig: Non-simply connected decision regions can arise in one


dimensions for Gaussians having unequal variance.

24
2/21/2017

2-D Case

Fig: Arbitrary Gaussian distributions lead to Bayes decision boundaries that are general
hyperquadrics. Conversely, given any hyperquadratic, one can find two Gaussian
distributions whose Bayes decision boundary is that hyperquadric.

3-D Case

Fig: Arbitrary three-dimensional Gaussian distributions yield Bayes decision


boundaries that are two-dimensional hyperquadrics.

25
2/21/2017

2-features and 4-classes Case

Fig: The decision regions for four normal distributions. Even with such a low
number of categories, the shapes of the boundary regions can be rather complex.

26
2/21/2017

Introducing loss function more general


than probability of error.
In classification problem we face the following situations

In simple cases, the error due to any classification mistakes are


equally costly.

In some other cases, error due classification mistakes are not equally
costly e.g. some mistakes lead to more costlier than others. e.g.

In RADAR signal analysis problem, two classes correspond to valid-


signal (e.g. a message or perhaps a radar return) and noise. Here,
our attempt is to develop strategies so that the incoming signal is
correctly classified.

27
2/21/2017

RADAR signal analysis problem


Case1: The incoming signal is a valid-signal and we correctly
classify it as such. ( a hit : ) 1-> 1

Case2: The incoming signal is a valid-signal and we incorrectly


classify it as noise. ( a miss) 1-> 2

Case3: The incoming signal is a noise and we incorrectly classify


it as valid signal. (a false alarm) 2 -> 1

Case4: The incoming signal is a noise and we correctly classify it


as noise. ( a true reject) 2 -> 2

RADAR signal analysis problem


Suppose after classification, we need to take some action as to fire a
missile or not.
In this case, we can assume that, action associated with wrong
classification is more costlier than right classification.

let = denote the action or decision taken as per the above four
cases.
e.g. 1 = fire a missile if signal (1)
2 = Not to fire a missile if noise (2)

So, this example leads to define the probability or error in new


manner.

28
2/21/2017

RADAR signal analysis problem


Let
X be d-dimensional vector and [1, 2 , n] finite set of n classes

[ 1, 2, n] finite set of n possible actions associated with each


class. Where i = the decision to choose class i
A decision rule: A function (x) e.g. mapping (x) -> i
( i|j) is a loss or cost function: estimating the loss or cost resulting
from taking action i when the class is j.
ij be the cost associated for selecting action i when the class j.
E.g. in case c = 2 (nos. of classes)
11 and 22 are the rewards for correct classification and
12 and 21 are the penalty for incorrect classification.
Risk R: In decision theoretic terminology, the expected loss is called
the Risk. Where R denote the over all risk.
R( i|x): It is conditional risk depend upon the action selected.

Estimation and Minimization of Risk


For n classes problem let X be classified as belonging to class k , then the probability
of error as per bayes rule : n
P(error | X ) P( | X )
i 1( i k )
i

But this bayes formulation for error does not describe the risk. So for risk:

The probability that we select i action, when j is the true class is given as:
P(ij) = P(i|j).P(j)

Since the term P(i|j) depends on the chosen mapping (x)-> i, which in turn
depends on x. Then the conditional risk is given as:
n n
R( i | x) ( i | j ). p( i | j ). p( j ) . p(
ij j | x)
j1 j1

29
2/21/2017

Estimation and Minimization of Risk


If c = 2 and Nos. of actions are 2 then the overall risk measure is:
R = 11.P(1|1).P(1) + 12. P(1|2).P(2)
+ 21.P(2|1).P(1) + 22.P(2|2).P(2)
Where,
For action 1, the measure of conditional risk:
R[(x)-> 1] = R(1|x) = 11.P(1|x) + 12.P(2|x)

For action 2, the measure of conditional risk:


R[(x)-> 2] = R(2|x) = 21.P(1|x) + 22.P(2|x)

For n classes, the expected risk R[(x)] that x will be classified


incorrectly (using the total probability theorem):

Estimation and Minimization of Risk


For minimizing the overall risk, we have to minimize the conditional
risk. And the lower bound for this minimization is the Bayes risk.

Minimizing the R(i|x) for n = 2 classes problem:


Two actions 1 and 2 are to be taken for 1 and 2 respectively.
The decision rule is formulated as: decide the action 1 or class 1 if:

In term of risk comparison

In term of prior prob.


-----------(1)

30
2/21/2017

Estimation and Minimization of Risk


So alternatively under the assumption 21 > 11 i.e. the cost for
incorrect decision is higher then the correct one, we can write
decision rule as: Decide 1 if

------------(2)

Eq.2 is in the form of likelihood ratio in term of ratio of prior


probabilities [p(2)/p(1)] weighted by .

Thus, the Bayes decision rule can be interpreted as calling for


deciding 1 if the likelihood ratio exceeds a threshold value that
is independent of the observation x.

Estimation and Minimization of Risk


Effect of ij in case of n = 2 classes

Suppose p(1) = p(2) = 0.5, and


11 = -2, 22 = -1 : for correct
12 = 2, 21 = 4 : for incorrect

It shows a miss is twice as costly as a false alarm, so equation 2


becomes:

It shows that we have significant concern with correctly


identifying the class 1 i.e signal than noise.

31
2/21/2017

Estimation and Minimization of Risk: Simplest case


Often we choose:
11 = 22 = 0, i.e. no cost or risk for correct classification
12 = 21 = 1, i.e. unit cost or risk for incorrect classification
So, in this case equation-2 becomes:

This is the same as we established earlier in case of


minimization of probability of error.

Estimation and Minimization of Risk: Generalization


For n classes problem to decide the cost for correct and incorrect
classification we can define the cost/loss function as a zero-one loss
function:

It shows all errors are equally costly.

The risk corresponding to this loss function is precisely the average


probability of error, since the conditional risk for decision i is now
given as:

32
2/21/2017

Estimation and Minimization of Risk: Generalization

Thus, to minimize the average probability of error, we


should select the i that maximizes the posterior probability
P(i|x).

In other words, for minimum error rate:

Thanks

33