Anda di halaman 1dari 49

PR

Dr. Robi Polikar lecture 3


Dimensionality Reduction
&
Discriminant Based Approaches
Genetic Algorithms
Information Theoretic Approaches
Principal Component Analysis
Fisher’s Linear Discriminant
All rights reserved, Robi Polikar © 2001 – 2013, No part of this presentation, including the PR logo, may be used without explicit written permission
PR Today in PR
 Dimensionality Reduction: Feature selection vs. feature Extraction
 Forward & Backward Search
 Genetic Algorithms
 Information theoretic approaches
 Feature extraction: Transformation based approaches
 Principal Component Analysis
 Fisher Linear Discriminant

D Duda, Hart & Stork, Pattern Classification, 2/e Wiley, 2000


G R. Gutieerez-Osuna, http://psi.cse.tamu.edu/teaching/lecture_notes/
RP Robi Polikar – All Rights Reserved © 2001 – 2013

Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Dimensionality
Reduction
 Why? Do we still need to answer this question?  Curse of dimensionality
 Reduces time complexity: Less computation  faster model generation / evaluation
 Reduces space complexity: Fewer parameters  faster model generation / evaluation
 Saves the cost of observing the feature  faster model generation / evaluation
 Simpler models are more robust on small datasets  Occam’s razor
 More interpretable; simpler explanation
 Data visualization (structure, groups, outliers, etc) if plotted in 2 or 3 dimensions
 There are two ways to do this:
 Feature extraction: Using typically a mathematical transformation, project the
𝑑 −dimensional feature space to an 𝑑’-dimensional space, where 𝑑’ < 𝑑 (typically, 𝑑’ ≪ 𝑑)
• Principal component analysis
• Linear discriminant analysis
• Factor analysis
 Feature subset selection : Choose 𝑑’ most informative features among a total of 𝑑
features.
• Filter approach – Choose relevant features based on some prior information (e.g., information theoretic)
• Wrapper approach – Search for subsets of features that provide the best performance on some
classification algorithm or figure of merit (genetic algorithms).
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Feature Selection
 Choose 𝑑’ most informative features among a total of 𝑑 features.
 Filter approach – Choose relevant features based on some prior information
• Information theoretic approach: determine those features that carry the most information (highest entropy,
highest mutual information, highest, maximum relevance minimum redundancy)
 Wrapper approach – Search for subsets of features that provide the best performance on
some classification algorithm or figure of merit.
• Forward sequential search: Start with no 0,0,0,0,0,0,0,0,0,0,0,0
features. Evaluate each feature using a
12 combinations
classifier. Pick the feature that performs the to test
best. Keeping this feature, evaluate all pairs ………..
of feature that include the first winning 1,0,0,0,0,0,0,0,0,0,0,0 0,1,0,0,0,0,0,0,0,0,0,0 0,0,0,0,0,0,0,0,0,0,0,1
feature). Then, choose the pair that 2nd feature gives 11 combinations
performs the best. Then evaluate all triplets best performance ……….. to test
that include the previously winning pair,
1,1,0,0,0,0,0,0,0,0,0,0 0,1,1,0,0,0,0,0,0,0,0,0 0,1,0,0,0,0,0,0,0,0,0,1
and so on. Stop when performance no
2nd and 12th features 10 combinations
longer improves. give best performance ……….. to test
• Backward sequential search: Same process 1,1,0,0,0,0,0,0,0,0,0,1 0,1,0,0,0,0,0,0,0,0,1,1
0,1,1,0,0,0,0,0,0,0,0,1
in reverse; starting with all features, and
removing one feature at a time. 1st , 2nd and 12th features
give best performance ………..
• These are greedy approaches and are
of course, very suboptimal. Search continues until the best feature in the current row is not
better than that of the previous row.

Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
Feature Selection
PR
Genetic Algorithms
 Based on the theory of evolution and survival of the fittest
 Individuals of a population who possess a particular phenotype that allow them to
accommodate an environment, are the ones that that are most fit to survive.
 These individuals procreate, creating new individuals (offsprings) with the same
genotype, who also then possess the phenotype that allowed their parents to
survive in the first place.
 Those offsprings later procreate and create more individuals with that favorable
phenotype, thus allowing the “feature” that allowed them to be successful in the
first place to become well-established.
 What caused the original phenotype to appear is usually a mutation in genotype.
 An (not so fictitious) example
 An environmental factor causes mutation to some people – who live in hot climate
- to have dark skin color (a phenotype). Dark skin color protects against skin
cancer. Those with fair skin die off young, before they can procreate, whereas
those with darker skin color survive, and can continue to procreate, passing their
genotype to their offsprings. This allows more and more dark skin people to
survive and the phenotype becomes established in that geographical area.
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
Feature Selection
PR
Genetic Algorithms
 Genetic algorithms are heuristic local search algorithms, mimicking the evolution’s
survival of the fittest and natural selection mechanisms that occurs in the nature,
to solve combinatorial optimization problems.
 In a GA, the search space of all possible solutions (good or bad) to an optimization
problem is the population. Specific solutions are individuals of this population. The
goal is to find the best individuals.
 Individuals are evaluated with respect to a figure of merit, called the fitness
function. The fitness function depends on the specific problem.
 In a classification problem, the fitness function may be the classification
performance. In the traveling salesman problem, it is the total distance travelled.
 The GA continues for many generations (iterations), in each of which the least fit
individuals are eliminated, and the most fit ones become new parents. Parents are
combined through cross-over, creating new (children) solutions, who share the
parents (good) characteristics.
 A mutation operation provides a small perturbation to the new solutions, and then
the process continues as long as new solutions have better fitness.

Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
Feature Selection
PR
Genetic Algorithms
 The pseudocode of a generic genetic algorithms looks like this

1. Choose the initial random population of individuals, an appropriate fitness function, and
stop criteria (max number of generations, total time, improvement in fitness, etc.)
2. Evaluate the fitness of each individual in that population
3. Repeat until stop criteria met
a. Select the most fit individuals to be used for reproduction
b. With crossover probability 𝑝𝑐 , breed new individuals through using the
crossover operation
c. With mutation probability 𝑝𝑚 , randomly flip the gene values in the offsprings
d. Place the new offspring in the population and compute their fitness
e. Replace least-fit individuals with new individuals
4. End

Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
Genetic Algorithms
PR
Cross-over & Mutation
 In a genetic algorithm, the solutions (the individuals) are often in the form of a
binary array of bits.
 To do cross-over, you simply select a random cross-over position, and with a
previously set cross-over probability (e.g., 0.8), you perform :

Parent 1 1001001001001111001001 Cross-over


Parent 2 0010010110101001011010 position

Child 1 1001001000101001011010
Child 2 0010010111001111001001

 Then, again select a random mutation position, and with a preset small mutation
probability (e.g., 0.01), you flip the selected bit

1001001000101001011010 Mutation
1001001010101001011010 position

Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
Feature Selection
PR
Genetic Algorithms
 So, how can this process be used for feature selection?
 Suppose we have 22 features, of which we would like to choose the best subset.
 We can choose the size of the “best subset” or leave it to the GA to figure it out.
 Represent each feature with a binary bit: 1, if that feature is to be included, 0
otherwise. Create a random initial population of, say, 1000. Here is an example
1001001001001111001001
 In this individual (remember, individuals are potential solutions), feature 1, 4, 7, 10,
etc. are used (as indicated with a “1”), whereas others are not (as indicated by a
“0”)
 We then choose a fitness function, say the classification performance of a classifier
trained with this feature vector “solution.”
 Run the GA for many, many generations. Set a stop criterion, e.g. 10,000 iterations,
or until features no longer change.

Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR In Matlab

ga Find minimum of function using genetic algorithm (Global Optimization Toolbox)

[x,fval,exitflag,output,population] = ga(fitnessfcn,nvars,A,b,Aeq,beq,LB,UB,nonlcon,options) finds a local


unconstrained minimum, x, to the objective function, fitnessfcn. nvars is the dimension (number of design
variables) of fitnessfcn. The objective function, fitnessfcn, accepts a vector x of size 1-by-nvars, and returns a
scalar evaluated at x. When provided, additional arguments set up the linear inequality constraints in the
form of A 𝐱 ≤ 𝑏, linear equalities in the form of Aeq 𝐱 = 𝑏𝑒𝑞, with lower and upper bounds constraints in
the form of 𝐿𝐵 ≤ 𝐱 ≤ 𝑈𝐵, nonlinear constraints set by nonlcon, using the optimization parameters given in
options, which can be created using the gaoptimset function.

The output parameters include solution, local minimum x, the value of its fitness function fval, an exitflag
identifying the reason the algorithm terminated, a structure output that contains output from each
generation and other information about the performance of the algorithm, the matrix, and population,
whose rows are the final population.

This is a complex function to use. Read the documentation carefully before attempting to use.

Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Information Theoretic
Feature Selection
 In filter based feature selection, we typically compute some figure of merit or
some objective function for each feature, based on which we determine whether
to include that feature.
 A large class of such figures of merit is based on information theoretic metrics.
 Entropy, mutual information, maximum relevance minimum redundancy, etc. are
examples of such metrics.
 The thinking behind this approach is that those features that carry more
information must be more relevant to the classification problem, and hence be
used in developing the learning model.

Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR A review of IT
Features We have seen
 If a message (or a r.v.) 𝑋 is sent, with a probability distribution of 𝑝(𝑥), then the average
amount of information sent is the entropy of the random variable 𝑋. It is also the amount of
uncertainty in the data, as represented by the random variable 𝑋.
𝐻 𝑥 =− 𝑥𝑝 𝑥 log 2 𝑝 𝑥

 If we have two random variables, 𝑋 and 𝑌, the joint entropy is the average information
contained in these two variables.

𝐻 𝑋, 𝑌 = 𝔼𝑋,𝑌 −log 𝑝 𝑥, 𝑦 =− 𝑝 𝑥, 𝑦 log𝑝 𝑥, 𝑦


𝑥,𝑦
where the notation E indicates the expectation over both 𝑋 and 𝑌 (which is approximated in
practice as mean over all values of 𝑋 and 𝑌
 The conditional entropy is the average uncertainty in a given variable 𝑋, after the random
variable 𝑌 is observed. The average is computed over the r.v. 𝑌.
𝐻 𝑋|𝑌 = 𝔼𝑌 𝐻 𝑋|𝑦 =− 𝑝 𝑦 𝑝 𝑥|𝑦 log 𝑝 𝑥|𝑦 = − 𝑝 𝑥, 𝑦 log 𝑝 𝑥|𝑦
𝑦∈𝑌 𝑥∈𝑋 𝑥,𝑦

 Note that conditional entropy is the joint entropy (joint uncertainty) in 𝑋 and 𝑌, after the
entropy (uncertainty) associated with 𝑌 is removed: 𝐻 𝑋|𝑌 = 𝐻 𝑋, 𝑌 − 𝐻(𝑌)

Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Mutual Information
 Mutual information of 𝑋 and 𝑌, 𝐼(𝑋, 𝑌) is the amount of information that 𝑋 and 𝑌 share, or
more specifically, how much knowing one can reduce the uncertainty about the other.
𝑝 𝑥, 𝑦
𝐼 𝑋, 𝑌 = 𝑝 𝑥, 𝑦 log
𝑝 𝑥 𝑝 𝑦
𝑥∈𝑋 𝑦∈𝑌

𝐼 𝑋, 𝑌 = 𝐻 𝑋 − 𝐻 𝑋|𝑌 = 𝐻 𝑌 − 𝐻 𝑌|𝑋
= 𝐻 𝑋 + 𝐻 𝑌 − 𝐻 𝑋, 𝑌 = 𝐻 𝑋, 𝑌 − 𝐻 𝑋|𝑌 − 𝐻 𝑌|𝑋

 Mutual information measures the difference between two entropies: the uncertainty before
knowing 𝑌, i.e., 𝐻 𝑋 , and the uncertainty after knowing 𝑌, i.e., 𝐻 𝑋|𝑌 . This can also be
interpreted as the amount of uncertainty in 𝑋 removed by knowing 𝑌.
 Mutual information is symmetric, i.e., 𝐼 𝑋, 𝑌 = 𝐼(𝑌, 𝑋), and it is zero, iff the variables are
independent, i.e., 𝑝 𝑥, 𝑦 = 𝑝 𝑥 𝑝 𝑦 .
 Mutual information can also be conditioned on another variable, 𝑍:
𝑝 𝑥, 𝑦|𝑧
𝐼 𝑋, 𝑌|𝑍 = 𝐻 𝑋|𝑍 − 𝐻 𝑋 𝑌, 𝑍 = 𝑝 𝑧 𝑝 𝑥, 𝑦|𝑧 log
𝑝 𝑥|𝑧 𝑝 𝑦|𝑧
𝑧∈Z 𝑥∈𝑋 𝑦∈𝑌

Which can be interpreted as information still shared by 𝑋 and 𝑌 after the value of 𝑍 is revealed.
G. Brown, A. Pocock, M. Zhao, M. Lujan, Conditional likelihood maximization: A unifying framework for
information theoretic feature selection, JMLR, vol. 13, pp. 27-66, 2012.
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Kullback-Leibler
Divergence
 Mutual information is closely related to the KL-divergence, which is a measure of pseudo –
distance between two distributions.
𝑝 𝑥
𝐾𝐿 𝑝‖𝑞 = − 𝑝 𝑥 ln  (discrete case)
𝑞 𝑥
𝑥
𝑝 𝑥
𝐾𝐿 𝑝‖𝑞 = − 𝑝 𝑥 ln  𝑑𝑥 (continuous case)
𝑞 𝑥
𝐱
 The KL divergence is not a true distance metric, as it is not symmetric 𝐾𝐿 𝑝‖𝑞 ≠ 𝐾𝐿 𝑞‖𝑝 ,
however, it is a useful substitute.
 In pattern recognition, we are often interested in computing the probability distributions of
class labels given the observations we make (e.g., what is the probability that the true label is
A in this handwritten character recognition problem, given the observation of A I just made)
 Hence, estimation of such distribution functions become important. The KL divergence
gives us a measure of “how much we screwed up” by estimating the true density of 𝑝(𝑥) as
𝑞(𝑥).

Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Mutual Information &
KL Divergence
 We can now see that the mutual information of 𝑋 and 𝑌 is in fact the KL divergence of their
joint distribution 𝑝(𝑥, 𝑦) and the product of marginal distributions 𝑝(𝑥) and 𝑞(𝑥), which of
course is the joint distribution when X and Y are independent.
 Hence mutual information measures the “distance” between the joint distribution of X and
Y, and what that distribution would have been if X and Y were independent.
𝑝 𝑥, 𝑦
𝐼 𝑋, 𝑌 = 𝑝 𝑥, 𝑦 log = 𝐾𝐿 𝑝 𝑥, 𝑦 ‖𝑝 𝑥 𝑝 𝑦
𝑝 𝑥 𝑝 𝑦
𝑥∈𝑋 𝑦∈𝑌
 Perhaps the best way to summarize these concepts is to look at this Venn diagram
H  X ,Y 

H Y X 
I(X,Y)

H X Y

H X  H Y 
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Information Theoretic
Feature Selection
 So, how do we use these concepts for selecting features?
 As in all filter approaches, we define our figure-of-merit (criterion function,
relevance index, etc.) to be an information theoretic metric, compute this metric
for all features, rank order the features wrt to this metric, and pick the top 𝑘
features with the highest value of the metric.
 Mutual information between the true class label and value of the feature is one
such information theoretic metric that can be used as a figure-of-merit. Here 𝑋𝑘 is
the random value representing the 𝑘 𝑡ℎ feature and 𝑌 is the class label. Then, we
compute 𝐽𝑀𝐼 𝑋𝑘 = 𝐼 𝑋𝑘 , 𝑌 .
 Note that this approach assumes that all features are independent. When this is not
the case, MI may not lead to best set of features. If there are two features, say one
of them is simply twice the other, they are linearly related (strongly correlated), and
hence have the same MI. If that MI is high, both of those features will be selected,
though it is clear that the second one is redundant.
 Features should not only be relevant, but also not be redundant, or more
specifically, not be highly correlated.

Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Information Theoretic
Feature Selection
 Mutual Information Feature Selection (MIFS) criterion subtracts the mutual information
between pairs of already selected features (set of 𝑆) to penalize highly correlated features,
i.e., the redundancy.
𝐽𝑀𝐼𝐹𝑆 𝑋𝑘 = 𝐼 𝑋𝑘 , 𝑌 − 𝛽 𝐼 𝑋𝑘 , 𝑋𝑗
𝑋𝑗 ∈𝑆
 Here, 𝛽 is parameter that controls the amount of penalty given to correlated features.
 Joint Mutual Information (JMI) attempts to increase the complementary information
between the selected features, where 𝑋𝑘 𝑋𝑗 is a pair of features selected together:
𝐽𝐽𝑀𝐼 𝑋𝑘 = 𝐼 𝑋𝑘 𝑋𝑗 , 𝑌
𝑋𝑗 ∈𝑆
 A special case of MIFS, when 𝛽 = 1/ 𝑆 , is minimum redundancy, maximum relevance
1
𝐽𝑀𝐼𝐹𝑆 𝑋𝑘 = 𝐼 𝑋𝑘 , 𝑌 − 𝐼 𝑋𝑘 , 𝑋𝑗
𝑆
𝑋𝑗 ∈𝑆
For excellent overview: Read the following two papers
• I. Guyon, A. Elisseeff, An introduction to variable and feature selection, JMLR, vol. 3, pp. 1157-1182, 2003.
• G. Brown, A. Pocock, M. Zhao, M. Lujan, Conditional likelihood maximization: A unifying framework for
information theoretic feature selection, JMLR, vol. 13, pp. 27-66, 2012.
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Principal Component
Analysis (PCA)
 By far the most commonly used feature extraction technique due to its elegant yet
simple theory.
 In data sets with many variables, groups of variables often move together, as one variable
might be measuring the same driving principle governing the behavior of the system.
 In many systems there are only a few such driving forces, yet an abundance of
instrumentation allows us to measure dozens of system variables.
 This redundancy can be removed by replacing a group of variables with a single new
variable, called principal components, which are linear combinations of the original
variables.
 PCA assumes that the information is carried in the variance of the features: the
higher the variance in one dimension (feature), the higher the information carried
by that feature
 The transformation is based on preserving the most variance in the data using the
least number of dimensions.
 The data is projected onto a lower dimensional space where the new features best
represent the old features in the least squares sense.

Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR PCA (Cont.)
 Assume we wish to represent n d-dimensional vectors 𝐱1 … 𝐱𝑁 with only one such vector, 𝐱0,
such that the sum of squared distances between 𝐱0 and each of the 𝐱𝑘 determined by the
criterion function 𝐽0 is minimum
𝑁 𝑁
This can be shown to be 1
𝐽0 𝐱 0 = ‖𝐱 𝑘 − 𝐱 0 ‖ 𝐦= 𝐱𝑘
𝑁
𝑘=1 𝑘=1
The sample mean is the zero-dimensional representation of the
entire dataset. Simple, but provides no information about the
variability in the data.
A better representation can be obtained with a 1-dimensional
x0=m projection, a line, through the sample mean

Let 𝐰 be a unit vector in the direction of this line. Then


this line can be represented as follows,

𝐱 = 𝐦 + 𝑦𝐰
RP where y is a constant coefficient that indicate the
w distance of any point 𝐱 from m

Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR PCA (Cont.)
 In general, we can represent any 𝐱𝑘 by 𝐦 + 𝑦𝑘𝐰, where the optimal coefficients 𝑦𝑘 can be
obtained by minimizing the “squared error criterion function”
𝑁
2
𝐽1 𝑦1 , ⋯ , 𝑦𝑁 , 𝐰 = 𝐦 + 𝑦𝑘 𝐰 − 𝐱 𝑘
𝑘=1
which yields 𝑦𝑘 = 𝐰𝑇(𝐱𝑘 − 𝐦), that is, we obtain the least square error coefficients by
projecting the data vector 𝐱𝑘 onto a line 𝐰 that passes through the sample mean m.
 But, what is the best direction for w?

RP
w=?

Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR PCA (Cont.)
 We look closer to the criterion function J1 to determine the best direction of e:

yk  wT  xk  m 
N
J1  y1 , , yn , w     m  yk w   x k
2

k 1
N N N
  y k  1  2 y k w  xk  m    xk  m
2 T 2

k 1 k 1 k 1
N N N
   w  xk  m   2  w  x k  m    xk  m
T 2 T 2 2

k 1 k 1 k 1
N N
   w  xk  m    xk  m
T 2 2

k 1 k 1
N N
  w  x k  m  xk  m  w   xk  m
T T 2

k 1 k 1
N N

  w Sw   x k  m S    xk  m  xk  m  (Total) Scatter
2 T
T

k 1
k 1 matrix
Note that to minimize J1 we need to maximize 𝐰𝐓𝐒𝐰, subject to the constraint that ||𝐰|| = 1
Why do we need the constraint…?
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Lagrange Multipliers
 Constraint minimization / maximization: Lagrange Multipliers
 If we wish to find the extremum of a function 𝑓(𝐱) subject to some constraint
𝑔(𝐱) = 0, the extremum point 𝐱 can be found by
1. Form the Lagrange function to convert the problem to an unconstraint problem, where λ –
whose value need to be determined – is the Lagrange multiplier

𝐿 𝐱, 𝜆 = 𝑓(𝐱) + 𝜆𝑔(𝐱)
2. Solve the resulting unconstrained problem by taking the derivative
𝜕𝐿 𝐱, 𝜆 𝜕𝑓 𝐱 𝜕𝑔 𝐱
= +𝜆
𝜕𝐱 𝜕𝐱 𝜕𝐱
 For the PCA problem, ‖𝐰‖ = 1 ⇌ 𝑔 𝐰 = 0 ⇌ ‖𝐰‖ − 1 = 0 ⇔ 1 − 𝐰 𝑇 𝐰 = 0

𝐿 𝐰, 𝜆 = 𝐰 𝑇 𝐒𝐰 + 𝜆 1 − 𝐰 𝑇 𝐰 Ring any bells?


𝜕𝐿
= 2𝐒𝐰 − 2𝜆𝐰
𝜕𝐰
𝐒𝐰 = 𝜆𝐰
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR PCA - Summary
 In projection to multiple dimensions, we are replacing 𝐱 = 𝐦 + 𝑦𝐰 with
𝐱 = 𝐦 + 𝑑′ 𝑖=1 𝑦𝑖 𝐰𝑖 that is, we are representing a vector x as a weighted sum of
a series of basis vectors that are orthogonal to each other.
 If we choose to use all eigenvectors, that is, we project the data to all eigenvectors
and then add them, we will get the original data back (hence no dimensionality
reduction)
 From a geometrical stand point, the eigenvectors represent the principal axes, along
which the data (and hence the covariance matrix) show largest variance. The weight
coefficients ai are called the principal components.

The optimal approximation – in the minimum sum of square error sense – of an N-


dimensional random vector 𝐱 ∈ ℝ𝑁 by a linear combination of 𝑀 < 𝑁 independent
vectors is obtained by projecting the vector x onto the eigenvectors 𝐰𝑖 corresponding to
the largest eigenvalues 𝜆𝑖 of the covariance matrix (or the scatter matrix) of the data
from which 𝐱 is drawn.

Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Eigenvalues &
Recall from Lecture 0
Eigenvactors
 For a given d-by-d matrix M, the scalars 𝜆 and the vectors v that satisfy the
equation 𝐌𝐯 = 𝜆𝐯 are called eigenvalues and eigenvectors of M, respectively.
 Note that when an eigenvector is multiplied by M, they only change their
magnitude, and not their direction.
 For a 𝑑-by-𝑑 matrix, there are exactly 𝑑 eigenvalues and corresponding
eigenvectors.
 The eigenvectors of a covariance matrix of data, point in the direction of principal
axes of the data, whereas the eigenvalues represent the length of these axes.
 Eigenvalues and eigenvectors play an extremely
important role in pattern recognition!
 Can be computed using the characteristic
equation, but it is usually computed
numerically.
 The determinant of a matrix is the product of
its eigenvalues
𝑑

|𝐌| = 𝜆𝑖 RP
𝑖=1
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Other Properties of
Recall from Lecture 0

Eigenvalues/Eigenvectors
 A matrix M for which 𝐰𝐓𝐌𝐰 ≥ 0 for any vector w is called a positive-semi-definite
matrix. If the inequality is strict, then it is a positive-definite matrix.
 The eigenvalues of a positive (semi) definite matrix are always positive (non-
negative).
 Recall: If M is symmetric, we can write VTMV=V-1MV=D, where VTV=I is satisfied 
V is orthogonal (hence VT=V-1)and D is diagonal. This is called diagonalization of M.
 Now, let R be the correlation matrix obtained by averaging (random) data vectors as
R=[uuT], where  represents the expectation operation. R is guaranteed to be symmetric,
positive semidefinite and invertible. Therefore, we can write VTRV=D. This proves that any
random vector can be turned into another whose elements are uncorrelated:
• Let u1 = VTu (where V is an orthogonal matrix). Then, the new correlation matrix is
R1 =u1u1T = VTu (VTu)T = VTuuT V = VTRV = D
• Since the new correlation matrix is R1 = D, diagonal, with off diagonal elements zero, the
elements of u1 are therefore uncorrelated!
• Furthermore, if D-1/2 is a diagonal matrix, whose elements are square roots of the eigenvalues
of R, (i.e., D1/2 D1/2 = D), we can show that u’ = D-1/2 VTu has uncorrelated elements with unit
variance (where D-1/2 is the inverse of D1/2). This is called whitening transformation.

Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Example
Recall from Lecture 0
Properties of Eigenvectors
3

% August 20, 2009, © Robi Polikar


RP
close all; clear all 2
%Generate normally distributed data with mean "mu" and Eigenvectors for
covariance "Sigma" raw_data / rot_data
mu = [1 -1]; Sigma = [.9 .4; .4 .3]; 1
raw_data= mvnrnd(mu, Sigma, 500);
plot(raw_data(:,1),raw_data(:,2),'.'); grid on; hold on;
%Compute covariance matrix of the actually sampled random 0
data
C1=cov(raw_data); %This should be close to "Sigma"
[V1 D1]=eig(C1); %Compute the eigenvalues and eigenvectors
rot_data=V1'*raw_data'; %Rotate the raw data by the
-1
eigenvectors
%Transposition in raw_data so that the data are column vectors raw_data
rot_data=rot_data'; -2
plot(rot_data(:,1),rot_data(:,2),'.r');
origin=[0; 0];   154.6
vectarrow(origin, V1(:,1)); %Draw a vector from origin to first EV -3
vectarrow(origin, V1(:,2)); %Draw a vector from origin to 2nd EV  2.69rad rot_data
-4
 0.9033 0.4290 
C2=cov(rot_data);
[V2 D2]=eig(C2);
0.1004 0  V1   
vectarrow(origin, V2(:,1)); C2    0.4290 0.9033
vectarrow(origin, V2(:,2));
 0 1.0890 -5
-2 -1 0 1 2 3 4 5
Zero off-diagonal elements  uncorrelated data * Your Numbers may be different!
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Example
Recall from Lecture 0
Whitening
3
% Whitening Transformation
% % August 20, 2009, © Robi Polikar white_data
2

close all
clear all 1

%Generate normally distributed data with mean "mu" and 0


raw_data
covariance "Sigma"
mu = [1 -1]; Sigma = [.9 .4; .4 .3]; -1
raw_data= mvnrnd(mu, Sigma, 500);
plot(raw_data(:,1),raw_data(:,2),'.'); grid on -2
%Compute covariance matrix of the actually sampled random data
C1=cov(raw_data); %This should be close to "Sigma" -3
[V1 D1]=eig(C1); %Compute the eigenvalues and eigenvectors RP
-4
W_temp=sqrt(D1); %The whitening matrix is the inverse of the -2 -1 0 1 2 3 4 5 6 7 8
W=inv(W_temp); % square root of eigenvalue matrix
white_data=W*V1'*raw_data'; %Rotate the raw data by the product 1.000 0.000 
of whitening and eigenvector matrix C2   
%Note the transposition in raw_data so that the data are column 0.000 1.000 
vectors
white_data=white_data';
hold on; The unit covariance matrix of the whitened data
plot(white_data(:,1),white_data(:,2),'.r'); indicate that i) the two axes are uncorrelated (off-
diagonal elements are zero), and the variations in each
C2=cov(white_data) %This should be darn close to unit matrix axes are the same and unit (diagonal elements are 1).
[V2 D2]=eig(C2);
Hence this dataset is now “whitened” or standardized,
giving the data a circular distribution.
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Back to PCA
 The w that maximizes 𝐰𝐓𝐒𝐰 are the eigenvectors of S with corresponding eigenvalues
𝜆 𝑁
𝑇
𝐒𝐰 = 𝜆𝐰 𝐒= 𝐱𝑘 − 𝐦 𝐱𝑘 − 𝐦
𝑘=1
 Note that S is nothing but a normalized covariance matrix: Σ = 1 𝑁 𝐒 . Therefore,
maximizing 𝐰𝐓𝐒𝐰 , means finding the directions (w) that maximizes the variance
(along each feature)
 If we want the “best” line that represents the data, then we need to project the
data onto a single line, for which we need to pick only one of the eigenvectors of S.
To ensure that 𝐰𝐓𝐒𝐰 is maximized, we pick the eigenvector corresponding to
largest eigenvalue 𝜆𝑚𝑎𝑥 .
 This can be readily extended to larger dimensions:
 If we want to project 𝒅-dimensional data onto a 𝒅’-dimensional subspace (𝒅’ < 𝒅),
we project the data onto the 𝒅’ eigenvectors of the scatter matrix S (which is really a
constant multiplier of the covariance matrix), corresponding to largest 𝒅’-eigenvalues:

Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR USING PCA
 Remember that the coefficient 𝑦𝑘 for 𝐱𝑘 that gives
the minimum least square error is 𝑦𝑘 = 𝐰 𝑇 𝐱𝑘 − 𝐦
where w is the eigenvector of the covariance matrix
corresponding to the largest eigenvalue
 To streamline this notation to multiple coefficients 𝒚 that
collectively give the minimum least square error 𝐲 = 𝐖𝑇 𝐱 − 𝐦
is then where the columns of the W matrix are the
eigenvectors of the covariance matrix 𝚺 or the scatter matrix 𝑺
 The principal components y are then simply the projection of the mean removed data points
on the columns of the W matrix. So what does this mean…?
w2
x2
x , x 
* *

y ,y 
1 2

1 2

m PCA w1

RP
x1
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR PCA

RP

Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR PCA- Example

S=Σx
G

From R. Gutierrez @ TAMU


Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR PCA Example (Cont.)

Solution by Matlab:

princomp(.)

RP

Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR PCA Example (Cont.)

Matlab solution using definition:


[V D]=eig(cov(x));
PC=V*x;

RP

Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR PCA In Matlab
princomp Principal component analysis (PCA) on data Statistics Toolbox

COEFF = princomp(X) performs principal components analysis (PCA) on the n-by-p data matrix X, and returns the principal component
coefficients, also known as loadings. Rows of X correspond to observations, columns to variables. COEFF is a p-by-p matrix, each column
containing coefficients for one principal component. COEFFS are the coefficients of the linear combinations of the original variables that
generate the principal components. The columns are in order of decreasing component variance.

princomp centers the data in X by subtracting off column means, but does not rescale the columns of X. To perform principal components
analysis with standardized variables, that is, based on correlations, use princomp(zscore(X)). To perform principal components analysis directly
on a covariance or correlation matrix, use pcacov.

[COEFF,SCORE] = princomp(X) returns SCORE, the principal component scores; that is, the representation of X in the principal component space.
Rows of SCORE correspond to observations, columns to components.

[COEFF,SCORE,latent] = princomp(X) returns latent, a vector containing the eigenvalues of the covariance matrix of X.

[COEFF,SCORE,latent,tsquare] = princomp(X) returns tsquare, which contains Hotelling's T2 statistic for each data point.

The scores are the data formed by transforming the original data into the space of the principal components. The values of the vector latent are
the variance of the columns of SCORE. Hotelling's T2 is a measure of the multivariate distance of each observation from the center of the data
set.

When n <= p, SCORE(:,n:p) and latent(n:p) are necessarily zero, and the columns of COEFF(:,n:p) define directions that are orthogonal to X.

[...] = princomp(X,'econ') returns only the elements of latent that are not necessarily zero, and the corresponding columns of COEFF and SCORE,
that is, when n <= p, only the first n-1. This can be significantly faster when p is much larger than n.

Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR A Real World Example
OCR Data
 Handwritten character
recognition problem
 3823 instances, 62
attributes, 10 classes
 Originally 64 attributes
(8-by-8 grid), two
constant valued attributes
removed.

RP
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR
PCA_for_OCR.m A Real World Example
knn_on_OCR_with_PCA.m OCR Data Principal Components of the OCR data

Principal Components of the OCR data

2
1.5

1.5

11 2 2
0.5
1 6
Principal Component 3

0.5
0 8 1
6
3 8 7
Principal Component 3

-0.5
5
0

3 7 4
-1 5 0 4
9
-1.5
-0.5

-2
2
9
-1
1.5
1
0 2.5
0.5 2
0 1.5
1
-0.5
-1.5 0.5
-1 0
-1.5 -0.5

-2
-1.5
-1 RP
-2.5 -2
-2 Principal Component 2
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi
Principal Polikar,
Component 1 Rowan University,
0 Glassboro, NJ
-5
PR Final Thoughts
 PCA transforms the original features into a new space,
 where the new space is simply a rotated version of the original one, the angle is
chosen such that the maximum variability in the data can be seen
 This is equivalent to walking around the data to see from which angle you get the
best view. The rotation of the axes is done in such a way that the new axes are
aligned with the directions of maximum variance, which are the directions of
eigenvectors of the covariance matrix
 PCA decorrelates the data. That is, the PCs will be uncorrelated !
 The dimensionality reduction is obtained by using only a subset of the new axes
(dimensions) that account for most of the variance
 PCA –also known as Karhunen-Loève transformation in communication and
image processing – is the oldest technique in multivariate analysis: originally
developed by Pearson in 1901 and generalized by Loève in 1963.
 The PCA does not take the class information into consideration, therefore,
there is no guarantee that the classes in the transformed data will be better
separated than the original one. For that  Fisher Linear Discriminant.
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Fisher Linear Discriminant
(FLD)
 PCA finds the minimum number of components that best represent the data.

 However, this best representation is in the least square sense. It does not
guarantee any usefulness for discrimination / classification.

 We need to reduce the dimensionality, under some constraint of maximizing the


class discrimination.

 Maximizing the discrimination can be achieved by increasing the intercluster


distances and reducing the intracluster distances. These distances are obtained
using between and within class scatter matrices, respectively.

 FLD (also called linear discriminant analysis – LDA) is simply based on a


transformation of the type (similar to that of PCA)
y  wT x
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR FLD
 Note that y=wTx defines a line, hence FLD basically projects the data onto a line,
along the direction of w, just like the PCA. However, unlike the PCA which looked
for a projection where the data was best represented, FLD looks for a line that
maximizes the separability of classes.

Figure courtesy of R. Gutierrez


w2
w1
G
Bad projection Good projection
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Bad Projection…?
 How do we know ahead of time which projection will be good? We need to determine a
separability criterion that increase with increasing separability.
 For example we can look at the distance between the projected means:
1 ~  1 1
mi 
ni
x
x i
mi
ni
y
y i ni
 w T

x i
x  w T
mi

J w   m2
~  wT m  m 
~ m
1 2 1

 Note however, the distance between the means does not take the variances into consideration.
What we really want is that the distance between the means be large with respect to variances.

Figure courtesy of R. Gutierrez


G

Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR FLD
 To obtain a criterion function that maximizes mean separability wrt to variances,
we can normalize J with the total variance:
~ m~
m
J w   ~ 2 ~ 21
2 ~
si 2    y  ~ 2
m Scatter for class i – a measure of the variance in class I
s s 1 2 yi
i
 ~s12  ~s22 is then a total within class scatter
 To obtain J as an explicit function of w, we define the following scatter matrices:
(for 2 classes)
Si   (x  m )(x  m )
x
i i
T
Within class scatter matrix for class i
i
2
S w  S1  S 2    (x  m i )(x  mi )T Total within class scatter matrix
i 1 xDi

SB  (m1  m2 )(m1  m2 )T Between class scatter matrix

T
w SBw
J (w )  T
Maximize between class scatter
w  SW1 (m1  m2 )
w SW w Minimize within class scatter

Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Example

Courtesy of R. Gutierrez - http://research.cs.tamu.edu/prism/lectures/pr/pr_l10.pdf


G

Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR FLD in Higher Dimensions
 For a C class problem, FLD will reduce the dimensionality from d to C-1 through C-
1 projection vectors, which are collectively placed in column format in the W
matrix. Obviously, C<d is required. As before, the problem is to identify the matrix
W for the following transformation

yi  w T x  y  WT x
W  w1 w 2  w C 1 

 Note that the dimensionalities will be as follows:


 x: [d x m] where m is the number of instances and d is the dimensionality of each
instance (number of features)
 W: [c-1 x d], where c is the number of classes
 y: [c-1 x m]

Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR FLD in Higher Dimensions
 We first define the total within and between scatter matrices as follows, for the
original feature space where x lie:
c
Si  (x  mi )(x  mi )T  SW   Si
xDi i 1

Di is the set of instances drawn from class i SW is the total within class scatter matrix
c
S B   ni (m i  m)(mi  m)T ST  SW  S B
i 1
ni is the number of instances drawn from class i SB is the total between class scatter matrix
m is the mean of individual class means mi ST is the total scatter matrix
 In the transformed (y) space, these matrices can be shown to be
C
SW    y  m
~ y  m
~ T WT S W
~
i i W
i 1 yi
c
~
S B   ni (m
~ m
i
~ )(m
~ m
i
~ )T  W T S W
B
i 1

Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR FLD In Higher Dimensions

 The FLD then tries to find the transformation matrix W that will maximize the

Figures courtesy of R. Gutierrez - http://research.cs.tamu.edu/prism/lectures/pr/pr_l10.pdf


following criterion function. Note that maximizing this function increases intercluster
distances (through maximizing the between cluster matrix) and decreases the
intracluster distances through minimizing the within class scatter matrix.

~ WT S B W
SB
J ( W)  ~ 
SW WT SW W

It turns out, the columns of the


optimum W matrix are the
generalized eigenvectors
corresponding to the largest
eigenvalues in G

SB wi  i SW wi
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR FLD in Higher Dimensions
 This generalized eigenvalue problem can be solved by first computing the eigenvalues as the
roots of the following characteristic polynomial
SB  i SW  0

and then solving for the wi, the columns of the W matrix:

SB  i SW wi  0
Note that | . | denotes discriminant. Also note that this procedure will generate d
eigenvalues, and d corresponding eigenvectors. However, only c-1 of these eigenvalues
should end up being non-zero (why? …Himmm good exam question…!)

Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Shortcomings of
the FLD
 The dimensionality after the projection will be at most C-1. For really high dimensional data
where d>>C, the information represented in C-1 dimensions may not be adequate

Figures courtesy of R. Gutierrez - http://research.cs.tamu.edu/prism/lectures/pr/pr_l10.pdf


 The approach assumes that the class distributions are unimodal (or Gaussian). If the original
distributions are multimodal and/or the classes are highly overlapping, this method is of
little use.

 LDA will fail – but PCA will prevail – if the discriminatory information lies in the variance,
rather then the mean!

Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Variations of LDA

Courtesy of R. Gutierrez - http://research.cs.tamu.edu/prism/lectures/pr/pr_l10.pdf


G

Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Exercise
 Implement PCA and FLD
 Test on four datasets
• 2 artificial (of your own choosing) – on one it should work, on the other it should not!
• 2 real world (you may use OCR)

Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ

Anda mungkin juga menyukai