Anda di halaman 1dari 34

Principal Component Analysis

Philosophy of PCA
Introduced

by Pearson (1901) and


Hotelling (1933) to describe the
variation in a set of multivariate data in
terms of a set of uncorrelated variables
We typically have a data matrix of n
observations on p correlated variables
x1,x2,xp
looks for a transformation of the xi
into p new variables yi that are
uncorrelated

PCA

The data matrix


case ht (x1) wt(x2) age(x3) sbp(x4) heart rate (x5)
1

175

1225

25

117

56

156

1050

31

122

63

202

1350

58

154

67

Reduce dimension
The

simplet way is to keep one


variable and discard all others: not
reasonable!
Wheigt all variable equally: not
reasonable (unless they have same
variance)
Wheigted average based on some
citerion.
Which criterion?

Let us write it first


Looking

for a transformation of the data


matrix X (nxp) such that

Y= T X=1 X1+ 2 X2+..+ p Xp

=(1 , 2 ,.., p)T is a column vector of


wheights with

Where

1+ 2+..+ p =1

One good criterion


Maximize

the variance of the projection of


the observations on the Y variables
Find so that
Var( T X)= T Var(X)

is maximal

matrix C=Var(X) is the covariance


matrix of the Xi variables

The

Let us see it on a figure


Good

Better

Covariance matrix
v( x1 ) c(x1,x2 ) ........c(x1,x p )

C=

c(x1,x2 ) v( x2 ) ........c(x2 ,x p )
c(x ,x ) c(x ,x )..........v( x )
1 p
2 p
p

And so.. We find that


The

direction of is given by the


eigenvector 1 correponding to the
largest eigenvalue of matrix C
The second vector that is orthogonal
(uncorrelated) to the first is the one
that has the second highest variance
which comes to be the eignevector
corresponding to the second
eigenvalue
And so on

So PCA gives
variables Yi that are linear
combination of the original variables
(xi):

New

Yi=

ai1x1+ai2x2+aipxp ; i=1..p

new variables Yi are derived in


decreasing order of importance;
they are called principal components
The

Calculating eignevalues and


eigenvectors
The

eigenvalues i are found by


solving the equation
det(C-I)=0
Eigenvectors are columns of the
matrix A such that
1 0 ........0

C=A D AT
0 2 .......0

Where
D= 0

0 ............
p

An example
Let
C=

us take two variables with covariance c>0

1 c

c 1

C-I= 1

c 1

det(C-I)=(1- )-c
Solving

this we find 1 =1+c

2 =1-c < 1

and eigenvectors
Any

a1

A=
a2

eigenvector A satisfies the condition


CA=A
1 c
CA=

c 1

a1
a1 ca2 a1
=
=

a2
ca1 a2 a2

Solving we find A1

A2

PCA is sensitive to scale


If

you multiply one variable by a


scalar you get different results
(can you show it?)
This is because it uses covariance
matrix (and not correlation)
PCA should be applied on data that
have approximately the same scale
in each variable

Interpretation of PCA
The

new variables (PCs) have a


variance equal to their corresponding
eigenvalue
Var(Yi)= i for all i=1p

Small

i small variance data

change little in the direction of


component Yi
The relative variance explained by
each PC is given by i / i

How many components to keep?


Enough

PCs to have a cumulative


variance explained by the PCs that is
>50-70%
Kaiser criterion: keep PCs with
eigenvalues >1
Scree plot: represents the ability of
PCs to explain de variation in data

Do it graphically

Interpretation of components
See

the wheights of variables in each


component
If Y1= 0.89 X1 +0.15X2-0.77X3+0.51X4
Then X1 and X3 have the highest
wheights and so are the mots
important variable in the first PC
See the correlation between variables
Xi and PCs: circle of correlation

Circle of correlation

Normalized (standardized) PCA


If

variables have very heterogenous


variances we standardize them
The standardized variables Xi*
Xi*= (Xi-mean)/variance
The

new variables all have the same


variance (1), so each variable have
the same wheight.

Application of PCA in Genomics


PCA

is useful for finding new, more


informative, uncorrelated features; it
reduces dimensionality by rejecting
low variance features
Analysis of expression data
Analysis of metabolomics data (Ward
et al., 2003)

However
PCA

is only powerful if the biological


question is related to the highest
variance in the dataset
If not other techniques are more
useful : Independent Component
Analysis
Introduced by Jutten in 1987

What is ICA?

That looks like that

The idea behind ICA

How it works?

Rationale of ICA
Find

the components Si that are as


independent as possible in the sens of
maximizing some function F(s1,s2,.,sk)
that measures indepedence
All ICs (except 1) should be nonNormal
The variance of all ICs is 1
There is no hierarchy between ICs

How to find ICs ?


Many

choices of objective function F


Mutual information
f ( s1 , s2 ,..., sk )
MI f ( s1 , s2 ,..., sk ) Log
f1 ( s1 ) f 2 ( s2 )... f k ( sk )
We

use the kurtosis of the variables


to approximate the distribution
function
The number of ICs is chosen by the
user

Difference with PCA


It

is not a dimensionality reduction


technique
There is no single (exact) solution for
components; uses different algorithms
(in R: FastICA, PearsonICA, MLICA)
ICs are of course uncorrelated but
also as independent as possible
Uninteresting for Normally distributed
variables

Example: Lee and Batzoglou (2003)


Microarray

expression data on 7070


genes in 59 Normal human tissue
samples (19 types)
We are not interested in reducing
dimension but rather in looking for
genes that show tissue specific
expression profile (what make tissue
types differents)

PCA vs ICA
Hsiao

et al (2002) applied PCA and


by visual inspection observed three
gene cluster of 425 genes: liverspecific, brain-specific and musclespecific
ICA identified more tissue-specific
genes than PCA

Anda mungkin juga menyukai