Anda di halaman 1dari 4

Dimensionality Reduction

Singular Value Decomposition (SVD)


Singular value decomposition (SVD) can be looked at from three mutually compatible points of view.
On the one hand, we can see it as a method for transforming correlated variables into a set of uncorrelated
ones that better expose the various relationships among the original data items.
At the same time, SVD is a method for identifying and ordering the dimensions along which data points
exhibit the most variation.
This ties in to the third way of viewing SVD, which is that once we have identied where the most variation
is, its possible to nd the best approximation of the original data points using fewer dimensions. Hence,
SVD can be seen as a method for data reduction
Example: You have a matrix with people in rows and observations about their movie watching. You reduce
these features into 1 feature: movie taste (called latent feature- that is directly unobservable)
Let X be mn matrix of rank k. (each variable is a column, each row an observation) This matrix can be
decomposed into:
X = UDV
T
Columns of U are orthogonal (independent of each other), and columns of V are orthogonal.
D is diagonal matrix (contains singular values)
For dimensionality reduction:
Order columns by decreasing order of their corresponding singular values
Replace D by a sub-matrix taken from upper left corner of D.
Principal Component Analysis (PCA)
Principal component analysis is a method of unsupervised learning. It is used as a tool for data visualization
or data pre-processing before supervised techniques are applied.
It nds a sequence of linear combinations of the variables that have maximal variance, and are mutually
uncorrelated. Then we can use these derived variables in supervised techniques.
The rst principal component of a set of features X
1
, X
2
, ...X
p
is the normalized linear combination of the
features
Z
1
=
11
X
1
+
21
X
2
+... +
p1
X
p
that has the largest variance. Normalized means that

p
j=1

2
j1
= 1. This constraint is necessary since
otherwise setting these elements to be arbitrarily large in absolute value could result in an arbitrarily large
variance.
If we take vector
1
with elements
11
,
21
, ..,
p1
, it denes a direction in feature space along which the data
vary the most.
The second principal component is the linear combination of X
1
, X
2
, ...X
p
that has maximal variance among
all linear combinations that are uncorrelated with Z
1
, and so on. For data set with p variables there are p
principal components,
1
,
2
, ...,
p
. You can nd

s by using singular-value decomposition (SVD) technique.


We can now put all of these vectors in order in matrix (call it V). Result of PCA is then matrix:
1
X UV
T
You want to minimize discrepancy between the actual X and your approximation of X via U and V (measured
by square error).
argmin

i,j
(x
i,j
u
i
.v
j
)
2
dot product u
i
.v
j
is a predicted preference of user i for item j. You want it to be as close as possible to actual
preference.
So we found the best choice for U and V is the one that minimizes squared dierences between prediction
and observation.
Matrix U will have a row for each user and a column for each latent feature. V will have row for each item
and a column for each latent feature.
Method
1) Center Data Subtract variable mean from each data point. This way you get a mean variable value of
0. Call this matrix U.
2) Calculate Covariance Matrix
3) Calculate Eigenvectors and Eigenvalues of the Covariance Matrix If you can multiply square
matrix with a vector and get multiple of the same vector, you have an eigenvector.
A =
A is a square matrix
is eigenvectos
is eigenvalue
Example:

2 3
2 1

3
2

12
8

= 4

3
2

Eigenvectors are orthogonal to each other.


These eigenvectors are set along the data dimensions that show the biggest variation, and thus carry the
most information about the data. They are like lines of best t.
4) Dimensionality Reduction Order eigenvector by the value of eigenvalues, highest to lowest. Eigen-
vector with the highest eigenvalue is the principal component of the data set. It is the most signicant
relationship between the data.
You can remove components of lesser signicance, because youll lose only a small amount of information.
5) Create Feature Vector Take eigenvectors that you want to keep and put them in a matrix with
eigenvectos in the columns. Call this matrix V.
2
6) Final Data Set
X UV
T
In X data is expressed in the terms of the patterns between them.
R Example: Principal Components
From Hastie / Tibshirani Statistical Leanrings, using the USArrests data (which is in R)
head(USArrests)
## Murder Assault UrbanPop Rape
## Alabama 13.2 236 58 21.2
## Alaska 10.0 263 48 44.5
## Arizona 8.1 294 80 31.0
## Arkansas 8.8 190 50 19.5
## California 9.0 276 91 40.6
## Colorado 7.9 204 78 38.7
apply(USArrests,2,mean)
## Murder Assault UrbanPop Rape
## 7.788 170.760 65.540 21.232
apply(USArrests,2, var)
## Murder Assault UrbanPop Rape
## 18.97 6945.17 209.52 87.73
We see that Assault has a much larger variance than the other variables. It would dominate the principal
components, so we choose to standardize the variables when we perform PCA (scale=TRUE)
pca.out=prcomp(USArrests, scale=TRUE)
pca.out
## Standard deviations:
## [1] 1.5749 0.9949 0.5971 0.4164
##
## Rotation:
## PC1 PC2 PC3 PC4
## Murder -0.5359 0.4182 -0.3412 0.64923
## Assault -0.5832 0.1880 -0.2681 -0.74341
## UrbanPop -0.2782 -0.8728 -0.3780 0.13388
## Rape -0.5434 -0.1673 0.8178 0.08902
names(pca.out)
## [1] "sdev" "rotation" "center" "scale" "x"
3
summary(pca.out)
## Importance of components:
## PC1 PC2 PC3 PC4
## Standard deviation 1.57 0.995 0.5971 0.4164
## Proportion of Variance 0.62 0.247 0.0891 0.0434
## Cumulative Proportion 0.62 0.868 0.9566 1.0000
==============================================================================
Dimensionality Problem
Gabriela Hromis
Notes are based on dierent books and class notes from dierent universities.
==============================================================================
4