CS7015 (Deep Learning) : Lecture 6

CS7015 (Deep Learning) : Lecture 6
Eigen Values, Eigen Vectors, Eigen Value Decomposition, Principal Component

Analysis, Singular Value Decomposition
Prof. Mitesh M. Khapra
Department of Computer Science and Engineering

Indian Institute of Technology Madras
1/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
Module 6.1 : Eigenvalues and Eigenvectors
2/71
y
What happens when a matrix hits a
1 2
A= vector?
2 1

1
x=
3
3/71
y
1 2
A= vector?
2 1
The vector gets transformed into a
new vector (it strays from its path)

1
x=
3
3/71
y
1 2
A= vector?
2 1
7 The vector gets transformed into a
Ax =
5 new vector (it strays from its path)

1
x=
3
3/71
y
1 2
A= vector?
2 1
7 The vector gets transformed into a
Ax =
5 new vector (it strays from its path)
The vector may also get scaled
1 (elongated or shortened) in the
x=
3 process.
3/71
y
For a given square matrix A, there
1 2
A= exist special vectors which refuse to
2 1
stray from their path.
4/71
y
1 2
2 1

1
x=
1
x
4/71
y
1 2
2 1

3 1
Ax = =3
3 1

1
x=
1
x
4/71
y
1 2
2 1
These vectors are called eigenvectors.

3 1
Ax = =3
3 1

1
x=
1
x
4/71
y
1 2
2 1
More formally,
3 1
Ax = =3
3 1 Ax = λx [direction remains the same]

1
x=
1
x
4/71
y
1 2
2 1
More formally,
3 1
Ax = =3
3 1 Ax = λx [direction remains the same]

1
The vector will only get scaled but
x= will not change its direction.
1
x
4/71
y

1 2
A=
2 1

3 1
Ax = =3
3 1

1
x=
1
x
5/71
y
So what is so special about
1 2
A= eigenvectors?
2 1

3 1
Ax = =3
3 1

1
x=
1
x
5/71
y
1 2
A= eigenvectors?
2 1
Why are they always in the limelight?

3 1
Ax = =3
3 1

1
x=
1
x
5/71
y
1 2
A= eigenvectors?
2 1
It turns out that several properties
of matrices can be analyzed based
3 1 on their eigenvalues (for example, see
Ax = =3
3 1 spectral graph theory)

1
x=
1
x
5/71
y
1 2
A= eigenvectors?
2 1
It turns out that several properties
of matrices can be analyzed based
3 1 on their eigenvalues (for example, see
Ax = =3
3 1 spectral graph theory)
We will now see two cases where
1 eigenvalues/vectors will help us in
x=
1 this course
x
5/71
Let us assume that on day 0, k1 students eat
Chinese food, and k2 students eat Mexican food.
(Of course, no one eats in the mess!)
6/71
Chinese Mexican Chinese food, and k2 students eat Mexican food.
k1 k2 (Of course, no one eats in the mess!)

k1
v(0) =
k2
6/71
On each subsequent day i, a fraction p of the
k1 students who ate Chinese food on day (i − 1),
v(0) =
k2 continue to eat Chinese food on day i, and (1 − p)
shift to Mexican food.
6/71
v(0) =
Similarly a fraction q of students who ate Mexican
food on day (i − 1) continue to eat Mexican food
on day i, and (1 − q) shift to Chinese food.
6/71
v(0) =
pk1 + (1 − q)k2
v(1) = Similarly a fraction q of students who ate Mexican
(1 − p)k1 + qk2
on day i, and (1 − q) shift to Chinese food.
6/71
v(0) =
pk1 + (1 − q)k2
(1 − p)k1 + qk2
p 1−q k1
= on day i, and (1 − q) shift to Chinese food.
1−p q k2
6/71
v(0) =
pk1 + (1 − q)k2
(1 − p)k1 + qk2
p 1−q k1
1−p q k2
v(1) = M v(0)
v(2) = M v(1)
= M 2 v(0)
6/71
v(0) =
pk1 + (1 − q)k2
(1 − p)k1 + qk2
p 1−q k1
1−p q k2
The number of customers in the two restaurants
v(1) = M v(0) is thus given by the following series:
v(2) = M v(1)
v(0) , M v(0) , M 2 v(0) , M 3 v(0) , . . .
2
= M v(0)
In general, v(n) = M n v(0)
6/71
1−p
p k1 k2 q
1−q
7/71
This is a problem for the two restaurant
owners.
1−p
p k1 k2 q
1−q
7/71
owners.
The number of patrons is changing constantly.
1−p
p k1 k2 q
1−q
7/71
owners.
Or is it? Will the system eventually reach
a steady state? (i.e. will the number
1−p of customers in the two restaurants become
constant over time?)
p k1 k2 q
1−q
7/71
owners.
p k1 k2 q Turns out they will!
1−q
7/71
owners.
p k1 k2 q Turns out they will!
Let’s see how?
1−q
7/71
Definition
Let λ1 , λ2 , . . . , λn be the
eigenvectors of an n × n matrix
A. λ1 is called the dominant
eigen value of A if
|λ1 | ≥ |λi | i = 2, . . . , n
8/71
Definition Definition
Let λ1 , λ2 , . . . , λn be the A matrix M is called a stochastic matrix if all the
eigenvectors of an n × n matrix entries are positive and the sum of the elements in
A. λ1 is called the dominant each column is equal to 1.
eigen value of A if (Note that the matrix in our example is a
stochastic matrix)
|λ1 | ≥ |λi | i = 2, . . . , n
8/71
stochastic matrix)
|λ1 | ≥ |λi | i = 2, . . . , n
Theorem
The largest (dominant)
eigenvalue of a stochastic matrix
is 1.
See proof here
8/71
stochastic matrix)
|λ1 | ≥ |λi | i = 2, . . . , n
Theorem
Theorem If A is a n × n square matrix with a dominant
The largest (dominant) eigenvalue, then the sequence of vectors given by
eigenvalue of a stochastic matrix Av0 , A2 v0 , . . . , An v0 , . . . approaches a multiple of
is 1. the dominant eigenvector of A.
See proof here (the theorem is slightly misstated here for ease of
explanation)
8/71
1−p
Let ed be the dominant eigenvector of M and
λd = 1 the corresponding dominant eigenvalue
p k1 k2 q
1−q
9/71
1−p
p k1 k2 q
Given the previous definitions and theorems,
what can you say about the sequence
M v(0) , M 2 v(0) , M 3 v(0) , . . . ? 1−q
9/71
1−p
p k1 k2 q
M v(0) , M 2 v(0) , M 3 v(0) , . . . ? 1−q
There exists an n such that
v(n) = M n v(0) = ked (some multiple of ed )
9/71
1−p
p k1 k2 q
M v(0) , M 2 v(0) , M 3 v(0) , . . . ? 1−q
Now what happens at time step (n + 1)?
v(n+1) = M v(n) = M (ked ) = k(M ed ) = k(λd ed ) = ked
9/71
1−p
p k1 k2 q
M v(0) , M 2 v(0) , M 3 v(0) , . . . ? 1−q
Now what happens at time step (n + 1)?
v(n+1) = M v(n) = M (ked ) = k(M ed ) = k(λd ed ) = ked
The population in the two restaurants

becomes constant after time step n.
See Proof Here
9/71
Now instead of a stochastic matrix let us consider any square matrix A
10/71
Let p be the time step at which the sequence x0 , Ax0 , A2 x0 , . . . approaches a
multiple of ed (the dominant eigenvector of A)
10/71
Ap x0 = ked
10/71
Ap x0 = ked
Ap+1 x0 = A(Ap x0 ) = kAed = kλd ed
10/71
Ap x0 = ked
Ap+2 x0 = A(Ap+1 x0 ) = kλd Aed = kλ2d ed
10/71
Ap x0 = ked
Ap+n x0 =
10/71
Ap x0 = ked
Ap+n x0 = k(λd )n ed
10/71
Ap x0 = ked
In general, if λd is the dominant eigenvalue of a matrix A, what would happen

to the sequence x0 , Ax0 , A2 x0 , . . . if
10/71
Ap x0 = ked

|λd | > 1
10/71
Ap x0 = ked

|λd | > 1 (will explode)
10/71
Ap x0 = ked

|λd | < 1
10/71
Ap x0 = ked

|λd | < 1 (will vanish)
10/71
Ap x0 = ked

|λd | = 1
10/71
Ap x0 = ked

|λd | = 1 (will reach a steady state)
10/71
Ap x0 = ked

|λd | = 1 (will reach a steady state)
(We will use this in the course at some point)
10/71
Module 6.2 : Linear Algebra - Basic Definitions
11/71
We will see some more examples where eigenvectors are important, but before
that let’s revisit some basic definitions from linear algebra.
12/71
Basis
A set of vectors ∈ Rn is called a basis, if they are linearly independent and every
vector ∈ Rn can be expressed as a linear combination of these vectors.
13/71
Basis
A set of vectors ∈ Rn is called a basis, if they are linearly independent and every
vector ∈ Rn can be expressed as a linear combination of these vectors.
Linearly independent vectors

A set of n vectors v1 , v2 , . . . , vn is linearly independent if no vector in the set can
be expressed as a linear combination of the remaining n − 1 vectors.
In other words, the only solution to
c1 v1 + c2 v2 + . . . cn vn = 0 is c1 = c2 = · · · = cn = 0(ci ’s are scalars)
13/71
For example consider the space R2
y = (0, 1)
x = (1, 0)
14/71
Now consider the vectors

1 0
x= and y =
0 1
y = (0, 1)
x = (1, 0)
14/71

1 0
x= and y =
0 1

y = (0, 1) a
Any vector ∈ R2 , can be expressed as a
b
linear combination of these two vectors i.e
x = (1, 0)
a 1 0
=a +b
b 0 1
14/71

1 0
x= and y =
0 1

y = (0, 1) a
Any vector ∈ R2 , can be expressed as a
b
linear combination of these two vectors i.e
x = (1, 0)
a 1 0
=a +b
b 0 1
Further, x and y are linearly independent.

(the only solution to c1 x + c2 y = 0 is c1 =
c2 = 0)
14/71
In fact, turns out that x and y are unit vectors
in the direction of the co-ordinate axes.
y = (0, 1)
x = (1, 0)
15/71
And indeed we are used to representing all
vectors in R2 as a linear combination of these
two vectors.
y = (0, 1)
x = (1, 0)
15/71
two vectors.
But there is nothing sacrosanct about the
y = (0, 1)
particular choice of x and y.
x = (1, 0)
15/71
two vectors.
y = (0, 1)
We could have chosen any 2 linearly
x = (1, 0) independent vectors in R2 as the basis vectors.
15/71
two vectors.
y = (0, 1)
For example, consider the linearly
independent vectors, [2, 3]T and [5, 7]T .
See how any vector [a, b]T ∈ R2 can be
expressed as a linear combination of these
two vectors.
15/71
two vectors.
y = (0, 1)

a 2 5
= x1 + x2
b 3 7 expressed as a linear combination of these
two vectors.
15/71
two vectors.
y = (0, 1)

a 2 5
= x1 + x2
b 3 7 expressed as a linear combination of these
two vectors.
We can find x1 and x2 by solving a system of
linear equations. 15/71
two vectors.
y = (0, 1)
a = 2x1 + 5x2 See how any vector [a, b]T ∈ R2 can be
expressed as a linear combination of these
b = 3x1 + 7x2
two vectors.
We can find x1 and x2 by solving a system of
linear equations. 15/71
In general, given a set of linearly independent

z1
vectors u1 , u2 , . . . un ∈ Rn , we can express any
z=
z2 vector z ∈ Rn as a linear combination of these
vectors.
u2
u1
16/71

z1
z=
vectors.
z = α1 u1 + α2 u2 + · · · + αn un
u2
u1
16/71

z1
z=
vectors.
z = α1 u1 + α2 u2 + · · · + αn un
       
u2 z1 u11 u21 un1
u1  z2   u12   u22   un2 
 ..  = α1  ..  + α2  ..  + . . . + αn  .. 
       
.  .   .   . 
zn u1n u2n unn
16/71

z1
z=
vectors.
z = α1 u1 + α2 u2 + · · · + αn un
       
u2 z1 u11 u21 un1
u1  z2   u12   u22   un2 
 ..  = α1  ..  + α2  ..  + . . . + αn  .. 
       
.  .   .   . 
zn u1n u2n unn
    
z1 u11 u21 . . . un1 α1
 z2   u12 u22 . . . un2   α2 
 ..  =  ..
    
.. .. ..   .. 
.  . . . .  . 
zn u1n u2n . . . unn αn
(Basically rewriting in matrix form)

16/71

z1
z=
vectors.
z = α1 u1 + α2 u2 + · · · + αn un
       
u2 z1 u11 u21 un1
u1  z2   u12   u22   un2 
 ..  = α1  ..  + α2  ..  + . . . + αn  .. 
       
.  .   .   . 
zn u1n u2n unn
    
z1 u11 u21 . . . un1 α1
 z2   u12 u22 . . . un2   α2 
 ..  =  ..
    
.. .. ..   .. 
.  . . . .  . 
zn u1n u2n . . . unn αn
We can now find the αi s using Gaussian
Elimination (Time Complexity: O(n3 ))
16/71
Now let us see if we have orthonormal basis.
a
z=
b
| z→
|
u2
u1
θ
α2
α1
17/71
a uTi uj = 0 ∀i 6= j and uTi ui = kui k2 = 1
z=
b
| z→
|
u2
u1
θ
α2
α1
17/71
z=
b
Again we have:
| z→
z = α1 u1 + α2 u2 + . . . + αn un
|
u2
u1
θ
α2
α1
17/71
z=
b
Again we have:
| z→
z = α1 u1 + α2 u2 + . . . + αn un
|
u2
u1 uT1 z = α1 uT1 u1 + . . . + αn uT1 un
θ
α2
α1
17/71
z=
b
Again we have:
| z→
z = α1 u1 + α2 u2 + . . . + αn un
|
u2
u1 uT1 z = α1 uT1 u1 + . . . + αn uT1 un
θ
α2
α1 = α1
17/71
z=
b
Again we have:
| z→
z = α1 u1 + α2 u2 + . . . + αn un
|
u2
u1 uT1 z = α1 uT1 u1 + . . . + αn uT1 un
θ
α2
α1 = α1
We can directly find each αi using a dot

product between z and ui (time complexity
O(N ))
17/71
z=
b
Again we have:
| z→
z = α1 u1 + α2 u2 + . . . + αn un
|
u2
u1 uT1 z = α1 uT1 u1 + . . . + αn uT1 un
θ
α2
α1 = α1
We can directly find each αi using a dot

product between z and ui (time complexity
O(N ))
The total complexity will be O(N 2 )
17/71
z=
b
Again we have:
| z→
z = α1 u1 + α2 u2 + . . . + αn un
|
u2
u1 uT1 z = α1 uT1 u1 + . . . + αn uT1 un
θ
α2
α1 = α1
→ → z T u1 We can directly find each αi using a dot

α1 = | z |cosθ = | z | → = z T u1 product between z and ui (time complexity
| z ||u1 |
O(N ))
The total complexity will be O(N 2 )
17/71
z=
b
Again we have:
| z→
z = α1 u1 + α2 u2 + . . . + αn un
|
u2
u1 uT1 z = α1 uT1 u1 + . . . + αn uT1 un
θ
α2
α1 = α1

| z ||u1 |
O(N ))
Similarly, α2 = z T u2 . The total complexity will be O(N 2 )
17/71
z=
b
Again we have:
| z→
z = α1 u1 + α2 u2 + . . . + αn un
|
u2
u1 uT1 z = α1 uT1 u1 + . . . + αn uT1 un
θ
α2
α1 = α1

| z ||u1 |
O(N ))
Similarly, α2 = z T u2 . The total complexity will be O(N 2 )
When u1 and u2 are unit vectors
along the co-ordinate axes

a 1 0
z= =a +b
b 0 1
17/71
Remember
An orthogonal basis is the most convenient basis that one can hope for.
18/71
But what does any of this have to do with
eigenvectors?
19/71
eigenvectors?
Turns out that the eigenvectors can form a
basis.
19/71
Theorem 1 eigenvectors?
The eigenvectors of a matrix Turns out that the eigenvectors can form a
A ∈ Rn×n having distinct basis.
eigenvalues are linearly
independent.
Proof: See here
19/71
In fact, the eigenvectors of a square symmetric
independent.
matrix are even more special.
Proof: See here
19/71
independent.
Proof: See here
Theorem 2
The eigenvectors of a square
symmetric matrix are
orthogonal.
Proof: See here
19/71
independent.
Proof: See here
Thus they form a very convenient basis.
Theorem 2
The eigenvectors of a square
symmetric matrix are
orthogonal.
Proof: See here
19/71
independent.
Proof: See here
Theorem 2 Why would we want to use the eigenvectors as
The eigenvectors of a square a basis instead of the more natural co-ordinate
symmetric matrix are axes?
orthogonal.
Proof: See here
19/71
independent.
Proof: See here
Theorem 2 Why would we want to use the eigenvectors as
The eigenvectors of a square a basis instead of the more natural co-ordinate
symmetric matrix are axes?
orthogonal. We will answer this question soon.
Proof: See here
19/71
Module 6.3 : Eigenvalue Decomposition
20/71
Before proceeding let’s do a quick recap of eigenvalue decomposition.
21/71
Let u1 , u2 , . . . , un be the eigenvectors of a matrix A and let λ1 , λ2 , . . . , λn be
the corresponding eigenvalues.
22/71
Consider a matrix U whose columns are u1 , u2 , . . . , un .
22/71
Now
AU =
22/71
Now
x x x
  
AU = A u1 u2 . . . un 
y y y
22/71
Now
x x x  x x x 
     
AU = A u1 u2 . . . un  = Au 1 Au
2 . . . Au
n

y y y y y y
22/71
Now
x x x  x x x 
     
AU = A u1 u2 . . . un  = Au 1 Au
2 . . . Au
n

y y y y y y
 x x x 
  
= λ1u1 λ2u2 . . . λnun 
y y y
22/71
Now
x x x  x x x 
     
AU = A u1 u2 . . . un  = Au 1 Au
2 . . . Au
n

y y y y y y
 x x x 
  
= λ1u1 λ2u2 . . . λnun 
y y y
x x x
  
u u
= 1 2
 . . . u
n 
y y y
22/71
Now
x x x  x x x 
     
AU = A u1 u2 . . . un  = Au 1 Au2 . . . Au
n

y y y y y y
 x x x 
  
= λ1u1 λ2u2 . . . λnun 
y y y
 
x x  λ1 0 . . . 0
.. 
x 
   
u u . . . u 0 λ2 . 
= 1 2
 n  .
 
..

y y y  .. . 0 

0 ... 0 λn
22/71
Now
x x x  x x x 
     
AU = A u1 u2 . . . un  = Au 1 Au2 . . . Au
n

y y y y y y
 x x x 
  
= λ1u1 λ2u2 . . . λnun 
y y y
 
x x  λ1 0 . . . 0
.. 
x 
   
u u . . . u 0 λ2 . 
= 1 2
 n  .
 
..
 = UΛ
y y y  .. . 0 

0 ... 0 λn
22/71
Now
x x x  x x x 
     
AU = A u1 u2 . . . un  = Au 1 Au2 . . . Au
n

y y y y y y
 x x x 
  
= λ1u1 λ2u2 . . . λnun 
y y y
 
x x  λ1 0 . . . 0
.. 
x 
   
u u . . . u 0 λ2 . 
= 1 2
 n  .
 
..
 = UΛ
y y y  .. . 0 

0 ... 0 λn
where Λ is a diagonal matrix whose diagonal elements are the eigenvalues of A. 22/71
AU = U Λ
23/71
AU = U Λ
If U −1 exists, then we can write,
A = U ΛU −1 [eigenvalue decomposition]
−1
U AU = Λ [diagonalization of A]
23/71
AU = U Λ
−1
Under what conditions would U −1 exist?
23/71
AU = U Λ
−1

If the columns of U are linearly independent [See proof here]
23/71
AU = U Λ
−1

i.e. if A has n linearly independent eigenvectors.
23/71
AU = U Λ
−1

i.e. if A has n linearly independent eigenvectors.
i.e. if A has n distinct eigenvalues [sufficient condition, proof : Slide 19
Theorem 1]
23/71
If A is symmetric then the situation is even more convenient.
24/71
The eigenvectors are orthogonal [proof : Slide 19 Theorem 2]
24/71
Further let’s assume, that the eigenvectors have been normalized [ uTi ui = 1]
 
← u1 →  x x x
 ← u2 →    
Q = UT U =   u1 u2 . . . un 
 ...  y y y
← un →
24/71
 
← u1 →  x x x
 ← u2 →    
Q = UT U =   u1 u2 . . . un 
 ...  y y y
← un →
Each cell of the matrix, Qij is given by uTi uj
Qij = uTi uj = 0 if i 6= j
= 1 if i = j
∴ U T U = I (the identity matrix)
24/71
 
← u1 →  x x x
 ← u2 →    
Q = UT U =   u1 u2 . . . un 
 ...  y y y
← un →
Each cell of the matrix, Qij is given by uTi uj
Qij = uTi uj = 0 if i 6= j
= 1 if i = j
∴ U T U = I (the identity matrix)

U T is the inverse of U (very convenient to calculate)
24/71
Something to think about
Given the EVD, A = U ΣU T ,
what can you say about the sequence x0 , Ax0 , A2 x0 , . . . in terms of the eigen
values of A.
(Hint: You should arrive at the same conclusion we saw earlier)
25/71
Theorem (one more important property of eigenvectors)
If A is a square symmetric N × N matrix, then the solution to the following
optimization problem is given by the eigenvector corresponding to the largest
eigenvalue of A.
max xT Ax
x
s.t kxk = 1
and the solution to
min xT Ax
x
s.t kxk = 1
is given by the eigenvector corresponding to the smallest eigenvalue of A.
Proof: Next slide.
26/71
This is a constrained optimization problem that can be solved using Lagrange
Multipliers:
L = xT Ax − λ(xT x − 1)
∂L
= 2Ax − λ(2x) = 0 => Ax = λx
∂x
27/71
Multipliers:
L = xT Ax − λ(xT x − 1)
∂L
= 2Ax − λ(2x) = 0 => Ax = λx
∂x
Hence x must be an eigenvector of A with eigenvalue λ.
27/71
Multipliers:
L = xT Ax − λ(xT x − 1)
∂L
= 2Ax − λ(2x) = 0 => Ax = λx
∂x
Multiplying by xT :
xT Ax = λxT x = λ(since xT x = 1)
27/71
Multipliers:
L = xT Ax − λ(xT x − 1)
∂L
= 2Ax − λ(2x) = 0 => Ax = λx
∂x
Multiplying by xT :
Therefore, the critical points of this constrained problem are the eigenvalues of
A.
27/71
Multipliers:
L = xT Ax − λ(xT x − 1)
∂L
= 2Ax − λ(2x) = 0 => Ax = λx
∂x
Multiplying by xT :
Therefore, the critical points of this constrained problem are the eigenvalues of
A.
The maximum value is the largest eigenvalue, while the minimum value is the
smallest eigenvalue.
27/71
The story so far...
28/71
The story so far...
The eigenvectors corresponding to different eigenvalues are linearly
independent.
28/71
The story so far...
independent.
The eigenvectors of a square symmetric matrix are orthogonal.
28/71
The story so far...
independent.
The eigenvectors of a square symmetric matrix can thus form a convenient basis.
28/71
The story so far...
independent.
The eigenvectors of a square symmetric matrix can thus form a convenient basis.
We will put all of this to use.
28/71
Module 6.4 : Principal Component Analysis and its
Interpretations
29/71
The story ahead...
30/71
The story ahead...
Over the next few slides we will introduce Principal Component Analysis and
see three different interpretations of it
30/71
y
Consider the following data
31/71
y
Each point (vector) here is
represented using a linear
combination of the x and y axes
(i.e. using the point’s x and y
co-ordinates)
31/71
y
co-ordinates)
In other words we are using x and y
as the basis
x
31/71
y
co-ordinates)
In other words we are using x and y
as the basis
x What if we choose a different basis?
31/71
y
For example, what if we use u1 and
u2 as a basis instead of x and y.
u1
u2
32/71
y
u1 We observe that all the points have a
very small component in the direction
u2 of u2 (almost noise)
32/71
y
u1 We observe that all the points have a
very small component in the direction
u2 of u2 (almost noise)
It seems that the same data which
was originally in R2 (x, y) can now be
represented in R1 (u1 ) by making a
x smarter choice for the basis
32/71
y
Let’s try stating this more formally
u1
u2
33/71
y
Why do we not care about u2 ?
u1
u2
33/71
y
u1 Because the variance in the data in
this direction is very small (all data
u2 points have almost the same value in
the u2 direction)
33/71
y
u1 Because the variance in the data in
this direction is very small (all data
u2 points have almost the same value in
the u2 direction)
If we were to build a classifier on
top of this data then u2 would not
x contribute to the classifier as the
points are not distinguishable along
this direction
33/71
y
In general, we are interested in
representing the data using fewer
dimensions such that
u1
u2
34/71
y
dimensions such that the data has
u1
high variance along these dimensions
u2
34/71
y
u1
u2 Is that all?
34/71
y
u1
u2 Is that all?
No, there is something else that we
desire. Let’s see what.
34/71
x y z Consider the following data
1 1 1
0.5 0 0
0.25 1 1
0.35 1.5 1.5
0.45 1 1
0.57 2 2.1
0.62 1.1 1
0.73 0.75 0.76
0.72 0.86 0.87
35/71
1 1 1 Is z adding any new information
0.5 0 0 beyond what is already contained in
0.25 1 1 y?
0.35 1.5 1.5
0.45 1 1
0.57 2 2.1
0.62 1.1 1
0.73 0.75 0.76
0.72 0.86 0.87
35/71
0.25 1 1 y?
0.35 1.5 1.5 The two columns are highly
0.45 1 1 correlated (or they have a high
0.57 2 2.1 covariance)
0.62 1.1 1
0.73 0.75 0.76
0.72 0.86 0.87
Pn
− y)(zi − z)
i=1 (yi
ρyz = pPn pPn
2 2
i=1 (yi − y) i=1 (zi − z)
35/71
0.25 1 1 y?
0.35 1.5 1.5 The two columns are highly
0.45 1 1 correlated (or they have a high
0.57 2 2.1 covariance)
0.62 1.1 1 In other words the column z
0.73 0.75 0.76 is redundant since it is linearly
0.72 0.86 0.87 dependent on y.
Pn
− y)(zi − z)
i=1 (yi
ρyz = pPn pPn
2 2
i=1 (yi − y) i=1 (zi − z)
35/71
y
u1
u2
36/71
y
u1
the data has high variance along these
u2 dimensions
36/71
y
u1
u2 dimensions
the dimensions are linearly
independent (uncorrelated)
36/71
y
u1
u2 dimensions
the dimensions are linearly
independent (uncorrelated)
(even better if they are orthogonal
x because that is a very convenient
basis)
36/71
Let p1 , p2 , · · · , pn be a set of such n linearly independent orthonormal vectors. Let
P be a n × n matrix such that p1 , p2 , · · · , pn are the columns of P .
37/71
Let x1 , x2 , · · · , xm ∈ Rn be m data points and let X be a matrix such that

x1 , x2 , · · · , xm are the rows of this matrix. Further let us assume that the data is
0-mean and unit variance.
37/71

We want to represent each xi using this new basis P .
xi = αi1 p1 + αi2 p2 + αi3 p3 + · · · + αin pn
37/71

We want to represent each xi using this new basis P .
xi = αi1 p1 + αi2 p2 + αi3 p3 + · · · + αin pn
For an orthonormal basis we know that we can find these αi0 s using
 
↑
T
T
αij = xi pj = ← xi → pj 
↓
37/71
In general, the transformed data x̂i is given by
 
↑ ↑
xTi → p1 · · · pn  = xTi P

x̂i = ←
↓ ↓
38/71
In general, the transformed data x̂i is given by
 
↑ ↑
xTi → p1 · · · pn  = xTi P

x̂i = ←
↓ ↓
and
X̂ = XP (X̂ is the matrix of transformed points)
38/71
Theorem:
If X is a matrix such that its columns have zero mean and if X̂ = XP then the
columns of X̂ will also have zero mean.
39/71
Theorem:
Proof: For any matrix A, 1T A gives us a row vector with the ith element
containing the sum of the ith column of A. (this is easy to see using the
row-column picture of matrix multiplication).
39/71
Theorem:
Consider
1T X̂ = 1T XP = (1T X)P
But 1T X is the row vector containing the sums of the columns of X. Thus
1T X = 0. Therefore, 1T X̂ = 0.
Hence the transformed matrix also has columns with sum = 0.
39/71
Theorem:
Consider
1T X̂ = 1T XP = (1T X)P
Theorem:
X T X is a symmetric matrix.
39/71
Theorem:
Consider
1T X̂ = 1T XP = (1T X)P
Theorem:
X T X is a symmetric matrix.
Proof: We can write (X T X)T = X T (X T )T = X T X
39/71
Definition:
1
If X is a matrix whose columns are zero mean then Σ = m X T X is the covariance
matrix. In other words each entry Σij stores the covariance between columns i and
j of X.
40/71
Definition:
1
If X is a matrix whose columns are zero mean then Σ = m X T X is the covariance
matrix. In other words each entry Σij stores the covariance between columns i and
j of X.
Explanation: Let C be the covariance matrix of X. Let µi , µj denote the means
of the ith and j th column of X respectively. Then by definition of covariance, we
can write :
m
1 X
Cij = (Xki − µi )(Xkj − µj )
m
k=1
m
1 X
= Xki Xkj (∵ µi = µj = 0)
m
k=1
1 1
= XiT Xj = (X T X)ij
m m
40/71
X̂ = XP
41/71
X̂ = XP
1 T
Using the previous theorem & definition, we get m X̂ X̂ is the covariance matrix of
the transformed data. We can write :
41/71
X̂ = XP
1 T
1 T 1 T
X̂ X̂ = (XP ) XP
m m
41/71
X̂ = XP
1 T

1 T 1 T 1 1 T
X̂ X̂ = (XP ) XP = P T X T XP = P T X X P
m m m m
41/71
X̂ = XP
1 T

1 T 1 T 1 1 T
X̂ X̂ = (XP ) XP = P T X T XP = P T X X P = P T ΣP
m m m m
41/71
X̂ = XP
1 T

1 T 1 T 1 1 T
m m m m
1 T
Each cell i, j of the covariance matrix m X̂ X̂ stores the covariance between columns
i and j of X̂.
41/71
X̂ = XP
1 T

1 T 1 T 1 1 T
m m m m
1 T
i and j of X̂.
Ideally we want,

1 T
X̂ X̂ =0 i 6= j ( covariance = 0)
m ij

1 T
X̂ X̂ 6= 0 i = j ( variance 6= 0)
m ij
41/71
X̂ = XP
1 T

1 T 1 T 1 1 T
m m m m
1 T
i and j of X̂.
Ideally we want,

1 T
X̂ X̂ =0 i 6= j ( covariance = 0)
m ij

1 T
X̂ X̂ 6= 0 i = j ( variance 6= 0)
m ij
In other words, we want

1 T
X̂ X̂ = P T ΣP = D [ where D is a diagonal matrix ]
m 41/71
We want,
P T ΣP = D
42/71
We want,
P T ΣP = D
But Σ is a square matrix and P is an orthogonal matrix
42/71
We want,
P T ΣP = D
Which orthogonal matrix satisfies the following condition?
42/71
We want,
P T ΣP = D
P T ΣP = D
42/71
We want,
P T ΣP = D
P T ΣP = D
In other words, which orthogonal matrix P diagonalizes Σ?
42/71
We want,
P T ΣP = D
P T ΣP = D

Answer: A matrix P whose columns are the eigen vectors of Σ = X T X [By
Eigen Value Decomposition]
42/71
We want,
P T ΣP = D
P T ΣP = D

Answer: A matrix P whose columns are the eigen vectors of Σ = X T X [By
Eigen Value Decomposition]
Thus, the new basis P used to transform X is the basis consisting of the eigen
vectors of X T X
42/71
Why is this a good basis?
43/71
Because the eigen vectors of X T X are linearly independent (proof : Slide 19
Theorem 1)
43/71
Theorem 1)
And because the eigen vectors of X T X are orthogonal (∵ X T X is symmetric -
saw proof earlier)
43/71
Theorem 1)
saw proof earlier)
This method is called Principal Component Analysis for transforming the data
to a new basis where the dimensions are non-redundant (low covariance) & not
noisy (high variance)
43/71
Theorem 1)
saw proof earlier)
This method is called Principal Component Analysis for transforming the data
to a new basis where the dimensions are non-redundant (low covariance) & not
noisy (high variance)
In practice, we select only the top-k dimensions along which the variance is
high (this will become more clear when we look at an alternalte interpretation
of PCA)
43/71
Module 6.5 : PCA : Interpretation 2
44/71
Given n orthogonal linearly independent vectors P = p1 , p2 , · · · , pn we can
represent xi exactly as a linear combination of these vectors.
45/71
n
X
0
xi = αij pj [we know how to estimate αij s but we will come back to that later]
j=1
45/71
n
X
0
j=1
But we are interested only in the top-k dimensions (we want to get rid of noisy &
redundant dimensions)
k
X
x̂i = αik pk
j=1
45/71
n
X
0
j=1
But we are interested only in the top-k dimensions (we want to get rid of noisy &
redundant dimensions)
k
X
x̂i = αik pk
j=1
We want to select p0i s such that we minimise the reconstructed error

m
X
e= (xi − x̂i )T (xi − x̂i )
i=1
45/71
m
X
e= (xi − x̂i )T (xi − x̂i )
i=1
46/71
m
X
e= (xi − x̂i )T (xi − x̂i )
i=1
 2
m
X n
X k
X
=  αij pj − αij pj 
i=1 j=1 j=1
46/71
m
X
e= (xi − x̂i )T (xi − x̂i )
i=1
 2
m
X n
X k
X
i=1 j=1 j=1
 2  T  
m
X n
X m
X n
X n
X
=  αij pj  =  αij pj   αij pj 
i=1 j=k+1 i=1 j=k+1 j=k+1
46/71
m
X
e= (xi − x̂i )T (xi − x̂i )
i=1
 2
m
X n
X k
X
i=1 j=1 j=1
 2  T  
m
X n
X m
X n
X n
X
i=1 j=k+1 i=1 j=k+1 j=k+1
m
X
= (αi,k+1 pk+1 + αi,k+2 pk+2 + . . . + αi,n pn )T (αi,k+1 pk+1 + αi,k+2 pk+2 + . . . + αi,n pn )
i=1
46/71
m
X
e= (xi − x̂i )T (xi − x̂i )
i=1
 2
m
X n
X k
X
i=1 j=1 j=1
 2  T  
m
X n
X m
X n
X n
X
i=1 j=k+1 i=1 j=k+1 j=k+1
m
X
i=1
m X
X n m X
X n n
X
= αij pTj pj αij + αij pTj pL αiL
i=1 j=k+1 i=1 j=k+1 L=k+1,L6=k
46/71
m
X
e= (xi − x̂i )T (xi − x̂i )
i=1
 2
m
X n
X k
X
i=1 j=1 j=1
 2  T  
m
X n
X m
X n
X n
X
i=1 j=k+1 i=1 j=k+1 j=k+1
m
X
i=1
m X
X n m X
X n n
X
i=1 j=k+1 i=1 j=k+1 L=k+1,L6=k
Xm X n
2
= αij (∵ pTj pj = 1, pTi pj = 0 ∀i 6= j)
i=1 j=k+1
46/71
m
X
e= (xi − x̂i )T (xi − x̂i )
i=1
 2
m
X n
X k
X
i=1 j=1 j=1
 2  T  
m
X n
X m
X n
X n
X
i=1 j=k+1 i=1 j=k+1 j=k+1
m
X
i=1
m X
X n m X
X n n
X
i=1 j=k+1 i=1 j=k+1 L=k+1,L6=k
Xm X n
2
i=1 j=k+1
m X
n
X 2
= xTi pj
i=1 j=k+1
46/71
m X
X n
pTj xi xTi pj

=
m
X i=1 j=k+1
e= (xi − x̂i )T (xi − x̂i )
i=1
 2
m
X n
X k
X
i=1 j=1 j=1
 2  T  
m
X n
X m
X n
X n
X
i=1 j=k+1 i=1 j=k+1 j=k+1
m
X
i=1
m X
X n m X
X n n
X
i=1 j=k+1 i=1 j=k+1 L=k+1,L6=k
Xm X n
2
i=1 j=k+1
m X
n
X 2
= xTi pj
i=1 j=k+1
46/71
m X
X n
pTj xi xTi pj

=
m
X i=1 j=k+1
e= (xi − x̂i )T (xi − x̂i ) n m
!
X X
i=1
 2 = pTj xi xTi pj
m
X n
X k
X j=k+1 i=1
i=1 j=1 j=1
 2  T  
m
X n
X m
X n
X n
X
i=1 j=k+1 i=1 j=k+1 j=k+1
m
X
i=1
m X
X n m X
X n n
X
i=1 j=k+1 i=1 j=k+1 L=k+1,L6=k
Xm X n
2
i=1 j=k+1
m X
n
X 2
= xTi pj
i=1 j=k+1
46/71
m X
X n
pTj xi xTi pj

=
m
X i=1 j=k+1
e= (xi − x̂i )T (xi − x̂i ) n m
!
X X
i=1
 2 = pTj xi xTi pj
m
X n
X k
X j=k+1 i=1
= αij pj − αij pj  n m
" #
1 X XT X
 X
T
i=1 j=1 j=1 = pj mCpj ∵ xi xTi = =C
m m
 2  T  j=k+1 i=1
m
X n
X m
X n
X n
X
i=1 j=k+1 i=1 j=k+1 j=k+1
m
X
i=1
m X
X n m X
X n n
X
i=1 j=k+1 i=1 j=k+1 L=k+1,L6=k
Xm X n
2
i=1 j=k+1
m X
n
X 2
= xTi pj
i=1 j=k+1
46/71
We want to minimize e
n
X
min pTj mCpj s.t. pTj pj = 1 ∀j = k + 1, k + 2, · · · , n
pk+1 ,pk+2 ,··· ,pn
j=k+1
47/71
n
X
pk+1 ,pk+2 ,··· ,pn
j=k+1
The solution to the above problem is given by the eigen vectors corresponding to
the smallest eigen values of C (Proof : refer Slide 26).
47/71
n
X
pk+1 ,pk+2 ,··· ,pn
j=k+1
The solution to the above problem is given by the eigen vectors corresponding to
the smallest eigen values of C (Proof : refer Slide 26).
Thus we select P = p1 , p2 , · · · , pn as eigen vectors of C and retain only top-k eigen

vectors to express the data [or discard the eigen vectors k + 1, · · · , n]
47/71
Key Idea
Minimize the error in reconstructing xi after projecting the data on to a new basis.
48/71
Let’s look at the ‘Reconstruction Error’ in the context of our toy example
49/71
y
u2 u1
x
u1 = [1, 1] and u2 = [−1, 1] are the

new basis vectors
50/71
y
u2 u1
x
u1 = [1, 1] and u2 = [−1, 1] are the

new basis vectors
Let us
h converti them tohunit vectors
i
−1
u1 = √12 √12 & u2 = √ 2
√1
2
50/71
y
Consider the point x = [3.3, 3] in the
original data
u2 u1
x
u1 = [1, 1] and u2 = [−1, 1] are the

new basis vectors
Let us
i
−1
u1 = √12 √12 & u2 = √ 2
√1
2
50/71
y
original data
√
α1 = xT u1 = 6.3/ √2
T
α2 = x u2 = −0.3/ 2
u2 u1
x
u1 = [1, 1] and u2 = [−1, 1] are the

new basis vectors
Let us
i
−1
u1 = √12 √12 & u2 = √ 2
√1
2
50/71
y
original data
√
α1 = xT u1 = 6.3/ √2
T
α2 = x u2 = −0.3/ 2
the perfect reconstruction of x is
given by (using n = 2 dimensions)
u2 u1
x = α1 u1 + α2 u2 = 3.3 3
x
u1 = [1, 1] and u2 = [−1, 1] are the

new basis vectors
Let us
i
−1
u1 = √12 √12 & u2 = √ 2
√1
2
50/71
y
original data
√
α1 = xT u1 = 6.3/ √2
T
α2 = x u2 = −0.3/ 2
the perfect reconstruction of x is
given by (using n = 2 dimensions)
u2 u1
x = α1 u1 + α2 u2 = 3.3 3
x
But we are going to reconstruct it
using fewer (only k = 1 < n
u1 = [1, 1] and u2 = [−1, 1] are the dimensions, ignoring the low variance
new basis vectors u2 dimension)
Let us
i
−1
u1 = √12 √12 & u2 = √ √1

2 2
x̂ = α1 u1 = 3.15 3.15
(reconstruction with minimum error)

50/71
Recap
The eigen vectors of a matrix with distinct eigenvalues are linearly independent
51/71
Recap
The eigen vectors of a square symmetric matrix are orthogonal
51/71
Recap
PCA exploits this fact by representing the data using a new basis comprising
only the top-k eigen vectors
51/71
Recap
The n − k dimensions which contribute very little to the reconstruction error
are discarded
51/71
Recap
The n − k dimensions which contribute very little to the reconstruction error
are discarded
These are also the directions along which the variance is minimum
51/71
Module 6.6 : PCA : Interpretation 3
52/71
We started off with the following wishlist
53/71
We are interested in representing the data using fewer dimensions such that
53/71
the dimensions have low covariance
53/71
the dimensions have high variance
53/71
So far we have paid a lot of attention to the covariance
53/71
It has indeed played a central role in all our analysis
53/71
But what about variance? Have we achieved our stated goal of high variance
along dimensions?
53/71
But what about variance? Have we achieved our stated goal of high variance
along dimensions?
To answer this question we will see yet another interpretation of PCA
53/71
The ith dimension of the transformed data X̂ is given by
X̂i = Xpi
54/71
X̂i = Xpi
The variance along this dimension is given by
54/71
X̂i = Xpi
X̂iT X̂i 1
= pTi X T Xpi
m m | {z }
54/71
X̂i = Xpi
X̂iT X̂i 1
= pTi X T Xpi
m m | {z }
1
= pTi λi pi [∵ pi is the eigen vector of X T X]
m
54/71
X̂i = Xpi
X̂iT X̂i 1
= pTi X T Xpi
m m | {z }
1
m
1
= λi pTi pi
m |{z}
=1
54/71
X̂i = Xpi
X̂iT X̂i 1
= pTi X T Xpi
m m | {z }
1
m
1
= λi pTi pi
m |{z}
=1
λi
=
m
54/71
X̂i = Xpi
X̂iT X̂i 1
= pTi X T Xpi
m m | {z }
1
m
1
= λi pTi pi
m |{z}
=1
λi
=
m
Thus the variance along the ith dimension (ith eigen vector of X T X) is given
by the corresponding (scaled) eigen value.
54/71
X̂i = Xpi
X̂iT X̂i 1
= pTi X T Xpi
m m | {z }
1
m
1
= λi pTi pi
m |{z}
=1
λi
=
m
Thus the variance along the ith dimension (ith eigen vector of X T X) is given
by the corresponding (scaled) eigen value.
Hence, we did the right thing by discarding the dimensions (eigenvectors)
corresponding to lower eigen values! 54/71
A Quick Summary
We have seen 3 different interpretations of PCA
55/71
A Quick Summary
It ensures that the covariance between the new dimensions is minimized
55/71
A Quick Summary
It picks up dimensions such that the data exhibits a high variance across these
dimensions
55/71
A Quick Summary
It picks up dimensions such that the data exhibits a high variance across these
dimensions
It ensures that the data can be represented using less number of dimensions
55/71
Module 6.7 : PCA : Practical Example
56/71
Consider we are given a large number of
images of human faces (say, m images)
57/71
Each image is 100 × 100 [10K dimensions]
57/71
We would like to represent and store the
images using much fewer dimensions (around
50-200)
57/71
50-200)
We construct a matrix X ∈ Rm×10K
57/71
50-200)
Each row of the matrix corresponds to 1 image
57/71
50-200)
Each row of the matrix corresponds to 1 image
Each image is represented using 10K
dimensions
57/71
X ∈ Rm×10K (as explained on the previous
slide)
58/71
slide)
We retain the top 100 dimensions
corresponding to the top 100 eigen vectors of
XT X
58/71
slide)
XT X
Note that X T X is a n × n matrix so its eigen
vectors will be n dimensional (n = 10K in this
case)
58/71
slide)
XT X
case)
We can convert each eigen vector into a 100 ×
100 matrix and treat it as an image
58/71
slide)
XT X
case)
Let’s see what we get
58/71
slide)
XT X
case)
Let’s see what we get
What we have plotted here are the first 16
eigen vectors of X T X (basically, treating each
10K dimensional eigen vector as a 100 × 100
dimensional image)
58/71
These images are called eigenfaces
and form a basis for representing any
face in our database
59/71
In other words, we can now represent
a given image (face) as a linear
combination of these eigen faces
59/71
1
X
α1i pi
59/71
i=1
2
X
α1i pi
59/71
i=1
4
X
α1i pi
59/71
i=1
8
X
α1i pi
59/71
i=1
12
X
α1i pi
59/71
i=1
16
X
α1i pi
59/71
i=1
In practice, we just need to store
p1 , p2 , · · · , pk (one time storage)
16
X
α1i pi
59/71
i=1
Then for each image i we just
need to store the scalar values
αi1 , αi2 , · · · , αik
16
X
α1i pi
59/71
i=1
Then for each image i we just
need to store the scalar values
αi1 , αi2 , · · · , αik
This significantly reduces the storage
cost without much loss in image
quality
16
X
α1i pi
59/71
i=1
Module 6.8 : Singular Value Decomposition
60/71
Let us get some more perspective on eigen vectors before moving ahead
61/71
Let v1 , v2 , · · · , vn be the eigen vectors of A and let λ1 , λ2 , · · · , λn be
corresponding eigen values
Av1 = λ1 v1 , Av2 = λ2 v2 , · · · , Avn = λn vn
62/71
If a vector x in Rn is represented using v1 , v2 , · · · , vn as basis then

n
X
x= αi vi
i=1
62/71

n
X
x= αi vi
i=1
Xn n
X
Now, Ax = αi Avi = αi λi vi
i=1 i=1
62/71

n
X
x= αi vi
i=1
Xn n
X
Now, Ax = αi Avi = αi λi vi
i=1 i=1
The matrix multiplication reduces to a scalar multiplication if the eigen vectors

of A are used as a basis.
62/71
So far all the discussion was centered around square matrices (A ∈ Rn×n )
63/71
What about rectangular matrices A ∈ Rm×n ? Can they have eigen vectors?
63/71
Is it possible to have Am×n xn×1 = xn×1 ?
63/71
Is it possible to have Am×n xn×1 = xn×1 ? Not possible !
63/71
The result of Am×n xn×1 is a vector belonging to Rm (whereas x ∈ Rn )
63/71
So do we miss out on the advantage that a basis of eigen vectors provides
for square matrices (i.e. converting matrix multiplications into scalar
multiplications)?
63/71
So do we miss out on the advantage that a basis of eigen vectors provides
for square matrices (i.e. converting matrix multiplications into scalar
multiplications)?
We will see the answer to this question over the next few slides
63/71
Note that matrix Am×n provides a transformation Rn → Rm
64/71
What if we could have pairs of vectors (v1 , u1 ), (v2 , u2 ), · · · , (vk , uk ) such that vi ∈ Rn ,
ui ∈ Rm and Avi = σi ui
64/71
Further let’s assume that v1 , · · · , vk , · · · , vn are orthogonal & thus form a basis V in Rn
64/71
Similarly let’s assume that u1 , · · · , uk , · · · , um are orthogonal & thus form a basis U in Rm
64/71
Now what if every vector x ∈ Rn is represented using the basis V
64/71
k
X
x= αi vi [note we are using k instead of n ; will clarify this in a minute]
i=1
64/71
k
X
i=1
k
X
Ax = αi Avi
i=1
64/71
k
X
i=1
k
X
Ax = αi Avi
i=1
k
X
= αi σi ui
i=1
64/71
k
X
i=1
k
X
Ax = αi Avi
i=1
k
X
= αi σi ui
i=1
Once again the matrix multiplication reduces to a scalar multiplication
64/71
Let’s look at a geometric interpretation of this
65/71
A
n Co R m
R ace lum
p n
ows of spac
R of A A e
dim=k=rank(A)
dim=k=rank(A)
66/71
A
n Co R m
R ace lum
p n
ows of spac
R of A A e
dim=k=rank(A)
dim=k=rank(A)
Rn - Space of all vectors which can multiply with A to give Ax [ this is the space of
inputs of the function]
66/71
A
n Co R m
R ace lum
p n
ows of spac
R of A A e
dim=k=rank(A)
dim=k=rank(A)
Rm - Space of all vectors which are outputs of the function Ax
66/71
A
n Co R m
R ace lum
p n
ows of spac
R of A A e
dim=k=rank(A)
dim=k=rank(A)
We are interested in finding a basis U , V such that
66/71
A
n Co R m
R ace lum
p n
ows of spac
R of A A e
dim=k=rank(A)
dim=k=rank(A)
V - basis for inputs
66/71
A
n Co R m
R ace lum
p n
ows of spac
R of A A e
dim=k=rank(A)
dim=k=rank(A)
U - basis for outputs
66/71
A
n Co R m
R ace lum
p n
ows of spac
R of A A e
dim=k=rank(A)
dim=k=rank(A)
U - basis for outputs
such that if the inputs and outputs are represented using this basis then the operation
Ax reduces to a scalar operation 66/71
What do we mean by saying that dimension of rowspace is k? If x ∈ Rn then
why is the dimension not n.
67/71
It means that of all the possible vectors in Rn only a subspace of vectors lying
in Rk can act as inputs to Ax and produce a non-zero output. The remaining
vectors in Rn−k will produce a zero output
67/71
It means that of all the possible vectors in Rn only a subspace of vectors lying
in Rk can act as inputs to Ax and produce a non-zero output. The remaining
vectors in Rn−k will produce a zero output
Hence we need only k dimensions to represent x
k
X
x= αi vi
i=1
67/71
Let’s look at a way of writing this as a matrix operation
Av1 = σ1 u1 , Av2 = σ2 u2 , · · · , Avk = σk uk
Am×n Vn×k = Um×k Σk×k

| {z }
diagonal matrix
68/71

| {z }
diagonal matrix
If we have k orthogonal vectors (Vn×k ) then using Gram Schmidt

orthogonalization, we can find n − k more orthogonal vectors to complete the
basis for Rn [We can do the same for U]
Am×n Vn×n = Um×m Σm×n
U T AV = Σ [U −1 = U T ] A = U ΣV T [V −1 = V T ]
68/71

| {z }
diagonal matrix

Σ is a diagonal matrix with only the first k diagonal elements as non-zero
68/71

| {z }
diagonal matrix

Σ is a diagonal matrix with only the first k diagonal elements as non-zero
Now the question is how do we find V , U and Σ
68/71
Suppose V , U and Σ exist, then
69/71
AT A = (U ΣV T )T (U ΣV T )
69/71
= V ΣT U T U ΣV T
69/71
= V ΣT U T U ΣV T
AT A = V Σ 2 V T
69/71
= V ΣT U T U ΣV T
AT A = V Σ 2 V T
What does this look like?
69/71
= V ΣT U T U ΣV T
AT A = V Σ 2 V T
What does this look like? Eigen Value decomposition of AT A
69/71
= V ΣT U T U ΣV T
AT A = V Σ 2 V T

Similarly we can show that
AAT = U Σ2 U T
69/71
= V ΣT U T U ΣV T
AT A = V Σ 2 V T

Similarly we can show that
AAT = U Σ2 U T
Thus U and V are the eigen vectors of AAT and AT A respectively and Σ2 = Λ
where Λ is the diagonal matrix containing eigen values of AT A
69/71
   
↑ ··· ↑ 
σ1
 
← v1 →

  
=
  ..   .. 

 A



u1 · · · uk   .   . 
↓ · · · ↓ m×k σk k×k ← vk → k×n
m×n
k
X
= σi ui viT
i=1
70/71
   
↑ ··· ↑ 
σ1
 
← v1 →

  
=
  ..   .. 

 A



u1 · · · uk   .   . 
↓ · · · ↓ m×k σk k×k ← vk → k×n
m×n
k
X
= σi ui viT
i=1
Theorem:
σ1 u1 v1T is the best rank-1 approximation of the matrix A. 2i=1 σi ui viT is the best
P
Pk
rank-2 approximation of matrix A. In general, i=1 σi ui viT is the best rank-k
approximation of matrix A. In other words, the solution to
min kA − Bk2F is given by :

T
B =U.,k Σk,k Vk,. (minimizes reconstruction error of A)
70/71
p
σi = λi = singular value of A
71/71
p
U = left singular matrix of A
71/71
p
U = left singular matrix of A
V = right singular matrix of A
71/71

CS7015 (Deep Learning) : Lecture 6

Diunggah oleh

Informasi Dokumen

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

CS7015 (Deep Learning) : Lecture 6

Diunggah oleh

Hak Cipta:

Format Tersedia

CS7015 (Deep Learning) : Lecture 6

Eigen Values, Eigen Vectors, Eigen Value Decomposition, Principal Component

Prof. Mitesh M. Khapra

Department of Computer Science and Engineering

v(n) = M n v(0) = ked (some multiple of ed )

v(n) = M n v(0) = ked (some multiple of ed )

Now what happens at time step (n + 1)?

v(n+1) = M v(n) = M (ked ) = k(M ed ) = k(λd ed ) = ked

v(n) = M n v(0) = ked (some multiple of ed )

Now what happens at time step (n + 1)?

v(n+1) = M v(n) = M (ked ) = k(M ed ) = k(λd ed ) = ked

The population in the two restaurants

In general, if λd is the dominant eigenvalue of a matrix A, what would happen

In general, if λd is the dominant eigenvalue of a matrix A, what would happen

In general, if λd is the dominant eigenvalue of a matrix A, what would happen

In general, if λd is the dominant eigenvalue of a matrix A, what would happen

In general, if λd is the dominant eigenvalue of a matrix A, what would happen

In general, if λd is the dominant eigenvalue of a matrix A, what would happen

In general, if λd is the dominant eigenvalue of a matrix A, what would happen

In general, if λd is the dominant eigenvalue of a matrix A, what would happen

Linearly independent vectors

c1 v1 + c2 v2 + . . . cn vn = 0 is c1 = c2 = · · · = cn = 0(ci ’s are scalars)

Further, x and y are linearly independent.

(Basically rewriting in matrix form)

We can directly find each αi using a dot

We can directly find each αi using a dot

→ → z T u1 We can directly find each αi using a dot

→ → z T u1 We can directly find each αi using a dot

→ → z T u1 We can directly find each αi using a dot

If U −1 exists, then we can write,

If U −1 exists, then we can write,

Under what conditions would U −1 exist?

If U −1 exists, then we can write,

Under what conditions would U −1 exist?

If U −1 exists, then we can write,

Under what conditions would U −1 exist?

If U −1 exists, then we can write,

Under what conditions would U −1 exist?

Each cell of the matrix, Qij is given by uTi uj

∴ U T U = I (the identity matrix)

Each cell of the matrix, Qij is given by uTi uj

∴ U T U = I (the identity matrix)

Let x1 , x2 , · · · , xm ∈ Rn be m data points and let X be a matrix such that

Let x1 , x2 , · · · , xm ∈ Rn be m data points and let X be a matrix such that

xi = αi1 p1 + αi2 p2 + αi3 p3 + · · · + αin pn

Let x1 , x2 , · · · , xm ∈ Rn be m data points and let X be a matrix such that

xi = αi1 p1 + αi2 p2 + αi3 p3 + · · · + αin pn

X̂ = XP (X̂ is the matrix of transformed points)

In other words, we want

In other words, which orthogonal matrix P diagonalizes Σ?

In other words, which orthogonal matrix P diagonalizes Σ?

In other words, which orthogonal matrix P diagonalizes Σ?

We want to select p0i s such that we minimise the reconstructed error

Thus we select P = p1 , p2 , · · · , pn as eigen vectors of C and retain only top-k eigen

u1 = [1, 1] and u2 = [−1, 1] are the

u1 = [1, 1] and u2 = [−1, 1] are the

u1 = [1, 1] and u2 = [−1, 1] are the

u1 = [1, 1] and u2 = [−1, 1] are the

u1 = [1, 1] and u2 = [−1, 1] are the

(reconstruction with minimum error)

Av1 = λ1 v1 , Av2 = λ2 v2 , · · · , Avn = λn vn

Av1 = λ1 v1 , Av2 = λ2 v2 , · · · , Avn = λn vn

If a vector x in Rn is represented using v1 , v2 , · · · , vn as basis then