CS 256: LMS Algorithms

CS 256: LMS Algorithms
Chapter 7: LMS Algorithms

• The LMS Objective Function.
• Global solution.
• Pseudoinverse of a matrix.
• Optimization (learning) by gradient descent.
• Widrow-Hoff Algorithm
Copyright © Robert R. Snapp 2005

LMS Objective Function
Again we consider the problem of programming a linear threshold function

x1
w1
x2 w2
x3 w0 y
w3
xn wn

T
 1, if wT x + w0 > 0,
y = sgn(w x + w0) =
−1, if wT x + w0 ≤ 0.
so that it agrees with a given dichotomy of m feature vectors,
Xm = {(x1, `1), . . . , (xm , `m )}.
where xi ∈ Rn and ì ∈ {−1, +1}, for i = 1, 2, . . . , m.

LMS Objective Function (cont.)
Thus, given a dichotomy
Xm = {(x1, `1), . . . , (xm , `m )},
we seek a solution weight vector w and bias w0, such that,
sgn wT xi + w0 = ì,

for i = 1, 2, . . . , m.
Equivalently, using homogeneous coordinates (or augmented feature vectors),
sgn wT
b x

b i = ì,
T
b i = 1, xT
for i = 1, 2, . . . , m, where x i .
Using normalized coordinates,
sgn wT
b x 0
b i = 1, bTx
or, w b 0i > 0,
b 0i = ìx
for i = 1, 2, . . . , m, where x b i.
LMS Objective Function (cont.)
Alternatively, let b ∈ Rm satisfy bi > 0 for i = 1, 2, . . . , m. We call b a margin

vector . Frequently we will assume that bi = 1.
Our learning criterion is certainly satisfied if
bTx
w b 0i = bi
for i = 1, 2, . . . , m. (Note that this condition is sufficient, but not necessary.)

Equivalently, the above is satisfied if the expression
m 2
T 0
X
E(w)
b = w
b xb i − bi .
i=1
equals zero. The above expression we call the LMS objective function, where
LMS stands for least mean square. (Technically, we should normalize the above
by dividing the right side by m.)
LMS Objective Function: Remarks
Converting the original problem of classification (satisfying a system of inequal-

ities) into one of optimization is somewhat ad hoc. And there is no guarantee
b ? ∈ Rn+1 that satisfies E(w
that we can find a w b ?) = 0. However, this con-
version can lead to practical compromises if the original inequalities possess
inconsistencies.
Also, there is no unique objective function. The LMS expression,

m
2
bTx
b 0i − bi
X
E(w)
b = w .
i=1
can be replaced by numerous candidates, e.g.
m m m
2
bTx
b 0i − bi, T 0 T 0
X X X
w 1 − sgn w
b xbi , 1 − sgn w
b xbi , etc.
i=1 i=1 i=1
b ?) is generally an easy task.
However, minimizing E(w
Minimizing the LMS Objective Function
Inspection suggests that the LMS Objective Function

m
2
bTx
b 0i − bi
X
E(w)
b = w
i=1
describes a parabolic function. It may have a unique global minimum, or an
infinite number of global minima which occupy a connected linear set. (The
latter can occur if m < n + 1.) Letting,
 0T   0 0 0 
x
b1 `1 x̂1,1 x̂1,2 · · · x̂1,n
 0T   0 0 0 
xb2   2
  ` x̂ 2,1 x̂2,2 · · · x̂2,n  ∈ Rm×(n+1) ,

X= =
 ...   ... ... ... ... ... 
   
b 0T 0 0 0
x m `m x̂m,1 x̂m,2 · · · x̂m,n
then,
 0T  2  0T  2
x
b1 w b − b1
x
b1

m  0T b   0T 

x w − b x

0T 2 2 
b − bk2.
X
2 2
  
E(w)
b = x
bi wb − bi =   =  ...  w b − b
= kX w
b b
...
  
i=1
  
0T 0T 
x bm wb − bm xbm
Minimizing the LMS Objective Function (cont.)
E(w) b − bk2
b = kX w
= (X wb − b)T (X w
b − b)

= w b T X T − bT (X w
b − b)
b T X TX w
=w b T X T b − bTX w
b −w b + bT b
b T X TX w
=w b − 2bTX w
b + kbk2
As an aside, note that

 
0T
xb1 m
X TX = (b
x01, . . . , x
b 0m )  ..  b 0ix
b 0T (n+1)×(n+1) ,
  X
 . = x i ∈R
xb 0T
m i=1
 
0T
x
b1 m
bTX = (b1, . . . , bm )  . b 0T n+1 .
X
. 
 
 . = bix i ∈R
b 0T
x m i=1
Okay, so how do we minimize
E(w) b T X TX w
b =w b − 2bTX w
b + kbk2?
Using calculus (e.g., Math 121), we can compute the gradient of E(w),
b and
b ? which makes each component vanish.
algebraically determine a value of w
That is, solve
∂E
 
 ∂w
 
c0 
 
 ∂E 
 
∇E(w)
b =
 ∂wc1  = 0.

 .. 
 . 
 ∂E 
 
∂w
cn
It is straightforward to show that
b = 2X T X w
∇E(w) b − 2X T b.
Thus,
b = 2X T X w
∇E(w) b − 2X T b = 0
if
b ? = (X T X)−1X T b = X †b,
w
where the matrix,
X † Õ (X T X)−1X T ∈ R(n+1)×m
is called the pseudoinverse of X. If X T X is singular, one defines
X † Õ lim (X T X + I)−1X T .
→0
Observe that if X T X is nonsingular,
X †X = (X T X)−1X T X = I.
Example
The following example appears in R. O. Duda, P. E. Hart, and D. G. Stork, Pattern

Classification, Second Edition, Wiley, NY, 2001, p. 241.
Given the dichotomy,
n o
T T T T
X4 = (1, 2) , 1 , (2, 0) , 1 , (3, 1) , −1 , (2, 3) , −1
we obtain,
 
0T
x
 

b1
 1 1 2
 0T  
xb2   1 2 0


X =  0T  = 
   .
xb 3  −1 −3 −1

−1 −2 −3
 
0T
xb 4
Whence,
5 13 3 7
 
 
4 8 6  4 12 4 12 
 1
X TX =  X † = (X TX)−1X T =  −1 1 1
 
8 18 11 ,
 and, − 2 6 −2 −6 .
6 11 14
 
0 −1
3 0 1
−3
Example (cont.)
Letting, b = (1, 1, 1, 1)T , then

  
5 13 3 7 11
 
1
 4 12 4 12   
1  3
 1
†
−1 −1 1   4
 
b =X b=
w − 2 6 2 −6    =  −3 

.
  1   
−1 1 2
 
0 3 0 −3 1 −3
Whence,
x2
T
11 4 2

w0 = and w = − ,− 5
3 3 3
4
1 2 3 4 x1
Method of Steepest Descent
An alternative approach is the method of steepest descent.
We begin by representing Taylor’s theorem for functions of more than one vari-
able: let x ∈ Rn, and f : Rn → R, so
f (x) = f (x1, x2, . . . , xn) ∈ R.
Now let δx ∈ Rn, and consider
f (x + δx) = f (x1 + δx1, . . . , xn + δxn).
Define F : R → R, such that,
F (s) = f (x + s δx).
Thus,
F (0) = f (x), and, F (1) = f (x + δx).

Method of Steepest Descent (cont.)
Taylor’s theorem for a single variable (Math 21/22),

1 0 1 1
F (s) = F (0) + F (0)s + F 00(0)s 2 + F 000(0)s 3 + · · · .
1! 2! 3!
Our plan is to set s = 1 and replace F (1) by f (x + δx), F (0) by f (x), etc.
To evaluate F 0(0) we will invoke the multivariate chain rule, e.g.,
d ∂f ∂f
f u(s), v(s) = (u, v) u0(s) + (u, v) v 0(s).
ds ∂u ∂v
Thus,
dF df
F 0(s) = (s) = (x1 + s δx1, . . . , xn + s δxn)
ds ds
∂f d ∂f d
= (x + s δx) (x1 + s δx1) + · · · + (x + s δx) (xn + s δxn)
∂x1 ds ∂xn ds
∂f ∂f
= (x + s δx) · δx1 + · · · + (x + s δx) · δxn.
∂x1 ∂xn
Thus,
∂f ∂f
F 0(0) = (x) · δx1 + · · · + (x) · δxn = ∇f (x)T δx.
∂x1 ∂xn
Thus, it is possible to show

f (x + δx) = f (x) + ∇f (x)T δx + O kδxk2

= f (x) + k∇f (x)kkδxk cos θ + O kδxk2 ,
where θ defines the angle between ∇f (x) and δx. If kδxk 1, then
δf = f (x + δx) − f (x) ≈ k∇f (x)kkδxk cos θ.
Thus, the greatest reduction δf occurs if cos θ = −1, that is if δx = −η∇f ,

where η > 0. We thus seek a local minimum of the LMS objective function by
taking a sequence of steps

w(t
b + 1) = w(t)
b − η∇E w(t)
b .
Training an LTU using Steepest Descent
We now return to our original problem. Given a dichotomy
Xm = {(x1, `1), . . . , (xm , `m )}
of m feature vectors xi ∈ Rn with ì ∈ {−1, 1} for i = 1, . . . , m, we construct

the set of normalized, augmented feature vectors
0 = {(` , ` x , . . . , ` x T n+1 |i = 1, . . . , m}.

X i i i,1 i i,n ) ∈ R
bm
Given a margin vector, b ∈ Rm , with bi > 0 for i = 1, . . . , m, we construct the

LMS objective function,
m
1 X 1
E(w)
b = bTx
(w b 0i − bi)2 = kX w
b − bk2,
2 2
i=1
(the factor of 1
2 is added with some foresight), and then evaluate its gradient
m
bTx x0i = X T (X w
b 0i − bi)b
X
∇E(w)
b = (w b − b).
i=1
Training an LTU using Steepest Descent (cont.)
Substitution into the steepest descent update rule,

w(t
b + 1) = w(t)
b − η∇E w(t)
b ,
yields the batch LMS update rule,

m
Tx b 0i.
b 0i x
X
w(t
b + 1) = w(t)
b +η bi − w(t)
b
i=1
Alternatively, one can abstract the sequential LMS , or Widrow-Hoff rule, from the
above:

w(t
b + 1) = w(t)
b + η b − w(t)
b Tx
b 0(t) x
b 0(t).
b 0(t) ∈ X
where x bm0 is the element of the dichotomy that is presented to the LTU
at epoch t. (Here, we assume that b is fixed; otherwise, replace it by b(t).)

Sequential LMS Rule
The sequential LMS rule

h i
w(t
b + 1) = w(t)
b + η b − w(t)
b T x0
b (t) xb 0(t),
resembles the sequential perceptron rule,

ηh T 0 i 0
w(t
b + 1) = w(t) + 1 − sgn w(t) b (t) x
x b (t).
2
b b
Each quantity enclosed in red brackets [ · · · ] measures the current classification

b 0(t) in a particular way.
error of x
Sequential rules are well suited to real-time implementations, as only the current
values for the weights, i.e. the configuration of the LTU itself, need to be stored.
They also work with dichotomies of infinite sets.
Convergence of the batch LMS rule
Recall, the LMS objective function in the form

1 2
E(w)b = X w b − b ,

2
has as its gradient,
b = X TX w
∇E(w) b − X T b = X T(X w
b − b).
Substitution into the rule of steepest descent,

w(t
b + 1) = w(t)
b − η ∇E w(t)
b ,
yields,
w(t
b + 1) = w(t)
b T
− η X X w(t)
b −b

Convergence of the batch LMS rule (cont.)
b ?, if for every finite initial

The algorithm is said to converge to a fixed point w

value w(0)
b < ∞,
lim w(t)
b b ?.
=w
t→∞
b ? satisfy ∇E(w
The fixed points w b ?) = 0, whence
X TX w
b ? = X T b.
The update rule becomes,
− η X T X w(t)

w(t
b + 1) = w(t)
b b −b
− η X TX w(t) b?

= w(t)
b b −w
Let δw(t)
b Õ w(t)
b b ?. Then,
−w
δw(t
b + 1) = δw(t)
b − η X TXδw(t)
b = I − η X TX)δw(t).
b
Convergence w(t)
b b ? occurs if δw(t)
→ w b = w(t)
b −w b ? → 0. Thus we require

that δw(t
b + 1) < δw(t)
b . Inspecting the update rule,

b + 1) = I − η X TX δw(t),
δw(t b
this reduces to the condition that all the eigenvalues of
I − η X TX
have magnitudes less than 1.
We will now evaluate the eigenvalues of the above matrix.

Let S ∈ R(n+1)×(n+1) denote the similarity transform that reduces the symmet-
ric matrix X TX to a diagonal matrix Λ ∈ R(n+1)×(n+1). Thus, S TS = SS T = I,
and
S X TX S T = Λ = diag(λ0, λ1, . . . , λn),
The numbers, λ0, λ1, . . . , λn, represent the eigenvalues of X TX. Note that 0 ≤ λi
for i = 0, 1, . . . , n. Thus,
b + 1) = S(I − ηX TX)S T S δw(t),

S δw(t b
= (I − ηΛ)S δw(t).
b
Thus convergence occurs if
kS δw(t
b + 1)k < kS δw(t)k,
b
which occurs if the eigenvalues of I − ηΛ all have magnitudes less than one.
The eigenvalues of I − ηΛ equal 1 − ηλi, for i = 0, 1, . . . , n. (These are in fact

the eigenvalues of I − ηX TX.) Thus, we require that
−1 < 1 − ηλi < 1, or 0 < ηλi < 2,
for all i. Let λmax = max λi denote the largest eigenvalue of X TX, then con-
0≤i≤n
vergence requires that
2
0<η< .
λmax

CS 256: LMS Algorithms

Diunggah oleh

Informasi Dokumen

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

CS 256: LMS Algorithms

Diunggah oleh

Hak Cipta:

Format Tersedia

CS 256: LMS Algorithms

Chapter 7: LMS Algorithms

Copyright © Robert R. Snapp 2005

Again we consider the problem of programming a linear threshold function

so that it agrees with a given dichotomy of m feature vectors,

Xm = {(x1, `1), . . . , (xm , `m )}.

where xi ∈ Rn and `i ∈ {−1, +1}, for i = 1, 2, . . . , m.

Thus, given a dichotomy

Xm = {(x1, `1), . . . , (xm , `m )},

we seek a solution weight vector w and bias w0, such that,

Using normalized coordinates,

Alternatively, let b ∈ Rm satisfy bi > 0 for i = 1, 2, . . . , m. We call b a margin

for i = 1, 2, . . . , m. (Note that this condition is sufficient, but not necessary.)

Converting the original problem of classification (satisfying a system of inequal-

Also, there is no unique objective function. The LMS expression,

Inspection suggests that the LMS Objective Function

As an aside, note that

Okay, so how do we minimize

It is straightforward to show that

where the matrix,

is called the pseudoinverse of X. If X T X is singular, one defines

Observe that if X T X is nonsingular,

The following example appears in R. O. Duda, P. E. Hart, and D. G. Stork, Pattern

Letting, b = (1, 1, 1, 1)T , then

An alternative approach is the method of steepest descent.

f (x) = f (x1, x2, . . . , xn) ∈ R.

Now let δx ∈ Rn, and consider

f (x + δx) = f (x1 + δx1, . . . , xn + δxn).

Define F : R → R, such that,

F (0) = f (x), and, F (1) = f (x + δx).

Taylor’s theorem for a single variable (Math 21/22),

Thus, it is possible to show

δf = f (x + δx) − f (x) ≈ k∇f (x)kkδxk cos θ.

Thus, the greatest reduction δf occurs if cos θ = −1, that is if δx = −η∇f ,

We now return to our original problem. Given a dichotomy

Xm = {(x1, `1), . . . , (xm , `m )}

of m feature vectors xi ∈ Rn with `i ∈ {−1, 1} for i = 1, . . . , m, we construct

0 = {(` , ` x , . . . , ` x T n+1 |i = 1, . . . , m}.

Given a margin vector, b ∈ Rm , with bi > 0 for i = 1, . . . , m, we construct the

Substitution into the steepest descent update rule,

yields the batch LMS update rule,

at epoch t. (Here, we assume that b is fixed; otherwise, replace it by b(t).)

The sequential LMS rule

resembles the sequential perceptron rule,

Each quantity enclosed in red brackets [ · · · ] measures the current classification

Recall, the LMS objective function in the form

Substitution into the rule of steepest descent,

b ?, if for every finite initial

The update rule becomes,

this reduces to the condition that all the eigenvalues of

have magnitudes less than 1.

We will now evaluate the eigenvalues of the above matrix.

S X TX S T = Λ = diag(λ0, λ1, . . . , λn),

b + 1) = S(I − ηX TX)S T S δw(t),

Thus convergence occurs if

The eigenvalues of I − ηΛ equal 1 − ηλi, for i = 0, 1, . . . , n. (These are in fact

−1 < 1 − ηλi < 1, or 0 < ηλi < 2,

Anda mungkin juga menyukai