xn wn
T
1, if wT x + w0 > 0,
y = sgn(w x + w0) =
−1, if wT x + w0 ≤ 0.
sgn wT xi + w0 = `i,
for i = 1, 2, . . . , m.
Equivalently, using homogeneous coordinates (or augmented feature vectors),
sgn wT
b x
b i = `i,
T
b i = 1, xT
for i = 1, 2, . . . , m, where x i .
sgn wT
b x 0
b i = 1, bTx
or, w b 0i > 0,
b 0i = `ix
for i = 1, 2, . . . , m, where x b i.
LMS Objective Function (cont.)
bTx
w b 0i = bi
E(w) b − bk2
b = kX w
= (X wb − b)T (X w
b − b)
= w b T X T − bT (X w
b − b)
b T X TX w
=w b T X T b − bTX w
b −w b + bT b
b T X TX w
=w b − 2bTX w
b + kbk2
E(w) b T X TX w
b =w b − 2bTX w
b + kbk2?
Using calculus (e.g., Math 121), we can compute the gradient of E(w),
b and
b ? which makes each component vanish.
algebraically determine a value of w
That is, solve
∂E
∂w
c0
∂E
∇E(w)
b =
∂wc1 = 0.
..
.
∂E
∂w
cn
b = 2X T X w
∇E(w) b − 2X T b.
Minimizing the LMS Objective Function (cont.)
Thus,
b = 2X T X w
∇E(w) b − 2X T b = 0
if
b ? = (X T X)−1X T b = X †b,
w
X † Õ (X T X)−1X T ∈ R(n+1)×m
X † Õ lim (X T X + I)−1X T .
→0
X †X = (X T X)−1X T X = I.
Example
we obtain,
0T
x
b1
1 1 2
0T
xb2 1 2 0
X = 0T =
.
xb 3 −1 −3 −1
−1 −2 −3
0T
xb 4
Whence,
5 13 3 7
4 8 6 4 12 4 12
1
X TX = X † = (X TX)−1X T = −1 1 1
8 18 11 ,
and, − 2 6 −2 −6 .
6 11 14
0 −1
3 0 1
−3
Example (cont.)
Whence,
x2
T
11 4 2
w0 = and w = − ,− 5
3 3 3
4
1 2 3 4 x1
Method of Steepest Descent
We begin by representing Taylor’s theorem for functions of more than one vari-
able: let x ∈ Rn, and f : Rn → R, so
F (s) = f (x + s δx).
Thus,
Thus,
dF df
F 0(s) = (s) = (x1 + s δx1, . . . , xn + s δxn)
ds ds
∂f d ∂f d
= (x + s δx) (x1 + s δx1) + · · · + (x + s δx) (xn + s δxn)
∂x1 ds ∂xn ds
∂f ∂f
= (x + s δx) · δx1 + · · · + (x + s δx) · δxn.
∂x1 ∂xn
Method of Steepest Descent (cont.)
Thus,
∂f ∂f
F 0(0) = (x) · δx1 + · · · + (x) · δxn = ∇f (x)T δx.
∂x1 ∂xn
Method of Steepest Descent (cont.)
f (x + δx) = f (x) + ∇f (x)T δx + O kδxk2
= f (x) + k∇f (x)kkδxk cos θ + O kδxk2 ,
where θ defines the angle between ∇f (x) and δx. If kδxk 1, then
(the factor of 1
2 is added with some foresight), and then evaluate its gradient
m
bTx x0i = X T (X w
b 0i − bi)b
X
∇E(w)
b = (w b − b).
i=1
Training an LTU using Steepest Descent (cont.)
b 0(t) ∈ X
where x bm0 is the element of the dichotomy that is presented to the LTU
Sequential rules are well suited to real-time implementations, as only the current
values for the weights, i.e. the configuration of the LTU itself, need to be stored.
They also work with dichotomies of infinite sets.
Convergence of the batch LMS rule
b = X TX w
∇E(w) b − X T b = X T(X w
b − b).
yields,
w(t
b + 1) = w(t)
b T
− η X X w(t)
b −b
Convergence of the batch LMS rule (cont.)
lim w(t)
b b ?.
=w
t→∞
b ? satisfy ∇E(w
The fixed points w b ?) = 0, whence
X TX w
b ? = X T b.
− η X T X w(t)
w(t
b + 1) = w(t)
b b −b
− η X TX w(t) b?
= w(t)
b b −w
Let δw(t)
b Õ w(t)
b b ?. Then,
−w
δw(t
b + 1) = δw(t)
b − η X TXδw(t)
b = I − η X TX)δw(t).
b
Convergence of the batch LMS rule (cont.)
Convergence w(t)
b b ? occurs if δw(t)
→ w b = w(t)
b −w b ? → 0. Thus we require
that δw(t
b + 1) < δw(t)
b
. Inspecting the update rule,
b + 1) = I − η X TX δw(t),
δw(t b
I − η X TX
Let S ∈ R(n+1)×(n+1) denote the similarity transform that reduces the symmet-
ric matrix X TX to a diagonal matrix Λ ∈ R(n+1)×(n+1). Thus, S TS = SS T = I,
and
The numbers, λ0, λ1, . . . , λn, represent the eigenvalues of X TX. Note that 0 ≤ λi
for i = 0, 1, . . . , n. Thus,
kS δw(t
b + 1)k < kS δw(t)k,
b
which occurs if the eigenvalues of I − ηΛ all have magnitudes less than one.
Convergence of the batch LMS rule (cont.)
for all i. Let λmax = max λi denote the largest eigenvalue of X TX, then con-
0≤i≤n
vergence requires that
2
0<η< .
λmax