Book

Neural network models
• Artificial neural networks provide a ‘good’

parameterized class of nonlinear functions to learn
nonlinear classifiers.
PR NPTEL course – p.1/130


• Nonlinear functions are built up through composition
of summation and sigmoids.


• Useful for both classification and Regression.


• We looked at such models in general.


• We looked at such models in general.
• We will now consider feedforward models in greater
detail.

• As we have seen, single neuron ‘represents’ a class
of functions from ℜm to ℜ.

• Specific set of weights realize specific functions.

• By interconnecting many units/neurons, networks can
m m′
represent more complicated functions from ℜ to ℜ .

m m′
• The architecture constrains the function class that can
be represented. Weights define specific function in
the class.

m m′
• The architecture constrains the function class that can
be represented. Weights define specific function in
the class.
• To form meaningful networks, nonlinearity of
activation function is important.

• The feedforward networks can always be organized
as a layered network.

• The function represented by the above network can
be written as
y5 = f5 (w35 y3 + w45 y4 )

be written as
y5 = f5 (w35 y3 + w45 y4 )
= f5 (w35 f3 (w13 y1 + w23 y2 ) + w45 f4 (w14 y1 + w24 y2 ))

be written as
y5 = f5 (w35 y3 + w45 y4 )
= f5 (w35 f3 (w13 y1 + w23 y2 ) + w45 f4 (w14 y1 + w24 y2 ))
= f5 (w35 f3 (w13 x1 + w23 x2 ) + w45 f4 (w14 x1 + w24 x2 ))

be written as
y5 = f5 (w35 y3 + w45 y4 )
= f5 (w35 f3 (w13 y1 + w23 y2 ) + w45 f4 (w14 y1 + w24 y2 ))
= f5 (w35 f3 (w13 x1 + w23 x2 ) + w45 f4 (w14 x1 + w24 x2 ))
• We can similarly write y6 as a function of x1 , x2 .

be written as
y5 = f5 (w35 y3 + w45 y4 )
= f5 (w35 f3 (w13 y1 + w23 y2 ) + w45 f4 (w14 y1 + w24 y2 ))
= f5 (w35 f3 (w13 x1 + w23 x2 ) + w45 f4 (w14 x1 + w24 x2 ))
• We can similarly write y6 as a function of x1 , x2 .
• We now start with a general notation for a multilayer
feed-forward network.

Multilayer feedforward networks
• Here is a general multilayer feedforward network.

Notation
• L – number of layers

Notation
• nℓ – number of nodes in layer ℓ, ℓ = 1, · · · , L.

Notation
• yiℓ – output of ith node in layer ℓ,
i = 1, · · · , nℓ , ℓ = 1, · · · , L.

Notation
i = 1, · · · , nℓ , ℓ = 1, · · · , L.
• wijℓ – weight of connection from node-i, layer-ℓ to
node-j , layer-(ℓ + 1).

Notation
i = 1, · · · , nℓ , ℓ = 1, · · · , L.
• ηiℓ – net input of node-i in layer-ℓ

Notation
i = 1, · · · , nℓ , ℓ = 1, · · · , L.
• ηiℓ – net input of node-i in layer-ℓ
• Our network represents a function from ℜn1 to ℜnL .

• The input layer gets external inputs.

• The outputs of the output layer are the outputs of the
network.

network.
m m′
• If we want to learn a function from ℜ to ℜ then we
would have n1 = m and nL = m′ .

network.
m m′
• The nodes in input layer are just to fanout the inputs.

network.
m m′
• The nodes in input layer are just to fanout the inputs.
• All other nodes calculate net input as weighted sum of
inputs and pass it through an activation function to
compute outputs.

• Let X = [x1 , · · · , xn1 ]T represent the external inputs to
the network.

the network.
• The input layer is special and is only to fanout inputs.
Hence we take
yi1 = xi , i = 1, · · · , n1

the network.
• The input layer is special and is only to fanout inputs.
Hence we take
yi1 = xi , i = 1, · · · , n1
• From layer-2 onwards, units in each layer,
successively compute their outputs.

• From layer-2 onwards, each unit computes its net
input as a weighted sum of its inputs.

• This is passed through an activation function to get its
output.

output.
• We assume all units use the same activation function.

output.
• We assume all units use the same activation function.
• The outputs of the units in the output layer are the
final outputs of the network.

• The outputs of a typical unit is computed as
nℓ−1
X
ηjℓ = wijℓ−1 yiℓ−1
i=1
yjℓ = f (ηjℓ )
• If we want to include bias input, then
nℓ−1
X
ηjℓ = wijℓ−1 yiℓ−1 + bℓj
i=1

• If we want to include bias input, then
nℓ−1
X
ηjℓ = wijℓ−1 yiℓ−1 + bℓj
i=1
• We can think of bias as an extra input and write

nℓ−1
X
i=0
ℓ−1
where, by notation, w0j = bℓj and y0ℓ = +1 for all ℓ.
• This can be shown in the figure as below.

• Consider the general network.

• Given inputs, x1 , · · · , xn1 , the outputs are computed
as follows.

as follows.
• For the input layer: yi1 = xi , i = 1, · · · , n1 .

as follows.
• For ℓ = 2, · · · , L, we now compute
nℓ−1
X
i=1

as follows.
nℓ−1
X
i=1
yjℓ = f (ηjℓ )

as follows.
nℓ−1
X
i=1
yjℓ = f (ηjℓ )
• The y1L , · · · , ynLL form the final outputs of the network.

• The network represents functions from ℜn1 to ℜnL .

• To get a specific function we need to learn appropriate
weights.

weights.
• Thus,
wijℓ , i = 0, · · · , nℓ , j = 1, · · · , nℓ+1 , ℓ = 1, · · · , L − 1,
are the parameters to learn.

weights.
• Thus,
wijℓ , i = 0, · · · , nℓ , j = 1, · · · , nℓ+1 , ℓ = 1, · · · , L − 1,
• Let W represent all these parameters.

weights.
• Thus,
wijℓ , i = 0, · · · , nℓ , j = 1, · · · , nℓ+1 , ℓ = 1, · · · , L − 1,
• Let W represent all these parameters.
• The yiL are functions of W and the external inputs X ,
though we may not always explicitly show it in the
notation.

• Consider a 2-layer network with single output node.
• This is like a Perceptron or an AdaLinE.

Ã m !
X
1
y=f wj1 xj
j=1
• In a 3-layer network we will have a hidden layer.
• The hidden layer can be thought of as computing an

‘internal representation’ so that now a single unit can
correctly predict the output.
• Let us first consider using neural network models to
learn a function.

learn a function.
• Suppose we have training data
{(X i , di ), i = 1, · · · , N }, where

learn a function.
{(X i , di ), i = 1, · · · , N }, where
X i = [xi1 , · · · , xim ]T ∈ ℜm and

learn a function.
{(X i , di ), i = 1, · · · , N }, where
X i = [xi1 , · · · , xim ]T ∈ ℜm and
i i i T m′
d = [d , · · · , d ] ∈ ℜ .
1 m′

learn a function.
{(X i , di ), i = 1, · · · , N }, where
X i = [xi1 , · · · , xim ]T ∈ ℜm and
i i i T m′
d = [d , · · · , d ] ∈ ℜ .
1 m′
• We want to learn a neural network to represent this
function.

• We can use a L layer network with n1 = m and
nL = m′ .

nL = m′ .
• L and n2 , · · · , nL−1 , are parameters which we fix (for
now) arbitrarily.

nL = m′ .
now) arbitrarily.
• We assume that nodes in all layers including output
layer use the sigmoid activation function.

nL = m′ .
now) arbitrarily.
i m′
• Hence we take d ∈ [0, 1] , ∀i.

nL = m′ .
now) arbitrarily.
i m′
• We can always linearly scale the output as needed.

nL = m′ .
now) arbitrarily.
i m′
• We can always linearly scale the output as needed.
(Or we can also use a linear activation function for
output nodes).
• L T
Let y L = [y1L , · · · , ym ′] denote the vector of outputs.

• L T
• We should actually write y L (W, X), yiL (W, X) and so
on.

• L T
on.
• Now, the risk minimization framework is as follows.

• L T
on.
• We fix a H by fixing architecture of the network.
(That is fixing, L and nℓ , ℓ = 1, · · · , L).

• L T
on.
• This hypothesis space is parameterized by W .

• L T
on.
• We can now choose a loss function and learn ‘best’
W by minimizing empirical risk.

• L T
on.
• We can now choose a loss function and learn ‘best’
W by minimizing empirical risk.
• We choose squared error loss function.
• The empirical risk is given by
N
1 X
R̂N (W ) = ||y L (W, X i ) − di ||2
N i=1

N
1 X
R̂N (W ) = ||y L (W, X i ) − di ||2
N i=1
N
Ã m′ !
1 X X L
= (yj (W, X i ) − dij )2
N i=1 j=1

N
1 X
R̂N (W ) = ||y L (W, X i ) − di ||2
N i=1
N
Ã m′ !
1 X X L
= (yj (W, X i ) − dij )2
N i=1 j=1
• We want to find a W that is a minimizer of R̂N (W ).

• Same as finding W to minimize
N
X
J(W ) = Ji (W ), where
i=1
m′
1 X L
Ji (W ) = (yj (W, X i ) − dij )2
2 j=1

• Same as finding W to minimize
N
X
J(W ) = Ji (W ), where
i=1
m′
1 X L
Ji (W ) = (yj (W, X i ) − dij )2
2 j=1
• Ji is the square of the error between the output of the
network and the desired output for the training
example X i .
• One method of finding a minimizer of J is to use
gradient-descent.

gradient-descent.
• This gives us the following learning algorithm

gradient-descent.
W (t + 1) = W (t) − λ ∇J(W (t))

gradient-descent.
W (t + 1) = W (t) − λ ∇J(W (t))

XN
= W (t) − λ ∇Ji (W (t))

i=1
where t is the iteration count and λ is the step-size

parameter.

• If we consider a 2-layer network with a single output
node with linear activation function, this gradient
descent is the LMS algorithm.

• In this case, J is a quadratic function of the weights.

• For a general multilayer network, J(W ) is a high
dimensional nonlinear function.

• The gradient descent gives us only a local minimum.

• The gradient descent gives us only a local minimum.
• However, often that may be the best we can do.

• For the gradient descent, we need J to be
differentiable.

differentiable.
• This is so if we take our activation function to be
differentiable.

differentiable.
differentiable.
• Hence we take the activation function to be sigmoid.

differentiable.
differentiable.
• We can also take it to be the hyperbolic tangent.

differentiable.
differentiable.
• We can also take it to be the hyperbolic tangent.
• Unless otherwise specified, we consider multilayer
feedforward networks with sigmoidal activation
function.

• To completely specify our algorithm for learning the
weights, we need an expression for the gradient of Ji .

• In terms of the individual weights, the gradient
descent is
N
ℓ ℓ
X ∂Js
wij (t + 1) = wij (t) − λ ℓ
(W (t))
s=1
∂wij

• In terms of the individual weights, the gradient
descent is
N
ℓ ℓ
X ∂Js
wij (t + 1) = wij (t) − λ ℓ
(W (t))
s=1
∂wij
• As in the case of LMS, we can have batch or

incremental version of the algorithm.

• Thus, all we need to calculate is the partial derivative
of the square of the error, for any training sample, with
respect to any weight in the network.

• Because of the structure of the network, there is an
efficient way of computing such partial derivatives.

• We look at this computation for any one training
sample (and for a general multilayer feedforward
network).

• We look at this computation for any one training
sample (and for a general multilayer feedforward
network).
• From now on we omit explicit mention of the specific
training example.

• Let y L = [y1L , · · · , ynLL ]T be the output of the network
and let d = [d1 , · · · , dnL ]T be the desired output.

• Let
nL
1 X
J= (yjL − dj )2
2 j=1

• Let
nL
1 X
J= (yjL − dj )2
2 j=1
• We need partial derivative of J with respect to any
weight wijℓ .

• Any weight wijℓ can affect J only by affecting the final
output of the network.

• In a layered network, the weight wijℓ can affect the
final output only through its affect on ηjℓ+1 .

• In a layered network, the weight wijℓ can affect the
final output only through its affect on ηjℓ+1 .

• Hence, using the chain rule of differentiation, we have
ℓ+1
∂J ∂J ∂ηj
ℓ
= ℓ+1
∂wij ∂ηj ∂wijℓ

ℓ+1
∂J ∂J ∂ηj
ℓ
= ℓ+1
• Recall that
nℓ
X
ηjℓ+1 = ℓ
wsj ysℓ
s=1

ℓ+1
∂J ∂J ∂ηj
ℓ
= ℓ+1
• Recall that
nℓ
X ∂ηjℓ+1
ηjℓ+1 = ℓ
wsj ysℓ ⇒ = y ℓ
i
s=1
∂wijℓ

• Define
ℓ ∂J
δ = ℓ , ∀j, ℓ
j
∂ηj

• Define
ℓ ∂J
δ = ℓ , ∀j, ℓ
j
∂ηj
• Now we get
∂J ∂J ∂ηjℓ+1
=
ℓ
∂wij ∂ηjℓ+1 ∂wijℓ

• Define
ℓ ∂J
δ = ℓ , ∀j, ℓ
j
∂ηj
• Now we get
=
ℓ
= δjℓ+1 yiℓ

• Define
ℓ ∂J
δ = ℓ , ∀j, ℓ
j
∂ηj
• Now we get
=
ℓ
= δjℓ+1 yiℓ
• We can get all the needed partial derivatives if we
calculate δjℓ for all nodes.
• We can compute δjℓ recursively:
nℓ+1 ℓ+1
∂J X ∂J ∂ηs
δjℓ = ℓ = ℓ+1 ∂η ℓ
∂ηj s=1
∂ηs j

nℓ+1 ℓ+1
∂J X ∂J ∂ηs
δjℓ = ℓ = ℓ+1 ∂η ℓ
∂ηj s=1
∂ηs j

nℓ+1 ℓ+1
∂J X ∂J ∂ηs
δjℓ = ℓ = ℓ+1 ∂η ℓ
∂ηj s=1
∂ηs j
nℓ+1 ℓ+1 ∂y ℓ
X ∂J ∂ηs j
= ℓ+1 ∂y ℓ ∂η ℓ
s=1
∂ηs j j

nℓ+1 ℓ+1
∂J X ∂J ∂ηs
δjℓ = ℓ = ℓ+1 ∂η ℓ
∂ηj s=1
∂ηs j
nℓ+1 ℓ+1 ∂y ℓ
X ∂J ∂ηs j
= ℓ+1 ∂y ℓ ∂η ℓ
s=1
∂ηs j j
nℓ+1
X
= ℓ
δsℓ+1 wjs f ′ (ηjℓ )
s=1

• Thus, we can compute δjℓ recursively using:
nℓ+1 ℓ+1
∂J X ∂J ∂ηs
δjℓ = ℓ = ℓ+1 ∂η ℓ
∂ηj s=1
∂ηs j
nℓ+1 ℓ+1 ∂y ℓ
X ∂J ∂ηs j
= ℓ+1 ∂y ℓ ∂η ℓ
s=1
∂ηs j j
nℓ+1
X
= ℓ
s=1

• Recall that the partial derivatives are given by
∂J ℓ+1 ℓ
ℓ
= δj yi
∂wij

∂J ℓ+1 ℓ
ℓ
= δj yi
∂wij
• For the weights, range of ℓ is ℓ = 1, · · · , (L − 1).

∂J ℓ+1 ℓ
ℓ
= δj yi
∂wij
• For the weights, range of ℓ is ℓ = 1, · · · , (L − 1).
• Hence we need δjℓ for ℓ = 2, · · · , L and all nodes j .

• Recall the recursive formula for δjℓ
nℓ+1
X
δjℓ = ℓ
s=1

nℓ+1
X
δjℓ = ℓ
s=1
• So, we need to first compute δjL .

nℓ+1
X
δjℓ = ℓ
s=1
• So, we need to first compute δjL .

• By definition, we have
L ∂J
δ = L
j
∂ηj

• We have
nL
1 X
J= (yjL − dj )2
2 j=1

• We have
nL
1 X
J= (yjL − dj )2
2 j=1
• Hence we have
L
L ∂J ∂J ∂y j
δj = L =
∂ηj ∂yjL ∂ηjL

• We have
nL
1 X
J= (yjL − dj )2
2 j=1
• Hence we have
L
L ∂J ∂J ∂y j
δj = L =
∂ηj ∂yjL ∂ηjL
= (yjL − dj )f ′ (ηjL )

• Using the above we can compute δjL , j = 1, · · · , nL .

• Then we can compute δjℓ , j = 1, · · · , nℓ for
ℓ = (L − 1), · · · , 2, recursively, using
Ã nℓ+1 !
X
δjℓ = ℓ
s=1

• Then we can compute δjℓ , j = 1, · · · , nℓ for
ℓ = (L − 1), · · · , 2, recursively, using
Ã nℓ+1 !
X
δjℓ = ℓ
s=1
• We call δjℓ the ‘error’ at node-j layer-ℓ.

• Then we compute all partial derivatives with respect
to weights as
∂J ℓ+1 ℓ
ℓ
= δj yi
∂wij

to weights as
∂J ℓ+1 ℓ
ℓ
= δj yi
∂wij
and hence can update the weights using the gradient
descent procedure.

to weights as
∂J ℓ+1 ℓ
ℓ
= δj yi
∂wij
and hence can update the weights using the gradient
descent procedure.
• Note that we actually need partial derivatives with
respect to Ji .

Book

Diunggah oleh

Informasi Dokumen

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Book

Diunggah oleh

Hak Cipta:

Format Tersedia

Neural network models

• Artificial neural networks provide a ‘good’

PR NPTEL course – p.1/130

• Artificial neural networks provide a ‘good’

PR NPTEL course – p.2/130

• Artificial neural networks provide a ‘good’

PR NPTEL course – p.3/130

• Artificial neural networks provide a ‘good’

PR NPTEL course – p.4/130

• Artificial neural networks provide a ‘good’

PR NPTEL course – p.5/130

PR NPTEL course – p.6/130

PR NPTEL course – p.7/130

PR NPTEL course – p.8/130

PR NPTEL course – p.9/130

PR NPTEL course – p.10/130

PR NPTEL course – p.11/130

PR NPTEL course – p.12/130

PR NPTEL course – p.13/130

PR NPTEL course – p.14/130

PR NPTEL course – p.15/130

PR NPTEL course – p.16/130

• Here is a general multilayer feedforward network.

PR NPTEL course – p.17/130

PR NPTEL course – p.18/130

PR NPTEL course – p.19/130

PR NPTEL course – p.20/130

PR NPTEL course – p.21/130

PR NPTEL course – p.22/130

PR NPTEL course – p.23/130

PR NPTEL course – p.24/130

PR NPTEL course – p.25/130

PR NPTEL course – p.26/130

PR NPTEL course – p.27/130

PR NPTEL course – p.28/130

PR NPTEL course – p.29/130

PR NPTEL course – p.30/130

PR NPTEL course – p.31/130

PR NPTEL course – p.32/130

PR NPTEL course – p.33/130

PR NPTEL course – p.34/130

PR NPTEL course – p.35/130

PR NPTEL course – p.37/130

• We can think of bias as an extra input and write

PR NPTEL course – p.39/130

PR NPTEL course – p.40/130

PR NPTEL course – p.41/130

PR NPTEL course – p.42/130

PR NPTEL course – p.43/130

PR NPTEL course – p.44/130

PR NPTEL course – p.45/130

PR NPTEL course – p.46/130

PR NPTEL course – p.47/130

PR NPTEL course – p.48/130

PR NPTEL course – p.49/130

PR NPTEL course – p.50/130

• This is like a Perceptron or an AdaLinE.

• The hidden layer can be thought of as computing an

PR NPTEL course – p.53/130

PR NPTEL course – p.54/130

PR NPTEL course – p.55/130

PR NPTEL course – p.56/130

PR NPTEL course – p.57/130

PR NPTEL course – p.58/130