Anda di halaman 1dari 130

Neural network models

• Artificial neural networks provide a ‘good’


parameterized class of nonlinear functions to learn
nonlinear classifiers.

PR NPTEL course – p.1/130


Neural network models

• Artificial neural networks provide a ‘good’


parameterized class of nonlinear functions to learn
nonlinear classifiers.
• Nonlinear functions are built up through composition
of summation and sigmoids.

PR NPTEL course – p.2/130


Neural network models

• Artificial neural networks provide a ‘good’


parameterized class of nonlinear functions to learn
nonlinear classifiers.
• Nonlinear functions are built up through composition
of summation and sigmoids.
• Useful for both classification and Regression.

PR NPTEL course – p.3/130


Neural network models

• Artificial neural networks provide a ‘good’


parameterized class of nonlinear functions to learn
nonlinear classifiers.
• Nonlinear functions are built up through composition
of summation and sigmoids.
• Useful for both classification and Regression.
• We looked at such models in general.

PR NPTEL course – p.4/130


Neural network models

• Artificial neural networks provide a ‘good’


parameterized class of nonlinear functions to learn
nonlinear classifiers.
• Nonlinear functions are built up through composition
of summation and sigmoids.
• Useful for both classification and Regression.
• We looked at such models in general.
• We will now consider feedforward models in greater
detail.

PR NPTEL course – p.5/130


• As we have seen, single neuron ‘represents’ a class
of functions from ℜm to ℜ.

PR NPTEL course – p.6/130


• As we have seen, single neuron ‘represents’ a class
of functions from ℜm to ℜ.
• Specific set of weights realize specific functions.

PR NPTEL course – p.7/130


• As we have seen, single neuron ‘represents’ a class
of functions from ℜm to ℜ.
• Specific set of weights realize specific functions.
• By interconnecting many units/neurons, networks can
m m′
represent more complicated functions from ℜ to ℜ .

PR NPTEL course – p.8/130


• As we have seen, single neuron ‘represents’ a class
of functions from ℜm to ℜ.
• Specific set of weights realize specific functions.
• By interconnecting many units/neurons, networks can
m m′
represent more complicated functions from ℜ to ℜ .
• The architecture constrains the function class that can
be represented. Weights define specific function in
the class.

PR NPTEL course – p.9/130


• As we have seen, single neuron ‘represents’ a class
of functions from ℜm to ℜ.
• Specific set of weights realize specific functions.
• By interconnecting many units/neurons, networks can
m m′
represent more complicated functions from ℜ to ℜ .
• The architecture constrains the function class that can
be represented. Weights define specific function in
the class.
• To form meaningful networks, nonlinearity of
activation function is important.

PR NPTEL course – p.10/130


• The feedforward networks can always be organized
as a layered network.

PR NPTEL course – p.11/130


• The function represented by the above network can
be written as
y5 = f5 (w35 y3 + w45 y4 )

PR NPTEL course – p.12/130


• The function represented by the above network can
be written as
y5 = f5 (w35 y3 + w45 y4 )
= f5 (w35 f3 (w13 y1 + w23 y2 ) + w45 f4 (w14 y1 + w24 y2 ))

PR NPTEL course – p.13/130


• The function represented by the above network can
be written as
y5 = f5 (w35 y3 + w45 y4 )
= f5 (w35 f3 (w13 y1 + w23 y2 ) + w45 f4 (w14 y1 + w24 y2 ))
= f5 (w35 f3 (w13 x1 + w23 x2 ) + w45 f4 (w14 x1 + w24 x2 ))

PR NPTEL course – p.14/130


• The function represented by the above network can
be written as
y5 = f5 (w35 y3 + w45 y4 )
= f5 (w35 f3 (w13 y1 + w23 y2 ) + w45 f4 (w14 y1 + w24 y2 ))
= f5 (w35 f3 (w13 x1 + w23 x2 ) + w45 f4 (w14 x1 + w24 x2 ))
• We can similarly write y6 as a function of x1 , x2 .

PR NPTEL course – p.15/130


• The function represented by the above network can
be written as
y5 = f5 (w35 y3 + w45 y4 )
= f5 (w35 f3 (w13 y1 + w23 y2 ) + w45 f4 (w14 y1 + w24 y2 ))
= f5 (w35 f3 (w13 x1 + w23 x2 ) + w45 f4 (w14 x1 + w24 x2 ))
• We can similarly write y6 as a function of x1 , x2 .
• We now start with a general notation for a multilayer
feed-forward network.

PR NPTEL course – p.16/130


Multilayer feedforward networks

• Here is a general multilayer feedforward network.

PR NPTEL course – p.17/130


Notation

• L – number of layers

PR NPTEL course – p.18/130


Notation

• L – number of layers
• nℓ – number of nodes in layer ℓ, ℓ = 1, · · · , L.

PR NPTEL course – p.19/130


Notation

• L – number of layers
• nℓ – number of nodes in layer ℓ, ℓ = 1, · · · , L.
• yiℓ – output of ith node in layer ℓ,
i = 1, · · · , nℓ , ℓ = 1, · · · , L.

PR NPTEL course – p.20/130


Notation

• L – number of layers
• nℓ – number of nodes in layer ℓ, ℓ = 1, · · · , L.
• yiℓ – output of ith node in layer ℓ,
i = 1, · · · , nℓ , ℓ = 1, · · · , L.
• wijℓ – weight of connection from node-i, layer-ℓ to
node-j , layer-(ℓ + 1).

PR NPTEL course – p.21/130


Notation

• L – number of layers
• nℓ – number of nodes in layer ℓ, ℓ = 1, · · · , L.
• yiℓ – output of ith node in layer ℓ,
i = 1, · · · , nℓ , ℓ = 1, · · · , L.
• wijℓ – weight of connection from node-i, layer-ℓ to
node-j , layer-(ℓ + 1).
• ηiℓ – net input of node-i in layer-ℓ

PR NPTEL course – p.22/130


Notation

• L – number of layers
• nℓ – number of nodes in layer ℓ, ℓ = 1, · · · , L.
• yiℓ – output of ith node in layer ℓ,
i = 1, · · · , nℓ , ℓ = 1, · · · , L.
• wijℓ – weight of connection from node-i, layer-ℓ to
node-j , layer-(ℓ + 1).
• ηiℓ – net input of node-i in layer-ℓ
• Our network represents a function from ℜn1 to ℜnL .

PR NPTEL course – p.23/130


• The input layer gets external inputs.

PR NPTEL course – p.24/130


• The input layer gets external inputs.
• The outputs of the output layer are the outputs of the
network.

PR NPTEL course – p.25/130


• The input layer gets external inputs.
• The outputs of the output layer are the outputs of the
network.
m m′
• If we want to learn a function from ℜ to ℜ then we
would have n1 = m and nL = m′ .

PR NPTEL course – p.26/130


• The input layer gets external inputs.
• The outputs of the output layer are the outputs of the
network.
m m′
• If we want to learn a function from ℜ to ℜ then we
would have n1 = m and nL = m′ .
• The nodes in input layer are just to fanout the inputs.

PR NPTEL course – p.27/130


• The input layer gets external inputs.
• The outputs of the output layer are the outputs of the
network.
m m′
• If we want to learn a function from ℜ to ℜ then we
would have n1 = m and nL = m′ .
• The nodes in input layer are just to fanout the inputs.
• All other nodes calculate net input as weighted sum of
inputs and pass it through an activation function to
compute outputs.

PR NPTEL course – p.28/130


• Let X = [x1 , · · · , xn1 ]T represent the external inputs to
the network.

PR NPTEL course – p.29/130


• Let X = [x1 , · · · , xn1 ]T represent the external inputs to
the network.
• The input layer is special and is only to fanout inputs.
Hence we take
yi1 = xi , i = 1, · · · , n1

PR NPTEL course – p.30/130


• Let X = [x1 , · · · , xn1 ]T represent the external inputs to
the network.
• The input layer is special and is only to fanout inputs.
Hence we take
yi1 = xi , i = 1, · · · , n1
• From layer-2 onwards, units in each layer,
successively compute their outputs.

PR NPTEL course – p.31/130


• From layer-2 onwards, each unit computes its net
input as a weighted sum of its inputs.

PR NPTEL course – p.32/130


• From layer-2 onwards, each unit computes its net
input as a weighted sum of its inputs.
• This is passed through an activation function to get its
output.

PR NPTEL course – p.33/130


• From layer-2 onwards, each unit computes its net
input as a weighted sum of its inputs.
• This is passed through an activation function to get its
output.
• We assume all units use the same activation function.

PR NPTEL course – p.34/130


• From layer-2 onwards, each unit computes its net
input as a weighted sum of its inputs.
• This is passed through an activation function to get its
output.
• We assume all units use the same activation function.
• The outputs of the units in the output layer are the
final outputs of the network.

PR NPTEL course – p.35/130


• The outputs of a typical unit is computed as
nℓ−1
X
ηjℓ = wijℓ−1 yiℓ−1
i=1

yjℓ = f (ηjℓ )
PR NPTEL course – p.36/130
• If we want to include bias input, then
nℓ−1
X
ηjℓ = wijℓ−1 yiℓ−1 + bℓj
i=1

PR NPTEL course – p.37/130


• If we want to include bias input, then
nℓ−1
X
ηjℓ = wijℓ−1 yiℓ−1 + bℓj
i=1

• We can think of bias as an extra input and write


nℓ−1
X
ηjℓ = wijℓ−1 yiℓ−1
i=0

ℓ−1
where, by notation, w0j = bℓj and y0ℓ = +1 for all ℓ.
PR NPTEL course – p.38/130
• This can be shown in the figure as below.

PR NPTEL course – p.39/130


• Consider the general network.

PR NPTEL course – p.40/130


• Given inputs, x1 , · · · , xn1 , the outputs are computed
as follows.

PR NPTEL course – p.41/130


• Given inputs, x1 , · · · , xn1 , the outputs are computed
as follows.
• For the input layer: yi1 = xi , i = 1, · · · , n1 .

PR NPTEL course – p.42/130


• Given inputs, x1 , · · · , xn1 , the outputs are computed
as follows.
• For the input layer: yi1 = xi , i = 1, · · · , n1 .
• For ℓ = 2, · · · , L, we now compute
nℓ−1
X
ηjℓ = wijℓ−1 yiℓ−1
i=1

PR NPTEL course – p.43/130


• Given inputs, x1 , · · · , xn1 , the outputs are computed
as follows.
• For the input layer: yi1 = xi , i = 1, · · · , n1 .
• For ℓ = 2, · · · , L, we now compute
nℓ−1
X
ηjℓ = wijℓ−1 yiℓ−1
i=1

yjℓ = f (ηjℓ )

PR NPTEL course – p.44/130


• Given inputs, x1 , · · · , xn1 , the outputs are computed
as follows.
• For the input layer: yi1 = xi , i = 1, · · · , n1 .
• For ℓ = 2, · · · , L, we now compute
nℓ−1
X
ηjℓ = wijℓ−1 yiℓ−1
i=1

yjℓ = f (ηjℓ )
• The y1L , · · · , ynLL form the final outputs of the network.

PR NPTEL course – p.45/130


• The network represents functions from ℜn1 to ℜnL .

PR NPTEL course – p.46/130


• The network represents functions from ℜn1 to ℜnL .
• To get a specific function we need to learn appropriate
weights.

PR NPTEL course – p.47/130


• The network represents functions from ℜn1 to ℜnL .
• To get a specific function we need to learn appropriate
weights.
• Thus,
wijℓ , i = 0, · · · , nℓ , j = 1, · · · , nℓ+1 , ℓ = 1, · · · , L − 1,
are the parameters to learn.

PR NPTEL course – p.48/130


• The network represents functions from ℜn1 to ℜnL .
• To get a specific function we need to learn appropriate
weights.
• Thus,
wijℓ , i = 0, · · · , nℓ , j = 1, · · · , nℓ+1 , ℓ = 1, · · · , L − 1,
are the parameters to learn.
• Let W represent all these parameters.

PR NPTEL course – p.49/130


• The network represents functions from ℜn1 to ℜnL .
• To get a specific function we need to learn appropriate
weights.
• Thus,
wijℓ , i = 0, · · · , nℓ , j = 1, · · · , nℓ+1 , ℓ = 1, · · · , L − 1,
are the parameters to learn.
• Let W represent all these parameters.
• The yiL are functions of W and the external inputs X ,
though we may not always explicitly show it in the
notation.

PR NPTEL course – p.50/130


• Consider a 2-layer network with single output node.

• This is like a Perceptron or an AdaLinE.


à m !
X
1
y=f wj1 xj
j=1
PR NPTEL course – p.51/130
• In a 3-layer network we will have a hidden layer.

• The hidden layer can be thought of as computing an


‘internal representation’ so that now a single unit can
correctly predict the output.
PR NPTEL course – p.52/130
• Let us first consider using neural network models to
learn a function.

PR NPTEL course – p.53/130


• Let us first consider using neural network models to
learn a function.
• Suppose we have training data
{(X i , di ), i = 1, · · · , N }, where

PR NPTEL course – p.54/130


• Let us first consider using neural network models to
learn a function.
• Suppose we have training data
{(X i , di ), i = 1, · · · , N }, where
X i = [xi1 , · · · , xim ]T ∈ ℜm and

PR NPTEL course – p.55/130


• Let us first consider using neural network models to
learn a function.
• Suppose we have training data
{(X i , di ), i = 1, · · · , N }, where
X i = [xi1 , · · · , xim ]T ∈ ℜm and
i i i T m′
d = [d , · · · , d ] ∈ ℜ .
1 m′

PR NPTEL course – p.56/130


• Let us first consider using neural network models to
learn a function.
• Suppose we have training data
{(X i , di ), i = 1, · · · , N }, where
X i = [xi1 , · · · , xim ]T ∈ ℜm and
i i i T m′
d = [d , · · · , d ] ∈ ℜ .
1 m′
• We want to learn a neural network to represent this
function.

PR NPTEL course – p.57/130


• We can use a L layer network with n1 = m and
nL = m′ .

PR NPTEL course – p.58/130


• We can use a L layer network with n1 = m and
nL = m′ .
• L and n2 , · · · , nL−1 , are parameters which we fix (for
now) arbitrarily.

PR NPTEL course – p.59/130


• We can use a L layer network with n1 = m and
nL = m′ .
• L and n2 , · · · , nL−1 , are parameters which we fix (for
now) arbitrarily.
• We assume that nodes in all layers including output
layer use the sigmoid activation function.

PR NPTEL course – p.60/130


• We can use a L layer network with n1 = m and
nL = m′ .
• L and n2 , · · · , nL−1 , are parameters which we fix (for
now) arbitrarily.
• We assume that nodes in all layers including output
layer use the sigmoid activation function.
i m′
• Hence we take d ∈ [0, 1] , ∀i.

PR NPTEL course – p.61/130


• We can use a L layer network with n1 = m and
nL = m′ .
• L and n2 , · · · , nL−1 , are parameters which we fix (for
now) arbitrarily.
• We assume that nodes in all layers including output
layer use the sigmoid activation function.
i m′
• Hence we take d ∈ [0, 1] , ∀i.
• We can always linearly scale the output as needed.

PR NPTEL course – p.62/130


• We can use a L layer network with n1 = m and
nL = m′ .
• L and n2 , · · · , nL−1 , are parameters which we fix (for
now) arbitrarily.
• We assume that nodes in all layers including output
layer use the sigmoid activation function.
i m′
• Hence we take d ∈ [0, 1] , ∀i.
• We can always linearly scale the output as needed.
(Or we can also use a linear activation function for
output nodes).
PR NPTEL course – p.63/130
• L T
Let y L = [y1L , · · · , ym ′] denote the vector of outputs.

PR NPTEL course – p.64/130


• L T
Let y L = [y1L , · · · , ym ′] denote the vector of outputs.
• We should actually write y L (W, X), yiL (W, X) and so
on.

PR NPTEL course – p.65/130


• L T
Let y L = [y1L , · · · , ym ′] denote the vector of outputs.
• We should actually write y L (W, X), yiL (W, X) and so
on.
• Now, the risk minimization framework is as follows.

PR NPTEL course – p.66/130


• L T
Let y L = [y1L , · · · , ym ′] denote the vector of outputs.
• We should actually write y L (W, X), yiL (W, X) and so
on.
• Now, the risk minimization framework is as follows.
• We fix a H by fixing architecture of the network.
(That is fixing, L and nℓ , ℓ = 1, · · · , L).

PR NPTEL course – p.67/130


• L T
Let y L = [y1L , · · · , ym ′] denote the vector of outputs.
• We should actually write y L (W, X), yiL (W, X) and so
on.
• Now, the risk minimization framework is as follows.
• We fix a H by fixing architecture of the network.
(That is fixing, L and nℓ , ℓ = 1, · · · , L).
• This hypothesis space is parameterized by W .

PR NPTEL course – p.68/130


• L T
Let y L = [y1L , · · · , ym ′] denote the vector of outputs.
• We should actually write y L (W, X), yiL (W, X) and so
on.
• Now, the risk minimization framework is as follows.
• We fix a H by fixing architecture of the network.
(That is fixing, L and nℓ , ℓ = 1, · · · , L).
• This hypothesis space is parameterized by W .
• We can now choose a loss function and learn ‘best’
W by minimizing empirical risk.

PR NPTEL course – p.69/130


• L T
Let y L = [y1L , · · · , ym ′] denote the vector of outputs.
• We should actually write y L (W, X), yiL (W, X) and so
on.
• Now, the risk minimization framework is as follows.
• We fix a H by fixing architecture of the network.
(That is fixing, L and nℓ , ℓ = 1, · · · , L).
• This hypothesis space is parameterized by W .
• We can now choose a loss function and learn ‘best’
W by minimizing empirical risk.
• We choose squared error loss function.
PR NPTEL course – p.70/130
• The empirical risk is given by
N
1 X
R̂N (W ) = ||y L (W, X i ) − di ||2
N i=1

PR NPTEL course – p.71/130


• The empirical risk is given by
N
1 X
R̂N (W ) = ||y L (W, X i ) − di ||2
N i=1
N
à m′ !
1 X X L
= (yj (W, X i ) − dij )2
N i=1 j=1

PR NPTEL course – p.72/130


• The empirical risk is given by
N
1 X
R̂N (W ) = ||y L (W, X i ) − di ||2
N i=1
N
à m′ !
1 X X L
= (yj (W, X i ) − dij )2
N i=1 j=1

• We want to find a W that is a minimizer of R̂N (W ).

PR NPTEL course – p.73/130


• Same as finding W to minimize
N
X
J(W ) = Ji (W ), where
i=1

m′
1 X L
Ji (W ) = (yj (W, X i ) − dij )2
2 j=1

PR NPTEL course – p.74/130


• Same as finding W to minimize
N
X
J(W ) = Ji (W ), where
i=1

m′
1 X L
Ji (W ) = (yj (W, X i ) − dij )2
2 j=1
• Ji is the square of the error between the output of the
network and the desired output for the training
example X i .
PR NPTEL course – p.75/130
• One method of finding a minimizer of J is to use
gradient-descent.

PR NPTEL course – p.76/130


• One method of finding a minimizer of J is to use
gradient-descent.
• This gives us the following learning algorithm

PR NPTEL course – p.77/130


• One method of finding a minimizer of J is to use
gradient-descent.
• This gives us the following learning algorithm

W (t + 1) = W (t) − λ ∇J(W (t))

PR NPTEL course – p.78/130


• One method of finding a minimizer of J is to use
gradient-descent.
• This gives us the following learning algorithm

W (t + 1) = W (t) − λ ∇J(W (t))


XN

= W (t) − λ ∇Ji (W (t))


i=1

where t is the iteration count and λ is the step-size


parameter.

PR NPTEL course – p.79/130


• If we consider a 2-layer network with a single output
node with linear activation function, this gradient
descent is the LMS algorithm.

PR NPTEL course – p.80/130


• If we consider a 2-layer network with a single output
node with linear activation function, this gradient
descent is the LMS algorithm.
• In this case, J is a quadratic function of the weights.

PR NPTEL course – p.81/130


• If we consider a 2-layer network with a single output
node with linear activation function, this gradient
descent is the LMS algorithm.
• In this case, J is a quadratic function of the weights.
• For a general multilayer network, J(W ) is a high
dimensional nonlinear function.

PR NPTEL course – p.82/130


• If we consider a 2-layer network with a single output
node with linear activation function, this gradient
descent is the LMS algorithm.
• In this case, J is a quadratic function of the weights.
• For a general multilayer network, J(W ) is a high
dimensional nonlinear function.
• The gradient descent gives us only a local minimum.

PR NPTEL course – p.83/130


• If we consider a 2-layer network with a single output
node with linear activation function, this gradient
descent is the LMS algorithm.
• In this case, J is a quadratic function of the weights.
• For a general multilayer network, J(W ) is a high
dimensional nonlinear function.
• The gradient descent gives us only a local minimum.
• However, often that may be the best we can do.

PR NPTEL course – p.84/130


• For the gradient descent, we need J to be
differentiable.

PR NPTEL course – p.85/130


• For the gradient descent, we need J to be
differentiable.
• This is so if we take our activation function to be
differentiable.

PR NPTEL course – p.86/130


• For the gradient descent, we need J to be
differentiable.
• This is so if we take our activation function to be
differentiable.
• Hence we take the activation function to be sigmoid.

PR NPTEL course – p.87/130


• For the gradient descent, we need J to be
differentiable.
• This is so if we take our activation function to be
differentiable.
• Hence we take the activation function to be sigmoid.
• We can also take it to be the hyperbolic tangent.

PR NPTEL course – p.88/130


• For the gradient descent, we need J to be
differentiable.
• This is so if we take our activation function to be
differentiable.
• Hence we take the activation function to be sigmoid.
• We can also take it to be the hyperbolic tangent.
• Unless otherwise specified, we consider multilayer
feedforward networks with sigmoidal activation
function.

PR NPTEL course – p.89/130


• To completely specify our algorithm for learning the
weights, we need an expression for the gradient of Ji .

PR NPTEL course – p.90/130


• To completely specify our algorithm for learning the
weights, we need an expression for the gradient of Ji .
• In terms of the individual weights, the gradient
descent is
N
ℓ ℓ
X ∂Js
wij (t + 1) = wij (t) − λ ℓ
(W (t))
s=1
∂wij

PR NPTEL course – p.91/130


• To completely specify our algorithm for learning the
weights, we need an expression for the gradient of Ji .
• In terms of the individual weights, the gradient
descent is
N
ℓ ℓ
X ∂Js
wij (t + 1) = wij (t) − λ ℓ
(W (t))
s=1
∂wij

• As in the case of LMS, we can have batch or


incremental version of the algorithm.

PR NPTEL course – p.92/130


• Thus, all we need to calculate is the partial derivative
of the square of the error, for any training sample, with
respect to any weight in the network.

PR NPTEL course – p.93/130


• Thus, all we need to calculate is the partial derivative
of the square of the error, for any training sample, with
respect to any weight in the network.
• Because of the structure of the network, there is an
efficient way of computing such partial derivatives.

PR NPTEL course – p.94/130


• Thus, all we need to calculate is the partial derivative
of the square of the error, for any training sample, with
respect to any weight in the network.
• Because of the structure of the network, there is an
efficient way of computing such partial derivatives.
• We look at this computation for any one training
sample (and for a general multilayer feedforward
network).

PR NPTEL course – p.95/130


• Thus, all we need to calculate is the partial derivative
of the square of the error, for any training sample, with
respect to any weight in the network.
• Because of the structure of the network, there is an
efficient way of computing such partial derivatives.
• We look at this computation for any one training
sample (and for a general multilayer feedforward
network).
• From now on we omit explicit mention of the specific
training example.

PR NPTEL course – p.96/130


• Let y L = [y1L , · · · , ynLL ]T be the output of the network
and let d = [d1 , · · · , dnL ]T be the desired output.

PR NPTEL course – p.97/130


• Let y L = [y1L , · · · , ynLL ]T be the output of the network
and let d = [d1 , · · · , dnL ]T be the desired output.
• Let
nL
1 X
J= (yjL − dj )2
2 j=1

PR NPTEL course – p.98/130


• Let y L = [y1L , · · · , ynLL ]T be the output of the network
and let d = [d1 , · · · , dnL ]T be the desired output.
• Let
nL
1 X
J= (yjL − dj )2
2 j=1
• We need partial derivative of J with respect to any
weight wijℓ .

PR NPTEL course – p.99/130


• Any weight wijℓ can affect J only by affecting the final
output of the network.

PR NPTEL course – p.100/130


• Any weight wijℓ can affect J only by affecting the final
output of the network.
• In a layered network, the weight wijℓ can affect the
final output only through its affect on ηjℓ+1 .

PR NPTEL course – p.101/130


• Any weight wijℓ can affect J only by affecting the final
output of the network.
• In a layered network, the weight wijℓ can affect the
final output only through its affect on ηjℓ+1 .

PR NPTEL course – p.102/130


• Hence, using the chain rule of differentiation, we have
ℓ+1
∂J ∂J ∂ηj

= ℓ+1
∂wij ∂ηj ∂wijℓ

PR NPTEL course – p.103/130


• Hence, using the chain rule of differentiation, we have
ℓ+1
∂J ∂J ∂ηj

= ℓ+1
∂wij ∂ηj ∂wijℓ
• Recall that
nℓ
X
ηjℓ+1 = ℓ
wsj ysℓ
s=1

PR NPTEL course – p.104/130


• Hence, using the chain rule of differentiation, we have
ℓ+1
∂J ∂J ∂ηj

= ℓ+1
∂wij ∂ηj ∂wijℓ
• Recall that
nℓ
X ∂ηjℓ+1
ηjℓ+1 = ℓ
wsj ysℓ ⇒ = y ℓ
i
s=1
∂wijℓ

PR NPTEL course – p.105/130


• Define
ℓ ∂J
δ = ℓ , ∀j, ℓ
j
∂ηj

PR NPTEL course – p.106/130


• Define
ℓ ∂J
δ = ℓ , ∀j, ℓ
j
∂ηj
• Now we get

∂J ∂J ∂ηjℓ+1
=

∂wij ∂ηjℓ+1 ∂wijℓ

PR NPTEL course – p.107/130


• Define
ℓ ∂J
δ = ℓ , ∀j, ℓ
j
∂ηj
• Now we get

∂J ∂J ∂ηjℓ+1
=

∂wij ∂ηjℓ+1 ∂wijℓ
= δjℓ+1 yiℓ

PR NPTEL course – p.108/130


• Define
ℓ ∂J
δ = ℓ , ∀j, ℓ
j
∂ηj
• Now we get

∂J ∂J ∂ηjℓ+1
=

∂wij ∂ηjℓ+1 ∂wijℓ
= δjℓ+1 yiℓ
• We can get all the needed partial derivatives if we
calculate δjℓ for all nodes.
PR NPTEL course – p.109/130
• We can compute δjℓ recursively:
nℓ+1 ℓ+1
∂J X ∂J ∂ηs
δjℓ = ℓ = ℓ+1 ∂η ℓ
∂ηj s=1
∂ηs j

PR NPTEL course – p.110/130


nℓ+1 ℓ+1
∂J X ∂J ∂ηs
δjℓ = ℓ = ℓ+1 ∂η ℓ
∂ηj s=1
∂ηs j

PR NPTEL course – p.111/130


• We can compute δjℓ recursively:
nℓ+1 ℓ+1
∂J X ∂J ∂ηs
δjℓ = ℓ = ℓ+1 ∂η ℓ
∂ηj s=1
∂ηs j
nℓ+1 ℓ+1 ∂y ℓ
X ∂J ∂ηs j
= ℓ+1 ∂y ℓ ∂η ℓ
s=1
∂ηs j j

PR NPTEL course – p.112/130


• We can compute δjℓ recursively:
nℓ+1 ℓ+1
∂J X ∂J ∂ηs
δjℓ = ℓ = ℓ+1 ∂η ℓ
∂ηj s=1
∂ηs j
nℓ+1 ℓ+1 ∂y ℓ
X ∂J ∂ηs j
= ℓ+1 ∂y ℓ ∂η ℓ
s=1
∂ηs j j
nℓ+1
X
= ℓ
δsℓ+1 wjs f ′ (ηjℓ )
s=1

PR NPTEL course – p.113/130


PR NPTEL course – p.114/130
• Thus, we can compute δjℓ recursively using:
nℓ+1 ℓ+1
∂J X ∂J ∂ηs
δjℓ = ℓ = ℓ+1 ∂η ℓ
∂ηj s=1
∂ηs j
nℓ+1 ℓ+1 ∂y ℓ
X ∂J ∂ηs j
= ℓ+1 ∂y ℓ ∂η ℓ
s=1
∂ηs j j
nℓ+1
X
= ℓ
δsℓ+1 wjs f ′ (ηjℓ )
s=1

PR NPTEL course – p.115/130


• Recall that the partial derivatives are given by
∂J ℓ+1 ℓ

= δj yi
∂wij

PR NPTEL course – p.116/130


• Recall that the partial derivatives are given by
∂J ℓ+1 ℓ

= δj yi
∂wij
• For the weights, range of ℓ is ℓ = 1, · · · , (L − 1).

PR NPTEL course – p.117/130


• Recall that the partial derivatives are given by
∂J ℓ+1 ℓ

= δj yi
∂wij
• For the weights, range of ℓ is ℓ = 1, · · · , (L − 1).
• Hence we need δjℓ for ℓ = 2, · · · , L and all nodes j .

PR NPTEL course – p.118/130


• Recall the recursive formula for δjℓ
nℓ+1
X
δjℓ = ℓ
δsℓ+1 wjs f ′ (ηjℓ )
s=1

PR NPTEL course – p.119/130


• Recall the recursive formula for δjℓ
nℓ+1
X
δjℓ = ℓ
δsℓ+1 wjs f ′ (ηjℓ )
s=1

• So, we need to first compute δjL .

PR NPTEL course – p.120/130


• Recall the recursive formula for δjℓ
nℓ+1
X
δjℓ = ℓ
δsℓ+1 wjs f ′ (ηjℓ )
s=1

• So, we need to first compute δjL .


• By definition, we have

L ∂J
δ = L
j
∂ηj

PR NPTEL course – p.121/130


• We have
nL
1 X
J= (yjL − dj )2
2 j=1

PR NPTEL course – p.122/130


• We have
nL
1 X
J= (yjL − dj )2
2 j=1
• Hence we have
L
L ∂J ∂J ∂y j
δj = L =
∂ηj ∂yjL ∂ηjL

PR NPTEL course – p.123/130


• We have
nL
1 X
J= (yjL − dj )2
2 j=1
• Hence we have
L
L ∂J ∂J ∂y j
δj = L =
∂ηj ∂yjL ∂ηjL
= (yjL − dj )f ′ (ηjL )

PR NPTEL course – p.124/130


• Using the above we can compute δjL , j = 1, · · · , nL .

PR NPTEL course – p.125/130


• Using the above we can compute δjL , j = 1, · · · , nL .
• Then we can compute δjℓ , j = 1, · · · , nℓ for
ℓ = (L − 1), · · · , 2, recursively, using
à nℓ+1 !
X
δjℓ = ℓ
δsℓ+1 wjs f ′ (ηjℓ )
s=1

PR NPTEL course – p.126/130


• Using the above we can compute δjL , j = 1, · · · , nL .
• Then we can compute δjℓ , j = 1, · · · , nℓ for
ℓ = (L − 1), · · · , 2, recursively, using
à nℓ+1 !
X
δjℓ = ℓ
δsℓ+1 wjs f ′ (ηjℓ )
s=1

• We call δjℓ the ‘error’ at node-j layer-ℓ.

PR NPTEL course – p.127/130


• Then we compute all partial derivatives with respect
to weights as
∂J ℓ+1 ℓ

= δj yi
∂wij

PR NPTEL course – p.128/130


• Then we compute all partial derivatives with respect
to weights as
∂J ℓ+1 ℓ

= δj yi
∂wij
and hence can update the weights using the gradient
descent procedure.

PR NPTEL course – p.129/130


• Then we compute all partial derivatives with respect
to weights as
∂J ℓ+1 ℓ

= δj yi
∂wij
and hence can update the weights using the gradient
descent procedure.
• Note that we actually need partial derivatives with
respect to Ji .

PR NPTEL course – p.130/130

Anda mungkin juga menyukai