Artificial Neural Networks (ANNs), also called parallel distributed processing systems
(PDPs), and connectionist systems are intended for modeling the organizational principles of
the central nervous system. This offers the hope that the biologically inspired computing
capabilities of the ANN will allow the cognitive and sensory tasks to be performed more
easily and more satisfactorily than with conventional serial processors.
3.1
The cell body (soma) or centron or neurocyton with dendrons (dendrites) which are
the outgrowth of the cell membrane of the neuron-cell body.
The axon (neurite). It is a single long process arising from the axon-hillock of the cell
body of neuron.
The axons and the dendrons constitute nerve fibers. The axons give off branches
called collaterals along its course and near the ends it ramifies into non-myelinated
terminated terminal branches known as axons terminals that serve as the third
structural unit of a neuron.
3.2
With this basic idea a mathematical model of a neuron can be developed for ANNs , based on
two distinct operations performed by neurons. They are
Synaptic Operation
The synaptic operation provides a confluence between n-dimensional neural input vectors, X,
and n-dimensional synaptic weight vector, W. A dot product is often used as a confluence
operation. Thus , the resulting vector, Z, can be expressed as,
Z i Wi T X i
i=1,2,3,
(3.1)
where,
W (W1W2W3......Wn )T
(3.2)
X ( X 1 X 2 X 3 .......... X n ) T
(3.3)
Z ( Z 1 Z 2 Z 3 ...... Z n ) T
(3.4)
Xi ,
Somatic Operation
This operation is a two step process , viz.
u Z i Wi T X i
i 1
(3.5)
i 1
where , u is the intermediate accumulated sum. Thus, the combined synaptic operation and
the somatic aggregation operation provide a mapping from n-dimensional neural input space,
X, to one dimensional space, u.
y f [u W0 ]
(3.6)
y f [ W i T X i W 0 ] f [ W iT X i ] f [v ]
i0
(3.7)
i0
provided ,
X 0 1
(3.8)
where, v is the final accumulated sum and X 0 is the fixed bias input. Thus a Neural
Processing Unit (NPU) can be schematically represented as in Fig.3.2.
3.3
Although a single neuron processing unit can handle single pattern classification problems,
the strength of neural computation comes from the neurons connected in a network. The set
of processing unit when assembled in a closely interconnected network is called an artificial
neural network.
Depending on the different modes of interconnection, ANNs can be broadly classified as:
X = -1
0
Z0
0
1
Z1
Wi
Wn
i
n
Y
(v)
Zn
Neutral Output
Zi
2. Laterally connected Neural Networks: In consists of feed forward input units and a
layer of neurons that are laterally connected to their neighbors.
3. Recurrent or Dynamic Neural Networks: In these networks, neurons are connected
in a layered structure and the neurons in a given layer may receive inputs from
neurons in given layer below it and/or from the layers above it. The output not only
depends upon the current inputs but also upon past inputs and/or outputs. These
networks have been mostly used for the solution of optimization problems
4. Hybrid Neural Networks: These networks combine two or more of the features of
the above mentioned networks.
3.4
SUPERVISED LEARNING
The ability of a particular neural network is largely determined by the learning process and
network structure used. The learning procedure is divided into thee types : Supervised,
reinforced and unsupervised. These three types of learning are defined by the type of error
signal used to train the weights in the network. In supervised learning, an error scalar is
provided for each output unit by an external teacher, while in reinforced learning the
network is given only a global punish/reward signal. In unsupervised learning, no external
error signal is provided, instead internal errors are generated between the units which are then
used to modify the weights.
In supervised learning paradigm, the weights connecting units in the network are set on the
basis of detailed error information supplied to the network by an external teacher. In most
cases the network is trained using a set of input-output pairs which are examples of the
mapping that the network is required to learn to compute. The learning process may,
5
therefore, be viewed as fitting a function and its performance can thus be judged on whether
the network can learn the desired function over the interval represented by the training set
and to what extent the network can successfully generalize away from the points that it has
been trained on.
As an example, consider the case of electrode contour optimization where the input-output
training sets are known, i.e. the predetermined electrode contours and the stresses along those
electrode contours from the results of electric field computations carried out for such
electrode contours. Thus for such problem a neural network with supervised learning is
needed.
3.5
Pattern classification problems, which are not linearly separable, can be solved with MFNNs
possessing one or more hidden layers in which the neurons have nonlinear characteristics.
A single neuron can provide simple pattern classification problem, i.e. transformation of sets
or functions from the input space to the output space. A two layer network consisting of two
inputs and N outputs can produce N distinct lines in the pattern space, provided the regions
formed by the problem are linearly separable.
But, in the all possible problems, as the dimensionality of the input space is more, the
problems are not linearly separable and cannot be solved by a two- layer neural network. This
leads to MFNNs.
In MFNNs, neurons are connected in a layered structure and neurons in given layer receive
inputs from the neurons in the layer immediately below it and send their outputs to the
neurons immediately above it. Their outputs are a function of only the current inputs and are
independent on the past inputs and/or outputs.
Although, theoretically, an infinite number of layers are required to define any decision
boundary. A three-layer FNN can generate arbitrary complex decision regions. For this
reason, three layer FNNs often referred to as universal approximators. The additional layer in
between the input and the output layer is known as the hidden layer and the number of units
in the hidden layers depends on the nature of the problem.
The term feed-forward implies that all the information in the network flows in the forward
direction, and during normal processing there is no feedback from the outputs to the inputs.
The neural network can identify input pattern vectors once the connection weights are
adjusted by means of the learning process. The back-propagation learning algorithm, which is
a generalization of Widrow-Hoff error correction rule, is the most popular method in training
the ANN. This learning algorithm is presented below in details.
Let, the net input to a neuron in input layer is net i . Then for each neuron in the input layer,
the neuron outputs are given by
Oi neti
(3.9)
Ni
net j ji oi
(3.10)
i 1
O j f ( net j , j )
(3.11)
Oj
1
1 e
( net j j )
(3.12)
In the eqn(3.12), the parameter j serves as threshold or bias. The effect of a positive is j
to shift the activation function to the left along the horizontal axis. These effects are
illustrated in Fig. 3.4.
net K Kj O j
(3.13)
j 1
OK f (netK , K )
(3.14)
In the learning phases or training of such a net work, a pattern is presented as input and the
set of weights in all the connecting links and also all the threshold in the neurons are adjusted
in such a way that the desired outputs t pK are obtained at the output neurons. Once this
adjustment has been accomplished by the net work, another pair of input-output patterns is
presented and the net work is required to learn that association also. In fact the net work is
required to find a single set of weights and thresholds that will satisfy all the input-output
pairs presented to it.
In general, the outputs O pK will not be the same as the target values t pK . For each pattern
the sum of squared errors is
1N
E p (t pK O pK ) 2
2 K 1
K
(3.15)
Where, N K is the number of neurons in the output layer. In the generalized delta rule
formulated by Rumelhart et al for learning the weights and thresholds, the procedure for
learning the correct set of weights is to vary the weights in a manner calculated to reduce the
error E p as rapidly as possible. In other words, the gradient search in weight space is carried
out on the basis of E p .
Omitting the subscript p for convenience, eqn (3.15) is written as
1 NK
E p (t K OK ) 2
2 K 1
(3.16)
Convergence towards improved values for the weights and thresholds is achieved by taking
incremental changes Kj proportional to
Kj
E
, that is
Kj
E
Kj
(3.17)
Kj
Now,
E net K
net K Kj
net K
Kj
Kj
Kj
Oj Oj
(3.18)
(3.19)
Let,
E
net K
(3.20)
Kj K O j
Again,
E
E OK
net K
OK net K
(3.21)
(3.22)
[ (t K OK ) ]
O K OK 2
= (t K OK )
Again,
(3.23)
OK
f (net K , K )
net K net K
1
[
]
=
net K 1 e ( netK K )
= OK (1 O K )
(3.24)
Therefore, for any output-layer neuron K, K is obtained from eqns (3.22), (3.23) and (3.24)
as follows
K (t K OK )OK (1 OK )
(3.25)
For the next lower layer where the weights do not affect output nodes directly, it can be
written that
ji
E
ji
(3.26)
E net j
net j ji
E
Oi
net j
= j Oi
(3.27)
10
where,
E
net j
(3.28)
E O j
O j net j
(3.29)
O j
net j
O j (1 O j )
(3.30)
However, the factor can not be evaluated directly. Instead it is written in terms of quantities
which are known and other quantities that can be evaluated. Hence,
E NK E netK
Oj K 1netK Oj
NK
(
K 1
)
Kjo j
netK Oj
NK
(
K 1
E
) Kj
net K
NK
K Kj
(3.31)
K 1
j O j (1 O j ) K Kj
(3.32)
K 1
Thus the deltas at a hidden layer neuron can be evaluated in terms of the deltas at an upper
layer. Hence, starting at the highest layer, i.e. the output layer, are evaluated using eqn (3.25)
and then the errors are propagated backward to lower layers using eqn.(3.32)
Summarizing and using the subscript p to denote the pattern number,
pKj Pk O pj
and
pK (tpK OpK)OpK(1OpK)
(3.33)
(3.34)
11
p ji pjOpi
(3.35)
NK
and
pj O pj (1 O pj ) pK Kj
(3.36)
K 1
ji ( n 1) pj O pi ji ( n )
(3.37)
where (n+1)is used to include the (n+1)th step and is a proportionality constant called
momentum constant. The second term in eqns (3.37) is used to specify that the change in ji
at the (n+1)th step should be some what similar to the change undertaken at the nth step. In
this way some inertia is built in and momentum in the rate of change is conserved to some
degree.
12
It is also important to note that the network must not be allowed to start off with a set of equal
weights. It has been shown that it is not possible to proceed from such weight configuration
to one of unequal weights, even if the latter corresponds to smaller system error.
3.6
Scaling of the input-output data has a significant influence on the convergence property and
also on the accuracy of the learning process. It is obvious from the sigmoidal activation
function given in eqn (3.12) that the range of the output of the network must be within (0,1).
Moreover the input variables should be kept small in order to avoid saturation effect caused
by the sigmoidal function. Thus the input-output data must be normalized before the initiation
of the training of the neural network. Two schemes have been tried for scaling the inputoutput variables as detailed below.
Scheme 1 of Normalization
In maximum values of the input and output vector components are determined as follows
(3.38)
i 1,....., N i
O K , max max( O K ( p ))
p 1,.....,NP
(3.39)
K 1,....., N K
Normalized by these maximum values, the input and output variables are given as follows.
net
i , nor
( p ) net i ( p ) / net
p 1,.....,NP
i , max
(3.40)
i 1,....., N i
and
O K , nor ( p ) O K ( p ) / O K , max
p 1,.....,NP
(3.41)
K 1,....., N K
13
After the normalization the input and output variable range is within (0,1) in this scheme.
Scheme 2 of Normalization
In this scheme of normalization, the output variables are normalized by using eqns(3.39) and
(3.41) to get a variables range within(0,1).But the input variables are normalized as follows.
neti ,nor ( p )
p 1,.....,NP
net i , av and i
where
(3.42)
i
i 1,....., N i
are the average value and the standard deviation of the ith
K 2 are real positive numbers. These input variables can then be easily made to fall in the
range
(-1 , 1) by dividing these variables by the greater of the two numbers i.e. K 1 and K 2 .
3.7
FASTER TRAINING
1 /(1 exp( D * x ))
the derivative is:
s (1 s )
14
where s is the activation value of the output unit and most often D = 1. The derivative is
largest at s = 1/2 and it is here that you will get the largest weight changes. Unfortunately as
you near the values 0 or 1 the derivative term gets close to 0 and the weight changes become
very small. In fact if the network's response is 1 and the target is 0, that is the network is off
by quite a lot, you end up with very small weight changes. It can take a VERY long time for
the training process to correct this. More than likely you will get tired of waiting. Fahlman's
solution was to add 0.1 to the derivative term making it:
0.1 s (1 s )
The solution of Chen and Mars was to drop the derivative term altogether, in effect the
derivative was 1. This method passes back much larger error quotas to the lower layer, so
large that a smaller
problem they found the best results came when that was 0.1 times the upper level , hence
they called their method the "differential step size" method. One tenth is not always the best
value so you must experiment with both the upper and lower level etas to get the best results.
Besides that, the
you use for the upper layer must also be much smaller than the you
3.8
To make the learning process converge more rapidly than the conventional method, in which
both the learning rate and momentum are kept constant during the learning, an adaptive
learning algorithm has been developed to adopt both momentum and learning rate in the
learning process.
The proposed adaption rule for learning rate is as follows
i ( n 1) exp[
]
i (n)
n
i ( n 1)
R e ( n ) R e ( n 1)
R e ( n ) R e ( n 1)
(3.43)
15
where
i (n) is the learning rate at iteration n in between the input layer and the next hidden
1
Re [
NP.Nk
NP Nk
(t pk Opk ) ]
p 1k 1
(3.44)
where, k is a constant.
( n ) R e ( n 1)
when R e ( n ) R e ( n 1) and
to keep
constant
error is decreasing, which implies that the connection weights are updated in the correct
direction. It is reasonable to maintain this update direction in the next iteration. In this case
we achieve this by decreasing the learning rate in next iteration. On the other hand, if the
connection weights are moved to the opposite direction, causing the error to increase, we
should try to ignore this direction in next iteration by keeping the value of the same as the
value of in previous iteration. The value of the constant k should be selected judiciously to
give the best result, and the optimum value of k is problem dependent.
Similar to the learning rate, the proposed adaption rule for momentum constant is as follows
[
1
(
)] i ( n 1)
100
i (n)
R
[1
] i ( n 1)
100
R e ( n ) R e ( n 1)
R e ( n ) R e ( n 1)
(3.45)
where R is the percentage rate of change of i between two successive iteration, and
i (n ) is the momentum at iteration n in between the input layer and the next hidden layer.
The learning rates and the moments for the other layers are updated in the same way as those
for i and i , respectively, at nth iteration during the learning process.
16
3.9
The basic principle of RPROP is to eliminate the harmful influence of the size of the partial
derivative on the weight step. As a consequence, only the sign of the derivative is considered
to indicate the direction of the weight update. The size of the weight change is exclusively
determined by a weight-specific, so called update-value
w ij( t )
where
E (t )
ij ( t ),
ij ( t ),
0,
E (t )
0
w ij
E (t )
if
0
w ij
else
if
(3.46)
wij denotes the partial derivative with respect to each weight. The second
(ijt )
where
. ij (t ),
. ij ( t ),
ij ( t 1),
0 1 .Thus
E ( t 1) E (t )
0
w ij
w ij
E ( t 1) E ( t )
if
0
w ij
w ij
else
if
(3.46)
weight wij (t ) changes its sign, which indicates that the last update was too big and the
algorithm has jumped over the local minimum, the update-value ij (t ) is decreased by the
factor . If the derivative retains its sign, the update value is slightly increased in order to
accelerate convergence in shallow regions. Additionally, in case of a change in sign, there
should be no adaptation in the succeeding learning step. In practice this can be achieved by
17
setting
E (t )
wij 0 in
adaptation are performed after the gradient information of the whole pattern set is computed.
The Rprop algorithm requires setting the following parameters (i) the increase factor is set
to
to
0 0 . 1 ; (iv)the maximum weight step, which is used in order to prevent the weights
18