Network Architecture
● Three different classes of network
architectures
− single-layer feed-forward
− multi-layer feed-forward
− recurrent
1 if v 0
( v )
1 if v 0
b (bias)
x1
w1
v y
x2 w2
(v)
wn
xn
Perceptron for Classification
● The perceptron is used for binary classification.
● First train a perceptron for a classification task.
− Find suitable weights in such a way that the training examples are
correctly classified.
− Geometrically try to find a hyper-plane that separates the examples of
the two classes.
● The perceptron can only model linearly separable classes.
● Given training examples of classes C1, C2 train the perceptron
in such a way that :
− If the output of the perceptron is +1 then the input is assigned to class
C1
− If the output is -1 then the input is assigned to C2
Boolean function OR – Linearly separable
X1 X2 Y
0 0 0
0 1 1
X1
1 0 1
1 true true 1 1 1
false true
0 1 X2
Learning Process for Perceptron
● Initially assign random weights to inputs between -0.5 and
+0.5
● Training data is presented to perceptron and its output is
observed.
● If output is incorrect, the weights are adjusted accordingly
using following formula.
wi(new) wi(old) + ( * xi *e), where ‘e’ is error produced
and ‘ ’ is learning rate
− Once the modification to weights has taken place, the next piece of
training data is used in the same way.
− Once all the training data have been applied, the process starts again
until all the weights are correct and all errors are zero.
− Each iteration of e this process is known as an epoch.
Example: Perceptron to learn OR
function
● Initially consider w1 = -0.2 and w2 = 0.4
● Training data say, x1 = 0 and x2 = 0, output is 0.
● Compute y = Step(w1*x1 + w2*x2) = 0. Output is correct so
weights are not changed.
● For training data x1=0 and x2 = 1, output is 1
● Compute y = Step(w1*x1 + w2*x2) = 0.4 = 1. Output is correct
so weights are not changed.
● Next training data x1=1 and x2 = 0 and output is 1
● Compute y = Step(w1*x1 + w2*x2) = - 0.2 = 0. Output is
incorrect, hence weights are to be changed.
● Assume a = 0.2 and error e=1
wi = wi + (a * xi * e) gives w1 = 0 and w2 =0.4
● With these weights, test the remaining test data.
● Repeat the process till we get stable result.
Perceptron: Limitations
● The perceptron can only model linearly separable
functions,
− those functions which can be drawn in 2-dim graph and single
straight line separates values in two part.
● Boolean functions given below are linearly separable:
− AND
− OR
− COMPLEMENT
● It cannot model XOR function as it is non linearly
separable.
− When the two classes are not linearly separable, it may be
desirable to obtain a linear separator that minimizes the mean
squared error.
Adaptive linear element (Adaline)
• An important generalisation of the perceptron training
algorithm was presented by Widrow and Hoff as the
‘least mean square’ (LMS) learning procedure, also
known as the delta rule.
• The main functional difference with the perceptron
training rule is the way the output of the system is
used in the learning rule. The perceptron learning rule
uses the output of the threshold function (either -1 or
+1) for learning. The delta-rule uses the net output
without further mapping into output values -1 or +1.
XOR – Non linearly separable function
X1
1 true false
false true
0 1 X2
Multi layer feed-forward NN (FFNN)
Input Output
layer layer
Hidden Layer
3-4-2 Network
FFNN for XOR
● The ANN for XOR has two hidden nodes that realizes this non-linear
separation and uses the sign (step) activation function.
● Arrows from input nodes to two hidden nodes indicate the directions of
the weight vectors (1,-1) and (-1,1).
● The output node is used to combine the outputs of the two hidden
nodes.
H1 –0.5
X1 1
–1 1
Y
–1 H2
X2 1 1
Inputs Output of Hidden Nodes Output X1 XOR X2
X1 X2 H1 H2 Node
0 0 0 0 –0.5 0 0
0 1 –10 1 0.5 1 1
1 0 1 –10 0.5 1 1
1 1 0 0 –0.5 0 0
Since we are representing two states by 0 (false) and 1 (true), we
will map negative outputs (–1, –0.5) of hidden and output layers
to 0 and positive output (0.5) to 1.
FFNN NEURON MODEL
Network activation
Forward Step
Error propagation
Backward Step
0.1 -0.2
Suppose Actual Y = 2
-0.087 Then
f (-0.087) = 0.478 Prediction Error = (2-0.478) =1.522
0.478
Building ANN Model
How to build the Model ?
E = (Yi – Vi) 2
Training the Model
How to train the Model ?
Back Propagation
Feed forward
E = (Yi – Vi) 2
Weight adjustment during Back Propagation
Weight adjustment formula in Back Propagation
Consider a very simple network with 2 inputs and 1 output. No hidden layer.
There are only two weights whose values needs to be specified.
E( w1, w2 ) = [ Yi – Vi(w1, w2 ) ] 2
8.0
Error
6.0
4.0
Global Minima
w*
2.0
w0 Weight Space
0.0
-3.000
-2.000
-3.000
-1.000
-2.000
0.000
-1.000
0.000
1.000
W1
1.000
2.000
2.000
3.000
3.000
4.000
4.000
W2
5.000
5.000
6.000
6.000
Training Algorithm
Next i
E = (Yi – Vi) 2
Check for Convergence
End Do
Convergence Criterion
Suggestion:
1. Stop if the decrease in total prediction error (since last cycle) is small.
2. Stop if the overall changes in the weights (since last cycle) are small.
Drawback:
Error keeps on decreasing. We get a very good fit to training data.
BUT … The network thus obtained have poor generalizing power on unseen data
-However for X’s which the network didn’t see before, it predicts poorly.
Convergence Criterion
Modified Suggestion:
Partition the training data into Training set and Validation set
Use Training set - build the model
Validation set - test the performance of the model on unseen data
Training
Cycle
Choice of Training Parameters
Learning Parameter
Too big – Large leaps in weight space – risk of missing global minima.
Too small –
- Takes long time to converge to global minima
- Once stuck in local minima, difficult to get out of it.
Suggestion
Trial and Error – Try various choices of Learning Parameter and Momentum
See which choice leads to minimum prediction error
NN DESIGN ISSUES
● Data representation
● Network Topology
● Network Parameters
● Training
● Validation
Data Representation
● Data representation depends on the problem.
● In general ANNs work on continuous (real valued) attributes.
Therefore symbolic attributes are encoded into continuous ones.
● Attributes of different types may have different ranges of values
which affect the training process.
● Normalization may be used, like the following one which scales
each attribute to assume values between 0 and 1.
xi min i
xi
max i min i
for each value xi of ith attribute, mini and maxi are the minimum and
maximum value of that attribute over the training set.
Network Topology
● The number of layers and neurons depend on the specific
task.
● In practice this issue is solved by trial and error.
● Two types of adaptive algorithms can be used:
− start from a large network and successively remove some neurons
and links until network performance degrades.
− begin with a small network and introduce new neurons until
performance is satisfactory.
Recurrent Network
● FFNN is acyclic where data passes from input to the output
nodes and not vice versa.
− Once the FFNN is trained, its state is fixed and does not alter as new
data is presented to it. It does not have memory.
● Recurrent network can have connections that go backward
from output to input nodes and models dynamic systems.
− In this way, a recurrent network’s internal state can be altered as sets
of input data are presented. It can be said to have memory.
− It is useful in solving problems where the solution depends not just on
the current inputs but on all previous inputs.
● Applications
− predict stock market price,
− weather forecast
Recurrent network
Motivation
x1
w1
1
x2
y
wm1
m1
xm
● One hidden layer with RBF activation functions
1 ... m 1
● Output layer with linear activation function.
x1
φ( || x - t||)
x2
t is called center
is called spread
center and spread are parameters
xm
Types of φ
• Multiquadrics:
(r) (r c )
2 2 1
c0
2
i x ti
2
m1
exp 2 x t i
2
d max
Learning Algorithm 1
• Weights: are computed by means of the pseudo-
inverse method.
– For an example ( xi , d i )consider the output of the
network
y ( xi ) w11 (|| xi t1 ||) ... wm1 m1 (|| xi tm1 ||)
y
Since the hidden unit activation space is only two dimensional
we can easily plot how the input patterns have been
transformed:
In this case we just have one output y(x), with one weight wj to each hidden unit j,
and one bias -θ. This gives us the network’s input-output relation for each input
pattern x
x1
1 w1
y
x2
2 w2
-1
y(x) = w1φ1(x)+ w2φ2 (x)−θ
Then, if we want the outputs y(xp) to equal the targets
tp, we get the four equations
1.0000w1 + 0.1353w2 −1.0000θ = 0
0.3678w1 + 0.3678w2 −1.0000θ = 1
0.3678w1 + 0.3678w2 −1.0000θ = 1
0.1353w1 +1.0000w2 −1.0000θ = 0
Three equations are different, and we have three
variables, so we can easily solve them to give
w1 = w2 = −2.5018 , θ = −2.8404
This completes our “training” of the RBF network for the
XOR problem
Comparison
RBF NN MLP
Non-linear layered feed-forward networks. Non-linear layered feed-forward networks
Hidden layer of RBF is non-linear, the output layer of RBF Hidden and output layers of FFNN are usually non-
is linear. linear.
One single hidden layer May have more hidden layers.
Neuron model of the hidden neurons is different from the Hidden and output neurons share a common
one of the output nodes. neuron model, though not always the same
activation function
Activation function of each hidden neuron in a RBF NN Activation function of each hidden neuron in a
computes the Euclidean distance between input vector FFNN computes the inner product of input vector
and the center of that unit. and the synaptic weight vector of that neuron
RBF networks are usually fully connected it is common for MLPs to be only partially
connected
RBF networks are usually trained one layer at a time with MLPs are usually trained with a single global
the first layer unsupervised supervised algorithm
Generally, for approximating non-linear input-output MLP may require a smaller number of parameters
mappings, RBF networks can be trained much faster
Network parameters
w ij 21N 1
|x i |
i 1,..., N
For weights from the input to the first layer
w jk 21N
i 1,..., N
(
1
wijx )
i
For weights from the first to the second layer
Choice of learning rate
1
E AV N
E (n)
n 1
Weight Update Rule
● The Backprop weight update rule is based on the gradient
descent method:
− It takes a step in the direction yielding the maximum decrease of
the network error E.
− This direction is the opposite of the gradient of E.
● Iteration of the Backprop algorithm is usually terminated
when the sum of squares of errors of the output values for
all training data in an epoch is less than some threshold
such as 0.01
E
w ij w ij w ij w ij -
w ij
Backprop learning algorithm
(incremental-mode)
n=1;
initialize weights randomly;
while (stopping criterion not satisfied or n <max_iterations)
for each example (x,d)
- run the network with input x and compute the output y
- update the weights in backward order starting from those of
the output layer:
w ji w ji w ji
with w jicomputed using the (generalized) Delta rule
end-for
n = n+1;
end-while;
Stopping criterions
● Total mean squared error change:
− Back-prop is considered to have converged when the absolute
rate of change in the average squared error per epoch is
sufficiently small (in the range [0.1, 0.01]).
● Generalization based criterion:
− After each epoch, the NN is tested for generalization.
− If the generalization performance is adequate then stop.
− If this stopping criterion is used then the part of the training set
used for testing the network generalization will not used for
updating the weights.
Training
● Rule of thumb:
− the number of training examples should be at least five to ten
times the number of weights of the network.
● Other rule:
|W |
N |W|= number of weights
a=expected accuracy on test set
(1 - a)