Anda di halaman 1dari 67

Network Architecture

Network Architecture
● Three different classes of network
architectures

− single-layer feed-forward
− multi-layer feed-forward
− recurrent

● The architecture of a neural network is linked


with the learning algorithm used to train
Single Layer Feed-forward

Input layer Output layer


of of
source nodes neurons
Perceptron: Neuron Model
(Special form of single layer feed forward)
• The Perceptron is presented by Frank Rosenblatt
(1958)
• The Perceptron, is a feedforward neural network
with no hidden neurons. The goal of the
operation of the perceptron is to learn a given
transformation using learning samples with input
x and corresponding output y = f (x).
• It uses the hard limit transfer function as the
activation of the output neuron. Therefore the
perceptron output is limited to either 1 or –1.
Perceptron: Neuron Model/
Threshold Logic Unit (TLU)
− A perceptron uses a step function that returns +1 if weighted sum
of its input  0 and -1 otherwise

  1 if v  0
 ( v )  
  1 if v  0
b (bias)
x1
w1
v y
x2 w2
(v)
wn
xn
Perceptron for Classification
● The perceptron is used for binary classification.
● First train a perceptron for a classification task.
− Find suitable weights in such a way that the training examples are
correctly classified.
− Geometrically try to find a hyper-plane that separates the examples of
the two classes.
● The perceptron can only model linearly separable classes.
● Given training examples of classes C1, C2 train the perceptron
in such a way that :
− If the output of the perceptron is +1 then the input is assigned to class
C1
− If the output is -1 then the input is assigned to C2
Boolean function OR – Linearly separable
X1 X2 Y
0 0 0
0 1 1
X1
1 0 1
1 true true 1 1 1

false true
0 1 X2
Learning Process for Perceptron
● Initially assign random weights to inputs between -0.5 and
+0.5
● Training data is presented to perceptron and its output is
observed.
● If output is incorrect, the weights are adjusted accordingly
using following formula.
wi(new)  wi(old) + (  * xi *e), where ‘e’ is error produced
and ‘  ’ is learning rate
− Once the modification to weights has taken place, the next piece of
training data is used in the same way.
− Once all the training data have been applied, the process starts again
until all the weights are correct and all errors are zero.
− Each iteration of e this process is known as an epoch.
Example: Perceptron to learn OR
function
● Initially consider w1 = -0.2 and w2 = 0.4
● Training data say, x1 = 0 and x2 = 0, output is 0.
● Compute y = Step(w1*x1 + w2*x2) = 0. Output is correct so
weights are not changed.
● For training data x1=0 and x2 = 1, output is 1
● Compute y = Step(w1*x1 + w2*x2) = 0.4 = 1. Output is correct
so weights are not changed.
● Next training data x1=1 and x2 = 0 and output is 1
● Compute y = Step(w1*x1 + w2*x2) = - 0.2 = 0. Output is
incorrect, hence weights are to be changed.
● Assume a = 0.2 and error e=1
wi = wi + (a * xi * e) gives w1 = 0 and w2 =0.4
● With these weights, test the remaining test data.
● Repeat the process till we get stable result.
Perceptron: Limitations
● The perceptron can only model linearly separable
functions,
− those functions which can be drawn in 2-dim graph and single
straight line separates values in two part.
● Boolean functions given below are linearly separable:
− AND
− OR
− COMPLEMENT
● It cannot model XOR function as it is non linearly
separable.
− When the two classes are not linearly separable, it may be
desirable to obtain a linear separator that minimizes the mean
squared error.
Adaptive linear element (Adaline)
• An important generalisation of the perceptron training
algorithm was presented by Widrow and Hoff as the
‘least mean square’ (LMS) learning procedure, also
known as the delta rule.
• The main functional difference with the perceptron
training rule is the way the output of the system is
used in the learning rule. The perceptron learning rule
uses the output of the threshold function (either -1 or
+1) for learning. The delta-rule uses the net output
without further mapping into output values -1 or +1.
XOR – Non linearly separable function

● A typical example of non-linearly separable function is the


XOR that computes the logical exclusive or
● This function takes two input arguments with values in {0,1}
and returns one output in {0,1},
● Here 0 and 1 are encoding of the truth values false and
true,
● The output is true if and only if the two inputs have
different truth values.
● XOR is non linearly separable function which can not be
modeled by perceptron.
● For such functions we have to use multi layer feed-forward
network.
Input Output
X1 X2 X1 XOR X2
0 0 0
0 1 1
1 0 1
1 1 0

These two classes (true and false) cannot be separated using a


line. Hence XOR is non linearly separable.

X1

1 true false

false true
0 1 X2
Multi layer feed-forward NN (FFNN)

● FFNN is a more general network architecture, where there are


hidden layers between input and output layers.
● Hidden nodes do not directly receive inputs nor send outputs to
the external environment.
● FFNNs overcome the limitation of single-layer NN.
● They can handle non-linearly separable learning tasks.

Input Output
layer layer

Hidden Layer
3-4-2 Network
FFNN for XOR
● The ANN for XOR has two hidden nodes that realizes this non-linear
separation and uses the sign (step) activation function.
● Arrows from input nodes to two hidden nodes indicate the directions of
the weight vectors (1,-1) and (-1,1).
● The output node is used to combine the outputs of the two hidden
nodes.

Input nodes Hidden layer Output layer Output

H1 –0.5
X1 1
–1 1
Y
–1 H2
X2 1 1
Inputs Output of Hidden Nodes Output X1 XOR X2
X1 X2 H1 H2 Node
0 0 0 0 –0.5 0 0
0 1 –10 1 0.5 1 1
1 0 1 –10 0.5 1 1
1 1 0 0 –0.5 0 0
Since we are representing two states by 0 (false) and 1 (true), we
will map negative outputs (–1, –0.5) of hidden and output layers
to 0 and positive output (0.5) to 1.
FFNN NEURON MODEL

● The classical learning algorithm of FFNN is based on the


gradient descent method.
● The activation function used in FFNN are continuous
functions of the weights, differentiable everywhere.
● The activation function for node i may be defined as a
simple form of the sigmoid function in the following
manner:
1
(Vi) 
1 e(A*Vi)
where A > 0, Vi =  Wij * Yj , such that Wij is a weight of the link
from node i to node j and Yj is the output of node j.
Training Algorithm: Backpropagation
● The Backpropagation algorithm learns in the same way as
single perceptron.
● It searches for weight values that minimize the total error of
the network over the set of training examples (training set).
● Backpropagation consists of the repeated application of the
following two passes:
− Forward pass: In this step, the network is activated on one example
and the error of (each neuron of) the output layer is computed.
− Backward pass: In this step the network error is used for updating
the weights. The error is propagated backwards from the output
layer through the network layer by layer. This is done by recursively
computing the local gradient of each neuron.
Backpropagation

● Back-propagation training algorithm

Network activation
Forward Step

Error propagation
Backward Step

● Backpropagation adjusts the weights of the NN in order to


minimize the network total mean squared error.
Prediction using a particular ANN Model

Input: X1 X2 X3 Output: Y Model: Y = f(X1 X2 X3)

X1 =1 X2=-1 X3 =2 0.2 = 0.5 * 1 –0.1*(-1) –0.2 * 2

0.6 -0.2 f(x) = ex / (1 + ex)


-0.1 0.1
0.5 0.7 f(0.2) = e0.2 / (1 + e0.2) = 0.55

0.2 0.9 Predicted Y = 0.478


f (0.2) = 0.55 f (0.9) = 0.71
0.55 0.71

0.1 -0.2

Suppose Actual Y = 2
-0.087 Then
f (-0.087) = 0.478 Prediction Error = (2-0.478) =1.522
0.478
Building ANN Model
How to build the Model ?

Input: X1 X2 X3 Output: Y Model: Y = f(X1 X2 X3)

# Input Neurons = # Inputs = 3 # Output Neurons = # Outputs = 1

# Hidden Layer = ??? Try 1 No fixed strategy.


# Neurons in Hidden Layer = ??? Try 2 By trial and error

Architecture is now defined … How to get the weights ???

Given the Architecture There are 8 weights to decide.


W = (W1, W2, …, W8)
Training Data: (Yi , X1i, X2i, …, Xpi ) i= 1,2,…,n
Given a particular choice of W, we will get predicted Y’s
( V1,V2,…,Vn)
They are function of W.
Choose W such that over all prediction error E is minimized

E =  (Yi – Vi) 2
Training the Model
How to train the Model ?

• Start with a random set of weights.

• Feed forward the first observation through the net


X1 Network V1 ; Error = (Y1 – V1)

Back Propagation
Feed forward

• Adjust the weights so that this error is reduced


( network fits the first observation well )

• Feed forward the second observation.


Adjust weights to fit the second observation well

• Keep repeating till you reach the last observation


E =  (Yi – Vi) 2
• This finishes one CYCLE through the data

• Perform many such training cycles till the


overall prediction error E is small.
Back Propagation
Bit more detail on Back Propagation

Each weight ‘Shares the Blame’ for prediction


error with other weights.

Back Propagation algorithm decides how to


distribute the blame among all weights and
adjust the weights accordingly.

Small portion of blame leads to small


adjustment.
Large portion of the blame leads to large
adjustment.

E =  (Yi – Vi) 2
Weight adjustment during Back Propagation
Weight adjustment formula in Back Propagation

Vi – the prediction for ith observation –


is a function of the network weights vector W = ( W1, W2,….)
Hence, E, the total prediction error is also a function of W
E( W ) =  [ Yi – Vi( W ) ] 2

Gradient Descent Method :


For every individual weight Wi, updation formula looks like
Wnew = Wold +  * ( E / W) |Wold

 = Learning Parameter (between 0 and 1)

Another slight variation is also used sometimes


W(t+1) = W(t) +  * ( E / W) |W(t) +  * (W(t) - W(t-1) )

 = Momentum (between 0 and 1)


Geometric interpretation of the Weight adjustment

Consider a very simple network with 2 inputs and 1 output. No hidden layer.
There are only two weights whose values needs to be specified.
E( w1, w2 ) =  [ Yi – Vi(w1, w2 ) ] 2

• A pair ( w1, w2 ) is a point on 2-D plane.


• For any such point we can get a value of E.
w1 w2
• Plot E vs ( w1, w2 ) - a 3-D surface - ‘Error Surface’
• Aim is to identify that pair for which E is minimum
• That means – identify the pair for which the height of
the error surface is minimum.

Gradient Descent Algorithm


• Start with a random point ( w1, w2 )
• Move to a ‘better’ point ( w’1, w’2 ) where the height of error surface is lower.
• Keep moving till you reach ( w*1, w*2 ), where the error is minimum.
Crawling the Error Surface
14.0

12.0 Error Surface

10.0 Local Minima

8.0

Error

6.0

4.0
Global Minima
w*
2.0
w0 Weight Space

0.0

-3.000
-2.000
-3.000

-1.000
-2.000

0.000
-1.000

0.000

1.000
W1
1.000

2.000
2.000

3.000
3.000

4.000
4.000

W2
5.000
5.000

6.000
6.000
Training Algorithm

Decide the Network architecture


(# Hidden layers, #Neurons in each Hidden Layer)

Decide the Learning parameter and Momentum


Initialize the Network with random weights

Do till Convergence criterion is not met

For i = 1 to # Training Data points

Feed forward the i-th observation thru the Net


Compute the prediction error on i-th observation
Back propagate the error and adjust weights

Next i
E =  (Yi – Vi) 2
Check for Convergence

End Do
Convergence Criterion

When to stop training the Network ?

Ideally – when we reach the global minima of the error surface


How do we know we have reached there ? We don’t …

Suggestion:
1. Stop if the decrease in total prediction error (since last cycle) is small.
2. Stop if the overall changes in the weights (since last cycle) are small.

Drawback:
Error keeps on decreasing. We get a very good fit to training data.
BUT … The network thus obtained have poor generalizing power on unseen data

The phenomenon is also known as - Over fitting of the Training data


The network is said to Memorize the training data.

- so that when an X in training set is given,


the network faithfully produces the corresponding Y.

-However for X’s which the network didn’t see before, it predicts poorly.
Convergence Criterion

Modified Suggestion:
Partition the training data into Training set and Validation set
Use Training set - build the model
Validation set - test the performance of the model on unseen data

Typically as we have more and more training cycles


Error on Training set keeps on decreasing.
Error on Validation set keeps first decreases and then increases.

Stop training when the error on


Validation
Validation set starts increasing
Error

Training

Cycle
Choice of Training Parameters

Learning Parameter and Momentum


- needs to be supplied by user from outside. Should be between 0 and 1

What should be the optimal values of these training parameters ?


- No clear consensus on any fixed strategy.
- However, effects of wrongly specifying them are well studied.

Learning Parameter
Too big – Large leaps in weight space – risk of missing global minima.
Too small –
- Takes long time to converge to global minima
- Once stuck in local minima, difficult to get out of it.

Suggestion

Trial and Error – Try various choices of Learning Parameter and Momentum
See which choice leads to minimum prediction error
NN DESIGN ISSUES

● Data representation
● Network Topology
● Network Parameters
● Training
● Validation
Data Representation
● Data representation depends on the problem.
● In general ANNs work on continuous (real valued) attributes.
Therefore symbolic attributes are encoded into continuous ones.
● Attributes of different types may have different ranges of values
which affect the training process.
● Normalization may be used, like the following one which scales
each attribute to assume values between 0 and 1.
xi  min i
xi 
max i  min i
for each value xi of ith attribute, mini and maxi are the minimum and
maximum value of that attribute over the training set.
Network Topology
● The number of layers and neurons depend on the specific
task.
● In practice this issue is solved by trial and error.
● Two types of adaptive algorithms can be used:
− start from a large network and successively remove some neurons
and links until network performance degrades.
− begin with a small network and introduce new neurons until
performance is satisfactory.
Recurrent Network
● FFNN is acyclic where data passes from input to the output
nodes and not vice versa.
− Once the FFNN is trained, its state is fixed and does not alter as new
data is presented to it. It does not have memory.
● Recurrent network can have connections that go backward
from output to input nodes and models dynamic systems.
− In this way, a recurrent network’s internal state can be altered as sets
of input data are presented. It can be said to have memory.
− It is useful in solving problems where the solution depends not just on
the current inputs but on all previous inputs.
● Applications
− predict stock market price,
− weather forecast
Recurrent network
Motivation

• Need for Systems which can process time


dependant data.

Especially for applications (like weather forecast)


which involves prediction based on the past.
Introduction to RNN
• Feedforward neural networks are unable to process
data with time dependent information.
• In RNN ( Recurrent neural network) the hidden or
output units can be provided with an extended input
that is composed of current input activities together
with the activities from the previous time step.
• RTRL(Real time recurrent learning algorithm) used for
training RNN are based on gradient minimization of
error.
Simple Recurrent Network(SRN)
Simple Recurrent network (SRN)

• It is multi layer perceptron with feedback connection.


• It consists of Input , output , recurrent and context layer.
• The recurrent layer is fed through recurrent connections by
activities of the context layer.
• Context unit hold activities of the recurrent layer from
previous time step.
• The units of the input layer I and recurrent layer R are fully
connected through weight WRI and and output layer O by
weight WOR
RNN Architecture
RNN Architecture
•Here every recurrent unit is fed by activities of all the recurrent units
from previous time step through recurrent weight WRC

•Let the recurrent activity be


RNN Architecture
• Recurrent unit net input and output activity respectively is given by

• Output unit net input and output activity respectively is given by

•f is the activation function . Here we used logistic sigmoid function


f(x) = ( 1 + e –x) -1
RTRL(Real time recurrent learning)
algorithm for SRN

• RTRL is an on-line algorithm for training RNN.

• RTRL is based on approximate on-line gradient


computation.

• Network weights are updated in every time step in


order to minimize the current output error with
respect to calculated approximate gradient.
Radial-Basis Function Networks
● A function is said to be a radial basis function (RBF) if its
output depends on the distance of the input from a given
stored vector.
− The RBF neural network has an input layer, a hidden layer and an
output layer.
− In such RBF networks, the hidden layer uses neurons with RBFs as
activation functions.
− The outputs of all these hidden neurons are combined linearly at the
output node.
● These networks have a wide variety of applications such as
− function approximation,
− time series prediction,
− control and regression,
− pattern classification tasks for performing complex (non-linear).
RBF Architecture

x1
w1
1
x2
y

wm1
 m1

xm
● One hidden layer with RBF activation functions
 1 ...  m 1
● Output layer with linear activation function.

y  w 1 1 (|| x  t 1 ||)  ...  w m 1 m1 (|| x  t m 1 ||)


|| x  t || distance of x  ( x 1 ,..., x m ) from center t
HIDDEN NEURON MODEL

• Hidden units: use radial basis functions


the output depends on the distance of
φ( || x - t||) the input x from the center t

x1
φ( || x - t||)
x2 
t is called center
 is called spread
center and spread are parameters

xm
Types of φ

• Multiquadrics:

 (r)  (r  c )
2 2 1
c0
2

• Inverse multiquadrics: r || x  t ||


1
 ( r )  2 2 12 c0
(r  c )
• Gaussian functions (most used):
 r2 
 ( r )  exp  
2   0
 2 
RBF network parameters

• What do we have to learn for a RBF NN with a


given architecture?
– The centers of the RBF activation functions
– the spreads of the Gaussian RBF activation functions
– the weights from the hidden to the output layer
• Different learning algorithms may be used for
learning the RBF network parameters.
Learning Algorithm 1
• Centers: are selected at random
– centers are chosen randomly from the training set
• Spreads: are chosen by normalization:
Maximum distance between any 2 centers d max
 
number of centers m
1

• Then the activation function of hidden neuron i


becomes:


i x  ti
2
  m1
 exp   2 x  t i
2 

 d max 
Learning Algorithm 1
• Weights: are computed by means of the pseudo-
inverse method.
– For an example ( xi , d i )consider the output of the
network
y ( xi )  w11 (|| xi  t1 ||)  ...  wm1 m1 (|| xi  tm1 ||)

– We would like y ( xi )  d i for each example, that is

w11 (|| xi  t1 ||)  ...  wm1 m1 (|| xi  tm1 ||)  d i


Learning Algorithm 1
• This can be re-written in matrix form for one example

1 (|| xi  t1 ||) ...  m1 (|| xi  tm1 ||)[ w1...wm1 ]T  di


and
1 (|| x1  t1 ||)... m1 (|| x1  tm1 ||) 
... [ w ...w ]T  [d ...d ]T
  1 m1 1 N

1 (|| x N  t1 ||)... m1 (|| x N  tm1 ||)

for all the examples at the same time


Learning Algorithm 1
let  1 (|| x1  t1 ||) ...  m1 (|| xN  tm1 ||)
 ... 
 
1 (|| xN  t1 ||) ...  m1 (|| xN  tm1 ||)

then we can write  w1  d1 


 ...   ... 
   
 wm1  d N 

If  is the pseudo-inverse of the matrix  we
obtain the weights using the following formula

[ w1...wm1 ]   [d1...d N ]
T T
The XOR Problem in RBF Form
• To perform the XOR classification in an RBF network,
we start by deciding how many basis functions we
need. Given there are four training patterns and two
classes, M = 2 seems a reasonable first guess.
• We then need to decide on the basis function
centres. Let us take t1 = (0, 0) and t2 = (1,1)
• and the distance between them is dmax = √2.
Learning Algorithm 1: summary

1. Choose the centers randomly from the


training set.

2. Compute the spread for the RBF function


using the normalization method.

3. Find the weights using the pseudo-inverse


method.
Example: the XOR problem

y
Since the hidden unit activation space is only two dimensional
we can easily plot how the input patterns have been
transformed:
In this case we just have one output y(x), with one weight wj to each hidden unit j,
and one bias -θ. This gives us the network’s input-output relation for each input
pattern x

x1
1 w1

y
x2
2 w2
-1
y(x) = w1φ1(x)+ w2φ2 (x)−θ
Then, if we want the outputs y(xp) to equal the targets
tp, we get the four equations
1.0000w1 + 0.1353w2 −1.0000θ = 0
0.3678w1 + 0.3678w2 −1.0000θ = 1
0.3678w1 + 0.3678w2 −1.0000θ = 1
0.1353w1 +1.0000w2 −1.0000θ = 0
Three equations are different, and we have three
variables, so we can easily solve them to give
w1 = w2 = −2.5018 , θ = −2.8404
This completes our “training” of the RBF network for the
XOR problem
Comparison
RBF NN MLP
Non-linear layered feed-forward networks. Non-linear layered feed-forward networks
Hidden layer of RBF is non-linear, the output layer of RBF Hidden and output layers of FFNN are usually non-
is linear. linear.
One single hidden layer May have more hidden layers.
Neuron model of the hidden neurons is different from the Hidden and output neurons share a common
one of the output nodes. neuron model, though not always the same
activation function
Activation function of each hidden neuron in a RBF NN Activation function of each hidden neuron in a
computes the Euclidean distance between input vector FFNN computes the inner product of input vector
and the center of that unit. and the synaptic weight vector of that neuron
RBF networks are usually fully connected it is common for MLPs to be only partially
connected
RBF networks are usually trained one layer at a time with MLPs are usually trained with a single global
the first layer unsupervised supervised algorithm

Generally, for approximating non-linear input-output MLP may require a smaller number of parameters
mappings, RBF networks can be trained much faster
Network parameters

● How are the weights initialized?


● How is the learning rate chosen?
● How many hidden layers and how many
neurons?
● How many examples in the training set?
Initialization of weights
● In general, initial weights are randomly chosen, with
typical values between -1.0 and 1.0 or -0.5 and 0.5.
● If some inputs are much larger than others, random
initialization may bias the network to give much more
importance to larger inputs.
● In such a case, weights can be initialized as follows:

w ij   21N  1
|x i |
i 1,..., N
For weights from the input to the first layer

w jk   21N 
i 1,..., N
( 
1
wijx )
i
For weights from the first to the second layer
Choice of learning rate

● The right value of  depends on the


application.
● Values between 0.1 and 0.9 have been used
in many applications.
Contd..
● Consider a network of three layers.
● Let us use i to represent nodes in input layer, j to represent
nodes in hidden layer and k represent nodes in output layer.
● wij refers to weight of connection between a node in input
layer and node in hidden layer.
● The following equation is used to derive the output value Yj
of node j
Yj  1
 X
1 e j

where, Xj =  xi . wij - j , 1 i  n; n is the number


of inputs to node j, and j is threshold for node j
Total Mean Squared Error

● The error of output neuron k after the activation of the


network on the n-th training example (x(n), d(n)) is:
ek(n) = dk(n) – yk(n)
● The network error is the sum of the squared errors of the
output neurons:
E(n)   e 2
k (n)

● The total mean squared error is the average of the network


errors of the training examples.
N


1
E AV  N
E (n)
n 1
Weight Update Rule
● The Backprop weight update rule is based on the gradient
descent method:
− It takes a step in the direction yielding the maximum decrease of
the network error E.
− This direction is the opposite of the gradient of E.
● Iteration of the Backprop algorithm is usually terminated
when the sum of squares of errors of the output values for
all training data in an epoch is less than some threshold
such as 0.01
E
w ij  w ij   w ij  w ij  -
 w ij
Backprop learning algorithm
(incremental-mode)
n=1;
initialize weights randomly;
while (stopping criterion not satisfied or n <max_iterations)
for each example (x,d)
- run the network with input x and compute the output y
- update the weights in backward order starting from those of
the output layer:
w ji  w ji  w ji
with  w jicomputed using the (generalized) Delta rule
end-for
n = n+1;
end-while;
Stopping criterions
● Total mean squared error change:
− Back-prop is considered to have converged when the absolute
rate of change in the average squared error per epoch is
sufficiently small (in the range [0.1, 0.01]).
● Generalization based criterion:
− After each epoch, the NN is tested for generalization.
− If the generalization performance is adequate then stop.
− If this stopping criterion is used then the part of the training set
used for testing the network generalization will not used for
updating the weights.
Training
● Rule of thumb:
− the number of training examples should be at least five to ten
times the number of weights of the network.
● Other rule:

|W |
N  |W|= number of weights
a=expected accuracy on test set
(1 - a)

Anda mungkin juga menyukai