Pattern Classification Neural Networks

Pattern
Pattern
Classification
Classification
All materials in these slides were taken All materials in these slides were taken
from from
Pattern Classification (2nd ed) by R. O. Pattern Classification (2nd ed) by R. O.
Duda, P. E. Hart and D. G. Stork, John Duda, P. E. Hart and D. G. Stork, John
Wiley & Sons, 2000 Wiley & Sons, 2000
with the permission of the authors and with the permission of the authors and
the publisher the publisher
Chapter 6: Multilayer Neural Networks
Chapter 6: Multilayer Neural Networks
(Sections 6.1
(Sections 6.1
-
-
6.3)
6.3)
Introduction
Feedforward Operation and Classification
Backpropagation Algorithm
Artificial Neural Net Models
Massive parallelism is essential for high
performance cognitive tasks (speech & image
recognition). Humans need only a few msec. for
most cognitive tasks
Design a massively parallel architecture
composed of many simple processing elements
interconnected to achieve certain collective
computational capabilities
Also known as connectionist models and parallel
distributed processing (PDP) models
Derive inspiration from knowledge of biological
neural nets
Natural Neural Net Models

Natural Neural Net Models
Human brain consists of very large number of
neurons (between 10**10 to 10**12)
No. of interconnections per neuron is between 1K
to 10K
Total number of interconnections is about 10**14
Damage to a few neurons or synapse (links)
does not appear to impair overall performance
significantly (robustness)
Artificial Neural nets are specified by
Net topology
Node (processor) characteristics
Training/learning rules
We consider only feedforward multilayer
networks
These networks essentially implement a non-
parametric non-linear classifier
Linear Discriminant Functions
A discriminant function that is a linear combination of
input features can be written as
Weight
vector
Weight
vector
Bias or Threshold
weight
Bias or Threshold
weight
Sign of the
function
value gives
the class
label
Sign of the
function
value gives
the class
label
Pattern Classification, Chapter 6
6
Introduction
Introduction
Goal: Classify objects by learning nonlinearity
There are many problems for which linear
discriminants are not sufficient for minimum error
The central difficulty is the choice of the appropriate
nonlinear functions
A brute approach might be to select a complete
basis set such as all polynomials; such a classifier
would require too many parameters to be determined
from a limited number of training samples
7
There is no automatic method for determining the
nonlinearities when no information is provided to the
classifier
In using the multilayer neural Networks or multilayer
Perceptrons, the form of the nonlinearity is learned from
the training data
Multilayer networks can, in principle, provide the optimal
solution to an arbitrary classification problem
Nothing magical about multilayer neural networks; they
implement linear discriminants but in a space where the
inputs have been mapped nonlinearly
8
Feedforward Operation and
Feedforward Operation and
Classification
Classification
A three-layer neural network consists of an
input layer, a hidden layer and an output layer
interconnected by modifiable weights
represented by links between layers
Multilayer neural network implements linear
discriminants, but in a space where the inputs
have been mapped nonlinearly
Figure 6.1 shows a simple three-layer
network
9
10
11
A single bias unit is connected to each unit in addition to
the input units
Net activation:
where the subscript i indexes units in the input layer, j in the
hidden layer; w
ji
denotes the input-to-hidden layer weights at
the hidden unit j. (In neurobiology, such weights or
connections are called synapses)
Each hidden unit emits an output that is a nonlinear function
of its activation, that is: y
j
= f(net
j
)

= == = = == =
= == = + ++ + = == =
d
1 i
d
0 i
t
j ji i 0 j ji i j
, x . w w x w w x net
12
Figure 6.1 shows a simple threshold function
The function f(.) is also called the activation
function or nonlinearity of a unit. There are
more general activation functions with
desirables properties
Each output unit similarly computes its net
activation based on the hidden unit signals as:
where the subscript k indexes units in the ouput
layer and n
H
denotes the number of hidden units

< << <

= == =
0 net if 1
0 net if 1
) net sgn( ) net ( f

= == = = == =
= == = = == = + ++ + = == =
H H
n
1 j
n
0 j
t
k kj j 0 k kj j k
, y . w w y w w y net
13
The output units are referred as z
k
. An output unit
computes the nonlinear function of its net input,
emitting
z
k
= f(net
k
)
In the case of c outputs (classes), we can view the
network as computing c discriminant functions
z
k
= g
k
(x); the input x is classified according to the
largest discriminant function g
k
(x) k = 1, ,c
The three-layer network with the weights listed in
fig. 6.1 solves the XOR problem
14
The hidden unit y
1
computes the boundary:
0 y
1
= +1
x
1
+ x
2
+ 0.5 = 0
< 0 y
1
= -1
The hidden unit y
2
computes the boundary:
0 y
2
= +1
x
1
+ x
2
-1.5 = 0
< 0 y
2
= -1
Output unit emits z
1
= +1 if and only if y
1
= +1 and y
2
= +1
Using the terminology of computer logic, the units are
behaving like gates, where the first hidden unit is an OR
gate, the second hidden unit is an AND gate, and the output
unit implements
z
k
= y
1
AND NOT y
2
= (x
1
OR x
2
) and NOT(x
1
AND x
2
) =
x
1
XOR x
2
which provides the nonlinear decision of fig. 6.1
15
General Feedforward Operation case of c output units
Hidden units enable us to express more complicated
nonlinear functions and extend classification capability
Activation function does not have to be a sign function, it
is often required to be continuous and differentiable
We can allow the activation in the output layer to be
different from the activation function in the hidden layer or
have different activation for each individual unit
Assume for now that all activation functions are identical
c) 1,..., (k
(1) w w x w f w f z ) x ( g
H
n
1 j
0 k
d
1 i
0 j i ji kj k k
= == =
| || |
| || |

| || |

\ \\ \
| || |
+ ++ +
| || |
| || |

| || |

\ \\ \
| || |
+ ++ + = == =

= == = = == =
16
Expressive Power of multi-layer Networks
Question: Can every decision boundary be implemented by
a three-layer network described by equation (1)?
Answer: Yes (due to A. Kolmogorov)
Any continuous function from input to output can be
implemented in a three-layer net, given sufficient number of
hidden units n
H
, proper nonlinearities, and weights.
Any continuous function g(x) defined on the unit cube can be
represented in the following form
for properly chosen functions
j
and
ij
( (( ( ) )) ) ) 2 n ]; 1 , 0 [ I ( I x ) x ( ) x ( g
n
1 n 2
1 j
i ij j
= == = = == =

+ ++ +
= == =

17
Eq (8) can be expressed in neural network terminology as
follows:
Each of the (2n+1) hidden units
j
takes as input a sum of d
nonlinear functions, one for each input feature x
i
Each hidden unit emits a nonlinear function
j
of its total
input
The output unit emits the sum of the contributions of the
hidden units
Unfortunately, Kolmogorovs theorem tells us very little about
how to find the nonlinear functions based on data; this is the
central problem in network-based pattern recognition
Another question: how many hidden nodes we should have?
18
19
Neural Network Functions
Neural Network Functions
20
21
Any function from input to output can be
implemented as a three-layer neural network
These results are of greater theoretical interest
than practical, since the construction of such a
network requires the nonlinear functions and the
weight values which are unknown!
22
Our goal is to learn the interconnection weights based on
the training patterns and the desired outputs
In a three-layer network, it is a straightforward matter to
understand how the output, and thus the error, depends
on the hidden-to-output layer weights
The power of backpropagation is that it enables us to
compute an effective error for each hidden unit, and thus
derive a learning rule for the input-to-hidden weights. This
is known as:
The credit assignment problem
Network Learning
23
Network has two modes of operation:
Feedforward
The feedforward operations consists of presenting a
pattern to the input units and passing (or feeding) the
signals through the network in order to yield a
decision from the outputs units
Learning
The supervised learning consists of presenting an
input pattern and modifying the network parameters
(weights) to bring the actual outputs closer to the
desired target values
24
25
Network Learning
Start with an untrained network, present a training
pattern to the input layer, pass the signal through the
network and determine the output.
Let tk be the k-th target (or desired) output and zk be
the k-th computed output with k = 1, , c. Let w
represent all the weights of the network
The training error:
The backpropagation learning rule is based on
gradient descent
The weights are initialized with random values and are
changed in a direction that will reduce the error:

= == =
= == = = == =
c
1 k
2
2
k k
z t
2
1
) z t (
2
1
) w ( J
w
J
w

= == =
26
where is the learning rate which indicates the relative
size of the change in weights
w(m +1) = w(m) + w(m)
where m is the m-th training pattern presented
Error on the hiddento-output weights
where the sensitivity of unit k is defined as:
and describes how the overall error changes with the
activation of the units net activation
kj
k
k
kj
k
k kj
w
net
w
net
.
net
J
w
J

= == =

= == =

k
k
net
J

= == =
) net ( ' f ) z t (
net
z
.
z
J
net
J
k k k
k
k
k k
k
= == =

= == =

= == =
27
Since net
k
= w
k
t
.y, therefore:
Conclusion: the weight update (or learning rule) for the
hidden-to-output weights is:
w
kj
=
k
y
j
= (t
k
z
k
) f (net
k
)y
j
Learning rule for the input-to-hiden units is more subtle
and is the crux of the credit assignment problem
Error on the input-to-hidden units: Using the chain rule
j
kj
k
y
w
net
= == =

ji
j
j
j
j ji
w
net
.
net
y
.
y
J
w
J

= == =

28
However,
Similarly as in the preceding case, we define the sensitivity
of a hidden unit:
Above equation is the core of the credit assignment
problem: The sensitivity at a hidden unit is simply the sum
of the individual sensitivities at the output units weighted by
the hidden-to-output weights w
kj,
all multiplied by f(net
j
);
see fig 6.5
Conclusion: Learning rule for the input-to-hidden weights:

= == = = == =
= == = = == =
= == =

= == =

= == =
( (( (

( (( (

= == =

c
1 k
c
1 k
kj k k k
j
k
k
k
k k
c
1 k
j
k
k k
2
k
c
1 k
k
j j
w ) net ( ' f ) z t (
y
net
.
net
z
) z t (
y
z
) z t ( ) z t (
2
1
y y
J

= == =

c
1 k
k kj j j
w ) net ( ' f
[ [[ [ ] ]] ]
i j k kj j i ji
x ) net ( ' f w x w
j

= == = = == =
Sensitivity at Hidden Node
Sensitivity at Hidden Node
More specifically, the backpropagation of
errors algorithm
During training, an error must be propagated
from the output layer back to the hidden layer to
learn the input-to-hidden weights
It is gradient descent in a layered network
Exact behavior of the learning algorithm
depends on the starting point
Start the process with random values of weights;
in practice you learn many networks with
different initializations
30
31
Training protocols:
Stochastic: patterns are chosen randomly from training
set; network weights are updated for each pattern
Batch: Present all patterns before updating weights
On-line: present each pattern once & only once (no
memory for storing patterns)
Stochastic backpropagation algorithm:
Begin initialize n
H
; w, criterion , , m
0
do m m + 1
x
m
randomly chosen pattern
w
ji
w
ji
+
j
x
i
; w
kj
w
kj
+
k
y
j
until ||J(w)|| <
return w
End
32
Stopping criterion
The algorithm terminates when the change in the
criterion function J(w) is smaller than some preset
value ; other stopping criteria that lead to better
performance than this one
A weight update may reduce the error on the single
pattern being presented but can increase the error on
the full training set
In stochastic backpropgation and batch propagation,
we must make several passes through the training
data
33
Learning Curves
Before training starts, the error on the training set is high;
as the learning proceeds, error becomes smaller
Error per pattern depends on the amount of training data
and the expressive power (such as the number of
weights) in the network
Average error on an independent test set is always higher
than on the training set, and it can decrease as well as
increase
A validation set is used in order to decide when to stop
training ; we do not want to overfit the network and
decrease the power of the classifiers generalization
Stop training when the error on the validation set is
minimum
34
Representation at the Hidden Layer
Representation at the Hidden Layer
What do the learned weights mean?
The weights connecting hidden layer to output
layer form a linear discriminant
The weights connecting input layer to hidden
layer represent a mapping from the input
feature space to a latent feature space
For each hidden unit, the weights from input
layer describe the input pattern that leads to
the maximum activation of that node
Backpropagation as Feature Mapping
Backpropagation as Feature Mapping
64-2-3 sigmoidal network for classifying three characters (E,F,L)
Non-linear interactions between the features may cause the
features of the pattern to not manifest in a single hidden node (in
contrary to the example shown above)
It may be difficult to draw similar interpretations in large networks
and caution must be exercised while analyzing weights
Input layer to hidden layer weights for a character recognition task
Weights at two hidden nodes
represented as 8x8 patterns
Left gets activated for F, right gets
activated for L, and both get
activated for E
Practical Techniques for Improving
Practical Techniques for Improving
Backpropagation
Backpropagation
A nave application of backpropagation procedures may
lead to slow convergence and poor performance
Some practical suggestions; no theoretical results
Activation Function f(.)
Must be non-linear (otherwise, 3-layer network is just a linear
discriminant) and saturate (have max and min value) to keep
weights and activation functions bounded
Activation function and its derivative must be continuous and
smooth; optionally monotonic
Choice may depend on the problem. Eg. Gaussian activation if
the data comes from a mixture of Gaussians
Eg: sigmoid (most popular), polynomial, tanh, sign function
Parameters of activation function (e.g. Sigmoid)
Centered at 0, odd function f(-net) = -f(net) (anti-symmetric);
leads to faster learning
Depend on the range of the input values
Activation Function
Activation Function
The anti-symmetric sigmoid function: f(-x) = -f(x).
a = 1.716, b = 2/3.
First & second derivative
Practical Considerations
Scaling inputs (important not just for neural networks)
Large differences in scale of different features due to the choice of
units is compensated by normalizing them to be in the same range,
[0,1] or [-1,1]; without normalization, error will hardly depend on
feature with very small values
Standardization: Shift the inputs to have zero mean and unit
variance
Target Values
One-of-C representation for the target vector (C is no. of classes).
Better to use +1 and 1 that lie well within the range of sigmoid
function saturation values (+1.716, -1.716)
Higher values (e.g. 1.716, saturation point of a sigmoid) may
require the weights to go to infinity to minimize the error
Training with Noise
For small training sets, it is better to add noise to the input patterns
and generate new virtual training patterns
Number of Hidden units (n
H
)
Governs the expressive power of the network
The easier the task, the fewer the nodes needed
Rule of thumb: total no. of weights must be less than the number of
training examples (preferably 10 times less); no. of hidden units
determines the total no. of weights
A more principled method is to adjust the network complexity in
response to training data; e.g. start with a large no. of hidden units
and decay, prune, or eliminate weights
Initializing weights
We cannot initialize weights to zero, otherwise learning cannot take
place
Choose initial weights w such that |w| < w
w too small slow learning; too large early saturation and no
learning
w is chosen to be 1/d for input layer, and 1/ n
H
for hidden layer
Total no. of Weights
Total no. of Weights
Error per pattern with the increase in number of hidden nodes.
2-n
H
-1 network (with bias) trained on 90 2D-Gaussian patterns (n =
180) from each class (sampled from mixture of 3 Gaussians)
Minimum test error occurs at 17-21 weights in total (4-5 hidden
nodes). This illustrates the rule of thumb that n/10 weights often gives
lowest error
Learning Rate
Small learning rate: slow convergence
Large learning rate: high oscillation and slow convergence
Momentum
Prevents the algorithm from getting stuck at plateaus and local
minima
Weight decay
Avoid overfitting by imposing the condition that weights must be
small
After each update, weights are decayed by some factor
Related to regularization (also used in SVM)
Hints
Additional input nodes added to NN that are only
used during training. Help learn better feature
representation.
Training setup
Online, stochastic, batch-mode
Stop training
Halt when validation error reaches (first) minimum
Number of hidden layers
More layers -> more complex
Networks with more hidden layers are more prone to
get caught in local minima
Smaller the better (KISS)
Criterion function
We talked about squared error, but there are others
44

Pattern Classification Neural Networks

Diunggah oleh

Informasi Dokumen

Deskripsi Asli:

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Pattern Classification Neural Networks

Diunggah oleh

Hak Cipta:

Format Tersedia

Pattern

Natural Neural Net Models

Anda mungkin juga menyukai