Anda di halaman 1dari 7

Artificial neural networks

Background Artificial neurons, what they can and cannot do The multilayer perceptron (MLP) Three forms of learning The back propagation algorithm Radial basis function networks Competitive learning (and relatives)

An artificial neuron
x0 = +1 x1 x2 xn w2 wn w1 w0 =
y = f (S )


f(S) = any non-linear, saturating function, e.g. a step function or a sigmoid:

f (S ) = 1 1 + e S

wi xi =

wi xi
i =0

i =1

A single neuron as a classifier

The neuron can be used as a classifier y<0.5 class 0 y>0.5 class 1 Linear discriminant = a hyper plane x2 2D example: A line
x2 =

The XOR problem

Not linearly separable must combine two linear discriminants.

x2 1

w1 x1 + w2 w2
Only linearly separable classification problems can be solved. x1

0 0 1 NOR

x1 AND
Two sigmoids implement fuzzy AND and NOR 4

The multilayer perceptron

Artificial neural networks ...

store information in the weights, not in the nodes are trained, by adjusting the weights, not programmed can generalize to previously unseen data are adaptive are fast computational devices, well suited for parallel simulation and/or hardware implementation are fault tolerant
5 6



Linear (func. approx.) or Sigmoidal (classification)

Can implement any function, given a sufficiently rich internal structure (number of nodes and layers)

Application areas
Finance Forecasting Fraud detection Medicine Image analysis Consumer market Household equipment Character recognition Speech recognition Industry Adaptive control Signal analysis Data mining

Why neural networks?

(statistical methods are always at least as good, right?)

Neural networks are statistical methods Model independence Adaptivity/Flexibility Concurrency Economical reasons (rapid prototyping)

Three forms of learning

Input Target function State Learning system Error Learning system
Suggested actions

Back propagation
Input Output (y) Desired output (d) Error function Action


Environment Reward Agent

The contribution to the error E from a particular weight wji is

Action selector

E w ji

The weight should be moved in proportion to that contribution, but in the other direction:

Error function and transfer function must both be differentiable.

w ji =

E w ji


Back propagation update rule


Training procedure (1)

Network is initialised with small random weights Split data in two a training set and a test set The training set is used for training and is passed through many times. Update weights after each presentation (pattern learning) or Accumulate weight changes (w) until the end of the training set is reached (epoch or batch learning) The test set is used to test for generalization (to see how well the net does on previously unseen data). This is the result the counts!


Error is squared error:

w ji =


1 2

n j =1

j k E = j xi w ji derivative of error

(d j y j )2

Transfer (activation) function is sigmoid:

j = y (1 y ) j j

y j (1 y j )( d j y j ) If node j is an output node wkj k Otherwise

y j = f (S j ) =

1 1+ e
S j

derivative of sigmoid

sum over all nodes in the next layer (closer to the outputs)


E Typical error curves Test or validation set error

Network size
Overtraining is more likely to occur
if we train on too little data if the network has too many hidden nodes if we train for too long

Training set error Time (epochs) Overtraining Cross validation: Use a third set, a validation set, to decide when to stop (find the minimum for this set, and retrain for that number of epochs)

The network should be slightly larger than the size necessary to represent the target function Unfortunately, the target function is unknown ... Need much more training data than the number of weights!

Training procedure (2)

1. Start with a small network, train, increase the size, train again, etc., until the error on the training set can be reduced to acceptable levels. 2. If an acceptable error level was found, increase the size by a few percent and retrain again, this time using the cross-validation procedure to decide when to stop. Publish the result on the independent test set. 3. If the network failed to reduce the error on the training set, despite a large number of nodes and attempts, something is likely to be wrong with the data.

Practical considerations
What happens if the mapping represented by the data is not a function? For example, what if the same input does not always lead to the same output? In what order should data be presented? Sequentially? At random? How should data be represented? Compact? Distributed? What can be done about missing data? Trick of the trade: Monotonic functions are easier to learn than non-monotonic functions! (at least for the MLP)

Radial basis functions (RBF)

Layered structure, like the MLP, with one hidden layer Output nodes are conventional Each hidden node

Geometric interpretation
The input space is covered with overlapping Gaussians.

measures the distance between its weight vector and the input vector (instead of a weighted sum)

feeds that through a Gaussian (instead of a sigmoid)

In classification, the discriminants become hyper spheres (circles in 2D).


17 18

RBF training
Could use backprop (transfer function still differentiable) Better: Train layers separately


RBF (hidden) nodes work in a local region, MLP nodes are global MLPs do better in high-dimensional spaces MLPs require fewer nodes and generalizes better RBFs can learn faster RBFs are less sensitive to the order in which data is presented RBFs make less false-yes classification errors MLPs extrapolate better

Hidden layer: Find position and size of Gaussians by unsupervised learning (e.g. competitive learning, K-means) Output layer: Supervised, e.g. Delta-rule, LMS, backprop

Unsupervised learning
Classifying unlabeled data Nearest neighbour classifiers Classify the unknown sample (vector) x to the class of its closest previously classified neighbour
The new pattern, x, will be classified as a . x

K-means, for K=2 Make a codebook of two vectors, c1 and c2 Sample (at random) two vectors from the data as initial values of c1 and c2 Split the data in two subsets, D1 and D2 where D1 is the set of all points with c1 as their closest codebook vector, and vice versa Move c1 towards the mean in D1 and c2 towards the mean in D2 Repeat from 3 until convergence (until the codebook vectors stop moving)

Problem 1: The closest neighbour may be an outlier from the wrong class Problem 2: Must store lots of samples and compute distance to each one, for every new sample

Voroni regions
K-means form so called Voroni regions in the input space The Voroni region around a codebook vector ci is the region in which ci is the closest codebook vector
1. 2. 3.

Competitive learning
M linear, threshold less, nodes (only weighted sums) N inputs
Present a pattern (sample), x The node with the largest output (node k) is declared winner The weights of the winner is updated so that it will become even stronger the next time the same pattern is presented. All other weights are left unchanged With normalised weights, this is equivalent to finding the node with the minimum distance between its weight vector and the input vector Network node = Codebook vector

The standard competitive learning rule

wki = ( xi wki)
Voroni regions around 10 codebook vectors

1 i N

Competitive learning + batch learning = K-means

The winner takes it all

Problem with competitive learning: A node may become invincible A B

Self organising maps

The cerebral cortex is a twodimensional structure, yet we can reason in more than two dimensions Different neurons in the auditory cortex respond to different frequencies. These neurons are located in frequency order! Topological preservation / topographic map

Poor initialisation: The weight vectors have been initialised to small random numbers (in W), but these are far from the data (A and B) The first node to win will move from W towards A or B and will always win, henceforth Solution: Use the data to initialise the weights (as in K-means), or include the winning-frequency in the distance measure, or move more nodes than only 25 the winner.

Dimensional reduction

Kohonens self-organising feature map (SOFM or SOM)

Non-linear, topologically preserving, dimensional reduction (like pressing a flower)

Competitive learning, extended in two ways: 1. The nodes are organised in a two-dimensional grid
(in competitive learning, there is no defined order between nodes) A 3x3 grid, making a twodimensional map of the fourdimensional input space

SOM update rule

Find the winner, node k, and then update all weights by:
wki = f ( j , k )( xi wki )

1 i N

2. A neighbourhood function is introduced

(not only the winner is updated, but also its closest neighbours in the grid)

f(j, k) is a neighbourhood function in the range [0,1], with a maximum for the winner ( j=k) and decreasing with distance from the winner, e.g. a Gaussian Gradually decrease neighbourhood radius (width of the Gaussians) and learning rate () over time. Result: Vectors that are close in the high-dimensional input space will activate areas that are close on the grid.


SOM online example

A 10x10 SOM, is trained on a chemical analysis of 178 wines from one region in Italy, where the grapes have grown on three different types of soil. The input is 13-dimensional. After training, wines from different soil types activate different regions of the SOM. For example:

SOM offline example A two-dimensional, clickable, map of Usenet news articles (from

Note that the network is not told that the difference between the wines is the soil type, nor how many such types (how many classes) there are.

Growing neural gas

Growing unsupervised network (starting from two nodes) Dynamic neighbourhood Constant parameters Very good at following moving targets Can also follow jumping targets Current work: Using GNG to define and train the hidden layer of Gaussians in a RBF network

Node positions
Start with two nodes Each node has a set of neighbours, indicated by edges The edges are created and destroyed dynamically during training For each sample, the closest node, k, and all its current neighbours are moved towards the input

Node creation
A new node (blue) is created every th time step, unless the maximum number of nodes has been reached The new node is placed halfway between the node with the greatest error and the node among its current neighbours with the greatest error The node with the greatest error is the most unstable one

Node creation (contd.)

Here, a fourth node has just been created In effect, new nodes are created close to where they are most likely needed The exact position of the new node is not crucial, since nodes move around

After a while

Neighbourhood edges are created and destroyed as follows: For each sample, let k denote the winner (the node closest to the sample) and r the runner-up (the second closest) If an edge exists between k and r, reset its age to 0
Otherwise, create such an edge and set its age to 0

7 nodes

50 nodes (Voroni regions in red)


Increment the age of all other edges emanating from node k Edges older than amax are removed, as are any nodes that in this way loses its last remaining edge

Delaunay triangulation
Connect the codebook vectors in all adjacent Voroni regions

Dead units
There is only one way for an edge to get younger when the two nodes it interconnects are the two closest to the input If one of the two nodes wins, but the other one is not the runner-up, then, and only then, the edge ages If neither of the two nodes win, the edge does not age!

Voroni regions (red) and Delaunay triangulation (yellow)

The graph of GNG edges is a subset of the Delaunay triangulation


The input distribution has jumped from the lower left to the upper right corner


The lab
(in room 1515!) Classification of bitmaps, by supervised learning (back propagation), using the SNNS simulator An illustration of some unsupervised learning algorithms, using the GNG demo applet
LBG/LBG-U ( K-means) HCL (Hard competitive learning) Neural gas CHL (Competitive Hebbian learning) Neural gas with CHL GNG/GNG-U (Growing neural gas) SOM (Self organising map) Growing grid