Unit1 PDF

www.Vidyarthiplus.
com
UNIT-1
ARCHITECTURE
1
www.Vidyarthiplus.com
What are Neural Networks?

• Simple computational elements forming a large
network
– Emphasis on learning (pattern recognition)
– Local computation (neurons)
• Definition of NNs is vague

– Often | but not always | inspired by biological brain
2
Machine Learning
Machine learning involves adaptive mechanisms
that enable computers to learn from experience,
learn by example and learn by analogy. Learning
capabilities can improve the performance of an
intelligent system over time. The most popular
approaches to machine learning are artificial
neural networks and genetic algorithms. This
lecture is dedicated to neural networks.
3
Biological neural network

Synapse
Synapse Dendrites
Axon
Axon
Soma Soma
Dendrites
Synapse
4
The neuron as a simple computing element

Diagram of a neuron
Input Signals Weights Output Signals
x1
Y
w1
x2
w2
Neuron Y Y
wn
Y
xn
5
Architecture of a typical artificial neural network
Out put Signals

Input Signals
Middle Layer
Input Layer Output Layer
6
n A neural network can be defined as a model of

reasoning based on the human brain. The brain
consists of a densely interconnected set of nerve
cells, or basic information-processing units, called
neurons.
n The human brain incorporates nearly 10 billion
neurons and 60 trillion connections, synapses,
between them. By using multiple neurons
simultaneously, the brain can perform its functions
much faster than the fastest computers in
existence today.
7
n Each neuron has a very simple structure, but an

army of such elements constitutes a tremendous
processing power.
n A neuron consists of a cell body, soma, a number of
fibers called dendrites, and a single long fiber
called the axon.
8
n Our brain can be considered as a highly complex,

non-linear and parallel information-processing
system.
n Information is stored and processed in a neural
network simultaneously throughout the whole
network, rather than at specific locations. In other
words, in neural networks, both data and its
processing are global rather than local.
n Learning is a fundamental and essential
characteristic of biological neural networks. The
ease with which they can learn led to attempts to
emulate a biological neural network in a computer.
9
n An artificial neural network consists of a number of

very simple processors, also called neurons, which
are analogous to the biological neurons in the
brain.
n The neurons are connected by weighted links
passing signals from one neuron to another.
10
Network Structure
• The output signal is transmitted through the
neuron’s outgoing connection. The outgoing
connection splits into a number of branches
that transmit the same signal. The outgoing
branches terminate at the incoming
connections of other neurons in the network.
11
Analogy between biological and

artificial neural networks
Biological Neural Network Artificial Neural Network

Soma Neuron
Dendrite Input
Axon Output
Synapse Weight
Synapse
Out put Signals

Input Signals
Synapse Dendrites
Axon
Axon
Soma Soma
Dendrites Middle Layer
Synapse Input Layer Output Layer
12
Course Topics
Learning Tasks
Supervised Unsupervised
Data: Data:
Labeled examples Unlabeled examples
(input , desired output) (different realizations of the
input)
Tasks:
classification Tasks:
pattern recognition clustering
regression content addressable memory
NN models:
perceptron NN models:
adaline self-organizing maps (SOM)
feed-forward NN Hopfield networks
radial basis function
support vector machines
13
Network architectures
• Three different classes of network architectures
– single-layer feed-forward neurons are organized

– multi-layer feed-forward in acyclic layers
– recurrent
• The architecture of a neural network is linked with the

learning algorithm used to train
14
Single Layer Feed-forward
Input layer Output layer

of of
source nodes neurons
15
Multi layer feed-forward
3-4-2 Network
Input Output
layer layer
Hidden Layer
16
Recurrent network
Recurrent Network with hidden neuron: unit delay operator z-1 is
used to model a dynamic system
z-1
input
z-1 hidden
output
z-1
17
The Neuron
Bias
b
x1 w1
Activation
Local function
Field
Output
 ()
v
x2 w2 y
Input
values
…………. Summing
function
xm wm
weights
18
The Neuron
• The neuron is the basic information processing unit of a
NN. It consists of:
1 A set of links, describing the neuron inputs, with
weights W1, W2, …, Wm
2 An adder function (linear combiner) for computing the
weighted sum of mthe inputs
(real numbers):
u w x j j
j 1
3 Activation function (squashing function) 

for
limiting the amplitude of the neuron output.
y  (u  b)
19
Bias of a Neuron
• The bias b has the effect of applying an affine
transformation to the weighted sum u
v=u+b
• v is called induced field of the neuron
u 
x1-x2= -1
x2 x1-x2=0 x1  x2
x1-x2= 1
x1
20
Bias as extra input
• The bias is an external parameter of the neuron. It can be
modeled by adding an extra input. m
x0 = +1
w0
v wx
j0
j j
x1
w0  b
w1 Activation
Local function
Field
Input Output
 ()
v
signal x2 w2 y
Summing
………….. function
xm wm Synaptic
weights
21
Activation Function 
There are different activation functions used in different applications. The
most common ones are:
Hard-limiter Piecewise linear Sigmoid Hyperbolic tangent
1 if v  1 2
1 if v  0
 v      v   tanhv 
 v   v if 1 2  v   1 2
 v  
1
0 if v  0 0 1  exp( av)
 if v  1 2
22
Neuron Models
• The choice of determines the neuron model. Examples:
• step function:
a if v  c
 (v )  
• ramp function: b if v  c
a if v  c

 ( v )  b if v  d
a  (( v  c )(b  a ) /(d  c )) otherwise
• sigmoid function: 
with z,x,y parameters
1
 (v )  z 
1  exp(  xv  y )
• Gaussian function:
1  1  v   2 
 (v )  exp    

2   2   
23
Learning Algorithms
Depend on the network architecture:

• Error correcting learning (perceptron)
• Delta rule (AdaLine, Backprop)
• Competitive Learning (Self Organizing Maps)
24
Applications
• Classification:
– Image recognition
– Speech recognition
– Diagnostic
– Fraud detection
– …
• Regression:
– Forecasting (prediction on base of past history)
– …
• Pattern association:
– Retrieve an image from corrupted one
– …
• Clustering:
– clients profiles
– disease subtypes
– …
25
Supervised Learning
• Training and test data sets

• Training set: input & target
26
Perceptron: architecture
• We consider the architecture: feed-forward NN
with one layer
• It is sufficient to study single layer perceptrons
with just one neuron:


27
Single layer perceptrons

• Generalization to single layer perceptrons with more
neurons is easy because:
 
•The output units are independent among each other

•Each weight only affects one of the outputs
28
Perceptron: Neuron Model

• The (McCulloch-Pitts) perceptron is a single layer
NN with a non-linear , the sign function
 1 if v  0
 (v )  
 1 if v  0
b (bias)
x1
w1
v y
x2 w2
(v)
wn
xn
29
Perceptron for Classification

• The perceptron is used for binary
classification.
• Given training examples of classes C1, C2 train
the perceptron in such a way that it classifies
correctly the training examples:
– If the output of the perceptron is +1 then the input is
assigned to class C1
– If the output is -1 then the input is assigned to C2
30
Perceptron Training
• How can we train a perceptron for a
classification task?
• We try to find suitable values for the
weights in such a way that the training
examples are correctly classified.
• Geometrically, we try to find a hyper-plane
that separates the examples of the two
classes.
31
Perceptron Geometric View

The equation below describes a (hyper-)plane in the input space
consisting of real valued 2D vectors. The plane splits the input
space into two regions, each of them describing one class.
decision
region for C1
x2
2 w1x1 + w2x2 + w0 >= 0
w x  w
i 1
i i 0 0 decision
boundary C1
x1
C2
w1x1 + w2x2 + w0 = 0
32
Example: AND
• Here is a representation of the AND function
• White means false, black means true for the output
• -1 means false, +1 means true for the input
-1 AND -1 = false
-1 AND +1 = false
+1 AND -1 = false
+1 AND +1 = true
33
Example: AND continued

• A linear decision surface (i.e. a plane in 3D
space) intersecting the feature space (i.e.
the 2D plane where z=0) separates false
from true instances
34
Example: AND continued

• Watch a perceptron learn the AND function:
35
Example: XOR
• Here’s the XOR function:

-1 XOR -1 = false
-1 XOR +1 = true
+1 XOR -1 = true
+1 XOR +1 = false
Perceptrons cannot learn such linearly inseparable functions

36
Example: XOR continued
• Watch a perceptron try to learn XOR
37
Perceptron: Limitations
• The perceptron can only model linearly separable
classes, like (those described by) the following Boolean
functions:
• AND
• OR
• COMPLEMENT
• It cannot model the XOR.
• You can experiment with these functions in the Matlab

practical lessons.
38
Gradient Descent Learning Rule
• Perceptron learning rule fails to converge if examples
are not linearly separable
• Gradient Descent: Consider linear unit without

threshold and continuous output o (not just –1,1)
• o(x)=w0 + w1 x1 + … + wn xn
• Update the wi’s such that they minimize the squared
error
• E[w1,…,wn] = ½ (x,d)D (d-o(x))2
where D is the set of training examples
39
• Replace the step function in the perceptron with a continuous (differentiable)

function f, e.g the simplest is linear function
• With or without the threshold, the Adaline is trained based on the output of the
function f rather than the final output.
+/ f (x)
(Adaline)
40
Incremental Stochastic
Gradient Descent
• Batch mode : gradient descent
w=w -  ED[w] over the entire data D
ED[w]=1/2d(td-od)2
• Incremental mode: gradient descent
w=w -  Ed[w] over individual training examples d
Ed[w]=1/2 (td-od)2
Incremental Gradient Descent can approximate Batch Gradient Descent

arbitrarily closely if  is small enough
41
Weights Update Rule:
incremental mode
• Computation of Gradient(E):
E(w) e
e
w w
 e[x ]
T
• Delta rule for weight update:
w(n  1)  w(n)   e(n)x(n)

42
LMS learning algorithm
n=1;
initialize w(n) randomly;
while (E_tot unsatisfactory and n<max_iterations)
Select an example (x(n),d(n))
ne=(n)  d(n) w(n)

n+1; x(n) T
w(n 1)  w(n)  e(n)x(n)

end-while;
 = learning rate parameter (real number)
A modification uses
x(n)
w(n  1)  w(n)   e(n)
|| x(n) ||
43
Perceptron Learning Rule VS.

Gradient Descent Rule
Perceptron learning rule guaranteed to succeed if
• Training examples are linearly separable
• Sufficiently small learning rate 
Linear unit training rules uses gradient descent
• Guaranteed to converge to hypothesis with
minimum squared error
• Given sufficiently small learning rate 
• Even when training data contains noise
• Even when training data not separable by H
44
Outline
• INTRODUCTION
• ADALINE
• MADALINE
• Least-Square Learning Rule
• The proof of Least-Square Learning Rule
45
Widrow and Hoff, 1960

Bernard Widrow and Ted Hoff introduced the Least-Mean-
Square algorithm (a.k.a.
delta-rule or Widrow-Hoff rule) and used it to train the
Adaline (ADAptive Linear Neuron)
--The Adaline was similar to the perceptron, except that it
used a linear activation function instead of the threshold
--The LMS algorithm is still heavily used in adaptive signal
processing
MADALINE: Many ADALINEs; Network of ADALINEs

46
x0
Perceptron vs. ADALINE
w0
Percptron: LTU
x1 w1 y
Emperical Hebbian Assumption
 f
wn
xn
Linear Threshold Unit (LTU) or ADALINE: LGU
Linear Graded Unit (LGU)
Gradient-Decent
f(s)
sgn(s)
1
tanh(s)
linear(s)
s
LTU: sign function; +/- (Positive/Negative)
LGU: Continuous and Differentiable Activation function
-1 including Linear function
MADALINE: Many ADALINEs; Network

47
of ADALINEs
ADALINE
• ADALINE(Adaptive Linear Neuron) is a network
model proposed by Bernard Widrow in1959.
X1 PE
single processing element

X2
．
．
．
X3
48
Method
• Method : The value in each unit must +1 or –1
net =  X iWi
X 0  1  net  W0  W1 X 1  W2 X 2    Wn X n
1 net  0

Y  if
 1 net  0

This is different from perception's transfer function.
49
Method
Wi   (T-Y) X i , T  expected output
Wi  Wi  Wi
ADALINE can solve only linear problem(the limitation)
50
MADALINE
• MADALINE： It is composed of many ADALINE
（Multilayer Adaline.）
Wij
Xi netj
No Wij
． Yj
．
．
．
．
．
．
．
．
•if more than half of netj ≥ 0,then
Xn output ＋1,otherwise, output －1
•After the second layer, the majority vote is used.
51
Least-Square Learning Rule (1/2)

• Least-Square Learning Rule：  X 0 
 
 X1 
 X j  ( X 0 , X 1 ,  , X n ) , (i.e. X   ) 1  j  p
t

 
X 
 n
 W0 
 
W1 
 W0  (W0 , W1 ,  , Wn ) , (i.e. W   )
t

 
W 
n  n
 Net j   W t X j , i.e., Net j  W0 X 0  W1 X 1    Wn X n
i 0
52
Least-Square Learning Rule (2/2)

• By applying the least-square learning rule the
weights is ： 


 R： correlation matrix
 R' ' P
W *  R -1P where R  , R  R1  R2      RP   X j X j
' ' ' t
 p j 1
 p

t
 T j X j
 Pt 
j 1
or  p
RW *  P
53
Exercise: Use Adaline (1/4)

1  1   1
     
Example ：X1  1  X 2   0  X 3   1
0 1   1
     
T1 1 T2 1 T3 1
X1 X2 X3 Tj
X1 1 1 0 1
X2 1 0 1 1
X3 1 1 1 -1
54
• Sol. First calculate R

1   1 1 0 
   
R1  1 110    1 1 0  
'
0  0 0 0 
   
  1 2 2 
1  1 0 1   3 2 2  3 3
      
R 2   0 101   0 0 0   R   2 2 1    2
' 1 2 1 
1  1 0 1  3   3 3 3
     2 1 2 2 1 2 
 3 3 3
1 1 1 1 
    
R3  1111  1 1 1 
'
1 1 1 1 
    
55
P1  1 1,1,0  110 
t

 t 1 1 
P2  1 1,0,1  101  P  100   ,0,0
t
 3 3 
P3  1 1,1,1  1 1 1
t
 1
 
 1 2 2 W1   3  3W1  2W2  2W3  1 W  3
 3 3     1
      
R W  P 
* 2 2 1 W2   0   2W1  2W2  W3  0   W2  -2
 3 3 3  
 2 1 2 W   0  2W  W  2W  0  W1  -2
 3 3 3  3     1 
 
2 3
 
56
• Verify the net:

• 代入（1,1,0） net=3X1-2X2-2X3=1 Y=1 ok
代入（1,0,1） net=3X1-2X2-2X3=1 Y=1 ok
代入（1,1,1） net=3X1-2X2-2X3=-1 Y=-1 ok
X1 3
-2
X2 ADALINE Y
-2
X3
57
Proof of Least Square Learning Rule(1/3)

• Let us use Least Mean Square Error to ensure the minimum
total error. As long as the total error approaches zero, the best
solution is found. Therefore, we are looking for the minimum of
k
2
〈〉。
• Proof:
1 L 2 1 L 1 L
mean       k   (Tk  Yk )   (Tk  2Tk Yk  Yk )
2 2 2 2
L k 1 L k 1 L k 1
1 L 2 2 L 1 L 2
  T   Tk Yk   Yk  let  Tk2  is mean of
L k 1 L k 1 L k 1 <Tk2>
2 L 1 L
 T    Tk Yk   [W ( X k X k )W ]
2 t t
k
L k 1 L k 1
58
1 L n L L
ps. Yk   ( wi xik )   (W X k )   (W t X k )( X k W )
2 2 t 2 t
承
上 k 1 k 1 i 1 k 1 k 1
頁 L
 W ( X k  X k )  W
t t
k 1
L L
2 t 1
 Tt    Tk Yk  W [ ( X k X t )]W
2
L k 1 L k 1
2 L
 Tt    Tk (W k X k )  W t  X k X k W
2 t t
L k 1
1 L
 Tt  2[  (Tk X k )W ]  W t  X k X k W
2 t t
L k 1
 Tt 2  2Tk X k W  W t  X k X k W
t t
59

 Let Rk  X k X k , i.e., Rk is a n  n matrix , also called Correlation Matrix
t
承 L
 Let R'  R'1  R'2    R'K    R'L   Tk X k
t
上
k 1
頁
 Let R  ( R' / L)  which is R  X k X kt 
 Tk  W t RW  2Tk X k W
2 t
We want to find W such that   k  is minimal

 k 
2
 [Tk   W t RW  2Tk X k W ]'
2 t
W
 2 RW  2  Tk X k  Let P  Tk X k 
t
 2 RW  2 P
 k 
2
if    2 RW *－ 2 P = 
W
即 W * = R -1P 或 RW *=P
60
Comparison of Perceptron and Adaline

Perceptron Adaline
Architecture Single-layer Single-layer
Neuron Non-linear linear

model
Learning Minimze Minimize total
algorithm number of error
misclassified
examples
Application Linear Linear classification and

classification regression
61

Unit1 PDF

Diunggah oleh

Informasi Dokumen

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Unit1 PDF

Diunggah oleh

Hak Cipta:

Format Tersedia

www.Vidyarthiplus.

What are Neural Networks?

• Definition of NNs is vague

Biological neural network

The neuron as a simple computing element

Input Signals Weights Output Signals

Architecture of a typical artificial neural network

Out put Signals

n A neural network can be defined as a model of

n Each neuron has a very simple structure, but an

n Our brain can be considered as a highly complex,

n An artificial neural network consists of a number of

Analogy between biological and

Biological Neural Network Artificial Neural Network

Out put Signals

Synapse Input Layer Output Layer

• Three different classes of network architectures

– single-layer feed-forward neurons are organized

• The architecture of a neural network is linked with the

Single Layer Feed-forward

Input layer Output layer

Multi layer feed-forward

3 Activation function (squashing function) 

Hard-limiter Piecewise linear Sigmoid Hyperbolic tangent

Depend on the network architecture:

• Training and test data sets

Single layer perceptrons

•The output units are independent among each other

Perceptron: Neuron Model

Perceptron for Classification

Perceptron Geometric View

Example: AND continued

Example: AND continued

• Here’s the XOR function:

Perceptrons cannot learn such linearly inseparable functions

Example: XOR continued

• Watch a perceptron try to learn XOR

• You can experiment with these functions in the Matlab

• Gradient Descent: Consider linear unit without

• Replace the step function in the perceptron with a continuous (differentiable)

Incremental Gradient Descent can approximate Batch Gradient Descent

• Delta rule for weight update:

w(n  1)  w(n)   e(n)x(n)

ne=(n)  d(n) w(n)

w(n 1)  w(n)  e(n)x(n)

Perceptron Learning Rule VS.

Widrow and Hoff, 1960

MADALINE: Many ADALINEs; Network of ADALINEs

MADALINE: Many ADALINEs; Network

single processing element

ADALINE can solve only linear problem(the limitation)

•After the second layer, the majority vote is used.

Least-Square Learning Rule (1/2)

Least-Square Learning Rule (2/2)

Exercise: Use Adaline (1/4)

• Sol. First calculate R

• Verify the net:

Proof of Least Square Learning Rule(1/3)

Proof of Least Square Learning Rule(2/3)

Proof of Least Square Learning Rule(3/3)

We want to find W such that   k  is minimal

Comparison of Perceptron and Adaline

Architecture Single-layer Single-layer