Anda di halaman 1dari 7

Hybrid Evolutionary One-Step Gradient Descent for

Training Recurrent Neural Networks

Rohitash Chandra1 and Christian W. Omlin2
1, 2
Department of Computer Engineering, Middle East Technical University,
Guzelyurt, Turkish Republic of Northern Cyprus.

approaches include the combination of genetic algorithms and

Abstract - In this paper, we present a fast, hybrid gradient
gradient descent [2, 3], and particle swarm optimization with
descent and genetic algorithm for training recurrent neural
gradient descent [4].
networks. The hybrid algorithm uses the strengths of genetic
Combining gradient descent with an evolutionary algorithm
algorithm and gradient descent learning in training
for training neural networks allows parallel and continuous
recurrent neural networks (RNNs) for learning fuzzy finite
global and local search for a solution in weight space.
automata. In the hybrid algorithm, the chromosomes are
Gradient descent has been successfully embedded in genetic
evolved using one-step gradient descent with genetic
algorithms for image reconstruction [5]. Particle swarm
evolution. The hybrid algorithm is applied in learning
optimization (PSO) has also been combined with evolutionary
deterministic finite-state automata using recurrent neural
algorithms (EA) for training recurrent neural networks. It has
networks. The surprising results demonstrate that the hybrid
been shown that the hybrid PSO-EA outperforms standard EA
algorithm trains recurrent neural networks faster when
and PSO algorithms [6].
compared to training with regular genetic algorithm alone.
The strengths of gradient descent and genetic algorithms
have motivated us to develop a hybrid GA-GD learning
Keywords: Genetic algorithms, Recurrent neural networks,
algorithm which alleviates and exploits their respective
Hybrid algorithm, and Gradient descent.
weaknesses and strengths, respectively. In our hybrid
algorithm, gradient descent is embedded in a genetic
algorithm. Gradient descent is used to evolve the
chromosomes by updating weights with the error
1 Introduction backpropagation algorithm. The fitness of each chromosome
Although neural networks have performed very well in in the population is calculated after one weight update. The
many real-world application problems, training them can be fitness directly affects the selection of parent chromosomes
cumbersome in cases where the data contains noise and is for the crossover operator which combines components of
linearly inseparable. Traditionally, neural networks have been two parents into a single offspring. We use the roulette wheel
trained by the error back-propagation algorithm which method for probabilistically selecting parent chromosomes.
employs gradient descent for training. There is a tendency for Each gene in the offspring is then mutated according to the
gradient based learning algorithms to get trapped in local mutation probability. This hybrid algorithm differs from
minima resulting in poor training and generalization traditional genetic algorithms in that the evolution of weights
performance. To overcome this shortcoming, evolutionary is based on the gradient information and probabilistic
techniques such as genetic algorithms have been used in mutation rather than mutation alone. Mutation offers a
neural network training which alleviate the problem of local network the opportunity for an escape from a possible local
minimum. Genetic algorithms are evolutionary search minimum in weight space. In principle, gradient descent and
techniques; thus, training of neural networks using genetic genetic algorithms can be combined in two ways: the genetic
algorithms can be time-consuming. Performance comparisons evolution using crossover and mutation can be done prior to
of the two methods have shown that genetic algorithms gradient descent weight update, or gradient descent weight
generally outperform gradient descent in training feedforward update can be performed prior to genetic evolution. In fact,
neural networks for real-world application problems [1]. these two methods are equivalent: the first gradient descent
In recent years, training neural networks using hybrid weight optimization simply creates a new population; thus, we
algorithms have gained much interest. A common hybrid can think of that population as the initial random population
technique uses an evolutionary algorithm in the initial training of the hybrid algorithm that first performs crossover and
phase; once a certain number of training generations has been mutation before the gradient descent optimization. Thus, there
reached, the evolutionary search terminates and gradient is no need to separately consider the two hybrid algorithms.
descent training is used for final training. Thus, evolutionary In the remainder of this paper, we will only consider the
training is used to search the weight space globally for a hybrid algorithm which uses gradient descent weight update
promising solution and gradient descent refines that solution followed by crossover and mutation operations.
through local optimization. Examples of such hybrid
The remainder of the paper is organized as follows: In Z-1
Section II, we discuss recurrent neural networks, fuzzy finite
automaton, gradient descent and genetic algorithms for
training RNNs. In Section III, we discuss the framework of Contex
the hybrid GA-GD algorithm in detail. In Section IV, we t layer
show empirical results on how the hybrid GA-GD algorithm
outperforms traditional GAs in training RNNs on fuzzy finite
automaton. We then conclude the work and discuss the
feasibility of future research according to the results. layer
wij Output

2 Material and Methods Hidden


2.1 Recurrent neural networks

Fig 1 First-order recurrent neural network
Recurrent neural networks have been an important focus of
research as they can be applied to difficult problems involving
time-varying patterns. Their applications range from speech
2.2 Finite-state automata for RNN training
recognition and financial prediction to gesture recognition Recurrent neural networks are appropriate tools for
[7]-[9]. They have the ability to provide good generalization modeling real world application problems of speech, signature
performance on unseen data but are difficult to train. and gesture recognition and stated earlier. However, these
Recurrent neural networks are dynamical systems and it has applications are not well suited for addressing their
shown been that they can represent deterministic finite-state fundamental issues such as training algorithms and
automata in their internal weight representations [10]. knowledge representation. These applications come with
Unlike feedforward neural networks, recurrent neural specific characteristics, for example, in application to speech
networks contain feedback connections. They are composed recognition feature extraction may be required which may
of an input layer, a context layer which provides state hinder the investigation of the networks fundamental issues.
information, a hidden layer and an output layer. Each layer Different applications require different feature extraction
contains one or more neurons which propagate information techniques. The models such as finite-state automata and their
from one layer to the next by computing a non-linear function corresponding languages can be viewed as a general paradigm
of their weighted sum of inputs. Recurrent neural networks of temporal, symbolic language. There is no feature extraction
maintain information about their past states for the necessary for recurrent neural networks to learn these
computation of future states and outputs. Popular languages. The knowledge acquired in recurrent neural
architectures of recurrent neural networks include first-order networks through learning well corresponds with the
recurrent networks [11], second-order recurrent networks dynamics of finite-state automata. The representation of
[12], NARX networks [13] and LSTM recurrent networks automata is a prerequisite for learning its corresponding
[14]. A detailed study about the vast variety of recurrent languages; i.e. if the architecture cannot represent a particular
neural networks is beyond the scope of this paper. Fig. 1 is a automaton then it would not be able to learn it either.
diagram for first order recurrent neural networks showing the Finite automata have been used for training recurrent neural
recurrence from the hidden to the context layer. The equation networks as they represent dynamical systems. They have also
of the dynamics of the change of hidden state neuron been used to study knowledge representation in recurrent
activations in first order recurrent neural network is given by neural networks and it has been demonstrated through
Equation 1. knowledge extraction that RNNs can represent dynamical
systems [15]-[17]. Finite-state automata are used as test beds
⎛ K J ⎞ for training recurrent neural networks. They have been
S i ( t ) = g ⎜ ∑ Vik S k ( t − 1) + ∑ W ij I j ( t − 1) ⎟ (1)
popular as they represent dynamical systems and the strings
⎝ k =1 j =1 ⎠ for training do not need to undergo any feature extraction.
An alphabet ∑ is a finite set of symbols. A formal language
where S k (t ) and I j (t) represent the output of the state neuron is a set of strings of symbols over some alphabet. Simple
and input neurons respectively. V ik and W ij represent their alphabets, e.g. ∑= {0, 1}, are typically considered in the
study of formal languages since results can easily be extended
corresponding weights. g(.) is a sigmoidal discriminant
to larger alphabets. The set of all strings of odd parity L = {ε,
1, 01, 001, 011, 101 …} is an example of a simple language.
The symbol ε is used to denote a null string. The language
contains an infinite number of strings.
neural network in time so that it becomes a deep multilayer
feedforward network. This can be done by adding a layer for
each time step. When unfolded in time, the network has the
same behavior as a recurrent neural network for a finite
number of time steps. Gradient descent has the limitation of
learning longer term dependencies in recurrent neural
networks as the error gradient decreases significantly in
longer sequences [19]. The weight update in gradient descent
learning is computed by adding ∆wji to the respective weight
as shown in Equation 2:
∆wji = −α (2)
where ∂Ed is the error on training example d, summed over
Fig. 2 A 7 state deterministic finite state automata. all output units in the network as shown in Equation 3.
1 m
A deterministic finite-state automata (DFA) is defined as a E d = ∑ ( d j − S Lj ) 2 (3)
5-tuple M = (Q, ∑, δ, q1 ,F ), where Q is a finite number of 2 j =1
states, ∑ is the input alphabet, δ is the next state function δ : Here d j is the desired output for neuron j in the output layer
Q × ∑ →Q which defines which state q’ = δ(q,σ) is reached which contains m neurons, and S Lj is the network output of
by an automaton after reading symbol σ when in state q, q1 Є
neuron j in the output layer L. Fig. 4 shows a high level
Q is the initial state of the automaton (before reading any
framework of the BPTT which employs gradient decent for
string) and F ⊆ Q is the set of accepting states of the
error back-propagation and weight update.
automaton. The language L(M) accepted by the automaton
contains all the strings that bring the automaton to an __________________________
accepting state. The languages accepted by DFAs are called
regular languages. Fig. 2 shows the DFA which will be used procedure: Gradient Descent for RNN training
for training RNN using the hybrid evolutionary one-step
algorithm. Double circles in the figure show accepting states initialize weights and biases
while rejecting states are shown by single circles. State 1 is while (termination condition is not satisfied) do
the automaton’s start state. The training and testing set is i) forward propagate
obtained upon the presentation of strings to this automaton
ii) back-propagate error through time
which gives an output i.e. a rejecting or accepting state
depending on the state where the last sequence of the string and do weight update
was presented. For example, the output of a string of length 7, end
i.e. 0100101 is in state 5 which is an accepting state, therefore load data for testing the RNN
the output is 1. ______________________________

2.3 Gradient descent algorithm for training Fig. 4 Description of BPTT for Weight Update
neural networks
Error backpropagation employs gradient descent 2.4 Genetic algorithms for training neural
learning and is the most popular algorithm used for training networks
neural networks. The goal of gradient descent learning is to
minimize the sum of squared errors by propagating error Genetic algorithms provide a learning method motivated
signals backward through the network architecture upon the by biological evolution [20]. They have been successfully
presentation of training samples from the training set. These applied to neural network weight updates and to network
error signals are used to calculate the weight updates which topology optimization [21]. In recent years, such hybrid
represent the knowledge learnt from training. A limitation of approaches to neural networks training have gained popularity
gradient descent learning is their tendency of getting trapped and have been applied to real-world problems such as job
in a local minimum during training resulting in poor training scheduling [22], forecasting [23] and robotics control [24].
and generalization performance. The general idea in using genetic algorithms for training
Backpropagation is used for training feedforward networks neural networks is to encode weights as chromosomes in a
while backpropagation-through-time (BPTT) is employed for population. The task of genetic algorithms then is to find
training recurrent neural networks. BPTT is the spatio- optimal sets of weights that best represent the knowledge after
temporal extension of the backpropagation algorithm [18]. being presented with the training data in the network. The
The general idea behind BPTT is to unfold the recurrent fitness function is thus the sum of squared errors returned by
the network after being presented with the weights encoded in
chromosomes. Genetic algorithms find the optimal set of
weights in a network topology which minimizes the error
function. To evaluate the fitness function, each weight
encoded in the chromosome is assigned to the respective
weight links of the network. The training set of examples is
then presented to the network which propagates the input
signals forward and the sum of squared errors is calculated.
In this way, genetic algorithms attempt to find a set of weights
which minimizes the error function of the network. Unlike
learning with gradient descent, genetic algorithms can help
neural networks to escape from the local minima in weight
space. Fig 5 shows a high level description of genetic
algorithms for training RNNs.


procedure: Genetic Algorithm for RNN training

Fig. 6 The crossover operator in genetic algorithm
initialize population
evaluate RNN’s fitness
while (termination condition is not satisfied) do 3 Hybrid Training Algorithm
i) crossover and mutation
The strengths and weaknesses of gradient descent and
ii) update population
genetic algorithms have been discussed in the previous
iii) evaluate RNN’s fitness sections. While genetic algorithms have shown to overcome
end the problem of local minima, their drawback is evolutionary
get the optimal chromosome of weights optimization which can be time consuming. The evolution of
load data for testing the RNN weights based can also temporarily direct the network away
________________________________ from the optimal solution.
The update of weight and biases in order to minimize the
Fig. 5 Genetic algorithms for evolution of RNN weights
fitness function is common in both gradient decent and
genetic algorithms. In the hybrid approach, the gradient
information is used in creating a new population on which
The weights are encoded into chromosomes using either genetic operators such as crossover and mutation are applied.
binary or real numbered weight encoding schemes. In binary Genetic operators combine components from two different
encoding, a set of genes correspond to a certain weight link solutions according to the respective selection criterion and
[25,26]. The genes are changed into real weight values before therefore, a better solution can be obtained. The description of
being decoded into their respective weight links in order to the proposed algorithm is shown in Fig. 7.
evaluate the fitness function. Real number encodings are an In the description of the algorithm in Fig 7, a population of
alternate approach [27]. In order to use this method, genetic hypothesis which represents weights as chromosome is
operators must be changed as traditional genetic operators are initialized with small random real values. Each chromosome
specifically designed for binary chromosomes. One way of is presented to the network where they are updated using
altering the genetic operators is as follows: the crossover gradient descent which calculates the weight update according
operator takes two parent chromosomes and creates a single to the error from the input to output mappings of the training
child chromosome by randomly selecting corresponding samples. This weight update is done for one epoch only.
genetic materials from both parents as shown in Fig. 6. The Each updated network then becomes part of the new
mutation operator adds a small random real number in the population. Each chromosome in the new population is
range of -1 and 1 to each gene in the offspring according to evaluated according to the fitness function which is the
the mutation probability. reciprocal of the objective function (e.g. the sum of squared
error returned by the network). If the termination condition is
not satisfied, then the algorithm proceeds with genetic
evolution using crossover and mutation: (1) two parents are
chosen by the respective selection criterion such as rank,
roulette wheel or tournament selection, (2) an offspring is
created from the components of each parent using the
crossover operator according to the crossover probability.
Then each gene in the chromosome is altered by adding a Note that we used the sum of squared error from the
small random number according to the mutation probability. network as the fitness function. The crossover operator
chooses two parents using roulette wheel selection and creates
a child chromosome by probabilistically selecting genes from
__________________________ each parent. The mutation operator adds a small real random
number in the range of [-1,1] to each gene in the
procedure: Hybrid Training Algorithm for RNN chromosome. The maximum number of training time allowed
was 1000 generations. We used 8 neurons in the hidden layer
initialize population as it showed successful results for representing a 7 state DFA
evaluate RNN’s fitness in trail experiments. The results for training RNN using the
while (termination condition is not reached) do hybrid algorithm is shown in Table 1. Table 2 shows the
i) crossover and mutation results for training using a standard GA.
(Genetic evolution) The results clearly demonstrate that our hybrid algorithm
outperformed training of RNNs with a genetic algorithm
ii) present each chromosome to RNN using
alone in terms of training time. The training time has been
GD weight update for 1 epoch only widely affected by the different combination of the crossover
(neural weight update) and mutation probabilities. The results are promising which
iii) update population motivates the application of the hybrid training algorithm in
iv) evaluate RNN’s fitness training feedforward neural networks. The contribution of our
end hybrid training algorithm to solving real-world problems
get the optimal chromosome of weights looks promising.
load data for testing the RNN
________________________________ TABLE 1: Hybrid Training Algorithm for RNN

Training Generalization Training

Fig. 7 Description of the proposed hybrid genetic Mutation Crossover
Accuracy Accuracy time
algorithm/gradient descent training algorithm
0.5 0.9 100±0% 100±0% 2.7±1.2
0.9 0.9 100±0% 100±0% 4.2±2.0
0.5 0.5 100±0% 100±0% 3.3±0.9
4 Results and Discussion 0.9 0.5 100±0% 100±0% 3.5±0.7
In the following results, we show the performance of
training of recurrent neural networks with the genetic The 90 percent confidence interval for 10 experiments done with
algorithm alone and our hybrid training method, respectively. different values of crossover and mutation is given .The training
In both cases, we randomly initialise all the genes in the time is given by the number of ‘generations’. The maximum
chromosomes in the range of [-1, 1]. From trial experiments, training time allowed was 1000 generations.
we determined a population size of 40 to give the best results;
therefore, this population size is used in all the experiments.
We used different combinations of crossover and mutation
probabilities of 0.9 and 0.5. For each combination of different TABLE 2: GA Training for RNN
crossover and mutation probabilities, we ran 10 experiments.
Training Generalization Training
In the implementation of the genetic algorithm, which evolves Mutation Crossover
Accuracy Accuracy time
real numbered weight values from the network, the optimal 0.5 0.9 100±0% 100±0% 118.9±44.1
probabilities of crossover and mutation are important for rapid 0.9 0.9 100±0% 100±0% 63.2±23.4
convergence to a solution. To understand the genetic training 0.5 0.5 100±0% 100±0% 86.4±21.2
process for neural networks, one has to consider that the 0.9 0.5 100±0% 100±0% 77.9±15.6
actual learning takes place during mutation where there is a
significant change in the weight values. The crossover
operator does not alter the value of the weights in any way; it The 90 percent confidence interval for 10 experiments done with
only exchanges them with its respective selected parent. different values of crossover and mutation is given .The training
When using real-valued genetic weight representation, time is given by the number of ‘generations’. The maximum
mutation is thus more significant for the learning. Therefore, training time allowed was 1000 generations.
we ran experiments to find out the optimal probabilities for
crossover and mutation. We used the following hybrid weight
update strategy: we construct a RNN from a chromosome,
perform weight update for one epoch only, and then apply
probabilistic crossover and mutation.,
5 Conclusion Proc. of the IEEE/IAFE Computational Intelligence for
Financial Engineering, New York City, USA, 1997, pp.
We have presented a simple hybrid algorithm for training 253-259
recurrent neural networks using a combination of gradient [9] K. Marakami and H Taguchi, Gesture recognition using
descent and genetic algorithm weight update. Each recurrent neural networks, Proc. of the SIGCHI
chromosome in the population is modified using one step of conference on Human factors in computing systems:
gradient descent optimization followed by the application Reaching through technology, Louisiana, USA, 1991,
standard crossover and mutation operators. Surprisingly, we pp. 237-242.
have found that this single gradient descent step makes the [10] C. Lee Giles, C.W Omlin and K. Thornber, “Equivalence
difference between rapid convergence and non-convergence in Knowledge Representation: Automata, Recurrent
within 1000 generations when applied to the problem of Neural Networks, and dynamical Systems”, Proc. of the
training recurrent neural networks to behave like deterministic IEEE, vol. 87, no. 9, 1999, pp.1623-1640.
finite- state automata. It would be interesting to see the [11] P. Manolios and R. Fanelli, First order recurrent neural
performance of the Hybrid training algorithm on fuzzy finite- networks and deterministic finite state automata. Neural
state automata in future works. The contribution of our hybrid Computation, vol. 6, no. 6, 1994, pp.1154-1172.
training algorithm to solving real-world problems looks [12] R. L. Watrous and G. M. Kuhn, Induction of finite-state
promising. languages using second-order recurrent networks, Proc.
of Advances in Neural Information Systems, California,
USA, 1992, pp. 309-316.
6 References [13] T. Lin, B.G. Horne, P. Tino, & C.L. Giles, Learning long-
term dependencies in NARX recurrent neural networks.
IEEE Transactions on Neural Networks, vol. 7, no. 6,
[1] Randall S. Sexton, Robert E. Dorsey, “Reliable 1996, pp. 1329-1338.
classification using neural networks: a genetic algorithm [14] S. Hochreiter and J. Schmidhuber, Long short-term
and backpropagation comparison”, Decision Support memory, Neural Computation, vol. 9, no. 8, 1997, pp.
Systems, 30, 2000, 11-22. 1735-1780.
[2] Rohitash Chandra, Christian. W. Omlin, “Combining [15] C. W. Omlin and C. Giles, “Constructing deterministic
Genetic and Gradient Descent Learning in Recurrent finite-state automata in recurrent neural networks”,
Neural Networks: An Application to Speech Phoneme Journal of the ACM, vol. 43, no. 6, 1996, pp. 937-972.
Classification” Proceedings of the International [16] C. W. Omlin, K. K. Thornber, & C. L. Giles, Fuzzy
Conference on Artificial Intelligence and Pattern finite state automata can be deterministically encoded
Recognition, Orlando FL, USA, July 2007, pp. 278-285. into recurrent neural networks, IEEE Trans. Fuzzy Syst.,
[3] Lixin Lu and Yan-Qing Zhang, “Evolutionary Fuzzy 6, 1998, 76–89.
neural networks for Hybrid Financial Prediction”, IEEE [17] R. L. Watrous and G. M. Kuhn, “Induction of finite-state
Transactions of systems, man and cybernetics-Part C: languages using second-order recurrent networks,” Proc.
Applications and Reviews, Vol 35, No. 2, May 2005, pp of Advances in Neural Information Systems, California,
244-249. USA, 1992, pp. 309-316.
[4] Jing-Ru Zang, Jun Zhang, Tat-Ming Lok, Michael R. [18] P. J. Werbos, “Backpropagation through time: what it
Lyu, “A hybrid particle swarm optimization- does and how to do it,” Proc. of the IEEE, vol. 78, no.
backpropagation algorithm for feedforward neural 10, 1990, pp.1550-1560.
network training,” Applied Mathematics and [19] Y. Bengio, P. Simard and P. Frasconi, “Learning long-
Computation, 185, 2007, 1026-1037. term dependencies with gradient descent is difficult,”
[5] Liu Mei, Liu WeiDong, Sun DeQing, Chen Guqiao and IEEE Transactions on Neural Networks, vol. 5, no. 2,
Liu Huinian, “A new super-resolution image 1994, pp. 157-166.
reconstruction method based on hybrid genetic [20] J. H. Holland, “Genetic Algorithms and the Optimal
algorithm”, Proceedings of the 2004 IEEE International Allocation of Trials”, SIAM Journal of Computing, vol.
Conference on Control Applications, Taipei, Taiwan, 2, no. 2, 1973, pp. 88-105.
2004, pp. 211-216. [21] J. H. Ang, K. C. Tan, A. Al-Mamun,” Training neural
[6] Xindi Cai, Nian Zhang, Ganesh K. Venayagamoorthy, networks for classification using growth probability-
Donald C. Wunsch II, “Time series prediction with based evolution”, Neurocomputing, 2008,
recurrent neural networks trained by a hybrid PSO-EA doi:10.1016/j.neucom.2007.10.011
algorithm”, Neurocomputing, 70, 2007, 2342-2353. [22] Haibin Yu, Wei Liang, “Neural Network and genetic
[7] A.J Robinson, An application of recurrent nets to phone algorithm based hybrid approach o extended job-
probability estimation, IEEE transactions on Neural scheduling”, Computers and Industrial Engineering, 39.
Networks, vol.5, no.2 , 1994, pp. 298-305. 2001, 337-356.
[8] C.L. Giles, S. Lawrence and A.C. Tsoi, Rule inference [23] Harri Niska, Teri Hiltunen, Ari Karppinen, Juhani
for financial prediction using recurrent neural networks, Ruuskanen, Mikko Kolehmainen, “Evolving the neural
network model for forecasting air pollution time series”,
Engineering Applications of Artificial Intelligence, 17,
2004, 159-167.
[24] Genci Capi, Kenji Doya, “Evolution of recurrent neural
controllers using an extended parallel genetic algorithm”,
Robotic and Autonomous Systems, 52, 2005, 148-159.
[25] P. J. Angeline, G. M. Sauders, and J. B. Pollack, An
evolutionary algorithm that constructs recurrent neural
networks, IEEE Transactions on Neural Networks, vol. 5,
1994, pp. 54-65.
[26] M. A. Potter and D. Jong, “Evolving neural networks
with collaborative species”, Proc. of the Summer
Computer Simulation Conference, 1995.
[27] M. Negnevitsky, Artificial Intelligence: A Guide to
Intelligence Systems, Addison Wesley, 2004.