Anda di halaman 1dari 424

Yoel Tenne and Chi-Keong Goh (Eds.

)
Computational Intelligence in Optimization
Adaptation, Learning, and Optimization, Volume 7
Series Editor-in-Chief

Meng-Hiot Lim
Nanyang Technological University, Singapore
E-mail: emhlim@ntu.edu.sg

Yew-Soon Ong
Nanyang Technological University, Singapore
E-mail: asysong@ntu.edu.sg

Further volumes of this series can be found on our homepage: springer.com

Vol. 1. Jingqiao Zhang and Arthur C. Sanderson


Adaptive Differential Evolution, 2009
ISBN 978-3-642-01526-7

Vol. 2. Yoel Tenne and Chi-Keong Goh (Eds.)


Computational Intelligence in
Expensive Optimization Problems, 2010
ISBN 978-3-642-10700-9

Vol. 3. Ying-ping Chen (Ed.)


Exploitation of Linkage Learning in Evolutionary Algorithms, 2010
ISBN 978-3-642-12833-2

Vol. 4. Anyong Qing and Ching Kwang Lee


Differential Evolution in Electromagnetics, 2010
ISBN 978-3-642-12868-4

Vol. 5. Ruhul A. Sarker and Tapabrata Ray (Eds.)


Agent-Based Evolutionary Search, 2010
ISBN 978-3-642-13424-1

Vol. 6. John Seiffertt and Donald C. Wunsch


Unified Computational Intelligence for Complex Systems, 2010
ISBN 978-3-642-03179-3

Vol. 7. Yoel Tenne and Chi-Keong Goh (Eds.)


Computational Intelligence in Optimization, 2010
ISBN 978-3-642-12774-8
Yoel Tenne and Chi-Keong Goh (Eds.)

Computational Intelligence in
Optimization
Applications and Implementations

123
Dr. Yoel Tenne
Department of Mechanical Engineering and
Science-Faculty of Engineering,
Kyoto University, Yoshida-honmachi,
Sakyo-Ku, Kyoto 606-8501, Japan
E-mail: yoel.tenne@ky3.ecs.kyoto-u.ac.jp
Formerly: School of Aerospace Mechanical and Mechatronic Engineering,
Sydney University, NSW 2006, Australia

Dr. Chi-Keong Goh


Advanced Technology Centre,
Rolls-Royce Singapore Pte Ltd
50 Nanyang Avenue, Block N2,
Level B3C, Unit 05-08, Singapore 639798
E-mail: chi.keong.goh@rolls-royce.com

ISBN 978-3-642-12774-8 e-ISBN 978-3-642-12775-5

DOI 10.1007/978-3-642-12775-5

Adaptation, Learning, and Optimization ISSN 1867-4534

Library of Congress Control Number: 2010926028


c 2010 Springer-Verlag Berlin Heidelberg

This work is subject to copyright. All rights are reserved, whether the whole or part
of the material is concerned, specifically the rights of translation, reprinting, reuse
of illustrations, recitation, broadcasting, reproduction on microfilm or in any other
way, and storage in data banks. Duplication of this publication or parts thereof is
permitted only under the provisions of the German Copyright Law of September 9,
1965, in its current version, and permission for use must always be obtained from
Springer. Violations are liable to prosecution under the German Copyright Law.

The use of general descriptive names, registered names, trademarks, etc. in this
publication does not imply, even in the absence of a specific statement, that such
names are exempt from the relevant protective laws and regulations and therefore
free for general use.

Typeset & Cover Design: Scientific Publishing Services Pvt. Ltd., Chennai, India.

Printed on acid-free paper

987654321

springer.com
To our families for their love and support.
Preface

Optimization is an integral part to science and engineering. Most real-world


applications involve complex optimization processes, which are difficult to
solve without advanced computational tools. With the increasing challenges
of fulfilling optimization goals of current applications there is a strong drive
to advance the development of efficient optimizers. The challenges introduced
by emerging problems include:
• objective functions which are prohibitively expensive to evaluate, so typi-
cally so only a small number of objective function evaluations can be made
during the entire search,
• objective functions which are highly multimodal or discontinuous, and
• non-stationary problems which may change in time (dynamic).

Classical optimizers may perform poorly or even may fail to produce any
improvement over the starting vector in the face of such challenges. This
has motivated researchers to explore the use computational intelligence (CI)
to augment classical methods in tackling such challenging problems. Such
methods include population-based search methods such as: a) evolutionary
algorithms and particle swarm optimization and b) non-linear mapping and
knowledge embedding approaches such as artificial neural networks and fuzzy
logic, to name a few. Such approaches have been shown to perform well in
challenging settings. Specifically, CI are powerful tools which offer several
potential benefits such as: a) robustness (impose little or no requirements
on the objective function) b) versatility (handle highly non-linear mappings)
c) self-adaption to improve performance and d) operation in parallel (making
it easy to decompose complex tasks). However, the successful application of
CI methods to real-world problems is not straightforward and requires both
expert knowledge and trial-and-error experiments. As such the goal of this
volume is to survey a wide range of studies where CI has been successfully ap-
plied to challenging real-world optimization problems, while highlighting the
VIII Preface

insights researchers have obtained. Broadly, the studies in this volume focus
on four main disciplines: continuous optimization, classification, scheduling
and hardware implementations.
For continuous optimization, Neto et al. study the use of artificial neural
networks (ANNs) and Heuristic Rules for solving large scale optimization
problems. They focus on a recurrent ANN to solve a quadratic program-
ming problem and propose several techniques to accelerate convergence of
the algorithm. Their method is more efficient than one using an ANN only.
Starzyk et al. propose a direct-search optimization algorithm which uses re-
inforcement learning, resulting in an algorithm which ‘learns’ the best path
during the search. The algorithm weights past steps based on their success
to yield a new candidate search step. They benchmark their algorithm with
several mathematical test functions and apply to training of a multi-layer
perceptron neural network for image recognition. Ventresca et al. use the op-
position sampling approach to decrease the number of function evaluations.
The approach attempts to sample the function in a subspace generated by the
‘opposites’ of an existing population of candidates. They apply their method
to differential evolution and incremental learning and show that the opposi-
tion method improves performance over baseline variants. Bazan studies an
optimization algorithm for problems where the objective function requires
large computational resources. His proposed algorithm uses locally regular-
ized approximations of the objective function using radial basis functions. He
provides convergence proofs and formulates a framework which can be ap-
plied to other algorithms such as Gauss-Seidel or the Conjugate Directions.
Ruiz-Torrubiano et al. study hybrid methods for solving large scale opti-
mization problems with cardinality constraints, a class of problems arising in
diverse areas such as finance, machine learning and statistical data analysis.
While existing methods can provide exact solutions (such as branch-and-
bound) they require large resources. As such the study focuses on meth-
ods which can efficiently identify approximate solutions but require far less
computer resources. For problems where it is expensive to evaluate the ob-
jective function, Jayadeva et al. propose using a support-vector machine to
predict the location of yet undiscovered optima. Their framework can be
applied to problems where little or no a-priori information is available on
the objective function, as the algorithm ‘learns’ during the search process.
Benchmarks show their method can outperform existing methods such as par-
ticle swarm optimization or genetic algorithms. Vouchkov and Keane study
multi-objective optimization problems using surrogate-models. They inves-
tigate how to efficiently update the surrogates under a small optimization
‘budget’ and compare different updating strategies. They also shows that us-
ing a number of surrogate models can improve the optimization search and
that the size of the ensemble should increase with the problem dimension.
Others study agents-based algorithms, that is, where the optimization is
done by agents which co-operate during the search. Dreżewski and Siwik
review agent-based co-evolutionary algorithms for multi-objective problems.
Preface IX

Such algorithms combine co-evolution (multiple species) with the agent ap-
proach (interaction). They review and compare existing methods and bench-
mark them over a range of test problems. Results show the agent-based co-
evolutionary algorithms can perform equally well and even surpass some of
the best existing multi-objective evolutionary algorithms. Salhi and Töreyen
proposes a multi-agent algorithm based on game theory. Their framework uses
multiple solvers (agents) which compete over available resources and their
algorithm identifies the most successful solver. In the spirit of game theory,
successful solvers are rewarded by increasing their computing resources and
vice versa. Test results show the framework provides a better final solution
when compared to using a single solver.
For applications in classification, Arana-Daniel et al. use Clifford algebra
to generalize support vector machines (SVMs) for classification (with an ex-
tension to regression). They represent input data as a multivector and use
a single Clifford kernel for multi-class problems. This approach significantly
reduces the computational complexity involved in training the SVM. Tests
using real-world applications of signal processing and computer vision show
the merit of their approach. Luukka and Lampinen propose a classification
method which combines principal component analysis to pre-process the data
followed by optimization of the classifier parameters using a differential evo-
lution algorithm. Specifically, they optimize the class vectors used by the
classifier and the power of the distance metric. Test results using real-world
data sets show the proposed approach performs equally or better to some of
the best existing classifiers. Lastly in this category, Zhang et al. study the
problem of feature selection in high-dimensional problems. They focus on the
GA-SVM approach, where a genetic algorithm (GA) optimizes the param-
eters of the SVM (the GA uses the SVM output as the objective values).
The problem requires large computational resources which make it difficult
to apply to large or high-dimensional sets. As such they propose several mea-
sures such as parallelization, neighbour search and caching to accelerate the
search. Test results show their approach can reduce the computational cost
of training an SVM classifier.
Two studies focus on difficult scheduling problems. First, Pieters studies
the problem of railway timetable design scheduling, which is an NP-hard
problem with additional challenging features as such being reactive and dy-
namic. He studies solving the problem with Symbiotic Networks, a class of
neural networks inspired by the symbiosis phenomenon is nature, and so the
network uses ‘agents’ to adapt itself to the problem. Test results show the
Symbiotic network can successfully handle the complex scheduling problem.
Next, Srivastava et al. propose an approach combining evolutionary algo-
rithms, neural network and fuzzy logic to solve problems of multiobjective
time-cost trade-off. They consider a range of such problems including non-
linear time-cost relationships, constrained resources and project uncertain-
ties. They show the merit of their approach by testing it on a real-world
test case.
X Preface

Lastly, for applications of CI to hardware implementations, Meher studies


the use of systolic arrays for implementing artificial neural networks in VLSI
and FPGA platforms. The chapter studies the use of systolic arrays for effi-
cient hardware implementations of neural networks for real-time applications.
The chapter surveys various approaches, current achievements as well as fu-
ture directions such as mixed analog-digital neural networks. This is followed
by Thangavelautham et al. who propose using coarse-coding techniques to
evolve multi-robot controllers and where they aim to evolve simultaneously
both the controller and sensor configurations. To make the problem tractable
they use an Artificial Neural Tissue to exploit regularity in the sensor data.
Test results show their approach outperforms a reference one.
Overall, the chapters in this volume address a spectrum of issues arising
in the application of computational intelligence to real-world difficult opti-
mization problems. The chapters discuss both the current accomplishments
and the remaining open issues as well as point to future research directions
in the field.

September 2009 Yoel TENNE


Chi-Keong GOH
Acknowledgement to Reviewers

We express our thanks to the expertise provided by our fellow researchers


who have kindly reviewed for the edited book. Their assistance have been
invaluable to our endeavors.

B.V. Babu Dudy Lim


Will Browne Passi Luukka
Pedro M. S. Carvalho Pramod Kumar Meher
Jia Chen Hirotaka Nakayama
Sheng Chen Ferrante Neri
Tsung-Che Chiang Thai Dung Nguyen
Siang-Yew Chong Alberto Ochoa
Antonio Della Ciopa Yew-Soon Ong
Carlos A. Coello Coello Khaled Rasheed
Marco Cococcioni Tapabrata Ray
Claudio De Stefano Abdellah Salhi
Antonio Gaspar-Cunha Vui Ann Shim
Kyriakos C. Giannakoglou Ofer M. Shir
David Ginsbourger Dimitri Solomatine
Frederico Guimarães Sanjay Srivastava
Martin Holena Janusz Starzyk
Amitay Issacs Stephan Stilkerich
Jayadeva Haldun Süral
Wee Tat Koo Mohammhed B. Trabia
Slawomier Koziel Massimiliano Vasile
Jouni Lampinen Lingfeng Wang
Xiaodong Li Chee How Wong
Contents

1 New Hybrid Intelligent Systems to Solve Linear and Quadratic


Optimization Problems and Increase Guaranteed Optimal
Convergence Speed of Recurrent ANN . . . . . . . . . . . . . . . . . . . . . . . . . 1
Otoni Nóbrega Neto, Ronaldo R.B. de Aquino, Milde M.S. Lira
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Neural Network of Maa and Shanblatt: Two-Phase
Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Hybrid Intelligent System Description . . . . . . . . . . . . . . . . . . . . . . . 7
1.3.1 Method of Tendency Based on the Dynamics in
Space-Time (TDST) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3.2 Method of Tendency Based on the Dynamics in
State-Space (TDSS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.4 Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.4.1 Case 1: Mathematical Linear Programming
Problem – Four Variables . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.4.2 Case 2: Mathematical Linear Programming
Problem – Eleven Variables . . . . . . . . . . . . . . . . . . . . . . . . 16
1.4.3 Case 3: Mathematical Quadratic Programming
Problem – Three Variables . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.5 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2 A Novel Optimization Algorithm Based on Reinforcement
Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Janusz A. Starzyk, Yinyin Liu, Sebastian Batog
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.2 Optimization Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.2.1 Basic Search Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
XIV Contents

2.2.2 Extracting Historical Information by Weighted


Optimized Approximation . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.2.3 Predicting New Step Sizes . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.2.4 Stopping Criterion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.2.5 Optimization Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.3 Simulation and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.3.1 Finding Global Minimum of a Multi-variable
Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.3.2 Optimization of Weights in Multi-layer Perceptron
Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.3.3 Micro-saccade Optimization in Active Vision for
Machine Intelligence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3 The Use of Opposition for Decreasing Function Evaluations in
Population-Based Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
Mario Ventresca, Shahryar Rahnamayan, Hamid Reza Tizhoosh
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.2 Theoretical Motivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.2.1 Definitions and Notations . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.2.2 Consequences of Opposition . . . . . . . . . . . . . . . . . . . . . . . 52
3.2.3 Lowering Function Evaluations . . . . . . . . . . . . . . . . . . . . . 53
3.2.4 Comparison to Existing Methods . . . . . . . . . . . . . . . . . . . 54
3.3 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.3.1 Differential Evolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.3.2 Opposition-Based Differential Evolution . . . . . . . . . . . . . 57
3.3.3 Population-Based Incremental Learning . . . . . . . . . . . . . . 57
3.3.4 Oppositional Population-Based Incremental
Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.4 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.4.1 Evolutionary Image Thresholding . . . . . . . . . . . . . . . . . . . 59
3.4.2 Parameter Settings and Solution Representation . . . . . . . 63
3.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.5.1 ODE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.5.2 OPBIL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4 Search Procedure Exploiting Locally Regularized Objective
Approximation: A Convergence Theorem for Direct Search
Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
Marek Bazan
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.2 The Search Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.3 Zangwill’s Method to Prove Convergence . . . . . . . . . . . . . . . . . . . . 75
Contents XV

4.4 The Main Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77


4.4.1 Closedness of the Algorithmic Transformation . . . . . . . . 78
4.4.2 A Perturbation in the Line Search . . . . . . . . . . . . . . . . . . . 80
4.5 The Radial Basis Appproximation . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.5.1 Detecting Dense Regions . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.5.2 Regularization Training . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
4.5.3 Choice of the Regularization Parameter λ Value . . . . . . . 90
4.5.4 Error Bounds for Radial Basis Approximation . . . . . . . . 91
4.6 Numerical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
4.6.1 Test Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
4.6.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
4.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5 Optimization Problems with Cardinality Constraints . . . . . . . . . . . . 105
Rubén Ruiz-Torrubiano, Sergio Garcı́a-Moratilla, Alberto Suárez
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
5.2 Approximate Methods for the Solution of Optimization
Problems with Cardinality Constrains . . . . . . . . . . . . . . . . . . . . . . . 108
5.2.1 Simulated Annealing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
5.2.2 Genetic Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
5.2.3 Estimation of Distribution Algorithms . . . . . . . . . . . . . . . 111
5.3 Benchmark Optimization Problems with Cardinality
Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
5.3.1 The Knapsack Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
5.3.2 Ensemble Pruning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
5.3.3 Portfolio Optimization with Cardinality Constraints . . . . 119
5.3.4 Index Tracking by Partial Replication . . . . . . . . . . . . . . . . 122
5.3.5 Sparse Principal Component Analysis . . . . . . . . . . . . . . . 124
5.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
6 Learning Global Optimization through a Support Vector
Machine Based Adaptive Multistart Strategy . . . . . . . . . . . . . . . . . . . 131
Jayadeva, Sameena Shah, Suresh Chandra
6.1 Introduction and Background Research . . . . . . . . . . . . . . . . . . . . . . 132
6.2 Global Optimization with Support Vector Regression Based
Adaptive Multistart (GOSAM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
6.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
6.3.1 One Dimensional Wave Function . . . . . . . . . . . . . . . . . . . 137
6.3.2 Two Dimensional Case: Ackley’s Function . . . . . . . . . . . 140
6.3.3 Comparison with PSO and GA on Higher
Dimensional Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
6.4 Extension to Constrained Optimization Problems . . . . . . . . . . . . . . 143
6.4.1 Sequential Unconstrained Minimization Techniques . . . 143
6.5 Design Optimization Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
XVI Contents

6.5.1 Sample and Hold Circuit . . . . . . . . . . . . . . . . . . . . . . . . . . 148


6.5.2 Folded Cascode Amplifier . . . . . . . . . . . . . . . . . . . . . . . . . 149
6.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
6.7 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
7 Multi-objective Optimization Using Surrogates . . . . . . . . . . . . . . . . . 155
Ivan Voutchkov, Andy Keane
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
7.2 Surrogate Models for Optimization . . . . . . . . . . . . . . . . . . . . . . . . . 157
7.3 Multi-objective Optimization Using Surrogates . . . . . . . . . . . . . . . 158
7.4 Pareto Fronts - Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
7.5 Response Surface Methods, Optimization Procedure and Test
Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
7.6 Update Strategies and Related Parameters . . . . . . . . . . . . . . . . . . . . 163
7.7 Test Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
7.8 Pareto Front Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
7.8.1 Generational Distance ([3], pp.326) . . . . . . . . . . . . . . . . . 165
7.8.2 Spacing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
7.8.3 Spread . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
7.8.4 Maximum Spread . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
7.9 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
7.9.1 Understanding the Results . . . . . . . . . . . . . . . . . . . . . . . . . 166
7.9.2 Preliminary Calculations . . . . . . . . . . . . . . . . . . . . . . . . . . 167
7.9.3 The Effect of the Update Strategy Selection . . . . . . . . . . . 167
7.9.4 The Effect of the Initial Design of Experiments . . . . . . . . 171
7.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174

8 A Review of Agent-Based Co-Evolutionary Algorithms for


Multi-Objective Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
Rafał Dreżewski, Leszek Siwik
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
8.2 Model of Co-Evolutionary Multi-Agent System . . . . . . . . . . . . . . . 179
8.2.1 Co-Evolutionary Multi-Agent System . . . . . . . . . . . . . . . 180
8.2.2 Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
8.2.3 Species . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
8.2.4 Sex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
8.2.5 Agent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
8.3 Co-Evolutionary Multi-Agent Systems for Multi-Objective
Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
8.3.1 Co-Evolutionary Multi-Agent System with
Co-Operation Mechanism (CCoEMAS) . . . . . . . . . . . . . . 187
8.3.2 Co-Evolutionary Multi-Agent System with
Predator-Prey Interactions (PPCoEMAS) . . . . . . . . . . . . . 190
8.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
Contents XVII

8.4.1 Test Suite, Performance Metric and State-of-the-Art


Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
8.4.2 A Glance at Assessing Co-operation Based Approach
(CCoEMAS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
8.4.3 A Glance at Assessing Predator-Prey Based
Approach (PPCoEMAS) . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
8.5 Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
9 A Game Theory-Based Multi-Agent System for Expensive
Optimisation Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
Abdellah Salhi, Özgun Töreyen
9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
9.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
9.2.1 Optimisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
9.2.2 Game Theory: The Iterated Priosoners’ Dilemma . . . . . . 213
9.2.3 Multi-Agent Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
9.3 Constructing GTMAS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
9.3.1 GTMAS at Work: Illustration . . . . . . . . . . . . . . . . . . . . . . 216
9.4 The GTMAS Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
9.4.1 Solver-Agents Decision Making Procedure . . . . . . . . . . . 219
9.5 Application of GTMAS to TSP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
9.6 Tests and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
9.7 Conclusion and Further Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
10 Optimization with Clifford Support Vector Machines and
Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
N. Arana-Daniel, C. López-Franco, E. Bayro-Corrochano
10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
10.2 Geometric Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
10.2.1 The Geometric Algebra of n-D Space . . . . . . . . . . . . . . . . 235
10.2.2 The Geometric Algebra of 3-D Space . . . . . . . . . . . . . . . 237
10.3 Linear Clifford Support Vector Machines for Classification . . . . . 237
10.4 Non Linear Clifford Support Vector Machines for
Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
10.5 Clifford SVM for Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
10.6 Recurrent Clifford SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
10.7 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
10.7.1 3D Spiral: Nonlinear Classification Problem . . . . . . . . . . 247
10.7.2 Object Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250
10.7.3 Multi-case Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . 256
10.7.4 Experiments Using Recurrent CSVM . . . . . . . . . . . . . . . . 257
10.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260
XVIII Contents

11 A Classification Method Based on Principal Component Analysis


and Differential Evolution Algorithm Applied for Prediction
Diagnosis from Clinical EMR Heart Data Sets . . . . . . . . . . . . . . . . . . 263
Pasi Luukka, Jouni Lampinen
11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264
11.2 Heart Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
11.3 Classification Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268
11.3.1 Dimension Reduction Using Principal Component
Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268
11.3.2 Classification Based on Differential Evolution . . . . . . . . 269
11.3.3 Differential Evolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
11.4 Classification Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272
11.5 Discussion and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281
12 An Integrated Approach to Speed Up GA-SVM Feature Selection
Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285
Tianyou Zhang, Xiuju Fu, Rick Siow Mong Goh, Chee Keong Kwoh,
Gary Kee Khoon Lee
12.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285
12.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 288
12.2.1 Parallel/Distributed GA . . . . . . . . . . . . . . . . . . . . . . . . . . . 288
12.2.2 Parallel SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290
12.2.3 Neighbor Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291
12.2.4 Evaluation Caching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292
12.3 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292
12.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298

13 Computation in Complex Environments; Optimizing Railway


Timetable Problems with Symbiotic Networks . . . . . . . . . . . . . . . . . . 299
Kees Pieters
13.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299
13.1.1 Convergence Inducing Process . . . . . . . . . . . . . . . . . . . . . 300
13.1.2 A Classification of Problem Domains . . . . . . . . . . . . . . . . 301
13.2 Railway Timetable Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302
13.3 Symbiotic Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304
13.3.1 A Theory of Symbiosis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306
13.3.2 Premature Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . 311
13.4 Symbiotic Networks as Optimizers . . . . . . . . . . . . . . . . . . . . . . . . . . 313
13.5 Trains as Symbiots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314
13.5.1 Trains in Symbiosis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315
13.5.2 The Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315
13.5.3 The Trains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316
13.5.4 The Optimizing Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 318
13.5.5 Computational Complexity . . . . . . . . . . . . . . . . . . . . . . . . 319
Contents XIX

13.5.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319


13.5.7 A Symbiotic Network as a CCGA . . . . . . . . . . . . . . . . . . . 321
13.5.8 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322
14 Project Scheduling: Time-Cost Tradeoff Problems . . . . . . . . . . . . . . . 325
Sanjay Srivastava, Bhupendra Pathak, Kamal Srivastava
14.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325
14.1.1 A Mathematical Description of TCT Problems . . . . . . . . 328
14.2 Resource-Constrained Nonlinear TCT . . . . . . . . . . . . . . . . . . . . . . . 329
14.2.1 Artificial Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . 330
14.2.2 Working of ANN and Heuristic Embedded Genetic
Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331
14.2.3 ANNHEGA for a Case Study . . . . . . . . . . . . . . . . . . . . . . 334
14.3 Sensitivity Analysis of TCT Profiles . . . . . . . . . . . . . . . . . . . . . . . . 336
14.3.1 Working of IFAG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339
14.3.2 IFAG for a Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339
14.4 Hybrid Meta Heuristic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343
14.4.1 Working of Hybrid Meta Heuristic . . . . . . . . . . . . . . . . . . 345
14.4.2 HMH Approach for Case Studies . . . . . . . . . . . . . . . . . . . 348
14.4.3 Standard Test Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . 352
14.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355
15 Systolic VLSI and FPGA Realization of Artificial Neural
Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359
Pramod Kumar Meher
15.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 360
15.2 Direct-Design of VLSI for Artificial Neural Network . . . . . . . . . . 362
15.3 Design Considerations and Systolic Building Blocks for ANN . . . 364
15.4 Systolic Architectures for ANN . . . . . . . . . . . . . . . . . . . . . . . . . . . . 371
15.4.1 Systolic Architecture for Hopfield Net . . . . . . . . . . . . . . . 371
15.4.2 Systolic Architecture for Multilayer Neural Network . . . 373
15.4.3 Systolic Implementation of Back-Propagation
Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373
15.4.4 Implementation of Advance Algorithms and
Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 376
15.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 376
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 377
16 Application of Coarse-Coding Techniques for Evolvable
Multirobot Controllers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 381
Jekanthan Thangavelautham, Paul Grouchy,
Gabriele M.T. D’Eleuterio
16.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 381
16.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384
XX Contents

16.2.1 The Body and the Brain . . . . . . . . . . . . . . . . . . . . . . . . . . . 385


16.2.2 Task Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385
16.2.3 Machine-Learning Techniques and Modularization . . . . 386
16.2.4 Fixed versus Variable Topologies . . . . . . . . . . . . . . . . . . . 387
16.2.5 Regularity in the Environment . . . . . . . . . . . . . . . . . . . . . . 388
16.3 Artificial Neural Tissue Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 389
16.3.1 Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 389
16.3.2 The Decision Neuron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 390
16.3.3 Evolution and Development . . . . . . . . . . . . . . . . . . . . . . . . 391
16.3.4 Sensory Coarse Coding Model . . . . . . . . . . . . . . . . . . . . . 393
16.4 An Example Task: Resource Gathering . . . . . . . . . . . . . . . . . . . . . . 395
16.4.1 Coupled Motor Primitives . . . . . . . . . . . . . . . . . . . . . . . . . 397
16.4.2 Evolutionary Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . 399
16.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 399
16.5.1 Evolution and Robot Density . . . . . . . . . . . . . . . . . . . . . . . 403
16.5.2 Behavioral Adaptations . . . . . . . . . . . . . . . . . . . . . . . . . . . 403
16.5.3 Evolved Controller Scalability . . . . . . . . . . . . . . . . . . . . . . 406
16.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 407
16.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 409
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 410
Chapter 1
New Hybrid Intelligent Systems to Solve Linear
and Quadratic Optimization Problems and
Increase Guaranteed Optimal Convergence
Speed of Recurrent ANN

Otoni Nóbrega Neto, Ronaldo R.B. de Aquino, and Milde M.S. Lira

Abstract. This chapter deals with the study of artificial neural networks (ANNs) and
Heuristic Rules (HR) to solve optimization problems. The study of ANN as optimiza-
tion tools for solving large scale problems was due to the fact that this technique has
great potential for hardware VLSI implementation, in which it may be more efficient
than traditional optimization techniques. However, the implementation of computa-
tional algorithm has shown that the proposed technique should have been efficient
but slow as compared with traditional mathematical methods. In order to make it
a fast method, we will show two ways to increase the speed of convergence of the
computational algorithm. For analyzes and comparison, we solved three test cases.
This paper considers recurrent ANN to solve linear and quadratic programming prob-
lems. These networks are based on the solution of a set of differential equations that
are obtained from a transformation of an augmented Lagrange energy function. The
proposed hybrid systems combining recurrent ANN and HR presented a reduced
computational effort in relation to the one using only the recurrent ANN.

1.1 Introduction
The early 1980’s were marked by a resurgence of interest in artificial neural net-
works (ANNs). At that time, the development of ANNs had the important charac-
teristic of temporal processing. Many researchers have attributed the resumption of
researches on ANNs in the eighties to the Hopfield model presented in 1982 [1].
This recurrent Hopfield model has so far constituted, a great progress in the thresh-
old of knowledge in the area of neural networks, until then.
Nowadays, it is known that there are two ways of incorporating the temporal com-
putation in a neural network: the first one is possible by using a statistical neural nets
to accomplish a dynamical mapping in a structure of short-term memory; and the
Otoni Nóbrega Neto · Ronaldo R.B. de Aquino · Milde M.S. Lira
Electrical Engineering Department, Federal University of Pernambuco, Brazil
e-mail: otoninobrega@hotmail.com,rrba@ufpe.br

Y. Tenne and C.-K. Goh (Eds.): Computational Intelligence in Optimization, ALO 7, pp. 1–26.
springerlink.com c Springer-Verlag Berlin Heidelberg 2010
2 O.N. Neto, R.R.B. de Aquino, and M.M.S. Lira

second one is by making internal feedback connections that may be made by single
or multi-loop feedback in which the neural network can be fully connected. Artifi-
cial neural networks that have feedback connections in their topology are known as
recurrent neural networks [2]. The theoretical study and applications of recurrent
neural nets were developed in several subsequent works [3, 4, 5, 6, 7, 8, 9]. Actu-
ally, the progress provided by Hopfield’s works have shown that a value of energy
could be associated to each state of the net and that this energy decreases monoton-
ically as the path is described within the state-space towards a fixed point. These
fixed points are therefore stable points of energy [10], i.e., the described energy
function behaves as Lyapunov functions for the model described in detail in Hop-
field’s works. At this exact moment, it is observed subjects of stabilities in recurrent
neural nets. Considering the stability in a non-linear dynamical system, we usually
think about stability in the sense of Lyapunov. The Direct Method of Lyapunov is
broadly used for stability analysis of linear and non-linear systems which may be
either time-variant or time-invariant. Therefore, it can be directly applicable to the
stability analysis of ANNs [2].
In 1985, Hopfield solved the traveling salesman problem [7] that is a problem
in combinatorial optimization using a continuous model of the recurrent neural net-
work as an optimization tool. In 1986, Hopfield proposed a specialized ANN to
solve specific problems of linear programming (LP) [9] based on analog circuits,
studied since 1956 by Insley B. Pyne and presented in [11]. On that occasion, Hop-
field demonstrated that the dynamics involved in recurrent artificial neural nets were
described by a Lyapunov function and that for this reason, it was demonstrated that
this network is stable and also that the point of stability is the solution to the problem
for which the ANN was modeled.
In 1987, Kennedy and Chua demonstrated that the ANN which was proposed
by Hopfield in 1986, in spite of searching for the minimum level of the energy
function, it had not been modeled to offer an inferior limit, but only when the sat-
uration of an operational amplifier of the circuit was reached [12]. Due to this
deficiency, Kennedy and Chua proposed a new circuit for LP problems that also
proved to be able to solve quadratic programming (QP) problems. These circuits
were nominated as “canonical non-linear programming circuit”, which are based on
the Kuhn-Tucker (KT) conditions [12]. In this kind of ANN-based optimization,
the problem has to be “hard-wired” in the network and the convergence behavior of
the ANN depends greatly on how the cost function is modeled.
Later on, hard studies [13, 14] confirmed that for non-linear programming prob-
lems, the proposed model by Kennedy and Chua [15] has completely satisfied the
optimization of KT conditions and the penalty method. Besides, under appropri-
ate conditions this net is stable. In spite of the important progresses presented in
Kennedy and Chua’s studies, a deficiency was observed in the model, which appears
when the equilibrium point of the net happens in the neighborhood of the optimal
point of the original problem, but the distance between the optimal point and the
equilibrium point of the network can be reduced by increasing the penalty parame-
ter (s), as in [14] and [16]. Even so, Kennedy and Chua’s network is able to solve a
great class of optimization problems with and without constraints. However, when
1 New HIS to Solve LP and QP Optimization and Increase Convergence Speed 3

the solutions of constrains optimization problems are in the neighborhood of the


feasible region, i.e., equality constraints are close to happen, then the network only
converges on an approximate solution that can be out of the feasible region [17].
This is explained by the application of the penalty function theorem [16]. For ap-
plications in which an unfeasible solution cannot be tolerated, the usefulness of this
technique (Kennedy and Chua’s neural networks) is seriously jeopardized. With the
intention of overcoming this difficulty, Maa and Shanblatt proposed the two-phase
method [14]. This work reveals an innovation in the method presented by David
W. Tank and John J. Hopfield [18] and it guarantees that, in certain conditions, the
proposed network evolves towards the exact solution of the optimization problem.
Since Kennedy and Chua network contains a finite penalty parameter, it gen-
erates only approximate solutions and presents an implementation problem when
the penalty parameter is very large. To reach an exact solution Maa and Shan-
blatt method uses another penalty parameter in the second phase. Therefore, to
avoid using penalty parameters, some significant works have been done in recent
years. Among them, a few primal-dual neural networks with two-layer and one-
layer structure were developed for solving linear and quadratic programming prob-
lems [18, 19, 20, 21]. These neural networks were proved to be globally convergent
to an exact solution when the objective function is convex.
Nowadays, recurrent ANN have been used to solve real world problems such
as, the hydrothermal Scheduling [22] based on the augmented Lagrange Hopfield
network and [23, 24, 25] based on Maa and Shanblatt Two-phase Neural Network.
In this work, an ANNs was approached to solve optimization problems and the
proposed method by Maa and Shanblatt was applied. The study of ANN as opti-
mization tools for solving large scale problems was due to the fact that this technique
has a great potential for hardware VLSI implementation, in which it can be more
efficient than traditional optimization techniques. However, the implementation of
the method in software has shown that, in spite of the technique being efficient in
the solution of optimization problems, the speed of convergence could become slow
when compared with traditional mathematical methods. In this regard, it was created
and proposed heuristic rules in a hybrid form to aid and accelerate the convergence
of the two-phase method in the software. It is important to point out that the soft-
ware implementation of the method is an important part of the development and
analysis of the method in hardware. An important characteristic for choosing Maa
and Shanblatt network is that it is ready to solve linear and quadratic optimization
problems with equality and inequality linear constraints without using mathematical
transformations, which would increase the dimension of the problem. As we plan to
apply the developed HIS in this work to solve the hydrothermal scheduling problem
[24, 25], which does not need an exact solution, Maa and Shanblatt network first
phase was chosen for this implementation. In future works, we may try other kinds
of recurrent ANNs.
Decision trees and classification rules are important and common methods used
for knowledge representation in the expert systems [26]. Heuristic rules are rules
which have no particular foundation in a scientific theory, but which are only based
on the observation of general patterns and derived from facts. These rules are
4 O.N. Neto, R.R.B. de Aquino, and M.M.S. Lira

applicable to many problems as shown in [27, 28, 29, 30]. Here the basis of the
proposed heuristic rules is the dynamical behavior of neural networks. From the
convergence analysis, we identified the parameters and their relationships, which
are then transformed into a set of heuristic rules. We developed an algorithm based
on the heuristic rules and carried out some experiments to evaluate and support the
proposed technique.
In this work, two possible implementations were developed, tested and com-
pared; and a high reduction in computational effort was observed by using the pro-
posed heuristic rules. This reduction is related to the decrease in the number of ODE
computed during the convergence process. Other possible implementations are also
indicated.
This work is organized beginning with a revision of the two-phase method of
Maa and Shanblatt; following, we present the proposed heuristic rules and show the
solutions for test cases using the previously discussed techniques; next, the simula-
tion results are presented and analyzed; and finally, we draw conclusions about the
proposed work.

1.2 Neural Network of Maa and Shanblatt: Two-Phase


Optimization
The operation of the Hopfield network model and the subsequent models is based
on a constraint violation of optimization problem. When a constraint violation oc-
curs, the magnitude and the direction of the violation are fed back to adjust the
states of the neurons of the network so that the overall energy function of the net-
work always decreases until it reaches a minimum level. These ANN models have
dynamical characteristics according to the Lyapunov function. Therefore, it can be
demonstrated that these networks are stable and that the equilibrium point is the
solution of LP and QP problems that the network represents. This type of neural
network was firstly improved in [15] and later in [14]. The network used in the last
version is the one used in this work.
Maa and Shanblatt network is able to solve constrained or unconstrained convex
quadratic and linear programming problems.
Consider the following convex P problem:
1 T
(P) min f (x) = x Qx + cT x
2
s.t. g(x) = Dx − b ≤ 0
h(x) = Hx − w = 0
x ∈ Rn (1.1)

where c ∈ Rn , D ∈ R pxn , b ∈ R p , H ∈ Rqxn , w ∈ Rq , p and q ≤ n and Q ∈ Rnxn is sym-


metric and positive definite or positive semidefinite, f , gi ’s and h j ’s are functions
on Rn → R. Assume that the feasible domain of P is not empty and the objective
function is bounded below over the domain of P.
1 New HIS to Solve LP and QP Optimization and Increase Convergence Speed 5

Particularly, P is said to be convex programming if f and gi ’s are convex func-


tions, and h j ’s are affine functions. Another particularity of the formulation can
be observed when Q is a zero matrix and the cost function is thus reduced to
f (x) = cT x. In this case, if the inequality and equality constraints have linear for-
mulation then the problem P becomes a linear programming problem.
The method presented by Maa and Shanblatt [14] is composed in to two phases.
The first phase of the method aims to initialize the problem and to converge quickly
without high accuracy towards the neighborhood of the optimal point, while the sec-
ond phase aims to reach the exact solution of the problem. To this end, the dynamic
of the first phase is based on the exact penalty of Lagrangian function or energy
function L(s, x):
 
p   q
s
∑ g+i (x) + ∑ (h j (x))
2 2
L(s, x) = f (x) + (1.2)
2 i=1 j=1

where s is a large positive real number, and the function g+ i (x(t)) = max{0, gi (x(t))},
whose notation was simplified to g+ = [g+ + T
1 , . . . , gm ] , according to [14].
As long as the system converges x(t) → x̂, sg+ i (x(t)) → λi and sh j (x(t)) → μ j
which are the Lagrange multipliers associated with each corresponding constraint.
In the first phase, an approximation of the Lagrange multipliers is already obtained.
The block diagram of a two-phase optimization network is shown in Fig. 1.1.
The dynamics that happen in the first phase are in the time range 0 ≤ t ≤ t1 (t1 is the
time instant when the switch is closed connecting the first phase to the second one).
The network operates according to the following dynamics:
 
p   q
dx
= −∇ f (x) − s ∑ ∇gi (x)gi (x) + ∑ (∇h j (x)h j (x))
+
(1.3)
dt i=1 j=1

In the second phase (t ≥ t1 ) the network begins to shift the directional vector sg+ i (x)
gradually to λi , and sh j (v) to μ j . By imposing a small positive real value ε , the up-
date rate of d λi /dt and d μi /dt that are represented in (1.6) and (1.7), respectively,
is comparatively much slower than that of dx/dt (1.5). Approximation of such dy-
namics is possible by considering λ and μ to be fixed. Then it can be seen that (1.6)
is seeking a minimum point of the augmented Lagrangian function La (s, x):
s + 
La (s, x) = f (x) + λ T g(x) + μ T h(x) +  g (x) 2 +  h(x) 2 (1.4)
2
In the block diagram of Fig. 1.1, in the first phase, the subsystems within the two
large rectangles do not contribute during t ≤ t1 and in the second phase, when t > t1 ,
the dynamics of the network become:
 
p   +  q
dx
= −∇ f (x) − ∑ ∇gi (x) sgi (x) + λi + ∑ (∇h j (x) (sh j (x) + μ j ))
dt i=1 j=1
(1.5)
6 O.N. Neto, R.R.B. de Aquino, and M.M.S. Lira

Fig. 1.1 Block diagram of the dynamical system to Maa and Shanblatt network

The Lagrange multipliers are updated as:

d λi (t + Δ t)
= ε sg+
i (gi (x(t))), to i = 1, . . . p, and (1.6)
dt
d μ j (t + Δ t)
= ε sh j (x(t)), to j = 1, . . . q. (1.7)
dt
A practical value is ε = 1/s according to [14], what leaves the network with just one
adjustment parameter. However, using ε independently of s gives more freedom to
control the dynamics of the network. During the first phase, the Lagrange multipliers
are null, thus there is not restriction on the initial value of x(t).
According to the theorem of penalty function, the solution achieved in the first
phase is not equivalent to the minimum of the function f (x), unless the penalty
parameter s is infinite. In this way, the use of the second phase of optimization is
necessary to any finite value of s. The system reaches equilibrium when:
1 New HIS to Solve LP and QP Optimization and Increase Convergence Speed 7

g+
i =0
h j = 0 and
p q
∇ f (x) + ∑ ∇gi (x)λi (x) + ∑ ∇h j (x)μ j (x) = 0, (1.8)
i=1 j=1

that is identical to optimality condition of the KT theorem and thus the equilibrium
point of the two-phase network is precisely a global minimum point to a convex
problem (P).
In [12] it is demonstrated that the Kennedy and Chua network for linear and
quadratic programming problems satisfies completely the optimality conditions of
KT and the function penalty method. It shows also that under appropriate conditions
this network is completely stable. Moreover, it is shown that the equilibrium point
happens in the neighborhood of the optimal point of the original problem and that
the distance among them can be made arbitrarily small, selecting sufficiently a large
value of penalty parameter (s).
For problems that cannot tolerate a solution in the infeasible region, due to physi-
cal limits of operational amplifiers, a two-phase optimization network model is pro-
posed. In the second phase, we can obtain both, the exact solution for these problems
and the corresponding Lagrange multipliers associated with each restriction.

1.3 Hybrid Intelligent System Description


The proposed network by Maa and Shanblatt has two attractive features which are
the property of guaranteed global convergence of the mathematical programming
problem and the possibility of physical implementation of the neural network in a
circuit with electrical components where the response time of the dynamic of the cir-
cuit would be imposed by the capacitance in the circuit, thus the convergence time
would be negligible. In spite of these attractive characteristics, the time required
for processing the computational algorithm becomes a barrier in the ANN-based
applications for solving large-scale mathematical programming problems, then it is
necessary the resolution of several differential equations. In this regard, problems
with larger number of variables and constraints will involve a fair amount of differ-
ential equations to be solved. In order to mitigate this problem, heuristic rules were
developed to accelerate the convergence of the computational algorithm involved in
recurrent neural networks.
The combination of recurrent ANN with heuristic rules forms the Hybrid Intelli-
gent System (HIS), in which these two techniques interact and exchange information
with one another while the optimal solution of the problem is not achieved.
The basis of the proposed heuristic rules is the dynamical behavior of the neural
networks. The control theory of system studies deeply the dynamical behavior of a
process. With the aid of control theory and gathering information of the Lyapunov
theorem for the network [2], it can be stated that from any given initial state of the
state vectors x(0), the network will always change the values of the state variables
xi (t) in the direction in which the value of the Lyapunov function for the network
8 O.N. Neto, R.R.B. de Aquino, and M.M.S. Lira

decreases continuously until it reaches a stationary point, which corresponds to a


global minimum of the programming problem. The trajectory of the variables is
illustrated in Fig. 1.2. It depicts a two-dimensional trajectory (orbit) of a dynamical
system, where it is possible to observe the state variables of the system at certain
time instants (t0 ,t1 , . . . ,t5 ). The dotted vector can be understood as the tendency of
the convergence (indicated by the gradient vector) of the variables in the dynamics
of the system.

Fig. 1.2 A two-dimensional trajectory (orbit) of a dynamical system

Trajectories of the state variables for the same system are exemplified graphically
in Fig. 1.3. These trajectories are distinct due to the fact that the state variables have
different initial states. The dynamic of recurrent ANN has the same properties and,
therefore, it is similar to the dynamics showed in Fig. 1.3.
In spite of Maa and Shanblatt model which deals with continuous-time recur-
rent network, in a computational algorithm the calculations of the iterations are
performed in a discrete-time form, since the calculations of the integral equations
demand a small step size for calculations, but not null. Therefore, we take total
control at the course of the iteration of the algorithm in the network.
Detailed observations were carried out during test of the algorithm of Maa and
Shanblatt model showing that the computational convergence is slow and the tra-
jectories in the state-space of convergence of recurrent networks are smooth and
possibly predictable. Then, we observed that, in certain conditions, it is not only
possible to estimate a point closer to the minimum point of the function energy of
the network, but it is also possible to estimate a point that instead of the initial orbit
of convergence turns to an initial point of a new orbit of convergence. This new orbit
would have a shorter curvature and, consequently, a smaller Euclidean distance to
the optimal point. In this regard, the number of steps to calculate the convergence
of the computational algorithm can be reduced and, consequently, the time to
compute the equilibrium point of the network.
1 New HIS to Solve LP and QP Optimization and Increase Convergence Speed 9

Fig. 1.3 An illustration of a two-dimensional state (phase) of a dynamical system and the
associated vectorial field

To achieve the equilibrium point, we use two methods. In the first one, the point
is calculated starting from the evolution of the dynamics in space-time plan (in this
work, we considered only autonomous systems). In the second method, the calcu-
lation is performed by observing the evolution of the variable in state-space. The
mechanism of these two methods and the way they operate in the proposed HIS are
described as follows.

1.3.1 Method of Tendency Based on the Dynamics in Space-Time


(TDST)
Consider the convergences of dynamical systems of first order according to the
graphs in Fig. 1.4.
Observing the curves of Fig. 1.4 and pointing out that the time is a vari-
able that is always increasing, i.e., the following point x1 (t) is always in front
of x1 (t − Δ t), we can reach the conclusion that a closer point to the convergence
would be located outside the internal area of the convergence curve concavity, in
case the convergence curve behaves as shown in graph Fig. 1.4(a). For example,
for t = 0.060, x1 (0.060) ≈ 0.3; in this case, a better estimation to the point would
be x1 (0.061) = 0.45 for Δ t = 0.001s. Restarting the network with this initial state
would generate a convergence curve as illustrated in graph Fig. 1.5. However, this
rule would be applied only for curves of type (a) and (c) as shown in Fig. 1.4, since
curves (b) and (d), a better estimation to the point is located inside the internal area
of the concavity.
Observing the particularities of the possible curvatures of the convergence curve
in the space-time plan, the following parameters were created in relation to:
• curvature {curve, straight line};
• concavity, when it exist, {concave downwards, concave upwards};
10 O.N. Neto, R.R.B. de Aquino, and M.M.S. Lira

1 1

0.9 0.9

0.8 0.8

0.7 0.7

0.6 0.6
x1(t)

x1(t)
0.5 0.5

0.4 0.4

0.3 0.3

0.2 0.2

0.1 0.1

0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
t t

(a) (b)

1 1

0.9 0.9

0.8 0.8

0.7 0.7

0.6 0.6
x1(t)

x1(t)

0.5 0.5

0.4 0.4

0.3 0.3

0.2 0.2

0.1 0.1

0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
t t

(c) (d)

Fig. 1.4 Dynamical convergences of first order (single variable systems): graphs of evolution
of state variables in time

• time rating {high, mean, low};


• variable xi (t) {increasing, decreasing}.
In order to assess these parameters, the network must provide at least three points
(P0 , P1 , P2 ) in the convergence curve. Next, these three points are normalized in the
horizontal axis and also in the vertical axis, in order to avoid problems in the algo-
rithm used to estimate a better point. The chosen normalization equation is presented
in (1.9), where M is the maximum and m is the minimum of the three values to be
normalized, a and b are chosen according to the range of the normalized values. In
this work, the values are normalized into value in the range [0, 1], then a = 0 and
b = 1. z is the value to be normalized, and zN is the normalized value.

b(z − m) − a(z − M)
zN = (1.9)
M−m
1 New HIS to Solve LP and QP Optimization and Increase Convergence Speed 11

0.8

Original Dynamic
0.6
x1(t)
Advanced Dynamic

0.4
Predict Point by Heuristic Rule (HIS−1)

0.2 The Algorithm Time Gain

Time in which the ANN is Restarted


0
0 0.2 0.4 0.6 0.8 1
t

Fig. 1.5 Action of the HIS to calculate a better point in dynamics evolving over time

From the normalized points, two spatial vectors − →v 1 and −



v 2 are computed, and the
following relevant information are obtained from them: Euclidian norm, angle (θi )
of each vector in relation to the horizontal axis; and finally, angle between them
(Δ θ = θ2 − θ1 ). Therefore, when there is a space curvature over the normalized
points, classification regions (decision regions) are generated as shown in Fig. 1.6.
To understand the illustration in Fig. 1.6 better, consider that the normalized initial
point (P0N ) is always found in the beginning of each area (S4, S5, S6, S7, S8 and
S9). Besides these six possibilities, there are more three others that occur in case of
straightforward convergences where Δ θ is approximately zero.

Fig. 1.6 Classification regions to patterns that have spatial curvature

Region 4 (S4) is similar to the beginning of the convergence shown in Fig. 1.4(a).
While region 5 (S5) describes a behavior close to the curve formed by the end of
the convergence shown in Fig. 1.4(a) and the beginning of the convergence shown
in Fig. 1.4(b). Region 6 (S6) has a convergence similar to that shown in Fig. 1.4(b).
12 O.N. Neto, R.R.B. de Aquino, and M.M.S. Lira

Region 7 (S7) describes the dynamic of the type shown in the Fig. 1.4(d) and region
8 (S8) describes a behavior close to the curve formed by the end of the convergence
shown in the Fig. 4(c) and the beginning of the convergence shown in Fig. 1.4(d).
While region 9 (S9) represents a behavior close to that shown in Fig. 1.4(c).
The straightforward regime can be of three types: increasing - the derivative of
the curve is positive and not close to zero, corresponding to region 1 (S1); constant
- the derivative of the curve is approximately zero, corresponding to region 2 (S2);
and decreasing - derivative is negative and not close to zero, corresponding to region
3 (S3). We can note that the regimes described by regions S5 and S8 can be consid-
ered close to the constant straightforward regime. Thus, we modeled the following
heuristic rules:

Rule 1: if <curvature is a straight line> and <variable increases> and <time


rating is high or mean> than <Action I>.
Rule 2: if <curvature is a straight line> and <time rating is low> than <Action
II>.
Rule 3: if <curvature is a straight line> and <variable decreases> and <time
rating is high or mean> than <Action III>.
Rule 4: if <curvature is a curve> and <variable increases> and <time rating is
high and mean> and <concavity is concave downward> than <Action
IV>.
Rule 5: if <curvature is a curve> and <time rating is low> and <concavity is
concave downward> than <Action II>.
Rule 6: if <curvature is a curve> and <variable decreases> and <time rating is
high and mean> and <concavity is concave downward> than <Action
V>.
Rule 7: if <curvature is a curve> and <variable increases> and <time rating
is high and mean> and <concavity is concave upward> than <Action
V>.
Rule 8: if <curvature is a curve> and <time rating is low> and <concavity is
concave upward> than < Action II >.
Rule 9: if <curvature is a curve> and <variable decreases> and <time rating
is high and mean> and <concavity is concave upward > than <Action
VI>.

The actions shown in the rules lead to sub-functions that return a better value for
the next initialization point of the network. The straight line condition implicates
that either the system is converging very slowly or the step size of the integration
algorithm is very small. In this case, the linear function shown in (1.10) can be
applied as shown below:

P3N = a(P2N − P0N ) + P2N , (1.10)

where a is a constant that yields a gain in magnitude of →


−v (−

3 v 3 = P3 − P2 ). The
rules above are summarized in Table 1.1.
1 New HIS to Solve LP and QP Optimization and Increase Convergence Speed 13

Table 1.1 Description of the actions to be taken due to the heuristic rules according to each
decision region

Actions Description of the Actions Regions


I P3N is calculated according to (1.6) with a = a1 . S1
II P3N is calculated according to (1.6) with a = a2 . S2, S5, S8
III P3N is calculated according to (1.6) with a = a3 . S3
IV P3N takes the coordinates of the superior point in the circum- S4
ference that passes through the normalized points P0N , P1N
and P2N .
V P3N takes the coordinates of the point fixed at the most right S6, S7
position in the circumference that passes through the normal-
ized points P0N , P1N and P2N .
VI P3N takes the coordinates of the inferior point in the circum- S9
ference that passes through the normalized points P0N , P1N
and P2N .

Having the normalized point P3N estimated by the heuristic rules, we need to
unnormalize it to obtain the P3 value. This value will be used to start the recurrent
network.

1.3.2 Method of Tendency Based on the Dynamics in State-Space


(TDSS)
To calculate a better point using the dynamics in state-space, two facts must be
pointed out: firstly, the convergence of the variables depends on the convergence
of other variables; secondly, when we are working in the state-space, the variations
in the state of the variables are mapped by taking one variable xi as reference, so
that the curve is free to evolve in all directions, which does not happen when we
are working in the space-time plan. Perceiving previous facts and observing the
convergence orbits in the state-space shown Fig. 1.2 and Fig. 1.3, we reached the
conclusion that to calculate a better point in the state-space, it must be located inside
the concavity of the orbit of the system dynamics.
To calculate in such a state-space, we do the following: first we take one of
the variables as a reference (for example, x1 (t)) and draw n − 1 complex plan
(for a system with n variable), thus we have the plans x1 (t)0x2 (t), x1 (t)0x3 (t), . . .,
x1 (t)0xn−1(t). Having the three points P2 = x(t), P1 = x(t − Δ t) and P0 = x(t − 2Δ t)
provided by the network, we can create state vectors in each plan of all n − 1 com-
plex plan. For example, for the plan x1 (t)0x2 (t), we have:


v 1 (t) = (x1 (t) + i · x2(t)) − (x1 (t − Δ t) + i · x2(t − Δ t)) and (1.11)



v 2 (t) = (x1 (t − Δ t) + i · x2(t − Δ t)) − (x1(t − 2Δ t) + i · x2(t − 2Δ t)). (1.12)
14 O.N. Neto, R.R.B. de Aquino, and M.M.S. Lira

From these vectors, we carried out a rotation transformation in the axes using (1.13)
and (1.14), according to the Fig. 1.7.


v 1’ = −

v 1 exp(−θ1 ) (1.13)

v 2’ = −

→ →
v 2 exp(−θ1 ) (1.14)
where θ1 is the −

v 1 (t) angle and also a translation transformation using:

x1 ’(t − 2Δ t) = −|−

v 1|
x2 ’(t − 2Δ t) = 0
x1 ’(t − Δ t) = 0
x2 ’(t − Δ t) = 0
x1 ’(t) = x1 (t) − x1 (t − 2Δ t)
x2 ’(t) = x2 (t) − x2 (t − 2Δ t) (1.15)

Fig. 1.7 Translation and rotation transformation in state-space

The rotation and translation transformation facilitates the behavior analysis of the
vector −→v 2 in relation to vector −→
v 1 . Therefore, heuristic rules can be applied to
perform the gain in module and angle of the state vectors yielding vector − →v 3 ’ to
each n − 1 complex plan. Finally, a strategy is created to determine which will be
the final value of the reference variable. An effective strategy is to add the value of
the reference variable and the average of the calculated increments of this variable
in the complex plans.
Fig. 1.8 shows two pictures associated with two examples of a set of heuris-
tic rules that can be used to produce vector − →
v 3 ’. In each picture, the point that is
closer to the left position in the circumference is the point P0 ’, in the center of the
circumference is fixed the point P1 ’ and the points marked with tiny circle in the
circumference symbolize several possibilities of point P2 ’. And finally, the results
1 New HIS to Solve LP and QP Optimization and Increase Convergence Speed 15

of the heuristic rules are marked with green circle, points P3 ’. In order to obtain the
final point P3 , we apply the inverse transformation of translation and rotation to the
point P3 ’, thus generating the appropriated value to initialize the recurrent network.
Fig. 1.9 shows an example of application of the heuristic rules to estimate a better
point through the dynamics in the state-space. The external curve represents the
dynamics of the recurrent network without the heuristic rules, and the internal one
represents the dynamics using the HIS (ANN and TDSS) which is based on heuristic
rules (TDSS). We point out that, in the internal curve, the points with circle as a
marker are the iteration points performed by the network and the points with a plus
sign are the points estimated by the heuristic rules (P3 ).

3 3

2 2

1 1
X2’(t)

X2’(t)

0 0

−1 −1

−2 −2

−3 −3
0 2 4 0 2 4
X1’(t) X1’(t)

(a) (b)

Fig. 1.8 Variations of the rules applied to points P0 , P1 , P2 to calculate the estimated point P3

2.5

2
Original Orbit

1.5
x2(t)

1
Predict Point by Heuristic Rule (HIS−2)

0.5
Advanced Orbit
Point by Recurrent NN
0

−0.5
−1 0 1 2 3 4 5
x1(t)

Fig. 1.9 Graph of the convergence orbit of the state variables x1 and x2 : external curve -
ANN; internal curve - HIS
16 O.N. Neto, R.R.B. de Aquino, and M.M.S. Lira

By iteratively applying the ANN and the HR, the system will perform to reduce
the curvature of the orbit in the state-space jumping from one orbit to another until it
achieves the solution of the problem (point of equilibrium of the recurrent network).

1.4 Case Studies


In order to test the proposed hybrid intelligent system, we chose mathematical pro-
gramming problems previously solved in [16, 31, 32].
The following cases were solved using the HIS which was implemented in the
MATLAB. In addition, to solve the differential equations involved in the problem,
we implemented the Richardson extrapolation method of level four, which has in its
structure the classical Runge-Kutta method of order four, and uses a fixed step-size
of integration [33].

1.4.1 Case 1: Mathematical Linear Programming Problem –


Four Variables
A classical linear programming problem [31] was used to choose the parameters
of the developed heuristic rules. In [31] the problem was solved to compare the
performance of several recurrent network models. In this case we deal with the
following problem LP1 :

(LP1 ) min f (x) = −8x1 − 8x2 − 5x3 − 5x4


s.t. x1 + x3 = 40
x2 + x4 = 60
−5x1 + 5x2 ≤ 0
2x1 − 3x2 ≤ 0
x≥0
x ∈ R4 (1.16)

1.4.2 Case 2: Mathematical Linear Programming Problem –


Eleven Variables
With the heuristic rules parameters already defined in case 1, we chose a larger
scale problem to compare the models performance. In this case, we chose a flow
problem of minimum cost (LP2 ) solved in [32]. In this problem there are several
nodes (points) representing several consumers and suppliers which are connecting
through paths (arcs). The aim of the problem is to calculate the flow through all
paths in order to minimize the total cost, whose value is calculated by the sum of
the products of the cost and the flow performed in each arch. The problem can be
represented as a graph as shown in Fig. 1.10 where the amount of flow on a node
cannot exceed the capacity of the node. A network is a set of elements called nodes
1 New HIS to Solve LP and QP Optimization and Increase Convergence Speed 17

and a set of elements called arcs, each arc ei j being an ordered pair (i, j) of distinct
nodes i and j. If ei j is an arc, then the node i is called the tail of ei j and the node j
is called the head of ei j . The directed graph shown in Fig. 1.10 is formed of 6 nodes
and 11 arcs.

e12 2
1
e23
e13
3
e42
e15
e53
e43 e62
e54 4
5
e64

e56 6

Fig. 1.10 A directed graph of a minimum cost flow problem

The cost of each arc is represented by vector c and its maximum capacity flow
by vector b, and the demands by vector w. If wi ≤ 0, the node is a supplier (sources)
and, if wi > 0, the node is consumers (sinks). Suppose, for instance, that we have
wT = [−9 4 17 1 − 5 − 8]. The matrix H of the network is called the incidence
matrix of our network. More generally, the incidence matrix of a network with n
nodes and m arcs has n rows and m columns. Thus, our matrix H has size 6 x 11 and
is formed as follows:
⎡ ⎤
−1 −1 −1 0 0 0 0 0 0 0 0
⎢ 1 0 0 −1 1 1 0 0 0 0 0 ⎥
⎢ ⎥
⎢ 0 1 0 1 0 0 1 1 0 0 0⎥
H =⎢⎢ 0 0 0 0 −1 0 0 −1 1 1 0 ⎥
⎥ (1.17)
⎢ ⎥
⎣ 0 0 1 0 0 0 −1 0 −1 0 −1 ⎦
0 0 0 0 0 −1 0 0 0 −1 1

Considering that there are no losses in the network, i.e., everything that is produced
is consumed then the sum of all the elements wi j of the graph is zero. This condition
turns the matrix H into a linearly dependent (LD) matrix, in other words, any row
can be obtained by a linear combination of the other rows. In order to overcome this
problem, we remove a row of the matrix H and one element of the column vector w.
Here, the last row of the matrix H was removed, turning this matrix and the vector
w into a truncated incidence matrix and a truncated vector, according to [32].
⎡ ⎤
−1 −1 −1 0 0 0 0 0 0 0 0
⎢ 1 0 0 −1 1 1 0 0 0 0 0 ⎥
⎢ ⎥
H =⎢ ⎢ 0 1 0 1 0 0 1 1 0 0 0⎥
⎥ (1.18)
⎣ 0 0 0 0 −1 0 0 −1 1 1 0 ⎦
0 0 1 0 0 0 −1 0 −1 0 −1
18 O.N. Neto, R.R.B. de Aquino, and M.M.S. Lira

Problem LP2 has the following form:

(LP2 ) min f (x) = 3x1 + 5x2 + x3 + x4 + 4x5 + x6 + 6x7 + x8 + x9 + x10 + x11


s.t. − 1x1 − 1x2 − 1x3 = 9
−x1 − x4 + x5 + x6 = −4
x2 + x4 + x7 + x8 = −17
−x5 − x8 + x9 + x10 = −1
x3 − x7 − x9 − x11 = 5
x ≤ [2 10 10 6 8 7 9 9 10 8 6]T
x≥0
x ∈ R11 (1.19)

1.4.3 Case 3: Mathematical Quadratic Programming


Problem – Three Variables
As an example of quadratic programming (QP), the economic dispatch problem was
solved as formulate in [16]. In this problem, the aim is to minimize the total cost
and respond to the demand of the power system. The formulation contemplates 3
thermal generators (n = 3) connected to just one load.
Defining fi as the generation cost of the i-th generation unit, xi as the generated
power by the i-th generation unit, w as the total power demand of the load; the limits
xi,min , xi,max are defined by the physical limitation of the i-th generation unit. The
power economic dispatch is expressed as QP:
n
(QP) min f (x) = ∑ fi (xi )
i=1
n
s.t. ∑ xi − w = 0
i=1
xmin ≤ x ≤ xmax
x ∈ Rn (1.20)

Data were obtained from [16]: xmin = [150 100 50]T in MW, xmax = [600 400 200]T
in MW, w = 850 MW and the following costs for the generator units:

f1 (x1 ) = 561 + 7.92x1 + 0.00156x21


f2 (x2 ) = 310 + 7.85x2 + 0.00194x22
f3 (x3 ) = 78 + 7.95x3 + 0.00482x23 (1.21)
1 New HIS to Solve LP and QP Optimization and Increase Convergence Speed 19

1.5 Simulations
The chosen parameters to simulate the problems LP1 and LP2 were: integration step
size of 1e−3 , 100 for the parameter s of the neural network in the first and the second
phase, value 1.1 for the parameter e of the network in the second phase. The main
results of LP1 and LP2 simulation are presented in Table 1.2.

Table 1.2 Simulation Results of LP1 and LP2

LP1 LP2
ANN HIS-1a HIS-2b ANN HIS-1a HIS-2b
No. points by the ANN (Phase 1) 8316 1000 2391 8670 2965 3099
No. points by the HR (Phase 1) - 331 795 - 986 1031
Total no. points (Phase 1) 8316 1331 3186 8670 3951 4130
Normalized computer processing time (Phase 1) 1.00 0.12 0.29 1.00 0.33 0.35
Time instant when the switch is closed (s) = t1 8.32 1.33 3.19 8.67 3.95 4.13
No. of calculated points by the ANN (Phase 2) 8607 8277 8316 7692 7251 7263
No. of calculated points in both phases 16923 9608 11502 16362 11202 11393
Initial cost (Phase 1) = f (x(0)) -260.00 -260.00 -260.00 0.00 0.00 0.00
Final cost (Phase 1) = f (x(t1 )) -741.77 -741.82 -741.78 55.67 55.65 55.63
Final cost (Phase 2) = f (x(tend )) -740.00 -740.00 -740.00 56.00 55.99 56.00
a HIS-1 = ANN and method of Tendency based on the Dynamics in
Space-Time(TDST).
b HIS-2 = ANN and method of Tendency based on the Dynamics in State-Space

(TDSS).

The results shown in Table 1.2 at row 5 point out that both proposed hybrid
systems (HIS-1 and HIS-2) were able to advance the dynamics of the simulated
linear problems efficiently. This reduces greatly the time to process the network
algorithm since at each integration step size, n ODEs are solved, where n is the
number of variables in the problem. For instance, for the problem LP1 , the total
number of points calculated at the end of the first phase by using the ANN was
8316, then 33264 ODEs were solved while using the HIS-1. It was necessary 1000
points yielding 4000 ODEs to be solved in order to reach the end of the first phase,
in other words, the HIS-1 reduced the computational effort by approximately 88%
compared to the ANN. Besides, for the LP2 problem, this rate was approximately
66%. The rates of computational effort reduction for problems LP1 and LP2 when
comparing the ANN to the HIS-2 were 71% and 64%, respectively.
Fig. 1.11-1.13 presents the simulation results for the LP1 problem and
Fig. 1.14-1.16 for the LP2 problem.
The chosen parameters to simulate the problems QP: integration step size of
1e−2 ; 50 for the parameter s of the neural network in the first phase. The initial
condition used was x(0) = [400 300 150]T .
20 O.N. Neto, R.R.B. de Aquino, and M.M.S. Lira

40
40
x1 x2 X1(t) x X2(t)
35
35

30
30
X1(t) x X4(t)

25 25

x4
20 20

15 15

10 10

5 X1(t) x X3(t)
5
x3
0
0
0 2 4 6 8 10 12 14 16 10 15 20 25 30 35 40

(a) (b)

Fig. 1.11 Dynamics of the problem LP1 obtained by the ANN with the initial condition
x(0) = [10 10 10 10]T : (a) Dynamic in time-state plan; (b) Dynamic in state-space plan, taking
the variable x1 (t) as reference

40
40
x1 x2
35 X1(t) x X4(t)
35

30
30

25 25
x4
20 X1(t) x X2(t)
20

15 15

10 10

5 X1(t) x X3(t)
5
x3
0
0
0 1 2 3 4 5 6 7 8 9 10 10 15 20 25 30 35 40

(a) (b)

Fig. 1.12 Dynamics of the problem LP1 , obtained by the HIS-1 with the initial condition
x(0) = [10 10 10 10]T : (a) Dynamic in time-space plan; (b) Dynamic in state-space plan,
taking the variable x1 (t) as reference

The results shown in Table 1.3 at row 5 point out that both proposed hybrid
systems were able to advance the dynamics of the simulated quadratic problems
efficiently. This reduces greatly the time to process the network algorithm since at
each step size integration, n ODEs are solved, where n is the number of variables
in the problem. For instance, for the problem QP, the total number of points cal-
culated at the end of the first phase by using the ANN was 172629, then 517887
ODEs were solved, while using the HIS-1, it was necessary 8257 points yielding
24771 ODEs to be solved in order to reach the end of the first phase, in other words,
the HIS-1 reduced the computational effort by approximately 95% compared to the
1 New HIS to Solve LP and QP Optimization and Increase Convergence Speed 21

40
40
x1 x2
35
35
X1(t) x X4(t)
30
30

25 25
X1(t) x X2(t)
x4
20 20

15 15

10 10

5 X1(t) x X3(t)
5
x3
0
0
0 2 4 6 8 10 12 10 15 20 25 30 35 40

(a) (b)

Fig. 1.13 Dynamics of the problem LP1 , obtained by the HIS-2 with the initial condition
x(0) = [10 10 10 10]T : (a) Dynamic in time-space plan; (b) Dynamic in state-space plan,
taking the variable x1 (t) as reference

9
9
x8 x9 8
8
x6 7
7

6 6

x4
5 5
x3
4 4
x2
3 3
x1
2 2
x7 x10
1
x11 1
0
x5 0
-1
0 2 4 6 8 10 12 14 16 0 0.5 1 1.5 2 2.5

(a) (b)

Fig. 1.14 Dynamics of the problem LP2 , obtained by the ANN with the initial condition
x(0) = [0 0 0 0 0 0 0 0 0 0 0]T : (a) Dynamic in time-space plan; (b) Dynamic in state-space
plan, taking the variable x1 (t) as reference

ANN. In addition to HIS-2, and the QP problem, this rate was approximately 70%.
Fig. 1.17-1.19 presents the simulation results for the problem QP.
All case studies were carried out in the same computer, thus we take the process-
ing time of the ANN in phase 1 as the base to normalize the hybrid case in the same
phase. Here, we point out that the hybrid systems were not used in phase 2. As a
result, we have observed that for the LP1 case, the HIS-1 has taken a processing time
of 12%, while the HIS-2 has taken 29%; for the LP2 case, the HIS-1 has taken a pro-
cessing time of 33%, while the HIS-2 has taken 35%; and for the QP case, the HIS-1
has taken a processing time of 6%, while the HIS-2 has taken 45%. It is important
22 O.N. Neto, R.R.B. de Aquino, and M.M.S. Lira

16
16
14
14

12
12

10 10
x8 x9
8 8
x6
6 6
x4
4 x3
4
x2
2 x1
2
x10
0 x5 x7 x11
0

-2
-2
-4
0 2 4 6 8 10 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3

(a) (b)

Fig. 1.15 Dynamics of the problem LP2 , obtained by the HIS-1 with the initial condition
x(0) = [0 0 0 0 0 0 0 0 0 0 0]T : (a) Dynamic in time-space plan; (b) Dynamic in state-space
plan, taking the variable x1 (t) as reference

9
9
x8 x9
8
8
x6 7
7

6 6

x4 5
5
x3
4 4
x2
3 3
x1
2 2
x10
1
1
x5 x7 x11
0
0
-1
0 2 4 6 8 10 0 0.5 1 1.5 2 2.5

(a) (b)

Fig. 1.16 Dynamics of the problem LP2 , obtained by the HIS-2 with the initial condition
x(0) = [0 0 0 0 0 0 0 0 0 0 0]T : (a) Dynamic in time-space plan; (b) Dynamic in state-space
plan, taking the variable x1 (t) as reference

to note that the heuristic rule efficiency varies according to the type of the problem.
In this work, the results have showed that the HIS-1 yielded better performance,
specifically, in the LP1 and QP problems. Therefore, we could observe a decrease in
the processing time yielded by the implemented heuristic rules, while reducing the
number of ODE computed. These rules estimated the next values to each variable
of the problem throughout the convergence. We can highlight that as ODE also
calculates points during the application of the HIS, the proposed systems can correct
themselves in case of incorrect estimative, showing the high performance for being
resilient.
1 New HIS to Solve LP and QP Optimization and Increase Convergence Speed 23

Table 1.3 Simulation Results of QP

QP
ANN HIS-1a HIS-2b
No. points by the ANN (Phase 1) 172629 8257 51900
No. points by the HR (Phase 1) - 2750 17299
Total no. points (Phase 1) 172629 11007 69199
Normalized computer processing time (Phase 1) 1.00 0.06 0.45
Time instant when the switch is closed (s) = t1 1726.29 110.07 691.99
Final cost (Phase 1) = f (x(t1 )) 22680.05 22680.05 22680.05
a HIS-1 = ANN and method of Tendency based on the Dynamics in
Space-Time(TDST).
b
HIS-2 = ANN and method of Tendency based on the Dynamics in State-Space
(TDSS).

400 320 X1(t) x X3(t)


x1
300
350
280
x2
260
300
240

250 220

200
200
180

160
150
x3 140
X1(t) x X2(t)
120
0 200 400 600 800 1000 1200 1400 1600 1800 393 394 395 396 397 398 399 400 401

(a) (b)

Fig. 1.17 Dynamics of the problem QP, obtained by the ANN with the initial condition
x(0) = [400 300 150]T : (a) Dynamic in time-space plan; (b) Dynamic in state-space plan,
taking the variable x1 (t) as reference

1.6 Conclusion
In this paper, two Hybrid Intelligent Systems have been proposed. These systems
combined Maa and Shamblatt network with heuristic rules. Maa and Shamblatt
network is a two-phase recurrent neural network that provides the exact solution
for linear and quadratic programming problem. When compared to conventional
linear and nonlinear optimization techniques, the two-phase network formulation
becomes advantageous as there is no matrix inversion required. The main aim of
the proposed HIS is to increase the speed of convergence towards the optimal point
which is guaranteed by the ANN. In the cases presented, the optimal convergence
24 O.N. Neto, R.R.B. de Aquino, and M.M.S. Lira

400 x1
320 X1(t) x X2(t)

300
350
x2
280

260
300
240

250 220

200
200
180

160
150
x3 140
X1(t) x X3(t)
120
0 20 40 60 80 100 393 394 395 396 397 398 399 400 401

(a) (b)

Fig. 1.18 Dynamics of the problem QP, obtained by the HIS-1 with the initial condition
x(0) = [400 300 150]T : (a) Dynamic in time-space plan; (b) Dynamic in state-space plan,
taking the variable x1 (t) as reference

400 x1
320
X1(t) x X2(t)
300
350
x2 280

260
300
240

250 220

200
200
180

160
150
x3 140 X1(t) x X3(t)

120
0 100 200 300 400 500 600 700 393 394 395 396 397 398 399 400 401

(a) (b)

Fig. 1.19 Dynamics of the problem QP, obtained by the HIS-2 with the initial condition
x(0) = [400 300 150]T : (a) Dynamic in time-space plan; (b) Dynamic in state-space plan,
taking the variable x1 (t) as reference

was reached. The proposed systems have both advantages. The simulation analyses
show a reduction in the computational effort by approximately 95% compared to
the ANN in the QP case solved in this paper which has a guaranteed optimal con-
vergence, without inverting matrices. The implementation of the proposed HIS has
been developed to solve operational planning problems of large-scale, which may
be applied in future works. That is, a large-scale economic power dispatch, with the
scheduling of hydro, thermal and wind power plants to minimize the overall pro-
duction cost, while satisfying the load demand in the mid-term operation planning
of hydrothermal generation systems. In future works, we will propose the combina-
tion of these heuristic rules and/or the application in the second phase of Maa and
Shanblatt method.
1 New HIS to Solve LP and QP Optimization and Increase Convergence Speed 25

References
1. Hopfield, J.J.: Neural networks and physical systems with emergent collective computa-
tional abilities. Proc. Natl. Acad. Sci. USA 79, 2552–2558 (1982)
2. Haykin, S.: Neural networks: a comprehensive foundation, 2nd edn. Prentice Hall, USA
(1999)
3. Hopfield, J.J.: Neurons with graded response have collective computational properties
like those of two-state neurons. Proc. Natl. Acad. Sci. USA 81, 3088–3092 (1984)
4. Hopfield, J.J.: Learning algorithms and probability distributions in feed-forward and
feed-back networks. Proc. Natl. Acad. Sci. USA 84, 8429–8433 (1987)
5. Hopfield, J.J.: The effectiveness of analogue neural network hardware. Network: Com-
putation in Neural Systems 1(1), 27–40 (1990)
6. Hopfield, J.J., Feinstein, D.I., Palmer, R.G.: Unlearning has a stabilizing effect in collec-
tive memories. Nature 304, 158–159 (1983)
7. Hopfield, J.J., Tank, D.W.: Neural Computation of Decisions in Optimization Problem.
Biological Cybernetics 52, 141–152 (1985)
8. Hopfield, J.J., Tank, D.W.: Computing with Neural Circuits: A Model. Science 233(8),
625–633 (1986)
9. Tank, D.W., Hopfield, J.J.: Simple Neural Optimization Networks: An A/D Converter,
Signal Decision Circuit, and a Linear Programming Circuit. IEEE Trans. on Circuits and
Systems 33(5), 533–541 (1986)
10. Ludemir, T.B., Braga, A.P., Carvalho, A.C.P.L.F.: Redes Neurais Artificiais Teoria e
Aplicacoes. 1st ed. Rio de Janeiro, RJ: LTC - Livros Tecnicos e Cientificos Editora S.A.
(2000)
11. Pyne, I.B.: Linear Programming on an electronic analogue computer. Trans. AIEE. Part
I (Comm. & Elect.) 75, 139–143 (1956)
12. Kennedy, M.P., Chua, L.O.: Unifying Tank and Hopfield Linear Programming Circuit
and the Canonical Nonlinear Programming Circuit of Chua and Lin. IEEE Trans. on
Circuits and Systems 34(2), 210–214 (1987)
13. Chiu, C., Maa, C.Y., Shanblatt, M.A.: An artificial neural network algorithm for dynamic
programming. Int. J. Neural Syst. 1(3), 211–220 (1990)
14. Maa, C.Y., Shanblatt, M.A.: A Two-Phase Optimization Neural Network. IEEE Trans-
actions on Neural Networks 3(6), 1003–1009 (1992)
15. Kennedy, M.P., Chua, L.O.: Neural Networks for Nonlinear Programming. IEEE Trans.
on Circuits and Systems 35(5), 210–220 (1988)
16. Maa, C.Y., Shanblatt, M.A.: Linear and Quadratic Programming Neural Network Anal-
ysis. IEEE Transactions on Neural Networks 3(4), 580–594 (1992)
17. Chiu, C., Maa, C.Y., Shanblatt, M.A.: Energy Function Analysis of Dynamic Program-
ming Neural Networks. IEEE Transactions on Neural Networks 2(4) (July 1991)
18. Xia, Y.S.: A New Neural Network for Solving Linear and Quadratic Programming Prob-
lems. IEEE Transactions on Neural Networks 7(6), 1544–1547 (1996)
19. Tao, Q., Cao, J.D., Xue, M.S., Qiao, H.: A High Performance Neural Network for Solv-
ing Nonlinear Programming Problems with Hybrid Constraints. Phys. Lett. A 288(2),
88–94 (2001)
20. Xia, Y.S., Wang, J.: A General Methodology for Designing Globally Convergent Op-
timization Neural Networks. IEEE Transactions on Neural Networks 9(6), 1331–1343
(1998)
26 O.N. Neto, R.R.B. de Aquino, and M.M.S. Lira

21. Xia, Y.S., Wang, J.: A Recurrent Neural Network for Solving Nonlinear Convex Pro-
grams Subject to Linear Constraints. IEEE Transactions on Neural Networks 16(2),
379–386 (2005)
22. Dieu, V.N., Ongsakul, W.: Enhanced Merit Order and Augmented Lagrange Hop-
field Network for Hydrothermal Scheduling. Electrical Power and Energy Systems 30,
93–101 (2008)
23. Naresh, R., Dubey, J., Sharma, J.: Two-phase Neural Network Based Modeling Frame-
work of Constrained Economic Load Dispatch. IEE Proc. Gener. Transm. Distrib. 151(3)
(May 2004)
24. Aquino, R.R.B.: Recurrent Artificial Neural Networks: an application to optimization of
hydro thermal power systems (in Portuguese), Ph.D. Thesis, COPELE/UFPE, Campina
Grande, Brazil (January 2001)
25. Rosas, P., Aquino, R.R.B., et al.: Study of Impacts of a Large Penetration of Wind Power
and Distributed Power Generation as a Whole on the Brazilian Power System. In: Euro-
pean Wind Energy Conference (EWEC), London (November 2004)
26. Witten, I.H., Frank, E.: Data Mining Practical Machine Learning Tools and Techniques,
2nd edn. Morgan Kaufmann, San Francisco (2005)
27. Mitra, S., Mitra, M., Chaudhuri, B.B.: Pattern Defined Heuristic Rules and Directional
Histogram Based Online ECG Parameter Extraction. Measurement 42, 150–156 (2009)
28. Tuncel, G.: A Heuristic Rule-Based Approach for Dynamic Scheduling of Flexible Man-
ufacturing Systems. In: Levner, E. (ed.) Multiprocessor Scheduling: Theory and Appli-
cations, December 2007, p. 436. Itech Education and Publishing, Vienna (2007)
29. Baykasoglu, A., Ozbakir, L., Dereli, T.: Multiple Dispatching Rule Based Heuristic for
Multi-Objective Scheduling of Job Shops Using Tabu Search. In: Proceedings of MIM
2002: 5th International Conference on Managing Innovations in Manufacturing (MIM)
Milwaukee, Wisconsin, USA, September 9-11, pp. 1–6 (2002)
30. Idris, N., Baba, S., Abdullah, R.: Using Heuristic Rules from Sentence Decomposition
of Experts Summaries to Detect Students Summarizing Strategies. International Journal
of Human and Social Sciences 2, 1 (Winter 2008), www.waset.org
31. Zak, S.H., Upatising, V., Hui, S.: Solving Linear Programming Problems with Neural
Networks: A Comparative Study. IEEE Transactions on Neural Networks 6(1), 94–104
(1995)
32. Chvatal, V.: Linear Programming. W.H. Freman and Company, New York (1983)
33. Lastman, G.J., Sinha, N.K.: Microcomputer-based numerical methods for science and
engineering. Saunders Colleg Pubblishing, USA (1988)
Chapter 2
A Novel Optimization Algorithm Based on
Reinforcement Learning

Janusz A. Starzyk, Yinyin Liu, and Sebastian Batog

Abstract. In this chapter, an efficient optimization algorithm is presented for the


problems with hard to evaluate objective functions. It uses the reinforcement learn-
ing principle to determine the particle move in search for the optimum process.
A model of successful actions is build and future actions are based on past ex-
perience. The step increment combines exploitation of the known search path and
exploration for the improved search direction. The algorithm does not require any
prior knowledge of the objective function, nor does it require any characteristics
of such function. It is simple, intuitive and easy to implement and tune. The opti-
mization algorithm was tested using several multi-variable functions and compared
with other widely used random search optimization algorithms. Furthermore, the
training of a multi-layer perceptron, to find a set of optimized weights, is treated as
an optimization problem. The optimized multi-layer perceptron was applied to Iris
database classification. Finally, the algorithm is used in image recognition to find a
familiar object with retina sampling and micro-saccades.

2.1 Introduction
Optimization is a process to find the maximum or the minimum function value
within given constraints by changing values of its multiple variables. It can be the
essential for solving complex engineering problems in such areas as computer sci-
ence, aerospace, machine intelligence applications, etc. When the analytical relation
Janusz A. Starzyk · Yinyin Liu
Ohio University, School of Electrical Engineering and Computer Science, U.S.A.
e-mail: starzyk@bobcat.ent.ohiou.edu,yliu@bobcat.ent.ohiou.edu
Sebastian Batog
Silesian University of Technology, Institute Of Computer Science, Poland
e-mail: sebastian.batog@gmail.com

Y. Tenne and C.-K. Goh (Eds.): Computational Intelligence in Optimization, ALO 7, pp. 27–47.
springerlink.com c Springer-Verlag Berlin Heidelberg 2010
28 J.A. Starzyk, Y. Liu, and S. Batog

between the variables and the objective function value is explicitly known, analyti-
cal methods, such as Lagrange multiplier methods [1], interior point methods [18],
Newton methods [30], gradient descent methods [25], etc., can be applied. How-
ever, in many practical applications, analytical methods do not apply. This happens
when the objective functions are unknown, when relations between variables and
function value are not given or difficult to find, when the functions are known while
their derivatives are not applicable, or when the optimum value of function cannot
be verified. In these cases, iterative search processes are required to find the function
optimum.
Direct search algorithms [10] contain a set of optimization methods that do not
require derivatives and do not approximate either the objective functions or their
derivatives. These algorithms find locations with better function values following a
search strategy. They only need to compare the objective function values in succes-
sive iterative steps to make the move decision. Within the category of direct search,
distinctions can be made among three classes including pattern search methods [28],
simplex methods [6], and adaptive sets of search directions [23]. In pattern search
methods, the variables of the function are varied by either steps of predetermined
magnitude or the steps sizes are reduced at the same degree [15]. Simplex meth-
ods construct a simplex in ℜN using N+1 points and use the simplex to drive the
search for optimum. The methods with adaptive sets of search directions, proposed
by Rosenbrock [23] and Powell [21], construct conjugate directions using the infor-
mation about the curvature of the objective function during the search.
In order to avoid local minima, random search methods are developed utilizing
randomness in setting the initial search points and other search parameters like the
search direction or the step size. In Optimized Step-Size Random Search (OSSRS)
[24], the step size is determined by fitting a quadratic function for the optimized
function values in each of the random directions. The random direction is generated
with a normal distribution of a given mean and standard deviation. Monte-Carlo op-
timizations adopted randomness in the search process to generate the possibilities
to escape from the local minima. Simulate Annealing (SA) [13] is one typical kind
of Monte-Carlo algorithm. It exploits the analogy between the search for a mini-
mum in the optimization problem and the annealing process in which a metal cools
and stabilizes into a minimum energy crystalline structure. It accepts the move to a
new position with worse function value with a probability, which is controlled by
the ”temperature” parameter, and the probability decreases along the ”cooling pro-
cess”. SA can deal with highly nonlinear, chaotic problems provided that the cooling
schedule and other parameters are carefully tuned.
Particle Swarm Optimization (PSO) [11] is a population-based evolutionary com-
putational algorithm. It exploits the cooperation within the solution population in-
stead of the competition among them. At each iteration in PSO, a group of search
particles make moves in a mutually coordinated fashion. The step size of a particle is
a function of both the best solution found by that particle and the best solution found
so far by all the particles in the group. The use of a population of search particles
and the cooperation among them enable the algorithm to evaluate function values in
a wide range of variables in the input space and to find the optimum position. Each
2 A Novel Optimization Algorithm Based on Reinforcement Learning 29

particle only remembers its best solution and the global best solution of the group
to determine its step sizes.
Generally, during the course of search, a sequence of decisions on the step sizes is
made and a number of function values are obtained in these optimization methods.
In order to implement an efficient search for the optimum point, it is desired that
such historical information can be utilized in the optimization process.
Reinforcement Learning (RL) [27] is a type of learning process to maximize cer-
tain numerical values by combining exploration and exploitation and using rewards
as learning stimuli. In the reinforcement learning problem, the learning agent per-
forms the experiments to interact with the unknown environment and accumulate
the knowledge during this process. It is a trial-and-error exploratory process with
the objective to find the optimum action. During this process, an agent can learn
to build the model of the environment to instruct its search, so that the agent can
predict the environment’s response to its actions and choose the most useful actions
for its objectives based on its past exploring experience.
Surrogate based optimization refers to an idea of speeding optimization process
by using surrogates for the objectives and constraints functions. The surrogates also
allow for the optimization of problems with non-smooth or noisy responses, and
can provide insight into the nature of the design space. The max-min SAGA ap-
proach [20] is to search for designs that have the best worst case performance in
the presence of parameter uncertainty. By leveraging a trust-region approach which
uses computationally cheap surrogate models, the present approach allows for the
possibility of achieving robust design solutions on a limited computational budged.
Another example of a surrogate based optimization is the surrogate assisted
Hooke-Jeeves (SAHJA) algorithm [8] which can be used as a local component of a
global optimization algorithm. This local searcher uses the Hooke-Jeeves method,
which performs its exploration of the input space intelligently employing both the
real fitness and an approximated function.
The idea of building knowledge about an unknown problem through exploration
can be applied in the optimization problems. To find the optimum of an unknown
multivariable function, an efficient search procedure can be performed using only
historical information from conducted experiments to expedite the search. In this
chapter, a novel and efficient optimization algorithm based on reinforcement learn-
ing is presented. This algorithm uses simple search operators and will be called
reinforcement learning optimization (RLO) in the later sections. It does not require
any prior knowledge of the objective function or function’s gradient information,
nor does it require any characteristics of the objective function. In addition, it is
conceptually very simple and easy to implement. This approach to optimization
is compatible with the neural networks and learning through interaction, thus it is
useful for systems of embodied intelligence and motivated learning as presented in
[26]. The following section presents the RLO method and illustrates it within several
machine learning applications.
30 J.A. Starzyk, Y. Liu, and S. Batog

2.2 Optimization Algorithm


2.2.1 Basic Search Procedure
A N-variable optimization objective function

V = f (p1 , p2 , ..., pN ) (p1 , p2 , ..., pN ,V ∈ ℜ1 )

could have several local minima and several global minima Vopt1 , ...,VoptN . It is
desired that the search process, initiated from a random point, finds a path to the
global optimum point. Unlike particle swarm optimization [11], this process can be
performed with a single search particle that learns how to find its way to the opti-
mum point. It does not require the cooperation among a group of particles, although
implementing the cooperation among several search particles may further enhance
the search process in this method.
At each point of the search, the search particle intends to find a new location
with a better value within a searching range around it and then determines the di-
rection and the step size for the next move. It tries to reach the optimum by explor-
ing weighted random search of each variable (coordinate). The step size of search
in each variable is randomly generated with its own probability density function.
These functions are gradually learned during the search process. It is expected that
at the later stage of search, the probability density functions are approximated for
each variable. Then the stochastically randomized path to the minimum point of the
function from the start point is learned.
The step sizes of all the coordinates determine the center of the new searching
area and the standard deviations of the probability functions determine the size of
the new searching area around the center. In the new searching area, several loca-
tions PS are randomly generated. If there is a location p’ with better value than the
current one, the search operator moves to it. From this new location, new step sizes
and new searching range are determined, so that the search for optimum continues.
If in the current searching area, there is no point with better value that the search
particle can move to, another set of random points are generated until no improve-
ment is obtained after several, say M, trials. Then the searching area size and step
sizes are modified in order to find a better function value. If no better value is found
after K trials of generating different searching areas or the proposed stopping crite-
rion is met, we can claim that the optimum point has been found. The algorithm of
searching for the minimum point is schematically shown in the Figure 2.1.

2.2.2 Extracting Historical Information by Weighted Optimized


Approximation
After the search particle makes a sequence of n moves, the step sizes of these moves
d pti (t = 1, 2, ..., n; i = 1, 2, ..., N) are available for learning. These historical steps
have made the search particle move towards better values of the objective function
and hopefully get closer to the optimum location. In this sense, these steps are the
2 A Novel Optimization Algorithm Based on Reinforcement Learning 31

Fig. 2.1 The algorithm of RLO searching for the minimum point

successful actions during the trial. It is proposed that the successful actions which
result in a positive reinforcement (as the step sizes of each coordinate) follow a
function of the iterative steps t, as in (2.1), where dpi represents the step sizes on ith
coordinate and f i (t) is the function for coordinate i.

d pi = fi (t) (i = 1, 2, ..., N), (2.1)

These unknown functions f i (t) can be approximated, for example, using polynomi-
als through the least-squared fit (LSF) process.
⎧ ⎫

⎪ a0 ⎪⎪
⎡ ⎤ ⎪ ⎪ ⎧ i⎫
1 t1 t12 ... t1B ⎪ ⎨ a1 ⎪⎬ ⎨ d p1 ⎬
⎣ ... ... ... ... ... ⎦ a2 = ... (2.2)
⎪ ⎪ ⎩ i⎭
1 tn tn2 ... tnB ⎪ ⎪
⎪ ... ⎪

⎪ d p
⎩ ⎭ n
aB

In (2.2), the step sizes from d pi1 to d pin are the step sizes on a certain coordinate
during n steps and are fitted as unknown function values using polynomials of or-
der B. The polynomial coefficients a0 to aB can be obtained and will represent the
function f i (t) to estimate dpi ,
B
d pi = ∑ a jt j. (2.3)
j=0

Using polynomials for function approximation could be easy and efficient. How-
ever, considering the characteristic of optimization problems, we have two concerns.
First, in order to generate a good approximation while avoiding overfitting, a proper
order of polynomials must be selected. In the optimized approximation algorithm
(OAA) presented in [17], the goodness of fit is determined by the so-called signal-
to-noise ratio figure (SNRF). Based on SNRF, an approximation stopping criterion
32 J.A. Starzyk, Y. Liu, and S. Batog

was developed. Using a certain set of basis functions for approximation, the error
signal, computed as the difference between the approximated function and the sam-
pled data, can be examined by SNRF to determine how much useful information it
contains. The SNRF for the error signal, denoted as SNRF e , is compared to the pre-
calculated SNRF for white Gaussian noise (WGN), denoted as SNRFW GN . If SNRF e
is higher thanSNRFWGN , more basis functions should be used to improve the learn-
ing. Otherwise, the error signal shows the characteristic of WGN and should not
be reduced any more to avoid fitting into the noise, and the obtained approximated
function is the optimum function. Such process can be applied to determine the
proper order of the polynomial.
The second concern is that in the case of reinforcement learning, the knowl-
edge about originally unknown environment is gradually accumulated throughout
the learning process. The information that the learning system obtains at the be-
ginning of the process is mostly based on initially random exploration. During the
process of interaction, the learning system collects the historical information and
builds the model of the environment. The model can be updated after each step of
interaction. The decisions made at the later stages of the interaction are more based
on the built model rather than a random exploration. This means that the recent re-
sults are more important and should be weighted more heavily than the old ones.
For example, the weights applied can be exponentially increasing from the initial
trials to the recent ones, as

αt
wt = (t = 1, 2, ..., n), (2.4)
n
where we can define α n = n. As a result, the weights are in the open interval (0:1],
and weight is 1 for the most recent sample. Applying the weights in the LSF, we
have the weighted least-squared fit (WLSF), expressed as follows:
⎧ ⎫

⎪ a0 ⎪⎪
⎡ ⎤ ⎪ ⎪ ⎧ ⎫
1 · w1 t1 w1 t12 w1 ... t1B w1 ⎪⎨ a1 ⎪⎬ ⎨ d p1 w1 ⎬
⎣ ... ... ... ... ... ⎦ a2 = ... (2.5)
⎪ ⎪ ⎩ ⎭
1 · wn tn wn tn wn ... tn wn ⎪
2 B

⎪ ... ⎪

⎪ d p n wn
⎩ ⎭
aB

Due to the weights applied to the given samples, the approximated function will fit
to the recent data better than to the old ones.
Utilizing the concept of OAA to obtain optimized WLSF, the SNRF for the error
signal or WGN has to be estimated considering the sample weights. In the original
OAA for one-dimensional problem [17], the SNRF for error signal was calculated
as,
C(e j , e j−1 )
SNRFe = (2.6)
C(e j , e j ) − C(e j , e j−1 )
where C represents the correlation calculation, e j represents the error signal (j=1,2,
...,n), e j−1 represents the (circular) shifted version of the e j . The characteristics
2 A Novel Optimization Algorithm Based on Reinforcement Learning 33

of SNRF for WGN, expressed through the average value and the standard deviation,
can be estimated from Monte-Carlo simulation, as (see derivation at [17])

μSNRF W GN (n) = 0 (2.7)

1
σSNRF W GN (n) = √ . (2.8)
n
Then the threshold, which determines whether SNRF e shows the characteristic of
SNRFW GN and the fitting error should not be further reduced, is,

thSNRF W GN (n) = μSNRF W GN (n) + 1.7σSNRF W GN (n). (2.9)

For the weighted approximation, the SNRF for the error signal is calculated as,

C(e j · w j , e j−1 · w j−1 )


SNRFe = . (2.10)
C(e j · w j , e j · w j ) − C(e j · w j , e j−1 · w j−1 )

In Fig.2.2(a), σSNRF W GN (n) from a 200-run Monte-Carlo simulation is shown in


the logarithmic scale. The σSNRF W GN (n) can be estimated as

2
σSNRF W GN (N) =√ . (2.11)
n

It is found that the 5% significance level can be approximated by the average value
plus 1.5 times standard deviations for an arbitrary n. Fig.2.2(b) illustrates the his-
togram of SNRFW GN with 216 samples, as an example. The threshold in this case of
a dataset with 216 samples can be calculated using μ + 1.5σ = 0 + 1.5 × 0.0078 =
0.0117.
Therefore, to obtain an optimized weighted approximation in one-dimensional
case, the following algorithm is performed.

Optimized weighted approximation algorithm (OWAA)


Step (2.1). Assume that an unknown function F, with input space t ⊂ ℜ1 is described
by n training samples as d pt , (t = 1, 2, ..., n).
Step (2.2). The signal detection threshold is pre-calculated for the given number of
samples n based on SNRFW GN . For a one-dimensional problem,
1.5 · 2
thSNRF W GN (n) = √ .
n

Step (2.3). Take a set of basis functions, for example, polynomials of order from 0
up to order B.
Step (2.4). Use these B+1 basis functions to obtain the approximated function,
B+1
dˆpt = ∑ fl (xt ) (t = 1, 2, ..., n). (2.12)
l=1
34 J.A. Starzyk, Y. Liu, and S. Batog

Fig. 2.2 Characteristic of SNRF for WGN in weighted approximation

Step (2.5). Calculate the approximation error signal,

et = d pt − dˆpt (t = 1, 2, ..., n). (2.13)

Step (2.6). Determine SNRF of the error signal using (2.10).


Step (2.7). Compare the SNRF e with thSNRF W GN . If the SNRF e is equal to or less
than thSNRF W GN , or if B exceeds the number of samples, stop the procedure. In
such case F̂ is the optimized approximation. Otherwise, add one basis function, in
this example increase the order of the approximating polynomial to B+1 and repeat
Steps (2.4)-(2.7).
Using the above algorithm, the proper order of polynomial is determined to ex-
tract the useful (but not the noise) information from the historical data. Also, the
extracted information will fit into the recent results better than to the old ones.
We illustrate this process of learning historical information by considering a 2-
variable function as an example.

Example

The function V (p1 , p2 ) = p22 sin(1.5p2 ) + 2p21 sin(2p1 ) + p1 sin(2p2 ) has several lo-
cal minima, but only one global minimum, as shown in Fig. 2.3. In the process of
interaction, the historical information after each iteration is collected. The historical
step sizes of 2 coordinates are separately approximated, as shown in the Fig. 2.4 (a)
and 2.4 (b). The step sizes of two coordinates are approximated by quadratic poly-
nomials which are determined by OWAA and the coefficients of polynomials are
obtained using WLSF. In Fig. 2.4, the approximated functions are compared with
the quadratic polynomials whose coefficients are obtained from LSF. Again, it is
2 A Novel Optimization Algorithm Based on Reinforcement Learning 35

observed that, the function obtained using WLSF is fitted closer to the data in later
iterations than the function obtained using LSF.

Fig. 2.3 A 2-variable function V (p1 , p2 )

Fig. 2.4 Function approximation for historical step sizes

The level of the approximation error signal et for step sizes of a certain coor-
dinate dpi , which is the difference between the observed sampled data and the ap-
proximated function, can be measured by its standard deviation, as shown in (2.14).
  

1 n
σ pi =  ∑ (et − ē)2 (2.14)
n t=1

This standard deviation will be called the approximation deviation in the follow-
ing discussion. It represents the maximum deviation of the location of the search
particle from the prediction by the approximated function in the unknown function
optimization problem.
36 J.A. Starzyk, Y. Liu, and S. Batog

2.2.3 Predicting New Step Sizes


The approximated functions will be used to determine the step sizes for the next
iteration, as shown in (2.15) and Fig. 2.5 along with the approximated functions.
i
d pt+1 = f i (t + 1) (2.15)

Fig. 2.5 Prediction of the step sizes for the next iteration

The step size functions are the model of environment that the learning system builds
during the process of interaction based on historical information. The future step
size determined by such model can be employed as exploitation of the existing
model. However, such model built during the learning process cannot be treated
as exact. Besides exploitation which best utilizes the obtained model, exploration
is desired to a certain degree in order to improve the model and discover better
solutions. The exploration can be implemented using Gaussian random generator
(GRG). As a good trade-off between exploitation and exploration is needed, we pro-
pose to use the step sizes for the next iteration determined by the step size functions
as the mean value and the approximation deviation as the standard deviation of the
random generator. Gaussian random generators give several random choices of the
step sizes. Effectively, the determined step sizes of multiple coordinates generate
the center of the searching area, and the size of the searching range is determined
by the standard deviations of GRG for the coordinates. The multiple random values
generated by GRG for each coordinate effectively create multiple locations within
the searching area. The objective function values of these locations will be com-
pared and the location with the best value, called current best location, will be
chosen as the place from which the search particle will continue searching in the
next iteration. Therefore, the actual step sizes are calculated using the distance from
the “previous best location” to the “current best location”. The actual step sizes will
be added in the historical step sizes and used to update the model of the unknown
environment.
2 A Novel Optimization Algorithm Based on Reinforcement Learning 37

Several locations of the search particle in this approach are illustrated in


Fig. 2.6 using a 2-variable function as an example. The search particle was located
at previous best location p prev (p1prev , p2prev ) and the previous step size was found as
d p prev (d p1prev , d p2prev ) after current best location p(p1 , p2 )is found as the best loca-
tion in previous searching area (an area with p(p1 , p2 ) in it, not shown in the figure).
At current best position p(p1 , p2 ), using the environment model built with historical
step sizes, the current step size is determined to be dp1 on coordinate 1 and dp2 on
coordinate 2, so that the center of the searching area is determined. The approxima-
tion deviations of two coordinates σ p1 and σ p2 give the size of the searching range.
Within the searching range, several random points are generated in order to find a
better position to which the search operator will move.

Fig. 2.6 Step sizes and searching area

2.2.4 Stopping Criterion


Search particle moves from every “previous best location” to “current best location”
and step sizes actually taken are used for model learning. As new step sizes are
generated, the search particle is expected to move to locations with better objective
function values. In the proposed algorithm, the search particle only makes the move
when a location with a better function value is found.
However, if all the points generated in the current searching range have no better
function values than the current best value, the search particle does not move and
the GRG will repeat generating groups of particle locations for several trials. If no
better location is found after M trials, we suspect that the current searching range
is too small or the current step size is too large, which makes us miss the locations
with better function values. In such case, we should enlarge the size of the searching
area, and reduce the step size, as in (2.16),
38 J.A. Starzyk, Y. Liu, and S. Batog

σ pi = α σ pi
(i = 1, 2, ..., N), (2.16)
d pi = ε d pi
where α > 1, and ε < 1. If this new search is still not successful, the searching range
and the step size will continue changing until some points with better function values
are found. If at certain step of the search process, in order to find the new location
with better function values, the current step size is reduced to be too small to make
the search particle move anywhere, it indicates that the optimum point has been
reached. The stop criterion can be defined by the current step size being β times
smaller than the previous step size, as,

d p < β d p prev (0 < β < 1, β is usually small). (2.17)

2.2.5 Optimization Algorithm


Based on previous discussion, the proposed optimization algorithm (RLO) can be
described as follows.
(a). The procedure starts from a random point of the objective function with N-
variables V = f (p1 , p2 , ..., pN ) . It will try to make a series of moves to get closer to
the global optimum point.
(b). To change from the current location, the step size dpi (i=1, 2, . . . ,N) and the
standard deviation σ pi (i = 1, 2, ..., N) for each coordinate are generated from the
uniform probability distribution.
(c). The step sizes dpi (i=1, 2, . . . N) determine the center of the searching area.
The deviations of all the coordinates σ pi (i = 1, 2, ..., N) determine the size of the
searching area. Several points Ps in this range are randomly chosen from Gaussian
distribution using dpi as mean values and σ pi as standard deviations.
(d). The objective function values are evaluated at these new points. Compare the
objective function values on random points with that at the current location.
(e). If the new points generated in Step (c) have no better values than the current
position, Step (c ) is repeated for up to M trials until point with better function value
is found.
(f). If the search fails after M trials, enlarge the size of the searching area, and reduce
the step size, as in (2.16).
(g). If the search with the updated searching area size and the step sizes from Step
(f) is not successful, the range and the step size will keep being adjusted until either
some points with better values are found, the current step sizes are much smaller
than previous step sizes as in (2.17), or function value changed by less than a pre-
specified threshold. If any of these conditions happens then the algorithm termi-
nates. This also indicates that the optimum point has been reached.
(h). Move the search particle to the point p(p1 , p2 ) with the best function value V b (a
local minimum or maximum depending on the optimization objective). The distance
between previous best point p prev (p1prev , p2prev ) and current best point p(p1 , p2 ) gives
the actual step size dpi (i=1, 2, . . . , N). Collect the historical information of the step
sizes taken during the search process.
2 A Novel Optimization Algorithm Based on Reinforcement Learning 39

(i). Approximate the function of the step sizes as a function of iterative steps us-
ing weighted least-square fit as in (2.5). The proper maximum order of the basis
functions is determined using SNRF described in section 2.2.2 to avoid overfitting.
(j). Use the modeled function to determine the step sizes dpi (i=1, 2, . . . ,N) for the
next iteration step. The approximation deviation difference between the approxi-
mated step sizes and the actual step sizes σ pi (i = 1, 2, ..., N) gives the approximation
deviation. Repeat Step (c) to (j).
In general, the optimization algorithm based on the reinforcement learning builds
the model of successful moves for a given objective function. The model is built
based on historical successful actions and it is used to determine new actions. The
algorithm combines the exploitation and exploration of searching using random gen-
erators. The optimization algorithm does not require any prior knowledge of the ob-
jective function or its derivatives nor there are any special requirements put on the
objective function. The use of search operator is conceptually very simple and intu-
itive. In the following section, the algorithm is verified using several experiments.

2.3 Simulation and Discussion


2.3.1 Finding Global Minimum of a Multi-variable Function
2.3.1.1 A Synthetic Bivariate Function
A synthetic bivariate function

V (p1 , p2 ) = p22 sin(1.5p2 ) + 2p21 sin(2p1 ) + p1 sin(2p2 ),

used previously in the example in section 2.2.2, is used as the objective function.
This function has several local minima and one global minimum equal to -112.2586.
The optimization algorithm starts at a random point and performs the search process
looking for the optimum point (minimum in this example). The number of random
points Ps generated in the searching area in each step is 10. The scaling factors α
and ε in (2.16) are 1.1 and 0.9. The β in (2.17) is 0.005.
One possible search path is shown in Fig. 2.7 from the start location to the final
optimum location as found by RLO algorithm. The global optimum is found in
13 iterative steps. The historical locations are shown in the figure as well. During
the search process, the historical step sizes taken are shown in Fig. 2.8 with their
approximation by WLSF.
Example of another search process starting from another random point is per-
formed and is shown in Fig. 2.9. The global optimum is found in 10 iterative steps.
Table 2.1 shows changes in the numerical function values and adjustment of the step
sizes dp1 and dp2 for p1 and p2 in the successive search steps. Notice how the step
size was initially reduced to be increased again once the algorithm started to follow
a correct path towards the optimum.
40 J.A. Starzyk, Y. Liu, and S. Batog

Fig. 2.7 Search path from start point to optimum

Fig. 2.8 Step sizes taken during the search process

Fig. 2.9 Search path from start point to optimum


2 A Novel Optimization Algorithm Based on Reinforcement Learning 41

Table 2.1 Function values and step sizes in a searching process

Search steps Function value V (p1 , p2 ) Step size d p1 Step size d p2


1 1.4430 2.9455 0.8606
2 -34.8100 0.3570 -1.7924
3 -61.4957 -0.0508 -0.7299
4 -69.8342 -0.0477 -0.3114
5 -70.5394 -0.1232 0.2015
6 -71.5813 0.0000 4.4358
7 -109.0453 -0.0281 0.3408
8 -110.8888 0.0495 -0.0531
9 -112.0104 0.0438 -0.0772
10 -112.1666

Such search process was performed for 300 random trials. The success rate of
finding the global optimum is 93.78%. On average, it takes 5.9 steps and 4299 func-
tion evaluations to find the optimum in this problem.
The same problems are tested on several other direct search based optimization
algorithms, including SA [29], PSO [14] and OSSRS [2]. The success rate of find-
ing global optimum and the average number of function evaluations are compared
in Tables 2.2, 2.3, 2.4. All the simulations were performed using an Intel Core Duo
2.2GHz based PC, with 2GB of RAM.

Table 2.2 Comparison of optimization performances on synthetic function

RLO SA PSO OSSRS


Success rate of finding the global optimum 93.78% 29.08% 94.89% 52.21%
Number of function evaluations 4299 13118 4087 313
CPU time consumption [s] 28.4 254.35 20.29 1.95

2.3.1.2 Six-Hump Camel Back Function


The classic 2D six-hump camel back function [5] has 6 local minima and 2 global
minima. The function is given as
p41 2
V (p1 , p2 ) = (4 − 2.1p21 + ) p1 + p1 p2 + (−4 + 4p22 ) p22 (p1 ∈ [−3, 3], p2 ∈
3
[−2, 2]).
Within the specified bounded region, the function has 2 global minima equal to -
1.0316. The optimization performances of these algorithms from 300 random trials
are compared in Table 2.3.
42 J.A. Starzyk, Y. Liu, and S. Batog

Table 2.3 Comparison of optimization performances on six-hump camel back function

RLO SA PSO OSSRS


Success rate of finding the global optimum 80.33% 45.22% 86.44% 42.67%
Number of function evaluations 5016 8045.5 3971 256
CPU time consumption [s] 33.60 151.86 20.35 1.63

2.3.1.3 Banana Function


The Rosenbrock’s famous “banana function” [23], as

V (p1 , p2 ) = 100(p2 − p21 )2 + (1 − p1)2 ,

has 1 global minimum equal to 0 lying inside a narrow, curved valley. The opti-
mization performances of these algorithms from 300 random trials are compared in
Table 2.4.

Table 2.4 Comparison of optimization performances on banana function

RLO SA PSO OSSRS


Success rate of finding the global optimum 74.55% 3.33% 41% 88.89%
Number of function evaluations 48883.7 28412 4168 882.4
CPU time consumption [s] 320.74 539.38 20.27 5.15

In these optimization problems, RLO demonstrates consistently satisfactory per-


formance without particular tuning of the parameters. However, other methods show
different level of efficiency and capabilities of handling various problems.

2.3.2 Optimization of Weights in Multi-layer Perceptron Training


The output of a multi-layer perceptron (MLP) can be looked at as the value of a
function with the weights as the approximation variables. Training the MLP, in the
sense of finding optimal values of weights to accomplish the learning task, can be
treated as an optimization problem. We can take the Iris plant database [22] as a
testing case. The Iris database contains 3 classes, 5 numerical features and 150 sam-
ples. In order to accomplish the classification of the iris samples, a 3-layered MLP
with an input layer, a hidden layer and an output layer can be used. The size of the
input layer should be equal to the number of features. The size of the hidden layer
is chosen to be 6, and since the class IDs are numerical values equal to 1, 2 and
3, the size of the output layer is 1. The weight matrix between the input layer and
the hidden layer contains 30 elements, and the one between the hidden layer and the
2 A Novel Optimization Algorithm Based on Reinforcement Learning 43

output layer contains 6 elements. Overall, there are 36 weight elements (parameters)
to be optimized. In a typical trial, the optimization algorithm finds the optimal set
of weights after only 3 iterations. In the testing stage, the outputs of the MLP are
rounded to be the nearest integers to indicate predicted class IDs. Comparing the
given class IDs and the predicted class IDs from the MLP in Fig. 2.10, it is obtained
that 146 out of 150 iris samples can be correctly classified by such set of weights
and the percentage of correct classification is 97.3%. A single support vector ma-
chine (SVM) achieved 96.73% classification rate [12]. In addition, a MLP with the
same structure, training by back-propagation (BP) achieved 96% on Iris test case.
The MLP and BP are implemented using MATLAB neural network toolbox.

Fig. 2.10 RLO performance on neural network training on Iris problem

2.3.3 Micro-saccade Optimization in Active Vision for Machine


Intelligence
In the area of machine intelligence, active vision becomes an interesting topic. In-
stead of taking in the whole scene captured by the camera and making sense of all
the information in the conventional computer vision approach, active vision agent
focuses on small parts of the scene and moves its fixation frequently. Human and
other animals use such quick movement of both eyes, which is called saccade [3], to
focus on the interesting part of the scene and efficiently use its own resources. The
interesting parts are usually important features of the input, and with the important
features being extracted, the high-resolution scene is analyzed and recognized with
relatively small number of samples.
In a saccade movement network (SMN) presented in [16], the original images are
transformed into a set of low resolution images after saccade movements and retina
44 J.A. Starzyk, Y. Liu, and S. Batog

sampling. The set of images, as the sampled features, are fed to the self-organizing
winner-take-all classifier (SOWTAC) network for recognition. To find interesting
features of the input image and to direct the movements of saccade, image segmen-
tation, edge detection and basic morphology tools [4] are utilized.

Fig. 2.11 Face image and its interesting features in active vision [16]

Fig. 2.11 (a) shows a face image from [7] with 320×240 pixels. The interesting
features found are shown in Fig. 2.11 (b). The stars represent the center of the four
interesting features found on a face image and the rectangles represent the feature
boundaries. Then, the retina sampling model [16] places its fovea at the center of
each interesting feature, so that these features will be extracted.
Practically, the centers of the interesting features found by image processing tools
[4] are not guaranteed to be the accurate centers, which will affect the accuracy
of feature extraction and pattern recognition process. In order to help to find the
optimum sampling position, RLO algorithm can be used to direct the move of the
fovea of the retina and find the closest match between the obtained sample features
and pre-stored reference sample features. These slight moves during fixation to find
the optimum sampling positions can be called microsaccades in the active vision
process, although the actual role of microsaccades has been a debate topic unsolved
for several decades [19].
Fig. 2.12 (a) shows a group of ideal samples of important features in face recog-
nition. Fig. 2.12 (b) shows the group of sampled features with initial sampling posi-
tions. In the optimization process, the x-y coordinates need to be optimized so that
the sampled images have the optimum similarity to the ideal images. The level of
similarity can be measured by the sum of squared intensity difference [9]. In this
metric, increased similarity will have decreased intensity difference. Such problem
can be also perceived as an image registration problem. The two-variables objec-
tive function V(x, y), the sum of squared intensity difference, needs to be minimized
2 A Novel Optimization Algorithm Based on Reinforcement Learning 45

Fig. 2.12 Image sampling by micro-saccade

through RLO algorithm. It is noted that the only information available is that V
can be the function of x and y coordinates. How the function would be expressed
and what are its characteristics are totally unknown. The minimum value of the
objective function is not known either. RLO would be the suitable algorithm for
such optimization problem. Fig. 2.12 (c) shows the optimized sampled images us-
ing RLO-directed microsaccades. The optimized feature samples are closer to the
ideal feature samples, which will help the processing of the face image.
After the featured images are obtained through RLO-directed microsaccades,
these low-resolution images, instead of the entire high-resolution face image, are
sent to the SOWTAC network for further processing or recognition.

2.4 Conclusions
In this chapter, a novel and efficient optimization algorithm is presented for the
problems in which the objective functions are unknown. The search particle is able
to build the model of successful actions and choose its future action based on the
past exploring experience. The decisions on the step sizes (and directions) are made
based on a trade-off between exploitation of the known search path and exploration
for the improved search direction. In this sense, this algorithm falls into a category
of reinforcement learning based optimization (RLO) methods. The algorithm does
not require any prior knowledge of the objective function, nor does it require any
characteristics of such function. It is conceptually very simple and intuitive as well
as very easy to implement and tune.
The optimization algorithm was tested and verified using several multi-variable
functions and compared with several other widely used random search optimization
46 J.A. Starzyk, Y. Liu, and S. Batog

algorithms. Furthermore, the training of a multi-layer perceptron (MLP), based on


finding a set of optimized weights to accomplish the learning, is treated as an opti-
mization problem. The proposed RLO was used to find the weights of MLP in the
training problem on Iris database. Finally, the algorithm is used in the image recog-
nition process to find a familiar object with retina sampling and micro-saccades.
The performance of RLO, will depend to a certain degree on the values of several
parameters that this algorithm uses. With certain preset parameters, the performance
of RLO can meet our requirements in several machine learning problems involved
in our current research. In the future research, a theoretical and systematic analysis
of the effect of these parameters will be conducted. In addition, using a group of
search particles and their cooperation and competition, a population based RLO can
be developed. With the help of model approximation techniques and the trade-off
between exploration and exploitation proposed in this work, the population based
RLO is expected to have better performance.

References
1. Arfken, G.: Lagrange Multipliers, 3rd edn. §17.6 in Mathematical Methods for Physi-
cists, pp. 945–950. Academic Press, Orlando (1985)
2. Belur, S.: A random search method for the optimization of a function of
n variables. MATLAB central file exchange, http://www.mathworks.com/
matlabcentral/fileexchange/loadFile.do?objectId=100
3. Cassin, B., Solomon, S.: Dictionary of Eye Terminology. Triad Publishing Company,
Gainsville (1990)
4. Detecting a Cell Using Image Segmentation. Image Processing Toolbox, the Mathworks,
http://www.mathworks.com/products/image/demos.html
5. Dixon, L.C.W., Szego, G.P.: The optimization problem: An introduction. Towards Global
Optimization II. North Holland, New York (1978)
6. Eelder, J.A., Mead, R.: A simplex method for function minimization. The Computer
Journal 7, 308–313 (1965)
7. Facegen Modeller. Singular Inversions,
http://www.facegen.com/products.htm
8. del Toro Garcia, X., Neri, F., Cascella, G.L., Salvatore, N.: A surrogate associated
Hooke-Jeeves algorithm to optimize the control system of a PMSM drive. IEEE ISIE,
347–352 (July 2006)
9. Hill, D.L.G., Batchelor, P.: Registration methodology: concepts and algorithms. In: Ha-
jnal, J.V., Hill, D.L.G., Hawkes, D.J. (eds.) Medical Image Registration. Medical Image
Registration. CRC, Boca Raton (2001)
10. Hooke, R., Jeeves, T.A.: Direct search solution of numerical and statistical problems.
Journal of the Association for Computing Machinery 8, 212–229 (1961)
11. Kennedy, J., Eberhart, R.C.: Particle swarm optimization. In: Proc. IEEE Int. Conf. Neu-
ral Networks, Perth, Australia, December 1995, vol. 4, pp. 1942–1948 (1995)
12. Kim, H., Pang, S., Je, H.: Support vector machine ensemble with bagging. In: Lee, S.-W.,
Verri, A. (eds.) SVM 2002. LNCS, vol. 2388, p. 397. Springer, Heidelberg (2002)
13. Kirkpatrick, S., Gelatt Jr., C.D., Vecchi, M.P.: Optimization by simulated annealing. Sci-
ence 220(4598), 671–680 (1983)
2 A Novel Optimization Algorithm Based on Reinforcement Learning 47

14. Leontitsis, A.: Hybrid Particle Swarm Optimization, MATLAB central file ex-
change, http://www.mathworks.com/matlabcentral/fileexchange/
loadFile.do?objectId=6497
15. Lewis, R.M., Torczon, V., Trosset, M.W.: Direct search methods: Then and now. Journal
of Computational and Applied Mathematics 124(1), 191–207 (2000)
16. Li, Y.: Active Vision through Invariant Representations and Saccade Movements. Master
thesis, School of Electrical Engineering and Computer Science, Ohio University (2006)
17. Liu, Y., Starzyk, J.A., Zhu, Z.: Optimized Approximation Algorithm in Neural Networks
without overfitting. IEEE Trans. on Neural Networks 19(4), 983–995 (2008)
18. Lustig, I.J., Marsten, R.E., Shanno, D.F.: Computational Experience with a Primal-Dual
Interior Point Method for Linear Programming. Linear Algebra and its Application 152,
191–222 (1991)
19. Martinez-Conde, S., Macknik, S.L., Hubel, D.H.: The role of fixational eye movements
in visual perception. Nature Reviews Neuroscience 5(3), 229–240 (2004)
20. Ong, Y.-S.: Max-min surrogate-assisted evolutionary algorithm for robust design. IEEE
Trans. on Evolutionary Computation 10(4), 392–404 (2006)
21. Powell, M.J.D.: An efficient method for finding the minimum of a function of several
variables without calculating derivatives. The Computer Journal 7, 155–162 (1964)
22. Fisher, R.A.: Iris Plants Database (July 1988),
http://faculty.cs.byu.edu/˜cgc/Teaching/CS_478/iris.arff
23. Rosenbrock, H.H.: An automatic method for finding the greatest or least value of a func-
tion. The Computer Journal 3, 175–184 (1960)
24. Sheela, B.V.: An optimized step-size random search. Computer Methods in Applied Me-
chanics and Engineering 19(1), 99–106 (1979)
25. Snyman, J.A.: Practical Mathematical Optimization: An Introduction to Basic Optimiza-
tion Theory and Classical and New Gradient-Based Algorithms. Springer, Heidelberg
(2005)
26. Starzyk, J.A.: Motivation in Embodied Intelligence. In: Frontiers in Robotics, Automa-
tion and Control, October 2008, pp. 83–110. I-Tech Education and Publishing (2008),
http://www.intechweb.org/
book.php?%20id=78&content=subject&sid=11
27. Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press, Cam-
bridge (1998)
28. Torczon, V.: On the Convergence of Pattern Search Algorithms. SIAM Journal on Opti-
mization 17(1), 1–25 (1997)
29. Vandekerckhove, J.: General simulated annealing algorithm, MATLAB central file ex-
change, http://www.mathworks.com/matlabcentral/fileexchange/
loadFile.do?objectId=10548
30. Ypma, T.J.: Historical development of the Newton-Raphson method. SIAM Re-
view 37(4), 531–551 (1995)
Chapter 3
The Use of Opposition for Decreasing Function
Evaluations in Population-Based Search

Mario Ventresca, Shahryar Rahnamayan, and Hamid Reza Tizhoosh

Abstract. This chapter discusses the application of opposition-based comput-


ing to reducing the amount of function calls required to perform optimization
by population-based search. We provide motivation and comparison to simi-
lar, but different approaches including antithetic variates and quasi-randomness/
low-discrepancy sequences. We employ differential evolution and population-based
incremental learning as optimization methods for image thresholding. Our results
confirm improvements in required function calls, as well as support the oppositional
princples used to attain them.

3.1 Introduction
Global optimization is concerned with discovering an optimal (minimum or maxi-
mum) solution to a given problem generally within a large search space. In some in-
stances the search space may be simple (i.e. concave or convex optimization can be
used). However, most real world problems are multi-modal and deceptive [5] which
often causes traditional optimization algorithms to become trapped at local optima.
Many strategies have been developed to overcome this for global optimization in-
cluding, but not limited to simulated annealing [9], tabu search [4], evolutionary
algorithms [7] and swarm intelligence [3].
Some of these methods employ a single solution per iteration methodology
whereby only one solution is generated and successively perturbed towards more
Mario Ventresca · Hamid Reza Tizhoosh
Department of Systems Design Engineering, The University of Waterloo, Ontario, Canada
e-mail: mventres@pami.uwaterloo.ca,tizhoosh@uwaterloo.ca
Shahryar Rahnamayan
Faculty of Engineering and Applied Science,
The University of Ontario Institute of Technology, Ontario, Canada
e-mail: shahryar.rahnamayan@uoit.ca

Y. Tenne and C.-K. Goh (Eds.): Computational Intelligence in Optimization, ALO 7, pp. 49–71.
springerlink.com c Springer-Verlag Berlin Heidelberg 2010
50 M. Ventresca, S. Rahnamayan, and H.R. Tizhoosh

appropriate solutions (i.e. simulated annealing and tabu search). Single-solution


methods inherently require low computational time per iteration, however they of-
ten lack sufficient diversity to adequately explore the search space. An alternative
is via population-based techniques where many solutions are generated at each iter-
ation and all (or most) are used to determine the search direction (i.e. evolutionary
and swarm intelligence algorithms). By considering many solutions per iteration
these methods will have a higher diversity, however will require a large amount of
computation. In this chapter we focus on population-based optimization.
Real-world problems also present the possible issue of complexity in the eval-
uation of a solution. That is, determining the quality of a given solution is com-
putationally expensive. Investigation into simpler evaluation metrics is one possible
direction, however, it may be found that evaluation is still expensive. Then, reducing
the time spent (i.e. number of function evaluations) becomes an important goal.
Opposition-based computing (OBC) is a newly devised concept, having as one
of its aims, the improvment of convergence rates of algorithms by defining and
simultaneously considering pairs of opposite solutions [24]. This advanced conver-
gence rate is also usually accompanied by a more desirable final result. To date,
OBC has shown improvements in reinforcement learning [20, 22, 23], evolution-
ary algorithms [14, 15, 16], ant colony optimization [12], simulated annealing [28],
estimation of distribution [29], and neural networks [26, 27, 30].
In this chapter we discuss the application of OBC to reducing the number of
function calls required to achieve a desired accuracy for population-based searches.
We show theoretical reasoning behind OBC (which has roots in monotonic opti-
mization) and provide mathematical motivations for its ability to reduce function
calls and improve accuracy of simulation results. The key factor to accomplishing
this is via simultaneous consideration of negatively associated variables and their
affect on the target evaluation function and search process. We choose to highlight
the improvements for the task of image thresholding using differential evolution and
population-based incremental learning.
The rest of this chapter is organized as follows: Section 3.2 discusses the theoret-
ical motivations behind our approach. Differential evolution and population-based
incremental learning are discussed in Section 3.3 as are their respective opposition-
based counterparts. The experimental setup is given in Section 3.4 and results are
presented in Section 8.4. Conclusions are give in Section 3.6.

3.2 Theoretical Motivations


In this section we introduce notations and definitions required to explain the concept
of opposition and its ability to reduce the number of function calls. We also provide
a brief comparison of OBC to antithetic variates and quasi-random/low-discrepancy
sequences.
3 The Use of Opposition for Decreasing Function Evaluations 51

3.2.1 Definitions and Notations


In the following definitions, assume that A ⊆ ℜd is non-empty, compact and d-
dimensional. Without loss of generality, f : A → ℜ is a continuous function to be
maximized. We assume all A are feasible.
The purpose of a global search method is to discover the global optima (either
minimum or maximum) of a given function and not converge to one of the local
optima.
Definition 1 (Global Optima). A solution θ ∗ ∈ A is a global optima if f (θ ) ≤
f (θ ∗ ) for all θ ∈ A . There may exist more than one global optima.
Definition 2 (Local Optima). A solution θ ∈ A is a local optima if there exists a
ε -neighborhood Nε (θ ) with radius ε > 0 where g(θ , θ ) < ε for distance function
g and θ ∈ A ∩ Nε (θ ), and f (θ ) ≤ f (θ ).
Recent research [1, 25, 32] has shown the benefit of utilizing monotonic transforma-
tions of the evaluation criteria as a means of discovering global optima. This causes
a reordering of the solutions and a gradient-based method can be used to search the
reordered space. An issue with these convexification and concavification methods
which transform certain functions to a monotonic form is that the mapping must be
known a priori. Otherwise, optimization on the transformed function is unreliable.
Definition 3 (Monotonicity). Function φ : ℜ → ℜ is monotonic if for x, y ∈ ℜ and
x < (>)y, then φ (x) ≤ (≥)φ (y). A strictly monotonic function is one which does
not permit equality, (i.e. φ (x) < (>)φ (y)).
Theoretically, a monotonic transformation is ideal, however, OBC does not require
it. Instead, opposition extends the monotonic global search idea through the use of
opposites solutions, which are simultaneously considered and the more desirable
(w.r.t. f and the problem definition) is used during the search.
Definition 4 (Opposite). A pair (x, x̆) ∈ A are opposites if there exists a function
Φ : A → A where Φ (x) = y and Φ (y) = x. The breve notation will be used to
denote the opposite element (i.e. x̆ = Φ (x) = y).
The function Φ referred to in Definition 4 is the key to employing opposition-based
techniques. This determines which elements are opposites, and a poorly selected
function could lead to poor performance (see the following section).
Definition 5 (Opposite Mapping). A one-to-one function Φ : A → A where every
pair x, x̆ ∈ A are unique (i.e. for z ∈ A , if Φ (x) = y and Φ (y) = x then there does
not exist Φ (y) = z or Φ (x) = z).
Φ can be determined via prior knowledge, intuition or through some a priori
or online learning procedure. Simultaneous use of the opposites (for example,
a maximization problem) is easily accomplished by allowing f (θ ) = f (θ̆ ) =
max( f (θ ), f (θ̆ )) and searching for a solution S which corresponds to the most de-
sirable solution.
S = arg max f (θ ). (3.1)
θ
52 M. Ventresca, S. Rahnamayan, and H.R. Tizhoosh

3.2.2 Consequences of Opposition


As with a monotonic transformation the “optimal” opposition map is one which
effectively reorders elements of A such that φ ( f ) is monotonic. Under this trans-
formation cor(X, X̆) ≤ 0 for X, X̆ ⊂ A , as is shown in Figure 3.1.

0.9

0.8

0.7

0.6
Evaluation

0.5 X X
0.4

0.3

0.2

0.1

0
Solution

Fig. 3.1 Example of a transformed evaluation funtion to a monotonic function. The values
in X and X̆ are negatively correlated. The original function (not shown) could have been any
nonlinear, continuous function

The implementation of opposition for an optimization problem involves the si-


multaneous examination of x, x̆ and returns the most desirable. That is, we aim for
a negative correlation between the two guesses. A consequence of Φ is an effec-
tive halving of possible evaluations (as shown in Figure 3.2) where f (θ ) = f (θ̆ ) =
max( f (θ ), f (θ̆ )). The halving results from full information of the transformed func-
tion such that the opposite solution is not required to be observed to compute the
max operation. In the more general case, it is sufficient to determine a function such
that Pr(max( f (θ1 ), f (θ˘1 ) < max( f (θ1 ), f (θ2 ))) > 0.5, for θ1 , θ2 ∈ A .
A further consequence of simultaneous consideration of opposites is a provably
more desirable E[ f ] and lower variance [28]. Alone this does not guarantee a higher
quality outcome; the probability density function of the opposite-transformed func-
tion values should also be more desirable than the joint p.d.f. of random sampling
(i.e. the distribution corresponding to Pr(max(x1 , x2 ))).
3 The Use of Opposition for Decreasing Function Evaluations 53

0.9

0.8

0.7

0.6 Function representing the


maximum of a solution
Evaluation

and its opposite


0.5

0.4 Transformed monotonic


function
0.3

0.2

0.1

0
Solution

Fig. 3.2 Taking f (x) = max( f (x), f (−x)), we see that the possible evaluations in the search
space have been halved in the optimal situation of full knowledge. In the general case, the
transformed function will have a more desirable mean and lower variance

While not investigated in this chapter we make the observation that successive
applications of different opposite maps will lead to further smoothing of f . For
example,
f 2 (θ ) = max( f (z = arg max f (θ )), f (z̆)) (3.2)
θ

where z̆ is determined via opposite map Φ2 = Φ1 and superscript f 2 indicates the


two applications of opposite mappings. In the limit,

lim f i (θ ) = max( f i (zi = arg max f i−1 (θi )), f (z̆i )) = f (θ ∗ ) (3.3)
i→∞ θi

for i > 0 and global optima f (θ ∗ ). Effectively, this flattens the entire error surface
of f , except for the global optimal(s). A more feasible alternative is to use k > 0
transformations which give reasonable results where 0 ≤ | f k−1 − f k | < ε does not
diminish greatly.

3.2.3 Lowering Function Evaluations


We briefy discuss conditions on the design of the opposition map Φ which often lead
to improvements over purely random sampling with respect to lowering function
evaluations. If using an algorithm solely based on randomly generating solutions to
the problem at hand then we require that for some ε > 0 and δ > 0.5
54 M. Ventresca, S. Rahnamayan, and H.R. Tizhoosh

Pr (max( f (x), f (x̆)) − max( f (x), f (y)) > ε ) > δ (3.4)

where x, y, x̆ ∈ A . That is, the distribution max(x, x̆) must be less desirable than the
distribution of i.i.d. random guesses. If this condition is met, then the probability
that the optimal solution (or higher quality) is discovered is higher using opposite
guesses. The goal in developing Φ is to maximize ε and δ . A similar goal is to deter-
mine Φ such that E[g(x, x̆)] is maximized for some distance function g. Satisifying
this condition implies (3.4).
Thus, probabilistically we expect a lower number of function calls to find a solu-
tion of a given thresholded quality. If employing this strategy within a guided search
method the dynamics of the algorithm must be considered to assure the guarantee
(i.e. the algorithm adds bias to the search which affects the convergence rate).
Practically, the simplest manner to decide a satisfactory Φ is through intuition
or prior knowledge about f . A possibility is to utilize modified low-discrepancy
sequences (see below), which aim to distribute guesses evenly throughout the search
space.

3.2.4 Comparison to Existing Methods


While opposition may sometimes employ methods from antithetic variates and low-
discrepancy sequences, in general that is not the case. To elucidate the uniqueness
of opposition in the following we distinguish it from these two methods.

3.2.4.1 Antithetic Variates


Suppose we desire to estimate ξ = E[ f ] = E[Y1 ,Y2 ] with unbiased estimator
Y1 + Y2
ξ̂ = . (3.5)
2

If Y1 ,Y2 are i.i.d. then var(ξ̂ ) = var(Y1 +Y2 )/2. However, if cov(Y1 ,Y2 ) < 0 then the
variance can be further reduced.
One method to accomplish this is through the use of a monotonic function h.
Then, generate Y1 as an i.i.d. value as before, but utilizing h our two variables are
h(Y1 ) and h(1 − Y1), which are monotonic over interval [0,1]. And

h(Y1 ) + h(Y2 )
ξ̂ = (3.6)
2
will be an unbiased estimator of E[ f ].
Opposition is similar in its selection of negatively correlated samples. However,
in the antithetic approach there is no guideline to construct such a monotonic func-
tion, although such a function has been proven to exist [17]. Opposition provides
various means to accomplish this, as well as to incorporate the idea into optimiza-
tion while guaranteeing more desirable expected values and lower variance in the
target function.
3 The Use of Opposition for Decreasing Function Evaluations 55

Further, opposition extends beyond the generation of solutions in random


sampling-based algorithms. It can also be applied to algorithm behavior and can be
used to relate concepts expressed linguistically, where we have not found evidence
that antithetic variates have application.

3.2.4.2 Quasi-Randomness and Low-Discrepancy Sequences


These methods aim to combine the randomness from pseudorandom generators
which select values i.i.d. with the advantages of generating points distributed in
a grid-like fashion. The goal is to uniformly distribute values over the search space
by achieving a low-discrepancy. However, this is achieved at the cost of statistical
independence.
The discrepancy of a sequence is a measure of its uniformity and can be calcu-
lated via [6]:  
 |B ∩ X| 
DN (X) = sup   − λd (B) (3.7)
I∈J N
where λd is the d-dimensional Lebesque measure, |B ∩ X| is the number of points
of X = (x1 , ..., xN ) that fall into interval I, and J is the set of d-dimensional intervals
defined as:
d
∏[ai, bi ) = {x ∈ ℜd : ai ≤ xi ≤ bi } (3.8)
i=1

for 0 ≤ ai < bi ≤ 1.
That is, the actual number of points within each interval for a given sample is
close to the expected number. Such sequences have been widely studied in quasi-
Monte Carlo methods [10].
Opposition may utilize low-discrepancy sequences in some situations. Though
in general, low-discrepancy sequences are simply a means for attaining uniform
distribution without regard to the correlation between the evaluations at these points.
Further, opposition-based techniques simultaneously consider two points in order to
smooth the evaluation function and improve performance of the sampling algorithm
whereas quasi-random sequences often are concerned with many more points.
These methods have been applied to evolutionary algorithms where it was found
that by a performance study of the different sampling methods such as Uniform,
Normal, Halton, Sobol, Faure, and Low-Discrepancy is valuable only for
low-dimensional (d < 10 and so non-highly-sparse) populations [11].

3.3 Algorithms
In this section we describe Differential Evolution (DE) and Population-¿ It seems,
performance study of the different sampling methods such as Uniform, Normal, Hal-
ton, Sobol, Faure, and Low-Discrepancy [11] is valuable only for low-dimensional
(D < 10 and so non-highly-sparse) populations. Learning (PBIL), which are the
56 M. Ventresca, S. Rahnamayan, and H.R. Tizhoosh

parent algorithms for this study. We also describe the oppositional variants of these
methods.

3.3.1 Differential Evolution


Differential Evolution (DE) was proposed by Price and Storn in 1995 [21]. It is an
effective, robust, and simple global optimization algorithm [8]. DE is a population-
based directed search method. Like other evolutionary algorithms, it starts with an
initial population vector. If no preliminary knowledge about the solution space is
available the population is randomly generated. Each vector of the initial population
can be generated as follows [8]:

Xi, j = a j + rand j (0, 1) × (a j − b j ); j = 1, 2, ..., d, (3.9)

where d is the problem dimension; a j and b j are the lower and the upper boundaries
of the variable j, respectively. rand(0, 1) is the uniformly generated random number
in [0, 1].
Assume Xi,t (i = 1, 2, ..., N p ) are candidate solution vectors in generation t and
N p : is the population size. Successive populations are generated by adding the
weighted difference of two randomly selected vectors to a third randomly selected
vector. For classical DE (DE/rand/1/bin), the mutation, crossover, and selection
operators are straightforwardly defined as follows:
Mutation: For each vector Xi,t in generation t a mutant vector Vi,t is defined by

Vi,t = Xa,t + F(Xc,t − Xb,t ), (3.10)

where i = {1, 2, ..., N p } and a, b, and c are mutually different random integer indices
selected from {1, 2, ..., N p }. Further, i, a, b, and c are unique such that it is neces-
sary for N p ≥ 4. F ∈ [0, 2] is a real constant which determines the amplification of
the added differential variation of Xc,t − Xb,t . Larger values for F result in higher
diversity in the generated population and lower values lead to faster convergence.
Crossover: By shuffling competing solution vectors DE utilizes the crossover op-
eration to generate new solutions and also to increase the population diversity. For
the classical DE (DE/rand/1/bin), the binary crossover (shown by ‘bin’ in the no-
tation) is utilized. It defines the following trial vector:

Ui,t = (U1i,t ,U2i,t , ...,Udi,t ), (3.11)

where, 
V ji,t if rand j (0, 1) ≤ Cr ∨ j = k,
U ji,t = . (3.12)
X ji,t otherwise,

for Cr ∈ (0, 1) the predefined crossover rate, and rand j (0, 1) is the jth evaluation of
a uniform random number generator. k ∈ {1, 2, ..., d} is a random parameter index,
3 The Use of Opposition for Decreasing Function Evaluations 57

chosen once for each i to make sure that at least one parameter is always selected
from the mutated vector, V ji,t . Most popular values for Cr are in the range of (0.4, 1)
[14].
Selection: This decides which vector (Ui,t or Xi,t ) should be a member of next (new)
generation, t + 1. For a minimization problem, the vector with the lower value of
objective function is chosen (greedy selection).
This evolutionary cycle (i.e., mutation, crossover, and selection) is repeated N p (pop-
ulation size) times to generate a new population. These successive generations are
produced until meeting the predefined termination criteria.

3.3.2 Opposition-Based Differential Evolution


By utilizing opposite points, we can obtain fitter starting candidate solutions even
when there is no a priori knowledge about the solution(s) according to:
1. Random initialization of population P(NP ),
2. Calculate opposite population by

OPi, j = a j + b j − Pi, j , (3.13)

i = 1, 2, ..., N p ; j = 1, 2, ..., D,
where Pi, j and OPi, j denote jth variable of the ith vector of the population and
the opposite-population, respectively.
3. Selecting the N p fittest individuals from {P ∪ OP} as the initial population.
The general ODE scheme also employs generation jumping, but it has not be used
in this work in lieu of only population-initialization and sample generation.

3.3.3 Population-Based Incremental Learning


PBIL is a stochastic search which abstracts the population of samples found in evo-
lutionary computation with a probability distribution for each variable of the so-
lution [2]. At each generation a new sample population is generated based on the
current probability distribution. The best individual is retained and the probability
model is updated accordingly to reflect the belief regarding the final solution. The
update rule is similar to that found in reinforcement learning.
A population is represented by probability matrix M := (mi, j )d×c which stores
the probability distribution over each possible element in the solution. If consider-
ing a binary problem then solution S := (si, j )d×c ∈ {0, 1} and mi, j ∈ [0, 1] is the
probability element si, j = 1. For continuous problems probability distributions are
used instead of a threshold value as is the case for discrete problems [18].
Learning consists of utilizing M to generate population P of k samples. After
evaluation of each sample according to function f the “best” (B∗ ) solution is re-
tained and M is updated according to
58 M. Ventresca, S. Rahnamayan, and H.R. Tizhoosh

Mt = (1 − α )Mt−1 + α B∗ (3.14)

where 0 < α < 1 is the learning rate and t ≥ 1 is the iteration. Initially, mi, j = 0.5 to
reflect that lack of prior information.
To abstract the crossover and mutation operators of evolutionary computation,
PBIL employs a randomization of M. Let 0 < β , γ < 1 be the probability of mutation
and degree of mutation, respectively. Then with probability β

mi, j = (1 − γ )mi, j + γ · random(0 or 1). (3.15)

Algorithm 1 provides a summary of this approach.

Algorithm 1. Population-Based Incremental Learning [2]


1: {Initialize probabilities}
2: M0 := (mi, j ) = 0.5
3: for t = 1 to ω do
4: {Generate samples}
5: G1 = generate samples(k,Mt−1 )

6: {Find best sample}


7: B∗ = select best({B∗ } ∪ G1)

8: {Update M}
9: Mt = (1 − α )Mt−1 + α B∗

10: {Mutate probability vector}


11: for i = 0...d and j = 0...c do
12: if random(0, 1) < β then
13: mi, j = (1 − γ )mi, j + γ · random(0 or 1)
14: end if
15: end for
16: end for

3.3.4 Oppositional Population-Based Incremental Learning


The opposition-based version of PBIL (OPBIL) shown in Algorithm 2 employs the
opposite concept to improve diversity within the sample generation phase. A direct
effect on convergence rate is observed as a consequence of this mechanism. Fur-
ther, OPBIL has an ability to escape local optima which estimation of distribution
algorithms such as PBIL are prone to becoming trapped on [19]. The description
provided here is brief and the interested reader is invited to read [29] for a more
detailed description.
The general structure of the PBIL algorithm remains, however aside from the
sampling procedure the update and mutation rules are altered to reflect a degrading
3 The Use of Opposition for Decreasing Function Evaluations 59

degree of opposition with respect to the number of iterations. That is, as the number
of iterations t → ∞ the amount two opposite solutions differ approaches 1 bit (w.r.t.
Hamming distance).
Sampling is accompished using an opposite guessing strategy whereby half of
the population R1 is generated using probability matrix M and the other half is
generated via a change in Hamming distance to a given element of R1 . The distance
is calculated using an exponentially decaying function in the flavor of

ξ (t) = le(ct) , (3.16)

where l is the maximum number of bits in a guess and c < 0 is a user defined
constant.
Updating of M is performed in lines 14-28. If a new global best solution has been
discovered (i.e. η = B∗ ), or with probability pamp the sample best solution is used
to focus the search, respectively. When no new optima have been discovered this
strategy tends away from B∗ . The actual update is performed in line 16 and is based
on a reinforcement learning update using the sample best solution. The degree to
which M is updated is controlled by the user defined parameter 0 < ρ < 1.
Should the above criteria for update fail, a decay of M with probability pdecay
is attempted in line 17. The decay, performed in lines 21-27 slowly tends M away
from B∗ . This portion of the algorithm has the intention to prevent convergence an
aide in the exploration ability through small, smooth updates. Parameter 0 < τ < 1
is user defined where often τ  ρ .
Equations in lines 11 and 12 were determined experimentally and no argument
regarding their optimality is provided. Indeed, there likely exists many functions
which will yield more desirable results. These have been decided because they tend
to lead to a good behavior and outcome of PBIL.

3.4 Experimental Setup


In this section we provide a discussion of the image thresholding problem, and the
application of evolutionary algorithms to solving it. Additionally, the evaluation
measure we employ to grade the quality of a segmentation is presented. Parame-
ter settings and problem representation are also given.

3.4.1 Evolutionary Image Thresholding


Image segmentation involves partitioning an image I into a set of segments with the
goal of locating objects in the image which are sufficiently similar. Thresholding
is a subset problem of image segmentation, with only 2 classes defined by whether
a given pixel is above or below a specific threshold value ω . This task has numer-
ous applications and several general segmentation algorithms have been proposed
[33]. Due to the variety of image types there does not exist a single algorithm for
segmenting all images optimally.
60 M. Ventresca, S. Rahnamayan, and H.R. Tizhoosh

Algorithm 2. Pseudocode for the OPBIL algorithm


Require: Maximum iterations, ω
Require: Number of samples per iteration, k
1: {Initialize probabilities}
2: M0 = mi..l = 0.5

3: for t = 1 to ω do
4: {Generate samples}
5: R1 = generate samples(k/2,M)
6: R̆1 = generate opposites(R1)

7: {Find best sample}


8: η = select best({R1 ∪ R̆1 })
9: B∗ = select best(B∗ , η )

10: {Compute probabilities}


11: pamp (Δ ) = 1 − e−bΔ

)− f (η )
1 − f (Bf (B ∗)
12: pdecay (Δ ; f (B∗ ), f (η )) = √
Δ +1
13: {Update M}
14: if η = B∗ OR random(0, 1) < pamp then
15: Δ =0
16: Mt = (1 − ρ )Mt−1 + ρη
17: else if random(0, 1) < pdecay then
18: if random(0, 1) < pdecay then
19: use B∗ in line 23 instead of η
20: end if
21: for all i, j each with probability < pdecay do
22: if ηi, j = B∗i, j then

1 − τ · random(0, 1), if ηi, j = 1,
23: mi, j = mi, j ·
1 + τ · random(0, 1), otherwise
24: else 
1 + τ · random(0, 1), if ηi, j = 1,
25: mi, j = mi, j ·
1 − τ · random(0, 1), otherwise
26: end if
27: end for
28: end if
29: Δ = Δ +1
30: end for
3 The Use of Opposition for Decreasing Function Evaluations 61

Many general purpose segmentation algorithms are histogram based and aim
to discover a deep valley between two peaks, and setting ω equal to that value.
However, many real world problems will have multimodal histograms and deciding
which value (i.e. valley) will correspond to the best thresholing is not obvious. The
difficulty is compounded by the fact that the relative size of peaks may be large
(and the valley becomes hard to distinguish) or valleys could be very broad. Sev-
eral algorithms have been proposed to overcome this [33]. Other methods based on
information theory and other statistical methods have been proposed as well [13].
Typically, the problem of segmentation involves a high degree of uncertainty
which makes solving the problem difficult. Stochastic searches such as evolution-
ary algorithms and population-based incremental learning often cope well with un-
certainty in optimization, hence they provide an interesting alternative approach to
traditional methods.
The main difficulty associated with the use of population-based methods is that
they are computationally expensive due to the large amount of function calls re-
quired during the optimization process. One approach to minimizing uncertainty
is by splitting the image into subimages which (hopefully) have characteristics al-
lowing for an easy segmentation. Combining the subimages together then forms
the entire segmented image, although this requires extra function calls to analyze
each subimage. An important caveat is that the local image may represent a good
segmentation, but may not be useful with respect to the image as a whole.
In this chaper we investigate thresholding with population-based techniques. Us-
ing ODE we do not perform any splitting into subimages and for OPBIL we split I
into four equal-sized subregions, each having it’s own threshold value. In both cases
we require a single evaluation to perform the segmentation and we show that the
opposition-based techniques reduce the required number of function calls.
As stated above, there exist many different segmentation algorithms. Further, nu-
merous methods for evaluating the quality of a seqmentation have also been put
forth [34]. In this paper we use a simple method which aims to minimize the dis-
crepancy between the original M × N gray-level image I and its thresholded image
T [31]:
M N
∑ ∑ |Ii, j − Ti, j | (3.17)
i=1 j=1

where | · | represents the absolute value operation. Using different evaluations will
change the outcome of the algorithm, however, the problem of segmentation in this
manner nonetheless remains computationally expensive.
We use the images shown in Figure 3.3 to evaluate the algorithms. The first col-
umn represents the original image, then the gold and the third column corresponds
to the approximate target image for ODE and OPBIL (i.e. the value-to-reach targets,
discussed below). We show the gold image for completeness, it is not required in
the experiments.
Both experiments employ a value-to-reach (VRT) stopping criteria which mea-
sures the time or function calls required to reach a specific value. The VTR values
have been experimentally determined and are given in the following table. Due to
62 M. Ventresca, S. Rahnamayan, and H.R. Tizhoosh

Fig. 3.3 The images used to benchmark the algorithms. The first column is the original gray-
level image, the second is the gold and the third colum is the target image of the optimization
within the required function calls
3 The Use of Opposition for Decreasing Function Evaluations 63

the respective algorithm ability of solving this problem, given the representation and
bahavior of convergence these values differ for ODE and OPBIL.

Table 3.1 Value-to-reach (VRT) for O/DE and O/PBIL experiments

Image O/DE O/PBIL


1 19579 19850
2 3391 4925
3 7139 7175
4 19449 19850
5 19650 19700
6 22081 22700

3.4.2 Parameter Settings and Solution Representation


The ODE and OPBIL algorithms differ in that the former is a real optimization
algorithm and the latter operates in the binary domain. Therefore, the solution rep-
resentations will also differ and consequently directly affect the quality of results.
However, the goal of this chapter is to show the ability of opposition to decrease the
required number of function evaluations, and so fine-tuning aspects these algorithms
is not the focus of this investigation.

ODE Settings

The differential evolution experiments follow standard enconding guidelines. The


user-defined parameter settings were determined empirically as shown in Table 3.2.

Table 3.2 Parameter settings for differential evolution-based experiments

Parameter Value
Population size Np = 5
Amplification factor F = 0.9
Crossover probability Cr = 0.9
Mutation strategy DE/rand/1/bin
Maximum function calls MAXNFC = 200
Jumpring rate (no jumping) Jr = −1

In order to maintain a reliable and fair comparison, these settings are kept un-
changed for all conducted experiments for both DE and ODE algorithms.
64 M. Ventresca, S. Rahnamayan, and H.R. Tizhoosh

OPBIL Settings

As stated above, OPBIL requires a binary solution representation. However, thresh-


olding aims to discover an integer value 0 ≤ T ≤ 255, to perform the segmentation
operation of I > T . Additionally, we use an approach of splitting I into subimages
I1,...,16 where each Ii is an equal sized square region of the original image.
Encoding was determined to be a matrix R := (ri, j )16×8 which corresponds to
16 subimages having a gray-level value < 28 = 256. Each row of R is converted to
an integer which is used to segment the respective region of I. The extra regions in-
crease problem difficulty as they result in more deceptive and multimodal problems.
Parameter settings for PBIL and OPBIL are as follows:

Table 3.3 Parameter settings for population-based incremental learning experiments

Parameter Value
Maximum iterations t = 150
Sample size k = 24
PBIL Only
Learning rate α = 0.35
Mutation probability β = 0.15
Mutation degree γ = 0.25
OPBIL Only
Update frequency control b = 0.1
Learning rate ρ = 0.25
Probability decay τ = 0.0005

3.5 Experimental Results


Using the test images and parameter settings stated above we show the ability of
oppositional concepts to decrease the required function calls. Only the results for
OPBIL are more detailed due to space limitations, but similar behavior should be
observed in ODE. All results presented correspond to the average of 30 runs, unless
otherwise noted.

3.5.1 ODE
Table 3.4 presents a summary of the results obtained regarding function calls for
ODE versus DE. Except for image 2, we show an decrease in function calls for
all images. Images 4 and 5 have statistically significant improvements with respect
to the decreased number of function calls, using a t-test at 0.9 confidence level.
Further except for image 6, we show a lower standard deviation which indicates
higher reliability in the results.
3 The Use of Opposition for Decreasing Function Evaluations 65

Computing the overall mean function calls we show an improvement of 322-


277=45 function calls. This equates to an average of 45/6 = 7.5 saved function calls
per image. This implies a savings of 322/277 ≈ 1.16, indicating approximately 16%
less function calls. For expensive optimization problems this can correspond to a
great amount of savings with respect to algorithm run-time.

Table 3.4 Summary results for DE vs. ODE with respect to required function calls. μ and σ
correspond to the mean and standard deviation of the subscripted algorithm, respectively

Image μDE σDE μODE σODE


1 74 41 60 34
2 32 20 35 16
3 42 23 37 19
4 74 36 54 30
5 47 37 45 22
6 63 26 46 31
Total 322 277

3.5.2 OPBIL
Table 3.5 shows the expected number of iterations (each iteration has 24 function
calls) to attain the value-to-reach given in Table 3.1. In all cases OPBIL reaches its
goal in fewer iterations that PBIL, where results for images 2,5,6 are found to be
statistically significant using a t-test at a 0.9 confidence interval. Additionally, in
all cases we find a lower standard deviation indicating a more reliable behavior for
OPBIL.
Overall we have 444-347=97 saved iterations using OPBIL, which is an average
of 16*24=384 function calls per image. The approximate savings is 444/347 ≈ 1.28
which is about a 28% improvement in required iterations.

Table 3.5 Summary results for PBIL vs. OPBIL with respect to required iterations calls.
μ and σ correspond to the mean and standard deviation of the subscripted algorithm,
respectively

Image μPBIL σPBIL μOPBIL σOPBIL


1 62 19 53 12
2 80 25 65 9
3 61 12 60 5
4 47 14 40 10
5 68 13 53 9
6 128 21 76 14
Total 444 347
66 M. Ventresca, S. Rahnamayan, and H.R. Tizhoosh

In the following we analyze the correlation and distance for each sample per
iteration. This is to examine whether the negative correlation and larger distance
properties between a guess and its opposite are found in the sample. If true, we have
supported (although not confirmed) the hypothesis that the observed improvements
can be due to these characteristics.
Figure 3.4 shows the averaged correlation cor(Rt1 , R̆1 ), for randomly generated
R1 and respective opposites R̆t1 at iteration t. The solid line corresponds to OPBIL
t

and the dotted line is PBIL, respectively. These plots show that OPBIL indeed has a
lower correlation (with respect to evaluation function) than PBIL (where we gener-
ate R1 as above, and let R̆1 also be randomly generated). In all cases the correlation
is much stronger for PBIL (noting that if the algorithm reaches the VTR then we set
the correlation to 0).

Image 1 Image 2
1 1

0.8 0.8
correlation

correlation

0.6 0.6

0.4 0.4

0.2 0.2

0 0
0 50 100 150 0 50 100 150
iterations iterations
Image 3 Image 4
1 1

0.8 0.8
correlation

correlation

0.6 0.6

0.4 0.4

0.2 0.2

0 0
0 50 100 150 0 50 100 150
iterations iterations
Image 5 Image 6
1 1

0.8 0.8
correlation

correlation

0.6 0.6

0.4 0.4

0.2 0.2

0 0
0 50 100 150 0 50 100 150
iterations iterations

Fig. 3.4 Sample mean correlation over 30 trials for PBIL (dotted) versus OPBIL (solid). We
find OPBIL indeed yields a lower correlation than PBIL
3 The Use of Opposition for Decreasing Function Evaluations 67

We also examine the mean distance,

2 k/2
ḡ = ∑ g(Rt1,i , R̆t1,i )
k i=1
(3.18)

which computes the fitness-distance between the ith guess R1,i and its opposite R̆1,i
at iteration t, which is shown in Figure 3.5. The distance for PBIL is relatively low
throughout the 150 iterations, gently decreasing as the algorithm converges. How-
ever, as a consequence of OPBIL’s ability to mainain diversity the distances between
samples increases during the early stanges of the search and similarily rapidly de-
creases. Indeed, this implies the lower correlation shown above.

Image 1 Image 2
4000 1500

3000
1000
distance

distance

2000
500
1000

0 0
0 50 100 150 0 50 100 150
iteration iterations
Image 3 Image 4
3000 4000

3000
2000
distance

distance

2000
1000
1000

0 0
0 50 100 150 0 50 100 150
iterations iterations
Image 5 Image 6
6000 4000

3000
4000
distance

distance

2000
2000
1000

0 0
0 50 100 150 0 50 100 150
iterations iterations

Fig. 3.5 Sample mean distance over 30 trials for samples of PBIL (dotted) versus OPBIL
(solid). We find OPBIL indeed yields a higher distance between paired samples than PBIL
68 M. Ventresca, S. Rahnamayan, and H.R. Tizhoosh

The final test is to examine the standard deviation (w.r.t. evaluation function) of
the distance between samples, given in Figure 3.6. Both algorithms have similarly
formed plots with respect to this measure, reflecting the convergence rate of the
respective algorithms. It seems as though the use of opposition aides by infusing
diversity early during the search and quickly focuses once a high quality optima is
found. Conversely, the basic PBIL does not include this bias, therefore convergence
is less rapid.

Image 1 Image 2
2000 1500

1500
1000
std. dev.

std. dev.
1000
500
500

0 0
0 50 100 150 0 50 100 150
iterations iterations
Image 3 Image 4
2000 1500

1500
1000
std. dev.

std. dev.

1000
500
500

0 0
0 50 100 150 0 50 100 150
iterations iterations
Image 5 Image 6
3000 1500

2000 1000
std. dev.

std. dev.

1000 500

0 0
0 50 100 150 0 50 100 150
iterations iterations

Fig. 3.6 Sample mean standard deviations over 30 trials for samples of PBIL (dotted) versus
OPBIL (solid)

3.6 Conclusion
In this chapter we have discussed the application of opposition-based computing
techniques to reducing the required number of function calls. Firstly, a brief
3 The Use of Opposition for Decreasing Function Evaluations 69

introduction to the underlying concepts of opposition were given, along with condi-
tions under which opposition-based methods should be successful. A comparison to
similar concepts of antithetic variates and quasi-random/low-discrepancy sequences
made obvious the uniqueness of our method.
Two recently proposed algorithms, ODE and OPBIL were briefly introduced and
the manner in which opposition is used to improve their parent algorithms, DE and
PBIL was given, respectively. The manner in which opposites are used in both cases
differed, but the underlying concepts are the same.
Using the expensive optimization problem of image thresholding as a benchmark,
we examine the ability of ODE and PBIL to lower the required function calls to
reach the pre-specified target value. It was found that both algorithms reduce the
expected number of function calls, ODE by approximately 16% (function calls)
and OPBIL by 28% (iterations), respectively. Further, concentrating on OPBIL, we
show the hypothesized lower correlation and higher fitness-distance measures for a
quality opposite mapping.
Our results are very promising, however their requires future work in various re-
gards. Firstly, a further theoretical basis for opposition and choosing opposite map-
pings is needed. This could possibly lead to general strategies of implementation
when no prior knowledge is available. Further application to different real-world
problems is also desired.

Acknowledgements
This work has been partially supported by Natural Sciences and Engineering Research Coun-
cil of Canada (NSERC).

References
1. Bai, F., Wu, Z.: A novel monotonization transformation for some classes of global opti-
mization problems. Asia-Pacific Journal of Operational Research 23(3), 371–392 (2006)
2. Baluja, S.: Population Based Incremental Learning - A Method for Integrating Genetic
Search Based Function Optimization and Competitive Learning. Tech. rep., Carnegie
Mellon University, CMU-CS-94-163 (1994)
3. Engelbrecht, A.: Fundamentals of Computational Swarm Intelligence. Wiley, Chichester
(2005)
4. Glover, F., Laguna, M.: Tabu Search. Kluwer, Dordrecht (1997)
5. Goldberg, D.E., Horn, J., Deb, K.: What makes a problem hard for a classifier system?
Tech. rep. In: Collected Abstracts for the First International Workshop on Learning Clas-
sifier Systems (IWLCS 1992), NASA Johnson Space (1992)
6. Niederreiter, H.: Random Number Generation and Quasi-Monte Carlo Methods. Society
for Industrial and Applied Mathematics (1992)
7. Holland, J.H.: Adaptation in Natural and Artificial Systems. The University of Michigan
Press (1975)
8. Price, K., Storn, R., Lampinen, J.A.: Differential Evolution: A Practical Approach to
Global Optimization. Springer, Heidelberg (2005)
70 M. Ventresca, S. Rahnamayan, and H.R. Tizhoosh

9. Kirkpatrick, S., Gelatt, C.D., Vecchi, M.P.: Optimization by Simulated Annealing. Sci-
ence 220(4598), 671–680 (1983)
10. Lemieux, C.: Monte Carlo and Quasi-Monte Carlo Sampling. Springer, Heidelberg
(2009)
11. Maaranen, H., Miettinen, K., Penttinen, A.: On intitial populations of genetic algorithms
for continuous optimization problems. Journal of Global Optimization 37(3), 405–436
(2007)
12. Montgomery, J., Randall, M.: Anti-pheromone as a tool for better exploration of search
space. In: Third International Workshop, ANTS, pp. 1–3 (2002)
13. O’Gormam, L., Sammon, M., Seul, M. (eds.): Practical Algorithms for Image Analysis.
Cambridge University Press, Cambridge (2008)
14. Rahnamayan, S., Tizhoosh, H.R., Salama, M.M.A.: Opposition-based differential evolu-
tion. IEEE Transactions on Evolutionary Computation 12(1), 64–79 (2008)
15. Rahnamayn, S., Tizhoosh, H.R., Salama, S.: Opposition-based Differential Evolution
Algorithms. In: IEEE Congress on Evolutionary Computation, pp. 7363–7370 (2006)
16. Rahnamayn, S., Tizhoosh, H.R., Salama, S.: Opposition-based Differential Evolution
Algorithms for Optimization of Noisy Problems. In: IEEE Congress on Evolutionary
Computation, pp. 6756–6763 (2006)
17. Rubinstein, R.: Monte Carlo Optimization, Simulation and Sensitivity of Queueing Net-
works. Wiley, Chichester (1986)
18. Sebag, M., Ducoulombier, A.: Extending Population-Based Incremental Learning to
Continuous Search Spaces. In: Eiben, A.E., Bäck, T., Schoenauer, M., Schwefel, H.-P.
(eds.) PPSN 1998. LNCS, vol. 1498, pp. 418–427. Springer, Heidelberg (1998)
19. Shapiro, J.: Diversity loss in general estimation of distribution algorithms. In: Parallel
Problem Solving in Nature IX, pp. 92–101 (2006)
20. Shokri, M., Tizhoosh, H.R., Kamel, M.: Opposition-based Q(lambda) Algorithm. In:
IEEE International Joint Conference on Neural Networks, pp. 646–653 (2006)
21. Storn, R., Price, K.: Differential evolution- a simple and efficient heuristic for global
optimization over continuous spaces. Journal of Global Optimization 11, 341–359 (1997)
22. Tizhoosh, H.R.: Reinforcement Learning Based on Actions and Opposite Actions. In:
International Conference on Artificial Intelligence and Machine Learning (2005)
23. Tizhoosh, H.R.: Opposition-based Reinforcement Learning. Journal of Advanced Com-
putational Intelligence and Intelligent Informatics 10(4), 578–585 (2006)
24. Tizhoosh, H.R., Ventresca, M. (eds.): Oppositional Concepts in Computational Intelli-
gence. Springer, Heidelberg (2008)
25. Toh, K.: Global optimization by monotonic transformation. Computational Optimization
and Applications 23(1), 77–99 (2002)
26. Ventresca, M., Tizhoosh, H.R.: Improving the Convergence of Backpropagation by Op-
posite Transfer Functions. In: IEEE International Joint Conference on Neural Networks,
pp. 9527–9534 (2006)
27. Ventresca, M., Tizhoosh, H.R.: Opposite Transfer Functions and Backpropagation
Through Time. In: IEEE Symposium on Foundations of Computational Intelligence, pp.
570–577 (2007)
28. Ventresca, M., Tizhoosh, H.R.: Simulated Annealing with Opposite Neighbors. In: IEEE
Symposium on Foundations of Computational Intelligence, pp. 186–192 (2007)
29. Ventresca, M., Tizhoosh, H.R.: A diversity maintaining population-based incremental
learning algorithm. Information Sciences 178(21), 4038–4056 (2008)
3 The Use of Opposition for Decreasing Function Evaluations 71

30. Ventresca, M., Tizhoosh, H.R.: Numerical condition of feedforward networks with op-
posite transfer functions. In: IEEE International Joint Conference on Neural Networks,
pp. 3232–3239 (2008)
31. Weszka, J., Rosenfeld, A.: Threshold evaluation techniques. IEEE Transactions on Sys-
tems, Man and Cybernetics 8(8), 622–629 (1978)
32. Wu, Z., Bai, F., Zhang, L.: Convexification and concavification for a general class of
global optimization problems. Journal of Global Optimization 31(1), 45–60 (2005)
33. Yoo, T. (ed.): Insight into Images: Principles and Practice for Segmentation, Registration,
and Image Analysis. AK Peters (2004)
34. Zhang, H., Fritts, J., Goldman, S.: Image segmentation evaluation: A survey of unsuper-
vised methods. Computer Vision and Image Understanding 110, 260–280 (2008)
Chapter 4
Search Procedure Exploiting Locally
Regularized Objective Approximation:
A Convergence Theorem for Direct Search
Algorithms

Marek Bazan

Abstract. The Search Procedure Exploiting Locally Regularized Objective Approx-


imation is a method to speed-up local optimization processes in which the objective
function evaluation is expensive. It was introduced in [1] and further developed in
[2]. In this paper we present the convergence theorem of the method. The theorem
is proved for the EXTREM [6] algorithm but applies to any Gauss-Siedle algorithm
that uses sequential quadratic interpolation (SQI) as a line search method. After
some extension it can also be applied to conjugate direction algorithms. The proof
is based on the Zangwill theory of closed transformations. This method of the proof
was chosen instead of sufficient decrease approach since the crucial element of the
presented proof is an extension of the SQI convergence proof from [14] which is
based on this approach.

4.1 Introduction
Optimization processes with objective functions that are expensive to evaluate –
since usually their evaluation requires to solve a large system of linear equations
or to simulate some physical process – occur in many fields of modern design. The
main strategy in speeding up such processes via constructing a model to approxi-
mate an objective function are trust region methods [4]. The application of radial
basis function approximation as an approximation model in trust region methods
was discussed in [13]. The standard method to prove a convergence of a trust region
method is the method of sufficient decrease.
In [1] and [2] we presented the search procedure which can be viewed as an
alternative to trust region methods. It relies on combining the direct search algo-
rithm EXTREM [6] with the locally regularized radial basis approximation. The
Marek Bazan
Institute of Informatics, Automatics and Robotics, Department of Electronics,
Wrocław University of Technology, ul. Janiszewskiego 11/17, 50-372 Wrocław, Poland
e-mail: marek.bazan@pwr.wroc.pl

Y. Tenne and C.-K. Goh (Eds.): Computational Intelligence in Optimization, ALO 7, pp. 73–103.
springerlink.com c Springer-Verlag Berlin Heidelberg 2010
74 M. Bazan

method achieves a speed-up of the optimization by exchanging some number of di-


rect function evaluations by radial basis approximation. The method combined with
the EXTREM algorithm was implemented within the computer program ROXIE
for superconducting magnets design and optimization [3], however it is a general
framework and as we shall see, any search algorithm that is based on the Gauss-
Seidle non-gradient algorithm or conjugate direction algorithm, can be used. We
will call our method the Search Procedure Exploiting Locally Regularized Objec-
tive Approximation (SPELROA).
In this paper we give the convergence theorem for the SPELROA method, com-
bined with the EXTREM minimization algorithm under the assumption that the
radial basis approximation has a realtive error less than ε . The method of proof is
different from those used in proofs of trust region methods since it is based on the
Zangwill theory of closed transformations. The crucial element of the proof is the
modification of the proof of the convergence of the quadratic interpolation as a line
search method (c.f. [14]) under the assumption that a pertubation of the function
value may be introduced to the algorithm at each step. We give the conditions on the
function values as well as on ε to maintain convergence. The proof of the conver-
gence of the sequential quadratic interpolation in [14] is based on Zangwill’s theory
and as we essentially extend it we chose this method also to prove the convergence
of the whole SPELROA method.
The plan of this chapter is the following. In the next section we sketch the SPEL-
ROA method. In the third section we give some theory from [21] to be used in the
next section to prove the main result. Finally in the fourth section we discuss the
radial basis approximation and heuristics used for its construction. We also describe
difficulties in establishing strict error bound in the current state of the development
of radial basis function approximation for sparse data. In the reminder of the paper
we give numerical results for three test functions of 6,8 and 11 variables from a set
of test functions proposed in [12].

4.2 The Search Procedure


Let there be given a direct search optimization algorithm A that uses the quadratic
interpolation as a line search method. The SPELROA method combined with algo-
rithm A can be written in the form of the following algorithm (c.f. [2]).
While generating the set Z we have to take care that data points are not placed too
close to each other. When two points are too close to each other – where a distance
is controlled by a user-supplied parameter whose value is relative to the diameter of
the set Z – one of the points has to be replaced by another point not yet included.
Such procedure of constructing Z keeps the separation distance (c.f. [16]) greater
than the user-supplied parameter value and therefore guarantees that the radial basis
function interpolation matrix is not singular. The crucial step of the scheme is point
3, containing a threefold check for whether the approximation f˜(xk ) can be used
in the algorithm A instead of f (xk ) evaluated directly. Conditions being checked in
steps 3.a) and 3.b) are related to the radial basis approximation and will be discussed
4 Search Procedure Exploiting Locally Regularized Objective Approximation 75

1
2 The Search Procedure exploiting Locally Regularized Objective
Approximation
Input : f : Rd → R – the objective function,
x 0 ∈ Rd – a starting point,
ε >0 – prescribed accuracy of the approximation,
f˜(·) – a radial basis function approximation of f (·),
Is – number of initial steps performed by the direct optimization
algorithm A,
N < Is – size of the dataset to construct the approximation f˜(·),.
ε -check – a procedure to check conditions required by the convergence
3 theorem to hold.
0. Perform Is initial steps of the algorithm A.
1. In the k-th step generate point xk for which the function value is supposed to be
evaluated using algorithm A.
2. Generate a set Z from N nearest points for which function was evaluated
directly.
3. Judge, whether for point xk a reliable approximation of f (xk ) can be
constructed.
a. If point xk is located in a reliable region of the search domain then construct
the approximation f˜ and evaluate f˜(xk ).
b. If the approximation f˜(xk ) was correctly constructed then perform an
ε -check.
c. If the ε -check is positive (i.e. the procedure returns true) then substitute

f (xk ) ← f˜(xk )

to the algorithm A.
d. Else evaluate f (xk ) directly.
4. If the stopping criterion of algorithm A is satisfied then stop.
Else replace k := k + 1 and go to 1.

in section 4.5 whereas the ε -check procedure from step 3.c) is associated with the
convergence theorem and will be discussed in section 4.4.

4.3 Zangwill’s Method to Prove Convergence


Feasible direction optimization algorithms can be written as

xk+1 = xk + τk dk
76 M. Bazan

where dk is a search direction and τk is a search step. The k-th iteration of such al-
gorithms can be viewed as a composite algorithmic transformation A = M 1 D where
D : Rd → Rd × Rd is a search direction generating transform

D(x) = (x, d)

where d ∈ Rd is a direction vector, whereas M 1 : Rd × Rd → Rd is a line minimiza-


tion transform

M 1 (x, d) = {y : f (y) = min f (x + τ d), y = x + τ 0d}


τ ∈J

where (x, d) ∈ Rd × Rd and J is a variability interval for a scalar τ .


In his monography [21] Zangwill gave a method of proving the convergence of
feasible directions optimization algorithms based on properties of an algorithmic
transformation A. In this section we sketch all crucial definitions and lemmas used
to state the main convergence theorem.
Definition 1. A transformation A : V → V is a transformation of a point into a set
when each point x ∈ V is assigned a set A(x) of points from V . The result of appli-
cation of the transform A(·) to a point xk can be any point xk+1 from a set A(xk )
thus
xk+1 ∈ A(xk ).
A transformation A = M 1 D defining a feasible direction optimization algorithm is a
transformation of a point into a set.
Definition 2. We say that a transformation A : V → V is closed in a point x∞ , if the
following implication holds true:
1. xk → x∞ , k ∈ K ,
2. yk ∈ A(xk ), k ∈ K ,
3. yk → y∞ ,
imply
4. y∞ ∈ A(x∞ ),
where K is a sequence of natural numbers. We say that transformation A is closed
on X ∈ V if it is closed in any point x ∈ X.
The property of closedness for algorithmic transformations is an analogue of the
property of continuity for ”usual” functions.
Theorem 1. (see [21] page 99)
Let a transformation A : V → V of a point into a set be an algorithm, which for a
given point x1 , generates a sequence {xk }∞
k=1 . Let S ⊂ V be a set of solutions.
Let us assume
1. All points xk are in a compact set X ⊂ V .
2. There exists a function Z : V → R such, as
4 Search Procedure Exploiting Locally Regularized Objective Approximation 77

a. if point x is not a solution, then for any y ∈ A(x) there is

Z(y) < Z(x)

b. if point x is a solution, then either the algorithm finishes or for any y ∈ A(x)

Z(y) ≤ Z(x).

3. Transformation A is closed for a point x if it is not a solution.


Then either the algorithm finishes in a point which is a solution or any convergent
subsequence generated by the algorithm has its limit in the solution set S.

Now we additionally need two lemmas from [21] concerning the closedness of a
composition of closed transforms.

Lemma 1. Let C : W → X be a given function and B : X → Y be a transform of a


point into a set. If function C is continuous in point w∞ and B is closed in C(w∞ ),
then a composition A = BC : W → Y is closed in w∞ .

Lemma 2. Let f be a continuous function. Then the transform M 1 is closed, if J is


a compact and bounded interval.

4.4 The Main Result


In practice usually another line search operator is considered, since implementation
of the operator M 1 (·, ·) is expensive. Let’s consider a line search operator M ∗ defined
as
M ∗ (x, d) = M 1 (x, d) ∪ {y = x + τ d : f (y) ≤ f (x) − Δ , τ ∈ J}. (4.1)
Its value is a set of points for which function f is decreased by Δ on the direction
d from point x or when this set is empty its value is a minimum of function f . A
suggestion of a practical application of the operator M ∗ can be found in one of the
exercises in [21]. In the algorithm EXTREM such an operator is used instead of M 1 .
The operator M ∗ is implemented to be the parabolic search algorithm. Application of
the operator M ∗ is more practical, in particular in the initial part of the optimization
process, where the first steps of the parabolic interpolation search give the most
significant decrease of the function value whereas the next steps usually make the
function f decrease much slower.
The main result of this note is the following theorem.

Theorem 2. Let a function f : Ω → R; Ω ⊂ Rd be an objective function of the opti-


mization problem. Let be given a method to approximate the objective function f in
certain points of the domain Ω with a relative error ε > 0. If function f is strictly
convex the SPELROA method combined with the Gauss-Seidel search algorithm us-
ing approximated sequential quadratic interpolation as a line search method con-
verges to a stationary point x0 ∈ S where S = {x : ∇ f (x) = 0}.
78 M. Bazan

Proof of Theorem 2 will be based on Theorem 1 where transformation A will be

A = RM ∗ Dd M ∗ Dd−1 . . . M ∗ D2 M ∗ D1

where Di chooses i-th direction from the orthogonal direction base of the k-th iter-
ation and R is an orthogonalization step to produce a new base along the direction
xk−1
0 xd
k−1
from k − 1 iteration.

4.4.1 Closedness of the Algorithmic Transformation


To show that point 3 of the theorem 1 is fullfilled we first have to show that the
transformation M ∗ defined in (4.1) is closed. We prove the following lemma.

Lemma 3. Let f be a continuous function. Then the transform M ∗ is closed, if J is


a compact and bounded interval.

Proof. According to Definition 2 let us consider sequences {(xk , dk )}∞


k=1 and
{yk }∞
k=1 . We assume that
1. (xk , dk ) → (x∞ , d∞ ), k ∈ K
2. yk ∈ M ∗ (xk , dk ), k ∈ K
3. y → y∞ , k ∈ K
So we have that
yk = xk + τ k dk
where τk ∈ J k for which

f (xk + τ k dk ) ≤ f (xk ) − Δ .

Because τ k ∈ J for k ∈ K and J is compact, thus there exists a convergent


subsequence
τ k → τ ∞, k ∈ K 1,
where K 1 ⊂ K and τ ∞ ∈ J.
For a fixed τ ∈ J from the definition of yk it folllows

f (yk ) < f (xk + τ dk ) (4.2)

or
f (yk ) − f (xk + τ dk ) ≤ Δ . (4.3)
Note that if (4.3) is not satisfied then (4.2) is satisfied.
Since f is continuous in a limit we get

f (y∞ ) = lim f (yk ) < lim f (xk + τ dk ) = f (x∞ + τ d∞ ) (4.4)


k∈K 1 k∈K 1

and in the same way


f (y∞ ) − f (x∞ + τ d∞ ) ≤ Δ . (4.5)
4 Search Procedure Exploiting Locally Regularized Objective Approximation 79

Because the above for any τ (4.4) or (4.5) is fulfilled then for any point y∗ ∈
M ∗ (x∞ , d∞ ) we can
f (y∞ ) < f (y∗ ), (4.6)
or
f (y∞ ) − f (y∗ ) ≤ Δ . (4.7)
On the other hand in point y∗ ∈ M ∗ (x∞ , d∞ ) function f for τ ∈ J attains the least
value
y∞ = x∞ + τ ∞ d∞ , τ ∞ ∈ J
or,
f (y∞ ) < f (x∞ ) − Δ (4.8)
therefore
f (y∗ ) − f (y∞ ) ≤ Δ . (4.9)
Comparing (4.8) and (4.9) with (4.6) and (4.7) we get the result

y∞ ∈ M ∗ (x∞ , d∞ ).

To make use of Lemma 1 in order to show the closedness of transformation A we


also have to notice that the transformations Di (x) = (x, di ) (i = 1, . . . , d) are contin-
uous functions. For direct search algorithms transformations Di generate orthogo-
nal directions. They are the same as in the Gauss-Seidle algorithm (c.f. [21]). For
conjugate direction search algorithms transformations Di generate succesive con-
jugate directions. Transformation R that generates the orthogonal serach directions
0 , . . . , dd−1 ] for the step k is defined as
[dnew new

0 , [d0 , . . . , dd−1 ]) = (x0 , [d0 , . . . , dd−1 ]),


R(xk−1 k new new

where the sequence of the new orthogonal vectors is uniquely defined by the orthog-
onalization of the vectors w0 , w1 , . . . , wd−1

w0 = s0 d0 + s1 d1 + . . . + sd−1 dd−1
w1 = + s1 d1 + . . . + sd−1 dd−1
...
wd−1 = + sd−1 dd−1

where scalars s0 , s1 , . . . , sd−1 correspond to step sizes in all directions in the step
k − 1. Transformation R is uniquely defined without any conditions on scalars
s0 , s1 , . . . , sd−1 as long as the orthogonalization is performed using the algorithm
presented in [15]. In this case it is also a continuous function. Finally then since the
transformation A is a composition of closed transformations M ∗ with continuous
functions Di (i = 0, . . . , d − 1) and R then the assumptions of the Lemma 1 are sat-
isfied we see that transformation A is closed. That proves that assumption 3 of the
Theorem 1 holds for unperturbed M ∗ . In the succeeding subsection we show that
transformation M ∗ can be realized by the perturbed transformation M 1 .
80 M. Bazan

4.4.2 A Perturbation in the Line Search


For non-gradient optimization algorithms with an orthogonal basis as a set of search
directions transformation M ∗ is the only place in the algorithm where the perturba-
tion from approximation of the objective function is introduced by the SPELROA
method. Therefore to show that point 2 of the Theorem 1 is fullfilled it is sufficient
that implementation of the transformation M ∗ that allows perturbation of function
value on the level ε > 0 minimizes function f along the direction d.
A proof of the convergence of the parabolic interpolation line search method can
be found in [8] or [14]. Here we will give conditions on the perturbation of the
function so that the proof given in [14] holds true. We will keep the notation as
close as possible to that in [14].
(i) (i) (i)
Let function f : R → R be unimodal. Let for a triplet ζ (i) = (ζ1 , ζ2 , ζ3 ) be
(i) (i) (i) (i) (i)
f (ζ2 ) ≤ min{ f (ζ1 ), f (ζ3 )} i.e. the interval [ζ1 , ζ3 ] contains an unique mini-
mizer of function f .
For a non-perturbed objective function we define a set of feasible triplets T ⊂ R3
defining an interval [ζ1 , ζ3 ] that contains the minimizer λ̂ as

T := {ζ ∈ R3 : ζ1 < ζ2 < ζ3 , f (ζ2 ) ≤ min{ f (ζ1 ), f (ζ3 )}


∪{ζ ∈ R3 : ζ1 = ζ2 < ζ3 , f (ζ1 ) ≤ 0, f (ζ3 ) ≥ f (ζ1 )}
∪{ζ ∈ R3 : ζ1 < ζ2 = ζ3 , f (ζ3 ) ≥ 0, f (ζ1 ) ≥ f (ζ3 )}
∪{ζ ∈ R3 : ζ1 = ζ2 = ζ3 = λ̂ }.

For ζ ∈ T with ζ1 < ζ2 < ζ3 the minimum of the quadratic interpolating points
(ζ1 , f (ζ1 )), (ζ2 , f (ζ2 )), (ζ3 , f (ζ3 )) equals

1 (ζ32 − ζ22 ) f (ζ1 ) + (ζ32 − ζ12 ) f (ζ2 ) + (ζ22 − ζ12 ) f (ζ3 )


λ ∗ (ζ ) =
2 (ζ3 − ζ2 ) f (ζ1 ) + (ζ3 − ζ1 ) f (ζ2 ) + (ζ2 − ζ1 ) f (ζ3 )

Then the set of admissible replacement triplets A(ζ ) is a set of candidate triplets
that may replace ζ ∈ T defining a smaller interval containing λ̂ in the next iteration
of the algorithm. A0 (ζ ) is defined as

(ζ1 , λ ∗ (ζ ), ζ2 ),
u1 (ζ ) =
(ζ2 , λ ∗ (ζ ), ζ3 ),
u2 (ζ ) =
A0 (ζ ) := T ∩ {u1 (ζ ), u2 (ζ ), u3 (ζ ), u4 (ζ )} where
(λ ∗ (ζ ), ζ2 , ζ3 ),
u3 (ζ ) =
(ζ1 , ζ2 , λ ∗ (ζ )).
u4 (ζ ) =
(4.10)
The crucial assumption on the perturbation that we introduce into the triplet used
to construct a quadratic is that we always allow a perturbation in only one point
of three points. Without loss of generality let us assume that a minimum of the
perturbed quadratic is constructed only for ζ such that ζ1 < ζ2 < ζ3 .
For the perturbation of the value of the function f on the level ε = 0 we define
three sets of triplets:
4 Search Procedure Exploiting Locally Regularized Objective Approximation 81

T1 (ε ) := {ζ ∈ R3 : ζ1 < ζ2 < ζ3 , f (ζ2 ) ≤ min{ f (ζ1 )(1 − |ε |), f (ζ3 )} } (4.11)


T2 (ε ) := {ζ ∈ R3 : ζ1 < ζ2 < ζ3 , f (ζ2 )(1 + |ε |) ≤ min{ f (ζ1 ), f (ζ3 )} } (4.12)
T3 (ε ) := {ζ ∈ R3 : ζ1 < ζ2 < ζ3 , f (ζ2 ) ≤ min{ f (ζ1 ), f (ζ3 )(1 − |ε |)} } (4.13)

with the minima of the underlying perturbed quadratics

1 (ζ32 − ζ22 ) f (ζ1 )(1 + |ε |) + (ζ32 − ζ12 ] f (ζ2 ) + (ζ22 − ζ12 ) f (ζ3 )
λ̃1∗ (ε ; ζ ) = ,
2 (ζ3 − ζ2 ) f (ζ1 )(1 + |ε |) + (ζ3 − ζ1 ) f (ζ2 ) + (ζ2 − ζ1 ) f (ζ3 )
1 (ζ32 − ζ22 ) f (ζ1 ) + (ζ32 − ζ12 ) f (ζ2 )(1 + |ε |) + (ζ22 − ζ12 ) f (ζ3 )
λ̃2∗ (ε ; ζ ) = ,
2 (ζ3 − ζ2 ) f (ζ1 ) + (ζ3 − ζ1 ) f (ζ2 )(1 + |ε |) + (ζ2 − ζ1 ) f (ζ3 )
1 (ζ32 − ζ22 ) f (ζ1 ) + (ζ32 − ζ12 ) f (ζ2 ) + (ζ22 − ζ12 ) f (ζ3 )(1 + |ε |)
λ̃3∗ (ε ; ζ ) =
2 (ζ3 − ζ2 ) f (ζ1 ) + (ζ3 − ζ1 ) f (ζ2 ) + (ζ2 − ζ1 ) f (ζ3 )(1 + |ε |)

for a perturbation of function f in points ζ1 , ζ2 and ζ3 respectively.


Such definitions of sets Tl (ε ) (l ∈ {1, 2, 3}) provide that the perturbed minima
are contained within the interval [ζ1 , ζ3 ]. Corresponding sets of admissible triplets
are defined as

Ãl (ε ; ζ ) := Tl (ε ) ∩ {ũl1 (ε ; ζ ), ũl2 (ε ; ζ ), ũl3 (ε ; ζ ), ũl4 (ε ; ζ )}

where
ũl1 (ε ; ζ ) = (ζ1 , λ̃l∗ (ε ; ζ ), ζ2 ),
ũl2 (ε ; ζ ) = (ζ2 , λ̃l∗ (ε ; ζ ), ζ3 ),
(4.14)
ũl3 (ε ; ζ ) = (λ̃l∗ (ε ; ζ ), ζ2 , ζ3 ),
ũl4 (ε ; ζ ) = (ζ1 , ζ2 , λ̃l∗ (ε ; ζ )),
where l ∈ {1, 2, 3}.
Lemma 4. Let S denote the set of stationary points of function f

S := {ζ ∈ T : f (ζ1 ) = 0 or f (ζ2 ) = 0 or f (ζ3 ) = 0}.

The following statements hold:


1. For every ζ ∈ T, the set A(ζ ) = A0 (ζ ) ∪ Ã1 (ε ; ζ ) ∪ Ã2 (ε ; ζ ) ∪ Ã3 (ε ; ζ ) is
nonempty.
2. The set valued map A(·) is closed.
3. For every ζ ∈ T\S := {ζ ∈ T : ζ ∈/ S} such that ζ1 < ζ2 < ζ3 there is c(y) < c(ζ )
for all y ∈ A(ζ ) for

c(ζ ) = f˜1 (ζ1 ) + f˜2 (ζ2 ) + f˜3 (ζ3 )

where f˜l (·) = f or f˜l (·) = f (·)(1 + |ε |) depending on whether a value at point ζl
was exact or perturbed.
Proof. Let us introduce the following notation f˜l (·) = f (·) when evaluated at points
x = ζl or f˜l (·) = f (·)(1 + ε ) when evaluated at point x = ζl . At the beginning let us
note that Tl (ε ) ⊂ T for (l ∈ {1, 2, 3}) and ε > 0.
82 M. Bazan

1. Let ζ = (ζ1 , ζ2 , ζ3 ), ∈ T be fixed. If f (ζ1 ), f (ζ2 ) and f (ζ3 ) are computed with-
out perturbation then A(ζ ) is not empty by proof in [14]. Let us consider then
that a function value was approximated with the relative error equal ε . The mini-
mum of the quadratic constructed in this case will be λ̃l∗ (ε ; ζ ) where l ∈ {1, 2, 3}
depending on at which point a function value is approximated. Let us consider
the case when λ̃l∗ (ε , ζ ) ∈ [ζ1 , ζ2 ] providing moreover that the minimum λ ∗ (ζ )
obtained as if the function was evaluated without any perturbation also belongs
to [ζ1 , ζ2 ]. Then A(ζ ) is empty if and only if both ũ1 (ε ; ζ ) and ũ3 (ε ; ζ ) are not
in A(ζ ), i.e. if and only if

f (λ̃1∗ (ε ; ζ )) > min{ f (ζ1 )(1 + |ε |), f (ζ2 )} = f (ζ2 ) (4.15)

and

f (ζ2 ) > min{ f (λ̃1∗ (ε ; ζ )), f (ζ3 )} ≥ min{ f (λ̃1∗ (ε ; ζ )), f (ζ2 )} (4.16)

if a function value is approximated at ζ1 (analogue inequalities are considered


for perturbations in ζ2 and ζ3 ).
Since the inequalities (4.16) imply, that f (ζ2 ) ≥ f (λ̃1∗ (ε ; ζ )) we get a
contradiction with the inequality (4.15) which proves the thesis when the per-
turbation is in the point ζ1 . The similiar contradiction we get for analogue in-
equalities for perturbations in ζ2 and ζ3 . Please note that the condition that the
corresponding unperturbed λ ∗ (ζ ) is also in [ζ1 , ζ2 ] guarantees that if we com-
pare ỹ = (ỹ1 , ỹ2 , ỹ3 ) ∈ Ãl (ζ ) (l ∈ {1, 2, 3}) with y = (y1 , y2 , y3 ) ∈ A0 (ζ ) we get
ỹ1 ≤ y1 and y˜3 ≥ y3 . The case when λ̃ ∗ (ζ ) belongs to [ζ2 , ζ3 ] is symmetric. This
shows that A(ζ ) is not empty.
2. We will prove the closedness of A from Definition 2. Let us assume that {ζ (i) }∞ i=0
and ζ (i) → ζ∗ ∈ T and there exists ζ∗∗ ∈ T and an infinite subsequence K ⊂ N
such that ζ (i+1) ∈ A(ζ (i) ) for every i ∈ K such that ζ (i+1) →K ζ∗∗ as i → ∞. Then
there must exist k ∈ {1, 2, 3, 4} and an inifinite subsequence K ⊂ K such that
ζ (i+1) = uk (ζ (i) ) or ζ (i+1) = ũlki (ε ; ζ (i) ) (where li ∈ {1, 2, 3}) for every i ∈ K . As
shown in [14] functions uk (·) are continuous and therefore when the sequence
{ζ (i) }∞
i=0 does not contain ζ
(i) generated by approximated values then from the

continuity of uk (·) with respect to ζ and the closedness of the set T it follows that
uk (ζ (i) ) → uk (ζ∗ ) = ζ∗∗ ∈ A(ζ∗ ). This proves the closedness of transformation A
if no approximation is used in any sequence ζ (i) .
Now we will consider the sequences containing approximated triplets. In-
troducing the approximated triplets ζ (i+1) (i.e. ζ (i+1) ∈ Ãl (ε ; ζ (i) )) introduces
discontinuities of the first kind into functions uk (ζ ) and the argument based on
continuity of functions uk (ζ ) cannot be applied directly. We will show that using
algorithm A introduces a finite number of isolated points discontinuities and that
guarantees that from any sequence {ζ (i) }∞ i=1 after removing some finite number
of initial elements we can apply the proof from [14].
We have to consider two cases
4 Search Procedure Exploiting Locally Regularized Objective Approximation 83

• When a number of occurences of ũlki in the sequence {ζ (i) }∞


i=0 is finite then
the proof of the closedness of transformation A given in [14] applies after
removing from {ζ (i) }∞
i=0 some number of initial elements.
• Let us then assume that ũlki occur an inifinite number of times in {ζ (i) }∞
i=0 .
(i) ∞
Since {ζ }i=1 is convergent then for any δ ∈ R there exists i0 such that for
all i > i0 we have
(i ) (i ) (i ) (i +1) (i +1) (i +1)
|ζ (i0 ) − ζ (i0 +1) | = |(ζ1 0 , ζ2 0 , ζ3 0 ) − (ζ1 0 , ζ2 0 , ζ3 0 )| < δ (4.17)

Let us choose such i1 for which inequality (4.17) holds. Since the approxi-
mation in the next iterations is used infnitely many times then there exists a
subsequence K = {i : i > i1 } ⊂ N such that if i ∈ K then

ζ (i+1) ∈ Ãl (ε ; ζ (i) )

for some l ∈ {1, 2, 3}.


We will show that for any δ it is possible to choose ε such that

||ζ (i1 ) − ζ (i1 +1) ||2 > 2δ . (4.18)

This will give us at the end a contradiction with assumed convergence. To


give conditions on ε we solve inequalities (4.18) using expressions (4.14) for
ulk (ε , ζ ) for k = 1, . . . , 4 and l ∈ {1, 2, 3} and applying a method of parabola
transformation q̂ from Appendix A.
(i)
For example for perturbation in ζ1 we get

(B − C)(ζ(r) − Ka ) − (B − ζ(r)C) − A(Ka − 1)


ε1 > (4.19)
A(Ka − 1)

and one of the system of inequalities is fullfiled


 
A(1 + ε1) + B − C > 0 A(1 + ε1) + B − C < 0
or . (4.20)
A(Ka − 1) < 0 A(Ka − 1) > 0

where  2
ε 2δ
ε1 = , Ka = − [1 − ζ(r)]2 .
(i)
f (ζl ) ζ3 − ζ1

and ζ(r) = ζζ2 − ζ1


with A, B and C defined in Appendix A. We leave to the
3 −ζ1
reader showing that these conditions are not contradicting and also derivation
of the analogue inequalities for l = 2 and l = 3.
The above considerations mean for any δ the value ε can be chosen so that
(4.18) holds. But it contradicts with the assumption that {ζi }∞i=0 converges
since we can choose two subsequences from {ζi }∞ i=0 that converge to two
different accumulation points. The first subsequence is formed from ζi1 s and
the second from ζi1 +1 s. Both are infinite. From this we conclude that from
84 M. Bazan

certain i0 the sequence {ζi }∞


i=0 cannot contain approximated points and there-
fore the proof from [14] also applies in this case.
This finishes the proof of the closedness of transformation A(ζ ).
3. Let us assume that ζ ∈ T\S. Then λ̃ ∗ (ζ ) ∈ (ζ1 , ζ3 ). Hereafter in this point we
consider the quadratic obtained with a transformation q̂ from Appendix A. This
transformation preserves all properties used in this proof. In the following con-
sideration we will use the expression Δl (ε ; ζ ) which is the distance between the
unperturbed and perturbed minimum i.e. Δl (ε ; ζ ) = |λ̃l∗ (ζ ) − λ ∗ (ζ )|. See Ap-
pendix B for expression Δl (ε ; ζ ).
Here the situation is also symmetric with respect to ζ1 and ζ3 and we will con-
sider the case where λ ∗ (ζ ) ∈ (ζ1 , ζ2 ] as well as λ̃l∗ (ζ ) ∈ (ζ1 , ζ2 ] (l ∈ {1, 2, 3}).
Firstly we observe that f (ζ2 ) ≤ f (ζ3 ), because if there was f (ζ2 ) = f (ζ3 ),
then λ̃ ∗ (ζ ) = 1/2(ζ2 + ζ3 ) since no value is perturbed or only the value at point
ζ2 is perturbed and there is no effect of this perturbation on a minimum. Both
cases contradict the assumption that λ̃l∗ (ζ ) ∈ (ζ1 , ζ2 ] (l ∈ {1, 2, 3}).
When λ̃l∗ (ζ )∈(ζ1 , ζ2 ] (l ∈{1, 2, 3}) then only u1(ζ ) and ũl1 (ε ; ζ ) (l ∈ {1, 2, 3})
and u3 (ζ ) and ũl3 (ζ ) (l ∈ {1, 2, 3}) can belong to A(ζ ). For unperturbed function
values we have λ ∗ (ζ ) < ζ2 and for perturbed function values we have

λl∗ (ε ; ζ ) + Δl (ε ; ζ ) < ζ2 (l ∈ {1, 2, 3}). (4.21)

In Appendix B we solve (4.21) with respect to ε establishing conditions for which


λl∗ (ε ; ζ ) < ζ2 and provide the line of proof from [14] is valid. We get only three
cases
a. A(ζ ) = {u1 (ζ ), ũ11 (ζ ), ũ21 (ζ ), ũ31 (ζ )}. Then since u3 (ζ ) ∈
/ A(ζ ) as well as
/ A(ζ ) (l ∈ {1, 2, 3}) and f (λ ∗ (ζ )) < f (ζ2 ) as well as f (λ̃l∗ (ε ; ζ )) <
ũl3 (ε ; ζ ) ∈
f (ζ2 ) (l = 1, 3)
f (λ̃2∗ (ε ; ζ )) < f (ζ2 )(1 + |ε |) (4.22)
we must have

c(ũl1 (ζ )) = f˜l (ζ1 ) + f (λ̃l∗ (ζ )) + f˜l (ζ3 ) < f˜l (ζ1 ) + f˜l (ζ2 ) + f˜l (ζ3 ) = c(ζ ),

where f˜l (·) = f (·) or f˜l (·) = f (·)(1 + ε ) depending on which value was per-
turbed.
b. A(ζ ) = {u3 (ζ ), ũ13 (ζ ), ũ23 (ζ ), ũ33 (ζ )}. Then, since u1 (ζ ) ∈
/ A(ζ ) as well as
/ A(ζ ) (l ∈ {1, 2, 3}) we must have that f (ζ2 ) ≤ f (λ ∗ (ζ )) as well
ũl1 (ε ; ζ ) ∈
as f˜(ζ2 ) ≤ f (λ̃l∗ (ζ )) (l ∈ {1, 2, 3}) depending on which value was perturbed.
Also f (λ ∗ (ζ )) < f (ζ1 ) as well as fl (λ̃l∗ (ζ )) < f˜l (ζ1 ) (l ∈ {1, 2, 3}) since oth-
erwise we would have a local maximum in [ζ1 , ζ2 ] contradicting unimodality.
Therefore in this case we must have

c(ũl3 (ζ )) = f (λ̃l∗ (ζ )) + f˜2 (ζ2 ) + f˜3 (ζ3 ) < f˜l (ζ1 ) + f˜l (ζ2 ) + f˜l (ζ3 ) = c(ζ )

with perturbation at point ζl .


4 Search Procedure Exploiting Locally Regularized Objective Approximation 85

c. Finally, we can have a case A(ζ ) = {u1 (ζ ), u3 (ζ )}. In this case we are not
able to include to A(ζ ) any of the approximated triplet ũl (ε ; ζ ) (i = 1, 2, 3).
This is because of the following properties
i. f (ζ2 ) < f (ζ1 ) by assumption,
ii. λ ∗ (ζ ) ≤ ζ2 ,
iii. f (λ ∗ (ζ )) = f (ζ2 ) which implies λ ∗ (ζ ) = ζ2 .
These equalities hold since otherwise, we would have a contradiction with
the unimodality of f (·). Approximating any of the value would mean that
we would not be able to guarantee the property iii. Therefore since f (ζ2 ) <
min{ f (ζ1 ), f (ζ3 )} we get c(u1 (ζ )) < c(ζ ) and c(u3 (ζ )) < c(ζ ). From a prac-
tical point of view for a given ζi where one of the coordinates is approximated
or all are exact, we can see whether we have to use exact values i.e. remove
the approximation, checking whether

|λ̃l∗ (ε ; ζ ) − ζ2 | > Δl (ε ; ζ ) (l ∈ {1, 2, 3}).

This exhausts all the possibilities and finishes the proof of the third point.

We are now in a position to formulate the Sequential Quadratic Interpolation Algo-


rithm with Perturbation (c.f. [14] for the unperturbed version).
Please note that in the following we use the expressions an “approximated” and
“perturbed” function values interchangably – they are synonymous here. Moreover
the adjective “exact” here means perturbed on the level of the machine precision ε
where ε << ε .
The main condition under which we can make use of the available approximation
of the function in one of the points is the separation property, i.e. for the approxi-
mation used in a point with an index l ∈ {1, 2, 3} the triplet ζi belongs to Tl where
Tl is defined by (4.11), (4.12) or (4.13) respectively.
Now we can formulate a convergence theorem for the above algorithm analogous
to that in ([14] p. 155).

Theorem 3. Suppose that {ζi }∞ i=0 is a sequence constructed by Algorithm 2 in mini-


mizing a continuously differentiable and unimodal function f : R → R. Then ζ (i) →
ζ̂ as i → ∞ with ζ̂ ∈ S.

To prove the above theorem we will show the way in which we can apply the proof
given in [14] making use of the Lemma 4.

Proof. The main difficulty in application of the method of the proof from [14] is the
fact that using approximation in certain points of the domain causes a discontinuity
of the cost function c as well as a discontinuity of the functional of calculating the
minimum of the quadratics with respect to a parameter triplet ζ (i) .
Let us observe first that the proof of the third point of Lemma 4 shows that allow-
(i)
ing perturbation according to (4.23) provides that {ζ1 }∞ i=0 is monotone increasing
(i)
as well as {ζ3 }∞ i=0 is monotone decreasing. Since both these sequences are bounded
(i)
they are both convergent. Moreover keeping Δ (ε ; ζ ) on a level so that the above
86 M. Bazan

1
2 The Sequential Quadratic Interpolation Algorithm with the Objective
Function Perturbation
Input : ζ0 ∈ T – a starting point,
ε > 0 – the relative error of the approximation available in certain
3 points of the function evaluation.
0. Set i = 0.
1. Compute λ ∗ = λ ∗ (ζ (i) ) or λ ∗ = λ̃l∗ (ε ; ζ (i) ) depending on whether the function
(i) (i) (i)
value was exact in all points ζ1 , ζ2 , ζ3 or it was perturbed in point
(i)
ζl (l ∈ {1, 2, 3}) respectively.
(i) (i)
2. If λ ∗ = ζ1 or λ ∗ = ζ3 then STOP, else construct the set A(ζ (i) ) and
a. If the approximation to any value of the triplet ζ (i) is not available then
A(ζ (i) ) = A0 according to (4.10).
(i)
b. If the approximation in point ζl (l ∈ {1, 2, 3}) is available
i. Compute transformation q̂ as described in Appendix A.
ii. Compute Δl (ε ; ζ (i) )
iii. If
(i)
|λl∗ (ε ; ζ ) − ζ2 | < Δl (ε ; ζ ).
then A(ζi ) = A0 and go to 3.
iv. If

λl∗ (ε ; ζ ) + Δl (ε ; ζ ) < ζi2 or λl∗ (ε ; ζ ) − Δl (ε ; ζ ) > ζi2 (4.23)

then A = A0 ∪ Ãl
3. Compute
ζ (i+1) ∈ arg min{c(ζ ) : ζ ∈ A(ζ (i) )}
4. Replace i := i + 1 and go to step 1.

property of the mentioned sequences would be fulfilled when the approximation


(i) (i)
would not be used, guarantees that ζ̂ ∈ [ζ1 , ζ3 ] for any i ∈ N. We have to distin-
guish between two nontrivial cases
1. When {ζ (i) }∞ i=1 → ζ̂ and ζ̂ = (ζ̂1 , ζ̂2 , ζ̂3 ) is such an accumulation point that ζ̂1 <
ζ̂2 < ζ̂3 . In this case the contradiction is obtained using the continuity of the
cost function c if an unperturbed algorithm was used. We can use the continuity
argument here since as it was shown in the proof of Lemma 4 the approximation
can only be used in a finite number of steps. Therefore after removing a certain
number of initial steps the proof from [14] applies.
2. The first case, therefore does not apply, i.e. the sequence constructed by Algo-
rithm 2 can have two accumulation points. In [14] it is shown that it is the same
4 Search Procedure Exploiting Locally Regularized Objective Approximation 87

accumulation point. Lemma 8 guarantees that the argument contained in [14]


holds true if the constructed sequence ends up in the stopping criterion with an
unperturbed triplet. Since the sequence is not infinite we have to consider one
more case, namely, when the algorithm stops with a perturbed triplet.
Two accumulation points are ζ∗ = (ζ̂1 , ζ̂1 , ζ̂3 ) and ζ∗∗ = (ζ̂1 , ζ̂3 , ζ̂3 ). To pro-
vide the separation of function values in the first case the perturbation can be only
in the point ζ̂3 . In the second case the perturbation can be only in the point ζ̂1 .
(i) (i) (i) (i)
In the above sequences we have therefore ζ2 → ζ1 or ζ2 → ζ3 . On the other
(i)
hand, the separation between λ ∗ (ε ; ζ (i) ) and ζ2 has to be greater than Δ3 (ε ; ζ (i) )
and Δ1 (ε ; ζ (i) ) for two cases respectively. This gives the contradiction with the
convergence.
From the practical point of view to provide the convergence of the perturbed SQI
algorithm and therefore the whole SPELROA method it is sufficient to provide that
the procedure ε -check in step 3.c) of Algorithm 1 to be launched in each iteration of
the perturbed SQI algorithm. It takes the last three points xi−2 , xi−1 , xi and compute
x −xk−1
ζk = (0,t1 ,t2 ) such that t1 = k−2 xk−2 −xk , t2 = 1 and then compute the transformation
q̂ defined by (4.34) to obtain points (0, −2), (ζ(r) , q̂(ζ(r) )), (1, −1) when f˜(xk−2 ) <
f˜(xk ) (for the case when f˜(xk−2 ) > f˜(xk ) it is sufficient to rotate the scaled parabola
around point 0.5) and scales ε
⎧ ε
⎨ f (xk−2 ) ,
⎪ when a perturbation is in point xk−2 ,
ε
ε := f (xk−1 ) , when a perturbation is in point xk−1 ,

⎩ 2ε ,
f (x ) when a perturbation is in point xk
k

depending on at which point the function value was approximated. Then if


1. a perturbation is at 0 then if ε satisfies (4.35) and fullfills (4.36) or (4.37) then the
approximation can be used else calculate the objective function deterministically,
2. a perturbation is at ζ(r) then if ε satisfies (4.38) and fullfills (4.37) or (4.40) then
the approximation can be used else calculate the objective function deterministi-
cally,
3. a perturbation is at 1 then ε has to satisfy analogue conditions (left to the reader)
for the approximation to be used by the algorithm.

4.5 The Radial Basis Appproximation


In this section we will discuss the main aspects and possibilities of constructing a
radial basis approximation of the objective function in Algorithm 1.

4.5.1 Detecting Dense Regions


Since the data set Z = {xi }Ni=1 is very sparse then a method to detect regions rich
in data within the convex hull Ω of Z is required. In [2] we introduced a number of
merit
88 M. Bazan

∑Nj=1,i< j ai jWi j
γ (x, Z ) := (4.24)
∑Nj=1,i< j Wi j
d
where ai j = r j +rij
i
, Wi j = r j +r
1
i
, and di j = ||x j − xi ||2 and r j = ||x − x j ||2 . γ (x, Z )
measures how well data points from Z surround the evaluation point x. Numbers ai j
measure how far point x is placed from the interval xi x j . The maximal value equal
1 is attained by ai j if point x is on this interval. The weighting Wi j emphasize in
γ (x, Z ) the impact of intervals xi x j that are close to point x. The additional denom-
inating by a sum of weights Wi j provides that for any x ∈ Rd the range of values
of γ (x, Z ) is (0, 1]. If the value of γ (x, Z ) is greater than a certain threshold value
then in the evaluation point x a construction of an approximation with a good local
quality can be expected.

0.9
0.8
0.7
0.6
0.5
X2

0.4
0.3
0.2
0.1
0
−1 −0.9 −0.8 −0.7 −0.6 −0.5 −0.4 −0.3
X1

Fig. 4.1 γ (x, Z ) defined by (4.24). Here the set Z is a data set constructed on the 50-th
optimization step of a 2-parametric Rosenbrock function optimization with the EXTREM
algorithm

4.5.2 Regularization Training


For a given data set Z = {xi , f (xi )}Ni=1 ⊂ Rd × R of pairwise different data points
xi , and for the Gaussian radial basis function φ (x) = exp(−x2 /r2 ), r ≥ 0 (see [9]) a
radial basis function interpolatant s(x) is defined as
N
s(x) = ∑ wi φi (||x − xi||), (4.25)
i=1

where
s(xi ) = f (xi ) = fi , i = 1, . . . , N.
4 Search Procedure Exploiting Locally Regularized Objective Approximation 89

The interpolation conditions s(xi ) = fi imply the matrix formulation

Φ w = f, (4.26)

where Φ = [φ (xi − x j )]i=1,...,N; j=1,...,N , f = [ f1 , f2 , . . . , fN ]T , w = [w1 , w2 , . . . , wN ]T .


Positive definiteness [9] of φ guarantees nonsingularity of matrix Φ and thus there
exists the unique solution w for which s(x) interpolates data from the set Z. Al-
though the other choice of positive definite radial basis functions is possible we
choose the Gaussian function due to its natural interpretation of r parameter which
can be set for a value proportional to the diameter of the data set Z = {xi }Ni=1 . There
are two reasons for which we solve an approxmiation problem rather than interpo-
lation. Firstly when the data set Z is very irregular and φ is strictly positive definite
the matrix Φ is very ill-conditioned. Secondly the problem of solving (4.26) yields
solution s(x) which may oscillate between data points where data is sparse.
The approximation solution is sought by means of Tikhonov regulatization. Sup-
pose that we are given a linear mapping T : H → R and define the regularization
operator J : H → R by J(s) = ||T s||2R where H is a native space ot the underlying
radial basis function φ (·). For a function f ∈ H and for the prescribed value of the
single regulatization parameter λ its approximation is a function fλ (x) ∈ H in a
form (4.25) which is the solution of the minimization problem
 
N
min

∑ [ f (xi ) − fλ (xi)]2 + λ J( fλ ) : fλ ∈ H . (4.27)
i=1

In the matrix form the problem (4.27) is written as

(Φ T Φ − λ I)w = Φ T t (4.28)

for a λ > 0 governing a trade-off between a data reproduction and the desired
smoothness of solution fλ (x). Due to the ill-conditioning of matrix Φ a direct
inversion of matrix (Φ T Φ − λ I) in (4.28) is not numerically stable.
To solve (4.28) in a numerically stable way we use the singular value decompo-
sition of the matrix Φ defined as

Φ = USVT , (4.29)

where U ∈ RN×N , V ∈ RN×N are orthogonal matrices and S = diag(σ1 , . . . , σN ),


where σ1 ≥ σ2 ≥ · · · ≥ σN are singular values of Φ . The singular value decomposi-
tion is unique up to the signs of the columns of matrices U and VT . Using the above
decompostion we express the inverse matrix in (4.28) as

Φ † = (Φ T Φ + λ 2 I)−1 Φ T = V((ST S + λ 2I)−1 ST )UT = VΩλ UT , (4.30)

where  
σ1 σ2 σN
Ωλ = diag , ,..., 2 .
σ12 + λ 2 σ22 + λ 2 σN + λ 2
90 M. Bazan

Using the above equation the weight vector wλ is expressed (see [5]) by the expan-
sion with respect to singular vectors of the matrix Φ
N
σi
wλ = VT Ωλ† UT t = ∑ (uT t)vi . (4.31)
σ
i=1 i
2 +λ2 i

Comparing the expansion (4.31) for λ > 0 with the expansion for λ = 0 i.e. for the
interpolation problem which reads
N
1 T
w = VS−1 UT t = ∑ (ui t)vi . (4.32)
i=1 σi

one can see what is the role of the regularization parameter λ . For σ p ≥ λ ≥ σ p+1
we have
1 σi
 2 (k > p),
σk σk + λ 2
and hence the impact of the singular vectors corresponding to singular values σk <
λ is dumped in the expansion (4.31). This enables us to avoid oscillation of the
solution that are introduced by inverting small singular values in the expansion of
the weight vector. Another approach in solving the problem of ill-conditioning of
the interpolation matrix generated by multiquadratic functions was presented in [7].

4.5.3 Choice of the Regularization Parameter λ Value


An appropriate procedure to choose λ parameter is a crucial issue in the Tikhonov
regularization. A chosen procedure should be able to find a λ for which the solution
s(x) reproduces well the data set Z as well as it generalizes well in-between data
points. These goals are conflicting and therefore a method should give us the right
balance between reproduction of the data set as well as a generalization for data
from outside the data set.

4.5.3.1 Weighted Gradient Variance and Local Mean Square Error


In [2] we introduced a method that relates a choice of λ to the reproduction quality
on the data points close to the evaluation point x.
The method was defined as the following. Let us define

r j = ||x − x j ||2 , j = 1, . . . , N.

For a sequence of λ s that covers the singular value spectrum (σN , σ1 ) of the matrix
S from the decomposition (4.29) we calculate the Normalized Local Mean Square
Error
4 Search Procedure Exploiting Locally Regularized Objective Approximation 91
  

 N [sλ (x j )− f j ]2 1
 ∑ j=1 · r2
 f 2j j
NLMSEλ ,Z (x) =  , (4.33)
∑Nj=1 r12
j

and the Weighted Gradient Variance


||∇sλ (x j )−Gλ ,Z (x)||2
∑Nj=1 r2j
WGVλ ,Z (x) = ,
∑Nj=1 r12
j

where Gλ ,Z (x) is a mean gradient variance at point x defined as


∇sλ (x j )
∑Nj=1 r2j
Gλ ,Z (x) = ,
∑Nj=1 r12
j

where sλ (x j ) is a value of the constructed approximation for λ at point x j from the


data set Z, and ∇ is the gradient operator. It is the discrepancy method since only
these λ ’s are considered for which the value of NLMSE is smaller than a prescribed,
user-defined threshold value NLMSEthr . The threshold NLMSEthr value says which
approximation quality has to be preserved on the points from Z nearest to the evalu-
ation point x. From the set of λ ’s for which NLMSEλ ,Z (x) is smaller than NLMSEthr
the minimizer of the oscillation measure WGVλ ,Z (x) is chosen as the optimum.
Figure 4.2 shows a) the data set of 30 points from the 2-parametric Rosenbrock func-
tion optimization by the EXTREM algorithm, b) a plot of NLMSEλ ,Z (x) for two dif-
ferent points of the domain for a data set, c) a corresponding plot with W GVλ ,Z(A)
and W GVλ ,Z (B). Figure 4.3 shows a) the obtained approximation and b) the chosen
value of λ parameter.

4.5.4 Error Bounds for Radial Basis Approximation


In the previous section we presented a method to choose the value of the regular-
ization parameter λ . Unfortunately, using this method does not provide us that the
generalization error is smaller than a prescribed value ε as well as we are not able
directly to estimate the generalization error. This method similarly as the Gener-
alized Cross Validation (c.f. [18]) estimates an error on the data points (WGV for
nearest points to the evaluation point and GCV for all points from a data set).
The error bounds for radial basis interpolation has been studied extensively for
more than two decades. The first bounds establishing the rate of convergence of
a radial basis interpolant for functions from a native space of the underlaying ra-
dial basis function were given in [10]. The latter paper gave the foundations for
the development of the theory of convergence of the radial basis interpolation (see
92 M. Bazan

a) b)

0.98 −1
A
0.96
−2 B
0.94 A

0.92 −3

log10(NLMSE)
0.9
−4
x2

0.88
B
−5
0.86

0.84 −6

0.82
−7
0.8

0.78 −8
−1.02 −1 −0.98 −0.96 −0.94 −0.92 −0.9 −18 −16 −14 −12 −10 −8 −6 −4 −2
x log10(λ)
1

c)

1.8

1.7

1.6 A
log10(WGVλ, Z( x))

1.5

1.4

1.3

1.2

1.1
B
1
−18 −16 −14 −12 −10 −8 −6 −4 −2
log10(λ)

Fig. 4.2 a) Data set consisting of 30 points from the optimization path from EXTREM
algorithm optimizing the 2-parametric Rosenbrock function. b) Local reproduction of the
data near points A and B measured by NLMSEλ ,Z (A) and NLMSEλ ,Z (B) respectively. Here
NLMSEthr = 10−5 is depicted by a straight line. c) A measure of the oscillation of the solution
W GVλ ,Z (x) in points A and B. By dots the optimal λ for points A and B are depicted

e.g. [17], [19] and references therein). The rate of the convergence which is consid-
ered is with respect to dataset density i.e. a global fill distance

h(Z , Ω ) := max min ||y − xi||2 .


y∈Ω 1≤i≤N
4 Search Procedure Exploiting Locally Regularized Objective Approximation 93

a) b)
approximation error of the approximation−5constructed with WGV λ for approximation constructed with WGV
for NLMSE thr = 10 −1 0.98 0.96 0.94 0.92
−4 0.9
x 10 0.88
−0.95 0.86
−12
2 y
−0.9 −10

−8

1
−6

−4

−2
0
−1 0
−0.98
−0.96
0.98
−0.94 0.96
0.94
−0.92 0.92
0.9
0.88
−0.9
0.86

Fig. 4.3 a) The approximation error for λ chosen by the measure W GVλ ,Z (x) for
NLMSEthr = 5.0 · 10−6 . b) Chosen value of λ parameter – it can be noticed that in data
regions where data is sparse the method suggests a greater value of λ

where Z = {xi }Ni=1 ⊆ Ω which is a mesh norm measuring the radius of the biggest
ball contained in the domain Ω and not containing any data points inside, and where
the domain Ω satisfies the interior cone condition with radius r and angle θ .

Definition 3. A set Ω ⊆ Rd is said to satisfy an interior cone condition if there exists


an angle θ ∈ (0, π /2) and a radius r > 0 so that for every x ∈ Ω a unity vector ξ (x)
exists so that the cone C(x, ξ (x), θ , r) := {x + η y : y ∈ Rd , ||y||2 = 1, yT ξ (x) ≥
cos θ , η ∈ [0, r]} is contained in Ω .

The summary of error bounds for various radial basis functions was given in [16]. A
very precise derivation of the error bounds for Gauss radial basis function interpo-
lation was given in [20]. In the latter paper one can find a derivation of all constants
involved in the bound. The analogue error estimates (with all constants involved)
of the approximation with positive defined radial basis function constructed with
Tikhonov regularization for functions from the Sobolew space Wpτ was presented in
[19]. Here we will show that the error estimates contained in [19] cannot be used
in our scheme due to the small number of points in the optimization process and
therefore using heuristic methods from previous sections is justified.
All of the error bounds rely on a common property of local polynomial reproduc-
tion that has to be guaranteed by an approximation procedure (c.f. [19]). The error
bounds can be formulated in the form of the following theorem.

Theorem 4. Suppose that Ω ⊂ Rd is bounded and satisfies an interior cone condi-


tion with angle θ and radius r. Let m be a maximal degree of polynomials repro-
duced by fλ in a form (4.25) defined as the solution of (4.27). Define the following
quantities
94 M. Bazan
 
sin θ
ϑ := 2arcsin ,
4(1 + sin θ )
sin θ sin ϑ
Q(m, θ ) := .
8m (1 + sin θ )(1 + sin ϑ )
2

If the global fill distance h(Ω , Y ) satisfies

h(Z , Ω ) ≤ Q(m, θ )r.

the approximation error can be bounded as

|| f − fλ ||L∞ (Ω ) ≤ C[h(Z , Ω )]τ −d/p | f |Wpτ (Ω ) + 2ε , and ε = max | f (x j )− fλ (x j )|.


x j ∈Z

Let us consider the unity ball B(0, 1) as a domain of the approximation. It satisfies
the interior cone condition with r = 1 and θ = π /3. It can be seen to reproduce only
the linear polynomials i.e. m = 1 and therefore for the above bounds to be satisifed
there has to be
h(Z , Ω ) ≤ Q(1, π /3) < 0.012613.
It means that from one data point to another there must be a distance not greater
than 1.3% of the radius of the ball containing the whole data set. Such a number of
points cannot be generated by any local optimization algorithm.
The above consideration shows that the exisiting accurate error bounds for the
regularized approximation with radial basis functions cannot be applied to estimate
the approximation in the SPELROA method. It is due to the sparseness of the data
in data sets constructed by local optimization algorithms.

4.6 Numerical Results


Numerical results of the performance for real optimization problems from the LHC
magnet design process with up to 5 parameters were presented in [2]. Here we
present results for three test problems from a set of test problems proposed in [12].

4.6.1 Test Problems


1. Six variable problem and eight variable problem I
As a six and first eight variable problem we considered the Extended Rosenbrock
Function (i.e. problem no 21 in [12]). It is defined as
d/2
f ([x1 , x2 , . . . , xd ]) = ∑ 100(x2i − x22i−1)2 + (1 − x2i−1)2 .
i=1

A standard starting point is x0 = (ξ j ) where ξ2 j−1 = −1.2 and ξ2 j = 1 and the


minimum equals f ∗ = 0 at (1, . . . , 1).
4 Search Procedure Exploiting Locally Regularized Objective Approximation 95

2. Eight variable problem II


As a second eight variable problem we chose Chebquad function (i.e. problem
no 35 in [12]). It is defined as
d d  1
1
f (x1 , . . . , xd ) = ∑ fi (x)2 , where fi (x) = ∑ Ti (x j ) − Ti (x)dx
i=1 d j=1 0

and Ti is the i-th Chebyshev polynomial shifted to the interval [0, 1]. The standard
starting point is x0 = (ξi ) where ξ j = j/(d + 1) and the minimum for d = 8 equals
to f ∗ = 3.51687 · 10−3.
3. Eleven variable problem
As en eleven variable problem we chose Osborne 2 function (i.e. problem no 19
in [12]). It is defined as
11
f (x1 , . . . , x11 ) = ∑ fi (x)2 , fi (x) = yi − (xi e−ti x5 + x2e−(ti −x9 )
2x
where 6 +
i=1

+x3 e−(ti −x10 ) + x4e−(ti −x11 )


2x 2x
7 8 )

ti = (i − 1)/10 and yi for i = 1, . . . , 65 are constants that can be found in [12].


The standard starting point is x0 = [1.3, 0.65, 0.65, 0.7, 0.6, 3, 5, 7, 2, 4.5, 5.5] and
the minimum is f ∗ = 4.01377 · · ·10−2.

4.6.2 Results
For each problem we run Algorithm 1 combined with EXTREM algorithm with the
three following objective function approximation methods
1. Radial basis function aproximation without regularization,
2. Radial basis function aproximation with regularization using Generalized Cross
Validation [18] to choose λ parameter,
3. Radial basis function aproximation with regularization using Weighted Gradient
Variance to choose λ parameter.
To construct the approximation of the objective function with one of these meth-
ods we used 30 Gaussian radial basis functions with an equal shape parameter set
to half of the distance between the most distant centers. No additional parameters
were required for the first and the second method. The single user-defned threshold
NLMSEthr for measure (4.33) was used in the third method. Apart from parameters
concerning the construction of the radial basis approximation we had to set-up three
parameters related directly to the Algorithm 1 itself. These are: the number of initial
steps Is = 50 and ε = 10−3 for the ε -chcek procedure and γthr – the threshold value
for a measure (4.24) to detect the reliable region.
In the tables below we show performance of the Algorithm 1 for problems de-
fined in the previous subsection with the above objective function approximation
strategies compared to the EXTREM algorithm. The first column shows the number
96 M. Bazan

Table 4.1 6-variable Rosenbrock function optimization using: Left) pure EXTREM, Right)
Algorithm 1 without regularization and with γthr = 0.65

EXTREM
step ||x − x∗|| f Alg. 1 no regularization
num. step num.
250 2.267853 2.885768 num. ||x − x∗|| f approx.
500 0.650821 0.109817 250 2.459117 3.711824 22
750 0.418187 0.031570 500 1.318398 0.559350 45
1000 0.246522 0.012666 750 0.675244 0.089372 35
1250 0.029985 0.001228 933 0.130808 0.005606 10
1523 0.002014 0.000001

Table 4.2 6-variable Rosenbrock function optimization using: Left) Algorithm 1 with regu-
larization using GCV, Right) Algorithm 1 with regularization using WGV with NLMSEthr =
5 · 10−6 . In both cases γthr = 0.65

Alg. 1 with GCV Alg. 1 with WGV


step ∗
||x − x || f num. step ∗ num.
num. approx. num. ||x − x || f approx.
250 2.363547 3.991352 11 250 2.292762 2.923941 12
500 1.094954 0.334820 28 500 1.250487 0.435774 16
750 0.778636 0.162569 48 750 0.630858 0.150665 28
1000 0.218992 0.009666 47 1000 0.513698 0.048269 38
1250 0.025537 0.000131 49 1250 0.094284 0.001763 47
1288 0.025537 0.000131 10 1262 0.094284 0.001763 4

Table 4.3 8-variable Rosenbrock function optimization using: Left) pure EXTREM, Right)
Algorithm 1 without regularization and γthr = 0.65

EXTREM
step ||x − x∗|| f Alg. 1 no regularization
num. step num.
250 3.923525 10.438226 num. ||x − x∗|| f approx.
500 2.731706 5.606541 250 3.030179 6.376938 22
750 1.771076 1.032648 500 2.153959 1.855113 23
1000 1.009230 0.251387 750 1.380208 0.471618 30
1250 0.484663 0.050215 1000 0.910852 0.215283 46
1500 0.274941 0.015386 1250 0.361361 0.028169 43
1750 0.262434 0.012950 1500 0.186309 0.006926 55
2000 0.208392 0.007465 1537 0.186309 0.006926 7
3471 0.000553 0.000000
4 Search Procedure Exploiting Locally Regularized Objective Approximation 97

Table 4.4 8-variable Rosenbrock function optimization using: Left) Algorithm 1 with regu-
larization using GCV, Right) Algorithm 1 with regularization using WGV with NLMSEthr =
10−6 . In both cases γthr = 0.65

Alg. 1 with WGV


step ||x − x∗|| f num.
Alg. 1 with GCV num. approx.
step ∗ num.
num. ||x − x || f approx. 250 3.263247 9.379781 11
500 2.288176 2.394216 15
250 3.100123 7.932378 17
750 1.272541 0.435303 17
500 2.353671 3.580721 28
1000 0.767617 0.143009 35
750 1.810286 1.085058 26
1250 0.505116 0.057994 42
1000 0.946269 0.240704 41
1500 0.146317 0.004114 51
1200 0.705151 0.118365 43
1750 0.028930 0.000169 35
1861 0.028958 0.000169 12

Table 4.5 8-variable Chebquad function optimization using: Left) pure EXTREM Right)
Algorithm 1 without regularization and with γthr = 0.65

EXTREM Alg. 1 no regularization


step ||x − x∗|| f step num.
num. num. ||x − x∗|| f approx.
1 0.161512 0.038618
250 0.098420 0.006216 4
250 0.097709 0.006187
500 0.059421 0.004684 18
500 0.059121 0.004681
750 0.016460 0.003698 750 0.009926 0.003584 10
1000 0.000454 0.003517 48
1000 0.000351 0.003517
1237 0.000000 0.003517 1197 0.000141 0.003517 97

of steps of the algorithm i.e. a sum of the number of the direct functions evaluations
and the number of steps in which the radial basis function approximation was used.
The second column shows the distance from the minium and the third column shows
the objective function value. The last column shows the number of steps in which
objective function approximation was used within the previous 250 steps.
As we can see in all cases SPELROA required considerably fewer steps than the
pure EXTREM algorithm to stop. Using the WGV method to build radial
basis approximation gave the best convergence results, i.e. the stopping point for
the SPELROA with WGV compared to stopping points given by the method with
the other methods. For all problems NLMSEthr was chosen intuitively to ∼ 10−6 .
It means that the reconstruction of the training set in the vicinity of the evaluation
point was at the level of 10−6. To compute the reliable region γthr set up to 0.65 was
sufficient to preserve convergence of the method. In the optimization of 8-variable
Chebquad function it turned out that it was possible to reduce γthr to 0.6. That gave
16 points in which the objective function was approximated in the first 250 steps
98 M. Bazan

Table 4.6 8-variable Chebquad function optimization using: Left) Algorithm 1 with regular-
ization using GCV and γthr = 0.65, Right) Algorithm 1 with regularization using WGV with
NLMSEthr = 10−6 and with γthr = 0.60

Alg. 1 with GCV Alg. 1 with WGV


step ||x − x∗ || f num. step ||x − x∗ || f num.
num. approx. num. approx.
250 0.098419 0.006216 4 250 0.109306 0.007110 16
500 0.059477 0.004684 17 500 0.064139 0.004759 73
750 0.008713 0.003584 11 750 0.009802 0.003586 46
1000 0.00156 0.003518 118 1000 0.001266 0.003519 31
1095 0.00156 0.003518 68 1077 0.001051 0.003517 52

Table 4.7 11-variable Osborne 2 function optimization using: Left) pure EXTREM, Right)
Algorithm 1 without regularization and with γthr = 0.65

EXTREM
step ||x − x∗|| f
num.
Alg. 1 no regularization
1 4.755269 2.093420 step num.
250 1.004450 0.081672 num. ||x − x∗|| f approx.
500 0.099411 0.041034 250 1.379902 0.977272 29
750 0.073530 0.041772 500 0.568322 0.059211 31
1000 0.007715 0.040138 710 0.335485 0.041512 43
1250 0.001092 0.040138
1434 0.000000 0.040138

Table 4.8 11-variable Osborne 2 function optimization using: Left) Algorithm 1 with regu-
larization using GCV and γthr = 0.65 b) Algorithm 1 with regularization using WGV with
NLMSEthr = 10−6 . In both cases γthr = 0.65

Alg. 1 with GCV


step num. Alg. 1 with WGV
||x − x∗ || f step num.
num. approx.
num. ||x − x∗ || f approx.
250 1.673133 0.365797 32
250 0.953416 0.106864 14
500 0.975217 0.045721 31
500 0.334653 0.047136 38
750 0.496683 0.041805 43
750 0.0441106 0.040209 49
1000 0.068066 0.040462 35
919 0.037960 0.040185 29
1106 0.062328 0.040177 25

instead 4 such points when γthr = 0.65. An interesting result was also obtained for
Osborne 2 function. The EXTREM algorithm found a better solution than that sug-
gested in [12]. Algorithm 1 with any approximation method did not converge to this
minimum. The method without regularization did not converge at all whereas GCV
4 Search Procedure Exploiting Locally Regularized Objective Approximation 99

and WGV converged rather to the minimum suggested in [12] than to that found by
EXTREM.

4.7 Summary
The Search Procedure Exploiting Locally Regularized Objective Approximation is
a method to speed-up local optimization processes. The method combines a non-
gradient optimization algorithm with the regularized local radial basis function
approximation. It relies on using a local regularized radial basis function approxi-
mation instead of a direct objective function evaluation in a certain number of func-
tion evaluation steps of the optimization algorithm. In this chapter we presented the
proof of the convergence of the Search Procedure Exploiting Regularized Objective
Approximation which applies to any Gauss-Siedle and conjugate direction search al-
gorithm that uses the sequanetial quadratic interpolation as a line search procedure.
The convergence is proven under assumption that the approximation of the objec-
tive function with the prescribed approximation relative error is exploited only in
the seqential quadratic interplation. The performance of the method was presented
on 6 and 8-parametric Rosenbrock function, 8-parametric Chebquad function and
11-parametric Osborne 2 function. Further studies will be to compare the method
with trust region methods.

Acknowledgements. I would like to thank all referees for very valuable comments.

Appendix A
The minimum of the quadratic q(ζ ) built of three points (ζ1 , f (ζ1 )), (ζ2 , f (ζ2 )) and
(ζ3 , f (ζ3 )) where {ζ1 , ζ2 , ζ3 } ⊂ R equals

1 (ζ32 − ζ22 ) f (ζ1 ) + (ζ32 − ζ12 ) f (ζ2 ) + (ζ22 − ζ12 ) f (ζ3 )


λ∗ = ,
2 (ζ3 − ζ2 ) f (ζ1 ) + (ζ3 − ζ1 ) f (ζ2 ) + (ζ2 − ζ1 ) f (ζ3 )

Let us transform quadratic q(x) = ax2 + bx + c; x = ζ1 + t(ζ3 − ζ1 );t = (−∞, ∞) to


quadratic q̂(x) = âx2 + b̂x + ĉ assuming the following
1. A transformation will be of the form

q̂(x) = E(p(x)) + D, p(x) = L2 (x), (4.34)

where L2 (x) is the Lagrange interpolation parabola constructed on points


(0, f (ζ1 )), (ζ(r) , f (ζ2 )) and (1, f (ζ3 ) whereas ζ(r) = ζζ2 − ζ1
.
3 −ζ1
2. a) If f (ζ1 ) > f (ζ3 ) b) if f (ζ1 ) < f (ζ3 )
 
q̂(0) = −1, q̂(0) = −2,
q̂(1) = −2, q̂(1) = −1.
100 M. Bazan

From the condition 1. we get

L2 (x) = a x2 + b x + c

where
f (ζ1 ) f (ζ2 ) f (ζ3 )
a = + + ,
ζ(r) ζ(r) (ζ(r) − 1) 1 − ζ(r)
!

f (ζ1 )(ζ(r) + 1) f (ζ2 ) f (ζ3 )ζ(r)
b =− + + ,
ζ(r) ζ(r) (ζ(r) − 1) 1 − ζ(r)
c = f (ζ1 ).

From the conditions 2. we get


a) b)
 
E = a +b
1
, E = − a +b
1
,
c c
D = 2 − Ec = 2 − a +b . D = 1 + Ec = 1 + a +b .

Then in the canonical form q̂(x) = âx2 + b̂x + ĉ we have â = Ea , b̂ = Eb , ĉ = Ec +


D.
The crucial properties of this transformation are
1. q̂(ζ(r) ) < −1.
 
2. p ζ(r) = f (ζ2 ).
3. q̂(ζ(r) ) does not depend on f (ζ2 ).
4. For the minimum point λ ∗ of q such that λ ∗ ∈ [ζ1 , ζ3 ] we get a minimum point
∗ −ζ
λ̂ ∗ = λζ − 1
where λ̂ ∗ is a minimum point of q̂.
3 ζ1
5. The transformation has a singularity when a = b .
This transformation reduces the number of free parameters from 6 to 3.

Appendix B

B.1 Expression for Δ (ε ; ζ )


The unperturbed minimum λ ∗ (ζ ) is related to the perturbed minimum λ̃l∗ (ε ; ζ ) as

λ̃l∗ (ε ; ζ ) = λ ∗ (ζ ) ± Δl (ε ; ζ ).

Here we assume that both λ ∗ (ζ ) and λ̃l∗ (ε ; ζ ) are minima of the quadratics obtained
by the transformation q̂(·) from Appendix A with the assumption that f (ζ1 ) < f (ζ3 )
i.e. if the opposite is true then we have to rotate the quadratics with respect to the
center of the interval [ζ1 , ζ3 ].
Let us introduce a notation to simplify derivations. Let us denote A = 2(ζ(r) −
1), B = q̂(ζ(r) ),C = ζ(r) . Then for the unperturbed quadratic we have that
4 Search Procedure Exploiting Locally Regularized Objective Approximation 101

1 A(ζ(r) + 1) + B − ζ(r)C
λ∗ =
2 A + B −C
whereas for the perturbed one we have
 
∗ 1 A(ζ(r) + 1)(1 + ε ) + B − ζ(r)C
λ1 (ε ; ζ ) =
2 A(1 + ε ) + B − C
 
1 A(ζ(r) 1) + B(1 + ε ) − ζ(r)C
+
λ2∗ (ε ; ζ ) =
2 A + B(1 + ε ) − C
 
1 A(ζ(r) 1) + B − ζ(r)C(1 + ε )
+
λ3∗ (ε ; ζ ) =
2 A + B − C(1 + ε )
ζ2 −ζ1
for the perturbation in 0, ζ(r) and 1 respectively where ζ(r) = ζ3 −ζ1 . We can simplify
the expression |λ ∗ − λ̃ ∗ (ε ; ζ )| to get
 
C − ζ(r) B Aε
Δ1 (ε ; ζ ) = ·
2(A + B − C) Aε + A + B − C

and Δ2 (ε ; ζ ) and Δ3 (ε ; ζ ) similarly.

B.2 The Main Condition


If inequalities (4.21) are satisfied then we have a guarantee that λ ∗ (ζ ) < ζ2 . It is
because if λ̃l∗ (ε ; ζ ) is shifted by Δl (ε ; ζ ) to the left then shifting it back to the right
will not make it greater than ζ2 . If it is shifted to the right then we have a margin of
2Δl (ε ; ζ ). As previously mentioned we will consider only perturbation in 0 and ζ(r)
i.e. l = 1 and l = 2. For l = 1 we get

A(ζ(r) + 1)(1 + ε ) + B − ζ(r)C


< 2ζ2
A(1 + ε ) + B − C

We have to consider two cases depending on the sign of the denominator.


1. If A(1 + ε ) + B − C) > 0 then we get

A(1 − ζ(r))ε < A(ζ(r) − 1) + B(2ζ(r) − 1) − C(2ζ(r) − 1)

which again is divided into two cases depending on the sign of the cooeficient
by ε
a. When A(1 − ζ(r)) > 0 then

A(ζ(r) − 1) + B(2ζ(r) − 1) − C(2ζ(r) − 1)


ε< .
A(1 − ζ(r))

b. When A(1 − ζ(r)) < 0 then


102 M. Bazan

A(ζ(r) − 1) + B(2ζ(r) − 1) − C(2ζ(r) − 1)


ε> .
A(1 − ζ(r))

2. If A(1 + ε ) + B − C) < 0 then we get two cases


a. When A(1 − ζ(r)) > 0 then

A(ζ(r) − 1) + B(2ζ(r) − 1) − C(2ζ(r) − 1)


ε> .
A(1 − ζ(r))

b. When A(1 − ζ(r)) < 0 then

A(ζ(r) − 1) + B(2ζ(r) − 1) − C(2ζ(r) − 1)


ε< .
A(1 − ζ(r))

Only upper bounds for ε , i.e. 1.a) and 2.b), are interpretable as a solution to our
problem. So finally we get two regions for ε and ζ(r) where

A(ζ(r) − 1) + B(2ζ(r) − 1) − C(2ζ(r) − 1)


ε< , (4.35)
A(1 − ζ(r))

for 
A(1 + ε ) + B − C > 0
, (4.36)
A(1 − ζ(r)) > 0
or 
A(1 + ε ) + B − C < 0
. (4.37)
A(1 − ζ(r)) < 0
In the same way we can obtain conditions for l = 2:

−B(1 − ζ(r)) − A
ε< , (4.38)
B(1 − ζ(r))

for 
A + B(1 + ε ) − C > 0
, (4.39)
B(1 − ζ(r)) < 0
or 
A + B(1 + ε ) − C < 0
. (4.40)
B(1 − ζ(r)) > 0

References
1. Bazan, M., Russenschuck, S.: Using neural networks to speed up optimization algo-
rithms. Eur. Phys. J. AP 12, 109–115 (2000)
2. Bazan, M., Aleksa, M., Russenschuck, S.: An improved method using radial basis func-
tion neural networks to speed up optimization algorithms. IEEE Trans. on Magnetics 38,
1081–1084 (2002)
4 Search Procedure Exploiting Locally Regularized Objective Approximation 103

3. Bazan, M., Aleksa, M., Lucas, J., Russenschuck, S., Ramberger, S., Völlinger, C.: In-
tegrated design of superconducting magnets with the CERN field computation pro-
gram ROXIE. In: Proc. 6th International Computational Accelarator Physics Conference,
Darmstadt, Germany (September 2000)
4. Conn, A.R., Gould, N.I.M., Toint, P.L.: Trust region methods. SIAM, Philadelphia (2005)
5. Hansen, P.C.: Rank-deficient and Discrete Ill-posed Problems. SIAM, Philadelphia
(1998)
6. Jacob, H.G.: Rechnergestützte Optimierung statischer und dynamischer Systeme.
Springer, Heidelberg (1982)
7. Kansa, E.J., Hon, Y.C.: Circumventing the ill-conditionning problem with multiquadratic
radial basis: Applications to elliptic partial differentail equations. Comp. Math. with
App. 39(7-8), 123–137 (2000)
8. Luenberger, D.G.: Introduction to linear and nonlinear programming, 2nd edn. Addison-
Wesley, New York (1984)
9. Micchelli, C.A.: Interpolation of Scattered Data: Distance Matrices and Conditionally
Positive Definite Functions. Constructive Approximation 2, 11–22 (1986)
10. Madych, W.R., Nelson, S.A.: Multivariate interpolation and conditionally positive defi-
nite functions II. Math. Comp. 4(189), 211–230 (1990)
11. Madych, W.R.: Miscellaneous error bounds for multiquadric and related interpolators.
Comp. Math. with Appl. 24(12), 121–138 (1992)
12. Moré, J.J., Garbow, B.S., Hillstorm, K.E.: Testing unconstrained optimization software.
ACM Trans. Math. Software 7(1), 17–41 (1981)
13. Oeuvray, R.: Trust region method based on radial basis functions with application on
biomedical imaging, Ecole Polytechnique Federale de Lausanne (2005)
14. Polak, E.: Optimization. Algorithms and Consistent Approximations. Applied Mathe-
matical Sciences, vol. 124. Springer, Heidelberg (1997)
15. Powell, M.J.D.: On calculation of orthogonal vectors. The Computer Journal 11(3),
302–304 (1968)
16. Schaback, R.: Error estimates and condition number for radial basis function interpola-
tion. Adv. Comput. Math. 3, 251–264 (1995)
17. Schaback, R.: Native Hilbert Spaces for Radial Basis Functions I. The new development
in Approximation Theory. Birkhäuser, Basel (1999)
18. Wahba, G.: Spline models for obsevational data. SIAM, Philadelphia (1990)
19. Wendland, H., Rieger, C.: Approximate interpolation with applications to selecting
smoothing parameters. Numerische Mathematik 101, 643–662 (2005)
20. Wendland, H.: Gaussian Interpolation Revisited. In: Kopotun, K., Lyche, T., Neamtu,
M. (eds.) Trends in Approximation Theory, pp. 427–436. Vanderbilt University Press,
Nashville (2001)
21. Zangwill, W.I.: Nonlinear Programming; a Unified Approach. Prentice-Hall Interna-
tional Series. Prentice-Hall, Englewood Cliffs (1969)
Chapter 5
Optimization Problems with Cardinality
Constraints

Rubén Ruiz-Torrubiano, Sergio Garcı́a-Moratilla, and Alberto Suárez

Abstract. In this article we review several hybrid techniques that can be used to
accurately and efficiently solve large optimization problems with cardinality con-
straints. Exact methods, such as branch-and-bound, require lengthy computations
and are, for this reason, infeasible in practice. As an alternative, this study focuses
on approximate techniques that can identify near-optimal solutions at a reduced
computational cost. Most of the methods considered encode the candidate solutions
as sets. This representation, when used in conjunction with specially devised search
operators, is specially suited to problems whose solution involves the selection of
optimal subsets of specified cardinality. The performance of these techniques is il-
lustrated in optimization problems of practical interest that arise in the fields of
machine learning (pruning of ensembles of classifiers), quantitative finance (port-
folio selection), time-series modeling (index tracking) and statistical data analysis
(sparse principal component analysis).

5.1 Introduction
Many practical optimization problems involve the selection of subsets of specified
cardinality from a collection of items. These problems can be solved by exhaustive
enumeration of all the candidate solutions of the specified cardinality. In practice,
only small problems of this type can be exactly solved within a reasonable amount of
Rubén Ruiz-Torrubiano
Computer Science Department, Universidad Autónoma de Madrid, Spain
e-mail: ruben.ruizt@estudiante.uam.es
Sergio Garcı́a-Moratilla
Computer Science Department, Universidad Autónoma de Madrid, Spain
e-mail: sergio.garcia@uam.es
Alberto Suárez
Computer Science Department, Universidad Autónoma de Madrid, Spain
e-mail: alberto.suarez@uam.es

Y. Tenne and C.-K. Goh (Eds.): Computational Intelligence in Optimization, ALO 7, pp. 105–130.
springerlink.com c Springer-Verlag Berlin Heidelberg 2010
106 R. Ruiz-Torrubiano, S. Garcı́a-Moratilla, and A. Suárez

time. The number of steps required to find the optimal solution can be reduced using
branch-and-bound techniques. Nevertheless, the computational complexity of the
search remains exponential, which means that large problems cannot be handled by
these exact methods. It is therefore important to design algorithms that can identify
near-optimal solutions at a reduced computational cost. In this article we present an
unified framework for handling optimization problems with cardinality constraints.
A number of approximate methods within this framework are analyzed and their
performance is tested in extensive benchmark experiments.
In its general form, an optimization problem with cardinality constraints can be
formulated in terms of a vector of binary variables z = {z1 , z2 , . . . , zD }, zi ∈ {0, 1}.
The goal is to minimize a cost function that depends on z, subject to a constraint
that specifies the number of non-zero bits in z
D
min {F(z)}
z
∑ zi = k. (5.1)
i=1

Optimization problems with cardinality constraints given by an inequality ∑D i=1 zi ≤


K can be solved by selecting the best of the solutions of K optimization prob-
lems with the equality constraint ∑D i=1 zi = k; k = 1, 2, . . . , K. Finally, the solution
of a combinatorial optimization problem without restrictions can be obtained by
solving the sequence of problems with cardinality constraints ∑D i=1 zi = k; k = 1, 2,
. . . , D.
Continuous optimization tasks with cardinality constraints can also be analyzed
within this framework. Consider the problem of minimizing a function that depends
on a D-dimensional continuous parameter θ . We search for solutions with exactly k
non-zero components of θ
D
min {F(θ )}
θ
θ ∈ RD , ∑ I(θi = 0) = k, (5.2)
i=1

where I(·) is an indicator function (I(true) = 1, I(false) = 0). This hybrid prob-
lem can be transformed into a purely combinatorial one of the type (5.1) by intro-
ducing a D-dimensional binary vector z whose i-th component indicates whether
variable i is allowed to take a non-zero value (zi = 1) or is set to zero (zi = 0).

min [F ∗ (z)] ,
z
∑ zi = k. (5.3)
i

The function F ∗ (z) is the solution of an auxiliary continuous optimization problem


in the reduced space defined by z

F ∗ (z) = min F(θ [z] ), (5.4)


[θ |z]
5 Optimization Problems with Cardinality Constraints 107

where θ [z] denotes the k-dimensional vector formed by the components of θ for which
the value of the corresponding component of z is 1. The remaining components of θ
are set to zero in the auxiliary problem.
This decomposition makes it clear how hybrid methods that combine techniques
for combinatorial and continuous optimization can be applied to identify the so-
lution of the subset selection problem with a continuous objective function: For a
given value of z, the optimal θ [z] is calculated by solving the surrogate problem
defined by (5.4), where z determines which components of θ are allowed to take a
non-zero value. The final solution is obtained by searching in the purely combinato-
rial space of possible values of z, using the optimal function value that is a solution
of (5.4) to guide the exploration.
The success of this hybrid approach depends on the availability of a continuous
optimization algorithm that can efficiently identify the globally optimal solution
of the auxiliary optimization problem defined in (5.4) and on the efficiency of the
algorithm used to address the combinatorial part of the search. For simple forms
of the continuous objective function and of the remaining restrictions (other than
the cardinality constraint), the auxiliary problem can be efficiently solved by exact
optimization techniques. For instance, efficient linear and quadratic programming
algorithms are available if the function is linear or quadratic, respectively [1]. For
more complex objective functions, general non-linear optimization techniques (such
as quasi-Newton [2] or interior-point methods [3]) may be necessary. In these cases,
there is no guarantee that the solution of the auxiliary problem be globally optimal.
As a consequence, if the solutions found are far from the global optimum, the com-
binatorial search that is used to solve the original problem (5.3) can be seriously
misled.
In this work, we assume that the continuous optimization task defined by (5.4)
can be solved exactly and focus on the solution of the combinatorial part of the orig-
inal problem. Section 5.2 describes how standard combinatorial optimization tech-
niques can be adapted to handle the cardinality constraints considered. Emphasis is
placed on the use of an appropriate encoding for the search states in terms of sets.
This set-based encoding is particularly well-suited for the definition of search oper-
ators that preserve the cardinality of the candidate solutions. With this adaptation,
the approximate methods described provide a practicable alternative to identifying
the exact solution by exhaustive search, which becomes computationally infeasible
in large problems, or to computationally inexpensive optimization methods, such as
greedy search, which tend to find suboptimal solutions. The experiments presented
in Section 5.3 illustrate how the techniques reviewed find near-optimal solutions
with limited computational resources and can therefore be used to address optimal
subset selection problems of practical interest. Novel results regarding the applica-
tion of these techniques to some of these problems (ensemble pruning and sparse
PCA) are also provided. Finally, Section 8.5 summarizes the conclusions of this
work.
108 R. Ruiz-Torrubiano, S. Garcı́a-Moratilla, and A. Suárez

5.2 Approximate Methods for the Solution of Optimization


Problems with Cardinality Constrains
In this section we describe how simulated annealing (SA), genetic algorithms (GA),
and estimation of distribution algorithms (EDA) can be used to solve large opti-
mization problems with cardinality constraints. They are stochastic search methods
that involve the generation of candidate solutions, which are then rejected or se-
lected according to their performance. In their standard formulation, no particular
consideration is given to the number of non-zero components in the candidate solu-
tions generated. Cardinality constraints can be taken into account using one of the
following approaches:
(i) No candidate solution violating the constraint is generated at any time by the al-
gorithm. Enforcing this property requires the design of appropriate genetic and
neighborhood operators, such that the space of solutions of a given cardinality
is closed under these search operators [4] [5].
(ii) Solutions that violate the cardinality constraint can be generated by the suc-
cessor operators. Whenever a violation occurs, a repair algorithm is applied to
transform the infeasible solution into a solution of the desired cardinality. Typ-
ically, a local search is used to obtain the closest feasible solution, but random
repair mechanisms can be used as well [6] [7].
(iii) Solutions that violate the cardinality constraint can be generated by the succes-
sor operators. In contrast with the previous approach, infeasible solutions are
not repaired. Instead, a penalty term is introduced on the evaluation function so
that infeasible candidate solutions have worse scores than feasible ones with an
equivalent performance [6].
In the experiments described in Section 5.3, the best overall performance is obtained
by methods that use a set-based representation together with appropriately designed
successor operators that preserve the cardinality of the solutions. These results un-
derscore the importance of using a representation that properly reflects the structure
of the problem. Therefore, the focus of this study is on the design of search opera-
tors that preserve the cardinality of the candidate solutions. These specially adapted
methods are generally preferable to standard schemes that take into account the re-
strictions by either ad-hoc repair mechanisms or by including a term in the cost
function that penalizes violations of the constraints.

5.2.1 Simulated Annealing


Simulated annealing (SA) is an optimization technique inspired by the field of ther-
modynamics [8]. The main idea is to mimic the physical process of melting a solid
and then cooling it to allow the formation of a regular crystalline structure that at-
tains a minimum of the system’s free energy. In simulated annealing the function to
be minimized F(z) (objective or cost function) takes the role of the free energy in
the physical system. The physical configuration space is replaced by the space of
candidate solutions, which are connected by transitions defined by a neighborhood
5 Optimization Problems with Cardinality Constraints 109

operator. The stochastic search proceeds by considering transitions from the cur-
rent state z(cur) to a neighboring configuration zl ∈ N (z(cur) ) generated at random.
The proposed transition is accepted if the value of the objective function decreases.
Otherwise, if the candidate configuration is of higher cost, the transition is accepted
only with a certain probability. This probability is expressed as a Boltzmann factor
 
(cur)
F(z ) − F(z )
Paccept (zl , z(cur) ; Tk ) = exp −
l
, (5.5)
Tk

where the parameter Tk plays the role of a temperature. A general version of this
technique is given as Algorithm 1. In this pseudocode, the function annealingSched-
ule returns the temperature Tk for the following epoch. It is common to use a geo-
metrical schedule Tk = γ Tk−1 , where γ smaller, but usually close to one, regulates
how fast the temperature is decreased.

Algorithm 1. Simulated annealing


• Generate initial configuration z(0) and initial temperature T0
• z(cur) ← z(0)
• i←0
• While convergence criteria are not met [Annealing loop]
– i ← i+1
– Fix temperature for epoch i: Ti = annealingSchedule(Ti−1 )
– Fix length for epoch i: Li
– For l = 1, . . . , Li [Epoch loop]
1. Select randomly an element zl ∈ N (z(cur) ).
2. If F(zl ) < F(z(cur) ), then z(cur) ← zl
3. Else, generate u ∼ U[0, 1]
If u < Paccept (zl , z(cur) ; Ti ), then z(cur) ← zl
• Return the best value found.

Cardinality constraints can be handled in SA by selecting an appropriate encod-


ing for z and a corresponding neighborhood N (z). In particular, the candidate
solutions can be encoded as sets of specified cardinality. The components of the
binary vector z are then interpreted as indicating membership to the set: if zi = 1,
the ith element is included in the solution. Otherwise, if zi = 0 it is excluded from
the selection. It is also necessary to design a neighborhood operator that preserves
the cardinality constraints, so that no penalty or repair mechanisms are needed. A
simple design is to exchange an element included in the current candidate solution
with an element excluded from it. This is the version of SA that will be used in
Section 5.3.
110 R. Ruiz-Torrubiano, S. Garcı́a-Moratilla, and A. Suárez

5.2.2 Genetic Algorithms


Genetic algorithms are a class of optimization methods that mimic the process of
the natural evolution [9]. Optimization is achieved by selection from a population
that exhibits some random variability. The outline of a general genetic algorithm is
shown in Algorithm 2.

Algorithm 2. Genetic Algorithm


• Generate an initial population P0 with P individuals.
• For each individual I j ∈ P0 , calculate fitness Φ (I j ).
• Initialize the generation counter t ← 0.
• While convergence criteria are not met:
– Increase the generation counter t ← t + 1.
– Select a parent set Πt ⊂ Pt composed of nP individuals from the population.
– While Πt = 0:
/
· Extract two individuals I1 and I2 from Πt .
· Apply the crossover operator Θ (I1 , I2 ) and generate nC children (with
probability pC ).
· Apply the mutation operator to the nC children (with probability pM ).
– Calculate the fitness value of the new individuals.
– Add the new individuals to the population.
– Select P individuals that make up Pt+1 , the population for generation t + 1.

For problems with cardinality constraints, two alternative encodings for the can-
didate solutions are considered. A first possibility is a standard binary representa-
tion, where the chromosomes are bit-strings. The difficulty with this encoding is
that standard mutation and crossover operators do not preserve the number of non-
zero bits of the parents. A possible solution to this problem is to assign a lower
fitness value to individuals in the population that violate the cardinality constraint.
Assuming that a problem with an inequality cardinality constraint is considered, a
penalized fitness function can be built by subtracting from the standard fitness func-
tion a penalty term that depends on the magnitude of the violation of the cardinality
constraint
Δk (z) = |Card(z) − k| (5.6)
The penalized fitness function is

Φ p (z) = Φ (z) − β f p (Δk (z)), (5.7)

where f p : N → R+ is a monotonically increasing function of Δk (z) and β ≥ 0


represents the strength of the penalty.
Another option is to repair infeasible individuals when they are generated. Sev-
eral repair mechanisms can be defined for this purpose. For instance, an individual
5 Optimization Problems with Cardinality Constraints 111

can be repaired by randomly setting some bits to 0 or 1, as needed, until the cardi-
nality constraint is satisfied (random repair). Another alternative is to use a heuristic
to determine which bits must be set to 0 or to 1 (heuristic repair). The results of a
greedy optimization or the solutions of a relaxed version of the problem can also be
used to achieve this objective [10].
An alternative to binary encoding is to use the set representation introduced in
simulated annealing. The use of this representation simplifies the design of crossover
and mutation operators that preserve the cardinality of the individuals. The neigh-
boring operator defined in SA can be used to construct mutated individuals. Since
this operator swaps a variable in the set of selected variables with another variable in
the complement of this set, the cardinality of the original chromosome is preserved
by the mutation.
Some crossover operators on sets were introduced in [5]. They are defined taking
into account the properties of respect and assortment [11]. Respect ensures that the
offspring inherit the common genetic material of the parents. Assortment guarantees
that every combination of the alleles of the two parents is possible in the child, pro-
vided that these alleles are compatible. When cardinality constraints are considered,
it is no longer possible to design crossover operators that guarantee both respect and
assortment.
A crossover operation that provides a good balance of these properties and en-
sures that the cardinality of the parents is preserved in the offspring is random as-
sorting recombination (RAR). RAR crossover is described in Algorithm 3. In this
algorithm, the integer parameter w ≥ 0 determines the amount of common informa-
tion from both parents that is retained by the offspring. For w = 0, elements that are
present in the chromosomes of both parents are not allowed in the child. Higher val-
ues of w assign more importance to the elements in the intersection of the parents’
sets (chromosomes). In the limit w → ∞, the child contains every element that is in
both of the parents’ chromosomes with a probability that approaches 1.

5.2.3 Estimation of Distribution Algorithms


Estimation of distribution algorithms (EDAs) are a class of evolutionary methods in
which diversity is generated by a probabilistic sampling scheme [12]. Depending
on the nature of this sampling scheme, different variants of EDAs can be designed.
In this work, we consider the Population Based Incremental Learning (PBIL) algo-
rithm as a representative algorithm of the EDA family [13]. It operates on binary
chromosomes of fixed length (z) and assumes statistical independence among the
genes {zi ; i = 1, 2, . . . , D}. In generation g, the genotype of the population is char-
acterized by the probability vector p(g) , whose ith component is the probability of
assigning the value 1 to the gene in the ith position. The update of the probability
distribution using DSe g (see Algorithm 4) in PBIL is

1 M (gim )
p(g+1) = α ∑ z + (1 − α )p(g),
M m=1
(5.8)
112 R. Ruiz-Torrubiano, S. Garcı́a-Moratilla, and A. Suárez

Algorithm 3. Random Assortment Recombination algorithm


1 Input: Two parents I1 and I2 , and a fixed cardinality k.
2 Output: A child chromosome Θ .
• Create auxiliary sets A, B,C, D, E:
– A = elements present in both parents.
– B = elements not present in any of the parents.
– C ≡ D elements present in only one parent.
– E = 0.
/
• Build set
G = {w copies of elements from A and B, and 1 copy of elements in C and D}
• While |Θ | < k and G = 0:
/
– Extract g ∈ G without replacement.
/ E, Θ = Θ ∪ {g}.
– If g ∈ A or g ∈ C, and g ∈
– If g ∈ B or g ∈ D, E = E ∪ {g}.
• If |Θ | < k, add elements chosen at random from U − Θ until chromosome is
complete.

Algorithm 4. Estimation of distribution algorithm(EDA)


• Initialize the distribution that characterizes the population P(0) (z)
• Initialize generation counter g ← 0.
• While convergence criteria are not met
– Sample population of P individuals using P(g) (z)

Dg = {z(g1) , . . . , z(gP) }

– Sort population by non-increasing fitness value



Dg = {z(gi1 )) , z(gi2 ) , . . . , z(giP ) },

where i1 , i2 , . . . , iP is a reordering of the indices 1, 2, . . . , P such that

Φ (z(gi1 ) ) ≥ Φ (z(gi2 ) ) ≥ · · · ≥ Φ (z(giP ) )

– Select the first M ≤ P individuals from the sorted population


(gi1 ) (gi2 )
g = {z
DSe ,z , . . . , z(giM ) }

– Estimate the new probability P(g+1) (z) distribution using DSe


g
– Update generation counter g ← g + 1
• Return the best solution found.
5 Optimization Problems with Cardinality Constraints 113

where z(gim ) represents the individual in the im −th position on generation g, and
α ∈ (0, 1] is a smoothing parameter included to avoid strong fluctuations in the
estimates of the probability distribution. Individuals are sorted by decreasing fitness
values. The Univariate Marginal Distribution Algorithm (UMDA [14]) algorithm
from the EDA family is recovered when α = 1. Even though the encoding is binary,
the cardinality constraints can be enforced in the sampling of individuals. Algorithm
5 describes a sampling method that generates individuals of a specified cardinality k
from a distribution of bits characterized by the probability vector p. The application
of this method to sample new individuals guarantees that the algorithm is closed
with respect to the cardinality constraint.

Algorithm 5. Sampling individuals of a specified cardinality from p.


• Initialize p̂ ← p
• Initialize individual x = 0.
• For i = 1, 2, . . . , k
– Generate a random number u ∼ U[0, 1]
j−1
– Determine the value of j such that ∑i=1 p̂i < u ≤ ∑D i= j+1 p̂i .
– Set x j = 1.
– Update the value p̂ j ← 0
– Renormalize
p̂i
p̂i ← D , i = 1, 2, . . . , D,
∑k=1 p̂k

i=1 p̂i = 1.
so that p̂ can be interpreted as a probability vector ∑D
– Return the generated individual x.

5.3 Benchmark Optimization Problems with Cardinality


Constraints
This section introduces a collection of optimization problems with cardinality con-
straints that are used to illustrate the application of the methods described in the pre-
vious section. These include standard combinatorial optimization problems, such as
the knapsack problem, and real-world problems that arise in the fields of machine
learning (ensemble pruning), quantitative finance (cardinality-constrained portfolio
optimization), time series modeling (index tracking by partial replication) and statis-
tical data analysis (sparse principal components analysis). The problems considered
are either purely combinatorial or involve the optimization of continuous parame-
ters. In a purely combinatorial optimization problem F(z) can be directly evaluated
once z is known. The knapsack problem and ensemble pruning are of this type. The
remaining problems considered (portfolio selection, index tracking and sparse PCA)
are hybrid optimization tasks, in which the evaluation of F(z) for a fixed value of
the binary vector z requires the solution of an auxiliary continuous optimization
114 R. Ruiz-Torrubiano, S. Garcı́a-Moratilla, and A. Suárez

problem. While it is possible to address the combinatorial and the continuous op-
timization problems simultaneously, we concentrate on strategies that handle these
aspects separately. Therefore, the outcome of the continuous optimization algorithm
is used to guide the combinatorial optimization search, as in (5.4). For the hybrid
problems considered, the secondary optimization task can be efficiently solved in an
exact manner by quadratic programming. Nonetheless, the scheme can be directly
generalized when the evaluation of F(z) requires a more complex programming so-
lution, possibly without guarantee of convergence to the global solution of the sur-
rogate optimization problem. Under these conditions the algorithm used to address
the combinatorial part can actually be misled by the suboptimal solutions found in
the auxiliary problem.

5.3.1 The Knapsack Problem


Knapsack problems are a family of combinatorial optimization problems that in-
volve selecting a subset from a pool of items [15]. In this work, we consider the
0/1 knapsack problem, which can be shown to be NP-complete [16]. In an instance
of the 0/1 knapsack D items are available to fill up a knapsack. A profit pi and a
weight wi are associated to the i-th item, i = 1, 2, . . . , D. The objective is to identify
the subset of items whose accumulated profit is maximum and whose overall weight
does not exceed a given capacity W
D D
max ∑ pi zi s.t. ∑ wi zi ≤ W zi ∈ {0, 1} , i = 1, 2, . . . , D. (5.9)
i=1 i=1

Both exact and approximate methods have been used to address the 0/1 knapsack
problem. Exact algorithms based on branch-and-bound approaches and dynamic
programming are reviewed in [17]. Genetic algorithms [18, 19] and EDAs [12] have
also been used to address this problem.
Cardinality constraints are generally not considered in the standard 0/1 knapsack
problem. Nevertheless, the optimum of the unconstrained problem can be obtained
by solving D cardinality-constrained knapsack problems ∑D i=1 zi = k; k = 1, 2, . . . D.
The k-th element in this sequence is a knapsack problem with the restriction that
only k items can be included in the knapsack. To compare the performance of the
different optimization methods analyzed in this work, we use the testing protocol
proposed in [20] [18]. Three types of problems, defined in terms of two parameters
v, r ∈ R+ , v > 1, are considered:
(1) Uncorrelated: Weights and profits are generated randomly in [1, v].
(2) Weakly correlated: Weights are generated randomly in [1, v] and profits are
generated in the interval [wi − r, wi + r].
(3) Strongly correlated: Weights are generated randomly in [1, v] and profit pi =
wi + r.
In general, knapsack problems with correlations between weights and profits are more
difficult to solve than problems in which the weights and profits are independent. We
5 Optimization Problems with Cardinality Constraints 115

use v = 10, r = 5 and a capacity W = 2v, which tends to include very few items in
the solution. The results reported are averages over 25 realizations of each problem,
which are solved using the different approximate methods: SA, a standard GA with
linear penalty, a GA using set encoding and the RAR operator (w = 1), and PBIL.
The conditions under which the search is conducted are determined on exploratory
experiments. A geometric annealing schedule Tk = γ Tk−1 with γ = 0.9 is used in
SA. The GAs evolve populations composed of 100 individuals. The probabilities of
crossover and mutation are pc = 1, pm = 10−2 , respectively. In PBIL, a population
composed of 1000 individuals is used. The probability distribution is updated using
10% of the individuals. The smoothing parameter α is 0.1. Exact results obtained with
the solver SYMPHONY from the COIN-OR project [21] implementing a branch-and-
cut (B&C) approach [22], are also reported for reference. In the strongly correlated
problems it was not possible to find the exact solutions within a reasonable amount
of time.

Table 5.1 Results for the 0-1 Knapsack problem with restrictive capacity

Corr. No.Items Algorithm


GA Lin. GA RAR SA PBIL B&C (exact)
Profit Time Profit Time Profit Time Profit Time Profit Time
none 100 79.36 26.2 82.09 54.5 80.70 98.4 81.89 24.1 82.11 62.0
none 250 90.63 38.1 105.34 134.9 102.91 284.4 104.51 47.4 106.43 178.7
none 500 95.93 57.2 119.88 261.9 118.07 531.7 117.28 91.5 123.93 568.8
weak 100 52.97 26.9 54.38 53.9 53.53 99.87 54.33 24.2 54.43 81.8
weak 250 59.07 38.3 66.24 130.4 65.13 286.4 65.85 47.7 67.10 180.3
weak 500 60.40 56.1 74.17 266.1 73.40 531.9 72.05 87.9 76.61 560.1
strong 100 76.19 26.2 79.77 57.8 79.73 98.7 78.99 24.0 − −
strong 250 83.98 37.8 94.20 139.5 94.15 286.0 92.39 47.3 − −
strong 500 84.52 55.3 101.40 272.2 102.16 525.6 96.60 86.9 − −

Table 5.1 displays the average profit obtained and the time (in seconds) to reach
a solution for each method. The experiments were performed on an AMD Turion
computer with 1.79 Ghz processor speed and 1 Gb RAM. None of the approximate
methods reaches the optimal profit, which is calculated using an exact branch-and-
cut method. The highest profit obtained by an approximate optimization is high-
lighted in boldface. In all cases, the algorithms that use a set encoding (GA with
RAR crossover and SA) exhibit the best performance. They also require longer
times to reach a solution, specially SA. PBIL obtains good results only in small
uncorrelated knapsack problems. This is explained by the fact that the sampling and
estimation of probability distributions becomes progressively more difficult as the
dimensionality of the problem increases. Furthermore, PBIL assumes statistical in-
dependence between the variables, which makes the algorithm perform worse on
problems in which correlations are present. The standard GA with linear penalty
has a very poor performance in all the knapsack problems analyzed.
116 R. Ruiz-Torrubiano, S. Garcı́a-Moratilla, and A. Suárez

5.3.2 Ensemble Pruning


Consider the problem of automatic induction of classifiers from a collection of in-
stances {(xn , yn ); n = 1, 2, . . . , N}, where yn is the class label of the example char-
acterized by the vector of attributes xn . The goal is to induce from these data an
autonomous system that accurately predicts the class label on the basis of the vec-
tor of attributes of a previously unseen instance. There are a number of algorithms
that can be used for learning different types of classifiers: decision trees, neural
networks, support vector machines, etc. In practice, one of the most successful
paradigms is ensemble learning [23]. Ensembles are composed of a diverse col-
lection of classifiers that are generated from the same training data by introduc-
ing variations in the algorithm used for induction or in the conditions under which
learning takes place. The outputs of the individual classifiers are then combined (for
instance, by majority voting) to produce the prediction of the ensemble. Pooling the
decisions of the ensemble members has the potential of improving the generaliza-
tion capacity of a single learner. However, ensembles are costly to generate and have
large storage requirements. Furthermore, the time required to classify an unlabeled
instance increases linearly with the size of the ensemble. Recent work has shown
that the storage requirements and classification times can be significantly reduced
by selecting a subset of classifiers whose generalization capacity is equivalent and
sometimes superior to the original complete ensemble. This process receives the
name of ensemble pruning [24], selection, [25] or thinning [26].
Ensemble pruning has been a subject of great interest in the recent literature on
machine learning (see refs. in [27]). Most studies focus on the definition of appropri-
ate quantities that can be optimized on the training set to obtain pruned ensembles
with good generalization performance. The individual properties of classifiers are
not useful to guide the selection process. The generalization capacity of the pruned
ensemble crucially depends on the complementarity of the classifiers that are part of
it. The search in the space of subensembles is usually greedy. A notable exception
is [28, 29, 30], where genetic algorithms, generally with real-valued chromosomes,
are used.
Another exception is [31], where ensemble pruning is formulated as a quadratic
integer programming problem. Consider an ensemble composed of D classifiers and
a set of labeled instances. Define a matrix G, whose element Gi j is the number of
common errors between classifier i and classifier j, where i, j = 1, 2, . . . , D. The
value of the diagonal term Gii is the number of errors made by classifier i. The
matrix is then symmetrized and its elements normalized so that they are in the same
scale  
Gii 1 Gi j G ji
G̃ii = , G̃i j,i= j = + . (5.10)
N 2 Gii G j j
Intuitively, ∑i G̃ii measures the overall strength of the ensemble classifiers and
∑i j,i= j G̃i j measures their diversity. The subensemble selection problem of size k
can now be formulated as a quadratic integer programming problem
5 Optimization Problems with Cardinality Constraints 117

D
argmin zT · G̃ · z, s.t. ∑ zi = k, zi ∈ {0, 1}. (5.11)
z i=1

The binary variable zi indicates whether classifier i should be selected. The size of
the pruned ensemble, k, is specified beforehand. The selection process is a combi-
natorial optimization problem whose exact  solution
 requires the evaluation of the
D
performance of the exponentially large subensembles of size k that can be
k
extracted from an ensemble of size D. In [31] the solution is approximated in poly-
nomial time by applying semi-definite programming (SDP) to a convex relaxation
of the original problem.
To investigate the performance in the ensemble pruning problem of the opti-
mization methods described in Section 5.2, we generate bagging ensembles for five
representative benchmark problems from the UCI repository: heart, pima, satellite,
waveform and wdbc (Breast Cancer Wisconsin) [32]. The individual classifiers in
the ensemble are trained on different bootstrap samples of the original data [33].
If the classifiers used as base learners are unstable the fluctuations in the bootstrap
sample lead to the induction of different predictors. Assuming that the errors of
these classifiers are uncorrelated, pooling their decisions by majority voting should
improve the accuracy of the predictions. In the experiments performed, bagging en-
sembles of 101 CART trees are built [34]. The original ensemble is pruned to k = 21
decision trees. The strength-diversity measure G, the time consumed in seconds and
the number of evaluations are averaged over 5 ten-fold cross-validations for heart,
pima, satellite and wdbc, and over 50 independent partitions for waveform. The suc-
cess rate is the average over 50 repetitions of the optimization for a given partition
of the data into training and testing sets.
The parameters for the metaheuristic optimization methods are determined in
exploratory experiments using the results of SDP as a gauge. For the GAs, popu-
lations with 100 individuals are evolved using a steady state generational substitu-
tion scheme. The crossover probability is set to 1. The mutation probability is 10−2
for GAs with binary representation and 10−3 for GAs with set representation. The
strength of the penalty term in the GA with linear penalties is β = 400. If the best
individual of the final population does not satisfy the cardinality constraint, a greedy
search is performed to fulfill the restriction. The value w = 1 is used in RAR-GA. A
geometric annealing schedules with γ = 0.9 is used in SA. In these experiments, the
best solution in 10 independent executions of the SA algorithm is chosen. For PBIL,
a population of 1000 individuals is generated, where 10% of the individuals are used
to update the probability distribution. The smoothing constant is set to α = 0.1.
The results of the ensemble pruning experiments performed are summarized in
Table 5.2. Most of the optimization methods analyzed reach similar solutions in
all the classification problems considered, with the exception of the standard GA
with linear penalty, which obtains the worst values of the objective function. In
terms of this quantity, the best overall results correspond to SA and SDP. In terms
of efficiency, SDP should be preferred. In machine learning, the relevant measure
of performance is the generalization capacity of the classifiers generated. The test
118 R. Ruiz-Torrubiano, S. Garcı́a-Moratilla, and A. Suárez

Table 5.2 Results for the GA, SA and EDA approaches in the ensemble pruning problem

Algorithm Problem Best G Success Time (s) Test Error


rate
heart 156.2940 1.00 7.575 18.06
pima 234.8931 0.98 14.644 23.99
SA satellite 185.1163 1.00 63.490 13.15
waveform 105.2465 1.00 37.621 19.64
wdbc 121.9183 0.88 34.085 4.50
heart 157.8668 0.98 2.017 17.96
pima 235.0572 0.92 2.072 23.86
GA satellite 186.0706 1.00 0.870 12.85
Linear Penalty waveform 105.8294 0.02 5.179 19.70
wdbc 122.7625 0.06 4.963 4.45
heart 156.3127 1.00 0.851 17.87
pima 234.8931 1.00 1.665 23.99
GA Heuristic Repair satellite 185.1163 1.00 0.520 13.15
waveform 105.2465 0.80 8.168 19.64
wdbc 121.9183 0.90 7.875 4.50
heart 156.3860 1.00 0.697 17.96
pima 234.9190 1.00 1.381 24.09
GA satellite 185.1163 1.00 0.910 13.15
RAR (w = 1) waveform 105.2510 0.48 6.880 19.62
wdbc 121.9399 0.40 6.449 4.50
heart 156.4111 1.00 38.409 17.69
pima 234.9358 0.96 38.426 24.05
PBIL satellite 185.1163 1.00 16.400 13.15
waveform 105.2663 0.36 16.086 19.67
wdbc 122.0467 0.34 38.392 4.34
heart 156.3034 1.00 1.137 18.15
pima 234.8956 1.00 1.159 24.09
SDP satellite 185.1163 1.00 1.230 13.15
waveform 104.9984 0.90 1.230 19.60
wdbc 121.9143 0.90 1.117 4.39

error displayed in the last column of the table provides an estimate of the error rate
in examples that have not been used to train the classifiers. Lower test errors indicate
better generalization capacity. According to this measure the ranking of methods is
rather different: classifiers that were optimal according to the objective function are
suboptimal in terms of their generalization capacity. This indicates that the learning
process is affected by overfitting, because the objective function is estimated on the
training data. Nevertheless, the generalization performance of the pruned ensembles
is very similar for all the optimization methods considered. Table 5.3 shows the test
error of a single CART tree, of a complete bagging ensemble and the range of values
5 Optimization Problems with Cardinality Constraints 119

Table 5.3 Test errors for CART, standard bagging and pruned bagging

Problem CART Bagging Pruned bagging


heart 23.63 21.48 [17.69,18.15]
pima 24.84 24.67 [23.86,24.09]
satellite 13.80 14.25 [12.85,13.15]
waveform 30.27 22.53 [19.62,19.67]
wdbc 7.28 5.68 [4.34,4.50]

of the test error obtained by pruned bagging ensembles of size k = 21. In all the
classification problems considered, pruned ensembles have a lower test error than
CART and complete bagging.

5.3.3 Portfolio Optimization with Cardinality Constraints


The selection of optimal investment portfolios is a problem of great interest in the
area of quantitative finance and has attracted much attention in the scientific com-
munity (see refs. in [10]). It is a multiobjective optimization task with two opposed
goals: The maximization of profit and the minimization of risk. Several methods
have been proposed to address this problem, mostly within the classical mean-
variance model developed by H. Markowitz [35]. In this framework, the returns of
the assets considered for investment are modeled as white noise. Profit is quantified
in terms of the expected return of the portfolio. The variance of the portfolio re-
turns is used as a measure of risk. In its simplest version, the problem can be solved
by quadratic programming [1]. However, if cardinality constraints are included, the
problem becomes a mixed-integer quadratic problem, which can be shown to be
NP-Complete [10]
T
min w[z] · Σ[z,z] · w[z] (5.12)
z
T
s.t. w[z] · r̄[z] = R∗ (5.13)
a[z] ≤ w[z] ≤ b[z] , a[z] ≥ 0, b[z] ≥ 0 (5.14)
l ≤ A[z] · w[z] ≤ u (5.15)
z ·1 ≤ K
T
(5.16)
wT · 1 = 1, w ≥ 0. (5.17)

The inputs of the algorithm are r̄, the vector of expected asset returns and Σ, the
covariance matrix of the asset returns. The goal is to determine the optimal weights
of the assets in the portfolio; i.e. the value of w that maximizes the variance of the
portfolio returns (5.12), for a given value of the expected return of the portfolio, R∗
(5.13). The elements of the binary vector z specify whether asset i is included in the
final portfolio (zi = 1) or not (zi = 0). Column vectors x[z] are obtained by remov-
ing from the corresponding vector x those components i for which zi = 0. Similarly,
120 R. Ruiz-Torrubiano, S. Garcı́a-Moratilla, and A. Suárez

the matrix A[z] is obtained by eliminating the i-th column of A whenever zi = 0. Fi-
nally, Σ[z,z] is obtained by removing from Σ the rows and columns for which the
corresponding indicator is zero (zi = 0). The symbols 0 and 1 denote vectors of the
appropriate size whose entries are all equal to 0 or to 1, respectively. Minimum and
maximum investment constraints, which set a lower and an upper bound on the in-
vestment of each asset in the portfolio are captured by (5.14). Vectors a and b are
D × 1 column vectors with the lower and upper bounds on the portfolio weights, re-
spectively. Inequality (5.15) summarizes the M concentration of capital constraints.
The m-th row of the M × D matrix A is the vector of coefficients of the linear combi-
nation that defines the constraint. The M × 1 column vectors l and u correspond to the
lower and upper bounds of the M linear restrictions, respectively. Concentration of
capital constraints can be used, for instance, to control the amount of capital invested
in a group of assets, so that investor preferences or limits for investment in certain
asset classes can be formally expressed. Since these constraints are linear, they do
not increase the difficulty of the problem, which can still be solved efficiently by
quadratic programming. Expression (5.16) corresponds to the cardinality constraint,
which limits the number of assets that can be included in the final portfolio. Finally,
equation (5.17) ensures that all the capital is invested in the portfolio.
The cardinality-constrained problem is difficult to solve by standard optimization
techniques. Branch-and-Bound methods can be used to find exact solutions [36]. De-
spite the improvements in efficiency, the complexity of the search is still exponential.
Genetic algorithms have also been used to address this problem: In [37], the perfor-
mance of GAs is compared to SA and to tabu search (TS) [38]. According to this in-
vestigation, the best-performing portfolios are obtained by pooling the results of the
different heuristics. In [39] SA is used to search directly in the space of real-valued
asset weights. Tabu search is employed in [40]. This work focuses on the design
of appropriate neighborhood operators to improve the efficiency of the search. In
[7] [41] Multi-Objective Evolutionary Algorithms (MOEAs) are used to address the
problem. These algorithms employ a hybrid encoding instead of a pure continuous
one and heuristic repair mechanisms to handle infeasible individuals. The impact of
local search improvements are also investigated in this work. The authors conclude
that the hybrid encoding improves the overall performance of the algorithm.
In the experiments carried out in this investigation, we address the problem of
optimal portfolio selection with lower bounds and cardinality constraints. The pa-
rameters of the constraints considered are li = 0.1, ui = 0.1, i = 1, . . . , D and K = 10.
The performance of the different optimization methods is compared by calculating
the efficient frontier for the problem with and without these constraints. Points on
the efficient frontier correspond to minimum-risk portfolios for a given expected
return, or, alternatively, to portfolios that have the largest expected return from a
family of portfolios with equal risk. As a measure of the quality of the solution ob-
tained, the average relative distance to the unconstrained efficient frontier (without
cardinality and lower bound constraints) is calculated

1 NF
σic − σi∗
D=
NF ∑ σ∗ (5.18)
i=1 i
5 Optimization Problems with Cardinality Constraints 121

where NF = 100 is the number of frontier points considered, σic is the solution of
the constrained problem in the i-th point of the frontier, and σi∗ is the solution of the
corresponding unconstrained problem.

Table 5.4 Results for the GA, SA and EDA approaches in the portfolio selection problem

Algorithm Index Best D Success Time (s) Optimizations


rate
Hang Seng 0.00321150 1.00 1499.9 3.87 · 107
DAX 2.53162860 0.98 2877.3 7.63 · 107
SA FTSE 1.92205745 0.92 3610.4 8.87 · 107
S&P 4.69373181 0.91 3567.8 9.54 · 107
Nikkei 0.20197748 0.95 4274.5 9.25 · 107
Hang Seng 0.00327011 0.86 750.9 1.36 · 107
DAX 2.53314271 0.69 2999.0 4.60 · 107
GA FTSE 1.93255870 0.51 3539.3 5.76 · 107
Linear penalty S&P 4.69373181 0.76 4636.8 7.03 · 107
Nikkei 0.22992173 0.42 4811.7 6.47 · 107
Hang Seng 0.00321150 1.00 1122.9 2.18 · 107
DAX 2.53162860 1.00 4730.6 7.45 · 107
GA FTSE 1.92150019 0.94 6301.4 9.70 · 107
Heuristic Repair S&P 4.69373181 1.00 7860.6 11.42 · 107
Nikkei 0.20197748 0.99 10191.2 11.47 · 107
Hang Seng 0.00321150 1.00 1200.8 2.77 · 107
DAX 2.53162860 1.00 3178.5 6.14 · 107
GA FTSE 1.92150019 0.95 6384.6 12.02 · 107
RAR (w = 1) S&P 4.69373181 0.99 6575.6 12.34 · 107
Nikkei 0.20197748 1.00 9893.3 14.17 · 107
Hang Seng 0.00321150 1.00 2292.8 5.55 · 107
DAX 2.53162860 0.94 4489.1 7.70 · 107
PBIL FTSE 1.92208910 0.85 4782.3 8.06 · 107
S&P 4.69570006 0.88 5100.2 8.28 · 107
Nikkei 0.30164777 0.43 7486.5 8.21 · 107

The expected returns and the covariance matrix of the components of five major
world markets included in the OR-Library [42] are used as inputs for the optimiza-
tion: Hang Seng (Hong-Kong, 31 assets), DAX (Germany, 85 assets), FTSE (UK,
89 assets), Standard and Poor’s (U.S.A., 98 assets) and Nikkei (Japan, 225 assets).
The methods compared are SA, standard GA with linear penalty, standard GA with
heuristic repair, GA with a set representation and RAR (w = 1) crossover, and PBIL.
The SA heuristic is used with a geometric annealing scheme with constant γ = 0.9.
Populations of 100 individuals are used for the GAs. The mutation and crossover
probabilities are pm = 10−2 and pc = 1, respectively. PBIL samples populations
122 R. Ruiz-Torrubiano, S. Garcı́a-Moratilla, and A. Suárez

of 400 individuals, 10% of which are used to update the probability distribution.
The heuristic repair scheme performs an unconstrained optimization without the
cardinality constraint, and then either includes in the chromosome those products
with the highest weights or eliminates the products with the smallest weights in the
unconstrained solution, as needed.
Table 5.4 summarizes the results of the experiments. The value of D (5.18) dis-
played in the third column is the best out of 5 executions of each of methods consid-
ered. The proportion of attempts in which the corresponding optimization algorithm
obtains the best known solution is given in the column labeled sucess rate. The last
two columns report the time employed (in seconds) and the number of quadratic
optimizations performed, respectively. In terms of the quality of the obtained solu-
tions, using a binary encoding with linear penalties performs worse than all the other
approximate methods. By contrast, the heuristic repair scheme identifies the best of
the known solutions in all the problems investigated. GA with a set representation
and RAR (w = 1) crossover has also an excellent performance and is slightly more
efficient on average. High quality solutions are also obtained by SA, albeit at higher
computational cost. PBIL performs well only in problems in which the number of
assets considered for investment is small. As the dimensionality of the problem in-
creases, sampling and estimation of the probability distribution in algorithms of the
EDA family become less effective.

5.3.4 Index Tracking by Partial Replication


Index tracking is a passive investment strategy whose goal is to match the perfor-
mance of a reference financial index. The problem can be exactly solved by invest-
ing on each asset an amount of capital that is proportional to the corresponding
weight in the index. In practice, this strategy has the drawback of incurring high
initial transaction costs. Furthermore, there is an overhead in managing a portfolio
that invests in every constituent of the index. In particular, rebalancing the portfolio
can be costly if the composition of the index is revised. An alternative is to create a
tracking portfolio that invests only in a reduced set of assets. This partial replication
strategy will in general be unable to perfectly reproduce the behavior of the index.
However, a portfolio that invests in a fixed number of assets and closely follows the
evolution of the index can be obtained by minimizing the tracking error
!
1 T D
min
w,z
∑ ∑ (w j r j (t) − rt )2
T t=1
(5.19)
j=1
D
∑ wi = 1, (5.20)
i=1
l ≤ A·w ≤ u (5.21)
zi ∈ {0, 1}, ai zi ≤ wi ≤ bi zi , ai ≥ 0, bi ≥ 0, i = 1, 2, . . . , D (5.22)
D
∑ zi ≤ K, (5.23)
i=1
5 Optimization Problems with Cardinality Constraints 123

where T is the length of the time series considered, D is the number of constituents
of the index, r j (t) is the return of asset j at time t and rt is the return of the index
at time t. Restriction (5.20) is a budget constraint, which ensures that all the cap-
ital is invested in the portfolio. Investment concentration constraints are captured
by (5.21). Expression (5.22) reflects lower and upper bound constraints. The binary
variables {z1 , z2 , . . . , zD } indicate whether an asset is included or excluded from the
tracking portfolio. Note that when zi = 0, the lower and upper bounds for the weight
of asset i are both equal to zero, which effectively excludes this asset from the
investment. The cardinality constraint is expressed by Eq.(5.23).
Index tracking has been extensively investigated in the literature. The hybrid GA
with set encoding and RAR crossover described in Section 5.2 is used in [4]. In-
stead of the tracking error, this work minimizes the variance of the difference be-
tween the returns of the index and of the tracking portfolio. Optimal impulse control
techniques are used in [43]. In [44] the problem is solved by using the threshold ac-
cepting (TA) heuristic, which is a deterministic analogue of simulated annealing, in
which transitions are rejected only when they lead to a deterioration in performance
that is above a specified threshold. Evolutionary algorithms with real-valued chro-
mosome representations are used in [45]. This investigation focuses on the influence
of transaction costs and portfolio rebalancing. In [46] the portfolio optimization and
index tracking problems are addressed by means of a heuristic relaxation method
that consists in solving a small number of convex optimization problems with fixed
transaction costs. Hybrid optimization approaches to minimizing the tracking by
partial replication are also investigated in [47, 48, 49].
In the current investigation, publicly available benchmark data from the OR-
Library [42] is used to compare the optimization techniques described in Section
5.2. Five major world market indices are used in the experiments: Hang-Seng, DAX,
FTSE, S&P and Nikkei. For each index, the time series of 290 weekly returns for
the index and for its constituents are given. From these data, the first 145 values
are used to create a tracking portfolio that includes a maximum of K = 10 assets.
The last 145 values are used to measure the out-of-sample tracking error. The pop-
ulation sizes are 350 for the GAs and 1000 for PBIL. The values of the remaining
parameters coincide with those used in the portfolio selection problem.
Table 5.5 presents a summary of the experiments performed. The best out of 5 ex-
ecutions of the different optimization methods are reported. GA with random repair
obtains the best overall results. GA with set encoding and RAR (w = 1) crossover
matches these results except in Nikkei, which is the index with the largest number of
constituents. PBIL also has a good performance, but the computational cost is higher
than for the other algorithms. In fact, the algorithm reached the maximum number
of optimizations established without converging. The results of SA and GA with
binary encoding and linear penalty are suboptimal in all but the simplest problems.
They also exhibit low success rates. In all problems investigated, the out-of-sample
error is typically larger than the in-sample error, but of the same order of magnitude.
124 R. Ruiz-Torrubiano, S. Garcı́a-Moratilla, and A. Suárez

Table 5.5 Results for the GA, SA and EDA approaches in the index tracking problem

Algorithm Index Best MSE MSE Success Time Number


In-Sample Out-of-Sample rate (s) opts.
Hang Seng 1.3462 · 10−5 2.0575 · 10−5 0.40 1.12 19342
DAX 8.0837 · 10−6 7.4824 · 10−5 0.40 1.73 27101
SA FTSE 2.3951 · 10−5 7.0007 · 10−5 0.20 1.44 1.43
S&P 1.6781 · 10−5 4.7347 · 10−5 0.20 1.97 29764
Nikkei 2.1974 · 10−5 1.0719 · 10−4 0.20 95.00 1476549
Hang Seng 1.3462 · 10−5 2.0575 · 10−5 0.60 4.15 51509
DAX 8.0837 · 10−6 7.4824 · 10−5 0.20 13.69 144868
GA FTSE 2.7345 · 10−5 5.3148 · 10−5 0.20 17.66 158465
Linear Penalty S&P 1.7974 · 10−5 5.2898 · 10−5 0.20 36.89 311008
Nikkei 2.0061 · 10−5 1.0707 · 10−4 0.20 123.28 1015774
Hang Seng 1.3462 · 10−5 2.0575 · 10−5 1.00 5.92 81690
DAX 8.0837 · 10−6 7.4824 · 10−5 1.00 18.89 231840
GA FTSE 2.1836 · 10−5 8.0091 · 10−5 0.40 21.20 255820
Random Repair S&P 1.6573 · 10−5 5.5457 · 10−5 0.20 47.02 508313
Nikkei 1.8255 · 10−5 6.9574 · 10−5 0.20 170.62 1664696
Hang Seng 1.3462 · 10−5 2.0575 · 10−5 1.00 4.67 51513
DAX 8.0837 · 10−6 7.4824 · 10−5 1.00 14.17 124717
GA FTSE 2.1836 · 10−5 8.0091 · 10−5 0.40 18.83 156456
RAR (w = 1) S&P 1.6573 · 10−5 5.5457 · 10−5 0.20 42.31 311002
Nikkei 1.8917 · 10−5 8.1057 · 10−5 0.20 175.34 1015766
Hang Seng 1.3462 · 10−5 2.0575 · 10−5 1.00 167.04 2010000
DAX 8.0837 · 10−6 7.4824 · 10−5 1.00 199.28 2010000
PBIL FTSE 2.1836 · 10−5 8.0091 · 10−5 1.00 195.31 2010000
S&P 1.6781 · 10−5 4.7347 · 10−5 0.60 314.77 2010000
Nikkei 1.9510 · 10−5 7.4572 · 10−5 0.20 222.86 2010000

5.3.5 Sparse Principal Component Analysis


Principal Component Analysis (PCA) is a dimensionality reduction technique that is
frequently used in data analysis, data compression and data visualization. The goal
is to identify the directions along which the multidimensional data have the largest
variance. The principal components can be obtained by maximizing the variance of
normalized linear combinations of the original variables. Typically they have a non-
zero projection on all the original coordinates, which can make their interpretation
difficult. The goal of sparse PCA is to find principal components that have non-
zero loadings in only a small number of the original directions, while at the same
time explaining most of the variance. The first sparse principal component can be
obtained by solving the cardinality-constrained optimization problem
5 Optimization Problems with Cardinality Constraints 125
" T
#
max w[z] · Σ[z,z] · w[z] (5.24)
w,z

s.t. w[z] 2 = 1 (5.25)


z · 1 ≤ K,
T
(5.26)

where Σ is the data covariance matrix. As in the previous problems, the elements of
the binary vector z encode whether the principal component has a non-zero projec-
tion along the corresponding direction. Once the first principal component has been
found, if more principal components are to be calculated, the covariance matrix Σ
is deflated as follows  

Σ = Σ − wT · Σ · w w wT (5.27)
and a new problem of the form given by (5.24), defined now in terms of this de-
flated covariance matrix is solved. The decomposition stops after a maximum of
Rank(Σ) iterations. In practice, the number of principal components is either spec-
ified beforehand or determined by the percentage of the total variance of the data
explained.
The problem of finding sparse principal components has also received a fair
amount of attention in the recent literature. Greedy search is used in [50]. In [51]
SPCA is formulated as a regression problem, so that LASSO techniques [52] can be
used to favor sparse solutions. In LASSO, an L1 -norm penalty for non-zero values
of the factor loadings is used. A higher weight of the penalty term in the objective
functions induces models that are sparser. However it is not possible to have a di-
rect control on the number of non-zero coefficients in the solution. The cardinality
constraint is explicitly considered in [53], which uses a method based on solving a
relaxation of the problem by semidefinite programming (SDP).
To compare the performance of the different methods analyzed, we use the bench-
mark problem introduced in [54]. Consider the sparse vector v, whose components
are ⎧

⎨1, if i ≤ 50
vi = 1/(i − 50), if 50 < i ≤ 100 (5.28)


0, otherwise
A covariance matrix is built from this vector and U, a square matrix of dimensions
150 × 150 whose elements are U[0, 1] random variables

Σ = σ vvT + UT U, (5.29)

where σ = 10 is the signal-to-noise ratio. In this manner, the pattern of cardinal-


ity is partially masked by noise. In our experiments the results of SA, binary GA
with linear penalties, binary GA with random repair, set GA with RAR crossover
operator and w = 1, PBIL and DSPCA, an approximate method based on semidef-
inite programming [53, 55] are compared. SA uses a geometric annealing scheme
with γ = 0.9. The GAs use a population of 50 invididuals. Crossover and muta-
tion are performed with probabilities pc = 1 and pm = 10−2 , respectively. PBIL is
executed with a population of 400 individuals and α = 0.1. In this algorithm, the
126 R. Ruiz-Torrubiano, S. Garcı́a-Moratilla, and A. Suárez

best 10% of the individuals are used to update the probability distribution. The first
sparse principal component is then calculated. For each of the methods that involve
stochastic search (all except DSPCA), the best out of 5 independent executions of
the algorithm is taken. Figure 7.1 displays the variance explained by the first sparse
principal component as a function of its cardinality K = 1, 2, . . . , 140, for all the
methods considered. GA using a linear penalty does not obtain good solutions in
this high-dimensional problem. PBIL performs slightly better, but is clearly infe-
rior to SA, GAs with random repair, GA with set encoding and DSPCA. Table 5.6
shows the detailed results for cardinality K = 50, which is the cardinality of the true
hidden pattern. In this table, the largest value of the variance achieved is highlighted
in bold. The success rates, the computation times on an AMD Turion machine with
1.79 Ghz processor speed and 1 Gb RAM and the total number of optimizations are
also given. The times for the the DSPCA algorithm times are not given, because a
MATLAB implementation was used [54], which cannot be directly compared with
the other results, obtained with code written in C. The GA with set encoding and
RAR (w = 1) crossover and the GA with binary encoding and random repair obtain
the best results and explain more variance than the solution obtained by DSPCA.
The first of this methods is slightly faster. SA is very fast and achieves a result
that is only slightly worse with a success rate of 100%. PBIL and GA with binary
encoding and linear penalty obtain solutions that are clearly inferior.

45

40

35

30

25
Variance

20

15

10
GA Linear Penalty
GA Random Repair
5 GA RAR
SA
PBIL
DSPCA
0
25 50 75 100 125 150
Cardinality

Fig. 5.1 Comparison of results for the SPCA problem


5 Optimization Problems with Cardinality Constraints 127

Table 5.6 Results for the GA, SA, EDA and SDP approaches in the synthetic problem for
K = 50

Algorithm Best Success Time (s) Optimizations


variance rate
SA 22.5727 1.00 65.95 11639
GA + Linear Penalty 19.7881 0.20 126.1 5137
GA + Random Repair 22.7423 0.80 172.1 7981
GA + RAR (w = 1) 22.7423 1.00 105.41 5146
PBIL 20.1778 1.00 198.20 40800
SDP 22.5001 − − −

5.4 Conclusions
Many tasks of practical interest can be formulated as optimization problems with
cardinality constraints. The examples analyzed in this article arise in various fields
of application: ensemble pruning, optimal portfolio selection, financial index track-
ing and sparse principal component analysis. They are large optimization problems
whose solution by standard optimization methods is computationally expensive. In
practice, using exact methods like branch-and-bound is feasible only for small prob-
lem instances. A practicable alternative is to use approximate optimization methods
that can identify near-optimal solutions at a lower computational cost: Genetic al-
gorithms, simulated annealing and estimation of distribution algorithms. However,
the search operators used in the standard formulations of these techniques are ill-
suited to the problem because they do not preserve the cardinality of the candi-
date solutions. This means that either ad-hoc penalization or repair mechanisms are
needed to enforce the constraints. Including penalty terms in the objective func-
tion distorts the search and generally leads to suboptimal solutions. Applying repair
mechanisms to infeasible configurations provides a more elegant and effective ap-
proach to the problem. Nonetheless, the best option is to use a set representation,
in conjunction with specially designed search operators that preserve the cardinality
of the candidate solutions. Some of the problems considered, such as the knapsack
problem and ensemble pruning are purely combinatorial optimization tasks. In prob-
lems like portfolio selection, index tracking and sparse PCA both combinatorial and
continuous aspects are present. For these we advocate the use of hybrid methods
that separately handle the combinatorial and the continuous aspects of cardinality-
constrained optimization problems. Among the approximate methods considered,
a genetic algorithm with set encoding and RAR crossover obtains the best overall
performance. In problems where the comparison was possible, the solutions ob-
tained are close to the exact ones and to those identified by approximate methods
that use semidefinitie programming. Using the same encoding, simulated annealing
also obtains fairly good solutions, generally at a higher computational cost. This
indicates that the RAR crossover operator seems to enhance the search by introduc-
ing in the population individuals that effectively combine advantageous features of
128 R. Ruiz-Torrubiano, S. Garcı́a-Moratilla, and A. Suárez

their ancestors. Estimation of distribution algorithms, such as PBIL, perform well


on small and medium-sized problem instances. However, they fail to obtain good
solutions on large problems. The reason for this loss of efficacy is that the sampling
and estimation of probability distributions becomes progressively more difficult as
the dimensionality of the problem increases.

Acknowledgments
This research has been supported by Dirección General de Investigación (Spain),
project TIN2007-66862-C02-02.

References
1. Gill, P.E., Murray, W., Saunders, M.A., Wright, M.H.: Inertia-controlling methods for
general quadratic programming. SIAM Review 33, 1–36 (1991)
2. Gill, P., Murray, W.: Quasi-newton methods for unconstrained optimization. IMA Journal
of Applied Mathematics 9 (1), 91–108 (1972)
3. Adler, I., Karmarkar, N., Resende, M.G.C., Veiga, G.: An implementation of Kar-
markar’s algorithm for linear programming. Mathematical Programming 44, 297–335
(1989)
4. Shapcott, J.: Index tracking: genetic algorithms for investment portfolio selection. Tech-
nical report, EPCC-SS92-24, Edinburgh, Parallel Computing Centre (1992)
5. Radcliffe, N.J.: Genetic set recombination. Foundations of Genetic Algorithms. Morgan
Kaufmann Pulishers, San Francisco (1993)
6. Coello, C.: Theoretical and numerical constraint-handling techniques used with evolu-
tionary algorithms: a survey of the state of the art. Computer Methods in Applied Me-
chanics and Engineering 191, 1245–1287 (2002)
7. Streichert, F., Ulmer, H., Zell, A.: Evaluating a hybrid encoding and three crossover
operators on the constrained portfolio selection problem. In: Proceedings of the Congress
on Evolutionary Computation (CEC 2004), vol. 1, pp. 932–939 (2004)
8. Kirkpatrick, S., Gelatt Jr., C.D., Vecchi, M.P.: Optimization by simulated annealing. Sci-
ence 4598, 671–679 (1983)
9. Goldberg, D.: Genetic Algorithms in Search, Optimization and Machine Learning.
Addison-Weasley, Reading (1989)
10. Moral-Escudero, R., Ruiz-Torrubiano, R., Suarez, A.: Selection of optimal investment
portfolios with cardinality constraints. In: Proceedings of the IEEE World Congress on
Evolutionary Computation, pp. 2382–2388 (2006)
11. Radcliffe, N.J.: Equivalence class analysis of genetic algorithms. Complex Systems 5,
183–205 (1991)
12. Larrañaga, P., Lozano, J.A. (eds.): Estimation of Distribution Algorithms: A New Tool
for Evolutionary Computation. Kluwer Academic Publishers, Dordrecht (2002)
13. Baluja, S.: Population-based incremental learning: A method for integrating genetic
search based function optimization and competitive learning. Technical Report CMU-
CS-94-163, Carnegie Mellon University (1994)
14. Muehlenbein, H.: The equation for response to selection and its use for prediction. Evo-
lutionary Computation 5, 303–346 (1998)
15. Kellerer, H., Pferschy, U., Pisinger, D.: Knapsack Problems. Springer, Heidelberg (2004)
5 Optimization Problems with Cardinality Constraints 129

16. Miller, R.E., Thatcher, J.W. (eds.): Reducibility among combinatorial problems, pp. 85–
103. Plenum Press (1972)
17. Pisinger, D.: Where are the hard knapsack problems? Computers & Operations Research,
2271–2284 (2005)
18. Simões, A., Costa, E.: An evolutionary approach to the zero/one knapsack problem: Test-
ing ideas from biology. In: Proceedings of the Fifth International Conference on Artificial
Neural Networks and Genetic Algorithms, ICANNGA (2001)
19. Ku, S., Lee, B.: A set-oriented genetic algorithm and the knapsack problem. In: Proceed-
ings of the IEEE World Congress on Evolutionary Computation, CEC 2001 (2001)
20. Michalewicz, Z.: Genetic Algorithms + Data Structures = Evolution Programs. Springer,
Heidelberg (1996)
21. Ladanyi, L., Ralphs, T., Guzelsoy, M., Mahajan, A.: SYMPHONY (2009),
https://projects.coin-or.org/SYMPHONY
22. Padberg, M.W., Rinaldi, G.: A branch-and-cut algorithm for the solution of large scale
traveling salesman problems. SIAM Review 33, 60–100 (1991)
23. Dietterich, T.G.: An experimental comparison of three methods for constructing ensem-
bles of decision trees: Bagging, boosting, and randomization. Machine Learning 40,
139–157 (2000)
24. Margineantu, D.D., Dietterich, T.G.: Pruning adaptive boosting. In: Proc. of the 14th
International Conference on Machine Learning, pp. 211–218. Morgan Kaufmann, San
Francisco (1997)
25. Caruana, R., Niculescu-Mizil, A., Crew, G., Ksikes, A.: Ensemble selection from li-
braries of models. In: Proc. of the 21st International Conference on Machine Learning,
p. 18. ACM Press, New York (2004)
26. Banfield, R.E., Hall, L.O., Bowyer, K.W., Kegelmeyer, W.P.: Ensemble diversity mea-
sures and their application to thinning. Information Fusion 6, 49–62 (2005)
27. Martı́nez-Muñoz, G., Lobato, D.H., Suárez, A.: An analysis of ensemble pruning tech-
niques based on ordered aggregation. IEEE Transactions on Pattern Analysis and Ma-
chine Intelligence 31, 245–259 (2009)
28. Zhou, Z.H., Wu, J., Tang, W.: Ensembling neural networks: Many could be better than
all. Artificial Intelligence 137, 239–263 (2002)
29. Zhou, Z.H., Tang, W.: Selective ensemble of decision trees. In: Liu, Q., Yao, Y., Skowron,
A. (eds.) RSFDGrC 2003. LNCS (LNAI), vol. 2639, pp. 476–483. Springer, Heidelberg
(2003)
30. Hernández-Lobato, D., Hernández-Lobato, J.M., Ruiz-Torrubiano, R., Valle, Á.: Pruning
adaptive boosting ensembles by means of a genetic algorithm. In: Corchado, E., Yin,
H., Botti, V., Fyfe, C. (eds.) IDEAL 2006. LNCS, vol. 4224, pp. 322–329. Springer,
Heidelberg (2006)
31. Zhang, Y., Burer, S., Street, W.N.: Ensemble pruning via semi-definite programming.
Journal of Machine Learning Research 7, 1315–1338 (2006)
32. Asuncion, A., Newman, D.: UCI machine learning repository (2007)
33. Breiman, L.: Bagging predictors. Machine Learning 24, 123–140 (1996)
34. Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J.: Classification and Regression
Trees. Chapman & Hall, New York (1984)
35. Markowitz, H.: Portfolio selection. Journal of Finance 7, 77–91 (1952)
36. Bienstock, D.: Computational study of a family of mixed-integer quadratic program-
ming problems. In: Balas, E., Clausen, J. (eds.) IPCO 1995. LNCS, vol. 920. Springer,
Heidelberg (1995)
130 R. Ruiz-Torrubiano, S. Garcı́a-Moratilla, and A. Suárez

37. Chang, T.J., Meade, N., Beasley, J.E., Sharaiha, Y.M.: Heuristics for cardinality con-
strained portfolio optimisation. Computers and Operations Research 27, 1271–1302
(2000)
38. Glover, F.: Future paths for integer programming and links to artificial intelligence. Com-
puters and Operations Research 13, 533–549 (1986)
39. Crama, Y., Schyns, M.: Simulated annealing for complex portfolio selection problems.
Technical report, Groupe d’Etude des Mathematiques du Management et de l’Economie
9911, Universie de Liege (1999)
40. Schaerf, A.: Local search techniques for constrained portfolio selection problems. Com-
putational Economics 20, 177–190 (2002)
41. Streichert, F., Tamaka-Tamawaki, M.: The effect of local search on the constrained port-
folio selection problem. In: Proceedings of the IEEE World Congress on Evolutionary
Computation (CEC 2006), Vancouver, Canada, pp. 2368–2374 (2006)
42. Beasley, J.E.: Or-library: Distributing test problems by electronic mail. Journal of the
Operational Research Society 41(11), 1069–1072 (1990)
43. Buckley, I., Korn, R.: Optimal index tracking under transaction costs and impulse con-
trol. International Journal of Theoretical and Applied Finance 1(3), 315–330 (1998)
44. Gilli, M., Këllezi, E.: Threshold accepting for index tracking. Computing in Economics
and Finance 72 (2001)
45. Beasley, J.E., Meade, N., Chang, T.: An evolutionary heuristic for the index tracking
problem. European Journal of Operations Research 148(3), 621–643 (2003)
46. Lobo, M., Fazel, M., Boyd, S.: Portfolio optimization with linear and fixed transaction
costs. Annals of Operations Research, special issue on financial optimization 152(1),
376–394 (2007)
47. Jeurissen, R., van den Berg, J.: Index tracking using a hybrid genetic algorithm. In: ICSC
Congress on Computational Intelligence Methods and Applications 2005 (2005)
48. Jeurissen, R., van den Berg, J.: Optimized index tracking using a hybrid genetic algo-
rithm. In: Proceedings of the IEEE World Congress on Evolutionary Computation (CEC
2008), pp. 2327–2334 (2008)
49. Ruiz-Torrubiano, R., Suárez, A.: A hybrid optimization approach to index tracking. Ac-
cepted for publication in Annals of Operations Research (2007)
50. Moghaddam, B., Weiss, Y., Avidan, S.: Spectral bounds for sparse PCA. In: Advances in
Neural Information Processing Systems, NIPS 2005 (2005)
51. Zou, H., Hastie, T., Tibshirani, R.: Sparse principal component analysis. Journal of Com-
putational and Graphical Statistics 15(2), 265–286 (2006)
52. Tibshirani, R.: Regression shrinkage and selection via the lasso. Journal of the Royal
Statistical Society B 58, 267–268 (1996)
53. d’Aspremont, A., Ghaoui, L.E., Jordan, M., Lanckriet, G.: A direct formulation for
sparse PCA using semidefinite programming. SIAM Review 49(3), 434–448 (2007)
54. d’Aspremont, A., Bach, F., Ghaoui, L.E.: Optimal solutions for sparse principal compo-
nent analysis. Journal of Machine Learning Research 9, 1269–1294 (2008)
55. d’Aspremont, A., Ghaoui, L.E., Jordan, M., Lanckriet, G.: MATLAB code for DSPCA
(2008), http://www.princeton.edu/˜aspremon/DSPCA.htm
Chapter 6
Learning Global Optimization through a
Support Vector Machine Based Adaptive
Multistart Strategy

Jayadeva, Sameena Shah, and Suresh Chandra

Abstract. We propose a global optimization algorithm called GOSAM (Global Op-


timization using Support vector regression based Adaptive Multistart) that applies
statistical machine learning techniques, viz. Support Vector Regression (SVR) to
adaptively direct iterative search in large-scale global optimization. At each itera-
tion, GOSAM builds a training set of the objective function’s local minima discov-
ered till the current iteration, and applies SVR to construct a regressor that learns
the structure of the local minima. In the next iteration the search for the local min-
imum is started from the minimum of this regressor. The idea is that the regressor
for local minima will generalize well to the local minima not obtained so far in the
search, and hence its minimum would be a ‘crude approximation’ to the global min-
imum. This approximation improves over time, leading the search towards regions
that yield better local minima and eventually the global minimum. Simulation re-
sults on well known benchmark problems show that GOSAM requires significantly
fewer function evaluations to reach the global optimum, in comparison with meth-
ods like Particle Swarm optimization and Genetic Algorithms. GOSAM proves to
be relatively more efficient as the number of design variables (dimension) increases.
GOSAM does not require explicit knowledge of the objective function, and also
Jayadeva
Dept. of Electrical Engineering, Indian Institute of Technology, Hauz Khas,
New Delhi - 110016, India
e-mail: jayadeva@ee.iitd.ac.in
Sameena Shah
Dept. of Electrical Engineering, Indian Institute of Technology, Hauz Khas,
New Delhi - 110016, India
e-mail: sameena.shah@gmail.com
Suresh Chandra
Dept. of Mathematics, Indian Institute of Technology, Hauz Khas,
New Delhi - 110016, India
e-mail: chandras@maths.iitd.ac.in

Y. Tenne and C.-K. Goh (Eds.): Computational Intelligence in Optimization, ALO 7, pp. 131–154.
springerlink.com c Springer-Verlag Berlin Heidelberg 2010
132 Jayadeva, S. Shah, and S. Chandra

does not assume any specific properties. We also discuss some real world applica-
tions of GOSAM involving constrained and design optimization problems.

6.1 Introduction and Background Research


Global optimization involves finding the optimal or best possible configuration from
a large space of possible configurations. It is among the most fundamental of compu-
tational tasks, and has numerous applications including bio-informatics [26],
robotics [20], portfolio optimization [5], VLSI design [31], and nearly every
engineering application [27].
If the search space is small, then obtaining the global optimal is trivial; otherwise
some special structure like linearity, convexity or differentiability of the problem
needs to be exploited. Classical mathematical optimization techniques are based
on utilizing such special structures. Difficulties in obtaining the global optimum
arise when the objective function neither has any special structure, nor possesses
properties like continuity and differentiability, or if it has numerous local optima that
obstruct search for the global optimum [25]. Such objective functions are common
in many applications including data mining, location problems, and computational
chemistry, amongst others [13, 15]. Similar difficulties arise if some structure exists
but is not known a priori.
For the global optimization of these kinds of objective functions, one utilizes
the broad class of general-purpose algorithms called local search algorithms[29].
Global optimizers typically depend on local search algorithms that search multiple
states in the neighborhood of a given state or configuration. Such methods find a
local optimum, which depends on the starting state. Local search generally does not
yield the global optimum, because it gets stuck in a local optimum. Therefore, local
search methods are usually augmented with some strategy to escape from local op-
tima. For instance, in simulated annealing (SA), first introduced by [22], the escape
strategy is a probabilistic shaking that is associated with a temperature parameter.
The “temperature” is reduced iteratively from a high initial value, based on a cooling
schedule.
On the other hand, though local search strategies suffer from entrapment in local
optima, they are very fast. The time required to run an iteration of simulated an-
nealing is sufficient for several iterations of gradient descent. Therefore, instead of
escaping from local optima, an alternative is the use of multi-restart local search ap-
proaches. These start from a new initial state once a local search step has terminated
in a local optimum. Multistart approaches are known to outperform other strategies
on some problems, e.g. simulated annealing on the Travelling Salesperson Problem
(TSP) [19].
The performance of any local search procedure depends on the starting state, and
multi-restart local search algorithms start from a randomly chosen state. None of the
above mentioned approaches exploit knowledge of the space that has been explored
so far, to guide further search. In other words, their search strategy does not evolve
with time. A question that comes to mind is whether, based on some knowledge
6 Learning Global Optimization through SVM Based Adaptive Multistart 133

collected about the function, it is possible to generate a start state that is better than
a random one. If the answer is in the affirmative, then successive iterations will lead
us closer to the global minimum.
Evolutionary algorithms like Particle Swarm optimization (PSO), Genetic Al-
gorithms (GA) and Ant Colony optimization (ACO), are distributed iterative search
algorithms, which indirectly use some form of information about the space explored
so far, to direct search. Initially, there is a finite number of “agents” that search for
the global optimum. The paths of these agents are dynamically and independently
updated during the search based on the results obtained till the current update.
PSO, developed by Kennedy and Eberhart [21], is inspired by the flocking behav-
ior of birds. In PSO, particles start search in different regions of the solution space,
and every particle is made aware of the best local optimum amongst those found
by its neighbors, as well as the global optimum obtained up to the current iteration.
Each particle then iteratively adapts its path and velocity accordingly. The algorithm
converges when a particle finds a highly desirable path to proceed on, and the other
particles effectively follow its lead.
Genetic Algorithms [12] are motivated by the biological evolutionary opera-
tions of selection, mutation and crossover. In real life, the fittest individuals tend
to survive, reproduce and improve over generations. Based on this, “chromosomes”
that yield better optima are considered to correspond to fitter individuals, and are
used for creating the next generation of chromosomes that hopefully lead us to bet-
ter optima. The population of chromosomes is updated till convergence, or until a
specified number of updates is completed.
Ant colony optimization [10] mimics the behavior of a group of ants following
the shortest path to a food source. Ants (agents) exchange information indirectly,
through a mechanism called “stigmergy”, by leaving a trail of pheromone on the
paths traversed. States believed to be good are marked by heavier concentrations
of pheromone to guide the ants that arrive later. Therefore, decisions that are taken
subsequently get biased by previous decisions and their outcomes.
Some heuristic techniques use an alternate approach to guide further search by
application of machine learning techniques on past search results. Machine learning
techniques help in discovering relationships by analyzing the search data that other
techniques may ignore. If any relationship exists, then it could be exploited to re-
duce search time or improve the quality of optima. For this task, some papers try
to understand the structure of the search space, while others try to tune algorithms
accordingly (cf. [4] for a survey of these algorithms).
Boyan used information of the complete trajectories to the local minima and the
corresponding value of the local minima reached, to construct evaluation functions
[8], [9]. The minimum of the evaluation function determined a new starting point.
Optimal solutions were obtained for many combinatorial optimization problems like
bin packing, channel routing, etc.
Agakov et. al [3], gave a compiler optimization algorithm that trains on a set of
computer programs and predicts which parts of the optimization space are likely
to give large performance improvements for programs. Boese et al. [6] explored the
use of local minima to adapt the optimization algorithm. For graph bisection and the
134 Jayadeva, S. Shah, and S. Chandra

TSP, they found a “big valley” structure to the set of minima. Using this information
they were able to hand code a strategy to find good starting states for these problems.
Is this possible for other problems as well ? The proposed work is motivated by
the question: for any general global optimization problem, is there a structure
to the set of local optima ? If so, can it be learnt automatically through the use
of machine learning ?
We propose a new algorithm for the general global optimization problem, termed
as Global Optimization using Support vector regression based Adaptive Multistart
(GOSAM). GOSAM attempts to learn the structure of local minima based on local
minima discovered during earlier local searches. GOSAM uses Support Vector Ma-
chine based learning to learn a Fit function (regressor) that passes through all the
local minima, thereby learning the structure of the locations of local minima. Since
the regressor can only learn the structure of the local minima encountered till the
present iteration, the idea is that the regressor for local minima will generalize well
to the local minima not obtained so far in the search. Consequently, its minimum
would be a ‘crude approximation’ to the global minimum of the objective function.
In the next iteration the search for the local minimum is started from the minimum of
this regressor. The new local minimum obtained is added as a new training point and
the Fit function is re-constructed. Over time, this approximation gets better, leading
the search towards regions that yield better local minima and eventually the global
minimum. Surprisingly, for most problems this algorithm tends to direct search to
the region containing the global minimum in just a few iterations and is significantly
faster than other methods. The results reinforce our belief that many problems have
some ‘structure’ to the location of local minima, which can be exploited in direct-
ing further search. It is important to emphasize that GOSAM’s approach is differ-
ent from approximating a fitness landscape; GOSAM attempts to predict how local
minima are distributed, and where the best one might lie. This turns out to be very
efficient in practice.
In this chapter, we wish to demonstrate the same by testing on many benchmark
global optimization problems against established evolutionary methods. The rest of
the chapter is organized as follows. Section 6.2 discusses the proposed algorithm.
Section 6.3 is devoted to GOSAM’s performance on benchmark optimization prob-
lems, as well as a comparison with GA and PSO. Section 6.4 extends the algorithm
for constrained optimization problems. Section 6.5 demonstrates how the algorithm
may be applied to design optimization problems. Section 6.6 is devoted to a general
discussion on the convergence of GOSAM to the global optimum, while Section 6.7
contains concluding remarks.

6.2 Global Optimization with Support Vector Regression Based


Adaptive Multistart (GOSAM)
The motivation of the proposed algorithm is to use the information about the local
minima encountered in earlier steps, to predict the location of other better minima.
We denote the objective function to be minimized by f (x), where x (xi , ∀i = 1, . . . , n)
is a n dimensional vector of variables. We assume that the lower and upper bounds
of each of these variables is known. For an unconstrained optimization problem, the
6 Learning Global Optimization through SVM Based Adaptive Multistart 135

feasible region is the complete search space that lies within the lower and upper
bounds of all variables.
We now summarize the flow of the GOSAM algorithm. At each iteration, the
algorithm performs a local search1 , starting from a location termed as the start-
state. Iteratively the algorithm determines the start-state for the next iteration.

GOSAM Algorithm for Minimization of a Multivariate Function


1. Initialize start-state to an initial guess, possibly chosen randomly, lying in the
search space.
2. Starting from start-state, obtain a locally optimal solution using a local search
procedure. Term this solution as current-local-optimum.
3. Store current-local-optimum (x∗i , ∀i = 1 , . . . , n) and the corresponding function
value ( f (x∗ )) in the training set.
4. Apply Support Vector Regression treating all the local optima collected so far, as
the independent variables and their corresponding function values as the target
values. The regressor obtained will be called the current Fit function.
5. Obtain the minimum of the current Fit function using a local search procedure.
6. Set start-state to the minimum obtained in Step 5. If minimum is out of bounds
or is same as that obtained in the previous r runs, set the start-state to a random
one.
7. If the termination criteria have been met, proceed to Step 8. Otherwise, go to
Step 2.
8. Return the best local minimum obtained.
Initially, we generate a feasible random state and assign it to start-state. In step 2,
we perform local search on the objective function starting from the start-state. The
search terminates at a local minimum. We store the local minimum and the corre-
sponding value of the objective function in an array. This constitutes our training
data. We treat local minima as data points and their corresponding function values
as target values. In Step 4, we use Support Vector Regression (SVR), to perform
regression on the training data comprising of the local minima obtained till the cur-
rent iteration. In general, fitting the local minima with a linear SVR regressor would
incur a large error. Nonlinear regression can be achieved with a wide choice of
kernel functions, and choice of the kernel will also impact the number of function
evaluations required to reach the global optimum. In all our experiments, we chose
the most commonly used 2nd degree polynomial kernel, also termed as a quadratic
kernel, primarily to simplify computation. It also facilitates Step 5 of the GOSAM
algorithm, which requires minimization of the SVR regressor. For a polynomial ker-
nel of degree 2, the problem to be minimized is a quadratic one, which can be solved
efficiently.
We found that GOSAM is not handicapped by the choice of kernel, and that a
quadratic kernel worked well over a wide range of problems. All the local minima
1 To the best of our knowledge, there is no restriction on the choice of local search procedure
used.
136 Jayadeva, S. Shah, and S. Chandra

obtained till the current iteration are treated as training samples, and their corre-
sponding function values as the target values. The regressor obtained in Step 4,
termed as the current Fit function, approximates local minima of the objective func-
tion. In the limit that all local minima are known, SVR will construct a regressor
that passes through all local minima of the objective function. The global minimum
of this function would then correspond to the global minimum of the original objec-
tive function. If we knew all the local minima, then regression is not required and
one can easily determine the best local minimum. We utilize only the information
of the few local minima obtained through local search till the current iteration. We
then rely on the excellent generalization properties of SVRs to predict how the local
minima are distributed. Search is redirected to the region containing the minimum
of the regressor or the Fit function. Because of the limited size of the training set,
this regressor will not be an exact approximation of the local minima of the objective
function. However, over successive iterations, the Fit function tends to better local-
ize the global minimum of the function being optimized. This is demonstrated by
the experiments presented in section 6.3, that show that the ‘predictor’ turns out to
be so good that search terminates in the global minimum within very few iterations.
Apart from the generalization ability of SVRs, which is imperative in predicting
better starting points and finding the global optimum quickly, the choice of using
SVR for function approximation is also motivated by the fact that the regressors ob-
tained using SVR are generally very simple and can be constructed by using only a
few support vectors. Since minimization of the Fit function requires evaluating it at
several points, the use of only the support vectors contributes to computational ef-
ficiency. Regardless of the complexity of the kernel used, the optimization problem
that needs to be solved remains a convex quadratic one, because only a kernel matrix
that contains an inner product between the points is required. The meagre amounts
of data to be fit, i.e. the small number of local minima and their corresponding func-
tion values, also contribute to making the process fast and efficient.
In step 5, we minimize the Fit obtained and reset the start-state to its minimum.
If the local minimum obtained from this start-state is the same as the one obtained
in the previous r iterations, or out of bounds, then we conclude that the search has
become too localized, and needs to explore other regions to discover new minima.
In such a case, we reset the start-state to a random state.

6.3 Experimental Results


GOSAM was encoded in MATLAB and run on an 800 MHz Pentium III PC with
256 MB RAM. For all our test cases, we used the local minimizer of LINDO API
[23]. We tested GOSAM on a number of difficult global optimization benchmark
problems, which are available online at [11, 24]2 . Most of the benchmark problems
available in [11, 24] are parameterized in the number of variables, and can thus
2 The websites also include visualizations of the two dimensional examples, a discussion of
why each of these problems is difficult, and a mention of the estimated number of local
minima of the benchmarks.
6 Learning Global Optimization through SVM Based Adaptive Multistart 137

be extended to any dimension. We first illustrate the working of GOSAM on one


and two variable examples that possess several local minima. These examples will
help visualize GOSAM’s working. We then discuss results for higher dimensional
benchmarks.

6.3.1 One Dimensional Wave Function


Figures 6.1 through 6.4 show the objective function f (x) = (|x|−10)cos(2π x) (taken
from [8]) as a dotted wave. The bounds for this problem were taken to be x = -10.0
to x = 10.0. As seen in Figs. 6.1 - 6.4, the objective function has several local min-
ima, but the global minimum lies at x = 0. We now show how GOSAM finds the
global minimum for this example.

10

−2

−4

−6

−8

−10
−10 −8 −6 −4 −2 0 2 4 6 8 10

Fig. 6.1 Iteration 1 for the global minimization of the objective function f (x) = (|x| −
10)cos(2π x). Local search started from a random start state given by x = -1.3444 (indicated
by the circle) and terminated at the local minimum x = -1.0 where f (x) = -9.0. Using only
this one local minimum in the training set, the regressor obtained till the end of iteration 1 is
shown by the solid line

The initial randomly chosen starting state is x = -1.3444. This is shown as the
circled point in Fig. 6.1. Local search from this point led to the local minimum at
x = -1.0, indicated by a square in Fig. 6.1. At this point, the objective function has a
value of f (−1.0) = -9.0. Using only one local minimum in the training set, the SVR
regressor that was obtained is shown by the solid line parallel to the x-axis. Since
this regressor has a constant function value, its minimum is the same everywhere;
therefore, any random point can be selected as the minimum. In our simulation,
the random point returned was x = -6.3. Local search from this point terminated at
138 Jayadeva, S. Shah, and S. Chandra

the local minimum x = -6.0. The regressor obtained using these two points led to
a minimum at the boundary. In cases when the minimum is at a boundary, we find
that one can start the next local search from either the boundary point, or from
a random new starting point. The search for the global optimum was not ham-
pered by either choice. However, the results reported here are based on a random
restart in such cases. In this simulation, search was restarted from a random point at
x = 4.483.

10

−2

−4

−6

−8

−10
−10 −8 −6 −4 −2 0 2 4 6 8 10

Fig. 6.2 Iteration 3 for the global minimization of the objective function f (x) = (|x| −
10)cos(2π x). Local search started from a random start state given by x = 4.483 (indicated
by the circle) and terminated at the local minimum at x = 4.0. The regressor obtained using
the three points, depicted by squares, is shown as the solid concave curve. The minimum of
this curve lies at x = -0.8422

In the third iteration, shown in Fig. 6.2, local search is started from x = 4.483, de-
picted by a circle. The local minimum was found to be at x = 4.0, and is depicted by
a square in the figure. When the information of these three local minima was used,
the SVR regressor shown as the solid concave curve was obtained. The minimum of
this curve lies at x = -0.8422.
The start state for iteration 4 was given by the minimum of the regressor obtained
in the previous iteration, given by x = -0.8422. This point is depicted as a circle in
Fig. 6.3. The local minimum obtained from this starting state is again depicted as the
square at the end of the slope. The regressor obtained using these four local minima
is shown as a bowl shaped curve, the minimum of which is located at x = −0.1130.
In the next iteration, depicted in Fig 6.4, local search from x = −0.1130, depicted
by a circle, led us to the global minimum at x = 0.0, depicted by a square in the
figure.
6 Learning Global Optimization through SVM Based Adaptive Multistart 139

10

−2

−4

−6

−8

−10
−10 −8 −6 −4 −2 0 2 4 6 8 10

Fig. 6.3 Iteration 4 for the global minimization of the objective function f (x) = (|x| −
10)cos(2π x). Local search started from the minimum of the regressor obtained in the
previous iteration, given by x = -0.8422 (indicated by the circle). It terminated at the local
minimum at x = -1.0. The regressor obtained using the four local minima obtained till the
current iteration, depicted by squares, is shown as the solid convex shaped curve. The mini-
mum of this curve lies at x = −0.1130

10

−2

−4

−6

−8

−10
−10 −8 −6 −4 −2 0 2 4 6 8 10

Fig. 6.4 Iteration 5 for the global minimization of the objective function f (x) = (|x| −
10)cos(2π x). Local search started from the start state given by the minimum of the regressor
obtained at the end of the previous iteration, given by x = -0.1130 (indicated by the circle)
and terminated at the local minimum at x = 0.0 where f (x) = -10.0. The regressor obtained
using all the local minima obtained, depicted by squares, is shown as the solid convex curve
140 Jayadeva, S. Shah, and S. Chandra

6.3.2 Two Dimensional Case: Ackley’s Function


Ackley’s function is a multimodal benchmark optimization problem, that is widely
used for testing global optimization algorithms. The n-dimensional Ackley’s
function is given by
$
−b 1
∑ni=1 x2i 1
− e n ∑i=1 cos(cxi ) + a + e ,
n
f (x) = −ae n

where a = 20, b = 0.2, and c = 2π . Its global minimum is located at xi = 0, ∀i =


1, 2, . . . , n, with the function value f (0) = 0. For the purpose of illustration, we
consider the two dimensional Ackley’s function.
As seen in Fig. 6.5, Ackley’s function has a large number of local minima that
hinder the search for the global minimum.

Fig. 6.5 Ackley’s function. A huge number of local minima are seen that obstruct the search
for the global minimum at (0, 0)

Figures 6.6 through 6.8 show the plots of the regressor function obtained after
iterations 2, 3, and 4 respectively. Note that though both figures 6.7 and 6.8 look
similar, there is a difference in the locations of their minima. The minimum of the
bowl shaped Fit function of Fig. 6.8, when used as the start state for next local
minimization procedure, led to the global minimum of Ackley’s function.
6 Learning Global Optimization through SVM Based Adaptive Multistart 141

Fig. 6.6 Regressor obtained after iteration 2, while optimizing Ackley’s function

Fig. 6.7 Regressor obtained after iteration 3, while optimizing Ackley’s function

6.3.3 Comparison with PSO and GA on Higher Dimensional


Problems
Particle Swarm Optimization (PSO) and Genetic Algorithms (GA) are evolutionary
techniques, that also use information revealed during search to generate new search
points. We compare our algorithm with both these approaches on several global op-
timization benchmark problems, ranging in dimension (number of variables n) from
142 Jayadeva, S. Shah, and S. Chandra

Fig. 6.8 Regressor obtained after iteration 4, while optimizing Ackley’s function. Local
search starting from the minimum of this Fit function led to the global minimum

2 to 100. The Particle Swarm optimization toolbox was obtained from [30], while
the Genetic Algorithm optimization toolbox (GAOT) is the one available at [16].
The next start-state in PSO and GA is obtained by simple mathematical or logical
operations, whereas for GOSAM it is generated after determining the SVR followed
by minimization of a quadratic problem. Therefore, an iteration of GOSAM takes
more time than an iteration of either of these algorithms. Moreover, GA and PSO
run a number of agents in parallel, whereas the current implementation of GOSAM
is a sequential one. However, the difference in the number of function evaluations
required is so dramatic that GOSAM always found the global minimum significantly
faster.
In all our experiments, we evaluated the three algorithms on three different per-
formance criteria. The first criterion is the number of function evaluations required
to reach the global optimum. The second criterion is the number of times the global
optimum is reached in 20 runs, each from a randomly chosen starting point. The
third measure is the average CPU time taken to reach the global optimum.
Table 6.1 presents the results obtained. Each value indicated in the table is the av-
erage over 20 runs of the corresponding algorithm. For each run, the initial start
state of all the algorithms was the same randomly chosen point. The reported re-
sults have been obtained on a PC with a 1.6 GHz processor and 512 MB RAM. The
first row in the evaluation parameter for each benchmark function (Fn. Evals.) gives
the average number of function evaluations required by each algorithm to find the
global optimum. The number of times that the global optimum was obtained out of
the 20 runs is given in the second row (GO. Obtd.). If the global minimum was not
obtained in all runs, then the average value and the standard deviation of the best
6 Learning Global Optimization through SVM Based Adaptive Multistart 143

optima obtained over all the runs has been mentioned within parentheses. The third
row (T (s)) indicates the average time taken in seconds by each algorithm in a run.
Though any number of local minima may be used for building a predictor, we
used a maximum of 100 local minima. The 101st local minima overwrote the 1st
local minimum obtained, and so on. In each case, the search was confined to lie
within a box [−10, 10]n where n is the dimension. In all our experiments, we used
the conventional SVR framework [14]. The use of techniques such as SMO [28],
or online SVMs [7] can be used to speed up the training process further. Our focus
in this work is to show the use of machine learning techniques to help predict the
location of better local minima.
The parameters for GA and PSO (c1 = 2, c2 = 2, c3 = 1, chi = 1, and swarm
size = 20) were kept the same as the default ones. For GOSAM, the SVR parameters
were taken to be ε = 10−3 , C = 1000, and the kernel to be the two degree polynomial
kernel with t = 1.
Table 6.1 shows that consistently GOSAM outperforms both PSO and GA by a
large margin. This difference gets dramatically highlighted in higher dimensions.
Finding the global minimum becomes increasingly difficult as the dimension n in-
creases; PSO and GA fail to find the global optimum in many cases, despite a large
number of function evaluations. However, GOSAM always found the global mini-
mum after a relatively small number of function evaluations (the count for function
evaluation for GOSAM also includes the number of times the objective function
was evaluated during local search). We believe that this result is significant, because
it shows that GOSAM scales very effectively to large dimensional problems. The
experimental results strikingly demonstrate that GOSAM not only finds the global
optimum consistently, but also does so with a significantly fewer number of function
evaluations.

6.4 Extension to Constrained Optimization Problems


Constrained optimization problems are usually solved by solving related uncon-
strained problems, which are obtained through the use of penalty or barrier func-
tions. We take recourse to Sequential Unconstrained Minimization Techniques
(SUMTs), which we briefly review.

6.4.1 Sequential Unconstrained Minimization Techniques


SUMTs comprise of a class of non-linear programming methods that solve a
sequence of unconstrained optimization tasks. Given a problem of the form

Minimize a(x) (6.1)

subject to the constraints

gi (x) <= 0, for i = 1, . . . M, (6.2)


144 Jayadeva, S. Shah, and S. Chandra

Table 6.1 Comparison of GOSAM with PSO and GA on Difficult Benchmark Problems

N Benchmark Evaluation GOSAM PSO GAOT


function parameter
2 Ackley Fn. Evals. 122.75 12580.0 2202.75
GO Obtd. 20 20 20
T(s) 0.02970 0.535524 0.677470

2 Rastrigin Fn. Evals. 129.5 23037.0 2198.15


GO Obtd. 20 20 20
T(s) 0.0328 0.913196 0.662956

2 Griewangk Fn. Evals. 108.95 91824.0 1180.15


GO Obtd. 20 20 19‡
(0.0057 ± 2.5e-
5)
T(s) .02265 3.758419 0.357913

2 Rotated Fn. Evals. 24.35 22542.0 647.75


Hyper
Ellipsoid GO Obtd. 20 20 20
T(s) .0078 0.972250 0.236874

2 Rosenbrock’s Fn. Evals. 105.05 49702.00 10600.75


Valley GO Obtd. 20 20 20
T(s) .0148 2.046909 3.19688

2 Schwefel Fn. Evals. 37.4 24000 866.30


GO Obtd. 20 1† 20
T(s) .01325 5.338205 0.268143

2 Branin’s Rcos Fn. Evals. 81.85 52341.0 649.75


GO Obtd. 20 20 20
T(s) .01795 2.371105 0.206017

2 Six Hump Fn. Evals. 64.5 29892.0 643.95


Camelback GO Obtd. 20 20 20
T(s) .01715 1.213340 0.199104

10 Ackley Fn. Evals. 208.09 145746.0 11226.10


GO Obtd. 20 20 20
T(s) .03516 6.636169 3.973002
6 Learning Global Optimization through SVM Based Adaptive Multistart 145

Table 6.1 (continued)

10 Rastrigin Fn. Evals. 298.65 300040.0 11219.25


GO Obtd. 20 0‡ 12‡
(3.035 ± 2.054) (0.4478 ± 0.4587)
T(s) .0398 12.854138 3.435049

10 Rotated Fn. Evals. 46.9 105417.00 4766.35


Hyper
Ellipsoid GO Obtd. 20 20 20
T(s) .01255 4.995237 1.448820

10 Rosenbrock’s Fn. Evals. 2177.90 300040.0 22917.3


Valley GO Obtd. 20 3‡ 3‡
(1.9104 ± 1.2932) (2.9247 ± 3.0229)
T(s) .04460 12.467250 7.737082

100 Ackley Fn. Evals. 7437.4 300040.0 24708.40


GO Obtd. 20 0‡ 0‡
(8.0094 ± 5.7940) (1.6847 ± 0.2302)
T(s) 8.707 24.979903 14.027134

100 Rastrigin Fn. Evals. 4931.5 2000040.00 36528.90


GO Obtd. 20 0 0
(486.3559 ± (60.2946 ± 8.7735)
737.8523)
T(s) 7.615 128.840828 23.566207

100 Rotated Fn. Evals. 431.5 292666.0 37001.70


Hyper
Ellipsoid GO Obtd. 20 0‡ 0‡
(0.3350 ± 1.0915) (2.2683 ± 0.9803)
T(s) .0823 46.069183 29.993730

100 Rosenbrock’s Fn. Evals. 8676.20 500040.0 61365.65


Valley GO Obtd. 20 0‡ 0‡
(14324550.17± (357.198±
1535813.439) 165.5572)
T(s) .89920 17.913865 36.662314

‡ The global optimum was not obtained in all the 20 runs. The value in the corresponding
parentheses indicates the mean and the standard deviation of the quality of global minima
obtained in the 20 runs.
† The global optimum obtained was not within the specified bounds.
146 Jayadeva, S. Shah, and S. Chandra

where a(x) is the objective function, and gi (x), for i = 1, . . . M are the M constraints.
One kind of SUMT, the quadratic penalty function method, minimizes a sequence
of functions of the form (p = 1, 2, . . . M).
M
Fp (x) = a(x) + ∑ α pi Max(0, gi (x))2 , (6.3)
i=1

where α pi is a scalar weight, and p is the problem number. The minimizer for the pth
problem in the sequence forms the guess or starting point for the (p + 1)th problem.
The scalars change from one problem to the next based on the rule that α pi >=
α(p−1)i ; they are typically increased geometrically, by say 10%. These weights
indicate the relative emphasis of the constraints and the objective function.
In the limit, the constraints become overwhelmingly large, the sequence of min-
ima of the unconstrained problems converges to a solution of the original con-
strained optimization problem. We now illustrate the use of SUMT through the
application of GOSAM to the graph coloring problem.
Given a graph with a set of nodes or vertices, and an adjacency matrix D, the
Graph Coloring Problem (GCP) requires coloring each node or vertex so that no
two adjacent nodes have the same color. The adjacency matrix entry di j is a 1 if
nodes i and j are adjacent, and is 0 otherwise.
A minimal coloring requires finding a valid coloring that uses the least number of
colors. The GCP can be solved through an energy minimization approach. We used
an approach based on the Compact Analogue Neural Network (CANN) formulation
[17]. In this approach, a N-vertex GCP is solved by considering a network of N
neurons, whose outputs denote the node colors. The outputs are represented by a set
of real numbers Xi , i = 1, 2, . . . , N. The color is not assumed to be an integer as is
done conventionally.
The GCP is solved by minimizing a sequence (p = 1, 2, . . .) of functions of the
form

A N N
E= ∑ ∑ (1 − di j )Vm ln coshβ (Xi − X j ) +
2 i=1
(6.4)
j=1
 
Bp N N
coshβ (Xi − X j + δ )coshβ (Xi − X j − δ )
∑∑
2 i=1
d i j Vm ln
coshβ (Xi − X j )2
j=1

In keeping with the earlier literature on neural network approaches to the GCP, we
term E in (6.4) as an energy function.
The first term of equation (6.4) is present only for di j = 0, i.e. for non-adjacent
nodes. The term is minimized when Xi = X j . The term therefore minimizes the
number of distinct colors used. The second term is minimized if the values of Xi and
X j corresponding to adjacent nodes differ by at least δ . This term corresponds to the
adjacency constraint in the GCP, and becomes large as the problem sequence index
p increases. Nodes colored by colors that differ by less than δ correspond to nodes
with identical colors.
6 Learning Global Optimization through SVM Based Adaptive Multistart 147

We used GOSAM to minimize the energy function corresponding to difficult


GCP benchmark problems [1], which have a large number of connections. Of these
the Myciel instances are Graphs based on the Mycielski transformation. These
graphs are difficult to solve because they are triangle free (clique number 2) but
the coloring number increases with the problem size. “Huck” instance is a graph
that is created where each node represents a character. Two nodes are connected by
an edge if the corresponding characters encounter each other in the book “Twain’s
Huckleberry Finn”. In the “Games120” instance, the games played in a college foot-
ball season is represented by a graph where the nodes represent each college team,
and two teams are connected by an edge if they played each other during the season.
The energy functions for these problems are very complex and lead to extremely
hard global optimization problems. However, the constrained optimizer was able
to obtain the optimal coloring for each of these instances. Table 6.2 sums up the
results obtained. Note that the starting value of B, and the amount of increment in
B for successive iterations are both related to the time taken to reach the optimal
solution. One would like to start with a value of B, which quickly takes us into the
feasible region. This leads us to believe that a large value of B would do the trick.
However, if the value of B is taken to be too large then we might not be able to reach
the optimal solution. Thus there is no obvious answer to determine a good starting
value of B, instead it is based on educated guesses. A natural reasoning would be that
for dense adjacency matrices a large value of B should be chosen while a relatively
smaller value would suffice for sparse adjacency matrices. If we reach the feasible
region, then we could slowly and cautiously (making sure that we don’t exit the
feasible region) increase the value of A till we reach the optimal solution. We defer
a more detailed discussion of this aspect as it is beyond the scope of this chapter.

Table 6.2 Constrained optimization on benchmark GCP instances

Instance Nodes Edges Optimal coloring Best Solution Obtained Iterations required
Myciel3 11 20 4 4 3
Myciel4 23 71 5 5 5
Huck 74 301 11 11 8
Games120 120 638 9 9 10

6.5 Design Optimization Problems


Designers are usually confronted with the problem of finding optimal settings for a
large number of design parameters, with respect to several simulated product or pro-
cess characteristics. Problems of design and synthesis in the electronic domain are
generally constrained non-linear optimization problems. The principal characteris-
tics of these problems are very time consuming function evaluations and the absence
of derivative information. In most cases, evaluating the cost or objective function re-
quires a system simulation, and the function is rarely available in an analytical form.
In fact, the use of classical optimization techniques to give an optimal solution is
148 Jayadeva, S. Shah, and S. Chandra

nearly impossible. For instance, VLSI design engineers carry out time-consuming
function evaluations by using circuit or other simulation tools , e.g. Spectre [2], and
choose a circuit with optimal component values. Since there are still many possi-
ble design parameter settings and computer simulations are time consuming, it is
crucial to find the best possible design with a minimum number of simulations. We
used GOSAM to solve several circuit optimization problems. The interface between
the optimizer and the circuit simulators is shown in Fig. 6.9. Preliminary details of
this work were reported in [18].

ble n
Netlist

ria ig
va des
s
te
da
Up

Updated design Invoke Spectre Read


variables
Optimizer Interface Spectre
(Run a
simulation)
Function value
Write function
Re
ad

value

Output File

Fig. 6.9 GOSAM’s interface with the circuit simulator Spectre

We initially start with values for the design variables that are provided by a de-
signer, or choose them randomly. Since there are no analytical formulae to compute
the output for the input design parameters, the function values are calculated by using
a circuit simulator such as Spectre. The simulator writes the output value to a file,
which is read by the interface and returned to GOSAM. GOSAM then uses SVR
on the set of values obtained so far, to determine the Fit function. The SVR yields a
smooth and differentiable regressor. GOSAM then computes the minimum of the Fit
function, and sends it as the vector of new design parameters, to the interface. A key
feature of this approach is that we can apply it even to scenarios where the objective
function is not available in analytical form or is not differentiable. A major bonus
is that examination of the Fit function yields information about what constitutes a
good design. We now briefly discuss a few interesting circuit optimization examples.

6.5.1 Sample and Hold Circuit


For a sample and hold circuit, the objective function was to hold the sampled value
as constant as possible during the hold period. The design variables are the widths
of 22 MOSFETs, along with values of four capacitors named as C1,C2,C3, and
C4. The transistor widths were constrained to lie between 250nm and 1200nm. Ca-
pacitor C3 was required to be between 1fF and 5000f, while all other capacitors
were constrained to lie between 1fF and 500fF. Simulations show that the sampled
6 Learning Global Optimization through SVM Based Adaptive Multistart 149

value is maintained well during the hold period. As of date, numerous complex
VLSI circuits have been designed using GOSAM interfaced with the circuit simu-
lator Spectre. The chosen circuits include Phase Locked Loops (PLLs), a variety of
operational amplifiers, and filters. In these examples, transistor sizes and other com-
ponent values have been selected to optimize specified objectives such as jitter, gain,
phase margin, and power, while meeting specified constraints on other performance
measures as well as on transistor sizes.

Fig. 6.10 Response of the optimized Sample-and-Hold Circuit, showing output voltage
versus time. The goal was to keep the output constant during the hold period

6.5.2 Folded Cascode Amplifier


For a folded cascode amplifier, the design objective was to maximize the phase mar-
gin. The variables for the optimization task were taken to be the widths of 16 MOS-
FET transistors. The result obtained, depicted in Figure 6.11 shows that GOSAM
obtained the maximum phase margin as 169.73◦, as well as an excellent solution
with a phase margin of around 120◦ . An industry level commercial tool found a
solution with a phase margin of around 59◦ .

6.6 Discussion
An important question relates to assumptions that may be implicitly or explicitly
made regarding the function to be optimized. We mentioned previously that any
local search mechanism could be used in conjunction with GOSAM. Figure 6.12
illustrates this with the help of a toy example. For the objective function shown
by the dashed curve in Fig. 6.12, the gradient cannot be computed to reach two of
the minima. A line search method is used in the outer triangular regions, while for
the parabolic region in the middle the gradient is available and a simple gradient
descent leads us to the local minimum. These three local minima, when used by
SVR to construct the regressor, yield the parabolic shaped solid curve of Fig. 6.12.
Local search starting from the minimum of this curve led to the global minimum.
150 Jayadeva, S. Shah, and S. Chandra

Fig. 6.11 Phase margin versus iteration count for a folded cascode amplifier

Fig. 6.12 A toy example illustrating that any local minimizing procedure can be used with
GOSAM. The function is depicted as the dotted curve. For the outer triangular regions, the
gradient information cannot be used, so the local minima are found by a line search method.
However for the inner parabolic region, the local minimum can be found using gradient de-
scent. The regressor obtained is shown by the solid curve that passes through the local minima
obtained

In the worst case, GOSAM performs similar to a random multistart. This is be-
cause whenever it is not possible to use the minimum of the Fit function (for exam-
ple when it is out of bounds or almost the same minimum is given by the previous
two iterations), GOSAM restarts the search from a random state. Therefore in the
worst case it will randomly explore the search space for new starting points. How-
ever, real applications never involve functions that are discontinuous everywhere,
and we have not encountered this worst case.
6 Learning Global Optimization through SVM Based Adaptive Multistart 151

Fig. 6.13 A toy example to illustrate that the regressor for the objective function f (x), de-
noted as Fit of f (x) is smoother than f (x). Recursively, the Fit for the Fit of f (x) is smoother
than the Fit of f (x), and in the limit leads to a convex underestimate of f (x)

Server invokes
Instance of GOSAM

Web Server
Server requests Optimizer
function/ Sends
optimized points
Client sends
a request
Client sends
function value
at requested
points
Client

Fig. 6.14 Testing: A web based service

Minimization of the regressor function is an essential step in GOSAM. In all


our experiments we used local search to accomplish this step. A doubt that comes
to mind is what might happen if the Fit function itself turns out to have multiple
local minima. Such a situation is certainly possible, and is theoretically interesting.
An alternative approach that we suggest is to use GOSAM recursively. This idea is
intuitive because the regressor function, called Fit function in Fig 6.13, is smoother
than the objective function, as it is a smooth interpolation of only the local minima
of the objective function encountered earlier. Therefore, a Fit of the Fit function’s
152 Jayadeva, S. Shah, and S. Chandra

local minima would be even smoother. This is depicted pictorially in Fig. 6.13,
which uses a hypothetical example to illustrate what the application of GOSAM
to f (x) and recursively to Fit functions, might achieve. The original function f (x)
has a number of minima. As can be seen, the number of minima reduces at each
step and the sequence of recursively computed Fit functions become increasingly
smoother, and the sequence terminates at a convex function that is related to the
double conjugate of the original function. However, local minimization of the Fit
function seems to be more than adequate, as is done in the present implementation.
It is possible to construct functions where GOSAM’s strategy will fail. For ex-
ample, it would be impossible to learn any structure from a function with a uniform
distribution of randomly located minima, or a function that is discontinuous almost
everywhere. However, on most problems of any practical interest, small perturbations
from a local minimum will lead us to another locally minimal configuration. This im-
plies that a learning tool can be used to predict locations of other minima from the
knowledge of only a few.

6.7 Conclusion and Future Work


In this paper, we presented GOSAM, a fast and effective multistart global minimiza-
tion algorithm for solving optimization problems. GOSAM applies support vector
regression on the training set formed by previously discovered local minima, to
guide search towards better local minima. This is different from approximating a fit-
ness landscape; GOSAM attempts to predict how local minima are distributed, and
where the best one might lie. A regressor that fits local minima is smoother than one
that tries to fit the original function. Approximating the fitness landscape requires
fitting all points and not just a few minima. The use of Support Vector Regression
allows only support vectors to be retained, and redundant information can be dis-
carded. Experimental results on large benchmarks show that GOSAM searches far
more efficiently, uses significantly fewer function evaluations, and finds the global
optimum more consistently than other state-of-the-art methods. The effectiveness of
GOSAM confirms that the generalizing ability of SVRs is very useful in predicting
where good local minima lie. We have also shown how GOSAM can be applied
to unconstrained tasks, as well as combinatorial optimization tasks such as graph
coloring, that are traditionally solved as integer programming problems. GOSAM
does not require the function to be known in terms of an analytical expression. It is
enough to have a black box that can evaluate the function at a chosen point. This
allows GOSAM to be interfaced to any such black box. We have presented results
in the VLSI domain, where GOSAM has been interfaced to a commercial circuit
simulator and used to optimize MOSFET sizes and component values to meet de-
sired objectives subject to specified constraints. The objectives are typically com-
plex, such as phase margin of a folded cascode amplifier, or jitter in a Phase Locked
Loop. A current version of GOSAM is equipped with a web interface that allows a
user to access it without revealing information about the function being optimized.
The set up of the web based service is shown in Fig. 6.14. As the figure illustrates,
6 Learning Global Optimization through SVM Based Adaptive Multistart 153

only vectors and corresponding cost values are exchanged between the GOSAM
server and a client running a simulator or emulator. This allows GOSAM to be pro-
vided as a service across the web while protecting proprietary information about the
optimizer and the objective function.
Other aspects worthy of investigation include the use of different approaches to
SVR, such as online learning techniques, and parallellizing operations in GOSAM
to speed up search. Ongoing efforts include extending GOSAM to multi-objective
optimization tasks. GOSAM may be obtained from the authors for non-commercial
academic use on a trial basis.

Acknowledgements. The authors would like to thank Dr. R. Kothari of IBM India Re-
search Laboratory, Prof. R. Newcomb, University of Maryland, College Park, USA, and Prof.
S.C.Dutta Roy of the Department of Electrical Engineering, IIT Delhi, for their valuable
comments and a critical appraisal of the manuscript.

References
1. http://mat.gsia.cmu.edu/COLOR02/
2. http://www.cadence.com/products/custom ic/spectre/
index.aspx
3. Agakov, F., Bonilla, E., Cavazos, J., Franke, B., Fursin, G., O’Boyle, M., Thomson,
J., Toussaint, M., Williams, C.: Using machine learning to focus iterative optimisation.
In: Proceedings of the 4th Annual International Symposium on Code Generation and
Optimization (CGO), New York, NY, USA, pp. 295–305 (2006)
4. Baluja, S., Barto, A., Boese, K., Boyan, J., Buntine, W., Carson, T., Caruana, R., Davies,
S., Dean, T., Dietterich, T., Hazlehurst, S., Impagliazzo, R., Jagota, A., Kim, K., Mc-
Govern, A., Moll, R., Moss, E., Perkins, T., Sanchis, L., Su, L., Wang, X., Wolpert, D.:
Statistical machine learning for large-scale optimization. Neural Computing Surveys 3,
1–58 (2000)
5. Black, F., Litterman, R.: Global portfolio optimization. Financial Analysts Journal 48(5),
28–43 (1992)
6. Boese, K., Kahng, A.B., Muddu, S.: A new adaptive multi-start technique for combina-
torial global optimizations. Operations Research Letters 16(2), 101–113 (1994)
7. Bordes, A., Bottou, L.: The huller: A simple and efficient online SVM. In: Gama, J.,
Camacho, R., Brazdil, P.B., Jorge, A.M., Torgo, L. (eds.) ECML 2005. LNCS (LNAI),
vol. 3720, pp. 505–512. Springer, Heidelberg (2005),
http://leon.bottou.org/papers/bordes-bottou-2005
8. Boyan, J.: Learning evaluation functions for global optimization. Phd dissertation, CMU
(1998)
9. Boyan, J., Moore, A.: Learning evaluation functions for global optimization and
boolean satisfiability. In: Proceedings of the Fifteenth National Conference on Arti-
ficial Intelligence, vol. 15, pp. 3–10. John Wiley and Sons Ltd., Chichester (1998),
http://www.cs.cmu.edu/˜jab/cv/pubs/boyan.stage2.ps.gz
10. Dorigo, M., Maniezzo, V., Colorni, A.: Ant system: Optimization by a colony of cooper-
ating agents. IEEE Transactions on Systems, Man, and Cybernetics-Part B 26(1), 29–41
(1996),
http://iridia.ulb.ac.be/˜mdorigo/ACO/publications.html
154 Jayadeva, S. Shah, and S. Chandra

11. GEATbx: Genetic and evolutionary algorithm toolbox (1994),


http://www.geatbx.com/docu/fcnindex.html
12. Goldberg, D.E.: Genetic Algorithms in Search, Optimization and Machine Learning.
Addison-Wesley Longman Publishing Co., Inc., Boston (1989)
13. Grossmann, I.E. (ed.): Global optimization in engineering design. Kluwer Academic
Publishers, Dordrecht (1996)
14. Gunn, S.: Support vector machines for classification and regression. Technical report,
Image Speech and Intelligent Systems Research Group, University of Southampton, UK
(1998), http://www.isis.ecs.soton.ac.uk/resources/svminfo/
15. Horst, R., Tuy, H.: Global optimization:deterministic approaches. Springer, Berlin
(1993)
16. Houck, C., Joines, J., Kay, M.: A genetic algorithm for function optimization: A matlab
implementation. NCSU-IE TR 95-09 (1995),
http://www.ie.ncsu.edu/mirage/GAToolBox/gaot/
17. Jayadeva, Dutta Roy, S.C., Chaudhary, A.: Compact analogue neural network: A
new paradigm for neural based combinatorial optimisation. IEE Proc-Circuits Devices
Syst. 146(3) (1999)
18. Jaydeva, Shah, S., Chandra, S.: Learning to optimize vlsi design problems. In: INDI-
CON, pp. 1–5. IEEE, New Delhi (2006)
19. Johnson, D., McGeoch, L.: The travelling salesman problem: A case study in local opti-
misation. In: Local Search in Combinatorial Optimisation, pp. 215–310. John Wiley and
Sons, London (1997)
20. Kazerounian, K., Wang, Z.: Global versus local optimization in redundancy resolution of
robotic manipulators. The International Journal of Robotics Research 7(5), 2–12 (1988)
21. Kennedy, J., Eberhart, R.: Particle swarm optimization. In: Proceedings of the IEEE
International Conference on Neural Networks, Perth, Australia, vol. 4, pp. 1942–1948
(1995)
22. Kirkpatrick, S., Gelatt, C., Vecchi, M.: Optimization by simulated annealing. Sci-
ence 220(4598), 671–680 (1983)
23. LINDO SYSTEMS Inc.: LINDO API User’s Manual (2002)
24. Madsen, K.: Test problems for global optimization,
http://www2.imm.dtu.dk/˜km/Test_ex_forms/test_ex.html
25. Mangasarian, O.: Nonlinear Programming. SIAM, Philadelphia (1994)
26. Moles, C., Mendes, P., Banga, J.: Parameter estimation in biochemical pathways: a com-
parison of global optimization methods. Genome Research 13, 2467–2474 (2003)
27. Neumaier, A.: Global optimization,
http://www.mat.univie.ac.at/˜neum/glopt/applications.html
28. Platt, J.: Fast Training of Support Vector Machines using Sequential Minimal Optimiza-
tion. In: Advances in Kernel Methods - Support Vector Learning, pp. 185–208. MIT
Press, Cambridge (1999)
29. Russel, S., Norvig, P.: Artificial intelligence: a modern approach. Prentice Hall, Engle-
wood Cliffs (1995)
30. Singh, J.: PSO algorithm toolbox (2003),
http://psotoolbox.sourceforge.net/
31. Wang, M., Yang, X., Sarrafzadeh, M.: Congestion minimization during placement. IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems 19(10),
1140–1148 (2000)
Chapter 7
Multi-objective Optimization Using Surrogates

Ivan Voutchkov and Andy Keane

Abstract. Until recently, optimization was regarded as a discipline of rather theo-


retical interest, with limited real-life applicability due to the computational or ex-
perimental expense involved. Practical multiobjective optimization was considered
almost as an utopia even in academic studies due to the multiplication of this ex-
pense. This paper discusses the idea of using surrogate models for multiobjective
optimization. With recent advances in grid and parallel computing more companies
are buying inexpensive computing clusters that can work in parallel. This allows,
for example, efficient fusion of surrogates and finite element models into a multiob-
jective optimization cycle. The research presented here demonstrates this idea using
several response surface methods on a pre-selected set of test functions. We aim to
show that there are number of techniques which can be used to tackle difficult prob-
lems and we also demonstrate that a careful choice of response surface methods is
important when carrying out surrogate assisted multiobjective search.

7.1 Introduction
In the world of real engineering design, there are often multiple targets which man-
ufacturers are trying to achieve. For instance in the aerospace industry, a general
problem is to minimize weight, cost and fuel consumption while keeping perfor-
mance and safety at a maximum. Each of these targets might be easy to achieve
individually. An airplane made of balsa wood would be very light and will have low
fuel consumption, however it will not be structurally strong enough to perform at
high speeds or carry useful payload. Also such an airplane might not be very safe,
Ivan Voutchkov
University of Southampton, Southampton SO17 1BJ, United Kingdom
e-mail: iiv@soton.ac.uk
Andy Keane
University of Southampton, Southampton SO17 1BJ, United Kingdom
e-mail: ajk@soton.ac.uk

Y. Tenne and C.-K. Goh (Eds.): Computational Intelligence in Optimization, ALO 7, pp. 155–175.
springerlink.com c Springer-Verlag Berlin Heidelberg 2010
156 I. Voutchkov and A. Keane

i.e., robust to various weather and operational conditions. On the other hand, a solid
body and a very powerful engine will make the aircraft structurally sound and able
to fly at high speeds, but its cost and fuel consumption will increase enormously. So
engineers are continuously making trade-offs and producing designs that will sat-
isfy as many requirements as possible, while industrial, commercial and ecological
standards are at the same time getting ever tighter.
Multiobjective optimization (MO) is a tool that aids engineers in choosing the
best design in a world where many targets need to be satisfied. Unlike conventional
optimization, MO will not produce a single solution, but rather a set of solutions,
most commonly referred to as Pareto front (PF) [12]. By definition it will contain
only non-dominated solutions1 . It is up to the engineer to select a final design by
examining this front.
Over the past few decades with the rapid growth of computational power, the fo-
cus in optimization algorithms in general has shifted from local approaches that find
the optimal value with the minimal number of function evaluations to more global
strategies which are not necessarily as efficient as local searches but (some more
than the others) promise to converge to global solutions, the main players being
various strands of genetic and evolutionary algorithms. At the same time, comput-
ing power has essentially stopped growing in terms of flops per CPU core. Instead
parallel processing is an integral part of any modern computer system. Computing
clusters are ever more accessible through various techniques and interfaces such as
multi-threading, multi-core, Windows HPC, Condor, Globus, etc.
Parallel processing means that several function evaluations can be obtained at
the same time, which perfectly suits the ideology behind genetic and evolutionary
algorithms. For example Genetic algorithms are based on the idea borrowed from
biological reproduction, where the offspring of two parents copy the best genes of
their parents but also introduce some mutation to allow diversity. The entire gener-
ation of offspring produced by parents in a generation represent designs that can be
evaluated in parallel. The fittest individuals survive and are copied into the next gen-
eration, whilst weak designs are given some random chance with low probability to
survive. Such parallel search methods are conveniently applicable to multiobjective
optimization problems, where the fitness of an individual is measured by how close
to the Pareto front this designs is. All individuals are ranked, those that are part of
the Pareto front get the lowest (best) rank, the next best have higher rank and so on.
Thus the multiobjective optimization is reduced to single objective minimization of
the rank of the individuals. This is idea has been developed by Deb and implemented
in NSGA2 [5].
In the context of this paper, the aim of MO is to produce a well spread out set
of optimal designs, with as few function evaluations as possible. There are number
of methods published and widely used to do this – MOGA, SPEA, PAES, VEGA,
NSGA2, etc. Some are better than others - generally the most popular in the litera-
ture are NSGA2 (Deb) and SPEA2 (Zitzler), because they are found to achieve good
results for most problems [2, 3, 4, 5, 6]. The first is based on genetic algorithms and
1 Non-dominated designs are those where to improve performance in any particular goal
performance in at least one other goal must be made worse.
7 Multi-objective Optimization Using Surrogates 157

the second on an evolutionary algorithm, both of which are known to need many
function evaluations. In real engineering problems the cost of evaluating a design
is probably the biggest obstacle that prevents extensive use of optimization proce-
dures. In the multiobjective world, this cost is multiplied, because there are multi-
ple expensive results to obtain. Evaluating directly a finite element model can take
several days, which makes it very expensive to try hundreds or thousands of designs.

7.2 Surrogate Models for Optimization


It seems that increased computing power leads to increased hunger for even more
computing power, as engineers realise that they can run more detailed and realis-
tic models. In essence, from an engineering point of view, the available computing
power is never enough and this tendency does not seem to be changing at least in
the foreseeable future. To put these words into prospective, to be useful to an engi-
neering company, a modern optimization approach should be able to tackle a global
multiobjective optimization problem in about a week. The problem would typically
have 20-30 variables, 2-5 objectives, 2-5 constraints with evaluation times of about
12-48h per design and often per objective. Unless you have access to 5000-7000
parallel CPUs, the only way to currently tackle such problems is to use surrogate
models.
In the single objective world, approaches using surrogate models are fairly well
established and have proven to successfully deal with the problem of computational

Fig. 7.1 Direct search versus surrogate models for optimization


158 I. Voutchkov and A. Keane

expense (see Fig. 7.1) [22]. Since their introduction, more and more companies
have adopted surrogate assisted optimization techniques and some are making steps
to incorporate this approach in their design cycle as standard. The reason for this is
that instead of using the expensive computational models during the optimization
step, they are substituted with a much cheaper but still accurate replica. This makes
optimization not only useful, but usable and affordable. The key idea that makes
surrogate models efficient is that they should become more accurate in the region of
interest as the search progresses, rather than being equally accurate over the entire
design space, as an FE representation will tend to be. This is achieved by adding to
the surrogate knowledge base only at points of interest. The procedure is referred to
as surrogate update.
Various publications address the idea of surrogates models and multiobjective op-
timisation [10, 11, 12, 13, 14, 15, 16, 17, 18, 19]. As one would expect, no approxi-
mation method is universal. Factors such as function modality, number of variables,
number of objectives, constraints, computation time, etc., all have to be taken into
account when choosing an approximation method. The work presented here aims to
demonstrate this diversity and hints at some possible strategies to make best use of
surrogates for multi-objective problems.

7.3 Multi-objective Optimization Using Surrogates


To illustrate the basic idea, the zdt1 – zdt6 test function suite [3] will be used to be-
gin with. It is a good suite to demonstrate the effectiveness of surrogate models, as it
is fairly simple for response surface (surrogate) modelling. Fig. 13.3 represents the
zdt2 function and the optimisation procedure. It is a striking comparison, demon-
strating the surrogate approach. The problem has two objective functions and two
design variables. The Pareto front obtained using surrogates with 40 function evalu-
ations is far superior to the one without surrogates and the same number of function
evaluations.

Table 7.1 Full function evaluations for ZDT2 - Fig. 13.3

Number of variables 2 5 10
Number of function evaluations without surrogates 2500 5000 10000
Number of function evaluations with surrogates 40 40 60

On the other hand 2500 evaluations without surrogates were required to obtain a
similar quality of Pareto front to the case with surrogates and 40 evaluations. The
difference is even more significant if more variables are added – see Table 14.1.
Here we have chosen objective functions with simple shapes to demonstrate the
effectiveness of using surrogates. Both functions would be readily approximated
using most available methods. It is not uncommon to have relationships of simi-
lar simplicity in real problems, although external noise factors could make them
7 Multi-objective Optimization Using Surrogates 159

Fig. 7.2 A (left) – Function ZDT2; B (right) – ZDT2 – Pareto front achieved in 40 evalua-
tions: Diamonds – Pareto front with surrogates; Circles – solution without surrogates

look rougher. Relationships of higher order of multimodality would be more of a


challenge for most methods, as will be demonstrated later.

7.4 Pareto Fronts - Challenges


Depending on the search algorithm, the quality of the Pareto front could vary greatly.
There are various characteristics that describe a good quality Pareto front:
1. Spacing – better search techniques will space the points on the Pareto front
uniformly rather than producing clusters. See Fig. 7.3a
2. Richness – better search techniques will put more points on the Pareto front than
others. See Fig. 7.3b
3. Diversity – better search techniques will produce fronts that are spread out better
with respect to all objectives. See Fig. 7.3c
4. Optimality – better search techniques will produce fronts that dominate the fronts
produced by less good techniques. In test problems this is usually measured
as ‘generational distance’ to an ideal Pareto front. We discuss this later. See
Fig. 7.3d
5. Globality – the obtained Pareto front is a global as opposed to local. Similar to
single objective optimization, in the multiobjective world, it is also possible to
have local and global optimal solutions. This concept is demonstrated using the
F5 test function (a full description is given in sections 15.4 and 15.5). Fig. 7.4
illustrates the function and the optimization procedure. Due to the sharp nature
of the global solution it cannot be guaranteed that with a small number of GA
evaluations, the correct solution will be found. Furthermore, since the surrogate is
based only on sampled data, if this data does not contain any points in the global
optimum area, then the surrogate will never know about its existence. Therefore
160 I. Voutchkov and A. Keane

any optimization based only on such surrogates will lead us to the local solution.
Therefore conventional optimization approaches based on surrogate models rely
on constant updating of the surrogate. A widely accepted technique in single
objective optimization is to update the surrogate with its current optimal solution.
In multiobjective terms this will translate to updating the surrogate with one or
more points belonging to its Pareto front. If the surrogate Pareto front is local
and not global, then the next update will also be around the local Pareto front.
Continuing with this procedure the surrogate model will become more and more
accurate in the area of the local optimal solution, but will never know about the
existence of the global solution.
6. Robust convergence from any start design with any random number sequence. It
turns out that the success of a conventional multiobjective optimization based on
surrogates, using updates at previously found optimal locations strongly depends
on the initial data used to train the first surrogate before any updates are added.
If this data happens to contain points around the global Pareto front, then the
algorithm will be able to quickly converge and find a nice global Pareto front.
However the odds are that the local Pareto fronts are smoother and easier to find
shapes and in most cases this is where the procedure will converge unless suitable
global exploration steps are taken.
7. Efficiency and convergence – better search techniques will converge using less
function evaluations.

Fig. 7.3 Pareto front potential problems - (a) clustering; (b) too few points; (c) lack of diver-
sity; (d) non-optimality
7 Multi-objective Optimization Using Surrogates 161

Fig. 7.4 F5: Local and global solutions

7.5 Response Surface Methods, Optimization Procedure and


Test Functions
In a previous publication [20] we have shown that for complex and high-dimensional
functions Kriging is the response surface method of choice [22]. We have also
stressed the importance of applying a high level of understanding when using Krig-
ing. There have been various publications that critique kriging, due to lack of under-
standing. Our opinion is that if the user understands the strengths and weaknesses
of this approach it can become an invaluable tool, often the only one capable of
producing meaningful results in reasonable times.
Kriging is a Response Surface (RSM) method, designed in the 60’s for geological
surveys [7]. It can be a very efficient RSM model for cases where it is expensive to
obtain large amounts of data. A significant number of publications discuss the krig-
ing procedure in detail. An important role for the success of the method is the tuning
of its hyper parameters. It should be mentioned that researchers who have chosen
rigorous training procedures, report positive results when using kriging, while pub-
lications that use basic training procedures often reject this method. Nevertheless,
the method is becoming increasingly popular in the world of optimization as it often
provides a surrogate with usable accuracy.
This method was used to build surrogates for the above test cases, therefore it is
useful to briefly outline its major pros and cons:

Pros:
• can always predict with no error at sample points,
• the error in close proximity to sample points is minimal,
• requires small number of sample points in comparison to other response surface
methods,
• reasonably good behaviour with high dimensional problems.
162 I. Voutchkov and A. Keane

Cons:
• for large number of data points and variables, training of the hyper-parameters
and prediction may become computationally expensive.
Researchers should make a conscious decision when choosing Kriging for their
RSMs. Such a decision should take into account the cost of a direct function eval-
uation including constraints (if any), available computational power, and dimen-
sionality of the problem. Sometimes it might be possible to use kriging for one
of the objectives while another is evaluated directly, or a different RSM is used to
minimize the cost.
As this paper aims to demonstrate various approaches in making a better use of
surrogate models, we will use kriging throughout, but most conclusions could be
generalised for other RS methods as well. The chosen multiobjective algorithm is
NSGA2. Other multiobjective optimizers might show slightly different behaviour.
The basic procedure is as follows:

1. Carry out 20 LPtau [8] spaced initial direct function evalua-


tions.
2. Train hyper-parameters, using combination of GA and DHC
(dynamic hill climbing) [23]
3. Choose a selection of update strategies with specified num-
ber of updates.
4. Search the RSMs using each of the selected methods
5. Select designs that are best in terms of ranking and space
filling properties.
6. Evaluate selected designs and add to data set.
7. Produce Pareto front and compare with previous. Stop if 2-3
consecutive Pareto fronts are identical. Otherwise continue.
8. If Pareto front contains too many points, choose specified
number of points that are furthest away from each other
9. Repeat from step 2.

There are several possible stopping criteria:


• fixed number of update iterations,
• stop when all update points are dominated,
• stop if the percentage of new update points that belong to the Pareto front falls
below a pre-defined value,
• stop if the percentage of old points on the current Pareto front rises above a
pre-defined value,
• stop when there is no further improvement of the Pareto front quality. The quality
of the Pareto front is a complex multiobjective problem on its own. The best
Pareto front could be defined as the one being as close as possible to the origin
of the objective function space, while having the best diversity, i.e., spread on all
7 Multi-objective Optimization Using Surrogates 163

objectives and the points are evenly distributed. Metrics for assessing the quality
of the Pareto front are discussed by Deb [3].
We have used the last of these criteria for our studies.

7.6 Update Strategies and Related Parameters


One of the main aims of this publication is to show the effect of different update
strategies and number of updates. Here we consider the following six approaches in
various combinations:
• UPDMOD = 1; (Nr) - Random updates. These can help escape from local Pareto
fronts and enrich the genetic material,
• UPDMOD = 2; (Nrsm) - RSM Pareto front. A specified number of points are
extracted from the Pareto front obtained after the search of the response surface
models of the objectives and constraints (if any). When the RSM Pareto front is
rich it is possible to extract data that is uniformly distributed.
• UPDMOD = 3; (Nsl) - Secondary NSGA2 layer. A completely independent
NSGA2 algorithm is applied directly to the non-RSM objective functions and
constraints. This exploits the well known property of the NSGA2 which makes
it (slowly) converge to global solutions. During each update iteration, the direct
NSGA2 is run for one generation with population size of Nsl. There are two
strands to this approach. The first one is referred to as ‘decoupled’. The genetic
material is completely independent from the other update strategies. No entries
other than those from the direct NSGA2 are used. The second strand is referred
to as ‘coupled’, where the genetic information is composed of suitable designs
obtained by other participating update strategies. Suitable designs are selected in
terms of Pareto optimality, or rank in terms of NSGA2. Please note that although
it might sound similar, this is a completely different approach from the Mμ GA
algorithm, proposed by Coello and Toscano (2000)
• UPDMOD = 4; (Nrmse) – Root of the Mean Squared Error (RMSE). When us-
ing kriging as a response surface model, it is possible to compute an estimate of
the RMSE, at no significant computational cost. The value of this metric is large
where there are large gaps between data points. RMSE is minimal close to or
at existing data points. Therefore adding updates at the location of the maximum
RMSE should significantly improve the quality and coverage of the response sur-
face model. When dealing with multiple objectives/constraints it is appropriate
to construct a Pareto front of maximum RMSEs for all objectives and extract
Nrmse points from it.
• UPDMOD = 5; (Nie) – Expected improvement (EI). This is another kriging spe-
cific function which represents the probability of finding the optimal point in a
new location. The update points are extracted from the Pareto front of the max-
imum values of the EI for all objectives. For constrained problems, the values
of EI for all objectives are multiplied by the value of the feasibility of the con-
straints, which is 1 for satisfied constraints 0 for unfeasible and rather smooth
ridge around the constraint boundary, see Forrester et al [22].
164 I. Voutchkov and A. Keane

• UPDMOD = 6; (Nkmean) – The RSMs are searched using GA or DHC and points
are extracted using a k-mean cluster detection algorithm.
All these update strategies have their own strengths and weaknesses, and therefore
a suitable combination should be carefully considered. The results section of this
chapter provides some insights on the effects of each of these strategies when used
in various combinations.

Additional Parameters that Can Affect the Search


The following parameters can also affect the performance of a multi-objective
RSM search:
• RSMSIZE – number of points used for RSM construction. It is expected that the
more points that are used, the more accurate the RSM predictions, however this
comes at increasing training cost. Therefore the number of training points should
be limited.
• EVALSIZE – number of points used during RSM evaluation. This stage is con-
siderably less expensive than training and therefore more points can be used dur-
ing the evaluation stage. Ultimately this should increase the density of quality
material and therefore fewer gaps for the RSM to approximate.
• EPREWARD – endpoint reward factor. Higher value rewards are given at the
end points of the Pareto front, and this improves its spread. Lower value would
increase the pressure of the GA to explore the centre of the Pareto front.
• GA NPOP and GA NGEN – the population size and number of generation used
to search the RSM, RMSE and EI Pareto fronts.

7.7 Test Functions


Several test functions with various degrees of complexity have been chosen to
demonstrate the overview of the RS methods for the purpose of multiobjective opti-
mization. These functions are well known from the literature:
F5: (Fig. 7.4). High complexity shape – has a smooth and a sharp feature. The
combination of both makes it easier for the optimization procedure to converge to
the smooth feature, which represents a local Pareto front. The global Pareto front
lies around the sharp feature which is harder to reach. Two objectives, x (i) = 0 .. 1,
i = 1, 2; no constraints [3], page 350.
ZDT1 - ZDT6: Clustered and discontinuous Pareto fronts. Shape complexity
is moderate. Two objectives, n variables (in present study n = 2), no constraints.
x (i) = 0 .. 1, i = 1, 2 [3], page 357.
ZDT1cons: Same formulation as for ZDT1 but with 25 variables and 2
constraints. Constraints are described in [3], page 368.
Bump: The bump function, 25 variables, 2 objectives, 1 constraint. We have used
the function as provided in [21] which is a single objective with two constraints.
We have made one of the constraints into second objective, so that the optimiza-
tion problem is defined as : Maximise the original objective, minimize the sum of
7 Multi-objective Optimization Using Surrogates 165

variables whilst keeping the product of the variables greater than 0.75. There are 25
variables, each varying between 0 and 3.

7.8 Pareto Front Metrics


To measure the performance of the various strategies discussed in this paper, we
have adopted several metrics. Some of them use comparison to an ‘ideal’ solution
which is denoted by Q and represents the Pareto front obtained using direct search
with a large number of iterations (20,000). All metrics are designed so that smaller
is better.

7.8.1 Generational Distance ([3], pp.326)


The average of the minimum Euclidian distance between each point of the two
Pareto fronts, $
|Q|
∑i=1 di
gd = ,
|Q|
% & '
(i) (k) 2
di = mink=1,|p| ∑Mj=1 f j − p j , and is the Euclidian distance between the so-
lution (i) and the nearest member of Q.

7.8.2 Spacing
Standard deviation of the absolute differences between the solution (i) and the near-
est member of Q, 

 1 |Q|  2
sp =  ∑ di − d¯ ,
|Q| i=1
& '
di = min ∑ j=1 f j − p j .
M (i) (k)
k=1,|p|

7.8.3 Spread

|Q|  
¯
m=1 dm − ∑i=1 di − d
∑M e
Δ = 1− ,
m=1 dm + |Q| d
∑M e ¯
where di is the absolute difference between neighbouring solutions. For compatibil-
ity with the above metrics, the values of the spread is subtracted from 1, so that a
wider spread will produce a smaller value.
166 I. Voutchkov and A. Keane

7.8.4 Maximum Spread


Normalized distance between the most distant points on the pareto front. The dis-
tance is normalized against the maximum spread of the ‘ideal’ pareto front. For
compatibility with the above metrics, the value of the maximum spread is subtracted
from 1, so that a wider spread will produce a smaller value,

 ⎛ ⎞
(i) 2
 (i)
max fm − min fm
1 M
 ⎜ i=1,|Q| ⎟
MS = 1 −  ∑ ⎝
i=1,|Q|
⎠ .
M m=1 Pm − Pmmin
max

7.9 Results
The study carried out aims to show the effect of applying various update strategies,
number of training and evaluation points, etc. The performance of each particular
approach is measured using the metrics described in the previous section.
An overall summary is given at the end of this section, but the best recipe ap-
pears to be highly problem dependant. It is also not possible to show all results for
all functions due to limited space, and we have therefore chosen several that best
represent the ideas discussed.
To correctly appreciate the results, please bear in mind that they are meant to
show diversity rather than a magic recipe that works in all situations.

7.9.1 Understanding the Results


The legend on the figures represents the selected strategy in the form
[Nr]-[Nrsm]-[Nsl]-[Nrmse]-[Nie]-[Nkmean]MUPD[RSMSIZE]MEVL[EVALSIZE]
so that a 8-14-15-10-3-3MUPD50MEVL300 would represent 8 random update
points, 14 RSM updates, 15 NSGA2 Second layer updates, 10 RMSE updates, 3
EI updates, 3 KMEAN updates with 50 krig training points and 300 krig evaluation
points.
All approaches were given a maximum of 60 update iterations and stopping cri-
teria of reaching two consecutive unchanged Pareto fronts. Total number of runs is
recorded for each update iteration and all metrics are plotted against number of real
function evaluations, (i.e. likely cost on real expensive problems).
Strategies with ‘dec’ appended to their name – indicate that the decoupled Second
layer is used, as opposed to coupled for those where Nsl = 30 and without any
appendix. Those labled ‘43’ use a one pass constraint penalty expected improvement
strategy whilst those that have Nie = 30 and no appendix use a constraint feasibility
algorithm.
7 Multi-objective Optimization Using Surrogates 167

7.9.2 Preliminary Calculations


7.9.2.1 Finding the Ideal Pareto Front
As mentioned in section 7.8, most of the Pareto front metrics are based on com-
parison to an ‘ideal’ Pareto front. To find it, each of the test functions has been
run through a direct NSGA2 search (direct = without the usage of surrogates) with
Population size of 100 for 200 generations, which takes 20000 function evaluations.

7.9.2.2 How Many Generations for the RSM Search?


We have conducted a study for each of the test functions to find what the minimum
number of generations they should be run for is, in order to achieve best conver-
gence. We found that a population size of 70 with 80 generations is sufficient for all
of test problems and this is what we have used for our tests. Some test functions,
such as ZDT1 - ZDT6 with two variables could be converged using a smaller num-
ber of individuals and generations, however for comparison purposes we decided to
use the same settings for all functions.

7.9.2.3 What Is the Best Value for EPREWARD during the RSM Search?

The EPREWARD value is strictly individual for each function. Taking into account
the specifics of the test function it can improve the diversity of the Pareto front. The
default value is 0.65, which works well for most of the functions, but we have also
conducted studies where this parameter is varied between -1 and 1 in steps of 0.1,
and individual value for each function is selected based on best Pareto front metrics.

7.9.3 The Effect of the Update Strategy Selection


Fig. 7.5 shows that the selection of update strategy is important even for functions
with only two variables. F5 has a deceptive Pareto front and several update strategies
were not able to escape from the local Pareto front.
Fig. 7.6 clearly shows that some strategies have converged earlier than the others,
but some of them to the local front. Generally methods such as Random updates and
Secondary NSGA2 layer updates are not based on the RSM and are the strongest
candidates when deceptive features in the multiobjective space are expected. It is
a common observation amongst most of the low dimensional objective functions
(two or three variables) that using all the update techniques together is not necessar-
ily the winning strategy. However combining at least one RSM and one non-RSM
technique proves to work well. It is somewhat important to note that the Second
NSGA2 layer shows its effect after sixth or seventh update iteration, as it needs time
to converge and gather genetic information.
Update strategies that employ a greater variety of techniques prove to be more
successful for functions with higher number of variables (25).
168 I. Voutchkov and A. Keane

Fig. 7.5 Pareto front for F5

Fig. 7.6 Generational distance for F5

Fig. 7.9 and Fig. 7.10 show that the ‘bump’ function is particularly difficult for
all strategies, which makes it a good test problem. This function has extremely
tight constraint and multimodal features. It is not yet clear which of combination of
strategies should be recommended, as the ‘ideal’ Pareto front has not been reached,
however it seems that a decoupled secondary NSGA2 layer is showing a good
7 Multi-objective Optimization Using Surrogates 169

Fig. 7.7 Pareto front for ZDT1cons

Fig. 7.8 Generational distance for ZDT1cons

advancement. We are continuing studies on this function and will give results in
future publications.
To summarize the performance of each strategy an average statistics is com-
puted. It is derived as follows. The actual performance in most cases is a trade-
off between a given metric and the number of function evaluations needed for
170 I. Voutchkov and A. Keane

Fig. 7.9 Pareto front for the ‘bump’ function

Fig. 7.10 Generational distance for the ‘bump’ function

convergence. Therefore the four metrics can be ranked against the number of runs,
in the same way as ranks are obtained during NSGA2 operation. The obtained ranks
are then averaged across all test functions. Low average rank means that the strategy
has been optimal for more metrics and functions. These results are summarized in
Table 14.2.
7 Multi-objective Optimization Using Surrogates 171

Table 7.2 Summary of performance

Random RSM PF SL RMSE EI KMEAN Av. Rank Min. Rank Max. Rank Note
0 30 0 0 30 0 1.53 1 2 EI const.feas
0 30 30 0 0 0 1.83 1 3.33 SL coupled
0 30 0 30 0 0 2 1.33 3.33 RMSE
0 30 0 0 30 0 2.2 1.33 3 EI normal
0 30 30 0 0 0 2.8 1.33 4 SL decoupled
30 30 0 0 0 0 2.84 2 4 Random
0 60 0 0 0 0 2.85 2 3.33 RSM PF

The summary shows that all strategies are generally better than using only the
conventional RSM based updates, which is expected, as the conventional method is
almost always bound to converge at local solutions. However it must be underlined
that a correct selection is problem dependant and must be selected with care and
understanding.

7.9.4 The Effect of the Initial Design of Experiments


All methods presented here start from a given initial design of experiments. This is
the starting point and this is what the initial surrogate model is based on. It is of
course important to show the effect of these initial conditions. In what follows we
have shown that effect by using a range of different initial DOEs. We have again

Fig. 7.11 Generational distance for zdt1 starting from different initial DOEs
172 I. Voutchkov and A. Keane

Fig. 7.12 Generational distance for F5 starting from different initial DOEs

Fig. 7.13 Pareto fronts for ‘bump’ starting from different initial DOEs

used 10 updates for each of the techniques (60 updates per iteration in total) for all
functions. The only difference being the starting set of designs.
Fig. 7.11 and Fig. 7.12 illustrate the generational distance for zdt1 and f5 func-
tions - both with two variables. They both demonstrate a good averagibility,
7 Multi-objective Optimization Using Surrogates 173

Fig. 7.14 Generational distance for ‘bump’ starting from different initial DOEs

Fig. 7.15 Pareto fronts for ‘zdt1cons’ starting from different initial DOEs

confirming once again that the surrogate updates are fairly robust for functions with
low number of variables.
Figures 7.13, 7.14 and 7.15 illustrate much greater variance and show that high
dimensionality is a difficult challenge for surrogate strategies, however one should
also consider the low number of function evaluations used here.
174 I. Voutchkov and A. Keane

7.10 Summary
In this publication we have aimed to share our experience in tackling expensive
multiobjective problems. We have shown that as soon as we decide to use surrogate
models, to substitute for expensive objective functions, we need to consider a num-
ber of other specifics in order to produce a useful Pareto front. We have discussed
the challenges that one might face when using surrogates and have proposed six up-
date strategies that one might wish to use. Given understanding of these strategies,
the researcher should decide on the budget of updates they could afford and then
spread this budget over several update strategies. We have shown that it is best to
use at least two different strategies – ideally a mixture of RSM and non-RSM based
techniques. When solving problems with few variables we have shown that a com-
bination of two or three techniques is sufficient, however with higher dimensional
problems, one should consider using more techniques.
It is also beneficial to constrain the number of designs that are used for RSM
training and also for RSM evaluation to limit the cost. The selection method of the
designs then being used is open to further research. In this material we have used
selection based on Pareto front ranking.
Our research also included parameters that reward the search for exploring the
end points on the Pareto front. Although not explicitly mentioned in this material,
our studies are using features such as improved crossover, mutation and selection
strategies, declustering algorithm applied both in the variable and objective space
to avoid data clustering. Data is also being automatically conditioned and filtered,
and advanced kriging tuning techniques are used. These features are part of the
OPTIONS [1], OptionsMATLAB and OptionsNSGA2 RSM suites [24].

Acknowledgements. This work was funded by Rolls – Royce Plc, whose support is
gratefully acknowledged.

References
1. Keane, A.J.: OPTIONS manual,
http://www.soton.ac.uk/˜ajk/options.ps
2. Obayashi, S., Jeong, S., Chiba, K.: Multi-Objective Design Exploration for Aerodynamic
Configurations, AIAA-2005-4666
3. Deb, K.: Multi-objective optimization using evolutionary algorithms. John Wiley &
Sons, Ltd., New York (2003)
4. Zitzler, et al.: Comparison of multiobjective evolutionary algorithms: Empirical results.
Evolutionary Computational Journal 8(2), 125–148 (2000)
5. Knowles, J., Corne, D.: The Pareto archived evolution strategy: A new baseline algorithm
for multiobjective optimisation. In: Proceedings of the 1999 Congress on Evolutionary
Computation, pp. 98–105. IEEE Service Center, Piscatway (1999)
6. Fonseca, C.M., Fleming, P.J.: Multiobjective optimization and multiple constraint han-
dling with evolutionary algorithms - Part II: Application example. IEEE Transactions on
Systems, Man, and Cybernetics: Part A: Systems and Humans, 38–47 (1998)
7 Multi-objective Optimization Using Surrogates 175

7. Jones, D.R., Schonlau, M., Welch, W.J.: Efficient global optimization of expensive black-
box functions. Journal of Global Optimization 13, 455–492 (1998)
8. Sobol’, I.M., Turchaninov, V.I., Levitan, Y.L., Shukhman, B.V.: Quasi-Random Se-
quence Generators, Keldysh Institute of Applied Mathematics, Russian Acamdey of Sci-
ences, Moscow (1992)
9. Nowacki, H.: Modelling of Design Decisions for CAD. In: Goos, G., Hartmanis, J.
(eds.) Computer Aided Design Modelling, Systems Engineering, CAD-Systems. LNCS,
vol. 89. Springer, Heidelberg (1980)
10. Kumano, T., et al.: Multidisciplinary Design Optimization of Wing Shape for a Small Jet
Aircraft Using Kriging Model. In: 44th AIAA Aerospace Sciences Meeting and Exhibit,
Jannuary 2006, pp. 1–13 (2006)
11. Nain, P.K.S., Deb, K.: A multi-objective optimization procedure with successive approx-
imate models. KanGAL Report No. 2005002 (March 2005)
12. Keane, A., Nair, P.: Computational Approaches for Aerospace Design: The Pursuit of
Excellence (2005) ISBN: 0-470-85540-1
13. Leary, S., Bhaskar, A., Keane, A.J.: A derivative based surrogate model for approximat-
ing and optimizing the output of an expensive computer simulation. J. Global Optimiza-
tion 30, 39–58 (2004)
14. Leary, S., Bhaskar, A., Keane, A.J.: A Constraint Mapping Approach to the Structural
Optimization of an Expensive Model using Surrogates. Optimization and Engineering 2,
385–398 (2001)
15. Emmerich, M., Naujoks, B.: Metamodel-assisted multiobjective optimization strategies
and their application in airfoil design. In: Parmee, I. (ed.) Proc of. Fifth Int’l. Conf.
on Adaptive Design and Manufacture (ACDM), Bristol, UK, April 2004, pp. 249–260.
Springer, Berlin (2004)
16. Giotis, A.P., Giannakoglou, K.C.: Single- and Multi-Objective Airfoil Design Using Ge-
netic Algorithms and Artificial Intelligence. In: EUROGEN 1999, Evolutionary Algo-
rithms in Engineering and Computer Science (May 1999)
17. Knowles, J., Hughes, E.J.: Multiobjective optimization on a budget of 250 evaluations.
In: Coello Coello, C.A., Hernández Aguirre, A., Zitzler, E. (eds.) EMO 2005. LNCS,
vol. 3410, pp. 176–190. Springer, Heidelberg (2005)
18. Chafekar, D., et al.: Multi-objective GA optimization using reduced models. IEEE
SMCC 35(2), 261–265 (2005)
19. Nain, P.: A computationally efficient multi-objective optimization procedure using suc-
cessive function landscape models. Ph.D. dissertation, Department of Mechanical Engi-
neering, Indian Institute of Technology (July 2005)
20. Voutchkov, I.I., Keane, A.J.: Multiobjective optimization using surrogates. In: Proc. 7th
Int. Conf. Adaptive Computing in Design and Manufacture (ACDM 2006), Bristol, pp.
167–175 (2006) ISBN 0-9552885-0-9
21. Keane, A.J.: Bump: A Hard (?) Problem (1994),
http://www.soton.ac.uk/˜ajk/bump.html
22. Forrester, A., Sobester, A., Keane, A.: Engineering design via Surrogate Modelling. Wi-
ley, Chichester (2008)
23. Yuret, D., Maza, M.: Dynamic hill climbing: Overcoming the limitations of optimization
techniques. In: The Second Turkish Symposium on Artificial Intelligence and Neural
Networks, pp. 208–212 (1993)
24. OptionsMatlab & OptionsNSGA2 RSM,
http://argos.e-science.soton.ac.uk/blogs/OptionsMatlab/
Chapter 8
A Review of Agent-Based Co-Evolutionary
Algorithms for Multi-Objective Optimization

Rafał Dreżewski and Leszek Siwik

Abstract. Agent-based evolutionary algorithms are a result of mixing two paradigms:


multi-agent systems and evolutionary algorithms. Agent-based co-evolutionary al-
gorithms allow for existing many species and sexes of agents within the system as
well as for defining co-evolutionary interactions between species and sexes. Algo-
rithms based on the model of co-evolutionary multi-agent system have been already
applied in many domains, like multi-modal optimization, generation of investment
strategies, portfolio optimization, and multi-objective optimization. In this chapter
we present an overview of selected agent-based co-evolutionary algorithms, their
formal models, and results of experiments with standard test problems and financial
problem, aimed at making comparison of agent-based and “classical” state-of-the-art
multi-objective algorithms. Presented results show that, depending on the problem
being solved, agent-based algorithms obtain comparable, and sometimes even bet-
ter, results than “classical” algorithms, however of course they are not the universal
solver for any multi-objective optimization problem.

8.1 Introduction
In spite of a huge potential dozing in evolutionary algorithms and a lot of successful
applications of such algorithms in solving difficult problem of optimization and
searching, very frequently such methods have not been able to deal with defined
problem and obtained results have not been satisfying. Among the reasons of such
a situation the following can be mentioned:
• centralization of evolutionary process where the process of selection as well
as the process of creation of new generations are controlled by one single
algorithm;
Rafał Dreżewski · Leszek Siwik
Department of Computer Science
AGH University of Science and Technology, Kraków, Poland
e-mail: {drezew,siwik}@agh.edu.pl

Y. Tenne and C.-K. Goh (Eds.): Computational Intelligence in Optimization, ALO 7, pp. 177–209.
springerlink.com c Springer-Verlag Berlin Heidelberg 2010
178 R. Dreżewski and L. Siwik

• reducing of specimen to the (system of) genes without capabilities of exerting of


any influence on the process of evolution;
• omitting some crucial—from the evolution and adaptation capabilities point of
view—operations and processes observable in the nature. Moreover, in the lit-
erature there are opinions that crossover and mutation they are only the kinds
of one single—destructive and exploration-oriented—operator and there is no
agreement if (and if so—when) they should be used or even if they should be
distinguished [17];
• to realize their own goals, during decision-making process, specimens are able
neither to gather nor to utilize any kind of information from the environment;
• depriving specimens of such—absolutely natural and obvious in nature—
biological and social behaviors like competition, rivalry, cooperation etc.;
• in the consequence of previous point (limited number of operators) it is almost
impossible to define in classical evolutionary algorithms more sophisticated
(and more effective simultaneously), advanced algorithms and computational
methods.
In the consequence, in the literature, there are being raised arguments that classi-
cal evolutionary algorithms are methods of adapting and fitting of algorithm’s pa-
rameters to defined conditions rather than really creative methods of searching and
optimization.
It is nothing strange so, that intensive research is being performed on methods uti-
lizing ideas and conceptions of computer models of observable in nature Darwinian
evolution but at the same time, on methods that should be devoid of mentioned above
shortcomings, and which could be perceived as a full analogy to natural processes.
During the research, decentralization and autonomy have been in the limelight.
Proposed, as a result, method called Evolutionary Multi-Agent System—EMAS [2]
should be perceived as a new trend among evolutionary algorithms allowing for
realization of defined postulates by utilizing advantages simultaneously of both:
evolutionary and agent-based approaches.
Proposed paradigm of evolutionary multi-agent system is characterized by the
following—crucial, taking the shortcomings of classical evolutionary algorithms
into account—features:
• in the process of evolution autonomous agents are taking a part. Agents are able
to make decisions to realize their own goals and they are not passive units of
global and central evolution which are limited and reduced to the role of (group
of) genes;
• the prices of evolution is decentralized and agents taking the part in that are
able to create advanced social structures and to realize sophisticated strategies of
cooperation, competition, interactions and reciprocal relations
• agents taking the part in the process of evolution are able to observe the envi-
ronment (and occurring changes) and to make appropriate decisions and actions
what additionally enrich the spectrum of possible for realization complex and
effective computational methods and algorithms.
8 A Review of Agent-Based Co-Evolutionary Algorithms 179

During further research on realizing advanced, complex social and biological mech-
anisms within the confines of EMAS—general model of so called CoEMAS Co-
evolutionary multi-agent systems (CoEMAS) [8] has been proposed and it has turned
out that with the use of such a model almost any kind of interaction, cooperation or
competition among many species or sexes of co-evolving agents is possible what al-
lows for improving the quality of obtained result. Such improvement results mainly
from better maintenance of population diversity—what is especially important in
the case of applying such systems for solving multi-modal or multi-objective opti-
mization tasks.
In the course of this chapter we are focusing on applying co-evolutionary
multi-agent systems for solving multi-objective optimization tasks.
Following [5]—multi-objective optimization problem—MOOP in its general
form is being defined as follows:


⎪ Minimize/Maximize fm (x̄), m = 1, 2 . . . , M
⎨ Sub ject to g j (x̄) ≥ 0, j = 1, 2 . . . , J
MOOP ≡

⎪ h k (x̄) = 0, k = 1, 2 . . . , K
⎩ (L) (U)
xi ≤ xi ≤ xi , i = 1, 2 . . . , N

Authors of this chapter assume that readers are familiar with at least fundamental
concepts and notions regarding multi-objective optimization in the Pareto sense (re-
lation of domination, Pareto frontier and Pareto set etc.) and their explanation is
omitted in this paper (interested readers can find definitions and deep analysis of all
necessary concepts and notions of Pareto multi-objective optimization for instance
in [3, 5]).
This chapter is organized as follows:
• in Section 8.2 formal model as well as detailed description of Co-Evolutionary
Multi-Agent System—CoEMAS is presented;
• in Section 8.3 detailed description and formal model of two realization of Co-
EMAS applied for solving MOOP is given. In this section Co-Evolutionary
Multi-Agent System with Predator-Prey interactions (PPCoEMAS) as well as
Co-Evolutionary Multi-Agent System with Cooperation (CCoEMAS) are
discussed;
• in Section 8.4 we discuss shortly test suite and performance metric used during
experiments, and next we glance at results obtained by both systems presented in
the course of this chapter (PPCoEMAS and CCoEMAS);
• in Section 8.5 the most important remarks, conclusions and comments are given.

8.2 Model of Co-Evolutionary Multi-Agent System


Agent-based models of evolutionary algorithms are the result of mixing two
paradigms: multi-agent systems and evolutionary algorithms. The result is decen-
tralized evolutionary system, in which agents “live” within the environment
of the system, compete for limited resources, reproduce, die, migrate from one
180 R. Dreżewski and L. Siwik

computational node to another, observe the environment and other agents, and can
communicate with other agents and change the environment.
Basic model of agent-based evolutionary algorithm (so called evolutionary multi-
agent system—EMAS model) was proposed in [2]. EMAS model included all the
features which were mentioned above. However in the case of some problems, for
example multi-modal optimization or multi-objective optimization, it turned out that
these mechanisms are not sufficient. Such types of problems require maintaining of
population diversity mechanisms, speciation mechanisms and possibilities of intro-
ducing additional biologically and socially inspired mechanisms in order to solve a
problem and obtain satisfying results.
Mentioned above limitations of the basic EMAS model and research aimed at
applying agent-based evolutionary algorithms to multi-modal and multi-objective
problems led to the formulation of the model of co-evolutionary multi-agent system—
CoEMAS [8]. This model included the possibilities of existing different species and
sexes in the system and allowed for defining co-evolutionary interactions between
them. Below we present basic ideas and notions of CoEMAS model, which we will
use in Section 8.3 when the systems used in experiments will be described.

8.2.1 Co-Evolutionary Multi-Agent System


The CoEMAS is described as 4-tuple:

CoEMAS = E, S, Γ , Ω  (8.1)

where E is the environment of the CoEMAS, S is the set of species (s ∈ S) that


co-evolve in CoEMAS, Γ is the set of resource types that exist in the system, the
amount of type γ resource will be denoted by rγ , Ω is the set of information types
that exist in the system, the information of type ω will be denoted by iω .

8.2.2 Environment
The environment of CoEMAS may be described as 3-tuple:
. /
E = T E,Γ E ,Ω E (8.2)

where T E is the topography of environment E, Γ E is the set of resource types that


exist in the environment, Ω E is the set of information types that exist in the envi-
ronment. The topography of the environment is given by:

T E = H, l (8.3)

where H is directed graph with the cost function c defined: H = V, B, c, V is the
set of vertices, B is the set of arches. The distance between two nodes is defined as
the length of the shortest path between them in graph H.
8 A Review of Agent-Based Co-Evolutionary Algorithms 181

Fig. 8.1 Co-evolutionary multi-agent system

The l function makes it possible to locate particular agent in the environment


space:
l : A→V (8.4)
where A is the set of agents, that exist in CoEMAS.
Vertex v is given by:
v = Av , Γ v , Ω v , ϕ  (8.5)
Av is the set of agents that are located in the vertex v, Γ v is the set of resource types
that exist within the v (Γ v ⊆ Γ E ), Ω v is the set of information types that exist within
the v (Ω v ⊆ Ω E ), ϕ is the fitness function.

8.2.3 Species
Species s ∈ S is defined as follows:

s = As , SX s , Z s ,Cs  (8.6)

where:
• As is the of agents of species s (by as we will denote the agent, which is of species
s, as ∈ As );
• SX s is the set of sexes within the s;
182 R. Dreżewski and L. Siwik

• Z s is the set of actions, which can be performed by the agents of species s


0 a
(Z s = Z , where Z a is the set of actions, which can be performed by the
a∈As
agent a);
• Cs is the set of relations with other species that exist within CoEMAS.
The set of relations of si with other species (Csi ) is the sum of the following sets of
relations: 1 s ,z− 2 1 s ,z+ 2
Csi = −− i
−→: z ∈ Z si ∪ −− i
−→: z ∈ Z si (8.7)
s ,z− s ,z+
where −−
i
−→ and −−
i
−→ are relations between species, based on some actions z ∈ Z si ,
which can be performed by the agents of species si :
s ,z− 3. /
−−
i
−→= si , s j ∈ S × S : agents of species si can decrease the fitness of
4 (8.8)
agents of species s j by performing the action z ∈ Z si

s ,z+ 3. /
−−
i
−→= si , s j ∈ S × S : agents of species si can increase the fitness of
4 (8.9)
agents of species s j by performing the action z ∈ Z si
s ,z−
If si −−
i
−→ si then we are dealing with the intra-species competition, for example
si ,z+
the competition for limited resources, and if si −− −→ si then there is some form of
co-operation within the species si .
With the use of the above relations we can define many different co-evolutionary
interactions, e.g., mutualism, predator-prey, host-parasite, etc. For example mutual-
ism between two species si and s j (i = j) takes place if and only if ∃zk ∈ Z si ∃zl ∈ Z s j ,
si ,z + s j ,zl +
such that si −−−
k
→ s j and s j −−−→ si and these two species live in tight co-operation.
Predator-prey interactions between two species, si (predators) and s j (preys) (i =
si ,z − s j ,zl +
j), takes place if and only if ∃zk ∈ Z si ∃zl ∈ Z s j , such that si −−−
k
→ s j and s j −−−→ si ,
where zk is the action of killing the prey (kill), and zl is the action of death (die).

8.2.4 Sex
The sex sx ∈ SX s which is within the species s is defined as follows:

sx = Asx , Z sx ,Csx  (8.10)

where Asx is the set of agents of sex sx and species s (Asx ⊆ As ):

Asx = {a : a ∈ As ∧ a is the agent of sex sx} (8.11)


8 A Review of Agent-Based Co-Evolutionary Algorithms 183

With asx we will denote the agent of sex sx (asx ∈ Asx ). Z0sx is the set of actions
which can be performed by the agents of sex sx, Z sx = Z a , where Z a is the
a∈Asx
set of actions which can be performed by the agent a. And finally Csx is the set of
relations between the sx and other sexes of the species s.
Analogically as in the case of species, we can define the relations between the
sexes of the same species. The set of all relations of the sex sxi ∈ SX s with other
sexes of species s (Csxi ) is the sum of the following sets of relations:
1 sx ,z− 2 1 sx ,z+ 2
Csxi = −−i−→: z ∈ Z sxi ∪ −−i−→: z ∈ Z sxi (8.12)

sx ,z− sx ,z+
where −−i−→ and −−i−→ are the relations between sexes, in which some actions
z ∈ Z sxi are used:

sx ,z− 3. /
−−i−→= sxi , sx j ∈ SX s × SX s : agents of sex sxi can decrease the
4 (8.13)
fitness of agents of sex sx j by performing the action z ∈ Z sxi

sx ,z+ 3. /
−−i−→= sxi , sx j ∈ SX s × SX s : agents of sex sxi can increase the
4 (8.14)
fitness of agents of sex sx j by performing the action z ∈ Z sxi

With the use of presented relations between sexes we can model for example sexual
selection interactions, in which agents of one sex choose partners for reproduction
from agents of the other sex within the same species, taking into account some
preferred features (see [10]).

8.2.5 Agent
Agent a (see Fig. 8.2) of sex sx and species s (in order to simplify the notation we
assume that a ≡ asx,s ) is defined as follows:

a = gna , Z a , Γ a , Ω a , PRa  (8.15)

where:
• gna is the genotype of agent a, which may be composed of any number of chro-
Ê
mosomes (for example: gna = (x1 , x2 , . . . , xk ), where xi ∈ , gna ∈ k ); Ê
• Z a is the set of actions, which agent a can perform;
• Γ a is the set of resource types, which are used by agent a (Γ a ⊆ Γ );
• Ω a is the set of information, which agent a can possess and use (Ω a ⊆ Ω );
• PRa is partially ordered set of profiles of agent a (PRa ≡ PRa , ) with defined
partial order relation .
184 R. Dreżewski and L. Siwik

Fig. 8.2 Agent in the CoEMAS

Relation  is defined in the following way:


3. /
 = pri , pr j ∈ PRa × PRa : realization of active goals of profile pri
has equal or higher priority than the realization of (8.16)
4
active goals of profile pr j

The active goal (which is denoted as gl ∗ ) is the goal gl, which should be realized in
the given time. The relation  is reflexive, transitive and antisymmetric and partially
orders the set PRa :

pr  pr for every pr ∈ PRa (8.17a)


(pri  pr j ∧ pr j  prk ) ⇒ pri  prk for every pri , pr j , prk ∈ PRa (8.17b)
(pri  pr j ∧ pr j  pri ) ⇒ pri = prk for every pri , pr j ∈ PRa (8.17c)

The set of profiles PRa is defined in the following way:

PRa = {pr1 , pr2 , . . . , prn } (8.18a)


pr1  pr2  · · ·  prn (8.18b)

Profile pr1 is the basic profile—it means that the realization of its goals has the
highest priority and they will be realized before the goals of other profiles.
Profile pr of agent a (pr ∈ PRa ) can be the profile in which only resources are
used:

pr = Γ pr , ST pr , RST pr , GL pr  (8.19)
8 A Review of Agent-Based Co-Evolutionary Algorithms 185

Algorithm 6. Basic activities of agent a in CoEMAS


γ γ
1 rγ ← rinit ; /* rinit is the initial amount of resource given to
the agent */
2 while rγ > 0 do
3 activate the profile pri ∈ PRa with the highest priority and with the active goal
gl ∗j ∈ GL pri ;
4 if pri is the resource profile then
γ γ
5 if 0 < rγ < rmin then ; /* rmin is the minimal amount of
resource needed by the agent to realize its
activities */
6
7 choose the strategy stk ∈ ST pri with the highest priority that can be used to
take some resources from the environment or other agent;
8 perform actions contained within the stk ;
9 else if rγ = 0 then
10 execute die strategy;
11 end
12 else if pri is the reproduction profile then
rep,γ rep,γ
13 if rγ > rmin then ; /* rmin is the minimal amount of
resource needed for reproduction */
14
15 choose the strategy stk ∈ ST pri with the highest priority that can be used to
reproduce;
16 perform actions contained within the stk ;
17 end
18 else if pri is the migration profile then
mig,γ mig,γ
19 if rγ > rmin then ; /* rmin is the minimal amount of
resource needed for migration */
20
21 choose the strategy stk ∈ ST pri with the highest priority that can be used to
migrate;
22 perform actions contained within the stk ;
mig,γ
23 give rmin amount of resource to the environment;
24 end
25 end
26 end

in which only information are used:

pr = Ω pr , M pr , ST pr , RST pr , GL pr  (8.20)
or resources and information are used:

pr = Γ pr , Ω pr , M pr , ST pr , RST pr , GL pr  (8.21)
where:
• Γ pr is the set of resource types, which are used within the profile pr (Γ pr ⊆ Γ a );
• Ω pr is the set of information types, which are used within the profile pr (Ω pr ⊆
Ω a );
186 R. Dreżewski and L. Siwik

• M pr is the set of information representing the agent’s knowledge about the


environment and other agents (it is the model of the environment of agent a);
• ST pr is the partially ordered set of strategies (ST pr ≡ ST pr , ), which can be
used by agent within the profile pr in order to realize an active goal of this profile;
• RST pr is the set of strategies that are realized within the profile pr—generally,
not all of the strategies from the set ST pr have to be realized within the profile
pr, some of them may be realized within other profiles;
• GL pr is partially ordered set of goals (GL pr ≡ GL pr , ), which agent has to
realize within the profile pr.
The relation  is defined in the following way:

3
 = sti , st j  ∈ ST pr × ST pr : strategy sti has equal or higher
4 (8.22)
priority than strategy st j

This relation is reflexive, transitive and antisymmetric and partially orders the set
ST pr . Every single strategy st ∈ ST pr is consisted of actions, which ordered perfor-
mance leads to the realization of some active goal of the profile pr:

st = z1 , z2 , . . . , zk , st ∈ ST pr , zi ∈ Z a (8.23)

The relation  is defined in the following way:


3
 = gli , gl j  ∈ GL pr × GL pr : goal gli has equal or higher
4 (8.24)
priority than the goal gl j

This relation is reflexive, transitive and antisymmetric and partially orders the set
GL pr .
The partially ordered sets of profiles PRa , goals GL pr and strategies ST pr are
used by the agent in order to make decisions about the realized goal and to choose
the appropriate strategy in order to realize that goal. The basic activities of the agent
a are shown in Algorithm 6.
In CoEMAS systems the set of profiles is usually composed of resource profile
(pr1 ), reproduction profile (pr2 ), and migration profile (pr3 ):

PRa = {pr1 , pr2 , pr3 } (8.25a)

pr1  pr2  pr3 (8.25b)

The highest priority has the resource profile, then there is reproduction profile, and
finally migration profile.
8 A Review of Agent-Based Co-Evolutionary Algorithms 187

8.3 Co-Evolutionary Multi-Agent Systems for Multi-Objective


Optimization
In this section we will describe two co-evolutionary multi-agent systems used in
the experiments. Each of these systems uses different co-evolutionary mechanism:
co-operation and predator-prey interactions. All of the systems are based on general
model of co-evolution in multi-agent system described in Section 8.2—in this sec-
tion only such elements of the systems will be described that are specific for these
instantiations of the general model. In all the systems presented below, real-valued
vectors are used as agents’ genotypes. Mutation with self-adaptation and intermedi-
ate recombination are used as evolutionary operators [1].

8.3.1 Co-Evolutionary Multi-Agent System with Co-Operation


Mechanism (CCoEMAS)
The co-evolutionary multi-agent system with co-operation mechanism is defined as
follows (see Eq. (8.1)):
CCoEMAS = E, S, Γ , Ω  (8.26)
The number of species corresponds with the number of criteria (n) of the multi-
objective problem being solved S = {s1 , . . . , sn }. Three information types (Ω =
{ω1 , ω2 , ω3 }) and one resource type (Γ = {γ }) are used. Information of type ω1
denotes nodes to which agent can migrate. Information of type ω2 denotes (for the
agent of given species) all agents from other species that are located within the same
node in time t. Information of type ω3 denotes (for the given agent) all agents from
the same species located within the same node.

8.3.1.1 Species
The species s is defined as follows:

s = As , SX s = {sx} , Z s ,Cs  (8.27)

where SX s is the set of sexes which exist within the s species, Z s is the set of actions
that agents of species s can perform, and Cs is the set of relations of s species with
other species that exist in the CCoEMAS.
Actions
The set of actions Z s is defined as follows:

Z s = {die, seek, get, give, accept, seekPartner, clone, rec, mut, migr} (8.28)

where:
• die is the action of death (agent dies when it is out of resources);
• seek is the action of finding a dominated agent from the same species in order to
take some resources from it;
188 R. Dreżewski and L. Siwik

• get action gets some resource from another agent located within the same node,
which is dominated by the agent that performs get action;
• give action gives some resources to the agent that performs get action;
• accept action accepts partner for reproduction when the amount of resource pos-
sessed by the agent is above the given level;
• seekPartner action seeks for partner for reproduction, such that it comes from
another species and has the amount of resource above the minimal level needed
for reproduction;
• clone is the action of producing offspring (parents give some of their resources
to the offspring during this action);
• rec is the recombination operator (intermediate recombination is used [1]);
• mut is the mutation operator (mutation with self-adaptation is used [1]);
• migr is the action of migrating from one node to another. During this action agent
loses some of its resource.

Relations

The set of relations of si species with other species that exist within the system is
defined as follows: 1 s ,get− s ,accept+ 2
Csi = −−−−→, −−−−−−→
i i
(8.29)

The first relation models intra species competition for limited resources:
s ,get−
−−
i
−−→= {si , si } (8.30)

The second one models co-operation between species:


s ,accept+ 3. /4
−−
i
−−−−→= si , s j (8.31)

8.3.1.2 Agent
Agent a of species s (a ≡ as ) is defined as follows:

a = gna , Z a = Z s , Γ a = Γ , Ω a = Ω , PRa  (8.32)

Genotype of agent a is consisted of two vectors (chromosomes): x of real-coded


decision parameters’ values and σ of standard deviations’ values, which are used
during mutation with self-adaptation. Agents of the given species are evaluated ac-
cording to only one criteria associated with this species. Z a = Z s (see Eq. (8.28)) is
the set of actions which agent a can perform. Γ a is the set of resource types used
by the agent, and Ω a is the set of information types. Basic activities of agent a in
CCoEMAS with the use of profiles are presented in Alg. 7.
8 A Review of Agent-Based Co-Evolutionary Algorithms 189

Algorithm 7. Basic activities of agent a in CCoEMAS


γ
1 rγ ← rinit ;
2 while rγ > 0 do
3 activate the profile pri ∈ PRa with the highest priority and with the active goal
gl ∗j ∈ GL pri ;
4 if pr1 is activated then
γ
5 if 0 < rγ < rmin then
6 seek, get;
 γ 
7 rγ ← rγ + rget ;
8 else if rγ = 0 then
9 die;
10 end
11 else if pr2 is activated then
rep,γ
12 if rγ > rmin then
13 seekPartner,
& clone,
' rec, mut;
rep,γ
14 rγ ← rγ − rgive ;
15 end
16 else if pr3 is activated then
17 if accept&is activated ' then
γ γ rep,γ
18 r ← r − rgive ;
else if give
 is activated then
γ 
19
20 rγ ← rγ − rget ;
21 end
22 else if pr4 is activated then
mig,γ
23 if rγ > rmin then
24 migr;
& '
mig,γ
25 rγ ← rγ − rmin ;
26 end
27 end
28 end

Profiles

The partially ordered set of profiles includes resource profile (pr1 ), reproduction
profile (pr2 ), interaction profile (pr3 ), and migration profile (pr4 ):

PRa = {pr1 , pr2 , pr3 , pr4 } (8.33a)


pr1  pr2  pr3  pr4 (8.33b)

The resource profile is defined in the following way:


.
pr1 = Γ pr1 = Γ , Ω pr1 = {ω3 } , M pr1 = {iω3 } ,
/ (8.34)
ST pr1 , RST pr1 = ST pr1 , GL pr1
190 R. Dreżewski and L. Siwik

The set of strategies include two strategies:

ST pr1 = {die, seek, get} (8.35)

The goal of the pr1 profile is to keep the amount of resources above the minimal
level or to die when the amount of resources falls to zero. This profile uses the
model M pr1 = {iω3 }.
The reproduction profile is defined as follows:
.
pr2 = Γ pr2 = Γ , Ω pr2 = {ω2 } , M pr2 = {iω2 } ,
/ (8.36)
ST pr2 , RST pr2 = ST pr2 , GL pr2

The set of strategies include one strategy:

ST pr2 = {seekPartner, clone, rec, mut} (8.37)

The only goal of the pr2 profile is to reproduce. In order to realize this goal agent can
use strategy of reproduction: seekPartner, clone, rec, mut. During the reproduction
rep,γ
agent transfers the amount of rgive resources to the offspring.
The interaction profile is defined as follows:
.
pr3 = Γ pr3 = Γ , Ω pr3 = {ω2 , ω3 } , M pr3 = {iω2 , iω3 } ,
/ (8.38)
ST pr3 = {accept, give}, RST pr3 = ST pr3 , GL pr3

The goal of the pr3 profile is to interact with agents from another species with the
use of accept and give strategies.
The migration profile is defined as follows:
.
pr4 = Γ pr4 = Γ , Ω pr4 = {ω1 } , M pr4 = {iω1 } ,
3. /4 / (8.39)
ST pr4 = migr , RST pr4 = ST pr4 , GL pr4

The goal of the pr4 profile is to migrate


. within
/ the environment. In order to realize
such a goal the migration strategy migr is used, which firstly chooses the node
on the basis of information {iω1 } and then realizes the migration. As a result of
migrating agent loses some of its resources.

8.3.2 Co-Evolutionary Multi-Agent System with Predator-Prey


Interactions (PPCoEMAS)
The co-evolutionary multi-agent system with predator-prey interactions
(PPCoEMAS) is defined as follows (see Eq. (8.1)):

PPCoEMAS = E, S, Γ , Ω  (8.40)


8 A Review of Agent-Based Co-Evolutionary Algorithms 191

The set of species includes two species, preys and predators S = {prey, pred}.
Two information types (Ω = {ω1 , ω2 }) and one resource type (Γ = {γ }) are used.
Information of type ω1 denote nodes to which agent can migrate. Information of
type ω2 denote such prey that are located within the particular node in time t.

8.3.2.1 Prey Species


The prey species (prey) is defined as follows:

prey = A prey , SX prey = {sx} , Z prey ,C prey  (8.41)

where SX prey is the set of sexes which exist within the prey species, Z prey is the set
of actions that agents of species prey can perform, and C prey is the set of relations
of prey species with other species that exist in the PPCoEMAS.

Actions

The set of actions Z prey is defined as follows:


3
Z prey = die, seek, get, give, accept, seekPartner,
4 (8.42)
clone, rec, mut, migr

where:
• die is the action of death (prey dies when it is out of resources);
• seek action seeks for another prey agent that is dominated by the prey performing
this action or is too close to it in criteria space.
• get action gets some resource from another a prey agent located within the same
node, which is dominated by the agent that performs get action or is too close to
it in the criteria space;
• give action gives some resource to another agent (which performs get action);
• accept action accepts partner for reproduction when the amount of resource pos-
sessed by the prey agent is above the given level;
• seekPartner action is used in order to find the partner for reproduction when the
amount of resource is above the given level and agent can reproduce;
• clone is the action of producing offspring (parents give some of their resources
to the offspring during this action);
• rec is the recombination operator (intermediate recombination is used [1]);
• mut is the mutation operator (mutation with self-adaptation is used [1]);
• migr is the action of migrating from one node to another. During this action agent
loses some of its resource.

Relations

The set of relations of prey species with other species that exist within the system is
defined as follows: 1 prey,get− prey,give+ 2
C prey = −−−−−→, −−−−−−→ (8.43)
192 R. Dreżewski and L. Siwik

The first relation models intra species competition for limited resources:
prey,get−
−−−−−→= {prey, prey} (8.44)

The second one models predator-prey interactions:


prey,give+
−−−−−−→= {prey, pred} (8.45)

8.3.2.2 Predator Species


The predator species (pred) is defined as follows:
5 6
pred = A pred , SX pred = {sx} , Z pred ,C pred (8.46)

Actions

The set of actions Z pred is defined as follows:

Z pred = {seek, getFromPrey, migr} (8.47)

where:
• The seek action allows finding the “worst” (according to the criteria associated
with the given predator) prey located within the same node as the predator;
• getFromPrey action gets all resources from the chosen prey,
• migr action allows predator to migrate between nodes of the graph H—this re-
sults in losing some of the resources.

Relations

The set of relations of pred species with other species that exist within the system
are defined as follows:
1 pred,getFromPrey− 2
C pred = −−−−−−−−−−−→ (8.48)

This relation models predator-prey interactions:


pred,getFromPrey−
−−−−−−−−−−−→= {pred, prey} (8.49)

As a result of performing getFromPrey action and taking all resources from selected
prey, it dies.

8.3.2.3 Prey Agent


Agent a of species prey (a ≡ a prey ) is defined as follows:

a = gna , Z a = Z prey , Γ a = Γ , Ω a = Ω , PRa  (8.50)


8 A Review of Agent-Based Co-Evolutionary Algorithms 193

Algorithm 8. Basic activities of agent a ≡ a prey in PPCoEMAS


γ
1 rγ ← rinit ;
2 while rγ > 0 do
3 activate the profile pri ∈ PRa with the highest priority and with the active goal
gl ∗j ∈ GL pri ;
4 if pr1 is activated then
γ
5 if 0 < rγ < rmin then
6 seek, get;
 γ 
7 rγ ← rγ + rget ;
8 else if rγ = 0 then
9 die;
10 end
11 else if pr2 is activated then
rep,γ
12 if rγ > rmin then
13 if seekPartner,
& clone, rec,
' mut is performed then
clone,γ
14 rγ ← rγ − rgive ;
15 else if accept
& is performed
' then
γ γ accept,γ
16 r ← r − rgive ;
17 end
18 end
19 else if pr3 is activated then
20 if get is performed by prey agent then
21 give;& '
γ
22 rγ ← rγ − rgive ;
23 else if get is performed by predator agent then
24 give;
25 rγ ← 0;
26 end
27 else if pr4 is activated then
mig,γ
28 if rγ > rmin then
29 migr; & '
mig,γ
30 rγ ← rγ − rmin ;
31 end
32 end
33 end

Genotype of agent a is consisted of two vectors (chromosomes): x of real-coded


decision parameters’ values and σ of standard deviations’ values, which are used
during mutation with self-adaptation. Z a = Z prey (see Eq. (8.42)) is the set of actions
which agent a can perform. Γ a is the set of resource types used by the agent, and
Ω a is the set of information types. Basic activities of agent a are presented in Alg. 8.

Profiles

The partially ordered set of profiles includes resource profile (pr1 ), reproduction
profile (pr2 ), interaction profile (pr3 ), and migration profile (pr4 ):
194 R. Dreżewski and L. Siwik

PRa = {pr1 , pr2 , pr3 , pr4 } (8.51a)


pr1  pr2  pr3  pr4 (8.51b)

The resource profile is defined in the following way:


.
pr1 = Γ pr1 = Γ , Ω pr1 = {ω2 } , M pr1 = {iω2 } ,
/ (8.52)
ST pr1 , RST pr1 = ST pr1 , GL pr1

The set of strategies include two strategies:

ST pr1 = {die, seek, get} (8.53)

The goal of the pr1 profile is to keep the amount of resources above the minimal
level or to die when the amount of resources falls to zero. This profile uses the
model M pr1 = {iω2 }.
The reproduction profile is defined as follows:
.
pr2 = Γ pr2 = Γ , Ω pr2 = {ω2 } , M pr2 = {iω2 } ,
/ (8.54)
ST pr2 , RST pr2 = ST pr2 , GL pr2

The set of strategies include two strategies:

ST pr2 = {seekPartner, clone, rec, mut, accept} (8.55)

The only goal of the pr2 profile is to reproduce. In order to realize this goal agent can
use strategy of reproduction seekPartner, clone, rec, mut or can accept partners for
reproduction (accept).
The interaction profile is defined as follows:
.
pr3 = Γ pr3 = Γ , Ω pr3 = 0, / M pr3 = 0,
/ ST pr3 = {give} ,
/ (8.56)
RST pr3 = ST pr3 , GL pr3

The goal of the pr3 profile is to interact with predators and preys with the use of
strategy give.
The migration profile is defined as follows:
.
pr4 = Γ pr4 = Γ , Ω pr4 = {ω1 } , M pr4 = {iω1 } ,
3. /4 / (8.57)
ST pr4 = migr , RST pr4 = ST pr4 , GL pr4

The goal of the pr4 profile is to migrate within the environment. In order to real-
ize such a goal the migration strategy is used, which firstly chooses the node and
then realizes the migration. As a result of migrating prey loses some amount of
resource.
8 A Review of Agent-Based Co-Evolutionary Algorithms 195

Algorithm 9. Basic activities of agent a ≡ a pred in PPCoEMAS


γ
1 rγ ← rinit ;
2 while rγ > 0 do
3 activate the profile pri ∈ PRa with the highest priority and with the active goal
gl ∗j ∈ GL pri ;
4 if pr1 is activated then
γ
5 if 0 < rγ < rmin then
6 seek, getFromPrey;
 prey,γ  prey,γ
7 rγ ← rγ + rget ; /* rget are all resources of the
prey agent that was chosen by a */
8 end
9 else if pr2 is activated then
mig,γ
10 if rγ > rmin then
11 migr;& '
mig,γ
12 rγ ← rγ − rmin ;
13 end
14 end
15 end

8.3.2.4 Predator Agent


Agent a of species pred is defined analogically to prey agent (see eq. (8.50)). There
exist two main differences. Genotype of predator agent is consisted only of the in-
formation about the criterion associated with the given agent. The set of profiles is
consisted only of two profiles, resource profile (pr1 ), and migration profile (pr2 ):
PRa = {pr1 , pr2 }, where pr1  pr2 . Basic activities of agent a are presented in
Alg. 9.

Profiles

The resource profile is defined in the following way:


.
pr1 = Γ pr1 = Γ , Ω pr1 = {ω2 } , M pr1 = {iω2 } ,
/ (8.58)
ST pr1 = {seek, getFromPrey}, RST pr1 = ST pr1 , GL pr1

The goal of the pr1 profile is to keep the amount of resource above the minimal level
with the use of strategy seek, getFromPrey.
The migration profile is defined as follows:
.
pr2 = Γ pr2 = Γ , Ω pr2 = {ω1 } , M pr2 = {iω1 } ,
3. /4 / (8.59)
ST pr2 = migr , RST pr2 = ST pr2 , GL pr2

The goal of pr2 profile is to.migrate


/ within the environment. In order to realize this
goal the migration strategy migr is used. The realization of the migration strategy
results in losing some of the resource possessed by the agent.
196 R. Dreżewski and L. Siwik

8.4 Experimental Results


Presented formally in section 8.3 agent-based co-evolutionary approaches for multi-
objective optimization have been tentatively assessed. Obtained during experiments
preliminary results were presented in some of our previous papers and in this section
they are shortly summarized.

8.4.1 Test Suite, Performance Metric and State-of-the-Art


Algorithms
As a test problem firstly, slightly modified so-called Laumanns multi-objective prob-
lem was used, which is defined as follows [15, 18]:

⎨ f1 (x) = x21 + x22
Laumanns = f2 (x) = (x1 + 2)2 + x22 (8.60)

−5 ≤ x1 , x2 ≤ 5

Secondly the so-called Kursawe problem was used. Its definition is as follows [18]:
⎧ & & $ ''

⎨ f1 (x) = ∑n−1 −10 exp −0.2 x2i + x2i+1
i=0
Kursawe = f2 (x) = ∑n |xi |0.8 + 5 sin x3 (8.61)

⎩ i=1 i
n = 3 − 5 ≤ x1 , x2 , x3 ≤ 5

In one of our experiments discussed shortly in this chapter building effective port-
folio problem was used. Assumed definition as well as true Pareto frontier for such
a problem can be found in [16].
Obviously during our experiments also well known and commonly used test
suites were used. Inter alia such problems as ZDT test suite was used ([19, p. 57–63],
[21], [5, p. 356–362], [4, p. 194–199]).

Dispersing solutions over


the whole approximation
of the true Pareto frontier

Drifting towards True Pareto frontier


max

f2 the true Pareto


frontier

f1
max

Fig. 8.3 Two goals of multi-objective optimization


8 A Review of Agent-Based Co-Evolutionary Algorithms 197

Two main distinguishing features of high-quality solution of MOOPs are: close-


ness to the true Pareto frontier as well as dispersion of found non-dominated solution
over the whole (approximation) of the Pareto frontier (see Figure 8.3).
In the consequence, despite that using only one single measure during assessing
the effectiveness of (evolutionary) algorithms for multi-objective optimization is not
enough [23], since Hypervolume Ratio measure (HVR) [20] allows for estimating
both of these aspects—in this chapter discussion and presentation of obtained results
is based on this very measure.
Hypervolume or Hypervolume ratio (HVR), describes the area covered by solu-
tions of obtained result set. For each solution, hypercube is evaluated with respect
to the fixed reference point. In order to evaluate hypervolume ratio, value of hyper-
volume for obtained set is normalized with hypervolume value computed for true
Pareto frontier. HV and HVR are defined as follows:
7
N
HV = v( vi ) (8.62a)
i=1
HV(PF ∗ )
HVR = (8.62b)
HV(PF)

where vi is hypercube computed for i − th solution, PF ∗ represents obtained Pareto


frontier and PF is the true Pareto frontier.
To assess (in a quantitative way) PPCoEMAS and CCoEMAS the comparison
with results obtained with the use of state-of-the-art algorithms has to be made. That
is why we are comparing results obtained by discussed in this chapter approaches
with results obtained by NSGA-II [6, 7] and SPEA2 [12, 22] algorithms since these
very algorithms are the most efficient and most commonly used evolutionary multi-
objective optimization algorithms. Additionally, obtained results are compared also
with NPGA [13] and PPES [15] algorithms.

8.4.2 A Glance at Assessing Co-operation Based Approach


(CCoEMAS)
Presented in section 8.3.1 co-evolutionary multi-agent system with co-operation
mechanism (CCoEMAS) was assessed tentatively using inter alia ZDT test-suite.
The size of population of CCoEMAS and the size of benchmarking algorithms
(NSGA-II and SPEA2) assumed during presented experiments were as follows:
CCoEMAS—200, NSGA-II—300 and SPEA—100. Next, selected parameters and
γ
their values assumed during those experiments are as follows: rinit = 50 (it repre-
sents the level of resources possessed initially by individual just after its creation),
γ
rget = 30 (it represents the amount of resources transferred in the case of domi-
rep,γ
nation), rmin = 30 (it represents the level of resources required for reproduction),
pmut = 0.5 (mutation probability).
198 R. Dreżewski and L. Siwik

1
HVR Measure value 0.95
0.9
0.85
0.8
0.75
NSGA2
0.7 CCoEMAS
SPEA
0.65
0 5 10 15 20 25 30 35
(a) Time [s]
1
0.99
HVR Measure value

0.98
0.97
0.96
0.95
0.94
NSGA2
0.93 CCoEMAS
SPEA
0.92
0 5 10 15 20 25 30 35 40
(b) Time [s]
Fig. 8.4 HVR values obtained by CCoEMAS, SPEA2, and NSGA-II run against Zitzler’s
problems ZDT1 (a) and ZDT2 (b) [11]

As one may see after the analysis of results presented in figures 8.4 and 8.5—
CCoEMAS, as not so complex algorithm as NSGA-II or SPEA2, initially allows for
obtaining better solutions, but with time classical algorithms—especially NSGA-
II—are the better alternatives. It is however worth to mention that in the case of
8 A Review of Agent-Based Co-Evolutionary Algorithms 199

0.9

HVR Measure value


0.8

0.7

0.6

0.5 NSGA2
CCoEMAS
SPEA2
0.4
0 5 10 15 20 25 30 35 40
(a) Time [s]
0.97
0.96
HVR Measure value

0.95
0.94
0.93
0.92
0.91
NSGA2
0.9 CCoEMAS
SPEA2
0.89
0 5 10 15 20 25 30 35 40
(b) Time [s]
1

0.95
HVR Measure value

0.9

0.85

0.8 NSGA2
CCoEMAS
SPEA2
0.75
0 5 10 15 20 25 30 35 40
(c) Time [s]

Fig. 8.5 HVR values obtained by CCoEMAS, SPEA2, and NSGA-II run against Zitzler’s
problems ZDT3 (a) ZDT4 (b) and ZDT6 (c) [11]
200 R. Dreżewski and L. Siwik

ZDT4 problem this characteristic seems to be reversed—i.e. initially classical al-


gorithms seem to be better alternatives, but finally CCoEMAS allows for obtaining
better solutions (observed as higher values of HVR metrics). Deeper analysis of
obtained during presented experiments results can be found in [11].

8.4.3 A Glance at Assessing Predator-Prey Based Approach


(PPCoEMAS)
In this section some selected results regarding presented in section 8.3.2 co-
evolutionary multi-agent system with predator-prey interactions are presented.
Among the others, PPCoEMAS was assessed with the use of some presented in
section 8.4.1 classical benchmarking problems: firstly Laumanns [15] and Kursawe
[14] test problems were used. Also the other than NSGA-II and SPEA2 classical
algorithms were used during experiments with predator-prey approach. This time
predator-prey evolutionary strategy (PPES) and niched-pareto genetic algorithm
(NPGA) were used. In this section only a kind of summary of obtained results is
given. More detailed analysis can be found in [9, 16].

5
PPCoEMAS frontier after 6000 steps
f2

0
0 1 2 3 4 5
(a) f1

5
PPES frontier after 6000 steps
f2

0
0 1 2 3 4 5
(b) f1

Fig. 8.6 Pareto frontier approximations obtained by PPCoEMAS (a) and PPES (b) algo-
rithms for Laumanns problem after 6000 steps [9]
8 A Review of Agent-Based Co-Evolutionary Algorithms 201

NPGA

PPES

PPCoEMAS

50 60 70
(a) HV

NPGA

PPES

PPCoEMAS

0.7 0.8 0.9 1


(b) HVR

Fig. 8.7 The value of HV (a) and HVR (b) measure for Laumanns problem obtained by
PPCoEMAS, PPES and NPGA after 6000 steps

In the very first experiments with PPCoEMAS relatively simple Laumanns test
problem was used. In Figure 8.6 there are presented Pareto frontier approxima-
tions obtained by PPCoEMAS and PPES algorithms and in Figure 8.7 there are
presented values of HV and HVR metrics for all three algorithms being compared
(PPCoEMAS, PPES and NPGA). As it can be seen—the differences between al-
gorithms being analyzed are not so distinct, however proposed PPCoEMAS system
seems to be the best alternative.
The second problem used was more demanding multi-objective Kursawe prob-
lem with disconnected both Pareto set and Pareto frontier. In Figure 8.9 there are
presented final approximations of Pareto frontier obtained by PPCoEMAS and by
reference algorithms after 6000 time steps. As one may notice, there is no doubt that
PPCoEMAS is definitely the best alternative since it is able to obtain Pareto frontier
that is located very close to the model solution, that is very well dispersed and what
202 R. Dreżewski and L. Siwik

NPGA

PPES

PPCoEMAS

350 400 450 500 550 600


(a) HV

NPGA

PPES

PPCoEMAS

0.6 0.7 0.8 0.9 1


(b) HVR

Fig. 8.8 The value of HV (a) and HVR (b) measure for Kursawe problem obtained by PP-
CoEMAS, PPES and NPGA after 6000 steps

is also very important—it is more numerous than PPES and NPGA-based solutions.
The above observations are fully confirmed by the values of HV and HVR metrics
presented in Figure 8.8.
Proposed co-evolutionary multi-agent system with predator-prey interactions was
also assessed with the use of building effective portfolio problem. In this case, each
individual in the prey population is represented as a p-dimensional vector. Each
dimension represents the percentage participation of i-th (i ∈ 1 . . . p) share in the
whole portfolio.
During presented experiments—Warsaw Stock Exchange quotations from 2003-
01-01 until 2005-12-31 were taken into consideration. Simultaneously, the portfolio
consists of the following three (experiment I) or seventeen (experiment II) stocks
quoted on the Warsaw Stock Exchange: in experiment I: RAFAKO, PONARFEH,
PKOBP, in experiment II: KREDYTB, COMPLAND, BETACOM, GRAJEWO,
KRUK, COMARCH, ATM, HANDLOWY, BZWBK, HYDROBUD, BORYSZEW,
8 A Review of Agent-Based Co-Evolutionary Algorithms 203

-5

f2
-10

PPCoEMAS frontier after 6000 steps


-20 -19 -18 -17 -16 -15 -14
(a) f1

-5
f2

-10

PPES frontier after 6000 steps


-20 -19 -18 -17 -16 -15 -14
(b) f1

-5
f2

-10

NPGA frontier after 6000 steps


-20 -19 -18 -17 -16 -15 -14
(c) f1

Fig. 8.9 Pareto frontier approximations for Kursawe problem obtained by PPCoEMAS (a),
PPES (b) and NPGA (c) after 6000 steps [9]

ARKSTEEL, BRE, KGHM, GANT, PROKOM, BPHPBK. As the market index,


WIG20 has been taken into consideration.
In Figure 8.10there are presented final Pareto frontiers obtained using PPCoEMAS,
NPGA and PPES algorithm after 1000 steps in experiment I. As one may notice, in
this case frontier obtained by PPCoEMAS is more numerous than NPGA-based and
as numerous as PPES-based one. Unfortunately, in this case, diversity of population
in PPCoEMAS approach is visibly worse than in the case of NPGA or PPES-based
frontiers.
204 R. Dreżewski and L. Siwik

0.2
PPCoEMAS-based Pareto frontier after 1000 steps

0.15

Profit
0.1

0.05

0
0 0.05 0.1 0.15 0.2 0.25 0.3
(a) Risk

0.2
PPES-based Pareto frontier after 1000 steps

0.15
Profit

0.1

0.05

0
0 0.05 0.1 0.15 0.2 0.25 0.3
(b) Risk

0.2
NPGA-based Pareto frontier after 1000 steps

0.15
Profit

0.1

0.05

0
0 0.05 0.1 0.15 0.2 0.25 0.3
(c) Risk

Fig. 8.10 Pareto frontier approximations after 1000 steps obtained by PPCoEMAS (a), PPES
(b), and NPGA (c) for building effective portfolio consisting of 3 stocks [16]

Similar situation can be also observed in Figure 8.11 presenting Pareto fron-
tiers obtained by PPCoEMAS, NPGA and PPES—but this time portfolio that is
being optimized consists of 17 shares. Also this time PPCoEMAS-based frontier
is quite numerous and quite close to the true Pareto frontier but the tendency for
focusing solutions around only selected part(s) of the whole frontier is very dis-
tinct. The explanation of observed tendency can be found in [9, 16] and on the very
8 A Review of Agent-Based Co-Evolutionary Algorithms 205

0.45
PPCoEMAS-based Pareto frontier after 1000 steps
0.4

0.35

0.3

0.25

Profit
0.2

0.15

0.1

0.05

0
0 0.05 0.1 0.15 0.2
(a) Risk

0.45
PPES-based Pareto frontier after 1000 steps
0.4

0.35

0.3

0.25
Profit

0.2

0.15

0.1

0.05

0
0 0.05 0.1 0.15 0.2
(b) Risk

0.45
NPGA-based Pareto frontier after 1000 steps
0.4

0.35

0.3

0.25
Profit

0.2

0.15

0.1

0.05

0
0 0.05 0.1 0.15 0.2
(c) Risk

Fig. 8.11 Pareto frontier approximations after 1000 steps obtained by PPCoEMAS (a), PPES
(b), and NPGA (c) for building effective portfolio consisting of 17 stocks [16]

general level it can be said that it is caused by the stagnation of evolution process
in PPCoEMAS. Hypothetical, non-dominated average portfolios for experiment I
and II are presented in Figure 8.12 and in Figure 8.13 respectively (in Figure 8.13
shares are presented from left to right in the order in which they were mentioned
above).
206 R. Dreżewski and L. Siwik

1
PPCoEMAS portfolio after 1 step

percentage share in the portfolio


0.8

0.6

0.4

0.2

0
RAFAKO PONAR PKOBP
(a) share name

1
PPCoEMAS portfolio after 900 steps
percentage share in the portfolio

0.8

0.6

0.4

0.2

0
RAFAKO PONAR PKOBP
(b) share name

Fig. 8.12 Effective portfolio consisting of three stocks proposed by PPCoEMAS [16]

1
PPCoEMAS portfolio after 1 step
percentage share in the portfolio

0.8

0.6

0.4

0.2

(a) share name

1
PPCoEMAS portfolio after 900 steps
percentage share in the portfolio

0.8

0.6

0.4

0.2

(b) share name

Fig. 8.13 Effective portfolio consisting of seventeen stocks proposed by PPCoEMAS [16]
8 A Review of Agent-Based Co-Evolutionary Algorithms 207

8.5 Summary and Conclusions


Agent-based (co-)evolutionary algorithms have been applied already in many dif-
ferent domains, including multi-modal optimization, multi-objective optimization,
and financial problems. Agent-based models of evolutionary algorithms allows for
mixing and using simultaneously different bio-inspired techniques and algorithms
within one coherent agent model, and adding new biologically and socially in-
spired operators and mechanisms in a very natural way. Agent-based models of
evolutionary algorithm also allow for using parallel and decentralized computa-
tions without any additional changes because these models are decentralized and use
asynchronous computations.
In this chapter we have presented two selected agent-based co-evolutionary al-
gorithms for multi-objective optimization—one of them used co-operative mech-
anisms and the other one used predator-prey mechanism. Formal models of these
systems as well as results of experiments with standard multi-objective test prob-
lems and financial problem of multi-objective portfolio optimization were presented.
The results of experiments show that agent-based algorithms may obtain quite satis-
factory results, comparable or in the case of some problems even better than state-of-
the-art multi-objective evolutionary algorithms, however of course there is still place
for improvement and further research. Presented results also lead to conclusion that
none of the existing evolutionary algorithms for multi-objective optimization can
not alone solve all problems in a best way—there is, and always will be, space for
new algorithms and improvements suited for some particular problems.
Future research on the agent-based models will concentrate on improvements
to the already proposed algorithms as well as on new algorithms and techniques.
Examples of new techniques which may be incorporated into agent-based models of
evolutionary algorithms include cultural and immunological mechanisms. Another
way of development would be adding social and economical layer to the existing
biological one and using such agent-based models for modeling and simulation of
complex and emergent phenomena from social and economical life.

References
1. Bäck, T., Fogel, D., Michalewicz, Z. (eds.): Handbook of Evolutionary Computation.
IOP Publishing and Oxford University Press (1997)
2. Cetnarowicz, K., Kisiel-Dorohinicki, M., Nawarecki, E.: The application of evolution
process in multi-agent world to the prediction system. In: Tokoro, M. (ed.) Proceedings
of the 2nd International Conference on Multi-Agent Systems (ICMAS 1996). AAAI
Press, Menlo Park (1996)
3. Coello, C., Lamont, G., Van Veldhuizen, D.: Evolutionary Algorithms for Solving Multi-
Objective Problems, 2nd edn. Springer, New York (2007)
4. Coello Coello, C., Van Veldhuizen, D., Lamont, G.: Evolutionary algorithms for solv-
ing multi-objective problems, 2nd edn. Genetic and evolutionary computation. Springer,
Heidelberg (2007)
5. Deb, K.: Multi-Objective Optimization using Evolutionary Algorithms. John Wiley &
Sons, Chichester (2001)
208 R. Dreżewski and L. Siwik

6. Deb, K., Agrawal, S., Pratab, A., Meyarivan, T.: A Fast Elitist Non-Dominated Sorting
Genetic Algorithm for Multi-Objective Optimization: NSGA-II. In: Deb, K., Rudolph,
G., Lutton, E., Merelo, J.J., Schoenauer, M., Schwefel, H.-P., Yao, X. (eds.) PPSN 2000.
LNCS, vol. 1917, pp. 849–858. Springer, Heidelberg (2000),
citeseer.ist.psu.edu/article/deb00fast.html
7. Deb, K., Pratab, A., Agarwal, S., Meyarivan, T.: A fast and elitist multi-objective ge-
netic algorithm: Nsga-ii. IEEE Transaction on Evolutionary Computation 6(2), 181–197
(2002)
8. Dreżewski, R.: A model of co-evolution in multi-agent system. In: Mařı́k, V., Müller, J.P.,
Pěchouček, M. (eds.) CEEMAS 2003. LNCS (LNAI), vol. 2691, pp. 314–323. Springer,
Heidelberg (2003)
9. Dreżewski, R., Siwik, L.: The application of agent-based co-evolutionary system with
predator-prey interactions to solving multi-objective optimization problems. In: Proceed-
ings of the 2007 IEEE Symposium Series on Computational Intelligence. IEEE, Los
Alamitos (2007)
10. Dreżewski, R., Siwik, L.: Agent-based co-evolutionary techniques for solving multi-
objective optimization problems. In: Kosiński, W. (ed.) Advances in Evolutionary Al-
gorithms. IN-TECH, Vienna (2008)
11. Dreżewski, R., Siwik, L.: Agent-based co-operative co-evolutionary algorithm for multi-
objective optimization. In: Rutkowski, L., Tadeusiewicz, R., Zadeh, L.A., Zurada, J.M.
(eds.) ICAISC 2008. LNCS (LNAI), vol. 5097, pp. 388–397. Springer, Heidelberg
(2008)
12. Zitzler, E., Laumanns, M., Thiele, L.: Spea2: Improving the strength pareto evolutionary
algorithm for multiobjective optimization. In: Giannakoglou, K., et al. (eds.) Evolution-
ary Methods for Design, Optimisation and Control with Application to Industrial Prob-
lems (EUROGEN 2001). International Center for Numerical Methods in Engineering
(CIMNE), pp. 95–100 (2002)
13. Horn, J., Nafpliotis, N., Goldberg, D.E.: A niched pareto genetic algorithm for multi-
objective optimization. In: Proceedings of the First IEEE Conference on Evolutionary
Computation. IEEE World Congress on Computational Intelligence, vol. 1, pp. 82–87.
IEEE Service Center, Piscataway (1994),
citeseer.ist.psu.edu/horn94niched.html
14. Kursawe, F.: A variant of evolution strategies for vector optimization. In: Schwefel, H.-
P., Männer, R. (eds.) PPSN 1990. LNCS, vol. 496, pp. 193–197. Springer, Heidelberg
(1991), citeseer.ist.psu.edu/kursawe91variant.html
15. Laumanns, M., Rudolph, G., Schwefel, H.P.: A spatial predator-prey approach to multi-
objective optimization: A preliminary study. In: Eiben, A.E., Bäck, T., Schoenauer, M.,
Schwefel, H.-P. (eds.) PPSN 1998. LNCS, vol. 1498, p. 241. Springer, Heidelberg (1998)
16. Siwik, L., Dreżewski, R.: Co-evolutionary multi-agent system for portfolio optimization.
In: Brabazon, A., O’Neill, M. (eds.) Natural Computation in Computational Finance, pp.
273–303. Springer, Heidelberg (2008)
17. Spears, W.: Crossover or mutation? In: Proceedings of the 2-nd Foundation of Genetic
Algorithms, pp. 221–237. Morgan Kauffman, San Francisco (1992)
18. Van Veldhuizen, D.A.: Multiobjective evolutionary algorithms: Classifications, analyses
and new innovations. PhD thesis, Graduate School of Engineering of the Air Force Insti-
tute of Technology Air University (1999)
19. Zitzler, E.: Evolutionary algorithms for multiobjective optimization: methods and appli-
cations. PhD thesis, Swiss Federal Institute of Technology, Zurich (1999)
8 A Review of Agent-Based Co-Evolutionary Algorithms 209

20. Zitzler, E., Thiele, L.: An evolutionary algorithm for multiobjective optimization: The
strength pareto approach. Tech. Rep. 43, Swiss Federal Institute of Technology, Zurich,
Gloriastrasse 35, CH-8092 Zurich, Switzerland (1998),
citeseer.ist.psu.edu/article/zitzler98evolutionary.html
21. Zitzler, E., Deb, K., Thiele, L.: Comparison of Multiobjective Evolutionary Algorithms:
Empirical Results. Evolutionary Computation 8(2), 173–195 (2000)
22. Zitzler, E., Laumanns, M., Thiele, L.: Spea2: Improving the strength pareto evolutionary
algorithm. Tech. Rep. TIK-Report 103, Computer Engineering and Networks Labora-
tory (TIK), Department of Electrical Engineering, Swiss Federal Institute of Technology
(ETH) Zurich, ETH Zentrum, Gloriastrasse 35, CH-8092 Zurich, Switzerland (2001)
23. Zitzler, E., Thiele, L., Laumanns, M., Fonseca, C.M., da Fonseca, V.G.: Performance
assessment of multiobjective optimizers: An analysis and review. IEEE Transactions on
Evolutionary Computation 7(2), 117–132 (2003)
Chapter 9
A Game Theory-Based Multi-Agent System for
Expensive Optimisation Problems

Abdellah Salhi and Özgun Töreyen

Abstract. This paper is concerned with the development of a novel approach to


solve expensive optimisation problems. The approach relies on game theory and a
multi-agent framework in which a number of existing algorithms, cast as agents, are
deployed with the aim to solve the problem in hand as efficiently as possible. The
key factor for the success of this approach is a dynamic resource allocation biased
toward promising algorithms on the given problem. This is achieved by allowing
the agents to play a cooperative-competitive game the outcomes of which will be
used to decide which algorithms, if any, will drop out of the list of solver-agents
and which will remain in use. A successful implementation of this framework will
result in the most suited algorithm(s) for the given problem being predominantly
used on the available computing platform. In other words it guarantees the best
use of the resources both algorithms and hardware with the by-product being the
best approximate solution for the problem given the available resources. GTMAS is
tested on a standard collection of TSP problems. The results are included.

9.1 Introduction
Modelling problems arising in real world applications taking into account the non-
linearity and the combinatorial aspects of solution sets often leads to expensive to
solve optimisation problems; they are inherently intractable. Indeed, even checking
a given solution for optimality is NP-hard [10, 17, 32]. It is, therefore, not reason-
able, in general, to expect the optimum solution to be found in acceptable times.
What one can, almost always, only expect is an approximate solution, the quality of
which is crucial to its potential use.
Abdellah Salhi · Özgun Töreyen
Department of Mathematical Sciences, The University of Essex, Colchester CO4 3SQ, UK
e-mail: as@essex.ac.uk,otoreyen@aselsan.com.tr

Y. Tenne and C.-K. Goh (Eds.): Computational Intelligence in Optimization, ALO 7, pp. 211–232.
springerlink.com c Springer-Verlag Berlin Heidelberg 2010
212 A. Salhi and Ö. Töreyen

It is well known, that, at least in the case of stochastic algorithms, the quality of
the approximate solution (or some confidence the user may have in it) is proportional
to the time spent in the search for it, [22, 23, 24]. As in many applications there is a
time constraint, a deadline beyond which a better approximate solution is of no use,
it is essential that all available resources (software and hardware) be used as well
as possible, to insure that the best approxiate solution, under the circumstances, is
obtained. This is what the novel approach suggested here is attempting to achieve.
To do so, it must:
1. find which algorithm(s), in the suite of algorithms, is the most appropriate for the
given instance of the expensive optimisation problem;
2. replicate this algorithm(s) on all availabe processor nodes in a parallel environ-
ment, or allocate to it all of the remaining CPU time if a single processor, or
sequential environment, is used;
Point (1) above is dealt with through measuring the performance of the algorithms
used. Point (2) is dealt with via the implementation of a cooperative/competitive
game of the Iterated Prisoners’ Dilemma (IPD) type, [4, 7, 16, 20, 25, 27]. Although
other paradigms of cooperative/competitive behaviour, such as the Stag Hunt game,
[7], can be used, the IPD seems appropriate. Note that, implementing cooperation is
fairly straightforward, and implementing competition is not. We believe that it is at
least as important as cooperation between agents for an effective search. To the best
of our knowledge, this is the first time implementing competition for optimisation
purposes is attempted. We use payoff matrices as a handle to manipulate it. Two al-
gorithms (agents) cooperate by exchanging their current solutions; they compete by
not exchanging their solutions. Note that, intuitively, cooperation may lead to early
convergence to a local optimum, by virtue of propagating a given solution poten-
tially to all algorithms and having all of them searching the same area. Competition,
on the other hand, may lead to good coverage of the search space by virtue of not
sharing solutions, i.e. helping algorithms “stay away” from each other and therefore,
potentially, explore different areas of the search space.
Although the study presents the prototype of a generic solver that can involve
any number of solver algorithms and run on any computing platform, here, a system
with only two search algorithms, implemented sequentially, is investigated. This
simplified model, however, has the inherent complexities of a system with many
more agents and should show how good or otherwise the general system can be for
expensive optimisation.
Note that the generic nature of this approach makes it applicable in any discipline
where problem solving is involved and more than one solution method is available.
This document is organised as follows. In Section 9.2, a brief literature review
is given. In Sections 9.3 and 9.4, the design and implementation of the system is
explained. Section 9.5 explains how the system is applied to solve the Travelling
Salesman Problem. The results are presented in Section 9.6. Finally, conclusions
are drawn and future research prospects are outlined in Section 9.7.
9 A Game Theory-Based Multi-Agent System 213

9.2 Background
In the following a brief review of the three main topics involved, i.e. optimisation,
the IPD and agents systems will be given.

9.2.1 Optimisation
The general optimisation problem is of the global type, constrained, nonlinear and
involves mixed variables, i.e. both discrete and continuous variables. However, a
lot of optimisation problems that are encountered in real applications do not have
all of these characteristics, but are still intractable. The 0-1-Knapsack problem, for
example, involves only binary variables and has one single constraint, but is still
NP-Hard. The general optimisation problem can be cast in the following form.
Let f be a function from Rn to R and A ⊂ Rn , then find x∗ ∈ A such that ∀x ∈ A,
f (x∗ ) ≤ f (x).

9.2.2 Game Theory: The Iterated Priosoners’ Dilemma


The Prisoners’ Dilemma (PD) brought to attention by Merrill Flood of the Rand
Corporation in 1951, and later formulated by Al Tucker [8, 11] is a popular paradigm
for the problem of cooperation. Its formulation can be as follows.

Table 9.1 Formulation of PD: The Payoff Matrix

Player2
C D
Player1 C R=3,R=3 S=0,T=5
D T=5,S=0 P=1,P=1

In the payoff matrix of Table 1, actions C and D stand for ‘Cooperate’ and ‘De-
fect’ and payoffs R, P, T, and S stand for ‘Reward’, ‘Punishment’,‘Temptation’, and
‘Sucker’s’ payoff respectively. This payoff matrix shows that defecting is benefi-
cial to both players for two reasons. First, it leads to a greater payoff (T = 5) in
case the other player cooperates, (S = 0). Second, it is a safe move because neither
knows what the other’s move will be. So, to rational players, defecting is the best
choice. But, if both players choose to defect then it leads to a worse payoff (P = 1)
as compared to cooperating (R = 3). That is the dilemma.
The special setting of the one shot PD is seen by many to be contrary to the idea of
cooperation. This is because the only equilibrium point is the outcome [P, P] which
is a Nash equilibrium, [7]. Also, [P, P] is at the intersection of minimax strategy
choices for both players. These minimax strategies are dominant for both players,
hence the exclusion in principal of cooperation (by virtue of the dominance of the
chosen strategies). Moreover, even if cooperative strategies were chosen, the result-
ing cooperative ‘solution’ is not an equilibrium point. This means that it is not stable
214 A. Salhi and Ö. Töreyen

due to the fact that both players are tempted to defect from it. It should also be noted
that cooperative problems in real life are likely to be faced repeatedly. This makes
the IPD a more appropriate model for the study of cooperation than the one-shot
version of the game.
The PD game is characterised by the strict inequality relations between the pay-
offs: T > R > P > S. And to avoid coordination or total agreement getting a ‘helping
hand’, most experimental PD games have a payoff matrix satisfying: 2R > S + T , as
in Table 1.
The close analysis of the IPD reveals that, unlike the one-shot PD, it has a large
number of Nash equilibria. These being inside the convex hull of the outcomes
(0,5), (3,3), (5,0), (1,1) of the pure strategies in the one-shot PD, (see Figure 9.1).
Note that (1,1) corresponding to [P, P] is a Nash equilibrium for the IPD also. For a
comprehensive investigation of the IPD, please refer to [4, 5, 15].

(0, 5)

(3, 3)

(1, 1)

(5, 0)

Fig. 9.1 Set of Nash Equilibrium Points

9.2.3 Multi-Agent Systems


Multi-Agent System (MAS) are collections of agents that work together to accom-
plish a task that is normally beyond the capabilities of a single agent, [19], [30],
[31]. In our case, however, every agent is capable of solving the instance of the
optimisation problem in hand.
The agents in a classical MAS communicate between themselves, cooperate, col-
laborate and sometimes negotiate ([1, 9, 12]). They do not, normally, compete. But,
they are allowed, in fact required, to do so, in the framework used here.
The present paper describes a Game Theoretic Multi-Agent Solver (GTMAS),
[18], which implements the ideas introduced above. It will be limited to three agents
(Figure 9.2): a coordinator-agent, Solver-Agent 1 (SA1), running the Genetic Al-
gorithm (GA), [14, 26], and Solver-Agent 2 (SA2), running Simulated Annealing
9 A Game Theory-Based Multi-Agent System 215

(SA), [28]. The system, implemented in Matlab, is tested on a set of Travelling


Salesman Problems (TSP) from TSPLIB, [21].

9.3 Constructing GTMAS


GTMAS architecture follows closely the goal-based MAS architecture of Park and
Suguraman [19] and the IPD model as described in [2], [3], [6].
Let the problem to be solved be ℘. The overall goal of GTMAS is to solve ℘ ef-
ficiently using the best available algorithm(s) in the system’s library. This is equiv-
alent to completing two tasks: selecting the best algorithm(s) among all available
algorithms and obtaining a ‘good’ solution.
The overall goal, is divided into sub-goals which are then matched to the sys-
tem’s agents. Each of the solver-agents runs a different algorithm and tries to be
the first to solve ℘ by using as much as possible of the available computing facili-
ties, here CPU time. The coordinator-agent enables coordination, manages the game
through which the solver-agents compete for the facilities, allocates the facilities to
the solver-agents, and also communicates with the user, i.e. the owner of problem
℘, (Figure 9.2).

Fig. 9.2 Goal Hierarchy Diagram

Solver-agents cooperate (C) or compete (D) with each other by sharing or not
their solutions. If an agent can take an opponent’s possibly better solution when it
is stuck in a local optimum say, and uses it, then it can improve its own search.
The decision to cooperate or to compete is autonomously made by the agents us-
ing their beliefs (no notable change in the objective value in the last few iterations,
for instance, may mean convergence to a local optimium), the history of the pre-
vious encounters with their opponents (the number of times they cooperated and
competed), and certain rules which follow observations of the behaviours of agents.
Some are explained below.
The rules are set to prevent the game from converging too soon to a near
pure competition game (which is equivalent to playing the GRIM strategy, [2]).
216 A. Salhi and Ö. Töreyen

Go-it-alone type strategy can not contribute to the solution quality more than run-
ning an algorithm on its own. These rules are:
• If the number of times SA1 knows the solution of SA2 increases, then the likeli-
hood that SA2 finds the solution to ℘ first, decreases. Therefore, SA2 is unlikely
to cooperate. Since all solver-agents are aware of this, they would cooperate less
and take their opponent’s solution more often, given the chance.
• If SA1 does not cooperate when SA2 cooperates, then SA2 would retaliate. This
leads to the TIT-FOR-TAT and the go-it-alone type strategy.
• If a solver-agent cooperates in the first encounter with another solver-agent, then
it can be perceived as in need of help; i.e. it is stuck at a local optimum. Agents,
therefore, perceive the first cooperation of their opponent as a “forceful invitation
to cooperate or else...” from bullet point 2 above.
There are all sorts of rules which are implicit in the IPD. Agents, however, do not
have to apply them systematically.

9.3.1 GTMAS at Work: Illustration


Recall that the solver-agents, in order to solve ℘, play the IPD game. In each en-
counter, they either cooperate (C) or compete (D).
Figure 9.3 shows an encounter of 2 agents after they both obtain their interme-
diary solutions. Node 1 is the starting node which shows the decision alternatives

Fig. 9.3 Decision Tree of the Game


9 A Game Theory-Based Multi-Agent System 217

of SA1. It can cooperate and end up in Node 2, or Node 3. Nodes 2 and 3 are the
decision nodes of SA2 which has the same two alternative decisions as SA1. Node 4
follows the cooperation of both of the agents that may result in a solution exchange.
Node 5 shows the situation where SA1 cooperates and SA2 competes which means
SA2 may take the solution of SA1 and SA1 takes nothing. Node 6 depicts the same
situation as that leading to node 5 but with agents taking different actions. In node
7, neither gives its solution; they continue without any exchange.
The decision tree is expanded further with branching from nodes 4-7, but with
alternatives now being: “Take the opponent’s solution” and “Do not take the op-
ponent’s solution”. This branching determines which agent has a better solution
and is essential for setting up the payoff matrices that drive the system. 8 new
nodes(leaves) arise. Each pair of sibling nodes yields a different payoff matrix. The
labels (G)(for good) and (B)(for bad) refer to agents having a better solution than the
opponent or otherwise, respectively. The cells that are crossed refer to impossible
outcomes.
Managing the resources is based on the outcomes of the decisions of the solver-
agents. When an agent cooperates it gains one unit (of CPU time or equivalent in
terms of iterations it is allowed to do) and loses double that. When it competes it
gains two units and loses one (or half of the initial gain). This means, the GTMAS
payoff matrix rewards competition. The idea behind supporting competition is to
counter the “helping hand” that cooperation gets from the rules underpinning the
construction of GTMAS (see above). It can also be argued that, intuitively at least,
too frequent exchanging of solutions will lead to early convergence to local optima.
So, competition gives solver-agents the chance to cover the search space better.
The 4 payoff matrices in Figure 9.3 can be combined in one payoff matrix
(Table 9.2).

Table 9.2 Combined Payoff Table for Evaluating and Rewarding Agents

B
C D
G C (1,-2) (1,-1)
D (2,-2) (2,-1)

The equilibrium point for the payoff matrix is (D, D) with payoffs 2 and -1. It
is also a regret-free point. The payoff matrix at the core of GTMAS is different
from those commonly found in the literature. These matrices would be drawn im-
mediately after decisions have been taken, i.e. at nodes 4 to 7 in Figure 9.3. Here,
they are drawn after other decisions are taken. In fact, one can highlight three main
differences;
218 A. Salhi and Ö. Töreyen

(i) The return of a player is not dependent on the opponent’s choice directly. Whether
the opponent cooperates or competes becomes only relevant after the exchange of
solutions has been decided;
(ii) The payoff is affected by what has been achieved in terms of the quality of
solution after exchange (or otherwise of solutions);
Unlike trditional games, here, after the players (solver-agents) have made their
choices, they are given a chance to progress with the consequences of the choices.
Only after that, are they rewarded/punished. This was made explicit in the above
paragraph where reasons for rewarding competition/penalising cooperation, were
given; for instance when we said that a cooperating agent “gains one unit and loses
double that”, we meant that the solver-agent runs first for a unit of CPU time (or
equivent in iterations) and only after that is it penalised by taking 2 units of CPU
time from its account. Basically, the quality of the solution following decisions has
to be measured first before the payoffs are allocated. Time is an important factor in
the IPD.
(iii) The third difference is that the players are not “Solver-Agent 1” and “Solver-
Agent 2”, but instead “Solver-Agent with the better solution” and “Solver-Agent
with the worse solution”. The configuration of the table may change at each stage
according to the solution qualities of the solver-agents. The one with better solution
is always placed as the row player.

9.4 The GTMAS Algorithm


GTMAS is a generic (n + 1)-agent system that consists of a coordinator-agent and
n solver-agents. It can be seen as a loosely coupled hybrid algorithm that uses
any number of algorithms contributing to the hybridisation. The pseudocode of the
agents is given below.

Coordinator-Agent Pseudocode
1. Initialise belief. Initialise resources,n.
2. For Nstage stages, play the game and update belief
where Nstage limits the number of stages the game is played.
2.1. Start decision phase: Run the solver-agents to decide.
2.2. Manage the solution exchange.
2.3. Start competition phase: Run the solver-sgents to compete.
2.4. Evaluate and reward/punish the solver-agents.
Update resources, m1 , m2 iterations where
mi = n + ∑currentstage−1
j=1 ri j , ri j is the reward of agent i
at stage j.
2.5. Increment stage.
3. End the game. Select the best algorithm. Report the results.
9 A Game Theory-Based Multi-Agent System 219

Solver-Agents Pseudocode
1. Initialise belief.
2. If it is a decision phase, do:
2.1. If it is the first stage, do:
2.1.1. Initialise memory and algorithm specific parameters.
2.1.2. Run own algorithm for n iterations.
2.1.3. Cooperate.
2.1.4. End run. Send the results to the Coordinator-Agent.
2.2. If it is the second stage, do:
2.2.1. Update belief.
2.2.2. Run own algorithm for mi iterations.
2.2.3. Compete.
2.2.4. End run. Send the results to the Coordinator-Agent.
2.3. If stage > 2, do:
2.3.1. Update belief.
2.3.2. Run own algorithm for mi iterations.
2.3.3. Decide to cooperate/compete.
2.3.4. End run. Send the results to the Coordinator-Agent.
3. If it is a competition phase, do:
3.1. Update belief.
3.2. Run own algorithm for n iterations.
3.3. End run. Send the results to the Coordinator-Agent.

A prototype GTMAS is constructed with 3 agents; a coordinator-agent and two


solver-agents. GTMAS starts with initialisation of the coordinator-agent. The
coordinator-agent reads the problem data and initialises the payoff table. It also
initialises the resources accounts (seconds of CPU time or number of iterations)
assigned to the solver-agents. The overall iterations of GTMAS are called stages
which consist of decision and competition phases. In the decision phases, the
coordinator-agent asks the solver-agents for their decisions.
The solver-agents start with the initialisation of their memory the outcome of
enounters with their opponents, algorithm parameters (i.e. population for the Ge-
netic Algorithm) and decision parameters (η , β , α , σ and γ ). After they run for
the given number of iterations determined by the coordinator-agent to obtain their
initial solutions for ℘, they make their decisions.
The decisions in the first two stages are not subject to analysis since no historical
data (memory of earlier encounters) exist yet. Both agents cooperate in the first
stage and compete in the second stage, regardless of the results and without prior
analysis. The following stages differ from the first two stages with the introduction
of memory. Decisions are made according to procedure Decide() below.

9.4.1 Solver-Agents Decision Making Procedure


Let SA1 be a solver-agent and SA2 its opponent in a IPD game. SA1 decides
whether to compete or cooperate according to the following procedure.
220 A. Salhi and Ö. Töreyen

Procedure Decide() Pseudocode


0. Begin
1. If (SA2 has cooperated in the last move) then
1.1. If (P(IC|OC) < σ %) then
1.1.1. Cooperate.
1.2. Else If (P(OC|IC) < γ %) then
1.2.0. Cooperate.
1.2.1. Else If (SA1 didn’t improve by γ % in last 2 stages) then
1.2.1.0. Decide randomly to cooperate or compete
1.2.1.1. Else
1.2.1.1.0. Compete.
1.2.1.2. End If
1.2.2. End If
1.3. End If
2. Else (Ask SA2 for its solution);
2.1. If (SA2 solution is α % better than that of SA1) then
2.1.1. If (SA1 is stuck with γ %) then
2.1.1.0. Cooperate.
2.1.2. Else If (SA2 solution is 2α % better than that of SA1) then
2.1.2.1. If (P(OC|ID) > β %) then
2.1.2.1.0. Compete.
2.1.2.2. Else
2.1.2.2.0. Cooperate.
2.1.2.3. End If
2.1.3. End If
2.3. Else
2.3.1. Compete.
2.4. End If
3. End If
4. Stop

In the decision-making process, P(IC|OC) is the probability that SA1 will cooperate
in the next iteration given that SA2 cooperates in this iteration. It is equal to the
ratio between the number of times SA1 cooperates in the (n + 1)st iteration given
that SA2 cooperated in the nth iteration and the the total number of encounters.
P(OC|IC) is the probability that SA2 will cooperate in the next iteration given that
SA1 cooperates in this iteration. It is equal to the ratio of the number of times SA2
cooperates in (n + 1)st iteration given SA1 cooperated in the nth iteration to the total
number of encounters.
P(OC|ID) is the probability that SA2 will cooperate in the next iteration given that
SA1 cooperates in this iteration. It is equal to the ratio of the number of times SA2
cooperates in (n + 1)st iteration given that SA1 competed in the nth iteration to the
total number of encounters.
9 A Game Theory-Based Multi-Agent System 221

σ is a measure of how likely it is for SA1 to adopt a TIT-FOR-TAT strategy; it is


referred to as “responsiveness”.
η is a measure of how likely it is for SA2 to adopt a TIT-FOR-TAT strategy;
γ is how likely it is for either agents to get stuck in a local optimum;
α is the difference between SA1’s solution and that of SA2;
β is how likely it is for the opponent to cooperate (be nice!);
If the opponent cooperated in the last move, the agents check their own responsive-
ness, first. If it is less than σ , then they cooperate; if not, they check the opponent’s
responsiveness. If the latter is less than η , they conclude that the opponent is not as
responsive as it should be, so they compete. If the opponent is responsive, then they
check their own status: if they performed γ % better than what they obtained in the
last 2 stages before, then they compete, otherwise they make a random choice as to
whether they cooperate or compete.
If the opponent did not cooperate in the last move, then they compare their own
status with that of their opponent. If the opponent’s last solution is not α % better
than their own solution, they compete. If it is at least α % better, then they check
their own progress. If they are stuck with γ %, or in other words, the solution has
not improved more than γ % in the last 2 stages, then they cooperate. If they are not
stuck, they check the difference between the opponent’s and their own solutions. If
the solution of the opponent is not 2α % better than their own solution, they com-
pete. If it is 2α % better, then they check the opponent’s attitude to competition.
If the opponent is likly to cooperate, i.e. if the probability it cooperates after com-
petition is larger than β , they compete, otherwise they cooperate. Note that this
description is given as the procedure Decide() .
After an agent makes a decision, it is sent to the coordinator-agent which manages
the solution exchange. Solution exchange is settled randomly when a solution is
offered, following a cooperate move; it is accepted with probability 0.5.
The worse performing agent is not entirely removed. It is kept but it is only
allowed one iteration in each stage. This is specific to the two solver-agent case as it
seems, from experimental results, that the weaker algorithm still helps the stronger
one to get, overall, a better solution. This, however, may not be the case if a large
number of algorithms were used.
When all the stages are completed, the results of the best solver-agent are
reported.

9.5 Application of GTMAS to TSP


GTMAS is applied to a collection of Travelling Salesman Problems (TSP), [21]. The
Genetic Algorithm (GA) and Simulated Annealing (SA) are selected as the solvers
in the library. GA is coded in Matlab 7.0 and Simulated Annealing Matlab code is
borrowed from Matlab Central ([29]). They are customised to be incorporated in
GTMAS which is also coded in Matlab 7.0.
Generic parameters of GTMAS are defined by pre-experimentation. Nstage , the
number of stages the game is played, is set to 5. Preliminary analyses show that
222 A. Salhi and Ö. Töreyen

it is sufficient for convergence to the optimum or a good solution. n, the number of


iterations solver-agents start the decision phase and run for in the competition phases
is set to 10. Accordingly, all the entries in the payoff table, ri j , are quadrupled for
faster progress (see Table 9.3).

Table 9.3 Payoff Table of GTMAS

B
C D
G C (4,-8) (4,-4)
D (8,-8) (8,-4)

GTMAS is customised specific to the GA-SA competition. SA is fast in the ini-


tial iterations because of the high temperature and large probability to accept bad
solutions in order to escape local optima. Afterwards, it slows down considerably
with the decreases in temperature. The resource in the decision phase is CPU time.
The number of iterations SA is run, mSA , is updated at the beginning of each stage
to balance the CPU time usage of SA and that of GA.
10
mSA = ∗ mSA .
currentstage
Prior to experimentation, GA is expected to perform better than SA. GA is coded
for the specific problem in hand and its parameters are tuned accordingly. The pa-
rameters of SA are default parameters as found in the literature. The game played
between the agents affects the solution quality substantially. It depends on the solver-
agents’ attitude to cooperation and competition, the solution exchanges and the pay-
off matrix. These parameters are summarised with their possible values in Table 9.4.

Table 9.4 GTMAS Parameter Values

Payoff Matrix Decision Model Characteristic Characteristic


Matrix Model of GA of SA
simple random random random
cooperation-rewarded evaluative cooperative cooperative
competition-rewarded competitive competitive
time-dependent

The payoff matrix can be simple, cooperation-rewarded, competition-rewarded


or time-dependent.
9 A Game Theory-Based Multi-Agent System 223

Table 9.5 Simple (Top-left), Cooperation-Rewarded (bottom-left), Competition-Rewarded


(Top-right) and Time-Dependent (bottom-right) Payoff Matrices

B B
C D C D
G C (4,-4) (4,-4) G C (4,-8) (4,-4)
D (4,-4) (4,-4) D (8,-8) (8,-4)

B B
C D C D
G C (8,-4) (8,-8) G C ( 4t ,−4t) ( 4t ,− 4t )
D (4,-4) (4,-8) D (4t,−4t) (4t,− 4t )

The solver-agents arrive at decisions using the decision-making parameters η ,


β , α and σ . These parameters were explained earlier and their values are given in
Table 9.6.

Table 9.6 The Values of the Decision Parameters

Agent
Characteristics α β γ η σ
cooperative 0.1 0.9 0.01 0.2 0.6
competitive 0.5 0.1 0.01 0.8 0.2

Twenty combinations of parameters were used and for each five runs were car-
ried out on the problems of Table 9.13, from TSPLIB ([21]). GA is found to be the
better algorithm in 98runs and SA is found to be better only in the remaining 2. The
final solutions, the elapsed times, the solver algorithm and the series of coopera-
tion/competition bouts and exchange of solutions are recorded. The results are en-
tered into SPSS 14 for analysis of significant factors. The cooperation/competition
series and the exchange series are categorised prior to analysis. The categorisation
is summarised in Table 9.7.

Table 9.7 Cooperation and Solution Exchange Sequences of Solver-Agents

Cooperate Take Opponent’s Solution


less than twice never takes
more than twice takes in first stage
takes after second stage
takes in both

These are added to the factors of the experiment. The final factors of the question
are summarised in Table 9.8 with their corresponding values and the number of
occurrences.
224 A. Salhi and Ö. Töreyen

Table 9.8 ANOVA - Data Summary

Number of
Factors ValuesObservations
PAYOFF simple 1 25
cooperation-rewarded 2 25
competition-rewarded 3 25
time-dependent 4 25
DECISION PROCESS coop GA vs coop SA 1 20
coop GA vs comp SA 2 20
comp GA vs coop SA 3 20
comp GA vs comp SA 4 20
random 5 20
CATEGORY GA COOPERATES less than twice 1 64
more than twice 2 36
CATEGORY SA COOPERATES less than twice 1 43
more than twice 2 57
CATEGORY GA TAKES never takes 1 25
SA’S SOLUTION takes in first stage 2 23
takes after second stage 3 26
takes in both 4 26
CATEGORY SA TAKES never takes 1 22
GA’S SOLUTION takes in first stage 2 33
takes after second stage 3 33
takes in both 4 12

Table 9.9 shows ANOVA results for the dependent variable deviation. Here de-
viation, is the difference between the true solution objective value and the objective
value of the solution found. All factors and reasonable multiple interactions are in-
cluded in the model. Most of them are very insignificant due to the high random
variability. However, the interaction of solution taking sequences of agents is signif-
icant with 11% confidence. Therefore, the solution taking sequences factors them-
selves are significant. Even though, it is not a very reliable confidence, these are the
most expected factors to be significant to explain the data since the solution quality
is expected to depend on the times solution exchanges occur.
Table 9.10 shows ANOVA results for the dependent variable time. The only sig-
nificant factor which is in the 12% significance level is the solution exchange se-
quences of SA solver-agent. This matches exactly the expectations since SA varies
a lot both between iterations within problems and between problems. When it takes
a solution in any stage, the average elapsed time is about 100 seconds. When it
doesn’t take the GA solution, the average elapsed time is about 60 seconds.
9 A Game Theory-Based Multi-Agent System 225

Table 9.9 ANOVA - Significant Factors Affecting Deviation From True Solution Value

Source Type III SS Deg.F Mean Sq. F.Value Sig.


Corrected Model 589.795a 76 7.760 .865 .689
Intercept 2098.415 1 2098.415 233.899 .000
PAYOFF 26.754 3 8.918 .994 .413
DECISION PROCESS 16.223 4 4.056 .452 .770
PAYOFF*DECISION
PROCESS 25.528 4 6.382 .711 .593
CATEGORY COOP1*
CATEGORY COOP2 5.214 1 5.214 .581 .454
CATEGORY TAKEN1*
CATEGORY TAKEN2 124.516 7 17.788 1.983 .102
PAYOFF*CATEGORY
COOP1*CATEGORY
COOP2 .000 0 . . .
PAYOFF*CATEGORY
TAKEN1*CATEGORY
TAKEN2 106.348 13 8.181 .912 .555
DECISION PROCESS*
CATEGORY COOP1*
CATEGORY COOP2 .000 0 . . .
DECISION PROCESS*
CATEGORY TAKEN1*
CATEGORY TAKEN2 7.187 4 1.797 .200 .936
PAYOFF*DECISION
PROCESS*
CATEGORY COOP1*
CATEGORY COOP2 .000 0 . . .
PAYOFF*DECISION
PROCESS*
CATEGORY TAKEN1*
CATEGORY TAKEN2 5.806 2 2.903 .324 .727
PAYOFF*DECISION
PROCESS*
CATEGORY COOP1*
CATEGORY COOP2*
CATEGORY TAKEN1*
CATEGORY TAKEN2 .000 0 . . .
CATEGORY COOP1 .041 1 .041 .005 .947
CATEGORY COOP2 4.731 1 4.731 .527 .475
CATEGORY TAKEN1 9.223 3 3.074 .343 .795
CATEGORY TAKEN2 11.471 3 3.824 .426 .736
Error 206.344 23 8.971
Total 4238.542 100
Corrected Total 796.139 99
a. R Squared = .741 (Adjusted R Squared =-.116)
226 A. Salhi and Ö. Töreyen

Table 9.10 ANOVA - Significant Factors Affecting Time

Source Type III SS Deg.F Mean Sq. F.Value Sig.


Corrected Model 331363.405a 76 4360.045 .522 .981
Intercept 626245.612 1 626245.612 74.956 .000
CATEGORY COOP1 6.103 1 6.103 .001 .979
CATEGORY COOP2 111.092 1 111.092 .013 .909
CATEGORY TAKEN1 7246.948 3 2415.649 .289 .833
CATEGORY TAKEN2 54448.233 3 18149.411 2.172 .119
PAYOFF 2783.043 3 927.681 .111 .953
DECISION PROCESS 878.624 4 219.656 .026 .999
CATEGORY COOP1*
CATEGORY COOP2 .538 1 .538 .000 .994
CATEGORY TAKEN1*
CATEGORY TAKEN2 50147.857 7 7163.980 .857 .553
PAYOFF*DECISION
PROCESS 794.601 4 198.650 .024 .999
CATEGORY COOP1*
CATEGORY COOP2* .000 0 . . .
DECISION PROCESS
CATEGORY COOP1*
CATEGORY COOP2* .000 0 . . .
PAYOFF
CATEGORY TAKEN1*
CATEGORY TAKEN2* 1624.845 4 406.211 .049 .995
DECISION PROCESS
CATEGORY TAKEN1*
CATEGORY TAKEN2* 25019.542 13 1924.580 .230 .996
PAYOFF
CATEGORY TAKEN1*
CATEGORY TAKEN2* 786.468 2 393.234 .047 .954
PAYOFF*DECISION
PROCESS
CATEGORY COOP1*
CATEGORY COOP2* .000 0 . . .
PAYOFF*DECISION
PROCESS
CATEGORY COOP1*
CATEGORY COOP2*
CATEGORY TAKEN1*
CATEGORY TAKEN2* .000 0 . . .
PAYOFF*DECISION
PROCESS*
Error 192161.123 23 8354.831
Total 1536492.299 100
Corrected Total 523524.528 99
a. R Squared = .633 (Adjusted R Squared =-.580)
9 A Game Theory-Based Multi-Agent System 227

Table 9.11 Results with Respect to Solution Exchanges

GA Takes SA’s SA Takes GA’s Average Average


Solution Solution Deviation Time (sec)
never takes never takes - -
never takes takes in first stage 5.51% 108.8
never takes takes after second stage 4.89% 81.28
never takes takes in both 8.01% 98.8
takes in first stage never takes 4.62% 60.41
takes in first stage takes in first stage 6.85% 105.6
takes in first stage takes after second stage 5.18% 211.16
takes in first stage takes in both 6.06% 71.12
takes after second stage never takes 3.79% 58.24
takes after second stage takes in first stage 6.30% 119.97
takes after second stage takes after second stage 7.08% 93.08
takes after second stage take sin both 7.50% 296.77
takes in both never takes 7.41% 57.16
takes in both takes in first stage 3.36% 140.57
takes in both taken after second stage 5.74% 92.54
takes in both takes in both 5.34% 116.08

Table 9.12 Selection of Best Parameters

Characteristic Characteristic # of Time


Payoff of GA of SA Occur. Dev. (sec.)
coop-rewarded cooperative cooperative 2 3.59% 61.97
coop-rewarded cooperative competitive 1 1.27% 45.49
coop-rewarded competitive cooperative 1 3.33% 60.49
coop-rewarded competitive competitive 2 7.39% 50.17
coop-rewarded random random - - -
comp-rewarded cooperative cooperative - - -
comp-rewarded cooperative competitive - - -
comp-rewarded competitive cooperative - - -
comp-rewarded competitive competitive 1 0.65% 58.09
comp-rewarded random random - - -
simple cooperative cooperative - - -
simple cooperative competitive - - -
simple competitive cooperative 1 6.50% 66.41
simple competitive competitive 1 3.20% 70.54
simple random random - - -
time-dependent cooperative cooperative - - -
time-dependent cooperative competitive - - -
time-dependent competitive cooperative - - -
time-dependent competitive competitive 1 1.34% 54.77
time-dependent random random 1 3.47% 60.31
228 A. Salhi and Ö. Töreyen

In Table 9.11, the best deviation is found when GA takes SA’s solution in both
the first stage and after the second stage and SA takes GA’s solution only in the
first stage. Average deviation is 3.36%. However, the average time elapsed to obtain
this average deviation is quite high, at 141 seconds. The second best deviation is
observed when GA takes SA’s solution after the second stage and SA never takes
GA’s solution. The average deviation is 3.79% with an average elapsed time of 58
seconds. From these results, it can be said that, in this setting, i.e. when GA com-
petes and takes the solution of SA and SA cooperates by offering its own solution
and never taking that of GA, the best performance is obtained. Whether obtaining
this solution exchange setting is random, is not clear. What is clear is that it occurs
quite often. Table 9.12, records some of its occurences. Amongst these 11 occur-
rences, the best average deviation is obtained with a competition-rewarded payoff
matrix and both agents being competitive. The deviation is 0.65% which actually
comes from only one occurrence, and the time is 58 seconds. This analysis does not
show that if competitive agents play against each other in a competition-rewarded
environment, then this is the best environment; rather, it shows that if competitive
agents play against each other in a competition-rewarded environment and their so-
lution exchange happens to be one-way benefit to one of the solver-agents, then this
might be the best setting.

9.6 Tests and Results


GTMAS is tested on 10 problems from TSPLIB [21]. The results are summarised
in Table 9.13.
Table 9.13 shows the runs for GA alone, SA alone and GA and SA together
under the framework of GTMAS, sequentially. For each problem instance, GTMAS
selected GA as the best solving agent.
When average deviations are compared within problem instances, GTMAS is
found to dominate GA. On average, GTMAS always finds better solutions than GA.
This is due to GA benefiting from the presence of SA; this must be from a synergistic
effect. It is observed that when GA takes the solution of SA the quality of the overall
solution increases considerably. Solution exchange seems to play a critical role in
defining the quality of the solution.
However, when elapsed times are compared, the average time of GTMAS is al-
most double that of GA for almost all problems. This rather unfavorable time count
is the cost of keeping the SA solver-agent because it improves the qualiy of the
overall solution. There are also, other overheads that come with the need for coor-
dination, decision making and so on.
It should also be noted that the recorded time counts for GTMAS are those of a
sequential implementation. The times of a parallel implementation are expected to
be significantly lower.
9 A Game Theory-Based Multi-Agent System 229

Table 9.13 Results of GTMAS Applied to TSP

Average Average Selected


Problem Agent 1 Agent 2 Deviation Time (sec) Algorithm
burma14 GA - 0.70% 6.95
burma14 - SA 0.53% 8.34
burma14 GA SA 0.00% 13.49 GA
ulysses16 GA - 0.27% 8.26
ulysses16 - SA 0.17% 42.28
ulysses16 GA SA 0.00% 14.71 GA
ulysses22 GA - 1.56% 9.83
ulysses22 - SA 1.16% 97.12
ulysses22 GA SA 1.47% 17.78 GA
att48 GA - 4.97% 41.23
att48 - SA 31.48% 10.52
att48 GA SA 4.06% 129.87 GA
eil51 GA - 4.46% 44.45
eil51 - SA 18.17% 423.26
eil51 GA SA 2.72% 91.58 GA
berlin52 GA - 8.67% 42.99
berlin52 - SA 36.37% 11.44
berlin52 GA SA 5.32% 66.94 GA
st70 GA - 11.62% 66.17
st70 - SA 24.89% 232.21
st70 GA SA 7.97% 143.52 GA
eil76 GA - 6.84% 74.90
eil76 - SA 33.34% 1162.02
eil76 GA SA 5.88 136.23 GA
pr76 GA - 6.25% 94.02
pr76 - SA 35.91% 254.20
pr76 GA SA 5.71% 140.05 GA
eil101 GA - 10.37% 143.99
eil101 - SA 50.11% 220.71
eil101 GA SA 8.58% 211.98 GA

9.7 Conclusion and Further Work


A generic smart solver, GTMAS, has been constructed that combines a multi-agent
system architecture and game theory to deal with expensive optimisation problems.
Within GTMAS different algorithms attached to agents play an Iterated Prisoners’
Dilemma type game in which they cooperate to solve the problem and compete
over the computing facilities available (here CPU time). In the process, the system
finds the most appropriate algorithm for the given problem from a library of avail-
able algorithms and solves the problem. It also obtains a better quality approximate
230 A. Salhi and Ö. Töreyen

solution than the best algorithm would obtain on its own. This is because of the
synergistic effect of the algorithms working together.
GTMAS implements an interesting resource allocation process that uses a pur-
pose built payoff matrix to encourage competition for the available computing re-
sources. Solver-agents are rewarded by increasing their access to the computing fa-
cilities for good performance; they are punished for bad performance, by reducing
their access to the computing facilities. This simple rule guarantees that the comput-
ing platform is increasingly being dedicated to the most suited algorithm. In other
words, the bulk of the computing platform will eventually be used by the best per-
forming algorithm, which is synonymous with the computing resources being used
efficiently.
GTMAS as implemented here involves only two players. The study will benefit
from a more extensive investigation with a large number of algorithms. To extend
it to n players, the results obtained can be used. The game can be designed such
that given the players A1 , A2 , ..., An , pair-wise games are considered and each game
is evaluated separately according to the same 2-by-2 payoff matrix introduced here.
The solvers that fail in the simultaneous games in 2-by-2 competitions get elimi-
nated and the tournament continues with the ones that survive.
Another approach of playing the n-by-n game could be playing it simultaneously,
using notions of Nash’s poker game [13], with a specially created n-by-n payoff
matrix that would evaluate all agents at once but select the best iteratively. Current
and future research directions concern extending the ideas of the GTMAS prototype
to a general n-by-n environment which deals with n algorithms, running in parallel,
according to one of the two proposed payoff matrices.

References
1. Aldea, A., Alcantra, R.B., Jimenez, L., Moreno, A., Martinez, J., Riano, D.: The scope
of application of multi-agent systems in the process industry: Three case studies. Expert
Systems with Applications 26, 39–47 (2004)
2. Axelrod, R.: Effective choice in the prisoner’s dilemma. Journal of Conflict Resolu-
tion 24(1), 3–25 (1980)
3. Axelrod, R.: More effective choice in the prisoner’s dilemma. Journal of Conflict Reso-
lution 24(3), 379–403 (1980)
4. Axelrod, R.: The Evolution of Cooperation. Basic Books, New York (1984)
5. Axelrod, R.: The evolution of strategies in the iterated prisoners’ dilemma. In: Davis, L.
(ed.) Genetic Algorithms and Simulated Annealing, pp. 32–42. Morgan Kaufmann, Los
Altos (1987)
6. Axelrod, R., Hamilton, W.D.: The evolution of cooperation. Science 211, 1390–1396
(1981)
7. Binmore, K.: Fun and Games. D.C.Heath, Lexington (1991)
8. Binmore, K.: Playing fair: Game theory and the social contract. MIT Press, Cambridge
(1994)
9. Bratman, M.E.: Shared cooperative activity. The Philosophical Review 101(2), 327–341
(1992)
9 A Game Theory-Based Multi-Agent System 231

10. Byrd, R.H., Dert, C.L., Rinnooy Kan, A.H.G., Schnabel, R.B.: Concurrent stochastic
methods for global optimization. Mathematical Programming 46, 1–30 (1990)
11. Colman, A.M.: Game Theory and Experimental Games. Pergamon Press Ltd., Oxford
(1982)
12. Doran, J.E., Franklin, S., Jennings, N.R., Norman, T.J.: On cooperation in multi-agent
systems. The Knowledge Engineering Review 12(3), 309–314 (1997)
13. Nash, J.F.: Non-cooperative games. Annals of Mathematics 54(2), 286–295 (1951)
14. Holland, J.H.: Adaptation in Natural and Artificial Systems. University of Michigan
Press, Ann Arbor (1975)
15. Linster, B.: Essays on Cooperation and Competition. PhD thesis, University of Michigan,
Michigan (1990)
16. Luce, R., Raiffa, H.: Games and Decisions. Wiley, New York (1957)
17. Murty, K.G., Kabadi, S.N.: Some NP-complete problems in quadratic and nonlinear pro-
gramming. Mathematical Programming 39, 117–130 (1987)
18. Töreyen, Ö.: A game-theory based multi-agent system for solving complex optimisation
problems and a clustering application related to the integration of turkey into the eu com-
munity. M.Sc. Thesis Submitted to the Department of Mathematical Sciences, University
of Essex, UK (2008)
19. Park, S., Sugumaran, V.: Designing multi-agent systems: A framework and application.
Expert Systems with Applications 28, 259–271 (2005)
20. Rapoport, A., Chammah, A.M.: Prisoner’s Dilemma: A Study in Conflict and Coopera-
tion. University of Michigan Press, Ann Arbor (1965)
21. Reinelt, G.: TSPLIB,
http://www.iwr.uni-heidelberg.de/groups/comopt/
software/TSPLIB95
22. Rinnooy Kan, A.H.G., Timmer, G.T.: Stochastic global optimization methods Part I:
Clustering methods. Mathematical Programming 39, 27–56 (1987)
23. Rinnooy Kan, A.H.G., Timmer, G.T.: Stochastic global optimization methods Part II:
Multi-level methods. Mathematical Programming 39, 57–78 (1987)
24. Rinnooy Kan, A.H.G., Timmer, G.T.: Global optimization. In: Nemhauser, G.L., Rin-
nooy Kan, A.H.G., Todd, M.J. (eds.) Optimization. Handbooks in Operations Research
and Management Science, ch. IX, vol. 1, pp. 631–662. North Holland, Amsterdam
(1989)
25. Salhi, A., Glaser, H., De Roure, D.: A genetic approach to understanding cooperative
behaviour. In: Osmera, P. (ed.) Proceedings of the 2nd International Mendel Conference
on Genetic Algorithms, MENDEL 1996, pp. 129–136 (1996)
26. Salhi, A., Glaser, H., De Roure, D.: Parallel implementation of a genetic-programming
based tool for symbolic regression. Information Processing Letters 66(6), 299–307
(1998)
27. Salhi, A., Glaser, H., De Roure, D., Putney, J.: The prisoners’ dilemma revisited. Techni-
cal Report DSSE-TR-96-2, Department of Electronics and Computer Science, The Uni-
versity of Southampton, U.K. (February 1996)
28. Salhi, A., Proll, L.G., Rios Insua, D., Martin, J.: Experiences with stochastic algorithms
for a class of global optimisation problems. RAIRO Operations Research 34(22), 183–
197 (2000)
29. Seshadri, A.: Simulated annealing for travelling salesman problem,
http://www.mathworks.com/matlabcentral/fileexchange
232 A. Salhi and Ö. Töreyen

30. Tweedale, J., Ichalkaranje, H., Sioutis, C., Jarvis, B., Consoli, A., Phillips-Wren, G.:
Innovations in multi-agent systems. Journal of Network and Computer Applications 30,
1089–1115 (2007)
31. Wooldridge, M., Jennings, N.R.: Intelligent agents: Theory and practice. Knowledge En-
gineering Review 10(2), 115–152 (1995)
32. Zhigljavsky, A.A.: Theory of Global Search. Mathematics and its applications, Soviet
Series, vol. 65. Kluwer Academic Publishers, Dordrecht (1991)
Chapter 10
Optimization with Clifford Support Vector
Machines and Applications

N. Arana-Daniel, C. López-Franco, and E. Bayro-Corrochano

Abstract. This chapter introduces a generalization of the real- and complex-valued


SVM’s using the Clifford algebra. In this framework we handle the design of ker-
nels involving the geometric product for linear and nonlinear classification and re-
gression. The major advantage of our approach is that we redefine the optimization
variables as multivectors. This allows us to have a multivector as output and, there-
fore we can represent multiple classes according to the dimension of the geometric
algebra in which we work. By using the CSVM with one Clifford kernel we reduce
the complexity of the computation greatly. This can be done thanks to the Clifford
product, which performs the direct product between the spaces of different grade
involved in the optimization problem. We conduct comparisons between CSVM
and the most used approaches to solve multi-class classification to show that ours
is more suitable for practical use on certain type of problems. In this chapter are
included several experiments to show the application of CSVM to solve classifica-
tion and regression problems, as well as 3D object recognition for visual guided
robotics. In addition, it is shown the design of a recurrent system involving LSTM
network connected with CSVM and we study the performance of this system with
time series experiments and robot navigation using reinforcement learning.

10.1 Introduction
The Support Vector Machine (SVM) [1, 2, 3, 4] is a powerfull optimization algo-
rithm to solve classification and regression problems, but it was originally designed
N. Arana-Daniel · C. López-Franco
Computer Science Department, Exact Sciences and Engineering Campus, CUCEI,
University of Guadalajara, Av. Revolucion 1500, Col. Olı́mpica, C.P. 44430,
Guadalajara, Jalisco, México
e-mail: {nancy.arana,carlos.lopez}@cucei.udg.mx
E. Bayro-Corrochano
Cinvestav del IPN, Department of Electrical Engineering and Computer Science,
Zapopan, Jalisco, México

Y. Tenne and C.-K. Goh (Eds.): Computational Intelligence in Optimization, ALO 7, pp. 233–262.
springerlink.com c Springer-Verlag Berlin Heidelberg 2010
234 N. Arana-Daniel, C. López-Franco, and E. Bayro-Corrochano

for binary classification. The methodology to extend this algorithm to do multi-


classification is still an on-going research issue. Currently there are two main types
of approaches for multi-class SVM [6, 7]. One is by constructing and combining
several binary classifiers, while the other is by directly considering all data in one
big optimization problem. The last mentioned approach is computationally more
expensive to solve multi-class problems. This is why the authors were motivated
to develop an SVM-based algorithm for multi-classification and multi-regression,
which furthermore is based on the Clifford (or geometric) Algebra’s framework [5].
The authors’ hypothesis was that these algebras could be the appropriate mathe-
matical framework to develop this algorithm because Clifford Algebras allow us to
express in a compact way a lot of geometric entities (which are used to represent
multi-class) and the products between them.
This chapter will present the results obtained from the development of the above
mentioned hypothesis: i) the design of the generalization of the real- and complex-
valued Support Multi-Vector Machines for classification and regression using the
Clifford geometric algebra, which from now on-wards will be called Clifford Sup-
port Vector Machines (CSVM), ii) the development of Multiple Input Multiple Out-
put (MIMO) CSVM and iii) the application of CSVM as classifiers, regressors and
as an important component of a recurrent system. This work is a continuation of a
first one on the generalization of SVMs [8].

10.2 Geometric Algebra


Let Gn denote the geometric (Clifford) algebra of n-dimensions, this is a graded
linear space. As well as vector addition and scalar multiplication we have a non-
commutative product which is associative and distributive over addition – this is the
geometric or Clifford product. A further distinguishing feature of the algebra is that
any vector squares to give a scalar. The geometric product of two vectors a and b is
written ab and can be expressed as a sum of its symmetric and antisymmetric parts
ab = a·b + a∧b, (10.1)
where the inner product a·b and the outer product a∧b are defined by

a · b = 12 (ab + ba)
(10.2)
a ∧ b = 12 (ab − ba).

The inner product of two vectors is the standard scalar or dot product and produces
a scalar. The outer or wedge product of two vectors is a new quantity which we call
a bivector. We think of a bivector as a oriented area in the plane containing a and b,
formed by sweeping a along b. Thus, b ∧ a will have the opposite orientation mak-
ing the wedge product anti-commutative as given in ( 10.2). The outer product is
immediately generalizable to higher dimensions – for example, (a ∧ b) ∧ c, a trivec-
tor, is interpreted as the oriented volume formed by sweeping the area a ∧ b along
vector c. The outer product of k vectors is a k-vector or k-blade, and such a quantity
is said to have grade k. A multivector A ∈ Gn is the sum of k-blades of different or
10 Optimization with Clifford Support Vector Machines and Applications 235

equal grade. This linear combination is called homogeneous of grade r (A = Ar ) if


it contains terms of only a single grade.

10.2.1 The Geometric Algebra of n-D Space


In an n-Dimensional space V n we can introduce an orthonormal basis of vectors
{σi }, i = 1, ..., n, such that σi · σ j = δi j . This leads to a basis for the entire algebra:

1, {σi }, {σi ∧ σ j }, {σi ∧ σ j ∧ σk }, ...,


σ1 ∧ σ2 ∧ . . . ∧ σn = I. (10.3)

which spans the entire geometric algebra Gn . Here I is the hyper volume called
pseudo scalar which commutes with all the multivectors and it is used as dualization
operator as well. Note that the basis vectors are not represented by bold symbols.
Any multivector can be expressed in terms of this basis. Any multivector can be
expressed in terms of this basis. Because the addition of k-vectors (homogeneous
vectors of grade k) is closed and the multiplication of a k-vector is a vector space,
8k  
denoted V n . Each of this spaces is spanned by nk k-vectors, where nk := (n−k)!k!
n!
.
n n
Thus, our geometric algebra Gn , which is spanned by ∑ k = 0 k = 2n elements, is
a direct sum of its homogeneous subspaces of grades 0, 1, 2, ..., n, that is,

9
0 9
1 9
2 9
n
Gn = Vn ⊕ Vn ⊕ Vn ⊕ ...⊕ Vn (10.4)

8
0 8
1
where V n = R is the set of real numbers and V n = V n corresponds to the linear
n-Dimensional vector space. Thus, any multivector of Gn can be expressed in terms
of the basis of these subspaces.
In this chapter we will specify a geometric algebra Gn of the n dimensional space
by G p,q,r , where p, q and r stand for the number of basis vector which squares to 1,
-1 and 0 respectively and fulfill n=p+q+r. Its even sub algebra will be denoted by
+ .
G p,q,r
In the n-D space there are multivectors of grade 0 (scalars), grade 1 (vectors),
grade 2 (bivectors), grade 3 (trivectors), etc... up to grade n. Any two such multi-
vectors can be multiplied using the geometric product. Consider two multivectors
Ar and Bs of grades r and s respectively. The geometric product of Ar and Bs can be
written as
Ar Bs = ABr+s + ABr+s−2 + . . . + AB|r−s| (10.5)
where Mt is used to denote the t-grade part of multivector M, e.g. consider the
geometric product of two vectors ab = ab0 + ab2 = a · b + a ∧ b. Another simple
illustration is the geometric product of A = 4σ3 + 2σ1 σ2 and b = 8σ2 + 6σ3

Ab = 24(σ3 )2 + 16σ1(σ2 )2 + 32σ3σ2 + 12σ1σ2 σ3


= 24 + 16σ1 − 32σ2σ3 + 12I (10.6)
236 N. Arana-Daniel, C. López-Franco, and E. Bayro-Corrochano

Note here, that the Clifford product for σi σi = (σi )2 = σi · σi = 1, because the wedge
product between σi ∧ σi = 0, and σi σ j = σi ∧ σ j , the geometric product of differ-
ent unit basis vectors is equal to their wedge, which for simple notation can be
omitted. Using equation 10.5 we can express the inner and outer products for the
multivectors as

Ar · Bs = Ar Bs |r−s| (10.7)


Ar ∧ Bs = Ar Bs r+s (10.8)

In order to deal with more general multivectors, we define the scalar product

A ∗ B = AB0 (10.9)

For an r-grade multivector Ar = ∑ri=0 Ar i , the following operations are defined:
r
Grade Involution: Âr = ∑ (−1)i Ai (10.10)
i=0
r
∑ (−1)
i(i−1)
Reversion: A†r = 2 Ai (10.11)
i=0
Clifford Conjugation: Ãr = †r (10.12)
r
∑ (−1)
i(i+1)
= 2 Ai (10.13)
i=0

The grade involution simply negates the odd-grade blades of a multivector. The
reversion can also be obtained by reversing the order of basis vectors making up the
blades in a multivector and then rearranging them to their original order using the
anti-commutativity of the Clifford product. The scalar product ∗ is positive definite,
i.e. one can associate with any multivector A = A0 + A1 + . . . + An a unique
positive scalar magnitude |A| defined by
n
|A|2 = A† ∗ A = ∑ |Ar |2 ≥ 0, (10.14)
r=0

where |A| = 0 if and only


: if A = 0. For an homogeneous multivector Ar its magnitude
is defined as |Ar | = A†r Ar . In particular, for an r-vector Ar of the form Ar = a1 ∧
a2 ∧ . . . ∧ ar : A†r = (a1 . . . ar−1 ar )† = ar ar−1 . . . a1 and thus A†r Ar = a21 a22 . . . a2r , so,
we will say that such a r-vector is null if anda only if it has a null vector for a factor.
If in such f actorization or Ar p, q as s factors square in a positive number, negative
and zero, respectively, we will say that Ar is a r-vector with signature (p, q, s). In
particular, if s = 0 such a non − singular r-vector has a multiplicative inverse

A† A
A−1 = (−1)q = 2 (10.15)
|A|2 A
10 Optimization with Clifford Support Vector Machines and Applications 237

In general, the inverse A−1 of a multivector A, if it exists, is defined by the equation


A−1 A = 1.

10.2.2 The Geometric Algebra of 3-D Space


The basis for the geometric algebra G3,0,0 of the the 3-D space has 23 = 8 elements
and is given by:

1 , {σ1 , σ2 , σ3 }, {σ1 σ2 , σ2 σ3 , σ3 σ1 }, {σ1 σ2 σ3 } ≡ I .


;<=> (10.16)
; <= >; <= >; <= >
scalar vectors bivectors trivector

In G3,0,0 a typical multivector v will be of the form v = α0 + α1 σ1 + α2 σ2 + α3 σ3 +


α4 σ2 σ3 + α5 σ3 σ1 + α6 σ1 σ2 + α7 I3 =< v >0 + < v >1 + < v >2 + < v >3 , where
8
0 8
1
the αi s are real numbers and < v >0 = α0 ∈ V n , < v >1 = α1 σ1 + α2 σ2 + α3 σ3 ∈
8
2 8
3
V n , < v >2 = α4 σ2 σ3 + α5 σ3 σ1 + α6 σ1 σ2 ∈ V n , < v >3 = α7 I3 ∈ V n .
In geometric algebra a rotor (short name for rotator), R, is an even-grade element
of the algebra which satisfies RR=1, ? where R? stands for the conjugate of R. If A =
{a0 , a1 , a2 , a3 } ∈ G3,0,0 represents a unit quaternion , then the rotor which performs
the same rotation is simply given by

R = a0 + a1 (I σ1 ) − a2 (I σ2 ) + a3 (I σ3 ) (10.17)
;<=> ; <= >
scalar bivectors
= a0 + a1σ2 σ3 + a2σ3 σ1 + a3σ1 σ2 . (10.18)

The quaternion algebra is therefore seen to be a subset of the geometric algebra of


3-space. The conjugated of a rotor given by R? = a0 − a1σ2 σ3 + a2σ3 σ1 − a3σ1 σ2 ,
The transformation in terms of a rotor a → RaR? = b is a very general way of
handling rotations; it works for multivectors of any grade and in spaces of any di-
mension in contrast to quaternion calculus. Rotors combine in a straightforward
manner, i.e. a rotor R1 followed by a rotor R2 is equivalent to a total rotor R where
R = R2 R1 .

10.3 Linear Clifford Support Vector Machines for


Classification
For the case of the Clifford SVM for classification we represent the data set in a cer-
tain Clifford Algebra Gn where n = p + q + r, where any multivector base squares
to 0, 1 or -1 depending if they belong to p, r, or r multivector bases respectively. We
consider the general case of an input comprising D multivectors, and one multivec-
tor output, i.e. each ith-vector has D multivector entries xi = [xi1 , xi2 , ..., xiD ]T , where
xi j ∈ Gn and D is its dimension. Thus the ith-vector dimension is D×2n , then each
238 N. Arana-Daniel, C. López-Franco, and E. Bayro-Corrochano

data ith-vector xi ∈ GnD . And each of ith-vectors will be associated with one output
of the 2n possibilities given by the following multivector output

yi = yi s + yi σ1 + yi σ2 + ... + yiI ∈ {±1 ± σ1 ± σ2 . . . ± I} (10.19)

where the first subindex s stands for scalar part. For the classification the CSVM sep-
arates these multivector-valued samples into 2n groups by selecting a good enough
function from the set of functions
T
{ f (x) = w† x + b, }. (10.20)

where x, w = [w1 , w2 , . . . , wD ]T ∈ GnD and f (x), b ∈ Gn . An entry of the optimal hy-


perplane w is given by wi = wis , wiσ1 σ1 + ... + wiσ1 σ2 σ1 σ2 + . . . + wiI I.
Let us see in detail the last equation
T
f (x) = w† x + b
= [w†1 , w†2 , ..., w†D ]T [x1 , x2 , ..., xD ] + b
D
= ∑ w†i xi + b (10.21)
i=1

where w†i xi corresponds to the Clifford product of two multivectors and w†i is the
reversion of the multivector wi .
Next,we introduce now a structural risk functional similar to the real valued
one of the SVM for classification. By using a loss function similar to Vapnik’s ξ -
insensitive one, we utilize following linear constraint quadratic programming for the
primalequation
T
min L(w, b, ξ ) = 12 w† w +C ∑i, j ξ1 j
subject to
T (10.22)
yi j ( f (xi )) j = yi j (w† xi + b) j ≥ 1 − ξi j
ξi j ≥ 0 for all i, j,
where ξi j stands for the slack variables, i, indicate the data ith-vector and j indexes
the multivector component, i.e. j = 1 for the coefficient of the scalar part, j = 2 for
the coefficient of σ1 . . . j = 2n for the coefficient of I. The dual expression of this
problem can be derived straightforwardly. Firstly let us consider the expression of
the orientation of optimal hyperplane.

wi = [w1 , w2 , ..., wD ]T (10.23)

each of the wk is given by the multivector

wk = wks + wkσ1 σ1 + ... + wkσ1σ2 σ1 σ2 + ... + wkI I. (10.24)


10 Optimization with Clifford Support Vector Machines and Applications 239

Each component of these weights are computed as follows:


l & '
wks = ∑ (αs ) j (ys ) j (xks ) j ,
j=1
l & '
wkσ1 = ∑ (ασ1 ) j (yσ1 ) j (xkσ1 ) j ...
j=1
l & '
wkI = ∑ (αI ) j (yI ) j (xkI ) j . (10.25)
j=1
where (αs ) j , (ασ1 ) j , ..., (αI ) j , j = 1, ..., l are the Lagrange multipliers. According
the Wolfe dual programing [1] the dual form reads
1 †T
min
2
(w w) − ∑ αi j (10.26)
i, j

subject to aT · 1 = 0, and all the Lagrange multipliers should fulfill 0 ≤ (αs ) j ≤ C,


0 ≤ (ασ1 ) j ≤ C, ..., 0 ≤ (ασ1 σ2 ) j ≤ C, ..., 0 ≤ (αI ) j ≤ C for i = 1, ..., D and j =
1, ..., l. In aT · 1 = 0, 1 denotes a vector of all ones. The entries of the vector
a = [as , aσ1 , aσ2 , ..., aσ1 σ2 , aI ] (10.27)

are given by
aTs = [(αs )1 (ys )1 , (αs )2 (ys )1 , ..., (αs )l (ys )l ]
aTσ1 = [(ασ1 )1 (yσ1 )1 , (ασ1 )1 (yσ1 )1 , ..., (ασ1 )l (yσ1 )l ]
..
. (10.28)
aTI = [(αI )1 (yI )1 , (αI )1 (yI )1 , ..., (αI )l (yI )l ]

note that the vector aT has the dimension: (l × 2n ) × 1. We require a compact and
easy representation of the resultant GRAM matrix of the multi-components, this will
help for the programing of the algorithm. For that let us first consider the Clifford
product of (w∗T w), this can be expressed as follows

w†T w = w†T ws + w†T wσ1 + w†T wσ2 + . . . + w†T wI (10.29)

Since w has the components presented in (10.25), the equation (10.29) can be rewrit-
ten as follows

w†T w = aTs x†T xs as + ... + aTs x†T xσ1 σ2 aσ1 σ2 + ...
+aTs x†T xI aI + aTσ1 x†T xs as + ...
+aTσ1 x†T xσ1 σ2 aσ1 σ2 + ... + aTσ1 x†T xI aI +
. (10.30)
aTI x†T xs as + aTI x†T xσ1 aσ1 + ...
+aTI x†T xσ1 σ2 aσ1 σ2 + ... + aTI x†T xI aI .
240 N. Arana-Daniel, C. López-Franco, and E. Bayro-Corrochano

Renaming the matrices of the t-grade parts of x†T xt , we rewrite previous
equation as:

w†T w = aTs Hs as + aTs Hσ1 aσ1 + aTs Hσ1 σ2 aσ1 σ2 + ...


+aTs HI aI + aTσ1 Hs as + aTσ1 Hσ1 aσ1 + ...
+aTσ1 Hσ1 σ2 aσ1 σ2 + ... + aTσ1 HI aI +
.
aTI Hs as + aTI Hσ1 aσ1 + ... + aTI Hσ1 σ2 aσ1 σ2 + ...
+aTI HI aI . (10.31)

Taken into consideration the previous equations and definitions, the primal equation
(10.22) reads now as follows:
1
min L(w, b, ξ ) = aT Ha + C · ∑ αi j (10.32)
2 i, j

using the previous definitions and equations we can define the dual optimization
problem as follows
1
max aT 1 − aT Ha
2
sub ject to
0 ≤ (αs ) j ≤ C, 0 ≤ (ασ1 ) j ≤ C, ...,
0 ≤ (ασ1 σ2 ) j ≤ C, ..., 0 ≤ (αI ) j ≤ C
f or j = 1, ..., l, (10.33)

where a is given by (10.27) and, again 1 denotes a vector of all ones.


H is a positive semidefinite matrix which is the expected generalized Gram ma-
trix. This matrix in terms of the matrices of the t-grade parts of x∗ xt is written as
follows:
⎡ ⎤
Hs Hσ1 Hσ2 .... .... ... ... Hσ1σ2 ... HI
⎢ HσT1 Hs ... Hσ4 .....Hσ1 σ2 ... HI Hs ⎥
⎢ ⎥
⎢ HσT2 HσT1 Hs ... Hσ1 σ2 ... HI Hs Hσ1 ⎥
⎢ ⎥
H =⎢
⎢ . ⎥,
⎥ (10.34)
⎢ . ⎥
⎢ ⎥
⎣ . ⎦
HIT ... HσT1 σ2 .............HσT2 HσT1 Hs

note that the diagonal entries equal to Hs and since H is a symmetric matrix the
lower matrices are transposed. The optimal weight vector w is as given by (10.23).
The threshold b ∈ GnD can be computed by using KKT conditions with the Clif-
ford support vectors as follows
10 Optimization with Clifford Support Vector Machines and Applications 241

b = bs + bσ1 σ1 + ... + bσ1σ2 σ1 σ2 + ... + bI I),


l
= ∑ (y j − w†T x j )/l. (10.35)
j=1

The decision function can be seen as sectors reserved for each involved class, i.e.
in the case of complex numbers (G1,0,0 ) or quaternions (G0,2,0 ) we can see that the
circle or the sphere are divide by means spherical vectors. Thus the decision function
can be envisaged as
" # " #
y = csignm f (x) = csignm w†T x + b
" l #
= csignm ∑ (α j ◦ y j )(x†Tj x) + b, (10.36)
j=1
" #
where csignm f (x) is the function for detecting the sign of f (x) and m stands for
the different values which indicate the state valency, e.g. bivalent, tetravalent and
the operation “◦” is defined as

(α j ◦ y j ) = < α j >0 < y j >0 + < α j >1 < y j >1 σ1 + ...


+ < α j >2n < y j >2n I, (10.37)

simply one consider as coefficients of the multivector basis the multiplications be-
tween the coefficients of blades of same degree. For clarity we introduce this oper-
ation “◦”which takes place implicitly in previous equation (10.25).
Note that the cases of complex numbers 2-state (outputs 1 for − π2 ≤ arg( f (x)) <
π π 3π π
2 and -1 for 2 ≤ arg( f (x)) < 2 ) and 4-state (outputs 1+i for 0 ≤ arg( f (x)) < 2 , -
π 3π 3π
1+i for 2 ≤ arg( f (x)) < π , -1-i for π ≤ arg( f (x)) < 2 and 1-i for 2 ≤ arg( f (x)) <
2π ) can be solved by the multi-class real valued SVM, however in case of higher
representations like the 16-state using quaternions, it would be awkward to resort to
the multi-class real valued SVMs.
The major advantage of our approach is that we redefine the optimization vector
variables as multivectors. This allows us to utilize the components of the multivec-
tor output to represent different classes. The amount of achieved class outputs is
directly proportional to the dimension of the involved geometric algebra. The key
idea to solve multi-class classification in the geometric algebra is to avoid that the
multivector elements of different grade get collapsed into a scalar, this can be done
thanks to the redefinition of the primal problem involving the Clifford product in-
stead of the inner product (10.22). The reader should bear in mind that the Clifford
product performs the direct product between the spaces of different grade and its
result is represented by a multivector, thus the outputs of the CSVM are represented
by y= ys + yσ1 + yσ2 + ... + yI ∈ {±1 ± σ1 ± σ2 . . . ± I}.
242 N. Arana-Daniel, C. López-Franco, and E. Bayro-Corrochano

10.4 Non Linear Clifford Support Vector Machines for


Classification
For the nonlinear Clifford valued classification problems we require a Clifford val-
ued kernel K(x, y). In order to fulfill the Mercer theorem we resort to a component-
wise Clifford-valued mapping
φ
x ∈ Gn −→ Φ (x) = Φs (x) + Φσ1 σ1 + Φσ1 σ2 (x)σ2 + ...
+I ΦI (x) ∈ Gn .

In general we build a Clifford kernel K(xm , x j ) by taking the Clifford product be-
tween the reversion of xm and x j as follows

K(xm , x j ) = Φ (xm )† Φ (x j ), (10.38)

note that the kind of reversion operation (·)† of a multivector depends of the signa-
ture of the involved geometric algebra G p,q,r . Next as illustration we present kernels
using different geometric algebras. According to the Mercer theorem, there exists
a mapping u : G → F , which maps the multivectors x ∈ Gn into the complex Eu-
u
clidean space x →= ur (x) + IuI (x)
Complex-valued linear kernel function in G1,0,0 (the center of this geometric al-
gebra, i.e. s, I = σ1 σ2 is isomorph with C):

K(xm , xn ) = u(xm )† u(xn )


= (u(xm )s u(xn )s + u(xm )I u(xn )I ) + I(u(xm )s u(xn )I − u(xm )I u(xn )s ),
= (k(xm , xn )ss + k(xm , xn )II ) + ...
+I(k(xm , xn )Is ) − k(xm , xn )sI )
= Hr + IHi (10.39)

where (xs )m , (xs )n , (xI )m , (xI )n are vectors of the individual components of the
complex numbers (x)m = (xs )m + I(xI )n ∈ G1,0,0 and (x)n = (xs )n + I(xI )n ∈ G1,0,0
respectively.
For the quaternion-valued Gabor kernel function, we use i = σ2 σ3 , j = −σ3 σ1 ,
k = σ1 σ2 . The Gaussian window Gabor kernel function reads

K(xm , xn ) = g(xm , xn )exp−iw0 (xm − xn )


T
(10.40)

where the normalized Gaussian window function is given by


||x − x ||2
1 − m 2 n
g(xm , xn ) = √ exp 2ρ (10.41)
2πρ
10 Optimization with Clifford Support Vector Machines and Applications 243

and the variables w0 and xm − xn stand for the frequency and space domains
respectively.
Unlike the Hartley transform or the 2D complex Fourier this kernel function sep-
arates nicely the even and odd components of the involved signal, i.e.

K(xm , xn ) = K(xm , xn )s + K(xm , xn )σ2 σ3 + ...


+K(xm , xn )σ3 σ1 + K(xm , xn )σ1 σ2
= g(xm , xn )cos(wT0 xm )cos(wT0 xm ) + ...
+g(xm , xn )cos(wT0 xm )sin(wT0 xm )i + ...
+g(xm , xn )sin(wT0 xm )cos(wT0 xm ) j + ...
+g(xm , xn )sin(wT0 xm )sin(wT0 xm )k.

Since g(xm , xn ) fulfills the Mercer’s condition it is straightforward to prove that


k(xm , xn )u in the above equations satisfy these conditions as well.
After we defined these kernels we can proceed in the formulation of the SVM
n
conditions. We substitute the mapped data Φ (x) = ∑2u=1 < Φ (x) >u into the linear
function f (x) = w†T x + b = w∗T Φ (x) + b. The problem can be stated similarly as in
(10.22-10.26). In fact we can replace the kernel function in (10.33) to accomplish
the Wolfe dual programming and thereby to obtain the kernel function group for
nonlinear classification

Hs = [Ks (xm , x j )]m, j=1,..,l


Hσ1 = [Kσ1 (xm , x j )]m, j=1,..,l
...
Hσn = [Kσn (xm , x j )]m, j=1,..,l ·
·
HI = [KI (xm , x j )]m, j=1,..,l . (10.42)

In the same way we use the kernel functions to replace the the dot product of the
input data in (10.36). In general the output function of the nonlinear Clifford SVM
reads
" # " #
y = csignm f (x) = csignm w†T Φ (x) + b , (10.43)

where m stands for the state valency.

10.5 Clifford SVM for Regression


The representation of the data set for the case of Clifford SVM for regres-
sion is the same as for Clifford SVM for classification; we represent the data set
244 N. Arana-Daniel, C. López-Franco, and E. Bayro-Corrochano

in a certain Clifford Algebra Gn . Each data ith-vector has multivector entries


xi = [xi1 , xi2 , ..., xiD ]T , where xi j ∈ Gn and D is its dimension. Let (x1 , y1 ),(x2 ,
y2 ),...,(x j , y j ),...,(xl , yl ) be the training set of independently and identically dis-
tributed multivector-valued sample pairs, where each label yi = yi s + yi σ1 σ1 +
yi σ2 σ2 + ... + yi I I, and the first subindex s stands for scalar part. The regression
problem using multivectors is to find a multivector-valued function f (x) that has
at most ε -deviation from the actually obtained targets yi ∈ Gn for all the training
data, and at the same time, is as flat as possible. We will use a multivector-valued
ε -insensitive loss function and arrive at the formulation of Vapnik [1]:

min 12 w†T w +C · ∑i, j (ξi + ξ̃i )


subject to
(10.44)
(yi − w†T xi − b) j ≤ (ε + ξi j )
(w†T xi + b − yi ) j ≤ (ε + ξ̃i j ) ξi j ≥ 0, ξ̃i j ≥ 0 for all i, j.
where w, x ∈ GnD , and (.) j extracts the scalar accompanying a multivector base. Next
we proceed like in section 10.3, since the expression of the orientation of optimal
hyperplane is the same that in (10.23) and each of the wi is computed as follows:
l & '
ws = ∑ (αs ) j − (α̃s ) j (xs ) j , ,
j=1
l & '
wσ1 = ∑ (ασ1 ) j − (α̃σ1 ) j (xσ1 ) j , ...,
j=1
l & '
wI = ∑ (αI ) j − (α̃I ) j (xI ) j .
j=1

We can now redefine the entries of the vector in (10.27), these are given by

aTs = [(αs11 − α̃s1 , (αs2 − α̃s2 ), ..., (αsl − α̃s1 )],


aTσ1 = [(ασ1 1 − α̃σ1 1 ), (ασ1 2 − α̃σ1 2 ), ..., (ασ1 l − α̃σ1 l )]
... (10.45)
aTI = [(αI1 − α̃I1 ), (αI2 − α̃I2 ), ..., (αIl − α̃Il )]

Now, we can rewrite the Clifford product, as we did in (10.29 - 10.31) to get the
primal problem as follows:

min 12 aT Ha +C · ∑li=1 (ξ + ξ̃ )
subject to
(w† x + b − y) j ≤ (ε + ξ ) j
(y − w† x − b) j ≤ (ε + ξ̃ ) j
ξi j ≥ 0, ξ̃i j for all i, j.

Thereafter, we write straightforwardly the dual of 10.46 for solving the regression
problem
10 Optimization with Clifford Support Vector Machines and Applications 245

1
max −α̃ T (ε −
˜ y) − α T (ε + y) − aT Ha
2
sub ject to
l l
∑ (αs j − α̃s j ) = 0, ∑ (ασ 1 j − α̃σ1 j ) = 0, ..,
j=1 j=1
l
∑ (αI j − α̃I j ) = 0, 0 ≤ (αis ) ≤ C, 0 ≤ (αiσ1 ) ≤ C, ...,
j=1
0 ≤ (αiσ1 σ2 ) ≤ C, ..., 0 ≤ (iαI ) ≤ C 0 ≤ (αis∗ ) ≤ C, 0 ≤ (αi∗σ1 ) ≤ C, ...,
0 ≤ (αi∗σ1 σ2 ) ≤ C, ..., 0 ≤ (iαI∗ ) ≤ C j = 1, ..., l, (10.46)

For nonlinear regression similar as explained in subsection 10.4 we utilize a partic-


ular kernel for computing k(xm , x j ) = Φ (x̃m )Φ (x j ), again this kind of conjugation
operation ()∗ of a multivector depends of the signature of the involved geometric
algebra G p,q,r . We can use the kernels described in subsection 10.4.

10.6 Recurrent Clifford SVM


SVMs are very powerful for solving regression and classification tasks. They carry
out predictions by linearly combining kernel basis functions. By mapping the input
feature space to a higher dimensional space, the SVMs can separate linearly clusters
by means an optimal hyperplane. A rather limited way to apply existing SVMs to
sequence prediction [? ? ] or classification [12] is to build a training set either by
transforming the sequential input to an input vector of some static domain (e.g.,
a frequency or phase representation, a Hidden Markov Model -HMM- [13, 14]),
or by simple frequency counting of patterns, symbols or substrings, or by taking
fixed time windows of k sequential values [10]. The window-based approaches of
course, fail if the temporal dependency exceeds the length of k steps. As for the
case of training HMM with long sequences, unfortunately they get numerous local
minima points [15, 16]. Suykens and Vandewalle [17], incorporates the dynamic
equations in the primal problem for a SVM solution. The major disadvantage of
this approach is that the problem is not longer convex, thus there is no guarantee
of finding an optimal global solution. In all these discussed attempts there has not
been a recurrent SVM which learns tasks involving time lags of arbitrary length
between important input events. However, a pioneering attempt using real valued
SVM and neuroevolution for sequence prediction was done by Schmidhuber, et al.
[18]. Unfortunately at present the research activity on recurrent SVM is very scarce.
We started to explore a way to build a CSVM based recurrent system which will
profit of all advantages of the CSVM, namely it helps to maintain the convexity, it
is MIMO and it suits to process sequences with geometric characteristics. In order
to do that, we decided to connect two processing modules in cascade: a Long Short
Term memory LSTM, [20] and a CSVM.
246 N. Arana-Daniel, C. López-Franco, and E. Bayro-Corrochano

LSTM-CSVM is a Evolino and Evoke based system [18, 19]: the underlying
idea of these systems is that it is needed two cascade modules: a robust module
to process short and long-time dependencies (LSTM) and an optimization module
to produce precise outputs (CSVM, Moore-Penrose pseudo inverse method, SVM
respectively). The LSTM module addresses the disadvantage of having relevant
pieces of information outside the history window and also avoids the problem of
the “vanishing error” presented by algorithms like Back-Propagation Through Time
(BPTT, e.g., Williams an Zipser 1992) or Real-Time-Recurrent Learning ( RTRL,
e.g., Robinson and Fallside 1987)1. Meanwhile CSVM maps the internal activations
of the fist module to a set of precise outputs, again, it is taken advantage of the mul-
tivector output representation to implement a system with less process units and
therefore less computational complex.
LSTM-CSVM works as follows: a sequence of input vectors (u(0)...u(t)) is given
to the LSTM which in turn feeds the CSVM with the outputs of each of its memory
cells, see Fig. 10.1.

Fig. 10.1 LSTM-CSVM system

The CSVM aimed at finding the expected nonlinear mapping of training data.
The input and output equations of Figure 10.1 are

φ (t) = f (W, u(t), u(t-1),...,u(0),...,).


k
y(t) = b + ∑ wi K(φ (t), φi (t)). (10.47)
i=1

where φ (t) = [ψ1 , ψ2 , ..., ψn ]T ∈ Rn is the activation in time t of n units of the LSTM,
this serves as input to the CSVM, given the input vectors(u(0)...u(t)) and the weight
matrix W . Since the LSTM is a recurrent net, the argument of the function f (.)
represents the history of the input vectors.

1 The reader can get more information about BPTT and RTRL-vanishing error versus
LSTM-constant error flow in [20].
10 Optimization with Clifford Support Vector Machines and Applications 247

First, the LSTM-CSVM system was trained using the conventional algorithm for
the LSTM. Although the system learns, unfortunately it takes too long to find a
suitable matrix W . Instead, propagating the training data through the LSTM-CSVM
system, we evolved the rows of the matrix using the evolutionary algorithms known
as Enforced Sub-Populations (ESP) [21] algorithm. This approach differs with the
standard methods, because instead of evolving the complete set of the net parame-
ters, it rather evolves subpopulations of the LSTM memory cells. For the mutation
of the chromosomes, the ESP uses Cauchy density function.

10.7 Applications
In this section we present five interesting experiments. The first one shows a multi-
class classification using CSVM with a simulated example.Here, we present also a
number of variables computing per approach and a time comparison between CSVM
and three approaches to do multi-class classification using real SVM. The second
is about object multi-class classification with two types of training data: Phase a)
artificial data and Phase b) real data obtained from a stereo vision system. We also
compared the CSVM against MLP’s (for multi-class classification) . The third ex-
periment presents a multi-class interpolation. The fourth and fifth includes the ex-
perimental analysis of the recurrent CSVM.

10.7.1 3D Spiral: Nonlinear Classification Problem


We extended the well known 2-D spiral problem to the 3-D space. This experiment
should test whether the CSVM would be able to separate five 1-D manifolds embed-
ded in R3 . On this application, we used a quaternion valued CSVM which works in
G0,2,0 2 , this allows us to have quaternion inputs and outputs, and therefore, with one
output quaternion we can represent until 24 classes .The functions were generated
as follows:
f1 (t) = [x1 (t), y1 (t), z1 (t)]
= [z1 ∗ cos(θ ) ∗ sin(θ ), z1 ∗ sin(θ ) ∗ sin(θ ), z1 ∗ cos(θ )]
f2 (t) = [x2 (t), y2 (t), z2 (t)]
= [z2 ∗ cos(θ ) ∗ sin(θ ), z2 ∗ sin(θ ) ∗ sin(θ ), z2 ∗ cos(θ )]
f3 (t) = [x3 (t), y3 (t), z3 (t)]
= [z3 ∗ cos(θ ) ∗ sin(θ ), z3 ∗ sin(θ ) ∗ sin(θ ), z3 ∗ cos(θ )]
f4 (t) = [x4 (t), y4 (t), z4 (t)]
= [z4 ∗ cos(θ ) ∗ sin(θ ), z4 ∗ sin(θ ) ∗ sin(θ ), z4 ∗ cos(θ )]
f5 (t) = [x5 (t), y5 (t), z5 (t)]
= [z5 ∗ cos(θ ) ∗ sin(θ ), z5 ∗ sin(θ ) ∗ sin(θ ), z5 ∗ cos(θ )]

To depict these vectors they were normalized by 10. In Fig. 10.2 one can see that the
problem is high nonlinear separable. The CSVM uses for training 50 input quater-
nions of each of the five functions, since these have three coordinates we use simply
2 The dimension of this geometric algebra is 22 = 4.
248 N. Arana-Daniel, C. López-Franco, and E. Bayro-Corrochano

Support Vectors

Fig. 10.2 3D spiral with five classes. The marks represent the support multivectors found by
the CSVM

the bivector part of the quaternion, namely xi = xi (t)σ2 σ3 + yi (t)σ3 σ1 + zi (t)σ1 σ2 ≡


[0, xi (t), yi (t), zi (t)]. The CSVM used the kernel given by (10.42). Note that the
CSVM indeed manage to separate the five classes.

10.7.1.1 Comparisons Using 3D Spiral Example


According to [22] the most used methods to do multi-class classification are: one-
against-all [23], one-against-one [24], DAGSVM [26], and some methods to solve
multi-class in one step, known as all together methods [27]. Table 10.1 shows a com-
parison of number of variables computing per approach, considering also CSVM.
The experiments shown in [22] indicate that “one-against-one and DAG meth-
ods are more suitable for practical use than the other methods”, we have chosen
of these methods the one-against-one and the earliest implementation for SVM
multi-class classification one-against-all approach to do comparisons between them
and our proposal CSVM. The comparisons were made using the 3D spiral toy ex-
ample and the quaternion CSVM shown in the past subsection. The number of
classes was increased on each experiment, we started with K=3 classes and 50
training inputs for each class. Since the training inputs have three coordinates we
use simply the bivector part of the quaternion for CSVM approach, namely xi =
xi (t)σ2 σ3 + yi (t)σ3 σ1 + zi (t)σ1 σ2 ≡ [0, xi (t), yi (t), zi (t)], therefore CSVM computes
D ∗ N = 3 ∗ 150 = 450 variables. The approaches one-against-all and one-against-
one compute 450 and 300 variables respectively, however the training times of
CSVM and one-against-one are very similar in the first experiment. Note that when
we increase the number of classes the performance of CSVM is much better than
the other approaches because the number of variables to compute is greatly reduced.
We improved the computational efficiency of all these algorithms, utilizing the de-
composition method [28] and the shrinking technique [29]. We can see in table 10.2
that the CSVM using a quarter of the variables is still faster with around a quarter of
the processing time of the other approaches. The classification performance of the
four approaches is presented in Table 10.3. We used during training and test 50 and
20 vectors per class respectively. We can see that the CSVM for classification has
overall the best performance.
10 Optimization with Clifford Support Vector Machines and Applications 249

Table 10.1 Number of variables per approach

Approach NQP NVQP TNV


CSVM 1 D*N D*N
One-against-all K N K*N
One-against-one K(K-1)/2 2*N/K N(K-1)
DAGSVM K(K-1)/2 2*N/K N(K-1)
A method by considering 1 K*N K*N
all data at once

NQP Number of quadratic problems to solve


NVQP Number of variables to compute per quadratic problem
TNV Total Number of Variables
D Training input data dimension
N Total number of training examples
K Number of classes

Table 10.2 Time training per approach (seconds)

Approach K=3, N=150 K=5, N=250 K=16, N=800


(Variables) (Variables) (Variables)
CSVM 0.07 0.987 10.07
C=1000 (450) (750) (3200)
One-against-all 0.11 8.54 131.24
(C, σ )=(1000,2−3) (450) (1250) (12800)
One-against-one 0.09 2.31 30.86
(C, σ )=(1000,2−2) (300) (1000) (12000)
DAGSVM 0.10 3.98 38.88
(C, σ )=(1000,2−3) (300) (1000) (12000)

K Number of classes, N Number of training examples (50 each


class)
Used kernels K(xi , x j ) = e−σ ||xi −x j with parameters taken from
σ = {2, 20 , 2−1 , 2−2 , 2−3 } and costs C={1,10,100,1000,10000}.
From these 5 × 5 combinations, the best result was selected for
each
approach.
250 N. Arana-Daniel, C. López-Franco, and E. Bayro-Corrochano

Table 10.3 Percent of accuracy in training and test

Approach Ntrain=150 Ntrain=250 Ntrain=800


Ntest=60 Ntest=100 Ntest=320
K=3 K=5 K=16
CSVM 98.66 99.2 99.87
C=1000 (95.00) (98.00) (99.68)
One-against-all 96.00 98.00 99.75
(C, σ )=(1000,2−3) (90.00) (96.00) (99.06)
One-against-one 98.00 98.4 (99.87)
(C, σ )=(1000,2−2) (95.00) 99.00 (99.375)
DAGSVM 97.33 98.4 (99.87)
(C, σ )=(1000,2−3) (95.00) (97) (99.68)

K Number of classes, Ntrain=Number of total training vectors


Ntest=Number of test vectors, % accuracy in training phase
above
Below in brackets, the percent of accuracy in test phase
Used kernels K(xi , x j ) = e−σ ||xi −x j with parameters taken from
σ = {2, 20 , 2−1 , 2−2 , 2−3 } and costs C={1,10,100,1000,10000}.
From these 5 × 5 combinations, the best result was selected for
each
approach.

10.7.2 Object Recognition


In this subsection we will show an application of Clifford SVM for multi class
object classification. In the experiments shown in this subsection, we want to use
only one CSVM with a quaternion as input and a quaternion as output, that allow
us to have up 24 = 16 classes. Basically we packed in a feature quaternion one 3-D
point (which lies in surface of the object) and the magnitude of the distance between
this point and the point which lies in the main axis of the object in the same level
curve. Fig. 10.3 depicts the 4 features taking by the object :

Xi = δi s + xi σ2 σ3 + yi σ3 σ1 + zi σ1 σ2 (10.48)
≡ [δi , (xi , yi , zi )]T

For each object we trained the CSVM using a set of several feature quaternions
obtained from different level curves; that means that each object is represented by
several feature quaternions and not only one. Due to this way to train the CSVM,
the order in which the feature quaternions are shown to the CSVM is important:
we begin to sample data from the bottom to the top of the objects and we show the
training and test data in this order to the CSVM. We processed the outputs using
a counter that computes which class fires the most for each training or test set in
10 Optimization with Clifford Support Vector Machines and Applications 251

[ n,
[ +m , (x,y,z) n ]
(x,y,z) +m]
[ ,
(x,y,z) ]

a) b)

Fig. 10.3 Geometric characteristics of one training object. The magnitude is δi , and the 3D
coordinates (xi , yi , zi ) to build the feature vector: [δi , (xi , yi , zi )]

order to decide which class the object belongs, see Fig. 10.4. Note carefully, that
this experiment is anyway a challenge for any algorithm for recognition, because
the feature signature is sparse. We will show later, that using this kind of feature
vectors the CSVM’s performance is superior to the MLP’s one. Of course, if you
spend more time trying to improve the quality of the feature signature, the CSVM’s
performance will increase accordingly.
WINNER CLASS
COUNTER
OUTPUTS
CSVM
INPUTS

Fig. 10.4 After we get the outputs, these are accumulated using a counter to calculate which
class the object belongs

It is important to say that all the objects (synthetic and real) were preprocessed
in order to have a common center an the same scale, then our learning process can
be seen as centering and scale invariant.

10.7.2.1 Phase a) Synthetic Data


In the first phase of this experiment, we used data training obtained from synthetic
objects, the training set are shown in Fig. 10.5. Note that we have six different
objects, which means a six-classes classification problem, and we solve it with
only one CSVM making use of its multi-output characteristic. In general, for the
”‘one versus all”’ approach one needs n SVMs (one for each class). In contrast,
the CSVM needs only one machine because its quaternion output allows to have 16
class outputs. For the input-data coding, we used a 3D point which is packed into
252 N. Arana-Daniel, C. López-Franco, and E. Bayro-Corrochano

the σ2 σ3 , σ3 σ1 , σ1 σ2 basis of the feature quaternion and the magnitude was packed
in the scalar part of the quaternion . Figure 10.6 shows the 3D points sampled from
the objects. We compared the performance of the following approaches: CSVM, a
4-7-6 MLP and the real valued SVM approaches one-against-one, one-against-all
and DAGSVM. The results in tables 10.4 and 10.5 show that CSVM has better gen-
eralization and less training errors than the MLP approach and the real valued-SVM
approaches. Note that all methods were speed up using the acceleration techniques
[28, 29]. The authors think that the MLP presents more training and generalization
errors because the way we represent the objects (as feature quaternion sets) makes
the MLP gets stuck in local minima very often during the learning phase, whereas
the CSVM is guaranteed to find the optimal solution to the classification problem
because it solves a convex quadratic problem with global minima. With respect to
the real-valued SVM based approaches, the CSVM takes advantage of the Clifford
product, which enhances the discriminatory power of the classificator itself unlike
the other approaches which are based solely on inner products.

a) b) c)

d) e) f)

Fig. 10.5 Training synthetic object set

10.7.2.2 Phase b) Real Data


In this phase of the experiment we obtained the training data using our robot “Ge-
ometer”, it is shown in right side of Fig. 10.7. We take two stereoscopic views of
each object: one frontal view and one 180 rotated view (w.r.t. the frontal view), after
that, we applied the Harry’s filter on each view in order to get the objects corners and
then, with the stereo system, the 3D points (xi , yi , zi ) which laid on the object surface
and to calculate the magnitude δi for the quaternion equation (10.49). This process
10 Optimization with Clifford Support Vector Machines and Applications 253

Table 10.4 Object-recognition performance in percent (%) during training using synthetic
data

Object NTS CSVM MLP 1-vs-all 1-vs-1 DAGSVM


C=1200 a) b) c)
C 86 93.02 48.83 87.2 90.69 90.69
S 84 89.28 46.42 89.28 90.47 90.47
F 84 85.71 40.47 83.33 84.52 83.33
W 86 91.86 46.51 90.69 91.86 93.02
D 80 93.75 50.00 87.5 91.25 90.00
U 84 86.90 48.80 82.14 83.33 84.52
C=cylinder, S=sphere, F=fountain, W=worm, D=Diamond,
U=cube
NTS= Number of Training Vectors.
Used kernels K(xi , x j ) = e−σ ||xi −x j with parameters taken from
σ = {2−1 , 2−2 , 2−3 , 2−4 , 2−5 } and costs
C={150,1000,1100,1200,1400,1500,10000}.
From these 8 × 5 combinations, the best result was selected for
each
approach. a)(2−4 , 1500), b)(2−3, 1200), c)(2−4, 1400).

Table 10.5 Object-recognition performance in percent (%) during test using synthetic data

Object NTS CSVM MLP 1-vs-all 1-vs-1 DAGSVM


C=1200 a) b) c)
C 52 94.23 80.76 90.38 96.15 96.15
S 66 87.87 45.45 83.33 84.84 86.36
F 66 90.90 51.51 83.33 86.36 84.84
W 66 89.39 57.57 86.36 83.33 86.36
D 58 93.10 55.17 93.10 93.10 93.10
U 66 92.42 46.96 89.39 90.90 89.39
C=cylinder, S=sphere, F=fountain, W=worm, D=Diamond,
U=cube
NTS= Number of Training Vectors.
Used kernels K(xi , x j ) = e−σ ||xi −x j with parameters taken from
σ = {2−1 , 2−2 , 2−3 , 2−4 , 2−5 } and costs
C={150,1000,1100,1200,1400,1500,10000}.
From these 8 × 5 combinations, the best result was selected for
each
approach. a)(2−4 , 1500), b)(2−3, 1200), c)(2−4, 1400).
254 N. Arana-Daniel, C. López-Franco, and E. Bayro-Corrochano

a) b) c)

*
***
***
***
*
* * ***
***
***
* *
* *
* *
* ***
***
*** *

d) e) f)
Fig. 10.6 Sampling of the training synthetic object set

is illustrated in Fig.10.7 and the whole training object set is shown in Fig.10.8. We
take the non-normalized 3D point for the bivector basis σ23 , σ31 , σ12 of the feature
quaternion in (10.49).

Fig. 10.7 Left:Sampling view of a real object. We use big white cross for the depiction.
Right: Stereo vision system in the experiment environment

After the training, we tested with a set of feature quaternions that the machine did
not see during its training and we used the approach of ’winner take all’ to decide
which class the object belongs. The results of the training and test are shown in
table 10.6. We trained CSVM with equal number of training data for each object,
that is, 90 feature quaternions for each object, but we test with different number of
data for object. Note that we have two pairs of objects that are very similar between
each other; first pair is composed by half sphere shown in Fig.10.8.c) and the rock
in Fig.10.8.d), in spite of this similarities, we got very good accuracy percentages in
10 Optimization with Clifford Support Vector Machines and Applications 255

Fig. 10.8 Training real object set, stereo pair images. We include only the frontal views

test phase for both objects: 65.9% for the half sphere and 84% for the rock. We think
we got better results for the rock because this object has a lot of texture that produces
many corners which in torn capture better the irregularities, therefore we have more
test feature quaternions for the rock than for half sphere (75 against 44 respectively).
The second pair composed by similar objects is shown in Fig.10.8.e) and Fig10.8.f),
these are two equal plastic bottles of juice, but one of them (Fig. 10.8.f)) is burned,
that makes the difference between them and give the CSVM enough distinguish
features to make two object classes, that is shown in table 10.6, we got 60% of
correct classified test samples for bottle in Fig. 10.8.e) against 61% for burned bottle
in Fig.10.8.f). The lower learn rates in the last objects (Fig.10.8 c), e) and f)) is
because the CSVM is confusing a bit the classes due to the fact that the feature
vectors are not large and reach enough.

Table 10.6 Experimental Results using real data

Object NTS NES CTS %


Cube 90 50 38 76.00
Prism 90 43 32 74.42
Half sphere 90 44 29 65.90
Rock 90 75 63 84.00
Plastic bottle 1 90 65 39 60.00
Plastic bottle 2 90 67 41 61.20
Number of training samples
Number of test samples
Number of correct classified test samples
Percentage left column
256 N. Arana-Daniel, C. López-Franco, and E. Bayro-Corrochano

10.7.3 Multi-case Interpolation


A real valued SVM can carry out regression and interpolation for multiple inputs
and one real output. Surprisingly a Clifford valued SVM can have multiple inputs
and 2n outputs for a n-dimensional space or R n . For handling regression we use
1.0 > ε > 0, where the diameter of the tube surrounding the optimal hyperplane
is twice of ε . For the case of interpolation we use ε = 0. We have chosen an in-
teresting task where we use a CSVM for interpolation in order to code a certain
kind of behavior we want that a visual guided robot performs. The robot should
autonomously draw a complicated 2D pattern. This capacity should be coded inter-
nally in Long Term Memory (LTM), so that the robot reacts immediately without
the need of reasoning. Similar to as a capable person who reacts in milliseconds
with incredible precision for accomplishing a very difficult task, for example a ten-
nis player or tango dancer. For our purpose we trained off line a CSVM using two
real valued functions. The CSVM used the geometric algebra G3+ (quaternion al-
gebra). Two inputs using two components of the quaternion input and two outputs
(two components of the quaternion output). The first input u and first output x coded
the relation x = a ∗ sin(3 ∗ u) ∗ cos(u) for one axis. The second input v and second
output y coded the relation y = a ∗ sin(3 ∗ v)∗ sin(v) for another axis, see Fig. 10.9.a),
b). The 2D pattern can be drawn using these 50 points generated by functions for
x and y, see Fig. 10.9.c. We tested if the CSVM can interpolate good enough using
100 and 400 unseen before input tuples {u, v}, see respectively Fig. 10.9.d) e). Once
the CSVM was trained we incorporated it as part of the LTM of the visual guided
robot shown in Fig. 10.9.f. For carrying out its task the robot called the CSVM for
a sequence of input patterns. The robot was able to draw the desired 2D pattern as
we see in Fig. 10.10.a)-d). The reader should bear in mind that this experiment was
designed using the equation of a standard function, in order to have a ground true.
Any how, our algorithm should be able also to learn 3D curves which do not have
explicit equations.

a) b) c) d)

Fig. 10.9 a) and b) Continuous curves of training output data for axes x and y (50 points).
c) 2D result by testing with 400 input data. d) Experiment environment
10 Optimization with Clifford Support Vector Machines and Applications 257

a) b) c) d)
Fig. 10.10 a), b), c) Image sequence while robot is drawing d) Robot’s draw. Result by testing
with 400 input data

10.7.4 Experiments Using Recurrent CSVM


In this section we first analyze the performance of the recurrent CSVM against state
of the art algorithms for solving a time series problem. Then, in a second experiment,
we apply the recurrent CSVM to tackle a partially observable problem in robotics.

10.7.4.1 Time Series


We utilized the data of water levels of the Venice Lagoon during the periods from
1980 to 1989 and 1990 to 1995.3. The recurrent CSVM was trained with the first
400 series values. The LSTM module was evolved with four memory cells during
100 generations and using the Cauchy parameter α = 10−3 . The CSVM was trained
using as inputs the output values of these four memory cells. The achieved train-
ing error was of 0.0019 and the recurrent CSVM was tested with 600 steps. The
system was able to predict in advance 600 steps of the series. Figure 10.11 shows
the prediction performance of the recurrent CSVM by using the training data and
Figure 10.11.a) depicts the results of predictions using unforeseen 600 test values.
In the figures the ordinate’s range is [0..1].

Fig. 10.11 a)Time seriess Venice Lagoon training. b)Recall data. Tick line (in red) real data,
thin line predicted values by LSTM-CSVM
3 A. Tomasin, CNR-ISDMG Universita Ca’Foscari, Venice.
258 N. Arana-Daniel, C. López-Franco, and E. Bayro-Corrochano

In the next test, we employed the time series Mackey-Glass which is commonly
used for testing the generalization and prediction ability of an algorithm. The series
are generated by the following differential equation

˙ =α y(t − τ )
y(t) , (10.49)
(1 + y(t − τ )β ) − γ y(t)

where the parameters are usually set as α = 0.2, β = 10 and γ = 0.1. This equation
is chaotic when the delay is τ > 16.8. We select as delay the most common utilized
value of τ = 17. The task is to predict the series values after the delay y[t + P]
by using the previous points y[t], y[t − 6], y[t − 12], y[t − 18]. By P = 50 sampled
values, it is expected that the algorithm learns the four dimensional function: y(t) =
f (y[t − 50], y[t − 50 − 6], y[t − 50 − 12], y[t − 50 − 18]).
The LSTM-CSVM was trained with the first 3000 values of the series using
P = 100. The module LSTM with 4 memory cells was evolved with a Cauchy pa-
rameter α = 10−5 over 150 generations. The “Eco state approach” was trained with
1000 neurons and a mean square error of 10−4 , while using the Evolino system
the achieved error was of 1.9 × 10−3 with 30 cells evolved over 200 generations
[19, 30]. It has been reported [31] that using a LSTM the minimum error achieved
was of 0.1184 using the same amount of 4 neurons as in our LTSM-CSVM.

Table 10.7 Time series Mackey-Glass

Approach Units Generations Error


Eco state 1000 200 10−4
Evolino 30 200 1.9 × 10−3
LSTM 4 0.1184
LSTM-CSVM 4 y CSVM 150 0.011

Table 10.7 shows a summary of the comparison results. Here we note that the
LTSM alone has a poorer performance than the LTSM-CSVM, showing that the
CSVM clearly improves the prediction precision. Note that for this complex time se-
ries, as opposite to these two approaches (Eco state approach, Evolino), the LTSM-
CSVM uses a lower number of neurons and it requires less generations during the
training for an acceptable error of 0.011.

10.7.4.2 Robot Navigation in Discrete Labyrinths


Finally, we utilize the LTSM-CSVM with reinforcement learning in a task of a
robotic perception action system. The robot system comprises of a stereoscopic
camera, a 6 D.O.F: robot arm and a 4 finger Barret-hand. The task consisted to
move the robot hand through a real 2D labyrinth. This was built using 10 blocks
of 10 cm. height each. To enhance the visibility, their top faces were painted in
10 Optimization with Clifford Support Vector Machines and Applications 259

red, this facilitated the segmentation of the blocks by the stereoscopic cameras. The
stereoscopic systems took images from an angle of 45 degrees, for that we needed
to calibrate cameras, and correct the position of the cameras views as they were
oriented on the top perpendicularly to the labyrinth. With this information we had
a complete 3D view from above. Using a color segmentation algorithm, we get the
vector coordinates of the block corners. These observation vectors were then fed to
the LTSM-CSVM.
The architecture of the LTSM-CSVM with reinforcement learning and the train-
ing were equal as the previous application. The differences with the simulated exper-
iment were: i) the 3D vectors of the block corners were obtained by the stereoscopic
camera (the blocks build a 2D labyrinth), ii) the robot actions were hand movements
trough the 2D labyrinth and iii) the length of this real labyrinth was smaller than pre-
vious simulated one. We had 4 different labyrinths. Each was 10 blocks length.
The evolution of the system consisted of 50 generations using a Cauchy noise
parameter of α = 10−4 . The module CSVM is fed with a vector of the output of
the last 4 memory cells of the LSTM. The 4 outputs of the CSVM represent the
4 different actions to be carried out during the navigation through the labyrinth.
After each generation, the best net was kept and the task was considered fulfilled
by a perfect reward of 4.0. The four possible actions of the system are robot hand
movements of 10 cm. length towards left, right, back an fort. The initial position
of the robot arm was located at the entry of a labyrinth. In all the labyrinths we
exploited the intern state (support state), i.e. the coordinates of the exit which was
the same for all cases.

1)
1)

2)
2)

3) 3)

4) 4)

Fig. 10.12 a) Training labyrinths 1 y 2. Recall labyrinths 3 y 4.(third column) The robot hand
is at the entry of the labyrinth holding a plastic object. (fourth column) Position from the hand
marked with a cross
260 N. Arana-Daniel, C. López-Franco, and E. Bayro-Corrochano

Figure 10.12 shows the four labyrinths. The images on the first and third columns
were obtained by the stereoscopic system. The images on the second and fourth
column were obtained after perspective correction and color segmentation. The
labyrinths 1 y 2 were used for the training, whereas the 3 and 4 were used for recall.
The third and fourth columns in Figure 10.12 show the agent at the beginning
the labyrinths. In this labyrinth, only one trajectory was successful (reward 4.0).
The training and the test were done off line, then the robot had to follow the action
vectors computed by the system LTSM-CSVM.

10.8 Conclusions
This chapter generalizes the real valued SVM to Clifford valued SVM and it is used
for classification, regression and recurrence. The CSVM accepts multiple multivec-
tor inputs and multivector outputs like a MIMO architecture, that allows us to have
multi-class applications. We can use CSVM over complex, quaternion or hyper-
complex numbers according our needs. The application section shows experiments
in pattern recognition and visually guided robotics which illustrate the power of the
algorithms and help the reader understand the Clifford SVM and use it in various
tasks of complex and quaternion signal and image processing, pattern recognition
and computer vision using high dimensional geometric primitives. This generaliza-
tion appears promising particularly in geometric computing and their applications,
like graphics, augmented reality and robot vision.

References
1. Vapnik, V.: Statistical Learning Theory. Wiley, New York (1998)
2. Burges, C.J.C.: A tutorial on Support Vector Machines for Pattern Recognition. In:
Knowledge Discovery and Data Mining, vol. 2, pp. 1–43. Kluwer Academic Publish-
ers, Dordrecht (1998)
3. Müller, K.-R., Mika, S., Rätsch, G., Tsuda, K., Schölkopf, B.: An Introduction to Kernel-
Based Learning Algorithms. IEEE Trans. on Neural Networks 12(2), 181–202 (2001)
4. Cristianini, N., Shawe-Taylor, J.: Support Vector Machines and other kernel-based learn-
ing methods. Cambridge University Press, Cambridge (2000)
5. Hestenes, D., Li, H., Rockwood, A.: New algebraic tools for classical geometry. In: Som-
mer, G. (ed.) Geometric Computing with Clifford Algebras. Springer, Heidelberg (2001)
6. Lee, Y., Lin, Y., Wahba, G.: Multicategory Support Vector Machines, Technical Report
No. 1043, University of Wisconsin, Departament of Statistics, pp. 10–35 (2001)
7. Weston, J., Watkins, C.: Support vector machines for multi-class pattern recognition. In:
Proceedings of the 6th European Symposium on Artificial Neural Networks (ESANN),
pp. 185–201 (1999)
8. Bayro-Corrochano, E., Arana-Daniel, N., Vallejo-Gutierrez, R.: Geometric Preprocess-
ing, geometric feedforward neural networks and Clifford support vector machines for
visual learning. Journal Neurocomputing 67, 54–105 (2005)
9. Bayro-Corrochano, E., Arana-Daniel, N., Vallejo-Gutierrez, R.: Recurrent Clifford Sup-
port Machines. In: Proceedings IEEE World Congress on Computational Intelligence,
Hong-Kong (2008)
10 Optimization with Clifford Support Vector Machines and Applications 261

10. Mukherjee, S., Osuna, E., Girosi, F.: Nonlinear prediction of chaotic time series using a
support vector machine. In: Principe, J., Gile, L., Morgan, N., Wilson, E. (eds.) Neural
Networks for Signal Precessing VII - Proceedings of the 1997 IEEE Workshop, New
York, pp. 511–520 (1997)
11. Müller, K.-R., Smola, A.J., Rätsch, G., Schölkopf, B., Kohlmorgen, J., Vapnik, V.N.:
Predicting time series with support vector machines. In: Gerstner, W., Hasler, M., Ger-
mond, A., Nicoud, J.-D. (eds.) ICANN 1997. LNCS, vol. 1327, pp. 999–1004. Springer,
Heidelberg (1997)
12. Salomon, J., King, S., Osborne, M.: Framewise phone classification using support vector
machines. In: Proc. Int. Conference on Spoke Language Processing, Denver (2002)
13. Altun, Y., Tsochantaris, I., Hofmann, T.: Hidden markov support vector machines. In:
Proc. Int. Conference on Machine Learning (2003)
14. Jaakkola, T.S., Haussler, D.: Exploting generative models in discriminative classifiers.
In: Proc. of the Conference on Advances in Neural Information Systems II, Cambridge,
pp. 487–493 (1998)
15. Bengio, Y., Frasconi, P.: Difussion of credit and markovian models. In: Tesauro, G.,
Touretzky, D.S., Leen, T.K. (eds.) Advances in Neural Information Systems 14. MIT
Press, Cambridge (2002)
16. Hochreiter, S., Mozer, M.: A discrete probabilistic memory for discovering dependencies
in time. In: Int. Conference on Neural Networks, pp. 661–668 (2001)
17. Suykens, J.A.K., Vanderwalle, J.: Recurrent least squares support vector machines. IEEE
Transactions on Circuits and Systems-I 47, 1109–1114 (2000)
18. Schmidhuber, J., Gagliolo, M., Wierstra, D., Gomez, F.: Recurrent Support Vector Ma-
chines, Technical Report, no. IDSIA 19-05 (2005)
19. Schmidhuber, J., Wierstra, D., Gómez, F.J.: Hybrid neuroevolution optimal linear search
for sequence prediction. In: Kaufman, M. (ed.) Proceedings of the 19th International
Joint Conference on Artificial Intelligence, IJCAI, pp. 853–858 (2005)
20. Hochreiter, S., Schmidhuber, J.: Long Short-Term Memory, Technical Report FKI-207-
95 (1996)
21. Gmez, F.J., Miikkulainen, R.: Active guidance for a finless rocket using neuroevolution.
In: Proc. GECCO, pp. 2084–2095 (2003)
22. Hsu, C.W., Lin, C.J.: A comparison of methods for multi-class Support Vector Machines.
Technical report, National Taiwan University, Taiwan (2001)
23. Bottou, L., Cortes, C., Denker, J., Drucker, H., Guyon, I., Jackel, L.Y., Muller, U.,
Sackinger, E., Simard, P., Vapnik, V.: Comparison of classifier methods: a case study
in handwriting digit recognition. In: International Conference on Pattern Recognition,
pp. 77–87. IEEE Computer Society Press, Los Alamitos (1994)
24. Knerr, S., Personnaz, L., Dreyfus, G.: Single-layer learning revisited: a stepwise proce-
dure for building and training a neural network. In: Fogelman, J. (ed.) Neurocomputing:
Algorithms, Architectures and Applications. Springer, Heidelberg (1990)
25. Kreßel, U.: Pairwise classification and support vector machines. In: Schlkipf, B., Burges,
C.J.J., Smola, A.J. (eds.) Advances in Kernel Methods - Support Vector Learning, pp.
255–268. MIT Press, Cambridge (1999)
26. Platt, J.C., Cristianini, N., Shawe-Taylor, J.: Large margin DAGs for multiclass classifi-
cation. In: Advances in Neural Information Processing Systems, vol. 12, pp. 547–533.
MIT Press, Cambridge (2000)
27. Weston, J., Watkins, C.: Multi-class support vector machines. Technical Report CSD-
TR-98-04, Royal Holloway, University of London, Egham (1998)
262 N. Arana-Daniel, C. López-Franco, and E. Bayro-Corrochano

28. Hsu, C.W., Lin, C.J.: A simple decomposition method for Support Vector Machines.
Machine Learning 46, 291–314 (2002)
29. Joachims, T.: Making large-scale SVM learning practical. In: Schölkopf, B., Burges,
C.J.C., Smola, A.J. (eds.) Advances in Kernel Methods-Support Vector Learning. MIT
Press, Cambridge (1998); Journal of Machine Learning Research 5, 819–844 (1998)
30. Jaeger, H.: Harnessing nonlinearity: Predicting chaotic systems and saving energy in
wireless communication. EmphScience (304), 78–80 (2004)
31. Gers, F.A., Eck, D., Schmidhuber, J.: Applying LSTM to time series predictable through
time-window approaches. In: Dorffner, G., Bischof, H., Hornik, K. (eds.) ICANN 2001.
LNCS, vol. 2130, pp. 669–685. Springer, Heidelberg (2001)
Chapter 11
A Classification Method Based on Principal
Component Analysis and Differential Evolution
Algorithm Applied for Prediction Diagnosis
from Clinical EMR Heart Data Sets

Pasi Luukka and Jouni Lampinen

Abstract. In this article we have studied the usage of a classification method based
on preprocessing the data first using principal component analysis, and then using
the compressed data in actual classification process which is based on differential
evolution algorithm, an evolutionary optimization algorithm. This method is applied
here for prediction diagnosis from clinical data sets with chief complaint of chest
pain using classical Electronic Medical Record (EMR), heart data sets. For exper-
imentation we used a set of five frequently applied benchmark data sets includ-
ing Cleveland, Hungarian, Long Beach, Switzerland and Statlog data sets. These
data sets are containing demographic properties, clinical symptoms, clinical find-
ings, laboratory test results specific electrocardiography (ECG), results pertaining
to angina and coronary infarction, etc. In other words, classical EMR data pertain-
ing to the evaluation of a chest pain patient and ruling out angina and/or Coronary
Artery Disease, (CAD). The prediction diagnosis results with the proposed classi-
fication approach were found promisingly accurate. For example, the Switzerland
data set was classified with 94.5% ± 0.4% accuracy. Combining all these data sets
resulted in the classification accuracy of 82.0% ± 0.5%. We compared the results
of the proposed method with the corresponding results of the other methods re-
ported in the literature that have demonstrated relatively high classification perfor-
mance in solving this problem. Depending on the case, the results of the proposed
method were of equal level with the best compared methods, or outperformed their
Pasi Luukka
Laboratory of Applied Mathematics, Lappeenranta University of Technology,
P.O. Box 20, FIN-53851 Lappeenranta, Finland
e-mail: pasi.luukka@lut.fi
Jouni Lampinen
Department of Computer Science, University of Vaasa, P.O. Box 700,
FI-65101 Vaasa, Finland
e-mail: jouni.lampinen@uwasa.fi

Y. Tenne and C.-K. Goh (Eds.): Computational Intelligence in Optimization, ALO 7, pp. 263–283.
springerlink.com c Springer-Verlag Berlin Heidelberg 2010
264 P. Luukka and J. Lampinen

classification accuracy clearly. In general, the results are suggesting that the pro-
posed method has potential in this task.

11.1 Introduction
Many data sets that come from the real world are admittedly coupled with noise.
Noise can be stated as a random error or variance of a measured variable [13]. Data
analysis is almost always burdened with uncertainty of different kinds. There are
several different techniques to deal with noisy data [7].
A major problem in mining scientific data sets is that the data is often high di-
mensional. In many cases there is a large number of features representing the object.
One problem is that the computational time for the pattern recognition algorithms
can become prohibitive, when the number of dimensions grows high. This can be a
severe problem, especially as some of the features are not discriminatory. Besides
the computational cost, irrelevant features may also cause a reduction in the accu-
racy of some algorithms.
To address this problem of high dimensionality, a common approach is to identify
the most important features associated with an object, so that further processing can
be simplified without compromising the quality of the final results. There are several
different ways in which the dimension of a problem can be reduced. The simplest
approach is to identify important attributes based on the input from domain experts.
Another commonly used approach is Principal Component Analysis (PCA) [19],
which defines new attributes (principal components or PCs) as mutually-orthogonal
linear combinations of the original attributes. For many data sets, it is sufficient to
consider only the first few PCs, thus reducing the dimension. However, for some
data sets, PCA does not provide a satisfactory representation. It is not always the
case that mutually-orthogonal linear combinations are the best way to define new
attributes but e.g. nonlinear combinations needs to be sometimes considered. The
analysis of the problem of dealing with data of high dimensionality is both diffi-
cult and subtle. The information loss caused by these methods is also sometimes a
problem.
One of the latest methods in evolutionary computation is differential evolution
(DE) algorithm [30]. In this paper we will examine the applicability of a classifica-
tion method where data is first preprocessed with PCA and then the resulting data is
classified with DE-classifier to the diagnosis of heart disease. In literature there are
several papers where evolutionary computation research has concerned the theory
and practice of classifier systems [4], [16], [17], [18], [31], [35], [10]. The differen-
tial evolution algorithm has been studied in unsupervised learning problems which
can be in a sence repositioned to classification problem in [26], [11]. DE was also
used combined with artificial neural networks in [1] for diagnosis of breast cancer.
It is also been used to tune classifiers parameter values in [12] and in similarity
classifier [23] to tune similarity measures parameters.
11 A Classification Method Based on PCA and DE 265

Here we propose a method which first preprocesses the data using PCA and then
classify the processed data using differential evolution classification method. Dif-
ferential evolution algorithm is applied for finding suitable vectors for each class to
classify sample by comparison of class vectors and the sample which we want to
classify. Differential evolution algorithm is applied for finding optimal class vectors
to represent each class. In addition, it is also applied for determining the value of a
distances parameter that we applied for making the final classification decision.
Advantages of doing the procedure this way are that we are able to reduce dimen-
sionality and hence reduce the computational cost that would be otherwise untoler-
ably high, especially in the cases for high dimensional data sets. Also advantage of
this procedure is that we are able to filter out noise which enhances the creation of
class vector for each class in the classifier. The class vectors are optimized using
the DE algorithm. Using this procedure we will also find the optimal dimension for
these data sets. Combination of finding best reduced dimension, filtering out noise
from the data and optimization of class vectors and needed parameters for our prob-
lem at hand brings out the more accurate solution for the problem.
The data sets for empirical evaluation of the proposed approach were taken from
a UCI-Repository of Machine Learning data set [25]. Classifier and preprocessing
methods were implemented with MAT LABT M -software.
From the optimization and modelling point of view the classification problem
subject to our investigations can be divided into two parts: to the classification model
and to the optimization approach applied for fitting (or learning) the model. Gen-
erally, a multipurpose classifier can be viewed as a scalable and learnable model
that can be fitted to a particular dataset by scaling it to the data dimensionality
and by optimizing a set of model parameters to maximize the classification accu-
racy. For the optimization, simply the classification accuracy over the learning set
may serve as the objective function value to be maximized. Alternatively the opti-
mization problem can be formulated as a minimization task, as we did here, where
the number of misclassified samples is to be minimized. In the literature, mostly
linear or nonlinear local optimization approaches has been applied for solving the
actual classifier model optimization problem, or approaches that can be viewed as
such. This is the most common approach despite of the fact that the underlying op-
timization problem is a global optimization problem. For example, the weight set
of a feed-forward neural network classifier is typically optimized with a gradient-
descent based on local optimizer, or alternatively by some other local optimizer like
Levenberg-Marquardt algorithm. This kind of usage of limited capacity optimizers
for fitting the classification model limits the achievable classification accuracy in
two ways. First, the model should be limited so, that local optimizers can be applied
to fit them. This means that only very moderately multimodal classification models
can be applied, and due to such modelling limitation, the classification capability
will be limited correspondingly. Secondly, if a local optimizer is applied to optimize
(to fit or to learn) even a moderately multimodal classification model, it is likely to
get trapped into a local optima, to a suboptimal solution. Thereby, the only way to
get the classifier models with a higher modelling capacity at disposal, and also to get
full capacity out of the current multimodal classification models, is applying global
266 P. Luukka and J. Lampinen

optimization for fitting the classification models to the data to be classified. For
example, in case of a nonlinear feed-forward neural network classifier, the model is
clearly multimodal, but practically always fitted by applying a local optimizer that is
capable of providing only locally optimal solutions. Thus, we consider that applying
global optimization instead of local optimization is an important fundamental issue
that is currently severely constraining the further development of classifiers. The
capabilities of currently used local optimizers are limiting the selection of the appli-
cable classifier models, and also the capabilities of the currently used models that
are including multimodal properties are limited by the capabilities of the optimizers
applied to fit them to the data.
Based on the above mentioned considerations, our basic motivation for applying
a global optimizer for learning the applied classifier model comes from the fact that
typically local (nonlinear) optimizers have been applied for the purpose, despite that
the underlying optimization problem is actually a multimodal global optimization
problem, and a local optimizer should be expected to become trapped into a local
suboptimal solution. The advantage of our proposed method is that since DE does
not get trapped in local minimum we can expect it to find better solutions than what
can be found in nearest local minimum.
Another motivation was that we would like to optimize also the parameter p of
the Minkowsky distance metrics (see section 3). In practice, that means increased
nonlinearity and increased multimodality of the classification model, resulting in
more locally optimal points in the search space, where a local optimizer would be
even more likely to get trapped. Practically, optimizing p successfully requires usage
of an effective global optimizer since local optimizers are unlikely to provide even
an acceptably good suboptimal solution anymore. Otherwise, by using a global op-
timizer, optimization of p becomes possible. Two folded advantages were expected
on this. First, by optimizing (systematically) the value for p, instead of selecting it
a priori by trial and error as earlier, a higher classification accuracy may be possi-
ble to reach. Secondly, the selection of the value of p can be done automatically
this way, and laborious trial and error experimentation by the user is not needed
at all. Furthermore, the potential for further developments is increased. The local
optimization approaches are severely limiting the selection of classifier models to
be used and as well the problem formulations for classifier model optimization task
become limited, too. Simply, local optimizers are limited to fit, or learn, only classi-
fier models, where trapping into a local suboptimal solution is not a major problem,
while global optimizers do not have such fundamental limitations. For example, the
range of possible class membership functions can be extended to those requiring
global optimization (due to increased nonlinearity and multimodality), and which
cannot be handled anymore by simple local optimizers, even with the nonlinear
ones. In addition, we would like to remark, that we have not yet fully utilized the
further development capabilities provided by our global optimization approach. For
example, even more difficult optimization problem settings are now within possi-
bilities, and the differential evolution have good capabilities for multi-objective and
multi-constrained nonlinear optimization that provides further possibilities for our
future developments.
11 A Classification Method Based on PCA and DE 267

11.2 Heart Data Sets


The heart data sets that we applied for experimentation were all taken from [25]
where they are freely available. They all contain 13 attributes (which have been
extracted from a larger set of 75). Information about the attributes can be found
in Table 11.1 and the basic properties of the data sets are summarized in Table
11.2. About attribute types; real valued attributes are no 1, 4, 5, 8, 10 and 12, or-
dered type is attribute no 11, binary value are attributes 2, 6, 9 and nominal value
attributes 7, 3, 13. Variable to be predicted is absence or presence of heart disease.
The data sets were collected in different locations and principal investigators re-
sponsible for the data collection are: 1) Andras Janos, Hungarian Institute of Car-
diology, 2) William Steinbrunn, University Hospital, Zurich, 3) Matthias Pfisterer,
University Hospital, Basel, 4) Robert Detrano, V.A. Medical Center, Long Beach
and Cleveland Clinic Foundation. Donor of the statlog data set was Ross D. King,
University of Strathclyde, Glasgow. The Statlog data set is slightly modified from
the Cleveland data set (instead of 303 samples they are using only 270).

Table 11.1 Heart data sets attribute information

no Attribute
1. Age
2. Sex
3. Chest pain type (4 values)
4. Resting blood pressure
5. Serum cholestoral in mg/dl
6. Fasting blood sugar > 120 mg/dl
7. Resting electrocardiographic results (values 0,1,2)
8. Maximum heart rate achieved
9. Exercise induced angina
10. Oldpeak = ST depression induced by exercise relative to rest
11. The slope of the peak exercise ST segment
12. Number of major vessels (0-3) colored by flouroscopy
13. Thal: 3 = normal; 6 = fixed defect; 7 = reversable defect

Table 11.2 Test data sets and their properties

Name Nb. classes Dim Nb. cases


Heart-Cleveland 2 13 303
Heart-Hungarian 2 13 294
Heart-Long-Beach-va 2 13 200
Heart-Switzerland 2 13 123
Heart-Statlog 2 13 270
268 P. Luukka and J. Lampinen

11.3 Classification Method


Heart data sets were classified so that first data was preprocessed using PCA algo-
rithm and then the resulting data was classified using classification method based on
differential evolution. Next we first start with explaining in more detail the principal
component analysis method and then the classification method based on differential
evolution algorithm, after this we give a more thorough description of the differen-
tial evolution algorithm.

11.3.1 Dimension Reduction Using Principal Component


Analysis
High-dimensional data sets present many mathematical challenges as well as some
opportunities, and are bound to give rise to new theoretical developments [7]. One
of the problems with high-dimensional data sets is that, in many cases, not all the
measured variables are ”important” for understanding the underlying phenomena of
interest. In mathematical terms, the problem we investigate in dimension reduction
can be stated as follows: given the r-dimensional random variable x = (x1 , ..., xr )T ,
find a lower dimensional representation of it, y = (y1 , ..., yk )T with k < r, that cap-
tures the content in the original data, according to some criteria. The components of
y are sometimes called the hidden components. Different fields use different names
for r multivariate vectors: the term ”variable” is mostly used in statistics, while
”feature” and ”attribute” are alternatives commonly used in the computer science
and machine learning literature.
PCA [19] is the best linear dimension reduction technique in the mean-square
error sense. In various fields, it is also known as the singular value decomposition
(SVD), the Karhunen-Loeve transform, the Hotelling transform, and the empirical
orthogonal function (EOF) method.
Let x1 , ..., xn be the n r-dimensional real vectors constituting the data set. In PCA
data is first to be centered
1 n
∑ xp = 0
n p=1
(11.1)

PCA attempts to find a k-dimensional subspace L of Rk , such that the orthogonal


projections PL xp of the r points on L have maximal variance. If L is the line spanned
by unit vector u, the projection of x ∈ Rk on L is

PL x = (u x)u (11.2)

where prime denotes transposition. The variance of data in the direction of L is,
therefore
1 n 2 1 n 1 n

n p=1
(u xp ) = ∑ u xp xp u = u( ∑ xp xp )u = u Su
n p=1 n p=1
(11.3)
11 A Classification Method Based on PCA and DE 269

where S is the sample covariance matrix of the data. PCA thus looks for the vector
u∗ which maximizes u Su, under the constraint ||u|| = 1. It is easy to show that the
solution is the normalized eigenvector u1 of S associated to its largest eigenvalue
λ1 , and
u1 Su1 = λ1 u1 u1 = λ1 (11.4)
This is then extented to find the k-dimensional subspace L on which the projected
points PL xp have maximal variance. The lines spanned by the eigenvectors uj are
called the principal axes of the data, and k new features y j = uj x defined by the
coordinates of x along the principal axes are called principal components. The vector
yp of principal components for each initial pattern vector xp may easily be computed
in matrix form as yp = U r xp , where Ur = [u1 , ..., ur ] is r × k matrix having the r
normalized eigenvector of S as its columns.
PCA can be used in classification problems to display data in the form of infor-
mative plots. The score values have the same properties as the weighted averages,
i.e., they are not sensitive to random noise but show the processes that affect several
variables simultaneously in a systematic way. This makes them suitable for detect-
ing multivariate trends, such as the clustering of objects or variables in multivariate
data sets. PCA can be seen as a data compression method which can be used to (1)
display multivariate data sets, (2) filter noise and (3) study and interpret multivariate
processes. One clear limitation of the PCA is that it can only handle linear relations
between variables [9]. We acknowledge the fact that this may not be the best kernel
for the approach but here in our procedure it seems to be working.

11.3.2 Classification Based on Differential Evolution


The problem of classification is basically one of partitioning the feature space into
regions, one region for each category. Ideally, one would like to arrange this parti-
tioning so that none of the decisions is ever wrong [8].
The objective is to classify a set X of objects to N different classes C1 , . . . ,CN by
their features. We suppose that T is the number of different kinds of features that
we can measure from objects. The key idea is to determine for each class the ideal
vector yi
yi = (yi1 , . . . , yiT ) (11.5)
that represents class i as well as possible. Later on we call these vectors as class vec-
tors. When these class vectors have been determined we have to make the decision
to which class the sample x belongs to according to some criteria. This can be done
e.g. by computing the distances di between the class vectors and the sample which
we want to classify. For computing the distance we used Minkowsky metric:
 1/p
T
d(x, y) = ∑ |x j − y j | p
(11.6)
j=1
270 P. Luukka and J. Lampinen

We used Minkowsky metric because it is more general than euclidean Metric. Eu-
clidean metric is still included there as a special case when p = 2. We also found
that when p value was optimized by using DE, the optimum was not even near p = 2
which corresponds to euclidean metric.
After computing the distances between the samples and class vectors we can
make the classification decision according to the shortest distance.
for x, y ∈ Rn . We decide that x ∈ Cm if

dx, ym  = min dx, yi  . (11.7)


i=1,...,N

Before doing the actual classification, all the parameters for classifier should be
decided. These parameters are
1. The class vectors yi = (yi (1), . . . , yi (T )) for each class i = 1, . . . , N
2. The power value p in (11.6).
In this study we used differential evolution algorithm [30] to optimize both the class
vectors and p value. For this purpose we split the data into learning set learn and
testing set test. Split was made so that half of the data was used in learning set and
half in testing set. We used data available in learning set to find the optimal class
vectors yi and the data in the testing set test was applied for assessing the clas-
sification performance of the proposed classifier. A brief description of differential
evolution algorithm is presented in the following section. The number of parameters
that differential evolution algorithm needs to optimize here is classes * dimension +
parameter coming from minkowsky distance. As results will later show PCA can be
used to lower datas dimensionality and with low dimensions we still we can find re-
sults which are clearly better than what can be found by using simply DE-classifier.
If we are not satisfied to just lower the data’s dimensionality and to enhancement
achieved this way but want to find out the best lowered dimension we have to do
this also for every dimension that is lower than maximum dimension so we get
∑dimension
i=1 (classes ∗ (dimension − i) + parameter coming from minkowsky distance)
worth of parameters to be optimized.
In short the procedure for our algorithm is as follows:
1. Divide data into learning set and testing set
2. Create initial class vectors for each class (here we used simply random numbers)
3. Compute distance between samples in the learning set and class vectors
4. Classify samples according to their minimum distance
5. Compute classification accuracy (no. of correctly classified samples/total number
of samples in learning set)
6. Compute the objective function value to be minimized as cost = 1 − accuracy
7. Create new class vectors for each class for the next population using selection,
mutation and crossover operations of differential evolution algorithm, and goto
step 3. until the stopping criteria is reached (e.g. maximum number of iterations
is reached)
11 A Classification Method Based on PCA and DE 271

8. Classify data in testing set according to the minimum distance between class
vectors and samples.

11.3.3 Differential Evolution


The DE algorithm [33], [30] was introduced by Storn and Price in 1995 and it be-
longs to the family of Evolutionary Algorithms (EAs). The design principles of DE
are simplicity, efficiency, and the use of floating-point encoding instead of binary
numbers. As a typical EA, DE has a random initial population that is then improved
using selection, mutation, and crossover operations. Several ways exist to determine
a stopping criterion for EAs but usually a predefined upper limit Gmax for the num-
ber of generations to be computed provides an appropriate stopping condition. Other
control parameters for DE are the crossover control parameter CR, the mutation fac-
tor F, and the population size NP.
In each generation G, DE goes through each D dimensional decision vector vi,G
of the population and creates the corresponding trial vector ui,G as follows in the
most common DE version, DE/rand/1/bin [29]:

r1 , r2 , r3 ∈ {1, 2, . . . , NP} , (randomly selected,


except mutually different and different from i)
jrand = floor (randi [0, 1) · D) + 1
for( j = 1; j ≤ D; j = j + 1)
{
if(rand j [0, 1) < CR ∨ j =jrand ) 
u j,i,G = v j,r3 ,G + F · v j,r1 ,G − v j,r2 ,G
else
u j,i,G = v j,i,G
}

In this DE version, NP must be at least four and it remains fixed along CR and F
during the whole execution of the algorithm. Parameter CR ∈ [0, 1], which controls
the crossover operation, represents the probability that an element for the trial vector
is chosen from a linear combination of three randomly chosen vectors and not from
the old vector vi,G . The condition “ j = jrand ” is to make sure that at least one element
is different compared to the elements of the old vector. The parameter F is a scaling
factor for mutation and its value is typically (0, 1+]1 . In practice, CR controls the
rotational invariance of the search, and its small value (e.g., 0.1) is practicable with
separable problems while larger values (e.g., 0.9) are for non-separable problems.
The control parameter F controls the speed and robustness of the search, i.e., a lower
value for F increases the convergence rate but it also adds the risk of getting stuck
into a local optimum. Parameters CR and NP have the same kind of effect on the
convergence rate as F has.
1 Notation means that the practical upper limit is about 1 but not strictly defined.
272 P. Luukka and J. Lampinen

After the mutation and crossover operations, the trial vector ui,G is compared to
the old vector vi,G . If the trial vector has an equal or better objective value, then it
replaces the old vector in the next generation. This can be presented as follows (in
this paper minimization of objectives is assumed) [29]:

ui,G if f (ui,G ) ≤ f (vi,G )
vi,G+1 = .
vi,G otherwise

DE is an elitist method since the best population member is always preserved and
the average objective value of the population will never get worse.
As the objective function, f , to be minimized we applied the number of incor-
rectly classified learning set samples. Each population member, vi,G , as well as each
new trial solution, ui,G , contains the class vectors for all classes and the power value
p. In other words, DE is seeking the vector (y(1), ..., y(T ), p) that minimizes the
objective function f . After the optimization process the final solution, defining the
optimized classifier, is the best member of the last generation’s, Gmax , population,
the individual vi,Gmax . The best individual is the one providing the lowest objective
function value and therefore the best classification performance for the learning set.
The control parameters of DE algorithm were set here as follows: CR=0.9 and
F=0.5 were applied for all classification problems. NP was chosen so that it was six
times the size of the optimized parameters or if size of the NP.
However, these selections were mainly based on general recommendations and
practical experiences with the usage of DE, and no systematic investigations were
performed for finding the optimal control parameter values, therefore further clas-
sification performance improvements by finding better control parameter settings in
future are within possibilities.

11.4 Classification Results


All data sets were split in half; one half was used for training and the other half
for testing the classifier. The training sets were randomly created 30 times for each
dimension. The results are also compared to other existing results in the literature.
In Table 11.3 the results from the applied data sets are reported. Achieved results
are also compared with the results achieved without PCA. In first column data set
and possible usage of preprocessing the data first with PCA is reported. In second
column best classification accuracy is given and in third the mean classification
accuracy. Variance is reported next and then reduced dimension providing the best
results is given. Finally also optimized p-value is given in last column. Results from
Cleveland heart data set is given in Heart-Cleveland, Hungarian in Heart-Hungarian,
Switzerland heart data set in Heart-Switzerland and results from Long-Beach data
set in Heart-Long-Beach. All these four data sets are combined in Heart-All. There
are several missing values in these data sets and simply a dummy value −9 is used
for missing value. Results from heart-statlog data set are in Heart-statlog. Best mean
classification accuracies are in boldface.
11 A Classification Method Based on PCA and DE 273

Table 11.3 Classification results of heart data sets. Comparison of classification results with
original data and data preprocessed with two dimension reduction methods. Best mean accu-
racy is in boldface

Data Best result (in %) Mean result (in %) Variance (in %) dim p-value
Heart-Cleveland 89.44 % 82.86 % 7.71 13 19.3
Heart-Cleveland(PCA) 91.55 % 86.48 % 2.82 12 82.8
Heart-Hungarian 88.44 % 83.42 % 5.95 13 88.1
Heart-Hungarian(PCA) 93.20 % 87.48 % 3.34 11 96.7
Heart-Switzerland 95.16 % 94.35 % 0.67 13 70.8
Heart-Switzerland(PCA) 95.16 % 94.46 % 0.66 5 82.1
Heart-Long Beach 80.20 % 78.32 % 1.31 13 54.4
Heart-Long Beach(PCA) 85.15 % 79.93 % 2.70 12 67.9
Heart-All 78.22 % 76.98 % 0.94 13 1.8
Heart-All(PCA) 84.22 % 82.01 % 1.05 13 49.2
Heart-statlog 88.89 % 83.21 % 10.80 13 81.1
Heart-statlog(PCA) 91.86 % 87.63 % 4.01 13 90.9

Cleveland data set: From the Table 11.3 we can observe that best mean classifica-
tion accuracy for the Cleveland data set is 86.5% and when 99% confidence √ interval
is computed for the results (Using Student’s t distribution μ ± t1−α /2 Sμ / n) we get
for the confidence interval 86.5% ± 0.8%. This result was obtained when data was
first preprocessed with PCA. The preprocessing by PCA enhanced the results over
3%. The best mean accuracy was found with target dimensionality of 12.
Achieved results with the Cleveland data set are compared to other results in Ta-
bles 11.4–11.6. In Table 11.4 the classification results obtained by our DE based
approach are compared to the corresponding results reported in [32] where method
called Classification by Feature Partitioning (CFP) was introduced. This method is
an inductive, incremental and supervised learning method. There the data set was
divided in two sets as here but training and testing set sizes were a bit different.
When comparing our results with the results of Sirin and Güvenir [32] we observed
that DE classifier classified the Cleveland data set with a higher accuracy (83.4%)
than IB classifiers and C4 but yielded a slightly lower accuracy than CFP,(84.0%).
When data was first preprocessed with PCA the classification accuracy of 87.5%
was reached by the DE-classifier. In Table 11.5 the classification results obtained by
DE based approach are compared to results from classifiers which was reported in
[21]. They used decision tree classifier and also preprocessed the data with wavelet
transform. They also used two fold techniques as here but division for training and
testing set was 80 − 20. For their decision tree classifier 76% accuracy was reported.
In comparison, DE yielded accuracy of 83%. Li et al. [21] managed to enhance the
results by preprocessing the data first with wavelet transform gaining about 4% unit
enhancement having mean accuracy of 80%. We reached about 3% unit enhance-
ment with using PCA, corresponding to 86% classification accuracy.
In Table 11.6 we have compared our results with the results reported in [3] where
tenfold crossover was used instead of two fold as in our experiment. As can be
seen there smart crossover operator with multiple parents for a Pittsburgh Learning
274 P. Luukka and J. Lampinen

Classifier seems to be giving around 10% better performance with this data set.
Generally, the results obtained here by the DE classifier with PCA preprocessing
appeared to be rather promising.

Table 11.4 DE classifiers classification result comparison to results Sirin and Güvenir re-
ported in [32] from Cleveland and Hungarian data sets

data set IB1 IB2 IB3 C4 CFP DE PCA + DE


Hungarian 58.7 55.9 80.5 78.2 82.3 83.4 87.5
Cleveland 75.7 71.4 78.0 75.5 84.0 82.9 86.5

Table 11.5 DE classifiers classification result comparison to results of Li et al. reported from
Cleveland, Hungarian and Switzerland data sets

Data set Decisiontree(Dt) Dt + wavelet DE PCA + DE


Hungarian 76 80 83 87
Cleveland 76 80 83 86
Switzerland 88 88 94 94

Table 11.6 DE classifiers classification result comparison to results of Bacardit and Krasno-
gor [3] from Cleveland, Hungarian and Statlog data sets

Data set EnhancedPLCS DE PCA + DE


Hungarian 86.05 83.42 87.48
Cleveland 95.54 82.86 86.48
Statlog 94.44 83.21 87.63

Hungarian data set: With the Hungarian data set a same situation was observed.
The best results was found when data was first preprocessed with PCA. Best mean
accuracy with 99% confidence interval was 87.5% ± 0.9%. Preprocessing with PCA
enhanced the results over 3%. Best accuracy was found with the target dimension-
ality of 11.
The results obtained with the Hungarian data set are compared to the results
of the other classifiers in Tables 11.4–11.7. When the results are compared with
the results reported by Sirin and Güvenir [32] (Table 11.4) DE-classifier yielded a
slightly higher mean accuracy 83.4%, than the second best CFD with the accuracy
of 82.3%. When the Hungarian data set was preprocessed with PCA the accuracy of
DE-classifier increased to 87.5%, which can be considered as a remarkably good re-
sult. In Table 11.5 our results are compared with the corresponding ones by Li et al.
[21]. They reported accuracy of 76% with their decision tree classifier while in com-
parison our DE-classifier reached with 83% accuracy. Li et al also preprocessed the
11 A Classification Method Based on PCA and DE 275

data and their wavelet transform preprocessing gained about 4% unit enhancement
in accuracy (80%). We obtained 4% unit enhancement with the accuracy when we
preprocessed the data by using PCA and then performed the classification by using
DE classifier, reaching the accuracy of 87%. In Table 11.6 our results are compared
with the results of [3]. With their method ac