00 mendukung00 menolak

922 tayangan424 halamanMay 25, 2011

© Attribution Non-Commercial (BY-NC)

PDF, TXT atau baca online dari Scribd

Attribution Non-Commercial (BY-NC)

922 tayangan

00 mendukung00 menolak

Attribution Non-Commercial (BY-NC)

Anda di halaman 1dari 424

)

Computational Intelligence in Optimization

Adaptation, Learning, and Optimization, Volume 7

Series Editor-in-Chief

Meng-Hiot Lim

Nanyang Technological University, Singapore

E-mail: emhlim@ntu.edu.sg

Yew-Soon Ong

Nanyang Technological University, Singapore

E-mail: asysong@ntu.edu.sg

Adaptive Differential Evolution, 2009

ISBN 978-3-642-01526-7

Computational Intelligence in

Expensive Optimization Problems, 2010

ISBN 978-3-642-10700-9

Exploitation of Linkage Learning in Evolutionary Algorithms, 2010

ISBN 978-3-642-12833-2

Differential Evolution in Electromagnetics, 2010

ISBN 978-3-642-12868-4

Agent-Based Evolutionary Search, 2010

ISBN 978-3-642-13424-1

Unified Computational Intelligence for Complex Systems, 2010

ISBN 978-3-642-03179-3

Computational Intelligence in Optimization, 2010

ISBN 978-3-642-12774-8

Yoel Tenne and Chi-Keong Goh (Eds.)

Computational Intelligence in

Optimization

Applications and Implementations

123

Dr. Yoel Tenne

Department of Mechanical Engineering and

Science-Faculty of Engineering,

Kyoto University, Yoshida-honmachi,

Sakyo-Ku, Kyoto 606-8501, Japan

E-mail: yoel.tenne@ky3.ecs.kyoto-u.ac.jp

Formerly: School of Aerospace Mechanical and Mechatronic Engineering,

Sydney University, NSW 2006, Australia

Advanced Technology Centre,

Rolls-Royce Singapore Pte Ltd

50 Nanyang Avenue, Block N2,

Level B3C, Unit 05-08, Singapore 639798

E-mail: chi.keong.goh@rolls-royce.com

DOI 10.1007/978-3-642-12775-5

c 2010 Springer-Verlag Berlin Heidelberg

This work is subject to copyright. All rights are reserved, whether the whole or part

of the material is concerned, specifically the rights of translation, reprinting, reuse

of illustrations, recitation, broadcasting, reproduction on microfilm or in any other

way, and storage in data banks. Duplication of this publication or parts thereof is

permitted only under the provisions of the German Copyright Law of September 9,

1965, in its current version, and permission for use must always be obtained from

Springer. Violations are liable to prosecution under the German Copyright Law.

The use of general descriptive names, registered names, trademarks, etc. in this

publication does not imply, even in the absence of a specific statement, that such

names are exempt from the relevant protective laws and regulations and therefore

free for general use.

Typeset & Cover Design: Scientific Publishing Services Pvt. Ltd., Chennai, India.

987654321

springer.com

To our families for their love and support.

Preface

applications involve complex optimization processes, which are diﬃcult to

solve without advanced computational tools. With the increasing challenges

of fulﬁlling optimization goals of current applications there is a strong drive

to advance the development of eﬃcient optimizers. The challenges introduced

by emerging problems include:

• objective functions which are prohibitively expensive to evaluate, so typi-

cally so only a small number of objective function evaluations can be made

during the entire search,

• objective functions which are highly multimodal or discontinuous, and

• non-stationary problems which may change in time (dynamic).

Classical optimizers may perform poorly or even may fail to produce any

improvement over the starting vector in the face of such challenges. This

has motivated researchers to explore the use computational intelligence (CI)

to augment classical methods in tackling such challenging problems. Such

methods include population-based search methods such as: a) evolutionary

algorithms and particle swarm optimization and b) non-linear mapping and

knowledge embedding approaches such as artiﬁcial neural networks and fuzzy

logic, to name a few. Such approaches have been shown to perform well in

challenging settings. Speciﬁcally, CI are powerful tools which oﬀer several

potential beneﬁts such as: a) robustness (impose little or no requirements

on the objective function) b) versatility (handle highly non-linear mappings)

c) self-adaption to improve performance and d) operation in parallel (making

it easy to decompose complex tasks). However, the successful application of

CI methods to real-world problems is not straightforward and requires both

expert knowledge and trial-and-error experiments. As such the goal of this

volume is to survey a wide range of studies where CI has been successfully ap-

plied to challenging real-world optimization problems, while highlighting the

VIII Preface

insights researchers have obtained. Broadly, the studies in this volume focus

on four main disciplines: continuous optimization, classiﬁcation, scheduling

and hardware implementations.

For continuous optimization, Neto et al. study the use of artiﬁcial neural

networks (ANNs) and Heuristic Rules for solving large scale optimization

problems. They focus on a recurrent ANN to solve a quadratic program-

ming problem and propose several techniques to accelerate convergence of

the algorithm. Their method is more eﬃcient than one using an ANN only.

Starzyk et al. propose a direct-search optimization algorithm which uses re-

inforcement learning, resulting in an algorithm which ‘learns’ the best path

during the search. The algorithm weights past steps based on their success

to yield a new candidate search step. They benchmark their algorithm with

several mathematical test functions and apply to training of a multi-layer

perceptron neural network for image recognition. Ventresca et al. use the op-

position sampling approach to decrease the number of function evaluations.

The approach attempts to sample the function in a subspace generated by the

‘opposites’ of an existing population of candidates. They apply their method

to diﬀerential evolution and incremental learning and show that the opposi-

tion method improves performance over baseline variants. Bazan studies an

optimization algorithm for problems where the objective function requires

large computational resources. His proposed algorithm uses locally regular-

ized approximations of the objective function using radial basis functions. He

provides convergence proofs and formulates a framework which can be ap-

plied to other algorithms such as Gauss-Seidel or the Conjugate Directions.

Ruiz-Torrubiano et al. study hybrid methods for solving large scale opti-

mization problems with cardinality constraints, a class of problems arising in

diverse areas such as ﬁnance, machine learning and statistical data analysis.

While existing methods can provide exact solutions (such as branch-and-

bound) they require large resources. As such the study focuses on meth-

ods which can eﬃciently identify approximate solutions but require far less

computer resources. For problems where it is expensive to evaluate the ob-

jective function, Jayadeva et al. propose using a support-vector machine to

predict the location of yet undiscovered optima. Their framework can be

applied to problems where little or no a-priori information is available on

the objective function, as the algorithm ‘learns’ during the search process.

Benchmarks show their method can outperform existing methods such as par-

ticle swarm optimization or genetic algorithms. Vouchkov and Keane study

multi-objective optimization problems using surrogate-models. They inves-

tigate how to eﬃciently update the surrogates under a small optimization

‘budget’ and compare diﬀerent updating strategies. They also shows that us-

ing a number of surrogate models can improve the optimization search and

that the size of the ensemble should increase with the problem dimension.

Others study agents-based algorithms, that is, where the optimization is

done by agents which co-operate during the search. Dreżewski and Siwik

review agent-based co-evolutionary algorithms for multi-objective problems.

Preface IX

Such algorithms combine co-evolution (multiple species) with the agent ap-

proach (interaction). They review and compare existing methods and bench-

mark them over a range of test problems. Results show the agent-based co-

evolutionary algorithms can perform equally well and even surpass some of

the best existing multi-objective evolutionary algorithms. Salhi and Töreyen

proposes a multi-agent algorithm based on game theory. Their framework uses

multiple solvers (agents) which compete over available resources and their

algorithm identiﬁes the most successful solver. In the spirit of game theory,

successful solvers are rewarded by increasing their computing resources and

vice versa. Test results show the framework provides a better ﬁnal solution

when compared to using a single solver.

For applications in classiﬁcation, Arana-Daniel et al. use Cliﬀord algebra

to generalize support vector machines (SVMs) for classiﬁcation (with an ex-

tension to regression). They represent input data as a multivector and use

a single Cliﬀord kernel for multi-class problems. This approach signiﬁcantly

reduces the computational complexity involved in training the SVM. Tests

using real-world applications of signal processing and computer vision show

the merit of their approach. Luukka and Lampinen propose a classiﬁcation

method which combines principal component analysis to pre-process the data

followed by optimization of the classiﬁer parameters using a diﬀerential evo-

lution algorithm. Speciﬁcally, they optimize the class vectors used by the

classiﬁer and the power of the distance metric. Test results using real-world

data sets show the proposed approach performs equally or better to some of

the best existing classiﬁers. Lastly in this category, Zhang et al. study the

problem of feature selection in high-dimensional problems. They focus on the

GA-SVM approach, where a genetic algorithm (GA) optimizes the param-

eters of the SVM (the GA uses the SVM output as the objective values).

The problem requires large computational resources which make it diﬃcult

to apply to large or high-dimensional sets. As such they propose several mea-

sures such as parallelization, neighbour search and caching to accelerate the

search. Test results show their approach can reduce the computational cost

of training an SVM classiﬁer.

Two studies focus on diﬃcult scheduling problems. First, Pieters studies

the problem of railway timetable design scheduling, which is an NP-hard

problem with additional challenging features as such being reactive and dy-

namic. He studies solving the problem with Symbiotic Networks, a class of

neural networks inspired by the symbiosis phenomenon is nature, and so the

network uses ‘agents’ to adapt itself to the problem. Test results show the

Symbiotic network can successfully handle the complex scheduling problem.

Next, Srivastava et al. propose an approach combining evolutionary algo-

rithms, neural network and fuzzy logic to solve problems of multiobjective

time-cost trade-oﬀ. They consider a range of such problems including non-

linear time-cost relationships, constrained resources and project uncertain-

ties. They show the merit of their approach by testing it on a real-world

test case.

X Preface

the use of systolic arrays for implementing artiﬁcial neural networks in VLSI

and FPGA platforms. The chapter studies the use of systolic arrays for eﬃ-

cient hardware implementations of neural networks for real-time applications.

The chapter surveys various approaches, current achievements as well as fu-

ture directions such as mixed analog-digital neural networks. This is followed

by Thangavelautham et al. who propose using coarse-coding techniques to

evolve multi-robot controllers and where they aim to evolve simultaneously

both the controller and sensor conﬁgurations. To make the problem tractable

they use an Artiﬁcial Neural Tissue to exploit regularity in the sensor data.

Test results show their approach outperforms a reference one.

Overall, the chapters in this volume address a spectrum of issues arising

in the application of computational intelligence to real-world diﬃcult opti-

mization problems. The chapters discuss both the current accomplishments

and the remaining open issues as well as point to future research directions

in the ﬁeld.

Chi-Keong GOH

Acknowledgement to Reviewers

who have kindly reviewed for the edited book. Their assistance have been

invaluable to our endeavors.

Will Browne Passi Luukka

Pedro M. S. Carvalho Pramod Kumar Meher

Jia Chen Hirotaka Nakayama

Sheng Chen Ferrante Neri

Tsung-Che Chiang Thai Dung Nguyen

Siang-Yew Chong Alberto Ochoa

Antonio Della Ciopa Yew-Soon Ong

Carlos A. Coello Coello Khaled Rasheed

Marco Cococcioni Tapabrata Ray

Claudio De Stefano Abdellah Salhi

Antonio Gaspar-Cunha Vui Ann Shim

Kyriakos C. Giannakoglou Ofer M. Shir

David Ginsbourger Dimitri Solomatine

Frederico Guimarães Sanjay Srivastava

Martin Holena Janusz Starzyk

Amitay Issacs Stephan Stilkerich

Jayadeva Haldun Süral

Wee Tat Koo Mohammhed B. Trabia

Slawomier Koziel Massimiliano Vasile

Jouni Lampinen Lingfeng Wang

Xiaodong Li Chee How Wong

Contents

Optimization Problems and Increase Guaranteed Optimal

Convergence Speed of Recurrent ANN . . . . . . . . . . . . . . . . . . . . . . . . . 1

Otoni Nóbrega Neto, Ronaldo R.B. de Aquino, Milde M.S. Lira

1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Neural Network of Maa and Shanblatt: Two-Phase

Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.3 Hybrid Intelligent System Description . . . . . . . . . . . . . . . . . . . . . . . 7

1.3.1 Method of Tendency Based on the Dynamics in

Space-Time (TDST) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.3.2 Method of Tendency Based on the Dynamics in

State-Space (TDSS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

1.4 Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

1.4.1 Case 1: Mathematical Linear Programming

Problem – Four Variables . . . . . . . . . . . . . . . . . . . . . . . . . . 16

1.4.2 Case 2: Mathematical Linear Programming

Problem – Eleven Variables . . . . . . . . . . . . . . . . . . . . . . . . 16

1.4.3 Case 3: Mathematical Quadratic Programming

Problem – Three Variables . . . . . . . . . . . . . . . . . . . . . . . . . 18

1.5 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

1.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2 A Novel Optimization Algorithm Based on Reinforcement

Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

Janusz A. Starzyk, Yinyin Liu, Sebastian Batog

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.2 Optimization Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

2.2.1 Basic Search Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

XIV Contents

Optimized Approximation . . . . . . . . . . . . . . . . . . . . . . . . . 30

2.2.3 Predicting New Step Sizes . . . . . . . . . . . . . . . . . . . . . . . . . 36

2.2.4 Stopping Criterion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

2.2.5 Optimization Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

2.3 Simulation and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

2.3.1 Finding Global Minimum of a Multi-variable

Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

2.3.2 Optimization of Weights in Multi-layer Perceptron

Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

2.3.3 Micro-saccade Optimization in Active Vision for

Machine Intelligence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

2.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

3 The Use of Opposition for Decreasing Function Evaluations in

Population-Based Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

Mario Ventresca, Shahryar Rahnamayan, Hamid Reza Tizhoosh

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

3.2 Theoretical Motivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

3.2.1 Definitions and Notations . . . . . . . . . . . . . . . . . . . . . . . . . . 51

3.2.2 Consequences of Opposition . . . . . . . . . . . . . . . . . . . . . . . 52

3.2.3 Lowering Function Evaluations . . . . . . . . . . . . . . . . . . . . . 53

3.2.4 Comparison to Existing Methods . . . . . . . . . . . . . . . . . . . 54

3.3 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

3.3.1 Differential Evolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

3.3.2 Opposition-Based Differential Evolution . . . . . . . . . . . . . 57

3.3.3 Population-Based Incremental Learning . . . . . . . . . . . . . . 57

3.3.4 Oppositional Population-Based Incremental

Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

3.4 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

3.4.1 Evolutionary Image Thresholding . . . . . . . . . . . . . . . . . . . 59

3.4.2 Parameter Settings and Solution Representation . . . . . . . 63

3.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

3.5.1 ODE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

3.5.2 OPBIL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

4 Search Procedure Exploiting Locally Regularized Objective

Approximation: A Convergence Theorem for Direct Search

Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

Marek Bazan

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

4.2 The Search Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

4.3 Zangwill’s Method to Prove Convergence . . . . . . . . . . . . . . . . . . . . 75

Contents XV

4.4.1 Closedness of the Algorithmic Transformation . . . . . . . . 78

4.4.2 A Perturbation in the Line Search . . . . . . . . . . . . . . . . . . . 80

4.5 The Radial Basis Appproximation . . . . . . . . . . . . . . . . . . . . . . . . . . 87

4.5.1 Detecting Dense Regions . . . . . . . . . . . . . . . . . . . . . . . . . . 87

4.5.2 Regularization Training . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

4.5.3 Choice of the Regularization Parameter λ Value . . . . . . . 90

4.5.4 Error Bounds for Radial Basis Approximation . . . . . . . . 91

4.6 Numerical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

4.6.1 Test Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

4.6.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

4.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

5 Optimization Problems with Cardinality Constraints . . . . . . . . . . . . 105

Rubén Ruiz-Torrubiano, Sergio Garcı́a-Moratilla, Alberto Suárez

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

5.2 Approximate Methods for the Solution of Optimization

Problems with Cardinality Constrains . . . . . . . . . . . . . . . . . . . . . . . 108

5.2.1 Simulated Annealing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

5.2.2 Genetic Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

5.2.3 Estimation of Distribution Algorithms . . . . . . . . . . . . . . . 111

5.3 Benchmark Optimization Problems with Cardinality

Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

5.3.1 The Knapsack Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

5.3.2 Ensemble Pruning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

5.3.3 Portfolio Optimization with Cardinality Constraints . . . . 119

5.3.4 Index Tracking by Partial Replication . . . . . . . . . . . . . . . . 122

5.3.5 Sparse Principal Component Analysis . . . . . . . . . . . . . . . 124

5.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

6 Learning Global Optimization through a Support Vector

Machine Based Adaptive Multistart Strategy . . . . . . . . . . . . . . . . . . . 131

Jayadeva, Sameena Shah, Suresh Chandra

6.1 Introduction and Background Research . . . . . . . . . . . . . . . . . . . . . . 132

6.2 Global Optimization with Support Vector Regression Based

Adaptive Multistart (GOSAM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

6.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

6.3.1 One Dimensional Wave Function . . . . . . . . . . . . . . . . . . . 137

6.3.2 Two Dimensional Case: Ackley’s Function . . . . . . . . . . . 140

6.3.3 Comparison with PSO and GA on Higher

Dimensional Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

6.4 Extension to Constrained Optimization Problems . . . . . . . . . . . . . . 143

6.4.1 Sequential Unconstrained Minimization Techniques . . . 143

6.5 Design Optimization Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

XVI Contents

6.5.2 Folded Cascode Amplifier . . . . . . . . . . . . . . . . . . . . . . . . . 149

6.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149

6.7 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153

7 Multi-objective Optimization Using Surrogates . . . . . . . . . . . . . . . . . 155

Ivan Voutchkov, Andy Keane

7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155

7.2 Surrogate Models for Optimization . . . . . . . . . . . . . . . . . . . . . . . . . 157

7.3 Multi-objective Optimization Using Surrogates . . . . . . . . . . . . . . . 158

7.4 Pareto Fronts - Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159

7.5 Response Surface Methods, Optimization Procedure and Test

Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161

7.6 Update Strategies and Related Parameters . . . . . . . . . . . . . . . . . . . . 163

7.7 Test Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164

7.8 Pareto Front Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165

7.8.1 Generational Distance ([3], pp.326) . . . . . . . . . . . . . . . . . 165

7.8.2 Spacing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165

7.8.3 Spread . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165

7.8.4 Maximum Spread . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166

7.9 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166

7.9.1 Understanding the Results . . . . . . . . . . . . . . . . . . . . . . . . . 166

7.9.2 Preliminary Calculations . . . . . . . . . . . . . . . . . . . . . . . . . . 167

7.9.3 The Effect of the Update Strategy Selection . . . . . . . . . . . 167

7.9.4 The Effect of the Initial Design of Experiments . . . . . . . . 171

7.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174

Multi-Objective Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177

Rafał Dreżewski, Leszek Siwik

8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177

8.2 Model of Co-Evolutionary Multi-Agent System . . . . . . . . . . . . . . . 179

8.2.1 Co-Evolutionary Multi-Agent System . . . . . . . . . . . . . . . 180

8.2.2 Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180

8.2.3 Species . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181

8.2.4 Sex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182

8.2.5 Agent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183

8.3 Co-Evolutionary Multi-Agent Systems for Multi-Objective

Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187

8.3.1 Co-Evolutionary Multi-Agent System with

Co-Operation Mechanism (CCoEMAS) . . . . . . . . . . . . . . 187

8.3.2 Co-Evolutionary Multi-Agent System with

Predator-Prey Interactions (PPCoEMAS) . . . . . . . . . . . . . 190

8.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196

Contents XVII

Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196

8.4.2 A Glance at Assessing Co-operation Based Approach

(CCoEMAS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197

8.4.3 A Glance at Assessing Predator-Prey Based

Approach (PPCoEMAS) . . . . . . . . . . . . . . . . . . . . . . . . . . . 200

8.5 Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207

9 A Game Theory-Based Multi-Agent System for Expensive

Optimisation Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211

Abdellah Salhi, Özgun Töreyen

9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211

9.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213

9.2.1 Optimisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213

9.2.2 Game Theory: The Iterated Priosoners’ Dilemma . . . . . . 213

9.2.3 Multi-Agent Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214

9.3 Constructing GTMAS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215

9.3.1 GTMAS at Work: Illustration . . . . . . . . . . . . . . . . . . . . . . 216

9.4 The GTMAS Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218

9.4.1 Solver-Agents Decision Making Procedure . . . . . . . . . . . 219

9.5 Application of GTMAS to TSP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221

9.6 Tests and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228

9.7 Conclusion and Further Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230

10 Optimization with Clifford Support Vector Machines and

Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233

N. Arana-Daniel, C. López-Franco, E. Bayro-Corrochano

10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233

10.2 Geometric Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234

10.2.1 The Geometric Algebra of n-D Space . . . . . . . . . . . . . . . . 235

10.2.2 The Geometric Algebra of 3-D Space . . . . . . . . . . . . . . . 237

10.3 Linear Clifford Support Vector Machines for Classification . . . . . 237

10.4 Non Linear Clifford Support Vector Machines for

Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242

10.5 Clifford SVM for Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243

10.6 Recurrent Clifford SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245

10.7 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247

10.7.1 3D Spiral: Nonlinear Classification Problem . . . . . . . . . . 247

10.7.2 Object Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250

10.7.3 Multi-case Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . 256

10.7.4 Experiments Using Recurrent CSVM . . . . . . . . . . . . . . . . 257

10.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260

XVIII Contents

and Differential Evolution Algorithm Applied for Prediction

Diagnosis from Clinical EMR Heart Data Sets . . . . . . . . . . . . . . . . . . 263

Pasi Luukka, Jouni Lampinen

11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264

11.2 Heart Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267

11.3 Classification Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268

11.3.1 Dimension Reduction Using Principal Component

Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268

11.3.2 Classification Based on Differential Evolution . . . . . . . . 269

11.3.3 Differential Evolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271

11.4 Classification Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272

11.5 Discussion and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281

12 An Integrated Approach to Speed Up GA-SVM Feature Selection

Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285

Tianyou Zhang, Xiuju Fu, Rick Siow Mong Goh, Chee Keong Kwoh,

Gary Kee Khoon Lee

12.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285

12.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 288

12.2.1 Parallel/Distributed GA . . . . . . . . . . . . . . . . . . . . . . . . . . . 288

12.2.2 Parallel SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290

12.2.3 Neighbor Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291

12.2.4 Evaluation Caching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292

12.3 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292

12.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298

Timetable Problems with Symbiotic Networks . . . . . . . . . . . . . . . . . . 299

Kees Pieters

13.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299

13.1.1 Convergence Inducing Process . . . . . . . . . . . . . . . . . . . . . 300

13.1.2 A Classification of Problem Domains . . . . . . . . . . . . . . . . 301

13.2 Railway Timetable Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302

13.3 Symbiotic Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304

13.3.1 A Theory of Symbiosis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306

13.3.2 Premature Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . 311

13.4 Symbiotic Networks as Optimizers . . . . . . . . . . . . . . . . . . . . . . . . . . 313

13.5 Trains as Symbiots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314

13.5.1 Trains in Symbiosis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315

13.5.2 The Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315

13.5.3 The Trains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316

13.5.4 The Optimizing Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 318

13.5.5 Computational Complexity . . . . . . . . . . . . . . . . . . . . . . . . 319

Contents XIX

13.5.7 A Symbiotic Network as a CCGA . . . . . . . . . . . . . . . . . . . 321

13.5.8 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322

14 Project Scheduling: Time-Cost Tradeoff Problems . . . . . . . . . . . . . . . 325

Sanjay Srivastava, Bhupendra Pathak, Kamal Srivastava

14.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325

14.1.1 A Mathematical Description of TCT Problems . . . . . . . . 328

14.2 Resource-Constrained Nonlinear TCT . . . . . . . . . . . . . . . . . . . . . . . 329

14.2.1 Artificial Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . 330

14.2.2 Working of ANN and Heuristic Embedded Genetic

Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331

14.2.3 ANNHEGA for a Case Study . . . . . . . . . . . . . . . . . . . . . . 334

14.3 Sensitivity Analysis of TCT Profiles . . . . . . . . . . . . . . . . . . . . . . . . 336

14.3.1 Working of IFAG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339

14.3.2 IFAG for a Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339

14.4 Hybrid Meta Heuristic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343

14.4.1 Working of Hybrid Meta Heuristic . . . . . . . . . . . . . . . . . . 345

14.4.2 HMH Approach for Case Studies . . . . . . . . . . . . . . . . . . . 348

14.4.3 Standard Test Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . 352

14.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355

15 Systolic VLSI and FPGA Realization of Artificial Neural

Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359

Pramod Kumar Meher

15.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 360

15.2 Direct-Design of VLSI for Artificial Neural Network . . . . . . . . . . 362

15.3 Design Considerations and Systolic Building Blocks for ANN . . . 364

15.4 Systolic Architectures for ANN . . . . . . . . . . . . . . . . . . . . . . . . . . . . 371

15.4.1 Systolic Architecture for Hopfield Net . . . . . . . . . . . . . . . 371

15.4.2 Systolic Architecture for Multilayer Neural Network . . . 373

15.4.3 Systolic Implementation of Back-Propagation

Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373

15.4.4 Implementation of Advance Algorithms and

Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 376

15.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 376

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 377

16 Application of Coarse-Coding Techniques for Evolvable

Multirobot Controllers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 381

Jekanthan Thangavelautham, Paul Grouchy,

Gabriele M.T. D’Eleuterio

16.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 381

16.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384

XX Contents

16.2.2 Task Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385

16.2.3 Machine-Learning Techniques and Modularization . . . . 386

16.2.4 Fixed versus Variable Topologies . . . . . . . . . . . . . . . . . . . 387

16.2.5 Regularity in the Environment . . . . . . . . . . . . . . . . . . . . . . 388

16.3 Artificial Neural Tissue Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 389

16.3.1 Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 389

16.3.2 The Decision Neuron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 390

16.3.3 Evolution and Development . . . . . . . . . . . . . . . . . . . . . . . . 391

16.3.4 Sensory Coarse Coding Model . . . . . . . . . . . . . . . . . . . . . 393

16.4 An Example Task: Resource Gathering . . . . . . . . . . . . . . . . . . . . . . 395

16.4.1 Coupled Motor Primitives . . . . . . . . . . . . . . . . . . . . . . . . . 397

16.4.2 Evolutionary Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . 399

16.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 399

16.5.1 Evolution and Robot Density . . . . . . . . . . . . . . . . . . . . . . . 403

16.5.2 Behavioral Adaptations . . . . . . . . . . . . . . . . . . . . . . . . . . . 403

16.5.3 Evolved Controller Scalability . . . . . . . . . . . . . . . . . . . . . . 406

16.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 407

16.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 409

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 410

Chapter 1

New Hybrid Intelligent Systems to Solve Linear

and Quadratic Optimization Problems and

Increase Guaranteed Optimal Convergence

Speed of Recurrent ANN

Otoni Nóbrega Neto, Ronaldo R.B. de Aquino, and Milde M.S. Lira

Abstract. This chapter deals with the study of artificial neural networks (ANNs) and

Heuristic Rules (HR) to solve optimization problems. The study of ANN as optimiza-

tion tools for solving large scale problems was due to the fact that this technique has

great potential for hardware VLSI implementation, in which it may be more efficient

than traditional optimization techniques. However, the implementation of computa-

tional algorithm has shown that the proposed technique should have been efficient

but slow as compared with traditional mathematical methods. In order to make it

a fast method, we will show two ways to increase the speed of convergence of the

computational algorithm. For analyzes and comparison, we solved three test cases.

This paper considers recurrent ANN to solve linear and quadratic programming prob-

lems. These networks are based on the solution of a set of differential equations that

are obtained from a transformation of an augmented Lagrange energy function. The

proposed hybrid systems combining recurrent ANN and HR presented a reduced

computational effort in relation to the one using only the recurrent ANN.

1.1 Introduction

The early 1980’s were marked by a resurgence of interest in artificial neural net-

works (ANNs). At that time, the development of ANNs had the important charac-

teristic of temporal processing. Many researchers have attributed the resumption of

researches on ANNs in the eighties to the Hopfield model presented in 1982 [1].

This recurrent Hopfield model has so far constituted, a great progress in the thresh-

old of knowledge in the area of neural networks, until then.

Nowadays, it is known that there are two ways of incorporating the temporal com-

putation in a neural network: the first one is possible by using a statistical neural nets

to accomplish a dynamical mapping in a structure of short-term memory; and the

Otoni Nóbrega Neto · Ronaldo R.B. de Aquino · Milde M.S. Lira

Electrical Engineering Department, Federal University of Pernambuco, Brazil

e-mail: otoninobrega@hotmail.com,rrba@ufpe.br

Y. Tenne and C.-K. Goh (Eds.): Computational Intelligence in Optimization, ALO 7, pp. 1–26.

springerlink.com c Springer-Verlag Berlin Heidelberg 2010

2 O.N. Neto, R.R.B. de Aquino, and M.M.S. Lira

second one is by making internal feedback connections that may be made by single

or multi-loop feedback in which the neural network can be fully connected. Artifi-

cial neural networks that have feedback connections in their topology are known as

recurrent neural networks [2]. The theoretical study and applications of recurrent

neural nets were developed in several subsequent works [3, 4, 5, 6, 7, 8, 9]. Actu-

ally, the progress provided by Hopfield’s works have shown that a value of energy

could be associated to each state of the net and that this energy decreases monoton-

ically as the path is described within the state-space towards a fixed point. These

fixed points are therefore stable points of energy [10], i.e., the described energy

function behaves as Lyapunov functions for the model described in detail in Hop-

field’s works. At this exact moment, it is observed subjects of stabilities in recurrent

neural nets. Considering the stability in a non-linear dynamical system, we usually

think about stability in the sense of Lyapunov. The Direct Method of Lyapunov is

broadly used for stability analysis of linear and non-linear systems which may be

either time-variant or time-invariant. Therefore, it can be directly applicable to the

stability analysis of ANNs [2].

In 1985, Hopfield solved the traveling salesman problem [7] that is a problem

in combinatorial optimization using a continuous model of the recurrent neural net-

work as an optimization tool. In 1986, Hopfield proposed a specialized ANN to

solve specific problems of linear programming (LP) [9] based on analog circuits,

studied since 1956 by Insley B. Pyne and presented in [11]. On that occasion, Hop-

field demonstrated that the dynamics involved in recurrent artificial neural nets were

described by a Lyapunov function and that for this reason, it was demonstrated that

this network is stable and also that the point of stability is the solution to the problem

for which the ANN was modeled.

In 1987, Kennedy and Chua demonstrated that the ANN which was proposed

by Hopfield in 1986, in spite of searching for the minimum level of the energy

function, it had not been modeled to offer an inferior limit, but only when the sat-

uration of an operational amplifier of the circuit was reached [12]. Due to this

deficiency, Kennedy and Chua proposed a new circuit for LP problems that also

proved to be able to solve quadratic programming (QP) problems. These circuits

were nominated as “canonical non-linear programming circuit”, which are based on

the Kuhn-Tucker (KT) conditions [12]. In this kind of ANN-based optimization,

the problem has to be “hard-wired” in the network and the convergence behavior of

the ANN depends greatly on how the cost function is modeled.

Later on, hard studies [13, 14] confirmed that for non-linear programming prob-

lems, the proposed model by Kennedy and Chua [15] has completely satisfied the

optimization of KT conditions and the penalty method. Besides, under appropri-

ate conditions this net is stable. In spite of the important progresses presented in

Kennedy and Chua’s studies, a deficiency was observed in the model, which appears

when the equilibrium point of the net happens in the neighborhood of the optimal

point of the original problem, but the distance between the optimal point and the

equilibrium point of the network can be reduced by increasing the penalty parame-

ter (s), as in [14] and [16]. Even so, Kennedy and Chua’s network is able to solve a

great class of optimization problems with and without constraints. However, when

1 New HIS to Solve LP and QP Optimization and Increase Convergence Speed 3

feasible region, i.e., equality constraints are close to happen, then the network only

converges on an approximate solution that can be out of the feasible region [17].

This is explained by the application of the penalty function theorem [16]. For ap-

plications in which an unfeasible solution cannot be tolerated, the usefulness of this

technique (Kennedy and Chua’s neural networks) is seriously jeopardized. With the

intention of overcoming this difficulty, Maa and Shanblatt proposed the two-phase

method [14]. This work reveals an innovation in the method presented by David

W. Tank and John J. Hopfield [18] and it guarantees that, in certain conditions, the

proposed network evolves towards the exact solution of the optimization problem.

Since Kennedy and Chua network contains a finite penalty parameter, it gen-

erates only approximate solutions and presents an implementation problem when

the penalty parameter is very large. To reach an exact solution Maa and Shan-

blatt method uses another penalty parameter in the second phase. Therefore, to

avoid using penalty parameters, some significant works have been done in recent

years. Among them, a few primal-dual neural networks with two-layer and one-

layer structure were developed for solving linear and quadratic programming prob-

lems [18, 19, 20, 21]. These neural networks were proved to be globally convergent

to an exact solution when the objective function is convex.

Nowadays, recurrent ANN have been used to solve real world problems such

as, the hydrothermal Scheduling [22] based on the augmented Lagrange Hopfield

network and [23, 24, 25] based on Maa and Shanblatt Two-phase Neural Network.

In this work, an ANNs was approached to solve optimization problems and the

proposed method by Maa and Shanblatt was applied. The study of ANN as opti-

mization tools for solving large scale problems was due to the fact that this technique

has a great potential for hardware VLSI implementation, in which it can be more

efficient than traditional optimization techniques. However, the implementation of

the method in software has shown that, in spite of the technique being efficient in

the solution of optimization problems, the speed of convergence could become slow

when compared with traditional mathematical methods. In this regard, it was created

and proposed heuristic rules in a hybrid form to aid and accelerate the convergence

of the two-phase method in the software. It is important to point out that the soft-

ware implementation of the method is an important part of the development and

analysis of the method in hardware. An important characteristic for choosing Maa

and Shanblatt network is that it is ready to solve linear and quadratic optimization

problems with equality and inequality linear constraints without using mathematical

transformations, which would increase the dimension of the problem. As we plan to

apply the developed HIS in this work to solve the hydrothermal scheduling problem

[24, 25], which does not need an exact solution, Maa and Shanblatt network first

phase was chosen for this implementation. In future works, we may try other kinds

of recurrent ANNs.

Decision trees and classification rules are important and common methods used

for knowledge representation in the expert systems [26]. Heuristic rules are rules

which have no particular foundation in a scientific theory, but which are only based

on the observation of general patterns and derived from facts. These rules are

4 O.N. Neto, R.R.B. de Aquino, and M.M.S. Lira

applicable to many problems as shown in [27, 28, 29, 30]. Here the basis of the

proposed heuristic rules is the dynamical behavior of neural networks. From the

convergence analysis, we identified the parameters and their relationships, which

are then transformed into a set of heuristic rules. We developed an algorithm based

on the heuristic rules and carried out some experiments to evaluate and support the

proposed technique.

In this work, two possible implementations were developed, tested and com-

pared; and a high reduction in computational effort was observed by using the pro-

posed heuristic rules. This reduction is related to the decrease in the number of ODE

computed during the convergence process. Other possible implementations are also

indicated.

This work is organized beginning with a revision of the two-phase method of

Maa and Shanblatt; following, we present the proposed heuristic rules and show the

solutions for test cases using the previously discussed techniques; next, the simula-

tion results are presented and analyzed; and finally, we draw conclusions about the

proposed work.

Optimization

The operation of the Hopfield network model and the subsequent models is based

on a constraint violation of optimization problem. When a constraint violation oc-

curs, the magnitude and the direction of the violation are fed back to adjust the

states of the neurons of the network so that the overall energy function of the net-

work always decreases until it reaches a minimum level. These ANN models have

dynamical characteristics according to the Lyapunov function. Therefore, it can be

demonstrated that these networks are stable and that the equilibrium point is the

solution of LP and QP problems that the network represents. This type of neural

network was firstly improved in [15] and later in [14]. The network used in the last

version is the one used in this work.

Maa and Shanblatt network is able to solve constrained or unconstrained convex

quadratic and linear programming problems.

Consider the following convex P problem:

1 T

(P) min f (x) = x Qx + cT x

2

s.t. g(x) = Dx − b ≤ 0

h(x) = Hx − w = 0

x ∈ Rn (1.1)

metric and positive definite or positive semidefinite, f , gi ’s and h j ’s are functions

on Rn → R. Assume that the feasible domain of P is not empty and the objective

function is bounded below over the domain of P.

1 New HIS to Solve LP and QP Optimization and Increase Convergence Speed 5

tions, and h j ’s are affine functions. Another particularity of the formulation can

be observed when Q is a zero matrix and the cost function is thus reduced to

f (x) = cT x. In this case, if the inequality and equality constraints have linear for-

mulation then the problem P becomes a linear programming problem.

The method presented by Maa and Shanblatt [14] is composed in to two phases.

The first phase of the method aims to initialize the problem and to converge quickly

without high accuracy towards the neighborhood of the optimal point, while the sec-

ond phase aims to reach the exact solution of the problem. To this end, the dynamic

of the first phase is based on the exact penalty of Lagrangian function or energy

function L(s, x):

p q

s

∑ g+i (x) + ∑ (h j (x))

2 2

L(s, x) = f (x) + (1.2)

2 i=1 j=1

where s is a large positive real number, and the function g+ i (x(t)) = max{0, gi (x(t))},

whose notation was simplified to g+ = [g+ + T

1 , . . . , gm ] , according to [14].

As long as the system converges x(t) → x̂, sg+ i (x(t)) → λi and sh j (x(t)) → μ j

which are the Lagrange multipliers associated with each corresponding constraint.

In the first phase, an approximation of the Lagrange multipliers is already obtained.

The block diagram of a two-phase optimization network is shown in Fig. 1.1.

The dynamics that happen in the first phase are in the time range 0 ≤ t ≤ t1 (t1 is the

time instant when the switch is closed connecting the first phase to the second one).

The network operates according to the following dynamics:

p q

dx

= −∇ f (x) − s ∑ ∇gi (x)gi (x) + ∑ (∇h j (x)h j (x))

+

(1.3)

dt i=1 j=1

In the second phase (t ≥ t1 ) the network begins to shift the directional vector sg+ i (x)

gradually to λi , and sh j (v) to μ j . By imposing a small positive real value ε , the up-

date rate of d λi /dt and d μi /dt that are represented in (1.6) and (1.7), respectively,

is comparatively much slower than that of dx/dt (1.5). Approximation of such dy-

namics is possible by considering λ and μ to be fixed. Then it can be seen that (1.6)

is seeking a minimum point of the augmented Lagrangian function La (s, x):

s +

La (s, x) = f (x) + λ T g(x) + μ T h(x) + g (x) 2 + h(x) 2 (1.4)

2

In the block diagram of Fig. 1.1, in the first phase, the subsystems within the two

large rectangles do not contribute during t ≤ t1 and in the second phase, when t > t1 ,

the dynamics of the network become:

p + q

dx

= −∇ f (x) − ∑ ∇gi (x) sgi (x) + λi + ∑ (∇h j (x) (sh j (x) + μ j ))

dt i=1 j=1

(1.5)

6 O.N. Neto, R.R.B. de Aquino, and M.M.S. Lira

Fig. 1.1 Block diagram of the dynamical system to Maa and Shanblatt network

d λi (t + Δ t)

= ε sg+

i (gi (x(t))), to i = 1, . . . p, and (1.6)

dt

d μ j (t + Δ t)

= ε sh j (x(t)), to j = 1, . . . q. (1.7)

dt

A practical value is ε = 1/s according to [14], what leaves the network with just one

adjustment parameter. However, using ε independently of s gives more freedom to

control the dynamics of the network. During the first phase, the Lagrange multipliers

are null, thus there is not restriction on the initial value of x(t).

According to the theorem of penalty function, the solution achieved in the first

phase is not equivalent to the minimum of the function f (x), unless the penalty

parameter s is infinite. In this way, the use of the second phase of optimization is

necessary to any finite value of s. The system reaches equilibrium when:

1 New HIS to Solve LP and QP Optimization and Increase Convergence Speed 7

g+

i =0

h j = 0 and

p q

∇ f (x) + ∑ ∇gi (x)λi (x) + ∑ ∇h j (x)μ j (x) = 0, (1.8)

i=1 j=1

that is identical to optimality condition of the KT theorem and thus the equilibrium

point of the two-phase network is precisely a global minimum point to a convex

problem (P).

In [12] it is demonstrated that the Kennedy and Chua network for linear and

quadratic programming problems satisfies completely the optimality conditions of

KT and the function penalty method. It shows also that under appropriate conditions

this network is completely stable. Moreover, it is shown that the equilibrium point

happens in the neighborhood of the optimal point of the original problem and that

the distance among them can be made arbitrarily small, selecting sufficiently a large

value of penalty parameter (s).

For problems that cannot tolerate a solution in the infeasible region, due to physi-

cal limits of operational amplifiers, a two-phase optimization network model is pro-

posed. In the second phase, we can obtain both, the exact solution for these problems

and the corresponding Lagrange multipliers associated with each restriction.

The proposed network by Maa and Shanblatt has two attractive features which are

the property of guaranteed global convergence of the mathematical programming

problem and the possibility of physical implementation of the neural network in a

circuit with electrical components where the response time of the dynamic of the cir-

cuit would be imposed by the capacitance in the circuit, thus the convergence time

would be negligible. In spite of these attractive characteristics, the time required

for processing the computational algorithm becomes a barrier in the ANN-based

applications for solving large-scale mathematical programming problems, then it is

necessary the resolution of several differential equations. In this regard, problems

with larger number of variables and constraints will involve a fair amount of differ-

ential equations to be solved. In order to mitigate this problem, heuristic rules were

developed to accelerate the convergence of the computational algorithm involved in

recurrent neural networks.

The combination of recurrent ANN with heuristic rules forms the Hybrid Intelli-

gent System (HIS), in which these two techniques interact and exchange information

with one another while the optimal solution of the problem is not achieved.

The basis of the proposed heuristic rules is the dynamical behavior of the neural

networks. The control theory of system studies deeply the dynamical behavior of a

process. With the aid of control theory and gathering information of the Lyapunov

theorem for the network [2], it can be stated that from any given initial state of the

state vectors x(0), the network will always change the values of the state variables

xi (t) in the direction in which the value of the Lyapunov function for the network

8 O.N. Neto, R.R.B. de Aquino, and M.M.S. Lira

global minimum of the programming problem. The trajectory of the variables is

illustrated in Fig. 1.2. It depicts a two-dimensional trajectory (orbit) of a dynamical

system, where it is possible to observe the state variables of the system at certain

time instants (t0 ,t1 , . . . ,t5 ). The dotted vector can be understood as the tendency of

the convergence (indicated by the gradient vector) of the variables in the dynamics

of the system.

Trajectories of the state variables for the same system are exemplified graphically

in Fig. 1.3. These trajectories are distinct due to the fact that the state variables have

different initial states. The dynamic of recurrent ANN has the same properties and,

therefore, it is similar to the dynamics showed in Fig. 1.3.

In spite of Maa and Shanblatt model which deals with continuous-time recur-

rent network, in a computational algorithm the calculations of the iterations are

performed in a discrete-time form, since the calculations of the integral equations

demand a small step size for calculations, but not null. Therefore, we take total

control at the course of the iteration of the algorithm in the network.

Detailed observations were carried out during test of the algorithm of Maa and

Shanblatt model showing that the computational convergence is slow and the tra-

jectories in the state-space of convergence of recurrent networks are smooth and

possibly predictable. Then, we observed that, in certain conditions, it is not only

possible to estimate a point closer to the minimum point of the function energy of

the network, but it is also possible to estimate a point that instead of the initial orbit

of convergence turns to an initial point of a new orbit of convergence. This new orbit

would have a shorter curvature and, consequently, a smaller Euclidean distance to

the optimal point. In this regard, the number of steps to calculate the convergence

of the computational algorithm can be reduced and, consequently, the time to

compute the equilibrium point of the network.

1 New HIS to Solve LP and QP Optimization and Increase Convergence Speed 9

Fig. 1.3 An illustration of a two-dimensional state (phase) of a dynamical system and the

associated vectorial field

To achieve the equilibrium point, we use two methods. In the first one, the point

is calculated starting from the evolution of the dynamics in space-time plan (in this

work, we considered only autonomous systems). In the second method, the calcu-

lation is performed by observing the evolution of the variable in state-space. The

mechanism of these two methods and the way they operate in the proposed HIS are

described as follows.

(TDST)

Consider the convergences of dynamical systems of first order according to the

graphs in Fig. 1.4.

Observing the curves of Fig. 1.4 and pointing out that the time is a vari-

able that is always increasing, i.e., the following point x1 (t) is always in front

of x1 (t − Δ t), we can reach the conclusion that a closer point to the convergence

would be located outside the internal area of the convergence curve concavity, in

case the convergence curve behaves as shown in graph Fig. 1.4(a). For example,

for t = 0.060, x1 (0.060) ≈ 0.3; in this case, a better estimation to the point would

be x1 (0.061) = 0.45 for Δ t = 0.001s. Restarting the network with this initial state

would generate a convergence curve as illustrated in graph Fig. 1.5. However, this

rule would be applied only for curves of type (a) and (c) as shown in Fig. 1.4, since

curves (b) and (d), a better estimation to the point is located inside the internal area

of the concavity.

Observing the particularities of the possible curvatures of the convergence curve

in the space-time plan, the following parameters were created in relation to:

• curvature {curve, straight line};

• concavity, when it exist, {concave downwards, concave upwards};

10 O.N. Neto, R.R.B. de Aquino, and M.M.S. Lira

1 1

0.9 0.9

0.8 0.8

0.7 0.7

0.6 0.6

x1(t)

x1(t)

0.5 0.5

0.4 0.4

0.3 0.3

0.2 0.2

0.1 0.1

0 0

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

t t

(a) (b)

1 1

0.9 0.9

0.8 0.8

0.7 0.7

0.6 0.6

x1(t)

x1(t)

0.5 0.5

0.4 0.4

0.3 0.3

0.2 0.2

0.1 0.1

0 0

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

t t

(c) (d)

Fig. 1.4 Dynamical convergences of first order (single variable systems): graphs of evolution

of state variables in time

• variable xi (t) {increasing, decreasing}.

In order to assess these parameters, the network must provide at least three points

(P0 , P1 , P2 ) in the convergence curve. Next, these three points are normalized in the

horizontal axis and also in the vertical axis, in order to avoid problems in the algo-

rithm used to estimate a better point. The chosen normalization equation is presented

in (1.9), where M is the maximum and m is the minimum of the three values to be

normalized, a and b are chosen according to the range of the normalized values. In

this work, the values are normalized into value in the range [0, 1], then a = 0 and

b = 1. z is the value to be normalized, and zN is the normalized value.

b(z − m) − a(z − M)

zN = (1.9)

M−m

1 New HIS to Solve LP and QP Optimization and Increase Convergence Speed 11

0.8

Original Dynamic

0.6

x1(t)

Advanced Dynamic

0.4

Predict Point by Heuristic Rule (HIS−1)

0

0 0.2 0.4 0.6 0.8 1

t

Fig. 1.5 Action of the HIS to calculate a better point in dynamics evolving over time

→

v 2 are computed, and the

following relevant information are obtained from them: Euclidian norm, angle (θi )

of each vector in relation to the horizontal axis; and finally, angle between them

(Δ θ = θ2 − θ1 ). Therefore, when there is a space curvature over the normalized

points, classification regions (decision regions) are generated as shown in Fig. 1.6.

To understand the illustration in Fig. 1.6 better, consider that the normalized initial

point (P0N ) is always found in the beginning of each area (S4, S5, S6, S7, S8 and

S9). Besides these six possibilities, there are more three others that occur in case of

straightforward convergences where Δ θ is approximately zero.

Region 4 (S4) is similar to the beginning of the convergence shown in Fig. 1.4(a).

While region 5 (S5) describes a behavior close to the curve formed by the end of

the convergence shown in Fig. 1.4(a) and the beginning of the convergence shown

in Fig. 1.4(b). Region 6 (S6) has a convergence similar to that shown in Fig. 1.4(b).

12 O.N. Neto, R.R.B. de Aquino, and M.M.S. Lira

Region 7 (S7) describes the dynamic of the type shown in the Fig. 1.4(d) and region

8 (S8) describes a behavior close to the curve formed by the end of the convergence

shown in the Fig. 4(c) and the beginning of the convergence shown in Fig. 1.4(d).

While region 9 (S9) represents a behavior close to that shown in Fig. 1.4(c).

The straightforward regime can be of three types: increasing - the derivative of

the curve is positive and not close to zero, corresponding to region 1 (S1); constant

- the derivative of the curve is approximately zero, corresponding to region 2 (S2);

and decreasing - derivative is negative and not close to zero, corresponding to region

3 (S3). We can note that the regimes described by regions S5 and S8 can be consid-

ered close to the constant straightforward regime. Thus, we modeled the following

heuristic rules:

rating is high or mean> than <Action I>.

Rule 2: if <curvature is a straight line> and <time rating is low> than <Action

II>.

Rule 3: if <curvature is a straight line> and <variable decreases> and <time

rating is high or mean> than <Action III>.

Rule 4: if <curvature is a curve> and <variable increases> and <time rating is

high and mean> and <concavity is concave downward> than <Action

IV>.

Rule 5: if <curvature is a curve> and <time rating is low> and <concavity is

concave downward> than <Action II>.

Rule 6: if <curvature is a curve> and <variable decreases> and <time rating is

high and mean> and <concavity is concave downward> than <Action

V>.

Rule 7: if <curvature is a curve> and <variable increases> and <time rating

is high and mean> and <concavity is concave upward> than <Action

V>.

Rule 8: if <curvature is a curve> and <time rating is low> and <concavity is

concave upward> than < Action II >.

Rule 9: if <curvature is a curve> and <variable decreases> and <time rating

is high and mean> and <concavity is concave upward > than <Action

VI>.

The actions shown in the rules lead to sub-functions that return a better value for

the next initialization point of the network. The straight line condition implicates

that either the system is converging very slowly or the step size of the integration

algorithm is very small. In this case, the linear function shown in (1.10) can be

applied as shown below:

−v (−

→

3 v 3 = P3 − P2 ). The

rules above are summarized in Table 1.1.

1 New HIS to Solve LP and QP Optimization and Increase Convergence Speed 13

Table 1.1 Description of the actions to be taken due to the heuristic rules according to each

decision region

I P3N is calculated according to (1.6) with a = a1 . S1

II P3N is calculated according to (1.6) with a = a2 . S2, S5, S8

III P3N is calculated according to (1.6) with a = a3 . S3

IV P3N takes the coordinates of the superior point in the circum- S4

ference that passes through the normalized points P0N , P1N

and P2N .

V P3N takes the coordinates of the point fixed at the most right S6, S7

position in the circumference that passes through the normal-

ized points P0N , P1N and P2N .

VI P3N takes the coordinates of the inferior point in the circum- S9

ference that passes through the normalized points P0N , P1N

and P2N .

Having the normalized point P3N estimated by the heuristic rules, we need to

unnormalize it to obtain the P3 value. This value will be used to start the recurrent

network.

(TDSS)

To calculate a better point using the dynamics in state-space, two facts must be

pointed out: firstly, the convergence of the variables depends on the convergence

of other variables; secondly, when we are working in the state-space, the variations

in the state of the variables are mapped by taking one variable xi as reference, so

that the curve is free to evolve in all directions, which does not happen when we

are working in the space-time plan. Perceiving previous facts and observing the

convergence orbits in the state-space shown Fig. 1.2 and Fig. 1.3, we reached the

conclusion that to calculate a better point in the state-space, it must be located inside

the concavity of the orbit of the system dynamics.

To calculate in such a state-space, we do the following: first we take one of

the variables as a reference (for example, x1 (t)) and draw n − 1 complex plan

(for a system with n variable), thus we have the plans x1 (t)0x2 (t), x1 (t)0x3 (t), . . .,

x1 (t)0xn−1(t). Having the three points P2 = x(t), P1 = x(t − Δ t) and P0 = x(t − 2Δ t)

provided by the network, we can create state vectors in each plan of all n − 1 com-

plex plan. For example, for the plan x1 (t)0x2 (t), we have:

−

→

v 1 (t) = (x1 (t) + i · x2(t)) − (x1 (t − Δ t) + i · x2(t − Δ t)) and (1.11)

−

→

v 2 (t) = (x1 (t − Δ t) + i · x2(t − Δ t)) − (x1(t − 2Δ t) + i · x2(t − 2Δ t)). (1.12)

14 O.N. Neto, R.R.B. de Aquino, and M.M.S. Lira

From these vectors, we carried out a rotation transformation in the axes using (1.13)

and (1.14), according to the Fig. 1.7.

−

→

v 1’ = −

→

v 1 exp(−θ1 ) (1.13)

v 2’ = −

−

→ →

v 2 exp(−θ1 ) (1.14)

where θ1 is the −

→

v 1 (t) angle and also a translation transformation using:

x1 ’(t − 2Δ t) = −|−

→

v 1|

x2 ’(t − 2Δ t) = 0

x1 ’(t − Δ t) = 0

x2 ’(t − Δ t) = 0

x1 ’(t) = x1 (t) − x1 (t − 2Δ t)

x2 ’(t) = x2 (t) − x2 (t − 2Δ t) (1.15)

The rotation and translation transformation facilitates the behavior analysis of the

vector −→v 2 in relation to vector −→

v 1 . Therefore, heuristic rules can be applied to

perform the gain in module and angle of the state vectors yielding vector − →v 3 ’ to

each n − 1 complex plan. Finally, a strategy is created to determine which will be

the final value of the reference variable. An effective strategy is to add the value of

the reference variable and the average of the calculated increments of this variable

in the complex plans.

Fig. 1.8 shows two pictures associated with two examples of a set of heuris-

tic rules that can be used to produce vector − →

v 3 ’. In each picture, the point that is

closer to the left position in the circumference is the point P0 ’, in the center of the

circumference is fixed the point P1 ’ and the points marked with tiny circle in the

circumference symbolize several possibilities of point P2 ’. And finally, the results

1 New HIS to Solve LP and QP Optimization and Increase Convergence Speed 15

of the heuristic rules are marked with green circle, points P3 ’. In order to obtain the

final point P3 , we apply the inverse transformation of translation and rotation to the

point P3 ’, thus generating the appropriated value to initialize the recurrent network.

Fig. 1.9 shows an example of application of the heuristic rules to estimate a better

point through the dynamics in the state-space. The external curve represents the

dynamics of the recurrent network without the heuristic rules, and the internal one

represents the dynamics using the HIS (ANN and TDSS) which is based on heuristic

rules (TDSS). We point out that, in the internal curve, the points with circle as a

marker are the iteration points performed by the network and the points with a plus

sign are the points estimated by the heuristic rules (P3 ).

3 3

2 2

1 1

X2’(t)

X2’(t)

0 0

−1 −1

−2 −2

−3 −3

0 2 4 0 2 4

X1’(t) X1’(t)

(a) (b)

Fig. 1.8 Variations of the rules applied to points P0 , P1 , P2 to calculate the estimated point P3

2.5

2

Original Orbit

1.5

x2(t)

1

Predict Point by Heuristic Rule (HIS−2)

0.5

Advanced Orbit

Point by Recurrent NN

0

−0.5

−1 0 1 2 3 4 5

x1(t)

Fig. 1.9 Graph of the convergence orbit of the state variables x1 and x2 : external curve -

ANN; internal curve - HIS

16 O.N. Neto, R.R.B. de Aquino, and M.M.S. Lira

By iteratively applying the ANN and the HR, the system will perform to reduce

the curvature of the orbit in the state-space jumping from one orbit to another until it

achieves the solution of the problem (point of equilibrium of the recurrent network).

In order to test the proposed hybrid intelligent system, we chose mathematical pro-

gramming problems previously solved in [16, 31, 32].

The following cases were solved using the HIS which was implemented in the

MATLAB. In addition, to solve the differential equations involved in the problem,

we implemented the Richardson extrapolation method of level four, which has in its

structure the classical Runge-Kutta method of order four, and uses a fixed step-size

of integration [33].

Four Variables

A classical linear programming problem [31] was used to choose the parameters

of the developed heuristic rules. In [31] the problem was solved to compare the

performance of several recurrent network models. In this case we deal with the

following problem LP1 :

s.t. x1 + x3 = 40

x2 + x4 = 60

−5x1 + 5x2 ≤ 0

2x1 − 3x2 ≤ 0

x≥0

x ∈ R4 (1.16)

Eleven Variables

With the heuristic rules parameters already defined in case 1, we chose a larger

scale problem to compare the models performance. In this case, we chose a flow

problem of minimum cost (LP2 ) solved in [32]. In this problem there are several

nodes (points) representing several consumers and suppliers which are connecting

through paths (arcs). The aim of the problem is to calculate the flow through all

paths in order to minimize the total cost, whose value is calculated by the sum of

the products of the cost and the flow performed in each arch. The problem can be

represented as a graph as shown in Fig. 1.10 where the amount of flow on a node

cannot exceed the capacity of the node. A network is a set of elements called nodes

1 New HIS to Solve LP and QP Optimization and Increase Convergence Speed 17

and a set of elements called arcs, each arc ei j being an ordered pair (i, j) of distinct

nodes i and j. If ei j is an arc, then the node i is called the tail of ei j and the node j

is called the head of ei j . The directed graph shown in Fig. 1.10 is formed of 6 nodes

and 11 arcs.

e12 2

1

e23

e13

3

e42

e15

e53

e43 e62

e54 4

5

e64

e56 6

The cost of each arc is represented by vector c and its maximum capacity flow

by vector b, and the demands by vector w. If wi ≤ 0, the node is a supplier (sources)

and, if wi > 0, the node is consumers (sinks). Suppose, for instance, that we have

wT = [−9 4 17 1 − 5 − 8]. The matrix H of the network is called the incidence

matrix of our network. More generally, the incidence matrix of a network with n

nodes and m arcs has n rows and m columns. Thus, our matrix H has size 6 x 11 and

is formed as follows:

⎡ ⎤

−1 −1 −1 0 0 0 0 0 0 0 0

⎢ 1 0 0 −1 1 1 0 0 0 0 0 ⎥

⎢ ⎥

⎢ 0 1 0 1 0 0 1 1 0 0 0⎥

H =⎢⎢ 0 0 0 0 −1 0 0 −1 1 1 0 ⎥

⎥ (1.17)

⎢ ⎥

⎣ 0 0 1 0 0 0 −1 0 −1 0 −1 ⎦

0 0 0 0 0 −1 0 0 0 −1 1

Considering that there are no losses in the network, i.e., everything that is produced

is consumed then the sum of all the elements wi j of the graph is zero. This condition

turns the matrix H into a linearly dependent (LD) matrix, in other words, any row

can be obtained by a linear combination of the other rows. In order to overcome this

problem, we remove a row of the matrix H and one element of the column vector w.

Here, the last row of the matrix H was removed, turning this matrix and the vector

w into a truncated incidence matrix and a truncated vector, according to [32].

⎡ ⎤

−1 −1 −1 0 0 0 0 0 0 0 0

⎢ 1 0 0 −1 1 1 0 0 0 0 0 ⎥

⎢ ⎥

H =⎢ ⎢ 0 1 0 1 0 0 1 1 0 0 0⎥

⎥ (1.18)

⎣ 0 0 0 0 −1 0 0 −1 1 1 0 ⎦

0 0 1 0 0 0 −1 0 −1 0 −1

18 O.N. Neto, R.R.B. de Aquino, and M.M.S. Lira

s.t. − 1x1 − 1x2 − 1x3 = 9

−x1 − x4 + x5 + x6 = −4

x2 + x4 + x7 + x8 = −17

−x5 − x8 + x9 + x10 = −1

x3 − x7 − x9 − x11 = 5

x ≤ [2 10 10 6 8 7 9 9 10 8 6]T

x≥0

x ∈ R11 (1.19)

Problem – Three Variables

As an example of quadratic programming (QP), the economic dispatch problem was

solved as formulate in [16]. In this problem, the aim is to minimize the total cost

and respond to the demand of the power system. The formulation contemplates 3

thermal generators (n = 3) connected to just one load.

Defining fi as the generation cost of the i-th generation unit, xi as the generated

power by the i-th generation unit, w as the total power demand of the load; the limits

xi,min , xi,max are defined by the physical limitation of the i-th generation unit. The

power economic dispatch is expressed as QP:

n

(QP) min f (x) = ∑ fi (xi )

i=1

n

s.t. ∑ xi − w = 0

i=1

xmin ≤ x ≤ xmax

x ∈ Rn (1.20)

Data were obtained from [16]: xmin = [150 100 50]T in MW, xmax = [600 400 200]T

in MW, w = 850 MW and the following costs for the generator units:

f2 (x2 ) = 310 + 7.85x2 + 0.00194x22

f3 (x3 ) = 78 + 7.95x3 + 0.00482x23 (1.21)

1 New HIS to Solve LP and QP Optimization and Increase Convergence Speed 19

1.5 Simulations

The chosen parameters to simulate the problems LP1 and LP2 were: integration step

size of 1e−3 , 100 for the parameter s of the neural network in the first and the second

phase, value 1.1 for the parameter e of the network in the second phase. The main

results of LP1 and LP2 simulation are presented in Table 1.2.

LP1 LP2

ANN HIS-1a HIS-2b ANN HIS-1a HIS-2b

No. points by the ANN (Phase 1) 8316 1000 2391 8670 2965 3099

No. points by the HR (Phase 1) - 331 795 - 986 1031

Total no. points (Phase 1) 8316 1331 3186 8670 3951 4130

Normalized computer processing time (Phase 1) 1.00 0.12 0.29 1.00 0.33 0.35

Time instant when the switch is closed (s) = t1 8.32 1.33 3.19 8.67 3.95 4.13

No. of calculated points by the ANN (Phase 2) 8607 8277 8316 7692 7251 7263

No. of calculated points in both phases 16923 9608 11502 16362 11202 11393

Initial cost (Phase 1) = f (x(0)) -260.00 -260.00 -260.00 0.00 0.00 0.00

Final cost (Phase 1) = f (x(t1 )) -741.77 -741.82 -741.78 55.67 55.65 55.63

Final cost (Phase 2) = f (x(tend )) -740.00 -740.00 -740.00 56.00 55.99 56.00

a HIS-1 = ANN and method of Tendency based on the Dynamics in

Space-Time(TDST).

b HIS-2 = ANN and method of Tendency based on the Dynamics in State-Space

(TDSS).

The results shown in Table 1.2 at row 5 point out that both proposed hybrid

systems (HIS-1 and HIS-2) were able to advance the dynamics of the simulated

linear problems efficiently. This reduces greatly the time to process the network

algorithm since at each integration step size, n ODEs are solved, where n is the

number of variables in the problem. For instance, for the problem LP1 , the total

number of points calculated at the end of the first phase by using the ANN was

8316, then 33264 ODEs were solved while using the HIS-1. It was necessary 1000

points yielding 4000 ODEs to be solved in order to reach the end of the first phase,

in other words, the HIS-1 reduced the computational effort by approximately 88%

compared to the ANN. Besides, for the LP2 problem, this rate was approximately

66%. The rates of computational effort reduction for problems LP1 and LP2 when

comparing the ANN to the HIS-2 were 71% and 64%, respectively.

Fig. 1.11-1.13 presents the simulation results for the LP1 problem and

Fig. 1.14-1.16 for the LP2 problem.

The chosen parameters to simulate the problems QP: integration step size of

1e−2 ; 50 for the parameter s of the neural network in the first phase. The initial

condition used was x(0) = [400 300 150]T .

20 O.N. Neto, R.R.B. de Aquino, and M.M.S. Lira

40

40

x1 x2 X1(t) x X2(t)

35

35

30

30

X1(t) x X4(t)

25 25

x4

20 20

15 15

10 10

5 X1(t) x X3(t)

5

x3

0

0

0 2 4 6 8 10 12 14 16 10 15 20 25 30 35 40

(a) (b)

Fig. 1.11 Dynamics of the problem LP1 obtained by the ANN with the initial condition

x(0) = [10 10 10 10]T : (a) Dynamic in time-state plan; (b) Dynamic in state-space plan, taking

the variable x1 (t) as reference

40

40

x1 x2

35 X1(t) x X4(t)

35

30

30

25 25

x4

20 X1(t) x X2(t)

20

15 15

10 10

5 X1(t) x X3(t)

5

x3

0

0

0 1 2 3 4 5 6 7 8 9 10 10 15 20 25 30 35 40

(a) (b)

Fig. 1.12 Dynamics of the problem LP1 , obtained by the HIS-1 with the initial condition

x(0) = [10 10 10 10]T : (a) Dynamic in time-space plan; (b) Dynamic in state-space plan,

taking the variable x1 (t) as reference

The results shown in Table 1.3 at row 5 point out that both proposed hybrid

systems were able to advance the dynamics of the simulated quadratic problems

efficiently. This reduces greatly the time to process the network algorithm since at

each step size integration, n ODEs are solved, where n is the number of variables

in the problem. For instance, for the problem QP, the total number of points cal-

culated at the end of the first phase by using the ANN was 172629, then 517887

ODEs were solved, while using the HIS-1, it was necessary 8257 points yielding

24771 ODEs to be solved in order to reach the end of the first phase, in other words,

the HIS-1 reduced the computational effort by approximately 95% compared to the

1 New HIS to Solve LP and QP Optimization and Increase Convergence Speed 21

40

40

x1 x2

35

35

X1(t) x X4(t)

30

30

25 25

X1(t) x X2(t)

x4

20 20

15 15

10 10

5 X1(t) x X3(t)

5

x3

0

0

0 2 4 6 8 10 12 10 15 20 25 30 35 40

(a) (b)

Fig. 1.13 Dynamics of the problem LP1 , obtained by the HIS-2 with the initial condition

x(0) = [10 10 10 10]T : (a) Dynamic in time-space plan; (b) Dynamic in state-space plan,

taking the variable x1 (t) as reference

9

9

x8 x9 8

8

x6 7

7

6 6

x4

5 5

x3

4 4

x2

3 3

x1

2 2

x7 x10

1

x11 1

0

x5 0

-1

0 2 4 6 8 10 12 14 16 0 0.5 1 1.5 2 2.5

(a) (b)

Fig. 1.14 Dynamics of the problem LP2 , obtained by the ANN with the initial condition

x(0) = [0 0 0 0 0 0 0 0 0 0 0]T : (a) Dynamic in time-space plan; (b) Dynamic in state-space

plan, taking the variable x1 (t) as reference

ANN. In addition to HIS-2, and the QP problem, this rate was approximately 70%.

Fig. 1.17-1.19 presents the simulation results for the problem QP.

All case studies were carried out in the same computer, thus we take the process-

ing time of the ANN in phase 1 as the base to normalize the hybrid case in the same

phase. Here, we point out that the hybrid systems were not used in phase 2. As a

result, we have observed that for the LP1 case, the HIS-1 has taken a processing time

of 12%, while the HIS-2 has taken 29%; for the LP2 case, the HIS-1 has taken a pro-

cessing time of 33%, while the HIS-2 has taken 35%; and for the QP case, the HIS-1

has taken a processing time of 6%, while the HIS-2 has taken 45%. It is important

22 O.N. Neto, R.R.B. de Aquino, and M.M.S. Lira

16

16

14

14

12

12

10 10

x8 x9

8 8

x6

6 6

x4

4 x3

4

x2

2 x1

2

x10

0 x5 x7 x11

0

-2

-2

-4

0 2 4 6 8 10 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3

(a) (b)

Fig. 1.15 Dynamics of the problem LP2 , obtained by the HIS-1 with the initial condition

x(0) = [0 0 0 0 0 0 0 0 0 0 0]T : (a) Dynamic in time-space plan; (b) Dynamic in state-space

plan, taking the variable x1 (t) as reference

9

9

x8 x9

8

8

x6 7

7

6 6

x4 5

5

x3

4 4

x2

3 3

x1

2 2

x10

1

1

x5 x7 x11

0

0

-1

0 2 4 6 8 10 0 0.5 1 1.5 2 2.5

(a) (b)

Fig. 1.16 Dynamics of the problem LP2 , obtained by the HIS-2 with the initial condition

x(0) = [0 0 0 0 0 0 0 0 0 0 0]T : (a) Dynamic in time-space plan; (b) Dynamic in state-space

plan, taking the variable x1 (t) as reference

to note that the heuristic rule efficiency varies according to the type of the problem.

In this work, the results have showed that the HIS-1 yielded better performance,

specifically, in the LP1 and QP problems. Therefore, we could observe a decrease in

the processing time yielded by the implemented heuristic rules, while reducing the

number of ODE computed. These rules estimated the next values to each variable

of the problem throughout the convergence. We can highlight that as ODE also

calculates points during the application of the HIS, the proposed systems can correct

themselves in case of incorrect estimative, showing the high performance for being

resilient.

1 New HIS to Solve LP and QP Optimization and Increase Convergence Speed 23

QP

ANN HIS-1a HIS-2b

No. points by the ANN (Phase 1) 172629 8257 51900

No. points by the HR (Phase 1) - 2750 17299

Total no. points (Phase 1) 172629 11007 69199

Normalized computer processing time (Phase 1) 1.00 0.06 0.45

Time instant when the switch is closed (s) = t1 1726.29 110.07 691.99

Final cost (Phase 1) = f (x(t1 )) 22680.05 22680.05 22680.05

a HIS-1 = ANN and method of Tendency based on the Dynamics in

Space-Time(TDST).

b

HIS-2 = ANN and method of Tendency based on the Dynamics in State-Space

(TDSS).

x1

300

350

280

x2

260

300

240

250 220

200

200

180

160

150

x3 140

X1(t) x X2(t)

120

0 200 400 600 800 1000 1200 1400 1600 1800 393 394 395 396 397 398 399 400 401

(a) (b)

Fig. 1.17 Dynamics of the problem QP, obtained by the ANN with the initial condition

x(0) = [400 300 150]T : (a) Dynamic in time-space plan; (b) Dynamic in state-space plan,

taking the variable x1 (t) as reference

1.6 Conclusion

In this paper, two Hybrid Intelligent Systems have been proposed. These systems

combined Maa and Shamblatt network with heuristic rules. Maa and Shamblatt

network is a two-phase recurrent neural network that provides the exact solution

for linear and quadratic programming problem. When compared to conventional

linear and nonlinear optimization techniques, the two-phase network formulation

becomes advantageous as there is no matrix inversion required. The main aim of

the proposed HIS is to increase the speed of convergence towards the optimal point

which is guaranteed by the ANN. In the cases presented, the optimal convergence

24 O.N. Neto, R.R.B. de Aquino, and M.M.S. Lira

400 x1

320 X1(t) x X2(t)

300

350

x2

280

260

300

240

250 220

200

200

180

160

150

x3 140

X1(t) x X3(t)

120

0 20 40 60 80 100 393 394 395 396 397 398 399 400 401

(a) (b)

Fig. 1.18 Dynamics of the problem QP, obtained by the HIS-1 with the initial condition

x(0) = [400 300 150]T : (a) Dynamic in time-space plan; (b) Dynamic in state-space plan,

taking the variable x1 (t) as reference

400 x1

320

X1(t) x X2(t)

300

350

x2 280

260

300

240

250 220

200

200

180

160

150

x3 140 X1(t) x X3(t)

120

0 100 200 300 400 500 600 700 393 394 395 396 397 398 399 400 401

(a) (b)

Fig. 1.19 Dynamics of the problem QP, obtained by the HIS-2 with the initial condition

x(0) = [400 300 150]T : (a) Dynamic in time-space plan; (b) Dynamic in state-space plan,

taking the variable x1 (t) as reference

was reached. The proposed systems have both advantages. The simulation analyses

show a reduction in the computational effort by approximately 95% compared to

the ANN in the QP case solved in this paper which has a guaranteed optimal con-

vergence, without inverting matrices. The implementation of the proposed HIS has

been developed to solve operational planning problems of large-scale, which may

be applied in future works. That is, a large-scale economic power dispatch, with the

scheduling of hydro, thermal and wind power plants to minimize the overall pro-

duction cost, while satisfying the load demand in the mid-term operation planning

of hydrothermal generation systems. In future works, we will propose the combina-

tion of these heuristic rules and/or the application in the second phase of Maa and

Shanblatt method.

1 New HIS to Solve LP and QP Optimization and Increase Convergence Speed 25

References

1. Hopfield, J.J.: Neural networks and physical systems with emergent collective computa-

tional abilities. Proc. Natl. Acad. Sci. USA 79, 2552–2558 (1982)

2. Haykin, S.: Neural networks: a comprehensive foundation, 2nd edn. Prentice Hall, USA

(1999)

3. Hopfield, J.J.: Neurons with graded response have collective computational properties

like those of two-state neurons. Proc. Natl. Acad. Sci. USA 81, 3088–3092 (1984)

4. Hopfield, J.J.: Learning algorithms and probability distributions in feed-forward and

feed-back networks. Proc. Natl. Acad. Sci. USA 84, 8429–8433 (1987)

5. Hopfield, J.J.: The effectiveness of analogue neural network hardware. Network: Com-

putation in Neural Systems 1(1), 27–40 (1990)

6. Hopfield, J.J., Feinstein, D.I., Palmer, R.G.: Unlearning has a stabilizing effect in collec-

tive memories. Nature 304, 158–159 (1983)

7. Hopfield, J.J., Tank, D.W.: Neural Computation of Decisions in Optimization Problem.

Biological Cybernetics 52, 141–152 (1985)

8. Hopfield, J.J., Tank, D.W.: Computing with Neural Circuits: A Model. Science 233(8),

625–633 (1986)

9. Tank, D.W., Hopfield, J.J.: Simple Neural Optimization Networks: An A/D Converter,

Signal Decision Circuit, and a Linear Programming Circuit. IEEE Trans. on Circuits and

Systems 33(5), 533–541 (1986)

10. Ludemir, T.B., Braga, A.P., Carvalho, A.C.P.L.F.: Redes Neurais Artificiais Teoria e

Aplicacoes. 1st ed. Rio de Janeiro, RJ: LTC - Livros Tecnicos e Cientificos Editora S.A.

(2000)

11. Pyne, I.B.: Linear Programming on an electronic analogue computer. Trans. AIEE. Part

I (Comm. & Elect.) 75, 139–143 (1956)

12. Kennedy, M.P., Chua, L.O.: Unifying Tank and Hopfield Linear Programming Circuit

and the Canonical Nonlinear Programming Circuit of Chua and Lin. IEEE Trans. on

Circuits and Systems 34(2), 210–214 (1987)

13. Chiu, C., Maa, C.Y., Shanblatt, M.A.: An artificial neural network algorithm for dynamic

programming. Int. J. Neural Syst. 1(3), 211–220 (1990)

14. Maa, C.Y., Shanblatt, M.A.: A Two-Phase Optimization Neural Network. IEEE Trans-

actions on Neural Networks 3(6), 1003–1009 (1992)

15. Kennedy, M.P., Chua, L.O.: Neural Networks for Nonlinear Programming. IEEE Trans.

on Circuits and Systems 35(5), 210–220 (1988)

16. Maa, C.Y., Shanblatt, M.A.: Linear and Quadratic Programming Neural Network Anal-

ysis. IEEE Transactions on Neural Networks 3(4), 580–594 (1992)

17. Chiu, C., Maa, C.Y., Shanblatt, M.A.: Energy Function Analysis of Dynamic Program-

ming Neural Networks. IEEE Transactions on Neural Networks 2(4) (July 1991)

18. Xia, Y.S.: A New Neural Network for Solving Linear and Quadratic Programming Prob-

lems. IEEE Transactions on Neural Networks 7(6), 1544–1547 (1996)

19. Tao, Q., Cao, J.D., Xue, M.S., Qiao, H.: A High Performance Neural Network for Solv-

ing Nonlinear Programming Problems with Hybrid Constraints. Phys. Lett. A 288(2),

88–94 (2001)

20. Xia, Y.S., Wang, J.: A General Methodology for Designing Globally Convergent Op-

timization Neural Networks. IEEE Transactions on Neural Networks 9(6), 1331–1343

(1998)

26 O.N. Neto, R.R.B. de Aquino, and M.M.S. Lira

21. Xia, Y.S., Wang, J.: A Recurrent Neural Network for Solving Nonlinear Convex Pro-

grams Subject to Linear Constraints. IEEE Transactions on Neural Networks 16(2),

379–386 (2005)

22. Dieu, V.N., Ongsakul, W.: Enhanced Merit Order and Augmented Lagrange Hop-

field Network for Hydrothermal Scheduling. Electrical Power and Energy Systems 30,

93–101 (2008)

23. Naresh, R., Dubey, J., Sharma, J.: Two-phase Neural Network Based Modeling Frame-

work of Constrained Economic Load Dispatch. IEE Proc. Gener. Transm. Distrib. 151(3)

(May 2004)

24. Aquino, R.R.B.: Recurrent Artificial Neural Networks: an application to optimization of

hydro thermal power systems (in Portuguese), Ph.D. Thesis, COPELE/UFPE, Campina

Grande, Brazil (January 2001)

25. Rosas, P., Aquino, R.R.B., et al.: Study of Impacts of a Large Penetration of Wind Power

and Distributed Power Generation as a Whole on the Brazilian Power System. In: Euro-

pean Wind Energy Conference (EWEC), London (November 2004)

26. Witten, I.H., Frank, E.: Data Mining Practical Machine Learning Tools and Techniques,

2nd edn. Morgan Kaufmann, San Francisco (2005)

27. Mitra, S., Mitra, M., Chaudhuri, B.B.: Pattern Defined Heuristic Rules and Directional

Histogram Based Online ECG Parameter Extraction. Measurement 42, 150–156 (2009)

28. Tuncel, G.: A Heuristic Rule-Based Approach for Dynamic Scheduling of Flexible Man-

ufacturing Systems. In: Levner, E. (ed.) Multiprocessor Scheduling: Theory and Appli-

cations, December 2007, p. 436. Itech Education and Publishing, Vienna (2007)

29. Baykasoglu, A., Ozbakir, L., Dereli, T.: Multiple Dispatching Rule Based Heuristic for

Multi-Objective Scheduling of Job Shops Using Tabu Search. In: Proceedings of MIM

2002: 5th International Conference on Managing Innovations in Manufacturing (MIM)

Milwaukee, Wisconsin, USA, September 9-11, pp. 1–6 (2002)

30. Idris, N., Baba, S., Abdullah, R.: Using Heuristic Rules from Sentence Decomposition

of Experts Summaries to Detect Students Summarizing Strategies. International Journal

of Human and Social Sciences 2, 1 (Winter 2008), www.waset.org

31. Zak, S.H., Upatising, V., Hui, S.: Solving Linear Programming Problems with Neural

Networks: A Comparative Study. IEEE Transactions on Neural Networks 6(1), 94–104

(1995)

32. Chvatal, V.: Linear Programming. W.H. Freman and Company, New York (1983)

33. Lastman, G.J., Sinha, N.K.: Microcomputer-based numerical methods for science and

engineering. Saunders Colleg Pubblishing, USA (1988)

Chapter 2

A Novel Optimization Algorithm Based on

Reinforcement Learning

problems with hard to evaluate objective functions. It uses the reinforcement learn-

ing principle to determine the particle move in search for the optimum process.

A model of successful actions is build and future actions are based on past ex-

perience. The step increment combines exploitation of the known search path and

exploration for the improved search direction. The algorithm does not require any

prior knowledge of the objective function, nor does it require any characteristics

of such function. It is simple, intuitive and easy to implement and tune. The opti-

mization algorithm was tested using several multi-variable functions and compared

with other widely used random search optimization algorithms. Furthermore, the

training of a multi-layer perceptron, to find a set of optimized weights, is treated as

an optimization problem. The optimized multi-layer perceptron was applied to Iris

database classification. Finally, the algorithm is used in image recognition to find a

familiar object with retina sampling and micro-saccades.

2.1 Introduction

Optimization is a process to find the maximum or the minimum function value

within given constraints by changing values of its multiple variables. It can be the

essential for solving complex engineering problems in such areas as computer sci-

ence, aerospace, machine intelligence applications, etc. When the analytical relation

Janusz A. Starzyk · Yinyin Liu

Ohio University, School of Electrical Engineering and Computer Science, U.S.A.

e-mail: starzyk@bobcat.ent.ohiou.edu,yliu@bobcat.ent.ohiou.edu

Sebastian Batog

Silesian University of Technology, Institute Of Computer Science, Poland

e-mail: sebastian.batog@gmail.com

Y. Tenne and C.-K. Goh (Eds.): Computational Intelligence in Optimization, ALO 7, pp. 27–47.

springerlink.com c Springer-Verlag Berlin Heidelberg 2010

28 J.A. Starzyk, Y. Liu, and S. Batog

between the variables and the objective function value is explicitly known, analyti-

cal methods, such as Lagrange multiplier methods [1], interior point methods [18],

Newton methods [30], gradient descent methods [25], etc., can be applied. How-

ever, in many practical applications, analytical methods do not apply. This happens

when the objective functions are unknown, when relations between variables and

function value are not given or difficult to find, when the functions are known while

their derivatives are not applicable, or when the optimum value of function cannot

be verified. In these cases, iterative search processes are required to find the function

optimum.

Direct search algorithms [10] contain a set of optimization methods that do not

require derivatives and do not approximate either the objective functions or their

derivatives. These algorithms find locations with better function values following a

search strategy. They only need to compare the objective function values in succes-

sive iterative steps to make the move decision. Within the category of direct search,

distinctions can be made among three classes including pattern search methods [28],

simplex methods [6], and adaptive sets of search directions [23]. In pattern search

methods, the variables of the function are varied by either steps of predetermined

magnitude or the steps sizes are reduced at the same degree [15]. Simplex meth-

ods construct a simplex in ℜN using N+1 points and use the simplex to drive the

search for optimum. The methods with adaptive sets of search directions, proposed

by Rosenbrock [23] and Powell [21], construct conjugate directions using the infor-

mation about the curvature of the objective function during the search.

In order to avoid local minima, random search methods are developed utilizing

randomness in setting the initial search points and other search parameters like the

search direction or the step size. In Optimized Step-Size Random Search (OSSRS)

[24], the step size is determined by fitting a quadratic function for the optimized

function values in each of the random directions. The random direction is generated

with a normal distribution of a given mean and standard deviation. Monte-Carlo op-

timizations adopted randomness in the search process to generate the possibilities

to escape from the local minima. Simulate Annealing (SA) [13] is one typical kind

of Monte-Carlo algorithm. It exploits the analogy between the search for a mini-

mum in the optimization problem and the annealing process in which a metal cools

and stabilizes into a minimum energy crystalline structure. It accepts the move to a

new position with worse function value with a probability, which is controlled by

the ”temperature” parameter, and the probability decreases along the ”cooling pro-

cess”. SA can deal with highly nonlinear, chaotic problems provided that the cooling

schedule and other parameters are carefully tuned.

Particle Swarm Optimization (PSO) [11] is a population-based evolutionary com-

putational algorithm. It exploits the cooperation within the solution population in-

stead of the competition among them. At each iteration in PSO, a group of search

particles make moves in a mutually coordinated fashion. The step size of a particle is

a function of both the best solution found by that particle and the best solution found

so far by all the particles in the group. The use of a population of search particles

and the cooperation among them enable the algorithm to evaluate function values in

a wide range of variables in the input space and to find the optimum position. Each

2 A Novel Optimization Algorithm Based on Reinforcement Learning 29

particle only remembers its best solution and the global best solution of the group

to determine its step sizes.

Generally, during the course of search, a sequence of decisions on the step sizes is

made and a number of function values are obtained in these optimization methods.

In order to implement an efficient search for the optimum point, it is desired that

such historical information can be utilized in the optimization process.

Reinforcement Learning (RL) [27] is a type of learning process to maximize cer-

tain numerical values by combining exploration and exploitation and using rewards

as learning stimuli. In the reinforcement learning problem, the learning agent per-

forms the experiments to interact with the unknown environment and accumulate

the knowledge during this process. It is a trial-and-error exploratory process with

the objective to find the optimum action. During this process, an agent can learn

to build the model of the environment to instruct its search, so that the agent can

predict the environment’s response to its actions and choose the most useful actions

for its objectives based on its past exploring experience.

Surrogate based optimization refers to an idea of speeding optimization process

by using surrogates for the objectives and constraints functions. The surrogates also

allow for the optimization of problems with non-smooth or noisy responses, and

can provide insight into the nature of the design space. The max-min SAGA ap-

proach [20] is to search for designs that have the best worst case performance in

the presence of parameter uncertainty. By leveraging a trust-region approach which

uses computationally cheap surrogate models, the present approach allows for the

possibility of achieving robust design solutions on a limited computational budged.

Another example of a surrogate based optimization is the surrogate assisted

Hooke-Jeeves (SAHJA) algorithm [8] which can be used as a local component of a

global optimization algorithm. This local searcher uses the Hooke-Jeeves method,

which performs its exploration of the input space intelligently employing both the

real fitness and an approximated function.

The idea of building knowledge about an unknown problem through exploration

can be applied in the optimization problems. To find the optimum of an unknown

multivariable function, an efficient search procedure can be performed using only

historical information from conducted experiments to expedite the search. In this

chapter, a novel and efficient optimization algorithm based on reinforcement learn-

ing is presented. This algorithm uses simple search operators and will be called

reinforcement learning optimization (RLO) in the later sections. It does not require

any prior knowledge of the objective function or function’s gradient information,

nor does it require any characteristics of the objective function. In addition, it is

conceptually very simple and easy to implement. This approach to optimization

is compatible with the neural networks and learning through interaction, thus it is

useful for systems of embodied intelligence and motivated learning as presented in

[26]. The following section presents the RLO method and illustrates it within several

machine learning applications.

30 J.A. Starzyk, Y. Liu, and S. Batog

2.2.1 Basic Search Procedure

A N-variable optimization objective function

could have several local minima and several global minima Vopt1 , ...,VoptN . It is

desired that the search process, initiated from a random point, finds a path to the

global optimum point. Unlike particle swarm optimization [11], this process can be

performed with a single search particle that learns how to find its way to the opti-

mum point. It does not require the cooperation among a group of particles, although

implementing the cooperation among several search particles may further enhance

the search process in this method.

At each point of the search, the search particle intends to find a new location

with a better value within a searching range around it and then determines the di-

rection and the step size for the next move. It tries to reach the optimum by explor-

ing weighted random search of each variable (coordinate). The step size of search

in each variable is randomly generated with its own probability density function.

These functions are gradually learned during the search process. It is expected that

at the later stage of search, the probability density functions are approximated for

each variable. Then the stochastically randomized path to the minimum point of the

function from the start point is learned.

The step sizes of all the coordinates determine the center of the new searching

area and the standard deviations of the probability functions determine the size of

the new searching area around the center. In the new searching area, several loca-

tions PS are randomly generated. If there is a location p’ with better value than the

current one, the search operator moves to it. From this new location, new step sizes

and new searching range are determined, so that the search for optimum continues.

If in the current searching area, there is no point with better value that the search

particle can move to, another set of random points are generated until no improve-

ment is obtained after several, say M, trials. Then the searching area size and step

sizes are modified in order to find a better function value. If no better value is found

after K trials of generating different searching areas or the proposed stopping crite-

rion is met, we can claim that the optimum point has been found. The algorithm of

searching for the minimum point is schematically shown in the Figure 2.1.

Approximation

After the search particle makes a sequence of n moves, the step sizes of these moves

d pti (t = 1, 2, ..., n; i = 1, 2, ..., N) are available for learning. These historical steps

have made the search particle move towards better values of the objective function

and hopefully get closer to the optimum location. In this sense, these steps are the

2 A Novel Optimization Algorithm Based on Reinforcement Learning 31

Fig. 2.1 The algorithm of RLO searching for the minimum point

successful actions during the trial. It is proposed that the successful actions which

result in a positive reinforcement (as the step sizes of each coordinate) follow a

function of the iterative steps t, as in (2.1), where dpi represents the step sizes on ith

coordinate and f i (t) is the function for coordinate i.

These unknown functions f i (t) can be approximated, for example, using polynomi-

als through the least-squared fit (LSF) process.

⎧ ⎫

⎪

⎪ a0 ⎪⎪

⎡ ⎤ ⎪ ⎪ ⎧ i⎫

1 t1 t12 ... t1B ⎪ ⎨ a1 ⎪⎬ ⎨ d p1 ⎬

⎣ ... ... ... ... ... ⎦ a2 = ... (2.2)

⎪ ⎪ ⎩ i⎭

1 tn tn2 ... tnB ⎪ ⎪

⎪ ... ⎪

⎪

⎪ d p

⎩ ⎭ n

aB

In (2.2), the step sizes from d pi1 to d pin are the step sizes on a certain coordinate

during n steps and are fitted as unknown function values using polynomials of or-

der B. The polynomial coefficients a0 to aB can be obtained and will represent the

function f i (t) to estimate dpi ,

B

d pi = ∑ a jt j. (2.3)

j=0

Using polynomials for function approximation could be easy and efficient. How-

ever, considering the characteristic of optimization problems, we have two concerns.

First, in order to generate a good approximation while avoiding overfitting, a proper

order of polynomials must be selected. In the optimized approximation algorithm

(OAA) presented in [17], the goodness of fit is determined by the so-called signal-

to-noise ratio figure (SNRF). Based on SNRF, an approximation stopping criterion

32 J.A. Starzyk, Y. Liu, and S. Batog

was developed. Using a certain set of basis functions for approximation, the error

signal, computed as the difference between the approximated function and the sam-

pled data, can be examined by SNRF to determine how much useful information it

contains. The SNRF for the error signal, denoted as SNRF e , is compared to the pre-

calculated SNRF for white Gaussian noise (WGN), denoted as SNRFW GN . If SNRF e

is higher thanSNRFWGN , more basis functions should be used to improve the learn-

ing. Otherwise, the error signal shows the characteristic of WGN and should not

be reduced any more to avoid fitting into the noise, and the obtained approximated

function is the optimum function. Such process can be applied to determine the

proper order of the polynomial.

The second concern is that in the case of reinforcement learning, the knowl-

edge about originally unknown environment is gradually accumulated throughout

the learning process. The information that the learning system obtains at the be-

ginning of the process is mostly based on initially random exploration. During the

process of interaction, the learning system collects the historical information and

builds the model of the environment. The model can be updated after each step of

interaction. The decisions made at the later stages of the interaction are more based

on the built model rather than a random exploration. This means that the recent re-

sults are more important and should be weighted more heavily than the old ones.

For example, the weights applied can be exponentially increasing from the initial

trials to the recent ones, as

αt

wt = (t = 1, 2, ..., n), (2.4)

n

where we can define α n = n. As a result, the weights are in the open interval (0:1],

and weight is 1 for the most recent sample. Applying the weights in the LSF, we

have the weighted least-squared fit (WLSF), expressed as follows:

⎧ ⎫

⎪

⎪ a0 ⎪⎪

⎡ ⎤ ⎪ ⎪ ⎧ ⎫

1 · w1 t1 w1 t12 w1 ... t1B w1 ⎪⎨ a1 ⎪⎬ ⎨ d p1 w1 ⎬

⎣ ... ... ... ... ... ⎦ a2 = ... (2.5)

⎪ ⎪ ⎩ ⎭

1 · wn tn wn tn wn ... tn wn ⎪

2 B

⎪

⎪ ... ⎪

⎪

⎪ d p n wn

⎩ ⎭

aB

Due to the weights applied to the given samples, the approximated function will fit

to the recent data better than to the old ones.

Utilizing the concept of OAA to obtain optimized WLSF, the SNRF for the error

signal or WGN has to be estimated considering the sample weights. In the original

OAA for one-dimensional problem [17], the SNRF for error signal was calculated

as,

C(e j , e j−1 )

SNRFe = (2.6)

C(e j , e j ) − C(e j , e j−1 )

where C represents the correlation calculation, e j represents the error signal (j=1,2,

...,n), e j−1 represents the (circular) shifted version of the e j . The characteristics

2 A Novel Optimization Algorithm Based on Reinforcement Learning 33

of SNRF for WGN, expressed through the average value and the standard deviation,

can be estimated from Monte-Carlo simulation, as (see derivation at [17])

1

σSNRF W GN (n) = √ . (2.8)

n

Then the threshold, which determines whether SNRF e shows the characteristic of

SNRFW GN and the fitting error should not be further reduced, is,

For the weighted approximation, the SNRF for the error signal is calculated as,

SNRFe = . (2.10)

C(e j · w j , e j · w j ) − C(e j · w j , e j−1 · w j−1 )

the logarithmic scale. The σSNRF W GN (n) can be estimated as

2

σSNRF W GN (N) =√ . (2.11)

n

It is found that the 5% significance level can be approximated by the average value

plus 1.5 times standard deviations for an arbitrary n. Fig.2.2(b) illustrates the his-

togram of SNRFW GN with 216 samples, as an example. The threshold in this case of

a dataset with 216 samples can be calculated using μ + 1.5σ = 0 + 1.5 × 0.0078 =

0.0117.

Therefore, to obtain an optimized weighted approximation in one-dimensional

case, the following algorithm is performed.

Step (2.1). Assume that an unknown function F, with input space t ⊂ ℜ1 is described

by n training samples as d pt , (t = 1, 2, ..., n).

Step (2.2). The signal detection threshold is pre-calculated for the given number of

samples n based on SNRFW GN . For a one-dimensional problem,

1.5 · 2

thSNRF W GN (n) = √ .

n

Step (2.3). Take a set of basis functions, for example, polynomials of order from 0

up to order B.

Step (2.4). Use these B+1 basis functions to obtain the approximated function,

B+1

dˆpt = ∑ fl (xt ) (t = 1, 2, ..., n). (2.12)

l=1

34 J.A. Starzyk, Y. Liu, and S. Batog

Step (2.7). Compare the SNRF e with thSNRF W GN . If the SNRF e is equal to or less

than thSNRF W GN , or if B exceeds the number of samples, stop the procedure. In

such case F̂ is the optimized approximation. Otherwise, add one basis function, in

this example increase the order of the approximating polynomial to B+1 and repeat

Steps (2.4)-(2.7).

Using the above algorithm, the proper order of polynomial is determined to ex-

tract the useful (but not the noise) information from the historical data. Also, the

extracted information will fit into the recent results better than to the old ones.

We illustrate this process of learning historical information by considering a 2-

variable function as an example.

Example

The function V (p1 , p2 ) = p22 sin(1.5p2 ) + 2p21 sin(2p1 ) + p1 sin(2p2 ) has several lo-

cal minima, but only one global minimum, as shown in Fig. 2.3. In the process of

interaction, the historical information after each iteration is collected. The historical

step sizes of 2 coordinates are separately approximated, as shown in the Fig. 2.4 (a)

and 2.4 (b). The step sizes of two coordinates are approximated by quadratic poly-

nomials which are determined by OWAA and the coefficients of polynomials are

obtained using WLSF. In Fig. 2.4, the approximated functions are compared with

the quadratic polynomials whose coefficients are obtained from LSF. Again, it is

2 A Novel Optimization Algorithm Based on Reinforcement Learning 35

observed that, the function obtained using WLSF is fitted closer to the data in later

iterations than the function obtained using LSF.

The level of the approximation error signal et for step sizes of a certain coor-

dinate dpi , which is the difference between the observed sampled data and the ap-

proximated function, can be measured by its standard deviation, as shown in (2.14).

1 n

σ pi = ∑ (et − ē)2 (2.14)

n t=1

This standard deviation will be called the approximation deviation in the follow-

ing discussion. It represents the maximum deviation of the location of the search

particle from the prediction by the approximated function in the unknown function

optimization problem.

36 J.A. Starzyk, Y. Liu, and S. Batog

The approximated functions will be used to determine the step sizes for the next

iteration, as shown in (2.15) and Fig. 2.5 along with the approximated functions.

i

d pt+1 = f i (t + 1) (2.15)

Fig. 2.5 Prediction of the step sizes for the next iteration

The step size functions are the model of environment that the learning system builds

during the process of interaction based on historical information. The future step

size determined by such model can be employed as exploitation of the existing

model. However, such model built during the learning process cannot be treated

as exact. Besides exploitation which best utilizes the obtained model, exploration

is desired to a certain degree in order to improve the model and discover better

solutions. The exploration can be implemented using Gaussian random generator

(GRG). As a good trade-off between exploitation and exploration is needed, we pro-

pose to use the step sizes for the next iteration determined by the step size functions

as the mean value and the approximation deviation as the standard deviation of the

random generator. Gaussian random generators give several random choices of the

step sizes. Effectively, the determined step sizes of multiple coordinates generate

the center of the searching area, and the size of the searching range is determined

by the standard deviations of GRG for the coordinates. The multiple random values

generated by GRG for each coordinate effectively create multiple locations within

the searching area. The objective function values of these locations will be com-

pared and the location with the best value, called current best location, will be

chosen as the place from which the search particle will continue searching in the

next iteration. Therefore, the actual step sizes are calculated using the distance from

the “previous best location” to the “current best location”. The actual step sizes will

be added in the historical step sizes and used to update the model of the unknown

environment.

2 A Novel Optimization Algorithm Based on Reinforcement Learning 37

Fig. 2.6 using a 2-variable function as an example. The search particle was located

at previous best location p prev (p1prev , p2prev ) and the previous step size was found as

d p prev (d p1prev , d p2prev ) after current best location p(p1 , p2 )is found as the best loca-

tion in previous searching area (an area with p(p1 , p2 ) in it, not shown in the figure).

At current best position p(p1 , p2 ), using the environment model built with historical

step sizes, the current step size is determined to be dp1 on coordinate 1 and dp2 on

coordinate 2, so that the center of the searching area is determined. The approxima-

tion deviations of two coordinates σ p1 and σ p2 give the size of the searching range.

Within the searching range, several random points are generated in order to find a

better position to which the search operator will move.

Search particle moves from every “previous best location” to “current best location”

and step sizes actually taken are used for model learning. As new step sizes are

generated, the search particle is expected to move to locations with better objective

function values. In the proposed algorithm, the search particle only makes the move

when a location with a better function value is found.

However, if all the points generated in the current searching range have no better

function values than the current best value, the search particle does not move and

the GRG will repeat generating groups of particle locations for several trials. If no

better location is found after M trials, we suspect that the current searching range

is too small or the current step size is too large, which makes us miss the locations

with better function values. In such case, we should enlarge the size of the searching

area, and reduce the step size, as in (2.16),

38 J.A. Starzyk, Y. Liu, and S. Batog

σ pi = α σ pi

(i = 1, 2, ..., N), (2.16)

d pi = ε d pi

where α > 1, and ε < 1. If this new search is still not successful, the searching range

and the step size will continue changing until some points with better function values

are found. If at certain step of the search process, in order to find the new location

with better function values, the current step size is reduced to be too small to make

the search particle move anywhere, it indicates that the optimum point has been

reached. The stop criterion can be defined by the current step size being β times

smaller than the previous step size, as,

Based on previous discussion, the proposed optimization algorithm (RLO) can be

described as follows.

(a). The procedure starts from a random point of the objective function with N-

variables V = f (p1 , p2 , ..., pN ) . It will try to make a series of moves to get closer to

the global optimum point.

(b). To change from the current location, the step size dpi (i=1, 2, . . . ,N) and the

standard deviation σ pi (i = 1, 2, ..., N) for each coordinate are generated from the

uniform probability distribution.

(c). The step sizes dpi (i=1, 2, . . . N) determine the center of the searching area.

The deviations of all the coordinates σ pi (i = 1, 2, ..., N) determine the size of the

searching area. Several points Ps in this range are randomly chosen from Gaussian

distribution using dpi as mean values and σ pi as standard deviations.

(d). The objective function values are evaluated at these new points. Compare the

objective function values on random points with that at the current location.

(e). If the new points generated in Step (c) have no better values than the current

position, Step (c ) is repeated for up to M trials until point with better function value

is found.

(f). If the search fails after M trials, enlarge the size of the searching area, and reduce

the step size, as in (2.16).

(g). If the search with the updated searching area size and the step sizes from Step

(f) is not successful, the range and the step size will keep being adjusted until either

some points with better values are found, the current step sizes are much smaller

than previous step sizes as in (2.17), or function value changed by less than a pre-

specified threshold. If any of these conditions happens then the algorithm termi-

nates. This also indicates that the optimum point has been reached.

(h). Move the search particle to the point p(p1 , p2 ) with the best function value V b (a

local minimum or maximum depending on the optimization objective). The distance

between previous best point p prev (p1prev , p2prev ) and current best point p(p1 , p2 ) gives

the actual step size dpi (i=1, 2, . . . , N). Collect the historical information of the step

sizes taken during the search process.

2 A Novel Optimization Algorithm Based on Reinforcement Learning 39

(i). Approximate the function of the step sizes as a function of iterative steps us-

ing weighted least-square fit as in (2.5). The proper maximum order of the basis

functions is determined using SNRF described in section 2.2.2 to avoid overfitting.

(j). Use the modeled function to determine the step sizes dpi (i=1, 2, . . . ,N) for the

next iteration step. The approximation deviation difference between the approxi-

mated step sizes and the actual step sizes σ pi (i = 1, 2, ..., N) gives the approximation

deviation. Repeat Step (c) to (j).

In general, the optimization algorithm based on the reinforcement learning builds

the model of successful moves for a given objective function. The model is built

based on historical successful actions and it is used to determine new actions. The

algorithm combines the exploitation and exploration of searching using random gen-

erators. The optimization algorithm does not require any prior knowledge of the ob-

jective function or its derivatives nor there are any special requirements put on the

objective function. The use of search operator is conceptually very simple and intu-

itive. In the following section, the algorithm is verified using several experiments.

2.3.1 Finding Global Minimum of a Multi-variable Function

2.3.1.1 A Synthetic Bivariate Function

A synthetic bivariate function

used previously in the example in section 2.2.2, is used as the objective function.

This function has several local minima and one global minimum equal to -112.2586.

The optimization algorithm starts at a random point and performs the search process

looking for the optimum point (minimum in this example). The number of random

points Ps generated in the searching area in each step is 10. The scaling factors α

and ε in (2.16) are 1.1 and 0.9. The β in (2.17) is 0.005.

One possible search path is shown in Fig. 2.7 from the start location to the final

optimum location as found by RLO algorithm. The global optimum is found in

13 iterative steps. The historical locations are shown in the figure as well. During

the search process, the historical step sizes taken are shown in Fig. 2.8 with their

approximation by WLSF.

Example of another search process starting from another random point is per-

formed and is shown in Fig. 2.9. The global optimum is found in 10 iterative steps.

Table 2.1 shows changes in the numerical function values and adjustment of the step

sizes dp1 and dp2 for p1 and p2 in the successive search steps. Notice how the step

size was initially reduced to be increased again once the algorithm started to follow

a correct path towards the optimum.

40 J.A. Starzyk, Y. Liu, and S. Batog

2 A Novel Optimization Algorithm Based on Reinforcement Learning 41

1 1.4430 2.9455 0.8606

2 -34.8100 0.3570 -1.7924

3 -61.4957 -0.0508 -0.7299

4 -69.8342 -0.0477 -0.3114

5 -70.5394 -0.1232 0.2015

6 -71.5813 0.0000 4.4358

7 -109.0453 -0.0281 0.3408

8 -110.8888 0.0495 -0.0531

9 -112.0104 0.0438 -0.0772

10 -112.1666

Such search process was performed for 300 random trials. The success rate of

finding the global optimum is 93.78%. On average, it takes 5.9 steps and 4299 func-

tion evaluations to find the optimum in this problem.

The same problems are tested on several other direct search based optimization

algorithms, including SA [29], PSO [14] and OSSRS [2]. The success rate of find-

ing global optimum and the average number of function evaluations are compared

in Tables 2.2, 2.3, 2.4. All the simulations were performed using an Intel Core Duo

2.2GHz based PC, with 2GB of RAM.

Success rate of finding the global optimum 93.78% 29.08% 94.89% 52.21%

Number of function evaluations 4299 13118 4087 313

CPU time consumption [s] 28.4 254.35 20.29 1.95

The classic 2D six-hump camel back function [5] has 6 local minima and 2 global

minima. The function is given as

p41 2

V (p1 , p2 ) = (4 − 2.1p21 + ) p1 + p1 p2 + (−4 + 4p22 ) p22 (p1 ∈ [−3, 3], p2 ∈

3

[−2, 2]).

Within the specified bounded region, the function has 2 global minima equal to -

1.0316. The optimization performances of these algorithms from 300 random trials

are compared in Table 2.3.

42 J.A. Starzyk, Y. Liu, and S. Batog

Success rate of finding the global optimum 80.33% 45.22% 86.44% 42.67%

Number of function evaluations 5016 8045.5 3971 256

CPU time consumption [s] 33.60 151.86 20.35 1.63

The Rosenbrock’s famous “banana function” [23], as

has 1 global minimum equal to 0 lying inside a narrow, curved valley. The opti-

mization performances of these algorithms from 300 random trials are compared in

Table 2.4.

Success rate of finding the global optimum 74.55% 3.33% 41% 88.89%

Number of function evaluations 48883.7 28412 4168 882.4

CPU time consumption [s] 320.74 539.38 20.27 5.15

formance without particular tuning of the parameters. However, other methods show

different level of efficiency and capabilities of handling various problems.

The output of a multi-layer perceptron (MLP) can be looked at as the value of a

function with the weights as the approximation variables. Training the MLP, in the

sense of finding optimal values of weights to accomplish the learning task, can be

treated as an optimization problem. We can take the Iris plant database [22] as a

testing case. The Iris database contains 3 classes, 5 numerical features and 150 sam-

ples. In order to accomplish the classification of the iris samples, a 3-layered MLP

with an input layer, a hidden layer and an output layer can be used. The size of the

input layer should be equal to the number of features. The size of the hidden layer

is chosen to be 6, and since the class IDs are numerical values equal to 1, 2 and

3, the size of the output layer is 1. The weight matrix between the input layer and

the hidden layer contains 30 elements, and the one between the hidden layer and the

2 A Novel Optimization Algorithm Based on Reinforcement Learning 43

output layer contains 6 elements. Overall, there are 36 weight elements (parameters)

to be optimized. In a typical trial, the optimization algorithm finds the optimal set

of weights after only 3 iterations. In the testing stage, the outputs of the MLP are

rounded to be the nearest integers to indicate predicted class IDs. Comparing the

given class IDs and the predicted class IDs from the MLP in Fig. 2.10, it is obtained

that 146 out of 150 iris samples can be correctly classified by such set of weights

and the percentage of correct classification is 97.3%. A single support vector ma-

chine (SVM) achieved 96.73% classification rate [12]. In addition, a MLP with the

same structure, training by back-propagation (BP) achieved 96% on Iris test case.

The MLP and BP are implemented using MATLAB neural network toolbox.

Intelligence

In the area of machine intelligence, active vision becomes an interesting topic. In-

stead of taking in the whole scene captured by the camera and making sense of all

the information in the conventional computer vision approach, active vision agent

focuses on small parts of the scene and moves its fixation frequently. Human and

other animals use such quick movement of both eyes, which is called saccade [3], to

focus on the interesting part of the scene and efficiently use its own resources. The

interesting parts are usually important features of the input, and with the important

features being extracted, the high-resolution scene is analyzed and recognized with

relatively small number of samples.

In a saccade movement network (SMN) presented in [16], the original images are

transformed into a set of low resolution images after saccade movements and retina

44 J.A. Starzyk, Y. Liu, and S. Batog

sampling. The set of images, as the sampled features, are fed to the self-organizing

winner-take-all classifier (SOWTAC) network for recognition. To find interesting

features of the input image and to direct the movements of saccade, image segmen-

tation, edge detection and basic morphology tools [4] are utilized.

Fig. 2.11 Face image and its interesting features in active vision [16]

Fig. 2.11 (a) shows a face image from [7] with 320×240 pixels. The interesting

features found are shown in Fig. 2.11 (b). The stars represent the center of the four

interesting features found on a face image and the rectangles represent the feature

boundaries. Then, the retina sampling model [16] places its fovea at the center of

each interesting feature, so that these features will be extracted.

Practically, the centers of the interesting features found by image processing tools

[4] are not guaranteed to be the accurate centers, which will affect the accuracy

of feature extraction and pattern recognition process. In order to help to find the

optimum sampling position, RLO algorithm can be used to direct the move of the

fovea of the retina and find the closest match between the obtained sample features

and pre-stored reference sample features. These slight moves during fixation to find

the optimum sampling positions can be called microsaccades in the active vision

process, although the actual role of microsaccades has been a debate topic unsolved

for several decades [19].

Fig. 2.12 (a) shows a group of ideal samples of important features in face recog-

nition. Fig. 2.12 (b) shows the group of sampled features with initial sampling posi-

tions. In the optimization process, the x-y coordinates need to be optimized so that

the sampled images have the optimum similarity to the ideal images. The level of

similarity can be measured by the sum of squared intensity difference [9]. In this

metric, increased similarity will have decreased intensity difference. Such problem

can be also perceived as an image registration problem. The two-variables objec-

tive function V(x, y), the sum of squared intensity difference, needs to be minimized

2 A Novel Optimization Algorithm Based on Reinforcement Learning 45

through RLO algorithm. It is noted that the only information available is that V

can be the function of x and y coordinates. How the function would be expressed

and what are its characteristics are totally unknown. The minimum value of the

objective function is not known either. RLO would be the suitable algorithm for

such optimization problem. Fig. 2.12 (c) shows the optimized sampled images us-

ing RLO-directed microsaccades. The optimized feature samples are closer to the

ideal feature samples, which will help the processing of the face image.

After the featured images are obtained through RLO-directed microsaccades,

these low-resolution images, instead of the entire high-resolution face image, are

sent to the SOWTAC network for further processing or recognition.

2.4 Conclusions

In this chapter, a novel and efficient optimization algorithm is presented for the

problems in which the objective functions are unknown. The search particle is able

to build the model of successful actions and choose its future action based on the

past exploring experience. The decisions on the step sizes (and directions) are made

based on a trade-off between exploitation of the known search path and exploration

for the improved search direction. In this sense, this algorithm falls into a category

of reinforcement learning based optimization (RLO) methods. The algorithm does

not require any prior knowledge of the objective function, nor does it require any

characteristics of such function. It is conceptually very simple and intuitive as well

as very easy to implement and tune.

The optimization algorithm was tested and verified using several multi-variable

functions and compared with several other widely used random search optimization

46 J.A. Starzyk, Y. Liu, and S. Batog

finding a set of optimized weights to accomplish the learning, is treated as an opti-

mization problem. The proposed RLO was used to find the weights of MLP in the

training problem on Iris database. Finally, the algorithm is used in the image recog-

nition process to find a familiar object with retina sampling and micro-saccades.

The performance of RLO, will depend to a certain degree on the values of several

parameters that this algorithm uses. With certain preset parameters, the performance

of RLO can meet our requirements in several machine learning problems involved

in our current research. In the future research, a theoretical and systematic analysis

of the effect of these parameters will be conducted. In addition, using a group of

search particles and their cooperation and competition, a population based RLO can

be developed. With the help of model approximation techniques and the trade-off

between exploration and exploitation proposed in this work, the population based

RLO is expected to have better performance.

References

1. Arfken, G.: Lagrange Multipliers, 3rd edn. §17.6 in Mathematical Methods for Physi-

cists, pp. 945–950. Academic Press, Orlando (1985)

2. Belur, S.: A random search method for the optimization of a function of

n variables. MATLAB central file exchange, http://www.mathworks.com/

matlabcentral/fileexchange/loadFile.do?objectId=100

3. Cassin, B., Solomon, S.: Dictionary of Eye Terminology. Triad Publishing Company,

Gainsville (1990)

4. Detecting a Cell Using Image Segmentation. Image Processing Toolbox, the Mathworks,

http://www.mathworks.com/products/image/demos.html

5. Dixon, L.C.W., Szego, G.P.: The optimization problem: An introduction. Towards Global

Optimization II. North Holland, New York (1978)

6. Eelder, J.A., Mead, R.: A simplex method for function minimization. The Computer

Journal 7, 308–313 (1965)

7. Facegen Modeller. Singular Inversions,

http://www.facegen.com/products.htm

8. del Toro Garcia, X., Neri, F., Cascella, G.L., Salvatore, N.: A surrogate associated

Hooke-Jeeves algorithm to optimize the control system of a PMSM drive. IEEE ISIE,

347–352 (July 2006)

9. Hill, D.L.G., Batchelor, P.: Registration methodology: concepts and algorithms. In: Ha-

jnal, J.V., Hill, D.L.G., Hawkes, D.J. (eds.) Medical Image Registration. Medical Image

Registration. CRC, Boca Raton (2001)

10. Hooke, R., Jeeves, T.A.: Direct search solution of numerical and statistical problems.

Journal of the Association for Computing Machinery 8, 212–229 (1961)

11. Kennedy, J., Eberhart, R.C.: Particle swarm optimization. In: Proc. IEEE Int. Conf. Neu-

ral Networks, Perth, Australia, December 1995, vol. 4, pp. 1942–1948 (1995)

12. Kim, H., Pang, S., Je, H.: Support vector machine ensemble with bagging. In: Lee, S.-W.,

Verri, A. (eds.) SVM 2002. LNCS, vol. 2388, p. 397. Springer, Heidelberg (2002)

13. Kirkpatrick, S., Gelatt Jr., C.D., Vecchi, M.P.: Optimization by simulated annealing. Sci-

ence 220(4598), 671–680 (1983)

2 A Novel Optimization Algorithm Based on Reinforcement Learning 47

14. Leontitsis, A.: Hybrid Particle Swarm Optimization, MATLAB central file ex-

change, http://www.mathworks.com/matlabcentral/fileexchange/

loadFile.do?objectId=6497

15. Lewis, R.M., Torczon, V., Trosset, M.W.: Direct search methods: Then and now. Journal

of Computational and Applied Mathematics 124(1), 191–207 (2000)

16. Li, Y.: Active Vision through Invariant Representations and Saccade Movements. Master

thesis, School of Electrical Engineering and Computer Science, Ohio University (2006)

17. Liu, Y., Starzyk, J.A., Zhu, Z.: Optimized Approximation Algorithm in Neural Networks

without overfitting. IEEE Trans. on Neural Networks 19(4), 983–995 (2008)

18. Lustig, I.J., Marsten, R.E., Shanno, D.F.: Computational Experience with a Primal-Dual

Interior Point Method for Linear Programming. Linear Algebra and its Application 152,

191–222 (1991)

19. Martinez-Conde, S., Macknik, S.L., Hubel, D.H.: The role of fixational eye movements

in visual perception. Nature Reviews Neuroscience 5(3), 229–240 (2004)

20. Ong, Y.-S.: Max-min surrogate-assisted evolutionary algorithm for robust design. IEEE

Trans. on Evolutionary Computation 10(4), 392–404 (2006)

21. Powell, M.J.D.: An efficient method for finding the minimum of a function of several

variables without calculating derivatives. The Computer Journal 7, 155–162 (1964)

22. Fisher, R.A.: Iris Plants Database (July 1988),

http://faculty.cs.byu.edu/˜cgc/Teaching/CS_478/iris.arff

23. Rosenbrock, H.H.: An automatic method for finding the greatest or least value of a func-

tion. The Computer Journal 3, 175–184 (1960)

24. Sheela, B.V.: An optimized step-size random search. Computer Methods in Applied Me-

chanics and Engineering 19(1), 99–106 (1979)

25. Snyman, J.A.: Practical Mathematical Optimization: An Introduction to Basic Optimiza-

tion Theory and Classical and New Gradient-Based Algorithms. Springer, Heidelberg

(2005)

26. Starzyk, J.A.: Motivation in Embodied Intelligence. In: Frontiers in Robotics, Automa-

tion and Control, October 2008, pp. 83–110. I-Tech Education and Publishing (2008),

http://www.intechweb.org/

book.php?%20id=78&content=subject&sid=11

27. Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press, Cam-

bridge (1998)

28. Torczon, V.: On the Convergence of Pattern Search Algorithms. SIAM Journal on Opti-

mization 17(1), 1–25 (1997)

29. Vandekerckhove, J.: General simulated annealing algorithm, MATLAB central file ex-

change, http://www.mathworks.com/matlabcentral/fileexchange/

loadFile.do?objectId=10548

30. Ypma, T.J.: Historical development of the Newton-Raphson method. SIAM Re-

view 37(4), 531–551 (1995)

Chapter 3

The Use of Opposition for Decreasing Function

Evaluations in Population-Based Search

ing to reducing the amount of function calls required to perform optimization

by population-based search. We provide motivation and comparison to simi-

lar, but different approaches including antithetic variates and quasi-randomness/

low-discrepancy sequences. We employ differential evolution and population-based

incremental learning as optimization methods for image thresholding. Our results

confirm improvements in required function calls, as well as support the oppositional

princples used to attain them.

3.1 Introduction

Global optimization is concerned with discovering an optimal (minimum or maxi-

mum) solution to a given problem generally within a large search space. In some in-

stances the search space may be simple (i.e. concave or convex optimization can be

used). However, most real world problems are multi-modal and deceptive [5] which

often causes traditional optimization algorithms to become trapped at local optima.

Many strategies have been developed to overcome this for global optimization in-

cluding, but not limited to simulated annealing [9], tabu search [4], evolutionary

algorithms [7] and swarm intelligence [3].

Some of these methods employ a single solution per iteration methodology

whereby only one solution is generated and successively perturbed towards more

Mario Ventresca · Hamid Reza Tizhoosh

Department of Systems Design Engineering, The University of Waterloo, Ontario, Canada

e-mail: mventres@pami.uwaterloo.ca,tizhoosh@uwaterloo.ca

Shahryar Rahnamayan

Faculty of Engineering and Applied Science,

The University of Ontario Institute of Technology, Ontario, Canada

e-mail: shahryar.rahnamayan@uoit.ca

Y. Tenne and C.-K. Goh (Eds.): Computational Intelligence in Optimization, ALO 7, pp. 49–71.

springerlink.com c Springer-Verlag Berlin Heidelberg 2010

50 M. Ventresca, S. Rahnamayan, and H.R. Tizhoosh

methods inherently require low computational time per iteration, however they of-

ten lack sufficient diversity to adequately explore the search space. An alternative

is via population-based techniques where many solutions are generated at each iter-

ation and all (or most) are used to determine the search direction (i.e. evolutionary

and swarm intelligence algorithms). By considering many solutions per iteration

these methods will have a higher diversity, however will require a large amount of

computation. In this chapter we focus on population-based optimization.

Real-world problems also present the possible issue of complexity in the eval-

uation of a solution. That is, determining the quality of a given solution is com-

putationally expensive. Investigation into simpler evaluation metrics is one possible

direction, however, it may be found that evaluation is still expensive. Then, reducing

the time spent (i.e. number of function evaluations) becomes an important goal.

Opposition-based computing (OBC) is a newly devised concept, having as one

of its aims, the improvment of convergence rates of algorithms by defining and

simultaneously considering pairs of opposite solutions [24]. This advanced conver-

gence rate is also usually accompanied by a more desirable final result. To date,

OBC has shown improvements in reinforcement learning [20, 22, 23], evolution-

ary algorithms [14, 15, 16], ant colony optimization [12], simulated annealing [28],

estimation of distribution [29], and neural networks [26, 27, 30].

In this chapter we discuss the application of OBC to reducing the number of

function calls required to achieve a desired accuracy for population-based searches.

We show theoretical reasoning behind OBC (which has roots in monotonic opti-

mization) and provide mathematical motivations for its ability to reduce function

calls and improve accuracy of simulation results. The key factor to accomplishing

this is via simultaneous consideration of negatively associated variables and their

affect on the target evaluation function and search process. We choose to highlight

the improvements for the task of image thresholding using differential evolution and

population-based incremental learning.

The rest of this chapter is organized as follows: Section 3.2 discusses the theoret-

ical motivations behind our approach. Differential evolution and population-based

incremental learning are discussed in Section 3.3 as are their respective opposition-

based counterparts. The experimental setup is given in Section 3.4 and results are

presented in Section 8.4. Conclusions are give in Section 3.6.

In this section we introduce notations and definitions required to explain the concept

of opposition and its ability to reduce the number of function calls. We also provide

a brief comparison of OBC to antithetic variates and quasi-random/low-discrepancy

sequences.

3 The Use of Opposition for Decreasing Function Evaluations 51

In the following definitions, assume that A ⊆ ℜd is non-empty, compact and d-

dimensional. Without loss of generality, f : A → ℜ is a continuous function to be

maximized. We assume all A are feasible.

The purpose of a global search method is to discover the global optima (either

minimum or maximum) of a given function and not converge to one of the local

optima.

Definition 1 (Global Optima). A solution θ ∗ ∈ A is a global optima if f (θ ) ≤

f (θ ∗ ) for all θ ∈ A . There may exist more than one global optima.

Definition 2 (Local Optima). A solution θ ∈ A is a local optima if there exists a

ε -neighborhood Nε (θ ) with radius ε > 0 where g(θ , θ ) < ε for distance function

g and θ ∈ A ∩ Nε (θ ), and f (θ ) ≤ f (θ ).

Recent research [1, 25, 32] has shown the benefit of utilizing monotonic transforma-

tions of the evaluation criteria as a means of discovering global optima. This causes

a reordering of the solutions and a gradient-based method can be used to search the

reordered space. An issue with these convexification and concavification methods

which transform certain functions to a monotonic form is that the mapping must be

known a priori. Otherwise, optimization on the transformed function is unreliable.

Definition 3 (Monotonicity). Function φ : ℜ → ℜ is monotonic if for x, y ∈ ℜ and

x < (>)y, then φ (x) ≤ (≥)φ (y). A strictly monotonic function is one which does

not permit equality, (i.e. φ (x) < (>)φ (y)).

Theoretically, a monotonic transformation is ideal, however, OBC does not require

it. Instead, opposition extends the monotonic global search idea through the use of

opposites solutions, which are simultaneously considered and the more desirable

(w.r.t. f and the problem definition) is used during the search.

Definition 4 (Opposite). A pair (x, x̆) ∈ A are opposites if there exists a function

Φ : A → A where Φ (x) = y and Φ (y) = x. The breve notation will be used to

denote the opposite element (i.e. x̆ = Φ (x) = y).

The function Φ referred to in Definition 4 is the key to employing opposition-based

techniques. This determines which elements are opposites, and a poorly selected

function could lead to poor performance (see the following section).

Definition 5 (Opposite Mapping). A one-to-one function Φ : A → A where every

pair x, x̆ ∈ A are unique (i.e. for z ∈ A , if Φ (x) = y and Φ (y) = x then there does

not exist Φ (y) = z or Φ (x) = z).

Φ can be determined via prior knowledge, intuition or through some a priori

or online learning procedure. Simultaneous use of the opposites (for example,

a maximization problem) is easily accomplished by allowing f (θ ) = f (θ̆ ) =

max( f (θ ), f (θ̆ )) and searching for a solution S which corresponds to the most de-

sirable solution.

S = arg max f (θ ). (3.1)

θ

52 M. Ventresca, S. Rahnamayan, and H.R. Tizhoosh

As with a monotonic transformation the “optimal” opposition map is one which

effectively reorders elements of A such that φ ( f ) is monotonic. Under this trans-

formation cor(X, X̆) ≤ 0 for X, X̆ ⊂ A , as is shown in Figure 3.1.

0.9

0.8

0.7

0.6

Evaluation

0.5 X X

0.4

0.3

0.2

0.1

0

Solution

Fig. 3.1 Example of a transformed evaluation funtion to a monotonic function. The values

in X and X̆ are negatively correlated. The original function (not shown) could have been any

nonlinear, continuous function

multaneous examination of x, x̆ and returns the most desirable. That is, we aim for

a negative correlation between the two guesses. A consequence of Φ is an effec-

tive halving of possible evaluations (as shown in Figure 3.2) where f (θ ) = f (θ̆ ) =

max( f (θ ), f (θ̆ )). The halving results from full information of the transformed func-

tion such that the opposite solution is not required to be observed to compute the

max operation. In the more general case, it is sufficient to determine a function such

that Pr(max( f (θ1 ), f (θ˘1 ) < max( f (θ1 ), f (θ2 ))) > 0.5, for θ1 , θ2 ∈ A .

A further consequence of simultaneous consideration of opposites is a provably

more desirable E[ f ] and lower variance [28]. Alone this does not guarantee a higher

quality outcome; the probability density function of the opposite-transformed func-

tion values should also be more desirable than the joint p.d.f. of random sampling

(i.e. the distribution corresponding to Pr(max(x1 , x2 ))).

3 The Use of Opposition for Decreasing Function Evaluations 53

0.9

0.8

0.7

maximum of a solution

Evaluation

0.5

function

0.3

0.2

0.1

0

Solution

Fig. 3.2 Taking f (x) = max( f (x), f (−x)), we see that the possible evaluations in the search

space have been halved in the optimal situation of full knowledge. In the general case, the

transformed function will have a more desirable mean and lower variance

While not investigated in this chapter we make the observation that successive

applications of different opposite maps will lead to further smoothing of f . For

example,

f 2 (θ ) = max( f (z = arg max f (θ )), f (z̆)) (3.2)

θ

two applications of opposite mappings. In the limit,

lim f i (θ ) = max( f i (zi = arg max f i−1 (θi )), f (z̆i )) = f (θ ∗ ) (3.3)

i→∞ θi

for i > 0 and global optima f (θ ∗ ). Effectively, this flattens the entire error surface

of f , except for the global optimal(s). A more feasible alternative is to use k > 0

transformations which give reasonable results where 0 ≤ | f k−1 − f k | < ε does not

diminish greatly.

We briefy discuss conditions on the design of the opposition map Φ which often lead

to improvements over purely random sampling with respect to lowering function

evaluations. If using an algorithm solely based on randomly generating solutions to

the problem at hand then we require that for some ε > 0 and δ > 0.5

54 M. Ventresca, S. Rahnamayan, and H.R. Tizhoosh

where x, y, x̆ ∈ A . That is, the distribution max(x, x̆) must be less desirable than the

distribution of i.i.d. random guesses. If this condition is met, then the probability

that the optimal solution (or higher quality) is discovered is higher using opposite

guesses. The goal in developing Φ is to maximize ε and δ . A similar goal is to deter-

mine Φ such that E[g(x, x̆)] is maximized for some distance function g. Satisifying

this condition implies (3.4).

Thus, probabilistically we expect a lower number of function calls to find a solu-

tion of a given thresholded quality. If employing this strategy within a guided search

method the dynamics of the algorithm must be considered to assure the guarantee

(i.e. the algorithm adds bias to the search which affects the convergence rate).

Practically, the simplest manner to decide a satisfactory Φ is through intuition

or prior knowledge about f . A possibility is to utilize modified low-discrepancy

sequences (see below), which aim to distribute guesses evenly throughout the search

space.

While opposition may sometimes employ methods from antithetic variates and low-

discrepancy sequences, in general that is not the case. To elucidate the uniqueness

of opposition in the following we distinguish it from these two methods.

Suppose we desire to estimate ξ = E[ f ] = E[Y1 ,Y2 ] with unbiased estimator

Y1 + Y2

ξ̂ = . (3.5)

2

If Y1 ,Y2 are i.i.d. then var(ξ̂ ) = var(Y1 +Y2 )/2. However, if cov(Y1 ,Y2 ) < 0 then the

variance can be further reduced.

One method to accomplish this is through the use of a monotonic function h.

Then, generate Y1 as an i.i.d. value as before, but utilizing h our two variables are

h(Y1 ) and h(1 − Y1), which are monotonic over interval [0,1]. And

h(Y1 ) + h(Y2 )

ξ̂ = (3.6)

2

will be an unbiased estimator of E[ f ].

Opposition is similar in its selection of negatively correlated samples. However,

in the antithetic approach there is no guideline to construct such a monotonic func-

tion, although such a function has been proven to exist [17]. Opposition provides

various means to accomplish this, as well as to incorporate the idea into optimiza-

tion while guaranteeing more desirable expected values and lower variance in the

target function.

3 The Use of Opposition for Decreasing Function Evaluations 55

sampling-based algorithms. It can also be applied to algorithm behavior and can be

used to relate concepts expressed linguistically, where we have not found evidence

that antithetic variates have application.

These methods aim to combine the randomness from pseudorandom generators

which select values i.i.d. with the advantages of generating points distributed in

a grid-like fashion. The goal is to uniformly distribute values over the search space

by achieving a low-discrepancy. However, this is achieved at the cost of statistical

independence.

The discrepancy of a sequence is a measure of its uniformity and can be calcu-

lated via [6]:

|B ∩ X|

DN (X) = sup − λd (B) (3.7)

I∈J N

where λd is the d-dimensional Lebesque measure, |B ∩ X| is the number of points

of X = (x1 , ..., xN ) that fall into interval I, and J is the set of d-dimensional intervals

defined as:

d

∏[ai, bi ) = {x ∈ ℜd : ai ≤ xi ≤ bi } (3.8)

i=1

for 0 ≤ ai < bi ≤ 1.

That is, the actual number of points within each interval for a given sample is

close to the expected number. Such sequences have been widely studied in quasi-

Monte Carlo methods [10].

Opposition may utilize low-discrepancy sequences in some situations. Though

in general, low-discrepancy sequences are simply a means for attaining uniform

distribution without regard to the correlation between the evaluations at these points.

Further, opposition-based techniques simultaneously consider two points in order to

smooth the evaluation function and improve performance of the sampling algorithm

whereas quasi-random sequences often are concerned with many more points.

These methods have been applied to evolutionary algorithms where it was found

that by a performance study of the different sampling methods such as Uniform,

Normal, Halton, Sobol, Faure, and Low-Discrepancy is valuable only for

low-dimensional (d < 10 and so non-highly-sparse) populations [11].

3.3 Algorithms

In this section we describe Differential Evolution (DE) and Population-¿ It seems,

performance study of the different sampling methods such as Uniform, Normal, Hal-

ton, Sobol, Faure, and Low-Discrepancy [11] is valuable only for low-dimensional

(D < 10 and so non-highly-sparse) populations. Learning (PBIL), which are the

56 M. Ventresca, S. Rahnamayan, and H.R. Tizhoosh

parent algorithms for this study. We also describe the oppositional variants of these

methods.

Differential Evolution (DE) was proposed by Price and Storn in 1995 [21]. It is an

effective, robust, and simple global optimization algorithm [8]. DE is a population-

based directed search method. Like other evolutionary algorithms, it starts with an

initial population vector. If no preliminary knowledge about the solution space is

available the population is randomly generated. Each vector of the initial population

can be generated as follows [8]:

where d is the problem dimension; a j and b j are the lower and the upper boundaries

of the variable j, respectively. rand(0, 1) is the uniformly generated random number

in [0, 1].

Assume Xi,t (i = 1, 2, ..., N p ) are candidate solution vectors in generation t and

N p : is the population size. Successive populations are generated by adding the

weighted difference of two randomly selected vectors to a third randomly selected

vector. For classical DE (DE/rand/1/bin), the mutation, crossover, and selection

operators are straightforwardly defined as follows:

Mutation: For each vector Xi,t in generation t a mutant vector Vi,t is defined by

where i = {1, 2, ..., N p } and a, b, and c are mutually different random integer indices

selected from {1, 2, ..., N p }. Further, i, a, b, and c are unique such that it is neces-

sary for N p ≥ 4. F ∈ [0, 2] is a real constant which determines the amplification of

the added differential variation of Xc,t − Xb,t . Larger values for F result in higher

diversity in the generated population and lower values lead to faster convergence.

Crossover: By shuffling competing solution vectors DE utilizes the crossover op-

eration to generate new solutions and also to increase the population diversity. For

the classical DE (DE/rand/1/bin), the binary crossover (shown by ‘bin’ in the no-

tation) is utilized. It defines the following trial vector:

where,

V ji,t if rand j (0, 1) ≤ Cr ∨ j = k,

U ji,t = . (3.12)

X ji,t otherwise,

for Cr ∈ (0, 1) the predefined crossover rate, and rand j (0, 1) is the jth evaluation of

a uniform random number generator. k ∈ {1, 2, ..., d} is a random parameter index,

3 The Use of Opposition for Decreasing Function Evaluations 57

chosen once for each i to make sure that at least one parameter is always selected

from the mutated vector, V ji,t . Most popular values for Cr are in the range of (0.4, 1)

[14].

Selection: This decides which vector (Ui,t or Xi,t ) should be a member of next (new)

generation, t + 1. For a minimization problem, the vector with the lower value of

objective function is chosen (greedy selection).

This evolutionary cycle (i.e., mutation, crossover, and selection) is repeated N p (pop-

ulation size) times to generate a new population. These successive generations are

produced until meeting the predefined termination criteria.

By utilizing opposite points, we can obtain fitter starting candidate solutions even

when there is no a priori knowledge about the solution(s) according to:

1. Random initialization of population P(NP ),

2. Calculate opposite population by

i = 1, 2, ..., N p ; j = 1, 2, ..., D,

where Pi, j and OPi, j denote jth variable of the ith vector of the population and

the opposite-population, respectively.

3. Selecting the N p fittest individuals from {P ∪ OP} as the initial population.

The general ODE scheme also employs generation jumping, but it has not be used

in this work in lieu of only population-initialization and sample generation.

PBIL is a stochastic search which abstracts the population of samples found in evo-

lutionary computation with a probability distribution for each variable of the so-

lution [2]. At each generation a new sample population is generated based on the

current probability distribution. The best individual is retained and the probability

model is updated accordingly to reflect the belief regarding the final solution. The

update rule is similar to that found in reinforcement learning.

A population is represented by probability matrix M := (mi, j )d×c which stores

the probability distribution over each possible element in the solution. If consider-

ing a binary problem then solution S := (si, j )d×c ∈ {0, 1} and mi, j ∈ [0, 1] is the

probability element si, j = 1. For continuous problems probability distributions are

used instead of a threshold value as is the case for discrete problems [18].

Learning consists of utilizing M to generate population P of k samples. After

evaluation of each sample according to function f the “best” (B∗ ) solution is re-

tained and M is updated according to

58 M. Ventresca, S. Rahnamayan, and H.R. Tizhoosh

Mt = (1 − α )Mt−1 + α B∗ (3.14)

where 0 < α < 1 is the learning rate and t ≥ 1 is the iteration. Initially, mi, j = 0.5 to

reflect that lack of prior information.

To abstract the crossover and mutation operators of evolutionary computation,

PBIL employs a randomization of M. Let 0 < β , γ < 1 be the probability of mutation

and degree of mutation, respectively. Then with probability β

1: {Initialize probabilities}

2: M0 := (mi, j ) = 0.5

3: for t = 1 to ω do

4: {Generate samples}

5: G1 = generate samples(k,Mt−1 )

7: B∗ = select best({B∗ } ∪ G1)

8: {Update M}

9: Mt = (1 − α )Mt−1 + α B∗

11: for i = 0...d and j = 0...c do

12: if random(0, 1) < β then

13: mi, j = (1 − γ )mi, j + γ · random(0 or 1)

14: end if

15: end for

16: end for

The opposition-based version of PBIL (OPBIL) shown in Algorithm 2 employs the

opposite concept to improve diversity within the sample generation phase. A direct

effect on convergence rate is observed as a consequence of this mechanism. Fur-

ther, OPBIL has an ability to escape local optima which estimation of distribution

algorithms such as PBIL are prone to becoming trapped on [19]. The description

provided here is brief and the interested reader is invited to read [29] for a more

detailed description.

The general structure of the PBIL algorithm remains, however aside from the

sampling procedure the update and mutation rules are altered to reflect a degrading

3 The Use of Opposition for Decreasing Function Evaluations 59

degree of opposition with respect to the number of iterations. That is, as the number

of iterations t → ∞ the amount two opposite solutions differ approaches 1 bit (w.r.t.

Hamming distance).

Sampling is accompished using an opposite guessing strategy whereby half of

the population R1 is generated using probability matrix M and the other half is

generated via a change in Hamming distance to a given element of R1 . The distance

is calculated using an exponentially decaying function in the flavor of

where l is the maximum number of bits in a guess and c < 0 is a user defined

constant.

Updating of M is performed in lines 14-28. If a new global best solution has been

discovered (i.e. η = B∗ ), or with probability pamp the sample best solution is used

to focus the search, respectively. When no new optima have been discovered this

strategy tends away from B∗ . The actual update is performed in line 16 and is based

on a reinforcement learning update using the sample best solution. The degree to

which M is updated is controlled by the user defined parameter 0 < ρ < 1.

Should the above criteria for update fail, a decay of M with probability pdecay

is attempted in line 17. The decay, performed in lines 21-27 slowly tends M away

from B∗ . This portion of the algorithm has the intention to prevent convergence an

aide in the exploration ability through small, smooth updates. Parameter 0 < τ < 1

is user defined where often τ ρ .

Equations in lines 11 and 12 were determined experimentally and no argument

regarding their optimality is provided. Indeed, there likely exists many functions

which will yield more desirable results. These have been decided because they tend

to lead to a good behavior and outcome of PBIL.

In this section we provide a discussion of the image thresholding problem, and the

application of evolutionary algorithms to solving it. Additionally, the evaluation

measure we employ to grade the quality of a segmentation is presented. Parame-

ter settings and problem representation are also given.

Image segmentation involves partitioning an image I into a set of segments with the

goal of locating objects in the image which are sufficiently similar. Thresholding

is a subset problem of image segmentation, with only 2 classes defined by whether

a given pixel is above or below a specific threshold value ω . This task has numer-

ous applications and several general segmentation algorithms have been proposed

[33]. Due to the variety of image types there does not exist a single algorithm for

segmenting all images optimally.

60 M. Ventresca, S. Rahnamayan, and H.R. Tizhoosh

Require: Maximum iterations, ω

Require: Number of samples per iteration, k

1: {Initialize probabilities}

2: M0 = mi..l = 0.5

3: for t = 1 to ω do

4: {Generate samples}

5: R1 = generate samples(k/2,M)

6: R̆1 = generate opposites(R1)

8: η = select best({R1 ∪ R̆1 })

9: B∗ = select best(B∗ , η )

11: pamp (Δ ) = 1 − e−bΔ

∗

)− f (η )

1 − f (Bf (B ∗)

12: pdecay (Δ ; f (B∗ ), f (η )) = √

Δ +1

13: {Update M}

14: if η = B∗ OR random(0, 1) < pamp then

15: Δ =0

16: Mt = (1 − ρ )Mt−1 + ρη

17: else if random(0, 1) < pdecay then

18: if random(0, 1) < pdecay then

19: use B∗ in line 23 instead of η

20: end if

21: for all i, j each with probability < pdecay do

22: if ηi, j = B∗i, j then

1 − τ · random(0, 1), if ηi, j = 1,

23: mi, j = mi, j ·

1 + τ · random(0, 1), otherwise

24: else

1 + τ · random(0, 1), if ηi, j = 1,

25: mi, j = mi, j ·

1 − τ · random(0, 1), otherwise

26: end if

27: end for

28: end if

29: Δ = Δ +1

30: end for

3 The Use of Opposition for Decreasing Function Evaluations 61

Many general purpose segmentation algorithms are histogram based and aim

to discover a deep valley between two peaks, and setting ω equal to that value.

However, many real world problems will have multimodal histograms and deciding

which value (i.e. valley) will correspond to the best thresholing is not obvious. The

difficulty is compounded by the fact that the relative size of peaks may be large

(and the valley becomes hard to distinguish) or valleys could be very broad. Sev-

eral algorithms have been proposed to overcome this [33]. Other methods based on

information theory and other statistical methods have been proposed as well [13].

Typically, the problem of segmentation involves a high degree of uncertainty

which makes solving the problem difficult. Stochastic searches such as evolution-

ary algorithms and population-based incremental learning often cope well with un-

certainty in optimization, hence they provide an interesting alternative approach to

traditional methods.

The main difficulty associated with the use of population-based methods is that

they are computationally expensive due to the large amount of function calls re-

quired during the optimization process. One approach to minimizing uncertainty

is by splitting the image into subimages which (hopefully) have characteristics al-

lowing for an easy segmentation. Combining the subimages together then forms

the entire segmented image, although this requires extra function calls to analyze

each subimage. An important caveat is that the local image may represent a good

segmentation, but may not be useful with respect to the image as a whole.

In this chaper we investigate thresholding with population-based techniques. Us-

ing ODE we do not perform any splitting into subimages and for OPBIL we split I

into four equal-sized subregions, each having it’s own threshold value. In both cases

we require a single evaluation to perform the segmentation and we show that the

opposition-based techniques reduce the required number of function calls.

As stated above, there exist many different segmentation algorithms. Further, nu-

merous methods for evaluating the quality of a seqmentation have also been put

forth [34]. In this paper we use a simple method which aims to minimize the dis-

crepancy between the original M × N gray-level image I and its thresholded image

T [31]:

M N

∑ ∑ |Ii, j − Ti, j | (3.17)

i=1 j=1

where | · | represents the absolute value operation. Using different evaluations will

change the outcome of the algorithm, however, the problem of segmentation in this

manner nonetheless remains computationally expensive.

We use the images shown in Figure 3.3 to evaluate the algorithms. The first col-

umn represents the original image, then the gold and the third column corresponds

to the approximate target image for ODE and OPBIL (i.e. the value-to-reach targets,

discussed below). We show the gold image for completeness, it is not required in

the experiments.

Both experiments employ a value-to-reach (VRT) stopping criteria which mea-

sures the time or function calls required to reach a specific value. The VTR values

have been experimentally determined and are given in the following table. Due to

62 M. Ventresca, S. Rahnamayan, and H.R. Tizhoosh

Fig. 3.3 The images used to benchmark the algorithms. The first column is the original gray-

level image, the second is the gold and the third colum is the target image of the optimization

within the required function calls

3 The Use of Opposition for Decreasing Function Evaluations 63

the respective algorithm ability of solving this problem, given the representation and

bahavior of convergence these values differ for ODE and OPBIL.

1 19579 19850

2 3391 4925

3 7139 7175

4 19449 19850

5 19650 19700

6 22081 22700

The ODE and OPBIL algorithms differ in that the former is a real optimization

algorithm and the latter operates in the binary domain. Therefore, the solution rep-

resentations will also differ and consequently directly affect the quality of results.

However, the goal of this chapter is to show the ability of opposition to decrease the

required number of function evaluations, and so fine-tuning aspects these algorithms

is not the focus of this investigation.

ODE Settings

user-defined parameter settings were determined empirically as shown in Table 3.2.

Parameter Value

Population size Np = 5

Amplification factor F = 0.9

Crossover probability Cr = 0.9

Mutation strategy DE/rand/1/bin

Maximum function calls MAXNFC = 200

Jumpring rate (no jumping) Jr = −1

In order to maintain a reliable and fair comparison, these settings are kept un-

changed for all conducted experiments for both DE and ODE algorithms.

64 M. Ventresca, S. Rahnamayan, and H.R. Tizhoosh

OPBIL Settings

olding aims to discover an integer value 0 ≤ T ≤ 255, to perform the segmentation

operation of I > T . Additionally, we use an approach of splitting I into subimages

I1,...,16 where each Ii is an equal sized square region of the original image.

Encoding was determined to be a matrix R := (ri, j )16×8 which corresponds to

16 subimages having a gray-level value < 28 = 256. Each row of R is converted to

an integer which is used to segment the respective region of I. The extra regions in-

crease problem difficulty as they result in more deceptive and multimodal problems.

Parameter settings for PBIL and OPBIL are as follows:

Parameter Value

Maximum iterations t = 150

Sample size k = 24

PBIL Only

Learning rate α = 0.35

Mutation probability β = 0.15

Mutation degree γ = 0.25

OPBIL Only

Update frequency control b = 0.1

Learning rate ρ = 0.25

Probability decay τ = 0.0005

Using the test images and parameter settings stated above we show the ability of

oppositional concepts to decrease the required function calls. Only the results for

OPBIL are more detailed due to space limitations, but similar behavior should be

observed in ODE. All results presented correspond to the average of 30 runs, unless

otherwise noted.

3.5.1 ODE

Table 3.4 presents a summary of the results obtained regarding function calls for

ODE versus DE. Except for image 2, we show an decrease in function calls for

all images. Images 4 and 5 have statistically significant improvements with respect

to the decreased number of function calls, using a t-test at 0.9 confidence level.

Further except for image 6, we show a lower standard deviation which indicates

higher reliability in the results.

3 The Use of Opposition for Decreasing Function Evaluations 65

277=45 function calls. This equates to an average of 45/6 = 7.5 saved function calls

per image. This implies a savings of 322/277 ≈ 1.16, indicating approximately 16%

less function calls. For expensive optimization problems this can correspond to a

great amount of savings with respect to algorithm run-time.

Table 3.4 Summary results for DE vs. ODE with respect to required function calls. μ and σ

correspond to the mean and standard deviation of the subscripted algorithm, respectively

1 74 41 60 34

2 32 20 35 16

3 42 23 37 19

4 74 36 54 30

5 47 37 45 22

6 63 26 46 31

Total 322 277

3.5.2 OPBIL

Table 3.5 shows the expected number of iterations (each iteration has 24 function

calls) to attain the value-to-reach given in Table 3.1. In all cases OPBIL reaches its

goal in fewer iterations that PBIL, where results for images 2,5,6 are found to be

statistically significant using a t-test at a 0.9 confidence interval. Additionally, in

all cases we find a lower standard deviation indicating a more reliable behavior for

OPBIL.

Overall we have 444-347=97 saved iterations using OPBIL, which is an average

of 16*24=384 function calls per image. The approximate savings is 444/347 ≈ 1.28

which is about a 28% improvement in required iterations.

Table 3.5 Summary results for PBIL vs. OPBIL with respect to required iterations calls.

μ and σ correspond to the mean and standard deviation of the subscripted algorithm,

respectively

1 62 19 53 12

2 80 25 65 9

3 61 12 60 5

4 47 14 40 10

5 68 13 53 9

6 128 21 76 14

Total 444 347

66 M. Ventresca, S. Rahnamayan, and H.R. Tizhoosh

In the following we analyze the correlation and distance for each sample per

iteration. This is to examine whether the negative correlation and larger distance

properties between a guess and its opposite are found in the sample. If true, we have

supported (although not confirmed) the hypothesis that the observed improvements

can be due to these characteristics.

Figure 3.4 shows the averaged correlation cor(Rt1 , R̆1 ), for randomly generated

R1 and respective opposites R̆t1 at iteration t. The solid line corresponds to OPBIL

t

and the dotted line is PBIL, respectively. These plots show that OPBIL indeed has a

lower correlation (with respect to evaluation function) than PBIL (where we gener-

ate R1 as above, and let R̆1 also be randomly generated). In all cases the correlation

is much stronger for PBIL (noting that if the algorithm reaches the VTR then we set

the correlation to 0).

Image 1 Image 2

1 1

0.8 0.8

correlation

correlation

0.6 0.6

0.4 0.4

0.2 0.2

0 0

0 50 100 150 0 50 100 150

iterations iterations

Image 3 Image 4

1 1

0.8 0.8

correlation

correlation

0.6 0.6

0.4 0.4

0.2 0.2

0 0

0 50 100 150 0 50 100 150

iterations iterations

Image 5 Image 6

1 1

0.8 0.8

correlation

correlation

0.6 0.6

0.4 0.4

0.2 0.2

0 0

0 50 100 150 0 50 100 150

iterations iterations

Fig. 3.4 Sample mean correlation over 30 trials for PBIL (dotted) versus OPBIL (solid). We

find OPBIL indeed yields a lower correlation than PBIL

3 The Use of Opposition for Decreasing Function Evaluations 67

2 k/2

ḡ = ∑ g(Rt1,i , R̆t1,i )

k i=1

(3.18)

which computes the fitness-distance between the ith guess R1,i and its opposite R̆1,i

at iteration t, which is shown in Figure 3.5. The distance for PBIL is relatively low

throughout the 150 iterations, gently decreasing as the algorithm converges. How-

ever, as a consequence of OPBIL’s ability to mainain diversity the distances between

samples increases during the early stanges of the search and similarily rapidly de-

creases. Indeed, this implies the lower correlation shown above.

Image 1 Image 2

4000 1500

3000

1000

distance

distance

2000

500

1000

0 0

0 50 100 150 0 50 100 150

iteration iterations

Image 3 Image 4

3000 4000

3000

2000

distance

distance

2000

1000

1000

0 0

0 50 100 150 0 50 100 150

iterations iterations

Image 5 Image 6

6000 4000

3000

4000

distance

distance

2000

2000

1000

0 0

0 50 100 150 0 50 100 150

iterations iterations

Fig. 3.5 Sample mean distance over 30 trials for samples of PBIL (dotted) versus OPBIL

(solid). We find OPBIL indeed yields a higher distance between paired samples than PBIL

68 M. Ventresca, S. Rahnamayan, and H.R. Tizhoosh

The final test is to examine the standard deviation (w.r.t. evaluation function) of

the distance between samples, given in Figure 3.6. Both algorithms have similarly

formed plots with respect to this measure, reflecting the convergence rate of the

respective algorithms. It seems as though the use of opposition aides by infusing

diversity early during the search and quickly focuses once a high quality optima is

found. Conversely, the basic PBIL does not include this bias, therefore convergence

is less rapid.

Image 1 Image 2

2000 1500

1500

1000

std. dev.

std. dev.

1000

500

500

0 0

0 50 100 150 0 50 100 150

iterations iterations

Image 3 Image 4

2000 1500

1500

1000

std. dev.

std. dev.

1000

500

500

0 0

0 50 100 150 0 50 100 150

iterations iterations

Image 5 Image 6

3000 1500

2000 1000

std. dev.

std. dev.

1000 500

0 0

0 50 100 150 0 50 100 150

iterations iterations

Fig. 3.6 Sample mean standard deviations over 30 trials for samples of PBIL (dotted) versus

OPBIL (solid)

3.6 Conclusion

In this chapter we have discussed the application of opposition-based computing

techniques to reducing the required number of function calls. Firstly, a brief

3 The Use of Opposition for Decreasing Function Evaluations 69

introduction to the underlying concepts of opposition were given, along with condi-

tions under which opposition-based methods should be successful. A comparison to

similar concepts of antithetic variates and quasi-random/low-discrepancy sequences

made obvious the uniqueness of our method.

Two recently proposed algorithms, ODE and OPBIL were briefly introduced and

the manner in which opposition is used to improve their parent algorithms, DE and

PBIL was given, respectively. The manner in which opposites are used in both cases

differed, but the underlying concepts are the same.

Using the expensive optimization problem of image thresholding as a benchmark,

we examine the ability of ODE and PBIL to lower the required function calls to

reach the pre-specified target value. It was found that both algorithms reduce the

expected number of function calls, ODE by approximately 16% (function calls)

and OPBIL by 28% (iterations), respectively. Further, concentrating on OPBIL, we

show the hypothesized lower correlation and higher fitness-distance measures for a

quality opposite mapping.

Our results are very promising, however their requires future work in various re-

gards. Firstly, a further theoretical basis for opposition and choosing opposite map-

pings is needed. This could possibly lead to general strategies of implementation

when no prior knowledge is available. Further application to different real-world

problems is also desired.

Acknowledgements

This work has been partially supported by Natural Sciences and Engineering Research Coun-

cil of Canada (NSERC).

References

1. Bai, F., Wu, Z.: A novel monotonization transformation for some classes of global opti-

mization problems. Asia-Pacific Journal of Operational Research 23(3), 371–392 (2006)

2. Baluja, S.: Population Based Incremental Learning - A Method for Integrating Genetic

Search Based Function Optimization and Competitive Learning. Tech. rep., Carnegie

Mellon University, CMU-CS-94-163 (1994)

3. Engelbrecht, A.: Fundamentals of Computational Swarm Intelligence. Wiley, Chichester

(2005)

4. Glover, F., Laguna, M.: Tabu Search. Kluwer, Dordrecht (1997)

5. Goldberg, D.E., Horn, J., Deb, K.: What makes a problem hard for a classifier system?

Tech. rep. In: Collected Abstracts for the First International Workshop on Learning Clas-

sifier Systems (IWLCS 1992), NASA Johnson Space (1992)

6. Niederreiter, H.: Random Number Generation and Quasi-Monte Carlo Methods. Society

for Industrial and Applied Mathematics (1992)

7. Holland, J.H.: Adaptation in Natural and Artificial Systems. The University of Michigan

Press (1975)

8. Price, K., Storn, R., Lampinen, J.A.: Differential Evolution: A Practical Approach to

Global Optimization. Springer, Heidelberg (2005)

70 M. Ventresca, S. Rahnamayan, and H.R. Tizhoosh

9. Kirkpatrick, S., Gelatt, C.D., Vecchi, M.P.: Optimization by Simulated Annealing. Sci-

ence 220(4598), 671–680 (1983)

10. Lemieux, C.: Monte Carlo and Quasi-Monte Carlo Sampling. Springer, Heidelberg

(2009)

11. Maaranen, H., Miettinen, K., Penttinen, A.: On intitial populations of genetic algorithms

for continuous optimization problems. Journal of Global Optimization 37(3), 405–436

(2007)

12. Montgomery, J., Randall, M.: Anti-pheromone as a tool for better exploration of search

space. In: Third International Workshop, ANTS, pp. 1–3 (2002)

13. O’Gormam, L., Sammon, M., Seul, M. (eds.): Practical Algorithms for Image Analysis.

Cambridge University Press, Cambridge (2008)

14. Rahnamayan, S., Tizhoosh, H.R., Salama, M.M.A.: Opposition-based differential evolu-

tion. IEEE Transactions on Evolutionary Computation 12(1), 64–79 (2008)

15. Rahnamayn, S., Tizhoosh, H.R., Salama, S.: Opposition-based Differential Evolution

Algorithms. In: IEEE Congress on Evolutionary Computation, pp. 7363–7370 (2006)

16. Rahnamayn, S., Tizhoosh, H.R., Salama, S.: Opposition-based Differential Evolution

Algorithms for Optimization of Noisy Problems. In: IEEE Congress on Evolutionary

Computation, pp. 6756–6763 (2006)

17. Rubinstein, R.: Monte Carlo Optimization, Simulation and Sensitivity of Queueing Net-

works. Wiley, Chichester (1986)

18. Sebag, M., Ducoulombier, A.: Extending Population-Based Incremental Learning to

Continuous Search Spaces. In: Eiben, A.E., Bäck, T., Schoenauer, M., Schwefel, H.-P.

(eds.) PPSN 1998. LNCS, vol. 1498, pp. 418–427. Springer, Heidelberg (1998)

19. Shapiro, J.: Diversity loss in general estimation of distribution algorithms. In: Parallel

Problem Solving in Nature IX, pp. 92–101 (2006)

20. Shokri, M., Tizhoosh, H.R., Kamel, M.: Opposition-based Q(lambda) Algorithm. In:

IEEE International Joint Conference on Neural Networks, pp. 646–653 (2006)

21. Storn, R., Price, K.: Differential evolution- a simple and efficient heuristic for global

optimization over continuous spaces. Journal of Global Optimization 11, 341–359 (1997)

22. Tizhoosh, H.R.: Reinforcement Learning Based on Actions and Opposite Actions. In:

International Conference on Artificial Intelligence and Machine Learning (2005)

23. Tizhoosh, H.R.: Opposition-based Reinforcement Learning. Journal of Advanced Com-

putational Intelligence and Intelligent Informatics 10(4), 578–585 (2006)

24. Tizhoosh, H.R., Ventresca, M. (eds.): Oppositional Concepts in Computational Intelli-

gence. Springer, Heidelberg (2008)

25. Toh, K.: Global optimization by monotonic transformation. Computational Optimization

and Applications 23(1), 77–99 (2002)

26. Ventresca, M., Tizhoosh, H.R.: Improving the Convergence of Backpropagation by Op-

posite Transfer Functions. In: IEEE International Joint Conference on Neural Networks,

pp. 9527–9534 (2006)

27. Ventresca, M., Tizhoosh, H.R.: Opposite Transfer Functions and Backpropagation

Through Time. In: IEEE Symposium on Foundations of Computational Intelligence, pp.

570–577 (2007)

28. Ventresca, M., Tizhoosh, H.R.: Simulated Annealing with Opposite Neighbors. In: IEEE

Symposium on Foundations of Computational Intelligence, pp. 186–192 (2007)

29. Ventresca, M., Tizhoosh, H.R.: A diversity maintaining population-based incremental

learning algorithm. Information Sciences 178(21), 4038–4056 (2008)

3 The Use of Opposition for Decreasing Function Evaluations 71

30. Ventresca, M., Tizhoosh, H.R.: Numerical condition of feedforward networks with op-

posite transfer functions. In: IEEE International Joint Conference on Neural Networks,

pp. 3232–3239 (2008)

31. Weszka, J., Rosenfeld, A.: Threshold evaluation techniques. IEEE Transactions on Sys-

tems, Man and Cybernetics 8(8), 622–629 (1978)

32. Wu, Z., Bai, F., Zhang, L.: Convexification and concavification for a general class of

global optimization problems. Journal of Global Optimization 31(1), 45–60 (2005)

33. Yoo, T. (ed.): Insight into Images: Principles and Practice for Segmentation, Registration,

and Image Analysis. AK Peters (2004)

34. Zhang, H., Fritts, J., Goldman, S.: Image segmentation evaluation: A survey of unsuper-

vised methods. Computer Vision and Image Understanding 110, 260–280 (2008)

Chapter 4

Search Procedure Exploiting Locally

Regularized Objective Approximation:

A Convergence Theorem for Direct Search

Algorithms

Marek Bazan

imation is a method to speed-up local optimization processes in which the objective

function evaluation is expensive. It was introduced in [1] and further developed in

[2]. In this paper we present the convergence theorem of the method. The theorem

is proved for the EXTREM [6] algorithm but applies to any Gauss-Siedle algorithm

that uses sequential quadratic interpolation (SQI) as a line search method. After

some extension it can also be applied to conjugate direction algorithms. The proof

is based on the Zangwill theory of closed transformations. This method of the proof

was chosen instead of sufficient decrease approach since the crucial element of the

presented proof is an extension of the SQI convergence proof from [14] which is

based on this approach.

4.1 Introduction

Optimization processes with objective functions that are expensive to evaluate –

since usually their evaluation requires to solve a large system of linear equations

or to simulate some physical process – occur in many fields of modern design. The

main strategy in speeding up such processes via constructing a model to approxi-

mate an objective function are trust region methods [4]. The application of radial

basis function approximation as an approximation model in trust region methods

was discussed in [13]. The standard method to prove a convergence of a trust region

method is the method of sufficient decrease.

In [1] and [2] we presented the search procedure which can be viewed as an

alternative to trust region methods. It relies on combining the direct search algo-

rithm EXTREM [6] with the locally regularized radial basis approximation. The

Marek Bazan

Institute of Informatics, Automatics and Robotics, Department of Electronics,

Wrocław University of Technology, ul. Janiszewskiego 11/17, 50-372 Wrocław, Poland

e-mail: marek.bazan@pwr.wroc.pl

Y. Tenne and C.-K. Goh (Eds.): Computational Intelligence in Optimization, ALO 7, pp. 73–103.

springerlink.com c Springer-Verlag Berlin Heidelberg 2010

74 M. Bazan

rect function evaluations by radial basis approximation. The method combined with

the EXTREM algorithm was implemented within the computer program ROXIE

for superconducting magnets design and optimization [3], however it is a general

framework and as we shall see, any search algorithm that is based on the Gauss-

Seidle non-gradient algorithm or conjugate direction algorithm, can be used. We

will call our method the Search Procedure Exploiting Locally Regularized Objec-

tive Approximation (SPELROA).

In this paper we give the convergence theorem for the SPELROA method, com-

bined with the EXTREM minimization algorithm under the assumption that the

radial basis approximation has a realtive error less than ε . The method of proof is

different from those used in proofs of trust region methods since it is based on the

Zangwill theory of closed transformations. The crucial element of the proof is the

modification of the proof of the convergence of the quadratic interpolation as a line

search method (c.f. [14]) under the assumption that a pertubation of the function

value may be introduced to the algorithm at each step. We give the conditions on the

function values as well as on ε to maintain convergence. The proof of the conver-

gence of the sequential quadratic interpolation in [14] is based on Zangwill’s theory

and as we essentially extend it we chose this method also to prove the convergence

of the whole SPELROA method.

The plan of this chapter is the following. In the next section we sketch the SPEL-

ROA method. In the third section we give some theory from [21] to be used in the

next section to prove the main result. Finally in the fourth section we discuss the

radial basis approximation and heuristics used for its construction. We also describe

difficulties in establishing strict error bound in the current state of the development

of radial basis function approximation for sparse data. In the reminder of the paper

we give numerical results for three test functions of 6,8 and 11 variables from a set

of test functions proposed in [12].

Let there be given a direct search optimization algorithm A that uses the quadratic

interpolation as a line search method. The SPELROA method combined with algo-

rithm A can be written in the form of the following algorithm (c.f. [2]).

While generating the set Z we have to take care that data points are not placed too

close to each other. When two points are too close to each other – where a distance

is controlled by a user-supplied parameter whose value is relative to the diameter of

the set Z – one of the points has to be replaced by another point not yet included.

Such procedure of constructing Z keeps the separation distance (c.f. [16]) greater

than the user-supplied parameter value and therefore guarantees that the radial basis

function interpolation matrix is not singular. The crucial step of the scheme is point

3, containing a threefold check for whether the approximation f˜(xk ) can be used

in the algorithm A instead of f (xk ) evaluated directly. Conditions being checked in

steps 3.a) and 3.b) are related to the radial basis approximation and will be discussed

4 Search Procedure Exploiting Locally Regularized Objective Approximation 75

1

2 The Search Procedure exploiting Locally Regularized Objective

Approximation

Input : f : Rd → R – the objective function,

x 0 ∈ Rd – a starting point,

ε >0 – prescribed accuracy of the approximation,

f˜(·) – a radial basis function approximation of f (·),

Is – number of initial steps performed by the direct optimization

algorithm A,

N < Is – size of the dataset to construct the approximation f˜(·),.

ε -check – a procedure to check conditions required by the convergence

3 theorem to hold.

0. Perform Is initial steps of the algorithm A.

1. In the k-th step generate point xk for which the function value is supposed to be

evaluated using algorithm A.

2. Generate a set Z from N nearest points for which function was evaluated

directly.

3. Judge, whether for point xk a reliable approximation of f (xk ) can be

constructed.

a. If point xk is located in a reliable region of the search domain then construct

the approximation f˜ and evaluate f˜(xk ).

b. If the approximation f˜(xk ) was correctly constructed then perform an

ε -check.

c. If the ε -check is positive (i.e. the procedure returns true) then substitute

f (xk ) ← f˜(xk )

to the algorithm A.

d. Else evaluate f (xk ) directly.

4. If the stopping criterion of algorithm A is satisfied then stop.

Else replace k := k + 1 and go to 1.

in section 4.5 whereas the ε -check procedure from step 3.c) is associated with the

convergence theorem and will be discussed in section 4.4.

Feasible direction optimization algorithms can be written as

xk+1 = xk + τk dk

76 M. Bazan

where dk is a search direction and τk is a search step. The k-th iteration of such al-

gorithms can be viewed as a composite algorithmic transformation A = M 1 D where

D : Rd → Rd × Rd is a search direction generating transform

D(x) = (x, d)

tion transform

τ ∈J

In his monography [21] Zangwill gave a method of proving the convergence of

feasible directions optimization algorithms based on properties of an algorithmic

transformation A. In this section we sketch all crucial definitions and lemmas used

to state the main convergence theorem.

Definition 1. A transformation A : V → V is a transformation of a point into a set

when each point x ∈ V is assigned a set A(x) of points from V . The result of appli-

cation of the transform A(·) to a point xk can be any point xk+1 from a set A(xk )

thus

xk+1 ∈ A(xk ).

A transformation A = M 1 D defining a feasible direction optimization algorithm is a

transformation of a point into a set.

Definition 2. We say that a transformation A : V → V is closed in a point x∞ , if the

following implication holds true:

1. xk → x∞ , k ∈ K ,

2. yk ∈ A(xk ), k ∈ K ,

3. yk → y∞ ,

imply

4. y∞ ∈ A(x∞ ),

where K is a sequence of natural numbers. We say that transformation A is closed

on X ∈ V if it is closed in any point x ∈ X.

The property of closedness for algorithmic transformations is an analogue of the

property of continuity for ”usual” functions.

Theorem 1. (see [21] page 99)

Let a transformation A : V → V of a point into a set be an algorithm, which for a

given point x1 , generates a sequence {xk }∞

k=1 . Let S ⊂ V be a set of solutions.

Let us assume

1. All points xk are in a compact set X ⊂ V .

2. There exists a function Z : V → R such, as

4 Search Procedure Exploiting Locally Regularized Objective Approximation 77

b. if point x is a solution, then either the algorithm finishes or for any y ∈ A(x)

Z(y) ≤ Z(x).

Then either the algorithm finishes in a point which is a solution or any convergent

subsequence generated by the algorithm has its limit in the solution set S.

Now we additionally need two lemmas from [21] concerning the closedness of a

composition of closed transforms.

point into a set. If function C is continuous in point w∞ and B is closed in C(w∞ ),

then a composition A = BC : W → Y is closed in w∞ .

a compact and bounded interval.

In practice usually another line search operator is considered, since implementation

of the operator M 1 (·, ·) is expensive. Let’s consider a line search operator M ∗ defined

as

M ∗ (x, d) = M 1 (x, d) ∪ {y = x + τ d : f (y) ≤ f (x) − Δ , τ ∈ J}. (4.1)

Its value is a set of points for which function f is decreased by Δ on the direction

d from point x or when this set is empty its value is a minimum of function f . A

suggestion of a practical application of the operator M ∗ can be found in one of the

exercises in [21]. In the algorithm EXTREM such an operator is used instead of M 1 .

The operator M ∗ is implemented to be the parabolic search algorithm. Application of

the operator M ∗ is more practical, in particular in the initial part of the optimization

process, where the first steps of the parabolic interpolation search give the most

significant decrease of the function value whereas the next steps usually make the

function f decrease much slower.

The main result of this note is the following theorem.

mization problem. Let be given a method to approximate the objective function f in

certain points of the domain Ω with a relative error ε > 0. If function f is strictly

convex the SPELROA method combined with the Gauss-Seidel search algorithm us-

ing approximated sequential quadratic interpolation as a line search method con-

verges to a stationary point x0 ∈ S where S = {x : ∇ f (x) = 0}.

78 M. Bazan

A = RM ∗ Dd M ∗ Dd−1 . . . M ∗ D2 M ∗ D1

where Di chooses i-th direction from the orthogonal direction base of the k-th iter-

ation and R is an orthogonalization step to produce a new base along the direction

xk−1

0 xd

k−1

from k − 1 iteration.

To show that point 3 of the theorem 1 is fullfilled we first have to show that the

transformation M ∗ defined in (4.1) is closed. We prove the following lemma.

a compact and bounded interval.

k=1 and

{yk }∞

k=1 . We assume that

1. (xk , dk ) → (x∞ , d∞ ), k ∈ K

2. yk ∈ M ∗ (xk , dk ), k ∈ K

3. y → y∞ , k ∈ K

So we have that

yk = xk + τ k dk

where τk ∈ J k for which

f (xk + τ k dk ) ≤ f (xk ) − Δ .

subsequence

τ k → τ ∞, k ∈ K 1,

where K 1 ⊂ K and τ ∞ ∈ J.

For a fixed τ ∈ J from the definition of yk it folllows

or

f (yk ) − f (xk + τ dk ) ≤ Δ . (4.3)

Note that if (4.3) is not satisfied then (4.2) is satisfied.

Since f is continuous in a limit we get

k∈K 1 k∈K 1

f (y∞ ) − f (x∞ + τ d∞ ) ≤ Δ . (4.5)

4 Search Procedure Exploiting Locally Regularized Objective Approximation 79

Because the above for any τ (4.4) or (4.5) is fulfilled then for any point y∗ ∈

M ∗ (x∞ , d∞ ) we can

f (y∞ ) < f (y∗ ), (4.6)

or

f (y∞ ) − f (y∗ ) ≤ Δ . (4.7)

On the other hand in point y∗ ∈ M ∗ (x∞ , d∞ ) function f for τ ∈ J attains the least

value

y∞ = x∞ + τ ∞ d∞ , τ ∞ ∈ J

or,

f (y∞ ) < f (x∞ ) − Δ (4.8)

therefore

f (y∗ ) − f (y∞ ) ≤ Δ . (4.9)

Comparing (4.8) and (4.9) with (4.6) and (4.7) we get the result

y∞ ∈ M ∗ (x∞ , d∞ ).

also have to notice that the transformations Di (x) = (x, di ) (i = 1, . . . , d) are contin-

uous functions. For direct search algorithms transformations Di generate orthogo-

nal directions. They are the same as in the Gauss-Seidle algorithm (c.f. [21]). For

conjugate direction search algorithms transformations Di generate succesive con-

jugate directions. Transformation R that generates the orthogonal serach directions

0 , . . . , dd−1 ] for the step k is defined as

[dnew new

R(xk−1 k new new

where the sequence of the new orthogonal vectors is uniquely defined by the orthog-

onalization of the vectors w0 , w1 , . . . , wd−1

w0 = s0 d0 + s1 d1 + . . . + sd−1 dd−1

w1 = + s1 d1 + . . . + sd−1 dd−1

...

wd−1 = + sd−1 dd−1

where scalars s0 , s1 , . . . , sd−1 correspond to step sizes in all directions in the step

k − 1. Transformation R is uniquely defined without any conditions on scalars

s0 , s1 , . . . , sd−1 as long as the orthogonalization is performed using the algorithm

presented in [15]. In this case it is also a continuous function. Finally then since the

transformation A is a composition of closed transformations M ∗ with continuous

functions Di (i = 0, . . . , d − 1) and R then the assumptions of the Lemma 1 are sat-

isfied we see that transformation A is closed. That proves that assumption 3 of the

Theorem 1 holds for unperturbed M ∗ . In the succeeding subsection we show that

transformation M ∗ can be realized by the perturbed transformation M 1 .

80 M. Bazan

For non-gradient optimization algorithms with an orthogonal basis as a set of search

directions transformation M ∗ is the only place in the algorithm where the perturba-

tion from approximation of the objective function is introduced by the SPELROA

method. Therefore to show that point 2 of the Theorem 1 is fullfilled it is sufficient

that implementation of the transformation M ∗ that allows perturbation of function

value on the level ε > 0 minimizes function f along the direction d.

A proof of the convergence of the parabolic interpolation line search method can

be found in [8] or [14]. Here we will give conditions on the perturbation of the

function so that the proof given in [14] holds true. We will keep the notation as

close as possible to that in [14].

(i) (i) (i)

Let function f : R → R be unimodal. Let for a triplet ζ (i) = (ζ1 , ζ2 , ζ3 ) be

(i) (i) (i) (i) (i)

f (ζ2 ) ≤ min{ f (ζ1 ), f (ζ3 )} i.e. the interval [ζ1 , ζ3 ] contains an unique mini-

mizer of function f .

For a non-perturbed objective function we define a set of feasible triplets T ⊂ R3

defining an interval [ζ1 , ζ3 ] that contains the minimizer λ̂ as

∪{ζ ∈ R3 : ζ1 = ζ2 < ζ3 , f (ζ1 ) ≤ 0, f (ζ3 ) ≥ f (ζ1 )}

∪{ζ ∈ R3 : ζ1 < ζ2 = ζ3 , f (ζ3 ) ≥ 0, f (ζ1 ) ≥ f (ζ3 )}

∪{ζ ∈ R3 : ζ1 = ζ2 = ζ3 = λ̂ }.

For ζ ∈ T with ζ1 < ζ2 < ζ3 the minimum of the quadratic interpolating points

(ζ1 , f (ζ1 )), (ζ2 , f (ζ2 )), (ζ3 , f (ζ3 )) equals

λ ∗ (ζ ) =

2 (ζ3 − ζ2 ) f (ζ1 ) + (ζ3 − ζ1 ) f (ζ2 ) + (ζ2 − ζ1 ) f (ζ3 )

Then the set of admissible replacement triplets A(ζ ) is a set of candidate triplets

that may replace ζ ∈ T defining a smaller interval containing λ̂ in the next iteration

of the algorithm. A0 (ζ ) is defined as

(ζ1 , λ ∗ (ζ ), ζ2 ),

u1 (ζ ) =

(ζ2 , λ ∗ (ζ ), ζ3 ),

u2 (ζ ) =

A0 (ζ ) := T ∩ {u1 (ζ ), u2 (ζ ), u3 (ζ ), u4 (ζ )} where

(λ ∗ (ζ ), ζ2 , ζ3 ),

u3 (ζ ) =

(ζ1 , ζ2 , λ ∗ (ζ )).

u4 (ζ ) =

(4.10)

The crucial assumption on the perturbation that we introduce into the triplet used

to construct a quadratic is that we always allow a perturbation in only one point

of three points. Without loss of generality let us assume that a minimum of the

perturbed quadratic is constructed only for ζ such that ζ1 < ζ2 < ζ3 .

For the perturbation of the value of the function f on the level ε = 0 we define

three sets of triplets:

4 Search Procedure Exploiting Locally Regularized Objective Approximation 81

T2 (ε ) := {ζ ∈ R3 : ζ1 < ζ2 < ζ3 , f (ζ2 )(1 + |ε |) ≤ min{ f (ζ1 ), f (ζ3 )} } (4.12)

T3 (ε ) := {ζ ∈ R3 : ζ1 < ζ2 < ζ3 , f (ζ2 ) ≤ min{ f (ζ1 ), f (ζ3 )(1 − |ε |)} } (4.13)

1 (ζ32 − ζ22 ) f (ζ1 )(1 + |ε |) + (ζ32 − ζ12 ] f (ζ2 ) + (ζ22 − ζ12 ) f (ζ3 )

λ̃1∗ (ε ; ζ ) = ,

2 (ζ3 − ζ2 ) f (ζ1 )(1 + |ε |) + (ζ3 − ζ1 ) f (ζ2 ) + (ζ2 − ζ1 ) f (ζ3 )

1 (ζ32 − ζ22 ) f (ζ1 ) + (ζ32 − ζ12 ) f (ζ2 )(1 + |ε |) + (ζ22 − ζ12 ) f (ζ3 )

λ̃2∗ (ε ; ζ ) = ,

2 (ζ3 − ζ2 ) f (ζ1 ) + (ζ3 − ζ1 ) f (ζ2 )(1 + |ε |) + (ζ2 − ζ1 ) f (ζ3 )

1 (ζ32 − ζ22 ) f (ζ1 ) + (ζ32 − ζ12 ) f (ζ2 ) + (ζ22 − ζ12 ) f (ζ3 )(1 + |ε |)

λ̃3∗ (ε ; ζ ) =

2 (ζ3 − ζ2 ) f (ζ1 ) + (ζ3 − ζ1 ) f (ζ2 ) + (ζ2 − ζ1 ) f (ζ3 )(1 + |ε |)

Such definitions of sets Tl (ε ) (l ∈ {1, 2, 3}) provide that the perturbed minima

are contained within the interval [ζ1 , ζ3 ]. Corresponding sets of admissible triplets

are defined as

where

ũl1 (ε ; ζ ) = (ζ1 , λ̃l∗ (ε ; ζ ), ζ2 ),

ũl2 (ε ; ζ ) = (ζ2 , λ̃l∗ (ε ; ζ ), ζ3 ),

(4.14)

ũl3 (ε ; ζ ) = (λ̃l∗ (ε ; ζ ), ζ2 , ζ3 ),

ũl4 (ε ; ζ ) = (ζ1 , ζ2 , λ̃l∗ (ε ; ζ )),

where l ∈ {1, 2, 3}.

Lemma 4. Let S denote the set of stationary points of function f

1. For every ζ ∈ T, the set A(ζ ) = A0 (ζ ) ∪ Ã1 (ε ; ζ ) ∪ Ã2 (ε ; ζ ) ∪ Ã3 (ε ; ζ ) is

nonempty.

2. The set valued map A(·) is closed.

3. For every ζ ∈ T\S := {ζ ∈ T : ζ ∈/ S} such that ζ1 < ζ2 < ζ3 there is c(y) < c(ζ )

for all y ∈ A(ζ ) for

where f˜l (·) = f or f˜l (·) = f (·)(1 + |ε |) depending on whether a value at point ζl

was exact or perturbed.

Proof. Let us introduce the following notation f˜l (·) = f (·) when evaluated at points

x = ζl or f˜l (·) = f (·)(1 + ε ) when evaluated at point x = ζl . At the beginning let us

note that Tl (ε ) ⊂ T for (l ∈ {1, 2, 3}) and ε > 0.

82 M. Bazan

1. Let ζ = (ζ1 , ζ2 , ζ3 ), ∈ T be fixed. If f (ζ1 ), f (ζ2 ) and f (ζ3 ) are computed with-

out perturbation then A(ζ ) is not empty by proof in [14]. Let us consider then

that a function value was approximated with the relative error equal ε . The mini-

mum of the quadratic constructed in this case will be λ̃l∗ (ε ; ζ ) where l ∈ {1, 2, 3}

depending on at which point a function value is approximated. Let us consider

the case when λ̃l∗ (ε , ζ ) ∈ [ζ1 , ζ2 ] providing moreover that the minimum λ ∗ (ζ )

obtained as if the function was evaluated without any perturbation also belongs

to [ζ1 , ζ2 ]. Then A(ζ ) is empty if and only if both ũ1 (ε ; ζ ) and ũ3 (ε ; ζ ) are not

in A(ζ ), i.e. if and only if

and

f (ζ2 ) > min{ f (λ̃1∗ (ε ; ζ )), f (ζ3 )} ≥ min{ f (λ̃1∗ (ε ; ζ )), f (ζ2 )} (4.16)

for perturbations in ζ2 and ζ3 ).

Since the inequalities (4.16) imply, that f (ζ2 ) ≥ f (λ̃1∗ (ε ; ζ )) we get a

contradiction with the inequality (4.15) which proves the thesis when the per-

turbation is in the point ζ1 . The similiar contradiction we get for analogue in-

equalities for perturbations in ζ2 and ζ3 . Please note that the condition that the

corresponding unperturbed λ ∗ (ζ ) is also in [ζ1 , ζ2 ] guarantees that if we com-

pare ỹ = (ỹ1 , ỹ2 , ỹ3 ) ∈ Ãl (ζ ) (l ∈ {1, 2, 3}) with y = (y1 , y2 , y3 ) ∈ A0 (ζ ) we get

ỹ1 ≤ y1 and y˜3 ≥ y3 . The case when λ̃ ∗ (ζ ) belongs to [ζ2 , ζ3 ] is symmetric. This

shows that A(ζ ) is not empty.

2. We will prove the closedness of A from Definition 2. Let us assume that {ζ (i) }∞ i=0

and ζ (i) → ζ∗ ∈ T and there exists ζ∗∗ ∈ T and an infinite subsequence K ⊂ N

such that ζ (i+1) ∈ A(ζ (i) ) for every i ∈ K such that ζ (i+1) →K ζ∗∗ as i → ∞. Then

there must exist k ∈ {1, 2, 3, 4} and an inifinite subsequence K ⊂ K such that

ζ (i+1) = uk (ζ (i) ) or ζ (i+1) = ũlki (ε ; ζ (i) ) (where li ∈ {1, 2, 3}) for every i ∈ K . As

shown in [14] functions uk (·) are continuous and therefore when the sequence

{ζ (i) }∞

i=0 does not contain ζ

(i) generated by approximated values then from the

continuity of uk (·) with respect to ζ and the closedness of the set T it follows that

uk (ζ (i) ) → uk (ζ∗ ) = ζ∗∗ ∈ A(ζ∗ ). This proves the closedness of transformation A

if no approximation is used in any sequence ζ (i) .

Now we will consider the sequences containing approximated triplets. In-

troducing the approximated triplets ζ (i+1) (i.e. ζ (i+1) ∈ Ãl (ε ; ζ (i) )) introduces

discontinuities of the first kind into functions uk (ζ ) and the argument based on

continuity of functions uk (ζ ) cannot be applied directly. We will show that using

algorithm A introduces a finite number of isolated points discontinuities and that

guarantees that from any sequence {ζ (i) }∞ i=1 after removing some finite number

of initial elements we can apply the proof from [14].

We have to consider two cases

4 Search Procedure Exploiting Locally Regularized Objective Approximation 83

i=0 is finite then

the proof of the closedness of transformation A given in [14] applies after

removing from {ζ (i) }∞

i=0 some number of initial elements.

• Let us then assume that ũlki occur an inifinite number of times in {ζ (i) }∞

i=0 .

(i) ∞

Since {ζ }i=1 is convergent then for any δ ∈ R there exists i0 such that for

all i > i0 we have

(i ) (i ) (i ) (i +1) (i +1) (i +1)

|ζ (i0 ) − ζ (i0 +1) | = |(ζ1 0 , ζ2 0 , ζ3 0 ) − (ζ1 0 , ζ2 0 , ζ3 0 )| < δ (4.17)

Let us choose such i1 for which inequality (4.17) holds. Since the approxi-

mation in the next iterations is used infnitely many times then there exists a

subsequence K = {i : i > i1 } ⊂ N such that if i ∈ K then

We will show that for any δ it is possible to choose ε such that

give conditions on ε we solve inequalities (4.18) using expressions (4.14) for

ulk (ε , ζ ) for k = 1, . . . , 4 and l ∈ {1, 2, 3} and applying a method of parabola

transformation q̂ from Appendix A.

(i)

For example for perturbation in ζ1 we get

ε1 > (4.19)

A(Ka − 1)

A(1 + ε1) + B − C > 0 A(1 + ε1) + B − C < 0

or . (4.20)

A(Ka − 1) < 0 A(Ka − 1) > 0

where 2

ε 2δ

ε1 = , Ka = − [1 − ζ(r)]2 .

(i)

f (ζl ) ζ3 − ζ1

with A, B and C defined in Appendix A. We leave to the

3 −ζ1

reader showing that these conditions are not contradicting and also derivation

of the analogue inequalities for l = 2 and l = 3.

The above considerations mean for any δ the value ε can be chosen so that

(4.18) holds. But it contradicts with the assumption that {ζi }∞i=0 converges

since we can choose two subsequences from {ζi }∞ i=0 that converge to two

different accumulation points. The first subsequence is formed from ζi1 s and

the second from ζi1 +1 s. Both are infinite. From this we conclude that from

84 M. Bazan

i=0 cannot contain approximated points and there-

fore the proof from [14] also applies in this case.

This finishes the proof of the closedness of transformation A(ζ ).

3. Let us assume that ζ ∈ T\S. Then λ̃ ∗ (ζ ) ∈ (ζ1 , ζ3 ). Hereafter in this point we

consider the quadratic obtained with a transformation q̂ from Appendix A. This

transformation preserves all properties used in this proof. In the following con-

sideration we will use the expression Δl (ε ; ζ ) which is the distance between the

unperturbed and perturbed minimum i.e. Δl (ε ; ζ ) = |λ̃l∗ (ζ ) − λ ∗ (ζ )|. See Ap-

pendix B for expression Δl (ε ; ζ ).

Here the situation is also symmetric with respect to ζ1 and ζ3 and we will con-

sider the case where λ ∗ (ζ ) ∈ (ζ1 , ζ2 ] as well as λ̃l∗ (ζ ) ∈ (ζ1 , ζ2 ] (l ∈ {1, 2, 3}).

Firstly we observe that f (ζ2 ) ≤ f (ζ3 ), because if there was f (ζ2 ) = f (ζ3 ),

then λ̃ ∗ (ζ ) = 1/2(ζ2 + ζ3 ) since no value is perturbed or only the value at point

ζ2 is perturbed and there is no effect of this perturbation on a minimum. Both

cases contradict the assumption that λ̃l∗ (ζ ) ∈ (ζ1 , ζ2 ] (l ∈ {1, 2, 3}).

When λ̃l∗ (ζ )∈(ζ1 , ζ2 ] (l ∈{1, 2, 3}) then only u1(ζ ) and ũl1 (ε ; ζ ) (l ∈ {1, 2, 3})

and u3 (ζ ) and ũl3 (ζ ) (l ∈ {1, 2, 3}) can belong to A(ζ ). For unperturbed function

values we have λ ∗ (ζ ) < ζ2 and for perturbed function values we have

λl∗ (ε ; ζ ) < ζ2 and provide the line of proof from [14] is valid. We get only three

cases

a. A(ζ ) = {u1 (ζ ), ũ11 (ζ ), ũ21 (ζ ), ũ31 (ζ )}. Then since u3 (ζ ) ∈

/ A(ζ ) as well as

/ A(ζ ) (l ∈ {1, 2, 3}) and f (λ ∗ (ζ )) < f (ζ2 ) as well as f (λ̃l∗ (ε ; ζ )) <

ũl3 (ε ; ζ ) ∈

f (ζ2 ) (l = 1, 3)

f (λ̃2∗ (ε ; ζ )) < f (ζ2 )(1 + |ε |) (4.22)

we must have

c(ũl1 (ζ )) = f˜l (ζ1 ) + f (λ̃l∗ (ζ )) + f˜l (ζ3 ) < f˜l (ζ1 ) + f˜l (ζ2 ) + f˜l (ζ3 ) = c(ζ ),

where f˜l (·) = f (·) or f˜l (·) = f (·)(1 + ε ) depending on which value was per-

turbed.

b. A(ζ ) = {u3 (ζ ), ũ13 (ζ ), ũ23 (ζ ), ũ33 (ζ )}. Then, since u1 (ζ ) ∈

/ A(ζ ) as well as

/ A(ζ ) (l ∈ {1, 2, 3}) we must have that f (ζ2 ) ≤ f (λ ∗ (ζ )) as well

ũl1 (ε ; ζ ) ∈

as f˜(ζ2 ) ≤ f (λ̃l∗ (ζ )) (l ∈ {1, 2, 3}) depending on which value was perturbed.

Also f (λ ∗ (ζ )) < f (ζ1 ) as well as fl (λ̃l∗ (ζ )) < f˜l (ζ1 ) (l ∈ {1, 2, 3}) since oth-

erwise we would have a local maximum in [ζ1 , ζ2 ] contradicting unimodality.

Therefore in this case we must have

c(ũl3 (ζ )) = f (λ̃l∗ (ζ )) + f˜2 (ζ2 ) + f˜3 (ζ3 ) < f˜l (ζ1 ) + f˜l (ζ2 ) + f˜l (ζ3 ) = c(ζ )

4 Search Procedure Exploiting Locally Regularized Objective Approximation 85

c. Finally, we can have a case A(ζ ) = {u1 (ζ ), u3 (ζ )}. In this case we are not

able to include to A(ζ ) any of the approximated triplet ũl (ε ; ζ ) (i = 1, 2, 3).

This is because of the following properties

i. f (ζ2 ) < f (ζ1 ) by assumption,

ii. λ ∗ (ζ ) ≤ ζ2 ,

iii. f (λ ∗ (ζ )) = f (ζ2 ) which implies λ ∗ (ζ ) = ζ2 .

These equalities hold since otherwise, we would have a contradiction with

the unimodality of f (·). Approximating any of the value would mean that

we would not be able to guarantee the property iii. Therefore since f (ζ2 ) <

min{ f (ζ1 ), f (ζ3 )} we get c(u1 (ζ )) < c(ζ ) and c(u3 (ζ )) < c(ζ ). From a prac-

tical point of view for a given ζi where one of the coordinates is approximated

or all are exact, we can see whether we have to use exact values i.e. remove

the approximation, checking whether

This exhausts all the possibilities and finishes the proof of the third point.

rithm with Perturbation (c.f. [14] for the unperturbed version).

Please note that in the following we use the expressions an “approximated” and

“perturbed” function values interchangably – they are synonymous here. Moreover

the adjective “exact” here means perturbed on the level of the machine precision ε

where ε << ε .

The main condition under which we can make use of the available approximation

of the function in one of the points is the separation property, i.e. for the approxi-

mation used in a point with an index l ∈ {1, 2, 3} the triplet ζi belongs to Tl where

Tl is defined by (4.11), (4.12) or (4.13) respectively.

Now we can formulate a convergence theorem for the above algorithm analogous

to that in ([14] p. 155).

mizing a continuously differentiable and unimodal function f : R → R. Then ζ (i) →

ζ̂ as i → ∞ with ζ̂ ∈ S.

To prove the above theorem we will show the way in which we can apply the proof

given in [14] making use of the Lemma 4.

Proof. The main difficulty in application of the method of the proof from [14] is the

fact that using approximation in certain points of the domain causes a discontinuity

of the cost function c as well as a discontinuity of the functional of calculating the

minimum of the quadratics with respect to a parameter triplet ζ (i) .

Let us observe first that the proof of the third point of Lemma 4 shows that allow-

(i)

ing perturbation according to (4.23) provides that {ζ1 }∞ i=0 is monotone increasing

(i)

as well as {ζ3 }∞ i=0 is monotone decreasing. Since both these sequences are bounded

(i)

they are both convergent. Moreover keeping Δ (ε ; ζ ) on a level so that the above

86 M. Bazan

1

2 The Sequential Quadratic Interpolation Algorithm with the Objective

Function Perturbation

Input : ζ0 ∈ T – a starting point,

ε > 0 – the relative error of the approximation available in certain

3 points of the function evaluation.

0. Set i = 0.

1. Compute λ ∗ = λ ∗ (ζ (i) ) or λ ∗ = λ̃l∗ (ε ; ζ (i) ) depending on whether the function

(i) (i) (i)

value was exact in all points ζ1 , ζ2 , ζ3 or it was perturbed in point

(i)

ζl (l ∈ {1, 2, 3}) respectively.

(i) (i)

2. If λ ∗ = ζ1 or λ ∗ = ζ3 then STOP, else construct the set A(ζ (i) ) and

a. If the approximation to any value of the triplet ζ (i) is not available then

A(ζ (i) ) = A0 according to (4.10).

(i)

b. If the approximation in point ζl (l ∈ {1, 2, 3}) is available

i. Compute transformation q̂ as described in Appendix A.

ii. Compute Δl (ε ; ζ (i) )

iii. If

(i)

|λl∗ (ε ; ζ ) − ζ2 | < Δl (ε ; ζ ).

then A(ζi ) = A0 and go to 3.

iv. If

then A = A0 ∪ Ãl

3. Compute

ζ (i+1) ∈ arg min{c(ζ ) : ζ ∈ A(ζ (i) )}

4. Replace i := i + 1 and go to step 1.

(i) (i)

would not be used, guarantees that ζ̂ ∈ [ζ1 , ζ3 ] for any i ∈ N. We have to distin-

guish between two nontrivial cases

1. When {ζ (i) }∞ i=1 → ζ̂ and ζ̂ = (ζ̂1 , ζ̂2 , ζ̂3 ) is such an accumulation point that ζ̂1 <

ζ̂2 < ζ̂3 . In this case the contradiction is obtained using the continuity of the

cost function c if an unperturbed algorithm was used. We can use the continuity

argument here since as it was shown in the proof of Lemma 4 the approximation

can only be used in a finite number of steps. Therefore after removing a certain

number of initial steps the proof from [14] applies.

2. The first case, therefore does not apply, i.e. the sequence constructed by Algo-

rithm 2 can have two accumulation points. In [14] it is shown that it is the same

4 Search Procedure Exploiting Locally Regularized Objective Approximation 87

holds true if the constructed sequence ends up in the stopping criterion with an

unperturbed triplet. Since the sequence is not infinite we have to consider one

more case, namely, when the algorithm stops with a perturbed triplet.

Two accumulation points are ζ∗ = (ζ̂1 , ζ̂1 , ζ̂3 ) and ζ∗∗ = (ζ̂1 , ζ̂3 , ζ̂3 ). To pro-

vide the separation of function values in the first case the perturbation can be only

in the point ζ̂3 . In the second case the perturbation can be only in the point ζ̂1 .

(i) (i) (i) (i)

In the above sequences we have therefore ζ2 → ζ1 or ζ2 → ζ3 . On the other

(i)

hand, the separation between λ ∗ (ε ; ζ (i) ) and ζ2 has to be greater than Δ3 (ε ; ζ (i) )

and Δ1 (ε ; ζ (i) ) for two cases respectively. This gives the contradiction with the

convergence.

From the practical point of view to provide the convergence of the perturbed SQI

algorithm and therefore the whole SPELROA method it is sufficient to provide that

the procedure ε -check in step 3.c) of Algorithm 1 to be launched in each iteration of

the perturbed SQI algorithm. It takes the last three points xi−2 , xi−1 , xi and compute

x −xk−1

ζk = (0,t1 ,t2 ) such that t1 = k−2 xk−2 −xk , t2 = 1 and then compute the transformation

q̂ defined by (4.34) to obtain points (0, −2), (ζ(r) , q̂(ζ(r) )), (1, −1) when f˜(xk−2 ) <

f˜(xk ) (for the case when f˜(xk−2 ) > f˜(xk ) it is sufficient to rotate the scaled parabola

around point 0.5) and scales ε

⎧ ε

⎨ f (xk−2 ) ,

⎪ when a perturbation is in point xk−2 ,

ε

ε := f (xk−1 ) , when a perturbation is in point xk−1 ,

⎪

⎩ 2ε ,

f (x ) when a perturbation is in point xk

k

1. a perturbation is at 0 then if ε satisfies (4.35) and fullfills (4.36) or (4.37) then the

approximation can be used else calculate the objective function deterministically,

2. a perturbation is at ζ(r) then if ε satisfies (4.38) and fullfills (4.37) or (4.40) then

the approximation can be used else calculate the objective function deterministi-

cally,

3. a perturbation is at 1 then ε has to satisfy analogue conditions (left to the reader)

for the approximation to be used by the algorithm.

In this section we will discuss the main aspects and possibilities of constructing a

radial basis approximation of the objective function in Algorithm 1.

Since the data set Z = {xi }Ni=1 is very sparse then a method to detect regions rich

in data within the convex hull Ω of Z is required. In [2] we introduced a number of

merit

88 M. Bazan

∑Nj=1,i< j ai jWi j

γ (x, Z ) := (4.24)

∑Nj=1,i< j Wi j

d

where ai j = r j +rij

i

, Wi j = r j +r

1

i

, and di j = ||x j − xi ||2 and r j = ||x − x j ||2 . γ (x, Z )

measures how well data points from Z surround the evaluation point x. Numbers ai j

measure how far point x is placed from the interval xi x j . The maximal value equal

1 is attained by ai j if point x is on this interval. The weighting Wi j emphasize in

γ (x, Z ) the impact of intervals xi x j that are close to point x. The additional denom-

inating by a sum of weights Wi j provides that for any x ∈ Rd the range of values

of γ (x, Z ) is (0, 1]. If the value of γ (x, Z ) is greater than a certain threshold value

then in the evaluation point x a construction of an approximation with a good local

quality can be expected.

0.9

0.8

0.7

0.6

0.5

X2

0.4

0.3

0.2

0.1

0

−1 −0.9 −0.8 −0.7 −0.6 −0.5 −0.4 −0.3

X1

Fig. 4.1 γ (x, Z ) defined by (4.24). Here the set Z is a data set constructed on the 50-th

optimization step of a 2-parametric Rosenbrock function optimization with the EXTREM

algorithm

For a given data set Z = {xi , f (xi )}Ni=1 ⊂ Rd × R of pairwise different data points

xi , and for the Gaussian radial basis function φ (x) = exp(−x2 /r2 ), r ≥ 0 (see [9]) a

radial basis function interpolatant s(x) is defined as

N

s(x) = ∑ wi φi (||x − xi||), (4.25)

i=1

where

s(xi ) = f (xi ) = fi , i = 1, . . . , N.

4 Search Procedure Exploiting Locally Regularized Objective Approximation 89

Φ w = f, (4.26)

Positive definiteness [9] of φ guarantees nonsingularity of matrix Φ and thus there

exists the unique solution w for which s(x) interpolates data from the set Z. Al-

though the other choice of positive definite radial basis functions is possible we

choose the Gaussian function due to its natural interpretation of r parameter which

can be set for a value proportional to the diameter of the data set Z = {xi }Ni=1 . There

are two reasons for which we solve an approxmiation problem rather than interpo-

lation. Firstly when the data set Z is very irregular and φ is strictly positive definite

the matrix Φ is very ill-conditioned. Secondly the problem of solving (4.26) yields

solution s(x) which may oscillate between data points where data is sparse.

The approximation solution is sought by means of Tikhonov regulatization. Sup-

pose that we are given a linear mapping T : H → R and define the regularization

operator J : H → R by J(s) = ||T s||2R where H is a native space ot the underlying

radial basis function φ (·). For a function f ∈ H and for the prescribed value of the

single regulatization parameter λ its approximation is a function fλ (x) ∈ H in a

form (4.25) which is the solution of the minimization problem

N

min

fλ

∑ [ f (xi ) − fλ (xi)]2 + λ J( fλ ) : fλ ∈ H . (4.27)

i=1

(Φ T Φ − λ I)w = Φ T t (4.28)

for a λ > 0 governing a trade-off between a data reproduction and the desired

smoothness of solution fλ (x). Due to the ill-conditioning of matrix Φ a direct

inversion of matrix (Φ T Φ − λ I) in (4.28) is not numerically stable.

To solve (4.28) in a numerically stable way we use the singular value decompo-

sition of the matrix Φ defined as

Φ = USVT , (4.29)

where σ1 ≥ σ2 ≥ · · · ≥ σN are singular values of Φ . The singular value decomposi-

tion is unique up to the signs of the columns of matrices U and VT . Using the above

decompostion we express the inverse matrix in (4.28) as

where

σ1 σ2 σN

Ωλ = diag , ,..., 2 .

σ12 + λ 2 σ22 + λ 2 σN + λ 2

90 M. Bazan

Using the above equation the weight vector wλ is expressed (see [5]) by the expan-

sion with respect to singular vectors of the matrix Φ

N

σi

wλ = VT Ωλ† UT t = ∑ (uT t)vi . (4.31)

σ

i=1 i

2 +λ2 i

Comparing the expansion (4.31) for λ > 0 with the expansion for λ = 0 i.e. for the

interpolation problem which reads

N

1 T

w = VS−1 UT t = ∑ (ui t)vi . (4.32)

i=1 σi

one can see what is the role of the regularization parameter λ . For σ p ≥ λ ≥ σ p+1

we have

1 σi

2 (k > p),

σk σk + λ 2

and hence the impact of the singular vectors corresponding to singular values σk <

λ is dumped in the expansion (4.31). This enables us to avoid oscillation of the

solution that are introduced by inverting small singular values in the expansion of

the weight vector. Another approach in solving the problem of ill-conditioning of

the interpolation matrix generated by multiquadratic functions was presented in [7].

An appropriate procedure to choose λ parameter is a crucial issue in the Tikhonov

regularization. A chosen procedure should be able to find a λ for which the solution

s(x) reproduces well the data set Z as well as it generalizes well in-between data

points. These goals are conflicting and therefore a method should give us the right

balance between reproduction of the data set as well as a generalization for data

from outside the data set.

In [2] we introduced a method that relates a choice of λ to the reproduction quality

on the data points close to the evaluation point x.

The method was defined as the following. Let us define

r j = ||x − x j ||2 , j = 1, . . . , N.

For a sequence of λ s that covers the singular value spectrum (σN , σ1 ) of the matrix

S from the decomposition (4.29) we calculate the Normalized Local Mean Square

Error

4 Search Procedure Exploiting Locally Regularized Objective Approximation 91

N [sλ (x j )− f j ]2 1

∑ j=1 · r2

f 2j j

NLMSEλ ,Z (x) = , (4.33)

∑Nj=1 r12

j

||∇sλ (x j )−Gλ ,Z (x)||2

∑Nj=1 r2j

WGVλ ,Z (x) = ,

∑Nj=1 r12

j

∇sλ (x j )

∑Nj=1 r2j

Gλ ,Z (x) = ,

∑Nj=1 r12

j

data set Z, and ∇ is the gradient operator. It is the discrepancy method since only

these λ ’s are considered for which the value of NLMSE is smaller than a prescribed,

user-defined threshold value NLMSEthr . The threshold NLMSEthr value says which

approximation quality has to be preserved on the points from Z nearest to the evalu-

ation point x. From the set of λ ’s for which NLMSEλ ,Z (x) is smaller than NLMSEthr

the minimizer of the oscillation measure WGVλ ,Z (x) is chosen as the optimum.

Figure 4.2 shows a) the data set of 30 points from the 2-parametric Rosenbrock func-

tion optimization by the EXTREM algorithm, b) a plot of NLMSEλ ,Z (x) for two dif-

ferent points of the domain for a data set, c) a corresponding plot with W GVλ ,Z(A)

and W GVλ ,Z (B). Figure 4.3 shows a) the obtained approximation and b) the chosen

value of λ parameter.

In the previous section we presented a method to choose the value of the regular-

ization parameter λ . Unfortunately, using this method does not provide us that the

generalization error is smaller than a prescribed value ε as well as we are not able

directly to estimate the generalization error. This method similarly as the Gener-

alized Cross Validation (c.f. [18]) estimates an error on the data points (WGV for

nearest points to the evaluation point and GCV for all points from a data set).

The error bounds for radial basis interpolation has been studied extensively for

more than two decades. The first bounds establishing the rate of convergence of

a radial basis interpolant for functions from a native space of the underlaying ra-

dial basis function were given in [10]. The latter paper gave the foundations for

the development of the theory of convergence of the radial basis interpolation (see

92 M. Bazan

a) b)

0.98 −1

A

0.96

−2 B

0.94 A

0.92 −3

log10(NLMSE)

0.9

−4

x2

0.88

B

−5

0.86

0.84 −6

0.82

−7

0.8

0.78 −8

−1.02 −1 −0.98 −0.96 −0.94 −0.92 −0.9 −18 −16 −14 −12 −10 −8 −6 −4 −2

x log10(λ)

1

c)

1.8

1.7

1.6 A

log10(WGVλ, Z( x))

1.5

1.4

1.3

1.2

1.1

B

1

−18 −16 −14 −12 −10 −8 −6 −4 −2

log10(λ)

Fig. 4.2 a) Data set consisting of 30 points from the optimization path from EXTREM

algorithm optimizing the 2-parametric Rosenbrock function. b) Local reproduction of the

data near points A and B measured by NLMSEλ ,Z (A) and NLMSEλ ,Z (B) respectively. Here

NLMSEthr = 10−5 is depicted by a straight line. c) A measure of the oscillation of the solution

W GVλ ,Z (x) in points A and B. By dots the optimal λ for points A and B are depicted

e.g. [17], [19] and references therein). The rate of the convergence which is consid-

ered is with respect to dataset density i.e. a global fill distance

y∈Ω 1≤i≤N

4 Search Procedure Exploiting Locally Regularized Objective Approximation 93

a) b)

approximation error of the approximation−5constructed with WGV λ for approximation constructed with WGV

for NLMSE thr = 10 −1 0.98 0.96 0.94 0.92

−4 0.9

x 10 0.88

−0.95 0.86

−12

2 y

−0.9 −10

−8

1

−6

−4

−2

0

−1 0

−0.98

−0.96

0.98

−0.94 0.96

0.94

−0.92 0.92

0.9

0.88

−0.9

0.86

Fig. 4.3 a) The approximation error for λ chosen by the measure W GVλ ,Z (x) for

NLMSEthr = 5.0 · 10−6 . b) Chosen value of λ parameter – it can be noticed that in data

regions where data is sparse the method suggests a greater value of λ

where Z = {xi }Ni=1 ⊆ Ω which is a mesh norm measuring the radius of the biggest

ball contained in the domain Ω and not containing any data points inside, and where

the domain Ω satisfies the interior cone condition with radius r and angle θ .

an angle θ ∈ (0, π /2) and a radius r > 0 so that for every x ∈ Ω a unity vector ξ (x)

exists so that the cone C(x, ξ (x), θ , r) := {x + η y : y ∈ Rd , ||y||2 = 1, yT ξ (x) ≥

cos θ , η ∈ [0, r]} is contained in Ω .

The summary of error bounds for various radial basis functions was given in [16]. A

very precise derivation of the error bounds for Gauss radial basis function interpo-

lation was given in [20]. In the latter paper one can find a derivation of all constants

involved in the bound. The analogue error estimates (with all constants involved)

of the approximation with positive defined radial basis function constructed with

Tikhonov regularization for functions from the Sobolew space Wpτ was presented in

[19]. Here we will show that the error estimates contained in [19] cannot be used

in our scheme due to the small number of points in the optimization process and

therefore using heuristic methods from previous sections is justified.

All of the error bounds rely on a common property of local polynomial reproduc-

tion that has to be guaranteed by an approximation procedure (c.f. [19]). The error

bounds can be formulated in the form of the following theorem.

tion with angle θ and radius r. Let m be a maximal degree of polynomials repro-

duced by fλ in a form (4.25) defined as the solution of (4.27). Define the following

quantities

94 M. Bazan

sin θ

ϑ := 2arcsin ,

4(1 + sin θ )

sin θ sin ϑ

Q(m, θ ) := .

8m (1 + sin θ )(1 + sin ϑ )

2

x j ∈Z

Let us consider the unity ball B(0, 1) as a domain of the approximation. It satisfies

the interior cone condition with r = 1 and θ = π /3. It can be seen to reproduce only

the linear polynomials i.e. m = 1 and therefore for the above bounds to be satisifed

there has to be

h(Z , Ω ) ≤ Q(1, π /3) < 0.012613.

It means that from one data point to another there must be a distance not greater

than 1.3% of the radius of the ball containing the whole data set. Such a number of

points cannot be generated by any local optimization algorithm.

The above consideration shows that the exisiting accurate error bounds for the

regularized approximation with radial basis functions cannot be applied to estimate

the approximation in the SPELROA method. It is due to the sparseness of the data

in data sets constructed by local optimization algorithms.

Numerical results of the performance for real optimization problems from the LHC

magnet design process with up to 5 parameters were presented in [2]. Here we

present results for three test problems from a set of test problems proposed in [12].

1. Six variable problem and eight variable problem I

As a six and first eight variable problem we considered the Extended Rosenbrock

Function (i.e. problem no 21 in [12]). It is defined as

d/2

f ([x1 , x2 , . . . , xd ]) = ∑ 100(x2i − x22i−1)2 + (1 − x2i−1)2 .

i=1

minimum equals f ∗ = 0 at (1, . . . , 1).

4 Search Procedure Exploiting Locally Regularized Objective Approximation 95

As a second eight variable problem we chose Chebquad function (i.e. problem

no 35 in [12]). It is defined as

d d 1

1

f (x1 , . . . , xd ) = ∑ fi (x)2 , where fi (x) = ∑ Ti (x j ) − Ti (x)dx

i=1 d j=1 0

and Ti is the i-th Chebyshev polynomial shifted to the interval [0, 1]. The standard

starting point is x0 = (ξi ) where ξ j = j/(d + 1) and the minimum for d = 8 equals

to f ∗ = 3.51687 · 10−3.

3. Eleven variable problem

As en eleven variable problem we chose Osborne 2 function (i.e. problem no 19

in [12]). It is defined as

11

f (x1 , . . . , x11 ) = ∑ fi (x)2 , fi (x) = yi − (xi e−ti x5 + x2e−(ti −x9 )

2x

where 6 +

i=1

2x 2x

7 8 )

The standard starting point is x0 = [1.3, 0.65, 0.65, 0.7, 0.6, 3, 5, 7, 2, 4.5, 5.5] and

the minimum is f ∗ = 4.01377 · · ·10−2.

4.6.2 Results

For each problem we run Algorithm 1 combined with EXTREM algorithm with the

three following objective function approximation methods

1. Radial basis function aproximation without regularization,

2. Radial basis function aproximation with regularization using Generalized Cross

Validation [18] to choose λ parameter,

3. Radial basis function aproximation with regularization using Weighted Gradient

Variance to choose λ parameter.

To construct the approximation of the objective function with one of these meth-

ods we used 30 Gaussian radial basis functions with an equal shape parameter set

to half of the distance between the most distant centers. No additional parameters

were required for the first and the second method. The single user-defned threshold

NLMSEthr for measure (4.33) was used in the third method. Apart from parameters

concerning the construction of the radial basis approximation we had to set-up three

parameters related directly to the Algorithm 1 itself. These are: the number of initial

steps Is = 50 and ε = 10−3 for the ε -chcek procedure and γthr – the threshold value

for a measure (4.24) to detect the reliable region.

In the tables below we show performance of the Algorithm 1 for problems de-

fined in the previous subsection with the above objective function approximation

strategies compared to the EXTREM algorithm. The first column shows the number

96 M. Bazan

Table 4.1 6-variable Rosenbrock function optimization using: Left) pure EXTREM, Right)

Algorithm 1 without regularization and with γthr = 0.65

EXTREM

step ||x − x∗|| f Alg. 1 no regularization

num. step num.

250 2.267853 2.885768 num. ||x − x∗|| f approx.

500 0.650821 0.109817 250 2.459117 3.711824 22

750 0.418187 0.031570 500 1.318398 0.559350 45

1000 0.246522 0.012666 750 0.675244 0.089372 35

1250 0.029985 0.001228 933 0.130808 0.005606 10

1523 0.002014 0.000001

Table 4.2 6-variable Rosenbrock function optimization using: Left) Algorithm 1 with regu-

larization using GCV, Right) Algorithm 1 with regularization using WGV with NLMSEthr =

5 · 10−6 . In both cases γthr = 0.65

step ∗

||x − x || f num. step ∗ num.

num. approx. num. ||x − x || f approx.

250 2.363547 3.991352 11 250 2.292762 2.923941 12

500 1.094954 0.334820 28 500 1.250487 0.435774 16

750 0.778636 0.162569 48 750 0.630858 0.150665 28

1000 0.218992 0.009666 47 1000 0.513698 0.048269 38

1250 0.025537 0.000131 49 1250 0.094284 0.001763 47

1288 0.025537 0.000131 10 1262 0.094284 0.001763 4

Table 4.3 8-variable Rosenbrock function optimization using: Left) pure EXTREM, Right)

Algorithm 1 without regularization and γthr = 0.65

EXTREM

step ||x − x∗|| f Alg. 1 no regularization

num. step num.

250 3.923525 10.438226 num. ||x − x∗|| f approx.

500 2.731706 5.606541 250 3.030179 6.376938 22

750 1.771076 1.032648 500 2.153959 1.855113 23

1000 1.009230 0.251387 750 1.380208 0.471618 30

1250 0.484663 0.050215 1000 0.910852 0.215283 46

1500 0.274941 0.015386 1250 0.361361 0.028169 43

1750 0.262434 0.012950 1500 0.186309 0.006926 55

2000 0.208392 0.007465 1537 0.186309 0.006926 7

3471 0.000553 0.000000

4 Search Procedure Exploiting Locally Regularized Objective Approximation 97

Table 4.4 8-variable Rosenbrock function optimization using: Left) Algorithm 1 with regu-

larization using GCV, Right) Algorithm 1 with regularization using WGV with NLMSEthr =

10−6 . In both cases γthr = 0.65

step ||x − x∗|| f num.

Alg. 1 with GCV num. approx.

step ∗ num.

num. ||x − x || f approx. 250 3.263247 9.379781 11

500 2.288176 2.394216 15

250 3.100123 7.932378 17

750 1.272541 0.435303 17

500 2.353671 3.580721 28

1000 0.767617 0.143009 35

750 1.810286 1.085058 26

1250 0.505116 0.057994 42

1000 0.946269 0.240704 41

1500 0.146317 0.004114 51

1200 0.705151 0.118365 43

1750 0.028930 0.000169 35

1861 0.028958 0.000169 12

Table 4.5 8-variable Chebquad function optimization using: Left) pure EXTREM Right)

Algorithm 1 without regularization and with γthr = 0.65

step ||x − x∗|| f step num.

num. num. ||x − x∗|| f approx.

1 0.161512 0.038618

250 0.098420 0.006216 4

250 0.097709 0.006187

500 0.059421 0.004684 18

500 0.059121 0.004681

750 0.016460 0.003698 750 0.009926 0.003584 10

1000 0.000454 0.003517 48

1000 0.000351 0.003517

1237 0.000000 0.003517 1197 0.000141 0.003517 97

of steps of the algorithm i.e. a sum of the number of the direct functions evaluations

and the number of steps in which the radial basis function approximation was used.

The second column shows the distance from the minium and the third column shows

the objective function value. The last column shows the number of steps in which

objective function approximation was used within the previous 250 steps.

As we can see in all cases SPELROA required considerably fewer steps than the

pure EXTREM algorithm to stop. Using the WGV method to build radial

basis approximation gave the best convergence results, i.e. the stopping point for

the SPELROA with WGV compared to stopping points given by the method with

the other methods. For all problems NLMSEthr was chosen intuitively to ∼ 10−6 .

It means that the reconstruction of the training set in the vicinity of the evaluation

point was at the level of 10−6. To compute the reliable region γthr set up to 0.65 was

sufficient to preserve convergence of the method. In the optimization of 8-variable

Chebquad function it turned out that it was possible to reduce γthr to 0.6. That gave

16 points in which the objective function was approximated in the first 250 steps

98 M. Bazan

Table 4.6 8-variable Chebquad function optimization using: Left) Algorithm 1 with regular-

ization using GCV and γthr = 0.65, Right) Algorithm 1 with regularization using WGV with

NLMSEthr = 10−6 and with γthr = 0.60

step ||x − x∗ || f num. step ||x − x∗ || f num.

num. approx. num. approx.

250 0.098419 0.006216 4 250 0.109306 0.007110 16

500 0.059477 0.004684 17 500 0.064139 0.004759 73

750 0.008713 0.003584 11 750 0.009802 0.003586 46

1000 0.00156 0.003518 118 1000 0.001266 0.003519 31

1095 0.00156 0.003518 68 1077 0.001051 0.003517 52

Table 4.7 11-variable Osborne 2 function optimization using: Left) pure EXTREM, Right)

Algorithm 1 without regularization and with γthr = 0.65

EXTREM

step ||x − x∗|| f

num.

Alg. 1 no regularization

1 4.755269 2.093420 step num.

250 1.004450 0.081672 num. ||x − x∗|| f approx.

500 0.099411 0.041034 250 1.379902 0.977272 29

750 0.073530 0.041772 500 0.568322 0.059211 31

1000 0.007715 0.040138 710 0.335485 0.041512 43

1250 0.001092 0.040138

1434 0.000000 0.040138

Table 4.8 11-variable Osborne 2 function optimization using: Left) Algorithm 1 with regu-

larization using GCV and γthr = 0.65 b) Algorithm 1 with regularization using WGV with

NLMSEthr = 10−6 . In both cases γthr = 0.65

step num. Alg. 1 with WGV

||x − x∗ || f step num.

num. approx.

num. ||x − x∗ || f approx.

250 1.673133 0.365797 32

250 0.953416 0.106864 14

500 0.975217 0.045721 31

500 0.334653 0.047136 38

750 0.496683 0.041805 43

750 0.0441106 0.040209 49

1000 0.068066 0.040462 35

919 0.037960 0.040185 29

1106 0.062328 0.040177 25

instead 4 such points when γthr = 0.65. An interesting result was also obtained for

Osborne 2 function. The EXTREM algorithm found a better solution than that sug-

gested in [12]. Algorithm 1 with any approximation method did not converge to this

minimum. The method without regularization did not converge at all whereas GCV

4 Search Procedure Exploiting Locally Regularized Objective Approximation 99

and WGV converged rather to the minimum suggested in [12] than to that found by

EXTREM.

4.7 Summary

The Search Procedure Exploiting Locally Regularized Objective Approximation is

a method to speed-up local optimization processes. The method combines a non-

gradient optimization algorithm with the regularized local radial basis function

approximation. It relies on using a local regularized radial basis function approxi-

mation instead of a direct objective function evaluation in a certain number of func-

tion evaluation steps of the optimization algorithm. In this chapter we presented the

proof of the convergence of the Search Procedure Exploiting Regularized Objective

Approximation which applies to any Gauss-Siedle and conjugate direction search al-

gorithm that uses the sequanetial quadratic interpolation as a line search procedure.

The convergence is proven under assumption that the approximation of the objec-

tive function with the prescribed approximation relative error is exploited only in

the seqential quadratic interplation. The performance of the method was presented

on 6 and 8-parametric Rosenbrock function, 8-parametric Chebquad function and

11-parametric Osborne 2 function. Further studies will be to compare the method

with trust region methods.

Acknowledgements. I would like to thank all referees for very valuable comments.

Appendix A

The minimum of the quadratic q(ζ ) built of three points (ζ1 , f (ζ1 )), (ζ2 , f (ζ2 )) and

(ζ3 , f (ζ3 )) where {ζ1 , ζ2 , ζ3 } ⊂ R equals

λ∗ = ,

2 (ζ3 − ζ2 ) f (ζ1 ) + (ζ3 − ζ1 ) f (ζ2 ) + (ζ2 − ζ1 ) f (ζ3 )

quadratic q̂(x) = âx2 + b̂x + ĉ assuming the following

1. A transformation will be of the form

(0, f (ζ1 )), (ζ(r) , f (ζ2 )) and (1, f (ζ3 ) whereas ζ(r) = ζζ2 − ζ1

.

3 −ζ1

2. a) If f (ζ1 ) > f (ζ3 ) b) if f (ζ1 ) < f (ζ3 )

q̂(0) = −1, q̂(0) = −2,

q̂(1) = −2, q̂(1) = −1.

100 M. Bazan

L2 (x) = a x2 + b x + c

where

f (ζ1 ) f (ζ2 ) f (ζ3 )

a = + + ,

ζ(r) ζ(r) (ζ(r) − 1) 1 − ζ(r)

!

f (ζ1 )(ζ(r) + 1) f (ζ2 ) f (ζ3 )ζ(r)

b =− + + ,

ζ(r) ζ(r) (ζ(r) − 1) 1 − ζ(r)

c = f (ζ1 ).

a) b)

E = a +b

1

, E = − a +b

1

,

c c

D = 2 − Ec = 2 − a +b . D = 1 + Ec = 1 + a +b .

D.

The crucial properties of this transformation are

1. q̂(ζ(r) ) < −1.

2. p ζ(r) = f (ζ2 ).

3. q̂(ζ(r) ) does not depend on f (ζ2 ).

4. For the minimum point λ ∗ of q such that λ ∗ ∈ [ζ1 , ζ3 ] we get a minimum point

∗ −ζ

λ̂ ∗ = λζ − 1

where λ̂ ∗ is a minimum point of q̂.

3 ζ1

5. The transformation has a singularity when a = b .

This transformation reduces the number of free parameters from 6 to 3.

Appendix B

The unperturbed minimum λ ∗ (ζ ) is related to the perturbed minimum λ̃l∗ (ε ; ζ ) as

λ̃l∗ (ε ; ζ ) = λ ∗ (ζ ) ± Δl (ε ; ζ ).

Here we assume that both λ ∗ (ζ ) and λ̃l∗ (ε ; ζ ) are minima of the quadratics obtained

by the transformation q̂(·) from Appendix A with the assumption that f (ζ1 ) < f (ζ3 )

i.e. if the opposite is true then we have to rotate the quadratics with respect to the

center of the interval [ζ1 , ζ3 ].

Let us introduce a notation to simplify derivations. Let us denote A = 2(ζ(r) −

1), B = q̂(ζ(r) ),C = ζ(r) . Then for the unperturbed quadratic we have that

4 Search Procedure Exploiting Locally Regularized Objective Approximation 101

1 A(ζ(r) + 1) + B − ζ(r)C

λ∗ =

2 A + B −C

whereas for the perturbed one we have

∗ 1 A(ζ(r) + 1)(1 + ε ) + B − ζ(r)C

λ1 (ε ; ζ ) =

2 A(1 + ε ) + B − C

1 A(ζ(r) 1) + B(1 + ε ) − ζ(r)C

+

λ2∗ (ε ; ζ ) =

2 A + B(1 + ε ) − C

1 A(ζ(r) 1) + B − ζ(r)C(1 + ε )

+

λ3∗ (ε ; ζ ) =

2 A + B − C(1 + ε )

ζ2 −ζ1

for the perturbation in 0, ζ(r) and 1 respectively where ζ(r) = ζ3 −ζ1 . We can simplify

the expression |λ ∗ − λ̃ ∗ (ε ; ζ )| to get

C − ζ(r) B Aε

Δ1 (ε ; ζ ) = ·

2(A + B − C) Aε + A + B − C

If inequalities (4.21) are satisfied then we have a guarantee that λ ∗ (ζ ) < ζ2 . It is

because if λ̃l∗ (ε ; ζ ) is shifted by Δl (ε ; ζ ) to the left then shifting it back to the right

will not make it greater than ζ2 . If it is shifted to the right then we have a margin of

2Δl (ε ; ζ ). As previously mentioned we will consider only perturbation in 0 and ζ(r)

i.e. l = 1 and l = 2. For l = 1 we get

< 2ζ2

A(1 + ε ) + B − C

1. If A(1 + ε ) + B − C) > 0 then we get

which again is divided into two cases depending on the sign of the cooeficient

by ε

a. When A(1 − ζ(r)) > 0 then

ε< .

A(1 − ζ(r))

102 M. Bazan

ε> .

A(1 − ζ(r))

a. When A(1 − ζ(r)) > 0 then

ε> .

A(1 − ζ(r))

ε< .

A(1 − ζ(r))

Only upper bounds for ε , i.e. 1.a) and 2.b), are interpretable as a solution to our

problem. So finally we get two regions for ε and ζ(r) where

ε< , (4.35)

A(1 − ζ(r))

for

A(1 + ε ) + B − C > 0

, (4.36)

A(1 − ζ(r)) > 0

or

A(1 + ε ) + B − C < 0

. (4.37)

A(1 − ζ(r)) < 0

In the same way we can obtain conditions for l = 2:

−B(1 − ζ(r)) − A

ε< , (4.38)

B(1 − ζ(r))

for

A + B(1 + ε ) − C > 0

, (4.39)

B(1 − ζ(r)) < 0

or

A + B(1 + ε ) − C < 0

. (4.40)

B(1 − ζ(r)) > 0

References

1. Bazan, M., Russenschuck, S.: Using neural networks to speed up optimization algo-

rithms. Eur. Phys. J. AP 12, 109–115 (2000)

2. Bazan, M., Aleksa, M., Russenschuck, S.: An improved method using radial basis func-

tion neural networks to speed up optimization algorithms. IEEE Trans. on Magnetics 38,

1081–1084 (2002)

4 Search Procedure Exploiting Locally Regularized Objective Approximation 103

3. Bazan, M., Aleksa, M., Lucas, J., Russenschuck, S., Ramberger, S., Völlinger, C.: In-

tegrated design of superconducting magnets with the CERN field computation pro-

gram ROXIE. In: Proc. 6th International Computational Accelarator Physics Conference,

Darmstadt, Germany (September 2000)

4. Conn, A.R., Gould, N.I.M., Toint, P.L.: Trust region methods. SIAM, Philadelphia (2005)

5. Hansen, P.C.: Rank-deficient and Discrete Ill-posed Problems. SIAM, Philadelphia

(1998)

6. Jacob, H.G.: Rechnergestützte Optimierung statischer und dynamischer Systeme.

Springer, Heidelberg (1982)

7. Kansa, E.J., Hon, Y.C.: Circumventing the ill-conditionning problem with multiquadratic

radial basis: Applications to elliptic partial differentail equations. Comp. Math. with

App. 39(7-8), 123–137 (2000)

8. Luenberger, D.G.: Introduction to linear and nonlinear programming, 2nd edn. Addison-

Wesley, New York (1984)

9. Micchelli, C.A.: Interpolation of Scattered Data: Distance Matrices and Conditionally

Positive Definite Functions. Constructive Approximation 2, 11–22 (1986)

10. Madych, W.R., Nelson, S.A.: Multivariate interpolation and conditionally positive defi-

nite functions II. Math. Comp. 4(189), 211–230 (1990)

11. Madych, W.R.: Miscellaneous error bounds for multiquadric and related interpolators.

Comp. Math. with Appl. 24(12), 121–138 (1992)

12. Moré, J.J., Garbow, B.S., Hillstorm, K.E.: Testing unconstrained optimization software.

ACM Trans. Math. Software 7(1), 17–41 (1981)

13. Oeuvray, R.: Trust region method based on radial basis functions with application on

biomedical imaging, Ecole Polytechnique Federale de Lausanne (2005)

14. Polak, E.: Optimization. Algorithms and Consistent Approximations. Applied Mathe-

matical Sciences, vol. 124. Springer, Heidelberg (1997)

15. Powell, M.J.D.: On calculation of orthogonal vectors. The Computer Journal 11(3),

302–304 (1968)

16. Schaback, R.: Error estimates and condition number for radial basis function interpola-

tion. Adv. Comput. Math. 3, 251–264 (1995)

17. Schaback, R.: Native Hilbert Spaces for Radial Basis Functions I. The new development

in Approximation Theory. Birkhäuser, Basel (1999)

18. Wahba, G.: Spline models for obsevational data. SIAM, Philadelphia (1990)

19. Wendland, H., Rieger, C.: Approximate interpolation with applications to selecting

smoothing parameters. Numerische Mathematik 101, 643–662 (2005)

20. Wendland, H.: Gaussian Interpolation Revisited. In: Kopotun, K., Lyche, T., Neamtu,

M. (eds.) Trends in Approximation Theory, pp. 427–436. Vanderbilt University Press,

Nashville (2001)

21. Zangwill, W.I.: Nonlinear Programming; a Unified Approach. Prentice-Hall Interna-

tional Series. Prentice-Hall, Englewood Cliffs (1969)

Chapter 5

Optimization Problems with Cardinality

Constraints

Abstract. In this article we review several hybrid techniques that can be used to

accurately and efficiently solve large optimization problems with cardinality con-

straints. Exact methods, such as branch-and-bound, require lengthy computations

and are, for this reason, infeasible in practice. As an alternative, this study focuses

on approximate techniques that can identify near-optimal solutions at a reduced

computational cost. Most of the methods considered encode the candidate solutions

as sets. This representation, when used in conjunction with specially devised search

operators, is specially suited to problems whose solution involves the selection of

optimal subsets of specified cardinality. The performance of these techniques is il-

lustrated in optimization problems of practical interest that arise in the fields of

machine learning (pruning of ensembles of classifiers), quantitative finance (port-

folio selection), time-series modeling (index tracking) and statistical data analysis

(sparse principal component analysis).

5.1 Introduction

Many practical optimization problems involve the selection of subsets of specified

cardinality from a collection of items. These problems can be solved by exhaustive

enumeration of all the candidate solutions of the specified cardinality. In practice,

only small problems of this type can be exactly solved within a reasonable amount of

Rubén Ruiz-Torrubiano

Computer Science Department, Universidad Autónoma de Madrid, Spain

e-mail: ruben.ruizt@estudiante.uam.es

Sergio Garcı́a-Moratilla

Computer Science Department, Universidad Autónoma de Madrid, Spain

e-mail: sergio.garcia@uam.es

Alberto Suárez

Computer Science Department, Universidad Autónoma de Madrid, Spain

e-mail: alberto.suarez@uam.es

Y. Tenne and C.-K. Goh (Eds.): Computational Intelligence in Optimization, ALO 7, pp. 105–130.

springerlink.com c Springer-Verlag Berlin Heidelberg 2010

106 R. Ruiz-Torrubiano, S. Garcı́a-Moratilla, and A. Suárez

time. The number of steps required to find the optimal solution can be reduced using

branch-and-bound techniques. Nevertheless, the computational complexity of the

search remains exponential, which means that large problems cannot be handled by

these exact methods. It is therefore important to design algorithms that can identify

near-optimal solutions at a reduced computational cost. In this article we present an

unified framework for handling optimization problems with cardinality constraints.

A number of approximate methods within this framework are analyzed and their

performance is tested in extensive benchmark experiments.

In its general form, an optimization problem with cardinality constraints can be

formulated in terms of a vector of binary variables z = {z1 , z2 , . . . , zD }, zi ∈ {0, 1}.

The goal is to minimize a cost function that depends on z, subject to a constraint

that specifies the number of non-zero bits in z

D

min {F(z)}

z

∑ zi = k. (5.1)

i=1

K can be solved by selecting the best of the solutions of K optimization prob-

lems with the equality constraint ∑D i=1 zi = k; k = 1, 2, . . . , K. Finally, the solution

of a combinatorial optimization problem without restrictions can be obtained by

solving the sequence of problems with cardinality constraints ∑D i=1 zi = k; k = 1, 2,

. . . , D.

Continuous optimization tasks with cardinality constraints can also be analyzed

within this framework. Consider the problem of minimizing a function that depends

on a D-dimensional continuous parameter θ . We search for solutions with exactly k

non-zero components of θ

D

min {F(θ )}

θ

θ ∈ RD , ∑ I(θi = 0) = k, (5.2)

i=1

where I(·) is an indicator function (I(true) = 1, I(false) = 0). This hybrid prob-

lem can be transformed into a purely combinatorial one of the type (5.1) by intro-

ducing a D-dimensional binary vector z whose i-th component indicates whether

variable i is allowed to take a non-zero value (zi = 1) or is set to zero (zi = 0).

min [F ∗ (z)] ,

z

∑ zi = k. (5.3)

i

in the reduced space defined by z

[θ |z]

5 Optimization Problems with Cardinality Constraints 107

where θ [z] denotes the k-dimensional vector formed by the components of θ for which

the value of the corresponding component of z is 1. The remaining components of θ

are set to zero in the auxiliary problem.

This decomposition makes it clear how hybrid methods that combine techniques

for combinatorial and continuous optimization can be applied to identify the so-

lution of the subset selection problem with a continuous objective function: For a

given value of z, the optimal θ [z] is calculated by solving the surrogate problem

defined by (5.4), where z determines which components of θ are allowed to take a

non-zero value. The final solution is obtained by searching in the purely combinato-

rial space of possible values of z, using the optimal function value that is a solution

of (5.4) to guide the exploration.

The success of this hybrid approach depends on the availability of a continuous

optimization algorithm that can efficiently identify the globally optimal solution

of the auxiliary optimization problem defined in (5.4) and on the efficiency of the

algorithm used to address the combinatorial part of the search. For simple forms

of the continuous objective function and of the remaining restrictions (other than

the cardinality constraint), the auxiliary problem can be efficiently solved by exact

optimization techniques. For instance, efficient linear and quadratic programming

algorithms are available if the function is linear or quadratic, respectively [1]. For

more complex objective functions, general non-linear optimization techniques (such

as quasi-Newton [2] or interior-point methods [3]) may be necessary. In these cases,

there is no guarantee that the solution of the auxiliary problem be globally optimal.

As a consequence, if the solutions found are far from the global optimum, the com-

binatorial search that is used to solve the original problem (5.3) can be seriously

misled.

In this work, we assume that the continuous optimization task defined by (5.4)

can be solved exactly and focus on the solution of the combinatorial part of the orig-

inal problem. Section 5.2 describes how standard combinatorial optimization tech-

niques can be adapted to handle the cardinality constraints considered. Emphasis is

placed on the use of an appropriate encoding for the search states in terms of sets.

This set-based encoding is particularly well-suited for the definition of search oper-

ators that preserve the cardinality of the candidate solutions. With this adaptation,

the approximate methods described provide a practicable alternative to identifying

the exact solution by exhaustive search, which becomes computationally infeasible

in large problems, or to computationally inexpensive optimization methods, such as

greedy search, which tend to find suboptimal solutions. The experiments presented

in Section 5.3 illustrate how the techniques reviewed find near-optimal solutions

with limited computational resources and can therefore be used to address optimal

subset selection problems of practical interest. Novel results regarding the applica-

tion of these techniques to some of these problems (ensemble pruning and sparse

PCA) are also provided. Finally, Section 8.5 summarizes the conclusions of this

work.

108 R. Ruiz-Torrubiano, S. Garcı́a-Moratilla, and A. Suárez

Problems with Cardinality Constrains

In this section we describe how simulated annealing (SA), genetic algorithms (GA),

and estimation of distribution algorithms (EDA) can be used to solve large opti-

mization problems with cardinality constraints. They are stochastic search methods

that involve the generation of candidate solutions, which are then rejected or se-

lected according to their performance. In their standard formulation, no particular

consideration is given to the number of non-zero components in the candidate solu-

tions generated. Cardinality constraints can be taken into account using one of the

following approaches:

(i) No candidate solution violating the constraint is generated at any time by the al-

gorithm. Enforcing this property requires the design of appropriate genetic and

neighborhood operators, such that the space of solutions of a given cardinality

is closed under these search operators [4] [5].

(ii) Solutions that violate the cardinality constraint can be generated by the suc-

cessor operators. Whenever a violation occurs, a repair algorithm is applied to

transform the infeasible solution into a solution of the desired cardinality. Typ-

ically, a local search is used to obtain the closest feasible solution, but random

repair mechanisms can be used as well [6] [7].

(iii) Solutions that violate the cardinality constraint can be generated by the succes-

sor operators. In contrast with the previous approach, infeasible solutions are

not repaired. Instead, a penalty term is introduced on the evaluation function so

that infeasible candidate solutions have worse scores than feasible ones with an

equivalent performance [6].

In the experiments described in Section 5.3, the best overall performance is obtained

by methods that use a set-based representation together with appropriately designed

successor operators that preserve the cardinality of the solutions. These results un-

derscore the importance of using a representation that properly reflects the structure

of the problem. Therefore, the focus of this study is on the design of search opera-

tors that preserve the cardinality of the candidate solutions. These specially adapted

methods are generally preferable to standard schemes that take into account the re-

strictions by either ad-hoc repair mechanisms or by including a term in the cost

function that penalizes violations of the constraints.

Simulated annealing (SA) is an optimization technique inspired by the field of ther-

modynamics [8]. The main idea is to mimic the physical process of melting a solid

and then cooling it to allow the formation of a regular crystalline structure that at-

tains a minimum of the system’s free energy. In simulated annealing the function to

be minimized F(z) (objective or cost function) takes the role of the free energy in

the physical system. The physical configuration space is replaced by the space of

candidate solutions, which are connected by transitions defined by a neighborhood

5 Optimization Problems with Cardinality Constraints 109

operator. The stochastic search proceeds by considering transitions from the cur-

rent state z(cur) to a neighboring configuration zl ∈ N (z(cur) ) generated at random.

The proposed transition is accepted if the value of the objective function decreases.

Otherwise, if the candidate configuration is of higher cost, the transition is accepted

only with a certain probability. This probability is expressed as a Boltzmann factor

(cur)

F(z ) − F(z )

Paccept (zl , z(cur) ; Tk ) = exp −

l

, (5.5)

Tk

where the parameter Tk plays the role of a temperature. A general version of this

technique is given as Algorithm 1. In this pseudocode, the function annealingSched-

ule returns the temperature Tk for the following epoch. It is common to use a geo-

metrical schedule Tk = γ Tk−1 , where γ smaller, but usually close to one, regulates

how fast the temperature is decreased.

• Generate initial configuration z(0) and initial temperature T0

• z(cur) ← z(0)

• i←0

• While convergence criteria are not met [Annealing loop]

– i ← i+1

– Fix temperature for epoch i: Ti = annealingSchedule(Ti−1 )

– Fix length for epoch i: Li

– For l = 1, . . . , Li [Epoch loop]

1. Select randomly an element zl ∈ N (z(cur) ).

2. If F(zl ) < F(z(cur) ), then z(cur) ← zl

3. Else, generate u ∼ U[0, 1]

If u < Paccept (zl , z(cur) ; Ti ), then z(cur) ← zl

• Return the best value found.

ing for z and a corresponding neighborhood N (z). In particular, the candidate

solutions can be encoded as sets of specified cardinality. The components of the

binary vector z are then interpreted as indicating membership to the set: if zi = 1,

the ith element is included in the solution. Otherwise, if zi = 0 it is excluded from

the selection. It is also necessary to design a neighborhood operator that preserves

the cardinality constraints, so that no penalty or repair mechanisms are needed. A

simple design is to exchange an element included in the current candidate solution

with an element excluded from it. This is the version of SA that will be used in

Section 5.3.

110 R. Ruiz-Torrubiano, S. Garcı́a-Moratilla, and A. Suárez

Genetic algorithms are a class of optimization methods that mimic the process of

the natural evolution [9]. Optimization is achieved by selection from a population

that exhibits some random variability. The outline of a general genetic algorithm is

shown in Algorithm 2.

• Generate an initial population P0 with P individuals.

• For each individual I j ∈ P0 , calculate fitness Φ (I j ).

• Initialize the generation counter t ← 0.

• While convergence criteria are not met:

– Increase the generation counter t ← t + 1.

– Select a parent set Πt ⊂ Pt composed of nP individuals from the population.

– While Πt = 0:

/

· Extract two individuals I1 and I2 from Πt .

· Apply the crossover operator Θ (I1 , I2 ) and generate nC children (with

probability pC ).

· Apply the mutation operator to the nC children (with probability pM ).

– Calculate the fitness value of the new individuals.

– Add the new individuals to the population.

– Select P individuals that make up Pt+1 , the population for generation t + 1.

For problems with cardinality constraints, two alternative encodings for the can-

didate solutions are considered. A first possibility is a standard binary representa-

tion, where the chromosomes are bit-strings. The difficulty with this encoding is

that standard mutation and crossover operators do not preserve the number of non-

zero bits of the parents. A possible solution to this problem is to assign a lower

fitness value to individuals in the population that violate the cardinality constraint.

Assuming that a problem with an inequality cardinality constraint is considered, a

penalized fitness function can be built by subtracting from the standard fitness func-

tion a penalty term that depends on the magnitude of the violation of the cardinality

constraint

Δk (z) = |Card(z) − k| (5.6)

The penalized fitness function is

represents the strength of the penalty.

Another option is to repair infeasible individuals when they are generated. Sev-

eral repair mechanisms can be defined for this purpose. For instance, an individual

5 Optimization Problems with Cardinality Constraints 111

can be repaired by randomly setting some bits to 0 or 1, as needed, until the cardi-

nality constraint is satisfied (random repair). Another alternative is to use a heuristic

to determine which bits must be set to 0 or to 1 (heuristic repair). The results of a

greedy optimization or the solutions of a relaxed version of the problem can also be

used to achieve this objective [10].

An alternative to binary encoding is to use the set representation introduced in

simulated annealing. The use of this representation simplifies the design of crossover

and mutation operators that preserve the cardinality of the individuals. The neigh-

boring operator defined in SA can be used to construct mutated individuals. Since

this operator swaps a variable in the set of selected variables with another variable in

the complement of this set, the cardinality of the original chromosome is preserved

by the mutation.

Some crossover operators on sets were introduced in [5]. They are defined taking

into account the properties of respect and assortment [11]. Respect ensures that the

offspring inherit the common genetic material of the parents. Assortment guarantees

that every combination of the alleles of the two parents is possible in the child, pro-

vided that these alleles are compatible. When cardinality constraints are considered,

it is no longer possible to design crossover operators that guarantee both respect and

assortment.

A crossover operation that provides a good balance of these properties and en-

sures that the cardinality of the parents is preserved in the offspring is random as-

sorting recombination (RAR). RAR crossover is described in Algorithm 3. In this

algorithm, the integer parameter w ≥ 0 determines the amount of common informa-

tion from both parents that is retained by the offspring. For w = 0, elements that are

present in the chromosomes of both parents are not allowed in the child. Higher val-

ues of w assign more importance to the elements in the intersection of the parents’

sets (chromosomes). In the limit w → ∞, the child contains every element that is in

both of the parents’ chromosomes with a probability that approaches 1.

Estimation of distribution algorithms (EDAs) are a class of evolutionary methods in

which diversity is generated by a probabilistic sampling scheme [12]. Depending

on the nature of this sampling scheme, different variants of EDAs can be designed.

In this work, we consider the Population Based Incremental Learning (PBIL) algo-

rithm as a representative algorithm of the EDA family [13]. It operates on binary

chromosomes of fixed length (z) and assumes statistical independence among the

genes {zi ; i = 1, 2, . . . , D}. In generation g, the genotype of the population is char-

acterized by the probability vector p(g) , whose ith component is the probability of

assigning the value 1 to the gene in the ith position. The update of the probability

distribution using DSe g (see Algorithm 4) in PBIL is

1 M (gim )

p(g+1) = α ∑ z + (1 − α )p(g),

M m=1

(5.8)

112 R. Ruiz-Torrubiano, S. Garcı́a-Moratilla, and A. Suárez

1 Input: Two parents I1 and I2 , and a fixed cardinality k.

2 Output: A child chromosome Θ .

• Create auxiliary sets A, B,C, D, E:

– A = elements present in both parents.

– B = elements not present in any of the parents.

– C ≡ D elements present in only one parent.

– E = 0.

/

• Build set

G = {w copies of elements from A and B, and 1 copy of elements in C and D}

• While |Θ | < k and G = 0:

/

– Extract g ∈ G without replacement.

/ E, Θ = Θ ∪ {g}.

– If g ∈ A or g ∈ C, and g ∈

– If g ∈ B or g ∈ D, E = E ∪ {g}.

• If |Θ | < k, add elements chosen at random from U − Θ until chromosome is

complete.

• Initialize the distribution that characterizes the population P(0) (z)

• Initialize generation counter g ← 0.

• While convergence criteria are not met

– Sample population of P individuals using P(g) (z)

Dg = {z(g1) , . . . , z(gP) }

Dg = {z(gi1 )) , z(gi2 ) , . . . , z(giP ) },

(gi1 ) (gi2 )

g = {z

DSe ,z , . . . , z(giM ) }

g

– Update generation counter g ← g + 1

• Return the best solution found.

5 Optimization Problems with Cardinality Constraints 113

where z(gim ) represents the individual in the im −th position on generation g, and

α ∈ (0, 1] is a smoothing parameter included to avoid strong fluctuations in the

estimates of the probability distribution. Individuals are sorted by decreasing fitness

values. The Univariate Marginal Distribution Algorithm (UMDA [14]) algorithm

from the EDA family is recovered when α = 1. Even though the encoding is binary,

the cardinality constraints can be enforced in the sampling of individuals. Algorithm

5 describes a sampling method that generates individuals of a specified cardinality k

from a distribution of bits characterized by the probability vector p. The application

of this method to sample new individuals guarantees that the algorithm is closed

with respect to the cardinality constraint.

• Initialize p̂ ← p

• Initialize individual x = 0.

• For i = 1, 2, . . . , k

– Generate a random number u ∼ U[0, 1]

j−1

– Determine the value of j such that ∑i=1 p̂i < u ≤ ∑D i= j+1 p̂i .

– Set x j = 1.

– Update the value p̂ j ← 0

– Renormalize

p̂i

p̂i ← D , i = 1, 2, . . . , D,

∑k=1 p̂k

i=1 p̂i = 1.

so that p̂ can be interpreted as a probability vector ∑D

– Return the generated individual x.

Constraints

This section introduces a collection of optimization problems with cardinality con-

straints that are used to illustrate the application of the methods described in the pre-

vious section. These include standard combinatorial optimization problems, such as

the knapsack problem, and real-world problems that arise in the fields of machine

learning (ensemble pruning), quantitative finance (cardinality-constrained portfolio

optimization), time series modeling (index tracking by partial replication) and statis-

tical data analysis (sparse principal components analysis). The problems considered

are either purely combinatorial or involve the optimization of continuous parame-

ters. In a purely combinatorial optimization problem F(z) can be directly evaluated

once z is known. The knapsack problem and ensemble pruning are of this type. The

remaining problems considered (portfolio selection, index tracking and sparse PCA)

are hybrid optimization tasks, in which the evaluation of F(z) for a fixed value of

the binary vector z requires the solution of an auxiliary continuous optimization

114 R. Ruiz-Torrubiano, S. Garcı́a-Moratilla, and A. Suárez

problem. While it is possible to address the combinatorial and the continuous op-

timization problems simultaneously, we concentrate on strategies that handle these

aspects separately. Therefore, the outcome of the continuous optimization algorithm

is used to guide the combinatorial optimization search, as in (5.4). For the hybrid

problems considered, the secondary optimization task can be efficiently solved in an

exact manner by quadratic programming. Nonetheless, the scheme can be directly

generalized when the evaluation of F(z) requires a more complex programming so-

lution, possibly without guarantee of convergence to the global solution of the sur-

rogate optimization problem. Under these conditions the algorithm used to address

the combinatorial part can actually be misled by the suboptimal solutions found in

the auxiliary problem.

Knapsack problems are a family of combinatorial optimization problems that in-

volve selecting a subset from a pool of items [15]. In this work, we consider the

0/1 knapsack problem, which can be shown to be NP-complete [16]. In an instance

of the 0/1 knapsack D items are available to fill up a knapsack. A profit pi and a

weight wi are associated to the i-th item, i = 1, 2, . . . , D. The objective is to identify

the subset of items whose accumulated profit is maximum and whose overall weight

does not exceed a given capacity W

D D

max ∑ pi zi s.t. ∑ wi zi ≤ W zi ∈ {0, 1} , i = 1, 2, . . . , D. (5.9)

i=1 i=1

Both exact and approximate methods have been used to address the 0/1 knapsack

problem. Exact algorithms based on branch-and-bound approaches and dynamic

programming are reviewed in [17]. Genetic algorithms [18, 19] and EDAs [12] have

also been used to address this problem.

Cardinality constraints are generally not considered in the standard 0/1 knapsack

problem. Nevertheless, the optimum of the unconstrained problem can be obtained

by solving D cardinality-constrained knapsack problems ∑D i=1 zi = k; k = 1, 2, . . . D.

The k-th element in this sequence is a knapsack problem with the restriction that

only k items can be included in the knapsack. To compare the performance of the

different optimization methods analyzed in this work, we use the testing protocol

proposed in [20] [18]. Three types of problems, defined in terms of two parameters

v, r ∈ R+ , v > 1, are considered:

(1) Uncorrelated: Weights and profits are generated randomly in [1, v].

(2) Weakly correlated: Weights are generated randomly in [1, v] and profits are

generated in the interval [wi − r, wi + r].

(3) Strongly correlated: Weights are generated randomly in [1, v] and profit pi =

wi + r.

In general, knapsack problems with correlations between weights and profits are more

difficult to solve than problems in which the weights and profits are independent. We

5 Optimization Problems with Cardinality Constraints 115

use v = 10, r = 5 and a capacity W = 2v, which tends to include very few items in

the solution. The results reported are averages over 25 realizations of each problem,

which are solved using the different approximate methods: SA, a standard GA with

linear penalty, a GA using set encoding and the RAR operator (w = 1), and PBIL.

The conditions under which the search is conducted are determined on exploratory

experiments. A geometric annealing schedule Tk = γ Tk−1 with γ = 0.9 is used in

SA. The GAs evolve populations composed of 100 individuals. The probabilities of

crossover and mutation are pc = 1, pm = 10−2 , respectively. In PBIL, a population

composed of 1000 individuals is used. The probability distribution is updated using

10% of the individuals. The smoothing parameter α is 0.1. Exact results obtained with

the solver SYMPHONY from the COIN-OR project [21] implementing a branch-and-

cut (B&C) approach [22], are also reported for reference. In the strongly correlated

problems it was not possible to find the exact solutions within a reasonable amount

of time.

Table 5.1 Results for the 0-1 Knapsack problem with restrictive capacity

GA Lin. GA RAR SA PBIL B&C (exact)

Profit Time Profit Time Profit Time Profit Time Profit Time

none 100 79.36 26.2 82.09 54.5 80.70 98.4 81.89 24.1 82.11 62.0

none 250 90.63 38.1 105.34 134.9 102.91 284.4 104.51 47.4 106.43 178.7

none 500 95.93 57.2 119.88 261.9 118.07 531.7 117.28 91.5 123.93 568.8

weak 100 52.97 26.9 54.38 53.9 53.53 99.87 54.33 24.2 54.43 81.8

weak 250 59.07 38.3 66.24 130.4 65.13 286.4 65.85 47.7 67.10 180.3

weak 500 60.40 56.1 74.17 266.1 73.40 531.9 72.05 87.9 76.61 560.1

strong 100 76.19 26.2 79.77 57.8 79.73 98.7 78.99 24.0 − −

strong 250 83.98 37.8 94.20 139.5 94.15 286.0 92.39 47.3 − −

strong 500 84.52 55.3 101.40 272.2 102.16 525.6 96.60 86.9 − −

Table 5.1 displays the average profit obtained and the time (in seconds) to reach

a solution for each method. The experiments were performed on an AMD Turion

computer with 1.79 Ghz processor speed and 1 Gb RAM. None of the approximate

methods reaches the optimal profit, which is calculated using an exact branch-and-

cut method. The highest profit obtained by an approximate optimization is high-

lighted in boldface. In all cases, the algorithms that use a set encoding (GA with

RAR crossover and SA) exhibit the best performance. They also require longer

times to reach a solution, specially SA. PBIL obtains good results only in small

uncorrelated knapsack problems. This is explained by the fact that the sampling and

estimation of probability distributions becomes progressively more difficult as the

dimensionality of the problem increases. Furthermore, PBIL assumes statistical in-

dependence between the variables, which makes the algorithm perform worse on

problems in which correlations are present. The standard GA with linear penalty

has a very poor performance in all the knapsack problems analyzed.

116 R. Ruiz-Torrubiano, S. Garcı́a-Moratilla, and A. Suárez

Consider the problem of automatic induction of classifiers from a collection of in-

stances {(xn , yn ); n = 1, 2, . . . , N}, where yn is the class label of the example char-

acterized by the vector of attributes xn . The goal is to induce from these data an

autonomous system that accurately predicts the class label on the basis of the vec-

tor of attributes of a previously unseen instance. There are a number of algorithms

that can be used for learning different types of classifiers: decision trees, neural

networks, support vector machines, etc. In practice, one of the most successful

paradigms is ensemble learning [23]. Ensembles are composed of a diverse col-

lection of classifiers that are generated from the same training data by introduc-

ing variations in the algorithm used for induction or in the conditions under which

learning takes place. The outputs of the individual classifiers are then combined (for

instance, by majority voting) to produce the prediction of the ensemble. Pooling the

decisions of the ensemble members has the potential of improving the generaliza-

tion capacity of a single learner. However, ensembles are costly to generate and have

large storage requirements. Furthermore, the time required to classify an unlabeled

instance increases linearly with the size of the ensemble. Recent work has shown

that the storage requirements and classification times can be significantly reduced

by selecting a subset of classifiers whose generalization capacity is equivalent and

sometimes superior to the original complete ensemble. This process receives the

name of ensemble pruning [24], selection, [25] or thinning [26].

Ensemble pruning has been a subject of great interest in the recent literature on

machine learning (see refs. in [27]). Most studies focus on the definition of appropri-

ate quantities that can be optimized on the training set to obtain pruned ensembles

with good generalization performance. The individual properties of classifiers are

not useful to guide the selection process. The generalization capacity of the pruned

ensemble crucially depends on the complementarity of the classifiers that are part of

it. The search in the space of subensembles is usually greedy. A notable exception

is [28, 29, 30], where genetic algorithms, generally with real-valued chromosomes,

are used.

Another exception is [31], where ensemble pruning is formulated as a quadratic

integer programming problem. Consider an ensemble composed of D classifiers and

a set of labeled instances. Define a matrix G, whose element Gi j is the number of

common errors between classifier i and classifier j, where i, j = 1, 2, . . . , D. The

value of the diagonal term Gii is the number of errors made by classifier i. The

matrix is then symmetrized and its elements normalized so that they are in the same

scale

Gii 1 Gi j G ji

G̃ii = , G̃i j,i= j = + . (5.10)

N 2 Gii G j j

Intuitively, ∑i G̃ii measures the overall strength of the ensemble classifiers and

∑i j,i= j G̃i j measures their diversity. The subensemble selection problem of size k

can now be formulated as a quadratic integer programming problem

5 Optimization Problems with Cardinality Constraints 117

D

argmin zT · G̃ · z, s.t. ∑ zi = k, zi ∈ {0, 1}. (5.11)

z i=1

The binary variable zi indicates whether classifier i should be selected. The size of

the pruned ensemble, k, is specified beforehand. The selection process is a combi-

natorial optimization problem whose exact solution

requires the evaluation of the

D

performance of the exponentially large subensembles of size k that can be

k

extracted from an ensemble of size D. In [31] the solution is approximated in poly-

nomial time by applying semi-definite programming (SDP) to a convex relaxation

of the original problem.

To investigate the performance in the ensemble pruning problem of the opti-

mization methods described in Section 5.2, we generate bagging ensembles for five

representative benchmark problems from the UCI repository: heart, pima, satellite,

waveform and wdbc (Breast Cancer Wisconsin) [32]. The individual classifiers in

the ensemble are trained on different bootstrap samples of the original data [33].

If the classifiers used as base learners are unstable the fluctuations in the bootstrap

sample lead to the induction of different predictors. Assuming that the errors of

these classifiers are uncorrelated, pooling their decisions by majority voting should

improve the accuracy of the predictions. In the experiments performed, bagging en-

sembles of 101 CART trees are built [34]. The original ensemble is pruned to k = 21

decision trees. The strength-diversity measure G, the time consumed in seconds and

the number of evaluations are averaged over 5 ten-fold cross-validations for heart,

pima, satellite and wdbc, and over 50 independent partitions for waveform. The suc-

cess rate is the average over 50 repetitions of the optimization for a given partition

of the data into training and testing sets.

The parameters for the metaheuristic optimization methods are determined in

exploratory experiments using the results of SDP as a gauge. For the GAs, popu-

lations with 100 individuals are evolved using a steady state generational substitu-

tion scheme. The crossover probability is set to 1. The mutation probability is 10−2

for GAs with binary representation and 10−3 for GAs with set representation. The

strength of the penalty term in the GA with linear penalties is β = 400. If the best

individual of the final population does not satisfy the cardinality constraint, a greedy

search is performed to fulfill the restriction. The value w = 1 is used in RAR-GA. A

geometric annealing schedules with γ = 0.9 is used in SA. In these experiments, the

best solution in 10 independent executions of the SA algorithm is chosen. For PBIL,

a population of 1000 individuals is generated, where 10% of the individuals are used

to update the probability distribution. The smoothing constant is set to α = 0.1.

The results of the ensemble pruning experiments performed are summarized in

Table 5.2. Most of the optimization methods analyzed reach similar solutions in

all the classification problems considered, with the exception of the standard GA

with linear penalty, which obtains the worst values of the objective function. In

terms of this quantity, the best overall results correspond to SA and SDP. In terms

of efficiency, SDP should be preferred. In machine learning, the relevant measure

of performance is the generalization capacity of the classifiers generated. The test

118 R. Ruiz-Torrubiano, S. Garcı́a-Moratilla, and A. Suárez

Table 5.2 Results for the GA, SA and EDA approaches in the ensemble pruning problem

rate

heart 156.2940 1.00 7.575 18.06

pima 234.8931 0.98 14.644 23.99

SA satellite 185.1163 1.00 63.490 13.15

waveform 105.2465 1.00 37.621 19.64

wdbc 121.9183 0.88 34.085 4.50

heart 157.8668 0.98 2.017 17.96

pima 235.0572 0.92 2.072 23.86

GA satellite 186.0706 1.00 0.870 12.85

Linear Penalty waveform 105.8294 0.02 5.179 19.70

wdbc 122.7625 0.06 4.963 4.45

heart 156.3127 1.00 0.851 17.87

pima 234.8931 1.00 1.665 23.99

GA Heuristic Repair satellite 185.1163 1.00 0.520 13.15

waveform 105.2465 0.80 8.168 19.64

wdbc 121.9183 0.90 7.875 4.50

heart 156.3860 1.00 0.697 17.96

pima 234.9190 1.00 1.381 24.09

GA satellite 185.1163 1.00 0.910 13.15

RAR (w = 1) waveform 105.2510 0.48 6.880 19.62

wdbc 121.9399 0.40 6.449 4.50

heart 156.4111 1.00 38.409 17.69

pima 234.9358 0.96 38.426 24.05

PBIL satellite 185.1163 1.00 16.400 13.15

waveform 105.2663 0.36 16.086 19.67

wdbc 122.0467 0.34 38.392 4.34

heart 156.3034 1.00 1.137 18.15

pima 234.8956 1.00 1.159 24.09

SDP satellite 185.1163 1.00 1.230 13.15

waveform 104.9984 0.90 1.230 19.60

wdbc 121.9143 0.90 1.117 4.39

error displayed in the last column of the table provides an estimate of the error rate

in examples that have not been used to train the classifiers. Lower test errors indicate

better generalization capacity. According to this measure the ranking of methods is

rather different: classifiers that were optimal according to the objective function are

suboptimal in terms of their generalization capacity. This indicates that the learning

process is affected by overfitting, because the objective function is estimated on the

training data. Nevertheless, the generalization performance of the pruned ensembles

is very similar for all the optimization methods considered. Table 5.3 shows the test

error of a single CART tree, of a complete bagging ensemble and the range of values

5 Optimization Problems with Cardinality Constraints 119

Table 5.3 Test errors for CART, standard bagging and pruned bagging

heart 23.63 21.48 [17.69,18.15]

pima 24.84 24.67 [23.86,24.09]

satellite 13.80 14.25 [12.85,13.15]

waveform 30.27 22.53 [19.62,19.67]

wdbc 7.28 5.68 [4.34,4.50]

of the test error obtained by pruned bagging ensembles of size k = 21. In all the

classification problems considered, pruned ensembles have a lower test error than

CART and complete bagging.

The selection of optimal investment portfolios is a problem of great interest in the

area of quantitative finance and has attracted much attention in the scientific com-

munity (see refs. in [10]). It is a multiobjective optimization task with two opposed

goals: The maximization of profit and the minimization of risk. Several methods

have been proposed to address this problem, mostly within the classical mean-

variance model developed by H. Markowitz [35]. In this framework, the returns of

the assets considered for investment are modeled as white noise. Profit is quantified

in terms of the expected return of the portfolio. The variance of the portfolio re-

turns is used as a measure of risk. In its simplest version, the problem can be solved

by quadratic programming [1]. However, if cardinality constraints are included, the

problem becomes a mixed-integer quadratic problem, which can be shown to be

NP-Complete [10]

T

min w[z] · Σ[z,z] · w[z] (5.12)

z

T

s.t. w[z] · r̄[z] = R∗ (5.13)

a[z] ≤ w[z] ≤ b[z] , a[z] ≥ 0, b[z] ≥ 0 (5.14)

l ≤ A[z] · w[z] ≤ u (5.15)

z ·1 ≤ K

T

(5.16)

wT · 1 = 1, w ≥ 0. (5.17)

The inputs of the algorithm are r̄, the vector of expected asset returns and Σ, the

covariance matrix of the asset returns. The goal is to determine the optimal weights

of the assets in the portfolio; i.e. the value of w that maximizes the variance of the

portfolio returns (5.12), for a given value of the expected return of the portfolio, R∗

(5.13). The elements of the binary vector z specify whether asset i is included in the

final portfolio (zi = 1) or not (zi = 0). Column vectors x[z] are obtained by remov-

ing from the corresponding vector x those components i for which zi = 0. Similarly,

120 R. Ruiz-Torrubiano, S. Garcı́a-Moratilla, and A. Suárez

the matrix A[z] is obtained by eliminating the i-th column of A whenever zi = 0. Fi-

nally, Σ[z,z] is obtained by removing from Σ the rows and columns for which the

corresponding indicator is zero (zi = 0). The symbols 0 and 1 denote vectors of the

appropriate size whose entries are all equal to 0 or to 1, respectively. Minimum and

maximum investment constraints, which set a lower and an upper bound on the in-

vestment of each asset in the portfolio are captured by (5.14). Vectors a and b are

D × 1 column vectors with the lower and upper bounds on the portfolio weights, re-

spectively. Inequality (5.15) summarizes the M concentration of capital constraints.

The m-th row of the M × D matrix A is the vector of coefficients of the linear combi-

nation that defines the constraint. The M × 1 column vectors l and u correspond to the

lower and upper bounds of the M linear restrictions, respectively. Concentration of

capital constraints can be used, for instance, to control the amount of capital invested

in a group of assets, so that investor preferences or limits for investment in certain

asset classes can be formally expressed. Since these constraints are linear, they do

not increase the difficulty of the problem, which can still be solved efficiently by

quadratic programming. Expression (5.16) corresponds to the cardinality constraint,

which limits the number of assets that can be included in the final portfolio. Finally,

equation (5.17) ensures that all the capital is invested in the portfolio.

The cardinality-constrained problem is difficult to solve by standard optimization

techniques. Branch-and-Bound methods can be used to find exact solutions [36]. De-

spite the improvements in efficiency, the complexity of the search is still exponential.

Genetic algorithms have also been used to address this problem: In [37], the perfor-

mance of GAs is compared to SA and to tabu search (TS) [38]. According to this in-

vestigation, the best-performing portfolios are obtained by pooling the results of the

different heuristics. In [39] SA is used to search directly in the space of real-valued

asset weights. Tabu search is employed in [40]. This work focuses on the design

of appropriate neighborhood operators to improve the efficiency of the search. In

[7] [41] Multi-Objective Evolutionary Algorithms (MOEAs) are used to address the

problem. These algorithms employ a hybrid encoding instead of a pure continuous

one and heuristic repair mechanisms to handle infeasible individuals. The impact of

local search improvements are also investigated in this work. The authors conclude

that the hybrid encoding improves the overall performance of the algorithm.

In the experiments carried out in this investigation, we address the problem of

optimal portfolio selection with lower bounds and cardinality constraints. The pa-

rameters of the constraints considered are li = 0.1, ui = 0.1, i = 1, . . . , D and K = 10.

The performance of the different optimization methods is compared by calculating

the efficient frontier for the problem with and without these constraints. Points on

the efficient frontier correspond to minimum-risk portfolios for a given expected

return, or, alternatively, to portfolios that have the largest expected return from a

family of portfolios with equal risk. As a measure of the quality of the solution ob-

tained, the average relative distance to the unconstrained efficient frontier (without

cardinality and lower bound constraints) is calculated

1 NF

σic − σi∗

D=

NF ∑ σ∗ (5.18)

i=1 i

5 Optimization Problems with Cardinality Constraints 121

where NF = 100 is the number of frontier points considered, σic is the solution of

the constrained problem in the i-th point of the frontier, and σi∗ is the solution of the

corresponding unconstrained problem.

Table 5.4 Results for the GA, SA and EDA approaches in the portfolio selection problem

rate

Hang Seng 0.00321150 1.00 1499.9 3.87 · 107

DAX 2.53162860 0.98 2877.3 7.63 · 107

SA FTSE 1.92205745 0.92 3610.4 8.87 · 107

S&P 4.69373181 0.91 3567.8 9.54 · 107

Nikkei 0.20197748 0.95 4274.5 9.25 · 107

Hang Seng 0.00327011 0.86 750.9 1.36 · 107

DAX 2.53314271 0.69 2999.0 4.60 · 107

GA FTSE 1.93255870 0.51 3539.3 5.76 · 107

Linear penalty S&P 4.69373181 0.76 4636.8 7.03 · 107

Nikkei 0.22992173 0.42 4811.7 6.47 · 107

Hang Seng 0.00321150 1.00 1122.9 2.18 · 107

DAX 2.53162860 1.00 4730.6 7.45 · 107

GA FTSE 1.92150019 0.94 6301.4 9.70 · 107

Heuristic Repair S&P 4.69373181 1.00 7860.6 11.42 · 107

Nikkei 0.20197748 0.99 10191.2 11.47 · 107

Hang Seng 0.00321150 1.00 1200.8 2.77 · 107

DAX 2.53162860 1.00 3178.5 6.14 · 107

GA FTSE 1.92150019 0.95 6384.6 12.02 · 107

RAR (w = 1) S&P 4.69373181 0.99 6575.6 12.34 · 107

Nikkei 0.20197748 1.00 9893.3 14.17 · 107

Hang Seng 0.00321150 1.00 2292.8 5.55 · 107

DAX 2.53162860 0.94 4489.1 7.70 · 107

PBIL FTSE 1.92208910 0.85 4782.3 8.06 · 107

S&P 4.69570006 0.88 5100.2 8.28 · 107

Nikkei 0.30164777 0.43 7486.5 8.21 · 107

The expected returns and the covariance matrix of the components of five major

world markets included in the OR-Library [42] are used as inputs for the optimiza-

tion: Hang Seng (Hong-Kong, 31 assets), DAX (Germany, 85 assets), FTSE (UK,

89 assets), Standard and Poor’s (U.S.A., 98 assets) and Nikkei (Japan, 225 assets).

The methods compared are SA, standard GA with linear penalty, standard GA with

heuristic repair, GA with a set representation and RAR (w = 1) crossover, and PBIL.

The SA heuristic is used with a geometric annealing scheme with constant γ = 0.9.

Populations of 100 individuals are used for the GAs. The mutation and crossover

probabilities are pm = 10−2 and pc = 1, respectively. PBIL samples populations

122 R. Ruiz-Torrubiano, S. Garcı́a-Moratilla, and A. Suárez

of 400 individuals, 10% of which are used to update the probability distribution.

The heuristic repair scheme performs an unconstrained optimization without the

cardinality constraint, and then either includes in the chromosome those products

with the highest weights or eliminates the products with the smallest weights in the

unconstrained solution, as needed.

Table 5.4 summarizes the results of the experiments. The value of D (5.18) dis-

played in the third column is the best out of 5 executions of each of methods consid-

ered. The proportion of attempts in which the corresponding optimization algorithm

obtains the best known solution is given in the column labeled sucess rate. The last

two columns report the time employed (in seconds) and the number of quadratic

optimizations performed, respectively. In terms of the quality of the obtained solu-

tions, using a binary encoding with linear penalties performs worse than all the other

approximate methods. By contrast, the heuristic repair scheme identifies the best of

the known solutions in all the problems investigated. GA with a set representation

and RAR (w = 1) crossover has also an excellent performance and is slightly more

efficient on average. High quality solutions are also obtained by SA, albeit at higher

computational cost. PBIL performs well only in problems in which the number of

assets considered for investment is small. As the dimensionality of the problem in-

creases, sampling and estimation of the probability distribution in algorithms of the

EDA family become less effective.

Index tracking is a passive investment strategy whose goal is to match the perfor-

mance of a reference financial index. The problem can be exactly solved by invest-

ing on each asset an amount of capital that is proportional to the corresponding

weight in the index. In practice, this strategy has the drawback of incurring high

initial transaction costs. Furthermore, there is an overhead in managing a portfolio

that invests in every constituent of the index. In particular, rebalancing the portfolio

can be costly if the composition of the index is revised. An alternative is to create a

tracking portfolio that invests only in a reduced set of assets. This partial replication

strategy will in general be unable to perfectly reproduce the behavior of the index.

However, a portfolio that invests in a fixed number of assets and closely follows the

evolution of the index can be obtained by minimizing the tracking error

!

1 T D

min

w,z

∑ ∑ (w j r j (t) − rt )2

T t=1

(5.19)

j=1

D

∑ wi = 1, (5.20)

i=1

l ≤ A·w ≤ u (5.21)

zi ∈ {0, 1}, ai zi ≤ wi ≤ bi zi , ai ≥ 0, bi ≥ 0, i = 1, 2, . . . , D (5.22)

D

∑ zi ≤ K, (5.23)

i=1

5 Optimization Problems with Cardinality Constraints 123

where T is the length of the time series considered, D is the number of constituents

of the index, r j (t) is the return of asset j at time t and rt is the return of the index

at time t. Restriction (5.20) is a budget constraint, which ensures that all the cap-

ital is invested in the portfolio. Investment concentration constraints are captured

by (5.21). Expression (5.22) reflects lower and upper bound constraints. The binary

variables {z1 , z2 , . . . , zD } indicate whether an asset is included or excluded from the

tracking portfolio. Note that when zi = 0, the lower and upper bounds for the weight

of asset i are both equal to zero, which effectively excludes this asset from the

investment. The cardinality constraint is expressed by Eq.(5.23).

Index tracking has been extensively investigated in the literature. The hybrid GA

with set encoding and RAR crossover described in Section 5.2 is used in [4]. In-

stead of the tracking error, this work minimizes the variance of the difference be-

tween the returns of the index and of the tracking portfolio. Optimal impulse control

techniques are used in [43]. In [44] the problem is solved by using the threshold ac-

cepting (TA) heuristic, which is a deterministic analogue of simulated annealing, in

which transitions are rejected only when they lead to a deterioration in performance

that is above a specified threshold. Evolutionary algorithms with real-valued chro-

mosome representations are used in [45]. This investigation focuses on the influence

of transaction costs and portfolio rebalancing. In [46] the portfolio optimization and

index tracking problems are addressed by means of a heuristic relaxation method

that consists in solving a small number of convex optimization problems with fixed

transaction costs. Hybrid optimization approaches to minimizing the tracking by

partial replication are also investigated in [47, 48, 49].

In the current investigation, publicly available benchmark data from the OR-

Library [42] is used to compare the optimization techniques described in Section

5.2. Five major world market indices are used in the experiments: Hang-Seng, DAX,

FTSE, S&P and Nikkei. For each index, the time series of 290 weekly returns for

the index and for its constituents are given. From these data, the first 145 values

are used to create a tracking portfolio that includes a maximum of K = 10 assets.

The last 145 values are used to measure the out-of-sample tracking error. The pop-

ulation sizes are 350 for the GAs and 1000 for PBIL. The values of the remaining

parameters coincide with those used in the portfolio selection problem.

Table 5.5 presents a summary of the experiments performed. The best out of 5 ex-

ecutions of the different optimization methods are reported. GA with random repair

obtains the best overall results. GA with set encoding and RAR (w = 1) crossover

matches these results except in Nikkei, which is the index with the largest number of

constituents. PBIL also has a good performance, but the computational cost is higher

than for the other algorithms. In fact, the algorithm reached the maximum number

of optimizations established without converging. The results of SA and GA with

binary encoding and linear penalty are suboptimal in all but the simplest problems.

They also exhibit low success rates. In all problems investigated, the out-of-sample

error is typically larger than the in-sample error, but of the same order of magnitude.

124 R. Ruiz-Torrubiano, S. Garcı́a-Moratilla, and A. Suárez

Table 5.5 Results for the GA, SA and EDA approaches in the index tracking problem

In-Sample Out-of-Sample rate (s) opts.

Hang Seng 1.3462 · 10−5 2.0575 · 10−5 0.40 1.12 19342

DAX 8.0837 · 10−6 7.4824 · 10−5 0.40 1.73 27101

SA FTSE 2.3951 · 10−5 7.0007 · 10−5 0.20 1.44 1.43

S&P 1.6781 · 10−5 4.7347 · 10−5 0.20 1.97 29764

Nikkei 2.1974 · 10−5 1.0719 · 10−4 0.20 95.00 1476549

Hang Seng 1.3462 · 10−5 2.0575 · 10−5 0.60 4.15 51509

DAX 8.0837 · 10−6 7.4824 · 10−5 0.20 13.69 144868

GA FTSE 2.7345 · 10−5 5.3148 · 10−5 0.20 17.66 158465

Linear Penalty S&P 1.7974 · 10−5 5.2898 · 10−5 0.20 36.89 311008

Nikkei 2.0061 · 10−5 1.0707 · 10−4 0.20 123.28 1015774

Hang Seng 1.3462 · 10−5 2.0575 · 10−5 1.00 5.92 81690

DAX 8.0837 · 10−6 7.4824 · 10−5 1.00 18.89 231840

GA FTSE 2.1836 · 10−5 8.0091 · 10−5 0.40 21.20 255820

Random Repair S&P 1.6573 · 10−5 5.5457 · 10−5 0.20 47.02 508313

Nikkei 1.8255 · 10−5 6.9574 · 10−5 0.20 170.62 1664696

Hang Seng 1.3462 · 10−5 2.0575 · 10−5 1.00 4.67 51513

DAX 8.0837 · 10−6 7.4824 · 10−5 1.00 14.17 124717

GA FTSE 2.1836 · 10−5 8.0091 · 10−5 0.40 18.83 156456

RAR (w = 1) S&P 1.6573 · 10−5 5.5457 · 10−5 0.20 42.31 311002

Nikkei 1.8917 · 10−5 8.1057 · 10−5 0.20 175.34 1015766

Hang Seng 1.3462 · 10−5 2.0575 · 10−5 1.00 167.04 2010000

DAX 8.0837 · 10−6 7.4824 · 10−5 1.00 199.28 2010000

PBIL FTSE 2.1836 · 10−5 8.0091 · 10−5 1.00 195.31 2010000

S&P 1.6781 · 10−5 4.7347 · 10−5 0.60 314.77 2010000

Nikkei 1.9510 · 10−5 7.4572 · 10−5 0.20 222.86 2010000

Principal Component Analysis (PCA) is a dimensionality reduction technique that is

frequently used in data analysis, data compression and data visualization. The goal

is to identify the directions along which the multidimensional data have the largest

variance. The principal components can be obtained by maximizing the variance of

normalized linear combinations of the original variables. Typically they have a non-

zero projection on all the original coordinates, which can make their interpretation

difficult. The goal of sparse PCA is to find principal components that have non-

zero loadings in only a small number of the original directions, while at the same

time explaining most of the variance. The first sparse principal component can be

obtained by solving the cardinality-constrained optimization problem

5 Optimization Problems with Cardinality Constraints 125

" T

#

max w[z] · Σ[z,z] · w[z] (5.24)

w,z

z · 1 ≤ K,

T

(5.26)

where Σ is the data covariance matrix. As in the previous problems, the elements of

the binary vector z encode whether the principal component has a non-zero projec-

tion along the corresponding direction. Once the first principal component has been

found, if more principal components are to be calculated, the covariance matrix Σ

is deflated as follows

Σ = Σ − wT · Σ · w w wT (5.27)

and a new problem of the form given by (5.24), defined now in terms of this de-

flated covariance matrix is solved. The decomposition stops after a maximum of

Rank(Σ) iterations. In practice, the number of principal components is either spec-

ified beforehand or determined by the percentage of the total variance of the data

explained.

The problem of finding sparse principal components has also received a fair

amount of attention in the recent literature. Greedy search is used in [50]. In [51]

SPCA is formulated as a regression problem, so that LASSO techniques [52] can be

used to favor sparse solutions. In LASSO, an L1 -norm penalty for non-zero values

of the factor loadings is used. A higher weight of the penalty term in the objective

functions induces models that are sparser. However it is not possible to have a di-

rect control on the number of non-zero coefficients in the solution. The cardinality

constraint is explicitly considered in [53], which uses a method based on solving a

relaxation of the problem by semidefinite programming (SDP).

To compare the performance of the different methods analyzed, we use the bench-

mark problem introduced in [54]. Consider the sparse vector v, whose components

are ⎧

⎪

⎨1, if i ≤ 50

vi = 1/(i − 50), if 50 < i ≤ 100 (5.28)

⎪

⎩

0, otherwise

A covariance matrix is built from this vector and U, a square matrix of dimensions

150 × 150 whose elements are U[0, 1] random variables

Σ = σ vvT + UT U, (5.29)

ity is partially masked by noise. In our experiments the results of SA, binary GA

with linear penalties, binary GA with random repair, set GA with RAR crossover

operator and w = 1, PBIL and DSPCA, an approximate method based on semidef-

inite programming [53, 55] are compared. SA uses a geometric annealing scheme

with γ = 0.9. The GAs use a population of 50 invididuals. Crossover and muta-

tion are performed with probabilities pc = 1 and pm = 10−2 , respectively. PBIL is

executed with a population of 400 individuals and α = 0.1. In this algorithm, the

126 R. Ruiz-Torrubiano, S. Garcı́a-Moratilla, and A. Suárez

best 10% of the individuals are used to update the probability distribution. The first

sparse principal component is then calculated. For each of the methods that involve

stochastic search (all except DSPCA), the best out of 5 independent executions of

the algorithm is taken. Figure 7.1 displays the variance explained by the first sparse

principal component as a function of its cardinality K = 1, 2, . . . , 140, for all the

methods considered. GA using a linear penalty does not obtain good solutions in

this high-dimensional problem. PBIL performs slightly better, but is clearly infe-

rior to SA, GAs with random repair, GA with set encoding and DSPCA. Table 5.6

shows the detailed results for cardinality K = 50, which is the cardinality of the true

hidden pattern. In this table, the largest value of the variance achieved is highlighted

in bold. The success rates, the computation times on an AMD Turion machine with

1.79 Ghz processor speed and 1 Gb RAM and the total number of optimizations are

also given. The times for the the DSPCA algorithm times are not given, because a

MATLAB implementation was used [54], which cannot be directly compared with

the other results, obtained with code written in C. The GA with set encoding and

RAR (w = 1) crossover and the GA with binary encoding and random repair obtain

the best results and explain more variance than the solution obtained by DSPCA.

The first of this methods is slightly faster. SA is very fast and achieves a result

that is only slightly worse with a success rate of 100%. PBIL and GA with binary

encoding and linear penalty obtain solutions that are clearly inferior.

45

40

35

30

25

Variance

20

15

10

GA Linear Penalty

GA Random Repair

5 GA RAR

SA

PBIL

DSPCA

0

25 50 75 100 125 150

Cardinality

5 Optimization Problems with Cardinality Constraints 127

Table 5.6 Results for the GA, SA, EDA and SDP approaches in the synthetic problem for

K = 50

variance rate

SA 22.5727 1.00 65.95 11639

GA + Linear Penalty 19.7881 0.20 126.1 5137

GA + Random Repair 22.7423 0.80 172.1 7981

GA + RAR (w = 1) 22.7423 1.00 105.41 5146

PBIL 20.1778 1.00 198.20 40800

SDP 22.5001 − − −

5.4 Conclusions

Many tasks of practical interest can be formulated as optimization problems with

cardinality constraints. The examples analyzed in this article arise in various fields

of application: ensemble pruning, optimal portfolio selection, financial index track-

ing and sparse principal component analysis. They are large optimization problems

whose solution by standard optimization methods is computationally expensive. In

practice, using exact methods like branch-and-bound is feasible only for small prob-

lem instances. A practicable alternative is to use approximate optimization methods

that can identify near-optimal solutions at a lower computational cost: Genetic al-

gorithms, simulated annealing and estimation of distribution algorithms. However,

the search operators used in the standard formulations of these techniques are ill-

suited to the problem because they do not preserve the cardinality of the candi-

date solutions. This means that either ad-hoc penalization or repair mechanisms are

needed to enforce the constraints. Including penalty terms in the objective func-

tion distorts the search and generally leads to suboptimal solutions. Applying repair

mechanisms to infeasible configurations provides a more elegant and effective ap-

proach to the problem. Nonetheless, the best option is to use a set representation,

in conjunction with specially designed search operators that preserve the cardinality

of the candidate solutions. Some of the problems considered, such as the knapsack

problem and ensemble pruning are purely combinatorial optimization tasks. In prob-

lems like portfolio selection, index tracking and sparse PCA both combinatorial and

continuous aspects are present. For these we advocate the use of hybrid methods

that separately handle the combinatorial and the continuous aspects of cardinality-

constrained optimization problems. Among the approximate methods considered,

a genetic algorithm with set encoding and RAR crossover obtains the best overall

performance. In problems where the comparison was possible, the solutions ob-

tained are close to the exact ones and to those identified by approximate methods

that use semidefinitie programming. Using the same encoding, simulated annealing

also obtains fairly good solutions, generally at a higher computational cost. This

indicates that the RAR crossover operator seems to enhance the search by introduc-

ing in the population individuals that effectively combine advantageous features of

128 R. Ruiz-Torrubiano, S. Garcı́a-Moratilla, and A. Suárez

on small and medium-sized problem instances. However, they fail to obtain good

solutions on large problems. The reason for this loss of efficacy is that the sampling

and estimation of probability distributions becomes progressively more difficult as

the dimensionality of the problem increases.

Acknowledgments

This research has been supported by Dirección General de Investigación (Spain),

project TIN2007-66862-C02-02.

References

1. Gill, P.E., Murray, W., Saunders, M.A., Wright, M.H.: Inertia-controlling methods for

general quadratic programming. SIAM Review 33, 1–36 (1991)

2. Gill, P., Murray, W.: Quasi-newton methods for unconstrained optimization. IMA Journal

of Applied Mathematics 9 (1), 91–108 (1972)

3. Adler, I., Karmarkar, N., Resende, M.G.C., Veiga, G.: An implementation of Kar-

markar’s algorithm for linear programming. Mathematical Programming 44, 297–335

(1989)

4. Shapcott, J.: Index tracking: genetic algorithms for investment portfolio selection. Tech-

nical report, EPCC-SS92-24, Edinburgh, Parallel Computing Centre (1992)

5. Radcliffe, N.J.: Genetic set recombination. Foundations of Genetic Algorithms. Morgan

Kaufmann Pulishers, San Francisco (1993)

6. Coello, C.: Theoretical and numerical constraint-handling techniques used with evolu-

tionary algorithms: a survey of the state of the art. Computer Methods in Applied Me-

chanics and Engineering 191, 1245–1287 (2002)

7. Streichert, F., Ulmer, H., Zell, A.: Evaluating a hybrid encoding and three crossover

operators on the constrained portfolio selection problem. In: Proceedings of the Congress

on Evolutionary Computation (CEC 2004), vol. 1, pp. 932–939 (2004)

8. Kirkpatrick, S., Gelatt Jr., C.D., Vecchi, M.P.: Optimization by simulated annealing. Sci-

ence 4598, 671–679 (1983)

9. Goldberg, D.: Genetic Algorithms in Search, Optimization and Machine Learning.

Addison-Weasley, Reading (1989)

10. Moral-Escudero, R., Ruiz-Torrubiano, R., Suarez, A.: Selection of optimal investment

portfolios with cardinality constraints. In: Proceedings of the IEEE World Congress on

Evolutionary Computation, pp. 2382–2388 (2006)

11. Radcliffe, N.J.: Equivalence class analysis of genetic algorithms. Complex Systems 5,

183–205 (1991)

12. Larrañaga, P., Lozano, J.A. (eds.): Estimation of Distribution Algorithms: A New Tool

for Evolutionary Computation. Kluwer Academic Publishers, Dordrecht (2002)

13. Baluja, S.: Population-based incremental learning: A method for integrating genetic

search based function optimization and competitive learning. Technical Report CMU-

CS-94-163, Carnegie Mellon University (1994)

14. Muehlenbein, H.: The equation for response to selection and its use for prediction. Evo-

lutionary Computation 5, 303–346 (1998)

15. Kellerer, H., Pferschy, U., Pisinger, D.: Knapsack Problems. Springer, Heidelberg (2004)

5 Optimization Problems with Cardinality Constraints 129

16. Miller, R.E., Thatcher, J.W. (eds.): Reducibility among combinatorial problems, pp. 85–

103. Plenum Press (1972)

17. Pisinger, D.: Where are the hard knapsack problems? Computers & Operations Research,

2271–2284 (2005)

18. Simões, A., Costa, E.: An evolutionary approach to the zero/one knapsack problem: Test-

ing ideas from biology. In: Proceedings of the Fifth International Conference on Artificial

Neural Networks and Genetic Algorithms, ICANNGA (2001)

19. Ku, S., Lee, B.: A set-oriented genetic algorithm and the knapsack problem. In: Proceed-

ings of the IEEE World Congress on Evolutionary Computation, CEC 2001 (2001)

20. Michalewicz, Z.: Genetic Algorithms + Data Structures = Evolution Programs. Springer,

Heidelberg (1996)

21. Ladanyi, L., Ralphs, T., Guzelsoy, M., Mahajan, A.: SYMPHONY (2009),

https://projects.coin-or.org/SYMPHONY

22. Padberg, M.W., Rinaldi, G.: A branch-and-cut algorithm for the solution of large scale

traveling salesman problems. SIAM Review 33, 60–100 (1991)

23. Dietterich, T.G.: An experimental comparison of three methods for constructing ensem-

bles of decision trees: Bagging, boosting, and randomization. Machine Learning 40,

139–157 (2000)

24. Margineantu, D.D., Dietterich, T.G.: Pruning adaptive boosting. In: Proc. of the 14th

International Conference on Machine Learning, pp. 211–218. Morgan Kaufmann, San

Francisco (1997)

25. Caruana, R., Niculescu-Mizil, A., Crew, G., Ksikes, A.: Ensemble selection from li-

braries of models. In: Proc. of the 21st International Conference on Machine Learning,

p. 18. ACM Press, New York (2004)

26. Banfield, R.E., Hall, L.O., Bowyer, K.W., Kegelmeyer, W.P.: Ensemble diversity mea-

sures and their application to thinning. Information Fusion 6, 49–62 (2005)

27. Martı́nez-Muñoz, G., Lobato, D.H., Suárez, A.: An analysis of ensemble pruning tech-

niques based on ordered aggregation. IEEE Transactions on Pattern Analysis and Ma-

chine Intelligence 31, 245–259 (2009)

28. Zhou, Z.H., Wu, J., Tang, W.: Ensembling neural networks: Many could be better than

all. Artificial Intelligence 137, 239–263 (2002)

29. Zhou, Z.H., Tang, W.: Selective ensemble of decision trees. In: Liu, Q., Yao, Y., Skowron,

A. (eds.) RSFDGrC 2003. LNCS (LNAI), vol. 2639, pp. 476–483. Springer, Heidelberg

(2003)

30. Hernández-Lobato, D., Hernández-Lobato, J.M., Ruiz-Torrubiano, R., Valle, Á.: Pruning

adaptive boosting ensembles by means of a genetic algorithm. In: Corchado, E., Yin,

H., Botti, V., Fyfe, C. (eds.) IDEAL 2006. LNCS, vol. 4224, pp. 322–329. Springer,

Heidelberg (2006)

31. Zhang, Y., Burer, S., Street, W.N.: Ensemble pruning via semi-definite programming.

Journal of Machine Learning Research 7, 1315–1338 (2006)

32. Asuncion, A., Newman, D.: UCI machine learning repository (2007)

33. Breiman, L.: Bagging predictors. Machine Learning 24, 123–140 (1996)

34. Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J.: Classification and Regression

Trees. Chapman & Hall, New York (1984)

35. Markowitz, H.: Portfolio selection. Journal of Finance 7, 77–91 (1952)

36. Bienstock, D.: Computational study of a family of mixed-integer quadratic program-

ming problems. In: Balas, E., Clausen, J. (eds.) IPCO 1995. LNCS, vol. 920. Springer,

Heidelberg (1995)

130 R. Ruiz-Torrubiano, S. Garcı́a-Moratilla, and A. Suárez

37. Chang, T.J., Meade, N., Beasley, J.E., Sharaiha, Y.M.: Heuristics for cardinality con-

strained portfolio optimisation. Computers and Operations Research 27, 1271–1302

(2000)

38. Glover, F.: Future paths for integer programming and links to artificial intelligence. Com-

puters and Operations Research 13, 533–549 (1986)

39. Crama, Y., Schyns, M.: Simulated annealing for complex portfolio selection problems.

Technical report, Groupe d’Etude des Mathematiques du Management et de l’Economie

9911, Universie de Liege (1999)

40. Schaerf, A.: Local search techniques for constrained portfolio selection problems. Com-

putational Economics 20, 177–190 (2002)

41. Streichert, F., Tamaka-Tamawaki, M.: The effect of local search on the constrained port-

folio selection problem. In: Proceedings of the IEEE World Congress on Evolutionary

Computation (CEC 2006), Vancouver, Canada, pp. 2368–2374 (2006)

42. Beasley, J.E.: Or-library: Distributing test problems by electronic mail. Journal of the

Operational Research Society 41(11), 1069–1072 (1990)

43. Buckley, I., Korn, R.: Optimal index tracking under transaction costs and impulse con-

trol. International Journal of Theoretical and Applied Finance 1(3), 315–330 (1998)

44. Gilli, M., Këllezi, E.: Threshold accepting for index tracking. Computing in Economics

and Finance 72 (2001)

45. Beasley, J.E., Meade, N., Chang, T.: An evolutionary heuristic for the index tracking

problem. European Journal of Operations Research 148(3), 621–643 (2003)

46. Lobo, M., Fazel, M., Boyd, S.: Portfolio optimization with linear and fixed transaction

costs. Annals of Operations Research, special issue on financial optimization 152(1),

376–394 (2007)

47. Jeurissen, R., van den Berg, J.: Index tracking using a hybrid genetic algorithm. In: ICSC

Congress on Computational Intelligence Methods and Applications 2005 (2005)

48. Jeurissen, R., van den Berg, J.: Optimized index tracking using a hybrid genetic algo-

rithm. In: Proceedings of the IEEE World Congress on Evolutionary Computation (CEC

2008), pp. 2327–2334 (2008)

49. Ruiz-Torrubiano, R., Suárez, A.: A hybrid optimization approach to index tracking. Ac-

cepted for publication in Annals of Operations Research (2007)

50. Moghaddam, B., Weiss, Y., Avidan, S.: Spectral bounds for sparse PCA. In: Advances in

Neural Information Processing Systems, NIPS 2005 (2005)

51. Zou, H., Hastie, T., Tibshirani, R.: Sparse principal component analysis. Journal of Com-

putational and Graphical Statistics 15(2), 265–286 (2006)

52. Tibshirani, R.: Regression shrinkage and selection via the lasso. Journal of the Royal

Statistical Society B 58, 267–268 (1996)

53. d’Aspremont, A., Ghaoui, L.E., Jordan, M., Lanckriet, G.: A direct formulation for

sparse PCA using semidefinite programming. SIAM Review 49(3), 434–448 (2007)

54. d’Aspremont, A., Bach, F., Ghaoui, L.E.: Optimal solutions for sparse principal compo-

nent analysis. Journal of Machine Learning Research 9, 1269–1294 (2008)

55. d’Aspremont, A., Ghaoui, L.E., Jordan, M., Lanckriet, G.: MATLAB code for DSPCA

(2008), http://www.princeton.edu/˜aspremon/DSPCA.htm

Chapter 6

Learning Global Optimization through a

Support Vector Machine Based Adaptive

Multistart Strategy

timization using Support vector regression based Adaptive Multistart) that applies

statistical machine learning techniques, viz. Support Vector Regression (SVR) to

adaptively direct iterative search in large-scale global optimization. At each itera-

tion, GOSAM builds a training set of the objective function’s local minima discov-

ered till the current iteration, and applies SVR to construct a regressor that learns

the structure of the local minima. In the next iteration the search for the local min-

imum is started from the minimum of this regressor. The idea is that the regressor

for local minima will generalize well to the local minima not obtained so far in the

search, and hence its minimum would be a ‘crude approximation’ to the global min-

imum. This approximation improves over time, leading the search towards regions

that yield better local minima and eventually the global minimum. Simulation re-

sults on well known benchmark problems show that GOSAM requires significantly

fewer function evaluations to reach the global optimum, in comparison with meth-

ods like Particle Swarm optimization and Genetic Algorithms. GOSAM proves to

be relatively more efficient as the number of design variables (dimension) increases.

GOSAM does not require explicit knowledge of the objective function, and also

Jayadeva

Dept. of Electrical Engineering, Indian Institute of Technology, Hauz Khas,

New Delhi - 110016, India

e-mail: jayadeva@ee.iitd.ac.in

Sameena Shah

Dept. of Electrical Engineering, Indian Institute of Technology, Hauz Khas,

New Delhi - 110016, India

e-mail: sameena.shah@gmail.com

Suresh Chandra

Dept. of Mathematics, Indian Institute of Technology, Hauz Khas,

New Delhi - 110016, India

e-mail: chandras@maths.iitd.ac.in

Y. Tenne and C.-K. Goh (Eds.): Computational Intelligence in Optimization, ALO 7, pp. 131–154.

springerlink.com c Springer-Verlag Berlin Heidelberg 2010

132 Jayadeva, S. Shah, and S. Chandra

does not assume any specific properties. We also discuss some real world applica-

tions of GOSAM involving constrained and design optimization problems.

Global optimization involves finding the optimal or best possible configuration from

a large space of possible configurations. It is among the most fundamental of compu-

tational tasks, and has numerous applications including bio-informatics [26],

robotics [20], portfolio optimization [5], VLSI design [31], and nearly every

engineering application [27].

If the search space is small, then obtaining the global optimal is trivial; otherwise

some special structure like linearity, convexity or differentiability of the problem

needs to be exploited. Classical mathematical optimization techniques are based

on utilizing such special structures. Difficulties in obtaining the global optimum

arise when the objective function neither has any special structure, nor possesses

properties like continuity and differentiability, or if it has numerous local optima that

obstruct search for the global optimum [25]. Such objective functions are common

in many applications including data mining, location problems, and computational

chemistry, amongst others [13, 15]. Similar difficulties arise if some structure exists

but is not known a priori.

For the global optimization of these kinds of objective functions, one utilizes

the broad class of general-purpose algorithms called local search algorithms[29].

Global optimizers typically depend on local search algorithms that search multiple

states in the neighborhood of a given state or configuration. Such methods find a

local optimum, which depends on the starting state. Local search generally does not

yield the global optimum, because it gets stuck in a local optimum. Therefore, local

search methods are usually augmented with some strategy to escape from local op-

tima. For instance, in simulated annealing (SA), first introduced by [22], the escape

strategy is a probabilistic shaking that is associated with a temperature parameter.

The “temperature” is reduced iteratively from a high initial value, based on a cooling

schedule.

On the other hand, though local search strategies suffer from entrapment in local

optima, they are very fast. The time required to run an iteration of simulated an-

nealing is sufficient for several iterations of gradient descent. Therefore, instead of

escaping from local optima, an alternative is the use of multi-restart local search ap-

proaches. These start from a new initial state once a local search step has terminated

in a local optimum. Multistart approaches are known to outperform other strategies

on some problems, e.g. simulated annealing on the Travelling Salesperson Problem

(TSP) [19].

The performance of any local search procedure depends on the starting state, and

multi-restart local search algorithms start from a randomly chosen state. None of the

above mentioned approaches exploit knowledge of the space that has been explored

so far, to guide further search. In other words, their search strategy does not evolve

with time. A question that comes to mind is whether, based on some knowledge

6 Learning Global Optimization through SVM Based Adaptive Multistart 133

collected about the function, it is possible to generate a start state that is better than

a random one. If the answer is in the affirmative, then successive iterations will lead

us closer to the global minimum.

Evolutionary algorithms like Particle Swarm optimization (PSO), Genetic Al-

gorithms (GA) and Ant Colony optimization (ACO), are distributed iterative search

algorithms, which indirectly use some form of information about the space explored

so far, to direct search. Initially, there is a finite number of “agents” that search for

the global optimum. The paths of these agents are dynamically and independently

updated during the search based on the results obtained till the current update.

PSO, developed by Kennedy and Eberhart [21], is inspired by the flocking behav-

ior of birds. In PSO, particles start search in different regions of the solution space,

and every particle is made aware of the best local optimum amongst those found

by its neighbors, as well as the global optimum obtained up to the current iteration.

Each particle then iteratively adapts its path and velocity accordingly. The algorithm

converges when a particle finds a highly desirable path to proceed on, and the other

particles effectively follow its lead.

Genetic Algorithms [12] are motivated by the biological evolutionary opera-

tions of selection, mutation and crossover. In real life, the fittest individuals tend

to survive, reproduce and improve over generations. Based on this, “chromosomes”

that yield better optima are considered to correspond to fitter individuals, and are

used for creating the next generation of chromosomes that hopefully lead us to bet-

ter optima. The population of chromosomes is updated till convergence, or until a

specified number of updates is completed.

Ant colony optimization [10] mimics the behavior of a group of ants following

the shortest path to a food source. Ants (agents) exchange information indirectly,

through a mechanism called “stigmergy”, by leaving a trail of pheromone on the

paths traversed. States believed to be good are marked by heavier concentrations

of pheromone to guide the ants that arrive later. Therefore, decisions that are taken

subsequently get biased by previous decisions and their outcomes.

Some heuristic techniques use an alternate approach to guide further search by

application of machine learning techniques on past search results. Machine learning

techniques help in discovering relationships by analyzing the search data that other

techniques may ignore. If any relationship exists, then it could be exploited to re-

duce search time or improve the quality of optima. For this task, some papers try

to understand the structure of the search space, while others try to tune algorithms

accordingly (cf. [4] for a survey of these algorithms).

Boyan used information of the complete trajectories to the local minima and the

corresponding value of the local minima reached, to construct evaluation functions

[8], [9]. The minimum of the evaluation function determined a new starting point.

Optimal solutions were obtained for many combinatorial optimization problems like

bin packing, channel routing, etc.

Agakov et. al [3], gave a compiler optimization algorithm that trains on a set of

computer programs and predicts which parts of the optimization space are likely

to give large performance improvements for programs. Boese et al. [6] explored the

use of local minima to adapt the optimization algorithm. For graph bisection and the

134 Jayadeva, S. Shah, and S. Chandra

TSP, they found a “big valley” structure to the set of minima. Using this information

they were able to hand code a strategy to find good starting states for these problems.

Is this possible for other problems as well ? The proposed work is motivated by

the question: for any general global optimization problem, is there a structure

to the set of local optima ? If so, can it be learnt automatically through the use

of machine learning ?

We propose a new algorithm for the general global optimization problem, termed

as Global Optimization using Support vector regression based Adaptive Multistart

(GOSAM). GOSAM attempts to learn the structure of local minima based on local

minima discovered during earlier local searches. GOSAM uses Support Vector Ma-

chine based learning to learn a Fit function (regressor) that passes through all the

local minima, thereby learning the structure of the locations of local minima. Since

the regressor can only learn the structure of the local minima encountered till the

present iteration, the idea is that the regressor for local minima will generalize well

to the local minima not obtained so far in the search. Consequently, its minimum

would be a ‘crude approximation’ to the global minimum of the objective function.

In the next iteration the search for the local minimum is started from the minimum of

this regressor. The new local minimum obtained is added as a new training point and

the Fit function is re-constructed. Over time, this approximation gets better, leading

the search towards regions that yield better local minima and eventually the global

minimum. Surprisingly, for most problems this algorithm tends to direct search to

the region containing the global minimum in just a few iterations and is significantly

faster than other methods. The results reinforce our belief that many problems have

some ‘structure’ to the location of local minima, which can be exploited in direct-

ing further search. It is important to emphasize that GOSAM’s approach is differ-

ent from approximating a fitness landscape; GOSAM attempts to predict how local

minima are distributed, and where the best one might lie. This turns out to be very

efficient in practice.

In this chapter, we wish to demonstrate the same by testing on many benchmark

global optimization problems against established evolutionary methods. The rest of

the chapter is organized as follows. Section 6.2 discusses the proposed algorithm.

Section 6.3 is devoted to GOSAM’s performance on benchmark optimization prob-

lems, as well as a comparison with GA and PSO. Section 6.4 extends the algorithm

for constrained optimization problems. Section 6.5 demonstrates how the algorithm

may be applied to design optimization problems. Section 6.6 is devoted to a general

discussion on the convergence of GOSAM to the global optimum, while Section 6.7

contains concluding remarks.

Adaptive Multistart (GOSAM)

The motivation of the proposed algorithm is to use the information about the local

minima encountered in earlier steps, to predict the location of other better minima.

We denote the objective function to be minimized by f (x), where x (xi , ∀i = 1, . . . , n)

is a n dimensional vector of variables. We assume that the lower and upper bounds

of each of these variables is known. For an unconstrained optimization problem, the

6 Learning Global Optimization through SVM Based Adaptive Multistart 135

feasible region is the complete search space that lies within the lower and upper

bounds of all variables.

We now summarize the flow of the GOSAM algorithm. At each iteration, the

algorithm performs a local search1 , starting from a location termed as the start-

state. Iteratively the algorithm determines the start-state for the next iteration.

1. Initialize start-state to an initial guess, possibly chosen randomly, lying in the

search space.

2. Starting from start-state, obtain a locally optimal solution using a local search

procedure. Term this solution as current-local-optimum.

3. Store current-local-optimum (x∗i , ∀i = 1 , . . . , n) and the corresponding function

value ( f (x∗ )) in the training set.

4. Apply Support Vector Regression treating all the local optima collected so far, as

the independent variables and their corresponding function values as the target

values. The regressor obtained will be called the current Fit function.

5. Obtain the minimum of the current Fit function using a local search procedure.

6. Set start-state to the minimum obtained in Step 5. If minimum is out of bounds

or is same as that obtained in the previous r runs, set the start-state to a random

one.

7. If the termination criteria have been met, proceed to Step 8. Otherwise, go to

Step 2.

8. Return the best local minimum obtained.

Initially, we generate a feasible random state and assign it to start-state. In step 2,

we perform local search on the objective function starting from the start-state. The

search terminates at a local minimum. We store the local minimum and the corre-

sponding value of the objective function in an array. This constitutes our training

data. We treat local minima as data points and their corresponding function values

as target values. In Step 4, we use Support Vector Regression (SVR), to perform

regression on the training data comprising of the local minima obtained till the cur-

rent iteration. In general, fitting the local minima with a linear SVR regressor would

incur a large error. Nonlinear regression can be achieved with a wide choice of

kernel functions, and choice of the kernel will also impact the number of function

evaluations required to reach the global optimum. In all our experiments, we chose

the most commonly used 2nd degree polynomial kernel, also termed as a quadratic

kernel, primarily to simplify computation. It also facilitates Step 5 of the GOSAM

algorithm, which requires minimization of the SVR regressor. For a polynomial ker-

nel of degree 2, the problem to be minimized is a quadratic one, which can be solved

efficiently.

We found that GOSAM is not handicapped by the choice of kernel, and that a

quadratic kernel worked well over a wide range of problems. All the local minima

1 To the best of our knowledge, there is no restriction on the choice of local search procedure

used.

136 Jayadeva, S. Shah, and S. Chandra

obtained till the current iteration are treated as training samples, and their corre-

sponding function values as the target values. The regressor obtained in Step 4,

termed as the current Fit function, approximates local minima of the objective func-

tion. In the limit that all local minima are known, SVR will construct a regressor

that passes through all local minima of the objective function. The global minimum

of this function would then correspond to the global minimum of the original objec-

tive function. If we knew all the local minima, then regression is not required and

one can easily determine the best local minimum. We utilize only the information

of the few local minima obtained through local search till the current iteration. We

then rely on the excellent generalization properties of SVRs to predict how the local

minima are distributed. Search is redirected to the region containing the minimum

of the regressor or the Fit function. Because of the limited size of the training set,

this regressor will not be an exact approximation of the local minima of the objective

function. However, over successive iterations, the Fit function tends to better local-

ize the global minimum of the function being optimized. This is demonstrated by

the experiments presented in section 6.3, that show that the ‘predictor’ turns out to

be so good that search terminates in the global minimum within very few iterations.

Apart from the generalization ability of SVRs, which is imperative in predicting

better starting points and finding the global optimum quickly, the choice of using

SVR for function approximation is also motivated by the fact that the regressors ob-

tained using SVR are generally very simple and can be constructed by using only a

few support vectors. Since minimization of the Fit function requires evaluating it at

several points, the use of only the support vectors contributes to computational ef-

ficiency. Regardless of the complexity of the kernel used, the optimization problem

that needs to be solved remains a convex quadratic one, because only a kernel matrix

that contains an inner product between the points is required. The meagre amounts

of data to be fit, i.e. the small number of local minima and their corresponding func-

tion values, also contribute to making the process fast and efficient.

In step 5, we minimize the Fit obtained and reset the start-state to its minimum.

If the local minimum obtained from this start-state is the same as the one obtained

in the previous r iterations, or out of bounds, then we conclude that the search has

become too localized, and needs to explore other regions to discover new minima.

In such a case, we reset the start-state to a random state.

GOSAM was encoded in MATLAB and run on an 800 MHz Pentium III PC with

256 MB RAM. For all our test cases, we used the local minimizer of LINDO API

[23]. We tested GOSAM on a number of difficult global optimization benchmark

problems, which are available online at [11, 24]2 . Most of the benchmark problems

available in [11, 24] are parameterized in the number of variables, and can thus

2 The websites also include visualizations of the two dimensional examples, a discussion of

why each of these problems is difficult, and a mention of the estimated number of local

minima of the benchmarks.

6 Learning Global Optimization through SVM Based Adaptive Multistart 137

and two variable examples that possess several local minima. These examples will

help visualize GOSAM’s working. We then discuss results for higher dimensional

benchmarks.

Figures 6.1 through 6.4 show the objective function f (x) = (|x|−10)cos(2π x) (taken

from [8]) as a dotted wave. The bounds for this problem were taken to be x = -10.0

to x = 10.0. As seen in Figs. 6.1 - 6.4, the objective function has several local min-

ima, but the global minimum lies at x = 0. We now show how GOSAM finds the

global minimum for this example.

10

−2

−4

−6

−8

−10

−10 −8 −6 −4 −2 0 2 4 6 8 10

Fig. 6.1 Iteration 1 for the global minimization of the objective function f (x) = (|x| −

10)cos(2π x). Local search started from a random start state given by x = -1.3444 (indicated

by the circle) and terminated at the local minimum x = -1.0 where f (x) = -9.0. Using only

this one local minimum in the training set, the regressor obtained till the end of iteration 1 is

shown by the solid line

The initial randomly chosen starting state is x = -1.3444. This is shown as the

circled point in Fig. 6.1. Local search from this point led to the local minimum at

x = -1.0, indicated by a square in Fig. 6.1. At this point, the objective function has a

value of f (−1.0) = -9.0. Using only one local minimum in the training set, the SVR

regressor that was obtained is shown by the solid line parallel to the x-axis. Since

this regressor has a constant function value, its minimum is the same everywhere;

therefore, any random point can be selected as the minimum. In our simulation,

the random point returned was x = -6.3. Local search from this point terminated at

138 Jayadeva, S. Shah, and S. Chandra

the local minimum x = -6.0. The regressor obtained using these two points led to

a minimum at the boundary. In cases when the minimum is at a boundary, we find

that one can start the next local search from either the boundary point, or from

a random new starting point. The search for the global optimum was not ham-

pered by either choice. However, the results reported here are based on a random

restart in such cases. In this simulation, search was restarted from a random point at

x = 4.483.

10

−2

−4

−6

−8

−10

−10 −8 −6 −4 −2 0 2 4 6 8 10

Fig. 6.2 Iteration 3 for the global minimization of the objective function f (x) = (|x| −

10)cos(2π x). Local search started from a random start state given by x = 4.483 (indicated

by the circle) and terminated at the local minimum at x = 4.0. The regressor obtained using

the three points, depicted by squares, is shown as the solid concave curve. The minimum of

this curve lies at x = -0.8422

In the third iteration, shown in Fig. 6.2, local search is started from x = 4.483, de-

picted by a circle. The local minimum was found to be at x = 4.0, and is depicted by

a square in the figure. When the information of these three local minima was used,

the SVR regressor shown as the solid concave curve was obtained. The minimum of

this curve lies at x = -0.8422.

The start state for iteration 4 was given by the minimum of the regressor obtained

in the previous iteration, given by x = -0.8422. This point is depicted as a circle in

Fig. 6.3. The local minimum obtained from this starting state is again depicted as the

square at the end of the slope. The regressor obtained using these four local minima

is shown as a bowl shaped curve, the minimum of which is located at x = −0.1130.

In the next iteration, depicted in Fig 6.4, local search from x = −0.1130, depicted

by a circle, led us to the global minimum at x = 0.0, depicted by a square in the

figure.

6 Learning Global Optimization through SVM Based Adaptive Multistart 139

10

−2

−4

−6

−8

−10

−10 −8 −6 −4 −2 0 2 4 6 8 10

Fig. 6.3 Iteration 4 for the global minimization of the objective function f (x) = (|x| −

10)cos(2π x). Local search started from the minimum of the regressor obtained in the

previous iteration, given by x = -0.8422 (indicated by the circle). It terminated at the local

minimum at x = -1.0. The regressor obtained using the four local minima obtained till the

current iteration, depicted by squares, is shown as the solid convex shaped curve. The mini-

mum of this curve lies at x = −0.1130

10

−2

−4

−6

−8

−10

−10 −8 −6 −4 −2 0 2 4 6 8 10

Fig. 6.4 Iteration 5 for the global minimization of the objective function f (x) = (|x| −

10)cos(2π x). Local search started from the start state given by the minimum of the regressor

obtained at the end of the previous iteration, given by x = -0.1130 (indicated by the circle)

and terminated at the local minimum at x = 0.0 where f (x) = -10.0. The regressor obtained

using all the local minima obtained, depicted by squares, is shown as the solid convex curve

140 Jayadeva, S. Shah, and S. Chandra

Ackley’s function is a multimodal benchmark optimization problem, that is widely

used for testing global optimization algorithms. The n-dimensional Ackley’s

function is given by

$

−b 1

∑ni=1 x2i 1

− e n ∑i=1 cos(cxi ) + a + e ,

n

f (x) = −ae n

1, 2, . . . , n, with the function value f (0) = 0. For the purpose of illustration, we

consider the two dimensional Ackley’s function.

As seen in Fig. 6.5, Ackley’s function has a large number of local minima that

hinder the search for the global minimum.

Fig. 6.5 Ackley’s function. A huge number of local minima are seen that obstruct the search

for the global minimum at (0, 0)

Figures 6.6 through 6.8 show the plots of the regressor function obtained after

iterations 2, 3, and 4 respectively. Note that though both figures 6.7 and 6.8 look

similar, there is a difference in the locations of their minima. The minimum of the

bowl shaped Fit function of Fig. 6.8, when used as the start state for next local

minimization procedure, led to the global minimum of Ackley’s function.

6 Learning Global Optimization through SVM Based Adaptive Multistart 141

Fig. 6.6 Regressor obtained after iteration 2, while optimizing Ackley’s function

Fig. 6.7 Regressor obtained after iteration 3, while optimizing Ackley’s function

Problems

Particle Swarm Optimization (PSO) and Genetic Algorithms (GA) are evolutionary

techniques, that also use information revealed during search to generate new search

points. We compare our algorithm with both these approaches on several global op-

timization benchmark problems, ranging in dimension (number of variables n) from

142 Jayadeva, S. Shah, and S. Chandra

Fig. 6.8 Regressor obtained after iteration 4, while optimizing Ackley’s function. Local

search starting from the minimum of this Fit function led to the global minimum

2 to 100. The Particle Swarm optimization toolbox was obtained from [30], while

the Genetic Algorithm optimization toolbox (GAOT) is the one available at [16].

The next start-state in PSO and GA is obtained by simple mathematical or logical

operations, whereas for GOSAM it is generated after determining the SVR followed

by minimization of a quadratic problem. Therefore, an iteration of GOSAM takes

more time than an iteration of either of these algorithms. Moreover, GA and PSO

run a number of agents in parallel, whereas the current implementation of GOSAM

is a sequential one. However, the difference in the number of function evaluations

required is so dramatic that GOSAM always found the global minimum significantly

faster.

In all our experiments, we evaluated the three algorithms on three different per-

formance criteria. The first criterion is the number of function evaluations required

to reach the global optimum. The second criterion is the number of times the global

optimum is reached in 20 runs, each from a randomly chosen starting point. The

third measure is the average CPU time taken to reach the global optimum.

Table 6.1 presents the results obtained. Each value indicated in the table is the av-

erage over 20 runs of the corresponding algorithm. For each run, the initial start

state of all the algorithms was the same randomly chosen point. The reported re-

sults have been obtained on a PC with a 1.6 GHz processor and 512 MB RAM. The

first row in the evaluation parameter for each benchmark function (Fn. Evals.) gives

the average number of function evaluations required by each algorithm to find the

global optimum. The number of times that the global optimum was obtained out of

the 20 runs is given in the second row (GO. Obtd.). If the global minimum was not

obtained in all runs, then the average value and the standard deviation of the best

6 Learning Global Optimization through SVM Based Adaptive Multistart 143

optima obtained over all the runs has been mentioned within parentheses. The third

row (T (s)) indicates the average time taken in seconds by each algorithm in a run.

Though any number of local minima may be used for building a predictor, we

used a maximum of 100 local minima. The 101st local minima overwrote the 1st

local minimum obtained, and so on. In each case, the search was confined to lie

within a box [−10, 10]n where n is the dimension. In all our experiments, we used

the conventional SVR framework [14]. The use of techniques such as SMO [28],

or online SVMs [7] can be used to speed up the training process further. Our focus

in this work is to show the use of machine learning techniques to help predict the

location of better local minima.

The parameters for GA and PSO (c1 = 2, c2 = 2, c3 = 1, chi = 1, and swarm

size = 20) were kept the same as the default ones. For GOSAM, the SVR parameters

were taken to be ε = 10−3 , C = 1000, and the kernel to be the two degree polynomial

kernel with t = 1.

Table 6.1 shows that consistently GOSAM outperforms both PSO and GA by a

large margin. This difference gets dramatically highlighted in higher dimensions.

Finding the global minimum becomes increasingly difficult as the dimension n in-

creases; PSO and GA fail to find the global optimum in many cases, despite a large

number of function evaluations. However, GOSAM always found the global mini-

mum after a relatively small number of function evaluations (the count for function

evaluation for GOSAM also includes the number of times the objective function

was evaluated during local search). We believe that this result is significant, because

it shows that GOSAM scales very effectively to large dimensional problems. The

experimental results strikingly demonstrate that GOSAM not only finds the global

optimum consistently, but also does so with a significantly fewer number of function

evaluations.

Constrained optimization problems are usually solved by solving related uncon-

strained problems, which are obtained through the use of penalty or barrier func-

tions. We take recourse to Sequential Unconstrained Minimization Techniques

(SUMTs), which we briefly review.

SUMTs comprise of a class of non-linear programming methods that solve a

sequence of unconstrained optimization tasks. Given a problem of the form

144 Jayadeva, S. Shah, and S. Chandra

Table 6.1 Comparison of GOSAM with PSO and GA on Difficult Benchmark Problems

function parameter

2 Ackley Fn. Evals. 122.75 12580.0 2202.75

GO Obtd. 20 20 20

T(s) 0.02970 0.535524 0.677470

GO Obtd. 20 20 20

T(s) 0.0328 0.913196 0.662956

GO Obtd. 20 20 19‡

(0.0057 ± 2.5e-

5)

T(s) .02265 3.758419 0.357913

Hyper

Ellipsoid GO Obtd. 20 20 20

T(s) .0078 0.972250 0.236874

Valley GO Obtd. 20 20 20

T(s) .0148 2.046909 3.19688

GO Obtd. 20 1† 20

T(s) .01325 5.338205 0.268143

GO Obtd. 20 20 20

T(s) .01795 2.371105 0.206017

Camelback GO Obtd. 20 20 20

T(s) .01715 1.213340 0.199104

GO Obtd. 20 20 20

T(s) .03516 6.636169 3.973002

6 Learning Global Optimization through SVM Based Adaptive Multistart 145

GO Obtd. 20 0‡ 12‡

(3.035 ± 2.054) (0.4478 ± 0.4587)

T(s) .0398 12.854138 3.435049

Hyper

Ellipsoid GO Obtd. 20 20 20

T(s) .01255 4.995237 1.448820

Valley GO Obtd. 20 3‡ 3‡

(1.9104 ± 1.2932) (2.9247 ± 3.0229)

T(s) .04460 12.467250 7.737082

GO Obtd. 20 0‡ 0‡

(8.0094 ± 5.7940) (1.6847 ± 0.2302)

T(s) 8.707 24.979903 14.027134

GO Obtd. 20 0 0

(486.3559 ± (60.2946 ± 8.7735)

737.8523)

T(s) 7.615 128.840828 23.566207

Hyper

Ellipsoid GO Obtd. 20 0‡ 0‡

(0.3350 ± 1.0915) (2.2683 ± 0.9803)

T(s) .0823 46.069183 29.993730

Valley GO Obtd. 20 0‡ 0‡

(14324550.17± (357.198±

1535813.439) 165.5572)

T(s) .89920 17.913865 36.662314

‡ The global optimum was not obtained in all the 20 runs. The value in the corresponding

parentheses indicates the mean and the standard deviation of the quality of global minima

obtained in the 20 runs.

† The global optimum obtained was not within the specified bounds.

146 Jayadeva, S. Shah, and S. Chandra

where a(x) is the objective function, and gi (x), for i = 1, . . . M are the M constraints.

One kind of SUMT, the quadratic penalty function method, minimizes a sequence

of functions of the form (p = 1, 2, . . . M).

M

Fp (x) = a(x) + ∑ α pi Max(0, gi (x))2 , (6.3)

i=1

where α pi is a scalar weight, and p is the problem number. The minimizer for the pth

problem in the sequence forms the guess or starting point for the (p + 1)th problem.

The scalars change from one problem to the next based on the rule that α pi >=

α(p−1)i ; they are typically increased geometrically, by say 10%. These weights

indicate the relative emphasis of the constraints and the objective function.

In the limit, the constraints become overwhelmingly large, the sequence of min-

ima of the unconstrained problems converges to a solution of the original con-

strained optimization problem. We now illustrate the use of SUMT through the

application of GOSAM to the graph coloring problem.

Given a graph with a set of nodes or vertices, and an adjacency matrix D, the

Graph Coloring Problem (GCP) requires coloring each node or vertex so that no

two adjacent nodes have the same color. The adjacency matrix entry di j is a 1 if

nodes i and j are adjacent, and is 0 otherwise.

A minimal coloring requires finding a valid coloring that uses the least number of

colors. The GCP can be solved through an energy minimization approach. We used

an approach based on the Compact Analogue Neural Network (CANN) formulation

[17]. In this approach, a N-vertex GCP is solved by considering a network of N

neurons, whose outputs denote the node colors. The outputs are represented by a set

of real numbers Xi , i = 1, 2, . . . , N. The color is not assumed to be an integer as is

done conventionally.

The GCP is solved by minimizing a sequence (p = 1, 2, . . .) of functions of the

form

A N N

E= ∑ ∑ (1 − di j )Vm ln coshβ (Xi − X j ) +

2 i=1

(6.4)

j=1

Bp N N

coshβ (Xi − X j + δ )coshβ (Xi − X j − δ )

∑∑

2 i=1

d i j Vm ln

coshβ (Xi − X j )2

j=1

In keeping with the earlier literature on neural network approaches to the GCP, we

term E in (6.4) as an energy function.

The first term of equation (6.4) is present only for di j = 0, i.e. for non-adjacent

nodes. The term is minimized when Xi = X j . The term therefore minimizes the

number of distinct colors used. The second term is minimized if the values of Xi and

X j corresponding to adjacent nodes differ by at least δ . This term corresponds to the

adjacency constraint in the GCP, and becomes large as the problem sequence index

p increases. Nodes colored by colors that differ by less than δ correspond to nodes

with identical colors.

6 Learning Global Optimization through SVM Based Adaptive Multistart 147

GCP benchmark problems [1], which have a large number of connections. Of these

the Myciel instances are Graphs based on the Mycielski transformation. These

graphs are difficult to solve because they are triangle free (clique number 2) but

the coloring number increases with the problem size. “Huck” instance is a graph

that is created where each node represents a character. Two nodes are connected by

an edge if the corresponding characters encounter each other in the book “Twain’s

Huckleberry Finn”. In the “Games120” instance, the games played in a college foot-

ball season is represented by a graph where the nodes represent each college team,

and two teams are connected by an edge if they played each other during the season.

The energy functions for these problems are very complex and lead to extremely

hard global optimization problems. However, the constrained optimizer was able

to obtain the optimal coloring for each of these instances. Table 6.2 sums up the

results obtained. Note that the starting value of B, and the amount of increment in

B for successive iterations are both related to the time taken to reach the optimal

solution. One would like to start with a value of B, which quickly takes us into the

feasible region. This leads us to believe that a large value of B would do the trick.

However, if the value of B is taken to be too large then we might not be able to reach

the optimal solution. Thus there is no obvious answer to determine a good starting

value of B, instead it is based on educated guesses. A natural reasoning would be that

for dense adjacency matrices a large value of B should be chosen while a relatively

smaller value would suffice for sparse adjacency matrices. If we reach the feasible

region, then we could slowly and cautiously (making sure that we don’t exit the

feasible region) increase the value of A till we reach the optimal solution. We defer

a more detailed discussion of this aspect as it is beyond the scope of this chapter.

Instance Nodes Edges Optimal coloring Best Solution Obtained Iterations required

Myciel3 11 20 4 4 3

Myciel4 23 71 5 5 5

Huck 74 301 11 11 8

Games120 120 638 9 9 10

Designers are usually confronted with the problem of finding optimal settings for a

large number of design parameters, with respect to several simulated product or pro-

cess characteristics. Problems of design and synthesis in the electronic domain are

generally constrained non-linear optimization problems. The principal characteris-

tics of these problems are very time consuming function evaluations and the absence

of derivative information. In most cases, evaluating the cost or objective function re-

quires a system simulation, and the function is rarely available in an analytical form.

In fact, the use of classical optimization techniques to give an optimal solution is

148 Jayadeva, S. Shah, and S. Chandra

nearly impossible. For instance, VLSI design engineers carry out time-consuming

function evaluations by using circuit or other simulation tools , e.g. Spectre [2], and

choose a circuit with optimal component values. Since there are still many possi-

ble design parameter settings and computer simulations are time consuming, it is

crucial to find the best possible design with a minimum number of simulations. We

used GOSAM to solve several circuit optimization problems. The interface between

the optimizer and the circuit simulators is shown in Fig. 6.9. Preliminary details of

this work were reported in [18].

ble n

Netlist

ria ig

va des

s

te

da

Up

variables

Optimizer Interface Spectre

(Run a

simulation)

Function value

Write function

Re

ad

value

Output File

We initially start with values for the design variables that are provided by a de-

signer, or choose them randomly. Since there are no analytical formulae to compute

the output for the input design parameters, the function values are calculated by using

a circuit simulator such as Spectre. The simulator writes the output value to a file,

which is read by the interface and returned to GOSAM. GOSAM then uses SVR

on the set of values obtained so far, to determine the Fit function. The SVR yields a

smooth and differentiable regressor. GOSAM then computes the minimum of the Fit

function, and sends it as the vector of new design parameters, to the interface. A key

feature of this approach is that we can apply it even to scenarios where the objective

function is not available in analytical form or is not differentiable. A major bonus

is that examination of the Fit function yields information about what constitutes a

good design. We now briefly discuss a few interesting circuit optimization examples.

For a sample and hold circuit, the objective function was to hold the sampled value

as constant as possible during the hold period. The design variables are the widths

of 22 MOSFETs, along with values of four capacitors named as C1,C2,C3, and

C4. The transistor widths were constrained to lie between 250nm and 1200nm. Ca-

pacitor C3 was required to be between 1fF and 5000f, while all other capacitors

were constrained to lie between 1fF and 500fF. Simulations show that the sampled

6 Learning Global Optimization through SVM Based Adaptive Multistart 149

value is maintained well during the hold period. As of date, numerous complex

VLSI circuits have been designed using GOSAM interfaced with the circuit simu-

lator Spectre. The chosen circuits include Phase Locked Loops (PLLs), a variety of

operational amplifiers, and filters. In these examples, transistor sizes and other com-

ponent values have been selected to optimize specified objectives such as jitter, gain,

phase margin, and power, while meeting specified constraints on other performance

measures as well as on transistor sizes.

Fig. 6.10 Response of the optimized Sample-and-Hold Circuit, showing output voltage

versus time. The goal was to keep the output constant during the hold period

For a folded cascode amplifier, the design objective was to maximize the phase mar-

gin. The variables for the optimization task were taken to be the widths of 16 MOS-

FET transistors. The result obtained, depicted in Figure 6.11 shows that GOSAM

obtained the maximum phase margin as 169.73◦, as well as an excellent solution

with a phase margin of around 120◦ . An industry level commercial tool found a

solution with a phase margin of around 59◦ .

6.6 Discussion

An important question relates to assumptions that may be implicitly or explicitly

made regarding the function to be optimized. We mentioned previously that any

local search mechanism could be used in conjunction with GOSAM. Figure 6.12

illustrates this with the help of a toy example. For the objective function shown

by the dashed curve in Fig. 6.12, the gradient cannot be computed to reach two of

the minima. A line search method is used in the outer triangular regions, while for

the parabolic region in the middle the gradient is available and a simple gradient

descent leads us to the local minimum. These three local minima, when used by

SVR to construct the regressor, yield the parabolic shaped solid curve of Fig. 6.12.

Local search starting from the minimum of this curve led to the global minimum.

150 Jayadeva, S. Shah, and S. Chandra

Fig. 6.11 Phase margin versus iteration count for a folded cascode amplifier

Fig. 6.12 A toy example illustrating that any local minimizing procedure can be used with

GOSAM. The function is depicted as the dotted curve. For the outer triangular regions, the

gradient information cannot be used, so the local minima are found by a line search method.

However for the inner parabolic region, the local minimum can be found using gradient de-

scent. The regressor obtained is shown by the solid curve that passes through the local minima

obtained

In the worst case, GOSAM performs similar to a random multistart. This is be-

cause whenever it is not possible to use the minimum of the Fit function (for exam-

ple when it is out of bounds or almost the same minimum is given by the previous

two iterations), GOSAM restarts the search from a random state. Therefore in the

worst case it will randomly explore the search space for new starting points. How-

ever, real applications never involve functions that are discontinuous everywhere,

and we have not encountered this worst case.

6 Learning Global Optimization through SVM Based Adaptive Multistart 151

Fig. 6.13 A toy example to illustrate that the regressor for the objective function f (x), de-

noted as Fit of f (x) is smoother than f (x). Recursively, the Fit for the Fit of f (x) is smoother

than the Fit of f (x), and in the limit leads to a convex underestimate of f (x)

Server invokes

Instance of GOSAM

Web Server

Server requests Optimizer

function/ Sends

optimized points

Client sends

a request

Client sends

function value

at requested

points

Client

our experiments we used local search to accomplish this step. A doubt that comes

to mind is what might happen if the Fit function itself turns out to have multiple

local minima. Such a situation is certainly possible, and is theoretically interesting.

An alternative approach that we suggest is to use GOSAM recursively. This idea is

intuitive because the regressor function, called Fit function in Fig 6.13, is smoother

than the objective function, as it is a smooth interpolation of only the local minima

of the objective function encountered earlier. Therefore, a Fit of the Fit function’s

152 Jayadeva, S. Shah, and S. Chandra

local minima would be even smoother. This is depicted pictorially in Fig. 6.13,

which uses a hypothetical example to illustrate what the application of GOSAM

to f (x) and recursively to Fit functions, might achieve. The original function f (x)

has a number of minima. As can be seen, the number of minima reduces at each

step and the sequence of recursively computed Fit functions become increasingly

smoother, and the sequence terminates at a convex function that is related to the

double conjugate of the original function. However, local minimization of the Fit

function seems to be more than adequate, as is done in the present implementation.

It is possible to construct functions where GOSAM’s strategy will fail. For ex-

ample, it would be impossible to learn any structure from a function with a uniform

distribution of randomly located minima, or a function that is discontinuous almost

everywhere. However, on most problems of any practical interest, small perturbations

from a local minimum will lead us to another locally minimal configuration. This im-

plies that a learning tool can be used to predict locations of other minima from the

knowledge of only a few.

In this paper, we presented GOSAM, a fast and effective multistart global minimiza-

tion algorithm for solving optimization problems. GOSAM applies support vector

regression on the training set formed by previously discovered local minima, to

guide search towards better local minima. This is different from approximating a fit-

ness landscape; GOSAM attempts to predict how local minima are distributed, and

where the best one might lie. A regressor that fits local minima is smoother than one

that tries to fit the original function. Approximating the fitness landscape requires

fitting all points and not just a few minima. The use of Support Vector Regression

allows only support vectors to be retained, and redundant information can be dis-

carded. Experimental results on large benchmarks show that GOSAM searches far

more efficiently, uses significantly fewer function evaluations, and finds the global

optimum more consistently than other state-of-the-art methods. The effectiveness of

GOSAM confirms that the generalizing ability of SVRs is very useful in predicting

where good local minima lie. We have also shown how GOSAM can be applied

to unconstrained tasks, as well as combinatorial optimization tasks such as graph

coloring, that are traditionally solved as integer programming problems. GOSAM

does not require the function to be known in terms of an analytical expression. It is

enough to have a black box that can evaluate the function at a chosen point. This

allows GOSAM to be interfaced to any such black box. We have presented results

in the VLSI domain, where GOSAM has been interfaced to a commercial circuit

simulator and used to optimize MOSFET sizes and component values to meet de-

sired objectives subject to specified constraints. The objectives are typically com-

plex, such as phase margin of a folded cascode amplifier, or jitter in a Phase Locked

Loop. A current version of GOSAM is equipped with a web interface that allows a

user to access it without revealing information about the function being optimized.

The set up of the web based service is shown in Fig. 6.14. As the figure illustrates,

6 Learning Global Optimization through SVM Based Adaptive Multistart 153

only vectors and corresponding cost values are exchanged between the GOSAM

server and a client running a simulator or emulator. This allows GOSAM to be pro-

vided as a service across the web while protecting proprietary information about the

optimizer and the objective function.

Other aspects worthy of investigation include the use of different approaches to

SVR, such as online learning techniques, and parallellizing operations in GOSAM

to speed up search. Ongoing efforts include extending GOSAM to multi-objective

optimization tasks. GOSAM may be obtained from the authors for non-commercial

academic use on a trial basis.

Acknowledgements. The authors would like to thank Dr. R. Kothari of IBM India Re-

search Laboratory, Prof. R. Newcomb, University of Maryland, College Park, USA, and Prof.

S.C.Dutta Roy of the Department of Electrical Engineering, IIT Delhi, for their valuable

comments and a critical appraisal of the manuscript.

References

1. http://mat.gsia.cmu.edu/COLOR02/

2. http://www.cadence.com/products/custom ic/spectre/

index.aspx

3. Agakov, F., Bonilla, E., Cavazos, J., Franke, B., Fursin, G., O’Boyle, M., Thomson,

J., Toussaint, M., Williams, C.: Using machine learning to focus iterative optimisation.

In: Proceedings of the 4th Annual International Symposium on Code Generation and

Optimization (CGO), New York, NY, USA, pp. 295–305 (2006)

4. Baluja, S., Barto, A., Boese, K., Boyan, J., Buntine, W., Carson, T., Caruana, R., Davies,

S., Dean, T., Dietterich, T., Hazlehurst, S., Impagliazzo, R., Jagota, A., Kim, K., Mc-

Govern, A., Moll, R., Moss, E., Perkins, T., Sanchis, L., Su, L., Wang, X., Wolpert, D.:

Statistical machine learning for large-scale optimization. Neural Computing Surveys 3,

1–58 (2000)

5. Black, F., Litterman, R.: Global portfolio optimization. Financial Analysts Journal 48(5),

28–43 (1992)

6. Boese, K., Kahng, A.B., Muddu, S.: A new adaptive multi-start technique for combina-

torial global optimizations. Operations Research Letters 16(2), 101–113 (1994)

7. Bordes, A., Bottou, L.: The huller: A simple and efficient online SVM. In: Gama, J.,

Camacho, R., Brazdil, P.B., Jorge, A.M., Torgo, L. (eds.) ECML 2005. LNCS (LNAI),

vol. 3720, pp. 505–512. Springer, Heidelberg (2005),

http://leon.bottou.org/papers/bordes-bottou-2005

8. Boyan, J.: Learning evaluation functions for global optimization. Phd dissertation, CMU

(1998)

9. Boyan, J., Moore, A.: Learning evaluation functions for global optimization and

boolean satisfiability. In: Proceedings of the Fifteenth National Conference on Arti-

ficial Intelligence, vol. 15, pp. 3–10. John Wiley and Sons Ltd., Chichester (1998),

http://www.cs.cmu.edu/˜jab/cv/pubs/boyan.stage2.ps.gz

10. Dorigo, M., Maniezzo, V., Colorni, A.: Ant system: Optimization by a colony of cooper-

ating agents. IEEE Transactions on Systems, Man, and Cybernetics-Part B 26(1), 29–41

(1996),

http://iridia.ulb.ac.be/˜mdorigo/ACO/publications.html

154 Jayadeva, S. Shah, and S. Chandra

http://www.geatbx.com/docu/fcnindex.html

12. Goldberg, D.E.: Genetic Algorithms in Search, Optimization and Machine Learning.

Addison-Wesley Longman Publishing Co., Inc., Boston (1989)

13. Grossmann, I.E. (ed.): Global optimization in engineering design. Kluwer Academic

Publishers, Dordrecht (1996)

14. Gunn, S.: Support vector machines for classification and regression. Technical report,

Image Speech and Intelligent Systems Research Group, University of Southampton, UK

(1998), http://www.isis.ecs.soton.ac.uk/resources/svminfo/

15. Horst, R., Tuy, H.: Global optimization:deterministic approaches. Springer, Berlin

(1993)

16. Houck, C., Joines, J., Kay, M.: A genetic algorithm for function optimization: A matlab

implementation. NCSU-IE TR 95-09 (1995),

http://www.ie.ncsu.edu/mirage/GAToolBox/gaot/

17. Jayadeva, Dutta Roy, S.C., Chaudhary, A.: Compact analogue neural network: A

new paradigm for neural based combinatorial optimisation. IEE Proc-Circuits Devices

Syst. 146(3) (1999)

18. Jaydeva, Shah, S., Chandra, S.: Learning to optimize vlsi design problems. In: INDI-

CON, pp. 1–5. IEEE, New Delhi (2006)

19. Johnson, D., McGeoch, L.: The travelling salesman problem: A case study in local opti-

misation. In: Local Search in Combinatorial Optimisation, pp. 215–310. John Wiley and

Sons, London (1997)

20. Kazerounian, K., Wang, Z.: Global versus local optimization in redundancy resolution of

robotic manipulators. The International Journal of Robotics Research 7(5), 2–12 (1988)

21. Kennedy, J., Eberhart, R.: Particle swarm optimization. In: Proceedings of the IEEE

International Conference on Neural Networks, Perth, Australia, vol. 4, pp. 1942–1948

(1995)

22. Kirkpatrick, S., Gelatt, C., Vecchi, M.: Optimization by simulated annealing. Sci-

ence 220(4598), 671–680 (1983)

23. LINDO SYSTEMS Inc.: LINDO API User’s Manual (2002)

24. Madsen, K.: Test problems for global optimization,

http://www2.imm.dtu.dk/˜km/Test_ex_forms/test_ex.html

25. Mangasarian, O.: Nonlinear Programming. SIAM, Philadelphia (1994)

26. Moles, C., Mendes, P., Banga, J.: Parameter estimation in biochemical pathways: a com-

parison of global optimization methods. Genome Research 13, 2467–2474 (2003)

27. Neumaier, A.: Global optimization,

http://www.mat.univie.ac.at/˜neum/glopt/applications.html

28. Platt, J.: Fast Training of Support Vector Machines using Sequential Minimal Optimiza-

tion. In: Advances in Kernel Methods - Support Vector Learning, pp. 185–208. MIT

Press, Cambridge (1999)

29. Russel, S., Norvig, P.: Artificial intelligence: a modern approach. Prentice Hall, Engle-

wood Cliffs (1995)

30. Singh, J.: PSO algorithm toolbox (2003),

http://psotoolbox.sourceforge.net/

31. Wang, M., Yang, X., Sarrafzadeh, M.: Congestion minimization during placement. IEEE

Transactions on Computer-Aided Design of Integrated Circuits and Systems 19(10),

1140–1148 (2000)

Chapter 7

Multi-objective Optimization Using Surrogates

retical interest, with limited real-life applicability due to the computational or ex-

perimental expense involved. Practical multiobjective optimization was considered

almost as an utopia even in academic studies due to the multiplication of this ex-

pense. This paper discusses the idea of using surrogate models for multiobjective

optimization. With recent advances in grid and parallel computing more companies

are buying inexpensive computing clusters that can work in parallel. This allows,

for example, efficient fusion of surrogates and finite element models into a multiob-

jective optimization cycle. The research presented here demonstrates this idea using

several response surface methods on a pre-selected set of test functions. We aim to

show that there are number of techniques which can be used to tackle difficult prob-

lems and we also demonstrate that a careful choice of response surface methods is

important when carrying out surrogate assisted multiobjective search.

7.1 Introduction

In the world of real engineering design, there are often multiple targets which man-

ufacturers are trying to achieve. For instance in the aerospace industry, a general

problem is to minimize weight, cost and fuel consumption while keeping perfor-

mance and safety at a maximum. Each of these targets might be easy to achieve

individually. An airplane made of balsa wood would be very light and will have low

fuel consumption, however it will not be structurally strong enough to perform at

high speeds or carry useful payload. Also such an airplane might not be very safe,

Ivan Voutchkov

University of Southampton, Southampton SO17 1BJ, United Kingdom

e-mail: iiv@soton.ac.uk

Andy Keane

University of Southampton, Southampton SO17 1BJ, United Kingdom

e-mail: ajk@soton.ac.uk

Y. Tenne and C.-K. Goh (Eds.): Computational Intelligence in Optimization, ALO 7, pp. 155–175.

springerlink.com c Springer-Verlag Berlin Heidelberg 2010

156 I. Voutchkov and A. Keane

i.e., robust to various weather and operational conditions. On the other hand, a solid

body and a very powerful engine will make the aircraft structurally sound and able

to fly at high speeds, but its cost and fuel consumption will increase enormously. So

engineers are continuously making trade-offs and producing designs that will sat-

isfy as many requirements as possible, while industrial, commercial and ecological

standards are at the same time getting ever tighter.

Multiobjective optimization (MO) is a tool that aids engineers in choosing the

best design in a world where many targets need to be satisfied. Unlike conventional

optimization, MO will not produce a single solution, but rather a set of solutions,

most commonly referred to as Pareto front (PF) [12]. By definition it will contain

only non-dominated solutions1 . It is up to the engineer to select a final design by

examining this front.

Over the past few decades with the rapid growth of computational power, the fo-

cus in optimization algorithms in general has shifted from local approaches that find

the optimal value with the minimal number of function evaluations to more global

strategies which are not necessarily as efficient as local searches but (some more

than the others) promise to converge to global solutions, the main players being

various strands of genetic and evolutionary algorithms. At the same time, comput-

ing power has essentially stopped growing in terms of flops per CPU core. Instead

parallel processing is an integral part of any modern computer system. Computing

clusters are ever more accessible through various techniques and interfaces such as

multi-threading, multi-core, Windows HPC, Condor, Globus, etc.

Parallel processing means that several function evaluations can be obtained at

the same time, which perfectly suits the ideology behind genetic and evolutionary

algorithms. For example Genetic algorithms are based on the idea borrowed from

biological reproduction, where the offspring of two parents copy the best genes of

their parents but also introduce some mutation to allow diversity. The entire gener-

ation of offspring produced by parents in a generation represent designs that can be

evaluated in parallel. The fittest individuals survive and are copied into the next gen-

eration, whilst weak designs are given some random chance with low probability to

survive. Such parallel search methods are conveniently applicable to multiobjective

optimization problems, where the fitness of an individual is measured by how close

to the Pareto front this designs is. All individuals are ranked, those that are part of

the Pareto front get the lowest (best) rank, the next best have higher rank and so on.

Thus the multiobjective optimization is reduced to single objective minimization of

the rank of the individuals. This is idea has been developed by Deb and implemented

in NSGA2 [5].

In the context of this paper, the aim of MO is to produce a well spread out set

of optimal designs, with as few function evaluations as possible. There are number

of methods published and widely used to do this – MOGA, SPEA, PAES, VEGA,

NSGA2, etc. Some are better than others - generally the most popular in the litera-

ture are NSGA2 (Deb) and SPEA2 (Zitzler), because they are found to achieve good

results for most problems [2, 3, 4, 5, 6]. The first is based on genetic algorithms and

1 Non-dominated designs are those where to improve performance in any particular goal

performance in at least one other goal must be made worse.

7 Multi-objective Optimization Using Surrogates 157

the second on an evolutionary algorithm, both of which are known to need many

function evaluations. In real engineering problems the cost of evaluating a design

is probably the biggest obstacle that prevents extensive use of optimization proce-

dures. In the multiobjective world, this cost is multiplied, because there are multi-

ple expensive results to obtain. Evaluating directly a finite element model can take

several days, which makes it very expensive to try hundreds or thousands of designs.

It seems that increased computing power leads to increased hunger for even more

computing power, as engineers realise that they can run more detailed and realis-

tic models. In essence, from an engineering point of view, the available computing

power is never enough and this tendency does not seem to be changing at least in

the foreseeable future. To put these words into prospective, to be useful to an engi-

neering company, a modern optimization approach should be able to tackle a global

multiobjective optimization problem in about a week. The problem would typically

have 20-30 variables, 2-5 objectives, 2-5 constraints with evaluation times of about

12-48h per design and often per objective. Unless you have access to 5000-7000

parallel CPUs, the only way to currently tackle such problems is to use surrogate

models.

In the single objective world, approaches using surrogate models are fairly well

established and have proven to successfully deal with the problem of computational

158 I. Voutchkov and A. Keane

expense (see Fig. 7.1) [22]. Since their introduction, more and more companies

have adopted surrogate assisted optimization techniques and some are making steps

to incorporate this approach in their design cycle as standard. The reason for this is

that instead of using the expensive computational models during the optimization

step, they are substituted with a much cheaper but still accurate replica. This makes

optimization not only useful, but usable and affordable. The key idea that makes

surrogate models efficient is that they should become more accurate in the region of

interest as the search progresses, rather than being equally accurate over the entire

design space, as an FE representation will tend to be. This is achieved by adding to

the surrogate knowledge base only at points of interest. The procedure is referred to

as surrogate update.

Various publications address the idea of surrogates models and multiobjective op-

timisation [10, 11, 12, 13, 14, 15, 16, 17, 18, 19]. As one would expect, no approxi-

mation method is universal. Factors such as function modality, number of variables,

number of objectives, constraints, computation time, etc., all have to be taken into

account when choosing an approximation method. The work presented here aims to

demonstrate this diversity and hints at some possible strategies to make best use of

surrogates for multi-objective problems.

To illustrate the basic idea, the zdt1 – zdt6 test function suite [3] will be used to be-

gin with. It is a good suite to demonstrate the effectiveness of surrogate models, as it

is fairly simple for response surface (surrogate) modelling. Fig. 13.3 represents the

zdt2 function and the optimisation procedure. It is a striking comparison, demon-

strating the surrogate approach. The problem has two objective functions and two

design variables. The Pareto front obtained using surrogates with 40 function evalu-

ations is far superior to the one without surrogates and the same number of function

evaluations.

Number of variables 2 5 10

Number of function evaluations without surrogates 2500 5000 10000

Number of function evaluations with surrogates 40 40 60

On the other hand 2500 evaluations without surrogates were required to obtain a

similar quality of Pareto front to the case with surrogates and 40 evaluations. The

difference is even more significant if more variables are added – see Table 14.1.

Here we have chosen objective functions with simple shapes to demonstrate the

effectiveness of using surrogates. Both functions would be readily approximated

using most available methods. It is not uncommon to have relationships of simi-

lar simplicity in real problems, although external noise factors could make them

7 Multi-objective Optimization Using Surrogates 159

Fig. 7.2 A (left) – Function ZDT2; B (right) – ZDT2 – Pareto front achieved in 40 evalua-

tions: Diamonds – Pareto front with surrogates; Circles – solution without surrogates

challenge for most methods, as will be demonstrated later.

Depending on the search algorithm, the quality of the Pareto front could vary greatly.

There are various characteristics that describe a good quality Pareto front:

1. Spacing – better search techniques will space the points on the Pareto front

uniformly rather than producing clusters. See Fig. 7.3a

2. Richness – better search techniques will put more points on the Pareto front than

others. See Fig. 7.3b

3. Diversity – better search techniques will produce fronts that are spread out better

with respect to all objectives. See Fig. 7.3c

4. Optimality – better search techniques will produce fronts that dominate the fronts

produced by less good techniques. In test problems this is usually measured

as ‘generational distance’ to an ideal Pareto front. We discuss this later. See

Fig. 7.3d

5. Globality – the obtained Pareto front is a global as opposed to local. Similar to

single objective optimization, in the multiobjective world, it is also possible to

have local and global optimal solutions. This concept is demonstrated using the

F5 test function (a full description is given in sections 15.4 and 15.5). Fig. 7.4

illustrates the function and the optimization procedure. Due to the sharp nature

of the global solution it cannot be guaranteed that with a small number of GA

evaluations, the correct solution will be found. Furthermore, since the surrogate is

based only on sampled data, if this data does not contain any points in the global

optimum area, then the surrogate will never know about its existence. Therefore

160 I. Voutchkov and A. Keane

any optimization based only on such surrogates will lead us to the local solution.

Therefore conventional optimization approaches based on surrogate models rely

on constant updating of the surrogate. A widely accepted technique in single

objective optimization is to update the surrogate with its current optimal solution.

In multiobjective terms this will translate to updating the surrogate with one or

more points belonging to its Pareto front. If the surrogate Pareto front is local

and not global, then the next update will also be around the local Pareto front.

Continuing with this procedure the surrogate model will become more and more

accurate in the area of the local optimal solution, but will never know about the

existence of the global solution.

6. Robust convergence from any start design with any random number sequence. It

turns out that the success of a conventional multiobjective optimization based on

surrogates, using updates at previously found optimal locations strongly depends

on the initial data used to train the first surrogate before any updates are added.

If this data happens to contain points around the global Pareto front, then the

algorithm will be able to quickly converge and find a nice global Pareto front.

However the odds are that the local Pareto fronts are smoother and easier to find

shapes and in most cases this is where the procedure will converge unless suitable

global exploration steps are taken.

7. Efficiency and convergence – better search techniques will converge using less

function evaluations.

Fig. 7.3 Pareto front potential problems - (a) clustering; (b) too few points; (c) lack of diver-

sity; (d) non-optimality

7 Multi-objective Optimization Using Surrogates 161

Test Functions

In a previous publication [20] we have shown that for complex and high-dimensional

functions Kriging is the response surface method of choice [22]. We have also

stressed the importance of applying a high level of understanding when using Krig-

ing. There have been various publications that critique kriging, due to lack of under-

standing. Our opinion is that if the user understands the strengths and weaknesses

of this approach it can become an invaluable tool, often the only one capable of

producing meaningful results in reasonable times.

Kriging is a Response Surface (RSM) method, designed in the 60’s for geological

surveys [7]. It can be a very efficient RSM model for cases where it is expensive to

obtain large amounts of data. A significant number of publications discuss the krig-

ing procedure in detail. An important role for the success of the method is the tuning

of its hyper parameters. It should be mentioned that researchers who have chosen

rigorous training procedures, report positive results when using kriging, while pub-

lications that use basic training procedures often reject this method. Nevertheless,

the method is becoming increasingly popular in the world of optimization as it often

provides a surrogate with usable accuracy.

This method was used to build surrogates for the above test cases, therefore it is

useful to briefly outline its major pros and cons:

Pros:

• can always predict with no error at sample points,

• the error in close proximity to sample points is minimal,

• requires small number of sample points in comparison to other response surface

methods,

• reasonably good behaviour with high dimensional problems.

162 I. Voutchkov and A. Keane

Cons:

• for large number of data points and variables, training of the hyper-parameters

and prediction may become computationally expensive.

Researchers should make a conscious decision when choosing Kriging for their

RSMs. Such a decision should take into account the cost of a direct function eval-

uation including constraints (if any), available computational power, and dimen-

sionality of the problem. Sometimes it might be possible to use kriging for one

of the objectives while another is evaluated directly, or a different RSM is used to

minimize the cost.

As this paper aims to demonstrate various approaches in making a better use of

surrogate models, we will use kriging throughout, but most conclusions could be

generalised for other RS methods as well. The chosen multiobjective algorithm is

NSGA2. Other multiobjective optimizers might show slightly different behaviour.

The basic procedure is as follows:

tions.

2. Train hyper-parameters, using combination of GA and DHC

(dynamic hill climbing) [23]

3. Choose a selection of update strategies with specified num-

ber of updates.

4. Search the RSMs using each of the selected methods

5. Select designs that are best in terms of ranking and space

filling properties.

6. Evaluate selected designs and add to data set.

7. Produce Pareto front and compare with previous. Stop if 2-3

consecutive Pareto fronts are identical. Otherwise continue.

8. If Pareto front contains too many points, choose specified

number of points that are furthest away from each other

9. Repeat from step 2.

• fixed number of update iterations,

• stop when all update points are dominated,

• stop if the percentage of new update points that belong to the Pareto front falls

below a pre-defined value,

• stop if the percentage of old points on the current Pareto front rises above a

pre-defined value,

• stop when there is no further improvement of the Pareto front quality. The quality

of the Pareto front is a complex multiobjective problem on its own. The best

Pareto front could be defined as the one being as close as possible to the origin

of the objective function space, while having the best diversity, i.e., spread on all

7 Multi-objective Optimization Using Surrogates 163

objectives and the points are evenly distributed. Metrics for assessing the quality

of the Pareto front are discussed by Deb [3].

We have used the last of these criteria for our studies.

One of the main aims of this publication is to show the effect of different update

strategies and number of updates. Here we consider the following six approaches in

various combinations:

• UPDMOD = 1; (Nr) - Random updates. These can help escape from local Pareto

fronts and enrich the genetic material,

• UPDMOD = 2; (Nrsm) - RSM Pareto front. A specified number of points are

extracted from the Pareto front obtained after the search of the response surface

models of the objectives and constraints (if any). When the RSM Pareto front is

rich it is possible to extract data that is uniformly distributed.

• UPDMOD = 3; (Nsl) - Secondary NSGA2 layer. A completely independent

NSGA2 algorithm is applied directly to the non-RSM objective functions and

constraints. This exploits the well known property of the NSGA2 which makes

it (slowly) converge to global solutions. During each update iteration, the direct

NSGA2 is run for one generation with population size of Nsl. There are two

strands to this approach. The first one is referred to as ‘decoupled’. The genetic

material is completely independent from the other update strategies. No entries

other than those from the direct NSGA2 are used. The second strand is referred

to as ‘coupled’, where the genetic information is composed of suitable designs

obtained by other participating update strategies. Suitable designs are selected in

terms of Pareto optimality, or rank in terms of NSGA2. Please note that although

it might sound similar, this is a completely different approach from the Mμ GA

algorithm, proposed by Coello and Toscano (2000)

• UPDMOD = 4; (Nrmse) – Root of the Mean Squared Error (RMSE). When us-

ing kriging as a response surface model, it is possible to compute an estimate of

the RMSE, at no significant computational cost. The value of this metric is large

where there are large gaps between data points. RMSE is minimal close to or

at existing data points. Therefore adding updates at the location of the maximum

RMSE should significantly improve the quality and coverage of the response sur-

face model. When dealing with multiple objectives/constraints it is appropriate

to construct a Pareto front of maximum RMSEs for all objectives and extract

Nrmse points from it.

• UPDMOD = 5; (Nie) – Expected improvement (EI). This is another kriging spe-

cific function which represents the probability of finding the optimal point in a

new location. The update points are extracted from the Pareto front of the max-

imum values of the EI for all objectives. For constrained problems, the values

of EI for all objectives are multiplied by the value of the feasibility of the con-

straints, which is 1 for satisfied constraints 0 for unfeasible and rather smooth

ridge around the constraint boundary, see Forrester et al [22].

164 I. Voutchkov and A. Keane

• UPDMOD = 6; (Nkmean) – The RSMs are searched using GA or DHC and points

are extracted using a k-mean cluster detection algorithm.

All these update strategies have their own strengths and weaknesses, and therefore

a suitable combination should be carefully considered. The results section of this

chapter provides some insights on the effects of each of these strategies when used

in various combinations.

The following parameters can also affect the performance of a multi-objective

RSM search:

• RSMSIZE – number of points used for RSM construction. It is expected that the

more points that are used, the more accurate the RSM predictions, however this

comes at increasing training cost. Therefore the number of training points should

be limited.

• EVALSIZE – number of points used during RSM evaluation. This stage is con-

siderably less expensive than training and therefore more points can be used dur-

ing the evaluation stage. Ultimately this should increase the density of quality

material and therefore fewer gaps for the RSM to approximate.

• EPREWARD – endpoint reward factor. Higher value rewards are given at the

end points of the Pareto front, and this improves its spread. Lower value would

increase the pressure of the GA to explore the centre of the Pareto front.

• GA NPOP and GA NGEN – the population size and number of generation used

to search the RSM, RMSE and EI Pareto fronts.

Several test functions with various degrees of complexity have been chosen to

demonstrate the overview of the RS methods for the purpose of multiobjective opti-

mization. These functions are well known from the literature:

F5: (Fig. 7.4). High complexity shape – has a smooth and a sharp feature. The

combination of both makes it easier for the optimization procedure to converge to

the smooth feature, which represents a local Pareto front. The global Pareto front

lies around the sharp feature which is harder to reach. Two objectives, x (i) = 0 .. 1,

i = 1, 2; no constraints [3], page 350.

ZDT1 - ZDT6: Clustered and discontinuous Pareto fronts. Shape complexity

is moderate. Two objectives, n variables (in present study n = 2), no constraints.

x (i) = 0 .. 1, i = 1, 2 [3], page 357.

ZDT1cons: Same formulation as for ZDT1 but with 25 variables and 2

constraints. Constraints are described in [3], page 368.

Bump: The bump function, 25 variables, 2 objectives, 1 constraint. We have used

the function as provided in [21] which is a single objective with two constraints.

We have made one of the constraints into second objective, so that the optimiza-

tion problem is defined as : Maximise the original objective, minimize the sum of

7 Multi-objective Optimization Using Surrogates 165

variables whilst keeping the product of the variables greater than 0.75. There are 25

variables, each varying between 0 and 3.

To measure the performance of the various strategies discussed in this paper, we

have adopted several metrics. Some of them use comparison to an ‘ideal’ solution

which is denoted by Q and represents the Pareto front obtained using direct search

with a large number of iterations (20,000). All metrics are designed so that smaller

is better.

The average of the minimum Euclidian distance between each point of the two

Pareto fronts, $

|Q|

∑i=1 di

gd = ,

|Q|

% & '

(i) (k) 2

di = mink=1,|p| ∑Mj=1 f j − p j , and is the Euclidian distance between the so-

lution (i) and the nearest member of Q.

7.8.2 Spacing

Standard deviation of the absolute differences between the solution (i) and the near-

est member of Q,

1 |Q| 2

sp = ∑ di − d¯ ,

|Q| i=1

& '

di = min ∑ j=1 f j − p j .

M (i) (k)

k=1,|p|

7.8.3 Spread

|Q|

¯

m=1 dm − ∑i=1 di − d

∑M e

Δ = 1− ,

m=1 dm + |Q| d

∑M e ¯

where di is the absolute difference between neighbouring solutions. For compatibil-

ity with the above metrics, the values of the spread is subtracted from 1, so that a

wider spread will produce a smaller value.

166 I. Voutchkov and A. Keane

Normalized distance between the most distant points on the pareto front. The dis-

tance is normalized against the maximum spread of the ‘ideal’ pareto front. For

compatibility with the above metrics, the value of the maximum spread is subtracted

from 1, so that a wider spread will produce a smaller value,

⎛ ⎞

(i) 2

(i)

max fm − min fm

1 M

⎜ i=1,|Q| ⎟

MS = 1 − ∑ ⎝

i=1,|Q|

⎠ .

M m=1 Pm − Pmmin

max

7.9 Results

The study carried out aims to show the effect of applying various update strategies,

number of training and evaluation points, etc. The performance of each particular

approach is measured using the metrics described in the previous section.

An overall summary is given at the end of this section, but the best recipe ap-

pears to be highly problem dependant. It is also not possible to show all results for

all functions due to limited space, and we have therefore chosen several that best

represent the ideas discussed.

To correctly appreciate the results, please bear in mind that they are meant to

show diversity rather than a magic recipe that works in all situations.

The legend on the figures represents the selected strategy in the form

[Nr]-[Nrsm]-[Nsl]-[Nrmse]-[Nie]-[Nkmean]MUPD[RSMSIZE]MEVL[EVALSIZE]

so that a 8-14-15-10-3-3MUPD50MEVL300 would represent 8 random update

points, 14 RSM updates, 15 NSGA2 Second layer updates, 10 RMSE updates, 3

EI updates, 3 KMEAN updates with 50 krig training points and 300 krig evaluation

points.

All approaches were given a maximum of 60 update iterations and stopping cri-

teria of reaching two consecutive unchanged Pareto fronts. Total number of runs is

recorded for each update iteration and all metrics are plotted against number of real

function evaluations, (i.e. likely cost on real expensive problems).

Strategies with ‘dec’ appended to their name – indicate that the decoupled Second

layer is used, as opposed to coupled for those where Nsl = 30 and without any

appendix. Those labled ‘43’ use a one pass constraint penalty expected improvement

strategy whilst those that have Nie = 30 and no appendix use a constraint feasibility

algorithm.

7 Multi-objective Optimization Using Surrogates 167

7.9.2.1 Finding the Ideal Pareto Front

As mentioned in section 7.8, most of the Pareto front metrics are based on com-

parison to an ‘ideal’ Pareto front. To find it, each of the test functions has been

run through a direct NSGA2 search (direct = without the usage of surrogates) with

Population size of 100 for 200 generations, which takes 20000 function evaluations.

We have conducted a study for each of the test functions to find what the minimum

number of generations they should be run for is, in order to achieve best conver-

gence. We found that a population size of 70 with 80 generations is sufficient for all

of test problems and this is what we have used for our tests. Some test functions,

such as ZDT1 - ZDT6 with two variables could be converged using a smaller num-

ber of individuals and generations, however for comparison purposes we decided to

use the same settings for all functions.

7.9.2.3 What Is the Best Value for EPREWARD during the RSM Search?

The EPREWARD value is strictly individual for each function. Taking into account

the specifics of the test function it can improve the diversity of the Pareto front. The

default value is 0.65, which works well for most of the functions, but we have also

conducted studies where this parameter is varied between -1 and 1 in steps of 0.1,

and individual value for each function is selected based on best Pareto front metrics.

Fig. 7.5 shows that the selection of update strategy is important even for functions

with only two variables. F5 has a deceptive Pareto front and several update strategies

were not able to escape from the local Pareto front.

Fig. 7.6 clearly shows that some strategies have converged earlier than the others,

but some of them to the local front. Generally methods such as Random updates and

Secondary NSGA2 layer updates are not based on the RSM and are the strongest

candidates when deceptive features in the multiobjective space are expected. It is

a common observation amongst most of the low dimensional objective functions

(two or three variables) that using all the update techniques together is not necessar-

ily the winning strategy. However combining at least one RSM and one non-RSM

technique proves to work well. It is somewhat important to note that the Second

NSGA2 layer shows its effect after sixth or seventh update iteration, as it needs time

to converge and gather genetic information.

Update strategies that employ a greater variety of techniques prove to be more

successful for functions with higher number of variables (25).

168 I. Voutchkov and A. Keane

Fig. 7.9 and Fig. 7.10 show that the ‘bump’ function is particularly difficult for

all strategies, which makes it a good test problem. This function has extremely

tight constraint and multimodal features. It is not yet clear which of combination of

strategies should be recommended, as the ‘ideal’ Pareto front has not been reached,

however it seems that a decoupled secondary NSGA2 layer is showing a good

7 Multi-objective Optimization Using Surrogates 169

advancement. We are continuing studies on this function and will give results in

future publications.

To summarize the performance of each strategy an average statistics is com-

puted. It is derived as follows. The actual performance in most cases is a trade-

off between a given metric and the number of function evaluations needed for

170 I. Voutchkov and A. Keane

convergence. Therefore the four metrics can be ranked against the number of runs,

in the same way as ranks are obtained during NSGA2 operation. The obtained ranks

are then averaged across all test functions. Low average rank means that the strategy

has been optimal for more metrics and functions. These results are summarized in

Table 14.2.

7 Multi-objective Optimization Using Surrogates 171

Random RSM PF SL RMSE EI KMEAN Av. Rank Min. Rank Max. Rank Note

0 30 0 0 30 0 1.53 1 2 EI const.feas

0 30 30 0 0 0 1.83 1 3.33 SL coupled

0 30 0 30 0 0 2 1.33 3.33 RMSE

0 30 0 0 30 0 2.2 1.33 3 EI normal

0 30 30 0 0 0 2.8 1.33 4 SL decoupled

30 30 0 0 0 0 2.84 2 4 Random

0 60 0 0 0 0 2.85 2 3.33 RSM PF

The summary shows that all strategies are generally better than using only the

conventional RSM based updates, which is expected, as the conventional method is

almost always bound to converge at local solutions. However it must be underlined

that a correct selection is problem dependant and must be selected with care and

understanding.

All methods presented here start from a given initial design of experiments. This is

the starting point and this is what the initial surrogate model is based on. It is of

course important to show the effect of these initial conditions. In what follows we

have shown that effect by using a range of different initial DOEs. We have again

Fig. 7.11 Generational distance for zdt1 starting from different initial DOEs

172 I. Voutchkov and A. Keane

Fig. 7.12 Generational distance for F5 starting from different initial DOEs

Fig. 7.13 Pareto fronts for ‘bump’ starting from different initial DOEs

used 10 updates for each of the techniques (60 updates per iteration in total) for all

functions. The only difference being the starting set of designs.

Fig. 7.11 and Fig. 7.12 illustrate the generational distance for zdt1 and f5 func-

tions - both with two variables. They both demonstrate a good averagibility,

7 Multi-objective Optimization Using Surrogates 173

Fig. 7.14 Generational distance for ‘bump’ starting from different initial DOEs

Fig. 7.15 Pareto fronts for ‘zdt1cons’ starting from different initial DOEs

confirming once again that the surrogate updates are fairly robust for functions with

low number of variables.

Figures 7.13, 7.14 and 7.15 illustrate much greater variance and show that high

dimensionality is a difficult challenge for surrogate strategies, however one should

also consider the low number of function evaluations used here.

174 I. Voutchkov and A. Keane

7.10 Summary

In this publication we have aimed to share our experience in tackling expensive

multiobjective problems. We have shown that as soon as we decide to use surrogate

models, to substitute for expensive objective functions, we need to consider a num-

ber of other specifics in order to produce a useful Pareto front. We have discussed

the challenges that one might face when using surrogates and have proposed six up-

date strategies that one might wish to use. Given understanding of these strategies,

the researcher should decide on the budget of updates they could afford and then

spread this budget over several update strategies. We have shown that it is best to

use at least two different strategies – ideally a mixture of RSM and non-RSM based

techniques. When solving problems with few variables we have shown that a com-

bination of two or three techniques is sufficient, however with higher dimensional

problems, one should consider using more techniques.

It is also beneficial to constrain the number of designs that are used for RSM

training and also for RSM evaluation to limit the cost. The selection method of the

designs then being used is open to further research. In this material we have used

selection based on Pareto front ranking.

Our research also included parameters that reward the search for exploring the

end points on the Pareto front. Although not explicitly mentioned in this material,

our studies are using features such as improved crossover, mutation and selection

strategies, declustering algorithm applied both in the variable and objective space

to avoid data clustering. Data is also being automatically conditioned and filtered,

and advanced kriging tuning techniques are used. These features are part of the

OPTIONS [1], OptionsMATLAB and OptionsNSGA2 RSM suites [24].

Acknowledgements. This work was funded by Rolls – Royce Plc, whose support is

gratefully acknowledged.

References

1. Keane, A.J.: OPTIONS manual,

http://www.soton.ac.uk/˜ajk/options.ps

2. Obayashi, S., Jeong, S., Chiba, K.: Multi-Objective Design Exploration for Aerodynamic

Configurations, AIAA-2005-4666

3. Deb, K.: Multi-objective optimization using evolutionary algorithms. John Wiley &

Sons, Ltd., New York (2003)

4. Zitzler, et al.: Comparison of multiobjective evolutionary algorithms: Empirical results.

Evolutionary Computational Journal 8(2), 125–148 (2000)

5. Knowles, J., Corne, D.: The Pareto archived evolution strategy: A new baseline algorithm

for multiobjective optimisation. In: Proceedings of the 1999 Congress on Evolutionary

Computation, pp. 98–105. IEEE Service Center, Piscatway (1999)

6. Fonseca, C.M., Fleming, P.J.: Multiobjective optimization and multiple constraint han-

dling with evolutionary algorithms - Part II: Application example. IEEE Transactions on

Systems, Man, and Cybernetics: Part A: Systems and Humans, 38–47 (1998)

7 Multi-objective Optimization Using Surrogates 175

7. Jones, D.R., Schonlau, M., Welch, W.J.: Efficient global optimization of expensive black-

box functions. Journal of Global Optimization 13, 455–492 (1998)

8. Sobol’, I.M., Turchaninov, V.I., Levitan, Y.L., Shukhman, B.V.: Quasi-Random Se-

quence Generators, Keldysh Institute of Applied Mathematics, Russian Acamdey of Sci-

ences, Moscow (1992)

9. Nowacki, H.: Modelling of Design Decisions for CAD. In: Goos, G., Hartmanis, J.

(eds.) Computer Aided Design Modelling, Systems Engineering, CAD-Systems. LNCS,

vol. 89. Springer, Heidelberg (1980)

10. Kumano, T., et al.: Multidisciplinary Design Optimization of Wing Shape for a Small Jet

Aircraft Using Kriging Model. In: 44th AIAA Aerospace Sciences Meeting and Exhibit,

Jannuary 2006, pp. 1–13 (2006)

11. Nain, P.K.S., Deb, K.: A multi-objective optimization procedure with successive approx-

imate models. KanGAL Report No. 2005002 (March 2005)

12. Keane, A., Nair, P.: Computational Approaches for Aerospace Design: The Pursuit of

Excellence (2005) ISBN: 0-470-85540-1

13. Leary, S., Bhaskar, A., Keane, A.J.: A derivative based surrogate model for approximat-

ing and optimizing the output of an expensive computer simulation. J. Global Optimiza-

tion 30, 39–58 (2004)

14. Leary, S., Bhaskar, A., Keane, A.J.: A Constraint Mapping Approach to the Structural

Optimization of an Expensive Model using Surrogates. Optimization and Engineering 2,

385–398 (2001)

15. Emmerich, M., Naujoks, B.: Metamodel-assisted multiobjective optimization strategies

and their application in airfoil design. In: Parmee, I. (ed.) Proc of. Fifth Int’l. Conf.

on Adaptive Design and Manufacture (ACDM), Bristol, UK, April 2004, pp. 249–260.

Springer, Berlin (2004)

16. Giotis, A.P., Giannakoglou, K.C.: Single- and Multi-Objective Airfoil Design Using Ge-

netic Algorithms and Artificial Intelligence. In: EUROGEN 1999, Evolutionary Algo-

rithms in Engineering and Computer Science (May 1999)

17. Knowles, J., Hughes, E.J.: Multiobjective optimization on a budget of 250 evaluations.

In: Coello Coello, C.A., Hernández Aguirre, A., Zitzler, E. (eds.) EMO 2005. LNCS,

vol. 3410, pp. 176–190. Springer, Heidelberg (2005)

18. Chafekar, D., et al.: Multi-objective GA optimization using reduced models. IEEE

SMCC 35(2), 261–265 (2005)

19. Nain, P.: A computationally efficient multi-objective optimization procedure using suc-

cessive function landscape models. Ph.D. dissertation, Department of Mechanical Engi-

neering, Indian Institute of Technology (July 2005)

20. Voutchkov, I.I., Keane, A.J.: Multiobjective optimization using surrogates. In: Proc. 7th

Int. Conf. Adaptive Computing in Design and Manufacture (ACDM 2006), Bristol, pp.

167–175 (2006) ISBN 0-9552885-0-9

21. Keane, A.J.: Bump: A Hard (?) Problem (1994),

http://www.soton.ac.uk/˜ajk/bump.html

22. Forrester, A., Sobester, A., Keane, A.: Engineering design via Surrogate Modelling. Wi-

ley, Chichester (2008)

23. Yuret, D., Maza, M.: Dynamic hill climbing: Overcoming the limitations of optimization

techniques. In: The Second Turkish Symposium on Artificial Intelligence and Neural

Networks, pp. 208–212 (1993)

24. OptionsMatlab & OptionsNSGA2 RSM,

http://argos.e-science.soton.ac.uk/blogs/OptionsMatlab/

Chapter 8

A Review of Agent-Based Co-Evolutionary

Algorithms for Multi-Objective Optimization

multi-agent systems and evolutionary algorithms. Agent-based co-evolutionary al-

gorithms allow for existing many species and sexes of agents within the system as

well as for defining co-evolutionary interactions between species and sexes. Algo-

rithms based on the model of co-evolutionary multi-agent system have been already

applied in many domains, like multi-modal optimization, generation of investment

strategies, portfolio optimization, and multi-objective optimization. In this chapter

we present an overview of selected agent-based co-evolutionary algorithms, their

formal models, and results of experiments with standard test problems and financial

problem, aimed at making comparison of agent-based and “classical” state-of-the-art

multi-objective algorithms. Presented results show that, depending on the problem

being solved, agent-based algorithms obtain comparable, and sometimes even bet-

ter, results than “classical” algorithms, however of course they are not the universal

solver for any multi-objective optimization problem.

8.1 Introduction

In spite of a huge potential dozing in evolutionary algorithms and a lot of successful

applications of such algorithms in solving difficult problem of optimization and

searching, very frequently such methods have not been able to deal with defined

problem and obtained results have not been satisfying. Among the reasons of such

a situation the following can be mentioned:

• centralization of evolutionary process where the process of selection as well

as the process of creation of new generations are controlled by one single

algorithm;

Rafał Dreżewski · Leszek Siwik

Department of Computer Science

AGH University of Science and Technology, Kraków, Poland

e-mail: {drezew,siwik}@agh.edu.pl

Y. Tenne and C.-K. Goh (Eds.): Computational Intelligence in Optimization, ALO 7, pp. 177–209.

springerlink.com c Springer-Verlag Berlin Heidelberg 2010

178 R. Dreżewski and L. Siwik

any influence on the process of evolution;

• omitting some crucial—from the evolution and adaptation capabilities point of

view—operations and processes observable in the nature. Moreover, in the lit-

erature there are opinions that crossover and mutation they are only the kinds

of one single—destructive and exploration-oriented—operator and there is no

agreement if (and if so—when) they should be used or even if they should be

distinguished [17];

• to realize their own goals, during decision-making process, specimens are able

neither to gather nor to utilize any kind of information from the environment;

• depriving specimens of such—absolutely natural and obvious in nature—

biological and social behaviors like competition, rivalry, cooperation etc.;

• in the consequence of previous point (limited number of operators) it is almost

impossible to define in classical evolutionary algorithms more sophisticated

(and more effective simultaneously), advanced algorithms and computational

methods.

In the consequence, in the literature, there are being raised arguments that classi-

cal evolutionary algorithms are methods of adapting and fitting of algorithm’s pa-

rameters to defined conditions rather than really creative methods of searching and

optimization.

It is nothing strange so, that intensive research is being performed on methods uti-

lizing ideas and conceptions of computer models of observable in nature Darwinian

evolution but at the same time, on methods that should be devoid of mentioned above

shortcomings, and which could be perceived as a full analogy to natural processes.

During the research, decentralization and autonomy have been in the limelight.

Proposed, as a result, method called Evolutionary Multi-Agent System—EMAS [2]

should be perceived as a new trend among evolutionary algorithms allowing for

realization of defined postulates by utilizing advantages simultaneously of both:

evolutionary and agent-based approaches.

Proposed paradigm of evolutionary multi-agent system is characterized by the

following—crucial, taking the shortcomings of classical evolutionary algorithms

into account—features:

• in the process of evolution autonomous agents are taking a part. Agents are able

to make decisions to realize their own goals and they are not passive units of

global and central evolution which are limited and reduced to the role of (group

of) genes;

• the prices of evolution is decentralized and agents taking the part in that are

able to create advanced social structures and to realize sophisticated strategies of

cooperation, competition, interactions and reciprocal relations

• agents taking the part in the process of evolution are able to observe the envi-

ronment (and occurring changes) and to make appropriate decisions and actions

what additionally enrich the spectrum of possible for realization complex and

effective computational methods and algorithms.

8 A Review of Agent-Based Co-Evolutionary Algorithms 179

During further research on realizing advanced, complex social and biological mech-

anisms within the confines of EMAS—general model of so called CoEMAS Co-

evolutionary multi-agent systems (CoEMAS) [8] has been proposed and it has turned

out that with the use of such a model almost any kind of interaction, cooperation or

competition among many species or sexes of co-evolving agents is possible what al-

lows for improving the quality of obtained result. Such improvement results mainly

from better maintenance of population diversity—what is especially important in

the case of applying such systems for solving multi-modal or multi-objective opti-

mization tasks.

In the course of this chapter we are focusing on applying co-evolutionary

multi-agent systems for solving multi-objective optimization tasks.

Following [5]—multi-objective optimization problem—MOOP in its general

form is being defined as follows:

⎧

⎪

⎪ Minimize/Maximize fm (x̄), m = 1, 2 . . . , M

⎨ Sub ject to g j (x̄) ≥ 0, j = 1, 2 . . . , J

MOOP ≡

⎪

⎪ h k (x̄) = 0, k = 1, 2 . . . , K

⎩ (L) (U)

xi ≤ xi ≤ xi , i = 1, 2 . . . , N

Authors of this chapter assume that readers are familiar with at least fundamental

concepts and notions regarding multi-objective optimization in the Pareto sense (re-

lation of domination, Pareto frontier and Pareto set etc.) and their explanation is

omitted in this paper (interested readers can find definitions and deep analysis of all

necessary concepts and notions of Pareto multi-objective optimization for instance

in [3, 5]).

This chapter is organized as follows:

• in Section 8.2 formal model as well as detailed description of Co-Evolutionary

Multi-Agent System—CoEMAS is presented;

• in Section 8.3 detailed description and formal model of two realization of Co-

EMAS applied for solving MOOP is given. In this section Co-Evolutionary

Multi-Agent System with Predator-Prey interactions (PPCoEMAS) as well as

Co-Evolutionary Multi-Agent System with Cooperation (CCoEMAS) are

discussed;

• in Section 8.4 we discuss shortly test suite and performance metric used during

experiments, and next we glance at results obtained by both systems presented in

the course of this chapter (PPCoEMAS and CCoEMAS);

• in Section 8.5 the most important remarks, conclusions and comments are given.

Agent-based models of evolutionary algorithms are the result of mixing two

paradigms: multi-agent systems and evolutionary algorithms. The result is decen-

tralized evolutionary system, in which agents “live” within the environment

of the system, compete for limited resources, reproduce, die, migrate from one

180 R. Dreżewski and L. Siwik

computational node to another, observe the environment and other agents, and can

communicate with other agents and change the environment.

Basic model of agent-based evolutionary algorithm (so called evolutionary multi-

agent system—EMAS model) was proposed in [2]. EMAS model included all the

features which were mentioned above. However in the case of some problems, for

example multi-modal optimization or multi-objective optimization, it turned out that

these mechanisms are not sufficient. Such types of problems require maintaining of

population diversity mechanisms, speciation mechanisms and possibilities of intro-

ducing additional biologically and socially inspired mechanisms in order to solve a

problem and obtain satisfying results.

Mentioned above limitations of the basic EMAS model and research aimed at

applying agent-based evolutionary algorithms to multi-modal and multi-objective

problems led to the formulation of the model of co-evolutionary multi-agent system—

CoEMAS [8]. This model included the possibilities of existing different species and

sexes in the system and allowed for defining co-evolutionary interactions between

them. Below we present basic ideas and notions of CoEMAS model, which we will

use in Section 8.3 when the systems used in experiments will be described.

The CoEMAS is described as 4-tuple:

co-evolve in CoEMAS, Γ is the set of resource types that exist in the system, the

amount of type γ resource will be denoted by rγ , Ω is the set of information types

that exist in the system, the information of type ω will be denoted by iω .

8.2.2 Environment

The environment of CoEMAS may be described as 3-tuple:

. /

E = T E,Γ E ,Ω E (8.2)

exist in the environment, Ω E is the set of information types that exist in the envi-

ronment. The topography of the environment is given by:

T E = H, l (8.3)

where H is directed graph with the cost function c defined: H = V, B, c, V is the

set of vertices, B is the set of arches. The distance between two nodes is defined as

the length of the shortest path between them in graph H.

8 A Review of Agent-Based Co-Evolutionary Algorithms 181

space:

l : A→V (8.4)

where A is the set of agents, that exist in CoEMAS.

Vertex v is given by:

v = Av , Γ v , Ω v , ϕ (8.5)

Av is the set of agents that are located in the vertex v, Γ v is the set of resource types

that exist within the v (Γ v ⊆ Γ E ), Ω v is the set of information types that exist within

the v (Ω v ⊆ Ω E ), ϕ is the fitness function.

8.2.3 Species

Species s ∈ S is defined as follows:

where:

• As is the of agents of species s (by as we will denote the agent, which is of species

s, as ∈ As );

• SX s is the set of sexes within the s;

182 R. Dreżewski and L. Siwik

0 a

(Z s = Z , where Z a is the set of actions, which can be performed by the

a∈As

agent a);

• Cs is the set of relations with other species that exist within CoEMAS.

The set of relations of si with other species (Csi ) is the sum of the following sets of

relations: 1 s ,z− 2 1 s ,z+ 2

Csi = −− i

−→: z ∈ Z si ∪ −− i

−→: z ∈ Z si (8.7)

s ,z− s ,z+

where −−

i

−→ and −−

i

−→ are relations between species, based on some actions z ∈ Z si ,

which can be performed by the agents of species si :

s ,z− 3. /

−−

i

−→= si , s j ∈ S × S : agents of species si can decrease the fitness of

4 (8.8)

agents of species s j by performing the action z ∈ Z si

s ,z+ 3. /

−−

i

−→= si , s j ∈ S × S : agents of species si can increase the fitness of

4 (8.9)

agents of species s j by performing the action z ∈ Z si

s ,z−

If si −−

i

−→ si then we are dealing with the intra-species competition, for example

si ,z+

the competition for limited resources, and if si −− −→ si then there is some form of

co-operation within the species si .

With the use of the above relations we can define many different co-evolutionary

interactions, e.g., mutualism, predator-prey, host-parasite, etc. For example mutual-

ism between two species si and s j (i = j) takes place if and only if ∃zk ∈ Z si ∃zl ∈ Z s j ,

si ,z + s j ,zl +

such that si −−−

k

→ s j and s j −−−→ si and these two species live in tight co-operation.

Predator-prey interactions between two species, si (predators) and s j (preys) (i =

si ,z − s j ,zl +

j), takes place if and only if ∃zk ∈ Z si ∃zl ∈ Z s j , such that si −−−

k

→ s j and s j −−−→ si ,

where zk is the action of killing the prey (kill), and zl is the action of death (die).

8.2.4 Sex

The sex sx ∈ SX s which is within the species s is defined as follows:

8 A Review of Agent-Based Co-Evolutionary Algorithms 183

With asx we will denote the agent of sex sx (asx ∈ Asx ). Z0sx is the set of actions

which can be performed by the agents of sex sx, Z sx = Z a , where Z a is the

a∈Asx

set of actions which can be performed by the agent a. And finally Csx is the set of

relations between the sx and other sexes of the species s.

Analogically as in the case of species, we can define the relations between the

sexes of the same species. The set of all relations of the sex sxi ∈ SX s with other

sexes of species s (Csxi ) is the sum of the following sets of relations:

1 sx ,z− 2 1 sx ,z+ 2

Csxi = −−i−→: z ∈ Z sxi ∪ −−i−→: z ∈ Z sxi (8.12)

sx ,z− sx ,z+

where −−i−→ and −−i−→ are the relations between sexes, in which some actions

z ∈ Z sxi are used:

sx ,z− 3. /

−−i−→= sxi , sx j ∈ SX s × SX s : agents of sex sxi can decrease the

4 (8.13)

fitness of agents of sex sx j by performing the action z ∈ Z sxi

sx ,z+ 3. /

−−i−→= sxi , sx j ∈ SX s × SX s : agents of sex sxi can increase the

4 (8.14)

fitness of agents of sex sx j by performing the action z ∈ Z sxi

With the use of presented relations between sexes we can model for example sexual

selection interactions, in which agents of one sex choose partners for reproduction

from agents of the other sex within the same species, taking into account some

preferred features (see [10]).

8.2.5 Agent

Agent a (see Fig. 8.2) of sex sx and species s (in order to simplify the notation we

assume that a ≡ asx,s ) is defined as follows:

where:

• gna is the genotype of agent a, which may be composed of any number of chro-

Ê

mosomes (for example: gna = (x1 , x2 , . . . , xk ), where xi ∈ , gna ∈ k ); Ê

• Z a is the set of actions, which agent a can perform;

• Γ a is the set of resource types, which are used by agent a (Γ a ⊆ Γ );

• Ω a is the set of information, which agent a can possess and use (Ω a ⊆ Ω );

• PRa is partially ordered set of profiles of agent a (PRa ≡ PRa , ) with defined

partial order relation .

184 R. Dreżewski and L. Siwik

3. /

= pri , pr j ∈ PRa × PRa : realization of active goals of profile pri

has equal or higher priority than the realization of (8.16)

4

active goals of profile pr j

The active goal (which is denoted as gl ∗ ) is the goal gl, which should be realized in

the given time. The relation is reflexive, transitive and antisymmetric and partially

orders the set PRa :

(pri pr j ∧ pr j prk ) ⇒ pri prk for every pri , pr j , prk ∈ PRa (8.17b)

(pri pr j ∧ pr j pri ) ⇒ pri = prk for every pri , pr j ∈ PRa (8.17c)

pr1 pr2 · · · prn (8.18b)

Profile pr1 is the basic profile—it means that the realization of its goals has the

highest priority and they will be realized before the goals of other profiles.

Profile pr of agent a (pr ∈ PRa ) can be the profile in which only resources are

used:

pr = Γ pr , ST pr , RST pr , GL pr (8.19)

8 A Review of Agent-Based Co-Evolutionary Algorithms 185

γ γ

1 rγ ← rinit ; /* rinit is the initial amount of resource given to

the agent */

2 while rγ > 0 do

3 activate the profile pri ∈ PRa with the highest priority and with the active goal

gl ∗j ∈ GL pri ;

4 if pri is the resource profile then

γ γ

5 if 0 < rγ < rmin then ; /* rmin is the minimal amount of

resource needed by the agent to realize its

activities */

6

7 choose the strategy stk ∈ ST pri with the highest priority that can be used to

take some resources from the environment or other agent;

8 perform actions contained within the stk ;

9 else if rγ = 0 then

10 execute die strategy;

11 end

12 else if pri is the reproduction profile then

rep,γ rep,γ

13 if rγ > rmin then ; /* rmin is the minimal amount of

resource needed for reproduction */

14

15 choose the strategy stk ∈ ST pri with the highest priority that can be used to

reproduce;

16 perform actions contained within the stk ;

17 end

18 else if pri is the migration profile then

mig,γ mig,γ

19 if rγ > rmin then ; /* rmin is the minimal amount of

resource needed for migration */

20

21 choose the strategy stk ∈ ST pri with the highest priority that can be used to

migrate;

22 perform actions contained within the stk ;

mig,γ

23 give rmin amount of resource to the environment;

24 end

25 end

26 end

pr = Ω pr , M pr , ST pr , RST pr , GL pr (8.20)

or resources and information are used:

pr = Γ pr , Ω pr , M pr , ST pr , RST pr , GL pr (8.21)

where:

• Γ pr is the set of resource types, which are used within the profile pr (Γ pr ⊆ Γ a );

• Ω pr is the set of information types, which are used within the profile pr (Ω pr ⊆

Ω a );

186 R. Dreżewski and L. Siwik

environment and other agents (it is the model of the environment of agent a);

• ST pr is the partially ordered set of strategies (ST pr ≡ ST pr , ), which can be

used by agent within the profile pr in order to realize an active goal of this profile;

• RST pr is the set of strategies that are realized within the profile pr—generally,

not all of the strategies from the set ST pr have to be realized within the profile

pr, some of them may be realized within other profiles;

• GL pr is partially ordered set of goals (GL pr ≡ GL pr , ), which agent has to

realize within the profile pr.

The relation is defined in the following way:

3

= sti , st j ∈ ST pr × ST pr : strategy sti has equal or higher

4 (8.22)

priority than strategy st j

This relation is reflexive, transitive and antisymmetric and partially orders the set

ST pr . Every single strategy st ∈ ST pr is consisted of actions, which ordered perfor-

mance leads to the realization of some active goal of the profile pr:

st = z1 , z2 , . . . , zk , st ∈ ST pr , zi ∈ Z a (8.23)

3

= gli , gl j ∈ GL pr × GL pr : goal gli has equal or higher

4 (8.24)

priority than the goal gl j

This relation is reflexive, transitive and antisymmetric and partially orders the set

GL pr .

The partially ordered sets of profiles PRa , goals GL pr and strategies ST pr are

used by the agent in order to make decisions about the realized goal and to choose

the appropriate strategy in order to realize that goal. The basic activities of the agent

a are shown in Algorithm 6.

In CoEMAS systems the set of profiles is usually composed of resource profile

(pr1 ), reproduction profile (pr2 ), and migration profile (pr3 ):

The highest priority has the resource profile, then there is reproduction profile, and

finally migration profile.

8 A Review of Agent-Based Co-Evolutionary Algorithms 187

Optimization

In this section we will describe two co-evolutionary multi-agent systems used in

the experiments. Each of these systems uses different co-evolutionary mechanism:

co-operation and predator-prey interactions. All of the systems are based on general

model of co-evolution in multi-agent system described in Section 8.2—in this sec-

tion only such elements of the systems will be described that are specific for these

instantiations of the general model. In all the systems presented below, real-valued

vectors are used as agents’ genotypes. Mutation with self-adaptation and intermedi-

ate recombination are used as evolutionary operators [1].

Mechanism (CCoEMAS)

The co-evolutionary multi-agent system with co-operation mechanism is defined as

follows (see Eq. (8.1)):

CCoEMAS = E, S, Γ , Ω (8.26)

The number of species corresponds with the number of criteria (n) of the multi-

objective problem being solved S = {s1 , . . . , sn }. Three information types (Ω =

{ω1 , ω2 , ω3 }) and one resource type (Γ = {γ }) are used. Information of type ω1

denotes nodes to which agent can migrate. Information of type ω2 denotes (for the

agent of given species) all agents from other species that are located within the same

node in time t. Information of type ω3 denotes (for the given agent) all agents from

the same species located within the same node.

8.3.1.1 Species

The species s is defined as follows:

where SX s is the set of sexes which exist within the s species, Z s is the set of actions

that agents of species s can perform, and Cs is the set of relations of s species with

other species that exist in the CCoEMAS.

Actions

The set of actions Z s is defined as follows:

Z s = {die, seek, get, give, accept, seekPartner, clone, rec, mut, migr} (8.28)

where:

• die is the action of death (agent dies when it is out of resources);

• seek is the action of finding a dominated agent from the same species in order to

take some resources from it;

188 R. Dreżewski and L. Siwik

• get action gets some resource from another agent located within the same node,

which is dominated by the agent that performs get action;

• give action gives some resources to the agent that performs get action;

• accept action accepts partner for reproduction when the amount of resource pos-

sessed by the agent is above the given level;

• seekPartner action seeks for partner for reproduction, such that it comes from

another species and has the amount of resource above the minimal level needed

for reproduction;

• clone is the action of producing offspring (parents give some of their resources

to the offspring during this action);

• rec is the recombination operator (intermediate recombination is used [1]);

• mut is the mutation operator (mutation with self-adaptation is used [1]);

• migr is the action of migrating from one node to another. During this action agent

loses some of its resource.

Relations

The set of relations of si species with other species that exist within the system is

defined as follows: 1 s ,get− s ,accept+ 2

Csi = −−−−→, −−−−−−→

i i

(8.29)

The first relation models intra species competition for limited resources:

s ,get−

−−

i

−−→= {si , si } (8.30)

s ,accept+ 3. /4

−−

i

−−−−→= si , s j (8.31)

8.3.1.2 Agent

Agent a of species s (a ≡ as ) is defined as follows:

decision parameters’ values and σ of standard deviations’ values, which are used

during mutation with self-adaptation. Agents of the given species are evaluated ac-

cording to only one criteria associated with this species. Z a = Z s (see Eq. (8.28)) is

the set of actions which agent a can perform. Γ a is the set of resource types used

by the agent, and Ω a is the set of information types. Basic activities of agent a in

CCoEMAS with the use of profiles are presented in Alg. 7.

8 A Review of Agent-Based Co-Evolutionary Algorithms 189

γ

1 rγ ← rinit ;

2 while rγ > 0 do

3 activate the profile pri ∈ PRa with the highest priority and with the active goal

gl ∗j ∈ GL pri ;

4 if pr1 is activated then

γ

5 if 0 < rγ < rmin then

6 seek, get;

γ

7 rγ ← rγ + rget ;

8 else if rγ = 0 then

9 die;

10 end

11 else if pr2 is activated then

rep,γ

12 if rγ > rmin then

13 seekPartner,

& clone,

' rec, mut;

rep,γ

14 rγ ← rγ − rgive ;

15 end

16 else if pr3 is activated then

17 if accept&is activated ' then

γ γ rep,γ

18 r ← r − rgive ;

else if give

is activated then

γ

19

20 rγ ← rγ − rget ;

21 end

22 else if pr4 is activated then

mig,γ

23 if rγ > rmin then

24 migr;

& '

mig,γ

25 rγ ← rγ − rmin ;

26 end

27 end

28 end

Profiles

The partially ordered set of profiles includes resource profile (pr1 ), reproduction

profile (pr2 ), interaction profile (pr3 ), and migration profile (pr4 ):

pr1 pr2 pr3 pr4 (8.33b)

.

pr1 = Γ pr1 = Γ , Ω pr1 = {ω3 } , M pr1 = {iω3 } ,

/ (8.34)

ST pr1 , RST pr1 = ST pr1 , GL pr1

190 R. Dreżewski and L. Siwik

The goal of the pr1 profile is to keep the amount of resources above the minimal

level or to die when the amount of resources falls to zero. This profile uses the

model M pr1 = {iω3 }.

The reproduction profile is defined as follows:

.

pr2 = Γ pr2 = Γ , Ω pr2 = {ω2 } , M pr2 = {iω2 } ,

/ (8.36)

ST pr2 , RST pr2 = ST pr2 , GL pr2

The only goal of the pr2 profile is to reproduce. In order to realize this goal agent can

use strategy of reproduction: seekPartner, clone, rec, mut. During the reproduction

rep,γ

agent transfers the amount of rgive resources to the offspring.

The interaction profile is defined as follows:

.

pr3 = Γ pr3 = Γ , Ω pr3 = {ω2 , ω3 } , M pr3 = {iω2 , iω3 } ,

/ (8.38)

ST pr3 = {accept, give}, RST pr3 = ST pr3 , GL pr3

The goal of the pr3 profile is to interact with agents from another species with the

use of accept and give strategies.

The migration profile is defined as follows:

.

pr4 = Γ pr4 = Γ , Ω pr4 = {ω1 } , M pr4 = {iω1 } ,

3. /4 / (8.39)

ST pr4 = migr , RST pr4 = ST pr4 , GL pr4

. within

/ the environment. In order to realize

such a goal the migration strategy migr is used, which firstly chooses the node

on the basis of information {iω1 } and then realizes the migration. As a result of

migrating agent loses some of its resources.

Interactions (PPCoEMAS)

The co-evolutionary multi-agent system with predator-prey interactions

(PPCoEMAS) is defined as follows (see Eq. (8.1)):

8 A Review of Agent-Based Co-Evolutionary Algorithms 191

The set of species includes two species, preys and predators S = {prey, pred}.

Two information types (Ω = {ω1 , ω2 }) and one resource type (Γ = {γ }) are used.

Information of type ω1 denote nodes to which agent can migrate. Information of

type ω2 denote such prey that are located within the particular node in time t.

The prey species (prey) is defined as follows:

where SX prey is the set of sexes which exist within the prey species, Z prey is the set

of actions that agents of species prey can perform, and C prey is the set of relations

of prey species with other species that exist in the PPCoEMAS.

Actions

3

Z prey = die, seek, get, give, accept, seekPartner,

4 (8.42)

clone, rec, mut, migr

where:

• die is the action of death (prey dies when it is out of resources);

• seek action seeks for another prey agent that is dominated by the prey performing

this action or is too close to it in criteria space.

• get action gets some resource from another a prey agent located within the same

node, which is dominated by the agent that performs get action or is too close to

it in the criteria space;

• give action gives some resource to another agent (which performs get action);

• accept action accepts partner for reproduction when the amount of resource pos-

sessed by the prey agent is above the given level;

• seekPartner action is used in order to find the partner for reproduction when the

amount of resource is above the given level and agent can reproduce;

• clone is the action of producing offspring (parents give some of their resources

to the offspring during this action);

• rec is the recombination operator (intermediate recombination is used [1]);

• mut is the mutation operator (mutation with self-adaptation is used [1]);

• migr is the action of migrating from one node to another. During this action agent

loses some of its resource.

Relations

The set of relations of prey species with other species that exist within the system is

defined as follows: 1 prey,get− prey,give+ 2

C prey = −−−−−→, −−−−−−→ (8.43)

192 R. Dreżewski and L. Siwik

The first relation models intra species competition for limited resources:

prey,get−

−−−−−→= {prey, prey} (8.44)

prey,give+

−−−−−−→= {prey, pred} (8.45)

The predator species (pred) is defined as follows:

5 6

pred = A pred , SX pred = {sx} , Z pred ,C pred (8.46)

Actions

where:

• The seek action allows finding the “worst” (according to the criteria associated

with the given predator) prey located within the same node as the predator;

• getFromPrey action gets all resources from the chosen prey,

• migr action allows predator to migrate between nodes of the graph H—this re-

sults in losing some of the resources.

Relations

The set of relations of pred species with other species that exist within the system

are defined as follows:

1 pred,getFromPrey− 2

C pred = −−−−−−−−−−−→ (8.48)

pred,getFromPrey−

−−−−−−−−−−−→= {pred, prey} (8.49)

As a result of performing getFromPrey action and taking all resources from selected

prey, it dies.

Agent a of species prey (a ≡ a prey ) is defined as follows:

8 A Review of Agent-Based Co-Evolutionary Algorithms 193

γ

1 rγ ← rinit ;

2 while rγ > 0 do

3 activate the profile pri ∈ PRa with the highest priority and with the active goal

gl ∗j ∈ GL pri ;

4 if pr1 is activated then

γ

5 if 0 < rγ < rmin then

6 seek, get;

γ

7 rγ ← rγ + rget ;

8 else if rγ = 0 then

9 die;

10 end

11 else if pr2 is activated then

rep,γ

12 if rγ > rmin then

13 if seekPartner,

& clone, rec,

' mut is performed then

clone,γ

14 rγ ← rγ − rgive ;

15 else if accept

& is performed

' then

γ γ accept,γ

16 r ← r − rgive ;

17 end

18 end

19 else if pr3 is activated then

20 if get is performed by prey agent then

21 give;& '

γ

22 rγ ← rγ − rgive ;

23 else if get is performed by predator agent then

24 give;

25 rγ ← 0;

26 end

27 else if pr4 is activated then

mig,γ

28 if rγ > rmin then

29 migr; & '

mig,γ

30 rγ ← rγ − rmin ;

31 end

32 end

33 end

decision parameters’ values and σ of standard deviations’ values, which are used

during mutation with self-adaptation. Z a = Z prey (see Eq. (8.42)) is the set of actions

which agent a can perform. Γ a is the set of resource types used by the agent, and

Ω a is the set of information types. Basic activities of agent a are presented in Alg. 8.

Profiles

The partially ordered set of profiles includes resource profile (pr1 ), reproduction

profile (pr2 ), interaction profile (pr3 ), and migration profile (pr4 ):

194 R. Dreżewski and L. Siwik

pr1 pr2 pr3 pr4 (8.51b)

.

pr1 = Γ pr1 = Γ , Ω pr1 = {ω2 } , M pr1 = {iω2 } ,

/ (8.52)

ST pr1 , RST pr1 = ST pr1 , GL pr1

The goal of the pr1 profile is to keep the amount of resources above the minimal

level or to die when the amount of resources falls to zero. This profile uses the

model M pr1 = {iω2 }.

The reproduction profile is defined as follows:

.

pr2 = Γ pr2 = Γ , Ω pr2 = {ω2 } , M pr2 = {iω2 } ,

/ (8.54)

ST pr2 , RST pr2 = ST pr2 , GL pr2

The only goal of the pr2 profile is to reproduce. In order to realize this goal agent can

use strategy of reproduction seekPartner, clone, rec, mut or can accept partners for

reproduction (accept).

The interaction profile is defined as follows:

.

pr3 = Γ pr3 = Γ , Ω pr3 = 0, / M pr3 = 0,

/ ST pr3 = {give} ,

/ (8.56)

RST pr3 = ST pr3 , GL pr3

The goal of the pr3 profile is to interact with predators and preys with the use of

strategy give.

The migration profile is defined as follows:

.

pr4 = Γ pr4 = Γ , Ω pr4 = {ω1 } , M pr4 = {iω1 } ,

3. /4 / (8.57)

ST pr4 = migr , RST pr4 = ST pr4 , GL pr4

The goal of the pr4 profile is to migrate within the environment. In order to real-

ize such a goal the migration strategy is used, which firstly chooses the node and

then realizes the migration. As a result of migrating prey loses some amount of

resource.

8 A Review of Agent-Based Co-Evolutionary Algorithms 195

γ

1 rγ ← rinit ;

2 while rγ > 0 do

3 activate the profile pri ∈ PRa with the highest priority and with the active goal

gl ∗j ∈ GL pri ;

4 if pr1 is activated then

γ

5 if 0 < rγ < rmin then

6 seek, getFromPrey;

prey,γ prey,γ

7 rγ ← rγ + rget ; /* rget are all resources of the

prey agent that was chosen by a */

8 end

9 else if pr2 is activated then

mig,γ

10 if rγ > rmin then

11 migr;& '

mig,γ

12 rγ ← rγ − rmin ;

13 end

14 end

15 end

Agent a of species pred is defined analogically to prey agent (see eq. (8.50)). There

exist two main differences. Genotype of predator agent is consisted only of the in-

formation about the criterion associated with the given agent. The set of profiles is

consisted only of two profiles, resource profile (pr1 ), and migration profile (pr2 ):

PRa = {pr1 , pr2 }, where pr1 pr2 . Basic activities of agent a are presented in

Alg. 9.

Profiles

.

pr1 = Γ pr1 = Γ , Ω pr1 = {ω2 } , M pr1 = {iω2 } ,

/ (8.58)

ST pr1 = {seek, getFromPrey}, RST pr1 = ST pr1 , GL pr1

The goal of the pr1 profile is to keep the amount of resource above the minimal level

with the use of strategy seek, getFromPrey.

The migration profile is defined as follows:

.

pr2 = Γ pr2 = Γ , Ω pr2 = {ω1 } , M pr2 = {iω1 } ,

3. /4 / (8.59)

ST pr2 = migr , RST pr2 = ST pr2 , GL pr2

/ within the environment. In order to realize this

goal the migration strategy migr is used. The realization of the migration strategy

results in losing some of the resource possessed by the agent.

196 R. Dreżewski and L. Siwik

Presented formally in section 8.3 agent-based co-evolutionary approaches for multi-

objective optimization have been tentatively assessed. Obtained during experiments

preliminary results were presented in some of our previous papers and in this section

they are shortly summarized.

Algorithms

As a test problem firstly, slightly modified so-called Laumanns multi-objective prob-

lem was used, which is defined as follows [15, 18]:

⎧

⎨ f1 (x) = x21 + x22

Laumanns = f2 (x) = (x1 + 2)2 + x22 (8.60)

⎩

−5 ≤ x1 , x2 ≤ 5

Secondly the so-called Kursawe problem was used. Its definition is as follows [18]:

⎧ & & $ ''

⎪

⎨ f1 (x) = ∑n−1 −10 exp −0.2 x2i + x2i+1

i=0

Kursawe = f2 (x) = ∑n |xi |0.8 + 5 sin x3 (8.61)

⎪

⎩ i=1 i

n = 3 − 5 ≤ x1 , x2 , x3 ≤ 5

In one of our experiments discussed shortly in this chapter building effective port-

folio problem was used. Assumed definition as well as true Pareto frontier for such

a problem can be found in [16].

Obviously during our experiments also well known and commonly used test

suites were used. Inter alia such problems as ZDT test suite was used ([19, p. 57–63],

[21], [5, p. 356–362], [4, p. 194–199]).

the whole approximation

of the true Pareto frontier

max

frontier

f1

max

8 A Review of Agent-Based Co-Evolutionary Algorithms 197

ness to the true Pareto frontier as well as dispersion of found non-dominated solution

over the whole (approximation) of the Pareto frontier (see Figure 8.3).

In the consequence, despite that using only one single measure during assessing

the effectiveness of (evolutionary) algorithms for multi-objective optimization is not

enough [23], since Hypervolume Ratio measure (HVR) [20] allows for estimating

both of these aspects—in this chapter discussion and presentation of obtained results

is based on this very measure.

Hypervolume or Hypervolume ratio (HVR), describes the area covered by solu-

tions of obtained result set. For each solution, hypercube is evaluated with respect

to the fixed reference point. In order to evaluate hypervolume ratio, value of hyper-

volume for obtained set is normalized with hypervolume value computed for true

Pareto frontier. HV and HVR are defined as follows:

7

N

HV = v( vi ) (8.62a)

i=1

HV(PF ∗ )

HVR = (8.62b)

HV(PF)

frontier and PF is the true Pareto frontier.

To assess (in a quantitative way) PPCoEMAS and CCoEMAS the comparison

with results obtained with the use of state-of-the-art algorithms has to be made. That

is why we are comparing results obtained by discussed in this chapter approaches

with results obtained by NSGA-II [6, 7] and SPEA2 [12, 22] algorithms since these

very algorithms are the most efficient and most commonly used evolutionary multi-

objective optimization algorithms. Additionally, obtained results are compared also

with NPGA [13] and PPES [15] algorithms.

(CCoEMAS)

Presented in section 8.3.1 co-evolutionary multi-agent system with co-operation

mechanism (CCoEMAS) was assessed tentatively using inter alia ZDT test-suite.

The size of population of CCoEMAS and the size of benchmarking algorithms

(NSGA-II and SPEA2) assumed during presented experiments were as follows:

CCoEMAS—200, NSGA-II—300 and SPEA—100. Next, selected parameters and

γ

their values assumed during those experiments are as follows: rinit = 50 (it repre-

sents the level of resources possessed initially by individual just after its creation),

γ

rget = 30 (it represents the amount of resources transferred in the case of domi-

rep,γ

nation), rmin = 30 (it represents the level of resources required for reproduction),

pmut = 0.5 (mutation probability).

198 R. Dreżewski and L. Siwik

1

HVR Measure value 0.95

0.9

0.85

0.8

0.75

NSGA2

0.7 CCoEMAS

SPEA

0.65

0 5 10 15 20 25 30 35

(a) Time [s]

1

0.99

HVR Measure value

0.98

0.97

0.96

0.95

0.94

NSGA2

0.93 CCoEMAS

SPEA

0.92

0 5 10 15 20 25 30 35 40

(b) Time [s]

Fig. 8.4 HVR values obtained by CCoEMAS, SPEA2, and NSGA-II run against Zitzler’s

problems ZDT1 (a) and ZDT2 (b) [11]

As one may see after the analysis of results presented in figures 8.4 and 8.5—

CCoEMAS, as not so complex algorithm as NSGA-II or SPEA2, initially allows for

obtaining better solutions, but with time classical algorithms—especially NSGA-

II—are the better alternatives. It is however worth to mention that in the case of

8 A Review of Agent-Based Co-Evolutionary Algorithms 199

0.9

0.8

0.7

0.6

0.5 NSGA2

CCoEMAS

SPEA2

0.4

0 5 10 15 20 25 30 35 40

(a) Time [s]

0.97

0.96

HVR Measure value

0.95

0.94

0.93

0.92

0.91

NSGA2

0.9 CCoEMAS

SPEA2

0.89

0 5 10 15 20 25 30 35 40

(b) Time [s]

1

0.95

HVR Measure value

0.9

0.85

0.8 NSGA2

CCoEMAS

SPEA2

0.75

0 5 10 15 20 25 30 35 40

(c) Time [s]

Fig. 8.5 HVR values obtained by CCoEMAS, SPEA2, and NSGA-II run against Zitzler’s

problems ZDT3 (a) ZDT4 (b) and ZDT6 (c) [11]

200 R. Dreżewski and L. Siwik

gorithms seem to be better alternatives, but finally CCoEMAS allows for obtaining

better solutions (observed as higher values of HVR metrics). Deeper analysis of

obtained during presented experiments results can be found in [11].

(PPCoEMAS)

In this section some selected results regarding presented in section 8.3.2 co-

evolutionary multi-agent system with predator-prey interactions are presented.

Among the others, PPCoEMAS was assessed with the use of some presented in

section 8.4.1 classical benchmarking problems: firstly Laumanns [15] and Kursawe

[14] test problems were used. Also the other than NSGA-II and SPEA2 classical

algorithms were used during experiments with predator-prey approach. This time

predator-prey evolutionary strategy (PPES) and niched-pareto genetic algorithm

(NPGA) were used. In this section only a kind of summary of obtained results is

given. More detailed analysis can be found in [9, 16].

5

PPCoEMAS frontier after 6000 steps

f2

0

0 1 2 3 4 5

(a) f1

5

PPES frontier after 6000 steps

f2

0

0 1 2 3 4 5

(b) f1

Fig. 8.6 Pareto frontier approximations obtained by PPCoEMAS (a) and PPES (b) algo-

rithms for Laumanns problem after 6000 steps [9]

8 A Review of Agent-Based Co-Evolutionary Algorithms 201

NPGA

PPES

PPCoEMAS

50 60 70

(a) HV

NPGA

PPES

PPCoEMAS

(b) HVR

Fig. 8.7 The value of HV (a) and HVR (b) measure for Laumanns problem obtained by

PPCoEMAS, PPES and NPGA after 6000 steps

In the very first experiments with PPCoEMAS relatively simple Laumanns test

problem was used. In Figure 8.6 there are presented Pareto frontier approxima-

tions obtained by PPCoEMAS and PPES algorithms and in Figure 8.7 there are

presented values of HV and HVR metrics for all three algorithms being compared

(PPCoEMAS, PPES and NPGA). As it can be seen—the differences between al-

gorithms being analyzed are not so distinct, however proposed PPCoEMAS system

seems to be the best alternative.

The second problem used was more demanding multi-objective Kursawe prob-

lem with disconnected both Pareto set and Pareto frontier. In Figure 8.9 there are

presented final approximations of Pareto frontier obtained by PPCoEMAS and by

reference algorithms after 6000 time steps. As one may notice, there is no doubt that

PPCoEMAS is definitely the best alternative since it is able to obtain Pareto frontier

that is located very close to the model solution, that is very well dispersed and what

202 R. Dreżewski and L. Siwik

NPGA

PPES

PPCoEMAS

(a) HV

NPGA

PPES

PPCoEMAS

(b) HVR

Fig. 8.8 The value of HV (a) and HVR (b) measure for Kursawe problem obtained by PP-

CoEMAS, PPES and NPGA after 6000 steps

is also very important—it is more numerous than PPES and NPGA-based solutions.

The above observations are fully confirmed by the values of HV and HVR metrics

presented in Figure 8.8.

Proposed co-evolutionary multi-agent system with predator-prey interactions was

also assessed with the use of building effective portfolio problem. In this case, each

individual in the prey population is represented as a p-dimensional vector. Each

dimension represents the percentage participation of i-th (i ∈ 1 . . . p) share in the

whole portfolio.

During presented experiments—Warsaw Stock Exchange quotations from 2003-

01-01 until 2005-12-31 were taken into consideration. Simultaneously, the portfolio

consists of the following three (experiment I) or seventeen (experiment II) stocks

quoted on the Warsaw Stock Exchange: in experiment I: RAFAKO, PONARFEH,

PKOBP, in experiment II: KREDYTB, COMPLAND, BETACOM, GRAJEWO,

KRUK, COMARCH, ATM, HANDLOWY, BZWBK, HYDROBUD, BORYSZEW,

8 A Review of Agent-Based Co-Evolutionary Algorithms 203

-5

f2

-10

-20 -19 -18 -17 -16 -15 -14

(a) f1

-5

f2

-10

-20 -19 -18 -17 -16 -15 -14

(b) f1

-5

f2

-10

-20 -19 -18 -17 -16 -15 -14

(c) f1

Fig. 8.9 Pareto frontier approximations for Kursawe problem obtained by PPCoEMAS (a),

PPES (b) and NPGA (c) after 6000 steps [9]

WIG20 has been taken into consideration.

In Figure 8.10there are presented final Pareto frontiers obtained using PPCoEMAS,

NPGA and PPES algorithm after 1000 steps in experiment I. As one may notice, in

this case frontier obtained by PPCoEMAS is more numerous than NPGA-based and

as numerous as PPES-based one. Unfortunately, in this case, diversity of population

in PPCoEMAS approach is visibly worse than in the case of NPGA or PPES-based

frontiers.

204 R. Dreżewski and L. Siwik

0.2

PPCoEMAS-based Pareto frontier after 1000 steps

0.15

Profit

0.1

0.05

0

0 0.05 0.1 0.15 0.2 0.25 0.3

(a) Risk

0.2

PPES-based Pareto frontier after 1000 steps

0.15

Profit

0.1

0.05

0

0 0.05 0.1 0.15 0.2 0.25 0.3

(b) Risk

0.2

NPGA-based Pareto frontier after 1000 steps

0.15

Profit

0.1

0.05

0

0 0.05 0.1 0.15 0.2 0.25 0.3

(c) Risk

Fig. 8.10 Pareto frontier approximations after 1000 steps obtained by PPCoEMAS (a), PPES

(b), and NPGA (c) for building effective portfolio consisting of 3 stocks [16]

Similar situation can be also observed in Figure 8.11 presenting Pareto fron-

tiers obtained by PPCoEMAS, NPGA and PPES—but this time portfolio that is

being optimized consists of 17 shares. Also this time PPCoEMAS-based frontier

is quite numerous and quite close to the true Pareto frontier but the tendency for

focusing solutions around only selected part(s) of the whole frontier is very dis-

tinct. The explanation of observed tendency can be found in [9, 16] and on the very

8 A Review of Agent-Based Co-Evolutionary Algorithms 205

0.45

PPCoEMAS-based Pareto frontier after 1000 steps

0.4

0.35

0.3

0.25

Profit

0.2

0.15

0.1

0.05

0

0 0.05 0.1 0.15 0.2

(a) Risk

0.45

PPES-based Pareto frontier after 1000 steps

0.4

0.35

0.3

0.25

Profit

0.2

0.15

0.1

0.05

0

0 0.05 0.1 0.15 0.2

(b) Risk

0.45

NPGA-based Pareto frontier after 1000 steps

0.4

0.35

0.3

0.25

Profit

0.2

0.15

0.1

0.05

0

0 0.05 0.1 0.15 0.2

(c) Risk

Fig. 8.11 Pareto frontier approximations after 1000 steps obtained by PPCoEMAS (a), PPES

(b), and NPGA (c) for building effective portfolio consisting of 17 stocks [16]

general level it can be said that it is caused by the stagnation of evolution process

in PPCoEMAS. Hypothetical, non-dominated average portfolios for experiment I

and II are presented in Figure 8.12 and in Figure 8.13 respectively (in Figure 8.13

shares are presented from left to right in the order in which they were mentioned

above).

206 R. Dreżewski and L. Siwik

1

PPCoEMAS portfolio after 1 step

0.8

0.6

0.4

0.2

0

RAFAKO PONAR PKOBP

(a) share name

1

PPCoEMAS portfolio after 900 steps

percentage share in the portfolio

0.8

0.6

0.4

0.2

0

RAFAKO PONAR PKOBP

(b) share name

Fig. 8.12 Effective portfolio consisting of three stocks proposed by PPCoEMAS [16]

1

PPCoEMAS portfolio after 1 step

percentage share in the portfolio

0.8

0.6

0.4

0.2

1

PPCoEMAS portfolio after 900 steps

percentage share in the portfolio

0.8

0.6

0.4

0.2

Fig. 8.13 Effective portfolio consisting of seventeen stocks proposed by PPCoEMAS [16]

8 A Review of Agent-Based Co-Evolutionary Algorithms 207

Agent-based (co-)evolutionary algorithms have been applied already in many dif-

ferent domains, including multi-modal optimization, multi-objective optimization,

and financial problems. Agent-based models of evolutionary algorithms allows for

mixing and using simultaneously different bio-inspired techniques and algorithms

within one coherent agent model, and adding new biologically and socially in-

spired operators and mechanisms in a very natural way. Agent-based models of

evolutionary algorithm also allow for using parallel and decentralized computa-

tions without any additional changes because these models are decentralized and use

asynchronous computations.

In this chapter we have presented two selected agent-based co-evolutionary al-

gorithms for multi-objective optimization—one of them used co-operative mech-

anisms and the other one used predator-prey mechanism. Formal models of these

systems as well as results of experiments with standard multi-objective test prob-

lems and financial problem of multi-objective portfolio optimization were presented.

The results of experiments show that agent-based algorithms may obtain quite satis-

factory results, comparable or in the case of some problems even better than state-of-

the-art multi-objective evolutionary algorithms, however of course there is still place

for improvement and further research. Presented results also lead to conclusion that

none of the existing evolutionary algorithms for multi-objective optimization can

not alone solve all problems in a best way—there is, and always will be, space for

new algorithms and improvements suited for some particular problems.

Future research on the agent-based models will concentrate on improvements

to the already proposed algorithms as well as on new algorithms and techniques.

Examples of new techniques which may be incorporated into agent-based models of

evolutionary algorithms include cultural and immunological mechanisms. Another

way of development would be adding social and economical layer to the existing

biological one and using such agent-based models for modeling and simulation of

complex and emergent phenomena from social and economical life.

References

1. Bäck, T., Fogel, D., Michalewicz, Z. (eds.): Handbook of Evolutionary Computation.

IOP Publishing and Oxford University Press (1997)

2. Cetnarowicz, K., Kisiel-Dorohinicki, M., Nawarecki, E.: The application of evolution

process in multi-agent world to the prediction system. In: Tokoro, M. (ed.) Proceedings

of the 2nd International Conference on Multi-Agent Systems (ICMAS 1996). AAAI

Press, Menlo Park (1996)

3. Coello, C., Lamont, G., Van Veldhuizen, D.: Evolutionary Algorithms for Solving Multi-

Objective Problems, 2nd edn. Springer, New York (2007)

4. Coello Coello, C., Van Veldhuizen, D., Lamont, G.: Evolutionary algorithms for solv-

ing multi-objective problems, 2nd edn. Genetic and evolutionary computation. Springer,

Heidelberg (2007)

5. Deb, K.: Multi-Objective Optimization using Evolutionary Algorithms. John Wiley &

Sons, Chichester (2001)

208 R. Dreżewski and L. Siwik

6. Deb, K., Agrawal, S., Pratab, A., Meyarivan, T.: A Fast Elitist Non-Dominated Sorting

Genetic Algorithm for Multi-Objective Optimization: NSGA-II. In: Deb, K., Rudolph,

G., Lutton, E., Merelo, J.J., Schoenauer, M., Schwefel, H.-P., Yao, X. (eds.) PPSN 2000.

LNCS, vol. 1917, pp. 849–858. Springer, Heidelberg (2000),

citeseer.ist.psu.edu/article/deb00fast.html

7. Deb, K., Pratab, A., Agarwal, S., Meyarivan, T.: A fast and elitist multi-objective ge-

netic algorithm: Nsga-ii. IEEE Transaction on Evolutionary Computation 6(2), 181–197

(2002)

8. Dreżewski, R.: A model of co-evolution in multi-agent system. In: Mařı́k, V., Müller, J.P.,

Pěchouček, M. (eds.) CEEMAS 2003. LNCS (LNAI), vol. 2691, pp. 314–323. Springer,

Heidelberg (2003)

9. Dreżewski, R., Siwik, L.: The application of agent-based co-evolutionary system with

predator-prey interactions to solving multi-objective optimization problems. In: Proceed-

ings of the 2007 IEEE Symposium Series on Computational Intelligence. IEEE, Los

Alamitos (2007)

10. Dreżewski, R., Siwik, L.: Agent-based co-evolutionary techniques for solving multi-

objective optimization problems. In: Kosiński, W. (ed.) Advances in Evolutionary Al-

gorithms. IN-TECH, Vienna (2008)

11. Dreżewski, R., Siwik, L.: Agent-based co-operative co-evolutionary algorithm for multi-

objective optimization. In: Rutkowski, L., Tadeusiewicz, R., Zadeh, L.A., Zurada, J.M.

(eds.) ICAISC 2008. LNCS (LNAI), vol. 5097, pp. 388–397. Springer, Heidelberg

(2008)

12. Zitzler, E., Laumanns, M., Thiele, L.: Spea2: Improving the strength pareto evolutionary

algorithm for multiobjective optimization. In: Giannakoglou, K., et al. (eds.) Evolution-

ary Methods for Design, Optimisation and Control with Application to Industrial Prob-

lems (EUROGEN 2001). International Center for Numerical Methods in Engineering

(CIMNE), pp. 95–100 (2002)

13. Horn, J., Nafpliotis, N., Goldberg, D.E.: A niched pareto genetic algorithm for multi-

objective optimization. In: Proceedings of the First IEEE Conference on Evolutionary

Computation. IEEE World Congress on Computational Intelligence, vol. 1, pp. 82–87.

IEEE Service Center, Piscataway (1994),

citeseer.ist.psu.edu/horn94niched.html

14. Kursawe, F.: A variant of evolution strategies for vector optimization. In: Schwefel, H.-

P., Männer, R. (eds.) PPSN 1990. LNCS, vol. 496, pp. 193–197. Springer, Heidelberg

(1991), citeseer.ist.psu.edu/kursawe91variant.html

15. Laumanns, M., Rudolph, G., Schwefel, H.P.: A spatial predator-prey approach to multi-

objective optimization: A preliminary study. In: Eiben, A.E., Bäck, T., Schoenauer, M.,

Schwefel, H.-P. (eds.) PPSN 1998. LNCS, vol. 1498, p. 241. Springer, Heidelberg (1998)

16. Siwik, L., Dreżewski, R.: Co-evolutionary multi-agent system for portfolio optimization.

In: Brabazon, A., O’Neill, M. (eds.) Natural Computation in Computational Finance, pp.

273–303. Springer, Heidelberg (2008)

17. Spears, W.: Crossover or mutation? In: Proceedings of the 2-nd Foundation of Genetic

Algorithms, pp. 221–237. Morgan Kauffman, San Francisco (1992)

18. Van Veldhuizen, D.A.: Multiobjective evolutionary algorithms: Classifications, analyses

and new innovations. PhD thesis, Graduate School of Engineering of the Air Force Insti-

tute of Technology Air University (1999)

19. Zitzler, E.: Evolutionary algorithms for multiobjective optimization: methods and appli-

cations. PhD thesis, Swiss Federal Institute of Technology, Zurich (1999)

8 A Review of Agent-Based Co-Evolutionary Algorithms 209

20. Zitzler, E., Thiele, L.: An evolutionary algorithm for multiobjective optimization: The

strength pareto approach. Tech. Rep. 43, Swiss Federal Institute of Technology, Zurich,

Gloriastrasse 35, CH-8092 Zurich, Switzerland (1998),

citeseer.ist.psu.edu/article/zitzler98evolutionary.html

21. Zitzler, E., Deb, K., Thiele, L.: Comparison of Multiobjective Evolutionary Algorithms:

Empirical Results. Evolutionary Computation 8(2), 173–195 (2000)

22. Zitzler, E., Laumanns, M., Thiele, L.: Spea2: Improving the strength pareto evolutionary

algorithm. Tech. Rep. TIK-Report 103, Computer Engineering and Networks Labora-

tory (TIK), Department of Electrical Engineering, Swiss Federal Institute of Technology

(ETH) Zurich, ETH Zentrum, Gloriastrasse 35, CH-8092 Zurich, Switzerland (2001)

23. Zitzler, E., Thiele, L., Laumanns, M., Fonseca, C.M., da Fonseca, V.G.: Performance

assessment of multiobjective optimizers: An analysis and review. IEEE Transactions on

Evolutionary Computation 7(2), 117–132 (2003)

Chapter 9

A Game Theory-Based Multi-Agent System for

Expensive Optimisation Problems

solve expensive optimisation problems. The approach relies on game theory and a

multi-agent framework in which a number of existing algorithms, cast as agents, are

deployed with the aim to solve the problem in hand as efficiently as possible. The

key factor for the success of this approach is a dynamic resource allocation biased

toward promising algorithms on the given problem. This is achieved by allowing

the agents to play a cooperative-competitive game the outcomes of which will be

used to decide which algorithms, if any, will drop out of the list of solver-agents

and which will remain in use. A successful implementation of this framework will

result in the most suited algorithm(s) for the given problem being predominantly

used on the available computing platform. In other words it guarantees the best

use of the resources both algorithms and hardware with the by-product being the

best approximate solution for the problem given the available resources. GTMAS is

tested on a standard collection of TSP problems. The results are included.

9.1 Introduction

Modelling problems arising in real world applications taking into account the non-

linearity and the combinatorial aspects of solution sets often leads to expensive to

solve optimisation problems; they are inherently intractable. Indeed, even checking

a given solution for optimality is NP-hard [10, 17, 32]. It is, therefore, not reason-

able, in general, to expect the optimum solution to be found in acceptable times.

What one can, almost always, only expect is an approximate solution, the quality of

which is crucial to its potential use.

Abdellah Salhi · Özgun Töreyen

Department of Mathematical Sciences, The University of Essex, Colchester CO4 3SQ, UK

e-mail: as@essex.ac.uk,otoreyen@aselsan.com.tr

Y. Tenne and C.-K. Goh (Eds.): Computational Intelligence in Optimization, ALO 7, pp. 211–232.

springerlink.com c Springer-Verlag Berlin Heidelberg 2010

212 A. Salhi and Ö. Töreyen

It is well known, that, at least in the case of stochastic algorithms, the quality of

the approximate solution (or some confidence the user may have in it) is proportional

to the time spent in the search for it, [22, 23, 24]. As in many applications there is a

time constraint, a deadline beyond which a better approximate solution is of no use,

it is essential that all available resources (software and hardware) be used as well

as possible, to insure that the best approxiate solution, under the circumstances, is

obtained. This is what the novel approach suggested here is attempting to achieve.

To do so, it must:

1. find which algorithm(s), in the suite of algorithms, is the most appropriate for the

given instance of the expensive optimisation problem;

2. replicate this algorithm(s) on all availabe processor nodes in a parallel environ-

ment, or allocate to it all of the remaining CPU time if a single processor, or

sequential environment, is used;

Point (1) above is dealt with through measuring the performance of the algorithms

used. Point (2) is dealt with via the implementation of a cooperative/competitive

game of the Iterated Prisoners’ Dilemma (IPD) type, [4, 7, 16, 20, 25, 27]. Although

other paradigms of cooperative/competitive behaviour, such as the Stag Hunt game,

[7], can be used, the IPD seems appropriate. Note that, implementing cooperation is

fairly straightforward, and implementing competition is not. We believe that it is at

least as important as cooperation between agents for an effective search. To the best

of our knowledge, this is the first time implementing competition for optimisation

purposes is attempted. We use payoff matrices as a handle to manipulate it. Two al-

gorithms (agents) cooperate by exchanging their current solutions; they compete by

not exchanging their solutions. Note that, intuitively, cooperation may lead to early

convergence to a local optimum, by virtue of propagating a given solution poten-

tially to all algorithms and having all of them searching the same area. Competition,

on the other hand, may lead to good coverage of the search space by virtue of not

sharing solutions, i.e. helping algorithms “stay away” from each other and therefore,

potentially, explore different areas of the search space.

Although the study presents the prototype of a generic solver that can involve

any number of solver algorithms and run on any computing platform, here, a system

with only two search algorithms, implemented sequentially, is investigated. This

simplified model, however, has the inherent complexities of a system with many

more agents and should show how good or otherwise the general system can be for

expensive optimisation.

Note that the generic nature of this approach makes it applicable in any discipline

where problem solving is involved and more than one solution method is available.

This document is organised as follows. In Section 9.2, a brief literature review

is given. In Sections 9.3 and 9.4, the design and implementation of the system is

explained. Section 9.5 explains how the system is applied to solve the Travelling

Salesman Problem. The results are presented in Section 9.6. Finally, conclusions

are drawn and future research prospects are outlined in Section 9.7.

9 A Game Theory-Based Multi-Agent System 213

9.2 Background

In the following a brief review of the three main topics involved, i.e. optimisation,

the IPD and agents systems will be given.

9.2.1 Optimisation

The general optimisation problem is of the global type, constrained, nonlinear and

involves mixed variables, i.e. both discrete and continuous variables. However, a

lot of optimisation problems that are encountered in real applications do not have

all of these characteristics, but are still intractable. The 0-1-Knapsack problem, for

example, involves only binary variables and has one single constraint, but is still

NP-Hard. The general optimisation problem can be cast in the following form.

Let f be a function from Rn to R and A ⊂ Rn , then find x∗ ∈ A such that ∀x ∈ A,

f (x∗ ) ≤ f (x).

The Prisoners’ Dilemma (PD) brought to attention by Merrill Flood of the Rand

Corporation in 1951, and later formulated by Al Tucker [8, 11] is a popular paradigm

for the problem of cooperation. Its formulation can be as follows.

Player2

C D

Player1 C R=3,R=3 S=0,T=5

D T=5,S=0 P=1,P=1

In the payoff matrix of Table 1, actions C and D stand for ‘Cooperate’ and ‘De-

fect’ and payoffs R, P, T, and S stand for ‘Reward’, ‘Punishment’,‘Temptation’, and

‘Sucker’s’ payoff respectively. This payoff matrix shows that defecting is benefi-

cial to both players for two reasons. First, it leads to a greater payoff (T = 5) in

case the other player cooperates, (S = 0). Second, it is a safe move because neither

knows what the other’s move will be. So, to rational players, defecting is the best

choice. But, if both players choose to defect then it leads to a worse payoff (P = 1)

as compared to cooperating (R = 3). That is the dilemma.

The special setting of the one shot PD is seen by many to be contrary to the idea of

cooperation. This is because the only equilibrium point is the outcome [P, P] which

is a Nash equilibrium, [7]. Also, [P, P] is at the intersection of minimax strategy

choices for both players. These minimax strategies are dominant for both players,

hence the exclusion in principal of cooperation (by virtue of the dominance of the

chosen strategies). Moreover, even if cooperative strategies were chosen, the result-

ing cooperative ‘solution’ is not an equilibrium point. This means that it is not stable

214 A. Salhi and Ö. Töreyen

due to the fact that both players are tempted to defect from it. It should also be noted

that cooperative problems in real life are likely to be faced repeatedly. This makes

the IPD a more appropriate model for the study of cooperation than the one-shot

version of the game.

The PD game is characterised by the strict inequality relations between the pay-

offs: T > R > P > S. And to avoid coordination or total agreement getting a ‘helping

hand’, most experimental PD games have a payoff matrix satisfying: 2R > S + T , as

in Table 1.

The close analysis of the IPD reveals that, unlike the one-shot PD, it has a large

number of Nash equilibria. These being inside the convex hull of the outcomes

(0,5), (3,3), (5,0), (1,1) of the pure strategies in the one-shot PD, (see Figure 9.1).

Note that (1,1) corresponding to [P, P] is a Nash equilibrium for the IPD also. For a

comprehensive investigation of the IPD, please refer to [4, 5, 15].

(0, 5)

(3, 3)

(1, 1)

(5, 0)

Multi-Agent System (MAS) are collections of agents that work together to accom-

plish a task that is normally beyond the capabilities of a single agent, [19], [30],

[31]. In our case, however, every agent is capable of solving the instance of the

optimisation problem in hand.

The agents in a classical MAS communicate between themselves, cooperate, col-

laborate and sometimes negotiate ([1, 9, 12]). They do not, normally, compete. But,

they are allowed, in fact required, to do so, in the framework used here.

The present paper describes a Game Theoretic Multi-Agent Solver (GTMAS),

[18], which implements the ideas introduced above. It will be limited to three agents

(Figure 9.2): a coordinator-agent, Solver-Agent 1 (SA1), running the Genetic Al-

gorithm (GA), [14, 26], and Solver-Agent 2 (SA2), running Simulated Annealing

9 A Game Theory-Based Multi-Agent System 215

Salesman Problems (TSP) from TSPLIB, [21].

GTMAS architecture follows closely the goal-based MAS architecture of Park and

Suguraman [19] and the IPD model as described in [2], [3], [6].

Let the problem to be solved be ℘. The overall goal of GTMAS is to solve ℘ ef-

ficiently using the best available algorithm(s) in the system’s library. This is equiv-

alent to completing two tasks: selecting the best algorithm(s) among all available

algorithms and obtaining a ‘good’ solution.

The overall goal, is divided into sub-goals which are then matched to the sys-

tem’s agents. Each of the solver-agents runs a different algorithm and tries to be

the first to solve ℘ by using as much as possible of the available computing facili-

ties, here CPU time. The coordinator-agent enables coordination, manages the game

through which the solver-agents compete for the facilities, allocates the facilities to

the solver-agents, and also communicates with the user, i.e. the owner of problem

℘, (Figure 9.2).

Solver-agents cooperate (C) or compete (D) with each other by sharing or not

their solutions. If an agent can take an opponent’s possibly better solution when it

is stuck in a local optimum say, and uses it, then it can improve its own search.

The decision to cooperate or to compete is autonomously made by the agents us-

ing their beliefs (no notable change in the objective value in the last few iterations,

for instance, may mean convergence to a local optimium), the history of the pre-

vious encounters with their opponents (the number of times they cooperated and

competed), and certain rules which follow observations of the behaviours of agents.

Some are explained below.

The rules are set to prevent the game from converging too soon to a near

pure competition game (which is equivalent to playing the GRIM strategy, [2]).

216 A. Salhi and Ö. Töreyen

Go-it-alone type strategy can not contribute to the solution quality more than run-

ning an algorithm on its own. These rules are:

• If the number of times SA1 knows the solution of SA2 increases, then the likeli-

hood that SA2 finds the solution to ℘ first, decreases. Therefore, SA2 is unlikely

to cooperate. Since all solver-agents are aware of this, they would cooperate less

and take their opponent’s solution more often, given the chance.

• If SA1 does not cooperate when SA2 cooperates, then SA2 would retaliate. This

leads to the TIT-FOR-TAT and the go-it-alone type strategy.

• If a solver-agent cooperates in the first encounter with another solver-agent, then

it can be perceived as in need of help; i.e. it is stuck at a local optimum. Agents,

therefore, perceive the first cooperation of their opponent as a “forceful invitation

to cooperate or else...” from bullet point 2 above.

There are all sorts of rules which are implicit in the IPD. Agents, however, do not

have to apply them systematically.

Recall that the solver-agents, in order to solve ℘, play the IPD game. In each en-

counter, they either cooperate (C) or compete (D).

Figure 9.3 shows an encounter of 2 agents after they both obtain their interme-

diary solutions. Node 1 is the starting node which shows the decision alternatives

9 A Game Theory-Based Multi-Agent System 217

of SA1. It can cooperate and end up in Node 2, or Node 3. Nodes 2 and 3 are the

decision nodes of SA2 which has the same two alternative decisions as SA1. Node 4

follows the cooperation of both of the agents that may result in a solution exchange.

Node 5 shows the situation where SA1 cooperates and SA2 competes which means

SA2 may take the solution of SA1 and SA1 takes nothing. Node 6 depicts the same

situation as that leading to node 5 but with agents taking different actions. In node

7, neither gives its solution; they continue without any exchange.

The decision tree is expanded further with branching from nodes 4-7, but with

alternatives now being: “Take the opponent’s solution” and “Do not take the op-

ponent’s solution”. This branching determines which agent has a better solution

and is essential for setting up the payoff matrices that drive the system. 8 new

nodes(leaves) arise. Each pair of sibling nodes yields a different payoff matrix. The

labels (G)(for good) and (B)(for bad) refer to agents having a better solution than the

opponent or otherwise, respectively. The cells that are crossed refer to impossible

outcomes.

Managing the resources is based on the outcomes of the decisions of the solver-

agents. When an agent cooperates it gains one unit (of CPU time or equivalent in

terms of iterations it is allowed to do) and loses double that. When it competes it

gains two units and loses one (or half of the initial gain). This means, the GTMAS

payoff matrix rewards competition. The idea behind supporting competition is to

counter the “helping hand” that cooperation gets from the rules underpinning the

construction of GTMAS (see above). It can also be argued that, intuitively at least,

too frequent exchanging of solutions will lead to early convergence to local optima.

So, competition gives solver-agents the chance to cover the search space better.

The 4 payoff matrices in Figure 9.3 can be combined in one payoff matrix

(Table 9.2).

Table 9.2 Combined Payoff Table for Evaluating and Rewarding Agents

B

C D

G C (1,-2) (1,-1)

D (2,-2) (2,-1)

The equilibrium point for the payoff matrix is (D, D) with payoffs 2 and -1. It

is also a regret-free point. The payoff matrix at the core of GTMAS is different

from those commonly found in the literature. These matrices would be drawn im-

mediately after decisions have been taken, i.e. at nodes 4 to 7 in Figure 9.3. Here,

they are drawn after other decisions are taken. In fact, one can highlight three main

differences;

218 A. Salhi and Ö. Töreyen

(i) The return of a player is not dependent on the opponent’s choice directly. Whether

the opponent cooperates or competes becomes only relevant after the exchange of

solutions has been decided;

(ii) The payoff is affected by what has been achieved in terms of the quality of

solution after exchange (or otherwise of solutions);

Unlike trditional games, here, after the players (solver-agents) have made their

choices, they are given a chance to progress with the consequences of the choices.

Only after that, are they rewarded/punished. This was made explicit in the above

paragraph where reasons for rewarding competition/penalising cooperation, were

given; for instance when we said that a cooperating agent “gains one unit and loses

double that”, we meant that the solver-agent runs first for a unit of CPU time (or

equivent in iterations) and only after that is it penalised by taking 2 units of CPU

time from its account. Basically, the quality of the solution following decisions has

to be measured first before the payoffs are allocated. Time is an important factor in

the IPD.

(iii) The third difference is that the players are not “Solver-Agent 1” and “Solver-

Agent 2”, but instead “Solver-Agent with the better solution” and “Solver-Agent

with the worse solution”. The configuration of the table may change at each stage

according to the solution qualities of the solver-agents. The one with better solution

is always placed as the row player.

GTMAS is a generic (n + 1)-agent system that consists of a coordinator-agent and

n solver-agents. It can be seen as a loosely coupled hybrid algorithm that uses

any number of algorithms contributing to the hybridisation. The pseudocode of the

agents is given below.

Coordinator-Agent Pseudocode

1. Initialise belief. Initialise resources,n.

2. For Nstage stages, play the game and update belief

where Nstage limits the number of stages the game is played.

2.1. Start decision phase: Run the solver-agents to decide.

2.2. Manage the solution exchange.

2.3. Start competition phase: Run the solver-sgents to compete.

2.4. Evaluate and reward/punish the solver-agents.

Update resources, m1 , m2 iterations where

mi = n + ∑currentstage−1

j=1 ri j , ri j is the reward of agent i

at stage j.

2.5. Increment stage.

3. End the game. Select the best algorithm. Report the results.

9 A Game Theory-Based Multi-Agent System 219

Solver-Agents Pseudocode

1. Initialise belief.

2. If it is a decision phase, do:

2.1. If it is the first stage, do:

2.1.1. Initialise memory and algorithm specific parameters.

2.1.2. Run own algorithm for n iterations.

2.1.3. Cooperate.

2.1.4. End run. Send the results to the Coordinator-Agent.

2.2. If it is the second stage, do:

2.2.1. Update belief.

2.2.2. Run own algorithm for mi iterations.

2.2.3. Compete.

2.2.4. End run. Send the results to the Coordinator-Agent.

2.3. If stage > 2, do:

2.3.1. Update belief.

2.3.2. Run own algorithm for mi iterations.

2.3.3. Decide to cooperate/compete.

2.3.4. End run. Send the results to the Coordinator-Agent.

3. If it is a competition phase, do:

3.1. Update belief.

3.2. Run own algorithm for n iterations.

3.3. End run. Send the results to the Coordinator-Agent.

solver-agents. GTMAS starts with initialisation of the coordinator-agent. The

coordinator-agent reads the problem data and initialises the payoff table. It also

initialises the resources accounts (seconds of CPU time or number of iterations)

assigned to the solver-agents. The overall iterations of GTMAS are called stages

which consist of decision and competition phases. In the decision phases, the

coordinator-agent asks the solver-agents for their decisions.

The solver-agents start with the initialisation of their memory the outcome of

enounters with their opponents, algorithm parameters (i.e. population for the Ge-

netic Algorithm) and decision parameters (η , β , α , σ and γ ). After they run for

the given number of iterations determined by the coordinator-agent to obtain their

initial solutions for ℘, they make their decisions.

The decisions in the first two stages are not subject to analysis since no historical

data (memory of earlier encounters) exist yet. Both agents cooperate in the first

stage and compete in the second stage, regardless of the results and without prior

analysis. The following stages differ from the first two stages with the introduction

of memory. Decisions are made according to procedure Decide() below.

Let SA1 be a solver-agent and SA2 its opponent in a IPD game. SA1 decides

whether to compete or cooperate according to the following procedure.

220 A. Salhi and Ö. Töreyen

0. Begin

1. If (SA2 has cooperated in the last move) then

1.1. If (P(IC|OC) < σ %) then

1.1.1. Cooperate.

1.2. Else If (P(OC|IC) < γ %) then

1.2.0. Cooperate.

1.2.1. Else If (SA1 didn’t improve by γ % in last 2 stages) then

1.2.1.0. Decide randomly to cooperate or compete

1.2.1.1. Else

1.2.1.1.0. Compete.

1.2.1.2. End If

1.2.2. End If

1.3. End If

2. Else (Ask SA2 for its solution);

2.1. If (SA2 solution is α % better than that of SA1) then

2.1.1. If (SA1 is stuck with γ %) then

2.1.1.0. Cooperate.

2.1.2. Else If (SA2 solution is 2α % better than that of SA1) then

2.1.2.1. If (P(OC|ID) > β %) then

2.1.2.1.0. Compete.

2.1.2.2. Else

2.1.2.2.0. Cooperate.

2.1.2.3. End If

2.1.3. End If

2.3. Else

2.3.1. Compete.

2.4. End If

3. End If

4. Stop

In the decision-making process, P(IC|OC) is the probability that SA1 will cooperate

in the next iteration given that SA2 cooperates in this iteration. It is equal to the

ratio between the number of times SA1 cooperates in the (n + 1)st iteration given

that SA2 cooperated in the nth iteration and the the total number of encounters.

P(OC|IC) is the probability that SA2 will cooperate in the next iteration given that

SA1 cooperates in this iteration. It is equal to the ratio of the number of times SA2

cooperates in (n + 1)st iteration given SA1 cooperated in the nth iteration to the total

number of encounters.

P(OC|ID) is the probability that SA2 will cooperate in the next iteration given that

SA1 cooperates in this iteration. It is equal to the ratio of the number of times SA2

cooperates in (n + 1)st iteration given that SA1 competed in the nth iteration to the

total number of encounters.

9 A Game Theory-Based Multi-Agent System 221

referred to as “responsiveness”.

η is a measure of how likely it is for SA2 to adopt a TIT-FOR-TAT strategy;

γ is how likely it is for either agents to get stuck in a local optimum;

α is the difference between SA1’s solution and that of SA2;

β is how likely it is for the opponent to cooperate (be nice!);

If the opponent cooperated in the last move, the agents check their own responsive-

ness, first. If it is less than σ , then they cooperate; if not, they check the opponent’s

responsiveness. If the latter is less than η , they conclude that the opponent is not as

responsive as it should be, so they compete. If the opponent is responsive, then they

check their own status: if they performed γ % better than what they obtained in the

last 2 stages before, then they compete, otherwise they make a random choice as to

whether they cooperate or compete.

If the opponent did not cooperate in the last move, then they compare their own

status with that of their opponent. If the opponent’s last solution is not α % better

than their own solution, they compete. If it is at least α % better, then they check

their own progress. If they are stuck with γ %, or in other words, the solution has

not improved more than γ % in the last 2 stages, then they cooperate. If they are not

stuck, they check the difference between the opponent’s and their own solutions. If

the solution of the opponent is not 2α % better than their own solution, they com-

pete. If it is 2α % better, then they check the opponent’s attitude to competition.

If the opponent is likly to cooperate, i.e. if the probability it cooperates after com-

petition is larger than β , they compete, otherwise they cooperate. Note that this

description is given as the procedure Decide() .

After an agent makes a decision, it is sent to the coordinator-agent which manages

the solution exchange. Solution exchange is settled randomly when a solution is

offered, following a cooperate move; it is accepted with probability 0.5.

The worse performing agent is not entirely removed. It is kept but it is only

allowed one iteration in each stage. This is specific to the two solver-agent case as it

seems, from experimental results, that the weaker algorithm still helps the stronger

one to get, overall, a better solution. This, however, may not be the case if a large

number of algorithms were used.

When all the stages are completed, the results of the best solver-agent are

reported.

GTMAS is applied to a collection of Travelling Salesman Problems (TSP), [21]. The

Genetic Algorithm (GA) and Simulated Annealing (SA) are selected as the solvers

in the library. GA is coded in Matlab 7.0 and Simulated Annealing Matlab code is

borrowed from Matlab Central ([29]). They are customised to be incorporated in

GTMAS which is also coded in Matlab 7.0.

Generic parameters of GTMAS are defined by pre-experimentation. Nstage , the

number of stages the game is played, is set to 5. Preliminary analyses show that

222 A. Salhi and Ö. Töreyen

iterations solver-agents start the decision phase and run for in the competition phases

is set to 10. Accordingly, all the entries in the payoff table, ri j , are quadrupled for

faster progress (see Table 9.3).

B

C D

G C (4,-8) (4,-4)

D (8,-8) (8,-4)

tial iterations because of the high temperature and large probability to accept bad

solutions in order to escape local optima. Afterwards, it slows down considerably

with the decreases in temperature. The resource in the decision phase is CPU time.

The number of iterations SA is run, mSA , is updated at the beginning of each stage

to balance the CPU time usage of SA and that of GA.

10

mSA = ∗ mSA .

currentstage

Prior to experimentation, GA is expected to perform better than SA. GA is coded

for the specific problem in hand and its parameters are tuned accordingly. The pa-

rameters of SA are default parameters as found in the literature. The game played

between the agents affects the solution quality substantially. It depends on the solver-

agents’ attitude to cooperation and competition, the solution exchanges and the pay-

off matrix. These parameters are summarised with their possible values in Table 9.4.

Matrix Model of GA of SA

simple random random random

cooperation-rewarded evaluative cooperative cooperative

competition-rewarded competitive competitive

time-dependent

or time-dependent.

9 A Game Theory-Based Multi-Agent System 223

(Top-right) and Time-Dependent (bottom-right) Payoff Matrices

B B

C D C D

G C (4,-4) (4,-4) G C (4,-8) (4,-4)

D (4,-4) (4,-4) D (8,-8) (8,-4)

B B

C D C D

G C (8,-4) (8,-8) G C ( 4t ,−4t) ( 4t ,− 4t )

D (4,-4) (4,-8) D (4t,−4t) (4t,− 4t )

β , α and σ . These parameters were explained earlier and their values are given in

Table 9.6.

Agent

Characteristics α β γ η σ

cooperative 0.1 0.9 0.01 0.2 0.6

competitive 0.5 0.1 0.01 0.8 0.2

Twenty combinations of parameters were used and for each five runs were car-

ried out on the problems of Table 9.13, from TSPLIB ([21]). GA is found to be the

better algorithm in 98runs and SA is found to be better only in the remaining 2. The

final solutions, the elapsed times, the solver algorithm and the series of coopera-

tion/competition bouts and exchange of solutions are recorded. The results are en-

tered into SPSS 14 for analysis of significant factors. The cooperation/competition

series and the exchange series are categorised prior to analysis. The categorisation

is summarised in Table 9.7.

less than twice never takes

more than twice takes in first stage

takes after second stage

takes in both

These are added to the factors of the experiment. The final factors of the question

are summarised in Table 9.8 with their corresponding values and the number of

occurrences.

224 A. Salhi and Ö. Töreyen

Number of

Factors ValuesObservations

PAYOFF simple 1 25

cooperation-rewarded 2 25

competition-rewarded 3 25

time-dependent 4 25

DECISION PROCESS coop GA vs coop SA 1 20

coop GA vs comp SA 2 20

comp GA vs coop SA 3 20

comp GA vs comp SA 4 20

random 5 20

CATEGORY GA COOPERATES less than twice 1 64

more than twice 2 36

CATEGORY SA COOPERATES less than twice 1 43

more than twice 2 57

CATEGORY GA TAKES never takes 1 25

SA’S SOLUTION takes in first stage 2 23

takes after second stage 3 26

takes in both 4 26

CATEGORY SA TAKES never takes 1 22

GA’S SOLUTION takes in first stage 2 33

takes after second stage 3 33

takes in both 4 12

Table 9.9 shows ANOVA results for the dependent variable deviation. Here de-

viation, is the difference between the true solution objective value and the objective

value of the solution found. All factors and reasonable multiple interactions are in-

cluded in the model. Most of them are very insignificant due to the high random

variability. However, the interaction of solution taking sequences of agents is signif-

icant with 11% confidence. Therefore, the solution taking sequences factors them-

selves are significant. Even though, it is not a very reliable confidence, these are the

most expected factors to be significant to explain the data since the solution quality

is expected to depend on the times solution exchanges occur.

Table 9.10 shows ANOVA results for the dependent variable time. The only sig-

nificant factor which is in the 12% significance level is the solution exchange se-

quences of SA solver-agent. This matches exactly the expectations since SA varies

a lot both between iterations within problems and between problems. When it takes

a solution in any stage, the average elapsed time is about 100 seconds. When it

doesn’t take the GA solution, the average elapsed time is about 60 seconds.

9 A Game Theory-Based Multi-Agent System 225

Table 9.9 ANOVA - Significant Factors Affecting Deviation From True Solution Value

Corrected Model 589.795a 76 7.760 .865 .689

Intercept 2098.415 1 2098.415 233.899 .000

PAYOFF 26.754 3 8.918 .994 .413

DECISION PROCESS 16.223 4 4.056 .452 .770

PAYOFF*DECISION

PROCESS 25.528 4 6.382 .711 .593

CATEGORY COOP1*

CATEGORY COOP2 5.214 1 5.214 .581 .454

CATEGORY TAKEN1*

CATEGORY TAKEN2 124.516 7 17.788 1.983 .102

PAYOFF*CATEGORY

COOP1*CATEGORY

COOP2 .000 0 . . .

PAYOFF*CATEGORY

TAKEN1*CATEGORY

TAKEN2 106.348 13 8.181 .912 .555

DECISION PROCESS*

CATEGORY COOP1*

CATEGORY COOP2 .000 0 . . .

DECISION PROCESS*

CATEGORY TAKEN1*

CATEGORY TAKEN2 7.187 4 1.797 .200 .936

PAYOFF*DECISION

PROCESS*

CATEGORY COOP1*

CATEGORY COOP2 .000 0 . . .

PAYOFF*DECISION

PROCESS*

CATEGORY TAKEN1*

CATEGORY TAKEN2 5.806 2 2.903 .324 .727

PAYOFF*DECISION

PROCESS*

CATEGORY COOP1*

CATEGORY COOP2*

CATEGORY TAKEN1*

CATEGORY TAKEN2 .000 0 . . .

CATEGORY COOP1 .041 1 .041 .005 .947

CATEGORY COOP2 4.731 1 4.731 .527 .475

CATEGORY TAKEN1 9.223 3 3.074 .343 .795

CATEGORY TAKEN2 11.471 3 3.824 .426 .736

Error 206.344 23 8.971

Total 4238.542 100

Corrected Total 796.139 99

a. R Squared = .741 (Adjusted R Squared =-.116)

226 A. Salhi and Ö. Töreyen

Corrected Model 331363.405a 76 4360.045 .522 .981

Intercept 626245.612 1 626245.612 74.956 .000

CATEGORY COOP1 6.103 1 6.103 .001 .979

CATEGORY COOP2 111.092 1 111.092 .013 .909

CATEGORY TAKEN1 7246.948 3 2415.649 .289 .833

CATEGORY TAKEN2 54448.233 3 18149.411 2.172 .119

PAYOFF 2783.043 3 927.681 .111 .953

DECISION PROCESS 878.624 4 219.656 .026 .999

CATEGORY COOP1*

CATEGORY COOP2 .538 1 .538 .000 .994

CATEGORY TAKEN1*

CATEGORY TAKEN2 50147.857 7 7163.980 .857 .553

PAYOFF*DECISION

PROCESS 794.601 4 198.650 .024 .999

CATEGORY COOP1*

CATEGORY COOP2* .000 0 . . .

DECISION PROCESS

CATEGORY COOP1*

CATEGORY COOP2* .000 0 . . .

PAYOFF

CATEGORY TAKEN1*

CATEGORY TAKEN2* 1624.845 4 406.211 .049 .995

DECISION PROCESS

CATEGORY TAKEN1*

CATEGORY TAKEN2* 25019.542 13 1924.580 .230 .996

PAYOFF

CATEGORY TAKEN1*

CATEGORY TAKEN2* 786.468 2 393.234 .047 .954

PAYOFF*DECISION

PROCESS

CATEGORY COOP1*

CATEGORY COOP2* .000 0 . . .

PAYOFF*DECISION

PROCESS

CATEGORY COOP1*

CATEGORY COOP2*

CATEGORY TAKEN1*

CATEGORY TAKEN2* .000 0 . . .

PAYOFF*DECISION

PROCESS*

Error 192161.123 23 8354.831

Total 1536492.299 100

Corrected Total 523524.528 99

a. R Squared = .633 (Adjusted R Squared =-.580)

9 A Game Theory-Based Multi-Agent System 227

Solution Solution Deviation Time (sec)

never takes never takes - -

never takes takes in first stage 5.51% 108.8

never takes takes after second stage 4.89% 81.28

never takes takes in both 8.01% 98.8

takes in first stage never takes 4.62% 60.41

takes in first stage takes in first stage 6.85% 105.6

takes in first stage takes after second stage 5.18% 211.16

takes in first stage takes in both 6.06% 71.12

takes after second stage never takes 3.79% 58.24

takes after second stage takes in first stage 6.30% 119.97

takes after second stage takes after second stage 7.08% 93.08

takes after second stage take sin both 7.50% 296.77

takes in both never takes 7.41% 57.16

takes in both takes in first stage 3.36% 140.57

takes in both taken after second stage 5.74% 92.54

takes in both takes in both 5.34% 116.08

Payoff of GA of SA Occur. Dev. (sec.)

coop-rewarded cooperative cooperative 2 3.59% 61.97

coop-rewarded cooperative competitive 1 1.27% 45.49

coop-rewarded competitive cooperative 1 3.33% 60.49

coop-rewarded competitive competitive 2 7.39% 50.17

coop-rewarded random random - - -

comp-rewarded cooperative cooperative - - -

comp-rewarded cooperative competitive - - -

comp-rewarded competitive cooperative - - -

comp-rewarded competitive competitive 1 0.65% 58.09

comp-rewarded random random - - -

simple cooperative cooperative - - -

simple cooperative competitive - - -

simple competitive cooperative 1 6.50% 66.41

simple competitive competitive 1 3.20% 70.54

simple random random - - -

time-dependent cooperative cooperative - - -

time-dependent cooperative competitive - - -

time-dependent competitive cooperative - - -

time-dependent competitive competitive 1 1.34% 54.77

time-dependent random random 1 3.47% 60.31

228 A. Salhi and Ö. Töreyen

In Table 9.11, the best deviation is found when GA takes SA’s solution in both

the first stage and after the second stage and SA takes GA’s solution only in the

first stage. Average deviation is 3.36%. However, the average time elapsed to obtain

this average deviation is quite high, at 141 seconds. The second best deviation is

observed when GA takes SA’s solution after the second stage and SA never takes

GA’s solution. The average deviation is 3.79% with an average elapsed time of 58

seconds. From these results, it can be said that, in this setting, i.e. when GA com-

petes and takes the solution of SA and SA cooperates by offering its own solution

and never taking that of GA, the best performance is obtained. Whether obtaining

this solution exchange setting is random, is not clear. What is clear is that it occurs

quite often. Table 9.12, records some of its occurences. Amongst these 11 occur-

rences, the best average deviation is obtained with a competition-rewarded payoff

matrix and both agents being competitive. The deviation is 0.65% which actually

comes from only one occurrence, and the time is 58 seconds. This analysis does not

show that if competitive agents play against each other in a competition-rewarded

environment, then this is the best environment; rather, it shows that if competitive

agents play against each other in a competition-rewarded environment and their so-

lution exchange happens to be one-way benefit to one of the solver-agents, then this

might be the best setting.

GTMAS is tested on 10 problems from TSPLIB [21]. The results are summarised

in Table 9.13.

Table 9.13 shows the runs for GA alone, SA alone and GA and SA together

under the framework of GTMAS, sequentially. For each problem instance, GTMAS

selected GA as the best solving agent.

When average deviations are compared within problem instances, GTMAS is

found to dominate GA. On average, GTMAS always finds better solutions than GA.

This is due to GA benefiting from the presence of SA; this must be from a synergistic

effect. It is observed that when GA takes the solution of SA the quality of the overall

solution increases considerably. Solution exchange seems to play a critical role in

defining the quality of the solution.

However, when elapsed times are compared, the average time of GTMAS is al-

most double that of GA for almost all problems. This rather unfavorable time count

is the cost of keeping the SA solver-agent because it improves the qualiy of the

overall solution. There are also, other overheads that come with the need for coor-

dination, decision making and so on.

It should also be noted that the recorded time counts for GTMAS are those of a

sequential implementation. The times of a parallel implementation are expected to

be significantly lower.

9 A Game Theory-Based Multi-Agent System 229

Problem Agent 1 Agent 2 Deviation Time (sec) Algorithm

burma14 GA - 0.70% 6.95

burma14 - SA 0.53% 8.34

burma14 GA SA 0.00% 13.49 GA

ulysses16 GA - 0.27% 8.26

ulysses16 - SA 0.17% 42.28

ulysses16 GA SA 0.00% 14.71 GA

ulysses22 GA - 1.56% 9.83

ulysses22 - SA 1.16% 97.12

ulysses22 GA SA 1.47% 17.78 GA

att48 GA - 4.97% 41.23

att48 - SA 31.48% 10.52

att48 GA SA 4.06% 129.87 GA

eil51 GA - 4.46% 44.45

eil51 - SA 18.17% 423.26

eil51 GA SA 2.72% 91.58 GA

berlin52 GA - 8.67% 42.99

berlin52 - SA 36.37% 11.44

berlin52 GA SA 5.32% 66.94 GA

st70 GA - 11.62% 66.17

st70 - SA 24.89% 232.21

st70 GA SA 7.97% 143.52 GA

eil76 GA - 6.84% 74.90

eil76 - SA 33.34% 1162.02

eil76 GA SA 5.88 136.23 GA

pr76 GA - 6.25% 94.02

pr76 - SA 35.91% 254.20

pr76 GA SA 5.71% 140.05 GA

eil101 GA - 10.37% 143.99

eil101 - SA 50.11% 220.71

eil101 GA SA 8.58% 211.98 GA

A generic smart solver, GTMAS, has been constructed that combines a multi-agent

system architecture and game theory to deal with expensive optimisation problems.

Within GTMAS different algorithms attached to agents play an Iterated Prisoners’

Dilemma type game in which they cooperate to solve the problem and compete

over the computing facilities available (here CPU time). In the process, the system

finds the most appropriate algorithm for the given problem from a library of avail-

able algorithms and solves the problem. It also obtains a better quality approximate

230 A. Salhi and Ö. Töreyen

solution than the best algorithm would obtain on its own. This is because of the

synergistic effect of the algorithms working together.

GTMAS implements an interesting resource allocation process that uses a pur-

pose built payoff matrix to encourage competition for the available computing re-

sources. Solver-agents are rewarded by increasing their access to the computing fa-

cilities for good performance; they are punished for bad performance, by reducing

their access to the computing facilities. This simple rule guarantees that the comput-

ing platform is increasingly being dedicated to the most suited algorithm. In other

words, the bulk of the computing platform will eventually be used by the best per-

forming algorithm, which is synonymous with the computing resources being used

efficiently.

GTMAS as implemented here involves only two players. The study will benefit

from a more extensive investigation with a large number of algorithms. To extend

it to n players, the results obtained can be used. The game can be designed such

that given the players A1 , A2 , ..., An , pair-wise games are considered and each game

is evaluated separately according to the same 2-by-2 payoff matrix introduced here.

The solvers that fail in the simultaneous games in 2-by-2 competitions get elimi-

nated and the tournament continues with the ones that survive.

Another approach of playing the n-by-n game could be playing it simultaneously,

using notions of Nash’s poker game [13], with a specially created n-by-n payoff

matrix that would evaluate all agents at once but select the best iteratively. Current

and future research directions concern extending the ideas of the GTMAS prototype

to a general n-by-n environment which deals with n algorithms, running in parallel,

according to one of the two proposed payoff matrices.

References

1. Aldea, A., Alcantra, R.B., Jimenez, L., Moreno, A., Martinez, J., Riano, D.: The scope

of application of multi-agent systems in the process industry: Three case studies. Expert

Systems with Applications 26, 39–47 (2004)

2. Axelrod, R.: Effective choice in the prisoner’s dilemma. Journal of Conflict Resolu-

tion 24(1), 3–25 (1980)

3. Axelrod, R.: More effective choice in the prisoner’s dilemma. Journal of Conflict Reso-

lution 24(3), 379–403 (1980)

4. Axelrod, R.: The Evolution of Cooperation. Basic Books, New York (1984)

5. Axelrod, R.: The evolution of strategies in the iterated prisoners’ dilemma. In: Davis, L.

(ed.) Genetic Algorithms and Simulated Annealing, pp. 32–42. Morgan Kaufmann, Los

Altos (1987)

6. Axelrod, R., Hamilton, W.D.: The evolution of cooperation. Science 211, 1390–1396

(1981)

7. Binmore, K.: Fun and Games. D.C.Heath, Lexington (1991)

8. Binmore, K.: Playing fair: Game theory and the social contract. MIT Press, Cambridge

(1994)

9. Bratman, M.E.: Shared cooperative activity. The Philosophical Review 101(2), 327–341

(1992)

9 A Game Theory-Based Multi-Agent System 231

10. Byrd, R.H., Dert, C.L., Rinnooy Kan, A.H.G., Schnabel, R.B.: Concurrent stochastic

methods for global optimization. Mathematical Programming 46, 1–30 (1990)

11. Colman, A.M.: Game Theory and Experimental Games. Pergamon Press Ltd., Oxford

(1982)

12. Doran, J.E., Franklin, S., Jennings, N.R., Norman, T.J.: On cooperation in multi-agent

systems. The Knowledge Engineering Review 12(3), 309–314 (1997)

13. Nash, J.F.: Non-cooperative games. Annals of Mathematics 54(2), 286–295 (1951)

14. Holland, J.H.: Adaptation in Natural and Artificial Systems. University of Michigan

Press, Ann Arbor (1975)

15. Linster, B.: Essays on Cooperation and Competition. PhD thesis, University of Michigan,

Michigan (1990)

16. Luce, R., Raiffa, H.: Games and Decisions. Wiley, New York (1957)

17. Murty, K.G., Kabadi, S.N.: Some NP-complete problems in quadratic and nonlinear pro-

gramming. Mathematical Programming 39, 117–130 (1987)

18. Töreyen, Ö.: A game-theory based multi-agent system for solving complex optimisation

problems and a clustering application related to the integration of turkey into the eu com-

munity. M.Sc. Thesis Submitted to the Department of Mathematical Sciences, University

of Essex, UK (2008)

19. Park, S., Sugumaran, V.: Designing multi-agent systems: A framework and application.

Expert Systems with Applications 28, 259–271 (2005)

20. Rapoport, A., Chammah, A.M.: Prisoner’s Dilemma: A Study in Conflict and Coopera-

tion. University of Michigan Press, Ann Arbor (1965)

21. Reinelt, G.: TSPLIB,

http://www.iwr.uni-heidelberg.de/groups/comopt/

software/TSPLIB95

22. Rinnooy Kan, A.H.G., Timmer, G.T.: Stochastic global optimization methods Part I:

Clustering methods. Mathematical Programming 39, 27–56 (1987)

23. Rinnooy Kan, A.H.G., Timmer, G.T.: Stochastic global optimization methods Part II:

Multi-level methods. Mathematical Programming 39, 57–78 (1987)

24. Rinnooy Kan, A.H.G., Timmer, G.T.: Global optimization. In: Nemhauser, G.L., Rin-

nooy Kan, A.H.G., Todd, M.J. (eds.) Optimization. Handbooks in Operations Research

and Management Science, ch. IX, vol. 1, pp. 631–662. North Holland, Amsterdam

(1989)

25. Salhi, A., Glaser, H., De Roure, D.: A genetic approach to understanding cooperative

behaviour. In: Osmera, P. (ed.) Proceedings of the 2nd International Mendel Conference

on Genetic Algorithms, MENDEL 1996, pp. 129–136 (1996)

26. Salhi, A., Glaser, H., De Roure, D.: Parallel implementation of a genetic-programming

based tool for symbolic regression. Information Processing Letters 66(6), 299–307

(1998)

27. Salhi, A., Glaser, H., De Roure, D., Putney, J.: The prisoners’ dilemma revisited. Techni-

cal Report DSSE-TR-96-2, Department of Electronics and Computer Science, The Uni-

versity of Southampton, U.K. (February 1996)

28. Salhi, A., Proll, L.G., Rios Insua, D., Martin, J.: Experiences with stochastic algorithms

for a class of global optimisation problems. RAIRO Operations Research 34(22), 183–

197 (2000)

29. Seshadri, A.: Simulated annealing for travelling salesman problem,

http://www.mathworks.com/matlabcentral/fileexchange

232 A. Salhi and Ö. Töreyen

30. Tweedale, J., Ichalkaranje, H., Sioutis, C., Jarvis, B., Consoli, A., Phillips-Wren, G.:

Innovations in multi-agent systems. Journal of Network and Computer Applications 30,

1089–1115 (2007)

31. Wooldridge, M., Jennings, N.R.: Intelligent agents: Theory and practice. Knowledge En-

gineering Review 10(2), 115–152 (1995)

32. Zhigljavsky, A.A.: Theory of Global Search. Mathematics and its applications, Soviet

Series, vol. 65. Kluwer Academic Publishers, Dordrecht (1991)

Chapter 10

Optimization with Clifford Support Vector

Machines and Applications

SVM’s using the Clifford algebra. In this framework we handle the design of ker-

nels involving the geometric product for linear and nonlinear classification and re-

gression. The major advantage of our approach is that we redefine the optimization

variables as multivectors. This allows us to have a multivector as output and, there-

fore we can represent multiple classes according to the dimension of the geometric

algebra in which we work. By using the CSVM with one Clifford kernel we reduce

the complexity of the computation greatly. This can be done thanks to the Clifford

product, which performs the direct product between the spaces of different grade

involved in the optimization problem. We conduct comparisons between CSVM

and the most used approaches to solve multi-class classification to show that ours

is more suitable for practical use on certain type of problems. In this chapter are

included several experiments to show the application of CSVM to solve classifica-

tion and regression problems, as well as 3D object recognition for visual guided

robotics. In addition, it is shown the design of a recurrent system involving LSTM

network connected with CSVM and we study the performance of this system with

time series experiments and robot navigation using reinforcement learning.

10.1 Introduction

The Support Vector Machine (SVM) [1, 2, 3, 4] is a powerfull optimization algo-

rithm to solve classification and regression problems, but it was originally designed

N. Arana-Daniel · C. López-Franco

Computer Science Department, Exact Sciences and Engineering Campus, CUCEI,

University of Guadalajara, Av. Revolucion 1500, Col. Olı́mpica, C.P. 44430,

Guadalajara, Jalisco, México

e-mail: {nancy.arana,carlos.lopez}@cucei.udg.mx

E. Bayro-Corrochano

Cinvestav del IPN, Department of Electrical Engineering and Computer Science,

Zapopan, Jalisco, México

Y. Tenne and C.-K. Goh (Eds.): Computational Intelligence in Optimization, ALO 7, pp. 233–262.

springerlink.com c Springer-Verlag Berlin Heidelberg 2010

234 N. Arana-Daniel, C. López-Franco, and E. Bayro-Corrochano

classification is still an on-going research issue. Currently there are two main types

of approaches for multi-class SVM [6, 7]. One is by constructing and combining

several binary classifiers, while the other is by directly considering all data in one

big optimization problem. The last mentioned approach is computationally more

expensive to solve multi-class problems. This is why the authors were motivated

to develop an SVM-based algorithm for multi-classification and multi-regression,

which furthermore is based on the Clifford (or geometric) Algebra’s framework [5].

The authors’ hypothesis was that these algebras could be the appropriate mathe-

matical framework to develop this algorithm because Clifford Algebras allow us to

express in a compact way a lot of geometric entities (which are used to represent

multi-class) and the products between them.

This chapter will present the results obtained from the development of the above

mentioned hypothesis: i) the design of the generalization of the real- and complex-

valued Support Multi-Vector Machines for classification and regression using the

Clifford geometric algebra, which from now on-wards will be called Clifford Sup-

port Vector Machines (CSVM), ii) the development of Multiple Input Multiple Out-

put (MIMO) CSVM and iii) the application of CSVM as classifiers, regressors and

as an important component of a recurrent system. This work is a continuation of a

first one on the generalization of SVMs [8].

Let Gn denote the geometric (Clifford) algebra of n-dimensions, this is a graded

linear space. As well as vector addition and scalar multiplication we have a non-

commutative product which is associative and distributive over addition – this is the

geometric or Clifford product. A further distinguishing feature of the algebra is that

any vector squares to give a scalar. The geometric product of two vectors a and b is

written ab and can be expressed as a sum of its symmetric and antisymmetric parts

ab = a·b + a∧b, (10.1)

where the inner product a·b and the outer product a∧b are defined by

a · b = 12 (ab + ba)

(10.2)

a ∧ b = 12 (ab − ba).

The inner product of two vectors is the standard scalar or dot product and produces

a scalar. The outer or wedge product of two vectors is a new quantity which we call

a bivector. We think of a bivector as a oriented area in the plane containing a and b,

formed by sweeping a along b. Thus, b ∧ a will have the opposite orientation mak-

ing the wedge product anti-commutative as given in ( 10.2). The outer product is

immediately generalizable to higher dimensions – for example, (a ∧ b) ∧ c, a trivec-

tor, is interpreted as the oriented volume formed by sweeping the area a ∧ b along

vector c. The outer product of k vectors is a k-vector or k-blade, and such a quantity

is said to have grade k. A multivector A ∈ Gn is the sum of k-blades of different or

10 Optimization with Clifford Support Vector Machines and Applications 235

it contains terms of only a single grade.

In an n-Dimensional space V n we can introduce an orthonormal basis of vectors

{σi }, i = 1, ..., n, such that σi · σ j = δi j . This leads to a basis for the entire algebra:

σ1 ∧ σ2 ∧ . . . ∧ σn = I. (10.3)

which spans the entire geometric algebra Gn . Here I is the hyper volume called

pseudo scalar which commutes with all the multivectors and it is used as dualization

operator as well. Note that the basis vectors are not represented by bold symbols.

Any multivector can be expressed in terms of this basis. Any multivector can be

expressed in terms of this basis. Because the addition of k-vectors (homogeneous

vectors of grade k) is closed and the multiplication of a k-vector is a vector space,

8k

denoted V n . Each of this spaces is spanned by nk k-vectors, where nk := (n−k)!k!

n!

.

n n

Thus, our geometric algebra Gn , which is spanned by ∑ k = 0 k = 2n elements, is

a direct sum of its homogeneous subspaces of grades 0, 1, 2, ..., n, that is,

9

0 9

1 9

2 9

n

Gn = Vn ⊕ Vn ⊕ Vn ⊕ ...⊕ Vn (10.4)

8

0 8

1

where V n = R is the set of real numbers and V n = V n corresponds to the linear

n-Dimensional vector space. Thus, any multivector of Gn can be expressed in terms

of the basis of these subspaces.

In this chapter we will specify a geometric algebra Gn of the n dimensional space

by G p,q,r , where p, q and r stand for the number of basis vector which squares to 1,

-1 and 0 respectively and fulfill n=p+q+r. Its even sub algebra will be denoted by

+ .

G p,q,r

In the n-D space there are multivectors of grade 0 (scalars), grade 1 (vectors),

grade 2 (bivectors), grade 3 (trivectors), etc... up to grade n. Any two such multi-

vectors can be multiplied using the geometric product. Consider two multivectors

Ar and Bs of grades r and s respectively. The geometric product of Ar and Bs can be

written as

Ar Bs = ABr+s + ABr+s−2 + . . . + AB|r−s| (10.5)

where Mt is used to denote the t-grade part of multivector M, e.g. consider the

geometric product of two vectors ab = ab0 + ab2 = a · b + a ∧ b. Another simple

illustration is the geometric product of A = 4σ3 + 2σ1 σ2 and b = 8σ2 + 6σ3

= 24 + 16σ1 − 32σ2σ3 + 12I (10.6)

236 N. Arana-Daniel, C. López-Franco, and E. Bayro-Corrochano

Note here, that the Clifford product for σi σi = (σi )2 = σi · σi = 1, because the wedge

product between σi ∧ σi = 0, and σi σ j = σi ∧ σ j , the geometric product of differ-

ent unit basis vectors is equal to their wedge, which for simple notation can be

omitted. Using equation 10.5 we can express the inner and outer products for the

multivectors as

Ar ∧ Bs = Ar Bs r+s (10.8)

In order to deal with more general multivectors, we define the scalar product

A ∗ B = AB0 (10.9)

For an r-grade multivector Ar = ∑ri=0 Ar i , the following operations are defined:

r

Grade Involution: Âr = ∑ (−1)i Ai (10.10)

i=0

r

∑ (−1)

i(i−1)

Reversion: A†r = 2 Ai (10.11)

i=0

Clifford Conjugation: Ãr = Â†r (10.12)

r

∑ (−1)

i(i+1)

= 2 Ai (10.13)

i=0

The grade involution simply negates the odd-grade blades of a multivector. The

reversion can also be obtained by reversing the order of basis vectors making up the

blades in a multivector and then rearranging them to their original order using the

anti-commutativity of the Clifford product. The scalar product ∗ is positive definite,

i.e. one can associate with any multivector A = A0 + A1 + . . . + An a unique

positive scalar magnitude |A| defined by

n

|A|2 = A† ∗ A = ∑ |Ar |2 ≥ 0, (10.14)

r=0

: if A = 0. For an homogeneous multivector Ar its magnitude

is defined as |Ar | = A†r Ar . In particular, for an r-vector Ar of the form Ar = a1 ∧

a2 ∧ . . . ∧ ar : A†r = (a1 . . . ar−1 ar )† = ar ar−1 . . . a1 and thus A†r Ar = a21 a22 . . . a2r , so,

we will say that such a r-vector is null if anda only if it has a null vector for a factor.

If in such f actorization or Ar p, q as s factors square in a positive number, negative

and zero, respectively, we will say that Ar is a r-vector with signature (p, q, s). In

particular, if s = 0 such a non − singular r-vector has a multiplicative inverse

A† A

A−1 = (−1)q = 2 (10.15)

|A|2 A

10 Optimization with Clifford Support Vector Machines and Applications 237

A−1 A = 1.

The basis for the geometric algebra G3,0,0 of the the 3-D space has 23 = 8 elements

and is given by:

;<=> (10.16)

; <= >; <= >; <= >

scalar vectors bivectors trivector

α4 σ2 σ3 + α5 σ3 σ1 + α6 σ1 σ2 + α7 I3 =< v >0 + < v >1 + < v >2 + < v >3 , where

8

0 8

1

the αi s are real numbers and < v >0 = α0 ∈ V n , < v >1 = α1 σ1 + α2 σ2 + α3 σ3 ∈

8

2 8

3

V n , < v >2 = α4 σ2 σ3 + α5 σ3 σ1 + α6 σ1 σ2 ∈ V n , < v >3 = α7 I3 ∈ V n .

In geometric algebra a rotor (short name for rotator), R, is an even-grade element

of the algebra which satisfies RR=1, ? where R? stands for the conjugate of R. If A =

{a0 , a1 , a2 , a3 } ∈ G3,0,0 represents a unit quaternion , then the rotor which performs

the same rotation is simply given by

R = a0 + a1 (I σ1 ) − a2 (I σ2 ) + a3 (I σ3 ) (10.17)

;<=> ; <= >

scalar bivectors

= a0 + a1σ2 σ3 + a2σ3 σ1 + a3σ1 σ2 . (10.18)

3-space. The conjugated of a rotor given by R? = a0 − a1σ2 σ3 + a2σ3 σ1 − a3σ1 σ2 ,

The transformation in terms of a rotor a → RaR? = b is a very general way of

handling rotations; it works for multivectors of any grade and in spaces of any di-

mension in contrast to quaternion calculus. Rotors combine in a straightforward

manner, i.e. a rotor R1 followed by a rotor R2 is equivalent to a total rotor R where

R = R2 R1 .

Classification

For the case of the Clifford SVM for classification we represent the data set in a cer-

tain Clifford Algebra Gn where n = p + q + r, where any multivector base squares

to 0, 1 or -1 depending if they belong to p, r, or r multivector bases respectively. We

consider the general case of an input comprising D multivectors, and one multivec-

tor output, i.e. each ith-vector has D multivector entries xi = [xi1 , xi2 , ..., xiD ]T , where

xi j ∈ Gn and D is its dimension. Thus the ith-vector dimension is D×2n , then each

238 N. Arana-Daniel, C. López-Franco, and E. Bayro-Corrochano

data ith-vector xi ∈ GnD . And each of ith-vectors will be associated with one output

of the 2n possibilities given by the following multivector output

where the first subindex s stands for scalar part. For the classification the CSVM sep-

arates these multivector-valued samples into 2n groups by selecting a good enough

function from the set of functions

T

{ f (x) = w† x + b, }. (10.20)

perplane w is given by wi = wis , wiσ1 σ1 + ... + wiσ1 σ2 σ1 σ2 + . . . + wiI I.

Let us see in detail the last equation

T

f (x) = w† x + b

= [w†1 , w†2 , ..., w†D ]T [x1 , x2 , ..., xD ] + b

D

= ∑ w†i xi + b (10.21)

i=1

where w†i xi corresponds to the Clifford product of two multivectors and w†i is the

reversion of the multivector wi .

Next,we introduce now a structural risk functional similar to the real valued

one of the SVM for classification. By using a loss function similar to Vapnik’s ξ -

insensitive one, we utilize following linear constraint quadratic programming for the

primalequation

T

min L(w, b, ξ ) = 12 w† w +C ∑i, j ξ1 j

subject to

T (10.22)

yi j ( f (xi )) j = yi j (w† xi + b) j ≥ 1 − ξi j

ξi j ≥ 0 for all i, j,

where ξi j stands for the slack variables, i, indicate the data ith-vector and j indexes

the multivector component, i.e. j = 1 for the coefficient of the scalar part, j = 2 for

the coefficient of σ1 . . . j = 2n for the coefficient of I. The dual expression of this

problem can be derived straightforwardly. Firstly let us consider the expression of

the orientation of optimal hyperplane.

10 Optimization with Clifford Support Vector Machines and Applications 239

l & '

wks = ∑ (αs ) j (ys ) j (xks ) j ,

j=1

l & '

wkσ1 = ∑ (ασ1 ) j (yσ1 ) j (xkσ1 ) j ...

j=1

l & '

wkI = ∑ (αI ) j (yI ) j (xkI ) j . (10.25)

j=1

where (αs ) j , (ασ1 ) j , ..., (αI ) j , j = 1, ..., l are the Lagrange multipliers. According

the Wolfe dual programing [1] the dual form reads

1 †T

min

2

(w w) − ∑ αi j (10.26)

i, j

0 ≤ (ασ1 ) j ≤ C, ..., 0 ≤ (ασ1 σ2 ) j ≤ C, ..., 0 ≤ (αI ) j ≤ C for i = 1, ..., D and j =

1, ..., l. In aT · 1 = 0, 1 denotes a vector of all ones. The entries of the vector

a = [as , aσ1 , aσ2 , ..., aσ1 σ2 , aI ] (10.27)

are given by

aTs = [(αs )1 (ys )1 , (αs )2 (ys )1 , ..., (αs )l (ys )l ]

aTσ1 = [(ασ1 )1 (yσ1 )1 , (ασ1 )1 (yσ1 )1 , ..., (ασ1 )l (yσ1 )l ]

..

. (10.28)

aTI = [(αI )1 (yI )1 , (αI )1 (yI )1 , ..., (αI )l (yI )l ]

note that the vector aT has the dimension: (l × 2n ) × 1. We require a compact and

easy representation of the resultant GRAM matrix of the multi-components, this will

help for the programing of the algorithm. For that let us first consider the Clifford

product of (w∗T w), this can be expressed as follows

w†T w = w†T ws + w†T wσ1 + w†T wσ2 + . . . + w†T wI (10.29)

Since w has the components presented in (10.25), the equation (10.29) can be rewrit-

ten as follows

w†T w = aTs x†T xs as + ... + aTs x†T xσ1 σ2 aσ1 σ2 + ...

+aTs x†T xI aI + aTσ1 x†T xs as + ...

+aTσ1 x†T xσ1 σ2 aσ1 σ2 + ... + aTσ1 x†T xI aI +

. (10.30)

aTI x†T xs as + aTI x†T xσ1 aσ1 + ...

+aTI x†T xσ1 σ2 aσ1 σ2 + ... + aTI x†T xI aI .

240 N. Arana-Daniel, C. López-Franco, and E. Bayro-Corrochano

Renaming the matrices of the t-grade parts of x†T xt , we rewrite previous

equation as:

+aTs HI aI + aTσ1 Hs as + aTσ1 Hσ1 aσ1 + ...

+aTσ1 Hσ1 σ2 aσ1 σ2 + ... + aTσ1 HI aI +

.

aTI Hs as + aTI Hσ1 aσ1 + ... + aTI Hσ1 σ2 aσ1 σ2 + ...

+aTI HI aI . (10.31)

Taken into consideration the previous equations and definitions, the primal equation

(10.22) reads now as follows:

1

min L(w, b, ξ ) = aT Ha + C · ∑ αi j (10.32)

2 i, j

using the previous definitions and equations we can define the dual optimization

problem as follows

1

max aT 1 − aT Ha

2

sub ject to

0 ≤ (αs ) j ≤ C, 0 ≤ (ασ1 ) j ≤ C, ...,

0 ≤ (ασ1 σ2 ) j ≤ C, ..., 0 ≤ (αI ) j ≤ C

f or j = 1, ..., l, (10.33)

H is a positive semidefinite matrix which is the expected generalized Gram ma-

trix. This matrix in terms of the matrices of the t-grade parts of x∗ xt is written as

follows:

⎡ ⎤

Hs Hσ1 Hσ2 .... .... ... ... Hσ1σ2 ... HI

⎢ HσT1 Hs ... Hσ4 .....Hσ1 σ2 ... HI Hs ⎥

⎢ ⎥

⎢ HσT2 HσT1 Hs ... Hσ1 σ2 ... HI Hs Hσ1 ⎥

⎢ ⎥

H =⎢

⎢ . ⎥,

⎥ (10.34)

⎢ . ⎥

⎢ ⎥

⎣ . ⎦

HIT ... HσT1 σ2 .............HσT2 HσT1 Hs

note that the diagonal entries equal to Hs and since H is a symmetric matrix the

lower matrices are transposed. The optimal weight vector w is as given by (10.23).

The threshold b ∈ GnD can be computed by using KKT conditions with the Clif-

ford support vectors as follows

10 Optimization with Clifford Support Vector Machines and Applications 241

l

= ∑ (y j − w†T x j )/l. (10.35)

j=1

The decision function can be seen as sectors reserved for each involved class, i.e.

in the case of complex numbers (G1,0,0 ) or quaternions (G0,2,0 ) we can see that the

circle or the sphere are divide by means spherical vectors. Thus the decision function

can be envisaged as

" # " #

y = csignm f (x) = csignm w†T x + b

" l #

= csignm ∑ (α j ◦ y j )(x†Tj x) + b, (10.36)

j=1

" #

where csignm f (x) is the function for detecting the sign of f (x) and m stands for

the different values which indicate the state valency, e.g. bivalent, tetravalent and

the operation “◦” is defined as

+ < α j >2n < y j >2n I, (10.37)

simply one consider as coefficients of the multivector basis the multiplications be-

tween the coefficients of blades of same degree. For clarity we introduce this oper-

ation “◦”which takes place implicitly in previous equation (10.25).

Note that the cases of complex numbers 2-state (outputs 1 for − π2 ≤ arg( f (x)) <

π π 3π π

2 and -1 for 2 ≤ arg( f (x)) < 2 ) and 4-state (outputs 1+i for 0 ≤ arg( f (x)) < 2 , -

π 3π 3π

1+i for 2 ≤ arg( f (x)) < π , -1-i for π ≤ arg( f (x)) < 2 and 1-i for 2 ≤ arg( f (x)) <

2π ) can be solved by the multi-class real valued SVM, however in case of higher

representations like the 16-state using quaternions, it would be awkward to resort to

the multi-class real valued SVMs.

The major advantage of our approach is that we redefine the optimization vector

variables as multivectors. This allows us to utilize the components of the multivec-

tor output to represent different classes. The amount of achieved class outputs is

directly proportional to the dimension of the involved geometric algebra. The key

idea to solve multi-class classification in the geometric algebra is to avoid that the

multivector elements of different grade get collapsed into a scalar, this can be done

thanks to the redefinition of the primal problem involving the Clifford product in-

stead of the inner product (10.22). The reader should bear in mind that the Clifford

product performs the direct product between the spaces of different grade and its

result is represented by a multivector, thus the outputs of the CSVM are represented

by y= ys + yσ1 + yσ2 + ... + yI ∈ {±1 ± σ1 ± σ2 . . . ± I}.

242 N. Arana-Daniel, C. López-Franco, and E. Bayro-Corrochano

Classification

For the nonlinear Clifford valued classification problems we require a Clifford val-

ued kernel K(x, y). In order to fulfill the Mercer theorem we resort to a component-

wise Clifford-valued mapping

φ

x ∈ Gn −→ Φ (x) = Φs (x) + Φσ1 σ1 + Φσ1 σ2 (x)σ2 + ...

+I ΦI (x) ∈ Gn .

In general we build a Clifford kernel K(xm , x j ) by taking the Clifford product be-

tween the reversion of xm and x j as follows

note that the kind of reversion operation (·)† of a multivector depends of the signa-

ture of the involved geometric algebra G p,q,r . Next as illustration we present kernels

using different geometric algebras. According to the Mercer theorem, there exists

a mapping u : G → F , which maps the multivectors x ∈ Gn into the complex Eu-

u

clidean space x →= ur (x) + IuI (x)

Complex-valued linear kernel function in G1,0,0 (the center of this geometric al-

gebra, i.e. s, I = σ1 σ2 is isomorph with C):

= (u(xm )s u(xn )s + u(xm )I u(xn )I ) + I(u(xm )s u(xn )I − u(xm )I u(xn )s ),

= (k(xm , xn )ss + k(xm , xn )II ) + ...

+I(k(xm , xn )Is ) − k(xm , xn )sI )

= Hr + IHi (10.39)

where (xs )m , (xs )n , (xI )m , (xI )n are vectors of the individual components of the

complex numbers (x)m = (xs )m + I(xI )n ∈ G1,0,0 and (x)n = (xs )n + I(xI )n ∈ G1,0,0

respectively.

For the quaternion-valued Gabor kernel function, we use i = σ2 σ3 , j = −σ3 σ1 ,

k = σ1 σ2 . The Gaussian window Gabor kernel function reads

T

(10.40)

||x − x ||2

1 − m 2 n

g(xm , xn ) = √ exp 2ρ (10.41)

2πρ

10 Optimization with Clifford Support Vector Machines and Applications 243

and the variables w0 and xm − xn stand for the frequency and space domains

respectively.

Unlike the Hartley transform or the 2D complex Fourier this kernel function sep-

arates nicely the even and odd components of the involved signal, i.e.

+K(xm , xn )σ3 σ1 + K(xm , xn )σ1 σ2

= g(xm , xn )cos(wT0 xm )cos(wT0 xm ) + ...

+g(xm , xn )cos(wT0 xm )sin(wT0 xm )i + ...

+g(xm , xn )sin(wT0 xm )cos(wT0 xm ) j + ...

+g(xm , xn )sin(wT0 xm )sin(wT0 xm )k.

k(xm , xn )u in the above equations satisfy these conditions as well.

After we defined these kernels we can proceed in the formulation of the SVM

n

conditions. We substitute the mapped data Φ (x) = ∑2u=1 < Φ (x) >u into the linear

function f (x) = w†T x + b = w∗T Φ (x) + b. The problem can be stated similarly as in

(10.22-10.26). In fact we can replace the kernel function in (10.33) to accomplish

the Wolfe dual programming and thereby to obtain the kernel function group for

nonlinear classification

Hσ1 = [Kσ1 (xm , x j )]m, j=1,..,l

...

Hσn = [Kσn (xm , x j )]m, j=1,..,l ·

·

HI = [KI (xm , x j )]m, j=1,..,l . (10.42)

In the same way we use the kernel functions to replace the the dot product of the

input data in (10.36). In general the output function of the nonlinear Clifford SVM

reads

" # " #

y = csignm f (x) = csignm w†T Φ (x) + b , (10.43)

The representation of the data set for the case of Clifford SVM for regres-

sion is the same as for Clifford SVM for classification; we represent the data set

244 N. Arana-Daniel, C. López-Franco, and E. Bayro-Corrochano

xi = [xi1 , xi2 , ..., xiD ]T , where xi j ∈ Gn and D is its dimension. Let (x1 , y1 ),(x2 ,

y2 ),...,(x j , y j ),...,(xl , yl ) be the training set of independently and identically dis-

tributed multivector-valued sample pairs, where each label yi = yi s + yi σ1 σ1 +

yi σ2 σ2 + ... + yi I I, and the first subindex s stands for scalar part. The regression

problem using multivectors is to find a multivector-valued function f (x) that has

at most ε -deviation from the actually obtained targets yi ∈ Gn for all the training

data, and at the same time, is as flat as possible. We will use a multivector-valued

ε -insensitive loss function and arrive at the formulation of Vapnik [1]:

subject to

(10.44)

(yi − w†T xi − b) j ≤ (ε + ξi j )

(w†T xi + b − yi ) j ≤ (ε + ξ̃i j ) ξi j ≥ 0, ξ̃i j ≥ 0 for all i, j.

where w, x ∈ GnD , and (.) j extracts the scalar accompanying a multivector base. Next

we proceed like in section 10.3, since the expression of the orientation of optimal

hyperplane is the same that in (10.23) and each of the wi is computed as follows:

l & '

ws = ∑ (αs ) j − (α̃s ) j (xs ) j , ,

j=1

l & '

wσ1 = ∑ (ασ1 ) j − (α̃σ1 ) j (xσ1 ) j , ...,

j=1

l & '

wI = ∑ (αI ) j − (α̃I ) j (xI ) j .

j=1

We can now redefine the entries of the vector in (10.27), these are given by

aTσ1 = [(ασ1 1 − α̃σ1 1 ), (ασ1 2 − α̃σ1 2 ), ..., (ασ1 l − α̃σ1 l )]

... (10.45)

aTI = [(αI1 − α̃I1 ), (αI2 − α̃I2 ), ..., (αIl − α̃Il )]

Now, we can rewrite the Clifford product, as we did in (10.29 - 10.31) to get the

primal problem as follows:

min 12 aT Ha +C · ∑li=1 (ξ + ξ̃ )

subject to

(w† x + b − y) j ≤ (ε + ξ ) j

(y − w† x − b) j ≤ (ε + ξ̃ ) j

ξi j ≥ 0, ξ̃i j for all i, j.

Thereafter, we write straightforwardly the dual of 10.46 for solving the regression

problem

10 Optimization with Clifford Support Vector Machines and Applications 245

1

max −α̃ T (ε −

˜ y) − α T (ε + y) − aT Ha

2

sub ject to

l l

∑ (αs j − α̃s j ) = 0, ∑ (ασ 1 j − α̃σ1 j ) = 0, ..,

j=1 j=1

l

∑ (αI j − α̃I j ) = 0, 0 ≤ (αis ) ≤ C, 0 ≤ (αiσ1 ) ≤ C, ...,

j=1

0 ≤ (αiσ1 σ2 ) ≤ C, ..., 0 ≤ (iαI ) ≤ C 0 ≤ (αis∗ ) ≤ C, 0 ≤ (αi∗σ1 ) ≤ C, ...,

0 ≤ (αi∗σ1 σ2 ) ≤ C, ..., 0 ≤ (iαI∗ ) ≤ C j = 1, ..., l, (10.46)

ular kernel for computing k(xm , x j ) = Φ (x̃m )Φ (x j ), again this kind of conjugation

operation ()∗ of a multivector depends of the signature of the involved geometric

algebra G p,q,r . We can use the kernels described in subsection 10.4.

SVMs are very powerful for solving regression and classification tasks. They carry

out predictions by linearly combining kernel basis functions. By mapping the input

feature space to a higher dimensional space, the SVMs can separate linearly clusters

by means an optimal hyperplane. A rather limited way to apply existing SVMs to

sequence prediction [? ? ] or classification [12] is to build a training set either by

transforming the sequential input to an input vector of some static domain (e.g.,

a frequency or phase representation, a Hidden Markov Model -HMM- [13, 14]),

or by simple frequency counting of patterns, symbols or substrings, or by taking

fixed time windows of k sequential values [10]. The window-based approaches of

course, fail if the temporal dependency exceeds the length of k steps. As for the

case of training HMM with long sequences, unfortunately they get numerous local

minima points [15, 16]. Suykens and Vandewalle [17], incorporates the dynamic

equations in the primal problem for a SVM solution. The major disadvantage of

this approach is that the problem is not longer convex, thus there is no guarantee

of finding an optimal global solution. In all these discussed attempts there has not

been a recurrent SVM which learns tasks involving time lags of arbitrary length

between important input events. However, a pioneering attempt using real valued

SVM and neuroevolution for sequence prediction was done by Schmidhuber, et al.

[18]. Unfortunately at present the research activity on recurrent SVM is very scarce.

We started to explore a way to build a CSVM based recurrent system which will

profit of all advantages of the CSVM, namely it helps to maintain the convexity, it

is MIMO and it suits to process sequences with geometric characteristics. In order

to do that, we decided to connect two processing modules in cascade: a Long Short

Term memory LSTM, [20] and a CSVM.

246 N. Arana-Daniel, C. López-Franco, and E. Bayro-Corrochano

LSTM-CSVM is a Evolino and Evoke based system [18, 19]: the underlying

idea of these systems is that it is needed two cascade modules: a robust module

to process short and long-time dependencies (LSTM) and an optimization module

to produce precise outputs (CSVM, Moore-Penrose pseudo inverse method, SVM

respectively). The LSTM module addresses the disadvantage of having relevant

pieces of information outside the history window and also avoids the problem of

the “vanishing error” presented by algorithms like Back-Propagation Through Time

(BPTT, e.g., Williams an Zipser 1992) or Real-Time-Recurrent Learning ( RTRL,

e.g., Robinson and Fallside 1987)1. Meanwhile CSVM maps the internal activations

of the fist module to a set of precise outputs, again, it is taken advantage of the mul-

tivector output representation to implement a system with less process units and

therefore less computational complex.

LSTM-CSVM works as follows: a sequence of input vectors (u(0)...u(t)) is given

to the LSTM which in turn feeds the CSVM with the outputs of each of its memory

cells, see Fig. 10.1.

The CSVM aimed at finding the expected nonlinear mapping of training data.

The input and output equations of Figure 10.1 are

k

y(t) = b + ∑ wi K(φ (t), φi (t)). (10.47)

i=1

where φ (t) = [ψ1 , ψ2 , ..., ψn ]T ∈ Rn is the activation in time t of n units of the LSTM,

this serves as input to the CSVM, given the input vectors(u(0)...u(t)) and the weight

matrix W . Since the LSTM is a recurrent net, the argument of the function f (.)

represents the history of the input vectors.

1 The reader can get more information about BPTT and RTRL-vanishing error versus

LSTM-constant error flow in [20].

10 Optimization with Clifford Support Vector Machines and Applications 247

First, the LSTM-CSVM system was trained using the conventional algorithm for

the LSTM. Although the system learns, unfortunately it takes too long to find a

suitable matrix W . Instead, propagating the training data through the LSTM-CSVM

system, we evolved the rows of the matrix using the evolutionary algorithms known

as Enforced Sub-Populations (ESP) [21] algorithm. This approach differs with the

standard methods, because instead of evolving the complete set of the net parame-

ters, it rather evolves subpopulations of the LSTM memory cells. For the mutation

of the chromosomes, the ESP uses Cauchy density function.

10.7 Applications

In this section we present five interesting experiments. The first one shows a multi-

class classification using CSVM with a simulated example.Here, we present also a

number of variables computing per approach and a time comparison between CSVM

and three approaches to do multi-class classification using real SVM. The second

is about object multi-class classification with two types of training data: Phase a)

artificial data and Phase b) real data obtained from a stereo vision system. We also

compared the CSVM against MLP’s (for multi-class classification) . The third ex-

periment presents a multi-class interpolation. The fourth and fifth includes the ex-

perimental analysis of the recurrent CSVM.

We extended the well known 2-D spiral problem to the 3-D space. This experiment

should test whether the CSVM would be able to separate five 1-D manifolds embed-

ded in R3 . On this application, we used a quaternion valued CSVM which works in

G0,2,0 2 , this allows us to have quaternion inputs and outputs, and therefore, with one

output quaternion we can represent until 24 classes .The functions were generated

as follows:

f1 (t) = [x1 (t), y1 (t), z1 (t)]

= [z1 ∗ cos(θ ) ∗ sin(θ ), z1 ∗ sin(θ ) ∗ sin(θ ), z1 ∗ cos(θ )]

f2 (t) = [x2 (t), y2 (t), z2 (t)]

= [z2 ∗ cos(θ ) ∗ sin(θ ), z2 ∗ sin(θ ) ∗ sin(θ ), z2 ∗ cos(θ )]

f3 (t) = [x3 (t), y3 (t), z3 (t)]

= [z3 ∗ cos(θ ) ∗ sin(θ ), z3 ∗ sin(θ ) ∗ sin(θ ), z3 ∗ cos(θ )]

f4 (t) = [x4 (t), y4 (t), z4 (t)]

= [z4 ∗ cos(θ ) ∗ sin(θ ), z4 ∗ sin(θ ) ∗ sin(θ ), z4 ∗ cos(θ )]

f5 (t) = [x5 (t), y5 (t), z5 (t)]

= [z5 ∗ cos(θ ) ∗ sin(θ ), z5 ∗ sin(θ ) ∗ sin(θ ), z5 ∗ cos(θ )]

To depict these vectors they were normalized by 10. In Fig. 10.2 one can see that the

problem is high nonlinear separable. The CSVM uses for training 50 input quater-

nions of each of the five functions, since these have three coordinates we use simply

2 The dimension of this geometric algebra is 22 = 4.

248 N. Arana-Daniel, C. López-Franco, and E. Bayro-Corrochano

Support Vectors

Fig. 10.2 3D spiral with five classes. The marks represent the support multivectors found by

the CSVM

[0, xi (t), yi (t), zi (t)]. The CSVM used the kernel given by (10.42). Note that the

CSVM indeed manage to separate the five classes.

According to [22] the most used methods to do multi-class classification are: one-

against-all [23], one-against-one [24], DAGSVM [26], and some methods to solve

multi-class in one step, known as all together methods [27]. Table 10.1 shows a com-

parison of number of variables computing per approach, considering also CSVM.

The experiments shown in [22] indicate that “one-against-one and DAG meth-

ods are more suitable for practical use than the other methods”, we have chosen

of these methods the one-against-one and the earliest implementation for SVM

multi-class classification one-against-all approach to do comparisons between them

and our proposal CSVM. The comparisons were made using the 3D spiral toy ex-

ample and the quaternion CSVM shown in the past subsection. The number of

classes was increased on each experiment, we started with K=3 classes and 50

training inputs for each class. Since the training inputs have three coordinates we

use simply the bivector part of the quaternion for CSVM approach, namely xi =

xi (t)σ2 σ3 + yi (t)σ3 σ1 + zi (t)σ1 σ2 ≡ [0, xi (t), yi (t), zi (t)], therefore CSVM computes

D ∗ N = 3 ∗ 150 = 450 variables. The approaches one-against-all and one-against-

one compute 450 and 300 variables respectively, however the training times of

CSVM and one-against-one are very similar in the first experiment. Note that when

we increase the number of classes the performance of CSVM is much better than

the other approaches because the number of variables to compute is greatly reduced.

We improved the computational efficiency of all these algorithms, utilizing the de-

composition method [28] and the shrinking technique [29]. We can see in table 10.2

that the CSVM using a quarter of the variables is still faster with around a quarter of

the processing time of the other approaches. The classification performance of the

four approaches is presented in Table 10.3. We used during training and test 50 and

20 vectors per class respectively. We can see that the CSVM for classification has

overall the best performance.

10 Optimization with Clifford Support Vector Machines and Applications 249

CSVM 1 D*N D*N

One-against-all K N K*N

One-against-one K(K-1)/2 2*N/K N(K-1)

DAGSVM K(K-1)/2 2*N/K N(K-1)

A method by considering 1 K*N K*N

all data at once

NVQP Number of variables to compute per quadratic problem

TNV Total Number of Variables

D Training input data dimension

N Total number of training examples

K Number of classes

(Variables) (Variables) (Variables)

CSVM 0.07 0.987 10.07

C=1000 (450) (750) (3200)

One-against-all 0.11 8.54 131.24

(C, σ )=(1000,2−3) (450) (1250) (12800)

One-against-one 0.09 2.31 30.86

(C, σ )=(1000,2−2) (300) (1000) (12000)

DAGSVM 0.10 3.98 38.88

(C, σ )=(1000,2−3) (300) (1000) (12000)

class)

Used kernels K(xi , x j ) = e−σ ||xi −x j with parameters taken from

σ = {2, 20 , 2−1 , 2−2 , 2−3 } and costs C={1,10,100,1000,10000}.

From these 5 × 5 combinations, the best result was selected for

each

approach.

250 N. Arana-Daniel, C. López-Franco, and E. Bayro-Corrochano

Ntest=60 Ntest=100 Ntest=320

K=3 K=5 K=16

CSVM 98.66 99.2 99.87

C=1000 (95.00) (98.00) (99.68)

One-against-all 96.00 98.00 99.75

(C, σ )=(1000,2−3) (90.00) (96.00) (99.06)

One-against-one 98.00 98.4 (99.87)

(C, σ )=(1000,2−2) (95.00) 99.00 (99.375)

DAGSVM 97.33 98.4 (99.87)

(C, σ )=(1000,2−3) (95.00) (97) (99.68)

Ntest=Number of test vectors, % accuracy in training phase

above

Below in brackets, the percent of accuracy in test phase

Used kernels K(xi , x j ) = e−σ ||xi −x j with parameters taken from

σ = {2, 20 , 2−1 , 2−2 , 2−3 } and costs C={1,10,100,1000,10000}.

From these 5 × 5 combinations, the best result was selected for

each

approach.

In this subsection we will show an application of Clifford SVM for multi class

object classification. In the experiments shown in this subsection, we want to use

only one CSVM with a quaternion as input and a quaternion as output, that allow

us to have up 24 = 16 classes. Basically we packed in a feature quaternion one 3-D

point (which lies in surface of the object) and the magnitude of the distance between

this point and the point which lies in the main axis of the object in the same level

curve. Fig. 10.3 depicts the 4 features taking by the object :

Xi = δi s + xi σ2 σ3 + yi σ3 σ1 + zi σ1 σ2 (10.48)

≡ [δi , (xi , yi , zi )]T

For each object we trained the CSVM using a set of several feature quaternions

obtained from different level curves; that means that each object is represented by

several feature quaternions and not only one. Due to this way to train the CSVM,

the order in which the feature quaternions are shown to the CSVM is important:

we begin to sample data from the bottom to the top of the objects and we show the

training and test data in this order to the CSVM. We processed the outputs using

a counter that computes which class fires the most for each training or test set in

10 Optimization with Clifford Support Vector Machines and Applications 251

[ n,

[ +m , (x,y,z) n ]

(x,y,z) +m]

[ ,

(x,y,z) ]

a) b)

Fig. 10.3 Geometric characteristics of one training object. The magnitude is δi , and the 3D

coordinates (xi , yi , zi ) to build the feature vector: [δi , (xi , yi , zi )]

order to decide which class the object belongs, see Fig. 10.4. Note carefully, that

this experiment is anyway a challenge for any algorithm for recognition, because

the feature signature is sparse. We will show later, that using this kind of feature

vectors the CSVM’s performance is superior to the MLP’s one. Of course, if you

spend more time trying to improve the quality of the feature signature, the CSVM’s

performance will increase accordingly.

WINNER CLASS

COUNTER

OUTPUTS

CSVM

INPUTS

Fig. 10.4 After we get the outputs, these are accumulated using a counter to calculate which

class the object belongs

It is important to say that all the objects (synthetic and real) were preprocessed

in order to have a common center an the same scale, then our learning process can

be seen as centering and scale invariant.

In the first phase of this experiment, we used data training obtained from synthetic

objects, the training set are shown in Fig. 10.5. Note that we have six different

objects, which means a six-classes classification problem, and we solve it with

only one CSVM making use of its multi-output characteristic. In general, for the

”‘one versus all”’ approach one needs n SVMs (one for each class). In contrast,

the CSVM needs only one machine because its quaternion output allows to have 16

class outputs. For the input-data coding, we used a 3D point which is packed into

252 N. Arana-Daniel, C. López-Franco, and E. Bayro-Corrochano

the σ2 σ3 , σ3 σ1 , σ1 σ2 basis of the feature quaternion and the magnitude was packed

in the scalar part of the quaternion . Figure 10.6 shows the 3D points sampled from

the objects. We compared the performance of the following approaches: CSVM, a

4-7-6 MLP and the real valued SVM approaches one-against-one, one-against-all

and DAGSVM. The results in tables 10.4 and 10.5 show that CSVM has better gen-

eralization and less training errors than the MLP approach and the real valued-SVM

approaches. Note that all methods were speed up using the acceleration techniques

[28, 29]. The authors think that the MLP presents more training and generalization

errors because the way we represent the objects (as feature quaternion sets) makes

the MLP gets stuck in local minima very often during the learning phase, whereas

the CSVM is guaranteed to find the optimal solution to the classification problem

because it solves a convex quadratic problem with global minima. With respect to

the real-valued SVM based approaches, the CSVM takes advantage of the Clifford

product, which enhances the discriminatory power of the classificator itself unlike

the other approaches which are based solely on inner products.

a) b) c)

d) e) f)

In this phase of the experiment we obtained the training data using our robot “Ge-

ometer”, it is shown in right side of Fig. 10.7. We take two stereoscopic views of

each object: one frontal view and one 180 rotated view (w.r.t. the frontal view), after

that, we applied the Harry’s filter on each view in order to get the objects corners and

then, with the stereo system, the 3D points (xi , yi , zi ) which laid on the object surface

and to calculate the magnitude δi for the quaternion equation (10.49). This process

10 Optimization with Clifford Support Vector Machines and Applications 253

Table 10.4 Object-recognition performance in percent (%) during training using synthetic

data

C=1200 a) b) c)

C 86 93.02 48.83 87.2 90.69 90.69

S 84 89.28 46.42 89.28 90.47 90.47

F 84 85.71 40.47 83.33 84.52 83.33

W 86 91.86 46.51 90.69 91.86 93.02

D 80 93.75 50.00 87.5 91.25 90.00

U 84 86.90 48.80 82.14 83.33 84.52

C=cylinder, S=sphere, F=fountain, W=worm, D=Diamond,

U=cube

NTS= Number of Training Vectors.

Used kernels K(xi , x j ) = e−σ ||xi −x j with parameters taken from

σ = {2−1 , 2−2 , 2−3 , 2−4 , 2−5 } and costs

C={150,1000,1100,1200,1400,1500,10000}.

From these 8 × 5 combinations, the best result was selected for

each

approach. a)(2−4 , 1500), b)(2−3, 1200), c)(2−4, 1400).

Table 10.5 Object-recognition performance in percent (%) during test using synthetic data

C=1200 a) b) c)

C 52 94.23 80.76 90.38 96.15 96.15

S 66 87.87 45.45 83.33 84.84 86.36

F 66 90.90 51.51 83.33 86.36 84.84

W 66 89.39 57.57 86.36 83.33 86.36

D 58 93.10 55.17 93.10 93.10 93.10

U 66 92.42 46.96 89.39 90.90 89.39

C=cylinder, S=sphere, F=fountain, W=worm, D=Diamond,

U=cube

NTS= Number of Training Vectors.

Used kernels K(xi , x j ) = e−σ ||xi −x j with parameters taken from

σ = {2−1 , 2−2 , 2−3 , 2−4 , 2−5 } and costs

C={150,1000,1100,1200,1400,1500,10000}.

From these 8 × 5 combinations, the best result was selected for

each

approach. a)(2−4 , 1500), b)(2−3, 1200), c)(2−4, 1400).

254 N. Arana-Daniel, C. López-Franco, and E. Bayro-Corrochano

a) b) c)

*

***

***

***

*

* * ***

***

***

* *

* *

* *

* ***

***

*** *

d) e) f)

Fig. 10.6 Sampling of the training synthetic object set

is illustrated in Fig.10.7 and the whole training object set is shown in Fig.10.8. We

take the non-normalized 3D point for the bivector basis σ23 , σ31 , σ12 of the feature

quaternion in (10.49).

Fig. 10.7 Left:Sampling view of a real object. We use big white cross for the depiction.

Right: Stereo vision system in the experiment environment

After the training, we tested with a set of feature quaternions that the machine did

not see during its training and we used the approach of ’winner take all’ to decide

which class the object belongs. The results of the training and test are shown in

table 10.6. We trained CSVM with equal number of training data for each object,

that is, 90 feature quaternions for each object, but we test with different number of

data for object. Note that we have two pairs of objects that are very similar between

each other; first pair is composed by half sphere shown in Fig.10.8.c) and the rock

in Fig.10.8.d), in spite of this similarities, we got very good accuracy percentages in

10 Optimization with Clifford Support Vector Machines and Applications 255

Fig. 10.8 Training real object set, stereo pair images. We include only the frontal views

test phase for both objects: 65.9% for the half sphere and 84% for the rock. We think

we got better results for the rock because this object has a lot of texture that produces

many corners which in torn capture better the irregularities, therefore we have more

test feature quaternions for the rock than for half sphere (75 against 44 respectively).

The second pair composed by similar objects is shown in Fig.10.8.e) and Fig10.8.f),

these are two equal plastic bottles of juice, but one of them (Fig. 10.8.f)) is burned,

that makes the difference between them and give the CSVM enough distinguish

features to make two object classes, that is shown in table 10.6, we got 60% of

correct classified test samples for bottle in Fig. 10.8.e) against 61% for burned bottle

in Fig.10.8.f). The lower learn rates in the last objects (Fig.10.8 c), e) and f)) is

because the CSVM is confusing a bit the classes due to the fact that the feature

vectors are not large and reach enough.

Cube 90 50 38 76.00

Prism 90 43 32 74.42

Half sphere 90 44 29 65.90

Rock 90 75 63 84.00

Plastic bottle 1 90 65 39 60.00

Plastic bottle 2 90 67 41 61.20

Number of training samples

Number of test samples

Number of correct classified test samples

Percentage left column

256 N. Arana-Daniel, C. López-Franco, and E. Bayro-Corrochano

A real valued SVM can carry out regression and interpolation for multiple inputs

and one real output. Surprisingly a Clifford valued SVM can have multiple inputs

and 2n outputs for a n-dimensional space or R n . For handling regression we use

1.0 > ε > 0, where the diameter of the tube surrounding the optimal hyperplane

is twice of ε . For the case of interpolation we use ε = 0. We have chosen an in-

teresting task where we use a CSVM for interpolation in order to code a certain

kind of behavior we want that a visual guided robot performs. The robot should

autonomously draw a complicated 2D pattern. This capacity should be coded inter-

nally in Long Term Memory (LTM), so that the robot reacts immediately without

the need of reasoning. Similar to as a capable person who reacts in milliseconds

with incredible precision for accomplishing a very difficult task, for example a ten-

nis player or tango dancer. For our purpose we trained off line a CSVM using two

real valued functions. The CSVM used the geometric algebra G3+ (quaternion al-

gebra). Two inputs using two components of the quaternion input and two outputs

(two components of the quaternion output). The first input u and first output x coded

the relation x = a ∗ sin(3 ∗ u) ∗ cos(u) for one axis. The second input v and second

output y coded the relation y = a ∗ sin(3 ∗ v)∗ sin(v) for another axis, see Fig. 10.9.a),

b). The 2D pattern can be drawn using these 50 points generated by functions for

x and y, see Fig. 10.9.c. We tested if the CSVM can interpolate good enough using

100 and 400 unseen before input tuples {u, v}, see respectively Fig. 10.9.d) e). Once

the CSVM was trained we incorporated it as part of the LTM of the visual guided

robot shown in Fig. 10.9.f. For carrying out its task the robot called the CSVM for

a sequence of input patterns. The robot was able to draw the desired 2D pattern as

we see in Fig. 10.10.a)-d). The reader should bear in mind that this experiment was

designed using the equation of a standard function, in order to have a ground true.

Any how, our algorithm should be able also to learn 3D curves which do not have

explicit equations.

a) b) c) d)

Fig. 10.9 a) and b) Continuous curves of training output data for axes x and y (50 points).

c) 2D result by testing with 400 input data. d) Experiment environment

10 Optimization with Clifford Support Vector Machines and Applications 257

a) b) c) d)

Fig. 10.10 a), b), c) Image sequence while robot is drawing d) Robot’s draw. Result by testing

with 400 input data

In this section we first analyze the performance of the recurrent CSVM against state

of the art algorithms for solving a time series problem. Then, in a second experiment,

we apply the recurrent CSVM to tackle a partially observable problem in robotics.

We utilized the data of water levels of the Venice Lagoon during the periods from

1980 to 1989 and 1990 to 1995.3. The recurrent CSVM was trained with the first

400 series values. The LSTM module was evolved with four memory cells during

100 generations and using the Cauchy parameter α = 10−3 . The CSVM was trained

using as inputs the output values of these four memory cells. The achieved train-

ing error was of 0.0019 and the recurrent CSVM was tested with 600 steps. The

system was able to predict in advance 600 steps of the series. Figure 10.11 shows

the prediction performance of the recurrent CSVM by using the training data and

Figure 10.11.a) depicts the results of predictions using unforeseen 600 test values.

In the figures the ordinate’s range is [0..1].

Fig. 10.11 a)Time seriess Venice Lagoon training. b)Recall data. Tick line (in red) real data,

thin line predicted values by LSTM-CSVM

3 A. Tomasin, CNR-ISDMG Universita Ca’Foscari, Venice.

258 N. Arana-Daniel, C. López-Franco, and E. Bayro-Corrochano

In the next test, we employed the time series Mackey-Glass which is commonly

used for testing the generalization and prediction ability of an algorithm. The series

are generated by the following differential equation

˙ =α y(t − τ )

y(t) , (10.49)

(1 + y(t − τ )β ) − γ y(t)

where the parameters are usually set as α = 0.2, β = 10 and γ = 0.1. This equation

is chaotic when the delay is τ > 16.8. We select as delay the most common utilized

value of τ = 17. The task is to predict the series values after the delay y[t + P]

by using the previous points y[t], y[t − 6], y[t − 12], y[t − 18]. By P = 50 sampled

values, it is expected that the algorithm learns the four dimensional function: y(t) =

f (y[t − 50], y[t − 50 − 6], y[t − 50 − 12], y[t − 50 − 18]).

The LSTM-CSVM was trained with the first 3000 values of the series using

P = 100. The module LSTM with 4 memory cells was evolved with a Cauchy pa-

rameter α = 10−5 over 150 generations. The “Eco state approach” was trained with

1000 neurons and a mean square error of 10−4 , while using the Evolino system

the achieved error was of 1.9 × 10−3 with 30 cells evolved over 200 generations

[19, 30]. It has been reported [31] that using a LSTM the minimum error achieved

was of 0.1184 using the same amount of 4 neurons as in our LTSM-CSVM.

Eco state 1000 200 10−4

Evolino 30 200 1.9 × 10−3

LSTM 4 0.1184

LSTM-CSVM 4 y CSVM 150 0.011

Table 10.7 shows a summary of the comparison results. Here we note that the

LTSM alone has a poorer performance than the LTSM-CSVM, showing that the

CSVM clearly improves the prediction precision. Note that for this complex time se-

ries, as opposite to these two approaches (Eco state approach, Evolino), the LTSM-

CSVM uses a lower number of neurons and it requires less generations during the

training for an acceptable error of 0.011.

Finally, we utilize the LTSM-CSVM with reinforcement learning in a task of a

robotic perception action system. The robot system comprises of a stereoscopic

camera, a 6 D.O.F: robot arm and a 4 finger Barret-hand. The task consisted to

move the robot hand through a real 2D labyrinth. This was built using 10 blocks

of 10 cm. height each. To enhance the visibility, their top faces were painted in

10 Optimization with Clifford Support Vector Machines and Applications 259

red, this facilitated the segmentation of the blocks by the stereoscopic cameras. The

stereoscopic systems took images from an angle of 45 degrees, for that we needed

to calibrate cameras, and correct the position of the cameras views as they were

oriented on the top perpendicularly to the labyrinth. With this information we had

a complete 3D view from above. Using a color segmentation algorithm, we get the

vector coordinates of the block corners. These observation vectors were then fed to

the LTSM-CSVM.

The architecture of the LTSM-CSVM with reinforcement learning and the train-

ing were equal as the previous application. The differences with the simulated exper-

iment were: i) the 3D vectors of the block corners were obtained by the stereoscopic

camera (the blocks build a 2D labyrinth), ii) the robot actions were hand movements

trough the 2D labyrinth and iii) the length of this real labyrinth was smaller than pre-

vious simulated one. We had 4 different labyrinths. Each was 10 blocks length.

The evolution of the system consisted of 50 generations using a Cauchy noise

parameter of α = 10−4 . The module CSVM is fed with a vector of the output of

the last 4 memory cells of the LSTM. The 4 outputs of the CSVM represent the

4 different actions to be carried out during the navigation through the labyrinth.

After each generation, the best net was kept and the task was considered fulfilled

by a perfect reward of 4.0. The four possible actions of the system are robot hand

movements of 10 cm. length towards left, right, back an fort. The initial position

of the robot arm was located at the entry of a labyrinth. In all the labyrinths we

exploited the intern state (support state), i.e. the coordinates of the exit which was

the same for all cases.

1)

1)

2)

2)

3) 3)

4) 4)

Fig. 10.12 a) Training labyrinths 1 y 2. Recall labyrinths 3 y 4.(third column) The robot hand

is at the entry of the labyrinth holding a plastic object. (fourth column) Position from the hand

marked with a cross

260 N. Arana-Daniel, C. López-Franco, and E. Bayro-Corrochano

Figure 10.12 shows the four labyrinths. The images on the first and third columns

were obtained by the stereoscopic system. The images on the second and fourth

column were obtained after perspective correction and color segmentation. The

labyrinths 1 y 2 were used for the training, whereas the 3 and 4 were used for recall.

The third and fourth columns in Figure 10.12 show the agent at the beginning

the labyrinths. In this labyrinth, only one trajectory was successful (reward 4.0).

The training and the test were done off line, then the robot had to follow the action

vectors computed by the system LTSM-CSVM.

10.8 Conclusions

This chapter generalizes the real valued SVM to Clifford valued SVM and it is used

for classification, regression and recurrence. The CSVM accepts multiple multivec-

tor inputs and multivector outputs like a MIMO architecture, that allows us to have

multi-class applications. We can use CSVM over complex, quaternion or hyper-

complex numbers according our needs. The application section shows experiments

in pattern recognition and visually guided robotics which illustrate the power of the

algorithms and help the reader understand the Clifford SVM and use it in various

tasks of complex and quaternion signal and image processing, pattern recognition

and computer vision using high dimensional geometric primitives. This generaliza-

tion appears promising particularly in geometric computing and their applications,

like graphics, augmented reality and robot vision.

References

1. Vapnik, V.: Statistical Learning Theory. Wiley, New York (1998)

2. Burges, C.J.C.: A tutorial on Support Vector Machines for Pattern Recognition. In:

Knowledge Discovery and Data Mining, vol. 2, pp. 1–43. Kluwer Academic Publish-

ers, Dordrecht (1998)

3. Müller, K.-R., Mika, S., Rätsch, G., Tsuda, K., Schölkopf, B.: An Introduction to Kernel-

Based Learning Algorithms. IEEE Trans. on Neural Networks 12(2), 181–202 (2001)

4. Cristianini, N., Shawe-Taylor, J.: Support Vector Machines and other kernel-based learn-

ing methods. Cambridge University Press, Cambridge (2000)

5. Hestenes, D., Li, H., Rockwood, A.: New algebraic tools for classical geometry. In: Som-

mer, G. (ed.) Geometric Computing with Clifford Algebras. Springer, Heidelberg (2001)

6. Lee, Y., Lin, Y., Wahba, G.: Multicategory Support Vector Machines, Technical Report

No. 1043, University of Wisconsin, Departament of Statistics, pp. 10–35 (2001)

7. Weston, J., Watkins, C.: Support vector machines for multi-class pattern recognition. In:

Proceedings of the 6th European Symposium on Artificial Neural Networks (ESANN),

pp. 185–201 (1999)

8. Bayro-Corrochano, E., Arana-Daniel, N., Vallejo-Gutierrez, R.: Geometric Preprocess-

ing, geometric feedforward neural networks and Clifford support vector machines for

visual learning. Journal Neurocomputing 67, 54–105 (2005)

9. Bayro-Corrochano, E., Arana-Daniel, N., Vallejo-Gutierrez, R.: Recurrent Clifford Sup-

port Machines. In: Proceedings IEEE World Congress on Computational Intelligence,

Hong-Kong (2008)

10 Optimization with Clifford Support Vector Machines and Applications 261

10. Mukherjee, S., Osuna, E., Girosi, F.: Nonlinear prediction of chaotic time series using a

support vector machine. In: Principe, J., Gile, L., Morgan, N., Wilson, E. (eds.) Neural

Networks for Signal Precessing VII - Proceedings of the 1997 IEEE Workshop, New

York, pp. 511–520 (1997)

11. Müller, K.-R., Smola, A.J., Rätsch, G., Schölkopf, B., Kohlmorgen, J., Vapnik, V.N.:

Predicting time series with support vector machines. In: Gerstner, W., Hasler, M., Ger-

mond, A., Nicoud, J.-D. (eds.) ICANN 1997. LNCS, vol. 1327, pp. 999–1004. Springer,

Heidelberg (1997)

12. Salomon, J., King, S., Osborne, M.: Framewise phone classification using support vector

machines. In: Proc. Int. Conference on Spoke Language Processing, Denver (2002)

13. Altun, Y., Tsochantaris, I., Hofmann, T.: Hidden markov support vector machines. In:

Proc. Int. Conference on Machine Learning (2003)

14. Jaakkola, T.S., Haussler, D.: Exploting generative models in discriminative classifiers.

In: Proc. of the Conference on Advances in Neural Information Systems II, Cambridge,

pp. 487–493 (1998)

15. Bengio, Y., Frasconi, P.: Difussion of credit and markovian models. In: Tesauro, G.,

Touretzky, D.S., Leen, T.K. (eds.) Advances in Neural Information Systems 14. MIT

Press, Cambridge (2002)

16. Hochreiter, S., Mozer, M.: A discrete probabilistic memory for discovering dependencies

in time. In: Int. Conference on Neural Networks, pp. 661–668 (2001)

17. Suykens, J.A.K., Vanderwalle, J.: Recurrent least squares support vector machines. IEEE

Transactions on Circuits and Systems-I 47, 1109–1114 (2000)

18. Schmidhuber, J., Gagliolo, M., Wierstra, D., Gomez, F.: Recurrent Support Vector Ma-

chines, Technical Report, no. IDSIA 19-05 (2005)

19. Schmidhuber, J., Wierstra, D., Gómez, F.J.: Hybrid neuroevolution optimal linear search

for sequence prediction. In: Kaufman, M. (ed.) Proceedings of the 19th International

Joint Conference on Artificial Intelligence, IJCAI, pp. 853–858 (2005)

20. Hochreiter, S., Schmidhuber, J.: Long Short-Term Memory, Technical Report FKI-207-

95 (1996)

21. Gmez, F.J., Miikkulainen, R.: Active guidance for a finless rocket using neuroevolution.

In: Proc. GECCO, pp. 2084–2095 (2003)

22. Hsu, C.W., Lin, C.J.: A comparison of methods for multi-class Support Vector Machines.

Technical report, National Taiwan University, Taiwan (2001)

23. Bottou, L., Cortes, C., Denker, J., Drucker, H., Guyon, I., Jackel, L.Y., Muller, U.,

Sackinger, E., Simard, P., Vapnik, V.: Comparison of classifier methods: a case study

in handwriting digit recognition. In: International Conference on Pattern Recognition,

pp. 77–87. IEEE Computer Society Press, Los Alamitos (1994)

24. Knerr, S., Personnaz, L., Dreyfus, G.: Single-layer learning revisited: a stepwise proce-

dure for building and training a neural network. In: Fogelman, J. (ed.) Neurocomputing:

Algorithms, Architectures and Applications. Springer, Heidelberg (1990)

25. Kreßel, U.: Pairwise classification and support vector machines. In: Schlkipf, B., Burges,

C.J.J., Smola, A.J. (eds.) Advances in Kernel Methods - Support Vector Learning, pp.

255–268. MIT Press, Cambridge (1999)

26. Platt, J.C., Cristianini, N., Shawe-Taylor, J.: Large margin DAGs for multiclass classifi-

cation. In: Advances in Neural Information Processing Systems, vol. 12, pp. 547–533.

MIT Press, Cambridge (2000)

27. Weston, J., Watkins, C.: Multi-class support vector machines. Technical Report CSD-

TR-98-04, Royal Holloway, University of London, Egham (1998)

262 N. Arana-Daniel, C. López-Franco, and E. Bayro-Corrochano

28. Hsu, C.W., Lin, C.J.: A simple decomposition method for Support Vector Machines.

Machine Learning 46, 291–314 (2002)

29. Joachims, T.: Making large-scale SVM learning practical. In: Schölkopf, B., Burges,

C.J.C., Smola, A.J. (eds.) Advances in Kernel Methods-Support Vector Learning. MIT

Press, Cambridge (1998); Journal of Machine Learning Research 5, 819–844 (1998)

30. Jaeger, H.: Harnessing nonlinearity: Predicting chaotic systems and saving energy in

wireless communication. EmphScience (304), 78–80 (2004)

31. Gers, F.A., Eck, D., Schmidhuber, J.: Applying LSTM to time series predictable through

time-window approaches. In: Dorffner, G., Bischof, H., Hornik, K. (eds.) ICANN 2001.

LNCS, vol. 2130, pp. 669–685. Springer, Heidelberg (2001)

Chapter 11

A Classification Method Based on Principal

Component Analysis and Differential Evolution

Algorithm Applied for Prediction Diagnosis

from Clinical EMR Heart Data Sets

Abstract. In this article we have studied the usage of a classification method based

on preprocessing the data first using principal component analysis, and then using

the compressed data in actual classification process which is based on differential

evolution algorithm, an evolutionary optimization algorithm. This method is applied

here for prediction diagnosis from clinical data sets with chief complaint of chest

pain using classical Electronic Medical Record (EMR), heart data sets. For exper-

imentation we used a set of five frequently applied benchmark data sets includ-

ing Cleveland, Hungarian, Long Beach, Switzerland and Statlog data sets. These

data sets are containing demographic properties, clinical symptoms, clinical find-

ings, laboratory test results specific electrocardiography (ECG), results pertaining

to angina and coronary infarction, etc. In other words, classical EMR data pertain-

ing to the evaluation of a chest pain patient and ruling out angina and/or Coronary

Artery Disease, (CAD). The prediction diagnosis results with the proposed classi-

fication approach were found promisingly accurate. For example, the Switzerland

data set was classified with 94.5% ± 0.4% accuracy. Combining all these data sets

resulted in the classification accuracy of 82.0% ± 0.5%. We compared the results

of the proposed method with the corresponding results of the other methods re-

ported in the literature that have demonstrated relatively high classification perfor-

mance in solving this problem. Depending on the case, the results of the proposed

method were of equal level with the best compared methods, or outperformed their

Pasi Luukka

Laboratory of Applied Mathematics, Lappeenranta University of Technology,

P.O. Box 20, FIN-53851 Lappeenranta, Finland

e-mail: pasi.luukka@lut.fi

Jouni Lampinen

Department of Computer Science, University of Vaasa, P.O. Box 700,

FI-65101 Vaasa, Finland

e-mail: jouni.lampinen@uwasa.fi

Y. Tenne and C.-K. Goh (Eds.): Computational Intelligence in Optimization, ALO 7, pp. 263–283.

springerlink.com c Springer-Verlag Berlin Heidelberg 2010

264 P. Luukka and J. Lampinen

classification accuracy clearly. In general, the results are suggesting that the pro-

posed method has potential in this task.

11.1 Introduction

Many data sets that come from the real world are admittedly coupled with noise.

Noise can be stated as a random error or variance of a measured variable [13]. Data

analysis is almost always burdened with uncertainty of different kinds. There are

several different techniques to deal with noisy data [7].

A major problem in mining scientific data sets is that the data is often high di-

mensional. In many cases there is a large number of features representing the object.

One problem is that the computational time for the pattern recognition algorithms

can become prohibitive, when the number of dimensions grows high. This can be a

severe problem, especially as some of the features are not discriminatory. Besides

the computational cost, irrelevant features may also cause a reduction in the accu-

racy of some algorithms.

To address this problem of high dimensionality, a common approach is to identify

the most important features associated with an object, so that further processing can

be simplified without compromising the quality of the final results. There are several

different ways in which the dimension of a problem can be reduced. The simplest

approach is to identify important attributes based on the input from domain experts.

Another commonly used approach is Principal Component Analysis (PCA) [19],

which defines new attributes (principal components or PCs) as mutually-orthogonal

linear combinations of the original attributes. For many data sets, it is sufficient to

consider only the first few PCs, thus reducing the dimension. However, for some

data sets, PCA does not provide a satisfactory representation. It is not always the

case that mutually-orthogonal linear combinations are the best way to define new

attributes but e.g. nonlinear combinations needs to be sometimes considered. The

analysis of the problem of dealing with data of high dimensionality is both diffi-

cult and subtle. The information loss caused by these methods is also sometimes a

problem.

One of the latest methods in evolutionary computation is differential evolution

(DE) algorithm [30]. In this paper we will examine the applicability of a classifica-

tion method where data is first preprocessed with PCA and then the resulting data is

classified with DE-classifier to the diagnosis of heart disease. In literature there are

several papers where evolutionary computation research has concerned the theory

and practice of classifier systems [4], [16], [17], [18], [31], [35], [10]. The differen-

tial evolution algorithm has been studied in unsupervised learning problems which

can be in a sence repositioned to classification problem in [26], [11]. DE was also

used combined with artificial neural networks in [1] for diagnosis of breast cancer.

It is also been used to tune classifiers parameter values in [12] and in similarity

classifier [23] to tune similarity measures parameters.

11 A Classification Method Based on PCA and DE 265

Here we propose a method which first preprocesses the data using PCA and then

classify the processed data using differential evolution classification method. Dif-

ferential evolution algorithm is applied for finding suitable vectors for each class to

classify sample by comparison of class vectors and the sample which we want to

classify. Differential evolution algorithm is applied for finding optimal class vectors

to represent each class. In addition, it is also applied for determining the value of a

distances parameter that we applied for making the final classification decision.

Advantages of doing the procedure this way are that we are able to reduce dimen-

sionality and hence reduce the computational cost that would be otherwise untoler-

ably high, especially in the cases for high dimensional data sets. Also advantage of

this procedure is that we are able to filter out noise which enhances the creation of

class vector for each class in the classifier. The class vectors are optimized using

the DE algorithm. Using this procedure we will also find the optimal dimension for

these data sets. Combination of finding best reduced dimension, filtering out noise

from the data and optimization of class vectors and needed parameters for our prob-

lem at hand brings out the more accurate solution for the problem.

The data sets for empirical evaluation of the proposed approach were taken from

a UCI-Repository of Machine Learning data set [25]. Classifier and preprocessing

methods were implemented with MAT LABT M -software.

From the optimization and modelling point of view the classification problem

subject to our investigations can be divided into two parts: to the classification model

and to the optimization approach applied for fitting (or learning) the model. Gen-

erally, a multipurpose classifier can be viewed as a scalable and learnable model

that can be fitted to a particular dataset by scaling it to the data dimensionality

and by optimizing a set of model parameters to maximize the classification accu-

racy. For the optimization, simply the classification accuracy over the learning set

may serve as the objective function value to be maximized. Alternatively the opti-

mization problem can be formulated as a minimization task, as we did here, where

the number of misclassified samples is to be minimized. In the literature, mostly

linear or nonlinear local optimization approaches has been applied for solving the

actual classifier model optimization problem, or approaches that can be viewed as

such. This is the most common approach despite of the fact that the underlying op-

timization problem is a global optimization problem. For example, the weight set

of a feed-forward neural network classifier is typically optimized with a gradient-

descent based on local optimizer, or alternatively by some other local optimizer like

Levenberg-Marquardt algorithm. This kind of usage of limited capacity optimizers

for fitting the classification model limits the achievable classification accuracy in

two ways. First, the model should be limited so, that local optimizers can be applied

to fit them. This means that only very moderately multimodal classification models

can be applied, and due to such modelling limitation, the classification capability

will be limited correspondingly. Secondly, if a local optimizer is applied to optimize

(to fit or to learn) even a moderately multimodal classification model, it is likely to

get trapped into a local optima, to a suboptimal solution. Thereby, the only way to

get the classifier models with a higher modelling capacity at disposal, and also to get

full capacity out of the current multimodal classification models, is applying global

266 P. Luukka and J. Lampinen

optimization for fitting the classification models to the data to be classified. For

example, in case of a nonlinear feed-forward neural network classifier, the model is

clearly multimodal, but practically always fitted by applying a local optimizer that is

capable of providing only locally optimal solutions. Thus, we consider that applying

global optimization instead of local optimization is an important fundamental issue

that is currently severely constraining the further development of classifiers. The

capabilities of currently used local optimizers are limiting the selection of the appli-

cable classifier models, and also the capabilities of the currently used models that

are including multimodal properties are limited by the capabilities of the optimizers

applied to fit them to the data.

Based on the above mentioned considerations, our basic motivation for applying

a global optimizer for learning the applied classifier model comes from the fact that

typically local (nonlinear) optimizers have been applied for the purpose, despite that

the underlying optimization problem is actually a multimodal global optimization

problem, and a local optimizer should be expected to become trapped into a local

suboptimal solution. The advantage of our proposed method is that since DE does

not get trapped in local minimum we can expect it to find better solutions than what

can be found in nearest local minimum.

Another motivation was that we would like to optimize also the parameter p of

the Minkowsky distance metrics (see section 3). In practice, that means increased

nonlinearity and increased multimodality of the classification model, resulting in

more locally optimal points in the search space, where a local optimizer would be

even more likely to get trapped. Practically, optimizing p successfully requires usage

of an effective global optimizer since local optimizers are unlikely to provide even

an acceptably good suboptimal solution anymore. Otherwise, by using a global op-

timizer, optimization of p becomes possible. Two folded advantages were expected

on this. First, by optimizing (systematically) the value for p, instead of selecting it

a priori by trial and error as earlier, a higher classification accuracy may be possi-

ble to reach. Secondly, the selection of the value of p can be done automatically

this way, and laborious trial and error experimentation by the user is not needed

at all. Furthermore, the potential for further developments is increased. The local

optimization approaches are severely limiting the selection of classifier models to

be used and as well the problem formulations for classifier model optimization task

become limited, too. Simply, local optimizers are limited to fit, or learn, only classi-

fier models, where trapping into a local suboptimal solution is not a major problem,

while global optimizers do not have such fundamental limitations. For example, the

range of possible class membership functions can be extended to those requiring

global optimization (due to increased nonlinearity and multimodality), and which

cannot be handled anymore by simple local optimizers, even with the nonlinear

ones. In addition, we would like to remark, that we have not yet fully utilized the

further development capabilities provided by our global optimization approach. For

example, even more difficult optimization problem settings are now within possi-

bilities, and the differential evolution have good capabilities for multi-objective and

multi-constrained nonlinear optimization that provides further possibilities for our

future developments.

11 A Classification Method Based on PCA and DE 267

The heart data sets that we applied for experimentation were all taken from [25]

where they are freely available. They all contain 13 attributes (which have been

extracted from a larger set of 75). Information about the attributes can be found

in Table 11.1 and the basic properties of the data sets are summarized in Table

11.2. About attribute types; real valued attributes are no 1, 4, 5, 8, 10 and 12, or-

dered type is attribute no 11, binary value are attributes 2, 6, 9 and nominal value

attributes 7, 3, 13. Variable to be predicted is absence or presence of heart disease.

The data sets were collected in different locations and principal investigators re-

sponsible for the data collection are: 1) Andras Janos, Hungarian Institute of Car-

diology, 2) William Steinbrunn, University Hospital, Zurich, 3) Matthias Pfisterer,

University Hospital, Basel, 4) Robert Detrano, V.A. Medical Center, Long Beach

and Cleveland Clinic Foundation. Donor of the statlog data set was Ross D. King,

University of Strathclyde, Glasgow. The Statlog data set is slightly modified from

the Cleveland data set (instead of 303 samples they are using only 270).

no Attribute

1. Age

2. Sex

3. Chest pain type (4 values)

4. Resting blood pressure

5. Serum cholestoral in mg/dl

6. Fasting blood sugar > 120 mg/dl

7. Resting electrocardiographic results (values 0,1,2)

8. Maximum heart rate achieved

9. Exercise induced angina

10. Oldpeak = ST depression induced by exercise relative to rest

11. The slope of the peak exercise ST segment

12. Number of major vessels (0-3) colored by flouroscopy

13. Thal: 3 = normal; 6 = fixed defect; 7 = reversable defect

Heart-Cleveland 2 13 303

Heart-Hungarian 2 13 294

Heart-Long-Beach-va 2 13 200

Heart-Switzerland 2 13 123

Heart-Statlog 2 13 270

268 P. Luukka and J. Lampinen

Heart data sets were classified so that first data was preprocessed using PCA algo-

rithm and then the resulting data was classified using classification method based on

differential evolution. Next we first start with explaining in more detail the principal

component analysis method and then the classification method based on differential

evolution algorithm, after this we give a more thorough description of the differen-

tial evolution algorithm.

Analysis

High-dimensional data sets present many mathematical challenges as well as some

opportunities, and are bound to give rise to new theoretical developments [7]. One

of the problems with high-dimensional data sets is that, in many cases, not all the

measured variables are ”important” for understanding the underlying phenomena of

interest. In mathematical terms, the problem we investigate in dimension reduction

can be stated as follows: given the r-dimensional random variable x = (x1 , ..., xr )T ,

find a lower dimensional representation of it, y = (y1 , ..., yk )T with k < r, that cap-

tures the content in the original data, according to some criteria. The components of

y are sometimes called the hidden components. Different fields use different names

for r multivariate vectors: the term ”variable” is mostly used in statistics, while

”feature” and ”attribute” are alternatives commonly used in the computer science

and machine learning literature.

PCA [19] is the best linear dimension reduction technique in the mean-square

error sense. In various fields, it is also known as the singular value decomposition

(SVD), the Karhunen-Loeve transform, the Hotelling transform, and the empirical

orthogonal function (EOF) method.

Let x1 , ..., xn be the n r-dimensional real vectors constituting the data set. In PCA

data is first to be centered

1 n

∑ xp = 0

n p=1

(11.1)

projections PL xp of the r points on L have maximal variance. If L is the line spanned

by unit vector u, the projection of x ∈ Rk on L is

PL x = (u x)u (11.2)

where prime denotes transposition. The variance of data in the direction of L is,

therefore

1 n 2 1 n 1 n

∑

n p=1

(u xp ) = ∑ u xp xp u = u( ∑ xp xp )u = u Su

n p=1 n p=1

(11.3)

11 A Classification Method Based on PCA and DE 269

where S is the sample covariance matrix of the data. PCA thus looks for the vector

u∗ which maximizes u Su, under the constraint ||u|| = 1. It is easy to show that the

solution is the normalized eigenvector u1 of S associated to its largest eigenvalue

λ1 , and

u1 Su1 = λ1 u1 u1 = λ1 (11.4)

This is then extented to find the k-dimensional subspace L on which the projected

points PL xp have maximal variance. The lines spanned by the eigenvectors uj are

called the principal axes of the data, and k new features y j = uj x defined by the

coordinates of x along the principal axes are called principal components. The vector

yp of principal components for each initial pattern vector xp may easily be computed

in matrix form as yp = Ur xp , where Ur = [u1 , ..., ur ] is r × k matrix having the r

normalized eigenvector of S as its columns.

PCA can be used in classification problems to display data in the form of infor-

mative plots. The score values have the same properties as the weighted averages,

i.e., they are not sensitive to random noise but show the processes that affect several

variables simultaneously in a systematic way. This makes them suitable for detect-

ing multivariate trends, such as the clustering of objects or variables in multivariate

data sets. PCA can be seen as a data compression method which can be used to (1)

display multivariate data sets, (2) filter noise and (3) study and interpret multivariate

processes. One clear limitation of the PCA is that it can only handle linear relations

between variables [9]. We acknowledge the fact that this may not be the best kernel

for the approach but here in our procedure it seems to be working.

The problem of classification is basically one of partitioning the feature space into

regions, one region for each category. Ideally, one would like to arrange this parti-

tioning so that none of the decisions is ever wrong [8].

The objective is to classify a set X of objects to N different classes C1 , . . . ,CN by

their features. We suppose that T is the number of different kinds of features that

we can measure from objects. The key idea is to determine for each class the ideal

vector yi

yi = (yi1 , . . . , yiT ) (11.5)

that represents class i as well as possible. Later on we call these vectors as class vec-

tors. When these class vectors have been determined we have to make the decision

to which class the sample x belongs to according to some criteria. This can be done

e.g. by computing the distances di between the class vectors and the sample which

we want to classify. For computing the distance we used Minkowsky metric:

1/p

T

d(x, y) = ∑ |x j − y j | p

(11.6)

j=1

270 P. Luukka and J. Lampinen

We used Minkowsky metric because it is more general than euclidean Metric. Eu-

clidean metric is still included there as a special case when p = 2. We also found

that when p value was optimized by using DE, the optimum was not even near p = 2

which corresponds to euclidean metric.

After computing the distances between the samples and class vectors we can

make the classification decision according to the shortest distance.

for x, y ∈ Rn . We decide that x ∈ Cm if

i=1,...,N

Before doing the actual classification, all the parameters for classifier should be

decided. These parameters are

1. The class vectors yi = (yi (1), . . . , yi (T )) for each class i = 1, . . . , N

2. The power value p in (11.6).

In this study we used differential evolution algorithm [30] to optimize both the class

vectors and p value. For this purpose we split the data into learning set learn and

testing set test. Split was made so that half of the data was used in learning set and

half in testing set. We used data available in learning set to find the optimal class

vectors yi and the data in the testing set test was applied for assessing the clas-

sification performance of the proposed classifier. A brief description of differential

evolution algorithm is presented in the following section. The number of parameters

that differential evolution algorithm needs to optimize here is classes * dimension +

parameter coming from minkowsky distance. As results will later show PCA can be

used to lower datas dimensionality and with low dimensions we still we can find re-

sults which are clearly better than what can be found by using simply DE-classifier.

If we are not satisfied to just lower the data’s dimensionality and to enhancement

achieved this way but want to find out the best lowered dimension we have to do

this also for every dimension that is lower than maximum dimension so we get

∑dimension

i=1 (classes ∗ (dimension − i) + parameter coming from minkowsky distance)

worth of parameters to be optimized.

In short the procedure for our algorithm is as follows:

1. Divide data into learning set and testing set

2. Create initial class vectors for each class (here we used simply random numbers)

3. Compute distance between samples in the learning set and class vectors

4. Classify samples according to their minimum distance

5. Compute classification accuracy (no. of correctly classified samples/total number

of samples in learning set)

6. Compute the objective function value to be minimized as cost = 1 − accuracy

7. Create new class vectors for each class for the next population using selection,

mutation and crossover operations of differential evolution algorithm, and goto

step 3. until the stopping criteria is reached (e.g. maximum number of iterations

is reached)

11 A Classification Method Based on PCA and DE 271

8. Classify data in testing set according to the minimum distance between class

vectors and samples.

The DE algorithm [33], [30] was introduced by Storn and Price in 1995 and it be-

longs to the family of Evolutionary Algorithms (EAs). The design principles of DE

are simplicity, efficiency, and the use of floating-point encoding instead of binary

numbers. As a typical EA, DE has a random initial population that is then improved

using selection, mutation, and crossover operations. Several ways exist to determine

a stopping criterion for EAs but usually a predefined upper limit Gmax for the num-

ber of generations to be computed provides an appropriate stopping condition. Other

control parameters for DE are the crossover control parameter CR, the mutation fac-

tor F, and the population size NP.

In each generation G, DE goes through each D dimensional decision vector vi,G

of the population and creates the corresponding trial vector ui,G as follows in the

most common DE version, DE/rand/1/bin [29]:

except mutually different and different from i)

jrand = floor (randi [0, 1) · D) + 1

for( j = 1; j ≤ D; j = j + 1)

{

if(rand j [0, 1) < CR ∨ j =jrand )

u j,i,G = v j,r3 ,G + F · v j,r1 ,G − v j,r2 ,G

else

u j,i,G = v j,i,G

}

In this DE version, NP must be at least four and it remains fixed along CR and F

during the whole execution of the algorithm. Parameter CR ∈ [0, 1], which controls

the crossover operation, represents the probability that an element for the trial vector

is chosen from a linear combination of three randomly chosen vectors and not from

the old vector vi,G . The condition “ j = jrand ” is to make sure that at least one element

is different compared to the elements of the old vector. The parameter F is a scaling

factor for mutation and its value is typically (0, 1+]1 . In practice, CR controls the

rotational invariance of the search, and its small value (e.g., 0.1) is practicable with

separable problems while larger values (e.g., 0.9) are for non-separable problems.

The control parameter F controls the speed and robustness of the search, i.e., a lower

value for F increases the convergence rate but it also adds the risk of getting stuck

into a local optimum. Parameters CR and NP have the same kind of effect on the

convergence rate as F has.

1 Notation means that the practical upper limit is about 1 but not strictly defined.

272 P. Luukka and J. Lampinen

After the mutation and crossover operations, the trial vector ui,G is compared to

the old vector vi,G . If the trial vector has an equal or better objective value, then it

replaces the old vector in the next generation. This can be presented as follows (in

this paper minimization of objectives is assumed) [29]:

ui,G if f (ui,G ) ≤ f (vi,G )

vi,G+1 = .

vi,G otherwise

DE is an elitist method since the best population member is always preserved and

the average objective value of the population will never get worse.

As the objective function, f , to be minimized we applied the number of incor-

rectly classified learning set samples. Each population member, vi,G , as well as each

new trial solution, ui,G , contains the class vectors for all classes and the power value

p. In other words, DE is seeking the vector (y(1), ..., y(T ), p) that minimizes the

objective function f . After the optimization process the final solution, defining the

optimized classifier, is the best member of the last generation’s, Gmax , population,

the individual vi,Gmax . The best individual is the one providing the lowest objective

function value and therefore the best classification performance for the learning set.

The control parameters of DE algorithm were set here as follows: CR=0.9 and

F=0.5 were applied for all classification problems. NP was chosen so that it was six

times the size of the optimized parameters or if size of the NP.

However, these selections were mainly based on general recommendations and

practical experiences with the usage of DE, and no systematic investigations were

performed for finding the optimal control parameter values, therefore further clas-

sification performance improvements by finding better control parameter settings in

future are within possibilities.

All data sets were split in half; one half was used for training and the other half

for testing the classifier. The training sets were randomly created 30 times for each

dimension. The results are also compared to other existing results in the literature.

In Table 11.3 the results from the applied data sets are reported. Achieved results

are also compared with the results achieved without PCA. In first column data set

and possible usage of preprocessing the data first with PCA is reported. In second

column best classification accuracy is given and in third the mean classification

accuracy. Variance is reported next and then reduced dimension providing the best

results is given. Finally also optimized p-value is given in last column. Results from

Cleveland heart data set is given in Heart-Cleveland, Hungarian in Heart-Hungarian,

Switzerland heart data set in Heart-Switzerland and results from Long-Beach data

set in Heart-Long-Beach. All these four data sets are combined in Heart-All. There

are several missing values in these data sets and simply a dummy value −9 is used

for missing value. Results from heart-statlog data set are in Heart-statlog. Best mean

classification accuracies are in boldface.

11 A Classification Method Based on PCA and DE 273

Table 11.3 Classification results of heart data sets. Comparison of classification results with

original data and data preprocessed with two dimension reduction methods. Best mean accu-

racy is in boldface

Data Best result (in %) Mean result (in %) Variance (in %) dim p-value

Heart-Cleveland 89.44 % 82.86 % 7.71 13 19.3

Heart-Cleveland(PCA) 91.55 % 86.48 % 2.82 12 82.8

Heart-Hungarian 88.44 % 83.42 % 5.95 13 88.1

Heart-Hungarian(PCA) 93.20 % 87.48 % 3.34 11 96.7

Heart-Switzerland 95.16 % 94.35 % 0.67 13 70.8

Heart-Switzerland(PCA) 95.16 % 94.46 % 0.66 5 82.1

Heart-Long Beach 80.20 % 78.32 % 1.31 13 54.4

Heart-Long Beach(PCA) 85.15 % 79.93 % 2.70 12 67.9

Heart-All 78.22 % 76.98 % 0.94 13 1.8

Heart-All(PCA) 84.22 % 82.01 % 1.05 13 49.2

Heart-statlog 88.89 % 83.21 % 10.80 13 81.1

Heart-statlog(PCA) 91.86 % 87.63 % 4.01 13 90.9

Cleveland data set: From the Table 11.3 we can observe that best mean classifica-

tion accuracy for the Cleveland data set is 86.5% and when 99% confidence √ interval

is computed for the results (Using Student’s t distribution μ ± t1−α /2 Sμ / n) we get

for the confidence interval 86.5% ± 0.8%. This result was obtained when data was

first preprocessed with PCA. The preprocessing by PCA enhanced the results over

3%. The best mean accuracy was found with target dimensionality of 12.

Achieved results with the Cleveland data set are compared to other results in Ta-

bles 11.4–11.6. In Table 11.4 the classification results obtained by our DE based

approach are compared to the corresponding results reported in [32] where method

called Classification by Feature Partitioning (CFP) was introduced. This method is

an inductive, incremental and supervised learning method. There the data set was

divided in two sets as here but training and testing set sizes were a bit different.

When comparing our results with the results of Sirin and Güvenir [32] we observed

that DE classifier classified the Cleveland data set with a higher accuracy (83.4%)

than IB classifiers and C4 but yielded a slightly lower accuracy than CFP,(84.0%).

When data was first preprocessed with PCA the classification accuracy of 87.5%

was reached by the DE-classifier. In Table 11.5 the classification results obtained by

DE based approach are compared to results from classifiers which was reported in

[21]. They used decision tree classifier and also preprocessed the data with wavelet

transform. They also used two fold techniques as here but division for training and

testing set was 80 − 20. For their decision tree classifier 76% accuracy was reported.

In comparison, DE yielded accuracy of 83%. Li et al. [21] managed to enhance the

results by preprocessing the data first with wavelet transform gaining about 4% unit

enhancement having mean accuracy of 80%. We reached about 3% unit enhance-

ment with using PCA, corresponding to 86% classification accuracy.

In Table 11.6 we have compared our results with the results reported in [3] where

tenfold crossover was used instead of two fold as in our experiment. As can be

seen there smart crossover operator with multiple parents for a Pittsburgh Learning

274 P. Luukka and J. Lampinen

Classifier seems to be giving around 10% better performance with this data set.

Generally, the results obtained here by the DE classifier with PCA preprocessing

appeared to be rather promising.

Table 11.4 DE classifiers classification result comparison to results Sirin and Güvenir re-

ported in [32] from Cleveland and Hungarian data sets

Hungarian 58.7 55.9 80.5 78.2 82.3 83.4 87.5

Cleveland 75.7 71.4 78.0 75.5 84.0 82.9 86.5

Table 11.5 DE classifiers classification result comparison to results of Li et al. reported from

Cleveland, Hungarian and Switzerland data sets

Hungarian 76 80 83 87

Cleveland 76 80 83 86

Switzerland 88 88 94 94

Table 11.6 DE classifiers classification result comparison to results of Bacardit and Krasno-

gor [3] from Cleveland, Hungarian and Statlog data sets

Hungarian 86.05 83.42 87.48

Cleveland 95.54 82.86 86.48

Statlog 94.44 83.21 87.63

Hungarian data set: With the Hungarian data set a same situation was observed.

The best results was found when data was first preprocessed with PCA. Best mean

accuracy with 99% confidence interval was 87.5% ± 0.9%. Preprocessing with PCA

enhanced the results over 3%. Best accuracy was found with the target dimension-

ality of 11.

The results obtained with the Hungarian data set are compared to the results

of the other classifiers in Tables 11.4–11.7. When the results are compared with

the results reported by Sirin and Güvenir [32] (Table 11.4) DE-classifier yielded a

slightly higher mean accuracy 83.4%, than the second best CFD with the accuracy

of 82.3%. When the Hungarian data set was preprocessed with PCA the accuracy of

DE-classifier increased to 87.5%, which can be considered as a remarkably good re-

sult. In Table 11.5 our results are compared with the corresponding ones by Li et al.

[21]. They reported accuracy of 76% with their decision tree classifier while in com-

parison our DE-classifier reached with 83% accuracy. Li et al also preprocessed the

11 A Classification Method Based on PCA and DE 275

data and their wavelet transform preprocessing gained about 4% unit enhancement

in accuracy (80%). We obtained 4% unit enhancement with the accuracy when we

preprocessed the data by using PCA and then performed the classification by using

DE classifier, reaching the accuracy of 87%. In Table 11.6 our results are compared

with the results of [3]. With their method ac

## Lebih dari sekadar dokumen.

Temukan segala yang ditawarkan Scribd, termasuk buku dan buku audio dari penerbit-penerbit terkemuka.

Batalkan kapan saja.