Yves Crama
HEC Management School, University of Li`ege
January 2014
Contents
1 Introduction 1
2 Combinatorial optimization and computational complexity 3
2.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1.1 The shortest path problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1.2 The Chinese postman problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.3 The traveling salesman problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.4 The 01 linear programming problem . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.5 The graph equipartitioning problem . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.6 The graph coloring problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.7 Combinatorial optimization in practice . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.8 Exercises. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 A glimpse at computational complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.1 Computational performance criteria . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.2 Problems and problem instances . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.3 Easy and hard problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.4 Exercises. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3 Heuristics for combinatorial optimization problems 17
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2 Reformulation, rounding and decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.3 Listprocessing heuristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.3.1 Exercises. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
i
ii CONTENTS
3.4 Neighborhoods and neighbors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.4.1 Some examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.4.2 Exercises. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.5 Steepest descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.5.1 Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.5.2 Local minima . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.5.3 Choice of neighborhood structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.5.4 Selection of neighbor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.5.5 Fast computation of the objective function . . . . . . . . . . . . . . . . . . . . . . 31
3.5.6 Flat objective functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.5.7 Exercises. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.6 Simulated annealing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.6.1 The simulated annealing metaheuristic . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.6.2 Choice of the transition probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.6.3 Stopping criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.6.4 Implementing the SA algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.6.5 Variants of the SA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.7 Tabu search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.7.2 The algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.7.3 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.8 Genetic algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.8.2 Diversication via crossover . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.8.3 A basic genetic algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.8.4 Intensication and local search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.8.5 Encodings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.8.6 Implementing a genetic algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4 Modeling languages for mathematical programming 59
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
CONTENTS iii
5 Integer programming 63
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.1.1 Integer programming models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.1.2 Exercises. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.2 Branchandbound method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.2.1 Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.2.2 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.2.3 Heuristic solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.2.4 Tight formulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.2.5 Some nal comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
6 Neural networks 77
6.1 Feedforward neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
6.2 Neural networks as computing devices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
6.3 Neural networks as function approximation devices . . . . . . . . . . . . . . . . . . . . . . 80
6.4 Unconstrained nonlinear optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
6.4.1 Minimization problems in one variable: introduction . . . . . . . . . . . . . . . . . 82
6.4.2 Equations in one variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
6.4.3 Minimization problems in one variable: algorithms . . . . . . . . . . . . . . . . . . 84
6.4.4 Multivariable minimization problems . . . . . . . . . . . . . . . . . . . . . . . . . . 85
6.5 Application to NN design: the backpropagation algorithm . . . . . . . . . . . . . . . . . . 86
6.5.1 Extensions of the delta rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
6.5.2 Model validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
6.6 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
6.7 Notes on PROPAGATOR software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
6.7.1 Input les . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
6.7.2 Menus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
6.7.3 Main window . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
7 Cases 95
7.1 Container packing at Titanic Corp. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
7.2 Stacking boxes at Gizeh Inc. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
iv CONTENTS
7.3 A high technology routing system for MealsonWheels . . . . . . . . . . . . . . . . . . . . 96
7.4 Operations scheduling in Hobbitland . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
7.5 Setup optimization for the assembly of printed circuit boards . . . . . . . . . . . . . . . . 98
7.6 A new product line for Legiacom . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
Bibliography 110
Chapter 1
Introduction
The aim of the course Advanced Operations Research is to present several perspectives on mathematical
modeling and problemsolving strategies as they are used in operations research.
The course contains several independent parts, namely:
generalpurpose heuristic strategies for the solution of combinatorial optimization problems, such
as simulated annealing, tabu search or genetic algorithms;
learning of a modeling language, i.e. a computer language specially devoted to the formulation,
the solution and the analysis of largescale optimization models (linear or nonlinear programming
problems);
an introduction to mixed integer programming models and algorithms;
other numerical methods, as time allows: neural networks, simulation, ...
These lecture notes propose a preliminary draft of the material usually covered in the course. They
concentrate mostly on combinatorial optimization heuristics, on mixed integer programming methods and
on neural networks. Modeling languages are handled more supercially, as this topic is mostly illustrated
through the development of numerical models in the computer lab.
The course assumes that the reader has had a rst introduction to operations research and has some
elementary knowledge of mathematical modeling, of mathematical programming and of graph theory.
Special thanks are due to JeanPhilippe Peters who drafted the rst version of these classroom notes.
1
2 CHAPTER 1. INTRODUCTION
Chapter 2
Combinatorial optimization and
computational complexity: Basic
notions
The generic combinatorial optimization (CO) problem is
minimize {F(x)  x X} (2.1)
where X is a nite (or at least, discrete
1
) set of feasible solutions and F is a realvalued objective function
dened on X. Of course, if X is given in extension, i.e., by a complete explicit list of its elements, then
solving (CO) is quite easy: it suces to compute the value of F(x) for all elements x X and to retain
the best element. But when X is dened implicitly rather than in extension, the problem may become
much harder.
2.1 Examples
2.1.1 The shortest path problem
Nowadays, lots of commercial software products allow you to select eortlessly the shortest possible
route from your current location to a chosen destination (for example, from Li`ege to Torremolinos). The
1
Intuitively, a set is discrete if it does not contain any continuous subset
3
4 CHAPTER 2. COMBINATORIAL OPTIMIZATION AND COMPUTATIONAL COMPLEXITY
optimization problem which has to be solved whenever you address a query to the system can be modelled
as follows.
There is a graph G = (V, E) where V is a nite set of elements called vertices and E is a collection of
pairs of vertices called edges (think of V as a list of geographical locations and of E as a road network;
see e.g. Figure 2.1 for a representation). Assume that every edge e E has a nonnegative length (e),
and let s and t be two vertices in G. The shortest path problem is to nd a path (a connected sequence of
edges) through the graph that starts at s and ends at t, and which has the shortest possible total length.
This is clearly a CO problem, where X is the (nite) set of all paths from s to t and F(x) is the total
length of path x. Note that the cardinality of X can be of the same order of magnitude as 2
V 
, which is
quite large as compared to the size of the graph.
`
`
`
`
`
`
>
>
>
>
>
>
4 5
2 3
1
6
Figure 2.1: A graph with 6 vertices and 8 edges
2.1.2 The Chinese postman problem
This problem is similar to the shortest path problem, except that we consider here the additional con
straint that every edge of G should be traversed exactly once by the path from s to t (the postman has
to visit every street in his district). It is also usual to assume that s = t in this problem (the postman
returns to the depot at the end of the day).
Besides its postal illustration, this model encounters applications in a variety of vehicle routing situ
ations (garbage collection, snow plowing, street cleaning, etc.) and in the design of automatic drawing
software.
2.1. EXAMPLES 5
2.1.3 The traveling salesman problem
The traveling salesman problem (denoted TSP) is again similar to the shortest path problem, with the
added requirement that every vertex should be visited exactly once by the path from s to t: the salesman
must visit each and every customer (located in the cities in V ) along the way. In the sequel, we shall
always assume that G is a complete graph (i.e., it contains all possible edges) and that s = t. Thus, we
speak of a traveling salesman tour rather than path. Then, X can simply be viewed as the set of all
permutations of the elements of V and X = V !. For instance, if V  = 30, then V ! is roughly 2 10
32
.
This famous combinatorial optimization problem has numerous applications, either in its pure form
or as a subproblem of more complex models. It arises for instance in many production scheduling settings
(sequencing of tasks on a single machine when the setup time between two successive tasks depends on
the identity of these tasks, sequencing of drilling operations in metal sheets, sequencing of component
placement operations for the assembly of printed circuit boards, etc.) and in various types of vehicle
routing models (truck delivery problems, mail pickup, etc.).
2.1.4 The 01 linear programming problem
We can express the 01 LP problem as
min cx
subject to Ax b and x {0, 1}
n
where c R
n
, b R
m
and A R
mn
are the parameters (or numerical data) of the problem and x R
n
is a vector of (unknown) decision variables. Note that, if we drop the constraint x {0, 1}
n
, then the
problem is simply a linear programming problem which can be solved by a variety of ecient algorithms
(e.g., the simplex method or an interiorpoint method). However, the requirement that x {0, 1}
n
leads
to a (much harder) CO problem where X = { x {0, 1}
n
: Ax b }. The cardinality of this set, although
nite, is potentially as large as 2
n
(when n = 30, this is approximately 10
9
).
The knapsack problem is the special case of 01 LP with only one inequality constraint:
max cx
subject to ax b and x {0, 1}
n
where a, c R
n
+
and b R. The usual interpretation of this problem is that the indices i = 1, 2, . . . , n
denote n objects that a hiker may want to carry in her knapsack, c
i
is the utility of object i, a
i
is its
weight and b is the maximum weight that the hiker is able to carry.
6 CHAPTER 2. COMBINATORIAL OPTIMIZATION AND COMPUTATIONAL COMPLEXITY
2.1.5 The graph equipartitioning problem
Let G = (V, E) be a nonoriented graph with an even number of vertices. We want to partition V into
two subsets V
1
and V
2
so that V
1
 = V
2
, and so that the number of edges having one endpoint in V
1
and the other endpoint in V
2
is minimized.
Here, the set X of feasible solutions is the set of all equipartitions of V into equal size subsets. Of
course, there are many possible equipartitions for any given graph (as many as
1
2
_
_
V 
V /2
_
_
, which is
about 10
8
when V  = 30). For example, two partitions of the graph in Figure 2.1 are provided in Table
2.1, together with the value of the objective function for each of them.
V
1
V
2
F
Solution 1 {1,2,6} {3,4,5} 5
Solution 2 {1,2,3} {4,5,6} 3
Table 2.1: Possible partitions
2.1.6 The graph coloring problem
Let G = (V, E) be a nonoriented graph. Coloring G consists in assigning one of the labels R(ed),
G(reen), Y(ellow), B(lue),..., or 1, 2, 3,... to every vertex of G. More formally, a coloring of G is a
function V {1, 2, . . . , C}, where C is the number of colors used. We say that a coloring is feasible
if no pair of adjacent vertices receive the same color. The minimum number of colors required to color
graph G is called its chromatic number and is denoted (G). The problem to determine (G) is called
the graph coloring problem. Figure 2.2 provides an illustration of a feasible (but nonoptimal) coloring
of a small graph.
This problem comes up for instance in school timetabling applications. Assume that each vertex of
G represents a course to be given and that two vertices are joined by an edge exactly when their meeting
times overlap. Then, (G) is the minimum number of classrooms needed to accomodate all the courses.
2.1.7 Combinatorial optimization in practice
Many combinatorial optimization problems arise in various areas of industrial and business practice. For
some selected examples of such situations, we refer the interested reader to Aarts and Lenstra (1997),
2.2. A GLIMPSE AT COMPUTATIONAL COMPLEXITY 7
`
`
`
`
`
`
>
>
>
>
>
>
R Y
G B
R
G
Figure 2.2: A feasible coloring
Applegate, Bixby, Chvatal, and Cook (2006), Barnhart, Johnson, Nemhauser, Sigismondi, and Vance
(1993), Bartholdi, Platzman, Collins, and Warden (1983), Bollapragada, Cheng, Phillips, Garbiras, Sc
holes, Gibbs, and Humphreville (2002), Crama, van de Klundert, and Spieksma (2002), Crama, Oer
lemans, and Spieksma (1996), Jain, Johnson, and Safai (1996), Glover and Laguna (1997), Kohli and
Krishnamurti (1987), Moonen and Spieksma (2003), Oliveira, Ferreira, and Vidal (1993), Tyagi and
Bollapragada (2003), etc.
2.1.8 Exercises.
Exercise 1. Consider the MealsonWheels case in Section 7.3. Explain the similarities that this problem
shares with the traveling salesman problem, as well as the dierences between the problems.
2.2 A glimpse at computational complexity
In order to fully appreciate the eld of combinatorial optimization, it is necessary to understand, at least
at an intuitive level, some of the basic concepts of computational complexity. This part of theoretical
computer science deals with fundamental, but extremely deep questions like: what tasks can be carried
out by a computer?, or how much time does a given computational task require?
In this section, we attempt to introduce some elements of computational complexity, in a very informal
and handwaving way. We refer the interested reader to Tovey (2002) for a more formal tutorial, and to
Papadimitriou and Steiglitz (1982) for a rigorous treatment of the topic.
8 CHAPTER 2. COMBINATORIAL OPTIMIZATION AND COMPUTATIONAL COMPLEXITY
2.2.1 Computational performance criteria
What do we expect from a CO algorithm ? Well, an obvious answer would be that this algorithm should
always return an optimal solution of the problem. Is it the only game in town ? Certainly not. We
might also want it to be fast or ecient. Combining these two expectations is a crucial thing. Indeed
the required time to solve a problem logically increases together with the size of this problem, where the
size can be measured by the amount of data needed to describe a particular instance of the problem.
Let us take a look at an example. Suppose that we want to solve a 01 linear programming problem
involving n variables x
j
{0, 1}, j = 1, . . . , n. We can certainly nd an optimal solution by listing all
possible vectors (x
1
, x
2
, . . . , x
n
), by checking for each of them whether it is feasible or not, by computing
the value of the objective function for each such feasible solution, and by retaining the best solution
found in the process. If we decide to go that way, then we must consider 2
n
vectors. For n = 50, that
means 2
50
10
15
= 1, 000, 000, 000, 000, 000 vectors! If our algorithm is able to enumerate one million
(1,000,000) solutions per second, the whole procedure takes 10
9
seconds, or about 30 years. And for
n = 60, the enumeration of the 2
60
solutions would take about 30, 000 years !!
Note that adding 10 variables to the problem increases the computing time by a multiplicative factor
of 2
10
1, 000. So, with n = 80 variables (a rather modest problem size), the same algorithm would run
for 30 billion years, which is about twice the age of the universe. Not really ecient, by any practical
standards...
Let us look at this issue from another vantage point. Consider the wellknown Moores law: Gordon
Moore, cofounder of the chips giant Intel, prophetized in 1965 that the number of transistors per square
inch on integrated circuits would double every 18 months per year starting from 1962, the year the
integrated circuit was invented (see the original paper of Moore (1965) for more details). In other words,
your PC processor works twice faster every year and a half, meaning that its speed is multiplied by 100
in 10 years.
2
So, if you were able to enumerate 2
n
solutions in one hour in 1997, you could enumerate
100 2
n
< 2
n+7
solutions in 2007. This great increase in computing speed thus allows you to gain only
6 or 7 variables in 10 years !!
Conclusion: exhaustive enumeration is not feasible in practice for largescale (or even mediumscale)
CO problems. Furthermore, we should not count on technical progress alone to improve the situation in
any signicant way. Only algorithmic, or mathematical, advances can help in this respect.
2
This rate has slowed down since Moore made his claim. It is now generally admitted that the number of transistors
doubles every two years.
2.2. A GLIMPSE AT COMPUTATIONAL COMPLEXITY 9
So, how much progress can we expect on the theoretical front? Before we provide a tentative answer
to this question, let us try to formulate more precisely some of the notions that have just been sketched.
2.2.2 Problems and problem instances
Formally speaking, a (computational) problem is a generic question whose formulation includes a number
of undetermined parameters. Here are some simple examples.
1. Matrix addition problem: The parameters are n, A and B where n N, and A and B are two nn
matrices. Question: what is the value of A+B ?
2. Shortest path problem: The parameters are a graph G = (V, E), two vertices s, t V , and the
length (e) 0 of every edge e E. Question: nd a shortest path from s to t.
3. Traveling salesman problem: The parameters are a graph G = (V, E), and the length (e) 0 of
every edge e E. Question: nd a shortest traveling salesman tour in G.
We call instance of a problem P the specic question obtained by specifying the value of all unde
termined parameters of P (or, more intuitively, by specifying the input le that contains the numerical
data). So, a problem can also be viewed as a collection of instances. Note that an instance admits an
answer, but a problem does not (try to give the answer of the matrix addition problem above!).
An algorithm for a problem P is a stepbystep procedure that describes how to compute a solution
for every instance of P. To compare the eciency of dierent algorithms for a same problem P, we
can determine the time required by each algorithm to solve an instance of P. Note that this obviously
depends on the particular instance which is to be solved, but also on the speed of the computer, on the
skills of the programmer, etc. Therefore, we need again to dene this notion in more formal way.
The size of an instance I, denoted by s(I), is the number of bits needed to represent I. It is determined
both by the number of parameters and by their magnitude. (Intuitively, this can be viewed as the size
of the input le of a computer program which solves I.)
The running time of an algorithm A on an instance I, denoted t
A
(I), is the number of elementary
operations (additions, multiplications, comparisons,...) performed by A when it runs on the instance I.
Determining the running time of an algorithm for each particular instance I is not an easy task. However,
it is often easier to estimate the running time as a function of the size of the instance.
Consider again the examples dened above.
10 CHAPTER 2. COMBINATORIAL OPTIMIZATION AND COMPUTATIONAL COMPLEXITY
1. Matrix addition problem:
Instance size: 2n
2
.
Algorithm: any naive addition algorithm.
Running time: n
2
(additions). We denote this by O(n
2
), meaning that the running time grows
at most like n
2
.
2. Shortest path problem:
Instance size: O(n
2
) where n = V .
Algorithm 1: enumerate all possible paths between s and t.
Running time of Algorithm 1: there could be exponentially many paths and t
A1
= O(2
n
).
Algorithm 2: Dijkstras algorithm (see Nemhauser and Wolsey (1988)).
Running time: O(n
2
) operations.
3. Traveling salesman problem:
Instance size: O(n
2
) where n = V .
Algorithm: enumerate all possible tours.
Running time: O(n!).
In view of these examples, we are led to the following concept: the complexity of an algorithm A for
a problem P is the function
c
A
(n) = max{t
A
(I)  I is an instance of P with size s(I) = n}. (2.2)
This is sometimes called the worstcase complexity of A: indeed, the denition focuses on the worstcase
running time of A on an instance of size n, rather than on its average running time.
2.2.3 Easy and hard problems
Figure 2.3 represents dierent types of complexity behaviors for algorithms.
The algorithm A is polynomial if c
A
(n) is a polynomial (or is bounded by a polynomial) in n, and
exponential if c
A
(n) grows faster than any polynomial function in n. Intuitively, we can probably accept
the idea that a polynomial algorithm is more ecient thaSn an exponential one.
For instance, the obvious algorithms for the addition and or the multiplication of matrices are poly
nomial. So is the Gaussian elimination algorithm for the solution of systems of linear equations. On
the other hand, the simplex method (or at least, some variants of it) for linear programming problems
2.2. A GLIMPSE AT COMPUTATIONAL COMPLEXITY 11
Figure 2.3: (a) Linear: F(n) = an +b (b) Exponential: F(n) = a 2
n
is known to be exponential
3
while interior point methods are polynomial. This clearly illustrates the
emphasis on the worstcase running time which was already underlined above: indeed, in an average
sense, the simplex algorithm is an ecient method.
The complete enumeration approach for shortest path, Chinese postman or traveling salesman prob
lems is exponential, since all these problems have an exponential number of feasible solutions. But
polynomial algorithms exist for the shortest path problem or the Chinese postman problem.
For the traveling salesman problem or for 01 integer programming problems, by contrast, only ex
ponential algorithms are known. In fact, it is widely suspected that there does not exist any polynomial
algorithm for these problems. This is a typical feature of socalled NPhard problems which we dene
(very informally again) as follows (see Papadimitriou and Steiglitz (1982) for details).
Denition 2.2.1. A problem P is NPhard if it is as least as dicult as the 01 linear programming
problem, in the sense that any algorithm for P can be used to solve the 01 LP problem with a polynomial
increase in running time.
The next claim has resisted all proof attempts (and there have been many) since the early 70s, but
the vast majority of computer scientists and operations researchers believe that it holds true.
3
Klee and Minty (1972) provide instances I of the LP problem such that t
simplex
2
s(I)
12 CHAPTER 2. COMBINATORIAL OPTIMIZATION AND COMPUTATIONAL COMPLEXITY
P = NP conjecture. If a problem is NPhard, then it cannot be solved by a polynomial algorithm.
The P = NP conjecture, if true, expresses a deep and fundamental fact of complexity theory. Its
implications are of enormous importance for the development of algorithms in operations research and
related areas. Indeed, a large number of combinatorial optimization problems turn out to be NPhard,
and hence dicult to solve.
Proposition 2.2.1. The following problems are NPhard:
traveling salesman problem;
graph equipartitioning problem;
graph coloring problem;
knapsack problem;
assembly line balancing;
facility layout;
jobshop scheduling;
several hundred other CO problems.
It is quite remarkable that most of the problems in the above list can in fact be formulated as special
cases of the 01 LP problem. This is obvious, in particular, for the knapsack problem, but it is also
true (though less obvious) for graph equipartitioning, or for the traveling salesman problem, or for graph
coloring, etc. Thus, the actual meaning of Proposition 2.2.1 is that all these NPhard problems are
somehow equivalent, in the sense that an ecient algorithm for any of them would immediately provide
an ecient algorithm for all of them.
From a practical point of view, however, some NPhard problems turn out to be more dicult than
others. For instance, the knapsack problem is quite easy to solve as compared with the general 01
LP problems. Nevertheless, NPhard problems seem to be intrinsically tougher than linear systems, LP
problems or shortest path problems.
As a consequence, for the solution of such dicult (or apparently dicult) problems, heuristic
algorithms are often used in practice.
2.2. A GLIMPSE AT COMPUTATIONAL COMPLEXITY 13
Denition 2.2.2. A heuristic for an optimization problem P is an algorithm which is based on intuitively
appealing principles, but which does not guarantee to provide an optimal solution of P.
So, when running on a particular CO problem, a heuristic could for instance
return an optimal solution of the problem, or
return a suboptimal solution, or
return an infeasible solution, or
fail to return any solution at all,
etc.
This very broad denition of a heuristic may seem rather amazing at rst sight. It raises again the
question of the criteria which can be applied to analyze the performance of a particular heuristic. We
mention here two criteria which will be of particular concern in this course.
Computational complexity
Generally speaking, we want heuristics to be fast, at least when compared with the highly exponential
running times mentioned above. In fact, the main reason for giving up optimality is that we want the
heuristic to compute quickly a reasonably good solution. Thus, the basic tradeo that we want to achieve
reads
SOLUTION QUALITY vs. RUNNING TIME
Quality of approximation
The solution returned by the heuristic should provide a good approximation of the optimal solution. To
understand how to measure this, let x
H
be the solution computed by heuristic H for a particular instance
and let x
opt
be an optimal solution for this instance.
Then,
E(x
H
) =
F(x
H
) F(x
opt
)
F(x
opt
)
0 (2.3)
provides a relative error measure: the closer it is to 0, the better the solution x
H
.
In general, however, F(x
opt
) is unknown. So, suppose now that we know how to compute a lower
bound on F(x
opt
), i.e. a number F
such that F
F(x
opt
) (this is often much easier to compute than
14 CHAPTER 2. COMBINATORIAL OPTIMIZATION AND COMPUTATIONAL COMPLEXITY
F(x
opt
)). Dene
E
(x
H
) =
F(x
H
) F
. (2.4)
Then we have
E(x
H
) =
F(x
H
)
F(x
opt
)
1
F(x
H
)
F
1 = E
(x
H
) (2.5)
which means that E
(x
H
) overestimates the relative error E(x
H
). So, if E
(x
H
) is small, we can
certainly be happy with the quality of the solution provided by H. (Note also that if the lower bound
F
(x
H
) actually provides a good estimate of the error.)
For example, consider the traveling salesman instance described by the (symmetric) distance matrix
L, where
ij
represents the distance from i to j, i, j = 1, 2, . . . , 6:
L =
_
_
_
_
_
_
_
_
_
_
_
_
_
_
0 4 7 2 6 3
4 0 3 5 5 7
7 3 0 2 6 5
2 5 2 0 9 8
6 5 6 9 0 5
3 7 5 8 5 0
_
_
_
_
_
_
_
_
_
_
_
_
_
_
.
Assume now that a heuristic returns the tour x
H
= (1, 2, 3, 4, 5, 6) (displayed in Figure 2.4).
,
6
,
1
,
5
,
2
,
4
,
3
`
`
`
`
`
`
`
`
`
`
`
`
3
4
3
2
9
5
Figure 2.4: A feasible tour
The total length of this tour is F(x
H
) = 4 + 3 + 2 + 9 + 5 + 3 = 26. On the other hand, an obvious
lower bound on the optimal tour length is given by the sum of the 6 shortest distances in L. Thus F
(x
H
) =
2619
19
0.37. We can therefore conclude
2.2. A GLIMPSE AT COMPUTATIONAL COMPLEXITY 15
that x
H
is at most 37% longer than the optimal tour.
In order to compute lower bounds for combinatorial optimization problems, a simple but powerful
principle can often be used: when a constraint of a minimization problem P is relaxed (i.e., when
the constraint is either removed or replaced by a weaker one), then the optimal value of the resulting
relaxed problem provides a lower bound on the optimal value of P. This principle will be illustrated
on the examples below.
2.2.4 Exercises.
Exercise 1. Consider again the traveling salesman problem. For every vertex v V , select the shortest
edge e
v
incident to v. Show that
vV
(e
v
) is a lower bound on the length of the optimal tour.
Compute this lower bound for the numerical example in Section 2.2.3. Can you improve this lower bound
by taking into account the two shortest edges incident to every vertex v? What bound do you obtain for
the numerical example?
Exercise 2. Consider the following problem: you want to save n electronic les with respective sizes
s
1
, s
2
, . . . , s
n
0 on the smallest possible number of storing devices (say, oppy disks) with capacity C.
This problem is known under the name of bin packing problem, and it is NPhard. Can you compute a
lower bound on its optimal value?
Exercise 3. Show that the optimal value of the linear programming problem
min cx subject to Ax b, 0 x
j
1 (j = 1, 2, . . . , n)
provides a lower bound on the optimal value of the 01 LP problem
min cx subject to Ax b, x
j
{0, 1} (j = 1, 2, . . . , n).
Exercise 4. Show that the lower bounds obtained in Exercises 13 can all be viewed as optimal solutions
of a relaxation of the original problem.
16 CHAPTER 2. COMBINATORIAL OPTIMIZATION AND COMPUTATIONAL COMPLEXITY
Chapter 3
Heuristics for combinatorial
optimization problems
3.1 Introduction
Even though there does not really exist any general theory of heuristics, certain common strategies
can be identied in many successful heuristics. The aim of this chapter is to present such fundamental
principles of heuristic algorithms for combinatorial optimization problems of the form
minimize {F(x)  x X} . (CO)
In Sections 3.2 and 3.3 below, we succesively describe a few simple ideas of this nature, namely
reformulation, decomposition, rounding, and listprocessing.
Then, we turn to more elaborate frameworks, or guidelines, which have been proposed to develop
specic heuristics for a broad variety of optimization problems. These frameworks go by the name of
metaheuristic schemes, or metaheuristics for short. Thus, metaheuristics can be viewed as recipes for
the solution of (CO) problems.
We focus more particularly on socalled local search heuristics. Broadly speaking, local search heuris
tics rely on a common, rather natural and intuitive approach to nd a good solution of (CO): starting
from an initial solution, they move from solution to solution in the feasible region X, in an attempt (or
hope) to locate a good solution along the way (see Figure 3.1, where N(x) represents the neighborhood
17
18 CHAPTER 3. HEURISTICS FOR COMBINATORIAL OPTIMIZATION PROBLEMS
of a current solution x). Most metaheuristics (like simulated annealing or tabu search) specically
generate local search heuristics. They constitute the main topic of this chapter.
_
,
x
0
,
x
1
,
x
2
,
x
3
,
x
4
/
/
/
/`
>
>
>
>
>
>
`
`
`
`
`
`
`
`
`
`
`
` N(x
0
)
N(x
1
)
N(x
2
)
N(x
3
)
X
Figure 3.1: Local search
Additional information on heuristics can be found for instance in Aarts and Lenstra (1997), Glover
and Laguna (1997), Hoos and St utzle (2005), Papadimitriou and Steiglitz (1982), Pirlot (1992), and many
other sources.
3.2 Reformulation, rounding and decomposition
Many heuristics rely on a few simple and natural ideas. One such idea is to replace the original hard
problem (CO) by an easier, but closely related one, say (CO). This can be accomplished, for instance,
by changing the denition of the objective function, or by dropping some of the constraints of (CO). In
the latter case, solving the simplied problem (CO) usually produces an infeasible solution of (CO), and
this solution needs to be somehow repaired in order to produce a feasible (but suboptimal) solution.
A specic, but extremely useful and common application of this idea is found in rounding algorithms
for 01 linear programming problems of the form
min cx
3.2. REFORMULATION, ROUNDING AND DECOMPOSITION 19
subject to Ax b and x {0, 1}
n
.
We have already observed in Section 2.1.4 that, when we drop the constraint x {0, 1}
n
from this
problem formulation, we obtain a linear programming problem which can be easily solved. Of course, the
optimal solution of the LP model is typically fractional, and hence infeasible for the original problem.
However, it is sometimes possible to round this optimal solution in such a way as to obtain feasible 01
solutions of the 01 LP problem.
Rounding has been used in countless algorithms for 01 LP problems, be it in theoretical developments,
in implementations of generic solvers, or in specic industrial applications; see for instance Bollapragada
et al. (2002) for a recent illustration.
Another general idea for solving hard problems is to decompose them into a collection or a sequence
of simpler subproblems. Then each subproblem can be solved either optimally or heuristically, and the
solutions of the subproblems are patched together in order to provide a feasible solution of the original
problem. Similar decomposition approaches are sometimes called divide and conquer strategies in the
broader context of algorithmic design. Note, however, that they usually result in suboptimal solutions of
the original CO problem.
Examples of the decomposition strategy are abundant in realworld settings. In a very broad sense,
if we assume that the ultimate objective of management is to optimize the revenues, or the shareholders
prot, or the survivability of a rm, then the functional organization of the rm in marketing, production,
and nance departments can be viewed as a way to decompose the global optimization issue into a num
ber of subproblems, linked together by appropriate coordination mechanisms (e.g., strategic or business
plans).
More specic examples can be found in classical production planning approaches, for instance in MRP
techniques (Material Resource Planning; see Crama (2002)). Here, for simplicity reasons, the optimal lot
size is usually determined independently for each component arising in a billofmaterials. But in fact, the
actual costminimization problem faced by the rm involves many interactions among these components:
use of common production equipments, possibilities of joint orders from suppliers, etc. Therefore, the
componentwise decomposition only provides a heuristic way of handling the global issue.
Illustrations of decomposition approaches can also be found in the papers by Crama, van de Klundert,
and Spieksma (2002) or by Tyagi and Bollapragada (2003) and in numerous other publications.
20 CHAPTER 3. HEURISTICS FOR COMBINATORIAL OPTIMIZATION PROBLEMS
3.3 Listprocessing heuristics
Listprocessing heuristics (also called greedy or myopic heuristics) can be viewed as a special type of local
search heuristics, and are among the simplest among them (see Section 3.4 below). We do not want to try
to characterize them very precisely here: let us simply say that they apply in particular to CO problems
of the form
min (or max) F(S) subject to S E, S I, (IS)
where E is a nite set of elements and I is a collection of subsets of E.
The elements of E can be viewed as the decision variables of the problem. For instance, the knapsack
problem (see Subsection 2.1.4) can be interpreted in this way: here, E is a set of objects, and S I if
the subset of objects S ts in the knapsack.
Now, listprocessing heuristics construct a feasible solution of (CO) in successive iterations, starting
from the initial solution S = and adding elements to this solution, one by one, in the order prescribed
by some prespecied priority list. They terminate as soon as the priority list has been exhausted. In
particular, no eort is made to improve this solution in subsequent steps (which justies the names
myopic or greedy).
Thus, the listprocessing metaheuristic can be sketched as in Figure 3.3 below.
1. Etablish a priority list L of the elements of E.
2. Set S := .
3. Repeat: if L is empty then return S and stop; else
consider the next element in L, say e
i
, and remove e
i
from L;
if S {e
i
} is feasible, i.e. if S {e
i
} I, then set S := S {e
i
}.
Figure 3.2: The listprocessing metaheuristic
Intuitively speaking, the choice of the list L should be dictated by the impact of each element of E on
the objective function F: those variables with a smaller marginal cost (for a minimization problem) or
a heavier marginal contribution (for a maximization problem) should receive higher priority. But these
3.3. LISTPROCESSING HEURISTICS 21
general guidelines leave room for many possible implementations. Let us illustrate this discussion on a
few examples.
Example: The knapsack problem. Consider the knapsack problem
max cx
subject to ax b and x {0, 1}
n
where a, c R
n
+
and b R. Various listprocessing strategies can be proposed for this problem.
Strategy 1.
1. Sort the variables by nonincreasing utility value: if c
i
> c
j
, then x
i
precedes x
j
in L.
2. Set x
:= (0, 0, . . . 0).
3. Run through L; increase the current variable to 1 if the resulting partial solution is feasible; otherwise
leave it equal to 0.
Let us apply this strategy to the instance:
max 3x
1
+ 10x
2
+ 3x
3
+ 7x
4
+ 6x
5
subject to 2x
1
+ 6x
2
+ 5x
3
+ 8x
4
+ 3x
5
16
x
i
{0, 1} (i = 1, 2, . . . , 5).
For this instance, we successively obtain:
L = (x
2
, x
4
, x
5
, x
1
, x
3
)
x
:= (0, 0, 0, 0, 0)
x
2
:= 1; x
:= (0, 1, 0, 0, 0);
x
4
:= 1; x
:= (0, 1, 0, 1, 0);
x
5
:= 0; x
1
:= 1; x
:= (1, 1, 0, 1, 0);
x
3
:= 0; x
:= (1, 1, 0, 1, 0);
So, the algorithm returns the heuristic solution (1, 1, 0, 1, 0), with value 20.
An obvious shortcoming of Strategy 1 is that it does not take the value of the coecients a
j
into
account when xing the priority list. So, in the previous instance, variable x
4
is given higher priority
than x
5
when in fact, for a comparable utility, x
5
adds much less weight to the knapsack than x
4
. This
observation leads to the next strategy.
22 CHAPTER 3. HEURISTICS FOR COMBINATORIAL OPTIMIZATION PROBLEMS
Strategy 2.
1. Sort the variables by nonincreasing value of the ratios
c
i
a
i
: if
c
i
a
i
>
c
j
a
j
, then x
i
precedes x
j
in L.
2. Set x
:= (0, 0, . . . 0).
3. Run through L; increase the current variable to 1 if the resulting partial solution is feasible; otherwise
leave it equal to 0.
Going back to the numerical instance, we now obtain L
= (x
5
, x
2
, x
1
, x
4
, x
3
). The resulting heuristic
solution is (1, 1, 1, 0, 1), with value 22.
Interestingly, it can be proved that this strategy is equivalent to the following one (which combines
rounding with listprocessing): solve the LP relaxation of the knapsack problem to obtain a fractional
solution x
i
and continue as in Steps
23 of Strategy 2 (see e.g. Nemhauser and Wolsey (1988)).
Example: The traveling salesman problem. The TSP can be viewed as a minimization problem
of the form (IS), where E is the set of edges of the underlying graph, and S is in I if and only if S is
a subset of edges which can be extended to a TSP tour. Assume now that the priority list L sorts the
edges by nondecreasing length. The resulting greedy heuristic is known in the literature as the shortest
edge heuristic.
Example: The maximum forest (MFT) problem. Let G = (V, E) be a nonoriented graph with
weight w(e) 0 on each edge e E. If S is any subset of edges of G, the weight of S is w(S) =
eS
w(e).
A forest of G is a subset of edges of G which does not contain any cycle (i.e., closed path). The maximum
forest problem asks for a forest of G of maximum weight.
The greedy (listprocessing) algorithm for this problem is:
Greedy MFT
1. Sort the edges of G by nonincreasing weight: if w(e
i
) > w(e
j
), then e
i
precedes e
j
in L.
2. Set T := .
3. Run through L; if T {e
i
} is a forest (i.e., is cyclefree), then set T := T {e
i
}.
Let us look at the instance in Figure 3.3, with the following weights (we denote by w(1, 2) the weight
of edge {1, 2}, etc.): w(1, 2) = 10, w(3, 5) = 8, w(1, 3) = 7, w(2, 3) = 7, w(5, 6) = 6, w(3, 6) = 6,
w(2, 4) = 2, w(4, 5) = 2. Note that we have listed the weights by nonincreasing value. So, the Greedy
algorithm successively produces:
3.3. LISTPROCESSING HEURISTICS 23
`
`
`
`
`
`
>
>
>
>
>
>
4 5
2 3
1
6
Figure 3.3: A graph with 6 vertices and 8 edges
T :=
T := {(1, 2)}
T := {(1, 2), (3, 5)}
T := {(1, 2), (3, 5), (1, 3)}
T := {(1, 2), (3, 5), (1, 3), (5, 6)}
T := {(1, 2), (3, 5), (1, 3), (5, 6), (2, 4)}.
The resulting forest has weight 33 and it is easy to check that this is the optimal solution for this instance
(although there are several alternative optimal forests).
Actually, Proposition 3.3.1 hereunder shows that the Greedy algorithm is not only a heuristic, but
also an exact algorithm for the Maximum forest problem. Together with some of its farreaching gener
alizations, this result plays a central role in combinatorial theory.
We rst need a lemma. Recall that a tree is a connected forest.
Lemma 3.3.1. If G = (V, E) is a connected graph, then every maximal forest of G is a tree containing
V  1 edges. More generally, if G has c connected components G
i
= (V
i
, E
i
) (i = 1, 2, . . . , n), then every
maximal forest of G is the union of c trees and contains
c
i=1
(V
i
 1) edges.
Proof. We leave the proof to the reader. QED
Proposition 3.3.1. The Greedy algorithm delivers an optimal solution for every instance of the Maxi
mum forest problem.
24 CHAPTER 3. HEURISTICS FOR COMBINATORIAL OPTIMIZATION PROBLEMS
Proof. Let T = {e
1
, e
2
, . . . , e
t
} be the solution returned by the greedy algorithm and let S = {e
1
, e
2
, . . . , e
t
}
be an optimal solution, where e
i
precedes e
i+1
and e
i
precedes e
i+1
in L. We want to show by induction
that, for k = 1, 2, . . . , t, w(e
k
) w(e
k
), which will imply that w(T) w(S).
For k = 1, we have w(e
1
) w(e
1
) by denition of the Greedy algorithm.
Consider now an index k > 1. Suppose that w(e
i
) w(e
i
) for 1 i < k and w(e
k
) > w(e
k
). Note
that e
k
precedes e
k
in L.
Consider the edgeset R = {e E  w(e) w(e
k
)} and the forests F = {e
1
, e
2
, . . . , e
k1
} and
H = {e
1
, e
2
, . . . , e
k
}. We claim that F is a maximal forest in R, i.e. every edge of R \ F creates a cycle
in F: indeed, if e R \ F, then w(e) w(e
k
) > w(e
k
) and the greedy algorithm should have chosen e
rather than e
k
.
Since F = k 1 < H, we conclude that the graph (V, R) contains two maximal forests of dierent
cardinalities, contradicting Lemma 3.3.1. QED
Beyond their application to CO problems of the form (IS), listproceesing algorithms can be extended
to handle associated partitioning problems like
min m
subject to S
1
S
2
. . . S
m
= E, (PART)
S
i
I (i = 1, 2, . . . , m).
Thus, problem (PART) is here to partition E into a smallest number of sets in I. This problem can
be attacked by solving a sequence of optimization subproblems over (IS), with F(S) = S: try rst
to determine a large set S
1
I, then remove from E all elements of S
1
, repeat the process in order to
determine S
2
, and so on. If each step is solved by a listprocessing algorithm, then the resulting procedure
is also called a listprocessing algorithm for (PART); see Exercises 3 and 4 hereunder.
Additional examples of listprocessing algorithms can be found, for instance, in Crama (2002).
3.3.1 Exercises.
Exercise 1. Apply the shortest edge heuristic to the TSP instance given in Section 2.2.3. Compare the
length of this tour with the lower bounds computed in Section 2.2.4.
Exercise 2. Prove Lemma 3.3.1.
Exercise 3. Let G = (V, E) be a graph. A subset of vertices S V is called stable (or independent)
3.4. NEIGHBORHOODS AND NEIGHBORS 25
in G if it contains no edges, that is if the following condition holds: for all u, v S, {u, v} E. The
maximum stable set problem consists in nding a stable set of maximum size in a given graph. Provide
a greedy heuristic for this problem.
Exercise 4. Show that the graph coloring problem (Section 2.1.6) and the bin packing problem (Section
2.2.4) are partitioning problems of the form (PART). Develop a greedy heuristic for each of these problems.
3.4 Neighborhoods and neighbors
In this and the following sections, we concentrate on local search procedures. A common feature of all
local search procedures is that they exploit the neighborhood concept (see Figure 3.1).
Denition 3.4.1. A neighborhood structure for the set X is a collection of subsets N(x) X, one for
each x X. We call N(x) the neighborhood of solution x, and we say that every element in N(x) is a
neighbor of x.
The neighborhhood concept is naturally linked to the concept of local optimality.
Denition 3.4.2. A solution x
, i.e. if F(x
is
called a 2exchange. The 2exchange neighborhood structure is naturally dened by
N(C) = {C
 C
1
, V
2
) : V
1
= V
1
{v} \ {u}, V
2
= V
2
{u} \ {v} for some pair of nodes u V
1
, v V
2
}.
3.4. NEIGHBORHOODS AND NEIGHBORS 27
`
`
`
`
`
`
`
`
`
`
`
`
, ,
,
, ,
,
C
i
j
k
l
, ,
,
, ,
,
C
i
j
k
l
Figure 3.4: 2exchange neighborhood concept for the traveling salesman problem
_
, , , ,
V
1
V
2
V
1
V
2
u v v u
Figure 3.5: Neighborhood concept for the graph equipartitioning problem
28 CHAPTER 3. HEURISTICS FOR COMBINATORIAL OPTIMIZATION PROBLEMS
We have imposed N(x) X for all x. When X is small (i.e., for heavily constrained problem), it is
sometimes dicult, or overly restrictive, to dene neighborhoods that obey this condition. For instance,
when partitioning a graph, it may be natural to consider the alternative neighborhood structure
N
(V
1
, V
2
) = {(V
1
, V
2
) : V
1
= V
1
{v}, V
2
= V
2
\ {v} for some node v V
2
}.
In this case, a problem occurs as the feasibility condition V
1
 = V
2
 does not hold, that is (V
1
, V
2
) / X.
One way around this diculty is to reformulate the original CO problem into an equivalent problem that
admits more feasible solutions (i.e., to extend X) and to penalize all solutions that are not in X.
For example, for any partition (V
1
, V
2
) (not necessarily into equal parts) of the vertex set V , dene
e(V
1
, V
2
) to be the number of edges from V
1
to V
2
. Then, the graph equipartitioning problem
minimize e(V
1
, V
2
)
subject to V
1
V
2
= , V
1
V
2
= V, V
1
 = V
2

has the same optimal solutions as the following one:
minimize h(V
1
, V
2
) = e(V
1
, V
2
) +M(V
1
 V
2
)
2
subject to V
1
V
2
= , V
1
V
2
= V
where M is a very large number (penalty). Such problem reformulation allows to enlarge the feasible set,
hence to move more freely within this set and to nd more easily an initial feasible solution x X. (A
similar reformulation is used in the big M method of linear programming.)
3.4.2 Exercises.
Exercise 1. For each of the neighborhood structures dened in Section 3.4.1, estimate the size of the
neighborhood of a feasible solution as a function of the size of the instance (number of variables, number
of vertices, etc.).
Exercise 2. Consider problem (IS) in Section 3.3. Show that a list heuristic is obtained by applying the
local search principle to (IS) with the following neighborhood structure
N(T) = {S I  T S and S = T + 1}
(i.e., S results from T by adding one element to it).
3.5. STEEPEST DESCENT 29
3.5 Steepest descent
The steepest descent metaheuristic is one of the most natural local search heuristics: it simply recommends
to keep moving from the current solution to the best solution in its neighborhood, until no further
improvement can be found.
A more formal description of the algorithm is given in Figure 3.5. We assume here that a particular
neighborhood structure has been selected. For k = 1, 2, . . ., we denote by x
k
the current solution at
iteration k. We denote by x
= F(x
).
1. Select x
1
X, set F
:= F(x
1
), x
= x
1
and k := 1.
2. Repeat:
nd the best solution x in N(x
k
) : F(x) = min{F(x) : x N(x
k
)};
if F(x) < F(x
k
) then x
k+1
:= x, F
:= F(x), x
= x and k := k + 1
else return x
, F
and stop.
Figure 3.6: The steepest descent metaheuristic
Note that steepest descent really is a metaheuristic, not an algorithm. In particular, it cannot be
applied directly to any particular CO problem until the initialization procedure has been described or,
more fundamentally, until the neighborhood structure has been specied for this problem.
Note also that, when dealing with maximization (rather than minimization) problems, we speak of
steepest ascent rather than steepest descent.
We now proceed with a number of further comments on this framework.
3.5.1 Initialization
How should we select x
1
? Intuitively, it seems preferable to start from a good solution, such as a
solution selected by a listprocessing heuristic. Experiments show, however, that this is not necessarily
the case and that starting from a random solution may sometimes be a good idea. The inuence of the
initial solution may be reduced if we execute several times the algorithm with dierent initial solutions.
30 CHAPTER 3. HEURISTICS FOR COMBINATORIAL OPTIMIZATION PROBLEMS
3.5.2 Local minima
By denition, steepest descent heuristics terminate with a local optimum of CO which is not necessarily
a global optimum.
For example, consider the following instance of the knapsack problem:
max 2x
1
3x
2
+x
3
+ 4x
4
2x
5
subject to 2x
1
3x
2
+ 2x
3
+ 3x
4
x
5
2
x
i
{0, 1} for i = i = 1, 2, . . . , n
and consider the neighborhood structure dened by N
1
(x) = {y X :
5
i=1
x
i
y
i
 1}.
Suppose that the initial solution is x
1
= (0, 0, 0, 0, 0) and F
= 2 and stops.
Suppose now that we start with another initial solution, say x
1
= (0, 1, 0, 0, 1) and F
= 5. Then,
we successively get x
2
= (0, 1, 0, 1, 1), F
= 1, and next x
3
= (0, 0, 0, 1, 1), F
= x
3
.
So, in both cases, we have only found local maxima, whereas the global maximum is x
= (1, 1, 0, 1, 0),
with F(x
) = 3.
3.5.3 Choice of neighborhood structure
A further observation (closely related to the previous one) is that, when N(x) is too small, the risk of
missing the global optimum is high. But conversely, when N(x) is large, the heuristic may spend a lot
of time exploring the neighborhood of the current solution in order to determine x. This is another
manifestation of the quality vs. time tradeo already mentioned in Section 2.2.3.
This is illustrated (although caricaturally) by considering two extreme cases:
if N(x) = {x} for all x X (a very small neighborhood, indeed), then the algorithm stops at the rst
iteration and simply returns the initial solution;
at the other extreme, if N(x) = X for all x X, then x is the global optimum of the problem (which,
of course, may be very hard to nd).
More interestingly, this brief discussion points to the fact that the subproblem min{F(x) : x N(x
k
)}
which is to be solved at every iteration of steepest descent is fundamentally a problem of the same nature
as CO itself, but over a restricted region of the search space. In many cases, this subproblem will be
solved by exhaustive search, i.e., by complete enumeration of all solutions in N(x). This observation may
guide the choice of an appropriate neighborhood structure.
3.5. STEEPEST DESCENT 31
3.5.4 Selection of neighbor
Some variants of the algorithm do not completely explore the neighborhood of the current solution x
k
in
order to nd x, but rather select, for instance, the rst solution x such that F( x) < F(x
k
) found during
the exploration phase, or the best solution among the rst ten candidates, etc. (This is akin to the partial
pricing strategy used in certain implementations of the simplex method for linear programming.)
3.5.5 Fast computation of the objective function
When exploring the neighborhood N(x
k
) of the current solution, it is sometimes possible to improve
eciency by avoiding to recompute F(x) from scratch for all x N(x
k
), and by making use of the
information that is already available about the value of F(x
k
).
For example, assume (as in a knapsack problem) that F(x
k
) =
n
j=1
c
j
x
k
j
and that x N
1
(x
k
) diers
from x
k
only on the 5th component. How should we compute F(x) in this case? Brute force computation
of the expression F(x) =
n
j=1
c
j
x
j
requires n multiplications and n 1 additions. By contrast, only 2
multiplications and 2 additions are required if we notice that F(x) = F(x
k
) c
5
x
k
5
+ c
5
x
5
. Similarly, if
X = {x 
n
j=1
a
j
x
j
b}, then we can check whether x X by storing the value
n
j=1
a
j
x
k
j
and simply
checking whether
n
j=1
a
j
x
k
j
a
5
xj
k
+a
5
x
j
b.
Let us consider another example. For the traveling salesman problem, let C be a feasible tour (set of
edges) with length L(C). After the 2exchange displayed in Figure 3.7, we obtain a tour C
with length
L(C
) = L(C) d
ij
d
kl
+d
ik
+d
jl
,
and the computation of L(C
`
`
`
`
`
`
`
`
`
`
`
`
, ,
,
, ,
,
C
i
j
k
l
, ,
,
, ,
,
C
i
j
k
l
Figure 3.7: 2exchange
'
'
'
'
'
'
'
'
'
'
'
'
'
'
'
'
'
'
'
'
'
'
'
'
'
'
'
'
'
'>
>
>
>
>
>
>
>
>
>
>
>
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
\
\
\
\
\
\
\
\
\
\
\
\
\
\
\
, ,
, ,
, ,
, ,
,
,
v
4
(1) v
5
(3)
v
9
(4) v
10
(4)
v
7
(1) v
8
(3)
v
2
(2) v
3
(2)
v
6
(2)
v
1
(1)
Figure 3.8: A feasible coloring with 4 colors
3.6. SIMULATED ANNEALING 33
of the current solution (all neighbors have the same objective function value) and it is dicult to nd
descent directions (see Figure 3.9).
A possible remedy to this diculty is to modify both F(x) and X in the denition of the problem!
For instance, let us select a tentative number of colors C (for example, C = 3) and dene
X
C
= { colorings of V using the colors {1, 2, . . . , C} },
where the colorings are not necessarily required to be feasible (cf. Section 2.1.6), and let
F
(x) = 4.
Of course, the graph can be colored with C colors if and only if min{F
(x)  x X
C
} = 0. In other
words, the chromatic number of a graph is the smallest value of C for which min{F
(x)  x X
C
} = 0.
So, the original graph coloring problem can be transformed into a sequence of problems of the form
(X
C
, F
(x) = 0,
meaning that the graph is feasibly colored with 3 colors.
3.5.7 Exercises.
Exercise 1. Explain why the simplex algorithm for linear programming can be called a steepest desccent
method.
Exercise 2. Show that changing the color of v
9
from 1 to 3 in Figure 3.10 leads to a local optimum of
F
(x).
3.6 Simulated annealing
The major weakness of steepest descent algorithms is that they tend to stop too early, i.e. they get
trapped in local optima of poor quality. How can we avoid this weakness?
A possible solution is to run the algorithm repeatedly from multiple initial solutions. This multistart
strategy may work well in some cases, but other, more complex approaches have proved to be much more
powerful for large, dicult instances of CO problems.
In this section, we want to explore the following ideas.
34 CHAPTER 3. HEURISTICS FOR COMBINATORIAL OPTIMIZATION PROBLEMS
F(x)
x
x
k
Figure 3.9: Flat objective function
'
'
'
'
'
'
'
'
'
'
'
'
'
'
'
'
'
'
'
'
'
'
'
'
'
'
'
'
'
'>
>
>
>
>
>
>
>
>
>
>
>
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
\
\
\
\
\
\
\
\
\
\
\
\
\
\
\
, ,
, ,
, ,
, ,
,
,
v
4
(1) v
5
(1)
v
9
(1) v
10
(3)
v
7
(1) v
8
(1)
v
2
(2) v
3
(2)
v
6
(2)
v
1
(1)
Figure 3.10: An infeasible coloring with 3 colors
3.6. SIMULATED ANNEALING 35
Idea # 1. In order to escape local minima, it may be useful to take steps which deteriorate the objective
function, at least once in a while. One way to achieve this goal may be to replace x
k
by a neighbor x
k+1
chosen randomly in N(x
k
). This idea has shown to be especially useful when combined with the next
ingredient.
Idea # 2. Select a good neighbor with higher probability than a bad one.
Taken together, these two ideas result in the very popular simulated annealing algorithm. Various
aspects of the implementation of SA algorithms are discussed at length, for instance, in two papers by
Johnson, Aragon, McGeoch and Schevon (1989, 1991) or in Pirlot (1992). We only provide here some
basic elements of information and we refer to these papers for additional details.
3.6.1 The simulated annealing metaheuristic
The generic framework of the simulated annealing metaheuristic is shown in Figure 3.11. We suppose
again that a particular neighborhood structure has been selected and we use the same notations x
, F
, x
k
as in the steepest descent heuristic. Moreover, we assume that for k = 1, 2, . . ., a number (called transition
probability) 0 < p
k
< 1 is available (see below for comments regarding the choice of p
k
).
3.6.2 Choice of the transition probabilities
How shoud we x the transition probabilities p
k
?
(i) Intuitively, we would like to reject bad moves with high probability, as suggested in Figure 3.12 (where
F
k
= F(x) F(x
k
)). The usual choice is to compute p
k
by a function of the type
p
k
= e
F
k
T
k
for F
k
0,
where T
k
is a positive scalar parameter called the temperature of the process. Now, how should we choose
the sequence T
k
?
(ii) In the course of iterations (that is, as k increases), our willingness to accept bad moves decreases
steadily, and thus we would like the probability p
k
to decrease for each xed value of F
k
(see Figure
3.13). This means that the ratio
1
T
k
must decrease (or at least, must not increase) when k increases.
Thus, altogether, we dene p
k
to be a sequence of the form
p(k) = e
F
k
T
k
where T
k
is a nonincreasing positive sequence called cooling schedule of the algorithm (since it denes the
way in which the temperature is progressively lowered). A common choice is to use a socalled geometric
36 CHAPTER 3. HEURISTICS FOR COMBINATORIAL OPTIMIZATION PROBLEMS
1. Select x
1
X, set F
:= F(x
1
), x
= x
1
and k := 1.
2. Repeat:
Choose x randomly in N(x
k
) (Propose a move).
If F(x) < F(x
k
) then AcceptMove(x) else Toss(x
k
, x).
Evaluate the stopping conditions.
If Terminate = True then return x
, F
) then F
:= F(x), x
:= x.
Procedure Toss(x
k
, x)
let x
k+1
:= x with probability equal to p
k
(Accept the move)
else, let x
k+1
:= x
k
(Reject the move).
Procedure Stopping conditions
if the stopping conditions are satised then Terminate := True
else k := k + 1 and Terminate := False.
Figure 3.11: The simulated annealing metaheuristic
cooling schedule whereby the temperature decreases by a constant factor (the cooling factor) after a
constant number L of iterations. The iterations performed at constant temperature constitute a plateau
(see Figure 3.14).
3.6.3 Stopping criteria
Note that, contrary to local search, simulated annealing may perform an innite number of iterations if
we do not impose some limitation on its running time. So, when should we terminate the process ?
A common criterion is to stop when a large number of iterations has been performed without any
improvement in the objective function and when the process seems to be stalling. One way to implement
this idea requires to select two positive numbers, say
2
and K
2
(for example,
2
= 2 and K
2
= 5). The
3.6. SIMULATED ANNEALING 37
Figure 3.12: Fixing p(k)
Figure 3.13: Fixing p(k) II
` T
k
k
T
0
T
0
2
T
0
3
T
0
L 2L 3L 4L
Plateau
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Figure 3.14: A geometric cooling schedule
38 CHAPTER 3. HEURISTICS FOR COMBINATORIAL OPTIMIZATION PROBLEMS
algorithm stops if, in K
2
successive plateaus of length L, the best available value F
F
T
0
e
T
0
= p
0
3.6. SIMULATED ANNEALING 39
1. Select x
1
X, set F
:= F(x
1
), x
= x
1
, k := 1 and T := T
0
.
2. Repeat:
Choose x randomly in N(x
k
) (Propose a move).
If F(x) < F(x
k
) then AcceptMove(x) else Toss(x
k
, x).
Evaluate the stopping conditions.
If Terminate = True then return x
, F
) then F
:= F(x), x
:= x.
Procedure Toss(x
k
, x)
compute F := F(x) F(x
k
) and p
k
= e
F
T
(transition probability)
draw a number u, randomly and uniformly distributed in [0,1]
if u p
k
then x
k+1
:= x (Accept the move)
else x
k+1
:= x
k
(Reject the move).
Procedure Stopping conditions
if the number of iterations since the last decrease of temperature is less than L
then k := k + 1 and Terminate := False (Continue with the same plateau)
else
if no improvement of F
F
T
is quite high. A nonnegligible speedup can be obtained if we replace this
expression by its approximation 1
F
T
(25 times faster for comparable quality; see Oliveira, Ferreira,
and Vidal (1993) for details).
Once again, we refer the reader to Aarts and Lenstra (1997), Johnson, Aragon, McGeoch and Schevon
(1989, 1991), Pirlot (1992) and to other references in the bibliography for more information on simulated
annealing algorithms.
42 CHAPTER 3. HEURISTICS FOR COMBINATORIAL OPTIMIZATION PROBLEMS
3.7 Tabu search
(To be revised and completed...)
3.7.1 Introduction
Idea : at each iteration, choose a neighbor x of x
n
that minimizes F(x) in N(x
n
)
1
Consider the following example:
max x
1
+ 10x
2
+ 3x
3
+ 7x
4
+ 6x
5
subject to 2x
1
+ 6x
2
+ 5x
3
+ 8x
4
+ 3x
5
16
x
j
{0, 1}, j = 1, ..., 5
The neighbors are solutions within a Hamming distance of 1. Let x
0
= (0, 0, 0, 0, 0) be the initial
solution. Thus we might have
x
0
= (0, 0, 0, 0, 0)
x
1
= (0, 1, 0, 0, 0)
x
2
= (0, 1, 0, 1, 0)
x
3
= (1, 1, 0, 1, 0)
x
4
= (0, 1, 0, 1, 0)
Here, x
4
= x
2
, underlying a danger of this method : the cycling problem. Now, suppose that coming
back to the last explored solution is forbidden. We could have therefore
x
4
= (1, 1, 0, 0, 0) (x
2
is tabu)
x
5
= (1, 1, 0, 0, 1) (x
3
is tabu)
x
6
= (1, 1, 1, 0, 1) (optimal solution)
Note that the interested reader will nd a generic description of this problem in Section 4.1 of Pirlot
(1992).
3.7.2 The algorithm
Initialization: select x
1
X; F
= F(x
1
); x
x
1
and the Tabu List TL := .
Step k (with k = 1,2,...):
Choose the best neighbor x of x
k
that is not tabu
F(x) = min{F(x) : x N(x
k
), x / TL}.
1
This is the steepest descent mildest ascent method. See Hansen and Jaumard (1990) for more about this topic.
3.7. TABU SEARCH 43
x
k+1
x
If F(x) < F() then x
x, F
F(x)
TL TL
{x
k
}.
If TL > K then put the oldest tabu solution out of TL.
If the stopping criterion is not fullled
then go to step k+1,
else stop.
Here are several comments on this algorithm:
1. An explicit tabu list requires a large amount of computer memory and comparing each neighbor
x N(x
k
) with all the elements of TL takes a lot of time ! (Just think about a simple case where
x is 20x20 matrix 400 equalities need to be checked for each tabu solution !) To avoid this
problem, an attribute of the new tabu solution x
k
is often stored rather than the entire solution
itself. Typically, we choose the attribute modied during the transition x
k
x
k+1
.
Examples:
(i) Suppose that during the transition x
k
x
k+1
, we only modify x
i
from 0 to 1. We can store
(i,0) in the tabu list, meaning that it is forbidden to x x
i
= 0.
(ii) Consider the ??? 2echange ?? in Figure 3.18. Here we will store (i, j, k, l) in the tabu list.
Note that by putting attributes in TL rather than entire solutions, access to yet unexplored but
potentially good solutions is sometimes forbidden (for instance, all the solutions with x
i
= 0). To
solve that, the TS algorithm can be modied by considering that F(x) = min{F(x) : x N(x
k
)}
if F(x) < F
`
`
`
`
`
`
`
`
`
`
`
`
, ,
,
, ,
,
i
j
k
l
, ,
,
, ,
,
i
j
k
l
Figure 3.18: Tabu list
x
1
: F(x
1
) = 5.
The best move is to change vertex (1) into V (or vertex (3) into V or vertex (4) into V). That way,
F(x
2
) = 3 and TL = {(1, B)}.
Then, the best move is to change vertex (3) into V. That way, F(x
3
) = 1 and TL = {(1, B), (3, B)}.
All the moves increase the F function. Choose for instance to change vertex (4) into V F(x
4
) = 3
and TL = {(1, B), (3, B), (4, B)}.
The best move is to change vertex (1) into B, which is tabu, but we accept it since it satises the
aspiration criterion. Thus, F(x
5
) = 1 and TL = {(1, B), (3, B), (4, B), (1, V )}.
...
3.7. TABU SEARCH 45
`
_
1
>
>
>
>
>
>
`
`
`
`
`
`
`
`
`
`
_
2
`
_
3
`
_
4
`
_
5
>
>
>
>
>
>
`
_
6
`
_
7
`
_
8
`
_
9
`
_
10
`
_
11
B V B
R
B V
B V
R
R R V V
B
B
Figure 3.19: Chromatic number  Tabu search
46 CHAPTER 3. HEURISTICS FOR COMBINATORIAL OPTIMIZATION PROBLEMS
3.8 Genetic algorithms
3.8.1 Introduction
Steepest descent, simulated annealing and tabu search are designed to improve an initial solution by
exploring solutions that are close to it. This approach is sometimes called an intensication strategy
since it allows to intensify the search in the vicinity of a current solution.
A major drawback of such strategies is that they cannot easily reach areas that are very distant from
the initial solution; that is, they cannot diversify the exploration of the feasible set X.
A possible remedy to this drawback is to apply the algorithm a large number of times from many
dierent initial solutions (multistart, or sampling strategy). But here again, several problems occur: rst,
a large number of times (1000 times, 10000 times) can still be quite small as compared to the size of
the space to explore. Second, it is hard to ensure that the sample of initial solutions faithfully represents
the set X.
2
Genetic algorithms (GA) oer a specic, quite powerful approach to the diversication issue (see e.g.
Goldberg (1989)). In fact, they alternate diversication and intensication phases. At each iteration,
they produce a population (i.e., a subset of solutions): at step k, the population is denoted X
(k)
=
{x
(k)
1
, x
(k)
2
, ..., x
(k)
N
} X.
3.8.2 Diversication via crossover
Consider a pair of solutions x and y (to be called parents) in the current population. We can combine
these solutions to produce one or two new solutions (called children) u and v that share some features
of both x and y. The operator that associates a child (or two children) to a pair of parents is called
crossover.
Intuitively (and just as in real life), the children obtained by crossover should look like their parents,
but should also introduce some diversity in the current population.
Suppose for example that x and y are binary vectors:
x = (11010011)
y = (01100101)
A possible crossover operator produces the single child u, where u
i
= x
i
with probability 0.5 and
u
i
= y
i
with probability 0.5. For our example, this operator could produce the child
2
This is a general problem with sampling methods.
3.8. GENETIC ALGORITHMS 47
u = (11100111).
Note that the second, fth and eighth elements (underlined) of u are predetermined since they are
common to x and y.
Another crossover method works by randomly choosing an index i, splitting x and y at coordinate i
and exchanging the initial segments of x and y. For instance, with i = 4 in the previous example, we
produce two children:
u = (11010101)
v = (01100011).
The crossover operators dened above are uniform operators, meaning: if z = (z
1
, ..., z
n
) is a child of
x = (x
1
, ..., x
n
) and y = (y
1
, ..., y
n
), then either (z
i
= x
i
) or (z
i
= y
i
). Note that nonuniform crossovers
are also frequently used in the literature (see M ulhenbein (1997) for details).
Ideally, the new individuals created by crossover should inherit desirable features from their parents:
we would like to produce good children from good parents. This goal can be achieved by combining
the following elements:
When picking a pair of parents to mate, good parents should be selected with a higher probability
than bad ones. For instance, x and y could be drawn in X
(k)
with probability equal to
Prob(x) =
F
max
F(x)
N
j=1
[F
max
F(x
j
)]
(3.1)
where F
max
= max{F(x
j
) : j = 1, ..., N}. See Table 3.1 for an example.
Common features of the parents (those that are expected to be typical of good solutions) or,
at least, some of those features, should be preserved when producing children (see later). As an
example, consider the traveling salesman problem. If the salesman is to visit every European
capital, then, in a reasonable tour, Helsinki and Madrid will never be visited successively (neither
will London and Athens). This feature should be preserved when crossing two reasonable parents.
3.8.3 A basic genetic algorithm
We are now ready to describe a primitive genetic algorithm for the combinatorial optimization problem
min{F(x) : x X}. The algorithm depends on the choice of a crossover operator, and on the choice of
a probability distribution Prob(.) dened on every nite subset of X. Let us assume that the following
parameters have also been selected: N (the population size) and M (the number of children produced in
each generation) with M N. Then, the basic genetic metaheuristic is presented in Figure 3.20.
48 CHAPTER 3. HEURISTICS FOR COMBINATORIAL OPTIMIZATION PROBLEMS
X
(k)
F(x) Prob
x
1
F(x
1
) = 15 0
x
2
F(x
2
) = 12 3/13
x
3
F(x
3
) = 10 5/13
x
4
F(x
4
) = 10 5/13
Table 3.1: Genetic algorithms: selecting good parents
1. Initialization: Select an initial population X
(1)
X with X
(1)
 = N, set F
:= min{F(x) :
x X
(1)
}, x
:= argmin{F(x) : x X
(1)
} and k := 1.
2. Repeat:
Selection of parents: Create a new temporary population Y
(k)
= {y
1
, ..., y
2M
}, drawn randomly
(with replacements) from X
(k)
according to the distribution Prob(x).
Crossover: For j = 1, ..., M, cross the pair of parents (y
2j1
, y
2j
) to produce the set of children
Z
(k)
= {z
1
, ..., z
M
}.
Survival of the ttest: Draw randomly N M elements from X
(k)
(with probability
Prob(x)) and add them to Z
(k)
in order to create the nextgeneration population X
(k+1)
=
{x
(k+1)
1
, ..., x
(k+1)
N
}. (An alternative procedure would draw N elements from X
(k)
Z
(k)
.)
Let x := argmin{F(x)  x X
(k+1)
}. If F(x) < F
then F
:= F(x) and x
:= x.
If the stopping criterion is satised then return x
, F
,
a measure of the gap between F
and a lower bound on min F(x), etc. For GAs, another criterion
is also commonly used. Let us dene the tness of population X
(k)
as the average value of F(x)
over X
k
, that is the value:
Q
k
=
1
X
(k)

xX
(k)
F(x).
Convergence of Q
k
toward a xed value indicates that the population is increasingly homogeneous
and that the procedure reaches a stationary state. Thus, if the dierence Q
k+1
Q
k
 is small for
several successive iterations, then the algorithm can stop.
In its primitive form, the genetic algorithm presented above is generally not a very ecient approach
to the solution of hard combinatorial optimization problems. Before it becomes a practical method,
some enhancements have to be added to this basic scheme. In the next subsections, we proceed with a
discussion of such possible renements.
3.8.4 Intensication and local search
In the simple GA outlined above, the average quality (or tness) of a population is driven up by a single
factor in the course of iterations, namely: the random bias introduced in the selection of parents and in
the selection of the ttest step. However, by itself, this bias is generally insucient to signicantly
improve a bad initial population.
Moreover, in spite of everything we said earlier, solutions (children) arising from a crossover operation
are frequently quite dierent from their parents and may turn out to be much worse.
These observations lead to an improvement of the GA scheme which is conceptually simple, but
very powerful in practice: it consists in introducing a local search (intensication) phase within the
50 CHAPTER 3. HEURISTICS FOR COMBINATORIAL OPTIMIZATION PROBLEMS
diversication strategy of GA. This is simply done, for instance, by adding the following step right after
the crossover step. (Some authors speak of memetic algorithms when this step is introduced in the basic
GA scheme.)
Local improvement: For j = 1, 2, . . . , M, let z
j
be the best solution produced by a local search algorithm
(either greedy, or steepest descent, or SA,...) starting from z
j
as initial solution. Replace z
j
by z
j
in
Z
(k)
.
In picturesque terms, we could say that children must be raised before they can be incorporated in the
population. More abstractly, with the above modication, we can view GA as performing a succession of
multistart rounds, where each round is initialized from members of the current population.
Whatever the interpretation, interlacing the basic GA scheme with some form of local search seems to
be a sine qua none condition for the eciency of the procedure. Let us illustrate this on some examples.
Example: Knapsack problem. Consider the knapsack problem
max cx
subject to ax b and x {0, 1}
n
and the particular instance:
max 2x
1
+ 3x
2
+ 5x
3
+x
4
+ 4x
5
subject to 5x
1
+ 4x
2
+ 4x
3
+ 3x
4
+ 7x
5
14
x
i
{0, 1} for i = 1, . . . , 5.
We use the following crossover operator: if the parents are x and y, then the child z has z
i
= 1 when
x
i
= y
i
= 1, and z
i
= 0 otherwise (1, 2, . . . , n). (The child inherits an object only if both his parents own
it.) So we obtain for instance:
x = 11010, value = 6
y = 10001, value = 6
z = 10000, value = 2
Note that this crossover, even though it ensures feasibility of the children, will systematically produce
children of lower quality than their parents.
Assume now that we apply a variant of the classical greedy algorithm during the improvement phase:
rst, we sort the indices (1, 2, . . . , n) by nonincreasing ratios
c
j
a
j
. Then, without changing the components
3.8. GENETIC ALGORITHMS 51
of z that are already equal to 1, we run through L and x the next variable to 1 as long as the knapsack
constraint is not violated.
In our example, this procedure yields the priority list L = (3, 2, 5, 1, 4), and successively produces the
solutions: z = 10000 10100 11100 = z
are two distinct solutions of the traveling salesman problem, viewed as sets
of edges. A child of T and T
can be produced by keeping all edges that occur in both parent solutions,
and by using a greedy procedure to complete the resulting partial solution T T
.
Merz and Freisleben (2001) propose more specically to apply the DPX crossover operator shown in
Figure 3.21 (we skip some details). They show that variants of this crossover operator, when combined
with eective local improvement steps, provide excellent solutions for the TSP.
52 CHAPTER 3. HEURISTICS FOR COMBINATORIAL OPTIMIZATION PROBLEMS
DPX Crossover:
1. compute C = T T
; let P
1
, P
2
, . . . , P
k
be the subpaths that make up C, and let u
j
, v
j
be the endpoints of subpath P
j
for j = 1, 2, . . . , k;
2. while C is not a tour, repeat
if C is a path containing all vertices, then add the missing edge that closes the
tour; else,
choose randomly one of the endpoints u
j
;
choose the closest vertex to u
j
among all vertices w {u
1
, v
1
, u
2
, v
2
, . . . , u
k
, v
k
},
w / {u
j
, v
j
}, such that the edge (u
j
, w) is not included in T T
;
add the edge (u
j
, w) to C;
3. return C;
Figure 3.21: DPX crossover for the TSP
Such adaptations of the basic genetic algorithm allow to enrich it with some heuristics that have
been specically developed for the problem at hand. Indeed, whereas the special features of a problem
are usually included quite naturally in a steepest descent or in a simulated annealing algorithm (via the
neighborhood structure), this is not immediately true in the basic GA formulation displayed in Figure
3.20.
A similar objective can sometimes be attained through a judicious encoding of the solutions. The
next section deals with this issue.
3.8.5 Encodings
The crossover operator and the encoding of solutions are not independent of each other (just like the
choice of the neighborhood structure can be inuenced by the choice of the encoding; remember our
discussion of 2exchanges in Section 3.4.1). For instance, consider a traveling salesman instance and
suppose that the following choices are made:
Encoding of feasible tours: permutation of cities, e.g.
3.8. GENETIC ALGORITHMS 53
x = (5, 2, 1, 3, 6, 4, 7),
y = (3, 1, 4, 2, 5, 6, 7).
Crossover operator: cut x and y at coordinate i (randomly selected) and swap their initial segments.
With i = 3, this yields
z
1
= (5, 2, 1, 2, 5, 6, 7),
z
2
= (3, 1, 4, 3, 6, 4, 7).
As clearly evidenced by this example, this choice of the crossover operator usually produces infeasible
solutions (that is, it does not produce permutations of the set of cities).
Similar diculties arise whenever we develop a genetic algorithm for a particular problem. In general
terms, they can be described as follows: the solutions of the problem are represented by strings of
characters (strings of integers for the TSP, binary strings for the knapsack problem, etc.), but not all
strings correspond to feasible solutions (in order to be feasible, the string must be a permutation of
{1, 2, ..., n}, or must satisfy the constaint
n
j=1
a
j
x
j
b, etc.). The trouble is of course that, when
crossing feasible strings, some infeasible strings can be produced.
Several strategies are available in order to remedy this diculty. Here are a few ones.
Strategy 1: Dene the encoding and/or the crossover operator so that they only produce
feasible strings.
This is what we have done, for example, when we have introduced two possible crossovers for the knapsack
problem, or the DPX crossover for the TSP, in Section 3.8.4.
Let us consider another application of this strategy to the traveling salesman problem (see Hertz
(1997)). We assume as before that a feasible tour is represented as a permutation of the set of cities
{1, 2, ..., n}. Then, given two parents x and y, produce a single child z as follows:
 Choose a subset B of cities, B {1, 2, ..., n}.
 In the permutation z, place the cities of B at the same rank that they occupy in x and explore the
other cities in the same order as for y.
For example, with
x = (5, 2, 1, 3, 6, 4, 7),
y = (3, 1, 4, 2, 5, 6, 7),
B = (1, 3, 7),
54 CHAPTER 3. HEURISTICS FOR COMBINATORIAL OPTIMIZATION PROBLEMS
we successively obtain
z = (, , 1, 3, , , 7) (as in x)
z = (4, 2, 1, 3, 5, 6, 7) (as in y).
Clearly, z is a permutation of the cities (that is, a feasible tour) and resembles both its parents to some
extent. (It is not clear, though, that this crossover operator can be extremely eective.)
Let us conclude with another application of the same principle, this time to the graph coloring problem.
In order to represent feasible colorings, we encode each color as a subset of vertices. For instance, the
coloring of the graph in Figure 3.22 is encoded as the list of subsets:
x : {1, 3}{2, 5}{4, 6}{7}.
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
, ,
,
, ,
, ,
1 (C1)
2 (C2) 3 (C1)
4 (C3)
5 (C2)
6 C(3) 7 C(4)
Figure 3.22: Coloring of a graph
If x and y represent two legal colorings, then a crossover can be performed as follows:
 Crossover: choose randomly a subset C in x, add it to y and delete all the elements of C from the other
subsets of y.
In other words, all vertices of C receive in z the same color as in x, and the other vertices receive the
same color as in y.
For instance, if
x : {1, 3}{2, 5}{4, 6}{7},
y : {1, 3, 7}{2, 6}{4, 5},
and we randomly choose C = {4, 6}, then the crossover produces
3.8. GENETIC ALGORITHMS 55
z = {1, 3, 7}{2}{5}{4, 6}.
This crossover only produces feasible colorings from feasible colorings. It also has the advantage of
preserving large parts of the existing color classes (which are potentially good). As usual, however, the
child z can often be improved by simple heuristics. In our example, we could for instance merge {2} and
{5} into a single class.
Strategy 2  Variant: the strings must be decoded in order to yield a feasible solution.
It is sometimes a good idea to take some distance with respect to the original denition of a feasible
solution, and to encode solutions by intermediate data structures which must be appropriately decoded
in order to yield interpretable results. Let us illustrate this generic idea in a special case.
We can encode a feasible solution of the graph coloring problem as an arbitrary permutation of the
vertex set {1, 2, ..., n}. Given a permutation z, the decoding phase works as follows: explore the vertices
of G in the order prescribed by z and give to each vertex the rst available color among colors C
1
, C
2
, . . .
(this is nothing but a list processing heuristic; see Section 3.3). See Figure 3.23 for an example with the
list z = (4, 2, 1, 3, 5, 6, 7).
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
, ,
,
, ,
, ,
1 (C1)
2 (C2) 3 (C3)
4 (C1)
5 (C2)
6 C(3) 7 C(2)
Figure 3.23: Coloring based on the list z = (4, 2, 1, 3, 5, 6, 7)
Now, since every permutation induces a feasible coloring, we can apply in this framework any crossover
operator developed for permutations. See Hertz (1997) for more details on this topic.
More generally, the same strategy can be applied to any CO problem for which a listprocessing
heuristic is available: then, the list L processed in the heuristic (see Figure 3.3) is simply viewed as
providing an indirect encoding of the feasible solution itself.
56 CHAPTER 3. HEURISTICS FOR COMBINATORIAL OPTIMIZATION PROBLEMS
Strategy 3: Accept all the strings but penalize infeasibilities in the objective function.
We have already encountered a similar idea in other local search approaches (see e.g. Sections 3.4.1 and
3.5.6). Let us illustrate it again on the graph coloring problem. Suppose that the vertices of the graph are
numbered from 1 to n, and let us encode a coloring as a sequence of letters R, G, B, Y, . . .. For instance,
the sequence (B, G, G, R, B, . . .) means that vertices 1 and 5 are Blue, vertices 2 and 3 are Green, vertex 4
is Red, etc. Of course, not every sequence corresponds to a feasible coloring, and therefore it is necessary
to penalize infeasibilities (just as in Section 3.5.6). Therefore, we dene the objective function as
min (number of colors + M number of monochromatic edges).
With these denitions, we are free to use a very simple crossover operator, for instance:
 Crossover : swap initial segments of x and y (that is, recolor some vertices of x as in y).
As a sideremark, let us observe that this particular method can easily lead to poor crossovers if it is
applied blindly. For instance, the parents
x = (R, R, V, V )
y = (V, V, R, R)
could produce the child
z = (R, R, R, R).
This phenomenon is linked to the fact that, in the graph coloring problem, the name (R, G, B, Y, . . .)
of each color does not have any special meaning. In any particular solution, we could recolor all Green
vertices with a new, previously unused color (say, Yellow) without any real eect on the structure of this
solution. As a matter of fact, the only meaningful concept is the partition of the vertex set into color
classes (as described in our discussion of Strategy 1). From this point of view, the strings x = (R, R, V, V )
and y = (V, V, R, R) describe exactly the same solution, and crossing them does not make very much sense.
3.8.6 Implementing a genetic algorithm
Here are some useful advices to keep in mind when developing a GA.
Think carefully about an encoding scheme and a crossover operator that tend to transmit the
qualities of the parents to their children.
Incoporate local optimization procedures (greedy, steepest descent, ...) in the basic GA scheme.
3.8. GENETIC ALGORITHMS 57
Major parameters to watch are the size of each population (N), the number of crossovers (M), the
evolution of the best available value F
j=1
c
j
x
j
(4.1)
subject to Ax b and x = (x
1
, x
2
, . . . , x
n
) R
n
(4.2)
where c R
n
, b R
m
and A R
mn
are the parameters (or numerical data) of the problem and
x = (x
1
, x
2
, . . . , x
n
) is a vector of (unknown) decision variables. A vector x R
n
is feasible if it satises
all the constraints of the problem. Solving the LP problem consists in
 either nding values x
= (x
1
, x
2
, . . . , x
n
) such that x
is feasible and cx
, there
exists a feasible vector x such that cx < cx
),
 or proving that there exists no feasible vector.
In this chapter, we are not so much concerned about algorithms to solve the LP problem, as about
software tools allowing to handle such problems in an ecient and userfriendly way. In order to under
stand why this is a real, practical issue, let us briey sketch the evolution of LP software over the last 50
59
60 CHAPTER 4. MODELING LANGUAGES FOR MATHEMATICAL PROGRAMMING
years.
The initial algorithms for the solution of LP models were based on the simplex method proposed
by George Dantzig around 1947. A description of this method can be found in many textbooks (e.g.
Nemhauser and Wolsey (1988)) and we do not intend to repeat it here. The rst computer codes for the
simplex method were written in the fties (simultaneously with the development of the rst computers)
and allowed the solution of (relatively) large scale industrial models. Very soon, it appeared that the
basic textbook description of the simplex method did not lead itself to ecient implementations, and
more ecient versions of the method were proposed by several researchers.
Two key features of realworld LP problems that must be taken into account in every ecient imple
mentation:
(i) the size of the problems is often very large; nowadays, LP solvers must be able to deal routinely with
problems involving tens of thousands of variables and thousands of contraints;
(ii) the constraint matrix A is often sparse, meaning that many of its coecients are typically zero.
To understand how sparsity arises, consider for instance the following transportation model : A rm
delivers goods from K warehouses to P customers. Warehouse i has stock
i
units of inventory and customer
j asks for demand
j
units. The cost of shipping one unit of good from warehouse i to customer j is denoted
by cost
ij
; the costs are assumed to be additive. The problem faced by the logistics manager of the rm is
to determine the transportation plan which fullls the demand of all customers at minimum cost. If we
denote by transp
ij
the number of units to be shipped from warehouse i to customer j (these quantities
are the decision variables which dene the transportation plan), then the problem can be formulated as:
min TotalCost =
K
i=1
P
j=1
cost
ij
transp
ij
(4.3)
subject to
P
j=1
transp
ij
stock
i
(i = 1, . . . , K) (4.4)
K
i=1
transp
ij
= demand
j
(j = 1, . . . , P) (4.5)
transp
ij
0 (i = 1, . . . , K; j = 1, . . . , P). (4.6)
Disregarding the nonnegativity restrictions, the constraint matrix has K +P lines and KP columns, for
a total number of (K +P)KP elements. However, only 2KP of these elements are nonzero. If we dene
the density of an n m matrix as
density = 100
number of nonzero elements
total number of elements
,
4.1. INTRODUCTION 61
then the density of the constraint matrix for the transportation problem is equal to 200/(K + P). If
K +P = 100, then the density is 2%. If K +P = 1000, then the density is 0.2%.
This example illustrates our previous claim (ii) regarding the density of LP constraint matrices. In
fact, for larger problems, it is quite usual to observe even lower densities than those indicated here.
Taken together, (i) and (ii) raise diculties related to the use of computer memory, as well as to the
format of input data. Indeed, we would like, ideally, to input and to store
1
only nonzero entries. This
forbids, in particular, to explicitly write all elements of the constraint matrix A in a gigantic array of
dimension n m. In fact, model readability and userfriendliness would also be signicantly enhanced if
we were able to give directly as input to a computer software, not the LP model in its generic form (4.1)
(4.2), but rather a more expressive formulation, similar to the expression (4.3)(4.6) for the transportation
model.
These requirements have led to the development, rst, of matrix generator programs allowing to
generate automatically the constraint matrix on the basis of a succinct description. Then, in the seventies
and eighties, several algebraic modeling languages have been designed: these languages allow to describe a
complete mathematical programming model (variables, constraints, parameters, objective function, etc.)
in a form which is close to the algebraic formulation familiar to the modeller.
As a last noteworthy development, the reader should realize that, since the early eighties, remarkable
advances in our theoretical understanding of LP problems, of data structures, of numerical analysis, and
in sheer computer speed, have led to formidable improvements in the eciency of LP software. These
improvements are largely responsible for the widespread popularity of optimizationbased models in the
industry (as witnessed, for instance, by the advent of socalled Advanced Planning and Scheduling
software for production planning). Simultaneously, signicant progress has also been recorded for the
solution of other types of mathematical programming problems: 01 LP, mixed integer linear problems
(MILP) involving both integer and continuous variables, quadratic problems, general nonlinear problems,
etc.
As a result of these considerations, commercial optimization software products are no longer limited to
the implementation of a model solver (like the simplex LP method), but are rather designed as integrated
systems for formulating, managing, solving, and reporting on the solution of largescale models. Thus,
they typically consist of
 an algebraic modeling language;
1
Note that we would also like to preserve the sparsity of the constraint matrix throughout the iterations of the optimiza
tion algorithm, but we shall not discuss this issue here.
62 CHAPTER 4. MODELING LANGUAGES FOR MATHEMATICAL PROGRAMMING
 a model generator;
 one or several solvers;
 a report generator allowing to extract useful information from the output of the model.
Among the best known mathematical programming solvers, let us mention:
CPLEX: for LP, MILP, network models, quadratic programming;
XPRESS: for LP, MILP, network models, quadratic programming, nonlinear programming, etc.;
XA: for LP, MILP and quadratic programmming;
OSL (IBM Optimization Solutions and Library): for LP, MILP, network models, quadratic program
ming;
LINDO: for linear, integer, and nonlinear programming;
CONOPT: for nonlinear problems;
MINOS: linear, quadratic and nonlinear programming;
SAS, etc.
Some popular modeling languages are available as components of broadly integrated systems:
GAMS (General Algebraic Modeling System): a pioneering language developed at the World Bank
around 1980;
AMPL (A Mathematical Programming Language): developed at Bell Labs (1993);
AIMMS (Advanced Integrated Multidimensional Modeling Software, 1993);
OPL Studio;
LINGO;
EXCEL: provides a primitive, non algebraic modeling language which allows to solve relatively simple
models within the spreadsheet environment;
etc.
The algebraic modeling languages mentioned above share much similarity. In this course, we will
discover some of the basic features of AIMMS, as an example of a typical integrated software system.
Chapter 5
Integer programming
5.1 Introduction
Integrality requirements are sometimes imposed on the variables of a Linear Programming (LP) model:
we then face an Integer Linear Programming (ILP) model of the form
max
n
j=1
c
j
x
j
(5.1)
s.t.
n
j=1
a
ij
x
j
b
i
for i = 1, 2, . . . , m, (5.2)
x
j
integer for j = 1, 2, . . . , n. (5.3)
If only a subset of the variables are required to be integer, while the other ones can assume fractional
values, then we speak of Mixed Integer Linear Programming (MILP) models.
The integrality requirement (5.3) truly constitutes an additional (nonlinear) constraint of the model.
In a typical mathematical programming language, for instance, it would be translated as the declaration:
var X integer;
Similarly, declaring var X binary; is equivalent to the constraint X {0, 1}, that is, 0 X 1 and
X integer. When all the variables of an ILP model are required to take 01 values, we speak of a 01 LP
model.
The crucial dierence between an LP model and an ILP model is that the feasible region of the ILP
is not convex and is not even connected (or continuous), so that purely algebraic methods like the
simplex algorithm are no longer sucient to solve this type of problems: it usually becomes necessary
to resort to combinatorial or enumerative methods. Actually, ILP and MILP problems constitute a
63
64 CHAPTER 5. INTEGER PROGRAMMING
(very large) subclass of combinatorial optimization problems comprising many dicult (that is, NPhard)
problems like the knapsack problem, the traveling salesman problem, the graph coloring problem, and so
forth.
Example 5.1. Figure 5.1 represents the constraints of the simple example:
maximize x
1
+x
2
(5.4)
s.t. x
1
+ 2x
2
2 (5.5)
2x
1
+x
2
2 (5.6)
x
1
, x
2
0 (5.7)
x
1
, x
2
integer. (5.8)
The feasible points are (0,0), (1,0) and (0,1). In particular, the optimal solutions of this model are the
points (1,0) and (0,1). Note however that the optimal solution of the relaxed linear programming
problem obtained by omitting the integrality constraint (5.8) is the point (
2
3
,
2
3
) which is not integral.
`
`
`
`
`
`
`
`
(0,1)
(1,0)
(
2
3
,
2
3
)
Figure 5.1: A simple ILP model
5.1.1 Integer programming models.
1) We have already presented the knapsack problem in Section 2.1.4. This is one of the simplest 01 LP
models since it involves a single contraint. But even so, we have mentioned in Proposition 2.2.1 that it
is NPhard. Hence, a fortiori, general 01 LP, ILP and MILP are NPhard as well.
2) The stable set problem was introduced in Section 3.3.1. Let G = (V, E) be a graph, with V =
{1, 2, . . . , n}. A subset of vertices S V is called stable in G if it contains no edges, that is if the
following condition holds: for all u, v S, {u, v} E. The maximum stable set problem consists in
5.1. INTRODUCTION 65
nding a stable set of maximum cardinality in G. In order to formulate this problem as a 01 LP, we
dene n binary variables x
1
, x
2
, . . . , x
n
, with the interpretation that x
i
= 1 if and only if i S. Then
the maximum stable set problem can be expressed as:
max
n
j=1
x
j
s.t. x
i
+x
j
1 for all edges {i, j} E,
x
j
{0, 1} for j = 1, 2, . . . , n.
3) A rm is planning to redesign its distribution network by reconsidering the number and the location of
the warehouses that it operates. The objective is to distribute its products more eciently to a (known)
set of P customers. A set of K warehouses can be rented on a yearly basis. As in the transportation
problem (see Chapter 4), customer j asks for demand
j
units of goods (say, on a yearly basis), the unit
shipping cost from warehouse i to customer j is denoted by cost
ij
, and the costs are assumed to be additive
(for simplicity, we disregard the capacity constraint of the warehouses). Moreover, the xed rental cost
for warehouse i is rent
i
per year. The problem faced by the rm is to decide which warehouses to open
and to determine the transportation plan which fullls the demand of all customers at total minimum
cost.
Let us denote by transp
ij
the number of units to be shipped from warehouse i to customer j, and
let us dene binary variables y
i
, i = 1, 2, . . . , K, where y
i
= 1 if the rm rents warehouse i and y
i
= 0
otherwise. The facility location model can be formulated as:
min TotalCost =
K
i=1
rent
i
y
i
+
K
i=1
P
j=1
cost
ij
transp
ij
(5.9)
subject to transp
ij
My
i
(i = 1, . . . , K; j = 1, . . . , P) (5.10)
K
i=1
transp
ij
= demand
j
(j = 1, . . . , P) (5.11)
transp
ij
0 (i = 1, . . . , K; j = 1, . . . , P) (5.12)
y
i
{0, 1} (i = 1, . . . , K). (5.13)
The constraint (5.10) expresses the logical implication that, if transp
ij
> 0 (i.e., some units of goods must
be shipped from i to j), then warehouse i must be rented and hence y
i
= 1. The formulation is correct
as long as M is large enough (that is, larger than transp
ij
in all feasible solutions). For instance, one
can choose M = demand
j
.
66 CHAPTER 5. INTEGER PROGRAMMING
An alternative formulation is obtained upon replacement of (5.10) by the aggregated constraints:
P
j=1
transp
ij
(
P
j=1
demand
j
) y
i
(i = 1, . . . , K). (5.14)
Constraints (5.10) or (5.14) are called xed charge constraints: they express the fact that if the rm
wants to be able to use warehouse i, it must incur a xed charge (rent
i
), independently of the level of
utilization of the warehouse.
4) An airline company proposes some ights on a daily basis (say, BrusselsParis, BrusselsLondon,
ParisBerlin,...). Let F = {1, 2, . . . , m} be this set of ights. Some combinations of ights form feasible
routes (also called pairings), in the sense that they can be covered by a single crew in one day.
Let R = {1, 2, . . . , n} be this set of routes. For instance, we might have Route 1 = {BrusselsParis,
ParisLondon, LondonBrussels}, Route 2 = {BrusselsParis, ParisBerlin}, and so on.
The objective is to cover all ights while minimizing the total number of routes used (or the cost
of the associated crews). Let R
f
be the set of routes which cover ight f = 1, 2, . . . , m, so that R =
{1, 2 . . . , n} = R
1
. . . R
m
. We dene the decision variables as x
r
= 1 if route r is selected and x
r
= 0
otherwise, for r = 1, 2, . . . , n. This leads to the following model:
min
n
r=1
x
r
s.t.
rR
f
x
r
1 for f = 1, 2, . . . , m,
x
j
{0, 1} for j = 1, 2, . . . , n.
More generally, the set covering problem is the 01 LP
min
n
j=1
c
j
x
j
s.t.
n
j=1
a
ij
x
j
1 for i = 1, 2, . . . , m,
x
j
{0, 1} for j = 1, 2, . . . , n.
where all constraint coecients a
ij
take value either 0 or 1. Models of this nature are routinely solved by
large airline companies. Note that the set of feasible routes is potentially huge, and that the optimization
models must be solved eciently in order for instance to recover easily from unexpected disturbances
(ight delays, cancellations, storms, etc.) Hence the companies must rely on highly sophisticated al
gorithms and data handling techniques; more information on this and related models can be found for
instance in Yu, Arguello, Song, McCowan, and White (2003).
5.2. BRANCHANDBOUND METHOD 67
5.1.2 Exercises.
Exercise 1. Give a 01 LP formulation for the problem Stacking boxes at Gizeh Inc. described in
Section 7.2.
Exercise 2. A basic planning problem for exible manufacturing systems (FMS) is the part selection
problem (see e.g. Crama, Oerlemans, and Spieksma (1996)). A part set containing n parts must be
processed, one part at a time, on a single exible machine. The machine can use dierent tools, numbered
from 1 to m. Each part requires a specic subset of tools which have to be loaded in the tool magazine
of the machine before the part can be processed: say part i requires T(i) {1, 2, . . . , m}. The magazine
features C tool slots. When loaded on the machine, tool j occupies s
j
slots in the magazine (j =
1, 2, . . . , m). The total number of tools required to process all parts can be much larger than C, so that
it is sometimes necessary to change tools in order to process the complete part set. The part selection
problem consists in determining the largest subset of parts that can be produced without tool changes.
Formulate this problem as a 01 LP problem.
Exercise 3. A standard trick to linearize a nonlinear expression of the form xv, where x is a 01 variable
and v is a real variable subject to the range constraint 0 v M, consists in replacing the product xv
by a new real variable z subject to the following constraints:
z M x
z v
z v M (1 x)
z 0.
Show that these constraints force z to take the value of the product xv for every assignment of 01 values
to x. Use this trick to obtain an MILP formulation of the Legiacom problem described in Section 7.6.
5.2 Branchandbound method
The branchandbound method is a generic framework for solving integer programming problems (or,
more generally, combinatorial optimization problems) to optimality (as opposed to: heuristically, or
68 CHAPTER 5. INTEGER PROGRAMMING
approximately). In order to explain the method, we write the problem in the form
Z = max f(x) (5.15)
s.t. x S. (5.16)
For instance, we might have S = {x : Ax b, x is an integer} and f(x) = cx.
The branchandbound method involves three fundamental building blocks: partitioning, evaluation
and heuristic solution.
5.2.1 Partitioning
Let us partition S into S
1
and S
2
, that is, S = S
1
S
2
and S
1
S
2
= .
Proposition 5.2.1. Let Z
1
= max f(x) subject to x S
1
and Z
2
= max f(x) subject to x S
2
, then Z
= max (Z
1
, Z
2
).
For the ILP, suppose that S
1
= {x : x S and x
1
= 0} and S
2
= {x : x S and x
1
1}, then
S = S
1
S
2
.
We could solve Equation 5.15 by splitting iteratively S until the resulting subproblems become
elementary (see Figure 5.2).
This method obviously boils down to complete enumeration of all feasible solutions. Several techniques
can be used to accelerate it.
5.2.2 Evaluation
Consider the following problem which is to be solved at an arbitrary node i in the arborescence obtained
from the partitioning process:
Z
i
= max f(x) (5.17)
s.t. x S
i
.
Let us compute an upper bound on Z
i
, say Z
i
such that
f(x) Z
i
Z
i
forall x S
i
.
Suppose also that a solution x
= f(x
x
1
= 1
max x
2
s.t.2x
2
2
x
2
2
x
2
{0, 1}
max 1 +x
2
s.t. 2x
2
1
x
2
0
x
2
{0, 1}
`
`
`
`
`
`
`
x
2
= 0
x
2
= 1
`
`
`
`
`
`
`
x
2
= 0
x
2
= 1
Trivial: Opt = 0 Trivial: Opt = 1 Trivial: Opt = 1 Non feasible
Optimum = 1
Figure 5.2: The branchandbound method: Example 5.1 (part I)
Proposition 5.2.2. If Z
i
Z
P
i
: f(x) Z
i
, x S
i
(if Z
i
Z
then stop)
`
`
`
`
`
`
`
, f(x
) = Z
S that can
be used as mentioned earlier.
We have already devoted quite a lot of time to a discussion of ecient heuristics in previous chapters.
Let us simply add here that, in the ILP or MILP context, a heuristic solution can sometimes be obtained
by rounding the optimal solution of the current linear relaxation. Note, however, that it is not always
easy to round the solution so as to preserve feasibility.
Example 5.2. Let us give another illustration of the branchandbound procedure on a knapsack prob
lem, where we use the classical greedy heuristic (Strategy 2 in Section 3.3).
max 20x
1
+ 16x
2
+ 11x
3
+ 9x
4
+ 7x
5
+x
6
s.t. 9x
1
+ 8x
2
+ 6x
3
+ 5x
4
+ 4x
5
+x
6
12
0 x
j
1 ; x
j
integer (for all j).
5.2. BRANCHANDBOUND METHOD 71
P : Z
LP
= 4/3
`
`
`
`
`
`
`
`
``
x
1
= 0
x
1
= 1
P
1
: max x
2
s.t.2x
2
2
x
2
2
x
2
{0, 1}
Z
LP
= 1
P
2
: max 1 +x
2
s.t. 2x
2
1
x
2
0
x
2
{0, 1}
Z
LP
= 1
STOP: we could never nd a better solution
than x
x
2
= 1
x
= (0, 0)
f(x
) = 0
x
= (0, 1)
f(x
) = 1
Figure 5.4: The branchandbound method: Example 5.1 (part II)
For its implementation, we follow a fourstep process: (1) split by xing the variables at 0 or 1 (x
1
then x
2
... then x
6
) ; (2) explore completely each branch x
j
= 1 before starting the exploration of
branch x
j
= 0 ; (3) compute the upper bound by linear relaxation and (4) use a greedy heuristic. The
resulting branchandbound tree is represented in Figure 5.5. It eventually leads to the optimal solution:
x
= 23.
5.2.4 Tight formulations
Two ingredients are absolutely crucial for a branchandbound algorithm to be ecient: at each node of
the tree, we should be able to compute a tight bound Z and a good heuristic solution x
. Indeed, when
the bounds are close to optimal, the conditions of Proposition 5.2.2 are more likely to be satised, and
thus we are frequently in a position to use the proposition and to shortcut the enumeration proces.
In particular, in an ILP problem, it is essential that Z
LP
be as close as possible to Z. The quality
of the bound Z
LP
itself depends on the selected model. Indeed, it is interesting to note that, in general,
several models can be used to represent a same ILP problem and that some of these models can have a
72 CHAPTER 5. INTEGER PROGRAMMING
Z = 26; cx
= 21
(x
1
= x
6
= 1)
x
1
= 1
.
.
.
.
.
.
.
.
. .
x
1
= 0
Z = 26
cx
= 21
Z = 23, 3
cx
= 23
(we could stop here by noting that the coe. are integers)
`
`
`
`
`
`
x
2
= 1
x
2
= 0
`
`
`
`
`
x
2
= 1
x
2
= 0
Infeasible Z = 25, 5
cx
= 21
Z = 23, 3
cx
= 23
Z = 21, 75 cx
STOP
`
`
`
`
`
`
x
3
= 1
x
3
= 0
`
`
`
`
`
`
x
3
= 1
x
3
= 0
Infeasible Z = 25, 4
cx
= 21
Infeasible Z = 23, 2
cx
= 23
`
`
`
`
`
`
x
4
= 1
x
4
= 0
`
`
`
`
`
`
x
4
= 1
x
4
= 0
Infeasible Z = 25, 25
cx
= 21
Infeasible Z = 23 cx
STOP: x
2
= x
5
= 1 is
optimal along branch
x
1
= 0, x
2
= 1
`
`
`
`
`
`
x
5
= 1
x
5
= 0
Infeasible Z = 21 cx
STOP: x
1
= x
6
= 1 is
Optimal along branch x
1
= 1
Figure 5.5: The branchandbound method: Example 5.2
5.2. BRANCHANDBOUND METHOD 73
better linear relaxation than others.
For instance, let us again consider the model of Example 5.1 and its feasible set S = {(0, 0), (0, 1), (1, 0)}.
In an ILP formulation, we can express S by a set of linear inequalities, as in the model:
S = {(x
1
, x
2
) : x
1
+ 2x
2
2, 2x
1
+x
2
2, x
1
0, x
2
0, x
1
integer, x
2
integer }.
Let us call this model formulation 1. Another formulation of the same feasible set is given as:
S = {(x
1
, x
2
) : x
1
+x
2
1, x
1
0, x
2
0, x
1
integer, x
2
integer }.
We call it formulation 2. The feasible sets of the linear relaxations of both formulations are represented
as P
1
and P
2
in Figure 5.6. Observe now that if we want to maximize the objective function f(x) = x
1
+x
2
over these relaxed sets, then formulation 1 yields the bound f(
2
3
,
2
3
) =
4
3
while formulation 2 immediately
leads to the optimal value f(1, 0) = f(0, 1) = 1.
`
`
`
`
`
`
`
`
(0,1)
(1,0)
P
1
`
(0,1)
(1,0)
P
2
Figure 5.6: Two formulations of the same problem
In order to put these observations in a more general framework, let us consider the ILP problem
Z = max cx s.t. x S with the alternative formulations S = {x : A
1
x b
1
, x integer} (formulation 1)
and S = {x : A
2
x b
2
, x integer} (formulation 2).
Let P
1
= {x : A
1
x b
1
} and P
2
= {x : A
2
x b
2
} denote the feasible sets associated to the linear
relaxations of these two formulations. Using Theorem 5.2.3, we have Z max{cx : x P
i
}, i = 1, 2.
And if max{cx : x P
1
} max{cx : x P
2
}, then formulation 1 should be preferred to formulation 2
because it yields a better upper bound. When this is the case, we say that formulation 1 is tighter than
formulation 2. There is a simple (sucient) condition for a formulation to be tighter than another one:
Proposition 5.2.4. If P
1
P
2
, then formulation 1 is tighter than formulation 2.
74 CHAPTER 5. INTEGER PROGRAMMING
`
>
>
>
>
>
>
>
>
>
>
/
/
/
/
/
/
/
/
>
>
>
>
>
/
/
/
/
/
/
/
/
P
1
P
2
S
,
,
,
,
,
Figure 5.7: Proposition 5.2.4 : an illustration
See Figure 5.7 for an illustration.
Obtaining tight formulations is an essential component of the art of modelling integer programming
problems. It has lead to the development of a whole body of knowledge and of associated algorithmic
methods which go by the name of polyhedral techniques, cutting plane procedures, etc. A more detailed
description of these approaches is beyond the scope of the present course; we refer the reader to Nemhauser
and Wolsey (1988) for additional information.
Let us simply mention here two rules which can frequently be used in order to tighten a given
formulation.
(i) If S = {x :
j
a
j
x
j
b
1
, x integer }, S = {x :
j
a
j
x
j
b
2
, x integer } and b
1
b
2
, then P
1
P
2
.
More generally, reducing the size of the coecients often improves IP formulations.
(ii) Disaggregation of constraints: We state this as a formal mathematical result.
Proposition 5.2.5. Let T = {y : Ay b, y 0} and suppose that, for all y T and for all i = 1, ..., n,
y
i
M
i
where M
i
is a constant. Let S
1
= {(x, y) : y T,
n
i=1
y
i
(
n
i=1
M
i
)x, x {0, 1}} and
S
2
= {(x, y) : y T, y
i
M
i
x(i = 1, .., n), x {0, 1}}.
Then S
1
= S
2
and P
2
P
1
.
Proof:
Let (x, y) S
1
, and thus y T. If x = 0, then y
i
= 0 (for i = 1, ..., n) and thus (x, y) S
2
. If x = 1,
then (x, y) S
2
by the hypothesis y
i
M
i
.
5.2. BRANCHANDBOUND METHOD 75
Let (x, y) S
2
, and thus y T. If x = 0, then y
i
= 0 (for i = 1, ..., n) and thus (x, y) S
1
. If x = 1, then
(x, y) S
1
, hence y
i
M
i
(i = 1, ..., n) and, by addition of these inequalities, we see that (x, y) S
1
.
Let (x, y) P
2
; by addition of the inequalities y
i
M
i
x (i = 1, ..., n), we have
n
i=1
y
i
(
n
i=1
M
i
)x
and hence (x, y) P
1
. QED.
In general, P
1
= P
2
. For instance, consider the following example where M
1
= M
2
= 1. Let
S
1
= {(x, y) : y
1
+y
2
2x, 0 y
1
1, 0 y
2
1, x {0, 1}}
and
S
2
= {(x, y) : y
1
x, y
2
x, 0 y
1
1, 0 y
2
1, x {0, 1}}.
Then (x, y
1
, y
2
) = (0.5, 1, 0) P
1
\ P
2
. This is the optimal solution of max{10y
1
x : (x, y) P
1
} but
not the optimal solution of max{10y
1
x : (x, y) P
2
} (which is (1, 1, 0)).
This general principle can be applied to the facility location problem. We have presented two formula
tions for this problem in Section 5.1.1. As a consequence of Proposition 5.2.5, the use of the disaggregated
constraints (5.10) leads to a tighter formulation than their agggregated variant (5.14) and thus, they yield
a preferable formulation. Observe that this is true in spite of the fact that the disaggregated constraints
(5.10) are much more numerous that their aggregated counterparts. Actually, the linear relaxation of
model (5.9)(5.13) often has an optimal solution with integer values for the xvariables.
For other illustrations of the same principle, see the paper by Barnhart, Johnson, Nemhauser, Sigis
mondi, and Vance (1993). The authors of this study report the following results for a large scale mixed
integer distribution problem:
1. Aggregate model, MPSX solver: after a computing time of 100 hours, no feasible solution on a
mainframe was obtained.
2. Disaggregated model, MPSX solver: optimal solution obtained after 5 hours of computing time on
a mainframe.
3. Disaggregated model, OSL2 solver: optimal solution obtained within a computing time of 13 minutes
on a simple workstation.
5.2.5 Some nal comments
Other general principles which should be used when developing integer programming models:
76 CHAPTER 5. INTEGER PROGRAMMING
Reduce as much as possible the number of integer variables .
Integer variables should be, in general, either binary variables or variables representing quantities
(which can often be rounded in a meaningful way), but not ordinal variables, like: x = sequence
number of a part to be processed on a workstation, with 1 x n and where x is an integer.
Indeed, such variables are dicult to round and to enumerate.
Chapter 6
Neural networks
This chapter contains a brief introduction to neural networks, with a particular focus on the design of
feedforward networks used in the context of learning and function approximation. (Other types of neural
networks, for instance Boltzmann machines, will not be discussed here.) More information about neural
networks can be found in many books, e.g. in Bishop (1995), Braspenning, Thuijsman, and Weijters
(1995), Ripley (1996), Smith (1996). Articles by DeBodt, Cottrell, and Levasseur (1996) or West, Brock
ett, and Golden (1997) provide gentle introductions to the basic theory with an eye toward management
applications.
6.1 Feedforward neural networks
A feedforward neural network is a directed acyclic graph whose nodes are labelled by transfer functions
(or activation functions) and whose arcs are labelled by numerical weights. More precisely (see Figure
6.1 for an illustration):
1. Directed, acyclic graph: The graph denes the architecture of the network. Its nodes can be orga
nized in N +1 layers, numbered from 0 to N, where layer p contains m
p
nodes. Layer 0 is the input
layer. Layers 1 to N 1 are called hidden layers. Layer N is the output layer: for simplicity, we
assume in these notes that the ouput layer contains only one node, i.e. m
N
= 1.
2. Arc weights: The arc between node (i, p) (node i in layer p) and node (j, p + 1) (node j in layer
p + 1) has weight w
p+1
ij
R .
77
78 CHAPTER 6. NEURAL NETWORKS
3. Transfer function: Each node (i, p) is associated with a function f
p
i
: R R. We assume that f
0
i
is
the identity function for all nodes (i, 0) in the input layer.
`
_
`
_
`
_
`
_
`
_
`
_
`
_
`
_
`
_
0
p
p + 1
N
i j
w
p+1
ij
f
p
i
Figure 6.1: A feedforward neural network
6.2 Neural networks as computing devices
Let x
1
, ..., x
n
be real variables (inputs) attached to the nodes of layer 0 (with n = m
0
). For each node
(j, p) of the network, we can compute a numerical value y
p
j
by repeated application of the following rules:
y
0
j
= x
j
for j = 1, . . . , n; (6.1)
y
p
j
= f
j,p
(
m
p1
i=1
w
p
ij
y
p1
i
) for j = 1, . . . , m
p
; p = 1, . . . , N. (6.2)
The previous equations can also be decomposed as follows (see Figure 6.2):
y
0
j
= x
j
for j = 1, . . . , n; ; (6.3)
y
p
j
= f
p
j
(
p
j
) for j = 1, . . . , m
p
; p = 1, . . . , N. (6.4)
p
j
=
m
p1
i=1
w
p
ij
y
p1
i
for j = 1, . . . , m
p
; p = 1, . . . , N. (6.5)
The value computed at the unique node of the last layer, i.e. y
N
, can be viewed as the value of a function
of n variables (x
1
, . . . , x
n
) computed by the neural network. We sometimes denote this function by
NN(x
1
, . . . , x
n
); thus, by denition,
NN(x
1
, . . . , x
n
) = y
N
.
6.2. NEURAL NETWORKS AS COMPUTING DEVICES 79
`
_
`
_
`
_
`
`
`
`
`
`
_
w
p
1j
w
p
2j
w
p
mj
y
p1
1
y
p1
2
y
p1
m
p
j
=
m
i=1
w
p
ij
y
p1
i
y
p
j
=
f(
p
j
)
Figure 6.2: Computing the value y
p
j
at node (j, p) of a neural network
`
`
`
`
`
`
`
`
`
>
>
>
>
>
>
_
x
1
= 2
x
2
= 4
y
1
1
1
y
1
2
= 1/2
y
N
= 5
1
1
3
0.5
1
1
= 10
1
2
= 0
2
1
= 5
3
4
5
Figure 6.3: A small example with transfer functions f(x) =
1
1+e
x
80 CHAPTER 6. NEURAL NETWORKS
An example of a small neural network is provided in Figure 6.3. Here we assume that the transfer
function is f(x) =
1
1+e
x
(logistic function) for the two nodes in the middle (hidden) layer and is the
identity for the output node. If x
1
= 2 and x
2
= 4, we successively compute
1
1
= (2)(1) +(4)(3) = 10
and
1
2
= (1)(2) +(0.5)(4) = 0. Hence, y
1
1
=
1
1+e
10
1, y
1
2
=
1
1+e
0
= 0.5, and the network outputs the
value y
N
= 3y
1
1
+4y
1
2
5. More generally, for any input pair (x
1
, x
2
), the network computes the function
NN(x
1
, x
2
) = 3
1
1 +e
1
1
+ 4
1
1 +e
1
2
=
3
1 +e
(x
1
+3x
2
)
+
4
1 +e
(x
1
+0.5x
2
)
.
6.3 Neural networks as function approximation devices
In view of the above, one can use a neural network to approximate an (unknown) function F given by a
sample of its values:
(X
(1)
1
, X
(1)
2
, ..., X
(1)
n
; F(X
(1)
1
, X
(1)
2
, ..., X
(1)
n
) = Y
(1)
)
(X
(2)
1
, X
(2)
2
, ..., X
(2)
n
; F(X
(2)
1
, X
(2)
2
, ..., X
(2)
n
) = Y
(2)
)
. . .
just as in a classical regression model.
Example. Let us consider the linear regression model described in DeBodt, Cottrell, and Levasseur
(1996), pp.2930. The objective is to nd ,
1
and
2
so that +
1
ROI
(e)
+
2
TN
(e)
PT
(e)
gives a good
approximation of y
(e)
for a sample of 10 observations (e = 1, . . . , 10). So, we must solve the leastsquares
minimization problem:
min SSE =
10
e=1
_
y
(e)
1
ROI
(e)
2
TN
(e)
PT
(e)
_2
. (6.6)
Cancelling the rst derivatives leads to the system of equations:
_
_
SSE
= 0
10
e=1
_
Y
(e)
1
ROI
(e)
2
TN
(e)
PT
(e)
_
= 0
SSE
1
= 0 ...
SSE
2
= 0 ...
= 1.28;
1
= 2.34;
2
= 1.86
The linear function
Y = +
1
ROI +
2
TN
PT
6.3. NEURAL NETWORKS AS FUNCTION APPROXIMATION DEVICES 81
is computed by a very simple neural network, as shown in Figure 6.4. But of course, a better t of the
observations (i.e., a smaller sum of squared errors) could potentially be obtained by adopting a more
complex neural network. 2
`
`
`
`
`
`
ROI
TN/PT
1
= +
1
ROI +
2
TN
PT
id
Figure 6.4: NN : linear function
In practice, to dene a neural network model, we would x the architecture of the network (that is, the
underlying graph) and the transfer functions, and we would determine the arc weights which provide the
best t under these restrictions. Note that this is quite similar to the classical statistical approach where
we x the functional form of the regression model (linear, or polynomial, or ...) and we subsequently
determine the best possible choice of the model coecients.
In this framework, sigmoid functions of the form f =
1
1+e
x
(where > 0) are often used as transfer
functions; the logistic function is the special sigmoid function
1
obtained when = 1. There are (at least)
two main motivations for the use of logistic transfer functions:
It can be proved that any reasonable continuous function can be approximated arbitrarily closely by
a network containing a single (but suciently large) hidden layer with logistic transfer functions (see e.g.
Funahashi (1989)).
If f is the logistic function, then f
i=1
_
NN(X
(i)
) Y
(i)
_
2
, (6.7)
where NN(X
(i)
) is the value computed by the network on the input X
(i)
. This is a nonlinear optimization
problem (in particular, because the function computed by the network is highly nonlinear in the weights
w
p
ij
). Therefore, we digress for a (very) short introduction to nonlinear optimization techniques.
6.4 Unconstrained nonlinear optimization
6.4.1 Minimization problems in one variable: introduction
We consider the problem
min
wR
f(w) (6.8)
where f(w) is a realvalued function of the single variable w. Various numerical methods are available
for attacking this type of model, although most of them are limited to nding (in the best case) a local
minimizer of f. In this short introduction, we restrict ourselves to twice continuously derivable functions
over some suitable open interval I. In this case, the following necessary condition holds:
Proposition 6.4.1. If w
(w
) = 0.
Observe that a stationary point may be either a local minimum, or a local maximum, or even no local
optimum at all. But we also know that:
Proposition 6.4.2. If w
(w
) > 0, then w
is a local
minimum of f.
Proposition 6.4.1 suggests an approach to solving the optimization problem (6.8): let g(w) = f
(w)
and solve the equation:
g(w) = 0, w I. (6.9)
So, let us focus momentarily on solution techniques for (6.9).
6.4. UNCONSTRAINED NONLINEAR OPTIMIZATION 83
6.4.2 Equations in one variable
Let w
0
be an initial point in I. In a neighborhood of w
0
, g can be approximated by the linear function
h(w) = g(w
0
) +g
(w
0
) (w w
0
). (6.10)
Therefore, it seems like a natural idea to solve the equation h(w) = 0 rather than the more intricate
equation (6.9). Assuming that g
(w
0
) = 0, the unique solution of h(w) = 0 is given by
w
1
= w
0
g(w
0
)
g
(w
0
)
. (6.11)
Repeating the same process with w
1
as the new initial solution leads to a next point w
2
and, iteratively,
to a (potentially innite) sequence of points w
k
where
w
k+1
= w
k
g(w
k
)
g
(w
k
)
for k = 0, 1, 2, . . . . (6.12)
The iterations (6.12) give rise to Newtons method (sometimes called NewtonRaphson method). The main
properties of Newtons method can be summarized as follows:
Proposition 6.4.3. If g
(w) is bounded away from 0 on the interval I, if the equation (6.9) has a solution
w
I and if w
0
is close enough to w
 c  w
k
w

2
.
Two points are worth commenting with regard to this result. First, Newtons method is a local
method in the sense that its convergence is only guaranteed when the method starts close enough to a
root of g. As an example, with g(w) = arctan(w), the method converges towards w
= 0 if  w
0
 < 1.39
but it diverges if  w
0
 > 1.40. However, when the method converges, the latter part of Proposition 6.4.3
implies that it does so quite rapidly: at every iteration, the distance between the current solution w
k
and
the root w
is reduced quadratically.
In order to ensure global convergence, some modications must be brought to Newtons method.
Several possibilities exist. Press, Teukolsky, Vetterling, and Flannery (1995) recommend to start by
bracketing a root of g(w) between two values a and b such that g(a)g(b) < 0 and to locate the root
approximately by some variant of a bisection approach (e.g., golden section search). The resulting value
can be used as an initial solution for Newtons method. Alternatively, Dennis and Schnabel (1996) describe
quasiNewton methods based on the following idea: if the point w
k+1
dened by Newtons iterate (6.12)
does not get us closer to 0 (that is, if  g(w
k+1
)   g(w
k
) ), then backtrack on the segment from w
k+1
84 CHAPTER 6. NEURAL NETWORKS
to w
k
until a better point than w
k
is found (such a point always exists). Such strategies combine global
convergence with the fast local convergence properties of Newtons method.
Finally, observe that Newtons method can also be adapted to the case where the rst derivative of g
is unavailable or expensive to express in analytical form. Such variants rely on numerical estimates of g
.
6.4.3 Minimization problems in one variable: algorithms
Let us now revert to the optimization problem (6.8). As mentioned above, Theorem 1 directly suggests
to solve the equation
f
(w) = 0, w I. (6.13)
Assuming that f is twice continuously derivable, Newtons iterates for equation (6.13) become:
w
k+1
= w
k
(w
k
)
f
(w
k
)
for k = 0, 1, 2, . . . . (6.14)
It is easy to check that iteration k amounts to approximating f(w) by the quadratic function
h(w) = f(w
k
) +f
(w
k
) (w w
k
) +f
(w
k
) (w w
k
)
2
(6.15)
(compare with (6.10)) and computing the (unique) local optimum w
k+1
of h(w).
Example. Let f(w) = (w 2)
4
. Then, f
(w) = 4(w 2)
3
and f
(w) = 12(w 2)
2
. So,
w
k+1
=
2
3
w
k
+
2
3
and w
k
2 from every initial solution w
0
. For instance, with w
0
= 1, we successively get w
1
= 1.33, w
2
=
1.55, w
3
= 1.80, w
4
= 1.87, w
5
= 1.91, . . . 2
The convergence properties of this method are similar to those stated for the solution of nonlinear
equations (observe that convergence is rather slow in the previous example, due to the fact that f
vanishes at w
= 2). A major problem, however, is that, even when the iterates converge to a stationary
point of f, there is no guarantee that this stationary point corresponds to a local optimum of f, let alone
to a local minimum. Indeed, the Newton steps (6.14) associated to the function f are exactly the same
as those associated to its negative f, so that the method is unable to distinguish between minimization
and maximization problems.
Here again, several approaches have been proposed in order to remedy this diculty. Dennis and
Schnabel (1996) suggest a quasiNewton method which ensures that successive iterates converge monoti
cally towards a local optimum, similarly to what was done for the solution of nonlinear equations. Assume
(for a minimization problem) that f(w
k+1
) f(w
k
). There are two possibilities.
6.4. UNCONSTRAINED NONLINEAR OPTIMIZATION 85
If f
(w
k
) > 0, then (6.14) guarantees that the segment [w
k
, w
k+1
] initially points towards decrasing
values of f (in the direction of the descending gradient). Therefore, backtracking from w
k+1
to w
k
eventually leads to a point w
+
with better value than w
k
and this point can be chosen as the next iterate
of the quasiNewton method.
Alternatively, if f
(w
k
) < 0, then the method must be adapted to take a step in the opposite
direction. Here again, searching along the direction of the descending gradient f
(w
k
) allows to locate
a point with better value than f(w
k
).
(Observe that the approximation h(w) (see 6.15) is a convex parabola when f
(w
k
) > 0, whereas
h(w) is concave when f
(w
k
) < 0.)
A simpler, but less powerful alternative is to abandon Newtons iterates altogether and to replace
them by iterates of the form:
w
k+1
= w
k
k
f
(w
k
) (6.16)
where
k
(k = 0, 1, 2, . . .) is a sequence of positive multipliers, or step sizes. Such steepest descent or
gradient methods can be made monotonically decreasing by choosing
k
small enough at each step. Notice
however that excessively small steps may result in very slow convergence or in premature convergence to
non stationary points. Necessary conditions for convergence towards a stationary point can be found for
instance in Bertsimas (1995).
6.4.4 Multivariable minimization problems
The ideas exposed in the previous section can be generalized, to some extent, to multidimensional un
constrained minimization problems. For a function f : R
n
R, the general expression of a Newton step
is given as:
w
k+1
= w
k
H(w
k
)
1
grad
f
(w
k
) (6.17)
(k = 0, 1, 2, . . .) where H(w
k
) is the Hessian matrix of f computed at w
k
and grad
f
(w
k
) is the gradient of
f at w
k
(compare with (6.14)). Here again, several diculties arise with the sequence of iterates (6.17);
for instance, H(w
k
) is not necessarily invertible and the sequence f(w
k
) does not necessarily converge
monotonically towards a local minimum. Remedies are similar in spirit, but more complex to describe,
than those suggested in the onedimensional case. A fundamental idea consists in ensuring that the
Hessian H(w
k
) (or a suitable approximation) remains positive denite at each iteration (see Dennis and
Schnabel (1996) for details).
86 CHAPTER 6. NEURAL NETWORKS
Additional numerical issues pop up in the multidimensional case: estimating and inverting the n n
Hessian matrix at each iteration can be very expensive (i.e., time consuming), matrices may be ill
conditioned (possibly resulting in important rounding errors), etc. For these reasons, simpler, though
less ecient, gradient methods are sometimes adopted for the solution of large scale problems. The general
form of such methods is given as:
w
k+1
= w
k
k
grad
f
(w
k
), (6.18)
where
k
(k = 0, 1, 2, . . .) is a suitably chosen sequence of step sizes. We refer to Bertsimas (1995) for
a discussion of suitable choices of step sizes. Let us simply mention here that excessively large step
sizes may prevent convergence of the algorithm, whereas excessively small steps may result in very slow
convergence or in premature convergence to non stationary points.
6.5 Application to NN design: the backpropagation algorithm
The NN design problem has been introduced in Section 6.3. Recall that the sample of observations is
(X
(1)
, y
(1)
), ..., (X
(s)
, y
(s)
) and that the architecture of the network is xed (see Figure 6.2; we assume
for simplicity that the transfer function is f(x) =
1
1+e
x
at every node). We must determine the values
of the weights w
p
ij
that minimize the objective function
E(w) =
s
i=1
_
NN(X
(i)
) Y
(i)
_
2
, (6.19)
where NN(X
(k)
) is the value computed by the network on the input X
k
.
We simplify the presentation by assuming that s = 1, in which case we can drop the superscript i in
the notation (X
i
, Y
i
). Thus, the objective is to adjust the weights so as to reproduce at best the observed
value Y on the input X.
When applied to the NN design problem, the gradient descent principle described by equation (6.18)
gives rise to a specialized algorithm called backpropagation algorithm that we now proceed to explain.
Note rst that the gradient of E is
_
E
w
1
11
,
E
w
1
12
, . . . ,
E
w
p
ij
, . . . ,
E
w
N
i
_
.
Given a current vector of weights, say w, we can compute the values
p
j
and y
p
j
at all nodes of the
network by application of the equations (6.3)(6.5). Now, one step of the algorithm consists in replacing
w by a modied vector = w grad
E
(w), where
p
ij
= w
p
ij
E
w
p
ij
for all i, j, p. (6.20)
6.5. APPLICATION TO NN DESIGN: THE BACKPROPAGATION ALGORITHM 87
Since E depends on w
p
ij
via
p
j
(by equations (6.4)(6.5)), we can write
E
w
p
ij
=
E
p
j
p
j
w
p
ij
. (6.21)
From equation (6.5), we obtain
p
j
w
p
ij
= y
p1
i
(value of node i in layer p 1). (6.22)
Let us introduce the notation
E
p
j
=
p
j
. There holds
p
j
=
E
p
j
=
E
y
p
j
y
p
j
p
j
. (6.23)
Since y
p
j
= f(
p
j
) (Eq. (6.4)) and f is the logistic function, we nd
y
p
j
p
j
= y
p
j
(1 y
p
j
). (6.24)
Now, the only derivative that is yet to be computed is
E
y
p
j
. Let us rst consider the case p = N and
remember that NN(X) = y
N
. Thus,
E
y
N
=
(y
N
Y )
2
y
N
= 2(y
N
Y ). (6.25)
Putting together the pieces (6.21)(6.25), we nally obtain
E
w
N
i
=
N
y
N1
i
with
N
= 2y
N
(1 y
N
)(y
N
Y ). (6.26)
Similarly, for p < N, we have
E
y
p
j
=
m
p+1
k=1
E
p+1
k
p+1
k
y
p
j
=
m
p+1
k=1
p+1
k
w
p+1
jk
.
It follows that, for p < N and for all i, j,
E
w
p
ij
=
p
j
y
p1
i
with
p
j
= y
p
j
(1 y
p
j
)
m
p+1
k=1
p+1
k
w
p+1
jk
. (6.27)
Note that the value of
N
is directly available from equation (6.26) (y
N
is computed with the current
weights w
p
ij
) and the remaining values
p
j
for p < N can be recursively obtained from (6.27). Therefore,
we can compute all partial derivatives
E
w
p
ij
and the gradient algorithm can be implemented on this basis.
Equations (6.26) and (6.27) dene the socalled delta rule. The name backpropagation is justied
by the fact that the parameters are computed backward by the recursion (6.26)(6.27). On the other
hand, each computation of NN(X) (or E(w)) requires a forward pass through the neural network.
88 CHAPTER 6. NEURAL NETWORKS
`
`
`
`
`
`
`
`
`
`
`
`
_
w
p+1
j1
w
p+1
jk
w
p+1
j,m
p+1
y
p
j
p+1
1
p+1
k
p+1
m
p+1
p+1
k
=
m
p
i=1
w
p+1
ik
y
p
i
Figure 6.5: Backpropagation : an example when p < N
6.5.1 Extensions of the delta rule
Usually, the learning sample contains more than one observation and the backpropagation algorithm
needs to be generalized accordingly. This generalization can be performed in various ways.
1. Cumulative delta rule: In this approach, we compute the error E =
s
i=1
(NN(X
i
)Y
i
)
2
whenever
the weights w are modied. Thus, in each iteration (or cycle), a forward pass through the NN must
be performed with all the learning examples in order to compute the error and the descent direction.
This procedure is closest to the basic gradient rule, but also computationally expensive.
2. Generalized delta rule: Here, after the weights w have been updated, exactly one example X
i
is
used to recompute the node labels y
p
j
(one forward pass in the network) and to estimate the new
value of the error. Then, w is updated again according to equations (6.20)(6.27) and the process
is continued with the next example X
i+1
. After s iterations (constituting one learning cycle),
all examples have been considered once. This only consitutes an approximation of the gradient
algorithm, since the error function E is not computed exactly at every iteration, but it is faster
than the cumulative delta rule for large networks, and its performance is usually good.
Whatever method is used, convergence is never guaranteed. Even if convergence occurs, the process
may lead to a local minimum of the error function (depending on the learning rate, the incoming order
of examples, the initial choice of the weights,...) Other optimization algorithms have also been tested for
implementing the learning mechanism: conjugate gradient, method of moments, etc., but backpropagation
6.5. APPLICATION TO NN DESIGN: THE BACKPROPAGATION ALGORITHM 89
remains the most popular approach in this framework. See DeBodt, Cottrell, and Levasseur (1996) and
Henseler (1995) for additional details.
6.5.2 Model validation
A good adjustment obtained on the training (or learning) dataset does not guarantee that the performance
will be satisfactory for all subsequent examples presented to the network. A common phenomenon is called
overlearning or overtting: excellent performance on the training set can be achieved by relying on
a very complex architecture and on ne tuning of the weights, but the predictive power of the resulting
network may be somewhat chaotic and erratic on any example outside of the training set. Intuitively,
this happens when the network incorporates extremely specic features of the training set that are not
generalizable to other examples.
Example. Assume that a training set contains three observations associated to three rms. There
is one input variable (ROI) and one output variable (with value 1 if the rm is considered nancially
healthy, and value 0 otherwise).
ROI Health
0.10 1
0.12 1
0.07 0
A network which species that Health = 1 if and only if ROI = 0.10 or ROI = 0.12 would
make no error on this dataset, and hence may be considered perfect. It should be clear, however, that
such a network would have very low generalizing power: for instance, when probing a new rm with ROI
equal to 0.15, it would classify it as unhealthy on this basis.
In order to avoid overlearning, it is often recommended to use the simplest architecture that allows
to get good results on the training set and simultaneously on another, independent dataset. These
observations (which are not specic to neural network models, but apply more broadly to all classication
or learning frameworks) lead to rely on and to distinguish among dierent types of data sets:
1. Training set: A data sample (X
(i)
, Y
(i)
) (i = 1, ...., s). The backpropagation algorithm attempts to
determine the values of the weights w
p
ik
which minimize
E = SSE
s
=
1
s
s
i=1
_
NN(X
(i)
) Y
(i)
_
2
, (6.28)
90 CHAPTER 6. NEURAL NETWORKS
where SCE
s
is an indicator of the quality of the t provided by the neural network on this training
set.
2. Validation set: A data sample (X
(i)
, Y
(i)
) (i = s + 1, ...., v) used to compare the predictions made
by the network, viz. (NN(X
(s+1)
), ..., NN(X
(v)
)), with the observed values (Y
(s+1)
, ..., Y
(v)
). As
nal values of the weights w
p
jk
, we keep those values for which the validation error
SSE
v
=
1
v
v
i=s+1
_
(NN(X
(i)
) Y
(i)
_
2
(6.29)
is the smallest, among all values explored by the backpropagation algorithm. Here, SSE
v
is viewed
as an indicator for the prediction performance of the network.
3. Testing set: A data sample (X
(i)
, Y
(i)
) (i = v + 1, ...., t) used after having xed all the parameters
of the network in order to evaluate its performance (independently of the samples used to x the
weights). The error
SSE
t
=
1
t
t
i=v+1
_
(NN(X
(i)
) Y
(i)
_
2
(6.30)
observed on the testing set is sometimes called outofsample error.
4. Prediction set: A data sample (X
(i)
) (i = t + 1, ...., n) for which we would like to predict the
corresponding (unknown) values Y
(i)
= F(X
(i)
). Thus, prediction sets are those datasets which
are typically of interest to a user of the network, in contrast with its designer.
6.6 Applications
The economic and management literature contains a huge number of reports on the use of neural networks
for the analysis of realworld problems. We only mention here a few representative pointers.
Numerous nancial applications are surveyed by DeBodt, Cottrell, and Levasseur (1996) and Fadlalla
and Lin (2001).
Swamidass, Nair, and Mistry (1999) construct a neural network model of factories in order to inves
tigate the impact on productivity (output) when selected input variables (including product line width)
are changed.
A specic production management model developed for a major semiconductor manufacturing plant
is reported by Mollaghasemi, LeCroy, and Georgiopoulos (1998).
6.7. NOTES ON PROPAGATOR SOFTWARE 91
West, Brockett, and Golden (1997) examine the ability of neural networks to predict consumer behav
ior. More generally, Hill (1996) provides a comparison of the accuracy of neural networks and classical
statistical models for time series forecasting.
6.7 Notes on PROPAGATOR software
The software has an online help. We only provide here some additional information in order to facilitate
its use and to establish the connection with the material covered in our general presentation of neural
networks.
6.7.1 Input les
The input (sample) les are text les containing one line per observation, in the following format:
X
(1)
1
X
(1)
2
X
(1)
3
X
(1)
n
Y
(1)
X
(2)
1
X
(2)
2
X
(2)
3
X
(2)
n
Y
(2)
X
(s)
1
X
(s)
2
X
(s)
3
X
(s)
n
Y
(s)
where X
(i)
1
, X
(i)
2
, X
(i)
3
, , X
(i)
n
are the values of the input variables for the ith observation and Y
i
is
the value of the output variable for the ith observation. Separators can be either spaces, or tabs, or line
returns.
PROPAGATOR provides menus to work with training sets, validation sets and testing sets. Note
however that it does not allow to read directly a prediction dataset. This diculty can easily be resolved
by creating a data le where the output values Y
(i)
are arbitrary (for instance, 0) and by handling this
le as a testing le. The values Y
(i)
can be ignored in the interpretation of the results, since we are only
interested in the value NN(X
(i)
) computed by the neural network.
6.7.2 Menus
File: Allows to open, create or save a network.
Graphs: Three types of graphs can be visualized.
Error vs. Cycle: Shows the value of SSE
s
(error on the training sample; see equation (6.28)) and SSE
v
(error on the validation sample; see equation (6.29)) as a function of the number of cycles performed by
the gradient algorithm.
92 CHAPTER 6. NEURAL NETWORKS
Error vs. Ouput unit: Shows SSE
s
for each of the output nodes (frequently, there is only one such
output).
Output error vs. Target: The output error for observation i is (Y
i
NN(X
i
)); the target is Y
i
Network:
Reinitialize: Resets the network in its initial conguration, before optimization.
Preferences:
Write output vs. target le when testing
Prompt for le name
This creates a le containing the pairs (NN(X
i
), Y
i
) after each application of the network on the test
le.
Write best weights le
This creates a le (*.bw) containing the value of the weights w
p
jk
for each arc from node j to k, from
layer p to p + 1. This le is created whenever the network is saved (Menu  File  Save ...).
6.7.3 Main window
Architecture
Number of layers: Including input and output layers.
Nodes per layer: Number of nodes in successive layers; for instance, 3 2 2 1. The rst layer always
includes a constant input node (bias).
Transfer functions: All functions in a same layer are of the same type, either L (linear, i.e. identity)
or S (sigmoid, i.e. f(x) =
1
1+e
x
).
Note: When the sigmoid function is used, then, for better results, it is recommended to normalize all
data (input and output) between 0 and 1. This can be achieved by running the procedure Scale on all
sample les.
Connectivity: Full: all the arcs between successive layers are present in the network. Partial: some
arcs are missing; the arcs of the network are dened in an auxiliary le.
Learning rule: Select either Generalized Delta rule or Cumulative Delta rule, depending on the learn
ing algorithm that you want to use.
Training parameters
6.7. NOTES ON PROPAGATOR SOFTWARE 93
Learning rate: This is the parameter in the update formula of the gradient algorithm:
w
k+1
:= w
k
grad
SSE
s
(w
k
). (6.31)
Typical values for range between 0 and 1 (often in [0.2, 0.6]).
Momentum factor: This is a second factor which can be used in the update formula (6.31) in order
to accelerate convergence during the learning phase. This value should be modied with care, by trials
and errors. Typical values: [0.2, 0.6].
Total training cycles: Parameter T which determines the main stopping criterion. The gradient
algorithm stops after T cycles.
Minimum training error: Parameter which determines a second stopping criterion. The gradient
algorithm stops when the average Sum of Squared Errors on the training sample is smaller than :
1
s
SSE
s
=
1
s
s
i=1
(Y
i
NN(X
i
))
2
< , (6.32)
where s is the size of the training sample, NN(X
i
) is the value returned by the neural network on the
ith input point X
i
, and Y
i
is the output value in the ith observation.
Training pattern order: Order in which the training examples are presented to the algorithm (cf.
Generalized Delta rule).
94 CHAPTER 6. NEURAL NETWORKS
Chapter 7
Cases
7.1 Container packing at Titanic Corp.
The DuPack company specializes in the development of management information systems and decision
support systems (MISDSS). It carries most of its activities in the eld of operational logistics: warehous
ing, shipment preparation, transportation, etc. The growing acceleration and integration of logistic ows
in distribution networks, as well as a permanent quest for cost reduction have led DuPacks customers
to express increasingly complex needs regarding the contents of the containers to be shipped: decrease
in volumes, optimization of lling rates, reduction of preparation and delivery leadtimes, etc. Most com
mercial DSS available for logistic management oer elementary tools for container packing. However, no
sophisticated integrated solution seems to exist on the market.
Based on this observation, Laetitia Di Castaprio, Dupacks R&D manager, has launched an evaluation
program with the objective to determine the extent to which optimization techniques could improve the
performance of their current DSS. She is especially interested in local search heuristics, like simulated
annealing, which seem to oer considerable promise in this respect.
Laetitia has rst decided to examine a simplied version of the question formulated by one of their
clients, Titanic Corp. This client is a trucking company which ships a large variety of parcels from the
local airport to distant warehouses. The main dening characteristics of each parcel are its volume and
its weight. Before transportation, the parcels must be placed in containers. Several parcels can be packed
into the same container, but each container has a maximal (normalized) capacity of one volumeunit and
one weightunit. The main objective of Titanic Corp. is to minimize the number of containers (and hence,
95
96 CHAPTER 7. CASES
the number of trucks) which are needed in order to deliver all the parcels.
Can you help Laetitia solve this problem ? You should develop an optimization heuristic (simulated
annealing, tabu search, genetic algorithm) which allows to pack the parcels into a small number of
containers. All algorithmic details are left to your creativity. The heuristic will be tested on datasets
corresponding to daily shipments performed by Titanic Corp. from the airport to one of its regular
customers. Your report should mention the number of containers required for each dataset, and you should
compare the value of your solutions to appropriate lower bounds on the optimal number of containers.
7.2 Stacking boxes at Gizeh Inc.
Now that Laetitia Di Castaprio has successfully tackled the challenge formulated by Titanic Corp. (see
the Case 7.1), she turns to the problem posed by Gizeh Inc., another of its trucking customers. Gizeh
ships rectangular shaped boxes of dierent sizes. The boxes are all relatively light and of the same height,
so that their distinguishing features are mostly their length and width. Rather than packing the boxes
into containers, the employees stack them upon each other, put each stack on a pallet and load the pallets
directly in the trucks. In order to t in the trucks and to ensure stability, the stacks must obey two simple
rules: rst, a stack cannot consist of more than 12 boxes; second, the boxes must be stacked up in a
pyramidal shape so that, if box A lies under box B in a stack, then A must be at least as long and at
least as wide as B. The main objective of Gizeh Inc. is to minimize the number of stacks used for a given
collection of boxes.
Can you help Laetitia develop an ecient algorithm for this stacking problem ? The algorithm should
be based on one of the classical metaheuristics for CO problems (either simulated annealing, or tabu
search, or genetic algorithms) and should be tested on several datasets. As usual, the value of the
solutions should be compared to appropriate lower bounds on the optimal number of stacks.
Can you also give a 01 linear programming formulation of the problem ?
References: Bischo (1991), Moonen and Spieksma (2003).
7.3 A high technology routing system for MealsonWheels
Routing problems must be solved daily by countless companies delivering goods or services (be it express
mail, heating oil, food or air transportation) to spatially dispersed customers.
In this project, we consider the situation faced by MealsonWheels, a small organization which
7.4. OPERATIONS SCHEDULING IN HOBBITLAND 97
delivers hot meals to senior citizens located in a North American city (this case is inspired from a study
by Bartholdi, Platzman, Collins, and Warden (1983)). Four drivers are in charge of deliveries. We assume
that, for all practical purposes, the capacity of their vans is unlimited. All drivers start at the same time
from a single depot. For simplicity, we assume that driving time is directly proportional to distance
and that the time spent at each customers location is negligible in comparison with driving time. If we
describe the customers addresses by (x, y)coordinates, then the distance between any pair of subscribers
can be approximated by their Manhattan distance: dist((a, b), (c, d)) = a c +b d.
The daily challenge of MealsonWheels is to deliver the last meal as early as possible: this is their
primary performance criterion. As a secondary criterion, the rm would also like to minimize the average
delivery time of meals (averaged over all customers). Can you help them with their problem ?
You are asked to develop a simulated annealing algorithm or a genetic algorithm which allows to
handle MealsonWheels delivery problem. All algorithmic details are left to your creativity! In order
to evaluate the quality of your algorithm, a numerical table containing the (x, y) coordinates of the
depot and of 120 (current) customers will be provided separately. You should compare the quality of the
solution that you obtain to a lower bound on its optimal value.
The quality of your work will be evaluated on the basis of:
the quality of the delivery plan computed for the numerical instance;
the correctness, elegance, sophistication and eciency of the algorithmic approach;
the quality of the written report and of the oral presentation.
Reference: Bartholdi, Platzman, Collins, and Warden (1983).
7.4 Operations scheduling in Hobbitland
The Hobbit company manufactures steel rings of various sizes and shapes, according to the specications
imposed by industrial customers who use the rings as essential parts of their equipments. Producing a
ring usually requires a rather long sequence of operations which are carried out on the 15 highprecision
machines available in the shops of Moria. Some sophisticated rings may require up to 15 successive
manufacturing steps.
Every week, a new production schedule has to be established for the current orders. Mr. Frodo,
scheduling manager at Hobbit, is responsible for this task. He receives from Mr. Sauron (planning
manager) a collection of production orders together with their routing sequence. A small example is
shown below: In this example, the machines are numbered 0, 1, 2 and there are 3 rings to be processed.
98 CHAPTER 7. CASES
Ring Sequence
1 (0,80) (2,55) (4,30)
2 (1,75) (0,60) (2,40)
3 (3,18) (2,56)
Ring 1 is to be processed on machine 0 for 80 minutes, next on machine 2 for 55 minutes, and nally
on machine 4 for 30 minutes. Ring 2 goes rst on machine 1 for 75 minutes, then on machine 0 for 60
minutes, then on machine 2 for 40 minutes; and so on. Note that the rings can be processed in any order,
but the routing sequence of each ring must be strictly respected, and every machine can only process one
ring at any time. Moreover, once an operation is started on a machine, it cannot be interrupted until its
total processing time is elapsed.
Now, Mr. Frodos problem is to design a schedule for the set of orders released by Mr. Sauron in such
a way as to complete all orders as early as possible. For instance, in the above example, it is possible to
process all jobs within 180 minutes: just start every operation as early as possible! But unfortunately,
this simple rule does not always yield optimal schedules, and Mr. Frodo spends a lot of time in his quest
for the elusive ideal scheduling rule.
Can you help Frodo by developing a decision support system which would assist him in his planning
activities? The DSS should be able to read the production routings of the rings and to propose a
feasible schedule. As mentioned above, Frodos objective is to minimize the production makespan, i.e.
the completion of the last operation.
The core of your DSS must be either a simulated annealing algorithm or a genetic algorithm. Your
algorithm is to be tested on numerical examples provided by the instructor. Your report should mention
the makespan of the schedule that you have obtained for each example, and you should compare the
value of your solutions to appropriate lower bounds on the corresponding optimal makespans.
References: Crama (2002).
7.5 Setup optimization for the assembly of printed circuit boards
The company Starac&Demy is a subsidiary of L5, a large producer of electronic devices. Starac&Demy
specializes in the assembly of printed circuit boards (PCBs) which are used within various endproducts.
PCB assembly consists in inserting a number of electronic components of prespecied types at prespecied
7.5. SETUP OPTIMIZATION FOR THE ASSEMBLY OF PRINTED CIRCUIT BOARDS 99
locations on a bare board. Several hundred components of a few distinct types (resistors, capacitors,
transistors, integrated circuits, etc.) may be inserted on each board.
The assembly shop at Starac&Demy consists of several computerized machines which take care of the
assembly operations (see Crama, Spieksma and van de Klundert (1999) for details). Each machine can
perform a large number of high speed, high precision assembly operations involving various tools and
components. The machines are laid out into three distinct assembly lines and a conveyor connects the
machines within each line. Each line can handle at most 20 dierent component types, which are assigned
to the line as part of the production setup procedures.
Indeed, at the beginning of every month, Mario (the shop manager) dedicates each line to the produc
tion of certain PCBs which appear in the monthly production plan prepared by Jennifer (the production
manager). For this purpose, Mario takes into account the BillOfMaterials (BOM) table illustrated
below. In this table, each line corresponds to a PCB type appearing in Jennifers plan (for instance,
a PCB to be included in cell phones) and each column corresponds to a type of electronic component
(for instance, a certain model of transistor). The entry a
ij
= 1 if PCB type i requires component j ;
otherwise, a
ij
= 0.
Then, Mario attempts to partition the rows and columns of the table into three subsets each, so as to
identify three diagonal blocks corresponding to homogeneous subfamilies of PCBs and of components.
To illustrate this approach, consider the small example in Table 7.1, where we assume for simplicity that
there are only two assembly lines and that each line can hold at most 5 component types.
1 2 3 4 5 6 7 8 9 10
1 1 1 1
2 1 1 1 1
3 1 1 1 1
4 1 1 1 1
5 1 1 1 1 1
6 1 1 1 1
Table 7.1: Original BOM table
After permuting its rows and columns, this table can be transformed into the table shown in Table
7.2.
This new presentation suggests that the PCBs can be partitioned into two subfamilies, where
100 CHAPTER 7. CASES
3 5 6 9 10 1 2 4 7 8
2 1 1 1 1
4 1 1 1 1
5 1 1 1 1 1
1 1 1 1
3 1 1 1 1
6 1 1 1 1
Table 7.2: Permuted BOM table
family 1 consists of PCBs 2, 4 and 5, which mostly require components 3, 5, 6, 9 and 10;
family 2 consists of PCBs 1, 3 and 6, which only require components 1, 2, 4, 7 and 8.
Therefore, Mario can dedicate (during the next month) one of the lines to the insertion of components
3, 5, 6, 9 and 10, and the second line to the insertion of components 1, 2, 4, 7 and 8 (this is in the spirit
of the family setup procedures mentioned in Crama, van de Klundert, and Spieksma (2002) and of the
group technology approaches discussed for instance in Askin and Standridge (1993)).
Note however that, when producing PCB types 4 and 5, it will be necessary to move the products
from line 1 to line 2, an operation which results in costly time losses in the form of material handling and,
more importantly, extra line setups. (Alternatively, components 4 and 7 could also be inserted manually
on PCBs 4 and 5, which again requires additional time and cost). Therefore, Marios objective is to
partition the component types so as to minimize the total number of exceptions, that is the number of
1s which fall outside the diagonal blocks (in the above case, there are three such exceptions).
Can you help Mario by developing a decision support system which will assist him in his monthly
planning activities? The DSS should be able to read the BOM table and to propose an assignment of
PCBs and component types to each production line, under the following constraints:
a) there are 3 production lines;
b) no more than 20 component types can be assigned to any line;
c) each component type can be assigned to one line only.
As mentioned above, Marios objective is to minimize the number of exceptions.
The core of your DSS must be either a simulated annealing or a genetic algorithm. The algorithm
should be tested on several large BOM tables. Your report should mention the value of the solution that
7.6. A NEW PRODUCT LINE FOR LEGIACOM 101
you have obtained for each BOM table and you should compare this value to appropriate lower bounds
on the optimal number of exceptions.
References: Askin and Standridge (1993), Crama, van de Klundert, and Spieksma (2002), Jain,
Johnson, and Safai (1996).
7.6 A new product line for Legiacom
For many years, Legiacom has beneted from a quasimonopolistic situation on the Walloon telecommu
nication market. Recently, however, this comfortable situation has been threatened by the opening of the
European market and by the increasing pressure of competitors who rely on agressive marketing strate
gies. In order to position itself favorably in this emerging competitive context, Legiacom has decided to
reengineer its operations and to improve all its activities, ranging from the design of its communication
networks to direct customer service.
Pierre Hafeux, R&D manager in the marketing department, has been given the specic assignment to
redesign the wireless telephone product line. His mission has been formulated as follows in the project
description approved by the Marketing Director:
As part of his/her basic contract, every Legiacom subscriber will be allowed to choose a
wireless telephone set among 4 types of sets to be designed by the R&D division. The global
objective of the R&D division is to satisfy as well as possible the needs and wishes expressed
by all our basic customer segments.
Before diving into his assignment, Pierre Hafeux has studied the (broad) literature dealing with the
development and positioning of new products. He has elected to represent the customers preferences
by the conjoint analysis method. In this approach, a number of attributes (or criteria) are selected
to describe the most relevant features of a product type, as viewed by the customers. Each attribute
possesses several modalities: for instance, for a phone set, the attribute color could assume any of the
modalities white, black, grey ou brown. On the basis of market research and statistical studies,
the analyst computes a set of parameters w
ijk
, where w
ijk
indicates the value which the customers in
market segment i assign to modality j of attribute k (a higher value means a more attractive feature for
this market segment).
Pierre Hafeux has selected 7 attributes which seem to explain the preferences expressed by the cus
tomers for dierent telephone sets. Here is the list of these attributes, together with their respective
102 CHAPTER 7. CASES
modalities:
1. design: modern, traditional, futuristic;
2. color: white, black, grey, brown;
3. keys: round, square, rectangular;
4. support: desktop or mural;
5. screen display: black or red;
6. memory: special keys or numeric codes ;
7. weight: ultralight or normal.
The modalities of each attribute can be arbitrarily combined to compose a telephone type.
The market studies performed by Legiacom allow to identify 8 distinct market segments with similar
population sizes. The w
ijk
values have been estimated for these dierent segments (they are measured
on a 5point scale, ranging from 1 to 4). They are shown in Tables 7.37.9 hereunder; the value w
ijk
is
given by element (i, j) in matrix W(k). So, for instance, the value of modality black for the attribute
screen display is equal to 4 for market segment 3.
The utility (or total value) of a telephone type described by a set of modalities M is computed, for
each market segment i, by addition of the w
ijk
values corresponding to M. For instance, the utility, for
each customer in market segment 2, of a modern black phone set with round keys, desktop support, black
screen display, numeric code memory and normal weight, is equal to: 1 + 4 + 3 + 4 + 0 + 1 + 3 = 16.
If we further assume that every customer selects the phone set which oers him the highest total
utility, then the problem to be solved by Pierre Hafeux can be stated easily: he must dene 4 types of
telephones (described by their respective modalities) so as to maximize the sum of the utilities collected
by the entire customer population.
You are asked to develop an algorithm for the computation of an optimal product line. Your algorithm
must be based on the generic guidelines oered by metaheuristics like simulated annealing, tabu search,
or genetic algorithms. It should be exible enough to handle arbitrary datasets with a similar structure.
The quality of the solution that you obtain should be assessed against appropriate upper bounds on the
optimal value.
References: This case was inspired by the article Kohli and Krishnamurti (1987). More details on
conjoint analysis can be found for instance in Green and Krieger (1989).
7.6. A NEW PRODUCT LINE FOR LEGIACOM 103
Modern Traditional Futuristic
1 1 1 4
2 1 4 1
3 4 2 1
4 4 1 2
5 3 4 2
6 4 1 2
7 3 2 4
8 4 1 2
Table 7.3: W(1): design
White Black Grey Brown
1 4 1 1 3
2 2 4 2 1
3 0 0 2 4
4 1 4 0 2
5 1 0 4 1
6 2 1 3 4
7 1 4 2 0
8 2 0 4 1
Table 7.4: W(2): color
104 CHAPTER 7. CASES
Round Square Rectangular
1 4 1 2
2 3 0 4
3 4 0 1
4 0 4 1
5 3 2 0
6 1 4 2
7 1 2 4
8 0 2 4
Table 7.5: W(3): keys
Desktop Wall
1 2 1
2 4 0
3 0 3
4 1 4
5 2 4
6 4 1
7 0 3
8 4 0
Table 7.6: W(4): support
7.6. A NEW PRODUCT LINE FOR LEGIACOM 105
Black Red
1 3 1
2 0 4
3 4 1
4 0 3
5 2 1
6 1 4
7 1 4
8 3 0
Table 7.7: W(5): screen display
Special keys Numeric codes
1 1 4
2 3 1
3 0 3
4 4 1
5 4 1
6 2 4
7 0 3
8 4 1
Table 7.8: W(6): memory
106 CHAPTER 7. CASES
Ultralight Normal
1 1 4
2 0 3
3 4 0
4 3 1
5 1 4
6 0 4
7 4 1
8 4 1
Table 7.9: W(7): weight
Bibliography
Aarts, E. and J. Lenstra (1997). Local Search in Combinatorial Optimization. New York: John Wiley
& Sons.
Applegate, D. L., R. E. Bixby, V. Chvatal, and W. J. Cook (2006). The Traveling Salesman Problem.
Princeton: Princeton University Press.
Askin, R. G. and C. R. Standridge (1993). Modeling and Analysis of Manufacturing Systems. New
York: Wiley.
Barnhart, C., E. Johnson, G. Nemhauser, G. Sigismondi, and P. Vance (1993). Formulating a mixed
integer distribution problem to improve solvability. Operations Research 41, 10131019.
Bartholdi, J. J., L. K. Platzman, R. L. Collins, and W. H. Warden (1983). A minimal technology
routing system for MealsonWheels. Interfaces 13, 18.
Bertsimas, D. (1995). Nonlinear Programming. Belmont, Massachusetts: Athena Scientic.
Bischo, E. E. (1991). Stability aspects of pallet loading. OR Spektrum 13, 189197.
Bishop, C. (1995). Neural Networks for Pattern Recognition. Oxford, UK: Oxford University Press.
Bollapragada, S., H. Cheng, M. Phillips, M. Garbiras, M. Scholes, T. Gibbs, and M. Humphreville
(2002). NBCs optimization systems increase revenues and productivity. Interfaces 32, 4760.
Braspenning, P., F. Thuijsman, and A. Weijters (1995). Articial Neural Networks, Volume 931 of
Lecture Notes in Computer Science. Berlin Heidelberg New York: SpringerVerlag.
Crama, Y. (2002). Gestion de la Production. Universite de Li`ege: Lecture notes.
Crama, Y., A. Kolen, and E. Pesch (1995). Local search in combinatorial optimization (Braspenning,
P.J., Thuijsman, F. and Weijters, A.J.M.M. ed.), Volume 931 of Lecture Notes in Computer Science,
pp. 157174. Berlin Heidelberg New York: SpringerVerlag.
107
108 BIBLIOGRAPHY
Crama, Y., A. G. Oerlemans, and F. C. R. Spieksma (1996). Production Planning in Automated
Manufacturing. Berlin: Springer.
Crama, Y., J. van de Klundert, and F. C. R. Spieksma (2002). Production planning problems in printed
circuit board assembly. Discrete Applied Mathematics 123, 339361.
DeBodt, E., M. Cottrell, and M. Levasseur (1996). Les reseaux de neurones en nance: Principes et
revue de la litterature. Finance 16, 2592.
Dennis, J. and R. Schnabel (1996). Numerical Methods for Unconstrained Optimization and Nonlinear
Equations. Philadelphia, PA: SIAM Classics in Applied Mathematics.
Fadlalla, A. and C.H. Lin (2001). An analysis of the applications of neural networks in nance.
Interfaces 31, 112122.
Funahashi, K. (1989). On the approximate realization of continuous mappings by neural networks.
Neural Networks 2, 183192.
Glover, F. and M. Laguna (1997). Tabu Search. Boston, Mass., etc.: Kluwer Academic Publishers.
Goldberg, D. (1989). Genetic Algorithms in Search, Optimization, and Machine Learning. Reading,
Mass., etc.: AddisonWesley Publ. Co.
Green, P. E. and A. M. Krieger (1989). Recent contributions to optimal product positioning and buyer
segmentation. European Journal of Operational Research 41, 127141.
Hansen, P. and B. Jaumard (1990). Algorithms for the maximum satisability problem. Computing 44,
279303.
Henseler, J. (1995). Back Propagation (Braspenning, P.J., Thuijsman, F. and Weijters, A.J.M.M. ed.),
Volume 931 of Lecture Notes in Computer Science. Berlin Heidelberg New York: SpringerVerlag.
Hertz, A. (1997). A colourful look on evolutionary techniques. Belgian Journal of Operations Research,
Statistics and Computer Science (JORBEL) 35, 2339.
Hill, T. (1996). Neural network models for time series forecasts. Management Science 42, 10821090.
Hoos, H. H. and T. St utzle (2005). Stochastic Local Search: Foundations and Applications. San Fran
cisco: Morgan Kaufmann Publishers.
Jain, S., M. E. Johnson, and F. Safai (1996). Implementing setup optimisation on the shop oor.
Operations Research 44, 843851.
BIBLIOGRAPHY 109
Johnson, D. S., C. R. Aragon, L. A. McGeoch, and C. Schevon (1989). Optimization by simulated
annealing: An experimental evaluation. Part I, Graph partitioning. Operations Research 37, 865
892.
Johnson, D. S., C. R. Aragon, L. A. McGeoch, and C. Schevon (1991). Optimization by simulated an
nealing: An experimental evaluation. Part II, Graph coloring and number partitioning. Operations
Research 39, 378406.
Klee, V. and G. Minty (1972). How good is the simplex algorithm ? (O. Shisha ed.)., pp. 159175.
Academic Press, New York.
Kohli, R. and R. Krishnamurti (1987). A heuristic approach to product design. Management Science 33,
15251533.
Kolen, A. and E. Pesch (1994). Genetic local search in combinatorial optimization. Discrete Applied
Mathematics 48, 273284.
Merz, P. and B. Freisleben (2001). Memetic algorithms for the traveling salesman problem. Complex
Systems 13, 297345.
Mollaghasemi, M., K. LeCroy, and M. Georgiopoulos (1998). Application of neural networks and sim
ulation modeling in manufacturing system design. Interfaces 28, 100114.
Moonen, L. S. and F. C. R. Spieksma (2003). Partitioning a permutation graph: algorithms and an
application. Working paper, Katholieke Universiteit Leuven.
Moore, G. (1965). Cramming more components onto integrated circuits. Electronics 38, 114117.
M ulhenbein, H. (1997). Genetic algorithms (Aarts, E., and Lenstra, J.K. ed.). Lecture Notes in Com
puter Science, Springer. New York: John Wiley & Sons.
Nemhauser, G. and L. Wolsey (1988). Integer and Combinatorial Optimization. New York: Wiley
Interscience.
Oliveira, J., J. Ferreira, and R. Vidal (1993). Algorithms for nesting problems. Belgian Journal of
Operations Research, Statistics and Computer Science (JORBEL) 33, 843851.
Papadimitriou, C. and K. Steiglitz (1982). Combinatorial Optimization  Algorithms and Complexity.
Englewood Clis: Prentice Hall.
Pirlot, M. (1992). General local search heuristics in combinatorial optimization : a tutorial. Belgian
Journal of Operations Research, Statistics and Computer Science (JORBEL) 32, 967.
110 BIBLIOGRAPHY
Press, W., S. Teukolsky, W. Vetterling, and B. Flannery (1995). Numerical Recipes in Pascal. Cam
bridge: Cambridge University Press.
Ripley, C. (1996). Pattern Recognition and Neural Networks. Cambridge, UK: Cambridge University
Press.
Smith, M. (1996). Neural Networks for Statistical Modeling. Boston, MA: International Thomson Com
puter Press.
Swamidass, P., S. Nair, and S. Mistry (1999). The use of a neural factory to investigate the eect of
product line width on manufacturing performance. Management Science 45, 15241538.
Tovey, C. (2002). Tutorial on computational complexity. Interfaces 32, 3061.
Tyagi, R. and S. Bollapragada (2003). SES Americom maximizes satellite revenues by optimally con
guring transponders. Interfaces 33, 3644.
West, P., P. Brockett, and L. Golden (1997). A comparative analysis of neural networks and statistical
methods for predicting consumer choice. Marketing Science 16, 370391.
Yu, G., M. Arguello, G. Song, S. McCowan, and A. White (2003). A new era for crew recovery at
Continental Airlines. Interfaces 33, 522.
Lebih dari sekadar dokumen.
Temukan segala yang ditawarkan Scribd, termasuk buku dan buku audio dari penerbitpenerbit terkemuka.
Batalkan kapan saja.