Anda di halaman 1dari 96

# Nonlinear Programming Models

## Fabio Schoen 2008

Introduction

http://gol.dsi.unifi.it/users/schoen

NLP problems
min f (x) x S Rn

## Local and global optima

A global minimum or global optimum is any x S such that
x S f (x) f (x )

Standard form:
min f (x) hi (x) = 0 gj (x) 0 i = 1, m j = 1, k

## A point x is a local optimum if > 0 such that

x S B ( x, )f (x) f ( x)

where B ( x, ) = {x Rn : x x } is a ball in Rn . Any global optimum is also a local optimum, but the opposite is generally false.

## Nonlinear Programming Models p. 4

Convex Functions
A set S Rn is convex if
x, y S x + (1 )y S

Convex Functions

for all choices of [0, 1]. Let Rn : non empty convex set. A function f : R is convex iff
f (x + (1 )y ) f (x) + (1 )f (y )

## Properties of convex functions

Every convex function is continuous in the interior of . It might be discontinuous, but only on the frontier. If f is continuously differentiable then it is convex iff
f (y ) f (x) + (y x)T f (x)

Convex functions

for all y
x y

## Nonlinear Programming Models p. 8

If f is twice continuously differentiable f it is convex iff its Hessian matrix is positive semi-denite:
2 f (x) := 2f xi xj

Example: an afne function is convex (and concave) For a quadratic function (Q: symmetric matrix):
1 f (x) = xT Qx + bT x + c 2

then 2 f (x)

0 iff v T 2 f (x)v 0 v Rn

we have
f (x) = Qx + b f is convex iff Q 0 2 f (x) = Q

## Convex Optimization Problems

min f (x) xS

Maximization
Slight abuse in notation: a problem
max f (x) xS

is a convex optimization problem iff S is a convex set and f is convex on S . For a problem in standard form
min f (x) hi (x) = 0 gj (x) 0 i = 1, m j = 1, k

is called convex iff S is a convex set and f is a concave function (not to be confused with minimization of a concave function, (or maximization of a convex function) which are NOT a convex optimization problem)

if f is convex, hi (x) are afne functions, gj (x) are convex functions, then the problem is convex.

## Convex and non convex optimization

Convex optimization is easy, non convex optimization is usually very hard. Fundamental property of convex optimization problems: every local optimum is also a global optimum (will give a proof later) Minimizing a positive semidenite quadratic function on a polyhedron is easy (polynomially solvable); if even a single eigenvalue of the hessian is negative the problem becomes N P hard

## Convex functions: examples

Many (of course not all . . . ) functions are convex! afne functions aT x + b quadratic functions 1 xT Qx + bT x + c with Q = QT , Q 2 any norm is a convex function
x log x (however log x is concave) f is convex if and only if x0 , d Rn , its restriction to any line: () = f (x0 + d), is a convex function 0

## a linear non negative combination of convex functions is convex

g (x, y ) convex in x for all y g (x, y ) dy convex

## Nonlinear Programming Models p. 14

more examples . . .
maxi {aT i x + b} is convex f, g : convex max{f (x), g (x)} is convex

fa convex functions for any a A (a possibly uncountable set) supaA fa (x) is convex f convex f (Ax + b) T race(AT X ) =
i,j

Data Approximation

## let S Rn be any set f (x) = supsS x s is convex

Aij Xij is convex (it is linear!) log det X 1 is convex over the set of matrices X Rnn : X 0

## Nonlinear Programming Models p. 16

norm approximation maximum likelihood robust estimation

Norm approximation
Problem:
min Ax b
x

where A, b: parameters. Usually the system is over-determined, i.e. b Range(A). For example, this happens when A Rmn with m > n and A has full rank. r := Ax b: residual.

## Nonlinear Programming Models p. 18

Examples
r = r = rT r: least squares (or regression) rT P r with P 0: weighted least squares

Example: 1 norm
Matrix A R10030 80 70 60 50 40 30 20 10 0
Nonlinear Programming Models p. 19

norm 1 residuals

## r = maxi |ri |: minimax, or or di Tchebichev approximation r =

1 i |ri |: absolute or approximation

Possible (convex) additional constraints: maximum deviation from an initial estimate: x xest simple bounds i xi ui ordering: x1 x2 xn

-5

-4

-3

-2

-1

## Nonlinear Programming Models p. 20

norm
20 18 16 14 12 10 8 6 4 2 0 -5 -4 -3 -2 -1 0 1 2 3 4 5
norm residuals

2 norm
18 16 14 12 10 8 6 4 2 0 -5 -4 -3 -2 -1 0 1 2 3 4 5 norm 2 residuals

## Nonlinear Programming Models p. 22

Variants
min
i

comparison
4 3.5 3 2.5 2 1.5 1 0.5 0 -0.5
Nonlinear Programming Models p. 23

2

-2

-1.5

-1

-0.5

0.5

1.5

## Nonlinear Programming Models p. 24

Maximum likelihood
Given a sample X1 , X2 , . . . , Xk and a parametric family of probability density functions L(; ), the maximum likelihood estimate of given the sample is
= arg max L(X1 , . . . , Xk ; )

## Max likelihood estimate - MLE

(taking the logarithm, which does not change optimum points):
= arg max
i

log(p(Xi aT i ))

## If p is logconcave this problem is convex. Examples:

Example: linear measures with and additive i.i.d. (independent identically dsitributed) noise:
Xi = aT i + i
(1)

N (0, ), i.e. p(z ) = (2 )1/2 exp(z 2 /2 2 ) MLE is the 2 estimate: = arg min A X 2 ; p(z ) = (1/(2a)) exp(|z |/a) 1 estimate: = arg min A X 1

## where i iid random variables with density p():

k

L(X1 . . . , Xk ; ) =

i=1

p(Xi aT i )
Nonlinear Programming Models p. 25 Nonlinear Programming Models p. 26

Ellipsoids
p(z ) = (1/a) exp(z/a)1{z0} (negative exponential)the estimate can be found solving the LP problem: min 1T (X A) A X p uniform on [a, a] the MLE is any such that A X a

## An ellipsoid is a subset of Rn of the form

E = {x Rn : (x x0 )T P 1 (x x0 ) 1}

where x0 Rn is the center of the ellipsoid and P is a symmetric positive-denite matrix. Alternative representations:
E = {x Rn : Ax b
2

1}

where A 0, or
E = {x Rn : x = x0 + Au | u
2

1}

where A is square and non singular (afne transformation of the unit ball)
Nonlinear Programming Models p. 27 Nonlinear Programming Models p. 28

## Robust Least Squares

Least Squares: x = arg min but it is known that
T i (ai x

RLS
It holds: then, choosing y = / if 0 and y = / , otherwise if < 0, then y = 1 and
| + T y | = | + T / sign()| = || + | + T y | || + y

## bi )2 Hp: ai not known,

ai Ei = {a i + Pi u : u 1}

where Pi = PiT

## 0. Denition: worst case residuals: max

ai Ei i 2 (aT i x bi )

then:
T T max |(aT i x b i + u Pi x | i x bi )| = max |a ai Ei u 1

## A robust estimate of x is the solution of

x r = arg min max
x ai Ei i

(aT i x

bi

)2

= |a T i x bi | + Pi x

## Nonlinear Programming Models p. 30

...
Thus the Robust Least Squares problem reduces to
1/2

...

min t
x,t

min
i

2 (|a T i x bi | + Pi x )

## (a convex optimization problem). Transformation:

min t
x,t 2

a T i x + bi + Pi x ti

a T i x bi + Pi x ti

## (Second Order Cone Problem). A norm cone is a convex set

C = {(x, t) Rn+1 : x t} ti i
i.e.

|a T i x bi | + Pi x

## Nonlinear Programming Models p. 32

Geometrical Problems
projections and distances polyhedral intersection extremal volume ellipsoids classication problems

Geometrical Problems

## Nonlinear Programming Models p. 34

Projection on a set
Given a set C the projection of x on C is dened as:
PC (x) = arg min z x
z C

## Projection on a convex set

If where fi : convex C is a convex set and the problem
PC (x) = arg min x z Az = b fi (z ) 0 i = 1, m C = {x : Ax = b, fi (x) 0, i = 1, m}

is convex.

## Distance between convex sets

dist(C (1) , C (2) ) =
xC (1) ,y C (2)

## Distance between convex sets

If C (j ) = {x : A(j ) x = b(j ) , fi 0} then the minimum distance can be found through a convex model:
min x(1) x(2)
(j )

min

xy

(2)

## Nonlinear Programming Models p. 38

Polyhedral intersection
1: polyhedra described by means of linear inequalities:
P1 = {x : Ax b}, P2 = {x : Cx d}

Polyhedral intersection
P1 P1 P2 ? Just check P2 = ? It is a linear feasibility problem: Ax b, Cx d sup{cT k x : Ax b} dk k

## Polyhedral intersection (2)

2: polyhedra (polytopes) described through vertices:
P1 = conv{v1 , . . . , vk }, P2 = conv{w1 , . . . , wh } P1 P2 = ? Need to nd 1 , k , 1 , h 0: i = 1
i j

## Minimal ellipsoid containing k points

Given v1 , . . . , vk Rn nd an ellipsoid
E = {x : Ax b 1}

## with minimal volume containing the k given points.

j = 1

* * * * * * * * * * * * * *
Nonlinear Programming Models p. 41

* * * *

i vi =
i j

j wj

P1 P2 ? i = 1, . . . , k check whether j 0: j = 1
j

j wj = vi
j

## Max. ellipsoid contained in a polyhedron

A = AT 0. Volume of E is proportional to det A1 convex optimization problem (in the unknowns: A, b): min log det A
1

Given P = {x : Ax b} nd an ellipsoid:
E = {By + d : y 1}

A = AT A 0 Avi b 1 i = 1, k

## Max. ellipsoid contained in a polyhedron

E P
B,d

Difcult variants
These problems are hard: nd a maximal volume ellipsoid contained in a polyhedron given by its vertices
* * * * * * * * * * * * *
*

aT i (By

sup {aT i By y 1

+ d) bi +

aT i d}

y : y 1 bi i

Bai + aT i d bi

* * *

Bai

## nd a minimal volume ellipsoid containing a polyhedron described as a system of linear inequalities.

It is already a difcult problem to show whether a given ellipsoid E contains a polyhedron P = {Ax b}. This problem is still difcult even when the ellipsoid is a sphere: this problem is equivalent to norm maximization in a polyhedron it is an NPhard concave optimization problem.

## Linear classication (separation)

Given two point sets X1 , . . . , Xk , Y1 , . . . , Yh nd an hyperplane aT x = t such that:
aT Xi 1
T

i = 1, k j = 1, h

a Yj 1

## Nonlinear Programming Models p. 50

Robust separation

Robust separation
Find a maximal separation:
max min aT Xi max aT Yj
i j

a: a 1

## equivalent to the convex problem:

max t1 t2 aT Xi t1 aT Yj t2 a 1

i j

## Optimality Conditions: descent directions Optimality Conditions

Fabio Schoen 2008

## Let S Rn be a convex set and consider the problem

min f (x)
xS

http://gol.dsi.unifi.it/users/schoen

where f : S R. Let x1 , x2 S and d = x2 x1 . d is a feasible direction. If there exists > 0 such that f (x1 + d) < f (x1 ) (0, ), d is called a descent direction at x1 . Elementary necessary optimality condition: if x is a local optimum, no descent direction may exist at x

Optimality Conditions p. 1

Optimality Conditions p. 2

## Optimality Conditions for Convex Sets

If x S is a local optimum for f () and there exists a neighborhood U (x ) such that f C 1 (U (x )), then
dT f (x ) 0 d : feasible direction

Optimality Conditions p. 3

Optimality Conditions p. 4

proof
Taylor expansion:
f (x + d) = f (x ) + dT f (x ) + o() d cannot be a descent direction, so, if is sufciently small, then f (x + d) f (x ). Thus dT f (x ) + o() 0

## Optimality Conditions: tangent cone

General case:
min f (x) gi (x) 0 i = 1, . . . , m (X : open set) xX

and dividing by ,
o() 0 dT f (x ) +

## Let S = {x X : gi (x) 0, i = 1, . . . , m}. Tangent cone to S in x : T ( x ) = {d R n }:

d xk x = lim xk x xk x d

## Letting 0 the proof is complete.

Optimality Conditions p. 5

where xk S .

Optimality Conditions p. 6

Some examples
S = Rn T (x) = Rn S = {Ax = b} x T (x) = {d : Ad = 0} S = {Ax b}; let I be the set of active constraints in x : aT = bi i x < bi aT i x iI i I.

Optimality Conditions p. 7

Optimality Conditions p. 8

## Let d = limk (xk x )/ (xk x )

T aT )/ (xk x ) i d = ai lim(xk x k

iI

k

## ) = lim(aT i xk b)/ (xk x

k

Thus if d T ( x) aT i d 0 for i I .

Optimality Conditions p. 9

Optimality Conditions p. 10

Example
Viceversa, let xk = x + k d. If aT i d 0 for i I
T aT x + k d) i xk = ai (

= bi + bi

k aT i d

iI

Let S = {(x, y ) R2 : x2 y = 0} (parabola). Tangent cone at (0, 0)? Let {(xk , yk ) (0, 0)}, i.e. xk 0, yk = x2 k:
(xk , yk ) (0, 0) =
4 x2 k + (xk )

T x + k d) aT i xk = ai (

< bi + k aT i d bi

iI

= |xk | 1 + x2 k

## and if k small enough

xk =1 xk 0 |xk | 1 + x2 k xk = 1 lim xk 0 |xk | 1 + x2 k lim+
Optimality Conditions p. 11

Thus
T (x) = {d : aT i d 0 i I}

## yk =0 xk 0 |xk | 1 + x2 k yk lim =0 xk 0 |xk | 1 + x2 k lim+

Optimality Conditions p. 12

## thus T (0, 0) = {(1, 0), (1, 0)}

Descent direction
d Rn is a feasible direction in x S if >0: x + d S [0, ).

## I order necessary opt condition

Let x S Rn be a local optimum for minxS f (x); let f C 1 (U ( x)). Then
dT f ( x) 0 d T ( x)

d feasible d T ( x), but in general the converse is false. If f ( x + d) f ( x) d is a descent direction (0, )

## Proof d = limk (xk x )/ (xk x ) . Taylor expansion:

f (xk ) = f ( x) + T f ( x)(xk x ) + o( xk x ) x local optimum U ( x) : f (x) f ( x) x U S. = f ( x) + T f ( x)(xk x ) + xk x o(1).

Optimality Conditions p. 13

Optimality Conditions p. 14

...
If k is large enough, xk U ( x):
f (xk ) f ( x) 0

Examples
Unconstrained problems Every d Rn belongs to the tangent cone at a local optimum
T f ( x)d 0 d Rn

## thus Dividing by (xk x ) :

T f ( x)(xk x ) + xk x o(1) 0

Choosing d = ei e d = ei we get
f ( x) = 0

## and in the limit

T f ( x)d 0.

NB: the same is true if x is a local minimum in the relative interior of the feasible region.

Optimality Conditions p. 15

Optimality Conditions p. 16

min f (x) Ax = b

## Linear equality constraints

From LP duality
max 0T = 0 AT = f ( x)

f ( x)d 0
T

## Thus at a local minimum point there exist Lagrange multipliers:

: AT = f ( x)

equivalent statement:
min T f ( x)d = 0
d

(a linear program).
Optimality Conditions p. 17 Optimality Conditions p. 18

Linear inequalities
min f (x) Ax b

Linear inequalities
From LP duality:
max 0T = 0 x) AT I = f ( 0

Tangent cone at a local minimum x : {d Rn : aT d 0 i I ( x ) } . Let AI be the rows of A i associated to active constraints at x . Then
min T f ( x)d = 0
d

Thus, at a local optimum, the gradient is a non positive linear combination of the coefcients of active constraints.

AI d 0 0

Optimality Conditions p. 19

Optimality Conditions p. 20

Farkas Lemma
Let A: matrix in Rmn and b Rn . One and only one of the following sets:
AT y 0 bT y > 0

Geometrical interpretation
AT y 0 b y>0
T

Ax = b x0

and
Ax = b x0

a1

{z : x : z = Ax, x 0} b a2

is non empty
{y : AT y 0}
Optimality Conditions p. 21

Optimality Conditions p. 22

Proof
1) if x 0 : Ax = b bT y = xT AT y . Thus if AT y 0 bT y 0. 2) Premise: Separating hyperplane theorem: let C and D be two convex nonempty sets: C D = . Then there exists a = 0 and b:
aT x b xC xD

## Farkas Lemma (proof)

2) let {x : Ax = b, x 0} = . Let
S = {y Rm : x 0, Ax = y } S is closed, convex and b S . From the separating hyperplane theorem: Rm = 0, R: T y T b > x S

aT x b

## If C is a point and D is a closed convex set, separation is strict, i.e.

a C<b aT x > b xD
Optimality Conditions p. 23

0 S 0 T b > 0; T Ax for all x 0. This is possible iff T A 0. Letting y = we obtain a solution of AY y 0 bT y > 0
Optimality Conditions p. 24

## First order feasible variations cone

G( x) = {d R : gi ( x)d 0}
n T

## First order variations

G( x) T ( x). In fact if {xk } is feasible and d = lim
k

iI

xk x xk x

then gi ( x) 0 and
g ( x + lim(xk x )) 0
k

Optimality Conditions p. 25

Optimality Conditions p. 26

...
xk x )0 k xk x xk x g ( x + lim xk x lim )0 k xk x g ( x + lim xk x d) 0 g ( x + lim xk x
k

gi ( x + k d) = gi ( x) + k T gi ( x)d + o(k )

where k > 0 and d belong to the tangent cone T ( x). If the ith constraint is active, then
gi ( x + k d) = k T gi ( x)d + o(k ) 0

Let k = xk x , if k 0:
g ( x + k d) 0

## Letting k 0 the result is obtained.

Optimality Conditions p. 27

Optimality Conditions p. 28

example
G( x) = T ( x); x3 + y 0

## KKT necessary conditions

(KarushKuhnTucker) Let x X Rn , X = be a local optimum for
min f (x) gi (x) 0 i = 1, . . . , m xX

y 0

## 2. constraint qualications conditions: T ( x) = G( x) hold in x ; then there exist Lagrange multipliers i 0, i I :

f ( x) +
Optimality Conditions p. 29

## 1. f (x), gi (x) C 1 ( x) for i I

iI

i gi ( x) = 0.
Optimality Conditions p. 30

Proof
x local optimum if d T ( x) dT f ( x) 0. But d T ( x) dT gi ( x) 0 i I.

## Constraint qualications: examples

polyhedra: linear independence: Slater condition:

## Thus it is impossible that

T f ( x)d > 0

X open set, gi (x), i I convex differentiable functions in x , gi (x), i I continuous in x , and x X strictly feasible: gi ( x) < 0 i I.

X open set, gi (x), i I continuous in x and {gi ( x)}, i I are linearly independent.

T gi ( x)d 0

iI

## From Farkas Lemma there exists a solution of:

iI

i T gi ( x) = T f ( x) i 0

iI iI
Optimality Conditions p. 31 Optimality Conditions p. 32

Convex problems
An optimization problem
min f (x)
xS

## Standard convex problem

min f (x) hj (x) = 0 gi (x) 0 i = 1, m j = 1, k

is a convex problem if
S is a convex set, i.e. x, y S x + (1 )y S f is a convex function on S , i.e. f (x + (1 )y ) f (x) + (1 )f (y ) [0, 1] and x, y S
Optimality Conditions p. 33

if
f is convex gi are convex hj are afne (i.e. of the form T x + )

[0, 1]

## then the problem is convex.

Optimality Conditions p. 34

Convex problems
Every local optimum is a global one. Proof: x : local optimum for minS f (x) x : global optimum. S convex x + (1 ) x S . Thus if 0
f ( x) f (x + (1 ) x

## Sufciency of 1st order conditions

(for a convex differentiable problem: if dT f ( x) d T ( x), then x is a (global) optimum Proof:
f (y ) f ( x) + (y x )T f (x) y S

f (x ) + (1 )f ( x)

But y x T ( x)
f (y ) f ( x) + dT f (x) f ( x) y S

f ( x) f (x )

## thus x is a global minimum.

Optimality Conditions p. 35

Optimality Conditions p. 36

## Convexity of the set of global optima

(for convex problems) The set of global minima of a convex problem is a convex set. In fact, let x and y be global minima for the convex problem
min f (x)
xS

## KKT for equality constraints

x : local optimum for min f (x) hj (x) = 0 gi (x) 0 xXR
n

i = 1, . . . , m j = 1, . . . , k

## Then, choosing [0, 1] we have x + (1 ) y S , as S is convex. Moreover

f (x + (1 ) y ) f ( x) + (1 )f ( y) f + (1 )f = f

Let I : set of active inequalities in x . If f (x), gi (x), i I , hj (x) C 1 and constraint qualications hold in x , i 0 i I e j R, j = 1, . . . , h:
h

where f is the global minimum value. Thus the equality holds and the proof is complete.
Optimality Conditions p. 37

f ( x) +

iI

i gi ( x) +

j =1

j hj ( x) = 0
Optimality Conditions p. 38

Complementarity
KKT equivalent formulation:
m h

## II order necessary conditions

If f, g1 , hj C 2 in x and the gradients of active constraints in x are linearly independent, then there exist mutlipliers i 0, i I and j , j = 1, . . . , k such that
k

f ( x) +

i=1

i gi ( x) +

j =1

j hj ( x) = 0 i gi ( x) = 0 i = 1, . . . , m

f ( x) +

iI

i gi ( x) +

j =1

j hj ( x) = 0

and
dT 2 L( x)d 0

## for every direction d: dT gi ( x) 0, dT hj (x) = 0 where

k

2 L(x) := 2 f (x) +
Optimality Conditions p. 39

iI

i 2 gi (x) +

j =1

j 2 hj (x)
Optimality Conditions p. 40

Sufcient conditions
Let f, gi , hj twice continuously differentiable. Let x , , :
k

Lagrange Duality
Problem:
f = min f (x) gi (x) 0 xX

f (x ) +

iI

i gi (x ) +

j =1

j hj (x ) = 0 i gi (x ) = 0

d L(x )d > 0 d :d hj (x ) = 0

i 0

## denition: Lagrange Function:

T

dT gi (x ) = 0, i I

L(x; ) = f (x) +
i

i gi (x)

0, x X

## then x is a local minimum.

Optimality Conditions p. 41

Optimality Conditions p. 42

Relaxation
Given an optimization problem
min f (x)
xS

## Lagrange minimization is a relaxation

Proof: Feasible set of the Lagrange problem: X (contains the original one) If g (x) 0 and 0
min g (x)
xQ

a relaxation is a problem

## L(x, ) = f (x) + T g (x) f (x)

where
SQ

g (x) f (x)

x S.

Weak Duality : The optimal value of a relaxation is a lower bound on the optimum value of the problem.
Optimality Conditions p. 43 Optimality Conditions p. 44

## Dual Lagrange function

with respect to constraints g (x) 0:
() = inf L(x, )
xX

min r

## = inf (f (x) + g (x))

xX

4r (xi xj ) (yi yj )2 0 xi , yi 0

xi , yi 1

1i<jN i = 1, . . . , N i = 1, . . . , N

For every choice of 0, () is a lower bound for every feasible solution and in particular, is a lower bound for the global minimum value of the problem.

Optimality Conditions p. 45

Optimality Conditions p. 46

solution
When N = 2, relaxing the rst constraint:
() = min r + (4r2 (x1 x2 )2 (y1 y2 )2 )
x,y,r

## Minimizing with respect to x, y |x1 x2 | = |y1 y2 | = 1 from which

() = min r + 4r2 2
r

x1 , x 2 , y1 , y2 0

x1 , x 2 , y1 , y2 1

1 r= 8 () = 2 1 16

This is a lower bound on the optimum value. Best possible lower bound:
= max ()

1 = 4 2

Optimality Conditions p. 47

2 2
Optimality Conditions p. 48

Lagrange Dual
Choosing (x1 , y1 ) = (0, 0) and (x2 , y2 ) = (1, 1) a feasible solution with r = 2/2 is obtained. The Lagrange dual gives a lower bound equal to 2/2: same as the objective function at a feasible solution optimal solution! (an exception, not the rule!)
= max () 0

This problem might: 1. be unbounded 2. have a nite sup but non max 3. have a unique maximum attained in correspondence with a single solution x 4. have many different maxima, each connected with a different solution x

Optimality Conditions p. 49

Optimality Conditions p. 50

Equality constraints
f = min f (x) hj (x) = 0 gi (x) 0 xX i = 1, . . . , m j = 1, . . . , k

Linear Programming
min cT x Ax b

## Dual Lagrange function:

() = min cT x + T (Ax b)
x

Lagrange function:
L(x; , ) = f (x) + g (x) + h(x)
T T

= T b + min(cT + T A)x.
x

## but: where 0, but is free.

min(cT + T A)x =
x

0 if cT + T A = 0 otherwise.
Optimality Conditions p. 52

Optimality Conditions p. 51

...
Lagrange dual function:
() =
Lagrange dual: max T b T A + cT = 0 0 which is equivalent to: max T b T A = cT 0
Optimality Conditions p. 53

1 min xT Qx + cT x 2 Ax = b

b if c + A = 0 otherwise.

## (Q: symmetric). Lagrange dual function:

1 () = min xT Qx + cT x + T (Ax b) x 2 1 = T b + min xT Qx + (cT + T A)x x 2

Optimality Conditions p. 54

QP Case 1
Q has at least one negative eigenvalue 1 min xT Qx + (cT + T A)x = x 2

QP Case 2
Q positive denite minimum point of the dual Lagrange function: Qx + (c + AT ) = 0

## In fact d : dT Qd < 0. Choosing x = d with > 0

1 T x Qx + (cT + T A)x = 2 1 2 T d Qd + (cT + T A)d 2

i.e.
x = Q1 (c + AT )

## and for large values of this can be made as small as desired.

Optimality Conditions p. 55

Optimality Conditions p. 56

...
Lagrange function value:
1 T () = T b + x Qx + (cT + T A) x 2 1 = T b + (c + AT )T Q1 QQ1 (c + AT ) 2 T T (c + A)Q1 (c + AT ) 1 = T b + (c + AT )T Q1 (c + AT ) 2 T T (c + A)Q1 (c + AT )
T

...
Lagrange dual (seen as a min problem): 1 min T b + (c + AT )T Q1 (c + AT ) 2 Optimality conditions: b + AQ1 (c + AT ) = 0

## But recalling that x = Q1 (c + AT ) b Ax =0

feasibility of x

1 = b (c + AT )T Q1 (c + AT ) 2

if we nd optimal multipliers (a linear system) we get the optimal solution x (thanks to feasibility and weak duality)!

Optimality Conditions p. 57

Optimality Conditions p. 58

## Properties of the Lagrange dual

For any problem
f = min f (x) gi (x) 0 i = 1, . . . , m xX

Dim.
From Weierstrass theorem
() = min f (x) + T g (x)
xX

## (a + (1 )b) = min(f (x) + (a + (1 )b)T g (x))

xX

where X is non empty and compact, if f and gi are continuous then the Lagrange dual function is concave

xX

## min(f (x) + aT g (x)) + (1 ) min(f (x) + bT g (x))

xX xX

= (a) + (1 )(b).

Optimality Conditions p. 59

Optimality Conditions p. 60

## Solution of the Lagrange dual

max () = max min(f (x) + g (x))
xX T

...
be the optimal solution of the restricted dual. Is it an Let T g (x)? Check: optimal dual solution? Is it true that z f (x) + we look for x , optimal solution of T g (x) min f (x) +
xX

is equivalent to
max z 0 z f (x) + T g (x) x X

## After having computed f and g in x1 , x2 , . . . , xk a restricted dual can be dened:

max z 0 z f (xj ) + T g (xj ) j = 1, . . . , k
Optimality Conditions p. 61

otherwise the pair x , f ( x) is added to the restricted dual and a new solution is computed.

Optimality Conditions p. 62

Geometric programming
Unconstrained Geometric program:
m n

Transformed problem:
m n

min
x>0 k=1

ck
j =1

xj kj

kj R, ck > 0

min
y k=1

ck
j =1 m

ekj yj ek y+k
k=1
T

= k = log ck

## (non convex). Variable substitution:

xj = exp(yj ) yj R

min
y

## still non convex, but its logarithm is convex.

Optimality Conditions p. 63

Optimality Conditions p. 64

Duality example
Dual of
m

Dual function
m T x exp(k k=1

+ k )

## L() = min log

x,y k=1

exp yk + T (Ax + y )

No constraints dual lagrange function is identical to f (x)! Strong duality holds, but is useless. Simple transformation:
m

m

min log
k=1

exp yk
T x + k yk = k

## L() = min log

y k=1

exp yk + T ( y )

Optimality Conditions p. 65

Optimality Conditions p. 66

## First order (unconstrained) optimality conditions w.r.t. yi :

exp yi i = 0 k exp yk Lagrange multipliers exist provided that i = 1
i

Substituting j = exp yj /
L() = log
j

exp yk , j yj
j

exp yj exp yj

= log
j

yj exp yj /
j k

exp yk exp yj yk ))

i > 0 i

1 ( k exp yk

exp yk (log
k j

=
k

## exp yk (log j exp yj k log k

k

exp yj yk )

=
Optimality Conditions p. 67

Optimality Conditions p. 68

Lagrange Dual
The Lagrange Dual becomes:
max T

## Special cases: linear constraints

min f (x) k log k
k

Ax b

k = 1
k

Lagrange function:
L(x, ) = f (x) + T (b Ax)

AT = 0 0

## Constraint qualications always hold (polyhedron). If x is a local optimum there exists 0:

f (x ) = AT
Optimality Conditions p. 70

Ax b

T (b Ax ) = 0
Optimality Conditions p. 69

## Non negativity constraints

min f (x) x0 j = f (x ) xj j = 1, n

## Lagrange function: L(x, ) = f (x) T x. KKT conditions:

f (x ) = 0 x 0

from which
f (x ) =0 xj f (x ) 0 xj j : x j > 0
otherwise

( )T x = 0

Optimality Conditions p. 71

Optimality Conditions p. 72

Box constraints
min f (x) xu i < ui i

## Box constr. (cont)

Then, from complementarity,
f (x ) = j xj f (x ) = j xj f (x ) =0 xj j J j Ju j J0

## Lagrange function: L(x, , ) = f (x) + T ( x) + T (x u). KKT conditions:

f (x ) =

( x )T = 0 (x u)T = 0

( , ) 0

## Given x let J = {j : x j = j }, Ju = {j : xj = uj }, J0 = {j : j < xj < uj }

Optimality Conditions p. 73

Optimality Conditions p. 74

## Optimization over the simplex

Thus
f (x ) 0 xj f (x ) 0 xj f (x ) =0 xj min f (x) j J j Ju j J0 1T x = 1 x0

## Lagrange function: L(x, , ) = f (x) T x + T (1T x 1). KKT:

f (x ) = 1 1T x = 1 (x , ) 0

with feasibility x u

( )T x = 0

Optimality Conditions p. 75

Optimality Conditions p. 76

simplex. . .
f (x ) j = xj
(all equal). Thus, from complementarity, if x j > 0 then j = 0 f (x ) xj f (x ) xj

## Application: Min var portfolio

Given n assets with random returns R1 , . . . , Rn , how to invest 1 e in such a way that the resulting portfolio has minimum variance? If xj denotes the percentage of the investment on asset j , how to compute the variance of this portfolio P (x)?
Var

and

= ; otherwise

. Thus, if j : x j > 0, k

= E (P (x) (E (P (x))))2
n

f (x ) f (x ) xj xk

=E
j =1

=
i,j

= xT Qx

## where Q is the variance-covariance matrix of the n assets.

Optimality Conditions p. 77 Optimality Conditions p. 78

## Min var portfolio

Problem (objective multiplied by 1/2 for simpler computations):
min(1/2)xT Qx 1 x=1 x0
T

Optimal portfolio
KKT: for all j : x j > 0:
Qij xj Qkj xj
j

Vector Qx might be thaught as the vector of marginal contributions to the total risk (which is a weighted sum of elements of Qx). Thus in the optimal portfolio, all assets with positive level give equal (and minimal) contribution to the total risk.

Optimality Conditions p. 79

Optimality Conditions p. 80

## Optimization Algorithms Algorithms for unconstrained local optimization

Fabio Schoen 2008

Most common form for optimization algorithms: Line search-based methods: Given a starting point x0 a sequence is generated:
xk+1 = xk + k dk

http://gol.dsi.unifi.it/users/schoen

where dk Rn : search direction, k > 0: step Usually rst dk is chosen and than the step is obtained, often from a 1dimensional optimization

## Algorithms for unconstrained local optimization p. 2

Trust-region algorithms
A model m(x) and a condence region U (xk ) containing xk are dened. The new iterate is chosen as the solution of the constrained optimization problem
x U (x k )

Speed measures
Let x : local optimum. The error in xk might be measured e.g. as
e(xk ) = xk x
or

min m(x)

## e(xk ) = |f (xk ) f (x )|.

The model and the condence region are possibly updated at each iteration.

## Given {xk } x if q > 0, (0, 1) : (for k large enough):

e(xk ) q k {xk } is linearly convergent, or converges with order 1; : convergence rate A sufcient condition for linear convergence: lim sup e(xk+1 ) e(xk )
Algorithms for unconstrained local optimization p. 4

## Algorithms for unconstrained local optimization p. 3

superlinear convergence
If for every (0, 1) exists q :
e(xk ) q k

## Higher order convergence

If, given p > 1, q > 0, (0, 1) :
e(xk ) q (p
k)

## then convergence is superlinear. Sufcient condition:

lim sup e(xk+1 ) =0 e(xk )

then {xk } is said to converge with order at least p If p = 2 quadratic convergence Sufcient condition:
lim sup e(xk+1 ) < e(xk )p

Examples
1 k

Examples
1 k 1 k2

Examples
1 k 1 k2

Examples
1 k 1 k2

Examples
1 k 1 k2

## Descent directions and the gradient

Let f C 1 (Rn ), xk Rn : f (xk ) = 0 Let d Rn . If
dT f (xk ) < 0

1 k 22

## then d is a descent direction Taylor expansion:

f (xk + d) f (xk ) = dT f (xk ) + o()

## f (xk + d) f (xk ) = dT f (xk ) + o(1)

Thus if is small enough f (xk + d) f (xk ) < 0 NB: d might be a descent direction even if dT f (xk ) = 0
Algorithms for unconstrained local optimization p. 7 Algorithms for unconstrained local optimization p. 8

## Convergence of line search methods

If a sequence xk+1 = xk + k dk is generated in such a way that:
L0 = {x : f (x) f (x0 )} is compact dk = 0 whenever f (xk ) = 0 f (xk+1 ) f (xk )

if dk = 0 then
|dT k f (xk )| ( f (xk ) ) dk

if f (xk ) = 0 k then
dT lim k f (xk ) = 0 k dk

## Algorithms for unconstrained local optimization p. 10

such that f (xk Then either there exists a nite index k ) = 0 or otherwise xk L0 and all of its limit points are in L0 {f (xk )} admits a limit limk f (xk ) = 0 f (xk+1 ) f (xk ): most optimization methods choose dk as a descent direction. If dk is a descent direction, choosing k sufciently small ensures the validity of the assumption limk dk f (xk ) = 0: given a normalized direction dk , the k scalar product dk T f (xk ) is the directional derivative of f along dk : it is required that this goes to zero. This can be achieved through precise line searches (choosing the step so that f is minimized along dk )
|dT k f (x k )| ( f (xk ) dk T dk : dk f (xk ) < 0 then dT

## the condition becomes

dT k f (xk ) c dk f (xk
Algorithms for unconstrained local optimization p. 11 Algorithms for unconstrained local optimization p. 12

Recalling that
cos k = dT k f (xk ) dk f (xk

General scheme:
xk+1 = xk k Dk f (xk )

cos k c

## with Dk 0 e k > 0 If f (xk ) = 0 then

dk = Dk f (xk )

that is, the angle between dk and f (xk ) is bounded away from orthogonality.

## is a descent direction. In fact

T dT k f (xk ) = f (xk )Dk f (xk )

<0

dT k f (xk )

## Algorithms for unconstrained local optimization p. 14

Steepest Descent
Dk := I

i.e. xk+1 = xk k f (xk ). If f (xk ) = 0 then dk = f (xk ) is a descent direction. Moreover, it is the steepest (w.r.t. the euclidean norm):
dRn

min T f (xk )d d 1

f (xk )

## Algorithms for unconstrained local optimization p. 16

...

Newtons method
min T f (xk )d dT d 1 Dk := 2 f (xk )
1

dRn

## Motivation: Taylor expansion of f :

1 f (x) f (xk ) + T f (xk )(x xk ) + (x xk )T 2 f (xk )(x xk ) 2

## KKT conditions: In the interior T f (xk ) = 0; if the constraint is active

d f (xk ) + =0 d dT d = 1 0 d =
f (x k ) f (x k )

## Minimizing the approximation:

f (xk ) + 2 f (xk )(x xk ) = 0

## If the hessian is non singular

x = xk 2 f (xk )
Algorithms for unconstrained local optimization p. 17

f (xk )
Algorithms for unconstrained local optimization p. 18

Step choice
Given dk , how to choose k so that xk+1 = xk + k dk ? optimal choice (one-dimensional optimization):
k = arg min f (xk + dk ).
0

Minimizing w.r.t. :
T dT k Qdk + (Qxk + c) dk = 0

= =

Analytical expression of the optimal step is available only in few cases. E.g. if f (x) = 1 xT Qx + cT x with Q 0. Then 2
1 f (xk + dk ) = (xk + dk )T Q(xk + dk ) + cT (xk + dk ) 2 1 T = 2 dT k Qdk + (Qxk + c) dk + 2

## E.g., in steepest descent:

k = f (xk ) 2 T f (xk )2 f (xk )f (xk )

## Approximate step size

Rules for choosing a step-size (from the sufcient condition for convergence):
f (xk+1 ) < f (xk )
dT limk dk k

f (xk ) = 0

u u u

## Often it is also required that

dT K f (xk + k dk ) 0 xk+1 xk 0

In general it is important to insure a sufcient reduction of f and a sufciently large step xk+1 xk

## Avoid too small steps

u

Armijos rule
Input:

u u u u

(0, 1), (0, 1/2), k > 0 := k ; while (f (xk + dk ) > f (xk ) + dT k f (xk )) do := ;

end return

Typical values : [0.1, 0.5], [104 , 103 ]. On exit the returned step is such that f (xk + dk ) f (xk ) + dT k f (xk )

## Line search in practice

How to choose the initial step size k ? Let () = f (xk + dk ). A possibility is to choose k = , the minimizer of a quadratic approximation to (). Example:
1 q () = c0 + c1 + c2 2 2 q (0) = c0 := f (xk ) q (0) = c1 := dT k f (xk ) dT k f (xk )

acceptable steps

Then = c1 /c2 .

dT k f (xk )

## Algorithms for unconstrained local optimization p. 26

of the minimum of f (xk + dk ) Third condition? If an estimate f . is available choose c2 : min q () = f min q () = q (c1 /c2 ) = c0 c2 1 /c2 := f

k = 2 f (xk ) f dT k f (xk )
k

## k1 )f (xk )) A reasonable estimate might be to choose k = 2 (f (x dT f (xk )

= c1 /c2 =2

c2 = c2 1 /2(f c0 ) c0 f c1

## Convergence of steepest descent

xk+1 = xk k f (xk )

## Local analysis of steepest descent

Behaviour of the algorithm when minimizing
1 f (x) = xT Qx 2

If a sufciently accurate step size is used the condition of the theorem on global convergence are satised the steepest descent algorithm globally converges to a stationary point. Sufciently accurate means exact line search or, e.g., Armijos rule.

## where Q 0. (local and global) optimum: x = 0. Steepest descent method:

xk+1 = xk k f (xk ) = xk k Qxk = (I k Q)xk

## Error (in x) at step k + 1:

xk+1 0 = (I k Q)xk
Algorithms for unconstrained local optimization p. 29

2 xT k (I k Q) xk

## Algorithms for unconstrained local optimization p. 30

Analysis
Let A: symmetric with eigenvalues: 1 < < n . Then
1 v
2 T xT k (I k Q) xk xk xk 2

...
is an eigenvalue of A iff is an eigenvalue of A is an eigenvalue of A iff 1 + is an eigenvalue of I + A

v T Av m v

v Rn

1 i

## where i are the eigenvalues of Q. The maximum eigenvalue will be:

max{(1 k 1 )2 , (1 k n )2 }

thus
xk+1
Algorithms for unconstrained local optimization p. 31

= max{|1 k 1 |, |1 k n |} xk

max{(1 k 1 )2 , (1 k n )2 } xk
Algorithms for unconstrained local optimization p. 32

...
Eliminating the dependency on k :
max{|1 1 |, |1 n |} =

...
0 and 1 n , 1 + 1 1 + n 1 1 1 n

max{1 1 , 1 + 1 , 1 n , 1 + n }

5 4 3 2 1 00 0.2 0.4

|1 1 | |1 n |

and thus
max{|1 k 1 |, |1 k n |} xk = max{1 1 , 1 + n }

Minimum point:
1 1 = 1 + n

0.6

0.8

i.e.
=
Algorithms for unconstrained local optimization p. 33

2 1 + n
Algorithms for unconstrained local optimization p. 34

Analysis
In the best possible case
xk+1 |1 1 | xk 2 = |1 1 | 1 + n n 1 = n + 1 1 = +1

Zigzagging
1 min (x2 + M y 2 ) 2 where M > 0. Optimum: x 0y = 0. Starting point: (M, 1). Iterates: xk xk xk+1 + = M yk yk yk+1

## With optimal step size

xk+1 yk+1 = M
M 1 k M +1 M 1 k M +1

where = n /1 : condition number of Q 1 (illconditioned problem) very slow convergence 1 very speed convergence
Algorithms for unconstrained local optimization p. 35

## Algorithms for unconstrained local optimization p. 36

Zigzagging
Converegence is rapid if M 1 very slow and zigzagging if M 1 or M 1 10

Slow convergence and zigzagging are general phenomena (especially when the starting point is near the longest axes of the ellipsoidal level sets)

-5

-10
Algorithms for unconstrained local optimization p. 37

20

40

60

80

100

## Analysis of Newtons method

Newton-Raphson method: xk+1 = xk (2 f (xk )) x : local optimum. Taylor expansion of f :
f (x ) = 0
2 1

f (xk ). Let

Thus
x xk+1 = o( x xk )

i.e.

x xk+1 x xk

o( x xk ) x xk

## If 2 f (xk ) is non singular and (2 f (xk ))1 is limited

0 = 2 f (xk )
1

= x xk+1 + o( x xk )

f (xk ) + (x xk ) + 2 f (xk )

o( x xk )

## Local Convergence of Newtons Method

Let f C 2 (U (x , 1 )), where U : ball with radius 1 and center x ; let 2 f (x ) be nonsingular. Then: 1. > 0 : if x0 U (x , ) {xk } is well dened and converges to x at least superlinearly. 2. If > 0, L > 0, M > 0 :
2 f (x) 2 f (y ) L x y

Difculties
Many things might go wrong: at some iteration, 2 f (xk ) might be singular. For example: if xk belongs to a at region f (x) = constant. even if non singular, inversion 2 f (xk ) or, in any case, solving a linear system with coefcient matrix 2 f (xk ) is numerically unstable and computationally demanding there is no guarantee that 2 f (xk ) 0 Newton direction might not be a descent direction

and
(2 f (x))1 M

## then, if x0 U (x , ) Newtons method converges with order at least 2 and

xk+1 x LM xk x 2
2
Algorithms for unconstrained local optimization p. 41 Algorithms for unconstrained local optimization p. 42

Difculties
Newtons method just tries to solve the system
f (xk ) = 0

Newtontype methods
line search variant: xk+1 = xk k (2 f (xk ))
1

and thus might very well be attracted towards a maximum the method lacks global convergence: it converges only if started near a local optimum

Modied Newton method: replace 2 f (xk ) by (2 f (xk ) + Dk ) where Dk is chosen so that 2 f (xk ) + Dk is positive denite

f (xk )

## Algorithms for unconstrained local optimization p. 44

Quasi-Newton methods
Consider solving the nonlinear system f (x) = 0. Taylor expansion of the gradient:
f (xk ) f (xk+1 ) + 2 f (xk+1 )(xk xk+1 )

QuasiNewton equation
Let:
sk := xk+1 xk yk := f (xk+1 ) f (xk )

## Let Bk+1 be an approximation of the hessian in xk+1 . QuasiNewton equation:

Bk+1 (xk+1 xk ) = f (xk+1 ) f (xk )

QuasiNewton equation: Bk+1 sk = yk . If Bk was the previous approximate hessian, we ask that 1. the variation between Bk and Bk+1 is small 2. nothing changes along directions which are normal to the step sk :
Bk z = Bk+1 z z : z T sk = 0

Choosing n 1 vectors z which are orthogonal to sk n2 linearly independent equations in n2 unknowns a unique solution.
Algorithms for unconstrained local optimization p. 45 Algorithms for unconstrained local optimization p. 46

Broyden updating
It can be shown that the unique solution is given by:
Bk+1 (yk Bk sk )sT k = Bk + sT s k k

proof
Bk+1 Bk = = k Bk sk )sT (Bs k sT s k k = (yk Bk sk )sT k sT s k k Bk )sk sT (B k sT s k k
T Trsk sT k sk sk sT k sk

## Theorem: let Bk Rnn and sk = 0. The unique solution to:

min Bk B
B F

k = yk Bs

Bk ) (B Bk ) = (B

sk sT k Bk ) = (B sT k sk sT k sk Bk ) = (B sT s k k

## is Broydens update Bk+1 here X Frobenius norm.

TrX T X denotes

Unicity is a consequence of the strict convexity of the norm and the convexity of the feasible region.

## Quasi-Newton and optimization

Special situation: 1. the hessian matrix in optimization problems is symmetric; 2. in gradient methods, when we let xk+1 = xk (Bk+1 )1 f (xk ), it is desirable that Bk+1 be positive denite. Broydens update:
Bk+1 = Bk + (yk Bk sk )sT k sT s k k

Simmetry
Remedy: let C1 = Bk +
( y k B k sk ) sT k sT k sk

symmetrization:

1 T C2 = (C1 + C1 ) 2

## However, it does not satisfy QuasiNewton equation. Broyden update of C2 :

C3 = C2 + (yk C2 sk )sT k sT k sk

## Algorithms for unconstrained local optimization p. 50

PBS update
In the limit
Bk+1 = Bk +
T (sT (yk Bk sk ))sk sT (yk Bk sk )sT k k + sk (yk Bk sk ) + k T 2 sk sk (sT s ) k k

BFGS
Same ideas, but applied to the approximate inverse Hessian: Inverse QuasiNewton equation:
sk = Hk+1 yk

(PBS Powell-Broyden-Symmetric update). Imposing also hereditary positive deniteness, DFP (Davidon-Fletcher-Powell) is obtained:
Bk+1 = Bk + = I (yk yk sT k T yk sk
T Bk sk )yk

## lead to the most common QuasiNewton update: BFGS (Broyden-Fletcher-Goldfarb-Shanno):

T Bk sk ))yk yk T (yk sk )2

+ yk (yk Bk sk ) + T yk sk
T sk yk T yk sk

(sT k (yk

Hk+1 =

T sk yk T yk sk

Hk I

yk sT k T yk sk

sk sT k T yk sk

Bk I

T yk yk T yk sk

## Algorithms for unconstrained local optimization p. 52

BFGS method
xk+1 = xk k Hk f (xk ) Hk+1 = I

## Trust Region methods

Possible defect of standard Newton method: the approximation becomes less and less precise if we move away from the current point. Long step bad approximation. Idea: constrained minimization of quadratic approximation:
xk+1 = arg
xk+1 xk k

sk sT k T yk sk

min

mk (x)

where

## 1 + (xk+1 xk )T 2 f (xk )(xk+1 xk ) 2

k > 0: parameter. First advantage (over pure Newton): the step is always denite (thanks to Weierstrasss theorem)

## Outline of Trust Region

Let mk () a local model function. E.g. in Newton Trust Region methods,
1 mk (s) = f (xk ) + sT f (xk ) + sT 2 f (xk )s 2

How to choose and update the trust region radius k ? Given a step sk , let
k = f (xk ) f (xk + sk ) mk (0) mk (sk )

## or in a Quasi-Newton Trust Region method

1 mk (s) = f (xk ) + sT f (xk ) + sT Bk s 2

the ratio between the actual reduction and the predicted reduction

## Algorithms for unconstrained local optimization p. 56

Model updating
f (xk ) f (xk + sk ) k = mk (0) mk (sk )
for

Algorithm
Data:

The predicted reduction is always non negative; if k is small (surely if it is negative) the model and the function strongly disagree the step must be rejected and the trust region reduced if k 1 it is safe to expand the trust region

> 0, 0 (0, ) , [0, 1/4] k = 0, 1, . . . do Find the step sk and k minimizing the model in the trust region ; if k < 1/4 then k+1 = k /4 ;
else if

end

else end

## Solving the model

How to nd
1 min f (xk )T s + sT Bk s s 2 s

Thus either s is in the interior of the ball with radius , in which case = 0 and we have the (quasi)-Newton step:
1 f (xk ) p = Bk

## If Bk 0, KKT conditions are necessary and sufcient; rewriting the constraint as sT s 2 :

f (xk ) + Bk s + 2s = 0 ( s ) = 0

or s = and if > 0 then 2s = f (xk ) Bs = mk (s) s is parallel to the negtaive gradient of the model and normal to its contour lines.

## The Cauchy Point

Strategy to approximately solve the trust region subproblem. Find the Cauchy point: the minimizer of mk along the direction f (xk ) within the trust region. First nd the direction:
T ps k = arg min fk + f (xk ) p p

## Finding the Cauchy point

Finding ps k is easy: analytic solution:
ps k = f (xk ) k gk

For the step size k : If f (xk )T Bk f (xk ) 0 negative curvature direction largest possible step k = 1
f (xk ) 3 } k f (xk )T Bk f (xk )

p k

## Then along this direction nd a minimizer

k = arg min mk ( ps k)
0

k = min{1,

ps k

## The Cauchy point is xk + k ps k.

Algorithms for unconstrained local optimization p. 61

Choosing the Cauchy point global but extremely slow convergence (similar to steepest descent). Usually an improved point is searched starting from the Cauchy one.
Algorithms for unconstrained local optimization p. 62

Pattern Search
For smooth optimization, but without knowledge of derivatives. Elementary idea: if x R2 is not a local minimum for f , then at least one of the directions e1 , e2 , e1 , e2 (moving towards E, N, W, S) forms an acute angle with f (x) is a descent direction. Direct search: explores all the direction in search of one which gives a descent.

## Algorithms for unconstrained local optimization p. 64

Coordinate search
Let D = {ei } be the set of coordinate directions and their opposites
Data:

Pattern search
It is not necessary to explore 2n directions. It is sufcient that the set of directions forms a positive span, i.e. every v Rn should be expressible as a non negative linear combination of the vectors in the set. Formally, G is a generating set iff
v = 0 Rn g G : v T g > 0

k = 0, 0 an initial step length, x0 a starting point while is large enough do if f (xk + k d) < f (xk ) for some d D then xk+1 = xk + k d (step accepted) ;
else

k+1 = 0.5k ;
end

(G ) := min max
v =0 dG

k =k+1;
end

vT d v d

## Algorithms for unconstrained local optimization p. 66

Examples
u u u u u u u u u u

Step Choice
xk + k dk if f (xk + k dk ) < f (xk ) (k )(success) x k

xk+1 =

otherwise (failure)

where (t) = o(t). We let In the rst case 0.19612, in the second = 0.5, in the third = 0.5 0.7017
k+1 = k k

where k 1 for successful iterations, k < 1 otherwise. Direct methods possess good convergence properties.

## Algorithms for unconstrained local optimization p. 72

Given a simplex S = {v1 , . . . , vn+1 } in Rn let vr the worst point: r = arg maxi {f (vi )}. Let C be the centroid of S \ {vr }:
C=
i=r

1: Reection
Check f (R): if it is intermediate, i.e. better than the worst and worse than the best, then accept the reection, i.e. discard the worst point in the simplex and replace it with R.

vi

The algorithm performs a sort of line search along the direction C vr . Let
R = C + (C vr ) be the a reection of the worst point along the direction. Let f best function value in the current simplex. Three cases might occur:
Algorithms for unconstrained local optimization p. 73 Algorithms for unconstrained local optimization p. 74

Reection step

2: improvement
if the trial step is an improvement: worst
f (R) < f = R + (R C ) then attempt an expansion: try to move R to R ) < f (R)) then accept the expansion and If successful (f (R discard the worst point. If unsuccessful, then accept R as a new point and discard the worst one.

reection

## Algorithms for unconstrained local optimization p. 76

Expansion

3: contraction
If however the reected point R is worse than all points in the simplex (possibly except the worst vr ), than a contraction step is performed: if f (R) > f (vr ) (R is worse than all points in the simplex), add

worst

0.5(vr + C )

reection expansion

0.5(R + C )

## to the simplex and discard vr

Algorithms for unconstrained local optimization p. 78

## Algorithms for unconstrained local optimization p. 77

Contraction
Nelder-Mead is not a direct search method (only a single direction at a time is explored) It is widely used by practitioners. However it may fail to converge to a local minimum. There are examples of strictly convex functions in R2 on which the method converges to a non-stationary point. The bad convergence properties are connected to the event that the ndimensional simplex degenerates into a lower dimensional space. Moreover the method has a strong tendency to generate directions which are almost normal to that of the gradient! Convergent variants of Nelder-Mead method do exists.

contraction

reection

worst

## Algorithms for unconstrained local optimization p. 80

Implicit ltering
Let
f (x) = h(x) + w(x)

Implicit ltering
Data:

repeat

## {k } 0, params , , of Armijos rule

OuterIteration = false;
repeat

where h(x) is a smooth function, while w(x) can be considered as an additive, typically random, noise. The method performs a rough estimate of the gradient (nite difference with a large step) and proceeds with an Armijo line search. If unsuccessful, the step for nite differences is reduced.

compute f (xk ) and a nite difference estimate of f (xk ): k f (xk ) = [(f (xk + k ei ) f (xk k ei ))/2k ]
if

k f (xk ) k then OuterIteration = true Armijo: if successful accept the Armijo step; otherwise let OuterIteration = true

else

## end until OuterIteration

; ;
Algorithms for unconstrained local optimization p. 82

k = k + 1;
Algorithms for unconstrained local optimization p. 81

## until convergence criterion

Convergence properties
If
2 h(x) is Lipschitz continuous

## the sequence {xk } generated by the method is innite

lim 2 k + (xk ; k ) =0 k

where
(x; ) = sup
z : z x

|w(x)|

unsuccessful Armijo steps occur at most a nite number of times then all limit points of {xk } are stationary
Algorithms for unconstrained local optimization p. 83

## Algorithms for constrained local optimization

Fabio Schoen 2008

## Feasible direction methods

http://gol.dsi.unifi.it/users/schoen

## Algorithms for constrained local optimization p. 2

FrankWolfe method
Let X : convex set. Consider the problem:
min f (x)
xX

FrankWolfe
If T f (xk )( xk xk ) = 0 then
T f (xk )d 0

Let xk X choosing a feasible direction dk corresponds to choosing a point x X : dk = x xk . Steepest descent choice:
min T f (xk )(x xk )
xX

for every feasible direction d rst order necessary conditions hold. Otherwise, letting dk = x k x, this is a descent direction along which a step k (0, 1] might be chosen according to Armijos rule.

(a linear objective with convex constraints, usually easy to solve). Let x k be an optimal solution of this problem.

## Convergence of Frank-Wolfe method

Under mild conditions the method converges to a point satisfying rst order necessary conditions. However it is usually extremely slow (convergence may be sublinear) It might nd applications in very large scale problems in which solving the sub-problem for direction determination is very easy (e.g. when X is a polytope).

Generic iteration:
xk+1 = xk + k ( xk xk )

## where the direction dk = x k xk is obtained nding

x k = [xk sk f (xk )]+

## Algorithms for constrained local optimization p. 6

The method is slightly faster than Frank-Wolfe, with a linear convergence rate similar to that of (unconstrained) steepest descent. It might be applied if projection is relatively cheap, e.g. when the feasible set is a box. A point xk satises rst order necessary conditions dT f (xk ) 0 iff
xk = [xk sk f (xk )]+

## Algorithms for constrained local optimization p. 8

Barrier Methods
min f (x) gj (x) 0 j = 1, . . . , r

Barrier Method
Let k 0 and x0 strictly feasible, i.e. gj (x0 ) < 0 j . Then let
xk = arg min (f (x) + k B (x)) n
xR

A Barrier is a continuous function which tends to + whenever x approaches the boundary of the feasible region. Examples of barrier functions:
B (x) =
j

Proposition: every limit point of {xk } is a global minimum of the constrained optimization problem

## log(gj (x)) 1 gj (x)

logaritmic barrier

B (x) =
j

invers barrier

## Analysis of Barrier methods

Special case: a single constraint (might be generalized) Let x be a limit point of {xk } (a global minimum). If KKT conditions hold, then there exists a unique 0:
f ( x) + g ( x) = 0

...
If B (x) = (g (x)),
f (xk ) + k (g (xk ))g (xk ) = 0

## In the limit, for k :

lim k (g (xk ))g (xk ) = g ( x)

## (with g ( x) = 0. xk , solution of the barrier problem

min f (x) + k B (x) g (x) < 0

if limk g (xk ) < 0 (g (xk ))g (xk ) K (nite) and Kk 0 if limk g (xk ) = 0 (thanks to the unicity of Lagrange multipliers),
= lim k (g (xk ))
k

satises
f (xk ) + k B (xk ) = 0

## Difculties in Barrier Methods

strong numeric instability: the condition number of the hessian matrix grows as k 0 need for an initial strictly feasible point x0 (partial) remedy: k is very slowly decreased and the solution of the k + 1th problem is obtained starting an unconstrained optimization from xk

Example
min(x 1)2 + (y 1)2 x+y 1 Logarithmic Barrier problem: min(x 1)2 + (y 1)2 k log(1 x y ) x+y1<0 Gradient: Stationary points x = y =
3 4

2(x 1) + 2(y 1) +
1+k 4

k 1xy k 1xy

## (only the - solution is acceptable)

Algorithms for constrained local optimization p. 14

## Barrier methods and L.P.

min c x Ax = b x0
T

## The central path

The starting point is usually associated with = and is the unique solution of
min
j

log xj Ax = b x>0

Logarithmic Barrier on x 0:
min c x
j T

log xj Ax = b x>0

The trajectory x() of solutions to the barrier problem is called the central path and leads to an optimal solution of the LP.

## Algorithms for constrained local optimization p. 16

Penalty Methods
Penalized problem:
min f (x) + P (x)

## Convergence of the quadratic penalty met

(for equality constrained problems): let
P (x; ) = f (x) +
i

hi (x)2

## where > 0 and P (x) 0 with P (x) = 0 if x is feasible. Example:

min f (x) hi (x) = 0 i = 1, . . . , m

## Given 0 > 0, x0 Rn , k = 0, let

xk+1 = arg min P (x; k )

## A penalized problem might be:

min f (x) +
i

hi (x)2

(found with an iterative method initialized at xk ); let k+1 > k , k := k + 1. If xk+1 is a global minimizer of P and k then every limit point of {xk } is a global optimum of the constrained problem.

## Algorithms for constrained local optimization p. 18

Exact penalties
Exact penalties: there exists a penalty parameter value s.t. the optimal solution to the penalized problem is the optimal solution of the original one. 1 penalty function:
P1 (x; ) = f (x) +
i

Exact penalties
for inequality constrained problems:
min f (x) hi (x) = 0 gj (x) 0

|hi (x)|

P1 (x; ) = f (x)
i

|hi (x)| +
j

max(0, gj (x))

## Augmented Lagrangian method

Given an equality constrained problem, reformulate it as:
1 min f (x) + h(x) 2 2 h(x) = 0

Motivation
1 min f (x) + h(x) x 2
2

+ T h(x)

x L (x, ) = f (x) +

i h(x) + h(x)h(x)
i

## The Lagrange function of this problem is called Augmented Lagrangian:

1 L(x; ) = f (x) + h(x) 2
2

= x L(x, ) + h(x)h(x)
2 2 xx L (x, ) = f (x) +

i

+ T h(x)

## Algorithms for constrained local optimization p. 22

motivation . . .
Let (x , ) an optimal (primal and dual) solution. Necessarily: x L(x , ) = 0; moreover h(x ) = 0 thus
x L (x , ) = x L(x , ) + h(x )h(x ) =0 (x , ) is a stationary point for the augmented lagrangian.

motivation . . .
Observe that:
2 T 2 2 xx L (x, ) = xx L(x, ) + h(x) h(x) + h(x) h(x) T = 2 xx L(x, ) + h(x) h(x)

## Assume that sufcient optimality conditions hold:

v T 2 xx L(x , )v > 0

v : v T h(x ) = 0,

## Algorithms for constrained local optimization p. 24

...
Let v = 0 : v T h(x )= 0. Then
T T T T T 2 v T 2 xx L (x , )v = v xx L(x , )v + v h(x ) h(x )v T = v T 2 xx L(x , )v > 0

...
Let v = 0 : v T h(x )= 0. Then
T T T 2 T T v T 2 xx L (x , )v = v xx L(x , )v + v h(x ) h(x )v T T 2 = v T 2 xx L(x , )v + (v h(x ))

which might be negative. However > 0: if T v T 2 L ( x , ) v > 0 . xx Thus, if is large enough, the Hessian of the augmented lagrangian is positive denite and x is a (strict) local minimum of L (, )

## Algorithms for constrained local optimization p. 26

Inequality constraints
Given the problem
min f (x) g (x) 0 min f (x) hi (x) = 0 gj (x) 0 i = 1, m j = 1, p

min f (x)
x,s

## an Augmented Lagrangian problem might be dened as

j = 1, p 1 min L (x, z ; , ) = min f (x) + T h(x) + h(x) 2 x,z 2 1 2 2 2 (gj (x) + zj ) + j (gj (x) + zj )+ 2 j j

gj (x) + s2 j = 0

## Algorithms for constrained local optimization p. 28

...
Consider minimization with respect to z variables:
min
z j

...
Thus:
u j = max{0, j gj (x)}.

1 2 )+ j (gj (x) + zj 2

2 2 ) (gj (x) + zj j

= min
u 0 j

## 1 j (gj (x) + uj ) + (gj (x) + uj )2 2

Substituting:
1 L (x; , ) = f (x) + T h(x) + h(x) 2 2 1 + max{0, j + gj (x)} 2 j 2 j

## (quadratic minimization over the nonnegative orthant). Solution:

u j } j = max{0, u

## where u is the unconstrained optimum:

u : j + (gj (x) + u j ) = 0
Algorithms for constrained local optimization p. 29

## Algorithms for constrained local optimization p. 30

min f (x) hi (x) = 0

## Newton step for SQP

Jacobian of KKT system:
F (x, ) = 2 xx L(x; ) H (x) T H (x) 0

Idea: apply Newtons method to solve the KKT equations: Lagrangian function:
L(x; ) = f (x) +
i

Newton step:
xk+1 k+1 = dk xk + k k

i hi (x)

## let H (x) = [hi (x)] , H (x) = [hi (x)]. KKT conditions:

F [x; ] = f (x) + H (x) =0 H (x)
T

where
2 xx L(xk ; k ) H (xk ) T H (xk ) 0 dk k = f (xk ) H T (xk )k H (xk )

## Algorithms for constrained local optimization p. 32

existence
The Newton step exists if the Jacobian of the constraint set H (xk ) has full row rank the Hessian 2 xx L(xk ; k ) is positive denite In this case the Newton step is the unique solution of

## Alternative view: SQP

1 min f (xk ) + f (xk )T d + dT 2 xx L(xk ; k )d d 2 H (xk )d + H (xk ) = 0

KKT conditions:
T T 2 xx L(xk ; k )dk + H (xk )k + f (xk ) + H (xk )k = 0

## H (xk )dk + H (xk ) = 0

Under the same conditions as before this QP has a unique solution dk with Lagrange multipliers k = k+1

## Alternative view: SQP

Thus SQP can be seen as a method which
1 T 2 min L(xk , k ) + T x L(xk , k )d + d xx L(xk ; k )d d 2 H (xk )d + H (xk ) = 0

minimizes a quadratic approximation to the Lagrangian subject to a rst order approximation of the constraints.

KKT conditions:
2 xx L(xk ; k )d + f (xk ) + H (xk )k + H (xk )k = 0

Under the same conditions as before this QP has a unique solution dk with Lagrange multipliers k = k+1

## Algorithms for constrained local optimization p. 36

Inequalities
If the original problem is
min f (x) hi (x) = 0 gj (x) 0

Filter Methods
Basic idea:
min f (x) g (x) 0

can be considered as a problem with two objectives: minimize f (x) minimize g (x) (the second objective has priority over the rst)

## then the SQP iteration solves

1 min fk + f (xk )T d + dT 2 xx L(xk , k )d d 2 T i hi (xk )p + hi (xk ) = 0 T j gj (xk )p + gj (xk ) 0

## Algorithms for constrained local optimization p. 38

Filter
Given the problem
min f (x) gj (x) 0 j = 1, . . . , k

Let {fk , hk , k = 1, 2, . . .} the observed values of f and h at points x1 , x2 , . . .. A pair (fk , hk ) dominates a pair (f , h ) iff
fk f hk h
and

## let us consider the bi-criteria optimization problem

min f (x) min h(x)

where
h(x) =
j

max{gj (x), 0}

f (x)

## Consider a Trust-region SQP method:

1 min fk + L(xk ; k )T d + dT 2 xx L(xk ; k )d d 2 T j gj (xk )p + gj (xk ) 0 d

(the norm is used here in order to keep the problem a QP) Traditional (unconstrained) trust region methods: if the current step is a failure reduce the trust region eventually the step will become a pure gradient step convergence!
h(x)

## Trust region SQP

Filter methods
Data:

x0 : starting point, , k = 0

## Find xk+1 minimizing constraint violation;

else

T j gj (xk )p + gj (xk ) 0

Solve QP and get a step dk ; try setting xk+1 = xk + dk ; if (fk+1 , hk+1 ) is acceptable to the lter then Accept xk+1 and add (fk+1 , hk+1 ) to the lter; Remove dominated points from the lter; Possibly increase ;
else

gj (x) 0

## Reject the step; Reduce ;

xk
end end

set k = k + 1;
Algorithms for constrained local optimization p. 43

end

f (x)

h(x)

## Global Optimization Problems Introduction to Global Optimization

Fabio Schoen 2008
x S Rn

min f (x)

## What is it meant by global optimization? Of course we sould like to nd f = min n f (x)

xS R

http://gol.dsi.unifi.it/users/schoen

and
x = arg min f (x) : f (x ) f (x) x S

## Introduction to Global Optimization p. 2

This denition in unsatisfactory: the problem is ill posed in x (two objective functions which differ only slightly might have global optima which are arbitrarily far) it is however well posed in the optimal values: ||f g || |f g |

Quite often we are satised in looking for f and search one or more feasible solutions suche that
f ( x) f (x ) +

## Research in Global Optimization

the problem is highly relevant, especially in applications the problem is very hard (perhaps too much) to solve there are plenty of publications on global optimization algorithms for specic problem classes there are only relatively few papers with relevant theoretical contents often from elegant theories, weak algorithms have been produced and viceversa, the best computational methods often lack a sound theoretical support many global optimization papers get published on applied research journals Bazaraa, Sherali, Shetty Nonlinear Programming: theory and algorithms, 1993: the word global optimum appears for the rst time on page 99, the second time at page 132, then at page 247: A desirable property of an algorithm for solving [an optimization] problem is that it generates a sequence of points converging to a global optimal solution. In many cases however we may have to be satised with less favorable outcomes. after this (in 638 pages) it never appears anymore. Global optimization is never cited.

## Introduction to Global Optimization p. 6

Complexity
Similar situation in Bertsekas, Nonlinear Programming (1999): 777 pages, but only the denition of global minima and maxima is given! Nocedal & Wrigth, Numerical Optimization, 2nd edition, 2006: Global solutions are needed in some applications, but for many problems they are difcult to recognize and even more difcult to locate ... many successful global optimization algorithms require the solution of many local optimization problems, to which the algorithms described in this book can be applied Global optimization is hopeless: without global information no algorithm will nd a certiable global optimum unless it generates a dense sample. There exists a rigorous denition of global information some examples: number of local optima global optimum value for global optimization problems over a box, (an upper bound on) the Lipschitz constant
|f (y ) f (x)| L x y x, y

## Concavity of the objective function + convexity of the feasible region

Introduction to Global Optimization p. 7

an explicit representation of the objective function as the difference between two convex functions (+ convexity of the

## Introduction to Global Optimization p. 8

Complexity
Global optimization is computationally intractable also according to classical complexity theory. Special cases: Quadratic programming:
1 min xT Qx + cT x lAxu 2

## Many special cases are still N P hard: norm maximization on a parallelotope:

max x b Ax c

is N P hard [Sahni, 1974] and, when considered as a decision problem, N P -complete [Vavasis, 1990].

Quadratic optimization on a hyper-rectangle (A = I ) when even only one eigenvalue of Q is negative quadratic minimization over a simplex
1 min xT Qx + cT x x 0 2 xj = 1
j

## Applications of global optimization

concave minimization quantity discounts, scale economies xed charge combinatorial optimization - binary linear programming:
min cT x + KxT (1 x) Ax = b x [0, 1]

or:
min cT x Ax = b x [0, 1]
Introduction to Global Optimization p. 11

Minimization of cost functions which are neither convex nor concave. E.g.: nding the minimum conformation of complex molecules Lennard-Jones micro-cluster, protein folding, protein-ligand docking, Example: Lennard-Jones: pair potential due to two atoms at X1 , X2 R3 : 1 2 v (r) = 12 6 r r where r = X1 X2 . The total energy of a cluster of N atoms located at X1 , . . . , XN R3 is dened as:
i=1,...,N j<i

v (||Xi Xj ||)

x (1 x) = 0

This function has a number of local (non global) minima which grows like exp(N )

## Introduction to Global Optimization p. 12

Lennard-Jones potential
3 2 1 0 -1 -2 attractive(x) repulsive(x) lennard-jones(x)

## Protein folding and docking

Potential energy model:E = El + Ea + Ed + Ev + Ee where:
El =
i L

1 b 0 2 K (ri ri ) 2 i

## (contribution of pairs of bonded atoms):

Ea =
i A

1 0 2 K (i i ) 2 i

## (angle between 3 bonded atoms)

Ed = 1 K [1 + cos(ni )] 2 i

-3 0.5

i T

1.5

2.5

3.5

4.5

(dihedrals)
Introduction to Global Optimization p. 14

Docking
Ev =
(i,j) C

## (van der Waals)

Ee = 1 2
(i,j) C

Given two macro-molecules M1 , M2 , nd their minimal energy coupling If no bonds are changed to nd the optimal docking it is sufcient to minimized:
Ev + Ee =
iM1 ,j M2

## Aij Bij 6 12 Rij Rij

1 2 iM

1 ,j M2

qi qj Rij

(Coulomb interaction)

## Main algorithmic strategies

Two main families: 1. with global information (structured problems) 2. without global information (unstructured problems) Structured problems stochastic and deterministic methods Unstructured problems typically stochastic algorithms Every global optimization method should try to nd a balance between exploration of the feasible region approximations of the optimum

N 1 N

## LJN = min LJ (X ) = min

i=1 j =i+1

1 Xi Xj

12

2 Xi Xj

This is a highly structured problem. But is it easy/convenient to use its structure? And how?

## Introduction to Global Optimization p. 18

LJ
The map
F1 : R3N R+ F1 (X1 , . . . , XN )
N (N 1)/2 2

X1 X2 2 , . . . , XN 1 XN
N (N 1)/2

NB: every C 2 function is d.c., but often its d.c. decomposition is not known. D.C. optimization is very elegant, there exists a nice duality theory, but algorithms are typically very inefcient.

## is convex and the function

F2 : R+ R 1 2 6 rij 1 3 rij

F2 (r12 , . . . , rN 1,N )

is the difference between two convex functions. Thus LJ (X ) can be seen as the difference between two convex function (a d.c. programming problem)

## A primal method for d.c. optimization

cutting plane method (just an example, not particularly efcient, useless for high dimensional problems). Any unconstrained d.c. problem can be represented as an equivalent problem with linear objective, a convex constraint and a reverse convex constraint. If g, h ar convex, then min g (x) h(x) is equivalent to:
min z g (x) h(x) z

## D.C. canonical form

min cT x g (x) 0 h(x) 0

## where h, g : convex. Let

= {x : g (x) 0} C = {x : h(x)0}

which is equivalent to
min z g (x) w h(x) + z w
Introduction to Global Optimization p. 21

Hp:
0 int intC, cT x > 0x \ intC

Fundamental property: if a D.C. problem admits an optimum, at least one optimum belongs to
C
Introduction to Global Optimization p. 22

## Discussion of the assumptions

4

g (0) < 0, h(0) < 0, cT x > 0 feasible x. Let x be a solution to the convex problem min cT x g (x) 0

cT x = 0

If h( x) 0 then x solves the d.c. problem. Otherwise cT x > cT x for all feasible x. Coordinate transformation: y = x x :
min cT y (y ) 0 h g (y ) 0
T

-1

-2

where g (y ) = g (y + x ). Then c y > 0 for all feasible solutions (0) > 0; by continuity it is possible to choose x and h so that g (0) < 0.
Introduction to Global Optimization p. 23

-3

-4 -9 -8 -7 -6 -5 -4 -3 -2 -1

## Introduction to Global Optimization p. 24

Let x best known solution. Let D( x) = { x : c T x c T x } If D( x) C then x is optimal; Check: a polytope P (with known vertices) is built which contains D( x) If all vertices of P are in C optimal solution. Otherwise let v : best feasible vertex; the intersection of the segment [0, v ] with C (if feasible) is an improving point x. Otherwise a cut is introduced in P which is tangent to in x.

D( x ) = {x : c T x c T x } cT x = 0

-1

x
-2

-3

## Introduction to Global Optimization p. 25

-4 -9 -8 -7 -6 -5 -4 -3 -2 -1

Initialization
4

P D( x)
3

## P : D( x) P with vertices V1 , . . . , Vk . V := arg max h(Vj ) cT x = 0

i.e.
y : cT y cT x y y P

feasible
0

-1

## If P C , i.e. if y P h(y ) 0 then x is optimal. Checking is easy if we know the vertices of P .

x
-2

-3

V
Introduction to Global Optimization p. 27

-4 -9 -8 -7 -6 -5 -4 -3 -2 -1

## Introduction to Global Optimization p. 28

Step 1
Let V the vertex with largest h() value. Surely h(V ) > 0 (otherwise we stop with an optimal solution) Moreover: h(0) < 0 (0 is in the interior of C ). Thus the line from V to 0 must intersect the boundary of C Let xk be the intersection point. It might be feasible (improving) or not.
4

xk = C [V , 0] cT x = 0

C
xk

-1

x
-2

-3

V
Introduction to Global Optimization p. 29

-4 -9 -8 -7 -6 -5 -4 -3 -2 -1

## Introduction to Global Optimization p. 30

If xk , set x := xk
cT x = 0

## Otherwise if xk , the polytope is divided

cT x = 0

-1

-1

x
-2 -2

-3

-3

-4 -9 -8 -7 -6 -5 -4 -3 -2 -1

## Introduction to Global Optimization p. 31

-4 3 -9 -8 -7 -6 -5 -4 -3 -2 -1

4

cT x = 0

xS

2

-1

## inf {h (u) g (u) : u : h (u) < +}

-2

is the Fenchel-Rockafellar dual. If min g (x) h(x) admits an optimum, then Fenchel dual is a strong dual.

-3

-4 -9 -8 -7 -6 -5 -4 -3 -2 -1

## Introduction to Global Optimization p. 33

A primal/dual algorithm
If x arg min g (x) h(x) then
u h(x ) Pk : min g (x) (h(xk ) + (x xk )T yk )

and
Dk : min h (y ) (g (yk1 ) + xT k (y yk1 )

( denotes subdifferential) is dual optimal and if u arg min h (u) g (u) then
x g (u )

## Introduction to Global Optimization p. 35

GlobOpt - relaxations
Consider the global optimization problem (P):
min f (x)

## Exact Global Optimization

xX

and assume the min exists and is nite and that we can use a relaxation (R):
min g (y ) yY

Usually both X and Y are subsets of the same space Rn . Recall: (R) is a relaxation of (P) iff:
XY
Introduction to Global Optimization p. 36

## Branch and Bound

1. Solve the relaxation (R) and let L be the (global) optimum value (assume it is feasible for (R)) 2. (Heuristically) solve the original problem (P) (or, more generally, nd a good feasible solution to (P) in X ). Let U be the best feasible function value known 4. otherwise split X and Y into two parts and apply to each of them the same method 3. if U L then stop: U is a certied optimum for (P)

Tools
good relaxations: easy yet accurate good upper bounding, i.e., good heuristics for (P) Good relaxations can be obtained, e.g., through: convex relaxations domain reduction

## Introduction to Global Optimization p. 39

Convex relaxations
Assume X is convex and Y = X . If g is the convex envelop of f on X , then solving the convex relaxation (R), in one step gives the certied global optimum for (P). g (x) is a convex under-estimator of f on X if:
g (x)is convex g (x) f (x) g is the convex envelop of f on X if: g is a convex under-estimator off g (x) h(x) h : convex under-estimator of f
Introduction to Global Optimization p. 40

A 1-D example

x X

x X

## Introduction to Global Optimization p. 41

Convex under-estimator

Branching

Bounding
Let

## Relaxation of the feasible domain

min f (x)
xS

be a GlobOpt problem where f is convex, while S is non convex. A relaxation (outer approximation) is obtained replacing S with a larger set Q. If Q is convex convex optimization problem. If the optimal solution to
Upper bound

min f (x)
xQ

fathomed

## belongs to S optimal solution to the original problem.

lower bounds
Introduction to Global Optimization p. 44 Introduction to Global Optimization p. 45

Example
min x 2y xy 3

Relaxation
min x 2y xy 3

x[0,5],y [0,3]

x[0,5],y [0,3]

4 3 2

We know that:
(x + y )2 = x2 + y 2 + 2xy

thus
1 0 0 1 2 3 4 5 6

xy = ((x + y )2 x2 y 2 )/2

## and, as x and y are non-negative, x2 5x, y 2 3y , thus a (convex) relaxation of xy 3 is

Introduction to Global Optimization p. 46

(x + y )2 5x 3y 6

## Introduction to Global Optimization p. 47

Relaxation
4 3 2 1 0 0 1 2 3 4 5 6

Stronger Relaxation
min x 2y xy 3

x[0,5],y [0,3]

Thus:
(5 x)(3 y ) 0 xy 3x + 5y 15

15 3x 5y + xy 0

3x + 5y 15 3

i.e.: 3x + 5y 18

## Introduction to Global Optimization p. 49

Relaxation
4 3 2 1 0 0 1 2 3 4 5 6

## Convex (concave) envelopes

How to build convex envelopes of a function or how to relax a non convex constraint? Convex envelopes lower bounds Convex envelopes of f (x) upper bounds Constraint: g (x) 0 if h(x) is a convex underestimator of g then h(x) 0 is a convex relaxations. Constraint: g (x) 0 if h(x) is concave and h(x) g (x), then h(x) 0 is a convex constraint

The optimal solution of the convex (linear) relaxation is (1, 3) which is feasible optimal for the original problem

## Introduction to Global Optimization p. 51

Convex envelopes
Denition: a function is polyhedral if it is the pointwise maximum of a nite number of linear functions. (NB: in general, the convex envelope is the pointwise supremum of afne minorants) The generating set X of a function f over a convex set P is the set

Generating sets

## X = {x Rn : (x, f (x))is a vertex of epi(convP (f ))}

I.e., given f we rst build its convex envelop in P and then dene its epigraph {(x, y ) : x P, y f (x)}. This is a convex set whose extreme points can be denoted by V . X are the x coordinates of V

## Introduction to Global Optimization p. 53

Characterization
Let f (x) be continuously differentiable in a polytope P . The convex envelope of f on P is polyhedral if and only if
X (f ) = Vert(P )

(the generating set is the vertex set of P ) Corollary: let f1 , . . . , fm C 1 (P ) and i fi (x) possess polyhedral convex envelopes on P . Then
Conv(
i

fi (x)) =
i i

Convfi (x)

Conv(fi (x))

is Vert(P )

## Introduction to Global Optimization p. 55

Characterization
If a f (x) is such that Convf (x) is polyhedral, than an afne function h(x) such that 1. h(x) f (x) for all x Vert(P ) 2. there exist n + 1 afnely independent vertices of P , V1 , . . . , Vn+1 such that
f (Vi ) = h(Vi ) i = 1, . . . , n + 1

Characterization
The condition may be reversed: given m afne functions h1 , . . . , hm such that, for each of them 1. hj (x) f (x) for all x Vert(P ) 2. there exist n + 1 afnely independent vertices of P , V1 , . . . , Vn+1 such that
f (Vi ) = hj (Vi ) i = 1, . . . , n + 1

## belongs to the polyhedral description of Convf (x) and

h(x) = convf (x)

Then the function (x) = maxj j (x) is the convex envelope of a polyhedral function f iff the generating set of is Vert(P) for every vertex Vi we have (Vi ) = f (Vi )

## Introduction to Global Optimization p. 57

Sufcient condition
If f (x) is lower semi-continuous in P and for all x Vert(P ) there exists a line x : x interior of P x and f (x) is concave in a neighborhood of x on x , then Convf (x) is polyhedral Application: let
f (x) =
i,j

## Application: a bilinear term

(Al-Khayyal, Falk (1983)): let x [x , ux ], y [y , uy ]. Then the convex envelope of xy in [x , ux ] [y , uy is
(x, y ) = max{y x + x y x y ; uy x + ux y ux uy }

## In fact: (x, y ) is a under-estimate of xy :

(x x )(y y ) 0

ij xi xj

The sufcient condition holds for f in [0, 1]n bilinear forms are polyhedral in an hypercube

xy y x + x y x y

## Introduction to Global Optimization p. 59

Bilinear terms
xy (x, y ) = max{y x + x y x y ; uy x + ux y ux uy } No other (polyhedral) function underestimating xy is tighter. In fact y x + x y x y belongs to the convex envelope: it underestimates xy and coincides with xy at 3 vertices ((x , y ), (x , uy ), (ux , y )). Analogously for the other afne function. All vertices are interpolated by these 2 underestimating hyperplanes they form the convex envelop of xy

## All easy then?

Of course no! Many things can go wrong . . . It is true that, on the hypercube, a bilinear form:
ij xi xj
i<j

is polyhedral (easy to see) but we cannot guarantee in general that the generating set of the envelope are the vertices of the hypercube! (in particular, if s have opposite signs) if the set is not an hypercube, even a bilinear term might be non polyhedral: e.g. xy on the triangle {0 x y 1}

Finding the (polyhedral) convex envelope of a bilinear form on a generic polytope P is NPhard!
Introduction to Global Optimization p. 60 Introduction to Global Optimization p. 61

Fractional terms
A convex underestimate of a fractional term x/y over a box can be obtained through
w x /y + x/uy x /uy
if x if x if x if x

## Univariate concave terms

If f (x), x [x , ux ], is concave, then the convex envelope is simply its linear interpolation at the extremes of the interval:
f (x ) + f (ux ) f (x ) (x x ) ux x

0 <0 0 <0

## Underestimating a general nonconvex function

Let f (x) C 2 be general non convex. Than a convex underestimate on a box can be dened as
n

How to choose i s? One possibility: uniform choice: i = . In this case convexity of is obtained iff
max 0, 1 min min (x) 2 x[,u]

(x) = f (x)

i=1

i (xi i )(ui xi )

## where i > 0 are parameters. The Hessian of is

2 (x) = 2 f (x) + 2diag() is convex iff 2 (x) is positive semi-denite.

## Introduction to Global Optimization p. 65

Key properties
(x) f (x) is convex interpolates f at all vertices of [, u]

Estimation of
U Compute an interval Hessian [H ] : [H (x)]ij = [hL ij (x), hij (x)] in [, u] Find such that [H ] + 2diag() 0. Gerschgorin theorem for real matrices:

Maximum separation:
1 max(f (x) (x)) = 4 (ui i )2 min min hii
i j =i

|hij |

## Extension to interval matrices:

min min hL ii
i U max{|hL ij |, |hij |}

j =i

uj j ui i

## Introduction to Global Optimization p. 67

Improvements
new relaxation functions (other than quadratic). Example
n

## Domain (range) reduction

Techniques for cutting the feasible region without cutting the global optimum solution. Simplest approaches: feasibility-based and optimality-based range reduction (RR). Let the problem be:
min f (x)
xS

(x; ) =

i=1

## (1 ei (xi i ) )(1 ei (ui xi ) )

gives a tighter underestimate than the quadratic function partitioning: partition the domain into a small number of regions (hyper-rectangules); evaluate a convex underestimator in each region; join the underestimators to form a single convex function in the whole domain

## Feasibility based RR asks for solving

i = min xi xS ui = max xi xS

for all i 1, . . . , n and then adding the constraints x [, u] to the problem (or to the sub-problems generated during Branch & Bound)
Introduction to Global Optimization p. 68 Introduction to Global Optimization p. 69

Feasibility Based RR
If S is a polyhedron, RR requires the solution of LPs:
[ , u ] = min / max x Ax b x [L, U ]
j

Optimality Based RR
Given an incumbent solution x S , ranges are updated by solving the sequence:
i = min xi f (x) f ( x) aij xj bi xS ui = max xi f (x) f ( x) xS

Poor mans L.P. based RR: from every constraint in which ai > 0 then
x x 1 bi ai 1 bi ai aij xj
j =

where f (x) is a convex underestimate of f in the current domain. RR can be applied iteratively (i.e., at the end of a complete RR sequence, we might start a new one using the new bounds)

j =

min{aij Lj , aij Uj }
Introduction to Global Optimization p. 70 Introduction to Global Optimization p. 71

generalization
Let
min f (x)
xX

R.H.S. perturbation
(P )

(y ) = min f (x)
xX

(Ry )

g (x) 0

## a (non convex) problem; let

min f (x)
xX

g (x) y (R)

g (x) 0

be a perturbation of (R). (R) convex (Ry ) convex for any y . Let x : an optimal solution of (R) and assume that the ith constraint is active:
g ( x) = 0
and

be a convex relaxation of (P ):
: g (x) 0} {x X : g (x) 0} {x X f (x) f (x) x X : g (x) 0

## Introduction to Global Optimization p. 73

Duality
Assume (R) has a nite optimum at x with value (0) and Lagrange multipliers . Then the hyperplane
H (y ) = (0) T y

Main result
If (R) is convex with optimum value (0), constraint i is active at the optimum and the Lagrange multiplier is i > 0 then, if U is an upper bound for the original problem (P ) the constraint:
g i (x) (U L)/i

## is a supporting hyperplane of the graph of (y ) at y = 0, i.e.

(y ) (0) T y y Rm

(where L = (0)) is valid for the original problem (P ), i.e. it does not exclude any feasible solution with value better than U .

## Introduction to Global Optimization p. 75

proof
Problem (Ry ) can be seen as a convex relaxation of the perturbed non convex problem
(y ) = min f (x)
xX

Applications
Range reduction: let x [, u] in the convex relaxed problem. If variable xi is at its upper bound in the optimal solution, them we can deduce
xi max{i , ui (U L)/i }

g (x) y

and thus (y ) (y ). Thus underestimating (Ry ) produces an underestimate of (y ). Let y := ei yi ; From duality: L T ei yi (ei yi ) (ei yi ) If yi < 0 then U is an upper bound also for (ei yi ), thus L i yi U . But if yi < 0 then constraint i is active. For any feasible x there exists a yi < 0 such that g (x) yi is active we may substitute yi with g i (x) and deduce L i g i (x) U
Introduction to Global Optimization p. 76

where i is the optimal multiplier associated to the ith upper bound. Analogously for active lower bounds:
xi min{ui , i + (U L)/i }

## Methods based on merit functions

Let the constraint
aT i x bi

## Bayesian algorithm: the objective function is considered as a realization of a stochastic process

f (x) = F (x; )

be active in an optimal solution of the convex relaxation (R). Then we can deduce the valid inequality
ai T x bi (U L)/i

## A loss function is dened, e.g.:

L(x1 , ..., xn ; ) = min F (xi ; ) min F (x; )
i=1,n x

and the next point to sample is placed in order to minimize the expected loss (or risk)
xn+1 = arg min E (L(x1 , ..., xn , xn+1 ) | x1 , ..., xn ) = arg min E (min(F (xn+1 ; ) F (x; )) | x1 , ..., xn )
Introduction to Global Optimization p. 78 Introduction to Global Optimization p. 79

Given k observations (x1 , f1 ), . . . , (xk , fk ), an interpolant is built:
n

Bumpiness
Let fk an estimate of the value of the global optimum after k observations. Let sy k the (unique) interpolant of the data points

s(x) =
i=1

i ( x xi ) + p(x)

(xi , fi )i = 1, . . . , k
) (y, fk

p: polynomial of a (prexed) small degree m. : radial function like, e.g.: (r) = r (r) = r
3

## linear cubic thin plate spline gaussian

Idea: the most likely location of y is such that the resulting interpolant has minimum bumpiness Bumpiness measure:
(sk ) = (1)m+1 i sy k (xi )

## (r) = r2 log r (r) = er

2

Polynomial p is necessary to guarantee existence of a unique interpolant (i.e. when the matrix {ij = ( xi xj )} is singular)

## Introduction to Global Optimization p. 81

TO BE DONE

Stochastic methods
Pure Random Search - random uniform sampling over the feasible region Best start: like Pure Random Search, but a local search is started from the best observation Multistart: Local searches started from randomly generated starting points

3 2

3 2

+
1

+ + + + + +
-1 -2

+
0

+ + + +

0 -1 -2 -3

-3

## Introduction to Global Optimization p. 85

Clustering methods
Given a uniform sample, evaluate the objective function Sample Transformation (or concentration): either a fraction of worst points are discarded, or a few steps of a gradient method are performed Remaining points are clustered from the best point in each cluster a single local search is started

Uniform sample
5
3 5 0 1

4 3 2 1 0

## Introduction to Global Optimization p. 87

Sample concentration
5
3 5 0 1

Clustering
5
3 5 0 1

4 3 2 1 0 5 0 1 2 3

4 3 2 1 0

+ ++ + + + + + + + + + +

+ +

## Introduction to Global Optimization p. 89

Local optimization
5
3 5 0 1

Clustering: MLSL
Sampling proceed in batches of N points. Given sample points X1 , . . . , Xk [0, 1]n , label Xj as clustered iff Y X1 , . . . , Xk :
1 ||Xj Y || k := 2 n log k 1 + k 2
1 n

4 3 and 2 1 0

f (Y ) f (Xj )

## Introduction to Global Optimization p. 91

A sequential sample is generated (batches consist of a single observation). A local search is started only from the last sampled point (i.e. there is no recall) unless there exists a sufciently near sampled point with better function valure

Smoothing methods
Given f : Rn R, the Gaussian transform is dened as:
f (x) = 1 n/2 n
Rn

f (y ) exp y x 2 /2

When is sufciently large f is convex. Idea: starting with a large enough , minimize the smoothed function and slowly decrease towards 0.

## Introduction to Global Optimization p. 93

Smoothing methods

## 3 2.5 2 1.5 1 0.5 0 10 5 -10 -5 0 5 10 -10 -5

Introduction to Global Optimization p. 94

## 3 2.5 2 1.5 1 0.5 0 10 5 -10 -5 0 5 10 -10 -5

Introduction to Global Optimization p. 95

2.4 2.2 2 1.8 1.6 1.4 1.2 1 0.8 0.6 10 5 -10 -5 0 5 10 -10 -5
Introduction to Global Optimization p. 96

## 2.2 2 1.8 1.6 1.4 1.2 1 0.8 10 5 -10 -5 0 5 10 -10 -5

Introduction to Global Optimization p. 97

## Transformed function landscape

Elementary idea: local optimization smooths out many high frequency oscillations

## 2.2 2 1.8 1.6 1.4 1.2 1 0.8 10 5 -10 -5 0 5 10 -10 -5

Introduction to Global Optimization p. 98 Introduction to Global Optimization p. 99

10

10

0 0

## Introduction to Global Optimization p. 101

10

Monotonic Basin-Hopping
9 8

k := 0; f := +; while k < M axIter do Xk : random initial solution Xk = arg min f (x; Xk ); (local minimization started at Xk ) fk = f (Xk ); if fk < f = f := fk N oImprove := 0; while N oImprove < M axImprove do X = random perturbation of Xk Y = arg minf (x; X ) ; if f (Y ) < f = Xk := Y ; N oImprove := 0; f := f (Y ) otherwise N oImprove + + end while end while
Introduction to Global Optimization p. 102 Introduction to Global Optimization p. 103

10

10

0 0

10

10

0 0

10

## Introduction to Global Optimization p. 108

0

References
In this years course the global optimization part has been expanded, so it is possible that some part in nonlinear optimization will be skipped. Here is an essential reference list for the material covered during the course:

Mokhtar S. Bazaraa, John J. Jarvis and Hanif D. Sherali, Linear Programming and Network Flows, John Wiley & Sons, 1990. Dimitri P. Bertsekas, Nonlinear Programming, Athena Scientic, 1999. Jorge Nocedal and Stephen J. Wright, Numerical Optimization, Springer, 2006. Mohit Tawarmalani and Nikolaos V. Sahinidis, A Polyhedral Branchand Cut Approach to Global Optimization, in: Mathematical Programming, volume 103, pages 225-249, 2005. Androulakis I.P., C.D. Maranas, and C.A. Floudas (PostScript (184K), PDF (154K)), BB : A Global Optimization Method for General Constrained Nonconvex Problems, Journal of Global Optimization, 7, 4, pp. 337-363(1995). A. Rikun. A convex envelope formula for multilinear functions. Journal of Global Optimization, pages 10:425437, 1997. Andrea Grosso, Marco Locatelli and Fabio Schoen, A Population Based Approach for Hard Global Optimization Problems Based on Dissimilarity Measures, in: Mathematical Programming, volume 110, number 2, pages 373-404, 2007.