1 Numerical optimization
1.1 Optimization of single-variable functions . . . . . . . . .
1.1.1 Golden Section Search . . . . . . . . . . . . . . . .
1.1.2 Fibonacci Search . . . . . . . . . . . . . . . . . . .
1.2 Algorithms for unconstrained multivariable optimization
1.2.1 Basic concepts . . . . . . . . . . . . . . . . . . . . .
1.2.1.1 Line search algorithms . . . . . . . . . .
1.2.1.2 Trust region methods . . . . . . . . . . .
1.2.2 Newton methods . . . . . . . . . . . . . . . . . . .
1.2.2.1 The algorithm . . . . . . . . . . . . . . .
1.2.2.2 Geometrical interpretation . . . . . . . .
1.2.2.3 Implementation aspects . . . . . . . . . .
1.2.2.4 Modified Newton method . . . . . . . .
1.2.3 Gradient methods . . . . . . . . . . . . . . . . . .
1.2.3.1 Directional derivative and the gradient .
1.2.3.2 The method of steepest descent . . . . .
1.2.4 Conjugate gradient methods . . . . . . . . . . . .
1.2.5 Quasi-Newton methods . . . . . . . . . . . . . . .
1.2.6 Algorithms for minimization without derivatives
1.2.6.1 Nelder-Mead method . . . . . . . . . . .
1.2.6.2 Rosenbrock method . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5
5
6
8
12
12
13
17
18
18
19
20
22
24
24
26
30
36
38
39
49
Contents
Chapter
Numerical optimization
1.1 Algorithms for optimization of single-variable
functions. Bracketing techniques
Consider a single variable real-valued function f (x) : [a, b] R for which it
is required to find an optimum in the interval [a, b]. Among the algorithms
for univariate optimization, the golden section and the Fibonacci search techniques are fast, accurate, robust and they do not require derivatives, (Sinha,
2007; Press et al., 2007; Mathews, 2005). These methods can be used if the optimum is bracketed in a finite interval and the function is unimodal on [a, b].
The algorithms will be described for finding a minimum. The problem of
finding a maximum is similar and may be solved with a very slight alteration
of the algorithm or by changing the sign of the function.
A function f (x) is unimodal on an interval [a, b] if it has exactly one local minimum x on the interval, i.e., it is strictly decreasing on [a, x ] and is
strictly increasing on [x , b].
The initial interval [a, b] is called the initial interval of uncertainty. The
general strategy of golden section and Fibonacci methods is to progressively
reduce the interval. This can be accomplished by placing experiments and
eliminating parts of the initial interval that do not contain the optimum. Consider the interval divided by two points x1 and x2 located at the same distance
from the boundaries a and b, as shown in Figure 1.2, i.e., the distance from a
to x2 is the same as the distance from x1 to b. The function value is evaluated
5
0.8
0.7
0.6
f(x)
0.5
0.4
0.3
0.2
0.1
0
in these two points and the interval is reduced from a to x1 if f (x1 ) f (x2 ),
or from x2 to b, otherwise. The procedure is iterative and it stops when the
length of the interval is smaller than a given tolerance .
(1.2)
b x1 = r(b a)
(1.3)
We shall assume now that the function has a smaller value at x1 than
x2 , thus the new search interval will be [a, x2 ]. The demonstration is similar
for the case when f (x2 ) > f (x1 ). A point x3 is introduced within the new
interval so it is located at the same distance, d2 , from x2 as x1 is from a:
d2 = x1 a = x2 x3
(1.4)
The ratio r must remain constant on each new interval, which means:
x1 a = r(x2 a)
(1.5)
x2 x3 = r(x2 a)
(1.6)
If x2 from (1.2) and x1 from (1.3) are replaced into (1.5), the value of r can
be determined as follows:
b r(b a) a = r [a + r(b a) a]
(1.7)
Rearranging, we get:
(b a)(1 r r2 ) = 0
(1.8)
and since the initial length of the interval is nonzero, the ratio r is one of the
roots of
1 r r2 = 0
(1.9)
The solutions of (1.9) are:
1 +
r1 =
2
1
= 0.618, r2 =
2
1
2
(1.10)
< r < 1.
The golden section search guarantees that after each new interval reduction the minimum will be bracketed in an interval having the length equal to
0.618 times the size of the preceding one.
7
Fk = Fk1 + Fk2 ,
k = 2, 3, ...
(1.11)
the two values at x1k , thus the new search interval is [ak+1 = ak , bk+1 = x2k ].
The point x1k is relabeled as x2,k+1 .
At each iteration, the interval is reduced by a ratio
distance dk is calculated as:
1
2
dk = rk (bk ak ) = rk dk1
(1.12)
(1.14)
(1.15)
9
1 rk
rk
(1.16)
(1.17)
Fnk
Fnk+1
Fnk
Fnk+1
Fnk
Fnk+1
(1.18)
Fnk1
Fnk
(1.19)
Fn1
Fn
Fn1
Fn
...
rn2 =
rn1 =
(1.20)
F2
F3
F1
1
=
F2
2
At the last step, n 1, no new points can be added. The distance dn1 =
dn2 rn1 = dn2 /2, i.e., at step n 1 the points x1,n1 and x2,n2 are equal
and:
F0
1
rn =
= =1
(1.21)
F1
1
The last interval obtained is dn = rn (bn an ) = rn dn1 = dn1 and no other
steps are possible. It may be written as:
F0
F0 F1
dn1 =
dn1 ...
F1
F1 F2
F0 F1
Fn2 Fn1
F0
1
...
d1 =
d0 =
d0
F1 F2
Fn1 Fn
Fn
Fn
dn = rn dn1 =
=
10
(1.22)
ba
(1.24)
"
!n
!n #
1+ 5
1 5
2
2
(1.25)
11
(1.26)
the function value calculated at two successive points does not vary
more than a given tolerance 2 :
|f (xk+1 ) f (xk )| < 2
12
(1.27)
(1.28)
where kxk is the Euclidian norm of a vector x = [x1 x2 ... xn ]T , given by:
q
kxk =
(1.29)
In general, there are two strategies from moving from one point to the
next: line search and trust region. An extensive description of algorithms from
both categories can be found in (Nocedal and Wright, 1999; Snyman, 2005;
Conn et al., 2000).
3
2.5
2
xn
1.5
x2
1
0.5
0
x0
0.5
1
1.5
2
1
0.5
0.5
x1
(1.30)
where xk and dk are the current point and the direction of search, respectively, and sk is the step length which is determined by solving a onedimensional minimization problem:
F (sk ) = f (xk + sk dk )
(1.31)
The line search is necessary for the following reason. If the step size is
very small, the value of the function decreases by a small amount. However,
large values of the step length may lead to the situation when the minimum
point is again not reached in a small number of iterations as shown in Figure 1.5. The function f is reduced at every step but with a very small amount
compared to the step length.
sk
(1.32)
F (0) =
0
dTk f (xk )
(1.33)
(1.34)
(1.35)
(1.36)
(1.37)
(1.39)
(1.40)
(1.42)
(1.43)
From (1.42):
(1.44)
(1.46)
To obtain the point of intersection with the x-axis we set g(x) = 0 and solve
19
g(xk )
g 0 (xk )
(1.47)
(1.48)
Let us denote the new point by xk+1 . The iterative scheme is:
xk+1 = xk
g(xk )
g 0 (xk )
(1.49)
f 0 (xk )
f 00 (xk )
(1.50)
(1.52)
(1.53)
The Newton method will converge to the exact stationary point in one
step, no matter which is the initial point selected.
Example 1.1 Consider the function f (x1 , x2 ) = x21 + 2x22 + x1 + 7
It can be written in the form (1.51) as:
1
f (x1 , x1 ) = [x1 x2 ]
2
"
2 0
0 4
#"
x1
x2
"
x1
x2
+ [1 0]
#
+7
(1.54)
Let x0 = [x10 x20 ]T be the initial point of the Newton algorithm. The gradient
and Hessian matrix for function f are:
"
f (x1 , x2 ) =
2x1 + 1
4x2
"
, H(x1 , x2 ) =
2 0
0 4
#
(1.55)
"
x11
x12
"
=
x10
x10
"
1
2
0
1
4
#"
2x10 + 1
4x20
(1.56)
"
=
12
0
#
(1.57)
If the function is not quadratic the selection of the initial point is very
important for the location of the extremum found by the Newton algorithm.
The convergence towards a local extremum is very fast, but the method
does not return the type of the stationary point unless the second order conditions are verified at the solution.
Since the calculus of the second order derivatives and the Hessian matrix
is a requirement for the implementation of the Newton method, the second
order conditions can be easily examined.
21
f (x) = 12x 2
(1.58)
(1.59)
A starting point x0 and a tolerance (i.e. = 105 ) must be selected. The next
points are calculated according to the iterative scheme:
xk+1 = xk
f 0 (xk )
f 00 (xk )
(1.60)
Figures 1.8, 1.9, 1.10 illustrate the result of the Newton method for various values of the initial point x0 , the extremum obtained x and the type of extremum determined directly from the computation of the second derivative at x .
The advantages of Newton method include:
The method is powerful and easy to implement
It will converge to a local extremum from any sufficiently close starting
value
However, the method has some drawbacks:
The method requires the knowledge of the gradient and the Hessian,
which is not always possible. For large problems, determining the Hessian matrix by the method of finite-differences is costly in terms of function evaluations.
There is no guarantee of convergence unless the initial point is reasonably close to the extremum, thus the global convergence properties are
poor.
1.2.2.4 Modified Newton method
The classical Newton method requires calculation of the Hessian at each iteration. Moreover, the matrix has to be inverted which may be inconvenient
22
0.05
0.1
f(x)
x0 = 0.9
0.15
x = 0.707
start
f 00 (x ) = 4 > 0
0.2
x is a local minimum
stop
0.25
1
0.5
0.5
Figure 1.8:
0
stop
0.05
start
0.1
f(x)
x0 = 0.3
x = 0
0.15
f 00 (x ) = 2 < 0
0.2
x is a local maximum
0.25
1
0.5
0.5
Figure 1.9:
for large problems. A modified Newton method uses the Hessian calculated
at the initial point. The convergence of the algorithm is still good for starting
points close to extrema but the convergence is linear instead of quadratic (as
in the case of the classical method).
0.05
f(x)
0.1
x0 = 0.3
start
x = 0.707
0.15
f 00 (x ) = 4 > 0
0.2
x is a local minimum
stop
0.25
1
0.5
0.5
Figure 1.10:
Algorithm 7 Modified Newton method
Define f (x) and a tolerance
Choose an initial point x0
Compute the gradient f (x)
Compute and evaluate the Hessian at the initial point, H(x0 )
Compute the inverse of the Hessian, H(x0 )1
while ||H(x0 )1 f (xk )|| > do
Calculate a new point xk+1 = xk H(x0 )1 f (xk )
Set k k + 1
end while
(1.61)
where f is the gradient of function f and () is the scalar (or inner) product
of vectors.
It generalizes the notion of a partial derivative, in which the direction is
always taken parallel to one of the coordinate axes. The slopes in the direction
24
direction of u
direction of x
1 2
f(x ,x )
0.5
direction of x1
0.5
1
0
1
x1
2
3
0.5
1.5
2.5
x2
Figure 1.12:
Inner
product of vectors f
and u
The inner product from the definition of the directional derivative can be
expressed as:
u f = f u = kf kkuk cos = kf k cos
(1.62)
where kf k and kuk are the norms of the corresponding vectors and is the
angle between them (Figure 1.12). Because u is a unit vector, the norm is
equal to one.
From (1.62) we obtain that:
the largest possible value of u f is obtained when cos assumes its
maximum value of 1 at = 0, i.e. the vector u points in the direction
of the gradient f . Thus, the directional derivative will be maximized
in the direction of the gradient and this maximum value will be the
magnitude of the gradient (kf k).
if = , the cosine has its smallest value of 1 and the unit vector u points in the opposite direction of the gradient, f . The directional derivative will be minimized in the direction of f and
u f = kf k.
25
(1.63)
0.5
f(x1,x2)
x2
0.5
2
1
6
6
5
4
x2
3
0
2
1
x1
(1.64)
f (xk )
min f xk sk
sk >0
kf (xk )k
Calculate a new point:
xk+1 xk sk
f (xk )
kf (xk )k
Set k k + 1
until kf (xk )k < (or kxk+1 xk k < or |f (xk+1 ) f (xk )| < )
Quadratic functions
In the particular case when the function to be minimized is quadratic, we
are able to determine explicitly the optimal step size that minimizes f (xk
sk f (xk )). Consider a function having the form:
1
f (x) = xT Qx + bT x + c
2
(1.66)
(1.67)
Notice also that the matrix Q is the Hessian of f because it can be obtained
as the second derivative of f with respect to the vector x.
We shall determine the step sk that minimizes f (xk sk f (xk )) for a
point xk .
1
(xk sk f (xk ))T Q(xk sk f (xk ))
2
+ bT (xk sk f (xk ))
(1.68)
f (xk sk f (xk )) =
(1.69)
(1.70)
where the last term in the right hand side was obtained using (1.67). If we set
the derivative df /dsk equal to zero, the optimal step is given by:
sk =
f (xk )T f (xk )
f (xk )T Qf (xk )
(1.71)
Thus, for quadratic functions, the method does not require a line search procedure and the steepest descent iteration is given by:
xk+1 = xk
f (xk )T f (xk )
f (xk )T Qf (xk )
f (xk )
(1.72)
29
(1.73)
g1
P
d0
3
3
x1
3
3
d1
b)
d0
x1
(1.75)
31
(1.76)
where the coefficient k is calculated by one of the formulas named FletcherReeves (FR), Polak-Ribiere (PR), or Hestenes-Stiefel (HS) after their developers.
Fletcher-Reeves formula:
f (xk+1 )T f (xk+1 )
f (xk )T f (xk )
(1.77)
(1.78)
k =
Polak-Ribi`ere formula:
k =
Hestenes-Stiefel formula:
k =
(1.79)
(1.80)
f (xk )T dk
dTk Qdk
(1.81)
sk
Compute
xk+1 = xk + sk dk
Compute k using one of FR (1.77), PR (1.78) or HS (1.79) formulas.
Compute the new direction
dk+1 = f (xk+1 ) + k dk
Set k k + 1
until convergence criterion is satisfied.
(1.82)
"
2 2
2 8
#"
x1
x2
1
= xT Qx
2
(1.83)
33
2x1 + 2x2
2x1 + 8x2
#
(1.84)
2 (2.5) + 2 0
2 (2.5) + 8 0
"
=
5
5
#
(1.85)
The optimal step length for quadratic functions, given by (1.81) is:
"
#
5
[5 5]
5
f (x0 )T d0
"
# " # = 0.1429
s0 =
=
T
d0 Qd0
2 2
5
[5 5]
2 8
5
(1.86)
2.5
0
"
+ 0.1429
5
5
"
=
1.7857
0.7143
#
(1.87)
2.1429
2.1429
#
(1.88)
We compute the next iterate using the Fletcher-Reeves formula to determine the
coefficients k :
"
2.1429
[2.1429 2.1429]
2.1429
f (x1 )T f (x1 )
"
#
0 =
=
f (x0 )T f (x0 )
5
[5 5]
5
34
#
= 0.1837
(1.89)
2.1429
2.1429
"
+ 0.1837
5
5
"
=
3.0612
1.2245
#
,
(1.90)
f (x1 )T d1
= 0.5833
dT1 Qd1
(1.91)
1.7857
0.7143
"
+ 0.5833
3.0612
1.2245
"
=
0
0
#
(1.92)
start
end
3
3
35
start
0
3
3
x1
(1.93)
(1.94)
The update matrix Buk can be calculated using one of the widely used
formulas: Davidon-Fletcher-Powell or Broyden-Fletcher-Goldfarb-Shanno.
Both of them require the value of the point and the gradient at the current
and previous iteration:
xk = xk+1 xk
(1.95)
Gk = f (xk+1 ) f (xk )
(1.96)
xk xTk
Bk Gk (Bk Gk )T
xTk Gk
GTk Bk Gk
(1.97)
Gk GTk
Bk xk (Bk xk )T
GTk xk
xTk Bk xk
(1.98)
sk >0
An important property of DFP and BFGS formulas is that if Bk is positive definite and the step size sk is chosen to satisfy the Wolfe conditions
then Bk+1 is also positive definite.
For a quadratic objective function, DFP method simultaneously generates the directions of the conjugate gradient method while constructing
the inverse Hessian Bk .
(1.99)
1.5
0.5
0
1
0.5
0.5
0
0.5
1
1W
0.5
x1 + x2
y1 + y2
, yM =
2
2
(1.100)
B+G
=
2
x1 + x2 y1 + y2
,
2
2
(1.101)
(1.103)
41
Figure 1.22: The triangle BGW and point R and extended point E
Contraction. If the function values at R and W are the same, another point
must be tested. Perhaps the function is smaller at M but we cannot
replace W with M because we need a triangle. Consider the two midpoints C1 and C2 of the line segments W M and M R respectively (Figure 1.23). The point with the smaller function value is called C and the
new triangle is BGC.
M W
W +M
RM
R+M
=
, C2 = R
=
(1.104)
2
2
2
2
(1.106)
43
44
(1.107)
(1.108)
(1.109)
B+G
= (1, 3)
2
(1.110)
45
(1.111)
The function value at R, f (R) = f (2, 6) = 80 < f (G) < f (B) and the
extended point E will be calculated from:
E = 2R M = (3, 9)
(1.112)
(1.113)
B+G
= (1.5, 7.5), R = 2M W = (1, 15)
2
(1.114)
The value of f at the reflected point is: f (1, 15) = 106. It is less than f (G) =
f (0, 6) = 116 but greater than f (B) = f (3, 9) = 50. Therefore, the extended
point will not be calculated and the new triangle is RGB.
Third iteration Because f (3, 9) < f (1, 15) < f (0, 6), the new labels are (Figure 1.27):
B = (3, 9), G = (1, 15), W = (0, 6)
(1.115)
B+G
= (2, 12), R = 2M W = (4, 18)
2
(1.116)
and the function takes at R the value f (R) = f (4, 18) = 100. This is again
less than the value at G but it is not less than the value at B, thus the point R
will be a vertex of the next triangle RGB.
Fourth iteration Because f (3, 9) < f (4, 18) < f (1, 15), the new labels are (Fig47
(1.117)
B+G
= (3.5, 13.5), R = 2M W = (6, 12)
2
(1.118)
Because f (R) = f (6, 12) = 20 < f (B), we shall build the extended point E:
E = 2R M = (8.5, 10.5)
(1.119)
The function takes here the value f (E) = f (8.5, 10.5) = 2.5, thus the new
triangle is GBE.
The calculations continue until the function values at the vertices have almost
equal values (with some allowable tolerance ). It will still take a large number of
steps until the procedure will reach the minimum point (10, 10) where the function
48
"
d1 =
1
0
"
, d2 =
0
1
#
(1.120)
49
x1B
x2B
"
=
x1A
x1A
"
+ s1
1
0
#
(1.121)
(1.122)
(1.123)
The coordinate system is now rotated and the new orthogonal directions of the
axes are d0 1 , d0 2 . If we move further on to the point D along d0 1 and then to E along
d0 2 , with the distances s01 and s02 , the coordinates are given by similar relations:
xD = xC + s01 d0 1
(1.124)
s02 d0 2
(1.125)
xE = xD +
The initial data for the Rosenbrock method include, (Lucaci and Agachi,
2002):
an initial vector x0 = [x10 x20 . . . xn0 ]T
a set of initial orthogonal normalized vectors d10 , d20 , ..., dn0 . Usually
they are taken as the columns of an n-th order unit matrix.
an initial vector of step lengths along each direction s
[s10 s20 . . . sn0 ]T
(1.126)
A2 =
D2 d2 + . . . + Dn dn
(1.127)
Dn dn
(1.128)
...
An =
(1.129)
Thus Ai is the vector joining the initial and final points obtained by
use of the vectors dk , k = i, n. Orthogonal unit vectors of the new
coordinate system (d0 1 , d0 2 , ..., d0 n ) are obtained from Ai by the GramSchmidt orthogonalization procedure:
B1 = A1 , d0 1 =
B2 = A2
...
Bn = An
B1
kB1 k
AT2 B1
B2
B1 , d0 2 =
2
kB1 k
kB2 k
n1
X
j=1
ATn Bj
Bn
Bj , d0 n =
kBj k2
kBn k
(1.130)
(1.131)
(1.132)
51
or f (xk+1 ) f (xk )
(1.133)
(1.134)
52
53
stop
1
0.8
0.6
x2
0.4
0.2
0
0.2
0.4
0.6
0.8
start
1
1
0.5
0.5
x1
54
Bibliography
Bibliography
Pierre, D. A. (1986). Optimization Theory with Applications. Courier Dover
Publications.
Press, W. H., Teukolsky, S. A., Vetterling, W. T., and Flannery, B. P. (2007).
Numerical Recipes: The Art of Scientific Computing. Cambridge University
Press.
Rosenbrock, H. (1960). An automatic method for finding the greatest or least
value of a function. The Computer Journal, 3(3):175184.
Sinha, P. (2007). Bracketing an optima in univariate optimization. From
http://www.optimization-online.org/.
Snyman, J. A. (2005). Practical Mathematical Optimization: An Introduction to
Basic Optimization Theory and Classical and New Gradient-based Algorithms.
Springer.
Weisstein,
E.
(2004).
Directional
derivative.
From
MathWorldA
Wolfram
Web
Resource.
http://mathworld.wolfram.com/DirectionalDerivative.html.
56