Hessian Matrix
2f
2f
2f
(x)
(x) x1 x
(x)
x1 x2
x21
n
2f
2f
2f
(x)
(x)
(x)
2
x
x
x
x
x
2
n
2
1
2
.
2
(1)
H(x) = f (x) =
..
..
..
.
.
.
2f
2f
2f
(x)
(x)
(x)
xn x1
xn x2
x2
n
The diagonal elements tell you the second derivative of f as you move along
each axis. The off-diagonal elements tell you how the partial derivative of f
in the x1 direction changes as you move in the x2 direction. It may help to
imagine the partial derivative of f in the x1 direction as a new scalar field
g(x). Then, you take the partial derivative of g(x) in the x2 direction.
Use of Hessian in second-order Taylor expansion. The Hessian appears in the second-order Taylor expansion of a scalar field:
1
f (x) = f (x0 ) + f (x0 ) (x x0 ) + (x x0 )T 2 f (x0 )(x x0 ) + O(||x x0 ||3 ).
2
(3)
Taking the first three terms, we see an expression of the form 21 xT H(x0 )x +
bT x + c for a vector b and scalar c. This is known as a quadratic form because
each element of x is raised to a maximum power of two. Wwe see that it is
the unique quadratic form that passes through x0 and has the same first and
second partial derivatives as f at x0 .
Use of Hessian as a second derivative test. Note that the secondorder Taylor approximation at a critical point x0 (one where f (x0 ) = 0) is
f (x) 21 (xx0 )T H(x0 )(xx0 )+f (x0 ). So, to a second-order approximation
the Hessian captures all the relevant information about the shape of f in a
local neighborhood. Let us now look at the possible shapes captured by H.
It turns out that multiple differentiation is invariant to changes in order.
2f
2f
That is, x2 x
=
. So, Hessian matrices are symmetric. Because
x
1
1 x2
symmetric matrices have a full set of real eigenvalues and an orthogonal
basis of eigenvectors, we can perform the eigendecomposition of H = QQT
where Q is orthogonal and is diagonal. This is useful to understand how
H affects the shape. If we perform a change of variables y(x) by performing
the following translation and rotation: y(x) = QT (x x0), then we see that
the quadratic form is simply f (y) f (x0 ) = 21 y T y. Since is diagonal, we
have a sum of axis-aligned parabolas: f (y) f (x0 ) = 21 (1 y12 + n yn2 ). We
therefore need to look at the signs of the eigenvalues 1 , . . . , n to determine
the shape of the function.
The multivariate Newtons method for minimization is very similar to Newtons method for root finding. Recall that when Newtons method is used
for minimization of the function f you try to find the root of the function
g = f 0 . Expanding the Newton iteration xt+1 = xt g(xt )/g 0 (xt ) in terms of
f we have
xt+1 = xt f 0 (xt )/f 00 (xt ).
(4)
For functions of many variables the iteration is very similar, except that the
gradient and Hessian take the place of f 0 and f 00 , respectively. Also, we must
take heed of the important differences between matrix algebra and scalar
algebra.
2.1
(5)
We would like to find a point at which the quadratic form on the right hand
side is minimized. Supposing that H(xt ) is positive definite, the unique
minimum is obtained at x = xt H(xt )1 f (xt ). To see this, the gradient
of a quadratic form q(x) = 12 xT Hx + bT x + c is
q(x) = Hx + b
(6)
(7)
h(x)T g(x)/x1
g(x)T h(x)/x1
..
..
f (x) =
+
.
.
T
T
h(x) g(x)/xn
g(x) h(x)/xn
(8)
2.2
Practical considerations
Quasi-Newton Methods
3.1
Derivation
or in other words,
f 00 (xt )(xt xt1 ) f 0 (xt ) f 0 (xt1 ).
(10)
(11)
The idea is to find a matrix Ht 2 (xt ) that makes this an equality. But
there are n2 entries in Ht and only n constraints, this is an underdetermined
system. The question is now which matrix to choose from?
Quasi-Newton methods begin with some estimate of the Hessian H0 and
incrementally improve it over time: H1 , H2 , . . .. Each Ht is derived from Ht1
such that equality in (11) is fulfilled, but also such that Ht differs from Ht1
as little as possible (in some way, to be defined in a moment). We will also
maintain that H0 and all subsequent Ht are positive definite.
DFP Method. The Davidson-Fletcher-Powell (DFP) update uses the following formula:
Ht = (In
1
qt qtT
1
T
T
q
x
)H
(I
x
q
)
+
t
t1
n
t
t
t
qtT xt
qtT xt
qtT xt
(12)
Sherman-Morrison-Woodbury Formula
The SMW (or Woodbury) formula gives an analytical expression for the
inverse of B = A + uv T where A is a square matrix and u and v are vectors:
B
=A
A1 uv T A1
.
1 + uT A1 v
(13)
(14)
(15)
1
xt xTt
1
T
T
x
q
)B
(I
q
x
)
+
t t
t1 n
t
t
qtT xt
qtT xt
qtT xt
(16)
Note that this formula simply switches H with B, and swaps qt and xt in
(12).
Decades of experimentation has demonstrated that the BFGS method
is somewhat more effective than the DFP method and is often preferred in
practice.
4.1
Practical concerns
Exercises
1.