MC

Appendix D
Matrix calculus
From too much study, and from extreme passion, cometh

madnesse.
Isaac Newton [86, 5]
D.1 Directional derivative, Taylor series

D.1.1 Gradients
Gradient of a differentiable real function f (x) : RK R with respect to its
vector domain is defined
f (x)
x1
f (x)

RK

f (x) = x2
.. (1354)
.

f (x)
xK
while the second-order gradient of the twice differentiable real function with
respect to its vector domain is traditionally called the Hessian ;
2f (x) 2f (x) 2f (x)

2
x1 x1 x2 x1 xK

2f (x) 2f (x) 2f (x)

2 f (x) = x2.x1 2x2
x2 xK SK (1355)
. .
. ... ..

. . .

2f (x) 2f (x) 2f (x)
xK x1 xK x2
2xK
2001 Jon Dattorro. CO&EDG version 04.18.2006. All rights reserved. 501
Citation: Jon Dattorro, Convex Optimization & Euclidean Distance Geometry,
Meboo Publishing USA, 2005.
502 APPENDIX D. MATRIX CALCULUS
The gradient of vector-valued function v(x) : R RN on real domain is

a row-vector
h i
v1 (x) v2 (x) vN (x)
v(x) = x x
x
RN (1356)
while the second-order gradient is

h i
2 v1 (x) 2 v2 (x) 2 vN (x)
2 v(x) = x2 x2
x2
RN (1357)
Gradient of vector-valued function h(x) : RK RN on vector domain is

hN (x)

h1 (x) h2 (x)
x1 x1
x1

h1 (x) h2 (x) hN (x)

h(x) = x2 x2 x2
.. .. ..

. . .
(1358)
h1 (x) h2 (x) hN (x)
xK xK
xK
= [ h1 (x) h2 (x) hN (x) ] RKN
while the second-order gradient has a three-dimensional representation

dubbed cubix ;D.1
hN (x)

hx1 (x)
1
hx2 (x)
1
x1

h1 (x) hN (x)
x hx2 (x) x

2
h(x) = 2 2 2

.. .. ..

. . .
(1359)
h1 (x) h2 (x) hN (x)
xK xK xK
= [ 2 h1 (x) 2 h2 (x) 2 hN (x) ] RKN K
where the gradient of each real entry is with respect to vector x as in (1354).
D.1
The word matrix comes from the Latin for womb ; related to the prefix matri- derived
from mater meaning mother.
D.1. DIRECTIONAL DERIVATIVE, TAYLOR SERIES 503
The gradient of real function g(X) : RKL R on matrix domain is

g(X) g(X) g(X)
X11 X12
X1L

g(X) g(X) g(X)

g(X) = X21 X22 X2L RKL

.. .. ..

. . .

g(X) g(X) g(X)
XK1 XK2
XKL
(1360)

X(:,1) g(X)
X(:,2) g(X)
= ... RK1L

X(:,L) g(X)
where the gradient X(:,i) is with respect to the i th column of X . The

strange appearance of (1360) in RK1L is meant to suggest a third dimension
perpendicular to the page (not a diagonal matrix). The second-order gradient
has representation

g(X)
X11
g(X)
X12
g(X)
X1L

g(X)
X21 g(X) g(X)

2 g(X) = X22 X2L RKLKL

.. .. ..

. . .

g(X) g(X) g(X)
XK1 XK2 XKL
(1361)

X(:,1) g(X)
X(:,2) g(X)
= ... RK1LKL

X(:,L) g(X)
where the gradient is with respect to matrix X .

Gradient of vector-valued function g(X) : RKL RN on matrix domain

is a cubix

X(:,1) g1 (X) X(:,1) g2 (X) X(:,1) gN (X)
X(:,2) g1 (X) X(:,2) g2 (X) X(:,2) gN (X)
g(X) = ... ... ...

X(:,L) g1 (X) X(:,L) g2 (X) X(:,L) gN (X)
= [ g1 (X) g2 (X) gN (X) ] RKN L (1362)

while the second-order gradient has a five-dimensional representation;

X(:,1) g1 (X) X(:,1) g2 (X) X(:,1) gN (X)
X(:,2) g1 (X) X(:,2) g2 (X) X(:,2) gN (X)
2 g(X) = ... ... ...

X(:,L) g1 (X) X(:,L) g2 (X) X(:,L) gN (X)
= [ 2 g1 (X) 2 g2 (X) 2 gN (X) ] RKN LKL (1363)

The gradient of matrix-valued function g(X) : RKL RM N on matrix
domain has a four-dimensional representation called quartix

g11 (X) g12 (X) g1N (X)
g(X) = g21. (X) g22. (X) g2N. (X) RM N KL (1364)

.. .. ..
gM 1 (X) gM 2 (X) gMN (X)
while the second-order gradient has six-dimensional representation

2
g11 (X) 2 g12 (X) 2 g1N (X)

2
2 g(X) = g21 (X) 2 g22 (X) 2 g2N (X) RM N KLKL

.. .. ..
. . .
2 gM 1 (X) 2 gM 2 (X) 2 gMN (X)
(1365)
and so on.
D.1.2 Product rules for matrix-functions

Given dimensionally compatible matrix-valued functions of matrix variable
f (X) and g(X)
X f (X)T g(X) = X(f ) g + X(g) f

(1366)
while [35, 8.3] [205]

T T T
X tr f (X) g(X) = X tr f (X) g(Z ) + tr g(X) f (Z ) (1367)
ZX
These expressions implicitly apply as well to scalar-, vector-, or matrix-valued

functions of scalar, vector, or matrix arguments.
D.1.2.0.1 Example. Cubix.

Suppose f (X) : R22 R2 = X Ta and g(X) : R22 R2 = Xb . We wish
to find
X f (X)T g(X) = X aTX 2 b

(1368)
using the product rule. Formula (1366) calls for
X aTX 2 b = X(X Ta) Xb + X(Xb) X Ta (1369)
Consider the first of the two terms:

X(f ) g = X(X Ta) Xb
(1370)
= (X Ta)1 (X Ta)2 Xb
The gradient of X Ta forms a cubix in R222 .

(X Ta)1 (X Ta)2
X11 X11
(1371)
II II
II
II
II
II

(X Ta)1 (X Ta)2 (Xb)
1
X12 X12

T
212
X(X a) Xb = R

(X Ta)1 (X Ta)2
(Xb)2

X21 X21

II II
II
II
II
II

(X Ta)1 (X Ta)2
X22 X22
Because gradient of the product (1368) requires total change with respect
to change in each entry of matrix X , the Xb vector must make an inner
product with each vector in the second dimension of the cubix (indicated by
dotted line segments);

a1 0
0 a1

b1 X11 + b2 X12

T
X(X a) Xb = R212

b X + b X

a2 0 1 21 2 22
0 a2 (1372)

a1 (b1 X11 + b2 X12 ) a1 (b1 X21 + b2 X22 )
= R22
a2 (b1 X11 + b2 X12 ) a2 (b1 X21 + b2 X22 )
= abTX T
where the cubix appears as a complete 2 2 2 matrix. In like manner for
the second term X(g) f

b1 0
b2 0

X11 a1 + X21 a2

T
X(Xb) X a = R212

X12 a1 + X22 a2

0 b1 (1373)
0 b2
= X TabT R22
The solution
X aTX 2 b = abTX T + X TabT (1374)
can be found from Table D.2.1 or verified using (1367). 2
D.1.2.1 Kronecker product

A partial remedy for venturing into hyperdimensional representations, such
as the cubix or quartix, is to first vectorize matrices as in (29). This device
gives rise to the Kronecker product of matrices ; a.k.a, direct product
or tensor product. Although it sees reversal in the literature, [211, 2.1] we
adopt the definition: for A Rmn and B Rpq

B11 A B12 A B1q A
B21 A B22 A B2q A

pmqn
B A = .. .. .. R (1375)
. . .
Bp1 A Bp2 A Bpq A
One advantage to vectorization is existence of a traditional

two-dimensional matrix representation for the second-order gradient of
a real function with respect to a vectorized matrix. For example, from
A.1.1 no.22 (D.2.1) for square A , B Rnn [96, 5.2] [10, 3]
2 T 2 T T T T n n 2 2
vec X tr(AXBX ) = vec X vec(X) (B A) vec X = BA + B A R
(1376)
To disadvantage is a large new but known set of algebraic rules and the
fact that its mere use does not generally guarantee two-dimensional matrix
representation of gradients.
D.1.3 Chain rules for composite matrix-functions

Given dimensionally compatible matrix-valued functions of matrix variable
f (X) and g(X) [137, 15.7]
X g f (X)T = X f T f g

(1377)
X2 g f (X)T = X X f T f g = X2 f f g + X f T f2 g X f (1378)

D.1.3.1 Two arguments
X g f (X)T , h(X)T = X f T f g + X hT h g

(1379)
D.1.3.1.1 Example. Chain rule for two arguments. [28, 1.1]
g f (x)T , h(x)T = (f (x) + h(x))TA (f (x) + h(x))

(1380)

x1 x1
f (x) = , h(x) = (1381)
x2 x2

T T 1 0 T 0
(A + AT )(f + h)

x g f (x) , h(x) = (A + A )(f + h) +
0 0 1
(1382)

T T
1+ 0 T x1 x1
x g f (x) , h(x) = (A + A ) +
0 1+ x2 x2
(1383)
lim x g f (x)T , h(x)T = (A + AT )x

(1384)
0
from Table D.2.1. 2
These formulae remain correct when the gradients produce

hyperdimensional representations:
D.1.4 First directional derivative

Assume that a differentiable function g(X) : RKL RM N has continuous
first- and second-order gradients g and 2 g over dom g which is an open
set. We seek simple expressions for the first and second directional derivatives
Y Y
in direction Y RKL , dg RM N and dg 2 RM N respectively.
Assuming that the limit exists, we may state the partial derivative of the
mn th entry of g with respect to the kl th entry of X ;
gmn (X) gmn (X + t ek eTl ) gmn (X)

= lim R (1385)
Xkl t0 t
where ek is the k th standard basis vector in RK while el is the l th standard

basis vector in RL . The total number of partial derivatives equals KLM N
while the gradient is defined in their terms; the mn th entry of the gradient is
gmn (X) gmn (X) gmn (X)

X11 X12
X1L


gmn (X) = X21 X22 X2L RKL (1386)

.. .. ..
. . .

XK1 XK2
XKL
while the gradient is a quartix

g11 (X) g12 (X) g1N (X)
g (X) g (X) g (X)
21 22 2N
g(X) = RM N KL (1387)

.
.. .
.. ..
.
gM 1 (X) gM 2 (X) gMN (X)
By simply rotating our perspective of the four-dimensional representation of

the gradient matrix, we find one of three useful transpositions of this quartix
(connoted T1 ):
g(X) g(X) g(X)

X11 X12
X1L
g(X) g(X) g(X)

T1 RKLM N

g(X) = X21
..
X22
..
X2L
.. (1388)
. . .

g(X) g(X) g(X)
XK1 XK2
XKL
When the limit for t R exists, it is easy to show by substitution of

variables in (1385)
gmn (X) gmn (X + t Ykl ek eTl ) gmn (X)

Ykl = lim R (1389)
Xkl t0 t
which may be interpreted as the change in gmn at X when the change in Xkl
is equal to Ykl , the kl th entry of any Y RKL . Because the total change
in gmn (X) due to Y is the sum of change with respect to each and every
Xkl , the mn th entry of the directional derivative is the corresponding total
differential [137, 15.8]
X gmn (X)
Ykl = tr gmn (X)T Y

dgmn (X)|dXY = (1390)
k,l
Xkl
X gmn (X + t Ykl ek eTl ) gmn (X)
= lim (1391)
k,l
t0 t
gmn (X + t Y ) gmn (X)
= lim (1392)
t0
t
d
= gmn (X + t Y ) (1393)
dt t=0
where t R . Assuming finite Y , equation (1392) is called the Gateaux

differential [27, App.A.5] [125, D.2.1] [234, 5.28] whose existence is implied
by the existence of the Frechet differential, the sum in (1390). [157, 7.2] Each
may be understood as the change in gmn at X when the change in X is equal
in magnitude and direction to Y .D.2 Hence the directional derivative,

dg11 (X) dg12 (X) dg1N (X)

dg21 (X) dg22 (X) dg2N (X)
Y
dg(X) = RM N

.
.. .. ..
. .

dgM 1 (X) dgM 2 (X) dgMN (X)
dXY

tr g11 (X)T Y tr g12 (X)T Y tr g1N (X)T Y

tr g (X)T Y tr g22 (X)T Y

tr g2N (X)T Y

21
=

.. .. ..
. . .

tr gM 1 (X)T Y tr gM 2 (X)T Y tr gMN (X)T Y
P P g1N (X)
g11 (X) P g12 (X)
Ykl Ykl Ykl
k,l Xkl k,l
Xkl
k,l
Xkl

P P g2N (X)
g21 (X) P g22 (X)

Xkl
Ykl Xkl
Ykl Xkl
Ykl
= k,l (1394)

k,l k,l
.. .. ..

. . .

P gM 1 (X) P gM 2 (X) P gMN (X)
Ykl Ykl Ykl

Xkl Xkl Xkl
k,l k,l k,l
from which it follows

Y X g(X)
dg(X) = Ykl (1395)
k,l
Xkl
Yet for all X dom g , any Y RKL , and some open interval of t R
Y
g(X + t Y ) = g(X) + t dg(X) + o(t2 ) (1396)
which is the first-order Taylor series expansion about X . [137, 18.4]

[85, 2.3.4] Differentiation with respect to t and subsequent t-zeroing isolates
the second term of the expansion. Thus differentiating and zeroing g(X+t Y )
in t is an operation equivalent to individually differentiating and zeroing
every entry gmn (X + t Y ) as in (1393). So the directional derivative of g(X)
in any direction Y RKL evaluated at X dom g becomes

Y d
dg(X) = g(X + t Y ) RM N (1397)
dt t=0
D.2
Although Y is a matrix, we may regard it as a vector in RKL .
f( + t y)

x f() (f(), )

f(x)
= x f()

1
2
df()
Figure 97: Drawn is a convex quadratic bowl in R2 R ; f(x) = xTx : R2 R

versus x on some open disc in R2 . Plane slice H is perpendicular
to function domain. Slice intersection with domain connotes bidirectional
vector y . Tangent line T slope at point ( , f()) is directional derivative
value x f()Ty (1424) at in slice direction y . Recall, negative gradient
x f(x) R2 is always steepest descent direction [248]. [137, 15.6] When
vector R3 entry
3 is half directional derivative in gradient direction at
1
and when = x f() , then points directly toward bowl bottom.
2
[177, 2.1, 5.4.5] [25, 6.3.1] which is simplest. The derivative with respect
to t makes the directional derivative (1397) resemble ordinary calculus
Y
(D.2); e.g., when g(X) is linear, dg(X) = g(Y ). [157, 7.2]
D.1.4.1 Interpretation directional derivative

In the case of any differentiable real function f(X) : RKL R , the
directional derivative of f(X) at X in any direction Y yields the slope
of f along the line X + t Y through its domain (parametrized by t R)
evaluated at t = 0. For higher-dimensional functions, by (1394), this slope
interpretation can be applied to each entry of the directional derivative.
Unlike the gradient, directional derivative does not expand dimension; e.g.,
directional derivative in (1397) retains the dimensions of g .
Figure 97, for example, shows a plane slice of a real convex bowl-shaped
function f(x) along a line + t y through its domain. The slice reveals a
one-dimensional real function of t ; f( + t y). The directional derivative
at x = in direction y is the slope of f( + t y) with respect to t at
t = 0. In the case of a real function having vector argument h(X) : RK R ,
its directional derivative in the normalized direction of its gradient is the
gradient magnitude. (1424) For a real function of real variable, the directional
derivative evaluated at any point in the function domain is just the slope of
that function there scaled by the real direction. (confer 3.1.1.4)
D.1.4.1.1 Theorem. Directional derivative condition for optimization.

[157, 7.4] Suppose f(X) : RKL R is minimized on convex set C Rpk
by X , and the directional derivative of f exists there. Then for all X C
XX
df(X) 0 (1398)

D.1.4.1.2 Example. Simple bowl.

Bowl function (Figure 97)

f(x) : RK R = (x a)T (x a) b (1399)
has function offset b R , axis of revolution at x = a , and positive definite

Hessian (1355) everywhere in its domain (an open hyperdisc in RK ); id est,
strictly convex quadratic f(x) has unique global minimum equal to b at
x = a . A vector based anywhere in dom f R pointing toward the
unique bowl-bottom is specified:

xa
RK R (1400)
f(x) + b
Such a vector is
x f(x)

=
x f(x)

(1401)
1
2
df(x)
since the gradient is
x f(x) = 2(x a) (1402)
and the directional derivative in the direction of the gradient is (1424)
x f(x)

df(x) = x f(x)T x f(x) = 4(x a)T (x a) = 4 f(x) + b (1403)
D.1.5 Second directional derivative

By similar argument, it so happens: the second directional derivative is
equally simple. Given g(X) : RKL RM N on open domain,
2gmn (X) 2gmn (X) 2gmn (X)

Xkl X11 Xkl X12
Xkl X1L

gmn (X) gmn (X)

= = Xkl X21 Xkl X22 Xkl X2L RKL (1404)

Xkl Xkl .. .. ..
. . .

Xkl XK1 Xkl XK2
Xkl XKL
mn (X) mn (X) mn (X)

gX gX gX

11 12 1L

gmn (X) mn (X) mn (X)
X21 gX gX

2
gmn (X) = RKLKL
22 2L

.. .. ..
. . .

XK1 XK2 XKL
(1405)

X11 X12
X1L

= X21
..
X22
..
X2L
..

. . .

XK1 XK2
XKL
Rotating our perspective, we get several views of the second-order gradient:
2 g11 (X) 2 g12 (X) 2 g1N (X)

2 g (X) 2 g (X) 2 g (X)
21 22 2N
2 g(X) = RM N KLKL (1406)

.
.. .
.. ..
.
2 gM 1 (X) 2 gM 2 (X) 2 gMN (X)
g(X) g(X) g(X)

X11 X12 X1L
g(X) g(X) g(X)

2

g(X) = T1 X21 X22 X2L RKLM N KL (1407)
.. .. ..
. . .

g(X) g(X) g(X)
XK1 XK2 XKL
g(X) g(X) g(X)

X11 X12
X1L
g(X) g(X) g(X)

2 g(X)T2 = X2L RKLKLM N

X21
..
X22
.. .. (1408)
. . .

g(X) g(X)
XK1 XK2
g(X)
XKL
Assuming the limits exist, we may state the partial derivative of the mn th
entry of g with respect to the kl th and ij th entries of X ;
l + ei ej )gmn (X+t ek el )(gmn (X+ ei ej )gmn (X))

gmn (X+t ek eT T T T
2gmn (X)
Xkl Xij
= lim t
,t0
(1409)
Differentiating (1389) and then scaling by Yij
2gmn (X) gmn (X+t Ykl ek eT

l )gmn (X)
Y Y
Xkl Xij kl ij
= lim X ij t
Yij (1410)
t0
l + Yij ei ej )gmn (X+t Ykl ek el )(gmn (X+ Yij ei ej )gmn (X))

gmn (X+t Ykl ek eT T T T
= lim t
,t0
which can be proved by substitution of variables in (1409). The mn th

second-order total differential due to any Y RKL is
X X 2gmn (X) T
2 T
d gmn (X)|dXY = Ykl Yij = tr X tr gmn (X) Y Y (1411)
i,j k,l
Xkl Xij
X gmn (X + t Y ) gmn (X)
= lim Yij (1412)
i,j
t0 Xij t
gmn (X + 2t Y ) 2gmn (X + t Y ) + gmn (X)
= lim (1413)
t0 t2
2

d
= gmn (X + t Y ) (1414)
dt2 t=0
Hence the second directional derivative,
d 2g11 (X) d 2g12 (X) d 2g1N (X)

Y
d 2g21 (X) d 2g22 (X) d 2g2N (X)

dg 2(X) = RM N

.. .. ..
. . .

d 2gM 1 (X) d 2gM 2 (X) d 2gMN (X) dXY
T T T
tr tr g11 (X)T Y Y tr tr g12 (X)T Y Y tr tr g1N (X)T Y Y

T
T
T
T
T
T
tr tr g21 (X) Y Y tr tr g22 (X) Y Y tr tr g2N (X) Y Y
=

.
.. .
.. .
..

T T T
tr tr gM 1 (X)T Y Y tr tr gM 2 (X)T Y Y tr tr gMN (X)T Y Y
2g11 (X) 2g12 (X) 2g1N (X)

PP PP PP
Y Y
Xkl Xij kl ij
Y Y
Xkl Xij kl ij
Y Y
Xkl Xij kl ij
i,j k,l i,j k,l i,j k,l
PP 2
g21 (X) P P 2g22 (X) P P 2g2N (X)
Ykl Yij Y Y
Xkl Xij kl ij
Y Y
Xkl Xij kl ij

= i,j k,l Xkl Xij

i,j k,l i,j k,l
.. .. ..
. . .

P P 2gM 1 (X) P P 2gM 2 (X) P P 2gMN (X)
Y Y Y Y Y Y

Xkl Xij kl ij Xkl Xij kl ij Xkl Xij kl ij
i,j k,l i,j k,l i,j k,l
(1415)
from which it follows
Y X X 2g(X) X Y
2
dg (X) = Ykl Yij = dg(X) Yij (1416)
i,j k,l
Xkl Xij i,j
Xij
Yet for all X dom g , any Y RKL , and some open interval of t R
Y 1 2 Y2
g(X + t Y ) = g(X) + t dg(X) + t dg (X) + o(t3 ) (1417)
2!
which is the second-order Taylor series expansion about X . [137, 18.4]
[85, 2.3.4] Differentiating twice with respect to t and subsequent t-zeroing
isolates the third term of the expansion. Thus differentiating and zeroing
g(X+ t Y ) in t is an operation equivalent to individually differentiating and
zeroing every entry gmn (X + t Y ) as in (1414). So the second directional
derivative becomes
Y
d 2

2
dg (X) = 2 g(X + t Y ) RM N (1418)
dt t=0
[177, 2.1, 5.4.5] [25, 6.3.1] which is again simplest. (confer (1397))
D.1.6 Taylor series

Series expansions of the differentiable matrix-valued function g(X) , of
matrix argument, were given earlier in (1396) and (1417). Assuming g(X)
has continuous first-, second-, and third-order gradients over the open set
dom g , then for X dom g and any Y RKL the complete Taylor series
on some open interval of R is expressed
Y 1 2 Y2 1 Y
g(X+Y ) = g(X) + dg(X) + dg (X) + 3 dg 3(X) + o(4 ) (1419)
2! 3!
or on some open interval of kY k
Y X 1 Y2 X 1 Y3 X
g(Y ) = g(X) + dg(X) + dg (X) + dg (X) + o(kY k4 ) (1420)
2! 3!
which are third-order expansions about X . The mean value theorem from
calculus is what insures the finite order of the series. [28, 1.1] [27, App.A.5]
[125, 0.4] [137]
In the case of a real function g(X) : RKL R , all the directional
derivatives are in R :
Y
dg(X) = tr g(X)T Y

(1421)
Y

T Y
dg 2(X) = tr X tr g(X)T Y Y = tr X dg(X)T Y (1422)
!
Y Y
T
T
dg (X) = tr X tr X tr g(X)T Y Y Y = tr X dg 2(X)T Y
3

(1423)
In the case g(X) : RK R has vector argument, they further simplify:

Y
dg(X) = g(X)T Y (1424)
Y
dg 2(X) = Y T 2 g(X)Y (1425)
Y T
dg 3(X) = X Y T 2 g(X)Y Y (1426)
and so on.
D.1.6.0.1 Exercise. log det . (confer [39, p.644])

Find the first two terms of the Taylor series expansion (1420) for log det X .
H
D.1.7 Correspondence of gradient to derivative

From the foregoing expressions for directional derivative, we derive a
relationship between the gradient with respect to matrix X and the derivative
with respect to real variable t :
D.1.7.1 first-order
Removing from (1397) the evaluation at t = 0 ,D.3 we find an expression for

the directional derivative of g(X) in direction Y evaluated anywhere along
a line X + t Y (parametrized by t) intersecting dom g
Y d
dg(X + t Y ) = g(X + t Y ) (1427)
dt
In the general case g(X) : RKL RM N , from (1390) and (1393) we find
d
tr X gmn (X + t Y )T Y = gmn (X + t Y )

(1428)
dt
which is valid at t = 0, of course, when X dom g . In the important case

of a real function g(X) : RKL R , from (1421) we have simply
d
tr X g(X + t Y )T Y = g(X + t Y )

(1429)
dt
When, additionally, g(X) : RK R has vector argument,
d
X g(X + t Y )T Y = g(X + t Y ) (1430)
dt
D.3
Justified by replacing X with X + t Y in (1390)-(1392); beginning,
X gmn (X + t Y )
dgmn (X + t Y )|dXY = Ykl
Xkl
k, l
D.1.7.1.1 Example. Gradient.

g(X) = wTX TXw , X RKL , w RL . Using the tables in D.2,
tr X g(X + t Y )T Y = tr 2wwT(X T + t Y T )Y

(1431)
= 2wT(X T Y + t Y T Y )w (1432)
Applying the equivalence (1429),

d d T
g(X + t Y ) = w (X + t Y )T (X + t Y )w (1433)
dt dt
= wT X T Y + Y TX + 2t Y T Y w

(1434)
= 2wT(X T Y + t Y T Y )w (1435)
which is the same as (1432); hence, equivalence is demonstrated.

It is easy to extract g(X) from (1435) knowing only (1429):

tr X g(X + t Y )T Y = 2wT(X T Y + t Y T Y )w
= 2 tr wwT(X T + t Y T )Y
(1436)

tr X g(X)T Y = 2 tr wwTX T Y

X g(X) = 2XwwT
2
D.1.7.2 second-order
Likewise removing the evaluation at t = 0 from (1418),
Y
d2
dg 2(X + t Y ) = g(X + t Y ) (1437)
dt2
we can find a similar relationship between the second-order gradient and the
second derivative: In the general case g(X) : RKL RM N from (1411) and
(1414),
T d2
tr X tr X gmn (X + t Y )T Y Y = 2 gmn (X + t Y ) (1438)
dt
In the case of a real function g(X) : RKL R we have, of course,
T d2
tr X tr X g(X + t Y )T Y Y = 2 g(X + t Y ) (1439)
dt
From (1425), the simpler case, where the real function g(X) : RK R has
vector argument,
d2
Y T X2 g(X + t Y )Y = g(X + t Y ) (1440)
dt2
D.1.7.2.1 Example. Second-order gradient.

Given real function g(X) = log det X having domain int SK
+ , we want to
find 2 g(X) RKKKK . From the tables in D.2,

h(X) = g(X) = X 1 int SK
+ (1441)
so 2 g(X) = h(X). By (1428) and (1396), for Y SK

T
d
tr hmn (X) Y = hmn (X + t Y ) (1442)
dt t=0

d
= h(X + t Y ) (1443)
dt t=0 mn

d
= (X + t Y )1 (1444)
dt t=0

mn
X 1 Y X 1 mn

= (1445)
Setting Y to a member of the standard basis Ekl = ek eTl , for k, l {1 . . . K},

and employing a property of the trace function (31) we find
2 g(X)mnkl = tr hmn (X)T Ekl = hmn (X)kl = (X 1 Ekl X 1 )mn

(1446)
2 g(X)kl = h(X)kl = X 1 Ekl X 1 RKK

(1447)
2
From all these first- and second-order expressions, we may generate new
ones by evaluating both sides at arbitrary t (in some open interval) but only
after the differentiation.
D.2 Tables of gradients and derivatives

[96] [43]
When proving results for symmetric matrices algebraically, it is critical

to take gradients ignoring symmetry and to then substitute symmetric
entries afterward.
a , b Rn , x, y Rk , A , B Rmn , X , Y RKL , t, R ,
i, j, k, , K, L , m , n , M , N are integers, unless otherwise noted.
x means ((x) ) for R ; id est, entrywise vector exponentiation.

is the main-diagonal linear operator (1036). x0 = 1, X 0 = I if square.
d
dx1 y
d .. y
dx
= . , dg(x) , dg 2(x) (directional derivatives D.1), log x ,
d
dxk
sgn x , sin x , x/y (Hadamard quotient), x (entrywise square root),
etcetera, are maps f : Rk Rk that maintain dimension; e.g., (A.1.1)
d 1
x = x 1T (x)1 1 (1448)
dx
The standard basis: Ekl = ek eT RKK | k , {1 . . . K }

For A a scalar or matrix, we have the Taylor series [45, 3.6]

A
X 1 k
e = A (1449)
k=0
k!
Further, [215, 5.4]

eA 0 A Sm (1450)
For all square A and integer k
detk A = det Ak (1451)
Table entries with notation X R22 have been algebraically verified

in that dimension but may hold more broadly.
D.2. TABLES OF GRADIENTS AND DERIVATIVES 521
D.2.1 Algebraic

x x = x xT = I Rkk X X = X X T = I RKLKL (identity)
x (Ax b) = AT

x xTA bT = A
T
x (Ax b) (Ax b) = 2AT(Ax b)
T
x2 (Ax b) (Ax b) = 2ATA

x xTAx + 2xTBy + y T Cy = (A + AT )x + 2By
x2 xTAx + 2xTBy + y T Cy = A + AT

X aTXb = X bTX Ta = abT
X aTX 2 b = X T abT + abT X T
X aTX 1 b = X T abT X T
X 1
X (X 1 )kl = = X 1 Ekl X 1 , confer (1388)(1447)
Xkl
x aTxTxb = 2xaT b X aTX TXb = X(abT + baT )
x aTxxT b = (abT + baT )x X aTXX T b = (abT + baT )X
x aTxTxa = 2xaTa X aTX TXa = 2XaaT
x aTxxTa = 2aaTx X aTXX T a = 2aaT X
x aTyxT b = b aTy X aT Y X T b = baT Y
x aTy Tx b = y bTa X aT Y TXb = Y abT
x aTxy T b = a bTy X aT XY T b = abT Y
x aTxTy b = y aT b X aT X T Y b = Y baT
Algebraic continued
d
dt
(X + tY ) = Y
d
dt
B T (X + t Y )1 A = B T (X + t Y )1 Y (X + t Y )1 A
d
dt
B T (X + t Y )TA = B T (X + t Y )T Y T (X + t Y )TA
d
dt
B T (X + t Y ) A = ... , 1 1, X , Y SM
+
d2
dt2
B T (X + t Y )1 A = 2B T (X + t Y )1 Y (X + t Y )1 Y (X + t Y )1 A
d

dt
(X + t Y )TA(X + t Y ) = Y TAX + X TAY + 2 t Y TAY
d2

dt2
(X + t Y )TA(X + t Y ) = 2 Y TAY
d
dt
((X + t Y )A(X + t Y )) = YAX + XAY + 2 t YAY
d2
dt2
((X + t Y )A(X + t Y )) = 2 YAY
D.2.2 Trace Kronecker
vec X tr(AXBX T ) = vec X vec(X)T (B T A) vec X = (B AT + B T A) vec X
2 T 2 T T T T
vec X tr(AXBX ) = vec X vec(X) (B A) vec X = B A + B A
D.2.3 Trace
x x = I X tr X = X tr X = I
d 1
x 1T (x)1 1 = dx x = x2 X tr X 1 = X 2T
T
x 1 (x) y = (x)2 y
1
X tr(X 1 Y ) = X tr(Y X 1 ) = X T Y TX T

d
dx x = x 1 X tr X = X (1)T , X R22
X tr X j = jX (j1)T
T
x (b aTx)1 = (b aTx)2 a

X tr (B AX)1 = (B AX)2 A
x (b aTx) = (b aTx)1 a
x xTy = x y Tx = y X tr(X T Y ) = X tr(Y X T ) = X tr(Y TX) = X tr(XY T ) = Y
X tr(AXBX T ) = X tr(XBX TA) = ATXB T + AXB

X tr(AXBX) = X tr(XBXA) = ATX TB T + B TX TAT
X tr(AXAXAX) = X tr(XAXAXA) = 3(AXAXA )T
k1 T
X tr(Y X k ) = X tr(X k Y ) = X i Y X k1i
P
i=0
X tr(Y TXX T Y ) = X tr(X T Y Y TX) = 2 Y Y TX

X tr(Y TX TXY ) = X tr(XY Y TX T ) = 2XY Y T

X tr (X + Y )T (X + Y ) = 2(X + Y )
X tr((X + Y )(X + Y )) = 2(X + Y )T
X tr(ATXB) = X tr(X TAB T ) = AB T

X tr(A X B) = X tr(X AB ) = X AB T X T
T 1 T T T
X aTXb = X tr(baTX) = X tr(XbaT ) = abT

X bTX Ta = X tr(X TabT ) = X tr(abTX T ) = abT
X aTX 1 b = X tr(X T abT ) = X T abT X T

X aTX b = ...
Trace continued
d
dt
tr g(X + t Y ) = tr dtd g(X + t Y )
d
dt
tr(X + t Y ) = tr Y
d
dt
tr j(X + t Y ) = j tr j1(X + t Y ) tr Y
d
dt
tr(X + t Y )j = j tr((X + t Y )j1 Y ) ( j)
d
dt
tr((X + t Y )Y ) = tr Y 2
d d
tr(Y (X + t Y )k ) = k tr (X + t Y )k1 Y 2 ,

dt
tr (X + t Y )k Y = dt
k {0, 1, 2}
k1
d d

tr (X + t Y )k Y = tr(Y (X + t Y )k ) = tr (X + t Y )i Y (X + t Y )k1i Y
P
dt dt
i=0
d
dt
tr((X + t Y )1 Y ) = tr((X + t Y )1 Y (X + t Y )1 Y )
d

dt
tr B T (X + t Y )1 A = tr B T (X + t Y )1 Y (X + t Y )1 A
d

dt
tr B T (X + t Y )TA = tr B T (X + t Y )T Y T (X + t Y )TA
d

dt
tr B T (X + t Y )k A = ... , k > 0
d
tr B T (X + t Y ) A = ... , 1 1, X , Y SM

dt +
d2

dt2
tr B T (X + t Y )1 A = 2 tr B T (X + t Y )1 Y (X + t Y )1 Y (X + t Y )1 A
d

dt
tr (X + t Y )TA(X + t Y ) = tr Y TAX + X TAY + 2 t Y TAY
d2

dt2
tr (X + t Y )TA(X + t Y ) = 2 tr Y TAY
d
dt
tr((X + t Y )A(X + t Y )) = tr(YAX + XAY + 2 t YAY )
d2
dt2
tr((X + t Y )A(X + t Y )) = 2 tr(YAY )
D.2.4 Log determinant

x 0, det X > 0 on some neighborhood of X , and det(X + t Y ) > 0 on
some open interval of t ; otherwise, log( ) would be discontinuous.
d
dx
log x = x1 X log det X = X T
X T T
X2 log det(X)kl = = (X 1 Ekl X 1 ) , confer (1405)(1447)
Xkl
d
dx
log x1 = x1 X log det X 1 = X T
d
dx
log x = x1 X log det X = X T

X log det X = X T , X R22
X log det X k = X log detk X = kX T
X log det (X + t Y ) = (X + t Y )T
1
x log(aTx + b) = a aTx+b X log det(AX + B) = AT(AX + B)T
X log det(I ATXA) = ...
X log det(X + t Y )k = X log detk (X + t Y ) = k(X + t Y )T
d
dt
log det(X + t Y ) = tr ((X + t Y )1 Y )
d2
dt2
log det(X + t Y ) = tr ((X + t Y )1 Y (X + t Y )1 Y )
d
dt
log det(X + t Y )1 = tr ((X + t Y )1 Y )
d2
dt2
log det(X + t Y )1 = tr ((X + t Y )1 Y (X + t Y )1 Y )
d
dt
log det((A(x + t y) + a)2 + I)
= tr ((A(x + t y) + a)2 + I)1 2(A(x + t y) + a)(Ay)

D.2.5 Determinant
X det X = X det X T = det(X)X T
X det X 1 = det(X 1 )X T = det(X)1 X T
X det X = det (X)X T

X det X = det(X )X T , X R22
X det X k = k detk1(X) tr(X)I X T , X R22

X det X k = X detk X = k det(X k )X T = k detk (X)X T
X det (X + t Y ) = det (X + t Y )(X + t Y )T
X det(X + t Y )k = X detk (X + t Y ) = k detk (X + t Y )(X + t Y )T
d
dt
det(X + t Y ) = det(X + t Y ) tr((X + t Y )1 Y )
d2
dt2
det(X + t Y ) = det(X + t Y )(tr 2 ((X + t Y )1 Y ) tr((X + t Y )1 Y (X + t Y )1 Y ))
d
dt
det(X + t Y )1 = det(X + t Y )1 tr((X + t Y )1 Y )
d2
dt2
det(X + t Y )1 = det(X + t Y )1 (tr 2 ((X + t Y )1 Y ) + tr((X + t Y )1 Y (X + t Y )1 Y ))
d
dt
det (X + t Y ) = ...
D.2.6 Logarithmic
+ [128, 6.6, prob.20]

1 1, X , Y SM
d
dt
log(X + t Y ) = ... ,
D.2.7 Exponential
[45, 3.6, 4.5] [215, 5.4]
T X) TX T X)
X etr(Y = X det eY = etr(Y Y ( X, Y )
TX T TYT
X tr eY X = eY Y T = Y T eX
log-sum-exp & geometric mean [39, p.74]...
d j tr(X+ t Y )
dt j
e = etr(X+ t Y ) tr j(Y )
d tY
dt
e = etY Y = Y etY
d X+ t Y
dt
e = eX+ t Y Y = Y eX+ t Y , XY = Y X
d 2 X+ t Y
dt2
e = eX+ t Y Y 2 = Y eX+ t Y Y = Y 2 eX+ t Y , XY = Y X
eX for symmetric X of dimension less than 3 [39, pg.110]...

MC

Diunggah oleh

Informasi Dokumen

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

MC

Diunggah oleh

Hak Cipta:

Format Tersedia

Appendix D

From too much study, and from extreme passion, cometh

D.1 Directional derivative, Taylor series

The gradient of vector-valued function v(x) : R RN on real domain is

while the second-order gradient is

Gradient of vector-valued function h(x) : RK RN on vector domain is

= [ h1 (x) h2 (x) hN (x) ] RKN

while the second-order gradient has a three-dimensional representation

= [ 2 h1 (x) 2 h2 (x) 2 hN (x) ] RKN K

The gradient of real function g(X) : RKL R on matrix domain is

where the gradient X(:,i) is with respect to the i th column of X . The

where the gradient is with respect to matrix X .

Gradient of vector-valued function g(X) : RKL RN on matrix domain

= [ g1 (X) g2 (X) gN (X) ] RKN L (1362)

= [ 2 g1 (X) 2 g2 (X) 2 gN (X) ] RKN LKL (1363)

g(X) = g21. (X) g22. (X) g2N. (X) RM N KL (1364)

while the second-order gradient has six-dimensional representation

D.1.2 Product rules for matrix-functions

X f (X)T g(X) = X(f ) g + X(g) f

while [35, 8.3] [205]

These expressions implicitly apply as well to scalar-, vector-, or matrix-valued

D.1.2.0.1 Example. Cubix.

X aTX 2 b = X(X Ta) Xb + X(Xb) X Ta (1369)

Consider the first of the two terms:

The gradient of X Ta forms a cubix in R222 .

D.1.2.1 Kronecker product

One advantage to vectorization is existence of a traditional

D.1.3 Chain rules for composite matrix-functions

D.1.3.1 Two arguments

D.1.3.1.1 Example. Chain rule for two arguments. [28, 1.1]

g f (x)T , h(x)T = (f (x) + h(x))TA (f (x) + h(x))

lim x g f (x)T , h(x)T = (A + AT )x

from Table D.2.1. 2

These formulae remain correct when the gradients produce

D.1.4 First directional derivative

gmn (X) gmn (X + t ek eTl ) gmn (X)

where ek is the k th standard basis vector in RK while el is the l th standard

while the gradient is a quartix

By simply rotating our perspective of the four-dimensional representation of

g(X) g(X) g(X)

When the limit for t R exists, it is easy to show by substitution of

gmn (X) gmn (X + t Ykl ek eTl ) gmn (X)

where t R . Assuming finite Y , equation (1392) is called the Gateaux

in magnitude and direction to Y .D.2 Hence the directional derivative,

from which it follows

which is the first-order Taylor series expansion about X . [137, 18.4]

Figure 97: Drawn is a convex quadratic bowl in R2 R ; f(x) = xTx : R2 R

D.1.4.1 Interpretation directional derivative

D.1.4.1.1 Theorem. Directional derivative condition for optimization.

D.1.4.1.2 Example. Simple bowl.

has function offset b R , axis of revolution at x = a , and positive definite

and the directional derivative in the direction of the gradient is (1424)

D.1.5 Second directional derivative

2gmn (X) 2gmn (X) 2gmn (X)

mn (X) mn (X) mn (X)

Rotating our perspective, we get several views of the second-order gradient:

2 g11 (X) 2 g12 (X) 2 g1N (X)

g(X) g(X) g(X)

l + ei ej )gmn (X+t ek el )(gmn (X+ ei ej )gmn (X))

2gmn (X) gmn (X+t Ykl ek eT

l + Yij ei ej )gmn (X+t Ykl ek el )(gmn (X+ Yij ei ej )gmn (X))

which can be proved by substitution of variables in (1409). The mn th

d 2g11 (X) d 2g12 (X) d 2g1N (X)