H avard Berland
Department of Mathematical Sciences, NTNU
1 / 21
Abstract
Automatic dierentiation is introduced to an audience with basic mathematical prerequisites. Numerical examples show the deency of divided dierence, and dual numbers serve to introduce the algebra being one example of how to derive automatic dierentiation. An example with forward mode is given rst, and source transformation and operator overloading is illustrated. Then reverse mode is briey sketched, followed by some discussion.
(45 minute talk)
2 / 21
Automatic dierentiation (AD) is software to transform code for one function into code for the derivative of the function. Automatic dierentiation
f(x) {...};
human programmer
df(x) {...};
y = f (x )
y = f (x )
3 / 21
But how to compute f (xn ) when we only know f (x )? Symbolic dierentiation? Divided dierence? Automatic dierentiation? Yes.
4 / 21
Divided dierences
x +h
5 / 21
3x 2 useless accuracy
ni te p rec isio n e ula r rro
ce nt er e d
form
ce
di e r
en
Dual numbers
Extend all numbers by adding a second component, x x +x d d is just a symbol distinguishing the second component, analogous to the imaginary unit i = 1. But, let d2 = 0, as opposed to i2 = 1. Arithmetic on dual numbers: d) + (y + y d) = x + y + (x +y )d (x + x
=0
(x + x d) (y + y d) = xy + x y d + xy d+x y d2
=0
(x + x d) (y + y d) = xy + x y d + xy d+x y d2 = xy + (x y + xy )d (x + x d) = x x d, 1 1 x = 2d x +x d x x (x = 0)
7 / 21
8 / 21
Similarly, one may derive d) = sin(x ) + cos(x ) x d sin(x + x cos(x + x d) = cos(x ) sin(x ) x d
d) e (x + x = ex + ex x d x log(x + x d) = log(x ) + d x = 0 x x x +x d = x + d x = 0 2 x
9 / 21
and
f = x1 x2
The chain rule f f w5 w3 w2 = x2 w 5 w 3 w 2 x2 ensures that we can propagate the dual components throughout the computation.
11 / 21
Our current procedure: 1. Decompose original code into intrinsic functions 2. Dierentiate the intrinsic functions, eectively symbolically 3. Multiply together according to the chain rule How to automatically transform the original program into the dual program? Two approaches, Source code transformation (C, Fortran 77) Operator overloading (C++, Fortran 90)
12 / 21
return w5 ; }
function.c
13 / 21
function.c
diff function.o
C compiler
13 / 21
Operator overloading
function.c++
Number f ( Number x1 , Number x2 ) { w3 = x1 * x2 ; w4 = sin ( x1 ); w5 = w3 + w4 ; return w5 ; }
14 / 21
Forward mode AD
We have until now only described forward mode AD. Repetition of the procedure using the computational graph: f (x1 , x2 ) w 5 = w 3 + w 4 Forward propagation of derivative values w5 w 4 = cos(w1 )w 1 w4 sin w 1 x1 d
16 / 21
+ w 3 = w 1 w2 + w1 w 2 w3 w 2 x2
seeds, w 1, w 2 {0, 1}
w 1
Reverse mode AD
The chain rule works in both directions. The computational graph is now traversed from the top. f (x1 , x2 ) Backward propagation of derivative values
=w f 5 = 1 (seed)
w5
w5 w 4 = w 5 5 1 w4 = w w4 sin a =w w 1 4 cos(w1 )
+
w5 w 3 = w 5 5 1 w3 = w
w3
w3 w 2 = w 3 3 w1 w2 = w
b w 1
=w 3 w2
x1 x 1 =
a w 1
x2 x 2 = x1 2 =w
17 / 21
b w 1
= cos(x1 ) + x2 d
Jacobian computation
Given F : Rn Rm and the Jacobian J = DF (x) Rmn .
f1 x1 f1 xn
J = DF (x) =
fm x1 fm xn
One sweep of forward mode can calculate one column vector of the Jacobian, J x , where x is a column vector of seeds. One sweep of reverse mode can calculate one row vector of the Jacobian, yJ , where y is a row vector of seeds. Computational cost of one sweep forward or reverse is roughly equivalent, but reverse mode requires access to intermediate variables, requiring more memory.
18 / 21
Forward and reverse mode represents just two possible (extreme) ways of recursing through the chain rule. For n > 1 and m > 1 there is a golden mean, but nding the optimal way is probably an NP-hard problem.
?
19 / 21
Discussion
Accuracy is guaranteed and complexity is not worse than that of the original function. AD works on iterative solvers, on functions consisting of thousands of lines of code. AD is trivially generalized to higher derivatives. Hessians are used in some optimization algorithms. Complexity is quadratic in highest derivative degree. The alternative to AD is usually symbolic dierentiation, or rather using algorithms not relying on derivatives. Divided dierences may be just as good as AD in cases where the underlying function is based on discrete or measured quantities, or being the result of stochastic simulations.
20 / 21
Applications of AD
Newtons method for solving nonlinear equations Optimization (utilizing gradients/Hessians) Inverse problems/data assimilation Neural networks Solving sti ODEs For software and publication lists, visit www.autodiff.org Recommended literature: Andreas Griewank: Evaluating Derivatives. SIAM 2000.
21 / 21