We have already argued that using derivatives gives optimization algorithms a boost. There are three methods to compute derivatives (and gradients, Jacobians, Hessians):
These are essentially the methods that we have all learnt to apply using a pen and paper. A bunch of rules. The input for the procedure is a function and the output from the procedure is another function. For example, for f(x) = x^2, the derivative is f'(x) = 2x.
Although straightforward and automatable, symbolic methods are not always the best choice. When does this happen?
The function to be differentiated is already rather complicated, and the derivative will typically be even more complicated. Its evaluation then may be computationally expensive. We will see that in the example.
The function to be differentiated is not available in the closed form (as a formula), but only as a software implementation, however open-source.
Example 1 (Symbolic computation of the gradient of a function of the simulated trajectory) Consider the state-space model of a pendulum
\underbrace{
\begin{bmatrix}
\dot \theta\\ \dot \omega
\end{bmatrix}}_{\dot{\bm x}}
=
\underbrace{
\begin{bmatrix}
\omega\\ -\frac{g}{l}\sin\theta
\end{bmatrix}}_{\mathbf f(\bm x)},
where l=1\,\mathrm{m} is the length of the pendulum, g=9.81\,\mathrm{m}/\mathrm{s}^2 is the acceleration due to gravity, \theta and \omega are the angle and angular velocity of the pendulum, respectively. We are going to simulate the trajectory of the pendulum that is initially at some nonzero angle, say, \theta(0) = \pi/4 = \theta_0, and zero velocity, that is, \omega(0) = 0 = \omega_0. And we are going to consider the 2-norm (actually its square for convenience) of the state vector at the end of the simulation interval as the cost function to be minimized, for which we need to evaluate the gradient at the initial state.
First, we write an ODE solver that obtains an approximation of the final point \bm x(t_\mathrm{f}) of the state trajectory on the interval [0,t_\mathrm{f}], and a function that computes the cost as a function of the initial state J(\bm x_0).
Alternative (and still imperfect) notation
The state at the final time is a function of the state at the initial time, hence we could also write it as \bm x(t_\mathrm{f};\bm x_0), in which case the cost cold be written as J(\bm x(t_\mathrm{f};\bm x_0)). The dependence of the cost on the initial state is still visible, but the notation is a bit more clumsy (and abusive anyway).
Show the code
functionsolve_for_final_state_fd(f, x₀, tspan, h) t0, tf = tspan t = t0 x = x₀while t < tf x = x + h *f(x) t = t + hendreturn xendfunctionJ(x₀) x_final =solve_for_final_state_fd(f, x₀, tspan, h)return x_final[1]^2+ x_final[2]^2end
And now we use the solver to compute the trajectory and the cost
We now use the Symbolics.jl package to compute the gradient of the cost function at the initial state. We first define symbolic state variables and and then obtain the symbolic expression for the cost function just by evaluating the function we already have at these symbolic state variables.
If the shortcomings of symbolic methods have not yet started surfacing, scroll to the right in the output field. Rather long, n’est-ce pas? And we have just started, because we now must differentiate this long expression symbolically (and then we convert it from a symbolic expression back to a standard function in Julia):
Note that while for the former quite some work must have been done before the timing analysis was started (namely, the formula for the derivative had to be found), for the latter we started only with the function definition. And yet the latter approach wins hands down. But before we start exploring the AD methods, we give a brief overview of numerical methods based on finite-difference (FD) approximations.
Numerical finite-difference (FD) methods
These methods approximate the derivative by computing differences between the function values at different points, hence the name finite-difference (FD) methods. The simplest FD methods follow from the definition of the derivative after omiting the limit:
\frac{\mathrm d f(x)}{\mathrm d x} \approx \frac{f(x+h)-f(x)}{h}\qquad\qquad \text{forward difference}
or
\frac{\mathrm d f(x)}{\mathrm d x} \approx \frac{f(x)-f(x-h)}{h}\qquad\qquad \text{backward difference}
or
\frac{\mathrm d f(x)}{\mathrm d x} \approx \frac{f(x+\frac{h}{2})-f(x-\frac{h}{2})}{h}\qquad\qquad \text{central difference}
For functions of vector variables, the same idea applies, but now we have to compute the difference for each component of the vector.
Dependence of the error on the step size
The finite-difference methods only approximate the derivatives. The smaller the h in the above formulas, the smaller the approximation error. Really? Not quite. Let’s explore it through an example.
Example 2 (Dependence of the error on the step size) Consider a scalar function of a vector argument f(\bm x) = \sum_{i=1}^n x_i^2.
Show the code
f(x) =sum(xᵢ^2for xᵢ in x)
f (generic function with 1 method)
Now, in order to compute the gradient \nabla f, we need to compute all the individual partial derivatives, the individual components of the vector. Let’s now restrict ourselves just to one component, say, the first one, that is, let’s compute \frac{\partial f(\mathbf x)}{\partial x_1}.
In this simple example, a formula can be written down upon inspection: \frac{\partial f(\mathbf x)}{\partial x_1} = 2x_1:
Show the code
x₀ =rand(10)∇f_1_exact =2*x₀[1]
0.6656067086287529
but let’s pretend that this answer is not available to us (we will only use it for evaluation of approximation errors of of chosen FD methods).
function∇f_1(f,x₀::Vector{T},h::T) where T<:Real (f(x₀[1]+h)-f(x₀[1]))/hend
∇f_1 (generic function with 1 method)
We can now compute the first entry of the gradient for the particular vector given in the IEEE double format (default for Julia)
Show the code
∇f_1(f,x₀,h)
0.7656067086287528
We can also compute the same quantity for the same vector represented in the IEEE single format:
Show the code
∇f_1(f,Vector{Float32}(x₀),Float32(h))
0.76560676f0
Obviously, both answers differ from the accurate one computed above.
Now, let’s examine the error as a function of the size of the interval h:
Show the code
h_range =exp10.(range(-13, stop=-1, length=1000));abs_err_64 = [abs((∇f_1_exact -∇f_1(f,x₀,h))) for h in h_range];abs_err_32 = [abs((∇f_1_exact -∇f_1(f,Vector{Float32}(x₀),Float32(h)))) for h in h_range];usingPlotsplot(h_range, abs_err_64,xaxis=:log, yaxis=:log, xlabel="h", ylabel ="|e|", label ="IEEE double")plot!(h_range, abs_err_32,xaxis=:log, yaxis=:log, xlabel="h", ylabel ="|e|", label ="IEEE single")
If we read the graph from right to left – as h is getting smaller –, we observe that initially the error decreases for both 64- and 32-bit floating-point format, and it decreases at the same rate. The reason for this decrease is that the trunction error (essentially what we commit here by doing FD approximation is that we truncate the Taylor series) dominates here, and this error goes down with h.
The major intended takeaway from this example is that this reduction of the error only takes place down to some h below which the error no longer decreases; in fact, it actally increases as h gets smaller. The reason is that for h this small, the rounding errors dominate. Apparently, they start exhibitting themselves for smaller values with the 64-bit format than with 32-bit format. The rounding errors are getting dominant here as we are subtracting two numbers that are getting more and more similar as h decreaces. This is known as the phenomenon of catastrophic cancellation.
It is known from rigorous numerical analysis that the error in the case of the simple backward or forward finite difference approximation to a scalar derivative scales with \sqrt{\epsilon}, where \epsilon is the machine epsilon. Here we can expect even worse outcome as the dimension n of the vector grows. Note that \epsilon for double precision IEEE is
Show the code
2^(-53)
1.1102230246251565e-16
which is available in Julia through eps() function with the type as the input argument (if no argument is given, it assumes Float64):
Show the code
sqrt(eps())
1.4901161193847656e-8
Similarly, the 32-bit version is
Show the code
sqrt(eps(Float32))
0.00034526698f0
Apparently, these are roughly the “cutt-off” values of h.
Similar to a complex number, a dual number has two components, one corresponding to the value, the other corresponding to the derivative:
x = v + d\epsilon,
where the special property of \epsilon is
\epsilon^2=0.
(Compare it with the property of the imaginary unit: i^2=-1.)
Multiplication of two dual numbers y = x_1\cdot x_2 is the defined naturally as
\begin{aligned}
x &= (v_1 + d_1\epsilon)\cdot (v_2 + d_2\epsilon)\\
&= v_1v_2 + (v_1d_2+d_1v_2) \epsilon.
\end{aligned}
Similarly for other functions. We illustrate this using the following example.
Example 3 Consider a function y(x) = \cos(x^2). Its derivative is trivially \frac{\mathrm d y}{\mathrm d x} = -2x\sin(x^2).
Let’s now evaluate this result at a particular value of x, say
x =2
2
First, we are going to develop our own data class (actually struct) in Julia for dual numbers. The code below is inspired by a code from [3] (they even provide Jupyter Notebooks).
Then we show the usage of a functionality already provided in ForwardDiff.jl package implementing the forward mode AD.
Show the code
struct Dual v # the VALUE part d # the DERIVATIVE partend
Now we need to overload the involved basic operations such as addition and multiplication of two dual numbers, multiplication by a scalar, squaring and finding the value of cosine function.
Let’s now check the functionality of the individual functions
X =Dual(x,1)
Dual(2, 1)
Y =Dual(3,0)
Dual(3, 0)
3*X
Dual(6, 3)
Y*X
Dual(6, 3)
X^2
Dual(4, 4)
cos(X)
Dual(-0.4161468365471424, -0.9092974268256817)
Finally, let’s use the new functionality to compute the derivative of the assigned function \cos(x^2)
cos(X^2)
Dual(-0.6536436208636119, 3.027209981231713)
In practice, you will hardly feel a need to implement your own library for algorithmic differentiation. Instead, you may want to use one of those avaialable ones, such as ForwardDiff.jl.
S. Gros and M. Diehl, “Numerical Optimal Control (Draft).” Systems Control; Optimization Laboratory IMTEK, Faculty of Engineering, University of Freiburg, Apr. 2022. Available: https://www.syscop.de/files/2020ss/NOC/book-NOCSE.pdf
[2]
J. R. R. A. Martins and A. Ning, Engineering Design Optimization. Cambridge ; New York, NY: Cambridge University Press, 2022. Available: https://mdobook.github.io/
[3]
M. J. Kochenderfer and T. A. Wheeler, Algorithms for Optimization. The MIT Press, 2019. Accessed: Dec. 29, 2020. [Online]. Available: https://algorithmsbook.com/optimization/