Show the code
function backtracking_line_search(f, ∇fₖ, xₖ, dₖ; α₀=1.0, β=0.5, γ=0.1)
αₖ = α₀
while f(xₖ)-f(xₖ+αₖ*dₖ) < -γ*αₖ*dot(dₖ,∇fₖ)
αₖ *= β
end
return αₖ
endbacktracking_line_search (generic function with 1 method)
Our motivation for studying numerical algorithms for unconstrained optimization remains the same as when we studied the conditions of optimality for such unconstrained problems – such algorithms constitute building blocks for constrained optimization problems. Indeed, many algorithms for constrained problems are based on reformulating the constrained problem into an unconstrained one and then applying the algorithms studied in this section.
It may be useful to recapitulate our motivation for studying optimization algorithms in general – after all, there are dozens of commercial or free&open-source software tools for solving optimization problems. Why not just use them? There are two answers beyond the traditional “at a grad school we should understand what we are using”:
There is certainly no shortage of algorithms for unconstrained optimization. In this crash course we can cover only a few. But the few we cover here certainly form a solid theoretical basis and provide practically usable tools.
One possible way to classify the algorithms is based on whether they use derivatives of the objective functions or not. In this course, we only consider the former approaches as they leads to more efficient algorithms. For the latter methods, we can refer to the literature (the prominent example is Nelder-Mead method).
All the relevant methods are iterative. Based on what happens within each iteration, we can classify them into two categories:
A single iteration of a descent method consists of the following step: \boxed{ \bm x_{k+1} = \bm x_{k} + \alpha_k \bm d_k,} \tag{1} where \bm x_k is the current solution, \bm d_k is the search direction, and \alpha_k is the step length.
The obvious quality that the search direction needs to satisfy, is that the cost function decreses along it, at least locally (for a small step length).
Definition 1 (Descent direction) At the current iterate \bm x_k, the direction \bm d_k is called a descent direction if \nabla f(\bm x_k)^\top \bm d_k < 0, that is, the directional derivative is negative along the direction \bm d_k.
The product above is an inner product of the two vectors \bm d_k and \nabla f(\mathbf x_k). Recall that it is defined as \nabla f(\bm x_k)^\top \bm d_k = \|\nabla f(\bm x_k)\| \|\bm d_k\| \cos \theta, where \theta is the angle between the gradient and the search direction. This condition has a nice geometric interpretation in a contour plot for an optimization in \mathbb R^2. Consider the line tangent to the function countour at \bm x_k. A descent direction must be in the other half-plane generated by the tangent line than the one into which the gradient \nabla f(\bm x_k) points.
Beware that it is only guaranteed that the cost function is reduced if the length of the step is sufficently small. For longer steps the higher-order terms in the Taylor’s series approximation of the cost function can dominate.
Before we proceed to the question of which descent direction to choose, we adress the question of how far to go along the chosen direction. This is the problem of line search.
Note that once the search direction has been fixed (whether we used the negative of the gradient or any other descent direction), the problem of finding the step length \alpha_k is just a scalar optimization problem. It turns out, however, that besides finding the true minimum along the search directions, it is often sufficient to find the minimum only approximately, or not aiming at minimization at all and work with a fixed step length instead.
Here we give a guidance on the choice of the lenght of the step. But we need to introduce a useful concept first.
Definition 2 (L-smoothness) For a continuously differentiable function f, the gradient \nabla f is said to be L-smooth if there exists a constant L>0 such that \|\nabla f(x) - \nabla f(y)\| \leq L \|x-y\|.
Not that if the second derivatives exist, L is an upper bound on the norm of the Hessian \|\nabla^2 f\|\leq L.
For quadratic functions, L is the largest eigenvalue of the Hessian L = \max_i \lambda_i (\mathbf Q).
The usefulness of the concept of L-smoothness is that it provides a quadratic function that serves as an upper bound for the original function. This is formulated as the following lemma.
Lemma 1 (Descent lemma) Consider an L-smooth function f. Then for any \mathbf x_k and \mathbf x_{k+1}, the following inequality holds f(\mathbf x_{k+1}) \leq f(\mathbf x_{k}) + \nabla f(\mathbf x_k)^\top (\mathbf x_{k-1}-\mathbf x_{k}) + \frac{L}{2}\|\mathbf x_{k-1}-\mathbf x_{k}\|^2
What implication does the result have on the determination of the step length?
\alpha = \frac{1}{L}.
A number of methods exist: bisection, golden section, Newton, … As finding the true minium in each iteration is often too computationally costly and hardly needed, we do not cover these methods here. One exception the Newton’s method, which for vector variables constitutes another descent method on its own and we cover it later.
Another exception is the case of a quadratic function in the following example.
Example 1 Here we develop a solution for exact minimization of a quadratic functions f(\bm x) = \frac{1}{2} \bm x^\top\mathbf Q \bm x + \mathbf c^\top \bm x along a given direction. We show that it leads to a closed-form formula. Although not particularly useful in practice (for a quadratic function we already know we can find the minimizer by solving a system of linear equations), it is a good exercise in understanding the problem of line search. Furthermore, we will use it later to demonstrate the behaviour of the steepest descent method. The problem is to \operatorname*{minimize}_{\alpha_k} f(\bm x_k + \alpha_k \bm d_k). We express the cost as a function of the current iterate, the direction, and step length. \begin{aligned} f(\bm x_k + \alpha_k \bm d_k) &= \frac{1}{2}(\bm x_k + \alpha_k\bm d_k)^\top\mathbf Q (\bm x_k + \alpha_k\bm d_k) +\mathbf c^\top(\bm x_k + \alpha_k\bm d_k)\\ &= \frac{1}{2} \bm x_k^\top\mathbf Q \bm x_k + \bm d_k^\top\mathbf Q\bm x_k \alpha_k + \frac{1}{2} \bm d_k^\top\mathbf Q\bm d_k \alpha_k^2+ \mathbf c^\top(\bm x_k + \alpha_k\bm d_k). \end{aligned}
Considering the current iterate and the search direction constant, by differentiating the function with respect to the length of the step, we get \frac{\mathrm{d}f(\bm x_k + \alpha_k\bm d_k)}{\mathrm{d}\alpha_k} = \bm d_k^\top \underbrace{(\mathbf Q\bm x_k + \mathbf c)}_{\nabla f(\bm x_k)} + \bm d_k^\top\mathbf Q\bm d_k \alpha_k.
And now setting the derivative to zero, we find the optimal step length \boxed{ \alpha_k = -\frac{\bm d_k^\top \nabla f(\bm x_k)}{\bm d_k^\top\mathbf Q\bm d_k} = -\frac{\bm d_k^\top (\mathbf Q\bm x_k + \mathbf c)}{\bm d_k^\top\mathbf Q\bm d_k}.} \tag{2}
As we have mentioned, this result will be useful for some benchmarking later.
There are several methods for approximate line search. Here we describe the backtracking algorithm, which is based on the sufficient decrease condition (also known as Armijo condition), which reads f(\bm x_k+\alpha_k\bm d_k) - f(\bm x_k) \leq \gamma \alpha_k \mathbf d^T \nabla f(\bm x_k), where \gamma\in(0,1), typically \gamma is very small, say \gamma = 10^{-4}.
The term on the right can be be viewed as a linear function of \alpha_k. Its negative slope is a bit less steep than the directional derivative of the function f at \bm x_k. The condition of sufficient decrease thus requires that the cost function (as a function of \alpha_k) is below the graph of this linear function.
Now, the backtracking algorithm is parameterized by three parameters: the initial step lenght \alpha_0>0, the typically very small \gamma\in(0,1) that parameterizes the Armijo condition, and yet another parameter \beta\in(0,1).
The k-th iteration of the algorithm goes like this: failure of the sufficient decrease condition for a given \alpha_k or, equivalently, satisfaction of the condition f(\bm x_k) - f(\bm x_k+\alpha_k\bm d_k) < -\gamma \alpha_k \mathbf d^T \nabla f(\bm x_k) sends the algorithm into another reduction of \alpha_k by \alpha_k = \beta\alpha_k. A reasonable choice for \beta is 0.5, which corresponds to halving the step length upon failure to decrease sufficiently.
The backtracking algorithm can be implemented as follows
function backtracking_line_search(f, ∇fₖ, xₖ, dₖ; α₀=1.0, β=0.5, γ=0.1)
αₖ = α₀
while f(xₖ)-f(xₖ+αₖ*dₖ) < -γ*αₖ*dot(dₖ,∇fₖ)
αₖ *= β
end
return αₖ
endbacktracking_line_search (generic function with 1 method)
Now we are ready to proceed to the question of choosing a descent direction.
A natural candidate for a descent direction is the negative of the gradient \bm d_k = -\nabla f(\bm x_k).
In fact, among all descent directions, this is the one for which the descent is steepest (the gradient determines the direction of steepest ascent), though we will see later that this does not mean that the convergence of the method is the fastest.
In each iteration of the gradient method, this is the how the solution is updated
\boxed{ \bm x_{k+1} = \bm x_{k} - \alpha_k \nabla f(\bm x_{k}),} where the determination of the step length \alpha_k has already been discussed in the previous section.
Let’s now examine the behaviour of the method by applying it to minimization of a quadratic function. Well, for a quadratic function it is obviously an overkill, but we use it in the example because we can compute the step length exactly using Eq. 2, which then helps the methods show its best performance.
Example 2 (Steepest descent method for a quadratic function with exact line search)
using LinearAlgebra # For dot() function.
using Printf # For formatted output.
x0 = [2, 3] # Initial vector.
Q = [1 0; 0 3] # Positive definite matrix defining the quadratic form.
c = [1, 2] # Vector defining the linear part.
xs = -Q\c # Stationary point, automatically the minimizer for posdef Q.
ϵ = 1e-5 # Threshold on the norm of the gradient.
N = 100; # Maximum number of steps .
function gradient_descent_quadratic_exact(Q,c,x0,ϵ,N)
x = x0
iter = 0
f = 1/2*dot(x,Q*x)+dot(x,c)
∇f = Q*x+c
while (norm(∇f) > ϵ)
α = dot(∇f,∇f)/dot(∇f,Q*∇f)
x = x - α*∇f
iter = iter+1
f = 1/2*dot(x,Q*x)+dot(x,c)
∇f = Q*x+c
@printf("i = %3d ||∇f(x)|| = %6.4e f(x) = %6.4e\n", iter, norm(∇f), f)
if iter >= N
return f,x
end
end
return f,x
end
fopt,xopt = gradient_descent_quadratic_exact(Q,c,x0,ϵ,N)i = 1 ||∇f(x)|| = 2.0229e+00 f(x) = 7.8495e-01
i = 2 ||∇f(x)|| = 9.0210e-01 f(x) = -1.0123e+00
i = 3 ||∇f(x)|| = 1.6005e-01 f(x) = -1.1544e+00
i = 4 ||∇f(x)|| = 7.1374e-02 f(x) = -1.1657e+00
i = 5 ||∇f(x)|| = 1.2663e-02 f(x) = -1.1666e+00
i = 6 ||∇f(x)|| = 5.6470e-03 f(x) = -1.1667e+00
i = 7 ||∇f(x)|| = 1.0019e-03 f(x) = -1.1667e+00
i = 8 ||∇f(x)|| = 4.4679e-04 f(x) = -1.1667e+00
i = 9 ||∇f(x)|| = 7.9269e-05 f(x) = -1.1667e+00
i = 10 ||∇f(x)|| = 3.5350e-05 f(x) = -1.1667e+00
i = 11 ||∇f(x)|| = 6.2718e-06 f(x) = -1.1667e+00
(-1.1666666666479069, [-0.9999939492423319, -0.6666672167355456])
We can also decorate the code a bit to visualize how the iterations proceeded.
function gradient_descent_quadratic_exact_decor(Q,c,x0,ϵ,N)
x = x0
X = x
f = 1/2*dot(x,Q*x)+dot(x,c)
F = [f,]
∇f = Q*x+c
iter = 0
while (norm(∇f) > ϵ)
α = dot(∇f,∇f)/dot(∇f,Q*∇f)
x = x - α*∇f
iter = iter+1
f = 1/2*dot(x,Q*x)+dot(x,c)
∇f = Q*x+c
X = hcat(X,x)
push!(F,f)
if iter >= N
return F,X
end
end
return F,X
end
F,X = gradient_descent_quadratic_exact_decor(Q,c,x0,ϵ,N)
x1_grid = x2_grid = -4:0.01:4;
f(x) = 1/2*dot(x,Q*x)+dot(x,c)
z_grid = [f([x1,x2]) for x2=x2_grid, x1=x1_grid];
using Plots
contour(x1_grid,x2_grid,z_grid)
plot!(X[1,:],X[2,:],label="xk",marker=:diamond,aspect_ratio=1)
scatter!([x0[1],],[x0[2],],label="x0")
scatter!([xs[1],],[xs[2],],label="xopt")
xlabel!("x1");ylabel!("x2");
xlims!(-4,4); ylims!(-4,4)Altough the number of iterations in the above example is acceptable, a major characteristic of the method is visible. Its convergence is slowing down as we are approaching a local minimum, which is visually recognizable oscillations or zig-zagging. But it can be much worse for some data, as the next example shows.
Example 3 (Steepest descent method for an ill-conditioned quadratic function with exact line search) Consider minimization of the following cost function f(\bm x) = 1000x_1^2 + 40x_1x_2 + x_2^2.
Q = [1000 20; 20 1]
c = [0, 0]
x0 = [1,1000]
F,X = gradient_descent_quadratic_exact(Q,c,x0,ϵ,N)i = 1 ||∇f(x)|| = 5.9951e+02 f(x) = 2.9939e+05
i = 2 ||∇f(x)|| = 1.2093e+04 f(x) = 1.7221e+05
i = 3 ||∇f(x)|| = 3.4484e+02 f(x) = 9.9052e+04
i = 4 ||∇f(x)|| = 6.9560e+03 f(x) = 5.6974e+04
i = 5 ||∇f(x)|| = 1.9835e+02 f(x) = 3.2771e+04
i = 6 ||∇f(x)|| = 4.0011e+03 f(x) = 1.8850e+04
i = 7 ||∇f(x)|| = 1.1409e+02 f(x) = 1.0842e+04
i = 8 ||∇f(x)|| = 2.3014e+03 f(x) = 6.2364e+03
i = 9 ||∇f(x)|| = 6.5623e+01 f(x) = 3.5872e+03
i = 10 ||∇f(x)|| = 1.3237e+03 f(x) = 2.0633e+03
i = 11 ||∇f(x)|| = 3.7746e+01 f(x) = 1.1868e+03
i = 12 ||∇f(x)|| = 7.6141e+02 f(x) = 6.8264e+02
i = 13 ||∇f(x)|| = 2.1711e+01 f(x) = 3.9265e+02
i = 14 ||∇f(x)|| = 4.3796e+02 f(x) = 2.2585e+02
i = 15 ||∇f(x)|| = 1.2488e+01 f(x) = 1.2991e+02
i = 16 ||∇f(x)|| = 2.5191e+02 f(x) = 7.4722e+01
i = 17 ||∇f(x)|| = 7.1831e+00 f(x) = 4.2980e+01
i = 18 ||∇f(x)|| = 1.4490e+02 f(x) = 2.4722e+01
i = 19 ||∇f(x)|| = 4.1317e+00 f(x) = 1.4220e+01
i = 20 ||∇f(x)|| = 8.3344e+01 f(x) = 8.1791e+00
i = 21 ||∇f(x)|| = 2.3765e+00 f(x) = 4.7046e+00
i = 22 ||∇f(x)|| = 4.7939e+01 f(x) = 2.7061e+00
i = 23 ||∇f(x)|| = 1.3670e+00 f(x) = 1.5565e+00
i = 24 ||∇f(x)|| = 2.7574e+01 f(x) = 8.9529e-01
i = 25 ||∇f(x)|| = 7.8627e-01 f(x) = 5.1497e-01
i = 26 ||∇f(x)|| = 1.5861e+01 f(x) = 2.9621e-01
i = 27 ||∇f(x)|| = 4.5226e-01 f(x) = 1.7038e-01
i = 28 ||∇f(x)|| = 9.1229e+00 f(x) = 9.7999e-02
i = 29 ||∇f(x)|| = 2.6014e-01 f(x) = 5.6369e-02
i = 30 ||∇f(x)|| = 5.2474e+00 f(x) = 3.2423e-02
i = 31 ||∇f(x)|| = 1.4963e-01 f(x) = 1.8649e-02
i = 32 ||∇f(x)|| = 3.0183e+00 f(x) = 1.0727e-02
i = 33 ||∇f(x)|| = 8.6065e-02 f(x) = 6.1701e-03
i = 34 ||∇f(x)|| = 1.7361e+00 f(x) = 3.5490e-03
i = 35 ||∇f(x)|| = 4.9504e-02 f(x) = 2.0414e-03
i = 36 ||∇f(x)|| = 9.9859e-01 f(x) = 1.1742e-03
i = 37 ||∇f(x)|| = 2.8475e-02 f(x) = 6.7539e-04
i = 38 ||∇f(x)|| = 5.7439e-01 f(x) = 3.8848e-04
i = 39 ||∇f(x)|| = 1.6378e-02 f(x) = 2.2345e-04
i = 40 ||∇f(x)|| = 3.3038e-01 f(x) = 1.2853e-04
i = 41 ||∇f(x)|| = 9.4207e-03 f(x) = 7.3928e-05
i = 42 ||∇f(x)|| = 1.9003e-01 f(x) = 4.2523e-05
i = 43 ||∇f(x)|| = 5.4188e-03 f(x) = 2.4459e-05
i = 44 ||∇f(x)|| = 1.0931e-01 f(x) = 1.4069e-05
i = 45 ||∇f(x)|| = 3.1168e-03 f(x) = 8.0922e-06
i = 46 ||∇f(x)|| = 6.2873e-02 f(x) = 4.6546e-06
i = 47 ||∇f(x)|| = 1.7928e-03 f(x) = 2.6773e-06
i = 48 ||∇f(x)|| = 3.6164e-02 f(x) = 1.5400e-06
i = 49 ||∇f(x)|| = 1.0312e-03 f(x) = 8.8578e-07
i = 50 ||∇f(x)|| = 2.0801e-02 f(x) = 5.0949e-07
i = 51 ||∇f(x)|| = 5.9314e-04 f(x) = 2.9306e-07
i = 52 ||∇f(x)|| = 1.1965e-02 f(x) = 1.6856e-07
i = 53 ||∇f(x)|| = 3.4117e-04 f(x) = 9.6958e-08
i = 54 ||∇f(x)|| = 6.8821e-03 f(x) = 5.5769e-08
i = 55 ||∇f(x)|| = 1.9624e-04 f(x) = 3.2078e-08
i = 56 ||∇f(x)|| = 3.9585e-03 f(x) = 1.8451e-08
i = 57 ||∇f(x)|| = 1.1288e-04 f(x) = 1.0613e-08
i = 58 ||∇f(x)|| = 2.2769e-03 f(x) = 6.1045e-09
i = 59 ||∇f(x)|| = 6.4925e-05 f(x) = 3.5113e-09
i = 60 ||∇f(x)|| = 1.3097e-03 f(x) = 2.0197e-09
i = 61 ||∇f(x)|| = 3.7345e-05 f(x) = 1.1617e-09
i = 62 ||∇f(x)|| = 7.5331e-04 f(x) = 6.6821e-10
i = 63 ||∇f(x)|| = 2.1480e-05 f(x) = 3.8435e-10
i = 64 ||∇f(x)|| = 4.3330e-04 f(x) = 2.2107e-10
i = 65 ||∇f(x)|| = 1.2355e-05 f(x) = 1.2716e-10
i = 66 ||∇f(x)|| = 2.4923e-04 f(x) = 7.3142e-11
i = 67 ||∇f(x)|| = 7.1068e-06 f(x) = 4.2071e-11
(4.207097012988499e-11, [-2.37187654299357e-7, 1.1842143766173123e-5])
While for the previous problem of the same kind and size the steepest descent method converged in just a few steps, for this particular data it takes many dozens of steps.
The culprit here are bad properties of the Hessian matrix Q. By “bad properties” we mean the so-called ill-conditioning, which is reflected in the very high condition number. Recall that condition number \kappa for a given matrix \mathbf A is defined as
\kappa(\mathbf A) = \|\mathbf A^{-1}\|\cdot \|\mathbf A\|,
and it can be computed as ratio of the largest and smallest singular values, that is,
\kappa(\mathbf A) = \frac{\sigma_{\max}(\mathbf A)}{\sigma_{\min}(\mathbf A)}.
Ideally this number should not be much larger than 1. In the example above it is
cond(Q)1668.0010671466664
which is well above 1000. Is there anything that we can do about it? The answer is yes. We can scale the original date to improve the conditioning.
Upon introducing a matrix \mathbf S that relates the original vector variable \bm x with a new vector variable \bm y according to \bm x = \mathbf S \bm y, the optimization cost function changes from f(\bm x) to f(\mathbf S \bm y). Let’s rename the latter to g(\bm y). And we will now examine how the steepest descent iteration changes. Straightforward application of the chain rule for finding derivatives of composite functions yields g'(\bm y) = f'(\mathbf S\bm y) = f'(\mathbf S\bm y)\mathbf S.
Keeping in mind that gradients are transposes of derivatives, we can write \nabla g(\bm y) = \mathbf S^\top \nabla f(\mathbf S\bm y).
Steepest descent iterations then change accordingly
\begin{aligned} \bm y_{k+1} &= \bm y_k - \alpha_k \nabla g(\bm y_k)\\ \bm y_{k+1} &= \bm y_k - \alpha_k \mathbf S^T\nabla f(\mathbf S \bm y_k)\\ \underbrace{\mathbf S \bm y_{k+1}}_{\bm x_{k+1}} &= \underbrace{\mathbf S\bm y_k}_{\bm x_k} - \alpha_k \underbrace{\mathbf S \mathbf S^T}_{\mathbf D}\nabla f(\underbrace{\mathbf S \bm y_k}_{\bm x_k}). \end{aligned}
Upon renaming the product \mathbf S \mathbf S^T as a scaling matrix \mathbf D, a single iteration changes to \boxed{\bm x_{k+1} = \bm x_{k} - \alpha_k \mathbf D\nabla f(\bm x_{k}).} \tag{3}
The key question now is: how to choose the matrix \mathbf D?
We would like to make the Hessian matrix \nabla^2 f(\mathbf S \bm y) (which in the case of a quadratic matrix form is the matrix \mathbf Q as we used it above) better conditioned. Ideally, \nabla^2 f(\mathbf S \bm y)\approx \mathbf I.
A simple way for improving the conditioning is to define the scaling matrix \mathbf D as a diagonal matrix whose diagonal entries are given by \mathbf D_{ii} = [\nabla^2 f(\bm x_k)]^{-1}_{ii}.
In words, the diagonal entries of the Hessian matrix are inverted and they then form the diagonal of the scaling matrix.
It is worth emphasizing how the algorithm changed: the direction of steepest descent (the negative of the gradient) is premultiplied by some (scaling) matrix. We will see in a few moments that another method – Newton’s method – has a perfectly identical structure.
Newton’s method is one of flagship algorithms in numerical computing. With confidence I include it in my personal Top 10 list of algorithms relevant for engineers and scientists. We may encounter the method in two settings:
The two are inherently related and it is useful to be able to see the connection.
The problem to be solved is that of finding x\in\mathbb R for which a given function g() vanishes. In other words, we solve the following equation g(x) = 0.
The above state scalar version has also its vector extension \mathbf g(\bm x) = \mathbf 0, in which \bm x stands for an n-tuple of variables, that is, \bm x\in \mathbb R^n, and \mathbf g() denotes an n-tuple of functions. Even more general version allows for different number of variables and equations.
We start with a scalar version. A single iteration of the method evaluates not only the value g(x_k) of the function g at the given point x_k but also its derivative g'(x_k). It then uses the two to approximate the function g() at x_k by a linear (actually affine) function and computes the intersection of this approximating function with the horizontal axis. This gives as x_{k+1}, that is, the (k+1)-th approximation to a solution (root). We can write this down as \begin{aligned} \underbrace{g(x_{k+1})}_{0} &= g(x_{k}) + g'(x_{k})(x_{k+1}-x_k)\\ 0 &= g(x_{k}) + g'(x_{k})x_{k+1}-g'(x_{k})x_k, \end{aligned} from which the famous formula follows \boxed{x_{k+1} = x_{k} - \frac{g(x_k)}{g'(x_k)}.}
In the vector form, the formula is \boxed{\bm x_{k+1} = \bm x_{k} - [\nabla \mathbf g(\bm x_k)^\top]^{-1}\mathbf g(\bm x_k),} where \nabla \mathbf g(\bm x_k)^\top is the (Jacobian) matrix of the first derivatives of \mathbf g at \bm x_k, that is, \nabla \mathbf g() is a matrix with the gradient of the g_i(\bm x) function as its i-th column.
Once again, we restrict ourselves to a scalar case first. The problem is \operatorname*{minimize}_{x\in\mathbb{R}}\quad f(x).
At the k-th iteration of the algorithm, the solution is x_k. The function to be minimized is approximated at x_k by a quadratic function m_k(). In order to find parameterization of this quadratic function, not only the function f() itself but also its first and second derivatives, f'() and f''(), respectively, must be evaluated at x_k. Using these three, a function m_k(x) approximating f(x) at some x not too far from x_k can be defined m_k(x) = f(x_k) + f'(x_k)(x-x_k) + \frac{1}{2}f''(x_k)(x-x_k)^2.
The problem of minimizing this new function in the k-th iteration is then formulated, namely,
\operatorname*{minimize}_{x_{k+1}\in\mathbb{R}}\quad m_k(x_{k+1}).
The way to find this solution is straightforward: find the derivative of m_k() and find the value of x_{k+1} for which this derivative vanishes. The result is \boxed{x_{k+1} = x_{k} - \frac{f'(x_k)}{f''(x_k)}.}
Example 4 (Newton’s method for minimization in the scalar case) We consider the scalar function of a single (real) variable
# f(x) = 3cos(x)*sin(2x+2)^2
f(x) = -x^4 + 12x^3 - 47x^2 + 60xf (generic function with 1 method)
Let’s have a look at its graph on some interval.
x_grid = 1:0.01:5.5
f_grid = f.(x_grid)
using Plots
plot(x_grid,f_grid,label="f(x)",xlabel="x")Now, the essence of the method is that at a given point x_k we approximate the function locally by a quadratic function. Say, we choose
#xₖ = 0.25
xₖ = 2.42.4
We can evaluate at this point not only the function but its first and second derivatives as well. We use one of the packages for AD.
using ForwardDiff
Df = x -> ForwardDiff.derivative(f,x)
D²f = x -> ForwardDiff.derivative(Df,x)#6 (generic function with 1 method)
Now, we can write the quadratic “model function” as
mₖ(x) = f(xₖ) + Df(xₖ)*(x-xₖ) + 1/2*D²f(xₖ)*(x-xₖ)^2mₖ (generic function with 1 method)
We now evaluate this model on the interval and plot it:
mₖ_grid = mₖ.(x_grid)
plot!(x_grid,mₖ_grid,label="mₖ(x)")
scatter!([xₖ,],[f(xₖ),],label="f(xₖ)")From the formula for the Newton’s step we get that
xₖ₊₁ = xₖ - Df(xₖ)/D²f(xₖ)3.798347107438017
scatter!([xₖ₊₁],[f(xₖ₊₁)],label="f(xₖ₊₁)")Let’s go for another iteration:
plot(x_grid,f_grid,label="f(x)",xlabel="x")
mₖ₊₁(x) = f(xₖ₊₁) + Df(xₖ₊₁)*(x-xₖ₊₁) + 1/2*D²f(xₖ₊₁)*(x-xₖ₊₁)^2
mₖ₊₁_grid = mₖ₊₁.(x_grid)
plot!(x_grid,mₖ₊₁_grid,label="mₖ₊₁(x)")
scatter!([xₖ,],[f(xₖ),],label="f(xₖ)")
scatter!([xₖ₊₁,],[f(xₖ₊₁),],label="f(xₖ₊₁)")
xₖ₊₂ = xₖ₊₁ - Df(xₖ₊₁)/D²f(xₖ₊₁)
scatter!([xₖ₊₂],[f(xₖ₊₂)],label="f(xₖ₊₂)")And the iterations would go on…
Before we leave this example, you are invited to experiment with setting different initial values of x.
The vector version of the Newton’s step is \boxed{\bm x_{k+1} = \bm x_{k} - [\nabla^2 f(\bm x_k)]^{-1} \nabla f(\bm x_k).}
Although the mathematical formula contains a symbol for the inverse of a matrix, computationally it is better to formulate this computation in a way that a system of linear equations is solved. Namely, denoting the second term on the right by \bm d_k, our symbol for the direction, that is, [\nabla^2 f(\bm x_k)]^{-1} \nabla f(\bm x_k) = \bm d_k, we can find \bm d_k by solving the following system of linear equations
\nabla^2 f(\bm x_k) \bm d_k = \nabla f(\bm x_k).
Indeed, discussion forums for various programming languages and environments abound with questions about how to find the inverse of a matrix. The answer is almost always: are you sure you really need it? Most probably what you need is to solve a systems of linear equations, and that is a (slightly) different task.
Example 5 (Newton’s method for minimization in the vector case) Consider the function
f(x) = (x[1]+1)^4 + x[1]*x[2] + (x[2]+1)^4f (generic function with 1 method)
The graph of the function is in Fig. 2 below.
x₁_grid = x₂_grid = -3:0.1:3;
f_grid = [f([x₁,x₂]) for x₁=x₂_grid, x₂=x₂_grid];
using Plots
surface(x₁_grid,x₂_grid,f_grid') # Note the transpose.
xlabel!("x₁")
ylabel!("x₂")
xlims!(-3,3)
ylims!(-3,3)Some more insight can perhaps be obtained by plotting the contours as in Fig. 3.
contour(x₁_grid,x₂_grid,f_grid',levels=50) # Note the transpose.
xlabel!("x₁")
ylabel!("x₂")
xlims!(-3,3)
ylims!(-3,3)We choose to compute the gradient manually/symbolically:
∇f(x) = [4*(1 + x[1])^3 + x[2], x[1] + 4*(1 + x[2])^3]∇f (generic function with 1 method)
and similarly we do for the Hessian:
∇²f(x) = [12*(1 + x[1])^2 1; 1 12*(1 + x[2])^2]∇²f (generic function with 1 method)
We now consider a particular value of the vector variable and use not only the function value but also the gradient and the Hessian to compute the local quadratic model:
xₖ = [0.7,0.5]
using LinearAlgebra # Because of the dot function
mₖ(x) = f(xₖ) + dot(∇f(xₖ),(x-xₖ)) + 1/2*dot((x-xₖ),∇²f(xₖ),(x-xₖ))mₖ (generic function with 1 method)
The contours of the local quadratic model and the contours of the original function can be plotted together as in Fig. 4.
mₖ_grid = [mₖ([x₁,x₂]) for x₁=x₁_grid, x₂=x₂_grid];
contour!(x₁_grid, x₂_grid, mₖ_grid',levels=10)
scatter!([xₖ[1],],[xₖ[2],],label="xₖ")The following step of the Newton’s method would constitute in finding the minimum of the quadratic function, that is, to localize the center of the contour ellipses. It can be guessed that such minimum will be located in the direction to the bottom left from the initial point, which is consistent with the contours of the original cost function.
xₖ₊₁ = xₖ - ∇²f(xₖ)\∇f(xₖ)
scatter!([xₖ₊₁[1],],[xₖ₊₁[2],],label="xₖ₊₁")We now compute a whole sequence of such local minimizers of quadratic models. In other words, we demonstrate the functionality of a basic implementation of Newton’s method.
using Printf
function newton_method(f,∇f,∇²f,x₀,ϵ,N)
x = x₀
k = 0
while (norm(∇f(x)) > ϵ)
k = k+1
x = x - ∇²f(x)\∇f(x)
@printf("iter = %3d ||∇f|| = %17.6f f = %19.6f \n",k,norm(∇f(x)),f(x))
if k >= N
return f(x),x
end
end
return f(x),x
endnewton_method (generic function with 1 method)
After setting the termination parameters we are ready to start the iterative algorithm:
ϵ = 1e-8
N = 100
x₀ = xₖ
f_opt,x_opt = newton_method(f,∇f,∇²f,x₀,ϵ,N)iter = 1 ||∇f|| = 7.104527 f = 2.630340
iter = 2 ||∇f|| = 1.872767 f = 0.650808
iter = 3 ||∇f|| = 0.365832 f = 0.390200
iter = 4 ||∇f|| = 0.034334 f = 0.375189
iter = 5 ||∇f|| = 0.000780 f = 0.375000
iter = 6 ||∇f|| = 0.000001 f = 0.375000
iter = 7 ||∇f|| = 0.000000 f = 0.375000
(0.375, [-0.4999999999999264, -0.5000000000000205])
Finally, plot the minimizer into the contour plots.
scatter!([x_opt[1],],[x_opt[2],],label="x⋆")In our straightforward implementation of the Newton’s method, we formulated a linear system of equations \mathbf A\bm x = \mathbf b, and we used the backslash operator (introduced by K. Hensel in 1929, and adopted and popularized by Matlab, and then imported by other languages, including Juluia) to find the solution by writing x = A\b. Although there is some decision tree behind the backslash operator (for example, depending on the shape of the matrix), more detailed analysis of the properties of the matrix is typically not performed automatically when calling x = A\b. But the Hessian matrix \nabla^2 f(\bm x_k) (playing the role of the matrix \mathbf A in the previous sentences) does possess important properties. First, it is symmetric. This can be exploited to solve the system of equations more efficiently. Second, well-behaved Hesians are positive definite in the vicinity of a local minimum. For this class of matrices, an efficient way to solve the corresponding system of equations is to use the Cholesky decomposition (factorization). (By the way, the pronounciation is ʃəˈlɛski.) A real positive-definite matrix \mathbf A can be decomposed as
\mathbf A = \bm L \bm L^\top, where \bm L is a lower triangular matrix.
Example 6 (Solving a system of linear equations with a symmetric positive definite matrix using Cholesky decomposition)
A = [10 1 0; 1 20 3; 0 3 30]
b = [1, 2, 3]
using LinearAlgebra
C = cholesky(A)Cholesky{Float64, Matrix{Float64}}
U factor:
3×3 UpperTriangular{Float64, Matrix{Float64}}:
3.16228 0.316228 0.0
⋅ 4.46094 0.672504
⋅ ⋅ 5.43578
C is a structure that contains the decomposition. We can access the lower triangular matrix \bm L as
L = C.L3×3 LowerTriangular{Float64, Matrix{Float64}}:
3.16228 ⋅ ⋅
0.316228 4.46094 ⋅
0.0 0.672504 5.43578
Having computed the lower triangular matrix \bm L, we can write the original problem as \bm L \bm L^\top \bm x = \mathbf b, which can be reformulated into solving two triangular systems by simple backsubstitution: \begin{aligned} \bm L \bm y &= \mathbf b,\\ \bm L^\top \bm x &= \bm y. \end{aligned}
y = L\b3-element Vector{Float64}:
0.31622776601683794
0.42591904767910926
0.49920457702436966
x = L'\y3-element Vector{Float64}:
0.09183673469387756
0.08163265306122447
0.09183673469387756
Both steps are realized by the following single line of code in Julia, but note that what is behind is really solving the two triangular systems by backsubstitution
x = C\b3-element Vector{Float64}:
0.09183673469387756
0.08163265306122447
0.09183673469387757
You can verify by yourself (using @btime macro from the BenchmarkTools package) that the decomposition followed by solving two triangular systems is faster than calling a general solver for linear systems of equations.
Back to the Newton’s method. We can make a few observations:
If compared to the general prescription for descent direction methods (as described in Eq. 1), the Newton’s method determines the direction and the step lenght at once (both \alpha_k and \mathbf d_k are contained in the term - [\nabla^2 f(\mathbf x_k)]^{-1} \nabla f(\mathbf x_k)).
If compared with steepest descent (gradient) method, especially with its scaled version in Eq. 3, Newton’s method fits into the framework nicely because the inverse [\nabla^2 f(\mathbf x_k)]^{-1} of the Hessian can be regarded as a particular choice of a scaling matrix \mathbf D. Indeed, you can find arguments in some textbooks that Newton’s method involves scaling that is optimal in some sense. We skip the details here because we only wanted to highlight the similarity in the structure of the two methods.
The great popularity of Newton’s method is mainly due to its nice convergence – quadratic. Although we skip any discussion of convergence rates here, note that for all other methods this is an ideal that is to be approached.
The plausible convergence rate of Newton’s method is paid for by a few disadvantages
The last two issues are handled by some modification of the standard Newton’s method.
A parameter \alpha\in(0,1) is introduced that shortens the step as in \bm x_{k+1} = \bm x_{k} - \alpha\left[\nabla^2 f(\bm x_k)\right]^{-1} \nabla f(\bm x_k).
\bm x_{k+1} = \bm x_{k} - \left[\nabla^2 f(\bm x_k)+ \lambda \mathbf I\right]^{-1} \nabla f(\bm x_k).
The step is determined as \bm x_{k+1} = \bm x_{k} - \mathbf B \nabla f(\bm x_k).
Note that the interpretation of the constant \mathbf B in the position of the (inverse of the) Hessian in the rootfinding setting is that the slope of the approximating linear (affine) function is always constant.
Now that we admitted to have something else then just the (inverse of the) Hessian in the formula for Newton’s method, we can explore further this new freedom. This will bring us into a family of methods called Quasi-Newton methods.
#TODO In the meantime, have a look at [1, Sec. 4.4.4], or [2, Sec. 6.3], or [3, Ch. 13].
Similarly as we did when introducing the Newton’s method, we start our exposition with solving equations. Quasi-Newton methods (indeed, the plural is appropriate here because there is a whole family of methods under this name) generalize the key idea behind the (scalar) secant method for rootfinding. Let’s recall it here. The methods is based on secant approximation of the derivative: \dot f(x_k) \approx \frac{f(x_k)-f(x_{k-1})}{x_k-x_{k-1}}.
We substitute this approximation into the Newton’s formula x_{k+1} = x_k - \underbrace{\frac{x_k-x_{k-1}}{f(x_k)-f(x_{k-1})}}_{\approx \dot f(x_k)}f(x_k).
Transitioning from scalar rootfinding to optimization is as easy as increasing the order of the derivatives in the formula
\ddot f(x_k) \approx \frac{\dot f(x_k)-\dot f(x_{k-1})}{x_k-x_{k-1}} =: b_k, which can be rewritten into the secant condition b_k (\underbrace{x_k-x_{k-1}}_{s_{k-1}}) = \underbrace{\dot f(x_k)-\dot f(x_{k-1})}_{y_{k-1}}.
The vector version of the secant condition is
\begin{aligned}\boxed{ \bm B_{k+1} \mathbf s_k = \mathbf y_k}, \end{aligned} where \bm B_{k+1} is a matrix (to be determined) with Hessian-like properties
\bm B_{k+1} = \bm B_{k+1}^\top, \qquad \bm B_{k+1} \succ \mathbf 0.
How can we get it? Computing the matrix at every step anew is not computationally efficient. The preferred way is to compute the matrix \bm B_{k+1} just by adding as small an update to the matrix \bm B_{k+1} computed in the previous step as possible \bm B_{k+1} = \bm B_{k} + \text{small update}.
Several update schemes are documented in the literature. Particularly attractive are schemes that update not just \bm B_{k+1} but \bm B_{k+1}^{-1} directly. One popular update is BFGS:
\boxed{ \begin{aligned} \bm H_{k+1} &= \bm H_{k} + \left(1+\frac{\mathbf y_k^\top \bm H_k \mathbf y_k}{\mathbf s_k^\top\mathbf y_k}\right)\cdot\frac{\mathbf s_k\mathbf s_k^\top}{\mathbf s_k^\top \mathbf y_k} - \frac{\mathbf s_k \mathbf y_k^\top \bm H_k + \bm H_k\mathbf y_k \mathbf s_k^\top}{\mathbf y_k^\top \mathbf s_k}. \end{aligned}}
#TODO In the meantime, have a look at [1, Sec. 4.5], or [2, Sec. 4.4].
The key concept of trust region methods is that of… trust region. Trust region is a region (typically a ball or an ellipsoid) around the current point, in which we trust some approximation of the original cost function. We then find the minimum of this approximating function subject to the constraint on the norm of the step. Typically, the approximating function that is simple enough to minimize is a quadratic one: m_k(\bm d) = f(\bm x_k) + \nabla f(\bm x_k)^\top \bm d + \frac{1}{2}\bm d^\top \underbrace{\nabla^2 f(\bm x_k)}_{\text{or} \approx} \bm d but trust the model only within \|\bm d\|_2 \leq \delta_k.
In other words, we formulate the constrained optimization problem \boxed{ \begin{aligned} \operatorname*{minimize}_{\bm d\in\mathbb R^n} &\quad m_k(\bm d)\\ \text{subject to} &\quad \|\bm d\|_2 \leq \delta_k. \end{aligned}}
For later convenience (when differentiating the Lagrangian), we rewrite the constraint as \frac{1}{2}\left(\|\bm d\|_2^2 - \delta_k^2\right) \leq 0.
Let’s write down the optimality conditions for this constrained problem. The Lagrangian is L(\bm x_k, \bm d) = f(\bm x_k) + \nabla f(\bm x_k)^\top \bm d + \frac{1}{2}\bm d^\top \nabla^2 f(\bm x_k)\bm d + \frac{\mu}{2} (\|\bm d\|^2-\delta_k^2)
The necessary conditions (the KKT conditions) can be written upon inspection \begin{aligned} \nabla_{\bm{d}}L(\bm x_k, \bm d) = \nabla f(\bm x_k) + \nabla^2 f(\bm x_k) \bm d + \mu \bm d &= 0,\\ \|\bm d\|_2^2 - \delta_k^2 &\leq 0,\\ \mu &\geq 0,\\ \mu \left(\|\bm d\|_2^2 - \delta_k^2\right) &= 0. \end{aligned}
Now, there are two scenarios: either the optimal step \bm d keeps the updated \bm x_{k+1} still strictly inside the trust region, or the updated \bm x_{k+1} is at the boundary of the trust region. In the former case, since the constraint is satisfied strictly, the dual variable \mu=0 and the optimality condition simplifies to \nabla f(\bm x_k) + \nabla^2 f(\bm x_k) \bm d= 0, which leads to the standard Newton’s update \bm d = -[\nabla^2 f(\bm x_k)]^{-1}\nabla f(\bm x_k). In the latter case the update is
\bm d = -[\nabla^2 f(\bm x_k) + \mu \mathbf I]^{-1}\nabla f(\bm x_k), which has the form that we have already discussed when mentioning modifications of Newton’s method.
Let’s recall here our discussion of line search methods – we argued that there is rarely a need to compute the minimum at each step, and “good enough” reductions of the cost function typically suffice. The situation is similar for the trust region methods – approximate solution to the minimization (sub)problem is enough. However, here we are not going to discuss such methods here.
One issue that, however, requires discussion, is the issue of evaluationg the predictive performance of the (quadratic) model. If the model is not good enough, the trust region must be shrunk, if it is fairly good, the trust region must be expanded. In both cases, the constrained optimization (sub)problem must be solved again.
One metric we can use to evaluate the model is
\eta = \frac{\text{actual improvement}}{\text{predicted improvement}} = \frac{f(\bm x_k)-f(\bm x_{k+1})}{f(\bm x_k)-m_k(\bm x_{k+1})}.
We shrink the region for small \eta (\approx 0), and expand it for larger \eta (\approx 1).
We conclude this short discussion of trust region methods by comparing it with the descent methods. While in the descent methods we set the direction first, and the perform a line search in the chosen direction, in trust region methods this sequence is reversed. Kind of. By setting the radius of the trust region, we essentially set an upper bound on the step length. The subsequent optimization subproblem can be viewed as a search for a direction.