Our motivation for studying numerical algorithms for unconstrained optimization remains the same as when we studied the conditions of optimality for such unconstrained problems – such algorithms constitute building blocks for constrained optimization problems. Indeed, many algorithms for constrained problems are based on reformulating the constrained problem into an unconstrained one and then applying the algorithms studied in this section.
It may be useful to recapitulate our motivation for studying optimization algorithms in general – after all, there are dozens of commercial or free&open-source software tools for solving optimization problems. Why not just use them? There are two answers beyond the traditional “at a grad school we should understand what we are using”:
There is no single solver that works best for all problems. Therefore we must be aware of the principles, strenghts and weaknesses of the algorithms in order to choose the right one for our problem.
This is a control engineering course and numerical optimization is becoming an integral part of control systems. While developing a control system, we may find ourselves in need of developing our own implementation of an optimization algorithm or adjusting an existing one. This requires deeper understanding of algorithms than just casual usage of high-level functions in Matlab or Python.
There is certainly no shortage of algorithms for unconstrained optimization. In this crash course we can cover only a few. But the few we cover here certainly form a solid theoretical basis and provide practically usable tools.
One possible way to classify the algorithms is based on whether they use derivatives of the objective functions or not. In this course, we only consider the former approaches as they leads to more efficient algorithms. For the latter methods, we can refer to the literature (the prominent example is Nelder-Mead method).
All the relevant methods are iterative. Based on what happens within each iteration, we can classify them into two categories:
Descent methods
In each iteration, fix the search direction d_k first, and then determine how far to go along that direction, that is, find the step length \alpha_k that minimizes (exactly or approximately) f(x_k + \alpha_k d_k). In the next iteration the search direction is updated.
Trust region methods
In each iteration, fix the region (typically a ball) around the current solution, in which a simpler (typically a quadratic) function approximates the original cost function reasonably accurately, and then find the minimum of this simpler cost function.
Descent methods
The obvious quality that the search direction needs to satisfy, is that the cost function decreses along it, at least locally (for a small step length).
Definition 1 (Descent direction) At the current iterate \bm x_k, the direction \bm d_k is called a descent direction if
\nabla f(\bm x_k)^\top \bm d_k < 0,
that is, the directional derivative is negative along the direction \bm d_k.
The product above is an inner product of the two vectors \bm d_k and \nabla f(\mathbf x_k). Recall that it is defined as
\nabla f(\bm x_k)^\top \bm d_k = \|\nabla f(\bm x_k)\| \|\bm d_k\| \cos \theta,
where \theta is the angle between the gradient and the search direction. This condition has a nice geometric interpretation in a contour plot for an optimization in \mathbb R^2. Consider the line tangent to the function countour at \bm x_k. A descent direction must be in the other half-plane generated by the tangent line than into which the gradient \nabla f(\bm x_k) points.
Beware that it is only guaranteed that the cost function is reduced if the length of the step is sufficently small. For longer steps the higher-order terms in the Taylor’s series approximation of the cost function can dominate.
Before we proceed to the question of which descent direction to choose and how to find it, we adress the question of how far to go along the chosen direction. This is the problem of line search.
Step length determination (aka line search)
Note that once the search direction has been fixed (whether we used the negative of the gradient or any other descent direction), the problem of finding the step length \alpha_k is just a scalar optimization problem. It turns out, however, that besides finding the true minimum along the search directions, it is often sufficient to find the minimum only approximately, or not aiming at minimization at all and work with a fixed step length instead.
Fixed length of the step
Here we give a guidance on the choice of the lenght of the step. But we need to introduce a useful concept first.
Definition 2 (L-smoothness) For a continuously differentiable function f, the gradient \nabla f is said to be L-smooth if there exists a constant L>0 such that
\|\nabla f(x) - \nabla f(y)\| \leq L \|x-y\|.
Not that if the second derivatives exist, L is an upper bound on the norm of the Hessian
\|\nabla^2 f\|\leq L.
For quadratic functions, L is the largest eigenvalue of the Hessian
L = \max_i \lambda_i (\mathbf Q).
The usefulness of the concept of L-smoothness is that it provides a quadratic function that serves as an upper bound for the original function. This is formulated as the following lemma.
Lemma 1 (Descent lemma) Consider an L-smooth function f. Then for any \mathbf x_k and \mathbf x_{k+1}, the following inequality holds
f(\mathbf x_{k+1}) \leq f(\mathbf x_{k}) + \nabla f(\mathbf x_k)^\top (\mathbf x_{k-1}-\mathbf x_{k}) + \frac{L}{2}\|\mathbf x_{k-1}-\mathbf x_{k}\|^2
What implication does the result have on the determination of the step length?
\alpha = \frac{1}{L}
Exact line search
A number of methos exist: bisection, golden section, Newton, As finding true minium in each iteration is often too computationally costly and hardly needed, we do not have them here. The only exception the Newton’s method, which for vector variables constitutes another descent method on its own and we cover it later.
Example 1 Here we develop a solution for an exact minimization of a quadratic functions f(\bm x) = \frac{1}{2} \bm x^\top\mathbf Q \bm x + \mathbf c^\top \bm x along a given direction. We show that it leads to a closed-form formula. Although not particularly useful in practice, it is a good exercise in understanding the problem of line search. Furthermore, we will use it later to demonstrate the behaviour of the steepest descent method. The problem is to \operatorname*{minimize}_{\alpha_k} f(\bm x_k + \alpha_k \bm d_k). We express the cost as a function of the current iterate, the direction, and step length.
\begin{aligned}
f(\bm x_k + \alpha_k \bm d_k) &= \frac{1}{2}(\bm x_k + \alpha_k\bm d_k)^\top\mathbf Q (\bm x_k + \alpha_k\bm d_k) +\mathbf c^\top(\bm x_k + \alpha_k\bm d_k)\\
&= \frac{1}{2} \bm x_k^\top\mathbf Q \bm x_k + \bm d_k^\top\mathbf Q\bm x_k \alpha_k + \frac{1}{2} \bm d_k^\top\mathbf Q\bm d_k \alpha_k^2+ \mathbf c^\top(\bm x_k + \alpha_k\bm d_k).
\end{aligned}
Differentiating the function with respect to the length of the step, we get
\frac{\mathrm{d}f(\bm x_k + \alpha_k\bm d_k)}{\mathrm{d}\alpha_k} = \bm d_k^\top \underbrace{(\mathbf Q\bm x_k + \mathbf c)}_{\nabla f(\bm x_k)} + \bm d_k^\top\mathbf Q\bm d_k \alpha_k.
And now setting the derivative to zero, we find the optimal step length
\boxed{
\alpha_k = -\frac{\bm d_k^\top \nabla f(\bm x_k)}{\bm d_k^\top\mathbf Q\bm d_k} = -\frac{\bm d_k^\top (\mathbf Q\bm x_k + \mathbf c)}{\bm d_k^\top\mathbf Q\bm d_k}.}
Approximate line search – backtracking
There are several methods for approximate line search. Here we describe the backtracking algorithm, which is based on the sufficient decrease condition (also known as Armijo condition), which reads
f(\bm x_k+\alpha_k\bm d_k) - f(\bm x_k) \leq \gamma \alpha_k \mathbf d^T \nabla f(\bm x_k),
where \gamma\in(0,1), typically \gamma is very small, say \gamma = 10^{-4}.
The term on the right can be be viewed as a linear function of \alpha_k. Its negative slope is a bit less steep than the directional derivative of the function f at \bm x_k. The condition of sufficient decrease thus requires that the cost function (as a function of \alpha_k) is below the graph of this linear function.
Now, the backtracking algorithm is parameterized by three parameters: the initial step lenght \alpha_0>0, the typically very small \gamma\in(0,1) that parameterizes the Armijo condition, and yet another parameter \beta\in(0,1).
The k-th iteration of the algorithm goes like this: failure of the sufficient decrease condition for a given \alpha_k, or, equivalently satisfaction of the condition
f(\bm x_k) - f(\bm x_k+\alpha_k\bm d_k) < -\gamma \alpha_k \mathbf d^T \nabla f(\bm x_k)
sends the algorithm into another reduction of \alpha_k by \alpha_k = \beta\alpha_k. A reasonable choice for \beta is 0.5, which corresponds to halving the step length upon failure to decrease sufficiently.
The backtracking algorithm can be implemented as follows
backtracking_line_search (generic function with 1 method)
Now we are ready to proceed to the question of choosing a descent direction.
Steepest descent (aka gradient descent) method
A natural candidate for a descent direction is the negative of the gradient
\bm d_k = -\nabla f(\bm x_k).
In fact, among all descent directions, this is the one for which the descent is steepest (the gradient determines the direction of steepest ascent), though we will see later that this does not mean that the convergence of the method is the fastest.
In each iteration of the gradient method, this is the how the solution is updated
\boxed{
\bm x_{k+1} = \bm x_{k} - \alpha_k \nabla f(\bm x_{k}),}
where determinatio of the step length \alpha_k has already been discussed in the prevous section.
Let’s now examine the behaviour of the method by applying it to minimization of a quadratic function. Well, for a quadratic function it is obviously an overkill, but we use it in the example because we can compute the step lenght exactly, which then helps the methods show its best.
Example 2
Show the code
usingLinearAlgebra# For dot() function.usingPrintf# For formatted output.x0 = [2, 3] # Initial vector.Q = [10; 03] # Positive definite matrix defining the quadratic form.c = [1, 2] # Vector defining the linear part.xs =-Q\c # Stationary point, automatically the minimizer for posdef Q. ϵ =1e-5# Threshold on the norm of the gradient.N =100; # Maximum number of steps .functiongradient_descent_quadratic_exact(Q,c,x0,ϵ,N) x = x0 iter =0 f =1/2*dot(x,Q*x)+dot(x,c) ∇f = Q*x+cwhile (norm(∇f) > ϵ) α =dot(∇f,∇f)/dot(∇f,Q*∇f) x = x - α*∇f iter = iter+1 f =1/2*dot(x,Q*x)+dot(x,c) ∇f = Q*x+c@printf("i = %3d ||∇f(x)|| = %6.4e f(x) = %6.4e\n", iter, norm(∇f), f)if iter >= Nreturn f,xendendreturn f,xendfopt,xopt =gradient_descent_quadratic_exact(Q,c,x0,ϵ,N)
i = 1 ||∇f(x)|| = 2.0229e+00 f(x) = 7.8495e-01
i = 2 ||∇f(x)|| = 9.0210e-01 f(x) = -1.0123e+00
i = 3 ||∇f(x)|| = 1.6005e-01 f(x) = -1.1544e+00
i = 4 ||∇f(x)|| = 7.1374e-02 f(x) = -1.1657e+00
i = 5 ||∇f(x)|| = 1.2663e-02 f(x) = -1.1666e+00
i = 6 ||∇f(x)|| = 5.6470e-03 f(x) = -1.1667e+00
i = 7 ||∇f(x)|| = 1.0019e-03 f(x) = -1.1667e+00
i = 8 ||∇f(x)|| = 4.4679e-04 f(x) = -1.1667e+00
i = 9 ||∇f(x)|| = 7.9269e-05 f(x) = -1.1667e+00
i = 10 ||∇f(x)|| = 3.5350e-05 f(x) = -1.1667e+00
i = 11 ||∇f(x)|| = 6.2718e-06 f(x) = -1.1667e+00
We can also decorate the code a bit to visualize how the iterations proceeded.
Show the code
functiongradient_descent_quadratic_exact_decor(Q,c,x0,ϵ,N) x = x0 X = x f =1/2*dot(x,Q*x)+dot(x,c) F = [f,] ∇f = Q*x+c iter =0while (norm(∇f) > ϵ) α =dot(∇f,∇f)/dot(∇f,Q*∇f) x = x - α*∇f iter = iter+1 f =1/2*dot(x,Q*x)+dot(x,c) ∇f = Q*x+c X =hcat(X,x)push!(F,f)if iter >= Nreturn F,Xendendreturn F,XendF,X =gradient_descent_quadratic_exact_decor(Q,c,x0,ϵ,N)x1_data = x2_data =-4:0.01:4;f(x) =1/2*dot(x,Q*x)+dot(x,c)z_data = [f([x1,x2]) for x2=x2_data, x1=x1_data];usingPlotscontour(x1_data,x2_data,z_data)plot!(X[1,:],X[2,:],label="xk",marker=:diamond,aspect_ratio=1)scatter!([x0[1],],[x0[2],],label="x0")scatter!([xs[1],],[xs[2],],label="xopt")xlabel!("x1");ylabel!("x2");xlims!(-4,4); ylims!(-4,4)
Altough the number of iterations in the above example is acceptable, a major characteristic of the method is visible. Its convergence is slowing down as we are approaching a local minimum, which is visually recognizable oscillations or zig-zagging. But it can be much worse for some data.
Gradient method converges slowly for ill-conditioned problems
Example 3 Consider minimization of the following cost function f(\bm x) = 1000x_1^2 + 40x_1x_2 + x_2^2.
While for the previous problem of the same kind and size the steepest descent method converged in just a few steps, for this particular data it takes many dozens of steps.
The culprit here are bad properties of the Hessian matrix Q. By ``bad properties’’ we mean the so-called , which is reflected in the very high . Recall that condition number \kappa for a given matrix \mathbf A is defined as
\kappa(\mathbf A) = \|\mathbf A^{-1}\|\cdot \|\mathbf A\|
and it can be computed as ratio of the largest and smallest singular values, that is,
\kappa(\mathbf A) = \frac{\sigma_{\max}(\mathbf A)}{\sigma_{\min}(\mathbf A)}.
\end{equation}$$
Ideally this number should be around 1. In the example above it is
Show the code
cond(Q)
1668.0010671466664
which is well above 1000. Is there anything that we can do about it? The answer is yes. We can scale the original date to improve the conditioning.
Scaled gradient method for ill-conditioned problems
Upon introducing a matrix \mathbf S that relates the original vector variable \bm x with a new vector variable \bm y according to
\bm x = \mathbf S \bm y,
the optimization cost function changes from f(\bm x) to f(\mathbf S \bm y). Let’s relabel the latter to g(\bm y). And we will now examine how the steepest descent iteration changes. Straightforward application of a chain rule for finding derivatives of composite functions yields
g'(\bm y) = f'(\mathbf S\bm y) = f'(\mathbf S\bm y)\mathbf S.
Keeping in mind that gradients are transposes of derivatives, we can write
\nabla g(\bm y) = \mathbf S^\top \nabla f(\mathbf S\bm y).
Steepest descent iterations then change accordingly
Upon defining the scaling matrix \mathbf D as \mathbf S \mathbf S^T, a single iteration changes to
\boxed{\bm x_{k+1} = \bm x_{k} - \alpha_k \mathbf D_k\nabla f(\bm x_{k}).}
The question now is: how to choose the matrix \mathbf D? We would like to make the Hessian matrix \nabla^2 f(\mathbf S \bm y) (which in the case of a quadratic matrix form is the matrix \mathbf Q as we used it above) better conditioned. Ideally, \nabla^2 f(\mathbf S \bm y)\approx \mathbf I.
A simple way for improving the conditioning is to define the scaling matrix \mathbf D as a diagonal matrix whose diagonal entries are given by
\mathbf D_{ii} = [\nabla^2 f(\bm x_k)]^{-1}_{ii}.
In words, the diagonal entries of the Hessian matrix are inverted and they then form the diagonal of the scaling matrix.
It is worth emphasizing how the algorithm changed: the direction of steepest descent (the negative of the gradient) is premultiplied by some (scaling) matrix. We will see in a few moments that another method – Newton’s method – has a perfectly identical structure.
Newton’s method
Newton’s method is one of flagship algorithms in numerical computing. I am certainly not exaggerating if I include it in my personal Top 10 list of algorithms relevant for engineers. We may encounter the method in two settings: as a method for solving (sets of) nonlinear equations and as a method for optimization. The two are inherently related and it is useful to be able to see the connection.
Newton’s method for rootfinding
The problem to be solved is that of finding x for which a given function g() vanishes. In other words, we solve the following equation
g(x) = 0.
The above state scalar version has also its vector extension
\mathbf g(\bm x) = \mathbf 0,
in which \bm x stands for an n-tuple of variables and \mathbf g() actually stands for an n-tuple of functions. Even more general version allows for different number of variables and equations.
We start with a scalar version. A single iteration of the method evaluates not only the value of the function g(x_k) at the given point but also its derivative g'(x_k). It then uses the two to approximate the function g() at x_k by a linear (actually affine) function and computes the intersection of this approximating function with the horizontal axis. This gives as x_{k+1}, that is, the (k+1)-th approximation to a solution (root). We can write this down as
\begin{aligned}
\underbrace{g(x_{k+1})}_{0} &= g(x_{k}) + g'(x_{k})(x_{k+1}-x_k)\\
0 &= g(x_{k}) + g'(x_{k})x_{k+1}-g'(x_{k})x_k),
\end{aligned}
from which the famous formula follows
\boxed{x_{k+1} = x_{k} - \frac{g(x_k)}{g'(x_k)}.}
In the vector form, the formula is
\boxed{\bm x_{k+1} = \bm x_{k} - [\nabla \mathbf g(\bm x_k)^\top]^{-1}\mathbf g(\bm x_k),}
where \nabla \mathbf g(\bm x_k)^\top is the (Jacobian) matrix of the first derivatives of \mathbf g at \bm x_k, that is, \nabla \mathbf g() is a matrix with the gradient of the g_i(\bm x) function in its i-th column.
Newton’s method for optimization
Once again, restrict ourselves to a scalar case first. The problem is
\operatorname*{minimize}_{x\in\mathbb{R}}\quad f(x).
At the k-th iteration of the algorithm, the solution is x_k. The function to be minimized is approximated by a quadratic function m_k() in x. In order to find parameterization of this quadratic function, the function f() but also its first and second derivatives, f'() and f''(), respectively, need to be evaluated at x_k. Using these three, a function m_k(x) approximating f(x) at some xnot too far from x_k can be defined
m_k(x) = f(x_k) + f'(x_k)(x-x_k) + \frac{1}{2}f''(x_k)(x-x_k)^2.
The problem of minimizing this new function in the k-th iteration is then formulated, namely,
\operatorname*{minimize}_{x_{k+1}\in\mathbb{R}}\quad m_k(x_{k+1})
and solved for some x_{k+1}. The way to find this solution is straightforward: find the derivative of m_k() and find the value of x_{k+1} for which this derivative vanishes. The result is
\boxed{x_{k+1} = x_{k} - \frac{f'(x_k)}{f''(x_k)}.}
The vector version of the Newton’s step is
\boxed{\bm x_{k+1} = \bm x_{k} - [\nabla^2 f(\bm x_k)]^{-1} \nabla f(\bm x_k).}
A few observations
If compared to the general prescription for descent direction methods (), the Newton’s method determines the direction and the step lenght at once (both \alpha_k and \mathbf d_k are hidden in - [\nabla^2 f(\mathbf x_k)]^{-1} \nabla f(\mathbf x_k)).
If compared with steepest descent (gradient) method, especially with its scaled version (), Newton’s method fits into the framework nicely because the inverse [\nabla^2 f(\mathbf x_k)]^{-1} of the Hessian can be regarded as a kind of a scaling matrix \mathbf D. Indeed, you can find arguments in some textbooks that Newton’s method involves scaling that is optimal in some sense. We skip the details here because we only wanted to highlight the similarity in the structure of the two methods.
The great popularity of Newton’s method is mainly due to its nice convergence – quadratic. Although we skip any discussion of convergence rates here, note that for all other methods this is an ideal that aims to be approached.
The nice convergence rate of Newton’s method is compensated by a few disadvantages
The need to compute the Hessian. This is perhaps not quite clear with simple problems but can play some role with huge problems.
Once the Hessian is computed, it must be inverted (actually, a linear system must by solved). But this assumes that Hessian is nonsingular. How can we guarantee this for a given problem?
It is not only that Hessian must be nonsingular but it must also be positive (definite). Note that in the scalar case this corresponds to the situation when the second derivative is positive. Negativeness of the second derivative can send the algorithm in the opposite direction – away from the local minimum – , which which would ruin the convergence of the algorithm.
The last two issues are handled by some modification of the standard Newton’s method
Damped Newton’s method
A parameter \alpha\in(0,1) is introduced that shortens the step as in
\bm x_{k+1} = \bm x_{k} - \alpha(\nabla^2 f(\bm x_k))^{-1} \nabla f(\bm x_k).
Fixed constant positive definite matrix instead of the inverse of the Hessian
The step is determined as
\bm x_{k+1} = \bm x_{k} - \mathbf B \nabla f(\bm x_k).
Note that the interpretation of the constant \mathbf B in the position of the (inverse of the) Hessian in the rootfinding setting is that the slope of the approximating linear (affine) function is always constant.
Now that we admitted to have something else then just the (inverse of the) Hessian in the formula for Newton’s method, we can explore further this new freedom. This will bring us into a family of methods called Quasi-Newton methods.