Algorithms for unconstrained optimization

Our motivation for studying numerical algorithms for unconstrained optimization remains the same as when we studied the conditions of optimality for such unconstrained problems – such algorithms constitute building blocks for constrained optimization problems. Indeed, many algorithms for constrained problems are based on reformulating the constrained problem into an unconstrained one and then applying the algorithms studied in this section.

It may be useful to recapitulate our motivation for studying optimization algorithms in general – after all, there are dozens of commercial or free&open-source software tools for solving optimization problems. Why not just use them? There are two answers beyond the traditional “at a grad school we should understand what we are using”:

There is certainly no shortage of algorithms for unconstrained optimization. In this crash course we can cover only a few. But the few we cover here certainly form a solid theoretical basis and provide practically usable tools.

One possible way to classify the algorithms is based on whether they use derivatives of the objective functions or not. In this course, we only consider the former approaches as they leads to more efficient algorithms. For the latter methods, we can refer to the literature (the prominent example is Nelder-Mead method).

All the relevant methods are iterative. Based on what happens within each iteration, we can classify them into two categories:

Descent methods
In each iteration, fix the search direction d_k first, and then determine how far to go along that direction, that is, find the step length \alpha_k that minimizes (exactly or approximately) f(x_k + \alpha_k d_k). In the next iteration the search direction is updated.
Trust region methods
In each iteration, fix the region (typically a ball) around the current solution, in which a simpler (typically a quadratic) function approximates the original cost function reasonably accurately, and then find the minimum of this simpler cost function.

Descent methods

The obvious quality that the search direction needs to satisfy, is that the cost function decreses along it, at least locally (for a small step length).

Definition 1 (Descent direction) At the current iterate \bm x_k, the direction \bm d_k is called a descent direction if \nabla f(\bm x_k)^\top \bm d_k < 0, that is, the directional derivative is negative along the direction \bm d_k.

The product above is an inner product of the two vectors \bm d_k and \nabla f(\mathbf x_k). Recall that it is defined as \nabla f(\bm x_k)^\top \bm d_k = \|\nabla f(\bm x_k)\| \|\bm d_k\| \cos \theta, where \theta is the angle between the gradient and the search direction. This condition has a nice geometric interpretation in a contour plot for an optimization in \mathbb R^2. Consider the line tangent to the function countour at \bm x_k. A descent direction must be in the other half-plane generated by the tangent line than into which the gradient \nabla f(\bm x_k) points.

Beware that it is only guaranteed that the cost function is reduced if the length of the step is sufficently small. For longer steps the higher-order terms in the Taylor’s series approximation of the cost function can dominate.

Before we proceed to the question of which descent direction to choose and how to find it, we adress the question of how far to go along the chosen direction. This is the problem of line search.

Steepest descent (aka gradient descent) method

A natural candidate for a descent direction is the negative of the gradient \bm d_k = -\nabla f(\bm x_k).

In fact, among all descent directions, this is the one for which the descent is steepest (the gradient determines the direction of steepest ascent), though we will see later that this does not mean that the convergence of the method is the fastest.

In each iteration of the gradient method, this is the how the solution is updated

\boxed{ \bm x_{k+1} = \bm x_{k} - \alpha_k \nabla f(\bm x_{k}),} where determinatio of the step length \alpha_k has already been discussed in the prevous section.

Let’s now examine the behaviour of the method by applying it to minimization of a quadratic function. Well, for a quadratic function it is obviously an overkill, but we use it in the example because we can compute the step lenght exactly, which then helps the methods show its best.

Example 2  

Show the code
using LinearAlgebra         # For dot() function.
using Printf                # For formatted output.

x0 = [2, 3]                 # Initial vector.
Q = [1 0; 0 3]              # Positive definite matrix defining the quadratic form.
c = [1, 2]                   # Vector defining the linear part.

xs = -Q\c                   # Stationary point, automatically the minimizer for posdef Q. 

ϵ  = 1e-5                   # Threshold on the norm of the gradient.
N  = 100;                   # Maximum number of steps .

function gradient_descent_quadratic_exact(Q,c,x0,ϵ,N)
    x = x0
    iter = 0
    f = 1/2*dot(x,Q*x)+dot(x,c)
    ∇f = Q*x+c
    while (norm(∇f) > ϵ)
        α = dot(∇f,∇f)/dot(∇f,Q*∇f)
        x = x - α*∇f
        iter = iter+1
        f = 1/2*dot(x,Q*x)+dot(x,c)
        ∇f = Q*x+c
        @printf("i = %3d   ||∇f(x)|| = %6.4e   f(x) = %6.4e\n", iter, norm(∇f), f)
        if iter >= N
            return f,x
        end
    end
    return f,x
end

fopt,xopt = gradient_descent_quadratic_exact(Q,c,x0,ϵ,N)
i =   1   ||∇f(x)|| = 2.0229e+00   f(x) = 7.8495e-01
i =   2   ||∇f(x)|| = 9.0210e-01   f(x) = -1.0123e+00
i =   3   ||∇f(x)|| = 1.6005e-01   f(x) = -1.1544e+00
i =   4   ||∇f(x)|| = 7.1374e-02   f(x) = -1.1657e+00
i =   5   ||∇f(x)|| = 1.2663e-02   f(x) = -1.1666e+00
i =   6   ||∇f(x)|| = 5.6470e-03   f(x) = -1.1667e+00
i =   7   ||∇f(x)|| = 1.0019e-03   f(x) = -1.1667e+00
i =   8   ||∇f(x)|| = 4.4679e-04   f(x) = -1.1667e+00
i =   9   ||∇f(x)|| = 7.9269e-05   f(x) = -1.1667e+00
i =  10   ||∇f(x)|| = 3.5350e-05   f(x) = -1.1667e+00
i =  11   ||∇f(x)|| = 6.2718e-06   f(x) = -1.1667e+00
(-1.1666666666479069, [-0.9999939492423319, -0.6666672167355456])

We can also decorate the code a bit to visualize how the iterations proceeded.

Show the code
function gradient_descent_quadratic_exact_decor(Q,c,x0,ϵ,N)
    x = x0
    X = x
    f = 1/2*dot(x,Q*x)+dot(x,c)
    F = [f,]
    ∇f = Q*x+c
    iter = 0
    while (norm(∇f) > ϵ)
        α = dot(∇f,∇f)/dot(∇f,Q*∇f)
        x = x - α*∇f
        iter = iter+1
        f = 1/2*dot(x,Q*x)+dot(x,c)
        ∇f = Q*x+c
        X = hcat(X,x)
        push!(F,f)
        if iter >= N
         return F,X
        end
    end
    return F,X
end

F,X = gradient_descent_quadratic_exact_decor(Q,c,x0,ϵ,N)

x1_data = x2_data = -4:0.01:4;
f(x) = 1/2*dot(x,Q*x)+dot(x,c)
z_data = [f([x1,x2]) for x2=x2_data, x1=x1_data];

using Plots
contour(x1_data,x2_data,z_data)
plot!(X[1,:],X[2,:],label="xk",marker=:diamond,aspect_ratio=1)
scatter!([x0[1],],[x0[2],],label="x0")
scatter!([xs[1],],[xs[2],],label="xopt")
xlabel!("x1");ylabel!("x2");
xlims!(-4,4); ylims!(-4,4)
Figure 1: Zigzagging of the steepest descent method for a quadratic function

Altough the number of iterations in the above example is acceptable, a major characteristic of the method is visible. Its convergence is slowing down as we are approaching a local minimum, which is visually recognizable oscillations or zig-zagging. But it can be much worse for some data.

Gradient method converges slowly for ill-conditioned problems

Example 3 Consider minimization of the following cost function f(\bm x) = 1000x_1^2 + 40x_1x_2 + x_2^2.

Show the code
Q = [1000 20; 20 1]
c = [0, 0]
x0 = [1,1000]

F,X = gradient_descent_quadratic_exact(Q,c,x0,ϵ,N)
i =   1   ||∇f(x)|| = 5.9951e+02   f(x) = 2.9939e+05
i =   2   ||∇f(x)|| = 1.2093e+04   f(x) = 1.7221e+05
i =   3   ||∇f(x)|| = 3.4484e+02   f(x) = 9.9052e+04
i =   4   ||∇f(x)|| = 6.9560e+03   f(x) = 5.6974e+04
i =   5   ||∇f(x)|| = 1.9835e+02   f(x) = 3.2771e+04
i =   6   ||∇f(x)|| = 4.0011e+03   f(x) = 1.8850e+04
i =   7   ||∇f(x)|| = 1.1409e+02   f(x) = 1.0842e+04
i =   8   ||∇f(x)|| = 2.3014e+03   f(x) = 6.2364e+03
i =   9   ||∇f(x)|| = 6.5623e+01   f(x) = 3.5872e+03
i =  10   ||∇f(x)|| = 1.3237e+03   f(x) = 2.0633e+03
i =  11   ||∇f(x)|| = 3.7746e+01   f(x) = 1.1868e+03
i =  12   ||∇f(x)|| = 7.6141e+02   f(x) = 6.8264e+02
i =  13   ||∇f(x)|| = 2.1711e+01   f(x) = 3.9265e+02
i =  14   ||∇f(x)|| = 4.3796e+02   f(x) = 2.2585e+02
i =  15   ||∇f(x)|| = 1.2488e+01   f(x) = 1.2991e+02
i =  16   ||∇f(x)|| = 2.5191e+02   f(x) = 7.4722e+01
i =  17   ||∇f(x)|| = 7.1831e+00   f(x) = 4.2980e+01
i =  18   ||∇f(x)|| = 1.4490e+02   f(x) = 2.4722e+01
i =  19   ||∇f(x)|| = 4.1317e+00   f(x) = 1.4220e+01
i =  20   ||∇f(x)|| = 8.3344e+01   f(x) = 8.1791e+00
i =  21   ||∇f(x)|| = 2.3765e+00   f(x) = 4.7046e+00
i =  22   ||∇f(x)|| = 4.7939e+01   f(x) = 2.7061e+00
i =  23   ||∇f(x)|| = 1.3670e+00   f(x) = 1.5565e+00
i =  24   ||∇f(x)|| = 2.7574e+01   f(x) = 8.9529e-01
i =  25   ||∇f(x)|| = 7.8627e-01   f(x) = 5.1497e-01
i =  26   ||∇f(x)|| = 1.5861e+01   f(x) = 2.9621e-01
i =  27   ||∇f(x)|| = 4.5226e-01   f(x) = 1.7038e-01
i =  28   ||∇f(x)|| = 9.1229e+00   f(x) = 9.7999e-02
i =  29   ||∇f(x)|| = 2.6014e-01   f(x) = 5.6369e-02
i =  30   ||∇f(x)|| = 5.2474e+00   f(x) = 3.2423e-02
i =  31   ||∇f(x)|| = 1.4963e-01   f(x) = 1.8649e-02
i =  32   ||∇f(x)|| = 3.0183e+00   f(x) = 1.0727e-02
i =  33   ||∇f(x)|| = 8.6065e-02   f(x) = 6.1701e-03
i =  34   ||∇f(x)|| = 1.7361e+00   f(x) = 3.5490e-03
i =  35   ||∇f(x)|| = 4.9504e-02   f(x) = 2.0414e-03
i =  36   ||∇f(x)|| = 9.9859e-01   f(x) = 1.1742e-03
i =  37   ||∇f(x)|| = 2.8475e-02   f(x) = 6.7539e-04
i =  38   ||∇f(x)|| = 5.7439e-01   f(x) = 3.8848e-04
i =  39   ||∇f(x)|| = 1.6378e-02   f(x) = 2.2345e-04
i =  40   ||∇f(x)|| = 3.3038e-01   f(x) = 1.2853e-04
i =  41   ||∇f(x)|| = 9.4207e-03   f(x) = 7.3928e-05
i =  42   ||∇f(x)|| = 1.9003e-01   f(x) = 4.2523e-05
i =  43   ||∇f(x)|| = 5.4188e-03   f(x) = 2.4459e-05
i =  44   ||∇f(x)|| = 1.0931e-01   f(x) = 1.4069e-05
i =  45   ||∇f(x)|| = 3.1168e-03   f(x) = 8.0922e-06
i =  46   ||∇f(x)|| = 6.2873e-02   f(x) = 4.6546e-06
i =  47   ||∇f(x)|| = 1.7928e-03   f(x) = 2.6773e-06
i =  48   ||∇f(x)|| = 3.6164e-02   f(x) = 1.5400e-06
i =  49   ||∇f(x)|| = 1.0312e-03   f(x) = 8.8578e-07
i =  50   ||∇f(x)|| = 2.0801e-02   f(x) = 5.0949e-07
i =  51   ||∇f(x)|| = 5.9314e-04   f(x) = 2.9306e-07
i =  52   ||∇f(x)|| = 1.1965e-02   f(x) = 1.6856e-07
i =  53   ||∇f(x)|| = 3.4117e-04   f(x) = 9.6958e-08
i =  54   ||∇f(x)|| = 6.8821e-03   f(x) = 5.5769e-08
i =  55   ||∇f(x)|| = 1.9624e-04   f(x) = 3.2078e-08
i =  56   ||∇f(x)|| = 3.9585e-03   f(x) = 1.8451e-08
i =  57   ||∇f(x)|| = 1.1288e-04   f(x) = 1.0613e-08
i =  58   ||∇f(x)|| = 2.2769e-03   f(x) = 6.1045e-09
i =  59   ||∇f(x)|| = 6.4925e-05   f(x) = 3.5113e-09
i =  60   ||∇f(x)|| = 1.3097e-03   f(x) = 2.0197e-09
i =  61   ||∇f(x)|| = 3.7345e-05   f(x) = 1.1617e-09
i =  62   ||∇f(x)|| = 7.5331e-04   f(x) = 6.6821e-10
i =  63   ||∇f(x)|| = 2.1480e-05   f(x) = 3.8435e-10
i =  64   ||∇f(x)|| = 4.3330e-04   f(x) = 2.2107e-10
i =  65   ||∇f(x)|| = 1.2355e-05   f(x) = 1.2716e-10
i =  66   ||∇f(x)|| = 2.4923e-04   f(x) = 7.3142e-11
i =  67   ||∇f(x)|| = 7.1068e-06   f(x) = 4.2071e-11
(4.207097012941924e-11, [-2.371876542980443e-7, 1.1842143766107573e-5])

While for the previous problem of the same kind and size the steepest descent method converged in just a few steps, for this particular data it takes many dozens of steps.

The culprit here are bad properties of the Hessian matrix Q. By ``bad properties’’ we mean the so-called , which is reflected in the very high . Recall that condition number \kappa for a given matrix \mathbf A is defined as \kappa(\mathbf A) = \|\mathbf A^{-1}\|\cdot \|\mathbf A\| and it can be computed as ratio of the largest and smallest singular values, that is, \kappa(\mathbf A) = \frac{\sigma_{\max}(\mathbf A)}{\sigma_{\min}(\mathbf A)}. \end{equation}$$

Ideally this number should be around 1. In the example above it is

Show the code
cond(Q)
1668.0010671466664

which is well above 1000. Is there anything that we can do about it? The answer is yes. We can scale the original date to improve the conditioning.

Scaled gradient method for ill-conditioned problems

Upon introducing a matrix \mathbf S that relates the original vector variable \bm x with a new vector variable \bm y according to \bm x = \mathbf S \bm y, the optimization cost function changes from f(\bm x) to f(\mathbf S \bm y). Let’s relabel the latter to g(\bm y). And we will now examine how the steepest descent iteration changes. Straightforward application of a chain rule for finding derivatives of composite functions yields g'(\bm y) = f'(\mathbf S\bm y) = f'(\mathbf S\bm y)\mathbf S.

Keeping in mind that gradients are transposes of derivatives, we can write \nabla g(\bm y) = \mathbf S^\top \nabla f(\mathbf S\bm y).

Steepest descent iterations then change accordingly

\begin{aligned} \mathbf y_{k+1} &= \mathbf y_k - \alpha_k \nabla g(\mathbf y_k)\\ \mathbf y_{k+1} &= \mathbf y_k - \alpha_k \mathbf S^T\nabla f(\mathbf S \mathbf y_k)\\ \underbrace{\mathbf S \mathbf y_{k+1}}_{\mathbf x_{k+1}} &= \underbrace{\mathbf S\mathbf y_k}_{\mathbf x_k} - \alpha_k \underbrace{\mathbf S \mathbf S^T}_{\mathbf D}\nabla f(\underbrace{\mathbf S \mathbf y_k}_{\mathbf x_k}). \end{aligned}

Upon defining the scaling matrix \mathbf D as \mathbf S \mathbf S^T, a single iteration changes to \boxed{\bm x_{k+1} = \bm x_{k} - \alpha_k \mathbf D_k\nabla f(\bm x_{k}).}

The question now is: how to choose the matrix \mathbf D? We would like to make the Hessian matrix \nabla^2 f(\mathbf S \bm y) (which in the case of a quadratic matrix form is the matrix \mathbf Q as we used it above) better conditioned. Ideally, \nabla^2 f(\mathbf S \bm y)\approx \mathbf I.

A simple way for improving the conditioning is to define the scaling matrix \mathbf D as a diagonal matrix whose diagonal entries are given by \mathbf D_{ii} = [\nabla^2 f(\bm x_k)]^{-1}_{ii}.

In words, the diagonal entries of the Hessian matrix are inverted and they then form the diagonal of the scaling matrix.

It is worth emphasizing how the algorithm changed: the direction of steepest descent (the negative of the gradient) is premultiplied by some (scaling) matrix. We will see in a few moments that another method – Newton’s method – has a perfectly identical structure.

Newton’s method

Newton’s method is one of flagship algorithms in numerical computing. I am certainly not exaggerating if I include it in my personal Top 10 list of algorithms relevant for engineers. We may encounter the method in two settings: as a method for solving (sets of) nonlinear equations and as a method for optimization. The two are inherently related and it is useful to be able to see the connection.

Newton’s method for rootfinding

The problem to be solved is that of finding x for which a given function g() vanishes. In other words, we solve the following equation g(x) = 0.

The above state scalar version has also its vector extension \mathbf g(\bm x) = \mathbf 0, in which \bm x stands for an n-tuple of variables and \mathbf g() actually stands for an n-tuple of functions. Even more general version allows for different number of variables and equations.

We start with a scalar version. A single iteration of the method evaluates not only the value of the function g(x_k) at the given point but also its derivative g'(x_k). It then uses the two to approximate the function g() at x_k by a linear (actually affine) function and computes the intersection of this approximating function with the horizontal axis. This gives as x_{k+1}, that is, the (k+1)-th approximation to a solution (root). We can write this down as \begin{aligned} \underbrace{g(x_{k+1})}_{0} &= g(x_{k}) + g'(x_{k})(x_{k+1}-x_k)\\ 0 &= g(x_{k}) + g'(x_{k})x_{k+1}-g'(x_{k})x_k), \end{aligned} from which the famous formula follows \boxed{x_{k+1} = x_{k} - \frac{g(x_k)}{g'(x_k)}.}

In the vector form, the formula is \boxed{\bm x_{k+1} = \bm x_{k} - [\nabla \mathbf g(\bm x_k)^\top]^{-1}\mathbf g(\bm x_k),} where \nabla \mathbf g(\bm x_k)^\top is the (Jacobian) matrix of the first derivatives of \mathbf g at \bm x_k, that is, \nabla \mathbf g() is a matrix with the gradient of the g_i(\bm x) function in its i-th column.

Newton’s method for optimization

Once again, restrict ourselves to a scalar case first. The problem is \operatorname*{minimize}_{x\in\mathbb{R}}\quad f(x).

At the k-th iteration of the algorithm, the solution is x_k. The function to be minimized is approximated by a quadratic function m_k() in x. In order to find parameterization of this quadratic function, the function f() but also its first and second derivatives, f'() and f''(), respectively, need to be evaluated at x_k. Using these three, a function m_k(x) approximating f(x) at some x not too far from x_k can be defined m_k(x) = f(x_k) + f'(x_k)(x-x_k) + \frac{1}{2}f''(x_k)(x-x_k)^2.

The problem of minimizing this new function in the k-th iteration is then formulated, namely,
\operatorname*{minimize}_{x_{k+1}\in\mathbb{R}}\quad m_k(x_{k+1}) and solved for some x_{k+1}. The way to find this solution is straightforward: find the derivative of m_k() and find the value of x_{k+1} for which this derivative vanishes. The result is \boxed{x_{k+1} = x_{k} - \frac{f'(x_k)}{f''(x_k)}.}

The vector version of the Newton’s step is \boxed{\bm x_{k+1} = \bm x_{k} - [\nabla^2 f(\bm x_k)]^{-1} \nabla f(\bm x_k).}

A few observations

  • If compared to the general prescription for descent direction methods (), the Newton’s method determines the direction and the step lenght at once (both \alpha_k and \mathbf d_k are hidden in - [\nabla^2 f(\mathbf x_k)]^{-1} \nabla f(\mathbf x_k)).

  • If compared with steepest descent (gradient) method, especially with its scaled version (), Newton’s method fits into the framework nicely because the inverse [\nabla^2 f(\mathbf x_k)]^{-1} of the Hessian can be regarded as a kind of a scaling matrix \mathbf D. Indeed, you can find arguments in some textbooks that Newton’s method involves scaling that is optimal in some sense. We skip the details here because we only wanted to highlight the similarity in the structure of the two methods.

The great popularity of Newton’s method is mainly due to its nice convergence – quadratic. Although we skip any discussion of convergence rates here, note that for all other methods this is an ideal that aims to be approached.

The nice convergence rate of Newton’s method is compensated by a few disadvantages

  • The need to compute the Hessian. This is perhaps not quite clear with simple problems but can play some role with huge problems.
  • Once the Hessian is computed, it must be inverted (actually, a linear system must by solved). But this assumes that Hessian is nonsingular. How can we guarantee this for a given problem?
  • It is not only that Hessian must be nonsingular but it must also be positive (definite). Note that in the scalar case this corresponds to the situation when the second derivative is positive. Negativeness of the second derivative can send the algorithm in the opposite direction – away from the local minimum – , which which would ruin the convergence of the algorithm.

The last two issues are handled by some modification of the standard Newton’s method

Damped Newton’s method

A parameter \alpha\in(0,1) is introduced that shortens the step as in \bm x_{k+1} = \bm x_{k} - \alpha(\nabla^2 f(\bm x_k))^{-1} \nabla f(\bm x_k).

Fixed constant positive definite matrix instead of the inverse of the Hessian

The step is determined as \bm x_{k+1} = \bm x_{k} - \mathbf B \nabla f(\bm x_k).

Note that the interpretation of the constant \mathbf B in the position of the (inverse of the) Hessian in the rootfinding setting is that the slope of the approximating linear (affine) function is always constant.

Now that we admitted to have something else then just the (inverse of the) Hessian in the formula for Newton’s method, we can explore further this new freedom. This will bring us into a family of methods called Quasi-Newton methods.

Quasi-Newton’s methods

Conjugate gradient method

Trust region methods

Back to top