Dynamic programming and discrete-time optimal control
In the previous two chapters we explained direct and indirect approaches to discrete-time optimal control. While the former conveniently allows incorporating almost arbitrary constraints, it only provides a control trajectory (a finite sequence of values of the control variable); if feedback is needed, the optimization must be performed in every sampling period (thus implementing the concept of receding horizon or model predictive control, MPC). The latter, in contrast, can lead to a (state) feedback control law, but this only happens in special cases such as a regulation of a linear system minimizing a quadratic cost (LQR) while assuming no bound constraints on the the control or state variables; in the general case it leads to a two-point boundary value problem, which can only be solved numerically for trajectories.
In this chapter we present yet another approach — dynamic programming DP. It also allows imposing constraints (in fact, even constraints such as integrality of variables, which are not compatible with our derivative-based optimization toolset exploited so far), and yet it directly leads to feedback controllers.
While in the case of linear systems with a quadratic cost function, dynamic programming provides another route to the theoretical results that we already know — Riccati equation based solution to the LQR problem —, in the the case of general nonlinear dynamical systems with general cost functions, the feedback controllers come in the form of look-up tables. This format of a feedback controller gives some hint about disadvantages of DP, namely, both computation and then the use of these look-up tables do not scale well with the dimension of the state space (aka curse of dimensionality). Various approximation schemes exist — one promising branch is known as reinforcement learning.
Bellman’s principle of optimality and dynamic programming
We start by considering the following example.
Example 1 (Reusing the plan for a trip from Prague to Ostrava) We are planning a car trip from Prague to Ostrava and you are searching for a route that minimizes the total time. Using the online planner we learn that the fastest route from Prague to Ostrava is — as bizarre as it sounds — via (actually around) Brno.
Now, is it possible to reuse this plan for our friends from Brno who are also heading for Ostrava?
The answer is yes, as the planner confirms. Surely did not even need the planner to answer such trivial question. And yet it demonstrates the key wisdom of the whole chapter — the Bellman’s principle of optimality —, which we now state formally.
Theorem 1 (Bellman’s principle of optimality) An optimal policy has the property that whatever the initial state and initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision.
We now investigate this idea a bit more quantitatively using a simple computational example of finding a shortest path in a graph.
Example 2 (Shortest path in a graph) We consider a directional graph with nodes A
, B
, C
, D
, and E
and edges with the prescribed lengths as in the figure below.
The task is now to find the shortest path from A
to I
. What are possible solution strategies? We can start enumerating all the possible paths and calculate their costs (by summing the costs of the participating edges). Needless to say, this strategy based on enumeration scales very badly with the growing number of nodes.
Alternatively, we solve the problem using dynamic programming and relying on Bellman’s principle of optimality. Before we proceed, we need to define the concept of a stage. It is perhaps less common and natural when it comes to solving graph problems, but we introduce it with anticipation of discrete-time optimal control problems. By the kth stage we understand the node at which the kth decision needs to be made. In our case, starting at A
, 4 decisions need to be made to reach the final node. But let’s agree that we also denote the final node as the stage, the 5th one, even if no decision is to be made here. The total number of stages is then N=5.
The crucial attribute of the strategy based on dynamic programming is that we proceed backwards. We start at the very final stage. At this stage, there is just one node and there is nothing we can do, but note that it also makes sense to formulate problems with several possible nodes at the final stage, each with a different (terminal) costs — we will actually use once we switch to the optimal control setting. Now we proceed backwards to the last but one, that is, the (N-1)th stage.
These are F
and H
nodes at this 4the stage. In these two nodes there is again no freedom as for the actions, but for each of them we can record their respective cost to go: 4 for the F
node and 2 for the H
node. These costs reflect how costly it is to reach the terminal node from them.
Things are only getting interesting if we now proceed to the 3rd stage. We now have to consider three possible nodes: C
, E
and G
. For the C
and G
nodes there is still just one action and we can only record their costs to go. The cost for the C
node can be computed as the cost for the immediate transition from C
to F
plus the cost for the F
node, which we recorded previously, that is, 3+4=7. We record the value of 7 with the C
node. Similarly for the G
node. For the E
node there are two possible actions — two possible decisions to be made, two possible paths to choose from. Either to the left (or, actually, up in our orientation of the graph), which would bring us to the node F
, or to the right (or down), which would bring us to the node H
. We compute the costs to go for both decisions and choose the decision with a smaller cost. Here the cost of the decision to go to the left is composed of the cost of the transition to F plus the cost to go from F
, that is, 3+4=7. The cost to go for the decision to go right is composed of the transition cost from E
to H
plus the cost to go from H
, that is, 2+2=4. Obviously, the optimal decision is to go right, that is, to the node H
. Here, on top of the value of the optimal (smallest) cost to go from the node we also record the optimal decision (go to the right/down). We do it by coloring the edge in blue.
Note that in principle we should have highlighted the edges from F
to I
, from C
to F
, and from G
to H
. It was unnecessary here since there were the only possible edges emanating from these nodes.
We proceed backwards to the 2nd stage, and we compute the costs to go for the nodes B
and D
. Again we record their optimal values and the actual optimal decisions.
One last shift backwards and we are at the initial node A, for which we can do the same computation of the costs to go. Note that here coincidently both decisions have the same cost to go, hence both possible decisions/actions are optimal and we can just toss a coin.
Maybe it is not immediately clear from the graph, but when viewed as an itinerary for a trip, it provides a feedback controller. Even if for whichever reason we find ourselves out of the optimal path, we can always have a look at the graph — it will guide us along the path that is optimal from that given node. For example, if we happen to be in node C
, we do have a plan. Well, here is misleadingly simple as there is no decision to be made, but you get the point.
Bellman’s principle of optimality applied to the discrete-time optimal control problem
Let’s recapitulate here the problem of optimal control for a discrete-time system. In particular, we consider the system modelled by \bm x_{k+1} = \mathbf f_k(\bm x_k,\bm u_k), defined on the discrete time interval [i,N], with the initial state \bm x_i fixed (\bm x_i = \mathbf x_i) We aim at minimizing the cost function J_i^N\left(\bm x_i, \bm u_i, \bm u_{i+1}, \ldots, \bm u_{N-1}\right) = \phi(\bm x_N,N) + \sum_{k=i}^{N-1}L_k(\bm x_k,\bm u_k).
Before we proceed, some comments on the notation are in order. Indeed, a well tuned and systematically used notation is instrumental in dynamic programming.
While the cost function does depend on the final time too, in most if not all our analyses we assume that it is fixed and understood from the context. Hence we will not explicitly indicate the dependence on the final time. We will write just J_i(\ldots) instead of J_i^N(\ldots). This may help reduce the notational clutter as we are going to need the upper index for something else soon.
The cost function is clearly a function of the full sequence \bm x_i, \bm x_{i+1},\ldots, \bm x_N of the state vectors too. In the previous chapters we handled it systematically (either by considering them as optimization variables in the simultaneous direct approach or by introducing Lagrange multipliers in the indirect approach). But here we want to emphasize the fact that starting with \bm x_{i+1}, the whole state trajectory is uniquelly determined by the initial state \bm x_i and the corresponding control trajectory \bm u_i, \bm u_{i+1},\ldots, \bm u_{N-1}. Therefore, we write the cost function as a function of the initial state, the initial time (we already agreed above not to emphasize the final time), and the sequence of controls.
The dependence on the discrete time is reflected by the lower indices: not only in \bm x_k and \bm u_k but also in \mathbf f_k(), L_k() and J_k(). We could perhaps write these as \mathbf f(\cdot,\cdot,k), L(\cdot,\cdot,k) and J(\cdot,\cdot,k) to better indicate that k is really an argument for these functions, but we prefer making it compatible with the way we indicate the time dependence of \bm x_k and \bm u_k.
Having introduced the cost function parameterized by the initial state, initial time and the full sequence of controls, we now introduce the optimal cost function
\boxed{ J^\star_i(\bm x_i) = \min_{\bm u_i,\ldots, \bm u_{N-1}} J_i\left(\bm x_i, \bm u_i, \bm u_{i+1}, \ldots, \bm u_{N-1}\right).} \tag{1}
The sequence of controls in the above minimization may be subject to some constraints, but we do not indicate them here for the sake of notational simplicity.
Understanding the difference is crucial. While the cost function J_i depends on the (initial) state, the (initial) time and the sequence of controls applied over the whole interval, the optimal cost function J^\star_i only depends on the (initial) state and the (initial) time.
Assume now that we can find an optimal control sequence from any given state \bm x_{k+1} at time k+1 on, i.e., we can find \bm u_{k+1}^\star,\bm u_{k+2}^\star,\ldots, \bm u_{N-1}^\star yielding the optimal cost J_{k+1}^\star(\bm x_{k+1}). We will soon show how to actually find it, but for the time being we just assume we can have it. We now show how it can be used to find the optimal cost J_k^\star(\bm x_k) at state \bm x_k and time k.
Let’s now consider the following strategy: with the system at state \bm x_k and time k we apply some control \bm u_k, not necessarily an optimal one, which brings the system to the state \bm x_{k+1} in the next time k+1. But from then on we use the control sequence \bm u_{k+1}^\star,\bm u_{k+2}^\star,\ldots, \bm u_{N-1}^\star that is optimal from \bm x_{k+1}. The corresponding cost is L_k(\bm x_k,\bm u_k) + J_{k+1}^\star(\bm x_{k+1}). \tag{2}
Bellman’s principle of optimality states that if we optimize the above expression over \bm u_k, we get the optimal cost J_k^\star(\bm x_k) at time k \boxed{J_k^\star(\bm x_k) = \min_{\bm u_k}\left(L_k(\bm x_k,\bm u_k) + J_{k+1}^\star(\bm x_{k+1})\right).} \tag{3}
Hence, at a given state \bm x_{k} and time k, the optimization is performed over only one (possibly vector) control \bm u_k and not the whole trajectory as the definition of the optimal cost in Equation 1 suggests! What a simplification!
The minimization needs to be performed over the whole sum L_k(\bm x_k,\bm u_k) + J_{k+1}^\star(\bm x_{k+1}), because \bm x_{k+1} is a function of \bm u_k (recall that \bm x_{k+1} = \mathbf f_k(\bm x_k,\bm u_k)). We can also write Equation 3 as J_k^\star(\bm x_k) = \min_{\bm u_k}\left(L_k(\bm x_k,\bm u_k) + J_{k+1}^\star(\mathbf f_k(\bm x_k,\bm u_k))\right), which makes it more apparent.
Once we have the optimal cost function J^\star_{k}, the optimal control \bm u_k^\star(x_k) at a given time k and state \bm x_k is obtained by \boxed{ \bm u_k^\star(\bm x_k) = \arg \min_{\bm u_k}\left(L_k(\bm x_k,\bm u_k) + J_{k+1}^\star(\mathbf f_k(\bm x_k,\bm u_k))\right).}
Alternative formulation of dynamic programming using Q-factors
The cost function in Equation 2 is sometimes called Q-factor and we denote it Q_k^\star(\bm x_k,\bm u_k). We write its definition here for convenience Q^\star_k(\bm x_k,\bm u_k) = L_k(\bm x_k,\bm u_k) + J_{k+1}^\star(\bm x_{k+1}).
It is a cost of choosing the control \bm u_k at state \bm x_k at time k and then following the optimal control from the next time on.
The optimal cost function J_k^\star(\bm x_k) can be recovered from the optimal Q-factor Q_k^\star(\bm x_k,\bm u_k) by taking the minimum over \bm u_k J_k^\star(\bm x_k) = \min_{\bm u_k} Q_k^\star(\bm x_k,\bm u_k).
Bellman’s principle of optimality can be then expressed using the optimal Q-factor as \boxed{Q_k^\star(\bm x_k,\bm u_k) = L_k(\bm x_k,\bm u_k) + \min_{\bm u_{k+1}} Q_{k+1}^\star(\bm x_{k+1},\bm u_{k+1})}.
Optimal control is then obtained from the optimal Q-factor as the minimizing control \boxed{\bm u_k^\star(\bm x_k) = \arg \min_{\bm u_k} Q_k^\star(\bm x_k,\bm u_k).}