HT2013Introduction to Optimal Control theory This note covers the material of the last four lectures. We give some basic ideas and techniques for solving optimal control problems. We introduce the idea of dynamic programming and the principle of optimality, explain Pontryagin’s maximum principle, and heuristic derivation of the theorem. Some worked examples are given in details. The topics are selected from last three chapters of our textbook. Optimal control theory has an extremely large literature so the aim of the note is of necessity to give a very concise treatment. We strongly recommend that readers study the material in the book for rigorous mathematical treatment. We start with some motivation and philosophy behind optimization. Optimization is a key tool in modelling. Sometimes it is important to solve a problem optimally. Other times either a near-optimal solution is good enough, or the real problem does not have a single criterion by which a solution can be judged. However, even then optimization is useful as a way to test thinking. If the optimal solution is ridiculous it may suggest ways in which both modelling and thinking can be refined. Optimal control theory is concerned with dynamical systems and their optimization over time. It takes stochastic, unknown or imperfect observations into account and deals with optimization problem over evolution processes. This is in contrast with optimization models in most topics in operations research, e.g. our advanced course Optimization where things are static and nothing is random or hidden. The three features of optimal control are imperfect state observations, dynamical and stochastic evolution. They give rise to new other type of optimization problems and require other way of thinking. This course does not deal with stochastic systems. However, it is my intention to lead the course to a more advanced level. So I also give a short introduction to Kalman filter and the Bellman principle of optimality applied to Markov process and heuristically to diffusion processes by dynamic programming approach to meet some students’ requirement. Note that these topics are beyond the scope of the course curriculum. 1 Performance indices Measures of performance We consider systems described by a general set of n nonlinear differential equations ˙ x(t) = f(x, u, t) (1.1) subject to x(t 0 ) = x 0 , (1.2) 1 where the components of f are continuous and satisfy standard conditions, such as having continuous first partial derivatives so that the solution (1.1) exists and is unique for given initial conditions (1.2). A cost functional, or a performance index is a scalar which provides a measure by which the performance of the cost of the system can be judged. Minimum-time problems. Here u(t) is to be chosen so as to transfer the system from an initial state x 0 to a specified state in the shortest possible time. This is equivalent to minimizing the performance index J = t 1 −t 0 = _ t 1 t 0 dt (1.3) where t 1 is the first instant of time at which the desired state is reached. Example 1. An aircraft pursues a ballistic missile and wishes to intercept it as quickly as possible. For simplicity neglect gravitational and aerodynamic forces and suppose that the trajectories are horizontal. At t = 0 the aircraft is a distance a from the missile, whose motion is known to be descried by x(t) = a + bt 2 , where b is a positive constant. The motion of the aircraft is given by ¨ x(t) a = u, where the thrust u(t) is subject to |u| ≤ 1, with suitably chosen units. Clearly the optimal strategy for the aircraft is to accelerate with maximum thrust u(t) = +1. After a time t the aircraft has then travelled a distance ct + 1 2 t 2 , where ˙ x a (0) = c, so interception will occur at time T where cT + 1 2 T 2 = a +bT 2 . This equation may not have any real positive solution; in other words, this minimum-time problem may have no solution for certain initial conditions. Terminal control. In this case the final state x f = x ( t 1 ) is to be brought as near as possible to some desired state r(t 1 ). A suitable performance measure to be minimized is e (t 1 )Me(t 1 ), (1.4) where e(t) = x(t)−r(t) and M is a real symmetric positive definite n×n matrix. A special case is when M is the identity matrix and (1.4) reduces to x f −r(t 1 ) 2 e . More generally, if M = diag[m 1 , ...m n ] then m i are chosen so as to weight the relative importance of the deviations [x i (t 1 ) −r i (t 1 )] 2 . If some of the r i (t 1 ) are not specified then the corresponding elements of M will be zero and M will be only positive semidefinite. Minimum effort. The desired final state is now to be attained with minimum total expenditure of control effort. Suitable performance indices to be minimized are _ t 1 t 0 β i |u i |dt (1.5) 2 or _ t 1 t 0 u Rudt (1.6) where R is real positive definite and the β i and r ij are weighting factors. Tracking problems. The aim here is to follow or “track” as closely as possible some desired state r(t) throughout the interval t 0 ≤ t ≤ t 1 . Following (1.4) and (1.6) a suitable index is _ t 1 t 0 e Qedt (1.7) where Q is real symmetric positive semidefinite. If u i (t) are unbounded then minimization of (1.7) can lead to a control vector having infinite components. This is unacceptable for real life problems, so to restrict the total control effort a combination of (1.6) and (1.7) can be used, giving _ t 1 t 0 (e Qe +u Ru)dt (1.8) expressions of the form (1.6), (1.7) and (1.8) are termed quadratic performance indices. Example 2. A landing vehicle separates from a spacecraft at time t = 0 at an altitude h from the surface of a planet, with initial downward velocity ν. For simplicity assume that gravitational forces can be neglected and that the mass of the vehicle is constant. Consider vertical motion only, with upwards regarded as the positive direction. let x 1 denote the altitude, x 2 velocity and u(t) the thrust exerted by the rocket motor, subject to |u(t)| ≤ 1 with suitable scaling. The equations of motion are ˙ x 1 = x 2 , ˙ x 2 = u (1.9) and the initial conditions are x 1 (0) = h, x 2 (0) = ν. (1.10) For a “soft landing” at some time t we require x 1 (t f ) = 0, x 2 (t f ) = 0. (1.11) A suitable performance index might be _ t f 0 (|u| +k)dt, (1.12) which is a combination of (1.3) and (1.5). The expression (1.12) represents a sum of total fuel consumption and time to landing, k being a factor which weights the relative importance of these two quantities. The expression for the optimal control which minimizes (1.12) subject to (1.9), (1.10) and (1.11) will be developed later. Of course the simple equation (1.9) arise in a variety of situations. 3 Performance indices given above are termed functionals, since they assign a unique real number to a set of functions x i (t), u j (t). In the classical optimization literature more general functionals are used, for instance the problem of Bolza is to choose u(t) so as to minimize J(u) = Q[x(t 1 ), t 1 )] + _ t 1 t 0 q(x, u, t)dt (1.13) subject to (1.1), the scalar function Q and q being continuous and having continuous first partial derivatives. 2 Dynamic programming We begin with a simple example to illustrate the idea of the Bellman optimality principle. A key idea is that optimization over time can often be thought of as optimization in stages. We trade off our desire to obtain the lowest possible cost at the present stage against the implication this would have for costs at future stages. The best action minimizes the sum of the cost incurred at the current stage and the least total cost that can be incurred from all subsequent stages, consequent on this decision. 2.1 Bellman’s principle of optimality The optimality principle stated by Bellman is as follows: An optimal policy has the property that whatever the initial state and initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision. According to Bellman, the basic idea of dynamic programming was initiated by himself in his research done during 1949-1951, mainly for multistage decision problems. He also found that the technique was also applicable to the calculus of variations and to optimal control problems whose state equations are ordinary differential equations, which led to a nonlinear PDE now called the Hamilton-Jacobi-Bellman equation. In 1960s, Kalman pointed out such a relation and probably was the first to use the name HJB equation of the control problem. As a matter of fact, the idea of the principle of the optimality actually goes back to Jakob Bernoulli while he was soling the famous brachistochrone problem, posed by his brother Johann Bernoulli in 1696. Moreover, an equation identical to what was nowadays called the Bellman equation was first derived by Carath´eodory in 1926 while he was study- ing the sufficient conditions of the calculus of variations problem (his approach was named Carath´eodory’s royal road. The work of Swedish mathematician Wald on sequential anal- ysis in the late 1940s contains some ideas very similar to that of dynamic programming. Briefly we can state the Bellman’s principle of optimality as follows. From any point on an optimal trajectory, the remaining trajectory is optimal for the corresponding problem 4 initiated at that point. We illustrate this principle by an example. Example 3. (The shortest path problem) Consider the “stagecoach problem” in which a traveler wishes to minimize the length of a journey from town A to town J by first traveling to one of B,C, or D and then onwards to one of E,F or G then onwards to one of H or I and finally to J. Thus there are four “stages”. The arcs are marked with distances between towns. A B C D E F G H I J 2 4 3 7 4 6 3 2 4 4 1 5 3 3 3 6 4 1 3 4 Figure 1: Road system for stagecoach problem Let V (x) be the minimal distance from town X to town J. Then obviously V (J) = 0, V (H) = 3, V (I) = 4. Recursively we compute next stage and obtain the following. V (F) = min(6 +V (H), 3 +V (I)) = 7, V (E) = min(1 +V (H), 4 +V (I)) = 4, V (G) = min(3 +V (H), 3 +V (I)) = 6, The underline above indicates that the minimum is attained at that point. V (B) = min(7 +V (E), 4 +V (F), 6 +V (G)) = 11, V (C) = min(3 +V (E), 2 +V (F), 4 +V (G)) = 7, V (D) = min(5 +V (G), 1 +V (F), 4 +V (E)) = 8, and finally V (A) = min(2 +V (B), 4 +V (C), 3 +V (D)) So the shortest path is A →D →F →I →J. From above calculation we know that the shortest path is not unique. (Find all of the shortest paths.) 2.2 The optimality equation To avoid complication of technicality we first derive the optimality equation for discrete- time problem. In this case t takes integer values, say t = 0, 1, 2, .... Suppose that u t is a 5 control variable whose value is to be chosen at time t. Let U t (= u 0 , u 1 , ..., u t−1 ) denote the partial sequence of controls (or decisions) taken over the first t stages. Suppose the cost up to the time horizon N is given by C = G(U N−1 ) = G(u 0 , u 1 , ..., u N−1 ). Then the principle of Optimality is phrased in the following theorem. Theorem 2.1 (The principle of optimality). Define the functions G(U t−1 ) = inf ut,u t+1 ,...,u N−1 G(U N−1 ). Then these functions obey the recursion G(U t−1 , t) = inf ut G(U t , t + 1), t < N, with terminal condition G(U N−1 , N) = G(U N−1 ). The proof is immediate, since G(U t−1 , t) = inf ut inf u t+1 ,...,u N−1 G(u 0 , u 1 , ..., u t , u t+1 , ..., u N−1 ). Now we consider the dynamical system x t+1 = f(x t , u t , t) where x ∈ R n is a state variable, the control variable u t is chosen on the basis of knowing U t−1 = (u 0 , ..., u t−1 ), which determines everything else. But a more economical represen- tation of the past history is often sufficient. Suppose we wish to minimize a cost function of the form J = N−1 t=0 J(x t , u t , t) +J N (x N ) (2.1) by choice of controls {u 0 , u 1 , ..., u N−1 }. A cost function that can be written in this way is called decomposable cost. Define the so-called cost-to-go function from time t onwards as J t = N−1 τ=t J(x τ , u τ , τ) +J N (x N ) (2.2) and the minimal cost-to-go function as an optimization over {u t , ..., u N−1 } conditional on x t = x, V (x, t) = inf ut,...,u N−1 J t . Here V (x, t) is the minimal future cost from time t onwards, given that the state x at time t. The function V (x, t) is also called a value function. Then by induction we can prove that V (x, t) = inf u [J(x, u, t) +V (f(x, u, t), t + 1)] t < N (2.3) 6 with terminal condition V (x, N) = J N (x N ), where x is a generic value of x t . The min- imizing u in (2.3) is the optimal control u(x, t) and values of x 0 , ..., x t−1 are irrelevant. The optimality equation (2.3) is also called the dynamical programming equation (DP) or Bellman equation. The DP equation defines an optimal control problem in feedback or closed loop form, with u t = u(x t , t). This is in contrast to the open loop formulation in which {u 0 , ..., u N−1 } are to be determined all at once at time 0. A policy (or strategy) is a rule for choosing the value of the control variable under all possible circumstances as a function of the perceived circumstances. Keep the following in mind. (i) The optimal u t is a function of x t and t, i.e., u t = u(x t , t). (ii) The DP equation yields the optimal control u t in closed loop form. It is optimal whatever the past control policy may have been. (iii) The DP equation is a backward recursion in time (from which we get the optimum at N −1, then N −2 and so on.) The later policy is decided first. It could also be instructive to remember the citation from Kierkegaard “Life must be lived forward and understood backwards”. Example 4. Managing spending and savings. An investor receives annual income from a building society of x t kr in year t. He consumes u t and adds x t − u t to his capital, 0 ≤ u t ≤ x t . The capital is invested at interest rate θ × 100%, and so his income in year t + 1 increases to x t+1 = f(x t , u t ) = x t +θ(x t −u t ). He desires to maximize his total consumption over N years, J = N−1 t=0 u t . (What is your guess?) It is clear that this problem is time invariant. So we can drop t in all functions involved. Let now the value function V s (x) denote the maximal reward obtainable, starting in state x and when there is time s = N −t to go. The DP equation is V s (x) = max 0≤u≤x [u +V s−1 (x +θ(x −u))], with the terminal condition V 0 (x) = 0, (since no more can be obtained once time N is reached). Remember that we use short hand notation for x and u for x s and u s . The idea is to substitute backwards and soon guess the form of the solution. First, V 1 (x) = max 0≤u≤x [u +V 0 (x +θ(x −u))] = max 0≤u≤x [u + 0] = x. Next, V 2 (x) = max 0≤u≤x [u +V 1 (x +θ(x −u))] = max 0≤u≤x [u +x +θ(x −u)]. 7 Since the function that is to be maximized is linear in u, its maximum attains at either u = 0 or u = x. Thus V 2 (x) = max((1 +θ)x, 2x) = max(1 +θ, 2)x = c 2 x. This motivates the guess V s−1 (x) = c s−1 x. We shall show that this is a right answer. The first induction step is already done. So we assume this is valid at s −1. We find that V s (x) = max 0≤u≤x [u +c s−1 (x +θ(x −u))] = max[(1 +θ)c s−1 , 1 +c s−1 ]x = c s x. Thus our guess is verified and V s (x) = c s x, where c s satisfy the recursion implicitly in the above computation. that is c s = c s−1 + max[θc s−1 , 1], which yields c s = _ _ _ s s ≤ s ∗ (1 +θ) s−s ∗ s ∗ s ≥ s ∗ , where s ∗ is the least integer such that s ∗ ≥ 1 θ . The optimal strategy is to invest the whole of the income in years 0, ..., N −s ∗ −1, (to build up capital) and then consume the whole of the income in years N −s ∗ , ..., N −1. Is this a surprise for you? What we learn from this example is the following. (i) It is often useful to frame things in terms of time to go, s. (ii) Although the form of the DP equation can sometimes look messy, try working backwards from V 0 (x) (which is known). Often we would get a pattern from which we can piece together a solution. (iii) When the dynamics are linear, the optimal control lies at an extreme point of the set of admissible controls. This form of policy is known as bang-bang control. 2.3 Markov decision processes ∗ This section is optional. It is written for completeness and for a student who wishes to know how the theory can be used in non-deterministic optimal control problems. We shall state the theory for controlled diffusion processes later. Consider now stochastic evolution. Let X t = (x 0 , .., x t ) and U t = (u 0 , ..., u t ) denote the x and u histories at time t. As before, state structure is defined by a dynamic system having value at x t at time t with the following properties. (a) Markov dynamics: (i.e., a stochastic dynamical system) P(x t+1 |X t , U t ) = P(x t+1 |x t , u t ), where P( ) is probability. (b) Decomposable cost: the cost given by (2.1) 8 These assumptions define state structure. To avoid loosing insight we also require the following property. (c) perfect state observation: The current value of the state is observable. That is, x t is known at the time at which u t must be chosen. Let W t denote the observed history at time t. Assume W t = ((X t , U t−1 ). Note that J is determined by W N , so we might write J = J(W N ). These assumptions define what is known as a discrete-time Markov decision processes (MDP). It is widely used in applications. As before the cost-to-go function is given by (2.2). Denote the minimal expected cost-to-go function by V (W t ) = inf π E π [J t |W t ], where π denote a policy, i.e., a rule for choosing the controls u 0 , ..., u N−1 , and E( ) denotes the mathematical expectation. We have the following theorem. Theorem 2.2. V (W t ) is a function of x t and t alone, say V (x t , t). It satisfies the opti- mality equation V (x t , t) = inf ut [J(x t , u t , t) +E[V (x t+1 , t + 1)|x t , u t ]], t < N, (2.4) with terminal condition V (x N , N) = J N (x N ). Moreover, a minimizing value of u t , which is also a function of x t and t in (2.4) is optimal. Proof. Use induction. First, the value of V (W N ) is J N (x N ), so the theorem is valid at time N. Assume it is valid at time t + 1. The DP equation is then V (W t ) = inf ut [J(x t , u t , t) +E[V (x t+1 , t + 1)|X t , U t ]]. But by assumption (a), the right-hand side of above equation reduces to the right-hand side of (2.4). This proves the theorem. Example 5. Exercising a stock option. The owner of a call option has the option to buy a share at fixed “striking price” p. The option must be exercised by day N. If he exercises the option on day t and then immediately sells the share at the current price x t , he can make a profit of x t − p. Suppose the price sequence obeys the equation x t+1 = x t + t , where the t are i.i.d random variables for which E|| < ∞. The aim is to exercise the option optimally. Let V s (x) be the value function (maximal expected profit) when the share price is x and there are s days to go. Show that (i) V s (x) is non-decreasing in s, (ii) V s (x) − x is non-decreasing in x and (iii) V s (x) is continuous in x. Deduce that the optimal policy can be characterized as follows. There exists a non-decreasing sequence {a s } such that an 9 optimal policy is to exercise the option the first time that x ≥ x s , where x is the current price and s is the number of days to go before expiry of the option. The state variable at time t is, strictly speaking, x t plus a variable which indicates whether the option has been exercised or not. However, it is only the latter case which is of interest, so x is the effective state variable. Since DP makes its calculations backwards, from the terminal point, it is often an advantage to write things in terms of s, time to go, as pointed earlier. Let V s (x) be the value function (maximal expected profit) with s days to go then V 0 (x) = max(x −p, 0) and so the DP equation is V s (x) = max(x −p, E[V s−1 (x +)]), s = 1, 2, ... Note that the expectation operator comes outside, not inside, V s−1 (·). We can use induc- tion to show (i)-(iii). For example, (i) is obvious, since increasing s means we have more time over which to exercise the option. However, for a formal proof we have V 1 (x) = max(x −p, E[V 0 (x +)]) ≥ max(x −p, 0) = V 0 (x). Now suppose V s−1 ≥ V s−2 . Then V s (x) = max(x −p, E[V s−1 (x +)]) ≥ max(x −p, E[V s−2 (x +)]) = V s−1 (x). Therefore V s is non-decreasing in s Similarly, an inductive proof of (ii) follows from V s (x) −x = max(−x, E[V s−1 (x +) −(x +)] +E()), since the left-hand inherits the non-decreasing character of the term V s−1 (x+)−(x+) in the right-hand side . Thus the optimal strategy can be characterized as asserted, because from (ii) and (iii) and the fact that V s (x) ≥ x − p it follows that there exists an a s such that V s (x) is greater than x − p if x < a s nd equals x − p is x ≥ a s . It follows from (i) that a s is non-decreasing in s. The constant s s is the smallest x for which V s (x) = x −p. Example 6. Accepting the best offer: We are to interview N candidates for a job. At the end of each interview we must either hire or reject the candidate we have just seen, and may not change this decision later. Candidates are seen in random order and can be ranked against those seen previously. The aim is to maximize the probability of choosing the candidate of greatest rank. Let W − t be the history of observations up to time t, i.e., after we have interviewed the tth candidate. All that matters are the value of t and whether the tth candidate is better than all her predecessors: let x t = 1 if this is true and x t = 0 if it is not. In the case x t = 1, the probability she is the best of all N candidates is P(best of N|best of first t) = P(best of N) P(best of first t) = 1/N 1/t = t N . 10 Now the fact that the tth candidate is the best of the t candidates seen so far places no restriction on the relative ranks of the first t − 1 candidates. Thus x t = 1 and W t−1 are statistically independent and we have P(x t = 1|W t−1 ) = P(W t−1 |x t = 1) P(W t−1 ) P(X −t = 1) = P(x t = 1) = 1 t . Let V (0, t−1) be the probability that under an optimal policy we select the best candidate, given that we have seen t −1 candidates so far and the last one was not the best of those. DP gives V (0, t −1) = t −1 t V (0, t) + 1 t max( t N , V (0, t)) = max( t −1 t V (0, t) + 1 h , V (0, t)) These imply V (0, t − 1) ≥ V (0, t) for all t ≤ N. Therefore, since t N and V (0, t) are respectively increasing and non-decreasing in t, it must be that for small t we have V (0, t) > t N and for large t we have V (0, t) < t N . Let t 0 be the smallest t such that V (0, t) ≤ t N . Then V (0, t −1) = _ _ _ V (0, t 0 ), t < t 0 , t−1 t V (0, t) + 1 N , t ≥ t 0 . Solving the second of these backwards from the point t = N, V (0, N) = 0, we obtain V (0, t −1) t −1 = 1 N(t −1) + V (0, t) t = · · · = 1 N(t −1) + 1 ht +· · · 1 N(N −1) , Hence, V (0, t −1) = t −1 N N−1 τ=t 0 1 τ , t ≥ t 0 . Since we require V (0, t 0 ) ≤ t 0 N , it must be that t 0 is the smallest integer satisfying N−1 τ=t 0 1 τ ≤ 1. For large N the sum on the left above is about log(N/t 0 ), so log(N/t 0 ) ≈ 1 and we find t 0 ≈ h e . The optimal policy is to interview about N e candidates, but without selecting any of these, and then select the first one thereafter that is the best of all those seen so far. The probability of success is V (0, t 0 ) ∼ t 0 N ∼ 1 e = 0.3679. It is surprising that the probability of success is large for arbitrarily large N. There are couple of things we should learn from this example. (i) It is often useful to try to establish the fact that terms over which a maximum is being taken are monotone in opposite directions, as we did with t N and V (0, t). (ii) A typical approach is to first deter- mine the form of the solution, then find the optimal cost (reward) function by backward recursion from the terminal point where its value is known. 11 2.4 LQ optimal control problem In this section we shall solve the optimal control problem where the dynamical system is linear and cost function is quadratic, using DP. We begin with the discrete-time problem and then extend the results to continuous time. Consider now the system in state space form x t+1 = A t x t +B t u t where x ∈ R n and u ∈ R m , A is n×n and B is n×m. We wish to minimize the quadratic cost J = N−1 t=0 (x t , u t ) _ Q t S t S t R t __ x t u t _ +x N P N x N where Q and P N are given, positive semi-definite, and R is given and is positive definite. This is a model for regulation of x, u to the point (0, 0). The following lemma will be used frequently. Lemma 2.3. Suppose x and u are vectors. Consider a quadratic form f(x, u) := (x , u ) _ P xx P xu P ux P uu __ x u _ . Assume that the matrix _ P xx P xu P ux P uu _ is symmetric, P uu is positive definite. Then the minimum with respect to u is attained at u = −P −1 uu P ux x, and is equal to x (P xx −P xu P −1 uu P ux )x. Proof. Suppose that the quadratic form is minimized at u. Then According to the neces- sary condition for optimality f u = 2P uu u +P ux x = 0 Here the symmetry of the matrix is used. So we have u = −P −1 uu P ux x. Since R uu is positive definite, the function has a global minimum. So the above value of u is indeed an optimal solution. A straightforward calculation leads to the optimal value as stated in the lemma. 12 Now we are processing the solution to LQ control problem. Let the cost-to-go function be J t (x t , u t , t) = N−1 τ=t (x τ , u τ ) _ Q τ S τ S τ R τ __ x τ u τ _ +x N P N x N Then the optimal cost-to-go function is V t (x t , t) = min ut J t (x t , u t , t) and the optimal cost to our problem is V (x 0 , 0). According to the optimality principle, we find optimal cost function backwards. Obviously V N (x N , N) = x N P N x N . Next we have the DP equation V N−1 (x N−1 , N −1) = min u N−1 [(x N−1 , u N−1 ) _ Q N−1 S N−1 S N−1 R N−1 __ x N−1 u N−1 _ +V N (A N−1 x N−1 +B N−1 u N−1 , N)] = min u N−1 [(x N−1 , u N−1 ) _ Q N−1 S N−1 S N−1 R N−1 __ x N−1 u N−1 _ + (A N−1 x N−1 +B N−1 u N−1 ) P N (A N−1 x N−1 +B N−1 u N−1 )] = min u N−1 [(x N−1 , u N−1 ) _ Q N−1 +A N−1 P N A N−1 S N−1 +A N−1 P N B N−1 S N−1 +B N−1 P N A N−1 R N−1 +B N−1 P N B N−1 __ x N−1 u N−1 _ ] Since R N−1 is positive definite and P N is positive semidefinite the matrix R N−1 +B N−1 P N B N−1 is positive definite. By Lemma 2.3 u N−1 = −(R N−1 +B N−1 P N B N−1 ) −1 (S N−1 +B N−1 P N A N−1 )x N−1 is the optimal control at this stage and the value function is V N−1 (x N−1 , N −1) = x N−1 P N−1 x N−1 , where P N−1 = Q N−1 +A N−1 P N A N−1 −(S N−1 +B N−1 P N A N−1 ) (R N−1 +B N−1 P N B N−1 ) −1 (S N−1 +B N−1 P N A N−1 ). Continue in this way backwards, we obtain the following theorem. Theorem 2.4. The value function to the LQ problem has the quadratic form V t (x t , t) = x t P t x t , for t ≤ N, and the optimal control is u t = K t x t , t < N 13 where K t = −(R t +B t P t+1 B t ) −1 (S t +B t P t+1 A t ), t < N. The time-varying matrix P t satisfies the Riccati equation P t = Q t +A t P t+1 A t −(S t +B t P t+1 A t ) (R t +B t P t+1 B t ) −1 (S t +B t P t+1 A t ), t < N with the terminal condition P N given. Remark 2.5. (i) Note that S can be normalized to zero by choosing a new control ˜ u = u +R −1 Sx, and setting ˜ A = A−BR −1 S, ˜ Q = Q−S R −1 S. (ii) The optimally controlled process obeys x t+1 = Γ t x t . Here the matrix Γ t is so-called the gain matrix and is defined by Γ t = A t +B t K t . (iii) We can derive the solution for continuous-time LQ problem from the discrete-time solution. In continuous-time we take the state space representation ˙ x = Ax + Bu and the cost function J = _ t f 0 _ x u _ _ Q S S R __ x u _ dt + (x Px)(t f ). Moving forward in time in increments of ∆ we have x t+1 →x t+∆ , A →I +A∆, B →B∆, R, S, Q →R∆, S∆, Q∆. Then as before, V (x, t) = x Px, where P obeys the Riccati equation, after a lengthy though straightforward calculation (letting ∆ → 0 and dropping the higher order terms) dP dt +Q+A P +PA−(S +PB)R −1 (S +B P) = 0. Observe that this is simpler than the discrete time version. The optimal control is u(t) = K(t)x(t) where K(t) = −R −1 (S +B P(t)). The optimally controlled system is ˙ x = Γ(t)x, where Γ(t) = A+BK(t). (iv) The solvability of Riccati equation and existence of unique positive definite solution of the Riccati equation are important and have their own interest. However, they are beyond this course. 14 2.5 DP in continuous time We study deterministic dynamic programming in continuous time. Consider the dynamical system ˙ x = f(x, u, t), x(t 0 ) = x 0 along with a cost functional J(t 0 , x o , u(·)) = _ t f t 0 q(x, u, t)dt +Q(x(t f ), t f ), where x ∈ R n , u ∈ U ⊆ R m , U is a set of admissible controls. We wish to minimize this cost. The cost-to-go functional can be defined as J(t , x(t ), u(·)) = _ t f t q(x(τ), u(τ), τ)dτ +Q(x(t f ), t f ). Next we define the optimal cost-to-go functional, the value funtion, as J ∗ (t , x(t )) = min u(·) J(t , x(t ), u(·)). where the optmization is performed over all admissbile controls. This means in particular that the optimization problem can be written as (P) min u(·) J(t 0 , x 0 , u(·)) (2.5) where the optimization is performed over all admissible controls u(t) ∈ U. Moreover, if u ∗ (·) is the optimal control function then V (x 0 , t 0 ) = J ∗ (t 0 , x 0 ) = J(t 0 , x 0 , u ∗ (· · · )) Note that J ∗ (t 0 , x 0 ) should satisfy the boundary condition J ∗ (t f , x) = Q(x, t f ). Next we prove the optimality principle. Proposition 2.6 (The optimality Principle). Let u ∗ : [t 0 , t f ] → R m be an optimal condition for (P) that generates the optimal trajectory x ∗ : [t 0 , t f ] → R n . Then for any t ∈ (t 0 , t f ], the restriction of the optimal control to [t , t f ], u ∗ | [t ,t f ] , is optimal for min u(·) J(t , x ∗ (t ), u(·)) and the corresponding optimal trajectory is x ∗ | [t ,t f ] . Proof. By additivity of the cost functional we have J ∗ (t 0 , t f ) = _ t t 0 q(x ∗ (t), u ∗ (t), t)dt +J(t , x ∗ (t ), u ∗ | [t ,t f ] ). Assume u ∗ | [t ,t f ] was not optimal over the interval [t , t f ] when the initial point is x(t ) = x ∗ (t ). Then there would exist an admissible control function ˆ u(·) defined on [t , t f ] such that J(t , x ∗ (t ), ˆ u(·)) < J(t , x ∗ (t ), u ∗ | [t ,t f ] ). 15 The control u(t) = _ _ _ u ∗ (t) t ∈ [t 0 , t ) ˆ u(t) t ∈ [t , t f ] is admissible and gives the cost J(t 0 , x 0 , u(·)) = _ t f t 0 q(x ∗ (t), u ∗ (t), t)dt +J(t , x ∗ (t ), ˆ u(·)) < _ t f t 0 q(x ∗ (t), u ∗ (t), t)dt +J(t , x ∗ (t ), u ∗ | [t ,t f ] ) = J ∗ (t 0 , x 0 )) This contradicts the optimality of u ∗ (·) and we have shown that the restriction u ∗ | [t ,t f ] is optimal over [t , t f ]. Now we are in a position to erive the dynamic programming equation. Following the logic of the discrete-time case, the value function should be approximately V (x, t) ≈ inf u [q(x, u, t)∆ +V (x +f(x, u, t)∆, t + ∆)]. By Taylor expansion, V (x +f(x, u, t)∆, t + ∆) ≈ V (x, t) +V t (x, t)∆ +V x (x, t)f(x, u, t), t)∆. Then V (x, t) = inf u∈U [q(x, u, t)∆ +V (x, t) +V t (x, t)∆ +V x (x, t)f(x, u, t)∆] Since V (x, t) is independent of u, we can pull it outside the minimization. This leads to the following partial differential equation V t (x, t) + inf u∈U [q(x, u, t) +V x (x, t)f(x, u, t)] = 0 (2.6) with the boundary condition V (x, t f ) = Q(x, t f ). This optimality equation equation is called the Hamilton-Jacobi-Bellman equation. Remark 2.7. Note that what we have derived is the following. Assume there is an op- timal control u(·), and the optimal cost-to-go function V (x, t) has continuously partial derivatives in both x and t, which we say a C 1 function. Then V (x, t) is a solution of the HJB equation and u(t) is the minimizing argument in (2.6) (pointwise). This necessary condition for optimality is not so useful since it assumes that the value function V is C 1 . Of course, this is not always the case. This is one of the drawbacks of this approach while Pontryagin’s maximum principle can do better. Another drawback is that the HJB equation is a PDE whose closed form solution is often difficult to find. Moreover, it is also difficult to solve it numerically, since a numerical procedure requires a discretization of the whole state space which is normally very large. However, the advantage of the the HJB equation is that it also gives a sufficient condition for optimality. We shall prove the so-called verification theorem for DP. 16 Theorem 2.8 (Verification Theorem for DP). Assume 1. V : R n ×[t 0 , t f ] →R is C 1 (in both arguments) and solves (HJB); 2. µ(t, x) = arg min u∈U [q(x, u, t) +V x (x, t)f(x, u, )] is admissible. Then V (x, t) = J ∗ (t, x) for all (t, x) ∈ [t 0 , t f ] ×R n and µ(t, x(t)) is the optimal feedback control. Proof. Let u(·) be an admissible control on [t 0 , t f ] and let x(·) be the corresponding solu- tion to ˙ x(t) = f(x(t), u(t), t), x(t 0 ) = x 0 . We have V (x(t f ), t f ) −V (x 0 , t 0 ) = _ t f t 0 ˙ V (x(t), t)dt = _ t f t 0 (V t (x(t), t) +V x (x(t), t)f(x(t), u(t), t)) dt ≥ − _ t f t 0 q(x(t), u(t), t)dt where the inequlity follows since (HJB)implies V t (x(t), t) +V x (x(t), t)f(x(t), u(t), t) ≥ −q(x(t), u(t), t). Using the boundary condition V (x(t f ), t f ) = Q(x(t f ), t f ) yields V (x 0 , t 0 ) ≤ Q(x(t f ), t f ) + _ t f t 0 q(x(t), u(t), t)dt = J(t 0 , x 0 , u(·)). This inequlity holds for all admissble controls, in particular the optimal u ∗ (·). Therefore we have shown that V (x 0 , t 0 ) ≤ J ∗ (t 0 , x 0 ) for all initial points. We shall next to show that equality is achieved by using u(t) = µ(t, x(t)). Note that Condition (ii) means min u∈U (q(x, u, t) +V x (x, t)f(x, u, t)) = q(x, µ(t, x(t)) +V x (x, t)f(x, µ(t, x(t)), t). This, together with (HJB), yields V t (x, t) +V x (x, t)f(x, µ(t, x(t)), t) = −q(x, µ(t, x(t)), t). Integrating this equation gives V (x 0 , t 0 ) = Q(x(t f ), t f ) + _ t f t 0 q(x(t), µ(t, x(t)), t)dt = J(t 0 , x 0 , u(·)) ≥ J ∗ (t 0 , x 0 ) where the last inequality follows sine J ∗ (t 0 , x 0 ) = min u(·) J ( t 0 , x 0 , u(·)). Thus we have J ∗ (t 0 , x 0 ) ≤ V (x(t 0 ), t 0 ) ≤ J ∗ (t 0 , x 0 ). This shows that (x(t), µ(t, x(t)) is the optimal state and control trajectory, i.e., x ∗ (t) = x(t) and u ∗ (t) = µ(t, x(t)). Since t = −0, x 0 are arbitrary this complets the proof of the theorem. 17 Note that the dynamic programming holds for all initial values. Now we give some examples to illustrate how to solve the HJB equation. Example 7. LQ-problem, revisited. Now we derive the solution of the continuous-time LQ problem by the HJB equation. To make the computation simpler we normalize the matrix S to zero. So we have the following PDE, 0 = V t (x, t) + min u [x Qx +u Ru +V x (x, t)(Ax +Bu)] and the boundary condition V (x, t f ) = x P(t f )x. We make an ansatz: V (x, t) = x P(t)x, where P is an n ×n symmetric matrix. Then V t (x, t) = x ˙ Px, V x (x, t) = 2x P. Substituting this into the previous equation yields 0 = x ˙ Px + min u [x Qx +u Ru + 2xAx + 2x PBu]. Now we shall find the minimizing u first. That the matrix R is positive definite guarantees the global minimum. The minimum is attained at u = −R −1 B P(t)x. So the equation becomes 0 = x ( ˙ P +Q+PA+A P +PBR −1 B P)x It holds if P satisfies the Riccati equation ˙ P +Q+PA+A P −PBR −1 B P = 0. In the LQ problem we can find a closed form solution for the HJB equation. The following example is an application of LQ-control. Example 8. (Tracking under disturbances) A problem which arises often is that of finding controls so as to force the output of a given system to track back (follow) a desired reference signal r(·). We also allow a disturbance ϕ to act on the system. Let Σ be a linear system over R with outputs ˙ x = A(t)x +B(t)u +ϕ(t), y = C(t)x, where ϕ is an R n -valued fixed measurable essentially bounded function. Assume given an R p -valued fixed measurable essentially bounded function r. We consider a cost criterion: J = _ τ σ [ω(t) R(t)ω(t) +e(t) Qe(t)]dt +ξ(τ) Sξ(τ), 18 where ˙ ξ = Aξ + Bω, ξ(σ) = x 0 and e = Cξ − r is the tracking error. We assume that R is an m × m symmetric matrix of measurable essentially bounded functions, Q is a p ×p symmetric matrix of measurable essentially bounded functions, and S is a constant symmetric n × n matrix. Further we assume that R is positive definite, Q and S are positive semidefinite, for each t. The problem is then that of minimizing J for given disturbance ϕ, reference signal r, and initial state, by an appropriate choice of controls. This problem can be reduced to LQ-control problem. See details on pages 372–375 in Sontag. Example 9. (Deterministic Kalman filtering) Let (A, B, C) be a time-varying continuous- time linear system, and let Q, R, S be measurable essentially bounded matrix function of sizes n × n, m × m and n × n, respectively. Assume that both S and R(·) are positive definite. For any given measurable essentially bounded function ¯ y on an interval [σ, τ], minimize the expression _ τ σ [ω(t) R(t)ω(t) + (C(t)ξ(t) − ¯ y(t)) Q(t)(C(t)ξ(t) − ¯ y(t))]dt +ξ(σ) Sξ(σ) over the set of all possible trajectories (ξ, ω) of ˙ ξ = Aξ +Bω on [σ, τ]. By reversing time, so that the cost is imposed on the final state, this problem can be reduced to Tracking problem in the preceding example. This problem is sometimes termed as (deterministic) Kalman filtering. See motivation and solution on pages 375–379 in Sontag. Before doing next example we want to point out a special type of optimal control problem, that is, optimization of the so-called discounted cost J = _ t f 0 e βt q(x, u, t)dt +e −βt f q(x(t f ), t f ). Let the value function for the discounted optimal time-to-go function be ˜ V (x, t). By the HJB equation, it satisfies the following PDE ˜ V t (x, t) + inf u∈U [e −βt q(x, u, t) + ˜ V x (x, t)f(x, u, t)] = 0 with the boundary value condition ˜ V (x, t f ) = e −βt f Q(x, t f ). Now let ˜ V (x, t) = e −βt V (x, t). Then V (x, t) satisfies the following equation V t (x, t) −βV (x, t) + inf u∈U [q(x, u, t) +V x (x, t)f(x, u, t)] = 0 (2.7) with V (x, t f ) = Q(x, t f ). We call this equation the discounted HJB equation. 19 Example 10. Estate planning. A man is considering his lifetime plan of investment and expenditure. He has an initial level of savings S and no income other than that which he obtains from investment at a fixed interest rate. His total capital x is therefore governed by the equation ˙ x = αx −u where α > 0 and u denotes his rate of expenditure. His immediate enjoyment due to expenditure is U(u) where U is his utility function. In his case U(u) = u 1 2 . Future enjoyment, at time t, is counted less today, at time 0, by incorporation of a discount term e −βt . Thus, he wishes to maximize J = _ t f 0 e −βt U(u)dt, assuming α 2 < β. We try to solve equation (2.7), i.e., 0 = max u ( √ u −βB +V t +V x (αx −u)) with V (x, t f ) = 0. We try a solution of the form V (x, t) = f(t) √ x. For this to work we need 0 = max u ( √ u −βf(t) √ x + ˙ f(t) √ x + f 2 √ x (αx −u)) The maximizing control is u = x f 2 . (Do not forget to check the sufficient condition!) The optimal value is thus √ x f ( 1 2 −(β − 1 2 α)f 2 +f ˙ f) Therefore we shall have a solution if we choose f to make the term within ( ) equal to 0. The boundary condition V (x, t f ) = 0 implies that f(t f ) _ x(t f ) = 0 which implies f(t f ) = 0. Hence we shall find solution to 1 2 −(β − 1 2 α)f 2 +f ˙ f = 0 with the terminal condition f(t f ) = 0. Thus we find f(t) 2 = 1 −e −( α 2 −β)(t f −t) 2α −β . There are few problems that can be solved analytically. We close this section by a short comment on the use of the HJB equation. For a given optimal control problem on the form min J = _ t f 0 q(x, u, t)dt +Q(x(t f ), t f ) subject to ˙ x = f(x, u, t) with u ∈ U we take the following steps. 20 1. Optimize pointwise over u to obtain u ∗ (p, x, t) = arg min u∈U [q(x, u, t) +p f(x, u, t)]. Here p ∈ R n is a parameter vector. 2. Define H(p, x, t) = q(x, u ∗ (p, x, t), t) +p f(x, u ∗ (p, x, t)). 3. Solve the PDE −V t (x, t) = H(V x (x, t), x, t) with the boundary condition V (x, t f ) = Q(x, t f ). If we define H(p, x, u) = q(x, u, t) + p f(x, u, t) then u ∗ (x, t, p) = arg min u∈U H(p, x, u, t). The same optimization is a part of the condition in Pontryagin’s minimum (maximum) principle. 3 Pontryagin’s principle I: a dynamic programming approach We explain Pontryagin’s maximum principle, derive it first by dynamic programming (through the HJB equation) under stronger assumptions and then by variational calculus, and give some examples of its use. Consider time-invariant optimal control problem. Problem: Minimize the cost function J = _ t f 0 q(x, u)dt +Q(x(t f )) subject to ˙ x = f(x, u), with x(0) = x 0 being fixed and u ∈ U, a set of admissible controls. The HJB equation is V t (x, t) + inf u [q(x, u) +V x (x)f(x, u)] = 0 (3.1) and the boundary condition V (x, t f ) = Q(x). Assume that (u ∗ , x ∗ ) is an optimal solution then (3.1) 0 = V t (x ∗ , t) +q(x ∗ , u ∗ ) +V x (x ∗ )f(x ∗ , u ∗ ). (3.2) Define now the so-called adjoint variable (or co-state variable) p = V x (x ∗ ) 21 where p ∈ R n , and the Hamiltonian function H(p, x ∗ , u ∗ ) = q(x ∗ , u ∗ ) +p f(x ∗ , u ∗ ). It is obvious that H p = f(x ∗ , u ∗ ) = ˙ x ∗ . To our purpose we make a stronger assumption on V : V is twice differentiable. Then ˙ p = dV x /dt = V xt (x ∗ , t) + ( ˙ x ∗ ) V xx (x ∗ , t) = V xt (x ∗ , t) +f(x ∗ , u ∗ ) V xx (x ∗ , t). On the other hand, differentiating (3.2) with respect to x gives V xt (x ∗ , t) +q x (x ∗ , u ∗ ) +f(x ∗ , u ∗ ) V xx (x ∗ , t) +V x (x ∗ , t)f x (x ∗ , u ∗ ) = 0 or equivalently, V tx (x ∗ , t) +f(x ∗ , u ∗ ) V xx (x ∗ , t) = −q x (x ∗ , u ∗ ) −p f x (x, u ∗ ). Therefore we have a system for the adjoint state −˙ p = q x +p f x with the terminal condition p(t f ) = Q x (x(t f )). If u ∗ is an optimizing argument then H(p(t), x(t), v) ≥ H(p(t), x(t), u(t)) for all t, 0 ≤ t ≤ t f , and all admissible controls v ∈ U. Note that the fact that optimizing the Hamiltonian function over u is the same as in dynamic programming approach. We summarize above arguments as a theorem. Theorem 3.1 (Pontryagin’s minimum principle (PMP)). Suppose that u(t) ∈ U and x(t) represent the optimal control and state trajectory for the optimal control problem Problem. Then there is an adjoint trajectory p(t) such that together u(t), x(t), and p(t) satisfy (i) state equation and initial state condition: ˙ x(t) = f(x(t), u(t)), x(0) = x 0 , (ii) adjoint equation and final condition: −˙ p = p f x (x(t), u(t)) +q x (x(t), u(t)), p(t f ) = Q x (x(t f )), (iii) minimum condition: for all t, 0 ≤ t ≤ t f , and all v ∈ U H(p(t), x(t), v) ≥ H(p(t), x(t), u(t)) where H is the Hamiltonian H(p, x, u) = p f(x, u) +q(x, u). 22 In PMP we have n state equations, n adjoint equations and m conditions from the minimum condition. So they decide the 2n+m variables in general. Note that we did not assume that the optimal cost function is twice differentiable although we used it in the derivation. The reason is that it is not needed when you use variational calculus. Neither do we need C 1 condition on the optimal cost function. PMP gives a necessary condition for optimality. Some other remarks are in order. In fact we can claim that the problem stated earlier covers most of optimal control problems where the terminal time is fixed. Remark 3.2. This theory can be easily extended to problems where the system equations, the constraints, and the cost functions are all explicit functions of time. Consider the problem with ˙ x(t) = f(x(t), u(t), t), x(0) = x 0 , u(t) ∈ U J = _ t f 0 (q(x(t), u(t), t)dt +Q(x(t f ), t f ). (3.3) Define an additional state variable x n+1 = t, this problem can be converted to an equivalent problem without explicit t dependence. Let z(t) = _ x x n+1 _ . Then z(0) = _ x 0 0 _ , ¯ f = _ f 1 _ . Then we have state equation ˙ z(t) = ¯ f(z(t), u(t)) with initial condition z(0) given, and u(t) ∈ U, and the cost function J = _ t f 0 q(z(t), u(t))dt +Q(z(t f )). It is easy to see that the state equation, the state initial condition and the minimum condition are the same as in PMP above, but with explicit t dependence. Nevertheless there are now n + 1 adjoint variables. We keep the same notation as in the theorem, ¯ p = (p , p n+1 ). The first n adjoint variables satisfy the condition −˙ p = p f x (x, u, t) +q x (x, u, t), while −˙ p n+1 = p f t (x, u, t) +q t (x, u, t) with the terminal condition p n+1 (t f ) = Q t (x(t f ), t f ) Hence the minimum principle for the time-variant optimal control problem is (i) state equation and initial state condition: ˙ x(t) = f(x(t), u(t), t), x(0) = x 0 , 23 (ii) adjoint equation and final condition: −˙ p(t) = p(t) f x (x(t), u(t), t) +q x (x(t), u(t), t), p(t f ) = q x (x(t f ), t f ), −˙ p n+1 = p f t (x, u, t) +q t (x, u, t). Note that the terminal condition for p n+1 is free because the terminal time is fixed. (iii) minimum condition: for all t, 0 ≤ t ≤ t f , and all v ∈ U H(x(t), v, p(t)) ≥ H(x(t), u(t), p(t)) where H is the Hamiltonian H(x, u, p, t) = p f(x, u, t) +q(x, u, t). Remark 3.3. From the theorem and the previous remark we can draw the following con- clusions: (i) the adjoint variable is the gradient of the value function with respect to the state vector. (ii) In DP the value function is obtained by solving a PDE, (the HJB equation). This is a consequence of the approach of looking for an optimal control from any initial point. (iii) In PMP we only solve the value function (or rather its gradient which is the adjoint variable) for a special initial condition. This gives a two point boundary value prob- lem for ODEs, which is normally much easier to deal with than the HJB equation. 4 Calculus of variations This subject dates back to Newton, and we have room for only a brief treatment. In particular we shall not mention the well-known Euler equation approach which can be found in standard texts on variational calculus. We consider the problem of minimizing J(u) in (1.13) subject to the system (1.1) and initial conditions (1.2). We assume that there are no constraints on the control functions u i (t) and that J(u) is differential, i.e., if u and u + δu are two controls for which J is defined then ∆J = J(u +δu) −J(u) = δJ(u, δu) +j(u, δu)δu (4.1) where δJ is linear in δu and j(u, δu) →0 as δu →0 (using any suitable norm). In (4.1) δJ is termed the variation in J corresponding to a variation δu in u. The control u ∗ is an extremal, and J has a relative minimum, if there exists an ε > 0 such that for all functions u satisfying u−u ∗ < ε the difference J(u) −J(u ∗ ) is nonnegative. A fundamental result is the following. 24 Theorem 4.1. A necessary condition for u ∗ to be an extremal is that δJ(u ∗ , δu) = 0 for all δu. The proof is omitted. We now apply Theorem 4.1. Introduce a vector of Lagrange multipliers p = [p 1 , ..., p n ] so as to form an augmented functional incorporating the constraints: J a = Q[x(t 1 ), t 1 ] + _ t 1 t 0 [q(x, u, t) +p (f − ˙ x)]dt (4.2) Integrating the last term by parts gives J a = Q[x(t 1 ), t 1 ] −[p x] t 1 t 0 + _ t 1 t 0 [H + ( ˙ p) x]dt (4.3) where the Hamiltonian function is defined by H(x, u, t) = q(x, u, t) +p f. (4.4) Assume that u is continuous and differentialable on t 0 ≤ t ≤ t 1 and that t 0 and t 1 are fixed. The variation in J a corresponding to a variation δu in u is δJ a = __ ∂Q ∂x −p _ δx _ t=t 1 + _ t 1 t 0 _ ∂H ∂x δx + ∂H ∂u δu + ( ˙ p) δx _ dt (4.5) where δx is the variation in x in the differential equations (1.1) due to δu, using the notation ∂H ∂x = _ ∂H ∂x 1 , · · · , ∂H ∂x n _ and similarly for ∂Q/∂x and ∂H/∂u. Note that since x(t 0 ) is specified, (δx) t=t 0 = 0. It is convenient to remove the term in (4.5) involving δx by suitably choosing p, i.e. by taking ˙ p i = − ∂H ∂x i , i = 1, 2, ..., n (4.6) and p i (t 1 ) = _ ∂Q ∂x i _ t=t 1 . (4.7) Eqn (4.5) then reduces to δJ a = _ t 1 t 0 _ ∂H ∂u δu _ dt (4.8) so by Theorem 4.1 a necessary condition for u ∗ to be an extremal is that _ ∂H ∂u i _ u=u ∗ = 0, t 0 ≤ t ≤ t 1 , i = 1, ..., m. (4.9) We have therefore established: 25 Theorem 4.2. Necessary conditions for u ∗ to be an extremal for (1.13) subject to (1.1) and (1.2) are that (4.6),(4.7), and (4.9) hold. Remark 4.3. The Hamiltonian is constant. Suppose u is unconstrained and f and q are autonomous, i.e., they do not explicitly depend on t. Let p, x, u satisfy the necessary conditions for optimality, then H(p, x, u) is constant for t 0 ≤ t ≤ t 1 . The following calculation proves this statement. Let H(p, x, u) = p f(x, u) +q(x, u). dH dt = H x ˙ x +H u ˙ u + ˙ p f = H x f −H x f = 0 where we used the fact that H u (p, x, u) = 0 and the state and adjoint equations. Thus the Hamiltonian is constant. In other words, H is constant along the optimal trajectory. Example 11. LQ-problem, revisited. We wish to minimize the cost function J = 1 2 _ t 1 0 (x Qx +u Ru)dt +x(t 1 ) P t 1 x(t 1 ) such that ˙ x = Ax +Bu, x(0) = x 0 , where Q and P t 1 are symmetric positive semi-definite matrices and R is symmetric positive definite matrix. The Hamiltonian is H(p, x, u) = p Ax +p Bu + 1 2 x Qx + 1 2 u Ru. Since this is an unconstrained problem, we can use the minimum condition H u = 0, which is p B + u R = 0, i.e., u = −R −1 B p. This together with the state equation, gives the following two-point boundary value problem ˙ x = Ax −BR −1 B p, x(0) = x 0 ˙ p = −Qx −A p, p(t 1 ) = P t 1 x(t 1 ) Or in matrix form _ ˙ x ˙ p _ = _ A −BR −1 B −Q −A __ x p _ , _ x(0) p(t 1 ) _ = _ x 0 P t 1 x(t 1 ) _ . In DP, we learned that u = −R −1 B Px. So we know less from Theorem 4.2. However, we shall show that the solution is in fact the same. Note that the state and adjoint equations form a linear system of 2n variables, which can be solved by the state transition matrix Φ(t, s) portioned as _ Φ 11 (t, s) Φ 12 (t, s) Φ 21 (t, s) Φ 22 (t, s) _ . Each block is n ×n. Therefore _ x 0 p(0) _ = _ Φ 11 (0, t 1 ) Φ 12 (0, t 1 ) Φ 21 (0, t 1 ) Φ 22 (0, t 1 ) __ x(t 1 ) P t 1 x(t 1 ) _ = _ Φ 11 (0, t 1 ) Φ 12 (0, t 1 ) Φ 21 (0, t 1 ) Φ 22 (0, t 1 ) __ I P t 1 _ x(t 1 ) (4.10) 26 So we have p(0) in terms of x 0 and the transition matrix. p(0) = (Φ 21 (0, t 1 ) + Φ 22 (0, t 1 )P t 1 )(Φ 11 (0, t 1 ) + Φ 12 (0, t 1 )P t 1 ) −1 x 0 if (Φ 11 (0, t 1 ) + Φ 12 (0, t 1 )P t 1 ) −1 exists. In a similar manner, we have _ x(t) p(t) _ = _ Φ 11 (t, t 1 ) Φ 12 (t, t 1 ) Φ 21 (t, t 1 ) Φ 22 (t, t 1 ) __ I P t 1 _ x(t 1 ) (4.11) It gives p(t) = (Φ 21 (0, t 1 ) + Φ 22 (0, t 1 )P t 1 )(Φ 11 (0, t 1 ) + Φ 12 (0, t 1 )P t 1 ) −1 x(t) =: P(t)x(t) It in turn gives u = −R −1 B Px. Now we have to show that P satisfies the Riccati equation. Substituting the control law obtained and the relation between p and x above yield (− ˙ P −PA−A P +PBR −1 B P −Q)x = 0 This holds if − ˙ P −PA−A P +PBR −1 B P −Q = 0 with terminal condition P(t 1 ) = P t 1 . We can also prove that P(t) can be obtained by P(t) = Y (t)X(t) −1 where X(t) and Y (t) solves the linear matrix equation _ ˙ X(t) ˙ Y (t) _ = _ A −BR −1 B −Q −A __ X Y _ , _ X(0) Y (0) _ = _ I P t 1 _ backwards in time over the interval [0, t 1 ]. So far we have assumed that t 1 is fixed and x ( t 1 ) is free. If this is not necessarily the case, then by considering (4.3) the terms in δJ a in (4.5) outside the integral are obtained to be __ ∂Q ∂x −p _ δx + _ H + ∂Q ∂t _ δt _ u=u ∗ t=t 1 (4.12) The expression in (4.12) must be zero by virtue of Theorem 4.1, since (4.6) and (4.9) still hold, making the integral in (4.5) zero. The implications of this for some important special cases are now listed. The initial conditions (1.2) hold throughout. Final time t 1 specified. (i) x(t 1 ) free In (4.12) we have δt 1 = 0 but δx(t 1 ) is arbitrary, so the conditions (4.7) must hold (with the fact that H is a constant when appropriate), as before. 27 (ii) x(t 1 ) specified In this case δt 1 = 0, δx(t 1 ) = 0 so (4.12) is automatically zero. The conditions are thus x ∗ (t 1 ) = x f the final state (4.13) and (4.13) replaces (4.7). Final time t 1 free (iii) x(t 1 ) free Both δt 1 = 0 and δx(t 1 ) are arbitrary, so for the expression in (4.12)to vanish, (4.7) must hold together with _ H + ∂Q ∂t _ u = u ∗ t = t 1 = 0 (4.14) In particular, if Q, q and f do not explicitly depend on t, then (4.14) and that H is a constant imply (H) u=u ∗ = 0, t 0 ≤ t ≤ t 1 . (4.15) (iv) x(t 1 ) specified Only δt 1 = 0 is now arbitrary in (4.12) so the conditions are (4.13) and (4.14) and (4.15). If the preceding conditions on x(t 1 ) apply to only some of its components, then since the δx i (t 1 ) in (4.12) are independent it follows that the appropriate conditions hold only for these components. Example 12. A particle of unit mass moves along the x-axis subject to a force u(t). It is required to determine the control which transfers the particle from rest at the origin to rest at x = 1 in unit time, so as to minimize the effort involved, measured by _ 1 0 u 2 dt. The equation of motion is ¨ x = u, and taking x 1 = x, x 2 = ˙ x we obtain the state equations ˙ x 1 = x 2 , ˙ x 2 = u (4.16) and from (4.4) H = p 1 x 2 +p 2 u +u 2 . From (4.9) the optimal control is given by 2u ∗ +p ∗ 2 = 0 (4.17) 28 and by (4.6) the adjoint equations are ˙ p ∗ 1 = 0, ˙ p ∗ 2 = −p 1 . (4.18) Integrating (4.18) gives p ∗ 2 = c 1 t + c 2 , where c 1 and c 2 are constants. From (4.16) and (4.17) we obtain ˙ x ∗ 2 = − 1 2 (c 1 t +c 2 ) which on integrating, and using the given conditions x 2 (0) = 0 = x 2 (1) produces x ∗ 2 (t) = 1 2 c 2 (t 2 −t), c 1 = −2c 2 . Finally, integrating the first equation in (4.16) and using x 1 (0) = 0, x 1 (1) = 1 gives x ∗ 1 (t) = 1 2 t 2 (3 −2t), c 2 = −12 and hence from (4.17) the optimal control is u ∗ (t) = 6(1 −2t). Example 13. A ship moves through a region of strong currents. For simplicity, and by a suitable choice of coordinates, assume that the current is parallel to the x 1 -axis and has velocity c = −V x 2 /a, where a is a positive constant, and V is the magnitude (assumed constant) of the ship’s velocity relative to the water. The problem is to steer the ship so as to minimize the time of travel from some given pointA to the origin. We see in the x u Current V x 2 1 Figure 2: figure that the control variable is the angel u. The equations of motion are ˙ x 1 = V cos u +c = V (cos u −x 2 /a) (4.19) ˙ x 2 = V sin u, (4.20) 29 where (x 1 (t), x 2 (t)) denotes the position of the ship at time t. The performance index is (1.3) with t 0 = 0, so from (4.4) H = 1 +p 1 V (cos u −x 2 /a) +p 2 V sin u. (4.21) The condition (4.9) gives −p ∗ 1 V sin u ∗ +p ∗ 2 V cos u ∗ = 0 so that tan u ∗ = p ∗ 2 /p ∗ 1 . (4.22) The adjoint equations (4.6) are ˙ p ∗ 1 = 0, ˙ p ∗ 2 = p ∗ 1 V/a, (4.23) which imply that p ∗ 1 = c 1 , a constant. Since t 1 is not specified we have case (iv) so that (4.16)-(4.18) hold. From (4.21) his gives 0 = 1 +p ∗ 1 V (cos u ∗ −x ∗ 2 /a) +p ∗ 2 V sin u ∗ = 1 +c 1 V (cos u ∗ −x ∗ 2 /a) +c 1 V sin 2 u ∗ / cos u ∗ , (4.24) using the expression for p ∗ 2 in (4.22). Substituting x ∗ 2 (t 1 ) = 0 into (4.24) leads to c 1 = −cos u 1 /V, (4.25) where u 1 = u ∗ (t 1 ). Eqn (4.25) reduces (4.24), after some rearrangement, to x ∗ 2 /a = sec u ∗ −sec u 1 . (4.26) Differentiating (4.26) with respect to t gives du ∗ dt sec u ∗ tan u ∗ = ˙ x ∗ 2 /a = V sin u ∗ /a by (4.20). Hence (V/a) dt du ∗ = sec 2 u ∗ which on integration produces tan u ∗ −tan u 1 = (V/a)(t −t 1 ). (4.27) Use of (4.19), (4.26) and (4.27) and some straightforward manipulation leads to an ex- pression for x ∗ 1 in terms of u ∗ and u 1 , which enables the optimal path to be computed. A typical minimum-time path is shown in Figure 3. 30 Figure 3: If the state at time T (assumed fixed) is to lie on a surface defined by some function m[x(t)] = 0, where m may in general be a k-vector, then it can be shown that in addition to the k conditions m[x ∗ (t 1 )] = 0 (4.28) there are further n conditions which can be written as ∂Q ∂x −p = d 1 _ ∂m 1 ∂x _ +d 2 _ ∂m 2 ∂x _ +· · · +d k _ ∂m k ∂x _ (4.29) both sides being evaluated at t = t 1 , u = u ∗ , x = x ∗ , p = p ∗ . The d i in (4.29) are constants to be determined. Together with the 2nconstants of integration there are thus a total 2n + k unknowns and 2n + k conditions (4.28) , (4.29) and (1.2). If t 1 is free then in addition (4.14) holds. The conditions which hold at t = t 1 for the various cases we have covered are summa- rized in the table: t 1 fixed t 1 free x(t 1 ) free (4.7) (4.7) and (4.14) x(t 1 ) fixed (4.13) (4.13) and (4.14) x(t 1 ) lies on m[x(t 1 )] = 0 (4.28) and (4.29) (4.14), (4.28) and (4.29) Example 14. If a second order system is to be transferred from the origin to a circle of unit radius, centre the origin, at some time t 1 then we must have [x ∗ 1 (t 1 )] 2 + [x ∗ 2 (t 1 )] 2 = 1. (4.30) Since m(x) = x 2 1 +x 2 2 −1 31 (4.29) gives −[p ∗ 1 (t 1 ), p ∗ 2 (t 1 )] = d 1 [2x ∗ 1 (t 1 ), 2x ∗ 2 (t 1 )], assuming Q ≡ 0, and hence p ∗ 1 (t 1 )/p ∗ 2 (t 1 ) = x ∗ 1 (t 1 )/x ∗ 2 (t 1 ). (4.31) Eqns (4.30) and (4.31) are the conditions to be satisfied at t = t 1 . Example 15. A system described by ˙ x 1 = x 2 , ˙ x 2 = −x 2 +u (4.32) is to be transferred from x(0) = 0 to the line ax 1 +bx 2 = c at time t 1 so as to minimize _ t 1 0 u 2 dt, which is of the form (1.6). The values a, b, c and t 1 are given. From (4.4) H = u 2 +p 1 x 2 −p 2 x 2 +p 2 u and (4.9) gives u ∗ = − 1 2 p ∗ 2 . (4.33) The adjoint equations (4.6) are ˙ p ∗ 1 = 0, ˙ p ∗ 2 = −p ∗ 1 +p ∗ 2 so that p ∗ 1 = c 1 , p ∗ 2 = c 2 e t +c 1 (4.34) where c 1 and c 2 are constants. Substituting (4.33) and (4.34) into (4.32) leads to x ∗ 1 = c 3 e −t − 1 4 c 2 e t − 1 2 c 1 t +c 4 , x ∗ 2 = −c 3 e −t − 1 4 c 2 e t − 1 2 c 1 , and the conditions x ∗ 1 (0) = 0, x ∗ 2 (0) = 0, ax ∗ 1 (t 1 ) +bx ∗ 2 (t 1 ) = c (4.35) must hold. It is easy to verify that (4.29) produces p ∗ 1 (t 1 )/p ∗ 2 (t 1 ) = a/b, (4.36) and (4.35) and and (4.36) give four equations for the four unknown constants c i . The optimal control u ∗ (t) is then obtained from (4.33) and (4.34). 32 In some problems the restriction on the total amount of control effort which can be expended to carry out a required task may be expressed in the form _ t 1 t 0 q(x, u, t)dt = c (4.37) where c is a given constant, such a constraint being termed isoperimetric. A convenient way of dealing with (4.37) is to define a new variable x n+1 (t) = _ t t 0 q(x, u, s)ds so that ˙ x n+1 (t) = q(x, u, t). (4.38) The differential equation (4.38) is simply added to the original set (1.1) together with the conditions x n+1 (t 0 ) = 0, x n+1 (t 1 ) = c and the procedure then continues as before ignoring (4.37). Example 16. LQ control. Minimize J = _ t 1 0 u(t) 2 dt subject to _ ˙ x(t) = Ax(t) +Bu(t) x(0) = x 0 , x(t 1 ) = 0 Assume that the system is completely controllable. We start by minimizing the Hamilto- nian H(˜ p, x, u) = p 0 u 2 +p (Ax +Bu). There are two cases. You can skip the first part without any serious loss. Case 1: p 0 = 0 If this is the case, we have arg min u H(˜ p, x, u) = arg min u [p (Ax +Bu)] = ±∞ unless p B = 0. It is however, impossible to have u = ±∞ on a nonzero time interval since then the cost would be infinite, which clearly cannot be the minimum since we know that the system can be driven to origin with finite energy expenditure. The other alternative p(t) B = 0 for t ∈ [0, t 1 ] is also impossible. To see this we note that the adjoint equation ˙ p(t) = −A p(t) has the solution p(t) = e −A t p(0). Hence, in order for p(t) B = 0 for t ∈ [0, t 1 ] we need p(0) B = 0 ˙ p(0) B = 0 . . . p (n−1) (0) B = 0 ⇔ p(0) B = 0 p(0) AB = 0 . . . p(0) A n−1 B = 0 ⇔p(0) [B, AB, ..., A n−1 B] = 0. 33 If the system is controllable, then the matrix [B, AB, A n−1 B] has full rank, which implies that p(0) = 0. However, then p(t) = 0 and ˜ p(t) = 0 which contradicts the theorem. This leads to the conclusion that p 0 = 0 is impossible for a controllable system. Case 2: p 0 = 1. We have u(t) = − 1 2 B p minimizing the Hamiltonian. The adjoint equation is ˙ p(t) = −A p(t) which has the solution p(t) = e −A t p(0), x(0) = x 0 . By the variation of constants formula we obtain x(t 1 ) = e At 1 x 0 − 1 2 _ t 1 0 e A(t 1 −s) BB e −A s dsp(0) = e At 1 x 0 − 1 2 W(t 1 , 0)e −A t 1 p(0) where the reachability Grammian is W(t 1 , 0) = _ t 1 0 e A(t 1 −s) BB e A (t 1 −s) ds In our case the system is controllable and therefore W(t 1 , 0) is positive definite and thus invertible. We can solve for p(0), which gives p(0) = 2e A t 1 W(t 1 , 0) −1 e At 1 x 0 This gives the optimal control u(t) = − 1 2 B e −A t p(0) = −B e A (t 1 −t) W(t 1 , 0) −1 e At 1 x 0 and the optimal cost becomes (after some calculations) J ∗ = x 0 e A t 1 W(t 1 , 0) −1 e At 1 x 0 . Exercises 1. A system is described by ˙ x 1 = −2x 1 +u, and the control u(t) is to be chosen so as to minimize _ 1 0 u 2 dt. Show that the optimal control which transfers the system from x 1 (0) = 1 to x 1 (1) = 0 is u ∗ = −4e 2t /(e 4 −1). 34 2. The equations describing a production scheduling problem are dI dt = −S +P, dS dt = −αP where I(t) is the level of inventory (or stock), S(t) is the rate of sales and α is a positive constant. The production rate P(t) can be controlled and is assumed unbounded. It is also assumed that the rate of production costs is proportional to P 2 . It is required to choose the production rate which will change I(0) = I 0 , S(0) = S 0 into I(t f ) = I 1 , S(t f ) = S 1 in a fixed time t f whilst minimizing the total production cost. Show that the optimal production rate has the form P ∗ = k 1 +k 2 t and indicate how the constants k 1 and k 2 can be determined. 3. A particle of mass m moves on a smooth horizontal plane with rectangular Cartesian coordinates x and y. Initially the particle is at rest at the origin, and a force of constant magnitude ma is applied to it so as to ensure that after a fixed time t f the particle is moving along a given line parallel to the x-axis with maximum speed. The angle u(t) made by the force with the positive x direction is the control variable, and is unbounded. Show that the optimal control is given by tan u ∗ = tan u 0 +ct where c is a constant and u 0 = u ∗ (0). Hence deduce that ˙ y ∗ (t) = (a/c)(sec u ∗ −sec u 0 ) and obtain a similar expression for x ∗ (t) (hint: change the independent variable from t to u). 4. For the system in Example 15 described by eqn (4.32), determine the control which transfers it from x(0) = 0 to the line x 1 + 5x 2 = 15 and minimizes 1 2 [x 1 (2) −5] 2 + 1 2 [x 2 (2) −2] 2 + 1 2 _ 2 0 u 2 dt. 5. Complete the detail in deriving (4.28) and (4.29). 5 Pontryagin’s principle II: a variational calculus approach In real life problems the control variables are usually subject to constraints on their magni- tudes, typically of the form |u i (t)| ≤ k i . This implies that the set of final states which can be achieved is restricted. Our aim here is to derive the necessary conditions for optimality corresponding to Theorem 4.2 for the unbounded case. An admissible control is one which 35 satisfies the constraints, and we consider variations such that u ∗ + δu is admissible and δu is sufficiently small so that the sign of ∆J = J(u ∗ +δu) −J(u ∗ ), where J is defined in (1.13), is determined by δJ in (4.1). Because of the restriction on δu, Theorem 4.1 no longer applies, and instead a necessary condition for u ∗ to minimize J is δJ(u ∗ , δu) ≥ 0. (5.1) To see this we give a heuristic derivation. Recall that the control u ∗ casuses the functional J to have a relative minimum if J(u) −J(u ∗ ) = ∆J ≥ 0 for all admissible controls sufficiently cloase to u ∗ . If we let u = u ∗ + δu the inrement in J can be expressed as ∆J(u ∗ , δu) = δJ(u ∗ , δu) + higher-order terms ; δJ is linear in δu and the higher-order terms approach zero as the norm of δu tends to zero. If the control were unbounded, we could use the linearity of δJ with respect to δu, and the fact that δu can vary arbitrarily to show that a necessary condition for u ∗ to be an extremal control is that the variation δJ(u ∗ , δu) must be zero for all admissible δu having a sufficiently small norm. Since we are no longer assuming that the admissible controls are not bounded, δu is arbitrary only if the extremal control is strictly within the boundary for all time in the interval [t 0 , t 1 ]. In this case, the boundary has no effect on the problem solution. If, however, an extremal control lies on a boundary during at least one subinterval [t 1 , t 2 ] of the interval [t 0 , t 1 ], then admissble control variation δˆ u exist but its negatives (−δˆ u) are not dmissble. If only these variations are considered, a necessary condition for u ∗ to minimize J is that δJ(u ∗ , δˆ u) ≥ 0. On the other had, for variations δ˜ u), which are nonzero only for t not in the interval [t 1 , t 2 ], it is necessary that δJ(u ∗ , δ˜ u) = 0; the reasoning used in proving the fundamental theorem applies. Considering all admissible variations with δu small enough so that the sign of ∆J is determined by δJ, we see that a necessary condition for u ∗ to minimize J is δJ(u ∗ , δu) ≥ 0. It seems reasonable to ask if this result has an analog in calculus. Recall that the differential df is the linear part of the increment ∆f. Consider the end point t 0 , and t 1 of the interval, and admissble values of the time increment ∆t, which are small enough so that the sign of ∆f is determined by the sign of df. If t 0 is a point where f has a relative 36 minimum, then df(t 0 , ∆t) must be greater than or equal to zero. The same requirement applies for f(t 1 ) to be a relative minimum. Thus necessary conditions for the function f to have relative minimum at the end points of the interval are df(t 0 , ∆t) ≥ 0, admissble ∆t ≥ 0 df(t 1 , ∆t) ≥ 0, admissble ∆t ≤ 0 and a necessary condition for f to have a relative minimum at an interior point t, t 0 < t < t 1 is df(t, ∆t) = . For the control problem the analogous necessary conditions are δJ(u ∗ , δu) ≥ 0 if u ∗ lies on the boundary during any potion of the time interval [t 0 , t 1 ], and δJ(u ∗ , δu) = 0 if u ∗ lies within the boundary during the entire time interval [t 0 , t 1 ]. The development then proceeds as earlier: Lagrange multipliers are introduced to define J a in (4.2) and are chosen so as to satisfy (4.6) a(4.7). The only difference is that the expression for δJ a in (4.9) is replaced by δJ a (u, δu) = _ t 1 t 0 [H(x, u +δu, t, t) −H(x, u, p, t)]dt. (5.2) It therefore follows by (5.1) that necessary condition for u = u ∗ to be a minimizing control is that δJ a (u ∗ , δu) in (5.2) be nonnegative for all admissible δu. This in turn implies that H(x ∗ , u ∗ +δu, p ∗ , t) ≥ H(x ∗ , u ∗ , p ∗ , t), (5.3) for all admissible δu and all t ∈ [t 0 , t 1 ]; for if (5.3) did not hold in some interval t 1 ≤ t ≤ t 2 , say, with t 2 − t 1 arbitrarily small, then by choosing δu = 0 for t outside this interval δJ a (u ∗ , δu) would be made negative. Eqn (5.3) states that u ∗ minimizes H, so we have established: Theorem 5.1 (Pontryagin’s minimum principle). Necessary conditions for u ∗ to minimize (1.13) are (4.6), (4.7), and (5.3) With a slightly different definition of H the principle becomes one of maximizing H, and is then referred to in the literature as the maximum principle. Note that u ∗ is now allowed to be piecewise continuous. We omit the rigorous proof here. Our derivation assumed that t 1 was fixed and x(t 1 ) free; the boundary conditions for other situations are precisely the same as those given in the preceding section. It can also be shown that when H does not explicitly depend on t then H is constant and (4.15) still hold for the respective cases when the final time t 1 is fixed or free. 37 Example 17 (Minimum revolved area). Consider the problem of finding the (lipschitz continuous) y : [0, 1] → R + that joins the points (0, 1) and (1, y 1 ) and has the property that, when its graph is revolved around the x-axis, there results a surface of mimum area. The surface in question has area A = _ 1 0 2πy _ 1 + (dy/dx) 2 dx. Thus, the minimization problem is the endpoint-constrained problem for which x 0 = x(0) = 1,x f = x(1) = y 1 . It can be formulated as an optimal control problem where we drop the factor 2π since it is irrelevent in minimization: min J(u) = _ 1 0 x _ 1 +u 2 dt subject to ˙ x = u, x(0) = 1, x(1) = y 1 . Now define the Hamiltonian function be H(p, x, u) = x _ 1 +u 2 +pu Applying PMP, we have ˙ p = −H x = − _ 1 +u 2 0 = H u = xu √ 1 +u 2 +p = 0 ⇔u 2 = p 2 x 2 −p 2 which gives us the optimal control. in order to solve the equations for p and x we rewrite its equivalent form p = − xu √ 1 +u 2 . substitute p in the equation, together with ˙ x = u x _ 1 +u 2 +pu = c ⇔x _ 1 + ˙ x 2 − x ˙ x √ 1 + ˙ x 2 = c (c is a constant) (5.4) (since the problem is autonomous the Hamiltonian is a constant on the optimal trajectory) we obtain x = c _ 1 + ˙ x 2 for some constant c which, because x(0) = 1, must satisfy 0 < c ≤ 1. From ˙ x 2 = _ x c _ 2 −1 (5.5) it follows that ˙ x¨ x = 1 c 2 x ˙ x. 38 We assume that ˙ x = 0 (which can only happen if x ≡ c and thus y 1 = 1. So there is some nonempty open interval where ¨ x = 1 c 2 and hence there are constants α 1 , β 1 so that x(t) = α 1 cosh(t/c) +β 1 sinh(t/c) on that interval. (here we need the property that q is real-analytic so that x takes this form on the entire interval by the principle of analytic continuation). The condition x(0) = 1 implies that α 1 = 1 and the equation (5.5) evaluated at t = 0 gives _ β 1 c _ 2 = _ 1 c _ 2 −1 and hence β 1 = ± _ 1 −c 2 ∈ ] −1, 1[. Pick d ∈ R so that tanh d = β 1 which implies that coshd = 1/c. Then (using cosh(z +y) = cosh z cosh y + sinh z sinh y) x(t) = cosh(t cosh d +d) cosh d Every minimizing curve must be of this form. We must however meet the second bundary condition: x(1) = y 1 . This can be done if and only if one can solve y 1 = cosh(cosh d +d) cosh d for d which requires y 1 to be sufficiently large (approximately > 0.587); in general there may be none, one or two solutions. For example if y 1 = cosh 1 the solutions are d = 0 and d ≈ −2.3, respectively. The integrals _ 1 0 x √ 1 + ˙ x 2 dt are approximately 1.407 and 1.764. So if a minimum exists, it must be x(t) = cosh t. We can show that this function is the unique global minimizer. (In this case, p(t) = −x(t) ˙ x(t)/ _ 1 + ˙ x(t) 2 = −cosh t tanh t. Remark: If we use the Euler-Lagrange equations to solve this problem the Euler-Lagrange equation is q(x(t), ˙ x(t)) − ˙ x(t)q u (x(t), ˙ x(t)) ≡ c. Compare this equation with equation (5.4) we can easily see that p = −q u . This is the best I can explain the Lagrange multiplier or the costae p in relation to the Euler-Lagrange equations. In fact we can prove that the extremal Lagrange multipliers, or costates, are the sensitivity of the minimum value of the performance measure to changes in the state value. 39 Example 18. Consider again the “soft landing” problem described in Example 2 where (1.12) is to be minimized subject to the system equations (1.9). The Hamiltonian (4.4) is H = |u| +k +p 1 x 2 +p 2 u. (5.6) Since the admissible range of control is −1 ≤ u(t) ≤ 1, it follows that H will be minimized by the following: u ∗ (t) = _ ¸ ¸ ¸ _ ¸ ¸ ¸ _ −1 if p ∗ 2 (t) > 1 0 if 1 > p ∗ 2 (t) > −1 1 if p ∗ 2 (t) < −1. (5.7) Such a control is referred to in the literature by the graphic term bang-zero-bang, since only maximum thrust is applied in a forward or reverse direction; no intermediate nonzero values are used. If there is no period in which u ∗ is zero the control is called bang-bang. For example, a racing-car driver approximates to bang-bang operation, since he tends to use either full throttle or maximum braking when attempting to circuit a track as quickly as possible. In (5.7) u ∗ (t) switches in values according to the value of p ∗ 2 (t), which is therefore termed (in this example) the switching function. The adjoint equations (4.6) are ˙ p ∗ 1 (t) = 0, ˙ p ∗ 2 = −p ∗ 1 and integrating these gives p ∗ 1 (t) = c 1 , p ∗ 2 (t) = −c 1 tc 2 (5.8) where c 1 and c 2 are constants. Since p ∗ 2 is linear in t, it follows that it can take each of the values +1 and −1 at most once in 0 ≤ t ≤ t f , so u ∗ (t) can switch at most twice. We must however use physical considerations to determine an actual optimal control. Since the landing vehicle begins with a downwards veloscily at an altitude h, logical sequences of control would seem to either u ∗ = 0, followed by u ∗ = +1 (upwards is regarded as positive), or u ∗ = −1, then u ∗ = 0, then u ∗ = +1. (5.9) Consider the first possibility and suppose that u ∗ switches from zero to one at time t 1 . By virtue of (5.7) this sequence of control is possible if p ∗ 2 decreases with time. It is easy to verify that the solution of (1.9) subject to the initial conditions (1.10) is x ∗ 1 = h −µt, x ∗ 2 = −ν, 0 ≤ t ≤ t 1 x ∗ 1 = h −νt + 1 2 (t −t 1 ) 2 , x ∗ 2 = −ν + (t −t 1 ), t 1 ≤ t ≤ t f (5.10) 40 Substituting the soft landing requirements (1.11) into (5.10) gives t f = h/ν + 1 2 ν, t 1 = h/ν − 1 2 ν. (5.11) Because the final time is not specified and because of the form H in (5.6) eqn (4.15) holds, so in particular (H) u=u ∗ = 0 at t = 0, i.e. with t = 0 in (5.6) k +p ∗ 1 (0)x ∗ 2 (0) = 0 or p ∗ 1 (0) = k/ν. Hence from (5.8) we have p ∗ 1 (t) = k/ν, t ≥ 0 and p ∗ 2 (t) = −kt/ν −1 +kt 1 /ν (5.12) using the assumption that p ∗ 2 (t 1 ) = −1. Thus the assumed optimal control will be valid if t 1 > 0 and p ∗ 2 (0) < 1 (the latter conditions being necessary since u ∗ (0) = 0), and using (5.11) and (5.12) these conditions imply h > 1 2 ν 2 , k < 2ν 2 /(h − 1 2 ν 2 ). (5.13) If these inequalities do not hold then some different control strategy, such as (5.9), becomes optimal. For example, if k is increased so that the second inequality in (5.13) is violated then this means that more emphasis is placed on the time to landing in the performance index (1.12). It is therefore reasonable to expect this time would be reduced by first accelerating downwards with u ∗ = −1 before coasting with u ∗ = 0, as in (5.9). It is interesting to note that provided (5.13) holds then the total time t f to landing in (5.11) is independent of k. Example 19. Suppose that in the preceding example it is now required to determine a control which achieves a soft landing in the least possible time, starting with an arbitrary given initial state x(0) = x 0 . The performance index is just (1.3) with t 0 = 0, t 1 = t f . The Hamitonian (4.4) is now H = 1 +p 1 x 2 +p 2 xu and Theorem 5.1 gives the optimal control: u ∗ = 1, p 2 < 0; u ∗ = −1, p ∗ 2 > 0, or more succinctly, u ∗ (y) = sgn(−p ∗ 2 ) (5.14) where sgn stands for the sign function. The optimal control thus has bang-bang form, and we must determine the switching function p ∗ 2 (t). Using (4.6) we again obtain ˙ p ∗ 1 = 0, ˙ p ∗ 2 = −p ∗ 1 41 so p ∗ 1 = c 1 , p ∗ 2 = −c 1 t +c 2 where c 1 and c 2 are constants. Since p ∗ 2 is a linear function of t it can change sign at most once in [0, t f ], so the optimal control (5.14) must take one of the following forms: u ∗ (t) = _ ¸ ¸ ¸ ¸ ¸ ¸ _ ¸ ¸ ¸ ¸ ¸ ¸ _ +1 0 ≤ t ≤ t f −1, 0 ≤ t ≤ t f +1, 0 ≤ t < t 1 ; −1, t 1 ≤ t ≤ t f −1, 0 ≤ t < t 2 ; +1, t 2 ≤ t ≤ t f . (5.15) Integrating the state equation (1.9) with u = ±1 gives Figure 4: x 1 = ± 1 2 t 2 +c 3 t +c 4 , x 2 = ±t +c 3 . (5.16) Eliminating t in (5.16) yields x 1 (t) = 1 2 x 2 2 (t) +c 5 , when u ∗ = +1, (5.17) x 1 (t) = − 1 2 x 2 2 (t) +c 6 , when u ∗ = −1. (5.18) The trajectories (5.17) and (5.18) represent two families of parabolas, shown in Figure 4. The direction of the arrows represents t increasing. We can now investigate the various cases in (5.15) (i) u ∗ = +1, 0 ≤ t ≤ t f . The initial state x 0 must lie on the lower part of the curve PO corresponding to c 5 = 0 in Figure 4(a). 42 (ii) u ∗ = −1, 0 ≤ t ≤ t f . The initial state x 0 must lie on the upper part of the curve QO corresponding to c 6 = 0 in Figure 4(b). (iii) With the third case in (5.15), since u ∗ = −1 for t 1 ≤ t ≤ t f it follows that x ∗ (t 1 ) must lie on the curve QO. The point x ∗ 1 (t 1 ) is reached using u ∗ = +1 for 0 ≤ t < t 1 , so the initial part of the optimal trajectory must belong to the curves in Figure 4(a). The optimal trajectory will thus be as shown in Figure 5. The point R corresponds to t = t 1 , and is where u ∗ switches from +1 to −1; QO is therefore termed the switching curve. By considering Figure 4 it is clear that the situation just described holds for any initial state lying to the left of both PO and QO. Figure 5: (iv) A similar argument shows that the last case in (5.15) applies for any initial state lying to the right of PO and QO, a typical optimal trajectory being shown in Figure 6. The switching now takes place on PO, so the complete switching curve is QOP, shown in Figure 7. To summarize, if x 0 lies on the switching curve then u ∗ = ±1 according as x 1 (0) is positive or negative. If x 0 does not lie on the switching curve then u ∗ must initially be chosen so as to move x ∗ (t) towards the switching curve. Exercises 1. A system is described by d 3 z dt 3 = u(t) where z(t) denotes displacement. Starting from some given initial position with given velocity and acceleration it is required to choose u(t), which is constrained by |u(t)| ≤ k, so as to make displacement, velocity, and acceleration equal to zero in 43 Figure 6: Figure 7: the least possible time. Show using Theorem 5.1 that the optimal control consists of u ∗ = ±k with zero, one, or two switchings. 2. A linear system is described by ¨ z(t) +a ˙ z(t) +bz(t) = u(t) where a > 0 and a 2 < 4b. The control variable is subject to |u(t)| ≤ k and is to be chosen so that the system reaches the state z(T) = 0, ˙ z(Y ) = 0 in minimum possible time. Show that the optimal control is u ∗ = ksgnp(t), where p(t) is a periodic function of t. 44 3. A rocket is ascending vertically above the earth, which is assumed flat. It is also assumed that aerodynamic forces can be neglected, that the gravitational attraction is constant, and that the thrust from the motor acts vertically downwards. The equations of motion are d 2 h dt 2 = −g + cu m , dm dt = −u(t) where h(t) is the vertical height, m(t) is the rocket mass, and c and g are positive constants. The propellant mass flow can be controlled subject to 0 ≤ u(t) ≤ k. The mass, height, and velocity at t = 0 are known and it is required to maximize the height subsequently reached. Show that the optimal control has the form u ∗ (t) = k, s > 0; u ∗ (t) = 0, s < 0 where the switching function s(t) satisfies the equation ds/dt = −c/m. If switching occurs at time t 1 show that s(t) = _ _ _ c k ln m(t) m(t 1 ) , 0 ≤ t ≤ t 1 c(t 1 −t)/m(t 1 ), t 1 ≤ t ≤ T. 4. A reservoir is assumed to have a constant cross-section, and the depth of the water at time t is x 1 (t). The net inflow rate of water u(t) can be controlled subject to 0 ≤ u(t) ≤ k, but the reservoir leaks, the differential equation describing the system being ˙ x 1 = −0.1x 1 +u. Find the control policy which maximizes the total amount of water in the reservoir over 100 units if time, i.e. _ 100 0 x 1 (t)dt. If during the same period the total inflow of water is limited by _ 100 0 u(t)dt = 60k determine the optimal control in this case. 5. In Example 19 let x 1 (0) = α, x 2 (0) = β be arbitrary initial point to the right of the switching curve in Figure 6, with α ≥ 0. Show that the minimum time to the origin is T ∗ = β + (4α + 2β 2 ) 1/2 . 6. Consider the system described by the equations (1.9) subject to |u(t)| ≤ 1. It is required to transfer the system to some point lying on the perimeter of the square in the (x 1 , x 2 ) plane having vertices (±1, ±1) in minimum time, starting from an arbitrary point outside the square. Determine the switching curves. 45 6 Kalman filtering and certainty equivalence ∗ We present the important concepts of the Kalman filter, certainty-equivalence and the separation principle, as we stated in the theory of output feedback. 6.1 Imperfect state observation with noise The elements needed to define a control optimization are specification of (i) the dynamics of the process, (ii) which quantities are observable at a given time, and (iii) an optimization criterion. In the LQG model the system equation and observation relations are linear, the cost is quadratic, and the noise is Gaussian (jointly normal). The LQG model is important because it has a complete theory and introduces some key concepts, such as controllability, observability and the certainty equivalence principle. Imperfect observation is the most important point. The model is x t = Ax t−1 +Bu t−1 + t y t = Cx t−1 +η t (6.1) where t is process noise, y t is the observation at time t and η t is the observation noise. The state observations are degraded in that we observe only Cx t−1 . Assume cov _ η _ = E _ η __ η _ = _ Q L L R _ and that x 0 ∼ N(ˆ x 0 , V 0 ). Let W t = (Y t , U t−1 ) = (y 1 , ..., y t ; u 0 , ..., u t−1 ) denote the ob- served history up to time t. Of course we assume that t, A, B, C, N, L, M, ˆ x 0 and V 0 are also known; W t denotes what might be different if the process were rerun. In the next subsection we turn to the question of estimating x from y. We consider the issue of state estimation and optimal control and shall show the following things. (i) ˆ x t can be calculated recursively from the Kalman filter (a linear operator): ˆ x t = Aˆ x t−1 +Bu t−1 +H t (y t −Cˆ x t−1 ), which is like the system equation except that the noise is given by an innovation process, ˜ y t = y t −Cˆ x t−1 , rather than the white noise. Compare this with observer! (ii) If there is full information (i.e., y t = x t ) and the optimal control is u t = K t x t , then without full information the optimal control is K t ˆ x t , where ˆ x t is the best linear least squares estimate of x t based on the information (Y t , U t−1 ) at time t. Many of the ideas we encounter in this analysis are not related to the special state structure and are therefore worth noting as general observations about control with im- perfect information. 46 6.2 The Kalman filter Consider the state system (6.1). Note that both x t and y t can be written as a linear functions of the unknown noise and the known values of u 0 , ..., u t−1 . Thus the disturbance of x t conditional on W t = (Y t , U t−1 ) must be normal, with some mean ˆ x t and covariance matrix V t . The following theorem describes recursive updating relations for these two quantities. A preliminary result is needed to make the proof simpler. Lemma 6.1. Suppose that x and y are jointly normal with zero means and covariance matrix cov _ x y _ = _ V xx V xy V yx V yy _ . Then the distribution of x conditional on y is Gaussian, with E(x|y) = V xy V −1 yy y, and cov(x|y) = V xx −V xy V −1 yy V yx . (6.2) Proof. Both y and x − V xy V −1 yy y are linear functions of x and y and therefore they are Gaussian. From E[(x − V xy V −1 yy y)y ] = 0 it follows that they are uncorrelated and this implies they are independent. Hence the distribution of x − V xy V −1 yy y conditional on y is identical with its unconditional distribution, and this is Gaussian with zero mean and the covariance matrix given by (6.2). The estimate of x in terms of y defined as ˆ x = Ky = V xy V −1 yy y is known as the linear least squares estimate of x in terms of y. Even without the assumption that x and y are jointly normal, this linear function of y has a smaller covariance matrix than any other unbiased estimate for x that is a linear function of y. In the Gaussian case, it is also the maximum likelihood estimator. Theorem 6.2 (The Kalman filter). Suppose that conditional on W 0 , the initial state x 0 is distributed N(ˆ x 0 , V 0 ) and the state and observations satisfy the recursions of the LQG model (6.1). Then conditional on W t , the current state is distributed N(ˆ x t , V t ). The conditional mean and variance satisfy the updating recursions ˆ x t = Aˆ x t−1 +Bu t−1 +H t (y t −Cˆ x t−1 ), (6.3) V t = Q+AV t−1 A −(L +AV t−1 C )(R +CV t−1 C ) −1 (L +CV t−1 A ), (6.4) where H t = (L +AV t−1 C )(R +CV t−1 C ) −1 . (6.5) 47 Proof. We do induction on t. Consider the moment when u t−1 has been determined but y t has not yet observed. The distribution of (x t , y t ) conditional on W t−1 , u t−1 ) is jointly normal with means E(x t |W t−1 , u t−1 ) = Aˆ x t−1 +Bu t−1 , E(y t |W t−1 , u t−1 ) = Cˆ x t−1 . Let e t−1 := ˆ x t−1 − x t−1 , which by an inductive assumption is N(0, V t−1 ). Consider the innovations ξ t = x t −E(x t |W t−1 , u t−1 ) = x t −(Aˆ x t−1 +Bu t−1 ) = t −Ae t−1 , ζ t = y t −E(y t |W t−1 , u t−1 ) = y t −Cˆ x t−1 = η t −Ce t−1 . Conditional on (W t−1 , u t−1 ), these quantities are normally distributed with zero means and covariance matrix cov _ t −Ae t−1 η t −Ce t−1 _ = _ Q+AV t−1 A L +AV t−1 C L +CV t−1 A R +CV t−1 C _ = _ V ξξ V ξζ V ζξ V ζζ _ Thus it follows from Lemma 6.1 that the distribution of ξ t conditional on knowing (W t−1 , u t−1 , ζ t ), (which is equivalent to knowing W t ), is normal with mean V ξζ V −1 ζζ ζ t and covariance matrix V ξξ −V ξζ V −1 ζζ V ζξ . These prove the theorem. 6.3 Certainty equivalence We say that a quantity a is policy-independent if E π (a|W 0 ) is independent of π. Theorem 6.3. Suppose LQG model assumptions hold. Then (i) V (W t ) = ˆ x t P t ˆ x t +· · · (6.6) where ˆ x t is the linear least squares estimate of x t whose evolution is determined by the Kalman filter in Theorem 6.2 and “+· · · ” indicates terms that are policy independent; (ii) the optimal control is given by u t = K t ˆ x t , where P t and K t are the same matrices as in the full information case of Theo- rem 2.4. It is important to grasp the remarkable fact that (ii) states: the optimal control u t is exactly the same as it would be if all unknowns were known and took values equal to their 48 linear least square estimates (equivalently, their conditional means) based on observations up to time t. This is the idea known as certainty equivalence. As we have seen in the previous subsection, the distribution of the estimation error ˆ x t − x t does not depend on U t−1 . The fact that the problems of optimal estimation and optimal control can be decoupled in this way is known as the separation principle, as we have seen in the theory of observer and output feedback. Proof. The proof is by backward induction. Suppose equation (6.6) holds at t. Recall that ˆ x t = Aˆ x t−1 +Bu t−1 +H t ζ t , e t−1 = ˆ x t−1 −x t−1 . Then with a quadratic cost of the form J(x, u) = x Qx + 2u Sx +u Ru, we have V (W t−1 ) = min u t−1 [J(x t−1 , u t−1 ) + ˆ x t P t ˆ x t +· · · |W t−1 u t−1 ] = min u t−1 E[J(ˆ x t−1 −e t−1 , u t−1 ) (Aˆ x t−1 +Bu t−1 +H t ζ t ) P t (Aˆ x t−1 +Bu t−1 +H t ζ t ) · · · |W t−1 u t−1 ] = min u t−1 [J(ˆ x t−1 , u t−1 ) + (Aˆ x t−1 +Bu t−1 ) P t (Aˆ x t−1 +Bu t−1 )] +· · · , where we used the fact that conditional on W t−1 , u t−1 both e t−1 and ζ t have zero means and are policy-independent. This ensures that when we expand the quadratics in powers of e t−1 and H t ζ t the expected value of the linear terms in these quantities are zero and the expected value of the quadratic terms (represented by +· · · ) are policy independent. 7 Optimal control problem with Controlled diffusion processes ∗ We give a brief introduction to controlled continuous time stochastic models with a con- tinuous state space, i.e., controlled diffusion processes. 7.1 Diffusion processes and controlled diffusion processes TheWiener precess {W(t)}, is a scalar process for which W(0) = 0, the increments in W over disjoint time intervals are statistically independent and W(t) is normally distributed with zero mean and variance t. This is also called Brownian motion. This specification is internally consistent because, for example, W(t) = W(t 1 ) + [W(t) −W(t 1 )] and for 0 ≤ t 1 ≤ t the two terms on the right-hand side are independent normal variables of zero mean and with variance t 1 and t −t 1 , respectively. If δW is the increment of W in a time interval of length δt then E[δW] = 0, E[(δW) 2 ] = δt, E[(δW) j ] = o(δt), for j > 2, 49 where the expectation is one conditional on the past of the process. Note that since E[( δW δt ) 2 ] = O(δt) −1 ) →∞, the formal derivative = DW dt (contiguous-time “white noise”) does not exist in a mean square sense, but expectations such as E _ __ α(t)(t)dt _ 2 _ = E _ __ α(t)dW(t) _ 2 _ = _ α(t) 2 dt make sense if the integral is convergent. Now consider a stochastic differential equation δx = a(x, u)δt +q(x, u)δW, which we shall write formally as ˙ x = a(x, u) +q(x, u). This, as a Markov process, has an infinitesimal generator with action p(u)φ(x) = lim δt→0 E _ φ(x(t +δt)) −φ(x) δt ¸ ¸ x(t) = x, u(t) = u _ = φ x a + 1 2 φ xx g 2 = φ x a + 1 2 Nφ xx where N(x, u) = q(x, u) 2 . The DP equation is thus inf u [J +V t +V x a + 1 2 NV xx ] = 0. In the vector case this becomes the controlled diffusion process inf u [J +V t +V x a + 1 2 tr(NV xx )] = 0. Example 20. (LQG in continuous time). The DP equation is inf u [x Qx +u Ru +V t +V x (Ax +Bu) + 1 2 tr(NV xx )] = 0. In analogy with the discrete and deterministic continuous cases where we have considered previously, we try a solution of the form, V (x, t) = x P(t)x +γ(t). This leads to the same Riccati equation 0 = x [Q+PA+A P −PBR −1 B P + ˙ P]x. 50 and also ˙ γ + tr(NP(t)) = 0, giving γ(t) = _ T t tr(NP(τ))dτ. Most parts in Section 1 and treatment of variational calculus are taken from ”Introduction to mathematical control theory” by Barnett & Cameron. Please send comments, report of errors and other suggestions to
[email protected]. Yishao Zhou September 30, 2013 51