The Power of Predictions in Online Control
Chenkai Yu, Guanya Shi, Soon-Jo Chung, Yisong Yue, Adam Wierman
TThe Power of Predictions in Online Control
Chenkai Yu
IIIS, Tsinghua University [email protected]
Guanya Shi
CMS, Caltech [email protected]
Soon-Jo Chung
CMS, Caltech [email protected]
Yisong Yue
CMS, Caltech [email protected]
Adam Wierman
CMS, Caltech [email protected]
Abstract
We study the impact of predictions in online Linear Quadratic Regulator controlwith both stochastic and adversarial disturbances in the dynamics. In both settings,we characterize the optimal policy and derive tight bounds on the minimum costand dynamic regret. Perhaps surprisingly, our analysis shows that the conventionalgreedy MPC approach is a near-optimal policy in both stochastic and adversarialsettings. Specifically, for length- T problems, MPC requires only O (log T ) predic-tions to reach O (1) dynamic regret, which matches (up to lower-order terms) ourlower bound on the required prediction horizon for constant regret. This paper studies the effect of using predictions for online control in a linear dynamical systemgoverned by x t +1 = Ax t + Bu t + w t , where x t , u t , and w t are the state, control, and disturbancerespectively. At each time step t , the controller incurs a quadratic cost c ( x t , u t ) . Recently, consid-erable effort has been made to leverage and integrate ideas from learning, optimization and controltheory to study the design of optimal controllers under various performance criteria, such as staticregret [2, 3, 12, 13, 15, 20, 29], dynamic regret [16, 23] and competitive ratio [17, 28]. However, thestudy of online convergence when incorporating predictions has been largely absent.Indeed, a key aspect of online control is considering the amount of available information whenmaking decisions. Most recent studies focus on the basic setting where only historical informa-tion, x , w , · · · , w t − , is available for u t at every time step [2, 13, 15, 28]. However, this basicsetting does not effectively characterize situations where we have accurate predictions, e.g., when x , w , · · · , w t − k are available at step t . These types of accurate predictions are often available inmany applications, including robotics [8, 27], energy systems [30], and data center management [22].Moreover, there are many practical algorithms that leverage predictions, such as the popular ModelPredictive Control (MPC) [6–9, 18, 19].While there has been increased interest in studying online guarantees for control with predictions, toour knowledge there has been no such study for the case of a finite-time horizon with disturbances.Several previous works studied the economic MPC problem by analyzing the asymptotic performancewithout disturbances [6, 7, 18, 19]. Rosolia and Borrelli [25, 26] studied learning for MPC butfocused on the episodic setting with asymptotic convergence guarantees. Li et al. [23] considereda linear system where finite predictions of costs are available, and analyzed the dynamic regret ofa new algorithm they proposed; however, they neither consider disturbances nor study the morepractically relevant MPC approach. Goel and Hassibi [16] characterized the offline optimal policy(i.e., with infinite predictions) and cost in LQR control with i.i.d. zero-mean stochastic disturbances,but those results do not apply to limited predictions or non-i.i.d. disturbances. Other prior work Preprint. Under review. a r X i v : . [ m a t h . O C ] J u l nalyzes the power of predictions in online optimization [11, 24], but the connection to online controlin dynamical systems is unclear.From this literature, fundamental questions about online control with predictions have emerged:1. What are the cost-optimal and regret-minimizing policies when given k predictions? Whatare the corresponding cost and regret of these policies? What is the marginal benefit from each additional prediction used by the policy, and howmany predictions are needed to achieve (near-)optimal performance? How well does MPC with k predictions perform compared to cost-optimal and regret-minimizing policies? Main contributions.
We systematically address each of the questions above in the context of LQRsystems with general stochastic and adversarial disturbances in the dynamics. In the stochastic case,we explicitly derive the cost-optimal and dynamic regret minimizing policies with k predictions.In both the stochastic and adversarial cases, we derive (mostly tight) upper bounds for the optimalcost and minimum dynamic regret given access to k predictions. We also show that the marginalbenefit of an extra prediction exponentially decays as k increases. Additionally, for MPC specifically,we show that it has a bounded performance ratio against the cost-optimal policy in both stochasticand adversarial settings. We further show that MPC is near-optimal in terms of dynamic regret, andneeds only O (log T ) predictions to achieve O (1) dynamic regret (the same order as is needed by thedynamic regret minimizing policy) in both settings.We would like to emphasize the generality of the results. The model we consider is the general LQRsetting with disturbance in the dynamics, where only stabilizability is assumed [4]. Further, in thestochastic setting we consider general distributions, which are not necessarily i.i.d. or zero-mean.Additionally, our results compare to the globally optimal policies for cost and regret rather thancompare to the optimal linear or static policy. Finally, our upper bounds are (almost) tight , i.e., thereexist some systems such that the bounds are (nearly) reached, up to lower-order terms.It is perhaps surprising that classic MPC, which is a simple greedy policy (up to the predictionhorizon), is near-optimal even with adversarial disturbances in the dynamics. Our results thushighlight the power of predictions to reduce the need for algorithmic sophistication. In that sense, ourresults somewhat mirror recent developments in the study of exploration strategies in online LQRcontrol with unknown dynamics { A, B } : after a decade’s research beginning with the work of Abbasi-Yadkori and Szepesvári [1], Simchowitz and Foster [29] recently showed that naive exploration isoptimal. Taken together with the result from [29], our paper provides additional evidence for the ideathat the structure of LQR allows simple algorithmic ideas to be effective, which sheds light on keyalgorithmic principles and fundamental limits in continuous control. We consider the
Linear Quadratic Regulator (LQR) optimal control problem with disturbances inthe dynamics. In particular, we consider a linear system initialized with x ∈ R n and controlled by u t ∈ R d , with dynamics x t +1 = Ax t + Bu t + w t and cost J = T − (cid:88) t =0 ( x (cid:62) t Qx t + u (cid:62) t Ru t ) + x (cid:62) T Q f x T , where T ≥ is the total length of the control period. The goal of the controller is to minimize thecost given A, B, Q, R, Q f , x , and the characterization of the disturbance w t . Throughout this paper,we use ρ ( · ) to denote the spectral radius of a matrix and (cid:107)·(cid:107) to denote the 2-norm of a vector or thespectral norm of a matrix.We assume Q, Q f (cid:23) , R (cid:31) and the pair ( A, B ) is stabilizable , i.e., there exists a matrix K ∈ R d × n such that ρ ( A − BK ) < . Further, we assume the pair ( A, Q ) is detectable , i.e., ( A (cid:62) , Q ) is stabilizable, to guarantee stability of the closed-loop. Note that detectability of ( A, Q ) is moregeneral than Q (cid:31) , i.e., Q (cid:31) implies ( A, Q ) is detectable. For w t , in the stochastic case,we assume { w t } t =0 , , ··· are sampled from a joint distribution with bounded cross-correlation, i.e., E (cid:2) w (cid:62) t w t (cid:48) (cid:3) ≤ m for any t, t (cid:48) ; in the adversarial case, we assume w t is picked from a bounded set Ω .2hese are standard assumptions in literature, e.g., [13, 15, 29] and it is worth noting that our notionof stochasticity is much more general than typically considered [10, 12, 13]. We also note that manyimportant problems can be straightforwardly converted to our model — for example, input-disturbedsystems and the Linear Quadratic (LQ) tracking problem [4]. Example: linear quadratic tracking.
The standard quadratic tracking problem is defined withdynamics x t +1 = Ax t + Bu t + w t and cost function J = (cid:80) T − t =0 ( x t +1 − d t +1 ) (cid:62) Q ( x t +1 − d t +1 ) + u (cid:62) t Ru t , where { d t } Tt =1 is the desired trajectory to track. To map this to our model, let ˜ x t = x t − d t .Then, we get J = (cid:80) T − t =0 ˜ x (cid:62) t +1 Q ˜ x t +1 + u (cid:62) t Ru t and ˜ x t +1 = A ˜ x t + Bu t + ˜ w t , which is a LQRcontrol problem with disturbance ˜ w t = w t + Ad t − d t +1 in the dynamics. In the classical model, at each step t , the controller decides u t after observing w t − and x t . In otherwords, u t is a function of all the previous information: x , x , . . . , x t − and w , w , . . . , w t − , orequivalently, of x , w , w , · · · , w t − . We describe this scenario via the following event sequence : x u w u w · · · u T − w T − , where each u t denotes the decision of a control policy, each w t denote the observation of a disturbance,and each decision may depend on previous events.However, in many real-world applications the controller may have some knowledge about future. Inparticular, at time step t , the controller may have predictions of immediate k future disturbances andmake decision u t based on x , w , . . . , w t + k − . In this case, the event sequence is given by: x w w · · · w k − u w k u w k +1 · · · u T − k − w T − u T − k · · · u T − . The existence of predictions is common in many applications such as disturbance estimation inrobotics [27] and model predictive control (MPC) [9], which is a common approach for the LQtracking problem. When given k predictions of d t , the LQ tracking problem can be formulatedas a LQR problem with k − predictions of future disturbances. In this paper we assume all thepredictions are exact , and leave inexact predictions [11, 28] as future work. This is common in theliterature on online algorithms with predictions, e.g., [23, 24]. The characteristics of the disturbances have a fundamental impact on the optimal control policy andcost. We consider two types of disturbance: stochastic disturbances, which are drawn from a jointdistribution (not necessarily i.i.d.), and adversarial disturbances, which are chosen by an adversary tomaximize the overall control cost of the policy.In the stochastic setting, we model the disturbance sequence { w t } T − t =0 as a discrete-time stochasticprocess with joint distribution W which is known to the controller. Let W t = W t ( w , . . . , w t − ) bethe conditional distribution of w t given w , . . . , w t − . Then the cost of the optimal online policywith k predictions is given by: STO Tk = E w ∼ W ,...,w k − ∼ W k − (cid:16) min u (cid:16) E w k ∼ W k (cid:16) · · · min u T − k − (cid:16) E w T − ∼ W T − (cid:16) min u T − k ,...,u T − J (cid:17)(cid:17)(cid:17)(cid:17)(cid:17) . Note that the cost J = J ( x , u , · · · , u T − , w , · · · , w T − ) . Two extreme cases are noteworthy: k = 0 reduces to the classical case without prediction and k = T reduces to the offline optimal.In the adversarial setting, each disturbance w t is selected by an adversary from a bounded set Ω ⊆ R n in order to maximize the cost. The controller has no information about the disturbance except that itis in Ω . Similar to the stochastic setting, we define: ADV Tk = sup w ,...,w k − ∈ Ω (cid:16) min u (cid:16) sup w k ∈ Ω (cid:16) · · · min u T − k − (cid:16) sup w T − ∈ Ω (cid:16) min u T − k ,...,u T − J (cid:17)(cid:17)(cid:17)(cid:17)(cid:17) . This can be viewed as online H ∞ control [31] with predictions.The average cost in an infinite horizon is particularly important in both control and learning commu-nities to understand asymptotic behaviors. We use separate notation for it: STO k = lim T →∞ T STO Tk , ADV k = lim T →∞ T ADV Tk . lgorithm 1: Model predictive control with k predictions Parameter: { A, B, Q, R } and ˜ Q f ∈ R n × n Input: x , w , . . . , w k − for t = 0 to T − doInput: x t , w t + k − // The controller now knows x , . . . , x t , w , . . . , w t + k − ( u t , . . . , u t + k − ) = arg min u (cid:80) t + k − i = t x (cid:62) i Qx i + u (cid:62) i Ru i + x (cid:62) t + k ˜ Q f x t + k subject to x i +1 = Ax i + Bu i + w i for i = t, . . . , t + k − Output: u t We emphasize that we do not have any constraints (like linearity) on the policy space, and both
STO Tk and ADV Tk are globally optimal with the corresponding type of disturbance. This point isimportant in light of recent results that show that linear policies cannot make use of predictions at all[16, 28], i.e., the cost of the best linear policy with infinite predictions ( k = ∞ ) is asymptoticallyequal to that with no predictions ( k = 0 ) in the setting with i.i.d. zero-mean stochastic disturbances.In this paper, we explicitly derive the optimal policy for every k > , which is nonlinear in general. Model predictive control (MPC) is perhaps the most common control policy for situations wherepredictions are available. MPC is a greedy algorithm with a receding horizon based on all availablecurrent predictions. Algorithm 1 provides a formal definition, and we additionally refer the reader tothe book [9] for a literature review on MPC. We adopt a conventional definition of MPC as an onlineoptimal control problem with a finite-time horizon with dynamics constraints. Note that other priorwork on MPC sometimes considers other input and state constraints [9].MPC is a practical algorithm in many scenarios like robotics [8], energy system [30] and data centercooling [22]. The existing theoretical studies of MPC focus on asymptotic stability and performance[6, 7, 18, 19, 25]. To our knowledge, we provide the first general, dynamic regret guarantee for MPCin this paper.In this paper, we study the performance of MPC in three different cases, where disturbances arei.i.d. zero-mean stochastic, generally stochastic, and adversarial, corresponding to Sections 3 to 5respectively. We define the performance of MPC in the stochastic and adversarial settings as follows:
MPCS k = lim T →∞ T MPCS Tk = lim T →∞ T E w ,...,w T − J MPC k , MPCA k = lim T →∞ T MPCA Tk = lim T →∞ T sup w ,...,w T − J MPC k , where J MPC k is the cost of MPC given a specific disturbance sequence, i.e., J MPC k ( w ) = J ( u, w ) where for each t , u t = φ ( x t , w t , . . . , w t + k − ) and φ ( · ) is the function that maps x t , w t , . . . , w t + k − to the policy u t , as defined in Algorithm 1. By definition, MPCS k ≥ STO k and MPCA k ≥ ADV k for every k ≥ since they use the same information but the latter ones are defined to be optimal. In this paper, we focus on two performance metrics, the dynamic regret and the performance ratio . Dynamic regret.
Regret is a standard metric in online learning and provides a bound on the costdifference between an online algorithm and the optimal static policy given complete information. Wefocus on the dynamic regret, which compares to the optimal dynamic offline policy, rather than theoptimal static offline policy. Note that the optimal offline policy may be nonlinear. It is important toconsider nonlinear policies because recent results highlight that the optimal offline policy can havecost that is arbitrarily lower than the optimal linear policy in hindsight [16, 28].More specifically, we compare the cost of an online algorithm with k predictions to that of the offlineoptimal (nonlinear) algorithm, i.e., one that has predictions of all disturbances. For MPC with k Reg S ( MPC k ) = E ( w , ··· ,w T − ) ∼W (cid:16) J MPC k ( w ) − min u (cid:48) ,...,u (cid:48) T − J ( u (cid:48) , w ) (cid:17) ,Reg A ( MPC k ) = sup w , ··· ,w T − ∈ Ω (cid:16) J MPC k ( w ) − min u (cid:48) ,...,u (cid:48) T − J ( u (cid:48) , w ) (cid:17) . As compared to (static) regret, dynamic regret does not have any restriction on the policies u (cid:48) , . . . , u (cid:48) T − used for comparison and thus differs from other notions of regret where u (cid:48) , . . . , u (cid:48) T − are limited in special cases. For example, in the classical form of regret, u (cid:48) = · · · = u (cid:48) T − ; and inthe regret compared to the best offline linear controller [2, 12], u (cid:48) t = − K ∗ x t .In this work, we obtain both upper bounds and lower bounds on dynamic regret. For lower bounds,we define the minimum possible regret that an algorithm with k predictions can achieve (i.e., theregret of the algorithm that minimizes the regret): Reg Sk ∗ = E w , ··· ,w k − min u E w k · · · min u T − k − E w T − min u T − k , ··· ,u T − (cid:16) J ( u, w ) − min u (cid:48) ,...,u (cid:48) T − J ( u (cid:48) , w ) (cid:17) ,Reg Ak ∗ = sup w , ··· ,w k − min u sup w k · · · min u T − k − sup w T − min u T − k , ··· ,u T − (cid:16) J ( u, w ) − min u (cid:48) ,...,u (cid:48) T − J ( u (cid:48) , w ) (cid:17) . Finally, we end our discussion of dynamic regret with a note highlighting an important contrastbetween stochastic and adversarial settings. In the stochastic setting,
Reg Sk ∗ = E w , ··· ,w k − min u E w k · · · min u T − k − E w T − (cid:16) min u T − k , ··· ,u T − J ( u, w ) − min u (cid:48) ,...,u (cid:48) T − J ( u (cid:48) , w ) (cid:17) = E w , ··· ,w k − min u E w k · · · min u T − k − E w T − min u T − k , ··· ,u T − J ( u, w ) − E w ,...,w T − min u (cid:48) ,...,u (cid:48) T − J ( u (cid:48) , w )= STO Tk − STO TT . This equality still holds if we take arg min instead of min and thus the regret-optimal policy isthe same as the cost-optimal policy. However, in the adversarial case, a similar reasoning gives aninequality:
Reg Ak ∗ ≥ ADV Tk − ADV TT , and correspondingly, the regret-optimal and cost-optimalpolicies can be different. Similarly, for MPC, we have Reg S ( MPC k ) = MPCS Tk − STO TT while Reg A ( MPC k ) ≥ MPCA Tk − ADV TT . Performance ratio.
The second metric we study is a new metric that we term the performance ratio .It characterizes the ratio of the cost of an online algorithm with k predictions to the cost of the optimalonline algorithm using k predictions. Thus, it gives a way of comparing to a weaker benchmark thanregret – one that has the same amount of information as the algorithm. Note that it is related to, butdifferent than, the competitive ratio in this context. Formally, the performance ratio of the MPCalgorithm in stochastic and adversarial settings, respectively, is defined as: PR S ( MPC k ) = MPCS k STO k , PR A ( MPC k ) = MPCA k ADV k . While the dynamic regret indicates whether the algorithm can match the optimal offline policy(which has complete information), the performance ratio measures whether the algorithm is usingthe information available to it in as efficient a manner as possible. Thus, the contrast between thetwo separates the efficiency of the algorithm from the inefficiency created by the lack of informationabout future disturbances.Finally, one may wonder if there are connections between dynamic regret and performance ratio. Asmight be expected, in both the stochastic and adversarial settings, the performance ratio of an onlinepolicy with k predictions provides a lower bound of its dynamic regret: PR S ( MPC k ) − ≤ MPCS k STO ∞ − MPCS k − STO ∞ STO ∞ = 1 STO ∞ lim T →∞ T Reg S ( MPC k ) , PR A ( MPC k ) − ≤ MPCA k ADV ∞ − MPCA k − ADV ∞ ADV ∞ ≤ ADV ∞ lim T →∞ T Reg A ( MPC k ) . Zero-mean i.i.d. disturbances
We begin our analysis with the simplest of the three settings we consider: the disturbances w t areindependent and identically distributed with zero mean. Though i.i.d. zero-mean is a limited setting,it is still complex enough to study predictions and the first results characterizing the optimal policywith predictions appeared only recently [16], focusing only on the optimal policy when k → ∞ .Before diving into our results, we first recap the classical Infinite Horizon Linear Quadratic StochasticRegulator [4, 5], i.e., the case when k = 0 : Proposition 3.1 (Anderson and Moore [5]) . Let w t be i.i.d. with zero mean and covariance matrix W . Then, the optimal control policy corresponding to STO is given by: u t = − ( R + B (cid:62) P B ) − B (cid:62) P Ax t =: − Kx t , where P is the solution of discrete-time algebraic Riccati equation (DARE) P = Q + A (cid:62) P A − A (cid:62) P B ( R + B (cid:62) P B ) − B (cid:62) P A. (1)
The corresponding closed-loop dynamics A − BK is exponentially stable, i.e., ρ ( A − BK ) < .Further, the optimal cost is given by STO = Tr { P W } . This result has been extensively studied in optimal control theory [4, 21] as well as in reinforcementlearning [13, 14, 29]. We want to emphasize two important properties of the optimal policy u t = − Kx t . First, the policy is linear in the state x t . In contrast, we show later that the optimal policywhen k (cid:54) = 0 is, in general, nonlinear . Second, under the assumptions of our model, this policy is exponentially stable , i.e., ρ ( A − BK ) < . We leverage this to show the power of predictions laterin the paper. Optimal policy.
Let F = A − BK and λ = ρ ( F )2 < . From Gelfand’s formula, there exists aconstant c ( n ) such that (cid:107) F k (cid:107) ≤ c ( n ) λ k for all k ≥ . Theorem 3.2.
Let w t be i.i.d. with zero mean and covariance matrix W . Suppose the controller has k ≥ predictions. Then, the optimal control policy at each step t is given by: u t = − ( R + B (cid:62) P B ) − B (cid:62) (cid:32) P Ax t + k − (cid:88) i =0 ( A (cid:62) − A (cid:62) P H ) i P w t + i (cid:33) , (2) where P is the solution of DARE in Equation (1) . The cost under this policy is: STO k = Tr (cid:40)(cid:32) P − k − (cid:88) i =0 P ( A − HP A ) i H ( A (cid:62) − A (cid:62) P H ) i P (cid:33) W (cid:41) , (3) where H = B ( R + B (cid:62) P B ) − B (cid:62) . Following the approach developed in [16], the proof is based on an analysis of quadratic cost-to-go functions in the form V t ( x t ) = x (cid:62) t P t x t + v (cid:62) t x t + q t . Note that A − HP A = A − B ( R + B (cid:62) P B ) − B (cid:62) P A = A − BK = F . Thus, the online optimal cost STO k with k predictionsapproaches the offline optimal cost STO ∞ by an exponential rate. In other words, STO k / STO ∞ =1 + O ( (cid:107) F k (cid:107) ) = 1 + O ( λ k ) . Two extreme cases of our result are noteworthy. When k = 0 , itreduces to the classical Proposition 3.1. When k → ∞ , it reduces to the offline optimal case derivedby Goel and Hassibi [16]. Model predictive control.
As might be expected, since the disturbances are i.i.d., future distur-bances have no dependence on the current. As a result, MPC gives the optimal policy.
Theorem 3.3.
In Algorithm 1, let ˜ Q f = P . Then, the MPC policy with k predictions is also given byEquation (2) . Assuming i.i.d. disturbance with zero mean, the MPC policy is optimal. Due to the greedy nature, MPC does not utilize any properties of the disturbance, so the first part inTheorem 3.3 holds not only for i.i.d. disturbance, but also other types of disturbance considered inthe later sections, i.e., MPC policy with k predictions is always given by Equation (2).6 General stochastic disturbances
In this section, we consider a general form of stochastic disturbance, more general than typicallyconsidered in this context [10, 12, 13]. Suppose the disturbance sequence { w t } t =0 , , ,... is sampledfrom a joint distribution W such that the cross-correlation of each pair is uniformly bounded, i.e.,there exist m > such that for all t, t (cid:48) ≥ , E (cid:2) w (cid:62) t w t (cid:48) (cid:3) ≤ m . Optimal policy.
In the case of general stochastic disturbances, we cannot obtain as clean a form for
STO k as in the i.i.d. case in Section 3. However, the marginal benefit of having an extra predictiondecays with the same (exponential) rate and the optimal policy is similar to that in Section 3, but withsome additional terms that characterize the expected future disturbances given the current information. Theorem 4.1.
The optimal control policy with general stochastic disturbance is given by: u t = − ( R + B (cid:62) P B ) − B (cid:62) (cid:32) P Ax t + k − (cid:88) i =0 F (cid:62) i P w t + i + ∞ (cid:88) i = k F (cid:62) i P µ t + i | t + k − (cid:33) , (4) where µ t (cid:48) | t = E [ w t (cid:48) | w , . . . , w t ] . Under this policy, the marginal benefit of obtaining an extraprediction decays exponentially fast in the existing number k of predictions. Formally, for k ≥ , STO k − STO k +1 = O ( (cid:107) F k (cid:107) ) = O ( λ k ) . This proof leverages a novel difference analysis of cost-to-go functions. Note that for some dis-tributions,
STO k may approach STO ∞ much faster than exponential rate. It is even possible that STO k = STO ∞ for finite k , as we show in Example 4.2 below. On the other hand, there are scenarioswhere STO k approaches STO ∞ in an exactly exponential manner, as we show in Example 4.3 below. Example 4.2.
Define the joint distribution W such that with probability / , all w t = w , andotherwise all w t = − w . In this case, one prediction is equivalent to infinite predictions since it isenough to distinguish these two scenarios with only w . As a result, STO = STO ∞ . Example 4.3.
Suppose the system is 1-d ( n = d = 1 ) and the disturbance is i.i.d. with zero mean,i.e., the setting of Section 3. Then, according to Equation (3) , as long as F, P, H, W are non-zero,
STO k − STO ∞ = ∞ (cid:88) i = k F i P HW = Θ( F k ) . Model predictive control.
The comparison between the MPC policy in Equation (2) and theoptimal policy in Equation (4) reveals that MPC is a truncation of the optimal policy and is nolonger optimal because MPC is a greedy policy without considering future dependence on currentinformation. Nevertheless, it is still a near-optimal policy, as characterized by the following results.
Theorem 4.4.
MPCS k − MPCS k +1 = O ( (cid:107) F k (cid:107) ) = O ( λ k ) . Moreover, in Example 4.3, MPCS k − MPCS k +1 = Θ( (cid:107) F k (cid:107) ) . In other words, the marginal benefit for the MPC algorithm of an extra prediction decays exponentiallyfast, paralleling the result for optimal policy in Equation (4). Theorem 4.4 implies that MPC has abounded performance ratio, which converges to 1 with an exponential rate in the number of availablepredictions. Formally:
Corollary 4.5. PR S ( MPC k ) = MPCS k STO k ≤ MPCS k STO ∞ = MPCS k MPCS ∞ = 1 + O ( (cid:107) F k (cid:107) ) = 1 + O ( λ k ) .Moreover, in Example 4.2, we have PR S ( MPC k ) = 1 + Θ( (cid:107) F k (cid:107) ) . Besides, the dynamic regret of MPC (nearly) matches the order of the optimal dynamic regret.
Theorem 4.6 (Main result) . Reg S ( MPC k ) = MPCS Tk − STO TT = O ( (cid:107) F k (cid:107) T + 1) = O ( λ k T +1) , where the second term results from the difference between finite/infinite horizons. Theorem 4.7.
The optimal dynamic regret
Reg Sk ∗ = STO Tk − STO TT = O ( (cid:107) F k (cid:107) T + 1) = O ( λ k T + 1) and there exist A , B , Q , R , Q f , x , and W such that Reg Sk ∗ = Θ( (cid:107) F k (cid:107) ( T − k )) . Note that, in the stochastic case, the regret-optimal policy is the same as the cost-optimal policy, i.e.,the policy for
STO Tk is the same as Reg Sk ∗ . 7 Adversarial disturbances
We now move from stochastic to adversarial disturbances. In this section, the disturbances are chosenfrom a bounded set Ω ⊆ R n by an adversary in order to maximize the controller’s cost. Maintainingsmall regret is more challenging in adversarial models than in stochastic ones, so one may expectweaker bounds. Perhaps surprisingly, we obtain bounds with the same order. Optimal policy.
In the adversarial setting, the cost of the optimal policy, defined with a sequence of min ’s and sup ’s, is the equilibrium value of a two-player zero-sum game. In general, it is impossibleto give an analytical expression of either
ADV k or the corresponding optimal policy. However, weprove a result that is structurally similar to the results from the stochastic setting, highlighting theexponential improvement from predictions. Theorem 5.1.
For k ≥ , ADV k − ADV k +1 = O ( (cid:107) F k (cid:107) ) = O ( λ k ) . Similarly to Example 4.2 for the stochastic case, in the adversarial setting, the optimal cost with k predictions may approach the offline optimal cost (under infinite predictions) much faster thanexponential rate, and it is possible that ADV k = ADV ∞ for finite k , as shown in Example 5.2. Example 5.2.
Let A = B = Q = R = 1 and Ω = [ − , . In this case, one prediction is enough toleverage the full power of prediction. Formally, we have ADV = ADV ∞ = 1 . In other words, forall k ≥ , ADV k = 1 . The optimal control policy (as T → ∞ ) is a piecewise function: u ∗ ( x, w ) = − ( x + w ) , − ≤ x + w ≤ − ( x + w ) + −√ ( x + w − , x + w > − ( x + w ) + −√ ( x + w + 1) , x + w < − . The proof leverages two different cost-to-go functions for the min player and the sup player.
Note that the optimal policy could be much more complex. Unlike Example 5.2, where the optimalpolicy is piecewise linear with only 3 pieces, for other values of
A, B, Q, R , this function may havemany more pieces.
Model predictive control.
Under adversarial disturbances, MPC is suboptimal, e.g., in Example5.2. However, its performance ratio and dynamic regret bounds turn out to be the same as those in thestochastic setting.
Theorem 5.3.
MPCA k − MPCA k +1 = O ( (cid:107) F k (cid:107) ) = O ( λ k ) . Corollary 5.4.
For k ≥ , PR A ( MPC k ) = MPCA k ADV k ≤ MPCA k ADV ∞ = MPCA k MPCA ∞ = 1 + O ( (cid:107) F k (cid:107) ) =1 + O ( λ k ) . This highlights that MPC has a bounded performance ratio, which converges to 1 with exponentialrate. Additionally, MPC has the same order of dynamic regret as the stochastic case:
Theorem 5.5 (Main result) . Reg A ( MPC k ) = O ( (cid:107) F k (cid:107) T + 1) = O ( λ k T + 1) . This dynamic regret is linear in the horizon T if we fix the number of predictions. However, if k is asuper-constant function of T — an increasing function of T that is not upper-bounded by a constant —then the regret is sub-linear. Furthermore, if we let k = log T /λ ) , then Reg A ( MPC k ) = O (1) . Inother words, we can get constant regret with O (log T ) predictions, even with adversarial disturbances.Finally, as implied by the following result, the O (log T ) horizon cannot be improved since even theregret minimizing algorithm needs the same order of predictions to reach constant regret. Theorem 5.6.
Reg Ak ∗ = O ( (cid:107) F k (cid:107) T + 1) = O ( λ k T + 1) . Moreover, there exist A , B , Q , R , Q f , x , and Ω such that Reg Ak ∗ = Ω( (cid:107) F k (cid:107) ( T − k )) . We conclude with several open problems and potential future research directions. Our results highlightthe power of predictions and show that, given predictions, a simple greedy policy (MPC) is near-optimal for LQR control with disturbances in the dynamics, in terms of dynamic regret. Building on Ω( · ) is the growth order notation and has nothing to do with the bounded set Ω . References [1] Yasin Abbasi-Yadkori and Csaba Szepesvári. Regret bounds for the adaptive control of linear quadraticsystems. In
Proceedings of the 24th Annual Conference on Learning Theory , pages 1–26, 2011.[2] Naman Agarwal, Brian Bullins, Elad Hazan, Sham M Kakade, and Karan Singh. Online control withadversarial disturbances. In
International Conference on Machine Learning (ICML) , 2019.[3] Naman Agarwal, Elad Hazan, and Karan Singh. Logarithmic regret for online control. In
Advances inNeural Information Processing Systems , pages 10175–10184, 2019.[4] Brian DO Anderson and John B Moore.
Optimal control: linear quadratic methods . Courier Corporation,2007.[5] Brian DO Anderson and John B Moore.
Optimal filtering . Courier Corporation, 2012.[6] David Angeli, Rishi Amrit, and James B Rawlings. On average performance and stability of economicmodel predictive control.
IEEE transactions on automatic control , 57(7):1615–1626, 2011.[7] David Angeli, Alessandro Casavola, and Francesco Tedesco. Theoretical advances on economic modelpredictive control with time-varying costs.
Annual Reviews in Control , 41:218–224, 2016.[8] Tomas Baca, Daniel Hert, Giuseppe Loianno, Martin Saska, and Vijay Kumar. Model predictive trajectorytracking and collision avoidance for reliable outdoor deployment of unmanned aerial vehicles. In , pages 6753–6760. IEEE,2018.[9] Eduardo F Camacho and Carlos Bordons Alba.
Model predictive control . Springer Science & BusinessMedia, 2013.[10] Asaf Cassel, Alon Cohen, and Tomer Koren. Logarithmic regret for learning linear quadratic regulatorsefficiently. arXiv preprint arXiv:2002.08095 , 2020.[11] Niangjun Chen, Anish Agarwal, Adam Wierman, Siddharth Barman, and Lachlan LH Andrew. Onlineconvex optimization using predictions. In
Proceedings of the 2015 ACM SIGMETRICS InternationalConference on Measurement and Modeling of Computer Systems , pages 191–204, 2015.[12] Alon Cohen, Tomer Koren, and Yishay Mansour. Learning linear-quadratic regulators efficiently with only √ T regret. In International Conference on Machine Learning (ICML) , 2019.[13] Sarah Dean, Horia Mania, Nikolai Matni, Benjamin Recht, and Stephen Tu. Regret bounds for robustadaptive control of the linear quadratic regulator. In
Neural Information Processing Systems (NeurIPS) ,2018.[14] Maryam Fazel, Rong Ge, Sham M Kakade, and Mehran Mesbahi. Global convergence of policy gradientmethods for the linear quadratic regulator. arXiv preprint arXiv:1801.05039 , 2018.[15] Dylan J Foster and Max Simchowitz. Logarithmic regret for adversarial online control. arXiv preprintarXiv:2003.00189 , 2020.[16] Gautam Goel and Babak Hassibi. The power of linear controllers in LQR control. arXiv preprintarXiv:2002.02574 , 2020.[17] Gautam Goel and Adam Wierman. An online algorithm for smoothed regression and LQR control.
Proceedings of Machine Learning Research , 89:2504–2513, 2019.[18] Lars Grüne and Simon Pirkelmann. Economic model predictive control for time-varying system: Perfor-mance and stability results.
Optimal Control Applications and Methods , 2018.[19] Lars Grüne and Marleen Stieler. Asymptotic stability and transient optimality of economic MPC withoutterminal conditions.
Journal of Process Control , 24(8):1187–1196, 2014.
20] Elad Hazan, Sham M Kakade, and Karan Singh. The nonstochastic control problem. In
Conference onAlgorithmic Learning Theory (ALT) , 2020.[21] Donald E Kirk.
Optimal control theory: an introduction . Courier Corporation, 2004.[22] Nevena Lazic, Craig Boutilier, Tyler Lu, Eehern Wong, Binz Roy, MK Ryu, and Greg Imwalle. Data centercooling using model-predictive control. In
Advances in Neural Information Processing Systems , pages3814–3823, 2018.[23] Yingying Li, Xin Chen, and Na Li. Online optimal control with linear dynamics and predictions: Algorithmsand regret analysis. In
Advances in Neural Information Processing Systems , pages 14858–14870, 2019.[24] Yiheng Lin, Gautam Goel, and Adam Wierman. Online optimization with predictions and non-convexlosses. arXiv preprint arXiv:1911.03827 , 2019.[25] Ugo Rosolia and Francesco Borrelli. Learning model predictive control for iterative tasks. a data-drivencontrol framework.
IEEE Transactions on Automatic Control , 63(7):1883–1896, 2017.[26] Ugo Rosolia and Francesco Borrelli. Sample-based learning model predictive control for linear uncertainsystems. arXiv preprint arXiv:1904.06432 , 2019.[27] Guanya Shi, Xichen Shi, Michael O’Connell, Rose Yu, Kamyar Azizzadenesheli, Animashree Anandkumar,Yisong Yue, and Soon-Jo Chung. Neural lander: Stable drone landing control using learned dynamics. In
International Conference on Robotics and Automation (ICRA) , 2019.[28] Guanya Shi, Yiheng Lin, Soon-Jo Chung, Yisong Yue, and Adam Wierman. Beyond no-regret: Competitivecontrol via online optimization with memory. arXiv preprint arXiv:2002.05318 , 2020.[29] Max Simchowitz and Dylan J Foster. Naive exploration is optimal for online LQR. arXiv preprintarXiv:2001.09576 , 2020.[30] Sergio Vazquez, Jose Rodriguez, Marco Rivera, Leopoldo G Franquelo, and Margarita Norambuena. Modelpredictive control for power converters and drives: Advances and trends.
IEEE Transactions on IndustrialElectronics , 64(2):935–947, 2016.[31] Kemin Zhou and John Comstock Doyle.
Essentials of robust control , volume 104. Prentice hall UpperSaddle River, NJ, 1998.
A Proofs of Section 3
In all proofs in this paper, for a sequence x = ( x , x , . . . , x n ) , we use x a : b to denote its consecutivesubsequence ( x a , x a +1 , . . . , x b ) . A.1 Proof of Theorem 3.2
Let w t be i.i.d. with zero mean and covariance matrix W . Suppose the controller has k ≥ predictions. Then, the optimal control policy at each step t is given by: u t = − ( R + B (cid:62) P B ) − B (cid:62) (cid:32) P Ax t + k − (cid:88) i =0 ( A (cid:62) − A (cid:62) P H ) i P w t + i (cid:33) , (2) where P is the solution of DARE in Equation (1) . The cost under this policy is: STO k = Tr (cid:40)(cid:32) P − k − (cid:88) i =0 P ( A − HP A ) i H ( A (cid:62) − A (cid:62) P H ) i P (cid:33) W (cid:41) , (3) where H = B ( R + B (cid:62) P B ) − B (cid:62) .Proof. Our proof technique closely follows that in Section 4.1 of [16]. To begin, note that thedefinition of
STO Tk has a structure of repeating min ’s and E ’s. We use dynamic programming to10ompute the value iteratively. In particular, we apply backward induction to solve the optimal cost-to-go functions, from time step T to the initial state. Given state x t and predictions w t , . . . , w t + k − , wedefine the cost-to-go function: V t ( x t ; w t : t + k − ) := min u t E w t + k min u t +1 · · · E w T − min u T − k , ··· ,u T − T − (cid:88) i = t ( x (cid:62) i Qx i + u (cid:62) i Ru i ) + x (cid:62) T Q f x T (5) = x (cid:62) t Qx t + min u t (cid:18) u (cid:62) t Ru t + E w t + k [ V t +1 ( Ax t + Bu t + w t ; w t +1: t + k )] (cid:19) with V T ( x T ; . . . ) = x (cid:62) T Q f x T . Note that E w t + k has no effect for t ≥ T − k . This function measuresthe expected overall control cost from a given state to the end, assuming the controller makes theoptimal decision at each time.We will show by backward induction that for every t = 0 , . . . , T , V t ( x t ; w t : t + k − ) = x (cid:62) t P t x t + v (cid:62) t x t + q t , where P t , v t , q t are coefficients that may depend on w t : t + k − . This is clearly true for t = T . Suppose this is true at t + 1 . Then, V t ( x ; w t : t + k − )= x (cid:62) Qx + min u (cid:16) u (cid:62) Ru + ( Ax + Bu + w t ) (cid:62) P t +1 ( Ax + Bu + w t )+ E w t + k [ v t +1 ] (cid:62) ( Ax + Bu + w t ) + E w t + k [ q t +1 ] (cid:17) = x (cid:62) Qx + ( Ax + w t ) (cid:62) P t +1 ( Ax + w t ) + E w t + k [ v t +1 ] (cid:62) ( Ax + w t ) + E w t + k [ q t +1 ]+ min u (cid:18) u (cid:62) ( R + B (cid:62) P t +1 B ) u + u (cid:62) B (cid:62) (cid:18) P t +1 Ax + 2 P t +1 w t + E w t + k [ v t +1 ] (cid:19)(cid:19) . The optimal u is obtained by setting the derivative to be zero: u ∗ = − ( R + B (cid:62) P t +1 B ) − B (cid:62) (cid:18) P t +1 Ax + P t +1 w t + 12 E w t + k [ v t +1 ] (cid:19) . (6)Let H t = B ( R + B (cid:62) P t +1 B ) − B (cid:62) . Plugging u ∗ back into V t , we have V t ( x ; w t : t + k − )= x (cid:62) Qx + ( Ax + w t ) (cid:62) P t +1 ( Ax + w t ) + E w t + k [ v t +1 ] (cid:62) ( Ax + w t ) + E w t + k [ q t +1 ] − (cid:18) P t +1 Ax + P t +1 w t + 12 E w t + k [ v t +1 ] (cid:19) (cid:62) H t (cid:18) P t +1 Ax + P t +1 w t + 12 E w t + k [ v t +1 ] (cid:19) = x (cid:62) (cid:0) Q + A (cid:62) P t +1 A − A (cid:62) P t +1 H t P t +1 A (cid:1) x + x (cid:62) (cid:18) ( A (cid:62) − A (cid:62) P t +1 H t ) E w t + k [ v t +1 ] + 2( A (cid:62) − A (cid:62) P t +1 H t ) P t +1 w t (cid:19) + w (cid:62) t ( P t +1 − P t +1 H t P t +1 ) w t + w (cid:62) t ( I − P t +1 H t ) E w t + k [ v t +1 ] − E w t + k [ v t +1 ] (cid:62) H t E w t + k [ v t +1 ] + E w t + k [ q t +1 ] . Thus, the recursive formulae, which parallel [16], are given by: P t = Q + A (cid:62) P t +1 A − A (cid:62) P t +1 H t P t +1 A, (7a) v t = ( A (cid:62) − A (cid:62) P t +1 H t ) E w t + k [ v t +1 ] + 2( A (cid:62) − A (cid:62) P t +1 H t ) P t +1 w t , (7b) q t = w (cid:62) t ( P t +1 − P t +1 H t P t +1 ) w t + w (cid:62) t ( I − P t +1 H t ) E w t + k [ v t +1 ] − E w t + k [ v t +1 ] (cid:62) H t E w t + k [ v t +1 ] + E w t + k [ q t +1 ] . (7c)11s T − t → ∞ , P t and H t converge to P and H respectively, where P is the solution of discrete-timealgebraic Riccati equation (DARE) P = Q + A (cid:62) P A − A (cid:62) P HP A , and H = B ( R + B (cid:62) P B ) − B (cid:62) .Note that v T = 0 and q T = 0 . Then, v t = 2 k − (cid:88) i =0 ( A (cid:62) − A (cid:62) P H ) i +1 P w t + i , (8) q t = w (cid:62) t ( P − P HP ) w t + w (cid:62) t ( I − P H ) E w t + k [ v t +1 ] − E w t + k [ v t +1 ] (cid:62) H E w t + k [ v t +1 ] + E w t + k [ q t +1 ] , (9) E w t + k [ v t +1 ] = 2 k − (cid:88) i =1 ( A (cid:62) − A (cid:62) P H ) i P w t + i . (10)Taking the expectation of q t over all randomness, namely w , w , w , . . . , we have E [ q t ] = Tr { ( P − P HP ) W } − k − (cid:88) i =1 Tr (cid:8) P ( A − HP A ) i H ( A (cid:62) − A (cid:62) P H ) i P W (cid:9) + E [ q t +1 ]= Tr (cid:40)(cid:32) P − k − (cid:88) i =0 P ( A − HP A ) i H ( A (cid:62) − A (cid:62) P H ) i P (cid:33) W (cid:41) + E [ q t +1 ] , (11)where in the first equality we use E [ w t ] = 0 and the independence of the disturbances. Thus, as T → ∞ , in each time step, a constant cost is incurred and the average cost STO k is exactly thisvalue. STO k = lim T →∞ T STO Tk = lim T →∞ T E [ V ( x ; w k − )] = lim T →∞ T E [ q ]= lim T →∞ T T − (cid:88) t =0 E [ q t ] − E [ q t +1 ] = Tr (cid:40)(cid:32) P − k − (cid:88) i =0 P ( A − HP A ) i H ( A (cid:62) − A (cid:62) P H ) i P (cid:33) W (cid:41) . The explicit form of the optimal control policy is obtained by combining Equations (6) and (10).
A.2 Proof of Theorem 3.3
In Algorithm 1, let ˜ Q f = P . Then, the MPC policy with k predictions is also given by Equation (2) .Assuming i.i.d. disturbance with zero mean, the MPC policy is optimal.Proof. Due to the greedy nature, MPC policy is given by the solution of a length- k optimal controlproblem, given deterministic w t , · · · , w t + k − . In other words, we want to derive the optimal policy ( u t , . . . , u t + k − ) that minimizes t + k − (cid:88) i = t ( x (cid:62) i Qx i + u (cid:62) i Ru i ) + x (cid:62) t + k P x t + k , where x i +1 = Ax i + Bu i + w i , given x t , w t , . . . , w t + k − . Define the cost-to-go function at time i given x i , w i , . . . , w t + k − : V i ( x i ; w i : t + k − ) = min u i : t + k − t + k − (cid:88) j = i ( x (cid:62) j Qx j + u (cid:62) j Ru j ) + x (cid:62) t + k P x t + k = x (cid:62) i Qx i + min u i ( u (cid:62) i Ru i + V i +1 ( Ax i + Bu i + w i ; w i +1: t + k − )) . Note that V t + k ( x t + k ) = x (cid:62) t + k P x t + k . Similar to the proof of Theorem 3.2, we can inductivelyshow that V i ( x i ; w i : t + k − ) = x (cid:62) i P x i + v (cid:62) i x i + q i for some v i and q i . Note that the second-degreecoefficient no longer depends on the index i as in the previous proof because we start from P , thesolution of DARE. We then have the followings equations that parallel with Equations (6) and (8): v i = 2 t + k − i − (cid:88) j =0 F (cid:62) j +1 P w i + j , ∗ i = − ( R + B (cid:62) P B ) − B (cid:62) (cid:18) P Ax i + P w i + 12 v i +1 (cid:19) = − ( R + B (cid:62) P B ) − B (cid:62) P Ax i + t + k − i − (cid:88) j =0 F (cid:62) j P w i + j . The case i = t gives: u ∗ t = − ( R + B (cid:62) P B ) − B (cid:62) P Ax t + k − (cid:88) j =0 F (cid:62) j P w t + j , which is the MPC policy at time step t , and is same as Equation (2). B Proofs of Section 4
B.1 Proof of Theorem 4.1
The optimal control policy with general stochastic disturbance is given by: u t = − ( R + B (cid:62) P B ) − B (cid:62) (cid:32) P Ax t + k − (cid:88) i =0 F (cid:62) i P w t + i + ∞ (cid:88) i = k F (cid:62) i P µ t + i | t + k − (cid:33) , (4) where µ t (cid:48) | t = E [ w t (cid:48) | w , . . . , w t ] . Under this policy, the marginal benefit of obtaining an extraprediction decays exponentially fast in the existing number k of predictions. Formally, for k ≥ , STO k − STO k +1 = O ( (cid:107) F k (cid:107) ) = O ( λ k ) . Proof.
Similar to the proof of Theorem 3.2, we assume V t ( x t ; w t + k − ) = x (cid:62) t P t x t + x (cid:62) t v t + q t , where V t has a similar definition as in Equation (5) but may further depend on w , . . . , w t − becausethe disturbance sequence is no longer Markovian. In this case, P t , v t and q t still satisfy the recursiveforms in Equation (7). However, the expected values of w t and v t are different since we have a moregeneral distribution now. Let T − t → ∞ , µ t (cid:48) | t = E [ w t (cid:48) | w , . . . , w t ] and F = A − HP A . Then, v kt = 2 k − (cid:88) i =0 F (cid:62) i +1 P w t + i + 2 ∞ (cid:88) i = k F (cid:62) i +1 P µ t + i | t + k − , (12) q kt = w (cid:62) t ( P − P HP ) w t + w (cid:62) t ( I − P H ) E w t + k (cid:2) v kt +1 (cid:3) − E w t + k (cid:2) v kt +1 (cid:3) (cid:62) H E w t + k (cid:2) v kt +1 (cid:3) + E w t + k (cid:2) q kt +1 (cid:3) , where the superscript k denotes the number of predictions.The optimal policy in this case has the same form as Equation (6). Plugging Equation (12) into it, weobtain the optimal policy in the theorem.Further, E (cid:2) q kt − q k +1 t (cid:3) = E (cid:20) w (cid:62) t ( I − P H ) (cid:18) E w t + k (cid:2) v kt +1 (cid:3) − E w t + k +1 (cid:2) v k +1 t +1 (cid:3)(cid:19)(cid:21) (13a) + 14 E (cid:20) E w t + k +1 (cid:2) v k +1 t +1 (cid:3) (cid:62) H E w t + k +1 (cid:2) v k +1 t +1 (cid:3) − E w t + k (cid:2) v kt +1 (cid:3) (cid:62) H E w t + k (cid:2) v kt +1 (cid:3)(cid:21) (13b) + E (cid:2) q kt +1 − q k +1 t +1 (cid:3) , (13c)where the expectation E is taken over all randomness. Part (13a) is zero because E w t + k (cid:2) v kt +1 (cid:3) = E w t + k ,w t + k +1 (cid:2) v k +1 t +1 (cid:3) . = 14 E w t + k (cid:34)(cid:18) E w t + k +1 (cid:2) v k +1 t +1 (cid:3) − E w t + k (cid:2) v kt +1 (cid:3)(cid:19) (cid:62) H (cid:18) E w t + k +1 (cid:2) v k +1 t +1 (cid:3) − E w t + k (cid:2) v kt +1 (cid:3)(cid:19)(cid:35) = E w t + k (cid:2) z (cid:62) k,t Hz k,t (cid:3) , where z k,t = F (cid:62) k P ( w t + k − µ t + k | t + k − ) + ∞ (cid:88) i = k +1 F (cid:62) i P ( µ t + i | t + k − µ t + i | t + k − ) . Note that z k,t = F (cid:62) z k − ,t +1 = F (cid:62) k z ,t + k . Thus, STO k − STO k +1 = lim T →∞ T E (cid:2) q k − q k +10 (cid:3) = lim T →∞ T T − (cid:88) t =0 E (cid:2) z (cid:62) k,t Hz k,t (cid:3) = lim T →∞ T T − (cid:88) t =0 E (cid:104) z (cid:62) ,t + k F k HF (cid:62) k z ,t + k (cid:105) = lim T →∞ T T − (cid:88) t =0 Tr (cid:110) F k HF (cid:62) k E (cid:2) z ,t + k z (cid:62) ,t + k (cid:3)(cid:111) ≤ (cid:13)(cid:13) F k (cid:13)(cid:13) (cid:107) H (cid:107) lim T →∞ T T − (cid:88) t =0 Tr E (cid:2) z ,t + k z (cid:62) ,t + k (cid:3) where in the last line we use the fact that if A is symmetric, then Tr { AB } ≤ λ max ( A ) Tr { B } .Finally we just need to show the last item Tr E (cid:104) z ,t + k z (cid:62) ,t + k (cid:105) is uniformly bounded for all t . Thisis straightforward because the cross-correlation of each disturbance pair is uniformly bounded, i.e.,there exists m > such that for all t, t (cid:48) ≥ , E (cid:2) w (cid:62) t w t (cid:48) (cid:3) ≤ m . Tr E (cid:2) z ,t z (cid:62) ,t (cid:3) = ∞ (cid:88) i,j =0 Tr E (cid:104) P F i F (cid:62) j P ( µ t + j | t − µ t + j | t − )( µ t + i | t − µ t + i | t − ) (cid:62) (cid:105) = ∞ (cid:88) i,j =0 Tr (cid:110) P F i F (cid:62) j P E (cid:104) µ t + j | t µ (cid:62) t + i | t − µ t + j | t − µ (cid:62) t + i | t − (cid:105)(cid:111) ≤ ∞ (cid:88) i,j =0 (cid:13)(cid:13) F i (cid:13)(cid:13)(cid:13)(cid:13) F j (cid:13)(cid:13) (cid:107) P (cid:107) E (cid:2) w (cid:62) t + j w t + i − w (cid:62) t + j w t + i (cid:3) ≤ ∞ (cid:88) i,j =0 cλ i cλ j (cid:107) P (cid:107) m = 2 c (1 − λ ) (cid:107) P (cid:107) m for some constant c from Gelfand’s formula. Thus Tr E (cid:2) z ,t z (cid:62) ,t (cid:3) is bounded by a constant indepen-dent of t . Thus, STO k − STO k +1 = O ( (cid:107) F k (cid:107) ) . B.2 Proof of Theorem 4.4
MPCS k − MPCS k +1 = O ( (cid:107) F k (cid:107) ) = O ( λ k ) . Moreover, in Example 4.3, MPCS k − MPCS k +1 =Θ( (cid:107) F k (cid:107) ) .Proof. To recursively calculate the value of J MPC k , we define: V MPC k t ( x t ; w t + k − ) = T − (cid:88) i = t ( x (cid:62) i Qx i + u (cid:62) i Ru i ) + x (cid:62) T Q f x T x (cid:62) t Qx t + u (cid:62) t Ru t + V t +1 ( Ax t + Bu t + w t ; w t + k ) as the cost-to-go function with MPC as the policy, i.e., u t is the control at time step t from the MPCpolicy with k predictions. Similar to the previous proofs, we assume V MPC k t ( x ) = x (cid:62) P t x + x (cid:62) v t + q t (which turns out to be correct by induction) and T − t → ∞ so that P t = P . Then, V MPC k t ( x t ; w t + k − ) = x (cid:62) t Qx t + u (cid:62) t Ru t + ( Ax t + Bu t + w t ) (cid:62) P ( Ax t + Bu t + w t )+ ( Ax t + Bu t + w t ) (cid:62) v t +1 + q t +1 = u (cid:62) t ( R + B (cid:62) P B ) u t + 2 u (cid:62) t B (cid:62) ( P Ax t + P w t + v t +1 / x (cid:62) t Qx t + ( Ax t + w t ) (cid:62) P ( Ax t + w t ) + ( Ax t + w t ) (cid:62) v t +1 + q t +1 . (14)Let F = A − HP A . Plugging in the formula of u t in Theorem 3.3, we have V MPC k t ( x t ; w t + k − ) = (cid:32) v t +1 − k − (cid:88) i =1 F (cid:62) i P w t + i (cid:33) (cid:62) H (cid:32) v t +1 − k − (cid:88) i =1 F (cid:62) i P w t + i (cid:33) − (cid:18) P Ax t + P w t + 12 v t +1 (cid:19) (cid:62) H (cid:18) P Ax t + P w t + 12 v t +1 (cid:19) + x (cid:62) t Qx t + ( Ax t + w t ) (cid:62) P ( Ax t + w t ) + ( Ax t + w t ) (cid:62) v t +1 + q t +1 = x (cid:62) t ( Q + A (cid:62) P A − A (cid:62) P HP A ) x t + x (cid:62) t ( F (cid:62) v t +1 + 2 F (cid:62) P w t )+ (cid:32) v t +1 − k − (cid:88) i =1 F (cid:62) i P w t + i (cid:33) (cid:62) H (cid:32) v t +1 − k − (cid:88) i =1 F (cid:62) i P w t + i (cid:33) − (cid:18) P w t + 12 v t +1 (cid:19) (cid:62) H (cid:18) P w t + 12 v t +1 (cid:19) + w (cid:62) t P w t + w (cid:62) t v t +1 + q t +1 = x (cid:62) t P x t + x (cid:62) t v t + q t . Thus, v t = F (cid:62) v t +1 + 2 F (cid:62) P w t = 2 ∞ (cid:88) i =0 F (cid:62) i +1 P w t + i . Then, we can plug v t +1 into q t : q t = q t +1 + (cid:32) ∞ (cid:88) i = k F (cid:62) i P w t + i (cid:33) (cid:62) H (cid:32) ∞ (cid:88) i = k F (cid:62) i P w t + i (cid:33) − (cid:32) ∞ (cid:88) i =0 F (cid:62) i P w t + i (cid:33) (cid:62) H (cid:32) ∞ (cid:88) i =0 F (cid:62) i P w t + i (cid:33) + w (cid:62) t P w t + 2 w (cid:62) t (cid:32) ∞ (cid:88) i =1 F (cid:62) i P w t + i (cid:33) . (15)Note that Equation (15) is for MPC with k predictions. With the disturbance sequence { w t } fixed,we can compare the per-step cost of MPC with k predictions and that with k + 1 predictions: q kt − q k +1 t = q kt +1 − q k +1 t +1 + (cid:32) ∞ (cid:88) i = k F (cid:62) i P w t + i (cid:33) (cid:62) H (cid:32) ∞ (cid:88) i = k F (cid:62) i P w t + i (cid:33) − (cid:32) ∞ (cid:88) i = k +1 F (cid:62) i P w t + i (cid:33) (cid:62) H (cid:32) ∞ (cid:88) i = k +1 F (cid:62) i P w t + i (cid:33) = q kt +1 − q k +1 t +1 + w (cid:62) t + k P F k HF (cid:62) k (cid:32) P w t + k + 2 ∞ (cid:88) i =1 F (cid:62) i P w t + i + k (cid:33) . (16)Thus, E (cid:2) q kt − q k +1 t − ( q kt +1 − q k +1 t +1 ) (cid:3) = E (cid:34) w (cid:62) t + k P F k HF (cid:62) k (cid:32) P w t + k + 2 ∞ (cid:88) i =1 F (cid:62) i P w t + i + k (cid:33)(cid:35) Tr (cid:40) P F k HF (cid:62) k (cid:32) P E (cid:2) w t + k w (cid:62) t + k (cid:3) + 2 ∞ (cid:88) i =1 F (cid:62) i P E (cid:2) w t + i + k w (cid:62) t + k (cid:3)(cid:33)(cid:41) = Tr (cid:110) P F k HF (cid:62) k Z k,t (cid:111) , where Z k,t = P E (cid:2) w t + k w (cid:62) t + k (cid:3) + 2 (cid:80) ∞ i =1 F (cid:62) i P E (cid:2) w t + i + k w (cid:62) t + k (cid:3) . Note that Z k,t = Z k − ,t +1 . MPCS k − MPCS k +1 = lim T →∞ T E (cid:2) q k − q k +10 (cid:3) = lim T →∞ T T − (cid:88) t =0 Tr (cid:110) P F k HF (cid:62) k Z k,t (cid:111) ≤ lim T →∞ T T − (cid:88) t =0 (cid:107) P (cid:107)(cid:107) H (cid:107) (cid:13)(cid:13) F k (cid:13)(cid:13) Tr { Z k,t } , where in the last line we use the fact that if A is symmetric, then Tr { AB } ≤ (cid:107) A (cid:107) Tr { B } . Similarlyto the last part in the proof of Theorem 4.1, now we just need to show the last term Tr { Z k,t } isuniformly bounded for all t . Again, this is because the cross-correlation of each disturbance pair isuniformly bounded. Tr { Z k,t } ≤ (cid:107) P (cid:107) Tr E (cid:2) w t + k w (cid:62) t + k (cid:3) + 2 ∞ (cid:88) i =1 (cid:107) P (cid:107) (cid:13)(cid:13) F i (cid:13)(cid:13) E (cid:34) (cid:88) j σ j ( w t + i + k w (cid:62) t + k ) (cid:35) ≤ (cid:107) P (cid:107) m + 2 ∞ (cid:88) i =1 cλ i (cid:107) P (cid:107) m = (cid:107) P (cid:107) m + 2 c λ − λ (cid:107) P (cid:107) m where c is some constant, and in the first line, we use the fact that Tr { AB } ≤ (cid:107) A (cid:107) (cid:80) j σ j ( B ) with σ j ( · ) denoting the j -th singular value. Thus, Tr { Z k,t } is uniformly bounded. Therefore, MPCS k − MPCS k +1 = O ( (cid:107) F k (cid:107) ) . B.3 Proof of Theorem 4.6
Reg S ( MPC k ) = MPCS Tk − STO TT = O ( (cid:107) F k (cid:107) T + 1) = O ( λ k T + 1) , where the second termresults from the difference between finite/infinite horizons.Proof. To calculate the dynamic regret, we cannot simply let T − t → ∞ as we did before Equa-tion (14) in the proof of Theorem 4.4 and instead need to handle the expressions in a more delicatemanner. In particular, we need to rigorously analyze the impact of finite horizon. Let ∆ t = P t − P . V MPC k t ( x t ; w t + k − )= u (cid:62) t ( R + B (cid:62) P t +1 B ) u t + 2 u (cid:62) t B (cid:62) ( P t +1 Ax t + P t +1 w t + v t +1 / x (cid:62) t Qx t + ( Ax t + w t ) (cid:62) P t +1 ( Ax t + w t ) + ( Ax t + w t ) (cid:62) v t +1 + q t +1 = u (cid:62) t ( R + B (cid:62) P B ) u t + 2 u (cid:62) t B (cid:62) ( P Ax t + P w t + v t +1 / x (cid:62) t Qx t + ( Ax t + w t ) (cid:62) P ( Ax t + w t ) + ( Ax t + w t ) (cid:62) v t +1 + q t +1 + u (cid:62) t B (cid:62) ∆ t +1 Bu t + 2 u (cid:62) t B (cid:62) ∆ t +1 ( Ax t + w t ) + ( Ax t + w t ) (cid:62) ∆ t +1 ( Ax t + w t ) . Plugging in the MPC policy as in Theorem 3.3, we have: V MPC k t ( x t ; w t + k − )= x (cid:62) t ( Q + A (cid:62) P A − A (cid:62) P HP A ) x t + x (cid:62) t ( F (cid:62) v t +1 + 2 F (cid:62) P w t )+ (cid:32) v t +1 − k − (cid:88) i =1 F (cid:62) i P w t + i (cid:33) (cid:62) H (cid:32) v t +1 − k − (cid:88) i =1 F (cid:62) i P w t + i (cid:33) − (cid:18) P w t + 12 v t +1 (cid:19) (cid:62) H (cid:18) P w t + 12 v t +1 (cid:19) + w (cid:62) t P w t + w (cid:62) t v t +1 + q t +1 (cid:32) F x t + w t − k − (cid:88) i =0 F (cid:62) i P w t + i (cid:33) (cid:62) ∆ t +1 (cid:32) F x t + w t − k − (cid:88) i =0 F (cid:62) i P w t + i (cid:33) = x (cid:62) t ( Q + A (cid:62) P A − A (cid:62) P HP A + F (cid:62) ∆ t +1 F ) x t + x (cid:62) t (cid:32) F (cid:62) v t +1 + 2 F (cid:62) P w t + 2 F (cid:62) ∆ t +1 (cid:32) w t − k − (cid:88) i =0 F (cid:62) i P w t + i (cid:33)(cid:33) + (cid:32) v t +1 − k − (cid:88) i =1 F (cid:62) i P w t + i (cid:33) (cid:62) H (cid:32) v t +1 − k − (cid:88) i =1 F (cid:62) i P w t + i (cid:33) − (cid:18) P w t + 12 v t +1 (cid:19) (cid:62) H (cid:18) P w t + 12 v t +1 (cid:19) + w (cid:62) t P w t + w (cid:62) t v t +1 + q t +1 + (cid:32) w t − k − (cid:88) i =0 F (cid:62) i P w t + i (cid:33) (cid:62) ∆ t +1 (cid:32) w t − k − (cid:88) i =0 F (cid:62) i P w t + i (cid:33) Comparing this with the induction hypothesis V MPC k t = x (cid:62) t ( P + ∆ t ) x t + x (cid:62) t v t + q t , we obtain therecursive formulae for ∆ t , v t , q t . ∆ t = F (cid:62) ∆ t +1 F = F (cid:62) T − t ∆ T F T − t = F (cid:62) T − t ( Q f − P ) F T − t . This implies that P t converges to P exponentially fast, i.e., (cid:107) ∆ t (cid:107) = O ( (cid:107) F T − t (cid:107) ) = O ( λ T − t ) ) . v t = F (cid:62) v t +1 + 2 F (cid:62) P w t + 2 F (cid:62) ∆ t +1 (cid:32) w t − k − (cid:88) i =0 F (cid:62) i P w t + i (cid:33) = 2 T − t − (cid:88) j =0 (cid:32) F (cid:62) j +1 P w t + j + F (cid:62) j +1 ∆ t + j +1 (cid:32) w t + j − k − (cid:88) i =0 F (cid:62) i P w t + j + i (cid:33)(cid:33) = 2 T − t − (cid:88) i =0 F (cid:62) i +1 P w t + i + 2 T − t − (cid:88) j =0 F (cid:62) j +1 ∆ t + j +1 (cid:32) w t + j − k − (cid:88) i =0 F (cid:62) i P w t + j + i (cid:33) . Denote the second term by d t . We have d t = T − t − (cid:88) j =0 F (cid:62) j +1 ∆ t + j +1 (cid:32) w t + j − k − (cid:88) i =0 F (cid:62) i P w t + j + i (cid:33) = T − t − (cid:88) j =0 O ( λ j λ T − t − j ) ) = O ( λ T − t ) .d kt − d k +1 t = T − t − k − (cid:88) j =0 F (cid:62) j +1 ∆ t + j +1 F (cid:62) k P w t + j + k (17) = T − t − k − (cid:88) j =0 O ( λ j λ T − t − j ) (cid:107) F k (cid:107) ) = O ( λ T − t + k (cid:107) F k (cid:107) ) . Finally, we have a formula for q t that parallels Equation (15): q t = q t +1 + (cid:32) d t +1 + T − t − (cid:88) i = k F (cid:62) i P w t + i (cid:33) (cid:62) H (cid:32) d t +1 + T − t − (cid:88) i = k F (cid:62) i P w t + i (cid:33) − (cid:32) d t +1 + T − t − (cid:88) i =0 F (cid:62) i P w t + i (cid:33) (cid:62) H (cid:32) d t +1 + T − t − (cid:88) i =0 F (cid:62) i P w t + i (cid:33) w (cid:62) t P w t + 2 w (cid:62) t (cid:32) d t +1 + T − t − (cid:88) i =1 F (cid:62) i P w t + i (cid:33) . Taking the difference between k and k + 1 predictions, we have q kt − q k +1 t − ( q kt +1 − q k +1 t +1 )= ( w (cid:62) t + k P F k + ( d kt +1 − d k +1 t +1 ) (cid:62) ) H (cid:32) d kt +1 + d k +1 t +1 + F (cid:62) k P w t + k + 2 T − t − k − (cid:88) i =1 F (cid:62) i + k P w t + i + k (cid:33) (18) = ( w (cid:62) t + k P F k + O ( λ T − t (cid:107) F k (cid:107) )) H (cid:32) O ( λ T − t ) + F (cid:62) k P w t + k + 2 T − t − k − (cid:88) i =1 F (cid:62) i + k P w t + i + k (cid:33) , and thus E (cid:2) q kt − q k +1 t − ( q kt +1 − q k +1 t +1 ) (cid:3) = O ( (cid:107) F k (cid:107) ( λ T − t + (cid:107) F k (cid:107) )) . E (cid:2) q k − q T (cid:3) = T − (cid:88) t =0 E (cid:2) q kt − q k +1 t − ( q kt +1 − q k +1 t +1 ) (cid:3) = T − (cid:88) t =0 O ( (cid:107) F k (cid:107) ( λ T − t + (cid:107) F k (cid:107) ))= O ( (cid:107) F k (cid:107) T + (cid:107) F k (cid:107) ) . E (cid:2) v k − v T (cid:3) = 2( d k − d T ) = O ( λ T + k (cid:107) F k (cid:107) ) . E J MPC k − E J MPC T = E (cid:2) V k ( x ) − V T ( x ) (cid:3) = E (cid:2) x (cid:62) ( v k − v T ) + ( q k + q T ) (cid:3) = O ( (cid:107) F k (cid:107) T + (cid:107) F k (cid:107) ) . (19)By definition, J MPC T is the cost of MPC policy given all future disturbances before making anydecisions. It almost equals to min u J , the optimal policy given all future disturbances, except thatduring optimization, MPC assumes the final-step cost to be x (cid:62) T P x T instead of x (cid:62) T Q f x T . This willincur at most constant extra cost, i.e., J MPC T − min u J = O ( P − Q f ) = O (1) . (20)By Equations (19) and (20), Reg S ( MPC k ) = E J MPC k − E min u J = O ( (cid:107) F k (cid:107) T + (cid:107) F k (cid:107) + 1) = O ( (cid:107) F k (cid:107) T + 1) . B.4 Proof of Theorem 4.7
The optimal dynamic regret
Reg Sk ∗ = STO Tk − STO TT = O ( (cid:107) F k (cid:107) T + 1) = O ( λ k T + 1) andthere exist A , B , Q , R , Q f , x , and W such that Reg Sk ∗ = Θ( (cid:107) F k (cid:107) ( T − k )) .Proof. The first part follows from Theorem 4.6 and that fact that
Reg Sk ∗ ≤ Reg S ( MPC k ) .The second part is shown by Example 4.3, i.e., suppose n = d = 1 and the disturbance are i.i.d. andzero-mean. Additionally, let Q f = P and x = 0 . In this case, MPC has not only the same policy butalso the same cost as the optimal control policy. Also, P t = P for all t . To calculate the total cost,we follow the approach used in the proof of Theorem 3.2. Since T is finite now, we have a similar (toEquation (8)) but different form of v t : v t = 2 min { k − ,T − t − } (cid:88) i =0 F (cid:62) i +1 P w t + i . E [ q t ] = Tr P − min { k − ,T − t − } (cid:88) i =0 P F i HF (cid:62) i P W + E [ q t +1 ] . E [ q ] = Tr T − (cid:88) t =0 P − min { k − ,T − t − } (cid:88) i =0 P F i HF (cid:62) i P W . Let q kt denote q t in the scenario of k predictions. Reg S ∗ = E (cid:2) q k − q T (cid:3) = Tr (cid:40) T − k − (cid:88) t =0 T − t − (cid:88) i = k P F i HF (cid:62) i P W (cid:41) ≥ ( T − k ) Tr (cid:110) P F k HF (cid:62) k P W (cid:111) = Ω( (cid:107) F k (cid:107) ( T − k )) . On the other hand,
Reg S ∗ = E (cid:2) q k − q T (cid:3) ≤ ( T − k ) Tr (cid:40) ∞ (cid:88) i = k P F i HF (cid:62) i P W (cid:41) = O ( (cid:107) F k (cid:107) ( T − k )) . Therefore,
Reg S ∗ = Θ( (cid:107) F k (cid:107) ( T − k )) . C Proofs of Section 5
C.1 Proof of Theorem 5.1
For k ≥ , ADV k − ADV k +1 = O ( (cid:107) F k (cid:107) ) = O ( λ k ) .Proof. This proof is based on Theorem 5.3. It turns out that the behavior of the MPC policy and itscost is easier to analyze than the optimal one, especially in the adversarial setting.
ADV k − ADV k +1 ≤ ADV k − ADV ∞ ≤ MPCA k − ADV ∞ = ∞ (cid:88) i = k MPCA i − MPCA i +1 . By Theorem 5.3,
MPCA i − MPCA i +1 ≤ O (cid:16)(cid:13)(cid:13) F i (cid:13)(cid:13) (cid:17) ≤ O (cid:16)(cid:13)(cid:13) F k (cid:13)(cid:13) (cid:13)(cid:13) F i − k (cid:13)(cid:13) (cid:17) ≤ O (cid:16)(cid:13)(cid:13) F k (cid:13)(cid:13) λ i − k ) (cid:17) . Thus,
ADV k − ADV k +1 ≤ O (cid:32)(cid:13)(cid:13) F k (cid:13)(cid:13) ∞ (cid:88) i = k λ i − k ) (cid:33) = O ( (cid:107) F k (cid:107) ) . C.2 Proof of Example 5.2
Let A = B = Q = R = 1 and Ω = [ − , . In this case, one prediction is enough to leverage thefull power of prediction. Formally, we have ADV = ADV ∞ = 1 . In other words, for all k ≥ , ADV k = 1 . The optimal control policy (as T → ∞ ) is a piecewise function: u ∗ ( x, w ) = − ( x + w ) , − ≤ x + w ≤ − ( x + w ) + −√ ( x + w − , x + w > − ( x + w ) + −√ ( x + w + 1) , x + w < − . The proof leverages two different cost-to-go functions for the min player and the sup player. roof. We will show
ADV = 1 and ADV ∞ = 1 separately. The system dynamics is given by x t +1 = x t + u t + w t with w t ∈ [ − , and ADV T = max w min u · · · max w T − min u T − T − (cid:88) t =0 ( x t + u t ) + x T . We will calculate the results of each min and max by dynamical programming. In particular, we willdefine two cost-to-go functions for the min player and the max player respectively. Let z t = x t + w t .Then, z t can be regarded as the disturbed state. This is natural since the controller has one predictionand decides u t after knowing w t . Thus, the system dynamics can be split into two stages: z t = x t + w t and x t +1 = z t + u t . Let f t ( z t ) = min u t max w t +1 min u t +1 · · · max w T − min u T − T − (cid:88) i = t ( u i + x i +1 )= min u t (cid:0) u t + ( z t + u t ) + g t +1 ( z t + u t ) (cid:1) ,g t ( x t ) = max w t min u t · · · max w T − min u T − T − (cid:88) i = t ( u i + x i +1 )= max w t f t ( x t + w t ) . For t = T − , we have f T − ( z ) = min u u + ( z + u ) = z ,g T − ( x ) = max w ( x + w ) | x | + 1) . We will prove by backward induction that g t ( x ) = a t x + 2 b t | x | + c t where a t , b t , c t are somecoefficients with < b t < . Assuming this is true at t , we will show this is true at t − . f t − ( z ) = min u (cid:0) u + ( z + u ) + g t ( z + u ) (cid:1) = min y (cid:0) ( y − z ) + y + g t ( y ) (cid:1) = min y (cid:0) ( y − z ) + y + a t y + 2 b t | y | + c t (cid:1) = min y (cid:0) ( a t + 2) y − z − b t sign( y )) y + z + c t (cid:1) , where y = z + u = x + w + u is the state after the control policy is applied. Let function y ( z ) mapfrom the disturbed old state to the new state. The optimal y is given by: y ∗ ( z ) = arg min y (cid:0) ( a t + 2) y − z − b t sign( y )) y + z + c t (cid:1) = (cid:40) , − b t ≤ z ≤ b tz − b t sign( z ) a t +2 , otherwise . (21)Thus, for z < − b t or z > b t , we have f t − ( z ) = − ( z − b t sign( z )) a t + 2 + z + c t = − z − b t | z | + b t a t + 2 + z + c t = a t + 1 a t + 2 z + 2 b t a t + 2 | z | + c t − b t a t + 2 . For z ∈ [ − b t , b t ] , the value of f t ( z ) is not needed in the calculation of g t ( x ) because < b t < (induction hypothesis) and the adversary — who wants to maximize f t ( z t ) , a convex, even function— will never choose w t such that z t = x t + w t ∈ ( − , since w t can be chosen from [ − , . g t − ( x ) = max w f t ( x + w ) = f t ( x + sign( x )) a t + 1 a t + 2 ( x + 2 | x | + 1) + 2 b t a t + 2 ( | x | + 1) + c t − b t a t + 2= a t + 1 a t + 2 x + 2( a t + b t + 1) a t + 2 | x | + c t + a t + 1 + 2 b t − b t a t + 2= a t − x + 2 b t − | x | + c t − . Now, we have obtained the recursive formulae for a t , b t , c t . The initial values are a T − = b T − = c T − = / .Let f i be the i -th Fibonacci number with f = 0 , f = 1 . Then, a T − i = f i +1 / f i +2 . As i → ∞ , a T − i → √ − .For b t , we have − b T − ( i +1) = (1 − b T − i ) / ( a T − i + 2) . When i is large, − b T − i approaches but is always positive. Thus, b T − i approaches but is always less than .For c t , we have c T − ( i +1) = c T − i + 1 − (1 − b T − i ) a T − i + 2 and thus c T − ( i +1) − c T − i → . Therefore, ADV = 1 .The optimal control policy is obtained by plugging the above values back into Equation (21): u ∗ ( x, w ) = − ( x + w ) + y ∗ ( x + w ) = − ( x + w ) + (cid:40) , − ≤ x + w ≤ x + w − sign( x + w ) √ , otherwise . For
ADV ∞ , we will show that STO ∞ = 1 at a specific disturbance sequence: w t = 1 for all t .Because STO ∞ ≤ ADV ∞ ≤ ADV = 1 , we know that ADV ∞ = 1 .According to Equations (8) and (9) with k → ∞ , STO ∞ = lim T →∞ T T − (cid:88) t =0 (2 w t ψ t − P w t − Hψ t ) with ψ t = ∞ (cid:88) i =0 F i P w t + i . Solving the Riccati equation, we have P = √ , H = F = −√ . When w t = 1 for all t , STO ∞ = 1 . C.3 Proof of Theorem 5.3
MPCA k − MPCA k +1 = O ( (cid:107) F k (cid:107) ) = O ( λ k ) .Proof. Note that Equation (16) in the proof of Theorem 4.4 does not rely on the type of disturbance,i.e., Equation (16) holds for adversarial disturbance as well. Let r = sup w ∈ Ω (cid:107) w (cid:107) . q kt − q k +1 t − ( q kt +1 − q k +1 t +1 ) = w (cid:62) t + k P F k HF (cid:62) k (cid:32) P w t + k + 2 ∞ (cid:88) i =1 F (cid:62) i P w t + i + k (cid:33) ≤ (cid:107) w t + k (cid:107)(cid:107) P (cid:107)(cid:107) H (cid:107) (cid:13)(cid:13) F k (cid:13)(cid:13) (cid:32) (cid:107) P (cid:107)(cid:107) w t + k (cid:107) + 2 ∞ (cid:88) i =1 (cid:13)(cid:13) F i (cid:13)(cid:13) (cid:107) P (cid:107)(cid:107) w t + i + k (cid:107) (cid:33) ≤ (cid:13)(cid:13) F k (cid:13)(cid:13) (cid:32) ∞ (cid:88) i =1 (cid:13)(cid:13) F i (cid:13)(cid:13)(cid:33) (cid:107) H (cid:107)(cid:107) P (cid:107) r ≤ (cid:13)(cid:13) F k (cid:13)(cid:13) (cid:18) cλ − λ (cid:19) (cid:107) H (cid:107)(cid:107) P (cid:107) r for some constant c . MPCA k − MPCA k +1 = lim T →∞ T (max w q k − max w q k +10 ) ≤ lim T →∞ T max w ( q k − q k +10 ) lim T →∞ T T − (cid:88) t =0 max w ( q kt − q k +1 t − ( q kt +1 − q k +1 t +1 )) ≤ (cid:13)(cid:13) F k (cid:13)(cid:13) (cid:18) cλ − λ (cid:19) (cid:107) H (cid:107)(cid:107) P (cid:107) r = O ( (cid:107) F k (cid:107) ) . C.4 Proof of Theorem 5.5
Reg A ( MPC k ) = O ( (cid:107) F k (cid:107) T + 1) = O ( λ k T + 1) .Proof. We follow the notations in the proof of Theorem 4.6. Equation (18) does not rely on the typeof disturbance, so it holds for adversarial disturbance as well. By Equation (18) and the fact that w t is bounded, we have q kt − q k +1 t − ( q kt +1 − q k +1 t +1 ) = O ( (cid:107) F k (cid:107) ( λ T − t + (cid:107) F k (cid:107) )) , where the constant in the Big-Oh notation does not depend on the disturbance sequence w . Thus, max w ( q k − q T ) ≤ T − (cid:88) t =0 max w ( q kt − q k +1 t − ( q kt +1 − q k +1 t +1 )) = O ( (cid:107) F k (cid:107) T + (cid:107) F k (cid:107) ) . By Equation (17) and the boundedness of w t , max w ( v k − v T ) = 2 max w ( d k − d T ) = O ( λ T + k (cid:107) F k (cid:107) ) . max w ( J MPC k − J MPC T ) = max w ( V k ( x ) − V T ( x )) ≤ max w ( x (cid:62) ( v k − v T )) + max w ( q k − q T )= O ( (cid:107) F k (cid:107) T + (cid:107) F k (cid:107) ) . As Equation (20), J MPC T − min u J = O (1) . Thus, Reg A ( MPC k ) = max w ( J MPC k − min u J ) ≤ max w ( J MPC k − J MP C T ) + max w ( J MPC T − min u J )= O ( (cid:107) F k (cid:107) T + (cid:107) F k (cid:107) + 1) = O ( (cid:107) F k (cid:107) T + 1) . C.5 Proof of Theorem 5.6
Reg Ak ∗ = O ( (cid:107) F k (cid:107) T + 1) = O ( λ k T + 1) . Moreover, there exist A , B , Q , R , Q f , x , and Ω suchthat Reg Ak ∗ = Ω( (cid:107) F k (cid:107) ( T − k )) .Proof. The first part of the theorem follows from Theorem 5.5 and the fact that
Reg Ak ∗ ≤ Reg A ( MPC k ) .We reduce the second part of this theorem to the second part of Theorem 4.7. Since the proof ofTheorem 4.7 works for any fixed distribution of w t (with finite second moment), we can restrict thatdistribution to have bounded support. Denote this bounded support by Ω . Then, we have Reg Ak ∗ = sup w , ··· ,w k − min u sup w k · · · min u T − k − sup w T − min u T − k , ··· ,u T − (cid:16) J ( u, w ) − min u (cid:48) ,...,u (cid:48) T − J ( u (cid:48) , w ) (cid:17) ≥ E w , ··· ,w k − min u E w k · · · min u T − k − E w T − min u T − k , ··· ,u T − (cid:16) J ( u, w ) − min u (cid:48) ,...,u (cid:48) T − J ( u (cid:48) , w ) (cid:17) = Reg Sk ∗ = Θ( (cid:107) F k (cid:107) ( T − k )) ..