Competitive Control with Delayed Imperfect Information
Chenkai Yu, Guanya Shi, Soon-Jo Chung, Yisong Yue, Adam Wierman
CCompetitive Control with Delayed Imperfect Information
Chenkai Yu Guanya Shi Soon-Jo Chung Yisong Yue Adam Wierman Tsinghua University [email protected] California Institute of Technology { gshi,sjchung,yyue,adamw } @caltech.edu Abstract
This paper studies the impact of imperfectinformation in online control with adversar-ial disturbances. In particular, we considerboth delayed state feedback and inexact pre-dictions of future disturbances. We introducea greedy, myopic policy that yields a constantcompetitive ratio against the offline optimalpolicy with delayed feedback and inexact pre-dictions. A special case of our result is a con-stant competitive policy for the case of exactpredictions and no delay, a previously openproblem. We also analyze the fundamentallimits of online control with limited informa-tion by showing that our competitive ratiobounds for the greedy, myopic policy in theadversarial setting match (up to lower-orderterms) lower bounds in the stochastic setting.
The design and analysis of controllers with imper-fect information is a long-standing challenge for thefields of online learning and robust control. This pa-per studies the impact of imperfect information inthe context of a disturbed linear dynamical systemwith feedback delay. Specifically, we consider a dis-turbed online Linear Quadratic Regulator (LQR) op-timal control problem with state feedback delay andinexact predictions of future disturbances, governedby x t +1 = Ax t + Bu t + w t , where x t , u t , and w t arethe state, control, and disturbance respectively.A growing literature at the interface of learning andcontrol has emerged in recent years with the goal ofdesigning controllers under various criteria, such as re- Preprint. Under review. gret (Dean et al., 2018; Agarwal et al., 2019), dynamicregret (Li et al., 2019; Yu et al., 2020), and competitiveratio (Shi et al., 2020; Goel and Wierman, 2019). How-ever, this line of work has made little progress when itcomes to using imperfect prediction information, andhas not approached the challenge of delayed feedbackat all. In fact, for even the case of perfect informationand no delay, the question of whether there exists aconstant competitive policy for general LQR systemsis unresolved.The task of designing controllers given imperfect pre-dictions has a long history in the control literature,as well as in real-world applications (Shi et al., 2019;Lazic et al., 2018). The basic idea is that, at state x t , one has access to imperfect estimates ˆ w t + i of thetrue future disturbances w t + i ( i ≥ t , one has only observed states up to x t − d ( d ≥ Contributions.
In this paper, we show that a simple, a r X i v : . [ m a t h . O C ] O c t ompetitive Control with Delayed Imperfect Information myopic, and practical policy that generalizes modelpredictive control (MPC) obtains a constant competi-tive ratio bound in the case of delayed imperfect infor-mation (Theorem 3). We also show that the competi-tive ratio bound exponentially increases (decreases) asthe amount of delay (prediction) increases, which high-lights the cost associated with delay and the power ofpredictions, even when they are inexact. To the bestof our knowledge, this result represents the first con-stant competitive bound for general LQR control withadversarial disturbance, even in the case of exact pre-dictions with no delay. Further, it represents the firstfinite-time performance bounds (regret or competitiveratio) for either the case of inexact predictions or de-layed feedback. Additionally, we prove that this my-opic policy is near-optimal in terms of competitive ra-tio, by showing that our competitive ratio bounds forthe myopic policy in the adversarial setting match thelower bounds in the stochastic setting.We would like to emphasize the generality of our re-sult. The model we consider is the general LQR settingwith bounded adversarial disturbance in the dynam-ics, where only stabilizability is assumed. Further, theprediction errors are assumed to be adversarial. Ad-ditionally, our results compare to the globally optimalpolicies without any constraints, rather than compareto the optimal linear or static policy.Our result adds further evidence that the structureof LQR allows simple algorithmic ideas to be effec-tive: Simchowitz and Foster (2020) recently provedthat the naive exploration is optimal in online LQRadaptive control problem with unknown { A, B } , andYu et al. (2020) proved the classic MPC is near-optimalin online LQR control with exact future predictions.Combined with the current paper, there is growing ev-idence that simple, myopic policies that build on MPCare constant-competitive and near-optimal, even in ad-versarial settings with delayed imperfect information,which sheds light on key algorithmic principles andfundamental limits in continuous control. Related work.
There is a growing literature of pa-pers that approach the control of linear dynamical sys-tems with tools and concepts from machine learning.Within this literature, most work focuses on the designof controllers with low regret (Dean et al., 2018; Agar-wal et al., 2019; Simchowitz and Foster, 2020). Whilethe study of competitive ratio has received some atten-tion (Goel and Wierman, 2019; Shi et al., 2020), onlyspecial LQR systems are covered and results have beenmore difficult to obtain.In these lines of work, very few papers focus on set-tings where the controller has access to predictions offuture disturbances. The few examples of results in the case of predictions, focus on settings where predic-tions are exact (Yu et al., 2020). For example, Li et al.(2019) analyzed dynamic regret with predictions of fu-ture cost functions, but the predictions are exact andthere is no disturbance in the dynamics. Even outsideof control in the related area of online optimization,when predictions are considered, they are typically as-sumed to be exact (Lin et al., 2019). One exception is(Chen et al., 2015), which uses a less general model ofprediction error than the current paper, but the con-nection to control is unclear.In contrast to the literature on predictions, there isno work studying the regret or competitive ratio ofpolicies subject to delayed feedback. The issue hasreceived considerable attention in the control commu-nity (Kim and Park, 1999; Bejczy et al., 1990), but thefocus is typically on stability and no finite-time per-formance bounds exist to this point to the best of ourknowledge.
We consider an online
Linear Quadratic Regulator(LQR) optimal control problem with adversarial dis-turbances in the dynamics. In particular, we considera linear system initialized with x ∈ R n and controlledby u t ∈ R m at each step t ∈ { , , . . . , T − } where T isthe total length of the problem. The system dynamicsis governed by: x t +1 = Ax t + Bu t + w t , where w t is the disturbance. We assume that w t isbounded, i.e., (cid:107) w t (cid:107) ≤ r . The goal of the controller isto minimize the following cost: J = T − (cid:88) t =0 ( x (cid:62) t Qx t + u (cid:62) t Ru t ) + x (cid:62) T Q f x T , (1)given matrices A, B, Q, R, Q f . We consider an online setting where an adversary selects { w t } T − t =0 in an adap-tive manner, and the controller makes the decision u t at every time step t , potentially based on delayed im-perfect information (see Section 2.1).We make the standard assumptions that Q, Q f (cid:23) R (cid:31)
0, and (
A, B ) is stabilizable (Goel and Hassibi,2020; Yu et al., 2020), i.e., ∃ K ∈ R m × n such that ρ ( A − BK ) <
1. We further assume ( A (cid:62) , Q ) is sta-bilizable to guarantee the stability of the closed-loop(Anderson and Moore, 2007). This assumption is moregeneral than the standard assumption that Q (cid:31)
0, i.e., Q (cid:31) A (cid:62) , Q ).Throughout this paper, we use ρ ( · ) to denote the spec-tral radius of a matrix and (cid:107)·(cid:107) to denote the 2-normof a vector or the spectral norm of a matrix. henkai Yu, Guanya Shi, Soon-Jo Chung, Yisong Yue, Adam Wierman Note that many important problems can be seen tobe special cases of the model described above. Twomotivating examples are input-disturbed systems andthe Linear Quadratic (LQ) tracking problem (Ander-son and Moore, 2007).
Example: Linear Quadratic (LQ) Tracking.
TheLQ tracking problem is defined via dynamics x t +1 = Ax t + Bu t + ˜ w t and cost function: J = T − (cid:88) t =0 ( x t +1 − y t +1 ) (cid:62) Q ( x t +1 − y t +1 ) + u (cid:62) t Ru t , where { y t } Tt =1 is the desired trajectory to track. To fitthis into our model, let ˜ x t = x t − y t . Then, we get J = (cid:80) T − t =0 ˜ x (cid:62) t +1 Q ˜ x t +1 + u (cid:62) t Ru t and ˜ x t +1 = A ˜ x t + Bu t + w t ,which is a LQR control problem with disturbance w t =˜ w t + Ay t − y t +1 in the dynamics. Note that in many LQtracking problems, delayed observations and imperfectpredictions are fundamental challenges (Anderson andMoore, 2007). The LQR optimal control problem introduced above istypically studied without predictions or delays. clas-sicly, at each time t ≥
0, the controller observes x t andthen decides u t without knowing w t . In other words,the controller is given x to start and then, at eachtime, it decides u t before obtaining w t . Thus, u t is afunction of all the previous information: u t = π t ( x ,u , u , . . . , u t − , w , w , . . . , w t − ), or equivalently, u t = π t ( x , u , u , . . . , u t − , x , x , . . . , x t ).The motivation for the work in this paper is that inmany real-world problems (Shi et al., 2019; Lazic et al.,2018), predictions of future information are availableand using them is crucially important, even thoughthey are typically noisy. For example, data-drivenmodel-based control is a prominent and successful ap-proach where it is crucial to consider model mismatchdue to statistical learning error. Further, in many sit-uations, there is feedback delay in the system whichmeans the controller must make decisions u t before thecurrent state x t is observed. The addition of feedbackdelay leads to considerable additional difficulty.Formally, to model the delayed inexact predictionsthat are common in many applications, we model therevealed information at time t as follows. At time step t , the revealed information is: x , u , . . . , u t − , w , . . . , w t − d − , ˆ w t − d | t , . . . , ˆ w T − | t , or equivalently, x , u , . . . , u t − , x , . . . , x t − d , ˆ w t − d | t , . . . , ˆ w T − | t , where d ≥ w s | t is thepotentially inexact prediction of w s at time t .We use e t − d + i | t = w t − d + i − ˆ w t − d + i | t ( i ≥
0) to rep-resent the estimation error and we assume the predic-tor satisfies (cid:13)(cid:13) e t − d + i | t (cid:13)(cid:13) ≤ (cid:15) i (cid:13)(cid:13) w t − d + i | t (cid:13)(cid:13) for all i and t .Thus, (cid:15) i measures the quality of the predictor. Notethat we may assume 0 ≤ (cid:15) i ≤ i since if (cid:15) i > i , we can simply let ˆ w t − d + i | t = 0 and then (cid:15) i = 1 (if ˆ w t − d + i | t = w t − d + i = 0 we define (cid:15) i = 0). Ad-ditionally, note that, although predictions are availablefor every time step, predictions that are far future mayhave bad quality, i.e., (cid:15) i is typically large for large i .Therefore, a good control policy may not use all pre-dictions in the same way, i.e., only using the predic-tions with smaller estimation error may yield betterperformance.The delayed imperfect information setting we considergeneralizes many existing settings for studying LQRcontrol. The classic setting is the special case where d = 0 and ˆ w t + i | t = 0 for all i and t . The offline optimalsetting considered by Goel and Hassibi (2020) is thespecial case where d = 0 and (cid:15) i = 0 for all i . Thesetting considered by Yu et al. (2020), where exactpredictions without delay are available, correspondsto d = 0, (cid:15) i = 0 for 0 ≤ i ≤ k − w t + i | t = 0 forall t and i ≥ k . In this paper, we study a myopic policy that extendsthe classic model predictive control (MPC) approachto the setting of delayed imperfect information.When there are predictions, but no delays, MPC isa popular and successful approach (Lazic et al., 2018;Camacho and Alba, 2013). In fact, Yu et al. (2020) re-cently showed that MPC has a near-optimal dynamicregret in the case of exact predictions and no delay.To define MPC formally, suppose the controller uses k predictions. At each time step t , the controller opti-mizes based on x t , ˆ w t | t , . . . , ˆ w t + k − | t :( u t , . . . , u t + k − ) = arg min u (cid:18) t + k − (cid:88) i = t ( x (cid:62) i Qx i + u (cid:62) i Ru i )+ x (cid:62) t + k ˜ Q f x t + k (cid:19) , s.t. x i +1 = Ax i + Bu i + ˆ w i | t , t ≤ i ≤ t + k − . (2)This optimization is myopic in the sense that it as-sumes that the length of the problem is k instead of T and only uses predicted future disturbances withinthose k steps. The terminal cost matrix ˜ Q f in Equa-tion (2) may or may not be the same as the terminalcost matrix Q f of the original problem (Equation (1)), ompetitive Control with Delayed Imperfect Information and can be viewed as a hyper-parameter of the algo-rithm. Similarly, k is also a hyper-parameter. Larger k is not necessarily better because the predictions in thefar future may have very large errors. In this paper,we let ˜ Q f = P , where P is the solution of the discretealgebraic Riccati equation (DARE): P = Q + A (cid:62) P A − A (cid:62) P B ( R + B (cid:62) P B ) − B (cid:62) P A. (3)The output of Equation (2) is k control actions cor-responding to time t, t + 1 , . . . , t + k −
1, respectively,but only the first ( u t ) is applied to the system. Therest (i.e., u t +1 , . . . , u t + k − ) are discarded. The explicitsolution of the MPC optimization Equation (2) can becomputed and is given below (Yu et al., 2020). Proposition 1.
The MPC policy at time t is: u t = − ( R + B (cid:62) P B ) − B (cid:62) (cid:32) P Ax t + k − (cid:88) i =0 F (cid:62) i P ˆ w t + i | t (cid:33) , where F = A − B ( R + B (cid:62) P B ) − B (cid:62) P A =: A − BK . Note that ρ ( F ) <
1, i.e., the closed loop is stable (Yuet al., 2020). As stated above, MPC does not directlyapply to the case of delayed imperfect information. Toadapt it, we consider two cases: (i) when the number ofpredictions available is longer than the feedback delay,i.e., k ≥ d , and (ii) when the delay is longer than thenumber of predictions available, i.e., k < d .When k ≥ d , the extension is perhaps straightfor-ward. Here, although the controller does not know thecurrent state x t , it knows x t − d and ˆ w t − d | t , . . . , ˆ w t − | t .Thus, it can estimate the current state. This meansthat it is possible to simply use this estimation, ˆ x t | t ,as a replacement for x t in the algorithm, which yieldsthe following: u t = − ( R + B (cid:62) P B ) − B (cid:62) (cid:18) P A ˆ x t | t + k − d − (cid:88) i =0 F (cid:62) i P ˆ w t + i | t (cid:19) . (4)When k < d , the extension is not as obvious. In thissetting, the quality of the predictions is poor enoughthat it is better not to use the predictions to estimatethe current state. Thus, one cannot simply estimatethe current state and run classic MPC. In this case, thekey is to view (classic) MPC from a different perspec-tive: MPC locally solves an optimal control problemby treating known disturbances (using predictions) asexact, and treating unknown disturbances as zero (Yuet al., 2020; Camacho and Alba, 2013). This view high-lights the fact that, underlying MPC is the assumptionthat predictions are exact. Following this philosophy, in the case when predictions are not enough to be usedto estimate the current state, we can instead assumethat unknown disturbances are exactly zero. The fol-lowing theorem derives the optimal policy under this“optimistic” assumption. Theorem 2.
Suppose there are d delays and k exactpredictions with k < d . Assume all used predictionsare exact and other disturbances (with unused predic-tions) are zero. The optimal policy at time t is: u t = − ( R + B (cid:62) P B ) − B (cid:62) P A (cid:18) A d − k ˆ x t − d + k | t + d − k − (cid:88) i =0 A i Bu t − − i (cid:19) . (5)In other words, the policy in Equation (5) first ob-tains the greedy estimation ˆ x t − d + k | t using predictionsˆ w t − d | t , · · · , ˆ w t − d + k − | t , and then estimates the currentstate by treating w t − d + k = · · · = w t − = 0. In fact,instead of treating them as zero, we can impose othervalues or distributions on those disturbances. Thiswould generalize Theorem 2 to a broader class of poli-cies.To summarize the two cases above, the myopic gener-alization of MPC we study in this paper is describedas follows. Suppose we want to use k predictions. If k ≥ d , then we estimate the current state x t and applyEquation (4). If k < d , then we estimate the state attime t − d + k and apply Equation (5). If fact, the twocases coincide when k = d . To study the performance of the policy describedabove, we analyze its competitive ratio , which boundsthe worst-case ratio of the cost of the online algorithm(
Alg ) compared to the optimal offline cost (
Opt ) withthe full exact knowledge of { w t } T − t =0 . Formally, westudy the so-called weak competitive ratio, which al-lows for an additive constant factor. In particular, wesay that the a policy is (weakly) c -competitive when Alg ≤ c Opt + O (1) for any fixed A, B, Q, R, Q f and r , and for any adversarially and adaptively chosen dis-turbances { w t } T − t =0 . When the competitive ratio c is aconstant, independent of T we say that the algorithmis constant competitive.While there has been considerable success in recentyears designing control policies that are no-regret (in-cluding dynamic regret), e.g., Dean et al. (2018); Agar-wal et al. (2019); Yu et al. (2020), there have beenvery few examples of constant competitive controllers.The few results that exist, e.g., Shi et al. (2020); Goeland Wierman (2019), tend to have restrictive assump-tions on the dynamics and/or disturbances. This is henkai Yu, Guanya Shi, Soon-Jo Chung, Yisong Yue, Adam Wierman because the study of competitive ratio adds difficultyin a few dimensions. In particular, results studying re-gret tend to focus on regret compared to the offline op-timal static linear policy (Agarwal et al., 2019), whilethe competitive ratio directly compares to the optimaloffline (potentially non-linear and non-static) policy(Shi et al., 2020). Characterizing the optimal offlinepolicy is known to be difficult. In fact, the optimalstatic linear policy can have cost that is an order-of-magnitude larger than the optimal offline cost, whichmakes achieving a constant competitive ratio morechallenging (Shi et al., 2020). Our main result provides bounds on the competitiveratio for a generalized form of MPC (Section 2.2) in thecase of inexact delayed predictions. To our knowledge,this represents the first constant-competitive boundfor a policy in the general online LQR setting, evenwithout considering delayed feedback or inexact pre-dictions. We present our general result below and thendiscuss the special cases of (i) exact predictions and nodelay, (ii) inexact predictions and no delay, and (iii)delay but no access to predictions. The special casesillustrate the contrast between inexact and exact pre-dictions as well as the impact of delay.
Theorem 3 (Main result) . Let c = (cid:107) P (cid:107)(cid:107) P − (cid:107) (1 + (cid:107) F (cid:107) ) and H = B ( R + B (cid:62) P B ) − B (cid:62) . Suppose there are d steps of delays and the controller uses k predictions.When k ≥ d , Alg ≤ (cid:34) (cid:18) c d − (cid:80) i =0 (cid:15) i (cid:107) A d − i (cid:107) + c k − (cid:80) i = d (cid:15) i (cid:107) F i − d (cid:107) + (cid:107) F k − d (cid:107) (cid:19) (cid:107) H (cid:107) − λ min ( P − − F P − F (cid:62) − H )+ 1 (cid:35) Opt + O (1) . When k ≤ d , Alg ≤ (cid:34) (cid:18) c k − (cid:80) i =0 (cid:15) i (cid:107) A d − i (cid:107) + c d − (cid:80) i = k (cid:107) A d − i (cid:107) + 1 (cid:19) (cid:107) H (cid:107) − λ min ( P − − F P − F (cid:62) − H )+ 1 (cid:35) Opt + O (1) . The O (1) is with respect to T . It may depend on thesystem parameters A , B , Q , R , Q f and the range ofdisturbances r , but not on T . When Q f = P the O (1) is zero. The two cases in the theorem correspond to the twocases in the algorithm: when predictions are of high enough quality to allow estimation of the current stateand when they are not. Note that the closed-loopdynamics is stable, i.e., ρ ( F ) = ρ ( A − BK ) < γ such that (cid:13)(cid:13) F i (cid:13)(cid:13) ≤ γ ( ρ ( F )+12 ) i for all i ≥ ρ ( A ) >
1. In the second case, we see that the amountof delay d exponentially increases the bound if (cid:15) i > ρ ( A ) >
1. We explore the insights that followfrom the bound further by looking at special cases inthe subsections that follow but, before moving to thespecial cases, we provide an overview of the proof ofTheorem 3.
Proof Sketch.
We first prove Theorem 3 in the case Q f = P . In this case, the O (1) is not needed. Then, weanalyze the impact of the terminal cost Q f and showthat it introduces at most an O (1) additional cost. Lemma 4.
Suppose Q f = P . Then, the conclusion ofTheorem 3 holds. The proof of the result in the case of Q f = P can befound in Appendix and follows from a novel differenceanalysis of the quadratic cost-to-go functions. Here,we focus on the second part of the proof, i.e., reducingthe case of Q f (cid:54) = P to the case of Q f = P . To thatend, let Alg ( X ) be the cost of our algorithm when theterminal cost is X and, similarly, let Opt Y ( X ) be thecost of the policy that is optimal for terminal cost Y when the terminal cost is actually X .Our analysis proceeds by first bounding the impact ofthe terminal cost on the gap between the algorithmand the optimal cost. Lemma 5.
For any algorithm,
Alg ( Q f ) − Opt P ( Q f ) ≤ Alg ( P ) − Opt P ( P ) + O (1) . Then, we prove that the terminal cost only has an O (1)impact on the optimal cost. Lemma 6.
The followings are equal up to O (1) dif-ference: Opt P ( P ) , Opt P ( Q f ) , Opt Q f ( Q f ) . Together, these imply that
Alg ( Q f ) − Opt Q f ( Q f ) ≤ Alg ( Q f ) − Opt P ( Q f ) + O (1) ≤ Alg ( P ) − Opt P ( P ) + O (1) ≤ Alg ( P ) − Opt P ( P ) Opt P ( P ) Opt P ( P ) + O (1) ≤ Alg ( P ) − Opt P ( P ) Opt P ( P ) Opt Q f ( Q f ) + O (1) . ompetitive Control with Delayed Imperfect Information Thus, we can complete the proof by concluding that
Alg ( Q f ) ≤ Alg ( P ) Opt P ( P ) Opt Q f ( Q f ) + O (1) . In the sections that follow, we explore special cases ofTheorem 3 in order to highlight the impact of inexactpredictions and delay. First, we present the specialcase of k accurate, exact predictions and no feedbackdelays. Formally, we have d = 0, (cid:15) i = 0 for 0 ≤ i ≤ k − w t + i | t = 0 for all t and i ≥ k .The main result for this setting is given below. Itdirectly follows from the k ≥ d case of Theorem 3. Theorem 7.
Suppose there are k exact predictionsand no feedback delay. Then: Alg ≤ (cid:20) (cid:107) F k (cid:107) (cid:107) H (cid:107) λ min ( P − − F P − F (cid:62) − H ) (cid:21) Opt + O (1) . Thus, the competitive ratio exponentially decreases as k goes up. To illustrate how the primary parameters A , B , Q and R affect the competitive ratio in this case,it is useful to consider the case when n = m = 1, asshown in both Corollary 7.1 and Figure 1. Corollary 7.1.
Assume there are k exact predictionsand no feedback delay, and let n = m = 1 and Q f = P .If A (cid:29) B Q/R + 1 , AlgOpt ≤ A − k B Q/R . If B Q/R (cid:29) A + 1 , AlgOpt ≤ A k ( B Q/R ) k − . Interestingly, in this case, the competitive bound onlydepends on A and B Q/R . It does not depend on thesign of A , nor on B , Q or R as long as B Q/R is fixed.Further, when k ≥
3, we see that the competitive ratiois small if
B, Q are small, R is large, or A is eithervery large or very small. However, when k = 0 or 1, alarge A can result in a large competitive ratio. When k = 0, a large value of B Q/R also results in a largecompetitive ratio. We see below that this phenomenonis similar to the case of delay (see Section 3.3).Theorem 7 is tight in the sense that there exist sys-tems where the competitive ratio is 1 + Θ( (cid:107) F k (cid:107) ) (seeAppendix). We next consider the case where predictions are inex-act, but there is no feedback delay. The contrast with the previous section highlights the impact of predic-tion error.As discussed in Section 2.1, the controller should op-timize k to utilize predictions with smaller estimationerrors while avoiding the use of those with larger er-rors. The following directly follows from the d = 0case of Theorem 3 and reduces to the exact case when (cid:15) i = 0. Theorem 8.
Suppose there are k inexact predictionsand no feedback delay. Alg ≤ (cid:34) (cid:107) H (cid:107) (cid:18) c k − (cid:80) i =0 (cid:15) i (cid:107) F i (cid:107) + (cid:107) F k (cid:107) (cid:19) λ min ( P − − F P − F (cid:62) − H ) + 1 (cid:35) Opt + O (1) . This subsection differs from the previous one in thatthe controller can minimize the bound in Theorem 8with respect to k . We characterize this optimizationin the following result in 1-d systems, and also providesimulation evidence in Section 4 (see Figure 3). Corollary 8.1.
Suppose there are k inexact predic-tions and no feedback delay. Assume n = m = 1 .Given non-decreasing { (cid:15) i } , to minimize the competi-tive ratio bound in Theorem 8, the optimal number k of predictions to use is such that: (cid:15) k − < − | F | | F | < (cid:15) k . As in the previous section, the one-dimensional settinghighlights the dependence of the competitive ratio onthe system parameters.
Corollary 8.2.
Assume there are k inexact predic-tions and no feedback delay, and let n = m = 1 andQ f = P .If A (cid:29) B Q/R + 1 , AlgOpt ≤ A B Q/R (cid:32) k − (cid:88) i =0 (cid:15) i | A | i + 1 | A | k (cid:33) . If B Q/R (cid:29) A + 1 , AlgOpt ≤ B QR (cid:32) k − (cid:88) i =0 (cid:15) i | A | i ( B Q/R ) i + | A | k ( B Q/R ) k (cid:33) . The dependence of the competitive ratio on A , B , Q , R is similar to the case of exact predictions. In par-ticular, we find that the prediction quality in the nearfuture is (exponentially) more important than furtherin the future, which is consistent with the robust MPCliterature (Cannon and Kouvaritakis, 2005).In the exact prediction case we show that Theorem 7 istight with respect to (cid:13)(cid:13) F k (cid:13)(cid:13) . In contrast, in the inexactcase the tightness of (cid:15) i and (cid:13)(cid:13) F i (cid:13)(cid:13) in Theorem 8 remainsas an open question. henkai Yu, Guanya Shi, Soon-Jo Chung, Yisong Yue, Adam Wierman (a) k = 0. (b) k = 1. (c) k = 3. Figure 1: Illustration of the competitive ratio bound in Theorem 7. The two axes are A and B Q/R respectively.The system is one-dimensional ( n = m = 1) and there are no delays ( d = 0). With no predictions ( k = 0), thebound is small only if both A and B Q/R are small. When k = 1, the bound is small if A is small or B Q/R islarge. When k = 3, the competitive ratio is small if A is either small or large, or if B Q/R is large.
The last special case we consider is the case with de-lays but no (usable) predictions. This case separatesthe impact of delay from that of predictions. Here,ˆ w t − d + i | t = 0 for all t and i ≥
0. When k ≤ d , viaTheorem 3 we have that: Alg ≤ (cid:34) (cid:107) H (cid:107) (cid:18) c d (cid:80) i =1 (cid:107) A i (cid:107) + 1 (cid:19) λ min ( P − − F P − F (cid:62) − H ) + 1 (cid:35) Opt + O (1) . Depending on whether the spectral radius ρ ( A ) < ρ ( A ) <
1, then (cid:107) A i (cid:107) ≤ κa i for a = ( ρ ( A ) + 1) / < κ , and (ii) if ρ ( A ) > (cid:107) A i (cid:107) ≤ (cid:107) A (cid:107) i .These yield the following result. Theorem 9.
Suppose there are d delays and no pre-dictions are available. If ρ ( A ) < , then the competi-tive ratio is bounded by a constant no matter how manydelays there are: Alg ≤ (cid:34) (cid:107) H (cid:107) (cid:16) cκ a − a d +1 − a + 1 (cid:17) λ min ( P − − F P − F (cid:62) − H ) + 1 (cid:35) Opt + O (1) ≤ (cid:34) (cid:107) H (cid:107) (cid:16) cκ a − a + 1 (cid:17) λ min ( P − − F P − F (cid:62) − H ) + 1 (cid:35) Opt + O (1) . If ρ ( A ) > , then the competitive ratio bound growsexponentially fast with the of number of delay steps: Alg ≤ (cid:34) (cid:107) H (cid:107) (cid:18) c (cid:107) A (cid:107) d +1 − (cid:107) A (cid:107)(cid:107) A (cid:107) − (cid:19) λ min ( P − − F P − F (cid:62) − H ) + 1 (cid:35) Opt + O (1) . As in the previous subsections, it is useful to considerthe one-dimensional case to get insights about the im-pact of the system parameters.
Corollary 9.1.
Assume there are d delays and no pre-dictions are available, and let n = m = 1 and Q f = P .If A (cid:29) B Q/R + 1 , AlgOpt ≤ A d B Q/R . If B Q/R (cid:29) A + 1 , then if | A | > , AlgOpt ≤ A d +2 B Q/R ( | A | − . If | A | < , AlgOpt ≤ − | A | d +1 ) B Q/R (1 − | A | ) ≤ B Q/R (1 − | A | ) . In contrast to the case with k ≥ B Q/R or A does not lead to a small competitive ratio. Instead,in the case of feedback delay, this results in a largecompetitive ratio. This is consistent with results fromrobust control theory, i.e., the less stable the open loopis ( | A | is larger), the more impact delay has (Zhou andDoyle, 1998).Theorem 9 is tight in the sense that there exist systemssuch that the competitive ratio is 1 + Θ( (cid:107) A (cid:107) d ) (seeAppendix). To illustrate our results, we end the paper by present-ing numerical examples that highlight the impact ofdelayed, inexact predictions. To that end, we considera 2-d tracking problem with the following trajectory(Li et al. (2019), illustrated in Figure 2): y t = (cid:20)
16 sin ( t )13 cos (cid:0) t (cid:1) − (cid:0) t (cid:1) − (cid:0) t (cid:1) − cos (cid:0) t (cid:1)(cid:21) . ompetitive Control with Delayed Imperfect Information - - - - - - - (a) d = 5 , k = 0. - - - - - - (b) d = 1 , k = 0. - - - - - - (c) d = 0 , k = 1. - - - - - - (d) d = 0 , k = 8. - - - (e) Relative cost: Alg / Opt − Figure 2: Tracking results with k exact predictions and no delay, or with d steps of delay with no predictions.(a-d) show the desired trajectory (blue) and actual (orange) trajectories. (e) shows both delay and predictionon the x-axis, with the negative part corresponding to delay. The y-axis is in log-scale.We consider following double integrator dynamics: p t +1 = p t + 0 . v t + h t , v t +1 = v t + 0 . u t + η t , where p t ∈ R is the position, v t is the velocity, u t is the control, and h t , η t ∼ U[ − , are i.i.d. noises.The objective is to minimize T − (cid:88) t =0 (cid:107) p t − y t (cid:107) + 0 . (cid:107) u t (cid:107) , where we let T = 200. This problem can be convertedto the standard LQR with disturbance w t by letting x t = [ p t v t ] and ˜ w t = (cid:2) h t η t (cid:3) and then using the reductionin the LQ tracking example in Section 2.Note that the disturbances are the combination of adeterministic trajectory and i.i.d. noises. In contrast,our theoretical results focus on more challenging ad-versarial disturbances. Nonetheless, the numerical re-sults are consistent with our theorems.In our first experiment, we study the effect of the num-ber of delays or predictions. For simplicity, we ex-clude the effect of inexactness of the predictions —a prediction is either exact ( (cid:15) i = 0) or uninformative( ˆ w t − d + i | t = 0). In this case, each exact prediction can-cels a step of delay so, delays can be viewed as “neg-ative” predictions. Figure 2 shows the performance ofthe proposed myopic policy in Section 2.2 with differ-ent numbers of predictions or delays. We see that thecost exponentially decreases (increases) as the numberof predictions (delays) increases.In the second experiment, we study the effect of inex-act predictions and show that the controller needs tooptimize how many predictions are used — it is betterto use only a few predictions and ignore those that aretoo noisy. Specifically, we let (cid:15) i = i , i.e., the noiselevel of predictions grows quadratically fast with thenumber of steps into the future. Each estimation error e t + i | t = w t + i − ˆ w t + i | t is independently sampled fromU[ − (cid:15) i (cid:107) w t + i (cid:107) , (cid:15) i (cid:107) w t + i (cid:107) ] . This process is repeated 8 Figure 3: The impact of inexact predictions. The rel-ative cost (
Alg / Opt −
1) of MPC using k exact (blue)or inexact (orange) predictions is shown. The red lineshows the maximum of all the inexact instances.times, with each instance depicted by an orange lineand their maximum represented by a red line in Fig-ure 3. Figure 3 summarizes the results, suggesting that(i) with exact predictions, the cost will decrease as thenumber of predictions increase (the blue line); and (ii)with inexact predictions, using fewer predictions mayyield better performance. Our result presents the first constant-competitive pol-icy for general LQR control with adversarial distur-bances and delayed imperfect information. We alsoshow that in the case of exact predictions with no de-lay, or in the case of delay with no predictions, thecompetitive ratio bounds of the proposed myopic pol-icy match the lower bound. However, in the inexactprediction case, the tightness of (cid:15) i in our bounds re-mains as an open question. Other important exten-sions include nonlinear dynamics and time-variant lin-ear systems, which can also lead to studying onlinelearning of robust controllers under model mismatch. henkai Yu, Guanya Shi, Soon-Jo Chung, Yisong Yue, Adam Wierman References
Naman Agarwal, Brian Bullins, Elad Hazan, Sham MKakade, and Karan Singh. Online control with ad-versarial disturbances. In
International Conferenceon Machine Learning (ICML) , 2019.Brian DO Anderson and John B Moore.
Optimal con-trol: linear quadratic methods . Courier Corporation,2007.Antal K Bejczy, Won S Kim, and Steven C Venema.The phantom robot: predictive displays for teleoper-ation with time delay. In
Proceedings., IEEE Inter-national Conference on Robotics and Automation ,pages 546–551. IEEE, 1990.Alberto Bemporad and Manfred Morari. Robustmodel predictive control: A survey. In
Robustness inidentification and control , pages 207–226. Springer,1999.Eduardo F Camacho and Carlos Bordons Alba.
Modelpredictive control . Springer Science & Business Me-dia, 2013.Mark Cannon and Basil Kouvaritakis. Optimizingprediction dynamics for robust mpc.
IEEE Trans-actions on Automatic Control , 50(11):1892–1897,2005.Niangjun Chen, Anish Agarwal, Adam Wierman, Sid-dharth Barman, and Lachlan LH Andrew. Onlineconvex optimization using predictions. In
Proceed-ings of the 2015 ACM SIGMETRICS InternationalConference on Measurement and Modeling of Com-puter Systems , pages 191–204, 2015.Sarah Dean, Horia Mania, Nikolai Matni, BenjaminRecht, and Stephen Tu. Regret bounds for robustadaptive control of the linear quadratic regulator. In
Neural Information Processing Systems (NeurIPS) ,2018.Gautam Goel and Babak Hassibi. The power oflinear controllers in LQR control. arXiv preprintarXiv:2002.02574 , 2020.Gautam Goel and Adam Wierman. An online al-gorithm for smoothed regression and LQR control.
Proceedings of Machine Learning Research , 89:2504–2513, 2019.Jong Hae Kim and Hong Bae Park. H ∞ state feedbackcontrol for generalized continuous/discrete time-delay system. Automatica , 35(8):1443–1451, 1999.Nevena Lazic, Craig Boutilier, Tyler Lu, EehernWong, Binz Roy, MK Ryu, and Greg Imwalle. Datacenter cooling using model-predictive control. In
Ad-vances in Neural Information Processing Systems ,pages 3814–3823, 2018. Yingying Li, Xin Chen, and Na Li. Online optimalcontrol with linear dynamics and predictions: Al-gorithms and regret analysis. In
Advances in Neu-ral Information Processing Systems , pages 14858–14870, 2019.Yiheng Lin, Gautam Goel, and Adam Wierman. On-line optimization with predictions and non-convexlosses. arXiv preprint arXiv:1911.03827 , 2019.Guanya Shi, Xichen Shi, Michael O’Connell, Rose Yu,Kamyar Azizzadenesheli, Animashree Anandkumar,Yisong Yue, and Soon-Jo Chung. Neural lander:Stable drone landing control using learned dynam-ics. In
International Conference on Robotics andAutomation (ICRA) , 2019.Guanya Shi, Yiheng Lin, Soon-Jo Chung, Yisong Yue,and Adam Wierman. Beyond no-regret: Compet-itive control via online optimization with memory.
Neural Information Processing Systems (NeurIPS) ,2020.Max Simchowitz and Dylan J Foster. Naive explo-ration is optimal for online LQR. arXiv preprintarXiv:2001.09576 , 2020.Chenkai Yu, Guanya Shi, Soon-Jo Chung, Yisong Yue,and Adam Wierman. The power of predictions inonline control.
Neural Information Processing Sys-tems (NeurIPS) , 2020.Kemin Zhou and John Comstock Doyle.
Essentialsof robust control , volume 104. Prentice hall UpperSaddle River, NJ, 1998. ompetitive Control with Delayed Imperfect Information
A COST CHARACTERIZATION LEMMA
Before we start our proofs, we first present a technical lemma that is used in many of the proofs below. Thislemma connects the control cost of a policy to its difference from the offline optimal policy.
Lemma 10.
Suppose at each time t , the controller applies the following policy: u t = − ( R + B (cid:62) P B ) − B (cid:62) (cid:32) P Ax t + T − t − (cid:88) i =0 F (cid:62) i P w t + i − η t (cid:33) . (6) If Q f = P , then the control cost is given by: Alg = T − (cid:88) t =0 (cid:32) w (cid:62) t P w t + 2 w (cid:62) t T − t − (cid:88) i =1 F (cid:62) i P w t + i − (cid:32) T − t − (cid:88) i =0 F (cid:62) i P w t + i (cid:33) (cid:62) H (cid:32) T − t − (cid:88) i =0 F (cid:62) i P w t + i (cid:33)(cid:33) + T − (cid:88) t =0 η (cid:62) t Hη t + x (cid:62) P x + 2 x (cid:62) T − (cid:88) i =0 F (cid:62) i +1 P w i . (7)Note that the optimal offline policy has η t = 0 for all t in Equation (7) (as derived by Goel and Hassibi (2020)),and as a result, the extra cost of Alg is given by
Alg − Opt = T − (cid:88) t =0 η (cid:62) t Hη t . (8)In Equation (6), η t can be regarded as the difference between the applied policy and the offline optimal policy.We also present below a lemma that has appeared in the body, as we will prove the two lemmas at one time. Lemma 5.
For any algorithm,
Alg ( Q f ) − Opt P ( Q f ) ≤ Alg ( P ) − Opt P ( P ) + O (1) , where the O (1) term is withrespect to T and it is zero when Q f = P .Proof of Lemmas 5 and 10. Given a disturbance sequence w , we define the cost-to-go function of a policy de-scribed by Equation (6): V Alg t ( x t ; w ) := T − (cid:88) i = t ( x (cid:62) i Qx i + u (cid:62) i Ru i ) + x (cid:62) T P T x T = x (cid:62) t Qx t + u (cid:62) t Ru t + V Alg t +1 ( x t +1 ; w ) . We will show by backward induction that V Alg t ( x t ; w ) = x (cid:62) t P t x t + x (cid:62) t v t + q t for some P t , v t and q t . Let∆ t = P t − P , where P is the solution of DARE (Equation (3)). When t = T , we have P T = Q f , v T = 0 and q T = 0. Assume this is true at t + 1. Then, V Alg t ( x t ; w )= x (cid:62) t Qx t + u (cid:62) t Ru t + ( Ax t + Bu t + w t ) (cid:62) P t +1 ( Ax t + Bu t + w t ) + ( Ax t + Bu t + w t ) (cid:62) v t +1 + q t +1 = u (cid:62) t ( R + B (cid:62) P t +1 B ) u t + 2 u (cid:62) t B (cid:62) ( P t +1 Ax t + P t +1 w t + v t +1 / x (cid:62) t Qx t + ( Ax t + w t ) (cid:62) P t +1 ( Ax t + w t ) + ( Ax t + w t ) (cid:62) v t +1 + q t +1 = u (cid:62) t ( R + B (cid:62) P B ) u t + u (cid:62) t B (cid:62) ∆ t +1 Bu t + 2 u (cid:62) t B (cid:62) ( P Ax t + P w t + v t +1 /
2) + 2 u (cid:62) t B (cid:62) (∆ t +1 Ax t + ∆ t +1 w t + v t +1 / x (cid:62) t Qx t + ( Ax t + w t ) (cid:62) P ( Ax t + w t ) + ( Ax t + w t ) (cid:62) ∆ t +1 ( Ax t + w t ) + ( Ax t + w t ) (cid:62) v t +1 + q t +1 = (cid:32) P Ax t + T − t − (cid:88) i =0 F (cid:62) i P w t + i − η t (cid:33) (cid:62) H (cid:32) P Ax t + T − t − (cid:88) i =0 F (cid:62) i P w t + i − η t (cid:33) + (cid:32) P Ax t + T − t − (cid:88) i =0 F (cid:62) i P w t + i − η t (cid:33) (cid:62) H ∆ t +1 H (cid:32) P Ax t + T − t − (cid:88) i =0 F (cid:62) i P w t + i − η t (cid:33) henkai Yu, Guanya Shi, Soon-Jo Chung, Yisong Yue, Adam Wierman − (cid:32) P Ax t + T − t − (cid:88) i =0 F (cid:62) i P w t + i − η t (cid:33) (cid:62) H ( P Ax t + P w t + v t +1 / − (cid:32) P Ax t + T − t − (cid:88) i =0 F (cid:62) i P w t + i − η t (cid:33) (cid:62) H (∆ t +1 Ax t + ∆ t +1 w t + v t +1 / x (cid:62) t Qx t + ( Ax t + w t ) (cid:62) P ( Ax t + w t ) + ( Ax t + w t ) (cid:62) ∆ t +1 ( Ax t + w t ) + ( Ax t + w t ) (cid:62) v t +1 + q t +1 = ( P Ax t ) (cid:62) H ( P Ax t ) + 2( P Ax t ) (cid:62) H (cid:32) T − t − (cid:88) i =0 F (cid:62) i P w t + i − η t (cid:33) + (cid:32) T − t − (cid:88) i =0 F (cid:62) i P w t + i − η t (cid:33) (cid:62) H (cid:32) T − t − (cid:88) i =0 F (cid:62) i P w t + i − η t (cid:33) + ( P Ax t ) (cid:62) H ∆ t +1 H ( P Ax t ) + 2( P Ax t ) (cid:62) H ∆ t +1 H (cid:32) T − t − (cid:88) i =0 F (cid:62) i P w t + i − η t (cid:33) + (cid:32) T − t − (cid:88) i =0 F (cid:62) i P w t + i − η t (cid:33) (cid:62) H ∆ t +1 H (cid:32) T − t − (cid:88) i =0 F (cid:62) i P w t + i − η t (cid:33) − P Ax t ) (cid:62) H ( P Ax t + P w t + v t +1 / − (cid:32) T − t − (cid:88) i =0 F (cid:62) i P w t + i − η t (cid:33) (cid:62) H ( P Ax t + P w t + v t +1 / − P Ax t ) (cid:62) H ∆ t +1 ( Ax t + w t ) − (cid:32) T − t − (cid:88) i =0 F (cid:62) i P w t + i − η t (cid:33) (cid:62) H ∆ t +1 ( Ax t + w t )+ x (cid:62) t Qx t + ( Ax t + w t ) (cid:62) P ( Ax t + w t ) + ( Ax t + w t ) (cid:62) ∆ t +1 ( Ax t + w t ) + ( Ax t + w t ) (cid:62) v t +1 + q t +1 = x (cid:62) t ( Q + A (cid:62) P A − A (cid:62) P HP A + F (cid:62) ∆ t +1 F ) x t + 2 x (cid:62) t F (cid:62) P t +1 w t + x (cid:62) t F (cid:62) v t +1 − x (cid:62) t F (cid:62) ∆ t +1 H (cid:32) T − t − (cid:88) i =0 F (cid:62) i P w t + i − η t (cid:33) + (cid:32) T − t − (cid:88) i =0 F (cid:62) i P w t + i − η t (cid:33) (cid:62) H (cid:32) T − t − (cid:88) i =0 F (cid:62) i P w t + i − η t (cid:33) − (cid:32) T − t − (cid:88) i =0 F (cid:62) i P w t + i − η t (cid:33) (cid:62) H ( P w t + v t +1 / w (cid:62) t P w t + w (cid:62) t v t +1 + q t +1 + O (∆ t +1 ) . Thus, P + ∆ t = P t = Q + A (cid:62) P A − A (cid:62) P HP A + F (cid:62) ∆ t +1 F = P + F (cid:62) ∆ t +1 F and thus ∆ t = F (cid:62) ∆ t +1 F = O ( λ T − t ) ), where λ = ρ ( F )2 . The recursive formulae for v t and q t are given by: v t = 2 F (cid:62) P w t + F (cid:62) v t +1 + O ( λ T − t ) ) = 2 T − t − (cid:88) i =0 F (cid:62) i +1 P w t + i + O ( λ T − t ) v t +1 = 2 T − t − (cid:88) i =1 F (cid:62) i P w t + i + O ( λ T − t ) ,q t = q t +1 + w (cid:62) t P w t + 2 w (cid:62) t T − t − (cid:88) i =1 F (cid:62) i P w t + i + (cid:32) T − t − (cid:88) i =0 F (cid:62) i P w t + i − η t (cid:33) (cid:62) H (cid:32) T − t − (cid:88) i =0 F (cid:62) i P w t + i − η t (cid:33) − (cid:32) T − t − (cid:88) i =0 F (cid:62) i P w t + i − η t (cid:33) (cid:62) H (cid:32) T − t − (cid:88) i =0 F (cid:62) i P w t + i + O ( λ T − t ) (cid:33) + O ( λ T − t )= q t +1 + w (cid:62) t P w t + 2 w (cid:62) t T − t − (cid:88) i =1 F (cid:62) i P w t + i − (cid:32) T − t − (cid:88) i =0 F (cid:62) i P w t + i (cid:33) (cid:62) H (cid:32) T − t − (cid:88) i =0 F (cid:62) i P w t + i (cid:33) + η (cid:62) t Hη t + O ( λ T − t ) . ompetitive Control with Delayed Imperfect Information Then,
Alg = V Alg ( x ; w ) = x (cid:62) P x + x (cid:62) v + q = x (cid:62) P x + x (cid:62) (cid:32) T − (cid:88) i =0 F (cid:62) i +1 P w i + O ( λ T ) (cid:33) + T − (cid:88) t =0 η (cid:62) t Hη t + T − (cid:88) t =0 O ( λ T − t )+ T − (cid:88) t =0 (cid:32) w (cid:62) t P w t + 2 w (cid:62) t T − t − (cid:88) i =1 F (cid:62) i P w t + i − (cid:32) T − t − (cid:88) i =0 F (cid:62) i P w t + i (cid:33) (cid:62) H (cid:32) T − t − (cid:88) i =0 F (cid:62) i P w t + i (cid:33)(cid:33) . If Q f = P , then the above O ( λ T ) and O ( λ T − t ) are both zero and thus we obtain Equation (7). Otherwise, Alg ( Q f ) − Opt P ( Q f ) = x (cid:62) O ( λ T ) + T − (cid:88) t =0 η (cid:62) t Hη t + O (1) . Therefore, (
Alg ( Q f ) − Opt P ( Q f )) − ( Alg ( P ) − Opt P ( P )) = O (1). B PROOF OF THEOREM 2
Theorem 2.
Suppose there are d delays and k exact predictions with k < d . Assume all used predictions areexact and other disturbances (with unused predictions) are zero. The optimal policy at time t is: u t = − ( R + B (cid:62) P B ) − B (cid:62) P A (cid:32) A d − k ˆ x t − d + k | t + d − k − (cid:88) i =0 A i Bu t − − i (cid:33) . (5) Proof.
Lemma 10 implies that when Q f = P , the offline optimal policy is given by u t = − ( R + B (cid:62) P B ) − B (cid:62) (cid:32) P Ax t + T − t − (cid:88) i =0 F (cid:62) i P w t + i (cid:33) . However, we are looking for the optimal policy using the incorrect assumptions that (i) w t − d + k and all laterdisturbances are zero, and (ii) w t − d , . . . , w t − d + k − equals to ˆ w t − d | t , . . . , ˆ w t − d + k − | t respectively.Replacing w t + i by zero and x t by ˆ x t | t in the above policy, we obtain: u t = − ( R + B (cid:62) P B ) − B (cid:62) P A ˆ x t | t . (9)ˆ x t | t = A ˆ x t − | t + Bu t − = A ( A ˆ x t − | t + Bu t − ) + Bu t − ...= A d − k ˆ x t − d + k | t + A d − k − Bu t − d + k + · · · + Bu t − = A d − k ˆ x t − d + k | t + d − k − (cid:88) i =0 A i Bu t − − i . As such, we obtain Theorem 2.
C PROOF OF THEOREM 3
Theorem 3.
Let c = (cid:107) P (cid:107)(cid:107) P − (cid:107) (1 + (cid:107) F (cid:107) ) and H = B ( R + B (cid:62) P B ) − B (cid:62) . Suppose there are d steps of delaysand the controller uses k predictions. When k ≥ d , Alg ≤ (cid:34) (cid:18) c d − (cid:80) i =0 (cid:15) i (cid:107) A d − i (cid:107) + c k − (cid:80) i = d (cid:15) i (cid:107) F i − d (cid:107) + (cid:107) F k − d (cid:107) (cid:19) (cid:107) H (cid:107) − λ min ( P − − F P − F (cid:62) − H ) + 1 (cid:35) Opt + O (1) . henkai Yu, Guanya Shi, Soon-Jo Chung, Yisong Yue, Adam Wierman When k ≤ d , Alg ≤ (cid:34) (cid:18) c k − (cid:80) i =0 (cid:15) i (cid:107) A d − i (cid:107) + c d − (cid:80) i = k (cid:107) A d − i (cid:107) + 1 (cid:19) (cid:107) H (cid:107) − λ min ( P − − F P − F (cid:62) − H ) + 1 (cid:35) Opt + O (1) . The O (1) is with respect to T . It may depend on the system parameters A , B , Q , R , Q f and the range ofdisturbances r , but not on T . When Q f = P the O (1) is zero. The proof outline provided in the body lays out a set of lemmas that, together, prove Theorem 3. Here, weprovide proofs for each of them.
C.1 Proof of Lemma 4Lemma 4.
Suppose Q f = P . Then, the conclusion of Theorem 3 holds.Proof. This lemma considers the case of Q f = P . Lemma 10 implies that when Q f = P , the cost of the offlineoptimal policy is: Opt = T − (cid:88) t =0 (cid:32) w (cid:62) t P w t + 2 w (cid:62) t T − t − (cid:88) i =1 F (cid:62) i P w t + i − (cid:32) T − t − (cid:88) i =0 F (cid:62) i P w t + i (cid:33) (cid:62) H (cid:32) T − t − (cid:88) i =0 F (cid:62) i P w t + i (cid:33)(cid:33) + x (cid:62) P x + 2 x (cid:62) T − (cid:88) i =0 F (cid:62) i +1 P w i . We consider the following substitution: ψ t = T − t − (cid:88) i =0 F (cid:62) i P w t + i , w t = P − ( ψ t − F (cid:62) ψ t +1 ) . (10)Then, the offline optimal cost can be lower bounded: Opt = T − (cid:88) t =0 ( w (cid:62) t P w t + 2 w (cid:62) t F (cid:62) ψ t +1 − ψ (cid:62) t Hψ t ) + x (cid:62) P x + 2 x (cid:62) F (cid:62) ψ = T − (cid:88) t =0 ( ψ (cid:62) t P − ψ t − ψ (cid:62) t +1 F P − F (cid:62) ψ t +1 − ψ (cid:62) t Hψ t ) + x (cid:62) P x + 2 x (cid:62) F (cid:62) ψ = T − (cid:88) t =0 ( ψ (cid:62) t P − ψ t − ψ (cid:62) t F P − F (cid:62) ψ t − ψ (cid:62) t Hψ t ) + ψ (cid:62) F P − F (cid:62) ψ + x (cid:62) P x + 2 x (cid:62) F (cid:62) ψ = T − (cid:88) t =0 ψ (cid:62) t ( P − − F P − F (cid:62) − H ) ψ t + ( F (cid:62) ψ + P x ) (cid:62) P − ( F (cid:62) ψ + P x ) ≥ λ min ( P − − F P − F (cid:62) − H ) T − (cid:88) t =0 (cid:107) ψ t (cid:107) . (11)The myopic policy has two cases and we analyze each of them below. Case 1: k ≥ d . In this case, the controller estimates x t using x t − d and ˆ w t − d | t , . . . , ˆ w t − | t . x t − ˆ x t | t = ( Ax t − + Bu t − + w t − ) − ( A ˆ x t − | t + Bu t − + ˆ w t − | t ) = A ( x t − − ˆ x t − | t ) + e t − | t . Applying similar procedures repetitively, we obtain: x t − ˆ x t | t = e t − | t + Ae t − | t + · · · + A d − e t − d | t = d (cid:88) i =1 A i − e t − i | t . ompetitive Control with Delayed Imperfect Information Comparing Equations (4) and (6), we have η t = d (cid:88) i =1 P A i e t − i | t + k − d − (cid:88) i =0 F (cid:62) i P e t + i | t + T − t − (cid:88) i = k − d F (cid:62) i P w t + i . (12)Using the substitution in Equation (10), we bound Equation (12) as follows. (cid:107) η t (cid:107) = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) d (cid:88) i =1 P A i e t − i | t + k − d − (cid:88) i =0 F (cid:62) i P e t + i | t + F (cid:62) k − d ψ t + k − d (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ d (cid:88) i =1 (cid:107) P (cid:107) (cid:13)(cid:13) A i (cid:13)(cid:13) (cid:15) d − i (cid:107) w t − i (cid:107) + k − d − (cid:88) i =0 (cid:13)(cid:13) F i (cid:13)(cid:13) (cid:107) P (cid:107) (cid:15) i − d (cid:107) w t + i (cid:107) + (cid:13)(cid:13) F k − d (cid:13)(cid:13) (cid:107) ψ t + k − d (cid:107)≤ d (cid:88) i =1 (cid:107) P (cid:107) (cid:13)(cid:13) A i (cid:13)(cid:13) (cid:15) d − i (cid:13)(cid:13) P − (cid:13)(cid:13) ( (cid:107) ψ t − i (cid:107) + (cid:107) F (cid:107)(cid:107) ψ t − i +1 (cid:107) )+ k − d − (cid:88) i =0 (cid:13)(cid:13) F i (cid:13)(cid:13) (cid:107) P (cid:107) (cid:15) i − d (cid:13)(cid:13) P − (cid:13)(cid:13) ( (cid:107) ψ t + i (cid:107) + (cid:107) F (cid:107)(cid:107) ψ t + i +1 (cid:107) ) + (cid:13)(cid:13) F k − d (cid:13)(cid:13) (cid:107) ψ t + k − d (cid:107) . (13)Note that when t < d , some terms in Equation (13) have negative subscripts. Those terms do not actually existand should be regarded as zero. However, for the clarity of the proof, we keep them in the formula. In the laterderivations, although we treat them as potentially non-zero, they do not affect our result because we are lookingfor an upper bound. Let η = ( (cid:107) η (cid:107) , . . . , (cid:107) η T − (cid:107) ) ∈ R T and ψ = ( (cid:107) ψ (cid:107) , . . . , (cid:107) ψ T − (cid:107) ). Equation (13) provides alinear inequality relationship between η and ψ . We define matrix M = { M t,s } T − t,s =0 ∈ R T × T such that M t,s isthe coefficient of (cid:107) ψ s (cid:107) in the bound of (cid:107) η t (cid:107) in Equation (13). Then, η ≤ M ψ . T − (cid:88) t =0 η (cid:62) t Hη t ≤ (cid:107) H (cid:107) η (cid:62) η ≤ (cid:107) H (cid:107) ψ (cid:62) M (cid:62) M ψ ≤ λ max ( M (cid:62) M ) (cid:107) H (cid:107)(cid:107) ψ (cid:107) . (14) Proposition 11 (Gershgorin circle theorem) . Let A ∈ C n × n . Let D ( A i,i , R i ) ⊆ C be a closed disc centered at A i,i with radius R i = (cid:80) j (cid:54) = i | A i,j | . Then, every eigenvalue of A lies within at least one of the discs D ( A i,i , R i ) . We use Proposition 11 to bound the eigenvalues of M (cid:62) M : λ max ( M (cid:62) M ) ≤ max i T − (cid:88) j =0 ( M (cid:62) M ) i,j = (cid:13)(cid:13) M (cid:62) M (cid:13)(cid:13) ∞ . (15)Plugging (cid:107) ψ s (cid:107) = 1 for all s into Equation (13), we have: M ≤ (cid:32) (cid:107) P (cid:107) (cid:13)(cid:13) P − (cid:13)(cid:13) (1 + (cid:107) F (cid:107) ) (cid:32) d (cid:88) i =1 (cid:13)(cid:13) A i (cid:13)(cid:13) (cid:15) d − i + k − d − (cid:88) i =0 (cid:13)(cid:13) F i (cid:13)(cid:13) (cid:15) i − d (cid:33) + (cid:13)(cid:13) F k − d (cid:13)(cid:13)(cid:33) . Thus, Equation (15) can be further bounded: λ max ( M (cid:62) M ) ≤ (cid:32) c (cid:32) d (cid:88) i =1 (cid:13)(cid:13) A i (cid:13)(cid:13) (cid:15) d − i + k − d − (cid:88) i =0 (cid:13)(cid:13) F i (cid:13)(cid:13) (cid:15) i − d (cid:33) + (cid:13)(cid:13) F k − d (cid:13)(cid:13)(cid:33) , where c = (cid:107) P (cid:107) (cid:13)(cid:13) P − (cid:13)(cid:13) (1 + (cid:107) F (cid:107) ). Together with Equations (8) and (14), this implies that Alg − Opt = T − (cid:88) t =0 η (cid:62) t Hη t ≤ (cid:32) c (cid:32) d (cid:88) i =1 (cid:13)(cid:13) A i (cid:13)(cid:13) (cid:15) d − i + k − d − (cid:88) i =0 (cid:13)(cid:13) F i (cid:13)(cid:13) (cid:15) i − d (cid:33) + (cid:13)(cid:13) F k − d (cid:13)(cid:13)(cid:33) (cid:107) H (cid:107)(cid:107) ψ (cid:107) . Together with Equation (11), we have
Alg − OptOpt ≤ (cid:32) c (cid:32) d (cid:88) i =1 (cid:13)(cid:13) A i (cid:13)(cid:13) (cid:15) d − i + k − d − (cid:88) i =0 (cid:13)(cid:13) F i (cid:13)(cid:13) (cid:15) i − d (cid:33) + (cid:13)(cid:13) F k − d (cid:13)(cid:13)(cid:33) (cid:107) H (cid:107) λ − ( P − − F P − F (cid:62) − H ) . henkai Yu, Guanya Shi, Soon-Jo Chung, Yisong Yue, Adam Wierman Case 2: k < d . We start from Equation (9). In this case, we have the following equations. x t − ˆ x t | t = A ( x t − − ˆ x t − | t ) + w t − − . ... x t − d + k +1 − ˆ x t − d + k +1 | t = A ( x t − d + k − ˆ x t − d + k | t ) + w t − d + k − .x t − d + k − ˆ x t − d + k | t = A ( x t − d + k − − ˆ x t − d + k − | t ) + w t − d + k − − ˆ w t − d + k − | t . ... x t − d +1 − ˆ x t − d +1 | t = A ( x t − d − ˆ x t − d | t ) + w t − d − ˆ w t − d | t . Note that in the last line, x t − d = ˆ x t − d | t . Thus, all of the above equations can be combined into the following: x t − ˆ x t | t = k − (cid:88) i =0 A d − i − e t − d + i | t + d − (cid:88) i = k A d − i − w t − d + i . Therefore, the policy can be written as: u t = − ( R + B (cid:62) P B ) − B (cid:62) P A ˆ x t | t = − ( R + B (cid:62) P B ) − B (cid:62) P A (cid:32) x t − k − (cid:88) i =0 A d − i − e t − d + i | t − d − (cid:88) i = k A d − i − w t − d + i (cid:33) . We compare this with Equation (6) to get η t = k − (cid:88) i =0 P A d − i e t − d + i | t + d − (cid:88) i = k P A d − i w t − d + i + T − t − (cid:88) i =0 F (cid:62) i P w t + i . With the substitution in Equation (10), (cid:107) η t (cid:107) = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) k − (cid:88) i =0 P A d − i e t − d + i | t + d − (cid:88) i = k P A d − i w t − d + i + ψ t (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ k − (cid:88) i =0 (cid:107) P (cid:107) (cid:13)(cid:13) A d − i (cid:13)(cid:13) (cid:15) i (cid:107) w t − d + i (cid:107) + d − (cid:88) i = k (cid:107) P (cid:107) (cid:13)(cid:13) A d − i (cid:13)(cid:13) (cid:107) w t − d + i (cid:107) + (cid:107) ψ t (cid:107)≤ k − (cid:88) i =0 (cid:107) P (cid:107) (cid:13)(cid:13) A d − i (cid:13)(cid:13) (cid:15) i (cid:13)(cid:13) P − (cid:13)(cid:13) ( (cid:107) ψ t − d + i (cid:107) + (cid:107) F (cid:107)(cid:107) ψ t − d + i +1 (cid:107) )+ d − (cid:88) i = k (cid:107) P (cid:107) (cid:13)(cid:13) A d − i (cid:13)(cid:13)(cid:13)(cid:13) P − (cid:13)(cid:13) ( (cid:107) ψ t − d + i (cid:107) + (cid:107) F (cid:107)(cid:107) ψ t − d + i +1 (cid:107) ) + (cid:107) ψ t (cid:107) . (16)Similar to the previous case, we define matrix M = { M t,s } T − t,s =0 ∈ R T × T such that M t,s is the coefficient of (cid:107) ψ s (cid:107) in the bound of (cid:107) η t (cid:107) in Equation (16). Then, by Proposition 11, λ max ( M (cid:62) M ) ≤ (cid:32) c k − (cid:88) i =0 (cid:13)(cid:13) A d − i (cid:13)(cid:13) (cid:15) i + c d − (cid:88) i = k (cid:13)(cid:13) A d − i (cid:13)(cid:13) + (cid:107) ψ t (cid:107) (cid:33) . Alg − Opt = T − (cid:88) t =0 η (cid:62) t Hη t ≤ (cid:107) H (cid:107) η (cid:62) η ≤ (cid:107) H (cid:107) ψ (cid:62) M (cid:62) M ψ ≤ λ max ( M (cid:62) M ) (cid:107) H (cid:107)(cid:107) ψ (cid:107) . Alg − OptOpt ≤ (cid:32) c k − (cid:88) i =0 (cid:13)(cid:13) A d − i (cid:13)(cid:13) (cid:15) i + c d − (cid:88) i = k (cid:13)(cid:13) A d − i (cid:13)(cid:13) + (cid:107) ψ t (cid:107) (cid:33) (cid:107) H (cid:107) λ − ( P − − F P − F (cid:62) − H ) . ompetitive Control with Delayed Imperfect Information C.2 Proof of Lemma 5
See Appendix A.
C.3 Proof of Lemma 6Lemma 6.
The followings are equal up to O (1) difference: Opt P ( P ) , Opt P ( Q f ) , Opt Q f ( Q f ) .Proof. By definition,
Opt P ( P ) = min X Opt X ( P ) ≤ Opt Q f ( P ) , Opt Q f ( Q f ) = min X Opt X ( Q f ) ≤ Opt P ( Q f ) . Moreover, for any X , Opt X ( Q f ) − Opt X ( P ) = x (cid:62) T ( Q f − P ) x T = O (1) , where x T is the final state obtained by the policy that is optimal assuming the terminal cost is X . Therefore, Opt P ( P ) ≤ Opt Q f ( P ) = Opt Q f ( Q f ) + O (1) ≤ Opt P ( Q f ) + O (1) = Opt P ( P ) + O (1) . As a result,
Opt P ( P ) , Opt P ( Q f ) , Opt Q f ( Q f ) , Opt Q f ( P ) are all equal up to a difference of O (1). D TIGHTNESS OF THEOREM 7
For the case of k exact predictions and no delays, we give an example where the competitive ratio is 1+Ω( (cid:107) F k (cid:107) ),where the asymptotic notation is with respect to k .Let n = m = 1, Q f = P , x = 0 and A, B, Q, R are such that
F >
0. Using the notation in Equation (6), wehave η t = T − t − (cid:88) i = k F i P w t + i . Let w t = r be a constant for all t . Then, Alg − Opt = T − (cid:88) t =0 Hη t ≥ T − (cid:88) t =0 H ( F k P r ) = Ω( F k ) . As a result,
AlgOpt = 1 + Ω( F k ) . This example directly generalizes to higher dimensions by stacking independent one-dimensional systems to-gether, so that all matrices are diagonal.
E TIGHTNESS OF THEOREM 9
For the case of d steps of delay and no usable predictions, we give an example where the competitive ratio is1 + Ω( (cid:107) A d (cid:107) ), where the asymptotic notation is with respect to d .Let n = m = 1, Q f = P , x = 0 and w t = r for all t . η t = d − (cid:88) i =0 P A d − i w t − d + i + T − t − (cid:88) i =0 F i P w t + i . Alg − Opt = T − (cid:88) t =0 Hη t ≥ T − (cid:88) t =0 H ( P A d r ) = Ω( A d ) . henkai Yu, Guanya Shi, Soon-Jo Chung, Yisong Yue, Adam Wierman AlgOpt ≥ A d ) ..